Data Source

Our data was found on Kaggle, a website containing many data sets, often created by users of the site. This data was scraped from www.sports-reference.com, a trusted source for sports data. The data contains over 270,000 rows of Olympic athletes competing in particular events in a given games from the inception of the Olympics to 2018.

Data Journey

Before beginning our analysis, we had to cleanse our data and manipulate it to create meaningful visualizations. First, we converted the date column to type date/time so that we could create time series visualizations. Next, we assigned each medal a value, with gold equal to 3 points, silver equal to 2, and so on. For our analysis of host countries, we had to read in a list of host countries and cities to make a new column called "Home/Away" where Home event instances were where athletes competed in the same country that they were from. We then created a relay column to mark all events that were relays because each athlete gets their own medal even though only one event was won (therefore skewing our medal count metrics). Our largest task was finding all of the teams that no longer map to countries on a present-day world map and reassigning them. (Yugoslavia, Soviet Union, and West Germany, for example) We then created Team2 and NOC2 (country code) columns so that we could visualize countries recognized by the IOC and winners in the same geographic area side-by-side.

Data Caveats + Assumptions

Although it may be more of a small detail than an assumption, we wanted to let our audience know that each of our visualizations that tally medals won includes relays, where multiple athletes win the same medal in the same event. Our visualizations are not number of events won, but the number of medals won, and these are slightly different aggregations due to relays. Our first assumption came with the creation of Team2 and NOC2 columns. For countries like Yugoslavia that were split into many modern-day countries, we decided to assign the entirety of the old country or territory to the country with the most area in that region. All Yugoslavian athletes are marked as Serbian in our new columns. For countries like East and West Germany, the new assignment was more obvious (Germany) and didn't require any assumptions. In an attempt to visualize both official Olympic count and count by geograhpic area, we have included visualizations of both old and new locations to show the difference.