Olympics

Data

We investigate 120 years of data on all Olympic athletes from Athens 1896 to Rio 2016. Our primary aim is to generally explore if the Olympics are an “even playing field,” or if athletes from some countries have systematic biases in their favor.

We begin by reading in Olympic data. We glimpse() the first few rows to get a sense of what the data contains. We also generate a new column iso_a3, which contains country codes in the ISO 3166-1 alpha-3 format. These country codes will be used to precisely join additional data on development indicators later in our analysis.

## Rows: 271,116
## Columns: 18
## $ id           <int> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ name         <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar...
## $ sex          <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M",...
## $ age          <int> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31...
## $ height       <dbl> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 1...
## $ weight       <dbl> 80, 60, NA, NA, 82, 82, 82, 82, 82, 82, 75, 75, 75, 75...
## $ team         <chr> "China", "China", "Denmark", "Denmark/Sweden", "Nether...
## $ noc          <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED"...
## $ games        <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Sum...
## $ year         <dbl> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, ...
## $ season       <chr> "summer", "summer", "summer", "summer", "winter", "win...
## $ city         <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary"...
## $ sport        <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed...
## $ event        <chr> "Basketball Men's Basketball", "Judo Men's Extra-Light...
## $ medal        <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ region       <chr> "China", "China", "Denmark", "Denmark", "Netherlands",...
## $ host_country <chr> "Spain", "UK", "Belgium", "France", "Canada", "Canada"...
## $ iso_a3       <chr> "chn", "chn", "dnk", "dnk", "nld", "nld", "nld", "nld"...

EDA & Motivating Bias Analysis

We begin our analysis by seeking evidence of potential biases for high performance at the Olympics.

Perhaps the most basic indication that certain countries may be favored are simple visualizations of the all-time count of medals won by each country.

Let’s visualize where these countries are.

We notice that most of the countries with high medal counts are in North America and Europe, and a few other regions in Asia/Australia known to be relatively well-developed economically. This observation motivates our subsequent analysis of the relationship between Olympic performance and country development indicators.

Bias from Development Indicators

GDP Per Capita

We first examine GDP per capita, as we hypothesize that richer countries may have more resources to produce superior athletes. We begin with a simple plot of GDP per capita versus the all-time medal count for each couuntry. We only plot countries who have won over 10 medals.

We find that countries with larger medal counts do tend to have larger GDP per capita, although there are countries with high GDP per capita but relatively low medal counts. In general, there are very few countries with low GDP and high medal counts.

However, just based on the graph above, it’s unclear how these medals are being attained. To better understand what sports these medals are coming from and what their relationship is to GDP, we construct the following plot of sports and the mean GDP per capita of the countries that produced medalists in the sport (weighted by the number of medalists in the sport).

This sports-wise analysis suggests that there may be a greater bias among winter sports for richer countries (the top 6 sports are all winter sports), while summer sports may have a lower “GDP threshold” to cross to be strongly competitive. However, compared to the global median of GDP, it’s clear that most low-GDP countries are probably not particularly competitive in these sports.

Case Study: Exploring Cultural Influences

Upon further investigation of the sports with low Mean GDP per capita, we discover that in many cases, this can be attributed to the dominance of Chinese athletes in these sports. We investigate Chinese performance further.

Chinese athletes are clearly very dominant at several sports, particularly Table Tennis, Badminton, and Trampolining. These sports are usually associated with athletes that are relatively nimble and lightweight rather than brawny and heavyweight. Thus, we further investigate the relationship between weight and performance in Chinese athletes.

We can see that as Chinese performance improves, weight tends to decrease. It’s unclear why this trend exists, but it’s possible it’s cultural. Thus, we explore other Asian countries.

We again find that most sports are relatively “light,” with the noteable exception in baseball.

Interestingly, for Japan, athletes tend to be relatively lightweight across all sports, with exception in Judo and Baseball.

It’s difficult to make conclusions with just numbers and little cultural context. While thorough cultural research is outside of the scope of this project, from my own personal experience and observation, in general, I have found that Asians and Asian-Americans tend not to have the same obsession with extreme muscularity and physical fitness as I have observed among some Caucasians. While this is far from a substantiated conclusion, this analysis highlights the potential role of cultural factors in the performance of athletes in certain sports at the Olympics.

Population

By combining insights from GDP and this plot, we can see that the order of sports is roughly inverted. This could make sense, as sports with lower barriers to entry (e.g. lower cost) may be dominated by countries with higher populations, and therefore a larger pool to choose from, while sports with higher barriers to entry may be dominanted by countries may be dominated by countries with people who can afford it.

Life Expectancy

Our life expectancy plot appears quite similar to our GDP per capita plot, which is not particularly surprising, as we would expect life expectancy to correlate closely with GDP per capita. We investigate the oldest and youngest athletes over time at the Olympics to explore if certain countries may have a significant advantage simply by having larger life expectancies and therefore being able to support older athletes.

The most interesting observation is Art Competitions, which features athletes as old as 97. One could imagine that certain countries with lower life expectancies simply could not produce athletes this old. Therefore, these kinds of events may have reduced the competition to far fewer countries.

We now shift to our analysis of biases caused by hosting of the Olympic Games. We are motivated by the following plot of the density of medals over time for the top 10 countries who have won the most cumulative medals at the Olympics.

Bias from Hosting

We can see that many countries tend not to exhibit clear trends in medal winning, but rather peak and trough at seemingly random intervals of time. We wonder if some of these peaks/troughs might be partially explained by contemporaneous events. In particular, we explore the event of hosting the Olympic Games.

Hosting Analysis

From the plot above, it seems clear that when a country hosts, it tends to win more medals compared to the previous Olympics. However, this is complicated by the fact that countries also tend to send more athletes when they host, as shown below.

From these two plots, it is unclear if hosting indeed tends to have a consistent effect on a country’s performance at the Olympics. We focus on the country which has hosted the most number of times, the USA.

Case Study: The United States

These two plots tell very different stories. The first plot, depicting simply the number of medals, seems to suggest that hosting does indeed provide an advantage, as all Olympics hosted by the US have produced medal counts greater than the median. However, when we divide by the number of athletes to get the number of medals per US athlete, we find that the relationship is more complicated, with some host years even having average medal counts below the median. While anomalies do exist, these can be attributed to historical events. With the 1932 Winter Olympics occuring during the Great Depression (limiting the participation of foreign countries), and the 1984 Summer Olympics being boycotted by many countries in the Eastern Bloc due to tensions with the Soviet Union. In general, the second plot makes it much more difficult to say with any degree of certainty that US performance is consistently improved by hosting, as other political and economic factors can clearly also play a significant role.

Overall, our analysis has revealed the importance of contextual understanding of data. While simply looking at numbers can point to certain correlations, we have found that causal effects may be better understood when context is taken into account, as we highlighted in realizing the potentially significant role of cultural, economic, and political factors in our data.