The hit game show, Jeopardy! (“Jeopardy”) is as much about intelligence as it is about skill and strategy. Myriad analyses have been done on the categories contestants should focus on most, which Shakespeare play is more likely to be the answer, and which elements on the periodic table are most important to be familiar with. This analysis is a little different and will most likely not help anybody improve their chances at winning the game. But gleaning from it, you’ll likely be able to impress your friends and family just as much as with other analyses: not at all.
The Data Set
The daily syndicated version of Jeopardy has been running since 1984. This data set that I am using includes 6,300 games, 366,000 clues, and 12,000 contestants. The data set runs from 1984 until 2019, and although it is not dispositive or complete, it certainly offers a few million lines of data to parse. The structure of the main portion of the data set is split across three JSON files and looks like this:
clues.json
{'clue_id': 7,
'game_id': 1,
'clue_value': 200,
'round': 'Jeopardy!',
'category': 'SHAPE UP!',
'clue': 'This chess piece can only move in an L-shape',
'response': 'the knight'}
contestants.json
{'contestant_id': 208,
'name': 'Ken Jennings',
'notes': 'a software engineer from Salt Lake City, Utah', 'games_played': 94,
'total_winnings': 2522700}
games.json
{'game_id': 6314,
'air_date': '1984-09-27',
'contestant1_id': 12001,
'contestant2_id': 12002,
'contestant3_id': 11997,
'score1': 5,
'score2': 5,
'score3': 3700}
The clues.json file most importantly holds all the clues and responses (technically answers and questions, respectively). I’ll use responses, answers, and solutions interchangeably. The contestants.json file includes all the contestants. The “notes” key typically has their career and where they are from, and the primary key, “contestant_id” is used by games.json which tracks each game, the scores, and the contestants that played.
Almost all of the charts, maps, and tables below are interactive and allow you to scroll, zoom, or even search. If you’re reading this in your email, visit the website or use the app to participate.
The Contestants
The data set has information about 12,000 contestants. Most of the “notes” are the introductions by Johnny Gilbert at the top of the show and typically includes their location. From this data set, 163 of the contestants told us they were from Canada. Below is a count of US States that we know contestants are from (hover and zoom to explore):
Although more contestants are from California than any other US state—which makes sense since it has the highest population—Washington, D.C., with a population of under 700k people ranks first by a mile in contestants per capita (almost five fold compared to second place). MA, NY, and MD are the runner ups. Not only do North Dakota and Utah not send many Jeopardy contestants to the show, they’re also at the bottom of list of the ratio of contestants/state population.
Also interesting is the breakdown of what the contestants do for a living. Below are some of the most common careers and jobs that they shared in their introduction.
The number one career of Jeopardy contestants appears to be teacher. This number includes professors. There might be a bias here because of the annual (since 2010) “Teachers Tournaments” that are included in the data set. The same goes for Student in second place with regards to the “Teen Tournament.” Both Actor and Musician made the list of the top 20 along with Stay-at-Home Moms and Dads, but Attorney (which includes the keyword “lawyer” in the data set) seems to be a more common career for the average Jeopardy Contestant.
Despite some notable streaks of runaway games in the past decade (the data set covers James Holzhauer’s run), the average score difference between 1st and 3rd place isn’t that large. Only 160 or so points/dollars on average has separated 2nd from 3rd place.
Clues
From the 365k+ clues in the data set, there are about 43k categories. That leaves a lot of categories that unique or close to unique. But here are the most common categories.
Since Science is the number one category in Jeopardy, the following analysis shows a subcategory that is the exact opposite: Harry Potter.
Harry himself is the titular character so every clue so the above is skewed heavily in favor of Harry. “This old man is the headmaster in Harry Potter and the Chamber of Secrets” is an example of counting both Dumbledore—the correct solution—and Harry himself.
Along the same vein, Disney is both a category and a subcategory of many categories. Below is an almost random Disney character/film count:
In this chart, several of these Disney characters are based on other non-Disney source material and so are skewed. For example, Peter Pan clues may include questions about Cathy Rigby’s Peter Pan (which is not Disney) and Cinderella was a folk tale before the 1950 film. But without Disney, I can’t imagine these characters would make many Jeopardy clues/answers, so they are included nonetheless.
When it comes to Literature and Books & Authors, many questions rely on these categories. American Literature in particular is a seemingly popular subcategory.
It seems that these books made most students’ summer reading list in highschool. I’m just glad to see Great Expectations not among this list (which makes sense since I’m the one that curated it and that was a terrible novel).
When it comes to US Geography, there is a positive correlation between US States’ population and the number of times they are part of a clue or solution:
States with higher populations are more likely to be included in clues or correct responses and lower population states, e.x North Dakota, are much less likely to be brought up on Jeopardy. Here is the correlation:
Even more fun is to see the breakdown of Country names that are part of a Jeopardy solution. Every. Single. Country on the planet has been represented in a solution on Jeopardy. (Disputed and/or controversial borders, ex. Taiwan, The West Bank, Western Sahara, etc. are also represented.) That includes Palau, an island country with a population of only 18k that I still can’t find on this map but I know it’s there. Of course a Jeopardy clue about Tokyo without explicitly mentioning Japan will not be included as this is based on mentions of the country’s name.
More Words
Using Natural Language Processing, or NLP, we can collect all the of the clues and solutions, combine them each into a single string (a really long sentence), and then break those strings into individual words and count up the instances of those words. The most common words will be words like what, a, and is. Those are referred to as stopwords and can be removed from our “master counter” that keeps track of the number of words and occurrences. This last section is dedicated to NLP.
Just for fun, here are the longest single-word, correct solutions for Jeopardy. Correct spelling is not required. Supercalifragilisticexpialidocious is actually in the dictionary.
Finally, below are the words that occur most on Jeopardy in clues and in correct answers. There is one table for each. It is limited to the top 1000 words sorted by occurrences. You can easily search this collated list (you may need to put the phone down and move to an iPad or computer to do so). Don’t bother searching for curse words. I already did that for you.