# Are We in Kansas Anymore? Judging the State of Hollywood Film with Data from Wikipedia

I’ve been watching a lot of movies lately. Last month, I was on a Grisham adaptation kick. For younger readers who may be unfamiliar with John Grisham, the author wrote several bestselling courtroom dramas in the 90s. Many of his books are set in the Deep South, and they often involve a young idealistic lawyer who bravely confronts the corrupt white male-dominated institutions of a southern city (usually Memphis). Before my run of Grisham adapations, I was watching a lot of John Hughes movies, many of which are set in the Midwest. Planes, Trains, and Automobiles, directed by Hughes, takes the audience through rural Kansas and Missouri before concluding in the suburbs of Chicago, where many Hughes movies end. Anyway, my forays into Grisham adaptations and Hughes films led me to wonder - are fewer movies set in rural and smalltown America today than in the 80s and 90s? In other words, are we in Kansas anymore?

This question led me to examine the broader question of how Hollywood film has changed over the past few decades. In this post, I look at several important trends related to this subject, including the changing relationship between genre and movie box office returns, shifts in the representation of men and women among movies’ top-billed actors, and a whole lot more. I conduct these analyses using data I collected through Wikipedia’s APIs. The data consists of 9712 movies released in the United States between 1980 and 2019. You can download it in its entirety on my Github page github.com/datadiarist/large_files/blob/master/movie_metadata_tbl.rds.

# Data

One challenge with collecting movie data from the internet is the two largest sources of online movie data, Rotten Tomatoes and the Internet Movie Database, do not allow web scraping and have limited APIs. Wikipedia, on the other hand, has a comprehensive set of APIs that allows users to collect pretty much anything from the site, even content from previous versions of Wikipedia pages. The catch is that Wikipedia is a database that relies on user-generated content. One consequence of this is that the data is fairly unstandardized. For instance, movie pages provide box office information in many different formats - $100 million, 100,000,000, 100 million dollars, and so forth. I won’t go into the gory details of pulling and preprocessing this data here. It’s a lot of sprawling conditional statements and regular expression syntax. I may devote a post to the process of dealing with the many edge cases one encounters when working with Wikipedia data sometime in the future. My sample of movies come from a group of pages that all have the headline “List of American films of [a year]”. Each of these pages has tables with movie titles and links to their pages. By drawing from these, I collected a list of names and links for 9712 movies. Next, I pulled information from the infobox of each movie page. This appears in the upper-right corner of the page. Here’s what the infobox looks like for Next, a timeless cinematic masterpiece starring Nicolas Cage as a small-time magician who can see the future, but only two minutes into the future (exactly two minutes). For each movie, I collected the release date, box office, budget, runtime, directors, and top-billed actors from the infobox. I also gathered links to the pages of top-billed actors in each movie. Each actor page has categories, which provide information that can be used to infer actor gender and race/ethnicity. For example, take a look at the categories associated on Tory Kittles’ (Next co-star) page. This tells us that Tory Kittles is a black male born in 1975. Finally, I collected additional information on movies by examining main body of movie pages. Most movie pages have a “Critical Reception” section that has a movie’s Rotten Tomotoes score and the number of reviews on which this score is based. I also extracted movie genre from the introduction of each movie page. This bit of information almost always comes in the first sentence of the article, right before the first instance of the word “film” or “movie”. Finally, I used a set of rules for extracting where the film was set from the film synopsis. Let’s have a look at the column names of the movie data. colnames(movie_metadata_tbl) ## [1] "name" "name_lab" "director" ## [4] "director_link" "genre_cat" "runtime" ## [7] "budget" "budget_adj" "box_office" ## [10] "box_office_adj" "profit_adj" "profit_lab" ## [13] "review" "num_review" "date" ## [16] "year" "month" "day" ## [19] "year_fin" "cast" "cast_link" ## [22] "cast_race" "cast_gender" "cast_age" ## [25] "cast_age_gender" "cast_bday" "tot_white" ## [28] "tot_black" "tot_hisp" "tot_asian" ## [31] "white_prop" "black_prop" "hisp_prop" ## [34] "asian_prop" "race_tots" "tot_man" ## [37] "tot_woman" This dataset has movie name, director and director link, genre, runtime, budget and box office information, Rotten Tomatoes review information, and release date information. After that, there is a set of columns that are nested lists containing data on top-billed actors in each movie. These nested lists contain actors’ names, links to their Wikipedia pages, race, gender, age, birthday, and more. Finally, there are several columns of movie-level actor data, including the proportion black of top-billed actors who are black and the total number of women among top-billed actors. Let’s start with some exploratory data analysis. By sorting the data by the box office variable and taking the top ten entries, we can see the top ten highest-grossing Hollywood movies according to the data. movie_metadata_tbl %>% arrange(desc(box_office)) %>% slice(1:10) %>% pull(name_lab) ## [1] "Avengers: Endgame" "Avatar" ## [3] "Titanic" "Star Wars: The Force Awakens" ## [5] "Avengers: Infinity War" "Jurassic World" ## [7] "The Lion King" "The Avengers" ## [9] "Furious 7" "Avengers: Age of Ultron" Sure enough, these are the highest grossing movies of all time before adjusting for inflation. Let’s see how this list compares to an inflation-adjusted list of highest grossing films. movie_metadata_tbl %>% arrange(desc(box_office_adj)) %>% slice(1:10) %>% pull(name_lab) ## [1] "Titanic" "Avatar" ## [3] "Avengers: Endgame" "Star Wars: The Force Awakens" ## [5] "E.T. the Extra-Terrestrial" "Avengers: Infinity War" ## [7] "Jurassic Park" "Jurassic World" ## [9] "The Avengers" "The Empire Strikes Back" Adjusting for inflation vaults James Cameron to the top of the list with Titanic and Avatar. We also see more of the old guard of blockbuster directors, such as Spielberg and Lucas, in this inflation-adjusted list. Let’s try to find some weirder kinds of outliers in this data. Turning to runtime, I pull the longest and shortest movies from the data. paste("Longest: ", movie_metadata_tbl %>% arrange(desc(runtime)) %>% pull(name_lab) %>% .[1]) ## [1] "Longest: The Cure for Insomnia" paste("Shortest: ", movie_metadata_tbl %>% arrange(runtime) %>% pull(name_lab) %>% .[1]) ## [1] "Shortest: Luxo Jr." The Cure for Insomnia is an 87-hour long experimental film that consists of an artist reading a 4,080-page poem. It held the Guiness record for longest film before being supplanted by a non-American movie. Luxo Jr. is a 2-minute long animated film released by Pixar in 1986. It used computer-based technology that was groundbreaking at the time and was the first CGI movie to be nominated for an Oscar (it was nominated for Best Animated Short). We can also look at which actors appear most in the data. movie_metadata_tbl$cast_link %>% unlist %>% table %>%
sort %>% .[(length(.) - 4):length(.)]
## .
## /wiki/Christopher_Walken       /wiki/Nicolas_Cage     /wiki/Robert_De_Niro
##                       62                       65                       65
##       /wiki/Bruce_Willis  /wiki/Samuel_L._Jackson
##                       67                       76

It turns out that Samuel L. Jackson is the hardest working actor in show business, with 76 top billings since 1980. Jackson has this distinction on lock, holding a nine-film lead on Unbreakable co-star Bruce Willis.

What other amusing outliers can we find in the data? How about worst movie of all time? I get this by filtering the data to movies that have received at least 40 Rotten Tomatoes reviews and sorting by average Rotten Tomatoes score.

movie_metadata_tbl %>% filter(num_review > 40) %>%
arrange(review) %>% pull(name) %>% .[1:10]
##  [1] "Pinocchio_(2002_film)"
##  [2] "National_Lampoon%27s_Gold_Diggers"
##  [3] "One_Missed_Call_(2008_film)"
##  [4] "A_Thousand_Words_(film)"
##  [5] "Gotti_(2018_film)"
##  [6] "The_Master_of_Disguise"
##  [7] "Twisted_(2004_film)"
##  [8] "Alone_in_the_Dark_(2005_film)"
## [10] "Disaster_Movie"

These movies all received either a 0% or 1% on Rotten Tomatoes (again, based on 40+ reviews). There are some derivative horror movies (One Missed Call, Alone in the Dark) and tasteless comedies (Disaster Movie, National Lampoon’s Gold Diggers) here. We also see movies that have ended careers (Roberto Benini as Pinocchio in Pinocchio, Cubo Gooding Jr. in Daddy Day Camp). My favorite on this list is Dana Carvey’s incredibly misguided attempt to capitalize on the success of Michael Myer’s Austin Powers with The Master of Disguise.

There are many other interesting bits of information one can find in this data, and I encourage you the download the data yourself to answer some of your own questions. In the next section, I examine some broader patterns in the data.

# Movie Locations

At the beginning of this post, I promised I would evaluate whether smalltown and rural America are less well represented in Hollywood today than they were in the past. Well, my analyses did not show a meaningful change in the proportion of movies that are set in rural/smalltown areas. However, I did find some other interesting patterns.

I visualize these here with a regional map of the U.S. The purple points on the map represent the highest-grossing movies to ever be set in specific locations across the country. These points can be interpreted as reflecting how these locations are represented in popular culture. For instance, Magic Mike is the highest grossing film ever set in Tampa Florida, a distinction that reinforces the popular association between Florida and everything trashy in American culture (see the portrayal of Florida in The Good Place for another example of this). Other movies reinforce more harmless stereotypes about the places where they are set. For example, Snow Day is the highest grossing movie ever set in Syracuse, a city that annually competes with Buffalo for the distinction of getting the most snow every year.

Click on a U.S. region to see the distribution of movies by genre in that region. These distributions are presented as waffle charts. Hover over a square to see the name and settings of a particular movie. Click the button in the top right to return to the map. This can take a few seconds.

For most of the regions, comedies and dramas are more numerous than other types of movies. This may be a function of my genre classification scheme. However, there are a few differences among regions that are of interest. For example, New England has a significantly higher proportion of horror movies than other regions of the country. This is likely due to the fact that Stephen King comes from Maine. Most of King’s books are set in New England, and more of his books have been adapted to film than any other author’s. Having grown up in New England, I can attest to the creepiness of this part of the country. The South has a disproportionate number of dramas. Many of these are historic films, courtroom dramas, and adaptations of works of southern Gothic literature. We see more levity in the Midwest, where comedies are the most popular genre (e.g. Adventures in Babysitting, Home Alone, Ferris Bueller’s Day off). My interpretation of this is that comedies try to make their characters as relatable as possible. In the 80s and 90s, one way to do this was to set your movie in middle America. This may no longer be true today, however.

# Recap

That concludes our journey through forty years of Hollywood film. I hope you learned a thing or two. All of the data for this project is publicly available on my Github. Please do not hesitate to contact me if you have any questions about how I created this dataset or the plots.