Data Mining in the Humanities
Feb 1, 2022 • 2 min read

Blog post #1

https://www.kaggle.com/bidyutchanda/top-10-highest-grossing-films-19752018

I chose this dataset because I’m a fan of movies and enjoy watching good movies, but sometimes it’s uncertain what exactly qualifies a movie as ?”good.” While awards and accolades are one useful metric, how much money a movie made is arguably more useful as it objectively shows which movies were the most popular. This dataset in particular looks especially promising as it has many different columns that can be used for various analyses, like the “imdb_rating” column to compare popularity with critic rating, for example. The origin story isn’t well defined: the Kaggle description says it was taken from a Crowdflower dataset, which was then cleaned up and updated for smoother use. Crowdflower notes that the dataset was from a “data categorization job,” so most likely it was put together for some user’s machine learning or data science project. There are a couple of limitations of this dataset, one of them being no mention of the country of origin, or even something more ambitious like the lead actors of the movies. So if someone were to ask questions pertaining to these topics, they would not be able to extract any useful conclusions from the given dataset. Furthermore, while it may seem obvious, the dataset does not include movies before 1975 or after 2018. However, there are many questions that can be explored using the given dataset. Namely, one can analyze what the mean length or imdb score is for the whole set. The former could be used to see whether runtime plays a significant role in popularity, for example, and the latter could be used to see if there is a strict correlation between popularity, or what the public enjoys, and rating, or what is deemed as good by critics. One could also look into which studios tend to dominate the most in terms of monetary returns. In all three of these cases, there isn’t much cleaning or restructuring to be done as the dataset doesn’t have any missing entries. If I had a friend, or was a manager, for an up and coming actor or voice actor who wanted to shoot high and make a lot of money, I could use analyses from this dataset to show to them, through a bar graph for example, that PG-13 and PG rated movies tend to generate more money amongst the top grossing movies than other MPAA ratings.

Guest post by: Neel S.