Data Mining in the Humanities

Dataset Analysis

For your first blog post, due Tuesday, February 1, before midnight, please write a blog post of 250-500 words in which you reflect on a dataset of your choosing.

Please consider the following questions and guidelines as you work on your blog post. Suggestions for datasets and places to look for them follow. Don’t forget to link to where your dataset can be found online. Please include a relevant screen shot of the data that is illustrative of one or more observations you make.


  1. Why did you choose this dataset? What makes it interesting, meaningful or confounding?
  2. What’s the data’s provenance or origin story? Who collected/created it? For what purpose is the data intended?
  3. What are the gaps or limitations of the dataset? What kinds of questions cannot productively be explored or answered using these data?
  4. What are 2-3 questions that you could potentially explore using this dataset? What modifications, if any, would be needed to carry out your plan(s)?
  5. How might you present an argument based on this dataset to an audience? To which audience would you direct your remarks?


The goal of this exercise is to think creatively and critically about your chosen dataset. It is NOT to analyze the biggest dataset. I’d in fact suggest that you choose “small data” to keep this exercise manageable, and to avoid technical glitches, so that you can devote your time to the critical aspect of this exercise.

With that in mind, I alert you that some of the suggested datasets included below could be on the large side. Downloading the data is not a requirement of the assignment, but if you have to do it in order to examine the data, look for file sizes that are under 50 MB. Prefer the CSV (comma separated value) format where available. It’s basically a spreadsheet that can be opened in Excel, Numbers, or even a text editor. If the dataset is large, look for ways to subset the data meaningfully. For instance, in the “A New Nation Votes” dataset, just look at New Jersey. You can even focus on a table in a Wikipedia article, like the one entitled “National Priorities List” in “List of Superfund sites in New Jersey.” A table is a kind of dataset. Choose a dataset with enough variables (column headers in a spreadsheet) and observations (rows) to support your reflective process and writing. Email me if you’d like help with your selection.


A New Nation Votes
Dataset of election returns in the early American republic, 1787-1825

middlebury_amsterdam: Data for 2014 Kress Digital Mapping and Art History Summer Institute
Associated blog post: “Mapping Artistic Attention in Amsterdam, 1550-1750”

New Jersey Shipwreck Database

Race Film Database

U.S. News and World Report, “Best Colleges Ranking Criteria and Weights”
Table of criteria and weights used to determine their “Best Colleges” rankings

What’s on the Menu?
Dishes from early 20th-century New York City restaurant menus

New York Philharmonic Performance History
This Performance History database documents all known concerts of the NY Phil, amounting to more than 20,000 performances


Awesome Public Datasets Thematically organized list of open datasets

Data Planet Statistical (numeric) datasets, mostly focused on recent years

Kaggle Datasets

NYC Open Data
Official open data repository for New York

Official open data repository for Philadelphia

Social Explorer
U.S. Decennial Census of population going back to 1790