I suppose if I kept up on my reading in the monthly magazine issued by one of my professional organizations, I’d have been able to bring this to the table in our DoOOFI cohort meetings, or post on it in a more timely fashion. But I don’t, so I didn’t, but my Wednesday night reading proved timely nonetheless.

A team (of history, English, rhetoric, and engineering professors, plus computer science students and librarians) at Virginia Tech published a piece, “Mining Coverage of the Flu: Big Data’s Insights into an Epidemic,” in the AHA’s Perspectives on History, that I found enlightening. They concede that asking historians “accustomed to interpreting the multiple causes of events within a narrative context, exploring the complicated meaning of polyvalent texts, and assessing the extent to which selected evidence is representative of broader trends, the shift toward data mining (specifically text mining) requires a willingness to think in terms of correlations between actions, accept the ‘messiness’ of large amounts of data, and recognize the value of identifying broad patterns in the flow of information.” It’s asking quite a bit, but their measured optimism is, I think, quite reasonable.

Using 20 weekly newspapers from throughout the US, they identified topics (defined by words that frequently appeared together–something I actually worked on some as a grad student research assistant) to think about broad patterns in reporting on the disease, including change over time. I don’t see the historical developments they identify as especially groundbreaking (and this recalls what Debra Schleef raised in relation to sociology, where she has seen projects use such methods, but without accomplishing much that traditional methods couldn’t anyway), but the fact that the research team then closely read selected articles to confirm the larger patterns and to further develop arguments I think suggests their approach to data mining as a supplementary tool, and one in which researchers can build confidence over time as they gain experience and confirm some of what they find by applying more traditional tools as well.

A second component of the project involved identifying the tone of these newspaper reports, which the project could do on a larger scale than individual human readers could manage. Again, the categories of tone they identified–Reassuring, Explanatory, Warning, and Alarmist–weren’t surprising or really new, I don’t think. Nor did the classifier program’s 72% success rate “correctly” identifying tone seem especially high. Yet the team’s report was cautiously optimistic, noting, “It is therefore potentially valuable as a knowledge discovery technique, but only if it can be refined,” which also suggests this process alone would provide an incomplete understanding. As they say, “Tone classification illustrates the real challenges that the complexity of written language poses for data mining.”

In other words, they’re very much in favor of employing new methods, but advocate their combination with existing methods–and the application of this combined approach to history, and especially to the 1918 influenza epidemic (which I talked about as a pandemic in my US History survey just last week), resonated with me.


4 Thoughts on “Sick of Big Data (or, Studying the 1918 influenza pandemic)

  1. Jason,

    This is an excellent find, and I really appreciate you blogging about it because I’m very interested in how the analysis of big data is being used in the social sciences and humanities. What’s more, the idea that predicting tone might have far better percentages in the future is interesting as well. Once that number gets to 95%, does that make the establishment of correlation accounting for tone less concerning?

    One of the ideas that came up in other groups is a general discounting of this trend of big data, and while I can sympathize it’s only going to get more important. And grappling with what it means, and how we plan on interrogating it on a discipline basis will make for some very interesting discussions over the coming years.

  2. Also, this is a catchy title, I think you’re being a bit hard on your self in your latest post 🙂

  3. admin on March 3, 2014 at 2:35 am said:

    Thanks, I’ve been trying harder with the titles, so I’m glad it’s better!

    I’m with you on the significance of big data (especially as we get more effective at using it for a broader range of things). Despite my wariness, I don’t think we should discount it, which is part of why I liked the article about the study I found–they use it cautiously, and so far are finding things that I don’t think most historians would otherwise dispute anyway (well, the bit we see in that article–they could be pushing boundaries further in the complete study). I do wonder if the reaction against big data (mine included) is more prompted by a sense that optimism about it is overinflated, rather than an outright refusal to recognize its utility as a tool.

    Should be some interesting disciplinary conversations coming, as you say.

Leave a Reply

Your email address will not be published. Required fields are marked *

Post Navigation