I suppose if I kept up on my reading in the monthly magazine issued by one of my professional organizations, I’d have been able to bring this to the table in our DoOOFI cohort meetings, or post on it in a more timely fashion. But I don’t, so I didn’t, but my Wednesday night reading proved timely nonetheless.
A team (of history, English, rhetoric, and engineering professors, plus computer science students and librarians) at Virginia Tech published a piece, “Mining Coverage of the Flu: Big Data’s Insights into an Epidemic,” in the AHA’s Perspectives on History, that I found enlightening. They concede that asking historians “accustomed to interpreting the multiple causes of events within a narrative context, exploring the complicated meaning of polyvalent texts, and assessing the extent to which selected evidence is representative of broader trends, the shift toward data mining (specifically text mining) requires a willingness to think in terms of correlations between actions, accept the ‘messiness’ of large amounts of data, and recognize the value of identifying broad patterns in the flow of information.” It’s asking quite a bit, but their measured optimism is, I think, quite reasonable.
Using 20 weekly newspapers from throughout the US, they identified topics (defined by words that frequently appeared together–something I actually worked on some as a grad student research assistant) to think about broad patterns in reporting on the disease, including change over time. I don’t see the historical developments they identify as especially groundbreaking (and this recalls what Debra Schleef raised in relation to sociology, where she has seen projects use such methods, but without accomplishing much that traditional methods couldn’t anyway), but the fact that the research team then closely read selected articles to confirm the larger patterns and to further develop arguments I think suggests their approach to data mining as a supplementary tool, and one in which researchers can build confidence over time as they gain experience and confirm some of what they find by applying more traditional tools as well.
A second component of the project involved identifying the tone of these newspaper reports, which the project could do on a larger scale than individual human readers could manage. Again, the categories of tone they identified–Reassuring, Explanatory, Warning, and Alarmist–weren’t surprising or really new, I don’t think. Nor did the classifier program’s 72% success rate “correctly” identifying tone seem especially high. Yet the team’s report was cautiously optimistic, noting, “It is therefore potentially valuable as a knowledge discovery technique, but only if it can be refined,” which also suggests this process alone would provide an incomplete understanding. As they say, “Tone classification illustrates the real challenges that the complexity of written language poses for data mining.”
In other words, they’re very much in favor of employing new methods, but advocate their combination with existing methods–and the application of this combined approach to history, and especially to the 1918 influenza epidemic (which I talked about as a pandemic in my US History survey just last week), resonated with me.