My colleague Kate Starbird recently shared a very neat study entitled “Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions” (PDF). As she and her co-authors rightly argue, “most Twitter activity during mass disruption events is generated by the remote crowd.” So can we use advanced computing to rapidly identify Twitter users who are reporting from ground zero? The answer is yes.
An important indicator of whether or not a Twitter user is reporting from the scene of a crisis is the number of times they are retweeted. During the Egyptian revolution in early 2011, “nearly 30% of highly retweeted Twitter users were physically present at those protest events.” Kate et al. drew on this insight to study tweets posted during the Occupy Wall Street (OWS) protests in September 2011. The authors manually analyzed a sample of more than 2,300 Twitter users to determine which were tweeting from the protests. They found that 4.5% of Twitter users in their sample were actually onsite. Using this dataset as training data, Kate et al. were able to develop a classifier that can automatically identify Twitter users reporting from the protests with an accuracy of just shy of 70%. I expect that more training data could very well help increase this accuracy score.
In any event, “the information resulting from this or any filtering technique must be further combined with human judgment to assess its accuracy.” As the authors rightly note, “this ‘limitation’ fits well within an information space that is witnessing the rise of digital volunteer communities who monitor multiple data sources, including social media, looking to identify and amplify new information coming from the ground.” To be sure, “For volunteers like these, the use of techniques that increase the signal to noise ratio in the data has the potential to drastically reduce the amount of work they must do. The model that we have outlined does not result in perfect classification, but it does increase this signal-to-noise ratio substantially—tripling it in fact.”
I really hope that someone will leverage Kate’s important work to develop a standalone platform that automatically generates a list of Twitter users who are reporting from disaster-affected areas. This would be a very worthwhile contribution to the ecosystem of next-generation humanitarian technologies. In the meantime, perhaps QCRI’s Artificial Intelligence for Disaster Response (AIDR) platform will help digital humanitarians automatically identify tweets posted by eyewitnesses. I’m optimistic since we were able to create a machine learning classifier with an accuracy of 80%-90% for eyewitness tweets. More on this in our recent study.
One question that remains is how to automatically identify tweets like the one above? This person is not an eyewitness but was likely on the phone with her family who are closer to the action. How do we develop a classifier to catch these “second-hand” eyewitness reports?