Michael Wu, Ph.D. is Lithium's Principal Scientist of Analytics, digging into the complex dynamics of social interaction and group behavior in online communities and social networks.
Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics and its application to Social CRM.He's a regular blogger on the Lithosphere's Building Community blog and previously wrote in the Analytic Science blog. You can follow him on Twitter or Google+.
Alright, just a little announcement about my never-ending speaking engagement before we begin today. I will be speaking tomorrow (March 30th) at SocialTech. I will be talking about how B2B enterprise can leverage the power of gamification, and John Pasquarette from National Instruments (a Lithium client) will be co-presenting with me. SocialTech is happening right now at Seattle, but unfortunately I only have time to fly there tomorrow, speak, and then fly back immediately. Too busy! :-(
Now we can begin. Last time, I showed you the results of a simple analysis I performed on the sentiment data from our social media monitoring (SMM) platform with respect to the presidential candidates for the 2012 election. I was able to demonstrate the predictive power of our data, which is able to predict 93.11% of the data variance in the Gallup data. So why did Attensity’s data only get the election half right? This is an interesting question, so I decided to do a bit more analysis and share my findings there.
Good Data Science Practice: Know the Limit of Your Data
Clearly, Attensity’s data is able to predict the election outcome in some states (e.g. Idaho, Massachusetts, Ohio, and Georgia), as indicated by the relatively high correlation coefficient (cc = 0.91, 0.99, 0.68, and 0.89 respectively). And where Attensity’s data fails to predict the election outcome, the correlation coefficient is relatively low: Oklahoma (cc = -0.48), Tennessee (cc = 0.41), Vermont (cc = -0.03), and North Dakota (cc = -0.10).
However, there are also cases where Attensity was able to predict the election outcome even though the correlation coefficient is relatively low: Alaska (cc = 0.32) and Virginia (cc = 0.44). What this means is that Attensity’s data really can’t predict the distribution of votes in these states, but they were able to predict the winning candidates coincidentally. In laymen’s terms, honestly, it’s just luck.
So how predictive is Attensity’s data set in this prediction exercise?
To address this question, I computed the average correlation coefficient across all 10 states, and the result is cc = 0.403. That means on average Attensity’s data is only able to predict 16.24% of the data variance in the Super Tuesday result.
Now, this is a retrospective analysis, so the computation is relatively simple. But there are ways to estimate the reliability and predictive power of your data with relative sample size and the intra- to inter-state variance ratio. Although there are many reasons (ranging from pure ignorance to willful marketing and PR tactics) for people to release marginally predictive data, I’m an advocate of responsible data practice. Moreover, it is always a good practice to know the limit of your data before making any inference and claims. Otherwise, your result could be very misleading.
After all, what good is analytics if all it does is give you the “illusion” of confidence?
How to Improve Prediction on Election Outcome?
As I alluded in my previous post, prediction science is a very challenging subject. Not only does it require statistical prowess and technical skills in computing, it also needs expert knowledge in the specific subject matter and a lot of good intuition. I also mentioned that a better model can sometimes improve your visibility in the predictive window (i.e. boost your prediction accuracy). With that said, what can campaign analysts do to improve their prediction?
First, we must recognize that most SMM systems are designed for marketers and PR agencies; they are not built for election campaigns. Therefore, even though the information about voter’s behavior may be implicit in SMM data, the analyses required to extract the voter’s preferences are not built-in for most SMM platforms. Currently, these analyses must be performed by human, and these analyses can start where SMM left off. However, since most SMM systems have some form of sentiment analysis, the sentiment data from SMM is a good common ground to start the analyses.
1. Although voter sentiment is a good indicator of election outcome, raw sentiment data from SMM are not a very accurate reflection of voter sentiment because each individual can tweet multiple times. So the first and most important analysis to infer voter sentiment from SMM’s sentiment data is normalization. We must normalize the sentiments down to a single voter.
For example, Romney may get more positive sentiment on SMM because the voters he engaged with are more vocal. He may have 1M supporters and each of them tweets 10 times a day, giving 10M positive sentiment per day. However, Obama’s supporter may be less vocal even though he may have more supporters. He may have 2M supporter, but each of them only tweets once a day giving him 2M positive sentiment per day.
Since Romney has 10M positive mentions and Obama only has 2M, the SMM sentiment data will predict Romney as the winning candidate. However, Obama actually has more supporters. When it comes to voting, it is the number of voters that each candidate gets that matters, not how vocal the voters are. After the ballots are counted, Obama will get 2M votes whereas Romney will only get 1M vote. To accurately predict election outcome, we must normalize the positive mentions down to number of unique users.
2. So what’s next? The obvious next step is to model how online interactions translate to offline actions. Although many users may express their positive sentiments for a candidate online via tweeting, sharing, blogging, vlogging etc., but there is not guarantee that any of them will actually vote. It is very possible that many young tweeters can’t even vote.
3. Once we have a good understanding of how online activities translate to offline voting behavior, we still need to model the electoral process. Social media is completely democratic (if we are able to accurately normalize the sentiment data down to individual users). That means social media is a good model of direct democracy. But the US government is not a direct democracy; instead it’s a representative democracy. In this system, 10K voters in California may contribute very differently from 10K voter from Alaska to the final election outcome.
4. To accurately model the indirect election of our electoral process, we must infer the geo-location of each user, because voters can only vote within their electoral districts. However, other than location-based services, which specifically record the user’s geo-location, geo-data is very sparse and not easily inferred.
All of these required analyses make predicting election a science of its own. But keep in mind that the predictive power of our models is still constraint by the predictive window. If we are outside of the predictive window of the data, then any analysis will be futile.
Due to time limitation, I certainly did not do any of these analyses when I was analyzing our SMM data for USA Today. That’s why I was very surprised that it was able to predict the Gallup data so well (i.e. cc = 0.965, which is equivalent to 93.11% of the data variance). I consider this coincidental, or just dumb luck, rather than anything special that I did. If I were to use the same formula for the next election, I probably wouldn’t be quite so lucky.
Prediction is a fool’s game, especially when you don’t have the necessary data. Since SMM platforms are not designed to predict elections, SMM data must be analyzed by humans in order to accurately predict election outcomes. These analyses are typically specific to the domain of quantitative politics.
Normalized net sentiment on mentions down to unique users
Model how each user’s online activities translate to actual voting offline
Infer or capture geo-location data of all online users
Model the indirect election process of our representative democracy
There are many more analyses that can be done to improve the election-outcome prediction. Your effort is basically constrained by time and resources. Without doing any of these analyses, it is unreasonable to expect raw SMM sentiments to predict election outcome with any accuracy. However, one can get lucky sometimes.
Next time we will return to the topics of big data analytics. But please keep in mind the concept of the predictive window. We will revisit this important concept when we talk about actionable analytics.