This post is the first of a two part article addressing the question: How do you know if your influence score is correct? Today, I won’t actually answer this question, but will show you a step-by-step procedure that we will use next time to address this question.
In my last article, I wrote about the missing link of influence. We talk about the fact that nobody actually has any data on influence (i.e. data that explicitly says who actually influenced who, when, where, how, etc.). Even influence vendors don’t have this measured influence data. The only measured data they have is people’s social media activities, which they actually didn’t measure themselves, but instead collect from the respective social media platforms.
All influence scores are therefore computed from users’ social activity data based on some models and algorithms of how influence work. However, anyone can create these models and algorithms. So who is right, and who has the best model? More importantly how can we tell and be sure your influence score is correct? In other words, how can we validate the models' accuracy that influence vendors use to compute people’s influence score?
To illustrate how statistical validation works, I will use a simpler and more tangible example, where we are trying to predict the stock price of a company, let’s say Apple.
Build a Predictive Model of the Stock Market
First, we need to build a model (or an algorithm) which takes various input data about Apple that might be predictive of its stock price. We can pick any data that we feel could potentially affect Apple’s stock price in any way as an input. For example:
- Sales data: units shipped for Apple devices, earning data from different business units of Apple (e.g. iTune store, Apple store, smartphones, tablets, laptops, etc.). Obviously the stock price should reflect how well the company do in terms of sales
- Fundamental company data: management, debts, liabilities, cash flow, etc. Various ratios that tells you different aspects of the financial health of the company may be a good predictor of its stock price
- Social data: share of voice and sentiments about Apple products and services (e.g. iCloud, etc.). Perhaps social media is indicative of public sentiment towards Apple and therefore can predict its traded volume and therefore the price
- Competitor data: all the above data from different competitors of Apple (e.g. Google, Dell, HP, RIM, etc.). Maybe Apple’s stock price will be anti-correlated with the performance of its competitors
- Industry and market data: international and national economic indicators, such as GDP growth rates, inflation, interest rates, exchange rates, productivity, energy prices, various market indices (e.g. S&P 500, Dow Jones, etc.), any industry-wide data on the technology sector, personal computing, and/or mobile phone. Apple will certainly be subjected to the same market forces that affect the industry, so may be its price will follow the industry trend to some extent
How do You Validate the Model You’ve Built?
The important point is that regardless of how much data we put into the model, and how complex and brilliant the model might be in combining these data, the final test for whether the model actually works, is to see if it can predict the real stock price of Apple. How good a model is, has nothing to do with its complexity or how much data it takes into consideration. If it doesn’t predict accurately, the model is no good regardless of how logical or scientifically sound the model is. So prediction accuracy offers an objective and empirical way to validate any statistical model.
There are three requirements to validate any statistical model or algorithm:
The most important of these is #2: having an independent measure of the outcome. It is pretty obvious if you think about it. To validate if your model can accurately predict the stock price for Apple, you must have the actual stock price of Apple, so you can compare the prediction against the actual stock price.
What does “Independently Measured” Mean and Why is it so Important?
Many people don’t understand what it means to be “independent.” To be independently measured means the measured outcome is completely independent of the model. In the example of predicting Apple’s stock price, it means you cannot use any of the actual stock price data as part of the input to the model. If you use any actual stock price data as input to a model that is trying to predict the stock price, then it’s obvious that the model would predict the stock price very well, because the model would already have information about the actual stock price. So, the actual stock price that you thought you measured independently will no longer be truly independent of the model anymore.
Hence the fact that this model is able to predict Apple’s stock price well is meaningless, because it didn’t actually predict anything, after all it already has the actual stock price that it is trying to predict. This model is basically cheating because it’s based on circular reasoning.
Conclusion
Today we illustrated the predictive validation framework through an example of predicting Apple’s stock price. This predictive validation framework is very general and can be used to validate any models (or algorithms).
To properly validate a model (any model), we must be able to compare the model’s predicted outcome with an independent measure of the outcome. Here, the outcome can be literally anything (e.g. stock price, influence, weather, earthquake, etc.). I’d like to re-emphasize the importance of having an independent measure that is truly independent of the model. That means you cannot use this measure anywhere in your model. Otherwise, the validation procedure will be confounded by circular reasoning.
Alright, now you know how to validate any model, next time we shall apply this framework to analyze the models that influence vendors use to score people’s influence. And we will be able to answer the question posed at the beginning of this post: How do you know if your influence score is correct?
Stay tuned... Have a warm and relaxing Thanksgiving... And see you next time.
Michael Wu, Ph.D. is
Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.
Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.
