Last time, I illustrated the predictive validation framework in a toy problem where we are supposed to predict the stock price of Apple. Today we will apply this framework to analyze algorithms that compute people’s influence score. Since this is second part of a two part article, you will need a solid understanding of the first post in order to make sense of today’s discussion. In fact, if you missed any of my past posts on Influence, I’d suggest you read them before moving forward:
To validate any model that influence vendors use to predict someone’s influence, they must have an independent measure of that person’s influence (if you don’t know why this is such an important requirement, go back and re-read part 1). But as we discussed before, nobody has any measured data on influence. So how can influence vendors be sure of the validity of their model?
Although some academic and research labs do properly validate of their influence model using good proxies for influence, most influence vendors don't. And there are really only three answers from influence vendors.
1. They don’t know because they don’t validate
They don’t validate their model, because they can’t, since no one has any influence data. So they can never be sure if their model/algorithm is correct!
2. They are not sure, because they overgeneralize their algorithm
Although no one can measure real influence, we do know some influential people (e.g. Barack Obama, because he is the president; Oprah Winfrey, because she’s is a celebrity and a community leader, etc.). So one can potentially gather these known influencers and see what scores they get from the model/algorithm. This validation procedure works, because these known influencers are truly independent from the model (i.e. the model has no knowledge of their influence a priori). The list of known influencers is also relatively short, constituting a tiny fraction of the population (probably much less than 0.01% of the users on social media). This list is small enough that vendors can rank and order these known influencers manually and see if their influence scoring algorithm predicts their influence correctly.
Vendors that are doing this are basically trying to validate their algorithm base on the top 0.01% of the most influential people, and try to generalize their algorithm to the rest. The problem with this approach is even if the model is able to predict these top influencers correctly (i.e. they get scores that are roughly near the top 0.01 percentile of the population), there is no guarantee that they can predict the influence of rest of the population correctly. Because these validation samples are heavily biased toward highly influential individuals, the algorithm is pretty much un-validated for people who are not top influencers. That means if you are not a top influencer, then your score may be completely wrong. If you get an influence score, say around 70th percentile, that score is not validated, and cannot be validated. This means that the vendors don’t know if you really should be at the 70th percentile at all. Maybe you should really be in the 60th percentile or 80th percentile, but the algorithm can never know this for sure, because it is not validated for these ranges.
3. They think they know but actually don’t, because the validation process is circular and flawed
Since nobody has influence data, influence vendors can only validate their algorithm with data that are good proxies for someone’s influence. Moreover, vendors must have this data for an unbiased sample of the social media users (i.e. both influencers as well as the rest of the world) in order to avoid the overgeneralization problem (see above) and properly validate their influence algorithm.
The good news is that good proxies for someone’s influence do exist. Most of them are based on reciprocity, such as retweets, likes, etc. So, as an example, the number of times someone has been retweeted (a.k.a. retweets) can be a proxy for that person’s influence. It is not a true measure of influence, because retweet ≠ influence, but it’s a proxy for it. Since we can get this data for pretty much everyone (influencers or not), we can use this data to validate the influence scoring algorithm without running into the overgeneralization problem.
But there is one more catch! In order to use retweets to validate a model that predict people’s influence, it must be independent of the model. That means the model cannot use retweets as input. Otherwise, the model is just cheating and the validation process would be invalid due to circular reasoning. However, most influence models out there do use reciprocity metrics (i.e. retweets, likes, etc.) as input to their algorithm. So they are back to square one, because now they can’t use these reciprocity data to validate their model anymore, because these data are no longer independent of the model.
This validation procedure seems to work, but it is really based on circular logic that is completely invalid. Although influence vendors may believe their algorithm is validated, they actually didn’t validate anything. So they really don’t know if their score is accurate.
As nobody has explicit data on who influenced whom, all influence scoring is based on models and algorithms. Therefore, the models must be validated by an independent source of data. Otherwise, nobody can be certain of the model’s accuracy. Last time, we illustrated the validation process with a simple example where we try to predict the stock price for Apple. This time, we applied this validation framework to analyze whether influence vendors sufficiently validate their influence scoring algorithms.
As it turns out, most influence vendors do not sufficiently validate their algorithm for the following reasons:
No Data: They don’t have an independent source of influence data. In fact, no one has.
Overgeneralization: They validate their algorithm base on a handful of known influencers, and try to overgeneralize their algorithm to millions of users.
Invalid Circular Validation: They use reciprocity data (which are good proxies for one’s influence) that went into building the model as the validation data. This is a common error in model validation, because this circular validation process is simply a waste of time and doesn’t give you any information about the accuracy of the algorithm. To properly validate any model, you must have an independent measure of the outcome, and that means you cannot use any of it in your model.
So should you trust your influence scores? Just ask your influence vendor how they validate their model, then you will know if you are just wasting your marketing budget.
OK, I’ve been talking about influence for the past few weeks, should we switch to something else? Let me know, and see you next time.