Since we digressed into the topic of influence over the past few weeks, it’s time to return to big data and talk about another big data fallacy.
In my previous Big Data posts, we discussed the data-information inequality (a.k.a. the Big Data Fallacy): information << data. We talked about what is it, how to quantify it, and why it is the way it is. We delved pretty deeply and talked about some nontrivial concepts and statistical properties of big data. As a result, the discussion got a little mathematical. However, if you like the technicality, you should have a quick read of the following posts:
Today, I want to talk about the second fallacy of big data and discuss the distinction between information and insights. I promise I won’t go too deep into the statistics. But before I begin, I want to tie up a few loose ends concerning the statistical redundancy in big data.
Last time, I illustrated the predictive validation framework in a toy problem where we are supposed to predict the stock price of Apple. Today we will apply this framework to analyze algorithms that compute people’s influence score. Since this is second part of a two part article, you will need a solid understanding of the first post in order to make sense of today’s discussion.
To validate any model that influence vendors use to predict someone’s influence, they must have an independent measure of that person’s influence. But as we discussed before, nobody has any measured data on influence. So how can influence vendors be sure of the validity of their model?