Since we digressed into the topic of influence over the past month, it’s time to return to big data and talk about another big data fallacy.
In my previous Big Data posts, we discussed the data-information inequality (a.k.a. the Big Data Fallacy): information << data. We talked about what is it, how to quantify it, and why it is the way it is. We delved pretty deeply and talked about some nontrivial concepts and statistical properties of big data. As a result, the discussion got a little mathematical. However, if you like the technicality, you should have a quick read of the following posts:
Today, I want to talk about the second fallacy of big data and discuss the distinction between information and insights. I promise I won’t go too deep into the statistics. But before I begin, I want to tie up a few loose ends concerning the statistical redundancy in big data.
Statistical Redundancy is not Bad
Although redundancy will limit the amount of extractable information from any data set, it is not inherently bad. Because the redundancy in all data set is a direct reflection of the correlation that exists in nature (see Why is there so Much Statistical Redundancy in Big Data?). So you shouldn’t try to remove redundancy in your data. If you do, your data won’t accurately reflect the reality anymore. Hence, the information you extract from your data will no longer be useful.
For example, retweets create a lot of redundancy in Twitter’s data. It inflates the data volume tremendously and turns Twitter into a big data company, whereas the actual information in Twitter’s data is actually several orders of magnitude smaller. But the redundancy created by retweets is a reflection of the reality that some people like certain content more than the other. If you try to remove all retweets, you will reduce the redundancy in Twitter’s data. The data volume will shrink and the gap between data volume and information volume will decrease. But then you won’t be able to see which content people like more. In fact, you will come to the conclusion that all tweets are equal, which is not interesting, not useful, and totally incorrect.
Statistical redundancy in big data is not good or bad. It inflates the data volume significantly without increasing the actual information content. However, it is also a direct reflection of the way things operate in nature and should not be eliminated. It is just an intrinsic property of all data, including big data. We just have to understand it and live with it.
The Second Fallacy of Big Data: Insight << Information
OK, now we are ready to discuss the second fallacy of big data. The promise of big data is that one could extract lots of information and uncover valuable insights from it. With the data-information inequality, we learned that the total amount of information we can extract from big data is actually much smaller than the raw data volume. Now, the question is what about valuable insights?
Insights are information, but not all information provides insights. There are three criteria for information to provide valuable insights:
If the information fails any one of these criteria, then it couldn’t be a valuable insight. So these three criteria will successively restrict insights to a tiny subset of the extractable information. Out of a thousand bits of information we extract from big data, we’d be lucky if just one bit is a valuable insight. So in general, the second fallacy of big data is: insight << information.
This can be combined with the data-information inequality (a.k.a. the first fallacy of big data): information << data.
So both big data fallacies can be summarize in a single inequality relationship: insight << information << data.
So even with big data, the probability for finding valuable insights from it will still be abysmal. This may sound disappointing, but believe it or not, these big data fallacies are actually strong arguments for why we need big data. We just have to look at this inequality from the other side.
Since the amount of valuable insights we can derive from big data is so tiny, we need to collect even more data to increase our chance of finding them. If the human population consists of 1% genius, you are more likely to find a genius if you look at a random sample of population > 100. Unfortunately, the probability of insight discovery is much smaller than 1%, that’s why we need petabytes of data and powerful analytics to have any hope of finding that million dollar insight.
First, we clarified that statistical redundancy in big data is an intrinsic property of all data. Even though it limits the amount of information we can extract from big data, it is not bad, and we shouldn’t try to remove them. Moreover, statistical redundancy reveals the reality of what we are measuring.
The second big data fallacy is that most people believe that with big data we can get a lot of valuable insights. This is not true, because insights << information << data. Insights are information, but information must satisfy three criteria to provide insights that are valuable:
These criteria imply that insights are a much smaller subset of information. Although big data cannot guarantee the revelation of many insights, increasing the data volume does increase the odds of finding it.
Next time we will examine these three criteria more carefully, so we know where to look within big data to find insights. In the meantime, let’s have some open discussion about the path from data to information to insights. If you have any inspirational story about how you discover insights from data, feel free to share it here.