I know; it’s been a while since I blogged. This past weekend, I actually received a tweet asking “are you dead?” It’s a little dramatic, but it’s quite true. I’ve been kind of dead on social media for about 4 months, even though I am still alive in person. I’ve just been too busy doing other works, which I will tell you later, but that’s not an excuse for not writing.
To prove that I’m not dead, I will restart blogging again! Today, I want to pick up where I left off with my mini-series on big data, so let me jump right in because I want to make up for lost time.
Data ≠ Information
Today, we are going to talk about data and information and the difference between them. Although they are different, many people speak of them as if they are synonymous, which is almost never true. However, the difference between data and information is quite subtle, so let’s try to understand it.
Data is simply a record of events that took place. It is the raw data that described what happen, when, where, how, who’s involved, etc. Well, isn’t that informative? Yes, it is!
Data does give you information. However, the fallacy of big data is that more data doesn’t mean you will get “proportionately” more information. In fact, the more data you have, the less information you gain as a proportion of the data. That means the information you can extract from any big data is asymptotically a diminishing return as your data volume increases. This does seem counterintuitive, but it is true. Let’s see if we can clarify this with a few examples.
Example 1: data backups and copies
If you look inside your computer, you will find thousands of files you’ve created over the years. Whether they are pictures you took, emails you sent, or blogs you wrote, they contain certain amount of information. These files are stored as data in your hard drive, which takes up certain storage volume.
Now, if you are as paranoid as I am, you will probably back up of your hard drive regularly. Think about what happens when you backed up your hard drive for the first time. In terms of data, you’ve just doubled the amount of data you have. If you had 50 GB of data in your hard drive, you would have 100 GB after the back up. But will you have twice the information after the back up? Certainly not! In fact, you gain NO additional information from this operation, because the information in the backup is exactly the same as the information in the original drive.
This happens at the file level too. Each of the thousands of files in our computer contains some fixed amount of information. If you made 100 copies of a file in your computer, you will increase the amount of data in your hard drive by 100x the size of your original file. Yet, the amount of information you gain is zero.
Although our personal data is not big data by any means, this example illustrates the subtle difference between data and information, and they are definitely not the same animal. Now let’s look at another example involving bigger data.
Example 2: airport surveillance video logs
Firstly, video files are already pretty big. Secondly, closed-circuit monitoring systems (CCTV) in an airport are on 24/7, and high definition (HD) devices increase the data volume further. Moreover, there are hundreds and probably thousands of security cameras all over the airport. So as you can see, the video logs created by all these surveillance cameras would probably qualify for big data.
Now, what happens when we double the number of camera installations? In terms of data volume, you will again get 2x the data. But will you get 2x the information? Probably not! Many of the cameras are probably seeing the same thing, perhaps from a slightly different angle, sweeping different areas at slightly different time. In terms of information content, we almost never get 2x. Furthermore, as the number of cameras continues to increase, the chance of information overlap also increases. That is why as data volume increase, information will always have a diminishing return, because more and more of it will be redundant.
A simple inequality characterizes this property: information ≤ data. So information is not data, it’s only the non-redundant portions of the data. That is why when we copy data, we don’t gain any information even when the data volume increase, because the copied data is redundant.
Example 3: updates on multiple social channels
What about social big data, like tweets, updates, and/or shares? If we tweet twice often, twitter is definitely getting 2x more data from us. But will Twitter get 2x the information? That depends on what we tweet. If there is absolutely zero redundancy among all our tweets, then Twitter will have 2x the information. But that typically never happens. Let’s think about why.
First of all, we retweet each other. Consequently, many tweets are redundant due to retweeting. Even if we exclude retweets, the chance that we coincidentally tweeting about the same content is actually quite high because there are so many tweeters out there. Although the precise wording of each tweet may not be exactly the same, the redundancies among all the tweets containing the same web content (whether it’s a blog post, a cool video, or a news article) is very high. Finally, our interest and taste for content are not random; they remain fairly consistent over time. Since our tweets tend to reflect our interests and tastes, even apparently unrelated tweets from the same user will have some redundancies, because the tweeter’s interests and tastes are the same.
Clearly, even if we tweet twice as often, Twitter is not going to get 2x the information because there is so much redundancy among our tweets (likewise with updates and shares on other social channels). Furthermore, we often co-syndicate contents across multiple social channels. Since this is merely duplicate content across multiple social channels, it doesn’t give us any extra information about the user.
We’ve seen three examples that illustrate the subtle difference between data and information. Although data does give rise to information, they are not the same. Information is only the non-redundant parts of the data. Since most data, regardless of how it is generated, has lots of built-in redundancy, the information we can extract from any data set is typically a tiny fraction of the data’s sheer volume.
I refer to this property as the data-information inequality: information ≤ data. And in nearly all realistic data sets (especially big data), the amount of information one can extract from the data is always much less than the data volume: information << data.
As a data scientist, I certainly recognized the importance of big data. Moreover, data is the foundation and key to most of my research. But even as a data scientist, I must confess that the value of big data is really overrated, because the value of big data is in the information that it can provide. And information is only the non-redundant portions of the data, which is a tiny and diminishing fraction of the overall data volume.