Michael Wu, Ph.D. is Lithium's Principal Scientist of Analytics, digging into the complex dynamics of social interaction and group behavior in online communities and social networks.
Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics and its application to Social CRM.He's a regular blogger on the Lithosphere's Building Community blog and previously wrote in the Analytic Science blog. You can follow him on Twitter or Google+.
Time flies, is this already the fifth article I wrote in this analytic science mini-series? Previous posts are compiled below for easy access. If you missed any of them before, now is your chance to catch up.
Although I’ve been talking about big data for a while, I realized that I never really defined it? How big is big? What are the precise criteria for a data set to be considered big data?
If you ask around, most big data practitioners would probably say that big data is any data that is too big to be stored, managed and analyzed via conventional database technologies. So the “data” in big data can really be anything. It doesn’t have to be social media data, and it is certainly not limited to user-generated content. It can be genomic, financial, environmental, or even astronomical. Although this definition is very simple and easy to understand, I didn’t like it, because its meaning actually changes over time.
According to Moore’s law, the speed and storage capacity of computing devices are increasing at an exponential rate. Many data sets that were once too big can now be stored and analyzed easily. So what was once considered big data isn’t big anymore. Likewise, big data today may not be big in the future as computing power continues to increase.
As you can see, it is difficult to pin point precisely how big the data needs to be for it to be considered big data; this criterion is a moving target. Rather than trying to define big data, we will take a different approach and try to identify some of their common traits. But keep in mind that these traits are not strict definitions and they do change over time.
The Data Capturing Devices
One of the most obvious characteristics of big data is that the devices for capturing those data are either already ubiquitous or becoming ubiquitous. Examples are cell phones, digital cameras, digital video recorders, etc. When any data capturing device becomes ubiquitous, there is a high probability that whatever data those devices are capturing will eventually become big data. This is pretty obvious, because more data capturing devices translate directly into a proportional increase in data production rate.
Besides the increase in capturing units, there is also an increase in the variety of data sensor and input devices. The GPS and accelerometer on your smart phone capture very different types of information even though they are really just a bunch of numbers. There is also an increase in the variety of input devices (i.e. different ways for a device to capture the same type of information). For example, search queries used to be captured strictly via a keyboard, now they can also be capture via any camera equipped with OCR, virtual keyboards on your smart phone or tablet, voice recognitions, etc.
The variety of data sensors and input devices not only increases the data production rate, it also produces an explosion of metadata for segmentation. Using the search function as an example, what used to be just queries can now be segmented into queries from computers vs. queries from mobile devices. Those from mobile devices can further be segmented into those that are input via a virtual keyboard vs. camera vs. voice. Likewise, queries can also be segmented according to their geo-location using GPS data. These are all valuable information that tells us how users are using the search function, and they certainly contribute to the size of big data.
Increased Data Resolution
Another major contributor to the bigness of big data is that data resolution is increasing rapidly. This is largely a consequence of the Moore’s Law, which says that the density of integrated circuit (IC) doubles approximately every 2 years. This means higher density CCDs in cameras and recorder, or equivalently higher image resolution. As a result, images and videos will take up more of your storage volume and make your data even bigger.
Many scientific instruments, medical diagnostics, satellite imaging systems, and telescopes benefit tremendously from this increased of spatial resolution. What used to be a blur due to a lack of resolution is now crystal clear. This can mean the difference between finding a star or a planet in a distant galaxy vs. not. And if it was a tumor that we are looking for, this could mean the difference between life and death.
Higher density IC also means faster CPU, which allows you to capture data at a higher sampling rate. This increases the data resolution in a different dimension: Time. Increased temporal resolution means instead of storing 180 frames of data for a minute of video (30 fps), now you have to store 360 frames for that same minute of video (60 fps). This will certainly make your data bigger, but the benefit can also be huge, especially for time sensitive data, for example, financial data, market reaction data, and audience measurements. The difference of a few seconds can mean the difference between making and losing millions of dollars.
Therefore, any data that is experiencing a rapid increase in data resolution (whether it is spatial, temporal or any other dimension) is likely to evolve into big data.
Super-Linear Scaling of Data Production Rate
Although there are a few more common traits among big data, I will talk about one more here in the interest of time. I call this property “super-linear scaling data production rate.”
When the rate of data production scales super-linearly with the data producer, data created by the data producer will likely grow rapidly into big data. The key concept here is super-linearity. That means for every incremental addition of data producer, there will be a disproportionately greater increment in the rate of data production.
Super-linear scaling is basically the network effect of data production. This property is particularly relevant to social data, because nearly all social media interactions scale super-linearly with the users. For example, if you have 4 users, the number of possible interactions among them is 6 (see figure 1a). But if the number of users doubles to 8 users, then the number of potential interactions among them increase more than double, in fact it more than quadruples to 28 potential interactions (see figure 1b). This is the power of super-linear scaling (a.k.a. network effect).
Because the majority of the social media data are generated through interactions between users, as more users adopt social media, the data production rate will increase super-linearly. That is why if you start capturing any social media data now, it is very likely that it will grow into big data very soon.
Since the precise criterion for “big” data is a moving target, it is useful to examine how “big” data were generated and try to identify the common traits that contribute to their “bigness.” There are at least three major factors that contribute to the bigness of big data.
Ubiquity and variety of data capturing devices for different types of information
Increase data resolution
Super-linear scaling of data production rate with data producers
BTW, my speaking engagement schedule is getting pretty pack. I'll be at presenting at two meeting for the Consortium of Service Innovation (CSI) today. And I'll be talking about boosting the relevance of internal search algorithms to make support and knowledge contents more findable. It's a search algorithm that has both social and contextual sensitivity that I've been working on. Then another CSI program tream meeting March 21-23 in Reston, VA. And I will be talking about the Reputation Model for effective Intelligent Swarming. I am also very honored to be invited to theDeloitte University (Westlake TX) to participate in the ON Social Insights meeting (March 12-13). Lots of travels coming up for me, so I might not have time to write as much as I like to. But I'll do my best to keep up my blogging pace.
Alright, now we know a bit about where big data come from, next time we will take a quick look at the big data processing pipeline. We will take a look at where do the big data go and what analytics/data scientists (like me) do with them. So stay tuned for more big data!