Now that SxSW interactive is over, it’s time to get back and do some serious business. For me, that means I’ll return to the world of big data. But let me tell you a little secret: although I work with big data all the time, I never actually look at any big data, because big data isn’t made for human consumption.
No one can make any sense out of direct examination of petabytes of data; not even analysts or data scientists. You can’t even plot them on the monitor, because even the highest resolution monitors are nowhere near a petapixel. We may look at several small samples of the big data during exploratory data analysis (EDA), but that’s not big data per se, since that is just a tiny fraction of big data. Frankly, I don’t know anyone who actually looks through the entire set of big data with their naked eye. Instead, we apply many sophisticated analytics to big data, and let our computers crunch it down to consumable digests. Then we look at the result of these analyses, and that’s where we spend most of our time.
From Exploratory Play to Statistical Rigor
In my previous big data post, we discussed the first step when analyzing any complex data set, that is EDA. This is one of the most important steps in data analysis. However, it is often not given the attention it deserves, because the result of EDA is rarely the end result that businesses want. Rather than helping business answer a question, EDA often creates more questions for the analysts. Moreover, EDA is very challenging, because successful EDA requires both a deep knowledge in statistics and creative imagination. Hence, most data scientists don’t spend enough time doing EDA. However with the rare combination of knowledge and imagination, the result of EDA can be very valuable. It will guide subsequent analyses in a way that will most likely lead to the discovery of new insights.
Suppose you’ve done your homework as a good data scientist and played with the data sufficiently to get a sense of what might be interesting in the data set. What can you do next?
This is where the number-crunching analytics for data reduction begins. Sometimes, your big data may go from hundreds of terabytes down to just a few bytes or bits. Like EDA, there are an infinite number of analytics for data reduction, but they can be group into three classes:
In an attempt to write shorter and more digestible posts, we will only discuss the first class of data reduction techniques today—descriptive analytics. And we will cover predictive and prescriptive analytics in subsequent posts.
Descriptive Analytics: Summarize
I came from a pure academic background and joined the industry about four and half years ago. What surprised me most coming to the industry is what people called analytics and business intelligence. My naïve thinking was that it must be some really complex artificial intelligence (AI) and really advance machine learning models. But after much investigation, I was very surprised to find that most of it is just simple descriptive statistics.
Over 80% of the business analytics, especially social analytics are descriptive analytics. They compute descriptive statistics (i.e. counts, sums, averages, percentages, min, max and simple arithmetic: + − × ÷) that summarizes certain groupings or filtered version of the data, which are typically simple counts of some events. They are mostly based on standard aggregate functions in databases that require nothing more than grade school math. Even basic statistics (e.g. standard deviations, variance, p-value, etc.) are pretty rare.
The purpose of descriptive analytics is simply to summarize and tell you what happened. For example, number of post, mentions, fans, followers, page views, kudos, +1s, check-ins, pins, etc. There are literally thousands of these metrics – it’s pointless to list them – but they are all just simple event counters. Other descriptive analytics may be results of simple arithmetic operations, such as share of voice, average response time, % index, average number of replies per post, etc.
Most of what the industry called advance analytics is nothing more than applying some filters on the data before computing the descriptive statistics. For example, by applying a geo-filter first, you can get metrics like average post per week from UK vs. average post per week from Japan. And you can show these data on a map for all countries. Then all of a sudden you get advanced analytics. But as you can see, the analytics beneath that map is really just grade school math.
So the next time you need to deal with analytics, don’t be afraid! Most of it is just counting, plus filtering, and simple arithmetics. Next time we’ll move into predictive analytics. That's going to be more mathematically challenging, but that is where the fun begins, because you get to predict the future.
In the mean time, I'm curious to know if you work with metrics? And how many of those metrics are just simple descriptive statistics? If you think you have a metric that's not a descriptive statistics, tell us about it and we can talk about it here.
See you next time.
Michael Wu, Ph.D. is Lithium's Chief Scientist. His research includes: deriving insights from big data, understanding the behavioral economics of gamification, engaging + finding true social media influencers, developing predictive + actionable social analytics algorithms, social CRM, and using cyber anthropology + social network analysis to unravel the collective dynamics of communities + social networks.
Michael was voted a 2010 Influential Leader by CRM Magazine for his work on predictive social analytics + its application to Social CRM. He's a blogger on Lithosphere, and you can follow him @mich8elwu or Google+.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.