Dr. Michael Wu, Ph.D. is Lithium's Principal Scientist of Analytics, digging into the complex dynamics of social interaction and online communities.
He's a regular blogger on the Lithosphere and previously wrote in the Analytic Science blog.
You can follow him on Twitter at mich8elwu.
Last week, I discussed the Lorenz curve and how to use it to quantify precisely how much content is contributed by various portions of the community population. This article builds on my previous posts, so I would recommend reading through the following articles before diving into this one.
So, one of the concepts I discussed last week was the utility of the Lorenz curve with data from Lithosphere. The question is what does the Lorenz curve for other communities look like?
If there is a community where everyone participates equally, for example, everyone posts exactly 10 messages, then the Lorenz curve would be a straight line (fig 1: Perfect Equality). As participation deviates from perfect equality, the curve will bow downward (fig 2: Unequal). The greater the inequality, the further the curve is depressed down and to the right (fig 3: More Unequal). In the extreme case, where one person produces all the content and everyone else just lurks, then the curve would turn into a rectangular corner (a.k.a. the delta function) as shown below (fig 4: Total Inequality).
So the shape of the Lorenz curve tells us how unequal the participation is in any community.
The Gini Coefficient.
Although we can visually examine the Lorenz curve and get a good sense of how unequal the participation is in a community, we still haven't numerically quantified the degree of inequality. To do this, the Italian statistician Corrado Gini created the Gini coefficient, which, by definition, is the area between the Lorenz curve (the red line in the above figure) and the line of Perfect Equality (the diagonal blue dotted line). This area is also normalized, so that when there is total inequality, the area between the red rectangular corner and the dotted blue line of perfect equality is 1. So the Gini coefficient is just the area of the yellow patches in the above figures. The Gini coefficient is sometimes multiplied by 100 to rescale it to an easily understandable score. This rescaled version is also known as the Gini index.
When there is perfect equality, the Gini coefficient would be zero (fig 1: G1=0). As the participation level deviates from perfect equality, the Gini coefficient will increase (fig 2: G2>0). The greater the inequality, the larger the numerical value of the Gini coefficient (fig 3: G3>G2). In the extreme case of total inequality the Gini coefficient would be one (fig 4: G4=1).
Now, I can calculate the Gini coefficient for Lithosphere using the same data I used last week (lurkers excluded for simplicity). The Gini coefficient for Lithosphere's post activity cumulatively as of Feb 28, 2010 is Gc=0.79. I can do the same for all communities in our data warehouse, and compute the mean level of participation inequality. The mean Gini coefficient for all our communities turned out to be 0.64 with SD=0.11. So the participation in Lithosphere is rather unequal among the participants (remember we excluded the lurkers for now).
Slicing and Dicing the Data
Having the Gini coefficient for all the communities in our data warehouse, I can now compare and contrast the data across industry, community type (support, marketing, or innovation), and audience (B2C, B2B, or internal).
Please note:
We can see clearly that the mean Gini coefficients (0.71) for marketing communities are higher than those of support and innovation communities. Likewise, the participation level in B2C communities is more unequal than those of B2B and internal communities, albeit less significantly (because the mean Gini coefficient for B2C communities is only 0.67, slightly above the mean).
If we segment the communities with a coarse binning by industry, we can see that the degree of participation inequality is not significantly different across industries, except for communities in the entertainment industry. Note that most of the industry average Gini coefficients are all very close to the community mean of 0.64, where as the average Gini coefficient for the entertainment industry is 0.75 (about 1 standard deviation above the mean).
What Does this Inequality Mean?
So what does inequality of participation mean to you? And what does it really mean for a Gini coefficient to be 0.64 as opposed to 0.75?
To address these questions, I will do the same analysis I performed in my earlier blog with these segmented communities. So here is the data you've been asking for, at least some of it (I don't want to turn this blog into a full fledge academic paper!).
Using the Lorenz curve, I can easily compute the fraction of content produced by the top 10% of the participants (lurkers excluded again), which should correspond to the 1% of hyper-contributors in the 90-9-1 rule.
These data show a strong correlation with the mean Gini coefficient data above. So, greater participation inequality (corresponding to a larger Gini coefficient) means that the hyper contributors are more prolific. This makes intuitive sense because you can think of inequality as the difference between the most prolific and the least prolific users. Because the least prolific users are always the lurkers with zero participation, the bigger the inequality or difference between the two extremes, implies that the top users must be more productive.
Since marketing communities, B2C communities, and communities in the entertainment industry have a higher Gini coefficients, their hyper-contributors (defined to be top 10% of the participants in this case) produce more content than other communities, 64%, 60% and 69% respectively compare to the mean of 55%
Once I have computed the Lorenz curve, turning the problem around is trivial. If we define "most of the community content" to be at least 50% (see The 90-9-1 Rule in Reality), then the Lorenz curve gives us an estimate of the hyper-contributor population as the fraction of participants that is require to produce at least 50% of the total content.
This data is anti-correlated with the Gini coefficient data in the previous section. So, greater participation inequality means that the percentage of hyper-contributors will be smaller. This is consistent with the observation we made earlier that the hyper-contributors will be more prolific when there is greater participation inequality. As they are more prolific, naturally fewer of them will be needed to contribute the same amount, in this case, 50% of the total content. I am not going to recite the data points here, just look at the chart and ask me if you have any questions.
Why use the Gini Coefficient?
Since the Gini coefficient is highly correlated with the fractional contribution of the top participants, you might wonder why bother with the Gini coefficient at all? The answer is its elegant simplicity and accuracy.
I have deliberately left out the lurkers in all our discussion, so there is one number we have to track (i.e. either the fractional contribution or the proportion of hyper-contributors). If I were to put the lurkers back into the picture, then we will need another number that quantifies the ratio between lurkers and participants. If I want greater accuracy with finer granularity than just the lurkers, occasional- and hyper-contributors (say, I also want to know about a group call the moderate-contributors), then we will need more numbers.
In contrast, because the Lorenz curve tracks the data for all possible participation level, it has all the accuracy we will ever need. Despite that, the Gini coefficient will always be a single number. Let me illustrate the utility of this with a hypothetical example.
Suppose you encounter three communities where one follows the 90:9:1 rule precisely, the second one follows a rule that is numerically more like 94:4:2, and the third follows the 88:10:2 rule. Question: which community has greatest level of participation inequality? Even if these numbers are accurate, it is not so obvious to rank them. With the Gini coefficient, we can use a single number that accurately quantifies the participation inequality. So we can easily identify the one with the largest Gini coefficient, thus, the one that has the most unequal level of participation. Even though you might not care about participation inequality, but you might want to know which community have the most prolific hyper-contributors, or the relative proportion of hyper-contributor populations.
With a simple yet accurate statistics like the Gini coefficient, we can compute the Gini coefficient for a window of activity at different time and watch how the participation inequality changes as the community grow. We can also build accurate models that have strong predictive powers. The possibilities become endless! When you can rigorously quantify something, that's when you turn it into a science. That is when you can gain quantitative and predictive insights. And that is when all the fun begins - at least for me.
As always, please let me know if you have questions or thoughts. This is a long blog with a lot of data, so we will take a short break from the 90-9-1 data mantra and come back to it later. Next time let's explore the science of influence.
