Last time, we talked about the selection of predictive variables, and the tedious process of nonlinear analysis. Once we have the variables and the nonlinearities, we must combine them into a single function, which when evaluated give us the proper health level of a community. But the hard work is not over yet. The result of this process culminated in a health function, which is a product of 6 health factors that are important in determining the health of online communities. These health factors are referred to as:
Members: the number of registered members over time,
Content: a function of posts weighted by member and guest viewership,
Traffic: the number of page views over time taking into account search crawlers,
Liveliness: a function of the number of posts per board over time taking into account user expectations for engagement
Interaction: the number of unique participants weighted by the amount of conversation between them within a thread, and
Responsiveness: A measure of time to respond between successive message posts within a thread taking into account expected response time.
Each of these health factors usually involves one or more metrics with some nonlinear function applied to them.
The health function is smoothed to give the health trend, like smoothing the daily stock price to give a better indication of the underlying direction of movement. The health function is then normalized to remove some of the bias introduced by the size of the community. I did not remove the size bias completely because human experts also have such bias and tend to rate larger communities healthier. The normalization process takes into account of the health history of the community, weighting the recent health more heavily, as well as the volatility of health so that consistent progression of the health trend will result in a greater value of CHI. By design the community health index is constructed to be robust to outliers and also sensitive; if there is a consistent signal for a change in health, it will be reflected in the weekly value of CHI.
The final step of any mathematical modeling is model validation. Basically, this means that we must test the model on a data set that we did not use to build the model, and make sure that the model still performs as expected. Lithium now hosts roughly 170 communities, and I developed the community health index using data derived from 16 communities of varying size, age, and purpose, where we have plenty of non-metric data. Then I tested the resulting model in 4 other communities. As with any scientific discovery process, this went through several iteration before the model begin to perform well during all the stages of the modeling process. Once the model predicted health start matching those assessed by human experts, I computed CHI for all our communities and gathered more data to refine the initial formulation. The computation published in our white paper is actually the result of three iterations of major reformulation; each introduces just a few minor but important tweaks that increase the prediction accuracy of community health for a greater variety of communities. And we are already working on future refinements as we continue to learn from the data we collect.
Hopefully this series of blogs have given you a peek at the development process behind the community health index and the effort that went into it. If you have any questions I'd be more than happy to address them in the comments, or feel free to ask me on Twitter at mich8elwu.