Welcome back Michael Wu! Here is his third installment in a series describing how the new Community Health Index was developed:
To begin the analysis of the previously collected data set, I gathered the non-metric data from various sources by talking to the moderators, the customer success managers (CSM), and our best practice advocates, which included Joe Cothrel and his team. As I mentioned earlier, these data are extremely important because they serve as the ground truth to our prediction problem. It is through the eyes of the moderators and the CSM who monitor and interact with the community everyday that we know how healthy a community is. Tabulating these non-metric data gives us a time series of the health level for each community. Since all the recorded metric are already in the forms of a time series, now we can turn to statistics and begin the number crunching.
The idea is very simple. We know the health level of the community from the non-metric data; now we simply want to know which of the 20 metrics that are commonly available can best predict community health. This can be achieved by running a sequence of linear and nonlinear regression analyses using the 20 metrics as the predictor variable and the tabulated non-metric data as the response variable.
This, however, is not trivial. Some of the issues that must be dealt with include the correlation among the predictor variables, the nonlinearity between the predictors and the response, and the nonstationarity of the time series data.
That's quite a mouthful, so here is a bit of explanation about what I mean by that:
The problem of correlations among the predictor is known as multicollinearity. If some of the predictor variables are highly correlated, it is very difficult to determine which predictor actually causes the response. Computationally, this shows up where the large regression coefficients may jump randomly between the correlated predictors. And these jumps are highly sensitive to the data making it difficult to determine which of the correlated predictors is most predictive. This is a very prominent problem in community data as many of the metrics are highly correlated. For example, if the community has a lot of traffic, they tend to gain more members, and achieve higher level of activities. I have used partial least square and boosting to try to overcome this problem.
Nonlinearity means that the predictors and the response may not be related in a linear fashion. That means a fixed changed in a predictor don't always lead to the same change in the response. It also depends on the history of the predictor as well as the interactions with other predictors. There is no out-of-the-box solution for nonlinearity. I just have to try some nonlinearity, plot the data, look at them, reformulate the model, and see which one fits and predicts best.
Finally, nonstationarity means that the system's behavior, in this case the community, depends on the absolute time. This makes prediction of any time series data very difficult. In laymen's term, it means that any statistical pattern that we have learned may change from one time to another (this is what it means by dependence on absolute time). In other words, knowing the history does not predict the future. For example, if we want to accurately predict the stock market price, any pattern we learn from the history better continue in the future. If there is a trend (or seasonality) in the history, the exact same trend (or seasonality) should persist in order for us to predict the future. If the trend changes in the future, then following the historical trend will lead to a wrong prediction. This is a very prevalent problem in communities, because communities are constantly changing due to management decision, product launch, marketing efforts, etc. There is also no way to predict a completely nonstationary system, as seen by the fact that no one can predict the stock market. We can only make some assumption about the how nonstationary our system is, proceed, and hope for the best. To deal with this problem, researchers typically assume one of several weaker forms of nonstationarity, and I have assumed the wide-sense nonstationrity in the analysis of our community data.
That is a lot to digest! If you have any questions I'd be more than happy to address them in the comments, or you feel free to ask me on Twitter at mich8elwu.