This is the first blogpost in a series about predicting churn. In this post we’re going to talk about analytics methodology. But first, let’s start with a simple definition:
Churn – a customer stops being a customer
This includes:
- A customer cancels their subscription service
- A customer with money invested in your services either lose all value or withdraws the money
- A recurring customer stops coming back to you for repeat purchases
To model this, first you need data on your customers – some of which are still with you and some who has churned. Like, in the table below customer Id1 and Id4 are still customers after April, but Id2, 3 and 5 are not:
Jan | Feb | Mar | Apr | |
Id1 | 1 | 1 | 1 | 1 |
Id2 | 1 | 1 | 0 | 0 |
Id3 | 1 | 1 | 1 | 0 |
Id4 | 1 | 1 | 1 | 1 |
Id5 | 1 | 0 | 0 | 0 |
Second, you need data to use for predicting who is going to churn. What data you have and how you use it is very important, and will be covered in more detail in a separate blogpost. For the time being we will make do with an example table below, consisting of both categorical and numerical data.
Gender | Age | AvgValue | Frequency | |
Id1 | Female | 31 | 302 | High |
Id2 | Male | 22 | 195 | Low |
Id3 | Female | 29 | 152 | Mid |
Id4 | Female | 47 | 412 | Low |
Id5 | Male | 39 | 353 | Mid |
Once you’ve got the data, you must decide on how to go about predicting the churn risk for your customers. Churn prediction is a classification problem, and as such statistical methods (like logistic regression and survival regression) and classification algorithms (like decision trees, random forests and k-NN) are all under consideration. One also must decide on whether to use a single model or to use a whole ensemble when predicting.
In a recent project that we did for a client we opted for a single model logistic regression approach. Our main reasons for doing this was:
- We wanted an estimated model that could be inserted into Tableau for daily updated churn probabilities on all the companies’ customers
- Model coefficients are interpretable and comparable, which is good when communicating results to non-data scientists
- Insights from the variable selection process can be leveraged to the wider organization – knowing which phenomenon are not related to churn can sometimes prove as valuable an insight as the knowledge of which phenomenon that are related to churn
Now, when predicting churn there is no right or wrong when choosing methodology. Sometimes some methodologies are less viable than others, but more often it’s the strengths of one methodology in combination with the restrictions imposed on you by the project that that make the choice for you. No right, no wrong, just picking the tool best suited for the job at hand.