In this blogpost we are going to talk about model predictors, i.e. the *independent variables* if you prefer the statistical nomenclature or *features* if you prefer the computer science nomenclature. Either way, we’re going to talk about the *data* that we are going to use to predict that our customers are churning.

Below I’m going to introduce a number types of variables, the theory behind them and why they are useful for modelling churn. The arguments presented below are written with a logistic regression churn prediction model in mind, but the variable concepts and eventual usefulness are as true for other methods like random forests, gradient boosting or k-NN (even though the lingo could vary slightly)

- Since we commonly define churn as “a customer that will stop being a customer within X months”, it’s rather clear that there is a time component to when a customer will churn. As such, it comes naturally to observe
**time series behaviour**for the customers: You can look at the number of products a customer has bought weekly during the twelve weeks leading up to the analysis, or you could look at the number of times they have contacted customer service monthly during the six months leading up to the analysis. Or you could look at something else, but the key is to look at to what extent they have been doing actions of interest over a period of time. The idea with time series variables is to see if there is a crucial point in time that something happens to/with your customers, or to find repeat patterns over time (one example of this could be bi-weekly visits or purchases) - If you’ve gone through the trouble of creating the time series variables, a piece of low hanging fruit is to calculate the
**recency and frequency**of actions during the period. You calculate the recency by going backwards from the analysis date, using the increments of time that you find most usable. For example, it would be interesting to see how many days or weeks it has been since a customer bought a product from you last time. Consecutively, the frequency of action is the total number of times an action has been performed during the period (going back to the same example: the total number of purchases during). The idea with recency is to identify if there is a critical length of time of inactivity that signals that a consumer will churn. Frequency is also straightforward: Is there a critical amount of activity (supposedly low activity) that signals that a customer is planning on leaving? - A third way of using your time series data is to create
**velocity variables**(or rate of change variables as they are commonly called). A variables velocity is the amount of change from one point in time to another, for example if a customer visited your website 5 times one week and 7 times the week after, then the relative velocity is 7/5=1,4 and the actual velocity is 7-5=2. The easiest thing to look for is spikes in behaviour, either positive or negative. But other things like consecutive weeks of positive velocity is also very interesting since is indicates an exponential growth in activity. Be sure to play around with time units as well – sometimes velocity from day to day is informative, but more often the time unit is too small and there is no clear signal to identify. However, when you go up to week to week velocity or two weeks to two weeks velocity it might become more informative. Out of the three types of predictors covered so far velocity variables is by far the most fickle, and it’s easy to get lost while creating more and more velocity variables for different increments of time and not get many useful variables out of the endeavour. While it’s always important to know your customer base and to know your data, it becomes much more so while working with velocity variables.

That’s the first half of the list: the time-related predictors. Please join us again in a fortnight when we delve into width, depth, grouping and net values, i.e. the aggregation-based predictors that you can use to predict churn.