Churn Analysis

What We Are Trying To Do

One of the more common tasks in Business Analytics is to try and understand consumer behaviour. By understanding the hope is that a company can better change this behaviour. In many industries it is more expensive to find a new customer then to entice an existing one to stay. This is usually known as “churn” analysis. The aim is to accurately identify the cohort who is likely to leave early enough so that the relationship can be saved.

Data Set

This data is taken from a telecommunications company and involves customer data for a collection of customers who either stayed with the company or left within a certain period. In many industries its often not the case that the cut off is so binary. Frequently it might be more likely that a client account lays dormant rather then getting explicitly closed - for example if the client only pays for usage. I will explain how to adjust for these situations latter in the piece. This dataset is taken from [here] ( with descriptions of the data available [here] ( This dataset also appears in the [C50 package] (

Lets load the required packages:


Next up we need to load and split the data. The purpose of the split is to try and have a more accurate estimate of how our algorithm/s will go in production. If we were to assess the accuracy on the same data that we use to tune the algorithm, we will have an unrealistically optimistic sense of the model. So we cut up the data and keep a “test” set which we never use to build the model. We can then run the model on “unseen” data and get a more accurate idea of the model’s worth.

We can use the Caret package to split the data into a “Training Set” with 75% of the data and a “Test” Set with 25%.

churn <- read.csv("D:/Users/soates/Downloads/churn.txt", header=T)
#churn <- read.csv("E:/Github Stuff/", header=T)
trainIndex <- caret::createDataPartition(churn$Churn., p = .75, list = FALSE, times = 1)
churnTrain <- churn[ trainIndex,]
churnTest <- churn[-trainIndex, ]

So the first step is to have a look at the balance of the outcomes. In this case its very binary, either the client has an existing contract with our telecommunications company or they have cancelled it.

## False.  True. 
##   2138    363

So we can see from this that about ~15% of customers left our service. So we now have a baseline for naive prediction of ~85% if we just predict all customers will stay with us (which would not be very useful).

The next step is to have a look at the data - lets start by just looking at numerical summaries. The main things we are looking for are:

  1. Missing Data: In particular is the missing data randomly spread or is affecting a particular cohort? Can we impute the data or should we just ignore it?

  2. Errors: These are often of two kinds. Sometimes just obvious errors that stand out (someone might be listed as an impossible age like in our case less then say 16 and more then 85 might be suspicious) sometimes when the initial data was entered a “special” number is used to indicate its a missing field. Often this is something like 999999 or -99999 etc. and it really stands out from the other values.

  3. Very low or high variance: Indicators that are nearly 100% one value are not likely to be very useful and in fact in many algorithms can have a detrimental affect. (If in doubt we can always run the algorithm twice and see which one does better). Conversely if an indicator is a randomly assigned label (think of an ID number) it is providing no information and can be discarded.

##      State      Account.Length   Area.Code        Phone      Int.l.Plan
##  WV     :  86   Min.   :  1    Min.   :408   327-3053:   1   no :2261  
##  NY     :  66   1st Qu.: 75    1st Qu.:408   327-3587:   1   yes: 240  
##  OH     :  65   Median :101    Median :415   327-3850:   1             
##  AL     :  63   Mean   :101    Mean   :437   327-3954:   1             
##  WI     :  63   3rd Qu.:126    3rd Qu.:415   327-4795:   1             
##  VA     :  59   Max.   :232    Max.   :510   327-5817:   1             
##  (Other):2099                                (Other) :2495             
##  VMail.Plan VMail.Message      Day.Mins     Day.Calls     Day.Charge  
##  no :1824   Min.   : 0.00   Min.   :  0   Min.   :  0   Min.   : 0.0  
##  yes: 677   1st Qu.: 0.00   1st Qu.:144   1st Qu.: 88   1st Qu.:24.5  
##             Median : 0.00   Median :179   Median :101   Median :30.5  
##             Mean   : 7.91   Mean   :180   Mean   :101   Mean   :30.6  
##             3rd Qu.:19.00   3rd Qu.:217   3rd Qu.:115   3rd Qu.:36.8  
##             Max.   :51.00   Max.   :351   Max.   :163   Max.   :59.6  
##     Eve.Mins     Eve.Calls       Eve.Charge     Night.Mins   
##  Min.   :  0   Min.   :  0.0   Min.   : 0.0   Min.   : 23.2  
##  1st Qu.:167   1st Qu.: 87.0   1st Qu.:14.2   1st Qu.:166.9  
##  Median :202   Median :100.0   Median :17.1   Median :201.4  
##  Mean   :201   Mean   : 99.7   Mean   :17.1   Mean   :201.2  
##  3rd Qu.:234   3rd Qu.:113.0   3rd Qu.:19.9   3rd Qu.:236.8  
##  Max.   :364   Max.   :164.0   Max.   :30.9   Max.   :381.9  
##   Night.Calls     Night.Charge     Intl.Mins      Intl.Calls  
##  Min.   : 33.0   Min.   : 1.04   Min.   : 0.0   Min.   : 0.0  
##  1st Qu.: 86.0   1st Qu.: 7.51   1st Qu.: 8.5   1st Qu.: 3.0  
##  Median :100.0   Median : 9.06   Median :10.3   Median : 4.0  
##  Mean   : 99.8   Mean   : 9.06   Mean   :10.2   Mean   : 4.5  
##  3rd Qu.:114.0   3rd Qu.:10.66   3rd Qu.:12.1   3rd Qu.: 6.0  
##  Max.   :175.0   Max.   :17.19   Max.   :20.0   Max.   :18.0  
##   Intl.Charge   CustServ.Calls    Churn.    
##  Min.   :0.00   Min.   :0.00   False.:2138  
##  1st Qu.:2.30   1st Qu.:1.00   True. : 363  
##  Median :2.78   Median :1.00                
##  Mean   :2.77   Mean   :1.56                
##  3rd Qu.:3.27   3rd Qu.:2.00                
##  Max.   :5.40   Max.   :9.00                
  1. We can see that there are no missing data in this set (something that almost never happens in real life!).

  2. I cannot see any super obvious mistakes though some of the fields seem a little uncertain. In particular

  • Its not clear what length of time these fields cover? I am guessing its an average of something (because otherwise the longer held accounts would have higher fields) but is it a daily, weekly or monthly average? Its hard to tell if these are unrealistic values without knowing the time covered. (As long as the data is correct though it won’t matter what the ratio is as long as its standardised)
  • We need to have a closer look at the States and Phone Fields as we cannot tell if there are errors from this summary.
## 36 63 36 46 25 46 52 45 49 46 46 38 32 47 44 58 49 50 37 47 52 42 49 58 50 
## 46 56 51 44 42 42 50 50 49 66 65 51 54 37 50 48 45 40 51 51 59 59 44 63 86 
## WY 
## 59

We can also see some features that are being treated as numerical that should not be (for example Area.Code is treated as a numerical factor when it should be a categorical factor as the numbers are not related to each other). The other factor that stands out is that Phone Field is not really usable as it currently stands as it seems that each phone number is singular. We will drop this variable for the moment. (Depending on the time frame we had for the project or if it was a competition we would spend time looking at the structure of the phone numberto try and extract something more meaningful).

So lets switch the Area Code to a categorical factor and drop the Phone Field.


The next step is to have a close look at the variables graphically. Often looking graphically is more illuminating then just a summary (and famously an indentical numerical summary can mask vast differences). We also want to see the Shape of the Distributions. Some algorithms work more accurately if we can transform highly skewed data to more closely resemble a standard (Gaussian) distribution. Sometimes its as simple as a log transformation, sometimes we need something like a Box-Cox transformation. It is usually good practice to scale and centre our data as well as the difference in magnitude will trip up some algorithms. We won’t have to worry about this Today as the Random Forest algorithim does not require these changes.

We can also start to form testable ideas about relationships. For example does the “Account Length” field have an impact on if they churn?

one<-ggplot(churn, aes(x=Account.Length, fill=Churn.))+geom_density()+ facet_grid(Churn. ~ .) + labs(title="Account Length")

plot of chunk unnamed-chunk-7

Nope! Does not appear to be any noticable difference.

I actually looked at all the variables but I won’t bore you with all of them ;)

Next up we can start the process of building an actual model. We will use the excellent caret package for this. As mentioned the imbalance in the number of churned customers and the fact that we really want to predict who will be a churned customer mean we have to make some modificiations to our approach. More technically we are intrested in sensitivity in our models rather then specificity.

rfmodel<-train(churnTrain$Churn.~., data=churnTrain, method="rf", trainControl = c(method = "adaptive_cv", number = 10, repeats = 5, classProbs = TRUE, summaryFunction = twoClassSummary, adaptive = list(min = 10, alpha = 0.05, method = "gls", complete = TRUE)), metric="Kappa")
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
## Bootstrapped (25 reps) Confusion Matrix 
## (entries are percentages of table totals)
##           Reference
## Prediction False. True.
##     False.   84.3   3.9
##     True.     1.0  10.7

Now we can’t be too excited yet as this is how we did on the same data we used to build the model. So lets have a look how it goes with the “test” set that our model has never seen before.

pred<-predict(rfmodel, newdata=churnTest)
confusionMatrix(pred, churnTest$Churn.)
## Confusion Matrix and Statistics
##           Reference
## Prediction False. True.
##     False.    700    35
##     True.      12    85
##                Accuracy : 0.944         
##                  95% CI : (0.926, 0.958)
##     No Information Rate : 0.856         
##     P-Value [Acc > NIR] : 6.63e-16      
##                   Kappa : 0.751         
##  Mcnemar's Test P-Value : 0.00133       
##             Sensitivity : 0.983         
##             Specificity : 0.708         
##          Pos Pred Value : 0.952         
##          Neg Pred Value : 0.876         
##              Prevalence : 0.856         
##          Detection Rate : 0.841         
##    Detection Prevalence : 0.883         
##       Balanced Accuracy : 0.846         
##        'Positive' Class : False.        

Not too shabby. Of the 97 that we predicted would have left we got 85 right and only 12 wrong. We did miss 35 other clients who left but its a big improvement from a naive guess of “All Stay” that we started with. In a latter post I will show how we can try and capture more of those missing 35 by accepting a few more “false positives”.

  • Chapter 16 of the excellent Applied Predictive Modeling by Max Kuhn & Kjell Johnson covers cases like this where there is a class imbalance.
  • A new feature of the Caret package (authored by Max Kuhn) is adaptive resampling. The package website contains an overview.