One of the more common tasks in Business Analytics is to try and understand consumer behaviour. By understanding the hope is that a company can better change this behaviour. In many industries it is more expensive to find a new customer then to entice an existing one to stay. This is usually known as “churn” analysis. The aim is to accurately identify the cohort who is likely to leave early enough so that the relationship can be saved.
This data is taken from a telecommunications company and involves customer data for a collection of customers who either stayed with the company or left within a certain period. In many industries its often not the case that the cut off is so binary. Frequently it might be more likely that a client account lays dormant rather then getting explicitly closed - for example if the client only pays for usage. I will explain how to adjust for these situations latter in the piece. This dataset is taken from [here] (http://www.dataminingconsultant.com/data/churn.txt) with descriptions of the data available [here] (http://www.sgi.com/tech/mlc/db/churn.names). This dataset also appears in the [C50 package] (http://cran.r-project.org/web/packages/C50/index.html)
Lets load the required packages:
library(plyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(reshape2)
library(caret)
Next up we need to load and split the data. The purpose of the split is to try and have a more accurate estimate of how our algorithm/s will go in production. If we were to assess the accuracy on the same data that we use to tune the algorithm, we will have an unrealistically optimistic sense of the model. So we cut up the data and keep a “test” set which we never use to build the model. We can then run the model on “unseen” data and get a more accurate idea of the model’s worth.
We can use the Caret package to split the data into a “Training Set” with 75% of the data and a “Test” Set with 25%.
churn <- read.csv("D:/Users/soates/Downloads/churn.txt", header=T)
#churn <- read.csv("E:/Github Stuff/srepho.github.io/churn.txt", header=T)
set.seed(12)
trainIndex <- caret::createDataPartition(churn$Churn., p = .75, list = FALSE, times = 1)
churnTrain <- churn[ trainIndex,]
churnTest <- churn[-trainIndex, ]
So the first step is to have a look at the balance of the outcomes. In this case its very binary, either the client has an existing contract with our telecommunications company or they have cancelled it.
table(churnTrain$Churn.)
##
## False. True.
## 2138 363
So we can see from this that about ~15% of customers left our service. So we now have a baseline for naive prediction of ~85% if we just predict all customers will stay with us (which would not be very useful).
The next step is to have a look at the data - lets start by just looking at numerical summaries. The main things we are looking for are:
Missing Data: In particular is the missing data randomly spread or is affecting a particular cohort? Can we impute the data or should we just ignore it?
Errors: These are often of two kinds. Sometimes just obvious errors that stand out (someone might be listed as an impossible age like in our case less then say 16 and more then 85 might be suspicious) sometimes when the initial data was entered a “special” number is used to indicate its a missing field. Often this is something like 999999 or -99999 etc. and it really stands out from the other values.
Very low or high variance: Indicators that are nearly 100% one value are not likely to be very useful and in fact in many algorithms can have a detrimental affect. (If in doubt we can always run the algorithm twice and see which one does better). Conversely if an indicator is a randomly assigned label (think of an ID number) it is providing no information and can be discarded.
summary(churnTrain)
## State Account.Length Area.Code Phone Int.l.Plan
## WV : 86 Min. : 1 Min. :408 327-3053: 1 no :2261
## NY : 66 1st Qu.: 75 1st Qu.:408 327-3587: 1 yes: 240
## OH : 65 Median :101 Median :415 327-3850: 1
## AL : 63 Mean :101 Mean :437 327-3954: 1
## WI : 63 3rd Qu.:126 3rd Qu.:415 327-4795: 1
## VA : 59 Max. :232 Max. :510 327-5817: 1
## (Other):2099 (Other) :2495
## VMail.Plan VMail.Message Day.Mins Day.Calls Day.Charge
## no :1824 Min. : 0.00 Min. : 0 Min. : 0 Min. : 0.0
## yes: 677 1st Qu.: 0.00 1st Qu.:144 1st Qu.: 88 1st Qu.:24.5
## Median : 0.00 Median :179 Median :101 Median :30.5
## Mean : 7.91 Mean :180 Mean :101 Mean :30.6
## 3rd Qu.:19.00 3rd Qu.:217 3rd Qu.:115 3rd Qu.:36.8
## Max. :51.00 Max. :351 Max. :163 Max. :59.6
##
## Eve.Mins Eve.Calls Eve.Charge Night.Mins
## Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 23.2
## 1st Qu.:167 1st Qu.: 87.0 1st Qu.:14.2 1st Qu.:166.9
## Median :202 Median :100.0 Median :17.1 Median :201.4
## Mean :201 Mean : 99.7 Mean :17.1 Mean :201.2
## 3rd Qu.:234 3rd Qu.:113.0 3rd Qu.:19.9 3rd Qu.:236.8
## Max. :364 Max. :164.0 Max. :30.9 Max. :381.9
##
## Night.Calls Night.Charge Intl.Mins Intl.Calls
## Min. : 33.0 Min. : 1.04 Min. : 0.0 Min. : 0.0
## 1st Qu.: 86.0 1st Qu.: 7.51 1st Qu.: 8.5 1st Qu.: 3.0
## Median :100.0 Median : 9.06 Median :10.3 Median : 4.0
## Mean : 99.8 Mean : 9.06 Mean :10.2 Mean : 4.5
## 3rd Qu.:114.0 3rd Qu.:10.66 3rd Qu.:12.1 3rd Qu.: 6.0
## Max. :175.0 Max. :17.19 Max. :20.0 Max. :18.0
##
## Intl.Charge CustServ.Calls Churn.
## Min. :0.00 Min. :0.00 False.:2138
## 1st Qu.:2.30 1st Qu.:1.00 True. : 363
## Median :2.78 Median :1.00
## Mean :2.77 Mean :1.56
## 3rd Qu.:3.27 3rd Qu.:2.00
## Max. :5.40 Max. :9.00
##
We can see that there are no missing data in this set (something that almost never happens in real life!).
I cannot see any super obvious mistakes though some of the fields seem a little uncertain. In particular
table(churnTrain$State)
##
## AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO
## 36 63 36 46 25 46 52 45 49 46 46 38 32 47 44 58 49 50 37 47 52 42 49 58 50
## MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV
## 46 56 51 44 42 42 50 50 49 66 65 51 54 37 50 48 45 40 51 51 59 59 44 63 86
## WY
## 59
We can also see some features that are being treated as numerical that should not be (for example Area.Code is treated as a numerical factor when it should be a categorical factor as the numbers are not related to each other). The other factor that stands out is that Phone Field is not really usable as it currently stands as it seems that each phone number is singular. We will drop this variable for the moment. (Depending on the time frame we had for the project or if it was a competition we would spend time looking at the structure of the phone numberto try and extract something more meaningful).
So lets switch the Area Code to a categorical factor and drop the Phone Field.
churnTrain$Phone<-NULL
churnTest$Phone<-NULL
churnTrain$Area.Code<-as.factor(churnTrain$Area.Code)
churnTest$Area.Code<-as.factor(churnTest$Area.Code)
The next step is to have a close look at the variables graphically. Often looking graphically is more illuminating then just a summary (and famously an indentical numerical summary can mask vast differences). We also want to see the Shape of the Distributions. Some algorithms work more accurately if we can transform highly skewed data to more closely resemble a standard (Gaussian) distribution. Sometimes its as simple as a log transformation, sometimes we need something like a Box-Cox transformation. It is usually good practice to scale and centre our data as well as the difference in magnitude will trip up some algorithms. We won’t have to worry about this Today as the Random Forest algorithim does not require these changes.
We can also start to form testable ideas about relationships. For example does the “Account Length” field have an impact on if they churn?
one<-ggplot(churn, aes(x=Account.Length, fill=Churn.))+geom_density()+ facet_grid(Churn. ~ .) + labs(title="Account Length")
one
Nope! Does not appear to be any noticable difference.
I actually looked at all the variables but I won’t bore you with all of them ;)
Next up we can start the process of building an actual model. We will use the excellent caret package for this. As mentioned the imbalance in the number of churned customers and the fact that we really want to predict who will be a churned customer mean we have to make some modificiations to our approach. More technically we are intrested in sensitivity in our models rather then specificity.
set.seed(12)
rfmodel<-train(churnTrain$Churn.~., data=churnTrain, method="rf", trainControl = c(method = "adaptive_cv", number = 10, repeats = 5, classProbs = TRUE, summaryFunction = twoClassSummary, adaptive = list(min = 10, alpha = 0.05, method = "gls", complete = TRUE)), metric="Kappa")
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
confusionMatrix(rfmodel)
## Bootstrapped (25 reps) Confusion Matrix
##
## (entries are percentages of table totals)
##
## Reference
## Prediction False. True.
## False. 84.3 3.9
## True. 1.0 10.7
Now we can’t be too excited yet as this is how we did on the same data we used to build the model. So lets have a look how it goes with the “test” set that our model has never seen before.
pred<-predict(rfmodel, newdata=churnTest)
confusionMatrix(pred, churnTest$Churn.)
## Confusion Matrix and Statistics
##
## Reference
## Prediction False. True.
## False. 700 35
## True. 12 85
##
## Accuracy : 0.944
## 95% CI : (0.926, 0.958)
## No Information Rate : 0.856
## P-Value [Acc > NIR] : 6.63e-16
##
## Kappa : 0.751
## Mcnemar's Test P-Value : 0.00133
##
## Sensitivity : 0.983
## Specificity : 0.708
## Pos Pred Value : 0.952
## Neg Pred Value : 0.876
## Prevalence : 0.856
## Detection Rate : 0.841
## Detection Prevalence : 0.883
## Balanced Accuracy : 0.846
##
## 'Positive' Class : False.
##
Not too shabby. Of the 97 that we predicted would have left we got 85 right and only 12 wrong. We did miss 35 other clients who left but its a big improvement from a naive guess of “All Stay” that we started with. In a latter post I will show how we can try and capture more of those missing 35 by accepting a few more “false positives”.