I’ve been posting a lot of raw learning material on this blog lately, and been looking for time and material to post some content with a more personal flavor, and I’m glad to say I’ve finally found both. This week, just like the last one, has been concluded by a day-long case study, during which we give our best shot in making the best prediction and analysis of a dataset in this very short timespan. The problem of the day was churn prediction in an Uber-like ride-sharing company, and a good occasion for our team of 4 to explore some of the practices absorbed during the last few weeks, plus some others overheard in Kaggle forums. Here’s an overview of the whole process.
The dataset provided for this week was refreshingly clean, compared to the last week where most of the day was dedicated to make it workable (which details I thought I’d spare you). Despite setting the context a bit further from real life and a bit more study-like, it gave much more room for experimentation and execution, which we used as much as we could.
The dataset was 12*50K. Two timestamps, including
last_trip_date, on which the
churn target was built, and
signup_date, which we converted in a days count to use as feature. All sign-ups were from the month of Jan, 2014, week and day of week didn’t seem to carry much important information, so we didn’t built more features around it. Two categorical features (
city) with 2 and 3 levels were turned into resp. 1 and 2 dummy variables. Only 3 columns had missing values, from 1% to 20%. Numerical values were filled with mean and new column was created a posteriori.
We then spent some time focusing on understanding the data and how different features were affecting the churn.
Plotting the scatter matrix shows only marginal correlation between our variables, and moreover seems to show that our observations will not be linearly separable, as none of the 2D-subsets of the feature space show linear separation.
Here are some examples of the observed relationships between features and with the churn metric:
Besides some pattern due to rounded features and a clear non-linearity of the boundary, correlation with churn remains relatively opaque at this point.
We find more insights in separating our dataset across the few categorical variable, as the churn seems to vary greatly from one subset to the next.
We run a couple t-tests to challenge these observations, and observe indeed significant differences between our phones and cities: it might be worth building models at a smaller scale. We save this idea for later.
Our features being clean and better comprehended, we move on to running a few models. We first discuss over the best metric to measure our results, and decide to go with accuracy, as our cost-benefit matrix is not straightforward, and AUC might make things harder to compare with other results.
We then pull up our sleeves and start fitting the usual suspects.
The logistic regression didn’t perform very well, with a score of 71.15%, but did offer a good explanation of the features, valuable for the presentation of the findings. The odd ratios of the different features were as follow:
These can be interpreted as the multiplying factor in probability associated with an increase of 1 unit of each of our feature. Using more human-friendly words:
- If I increase my average distance by one, the odds of me churning are multiplied by 1.6.
- If I switch from iPhone to Android, the chances of me churning are divided by 1.67.
That’s great insight. Nonetheless, these figures do not say anything about the actual significance of these relationships, and how much we can rely on these ratios. We could compute the p-values and standard errors of these coefficients, yet another approach to get a feel of the significance comes with our next model:
Random Forest is a good competitor for the best model, and also enables us to take a peek at the features importances.
We see that the correlation here with the odd ratios is very superficial, and features like
avg_dist that seemed the most relevant in the logisitic regression context are actually very low in terms of feature importance. Also worthy to highlight is that our 3 levels for the city feature are all on top of this chart, regardless of the fact that one can be entirely explained by the two others. This is because the random forest picks for each tree a random subset of variables, which makes it very likely that one of the 3 cities is alternatively missing from the set, enabling all 3 of them to rank well.
At this point it is worth saying a few words about our fine-tuning techniques. To avoid consuming of our (little) time in running hundreds of models for grid searchs, we first run a quick analysis on the number of estimations required to get a reasonable prediction.
This enables us to set a threshold on the number of observations used for cross-validation to some 30% of our data, rounding up to 15000 rows picked at random without replacement. We then run the same process to define a reasonable number of trees to define our threshold in the speed-accuracy trade-off, and decide to set it to 300 trees.
With this set up, we can run a quick exhaustive grid search, and we settle with the following set of parameters.
- Max number of features: Square root of total.
- Minimum samples per leaf: 7.
With this best parameters, the random forest achieves a very respectable accuracy of 78.1% on the test dataset.
Winner of the last case study, fastest and more robust implementation of sklearn’s Gradient Boosting Classifier, evergreen of Kaggle completitions, XGBoost seems to be pretty much an unfair advantage, which we gladly used.
A first run shows already very promising results, with an accuracy close to 78% on validation set. We then run a first analysis over the number of trees,
With this number fixed (for now), we run a grid search over several key parameters of the XGBoost, and reinject the best parameters in our model:
- Max depth: 4
- Subsample: 0.5
- Max features (colsample): 0.4
Finally, we run another grid search on the best learning rate / # trees combination, which enables us to set these two to the following:
- Learning rate: 0.13
- Number of trees: 350.
The XGBClassifier then returns the best accuracy of all our models, with a whopping 78.52%.
We then test different flavors of SVM, to try to take advantage of the flexible kernels.
Linear kernel doesn’t return anything interesting, as expected since the data did not show any pattern of linear separation.
The polynomial kernel starts yielding interesting results, outperforming the logistic regression with a quadratic kernel and achieving an accuracy of 73.9%.
The best comes nonetheless with the radial kernel, which achieves an accuracy of 76% after grid search. The best parameters are shown in the following heatmap, and come down to a penalty term
C of 1, and a kernel parameter
gamma of 0.1.
KNN performs relatively well, better than the logistic regression although worst than all other models. The best K comes out to be 11, and the model achieves with this setup an accuracy of 75.5%.
Here’s to finish an overview of how the different ROC/AUC compare for the models we fitted.
After testing and fine-tuning all this set of models, we decide to move forward with the common Kaggle technique of models ensembling (see this very good article for more info).
We first try to build a simple majority vote amongst our 5 models, but get quickly disappointed: the ensemble model actually performs.. 1% worst than the XGB Tree, bringing the ensemble accuracy down to 77%.
This is no surprise, given that our models are built on the same dataset, and have very little to decorrelate them: they are therefore much too similar to add value to one another.
Weighted majority vote
The logic behind the weighted majority vote is simple: we want to give the best model more weight in the prediction. By giving the XGB a weight of 3, he accounts for 3/7 of the final vote: As a result, only unanimity amongst the other voters will influence the final vote. This setup bring us to a score of.. 78.4%, still lower than XGB alone.
A simple way to smoothen the votes is to make one step upwards and take back the probability that the model provides as outcome. For non-probabilistic models (KNN, SVM), SKlearn provide some pseudo-probabilities that make it possible to get a comparable probability out of these models. We then average them all and set our threshold to .5 (we haven’t fine-tuned this threshold, which could be a good option if we’re able to define realistic costs and benefits to our classifications).
This method outperforms the majority vote, but is still far from the XGB, with a light 77.35%.
Weighted averaged probability
Next attempt boils down to adding weights to each model’s probability in the average, proportional to their overall accuracy on the validation set. Alas, this merely increases the previous score to 77.65%.
Blending is a much more elegant solution. It has been introduced during the Netflix prize, and has since then spread out to many of the winning solutions in Kaggle competitions. The concept is as follow:
* Split the training dataset in k folds
* For each of the k folds, use the (k-1) other folds to output probabilities on the k-th fold. Stack these to output probabilities on the whole training dataset without predicting the training data.
* Iterate over all models and concatenate the probabilities provided by each model together. In doing so, we recreate a set of p features, where p is this time the number of models we trained and cross-validated.
* Use these new features to train a new model.
I went through these steps and built a logistic regression on top of our 5 previous models. And this time, the effort paid: the final prediction on the test dataset reached an accuracy of 78.6%, outperforming the XGB-only model by 0.2%. (Surprisingly, the logistic regression outperformed the XGB this time. I assume probabilities, being continuous normalized variables by nature, make a better input for log regression than for tree ensembles.)
All-in-all, a lot of efforts for a little gain, which surely makes sense in the context of Kaggle competitions, and probably less in real time contexts, unless this 0.2% actually gets linked to a substantial gain in the way this model is used. But an interesting experiment altogether.
The whole day has also been rich in learnings and a great occasion to practice building data science projects as a team. I’ll close this long post by cheering for the great team work with Catherine, Joel and Clement. International team FTW 😉