Abstract
Customer churn prediction is an active research topic for the data mining community and business managers in this rapidly growing society. The ability to detect churn customers precisely is something that every company would wish to achieve. From different experiments on customer churn, it can be seen that customers always could be divided into different types and the customers in the same segment generally have similar personas, behavioral preferences, and focus points. Therefore, a hybrid classification model named ClusGBDT for customer churn prediction is proposed. This model has three steps: a feature transformation stage, a customer clustering stage, and a prediction stage. At first, the multi-layer perceptron is used to training a prediction model and replace the original attributes with low-dimensional vectors. Then, customer segments are divided using K-means. Lastly, the unique prediction model based on GBDT is constructed for every customer segment. Several measures are used to evaluate the prediction performance. From the experiments, it is observed that our design could improve original classification algorithms include GBDT, random forest and logistic regression. Additionally, the proposed framework helps us to comprehend customer data.
Introduction
Customer churn prediction, one of the most frequently tackled tasks in customer relationship management, has gained increasing attention in recent years. It helps executives find target customers, retain customers and explore customer value [1]. Concretely, customer churn prediction is that constructing a prediction model to estimate the future churn probability for every customer using data mining technologies based on historical customer information. Generally speaking, long-term customers have stable spending power compared to new customers. And according to the survey, attracting new customers in mature markets costs several times more than to prevent regular customers from stopping services [2]. Consequently, how to formulate customer retention strategies is crucial for enterprises to enhance profitability and competitiveness.
However, constructing a credible prediction model is challenging, because the historical customer information is hidden, noisy and complicated. In previous researches, the customer churn prediction mainly has two directions. On the one hand, some researchers concentrate on improving predictive performance by constructing complex algorithms [3]. For example, support vector machine (SVM) [4–7, 30] neural networks (NN) [8–10] and rough set approach [11] greatly improve the predictive performance but have difficulties to reveal the relation between churn and variables generally. On the other hand, the remaining investigators want to understand what drives customers churn from the model [12] so as to help executives make corresponding measures. For instance, decision tree (DT) [13–15] logistic regression (LR) [2, 17] and random forest (RF) [16–18] have been applied to customer churn prediction because of their great robustness, comprehensibility, and great predictive performance. Until now, customer churn prediction has been widely used in various domains, including the banking sector [19–21], online gaming [22, 23], telecommunication industry [3, 24], insurance industry [25], and financial service [26].
In this paper, a hybrid predictive algorithm named ClusGBDT is proposed. This model is originating from that the customers in the same segment generally have similar personas, behavioral preferences, and focus points. In our approach, customers are assigned into several segments based on customer behaviors before building classification models, which improves the performance of churn prediction and help us to comprehend the churn drives. To be specific, multi-layer perceptron (MLP) [27] is used firstly to reduce the dimension of variables for the reduction of computational cost and elimination of variable outliers. Then k-means is applied to divide customer groups and data analysis are conducted for them respectively. At last, the unique classifier based on gradient boosting decision tree (GBDT) is constructed for the corresponding customer segment.
The purposes of this study are summarized as follows: ClusGBDT is proposed as a new hybrid classification algorithm that enhances the predictive performance and robustness of GBDT based on experimental results. It helps managers to comprehend the characteristics of customers in different segments so as to formulate corresponding strategies. A general churn prediction framework for distinct industries is developed.
The rest of the paper is organized as follows: Section 2 presents the related work. In Section 3, a review of preliminaries is provided. Section 4 introduces the churn prediction model. Section 5 presents the experimental set-up. The customer segments analysis and experimental results are revealed in Section 6. The conclusions and future work are presented in Section 7.
Related work
Customer churn
As a result of the economic globalization and trade liberalization, a large number of companies enter the market, which results in the continuously increasing customer liquidity. Customer churn refers to the target customers who decide to abandon business services, stop purchasing products, or switch to a competitor in the market. The previous study has revealed the following three types of customer churners [11]:
The first two types of churn customers can be predicted easily by manual methods. However, the third type of churn customers is difficult to predict since their historical information is extremely complicated. And the aim of customer churn prediction model is to predict the third type churner.
Review of customer churn prediction models
According to previous researches, the customer churn prediction has the following two directions.
On the one hand, some researchers concentrate on improving predictive performance by constructing complex algorithms. He et al. [28] proposed a prediction model based on SVM and random sampling. At first, random sampling is used to solve the problem of class imbalance by changing the data distribution of samples. Then, the SVM is applied to construct the prediction model. In terms of handling class imbalance, Chen et al. [29] presented a classification algorithm based on the CSCUM chart. This algorithm only needs to collect the inter-arrival time (IAT), so that the churn possibility can be estimated for the purpose of individual monitoring. Gordini et al. [30] applied SVM based on the AUC parameter-selection technique to customer churn prediction. This work showed that the process of parameter optimization plays a significant role in prediction performance and the combination of data-driven algorithm and retention strategy is better than the common heuristic management method. Amin et al. [11] designed an intelligent rule-based approach based on four rule-generation mechanisms to extract decision rules related to churn customers, namely, Exhaustive Algorithm (EA), Genetic Algorithm (GA), Covering Algorithm (CA), and LEM2 Algorithm (LA). Stripling et al. [2] incorporated the concept of profit maximization within the customer churn prediction for the first time by using genetic algorithms to optimize the expected maximum profit measure (EMPC). Wang et al. [31] studied how the GBDT predicts the future churn possibility based on customer activities in search advertisings. This method extracts two types of features for the GBDT: dynamic features and static features at first. Then the GBDT prediction model is constructed based on these two features. Amin et al. [24] proposed a prediction method based on the distance factor which is aimed at estimating the classification certainty of different regions in the dataset.
On the other hand, some investigators want to understand what drives customer churn. Xie et al. [18] used weighted random forest (WRF) to predict churn customers. This method not only handles class balance better but also retains good interpretability. De Bock et al. [32] presented a prediction model based on rotation forest and Adaboost in 2011. The rotation forest is applied to extract customer features while the Adaboost method is used to improve predictive performance. The experimental results revealed that the predictive performance is highly improved. However, the interpretability of the model and the understandability of churn factors are deficient. Hence, De Bock et al. [33] conducted another study, combining generalized additive models (GAM) with an ensemble classification algorithm. Experimental results showed that this method not only improves the predictive performance but also has great interpretability. Verbeke et al. [34] examined the use of social network information for customer churn prediction. This method uses social network effects to handle large scale networks, a time-dependent class label, and an imbalanced class distribution. In addition, this research introduced a new method incorporating non-Markov network effects within relational classifiers and a novel parallel modeling method that combines relational and non-relational classifiers. However, the utilization rate of useful information on social networks is low in this work for three reasons. Firstly, the network characterization is tedious due to the complexity of networks and the lack of corresponding methods. Secondly, the computational cost of deriving structural features in large scale networks is high. Thirdly, most dynamic features in networks are processed as static features. Therefore, Mitrovic et al. [35] proposed a panoptic representation learning approach that integrates interactive and structural information. This approach can account for different temporal granularities by slicing the information in different periods. Yang et al. [36] developed a framework based on interpretable user clustering and churn prediction. This framework firstly divides users into interpretable segments, based on their daily activities and ego-network structures. Then a deep learning pipeline based on long short-term memory (LSTM) and attention mechanism is designed. Extensive data analysis and experimental results revealed that this framework helps researchers comprehend user behaviors and outperforms other prediction approaches. Caigny et al. [37] designed a new hybrid classification algorithm for customer churn prediction based on LR and CART which contains two stages: a segmentation stage and a prediction stage. In the segmentation stage CART is applied to identify customer segments and in the prediction stage unique models based on LR are created for all leaves of this tree.
Preliminaries
Notation
For computational reasons,
Review of multi-layer perceptron
The multi-layer perceptron is a traditional neural network model, as shown in Fig. 1, in which each neuron of a given layer is fully connected to all neurons of the adjacent layers, without any cycle. The training data are imported by the input layer and then processed through one or several hidden layers to compute the low-dimensional representation of customer data. At last, the output layer exports the prediction. Every neuron in the hidden layer is consisting of the linear regression and nonlinearity as:

Multi-layer perceptron.
K-means is a popular clustering algorithm because of its excellent comprehensibility and convergence. The goal of k-means is to find a set of centroids C
k
, k = 1, …, K such that the total square loss is minimized:
As described in Algorithm 1, K-means alternates the optimization of centroids C k and the assignment of each distance to the nearest centroid.
In this section, a detailed description of the proposed model named ClusGBDT is presented step by step. The model framework is shown in Fig. 2. Our proposed model has two main differences compared to previous methods. Firstly, in contrast to most predictive methods, we construct unique models for different customer segments in parallel. Secondly, the choice of classification algorithms is not fixed in step 3. The detailed steps are as follows.

Presentation of the proposed mode3.3 Review of K-means.
The original customer features are noised and complex. This process helps us to largely reduce the computational cost and eliminate the influence of different variable distributions and useless variables. MLP has been successfully applied to customer churn prediction because of its great prediction performance. In this paper, our purpose for training an MLP model is to obtain the low dimensional representation of customer variables.
A single-layer neural network is used to construct a prediction model and the activation functions in the hidden layer are tested in the experiments including the Sigmoid function, the Tanh function, the Relu function, and the Elu function.
If the output value o i is more than or equal to 0.5, the model would classify this customer as a churn one. Otherwise, it means the customer is a normal one. The adaptive moment estimation (ADAM) [38] is used to optimize cross-entropy loss function:
Try to use typical clustering algorithms like k-means to divide customer segments directly is challenging for three reasons: Some customer features are useless for churn prediction. Every feature in the dataset has its unique distribution. The computational cost is high when dealing with large scale datasets.
Hence, the MLP is applied to transform original features into low-dimensional presentation vectors in Step 1. This process can help us to largely reduce the computational cost and eliminate the influence of different variable distributions and useless variables.
However, how to determine the number of customer segments is challenging too. The silhouette analysis [39] is a popular evaluation measure of clustering performance. In this measure, a (i) and b (i) are used to denote the mean intra-cluster distance and the mean nearest-cluster distance respectively. The silhouette coefficient for the i-th customer is defined as:
The best value is 1 and the worst value is -1. Positive values generally indicate that most samples are assigned to the right cluster. And negative values are converse. Therefore, the silhouette analysis is used to determine the number of customer segments.
The procedure of customer clustering is briefly described in Algorithm 2.
As shown in Fig. 3, we propose a parallel prediction model based on GBDT [31] to make experiments. The GBDT is a high-performance method in data mining tasks, which is consisting of numerous weak prediction models.

Parallel GBDTs for customer churn prediction.
The approximate function of GBDT is a sum of trees T
p
:
Here fp-1 is the sum of the first p – 1 trees.
Hence, the optimization objective of T
p
(x) is to correct its Tp-1 (x) predecessors.
Note that the choice of classification algorithms is not fixed. In this paper, we use GBDT to construct the prediction model. However, other classification algorithms are appropriate for the framework.
Experimental design
Experiments are conducted on 4 publicly available data sets from different industries. Table 1 provides an overview of these data sets and Table 2 gives some details about the features in four datasets.
Summary of datasets
Summary of datasets
The description of some representative features
In order to keep the reliability of experiments, the customer data are randomly split into training data and test data with the ratio 8:2 for 5 times and take the average performance for evaluation. All experiments are implemented on a single machine using Sklearn and Tensorflow with a 16-core 2.1 GHz CPU (E5-2620) and a 6 Gb GPU (Quadro M5000).
In customer churn prediction, data preprocessing is a crucial step. First, the missing data in continuous variables are replaced by the mean values and then processed via z-score normalization:
Then, categorical variables are transformed into binary variables by one-hot encoding. This process creates v binary variables, where v means the number of distinct values including the missing value. As shown in Fig. 4, for example, one input sample [Gender = Female, Weekday = Wednesday, Country = England] is transformed into a high-dimensional sparse vector. Last, in view of some datasets are heavily imbalanced, the weighted class is applied to remedy this problem.
In addition, feature selection is a necessary component in data preprocessing. Because some variables that have lower variance are useless for classifiers and the one-hot encoding generate numerous sparse data. Hence, the features with low variance are removed to improve the predictive performance of classifiers and reduce the computational cost.

One-hot encoding.
Different from the normal binary classification problem, most customer data in the customer churn prediction are extremely imbalanced. Hence, several state-of-the-art evaluation measures are used to assess the experimental results (i.e., accuracy, precision, recall, f1). In this paper, True Positive (TP) is assigned as the number of samples classified as churn customers correctly, True Negative (TN) is assigned as the number of samples classified as normal customers correctly, False Positive (FP) is assigned as the number of samples classified as churn customers falsely, and False Negative (FN) is assigned as the number of samples classified as normal customers falsely. The measures used for the evaluation of classifiers are as follows.
Additionally, we use the Receiver Operator Characteristic (ROC) curve which is a widely used metric in evaluating the performance of classifiers to present the results of experiments intuitively. The ROC curve is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. Concretely, it plots the True Positive Rate (TPR) on the y-axis and the False Positive Rate (FPR) on the x-axis.
Customer segment analysis
Dataset 4 is illustrated to analyze the customer segments for the convenience of visualization. Fig. 5 (a) presents the portions of two customer types. Additionally, the churn rates of each customer type are calculated and shown in Fig. 5 (b). The visualization shows that Cluster_0 customers are more likely to churn and the churn rate of Cluster_1 is much lower, while their counts are nearly equal. Figure 6 shows the visualization of customer segments via principal components analysis (PCA). The results intuitively show that the location between non-churn customers and churn customers is distinguishable in the same customer type. Insights like these are valuable for data analysis, customer modeling and so on.

Portions and churn rates of two customer types.

The PCA of customer segments.
In order to demonstrate the predictive performance of our proposed model, some state-of-the-art models are applied to make comparative experiments LR [16]0, EEMLP [40], RF [18], P-MLP [10], and GBDT [31]. The brief descriptions are as follows. LR [16]: LR with regularization is a popular classification method for customer churn prediction due to its great timeliness and interpretability. EEMLP [40]: EEMLP is a neural network based on entity embedding. The entity embedding can efficiently force the network to learn the intrinsic properties of each feature. RF [18]: RF is a bagging ensemble model that can reduce the variance of the single decision tree. P-MLP [10]: P-MLP is a particle classification optimized-based neural network which has a lower learning error and a faster convergence speed than traditional networks. GBDT [31]: GBDT is boosting ensemble model. IN contrast to RF, GBDT also aims to minimize the bias and not only the variance.
The average cross-validation predictive performance of different models over four datasets are listed in Table 3. The best performance classifier in each dataset is in bold. Additionally, the ROC curves are shown in Fig. 7. These average results are the basis of a statistical analysis of model performance. From the experiments, our proposed model performed better than other algorithms in most evaluation measures. And further researches on predictive performance are discussed in section 6.3.
The prediction performance of several models
The prediction performance of several models
First, we focus on dataset 1 (a balanced dataset), the ClusGBDT outperformed the other five methods that proves our proposed method has a great binary classification ability. As to the other three datasets, the metrics: Precision, Recall, and F1-score are much more important because these datasets are imbalanced. When dealing with imbalanced problems, the ability to learn the features of churn customers is significant for a classification model. Similarly, the ClusGBDT performed much better. Different from traditional evaluation measures, the ROC curve can express dynamic predictive performance as its discrimination threshold is varied. We can intuitively observe that our proposed method has a superior performance. To sum up, the ClusGBDT outperforms other predictive methods and has a great robustness no matter whether the churn rate is low or high and the amount is large or small.
The influence of activation function
We study the impact of activation functions on ClusGBDT. In this section, the experiments are conducted via holding the other settings. Note that we exploit the identity as activation function on neurons of MLP, as shown in Equation (3). A common practice in neural networks is to test non-linear activation functions on hidden layers. The choice of activation functions is related to the accuracy of customer segments. We thus compare the F1-score of different activation functions on ClusGBDT. The experimental results are shown in Fig. 8.

The ROC curves on four public datasets.

The influence of activation function on F1-score.
In order to study the impact of customer clustering for other classifiers in depth, we also deploy customer clustering on LR and RF that are frequently used in the domain of customer churn prediction. As shown in Table 4, the improvements for LR and RF are obvious on most evaluation measures. In general, customers in the same segment always have similar personas, behavioral preferences, and focus points. It helps to detect exactly the churn drives for each and every segment. Afterwards, the managers can take appropriate strategies for every segment and tackle the churn drives of the segment. And the classification techniques can learn the customer features better to improve predictive performance.
The impact of customer clustering on LR and RF
The impact of customer clustering on LR and RF
In this study, the application of MLP is explored to customer clustering for churn prediction. We aim to ensure the accuracy and reliability of customer segments. MLP can enhance the interaction between customer features and eliminate the influence of useless features without any feature selection, extraction, and generation. To evaluate the performance of the proposed model, five algorithms are compared and several datasets are tested. In our benchmarking research, the proposed model is the overall most great compared to other classification techniques. To sum up, our model offers three main contributions to the existing literature such as: (1) improved the prediction performance and robustness of traditional algorithms, (2) help managers to comprehend the causes of customer churn so as to formulate corresponding strategies, and (3) developed a general churn prediction framework for distinct industries.
As a topic of further research, the meaning of high-order vectors is needed to study so that the managers could formulate corresponding strategies for several customer segments. Another future direction is to improve the efficiency of the model. Because the training time spending on the proposed model is too much. The model training time is also a necessary metric to take into consideration especially in cases where real-time predictions are needed. Lastly, the predictive performance of our proposed framework is restricted to the performance of MLP. Hence, it is significant to find a way to ensure the accuracy of customer segments.
Footnotes
Acknowledgments
Our work is partially supported by National Natural Science Foundation of China (No. 71862003), the Foundation of Guangxi Key Laboratory Cultivation Base of Cross-border E-commerce Intelligent Information Processing, Guangxi University of Finance and Economics, and the Foundation of Science of Business Administration, Guangxi University of Finance and Economics.
