Predicting Churn in Online Games by Quantifying Diversity of Engagement

Abstract

Understanding engagement patterns of users in online platforms, may it be games, online social networks, or academic websites, is a widely studied topic with many real-world applications and economic consequences. A holy grail in this area of research is to develop an automatic prediction algorithm for when a user is going to leave the platform and devise proper intervention. In this work, we study online recreational games and propose to model the engagement patterns of players through an unsupervised learning framework. We think of engagement as a continuous temporal process, measured along specific axes derived from gaming users' data using principal component analysis. We track the overall trend of the projection of the data along the significant principal components. We find that the geometric variability of the trajectory is a good predictor of the users' engagement level. Users characterized by a time series with large variability are users with higher engagement; namely, they will continue playing the game for prolonged periods of time. We evaluated our methodology on two data sets of very different game types and compared the performance of our method with state-of-the-art black-box machine learning algorithms. Our results were fairly competitive with these methods, and we conclude that churn can be predicted using an explainable, intuitive, and white-box decision-rule algorithm.

Introduction

The availability of smartphones has led to the emergence of a large online gaming industry. Since its launch in 2012, the Google Play store is now home to more than 300,000 active applications, with over 250 billion downloads in total. Among these, less than 10 applications have achieved 5–10 billion downloads. For free games, these numbers are much lower; there are only 500 games with over 1 million downloads, and less than 20 games have achieved over 100 million downloads. Thus, gaming companies are faced with heavy competition and high rates of disengagement.

Disengagement is the process in which the player gradually loses interest in the game, leading to the final abandonment of the platform, which is commonly termed “churn.” Similar phenomena manifest in various fields such as online learning systems,^1–3 e-commerce sites,⁴ and community engagement during an environmental crisis.⁵

Our work focuses on casual games, characterized by their simplicity and short sessions.⁶ In the past, casual games such as Pinball were played in arcades and followed a pay-per-play revenue model. Later, users bought games such as Tetris and Minesweeper for their computers; some operating systems included these for their users. Casual games such as Snake and Bubble Shooter transitioned from computers to mobile phone interfaces, usually following a freemium pricing model that provides either in-game purchases or space for advertisements.⁷ The casual gaming industry is especially susceptible to the problem of churn. This is largely due to the fact that casual games do not require many resources from the players. Skills are easily gained, levels are completed, and the overall trend to play a specific game passes. Thus, users are compelled to move on to the next popular game.

That being said, it is important for casual gaming companies to retain their users. It has been shown⁸ that a high retention rate results in higher profit for game companies compared with acquiring new users.

It is particularly difficult to model user engagement in casual gaming, as the users play in their leisure times with seemingly random patterns. Although challenging, modeling individual behavior patterns that lead to churn allows game developers to devise better marketing strategies to improve user retention (e.g., sending push notifications or providing free items in games). By improving user retention, companies would significantly reduce costs since the acquisition cost for new users is much higher than the retention cost for existing users.

A major challenge in modeling user behavior in games is dimensionality: games range in complexity from relatively simple to very complex information systems featuring millions of users and thousands of possible user actions and system responses.^9–11 Given the complexity of some games, having a data mining method that can reduce the complexity of the behavioral data sets and provide actionable insights is of great interest.^12,13 Interpretability and reliability of results are vital, as decisions based on them affect the game design and eventually the revenue.

Although the importance of user churn prediction (CP) of games has been acknowledged and researched, many existing works seem to suffer from a methodological limitation, which was pointed out in various articles.^14–16 Most studies define handcrafted features for predicting user churn, which are then fed to an off-the-shelf supervised learning algorithm.^17–19 Such a pipeline does not provide an in-depth analysis of how these features were selected or their importance to the model's success. In general, prediction performance significantly varies depending on how features are defined.^20,21 Furthermore, features selected for CP are, in many cases, application-dependent, which implies that current methods cannot be easily adapted to other domains.

In recent years, neural network (NN)-based algorithms were suggested to handle the problem of handcrafted features, and in some problems, these algorithms give state-of-the-art results (e.g., speech, image, and language processing). An NN-based algorithm was proposed in Liu et al.¹⁵ to predict churn in casual games. There are two caveats in using NN-based algorithms: they require a lot of data for training; Liu et al.¹⁵ trained on data from thousands of games from the Samsung Game Launcher platform. Another issue is interpretability. While there are algorithms that propose an interpretation of NN, such as Shapely values,²² they often give local insights that help in understanding how the model made decisions for a single instance, rather than understanding the overall structure of how a model makes a decision.

The problem of disengagement is usually approached as a supervised classification of churned or returning players.^17–19 While traditional machine learning algorithms are the standard for predictive modeling in mobile games, simple heuristics (similar to our unsupervised principal component analysis [PCA]-based method) provide several advantages as follows. They are easy to deploy as a simple rule system in the client device; they tend to have lower computational cost than machine learning-based models; and they are easier to explain to nonexpert decision makers and thus achieve organizational acceptance for them.

Our contribution

We study a universal pipeline for predicting churn in online games, which can be run both in supervised and unsupervised mode. A restricted version of this pipeline was introduced in Hershcovits et al.²³ and its usefulness was demonstrated in an e-learning platform. In this study, we elaborate on the pipeline, prove its usefulness in a more challenging platform, and explore different aspects such as robustness.

At the basis of our approach is the working hypothesis that diverse use of the system reflects high levels of engagement. This hypothesis is inspired by studies in e-learning showing that varying the pedagogical challenges posed to students and the skills required to meet the challenges positively contributes to students' motivation and retention.^24,25

We infer the user's engagement level from temporal sequences of user data. We quantify the diversity of using the system (a game in our case) by the geometric variability of the trajectory of the user's temporal data. Since the time series is multivariate, it defies simple geometric patterns such as monotonicity. To overcome this problem, we project the time series onto the leading principal components (PCs) and examine the geometry after projection. The different geometries are illustrated in Figure 1, top panel.

FIG. 1.

(Top) Illustration of the geometric variability of the time series trajectory along the first principle component (PC₁) axis in the Bubble Shooter game. Four different users are portrayed with monotonically decreasing (black), monotonically increasing (red), fixed (blue), and variable (green) trajectories. (Bottom left) The distribution of trajectory types in Bubble Shooter batch1 data set, along with the percentage of churners in each type. Churn rate is lower for variable trajectory. (Bottom right) Zoom in on the variable trajectories for Bubble Shooter batch1 data, and breakdown according to the number of times the trajectory changes trend (CC). Churn rate drops as CC increases. CC, change count; PC, principal component.

To get some intuition about how trajectory geometry correlates with user experience, consider, for example, a monotonically increasing trajectory. Such a pattern may suggest that the user is stressed by, for example, the increasing complexity of game episodes. Following this example, a monotone decreasing trajectory may point to boredom. The table in Figure 1 (bottom left) demonstrates how our working hypothesis manifests in the Bubble Shooter data set that we collected: users' trajectories are either monotone down or variable, and the gap in churn rates is consistent with our hypothesis. The second table in Figure 1 (bottom right) zooms in on the variable trajectories. The variable trajectories are broken down according to the number of times the trend changes. We see that the churn rate drops as the change count increases; that is again in line with our working hypothesis.

We evaluated our approach on data from two very different game types: a simple freemium Bubble Shooter game and a sophisticated massively multiplayer online role-playing game (MMORPG). The data for the Bubble Shooter game are an in-house data set to which we had exclusive access, and the second data set was used in an open competition.²⁶

We compared the performance of our simple, explainable algorithm to that of supervised “black-box” algorithms. In all cases, the performance achieved by our algorithm was in the same ball park (around 10% worse in F1 score) as the best supervised algorithm. For example, in the gaming competition, our unsupervised variant obtained an F1 score of nearly 0.5, arriving in the 9th place out of 13 teams. Our supervised variant came in the 7th place with an F1 score of 0.53. The competition winner, whose method was based on NNs and a plethora of carefully designed features, obtained an F1 score of 0.62.

To conclude, we propose a complete pipeline from raw data to modeling of engagement types, which can be trained both in supervised and unsupervised manners. Our method is general, and the suggested pipeline can be adapted to other data sets and domains where user engagement levels are of interest, providing an explainable white-box alternative to black-box algorithms.

We tested the robustness of our approach by changing the set of features with which our pipeline is trained. The different sets of features resulted in the same performance (see the Robustness section). As pointed out before, which features to choose, and how that selection affects the performance, is a wide concern; our methodology provides a way to deal with that problem by introducing another layer of abstraction over the selected features—the trajectories.

Defining Churn

Churn is seen as a loss of interest in the product (game in our case). The definition of churn varies depending on the relationship between the company and the customer. For example, a customer is often bound by a contract in the telecommunication industry. Hence, churn can simply be defined as not renewing the contract or asking to leave midway. Some of the MMORPGs are also based on subscriptions. Casual games, however, are widely based on freemium models where players might not return to the game for a few months or even for good without notifying anyone. In such an environment, it becomes impossible to have one single golden definition of churn.

Churn may be quantified by not using the product for a defined period of time, after which most customers do not return. The threshold value for the limit time after which a player is labeled as a churner may be chosen from the data, and in most works that study casual games, this value is typically between 7 and 14 days (e.g., Runge et al.¹⁹). Figure 2 shows the distribution of active days per player in the Bubble Shooter data. Only 30% of the users played for more than 7 days.

FIG. 2.

Active days' histogram over the first 14 days from installation of 66,767 Bubble Shooter game users. Data collected during July 2019. Only 30% of the users play more than 7 days.

Another definition of churn is that of soft churn from Hadiji et al.,²⁷ which the authors claim to be more applicable for real-world settings. Players are labeled as churners if they have low activity. In the Discussion and Limitations section, we explain how our approach is compatible with this churn definition.

In this work, we adopt the prevalent binary definition of churn. We define a churner in terms of the observation period (OP) and CP period. An OP is a period for observing a user's play log data, from which features are extracted to predict churn. A CP is a period for determining whether a user actually churned or not. A churner is defined as a user playing the game at least once during the OP, and not playing the game at all in the CP period. Each individual player may have his own clock (namely, starting time), according to which the OP and CP are measured.

In most works, the OP and CP are defined back-to-back as one continuous time block. While it is likely that user behaviors leading up to the point of churn (e.g., deleting the payment method) would play a vital role in CP, predicting churn just before the point of leaving has a limited effect, as at that point, little can be done to prevent users from leaving. In this work, we tested both the prevalent back-to-back setting and the data sets with a gap of 3 weeks between OP and CP.

Our first data set (the one we collected) consists of two batches, one collected in July 2019 and the other in August that year. We chose OP to be the first 7 days of the user from the game installation date for the July batch, and the CP was the following 7 days (day 8–14). In the August batch, we chose OP to be 14 days from installation and CP to be the next 7 days. This allows us to check the effect of a larger OP on the results of the classifier. See Figure 3 for illustration.

FIG. 3.

OP and CP period for the Bubble Shooter game collected during July and August 2019. CP, churn prediction; OP, observation period.

Our second data set comes from an open competition. The training set had an OP of 6 weeks and CP of 5 weeks; 3 weeks separated the OP and the CP. There were two test sets, each consisting of 8 weeks of OP and 5 weeks of CP, again separated by 3 weeks. The test sets are 2 and 4 months ahead of the training set (see Fig. 4).

FIG. 4.

Description of the competition data, along with the change in business model between the two test sets. Figure taken from Lee et al.²⁶

Related Work

Predicting churn is a topic of interest in many fields, such as retail,²⁸ customer retention,²⁹ telecommunications,³⁰ and direct marketing.³¹ With the recent rise of mobile phones and online gaming, attention has been drawn to predicting churn in the gaming industry. Most works extract handcrafted features from users' log data and apply supervised learning methods to predict churn.^14,27,32

The problem of highly biased data has been addressed in F2P games by Xie et al.,¹⁸ where various supervised learning algorithms were tested. Predicting churn for high-value players has been discussed in Runge et al.¹⁹ In Drachen et al.,³³ users likely to churn rapidly after installing the game were studied.

A related problem to predicting engagement levels of users is identifying which players will become payers. As most of the casual game industry comprises the aforementioned F2P games, it is interesting to understand what parameters can predict whether a player will become a payer. A binary supervised classification model for “payer” and “nonpayer” was introduced,³⁴ and it was concluded that previous purchases, in-game interactions, and total time spent playing are strong indicators to determine future player purchase behavior.

Another line of work that is tangent to ours is players' behavior modeling, such as in Canossa et al.³⁵ where players' frustration was studied, and in Sifa et al.,³⁶ where user behavior evolution in the Tomb Raider game was studied.

Hybrid models containing supervised/unsupervised methods have been applied to problems relating to the telecommunications industry.³⁰ Less work has been done for the casual gaming industry. Data from thousands of games from the Samsung Game Launcher platform have been used in Liu et al.¹⁵ to train a supervised inductive model, based on NN-derived embedding, to capture the dynamics of a player in a mobile game.

Several works used a time-series-based approach to predict churn. Khodadadi et al.¹⁶ propose a pipeline based on Temporal Point Processes and Recurrent Neural Networks to predict the user return time to the application. In Borbora and Srivastava,³⁷ users' behavior is modeled as three one-dimensional time series over three features. Each represents a different semantic direction: Engagement, Enthusiasm, and Persistence. The time series are then clustered, and clusters of churners are identified. In Zheng et al.,³⁸ a sophisticated context embedding is combined with an long short-term memory (LSTM) network to predict churn, and in Castro and Tsuzuki,³⁹ a discrete wavelet transform is performed on the login time series of users, and the extracted frequencies are used as features for classification.

Our work departs from all these works in several key aspects. First, our time series is one-dimensional, but the PCs are a linear combination of all the features. Thus, we could circumvent the dimensionality problem while not compromising the data. Second, we do not use any black-box methods such as recurrent neural network or LSTM, making the method hard to interpret and explain to nonexperts. Third, we have a straightforward decision rule rooted in educational theory.

We may also compare our work with two supervised pipelines.^15,37 In both cases, most of the pipeline is unsupervised (embedding in the former and clustering in the latter), and labels are used only in the last step. The F1 score achieved in Borbora and Srivastava,³⁷ although on other data sets, was between 0.5 and 0.58, which is similar to what we obtained (see Table 5). The average performance of the algorithm in Liu et al.¹⁵ had both recall (0.78) and precision (0.32), similar to our method.

Gaming Data

We evaluated our approach on two data sets from two different games, one we collected ourselves and the other from an open competition.²⁶ The competition data set contained “practical contamination.” For example, a much longer time span (3 weeks) was given between the training data and prediction window, reflecting the time required to apply the churn prevention strategies to retain potential churners. Another contamination was a change in the business model. Specifically, the train and test data sets are each from different periods. Between the two periods, the business model of the game changed. This allows measuring the robustness of the model when applied to constantly evolving conditions.

Bubble Shooter game

This data set was collected from a mobile casual game of the Bubble Shooter type, for which we had in-house exclusive access. Bubble Shooter is a single-player mobile game. The game's goal is to clear the screen of colorful balls by creating groups of identical colors; see screenshot in Figure 5. It can be played either under a time limit or a shots limit (this article's data refer to the shots limit version). In the current mobile version, the player can use “boosters” (in-game items) to help him complete the level. These boosters are purchased using in-game currency (coins). In the game there is also a reward system that grants the user booster/coins depending on his completion of side-quests levels and other in-game features. The game is free to download and play, and revenue is made from in-app purchases or advertisements.

FIG. 5.

Bubble Shooter game screenshot taken from www.bubbleshooter.com

The data set consists of raw log data of 66,767 users collected in July 2019; for each user, a period of 14 days from the installation date was recorded. A second set consists of 70,756 users collected during August 2019. This time, 21 days were observed for each user from the installation date. The raw log data represent event interactions in the game such as starting and ending a level event, using in-app currency, total session time, and more.

The data set was labeled according to the definition discussed in the Defining Churn section. In July 2019, the OP and CP were both 7 days each. In August 2019, the OP was 14 days, and the CP was 7 days. The class distribution is shown in Table 1.

Table 1.

Class distribution of the evaluation data sets

Game	Churn %	Not churn %
Bubble Shooter, July 2019	35	65
Bubble Shooter, August 2019	50	50
Blade & Soul	30	70

Feature computation

We defined 17 features that belong to two categories: monetization and activity. Each feature is computed as an average over the days of the OP. Table 2 describes the features. Monetization-type features included any real money and/or in-game purchases and virtual currency transactions, and user daily currency balance. Activity-type features included daily counts of levels started/completed/failed or abandoned, in-game achievements, and rewards.

Table 2.

Description of features collected for every user in the Bubble Shooter game

Feature name	Description
AvgCompletedLevels	Average count of completed levels
AvgStartedLevels	Average count of started levels
AvgTimeInApp	Average game time in minutes
AvgSessions	Average session count
AvgNumOfBoostersUsed	Average “boosters” (in-game items) used
AvgStars	Average stars ranked from 1 to 3 according to the user score in the level
AvgTimesShopEntered	Average times user has entered shop
AvgMaxBalance	Average of the daily maximum coin balance of the user
AvgMinBalance	Average of the daily minimum coin balance of the user
AvgTransactionsCount	Average real money transaction count
AvgPurchaseRevenue	Average purchase revenue
AvgCoinSpent	Average coin spent
AvgInGamePurchases	Average in-game item purchases (purchases made using coins)
AvgVideoWatched	Average video watch
AvgRewardsGivenCount	Average rewards given to the user, rewards can be boosters or coins
AvgAbandonTrials	Average level trials that ended with the user leaving the level without finishing his moves
AvgFailedTrials	Average trials that ended in failing the level, meaning the user finished his moves without completion of the level

Blade & Soul

Blade & Soul is an MMORPG-type game, published by NCSOFT in 2012, that features a combination of epic martial art actions with highly customizable characters (see Fig. 6). The game is played in a single mode where the player can experience numerous quests and dungeons and can be played player-versus-player in an arena mode. The latter pushed the game to gain global popularity. The data set we used was made available in a CP competition hosted for 5 months, from March 28, 2017, to August 25, 2017. More than 300 registrations on the competition's Google Groups were given access to the log data, and a total of 13 final submissions.²⁶

FIG. 6.

Blade & Soul game screenshot taken from NCSOFT's Twitter account.

The competition was designed to incorporate concept drift, specifically, a change in the business model, to measure the participant model's robustness. Consequently, the competition comprised two test data sets, each from different periods.

The training data consisted of 4000 users whose log data were collected from April 1, 2016, till May 11, 2016, and two test data sets consisting of 3000 users each. Figure 4 depicts how the three sets spread over the time line. The class distribution is detailed in Table 1.

Feature computation

The data provided for training and test consisted of 82 log events that were categorized into 6 different types: connection, character, item, skills, quest, and guild. The OP in the training set was 6 weeks long and 8 weeks in the test. The logs were quite sparse in terms of user activity. We located for each user the week where that user was most active and computed the features only over the log events in that week. We found this approach to perform better than computing features over the entire period's logs. The feature list can be seen in Table 3. In this case, the features belong only to the activity category.

Table 3.

Description of features collected for every user in the Blade & Soul game

Feature name	Description
AvgLagBetweenSessions	Average of time between sessions
AvgConnectionTypeCount_1	Average count of Log type EnterWorld
AvgConnectionTypeCount_2	Average count of Log type LeaveWorld
AvgConnectionTypeCount_3	Average count of Log type EnterZone
AvgConnectionTypeCount_4	Average count of Log type LeaveZone
AvgConnectionTypeCount_5	Average count of Log type Teleport
AvgCharacterTypeCount	Average count of all Log types
	Associated with character logs
AvgGuildTypeCount	Average count of all Log types
	Associated with guild logs
AvgItemTypeCount	Average count of all Log types
	Associated with items logs
AvgSkillTypeCount	Average count of all Log types
	Associated with skill logs
AvgQuestTypeCount	Average count of all Log types
	Associated with quest logs

Methodology

We describe the methodology for inferring user engagement from temporal sequences of a player's log data. The method consists of four stages: (1) extracting features from the user's data log, (2) computing PCA on that set of features and extracting the leading PCs, (3) the generation of a time series for every user by computing the user's score on the leading PC(s), and (4) users are labeled according to their time series pattern, where each pattern corresponds to the engagement level in the game.

Our method's pipeline is illustrated in Figure 7. We call our algorithm PCAT, which stands for PCA trajectories. Tables 2 and 3 list the features that we extracted from the gaming data that we have used in the evaluation of our pipeline [Step (1) in our pipeline]; the reader may get an idea of the type of features that are commonly used.

FIG. 7.

Our user CP pipeline, from raw data to labeling. Classification is done based on the geometry of the time series trajectory along one of the PCs.

Computing PCA and constructing time series

PCA is a popular technique for dimension reduction and feature selection with a wide range of applications involving multivariate data in many diverse fields such as engineering, biology, finance, and social sciences (see e.g., Anderson⁴⁰ and Jolliffe⁴¹). Typically, PCA is used statically; namely, a low-dimensional snapshot of the distribution is obtained, and patterns are examined. We use the axes provided by PCA to generate a dynamic view of the system, from which the level of engagement is derived.

The PCs are linear transformations of the original set of features, chosen according to the maximization of the variance criterion. As such, the PCs provide a new coordinate system along which the data may be redrawn. The significance of every PC is measured according to the amount of variance along its direction.

A time series for user u consists of projections of its accumulated data on a fixed PC at a fixed frequency. Formally, for every user u at time t, we associate a vector u_t, which consists of the current value of the p features that are measured for each user. For every user u and PC i, we generate the time series $S_{u}^{(i)} = {α_{t_{1}}^{(i)}, α_{t_{2}}^{(i)}, \dots}$ , where $α_{t_{j}}^{(i)} = ⟨ u_{t_{j}}, P C_{i} ⟩$ is the scalar product of the two vectors. Since the PCs are unit vectors, this is also the projection of $u_{t_{j}}$ in the direction of $P C_{i}$ .

We further introduce a smoothing over the trajectories to filter out small variabilities in the trajectory that one would like to treat as constant. To that end, for each PC index i and time stamp t_j, we compute the standard deviation (SD) $σ_{t_{j}}^{(i)}$ of all users' $α_{t_{j}}^{(i)}$ (the projection at time t_j on $P C_{i}$ ). We introduce a smoothing constant $λ$ .

At every consecutive time steps $t_{j}, t_{j + 1}$ , we say that the user's trajectory has remained fixed if $| α_{t_{j}}^{(i)} - α_{t_{j + 1}}^{(i)} | \leq λ σ_{t_{j}}^{(i)}$ , stepped up if not fixed and $α_{t_{j}}^{(i)} > α_{t_{j + 1}}^{(i)}$ , stepped down is defined symmetrically. We say that a trajectory is:

Fixed if all steps are fixed.

Monotone up if all steps are up (and similarity monotone down).

Variable if there is at least one step up and one step down.

Note that the PCA is computed over all the points (players) in the training set, disregarding the labels (churn or not). Thus, this step is entirely unsupervised.

Classification using trajectories

We now describe the decision rule, which maps the trajectory type to a label: “churn” or “not churn.” We assume that we have already computed the PCs. We describe the rule with respect to the leading PC. However, it may be easily generalized to using multiple PCs as well. The decision rule is parameterized with a predefined threshold $τ$ .

Compute the user's trajectory in the OP along the leading PC.

If the trajectory is either fixed or monotone, such a user is labeled as “churn.”

Otherwise, count how many times the trajectory changed orientation (see Fig. 8 for an illustration).

If this counter is larger than $τ$ , then the user is labeled “not churn,” otherwise “churn.”

FIG. 8.

A user's 7-day trajectory along the first PC, $P C_{1}$ . The trajectory changes trend four times (CC = 4) and the TV, the sum of the d_i's (the difference between two consecutive points) in absolute value, $T V = | \sum d_{i} | = 100$ . TV, total variability.

The hyperparameters $τ$ and $λ$ may be fixed according to prior knowledge or some domain expert knowledge (unsupervised version), or according to the desired performance measure (e.g., F1-score or accuracy). In the latter case, labeled data are used (supervised version). In either case, the entire pipeline is unsupervised, and only the setting of the two hyperparameters is done unsupervised or supervised.

Meta-features and logistic regression

In the “Classification using trajectories” section, we described a decision-rule-based algorithm for labeling a user as churn or not. This approach straightforwardly encodes our working assumption about the diversity of using the system as a telling sign of higher levels of engagement.

In this study, we suggest an alternative classification algorithm, which is more opaque but still relies on the computed trajectories.

Compute the user's trajectory in the OP along the leading PC.

Extract from the computed trajectory a host of features: the total area under the trajectory's curve (AR); the number of times that the trajectory changes trend (CC = change count); and the total variability (TV) of the trajectory. The computation of TV is explained in Figure 8.

Train a supervised-learning classifier C with these features.

We use the notation $P C A T_{C}$ for the classification algorithm that is derived and trained in the aforementioned way.

Results

In this section, we report the evaluation of our pipeline, in various configurations, on the two data sets above.

Train-test setup

The Bubble Shooter game consists of two batches, July and August 2019 (Fig. 3). Each bath constitutes a separate train-test pair (batch1, batch2). We split each batch into half of the users that are used for training and the other half for testing (about 35,000 users in each of the train-test sets). Features were computed only from the OP, and labels were computed according to the activity in the CP. PCA was computed over all the users in the training set, regardless of their label. The only step in the training where labels were used, if at all, was to set the two hyperparameters $λ (s m o o t h i n g)$ and $τ$ (change count threshold).

For the Blade & Soul game, the training and testing data sets were prepared by the competition organizers. In addition to splitting the users into train and test, the time periods are different, as portrayed in Figure 4. The training set was collected in March–April 2016, and testing was done in two batches, July–September 2016 and December 2016, until February 2017.

Computing PCA

The 17 × 17 covariance matrix for the Bubble Shooter game data was computed, and PCs were extracted. All the users in the train set were used, and only data from the train's OP were taken into account.

The leading PC explains 39.49% of the variance. Using the Kaiser-Guttman criterion,⁴² PCs that explain less than $1 ∕ p$ -fraction of the variance are treated as explaining incidental variance (or noise) and hence ignored. The top four PCs explained more than $1 ∕ p = 1 ∕ 17 ≃ 5 %$ -fraction of the variance each here, we give a full description of the top three PCs (Table 4).

Table 4.

Feature weights for PC₁, PC₂_, and PC₃ of the Bubble Shooter data

Table 5.

Evaluation of our pipeline on the Bubble Shooter gaming data

Method	Batch	Recall	Precision	F1 score
$P C A T_{L R} + C C$	batch1	0.497	0.507	0.502
$P C A T_{L R} + C C + T V$	batch1	0.574	0.465	0.514
PCAT with $λ = 1, τ = 1$	batch1	0.840	0.420	0.560
PCAT with $λ = 0.5, τ = 1$	batch1	0.737	0.481	0.582
$P C A T_{L R} + A R$	batch1	0.841	0.477	0.606
$P C A T_{L R} + A R + C C + T V$	batch1	0.773	0.525	0.621
Random Forest + Raw features	batch1	0.843	0.631	0.710
PCAT with $λ = 1, τ = 1$	batch2	0.910	0.539	0.677
PCAT with $λ = 0.5, τ = 1$	batch2	0.871	0.588	0.702
$P C A T_{L R} + C C + T V$	batch2	0.705	0.753	0.729
$P C A T_{L R} + A R + C C + T V$	batch2	0.794	0.745	0.769
Random Forest + Raw features	batch2	0.855	0.783	0.818

Results sorted in ascending F1-score order.

AR, area under trajectory curve; CC, change count; LR, logistic regression; TV, total variability.

For the Blade & Soul game, we computed the 11 × 11 covariance matrix; here, the first PC explained 58.7% of the variance, and the top five PCs passed the Kaiser-Guttman criterion.

Table 4 shows the feature weights for $P C_{1}$ , $P C_{2}$ , and $P C_{3}$ . The interpretation of the direction to which each PC points is made using the folklore rule: only features that correspond to significant PC entries (in absolute value) are considered. Significant entries are colored blue. Interestingly, the PCs are “semantically orthogonal,” namely, they do not share any significant features. It is important to note that by no means does our methodology rely on semantic interpretation. Our algorithm is oblivious to the semantics of the features.

PC₁: The features providing the largest contribution to PC₁ are from the engagement/activity feature category: the average number of started levels, time spent in the app, number of sessions, number of stars, the count of rewards given, and the number of failed trials.

PC₂: Here, the top largest features belong to the monetization category and include an average count of transactions made within the game, the user's purchase revenue, the number of coins spent, and the number of in-game purchases.

PC₃: The largest entries correspond to the user's average maximal and minimal coin balance per day.

The top three PCs point in different directions (activity, monetization). We can proceed with our pipeline by choosing one of them (or combining several). Experimenting with each of the PCs, we found that classifying by trajectories along PC₁ gave the best results. Adding information from trajectories along other PCs resulted in a minor improvement.

We repeated the PCA computation for the Blade & Soul game. Similar to the Bubble Shooter game, trajectories along the leading PC were the most informative. Nevertheless, a logistic regression over the area under the trajectories' curve along the top five PCs gave the best result (Table 6).

Table 6.

Evaluation of our pipeline on the Blade & Soul gaming data

Method	Test	Recall	Precision	F1 score
Lessang Team (13th place)	test1	0.29	0.30	0.29
	test2	0.29	0.29	0.29
PCAT with $λ = 0.5, τ = 2$ (9th place)	test1	0.704	0.377	0.491
	test2	0.646	0.38	0.479
$P C A T_{L R} + A R_{1} + {C C}_{1} + {T V}_{1}$ (9th place)	test1	0.702	0.379	0.493
	test2	0.772	0.402	0.529
$P C A T_{L R} + A R_{1 - 5}$ (7th place)	test1	0.746	0.417	0.537
	test2	0.842	0.413	0.554
MNDS Team (6th place)	test1	0.62	0.51	0.55
	test2	0.62	0.51	0.56
Yokozuna Team (1st place)	test1	0.69	0.55	0.61
	test2	0.76	0.54	0.63

Results sorted in ascending F1-score order. We also report the results of three competitors in the original competition (first place, middle, and last, 13th place), table 3 in Lee et al.²⁶ $A R_{i}$ stands for the area under the trajectory along the ith PC, similarly $C C_{i}$ . The place of PCAT in parenthesis corresponds to our ranking had we competed.

Classifiers' results

We tested the two pipelines that are described in the Methodology section: PCAT, which classifies according to predefined rules about the trajectory types, and $P C A T_{L R}$ , which extracts meta-features from the trajectories and trains a linear regression (LR) on these features to predict churn or not.

We tested PCAT with various configurations of the smoothing parameter $λ$ and the threshold parameter $τ$ . We ran $P C A T_{L R}$ with three meta-features: AR (area under the trajectory curve), CC (change count − the number of times the line changes trend), and TV. The coefficients of all three features were negative in the logistic regression, which is in line with our working hypothesis that the larger the variability (quantified by TV and CC), the lower the chances of churning.

We trained a data-driven black-box learning algorithm to benchmark our theory-driven white-box approach. Concretely, we trained a random forest classifier over the nonaveraged data features, namely the 17 × 7 = 119 features of the 7 OP days. We used Python's sklearn with the following parameters. For batch1: max_depth = 6, min_samples_split = 10, n_estimators = 500, min_samples_leaf = 5; for batch 2: max_depth = 8, min_samples_split = 15, n_estimators = 800, min_samples_leaf = 5.

Table 5 presents the results of all these methods for the Bubble Shooter game for the two batches, July 2019 and August 2019. The OP for batch1 is 7 days and for batch2 is 14 days. The CP is 7 days for both. The measures we report are precision, recall, and the F1-score on the churn class (the minority class), as this was also the measure used in the Blade & Soul competition and also in the previous work that we compared against.^15,37

As evident from Table 5, the best result of PCAT for batch1 is obtained with smoothing parameter $λ = 0.5$ and change count threshold $τ = 1$ . We treat this configuration as a complete unsupervised pipeline since disregarding variations within a range of 1 SD (half to each side) around a point is a common practice in statistics (sometimes a window of 1 SD to each side is taken, we also show this configuration, which obtained very similar results to $λ = 0.5$ ). Setting $τ = 1$ means that the trajectory type determines the label: variable (which gets the label “not churn”) or monotone/fixed (which is labeled as “churn”).

The best result of the supervised version $P C A T_{L R}$ is obtained with all three meta-features, improving over the unsupervised PCAT by $7 %$ . The random forest achieves the overall best result for batch1 with the raw features, a $13 %$ improvement over $P C A T_{L R}$ .

For batch2, both PCAT and the random forest do better in terms of F1-score compared with batch2. This is because the OP has increased from 7 to 14 days. The gap between the supervised and unsupervised versions of PCAT and between $P C A T_{L R}$ and the random forest reduces to merely $5 %$ .

Including the feature AR (the volume of activity) in $P C A T_{L R}$ improves performance on top of CC and TV (F1 increases from 0.514 to 0.621 in batch1 and from 0.729 to 0.769 in batch2). This finding is consistent with the literature where the volume of activity features ranks first in feature importance. What is worth noting is that when dropping this feature and relying only on variability features, the performance drops by only $5 %$ in batch2 (and by 20% in batch1). In settings where it is difficult to define a typical activity volume for a user, as volume is not a normalized quantity, variability offers a “unitless” quantity, which may help facilitate robustness in such settings.

In Table 6, we see the results obtained in the Blade & Soul competition data set.²⁶ Recall that this data set is more challenging in two aspects. First, the OP and CP are separated by 3 weeks, and second, there was a business model change in the time span between test1 and test2. We compared the performance of PCAT with other participants in that competition; all of them used state-of-the-art machine learning (ML) models, including NNs, deep NNs, and different variants of random forests. On the contrary, we used a lightweight white-box and an easily explainable method, anchored in educational theory. Had we competed in this competition, $P C A T_{L R}$ would have made the 7th place out of 13, and PCAT would have come 9th.

Also, note that our method performed just as well, and perhaps slightly better, on test2, which was collected after the business model change. This fact testifies to the robustness of our method. In addition, the optimal choice for $τ$ was no longer $τ = 1$ as in the Bubble Shooter game, but now it increased to $τ = 2$ . This gap may be explained by different types of games, a simple casual game such as Bubble Shooter versus an MMORPG.

In the article, following the competition, the authors emphasize that proper data preprocessing was highly important in achieving high prediction performance. The winner, Yokozuna Data, used the most various features among participants. They used daily, entire period, time-weighted, and statistical features. We, on the contrary, made little effort to engineer features. The crux of our method lies in the fact that regardless of the features, as long as they are reasonable, the PC trajectories give sufficient information for classification. Indeed, we proved this point by overtaking half of the competitors.

Finally, let us note that in both games, the recall is often considerably higher than the precision across all algorithms we tested or cited. This phenomenon may be explained by the nature of gaming and the definition of churn. The recall is relatively high because the signature of churners is easy to pick up. However, some nonchurners, probably lightweight users, have a similar signature. Suppose a user played the game once within the OP and once within the CP. Such a user has a clear churner signature in the OP, but since he played once in the CP, he is labeled as a nonchurner. Other churn definitions, such as “soft churn” from Hadiji et al.,²⁷ may lessen this effect.

Robustness

Most of the articles that propose a CP pipeline, design ad hoc handcrafted features. In addition, it is not straightforward to decide to what extent the pipeline's success depends on the exact choice of features. More generally, prediction performance may significantly vary depending on how features are defined.^20,21 Furthermore, some features are relevant to a certain game or platform, but not to another. All this precludes the transferability of the pipeline to other games or even similar domains, thus limiting its usefulness.

Our method alleviates such concerns by adding another abstraction layer: meta-features are derived from the trajectories rather than using the original features. To demonstrate our method's robustness, we run the following experiment where we change the set of features with which the covariance matrix is computed and on which PCA is applied. We find that such changes barely affect the performance of the pipeline (in fact, it improves the performance).

We carried out the robustness test for the Bubble Shooter game. We randomly chose 9 features out of the 17 and followed the same testing scheme in the Results section. We repeated this three times. Table 7 reports the performance of PCAT with $λ = 0.5$ and $τ = 1$ . We see that the overall performance of PCAT actually improved by reducing the number of features. This may be explained by the fact that the 17 original features contain redundant features that may not play a neutral role and cause degradation in performance.⁴³

Table 7.

Evaluation of PCAT on the Bubble Shooter gaming data with hyperparameters $λ = 0.5, τ = 1$

Subset	Batch	Recall	Precision	F1 score
PCAT with $λ = 0.5, τ = 1$ using all features	batch1	0.737	0.481	0.582
${1, 2, 5, 7, 12, 13, 14, 15, 16}$	batch1	0.726	0.600	0.606
${1, 3, 4, 5, 7, 8, 9, 11, 16}$	batch1	0.773	0.525	0.624
${1, 3, 4, 6, 7, 9, 10, 12, 17}$	batch1	0.781	0.528	0.630
$P C A T_{L R} + C C, T V, A R$	batch1	0.797	0.524	0.632
$P C A T_{L R} + C C, T V, A R$ using all features	batch1	0.773	0.525	0.621

Three different subsets of features are examined. Features are numbered according to row number in Table 2. The before-last row gives the result for $P C A T_{L R}$ averaged over the three feature subsets. As evident, both in the supervised and unsupervised versions, using subsets of features gives slightly better results.

Discussion and Limitations

We studied a pipeline for predicting churn based on the time series along PCs to model engagement patterns of players in online games. Engagement is a cornerstone of flow,⁴⁴ a term in psychology that describes an experience so gratifying that people feel it is worth their while, even if there is no real reward. Flow consists of eight elements, from being a task that can be completed, through the ability to convey a sense of control over actions, to the alteration of the person's sense of time for the duration of the experience. The gaming research community readily adopted the term flow, as the eight elements can immediately be translated to game concepts such as concentration, control, and immersion⁴⁵; all are different facets of engagement.

In this article, we explored a facet of flow—the diversity of engagement. We quantified this dimension through the geometric variability of the time series trajectory along the PC axes (monotonically increasing/decreasing, constant and variable trajectories). We found that the trend of the data projection along the significant PC is a good predictor of the engagement level (in this study, the level was binary “churn” vs. “not churn”).

A well-observed fact is that the volume of activity is a good predictor of engagement. However, the volume of activity is a relative measure that is rarely transferable between games and even between different time windows within the same game. On the contrary, variability is a unitless universal measure. Indeed, our method proved useful in predicting churn for two very different types of games: a casual fermium simple game (Bubble Shooter) and a sophisticated MMORPG (Blade & Soul). In both cases, the same pipeline was used, while the underlying raw features were not engineered in any particular way. In both cases, a competitive result was achieved with respect to a supervised black-box algorithm.

As for the limitations of our method, the first and most obvious limitation is that the performance of our pipeline is inferior to the best supervised ML algorithm. In other words, there may be a price to pay for simplicity, generalizability, and explainability. Second, we have not checked our method on data sets where the majority class is the churner. In this case, the PCs may be less informative, and then PCA should be computed on the nonchurners only, ridding us of the unsupervised mode. Finally, we tested our method only on two games. Further exploration of additional games and other domains is called for. Unfortunately, at this point in time, such an exploration may not be readily executed since gaming data are rarely shared.

Future work will study more nuanced connections between the trajectory type and the engagement level. We focused on a binary definition of churn. However, engagement is a continuous variable, and other churn definitions, such as soft-churn,²⁷ may be more suitable in some settings. We expect that we can usefully match trajectory types to a multiclass definition of churn.

Footnotes

Authors' Contributions

I.W.: Software (lead); methodology (equal); conceptualization (equal); and writing (supporting). D.V.: Writing (lead).

Author Disclosure Statement

No competing financial interests exist.

Funding Information

Support for D.V. was partially provided by Israeli Science Foundation grant number 1388/16.

Abbreviations Used

References

Yukselturk

, Ozekes

, Türel

. Predicting dropout student: An application of data mining methods in an online education program. Eur J Open Dist E Learn, 2014; 17(1):118–133.

Lloyd

, Heffernan

, Ruiz

Predicting Student Engagement in Intelligent Tutoring Systems Using Teacher Expert Knowledge. In: The Educational Data Mining Workshop Held at the 13th Conference on Artificial Intelligence in Education; 2007; pp. 40–49.

Balakrishnan

, Coetzee

. Predicting Student Retention in Massive Open Online Courses Using Hidden Markov Models. Electrical Engineering and Computer Sciences University of California at Berkeley;, 2013; 53:57–58.

Dwivedi

, Patil

. A study on customer time engagement and perception of content for e-commerce sites in India. IOSR J Eng, 2019; 2250–3021(2278–8719).

Winters

, Moore

, Kuntz

, et al. Principal components analysis to identify influences on research communication and engagement during an environmental disaster. BMJ Open, 2016; 6(8): e012106.

Juul

A Casual Revolution: Reinventing Video Games and Their Players. MIT Press: Cambridge, MA; 2010.

Koetsier

Mobile App Monetization: Freemium is King, But In-App Ads are Growing Fast. VentureBeat: San Francisco, CA; 2014.

Reichheld

, Schefter

. E-loyalty: Your secret weapon on the web. Harv Bus Rev, 2000; 78(4):105–113.

Kim

, Gunn

, Schuh

, et al. Tracking Real-Time User Experience (True) a Comprehensive Instrumentation Solution for Complex Systems. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems; 2008; pp. 443–452.

10.

Bohannon

Game-miners grapple with massive data. Science, 2010; 330:30–31.

11.

Drachen

, Canossa

. Evaluating motion: Spatial user behaviour in virtual environments (pre-print). Int J Arts Technol, 2013; 7:294–314.

12.

Drachen

, Canossa

, Yannakakis

. Player Modeling Using Self-Organization in Tomb Raider: Underworld. In: 2009 IEEE Symposium on Computational Intelligence and Games; 2009; pp. 1–8.

13.

Weber

, Mateas

. A Data Mining Approach to Strategy Prediction. In: 2009 IEEE Symposium on Computational Intelligence and Games. IEEE; 2009; pp. 140–147.

14.

Kim

, Choi

, Lee

, et al. Churn prediction of mobile and online casual games using play log data. PLoS One, 2017; 12(7):1–19.

15.

Liu

, Xie

, Wen

, et al. A Semi-Supervised and Inductive Embedding Model for Churn Prediction of Large-Scale Mobile Games. In: 2018 IEEE International Conference on Data Mining (ICDM). 2018.

16.

Khodadadi

, Hosseini

, Pajouheshgar

, et al. Choracle: A unified statistical framework for churn prediction. IEEE Trans Knowl Data Eng, 2020; 34(4):1656–1666.

17.

Hadiji

, Sifa

, Drachen

, et al. Predicting Player Churn in the Wild. In: 2014 IEEE Conference on Computational Intelligence and Games; 2014; pp. 1–8.

18.

Xie

, Devlin

, Kudenko

Predicting Disengagement in Free-to-Play Games with Highly Biased Data. In: Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Number 2 in 12; 2016; pp. 143–150.

19.

Runge

, Gao

, Garcin

, et al. Churn Prediction for High-Value Players in Casual Social Games. In: 2014 IEEE Conference on Computational Intelligence and Games; 2014; pp. 1–8.

20.

Guyon

, Elisseeff

. An introduction of variable and feature selection. J Mach Learn Res, 2003; 3:1157–1182.

21.

Chandrashekar

, Sahin

. A survey on feature selection methods. Comput Electr Eng, 2014; 40(1; 40th-year commemorative issue):16–28.

22.

Lundberg

, Lee

S-I

. A unified approach to interpreting model predictions. NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017; pp. 4768–4777.

23.

Hershcovits

, Vilenchik

, Gal

. Modeling engagement in self-directed learning systems using principal component analysis. IEEE Trans Learn Technol, 2019; 13(1):164–171.

24.

Rodríguez-Ardura

, Meseguer-Artola

. Flow in e-learning: What drives it and why it matters. Br J Educ Technol, 2017; 48(4):899–915.

25.

Pearce

Engaging the Learner: How Can the flow Experience Support e-Learning? In: E-Learn: World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education. Association for the Advancement of Computing in Education (AACE): Waynesville, NC; 2005; pp. 2288–2295.

26.

Lee

, Jang

, Yoon

, et al. Game Data Mining Competition on Churn Prediction and Survival Analysis Using Commercial Game Log Data. IEEE Trans Games, 2018; 11(3):215–226.

27.

Hadiji

, Sifa

, Drachen

, et al. Predicting Player Churn in the Wild. In: 2014 IEEE Conference on Computational Intelligence and Games. IEEE: Dortmund, Germany; 2014; pp. 1–8.

28.

Patil

, Deepshika

, Mittal

, et al. Customer Churn Prediction for Retail Business. In: 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS); 2017; pp. 845–851.

29.

Tsai

C-F

, Lu

Y-H

. Customer churn prediction by hybrid neural networks. Expert Syst Appl, 2009; 36(10):12547–12553.

30.

Bose

, Chen

. Hybrid models using unsupervised clustering for prediction of customer churn. J Organ Comput E Commerce, 2009; 19:133–151.

31.

Daneshmandi

, Ahmadzadeh

. A hybrid data mining model to improve customer response modeling in direct marketing. Indian J Comput Sci Eng, 2013; 3(6):844–855.

32.

Bertens

, Guitart

, Perianez

. Games and Big Data: A Scalable Multi-Dimensional Churn Prediction Model. In: 2017 IEEE Conference on Computational Intelligence and Games (CIG). 2017.

33.

Drachen

, Lundquist

, Kung

, et al. Rapid Prediction of Player Retention in Free-to-Play Mobile Games. In: Twelfth Artificial Intelligence and Interactive Digital Entertainment Conference. 2016.

34.

Sifa

, Hadiji

, Runge

, et al. Predicting Purchase Decisions in Mobile Free-to-Play Games. In: Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 2015; 11(1):79–85.

35.

Canossa

, Drachen

, Sørensen

JRM

. Arrrgghh!!! Blending Quantitative and Qualitative Methods to Detect Player Frustration. In: Proceedings of the 6th International Conference on Foundations of Digital Games, FDG’11. Association for Computing Machinery: New York, NY, USA; 2011; pp. 61–68.

36.

Sifa

, Drachen

, Bauckhage

, et al. Behavior Evolution in Tomb Raider Underworld. In: 2013 IEEE Conference on Computational Inteligence in Games (CIG); 2013; pp. 1–8.

37.

Borbora

, Srivastava

. User Behavior Modelling Approach for Churn Prediction in Online Games. In: 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing. IEEE; 2012; pp. 51–60.

38.

Zheng

, Chen

, Xie

, et al. Keep You From Leaving: Churn Prediction in Online Games. In: Database Systems for Advanced Applications. ( Nah

, Cui

, Lee

S-W

. eds.) Springer International Publishing: Cham; 2020; pp. 263–279.

39.

Castro

, Tsuzuki

. Churn prediction in online games using players' login records: A frequency analysis approach. IEEE Trans Comput Intell AI Games, 2015; 7:1–1.

40.

Anderson

TW.

An Introduction to Multivariate Statistical Analysis. 3rd ed. Wiley-Interscience: Hoboken, New Jersey; 1962/2003.

41.

Jolliffe

Principal Component Analysis. In: Encyclopedia of Statistics in Behavioral Science. ( Everitt

, Howell

. eds.) Wiley Online Library: Hoboken, New Jersey; 2002.

42.

Yeomans

, Golder

. The Guttman-Kaiser Criterion as a predictor of the number of common factors. Statistician, 1982; 1:221–229.

43.

Weston

, Mukherjee

, Chapelle

, et al. Feature selection for SVMS. Adv Neural Inf Process Syst, 2000; 13:668–674.

44.

Csikszentmihalyi

Flow: The Psychology of Optimal Experience. Harper & Row: New York; 2009.

45.

Sweetser

, Wyeth

. Gameflow: A model for evaluating player enjoyment in games. Comput Entertain, 2005; 3:3–3.