Abstract
The aim of this research was to investigate statistical learning techniques to predict gender based upon psychological constructs (measuring motivations to participate in masters sports). Motivations of marathoners scales (MOMS) sports psychological data for 3928 masters athletes (2,010 males) from the World Masters Games (the largest sporting event in the world by participant numbers) was investigated using the regularized linear modelling methodologies ridge regression, the lasso and the elastic net. Comparison was made between previously published research utilizing logistic regression, discriminate function analysis, radial basis functions, multilayer perceptrons and a selection of boosted decision tree based models. It was hypothesized that the regularized linear models would perform better than other models except the boosted decision trees, however ensembles of the regularized linear models and gradient boosted machines would result in improved accuracy over any prior models in the literature. Implementing modern regression methods with regularization provided improvements in classification accuracy based on gender compared to non-regularized linear models, however not boosted trees such as a gradient boosted machine (GBM). Models that were solely or partially based on L2 regularization (including a penalty term to reduce the sum of squares of the parameters) performed better than those than relied solely or primarily on L1 regularization (including a penalty term to reduce the sum of the absolute values of the parameters) or subset selection. This finding had implications for analysis of MOMS data in general with respect to using the 56 questions in MOMS as opposed to the underlying nine constructs for analysis in order to compensate for multicollinearity. Ensemble methods stacking ridge regression and GBMs with out-of-sample prediction further improved accuracy, giving higher accuracy scores (0.7236) than obtained in any preceding literature. This demonstrates the potential benefits from such an ensemble approach in terms of developing models with improved accuracy, as well as increasing the likelihood of developing practical applications from predictions using MOMS psychometric data.
Introduction
World Masters Games
The World Masters Games (WMG) is the world’s largest international sporting competition in terms of participant numbers, attracting tens of thousands of masters athletes. Masters athletes are people that systematically train for and compete in organized sporting events designed specifically for older adults (Raeburn & Dascombe, 2008). The WMG is held quadrennially as a non-invitational event. In 2009, the Sydney WMG attracted 28,089 competitors representing 95 countries and competing in 28 sports (Sydney 2009 World Masters Games Committee, 2009). Most masters athletes are adults over 30 years old, though this can vary by tournament with age requirements based upon sport and gender. For example, in some sports such as gymnastics the age requirements can be lower, such as over 25 years old. Previous research has been conducted on the Sydney WMG. This research included investigating body mass index (Walsh et al., 2011a, 2011b), injury incidence (Walsh et al., 2011c; Heazlewood et al., 2017) and health (Climstein et al., 2011; DeBeliso et al., 2011; Walsh et al., 2011d, 2012; DeBeliso et al., 2014; Climstein et al., 2016; DeBeliso et al., 2017) of the masters athletes at the WMG.
Motivations of marathoners scales
The motivations of marathoners scales (MOMS) (Masters et al., 1993) is used to gauge the importance of a range of psychological factors in determining sports participation. The MOMS is a psychometric instrument based upon a series of 56 questions and scored on a seven point Likert scale (Likert, 1932). Participants are requested to score 56 questions according to how important they are as a reason for them participate in the WMG. The questions identified in the MOMS have been demonstrated (Ogles & Masters 2003; Havenar & Lochbaum, 2007; Ruiz & Zarauz Sancho, 2011; Buning & Walker, 2016) as important motivational constructs and have been used by sport psychology researchers for 25 years. A number of studies have been conducted on the MOMS in the context of masters athletes (Adams et al., 2011; Heazlewood et al., 2011, 2012; Sevene, 2012; Heazlewood et al., 2015, 2016b, 2016c, 2016d). Heazlewood et al. (2016a) and Walsh et al. (2018) investigated prediction of gender from the sports psychological data in the MOMS. As well as the cases laid out for predicting gender from MOMS in the literature (Heazlewood et al., 2016a; Walsh et al., 2018), being able to predict gender or other attributes of participants from MOMS data has value of its own based solely on the additional information (predicted gender) produced. Additionally further research on relationships between gender classification prediction and this scale may lead to supplementary insights that might assist in other research using the MOMS instrument. The 56 questions in the MOMS have nine underlying factors that are different psychometric constructs. Further background to the MOMS, reasoning behind investigation into gender prediction based on MOMS and some examples of the 56 questions used in MOMS are found in Walsh et al. (2018).
Background
Previous research (Heazlewood et al., 2016a) compared four supervised learning (Hastie et al., 2009) models in terms of their ability as statistical techniques to classify the different genders based upon the MOMS 56 psychological motivational questions associated with motivations to participate in masters sports. The four supervised learning models used were radial basis functions and multilayer perceptrons (both neural networks), discriminant function analysis (Fisher, 1936) (a more traditional statistical approach under the general linear model) and logistic regression (Walsh et al., 2018). Comparing the four methods (Heazlewood et al., 2016a) none of these classification techniques based on neural network analyses, the multivariate method of discriminant analysis or logistic regression were overtly superior to each other. Classification accuracy was highest for each gender individually and also in overall terms (0.644) using a multilayer perceptron (MLP) whilst the radial basis function neural network displayed classification accuracy slightly lower (overall 0.605) than the other three methods (0.633 for discriminant function analysis and 0.630 for logistic regression). The classification accuracy of 0.644 for the MLP marginally outperformed the other methods and displayed a reasonable level of predictive validity.
Further research (Walsh et al., 2018) revealed that using decision tree (Quinlan, 1986; Quinlan, 1993) based models, specifically gradient boosted models, further improved accuracy. The efficacy of tree based models for prediction in this environment was established with even baseline older implementations, giving higher prediction accuracy than any methods used in prior research (Walsh et al., 2018). Whilst a more traditional decision tree such as J48 (Walsh et al., 2018), produced a higher accuracy than MLP, the accuracy lift obtained with application of more modern formats of this model (via boosting) produced even higher accuracy predictions. For these predictions via boosted trees, even the lower levels of the 95% confidence interval bands for this accuracy were higher than the accuracy obtained implementing MLP in prior research (Heazlewood et al., 2016a). The highest predictive accuracy was achieved using Gradient Boosted Machines (GBMs) (0.7134), exceeding accuracies of models using XGBoost (0.7012) or LightGBM (0.6904) (Walsh et al., 2018). These two recent implementations of boosting may have given lower predictive accuracy than GBM due to the high dimensionality relative to the number of cases in the data. It should be noted that a promising new implementation of boosted trees CatBoost developed by Yandex N.V., is now also available (Dorogush et al., 2018). As CatBoost specializes in dealing with categorical data it was not adopted in this experiment (the training data used was not categorical, but ordinal).
Alternative linear models
Logistic regression is a technique that has been used for more than 50 years (Cox, 1958). It is a widely used method for modelling binary responses given either one or multiple predictors and under various conditions. Logistic regression is similar to a linear regression model, however is suited to models where the dependent variable is dichotomous. Logistic regression coefficients can be used to estimate odds ratios for each of the independent variables in the model. Logistic regression is applicable to a broader range of research situations than discriminant analysis. Prior research used a step-wise logistic regression approach (Heazlewood et al., 2016a). Improvements in accuracy were achieved using tree based models (Walsh et al., 2018), with further improvements utilizing more modern implementations of decision trees via boosting (Walsh et al., 2018). The aim of this paper was to investigate whether similar improvements in accuracy obtained using a linear model for prediction could be gained by employing a more modern implementation of linear modelling.
Ordinary least squares regression has been used since the early 1800s, with use by Legendre and Gauss (Stigler, 1981). With the advancement of modern computing techniques there are a number of more modern regression approaches that incorporate methods to accommodate for high dimensionality in terms of the number of training features. In modelling scenarios where there are a large number of input features overfitting a model to the data can potentially be a problem when there is insufficient training data. In these scenarios there are several approaches for prioritizing more predictive features: subset selection, ridge regression and the lasso.
Subset selection
Subset selection is a well-established technique, providing interpretable linear predictive models using only a subset of the available features for prediction. Subset selection focuses on the most predictive variables by discarding the other input features. With discriminant function analysis and logistic regression using step-wise methods only a limited number of questions would be used in formulating a predictive model, hence in Heazlewood et al. (2016a) the 56 MOMS questions were collapsed into nine distinct factors in order to increase accuracy. Whilst subset prediction can improve prediction accuracy it is a discrete process with features either completed dropped or kept without shrinkage within the predictive model. Shrinkage refers to reducing the size of some of the coefficients in a model (whilst dropping them is the equivalent of setting them to zero). This discrete deselection of features (i.e. dropping them) may result in a loss of predictive accuracy compared to a more continuous approach (e.g. incorporating shrinkage). Other less discrete (than subset selection) approaches to prioritizing features in linear modelling were used in this manuscript. These methods were the lasso (the least absolute selection and shrinkage operator), ridge regression and the elastic net.
Ridge regression
Ridge regression (Hoerl & Kennard, 1970) uses a proportional shrinkage to reduce the contribution of less predictive features, but without subsetting to completely deselect features. Ridge regression was proposed independently by several different authors and is also known as Tikhonov regularization. Tikhonov’s work, published in the Soviet Union (Tikhonov, 1963), considerably predates the term ridge regression, however it was not translated outside the Soviet Union until later (Tikhonov et al., 1977). Ridge regression aims to give more precise estimates of regression coefficients than ordinary least squares regression and minimize shrinkage (Batah et al., 2008). Ridge regression shrinks features, however does not drop features, except in the edge case when the regularization parameter lambda is set to infinity. Therefore in some cases where it is beneficial to drop the less predictive features, the lasso may produce a superior predictive model. Ridge regression can be used to moderate the effect of multicollinearity within the data (Türkan & Toktamış, 2012), which would make ridge regression an appropriate choice for predictive modelling based on the MOMS with 56 variables with nine underlying psychological constructs (thus some multicollinearity would be expected, particularly as psychological constructs likely overlap in complex ways). Ridge regression is sometimes referred to as L2 regularization.
The lasso
The lasso (the least absolute selection and shrinkage operator) is a shrinkage method proposed in 1996 (Tibshirani, 1996). Regression shrinkage and feature selection via lasso is similar to ridge regression, with subtle but important differences. The lasso algorithm, sometimes called L1 regularization, is used when many features with lesser contribution to the predictive model should be minimized and with a sufficient number of input features, those with minimal contribution to the predictive model will be set to zero. The main difference between L1 and L2 regularization is that L1 regularization includes a penalty term to reduce the sum of the absolute values of the parameters, whilst for L2 regularization, the penalty term is used to reduce the sum of squares of the parameters (Ng, 2004).
Tibshirani’s paper (Tibshirani, 1996) provides three example scenarios where discrete subset selection, the lasso and ridge regression each will perform better than the others. When there are a small number of features with a large effect, subset selection will perform better than the lasso, which in turn will perform better than ridge regression. Where there are a small to moderate number of features with a moderate sized effect, then the lasso will perform better than ridge regression, which in turn will perform better than subset selection. The third scenario is where many features have a small effect in which case ridge regression will outperform the lasso, which in turn will outperform subset selection. The Lasso can have less predictive performance in some cases where many features are highly correlated to each other, as it will tend to select only one variable from this group to include in its predictive model and this can impact model performance.
The lasso has had a tremendous impact on high dimensional statistical inference and with the growth in readily available computing power since its creation it has been extensively implemented across many fields. This is evidenced by the exponential growth in citations (Tibshirani, 2011) of the paper originally proposing it (Tibshirani, 1996).
It should be noted that as an alternative to using a penalty term to reduce the sum of absolute values of the parameters (L1 regularization) or using a penalty term to reduce the sum of squares of the parameters (L2 regularization), other penalties have been investigated and have shown promise (Lipovetsky, 2010). These other penalty terms have involved either subjective application of alternative penalties to the ordinary least squares regression objective function (Lipovetsky & Conklin, 2005; Lipovetsky, 2006) or modification of the ordinary least squares regression objective function via a transformation (Lipovetsky, 2010). Currently these alternatives to purely applying L1 or L2 regularization are less commonly adopted, however there is experimental evidence indicating they can outperform standard ridge regression models (Lipovetsky, 2010). Improved performance has also been demonstrated using such penalties in the case of logistic regression (Asar & Genç, 2017).
Elastic net
The elastic net (Zou & Hastie, 2005) is a computationally more intensive method than the lasso or ridge regression alone as it is a convex sum forming a hybrid of ridge and lasso penalties. Combining subset selection and shrinkage penalties from both methods, it would be expected that the elastic net would produce a superior model than using lasso or ridge regression alone. The original elastic net proposed by Zou and Hastie (Zou & Hastie, 2005) was demonstrated to perform poorly via empirical testing. They thus also introduced the least angle regression elastic net (LARS-EN) algorithm to improve solution of the elastic net (Zou & Hastie, 2005). These methods are used in the R package elasticnet written by Zou and Hastie. The R package glmnet uses a more recent implementation, cyclical coordinate descent (Friedman et al., 2010), when used for elastic nets.
Smaller data sets may be better modelled with either ridge regression or lasso alone, as the elastic net may not have sufficient data to train off and thus may over fit. Additionally, when there are only a few factors that contribute meaningfully to prediction, subset selection may also outperform the elastic net.
Relationship between ridge regression, the lasso and the elastic net
Expressions Eq. (1)–(3) demonstrate the relationship between ridge regression Eq. (1), the lasso Eq. (2) and the elastic net Eq. (3). Each equation is a cost function to be minimized for
The parameter
Empirical evidence (Zou & Hastie, 2005) demonstrates the naïve elastic net does not perform well except when
Ensemble methods make use of combinations of different classifiers as a “hierarchical mixture of experts” (Cambon et al., 2015). The individual decisions of the different classifiers are combined, such as by weighted or unweighted voting (Dietterich, 2000). Classifiers can be considered diverse if they make different errors on a new prediction. If the prediction errors between two models are uncorrelated, then for models predicting at greater than 50% accuracy combining models should result in an increase in accuracy. More diversity between different accurate classifiers should lead to more accurate ensembles when they are combined (Hansen & Salamon, 1990). When combining classifiers to produce ensemble models the individual classifiers are weighted to determine their contribution to the overall model. This weighting can be arbitrarily set, such as for example using an equal weighting for all classifiers. An alternative approach is to weight each classifier according to some performance measure (e.g. model accuracy). The ensemble classifier
Tree based statistical models such as GBMs (for example those used in Walsh et al. (2018)) are able to model subtle non-linear interactions. GBM ensembles combine many small decision trees as base learners (Jiang, 2000). Linear models by their definition model linear interactions and approximate non-linear interactions as linear. In general tree based ensembles such as GBM (and its more recent iterations such as XGBoost and GBMLight) perform better than linear models as predictors, as evidenced by their consistent high ranking in international machine learning competitions. There are however a number of situations though where a linear model would be expected to outperform GBM models. Hypothetical examples include where there is minimal data to build a model from. In this case, a simpler model such as a linear model will perform better than complex models requiring more training data. Linear models would also be expected to perform better than their non-linear counter-parts when the underlying relationship between the dependent and independent variables is linear (which a linear model should better fit). They are also a good choice when the dataset is so large with so many features that a more complex model would be computationally too costly (e.g. modelling would take too long to run, or be uneconomical). A similar scenario would be appropriate if the dataset was of manageable size, but there were time constraints as a simpler linear model would execute faster. Data with sparsity (such as for large data sets with many empty values as found in text mining) may also perform well with linear models. Linear models and tree based ensembles can be considered a diverse model pairing. This diversity is due to linear models looking at linear patterns within the data and being able to operate with less data, whilst GBMs can model more subtle, deeper interactions but perhaps requiring more data. Although linear models arguably do not have the predictive accuracy of GBMs, as they may have uncorrelated errors, combining them in an ensemble should provide an increased accuracy over the result obtained by either model type independently. If the lasso, ridge regression or the elastic net provide models with appropriate accuracy, it is planned to build a suitable ensemble prediction combining them with a tree based model.
There are many metrics that can be adopted to assess the performance of a predictive model. As well as the classification accuracy (the proportion of masters athletes with gender correctly predicted by a model) a number of other evaluation metrics were planned to be calculated. As classification predictions are binomial (predictions are correct or incorrect), in addition to model accuracy, a binomial 95% confidence interval can be computed (utilising a one-tailed binomial test via the binom.test function in R).
One assessment of model performance is to compare with a model that solely predicts all individuals as having group membership in the majority class. The accuracy of this model is referred to as the no information rate. A one sided hypothesis test can be used to test whether the model accuracy is greater than the no information rate. The p-value from this test can be particularly relevant when there is a large class imbalance in the training data. Another statistic used for evaluation is the Cohen’s Kappa (Cohen, 1960) (also just referred to as Kappa) statistic. Kappa measures association between the observed test values and model predicted values. A Kappa value of one is associated with perfect agreement, zero is associated with an agreement level equivalent to random chance and a negative Kappa value would be associated with a model that predicted worse than random chance. A rudimentary McNemar’s test (McNemar, 1947) can also be applied to test predictive accuracy. The McNemar’s test is applied to the columns of a 2
Often predictive models are formulated to predict presence or absence of a particular event (such as for example a medical condition). The presence or absence of such an event can lead to positive and negative classification categories (such as presence or absence of the predicted medical condition). Such binary outcomes have influenced the nomenclature of many model evaluation metrics, however the same metrics can be exceedingly useful when evaluating other binary outcomes without classification categories that are conventionally considered positive or negative, such as male and female as genders. In such an example one of the categories is treated as the positive outcome and these useful metrics can still be evaluated.
An important metric for assessing a model is sensitivity, where sensitivity is also called the true positive rate or recall. Sensitivity is the ratio of model predictions that are correct out of all the positive classification class predictions. The false positive rate is the proportion of such cases that were classified by the model as having membership of the positive category, but that were in fact members of the negative category group. Another useful metric is specificity, where specificity is equal to one minus the false positive rate. Other beneficial model evaluation metrics include prevalence (total number of cases within the positive category observed in the test data divided by the total number of cases within the test data), detection rate (the number of true positives divided by the total number of cases within the test data) and the detection prevalence (the total number of predicted events given by the predictive model divided by the total number of cases within the test data). Individual accuracies can be assessed for each case category (such as predictive accuracy scores considering only the males (or females) within the test set) or a balanced accuracy (equal to an average of sensitivity and specificity).
Aim and hypothesis
Research by Walsh et al. (2018) demonstrated significant improvements to gender classification prediction over accuracy scores reported in Heazlewood et al. (2016a). Improvements in accuracy were achieved using tree based models, with further improvements utilizing more modern implementations of decision trees via boosting (Walsh et al., 2018). The highest predictive accuracy for gender classification was obtained using boosted trees. The aim of this investigation was to explore whether a similar improvement could be gained using more modern methods for restricting the linear regression model. It has been shown that boosting methods are equivalent to a high dimensional version of the lasso, without explicitly performing the lasso (Friedman et al., 2004). It was therefore hypothesized that similar improvements to accuracy would be achieved by applying the lasso, ridge regression and the elastic net methods over those achieved in prior research that used a step-wise logistic regression approach (Heazlewood et al., 2016a).
Current accuracy scores from prior research using MLP of 0.644 (Heazlewood et al., 2016a), or using GBM 0.7134 (Walsh et al., 2018) have some promise, however it was believed if the accuracy of prediction could be significantly improved, more cogent outcomes could then be applied in terms of gender classification based off the MOMS. It was our hypothesis that there would be an improvement in accuracy using more modern linear modelling methods that would be significant when compared to previous research results (Heazlewood et al., 2016a). However whilst a progression, it was believed this still might not be adequate for meaningful cogent applications of gender prediction from psychological motivations. It was not expected that the linear predictive models using regularization would give superior scores to those obtained with GBM in Walsh et al. (2018). This was hypothesized as gradient boosted trees are often a component of the winning models in international modelling competitions. There was interest in assessing the predictive capabilities of linear models, however as there are distinct advantages in using linear models. As mentioned, these advantages included requiring less data to train off, being computationally faster to run, being easier to interpret and of course as a straight hyperplane is being fit to the data, providing better predictive accuracy when the underlying pattern is linear. Linear models requiring less data to train off might be particularly relevant as compared to data sets in modelling competitions the WMG dataset was exceedingly small. Conversely tree based models will perform better than linear models where underlying patterns within the data are far from linear, by splitting the feature space into progressively smaller sub-sections as required at each decision node. Due to the diversity between the two categories of classifier, if there was some promise in the more modern linear models investigated (e.g. improved accuracy over Heazlewood et al. (2016a)), it was a secondary aim of this investigation to build an ensemble of linear and GBM classifiers. Due to leveraging off uncorrelated errors, it was hypothesized that some small accuracy increase would be given with this ensemble over both the linear models developed in this manuscript and the GBM models built in prior research (Walsh et al., 2018).
glmnet
Much of the research conducted on the lasso (Tibshirani, 1996) and elastic nets was conducted by faculty from the Statistics Department at Stanford University. These include the initial papers proposing the lasso and elastic nets (Zou & Hastie, 2005) as well as using the LARS-EN algorithm (Zou & Hastie, 2005) and cyclical coordinate descent (Friedman et al., 2010). The same team authored and maintains an R package glmnet specifically for performing the lasso and elastic net, a package which also accommodates ridge regression. This package implements cyclical coordinate descent (Friedman et al., 2010) to optimize the objective function for each parameter individually with the others held fixed, cycling repeatedly until convergence is achieved. Whilst there are other choices of package within R (such as an earlier package elasticnet, also from some of the same authors and implementing the LARS-EN algorithm (Zou & Hastie, 2005)) or other languages (Matlab has its own version of glmnet, also created and maintained by the Statistics Department at Stanford University), implementing glmnet in R to investigate regularized linear modelling was considered the most appropriate selection.
Data and methods
Electronic invitations were sent to masters games athletes who provided a valid email address and a total of 3,928 masters athletes (2,010 male, 1,918 female) completed all 56 questions in the MOMS via an online survey created using Limesurvey
Analysis was conducted using the R programming language version 3.4.3 (2017-11-30) “Kite-Eating Tree” on platform Windows x86_64-w64-mingw32/x64 (64-bit). Ridge regression, lasso and elastic net models were trained by means of the glmnet package using repeated cross validation.
The models were built on an 80:20 train test split with the training data internally validated with cross fold validation to reduce overfitting. In order to evaluate our supervised learning models and optimize performance on the data set, hyperparameter (Bergstra et al., 2011) optimization was conducted using a grid search/hyperparameter sweep. This is an exhaustive search through manually specified subsets of parameters. The cartesian product of these parameter subsets can be computationally expensive. Alternative methods include random search (Bergstra & Bengio, 2012) and using a Bayesian approach (Snoek et al., 2012) to optimization of hyperparameters. After provisional tests conducted with both random search and Bayesian Optimisation, grid search was deemed a preferable choice as requirements of grid search were well within practical limits of the computational resources available, even with the extensive repeated cross fold validation to reduce likelihood of over-fitting. The linear models used in this manuscript have fewer parameters than alternatives, such as some decision tree based models (see listed parameter requirements for boosted tree based models in Walsh et al. (2018) for illustrative examples), another factor that supported use of grid search. After provisional evaluation of various numbers of folds with numerous repetition and the convergence of their results an eight-fold cross-validation with the whole process repeated a further ten times was selected as a significantly robust cross-validation approach and was the final format utilized for all the models built.
A ridge regression was trained first. The hyper parameters used were alpha and lambda. Alpha was set to zero in order to specify a L2 regularization. Lambda was tuned from zero to one with increments of 0.001. Lambda can take values up to infinity and should there not be clear convergence at these lower values, the tuning parameter range would be expanded to higher parameters.
A model using the lasso was trained next, followed by an elastic net. Both of these two models using glmnet required the same hyper parameters alpha and lambda that were used for the ridge regression model. For the lasso, alpha was set to zero in order to specify L1 regularization, whilst lambda was tuned across the same range as for ridge regression (lambda was tuned from zero to one with increments of 0.001). For the elastic net the hyper parameter grid was set with alpha ranging from 0 to 1 incremented in steps of 0.01, whilst lambda was tuned as per ridge regression and the lasso. In the case of the elastic net, for example, developing models across the full range of hyper parameters across 10 repeats of eight fold cross fold validation required a total of 8 million models to be generated for elastic net alone, not including provisional models or test runs.
Building a simple ensemble model
Ridge regression was selected to be used in conjunction with GBMs which had showed good predictive accuracy in prior research (Walsh et al., 2018) for building an ensemble predictor. The first ensemble built was a simple ensemble merging ridge regression and GBM models using a linear weighting. To build this model a GBM was tuned using all the training data as per Walsh et al. (2018). This was then combined with the already built ridge regression. The predictors were weighted with an order of magnitude 10:1 loading favoring predictions made using a GBM over those using ridge regression. The predictors were then combined to form an ensemble in a format as per Eq. (6). A naïve ensemble such as this was not expected to produce much of an accuracy lift over GBM alone as there would be overfitting on the training set due to exposure of both models to the same individuals in the training data. However it was a simple way to gain a slight lift as the improvement in accuracy due to leveraging off uncorrelated errors was believed to be more substantial than the decrease in accuracy due to overfitting on the same training subset.
From Eq. (6), a function for the simple/naïve ensemble could be expressed Eq. (7). In this equation
In order to reduce the overfitting due to multiple models being built on the same training subset a stacked ensemble with out-of-sample (OOS) predictions (predicting with a model trained using a different sample) was built. This involved splitting the training data into two subsets. For ease of documenting the model building process, these two subsets are referred to here as subsets 1 and 2. A GBM model was trained using data in subset 1, whilst a ridge regression model was trained on subset 2. The GBM model was then used to make a data frame of predictions by applying the model to the data in subset 2. Similarly the ridge regression model trained using subset 2 was now used to make predictions on the training data in subset 1. The predictions on subset 1 made using the ridge regression model were then added to subset 1 as extra columns. Subset 1 was then used to train a fresh GBM model but including as training data the predictions from the ridge regression model originally trained on subset 2. This meant a GBM was built using OOS predictions from ridge regression. The same process was followed for subset 2, where subset 2 combined with the predictions on subset 2 using the first GBM model (trained on subset 1) was used as training data to train a new ridge regression model. This gave a ridge regression model with OOS predictions from a GBM model. The advantage of this method is these OOS predictions avoid over exposure to the same training data. This over exposure was present with the multiple models built on the same subset of data in our simplistic ensemble built previously. The final GBM and ridge regression models were weighted heavily by an order of magnitude towards the predictions that used GBM as a final model with a 10:1 ratio of weights. This gave some OOS contribution to the GBM prediction from ridge regression models both in terms of OOS predictions as well as weighted contribution from a separate ridge regression model, itself containing GBM predictions from a model built OOS. The weighting was skewed to the GBM as it was expected to be a stronger model.
For predictions on the testing data, the testing data needed to be split into two subsets. These subsets will be referred to as testing subset one and testing subset two. For testing subset one predictions were made using the first ridge regression model and the predictions were added as columns to testing subset one. Predictions were then made on the data in this expanded testing subset one using the second GBM model. For testing subset two, predictions were made using the first GBM model and the predictions were added as columns to testing subset two. Predictions were then made on the data in this expanded testing subset two using the second ridge regression model. The final predictions from the last GBM and ridge regression models were then weighted as discussed in the training section.
This process was then repeated but with swapping the two testing inputs. Namely for testing subset two predictions were now made using the first ridge regression model to form an expanded data frame and further predictions were made on this data frame with the second GBM model. For testing subset one predictions were now made using the first GBM model to form an expanded data frame and further predictions were made on this data frame with the second ridge regression model. These additional predictions were again weighted 10:1 towards the model that had its final set of predictors from a GBM. This new set of final predictions on the training set was then averaged with the previous set of final predictors to form the concluding predictions of this stacked ensemble. Switching the input in this manner to run predictions twice allowed averaging out of prediction error.
Considering the model training data as split into two subsets of features
Where,
The weights for the different components of the ensemble are indicated next to each model in Eq. (8). The equation in Eq. (8) can be expressed more clearly rearranging Eq. (9)
Regularized linear models predictive accuracy
Tuning lambda via grid search optimization provided the highest prediction accuracy score for ridge regression of 0.6828 with 95% confidence interval (CI) (0.649, 0.7153) at lambda equal to 0.025. For the lasso optimal prediction accuracy was 0.6815 with 95% CI (0.6477, 0.714) with lambda equal zero. The relationship between lambda and accuracy is shown in Fig. 1 for ridge regression and Fig. 2 for the lasso. In both Figs 1 and 2 the range and granularity of lambda values computed was reduced in order to assist visualization. It is clear from Fig. 1 that accuracy increased as regularization was introduced and then decreases as the L2 penalty increases in size and begins to cause too much shrinkage. For the lasso (Fig. 2) the highest accuracy score was obtained with no L1 regularization. From Fig. 2, it can be seen that as L1 regularization increases, it dramatically reduces accuracy by removing input variables from the model (setting coefficients to zero). At a lambda value of approximately 0.1, the accuracy saturates and does not decrease further (additional predictive independent variables are not dropped within the range shown in Fig. 2).
Relationship between accuracy (via repeated cross validation) and the regularization parameter lambda for the ridge regression models.
Relationship between accuracy (via repeated cross validation) and the regularization parameter lambda for the lasso models.
For the elastic net optimal accuracy was 0.6828 with 95% CI (0.649, 0.7153) with alpha equal to 0.01 and lambda 0.020. The relationship between model accuracy (via repeated cross validation) and the regularization parameter lambda at different mixing percentages (alpha) for the elastic net models is demonstrated in Fig. 3. Figure 3 only shows a small subset of alpha the mixing percentages (performed via selecting only every fifth value of alpha) in order to assist visualization without an overly cluttered graph. Figure 3 demonstrates accuracy decreasing as either mixing percentage or regularization increase. The highest accuracy score was obtained in training with alpha equal at 0.01 and lambda 0.02, giving an accuracy of 0.6828 on the test set. As per Fig. 1, the models with significant L1 regularization decrease in accuracy and saturate at lambda of approximately 0.1.
Model evaluation metrics for the three different regularized linear models are shown in Table 1. The accuracy scores are all higher than the scores reported in Heazlewood et al. (2016a) of 0.630 for logistic regression as well as the accuracy obtained using MLP of 0.644 (Heazlewood et al., 2016a). The scores are however lower than 0.7134, obtained using GBMs (Walsh et al., 2018).
Comparison of model evaluation metrics for the three regularized linear models developed to predict gender classification
Relationship between accuracy (via repeated cross validation) and the regularization parameter lambda at different mixing percentages (alpha) for the elastic net models.
The simple ensemble gave a predictive accuracy of 0.7159 with a 95% CI interval between 0.683 and 0.7472. The stacked ensemble with OOS prediction had predictive accuracy on the test set of 0.7236 with a 95% CI interval between 0.6908 and 0.7546.
Individually the GBM models created as components of the ensemble gave higher accuracy models than the individual ridge regression models. The GBM models also had a wider range of hyperparameters to tune. The Fig. 4 provides a good illustration of the relationship between some of the different GBM hyperparameters. In Fig. 4, only a small selection of the range of parameters tuned are displayed to ease interpretability. The different colour lines represent different tree depths ranging from two to eight. The shrinkage ranges from 0.0010 to 0.0150, whilst the minimum number of observations in a node (n.minobsinnode) range from six to eight. The x-axis values are the number of trees per GBM and the y-axis values are receiver operator characteristic (ROC), obtained via cross-validation.
Model evaluation metrics for the simple and stacked ensemble models are shown in Table 2. The accuracies of both the simple and stacked ensembles were the highest of any of the other models in this manuscript as well as being higher than the accuracy of any of the models in prior literature (Heazlewood et al., 2016a; Walsh et al., 2018).
Comparison of model evaluation metrics for the simple and stacked ensemble with OOS prediction models developed to predict gender classification
Comparison of model evaluation metrics for the simple and stacked ensemble with OOS prediction models developed to predict gender classification
Hyperparameter tuning of a series of GBM models using grid search optimization.
As hypothesized the accuracy scores for ridge regression, the lasso and the elastic net models all were improvements over the score previously reported in Heazlewood et al. (2016a) of 0.630 for logistic regression and also the accuracy obtained using MLP of 0.644 (Heazlewood et al., 2016a). Examining the 95% confidence intervals (CI) for the ridge regression, the lasso and the elastic net models these intervals also gave ranges which lay above the accuracy scores for this preceding research (Heazlewood et al., 2016a), giving a good level of confidence in these new models providing significantly improved predictions. These new scores however were below those obtained in the literature using a GBM 0.7134 (Walsh et al., 2018). It is clear that the main hypothesis of the study was supported, namely that modern regularized regression models would give a predictive accuracy improvement and that, similar to the improvement obtained implementing boosted trees, this would improve accuracy over that obtained in prior research using logistic regression, discriminate function analysis, Radial Basis Function (RBF) and MLP.
Figure 1 demonstrates the relationship between model accuracy via repeated cross validation and the regularization parameter lambda for ridge regression. It is clear that a small amount of L2 regularization increases model accuracy, however as the regularization parameter gets larger (approximately above 0.1) the accuracy starts to fall. The highest accuracy score of 0.6828 was obtained at lambda 0.025. In Fig. 2 we have the same accuracy relationship with lambda for the lasso. It is clear from Fig. 2 that restricting the number of variables via introducing L1 regularization results in reduced model accuracy. In fact the highest accuracy score in Fig. 2 was obtained when lambda equaled zero and from Eq. (2) the L1 penalty
The models obtained via elastic net are displayed in Fig. 3. Figure 3 does not contain the full range of lambdas used for modelling and has heavily reduced the number of different mixing percentages (alpha) for the elastic net models presented. This was done to improve visualization. It is well illustrated by Fig. 3 that
The regularized linear models improved model classification accuracy for predicting gender in the WMG database based on the MOMS scale questions. The improvement was not deemed sufficient to have cogent applications and the predictive accuracy was less than that achieved with prior models using GBM. It was of interest that the linear models gave better predictions using the 56 raw psychological Likert type scores than the nine MOMS factors when using linear models in the literature (Heazlewood et al., 2016a). This was the case even when there was no regularization (as per Figs 1–3). This might have implications for the factors in MOMS, although they have been validated by previous research, adopting them appears to result in a decrease in predictive accuracy. Whilst this is only indicated for linear modelling and the factors were not designed to predict gender it is of interest as the MOMS has been used for 25 years and it is evidence of loss of information about a sample when using the MOMS factors as opposed to the underlying data. In particular it should be noted that the L2 regularization can accommodate the multicollinearity in MOMS providing superior predictions than linear models using the nine underlying factors in MOMS. For this reason it is recommended that for any analysis using MOMS data use of L2 regularization on the full 56 questions is preferable over unregularized linear analysis based on the nine underlying factors for MOMS. As sports psychology utilizes MOMS as a tool, this is a finding with practical applications.
As linear models rely on linear patterns within the training data (or approximations to linear patterns), whilst the boosted trees can model more subtle interactions, supplementary investigation was conducted to see if a further boost to accuracy could be obtained by ensembling a contribution from the ridge regression or elastic net model with a GBM. As the accuracy of the elastic net and ridge regression models was approximately equal and the best performing elastic net was an approximation to a ridge regression (alpha close to zero), it was deemed appropriate to select ridge regression (as opposed to the lasso or elastic net) for contributing the linear modelling component to the ensemble models built. As hypothesized the simple ensemble model accuracy (0.7159) gave an accuracy improvement over the linear models in this manuscript (highest accuracy 0.6828) as well as being a higher than the 0.7134 obtained via using GBMs alone in the literature (Walsh et al., 2018). The accuracy of the stacked ensemble with OOS prediction of 0.7236 was the highest of any of the other models in this manuscript as well as being higher than the accuracy of any of the models in prior literature (Heazlewood et al., 2016a; Walsh et al., 2018). The accuracy of 0.7236 was only a 1.43% improvement over the 0.7134 score in Walsh et al, 2018 and a 10.8% improvement over the best accuracy (achieved using neural networks) of 0.644 from Heazlewood et al. (2016a). If improvement in classification error (i.e. percent of masters athletes incorrectly classified by gender) is considered the error improves from 0.2866 (Walsh et al., 2018) and 0.356 (Heazlewood et al., 2016a) to 0.2764, relative improvements of 3.69% and 22.4%.
Although there were improvements in accuracy with the ensemble models produced, it should be noted that it requires large computational overhead to train such models. The second ensemble built many millions of individual models for grid search tuning and it should also be noted that these models in the case of GBMs were ensembles themselves, containing thousands of decision trees. Tuning required several days run time, but prediction took only a few seconds. The ensemble models take considerable time to build (both in terms of coding and run time), however are quick to use for prediction once developed. This quick prediction time means that the long time to train the models does not prevent useful practical applications once the models are developed.
There are a number of techniques that can be adopted to improve model training time, even with extensive grid search methods. Figure 4 demonstrates the relationship between shrinkage and both max tree depth and the number of trees when optimizing Receiver Operating Characteristic (ROC) score. If shrinkage is small the model accuracy increases as the number of trees is increased. This can be seen in the positive gradient for the graphs on the first column of graph panes, where shrinkage is only 0.0010. As shrinkage is increased, models produce higher ROC scores when the number of trees is less. Once the number of trees has increased over 1000 the gradient of the graphs (with shrinkage above 0.0010) are negative instead of positive. This can be seen for graphs with shrinkage values from 0.0075 and above, where the graphs give progressively lower ROC scores as number of trees is increased and this is more dramatic for higher shrinkage scores. A similar trend can be seen with tree depth. At lower levels of shrinkage, such as 0.0010, higher ROC scores are obtained with deeper trees (for example a tree depth of eight, as opposed to a tree depth of two). As shrinkage begins to increase this pattern progressively reverses, such that once shrinkage is 0.0150, higher ROC scores result when the trees are shallower (e.g. maximum tree depth of two). The inverse relationship can be used to reduce training times in a number of ways. A simple strategy is to set one hyperparameter, such as shrinkage as a constant such as 0.01 and only tune for tree depth and number of trees. Eliminating variability for one hyperparameter effectively removes one dimension from the grid search and can dramatically reduce training time. For example if the hyperparameter grid would have had ten values of shrinkage, training with one value would require a grid search ten times smaller and be approximately ten times faster.
There is far more potential for meaningful applications with predictive accuracy of 0.7236 than 0.644 (as per Heazlewood et al. (2016a)), however increasing accuracies further would increase this scope. Some consideration should be given to other techniques that could be used to further improve on these results. The simplest option is the collection of additional training data. It also should be noted that there are many methods for building an ensemble. Building models with OOS predictions can be extended to build models where each model has OOS predictions from many other models, not just one. Additionally the training data can be split into many subsets, as opposed to just two. In data modelling competitions is it common to build many complex ensembles using different methods and then in turn ensemble these separate ensembles to maximize predictive accuracy. It should therefore be considered that it may be possible to get a small uplift combining other diverse models into the ensemble, such as neural networks. Given that the elastic net (incorporating a blend of L1 and L2 penalties) did not outperform use of L2 regularization only, it would be appropriate to investigate alternative regularization penalties. This is due to these other penalties having outperformed L2 regularisation alone in some other studies (Lipovetsky, 2010; Asar & Genç, 2017). If predictive accuracy scores from linear based models could be further increased with such alternative penalties, this might also further improve the accuracy of the ensembles produced.
Neural networks (in the form of multilayer perceptron) proved to be the most accurate approach in Heazlewood et al. (2016a). Modern applications of neural networks, such as deep neural networks (with many hidden layers) are a highly successful modern machine learning algorithm, particularly for image processing. Neural networks were not chosen as an optimal algorithm (compared to boosted trees) as the relatively small amount of data (compared to that in images or video) was considered too small to leverage off their modelling capabilities. It may be possible that although they may not perform as well as boosted trees with such a small amount of data to train off, they might be able to contribute as one of many predictors in an ensemble model. As well as considerations with regard to improving predictive accuracy, future research should look at how well this predictive model works when applied to other populations that have completed the MOMS. It would be interesting to see how well predictions hold if applied to other masters athlete populations, as well as other athletes that are not masters athletes who have completed the MOMS. Similarly other variables may be collected when the MOMS is administered. It would be of interest to see how well the ensemble methods developed could be translated to the prediction of other variables. It might be possible that further insights can be hypothesized about those completing the MOMS outside of gender classification.
Conclusion
Implementing modern regression methods with regularization resulted in substantial improvements in classification accuracy based upon gender compared to prior research utilizing neural networks, logistic regression and discriminate function analysis, but not boosted trees such as GBMs. Models that were solely or partially based on L2 regularization performed better than those than relied solely or primarily on L1 regularization, most likely due to multicollinearity within the data. Predictive accuracy for regularized linear models did not exceed those produced in prior research using GBMs. Prediction utilizing raw psychological Likert questionnaire data gave better predictive accuracy than predictions based on linear models from prior research using factors, even when no regularization was incorporated. Although the MOMS factors were not designed for predicting gender classification and they have been validated in the literature, this does imply that there is scope for consideration into working off raw data with L2 regularization as opposed to the nine underlying factors of the MOMS scales when researchers are using the MOMS.
Ensemble methods stacking ridge regression and GBMs with OOS prediction, as hypothesized, further improved accuracy giving higher accuracy scores (0.7236) than obtained in any preceding literature. This manuscript demonstrates the potential benefits from such an ensemble approach in terms of improved model accuracy. By developing models with improved accuracy it increases the likelihood of developing meaningful practical applications from predictions using MOMS psychometric data.
Footnotes
Acknowledgments
The authors would like to thank the 3,928 masters athletes who gave their time to help by completing the 56 MOMS survey questions. Without their assistance, this research would not have been possible. This paper builds on initial model predictions using MLP, RBF, discriminant function analysis and logistic regression first presented at the 10th International Symposium on Computer Science in Sport (ISCSS 2015), September 09–11, 2015, Loughborough, UK. Helpful comments and insights from other researchers at the conference was greatly appreciated and motivated this further investigation.
