Abstract
Consumer IncomeView is the Equifax next-generation income model that estimates a consumer’s individual annual salary/wage income. This model provides an optimal income solution to our clients by incorporating premier multi-source data assets and advanced machine learning modeling techniques. As a result, the new model significantly increases the scorable population rate compared with the predecessor models and significantly improves the prediction accuracy. The results of Consumer IncomeView have been successfully validated by an array of new proprietary accuracy metrics for model performance measurement, both on internal out-of-time and clients’ data. This paper presents the design, development, methodology and main results obtained from the development of the Equifax Consumer IncomeView model.
Introduction
Understanding a consumer’s income level strengthens customer relationships across the entire account lifecycle (Arslan & Karan, 2010). The primary sources of income on industry are verified income from real estate-based loan application, unverified income from other loan application, or employer direct deposit into a checking account. However, many existing products use unverified income data that has been self-reported through consumer surveys or in government censuses, which generally did not yield a satisfactory result (Moore et al., 2000). Accurate modeled income estimation is urgently demanded to meet market needs. In 2008, Equifax launched the first generation of Personal Income Model (PIM) to the U.S. market, which provided in-depth income insights to identify the best places to deploy key resources (
More recently, Equifax developed and implemented next-generation income model for U.S. market, the Equifax Consumer IncomeView (
Methodology
Modeling data: Sources and validation
Equifax Workforce Solutions (EWS) is a subsidiary of Equifax that provides employment and income verifications on over 6,000 U.S. employers, including 75% of the Fortune 500 and Fortune 1000 companies maintained in The Work Number
All the selected Equifax attributes suites in the model contain attributes for use in model and criteria development. Specifically, all are available for use in FCRA permitted prescreen or account management modeling programs, criteria development and portfolio analysis. They are developed using the Equifax consumer credit file.
The Advanced Decisioning Attributes provide extensive coverage of many credit analysis areas – depth of file to account activity, balance to high credit, recent to worst account status. Type-specific attributes are available for the general types: installment, retail, and revolving. Further industry-specific attributes are available for the following: auto, bankcard, credit union, department store, mortgage, personal finance and student loan.
The Mortgage CMA attributes provide extensive coverage of many mortgage credit analysis areas – depth of file to account activity, balance to high credit, recent to worst account status. Type-specific attributes are available for: first mortgage, home equity loan, home equity line, and non-HELOC revolving.
The Equifax Dimensions Attributes provide extensive coverage of many credit analysis areas. Equifax raw data fields are aggregated across accounts and trended for 24 months to provide a complete and historical view of a consumer’s credit behavior, such as transactor and revolver behavior types, balance direction, payment behavior, periodic spending, and activation. Type-specific attributes are available for the general types: installment, retail, and revolving. Further industry-specific attributes are available for the following: auto, bankcard, credit union and mortgage
Dependent variable
TWN verified annualized income in dollar amount, i.e. annualized individual salary/wage before tax, was used as the dependent variable. Notable data exclusion filters were used: individuals with outlier incomes that could not be validated were excluded, as well as consumers that were retired, restricted, deceased, surviving spouse and records that had recently been added to the TWN database without payroll history. The consumers with no credit activity within 24 months, or outdated were excluded. Multiple transformation schemes, such as box-cox power transformation, have been explored for the dependent variable and an internal research shows that log transformation provided the best in-sample fit. Due to the nature of income distribution, the assumption that earnings are log-normally distributed is widely accepted (Drăgulescu & Yakovenko, 2001).
Independent variables
Prior to model parametrization, the following variable treatment and selection steps were performed:
Apply standard data cleansing procedures to the sourced data; Missing value imputation and capping/flooring; and Perform exploratory data analysis to understand the stability and predictive power of each attributes (not shown).
In addition, care was taken to avoid high correlation among the independent variables, which can affect model stability. Measures such as coefficient correlation (
Determine the necessity of transformation for each of the independent variables; Determine the optimal method for the variable transformation; and Enable transformed variables to be included as independent variables in model development.
Other proprietary variable treatment includes creating additional derived variables and the use of interaction terms.
As mentioned in previous sections, all independent variables and their missing indicators were initially considered as independent variables. Together, these variables were run through computer-aided variable selection or reduction procedures in order to narrow down the candidate set of variables into a smaller, more manageable list. The variable list was then further refined through several more iterations to ensure that the model worked from both a statistical and business standpoint. In addition, from a statistical point of view, highly correlated variables were eliminated from subsequent regressions. Finally, variables were tested one at a time to determine the best possible combination of predictive variables.
The purpose of segmentation analysis is to determine the possibility, as well as the necessity, of defining homogeneous segments or subgroups in the population that require separate models. If such groupings can be identified, it may be deemed necessary to build separate models for these groups to enhance the overall performance.
Decision Tree was used to select the optimal segmentation scheme and splits. The size, significance, complexity and interpretation of various segmentation themes were evaluated to finalize the final segmentation. More than ten different scenarios of segmentation were studied, the goal was to separate the most accurate to least accurate group to provide different confidence level for income estimation on each segment. The best scheme is to use two layers of decision trees, by using different target variables. And research shows that four segments optimize the performance, derived by three Equifax ACRO attributes: age of trades, consumer credit capacity, and available credit on revolving accounts.
Modeling methods
Over 120 different machine learning modeling techniques were explored to optimize the best modeling approaches in order to achieve two goals: optimizing both the model estimation accuracy and interpretability. The selection of the techniques is based off several criteria: computational time, method complexity, additional accuracy lift, and feasibility of method implementation. The selected machine learning method has least computational complexity yet achieve maximum performance accuracy lift on the studied segments. The final income estimation is a combination of the following three modeling approaches (different algorithms used for each segment): linear regression with regularization (Ordinary Least Square baseline model), Multivariate Adaptive Regression Splines introduced by Friedman (Friedman, 1991) and multi-layer Neural Networks (Svozil et al., 1997) . The performance of each model was evaluated and compared by using various proprietary accuracy metrics innovated internally.
Multiple Linear regression
Multiple linear regression is a proven successful modeling technique designed to model the relationship between a continuous dependent variable
where
Provided that the LASSO parameter
A neural network is a series of algorithms, which assemble many “neurons”, and output the prediction neuron, as shown in Fig. 1 below (Matignon, 2005). The leftmost layer of the network is called the input layer, and the rightmost layer is called the output layer, the middle layers of nodes are the hidden layers. In this model, we chose to use two hidden layers to optimize the model prediction accuracy preventing the overfitting issue.
Neural network model configuration.
The input neurons take the input
Where
Multivariate adaptive regression splines constructs nested “hockey-stick” spline basis functions in an adaptive way by automatically selecting appropriate knot values for different variables, and it obtains reduced models by applying model selection techniques (Kuhfeld & Cai, 2013). The method does not assume parametric model forms and does not require specification of knot values. The bases are constructed by using truncated power functions (hockey stick function) as follow:
The final income prediction
The spline knot is a key concept, which connects the end of one portion of data and the beginning of another. Figure 2 demonstrates a spline with three knots, and different basis
Spline example with variation of basis.
Similar to the forward selection in linear regression model, pairs of corresponding basis functions were selected and added to the model. The pair that resulted in the largest reduction in the residual sum of square was added. The next phase was backward elimination of a single basis function whose elimination minimizes the generalized cross validation criterion (GCV), a function of the residual sum of squares. Backward elimination iterates until all terms except the intercept are eliminated and then the model with the minimum GCV was chosen. Adaptivereg procedure is used to fit the final model. Like other nonparametric non-linear regression procedures, the Adaptivereg algorithm can yield complicated models that involve high-order interactions in which many knot values or subsets are considered. Besides the basis functions, both the forward selection and backward selection processes are also highly nonlinear. Because of the trade-off between bias and variance, the complicated models that contain many parameters tend to have low bias but high variance. To select models that achieve good prediction performance, GCV was used:
where
where
Performance metrics
To assess the performance of Consumer IncomeView, Equifax examined the accuracy of the predicted income by the traditional accuracy metrics: Windowed Percent Error (WPE) and Concordance, and following new accuracy metrics: One-tail Accuracy, Capture Rate and Classification metric. These metrics were designed and implemented primarily for various business applications.
It measures how accurate the model estimates a consumer’s income higher than $x. For example, if the model estimate a consumer’s income is
If consumers true income is higher than $x, what percent of the predicted income is higher than $x Evaluate what percent of the true income the model can correctly capture directionally One-tail accuracy and capture rate are combined accuracy measurement, they should be considered together as one measurement criteria. If income1 If income1 Otherwise, discordant. The final concordance measure is expressed as a percentage of correctly ranking pairs of income records, e.g. if the concordance statistic is 70.3%, then 70.3% of the pairs of predictions were rank-ordered correctly.
Income distribution
Consumer IncomeView outputs individual income scores in the range from 20–300 in the unit of one thousand dollars. Based on the out-of-time validation samples, Figure 3 compares the distributions of predicted income vs. the actual income in vingtiles. The median incomes estimated by Consumer IncomeView correspond very closely with the median of actual incomes.
Predicted vs. Actual income distribution comparison.
When compared with the older Equifax income model PIM3, Consumer IncomeView has significant accuracy lift on overall and segmental level. WPE20 accuracy is 67% for segment #1, 38.7% for segment #2, 29% for segment #3, and 29.8% for segment #4. Model tends to achieve better accuracy on lower income, as there is less variation on income.
Figure 4 shows the overall accuracy and scorable lift that Consumer IncomeView provides. WPE20 accuracy lift is 31% incremental, Equifax proprietary data and advanced machine learning modeling technique on big data platform are the main drivers for the lift. Scorable rate is 10% incremental, which enables about 26MM additional consumers to be scored, and generate incremental revenue for end product users.
Overall model results.
Figure 5 shows the One-tail and Classification Accuracy of the Consumer IncomeView. Compared with the general U.S. population distribution, the new solution significantly improves the One-tail (upwards) and classification accuracy. For instance, the One-tail has a
One-tail and classification accuracy.
Furthermore, we can calculate the area under the two classification accuracy curves (random population vs. Consumer IncomeView) and obtain the overall classification accuracy on all income range from $20 K–$300 K. Compared with the benchmark random U.S. population (without a predictive model), the Consumer IncomeView has
Finally, concordance statistics of the Consumer IncomeView is also evaluated. When or concern focuses on the overall rank-ordering rather than individual income estimate, the nonparametric concordance metrics can assess the overall model performance. Compared with the PIM3 model, Consumer IncomeView significantly improved the concordance statistics, from 67.7% to 71.2%, generating a 5.2% incremental lift.
In this paper we describe the development of the Equifax Consumer IncomeView model. This enhanced solution was built on the enriched Equifax proprietary consumer credit attributes, including the powerful newly developed trended credit attributes, featuring premier monthly consumer credit data up to 24 months of extended financial account history. Compared with the predecessor PIM3 model, the Consumer IncomeView significantly improves the overall WPE20 and expands the scorable population. When measured by the innovative One-tail and classification metrics, this new model also outperforms PIM3 model by a significant margin. The additional multi-source data assets and machine learning modeling techniques largely contributed to the huge performance improvement.
Consumer IncomeView has also been validated by both in-time validation and out-of-time validation. Segmentation distribution is almost the same as the model development sample, i.e. 2-layers segmentation scheme validation holds well. WPE20 has solid validation on out of time data, both segmentally and overall, and one-tail and classification accuracy on out of time validation hold very well (not shown).
