Abstract
In case that both the goals of selection quality and diversity are important, a selection system is Pareto-optimal (PO) when its implementation is expected to result in an optimal balance between the levels achieved with respect to both these goals. The study addresses the critical issue whether PO systems, as computed from calibration conditions, continue to perform well when applied to a large variety of different validation selection situations. To address the key issue, we introduce two new measures for gauging the achievement of these designs and conduct a large simulation study in which we manipulate 10 factors (related to the selection situation, sensitivity/robustness, and the selection system) that cumulate in a design with 3,888 cells and 24 selection systems. Results demonstrate that PO systems are superior to other, non-PO systems (including unit weighed system designs) both in terms of the achievement measures as well as in terms of yielding more often a better quality/diversity trade-off. The study also identifies a number of conditions that favor the achievement of PO systems in realistic selection situations.
Keywords
Selection system design requires a number of decisions, including the type and number of predictors that will be used, the selection rule, the sequencing and weighing of the predictors, as well as the between-stage retention rates that will be implemented in case of multistage selection. To assist making these decisions in cases that value both the quality (i.e., the expected job performance) and the diversity of the selected applicant group, De Corte, Sackett, and Lievens (2011) proposed a decision aid for identifying Pareto-optimal (PO) selection designs. As the concept of Pareto-optimality is relatively new to the psychological literature, it might be confusing to someone who infers that the concept refers to a single optimal solution. Rather, there is a PO solution for every attainable level of diversity; that is, it is the system that produces the highest level of expected performance among systems producing that specific level of diversity. The result is a set of PO solutions, commonly referred to as a “Pareto front,” ranging from a performance-maximizing solution to a diversity maximizing solution. The argument is that PO solutions should be preferred to non-PO solutions. However, the choice among PO solutions is a value judgment, rather than a technical problem, as it depends on the relative value the organization assigns to the performance and diversity objectives.
Recently, Cortina, Aguinis, and DeShon (2017) reviewed key methodological developments in the last century and listed the PO approach as one of the methodological approaches in the last 10 years that has key applied implications for dealing with subgroup differences in personnel selection (see also the large-scale reviews of Bobko & Roth, 2013; Ryan & Ployhart, 2014). Although we agree that the PO decision aid offers a sound, psychometrically based contribution to the selection design problem, some critically important issues remain unresolved. At present, the PO approach has been primarily examined with meta-analytic input data. This is problematic for two key reasons. First, the model and the assumptions that drive the calculation of the expected selection outcomes will virtually never hold perfectly in the actual selection domain. Second, the PO decision aid uses data on the composition of the applicant pool, as well as on the effect sizes, the validities, and the intercorrelations of the predictors that are approximate at best. For example, little is known how results of PO systems are affected if the applicant pool is substantially smaller and different in composition than expected, fewer minority candidates are retained, and/or their scores on the selection procedures are differently distributed than assumed.
Due to these key unresolved issues we do not know the impact of (a) violations from the model assumptions and (b) deviations from the input data on the PO results obtained. We are also in the dark regarding which of these factors might have the most impact on the results. Our purpose is to provide insight into the value of PO selection system design in a large variety of selection conditions. We address robustness (related to the assumption violations), sensitivity (related to the input data deviations), and sampling variability issues.
The structure of this article is as follows. After providing an overview of previous developments on PO selection design, motivating the key research issues of the article and summarizing prior research (e.g., Song, Wee, & Newman, 2017), we propose new measures and a novel methodology for gauging the achievement of PO selection designs. This methodology is subsequently implemented within a factorial design to study the achievement of various PO designs when applied to a variety of validation settings that all differ from the calibration conditions (i.e., the assumptions and the predictor/criterion effect size and correlation data) used in deriving the PO systems. We also report on the relationship between the achievement level and the key dimensions that differentiate between the calibration conditions and the validation conditions that characterize the selection settings. Finally, we compare the achievement of PO designs in a large variety of selection settings to that of other design choices (e.g., unit weighed predictor composites).
In our study, we use simulation methods rather than analyzing real data sets or focusing on any one actual intervention implementing the principles of PO selection design. This is a deliberate choice, based on two considerations. First, real data sets, including not only the predictor but also the criterion scores of the entire applicant pool, are seldom if ever available in a real selection context. Second, our objective goes beyond an assessment of the achievement of PO selection designs in a particular context. Instead, our aim is to shed light on the achievement of these systems in a wide variety of selection contexts. Simulation methods fit this aim much better than one single selection case.
Assessing the Achievement of PO Selection Systems
PO Selection Systems: A Brief Tutorial
For selection applications where the goals of selection quality and diversity are both of importance, De Corte et al. (2011) proposed a psychometrically based decision aid for rational selection design that results in selection systems that offer an optimal balance (i.e., a PO trade-off) between the two valued goals. The decision aid conceives the shaping of a selection process as a series of mutually dependent decisions that define the resulting selection systems as particular sets of concrete choices with respect to the (a) predictor subset, (b) selection rule, (c) predictor staging, (d) predictor sequencing, (e) predictor weighing, and (f) between-stage retention rates that will be implemented during the selection process.
To derive the PO selection systems, the decision aid proceeds in two steps. The first step, the inventory stage, consists of identifying the set of selection systems that are feasible within constraints that govern the planned selection process. Constraints may include limits on selection costs and limits on the number of stages in a selection system, among others. In the second step, the computational stage, the decision aid computes from this set the subset of selection systems that are PO with respect to the selection diversity and quality goals.
These computations are based on the model proposed by De Corte, Lievens, and Sackett (2007) for gauging the quality and diversity outcome value of selection systems, and the formulae invoked by the model depend on essentially three assumptions: (a) the joint distribution of the selection predictors and the job performance criterion is multivariate normal in both the majority and the minority applicant population, 1 (b) the initial applicant pool is a mixture of infinite size of both applicant populations, and (c) a top-down selection rule (without applicant drop out) applies. In addition, the model calculations require data on the validity, the intercorrelation, and the effect size of subgroup differences of the available predictors as well as data on the final selection rate and the majority/minority composition of the total applicant pool. Henceforth, the assumptions, together with the input data used in the calculations, are referred to as the set of calibration conditions from which the results of the decision aid are derived and the symbol Cc will be used to denote the set.
To illustrate, consider the example situation, henceforth referred to as situation S 0, where the first, inventory stage results in considering the following five predictors for selecting with a 0.20 selection rate in an applicant pool consisting of 80% candidates from the majority and 20% applicants from the minority population: (a) a cognitive ability (CA) test, (b) a structured interview (SI), (c) a conscientiousness (CO) measure, (d) a biographical inventory (BI), and (e) an integrity test (IN). In the inventory stage it is further decided that only three different selection scenarios are feasible: (a) a single-stage scenario in which the final accept/reject decision is based on a weighed composite of the CA, CO, BI, and IN predictors; (b) a two-stage scenario where the candidates are first screened on the basis of a weighed composite of CA, CO, and BI, and the remaining candidates (anywhere between 35% and 60% of the initial number of applicants) are selected using a weighed composite of the SI and IN predictors; and (c) a three-stage scenario where the intermediate retention decisions involve top-down selection on a CA and IN composite (retaining anywhere between 60% and 75% of the candidates) and a CO and BI composite (retaining 35% to 45% of the initial candidates), for the first and the second stage respectively, and the SI predictor is used in the final selection stage. Finally, suppose that the inventory stage also leads to the decision that the predictors may have weights between 0 and 1 when forming the predictor composites; a decision that implies that several additional scenarios, such as, for example, a two-stage scenario using only the CA predictor and the SI predictor, is also feasible.
Given the above detailed situation S 0, and using data estimates on the applicant group composition, the predictors and the criterion (the example uses the predictor/criterion data values displayed in Table 1, Selection Environment 3), the decision aid next proceeds by computing, over all feasible selection systems, the subset of systems that are PO with respect to the selection diversity and quality goals. Panel A of Figure 1 portrays the results of this second step, using the expected job performance of the selected applicants (expressed in standard score units) and the selection ratio in the minority applicant group as gauges for the selection quality and diversity goal, respectively. The upper bold line in Panel A represents the set of PO goal trade-offs (i.e., the PO trade-off curve or Pareto front), whereas the area enclosed by the upper and lower (orange) lines depicts the entire gamut of achievable quality/diversity trade-offs. The figure in Panel A also represents a number of particular PO trade-off points (i.e., the points P1 to P4) on the PO trade-off curve. It is of key importance to note that these PO points not only correspond to a particular value for the quality/diversity trade-off, but are each also associated with a particular selection system. For example, PO trade-off point number 2 (point P2 on the figure) is associated with a two-stage selection system in which the first stage selection, retaining 60% of the candidates, is based on a weighed composite of the CA and CO predictors (with weights equal to 0.707 and 0.687); whereas a composite of the SI and the IN predictors (with weights equal to 0.677 and 0.750) is used in the final selection stage.
Predictor/Criterion Data for the Three Types of Selection Environment.
a d corresponds to the standardized mean difference between the majority and the minority applicant populations.
b This d value seems low, but it is the value mentioned by Johnson et al. (2004).

Quality/diversity trade-offs achieved by various selection systems for situation S 0 under calibration conditions Cc (Panel A) and validation conditions Cv (Panel B). Under Cc the systems with trade-offs P1 to P4 are Pareto-optimal (100% calibration quality achievement); the systems with trade-offs W1 to W4 are the worst possible (0% calibration quality achievement); and the systems with trade-offs S1 to S4, F1 to F4, and T1 to T4 have 75%, 50%, and 25% calibration quality achievement, respectively. The systems with trade-offs U1 to U4 correspond to unit weighed selection systems. In Panel B, the same symbols identify the trade-offs achieved by the systems under conditions Cv , whereas the symbols B1 to B4 and W1 and W4 show the best possible and the worst possible corresponding trade-offs that can be achieved under Cv .
Besides the PO trade-offs on the upper curve, Panel A of Figure 1 also displays five additional sets of trade-offs that all have the same diversity value as one of the PO trade-offs on the upper curve, but are inferior in terms of the quality value. Exploring these will be a major component of this article: We will generate selection systems that are inferior to PO systems when assumptions are met, and then examine the achievement of these PO and inferior systems when assumptions are violated.
The trade-offs on the lower curve (labeled with the letter Z) represent the worst possible trade-offs, whereas the trade-offs labeled with the letter U refer to trade-offs associated with selection systems in which any predictor that is assigned a nonzero weight in a selection system is given a weight of one. These fixed weight systems are henceforth referred to as unit weighed systems and they reflect the practice of using unit weighting to either the totality or a subset of the available predictors. So, unit weighed systems do not necessarily assign a weight of one to each predictor in the composite, but for each composite at least one of the predictors has a weight of one. The unit weighed system U1, for example, refers to a three-stage selection system with weights one and zero for the first stage predictors CA and IN respectively, weight one for both the second stage predictors CO and BI, and weight one to the third stage predictor SI. Finally, the trade-offs of the remaining three sets (i.e., the sets T1 to T4, F1 to F4, and S1 to S4) 2 correspond to feasible selection systems that are characterized by a quality trade-off value of a given fixed percentage as compared to the quality value achieved by the PO system that shows the same diversity trade-off value. These reflect 25% (T), 50% (F), and 75% (S) of the quality achieved via the PO system.
In summary, the application of the decision aid determines which of the feasible selection systems are PO and which are non-PO. Also, only PO systems should be implemented because all other systems (e.g., the U systems or the T, F, and S systems in Figure 1) result in a quality/diversity trade-off that can be bettered by a PO system (e.g., the system U2 is bettered by both P2 and P3). The decision aid does not indicate which PO system is to be preferred. As noted by De Corte et al. (2011, p. 913), the final decision in favor of a particular PO system calls for “a value judgment on the particular kind of balance between selection quality and work force diversity one is aiming at.”
Key Research Issues
The results of the decision aid are all dependent on the validity of the calibration conditions Cc . Yet, there is no doubt that these conditions will rarely, if ever, correspond to the unknown conditions that characterize the real selection situation. So, although input predictor, criterion, and applicant data values might come from a prior local validity study or from generalized validity evidence (e.g., transporting validity from a closely related setting or meta-analytic findings), they might at best approximate the values that will be found in the actual selection situation of interest. In addition, recruitment efforts may in real situations result in a size and a composition of the applicant pool such that different retention rates and a different selection rate than the rates initially used in deriving the PO systems must be applied to obtain the required number of selected candidates. Finally, it will almost surely be the case that the majority and minority candidates in the applicant pool will not represent samples from a multinormal distribution with mean and correlation structure values as assumed under Cc , but rather come from a possibly nonnormal distribution with a different mean and correlation structure.
So, the PO selection systems identified by the decision aid correspond to calibration conditions Cc that at best approximate the typically unknown conditions that characterize the actual selection application. Denoting the actual prevailing conditions, henceforth also referred to as the validation conditions, as Cv , the key issue then becomes how the achievement of the PO selection systems, as computed under the calibration conditions Cc , evolves when these systems are implemented for a selection application where the validation conditions Cv apply. Also, it is equally important to assess the two major types of circumstances that may impact the achievement level of the PO selection systems: (a) the nature of the selection environment and the calibration conditions Cc the systems are computed from and (b) the features that differentiate between the calibration conditions Cc and the actually prevailing validation conditions Cv . Finally, it is also worthy to consider whether the achievement in the validation conditions varies across the range of PO systems and to study the possibly different impact on the achievement in the validation condition of PO as compared to non-PO systems such as the unit weighed systems.
As a consequence, the first key research issue of the article focuses not only on the achievement level of PO systems when applied in a large variety of validation settings, but also on the relative impact on the achievement level of (a) the nature of the selection environment and the calibration conditions the systems are computed from and (b) the features that differentiate between the calibration and the validation conditions. As a second research issue, the article compares the achievement, across various validation circumstances, of PO selection systems to the achievement of non-PO systems, including unit weighted systems. Together both of these issues speak not only to the robustness for violations of the distributional assumptions and the sensitivity to variability in the input data of PO selection system design, but also to the relative level of robustness and sensitivity of these systems relative to other non-PO systems. Finally, by considering actually realized applicant pools as finite-sized samples obtained under the validation conditions Cv , we address the issue of sampling variability in the PO systems achievement as well.
Previous Related Research
In general, prior related research focused on comparing the quality/diversity trade-off of PO and unit weighed selection systems, as computed from calibration conditions Cc
, to the trade-off achieved by the systems when applied in validation settings
By and large, the previous studies reported rather favorable results on the achievement of PO selection systems when applied in new, validation settings. De Corte et al. (2011) found that in these new settings both PO and unit weighed systems continue to outperform their dominated systems to a fairly similar degree. Also, evaluated at identical diversity levels, PO systems, computed from calibration sample predictor/criterion data, maintained on average a high quality level relative to the average quality achieved by the corresponding PO systems as derived from the population validation data. Wee et al. (2014) concluded that the average gain in the diversity objective, when using the PO system with the same quality level as the unit weighed system, remains substantial across different validation settings. The average gain was also quite stable across different levels of sampling variability in the validation predictor/criterion data. Finally, Song et al. (2017) observed that validity shrinkage in the validation setting is fairly negligible when the PO systems are computed from calibration data obtained from samples of at least 100. Diversity shrinkage is more pronounced for samples of the same size, however, especially when some of the selection predictors show small effect sizes as is illustrated in Figure 2 of Song et al. by the considerably larger shrinkage along the diversity axis as compared to the shrinkage along the validity axis. The shrinkage also relates to the type of PO system: PO systems that give priority to the quality objective are more prone to validity shrinkage, whereas PO systems that favor the diversity objective show more diversity shrinkage. Finally, even accounting for the shrinkage observed for PO systems computed from small sample predictor/criterion data, these systems still offered potential for diversity/validity improvements over unit weighted selection systems.

Average diversity/quality shrinkage in a validation condition involving the applicant population.
Although previous studies suggests that PO systems may compare favorably to other selection system designs, further research is highly needed for several reasons. First, the previous studies do not cover the robustness issue and are all limited to the situation where the initial (calibration) and the new (validation) setting differ only in terms of the predictor/criterion correlation and effect size data, thereby neglecting the common instance where the calibration and the validation setting also differ in terms of the selection rate and the size and composition of the applicant pool. Second, thus far only single-stage selection systems have been investigated. Third, and even more importantly, all previous results are tentative at best because they relate to situations where the validation setting involves the applicant population instead of samples of limited size from this population. 3 Yet, as argued by Cattin (1980), personnel selection researchers and practitioners are essentially interested in how well selection systems derived in the calibration condition will perform in new, limited sized validation sample conditions. Finally, all previous research fails to address the question whether the quality/diversity trade-off of the PO systems achieved in the validation condition continues to compare favorably to the diversity/quality trade-offs that are at all possible in the validation situation. 4
Figure 2 is particularly helpful to explain the latter issue. At the same time, the figure illustrates the main difference between the approach of Song et al. (2017) and the one adopted in the present article to evaluate the validation potential of calibration-based PO systems. The figure relates to a single-stage selection situation (using a weighed composite with nonnegative weights of the five predictors of Selection Environment 3 detailed in Table 1) with a 0.15 selection rate and a 167 proportion of minority candidates in the applicant population. The vertical and horizontal axis of the figure correspond to the quality objective (operationalized as the predictor composite validity) and the diversity objective (measured as the minority applicant selection rate), respectively. The points
Whereas Song et al. (2017) focus on diversity (quality) shrinkage to assess the validation potential of the PO systems, the present approach proposes comparing the validation diversity/quality trade-off of the PO systems to the set of trade-offs that are at all possible in the validation condition. The area enclosed by the solid line contour in Figure 2 represents the latter gamut of possible diversity/quality trade-offs, and the average diversity/quality trade-offs achieved by the PO systems in the validation condition (i.e., the trade-offs
The next section further develops the basic idea underlying the present approach for gauging the validation potential of calibration-based selection systems. These developments lead to new measures for quantifying the achievement of PO and other selection system designs when applied in validation settings involving applicant groups of both limited and unlimited size. The new measures also enable a straightforward comparison of the achievement level of the various systems in these settings.
Measuring the Achievement of Selection Systems in the Calibration Condition
We first consider measuring the achievement level of selection systems as obtained in the calibration stage. In this stage, the systems are computed using the model proposed by De Corte et al. (2007) implying that the achievement level of the systems expresses the achievement as obtained with respect to an infinitely sized applicant pool; that is with respect to the total applicant population. Panel A of Figure 1 represents such a situation. Suppose now that we aim for a measure, with values ranging between 0 and 1, to assess the achievement of the selection systems P1 to P4 and Z1 to Z4 depicted in the panel. In that case, the obvious choice is to assign a value of 1 to the systems P1,…, P4 and a value of 0 to the systems Z1,…, Z4 because the former systems show the maximum possible quality at the corresponding diversity level, whereas the latter systems have the worst possible quality at the same diversity level. Given these values, it is then straightforward to assign achievement values to the other systems (e.g., U1, F2, and so on) reflecting the percentage of the possible improvement over the Z system that is obtained with the PO system. More specifically, the achievement of these other systems can be expressed as a proportion relating (a) the difference in quality value of the system and the quality value of the worst possible system with the same diversity value to (b) the difference in quality value of the best and the worst possible system with the same diversity value. In the extreme rare event that the latter difference equals zero we adopt the convention that the system has a performance value of one.
As an illustration of the proposed achievement measure, consider the system U2. The system shows a quality/diversity trade-off of 1.086/0.126, whereas the best and worst possible systems with the same diversity value (i.e., P2 and Z2) have a quality value of 1.256 and 0.788, respectively. With these values, the achievement of system U2 is then equated to
In what follows, the above described measure for the achievement of a selection system will be referred to as the calibration quality achievement of the system. So, the calibration quality achievement of a selection system indicates the proportional achievement, on the quality objective, of the system at its corresponding diversity level as computed under the calibration conditions Cc .
As the natural companion of the former gauge, we also introduce the calibration diversity achievement measure. Similar to the calibration quality achievement measure, the calibration diversity achievement measure indicates the proportional achievement in the calibration condition, but this time with respect to the diversity objective, of a selection system at its corresponding quality level. Using system S1 of panel A (with a diversity/quality trade-off value of 0.10/1.18) as an example, it can be seen that the systems with labels W1 and B1 have the same quality value (i.e., 1.18) as the system S1, with W1 showing the worst possible diversity value (i.e., 0.07) and B1 the best possible diversity value (i.e., 0.15). The calibration diversity achievement therefore equals (0.10 – 0.07)/(0.15 – 0.07) = 0.38.
Observe that the calibration diversity and the calibration quality achievement measure are undefined if the denominator in the corresponding proportion equals zero. As illustrated in Panel A of Figure 1, this will typically be the case for only four selection systems: the systems NB, NE, NO, and P1. System P1, for example, shows a zero difference between the lowest and the highest attainable diversity trade-off value at its quality trade-off level, but in this case as well as for the other three systems it is obvious to equate the corresponding calibration diversity (quality) achievement measure to one. It is also important to note that both new measures result in dimensionless quantities that share the same metric. Although the quantities still relate to one specific selection objective, they do no longer share the metric of the objective. In particular a value of, for example, 0.75 on the calibration quality achievement (calibration diversity achievement) measure does not mean that the system has a value of 0.75 for the quality (diversity) objective but that its achievement on the quality (diversity) objective is at 75% of the gain over the worst possible system that could be obtained with the best system at the same diversity (quality) level of the system. Also, because both new measures share the same metric, it is admissible to combine their values to one aggregate achievement measure by taking the average of the two measure values.
Measuring the Achievement of Selection Systems in the Validation Condition
In selection practice, one is less interested in the achievement of selection systems when applied to an entire population but rather in the expected achievement of the systems when applied to a future, finite applicant pool because real-world applicant pools are always of finite size. We therefore focus on measuring the achievement of selection systems in validation conditions involving either a single finite-sized applicant pool or a population of finite-sized pools. Panel B of Figure 1 (the subsection “Computing the Validation Achievement of Selection Systems” details the procedure for obtaining the results depicted in the panel) illustrates the development of the achievement measures in the first case. The panel depicts the trade-offs achieved by the calibration selection systems of Panel A when applied to a given applicant pool of size 250 with an equal number of minority and majority applicants using an overall selection rate of 0.3. Panel B also shows the gamut of trade-offs that can be achieved in the validation applicant pool.
Comparing both panels of Figure 1 illustrates how the gamut of achievable quality/diversity trade-offs and the trade-offs of the PO and the non-PO selection systems as obtained under the calibration condition Cc may change substantially for the validation applicant pool. First, the upper and lower boundary of the gamut of achievable trade-offs consists of only a limited number of points because the validation condition involves a finite-sized applicant pool such that only certain values for the minority selection rate are possible. Similarly, and for the same reason, the gamut no longer corresponds to the area enclosed by the boundary points, but reduces to the collection of vertical dashed lines connecting the corresponding upper and lower boundary points (cf. the vertical orange dashed lines in Panel B). Also, within each vertical line, only a finite number (quickly increasing with the size of the applicant pool) of different quality values is achievable. Finally, observe the changes in the trade-off achieved by the selection systems in the calibration condition (cf. Panel A) versus the validation applicant pool. Consider, for example, PO System P2. Under Cc , the system has a quality/diversity trade-off of 1.26/0.13 (cf. Panel A), whereas the same system results in a trade-off of 1.09/0.16 when applied to the validation pool. Also, none of the systems that are PO under Cc remain PO in the validation pool because each one is dominated by a feasible system that has the same diversity, but a higher quality value (cf. the systems corresponding to the trade-offs B1,…, B4 in Panel B).
Despite the differences between the calibration and the validation conditions, the principle used to measure the achievement of the selection systems in the calibration condition can also be invoked to gauge the achievement of these systems when applied to the validation applicant sample. To distinguish the resulting measures for the applicant sample in the validation context from the corresponding measures in the calibration condition, they are henceforth referred as the sample validation diversity achievement and the sample validation quality achievement, respectively. Thus, given the trade-offs achieved in the validation pool of, for example, P2, B2, W2, W2D, and B2D (i.e., 1.09/0.16, 1.20/0.16, 0.51/0.16, 1.09/0.12, and 1.09/0.23 for P2, B2, W2, W2D, and B2D, respectively; cf. Panel B), the sample validation quality achievement of System P2 can now be equated to (1.09 – 0.51)/(1.20 – 0.51) = 0.84; whereas the sample validation diversity achievement of the system equals (0.16 – 0.12)/(0.23 – 0.12) = 0.36.
If the validation conditions refer to a population of finite-sized applicant pools, the sample validation achievement value of the selection systems will vary across the set of all possible applicant samples that are consistent with the validation conditions Cv . To account for this sampling variability the validation diversity (quality) achievement of a selection system under such more general validation conditions Cv is henceforth defined as the expected sample validation diversity (quality) achievement across all possible applicant samples according to Cv .
As is the case for the calibration achievement measures, the validation achievement measures are undefined if the denominator in the corresponding proportion equals zero. Although the condition will generally not hold for the validation quality achievement measure, the same is not true for the validation diversity measure, especially if the validation conditions relate to a selection with a small selection rate applied to a small applicant pool. In that case, the number of possible values for the selection diversity trade-off, as gauged by either the minority selection rate or the AIR, is (very) small and the worst and the best possible diversity value for a given quality level are often identical. 5 So, although both validation achievement measures are conceptually on an equal footing, the validation quality achievement measure has, compared to the validation diversity measure, the net advantage that it is almost never undefined.
Compared to previously proposed gauges, the novel measures of validation achievement have two distinct advantages. First, the measures offer an adequate, intuitively appealing, and easily interpretable quantification of the validation achievement level of a selection system. In essence, the measures tell by means of a proportion how well a system is expected to perform on the quality (diversity) objective in a new setting Cv as compared to the best and the worst possible selection system designs that, under Cv , have the same diversity (quality) value as the system. Also, because the measures are dimensionless and in the same metric they can be combined to a single aggregate validation achievement measure. Second (and except for the earlier discussed limitation for the validation diversity achievement measure), the measures are generally applicable because they can be used to evaluate the validation achievement of both PO and other selection systems with respect to any applicant pool corresponding to any set of validation conditions Cv and therefore enable comparing the validation achievement of any one selection system with that of any other system either under the same conditions Cv or across different conditions.
Finally, observe that the values on the new validation achievement measures as well as differences between these values can easily be converted to corresponding quantities that are of immediate relevance to practitioners. For example, consider again Panel B of Figure 1. The panel shows that the systems U4, P4, W4, and B4 result in the same minority selection rate of 0.21. Yet, each of these has quite a different value for the quality objective (i.e., quality values of 0.36, 0.82, 0.93, and 1.15 for W4, U4, P40, and B4, respectively), resulting in sample validation quality achievement values of 0, 0.58, 0.72, and 1, respectively. Obviously, the P4 system outperforms the U4 system and the difference in sample validation quality achievement, equal to 0.14, can be translated to a difference of 0.93 – 0.82 = 0.11 standard units in the expected job performance of the selected applicants.
Computing the Validation Achievement of Selection Systems
We developed two suites of programs and accompanying shell scripts to compute the validation achievement in the validation conditions Cv of selection systems as derived under the calibration conditions Cc . The first suite is restricted to the study of single-stage selection systems with respect to validation conditions involving an infinite-sized applicant pool as in the Song et al. (2017) study and the suite is executable on a personal computer. The suite also calculates the shrinkage in the quality and diversity objective of the systems in the validation conditions. The program solves a series of nonlinear optimization problems similar to the ones described in De Corte et al. (2011), using a classic, gradient-based sequential quadratic programming algorithm. The suite, including documentation about its usage, can be downloaded from http://users.ugent.be/∼wdecorte/software.html, and the online material accompanying the article presents an application studying the robustness and sensitivity of both the shrinkage and the validation achievement of various single-stage selection systems under population validation conditions. These results assist the discussion on the main results reported in the article.
In contrast to the limited capabilities of the first suite of programs, the second suite addresses both single- and multistage selection systems under general validation conditions related to a finite-sized applicant pool. 6 The previous section shows that in that case the computation of the validation diversity (quality) achievement of a selection system in validation conditions Cv requires generating a large number of applicant samples according to Cv , computing the sample validation diversity (quality) achievement of the system for each sample, and taking the average of the resulting achievement values. Because these computations are extremely demanding, the second suite can be executed only on a high-performance computing facility.
To generate the applicant samples in the second suite, we use the procedure described by Ruscio and Kaczetow (2008) because it can deal with virtually any type of joint distribution (including real data distributions) of the predictor/criterion in the majority and the minority applicant populations and the procedure can therefore accommodate a very broad range of Cv conditions. Next, to compute the validation achievement values of the systems in each of the generated applicant samples, we wrote a mixed C and Fortran 77 program. The program repeatedly applies the evolutionary multiobjective optimization (EMOO) algorithm as implemented in the NSGA-2 program developed by Deb, Pratap, Agarwal, and Meyarivan (2002) to calculate the maximum and minimum quality (diversity) value that can be achieved (over all feasible selection systems) at the diversity (quality) level obtained by the systems in the sample. We adopted the latter EMOO algorithm because with finite-sized validation applicant pools the calculation of both the maximum and minimum achievable quality (diversity) involves the global, constrained optimization of a nonlinear, nonanalytic function (corresponding to either the quality or the diversity of the system) where one of the equality constraints (related to the diversity or the quality in case the quality or the diversity is maximized or minimized) is also nonanalytic. These optimization problems cannot be solved with classic gradient-based methods, such as invoked by the decision aid of De Corte et al. (2011), leaving no other option than to use a general meta heuristic approach instead. To assist the EMOO algorithm, its execution is preceded by an extensive grid search to generate an initial population of problem variable values that meet the nonanalytic equality constraint.
Although the above procedure succeeds in computing the minimum and maximum achievable quality at the diversity level obtained by a selection system in the validation sample, the procedure is unreliable when solving for the corresponding diversity optimizations at the quality level obtained by the system. 7 Using a different metaheuristic approach (i.e., ant colony optimization; Dorigo & Stutzle, 2004) instead of the evolutionary-based approach does not solve the problem. Apparently, the problems with the present procedure to maximize/minimize diversity under the quality equality constraint are caused by the fact that in finite applicant pools the number of possible values for the quality objective (gauged by the average job performance of the selected applicants) at a given diversity level is much larger than the corresponding number of possible values for the diversity objective (gauged by either the minority selection rate or the AIR) at a given quality level, making it much harder to implement the equality constraint with respect to the quality objective as compared to the implementation of the constraint with respect to the diversity objective.
Despite these computational problems, we decided to adopt the general approach for the remainder of the article, even though this means that only results about the validation quality achievement of the systems will be reported. The decision is motivated by the fact that only the general approach can shed light on the achievement of both single- and multistage selection systems when applied in realistic validation conditions, that is, in conditions involving finite-sized applicant pools. Also, using the findings from the study reported in the online supplement it is possible to at least indicate how the results about the validation diversity achievement of different selection systems are expected to evolve. Finally note that even without the computational problems, the integration of the validity diversity achievement measure in the present study could still be somewhat problematic because the measure is often undefined for small applicant pool validation conditions (cf. the section “Measuring the Achievement of Selection Systems in the Validation Condition”).
Studying the Robustness and Sensitivity of Selection Systems
We use simulation methods within a design structured by 10 factors to address the key research questions about the validation achievement of PO and other selection systems when these systems are applied to a large variety of validation selection settings. The design adopts the framework of sensitivity analysis (Saltelli, Tarantola, Campolongo, & Ratto, 2004). This framework aims to assess the effect of different sources of uncertainty (variability or error) in the input data of a model on the model output, often using simulation and regression or ANOVA methods within a (preferably) factorial design to identify the most prominent sources of uncertainty or variability. The framework is therefore ideally suited to address the key research questions of the article. In addition, the present design also permits studying issues concerning the population to sample and the sample-to-sample cross-validation (cf. Cattin, 1980) of PO selection system designs.
The first three factors of our design, henceforth referred to as the selection situation factors, capture the impact of the nature of the selection environment and the initial calibration conditions Cc the systems are computed from. A second set of five factors, addressing the sensitivity and robustness issues, relates to the major features that differentiate between the initial conditions Cc and the actually prevailing validation conditions Cv . Finally, the remaining two factors, labeled as the selection system factors, structure the characteristics of the analyzed selection systems.
Selection Situation Factors
The first two selection situation factors relate to the overall selection rate under Cc (with three levels: 0.1, 0.2, and 0.4), and the proportional representation under Cc of the majority applicants in the candidate population (with two levels: a 0.8 and a 0.5 majority applicant representation) respectively. We included these factors in the design to investigate whether PO and other selection systems, as derived for different combinations of selectivity rate and majority/minority mixture proportion values under Cc , show different levels of robustness and sensitivity. The choice of the actual levels of both factors was driven by a double concern: sufficient variation in the level values to capture an eventual effect of the factors and maintaining a reasonable degree of realism.
The final selection situation factor, labeled as the selection environment factor, has three levels that each refer to a quite different selection setting. Tables 1 and 2 detail the three environments. The first table identifies the available predictors in each environment and summarizes the predictor/criterion mean and intercorrelation data used to compute the different selection systems under Cc within the environment. In turn, Table 2 describes the contextual and other relevant constraints that demarcate the set of feasible selection systems for each environment.
Constraints Demarcating the Set of Feasible Selection Systems.
We choose these three selection environments because we first and foremost wanted to assess the validation achievement of PO and non-PO systems over a wide variety of selection settings, even though this implied considering environments that differ not only in terms of the type and number of the predictors and, hence, in the predictor/criterion data, but also vary with respect to the staging of the predictors (i.e., single- vs. two-stage selection and mixed single, two-, and three-stage selection) and the nature of the feasible selection designs. At the present early stage of research on the robustness and sensitivity of PO selection system design, we decided in favor of including a wide variety of factors, representing all major types of circumstances that may impact on the validation achievement of the selection systems, rather than focusing too much on a single, albeit important aspect such as the nature of the selection environment. Given the multitude of ways in which selection environments may differ from each other (e.g., with respect to the number and type of predictors, the distribution of the predictor/criterion effect sizes, the factorial structure of the predictor battery and the nature of the set of feasible selection designs), a detailed analysis of the impact of this factor is best postponed until more is known about the other circumstances that are critical to the robustness and sensitivity of PO and other systems. For now, the decision to first consider the full scale of possibly important factors implied choosing between a set of fairly homogeneous levels for the selection environment factor that differ in only one aspect such as, for example, the staging of the selection process, and a set of heterogeneous levels. We decided in favor of the latter option because it enables a more general and informative answer about the validation achievement of PO as compared to other selection systems, even though this choice may entail some difficulties with the interpretation of the effect of the factor.
With three levels for the selection rate and the selection environment, and two levels for the proportional minority/majority representation, the crossing of the three selection situation factors results in a total of 18 different studied selection situations. These 18 situations are at best exemplary for the broad range of situations encountered in practice, but we believe that the situations are sufficiently heterogeneous to ensure that the study provides at least guiding evidence on the validation achievement of PO selection system designs.
Factors Differentiating Between the Calibration and the Validation Conditions
The design also includes five factors to capture the ways in which the validation conditions of the selection application, Cv
, may deviate from calibration conditions,
Although the above five factors permit a fairly exhaustive investigation of the robustness and sensitivity issues, it is again noted that the choice of the number and the nature of the factor levels reflects a balance between the concerns of feasibility and adequate coverage. Thus, proportional representation under Cv
of the minority/majority applicants has only two levels: either the same or different to the one under Cc
(i.e., if different, the proportion majority applicants under
Finally, the factor about the normality versus nonnormality of the joint predictors/criterion score distribution in the parent majority and minority population from which the applicant pools are sampled from under Cv , has three levels. Level 1 corresponds to sampling from the multinormal distribution, whereas the Levels 2 and 3 indicate sampling from moderately and severely nonnormal distributions (i.e., generalized lambda distributions; Chalabi, Scott, & Wuertz, 2012), respectively. More specifically, the marginal distributions of the predictors/criterion scores have skew and kurtosis of 0.75 (2.0) and 4 (9) under Level 2 (3) of the factor. In this way, both levels reflect the characteristics of predictor/criterion score distributions as often encountered in real samples (cf. Blanca, Arnau, Lopez-Montiel, Bono, & Bendayan, 2013; Micceri, 1989). Also, the heavily skewed criterion score distribution under Level 3 accords with recent arguments by O’Boyle and Aguinis (2012) that actually observed job performance scores follow a Pareto distribution; however, see Beck, Beatty, and Sackett (2014) for a contrary view.
Selection System Factors
The selection systems studied within each cell of the design correspond to the crossing of two selection system factors: the selection system type factor with six levels and the selection system relative diversity factor with four levels. More specifically, the first five levels of the selection system type factor differentiate the studied selection systems in terms of the calibration quality achievement level attained under Cc . Level 1 selection systems are PO under Cc , and therefore show the best possible calibration quality achievement value (i.e., a value of 1% or 100%), whereas the Levels 2, 3, 4, and 5 systems have, under Cc , a 0%, 25%, 50%, and 75% calibration quality achievement value, respectively. Panel A of Figure 1 illustrates the different types of selection systems: the systems corresponding to the trade-offs P1 to P4 represent Level 1 type of selection systems, the systems corresponding to the trade-offs Z1 to Z4 are Level 2 type of systems, and so on. The sixth level of the selection system type factor refers to selection systems in which unit weighed composites are used to perform the selection. In Panel A of Figure 1, these systems correspond to the trade-offs U1 to U4.
The inclusion of the selection system type factor permits addressing the differential robustness and sensitivity of different types of selection systems. Also, given the particular levels chosen for the factor it is possible to study whether PO selection systems, as derived under Cc , continue to outperform other selection system types and, in particular, unit weighed selection systems when these systems are applied in a large variety of validation settings. Note that the study does not include regression weighed selection systems as an additional level for the selection system type factor because these systems may, depending on the predictor/criterion correlation structure, assign negative weights to the predictors in forming the predictor composites. As a consequence the regression weighed systems may violate the constraint on the feasible selection systems, imposed in all three studied selection environments, that only nonnegative weights are permissible in forming the predictor composites.
The four levels of the selection system relative diversity factor refer to increasing degrees of diversity achieved by the systems under Cc . The actual values of the four diversity levels vary across the 18 different studied selection situations, however, because these situations differ in terms of the selection environment, the selection rate and the majority/minority applicant composition such that it is impossible to construct selection systems that show identical diversity values across the situations. The relative diversity factor is therefore nested within the crossing of the three selection situation factors. Also, within each situation, the diversity level values were chosen according to two criteria. First, the values must be attainable by at least one of the unit weighed selection systems that are feasible in the situation. Second, the level values should span as evenly as possible the major part of the range of diversity values achievable between the diversity level associated with the highest quality PO system (under Cc ) and the diversity corresponding to the least quality PO system (under Cc ). The diversity trade-off values corresponding to the PO systems P1 to P4 in Panel A of Figure 1 illustrate the resulting four factor levels for the selection situation S 0 described in the section “PO Selection Systems: A Brief Tutorial.”
We added the relative diversity factor to the design because both the measures of calibration and validation quality achievement are defined with reference to the diversity level of the system and it is therefore important to assess whether the robustness/sensitivity of PO selection systems varies, depending on the diversity trade-off value of the systems. Also, Song et al. (2017) found that validity (diversity) shrinkage in PO systems is more pronounced to the extent that the PO system gives priority to the validity (diversity) objective. If this finding would also apply to the validation quality achievement of PO systems, then the low diversity PO systems (i.e., the systems that give a high priority to the quality objective) will show a lower level of validation quality achievement than the high diversity PO systems.
Finally, note that the relative diversity factor harbors an ambiguity with respect to the unit weighed selection systems. Whereas variable weight systems may show different quality trade-off values for the same diversity trade-off value by adjusting the predictor weights in the composites, this is not the case with unit weighed systems. For these systems the diversity trade-off value corresponds to a unique quality trade-off value and, hence, to a unique value. So, choosing the unit weighed selection systems within each selection situation according to the diversity level value also fixes the calibration quality achievement value of the systems. As a consequence, the levels of the relative diversity factor confound diversity and calibration quality achievement in case (but only in case) of the unit weighed systems, and this confound will have to be taken into account when comparing the validation achievement of unit weighed and PO selection systems.
Overview and Implementation of the Study Design
Table 3 provides a summary of the 10 factors of the design. The design corresponds to the full crossing of nine of the factors, whereas the relative diversity level of the selection systems factor is nested within the crossing of the three selection situation factors. Given the number of levels of the eight factors that are used to provide a fairly exhaustive coverage of the different selection situations and the ways in which real settings deviate from the idealized conditions Cc , the design has a total of 3,888 cells, with 24 selection systems (corresponding to the crossing of the two selection system factors) studied in each cell.
Study Design Factors.
The implementation of the design proceeded in two stages, using throughout the minority selection rate and the average score on the job performance criterion as gauges for the diversity and the quality objective respectively. In the first stage a modified version of the COPOSS program (De Corte, 2011; De Corte et al., 2011) is used to identify the 24 selection systems under Cc for each of the 18 different selection situations obtained from the crossing of the three selection situation factors. The second, simulation stage involved the computation of the validation quality achievement of these systems when applied in the validation conditions Cv corresponding to each of the 3,888 cells of the design.
The actual execution of the second stage consisted of two steps. In the first step, the procedure of Ruscio and Kaczetow (2008) was used to generate 500 applicant predictor/criterion data samples within each of the 3,888 cells according to the situational features and the Cv conditions that are specific for the cell. In the second step, the above described procedure for assessing the validation quality achievement of the selection systems was applied to each data sample within each cell of the design, resulting in the sample validation quality achievement value of the selection systems for the particular sample.
Obviously, given the size of the design and the numerical complexity, especially of the step to determine the validation quality achievement of the different systems for each sample within each of the cells of the design, the implementation of the study required massive computational resources as can be delivered only by a high-performance computing facility. In particular, all computational resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation–Flanders (FWO) and the Flemish Government–Department EWI. All results were subsequently transferred to the SAS/Stat software environment for further analysis, using the means, tabulate, and ANOVA procedure to provide answers to the key research questions of the article.
Validation Quality Achievement of PO Versus Other Selection Systems
First, we focus on the validation quality achievement of PO selection systems and, on the outcomes related to the sensitivity and the robustness of these systems. Next, we compare the sensitivity and robustness of the validation quality achievement of PO systems to that of other types of selection systems. Given the categorical measurement level of the studied factors, all results are obtained using appropriate ANOVA models. When the dependent variable in the models is a (function of a) proportion, we first applied the logit transformation to the dependent before executing the ANOVA analyses.
8
For each analysis we report the percentage of variance explained by the models (i.e., the effect size measure
Robustness and Sensitivity of PO Selection Systems
To address the first key research question we conducted an ANOVA with the validation quality achievement value of the PO selection systems as the dependent variable. The set of independent variables in the ANOVA comprises all terms in the full model of 9 of the 10 design factors. The selection system type factor can be dropped because we study only one type of selection system (i.e., PO systems). The model explains 46.4% of the total variance and the bulk of the explained variance is due to the main effect of three factors: (a) the diversity of the system under Cc , explaining 19.6% of the variance, (b) the selection environment, 14.6%, and (c) the size of the applicant pool, 6.5%. Table 4 presents the average validation quality achievement value of the PO systems overall and broken down according to the levels of the three factors. The tabled values reveal that, across all studied conditions, the systems that are PO under Cc have an average validation quality achievement value of 0.661, but the level of achievement varies substantially across the three selection environments and the levels of diversity that characterize the systems. Thus, the validation quality achievement is highest in selection environment three, whereas PO systems with higher diversity trade-off levels under Cc show a poorer achievement. The higher average validation quality achievement in environment three probably relates to the fact that this environment uses fewer predictors than the other environments. Previous research on shrinkage in regression models and the formulas used to predict the shrinkage (e.g., Cattin, 1980) indicates that the amount of shrinkage is inversely related to the number of predictors in the model, a result that is mirrored by the present finding that the selection environment with the least number of predictors offers the best validation achievement.
Validation Quality Achievement of PO Systems: Global and According to the Relative Diversity of the Selection System (DIV, With Levels 1 to 4 Coding for Increasing Degrees of Diversity), Selection Environment (SEN, With Levels 1 to 3 Coding for the Environments 1 to 3; see Tables 1 and 2), Size of the Applicant Pool (SIZ, With Level 1: 80 Applicants; Level 2: 250 Applicants; Level 3: 800 Applicants; And Level 4: 2,500 Applicants).
In contrast, the result about the lower validation quality achievement of high diversity (and therefore low quality) PO systems defies the expectation as based on the shrinkage results of Song et al. (2017). Whereas Song et al. found that low quality PO systems exhibit less quality shrinkage (i.e., less validity shrinkage as Song et al. use validity for the quality objective) we find that these systems have a lower validation quality achievement than the high quality (low diversity) PO systems. Apparently PO systems that give a higher priority to the diversity objective (and, hence a lower priority to the quality objective) tend to show a smaller validation quality achievement as compared to the lower diversity systems. The finding thereby indicates that quality shrinkage, as proposed by Song et al., could be misinterpreted by users as a gauge for the loss in the quality achievement by a PO system when implemented in validation conditions, at least when smaller quality shrinkage would be considered as an indication of higher quality achievement. Looking back at Figure 2, this finding does not come as a surprise, however, because the figure clearly shows that quality shrinkage and validation quality achievement are rather inverse indicators of the validation potential of selection systems: Whereas smaller quality shrinkage might suggest a higher quality achievement, the reverse is the case. This is further substantiated by the results of the study reported in the section “Comparing Shrinkage and Validation Achievement” of the online material. This study, albeit restricted to validation conditions involving the applicant population, additionally shows that the relationship between the corresponding diversity (quality) shrinkage and validation diversity (quality) achievement measures is not linear and even not entirely monotone.
Although the present research does not permit studying the relation between the relative diversity of the PO systems and validation diversity achievement of the systems in general validation conditions with finite applicant pools, the online material presents at least indicative results on this issue in the case of validation conditions involving applicant populations instead of finite applicant pools. These results show a rather proportional relationship between the relative diversity of a system and its validation diversity achievement, again contrary to the expectation based on the diversity shrinkage results.
From the four factors in the design that aim to study the sensitivity of the PO selection systems for discrepancies between the calibration conditions Cc and the validation conditions Cv , only the size of the applicant pool explains at least 1% of the variability in the validation quality achievement values. As expected, the validation quality achievement of the PO systems is higher when the applicant pool is larger. In small applicant pool samples, as compared to large-sized samples, the variability of the predictor correlation, validity, and effect size values is considerably larger, implying that these values are more often substantially different from the values on which the PO selection systems are based, thereby resulting in a poorer validation quality achievement of the systems.
The effects related to the other sensitivity factors, although statistically significant (as almost all other effects in the ANOVA analysis because of the huge number of cases), explain only a negligible fraction of the total variability. Thus, the validation quality achievement of PO selection systems depends very little on the discrepancies between Cc and Cv as related to the proportional representation of the majority/minority candidates in the applicant pool and the predictor/criterion mean and correlation structure values in the majority/minority populations, although the level averages of the latter factor show that the average validation quality achievement decreases for larger discrepancies in the predictor/criterion mean and correlation structure.
The ANOVA further indicates that the effect of the factor about the normality versus nonnormality of the joint predictors/criterion distribution is also quite small (i.e., less than 1% explained variance). Although the validation quality achievement decreases somewhat in settings where the distribution is nonnormal, the effect is not entirely consistent across the different environments and the levels of the applicant pool size factor. By and large, the finding implies that the assumption invoked by the decision aid about the multivariate normal distribution of the predictor/criterion scores in the applicant populations is not really critical. Fairly different joint predictor/criterion distributions only marginally affect the validation quality achievement of the PO systems.
Finally, the ANOVA reveals that none of the effects related to the interaction of the selection environment factor with (any combination of) the other factors explains a sizable portion of the total variance, implying that the above discussed effects about selection system diversity level and the size of the actual applicant pool apply in a similar way across the different types of selection environment and therefore are quite general. The result is also of key importance with regard to future studies about the features of the selection environment that impact on the actual performance of PO systems because it suggests that this future research can be conducted using a much more simple design that focuses on only these features without considering any additional factors.
Sampling Variability of the Validation Quality Achievement of PO Systems
With more than 50% unexplained variance, the ANOVA also shows that the sampling variability of the validation quality achievement of PO systems is quite large. Figure 3 illustrates this by showing the density plot of the validation quality achievement by selection environment (upper panel), by size of the applicant pool (middle panel), and by the diversity level of the PO systems (lower panel). Within the panels we also represented for each density the 0.1 (filled square) and the 0.9 quantile (filled circle) of the density, thereby indicating the interval that contains the 80% middle values of the validation quality achievement. Even for the largest applicant pool size, the width of this interval, with 0.1 and 0.9 quantile values of 0.52 and 0.92, is still quite substantial.

Sampling variability validation quality achievement of PO selection systems. The square dots on the horizontal axis indicate the 0.10 quantile of the density, whereas the circle dots show the 0.90 quantile.
To determine the conditions that affect the sampling variability we applied a second ANOVA, with the within-cell (logit transformed) interquartile range of the validation quality achievement of the PO systems as the dependent variable and the main and the interactions effects (up to the fourth order) of the nine relevant factors of the design as independent variables. The model explains 94.4% of the variance. As expected, the effects related to the number of selected applicants provide together the largest contribution (i.e., the size of the applicant pool, 42.2%, and the selection rate under Cc
and Cv
factors with 5.0% and 2.3%, respectively), whereas the relative diversity factor (26.5%), the discrepancy between the moments of the joint predictor/criterion score distribution under Cc
versus Cv
(3.9%) and the selection environment factor (4.6%) are largely responsible for the remaining part of the explained variance. The average values of the dependent variable corresponding to these factors further show that the sampling variability of the validation quality achievement of a PO system is directly proportional to the relative diversity level of the system (i.e., the interquartile range values for the diversity Levels 1 to 4 are 0.134, 0.150, 0.174, and 0.213), increases for systems using a larger number of predictors (i.e., the interquartile range values for the environments 1 to 3 are 0.179, 0.170, and 0.155), and decreases in situations with a larger number of (selected) applicants. The variability also increases for bigger differences between the validation predictor/criteria moment data and the calibration moment data used to derive the PO systems (i.e., interquartile range values of 0.157, 0.164, and 0.182 for the Levels 1 to 3 of the difference of the mean and correlation structure of the joint predictor/criterion score distribution under Cc
versus
Integrating the results of the previous analyses, it can be concluded that the conditions that substantially affect the magnitude and the variability of the validation quality achievement of a PO system are by and large the same. One may expect a higher validation quality achievement, and at the same time be more confident about this expectation (i.e., the sampling variability is smaller) when implementing a low diversity PO system, derived from fairly accurate predictor/criterion data and involving a small number of predictors, in a selection situation with a large number of (selected) applicants. However, the substantial decline from the value of 1 for the calibration quality achievement to the value of 0.661 for the validation quality achievement of the PO systems may raise concerns about the real practical utility of adopting these designs instead of other, more simple designs to address the selection quality/diversity quandary. To settle this issue, the next sections compare the robustness, the sensitivity, and the sampling variability of both PO and other non-PO selection system designs, including the unit weighed designs.
Comparing the Robustness, Sensitivity and Sampling Variability of PO and Sub-PO Systems
We first report the results of the analysis comparing the validation quality achievement of the PO and the 0%, 25%, 50%, and 75% calibration quality achievement systems. The analysis again applies an ANOVA model to the (logit transformed) validation quality achievement of the selection systems as the dependent variable, but this time using a slightly restricted model containing the main effects and all possible interactions up to the seventh order of all 10 factors in the design. We imposed the restriction to stay within the limitations inherent to the SAS ANOVA procedure. The restriction does not affect the quality of the analysis, however. Because the design is orthogonal, the sum of squares (and, hence, the proportion of explained variance) associated with the different effects remains the same whatever the set of effects that is included in the model.
The ANOVA model explains 52.6% of the total variance, with four effects related to the selection system type factor contributing at least 1%: the main effect of selection system type (25.9%), and the interaction of selection system type with the selection environment factor (9.1%), the size of the applicant pool (2.8%), and the system relative diversity (1.5%), respectively. Table 5 summarizes the average validation quality achievement values corresponding to these four effects. The averages support the major conclusion that the order in the validation quality achievement level of the selection systems is maintained when these systems are applied in a large variety of selection settings. Note in particular that the three interaction effects do not invalidate this conclusion. Both overall and within each selection environment, within each size of the total applicant pool, and within each relative diversity level of the systems, the PO systems perform best, followed by the 75% (the S systems), the 50% (the F systems), the 25% (the T systems), and the 0% estimated performance systems (the Z systems), but the degree of separation between the validation quality achievement validation levels of the different selection systems varies significantly across the selection environments, the applicant pool size conditions, and the system diversity levels. Also note that the average of validation quality achievement values of the different selection system types across the levels of the applicant pool size factor reflect the expectation that the validation quality achievement of selection systems with a high calibration quality achievement level (i.e., the 75 EP and the PO systems) is directly proportional to the size of the applicant pool, whereas the reverse is the case for the selection systems with a low calibration quality achievement level (i.e., the 0 and the 25 EP systems). Higher variability in the predictor/criterion data because of smaller applicant pool size should more often benefit the validation quality achievement of systems with a low calibration quality achievement and have the opposite effect for high calibration quality achievement systems.
Average Validation Quality Achievement of PO and Non-PO Systems (0% Calibration Quality Achievement, 0 CA, Until 75% Calibration Quality Achievement, 75 CA): Overall, by Environment, by Size of the Applicant Pool, and by Relative Diversity of the Selection System.
With more than 47% unexplained variance, the ANOVA again indicates substantial sampling variability. Figure 4 further illustrates the issue by displaying the density of the validation quality achievement of the five different selection system types, both overall and by the different selection environments. To study whether the different selection system types are more or less susceptible to sampling variability, we performed a follow up ANOVA with the within cell (logit transformed) interquartile range of the validation quality achievement of the systems as dependent and the main effects of all 10 design factors as well as the corresponding interactions (up to the fourth order) as independents. The ANOVA explains 94.7% of the variance with several effects related to the selection system type factor contributing at least 1%. Briefly summarized, the average interquartile range values corresponding to these effects reveal that the sampling variability is inversely related to the validation quality achievement level of the systems, that the trend is more pronounced for lower relative diversity systems, but weaker for larger sizes of the applicant pool. Yet, despite this variation, the important practical finding remains that PO systems apparently show a smaller within cell sampling variability than the non-PO systems.

Sampling variability validation quality achievement of PO and non-PO systems (P SYS: PO system; Z SYS: 0% calibration quality achievement system; T SYS: 25% calibration quality achievement system; F SYS: 50% calibration quality achievement system; S SYS: 75% calibration quality achievement system; U SYS: unit weight system
Comparing the Validation Quality Achievement of PO and Sub-PO Systems at the Same, Single Application Level
Although PO systems maintain the highest validation quality achievement level, without showing more sampling variability, the substantial overlap of the density plots in Figure 4 suggests that PO systems may with some frequency result in a lower validation quality achievement than the sub-PO systems. To gather more precise information about this possibility, we recorded for each sample within each cell of the design the proportion with which the validation quality achievement of PO systems is at least equal to that achieved by the corresponding 0%, 25%, 50%, and 75% calibration achievement systems. Averaged across all samples and cells, these proportions equal 0.87, 0.84, 0.79, and 0.70, respectively, implying that the overall odds that PO systems outperform (i.e., have a higher validation quality achievement value) 0%, 25%, 50%, and 75% systems at the single application level are 6.69, 5.25, 3.76, and 2.33 to one, respectively. These odds clearly show that PO selection systems not only maintain the highest validation quality achievement level, but also are much more likely to perform better than the sub-PO systems when applied to the same single selection application.
From a practical perspective, the comparison in terms of robustness, sensitivity, and sampling variability between the PO and the sub-PO systems showed that selection practitioners may expect a substantially better and a less variable validation quality achievement when implementing a PO instead of a sub-PO system. In the next section we study whether PO systems also maintain an advantage when compared to simpler unit weighed designs.
Comparing the Robustness, Sensitivity, Sampling Variability and Validation Quality Achievement at the Same, Single Application Level of PO and Unit Weighed Systems
The ANOVA analysis to explore the comparative robustness and sensitivity of PO and unit weighed selection systems, using the full model of all 10 factors, explains 48.5% of the variance of the validation quality achievement of the systems. Only four of the effects related to the selection system type factor contribute at least 1% to the explained variance: the main effect of system type (13.8%), and the first order interaction of selection system type with the selection environment (1.8%), the applicant pool size (1.5%), and selection system diversity (3.6%). Table 5 summarizes the average validation quality achievement values associated with these effects. Except for the latter selection system type by selection system diversity interaction effect, the averages related to the other effects essentially repeat the findings reported in the previous section, albeit this time with respect to the unit weighed systems: PO systems show a higher validation quality achievement than unit weighed systems; the difference grows for larger pool sizes and varies across selection environments.
The interpretation of the selection system type by selection system relative diversity interaction effect is less straightforward, however, because the relative diversity level of the unit weighed systems is inevitably confounded with the level of calibration quality achievement of these systems. Whereas PO systems have, by definition, 100% calibration quality achievement, the unit weighed systems have a calibration quality achievement that varies across the levels of the relative diversity factor. The selection system type by selection system relative diversity interaction may therefore very well reflect this difference in calibration quality achievement rather than indicate that the unit weighed systems have a different validation quality achievement pattern across the levels of the relative diversity factor as compared to that of the PO systems
The study comparing the PO and unit weighed systems also included the above detailed analyses focusing on (a) the susceptibility to within cell sampling variability of the two systems and (b) the likelihood that the PO systems show a better validation quality achievement than the unit weighed systems at the same, single application level. By and large both analyses result in essentially the same findings as the corresponding studies comparing between PO and non-PO systems. Thus, the first additional analysis reveals that PO systems show substantially less sampling variability than the unit weighed systems (cf. Figure 4) and that the difference in sampling variability between the two systems varies across selection environments and across the levels of the relative diversity and the applicant pool size factors. However, the average interquartile range values associated with these interactions never indicate that PO systems have a larger within cell sampling variability. The variability in the calibration quality achievement of the unit weighed systems, as compared to the corresponding fixed 100% achievement of the PO systems, probably explains why the latter systems exhibit a smaller within-cell sampling variability of the validation quality achievement values.
In turn, the second additional analysis results in an overall proportion of 0.75 that PO systems have a better validation quality achievement than unit weighed systems when applied in the same setting. The results of this and the previous analyses therefore warrant the conclusion that PO systems not only outperform variable weight sub-PO systems (i.e., systems using variable weights for the predictors in forming the predictor composites), but also fixed, and in particular, unit weight systems.
Comparing the Validation Trade-Off Achieved by the Different Selection Systems
Thus far, all analyses and results focus on the new validation quality achievement measure as the criterion for evaluating the merits under general validation conditions of PO, sub-PO, and unit weighed systems. Yet, despite the advantage of using this measure instead of other possible gauges it remains true that selection practitioners will often also be interested in the merits of the different selection systems as operationalized by the diversity/quality trade-off value achieved by the systems under such general validation conditions. In particular, they may wonder whether PO systems, when applied in validation conditions, are expected to result in a better diversity/quality trade-off (i.e., a trade-off where the value of the PO system on one of the objectives is higher than the corresponding value of the non-PO system, whereas the value of the PO system on the other objective is at least as high as the corresponding value of the non-PO system) as compared to the one achieved by a non-PO system in the same setting.
To clarify whether or not this is the case, we conducted a final analysis. For each of the total of 7,776,000 studied sample selections we registered whether the trade-off achieved by the PO system, as compared to the trade-off achieved by the corresponding 0%, 25%, 50%, and 75% calibration achievement systems, is better, worse or incomparable. We found that the percentages with which the PO systems result in a better trade-off than the 0%, 25%, 50%, and 75% systems equal 50%, 48%, 47%, and 43%, respectively, whereas the corresponding percentages with which the PO systems result in a poorer trade-off are equal to 6%, 8%, 10%, and 15%. In the remaining 44%, 44%, 43%, and 42% of the comparisons with the 0%, 25%, 50%, and 75% systems the trade-offs were incomparable in that neither trade-off is better than the other trade-off. Compared to the unit weighed systems, the PO systems result in a better (worse) trade-off in 48% (12%) of all cases, with 40% incomparable results. Note that the substantial percentage of incomparable outcomes once again illustrates that an evaluation and comparison of both PO and non-PO systems solely on the basis of the trade-off that these systems achieve under validation conditions is quite unsatisfactory and that achievement measures such as proposed in the article are necessary to achieve this purpose.
Discussion
When learning about PO selection design, and the decision aid for deriving these designs in particular, selection experts and practitioners may question whether PO systems will live up to expectations when implemented in a large variety of validation selection situations. These doubts can never be resolved conclusively because every future selection application harbors a number of inherent uncertainties. That said, the present article offers a theoretical as well as a practical contribution that together succeed in generating rather convincing evidence to decide on the issue of the validation achievement of PO selection system design. From a theoretical perspective the article introduces two new gauges for expressing the achievement of PO and other selection systems when applied under almost any type of validation condition. Compared to previous approaches, the validation quality and the validation diversity achievement measures permit an adequate, unbiased, and intuitively appealing assessment and comparison of the validation achievement of both PO and non-PO selection system designs
From a practical perspective, the article presents two novel procedures for computing the validation achievement of any selection system design as applied to virtually any selection situation, involving either finite applicant pool or infinite applicant population conditions. These procedures prove reliable, except for the evaluation of the validation diversity achievement in validation situations related to finite applicant pools. Also, the procedure for studying validation achievement with respect to applicant population validation conditions is made available to other researchers and practitioners. We encourage others to use the procedure because the analyses reported in the online material indicate that there is no real alternative short cut procedure that can provide a more easily obtainable estimation of the validation achievement of selection systems, even in the case of applicant population validation conditions. Finally, it is shown how the new procedures can be integrated within a factorial design to provide answers not only about the validation achievement of PO selection system design in a wide variety of validation conditions, but also, and even more importantly, about the major key issues addressed in the article: Do PO selection systems result in a higher validation achievement than non-PO systems, and is this higher achievement consistent across a large variety of validation conditions?
Are PO Selection Systems to be Preferred to Non-PO Systems?
Given our results, the answer to the above question is strongly in favor of a “yes.” In particular, we found an overall difference in validation quality achievement between the PO and the corresponding unit weighed systems of 0.16 (cf. Table 5). Using the procedure outlined in the section “Measuring the Achievement of Selection Systems in the Validation Condition,” this difference corresponds to an overall difference in expected job performance of 0.10 standard units. Although this may not seem impressive at first, this is the same difference one may expect to obtain when switching from a predictor with a rather low validity of 0.30 to a predictor with a substantially higher validity of 0.42 when performing a selection with a 0.20 selection rate. Also, the gain of 0.10 standard units in average job performance did not come at the expense of a lower minority selection rate because both the PO and the unit weighed systems showed a virtually identical overall value (i.e., 0.166 versus 0.164 for the PO and the unit weighed systems) for this selection rate.
The validation quality achievement of PO systems also consistently and substantially exceeds that of other non-PO systems. Furthermore, and although the sampling variability of the validation quality achievement level may be quite large for both PO and non-PO systems, the odds that PO systems have a higher validation quality achievement than the sub-PO or the unit weighed systems when they are all applied to the same setting are well above two to one in virtually all studied validation conditions. Finally, we found no evidence confirming that unit weighed systems are more robust and/or less sensitive than variable weight systems and PO selection systems in particular.
Observe that the present results substantially extend previous findings about the merits of PO selection systems. Whereas all former findings relate to the behavior of these systems in validation contexts involving the total applicant population, the present results inform about the achievement in (small, medium, etc.) sample validation conditions that are the real center of interest in an applied setting like personnel selection. In addition, the new measures used to assess the merits of the different selection systems avoid the deficiencies associated with the previously used methods.
Summing up, the message of the present analyses should be clear. If the goals of both selection quality and diversity are of importance, at least approximate data on the predictor/criterion characteristics are available, and provided that the design of the selection process is not entirely fixed by the constraints of the selection situation, practitioners have good reasons to implement a PO selection system design. Any selection design boils down to a decision that is to be made and according to the results of the present study, PO selection designs represent the decision with the most favorable expected consequences. When probed under a large variety of validation conditions, these designs show the best validation quality achievement. In addition, when applied to the same single selection, the odds that PO systems attain a higher validation quality achievement level than the non-PO systems are favorable, exceeding two to one in almost all applications. Finally, compared to non-PO systems, the implementation of PO systems results more often in a quality/diversity trade-off that is better than the trade-off achieved by these other systems.
Limitations of the Study and Avenues for Further Research
Let us start by repeating that neither the present nor any following study can produce conclusive and final answers about the robustness and sensitivity of PO selection systems. Although the design of the study aimed for a comprehensive inclusion of the factors that may affect the robustness/sensitivity of these systems, using representative levels for these factors, certain possibly important factors may have been omitted and other more pertinent specifications for the levels of the factors (e.g., smaller applicant pool sizes and higher selection rates than those considered to reflect current labor shortages) may have been missed. Thus, future studies could be designed to provide more detailed answers about the features of the selection environment that either favor or impede the robustness/sensitivity of PO systems. Although we varied the nature of the selection environment and, in particular, the number of available predictors in the different environments, other features, related to, for example, the specific blend of available predictors (e.g., the set of predictors is quite homogeneous, with all predictors assessing either the same construct or different lower order variations of the same construct, or more heterogeneous, focusing on a mixture of different constructs), the application context and the demarcation of the set of feasible selection systems, were not really considered systematically. To this end, a number of smaller scale studies, focusing on only one or a limited number of these environment features (e.g., comparing single-stage versus multistage environments, keeping all other features constant), may be more appropriate. The present results, showing that none of the interaction effects of the selection environment factor with the other studied factors explains a sizable portion of the PO system variability, indicates that such smaller scale studies may indeed be adequate.
As a second limitation, the study does not present results on the validation diversity achievement of the selection systems primarily because of as yet unresolved technical issues in the computation of the measure in validation conditions with finite applicant pool sizes, but we also note that even without these issues, the measure is still somewhat problematic for studying the validation achievement in small applicant pool validation conditions because its value will often be undefined in these conditions. Also, the online material reports results about both the quality and diversity achievement in case of validation conditions related to applicant populations. These results confirm the finding that lower diversity PO systems (and, hence, higher quality PO systems) tend to show a higher validation quality achievement as compared to the higher diversity PO systems. In addition, they suggest a rather opposite trend with respect to the validation diversity achievement of the systems: Lower diversity PO systems have a lower validation diversity achievement. These results once again underscore that focusing on the level of diversity and quality shrinkage between the calibration and the validation condition as proposed by Song et al. (2017) may result in users forming a rather poor picture of the true merits under validation of the selection systems. The online supplement study implements side by side both the Song et al. shrinkage and the novel validation achievement calculations showing that higher rather than lower levels of quality (diversity) shrinkage correspond to higher validation quality (diversity) achievement. Whereas the present study confirms this for validation conditions involving finite applicant pools with respective to the quality dimension, future studies should consider the diversity dimension as well, provided that the difficulties regarding the computation of the validation diversity achievement in finite applicant pool validation conditions can be resolved.
As a further possibility for future study, it would be interesting to assess how the validation achievement of PO systems is affected by common realities such as the refusal of job offers or the drop-out of candidates during the selection process. Several variants of job refusal/drop-out could be considered in the validation condition, paying attention to, among others, the differential effect of random versus systematic forms of candidate self-selection where job refusal/drop-out is related to the quality of the candidates.
Adding the sample-to-sample cross-validation approach more explicitly to the study design constitutes a final avenue for future research. To achieve this purpose only one major extension is required. Instead of computing the PO and other systems only once, the systems (and the corresponding calibration trade-offs) should be computed with respect to a large number of (calibration) sample-based data as done by Song et al. (2017). The sample-to-sample cross-validity research question can then be addressed by invoking the above outlined procedure for computing the validation achievement of the systems when applied to a large number of finitely sized validation samples and by averaging the thus obtained achievement values.
Conclusion
The article reports a massive simulation study investigating whether PO systems live up to expectations when implemented under a large variety of validation selection conditions. Although by no means conclusive, the obtained results nevertheless converge to the conclusion that PO systems, as derived by the psychometric approach proposed by De Corte et al. (2007, 2011), are indeed expected to outperform other non-PO systems, including unit weighed systems. The results therefore add substantial weight to the advice that selection practitioners and researcher should consider applying PO selection designs whenever possible. Otherwise they may face complaints and even legal actions because plaintiffs can argue quite convincingly, not only on the formal grounds implied by the formulas of the psychometric approach, but also based on the results of the present study, that a better design was indeed possible.
Supplemental Material
Supplemental Material, corm04revonlinemat - Robustness, Sensitivity, and Sampling Variability of Pareto-Optimal Selection System Solutions to Address the Quality-Diversity Trade-Off
Supplemental Material, corm04revonlinemat for Robustness, Sensitivity, and Sampling Variability of Pareto-Optimal Selection System Solutions to Address the Quality-Diversity Trade-Off by Wilfried De Corte, Paul R. Sackett and Filip Lievens in Organizational Research Methods
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The computational resources and services used in this work were provided to the first author by the VSC (Flemish Supercomputer Center), funded by the Research Foundation–Flanders (FWO) and the Flemish Government–Department EWI.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
