A Demonstration of Mokken Scale Analysis Methods Applied to Cognitive Test Validation Using the Egyptian WAIS-IV

Abstract

The fourth edition of the Wechsler Adult Intelligence Scale (WAIS-IV) has been used extensively for assessing adult intelligence. This study uses Mokken scale analysis to investigate the psychometric proprieties of WAIS-IV subtests adapted for the Egyptian population in a sample of 250 adults between 18 and 25 years of age. The monotone homogeneity model and the double monotonicity model were consistent with the subtest data. The items of all subtests except Matrix Reasoning, Information, Similarities, and Vocabulary formed a unidimensional scale. The WAIS-IV subtests have discriminatory and invariantly ordered items, although some items violated the invariant item ordering and scalability criteria. Therefore, the WAIS-IV subtests—with the exception of some items—are hierarchical scales that allow items to be ordered according to difficulty and subjects to be ordered using the sum score. In conclusion, the current study provides evidence of the dimensionality and hierarchy of the WAIS-IV subtests in the framework of Mokken scaling, although care should be taken when interpreting or including certain items.

Keywords

Mokken WAIS-IV item response theory intelligence

The Wechsler Adult Intelligence Scale (WAIS) is one of the most widely used scales for assessing the cognitive abilities of adults and older adolescents (Benson, Hulac, & Kranzler, 2010; Salthouse & Saklofske, 2010). The most recent version of the WAIS (Wechsler Adult Intelligence Scale–Fourth Edition [WAIS-IV]; Wechsler, 2008) updates the base theory of the intelligence construct, assuming it to be a general concept comprising four indices: Verbal Comprehension (VC), Perceptual Reasoning (PR), Working Memory (WM), and Processing Speed (PS; Saklofske et al., 2012). The WAIS-IV takes into consideration current concepts of fluid reasoning, WM, and PS from the Cattell–Horn–Carroll model and contains 15 subtests, five of which are supplemental (Bowden, Saklofske, & Weiss, 2011a; Kaufman, Salthouse, Scheiber, & Chen, 2016). For each subtest, it is assumed that items are administrated in ascending order of difficulty; consequently, the start points, reversal rules, basal rules, and discontinue rules are used to reduce the administration time and estimate participants’ sum scores without having to apply all the test items (Climie & Rostad, 2011; Weiss, Saklofske, Coalson, & Raiford, 2010).

Several recent studies, including Abdelhamid, Gómez-Benito, Abdeltawwab, Abu Bakr, and Kazem (2019); Bowden, Saklofske, and Weiss (2011b); Miller, Davidson, Schindler, and Messier (2013); Nelson, Canivez, and Watkins (2013), have explored the validity of the WAIS-IV using statistical tools based on classical test theory such as the internal consistency reliability, factor analysis, and confirmatory factor analysis. Together, this body of studies provides important insights into the statistical properties of the WAIS-IV, but their conclusions depend on the total score for the subtests in the framework of classical test theory, which has many acknowledged limitations. One such limitation is that the item parameter is dependent on the sample properties and the person parameter is dependent on the specific selection of items in a test (Embretson, 1996).

Sijtsma, Emons, Bouwmeester, Nyklíček, and Roorda (2008) argued that an instrument should achieve two requirements to be an efficient measure. First, the true number of dimensions measured must be clear (e.g., one dimension or multidimensional). In the light of this requirement, if each subtest of the WAIS-IV assesses one dimension, the sum score of subtest items can be calculated to determine the adult level of the latent trait being measured. However, if the subtest encompasses two or more dimensions, it is necessary to estimate the sum score for each dimension that reflects a feature of the latent trait measured. The second requirement is that the psychometric properties of the items must be accurately estimated, as they are necessary in ensuring that the difficulty and discriminatory power are accurate.

In the literature, there is no discussion of the properties of WAIS-IV subtests adapted for Arabic speakers (Abdelhamid et al., 2019) in terms of individual item scores or, in particular, the dimensionality of each subtest. To our knowledge, no study has focused on item statistical properties or dimensionality in each of the WAIS-IV subtests using modern psychometric theory, such as item response theory (IRT) or Mokken scale analysis (MSA), which represents the main aim of the current study.

MSA is a nonparametric procedure that provides a series of methods for examining the relationship between items and the latent traits being measured and for investigating hierarchies of items in measures (Watson et al., 2012). It has some advantages over other parametric procedures. First, MSA is less restrictive about the data with regard to the item response function (IRF) than parametric IRT models (specific shape, logistic like S; Sijtsma & Van der Ark, 2017). This helps researchers to retain items that would otherwise be omitted from a measure in restrictive parametric IRT models. Second, MSA provides a set of exploratory tools for dimensionality analysis, which is not possible in parametric IRT models (Emons, Sijtsma, & Pedersen, 2012).

Nonparametric Item Response Theory (NIRT) Models

MSA uses a set of methods that assesses the fit of two NIRT models. These NIRT models are known as the monotone homogeneity model (MHM) and, its special case, the double monotonicity model (DMM). The MHM and DMM share a number of assumptions, including unidimesionality, monotonicity, and local independence, whereas the DMM adds nonintersecting item response nonintersection of IRFs (Sijtsma & Van der Ark, 2017).

The MHM indicates that each item exhibits a monotonic and positive relationship with the latent variable (Emons et al., 2012). The MHM uses the sum score $X_{+}$ to order individuals according to their abilities and for a set of items with monotone homogeneity, the order of individuals on the latent variable is the same for any of the MHM items (Sijtsma & Molenaar, 2002). For a group of items in the DMM, this means that the order of the items according to difficulty (i.e., mean score) is the same for all subjects (Mokken, 1997). These models were extended for polytomous items by Molenaar (1997), who proposed the polytomous MHM and the polytomous DMM; a DMM was also proposed by Sijtsma, Debets, and Molenaar (1990).

MSA uses three scalability coefficients: item scalability coefficient ( $H_{i}$ ), item-pair scalability coefficient ( $H_{i j}$ ), and scale total scalability coefficient (H). The item-pair scalability coefficient ( $H_{i j}$ ) is defined as the ratio of the covariance between any pair of items i and j and their maximum possible covariance given the marginal distributions of the two item scores (Mooij, 2012), reflecting the internal consistency of each pair of items. The item scalability coefficient $H_{i}$ is expressed as the ratio of the sum of all pairwise covariances with regard to any item i and the sum of all pairwise maximum covariances of this item i, summarizing the accuracy of item discrimination and the strength of the relationship between the item and the whole set of items (e.g., the trait scale; Emons et al., 2012). Higher values of $H_{i}$ indicate higher discriminatory power. The scale total scalability coefficient H is the ratio of the sum of all pairwise covariances and the sum of all pairwise maximum covariances (Mooij, 2012), investigating the relationship between the sum score and trait scale. Higher values of H indicate that the means of the total score can be used for individual ordering with high accuracy. A number of items are considered to be a Mokken scale when all values of the item-pair scalability coefficient are positive and all item scalability coefficients are greater than 0.30 (Watson et al., 2012).

Assumptions of NIRT Models

The unidimensionality assumption indicates that all items measure the same latent variable (denoted by $θ$ ; Straat, Van der Ark, & Sijtsma, 2013). MSA proposes an automated item selection procedure (AISP) to select many items that measure the same trait (Mokken, 1971). Straat et al. (2013) described an item selection procedure based on an objective function using a genetic algorithm (GA), which examines all the possible partitions of the item pool, and reports the partition that best represents Mokken’s objective (i.e., to select a set of sufficiently discriminating items in each cluster). The second assumption is local independence, according to which a person’s response on any item i is independent of his or her responses on any another item j (Sijtsma & Molenaar, 2002); for example, a person’s response to one item is not affected by the score on another item. Mokken (1997) showed that sampling independence or independence of responses between individuals reflects that item parameter estimation is independent of the sample used. The third assumption is monotonicity of the IRF, which means that IRFs are monotone nondecreasing functions of the latent trait $θ$ (Mokken, 1971), which depicts the relationship between the probability of an individual correct response on item $X_{j}$ and the latent trait level (i.e., a higher latent trait level corresponds to a higher expected item score). Finally, the nonintersection assumption indicates that IRFs do not intersect. It includes invariant item ordering (IIO) for dichotomous data (Sijtsma, Meijer, & Van der Ark, 2011); when the IIO assumption is satisfied for a set of items, these items form a hierarchical scale from easiest to most difficult.

Numerous studies have used MSA to examine the psychometric properties of various tests, but despite the advantages of the MHM and its special case, the DMM, described above (e.g., Sijtsma & Van der Ark, 2017), to our knowledge, there is no published calibration of the WAIS-IV scale using MSA. As such, as far as we are aware, the current study is also the first to examine IIO and dimensionality of WAIS-IV subtests adapted for an Egyptian sample (Abdelhamid et al., 2019) using those of the two NIRT models. As such, this study was an important opportunity to advance the understanding of Mokken analysis and to assess the fit of IRT models for the WAIS-IV subtests analyzed. Therefore, our study has three purposes: (a) to evaluate the dimensionality of each of the WAIS-IV subtests using MSA, (b) to estimate whether the hierarchy of the items in each of the WAIS-IV subtests could be established with IIO, and (c) to check the ability of the items in each of the subtests analyzed to identify individual differences on the latent trait measured.

Method

Participants

Two hundred fifty normal adults agreed to participate voluntarily in this study. Once informed consent had been received, the participants were tested from 2015 to 2016 across Egypt. Participants were aged between 18 and 24 years, with an overall mean age of 20.6 years (SD = 1.7 years), and just more than half the sample (62.7%) was female. Respondents participated voluntarily. All participants were native speakers of Egyptian Arabic. All participants were evaluated individually by psychologists and educators who had received prior training in the application of the scale, based on guidelines explained in the administration manual (Wechsler, 2008). The research reported in this study was part of a project to adapt the WAIS-IV to Arabic speakers, and permission was obtained from the ethics committee of Fayoum University.

Measures

The WAIS-IV Arabic version was used (Abdelhamid et al., 2019), which, like the English version, has 10 core subtests and five supplemental subtests that generate a score for four indices: VC, PR, WM, and PS. Some of these subtests are scored 0, 1 (e.g., dichotomous data). For PR, the subtests analyzed were Visual Perception (26 items), Figure Weights (27 items), Picture Complete (24 items), and Matrix Reasoning (26 items); for WM, Arithmetic (22 items); for VC, Information (26 items); for PS, Symbol Search (60 items) and Coding (135 items). Others are scored polytomously (i.e., 0, 1, 2, etc.): for VC, the subtests Similarities (18 items) and Vocabulary (30 items); for PR, Block Design (14 items); for WM, Digit Span, which has three subscales (Digit Span-Forward [eight items], Digit Span-Backward [eight items], and Digit Span-Sequencing [eight items]), and Letter–Number Sequencing (10 items); for PS, Cancelation (two items).

The WAIS-IV subtests can also be classified as verbal and nonverbal. The verbal tests, such as Similarities, Vocabulary, Arithmetic and Information, contain some changes to ensure that they are suitable for the Arabic-speaking population. By contrast, nonverbal tests such as Visual Puzzles, Figure Weights, Picture Completion, Matrix Reasoning, Block Design, and Symbol Search were unchanged. In the Letter–Number Sequencing subscale, the English alphabet and numbers were converted to Arabic script, taking into account the order of letters and numbers during items adaptation. For the Digit Span-Forward, Digit Span-Backward, and Digit Span-Sequencing subscales, which only contain numbers, the numerals were converted to the Arabic equivalents. It should be noted that the WAIS-IV subtests contain a set of easy items that do not fit the sample used in this study and are, therefore, omitted from the analysis.

Data Analysis

The R package Mokken V 2.8.2 (Van der Ark, 2016) was used to analyze the data for WAIS-IV subtests. The Mokken scalability coefficients $H_{i j}$ , $H_{j}$ , and $H$ were examined following the criteria proposed by Mokken (1971): A scale is considered weak if .30 ≤ H < .40, a medium scale if 0.40 ≤ H < 0.50, and strong if H ≥ 0.50; the values of $H_{i j}$ must be greater than zero; and, finally, if the coefficient $H_{j}$ < 0.30, item j should be reviewed or deleted, but if $H_{j}$ ≥ 0.30, item j should be selected to make up the Mokken scale. In addition to the above coefficients, the dimensionality of each subtest was examined with the genetic algorithm, using values of the lower bound (Straat et al., 2013); lower bound c indicates the minimum value of discrimination ( $H_{j}$ ) for items that make up the Mokken scale. We start with an initial value c = .0, which is increased in increments of .05 up to a value of .55, as recommended by Sijtsma and Molenaar (2002). The scale is unidimensional when all items are selected in one scale for a lower bound of c ≤ .3; when the values of c increase, these items must not be selected to form the scale. The IIO assumption was investigated using manifest IIO by number of violations and the backward item selection method, which removes items in violation of IIO. In addition, we considered the H-transpose index (H^T), which estimates the distance between IRFs; the greater the distance between IRFs, the greater the IIO.

Thus, the statistic H^T was reported as an indicator of the accuracy of the IIO on the basis of the following criteria: $H^{T}$ < .3 signifies that the item ordering is inaccurate, 0.30 ≤ $H^{T}$ < 0.40 signifies low accuracy, 0.40 ≤ $H^{T}$ < 0.50 signifies medium accuracy, and $H^{T}$ ≥ 0.50 signifies high accuracy (Ligtvoet, Van der Ark, te Marvelde, & Sijtsma, 2010). Finally, the Crit value proposed by Sijtsma and Molenaar (2002) to check the effect size of a significant violation was used with the following criteria: if Crit < 40, then violation is minor; if 40 ≤ Crit < 80, violation is nonserious but must be reviewed by the researcher; and if Crit ≥ 80, violation is serious (Van Schuur, 2011).

To assess the reliability of the scale, we estimated three reliability coefficients for each subtest of the WAIS-IV: the lambda-2 statistic (Sijtsma, 2009), as an alternative to Cronbach’s alpha; the Molenaar–Sijtsma statistic, as a reliability estimator with a smaller bias for MSA (MS; Sijtsma & Molenaar, 2002); and the latent class reliability coefficient (LCRC), an unbiased statistic of test-score reliability (Van der Ark, Van der Palm, & Sijtsma, 2011).

Results

Descriptive Statistics and Reliability

The majority of participants were students who were not employed and were not married; only around 22% were employed and married. With regard to education achievement, the participants were at different stages of undergraduate study (first year, 22%; second year, 34.8%; third year, 9.2%; and fourth year, 8.8%; or held an undergraduate or postgraduate degree, 25.2%).

As shown in Table 1, the normality assumption held for the WAIS-IV subtests; skewness and kurtosis values were within the rule of thumb. Table 1 also provides the results obtained with the three reliability coefficients for the WAIS-IV subtests. Overall, the subscales were shown to have good internal consistency: The values of the three coefficients ranged from .68 to .99.

Table 1.

Descriptive Statistics and Internal Consistency for the Wechsler Adult Intelligence Scale-IV Subtests.

Scale (factor)	Number of items	M range	Skewness	Kurtosis	Reliability
Scale (factor)	Number of items	M range	Skewness	Kurtosis	MS	λ₂	LCRC
VP (PR)	26	0.005-0.995	0.35	−0.780	.86	.86	.91
FW (PR)	27	0.085-0.979	−0.052	−0.451	.89	.88	.93
PC (PR)	24	0.007-0.993	0.485	−0.103	.86	.85	.92
MR (PR)	26	0 .068-0.996	−0.221	−0.221	.86	.85	.89
AR (WM)	22	0.047-0.995	0.328	−0.728	.86	.86	.91
INF (VC)	26	0.005-0.995	0.330	−0.145	.81	.82	.89
SIM (VC)	18	0.09-1.94	0.60	0.259	.73	.73	.80
V (VC)	30	0.21-1.84	−1.008	0.076	.88	.88	.91
BD (PR)	14	0.20-3.40	1.24	1.31	.69	.75	.82
DS (WM)	24	0.07-1.79	0.363	−0.566	.84	.84	.88
LN (WM)	10	0.04-2.99	0.682	0.409	.72	.68	.76
SS (PS)	60	0.01-0.99	0.424	1.20	.94	.92	.96
CD (PS)	135	0.004-0.995	−0.189	−0.211	.98	.96	.99

Note. MS = Molenaar–Sijtsma; λ₂ = lambda-2; LCRC = latent class reliability coefficient; VP = Visual Puzzles; PR = Perceptual Reasoning; FW = Figure Weights; PC = Picture Completion; MR = Matrix Reasoning; AR = Arithmetic; WM = Working Memory; INF = Information; VC = Verbal Comprehension; SIM = Similarities; V = Vocabulary; BD = Block Design; DS = Digit Span; LN = Letter–Number Sequencing; SS = Symbol Search; PS = Processing Speed; CD = Coding.

MHM Analysis

Tables 2 and 3 provide an overview of Mokken analysis for the WAIS-IV subtests. It should first be noted that item-pair scalability (H_ij) and item scalability (H_j) were positive for all subtest items. However, H_j > .3 was only achieved for all items of the subtests Visual Puzzles, Arithmetic, and Block Design, whereas total scalability (H) was .48, .55, and .59, respectively, which indicated medium and strong scales. In the case of the other subtests, some items failed to achieve the H_j > .3 criterion: one item in Figure Weight and Digit Span; three items in Picture Completion, Information, and Letter–Number Sequencing; five items in Matrix Reasoning; six items in Similarities; 11 items in Vocabulary; and 19 items in Symbol Search. The total scalability (H) of these subtests was between .30 and .65 when no items were deleted (see Table 2). These results indicate that Similarities, Vocabulary, and Symbol Search are weak; Picture Completion, Matrix Reasoning, Information, Letter–Number Sequencing, and Digit Span are medium; and Figure Weights and Coding are strong. However, when those items that failed to satisfy the H_j criterion were deleted, the total scalability H was greater than or equal to .5 (strong scales) for all subtests except Matrix Reasoning, which was medium. These increases in total scalability show a trend toward greater unidimensionality of subtests.

Table 2.

Summary of the Mokken Scaling Analysis Results for the Wechsler Adult Intelligence Scale-IV Subtests.

Scale (factor)	H	H ^T	Total items violated				Genetic algorithm
			Scalability (H_i < .30)	Monotonicity	IIO		Number of scales to
			Scalability (H_i < .30)	Monotonicity	Total	Deleted by BIS	c ≤ $\leq$ .30	c = .40
VP (PR)	.48	.84	0	0	0	0	1	1
FW (PR)	.51	.78	1	0	0	0	1	1
PC (PR)	.45	.81	3	0	6	Two items (8 and 10)	2	1
MR (PR)	.40	.60	5	0	0	0	2	2
AR (WM)	.55	.84	0	0	0	0	1	1
INF (VC)	.43	.88	3	0	7	Two items (5 and 8)	4	3
SIM (VC)	.30	.65	6	0	2	One item (11)	2	2
V (VC)	.32	.38	11	0	12	Three items (9, 10, 21)	3	3
BD (PR)	.59	.75	0	0	0	0	1	1
DS (WM)	.40	.71	1	0	0	0	1	2
LN (WM)	.42	.96	3	0	0	0	1	1
SS (PS)	.38	.86	19	0	—	—	—	—
CD (PS)	.65	.96	—	—	—	—	—	—

Note. H = scale scalability; H^T = coefficient showing accuracy of item ordering; IIO = invariant item ordering; BIS = backward item selection; VP = Visual Puzzles; PR = Perceptual Reasoning; FW = Figure Weights; PC = Picture Completion; MR = Matrix Reasoning; AR = Arithmetic; WM = Working Memory; INF = Information; VC = Verbal Comprehension; SIM = Similarities; V = Vocabulary; BD = Block Design; DS = Digit Span; LN = Letter–Number Sequencing; SS = Symbol Search; PS = Processing Speed; CD = Coding.

Table 3.

Fit Statistics of MSA Models for Wechsler Adult Intelligence Scale-IV Subtests.

I	M	H_j	V.IIO	Rank^a	M	H_j	V.IIO	Rank^a	M	H_j	V.IIO	Rank^a
I	Visual Perception				Figure Weights				Picture Complete
2									0.993	.60		2
3									0.993	.60		3
4									0.964	.53		4
5	0.995	.72		5					0.593	.27	1	7
6	0.995	.72		6	0.880	.21		9	0.686	.33		5
7	0.943	.59		7	0.979	.58		6	0.60	.29	1	6
8	0.759	.47		8	0.838	.43		11	0.436	.30	2	11
9	0.604	.46		10	0.916	.63		7	0.507	.40		9
10	0.675	.53		9	0.901	.44		8	0.579	.50	2	8
11	0.557	.45		11	0.873	.54		10	0.279	.45		14
12	0.467	.47		12	0.768	.33		13	0.429	.49	1	12
13	0.458	.51		13	0.789	.47		12	0.471	.53	1	10
14	0.231	.31		17	0.549	.39		16	0.343	.58		13
15	0.316	.48		14	0.606	.53		14	0.071	.45		18
16	0.274	.50		15	0.444	.45		17	0.114	.53		16
17	0.259	.51		16	0.585	.52		15	0.136	.49		15
18	0.175	.53		18	0.444	.55		18	0.107	.63		17
19	0.151	.52		19	0.380	.62		19	0.036	.62		21
20	0.080	.48		20	0.197	.58		21	0.064	.66		19
21	0.080	.52		21	0.275	.59		20	0.029	.74		22
22	0.076	.50		22	0.148	.52		24	0.021	.59		23
23	0.038	.43		25	0.162	.56		23	0.050	.64		20
24	0.051	.51		23	0.190	.60		22	0.007	.26		24
25	0.042	.50		24	0.134	.61		25
26	0.005	.61		26	0.085	.54		26
27					0.092	.65		27
Matrix Reasoning					Arithmetic				Information
4	0.996	.67		4					0.995	.86		4
5	0.986	.68		5					0.103	.32	3	17
6	0.891	.01		8	0.995	.60		6	0.846	.22		8
7	0.941	.19		7	0.991	.79		7	0.897	.44		6
8	0.715	.23		10	0.972	.64		8	0.397	.29	2	13
9	0.946	.37		6	0.817	.46		10	0.608	.42		9
10	0.801	.39		9	0.864	.49		9	0.935	.57		5
11	0.669	.24		13	0.596	.47		12	0.061	.30		19
12	0.706	.28		11	0.676	.53		11	0.855	.34		7
13	0.688	.43		12	0.535	.50		13	0.509	.41		10
14	0.643	.35		14	0.441	.55		14	0.196	.47	1	15
15	0.539	.35		17	0.385	.57		15	0.299	.46		14
16	0.543	.42		15	0.277	.55		16	0.407	.54	1	11
17	0.543	.47		16	0.239	.58		19	0.402	.60	1	12
18	0.380	.43		19	0.249	.59		18	0.009	.34		24
19	0.394	.45		18	0.272	.64		17	0.149	.46	1	16
20	0.362	.44		20	0.085	.64		20	0.103	.47	1	18
21	0.303	.47		22	0.075	.62		21	0.051	.49		20
22	0.335	.49		21	0.047	.63		22	0.019	.29		21
23	0.299	.47		23					0.014	.48		22
24	0.226	.50		24					0.014	.48		23
25	0.122	.45		25					0.009	.56		25
26	0.068	.41		26					0.005	.65		26
Vocabulary					Similarities				Letter–Number Sequencing
4									2.99	.14		4
5									2.77	.25		5
6	1.71	.08	2	10	1.94	.25		6	2.53	.26		6
7	1.80	.08	2	7	1.55	.11		7	1.53	.45		7
8	1.74	.19	2	9	1.30	.22		8	0.65	.56		8
9	1.09	.14	3	22	0.95	.22		10	0.18	.64		9
10	1.57	.15	2	12	1.16	.26		9	0.04	.59		10
11	1.78	.13	2	8	0.37	.31	1	14	Digit Span (Forward [F], Backward [B], Sequencing [S])
12	1.84	.20	1	6	0.46	.30		12	F4, 1.79	.40		4
13	1.25	.30		17	0.36	.24		15	F5, 1.34	.44		5
14	0.70	.25		27	0.39	.39	1	13	F6, 0.85	.40		6
15	1.24	.27		18	0.51	.44		11	F7, 0.45	.41		7
16	1.48	.41		13	0.12	.38		16	F8, 0.18	.42		8
17	0.54	.27		29	0.09	.45		18	B3, 1.65	.26		3
18	1.20	.34		19	0.11	.36		17	B4, 1.18	.41		4
19	1.15	.34	1	20	Block Design				B5, 0.68	.44		5
20	1.40	.41		15	Item	M	H_j	Rank^a	B6, 0.30	.45		6
21	1.60	.57	8	11	9	3.40	.49	9	B7, 0.14	.46		7
22	0.98	.44		24	10	2.21	.58	10	B8, 0.07	.41		8
23	1.14	.36	1	21	11	0.80	.56	11	S4, 1.44	.37		4
24	0.90	.36		26	12	0.79	.63	12	S5, 1.05	.37		5
25	1.47	.49	1	14	13	0.43	.65	13	S6, 0.37	.35		6
26	1.40	.45		16	14	0.20	.64	14	S7, 0.13	.42		7
27	1.09	.43	1	23					S8, 0.07	.44		8
28	0.64	.30		28
29	0.21	.22		30
30	0.98	.37		25

Note. I = item; H_i = item scalability; V.IIO = number of significant violations of the invariant item ordering.

The item ordering using the mean score.

Additional information about dimensionality can also be found in Table 2, particularly in the last two columns; we report only the results for c ≤ .30 and c = .40 (the other values did not show interesting results), which show the results using the genetic algorithm for each WAIS-IV subtest. In general terms, we found that the items of the subtests Visual Puzzles, Figure Weights, Arithmetic, Block Design, Digit Span, and Letter–Number Sequencing were selected to form a scale in each subtest with lower bounds in the range .0 ≤ c ≤ .3. For the subtests Picture Completion, Matrix Reasoning, and Similarities, with lower bounds in the range .0 ≤ c ≤ .3, not all the items were selected for the same scale, suggesting that the items can be divided between two scales for each subtest. Finally, for the subtests Information and Vocabulary, using the same criterion (.0 ≤ c ≤ .3), the items can form up to four and three scales, respectively.

Using a slightly more restrictive lower bound criterion of c = .4, the items of the subtests Visual Perception, Figure Weights, Picture Completion, Arithmetic, Block Design, and Letter–Number Sequencing formed a single scale in each case, whereas the items of the subtests Matrix Reasoning, Similarities, and Digit Span formed two scales and those of Information and Vocabulary formed three scales. In summary, the results fitted the expected pattern of a unidimensional scale for Visual Puzzles, Figure Weights, Picture Completion, Arithmetic, Block Design, Digit Span, and Letter–Number Sequencing, as described by Sijtsma and Molenaar (2002), whereas the Matrix Reasoning, Information, Similarities, and Vocabulary subtests were multidimensional.

No significant violations of the monotonicity assumption were detected for the items of each subtest, but one item of the Matrix Reasoning and Vocabulary subtests had a Crit value in the range 40 ≤ Crit < 80, showing a nonserious degree of misfit. In addition, strong evidence of monotonicity was found when inspecting all IRFs of the subtests for all items across the range of ability. In summary, the results under the MHM indicate that the monotonicity assumption held for each of the WAIS-IV subtests and that unidimensionality was achieved by all subtests except Matrix Reasoning, Information, Similarities, and Vocabulary, which reported Mokken multiscales (i.e., multifactor).

DMM Analysis

Tables 2 and 3 (V.IIO column) display the statistics for IIO applied to the subtests. The IIO assumption was also visualized using the Mokken package. The results revealed no significant IIO violations for any items of the Visual Puzzles, Figure Weights, Matrix Reasoning, Arithmetic, Block Design, Digit Span, and Letter–Number Sequencing subtests, and the H^T coefficient was greater than .50, indicating a strong ordering.

But, six items (5, 7, 8, 10, 12, and 13) of the Picture Completion subtest violated the IIO assumption, and Crit values were in the range 40 ≤ Crit < 80, although the backward item selection method reported only two items (8 and 10) to be removed. Once the two items had been removed from the Picture Completion subtest, H^T was .81 (strong ordering) and the test scalability H was .48, indicating a medium scale. The Information subtest showed significant violations of IIO for seven items, but only one item (8) had a Crit value greater than 80, of approximately 90. Backward item selection confirmed that two items (5 and 8) should be removed. Once these items had been removed, the H^T coefficient for the remaining Information items was .88, which is a very high value according to Ligtvoet et al. (2010) and H was .48, indicating a medium scale.

For the Similarities and Vocabulary subtests, the H^T coefficient was .65 (strong ordering) and .38 (weak ordering) and backward item selection confirmed that only one item (11) and three items (9, 10, and 21) should be removed, respectively. As shown in Figure 1, Item 11 violated the IIO and nonintersection assumptions with Item 15 for the Similarities subtest, and depicted the intersection between item-pairs 20 and 9, and 21 and 10 for the Vocabulary subtest. For the Coding and Symbol Search subtests, although strong ordering was achieved, the backward item selection procedure reported violations by some items, which should be deleted.

Figure 1.

Example violations of the IIO assumption for item-pair 15 and 11 for the Similarities subtest and item-pairs 20 and 9, and 21 and 10 for the Vocabulary subtest using the manifest IIO method.

Table 3 (Rank column) displays the item ordering for each subtest using the mean score. Using this approach, items with a lower mean score are reflected as more difficult. Interestingly, the ordering was different for some items with respect to the original ranking described in the WAIS-IV manual.

It can, therefore, be concluded that the MHM and DMM fitted well to the subtest data, although caution is required with certain items for which a poor fit was reported: In the MHM, those items that failed to satisfy the H_j criterion, and in the DMM, those items that were removed following backward item selection.

Discussion

This study uses two NIRT models to assess the psychometric properties of WAIS-IV subtests. In reviewing the literature, no data were found on the application of MSA to WAIS-IV data. The most interesting finding was that MHM fitted all items of the subtests, although a small number of items fitted poorly as measured by the scalability coefficient. Sijtsma and Molenaar (2002) noted that a similar fit of the MHM to the one reported in this study suggests that the sum score of each subtest is a good indicator of the latent trait. From a practical perspective, the sum score of each subtest can be used to order adults on the latent trait. This is reinforced by the results obtained with the item scalability coefficient H_i for the WAIS-IV subtests, which indicate that the subtest items discriminate well between levels of adults on the latent trait, such that adults with a higher level of intelligence will score higher for each subtest. In any case, although the results also showed that care should be taken with those items with a scalability coefficient less than .30 when interpreting the total scores, it was not essential to remove these items as H was greater than .4, indicating that they fitted well to the MHM.

Moreover, strong IIO (H^T ≥ .50) was recorded for all WAIS-IV subtests with the exception of Vocabulary, according to the criteria established by Ligtvoet et al. (2010). One of the issues that emerges from these findings is that WAIS-IV subtests present hierarchical information based on the difficulty of each item, and items can be administered in ascending order, using their difficulty to reduce administration time and applying the discontinue rule if the individual fails to answer several consecutive items; for example, the matrix subtest is discontinued if the individual fails to correctly answer three consecutive items. This also makes it possible to apply the starting rule according to the age of each individual. As expected, there are differences in item order between the WAIS-IV subtests adapted to Arabic and those detailed in the U.S. WAIS-IV manual, although the original subtest structures have been maintained as far as possible. For verbal subtests, the order of item administration should be changed. For instance, in the Information subtest, most of the items pertain to the Western canons of geography, science, history, and literature. Specifically, Item 5, “Martin Luther King,” is very easy for a U.S. sample, whereas individuals from Egypt may find it more difficult to answer correctly, so it was ranked 17th, whereas Item 10, “Cleopatra,” is very easy for our sample and was ranked fifth. Similarly, the administration order of some items in the nonverbal subtests should also be reexamined. For instance, Item 14 in Visual Puzzles was ranked 17th in our study, and Item 17 in Figure Weights was ranked 15th. On the basis of these results, applying the WAIS-IV subtests for Arabic speakers as ordered in the U.S. WAIS-IV manual may have a negative impact on overall scores due to the presence of the lowest ranked items and their implications for the application of the discontinue rule. In general, it seems that WAIS-IV subtest items should be resequenced for Arabic speakers to obtain more accurate scores. These results match those reported by Suwartono, Hidajat, Halim, Hendriks, and Kessels (2016), who found that the orders established in the U.S. WAIS-IV manual were unsuitable for Indonesia.

Moreover, according to Watson et al. (2012) and Ligtvoet et al. (2010), the lack of IIO was due to the measurement of many items at the same level of latent trait. Therefore, we can infer from the IIO of WAIS-IV items that they measure different levels of cognitive construct, and this is confirmed by the variation in mean item scores; for instance, for the Visual Puzzles subtest, mean scores ranged from .005 (Item 26; very difficult) to .995 (Items 1-5; very easy). From this, and considering the IIO of the WAIS-IV, we can conclude that the ordering of subjects based on the total score of each WAIS-IV subtest is invariant (Ligtvoet et al., 2010; Sijtsma & Van der Ark, 2017).

Interestingly, the dimensionality results using the genetic algorithm indicated that the WAIS-IV subtests analyzed in this study are unidimensional except for Matrix Reasoning, Information, Similarities, and Vocabulary, which are multidimensional. The appearance of more than one scale for some of the WAIS-IV subtests using Mokken analysis may explain the findings of previous studies such as Abdelhamid et al. (2019), Bowden et al. (2011a), Weiss, Keith, Zhu, and Chen (2013a, 2013b), which suggested that some of these subtests were loaded on more than one factor. As such, the total score of each the WAIS-IV subtests (except multidimensional subtests) can be computed to determine the adult’s level on the latent trait being measured. For the multidimensional subtests (e.g., Matrix Reasoning, Information, Similarities, and Vocabulary), it is necessary to calculate the total score for each dimension that reflects features of the latent trait being measured.

Moreover, the current study used the reliability coefficients (Molenaar–Sijtsma, lambda-2, and latent class reliability), which revealed high reliability for the subtests, which is an indication of good quality. These findings are in the line with those of previous studies such as Glass, Ryan, and Charter (2010).

From an empirical perspective, this study provides new understanding of how to apply Mokken analysis to intelligence scales and how to assess the fit of NIRT models. Our analysis has shown that the MHM and DMM fit the WAIS-IV, giving evidence of their highly successful application in intelligence scales. The current findings should be extrapolated only to the 18 to 24 years age group. Although the sample is expected to be representative of the 18 to 24 years age group and the data satisfied the normality assumption, care should be taken when drawing inferences with regard to other age groups and regions. It is unfortunate that our study did not include the Comprehension and Cancelation subtests. Symbol Search and Coding are speeded subtests, so the results should be approached with caution; as such, we did not discuss these results at the item level.

In conclusion, the present study provides several interesting findings on the dimensionality and hierarchy of WAIS-IV subtests in an MSA framework. In future research, it may be of interest to use different WAIS-IV data (samples from other countries, or extending the current sample, to include other socidemographic characteristics) to compare NIRT models and establish their statistical properties. Moreover, the use of a variety of IRT models may yield useful information from which to draw conclusions about item fit or item weakness. The current findings offer many suggestions that may improve the WAIS-IV subtests adapted for Arabic speakers. First, consideration should be given to reordering the items of some subtests to obtain more accurate mean score estimates using modern theoretical approaches such as Mokken analysis. Second, some WAIS-IV items that did not fit well to the NIRT models could be revised, and some could be omitted in the construction of a shortened version of the WAIS-IV, which was suggested by previous studies such as Denney, Ringe, and Lacritz (2015) and Meyers, Zellinger, Kockler, Wagner, and Miller (2013).

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Egyptian Ministry of Higher Education, Management of Supporting Excellence, Competitive Excellence Project of Higher Education Institutions (grant 2016), and by the Agency for the Management of University and Research Grants of the Government of Catalonia (grant 2017SGR1681). The funders played no role in the study design, data collection and analysis, decision to publish, or preparation of the article.

ORCID iDs

Gomaa S. M. Abdelhamid

Juana Gómez-Benito

References

Abdelhamid

G. S. M.

Gómez-Benito

Abdeltawwab

A. T. M.

Abu Bakr

M. H. S.

Kazem

A. M.

(2019). Hierarchical structure of the Wechsler Adult Intelligence Scale–Fourth Edition with an Egyptian Sample. Journal of Psychoeducational Assessment, 37, 395-404. doi:10.1177/0734282917732857

Benson

Hulac

D. M.

Kranzler

J. H.

(2010). Independent examination of the Wechsler Adult Intelligence Scale–Fourth Edition (WAIS-IV): What does the WAIS-IV measure? Psychological Assessment, 22, 121-130. doi:10.1037/a0017767

Bowden

S. C.

Saklofske

D. H.

Weiss

L. G.

(2011a). Augmenting the core battery with supplementary subtests: Wechsler Adult Intelligence Scale–IV measurement invariance across the United States and Canada. Assessment, 18, 133-140. doi:10.1177/1073191110381717

Bowden

S. C.

Saklofske

D. H.

Weiss

L. G.

(2011b). Invariance of the measurement model underlying the Wechsler Adult Intelligence Scale–IV in the United States and Canada. Educational and Psychological Measurement, 71, 186-199. doi:10.1177/0013164410387382

Climie

E. A.

Rostad

(2011). Test review: Wechsler Adult Intelligence Scale. Journal of Psychoeducational Assessment, 29, 581-586. doi:10.1177/0734282911408707

Denney

D. A.

Ringe

W. K.

Lacritz

L. H.

(2015). Dyadic short forms of the Wechsler Adult Intelligence Scale–IV. Archives of Clinical Neuropsychology, 30, 404-412. doi:10.1093/arclin/acv035

Embretson

S. E.

(1996). The new rules of measurement. Psychological Assessment, 8, 341-349. doi:10.1037/1040-3590.8.4.341

Emons

W. H. M.

Sijtsma

Pedersen

S. S.

(2012). Dimensionality of the Hospital Anxiety and Depression Scale (HADS) in cardiac patients: Comparison of Mokken scale analysis and factor analysis. Assessment, 19, 337-353. doi:10.1177/1073191110384951

Glass

L. A.

Ryan

J. J.

Charter

R. A.

(2010). Discrepancy score reliabilities in the WAIS-IV standardization sample. Journal of Psychoeducational Assessment, 28, 201-208. doi:10.1177/0734282909346710

10.

Kaufman

A. S.

Salthouse

T. A.

Scheiber

Chen

(2016). Age differences and educational attainment across the life span on three generations of Wechsler Adult Scales. Journal of Psychoeducational Assessment, 34, 421-441. doi:10.1177/0734282915619091

11.

Ligtvoet

Van der Ark

L. A.

te Marvelde

J. M.

Sijtsma

(2010). Investigating an invariant item ordering for polytomously scored items. Educational and Psychological Measurement, 70, 578-595. doi:10.1177/0013164409355697

12.

Meyers

J. E.

Zellinger

M. M.

Kockler

Wagner

Miller

R. M.

(2013). A validated seven-subtest short form for the WAIS-IV. Applied Neuropsychology-Adult, 20, 249-256. doi:10.1080/09084282.2012.710180

13.

Miller

D. I.

Davidson

P. S. R.

Schindler

Messier

(2013). Confirmatory factor analysis of the WAIS-IV and WMS-IV in older adults. Journal of Psychoeducational Assessment, 31, 375-390. doi:10.1177/0734282912467961

14.

Mokken

R. J.

(1971). A theory and procedure of scale analysis: With applications in political research. The Hague, The Netherlands: De Gruyter.

15.

Mokken

R. J.

(1997). Nonparametric models for dichotomous responses. In van der Linden

W. J.

Hambletton

R. K.

(Eds.), Handbook of modern item response theory (pp. 351-368). New York, NY: Springer.

16.

Molenaar

I. W.

(1997). Nonparametric models for polytomous responses. In van der Linden

W. J.

Hambletton

R. K.

(Eds.), Handbook of modern item response theory (pp. 369-380). New York, NY: Springer.

17.

Mooij

(2012). A Mokken Scale to assess secondary pupils’ experience of violence in terms of severity. Journal of Psychoeducational Assessment, 30, 496-508. doi:10.1177/0734282912439387

18.

Nelson

J. M.

Canivez

G. L.

Watkins

M. W.

(2013). Structural and incremental validity of the Wechsler Adult Intelligence Scale–Fourth Edition with a clinical sample. Psychological Assessment, 25, 618-630. doi:10.1037/a0032086

19.

Saklofske

D. H.

Zhu

Miller

J. L.

Weiss

L. G.

Babcock

S. E.

Cayton

T. G.

Coalson

D. L.

(2012). The Cognitive Proficiency Index for the Canadian Edition of the Wechsler Adult Intelligence Scale–Fourth Edition. Canadian Journal of Behavioural Science/Revue Canadienne des Sciences du Comportement, 44, 117-123. doi:10.1037/a0026734

20.

Salthouse

T. A.

Saklofske

D. H.

(2010). Do the WAIS-IV Tests measure the same aspects of cognitive functioning in adults under and over 65? In Weiss

L. G.

Coalson

D. L.

Saklofske

D. H.

Raiford

S. E.

(Eds.), WAIS-IV clinical use and interpretation: Scientist-practitioner perspectives (pp. 217-235). San Diego, CA: Academic Press.

21.

Sijtsma

(2009). Correcting fallacies in validity, reliability, and classification. International Journal of Testing, 9, 167-194. doi:10.1080/15305050903106883

22.

Sijtsma

Debets

Molenaar

I. W.

(1990). Mokken scale analysis for polychotomous items: Theory, a computer program and an empirical application. Quality and Quantity, 24, 173-188. doi:10.1007/BF00209550

23.

Sijtsma

Emons

W. H. M.

Bouwmeester

Nyklíček

Roorda

L. D.

(2008). Nonparametric IRT analysis of Quality-of-Life Scales and its application to the World Health Organization Quality-of-Life Scale (WHOQOL-Bref). Quality of Life Research, 17, 275-290. doi:10.1007/s11136-007-9281-6

24.

Sijtsma

Meijer

R. R.

Van der Ark

L. A.

(2011). Mokken scale analysis as time goes by: An update for scaling practitioners. Personality and Individual Differences, 50, 31-37. doi:10.1016/j.paid.2010.08.016

25.

Sijtsma

Molenaar

I. W.

(2002). Introduction to nonparametric item response theory (1st ed.). Thousand Oaks, CA: Sage.

26.

Sijtsma

Van der Ark

L. A.

(2017). A tutorial on how to do a Mokken scale analysis on your test and questionnaire data. British Journal of Mathematical and Statistical Psychology, 70, 137-158. doi:10.1111/bmsp.12078

27.

Straat

J. H.

Van der Ark

L. A.

Sijtsma

(2013). Comparing optimization algorithms for item selection in Mokken scale analysis. Journal of Classification, 30, 75-99. doi:10.1007/s00357-013-9122-y

28.

Suwartono

Hidajat

L. L.

Halim

M. S.

Hendriks

M. P. H.

Kessels

R. P. C.

(2016). External validity of the Indonesian Wechsler Adult Intelligence Scale–Fourth Edition (WAIS-IV-ID). ANIMA Indonesian Psychological Journal, 32(1), Article 16. doi:10.24123/aipj.v32i1.581

29.

Van der Ark

L. A

. (2016). R package Mokken V 2.8.2. Retrieved from https://cran.r-project.org/web/packages/mokken/

30.

Van der Ark

L. A.

Van der Palm

D. W.

Sijtsma

. (2011). A latent class approach to estimating test-score reliability. Applied Psychological Measurement, 35, 380-392. doi:10.1177/0146621610392911

31.

Van Schuur

W. H

. (2011). Ordinal item response theory: Mokken scale analysis. Thousand Oaks, CA: Sage.

32.

Watson

van der Ark

L. A.

Lin

L.-C.

Fieo

Deary

I. J.

Meijer

R. R.

(2012). Item response theory: How Mokken scaling can be used in clinical practice. Journal of Clinical Nursing, 21, 2736-2746. doi:10.1111/j.1365-2702.2011.03893.x

33.

Wechsler

(2008). WAIS-IV administration and scoring manual. San Antonio, TX: Psychological Corporation.

34.

Weiss

L. G.

Keith

T. Z.

Zhu

Chen

(2013a). Technical and practical issues in the structure and clinical invariance of the Wechsler Scales: A rejoinder to commentaries. Journal of Psychoeducational Assessment, 31, 235-243. doi:10.1177/0734282913478050

35.

Weiss

L. G.

Keith

T. Z.

Zhu

Chen

(2013b). WAIS-IV and clinical validation of the four- and five-factor interpretative approaches. Journal of Psychoeducational Assessment, 31, 94-113. doi:10.1177/0734282913478030

36.

Weiss

L. G.

Saklofske

D. H.

Coalson

D. L.

Raiford

S. E.

(2010). Theoretical, empirical and clinical foundations of the WAIS-IV Index Scores. In Weiss

L. G.

Coalson

D. L.

Saklofske

D. H.

Raiford

S. E.

(Eds.), WAIS-IV clinical use and interpretation: Scientist-practitioner perspectives (pp. 64-94). San Diego, CA: Academic Press.