Abstract
The purpose of this study was to investigate the effect of complex structure on dimensionality assessment in noncompensatory multidimensional item response models using dimensionality assessment procedures based on DETECT (dimensionality evaluation to enumerate contributing traits) and NOHARM (normal ogive harmonic analysis robust method). Five methods were evaluated: two DETECT-based methods—exploratory and cross-validated—and three NOHARM-based methods: root mean square residual (RMSR),
Keywords
Introduction
Dimensionality of a test can be defined as “the number of latent variables that account for the correlations among item responses in a particular data set” (Camilli, Wang, & Fesq, 1995, p. 80). A process of determining test dimensionality, dimensionality assessment, is crucial as it forms the basis for statistical analysis of the data (Hambleton, Swaminathan, & Rogers, 1991; Zhang, 2007). Stout et al. (1996) described the assessment of test dimensionality as a two-stage process: (a) verification or refutation of unidimensionality and (b) if necessary, description of the multidimensional test structure. Although many of the item response theory models make assumptions about a single dimension (unidimensionality) and local independence, if multidimensional structure is expected, understanding which items go with which dimension is just as crucial.
Over the past few decades, researchers have provided arguments for supporting dimensionality assessment and understanding the structure of a test as an important step in testing (Hambleton et al., 1991; Jang & Roussos, 2007; Tate, 2003; Zhang, 2007). Failure to accurately determine the number of dimensions and associated dimensional structure (or misalignment of the psychometric model and the data) may lead to severe consequences in various aspects of testing, including inaccurate and imprecise estimates of item and person parameters, problems in test linking and equating, item bias and test assembly, and problematic score reporting (e.g., Ackerman, 1989, 1994; Chen & Thissen, 1997; Reckase, Carlson, Ackerman, & Spray, 1986; Walker & Beretvas, 2003; Way, Ansley, & Forsyth, 1988; Yen, 1985). Consequently, it is argued that given the role of dimensionality assessment in supporting a variety of psychometric endeavors, assessing dimensionality should be a prerequisite to applying most commonly used item response theory models (Childs & Oppler, 2000; Jang & Roussos, 2007; Nandakumar & Yu, 1996; Nandakumar, Yu, Li, & Stout, 1998; Seraphine, 2000).
Dimensionality of the item responses can be conducted using a number of different procedures (for a review of current and emerging methods, see Jasper, 2010; Levy & Svetina, 2010; Tate, 2003). A procedure can be characterized by researchers’ approach to modeling (e.g., exploratory or confirmatory) and other characteristics associated with that particular method, such as distributional assumptions it makes (e.g., parametric or nonparametric) or the modeling paradigm within which it is commonly applied (e.g., item response or factor analytic).
Research in dimensionality assessment has largely focused on examining dimensionality in situations where a compensatory type of a multidimensional model is assumed. In addition, in many of those cases, the structure of the data is assumed to be factorially simple—meaning, each item has primary association with a single dimension. Research has shown that to a large degree, commonly used methods perform well under conditions that align well with the principles on which they were built, such as simple structure and compensatory multidimensional item response theory (MIRT; Finch & Habing, 2005, 2008; Nandakumar, 1991, 1993; Nandakumar et al., 1998; Nandakumar & Stout, 1993; Nandakumar & Yu, 1996; Stout, 1987; Stout et al., 1996; van Abswoude, van der Ark, & Sijtsma, 2004; Zhang, 2007; Zhang & Stout, 1999b). Fewer studies examined conditions that depart from the principles or assumptions, such as noncompensatory MIRT (e.g., Froelich & Habing, 2008; Hattie, Krakowski, Rogers, & Swaminathan, 1996) and/or complex structure (e.g., Froelich & Habing, 2008; Gierl, Leighton, & Tan, 2006).
The current study focuses on how well two of the more commonly used methods work under conditions that do not align with the foundational principles of the tools. Specifically, this study examines in a systematic way how well do dimensionality assessment procedures implemented in DETECT (dimensionality evaluation to enumerate contributing traits) and NOHARM (normal ogive harmonic analysis robust method) perform in conditions where complex structure exists and where the generating (underlying) MIRT model is noncompensatory.
The article is organized as follows. The next section provides an overview of the dimensionality assessment methods, DETECT and NOHARM, used in the study, followed by a brief summary of current research on the performance of those procedures. The Method section outlines the implemented study design and describes the outcome variables used in the study. In the Results section, performance across the procedures based on DETECT and NOHARM output is reported for the studied conditions. The last section discusses study limitations and implications.
Dimensionality Assessment
Overview of DETECT Procedure
Dimensionality evaluation to enumerate contributing traits (Kim, 1994; Zhang & Stout, 1999a, 1999b) is an estimation procedure based on conditional covariances among the items. As a procedure, it is often used as an exploratory tool for dimensionality assessment and is useful when the data are scored dichotomously.
The main goal of DETECT is to partition the items into clusters such that within a cluster the items are most homogeneous, and the clusters themselves are widely separated, reflecting an assumption of approximate simple structure. If approximate simple structure exists, the theoretical index D will be maximized at the correct dimensionality-based cluster partition D*. The maximum possible value of D, denoted as D*, indicates the amount of multidimensionality the test displays (how much departure from being perfectly fitted by a unidimensional model; Zhang & Stout, 1999b). This implies that when the partition matches approximate simple structure, the maximum value of DETECT will be obtained since all of the within-cluster conditional covariances will be positive and all between-cluster conditional covariances will be negative (Zhang & Stout, 1999b). There are a large number of possible ways in which DETECT can partition the items; therefore, to search the space intelligently, the DETECT procedure employs a generic algorithm in addition to hierarchical cluster analysis to limit the search (Roussos, Stout, & Marden, 1998; Zhang & Stout, 1999b).
If the hypothesis of approximate simple structure is supported, the solution may be interpreted in terms of the number of homogeneous item clusters as the number of dominant dimensions. This is possible because DETECT procedure outputs the number of nonoverlapping clusters and items associated with each of the clusters. To the extent where there are clusters with few items or if approximate simple structure does not hold, inferring the number of dominant dimensions should be done with caution (Jang & Roussos, 2007; Zhang & Stout, 1999b). For a complete description of DETECT procedure, see Zhang and Stout (1999a, 1999b).
Overview of NOHARM Procedure
Normal ogive harmonic analysis robust method (Fraser & McDonald, 1988) is a parametric nonlinear factor analytic estimation method. The model fitted in NOHARM can be represented by the nonlinear factor analytic model or its equivalent compensatory MIRT in the form of the following equation:
where
As originally developed, NOHARM does not produce a formal statistic for the model fit. In exploratory analysis, a user specifies the number of factors to be extracted for any one solution. Several approaches or tests have been proposed to examine model fit based on NOHARM output, including a heuristic approach by Tate (2003) and his reduction in root mean square residual (RMSR), and formal statistics approximate
Tate’s reduction in RMSR
Tate (2003) suggested evaluating model fit by a degree of improvement via sequential model fitting. Using this approach, if the higher dimensional model produces 10% or more decrease in RMSR over the preceding model, the higher dimensional model should be retained. Let us assume we fit four exploratory models in NOHARM, and obtained RMSR of .00721 for a model with a single factor and RMSRs of .00622, .00541, and .00511 for models with two, three, and four factors, respectively. The resulting decreases in RMSRs from a single-factor solution to the second-, third-, and fourth-dimensional solutions would be 14%, 13%, and 6%, respectively. Following the recommended rule of 10% decrease, the result of adding the fourth dimension (i.e., going from three to four) resulted in a decrease of 6%; thus given the rule, we would conclude that optimal solution was a three-factor solution.
Approximate
(Gessaroli & De Champlain, 1996)
A formal test, the goodness of fit of a particular dimensionality solution based on NOHARM output, was introduced by Gessaroli and De Champlain (1996) as a
where N is the number of examinees, J is the total number of items, j and j′ serve to index the items to define the unique pairings of items, and
is the Fisher’s z transformation of the residual correlation for a given item pairing; and
where
Approximate likelihood ratio (Gessaroli et al., 1997)
An alternative statistics to
where
where
To determine the optimal number of factors or dimensions, a χ2 difference test is adopted. First, for each M-dimensional fitted model,
Research Using DETECT and NOHARM Procedures
Previous studies generally found support for dimensionality assessment using DETECT- or NOHARM-based methods when compared to other procedures, although varying degrees of superiority were noted. As an example, consider Finch and Habing (2005), who in an extensive simulation study evaluated DETECT’s and NOHARM’s performances. The authors investigated two- and six-dimensional structures using two- and three-parameter logistic models and found that in two-dimensional (2D) cases, DETECT and NOHARM performed equally well in conditions with higher correlation, although DETECT was more successful in recreating dimensional structures in conditions with lower correlations. Neither DETECT nor NOHARM seemed to be affected by the sample size, although test length and skewness of the data had varying impact (it should be noted that Finch and Habing found ALR to be superior over
In six-dimensional (6D) conditions, ALR tended to outperform DETECT, although as in the 2D case, the number of subjects did not seem to have an impact on either procedure. Increase in the number of items, however, had opposite effect in 6D than in 2D case; ALR benefited from the increase in items, whereas DETECT’s performance deteriorated in high-dimensional cases with the increased number of items.
Furthermore, Finch and Habing (2005) found that when the procedures erred (i.e., did not successfully recreate the dimensional structure), the DETECT method appeared to separate items that should have been grouped together, whereas the ALR appeared to group items that should have been kept separate. Last, unlike in the Finch and Habing (2003) study, the authors found in the 2005 study that guessing had little effect on either of the methods.
As alluded to earlier, it was often the case that the studies assumed the (approximate) simple structure (e.g., Finch & Habing, 2005; van Abswoude et al., 2004; Zhang & Stout, 1999b), while they investigated a number of different effects on the procedures’ performance, such as the model choice—that is, when guessing was present (e.g., Finch & Habing, 2003, 2005, 2007); the varying number of dimensions; test length; sample size; and distributional characteristics of the data (e.g., Finch & Habing, 2003, 2005, 2007; Gessaroli & De Champlain, 1996; Zhang & Stout, 1999b).
Only a few studies examined the performance of these methods when complex structure was present in the data. Gierl et al. (2006) examined DETECT’s performance in 2D conditions varying the amount of complexity present, whereas De Champlain and Gessaroli (1998) investigated performance of NOHARM-based
In a simulation study with 2D cases, Gierl et al. (2006) found that DETECT was generally very successful in accurately recovering the dimensional structure some complex structures, particularly when 30% or less items were complex, correlation between the traits was≤.75, and N≥1,000. The authors recommended that in cases when large numbers of items were expected to display complex structure, DETECT should be used for dimensionality analysis with large sample size, N≥1,500, and in situations where latent traits were correlated up to .60. Initial results from De Champlain and Gessaroli (1998) suggested that
Although many of the results from these studies seem encouraging, the authors widely acknowledge that positive results should be situated within a restrictive set of studied conditions. Therefore, the current study focuses on how well the two commonly used procedures in dimensionality assessment work under conditions that do not align with the foundational principles of the tools but align with the use of psychometric models that allow for rich and refined inferences—specifically in noncompensatory MIRT when complex structure exists.
Method
Methods to Evaluate Dimensionality
As stated previously, this study examined the performance of DETECT- and NOHARM-based methods. Within the exploratory DETECT, dimensionality assessment was conducted using
exploratory DETECT (DETECTE) and
cross-validated DETECT (DETECTCV),
where 50% of the sample was randomly used as a training sample, whereas the remaining 50% was used in cross-validation analysis. In this study, the maximum of clusters to be extracted by DETECT was set at five.
Similarly, in NOHARM, for each condition, exploratory one- through five-factor models were fit and three NOHARM-based methods were computed for each using solutions based on Promax rotation:
RMSR
ALR (Gessaroli et al., 1997)
Default options were employed for the procedures.
Determining the (optimal) number of dimensions
Since DETECT output nonoverlapping clusters of items, the optimal number of dimensions was obtained in a straightforward way. The partition that maximized the DETECT index was considered a preferred solution or optimal dimensionality. For example, if DETECT output two clusters of items, the recorded preferred solution was considered to be 2D. For the NOHARM-based models (as with any factor analytic procedure), determining the optimal number of dimensions required an outside evaluation. In the current study, this evaluation was based on formal test of sequential model fitting via difference tests (for
Study Design
Several factors were manipulated in this study, including
Number of dimensions (two or three)
Structure type of data (0% complexity—simple structure, 10% of items exhibiting complexity, 30% or 50%)
Correlations between dimensions (.00, .30, .60, .75, or .90)
Sample size (small: 500, medium: 1,000, or large: 2,000)
Number of items per dimension (10 items per dimension for shorter tests or 20 items per dimension for longer tests)
Binary item responses were generated to follow a two-parameter normal-ogive noncompensatory MIRT (Sympson, 1978) model:
where
Language R Version 2.10 (R Development Core Team, 2010) was used to simulate the binary item responses such that each item response vector conformed to the conditions outlined above. The fully crossed design yielded a total of 240 conditions, where each condition was replicated 500 times. Item parameters were fixed across all conditions, and they ranged in values similar to those used in comparable studies (e.g., Ansley & Forsyth, 1985; Gierl et al., 2006). Item locations ranged from −1.50 to 1.50, with 0.75 increments, whereas item discrimination parameters ranged from 0.80 to 1.60, with increments of 0.20. A complex item had a discrimination parameter on multiple latent traits such that influence of dominant trait was always the highest, and the remaining dimensional influence(s) decreased by 0.20. For example, in the 3D condition, a complex item might have a discrimination parameter of 1.40 on its dominant dimension and 1.20 discrimination parameters on the remaining two dimensions. Person parameters were generated from multivariate normal distributions with an appropriately sized mean vector of
Outcome Variables
Two outcome variables were used to evaluate the performance of the procedures: (a) the proportion of correct selection of true (modeled) dimensionality and (b) the ability to label sets of items as dimension-like.
The proportion of correct selection of true dimensionality
This variable was computed as the proportion of times across 500 replications within each condition that a “true” dimensional space was favored by a method. A “true” dimensional space was defined by the generating model; either 2D or 3D structures were considered and therefore counted.
The ability to label sets of items as “dimension-like.”
This outcome variable was computed in a two-step process: (a) grouping of the items and (b) labeling of the groups of items. Inclusion of this outcome variable allowed for the emphasis on interpretation of the clusters or sets of items resulted from the analysis. The goal was to examine how often a set of items could be interpreted as adequately representing one of the true underlying dimensions.
Grouping of the items
In DETECT, sets of items were determined and grouped “automatically,” as the procedure output nonoverlapping clusters with specifications to which cluster each item belongs.
In NOHARM, however, some manipulation was required to obtain groupings of items. Items were grouped such that the following criteria were applied to the rotated Promax factor solution selected as optimal by each of the methods. For an item to be considered associated with a particular factor, the item must have an estimated loading of >.40 on that particular factor and the difference between the loading and all its other loadings (i.e., loadings on remaining factors) must be >.20.
If the item had an estimated loading >.40 and the difference between its largest loading and at least one other loading was <.20, the item was labeled as complex (note this complexity was with respect to the fitted factor model, which may not necessarily correspond to whether the item was generated as a factorially complex item). Alternatively, if an item failed to meet either criteria (i.e., its loadings were <.40 on all factors), the item was considered to be unexplained (unexplained here referred to the fact that an item failed to have a substantial loading or association with any dimension).
Labeling of sets of items
After all items were grouped in their respective sets, the labeling of these “item groups” or “item sets” as dimension-like began. A set of items can be labeled as Dimension-1-like, Dimension-2-like, or Dimension-3-like set of items, depending on the true dimensionality of the data (e.g., in 2D conditions, item sets could be labeled as Dimension-1-like or Dimension-2-like).
For a set of items to be labeled as dimension-like, three criteria ought to be met. First, at least 50% of the items in the set must be items that were generated as factorially simple (i.e., ought to reflect a single dominant dimension). Second, these factorially simple items ought to occupy more than half of the set to which they belong (i.e., a majority of items in the set ought to be factorially simple, reflecting a particular dimension) in order to consider all of the items in that set as dimension-like items. Third, the proportion of factorially simple items in the cluster under consideration (e.g., Dimension-1-like items in Cluster 1) must be larger than the proportion of the same type of items in any other cluster (e.g., Dimension-1-like items in Cluster 2 or Cluster 3). These three criteria combined allow for individual clusters to be evaluated separately when optimal solution favors mth cluster solution, without labeling multiple clusters as the same dimension-like (i.e., any two clusters cannot be labeled as the same dimension-like sets).
To illustrate how these three criteria were applied, consider the following situation. Let us assume a 2D condition with 10 items per dimension and 30% complexity in the data. According to the study design, items 1 to 10 belong to Dimension 1 and items 11 to 20 belong to Dimension 2. Furthermore, because data exhibit 30% complexity, a total of 6 items (3 per dimension) are modeled as complex. Therefore, the “true” associations of the items are as follows: Items 3, 7, and 9 associated with Dimension 1 are modeled as factorially complex, whereas Items 1 and 2, 4 to 6, 8, and 10 are modeled as factorially simple items associated with Dimension 1. Likewise, Items 13, 17, and 19 are modeled as factorially complex items, whereas Items 11 and 12, 14 to 16, 18, and 20 are modeled as factorially simple items associated with Dimension 2.
If DETECTE outputs solution of two clusters such that Items 1 to 9 are grouped into Cluster 1 and Items 10 to 20 are grouped into Cluster 2, our evaluation of the criteria would be as follows:
In Cluster 1, we have ~87% of items modeled as factorially simple (six out of the original seven items), which meets the first criterion to label this set as Dimension-1-like. The second criterion is also met, as the six factorially simple items comprise more than 50% of the total size of cluster 1 (six out of nine items; ~67%). Note that the proportion of factorially simple items in Cluster 1 associated with Dimension 1 is .87, which is greater than the remaining proportion of items in Cluster 2, which is .13. Given that all three criteria are met, we would consider Items 1 through 9 from Cluster 1 as Dimension-1-like.
The same procedure is applied to Cluster 2, which contains Items 10 through 20. Here, we have 1 item (Item 10) that is originally modeled as a factorially simple item associated with Dimension 1; Items 11 and 12, 14 to 16, 18, and 20 are modeled as factorially simple associated with Dimension 2; and Items 13, 17, and 19 are modeled as factorially complex items. Applying the first criterion, we see that all 7 items modeled as factorially simple associated with Dimension 2 are grouped together (100% of items originally modeled as factorially simple are grouped together). The second criterion is also met, as these 7 items comprise more than 50% of the total set of 11 items (7 out of 11). Last, the third criterion of proportionally is also met, as the proportion of items that are of particular modeled dimensionality is larger in Cluster 2 than in Cluster 1 (1.0 > .0). For this one replication (i.e., selected optimal solution), DETECTE’s ability to label both sets of items as dimension-like would be counted.
It is to be noted that the marginal proportions computed across replications within any one condition ignore the accuracy aspect of the methods. In other words, ability to label sets of items as dimension-like is evaluated for a dimensional solution irrespective of whether that solution is indeed reflective of true, generated, dimensionality. Going back to our example of 2D condition, if DETECTE provided a three-cluster solution, each cluster of items would be submitted to the three criteria described above and count of the ability to label sets of items as Dimension-1-like and Dimension-2-like would be taken across all three clusters.
Results
Overall, the results are presented in two parts, each corresponding to the outcome variable considered in the study. Within each part, relevant selected results are presented. Complete tables and figures may be obtained from the author by request.
Part 1: Proportion Correct
Shorter tests with 10 items per dimension in 2d structures
In Table 1, the proportion of correctly identified dimensions (i.e., counting of dimensions), henceforth proportion correct, for 2D conditions with 10 items per dimension is presented. Several patterns emerged for the studied procedures. First, the NOHARM-based RMSR seldom identified the correct number of dimensions across any complexity level, sample size, or correlation. Second, the other two NOHARM-based methods—
Proportion Correct for Methods Where Data Follow 2D Noncompensatory MIRT With 10 Items per Dimension.
Note. 2D = two-dimensional; MIRT = multidimensional item response theory; ALR = approximate likelihood ratio; NOHARM = normal ogive harmonic analysis robust method; DETECT = dimensionality evaluation to enumerate contributing traits; RMSR = root mean square residual. R stands for NOHARM-based RMSR; D E stands for DETECT exploratory; DCV stands for DETECT cross-validated.
Indicates actual zero correct.
Indicates <.01 proportion correct.
Impact of lengthening the 2D test
An increase in the number of items associated with each dimension had differential impact on the methods (see Table 2). Out of the five methods, the greatest impact of lengthening the test was observed for RMSR, where the method yielded some of the highest proportions correct. The greatest improvement across all complexity levels was noted in conditions with small sample sizes across complexity and correlation levels. This result is contrast to that obtained in 2D conditions with shorter tests, in which RMSR was the worst performing method.
Proportion Correct for Methods Where Data Follow 2D Noncompensatory MIRT With 20 Items per Dimension.
Note. 2D = two-dimensional; MIRT = multidimensional item response theory; ALR = approximate likelihood ratio; NOHARM = normal ogive harmonic analysis robust method; DETECT = dimensionality evaluation to enumerate contributing traits; RMSR = root mean square residual. R stands for NOHARM-based RMSR; D E stands for DETECT exploratory; DCV stands for DETECT cross-validated.
Indicates actual zero correct.
Indicates <.01 proportion correct.
The ALR method was affected by the increase in the number of items in different ways. In conditions with no complexity, positive impact (i.e., higher proportion correct) was noted in conditions with correlations between .30 and .75 across sample sizes. As the complexity level increased, ALR typically yielded higher rates in conditions with correlations at .60 or lower; it is to be noted, however, that in conditions with N = 500 and increased complexity, ALR generally tended to deteriorate in performance.
The
DETECT-based methods were particularly positively affected by the increase in the number of items in conditions with correlations of ≤.30, across complexity levels and sample sizes. As the sample size increased, the proportion correct tended to increase for both DETECT-based methods, particularly DETECTE, in these low(er) correlation conditions. At correlation of ≥.60, proportions correct were generally very low, never exceeding .10 across any complexity levels or sample sizes for either DETECT-based method.
Shorter tests with 10 items per dimension in 3d structures
The proportion correct for 3D conditions with 10 items per dimension is presented in Table 3. Generally, none of the methods performed well in recovering the true underlying number of dimensions, except in conditions with no correlation and large sample size where the NOHARM-based methods
Proportion Correct for Methods Where Data Follow 3D Noncompensatory MIRT With 10 Items per Dimension.
Note. 3D = three-dimensional; MIRT = multidimensional item response theory; ALR = approximate likelihood ratio; NOHARM = normal ogive harmonic analysis robust method; DETECT = dimensionality evaluation to enumerate contributing traits; RMSR = root mean square residual. R stands for NOHARM-based RMSR; D E stands for DETECT exploratory; DCV stands for DETECT cross-validated.
Indicates actual zero correct.
Indicates <.01 proportion correct.
Impact of lengthening the 3D test
Lengthening the test in 3D conditions had differential, although in some cases, meaningful effect on the methods. As seen in Table 4, the most notable and generally positive effect was found for NOHARM-based RMSR, proportions correct for which increased in many conditions; in particular, high proportions correct were yielded when complexity was 30% or less and correlations ≤.30. The other two NOHARM-based methods, ALR and
Proportion Correct for Methods Where Data Follow 3D Noncompensatory MIRT With 20 Items per Dimension.
Note. 3D = three-dimensional; MIRT = multidimensional item response theory; ALR = approximate likelihood ratio; NOHARM = normal ogive harmonic analysis robust method; DETECT = dimensionality evaluation to enumerate contributing traits; RMSR = root mean square residual. R stands for NOHARM-based RMSR; D E stands for DETECT exploratory; DCV stands for DETECT cross-validated.
Indicates actual zero correct.
Indicates <.01 proportion correct.
Part 2: Ability to Label Sets of Items as Dimension-Like
Two-dimensional structures with shorter tests
In addition to examining the performance of the methods in terms of accurate counting of dimensions, the performance of the methods was examined by their ability to label sets of items as dimension-like. Recall that the purpose of this evaluation was to go beyond the assessment of the number of dimensions present. The purpose here was to evaluate, regardless of the selected solution, how often grouped items can be thought of in some meaningful way as dimensions. 3 Using the previously specified criteria, marginal proportions were calculated for each method across the 500 replications. In Table 5, marginal proportions of labeling (both) two sets of items as dimension-like are reported (complete results for labeling any one set of items as dimension-like may be obtained by contacting the author).
Marginal Proportion of Labeling Both Sets of Items as Dimension-Like for 2D Conditions in Short Tests.
Note. 2D = two-dimensional; ALR = approximate likelihood ratio; NOHARM = normal ogive harmonic analysis robust method; DETECT = dimensionality evaluation to enumerate contributing traits; RMSR = root mean square residual. R stands for NOHARM-based RMSR; D E stands for DETECT exploratory; DCV stands for DETECT cross-validated.
Indicates actual zero.
Indicates <.01 proportion.
It was observed that the DETECT-based methods, in particular DETECTE, had more success in labeling the two sets of items as dimension-like compared with their NOHARM-based counterparts across all conditions. RMSR method did not perform well with respect to labeling both sets of items in any condition.
When the data exhibited 30% or more of complexity, none of the methods was very successful in labeling the two sets of items as dimension-like. In these conditions, out of all of the methods, DETECTE was the most successful in labeling both sets of items as dimension-like, in particular with lower levels of complexity and correlations of at most .75.
Generally, an increase in correlations between dimensions yielded lower marginal proportions of labeling both sets of items as dimension-like for all complexity levels. Additional, unreported results showed that as the complexity in the data increased, the methods’ ability to label any one set of items as dimension-like also increased. In other words, in conditions with complexity of 30% and 50%, the methods were generally very successful in identifying any one set of items. This was found in conditions across different levels of correlation and sample size.
Impact of lengthening the test
In 2D structures, the increase in the number of items had varying impact on methods’ ability to label the sets of items as dimension-like. As it can be seen in Table 6, the DETECT-based methods tended to be the least affected by the increase in test length and remained the most successful procedures to label both sets of items as dimension-like across conditions with 30% or less complexity, in particular when correlations were ≤.60. In conditions with 50% complexity, none of the five methods was able to label both sets of items as dimension-like.
Marginal Proportion of Labeling Both Sets of Items as Dimension-Like for 2D Conditions in Longer Tests.
Note. 2D = two-dimensional; ALR = approximate likelihood ratio; NOHARM = normal ogive harmonic analysis robust method; DETECT = dimensionality evaluation to enumerate contributing traits; RMSR = root mean square residual. R stands for NOHARM-based RMSR; DE stands for DETECT exploratory; DCV stands for DETECT cross-validated.
Indicates actual zero.
Indicates <.01 proportion.
The increase in test length had dissimilar effects on the NOHARM-based methods. RMSR benefited greatly from the increase in test length: the most out of all five methods. The positive effect was observed for almost all conditions; however, as the complexity and correlations increased, the impact was smaller in magnitude. The directionality of the impact (positive/negative) varied across the other two NOHARM-based methods. For
ALR, however, had differential impact from the test length increase. For example, in conditions with no complexity and no correlation, the increase in the test length had a negative impact on the procedure’s performance. As the complexity levels or correlations increased, lengthening of the test tended to positively affect the ALR’s performance. Lastly, it was observed that the five procedures studied were not successful in labeling all three sets of items as dimension-like in conditions with .90 correlation, except for DETECTE, which in conditions with no complexity yielded rates of .42 and .66 for medium and large sample size, respectively.
Three-Dimensional Structures With Shorter and Longer Tests
Although marginal proportions for labeling sets of items as dimension-like in 2D conditions were high for the NOHARM-based methods under certain conditions (see Tables 5 and 6), the NOHARM-based methods were not successful in identifying the sets of items as dimension-like that would correspond to all three dimensions across any complexity levels, sample size, or correlations.
For ALR,
The Marginal Proportion of Labeling All Three Sets of Items as Dimension-Like in 3D Conditions for DETECT-Based Methods in Shorter (J = 10 per Dimension) and Longer (J = 20 per Dimension) Tests.
Note. 3D = three-dimensional; ALR = approximate likelihood ratio; DETECT = dimensionality evaluation to enumerate contributing traits; RMSR = root mean square residual. DETECTE stands for DETECT exploratory; DETECTCV stands for DETECT cross-validated.
Indicates actual zero.
Indicates <.01 proportion.
As illustrated in Table 7, DETECTE was generally more successful than DETECTCV in labeling all three sets of items as dimension-like. Both methods were the most successful in conditions with zero and 10% complexity: when correlations among dimensions were .60 or less. As the complexity increased to 30% or higher, in conditions with small and medium sample sizes, the methods yielded low marginal proportions (rarely did the rates rise above .20 for either procedure). With respect to the test length, it was observed that the increase in the number of items tended to positively affect the methods’ performance. The effect was largely noted in conditions with a correlation with of at most .60 and a complexity of 0% or 10%.
Discussion
This study examined the performance of popular tools used in conducting dimensionality assessment of binary data where the underlying MIRT was noncompensatory and the data exhibited varied degrees of complexity. The methodological contribution of this study rests in that little is currently known about dimensionality assessment procedures under these types of conditions, although they indeed may reflect complex assessments. As it was observed in the current study, the use of DETECT- and NOHARM-based methods should be further studied prior to their application in situations that may resemble those of the present study, as only in a small number of instances the procedures performed at what in practice might be considered an acceptable performance.
For example, in 2D conditions, the DETECT-based methods seemed to have difficulty in correctly counting the number of dimensions in shorter tests (Table 1), but they were quite successful in appropriately grouping items together (Table 5). On the other hand, the NOHARM-based methods ALR and
Even though little is known about performance of dimensionality assessment procedures in situations where noncompensatory MIRT model underlies the data, recognizing the work conducted in compensatory MIRT, we are able to further understand how these procedures operate across a variety of conditions. As discussed above, Finch and Habing (2005) found that when 2D and 6D conditions were examined, in higher dimensional conditions, the increase in the number of items benefited the ALR, whereas DETECT’s performance deteriorated. Lengthening the test had a somewhat different effect in the current study, where in 3D conditions (as compared with 2D cases), the increase in the number of items had a negative effect on the proportion correct for ALR and
Similarly, Gierl et al. (2006) found that DETECT performed quite satisfactorily when using a compensatory model and conditions with at most 30% complexity and at most .60 correlations. The results in this study were often different; DETECT-based methods generally performed poorly, with only a few exceptions where none of the methods accurately identified the correct dimensionality. 4 Notice, however, that these types of comparison cannot be directly made as the generating MIRT models differed in the current and previous studies.
Although the current study contributes to the literature of dimensionality assessment by using the noncompensatory MIRT with complex data, it has certain limitations. Among those, the methods’ performance is only generalizable by the limits of the studied conditions. Future work should include other possible testing scenarios, such as the use of pseudo-guessing parameters and the analysis of other model structure types (e.g., unbalanced structures where the number of complex items varies across dimensions). In addition, this study only considered 2D and 3D conditions. As Finch and Habing (2005) found, the procedures may indeed perform differently in test structures with a large(r) number of dimensions. As our assessments become more complex and multidimensional, considerations of higher dimensions may be useful.
Furthermore, our understanding of the performance of any method in dimensionality assessment should be approached in a comprehensive manner. In the current study, an attempt was made to further illustrate how groups of items may be considered to meaningfully reflect a dimension. The criteria and choices in operationalization of these outcome variables were made with practical considerations; however, these should be further studied and potentially refined. Additionally, as future research further examines data with complex structure, a deeper examination of the identification and classification of complex items is warranted. This is especially possible with procedures such as NOHARM, which can capture the structural aspect of the test by providing estimates of loadings and other item parameters. Finally, the current study does not directly address the issue of what happens when the methods err. Given the incorrect identification of the dimensionality, methods may be more likely to under- or overfactor. Together with under- or overfactoring, future studies could investigate how procedures group and separate items, in similar fashion to what Finch and Habing examined in their 2005 study. This knowledge will provide a better understanding of how the methods behave, and should be considered for future research.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
