Specificity and False Positive Rates of the Test of Memory Malingering,Rey 15-Item Test,and Rey Word Recognition Test Among Forensic Inpatients With Intellectual Disabilities

Abstract

This study evaluated the specificity and false positive (FP) rates of the Rey 15-Item Test (FIT), Word Recognition Test (WRT), and Test of Memory Malingering (TOMM) in a sample of 21 forensic inpatients with mild intellectual disability (ID). The FIT demonstrated an FP rate of 23.8% with the standard quantitative cutoff score. Certain qualitative error types on the FIT showed promise and had low FP rates. The WRT obtained an FP rate of 0.0% with previously reported cutoff scores. Finally, the TOMM demonstrated low FP rates of 4.8% and 0.0% on Trial 2 and the Retention Trial, respectively, when applying the standard cutoff score. FP rates are reported for a range of cutoff scores and compared with published research on individuals diagnosed with ID. Results indicated that although the quantitative variables on the FIT had unacceptably high FP rates, the TOMM and WRT had low FP rates, increasing the confidence clinicians can place in scores reflecting poor effort on these measures during ID evaluations.

Keywords

malingering effort testing intellectual disability mental retardation

The American Psychiatric Association’s (2013) Diagnostic and Statistical Manual of Mental Disorders, fifth edition (DSM-5) defines malingering as “the intentional production of false or grossly exaggerated physical or psychological symptoms, motivated by external incentives” (p. 726). Published base-rate estimates for malingering in forensic evaluations range from as low as 8% to as high as 40% depending on the population studied and the method used to derive an estimate (e.g., Griffin, Normington, May, & Glassmire, 1996; Larrabee, 2003; Mittenberg, Patton, Canyock, & Condit, 2002; Rogers, Sewell, & Goldstein, 1994). Malingering is often manifested as poor effort on cognitive testing. Therefore, competent assessment of cognitive functioning in forensic evaluations requires that clinicians address the effort level of examinees due to the significant incentive such examinees may have to exaggerate symptoms (Otto, 2008). There are a number of forensic contexts in which the presence or absence of an intellectual disability (ID; formerly referred to as mental retardation; see Schalock et al., 2007) is directly relevant to forensic legal criteria, including evaluations for Social Security disability, competency to stand trial, mental state at the time of an offense, and death penalty mitigation.

The sensitivity of a cutoff score on an effort measure refers to the percentage of individuals who are putting forth poor effort who score in the poor effort range when using that cutoff. The specificity of a cutoff score refers to the percentage of individuals in a relevant clinical comparison group who score in the valid effort range. The false positive (FP) rate (also referred to as nonspecificity) of a cutoff score refers to the percentage of validly performing individuals in a relevant clinical comparison group who are erroneously classified as putting forth poor effort. The FP rate is important, as it is considered an egregious error to classify a valid responder erroneously as putting forth poor effort due to the significant negative ramifications of such an error (Vitacco, 2008). It is important to underscore that although sensitivity estimates derived from studies often can generalize to a variety of clinical settings, specificity estimates and FP rates are inherently tied to the clinical population/question that is being evaluated.

The Test of Memory Malingering (TOMM; Tombaugh, 1996) evaluates for suboptimal effort during memory testing by using a visual recognition paradigm with three trials. A recent survey of forensic practitioners indicated that the TOMM was the most commonly used test to assess for feigned ID, with 64% of respondents indicating that they used the TOMM for this purpose (Victor & Boone, 2007). Although a number of studies indicate that the TOMM is relatively insensitive to many bona fide neuropsychological and psychiatric disorders, the FP rates of standard TOMM cutoff scores among individuals with documented ID have been equivocal.

Two previous studies obtained unacceptably high FP rates on Trial 2 and the Retention Trial of the TOMM when using the standard cutoff score of <45 items. Hurley and Deal (2006) found an FP rate of 41% on Trial 2 among nonforensic outpatients with documented ID. The 39 participants in Hurley and Deal’s sample had Full Scale IQ (FSIQ) scores ranging from 50 to 78 with the majority (51%) falling between 60 and 69. Graue et al. (2007) found FP rates of 31% for Trial 2 and 19% for the Retention Trial using the cutoff of <45 among 26 ID individuals with an average FSIQ of 61.7 (SD = 8.9). Graue et al. recommended adopting a cutoff of <30 and found an FP rate of only 4% on each trial using this cutoff.

Two other studies found more promising results for the TOMM with lower FP rates among individuals with ID. Simon (2007) administered the TOMM to 21 forensic inpatients with documented ID (FSIQ M = 60, SD = 4.8) who were no longer facing legal charges and obtained an FP rate of 4.8% for Trial 2 and 0.0% for the Retention Trial using the <45 cutoff. Shandera et al. (2010) obtained FP rates of 12% for Trial 2 and 8% for the Retention Trial with the <45 cutoff among a sample of 24 participants with mild ID (FSIQ M = 63.21, SD = 6.66). Using the cutoff of <30, they obtained an FP rate of only 4% for Trial 2 and 0% for the Retention Trial.

It does not appear that the differences in these previous findings regarding the FP rate of the TOMM are related purely to differences in FSIQ among the four samples studied, as all four samples had similar average FSIQs generally falling between 60 and 69. Moreover, the Hurley and Deal (2006) found a small nonsignificant correlation of r = −.03 between FSIQ and Trial 2 of the TOMM. Delain (2006) found that a sample of individuals with ID living in a group home setting had a 44% failure rate on the TOMM, whereas an independent living ID sample had only a 10% failure rate. Although the independent living sample had significantly higher IQ scores, these participants were also more likely to have attended mainstream school settings and were rated as having significantly higher levels of adaptive functioning. Some of these other factors may account for the widely differing findings in studies evaluating the FP rates of the TOMM among individuals with ID.

The Rey 15-Item Test (FIT; Rey, 1964) is one of the most commonly used measures for evaluating effort during cognitive testing (Rabin, Barr, & Burton, 2005) and was the third most commonly used effort measure in forensic ID evaluations in Victor and Boone’s (2007) survey, with 44% of respondents indicating that they use the test for this purpose. The most commonly identified cutoff score for predicting poor effort on the FIT is <9 items recalled (Nitch & Glassmire, 2007), and this cutoff generally results in FP rates below 10% in various clinical samples and diagnostic groups (Boone, Salazar, Lu, Warner-Chacon, & Razani, 2002). However, FP rates for the <9 cutoff have been significantly higher among individuals with documented ID. FP rates identified for the <9 cutoff in various studies of ID populations include 37.5% (Goldberg & Miller, 1986), 79.5% (Hurley & Deal, 2006), and 55.7% (Marshall & Happe, 2007). When using a recognition procedure developed by Boone et al. (2002), the Combination Score (free recall + [recognition – FP errors]) cutoff of <20 suggested by Boone et al. resulted in an FP rate of 17.1% among ID individuals in Marshall and Happe’s study. In comparison, Boone et al. found that the same Combination Score cutoff resulted in only an 8.3% FP rate among a mixed sample of neuropsychology referrals.

Some authors have found that alternative scoring procedures focusing on the number of qualitative errors (e.g., item reversals, confabulated items, repeated items, incorrect ordering of rows) are less correlated with IQ, whereas other authors have found that certain types of qualitative errors are common among low IQ individuals. Griffin, Normington, and Glassmire (1996) found that although the number of items recalled on the FIT significantly correlated with IQ in a sample of mixed psychiatric patients, the number of qualitative error types committed was not significantly correlated with IQ. This finding suggested that qualitative errors may be less sensitive to intellectual disabilities than the total number of items recalled. Within a sample of individuals who were receiving Social Security disability income for psychiatric disabilities, Griffin et al. found FP rates below 10% for between-row (3.8%), “dyslexic” (i.e., character reversal; 0.0%), embellishment (0.0%), gestalt (1.9%), indistinct character (7.5%), and roman numeral (1.9%) errors. Marshall and Happe (2007) found that many of the qualitative error types identified by Griffin et al. were rare among individuals with ID, including “dyslexic” (1.4%) and embellishment (2.8%) errors. In contrast, Hays, Emmons, and Stallings (2000) found that certain types of qualitative errors are quite common among individuals with ID and, therefore, would produce unacceptably high FP rates. For example, 48.2% of Hays et al.’s sample of individuals with low IQ (<70) produced confabulated responses and 43.6% of their sample produced repetitions on the FIT.

The Rey Word Recognition Test (WRT; Lezak 1995; Rey, 1941) has shown promise as a measure of effort on cognitive testing (Frederick, 2002; Nelson et al., 2003; Nitch, Boone, Wen, Arnold, & Alfano, 2006). Using a cutoff of ≤5 (after subtracting FPs), Greiffenstein, Baker, and Gola (1994) obtained an FP rate of 12% and a sensitivity rate of 59%. Nitch et al. (2006) obtained sensitivity rates on the WRT of greater than 80% with no more than a 10% FP rate in a population of patients who attempted to feign mild traumatic brain injury symptoms. Unfortunately, the FP rate of the WRT among individuals with ID has not been established in previous research.

Despite the proliferation of research on effort measures during the past few decades, few studies have directly addressed the FP rates of common effort measures among individuals with documented ID (Victor & Boone, 2007). Moreover, the limited research evaluating FP rates among ID individuals has produced equivocal findings. Additionally, most prior studies evaluating this issue provided data only for the FP rates of traditional cutoff scores, and did not provide FP rates for a range of possible cutoff scores that may be obtained in clinical practice. This prevents clinicians from being able to provide accurate forensic reports and testimony regarding the true likelihood of an FP error for any obtained score that deviates from the usual cutoff score. The current study sought to evaluate the FP rates of a wide range of cutoff scores for the TOMM, FIT, and WRT among individuals with documented mild ID. This study was not focused on establishing additional sensitivity estimates for these measures, as their sensitivity to poor effort has been well established in previous studies. Rather, the purpose of this study was to evaluate the failure rates on these measures for a wide range of possible cutoffs among a well-defined sample of ID individuals who were thoroughly screened for the presence of ID and for the lack of any incentive to give poor effort. Because participants were thoroughly screened to ensure that they were likely giving their best effort, failure of an effort measure at any given cutoff in our sample indicates a likely FP finding for that cutoff score among individuals with mild ID.

Method

Participants

Participants were recruited from a state-operated inpatient facility that treats patients with various developmental disabilities, including those diagnosed with ID. Diagnosis of ID for all patients involved an extensive interdisciplinary evaluation and review process. To be referred to the hospital, patients must have been identified as eligible for Regional Center services in California, which requires the presence and documented diagnosis of a developmental disability before an individual attains 18 years of age. Many patients also underwent psychological testing as part of the process of commitment to the hospital. Prior testing was reviewed and, if needed, additional intellectual testing was conducted by a psychologist to substantiate the diagnosis of ID. Patients were housed in an inpatient forensic facility where they were able to be observed on a 24-hour basis and their functioning in the facility was well documented by nursing staff. Observations were available by inpatient staff (psychiatric technicians, group leaders, social workers, psychologists, psychiatrists, etc.) and off-unit staff at the hospital (e.g., teachers and job coaches). The final diagnosis of ID was based on an interdisciplinary decision process that included a review of the aforementioned information and observations by a psychologist, psychiatrist, social worker, and nursing staff.

Predetermined criteria that precluded consideration for the study included (a) previous diagnosis of malingering or suspicion of malingering by the treatment team, (b) legal cases still under adjudication (because of the potential motivation to feign/exaggerate ID symptoms), (c) history of traumatic brain injury or other neuropsychological insult, (d) diagnosed ID below mild ID, or (e) documented IQ results above the borderline range. Patients with moderate to severe ID were excluded because of institutional concerns about their increased overall level of psychological vulnerability and decreased ability to provide assent or to complete the evaluation process. To reduce the probability of including patients who intentionally performed poorly during the study, an a priori decision was made to exclude results from patients who scored statistically significantly below the chance range on any trials of the TOMM, which was the only two-item forced-choice test administered that allowed for statistical evaluation of below-chance performance using the binomial formula. Forty-one patients were approached for possible participation in the study; 28 (68.3%) expressed interest in participation and underwent the consent/assent procedures described below. Six of these patients (21.4%) were excluded from consideration in the study because they had unresolved legal charges. None of the remaining patients met any of the additional preliminary exclusion criteria, leaving a sample of 22 participants who completed all measures from the study. One participant was excluded from the final analyses because of producing scores on two TOMM trials that fell significantly below the chance range at the p < .05 level of significance.

The final sample (n = 21) included 20 men (95.2%) and one woman (4.8%). The participants were all committed to the hospital involuntarily under a legal commitment that required the presence of a documented developmental disability. All the patients had been judicially determined to be a danger to self, danger to others, or gravely disabled and unable to care for themselves on an outpatient basis. The ethnic breakdown of the sample was 66.7% Caucasian, 23.8% Hispanic/Latino, 4.8% African American, and 4.8% who self-identified with more than one ethnicity. The participants were an average of 31.7 (SD = 6.9) years old. All 21 participants had received a DSM-IV-TR diagnosis of mental retardation by their interdisciplinary treatment teams prior to participation in the study and all participants were under legal conservatorships. The average Wechsler Adult Intelligence Scale–III (WAIS-III; Wechsler, 1997) FSIQ for the final sample was 60.89 (SD = 6.8; range = 45-73).

Procedure

Participants granted assent to participate after their legal conservator granted consent for research participation. All participants were evaluated individually by a psychologist or graduate student trained in the administration of the three effort measures. At the conclusion of the testing, all participants were asked if they had ever been administered any of the three effort measures previously; no participant answered in the affirmative. All participants were administered the TOMM, FIT, and WRT during a single testing session. The ordering of the measures was as follows: (a) FIT free recall and recognition trials, (b) TOMM Trials 1 and 2, (c) WRT, (d) a standardized set of general mental status questions (to ensure that adequate time elapsed between Trial 2 and the Retention Trial of the TOMM), and (e) TOMM Retention Trial.

Results

The average TOMM scores were 45.29 (SD = 4.72) for Trial 1, 49.23 (SD = 2.10) for Trial 2, and 49.81 (SD = 0.40) for the Retention Trial. FP rates for all TOMM scores are provided in Table 1. The standard cutoff of <45 resulted in an FP rate of 4.8% on Trial 2 and 0.0% on the Retention Trial. Neither TOMM Trial 2 (r = .168, p = .49) nor the Retention Trial (r = −0.86, p = .73) was significantly correlated with FSIQ at the p < .05 level.

Table 1.

Specificity and Nonspecificity (False Positive) Rates for Trial 2 and the Retention Trial of the TOMM.

Cutoff scores	Specificity (true negative rate), %	Nonspecificity (false positive rate), %
Trial 2
<41	100.0	0.0
<42	95.2	4.8
<43	95.2	4.8
<44	95.2	4.8
<45	95.2	4.8
<46	95.2	4.8
<47	91.5	9.5
<48	91.5	9.5
<49	91.5	9.5
<50	66.7	33.3
Retention Trial
<45	100.0	0.0
<46	100.0	0.0
<47	100.0	0.0
<48	100.0	0.0
<49	100.0	0.0
<50	81.0	19.0

Note. Specificity = percentage of participants scoring above the cutoff who were classified as having put forth valid effort using the cutoff. Nonspecificity = percentage of participants scoring below the cutoff who were classified as having put forth suboptimal effort using the cutoff. TOMM = Test of Memory Malingering.

Two previous studies (Graue et al., 2007, and Shandera et al., 2010) provided data regarding average FSIQ, as well as data regarding Trial 2 and the Retention Trial of the TOMM for two different cutoffs (<45 and <30). Therefore, we combined our data with the available data from these studies to provide a more stable estimate with a larger sample size for two different cutoffs. We also separately combined our data with the data from Simon (2007) to provide FP rates for the <45 cutoff for a sample that consisted of only forensic inpatients with ID.

When data from the present study are combined with the published data from Graue et al. (2007) and Shandera et al. (2010), the total sample size is 71 and the weighted average FSIQ score is 61.35 (pooled SD = 6.30). In the combined sample, the FP rate for the <45 cutoff was 16.9% for Trial 2 and 9.9% for the Retention Trial. When adopting the <30 cutoff, the combined FP rate was 2.8% for Trial 2 and 1.4% for the Retention Trial. When combining our sample with Simon’s sample, the total sample size is 42 and the weighted average FSIQ is 60.45 (pooled SD = 5.8). In this combined forensic inpatient sample, the FP rate for the <45 cutoff on the TOMM was 4.8% for Trial 2 and 0.0% for the Retention Trial.

The average free recall score on the FIT was 10.90 (SD = 4.54). The average Combination Score (Boone et al., 2002) was 20.76 (SD = 9.40). The specificity and FP rates for the FIT quantitative variables are provided in Table 2. The cutoff of <9 items resulted in an FP rate of 23.8%. A cutoff of <6 items on FIT Recall was needed to achieve an FP rate below 10%. Boone et al.’s recommended cutoff of <20 on the Combination Score resulted in an FP rate of 38.1%. A Combination Score cutoff of <6 was required to attain an FP rate below 10%. Both the FIT Recall (r = .484, p = 0.04) and Combination (r = .504, p = .03) were significantly correlated with the WAIS-III FSIQ score at the p < .05 level.

Table 2.

Specificity and Nonspecificity (False Positive) Rates of FIT Variables.

Cutoff scores	Specificity (true negative rate), %	Nonspecificity (false positive rate), %
Recall Correct
<15	33.3	66.7
<14	38.1	61.9
<13	42.9	57.1
<12	57.1	42.9
<11	61.9	38.1
<10	61.9	38.1
<9	76.2	23.8
<8	81.0	19.0
<7	85.7	14.3
<6	90.5	9.5
<5	95.2	4.8
<4	95.2	4.8
<3	100.0	0.0
Recognition Correct (true positive)
<15	33.33	66.7
<14	38.1	61.9
<13	42.9	57.1
<12	57.1	42.9
<11	61.9	38.1
<10	61.9	38.1
<9	76.2	23.8
<8	81.0	19.0
<7	85.7	14.3
<6	90.5	9.5
<5	95.2	4.8
<4	95.2	4.8
<3	100.0	0.0
False Positive Recognitions
>0	57.1	42.9
>1	71.4	28.6
>2	71.4	28.6
>3	90.5	9.5
>4	90.5	9.5
>5	95.2	4.8
>6	95.2	4.8
>7	95.2	4.8
>8	100.0	0.0
Combination Score
<30	23.8	76.2
<29	23.8	76.2
<28	28.6	71.4
<27	42.9	57.1
<26	47.6	52.4
<25	47.6	52.4
<24	47.6	52.4
<23	57.1	42.9
<22	61.9	38.1
<21	61.9	38.1
<20	61.9	38.1
<19	61.9	38.1
<18	66.7	33.3
<17	76.2	23.8
<16	76.2	23.8
<15	76.2	23.8
<14	76.2	23.8
<13	76.2	23.8
<14	76.2	23.8
<13	76.2	23.8
<12	76.2	23.8
<11	76.2	23.8
<10	76.2	23.8
<9	85.7	14.3
<8	85.7	14.3
<7	85.7	14.3
<6	90.5	9.5
<5	95.2	4.8
<4	95.2	4.8
<3	95.2	4.8
<2	100.0	0.0

FIT protocols were scored for the 10 types of qualitative errors identified by Griffin, Normington, and Glassmire (1996). Each error type was counted only once regardless of how many times it was produced. The percentage of protocols containing each error type were as follows: within-row errors (i.e., placing characters within a row out of order; 42.9%), repetition errors (19.0%), between-row errors (combining items from two different rows into a single row; 9.5%), Roman numeral errors (placing horizontal lines on the top and bottom of the I, II, III rows; 9.5%), wrong item errors (9.5%), row sequence errors (placing rows out of sequence; 4.8%), and character reversal errors (e.g., “d” instead of “b”; 4.8%). There were no other qualitative error types in this sample. Overall, 38.1% of participants made no qualitative errors, 61.9% made one qualitative error, 9.5% made two qualitative errors, and 4.8% made three qualitative errors. No participant made more than three qualitative error types. All nine (100.0%) of the within-row errors across protocols were made on the row of shapes. When within-row errors on the row of shapes were removed from consideration, 61.9% of participants made no qualitative errors, 23.8% made one qualitative error, 9.5% made two qualitative errors, and 4.8% made three qualitative errors.

The average True Recognition score on the WRT was 12.5 (SD = 2.4). The average number of False Recognition responses on the WRT was 4.0 (SD = 3.1). The average True Recognition-minus-False Recognition score was 8.5 (SD = 3.6). The FP rates for the WRT variables are provided in Table 3. When applying True Recognition cutoff scores from the literature, a cutoff of ≤7 resulted in an FP rate of 4.8% and a cutoff of ≤6 resulted in an FP rate of 0.0%. When evaluating the number of False Recognition errors, a cutoff of ≤6 was required to achieve an FP rate of less than 10%. When evaluating the True Recognition-minus-False Recognition score, a cutoff of ≤4 was required to achieve an FP rate of less than 10%. The WRT Recognition score was significantly negatively correlated with FSIQ (r = −.51, r = .03) at the p < .05 level. However, visual inspection of a scatter plot indicated two outliers: one participant had an FSIQ 8 points below the next lowest FSIQ with a perfect WRT score and one participant had a WRT Recognition score that fell 3 points below the next lowest WRT score with one of the highest FSIQ scores. When these two outliers were removed from analyses, the correlation was not significant (r = −.45, p = .06). Neither the number of false recognition errors (r = −.30, p = .21) nor the True Recognition-Minus-False Recognition score (r = −.09, p = .72) were significantly correlated with FSIQ at the p < .05 level.

Table 3.

Specificity and Nonspecificity (False Positive) Rates of WRT Variables.

Cutoff scores	Specificity (true negative rate), %	Nonspecificity (false positive rate), %
Recognition Correct
≤15	28.6	71.4
≤14	38.1	61.9
≤13	52.4	47.6
≤12	71.4	28.6
≤11	81.0	19.0
≤10	90.5	9.5
≤9	95.2	4.8
≤8	95.2	4.8
≤7	95.2	4.8
≤6	100	0.0
False Recognition Errors
≥1	0.0	100.0
≥2	33.3	66.7
≥3	47.6	52.4
≥4	71.4	28.6
≥5	81.0	19.0
≥6	90.5	9.5
≤7	90.5	9.5
≥8	95.2	4.8
≥9	100.0	0.0
TR – FR
≤14	9.5	90.5
≤13	9.5	90.5
≤12	23.8	76.2
≤11	33.3	66.7
≤10	42.9	57.1
≤9	52.4	47.6
≤8	57.1	42.9
≤7	76.2	23.8
≤6	81.0	19.0
≤5	85.7	14.3
≤4	90.5	9.5
≤3	95.2	4.8
≤2	95.2	4.8
≤1	95.2	4.8
0	100.0	0.0

To determine the combined FP rates when using all three measures simultaneously, we calculated the percentage of participants scoring in the poor effort range on 0, 1, 2, or 3 measures using standard cutoffs (TOMM <45 on either Trial 2 or Retention; FIT Recall <9, WRT Recognition <6). When evaluating the aforementioned cutoff scores simultaneously, the FP rates were 23.8% for ≥1 test failure, 4.8% for ≥2 test failures, and 0.0% for 3 test failures. The FIT was failed by all participants who failed at least one measure. When the FIT Recall cutoff score was lowered to <6, the FP rate was 19.0% for ≥1 test failure, 4.8% for ≥2 test failures, and 0.0% for 3 test failures. When the analyses were limited to the standard cutoffs on the TOMM and WRT, the FP rate for a cutoff of ≥1 failure was 4.8% and the FP rate for 2 failures was 0.0%.

Discussion

The present study evaluated the FP rates of various cutoff scores on the TOMM, FIT, and WRT among individuals with documented mild ID who were screened to ensure removal of any participants likely to give poor effort. By reporting the FP rates of a range of cutoff scores, clinicians can be more confident in decisions made about effort among examinees with mild ID who obtain scores that fall below traditional cutoffs.

Our FP rates for the standard TOMM cutoff scores were significantly lower than those obtained by Hurley and Deal (2006) and Graue et al. (2007), slightly lower than those obtained by Shandera et al. (2010), and higher than those obtained by Simon (2007). Consistent with Hurley and Deal (2006), the TOMM was not significantly correlated with FSIQ in our sample. Our sample had a similar average FSIQ to all the samples from previous studies of the TOMM. However, our sample was composed of forensic patients who were involuntarily committed to an inpatient treatment setting that may provide more environmental support and structure than the outpatient settings where Graue et al. (2007) and Shandera et al. (2010) collected their data. Our results on the TOMM were similar to those of Simon (2007), who also studied a forensic inpatient sample. Additionally, our participants were administered a shorter battery of tests in comparison with the participants in Graue et al. and Shandera et al.’s studies. Like our study, Simon administered a brief testing battery. Therefore, fatigue may have been less of an issue for the participants in our study and in the study by Simon.

Because the individual results of each study may be less stable than the combined results of the studies when evaluated simultaneously, we combined our data with the data from Graue et al. (2007) and Shandera et al. (2010), as these were the only two studies that provided means and standard deviations for the FSIQ and that also provided data for more than one cutoff on the TOMM. In the combined sample (N = 71), the FP rate for the <45 cutoff was 16.9% for Trial 2, but fell just below 10% for the Retention Trial. When adopting the lower cutoff of <30 items recalled on Trial 2 and the Retention Trial, the combined FP rate is 2.8% for Trial 2 and 1.4% for the Retention Trial. These findings indicate that obtained scores below 30 on Trial 2 or scores below 45 on the Retention Trial are unlikely to represent FP errors. Therefore, clinicians can be more confident in interpreting TOMM scores in these ranges as indicative of suboptimal effort in ID evaluations. We also computed the FP rate of our data combined with Simon’s (2007) data, as both these samples were forensic inpatient samples. In this combined forensic inpatient sample (N = 42), the FP rate for the <45 cutoff was 4.8% for Trial 2 and 0.0% for the Retention Trial, indicating that scores below 45 on Trial 2 or the Retention Trial are unlikely to represent FP errors among forensic inpatients with ID. Therefore, clinicians can be more confident in interpreting TOMM scores below 45 as indicative of poor effort when conducting ID evaluations in an inpatient setting.

Consistent with previous literature, our findings regarding the FP rates for the FIT among ID individuals were less promising. The traditional recall cutoff of <9 resulted in an FP rate of almost 24% in our sample and the previously recommended combination score cutoff of <20 resulted in a 38.1% FP rate. It was not surprising that <20 cutoff for the combination score produced a higher FP rate than the recall cutoff of <9, as the FIT recognition procedure was created by Boone et al. primarily as a method of increasing the FIT’s sensitivity (rather than specificity). Additionally, Boone et al. (2002) found that the combination score cutoff of <20 resulted in higher FP rates than the recall cutoff of <9 among a mixed sample of neuropsychology referrals, a learning disabled sample, and a nonclinical sample. In order to reduce the FP rate to below 10% in our sample, we needed to adopt a cutoff of <6 items for both the Free Recall and Combination Score. A FIT cutoff of <6 on either of these variables is low enough that it likely reduces the sensitivity of the FIT significantly in this population. The lowest scores with reported sensitivity estimates we could find in the literature for the FIT were for a Recall cutoff of <8 and a combination cutoff of <18 (Boone et al., 2002). These cutoff scores resulted in low sensitivity estimates of 46.9% and 61.2%, respectively.

We evaluated the frequency of qualitative errors identified by Griffin, Normington, and Glassmire (1996) in our sample and found that the most common qualitative error type involved within-row errors among the row of shapes. The only other qualitative errors that were produced by more than 10% of participants were repetition errors, which were also found to occur frequently by ID individuals by Hays et al. (2000). Overall, our findings indicate that several qualitative error types on the FIT are relatively rare among individuals with ID and, therefore, should increase suspicion of poor effort in ID evaluations.

We were unable to find any previous studies that evaluated the FP rates for the WRT among individuals with ID. Our results indicated that a cutoff of ≤9 produced an FP rate of only 4.8%. The standard cutoff of ≤6 resulted in an FP rate of 0.0% in our sample, indicating that the WRT may be a promising measure for effort among individuals with documented ID. Because this is the only study to evaluate the WRT among individuals with ID, further research on the FP rates of the WRT among other samples of individuals with ID is needed.

When traditional cutoffs for all three measures were used simultaneously, a combined cutoff of ≥2 test failures resulted in an FP rate of 4.8% and a cutoff of ≥3 test failures resulted in an FP rate of 0.0%. When analyses were limited to simultaneous use of the traditional cutoffs on the TOMM and WRT, a cutoff of ≥1 test failure resulted in a 4.8% FP rate and a cutoff of ≥2 test failures resulted in a 0.0% FP rate. This finding indicates that the simultaneous use of the TOMM and WRT traditional cutoffs can retain the low FP rate and possibly increase the sensitivity of the test battery to poor effort by evaluating for effort more than one time in a battery. Because we did not include a group of participants with known or suspected poor effort, the actual impact of using multiple measures on sensitivity in our sample is unknown.

Our findings have several implications for clinicians who conduct ID evaluations. First, because different studies have produced equivocal results regarding the FP rate of the TOMM among individuals with ID, certain contextual factors should be considered by clinicians when determining whether failure on the TOMM is likely to represent an FP error. When TOMM results are compared across studies, the FP rates were lower among forensic inpatient ID samples than among outpatients with ID. Therefore, clinicians may be able to place more confidence in TOMM scores that are obtained by inpatients (who presumably have more environmental support) and should be more cautious when interpreting TOMM scores among outpatients. Second, lower FP rates were found for the TOMM in studies that did not include a lengthy battery of tests. Therefore, if the TOMM is used in ID evaluations, it should be administered at the beginning of the assessment battery to reduce any impact that fatigue may have on performance. Third, because the FIT recall score consistently produces high FP rates among individuals with ID across studies, this test may not be appropriate for ID evaluations unless a very low cutoff score is adopted or the clinician focuses on qualitative errors in the interpretation. Because it will reduce the sensitivity of the FIT to lower the cutoff enough to produce an acceptable FP rate, clinicians should not rely on the FIT as the sole instrument for evaluating effort in ID evaluations. Finally, our results indicate that a combination of the TOMM and WRT may produce valuable information in ID evaluations, as no participant in our sample failed both measures simultaneously and the use of two instruments gives clinicians more than one opportunity to measure effort in a test battery.

Our study has several strengths that add to the literature on cognitive effort testing in ID evaluations. First, our sample had well-documented histories of mild ID. Second, we excluded individuals who had any known incentive to exaggerate or who produced below-chance test scores, thereby reducing the possibility that our results were affected by participants who did not put forth valid effort. Third, this is one of the few existing studies to include more than one effort measure in an ID sample, thereby allowing for a direct comparison of the relative FP rates across measures and for evaluation of the impact of using multiple tests on overall FP rates. Fourth, this is the first study that directly evaluated the FP rate of the WRT among individuals with documented ID. Finally, this is the first study that provides the FP rates for a wide range of cutoff scores on these measures among individuals with documented ID.

Our study also has some limitations that should be considered when interpreting the findings. First, our sample size is small, making it possible that outliers may have affected the results. This limitation is not unique to this study, given the difficulty obtaining access to well-defined ID samples for research. The four published studies that evaluated the FP rate of the TOMM among individuals with ID had an average sample size of 27.5 (SD = 7.9; range = 21-39). We combined our data with published data to arrive at more stable FP estimates of the TOMM among ID individuals. Because a large sample (n = 100) was already available for individuals with documented ID on the FIT (Marshall & Happe, 2007) and our findings were generally consistent with their findings, we did not conduct a combined analysis for the FIT. Unfortunately, we were unable to find any previous studies on the performance of individuals with ID on the WRT.

An additional limitation of this study is that it does not provide data regarding the sensitivity of these effort measures. Because we did not have a group of participants with known or suspected poor effort, we were able to evaluate only the specificity and FP rates of the instruments among individuals with ID. This limitation precluded the generation of statistics representing overall classification rates or optimal cutoff scores that optimize both sensitivity and specificity concurrently. However, the primary purpose of this study was to determine whether scores that are typically associated with poor effort on these measures are, in fact, common among individuals with bona fide ID and, therefore, may be produced by FP errors in this population. Additionally, sensitivity estimates are available from previous studies for the traditional cutoffs for all three measures. Although FP rates are inherently tied to the population/question being evaluated (e.g., “is this examinee’s performance consistent with ID?”), sensitivity rates often can often generalize from one study to other populations depending on how the sensitivity estimates were derived (i.e., “does available research indicate that the obtained score is sensitive to poor effort?”).

An additional limitation is that our sample included only individuals with mild ID. It is likely that FP rates would be higher if individuals with moderate or severe ID were included in the sample. Therefore, our findings generalize only to individuals with mild ID. Because individuals with moderate and severe ID may have better documented histories of adaptive deficits and intellectual limitations, as well as more difficulties completing a battery of tests, the issue of effort testing may be less salient in such evaluations. Nonetheless, clinicians should not apply these FP estimates to individuals with a documented history of moderate to severe ID.

Finally, our analyses rest on the assumption that the participants were giving valid effort when they took the tests. It is possible that some participants did not put forth their best effort. This limitation is not unique to this study and is inherent in all studies that use predefined clinical samples in order to arrive at specificity/FP estimates. This limitation is a trade-off that is required in order to use real-world clinical samples (which increase the generalizability of findings) and requires researchers to use careful screening procedures in order to reduce as much as possible the probability of including participants who give poor effort during the study. It is unlikely that a significant portion of our sample put forth poor effort for three reasons. First, patients were not invited to participate in the study if they had ever been diagnosed with malingering or were suspected of exaggeration by treatment staff who had access to 24-hour observations. Second, patients were not invited to participate in the study if they had pending legal charges. Finally, participants were excluded from analysis if they produced any statistically below-chance test performances, as such performance indicates a high probability that they intentionally selected incorrect answers. If any of the participants included in the analyses were, in fact, giving poor effort, such a scenario would result in overestimates of the FP rates, thereby resulting in more conservative use of the resultant cutoffs in clinical practice. This would reduce the probability of an examinee falsely being labeled as putting forth poor effort in clinical practice.

It is recommended that future researchers provide FP rates for a wide range of cutoff scores on effort measures rather than limiting analyses to traditional cutoff scores. Furthermore, more research is needed regarding the sensitivity of various effort measures in this population. Because this is the first study to evaluate the FP rate of the WRT, our promising findings regarding the low FP rate on the WRT require replication. Given the relative lack of published studies on the performance of individuals with ID on commonly used effort measures, the significant clinical and legal decisions that are often made with this population based on psychological assessment results, and the equivocal findings across studies, more research should be conducted to determine normative performances of this group.

Footnotes

Authors’ Note

This was not grant-funded research. The California Department of State Hospitals allocated some time to the authors to work on the data analysis, but most of the project and writing was completed on our own time after work hours. The opinions expressed in this manuscript are those of the authors and do not reflect the opinions of the California Department of State Hospitals.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). Washington, DC: Author.

Boone

K. B.

Salazar

Warner-Chacon

Razani

(2002). The Rey 15-Item recognition trial: A technique to enhance sensitivity of the Rey 15-Item Memorization Test. Journal of Clinical and Experimental Neuropsychology, 24, 561-573. doi:10.1076/jcen.24.5.561.1004

Delain

S. L.

(2006). Use of the Test of Memory Malingering (TOMM) in adults with mental retardation. Dissertation Abstracts International: Section B: The Sciences and Engineering, 67(12-B), 7369.

Frederick

R. I.

(2002). A review of Rey’s strategies for detecting malingered neuropsychological impairment. Journal of Forensic Neuropsychology, 2(3/4), 1-25. doi:10.1300/J151v02n03_01

Goldberg

J. O.

Miller

H. R.

(1986). Performance of psychiatric inpatients and intellectually deficient individuals on a task that assesses the validity of memory complaints. Journal of Clinical Psychology, 42, 792-795. doi:10.1002/1097-4679(198609)42:5<792::AID-JCLP2270420519>3.0.CO;2-8

Graue

L. O.

Berry

D. T. R.

Clark

J. A.

Sollman

M. J.

Cardi

Hopkins

Werline

(2007). Identification of feigned mental retardation using the new generation of malingering detection instruments: Preliminary findings. The Clinical Neuropsychologist, 21, 929-942. doi:10.1080/13854040600932137

Greiffenstein

M. F.

Baker

W. J.

Gola

(1994). Validation of malingered amnesia measures with a large clinical sample. Psychological Assessment, 6, 218-224. doi:10.1037/1040-3590.6.3.218

Griffin

G. A. E.

Normington

Glassmire

(1996). Qualitative dimensions in scoring the Rey Visual Memory Test of Malingering. Psychological Assessment, 8, 383-387. doi:10.1037/1040-3590.8.4.383

Griffin

G. A. E.

Normington

May

Glassmire

(1996). Assessing dissimulation among Social Security disability income claimants. Journal of Consulting and Clinical Psychology, 64, 1425-1430. doi:10.1037/0022-006X.64.6.1425

10.

Hays

J. R.

Emmons

Stallings

(2000). Dementia and mental retardation markers on the Rey 15-Item Visual Memory Test. Psychological Reports, 86, 179-182. doi:10.2466/PR0.86.1.179-182

11.

Hurley

K. E.

Deal

W. P.

(2006). Assessment instruments measuring malingering used with individuals who have mental retardation: Potential problems and issues. Mental Retardation, 44, 112-119. doi:10.1352/0047-6765(2006)44[112:AIMMUW] 2.0.CO;2

12.

Larrabee

G. J.

(2003). Detection of malingering using atypical performance patterns on standard neuropsychological tests. The Clinical Neuropsychologist, 17, 410-425.

13.

Lezak

M. D.

(1995). Neuropsychological assessment (3rd ed.). New York, NY: Oxford University Press.

14.

Marshall

Happe

(2007). The performance of individuals with mental retardation on cognitive tests assessing effort and motivation. The Clinical Neuropsychologist, 21, 826-840. doi:10.1080/13854040600801001

15.

Mittenberg

Patton

Canyock

E. M.

Condit

D. C.

(2002). Base rates of malingering and symptom exaggeration. Journal of Clinical and Experimental Neuropsychology, 24, 1094-1102.

16.

Nelson

N. W.

Boone

Dueck

Wagener

Grills

(2003). Relationship between eight measures of suspect effort. The Clinical Neuropsychologist, 17, 263-272. doi:10.1076/clin.17.2.263.16511

17.

Nitch

Boone

K. B.

Wen

Arnold

Alfano

(2006). The utility of the Rey Word Recognition Test in the detection of suspect effort. The Clinical Neuropsychologist, 20, 873-887. doi:10.1080/13854040590967603

18.

Nitch

S. R.

Glassmire

D. M.

(2007). Non-forced-choice measures to detect noncredible cognitive performance. In Boone

K. B.

(Ed.), Assessment of feigned cognitive impairment: A neuropsychological perspective (pp. 78-102). New York, NY: Guilford Press.

19.

Otto

R. K.

(2008). Challenges and advances in the assessment of response styles in forensic examination contexts. In Rogers

(Ed.), Clinical assessment of malingering and deception (3rd ed., pp. 365-375). New York, NY: Guilford Press. doi:10.1111/j.2044-8333.2011.02016.x

20.

Rabin

L. A.

Barr

W. B.

Burton

L. A.

(2005). Assessment practices of clinical neuropsychologists in the United States and Canada: A survey of INS, NAN, and APA Division 40 members. Archives of Clinical Neuropsychology, 20, 33-65. doi:10.1016/j.acn.2004.02.005

21.

Rey

(1941). L’examen psychologique dans les cas d’encéphal-opathie traumatique [The psychological examination in cases of traumatic encephalopathy]. Archives de Psychologie, 28, 286-340.

22.

Rey

(1964). L’examen clinique en psychologie [The clinical examination in Psychology]. Paris: Presses Universitaires de France.

23.

Rogers

Sewell

K. W.

Goldstein

(1994). Explanatory models of malingering: A prototypical analysis. Law and Human Behavior, 18, 543-552. doi:10.1007/BF01499173

24.

Schalock

R. L.

Luckasson

R. A.

Shogren

K. A.

Borthwick-Duffy

Bradley

Buntinx

W. H. E.

. . . Yeagerm

N. H.

(2007). The renaming of mental retardation: Understanding the change to the term intellectual disability. Intellectual and Developmental Disabilities, 45, 116-124. doi:10.1352/1934-9556(2007)45[116:TROMRU]2.0.CO;2

25.

Shandera

A. L.

Berry

D. T. R.

Clark

J. A.

Schipper

L. J.

Graue

L. O.

Harp

J. P.

(2010). Detection of malingered mental retardation. Psychological Assessment, 22, 50-56. doi:10.1037/a0016585

26.

Simon

M. J.

(2007). Performance of mentally retarded forensic patients on the Test of Memory Malingering. Journal of Clinical Psychology, 63, 339-344. doi:10.1002/jclp.20351

27.

Tombaugh

T. N.

(1996). Test of memory malingering. North Tonawanda, NY: Multi-Health Systems.

28.

Victor

T. L.

Boone

K. B.

(2007). Identification of feigned mental retardation. In Boone

K. B.

(Ed.), Assessment of feigned cognitive impairment: A neuropsychological perspective (pp. 310-345). New York, NY: Guilford Press.

29.

Vitacco

M. J.

(2008). Syndromes associated with deception. In Rogers

(Ed.), Clinical assessment of malingering and deception (pp. 39-50). New York, NY: Guilford Press. doi:10.1111/j.2044-8333.2011.02016.x

30.

Wechsler

(1997). Wechsler Adult Intelligence Scale (3rd ed.). San Antonio, TX: Psychological Corporation.