Abstract
Background
Although Boston Naming Test has been thoroughly validated at a global level, there is limited assessment of item-level properties using modern psychometric methods.
Objective
This study aimed to investigate the construct validity and item-level properties of the color-picture version of Boston Naming Test (CP-BNT) in a Chinese cohort with neurodegenerative diseases.
Methods
This retrospective study included 424 participants, consisting of 118 normal controls, 152 with Alzheimer's disease, 101 with primary progressive aphasia, and 53 with other neurodegenerative diseases. All participants underwent a comprehensive neuropsychological assessment that included the CP-BNT. Factor analysis and item response theory were conducted.
Results
The CP-BNT exhibits a multidimensional structure with three factors: Factor 1, consisting of nine items with moderate difficulty levels, demonstrated peak measurement function for mild anomia (the highest information value = 33.7, ability estimated value = −0.8, reliability = 0.97); Factor 2, comprising eleven items with lower difficulty levels, performed well in cases of mild to moderate anomia (the highest information value = 34.1, ability estimated value = −1.2, reliability = 0.97); and Factor 3, including ten items with higher difficulty levels, provided the most measurement information for normal naming (the highest information value = 9.9, ability estimated value = 0, reliability = 0.90). All items, except item igloo, showed good discrimination (discrimination parameter ranged from 5.46 to 1.15). Most items had a different difficulty position versus the original version, thereby generating a novel item sequence with an ascending difficulty hierarchy for Chinese samples.
Conclusions
These findings support that the CP-BNT has good validity, reliability, and cultural appropriateness in the Chinese context, improving its utility in clinical assessments and interventions.
Keywords
Introduction
Boston Naming Test (BNT) is the most widely used neuropsychological measure for detecting naming impairment and has been translated into several languages since its initial publication in the United States during the 1980s.1–4 Despite its widespread use, BNT was not originally designed as an international instrument. Several researchers have cautioned that language and cultural background could affect the psychometric properties of BNT.5–8 For example, the item pretzel showed significant variability in difficulty between the original version and its translated counterparts (e.g., Spanish, Greek, and French versions) due to limited familiarity with pretzels outside of North America.2,4,5,8 This variation in difficulty usually results in changes in the item's discriminating power and/or measuring function.7,9 In addition to item pretzel, other items such as beaver, yoke, and scroll have also shown differential difficult levels and measurement functions across modified versions. Individuals from diverse cultural backgrounds with similar naming abilities may exhibit varied responses to specific items, potentially leading to a high rate of false positives or false negatives.2,5,8 This highlights the importance of validating item-level psychometric properties when modifying this naming test.
Item Response Theory (IRT) is an item-based analysis that has the advantage of providing information on how well each item discriminates different levels of an individual ability (e.g., between high and low naming ability), as well as the difficulty of a given item in comparison to others. 10 In classical test theory, the item difficulty is determined by the proportion of examinees passing an item, which depends on characteristic of the sample such as age, level of education, or cognitive ability. The results obtained by classical test theory can only be generalized to samples with similar characteristics. In contrast to classical test theory, IRT assumes that the probability of a correct answer depends on item properties (i.e., difficulty and discrimination) and an interval scale of the examinee's ability, so the estimation of item properties is independent of the sample's characteristics.11,12 In addition, classical test theory assumes that all items have equal value in estimating an individual's ability, meaning that an easy item has the same predictive value for an examinee's ability as a difficult one. In contrast, IRT estimates the ability level at which an item provides the most measurement precision. 12 Therefore, IRT is increasingly becoming the leading methodology used to develop, evaluate, and score clinical measures. 13
Several modified versions of BNT have been validated using IRT. It was observed that item difficulty hierarchy did not consistently increase with item number, especially when the BNT was used outside of North America.14–18 In terms of discrimination parameters, several studies have reported that certain items were psychometrically redundant, contributing very little predictive value to the total scale.15–17 These IRT studies highlighted variations regarding the item-level properties of BNT across diverse cultural backgrounds.
The Chinese version of BNT has been used in the Chinese context for more than 30 years. During this time, several modifications have been made to improve its cultural appropriateness and diagnostic accuracy.3,19 In 2022, our team further modified this 30-item Chinese version by replacing the original black-and-white line-drawing pictures with colored, visually realistic pictures. The color-picture version of BNT was validated to have better reliability and validity than the black-and-white version, particularly in reducing the false positive rate for detecting mild cognitive impairment and mild dementia with Alzheimer's disease (AD).19,20 Similar to other non-English versions, the Chinese version of BNT has its own linguistic and cultural salience.21,22 Furthermore, the 30 items in the Chinese version of BNT were selected based on experts’ experience, and no item analysis has yet been conducted. These findings suggest that while the validation of the Chinese version of BNT has been well established at a global “test” level, the item-level properties have not yet been evaluated using modern psychometric methods. Hence, this study aimed to explore the underlying structure and item-level properties of the Chinese version of BNT using IRT.
Methods
Participants
Data for this retrospective study were obtained from a clinical sample with neurodegenerative diseases referred to a language impairment clinic and memory clinic at Xuanwu Hospital, Capital Medical University from 2015 to 2022. Control participants were recruited from the community or the spouses of patients. They were community-dwelling, cognitive and neurologically healthy individuals. The inclusion and exclusion criteria have been described in great detail elsewhere.19,20 All participants underwent comprehensive assessments with medical history, neurological and physical examination, a neuropsychological battery, laboratory tests, and an MRI/CT scan. The clinical diagnosis was made in a multidisciplinary consensus meeting using international diagnostic consensus criteria. For AD, the clinical core diagnostic criteria of the National Institute on Aging-Alzheimer's Association (NIA-AA) were used.23,24 The diagnosis of primary progressive aphasia (PPA) was based on the international consensus. 25 Posterior cortical atrophy (PCA) was diagnosed on the current research criteria. 26 Parkinson's dementia was diagnosed according to the Movement Disorder Society criteria. 27 The study sample consisted of 118 cognitively normal controls, 152 patients with AD, 101 patients with PPA, 35 patients with PCA, and 18 patients with Parkinson's dementia. All participants provided written informed consent before joining the study.
The color-picture version of Boston Naming Test
The color-picture version of BNT was administered as part of a comprehensive neuropsychological battery, including Mini-Mental State Examination (MMSE), Montreal Cognitive Assessment (MoCA), and the Clinical Dementia Rating scale (CDR), Auditory Verbal Learning Test, Digit span, Trail Making Test, and Rey Complex Figure Copy. 20 The color-picture version of BNT is a modified version of the black-and-white line-drawing version of BNT, which has been validated to have better acculturation than the original version for the Chinese population. It showed higher internal consistency (α = 0.767) and higher diagnostic effectiveness for distinguishing AD, PPA, and PCA from normal control at the cutoff value of 25/26 (Sensitivity ranged from 82.6% to 100%; Specificity = 90%; Area under the curve ranged from 90.7% to 99.4%).19,20 The 30-item color-picture version of BNT was administrated in the order of the original version without basal or discontinuation rules. Based on the score of spontaneous naming (SN), naming impairment can be classified into different levels of severity. SN scores greater than or equal to 25 indicate normal naming ability, while scores less than 1 standard deviation (SD) indicate mild naming impairment, scores less than 2 SD indicate moderate naming impairment, and scores less than 3 SD indicate severe naming impairment.19,20
Statistical analysis
Continuous variables and categorical variables were described in the form of mean ± standard deviation (mean ± SD) and numbers with percentages [n(%)], respectively. The dimensionality of the BNT was evaluated by exploratory factor analyses (EFA) using maximum likelihood estimation with Promax rotation (SPSS22 version, Chicago, USA). Prior to the exploratory factor analyses, data suitability and sampling adequacy were checked using the Kaiser-Meyer-Olkin value and Bartlett's test of sphericity. Factors with an eigenvalue larger than 1 were extracted. The criteria for unidimensionality were: (1) variance explained by the measurement dimension ≥ 40%, (2) the variance explained by the first principal component of residuals ≤ 15%, (3) the ratio of (1): (2) above is at least 3:1.15,28
In line with the dimensionality of CP-BNT, confirmatory factor analysis was used to evaluate the model fit of the suggested construct with the comparative fit index (CFI) and Tucker–Lewis Index (TLI), as well as the root-mean-square error of approximation (RMSEA). CFI and TLI values higher than 0.97 underlined a good fit; values ranging from 0.95 to 0.97 showed an acceptable fit. RMSEA values were categorized as good (≤0.05), adequate (0.05–0.08), mediocre (0.08–0.10), and unacceptable (>0.10).
Subsequently, item response theory analysis was carried out to evaluate the psychometrics of items. The discrimination parameter is defined as α, and the difficulty parameter is β. The discrimination parameter (α) represents the degree to which the item distinguishes persons with higher naming ability from those with lower naming ability. α Values < 0.64 reflect unacceptable discrimination, values 0.65–1.34 refer to moderate discrimination, values 1.35–1.69 point out high discrimination, and values >1.70 reveal very high discrimination. 29 The difficulty parameter (β) refers to the point in the ability scale at which a person has a 50% chance of responding correctly to the item. Item characteristic curves (ICC) were plotted for visual inspection for discrimination and difficulty parameters for each item.
If the multidimensionality is identified, each factor will be analyzed for its association with demographic variables (including age, sex, and educational level) and cognitive performances (including MMSE, MoCA, and CDR global score) using Spearman's correlation coefficient. Cronbach's alpha coefficient was calculated for each factor and the total scale for internal consistency. Furthermore, the IRT estimated measurement information and reliability for each factor using the Metropolis-Hastings Robbins-Monro algorithm. The factor information function was graphed to demonstrate the ability level at which the factor yields the highest measurement precision. Information value > 3.30 and reliability > 0.70 indicates high measurement function and good reliability. 29
R Studio for Windows (Version 2022.07.0, Posit Software, Boston, MA, USA) was applied to analyze the data. In each subsection, the specific R package employed for statistical analyses is provided. The significance level was 0.05 for statistical tests.
Results
Participant characteristics
Table 1 presents demographic characteristics, cognitive performances, and color-picture version of BNT scores of all 424 participants. The average age was 64.75 (SD7.08), ranging from 47 to 84 years. The majority were female (n = 235, 55.4%). The most common categories of cognitive level were mild cognitive impairment (CDR = 0.5, n = 142, 34.2%) and mild dementia (CDR = 1, n = 127, 30.0%). The spontaneous naming score (SN) of color-picture version of BNT varied from 1 to 30 (21.17 ± 7.58). Among them, 162 participants had normal naming ability (SN = 30–26), 101 participants had mild to moderate naming impairment (SN = 21–25), and 161 participants had severe naming impairment (SN = 20–1).
The demographic data and clinical diagnosis (n = 424).
NC: normal control; AD: Alzheimer's disease; PPA: primary progressive aphasia; MMSE: Mini-Mental State Examination; MoCA: Montreal Cognitive Assessment; CDR: Clinical Dementia Scale; CP-BNT: Color-picture version of Boston naming test; SN: spontaneous naming; SD: standard deviation.
Supplemental Table 1 shows demographic characteristics, cognitive and CP-BNT performances of four groups, respectively.
The results of factor analysis
The principal component analysis for the group was robust (Kaiser-Meyer-Olkin = 0.96) and Bartlett's test of sphericity was significant (approximate χ2 = 2345.9, d.f. = 29, p < 0.001). The results from EFA revealed a 3-factor structure. The eigenvalue was as follows: 12.68, 2.23, 1.15. The factor loadings are shown in Table 2. Each item showed acceptable factor loadings ranging from 0.40 to 0.71. As shown in Table 2, the variance in data explained by measures was 52.5%. The first factor explained 19.8% of the variance. The ratio of variance explained by measures to variance explained by the first component of residuals was 2.65:1 (less than 3:1), indicating inconsistency with the criteria for unidimensionality. Thus, this test was regarded as a multi-dimensional scale comprising three factors.
Results of exploratory factor analysis of CP-BNT (n = 424).
CP-BNT: Color-picture version of Boston Naming Test.
Model fit
The following model fit indices were obtained: M2 = 476.96, d.f. = 405; Comparative Fit Index = 0.998; Tucker–Lewis Index = 0.997; Root Mean Square Error of Approximation = 0.020, (95% confidence intervals: 0.009, 0.029). The model fit indicators were sufficient and good which supported the three-factor structure.
Item-level properties
Table 3 presents the results of MIRT with item discrimination and difficulty parameters. The results revealed that the majority of the items had discrimination parameters above 1.7, signifying excellent discrimination ability. Among all 30 items, item abacus had the highest magnitude of discrimination (α2 = 5.46), followed by camel (α2 = 4.94) and funnel (α1 = 4.65). The least discriminating item was item igloo (α3 = 1.15), followed by protractor (α3 = 1.53) and wheelchair (α1 = 1.69).
The results of multi-dimensional item response theory.
The first column displays the sequence number of 30 items, arranged in ascending order based on the actual difficulty parameter in the Chinese context. The second column on the left presents the sequence number in the original version. Items in bold signify noticeable changes in difficulty position from the original version.
CP-BNT: Color-picture version of Boston Naming Test.
As for difficulty parameters, most items had β value less than zero, suggesting that most of the 30 items had relatively low difficulty parameters. Among them, item igloo showed the highest magnitude of difficulty (β = 2.97), followed by item harp, protractor, and dart (β = 0.69, β = 0.13, β = 0.10). The least difficult items were pencil, tree, flower, and scissors (β = −2.63, −2.41, −2.40, −2.12). It is noteworthy that some items showed a notable discrepancy of difficulty level from the original version. For four items, including abacus, tongs, funnel, accordion, their position changed to one of less difficulty level and can be seen near the top of the order. Conversely, for some items such as dart, harp, igloo, seahorse, their position changed to one of greater difficulty level and can be found near the lower end of the order. These transitions led to a novel item sequence of BNT with monotonically increasing difficulty for Chinese individuals. Table 3 presents the current order of BNT items and their ranks in the original version, respectively.
Item characteristic curves (ICC) of thirty items
The results of factor analysis and IRT are more effectively visualized by item characteristic curves in Figure 1(a)–(c). Figure 1(a) demonstrates curves of nine items loading to Factor 1. Figure 1(b) is for the 11 items loading to Factor 2. Figure 1(c) shows curves for 10 items loading to Factor 3. Factor 3 is regarded as more difficult than Factor 1 and Factor 2. The curve of item igloo is located on the far right of the x-axis, indicating this item has the highest level of difficulty. However, this curve is also the flattest one, suggesting the least discrimination.

Item characteristic curves (ICC) of thirty items.
Factor information function
Table 4 shows the measurement information and conditional reliability for three factors. Factor 1 had its peak of information value at theta −0.8 (Information value = 33.69), with high reliability between θ = −1.6 and θ = 0.2. Factor 2 had its peak of information value at theta −1.2 (Information value = 34.10), with high reliability between θ = −3.0 and θ = −0.4. Factor 3 had its peak of information value at theta zero (Information value = 9.91), with high reliability between θ = −1.2 and θ = 1. Factor 1 and Factor 2 showed the highest degree of measurement precision near the low average to mildly impaired range of naming ability. Furthermore, the highest value of measurement information was found in Factor 2, followed by Factor 1 and Factor 3.
Information value and reliability for three factors.
Supplemental Table 2 shows the convergent and divergent validity of three factors. The Cronbach's alpha of the total scale and each factor were greater than 0.80, indicating a good level of internal consistency (Total scale: α = 0.947; Factor 1: α = 0.923; Factor 2: α = 0.896; Factor 3: α = 0.848).
Discussion
This study investigated the underlying structure and item-level properties of the color-picture version of Boston naming test in a Chinese sample with neurodegenerative diseases. Firstly, the color-picture version of BNT was validated as a multidimensional structure with three factors representing varying levels of difficulty: lower, moderate and higher. Each factor demonstrated the highest measurement precision and reliability in a certain range of naming ability. Secondly, IRT offered quantitative evidence on the difficulty parameter of each item, which facilitated the reordering of 30 items in an ascending difficulty hierarchy. Finally, most items, except item igloo, demonstrated good discrimination parameter, suggesting satisfactory cultural appropriateness of this revised version in the Chinese context.
The naming ability has traditionally been viewed as a unitary concept, which has led to the assumption that naming tests should have a unidimensional structure. However, studies on the underlying construct of BNT have failed to demonstrate satisfactory unidimensionality.15–17,30,31 For instance, Pedraza et al. found that BNT data from Caucasian adults did not exhibit sufficient unidimensionality to meet IRT assumptions. 30 This discrepancy was also noted by Medvedev et al., who found that the performance of BNT items varied across different levels of naming abilities. 16 These studies had to eliminate four to thirteen items to achieve an improved but still unsatisfactory unidimensional model.16,17,30,31 One possible explanation for this discrepancy is that visual confrontation naming may not, in fact, be a unitary concept. Modern neurobiology and neuropsychology research has evidenced that the visual confrontation naming process is a multidimensional skill involving many cognitive abilities including visual recognition, semantic activation, word retrieval, and physiological articulation processing.9,32 Various neurodegenerative diseases, such as AD, PPA, and PCA, disrupt different regions of the cerebral cortex and present naming impairment with varying features and severity. The second explanation is that previous studies were conducted with a homogenous sample of normal control and mild naming impairment.15–17,30,31 The probability of responding correctly to a test item depends not only on item difficulty, but also on the individual's latent ability. 33 If the sample does not include individuals with moderate to severe naming impairment, there may be a failure to account for the relationship between items with lower difficulty levels and low naming abilities. As such, those studies had to remove some easy items which could not fit a unidimensional assumption between items with higher levels of difficulty and high naming ability.
The current study utilized a sample covering a wide range of naming impairments (from BNT = 1 to BNT = 30; 38% of individuals with severe naming impairment). Both factor analysis and item response theory confirmed that the 30-item color-picture version of BNT has three factors, each with varying difficulty levels and an optimal working range. As shown in Figure 1 and Table 4, Factor 1 includes nine items with a moderate level of difficulty, which has the highest measurement function for mild naming impairment. Factor 2 has eleven items with a lower level of difficulty, which works well from mild to moderate naming impairment. Factor 3 is composed of ten items with a higher level of difficulty, which yields the highest degree of measurement precision near normal naming. This finding is consistent with previous research indicating that BNT items do not perform equally well across different levels of naming deficit. 16 When the difficulty level matches the target ability level, the item or factor will show the best discrimination. 31 Therefore, the three factors with varying difficulty and optimal working range could be used as an index of a certain level of naming deficit, offering additional information about the extent of anomia in an individual.
Previous studies on cross-cultural adaptation of psychological measures have demonstrated that cultural salience significantly influences the constructs of measurement scales. 34 This effect is expected to be more significant for the naming tests, as their construction and application heavily rely on word familiarity, word length, lexical and articulation difficulty, all of which in turn affect psychometric properties of items. 7 In the current study, most items have a different difficulty position versus the original version, especially for items abacus and igloo. The variation of item-level difficulty is a common but significant cultural salience of measurement scales in cross-cultural studies. There are strong theoretical reasons to reorder 30 items of the color-picture version of BNT based on their actual level of difficulty, with which patients are most likely to perform well and respond rapidly in the lower difficult items while performing worse in more difficult ones. The incremental difficulty among ordered items would optimize the administration procedure of the naming test, leading to a smoother and time-saving administration. This new sequence of 30 items could also facilitate the application of standard basal and discontinuation rules of BNT in Chinese samples.
The impact of cultural salience is also evident in the changes observed in discrimination parameters. Among American adults, the item igloo demonstrated the highest level of discrimination (α = 3.46) with a moderate level of difficulty. 17 In the Chinese context, however, igloo was found to be the least discriminating item (α3 = 1.37) and was noted as the most challenging. Except for item igloo, most items exhibited a high level of discrimination. The favorable item-level properties offer support for the overall validity and cultural appropriateness of the test in a Chinese context. Unfortunately, even after optimizing the stimulus images, item igloo still exhibits obvious cultural bias, suggesting that it should be replaced or removed in future Chinese versions.
This study has several strengths. First, this is the first factor analysis of BNT conducted in a sample with a wide range of naming abilities from severe naming deficit to normal naming. The three-factor model solved the problem of unsatisfied unidimensional models in previous studies. The multidimensional structure of the scale not only conforms to the modern theory construction of the visual naming process but also to the theory construction of the difficulty gradient of BNT. Utilizing gradation difficulty factors and optimal working range would provide more comprehensive information, enabling clinicians to draw conclusions based on factor-level performances rather than just comparing the total score. Second, this study is the first to investigate item-level properties of the BNT with IRT in the Chinese context. The results generated a different item sequence with increasing difficult hierarchy that enables the actual application of standard basal and discontinuation rules in the Chinese context. Third, the results validated that most items have excellent discrimination ability and reliability, indicating the good cultural appropriateness of the color-picture version of BNT in the Chinese context. Finally, this study provided evidence for certain items with poor or redundant item properties in the Chinese context, which may be helpful for the future development of shorter naming tasks using BNT items without a loss of discrimination characteristics.
Some limitations should be acknowledged. One limitation of the current study is the relatively small sample size (better than 500 cases), which may not be sufficient for robust estimation of item discrimination and could result in imprecise estimates of fit residuals. Given the exploratory nature of this study, the findings have addressed the longstanding challenge of difficulty ranking in the Chinese version of BNT and will improve the clinical utility of BNT in the Chinese context. We will continue to recruit patients and aim to develop a condensed version of CP-BNT. Secondly, although Factor 3 worked well for normal naming, its measurement precision was much lower than that of Factor 1 and Factor 2. Therefore, in the Chinese context, the color-picture version of BNT still serves as a test for measuring naming impairment rather than achievement testing. Thirdly, as a visual confrontation test, visual deficits will definitely affect performances of the visual naming test. In our previous study, the PCA patients had obviously lower score in spontaneous naming (SN) than AD patients and normal controls, while scored significantly higher in the correct percentage of semantic cuing. 20 Therefore, CP-BNT may be unsuitable for individuals with severe visual and auditory deficits.
In summary, this study validated that the Chinese version of BNT is a multi-dimensional structure with three factors. The IRT study deepened our understanding of psychometric properties of the color-picture version of BNT, renewed its item difficulty sequence, discriminatory ability and measurement information. These results would improve the clinical application of the BNT in the Chinese context and could be used to develop shorter revisions by replacing or removing certain items with poor or redundant properties.
Supplemental Material
sj-docx-1-alz-10.1177_13872877241305820 - Supplemental material for Item response theory for the color-picture version of Boston naming test in a Chinese sample with neurodegenerative diseases
Supplemental material, sj-docx-1-alz-10.1177_13872877241305820 for Item response theory for the color-picture version of Boston naming test in a Chinese sample with neurodegenerative diseases by Dan Li, Xining Liu, Jiaming Yu, Yifei Zhang, Nan Hu, Yuanyuan Lu, Fangling Sun, Min Zhang, Xiaowei Ma and Fen Wang in Journal of Alzheimer's Disease
Footnotes
Acknowledgments
We thank all participants and their families for their time and contribution to this study.
Author contributions
Dan Li (Conceptualization; Data curation; Formal analysis; Funding acquisition; Investigation; Methodology; Project administration; Resources; Writing – original draft; Writing – review & editing); Xining Liu (Conceptualization; Data curation; Formal analysis; Methodology; Writing – original draft; Writing – review & editing); Jiaming Yu (Data curation; Formal analysis; Investigation; Methodology; Software; Validation; Visualization; Writing – original draft; Writing – review & editing); Yifei Zhang (Data curation; Formal analysis; Investigation; Methodology; Validation; Visualization; Writing – original draft; Writing – review & editing); Nan Hu (Methodology; Software; Visualization; Writing – original draft; Writing – review & editing); Yuanyuan Lu (Data curation; Investigation; Validation; Writing – original draft; Writing – review & editing); Fangling Sun (Formal analysis; Software; Validation; Writing – original draft; Writing – review & editing); Min Zhang (Software; Visualization; Writing – original draft; Writing – review & editing); Xiaowei Ma (Writing – review & editing); Fen Wang (Writing – review & editing).
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Beijing Natural Science Foundation (Grant numbers L182049).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability
The data supporting the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
