Abstract
Purpose
Although researchers have examined empathy among many populations worldwide, investigations of empathy among Farsi-speakers are limited. The purpose of this study is to evaluate the psychometric properties of the Interpersonal Reactivity Index (IRI) for Farsi-speakers (IRI-Farsi).
Methods
After translating, we explored psychometric properties of the IRI-Farsi with exploratory factor analysis and item response theory using a sample of Iranians (N = 517).
Results
The IRI-Farsi appeared to exhibit a four-factor structure and acceptable item properties within each subscale. Moreover, the IRI-Farsi rating scale categories were generally ordered and distinct with emotion-triggering items as easier to endorse compared to more complex cognitively aroused statements.
Conclusions
Results support using the IRI to measure dispositional empathy in mainland Iran. Social work researchers can use these results to inform research and practice related to empathy in this population and design more effective interventions to increase awareness of empathic feelings and understanding for practitioners and clients.
Keywords
Empathy is regarded as one of the most fundamental components of human and non-human emotional life and their social interactions (Yaghoubi Jami et al., 2021a), and it has been a popular line of research in social psychology, moral psychology, psychiatry, and social work (Aaltola, 2013; Eisenberg & Morris, 2001; Gerdes, Segal, Lietz, 2010). Empathy contributes to the development of prosocial behavior, emotional management strategies, parenting, psychological well-being, social relationships, and social work practitioner-client relationships (Charbonneau & Nicol, 2002; Gerdes & Segal, 2011). As Hoffman (2001) stated, “empathy is the spark of human concern for others, the glue that makes social life possible” (p. 3). Given his argument, empathy enables us to be concerned with others’ emotional states and behave altruistically leading to reducing others’ pain while promoting their welfare (Charbonneau & Nicol, 2002). Lack of empathy has been found to be associated with mental health problems. For example, findings from numerous studies have suggested a strong relationship between lack of empathic feeling and narcissistic personality disorder (e.g., Ritter et al., 2011), aggressive behavior (Steffgen et al., 2011), psychopathy (Domes, Hollerbach, Vohs, Mokros, Habermeyer, 2013), and sexual offending (Simons et al., 2002).
The definition of empathy has undergone immense changes from a sole affective response to a mixture of affective and cognitive processes (Cuff, Brown, Taylor, Howat, 2016; see Gerdes et al., 2010 for details). Affective empathy can be defined as experiencing similar emotion to the observed person (not identical, because of the impossibility of experiencing congruent feelings for the empathizer). The other component of empathic responsiveness is cognitive empathy, which is synonymous with mentalizing, theory of mind, and perspective taking (Smith, 2006; Yaghoubi Jami, Mansouri, Thomas, Han, 2019). Cognitive empathy refers to understanding another individual’s thoughts and feelings from their view by the help of imagination and mental positioning; however, the empathizer should be able to distinguish between one’s own and others' emotional state (Blair, 2005; Schieman & Van Gundy, 2000).
Having a clear definition of the concept and employing the most comprehensive measure that could cover different components of empathy is important if we want to “make significant conclusions about how we define and measure this key component of human behavior—necessary steps if social work practitioners are to cultivate empathy in ourselves and others” (Gerdes et al., 2010, p. 2327). Therefore, the following section summarizes the most commonly used approaches to measure empathy in order to highlights the uniqueness of the Interpersonal Reactivity Index (IRI; discussed further below) as the most comprehensive questionnaire for measuring dispositional empathy.
Assessing Empathy
Researchers typically study empathy using either self-reported reflections on empathic reactions in general settings (i.e., trait/dispositional empathy) or empathic responsiveness toward a specific situation (i.e., situational empathy; Zhou, Valiente, Eisenberg, 2003). In studies on empathy, researchers have offered numerous methods for investigating people’s empathic behavior, including situational empathy, through experimental and quasi-experimental laboratory experiments (e.g., Yaghoubi Jami et al., 2021a), neuroimaging techniques (e.g., Singer et al., 2004; Yaghoubi Jami, Han, Thoma, Mansouri, Houser, 2021b), and qualitative-based settings (e.g., interviews; Picard et al., 2016). Investigating dispositional empathy with self-reported and other-reported questionnaires has also caught researchers’ attention.
Among published self-reported empathy questionnaires, some have received more attention from scholars; however, each of these questionnaires reflects different perspectives toward measuring empathy. For example, the Empathy Scale (Hogan, 1969) considers only the cognitive facet of empathy, the Questionnaire Measure of Emotional Empathy (QMEE; Mehrabian & Epstein, 1972) assesses the affective aspect of empathy, and the Empathy Quotient (EQ; Baron-Cohen & Wheelwright, 2004) measures empathy as a one-dimensional construct. To the best of our knowledge, the only questionnaire that considers empathy as a multi-dimensional construct and simultaneously measures affective and cognitive components of empathy is the IRI (Davis, 1983).
The IRI has received more attention than other self-reported empathy questionnaires for several reasons. First, the IRI is the only self-report scale designed to measure dispositional empathy as a multidimensional construct through its four subscales: Perspective Taking (PT), Empathic Concern (EC), Fantasy Scale (FS), and Personal Distress (PD). The PT subscale (inclinations to adopt others’ viewpoints) and the FS subscale (inclinations to imaginatively put oneself into the characters’ feelings in films, books, etc.) are designed to measure the cognitive component of empathy, while the EC subscale (inclinations to adopt feelings of warmth, sympathy, compassion, and concern for others) and the PD subscale (inclinations to produce feelings of discomfort toward others’ negative experiences) assess the affective component of empathy (Davis, 1983). However, due to legitimate criticisms regarding the validity and reliability of FS items in measuring imagination abilities rather than perspective taking (Spreng, McKinnon, Mar, Levine, 2009), researchers have recommended excluding these items when measuring empathy (Baron-Cohen & Wheelwright, 2004). Second, the IRI is exceptionally comprehensive, short, and easy to administer (De Corte et al., 2007). Third, researchers have reported that the IRI subcomponents have acceptable internal consistency estimates (0.70 ≤ α ≤ 0.78) and test-retest reliability estimates ranging from 0.61 to 0.81 depending on the subscales and gender (see Davis, 1980, for details). Last, the IRI subcomponents correlated in the theoretically expected directions and power magnitudes with measures of self-esteem, sensitivity to others, emotionality, and social functioning (Davis, 1983).
IRI in Languages Other Than English
A common strategy in the field of psychological assessment is to evaluate the psychometric properties of a questionnaire in different linguistic contexts by translating psychological assessments into another language. Using such an approach allows researchers to not only conduct cross-cultural studies and assess the validity and reliability of the same questionnaire in different populations, but to also save money and time. Using a translated version of a questionnaire can also be beneficial for respondents. When an instrument is translated, respondents can truly reflect on their opinions/feelings in a preferred language, which could reduce undesired bias and enhance the fairness of a measurement procedure (Hambleton & Kanjee, 1995). For these reasons, researchers have used translations of the IRI in several studies. For example, the available literature shows that researchers have translated and evaluated the IRI into Chinese (Siu & Shek, 2005), Dutch (De Corte et al., 2007), French (Gilet, Mella, Studer, Gruhn, Labouvie-Vief, 2013), German (Grevenstein, 2020), Italian (Albiero, Ingoglia, Lo Coco, 2006), Korean (Yang & Kang, 2020), Spanish (Fernández, A. et al., 2011; Garcia-Barrera, M et al., 2017; Mestre, Frias, Samper, 2004; Tello, Delgado Egido, Carrasco Ortiz, Del Barrio Gandara, 2013), Turkish (Engeler & Yargıç, 2007), and Swedish (Cliffordson, 2002).
Results from these studies on translations of the IRI support its multidimensionality, however, the number of loaded factors varied across populations. For example, Siu and Shek (2005) identified the existence of three factors in the Chinese version of the IRI: PD, ES, and FS. Similarly, Cliffordson (2002) found three factors, but only items in PT, FS, and EC loaded to the model. On the other hand, De Corte et al. (2007) reported a four-factor structure of the IRI with a Dutch translation, but the authors suggested scale improvements. Garcia-Barrera and colleagues (2017) also examined the psychometric properties of the IRI in a Colombian population. Specifically, these researchers used CFA to explore the IRI factor structure and measurement invariance analyses to explore the stability of the factor structure across genders. Their findings also supported four-factor structure, but they chose to remove 10 items (9 of which were reverse-coded) due to low factor loadings, which they argued could be explained by participants’ demographic characteristics; their participants were selected from specific group of people who had experienced a long history of violence (guerrilla and paramilitary ex-combatants), and who may not be able to comprehend some of the items. Other Spanish versions of the IRI supported a four-factor structure with acceptable reliability estimates ranging from .56 to .80 and gender differences in favor of female participants (Fernández et al., 2011; Mestre et al., 2004).
Researchers have also investigated the psychometric properties of the IRI within English speaking contexts (Chrysikou & Thompson, 2016; Pulos, Elison, Lennon, 2004). In short, the literature pertaining to evaluating validity and reliability of the IRI strongly suggests that this instrument can be used as a reliable and valid tool for investigating dispositional empathy. In addition, the four-factor and two-factor models were assessed with English and non-English-speaking populations and the results indicated stronger model fit for a four-factor model (see Chrysikou & Thompson, 2016, for discussion). However, there have been inconsistent findings regarding item factor loading to four factors of the IRI and measurement invariance (e.g., Garcia-Barrera et al., 2017).
Purpose of the Study
There has been a great deal of inconsistency regarding the factor structure of the IRI in the literature. As mentioned earlier, the IRI was developed as a questionnaire with four factors. However, not all researchers who have studied the IRI observed the four-factor model empirically in their samples. Such inconsistency could be linked to the applied analytical approach or cultural values within the studied population. Culture affects many aspects of people’s lives, including their personality and behavior. Empathy, as an important factor of people’s social life, is not an exception. Indeed, numerous studies have shown that the expression of empathic responsiveness and even empathic feelings depend on people’s cultural backgrounds (Yaghoubi Jami et al., 2019).
Additionally, despite the large Farsi-speaking population (more than 100 million globally, 1.5% of world population), Iranians are underrepresented in social psychology research (Wind et al., 2021). Although empathy has been examined among many populations worldwide, investigations of empathy among Farsi-speakers as well as differences and similarities between the Iranian community and other populations are limited to few studies (e.g., see Yaghoubi Jami, Mansouri, & Thoma, 2021, for details). As the search for related literature has demonstrated, neither the English nor the translated versions of the IRI have been adapted and/or evaluated within an Iranian context. Therefore, the primary objective of the current paper is to adapt the most practical empathy questionnaire and evaluate its psychometric properties for Farsi-speakers. We focus our analysis on identifying the order of IRI items for Farsi speakers as a method for understanding the nature of empathy in this population. We also examine the degree to which the IRI rating scale functions in a psychometrically useful way for this population. The following research questions will guide the analysis: 1. What is the order of the IRI items for Farsi speakers? 2. To what extent does the IRI rating scale function as expected for Farsi speakers?
Method
Participants
In total, 677 questionnaires were distributed online among Iranians. One hundred-sixty questionnaires were deleted due to incomplete responses or suspicion of random answering. Any participant below 18 years old who could not understand Farsi and were not Iranians were excluded from study (N = 0). The final dataset consisted of 517 complete responses from Iranian individuals (182 males) ranging in age from 16 to 66 years old (Mage = 28.31, SDage = 9.84) living either in Iran or outside the country, who voluntarily participated in this study. The majority of participants had a bachelor’s degree (N = 202) and were single (N = 316). All participants spoke Farsi as their first language and were Iranians. Prior to participating in the study, all participants completed an online informed consent form approved by the Institutional Review Board (IRB #15-OR-336-R1) at the University of Alabama. The study is performed in accordance with the ethical standards Declaration of Helsinki (1964) and its later amendments.
Instrument
Per Hambleton and Kanjee’s (1995) recommendation, the current study used translation techniques for adopting the English version of the IRI within Iranian samples. Specifically, using the International Test Commission’s (ITC) guideline of translating tests (Hambleton, 2001) and the procedure suggested by Wind and collaborators (2021), the current study benefited from a combination of translation techniques and a team of translators. In short, three expert translators translated the IRI into Farsi. The translated versions were compared and collated into one coherent version. To ensure similarity between the translated and original versions, a back-translation procedure was used. The second group of translators (N = 3) retranslated the Farsi version of the IRI to English. Differences found in the translation process (i.e., original and back translation) were addressed by group discussion. The main principal investigator finalized the translated version to ensure meaning of the items was fully compatible.
Among all the methods of data collection, use of electronic techniques (i.e., email-based and web-based surveys) is increasing due to its ability in collecting large-scale data. Thus, the current study utilized online distribution as a method of data collection. The translated version of the IRI was generated using the Qualtrics survey-hosting provider (https://www.qualtrics.com). For recruiting participants, a flyer explaining procedure and purpose of the study was posted on social media (e.g., Facebook, Instagram) to be visible in Farsi speakers' news feeds. The Farsi version of the IRI and consent form were sent to interested participants through a Qualtrics link.
The Interpersonal Reactivity Index.
The IRI consists of 28 items in four subscales: Empathic Concern (EC; e.g., “sometimes I don’t feel very sorry for other people when they are having problems”), Fantasy Scale (FS; e.g., “I really get involved with the feelings of character in a novel”), Perspective Taking (PT; e.g., “I try to look at everybody’s side of disagreement before I make a decision”), and Personal Distress (PD; e.g., “When I see someone get hurt, I tend to remain calm”). Items in each subscale are paired with a 5-point Likert response scale (0 = does not describe me well to 4 = describes me well) with a final score ranging from 0 to 28 on each scale. Prior to analyses, items were scored per Davis’ (1980) guidelines. Specifically, responses to 9 items were reverse coded and given a score of 4 if participant selected Does not describe me well and a score of 0 if describes me very well. The remaining items (i.e., items that are not reverse coded), were scored as 4 if describes me very well, and 0 if does not describe me well was selected. Higher scores on each scale indicate higher tendencies on that subscale (e.g., score of 28 on empathic concern scale suggests participants reported that they have high affective empathy).
Originally, items in EC and PD subscales were proposed to measure affective empathy, and items in the PT and FS subscales were intended to measure cognitive empathy. However, researchers have raised some concerns about the construct-related validity of items in the FS subscale in measuring cognitive empathy, as FS items might assess responder’s imagination ability rather than empathic ability (for further discussion, see Baron-Cohen & Wheelwright, 2004). Regarding the involvement of PD items in affective empathy, researchers have argued that personal distress is different from empathic concern because distress caused by observing others’ misfortunate may result in non-altruistic behavior (Yaghoubi Jami et al., 2021a). The internal consistency in the current database was calculated using Cronbach’s alpha, which showed a high internal consistency: α IRI-total = 0.79 (95% CI: 0.77, 0.82); αfantasy scale = 0.79 (95% CI: 0.77, 0.82); αempathic concern = 0.67 (95% CI: 0.63, 0.72); αpersonal distress = 0.69 (95% CI = 0.65, 0.73), and αperspective taking = .62 (95% CI: 0.58, 0.67)
Data Analysis Procedures
The purpose and research questions for our study focus on scaling the IRI items and evaluating their psychometric properties within a new population. Accordingly, our primary analytic technique was based on item response theory (IRT). Before we applied the IRT analyses, we conducted two preliminary investigations to inform the application and interpretation of the IRT analyses. In the following paragraphs, we describe these procedures and how their results informed our subsequent modeling decisions.
Descriptive Item Analyses.
Descriptive Item Analysis Results.
Table 1 also shows corrected item-total correlations for each IRI item within its subscale (ri,subscale) and for the complete IRI (ri,complete). For most items, the item-total correlation within its subscale was higher than the item-total correlation for the complete IRI. However, there were exceptions within each subscale (e.g., Items 1-FS, 4-EC [reversed coded], 10-PD, 3-PT [reversed coded])—indicating that these items may potentially reflect distinct aspects of empathy from other items in their subscale. In addition, two items from the PT subscale had particularly low item-total correlations within the subscale: Item 15-PT (reversed coded) (ri,subscale = 0.10) and Item 3-PT(reversed coded) (ri,subscale = 0.15).
Exploratory Factor Analyses.
Second, we analyzed the IRI using exploratory factor analysis (EFA) methods. We applied EFA to help us identify relationships among the IRI items from the perspective of dimensionality and to align our work with previous research on the IRI. We selected EFA rather than confirmatory factor analysis (CFA) for several reasons. First, the purpose of our study was not to identify an appropriate factor structure to represent the IRI in our population; model-fit comparisons between various CFA models were less important than item analyses from a scaling perspective. Moreover, as Bandalos (2018) pointed out, the distinction between EFA and CFA from is not always clear-cut: “EFA is sometimes used in a confirmatory manner, and CFA is sometimes used in an exploratory manner” (p. 350). Although there is previous research on the IRI that could warrant theory testing and statistical model comparisons, we were more interested in exploring the patterns of relationships among the IRI items than comparing the fit of various factor solutions. In addition, an exploratory approach was useful for our purposes because we were exploring the structure of the IRI with a new population; EFA allowed us to explore the structure of the item responses without assuming that the patterns that researchers have observed in other populations would hold in this context.
Given the non-normal item distributions and ordinal rating scale, we conducted an EFA on the polychoric item correlation matrix using unweighted least squares (ULS) estimation (Sellbom & Tellegen, 2019). We used varimax rotation to reflect previous research and theory that suggests that the IRI subscales are distinct. We applied two EFA models: One with two factors specified to reflect affective empathy through EC and PD items, and cognitive empathy through PT and FS items, and one with four factors treated each subscale individually and measured affective empathy by items included in empathic concern and cognitive empathy by statements included in perspective taking subscales of the IRI (Davis, 1980). We conducted the EFA using the psych package for R (Revelle, 2016).
Overall indicators of suitability for factor analysis generally supported the use of EFA to examine relationships among IRI item responses: The overall KMO measure of sampling adequacy was equal to 0.81, with item specific KMO measures ranging from 0.52 to 0.89. Furthermore, Bartlett’s test of sphericity suggested that the polychoric correlation matrix was significantly different from an identity matrix (K2(378) = 3516.41, p < 0.001, V = 0.13). However, for both the two-factor and four-factor EFA models, other indicators of suitability for factor analysis were marginally acceptable. The two-factor and four-factor models explained 26% and 37% of the variance in item responses, respectively. The Tucker-Lewis measure of factoring reliability was equal to 0.52 and 0.76 for the two-factor and four-factor models, respectively; both values are lower than what is generally recommended (< 0.9; Tabachnick & Fidell, 2001). Finally, the RMSEA value was higher for the two-factor model (0.102) than the four-factor model (0.074).
Exploratory Factor Analysis (EFA) Results.
To summarize, results supported the four-factor structure, such that each subscale of the IRI should be treated individually rather than in two pairs of EC and PD together and FS and PT together (Davis, 1980). As mentioned earlier, EC items and PD items cannot be treated similarly as measurement of affective empathy due to different motivational and emotional consequences resulting from each facet. Similarly, as Baron-Cohen and Wheelwright (2004) noted, FS items would be more accurate in measuring imagination abilities than cognitive empathy. Therefore, items in the PT subscale may be the most suitable candidate for measuring cognitive empathy.
Item Response Theory Analyses.
Next, we conducted an item response theory (IRT) analysis to explore the order and structure of the IRI items in detail. Our IRT analysis was evaluative, rather than exploratory: Our goal was to identify items with useful psychometric properties rather than to identify a model that reflected the patterns of item responses in our data. Accordingly, we used the Partial Credit Model (PCM; Masters, 1982; 2018) to analyze each subscale separately. The PCM can be stated as follows:
The PCM defines the probability for a response in a given rating scale category as a function of person locations on a latent variable and item locations on the same latent variable. In the equation, θ n is the location of Participant n on the logit scale that represents the latent variable. Higher participant locations indicate higher levels of empathy. Next, δ ik is a combination of the overall item location (δ i ) and the location of the threshold between adjacent categories (τk) in the rating scale. The δ ik estimate can be interpreted as the level of empathy required for participants to select a response in the higher of two rating scale categories. In the context of the IRI, items with low locations may describe attitudes or behaviors that require lower levels of empathy for participants to endorse, and items with high locations may represent attitudes or behaviors that require high levels of empathy for participants to endorse. The PCM provides estimates of person and item-category threshold locations on a linear scale that represents the latent variable.
Situated within the framework of Rasch measurement theory (Rasch, 1960; Wright & Mok, 2004), the PCM is characterized by strict requirements for measurement. In addition to the typical IRT model requirements of unidimensionality (one primary latent variable is sufficient for explaining variation in item responses) and local independence (after controlling for the latent variable, item responses are statistically independent), the PCM requires adherence to invariant measurement. In the context of the PCM, invariant measurement means that item parameter estimates are invariant across persons, and that person estimates are invariant across items (Engelhard, 2013; Wright & Stone, 1979). When these requirements are met, a single item hierarchy is established that describes the progression of items on the latent variable for all persons. This item hierarchy can help researchers and practitioners understand the progression of behaviors and attitudes related to a construct, which can inform theory and practice. In our case, identifying and examining an item hierarchy for each subscale of the IRI provided preliminary insight into the nature of the components of empathy among Farsi-speaking participants.
Rather than analyzing only the items that loaded as expected in the EFA (Table 2), we included all of the items in each subscale analysis. We did this for several reasons. First, this method reflects the Rasch measurement theory approach to identify and seek to explain potential deviations from model requirements rather than to discard items that do not perfectly adhere to expectations (Bond, Yan, Heene, 2020). Second, including all items was useful from a content validity perspective; excluding items that did not load as expected in the EFA may prevent us from including important components of empathy in our analysis. Finally, it is possible to empirically evaluate adherence to the unidimensionality requirement using the PCM (discussed below). Accordingly, we confirmed an approximate unidimensional structure within each of the IRI subscales before proceeding with our interpretation of the model results.
We applied the PCM using Conditional Maximum Likelihood Estimation (CMLE) in the extended Rasch models (eRm) package for R (Mair, Hatzinger, Maier, 2020). We examined evidence of adherence to the model requirements for each set of items using several indices. First, we evaluated overall adherence to the PCM requirement of unidimensionality by examining the proportion of variance in participant responses to each subscale explained by the PCM person and item location estimates (Linacre, 2003). We also used a principal components analysis (PCA) of standardized residual correlations (standardized residual PCA; Bond et al., 2020; Linacre, 1998) to evaluate adherence to the unidimensionality requirement within each subscale. The standardized residual PCA approach is different from PCA methods that are used as a standalone statistical technique. Specifically, standardized residual PCA involves estimating person and item location parameters using a measurement model such as the PCM, and calculating standardized residuals based on the difference between model-expected responses and the original data. Residuals are calculated for each participant’s response to each item (Y
ni
):
To evaluate the local independence requirement, we examined correlations between item-specific standardized residuals (Q3; Yen, 1984). Rather than using critical values to classify items as dependent or independent, we examined the overall magnitude of these inter-item residual correlations for each subscale. We also interpreted values of Q3 according to guidance from Christensen, Makransky, Horton (2017), who suggested that inter-item residual correlations that exceed 0.2 above the average inter-item residual correlation may indicate local dependence.
Next, we examined fit indices for individual items and persons to evaluate adherence to the invariance requirements. Specifically, we calculated unweighted and weighted mean square error statistics (MSE) that summarize the residuals associated with each item and person. For items, the unweighted MSE statistic (“outfit MSE”) is calculated as:
Researchers who use Rasch models generally interpret values of infit and outfit MSE statistics around to 1.00 as evidence of acceptable fit (Smith, 2004; Wu & Adams, 2013). Item and person MSE fit statistics that are greater than 1.00 indicate more variation than expected in item responses, and values less than 1.00 indicate less variation than expected in item responses. In most cases, higher values of MSE fit statistics are more cause for concern than lower values (Linacre & Wright, 1994).
As a final check for psychometric quality, we also examined the location of item-category thresholds for each item. In theory, these values should be ordered such that lower categories in the rating scale reflect lower levels of empathy than higher categories. In addition, the level of empathy required to respond in each category should be distinct, such that each category provides unique information. For example, distinct categories would imply that participants who respond in Strongly Disagree have different levels of empathy than participants who respond in Disagree, and so on along the scale. Given the specification of the PCM, it is possible to evaluate category ordering and distinctiveness for each item using item-specific threshold estimates (Linacre, 2002). In our analyses, we examined rating scale ordering and distinctiveness for each item in each of the IRI subscales.
Results
In this section, we present results from the PCM analysis of the IRI subscales. We present the results separately for each subscale.
Fantasy Subscale
PCM measures explained 47.38% of the variance in the Fantasy subscale responses. This value is well above Reckase’s (1979) recommended minimum value of 20% for unidimensional IRT analyses of potentially multidimensional scales. Results from the PCA of standardized residuals (Linacre, 1998) also suggested adequate adherence to the unidimensionality requirement, with eigenvalues for all of the contrasts lower than 1.90. The FS items generally adhered to the local independence assumption: the average correlation between item residuals was equal to −0.16. Two item pairs (Items 16-FS and 12-FS [reversed coded], and Items 26-FS and 7-FS [reversed coded]) had inter-item residual correlations equal to −0.36—suggesting approximately 13% of shared variability between the items in each pair. This result likely reflects the similar content between these items as all these items are related to becoming involved with a book/movie as if the person reading or watching was one the characters. For example, Item 16-FS asks participants “After seeing a play or movie, I have felt as though I were one of the characters,” whereas Item 12-FS is a reversed form of the same statement “Becoming extremely involved in a good book or movie is somewhat rare for me.”
Partial Credit Model Results for Fantasy Subscale.
Notes. (1) Items are arranged by their overall location estimate, from low (easy to endorse) to high (difficult to endorse); (2) Asterisks (*) indicate disordered thresholds.
Given general adherence to the PCM requirements, we proceeded to examine and interpret parameter estimates for the FS items. Table 3 shows item location estimates for the FS items. For ease of interpretation, the item parameters were scaled such that the mean value was equal to zero logits. Among the FS items, Item 7-FS (reversed coded) was most easily endorsed by our participants; this item had the lowest overall location on the logit scale (δ = −0.36). Item 16-FS was the most difficult to endorse; this item had the highest overall location on the logit scale (δ = 0.59).
Looking at each item individually, it can be observed that items that ask participants about a general situation such as Item 7-FS (“I am usually objective when I watch a movie or play, and I don’t often get completely caught up in it”) were easier for participants to relate to. This result may reflect the fact that these items do not require participants to recall a specific situation or moment to rate the likelihood such as Item 16-FS (“after seeing a play or movie, I have felt as though I were one of the characters.“), which was the hardest item to endorse.
The FS item rating scale thresholds were ordered as expected for all items except Item 7-FS (reversed coded): For this item, the first threshold was located higher on the logit scale compared to the second threshold (τ1 > τ2). In addition, the distance between adjacent category thresholds varied substantially across items in the FS scale. For example, the distance between the first two thresholds were quite close (≤ 0.12 logits) for Item 1-FS and Item 12-FS (reversed coded); this result suggests that the lowest categories in the rating scale may not provide distinct information for these items. Similarly, the third threshold and fourth threshold were quite close for Item 5-FS and Item 16-FS (≤ 0.28 logits); suggesting potential redundancy at the higher end of the rating scale for these items.
The distribution of person location estimates (θ) had a wider spread compared to the spread of item locations, ranging from -2.03 logits to 3.64 logits (M = 0.80, SD = 0.95). In addition, the average person location was higher than the average item location (M = 0.00)—indicating that, on average, the participants readily endorsed the FS subscale items. Person standard errors (0.38 ≤ SE ≤ 1.00) indicated poor targeting between items and persons for some persons for the FS subscale. In future research, additional FS items could be examined that may better target persons with lower or higher levels of fantasy skills.
Person fit statistics for the FS scale indicated adequate model-data fit for most of the participants. Average values of the person MSE statistics were slightly lower than is generally expected (mean infit MSE = 0.89, mean outfit MSE = 0.90); this result indicates some consistency in person responses across items. However, there were some participants with notably high person fit statistics (≥ 4); for these persons, the item order observed in Table 3 does not hold. Additional research focused on individual persons may provide insight into the nature of FS for these participants.
EC Subscale
Like the Fantasy subscale, participant responses to the Empathic Concern subscale suggested adequate adherence to the requirement for unidimensionality: PCM measures explained 46.11% of the variance in the EC subscale responses, and the standardized residual PCA resulted in eigenvalues for all of the contrasts ≤ 1.53. In addition, the EC items generally adhered to the local independence assumption: the average correlation between item residuals was equal to −0.16. For one item pair (Items 9-EC and 4-EC [reversed coded]), the inter-item residual correlations equal to −0.32— suggesting approximately 10% of shared variability between these items. This result likely reflects the similar content between these items as both items require participants to rate the likelihood of becoming concerned and act altruistically in response to others’ problem. For instance, Item 9-EC is “When I see someone being taken advantage of, I feel kind of protective towards them,” while Item 4-EC is its reversed form “Sometimes I don’t feel very sorry for other people when they are having problems.”
Partial Credit Model Results for EC Subscale.
Notes. (1) Items are arranged by their overall location estimate, from low (easy to endorse) to high (difficult to endorse); (2) Asterisks (*) indicate disordered thresholds.
Given general adherence to the PCM requirements, we proceeded to examine and interpret parameter estimates for the EC items. Table 4 shows item location estimates for the EC items, with the average item location set to 0.00 logits to facilitate interpretation. The EC items had a relatively wide spread of locations on the logit scale, ranging from −0.90 logits for Item 18-EC (reversed coded), which was the easiest item to endorse to 0.88 logits for Item 2-EC, which was the most difficult to endorse.
The pattern found in the item order for the EC subscale is interesting and can be linked to cultural standards. For example, Item 18-EC, which was easy to endorse in our sample, is related to unjust behavior: “When I see someone being treated unfairly, I sometimes don’t feel very much pity for them.” Iran is a hierarchical society with a noticeable power imbalance between classes (Yaghoubi Jami et al., 2019). In other words, most Iranians feel they are treated unfairly and therefore people can empathize with each other due to this shared experience. They think they know how the other person is feeling because they have already felt it at least once in their life and this similarity between the empathizer and the target of empathy may boost empathy (Yaghoubi Jami et al., 2021a). Similarly, Item 2-EC, which was the hardest to endorse, requires participants to rate their empathic feelings toward someone less fortunate than themselves, which makes it hard for people to imagine another individual being less fortunate than they are.
Several items in the EC subscale exhibited category disordering. For three items (Items 9-EC, 14-EC [reversed coded], and 22-EC), the first threshold was located higher on the logit scale compared to the second threshold. In addition, for Item 18-EC (reversed coded), the second threshold was located higher on the logit scale compared to the third threshold. These results suggest that the rating scale for the IRI may function differently for EC compared to other subscales.
The distribution of person location estimates (θ) ranged from −0.90 logits to 3.89 logits (M = 1.26, SD = 0.81). The average person location was higher than the average item location (M = 0.00)—indicating that, on average, the participants readily endorsed the EC subscale items. Person standard errors (0.35 ≤ SE ≤ 1.04) indicated poor targeting between items and persons for some persons for the EC subscale. In future research, additional EC items could be examined that may better target persons with lower or higher levels of EC.
Person fit statistics for the EC scale indicated adequate model-data fit for most of the participants. Average values of the person MSE statistics were slightly lower than is generally expected (mean infit MSE = 0.89, mean outfit MSE = 0.87); this result indicates more consistency in person responses than expected. However, there were some participants with notably high person fit statistics (≥ 5); for these persons, the item order observed in Table 4 does not hold. Additional research focused on individual persons may provide insight into the nature of EC for these participants.
PD Subscale
Participant responses to the Personal Distress subscale suggested adequate adherence to the requirement for unidimensionality: PCM measures explained 54.06% of the variance in the PD subscale responses, and the standardized residual PCA resulted in eigenvalues for all of the contrasts were ≤ 1.54. In addition, the PD items generally adhered to the local independence assumption: the average correlation between item residuals was equal to −0.16. For one item pair (Items 17-PD and 13-PD [reversed coded]), the inter-item residual correlations equal to −0.31— suggesting approximately 10% of shared variability between these items. This result likely reflects the similar content between these items as both items ask participants to rate the likelihood of staying calm in emergencies. For example, Item 17-PD is “Being in a tense emotional situation scares me,” whereas Item 13-PD is the reversed form of the same statement “When I see someone get hurt, I tend to remain calm.”
Partial Credit Model Results for PD Subscale.
Notes. (1) Items are arranged by their overall location estimate, from low (easy to endorse) to high (difficult to endorse); (2) Asterisks (*) indicate disordered thresholds.
Given general adherence to the PCM requirements, we proceeded to examine and interpret parameter estimates for the PD items. Table 5 shows item location estimates for the PD items. The PD item locations (M = 0.00) ranged from −0.71 logits for Item 6-PD, which was the easiest item for participants to endorse, to 0.30 logits for Item 24-PD, which was the most difficult item for participants to endorse. These two items were both related to one’s ability to not lose control in emergencies. However, the easiness to endorse the items depended on their management skills and the action required in each statement. For example, Item 6-PD, which was the easiest item to endorse, asked participants to rate the likelihood of how they feel in emergencies, “In emergency situations, I feel apprehensive and ill-at-ease,” whereas Item 24-PD, which was the hardest item to endorse, required participants to rate whether they can take the control in emergencies, “I tend to lose control during emergencies.”
The rating scale category thresholds were ordered as expected for all PD items. Moreover, the distances between adjacent thresholds were distinct for all but one item: Item 24-PD. For this item, the third and fourth threshold locations were nearly identical; this result suggest that the highest rating scale categories may be redundant for this item.
The distribution of person location estimates (θ) ranged from −2.95 logits to 2.98 logits (M = 0.58, SD = 0.74). The average person location was higher than the average item location (M = 0.00)—indicating that, on average, the participants readily endorsed the PD subscale items. Person standard errors (0.38 ≤ SE ≤ 1.03) indicated poor targeting between items and persons for some persons for the PD subscale. In future research, additional PD items could be examined that may better target persons with lower or higher levels of PD.
Person fit statistics for the PD scale indicated adequate model-data fit for most of the participants. Average values of the person MSE statistics were slightly lower than is generally expected (mean infit MSE = 0.91, mean outfit MSE = 0.91); this result indicates some consistency in person responses across items. However, there were some participants with notably high person fit statistics (≥ 4); for these persons, the item order observed in Table 5 does not hold. Additional research focused on individual persons may provide insight into the nature of PT for these participants.
PT Subscale
Participant responses to the Perspective Taking subscale suggested adequate adherence to the requirement for unidimensionality: PCM measures explained 36.44% of the variance in the PT subscale responses, and the standardized residual PCA resulted in eigenvalues for all of the contrasts were less than 2. In addition, the PT items generally adhered to the local independence assumption: the average correlation between item residuals was equal to −0.16. For one item pair (Items 15-PT [reversed coded] and 21-PT), the inter-item residual correlations were equal to −0.36— suggesting approximately 13% of shared variability between these items. This result likely reflects the similar content between these items, as both items ask participants to rate the likelihood of not being linear about their perspective and try to consider the other side as well as their own in problem solving. For example, Item 21-PT is “I believe that there are two sides to every question and try to look at them both,” whereas Item 15-PT is the reversed form of the same statement “If I’m sure I’m right about something, I don’t waste much time listening to other people’s arguments.”
Partial Credit Model Results for PT Subscale.
Notes. (1) Items are arranged by their overall location estimate, from low (easy to endorse) to high (difficult to endorse); (2) Asterisks (*) indicate disordered thresholds.
Given general adherence to the PCM requirements, we proceeded to examine and interpret parameter estimates for the PT items. Table 6 shows item location estimates for the PT items. The PT item locations (M = 0.00) ranged from −0.44 logits for Item 8-PT, which was the easiest item for participants to endorse, to 0.73 logits for Item 25-PT, which was the most difficult item for participants to endorse. Considering the nature of the statements in these items, the result was not surprising; Item 8-PT, “I try to look at everybody’s side of a disagreement before I make a decision,” requires a simple cognitive perspective that is in line with societal expectations, and generally people tend to follow social desirability. Therefore, most people believed or at least liked to think of themselves as considerate and able to think clearly before making a decision. On the other hand, Item 25-PT is a complex cognitive statement, “When I’m upset at someone, I usually try to put myself in his shoes for a while,” which requires both emotional regulation and taking others’ perspective simultaneously. Therefore, it may be hard for people to simultaneously perform two complex cognitive abilities.
The rating scale category thresholds were ordered as expected for all PT items. Moreover, the distances between adjacent thresholds were distinct most items. However, for Item 8-PT, the first and second thresholds were disordered, such that less perspective taking was required to provide a rating in category 2 compared to category 1. In addition, for item 15-PT (reverse coded), the first and second threshold were quite close (τ2 – τ1 = 0.08 logits), indicating that the lowest rating scale categories may not provide unique information about participant’s levels of perspective taking.
The distribution of person location estimates (θ) ranged from −1.13 logits to 3.93 logits (M = 0.77, SD = 0.74). The average person location was higher than the average item location (M = 0.00)—indicating that, on average, the participants readily endorsed the PT subscale items. Person standard errors (0.39 ≤ SE ≤ 1.03) indicated poor targeting between items and persons for some persons for the PT subscale. In future research, additional PT items could be examined that may better target persons with lower or higher levels of PT.
Person fit statistics for the PT scale indicated adequate model-data fit for most of the participants. Average values of the person MSE statistics were slightly lower than is generally expected (mean infit MSE = 0.92, mean outfit MSE = 0.91); this result indicates more consistency in person responses than expected. However, there were some participants with notably high person fit statistics (≥ 4); for these persons, the item order observed in Table 6 does not hold. Additional research focused on individual persons may provide insight into the nature of PT for these participants.
Discussion and Application to Practice
We evaluated the IRI (Davis, 1983) for Farsi speakers by exploring the factor structure of the questionnaire and examining the structure of each subscale in terms of adherence to fundamental measurement properties, item order, person order, and rating scale structure using IRT. The IRI is one of the most common self-reported questionnaires for measuring affective and cognitive empathy (De Corte et al., 2007). Davis (1983) argued that the FS and PT scales allow researchers to measure people’s ability to mentally understand others’ position and perspective (i.e., cognitive empathy). He also proposed that the other two subscales, EC and PD, contribute to people’s ability to experience the same feeling as other individuals, known as affective empathy. In line with previous studies (e.g., Garcia-Barrera et al., 2017), our factor analysis results supported the four-factor structure of the IRI. However, the contribution of each individual item included in the survey to the phenomenon of empathy as a whole highly depends on the nature of statements and the way they connect to the context of the research. In other words, the results show that in the context of the current study (i.e., Iran), cognitive empathy is best measured with PT items and affective empathy is measured with EC items. The results also suggest that none of these two facets of empathy could be measured through inclusion of PD and FS items with PT and EC statements.
Moving to the next objective of the current study, we examined the psychometric properties of each subscale within the IRI separately using the PC model. The PC model allowed us to evaluate the degree to which participant responses to each scale reflected fundamental measurement properties, including unidimensionality, local independence, and invariance. This model also allowed us to empirically evaluate the degree to which the IRI rating scale categories functioned as expected, such that increasing categories reflected increasing levels of empathy and each category provided distinct information about participants. Overall, the IRT analyses revealed general adherence to these properties for the IRI subscales. However, we identified several aspects related to each subscale that warrant additional consideration. For example, additional items are needed to better measure persons with relatively low or high levels of FS, EC, PD, and PT. In addition, we observed some notable person misfit that indicated that the IRI items may not function in a comparable way for all individuals, particularly in the EC, PD, and PT subscales. Additional research on these subscales in the context of Farsi speakers is needed to more fully understand the nature and potential causes for this person misfit.
With regard to rating scale structure, we observed that the IRI rating scale categories were generally ordered and distinct. With the exception of the PD scale, there were several items for which the categories did not function as expected. Additional research is needed to better understand these deviations from expectations. Perhaps the most interesting result from the IRT analyses was the item order within each scale. We observed some commonalities between the scales. In all the subscales, the emotion-triggering items were easier to endorse, while the statements requiring employment of either cognition (e.g., recalling a specific situation) or a combination of cognition and emotion (e.g., emotion regulation and taking others’ perspective) were the hardest items for participants to endorse. On the other hand, the pattern of reverse-coded statements within all four subscales yielded some differences. The results indicated that these items within the PT subscale were the hardest for participants to endorse. Conversely, the reverse-coded items in the other three subscales, especially in the EC subscale, were the easiest to endorse. This result may reflect the fact that positively and negatively oriented items are not necessarily psychometrically interchangeable (Sliter & Zickar, 2014). Additional research may provide insight into participants’ interpretation of the negative and positive items in the IRI.
Whereas previous examinations of the IRI have focused primarily on its factor structure, our study provided insight into the item-level psychometric properties of this instrument, including the order of items on the latent variable, the degree to which individual items adhered to fundamental measurement properties (item fit), the degree to which the item order held across participants (person fit), and the degree to which the rating scale categories functioned in psychometrically useful ways for each item. Our results suggest that the IRI provides interesting information about empathy within an Iranian context, and that additional research is needed to more fully understand the nature of this construct among Farsi speakers.
From a social work perspective, research on empathy and its measurement, including the current analysis, can contribute to the field by increasing awareness of conceptualizing empathy as a multi-dimensional concept and rooting for a more conclusive operationalization of empathy with the help of empirical and cross-cultural evidence. As Gerdes and Segal (2011) argued, empathy is the defining factor in practitioner-client success, and it could lead to desirable outcomes for both parties involved. Through emotion understanding, social work practitioners and counselors can connect more effectively with their clients and through emotion regulation, they can prevent burnout and empathic fatigue and offer more productive treatments to their clients (Gerdes et al., 2010; Yaghoubi Jami et al., 2021a). Clients could also benefit from empathy by being more engaged and motivated in their therapeutic process and increasing positive outcomes such as improved family relationships (Diamond, Diamond, Hogue, 2007), relationships with others (Eisenberg et al., 2005) and experiencing personal growth by increasing empathy through interventions (see Gerdes et al., 2010 for a review).
However, achieving these goals and outcomes require a mutual understanding of conceptualization and operationalization of empathy between researchers, practitioners, and clients that could be generalizable to different population and other cultures. As pointed out by Siu and Shek (2005), social work research on empathy is largely limited to western populations, which could threaten the generalizability of findings. Moreover, considering the lack of consensus among researchers about conceptualization and operationalization of empathy (Gerdes & Segal, 2011) as well as the lack of research on empathy in the social work field (Siu & Shek, 2005), results from studies, including the current analysis, could move the field forward and help educators and practitioners design more successful interventions. In line with Gerdes et al. (2010), we believe that “social work can and should be at the forefront of developing a consistent definition of empathy and creating measures” (p. 2339), which requires validation process for questionnaires adapted for non-western populations (Siu & Shek, 2005).
Footnotes
Author Contributions
These authors equally contributed to this study. PJ conceptualized and designed the study and collected the data. SAW carried out the statistical analyses and interpretation of results. PJ and SAW drafted the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Ethical approval
All procedure performed in this study were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. This study was approved by the Institutional Review Board (IRB #15-OR-336-R1) at the University of Alabama. Online Written informed consent forms were collected from all participants prior to participating in current research. Participation was completely voluntary.
