Abstract
Servant leadership has been gaining attention from all types of organizations, whether it be business organizations or public schools. With the increase of studies on the servanthood characteristics of organizational leaders, various scales of servant leadership were used to examine servant leadership behaviors, perceptions, and attitudes in different organizations. In line with the increasing interest on servant leadership, the purpose of the study was aimed at characterizing the Servant Leadership (SL) scale psychometrically through Confirmatory Factor Analysis (CFA) and Rasch analysis. The related data were collected from 461 teachers across several countries. The one-factor structure of the SL was confirmed in CFA along with the Rasch Rating Scale model, with the analyses of rating scale diagnosis, item fit assessment, reliability, unidimensionality, local independence, and differential item functioning (DIF). High person separation and reliability statistics supported the consistency of the SL scores. Only one item (Item 7) did not fit the Rasch model, and another item (Item 1) showed DIF to be in favor of females. Overall CFA and the Rasch models provided enough evidence for the seven-item SL scale.
Introduction
Recently, servant leadership has been a much-discussed leadership style since the needs of employees in organizations have become a priority and an important quality indicator in the products or services rendered. This leadership is of paramount importance, particularly in educational organizations, which depend on servant leaders’ qualities, such as creating a common purpose, commitment, trust, and cooperation among the teachers (Cerit, 2010). The servant leadership approach in school settings refers to the human desire “to be known, to care, and to be cared for in pursuit of the common good” (Bowman, 2005, p. 257). Black (2010) points out – in research with a wealth of evidence – that the goal of improved academic achievement can be reached with effective leaders in schools – leaders who can meet the needs of school organizations.
One of the ways that effective schools can be characterized is by servant leadership. For that reason, the main mission of a school leader can be specified by his or her role in serving the school’s community and society at large (Al-Mahdy et al., 2016). Contrary to self-serving leaders that deal with others’ requests and needs only when desired or personally advantageous, school leaders who serve as an example of true servant leadership must serve both teachers and students if there is a legitimate need. That duty exists regardless of the leaders’ own moods, benefits, and burdens by the very nature of the role (Sendjaya & Cooper, 2011).
Against the pyramid model of leadership, for Washington et al. (2014) servant leadership is the result of the inverted pyramid. This calls for leaders to be placed at the bottom of the organizational pyramid and to focus on serving the organization. Irrespective of the predominance of authority and upper management style, servant leaders in schools can create a leadership style with worker autonomy, allowing individuals to adopt the organizational process internally (Ekinci, 2015). Thus, the assessment and evaluation of the servant leadership characteristics within an educational setting can guide policymakers in setting overall leadership practices regarding a bottom-up approach and adopting contemporary educational management practices for schools.
Servant leadership
Within the neoclassical approaches to management, there has been always a strong call to see workers as the main factor in organizations. Servant leadership, spiritual leadership, and ethical leadership have been some of the latest leadership styles that have focused on the inner world, values, and needs of employees, implying “a leader for followers” instead of the traditional lens of “followers for a leader.” Servant leadership, in this respect, refers to the focus on leaders who serve their followers, which in turn produces a shared spirit in purpose, trust, commitment, desire for wisdom, and effort in the organization. Zhou and Miao (2014, p. 381) tell leaders in the public sector need to improve their administrative and management ability as servants. A service-oriented philosophy of leadership is an exhibition of and a prerequisite for a wise organization (Barbuto & Wheeler, 2006). A wise organization, in turn, creates multi-channeled interactions in the organization based on knowledge, not a hierarchy.
The concept of servant leadership made its way into the literature with the work of Greenleaf (1970) and was supported by the models of several pioneering authors. Spears (2010, pp. 27–29), after years of considering Greenleaf’s writings, has identified a set of ten characteristics of a servant leader central to servant-leaders’ development: listening, healing, empathy, awareness, conceptualization, persuasion, stewardship, foresight, commitment to people’s growth, and building community.
In the theoretical model of servant leadership, Patterson (2003) asserted the notion of leading an organization with a highlighted focus on their followers and stated that servant leaders lead and serve with agapao love, act with humility, are altruistic, are a visionary for their followers, are trusting, serve, and empower their followers. Graham (1991), on the other side, identified the characteristics of servant leadership as humility, autonomy, relational power, moral development of followers, and emulation of the leaders’ service orientation.
Sparked by Greenleaf and other models of spiritual, servant, and transformational leadership qualities, the calls for a stronger bottom-up management understanding directed researchers in the last decade toward empirical studies to assess servant leaders’ behaviors. Several researchers contributed to the literature with various studies on servant leadership in the last decade (e.g. Liden et al., 2015; Reed et al., 2011; Sendjaya & Cooper, 2011; Van Dierendonck & Nuijten, 2011) and with the implementation of new ideas – like Alpha Leadership – in the workforce, more empirical studies in this regard will be needed in the coming years.
As mentioned above, several scales have been developed to measure a servant leadership construct. An ideal scale should have excellent psychometric properties, including reliability and validity. Validity for servant leadership scales have already been reported in the original studies. Given that scale validation is a continuous process (Nunnally & Bernstein, 1994), more research would be needed to further validate servant leadership and test its generalizability. Aside from that, there seems to be a need for more studies to be carried out in assessing servant leadership. In the several validation studies that exist, there is no study which focused on an international perspective on servant leadership that covered multinational samples.
The goal of this study is to enlarge empirical support regarding the servant leadership theory by re-examining Liden et al.’s (2015) Servant Leadership (SL) scale under the Rasch model on a wide sample of participants from diverse countries. While doing so, we also took heed of Crawford and Kelder’s (2019) notice that there is a clear and substantial divide between the scales developed by academia and industry. Thus, as teacher trainers in academia and former teachers, we formed the sample with teachers from different countries and gave our opinions on the related items in the results, especially in misfit and DIF-related items.
Method
In writing the method part of this study, we followed and checked studies in our references in regards to Crawford and Kelder’s (2019, p.141) guide for empirical evaluation which includes a) reporting on and factoring in the researchers’ assumptions of normality, b) testing internal reliability by using tests specifically designed for the each model, c) using factor analysis properly, d) reporting on (at the minimum) SRMR, the chi-square test, RMSEA, and CFI, e) demonstrating, at the very least, predictive validity that equals or exceeds accepted leadership theories, f) considering a sample size appropriate for tests to be used, with 150 as the minimum, g) reporting the methods used, justifying assumptions distinctly. Thus, we presume that all results in this study and the implications based on the examined studies are of scientific rigor, along with our personal interpretations.
Sample
The SL scale was administered in English through an online survey using a demographic information form (age, gender, etc.). The sample of respondents consisted of 461 teachers (108 men, 353 women), mean age 35.4 (SD = 9.9). The participants who responded to this scale were from fifty countries, consisting mainly of Turkish and European citizens. We collected data by two ways; online teacher groups and direct contacts with schools, teacher training organizations, teachers’ networks from countries. First, we mailed the most active teachers’ groups in social media (Facebook etc. mostly the ones aimed at English language teachers) about our study upon approval of group admins. We examined the recent activities in the group and mailed the most active contributors in English language by private mail asking if they would be willing to be part of our study. Then, we used our existing research and project contacts in other countries (EU) and in Turkey, phoned and sent online invitation to the organization/school directors asking their help to pass the online survey to the English language teachers in schools or other teachers with English proficiency. The reason we mostly focused on English teachers for data collection and employed a very concise scale is to remove any language related problem which may happen in multinational sample. Also, we got feedback from school directors in EU/Turkey and different teacher group admins in social media if they found the items understandable and clear. The respondents reported working at high schools (n = 134), middle schools (n = 187), primary schools (n = 130) and pre-primary schools (n = 10). The participants ranged from a variety of teaching areas, including social sciences (n = 344), life sciences (n = 30), and others (n = 87). The country information of participants is given on regional basis in Table 1.
Distribution of participants from different regions/countries.
As can be seen in Table 1, the regional grouping was mostly based on UN’s groups of members states demographic information (see United Nations Regional Groups of Member States, n.d.). Turkey was classified in a separate row as it had the largest participation rate.
Instrument
This study used a servant leadership scale, first developed by Liden et al. (2008) with 28 items, then turned to a 7-item version by Liden et al. (2015). We used Liden et al.’s (2015) instrument (the short form), which included items selected from seven dimensions reported in the 2008 article to maximize domain coverage of servant leadership. Liden et al. (2015) states that servant leadership is best represented by the aggregate model, which consists of the sum of its dimensions, and concise global measures as the scale used in this study are more favored theoretically and in respect to design factors.
Responders can answer each item on this scale that ranged from 1: “strongly disagree” to 7: “strongly agree.” Liden et al. (2015) employed the short form of the SL in three separate studies with six independent samples. Three samples separately produced acceptable fit indices (e.g., 2nd sample: CFI = .99, SRMR = .03, RMSEA = .04; χ2(14, 218) = 20.03, p<.01). The reason to employ Servant Leadership scale by Liden et al. (2015) in this study- which is among 16 measures of servant leadership in the literature (see Eva et al., 2019)- is based on several factors. We wanted to examine one of the most cited servant leadership measure to help the researchers for further use, we opted for shorter version so that the participants can better focus on items which are fairly easy and straightforward. Most importantly, we evaluated all scales in terms of a) Item generation (deductive, inductive; both), b) Content adequacy assessment, c) Questionnaire administration, d) Factor analysis, EFA & CFA, e) Internal consistency assessment, f) Construct validity, g) Replication just as examined by Eva et al. (2019, p. 115). The scale of Liden et al. (2015) with seven items underwent all these processes within a rigorous process of construction and validation and was one of three suggested measures to be used for data collection (Eva et al., 2019). Sendjaya et al. (2019) also developed a Servant Leadership scale with six items, matching our criteria mentioned in this part. However, seven-item scale of Liden et al. (2015) was a short display of seven dimensions. Being an older version, it was more employed for assessing the leadership qualities in different organizations and contexts.
The scales’ items are presented in Table 3. Each of them represents one dimension of Liden et al.’s (2008) scale with 28 items and seven dimensions (e.g., ‘‘My leader emphasizes the importance of giving back to the community’’ is intended to measure “creating value for the community”). All dimensions in Liden et al.’s (2008, p.176) scale can be listed as “conceptual skills, empowering, helping subordinates grow and succeed, putting subordinates first, behaving ethically, emotional healing, creating value for the community.”
Data analysis
Both classical test theory and item response theory methods were employed to examine the psychometric properties of the SL scale. Applications of these two complementary methods have not been common in leadership validity studies.
We selected these two methods for two reasons. First, confirmatory factor analysis (CFA) method was used to make a comparison with the original scale development study by Liden et al. (2015). The Rasch analysis was used to gain more insight about the scale items that could not be obtained with CFA. Recent research on scale validation has focused on combining these two approaches (e.g., He et al., 2020).
A one-factor CFA model was tested to provide evidence for the single-factor structure of the seven-item scale. The CFA model was estimated using Mplus 8.3 (Muthén & Muthén, 1998–2019) software. The fit of the single-factor structure was assessed using multiple fit indices, including chi-square (χ2) statistics, root mean square error of approximation (RMSEA), the Tucker-Lewis Index (TLI), the comparative fit index (CFI), and the standardized root mean square residual (SRMR). The fit of the empirical one-factor model was evaluated based on the following criteria for a good model fit: CFI > 0.95; TLI > 0.95; RMSEA < 0.06; SRMR < 0.08; and nonsignifant χ2. The convergent validity of the scale was also assessed by calculating the average variance extracted (AVE) value for the scale (Hair et al., 2014). The internal consistency of the scale items was assessed with Cronbach’s alpha, Omega, and construct reliability (CR) coefficients.
The Rasch modeling approach was also applied to examine whether the instrument objectively measures servant leadership and has adequate construct validity. Since the Rasch model is an item-based method, it helps us to determine possible item misfit to check whether the rating scales are functioning ideally. The fit of each item was evaluated using fit indices (Infit and outfit mean squares) and category measures. The Rasch Rating Scale model (Andrich, 1978) was used to analyze Likert-type responses via WINSTEPS 3.68.2 (Linacre, 2009) software. In this study, the WINSTEPS program was utilized to do several analyses listed in Table 2. If the data adhere to the Rasch model requirements listed in Table 2, the scale can be considered unidimensional and its sum score can be used to represent servant leadership construct.
The summary of Rasch analyses conducted in this study.
Results
Descriptive statistics
Table 3 presents the descriptive statistics for the scale items. As shown in Table 3, the items differed in their mean scores, ranging from 3.868 (Item 5) to 5.377 (Item 1). The skewness values ranged from −0.098 to −1.235, whereas the kurtosis values were between −0.158 and 1.150. As Table 3 shows, most of the skewness and kurtosis values were within the acceptable range of univariate normality (±1). However, the kurtosis statistics for Item 1 and the skewness statistics for Items 1 and 4 were outside of the acceptable range, indicating non-normality. Mardia's multivariate skewness and kurtosis tests were used to test the multivariate normality of the data using the MVN R package (Korkmaz et al., 2014). Significant p-values from multivariate skewness and kurtosis tests indicated the lack of multivariate normality. Table 4 presents the Pearson correlation matrix calculated between the scale items. The correlation values were found to be between .236 and .609. Small to medium correlations indicated no sign of multicollinearity.
Descriptive statistics for scale items.
The intercorrelation matrix between variables for SL.
Confirmatory factor analysis
As the data showed no univariate or multivariate normal distributions, the CFA of the one-factor model was conducted using the weighted least square estimation, which is an appropriate method for non-normal data. The one-factor model demonstrated a good model fit, with the exception of a highly significant χ2(14) = 43.454 (p = .0001). CFI of .976 and TLI of .964 along with an RMSEA at 0.068 (90% CI: 0.045–0.091) and SRMR at .028, thus supporting the use of the one-factor Servant Leadership scale in further analysis (see Figure 1). Fit indices were within or near the recommended criteria (Hu & Bentler, 1998). Figure 1 displays the standardized estimates for each item.

The path diagram of the one-factor CFA model of the Servant Leadership scale. All coefficients are standardized and significant. Standard errors are presented in the parentheses.
Factor loadings were all significant at the 0.05 alpha level, and most of them were higher than .70. Only the loadings of Item 1 (.518) and Item 4 (.463) were smaller than .70, indicating weak relationships between the scale and these items. The R-square values ranged from .214 (Item 7) to .620 (Item 2). An AVE value of .470 was calculated using the squared standardized factor loadings obtained from the one-factor CFA model. This result indicates an adequate convergence (Hair et al., 2014). Cronbach’s alpha, Omega, and the CR value were computed as .851, .858, and .857, respectively, indicating good reliability (>.70). Thus, high alpha, Omega, and CR values show that internal consistency exists across the items that consistently measure the same construct.
Rasch analyses
We also conducted a Rasch analysis to examine the seven-item scale. The analyses, including rating scale functioning, item fit assessment, reliability analysis, unidimensionality, local independence, and differential item functioning (DIF) analysis, are presented below.
Rating scale functioning
The seven-point scale categories were examined to check whether the following criteria for the rating scale evaluation (Linacre, 2002) were met:
#1: At least ten frequencies should be observed for each category.
#2: Regular observation distribution should be existing.
#3: Average measures should advance monotonically with each category.
#4: Outfit mean-squares should be less than 2.0.
#5: Step calibrations should advance monotonically with each category.
#6: Ratings should imply measures, and measures should imply ratings.
#7: Step difficulties should advance by at least 1.4 logits and by less than 5.0 logits.
Table 5 summarizes the main results regarding the rating scale structure (the seven-point scale) for the whole scale. It appeared that the categories had at least ten observations and unimodal distribution of frequencies. The rating scale increased monotonically regarding the average measures by category, indicating that teachers with more SL scores overall respond with the higher categories. Thus, no category disordering was found. However, the monotonic increase was not observed for the step calibrations that show the transition between the adjacent rating categories.
Analysis of the SL rating scales.
Note. Obs=Observed; MNSQ= mean-square.
The criterion of monotonic advance was not met between categories 3–4 (from −0.27 to −1.20) and categories 5–6 (from 0.40 to −0.10). Outfit mean-squares are less than 2.0, meeting the requirement for guideline 4. Overall, the response categories match the Rasch model expectations. Guideline 6 says that the rating should imply the measure and vice versa; this is determined with the coherence statistic. This guideline was mostly met and is indicated by moderate coherence statistics. None of the differences were above 5.0.
Step difficulty estimates appeared to advance by at least 1.4 only between Categories 1–2 and 5–6. However, the difference between Categories 2–3, 3–4, and 4–5 was less than 1.4. Thus, the number of categories appeared to be redundant in the middle. Categories 2, 3, and 4 may be too close in meaning for the respondents to differentiate between them as the model expects. It means that it would be better to merge these middle categories. Overall, the SL scale appeared to be functioning well based on Linacre’s guidelines (see Table 6).
Item fit statistics.
Note. MNSQ= mean-square. The Mnsq and Outfit Mnsq values outside the 0.6-1.7 are given bold.
Item fit to Rasch
Table 6 presents the item difficulty parameters (logit), infit and outfit mean-square values, and PTMEA correlations for each item. As Table 6 shows, the item difficulty estimates ranged from −0.57 to 0.70 logits, indicating a good amount of item spread. Item 5 appeared to be the most challenging item for respondents to endorse, and Item 1 appeared to be the easiest item for respondents.
The item fit with the Rasch model was investigated with outfit and infit mean-square values. As Linacre (Linacre, 2010, p. 444) pointed out, “High infit mean squares indicate that the items are mis-performing for the people on whom the items are targeted. This is a bigger threat to validity, but more difficult to diagnose than high outfit.” Outfit and infit mean-square values of good fitting items should be higher than 0.6 and less than 1.4 (Wright & Linacre, 1994). As shown in Table 6, only Item 7 has infit and outfit mean-square values above 1.4, suggesting that Item 7 may be unproductive for construction of measurement (servant leadership). As Table 6 shows, none of the infit and outfit mean-square values is below 0.60. Overall, there is no serious indication of item misfit in the SL scale except for Item 7 and no sign of multidimensionality. This result means that our data appear to fit the Rasch model adequately. As shown in Table 6, there are no negative correlations between the items and the measurement. The PTMEA correlations ranged from .56 to .74. As Table 6 shows, most of the items on the SL scale had positive and moderate-to-strong PTMEA correlations.
Reliability analyses
Reliabilities of items and persons were evaluated in terms of “separation,” defined as the ratio of the true spread of the measures with their measurement error. The Rasch item and person separation indices were found to be 8.39 and 2.19, respectively, both above the threshold of 2. The reliability coefficients were also obtained, with a value greater than or equal to .8 is considered acceptable. These estimates are analogous to Cronbach’s alpha, indicating very little measurement error in the scores.
The reliability of separation coefficients for the items (.99) and persons (.83) was good, indicating high item and person differentiation. Item differentiation may exist due to a wide spread of difficulty in the items. High reliability for the persons indicates that the instrument is sensitive to differentiate between high, medium, and low performers.
Unidimensionality and local independence
The PCAR method was used to inspect the unidimensionality of the SL scale after an initial Rasch factor was extracted. The results of the PCAR demonstrated that the unidimensional Rasch model explained 56.4% of the variance, with an eigenvalue of 9.1. Raw variances explained by persons and items were 27.4% and 29.0%, respectively. Eigenvalues were found to be 4.4 and 4.7 accordingly. In all, 43.6% of the variance remains unexplained, and the measures explained a good amount of the variation (56.4%). The unexplained variance in the first component that emerged in PCAR after extracting the Rasch-explained variance was found to be 1.4 (<2.0), which indicates that there is no substantive structure in the Rasch residuals, and unidimensionality is supported (Linacre, 2009).
Local independence of the items was also assessed by standardized residual correlations obtained from the WINSTEPS output. The largest standardized residual correlations were found to be less than .40, indicating local independency. In all, the PCAR and residual correlation analysis indicated the unidimensionality of the SL scale.
DIF analysis
Finally, we explored DIF across gender. Table 7 shows the results of the DIF analysis, including the difficulty parameters of scale items for each gender subgroup, standard errors of these estimates, the local difficulty contrast between gender subgroups, degrees of freedom, Welch t-value, and p-value for this contrast, Mantel Haenszel (MH) based p-value and MH size, respectively. For example, the first row of Table 7 can be interpreted as follows: the difficulty of Item 1 is −0.66 for the female subgroup and −0.30 for the male subgroup; the contrast in difficulty, −0.36, is the measure of DIF effect size (Linacre, 2010); the Welch t-value of this contrast is −3.38, and the p-value of the contrast is .0008 based on Welch test and .0045 based on MH test, both of which are significant at the 0.05 alpha level. This item (Item 1: “My leader can tell if something work-related is going wrong.”) was easier to be endorsed by females. As Table 7 shows, evidence of DIF by gender was also found on Item 3 (“I would seek help from my leader if I had a personal problem.”) based on MH test (p < .05). This item was easier to be endorsed by males (−.06) than females (.11). However, the remaining items showed no sign of nonsignificant DIF values based on p-values obtained from Welch and MH tests (p > .05). Table 7 also shows that the magnitude of DIF values for all items were less than one which indicates no sign of DIF.
DIF analysis of the items.
Note. MH= Mantel Haenszel.
We re-evaluated the psychometric properties of the SL scale after removing Item 7 due to the misfit of one item. The results obtained using only six items did not show improvement over the previous scale as follows: (1) the infit and outfit mean-square statistics for Item 6 became higher than 1.4, (2) the person reliability coefficient decreased to .80. However, the item reliability estimate remained the same, and the variance explained by the first factor also decreased to 59.5%. However, the short version of the scale (without Item 7) showed improvement in the fit of the one-factor CFA model (χ2(9) = 29.423, p < 0.005; CFI = .982; TLI = .970; RMSEA = .070 [%90CI = .043, 0.105]; SRMR = 0.026). Standardized factor loadings were estimated as .518, .792, .718, .741, .748, and .748 for Item 1 to Item 6, respectively. R-square values were .269, .628, .515, .548, .560, and .560 for Item 1 to Item 6, respectively.
Discussion
Although the SL scale is one of the most widely used tools for measuring perceptions about servant leadership in organizations, no studies analyzing its validation using the Rasch model have been found. The current study examined the psychometric properties of the SL scale using the CFA and Rasch models that are complementary assessments of construct validity. Thus, a more comprehensive assessment was applied to obtain more information about the scale items.
In summary, our findings provide additional evidence on the good psychometric properties of the SL Scale. Both CFA and Rasch analyses provided enough evidence for the one-factor structure of the scale. This finding is in line with previous studies that show that the scale is one factor within acceptable considerations. Chughtai (2018), Huertas-Valdivia et al. (2019), Lu et al. (2019) used the seven-item scale of Liden et al. (2015) and found the psychometric properties of the measure to be suitable for the intended models in their studies as one factor with seven items.
The results of the one-factor CFA model showed that Item 7 (“My leader would not compromise ethical principles to achieve success”) and Item 1 (“My leader can tell if something work-related is going wrong”) had the lowest factor loadings, below 0.70. This result indicates the weak relationship between the scale and these items. It is also in line with previous studies as well. Liden et al. (2015, p. 256), in their study on scale development of servant leadership behaviors, found the lowest factor loadings in Item 1 and Item 7 in three different sample groups. Karatepe et al. (2019, pp. 96–97) removed these two items from their model upon CFA results as the measurement model suggested deleting them because they were lower than the base values due to correlation measurement errors and standardized loadings.
Aside from that, the item “My leader can tell if something work-related is going wrong” was deleted in Paesen et al.’s (2019) study while “My leader would not compromise ethical principles in order to achieve success” was deleted due to low factor loading in Riquelme et al.’s (2019) study. These results support our findings. The reasons for low factor loading for these two items can be different and may depend on the characteristics of the participants’ schools or personal culture. But for Item 7, as researchers in the field of teaching for years, we have the impression that the negative structure of the last item, apart from the order effect, could be the reason why this item has a low factor loading.
As a general rule, teachers refrain from using negative structures in assessments to a large extent. This rule applies to all. Similarly, Lapointe and Vandenberghe (2018) noted that they found high reliability for Liden et al.’s (2015) SL scale. However, they also informed the readers that the item “My leader would not compromise ethical principles in order to achieve success” (reflecting the dimension of behaving ethically) was in a negative format; thus, they replaced it by ‘‘My manager is always honest,” an item which is under the same dimension of “behaving ethically” in Liden et al.’s (2008) scale (longer version). This observation shows that positive wording seems better for data collection and is a possible solution for the low factor loading.
The Rasch model results showed different patterns for each analysis. Several scale properties appeared to be good to excellent, whereas others were poor. The rating scale functioning of seven-point categories was met to most of the guidelines. However, some categories did not function as intended. This issue may be due to a large number of categories (the seven-point scale). The middle categories may be too close in meaning for the respondents to differentiate between them as the model expects. It means that it would be better to merge these middle categories in future applications. High reliability scores for person (test) and item reliability are evidence of internal consistency across the scale items.
Of the seven items from the SL scale that were analyzed, only one item (i.e., Item 7: “My leader would not compromise ethical principles in order to achieve success”) was identified as a misfit to the Rasch measurement model based on the criterion used. Reasons for misfit can, for instance, be again largely attributed to the items’ being in the negative form as stated above, and secondly to its being the last item in the scale, i.e., order effect. Some authors opted for removal of the item from the scale (Karatepe et al., 2019) or changing the sentence structure into positive wording (Lapointe & Vandenberghe, 2018). Thus, for subsequent studies on servant leadership, the positive wording of the same item can be considered as well.
The SL scale was mostly free of DIF. However, one item (i.e., Item 1: “My leader can tell if something work-related is going wrong”) on the scale was easier to endorse for female respondents based on Welch test. Reasons for the differential functioning of Item 1 across different gender groups can be about the feminine and masculine features in the workplace since males may find it hard to accept criticism from the leaders or ask for help from the management in the problems they face in class or school, while females are more relationship- and communication-oriented.
Depending on the experience of school culture in our previous and existing schools, we find this assumption to be logical since female teachers in schools tend more toward talking with their colleagues about their school problems during breaks and are more likely to ask for help from school management (visibly more compared to their male colleagues). Moreover, the same item in the Liden et al.’s (2008) scale (the longer version) refers to conceptual skills, which reference the leader's ability to assist and solve problems and understand organizational goals. This phenomenon may also suggest that leaders feel more at ease while talking to females about problems at work and solving them. So, this case could be why this item has a higher endorsement rate among female participants in the teacher sample.
Several limitations contribute to the need for caution in the use of the SL scale in future studies. First, the problematic items (Item 1 and Item 7) identified by the Rasch measurement model call for further examination and revision to strengthen their fit to the unidimensional scale. Second, masculine and feminine characteristics seem to play a role in Item 1, the future researchers can take it into account and explore other possible reasons, depending on the sample features. Another limitation is the lack of external tools measured in relation to the SL scale with the same cohort. Drost (2011) suggests use of criterion-based validity steps for this goal. Further studies on SL for validation purposes should consider this.
Having a concise and to-the-point structure, Liden et al.’s (2015) SL scale is better for timely data collection. It is not tedious and time-consuming for collecting data from large groups of people. We experienced this situation in this study while reaching the desired group of participants from diverse countries. It reflects overall servant leadership behaviors under one factor. That being said, other points should be considered as well. Servant leadership scales by different authors are numerous and they focus on different aspects of servant leadership qualities ranging from “humility” to “building community.” Thus, researchers should know which scale they need to use for their specific goals. Researchers should also bear in mind that the generations keep changing and may have different ideas on effective leadership skills that fall under the umbrella of servant leadership. With the influx of new generations (e.g., digital natives), new scales that measure understanding of servant leadership in new generations should be developed since the understanding of servant leadership for the “Silent Generation” are clearly different from that of Generation Z.
