Abstract
Research on school administrators for the period 1967-80 reminds one of the dictum: “The more things change, the more they remain the same.” The state-of-the-art is scarcely different from what seemed to be in place 15 years ago. Although researchers apparently show a greater interest in outcomes than was the case in the earlier period, they continue their excessive reliance on survey designs, questionnaires of dubious reliability and validity, and relatively simplistic types of statistical analysis. Moreover these researchers persist in treating research problems in an ad hoc rather than a programmatic fashion. (Bridges, 1982, pp. 24-25)
Thirty years ago, Bridges (1982) highlighted persisting problems with the quality of research in educational leadership and management. In response to this and other critiques, researchers proposed stronger, research-based conceptual frameworks that sought to describe dimensions of the principal’s instructional leadership role (e.g., Bossert, Dwyer, Rowan, & Lee, 1982; Hallinger & Murphy, 1985; Leithwood & Stager, 1989; Ogawa & Bossert, 1995; Pitner, 1988). This conceptual progress was accompanied by the development of new research instruments (e.g., Hallinger & Murphy, 1985; Leithwood & Montgomery, 1986; Leithwood & Steinbach, 1991; Villanova, Gauthier, Proctor, & Shoemaker, 1981).
These research tools subsequently aided scholars in gaining a better understanding of the nature of the relationship between leadership and learning (e.g., Dwyer, 1986; Hallinger, Bickman, & Davis, 1996; Hallinger & Heck, 1996; Hallinger & Murphy, 1986; Heck, Larson, & Marcoulides, 1990; van de Grift, 1990). As a case in point, Hallinger (1983; Hallinger & Murphy, 1985) developed the Principal Instructional Management Rating Scale (PIMRS), a research tool that has been used in over 200 empirical studies conducted in 26 different countries (Hallinger, 2011b; Hallinger & Wang, 2013). 1
A recent review of research found that the PIMRS continues to be an instrument of choice among scholars studying principal leadership (Hallinger, 2011b). The same article further suggested that the PIMRS has maintained a consistent record of yielding reliable and valid data. However, the earlier (Hallinger, 2011b) review was oriented toward a more general examination of research methodologies used in conjunction with the PIMRS and provided relatively few details concerning the reliability results obtained in this body of PIMRS studies. Given increasing global interest in instructional leadership and the continued widespread use of the PIMRS, the field will benefit from an updated and detailed assessment of its measurement properties. We note that there has not been any systematic attempt of this kind since the initial published report in the Elementary School Journal in 1985 (Hallinger & Murphy, 1985).
This report presents a meta-analysis of reliability results derived from 43 independently conducted studies that employed the PIMRS as a research tool over the past three decades. The study addresses several research questions:
Research Question 1: Does the PIMRS provide reliable data for the purposes of assessing principal instructional leadership in research and practice?
Research Question 2: How do reliability estimates differ based on the role group of the respondents (i.e., teachers or principals)?
Research Question 3: Does the PIMRS yield reliable data when used in rating principals working at different school levels (i.e., primary, middle, secondary) and cultural contexts (i.e., nations)?
The relevance of this information was inadvertently confirmed by a recent publication that critiqued the reliability of several leadership scales (Condon & Matthews, 2010). However, the information included on the PIMRS in the report was incomplete and out of date. This further highlights the need for a comprehensive, up-to-date description of the reliability of the PIMRS. This will aid scholars and practitioners in choosing among instruments used for assessing principal instructional leadership and in making methodological choices when using the PIMRS.
Theoretical Perspective
In this section of the article, we provide background information on the conceptual framework that informed development of the PIMRS. Then we discuss the development of the PIMRS instrument. This will set the stage for presentation of the methodology and results.
The PIMRS Framework
The PIMRS (Hallinger, 1982/1990) is grounded in a conceptual framework that proposes three dimensions in the instructional leadership role: Defines the School Mission, Manages the Instructional Program, and Develops a Positive School Learning Climate (Hallinger & Murphy, 1985; see Figure 1). These three dimensions are delineated into 10 instructional leadership functions. Two functions, frames the school’s goals and communicates the school’s goals, comprise the dimension Defines the School Mission. These functions concern the principal’s role in working with staff to ensure that the school has a clear mission and that the mission is focused on academic progress of its students. While this dimension does not assume that the principal defines the school’s mission alone, it does propose that the principal is responsible for ensuring that such a mission exists and for communicating it widely to staff.

Principal Instructional Management Rating Scale (PIMRS) conceptual framework.
The second dimension is Manages the Instructional Program. This incorporates three leadership functions: supervises and evaluates instruction, coordinates the curriculum, and monitors student progress. This dimension focuses on the role of the principal in “managing the technical core” of the school (Hallinger & Murphy, 1985). In today’s schools, it is clear that the principal is not the only person involved in monitoring and developing the school’s instructional program (Hallinger, 2011a; Leithwood, Day, Sammons, Harris, & Hopkins, 2006). At the same time, however, coordination and control of the academic program of the school remains a key leadership responsibility of the principal, even when tasks are delegated or shared.
The third dimension, Develops a Positive School Learning Climate, includes several functions: protects instructional time, promotes professional development, maintains high visibility, provides incentives for teachers, and provides incentives for learning. This dimension is broader in scope than the other dimensions and overlaps with factors generally associated with transformational leadership (e.g., Hallinger, 2003; Leithwood et al., 2006; Marks & Printy, 2003). This dimension conforms to the notion that successful schools create an “academic press” through the development of high standards and expectations and a culture that fosters and rewards capacity development and continuous learning (Hallinger & Murphy, 1985).
Development of the PIMRS Instrument
The PIMRS instrument
The original form of the PIMRS (Hallinger, 1982/1990) contained 11 subscales and 72 “behaviorally anchored” items. Subsequent revision of the instrument reduced the instrument to 10 subscales and 50 items (Hallinger, 1982/1990, 1983). 2 For each item, the rater assesses the frequency with which the principal enacts a behavior or practice associated with the particular instructional leadership function. Each item is rated on a Likert scale ranging from (1) almost never to (5) almost always (see Figure 2). The instrument is scored by calculating the mean for the items that comprise each subscale. This results in a data-based profile of principal performance.

Sample Principal Instructional Management Rating Scale (PIMRS) rating subscale: teacher form.
Three parallel forms of the instrument have been developed and tested for completion by the principal, teachers, or a supervisor. The items that comprise each form are identical; only the stems change to reflect the differing perspectives of the role groups. Hallinger (2011b) noted that researchers consistently report significant differences between teacher and principal perceptions of the principal’s instructional leadership. Moreover, principal self-report scores tend to be substantially higher than those obtained from teachers (e.g., Benoit, 1990; Brown, 1991; Corkill, 1994; Dennis, 2009; Haack, 1991; Haasl, 1989; Hallinger, 1983; Henderson, 2007; Krug, 1986; Mallory, 2002; Marshall, 2005; Meek, 1999; Meyer, 1990; O’Day, 1984; O’Donnell, 2002; Reid, 1989; Shatzer, 2009; S. Smith, 2007; Stevens, 1996; Vinson, 1997).
Notably these “role set” (Merton, 1957) differences in PIMRS ratings obtained from teachers and their principals extend to contexts other than the United States. Empirical comparisons have yielded a similar pattern of results in Thailand (Hallinger & Lee, in press; Poovatanikul, 1993; Ratchaneeladdajit, 1997; Taraseina, 1993), Guam (San Nicolas, 2003), the Philippines (Saavedra, 1987; Salvador, 1999; Yogere, 1996), the Maldives (Wafir, 2011), Hong Kong (Chan, 1993), and Taiwan (Chi, 1997; Tang, 1997; Yang, 1996). In addition, despite differences in the magnitude of ratings obtained by the two role groups, researchers often report a similar pattern in their ratings on the various subscales that comprise the PIMRS.
Five studies have been conducted that included extensive assessments of the reliability and validity of the PIMRS. Three were conducted in the United States (Hallinger, 1983; Howe, 1995; Jones, 1987), one in Thailand (Taraseina, 1993), and one in Cameroon (Wotany, 1999). Hallinger’s (1983) study focused on the elementary school level and the others on secondary schools. Numerous additional studies have reported the results of reliability tests (see Tables 1 and 2). In the following sections, we provide an overview of the approaches employed in assessing the reliability and validity of the PIMRS instrument.
Principal Data Sources.
Teacher Data Sources.
Assessing the reliability of the PIMRS
Lang and Heiss (1998) defined reliability as the consistency with which an instrument yields the same or similar responses across settings and time. Several different approaches may be employed for assessing the reliability of a test instrument: test-retest, parallel forms, and internal consistency (Kerlinger, 1966). Studies employing the PIMRS have relied exclusively on measures of internal consistency. Internal consistency refers to the degree to which items that have been grouped together conceptually as subscales correlate with each other (Kerlinger, 1966).
PIMRS studies have varied with respect to the form of the scale that was used (i.e., teacher or principal), the level of the scale on which reliability was calculated (i.e., whole scale, dimension, function), as well as the school level (i.e., primary, middle, secondary) and cultural context in which the study was conducted.
Gathering data with the PIMRS directly from principals represents a type of self-assessment. The resulting score reflects a latent trait of the individual subject (Kerlinger, 1966). This is a typical case faced in measurement for which researchers often employ Cronbach’s (1951) alpha test of internal consistency. In our data set, all PIMRS studies that reported reliability results based on principal self-report data employed Cronbach’s alpha (see Tables 1 and 3).
Meta-Analysis of Reliability Estimates Derived From Principal Respondent Studies.
Note. These analyses include all data sets comprised of 15 or more principals. All calculations are based on Cronbach’s alpha test of internal consistency. “Extracted” refers to alpha coefficients extracted from research reports. “Raw” refers to our analysis of actual data sets obtained from the authors of the studies.
Researchers have, up until now, employed two different statistical tests to assess the scale’s internal consistency with data collected from teachers. Although the majority have applied Cronbach’s alpha to data obtained with the PIMRS Teacher Form, several employed Ebel’s (1951) test of reliability. For the purposes of the present background discussion of the PIMRS’s reliability, we simply wish to highlight the fact that our data set contains reliability results that were obtained through a variety of methods (see Tables 1 and 2). While this presented challenges for our attempt to synthesize findings across studies, the diversity of approaches used to formulate and express the scale’s reliability can also be viewed as a strength of the current study. We will discuss these points in greater detail in the methods section of the article.
Assessing the validity of the PIMRS
An appraisal instrument must not only provide accurate (i.e., reliable) data, but also measure the construct as conceptualized by the researcher (i.e., validity; Lang & Heiss, 1998). In the validation studies cited earlier, four criteria were used to judge the validity of the PIMRS:
Content validity: Items making up each subscale of the instrument must be relevant to the critical requirements of the job; each item assigned to a subscale must achieve a minimum average agreement of .80 among a group of raters.
Discriminant validity: The subscales should discriminate among principals; that is, the variance in principal ratings within schools must be less than the variance in ratings of principals between schools.
Construct validity (subscale intercorrelation): Groups of items within a subscale must intercorrelate more strongly with each other than with other subscales.
Construct validity (documentary support): An analysis of school documents related to the instructional management behavior of the principals should yield profiles of the principals’ instructional management performance similar to those obtained from teachers on the questionnaire.
The application of these methods to development of the PIMRS yielded an instrument that met common standards of internal validity in the original validation study conducted by Hallinger (1983). Studies by Jones (1987), Howe (1995), and Taraseina (1993) largely replicated the original validation study’s results at the secondary school level in studies conducted in Canada, the United States, and Thailand, respectively. We note, however, that no formal studies have assessed the external validity of the scale. Given space limitations, we do not offer further description of validation procedures or results in this article. Readers interested in such detail can consult the original research reports (Hallinger, 1983; Hallinger & Murphy, 1985; Howe, 1995; Jones, 1987; Taraseina, 1993; Wotany, 1999) as well as a PIMRS Technical Report (Hallinger & Wang, 2013) which describes our recent effort to revalidate the scale. 3
Method
In this section of the article, we clarify the nature of the data sources for this study, the types of reliability tests that have been applied to PIMRS data, and our approach to integrating reliability estimates obtained across the studies.
Data Sources
We were fortunate to gain access to secondary data contained in 43 previously conducted studies that had employed the PIMRS for data collection. The data consisted of two types. The first was statistical information contained in published research reports and doctoral dissertations. The second consisted of raw data sets obtained directly from researchers who had used the scale. We present these data sources separately since they presented different challenges and required different treatment in data analysis.
Extracted data
The research began with a literature search that sought to identify the full set of studies that had used the PIMRS. Our search found 15 master degree theses, 151 doctoral dissertations, and 11 published journal articles. 4 The 177 studies 5 were reviewed in order to identify ones that had conducted reliability analyses. We located 28 studies that reported reliability results (see Tables 1 and 2).
We proceeded to extract information from these studies. The data included the author(s), year, national context, school level(s), respondent group(s), sample size(s), reliability test, and reliability results. These data were entered into a table in MS Excel.
These research reports varied widely in terms of the sample size, role group of respondents, method of calculating reliability, and levels of the scale for which reliability was reported. As in all meta-analyses, variability in the design and methodology of the component studies presented challenges in terms of our goal of quantitative integration of the data. If, for example, the reader compares the upper and lower sections of Table 5, the limitations of relying on data extracted from published reports quickly become apparent. Lack of comparable data would have impeded our attempt to develop a comprehensive assessment of the PIMRS’s reliability. For this reason, we sought to obtain original item-level data for our calculations.
Raw data sets
Given the frequent continuing use of the PIMRS instrument among researchers, 6 it appeared feasible to try and obtain raw data from recently completed studies. Raw item-level data would offer several advantages for meta-analysis. First, it allows researchers to use a consistent test for calculating reliability estimates. Second, it affords the opportunity to compare the pattern of reliability estimates obtained from different methods of calculation. Third, access to item-level data enables calculation of a more comprehensive set of reliability estimates (e.g., full scale, three dimensions, and 10 functions). In contrast, the reliability estimates extracted from research reports typically focused only on one, or at most two, of the three scale levels. In sum, access to raw data enables researchers to exploit the full power of meta-analytic techniques, resulting in a more robust synthesis of results.
We subsequently contacted authors of PIMRS studies that had been completed since 2008 with the goal of gaining access to original item-level PIMRS data. We were able to obtain access to 25 original data sets comprised of PIMRS responses from 19 different researchers. 7 When these were combined with the extracted data, we had 52 data sets derived from 43 independent studies. The data sets were separated into two groups, based upon the data source, either principals (see Table 1) or teachers (see Table 2). We describe these in turn.
Principal self-report data
Principals represented the data source in 19 of the studies (see Table 1). As shown in Table 1, this consisted of 13 raw data sets and 6 data sets comprised of extracted data. We eliminated 3 studies that had surveyed fewer than 15 principals from our analyses (i.e., Carr, 2012; Gjelaj Merturi, 2010; Shafeeu, 2011) on the grounds that this was the minimum level needed in order to obtain sound results. This left us with 16 studies in the principal respondent data set. Four had collected data in East Asia and 12 in the United States (see Table 3). The studies were distributed across all school levels (i.e., primary, middle, secondary schools). 8 The sample size of the studies ranged from 15 to 1,195 principals, with a mean of 157 principals per study and a total sample of 2,508 principals.
Given the large size and uniqueness of the Hallinger and Lee (in press) data set from Thailand, elaboration on its characteristics seems appropriate. Unlike the other studies, this research employed a shortened version of the principal form of the PIMRS comprised of 20 items. This short form of the instrument was used to collect data from a nationally representative sample of 1,195 principals in Thailand. In order to prepare these data for meta-analysis, we applied a set of procedures to transform the 20-item scale so that the reliability results would approximate those of the 50-item test.
We employed the Spearman Brown test to correlate the items included in this study with results obtained from the raw data sets that had employed the full PIMRS comprised of 50 items. This procedure resulted in revised reliability estimates for both the full PIMRS as well as its three dimensions. The distribution of items in the short form was, however, insufficient to obtain an accurate estimation of reliability for the 10 leadership functions.
Teacher data sources
The teacher respondent data set consisted of data gathered in 33 studies conducted between 1983 and 2012 (see Table 2). The 33 studies were comprised of 11 raw data sets and 22 extracted data sets (see Table 2). The studies came from six different countries: United States (25), Canada (2), China (1), Thailand (1), Malaysia (2), the Maldives (1), as well as one study that analyzed a cross-cultural data set (Hart, 2006). The sample size of these teacher data sets ranged from 95 to 1,610, with a mean of 359 teachers per study and a total sample of 10,080 teachers. Respondents included teachers in primary, middle, and secondary schools.
Data Analysis
In this section we begin by describing the procedures for analyzing the reliability of the principal and teacher data. Then we discuss the meta-analytic procedures used to mathematically integrate results obtained across the studies. Finally, we discuss the standards for assessing the results of reliability analyses.
Procedures for assessing the reliability of the PIMRS Principal Form
As noted earlier, it was standard procedure for the researchers to employ Cronbach’s alpha in testing the reliability of the PIMRS Principal Form. First, we extracted alpha reliability estimates from the research reports. Then we calculated Cronbach’s alpha coefficients for the raw data sets. Finally, we combined the full set of alpha reliability coefficients into a single MS Excel table in preparation for meta-analysis.
Procedures for assessing reliability of the PIMRS Teacher Form
For reasons suggested earlier, synthesizing reliability estimates from the data sets comprised of teacher responses was less straightforward. Researchers applied different reliability tests to these data sets (e.g., see Tables 2 and 5). Although most researchers employed Cronbach’s test, some researchers suggested that this application of coefficient alpha violates a fundamental assumption of the test (e.g., see Jones, 1987). When analyzing a PIMRS data set obtained from teacher respondents, Cronbach’s test treats each teacher’s response independently, as if each teacher were rating a different principal. In reality, however, teachers are nested within schools, with each school’s teachers rating their own principal. In this case, the reliability estimates of internal consistency should be based on the combined ratings of teachers grouped together by their schools.
Thus, several researchers (e.g., Howe, 1995; Jones, 1987; Leitner, 1989; Taraseina, 1993) employed Ebel’s (1951) test, which yields a reliability rating based on the aggregated responses of teachers within their respective schools. The reliability result is therefore treated as a feature of the school (i.e., the principal). Ebel’s formula is:
When examining the various studies that had collected teacher data with the PIMRS, we found that 18 researchers had derived reliability estimates using Cronbach’s alpha and four from Ebel’s test. While both tests offer estimates of internal consistency, the coefficients cannot be combined through meta-analysis. Therefore, these data were grouped separately for subsequent meta-analyses (see Tables 5 and 6).
When we turned our attention to the raw data sets, we faced a decision as to which test to use. Although we initially considered Ebel’s (1951) test superior to Cronbach’s alpha for analyzing reliability of the teacher data, this approach method still had two problems (Schmitt, 1996). First, Ebel’s formula assumes that teachers are randomly selected from the same population. This assumption is not, however, tenable because teachers are nested within schools and each school’s teachers are rating their particular principal. The other problem is that item-level scores are “invisible” (i.e., ignored) in the Ebel formula, which only employs the total score from a teacher on the relevant subscale.
With these limitations in mind, we employed a formula based on generalizability theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Kane, Gilmore, & Crooks, 1976) for testing the reliability of raw data sets comprised of teacher responses. This approach not only takes into account the hierarchical structure of the data but also utilizes item-level scores. Thus, we assert that a more accurate way of calculating the reliability of scores obtained from teacher respondents is represented in the following formula.
where
where MSp is the mean square of principal, MSpxt is the mean square of principal by teacher, MSpxi is the mean square of principal by item, and MSe is the mean square of error (Kane et al., 1976).
We will refer to this approach as a “generalizability theory” or “gen theory” test. 9 We used this formula to compute the reliability in studies where we had raw, item-level teacher response data. It was not, however, possible to reanalyze data extracted from published reports with this method due to lack of item-level information. Thus, we were left with three separate groups of teacher respondent studies, each based on reliability estimates obtained through different tests. We further note that although each test produced a set of reliability coefficients, the results of the three tests could not be combined in subsequent meta-analytic procedures. We report both the reliability results of individual studies and the summary statistics obtained through meta-analysis in separate portions of Table 5 and 6.
Meta-analytic procedures
Meta-analysis refers to statistical techniques that integrate findings obtained from a body of research comprised of independent studies (Glass, 1977). Meta-analysis is frequently applied as a means of understanding the trend in substantive findings across studies (Glass, 1977). The meta-analysis of leadership effects studies in education conducted by Robinson, Lloyd, and Rowe (2008) is an example of this species of meta-analysis.
However, Rodriguez and Maeda (2006) also applied meta-analytic methods to the generalization of reliability coefficients. This involves mathematically synthesizing the reliability coefficients obtained from different studies weighted by their sample sizes. The resulting generalization reliability estimate is more accurate than any single reliability coefficient (Rodriguez & Maeda, 2006).
Take the alpha coefficient as an example. It is assumed that each of K studies (k = 1, …, K) provides an estimate of the population alpha coefficient. Let α k be the alpha coefficient, n k be sample size, and J k be the number of items in study k. The alpha coefficient should be transformed as:
with error variance:
Then, the weighted mean transformed alpha is:
with error variance of
In sum, this approach to meta-analytic transformation of data provides a “weighted average” of the reliability estimates derived from a larger set of studies (Hakstian & Whalen, 1976). The weighted average adjusts for the sample size of the particular studies, giving greater weight to studies with larger samples. As noted previously, we categorized the data sets for the studies that contained teacher responses (see Tables 5 and 6) in order to maintain the integrity of the results of different reliability tests.
Standards for assessing scale reliability
Contrary to common belief, there is no single standard for assessing the reliability of a research instrument (Kerlinger, 1966; Latham & Wexley, 1981; Smith & Kendall, 1963). The appropriate standard should be based on the intended use of the data. In general, instruments used for research and assessments of job performance assessment (e.g., principal evaluation) are evaluated according to different standards (Latham & Wexley, 1981). For example, in situations where important decisions about the employment of individuals are made on the basis of test scores, Latham and Wexley (1981, p. 66) proposed a minimum standard of .80 for tests of reliability. Nunnally (1978) suggested that performance assessment instruments should achieve a reliability standard of at least .90, or even .95, when the data will be used for personnel decisions.
A lower standard is, however, generally applied when data are employed for research purposes. For example, Hair and colleagues suggested a minimum acceptable range of .60 to .70 when data will be used for research (Hair, Black, Babin, Anderson, & Tatham, 1998). Other scholars (e.g., Fraenkel & Wallen, 1990; Kerlinger, 1966; Lang & Heiss, 1998; Nunnally, 1978) have recommended that research instruments should meet a minimum reliability standard of .70. Nunnally (1978, p. 245) even went so far as to assert that efforts to achieve reliability levels beyond .80 are a waste of resources when instruments are intended for use in basic research.
It is also useful to note Gay’s (1992) observation that reliability is influenced by the length of the instrument. Longer scales tend to yield greater reliability. Thus, the subscales that comprise an instrument typically yield lower reliability coefficients than the full scale. This is relevant for our study of the reliability of the PIMRS, which has been analyzed in terms of the full scale as well as its three dimensions and 10 functions.
These perspectives on standards of reliability are relevant to the current study. The PIMRS instrument has been used extensively as a tool for data collection in research, principal needs assessment, and principal evaluation. This discussion suggests that users should not evaluate the reliability of the instrument based on a single standard. Rather, users should select the instrument, the form of the instrument (i.e., self-report, teacher, or supervisor version), and the scale level based on the intended use of the data.
Results
This study was designed to assess the reliability of the Principal and Teacher Forms of the PIMRS under a variety of different conditions. We employed meta-analysis with a combination of original and extracted secondary data contained in 52 data sets obtained from 43 independent studies. We organize our presentation of the results first for the Principal Form and then the Teacher Form of the PIMRS.
Reliability of the PIMRS Principal Form
The results of our efforts to synthesize the reliability results for the PIMRS Principal Form are presented in Table 3. The combined sample used for the meta-analysis consisted of 2,508 principals (see Table 3). The whole scale alpha reliability estimate was .96. Reliability estimates for the three dimensions were .88 for Defines the School Mission, .91 for Manages the Instructional Program, and .93 for Develops a Positive School Learning Climate. 10 These all meet a sufficiently high standard of reliability for use as a component in systems of principal evaluation.
The summary coefficients, rho-hat, for the 10 instructional leadership functions were consistently and substantially lower than estimates for the full scale and three dimensions. Estimates ranged from a low of .74 on Creates Incentives for Teachers to a high of .85 on Frames the School’s Goals. These results suggest that the function-level subscales meet a standard of reliability sufficient for use in research and principal needs assessment, but not in principal evaluation.
Numerous scholars and practitioners have noted that features of the school context may shape or moderate the leadership behavior of principals (e.g., Belchetz & Leithwood, 2007; Goldring, Huff, May, & Camburn, 2008; Hallinger, 2011a). Differences in structural complexity and size create different challenges for principals who exercise instructional leadership in primary and secondary schools (Cuban, 1988). Similarly, the cultural context of the school shapes both the formal job description and normative expectations held by teachers and parents for the principal (Hallinger & Leithwood, 1996; Hallinger & Lee, in press).
We took advantage of the breadth of our data sources to analyze the reliability of the PIMRS instrument across different school levels and cultural contexts in which the scale had been used. This allowed us to gain insight into how the scale responds under different conditions. The detailed results of these meta-analyses are displayed in Table 4.
Meta-Analysis of Principal Reliability Estimates by Cultural Context and School Level.
Although there are some minor variations, the pattern of reliability results did not vary significantly either across different school levels or the two specific cultural contexts included in this analysis (i.e., United States and East Asia). At the same time, however, the sample size and number of studies both vary widely for the different cells in Table 4. 11 Furthermore, we note that it is not possible to generalize findings from two societies to all of “East Asia.” Therefore, the results presented for East Asia should be interpreted as preliminary or indicative rather than conclusive.
Reliability of the PIMRS Teacher Form
For the PIMRS Teacher Form, the results of the meta-analysis are organized in terms of the three different reliability tests. The findings shown in Table 5 are based on the synthesis of data sets comprised of 8,153 respondents with an average of 19.6 teachers per school. The 18 data sets that employed Cronbach’s alpha were comprised of 6,465 teachers, with an average sample size of 22 teachers per school. The four data sets that employed Ebel’s test included 1,984 teachers, with an average sample size of 22 teachers per school. The 11 data sets containing gen theory coefficients included 3,615 teachers, with an average sample size of 11 teachers per school.
Meta-Analysis of Teacher Reliability Estimates.
Although we were unable to mathematically combine the results of the three different reliability tests, it was of interest to understand how the magnitude of reliability coefficients differed depending on the test that was used. In order to gain insight into this issue, we proceeded to apply the Cronbach and gen theory reliability tests to our raw data. The results indicated that the gen theory formula tended to yield slightly higher coefficients than the results obtained from the Cronbach tests when applied to the same data (not tabled). We suggest that this is because of the capacity of the gen theory test to derive the reliability from teachers grouped by school and from item level rather than averaged responses. We consider the gen theory test to be the most accurate approach to representing the reliability of the data. Moreover, these results provide a useful benchmark for interpreting the reliability results from the other tests displayed in Tables 5 and 6.
Meta-Analysis of Teacher Reliability by Cultural Context and School Level.
Our meta-analysis based on results of the gen theory test yielded a full-scale reliability of .99, with coefficients of .97 (Defines the School Mission), .98 (Manages Instruction), and .98 (Develops School Learning Climate) for the three dimensions (see Table 5). The combined reliability estimates for the 10 instructional leadership functions ranged from a low of .90 (Maintains High Visibility) to a high of .95 on several functions. Despite these high reliability coefficients, we noted considerable variability in the actual coefficients reported study by study. We further observed that these estimates are consistently higher than estimates for the Principal Form of the PIMRS, and as expected, the results obtained from Cronbach’s test were slightly lower than from the gen theory test.
We followed the main analysis of the teacher data sets with analyses of reliability across school levels and cultural contexts. 12 These results bore similar patterns to the main analyses reported previously. The results of the gen theory test displayed in Table 6 indicate reliability levels consistently above .90 across both cultural contexts and all school levels. Again, the reliability estimates were somewhat lower for the 10 functions than the three dimensions and full scale, and results from the gen theory test were somewhat higher than those from Cronbach’s test.
Discussion
This report sought to provide a comprehensive assessment of the reliability of the PIMRS, one of the most widely used survey instruments in the field of educational leadership and management (Hallinger & Heck, 1996; Leithwood & Jantzi, 2005; Robinson et al., 2008). Our research questions revolved around understanding the reliability of the PIMRS with respect to its two most commonly used forms (i.e., teacher and principal forms), its component subscales, and its use in different organizational and cultural contexts. In this final section, we interpret the results and then discuss limitations and implications of the study.
Interpretation of the Findings
Citations of instrument reliability included in published empirical reports are frequently based on a single validation study, or occasionally on several replication studies. In the current report, we were able to gather reliability data from 43 independent studies. These yielded 52 data sets for meta-analysis. The diversity of the studies allowed us to develop a comprehensive, multidimensional view of the reliability of the PIMRS instrument and its use under different conditions.
Meta-analyses of reliability results were conducted separately for the Principal and Teacher Forms of the PIMRS. In each case, we provided analyses for the whole scale as well as its component subscales. The pattern of results was consistent with Gay’s (1992) observation that even in highly reliable instruments, subscales tend to yield lower reliability than the full scale.
We concluded that the Principal Form of the PIMRS demonstrated moderately high to very high reliability, depending on the scale level being analyzed. More specifically, alpha coefficients were .96 for the whole scale, between .88 and .93 for the three dimension-level subscales, and between .74 and .85 for the 10 function-level subscales. Although the number of studies and sample sizes varied considerably, we found no substantial variation in the pattern of the results for the PIMRS Principal Form across school levels or between the United States and East Asia. Results of meta-analysis for the PIMRS Teacher Form demonstrated a consistently higher level of reliability for all three levels of scale measurement (i.e., > .90) as well as across the specified organizational and cultural contexts.
We earlier discussed standards for assessing instrument reliability and noted the importance of aligning the selected standard with the use of the data. Here we conclude that the PIMRS Principal Form produces data that meet or exceed the standard needed in both research and principal needs assessment (i.e., the primary purposes for which this form of the scale is employed). Principal self-report data obtained from the PIMRS could also be employed in principal evaluation, but we recommend using measurements obtained for the full scale and three instructional leadership dimensions rather than the 10 functions.
The results indicated that the PIMRS Teacher Form meets applicable standards of reliability required for use in both personnel assessment and research. This conclusion applies to all three levels of the scale and school levels. The pattern of results was particularly strong for the United States. However, we consider the results for East Asia tentative, due to the limited coverage of regional societies and small number of studies included in the sample.
Limitations and Implications
This report on the reliability of the PIMRS has relevance for both researchers and practitioners. While the data suggest that researchers should be confident in using the PIMRS for collecting data on principal instructional leadership, our analyses found slightly different patterns of results for the two forms of the scales. In addition, the reliability estimates for the full scale and three dimensions were somewhat higher than for the 10 leadership functions. Users will wish to take note of these differences in order to determine their relevance in relation to the purpose for which the data will be employed. We emphasize this last point because our investigation revealed the importance of linking the standard of reliability to the use of the data. Thus, researchers and district policymakers or practitioners could draw different conclusions about which forms and scale levels of the PIMRS are best suited to their purposes.
Another implication emerged from our examination of the full body of literature in preparation for the analyses included in this report. We noted that in the early PIMRS studies researchers were generally assiduous in reporting the reliability estimates reported (i.e., Hallinger, 1983; Jones, 1987; Leitner, 1989; Taraseina, 1993) in the methods sections of their reports. Yet, surprisingly, only about one third of documented users included a reliability analysis in their own study procedures.
In our view, establishing reliability is especially important when an instrument is being used in a setting that differs in meaningful ways from the original validation site(s). For example, we earlier noted that the PIMRS has been employed in 26 different countries. Yet to date we were only able to obtain reliability estimates for data collected in 7 of those countries. We consider this an easily remedied, though potentially important oversight. Since most of these studies were master and doctoral dissertations, we recommend that supervisors be more stringent in making this a requirement in future graduate research studies (see also Hallinger, 2011b).
We also noted that several different tests have been employed by researchers to examine the reliability of the PIMRS. Although it was not an explicit goal of the study, we gained insight into both the complexity and criteria for testing the reliability of the two different forms of the PIMRS. Accordingly, we recommend that future users of the PIMRS Principal Form employ Cronbach’s alpha and users of the Teacher Form use the gen theory formula included in this article. 13 Although our comparison of results obtained from the Cronbach and the gen theory tests yielded relatively small differences, the latter represents a more accurate approach when employed with teacher data. This recommendation applies not only to testing the reliability of the PIMRS, but also other principal assessment scales as well. In our reading of the literature, we find that most researchers are using Cronbach’s test of reliability under conditions that exceed its capacity to yield accurate results.
We limited the scope of this report to a study of the reliability of the PIMRS. Nonetheless, we acknowledge that high reliability is a necessary but insufficient feature of a valid instrument. Although we noted that several studies have examined aspects of the PIMRS instrument’s validity (Hallinger, 1983; Jones, 1987; O’Day, 1984; Taraseina, 1993; Wotany, 1999), an updated assessment of its validity is warranted. The authors are in the process of conducting a revalidation study (see Hallinger & Wang, 2013).
Moreover, as noted earlier, previous validity studies have focused exclusively on the PIMRS instrument’s internal validity. In the future, researchers should also assess the external validity of the PIMRS. Tests could, for example, directly compare the results of the PIMRS with results obtained from other instruments and in relation to key organizational conditions (Kerlinger, 1966). Indeed, while examining the full set of PIMRS studies, we came across several that compared the PIMRS with instruments measuring other leadership constructs such as transformational leadership, emotional intelligence, cognitive styles, and leader authenticity (e.g., Dale, 2010; Greb, 2011; Meyer, 1990; Munroe, 2009; Sawyer, 1997; Shatzer, 2009; Tang, 1997). Although these were not explicitly framed as validation studies, findings from these studies could be employed to increase our understanding of the PIMRS’s external validity.
We began this report with a quotation from Bridges (1982) who had implied that progress in our field would remain stunted in the absence of stronger conceptual frameworks (see also Bridges, 1967) and research instruments. We noted that the PIMRS was developed in direct response to the expressed need for research instruments that could contribute to a program of research on how leadership impacts learning. The current study was conducted in order to ascertain empirically whether 30 years after its development the PIMRS continues to warrant a place in the researcher’s toolbox for addressing these high-priority lines of inquiry. Our findings clearly indicate that both forms of the PIMRS instrument meet high standards of reliability that are consistent with current usage. As a result, we believe that both researchers and practitioners should be able to make more informed choices concerning whether and how to use this instrument in assessing principal instructional leadership.
Footnotes
Acknowledgements
The authors also wish to acknowledge the invaluable assistance rendered by the scholars who contributed their raw data for the purposes of this study: Philip Adam, Craig Carson, Aaron Dale, Sam Fancera, Ellen Goldring, William Greb, Cheryl Long, Brendan Lyons, Myra Munroe, Wong Yau Nyau, Premavathy Ponnusamy, Ismail Shafeeu, Ryan Shatzer, Wang Sitong, and Tara Todd. The contributions of other researchers whose data we drew upon in this study are also appreciated and cited in the text. We also wish to acknowledge the helpful comments offered by Edwin M. Bridges and Ronald H. Heck to an earlier version of this manuscript.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors wish to acknowledge the funding support of the Research Grant Council (RGC) of Hong Kong for its support through the General Research Fund (GRF 841711).
