Abstract
Validity refers to the extent to which a given study reflects the dominant values of a research community regarding what constitutes “good” research. Methodological texts on qualitative research provide a wide range of criteria and strategies to help qualitative researchers validate their studies. Given the importance of these strategies to establish a study as trustworthy and legitimate, the objective of this study was to understand the strategies commonly used by school psychology researchers in qualitative research. We therefore reviewed qualitative research articles published in seven school psychology journals between 2006 and 2021. We found 15 strategies authors used to enhance the validity of their research. We also found that a strategy could be enacted in many different ways by different researchers depending on the context. We conclude by recommending four ways in which qualitative researchers in school psychology can improve their validation practices.
School psychology, as a branch of psychology, has traditionally been a quantitative field grounded in the postpositivist paradigm. Although methodological diversity characterized the early days of psychology, the discipline became more conservative over time, prioritizing narrow hypothetico-deductive (quantitative) research in order to position itself as a “respectable” science similar to material sciences such as physics and chemistry (Danzinger, 1979; Harre, 2004). Non-quantitative forms of research were thus marginalized within the applied branches of psychology, including school psychology.
Nevertheless, qualitative research seems to be growing in prevalence. In school psychology, specifically, two systematic studies (Powell et al., 2008; Sabnis et al., 2023) have found a definite growth in the use of qualitative research in school psychology journals. Between 2001 and 2005, Powell et al. (2008) found the number of qualitative publications per year increased from one to six. Sabnis et al. (2023) found the number to grow from 7 qualitative articles in 2006 to 22 articles in 2021. Although these figures continue to be minuscule in comparison to the overall volume of articles published in these journals, it is nevertheless a cause for cautious optimism, given that similar trends are seen in other fields (e.g., Hays et al., 2016).
As the prevalence of qualitative research in the social sciences increases, numerous organizations have published reporting standards in order to provide journal editors and authors more guidance for evaluating the quality of manuscripts. For example, a task force appointed by the American Psychological Association (APA) published reporting standards for qualitative studies (see Levitt et al., 2018). Similar efforts have been seen in other fields such as health (Tong et al., 2007), business (Gephart, 2004), and education (Reid & Gough, 2000). These reporting standards may vary from each other in some ways, given that each discipline has its own particular topography which shapes the research methods employed by its researchers. Nevertheless, the standards all seem to emphasize an explicit discussion on the strategies used by the authors to establish the validity of their study.
What is validity?
Validity refers to the extent to which a given study reflects the dominant values of a research community about what constitutes “good” research. Validity has a specific meaning in quantitative research that was shaped by the positivist belief in the existence of an objective reality independent of the human mind as well as a desire to capture this reality accurately while diminishing the noise of confounding variables, conceptual fuzziness, and probabilistic errors (Eketu, 2017). Social scientists thus use validity to refer to the “truth value” of a study's findings (i.e., in its ability to reveal “true” things about the world; Onwuegbuzie & Leech, 2007). How could this ability be determined? In other words, what criteria could be used by the scientific community to evaluate a study's truth value? And what strategies could be used by social scientists to ensure their study met these criteria?
Quantitative researchers Campbell and Stanley (1963) proposed two main criteria for determining the truth value of a study—namely by looking at the extent to which the findings were internally true and externally true. A study finding could be deemed internally true when it was purely a result of the experimenter's intentional manipulation of experimental conditions. A finding was deemed externally true when it was collected under conditions that resembled real-world conditions and populations of people. These two criteria have come to be known as internal and external validity among quantitative researchers. Although no study could be 100% internally and externally valid, Campbell and Stanley advocated for researchers to take steps to reduce threats to the internal and external validity of their studies. Following the popularization of these two criteria, the quantitative research community came to use many different strategies to bolster the internal and external validity, such as closely controlling the conditions of the data collection to minimize confounding variables, statistically controlling for extraneous variables, random assignment, and so forth.
At one point in time, it was thought possible to import the Campbell and Stanley's (1963) postpositivist criteria of validity to qualitative research (e.g., Miles & Huberman, 1984 argue this position). However, their underlying onto-epistemological assumptions about the existence of objective reality and objective knowledge were incompatible with those of qualitative researchers who rejected the concept of a single objective social reality and viewed all “truth claims” to be co-constructed (e.g., Bochner, 2000; Wolcott, 1990). This incompatibility led qualitative researchers to seek an alternative to validity in qualitative research (Hammersly, 1992; Leininger, 1994), starting with substituting the term “validity” itself, using words such as trustworthiness (Lincoln & Guba, 1985), legitimation (Onwuegbuzie & Johnson, 2006), or methodological integrity (Levitt et al., 2018) in its place because they believed these terms better captured what was being evaluated in a manuscript. Nevertheless, a comprehensive look at the contemporary qualitative literature suggests these terms are often used interchangeably. For this article, we will use the term “validity” as a signifier of quality since that is the term most familiar to school psychologists and still widely used in the qualitative research community.
Validity criteria
Just as quantitative researchers use Campbell and Stanley's (1963) two criteria for judging the validity of a study, qualitative researchers have sought to develop their own criteria of validity. Lincoln and Guba (1985), proposed four criteria for judging a study's trustworthiness, namely credibility, transferability, dependability, and confirmability. Smith and Glass (1987) proposed four criteria—namely logical validity, construct validity, internal validity, and external validity. Whereas these criteria were informed by the interpretivist paradigm, Lather (1993) proposed criteria informed by poststructural feminist theory—namely catalytic validity, rhizomatic validity, ironic validity, and paralogic validity. Some methodologists have sought to establish validity criteria for specific methodologies such as action research, discourse analysis, ethnography, and poststructural research (Cho & Trent, 2006).
Whereas the quantitative research community shares a strong consensus about the two criteria to evaluate a study's validity (internal validity and external validity), no such consensus exists among qualitative researchers. As qualitative approaches to inquiry expand, evolve, and hybridize, methodologists such as Tracy (2010) and Onwuegbuzie and Leech (2007) have attempted to bring together the various criteria proposed by qualitative researchers under a single big-tent framework. For example, Tracy (2010) synthesized validity literature into eight criteria (worthy topic, rich rigor, sincerity, credibility, resonance, significant contribution, ethics, and meaningful coherence). Similarly, Onwuegbuzie and Leech (2007) subsumed the various criteria found in literature under two broad categories which they called internal credibility and external credibility, which are further divided into sub-criteria.
Validity strategies
Qualitative researchers are expected to choose the criteria by which they wish their work to be judged—criteria well aligned with their methodology and underlying philosophy of inquiry—and then take steps to ensure their study meets those criteria. Validity strategies thus refer to the specific “means, practices, and methods through which” (Tracy, 2010, p. 840) qualitative researchers live up to their chosen criteria. Here it is important to mention that no study, however well-designed, can meet a validity criterion entirely. Validity criteria act as aspirational lodestars that can be used by a researcher to navigate their inquiry. Once a qualitative researcher determines the criteria they want their study to be evaluated on, they can then take steps or enact procedures to meet that criteria. For instance, a researcher who wants their findings to be judged on the realist criteria of credibility (i.e., the extent to which their account seems “true,” accurate, or that the researcher “got it right”) may conduct member checking where they take their findings back to the participants for feedback to ensure their account corresponds to the participants’ reality (Tracy, 2010). Onwuegbuzie and Leech (2007) surveyed qualitative literature and found 24 strategies that have been commonly used by researchers to help bolster their study's validity, such as triangulation, audit trails, member checks, and so forth.
Rationale for this study
By conducting this study, we hoped to help future school psychology researchers and graduate students gain a better understanding of validity by showing them how past researchers in school psychology have practiced it. Levitt et al. (2018) identify discussion about validity (which they refer to as methodological integrity) as an important part of a manuscript reporting on a qualitative study. Specifically, they ask authors to clarify the means by which they sought to achieve validity for their study. However, there are various difficulties associated with this mandate. First, methodological texts on qualitative research present a dazzling array of validity criteria for judging the worth of a study. Reporting standards thus refrain from specifying the validity criteria to be used given the “wide range of qualitative approaches”, and the importance of choosing or tailoring validity criteria for one's project (Levitt et al., 2018, p. 29). Standardizing the criteria for validity is also difficult given the need for qualitative researchers to balance rigor, discipline, and systematicity on one hand with creativity and subjectivity on the other, without going too far in one direction at the expense of the other, which Whittemore et al. (2001, p. 522) described as the “opposing dangers of methodological rigidity and methodological anarchy”.
Secondly, school psychology researchers typically do not receive much training in qualitative research methods (Arora et al., 2022). Powell et al. (2008) found that out of all the school psychology training programs in the US, only one program required its students to learn qualitative research methods. Thus, school psychology researchers interested in using qualitative methods for a research project and publishing the study may be unsure of how to establish the validity of their study to make it publishable. Therefore, it may be helpful to understand how previous researchers publishing in school psychology journals have gone about the task of validating their studies. To this end, we asked the following research question: In what ways do school psychology researchers establish the validity of their studies?
Process of inquiry
The data for this manuscript partially came from a larger project investigating the prevalence of qualitative research in school psychology journals. Sabnis et al. (2023) conducted a bibliometric analysis of seven school psychology journals from 2006 through 2021. The journals were selected for having a primarily school psychology audience and performance metrics used to evaluate journal quality such as an impact factor or CiteScore listed on their websites. The seven journals were
Canadian Journal of School Psychology (CJSP)
Journal of Applied School Psychology (JASP)
Journal of School Psychology (JSP)
Psychology in the Schools (PITS)
School Psychology (SP)
School Psychology International (SPI)
School Psychology Review (SPR)
“(1) The data being analyzed were in the form of text or images, not numbers. (2) The text or images were analyzed and reported in terms of themes, categories, domains, narratives, or discourses. Quantification, if present, was limited to descriptive statistics (e.g., frequency of themes, number of codes, and average age of participants) not inferential statistics. (3) The data analysis procedure was clearly described in the article.” (Sabnis et al., 2023, p. 2)
Sabnis et al. (2023) looked at all the articles in these journals between 2006 and 2021 to identify qualitative articles, defined as any article meeting the following criteria:
For this study, the first author collaborated with a different researcher to conduct a systematic review of these 142 qualitative articles to understand various methodological aspects such as commonly used study designs, forms of data collection and analysis, methods of reflexivity, and validity. Given space constraints and the timely importance of the topic of validity for qualitative researchers, we decided to focus this manuscript solely on the findings related to validity.
Validity of the study
We had two goals for this article. We wanted to increase the understanding of validity among quantitatively-trained school psychology researchers who may be considering a qualitative study and among school psychology graduate students who may be considering a qualitative dissertation but without being sure how to proceed. We hoped to bring them more clarity by highlighting the different ways in which validity has been practiced in the past by school psychology researchers. We also wanted to highlight areas for improvement in order to help raise the overall quality of qualitative research produced in the field. Given these motives, undertaking a rigorous rather than a haphazard approach to collecting and analyzing the large quantity of published articles spanning 16 years would reduce our chances of missing important patterns in the data. Since we wanted our research to have pedagogic value in teaching people about qualitative research, we would also need to come across as credible narrators to the reader of this article. For this reason, we selected two of the eight validity criteria suggested by Tracy (2010) to help orient our study, namely rigor and credibility.
Rigor
We borrowed the first validity criteria from Tracy (2010) who defined it as “due diligence, exercising appropriate time, effort, care, and thoroughness” in following a systematic process of inquiry (p. 841). We started by creating a protocol to extract information from the articles (see Appendix A). We started by reading five articles separately, followed by a meeting to extract the information into the spreadsheet together. This step helped us to iron out differences and refine the protocol to reflect an improved understanding of what we were looking for. Specifically, we decided to use a broad definition of validity strategies: steps or measures researchers take to increase the intellectual value, usefulness, and insightfulness of their study. These steps could be taken at any stage from study conception, data collection, data analysis through how the study is presented to readers. Importantly the author had to explicitly disclose taking these steps. We decided to make this definition broad to account for the multiple forms of validity strategies we came across during the literature review phase. Although we approached the coding process without any a priori codes, this stage was doubtlessly influenced by our prior knowledge of common validity strategies described in qualitative literature such as interrater agreement, member checking, prolonged engagement, triangulation, and so forth.
After the protocol was finalized, we met over the next few weeks to code articles, first separately and then together. We compared our respective coding, and used discussion to arrive at a consensus which we noted in the final spreadsheet. By the 25th article, we were coding similarly with very few disagreements. At this point, we decided to split the remaining articles into two, with each author coding the articles assigned to them. During this phase, we met every other week for seven months to check on each other's progress, review each other's work, and ask questions to clarify the other person's coding. This process generated information about the validity procedures used in each of the articles. We reviewed this information to identify 15 different validity procedures. The first author revisited all the 142 articles as well as the spreadsheets to verify the calculations.
Credibility
Credibility refers to the extent to which readers can trust a researcher to study a topic competently and arrive at reasonable results and conclusions based on the data. One of the factors that affect the credibility of a study is the competence of the authors with regard to the topic under study. We believe we are competent authors based on our extensive training in qualitative research. The first author completed a doctoral degree in school psychology and a graduate certificate in qualitative research. He has conducted and published various qualitative research studies and is also asked to provide ad hoc reviews for qualitative manuscripts submitted to various school psychology journals. The second author is a qualitative methodologist who has published over 70 articles related to qualitative research in journals such as Qualitative Inquiry and Qualitative Health Research. Overall, we believe this training provides us with the expertise needed to conduct this systematic review and arrive at credible conclusions.
Credibility of the findings is also bolstered by providing thick descriptions of a phenomenon (Tracy, 2010). To construct a thick description, a researcher must dive deeply into a phenomenon to uncover complex patterns which are then communicated to the reader through examples and anecdotes. In this report, we showcase the different ways in which school psychology researchers went about establishing the validity of their studies. We discuss each strategy not only in terms of its overall meaning but also in terms of its heterogeneous execution. In other words, we show how two teams of researchers may use the same strategy but in widely different ways to attain the goals of their manuscript. We also provide many examples to illustrate the wide range of topics that have been studied with these strategies.
In the following section, we discuss the different ways in which validity was established by qualitative researchers in school psychology. Note that when referring to one of the 142 articles, we italicize the citation to distinguish the article from general citations.
Key findings: validity strategies
Researchers used various strategies to establish validity of their studies. These strategies were used in different phases of the inquiry. Figure 1 represents the different validity strategies in order of their popularity in qualitative research in school psychology. Popularity was judged by how many of the 142 articles featured a given validity strategy. A number in the figure (e.g., 30) indicates that the validity strategy appeared in 30 out of 142 articles published in school psychology journals between 2006 and 2021.

Most common validity strategies in published qualitative literature in seven school psychology journals (2006-2021).
Intercoder agreement
The most commonly used strategy to establish the validity of a qualitative inquiry was intercoder agreement, which was used in 44 out of the 142 articles (31%). This strategy also went by other names such as interrater reliability. In this strategy, researchers coded the transcripts separately, then came together to compare their codes. During these meetings they monitored the rate of agreement in terms of percentage or the kappa. Frequently they went through multiple iterations of this process to get the rate of agreement above a preset threshold such as 75% (Harvey & Pearrow, 2010), 80% (Kelada et al., 2017; McMahon et al., 2020; Moy et al., 2014; Washington et al., 2020) or 90% (Proctor et al., 2018). The value of preset thresholds was justified by citing methodologists such as Miles and Huberman (1994) for 80% and Bakeman and Gottman (1986) for 90%.
Ideally the process of intercoder agreement involves two or more coders coding the entire dataset independently and then coming together to compare their codes with each other. However, this process can be extremely laborious and require extensive time commitment. For Newman and Ingraham's (2020) study of a classroom activity for teaching multicultural school consultation, the process took “approximately two years, with over 40 collaborative meetings typically two to three hours long” (p. 16). Many research teams may not have secondary coders available to make such time commitments.
Given the clear logistical challenges presented by this approach, qualitative researchers in our review found many creative ways to balance the need for intercoder rigor with limited time. One of the most common strategies we found was for the primary researcher to use a secondary coder to co-code part of the data set, and then code the remaining dataset alone. For example, Justin et al. (2021) and Frels et al. (2013) conducted interrater agreement on 30% and 18% of the data respectively, and refined the coding scheme until they reached a satisfactory level of agreement with the secondary coder. The primary researcher in both the studies coded the rest of data alone using the refined coding scheme.
In Grapin et al.’s (2021) open-ended survey of graduate students in school psychology programs, researchers randomly selected 20% of the survey items and used a predeveloped coding scheme to calculate the interrater reliability using Cohen's kappa which exceeded 0.8. Ni et al. (2013) had a graduate student code 20% of randomly selected data and reported an interrater agreement of 89%. McIntosh et al. (2013) assigned a number to each participant and used a random number generator to select 25% of the data which was then sent to the third author to code. These strategies helped the authors to uphold the rigor of their study while also respecting the time constraints of fellow researchers.
One study also calculated intrarater reliability which refers to the process of ensuring that a researcher applied the same rules consistently over time, in order to minimize researcher drift. In Rush and Wheeler’s (2011) study of early career scholars attending the School Psychology Research Collaboration Conference (SPRCC), both authors individually coded an excerpt a second time and compared it to their first individual coding of the same excerpt, leading to 99.8% and 100% agreement, respectively, indicating strong intrarater consistency of applying the codes and a low level of coder drift.
Member checking
At its core, member checking refers to researchers “checking” with their participants and providing them an opportunity to give feedback about the research process. These opportunities can come in many different forms—from modifying the data itself to modifying the researcher's interpretation of the data. Member checking helps to make the research process more democratic and also helps researchers avoid unintentionally misrepresenting or misunderstanding participants’ viewpoints. Thirty percent (n = 43) of the studies involved member checking. Researchers took many different approaches to member checking, and differed specifically in what was sent to the participants.
Interview notes
For some like Davies (2020) exploration of the challenges faced by families of students with traumatic brain injuries (TBIs), member checking involved summarizing the “main points” of the interview to the interviewees immediately upon finishing the interview, and asking them “if the summary was accurate” (p. 282).
Transcripts
For others like Cassidy et al. (2012) and Craggs and Kelly (2018), member checking involved transcribing the interviews and then sending the raw transcripts to the participants and allowing them to make any changes or add any information they had missed.
Summary of results
Leath et al. (2021), who explored how Black girls navigate friendships in high schools, sent a summary of the results of the data analysis to participants to get their feedback and made changes to the results based on that feedback in order to “ensure that [their] findings reflected the young Black women's perspectives” (p. 39). Similarly, Newman and Ingraham (2020) sent their findings (consisting of a summary of themes and a conceptual model created by the authors based on the study results) to participants and requested their feedback in specific areas such as “(a) their perceived accuracy of the theme; (b) what, if anything would enhance the accuracy; and (c) what additional points they had for consideration regarding each theme” (p. 17).
Draft of the manuscript
Finally, we found one study investigating a distance education program for preparing rural school psychologists in Colorado, wherein Lahman et al. (2006) sent a draft of their manuscript to their participants to “respond with comments regarding inaccuracies, clarification and reflections” (p. 442) prior to journal submission.
Non-quantified consensual coding
The third most common strategy we encountered in the qualitative articles was what we called non-quantified consensual coding, appearing in 27% of the articles (n = 38). This procedure was similar to intercoder agreement except that it did not involve the researchers quantifying their agreement. Thus, the researchers coded the transcripts separately and then came together to compare their coding without specifically monitoring the rate of agreement or disagreement. The researchers discussed areas of disagreement and came to a consensus. The reason for choosing this strategy over intercoder agreement was cogently explained by Suldo et al. (2008): “As the research team's goal was to reach full consensus in the coding scheme, percentage of initial agreement between raters was not calculated—consistent with the qualitative work initiated by Harry, Sturges and Klingner (2005)” (p. 965).
External auditing or peer feedback
This strategy was found in 26% of the articles (n = 35) and involved researchers consulting with people outside the research team to aid in the research process. A closer look into the data revealed variations in terms of when the researchers sought external help.
Before the study
In some cases, external auditors were brought in at the beginning of the study and helped to audit the entire research process. For example, Sansosti and Sansosti (2012), who explored educators’ experiences working with students with “high-functioning autism spectrum disorders” (p. 917), invited a panel of colleagues with expertise in qualitative research and special education to review all the study procedures before the study began. Larsen and Samdal (2012) invited two teachers to be external reviewers. Before beginning data collection, they pilot-tested the interview protocol with the teachers and used their feedback to improve the protocol.
During data analysis
In other cases, external reviewers were brought in during data analysis to help the researchers code the transcripts. For instance, Canpolat and Atli (2021), who explored how children experience changing schools due to parents’ relocation, interviewed 25 such students, tasked external reviewers with coding their transcripts, and then used this to calculate intercoder agreement. Phasha (2008) invited an independent analyst to code 15 of the transcripts and then reviewed their coding in order to modify her own coding.
After data analysis
More often than not, the external auditor was brought in after the data analysis was completed. In Moy et al. (2014), for example, an external auditor reviewed the codebook created by the researchers based on data analysis as well as the transcripts to evaluate “whether or not statements were accurately coded… and indicating his agreement or disagreement with the coding of each statement” (p. 329). The researchers also supplied him with all the documents from the research including the coding, and the auditor went through this and pointed out discrepancies. Similarly, researchers such as Pufpaff (2008), who studied the emergence of literacy skills in a 7-year-old student with expressive and communication difficulties, met with their academic colleagues (peers) to debrief them about their “problems, concerns, and initial findings with colleagues who had expertise and experience in a variety of related fields” (p. 586). The colleagues pointed out potential biases, suggested new meanings they may not have considered before, and identified places that needed clarification.
Most studies that used external reviewers or peer feedback did not specify the reasons for inviting a particular individual to be the reviewer. It is possible that many researchers simply turned to people in their personal networks (e.g., fellow researchers, graduate assistants, etc.) to be the external reviewer. Among the ones that indicated a reason for inviting a specific reviewer, we found four reasons.
Methodological proximity
Craggs and Kelly (2018) conducted interpretative phenomenological analysis (IPA) to understand student experiences of being removed from a regular school and sent to an alternative school. For external auditing, they invited a colleague who had used IPA in their own research projects and was thus aware of its conventions and procedures.
Expertise
Two studies invited people based on their area of expertise. For example, Diakow and Goforth (2021) invited people who were “international experts in the field of migrant and refugee youth mental health, humanitarian assistance, Islamic studies, and Middle Eastern culture” (p. 244) to serve as external auditors for their study on the wellbeing of Muslim refugee youth. Theron (2013) invited other resilience researchers for their study on the resilience of Black South African university students from disadvantaged backgrounds.
Linguistic background
Finally, reviewers played a role in studies carried out in languages other than English. In Sluiter et al. (2019) study exploring teachers’ perspectives on ADHD medications, the researchers consulted with a native speaker of Dutch and English to review the quotes the authors wanted to include in the manuscript to ensure their translation “resemble(d) the language that was used by the respondents with a focus on intended meaning rather than the proximity of words in literal translation” (p. 1264). Fulano et al. (2018) similarly invited a teacher fluent in English and Mozambican to translate the quotes from the interviews.
Lack of proximity
Finally, some researchers like Chen et al. (2016), who studied how students intervene in bullying incidents, intentionally sought researchers from different fields in order to get “different perspectives on the study's provisional theoretical framework and research results” (p. 293).
Triangulation
The term “triangulation” emerges etymologically from navigation and refers to the act of using multiple points to pinpoint the geographical location of an object (Smith, 1975). In qualitative research, triangulation refers to the researcher eschewing reliance on any one strategy and instead using multiple strategies to arrive at their conclusions. Twenty-three percent of the studies (n = 32) used this validity strategy. We found many different ways in which researchers interpreted the term. Specifically, these studies differed in what was triangulated.
Triangulating the mode of data collection
The most common form of triangulation was between multiple modes of data collection to gather data. For example, Frels et al. (2013) used individual interviews, focus groups, and document review to arrive at their findings about the factors that seem to characterize the relationship between successful mentors and their mentees.
Theory triangulation
Fontil and Petrakos (2015) triangulated by analyzing the data from several theoretical perspectives. Specifically, they reported triangulating their understanding of the participants’ experiences by looking at it through the lens of Rimm-Kaufman and Pianta (2001) developmental model of transition and the theoretical framework underlying the systems of care (SOC) programs.
Site triangulation
Nkoma and Hay (2018) talked about site triangulation wherein they “purposefully (sampled) three provincial offices, ensuring a high level of variation of participation characteristics in terms of gender, ethnicity, and time served in school psychology services” (p. 856). The use of multiple sites aided them in achieving “the greatest diversity” (p. 856) in the cases thus enabling a well-rounded understanding of the phenomenon.
Prolonged engagement
Lincoln and Guba (1985) identified prolonged engagement with the community as an important strategy for enhancing the quality of the research. At the core of this principle is the idea that developing an intimate understanding of participants and the context in which they live can allow the researcher richer insights into their actions. Sixteen percent of the articles (n = 22) involved prolonged engagement as a way to establish the validity of the study.
In comparison to other validity strategies we have discussed so far, prolonged engagement may be the most difficult to operationalize. Given the variety of research topics, there is no one standard way to engage with a community nor is there a concrete criterion to understand when one's engagement has been “deep” or “prolonged” enough. It is thus up to each researcher to explain to the reader their attempts at the engagement, and to allow the readers to decide whether it qualifies as prolonged engagement. Given the operational fuzziness of the criteria, we found a variety of scenarios described as constituting prolonged engagement. McGeown et al. (2017) reported enacting deep engagement by using one single site for all their data collection. They believed that this yielded rich data because of the opportunities to create strong rapport and to gain people's trust.
Balagna et al. (2013) enacted deep engagement with a Spanish-speaking community through phone calls as well as making several home visits to various families prior to the interviews. This was especially salient given that their research involved participants from a Spanish-speaking community and that the interviews were to be in Spanish as well. Following the data analysis, they visited the interviewed families at their homes to present their data interpretation and seek feedback (member checking).
Some researchers leveraged their prior relationship with a school as a form of validity. For example, Wood et al. (2017) discussed their working relationship with the two school sites for “over ten years.” They explained how this intimate familiarity with the culture of the schools aided them in the creation of the interview protocol as well as “the interpretation of the data in context” (p. 6). It also helped them during the interviews because they were able to “express a shared sense of knowledge” about the school to the students they were interviewing (p. 6). Similarly, Pillay (2014) discussed how his career as “an educational psychologist and academic involved in promoting children's rights” (p. 229) allowed him rich insights and enhanced the credibility of his conclusions.
Not discussed
Fifteen percent (n = 21) of the studies did not explicitly describe the steps they took to enhance the validity of their study.
Thick rich descriptions
Eleven percent (n = 16) of the articles mentioned thick rich descriptions as the validity-bolstering mechanism used by their authors. Thick rich description as a form of validity emerged from ethnographic research, which focuses on understanding the cultural practices of various social groups. Geertz (1973) advocated for thick rich description as a way for the ethnographers to describe various cultural practices. This could be done by providing a rich account of a cultural milieu that revealed the ethnographer's intricate understanding of culture. An ethnographer's ability to write thick rich descriptions testifies to their deep immersion in the culture. Although emerging from ethnography, thick rich description spread to other disciplines as a means of judging the trustworthiness of an account. In the studies we reviewed, authors enacted thick rich descriptions by providing the reader with many different quotes from the participants “to ensure that the participants’ voices were well-represented” (Haegele et al., 2020). In addition to using direct quotes, researchers such as Vilbas and King-Sears (2021) enacted this procedure by describing their “participants, setting, and emergent themes” in rich detail (p. 877).
Saturation
Ten percent (n = 15) of the studies cited saturation as a form of validity-boosting practice. Saturation as a practice refers to the practice of collecting data and simultaneously coding it until no new codes emerge. The researchers may thus not start with a rigid ceiling for the number of participants. Rather they continue to interview participants until they reach a point at which the existing codes are sufficient to analyze the new information. There is no standard way to decide the ceiling. Castillo et al. (2016) determined the ceiling had been reached “when the last two interviews yielded no new information” (p. 647).
Audit trail
Onwuegbuzie and Leech (2007) defined an audit trail as “maintaining extensive documentation of records and data stemming from the study” (p. 240). Nine percent (n = 13) of the articles cited the use of audit trails, which Vilbas and King-Sears (2021) concisely described as “detailing the process of data collection, analysis, and interpretation of the data” (p. 877). O’Bryon and Rogers (2016) described it as documenting “the process and product of the investigation” (p. 228) including various “methodological considerations and research decisions” they took at each step of the research process. They argue the audit trail aided them in being reflexive about the research. Parker et al. (2020) similarly talked about recording “theoretical memos and notes throughout each transcript to ensure that all members of the research team had a clear understanding of data interpretation” (p. 119). Wood et al. (2017) also discussed documenting the various changes they made to the interview protocol and codebook as the study progressed.
Providing a detailed account of the research process
Six percent (n = 8) of the articles provided a detailed description of how the research was conducted from design to analysis and interpretation, and claimed this as the mechanism that validated their studies. For example, Wood et al. (2017) stated “the current article offers in depth details about the process, context, and results of the study to provide a detailed description that is designed to allow readers to assess transferability of these findings to other settings” (p. 7). Meyer and Cranmore (2020) stated “to address dependability, the research report contains detailed procedural descriptions, interview protocols, codebook, excerpts, and other resources for replication” (p. 1062).
Training in coding or interviewing
Six percent of the studies (n = 8) reported using training to enhance the rigor of their research process. In Castillo et al. (2016), one of the research members with expertise in qualitative research trained the other team members in interviewing. They met with the research team, and explained common tenets of interviewing, followed by participants conducting mock interviews using the study protocol. Finally, the expert provided feedback to the team members. Once data were collected, the expert also trained the team members in thematic analysis which included “provid[ing] information on identifying thought units, coding, and thematic analysis” (p. 646). This was followed by the team member coding a partial transcript together in a meeting. Following this meeting, the team members coded the rest of that transcript separately and met again to discuss their coding. Similarly, O’Bryon and Rogers (2016) talked about “extensive training- both didactic and applied” (p. 228) for all the coders in the method of content analysis. Cheng et al. (2011) reported all the researchers undergoing a three-hour training on coding, although no details were provided of this training. Moy et al. (2014) discussed how the research team generated a list of domains. Following the creation of this list, their research team members were “trained to identify the domains as consensually defined and to code all transcripts according to these domains” (p. 328).
Negative case analysis
Three percent of the articles (n = 4) featured negative case analysis as a validation strategy. This procedure emerged from grounded theory as a way to prevent overinterpretation of the data. Barrow and Hannah (2012) described it as “scrutiniz[ing] the data for deviant cases” (p. 457). Giraldo-Garcia et al. (2021) described it as “identify[ing] concepts and incidents that did not fit the resulting thematic framework, helping to safeguard against a drift toward a priori assumption…” (p. 57). Balagna et al. (2013) used negative case analysis by scanning the transcripts for information that “contradicted the initial interpretations”. They then refined or broadened those interpretations to accommodate the diversity in the participants’ experiences presented by the negative cases. For Gross and Lo (2018), negative case analysis involved identifying the data that interrupted or went counter to an emerging pattern; they looked deeper into this data to “[analyze] the possible source of the discrepancy” (p. 384).
Threshold
Three percent of the articles (n = 4) featured the threshold validity strategy. This strategy occurs toward the end of the coding stage, where researchers are finalizing the thematic schema. Although some researchers use an a priori approach to coding (wherein the researcher predetermines a set of codes and then reads the data for instances of these codes), many others take an inductive approach, wherein the codes are allowed to emerge organically from the data. This liberal approach to coding can potentially generate many more codes or subthemes than necessary to answer a research question. Threshold provides a way to separate the signal from the noise by setting a criterion to prioritize some codes while eliminating others from the final thematic schema. For example, McCabe and Rubinson (2008) eliminated codes that occurred in less than three focus group interviews. Grapin et al. (2021) treated themes that occurred in at least 25% of the interviews as major themes while themes that occurred lesser than that were designated as minor themes. Vega et al. (2019) only reported on themes that occurred in two or more participant interviews.
Translational validity
In 3% of the studies (n = 4), researchers conducted the interviews in a language other than English, and transcribed them into English for the purpose of coding. For example, van Schalkwyk and Sit (2013) interviewed participants in Chinese and then translated the interviews into English for analysis. In doing so, they risked changing the meaning of what the participants said given that the nuances of one language may not carry into the other language. In such cases, the researchers would typically take one of several routes. Van Schawlwyk and Sit (2013) coded the translated transcripts and then referred to the original transcripts while writing the results (“refer back to the original recordings and transcripts in Chinese to verify credibility of interpretations” [p. 158]).
Other researchers coded the transcripts in the original language and then translated the quotes when writing the manuscripts. Sluiter et al. (2019) state that the focus of their translation was “on intended meaning rather than the proximity of words in literal translation” (p. 1264). Similarly, Fulano et al. (2018) spoke about coding in the original language and then translating the quotes at the manuscript-writing stage: “Following literature recommendations, data were used in the original language…; namely, all data analysis and direct quotations in the manuscript. In the final version of the manuscript, one of the coders, who is Mozambican, translated the direct quotations with the support of an English teacher. Researchers discussed the final version of quotations in English compared to the original and reached total consensus” (p. 202).
Relational ethics
Two percent of the studies (n = 3) emphasized the researchers’ ethical conduct with their participants during data collection as a form of validity. Their actions went beyond the standard ethical mandate about non-coercion, confidentiality, and privacy, and included taking steps to weaken the power imbalance between the researcher and the participant. One of the ways to do this was by disclosing their positionality to participants. This was based on the sensitivity to context, as a validity strategy presented by Yardley (2000). Haegele et al. (2020) explained their positionality to participants to “expose his biases” and included an abundant number of verbatim transcript extracts in the presentation of the results to ensure that the participants’ voices were well-represented. Explaining one's positionality can help weaken the power imbalance by reversing the flow of information in a research study, wherein the participants are typically the ones revealing personal information to the researcher.
Frels et al. (2013) were one of the few articles that discussed ethics as an important part of validity. They used Patton (2002) argument that a research study's credibility was deeply connected to the integrity of the researcher. To this end, they used Patton's Ethical Issues Checklist to inform their behavior during the data collection process. Specifically they spoke about doing the following things to demonstrate their integrity as researchers: “purpose of inquiry and methods, (b) recognizing promises and reciprocity, (c) assessing risks, (d) maintaining confidentiality, (e) obtaining informed consent and assent from all parties, (f) understanding data access and ownership, (g) being cognizant of interviewer mental health, (h) consulting for advice, (i) recognizing data collection boundaries, and (j) maintaining ethical or legal conduct issues” (Frels et al., 2013, pp. 621–622).
Nkoma and Hay (2018) followed Elanglander (2012) advice to have a preliminary meeting with the participants as a way to establish trust, comfort, and safety without collecting any data. To this end, they met with the participants one week before the scheduled interviews and used this to introduce themselves to the participants, talk about ethical considerations and informed consent, and to preview the questions they would be asked in their interviews the following week in order to give them time to think about their answers. The authors argued that this ethical conduct enabled them to “obtain a richer description during the interviews without making the researcher ask too many questions” (p. 855).
Strengthening validity practices in school psychology research: discussion and a few recommendations
Validity is an important consideration for researchers, helping a research community to evaluate research studies and elevate the ones that potentially advance the field or enrich the discourse. Whereas quantitative researchers are subjected to well-established criteria for judging the validity of a study, qualitative researchers are expected to decide the criteria against which they want a specific study to be evaluated. Multiple reporting standards established to guide the writing of qualitative research manuscripts therefore emphasize the importance of talking about what makes the study valid, but largely leave it up to the authors to select the criteria and strategies to argue their study's validity.
Our systematic review uncovered a variety of strategies qualitative researchers in school psychology used to establish and communicate the validity of their studies. The diversity of validation strategies shows the idiosyncratic nature of each qualitative research project. While the variations in validity strategies may reflect author's personal preferences, they may also reflect the editors’ and reviewers’ requests for authors to clarify the process of their inquiries. Regardless of the validity strategy used, it is important to remember that a researcher can work toward validity but never fully attain it (Koro-Ljungberg, 2010). Validity is a lodestar that can guide a research team and help them navigate the research process to produce “good” findings. However, all findings coming out of a study are a product of the researcher's limited sensemaking processes as well as a certain degree of randomness that characterizes the world. Thus, qualitative researchers should not expect their strategies to yield “a dichotomous outcome (i.e., valid vs. invalid)” but understand that validity is a matter of degree (Onwuegbuzie & Leech, 2007, p. 239).
Having discussed the different ways to validate studies, we reflected upon the findings to identify how future qualitative research in school psychology might be strengthened. Below we discuss four recommendations for potentially improving the practice of validity in school psychology research.
Recognizing the distinction between criteria and strategies
We believe it is important to distinguish between the criteria for judging a manuscript's validity and the strategies used to meet the criteria. A criterion specifies a researcher's values about how they evaluate a research study or a research publication and what characteristics they expect to see in a “good” study. A strategy is a procedure or the means that help them live up to these standards. Thus, a quantitative researcher who judges experimental studies based on the criteria of external validity may employ strategies to make the experimental conditions resemble “real-world” conditions. A qualitative researcher who judges research studies based on the criteria of resonance (i.e., the extent to which it “meaningfully reverberate(s) and affect(s)” the reader; Tracy, 2010, p. 844) may use strategies such as evocative writing (i.e., writing in a way that is “vivid, engaging, and structurally complex”; Tracy, 2010, p. 845).
Although we acknowledge the distinction between criteria and strategy may be blurred in some cases, we find it pragmatically useful to distinguish between the two when teaching qualitative research. In the absence of this distinction, qualitative researchers often tend to conflate the two. In the current study, we found many cases where authors named their validation strategy (e.g., interrater agreement) without clarifying the criteria they were trying to meet with the strategy as if the criteria were self-evident. Kahn (1993) argued that this way of approaching validity leads to a “procedural charade” wherein validity procedures come to be used ritualistically simply because they are expected or have been used before. Thus, it is important for researchers to clarify the criteria they use to judge their study's validity before discussing the means or strategies they used to meet those criteria.
In this article, we illustrated what we consider to be good practice in regard to discussing validity in an article or a dissertation. We started by describing the criteria we selected to help orient our study and also explained why we used those two criteria over others. We then discussed the means or strategies by which we believe we were able to honor these criteria.
Aligning validity and the paradigmatic position
Paradigmatic position refers to the kind of knowledge researchers and research communities value. Conducting interpretivist research means valuing knowledge about the meanings people assign to events and practices around them. Inquiring under a critical paradigm means exposing the hidden workings of power in oppressive and unjust systems. Postmodern inquiries value knowledge that affirms the uncertain and fragmented nature of reality against the totalizing power of grand narratives. And postpositivist research seeks knowledge that appears to be objective and unbiased, regardless of what the knowledge is about. As such, what constitutes valid research is contingent on the paradigmatic position adopted for the study.
We recommend researchers carefully consider their reasons for selecting validity strategies and criteria in light of their study's stated or implied paradigmatic position. This careful consideration requires awareness and understanding of the various research paradigms available to qualitative researchers and explicit adoption of a paradigmatic position aligned with a study's aims, purposes, methodology, methods, and validation criteria and strategies. For example, if a researcher seeks to understand how teachers experience and make sense of a new policy implemented in their schools, then that researcher will likely choose the interpretivist paradigm to undergird their research design because they value teachers’ subjective perspectives and meaning-making. Selecting a critical paradigm for this study, for example, would be a mismatch unless the researcher makes clear their interest in how the new policy maintains oppression and/or enacts social justice. Having carefully considered and selected an interpretivist paradigm, the researcher should therefore avoid stating that they used intercoder agreement to obtain a “more accurate” or “objective” understanding of the data. This is because the interpretivist position does not hold that researchers create objective knowledge. Rather, under interpretivism, all research “findings” are interpretations. Therefore, a researcher's decision to use multiple coders in an interpretivist study would be better justified by stating that incorporating the multiple perspectives during coding helped them achieve a well-rounded, nuanced, and multifaceted understanding of the teachers’ experiences and understandings of the new policy.
Going beyond positivist and interpretivist traditions of validity
Reflecting on Figure 1, we notice the pervasive influence of two well-established research traditions—positivism and interpretivism—on school psychology researchers’ approach to validity in qualitative research. For example, the first and third most popular validity strategies in school psychology research are interrater agreement and consensual coding. In our own experience as mentors, grantwork collaborators, and members of dissertation committees we have come across the misconception that every qualitative study must include either interrater agreement or consensual coding, which is untrue. Although both are useful, they are not mandatory for every qualitative study (Morse, 1997, 2015), and believing so can disadvantage researchers who may be unable to find a co-researcher to help them code their transcripts due to the expertise or time commitment required from the co-coder. These strategies may even be detrimental in studies that involve interpretive coding (see Morse, 2015). Qualitative researchers who may be the sole researcher (e.g., a graduate student collecting and analyzing data for their dissertation) can nevertheless produce rigorous and high-quality work by employing other validity strategies discussed above as well as those discussed in Morse (2015). The remaining strategies in Figure 1 (barring relational ethics) are all traditionally associated with the interpretivist paradigm which dominates qualitative research in the social sciences.
The least commonly used validity strategy—relational ethics—is associated with the critical paradigm of research, which seeks to uncover and challenge oppressive social processes. Researchers working from this paradigm are acutely aware of researchers’ historic role in legitimizing oppressive ideologies (e.g., eugenics) and exploiting marginalized communities for data (e.g., the Tuskegee Syphilis Study). Critical researchers working toward equity and social justice thus start by challenging the unequal balance of power in the research project itself, holding that the communities should have some level of say in the inquiry that impacts them (Kemmis & McTaggert, 2000). This may be done by engaging participants as co-researchers, taking active measures to increase their agency and involvement in the project, and ensuring the project benefits the participants or the community where it was conducted. Although we found these practices in some of the studies using participatory forms of research, the authors did not explicitly discuss these practices as sources of validity for their studies.
It is possible that the researchers may not be aware that these strategies do serve as a means of validation within the critical paradigm of research. Summarizing Denzin (2015) account of critical qualitative research, Koro-Ljungberg and Canella (2017) note that critical qualitative inquiry “places the voices of the oppressed at the center of inquiry, reveals sites for change, promotes change in people's lives, and has the potential to affect policy. As such critical qualitative inquiry is always already ethical…” (p. 330). Validating critical qualitative research thus includes discussing the ethical aspects of the inquiry throughout the study and including the author's ethical deliberation in the research write-up.
Appreciating the role of creativity
Finally, we want to highlight the importance of creativity, which often gets overshadowed by discussion of the techniques and strategies of qualitative research. Although the postpositivist influence can be credited for strengthening qualitative research by emphasizing rigor, discipline, and systematicity in knowledge production, Tracy (2010) warned that a rigorously designed study “does not automatically result in high quality work. Qualitative methodology is as much art as it is effort, piles of data, and time in the field. And just like following a recipe does not guarantee perfect presentation… rigor does not guarantee a brilliant final product” (p. 841).
In fact, an excessive emphasis on procedural rigor can come at the expense of creativity that is vital to a strong qualitative study (Whittemore et al., 2001). That is because procedural rigor can engender rigidity and compliance-orientation which conflicts with the open-mindedness and intuitiveness associated with creativity (Sandelowski, 1993). A research team that is purely focused on adhering to a preplanned strategy for data analysis may miss out on an important pattern in the data that would have interested the reader. A good qualitative researcher is thus tasked with knowing when to follow rigorous procedures and when to follow instincts. The former may help in the efficient completion of a study and prevent a researcher from straying off course, whereas the latter can lead to unexpected insights that push the field forward. Qualitative researchers must therefore understand that rigorous procedures alone do not make their studies valid or worthy of reader's engagement. Their manuscripts must also demonstrate a depth of thought, a newness of insights, and richness of writing that stimulates the reader, increases interest in the topic of inquiry, and helps to broaden qualitative research's audience.
Conclusion
As interest in qualitative research grows among school psychology researchers, we will likely continue to see an increase in the publication of qualitative studies in school psychology journals and doctoral dissertations. It is therefore important to attend to issues of validity in order to ensure high quality research that advances the field. One of the ways to improve validity practices in a field is to look back at its use by past researchers in the field, studying how they adapted its strategies for different contexts and reflecting on how these practices could be improved for future research. We therefore conducted this systematic review of qualitative literature in school psychology journals. Our findings highlighted the diversity of strategies used by school psychology researchers to establish their study's validity, as well as the creativity that goes into adapting a validity strategy for the specific context of each study. In addition to attending to these procedural aspects of validity, we also recommend school psychology researchers to attend to the deeper issues underlying validity practices, such as the paradigmatic connections and the role of creativity in crafting a strong qualitative study.
Limitations of the study
As with all studies, ours was shaped as well as constrained by our own beliefs, worldviews, experiences, and social positionalities. Thus, we make no claims about the universality or objectivity of our findings or claims. We believe it is ultimately up to the reader to determine the usefulness of this manuscript based on the arguments, reasoning, and evidence we provide. This is known as transferability which is analogous to what quantitative researchers call generalizability (Levitt et al., 2018). Nevertheless, we would like to identify some limitations of our study. First, our findings focused on the strategies the authors used to validate their studies. Although we were also curious to understand the criteria toward which the strategies were applied, it was difficult to code this information in the data given that researchers often deployed terms like validity, trustworthiness, dependability, credibility, and rigor interchangeably. For example, both Meyer and Cranmore (2020) and Wood et al. (2017) use the validity strategy of providing a detailed account of their research process to the reader. However, Meyer and Cranmore (2020) claims to use this to increase their study's dependability whereas Wood et al. (2017) used it to increase their study's transferability, indicating the two criteria are not adequately differentiated from each other. Other authors simply excluded any mention of validity criteria and only discussed validity strategies. Thus, we made a strategic choice to focus our data analysis on validity strategies only, since they were more concrete, had less interchangeable use of terms, and were therefore more feasible to code. Future research studies might focus on the criteria of validity used by school psychology researchers.
Another limitation of the study is that we only looked at the strategies that authors explicitly identified as increasing their study's validity. Thus, although we found many articles containing thick rich descriptions, we did not classify an article under the thick rich description category unless the authors explicitly stated they used this strategy. We did this because we were interested in understanding what researchers believed made their studies valid. A future study interested in evaluating the quality or validity of qualitative research published in school psychology journals might use a broadened scope of coding that includes both stated and unstated ways in which authors strengthen their study's validity.
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Author Biographies
Appendix A
Coding proforma
For each journal, we extracted the following information in a spreadsheet:
- Year - Volume - Issue - Total Articles (does not include acknowledgement, editorials, test reviews, book reviews, corrections, and announcement) - Number of qualitative articles - Additional comments (optional)
A different team of researchers downloaded the qualitative articles identified in the previous stage, and extracted the following information in a spreadsheet:
- Article citation - Journal - Study design - Data collection strategy (interviews, focus groups, etc.) - Data analysis strategy - Discussion of reflexivity - Discussion of validity* - Overall strengths and limitations of the article's methodology
