Abstract
Objectives:
This study examined how evaluation and intervention research (IR) studies assessed statistical change to ascertain effectiveness.
Methods:
Studies from six core social work journals (2009–2013) were reviewed (N = 1,380). Fifty-two evaluation (n = 27) and intervention (n = 25) studies met the inclusion criteria. These studies were categorized by the statistical change indices and effect sizes reported. Only intervention studies were included in the final analysis.
Results:
Multivariate analysis of variance (28%) was the most frequently used primary change statistic and χ2 (43%) was the most frequent secondary indicator. About 68% of studies reported effect sizes with Cohen’s d (28%) being the most frequent.
Conclusion:
In addition to assessing statistical significance, the extent of rigor lies in determining the size of any change or effect in evaluation and IR. Implications for social work practice and research are discussed.
In the past few decades, political influence and financial downsizing of health and human service agencies including welfare expenditures have resulted in increased calls for legitimizing programs and services intended to address our nation’s social maladies. Consequently, the era of developing intervening social programs and services largely based on needs, practice experiences, and case wisdom has steadily dissipated. Currently, those playing critical roles in understanding human problems, protecting the social safety net, and advancing our social environment such as policy makers, funders, researchers, advocates, practitioners, and even the public want to know (1) which social programs work or not? (2) in what ways do these programs work or not? and (3) how do we know if they work or not? To answer the call for more accountability, social scientists are engaging in a variety of activities ranging from intervention research (IR), studying the design and development of systematic change strategies, to evaluation research (ER), by scientifically examining processes, effectiveness, and outcomes of social programs (Fraser, 2004; Fraser & Galinsky, 2010). In addition to advancing the social work knowledge base about vulnerable populations, interventions, programs, and services, these various research and evaluation (R&E) activities provide empirical evidence and justification for important policy-making decisions and funding (Holosko, 2010). In addition, Muhlhausen (2012) noted of these activities, “perhaps the greatest contribution is the evidenced-based policy movement that seeks to inform and influence policy-makers through scientifically rigorous evaluations of the effectiveness of government program” (p. 100).
Concepts such as evidence-based or evidence-informed practice have found traction within the realms of medical, social work, nursing, psychology, political science, and so on, and their use and prominence have been linked to justifying interventions that are both more efficient and, in turn, ideally produce cost savings. As such, evaluation activities have become essential for satisfying funding requirements within the applied social sciences (Holosko, 2010). However, for social work, a discipline that has historically held practice experience and case wisdom in high regard over empirical research, the pathway to becoming evidence based has emerged reluctantly due to the slow embrace of an R&E culture within the profession overall (Fong & Pomeroy, 2011; Howard, McMillen, & Pollio, 2003; Okpych & Yu, 2014). Consequently, research that has been conducted too often has fallen into the perennial abyss between the social work academic and practice communities (Holosko & Leslie, 1998).
Literature Review
Our Shifting Paradigms
Believed to have been derived from evidence-based medicine, which “… advocated for the use of experimental outcome research to determine if treatments were effective, benign, or harmful” (Drisko, 2014, p. 124), the profession of social work endured a few iterative paradigm shifts before a more heartfelt embrace of an evidence-based culture emerged. According to Okpych and Yu (2014), social work is marked historically by three paradigms: (1) the moral-based paradigm in the late 1800s and early 1900s, which framed issues as matters of moral depravity, (2) the authority-based paradigm beginning in the 1910s through 1930s that legitimized practice based on the methods and experience of social work veterans or experts, and (3) the empirically based paradigm emerging in the 1960s, which entailed practice legitimized by scientific evidence and routine evaluations of interventions.
However, as compared to other social science disciplines, the social work profession is not totally devoid of a foundation in research principles, as the initial formation of the profession touted using scientific methods or “scientific charity.” Dating back to the 1930s, attention on the need for research on social work outcomes has increased, and by the 1950s, outcomes were being documented in published reports of individual field experiments regarding micro-level practices, then called social casework (Mullen & Shuluk, 2011). Compared to psychology, which began publishing reviews of intervention outcomes in the 1950s, reviews of social work intervention outcomes did not emerge until the 1970s.
As evidenced-based practice (EBP) began to gain momentum in the 1960s in other social science cognates, it was not fully embraced by social work until the 1990s. Basically, as a response to the demand for greater accountability, the profession acquiesced to a paradigmatic shift from authority based to empirically based, creating a “heartburn effect” within the profession, as the researcher–practitioner relationship also had to be reconfigured (Davis et al., 2013; Herie & Martin, 2002). No longer viable, as it was becoming clear, was a culture characterized by social work academics publishing research in journals only perused by other academics, and practitioners bogged down in demanding real-world service delivery relying solely on experience and practice wisdom, with no time for perusing academic journals to seek out EBPs (Bledsoe-Mansori, 2013; Davis et al., 2013; Fong & Pomeroy, 2011; Herie & Martin, 2002; Patton, 2010; Secret, Abell, & Berlin, 2011). Therefore, this relatively new social work evidence-based culture slowly extended to the practice, pedagogical, and professionalization of the profession, while also bridging the gap between the evidence producing arm of the profession (academia) and evidence application arm (practice), fostering a more partnering or collaborative researcher–practitioner subculture.
Research and Evaluation
Undeniably, R&E lay at the core of the latter evidence-based paradigm. Whether they were evidence-based policies, programs, or practices, R&E promoted the central notion that all interventions must be first grounded in empirical research and then proven to be effective, all the while emphasizing accountability to clients, workers, agencies, and funding bodies (Davis et al., 2013; Okpych & Yu, 2014). More succinctly, Davis et al. (2013) stated “… practicing EBP appropriately involves identifying research-based evidence about an intervention, appraising the validity and utility of the evidence, and applying the results of such an appraisal in an ethical fashion” (p. 19). Also, EBP is required when social work practitioners seek out research findings related to the social problems of their clients, critically analyze those findings, and share conclusions and develop next courses of action with clients based on this systematic review of the evidence (Cohen, 2011). It is through the rigor of scientific R&E methods that policies, programs, and practices are deemed effective or “evidenced based” with replicable qualities and an ability to withstand ongoing scientific scrutiny by more rigorous methods and designs (Howard et al., 2003; Muhlhausen, 2012). Therefore, it would be of added value for social work to expound on the nuanced distinctions between intervention and ER and program and practice evaluations.
According to Fraser and Galinsky (2010), IR is distinguished from ER by its emphasis on the design and development of change strategies that match delineated social problems. Although IR clearly involves some aspects of evaluation, it is defined as: ….the systematic study of purposive change strategies. It is characterized by both the design and development of interventions. Design involves the specification of an intervention. This includes determining the extent to which an intervention is defined by explicit practice principles, goals, and activities. (p. 459)
On the other hand, ER is characterized as the process of systematically applying social research methods to assess processes and outcomes of social intervention programs, including the testing of hypotheses (Holosko & Thyer, 2011). Furthermore, program and practice evaluations involve the systematic collection of data and statistics about the activities, characteristics, and outcomes of clients or programs, in a manner not necessarily research oriented, to assess effectiveness, and/or inform decisions about future programming (Fraser & Galinsky, 2010; Soydan, 2002). These evaluations only seek to measure the value of that particular program for its stakeholders and typically offer no intent to make generalizations (Fraser & Galinsky, 2010). However, according to Holosko et al. (2009), practice evaluations are somewhat different in that they focus more on defining the problem and appropriate intervention to use with their clients, collecting and analyzing data on the impact of the intervention, and assessing the benefit of the intervention to the client system.
Regardless of the type of R&E one employs, for social work, it appears that empirically assessing change in intervention programs is the “bottom line” for ER, IR, or program/practice evaluations. This reality is reiterated by the National Association of Social Workers (NASW) Code of Ethics (COE) and the Council on Social Work Education Policies (CSWE). While the NASW COE subsection 5.02 promotes a position that social workers engage in evaluation or research activities by providing 16 ethical guidelines for social workers to adhere to (Holosko et al., 2009; NASW, 2013), the CSWE Educational Policy and Accreditation Standard 2.1.6 stresses the importance of engaging in research-informed practice and practice-informed research (Council of Social Work Education, 2008; Davis et al., 2013).
The aim of this study was to add to the paucity of extant literature in this area by examining how statistical change (from Time 1 to Times 2, 3, 4, etc.) was assessed in both evaluation and intervention studies. This investigation sought answers to two research questions: (1) What statistical tests or change indices are being used to measure change from preintervention to postintervention in these studies? and (2) What supplemental statistics are being used to measure the effect of change from preintervention to postintervention in these studies? The implications of this study are directed toward social work researchers, practitioners, and students who wish to effectively design and develop interventions and/or evaluate programs.
Method
Sample
A quick scope secondary analysis (see preface to this journal) was conducted using six of the top ranked impact social work research journals from January 2009 through December 2013. These journals included the British Journal of Social Work (BJSW), Research on Social Work Practice (RSWP), Journal of Social Service Research (JSSR), Social Work Research (SWR), Social Service Review (SSR), and Social Work (SW). According to previous empirical studies, these journals have been identified as the so-called Tier 1 empirical journals that have been used before in similar studies (Holosko, 2010; Ligon, Cobb, & Thyer, 2012; Ligon, Jackson, & Thyer, 2007; Rubin & Parrish, 2007; Shardlow & Harlow, 2010; Thyer & Bentley, 1986). For example, Holosko (2010) reviewed articles in SSR, RSWP, and SWR over two time periods to locate and describe research designs used in published SWR and evaluation studies, while Rubin and Parrish (2007) examined RSWP and SWR over two time periods in their study on problematic phrases in the conclusion of published outcome studies. Furthermore, Ligon, Cobb, and Thyer (2012) and Shardlow and Harlow (2010) in the United States and United Kingdom, respectively, replicated previous studies, appraising articles dating back to 1979 from all six journals to analyze the academic affiliations of social work journal article authors. According to the 2014 Journal Citation Reports® (Thomson Reuters, 2014), data for RSWP, the overall impact scores of these six journals ranked: BJSW = 1.162 (ranked 7th), RSWP = 0.905 (ranked 13th), SW = 0.877 (ranked 15th), SSR = 0.535 (ranked 27th), and JSSR = 0.483 (ranked 36th).
Data Collection
The authors initially conducted an exhaustive search for intervention and ER studies by reviewing the abstract and methods section for all articles (N = 1,380) published in the six identified social work journals during the years 2009–2013. A set of criteria was then applied to each article as a method of exclusion for this study. Included ER studies met the following criteria: (1) the processes and/or outcomes of the programs were specified, (2) the study involved administering an empirically based intervention, and (3) change between at least two data time points was measured. The criteria for inclusion of IR studies were as follows: (1) it examined an intervention in the process of being designed and developed, (2) it randomly assigned study groups (treatment and control or comparison groups), and (3) it measured causal change from one time to another (pre to postintervention). Data were collected and appraised independently by three doctoral students over a 9-month period in 2013, and the interrater reliability between the data collectors was established at r = .89. The resultant sample consisted of a total of n = 52 (3.7%) published articles, which included evaluation (n1 = 27) and intervention (n2 = 25) research studies. These research study groups were mutually exclusive, meaning there were no articles included in both groups.
All ER and IR studies meeting inclusion criteria were then assessed by variables falling into four broad categories: (1) sample characteristics, (2) intervention characteristics, (3) research design characteristics, and (4) statistical characteristics. Sample characteristics were operationalized by the categorical variables “age range,” “at-risk youth,” “gender,” and “diversity.” Intervention characteristics included the type (“existing/off the shelf” or “new/customized”), duration (in weeks), and attrition rate. Research design characteristics included design type (pre–post only, pre–post with groups, pre–post with groups and 3+ time series, and pre–post with 3+ time series) and measurement/data collection methods. The measurement/data collection methods were operationalized by the following six categories of social science instruments used to collect data for measuring change: (1) behavioral rating scales, (2) checklists and inventories, (3) psychometric instruments, (4) other rating scales, (5) survey questionnaires, and (6) test (teacher made and standardized; see Figure 1). We also thought it was important to assess the frequency with which these published studies reported reliability and validity statistics and/or methods and whether these statistics were from a previous administration of the measurement/data collection instrument or the current study sample.

Categories of social science instruments.
Two dichotomous (yes/no) questions were asked: (1) did the study report reliability and validity statistics and/or methods for a previous administration of the measurement/data collection instruments? and (2) did the study test the reliability and validity of the measurement/data collection instruments with the study sample? Finally, statistical characteristics were operationalized by “both primary and secondary change statistics reported,” “effect size reported for primary and secondary statistics” (yes, no, and not applicable), “supplemental statistics,” “number of studies reporting of significant outcomes,” and “size of effect” (larger than typical, large, medium, and small).
Results
The final analyses of the IR studies (n = 25) and their distinct content was more germane to the special edition journal. As such, the frequency analysis revealed the majority of studies were published in RSWP (80%) followed by SWR (8%) and JSSR, SSR, and BJSW equally (4%). There were no IR articles included in this study from Social Work. Thirty-two percent were published in 2009 followed by 2012 (24%), 2013 and 2010 (16%) each, and 2011 (12%), and sample sizes ranged from 20 to 503 (M = 156.9, Median = 100, Mode = 67). Table 1 shows additional categorical variables examined and their frequencies. Most of the study samples were a combination of youth and adults (40%), male and female (84%), and had diverse or heterogeneous (68%) samples. The interventions were primarily new or customized (80%) with a duration of 17 weeks or more (44%) and attrition rates of 20% or less (68%). Fifty-six percent of the studies had a pre–post within-group research design (56%), including one whose time point of testing was varied by participant. Of the 36% of studies with pre–post within group and multiple time-series research design, 8% had more than two follow-up testing time points. Also, more studies employed varying combinations of measurements or data collection methods (40%).
Frequencies for Categorical Variables.
Note. N = 25.
aNine (36%) of these studies targeted at-risk youth. bTwenty-three or twenty-five (92%) used a control/comparison group.
Table 2 presents these studies that reported both reliability and validity statistics and/or methods for a previous administration of the measurement/data collection instrument and also ones that reported testing the psychometrics in the actual study. Most studies did not report reliability and validity statistics and/or methods for previous administrations of the instruments (52%), and of these, only five reported testing psychometrics and 8 did not. Regarding the data analysis characteristics of these selected studies, Table 3 shows the frequencies of statistical indices used to measure change from pre- to postintervention. The most prominent statistical change indicator used was multivariate analysis of variance (MANOVA; 28%) followed by t-test (20%), analysis of variance (ANOVA; 16%), and analysis of covariance (ANCOVA; 16%). Over half of the studies measured change for multiple outcomes and reported secondary change statistics as well. Of the secondary change statistics used, χ2 (28%) was the most prominent followed by t-test (12%) and ANOVA (12%).
Cross Tabulation for Reliability and Validity Reported and Tested.
Note. N = 25.
Frequencies of Change and Supplemental Statistics.
Note. N = 25. MANOVA = multivariate analysis of variance; ANOVA = analysis of variance; ANCOVA = analysis of covariance.
aSeven of the eight studies that did not report effect size did report significant outcomes. bEffect size statistics were reported for 28% of secondary change statistics reported. cStudies with multiple outcomes reported effect sizes for each outcome.
Table 3 also illustrates the frequencies of reported supplemental statistics used to measure the effect sizes of significant outcomes. Twenty-two (88%) of these studies reported statistical significance for one or more of the outcomes measured. Seventeen (68%) of these studies also reported the effect sizes of significant outcomes. Of these reported effect sizes, Cohen’s d (23%) was the most prominent followed by 16% of the studies using a combination of supplemental statistics measuring effect (i.e., Cohen’s d and partial η2, Cohen’s d and φ/Cramer’s V, η2, and R2). Just over half (52%) of the studies that reported effect sizes indicated significant change from pre- to postintervention of medium or moderate effect and 24% reported large effect. Also, of the eight studies that did not report effect sizes, seven still reported significant outcomes.
Discussion
We were initially motivated to conduct this research because social work by its very nature deals with change. Given the frequency of evaluation and intervention studies that use pre–post and time-series designs to effect change, we were concerned about tabling how they actually empirically show change through the statistics used in their studies.
One of the most important strengths of EBP in the field of social work is that it bridges the gap between the service and science arms of the profession. In order to achieve this goal, it is important for both social work practitioners and researchers to have both a theoretical and practical knowledge base in research methods and statistical analysis that inform whether or not interventions actually facilitate change and to what extent any change is effective. The purpose of this study was to provide a useful snapshot/tool of statistical indices used to measure change from pre- to postintervention and the effect of such change for researchers, practitioners, and students. While it is important to show change, it is even more important to show the effect of such change. In addition to assessing primary and secondary statistical analysis used in IR studies, supplemental statistics (i.e., effect size indices) were analyzed.
Although the statistical change indices used depended on the parameters of the research questions, there was little standardization in the supplemental reporting of effect size statistics for significant findings. This may be due to the majority of the journals having no specific guidelines or requirements for reporting statistical indices. The two exceptions were BJSW, which directs its authors to read and follow recommendations from Conducting and Presenting SWR: Some Basic Statistical Considerations (Smeeton & Goda, 2003) and RSWP, which asks authors to familiarize themselves with reporting standards found in Reporting Standards for Research in Psychology: Why do we need them? What might they be? (APA Publications and Communications Board Working Group on Journal Article Reporting Standards, 2008). In addition to reporting statistical significance, these journals, in accordance with the American Psychological Association Journal Article Reporting Standards, emphasize the importance of reporting indicators of effect sizes of statistical results. These indicators include Cohen’s d, r (correlation coefficient), R2 (regression), eta squared (η2), partial eta squared (partial η2), Jacobson reliable change index, and percentages. Definitions of these effect size indicators along with web links for additional information are provided in Figure 2.

Definitions of Effect Size Indicators.
Turning to the main findings in this review, when interventions were assessed in Table 1, it was noted that the most frequently used interventions were new or customized programs, not necessarily existing or “off-the-shelf” programs. We were surprised by this finding because traditionally social workers are taught to use existing scales, instruments, and interventions that have proven efficacious in their work with clients. In this regard, this could show a trend toward using more targeted interventions created specifically for individuals with different forms of problems or vulnerabilities. This notion was echoed by Wodarski (2011) who asserted that one of the social work conundrums that we have is the use of homogeneous interventions for heterogeneous clients. These data were encouraging in regard to intervention types as the preference for taking an available off-the-shelf intervention was not as important as developing a new one for the targeted population.
It was no surprise to us that youth and adults were the targeted sample participants for interventions. What was a surprise, however, is the breakdown of the different age ranges of the participants in Table 1, from infants all the way on up to adults—again showing sensitivity to cohorts of people who were similarly aged receiving targeted interventions for them—not necessarily a “one-size-fits-all” interventions, of which we have been criticized for in the past. This was further reiterated by the at-risk youth population, representing 36% of the study sample indicating a special category of targeted interventions constituted unique consideration in this regard. As for genders, both were used in these studies and only 16% of them included targeted interventions for males and females, individually; also here, the relative heterogeneity of the samples reflected diversity. We were also surprised by the intervention duration times noted in Table 1. Given today’s climate of compressed treatments and limited money for services, 44% of the interventions studied had durations of 17 or more weeks. We thought that time was quite long, given the way we currently provide interventions today in our health and human social services. Ostensibly, 17 plus weeks or at least 4 months of an intervention would also have implications for some of the traditional threats to validity such as maturation, mortality, attrition, testing effects, or history that were not noted effectively in many of these studies.
The traditional design used was pre and posttest (52%), followed by pre and posttest with follow-up (36% or 42%). This finding was consistent with Holosko’s (2010) study of the types of research designs being used in SWR and evaluation. The range of data collection tools used was fairly balanced and similar going from behavior rating scales to nonbehavioral rating scales in a frequency of 20–8%. In regard to attrition, the “magic number” we found seemed to be 20%—in that 68% of the study lost 20% of their targeted sample over the time of the study, which may in part be related to the 17 plus week intervention duration noted above. Although a significant number of these studies met the threshold of acceptable attrition (less than 30% or 40%) for interventions, losing 20% of any statistical sample, particularly with vulnerable populations who have very unique problems, can still significantly affect respondent bias and therefore outcomes (Amico, 2009; Valentine & McHugh, 2007).
Table 3 summarizes the frequencies of the main change statistics used accordingly in these various studies. In regard to the primary change statistic, the parametric tests MANOVA, t-test, ANOVA, and ANCOVA were used in the vast majority of these studies. Surprisingly, very little air play was given to nonparametric statistical tests. We thought there would be more, given the smaller sample sizes and the inability to generalize to the normal curve in terms of probability-based sampling of populations highly unique in their characteristics. Additionally, we were encouraged by the supplemental statistic related to the effect sizes reported herein. Indeed, 68% of the studies had effect sizes as companion statistics with the primary and secondary change statistics reported. We found that 28% of these studies also reported effect sizes for the particular secondary statistic in addition to the primary ones noted above. We then broke down the secondary change statistics reported in 56% of these studies. The most frequently used one was χ2 (43%), followed by t-test (21%) and ANOVA (14%). Once more, our assumption was that given the uniqueness of these samples and the inability to generalize the populations accordingly in terms of their own attributes, we thought there would be more nonparametric tests in this list.
When we detached the supplemental effect sizes from these data in Table 3, we found that 32% of these studies reported no such supplemental statistics. When noted, the most reported measure of effect sizes were Cohen’s d (23%), followed by partial η2 (15%) and η2 (8%). A few studies (16%) reported a combination of effect size statistics, namely, Cohen’s d, partial η2, and η2. Finally, in regard to effect size outcomes, the vast majority of studies—statistically more than half had medium effect sizes coupled with large effect sizes and larger effect sizes—84% of the studies had medium to larger than typical effect sizes totally. The importance of including effect sizes to supplement change statistics becomes apparent because these reveal that interventions are working, in terms of the actual effect of change. Everything from medium sized upward shows a significant change—and the other 24% shows a relatively smaller effect size. As Rubin and Parrish (2007) found in their review of published evaluation studies of interventions, this study further supports the increase in rigorous research of effectiveness and outcomes in discipline of social work.
An additional observation of this study related to the inconsistencies in reporting of procedures or protocols used to ensure that data collection/measurement instruments administered, either previously or for the current study, were reliable and valid. Although ensuring that these instruments measure what they are expected to measure consistently (reliability) and collect the most accurate and trustworthy data (validity) seems commonsensical in social science research, Brink and Louw (2012) noticed, in their study of the psychometric properties of clinical tools used by health care practitioners, a trend in research literature omitting quality reporting of the validity and reliability of clinical instruments, and they advocated for researchers and practitioners to reverse this trend by becoming more educated on the subject matter. Integrity of the measurement/data collection tools are critical to deriving sound empirical conclusions (Barry, Chaney, Piazza-Gardner, & Chavarria, 2014). Given the findings of this study, the social work profession could stand to improve more testing and reporting of reliability and validity of instrumentation as well.
In sum, the implications of these findings for future social work practice and research are an expanded knowledge base of researchers, practitioners, and students in the rigorous reporting of statistical change indices in intervention and ER, and awareness of the confirmatory value of having such empirical evidence of the good work that is being done to maximize the human condition in the social environment. Ultimately, this could lead to improved interventions, improved policies, increased funding, and better outcomes for populations of people in our society. Although the small sample size of this study limits the overt generalizability of the findings, it hopefully eases the angst of the skeptic and reluctant professional toward conducting more empirical quantitative intervention and ER and promoting evidence-based or evidence-informed practice. This study on change statistics used in IR adds to knowledge in our profession that hopes to inform both practitioners and students alike about this important topic.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
