Why organizations (do not) evaluate? Explaining evaluation activity through the lens of configurational comparative methods

Abstract

This article aims at explaining why some Flemish (Belgian) organizations evaluate policy, while others do not. The study relies on a unique combination of two configurational comparative methods: the Most Similar Different Outcome/Most Different Similar Outcome method and crisp set Qualitative Comparative Analysis. This combination of methods helps us unravel the combinations of conditions that promote or impede policy evaluation activity in a public administration that recently underwent major changes in line with New Public Management (NPM). The analysis reveals that the impact of NPM reforms on evaluation activity should not be overestimated. The results suggest that the important conditions that explain policy evaluation activity or inactivity are the anchorage of the evaluation function, the availability of skilled personnel to undertake evaluations, evaluation demand from organizational management, and the measurability of the outputs and outcomes of the organization’s activities.

Keywords

configurational comparative methods evaluation activity evaluation capacity building Flanders

Introduction

Evidence-based policy is dominant in current policy discourse to such an extent that it has led some scholars to ironically ask what other types of policy could potentially exist (Gray and Jenkins, 2011). Policy evaluation, as one particular form of evidence, follows the same trend. According to Eliadis et al. (2011) it is difficult to imagine society without some sort of general use of evaluation. The wide acceptance of evaluation, though, might hide significant differences in evaluation activity between public sector organizations. Not all organizations conduct evaluations in this evidence-based era. Trying to explain variance in public organizations’ evaluation activity, it soon becomes apparent that evidence is fragmented and inconclusive. Indeed, multiple explanatory factors have been identified by the evaluation community, mainly in the evaluation capacity building (ECB) literature. Yet, common to the practical nature of the field, insights are predominantly of anecdotal nature and have seldom been systematically tested.

The present contribution aims at revealing the combinations of conditions that may explain why some Flemish organizations practice evaluations, while others don’t. To address this research challenge, we rely on two Boolean methods. First, using the Most Similar Different Outcome/Most Different Similar Outcome (MSDO/MDSO) technique, we identify the conditions that have most potential to explain evaluation activity or evaluation inactivity. Second, we use Qualitative Comparative Analysis (QCA) to detect the different combinations of conditions that promote or impede evaluation activity or evaluation inactivity. The choice here of the Flemish administration is of particular interest as it recently underwent major reform along the lines of New Public Management (NPM). NPM is widely considered to have played a major role in the diffusion of evaluation practice (Furubo and Sandahl, 2002; Stame, 2003). The findings show to what extent NPM has influenced evaluation activity in Flanders, and reveal which other (combinations of) conditions matter.

The next section provides some background information on evaluation activity within the Flemish public sector. In the following sections, we explain the methodological choices made, and present the findings.

Policy evaluation activity in Flanders

Evaluation practice and culture have spread rather slowly and unevenly across European countries. Furubo and Sandahl (2002) identified two different waves of diffusion of evaluation in Europe, respectively taking place in the 1960s and 1970s, and towards the end of the 1990s. Whereas the first group of countries (e.g. Sweden, Germany, UK) adopted evaluation mainly following internal and domestic pressures, in the larger second group of countries, external factors are considered as the main driver of diffusion. Among the most cited reasons is NPM, which is often associated with the promotion of policy evaluation as an accountability tool (Stame, 2003). In Belgium, increased attention paid to evaluation only gained momentum when the second wave was already slowing down. The central role of the political parties, the federal state structure and the dominance of the executive power are often presented as the main factors complicating and delaying the establishment of a genuine evaluation culture in Belgium (Varone et al., 2005). Our article focuses on Flanders, to date a largely undiscovered administrative area as far as policy evaluation practice is concerned.

In 2006, the Flemish administration implemented a government-wide reform package, titled ‘Beter Bestuurlijk Beleid’ [Better Administrative Policy]. Comparing these reforms with an ideal-typical blueprint of NPM, it becomes clear that the reforms incorporate substantial NPM-inspired elements. Although policy evaluation was not a core goal of the reforms, it has nevertheless been given significant attention. Departments have been explicitly assigned the policy evaluation function, and the autonomized agencies are supposed to generate input in terms of relevant policy and managerial information for policy evaluation (Vlaams Parlement, 2003). Since NPM is only recently implemented, the Flemish public sector forms an interesting case to investigate which organizations are now active in evaluations, and which conditions foster or impede policy evaluation activity.

For the purposes of the current research, we rely on Scriven’s (1991) conceptualization of policy evaluation. We define policy evaluation as a scientific analysis of a certain policy (or part of a policy) aimed at determining the merit or worth of the evaluanda on the basis of certain criteria (effectiveness, efficiency, sustainability, etc.). The definition stresses the research-based link of evaluation. It highlights the fact that evaluation provides a systematic and transparent assessment of an object. The evaluations of our interest pertain to a large array of possible evaluanda, including evaluations of the content of the policy, the process, or the effect and impact of the policy.

Case selection and data collection

The present research focuses on the Flemish public sector. For our purposes, we draw attention exclusively to policy fields with an external societal focus: education; work and social economy; mobility and public works; well-being and public health; housing and spatial planning; economy and innovation; agriculture; environment, nature and energy. The data collection for the research occurred in several rounds. We sequentially conducted the following:

An explorative quick scan of all departments. We interviewed privileged informants who could give us in-depth insight on the state of evaluation activity within their department and within the broader policy field, and who could also provide information about possible triggers or impediments for evaluation activity.

A series of semi-structured interviews with 21 agencies active in one of the policy fields mentioned. We interviewed representatives of organizations with and without evaluation practice. We contacted the heads of the organization, asking which person(s) from their agency was (were) best placed to answer evaluation-related questions.

A survey. We deliberately sent out the questionnaire to those we had previously interviewed. The main purpose of this survey was to corroborate the research findings collected during the interviews, as well as to collect evidence on a more systematic basis.

A series of interviews with personal advisors of each of the sector ministers. They could give us a supplementary validity check on the information collected via the organizations themselves.

The case selection follows from the data collection. Only those organizations that participated both in the interview and in the survey were selected as final cases. This stringent selection strategy resulted in a total of 27 organizations (29 analytical cases).¹ The cases can be classified as follows: 18 cases as active in evaluations; 9 cases as neither active in any evaluation nor having any intentions of being active in the future; and two cases have a hybrid nature, as they are not conducting any evaluations yet, but have concrete plans to do so in the near future. We verified the evaluation activity or inactivity of our cases by looking at the actual evaluation reports.

An exploratory approach

Current explanations of organizational differences in evaluation activity are mainly restricted to normative insights, collected in a variety of empirical settings. As we lack sound evidence to make a prior selection of relevant conditions, we consider that the best way to contribute to the field is to start from an open approach, in which we scrutinize a large variety of possibly interesting factors. To come up with a list of factors of potential relevance, we screened the evaluation literature. We combined sources that are grounded in experiences from evaluation capacity builders and practitioners with more theoretical and academic sources that contain relevant indications for explaining organizational evaluation activity. Additional sources were screened, as long as new factors were found. Complementary to literature screening, other relevant conditions emerged inductively through the explorative interviews in the departments.

The list of conditions was subsequently ordered with the help of a conceptual framework: actor centred institutionalism (ACI), developed by Scharpf (1997). Several reasons support the choice of this framework. First, ACI was specifically developed for policy analysis. Second, ACI employs definitions of actors, institutions and the relationships between both that are workable and relatively easy to operationalize (Witte, 2006). Additionally, it offers a legitimation of the treatment of organizations as (composite) actors (Scharpf, 1997: 12). Third, Scharpf shares a belief in configurational comparative methods (Scharpf, 1997: 26) as best fitting the complex social reality. He also promotes the separate analysis of structural (remote) and actor (proximate) types of conditions (Scharpf, 1997: 764), which provides the basis for the two-steps approach that we use in the QCA below.

As its name suggests, ACI proceeds from the conviction that social phenomena are to be explained as the outcome of interactions between intentional actors, but that these interactions are structured, and the outcomes shaped, by the characteristics of the institutional settings in which they occur (Scharpf, 1997: 1). The underlying assumption is that people do not act on the basis of objective reality and objective needs, but on the basis of their subjectively defined interests, preferences and capabilities which are, but not entirely, shaped by the institutional environment. Broadly speaking, in ACI, we can distinguish between two types of factors, which comply with the broad traditions in social science research: those emphasizing agency and those that stress the role of structure (Witte, 2006). Each of these types of factors can be further distinguished into more specific analytic categories, five in total (A–E).

Actors are characterized by specific capabilities (Category A) and specific cognitive and normative orientations (Category B).

Category A: Actor capabilities are understood as all action resources allowing an actor to influence a certain outcome (Scharpf, 1997: 43). The importance of capabilities is evident. Without this, actors will not be able to realize their orientations. To realize evaluation activities, the following capabilities are highlighted as relevant in the evaluation literature: (i) skills to conduct in-house evaluations, (ii) skills to outsource evaluations, (iii) budgetary resources to evaluate, (iv) availability of experienced staff to evaluate, (v) availability of an external evaluation community with expertise in the themes of the organization, (vi) availability of monitoring information.

Category B: Besides capabilities, actors are characterized by having certain orientations (Scharpf, 1997: 43–4). Without demand or interest for evaluation, the supply of it remains a loose end. The majority of ECB sources we screened referred to demand in their testimonials, be it in general terms or by referring to the demand coming from a particular actor. In the present research, we focus on the influence of requests for evaluation from (vii) organizational management, (viii) the sector minister and his/her advisors, (ix) parliament, (x) civil society organizations, (xi) other organizations active within the same policy field, and (xii) organization-wide support for evaluations.

As a proponent of neo-institutionalism, ACI requires a proper investigation of the structural setting in which actors operate. We clustered the conditions in the following three different categories.

Category C: we will scrutinize the relevance of six different institutional attributes, which frame the interactions between intentional actors (Scharpf, 1997: 12): (xiii) the size of the organization, (xiv) the extent of formal autonomy that an organization enjoys, (xv) the status of the organization (department or agency), (xvi) the existence of an evaluation unit, (xvii) the engagement of staff in evaluation training and networking, and the presence of requirements for evaluation in regulations to which the organization is subjected, at Flemish level (xviii), at EU level (xix), and in the management agreement negotiated with the sector minister (xx).

Category D: The particularities of the policy an organization is dealing with will determine how it perceives the general tendency for evidence-based policy, and whether it considers a response in terms of the development of evaluation activity necessary and feasible (Scharpf, 1997: 11; Schmidt, 2003). The scrutiny of the literature combined with our interviews alerted us to four different policy issue related conditions that can exert important influence: (xxi) media and parliamentary attention paid to the organization, (xxii) competition with other organizations; (xxiii) perceived measurability of the organizational outputs, and (xxiv) outcomes.

Category E: A final category of conditions that can be assumed to determine policy evaluation activity deals with the path of the organization. As Scharpf (2000: 768) states: ‘Not everything can be changed at the same time. In any one policy area, the body of existing policy must mostly be considered an invariant environment of present policy choices’. Whether organizations currently conduct policy evaluations will likely be influenced by their experiences with this policy supporting instrument in the past. In the present article, we examine the influence of (xxv) organizational stability, (xxvi) ministerial stability, and (xxvii) the extent of policy evaluation practice conducted by the organization’s predecessor(s). This last condition refers to the situation prior to the NPM-oriented reforms which the Flemish administration recently underwent.

A binary translation of the cases

The present research relies on two methods that make use of Boolean data. They require all cases to be translated into binary coded conditions and outcomes (values: 0 and 1). The choice to dichotomize the data is a deliberate choice for simplification, to handle the complexity of the evaluation reality. We do not see this as a deficiency. On the contrary, we discern certain advantages.

First, simplification allows us to make progress in our understanding of complexity. All social science research, qualitative and quantitative, does imply some kind of simplification to enable the identification of general trends. Simplification is at least made explicit here, and can be subject to critical examination (De Meur and Berg-Schlosser, 1996; De Meur et al., 2009).

Second, we prefer to focus on broad trends that we can measure in a reliable and valid way, rather than focus on nuances that cannot be robustly measured. Most of our data is perceptual in nature. We observed that respondents were not always consistent in their perception about the extent of presence or absence of the same condition, in the interviews and in the survey. Moreover, respondents of a single organization did not always share the same nuances in perception, but agreed on broad lines. Focusing on dichotomous trends was therefore helpful to compensate for these differences in nuances. For instance, no matter whether an organization considered its evaluation skills as ‘rather sufficient’ or ‘sufficient’, a score of ‘1’ was given.

Third, the use of dichotomous data is a useful tool to engage practitioners. Civil servants or evaluation capacity builders are often faced with binary decisions, whether to do (a) or (b) to achieve a particular outcome. The Boolean translation of cases can help handle the complexity that practitioners face. A more fine-grained operationalization of the data, which fuzzy set QCA would require, does not deliver the decisiveness of crisp set QCA with binary conditions (Blackman, 2012).

A final argument is that the MSDO/MDSO technique is at present only applicable for binary coded settings.

In Table 1 we provide details as to how we calibrated the conditions. Conditions or outcomes assigned a score of ‘1’ or ‘0’ should respectively be read as ‘present’ and ‘absent’. The labels ‘present’ and ‘absent’ are not to be strictly interpreted, as they can also express concepts such as high versus low, or qualitative nominal differences, such as agency versus department. Except for the conditions for which objective indicators could be used (e.g. ‘organizational status’), we relied on the data collected via the structured survey. In the few instances where we noticed inconsistencies when comparing answers for a single organization, we searched for additional evidence among our respondents or via documents. The resulting overview of binary (0/1) coded conditions of the 29 analytical cases can be consulted on the website presenting the research (http://soc.kuleuven.be/io/eng/research/bel13.htm).

Table 1.

Overview of conditions, their indicators, and dichotomization thresholds.

Conditions	Code 0 indicators	Code 1 indicators
Category A: Capabilities of the organization
i) Skills to conduct in-house evaluations,	Totally insufficient, rather insufficient	Rather sufficient, fully sufficient
ii) Skills to outsource evaluations
iii) Budgetary resources to evaluate,
iv) Availability of capable staff,
v) Availability of external evaluators,
vi) Availability of monitoring information
Category B: Orientations
Extent of evaluation demand from:	No demand,	Sometimes demand, frequent demand
vii) Organizational management,	hardly any demand
viii) Sector minister,
ix) Parliament,
x) Civil society organizations,
xi) Other organizations
xii) Extent of organization-wide support for evaluations	Not at all,	To major extent,
xii) Extent of organization-wide support for evaluations	to limited extent	to large extent
Category C: Conditions with regard to the institutional setting
xiii)Organizational size	Very low,	At least average material weight
	low material weight (*)	At least average material weight
xiv) Organizational autonomy	No legal personality	Legal personality
xv) Organizational status	Agency	Department
xvi) Anchorage of evaluation function	No evaluation unit	Formal or de facto evaluation unit
xvii) Participation in evaluation community	No engagement in evaluation trainings or networks	Minimally sometimes participating in evaluation trainings or networks
Evaluation requirements stipulated in:	No evaluation requirements	Evaluation requirements
xviii) regulation or decrees at Flemish level,
xix) legislation/regulation at EU level,
xx) management agreement of organization
Category D: Conditions concerning policy issue characteristics
xxi) Attention by media or parliament for the tasks of the organization	Not at all, limited, rather limited	Rather much, much, very much
	Highest score of the assessments of attention by media or parliament
xxii) Perceived competition on tasks of the organization	Not at all, limited, rather limited	Rather much, much, very much
xxiii) Perceived measurability of outputs, and xxiv) outcomes	Average score ≤ 3 and/or qualification: very difficult, difficult, rather difficult	Average score ≥ 3 and/or qualification very easy, rather easy, easy
	Average score of measurability on a scale of 1 (very difficult to measure) to 5 (very easy to measure) of the three most important outputs and outcomes of the organization.
Category E: Conditions characterizing the path of the organization
xxv) Pre-NPM evaluation experience	No/seldom evaluation practice prior to the NPM reforms	Sometimes/frequent evaluation practice prior to the NPM reforms
xxvi) Organizational stability	Organization underwent medium or large changes (**)	Organization underwent no or small changes
xxvii) Ministerial stability	≥ 1 minister changes since the reforms	No ministerial turnover

(*)

The indicator concerns financial material weight (weight: 50%) and material weight with regard to personnel (weight: 50%). For financial material weight, the following scales are used [in 10,000EUR]: (1) very low material weight: 0–50,000; (2) low material weight: 10,000–50,000; (3) average material weight: 50,000–100,000; (4) high material weight: 100,000–500,000; (5) very high material weight: <500,000. As for material weight, in staff numbers per organization: (1) very low: 0–100; (2) low: 101–200; (3) average: 201–400; (4) high: 401–900; (5) very high: >900. We calculated the average for the years 2007–2008–2009 (IAVA).

(**)

Four sub criteria constitute this indicator. Three of them relate to the impact of the NPM-oriented reforms (60% of the indicator in total): (1) changes in the form of management/steering of the organization; (2) changes with regard to the composition of the public entity; (3) changes with regard to the organization of the management support services. The remaining 40% of the indicator refers to changes independent of the NPM reforms. Based on the sum of these sub criteria, a scale was composed ranging from 0.1 to 0.5, with 0.5 standing for these organizations, which underwent a large number of changes; 0.3 for these which underwent a medium number of changes and 0.1 for these organizations with large stability. We calculated the average for the years 2007–2008–2009 (IAVA).

In the remaining sections we organize our material by the techniques we use. With the MSDO/MDSO technique, we reduce the list of conditions to those with most explanatory relevance for evaluation activity and evaluation inactivity respectively. With QCA, we identify the combinations of conditions that led to the same outcome.

The MSDO/MDSO technique. Conditions of relevance in explaining evaluation activity and inactivity

We deliberately chose to depart from a long list of conditions of potential explanatory relevance. To detect the conditions with most explanatory power, we rely on the Most Similar Different Outcome/Most Different Similar Outcome technique. MSDO/MDSO was originally developed by G. De Meur (1996) as a systematic comparative and pairwise technique to reduce the complexity of a large data set (De Meur and Berg-Schlosser, 1994). Key objectives of the technique are to identify the factors that can explain why the most different cases correspond with the same outcome values, and vice versa, why the most similar cases can be matched to different outcome values. The idea behind it is that the most ‘extreme’ pairs of cases, in terms of degree of similarity and difference, embody most explanatory potential. Applying this to our research topic we find that when two organizations hardly share any conditions, but both conduct evaluations, we can only understand this similar evaluation behavior by searching for their limited similarities. And inversely: when two organizations are in many aspects similar, but – despite their similarities – differ in evaluation activity, we can only understand this variety, by looking for their differences. The MSDO/MDSO method assists to detect the conditions that can account for these similarities and differences. Conditions identified can be considered as likely candidates with key explanatory potential that can be further examined in subsequent analyses (De Meur and Berg-Schlosser, 1996).

The current research attempts to explain two outcomes: (1) evaluation activity and (2) evaluation inactivity. We consequently conduct two separate MSDO/MDSO analyses. The choice of two different outcomes allows us to take account of the hybrid cases which plan evaluations, but which have not yet implemented these. These cases can neither be classified as fully active in evaluations, nor as fully inactive. In other words, they have an outcome value of ‘0’ for both outcomes. The two MSDO/MDSO analyses require different comparisons. In the analysis of (1) evaluation activity, we compare the group of cases that conduct evaluations WITH the group of cases that have no aspirations to conduct evaluations or have evaluation intentions but have not yet implemented these. In the analysis of (2) evaluation inactivity, we compare the group of cases without evaluation intentions WITH the group of cases that plan evaluations or have already been conducting evaluations.

Each MSDO/MDSO analysis involves several steps. Within the scope of this article, we restrict ourselves to a concise presentation of the core tenets. More technical details can be found in, for example, De Meur, 1996; De Meur and Berg-Schlosser, 1996; De Meur and Gottcheiner, 2009; and De Meur et al., 2006.

Step 1: A first step is to identify the pairs of cases that are ‘most similar’ or ‘most different’. For this purpose, we need to calculate distances, for the cases that share the same outcome value and similarities for the cases with a different outcome value. As a measure of distance, the technique relies on the ‘Boolean distance’. The distances are simply the number of binary (0/1) coded conditions for which two cases differ from each other. MSDO/MDSO requires the calculation of the Boolean distance per category of conditions. As such, the technique takes into account that cases can be similar in one dimension/category, but dissimilar in another. We logically proceed with the five categories (A to E) of the framework that we presented in Table 1. Respectively they comprise 6 (category A); 6 (category B); 8 (category C); 4 (category D) and 3 (category E) conditions.

Step 2: In a second step, levels of distance and proximity can be calculated, again category by category. The cases that differ most from each other are said to differ at ‘level 0’ D(0). ‘Level 1’ D(1) is 1 away from ‘level 0’, whether there is a pair with this value of distance or not, and so on. The inverse reasoning should be followed to calculate the similarity levels S(0) to S(k) between cases for MSDO.

Step 3: The Boolean distances per pair of cases can be compared with those levels of (dis)similarity, on the basis of which an aggregated overview can be composed that presents the levels for the various categories altogether. Figure 1 illustrates the levels of (dis)similarity for each pair of cases across the five categories for the analysis of evaluation activity. The figure is divided into three zones, relating to the values of the outcome of the pairs compared:

Zone 1: The levels of distance of the pairs that conduct evaluations;

Zone 2: The levels of distance of the pairs that do not or not yet conduct evaluations;

Zone 3: The levels of proximity of the cases with different outcome values.

Figure 1.

MSDO/MDSO analysis of evaluation activity. Levels of (dis)similarity for each pair of cases across the five categories (see the dotted border, with 01234 standing for categories A to E). Output generated by the MSDO/MDSO software (version 8/7/2006), developed by G. De Meur (available via http://www.jchr.be/01/beta.htm).

Consider, for instance, the (underlined) pair of cases 4 and 17 (0-0-1) in zone 1. For category A and C, the pair differs at level 0, for category E, at level 1. For categories B and D the pair can be said not to differ substantially (at least half of the variables are valued the same). The latter categories are marked by a dash (-).

Step 4: For our research purposes, we interpreted the labels ‘most similar’ and ‘most different’ rather restrictively, and therefore decided to only single out the pairs of cases that reach levels S(0)/D(0) and/or S(1)/D(1) for the highest number of categories. As such, we concentrated our analysis on the pairs that are most similar/different in depth, but also in breadth on the highest levels (De Meur and Berg-Schlosser, 1994; De Meur et al., 2006). The above mentioned pair of cases 4 and 17 can be considered one of the most remarkable pairs. For three categories, it differs at level 0 or 1.

Step 5: Once the relevant pairs of cases and categories are selected, we can look for the conditions that matter most in the categories identified. In the case of zone 1 (outcome value = 1), we look for those conditions on which the most different cases achieve the same value. For instance, for the pair of cases 4 and 17, we check the conditions on which the cases achieve the same value. We only consider the conditions in categories A, C and E. These are the categories for which the pair is similar at level 0 or 1.

The same procedure should be followed for zone 2 (outcome value = 0). In the case of zone 3 (outcome values 1 versus 0), we are especially interested in those conditions for which the cases achieve a different value. Not all conditions are equally relevant. We only kept the conditions that were at least mentioned twice across several (dis)similar pairs of cases. Table 2 lists the conditions, identified as having the most explanatory power for evaluation activity and evaluation inactivity.

Table 2.

Relevant conditions in explaining evaluation activity and evaluation inactivity.

	Category	Conditions	Analysis of evaluation activity	Analysis of evaluation inactivity
AGENCY	A. Capabilities of the organization	Skills to conduct in-house evaluations	x
		Skills to outsource evaluations	x	x
		Financial means to evaluate	x
		Availability of staff		x
		Availability of external evaluators	x
		Availability of monitoring information
	B. Actor orientations	Evaluation demand from organizational management	x	x
		Evaluation demand from sector minister and/his her advisors	x	x
		Evaluation demand from parliament
		Evaluation demand from other organizations of the policy domain	x
		Evaluation demand from civil society organizations	x	x
		Organizational support for evaluations
STRUCTURE	C. Conditions related to the institutional setting	Size
		Status	x	x
		Autonomy
		Anchorage of the evaluation function	x
		Participation in trainings and networks
		Evaluation requirements embedded in Flemish legislative documents
		Evaluation requirements stipulated in management agreement
		Evaluation requirements imposed by international organizations
	D. Conditions concerning policy issue characteristics	Media and parliamentary attention
		Extent of competition	x
		Measurability of organizational outputs	x	x
		Measurability of organizational outcomes	x
	E. Conditions concerning the path of the organization	Pre-NPM evaluation experience	x	x
		Organizational stability
		Ministerial stability	x

Comparing the conditions yielded by the analysis of evaluation activity and evaluation inactivity gives a partially overlapping picture, but not entirely. We will discuss the different explanatory conditions in depth, when presenting the findings of the QCA analysis. Here, we draw attention to the conditions that turn out not to have strong power to explain why Flemish organizations do (not) evaluate. Overall, the institutional setting of the organization (category C) came up as the least powerful in explaining either why similar organizations project different outcomes or why most different organizations get the same outcome value. The other categories have more explanatory potential. Across all categories, the following conditions come up as least relevant in explaining evaluation activity or inactivity in Flanders.

The availability of monitoring information. Many organizations that conduct evaluations do not have a well-developed monitoring system. The evaluation field offers a wide array of approaches or techniques that do not necessarily require the existence of a lot of monitoring data.

Support of organizations for policy evaluations. It is often sufficient for a small group of actors supporting evaluations within the organization to be able to implement them. Intra-organizational resistance is not desirable, but can be overcome.

Evaluation demand from parliament. In Flanders, members of parliament only seldom demand evaluations (Speer et al., 2015). And in case they request evaluations, this request is not always followed up.

Organizational autonomy. This observation only pertains to the possession of a legal personality. We cannot make any speculations on the impact of any other forms of autonomy (e.g. policy or financial autonomy).

Compulsory evaluation requirements. In Flanders, it is not (yet) common practice to institutionalize evaluation requirements in legal or regulatory documents, as is the case in other countries, such as the Netherlands. Although some Flemish organizations are confronted with legislative evaluation requirements (stipulated in management agreements, or in Flemish or European Union legislative documents), such requirements do not belong to the strongest explanatory conditions to understand why organizations do (not) conduct evaluations.

Participation of staff in evaluation training and networks. The NPM-oriented reform was one of the main triggers launching the ‘Flemish Evaluation Platform’ in December 2007. Yet, the participation of staff members in this Association and other training is not key to understanding whether organizations are active in evaluations.

Media and parliamentary attention paid to the tasks of the organization. For instance, Haarich and del Castillo Hermosa (2004) have identified media and parliamentary attention paid to evaluation as a potential factor that can trigger the attention of policy makers, causing them to initiate an evaluation. But in Flanders, this condition is not very powerful as an explanatory factor as to why some organizations evaluate policy, while others don’t.

Organizational stability. Many organizations underwent significant reorganizations as a result of the reforms. Yet, these changes have barely affected their evaluation activity.

Two-step QCA: Combinations of conditions explaining evaluation activity and inactivity

Having identified the conditions of likely explanatory relevance for evaluation activity and evaluation inactivity, through the MSDO/MDSO analysis, we now turn to investigate those combinations of conditions that may account for a sufficient explanation. Given the complexity of the evaluation reality, as well as the important role of context in this regard, it is highly unlikely to expect a permanent universal effect of a certain condition on the outcomes. It seems more plausible to assume that the effect of the presence or absence of a condition might differ depending on the wider context. Similarly, we expect that a combination of conditions will lead to a certain evaluation profile, and that different configurations can be associated with the same outcome (Schneider and Wagemann, 2012: 78).

QCA is rooted in these broad assumptions. In the evaluation field, the method is getting increased attention (see for instance: Befani and Sager, 2006; Befani et al., 2007; Marx, 2005; Sager and Andereggen, 2012; Varone et al., 2006). Basically, QCA groups cases into combinations of conditions with similar outcomes, and minimizes these combinations by logical reduction (Blackman, 2012). The logic of the minimization is strongly inspired by Mill’s (1973 [1843]) method of pairwise difference, which assumes that ‘if two configurations differ only in one condition but show the same outcome, this distinguishing condition is irrelevant and can be eliminated’ (Ragin, 1987: 93). QCA continually applies the pairwise comparison of configurations until the point is reached at which no further minimization can take place. The minimized solutions no longer comprise any redundant variables. For ECB, this information is a most fruitful means through which to ascertain possible ‘recipes’ that can account for a successful or failing outcome. For detailed information on the method, we refer to specialized works such as Ragin (1987, 2000, 2008); Rihoux and Ragin (2009) and Schneider and Wagemann (2010, 2012).

How do we then practically proceed? Do we apply QCA in both structure- and agency-based conditions at once or do we, as Scharpf (1997: 42) has also emphasized, consider them as different steps in our explanatory exercise? As the structural conditions constitute the setting in which actors behave, for many purposes, knowledge of these conditions will be sufficient to understand a particular situation (Scharpf, 1997: 28). Structural conditions not only delimit the range of choice of actors, but also impact the type of actor constellations involved, including their capabilities and orientations vis-à-vis policy evaluation. For evaluation capacity builders, it is valuable to know which structural conditions foster or impede evaluation activity. It does not make sense to invest resources in ECB, if the wider structural setting of an organization is not beneficial for evaluation activity. In our research, therefore, as a first step, we only include the more structural conditions in our analysis. Only if these cannot provide a sufficient explanation, we will then include the actor-related conditions. As a matter of fact, this two-step approach has increasingly gained ground as a tool to come to more subtle and parsimonious explanations in QCA (Mannewitz, 2011; Schneider and Wagemann, 2003). Within the scope of this article, we only present the paths that can be associated with the conduct of evaluations and those that can be associated with the non-conduct of evaluations. We do not present the paths that include the cases with a hybrid status, i.e. which plan evaluations but which have not yet implemented these.

We proceed with the conditions identified as relevant by the MSDO/MDSO analysis. As we revealed, the set of conditions to explain evaluation activity is different from the set of conditions that are most relevant to explain evaluation inactivity.

Paths associated with evaluation activity

The application of the QCA minimization method on the structural conditions of relevance in the eighteen cases that conduct evaluations results into four paths, or combinations of conditions with common outcomes (Blackman, 2012).² For evaluation capacity builders, this implies that there are multiple scenarios that can all account for the same outcome. All paths are theoretically equivalent. However, we order the paths, in terms of empirical relevance, followed by the coverage scores of each path, presented inside a parenthesis. Raw coverage (RC) refers to the proportion of cases that is covered by a given path. Unique coverage (UC) concerns the proportion of cases that are uniquely covered by a given solution (no other solutions cover those cases) (Schneider and Wagemann, 2012: 133).

- Configuration 1: Pre-reform evaluation experience (RC: 0.83. UC: 0.72)

- Configuration 2: Presence of an evaluation unit and outcomes that are easy to measure (RC: 0.17. UC: 0.06)

- Configuration 3: Absence of an evaluation unit and department status and competition (RC: 0.06. UC: 0.06)

- Configuration 4: Absence of an evaluation unit and department statusand frequent ministerial turnover (RC: 0.06. UC: 0.06)

Configuration 1

The most empirically relevant path, matching with 83 per cent of the cases, revolves around the condition referring to the pre-NPM reform evaluation experience of the organization. Eighty-three percent of our cases that conduct evaluations have done this before, i.e. prior to the reforms. Moreover, all cases that conducted evaluations prior to the implementation of the framework are still doing so. In other words, having previous evaluation experience seems to be a major reason to continue with evaluations after the reforms. The implementation of the NPM framework has not had any significant impact, irrespective of the number of evaluations the organization’s predecessor conducted. This applies to both departments and agencies. In fact, many agencies have a much longer evaluation history than departments, and are hence not very keen to give up the capacity which they internally developed over the years. As a respondent remarked in this respect: ‘every organization is conservative, and strives to keep what it has, and preferably tries to expand this’. The coming into force of the reform framework has not changed this to a great extent. The discourse of the NPM framework, with a separation of responsibilities between department and agency in policy evaluation, is not rigidly implemented in practice.

Configuration 2, 3 and 4 apply to the cases that do not have this pre-reform evaluation expertise.

Configuration 2

The second configuration emphasizes the importance of the anchorage of the evaluation function. In our operationalization, anchorage ranges from having a large evaluation unit to having just a single staff member identified in the organogram of the organization whose task is the coordination of evaluations. No matter of the modalities of anchorage, we observe that an anchored evaluation function creates professional capacity for evaluation. The staff members working in these units operate as important mediators in this respect. While the reform framework puts emphasis on the departments to embody the evaluation function, it has not prevented some agencies from establishing an evaluation unit. This unit can help mobilizing the resources to evaluate. Yet, an evaluation unit in itself is not a sufficient guarantee for evaluation. For the cases without evaluation experience, the evaluation unit proved to only be fruitful in situations in which the outcomes are easy to measure. When outcomes are not very costly to observe, an important barrier is already removed and it is easier to proceed to the conduct of evaluations.

Configurations 3 and 4

These configurations sketch two other, less frequent alternatives that make organizations evaluate without pre-reform evaluation expertise. The paths apply to departments without an evaluation unit. While the impact of the NPM reform was limited in the sense that it did not have major influence on those organizations that already evaluated, it constituted an important evaluation stimulus for those organizations that acquired the department status with the reforms. Competition (configuration 3) does not impede these departments to evaluate. Competition can even trigger evaluations. Positive evaluation findings can be used to reinforce a department’s position vis-à-vis concurrent organizations. This observation is in line with Lindkvist’s (1996) observation of Swedish hospitals’ performance measurement systems. For hospitals that face competition from the wider environment, performance measurement systems were more easily introduced. The same can apply for policy evaluations. Configuration 4 brings ministerial rotation in the picture. Although ministerial rotation is generally not considered beneficial for the institutionalization of the evaluation function (Schwartz, 1998; Weiss, 1993), the interest for evaluation can also grow with a new minister coming into power. This was the case for the department of configuration 4.

Comparing the various paths, the conditions spring from the three different categories (C, D, E). This confirms their explanatory relevance. Together, the seven structural conditions do have sufficient explanatory power to account for a consistent explanation for evaluation activity. There are no organizations with similar attributes, and which do not conduct evaluations. Strictly analytically, there is, therefore no need to proceed to the analysis of agency-related conditions. This implies that if we get to know the values of cases on these structural conditions, we can to a large extent predict whether the case will conduct evaluations or not.

Paths associated with evaluation inactivity

Nine cases do not practice any evaluation, and do not have any intention to do this in the near future. Comparing the values of the cases on the remote conditions yields a single path:

- No pre-reform evaluation experience and outputs that are easy to measure.

It emphasizes the absence of pre-reform evaluation experience in combination with outputs that are easy to measure. Having outputs that are easy to measure usually applies to organizations providing tangible services or conducting highly perceptible tasks. Production organizations (for instance: the Flemish Social Housing Company) and procedural organizations (for instance: the Flemish Public Transport Agency) are often assumed to belong to this category (Wilson, 1989). As service provision is the first occupation of such organizations, policy evaluation is not always a priority for them. Yet, this observation does not apply to all cases. There are also cases that practice evaluations with the same characteristics. Therefore, to understand why organizations do not evaluate, the inclusion of the agency-related conditions is essential. The MSDO/MDSO analysis identified five actor-related elements likely at play in the explanation (see Table 2). Two paths can be discerned if we compare the proximate conditions in this structural setting:

- Configuration 1: No pre-reform evaluation experience and outputs that are easy to measure and absent evaluation demand from organizational management and absent evaluation demand from civil society organizations (RC: 0.67. UC: 0.67)

- Configuration 2: No pre-reform evaluation experience and outputs that are easy to measure and absent skills to outsource evaluations (RC: 0.33. UC: 0.22)

Configuration 1

Most empirically important is a combination of absent evaluation demand from organizational management along with absent evaluation demand from civil society organizations. Organizational management occupies a major role in the promotion of evaluation activity. A lot of our interviewees also emphasized this. As stated by one of them: ‘[evaluation] requires a certain attitude, which should be taken into account when appointing organizational management. You need to have the support of the management. If you lack this, developing evaluation capacities is meaningless’. Besides, several organizations identified civil society organizations as an actor category that is increasingly demanding evaluations.

Configuration 2

A second explanation, reaching a raw coverage of 33 per cent, revolves around the absence of skills to outsource evaluations as a core condition. An organization will have great difficulties commissioning an evaluation, when it is lacking the basic skills to provide minimal guidance to external experts.

Two general observations can be made, when we compare the QCA analysis of evaluation activity with the analysis of evaluation inactivity. First, there are fewer paths for the explanation of the non-conduct of evaluation than for the actual conduct of evaluations. Second, for those cases that do conduct evaluations, a consistent explanation can be reached solely on the basis of structural conditions.

Conclusion

The present study aimed at explaining differences in evaluation activity and inactivity across public sector organizations in Flanders. In this evidence-based era, the variance in policy evaluation activity across public sector organizations is still significant. This in turn, explains the necessity, as well as the raison d’être, of ECB exercises and attempts within the public sector. Having employed two configurational comparative methods, MSDO/MDSO and crisp set QCA, we systematically compared 27 organizations, departments and agencies. They represent almost half (41.5%) of all Flemish entities that have been included in the NPM reform framework. Based on our research, the following findings are deemed important as both policy evaluation scholars and practitioners can benefit from these insights.

Policy evaluation activity is still not widely spread and undertaken in Flanders. Eleven of the twenty nine cases do not (yet) conduct evaluations. NPM is considered to have played a major catalyst role in the diffusion of policy evaluation. Yet, the extent of diffusion should be nuanced.

Several conditions that receive considerable attention in the evaluation literature do not have great explanatory power as to why Flemish organizations do (not) evaluate. This is the case for: the availability of a well-developed monitoring system; the participation of staff in evaluation networks and training; the autonomy of the organization; evaluation demand from parliament; the degree of media and parliamentary attention paid to tasks of the organization; organizational stability; and regulatory evaluation requirements.

In Flanders, the influence of the introduction of NPM is mixed. On the one hand, the implementation of the reform framework has not had a major impact on the extent of evaluation activity for those organizations previously active in evaluation. All organizations projecting a past evaluation activity continued to do so. From a perspective of ECB, this is encouraging. Once an organization has ‘tasted’ evaluation practice, the appetite to continue on this track is stimulated. On the other hand, the introduction of the framework served as an important trigger for departments without that expertise. Our study shows that the NPM-inspired separation of evaluation responsibilities between departments and agencies cannot be easily forced on those agencies that undertook evaluations prior to the reforms.

Having an evaluation unit is not essential for evaluation activity, but it can help to mobilize the resources for it. Moreover, it can function as an institutional impediment for organizations that do not want their evaluation capacity to be downsized. This especially applies to the agencies that no longer have an official evaluation role. If policy makers would like to secure continuity in policy evaluation, they should invest in the institutional anchoring of the evaluation function.

The perception of the measurability of organizational outputs and outcomes appears to be important. On the one hand, we observed that many organizations that consider their outputs as easy to measure – typically production and procedural agencies – are not often inclined to evaluate. On the other hand, we observed that outcomes that are easy to measure facilitate evaluation activity.

The extent of measurability is, however, not an absolute given. It is strongly dependent on the extent in which the required outputs and outcomes of the organization are clearly (SMART) formulated or dependent upon attitudes vis-à-vis evaluation. We can influence this perception to some extent. Organizational management can play a key role in this regard.

A lack of evaluation demand from organizational management has been demonstrated to be an important explanatory factor for evaluation inactivity. As a top ministerial advisor put it: ‘if you have a minister(ial cabinet) that is evaluation minded, you can stimulate evaluations. Yet, the constant factor in the story is the administration’. If governments want to invest in the diffusion of policy evaluation, they should pay attention to evaluation when recruiting new management or when evaluating the present management.

The organizations that conduct evaluations all have the skills to at least outsource evaluations. If policy evaluation is a genuine concern of policy makers, the necessary training should be provided, or inter-organizational pools should be established to facilitate the sharing of expertise.

Overall, we identified more scenarios explaining why organizations evaluate than why they do not evaluate. Yet, for those cases that do conduct evaluations, a fully consistent explanation could already be achieved on the mere basis of the structural conditions, which confirms their predicting power. Moreover, for each of the two outcomes, we could find a single path capable of explaining nearly 70 per cent of the variation. The complex nature of evaluation activity could thus be reduced to a selected number of configurations. Each configuration is to be read as a potential recipe for evaluation capacity builders.

The present analysis is a first attempt to develop more systematic insights on the dynamics behind organizational differences in evaluation activity on the basis of configurational comparative methods. Future studies should assess the real causal mechanisms behind the different paths that lead to evaluation activity or inactivity, and should preferably scrutinize a selected number of conditions in more depth. Similarly, the scope of external validity remains to be explored.

Footnotes

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Notes

Valérie Pattyn is Postdoctoral Researcher at KU Leuven Public Governance Institute. The article presents part of her PhD research. Other research interests include policy advisory actors, policy capacity, policy analysis, teaching public administration, and comparative methods.

References

Befani

Sager

(2006) QCA as a tool for realistic evaluations: the case of the Swiss environmental impact assessment. In: Rihoux

Grimm

(eds), Innovative Comparative Methods for Policy Analysis. Beyond the Quantitative-Qualitative Divide. New York: Springer, 263–84.

Befani

Ledermann

Sager

(2007) Realistic evaluation and QCA: conceptual parallels and an empirical application. Evaluation 13(2): 171–92.

Blackman

(2012) Rethinking policy related research: charting a path using Qualitative Comparative Analysis and complexity theory. Contemporary Social Science: Journal of the Academy of Social Sciences. URL: https://www-tandfonline-com.web.bisu.edu.cn/doi/abs/10.1080/21582041.2012.751500

Delreux

Hesters

(2010) Solving contradictory simplifying assumptions in QCA: presentation of a new best practice. COMPASSS Working Paper, 58. URL: http://www.compass.org

De Meur

(1996) La comparaison des systèmespolitiques: recherche des similarities et des differences. Revue Internationale de PolitiqueComparée 3(2): 405–37.

De Meur

Berg-Schlosser

(1994) Comparing political systems: establishing similarities and dissimilarities. European Journal of Political Research 26: 193–219.

De Meur

Berg-Schlosser

(1996) Conditions of authoritarianism, fascism and democracy in inter-war Europe: systematic matching and contrasting of cases for ‘small n’ analysis. Comparative Political Studies 29(4): 423–68.

De Meur

Gottcheiner

(2009) The logic and assumptions of MDSO/MSDO designs. In: Byrne

Ragin

(eds), The Sage Handbook of Case-Based Methods. London: SAGE, 208–21.

De Meur

Bursens

Gottcheiner

(2006) MSDO/MDSO revisited for public policy analysis. In: Rihoux

Grimm

(eds), Innovative Comparative Methods for Policy Analysis. Beyond the Quantitative-Qualitative Divide. New York: Springer, 67–94.

10.

De Meur

Rihoux

Yamasaki

(2009) Addressing the critiques of QCA. In: Rihoux

Ragin

(eds), Configurational Comparative Methods. Qualititative Comparative Analysis (QCA) and Related Techniques. London: SAGE, 147–78.

11.

Eliadis

Furubo

Jacob

(eds) (2011) Evaluation. Seeking Truth or Power? Comparative Policy Evaluation Volume 17. New Brunswick: Transaction Publishers.

12.

Fiss

(2011) Building better causal theories: a fuzzy-set approach to typologies in organization research. The Academy of Management Journal 54(2): 393–420.

13.

Furubo

Sandahl

(2002) Introduction: a diffusion perspective on global developments in evaluation. In: Furubo

Rist

Sandahl

(eds), International Atlas of Evaluation. New Brunswick, NJ: Transaction Publishers, 1–23.

14.

Gray

Jenkins

(2011) Policy and evaluation: many powers, many truths. In: Eliadis

Furubo

Jacob

(eds), Evaluation. Seeking Truth or Power? Comparative Policy Evaluation Volume 17. New Brunswick, NJ: Transaction Publishers, 39–53.

15.

Haarich

del Castillo Hermosa

(2004) Development of evaluation systems – evaluation capacity building in the framework of the new challenges of EU structural policy. Paper presented at the ESRA 2004 Conference, Porto, Portugal.

16.

IAVA (2007, 2008, 2009) Jaarverslag van het Auditcomité en het Agentschap Interne Audit van de VlaamseAdministratie. URL (consulted 10 September 2013): http://www2.vlaanderen.be/doelbewustmanagement/jaarverslagen.html

17.

Lindkvist

(1996) Performance based compensation in health care. A Swedish experience. Financial Accountability and Management 12(2): 89–105.

18.

Mannewitz

(2011) Two-level theories in QCA: a discussion of Schneider and Wagemann’s two-step approach. COMPASSS working paper, 64. URL (consulted 1 May 2012): http://www.compass.org

19.

Mill

(1973 [1843]) Of the four methods of experimental inquiry. In: The Collected Works of John Stuart Mill (Vol. VII – A System of Logic Ratiocinative and Inductive). London: Routledge and Kegan Paul, 388–406.

20.

Ragin

(1987) The Comparative Method. Moving Beyond Qualitative and Quantitative Strategies. London: University of California Press.

21.

Ragin

(2000) Fuzzy Set Social Science. Chicago, IL: University Chicago Press.

22.

Ragin

(2008) Redesigning Social Inquiry: Fuzzy Sets and Beyond. Chicago, IL: University Chicago Press.

23.

Rihoux

Ragin

(2009) Configurational Comparative Methods. Qualititative Comparative Analysis (QCA) and Related Techniques. Thousand Oaks, CA: SAGE.

24.

Sager

Andereggen

(2012) Dealing with complex causality in realist synthesis: the promise of qualitative comparative analysis. American Journal of Evaluation 33(1): 60–78.

25.

Scharpf

(1997) Games Real Actors Play: Actor Centred Institutionalism in Policy Research. Oxford: Westview Press.

26.

Scharpf

(2000) Institutions in comparative policy research. Max Planck Institut fur Gesellschaftsforschung working paper 00 (3). URL (consulted 1 May 2012): http://hdl.handle.net/10419/44254

27.

Schmidt

(2003) The boundaries of ‘bounded generalizations’: discourse as the missing factor in actor-centered institutionalism. In: Mayntz

Streeck

(eds), Die Reformierbarkeit der Demokratie: Innovationen und Blockaden: Festschrift für Fritz W. Scharpf. Frankfurt: Campus, 318–50.

28.

Schneider

Wagemann

(2003) Improving inference with a two-step approach: theory and limited diversity in fs/QCA. EUI Working Papers. 2003/7. European University Institute: San Domenico di Fiesole.

29.

Schneider

Wagemann

(2010) Standards of good practice in qualitative comparative analysis (QCA) and fuzzy sets. Comparative Sociology 9(3): 397–418.

30.

Schneider

Wagemann

(2012) Set-Theoretic Methods for the Social Sciences. A Guide to Qualitative Comparative Analysis. Cambridge: Cambridge University Press.

31.

Schwartz

(1998) The politics of evaluation reconsidered: a comparative study of Israeli programs. Evaluation 4(3): 294–309.

32.

Scriven

(1991) Evaluation Thesaurus, 4th edn. Newbury Park, CA: SAGE.

33.

Speer

Pattyn

De Peuter

(2015) The growing role of evaluation in parliaments: holding governments accountable? International Review of Administrative Sciences 81, forthcoming.

34.

Stame

(2003) Evaluation and the policy context: the European experience. Evaluation Journal of Australiasia 3(2): 36–43.

35.

Varone

Jacob

De Winter

(2005) Polity, politics and policy evaluation in Belgium. Evaluation 11(3): 253–73.

36.

Varone

Rihoux

Marx

(2006) A new method for policy evaluation? Longstanding challenges and the possibilities of qualitative comparative analysis (QCA). In: Rihoux

Grimm

(eds), Innovative Comparative Methods for Policy Analysis. Beyond the Quantitative-Qualitative Divide. New York: Springer, 213–36.

37.

Vlaams Parlement (2003) Kaderdecreet Bestuurlijk Beleid. 18/7/2003. Brussel: Vlaams Parlement.

38.

Weiss

(1993) Where politics and evaluation research meet. American Journal of Evaluation 14(1): 93–106.

39.

Wilson

(1989) Bureaucracy. New York: Basic Books.

40.

Witte

(2006) Change of degrees and degrees of change. Comparing adaptations of European Higher Education Systems in the context of the Bologna Process. PhD Thesis, UniversiteitTwente, NL.