Abstract
This work proposes a method for assessment of evaluation processes, applicable to systems in which both measurable and policy-based performance criteria coexist. It intends to provide a mapping procedure, based solely on measurable criteria, aiming at visual inspection of the outcome, which allows the introduction of two types of indicators at different phases of the process. The definition of performance-based zones is suggested: two of them clearly correspond to high/low performance, in which the overall evaluation result is essentially established on the basis of measurable indicators, i.e. on the basis of objective criteria, and an in-between area, more prone to a less deterministic outcome. The method of evaluation is applied in three successive steps: (i) build the representation and identify the performance zones, (ii) introduce additional policy criteria to complete a first draft of the outcome, and, finally (iii) assess relative positioning, correct if necessary, and finalize. Performance maps are presented and inspected for a set of Research and Development institutions, to illustrate examples of application of the methodology proposed.
Introduction
The problem of evaluating and ranking performance in groups and institutions is present in almost every organizational context, from public institutions to private companies. However, the combination of different indicators and perspectives in a single output is a difficult task, and has been the subject of many studies. More specifically, combining both objective, i.e. based on some specific metrics, with subjective criteria, based on peer assessment, benefits from the existence of a reference framework that allows quantification of the degree of discretion. Naturally, subjective evaluation and policy implementation often lead to the same result, which is to depart from the indications given by quantitative performance indicators. Research evaluation and peer review are deeply embedded in academic cultures and practices and hence far from just being modern discussion topics. Various approaches for dealing with different kinds of evaluation have been proposed (De Jong et al., 2011), from peer review of journal papers and research projects to national or international evaluation initiatives (European Science Foundation, 2015). Some of these exercises compare and discuss different evaluation schemes, such as whether metrics-based evaluation systems are better than or just a part of, peer-based evaluation (Glänzel and Willems, 2016; Mutz et al., 2016; Ramos and Sarrico, 2015). In general, there is considerable controversy as to whether metrics are better than peer review, but most recent research assessments use both. Other studies of research evaluation are rather more sociological, focusing on the complexity and shortcomings of specific examples of evaluation processes, the expected consequences of such exercises for research institutions and disciplines and also the nature of the outcome (Leydesdorff and Bornmann, 2014; Upton et al., 2014). Performance management systems are in continuous development and are also the subject of a significant number of studies (Gibbs et al., 2004; Long et al., 2015; Shahin and Mahbod, 2004). In research institutions (e.g. Research and Development (R&D) units) performance management is a process by which the institution involves its researchers, both as individuals and members of a group, in improving organizational effectiveness towards the goals of the institution. This process is used to communicate organizational goals and objectives, reinforce individual performance and accountability for meeting those goals, and also supervise and evaluate individual and organizational performance results. The ability of metrics to represent complex information in an accessible format has been extensively explored, mainly when they are used as performance evaluation tools (Abbott et al., 2010; Derrick and Pavone, 2013; DORA, 2013; Guilak and Jacobs, 2011; Hauser and Katz, 1998; Lawrance, 2003, 2007; National Health and Medical Research Council, Australia, 2010; van Veen-Dirks, 2010).
Evaluation systems can be used to assess the efficiency of staff, quality of products and a wide range of other measurable actions and processes (Chandler, 1982; Hamilton and Chervany, 1981a, 1981b). The level and type of performance indicators used depends in part on what is being assessed (Blockman et al., 2014, 2016; Cova et al., 2015). Key performance indicators tend to share certain characteristics (Chenhall et al., 2013). Identifying those key indicators is one of the major challenges in the design of frameworks and models for performance measurement. This issue has been evaluated, for example, through rating techniques using the SMART approach (Piskurich, 2011; Yemm, 2013), which introduces relevant concepts such as measurability and specificity. This approach has also been used to design productivity strategies (Shahin and Mahbod, 2004).
Evaluation processes may be based on both quantitative and qualitative metrics (Chandler, 1982; Chenhall et al., 2013; Hamilton and Chervany, 1981a, 1981b; Moers, 2005; Woods, 2012). Both types of performance indicators must be integrated successfully to define adequate strategies and develop an efficient and fair evaluation system (Ho et al., 2014; Poston and Speier, 2005). Therefore, evaluation requires the use of various indicators, of different types and origins. Using a formulaic approach, and explicitly resorting to weights (Lawrance, 2003, 2007; Selto and Malina, 2004), provides a simple way to circumvent the problems associated with this multi-variable analysis. However, weights are arbitrary and promote dispute among evaluators (or those evaluated), benefiting from an additional framework. This framework, used to position the elements under evaluation, should be the direct result of metrics indicators, with as little manipulation as possible, in a ‘let the numbers speak for themselves’ approach. Further criteria, of any type, can be assessed by contrasting this unsupervised positioning and the overall result of the evaluation exercise.
The existence of different types of indicators increases complexity. How, for example, can productivity and social skills indicators be combined? Or those associated with efficiency and teamwork? In this context, it is widely accepted (Bretz et al., 1992; Ittner et al., 2003) that peer review cannot (and should not) discard some discretionarity, and that subjective assessments and personal opinions are beneficial in any evaluation task, providing additional insight that cannot be obtained exclusively from cold facts and figures. A measure of how far an evaluator may go in terms of the level of subjectivity should, therefore, also be clearly established. Probably, the best way to address these issues is to combine information in a sophisticated way with objective, numerical data, providing an overall positioning or ranking status, while also helping to decide on the degree of discretion that should be allowed and introduced in a subsequent step. This type of approach has not, to the best of the author’s knowledge, been proposed before. It mixes evaluation and ranking naturally, although these are usually very interdependent concepts. As such, and in simple terms, the best or the worst placed on the basis of the ‘objective’ assessment are not likely candidates for repositioning resulting from the more subjective criteria, except in exceptional, duly justified cases.
In order to provide a concrete example of the overall process, this work resorts to the evaluation of research units, a type of organization often subjected to evaluation and peer review exercises. Currently, the most widely used methods to support evaluation of research are based on bibliometric indicators (Egghe, 2006; Hirsch, 2005; Parra et al., 2011; Zhang, 2013). The majority of these indicators have been developed to quantify both the production of researchers, using the total number of published articles and the number of articles published in a certain period of time, and their impact, averaged by the total number of citations, the average number of citations per article, the number and percentage of significant published articles, relative citation rates, or even combining some of these indicators (see e.g. Alonso et al., 2009; Bishop et al., 2003; Leydesdorff and Rafols, 2011; Martin, 1996; Schreiber, 2008; Serenko and Dohan, 2011; Van Leeuwen et al., 2003; Vieira and Gomes, 2010; Vinkler, 2007; Zhang et al., 2011). The intuitive feeling is generally that bibliometric indicators reflect the impact of a given scientist or institution. However, recent studies have suggested that these metrics alone are not enough to fully assess the quality of the individual (or institutional) work. Several views have been discussed pointing to the need of using multiple indicators, as well as to the importance of the peer review system in the assessment of quality (European Physics Society, 2012; European Science Foundation, 2011; Institut de France, Académie des Sciences, 2011; Science and Technology Committee, 2011; Swedish Research Council, 2009). Bibliometric performance indicators should thus be applied only as a collective group and not individually, and should be used in conjunction with peer review following a clearly stated set of rules outlining proper practices, thereby supplying a basis for evaluating the past and directing the future institution work. One question remains: how should evaluators use this high dimensional information combining indicators of different origins and types? The present work proposes a simple approach, based on principal component analysis, to address this question. It is clear that understanding how metrics/peer review or objective/subjective criteria are combined to form a general procedure of evaluation will contribute to the design of optimized models for performance assessment.
Evaluation
General aspects
Policymakers have sought to promote better productivity in research institutions by using performance funding, i.e. funding proportional to institutional performance, measured on the basis of specified indicators. These indicators allow the identification of institutional strengths and weaknesses, aiming at higher productivity, although how these multiple indicators influence the final result of evaluations determining funding should also be assessed. An institution is made of people, and achieving effective performance of the workforce is a primary goal of any organization (Cova et al., 2015). Performance management practice must thus rely on indicators based on the outlined targets, which should also be those guiding funding decisions.
The process of evaluating or ranking individuals or institutions is, naturally, dependent on the type of structure in which it is performed and, also, on the context and objectives which it aims to achieve. However, it is assumed that some general characteristics are a priori present. One of these, is that it is based on a multivariate approach, i.e. there are different indicators to support the decision, and these indicators fall into two main categories: ‘objective’, which are numerical or ordinal and can be established directly from performance measurements, and ‘subjective’, which include concepts such as ‘integration’, ‘evolution’, ‘specificity’, etc. To the latter, criteria based on items such as ‘geographical situation’, ‘strategic area’, ‘strategic ability’, ‘reputation’, and ‘background’ may also be included. These will be denoted as ‘policy’ criteria, which also transcend performance. The examples provided above are just a guide to the necessary previous assessment of the type of criterion, and must naturally be configured to the specific case being studied.
The objective indicators are, due to their nature, usually strongly correlated. Imagine that these indicators can be subdivided into two types: those in which the value grows with performance (e.g. counting of achievements or items produced), and those that decrease with performance (e.g. counting of reprimands, days of absence). Naturally, the overall tendency is to have several indicators positively correlating, if they are of the same subtype, or negatively correlating if they are of opposite subtypes. This means that the overall objective information is prone to be represented in a low-dimensional system, preferentially two-dimensional. This is the main assumption in this work. The system must be defined in a natural (and intuitive) manner, and probably one of the most direct ways is to choose coordinates emphasizing the distinction between the elements subject to evaluation. The interpretation of the coordinate system must also be straightforward, and as it will be shown in the case-study, it is likely that the major axis will be linked to overall performance, while the remaining will promote more subtle, often less relevant, distinctions. It should be noted that these axes only contain information from measurable indicators: policy criteria are included at a later stage.
The first step in the methodological approach of evaluation is to represent the objects upon a plane or, in less favorable cases, a 3-D surface, and identify the direction and sense along which an increased performance is observed.
Figure 1 presents, schematically, an illustrative situation in which the positive sense of the horizontal axis is associated with increasing quantity/quality. The objects are located in two main areas, for which the notation ‘Potentially fundable’ and ‘Potentially not fundable’ is employed as a simplified caricaturized view. Those located in the ‘Potentially fundable’ region are clearly productive, while the ‘Potentially not fundable’ are falling behind. A third area, depicted in grey, can be found between these two more extreme classes, corresponding to the region where subjective or policy criteria are mandatory, to promote the subjects into either of the two large classes. In general, successfully evaluated R&D units (white markers) are located to the right, unsuccessful ones (black markers) to the left, while their coexistence is found in the grey area. However, one of the objects, a low performance one, is labeled as successful after evaluation (including the inclusion of subjective/policy criteria). This is a situation prone to be revised, as it could be a result of the evaluators’ perspective, excessively stained by personal perceptions or/and strategic or policy design.

Schematic representation of the discrimination areas, concerning the evaluation of the objects (institutions/individuals), established using objective criteria of performance assessment.
Figure 1 is a metrics-based construction for subjects located in the ‘Potentially not fundable’ and ‘Potentially fundable’ regions, an automatic selection is more likely to be sufficient, and less intervention from the evaluator is, a priori, required. We should not discard the idea that, in some exceptional situations, a R&D unit can switch from ‘Potentially fundable’ to ‘Potentially not fundable’ or vice versa: this should be duly justified, especially in the former case, and made clear why the subjective or policy criteria dominated over performance indicators. The units depicted in the grey area should not, in principle, be subjected to a blind cut line because, ultimately, this would imply the irrelevance of the less objective assessment.
The process is, thus, made in three steps. In the first step, the system is characterized and mapped, using metrics information, to define zones. Subsequently, additional criteria, in most cases totally independent of the measurable variables, are introduced and a first draft of the outcome is produced. The question arises as to whether we may also have additional criteria based on measurable values. In some cases, the answer is yes. Consider that one focuses on one of the variables used pertaining to objective indicators, and uses it (concomitantly or not) in an independent condition. For example, research units that provide services for others receive a bonus for ranking. These services may be quantified, but the bonus criterion may be included among the policy criteria.
Finally, strong deviations between subject positioning in the map and evaluation results are assessed, and the latter corrected if necessary: the evaluator must decide if the evaluation outcome is to be maintained or if the subjective/policy assessment should be discarded because it is excessive.
It should be noted that, although the mapping described above provides simple ingredients to help evaluation, it is clear that some latitude and discretion is necessary to establish, in most cases a priori, the proportion of ‘Potentially not fundable’ and ‘Potentially fundable’ and the size of the grey area. Also, note that the present proposal is directed at a binary outcome: promoted/not promoted, funded/not funded, etc., but the usefulness of the mapping procedure goes clearly beyond this, and the extension of its use for ranking is straightforward.
As a case study, an example from the evaluation of R&D units is presented. This selection is based on its relevance, economical and institutional, for the research and also on its relation to management structures in universities and university rankings.
Sources and methods
Dimensionality reduction process
For reducing the number of dimensions in the system, and allowing the respective graphical representation, principal component analysis (PCA) is used. This technique requires a description of the objects, i.e. points in Euclidean space. In this analysis, each R&D unit corresponds to one of these points and is described on the basis of 26 components related to the measurable performance indicators. In the ‘mapping’ procedure, an overview is given of the results of implementing ‘objective’ measurable and so-called ‘subjective’ criteria in research and development institutions, resorting to principal component analysis (PCA). This consists of a simple, non-parametric procedure of extracting relevant information from multivariate datasets (Joliffe, 2002). PCA computes a compact and optimal description of the data set, providing a roadmap to a lower dimension space that reveals the underlying structure. The most influential variables in the system are highlighted, while the most relevant factors may be identified. Relevant, in this method, is related to promoting difference, i.e., variability. It is, thus, an obvious choice in an evaluation process aimed at ranking. More specifically, this technique relies on the assumption that most of the information contained in the data is present in directions along which the variations are the largest (Almeida et al., 2009; Cova et al., 2013; Joliffe, 2002). In the present study, PCA summarizes the information residing in the data corresponding to the evaluation of institutions/individuals, into a form, which may be more easily inspected and interpreted. Thus, the original multi-dimensional space, defined by several performance indicators, is contracted into a few descriptive dimensions, which represent the indicators that are more discriminative in the data. Each principal component (PC) can be displayed graphically and analysed separately, and its meaning of it may often be established on the basis of a few indicators. Essentially, the procedure is carried out by a linear transformation of the m performance measures
where
Since the first principal components retain most of the variance, several indicators can be summarized by a few components and a plot of the first two or three PCs enables the visualization of most of the information contained in the initial data set. PCA requires the solution of an eigenvalue problem, either based on the correlation or variance/covariance matrices of the original indicators. In the proposed approach, the results are based on the correlation approach. In either case, the components are ranked, and the percentage of explained variance λi decreases from the first PC to the second and so on, suggesting the criteria for the selection of the most relevant first p principal components. The most common one is the Pearson criterion (Joliffe, 2002), which can be used in both the variance/covariance and the correlation approaches. The value p is selected as the minimum integer that warrants
If correlation is used, the most common criterion corresponds to retaining the p components for which
It should be noted that, as extracted from the description above, alternatives based on supervised techniques were discarded. Specifically, regression is not an adequate approach because (i) it implies using the result of the evaluation, that is, one is using the outcome of the exercise to validate that same outcome and (ii) it does not provide the relative positioning of the objects, or a straightforward graphical representation as described in what follows.
Graphical representation
Evaluation trends can be monitored by combining PCA and the biplot representation (Gabriel, 1971; Galindo, 1986). A simple and efficient way to visualize the relations between those being evaluated and the performance indicators is by computing the biplot on the principal components. A biplot uses points to represent the scores of the evaluated research units and vectors to represent the coefficients of each indicator on the principal components, i.e. it uses the evaluated units and performance indicators to represent structure. The relative location of the units can be interpreted. R&D units that are close together correspond to observations that have similar scores with the components displayed in the representation, possessing similar overall performance profiles. Both the direction and length of the vectors can also be interpreted. Vectors that point in the same direction correspond to indicators that have similar response profiles, and can be interpreted as having similar meaning in the context set by the data. So, for these data, vectors represent different performance indicators, and points correspond to the evaluated R&D units. This representation, in two dimensions, allows the visual discrimination between units.
Case study: Application to R&D units evaluation
The recognition of the merit of a research unit is highly dependent on the excellence of the science developed within its structure, which prompt regular evaluations of this type of institution. As explained above, this work aims to contribute to any process of evaluation in which a discriminating ranking is the expected outcome, by establishing a simple and straightforward methodology that uses a low dimensional representation based on the natural variability of the system to represent the data set. Most scientific research in Portugal takes place in R&D institutions, financed and evaluated by the public national funding agency for science (FCT). Research in these R&D units encompasses all scientific fields, from life and health sciences to social sciences, arts and humanities, from engineering and exact sciences to natural and environmental sciences. There are currently 292 R&D units and 26 Associate Laboratories (Larger Research Units that are oriented to more specific disciplines such as Material sciences, Neurosciences, Green Chemistry), with more than 22,000 researchers. As for many research institutes, the funding agency evaluation is based on a peer-review process. The assessment procedures include periodic assessments by an independent panel of internationally recognized experts, based on the reports of activity and strategic plans, as well as on direct contacts with researchers and institutions, through site visits and/or interviews. All R&D Units are classified with a qualitative grade, which determines the level of funding to be awarded (FCT, 2014a). Periodic evaluations of R&D units are an established mechanism of the agency since the mid-1980s (Neave and Amaral, 2012). As a contribution to the evaluation exercise of R&D units in 2013, the agency requested a bibliometric study that considers the contributions of the permanent members pertaining to different R&D Units in 47 scientific areas, and in the period of 2008–12. This bibliometric study was used as a complement to peer review. For the analysis, the Open Researcher and Contributor ID (ORCID, 2014) was adopted as a unique identifier of researchers and the Scopus database was used as data source (Scopus, 2014). Other details concerning the specific procedures relevant for the analysis can be found in reference (FCT, 2014b). This new exercise possesses some key differences between the previous evaluation held in 2007 (Ramos and Sarrico, 2015). In the latter, the evaluation process was entirely run by the agency. There was no government austerity regime as the evaluation exercise was initiated before the Eurozone financial crisis and public universities were allowed to hire new permanent academic staff. Also, there were more than 20 panels, allowing in-depth consideration of units in different disciplines by evaluators recognized as experts in the respective scientific disciplines. In this case, there was no special focus on bibliometric indicators, there was no specific relevance connected to research excellence, and the main task was just about evaluating the quality of the research in R&D units. Also, the Associated Laboratories (very large research institutes) were excluded. These laboratories were not assessed alongside the other research units. However, in 2013, the Associated Laboratories were assessed on the same basis as smaller R&D units. Some concerns related to the scope of the evaluation, and its impact as a consequence of the national financial problems and the contract with the international entity involved (European Science Foundation (ESF)) were raised. In fact, it was difficult for the funding agency to carry on as before in relation to research funding, since the country was faced with an austerity regime that cut public spending. In this context, it was suggested that the funding agencies should become more independent of government (Ferreira and Firmino, 2015).
In the selected case study, more than 300 research institutions were considered, using a set of almost 30 performance indicators (including size, production and productivity), which reflect the contributions of the permanent members pertaining to each unit, in a collection of data for a five-year period. These indicators and the final evaluation attributed to each R&D unit were extracted from the evaluation reports (FCT, 2014b). The analysis resorts mainly to methods that involve the simultaneous study of several key variables related to different performance indicators including the: (i) full-time equivalent (FTE) researchers; number of (ii) publications, (iii) publications per year, (iv) publications in the five-year period 2008–12 and (v) publications per FTE, (vi) citations, (vii) citations per publication, (viii) citations per FTE; (ix) h-index, as defined by Hirsch (a group of papers has a h-index of 10, if 10 of these publications have each received at least 10 citations, and 11 of these publications have not each received at least 11 citations); (x) Field Weighted Citation Impact (FWCI), defined as the total number of citations actually received by a group of publications divided by the average number of citations that were received worldwide by publications in the same subject field(s); number and percentage of publications in the (xi) Top 1 per cent, (xii) 5 per cent, (xiii) 10 per cent and (xiv) 25 per cent percentiles (the citation thresholds that represent the top 1 per cent, 5 per cent, 10 per cent and 25 per cent publications in the data universe being used were established, and the absolute counts and the percentage of publications that lie within each threshold were calculated); number and percentage of publications per FTE in (xv) national and (xvi) international collaboration (FCT, 2014b). In this context, multivariate methods allow the identification of underlying patterns and graphical representation of inter-relationships between performance indicators, and provide ways of simplifying and reducing the dimensionality of the data. In the present example, the final output of the evaluation process will be a binary response: financed, not (or less) financed, and the study will be conducted in two different research areas, denoted as A and B.
The software code was developed and optimized by the authors using R (version 3.0.1) (Venables and Smith, 2013). The general procedure consists of two main steps: (i) normalization of the values for each variable comprising performance measures, (ii) data overview and variable selection by principal component analysis (PCA) using a Biplot representation.
Results and discussion
A comprehensive characterization of R&D units from two different research areas is performed to explore the productivity profiles, the relations between a set of performance indicators and the outcome of an evaluation process resulting in units recommended (and not recommended) for funding. Each area is represented by a set of R&D units, characterized on the basis of multiple performance indicators, which reflect scientific productivity or quality. The mapping is solely based on ‘objective criteria’ of evaluation, consisting of the bibliometric indicators described above. At this point, it should be reiterated that an evaluation process should not be restricted (or limited) to a pure bibliometric component; however, this analysis provides a first overview of the system, and references for assessing the final result.
Note that what follows is not a full implementation of the evaluation procedure described above; rather it focuses on two fundamental aspects: the representation based on the non-subjective indicators, and the assessment after introduction of the subjective and political or strategical criteria. It is not easy to infer the subjective criteria used in the different areas, but one can suggest ‘reputation’, and ‘background’ (Deem, 2015) based on the testimony of an evaluator.
Despite all the limitations and distortions that can be assigned to bibliometric indicators, they, generally, offer a reasonable expression of the research activity. For example, the impact factor (IF) and the h-index (Hirsch, 2005) are widely used in evaluation exercises as a proxy for the quality and scientific prestige of a journal (and, consequently, articles published therein) and as measures to quantify/qualify the research output of scientists or institutions, respectively.
In this particular type of evaluation, circumstances, evolution trends, unit strategies, specificity of the scientific area, although transcending the bibliometric component, should also be taken into account. In other words, the inclusion of a degree of discrimination, associated with a policy component, is fully justified. These criteria do not affect the mapping, but their effects are visible upon the map. See the example provided in Figure 1, in which a low-performance unit was, resorting to these criteria, turned into a successful one.
As previously mentioned, the analysis is conducted using vectors of indicators defining performance. Representing each unit as a point in the hyperspace of performance indicators, allows us to carry out the analysis that results in the composed view depicted in Figure 2.

R&D units mapped onto discrimination components, according to the performance patterns and the evaluation results.
This Figure displays the relative positioning of the units in a Biplot form considering the new orthonormal principal component system. The representation in two dimensions allows the visual discrimination between R&D units. The final result of the evaluation, taken as example, is expressed by the white (recommended for funding) or black (not recommended for funding or recommended for very limited funding) markers in the plots. The relative location of the units can be further interpreted: units that are placed close together correspond to observations that have similar scores on the components displayed. These units also correspond to observations that have similar performance values. From the present observations, it is safe to say that the first component should be viewed (and is associated with) a measure of quantity and quality, providing inter-group discrimination. The second component also introduces discrimination, based on other less specific, and probably less relevant, aspects. This is, in a sense, the definition of a simplified evaluation space.
Figure 2 also indicates that the new reduced system preserves c. 70 per cent of the information in the representation plane and allows us to identify how the most relevant indicators evolve in that plan. The latter include the number of (i) permanent or full-time equivalent researchers (Size), (ii) publications in a five-year period (Pub), (iii) publications per permanent researcher (PubSize), i.e. number of publications normalized by the size of the unit, (iv) citations per publication (CitPub), and the h-index (hindex) – an index of quantity/quality. All these indicators evolve along the increasing ‘Quantity/Quality’ axis, substantiating the nature assigned to this axis. Note that this is a common finding in many systems: the first axis depicts an overall evolution of what is measured (Cova et al., 2013). On the other hand, the discrimination along the second axis is associated with aspects that vary from one area to another, and are probably related to the emphasis placed on quantity, quality or size of the units and other characteristics. This component was denoted ‘Other aspects’, and although it contributes to the graphical representation, no effort was made to analyse it further. The research area denoted as A will firstly be inspected. The three selected units for funding are placed in the ‘recommended for funding’ group, standing out from the others, representing the research units in that field with the best production/productivity profile. In this example, the units positioned in the grey area seem to have been excluded, since the level of exclusion in this area appears to be quite high, which of course is dependent on the cutoff that was defined.
Consider now area B. There is a switch between the ‘recommended for funding’ and ‘not recommended for funding’ units along an axis in which the quantity and quality component obtained from bibliometric indicators should be sustained in a continuous growth. In fact, although the unit with the best performance indicators was approved, the second in order was rejected. This is a clear example on which this first-step data treatment, here suggested, would help the evaluator to take a decision or at least prompt a thoroughly evaluation of these, apparently, inconsistent cases. In the two areas analysed, only area A seems to follow what is prescribed, with a large ‘Potentially not fundable’ region.
Summarizing at this stage, one can say that the difficulty inherent in an evaluation exercise, based on a large number of objective and subjective criteria, is demonstrated in this work. In addition, the proposed procedure underlines the necessity of setting up an efficient system in order to assess the impact of the different criteria.
The above methodology can fail in situations in which the PCA recovered variability does not evolve along any recognizable low dimensional direction: for the mapping to be useful it must contain the ‘evaluation space’. Also, it will probably fail if information cannot be contained in a small number of variables. Although the latter situation is unlikely to be found in a quality assessment, the correct selection of indicators will probably make these limitations less severe.
Final remarks
The present work deals with an important topic on the use of multiple indicators in evaluation systems: how can ‘objective’ measurable and so-called ‘policy’ criteria be combined in an evaluation system? A straightforward method is proposed for performance assessment, relying on a dynamic three-step process involving both quantitative and subjective or policy-making components. The main idea is that the principal aspects of the system, as extracted from the objective indicators (built from measurable observations), allow the mapping of the subjects under evaluation, without using weights, and then less objective criteria are subsequently introduced.
The deviation between positioning in the map and evaluation outcome is, then, easily evaluated. The illustration that is presented, based on bibliometrics, clearly highlights the possibilities of application. It should be recognized, however, that bibliometrics easily gathers a set of values that can be used in the mapping. In other areas or systems, the metrics may be less clear, and even the distinction between objective and subjective/policy criteria may be fuzzier. This simply means that the information used to build the maps should be carefully considered for each case, but does not diminish the usefulness of representing evaluation (or partial evaluation) outcomes upon low-dimensional maps.
It is extremely relevant, when trying to characterize productivity, to establish a clear distinction between objective and policy dimensions. The proposed procedure underlines the necessity of setting up an efficient system, in order to assess the impact of the different criteria.
It should also be noted that the method can have a broader application than that described in this article. In fact, it can be used in any situation in which relevant numerical indicators provide a basis for the mapping, and other criteria, of any type (even also based on metrics), are subsequently introduced.
Footnotes
Acknowledgements
Professor Hugh Burrows is gratefully acknowledged for a critical review of the manuscript.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The Coimbra Chemistry Centre is supported by the Fundação para a Ciência e a Tecnologia (FCT), Portuguese Agency for Scientific Research, through the Project No.007630 UID/QUI/00313/2013, co-funded by COMPETE2020-UE. T.F.G.G. Cova and S.C.C. Nunes also acknowledge, respectively, the PhD and post-doctoral research Grants SFRH/BD/95459/2013 and SFRH/BPD/71683/2010, assigned by FCT.
