Abstract
Objective
To investigate whether the credibility of health economic models of screening for abdominal aortic aneurysms for health policy decisionmaking has improved since 2005 when a systematic review by Campbell et al. concluded that reporting standards were poor and there was divergence between the findings of studies that was hard to explain.
Methods
A systematic literature review was carried out following PRISMA reporting principles. Health economic models of the cost-effectiveness of screening for abdominal aortic aneurysms published between 2005-2010 were included. Key characteristics were extracted and the models were assessed for quality against guidelines for best practice by a multidisciplinary team.
Results
Seven models were identified and found to provide divergent guidance. Only three reports met 10 of the 15 quality criteria.
Conclusions
Researchers in the field seem to have benefited from general advances in health economic modelling and some improvements in reporting were noted. However, the low level of agreement between studies in model structures and assumptions, and difficulty in justifying these (convergent validity), remain a threat to the credibility of health economic models. Decision-makers should not accept the results of a modelling study if the methods are not fully transparent and justified. Modellers should, whenever relevant, supplement a primary report of results with a technical report detailing and discussing the methodological choices made.
Introduction
Health economic modelling is a branch of decision analytic modelling often employed for the economic evaluation of health care technologies. It is defined by its application of mathematical techniques to synthesize available information about health care processes and their implications, thereby providing a bridge between primary data and the decisions they inform. 1
The role of health economic modelling in informing policy decisions has evolved rapidly over the past decade. In particular, since the National Institute for Health and Clinical Excellence (NICE) published updated guidelines in 2004 on how to appraise new and existing technologies, the guidelines can be interpreted as favouring modelling over trial-based studies for economic evaluation.2,3
Concerns have been raised about the credibility of health economic models 4 due to the poor quality and poor reporting standards of modelling studies. Another factor is a misunderstanding of the relation between models and trials. The modelling approach is sometimes introduced as a solution to the problems of the trial-based approach (e.g., limited number of comparators, insufficient follow-up time and outcome measures that are too specific for decisionmaking) rather than as an analytical framework for synthesizing relevant evidence to inform decisionmaking. It is the authors' position that the two approaches should be seen as complementary rather than as substitutes.
In 2007, Campbell et al. examined the credibility of health economic models for decisionmaking using the case of screening for abdominal aortic aneurysm (AAA). 5 At that point, modelling of the cost-effectiveness of screening for AAA had been undertaken for more than 15 years, and the results were debated in several countries deciding whether to implement screening. 6 The conclusion of the review was that the models were a weak aid to decisionmaking due to substantial variations in methods and poor reporting standards.
Since 2007, AAA screening has been introduced in the UK, 7 the USA 8 and some areas of Sweden, whereas many other (European) countries continue to refrain from making a decision, perhaps due to uncertainty regarding cost-effectiveness. In parallel, the methods for health economic models have improved; in particular, attempts have been made to establish consensus statements on best practice for the development and reporting of models. 9 Nevertheless, the variance in model results has let to contradictory guidance, with a British model, among others, suggesting that screening is highly cost-effective at conventional threshold values and a Danish model suggesting that it is not.10,11
Altogether, these facts underline the importance of re-evaluating the credibility of health economic models used to inform decisionmaking. 12 The objective of this study was to update a previous systematic review assessing the credibility of health economic models for decisionmaking in the case of AAA screening.
Methods
The methods of the present literature review and quality assessment comply with the PRISMA statement for systematic reviews. 13 A systematic literature search was conducted to identify reports of health economic models of the cost-effectiveness of screening men for AAA. Reports that did not represent health economic models, that did not consider the relevant intervention or that did not report original results were excluded. The period of interest was from January 2005 to May 2010 to update the findings of the previous review by Campbell et al. 5 Accordingly, the comprehensive search strategy specified by Campbell et al. was adapted and applied, except for the searches in the Biological Abstracts database and the Health Economic Evaluations Database, which were not possible due to restricted rights. An additional search on the Web of Science was added in their place. Eleven databases were searched using the terms ‘model* AND screen*’ AND (survival OR ‘life year*’ OR ‘quality adjusted life year*’ OR ‘life expectancy’ OR qaly*) AND (‘abdominal aorta aneurysm*’ OR aaa OR aneurysm*). This was supplemented with manual searches of the Internet and of the reference lists of identified papers. Figure 1 illustrates the flow of identified publications, with a net total (excluding duplicates) of 34 references for which abstracts were reviewed. Upon the exclusion of 26 references, seven papers were obtained and included for appraisal.

Flow of identified publications
Costs were adjusted to 2009 prices using the general consumer price index and to the Euro (€) using purchasing power parity adjustments before between-study comparisons.
Models were quality assessed against the best practice guidelines by Philips et al. 14 The guidelines include 15 assessment criteria for which no direction on how to summarize results is provided. A simple count of fulfilled criteria was used to indicate the overall quality although this imposed an assumption that each criterion should have equal weight. When a report referred to a previous publication for methodological details, this publication was retrieved and considered as part of the reporting of the model.
Beyond the quality assessment, special emphasis was given to extracting the key assumptions of the models (whether stated in the original reports or not) in order to assess the possible direction of bias with respect to the incremental costs per life year and/or quality-adjusted life-year (QALY) between studies.
Results
Description of the literature
Seven reports of health economic models of the cost-effectiveness of screening for AAA were identified and included for appraisal (see Table 1).10,11,15–19 The models all appeared to address the decision of whether to introduce a (national) screening programme for men aged 65 years. All models were of the Markov type and defined a lifetime time horizon (this was specified from 20 to 40 years after screening at age 65). One model was specified for a high risk population of patients referred to a vascular laboratory, 17 whereas all other models dealt with population screening by ultrasonography performed by mobile teams at a hospital or in primary care.
Health economic models of the cost-effectiveness of population screening for abdominal aortic aneurysms published from 2005 to 2010
LY = life year, QALY = quality-adjusted life-year, ICER = incremental cost-effectiveness ratio, NR = not reported
Note: Costs are in purchasing power parity-adjusted 2009€. (Cost estimates from Silverstein et al. were assumed to be in 2005 prices as no price year was reported. Cost estimates from Kim et al. were adjusted from 2001 although they were reported to be in 2000/01 prices)
All models agreed on the definition of an AAA (aortic diameter >3 mm) and on the threshold value for surgery (aortic diameter >5.5 mm). Most models employed annual follow-ups for smaller aneurysms and biannual or quarterly follow-ups for medium-sized aneurysms. Attendance rates varied from 77% to 80%, while the diagnostic accuracy of ultrasonography was defined as 100% in all studies. The prevalence of AAA ranged from 3% to 6%, and the staging distributions among prevalent cases varied even more; for example, the proportion of large aneurysms ranged from 12% to 44%. The possibility of opportunistic detection was acknowledged in all models; however, rates varied from 2% to 13%.
The variance in the incremental cost-effectiveness ratios (ICERs) across studies was striking. At conventional threshold values for a willingness to pay of about £30,000, Kim et al. 11 concluded that the cost-effectiveness of screening was highly attractive (in the UK), whereas Ehlers et al. 10 concluded that it was not (in Denmark).
Quality assessment of the literature
The quality of models was appraised according to 15 best practices criteria (see Table 2). None of the models fulfilled all of the criteria, and only three studies appeared to comply with a minimum of 10 of the 15 assessment criteria.
Quality assessments of the identified models: extent to which a criterion is addressed
Note: The quality assessment was based on the best practice guideline by Phillips et al., 14 in which detailed questions and assessment criteria are available. An assessment criterion is generally satisfied if the issue at hand is stated and discussed in the original report
Structure of models
Decision models always reflect a compromise between appropriately reflecting real-world complexity and maintaining transparency by controlling model complexity. Maintaining such a balance usually entails a series of structural assumptions that must be clearly reported.
The literature seems to suffer from almost ubiquitous weaknesses related to the definition of the course of the disease with and without intervention. Although there is no benchmark for the assessment of whether a model appropriately reflects the underlying biological process, a minimum requirement would be that the rationale for the model structure and its consistency with scientific theory are explicitly addressed. This was observed in one study only, and even there, the structure appeared to be driven by the available data rather than by coherence with the biological evidence.11,20 The individual models provided very different suggestions for an appropriate model structure, from a relatively complex 17-health-states model to a simple four-health-states model. Most models agreed on distinguishing between no, small, medium and large aneurysms,10,11,15,18,19 and those that did not received a quality score of zero due to clear-cut biological evidence that the risk of rupture becomes exponentially larger with increasing aneurysm size.
The specified follow-up and treatment regimens of detected aneurysms varied across studies; i.e. in relation to the intervals at which detected cases were followed up, whether endovascular surgery was an option, whether different outcomes were assumed for acute surgery with and without rupture and whether to allow for some patients who would decline or be unfit for surgery. While specification of treatment regimens might rightfully vary between models, models that did not distinguish between at least two postoperative states (elective and acute) and models that did not allow for some patients to be unfit for surgery were considered inappropriate.16,17 Again, this was based on clear-cut clinical evidence for significantly higher postoperative mortality rates after acute versus elective surgery and for the existence of a proportion of patients who are unfit for surgery. Also, in terms of justifying general assumptions, there seemed to be a weakness in the literature in that all reports acknowledged that several assumptions had been made to arrive at the proposed model structure, but few listed and justified these in a systematic and fully transparent manner (see below). This issue is further examined in a later section.
Data
Once a model structure is established, the model is next populated with parameter estimates (although in practice this process is likely to be iterative with revisions of the structure upon the results of searches for parameter estimates). In a perfect world with ample evidence, a first-best approach would be systematically to identify, assess the quality of and perform a meta-analysis of the relevant literature, but models are often populated based on more scarce evidence. Transparency regarding how evidence was collected and what evidence was included is a key premise to provide an opportunity for decision-makers to judge the validity of findings.
The literature generally satisfied the assessment criteria relating to how models were populated, although an inconsistency in the methods across different types of parameters (transition probabilities, costs and outcomes) was noted. Several models seemed to be populated with transition probabilities from systematic literature searches,10,15,16,18,21 but as only limited information was given on the methods used for synthesizing evidence, credit could not be given for a first-best approach. Additionally, it seemed that major effort was put into informing some model parameters using the first-best method (typically the efficacy of screening), whereas others were informed by more indiscriminate methods (typically the resource use/costs parameters).
A consistent weakness was observed regarding the addressing of joint decision uncertainty. In the modelling context, at least three types of uncertainty exist: parameter uncertainty; methodological and structural uncertainties; and population heterogeneity. During the last decade, it has become a mandatory requirement of, for example, NICE that model parameters be specified probabilistically to provide an estimate of the joint (statistical) decision uncertainty. This was accomplished fully in four models10,11,18,19 and partly in one model, 16 whereas the remaining models provided only point estimates.
Methodological and structural uncertainties are usually addressed in sensitivity analyses, where the impact of relaxing or imposing additional assumptions is tested. Most of the identified models tested the conventional assumptions of the discount rate, unit costs, prevalence, attendance rate and quality of life weights, whereas key assumptions for the disease process – for example, the rupture rate of large aneurysms – were only sporadically considered. This is not surprising given that when assessing criteria related to the structure of models, weaknesses in outlining a rationale for the structure were apparent in most reports.10,15–19,21
The final type of uncertainty is population heterogeneity, which refers to systematic variability within the specified population of 65-year-old men that might be associated with the efficacy of screening; for example, the proportion of smokers, the prevalence in the use of statins and other lifestyle-related factors. Population heterogeneity appeared not to have been addressed in most reports (and a post hoc assessment is difficult due to several models not providing details about the population characteristics).
Consistency
The final criteria assessed were related to the internal and external validity of the model – issues that have received increased attention in recent years. Most studies included some form of internal validation although only a few of the reports provided details regarding whether and how internal validity was assessed. There is no accepted procedure for the testing of internal validity. The available testing approaches range from quick assessments of mathematical logic (whether probabilities sum to one, for example) to careful and thorough evaluations of model predictions (how well a model's predictions of the number of events compare to the observed number of events in original data sources, for example).20,22
All reports discussed the external validity of the findings (the assessment criterion concerning external validity evaluated whether external validity was discussed rather than tested).
Potential biases in the models
The quality assessment of the literature revealed that a failure to make assumptions (and their impact on the findings) transparent was a particular weakness. This section therefore seeks to extract some of the assumptions from the models (see Table 3 and Table 4) to assess whether they might have led to a biased cost-effectiveness ratio.
Examples of structural assumptions in models that could lead to biased results
Note: Direction of bias refers to the impact of assumptions on the incremental cost-effectiveness ratio
Examples of assumptions relating to model populations that could lead to biased results
Note: Direction of bias refers to the impact of assumptions on the incremental cost-effectiveness ratio
NA = not applicable, as these models did not distinguish between small and large abdominal aortic aneurysms
Structural assumptions
Building a model to fit the data available despite conflicts with biological evidence yields invalid results, although on the surface they can appear to be founded on evidence. Given biological evidence that growth and rupture rates are exponential functions of aneurysm size, modelling even three aneurysm states (small, medium and large) might be too few. If so, all of the literature is biased against screening, as not detecting an aneurysm early is likely to have an even worse outcome than expected.
Another bias in most of the models was the lack of consideration of emergency procedures in cases without rupture, which were instead (mis-)classified as elective procedures or were simply left out.10,11,16–19 Such cases are well known in daily clinical work and are associated with three to four times higher rates of complications and mortality (and thus also with increased costs) compared to elective cases. They account for 20–25% of emergency cases as demonstrated by, for example, the MASS trial.23,24 Excluding, or including these as elective cases, eradicates a significant part of the impact potential of screening, namely intervening before the disease progresses to a state requiring acute treatment.
A similar bias relates to neglecting the treatment modality of endovascular surgery, which has become standard practice and provides a treatment opportunity for individuals who are unfit for open surgery. Hence, failing to capture the fact that the proportion of detected cases that are unfit for surgery is diminishing will bias results against screening.
When individuals undergo screening, health professionals often note the presence of general cardiovascular risk factors and advise patients on compensatory actions to take. For example, many participants in a screening programme will be referred to their general practitioners for initiation of statin treatment and/or lifestyle changes (e.g., diet, exercise, smoking cessation). Such lifestyle changes might lead to an additional effect on top of the effect of screening. On the other hand, incidental detection of a need for statin treatment and/or lifestyle changes will also take place without a screening regimen, although its extent is less certain.
Parameter estimates
Essentially all models adopted transitional probabilities that were derived in study populations with ages up to 74 years (as age-specific estimates are not available). This might bias results against screening, as the attendance rate has most likely been underestimated while the mortality risk related to surgical repair has most likely been overestimated. Conversely, adopting the prevalence observed in an older age group will lead to an overestimated probability.
It has been established that aneurysms grow faster the larger they get and that the risk of rupture increases exponentially for increasing aneurysm sizes.
25
Nevertheless, at least two studies assumed a constant growth rate and rupture risk for aneurysms ≥ 5.5 cm based on average growth for AAA
According to the stationarity assumption of Markov models, parameters are time independent. This is an invalid assumption if, for example, any of the parameters vary with age or aneurysm size. At least four studies report having relaxed the assumption in the context of all-cause mortality,10,11,18,19 but for other parameters, this issue seems to have been neglected. It should be noted that, despite methodological challenges, it is possible to allow probabilities to vary with a covariate (e.g., age) or to add extra states to the model to allow for distinct transitional probabilities for distinct categories. 27
Most models distinguish between non-AAA-related mortality and AAA-related mortality (usually defined by an occurrence within 30 days after rupture or aneurysmal repair). The former is typically informed by national age-and gender-matched mortality statistics (with AAA-related death subtracted), while the latter is informed by clinical databases. The mortality rate after elective surgery will be significantly lower in a regimen with screening versus without screening due to earlier detection and, accordingly, smaller and more often asymptomatic aneurysms. If this difference in mortality rates is not recognized, a major bias against screening will arise.
Discussion
The present quality assessment of health economic models for the cost-effectiveness of screening for AAA shows that the literature suffers from weaknesses in its presentation of the rationales for the chosen model structures and in justifying the assumptions made in the analyses. The point of departure for the present work was the conclusion of a previous assessment of the credibility of models published up to 2003 that they were characterized by low convergent validity; i.e., that models gave different results for unexplained reasons. 5 The authors of the previous review furthermore concluded that it was extremely difficult to attribute differences in results to one or more specific sources of uncertainty and that the modelling methods were in general poorly reported. The first of these arguments still holds true, whereas reporting has improved in that a few studies included detailed technical reports as supplemental material. It is recommended that modellers publish technical reports including an explicit and transparent list of all assumptions and their possible impacts.
It is difficult to explain the divergence of findings between studies simply in terms of their quality assessment scores because of the complexity of each model. Similarly, differences in settings and, in particular, differences in item costs are often held responsible for the divergence in findings, but this does not seem to be entirely justified in the present case. The average cost difference between screening and not screening was found to be highest in the American study and lowest in the British study; however, given the numerous other causes of variability between studies, this is not necessarily a reliable explanation for the discrepancy. The same conclusion was reached by the investigators who carried out the British trial, who tested the impact of replacing their item costs with those of the Danish model (which in contract to the remainder literature had led to the conclusion that screening was not cost-effective) and found that this explained only a minor part of the discrepancy between the two models' results. 28
If models are to earn credibility, they must comply with their rationale of bridging original primary data and decisionmaking, even if the primary data are scarce or not tailored to the decision problem. This review demonstrates that most models rely on assumptions arising from the generalization of parameter estimates from particular study groups (e.g., older age groups, smaller aneurysms) that are made because no alternatives exist. Modellers cannot be held responsible for a lack of original estimates, but they are responsible for reporting any uncertainty in an explicit manner. Other assumptions become involved when parameter estimates are not specified probabilistically, which was generally the case for cost estimates. This essentially implies that the precision of such estimates is assumed to be 100% and that any analysis of the value of additional information is unnecessary. In contrast, it has been argued that when conducting health economic evaluations, probabilistic sensitivity analysis and the associated value of information analysis should be obli-gatory.29,30 Most of the reviewed models performed some form of probabilistic analysis, but few performed a value of information analysis. This is an important task for future studies.
Limitations
The best practice assessment criteria adopted for the present review were published in 2006, and some methodological advances have emerged since then. A recent study by Kim and Thompson described how comprehensive model validation should encompass some form of an assessment of internal, prospective and external validity. 22 The external validity check, in particular, has different interpretations between the guidelines used for the present assessment and the most recent methodological recommendations: conventional discussion of findings in the light of previous findings versus a quantitative comparison of model predictions with findings of external studies. The use of updated best practice criteria would probably have led to the identification of further weaknesses in the literature concerning model validation.
Systematic and exhaustive extraction of the underlying assumptions of a model based on its results is a complex task when conducted post hoc and by parties other than the original modellers. This assessment does not claim to have extracted all, or even the most important, assumptions, but only some examples. Furthermore, it should be noted that a multidisciplinary team of a clinical expert and a health economic modelling expert conducted the assessments; researchers with expertise in other disciplines might have chosen a different focus and thereby arrived at a different assessment result.
Finally, as this review was restricted to the challenging case of AAA screening, the results should be seen as illustrative rather than necessarily generalizable to the whole modelling literature.
Conclusions
Seven years have passed and a policy decision regarding the introduction of population screening for AAA in the UK has been enacted since the previous review regarding the credibility of health economic models of its cost-effectiveness. Researchers in the field seem to have benefited from general advances in the state of the art of health economic modelling, and some improvements in the previously identified poor reporting standards were noted. However, a major issue of the low level of agreement in model structures and assumptions, and difficulty in justifying these (convergent validity) between studies remains a threat to the credibility of health economic models. Modellers might rightfully take different stands regarding methodological choices, but if the rationale for these, the associated assumptions and the impact on the results of such choices are not made fully transparent, the credibility of health economic modelling as a whole is threatened.
Decision-makers should not accept the results of a health economic modelling study if the methods are not fully transparent and justified. Modellers should, whenever relevant, supplement a primary report of results with a technical report detailing and discussing the methodological choices made, and finally, it should be understood as a given that health economic modelling requires a multidisciplinary team.
Footnotes
Acknowledgements
The authors thank the head librarian of the research library of Viborg Hospital, Hanne Margit Gronemann Christensen, who conducted the systematic literature searches, and the Health Research Fund of Central Denmark Region and the Research Fund of Viborg Hospital for funding.
