Abstract
Scotland has existing data resources which are competitive internationally and available to researchers from elsewhere. The Scottish Informatics and Linkage Collaboration (SILC) was recently launched, allowing data sets to be linked within and between sectors (e.g. health to non-health). The purpose of this review article is to introduce and define key terms in data linkage, to describe the emerging data linkage resources available in Scotland and to describe the opportunities available in Scotland to researchers internationally. The review is aimed at researchers internationally who are interested in data linkage using Scottish data resources. The review makes particular reference to longitudinal health data but emphasises that linkage to non-health data allows research questions to be considered that were previously not answerable. The review is focused on longitudinal data resources (e.g. cohort studies and repeated measures designs), since they are usually the focus of data linkage research. The review concludes that any intended data linkage for research should be driven by a clear research question. The infrastructure already available and the launch of SILC will accelerate research in Scotland and generate new research questions that previously could not be considered.
Introduction
Scotland already has data resources which are competitive internationally and available to researchers from elsewhere. The purpose of this review article is to describe the emerging data linkage resources available, so that researchers interested in data linkage opportunities become aware of what Scotland can offer. The review makes particular reference to health data (e.g. primary care, hospital records, cancer registries, mortality data) but emphasises that linkage to non-health data (e.g. education, social care and crime) may offer unique opportunities for answering certain research questions. The review is focused on longitudinal data resources (e.g. cohort studies and repeated measures designs), since these are often the focus of data linkage efforts, and are usually considered better quality evidence in the ‘hierarchy of evidence’ than cross-sectional data or case reports, for example. 1 Longitudinal data can, however, be analysed as cross-sectional data, by restricting to the analysis to one measurement occasion if required. The review is also focused on data that have been digitised, but it is worth noting that a great deal of historic administrative data exists only in paper records.
Data linkage
Data linkage involves bringing two records together that belong to the same individual. This might involve linking records that belong to the same individual over time in the same data set (e.g. repeated hospital admissions for a patient), or linking records that belong to the same individual from two different data sets (e.g. a patient's health records to their social care records). Data linkage can result in two kinds of data linkage errors: false matches (two people are assigned the same ID) or missed matches (the same person is assigned more than one ID).2,3
In Scotland following development of the Scotland-wide Data Linkage Framework (SDLF), the Scottish Informatics and Linkage Collaboration (hereafter, SILC) has now been launched. Subject to approval, researchers will be able to obtain linked data from different sectors. Each database retains its own identification, but these are linked my matching different identifiers together (e.g. sex, date of birth, National Health Service (NHS) number) through an index called the population spine. This population spine has a unique ID for each individual. Existing data from different sectors can then be linked, using the population spine, although data linkage success will depend on the accuracy and quality of the identifiers available. 3 Linked data will be available for researchers to analyse in secure settings called safe havens, several of which exist already.
The SILC project builds on the success of the previous ScottisH Informatics Programme (SHIP) infrastructure (now part of the Farr Institute in Scotland), which already has an indexing service and various secure databases. The project is expected to enhance research opportunities, improve how the public's administrative data are processed and attract investment into Scotland. 4 Investment can be generated from being able to answer novel research questions by data linkage for the first time in Scotland (due to infrastructure already available and under development), innovation in data linkage methodology, attracting researchers to move to Scotland in order to take advantage of the data linkage resources available and enhanced reputation for high quality research. Administrative data refer to data that are collected routinely by government departments, not originally for research purposes. The health informatics infrastructure already developed in Scotland then, has provided the foundation for data linkage involving data sets beyond health (e.g. education, social care and crime).
Historical context
It could be argued that attempts to pool data from different sectors in Scotland are nothing new – having been attempted three times historically. An early attempt to pool data in order to map the economy, demography and geography of Scotland (1680/1790) was followed by Sinclair's enormous ‘Statistical account of Scotland’ (1790/1799) and two sequels, the ‘Statistical account of Scotland II’ (1834/1845) and a final survey (1951/1992) which had staggered completion dates. There have also been regular Censuses which collect data on individuals, from 1801 and then every 10 years except during World War II. All these exercises however, produced cross-sectional data (with the exception of the Scottish Longitudinal Study, linking together around 5% of each census from 1991 onwards). In epidemiology and public health, cross-sectional data can be helpful for determining prevalence of a disease, but not for determining incidence – the rate of new cases over a specified time period. 1 Longitudinal data, where data on the same individuals are collected more than once, or where individuals are monitored to see what happens to them, allows calculation of incidence rates and has several additional advantages over cross-sectional data. For example in epidemiology research, it is easier to make claims about cause (exposure) and effect (outcome) when the exposure was measured before the outcome. Cross-sectional surveys make it difficult to disentangle causes, effects and confounding factors; nor is it possible (without data linkage) to evaluate key end points such as cancer, cardiovascular events and death.
International context
Scotland is already internationally competitive in terms of data linkage, particularly in health informatics. Data linkage systems have been created in Canada, Australia and Scandinavia and more recently in other UK countries (e.g. the SAIL databank in Wales). The Oxford Record Linkage System is an early example of data linkage in the health sector, but this has been discontinued. Health informatics in Scotland is competitive in part to the foresight of the creators of the Community Health Index (CHI) number which was initially used in Tayside only, but then was used routinely for all people who come into contact with healthcare providers in Scotland. This meant that many different kinds of health data sets had the same ID number (e.g. primary care, hospital records, diabetes registers). The CHI contains date of birth and is therefore more disclosive than the NHS number. CHI numbers have been extremely useful in facilitating data linkage in Scotland. Today, CHI is used by the NHS National Services of Scotland indexing team for health research, whereas NHS number (from NHS Central Registry) is mainly used to socio-economic research projects by National Records of Scotland (NRS). There is regular linkage between NHS and CHI number however, with a nightly update ensuring high correspondence between the two systems. At the time of writing, the Economic and Social Research Council (ESRC) has funded Administrative Data Research Centres (ADRC) in England, Wales, Scotland and Northern Ireland. It is not yet clear whether population spines will be created in all four countries, but it is likely that the infrastructure and innovation developed in Scotland will inform what happens elsewhere.
United Kingdom context
Data research and informatics agencies within Scotland and the UK provide data sets and linkage resources for researchers. The Electronic Data Research and Innovation Service (eDRIS) provides support for health data linkage in Scotland, and is funded by NHS Scotland. It draws on NHS Scotland Information Services Division (ISD) Record Linkage Service and contributes to the Farr Institute at Scotland, incorporating elements of the earlier SHIP. Agencies across the UK do work together in order to address so-called ‘complex cases’ such as Scottish patients presenting in England for the first time, or vice versa. Administrative data are collected routinely for the UK as a whole, nationally in Scotland and by local authorities. Prior to the launch of the ADRC centres, there was often no central body for researchers to contact in order to obtain all the administrative data needed. For some linkage projects, it may be necessary to contact local authorities separately and then pool the data, as done recently for a project linking health with social care data. 5 Many authorities do however submit their data for national processing or to the Department for Communities and Local Government (DCLG), in which case it may be sensible to contact government departments in order to obtain it.
Policy and legal context
Recent legal and policy changes in Scotland mean that now is a particularly good time to consider data linkage opportunities. Aimed at improving record keeping in Scotland, the Public Records (Scotland) Act 2011 is a key piece of legislation, 6 the first specifically concerning data linkage since 1937. The act was introduced to improve record keeping in order to prevent abuse and protect vulnerable people, 6 allowing data linkage between sectors to achieve this aim. Subsequent policy documents and data linkage exercises have highlighted the potential for protecting vulnerable people and other benefits of data linkage such as reducing costs.4,5,7 The economic climate at the time of writing is important to consider. Scotland's capital budget is being cut by 37% in real terms. The budget for 2012/2013 was intended to ensure that investment was made in infrastructure, attracting investment and job creation. There is a move towards (and an increased investment in) so-called ‘preventative spending’ which includes adult social care, the early years and reducing re-offending. Sectors are encouraged to work together in order to reduce expenditure in the future. More frequent and better quality (fewer data linkage errors) data sharing across sectors is thought to contribute to this objective. Data linkage from existing research studies to administrative data is also featured in the recent review of how research is used to support the Scottish Government in its objectives.4,7
Benefits of data linkage in Scotland
The benefits of data linkage from longitudinal studies to administrative data are numerous. A recent example is the linkage of data from Scottish Health Surveys to determine the association between shorter height and dementia death. 8 A key advantage of linking research to administrative data is the ability to continue monitoring participants who might otherwise have been lost due to drop-out, migration and so on. While people often drop-out from longitudinal research and from clinical trials, they rarely drop-out of administrative data. 9 If participants in a prospective cohort provide informed consent to be followed for health outcomes, their administrative data can be checked even if they no longer take part in future follow-ups by the research study itself. People recruited and retained into longitudinal research tend to have better physical health, higher levels of educational attainment and have higher socio-economic status. 10 This can produce biased results when data are analysed, because the sample can suffer from health selection11,12 and healthy survivor effects. Health selection refers to restriction of range on study variables at recruitment, for example, if people tend to have better physical health than the wider population. 12 Healthy survivor effects refer to the tendency for unhealthier participants to die or drop-out of follow-ups. This can produce biased estimates of associations between smoking and lung cancer for example, if smokers who remain active in the study are healthier than those who have left. 13 Quantifying the extent of health selection at recruitment into a study is often difficult, because the characteristics of the non-responders are not known. Using routinely collected data means that researchers can check for differences between their participants and those not included in the study, using chi-square tests for proportions expected and observed, for example. Health selection is a particular problem for cohorts established in adulthood. 14
Longitudinal data resources in Scotland
Longitudinal data resources in Scotland, sample sizes and study protocols/profiles. a
Although not included in this report, it is legitimate to follow up or link cross-sectional data to administrative data and long term outcomes. It is also legitimate to convert randomised controlled trials into longitudinal studies, as done for several studies here.
It must be emphasised that any data linkage project should be driven by a clear research question, and one which cannot be answered without the linkage. Efficient data linkage in the longitudinal sector will usually involve linking a set of predictor variables from a longitudinal study, with a set of outcomes from routinely collected administrative data – but not always. Techniques such as meta-analysis of individual participant data involve pooling data from individuals in order to increase statistical power overall.8 Statistical power, the ability to detect an association between an exposure and an outcome, or to correctly detect that there is no association, increases with large sample sizes.1 Many researchers are familiar with the importance of doing sample size calculations in order to recruit a sample that has sufficient statistical power to detect a particular size of effect (or difference). (1) In data linkage however, the sample sizes available are typically much larger and statistical power is enhanced for looking at smaller groups of individuals and rarer outcomes.
Researchers who want to use data linkage might want to propose a linkage that would allow them to group several research questions together, making the exercise more cost-effective. For example, they could study the association between an exposure and three different health outcomes from the same data linkage application using NHS number: cancer registration, all-cause mortality and cause-specific mortality. Those evaluating proposals however, will need to be careful that proposals are too broad and research questions are sufficiently precise, to avoid requesting data that is not actually needed. If researchers intend to request a large dataset in order to answer several different questions, they will need to be specific from the outset about which data are needed. For example, an epidemiologist could request one data extract, in order to study 10 exposures with two or three outcomes from administrative data. This is more efficient than requesting 30 different data extracts.
Conclusions
The SILC service offers opportunities for transforming cross-sectional data into longitudinal data. Data linkage from longitudinal studies to administrative data beyond health data is a promising new development. As mentioned previously, health data have been linked to social care outcomes. 5 As mentioned previously however, a great deal of administrative data exists however but only in paper records. For example, the National Records of Scotland hold transcripts from Sheriff courts in paper format. Linkage to these records would not be practical in the short to medium term, because considerable investment would be needed in order to digitise the data.
Take home messages
Any intended data linkage for research should be driven by a clear research question. Efficient data linkage in the longitudinal sector will usually involve linking a set of predictor variables from a longitudinal study, with a set of outcomes from routinely collected administrative data. Researchers may want to identity studies in Scotland that have particularly detailed information on a set of predictors, repeated measurements of these predictors and data on the ‘outcomes’ they are interested in, whether health or non-health (see online Table S1). For example, a researcher interested in the role of early educational exposures measuring repeatedly in adolescence to early adulthood, on health and social care in old age, might find the 36-day sample particularly relevant. 15 Researchers could group research questions together and propose a linkage that would allow them to address several questions with the same data, making the linkage more cost-effective. The infrastructure already available in Scotland and the launch of the SILC will accelerate research in Scotland and generate research questions that could not even be considered until now.
Footnotes
Glossary
| Administrative data | Data that are collected routinely by government departments, not originally for research purposes. |
| Administrative Data Research Centres (ADRC) | Centres in each of four countries (England, Scotland, Wales, Northern Ireland) that facilitate data linkage, develop data linkage methodology and provide expertise on public engagement, ethics, information governance and law. |
| Community Health Index (CHI) number | A 10-digit number used to identify patients treated by NHS Scotland, which appears in many medical data sets in Scotland. |
| Cross-sectional data | Data recorded at one point in time, with no follow-up to repeat measurements on the same participants. |
| Data linkage | Joining two records together that belong to the same individual. Internal data linkage links records from the same person over time (e.g. repeated hospital admissions). External data linkage links records to another data set (e.g. participant in a cohort study to cancer registration data). |
| Data linkage error | Following an attempt to link data, the production of false matches (two people are assigned the same ID) or missed matches (the same person is assigned more than one ID). |
| Electronic Data Research and Innovation Service (eDRIS) | A service to support researchers who want to conduct health research involving Scottish patients or Scottish health data. |
| Farr Institute at Scotland | A collaborative network of Universities and NHS services in Scotland to support health informatics research in Scotland. |
| Health informatics | The application of information technology to healthcare, epidemiology and public health. |
| Hierarchy of evidence | A hierarchy considered useful for comparing the relative quality of evidence produced by different research designs, ranging from systematic reviews of randomised controlled trials (best) to case reports and anecdote (worst). |
| Identifiers | Characteristics that identify individuals in a data set (e.g. date of birth, sex, NHS number, postcode) and which can be used to link data sets together. |
| Information Services Division (ISD), NHS Scotland | |
| Longitudinal data | Data in which the characteristics of individuals are measures more than once, or where individuals are followed over time to see what happens to them. Prospective cohort studies, epidemiological surveillance studies and repeated measures designs are all considered longitudinal. |
| National Health Service (NHS) number | A 10-digit number used to identify patients treated by the NHS across the UK. In theory, it is a unique identifier so that one NHS number applies to one patient. |
| National Records of Scotland (NRS) | Responsible for registration (e.g. births, marriages and civil partnerships, deaths) and statistical functions of the Registrar General for Scotland, including demography and the census. |
| Population spine | A new data set created by cross-referencing other data sets, theoretically allowing an accurate count of the entire population. |
| Safe haven | A secure environment in which data are linked and accessed. |
| Scottish Informatics and Linkage Collaboration (SILC) | A collaboration between many academic and public bodies to ensure that Scotland realises the benefits that can be derived through the legal, ethical and carefully controlled use of administrative, survey and other types of data. |
| ScottisH Informatics Programme (SHIP) | A precursor to the Farr Institute at Scotland. |
Acknowledgements
GH-J would like to thank Gerald Donnelly and Ian Deary for their support and guidance, and staff at the Centre for Cognitive Ageing and Cognitive Epidemiology (CCACE), University of Edinburgh for hosting the author during his visit (particularly Caroline Brett). He would also like to thank David Campbell and Susan Carsley at National Records of Scotland for helpful comments on earlier versions of this review.
Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: This review was originally undertaken as paid consultancy by GH-J for the Scottish Government and University of Edinburgh. It has not been published elsewhere in any form. Any views expressed are the authors' own.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
