Abstract
Under the Modernization programme Statistics Canada has recently undertaken, the Agency is to put forward data access solutions that present greater analytical value to Canadians while maintaining its core values of protecting confidentiality of respondents’ information. One avenue currently explored is Data Synthesis as a means of delivering synthetic data with high analytical value to users. At the time of writing, Statistics Canada has publicly released synthetic versions of two different datasets related to census, mortality and cancer information. In both cases, synthetic data were generated using the R package synthpop. This paper describes the use of Data Synthesis as a proof of concept for modernizing Statistics Canada’s data access solutions.
Introduction
Statistics Canada, as part of its mandate, is to disseminate relevant statistical information about the Canadian population based on the confidential data collected from its constituents. Over the years, Statistics Canada has developed several data access solutions of various degrees of accessibility and analytical utility to users. Figure 1 shows, from an advanced user’s perspective, the main social data access solutions2 used by the Agency to make statistical information available. At one end of the spectrum, there is a master file of collected data in its most useful form; however the significant disclosure risk associated to this product restricts its access to authorized employees only. Then, are shown the solutions providing data of high utility to certified users, where few modifications are made to the original data but where high measures are taken in terms of security (physical restrictions). Regarding the products more easily accessible, the data access strategy is focused on making the data itself safer, using traditional statistical disclosure control methods. The idea is to protect respondents’ confidentiality by altering the original values in the dataset. Of course, alteration of the dataset comes with degradation of its utility. For example, near the other end of the spectrum, one finds Statistics Canada’s daily publication, The Daily, which offers all Canadians highly aggregated statistical information only. While aggregated statistical information fulfills the information needs of many users, including researchers in the exploratory phase of their work, there is demand for a more direct, hands-on experience with information in dataset form.
An advanced user’s view onto Statistics Canada’s current social data access solutions along the axes of utility and accessibility.
These data access solutions are located more or less on an arc connecting the upper left and lower right corners of Fig. 1. This implies that none scores high on both dimensions at once: each solution represents a different balance achieved between accessibility and utility. Furthermore, data access solutions have traditionally been developed with confidentiality protection as the main objective so that their utility has been more or less incidental to the disclosure protection measures used.
Recently, the Government of Canada has developed a Digital Agenda3 and has asked public servants to be more innovative and responsive to help making Canada’s economy and society more data-driven.
Part of the Modernization programme it has undertaken for the past three years to embrace the new data context, Statistics Canada is adopting a more user-centric service delivery. This means consulting extensively with data users to learn more about their exact needs [1].
Therefore, the Agency is looking to provide Canadians with more flexible and useful data access solutions while remaining true to its core values of protecting the confidentiality of respondents’ information. In this context, Data Synthesis is a leading-edge avenue worth considering to disseminate useful microdata in a suitably confidentialized form, which is a need many users have expressed.
The goal with Data Synthesis is to go beyond traditional statistical disclosure control methods and position confidentialized synthetic data files of high analytical value above the arc. In a nutshell, Data Synthesis is a two-step process. First the joint distribution of all variables, seen as the source of the dataset’s utility, is modeled. Then the original values contained in the dataset, which are the source of the disclosure concerns, are replaced by values randomly generated from the estimated model. In principle the synthetic dataset obtained is confidentialized (because of random generation) and of comparable analytical value to the original one (because the random generation is carried out from an informed model). Thus, unlike other disclosure control methods used on datasets such a random noise addition and data swapping, Data Synthesis does not seek to achieve greater accessibility at the expense of utility. Instead, by pursuing the joint distribution – source of utility and confidentiality – Data Synthesis aims at preserving both simultaneously.
The method is appealing because, ideally, users would draw the same statistical conclusions from a synthetic file as they would from the original one; but, because they are generated from various statistical and machine learning models, synthetic data cannot, in principle, be directly retraced to respondents, which alleviates significantly disclosure concerns related to re-identification. Conceptually, Data Synthesis is used to extract and maintain the more general characteristics that are used to find general patterns of the population and not of a specific individual. These general patterns are of high interest for researchers because they establish the analytical value of a dataset.
Given their resemblance to the original data, not only in terms of structure but also in terms of analytical value, synthetic data are useful in many situations. Indeed, synthetic data of high analytical value will be most useful to researchers seeking to create statistical models that describe the many complex relationships existing in the original dataset. The advantage over the original data is that such synthetic data can freely be accessed outside of a secure environment. In fact, one of the business cases for the use of synthetic data at Statistics Canada would be to provide researchers with synthetic files while they are waiting to be granted secured access to the original ones in order for them to start exploring the data and the various relationships between the variables. However, it is important to be mindful that a synthetic dataset is the result of a modeling exercise and, as such, relationships between variables are artificially created. Although these relationships aim to mimic the real ones, and that the ideal situation would be to have all relationships preserved, the modeling exercise could introduce new relationships that are not present in the original data or not be able to preserve some of the original relationships. Furthermore, it can naturally be anticipated that the more complex a data set is, the more challenging it will be to create a synthetic version of high analytical value. For instance, a longitudinal dataset at the individual level of a hundred variables and millions of observations that contains information on potential dependencies of the units (ex: indicators of nuclear families, indicators on the main income earner of the household) will be more complex to synthesize than a small dataset of a dozen variables without any dependencies between units. In addition to the methodological complexity of the models used in such cases, the procedure could easily become time consuming, in terms of pre-processing and computing time of the data. Consequently, the resemblance of a synthetic data to the original one relies mainly on the models used. Finally, users should always be mindful that even when a synthetic dataset is of high analytical value, which is similar to the original dataset in terms of statistical properties, it remains a substitute to the real one. Therefore, it is generally recommended that statistical results and conclusions intended for publication purposes or decision-making processes should be based on original observations.
It should be noted that the term “synthetic data” is not new at Statistics Canada. However, the previous editions of synthetic data presented so little analytical value as to be commonly referred to as dummy files: original files whose structure (e.g., same variable names) has been preserved but where their analytical value is not. The contents of a dummy file are randomly generated but unlike Data Synthesis, the underlying model provides a very poor fit to the original dataset. Typically, the model only encapsulates a (small) number of marginal distributions the dummy file contents are to verify. Dummy files are generally used for testing purposes to see if the mechanics of a production process works, without taking into consideration the interpretability of the data and the meaningfulness of the analytical results obtained. In contrast, Data Synthesis aims at preserving much of the analytical value of the original file to allow meaningful analytical exploration.
The rest of the paper is structured as follow: Section 2 gives an introduction to the theoretical concepts underlying Data Synthesis and how they can be implemented in practice. Then, Section 3 presents the two pilot projects involving Data Synthesis that were carried out at Statistics Canada to assess the feasibility of using this methodology in a production-type environment as well as some of the research questions they have spurred. The final section concludes this paper and describes some of the limitations surrounding the use of synthetic data.
Background
Data Synthesis aims at striking a balance between confidentiality and utility, and the idea is not new. It was proposed over 25 years ago by Rubin [2] who, for the first time, suggested generating synthetic datasets using the Multiple Imputation (MI) technique [3]. However, it is only recently that Data Synthesis has reached mainstream official statistics. Using a different methodology than the one presented here, the US Census Bureau is well versed in producing synthetic data. Indeed they release the partially synthetic Longitudinal Business Database as well as On The Map, and they also release the fully synthetic Survey of Income and Program Participation Synthetic Beta [4].
This section describes the methodology used at Statistics Canada to produce synthetic data of high analytical value. Data Synthesis is the term used at Statistics Canada for the process described in detail by Drechsler [5]. Over the years, several leading-edge methodologies have been implemented at Statistics Canada, including the bootstrap method for variance estimation in a survey setting. Experience has taught us that availability of readable accounts, an established tool and experts to consult all play a key role in a successful implementation of a new methodology; in our case Data Synthesis met all of these conditions. In this paper we mainly focus on the analytical-preserving aspects of the process as this is the main advantage of the method from a user perspective. Considerations related to Statistics Canada’s release approval process are presented by Sallier and Girard [6].
Data Synthesis calls for a replacement dataset to be created which is to preserve as much as possible the analytical value of the original file while fulfilling one’s confidentiality commitments. This means that, when provided with a synthetic data set, a user would hope to obtain the same statistical conclusions that they would have if they had access to the original one.
The conceptual ideas behind the creation of synthetic data of high analytical value.
A representation of the Data Synthesis process, where D represents the original dataset and D’ the synthetic one.
Figure 2 presents the conceptual ideas underlying the creation of synthetic data using Data Synthesis. The utility of a synthetic file could be seen as a measure of how close results of statistical analyses performed on it are to the original ones. There is no known standard regarding this measure, but the idea is that the utility of a file is maximized when both the synthetic and original files allow for the same statistical conclusions to be reached for the various analyses performed by a user, without the synthesizer (person creating the synthetic file) knowing beforehand which analyses will be performed. Section 3 presents a research project conducted at Statistics Canada on a potential utility measure.
The analytical value of the original dataset could be expressed as the (complex) set of relationships that exist among all of its variables which, in statistical terms, constitutes the joint statistical distribution of the data. If the joint distribution were known, then it could be used to generate a new dataset with the same analytical value as the original one (because the two datasets would then have originated from the exact same source). Thus, central to Data Synthesis is the estimation of the joint distribution underlying the observed data.
Data Synthesis is the following two-step process. First, the original dataset is used to model the joint distribution of the data. Second, new data are generated from the estimated model. In this process, the joint distribution is both the source of the original dataset’s utility and of the confidentiality issues its custodian is facing. Indeed, because the hypothesized true joint distribution is unknown in practice, one can only analyze it through the values provided by respondents. Hence, data points generated from the joint distribution of interest and confidential information supplied by respondents become inextricably linked. By generating new data from the estimated joint distribution, Data Synthesis seeks to sever the link between data values and respondents while still creating a meaningful dataset, one that retains as much of the original data’s analytical value as possible. Figure 3 shows the conceptual representation of the Data Synthesis process.
In practice, modeling the joint distribution in a single run is not simple. Indeed, a typical dataset is made of a large number of variables of various nature and the overall joint distribution of these variables cannot be captured using solely the traditional statistical distributions. This is why it is recommended [5] to use the Fully Conditional Specification (FCS) approach, which expresses the joint distribution as a series of conditional distributions that are simpler to approximate. More specifically, let
The original dataset is used to approximate each of the conditional distributions represented in the right-hand side of Eq. (2.2), either through parametric or nonparametric modeling approaches. This amounts to iteratively modeling each variable based on the previous model, following the order
From an implementation perspective, FCS presents the advantage of having to deal only with conditionally univariate distributions instead of the joint distribution. Therefore, the modeling promises to be simpler to perform, and each variable can be modeled using its own separate model based on its nature and that of the other variables available.
However, when there are many variables to be synthesized, models at the end of the sequence can be very large. Thus, one of the caveats of Data Synthesis is that every relationship that exists in the real data will not necessarily be preserved in the synthetic one. Also, relationships between variables can artificially be created in the synthetic data, that are absent in the real data.
The FCS approach poses two methodological questions which we discuss next: the order of the variables and the choice of the models.
There is currently no established standard for selecting the order in which variables are to be synthesized. Best practices call for the order to be determined in terms of the logical relationships that may exist among variables. For example, it is generally accepted that an individual’s education level influences their income. We would want to have education placed before income in the synthesis order as to use the generated values of education as predictor variables when synthesizing income. This helps preserving the relationships between variables.
During the synthesis process, sampling with replacement is used to synthesize the first variable. Sampling with replacement is used since it is equivalent to drawing values using the Empirical Distribution Function (EDF) of the sample. The EDF is unbiased for the true Cumulative Distribution Function (CDF) and converges uniformly to the true CDF with probability 1 by virtue of the Glivenko-Cantelli theorem. This method provides us with a non-parametric way of estimating and drawing from the distribution of the first variable. This is important since it is often the case that the distribution of population variables cannot be easily estimated using standard parametric models, especially since the first variable is modeled without the use of explanatory variables. For each of the subsequent variables we have the benefit of using the previous ones as predictor variables to develop a richer model than what the variable’s estimated CDF provides. To carry out this sequential modeling, from the second variable to the last, the Classification and Regression Tree (CART) machine-learning method is often seen as a standard. Indeed, it can be more easily applied than parametric modeling, especially for data with irregular distributions [7]. This is useful when performing Data Synthesis since we are generally unaware of how complex the conditional distributions of the variables are. CART can also capture non-linear relationships and interaction effects which may not have been easily revealed with parametric modeling. This is important in attempting to retain analytical value in the synthetic dataset. Finally, it should be mentioned that the analytical value of a synthetic dataset strongly depends on the quality of the original data and on the models chosen during the process. Indeed, models that are unable to capture the relationships between the variables will generate synthetic data that does not sufficiently match the statistical properties of the real data.
In terms of implementation, based on consultation with experts such as Drechsler and Charest, Data Synthesis is implemented at the Agency with the R package synthpop developed as part of the Synthetic Data Estimation for UK Longitudinal Studies (SYLLS)4 project.
Synthetic data of high analytical value: Statistics Canada’s experience and undergoing research
Data Synthesis projects at Statistics Canada
As of now Statistics Canada has released two synthetic data of high analytical value for public use. In both cases the mandate was to provide data of high analytical value to participants of hackathons. The first experience was in 2018 where Statistics Canada’s Health Analysis Division sponsored a hackathon as part of the 5
The second experience was in 2019: the Canadian Partnership Against Cancer (CPAC) in collaboration with Statistics Canada’s Centre for Population Health Data (CPHD) invited researchers from across Canada to participate in a hackathon taking place in early November, during the 2019 Canadian Cancer Research Conference held in Ottawa, Canada. The goal was to have participants compete in a time-limited and team-based analysis using a synthetic dataset mimicking a real linked dataset that combines statistical information on cancer incidence, treatment and sociodemographic characteristics. The dataset to be synthesized was the result of the linkage of the 2006 Census Long Form, the Canadian Cancer Registry (CCR) (diagnosis years: 1992–2015), the Canadian Vital Statistics Death Database (CVSD) (1992–2014). There were in fact two linkages performed: CCR to the CVSD; the, the resulting file was linked to the 2006 Census Long Form for both individuals who appear on the CCR and individuals who were never diagnosed with cancer (not appearing in the CCR). The file has approximately 5 M observations and 47 variables, which is 11 more variables (most of them sensitive in nature) than what we had for the first project. In contrast to the first project, this file is really rich on cancer-related information such as the year of diagnosis of the tumour, the age of diagnosis, the type of cancer (13 cancers selected for inclusions and while patients may have been diagnosed with one or more cancers only the first primary cancer diagnosed is represented on the file), stage of cancer along with other cancer-related variables.
Prior to synthesis, the sensitivity of the file was reduced by collapsing categorical variables as to have at most 15 categories each and by suppressing too detailed geographical information. However, contrary to the first hackathon project, it was decided not to aggregate continuous variables. Instead, continuous variables were generated in continuous form and then values were smoothed, to add a layer of confidentiality protection: Gaussian kernel smoothing was used in order to preserve conditional distributions. Thus, for example, age is presented in a continuous form in the synthetic file. Then, the order of the variables was chosen with senior analysts from the CPHD. No design weights were to be taken into account, thus variables were then modeled and generated in an unweighted fashion using CART, ordered and non-ordered multinomial regressions, logistic regressions and linear multivariate regressions.
In terms of implementation of the Data Synthesis process, many improvements were made after the first hackathon project. Indeed, during the first project, many challenges were encountered due to R difficulties to deal with large datasets and complex processes. Just as the first project, the dataset was divided by sex and provinces (the southern part of Canada is divided from east to west into ten provinces, with three territories making the north; province is the lowest geographical variable on the file), with each subsample being synthesized independently. However, to make the production process faster and more efficient, we used batch file programming to automate the creation of the file. In terms of production only, the process took less than 4 days.
For the evaluation of the analytical value, statistical analyses have been carried out by a team led by a senior subject matter analyst to compare the conclusions one could draw from the original desensitized data file to the synthetic file produced. This team had not been involved in the production of the file and their only role was to conduct comparative analyses, based on their knowledge of the topic, once the file was produced. None of the analyses to be performed were known by the synthesizer prior the synthesis of the desensitized data file. First, marginal distributions on all categorical variables were compared as well as descriptive statistics of continuous variables (range, mean, median and standard deviation); it was found that they closely agreed. Also all cross-tabulations of all mortality variables and also of sex variable across sex-specific cancers were produced. Again, very comparable results were found. While obtaining these positive but basic results is only a first step toward getting conclusive evidence of the utility of the synthetic dataset produced, it is a significant one. Indeed, it shows that Data Synthesis was able to capture these relationships all on its own, without the synthesizer explicitly feeding these somehow into the model. For more complex types of analyses, several logistic regressions such as testing rural/urban classification against lung cancer or testing different cancers against immigration status, marital status, level of education and working level were adjusted. In the same vein, different survival analyses, such as Cox proportional hazard model using socio-demographic variables as predictors were also adjusted. Both for logistic regressions and survival analyses, results from both files were extremely close.
On the confidentiality side, even though Data Synthesis relies on a data generation step that produces new observations it remains that synthetic records may have same values as real ones. Therefore, despite the fact that the risk of re-identification has been strongly alleviated we are in the presence of a perceived disclosure issue. Perceived disclosure is a real concern for Statistic Canada’s regulatory committee in charge of approving the public release of the synthetic file even considering that the file produced is near-fully synthetic (all variables have been synthesized except for the ones used to split the dataset). The process of evaluation of the confidentiality was the same as the one for the first hackathon project. This process is documented by Sallier and Girard [6]. For this project as well, the regulatory committee evaluated the file using the matching-based criteria developed for PUMF-type products. To address the perceived disclosure issue, it was requested to compare the original file and the synthetic one to assess the proportion of observations that are not only the same in both files, but also unique in both files. Therefore, the desensitized file and its synthetic counterpart were systematically compared by being matched record-wise. Very low percentage of the records presented a unique combination of values with respect to all of the categorical variables existing both in the synthetic and desensitized file. Actually for one of the provinces, no combinations of categorical variables from the synthetic file could be retraced to a one from the original one.
Ongoing research
So far, at Statistics Canada, Data Synthesis has been performed on census information linked to administrative data. However, as a national statistical organization, a natural question which arises is the inclusion of survey design features in the Data Synthesis process. Thus, a research project is currently conducted to synthesize survey weights by synthesizing design variables and derive synthetic weights from them. As of now, it was possible to retrieve statistical results of weighted multivariate linear regressions performed on data coming from highly informative designs, using synthetic weights.
It is also of interest to explore a potential utility measure as to provide an indicator of the quality of a synthetic dataset. As it was discussed previously, the joint distribution is the mechanism which, conceptually, underlies the set of all possible analyses. As an attempt to assess the analytical utility of a synthetic file, it would be interesting to determine if the joint distributions of the original and synthetic data are the same. This hypothesis testing problem is known as the “multivariate two-sample problem.” This problem has been well discussed in the literature and some non-parametric test statistics have been proposed. To evaluate this approach, a research project is currently conducted to explore the Cramer test which uses the Baringhaus-Franz test statistic [8]. Indeed, this nonparametric two-sample-test on equality of the underlying distributions can be applied to multivariate distributions as well as univariate distributions. The Cramer test was chosen since it was easily implemented in R through the package cramer. As of now the test shows good results and research is now conducted to evaluate the power of the test.
Conclusion
We presented the methodological concepts underlying Data Synthesis and the advantages from the users’ perspective. By describing Statistics Canada’s experience in producing synthetic versions of linked files, we demonstrated that Data Synthesis can be implemented successfully for social data. It presents users with a useful data access solution of a new kind while fulfilling confidentiality requirements imposed on the dissemination of data products. However, at Statistics Canada, the intent with Data Synthesis is not to replace any of the existing data access solutions, but to complement them. For instance, requests for estimates of simple finite population parameters of interest will likely remain better served by data access solutions such as Research Data Centers or Real-Time Remote Access, which provide datasets close to the original ones. Indeed, regardless of the quality of the analytical value of a synthetic file, for publication purposes, final analyses should be performed on original data. One of the reasons is that analyses could have the same statistical conclusions but not the same estimates whose values can be of high importance in some research field like medical research. To illustrate our perspective we draw the following parallel: if one was interested in a set of variables where values were not observed yet, they would be interested in forecasted values to start exploring a phenomenon (this represents the case where a researcher explores synthetic dataset while waiting to be granted access to the original one); but once the data is observed they would use observed (real) data to publish or to take decisions (which is equivalent to say that, in the synthetic data scenario, researchers would use real data to publish once access is granted). Also, some of the challenges moving forward with Data Synthesis include taking into account survey data appropriately throughout the synthesis process and handling in an efficient manner datasets containing hundreds of variables and not just a few dozen as we have experimented with. We also anticipate extra challenges synthesizing economic data as the underlying distributions tend to be quite skewed and that presence of outliers remains a challenge in terms of disclosure or perceived disclosure control.
While synthetic data of high analytical value do not replace real data held in a secured environment, they allow for a more direct and timely access to data that has greater utility than what is typically provided by a Public-Use Microdata File. For that reason, Data Synthesis fits perfectly with a user-centric and modern view intended to provide users with high quality data, which involves concepts as timeliness and accessibility in addition to accuracy.
Footnotes
See
See
See
Acknowledgments
I would like to express my very great appreciation to Mr. Claude Girard for his encouragement and guidance as I worked on data synthesis. I also wish to acknowledge the deep influence Mr. Eric Rancourt and Mrs. Andrea Leigh MacMillan have had on me as we collaborated on a special assignment. Finally, I would like to thank the IAOS for creating this competition and promoting through this prize the contribution of young statisticians to official statistics organizations.
