Abstract
In the era of data revolution, availability and presence of data is a huge wealth that has to be utilized. Instead of making new surveys, benefit can be made from data that already exists. As enormous amounts of data become available, it is becoming essential to undertake research that involves integrating data from multiple sources in order to make the best use out of it. Statistical Data Integration (SDI) is the statistical tool for considering this issue. SDI can be used to integrate data files that have common units, and it also allows to merge unrelated files that do not share any common units, depending on the input data. The convenient method of data integration is determined according to the nature of the input data. SDI has two main methods, Record Linkage (RL) and Statistical Matching (SM). SM techniques typically aim to achieve a complete data file from different sources which do not contain the same units. This paper aims at giving a complete overview of existing SM methods, both classical and recent, in order to provide a unified summary of various SM techniques along with their drawbacks. Points for future research are suggested at the end of this paper.
Keywords
Pattern of missing data originated by the statistical matching
Pattern of missing data originated by the statistical matching
Source: [13].
Attention has been growing towards data integration in response to the increasing flood of available data. Data integration aims at integrating two or more data sources (usually data from sample surveys) sharing the same target population. Statistical Data Integration (SDI) can be used to integrate data files that have common units, and it also allows to merge unrelated files that do not share any common units, depending on the input data. The convenient method of data integration is determined according to the nature of the input data. SDI has two main methods, Record Linkage (RL) and Statistical Matching (SM). [1] provide a review of data integration techniques for combining probability samples, probability and non-probability samples, and probability and big data samples. [2] presents a review for combining data from different sources focusing on record linkage. RL is the method of gathering information from multiple sources that relates to the same entity. If there is a unique identifier in the data sources that will be integrated, then there is no difficulty in matching data sources. In this case, deterministic linkage will be used. Deterministic linkage is a direct way of linking which usually requires exact agreement on a unique identifier (such as a national identity number). If there is no unique identifier, the probabilistic record linkage is employed (see [3, 4, 5, 6] for details). The advantage of SM that it can be used as a supplement to RL to reduce the potential bias when obtaining the estimates using RL [7]. On the other hand, SM techniques typically aim to achieve a complete data file from different sources which do not contain the same units. In SM, data sources just share a set of common variables and inference is required on the other variables. Besides the main goal of SM which is data integration, SM also has the objective of jointly analysing pairs of variables observed in two distinct sample surveys [8, 9, 10, 11]. Inconsistent SM can only produce meaningless data that are impossible to compare and use. Accordingly, suitable approaches should be used for SM to keep pace with rapid development in the data science.
SM can be viewed as a missing data problem where methods such as imputation can be used to solve it [12]. In the basic SM Structure, there exist two files A and B as sources of data. File A contains variables
The following example from [13] represents two different surveys, each of them having a different goal with some variables in common. The first dataset includes data about gender, age, education and purchasing information about some products like cereals, wines, …, etc. The second dataset includes data about gender, age, education and information about TV viewing behavior for daily soaps, news, …, etc. (see Table 1). An appealing option for market researchers is to combine viewing information available from the second dataset with purchasing data available from the first. This can save a lot of time and money while providing a richer data source for analysing relationships among those variables. Variables which were not found jointly on both datasets are imputed individually to create a comprehensive and large database through SM. On matching the two datasets and imputing the missing parts of the data, it becomes plausible to study the relationship between TV viewing behavior and purchasing behavior.
This paper aims to provide a unified piece of work that gives a complete view of SM techniques that are available in the literature. The organization of this paper is as follows: Section 2 introduces SM as a missing data problem. Section 3 presents classical approaches in SM showing drawbacks of these procedures. Methods of SM are presented within three frameworks; under conditional independence assumption, in presence of auxiliary information, and subject to uncertainty analysis. A number of alternative SM techniques are given in Section 4. Moreover, Section 5 discusses issues related to SM including a review of existing methods for assessing the quality of the data resulting from various SM techniques. Finally, Section 6 gives conclusion of this paper.
Statistical matching as a missing data problem
There are several methods for dealing with data with missing values including procedures based on completely recorded units, where incomplete units are discarded and only units with complete data are analysed. Another approach for dealing with missing data depends on weighting procedures, where weights are given for observed units. Imputation techniques constitute yet another common approach for treatment of missing data, where imputation can be conducted via single or Multiple Imputation (MI). Single imputation is defined as replacing each missing value by a single value. Single imputation techniques have different types including mean imputation, regression imputation, and hot-deck imputation, and advanced imputation techniques such as using an Expectation Maximization (EM) algorithm to obtain Maximum Likelihood (ML) estimates [14, 15, 16]. Rubin was the first to propose MI as a technique to address the problem of missing data (see, e.g., [17, 18, 19]). MI has a main notion, which is the replacement of each missing value by a set of M plausible values where each value is drawn from the conditional distribution for the missing values given the observed values. The M completed datasets are produced by an appropriate imputation model, and the method which is considered suitable for analysis had the data been complete is then used to analyse each set. The resulting M analyses are then combined to give one final result. MI is most straightforward to use under ignorable missingness mechanisms (MAR and MCAR). If the missingness is ignorable, the estimates of the statistical model of interest resulting from MI often have the desirable properties of being unbiased, consistent, and asymptotically normal [20, 21]. Nevertheless, MI is quite possible to be applied when data is Missing Not At Random (MNAR), however the resulting estimates are not proved to have the same properties. For more details about types of missingness: MCAR, MAR, and MNAR; see [22, 23, 13]. Since SM can be viewed as a problem of missing data [12], many authors have been using MI in the context of SM. For example, [24] performed cross survey-MI to combine the regression estimation where more covariates were found in one survey but less in the other. Another MI method used in SM and proposed by [13] was Multiple Imputation Chained Equations (MICE).
The most important question raised when considering SM is how to match the sets of data in a right way. One of the first SM attempts is by [25]. He presented the use of Equivalence Class of the common variables in two data sources by making a random matching of
Various developments to SM techniques have been discussed in the literature and many applications have used SM, depending on the Conditional Independence Assumption (CIA) proposed by [25]. For instance, [34] assumed CIA when developing a theoretical framework for SM. [35] applied sample likelihood approach in case of CIA to statistically match two independent samples. Whereas, [36] proposed a mixed procedure based on multinomial logistic regression models depending on CIA. [37] used CIA to combine the information of two different datasets: Statistics on Income and Living Condition, and Household Budget Survey.
Recently, [38] used a mixed procedure of SM based on CIA to integrate two different surveys: Statistics on Income and Living Condition, and Household Budget Survey. In Section 3, we will present classical approaches in SM showing drawbacks of these procedures.
Traditional statistical matching methods
In their comprehensive book about SM, [11] classify SM techniques into procedures developed under CIA, and SM procedures in light of auxiliary information. For general review about CIA, see [11, 39]. In this section, we present classical SM techniques under CIA and with auxiliary information using the same structure proposed by [11]. In addition, we consider uncertainly analysis for SM methods. With this framework of the three settings, different estimation approaches are considered such as parametric, nonparametric, mixed and Bayesian.
Statistical matching methods under conditional independence assumption
Traditionally, the framework of SM occurs when only file A and file B are available. File A contains variables
where;
To estimate Eq. (1), we need information about the marginal distribution of
The SM parametric methods are mainly based on two steps. First, it necessarily requires the specification of a model; then parameters have to be estimated. The method of estimation depends on the specific SM framework. There are two main approaches for parametric SM; namely micro and macro approaches.
In the macro case, the goal is to estimate the parameter of interest, e.g. the correlation coefficient between
The goal of micro approaches for SM is to create a complete set of data for
where the estimated parameters values
Notice that when dealing with categorical variables, the variables are replaced by the indicator variable of each category. [43] showed that the conditional mean matching has the problem of leading to lack of variability in the imputations with respect to the same conditioning variables and hence leads to underestimation of the covariance between
where
where
Nonparametric approaches to SM do not explicitly refer to a model. As for the parametric case, there are two cases in nonparametric approaches; namely macro and micro.
In nonparametric macro approaches, as mentioned before, the goal is to estimate the joint distribution of
In nonparametric micro approaches, a complete dataset for files A and B can be obtained without considering any specific parametric distribution for the variables
Random hot deck is performed by selecting a donor record in file B randomly and the value of
Data of file A and file B
Data of file A and file B
Source: [49, 13], Gender: 1
Distance hot deck method, which is widely used in imputation of missing data, is performed by choosing the value of
where;
Distance hot deck can be performed using a constrained or an unconstrained approach both introduced by [49]. The unconstrained approach has an objective of finding the case on the target file that is most similar to each case on the other file. Suppose we have file A that contains a set of variables: gender
Statistical matching using unconstrained approach
Reordered exploded file A and file B for constrained approach
Statistical Matching using Constrained Approach
Rank hot deck, which was first introduced by [50], is performed by separately ranking the units with respect to
In the recipient file;
In the donor file;
After that, every
Then again as in Distance Hot Deck, the value of
Mixed approaches in SM depend on predictive mean matching imputation [11]. These are two-step procedures which depend on a mixture of parametric and nonparametric methods. A SM mixed method consists of the following two steps:
A model is fitted and all its parameters are estimated. A nonparametric method is used to create the complete dataset.
There exist various methods for mixed SM depending on the type of variables, whether they are continuous or categorical.
In case of mixed approaches under CIA for continuous variables
Estimate the regression parameters of For each unit For each unit
In the same manner,
Using the above steps, similar procedures were proposed by [34, 51, 11, 52, 43]. The main difference between their procedures lies within the parameter estimates they use. Also, they differ in the matching step; unconstrained or constrained approach. Another difference is that Rubin’s procedure suggests concatenating the resulting statistically matched files and assigning the weight
Analogous to SM for continuous variables, two steps are applied within the mixed approach when categorical variables are considered. These are the loglinear regression model step, and the matching step. For more details about this method, see [42].
Bayesian approaches
[13] presented three procedures for SM using Bayesian approaches. First, the author followed [52] by presenting a frequentist regression imputation method with random residual, known as RIEPS. This procedure is based on drawing the missing values from their conditional predictive distribution which is equivalent to the imputed values in Eqs (2) and (3), but after considering the residual in the imputation process. Then, [13] presented the Bayesian version of RIEPS. She presented a Non Iterative Bayesian based Imputation Procedure (NIBAS) for univariate and multivariate cases. The format of the procedure is the same for both the univariate and multivariate cases. The only difference is in the mathematical formulation of matrices that are used, and in the posterior distributions of variables. In this approach, instead of estimating the parameters like the above approaches, she performed random draws for them using a Bayesian version. [13] first assumed a general linear model for both datasets as suggested by Rubin’s procedure. Then, derived different posterior distributions for the parameters. Accordingly, she proposed the following MI algorithm based on a Bayesian approach by the following steps:
Perform a regression for each dataset and compute the OLS estimates Calculate each sum of squared residual errors; Choose a value for Conduct random draws for the parameters from their observed data posterior distribution.
In a nutshell, this procedure depends on random draws for the parameters, instead of estimating them from the data. This algorithm is repeated
The second method presented by [13] was the Normal Approach (NORM). This method is mainly based on data augmentation algorithm, in case of a normal model. The normal data model is used because it is the most flexible in terms of the number of variables included in any realistic matching task. NORM uses a data model much like NIBAS. The difference between NIBAS and NORM is that NORM uses the parametric data model in case of complete data, whereas NIBAS uses the observed data posterior distribution. Since NORM method needs complete data, an algorithm that handles the missing data is needed, namely data augmentation. The data augmentation algorithm can be viewed as a Bayesian version of the EM algorithm for dealing with missing data [53].
The third method proposed by [13] was MICE, an abbreviation for Multiple Imputation Chained Equations. MICE, a common modification in MI, is an iterative method that considers the imputation problem as a set of estimations where each variable takes its turn in being regressed on the other variables. MICE, first introduced by [54], is a flexible method that handles various types of variables, since each variable is imputed using its own imputation model [55]. This procedure is called regression switching, chained equations, or variable-by-variable Gibbs sampling [54]. [13] employed MICE within a Bayesian estimation framework as a tool for SM. For more details about MICE, see [54, 13]. [56] compared, through a simulation study, the different Bayesian approaches described above. Using another simulation study, [57] found out that the MI approaches are superior to the traditional SM procedures.
Statistical matching methods with auxiliary information
The problem with the conditional independence assumption of the variables
The existence of a third data source C that contains variables Parameter estimates of the joint distribution of
We will follow the same structure as that given for CIA, to present different approaches for SM with auxiliary information, which was proposed by [11].
Parametric approaches
In case of a parametric macro approach, the auxiliary information may be an additional data source or using an external estimate. In case of having an additional data source C, containing
In case of parametric micro approach and having external information, a model is assumed for the joint distribution of
Analogous to Section 3.1.1 under CIA, we get the equations for the method of draws from the predicted distribution by adding an error term as follows:
where;
In case of a nonparametric macro setting, the only type of auxiliary information available is that from an additional data source C. There are two cases for auxiliary information available from an additional data source C. First, when all variables are available in C. The marginal
In case of a nonparametric micro setting, auxiliary information is solely represented by an additional data source C. This procedure is presented by [40] in two steps as follows:
Impute a live
The final live
In particular, if C is a sample that is large enough and the information it provides can be considered ”reliable”, then the SM procedure can be limited to Step 1 outlined above; that is imputing
An alternative procedure based on loglinear models called categorical restriction approaches that was introduced by [42].
Analogous to CIA in Section 3.1.3, mixed approaches contain the same two steps in case of auxiliary information. First step is to estimate model parameters, followed by the second step where a nonparametric method is used.
The same procedure outlined in Section 3.1.3 for SM under CIA can be used here with continuous variables. The only difference here is in the way of estimation as information about parameter estimates is available, either from additional data source C or from an external estimate (i.e. information about
The mixed procedure remains unchanged, focusing on imputing primary values using the same regression models as in equations 10 and 11. Final nonparametric imputation: a live
[34, 51] proposed a similar procedure that permits the incorporation of additional information represented by the correlation coefficient
In case of categorical variables, a number of mixed methods are introduced in [42]. For more details, see [11].
The Bayesian approach is especially useful when dealing with auxiliary information. [13] considered a Bayesian approach for SM when an additional data source C is available. This additional dataset C can be used for estimating the conditional association of
Uncertainty analysis
In practice, CIA rarely holds true, and results relying on CIA may be misleading. If CIA is not valid, bias will exist among variables of interest after applying the SM process. This issue is solvable with the help of auxiliary information when available [42, 62]. When CIA is not valid and auxiliary information is not available, which is often the case, uncertainty analysis may be performed.
Statistical matching is essentially related to an identification problem concerning the association of variables not found together in the same dataset. As known, CIA cannot be validated from the observed data. By depending on the explanatory power of the common variables
Uncertainty in SM can be viewed as a special case of estimation problem as there is a kind of uncertainty about the joint model of
Uncertainty analysis can be performed in two settings either parametric or nonparametric. According to parametric setting, the result of the identification problem is handled by considering ranges of plausible values of the missing records, extracted from models fitted to the available sample information. Then, intervals are defined by these ranges and are known as uncertainty intervals [43, 52, 34, 13]. According to nonparametric setting, uncertainty for SM is considered in [69, 77, 76]. Uncertainty in a nonparametric setting is still described by a class of models, or specifically, by a class of distributions, for
Recently, [80] performed a unique analysis to choose the matching variables by searching for the common variables
Developments in statistical matching methods
In addition to the traditional SM techniques outlined in Section 3, there are a number of alternative methods that have been introduced in the literature. This section will present an overview of some of these methods.
Propensity scores method
[84, 85] were among the first who proposed the use of propensity score matching. Propensity scores are predicted values from logistic regression, where the effect of a certain group or treatment is estimated, accounting for covariates. The matching process is done based on these scores. The main advantage of using this method in SM, is to reduce the matching to one constructed propensity score, which is a very great advantage, especially when there are a large number of common variables
Data preparation and harmonization step: Harmonization step involves identification of all common variables Weight adjustments step: In this step, the sum of the attached weights for records (weighted population totals) in the donor file is adjusted to make them comparable with those in the recipient file. Estimation of the propensity scores step: In this step, estimation of the propensity scores for matching process is done by creating an outcome variable
the conditional probability of unit
The individual propensity scores
On the contrary, [89] opposed using propensity scores for SM justifying their opinion by stating that the procedure ignores the measurements of matching variables
[90, 91] proposed fractional imputation, a comparatively recent form of imputation for handling missing data. Fractional imputation can be handled in two settings either parametric or nonparametric. According to the parametric approach for generating imputed values, it is performed using EM algorithm which is a popular tool for finding the MLE for parameters of the model. In fractional imputation, the E-step can be approximated by the weighted mean of the imputed data likelihood where the fractional weights are computed from the current value of the parameter estimates. Using fractional imputation, numerous imputed values with fractional weights are generated for each unit containing missing data. Each fractional weight indicates the conditional probability of imputed value given observed data. For instance, the following two steps should be considered if we would like to generate the missing
For each Fractional weight is assigned for the
where Solve the fractionally imputed score equation for
where Repeat steps 2 and 3 till convergence.
For nonparametric setting, fractional hot deck imputation is considered; that is a mixed idea of fractional imputation and hot deck procedure. Instead of generating
where
Statistical Learning (SL) refers to a wide set of classification/regression techniques that “ learn from data”. These are typically algorithm-based techniques that do not assume a statistical model. Statistical learning techniques include methods such as classification and regression trees (CART), random forest, …, etc. rather than methods based on fitting stochastic models. Statistical Learning techniques can be used in statistical matching to replace methods that use model predictions. Some techniques like Nearest Neighbour distance hot deck are special cases of a well-known statistical learning technique called kNN. For more details about SL techniques, see [92, 93, 94, 95]. [11] were first to use SL techniques in SM. They showed how SL can be used to pick a subset of matching variables prior to the matching step. Even, when measuring the distance between the observations in A and those in B, SL techniques can be useful when employing the nearest neighbour distance hot deck [96]. The use of SL in SM takes it a bit further by incorporating other current classification or regression approaches into SM. [97] showed how popular SL techniques can be beneficial for matching purposes. Two different ideas are looked at: (i) integrating datasets; (ii) assessing the uncertainty. The characteristics of these approaches are investigated by simulation study and application on a real survey data. The obtained results in his paper are promising, showing that certain SL approaches can be very successful in using the given information by already available survey data, allowing for a reduction in the uncertainty when using traditional SM techniques. Moreover, [80] compared the SL techniques to traditional hot deck methods to avoid consuming time in selecting the matching variables that are typically needed in hot deck and to get the distances between units in file A and potential donors in file B. [80] showed the superiority of SL techniques to traditional hot deck methods in two aspects; first the reduction of time consumed in selecting matching variables typically needed in hot deck, and second in calculating distances between units in file
Statistical matching using multinomial logistic regression models
[36] proposed several mixed procedures based on multinomial logistic regression models depending on CIA and auxiliary information. First, they proposed two mixed methods for SM, without auxiliary information assuming conditional independence. The first approach utilizing multinomial logistic regression, is a mixed method that uses distance hot deck imputation. It is based on the following three steps:
The probability of the unit being in For each unit in file A, the probability of being in
where Distance hot deck approach is applied in which a value of
The second mixed method, based on multinomial logistic regression models under CIA, uses a randomization mechanism, and involves the same first and second steps as the previous method. The only difference is in the third step as
They also extended their method to be used in presence of auxiliary information. Four versions of this method were introduced, in presence of auxiliary information. The first version has three steps as follows:
For file C, fit a multinomial logistic regression model in which For file A, the estimates in the first step are used to get For file A,
A variation of the above method is to replace the first step by fitting a multinomial logistic regression model, in which
The third variation of this method involves the following six steps:
For file C, fit a multinomial logistic regression model in which For file For file B, For file B, fit a multinomial logistic regression model in which For file A, the estimates in the fourth step are used to get probability For file A,
The first step can be replaced by fitting a multinomial logistic regression model in which
The proposed methods were compared to a random hot deck procedure via a simulation study. The results were very similar in the case of CIA. However, the proposed methods exhibit better performance in case of auxiliary information, especially with high levels of association between
Mixed integer linear programming procedure
A set of approaches have been raised recently to tackle the problem of incoherence arising when SM is considered. A number of authors including [98, 99, 100, 101] and [102] applied a mixed integer linear programming procedure, based on efficient L1 distance minimization, in the context of SM. [102] dealt with the managing of inconsistencies inside the SM framework when logical relations among the variables are present. Incoherence can arise in the probability evaluations. They used several advanced adjustment procedures to remove such incoherence. They applied these procedures to real data, and their study found some differences between these adjustment procedures.
Recently, [99, 98] recommended applying a merging approach for jointly inconsistent probabilistic assessments to the SM problem. The merging technique is based on an efficient L1 distance minimization through mixed-integer linear programming, which results in elicitation of imprecise (lower-upper) probability assessments that are not only feasible but also meaningful. They emphasized how their method is meaningful whenever there are structural zeros among the variables. Structural zeros is an important feature of survey data that refers to the existence of impossible combinations of variables. For example, in the combinations of variables of pregnancy status and gender, there should not exist a pregnant male. For household survey, in the combinations of variables of relationship and age, there should not exist a household where a son is older than his biological father. The presence of these structural zeros prevents the sure coherence of the merging of estimates coming from different sources of information. Importance of their approach seems to be apparent whenever there are logical (structural) constraints with varied sources of information.
Statistical matching using Bayesian networks
[103] described and discussed first attempts for SM of discrete data using Bayesian networks. In Bayesian networks, so-called (directed acyclic) graphs are used to model the data. Their micro matching approach has three steps: estimating and combining the (directed acyclic) graphs for datasets A and B, estimating the corresponding local parameters and combining them to the joint probability distribution, and imputing the missing values in A and B to obtain the integrated dataset.
Their first attempt of using Bayesian networks encouraged a further study into how probabilistic graphical models may be used for SM [104]. In their study, they considered not only discrete but also continuous and mixed variables for SM. Further research is also needed to see if using undirected probabilistic graphical models is more promising. Some of these new points of research for Bayesian network is considered in [104]. [104] performed log-linear Markov networks under the assumption of conditional independence. Their approach visualizes dependencies among variables using undirected graphical models, and obtained a powerful factorization of their joint distribution. It is utilized to estimate the probability components of the joint distribution. They embedded the identification problem of SM into the theory of log-linear Markov networks and showed an exemplary implementation of their approach depending on the German General Social Survey. The findings showed that their suggested SM approach can reconstruct the joint distribution reasonably well. These preliminary findings showed that their method yields good results. Small differences between the sample distribution and the distribution estimated using their SM procedure are particularly encouraging because they avoided overconfidence by deliberately not selecting the specific and common variables based on previous association analysis.
Recently, [83] proposed the use of Bayesian networks to deal with the SM uncertainty for multivariate categorical variables. The motivation for using Bayesian networks in their work is because extra sample information of qualitative dependencies between the components of
Issues related to statistical matching
Methods for complex surveys
Complex surveys refer to surveys that involve sampling designs that are not based on simple random selection. These may include many sampling designs such as probability proportional to size sampling, multistage sampling, cluster sampling, stratified sampling, etc…It is also very common to combine some of these designs according to the population under study or the survey objectives.
In general samples selected using a complex survey design have many differences if compared with simple random sampling. The secondary sampling units, in a multistage sampling design, may show unequal inclusion probabilities; units belonging to the same primary sampling unit are not independent and usually show an intraclass correlation. For more details about complex surveys, see [105].
In the literature, there are some methods raised in SM to deal with complex surveys. These include traditional micro approaches and SM methods that adjust for sampling design, proposed by [52, 106, 107] and [108]. Traditional micro approaches for SM of data from complex sample surveys are usually performed using nonparametric micro methods such as; nearest neighbour donor, rank or random hot deck. These methods neglect the sampling design and the weights of the units in the matching step. After the matching step and filling the missing values in the recipient file by using one of these traditional micro approaches, the sampling design for the recipient data is ready to use it and perform statistical analysis. [109] compared several traditional approaches for SM in the context of complex surveys. Rank and random hot deck approaches appeared to be efficient in terms of preserving the joint distribution
For SM methods that explicitly consider the sampling design and the weights associated with it, there are Renssen’s method [106]; Rubin’s file concatenation [52] and a method depending on empirical likelihood suggested by [107]. A comparison between Renssen’s and Wu’s approaches with Rubin’s file concatenation procedure is found in [110]. In general, Renssen’s approach relies on a series of calibration’s weights, that are applied in the two datasets, to achieve consistency between estimators derived from files A and B separately. Calibration is a common approach in sample surveys to derive new weights (for more details about calibration; see [111]). Rubin’s procedure is the only method that takes the sampling design into consideration through the concatenation step. It gives weights for all units in the matched file. The general steps of Rubin’s procedure are mentioned in Section 3.1.3. Whereas, [107] proposed his approach based on empirical likelihood. Unlike other methods, Wu’s approach, which is used in the context of SM, guarantees to obtain positive adjusted weights as it is based on maximum likelihood. Also, his approach is consistent and efficient compared with other methods introduced for adjusted weights like methods proposed in [112, 113]. Table 6 summarises advantages and disadvantages of SM methods for complex surveys proposed by [106, 52, 107], in addition to traditional micro approaches.
Comparison between different methods of SM in complex surveys
Comparison between different methods of SM in complex surveys
Recently, [108] dealt with the problem of SM in case of complex sample surveys using a nonparametric setting. They proposed to use an iterative proportional fitting algorithm for estimating the distribution function of variables that are not found together, and demonstrated how to assess its reliability.
One of the situations when SM may need to be applied is Split Questionnaires Design (SQD). The aim of SQD is to reduce burden on the respondent by shortening questionnaires. Since long questionnaires lead to deter prospective respondents and cause high nonresponse rates, split questionnaires are used as an option to reduce the pressure on respondents [114, 115, 116]. Also, long questionnaires decrease the quality of survey responses and, hence, lead to low accuracy [117]. To increase the quality of survey responses, SQD can be used [118]. [12] discussed a SQD, in which, a sample
Another very useful application of SM is combining data from sample surveys with census data. Sample surveys can include detailed data about specific issues, that may not be available via census data. [120] used SM to combine census data with Demographic and Health Survey due to the increasing demand for integrating surveys with census data in order to provide richer, more detailed sources of data. SM methods can be used to connect records from survey and census data where RL is impractical due to confidentiality constraints. [120] created a technique that uses an iterative proportional updating approach to construct a synthetic population of individuals and households that is similar to reality. Then, SM is used for imputing survey data to individuals and households in the synthetic population using the nearest neighbour approach. To assess this process, 2011 Bangladesh census data is used to produce a district-specific synthetic population of individuals and households. The wealth index for each household within the synthetic population is then estimated to impute the closest available records from the 2011 Bangladesh Demographic and Health Survey. The findings show that the method proposed by [120] achieved more representative estimates (when compared to direct survey estimates), particularly in areas with small sample sizes and a small number of population units with different socio-demographic characteristics.
Essential steps for applying SM in practice
Some considerations and preprocessing steps are required before utilizing SM approaches to integrate two or more data sources. Assuming that we have two data files A and B, the following steps are essential for SM:
Selection of the target variables Data preparation and harmonization step. Selection of matching variables that will be used in the matching process for the two separate sample surveys. For the SM process, many common variables may exist in both files A and B. In practice, only the most relevant ones from these set of common variables are employed in the matching process, which are known as matching variables. These variables should be chosen using appropriate statistical procedures and with the help of subject matter experts. The appropriate statistical procedures may be descriptive or inferential. For instance, the easiest approach for identifying the optimal set of matching variables is to calculate the measures of pairwise correlation/association between The matching framework should be determined with respect to the objective of SM; either micro or macro and also with respect to the appropriate setting; either parametric, nonparametric, mixed approach or Bayesian. After deciding the SM framework using the previous steps, the suitable SM approach is performed to integrate the two separate datasets. An appropriate quality assessment technique should be performed to evaluate the dataset after matching.
Statistical matching quality assessment
Having reviewed all the above traditional and recent developments for SM, maybe the most important question to be asked is: How good is the resulting integrated data? [48] stated two questions about the quality of integrated data. First, how can we measure the quality of estimates based on statistical matched data in practice, i.e. when one does not know the complete data? And second, can we develop a theoretical model for data including relations between target variables that enables us to measure the quality of integrated data?
Quality assessment of joint distribution of variables that have never been jointly observed is a non-trivial task [121], and [49] suggested relatively simple measures of quality assessment of integrated datasets through a comparison of basic statistics (mean, standard deviation, …etc.) in donor and integrated datasets. [13] proposed a more complex way of integrated data quality evaluation, called ‘integration validity’. It is a multilevel framework for the evaluation of quality in a SM procedure, based on four levels of validity for a matching procedure:
A reproduction of true but unknown values of A joint distribution preservation where a true unknown joint distribution of A covariance structure The marginal as well as joint distributions of variables in the donor file are preserved in the integrated file.
The first and third levels of quality are not achievable. The only way to validate these two levels is using simulation experiments. The second level may be validated by auxiliary information. In reality, marginal and joint distributions in the integrated datasets are derived when using traditional approaches. This is a minimum criterion to assess the validity of SM technique. This, however, does not imply the validation of the estimates for the joint distributions of the variables found separately in the two datasets.
In practice, the most commonly used quality assessment technique for data resulting from SM is the one suggested by the German Association of Media Analysis [13], which involves:
Comparing the empirical distribution of target variables included in the integrated file with the one in the recipient and the donor files, Comparing the joint distributions
More work still needs to focus on the assessment of the quality of integrated data, and developing reliable indicators that can be used in this context.
Conclusion and future work
Through this paper, we introduced available methods in existing literature about SM either under CIA, auxiliary information or uncertainty analysis, along with their drawbacks whenever applicable. Recent contributions and alternative techniques for SM have also been reviewed. It is noted that available SM procedures focus more on the case of continuous variables, rather than categorical data. For future work, it is advised that studies focus more on categorical data as it can be considered the most common type of data in most social surveys. An important result by [57] was, that MI approaches are superior to the traditional SM procedures. Consequently, recent MI approaches that are used for categorical data can be employed in the context of SM. Latent class models have recently been used for MI, where they are employed as a tool for estimating the density of categorical variables [122, 123]. The advantage of using latent class models is that they can be used for datasets drawn from large-scale studies, where there is a large number of variables and complex relationship structures. On the other hand, it is known that when assuming CIA in cases where it does not really hold, this can lead to misleading results. To solve such a problem, it is suggested by [9, 57] to choose the common variables carefully in a way that already establishes conditional independence, thus inference about the actually unobserved association becomes valid. In this context, another advantage of using latent class models is the conditional independence, that is, the scores of different items are independent of each other given latent classes. A new approach for SM of categorical data is currently being developed by the authors, based on latent class models within a Bayesian framework. Simulation studies will be performed to evaluate the performance of the proposed latent class model in SM, through making an empirical comparison of several different matching procedures depending on simulated data.
Footnotes
Acknowledgments
The authors are grateful to reviewers for their valuable comments, which permitted to improve the manuscript.
