Data integration using statistical matching techniques: A review

Abstract

In the era of data revolution, availability and presence of data is a huge wealth that has to be utilized. Instead of making new surveys, benefit can be made from data that already exists. As enormous amounts of data become available, it is becoming essential to undertake research that involves integrating data from multiple sources in order to make the best use out of it. Statistical Data Integration (SDI) is the statistical tool for considering this issue. SDI can be used to integrate data files that have common units, and it also allows to merge unrelated files that do not share any common units, depending on the input data. The convenient method of data integration is determined according to the nature of the input data. SDI has two main methods, Record Linkage (RL) and Statistical Matching (SM). SM techniques typically aim to achieve a complete data file from different sources which do not contain the same units. This paper aims at giving a complete overview of existing SM methods, both classical and recent, in order to provide a unified summary of various SM techniques along with their drawbacks. Points for future research are suggested at the end of this paper.

Keywords

Statistical matching record linkage parametric statistical matching nonparametric statistical matching mixed methods Bayesian statistical matching

Table 1
Pattern of missing data originated by the statistical matching

File A

Unit Gender Age Education $\ldots$ Purchasing information about View information about

Cereals Wines $\ldots$ Daily soaps News $\ldots$

1 Female 40–45 Low 1 Kg 1

2 Male 30–35 High None 2 Missing data

$\ldots$ $\ldots$ $\ldots$ $\ldots$

File B

Unit Gender Age Education $\ldots$ Purchasing information about View information about

Cereals Wines $\ldots$ Daily soaps News $\ldots$

1 Female 40–45 Low Regular No

2 Male 30–35 High Missing Data No Regular

$\ldots$ $\ldots$ $\ldots$ $\ldots$

File A
					Cereals	Wines	$\ldots$	Daily soaps	News	$\ldots$
1	Female	40–45	Low		1 Kg	1
2	Male	30–35	High		None	2		Missing data
$\ldots$	$\ldots$	$\ldots$	$\ldots$
File B
Unit	Gender	Age	Education	$\ldots$	Purchasing information about	View information about
					Cereals	Wines	$\ldots$	Daily soaps	News	$\ldots$
1	Female	40–45	Low					Regular	No
2	Male	30–35	High		Missing Data	No	Regular
$\ldots$	$\ldots$	$\ldots$	$\ldots$

Source: [13].

1. Introduction

Attention has been growing towards data integration in response to the increasing flood of available data. Data integration aims at integrating two or more data sources (usually data from sample surveys) sharing the same target population. Statistical Data Integration (SDI) can be used to integrate data files that have common units, and it also allows to merge unrelated files that do not share any common units, depending on the input data. The convenient method of data integration is determined according to the nature of the input data. SDI has two main methods, Record Linkage (RL) and Statistical Matching (SM). [1] provide a review of data integration techniques for combining probability samples, probability and non-probability samples, and probability and big data samples. [2] presents a review for combining data from different sources focusing on record linkage. RL is the method of gathering information from multiple sources that relates to the same entity. If there is a unique identifier in the data sources that will be integrated, then there is no difficulty in matching data sources. In this case, deterministic linkage will be used. Deterministic linkage is a direct way of linking which usually requires exact agreement on a unique identifier (such as a national identity number). If there is no unique identifier, the probabilistic record linkage is employed (see [3, 4, 5, 6] for details). The advantage of SM that it can be used as a supplement to RL to reduce the potential bias when obtaining the estimates using RL [7]. On the other hand, SM techniques typically aim to achieve a complete data file from different sources which do not contain the same units. In SM, data sources just share a set of common variables and inference is required on the other variables. Besides the main goal of SM which is data integration, SM also has the objective of jointly analysing pairs of variables observed in two distinct sample surveys [8, 9, 10, 11]. Inconsistent SM can only produce meaningless data that are impossible to compare and use. Accordingly, suitable approaches should be used for SM to keep pace with rapid development in the data science.

SM can be viewed as a missing data problem where methods such as imputation can be used to solve it [12]. In the basic SM Structure, there exist two files A and B as sources of data. File A contains variables $X$ and $Y,$ while file B contains variables $X$ and $Z$ . In this case, variables $Y$ and $Z$ are not found jointly in one dataset. SM provides a complete file containing variables $X, Y$ and $Z$ . SM can be very useful in so many applications. A motivating example is proposed in this section to demonstrate the importance of SM. The objective is to integrate two or more surveys that have no units in common.

The following example from [13] represents two different surveys, each of them having a different goal with some variables in common. The first dataset includes data about gender, age, education and purchasing information about some products like cereals, wines, …, etc. The second dataset includes data about gender, age, education and information about TV viewing behavior for daily soaps, news, …, etc. (see Table 1). An appealing option for market researchers is to combine viewing information available from the second dataset with purchasing data available from the first. This can save a lot of time and money while providing a richer data source for analysing relationships among those variables. Variables which were not found jointly on both datasets are imputed individually to create a comprehensive and large database through SM. On matching the two datasets and imputing the missing parts of the data, it becomes plausible to study the relationship between TV viewing behavior and purchasing behavior.

This paper aims to provide a unified piece of work that gives a complete view of SM techniques that are available in the literature. The organization of this paper is as follows: Section 2 introduces SM as a missing data problem. Section 3 presents classical approaches in SM showing drawbacks of these procedures. Methods of SM are presented within three frameworks; under conditional independence assumption, in presence of auxiliary information, and subject to uncertainty analysis. A number of alternative SM techniques are given in Section 4. Moreover, Section 5 discusses issues related to SM including a review of existing methods for assessing the quality of the data resulting from various SM techniques. Finally, Section 6 gives conclusion of this paper.

2. Statistical matching as a missing data problem

There are several methods for dealing with data with missing values including procedures based on completely recorded units, where incomplete units are discarded and only units with complete data are analysed. Another approach for dealing with missing data depends on weighting procedures, where weights are given for observed units. Imputation techniques constitute yet another common approach for treatment of missing data, where imputation can be conducted via single or Multiple Imputation (MI). Single imputation is defined as replacing each missing value by a single value. Single imputation techniques have different types including mean imputation, regression imputation, and hot-deck imputation, and advanced imputation techniques such as using an Expectation Maximization (EM) algorithm to obtain Maximum Likelihood (ML) estimates [14, 15, 16]. Rubin was the first to propose MI as a technique to address the problem of missing data (see, e.g., [17, 18, 19]). MI has a main notion, which is the replacement of each missing value by a set of M plausible values where each value is drawn from the conditional distribution for the missing values given the observed values. The M completed datasets are produced by an appropriate imputation model, and the method which is considered suitable for analysis had the data been complete is then used to analyse each set. The resulting M analyses are then combined to give one final result. MI is most straightforward to use under ignorable missingness mechanisms (MAR and MCAR). If the missingness is ignorable, the estimates of the statistical model of interest resulting from MI often have the desirable properties of being unbiased, consistent, and asymptotically normal [20, 21]. Nevertheless, MI is quite possible to be applied when data is Missing Not At Random (MNAR), however the resulting estimates are not proved to have the same properties. For more details about types of missingness: MCAR, MAR, and MNAR; see [22, 23, 13]. Since SM can be viewed as a problem of missing data [12], many authors have been using MI in the context of SM. For example, [24] performed cross survey-MI to combine the regression estimation where more covariates were found in one survey but less in the other. Another MI method used in SM and proposed by [13] was Multiple Imputation Chained Equations (MICE).

The most important question raised when considering SM is how to match the sets of data in a right way. One of the first SM attempts is by [25]. He presented the use of Equivalence Class of the common variables in two data sources by making a random matching of $(X,Y)$ with $(X,Z)$ among “Equivalent” $(X,Z)$ ’s that achieve the closest score. [26, 27] opposed Okner’s approach when assuming that $Y$ and $Z$ , given $X$ , are independent and stressed that there is a need for a theory of matching. [28] supported and defended the independence assumption, whereas [29] discussed that this conditional independence assumption is valid in many cases. Through a second round of discussion, [30, 31, 32] showed some improvements in the method of equivalence classes. Again, [33] stressed his belief that the methods proposed will not perform well under conditional independence assumption.

Various developments to SM techniques have been discussed in the literature and many applications have used SM, depending on the Conditional Independence Assumption (CIA) proposed by [25]. For instance, [34] assumed CIA when developing a theoretical framework for SM. [35] applied sample likelihood approach in case of CIA to statistically match two independent samples. Whereas, [36] proposed a mixed procedure based on multinomial logistic regression models depending on CIA. [37] used CIA to combine the information of two different datasets: Statistics on Income and Living Condition, and Household Budget Survey.

Recently, [38] used a mixed procedure of SM based on CIA to integrate two different surveys: Statistics on Income and Living Condition, and Household Budget Survey. In Section 3, we will present classical approaches in SM showing drawbacks of these procedures.

3. Traditional statistical matching methods

In their comprehensive book about SM, [11] classify SM techniques into procedures developed under CIA, and SM procedures in light of auxiliary information. For general review about CIA, see [11, 39]. In this section, we present classical SM techniques under CIA and with auxiliary information using the same structure proposed by [11]. In addition, we consider uncertainly analysis for SM methods. With this framework of the three settings, different estimation approaches are considered such as parametric, nonparametric, mixed and Bayesian.

3.1 Statistical matching methods under conditional independence assumption

Traditionally, the framework of SM occurs when only file A and file B are available. File A contains variables $X$ and $Y$ , while file B contains variables $X$ and $Z$ . The match is done based on the set of common variables $X$ in file A and file B. All SM techniques throughout this section assume conditional independence of $Y$ and $Z$ given $X$ . In other words, variable $X$ is the only source of association between data in files A and B. In particular, the CIA implies that the joint probability density function for $X, Y$ and $Z$ can be factorized as follows:

$\displaystyle f(x,y,z)=f(y,z|x)f(x)=f(y|x)f(z|x)f(x),$ (1)

where; $f_{Y|X}$ is the conditional distribution of $Y$ given $X$ , $f_{Z|X}$ is the conditional distribution of $Z$ given $X$ , and $f_{X}$ is the marginal distribution of $X$ .

To estimate Eq. (1), we need information about the marginal distribution of $X$ and the partial relationship between $X$ and $Y$ on one hand and that of $X$ and $Z$ on the other hand. In fact, such information can be found in different samples A and B.

3.1.1 Parametric approaches

The SM parametric methods are mainly based on two steps. First, it necessarily requires the specification of a model; then parameters have to be estimated. The method of estimation depends on the specific SM framework. There are two main approaches for parametric SM; namely micro and macro approaches.

In the macro case, the goal is to estimate the parameter of interest, e.g. the correlation coefficient between $Y$ and $Z$ or the contingency table $Y\times Z$ in case of categorical data. The only parameter that cannot be estimated is the covariance between $Y$ and $Z$ , as they are not jointly found in one data source. One possibility to derive an estimate of such a covariance is provided by assuming the conditional independence of $Y$ and $Z$ given the $X$ variable. However, the estimates from different independent samples may lead to a variance covariance matrix ( $\hat{\sigma}$ ) that is not positive semi-definite. To avoid such a problem, Maximum Likelihood Estimation (MLE) can be used for partially observed data [11]. In case of categorical variables, the parameters of interest are the joint probabilities of $X$ , $Y$ , and $Z$ , which can be obtained using maximum likelihood too (see [40, 41, 42]).

The goal of micro approaches for SM is to create a complete set of data for $(X,Y,Z)$ , by completing the missing values in files A and B. More precisely, missing $Z$ in file A and missing $Y$ in file B are filled in. Accordingly, there are essentially two main parametric micro methods; conditional mean matching and draws based on a predictive distribution. The conditional mean matching is the most popular method in the case of continuous variables. This method is based on regression imputation. The values are imputed using the estimated regression functions of $Z$ on $X$ , and $Y$ on $X$ , respectively. Since $X$ is a random variable, we consider the conditional mean of $Z$ given $X=x$ ,

$\displaystyle E(Z|X=x_{a})=\hat{z}_{\text{primary}}^{(A)}=\hat{\alpha}_{Z}+% \hat{\beta}_{ZX}x_{a},$ $\displaystyle a=1,2,\ldots,n_{A},$ (2)

where the estimated parameters values $\hat{\beta}_{ZX}$ and $\hat{\alpha}_{Z}$ are the parametric macro ML estimates. Thus, a complete file A is obtained after adding $\hat{z}_{\text{primary}}^{(A)}$ . Similarly, file B is filled with the following predicted values:

$\displaystyle E(Y|X=x_{b})=\hat{y}_{\text{primary}}^{(B)}=\hat{\alpha}_{Y}+% \hat{\beta}_{YX}x_{b},$ $\displaystyle b=1,2,\ldots,n_{B}.$ (3)

Notice that when dealing with categorical variables, the variables are replaced by the indicator variable of each category. [43] showed that the conditional mean matching has the problem of leading to lack of variability in the imputations with respect to the same conditioning variables and hence leads to underestimation of the covariance between $Y$ and $Z$ . To preserve variability and to avoid this underestimation, [22] suggest to add a random residual to each predicted value. They referred to this approach as stochastic regression imputation. Like conditional mean matching method, random draws can be obtained by replacing the parameter values with their ML estimates. Therefore, file A is filled in with the values

$\displaystyle\hat{z}_{\text{primary}}^{(A)}=\hat{\alpha}_{Z}+\hat{\beta}_{ZX}x% _{a}+\varepsilon_{a},$ $\displaystyle a=1,2,\ldots,n_{A},$ (4)

where $\varepsilon_{a}$ is a normally distributed residual $N(0,\hat{\sigma}_{Z|X})$ with $\hat{\sigma}_{Z|X}^{2}=S_{ZZ,A}-\hat{\beta}_{ZX}^{2}S_{XX,A}$ . Similarly, file B is filled in with the values

$\displaystyle\hat{y}_{\text{Primary}}^{(B)}=\hat{\alpha}_{Y}+\hat{\beta}_{YX}x% _{b}+\varepsilon_{b},$ $\displaystyle b=1,2,\ldots,n_{B},$ (5)

where $\varepsilon_{b}$ is a normally distributed residual $N(0,\hat{\sigma}_{Y|X})$ with $\hat{\sigma}_{Y|X}^{2}=S_{YY,A}-\hat{\beta}_{YX}^{2}S_{XX,A}$ .

3.1.2 Nonparametric approaches

Nonparametric approaches to SM do not explicitly refer to a model. As for the parametric case, there are two cases in nonparametric approaches; namely macro and micro.

In nonparametric macro approaches, as mentioned before, the goal is to estimate the joint distribution of $X, Y$ and $Z$ . There are a number of nonparametric approaches that can be used in this sense. Three approaches that are mentioned in the literature are empirical cumulative distribution function, kernel density estimator and k Nearest Neighbour (kNN). For more details, see [44, 45]. The previous three methods focus on the distribution of $(X,Y,Z)$ either jointly using the cumulative distribution function, or as a product of conditional and marginal distributions. However, one may be interested in certain characteristics of this distribution. The conditional expectations of $Z$ given $X$ , and $Y$ given $X$ are considered among the most important characteristics of interest. These can be derived using a nonparametric regression function. Nonparametric regression is a nonparametric method for describing the relation between a response variable and one or more predictors. Nonparametric regression techniques include kernel smoothing, local polynomial regression, spline based regression models, and regression trees. All these nonparametric techniques result in finding the conditional expectation of $Z$ given $X$ or $Y$ given $X$ , which is the most important characteristic of interest (for details see [46]).

In nonparametric micro approaches, a complete dataset for files A and B can be obtained without considering any specific parametric distribution for the variables $X, Y$ and $Z$ . In a nonparametric micro setting, this can be done by applying nonparametric imputation procedures, known as hot deck imputation methods. Such procedures are defined by replacing missing values in file A by values that are found in file B based on some matching criteria, and vice versa. The advantage of using hot deck imputation procedures is that they do not need any specification of a family of distributions. Also, there is no need to estimate the distribution function or any of its characteristics. Most of these procedures are based on filling in the dataset chosen as the “recipient”, with the values of the variable which is available only in the other dataset, the “donor” file. In the following, we suppose that file A is the recipient whereas file B is the donor such that the sample size of the recipient file $n_{A}$ , is less than the sample size of the donor file $n_{B}$ ; i.e $n_{A}<n_{B}$ . [42] stated three hot deck methods that are used in SM: Distance Hot Deck, Random Hot Deck and Rank Hot Deck. Such methods may be regarded as the nonparametric counterparts of micro parametric approaches that are presented above.

Random hot deck is performed by selecting a donor record in file B randomly and the value of $Z_{\text{observed}}$ in file B is assigned to the $Z_{\text{missing}}$ in recipient file A and this is done for each record in file A [42, 47, 48]. A necessary step when making the random selection is that the units in files A and B are grouped into donation classes. Donation classes are defined according to the values of one or more categorical variables chosen within the common set of variables in the datasets A and B (for instance gender, region etc.). Once files are grouped into homogenous subsets, those subsets are referred to as donation classes. Then, for a given record in file A belonging to a given group (say males), a donor record in file B is randomly chosen from within the same group (males in B).

Table 2
Data of file A and file B

File A

Unit Weight $w_{A}$ Gender $X_{1}^{A}$ Age $X_{2}^{A}$ Profit $Y$

A1 3 1 42 9.156

A2 3 1 35 9.149

A3 3 0 63 9.287

A4 3 1 55 9.512

A5 3 0 28 8.484

A6 3 0 53 8.891

A7 3 0 22 8.425

A8 3 1 25 8.867

File B

Unit Weight $w_{B}$ Gender $X_{1}^{B}$ Age $X_{2}^{B}$ Income $Z$

B1 4 0 33 6.932

B2 4 1 52 5.524

B3 4 1 28 4.224

B4 4 0 59 6.147

B5 4 1 41 7.243

B6 4 0 45 3.230

Source: [49, 13], Gender: 1 $=$ male; 0 $=$ Female; Population size is 24; $w_{A}=\frac{24}{8}=3$ ; $w_{B}=\frac{24}{6}=4$ .

Distance hot deck method, which is widely used in imputation of missing data, is performed by choosing the value of $Z_{\text{observed}}$ in file B to be assigned to a recipient record in file A based on the closest distance, according to a set of common variables $X$ , after splitting units in files A and B into donation classes. For the simple case of a a single continuous variable $X$ , distance hot deck method is based on measuring the absolute difference between a record in file A and another in file B as

$\displaystyle d_{ab^{*}}=|x_{a}^{A}-x_{b^{*}}^{B}|=\min_{1\leqslant b\leqslant n% _{B}}|x_{a}^{A}-x_{b}^{B}|,$ (6)

where; $b^{*}$ is the $b^{\rm*th}$ unit in file B, achieving the minimum distance between $x_{a}^{A}$ and $x_{b}^{B}$ . In case of more than one variable $X$ , this distance is for example measured by the Mahalanobis distance.

Distance hot deck can be performed using a constrained or an unconstrained approach both introduced by [49]. The unconstrained approach has an objective of finding the case on the target file that is most similar to each case on the other file. Suppose we have file A that contains a set of variables: gender $X_{1}$ , age $X_{2}$ and profit $Y$ . Whereas, file B contains gender $X_{1}$ , age $X_{2}$ and Income $Z$ (see Table 2). The files A and B have no units in common, but they have the variables gender and age in common. The process of creating a matched file A with variables $X_{1},X_{2},Y$ and $Z$ using an unconstrained method is done by finding for each A unit the same gender B unit that is closest in age (e.g., A1 is male and 42, and the best matching B unit is B5, who is male and 41; B5 is also the closest match to A2). The result of matched file A is given in Table 3. To create a matched B file, one can use the same procedure to obtain matches for B units from file A. On the other hand, constrained distance hot deck approach starts by making an exploded file for both files A and B. The exploded file depends on sampling weights $w_{A}$ and $w_{B}$ , that – assuming that datasets A and B have both been obtained by simple random sampling from the population – are calculated by dividing the population size by the number of units in each file. Then, every unit in files A and B is duplicated $w_{A}$ and $w_{B}$ times respectively to get the same number of population size $N$ in each file A and file B. The units in exploded file A and exploded file B are then reordered according to the common variables $X_{1}$ and $X_{2}$ . This procedure is shown in detail in Table 4. Each unit in the reoredered exploded file A is matched to the corresponding reordered exploded file B to minimize the distance measure in $X$ space between the exploded file A and exploded file B. For instance, unit A7 is matched three times with unit B1, unit A5 is matched once with unit B1 and twice with unit B6, and so on. The results of SM using a constrained method are summarised as shown in Table 5.

Table 3

Statistical matching using unconstrained approach

Unit	Gender $X_{1}^{A}$	Age $X_{2}^{A}$	Age $X_{2}^{B}$	Profit $Y$	Income $Z$
A1	1	42	41	9.156	7.243
A2	1	35	41	9.149	7.243
A3	0	63	59	9.287	6.147
A4	1	55	52	9.512	5.524
A5	0	28	33	8.484	6.932
A6	0	53	59	8.891	6.147
A7	0	22	33	8.425	6.932
A8	1	25	28	8.867	4.224

Source: [49, 13].

Table 4

Reordered exploded file A and file B for constrained approach

Unit i	Gender $X_{1}^{A}$	Age $X_{2}^{A}$	Unit i	Gender $X_{1}^{B}$	Age $X_{2}^{B}$
A7	0	22	B1	0	33
A7	0	22	B1	0	33
A7	0	22	B1	0	33
A5	0	28	B1	0	33
A5	0	28	B6	0	45
A5	0	28	B6	0	45
A6	0	53	B6	0	45
A6	0	53	B6	0	45
A6	0	53	B4	0	59
A3	0	63	B4	0	59
A3	0	63	B4	0	59
A3	0	63	B4	0	59
A8	1	25	B3	1	28
A8	1	25	B3	1	28
A8	1	25	B3	1	28
A2	1	35	B3	1	28
A2	1	35	B5	1	41
A2	1	35	B5	1	41
A1	1	42	B5	1	41
A1	1	42	B5	1	41
A1	1	42	B2	1	52
A4	1	55	B2	1	52
A4	1	55	B2	1	52
A4	1	55	B2	1	52

Source: [49, 13].

Table 5

Statistical Matching using Constrained Approach

Units	Gender $X_{1}^{A}$	Age $X_{2}^{A}$	Age $X_{2}^{B}$	Profit $Y$	Income $Z$
A1, B2	1	42	52	9.156	5.524
A1, B5	1	42	41	9.156	7.243
A2, B3	1	35	28	9.149	4.223
A2, B5	1	35	41	9.149	7.243
A3, B4	0	63	59	9.287	6.147
A4, B2	1	55	52	9.512	5.524
A5, B1	0	28	33	8.494	6.932
A6, B4	0	53	59	8.891	6.147
A6, B6	0	53	45	8.891	3.230
A7, B1	0	22	33	8.425	6.932
A7, B6	0	22	45	8.425	3.230
A8, B3	1	25	28	8.867	4.223

Source: [49, 13].

Rank hot deck, which was first introduced by [50], is performed by separately ranking the units with respect to $X$ values in both files. Files A and B are matched in this method by associating the records with the same rank. If files A and B contain a different number of records, matching is done using the empirical cumulative distribution function of $X$ . This cumulative distribution is calculated for the two files as follows:

In the recipient file; $\hat{F}_{X}^{A}(x)=\frac{1}{n_{A}}\sum_{a=1}^{n_{A}}I(x_{a}\leqslant x),$ $x\in X$ , where $I$ is the indicator function taking the value 1 when $x_{a}\leqslant x$ , and 0 otherwise.

In the donor file; $\hat{F}_{X}^{B}(x)=\frac{1}{n_{B}}\sum_{b=1}^{n_{B}}I(x_{b}\leqslant x),$ $x\in X$ , where $I$ is the indicator function taking the value 1 when $x_{b}\leqslant x$ , and 0 otherwise.

After that, every $a=1,\ldots,n_{A}$ is linked with that record $b^{*}$ in file B according to the following condition:

$\displaystyle|\hat{F}_{X}^{A}(x_{a})-\hat{F}_{X}^{B}(x_{b^{*}})|$ $\displaystyle=\min_{1\leqslant b\leqslant n_{B}}|\hat{F}_{X}^{A}(x_{a})-\hat{F% }_{X}^{B}(x_{b})|.$ (7)

Then again as in Distance Hot Deck, the value of $Z$ observed for the donor unit in file B, is assigned to the missing $Z$ on the recipient unit in file A.

3.1.3 Mixed approaches

Mixed approaches in SM depend on predictive mean matching imputation [11]. These are two-step procedures which depend on a mixture of parametric and nonparametric methods. A SM mixed method consists of the following two steps:

A model is fitted and all its parameters are estimated.

A nonparametric method is used to create the complete dataset.

There exist various methods for mixed SM depending on the type of variables, whether they are continuous or categorical.

In case of mixed approaches under CIA for continuous variables $X, Y$ and $Z$ , steps for SM procedure are as follows:

Estimate the regression parameters of $Z$ on $X$ from file B.

For each unit $a=1,\ldots,n_{A}$ , a primary value $\hat{z}_{\text{primary}}^{(A)}$ is imputed in file A using the estimated regression function from the first step.

For each unit $a=1,\ldots,n_{A}$ , a live value $z_{\text{live}}^{(B)}$ from file B is assigned to the $a^{\text{th}}$ record in file A using a convenient distance hot deck, considering the primary value $\hat{z}_{\text{primary}}^{(A)}$ .

In the same manner, $Y$ is estimated (using data from file A) and imputed in file B.

Using the above steps, similar procedures were proposed by [34, 51, 11, 52, 43]. The main difference between their procedures lies within the parameter estimates they use. Also, they differ in the matching step; unconstrained or constrained approach. Another difference is that Rubin’s procedure suggests concatenating the resulting statistically matched files and assigning the weight $(w_{A}^{-1}+w_{B}^{-1})^{-1}$ to each unit, where $w_{A}$ and $w_{B}$ are the weights corresponding to files A and B, respectively.

Analogous to SM for continuous variables, two steps are applied within the mixed approach when categorical variables are considered. These are the loglinear regression model step, and the matching step. For more details about this method, see [42].

3.1.4 Bayesian approaches

[13] presented three procedures for SM using Bayesian approaches. First, the author followed [52] by presenting a frequentist regression imputation method with random residual, known as RIEPS. This procedure is based on drawing the missing values from their conditional predictive distribution which is equivalent to the imputed values in Eqs (2) and (3), but after considering the residual in the imputation process. Then, [13] presented the Bayesian version of RIEPS. She presented a Non Iterative Bayesian based Imputation Procedure (NIBAS) for univariate and multivariate cases. The format of the procedure is the same for both the univariate and multivariate cases. The only difference is in the mathematical formulation of matrices that are used, and in the posterior distributions of variables. In this approach, instead of estimating the parameters like the above approaches, she performed random draws for them using a Bayesian version. [13] first assumed a general linear model for both datasets as suggested by Rubin’s procedure. Then, derived different posterior distributions for the parameters. Accordingly, she proposed the following MI algorithm based on a Bayesian approach by the following steps:

Perform a regression for each dataset and compute the OLS estimates $\hat{\beta}_{YX}$ and $\hat{\beta}_{ZX}$ .

Calculate each sum of squared residual errors; $(z-X_{A}\hat{\beta}_{ZX})^{\prime}(z-X_{A}\hat{\beta}_{ZX})$ and $(y-X_{B}\hat{\beta}_{YX})^{\prime}(y-X_{B}\hat{\beta}_{YX})$ , where $y=(y_{1},y_{2},\ldots,$ $y_{n_{A}})^{\prime},z=(z_{1},z_{2},\ldots,z_{n_{B}})^{\prime}$ .

Choose a value for $\rho_{ZY|X}$ from its prior or set of arbitrary levels.

Conduct random draws for the parameters from their observed data posterior distribution.

In a nutshell, this procedure depends on random draws for the parameters, instead of estimating them from the data. This algorithm is repeated $m$ times which yields $m$ imputed datasets. Then, the results are combined. For more details, see [13].

The second method presented by [13] was the Normal Approach (NORM). This method is mainly based on data augmentation algorithm, in case of a normal model. The normal data model is used because it is the most flexible in terms of the number of variables included in any realistic matching task. NORM uses a data model much like NIBAS. The difference between NIBAS and NORM is that NORM uses the parametric data model in case of complete data, whereas NIBAS uses the observed data posterior distribution. Since NORM method needs complete data, an algorithm that handles the missing data is needed, namely data augmentation. The data augmentation algorithm can be viewed as a Bayesian version of the EM algorithm for dealing with missing data [53].

The third method proposed by [13] was MICE, an abbreviation for Multiple Imputation Chained Equations. MICE, a common modification in MI, is an iterative method that considers the imputation problem as a set of estimations where each variable takes its turn in being regressed on the other variables. MICE, first introduced by [54], is a flexible method that handles various types of variables, since each variable is imputed using its own imputation model [55]. This procedure is called regression switching, chained equations, or variable-by-variable Gibbs sampling [54]. [13] employed MICE within a Bayesian estimation framework as a tool for SM. For more details about MICE, see [54, 13]. [56] compared, through a simulation study, the different Bayesian approaches described above. Using another simulation study, [57] found out that the MI approaches are superior to the traditional SM procedures.

3.2 Statistical matching methods with auxiliary information

The problem with the conditional independence assumption of the variables $Y$ and $Z$ given $X$ , is that it may or may not hold. Moreover, it is not a testable assumption. Thus, the use of existing auxiliary information can be very helpful. Auxiliary information can take one of the following two forms:

The existence of a third data source C that contains variables $(X,Y,Z)$ or just $(Y,Z)$ .

Parameter estimates of the joint distribution of $(Y,Z)$ or the conditional distribution $(Y,Z)|X$ ; like for example covariance $\sigma_{YZ}$ ; correlation coefficient $\rho_{YZ}$ or partial correlation coefficient $\rho_{YZ|X}$ ; a contingency table of $Y\times Z$ ; etc.

We will follow the same structure as that given for CIA, to present different approaches for SM with auxiliary information, which was proposed by [11].

3.2.1 Parametric approaches

In case of a parametric macro approach, the auxiliary information may be an additional data source or using an external estimate. In case of having an additional data source C, containing $n_{C}$ i.i.d. observations, parameters can be estimated by concatenating all the data sources $(A\cup B\cup C)$ and then by exploiting all the available information. In case of using external estimates, it seems to be quite difficult as it requires validation of some compatibility assumptions that is not always straightforward to check, or that is not easy to verify.

In case of parametric micro approach and having external information, a model is assumed for the joint distribution of $X, Y$ and $Z$ . After that, the parameters are estimated by exploiting all the available information (data sources and auxiliary information). When parameters can be estimated uniquely as in Section 3.1.1, then the estimates are used to generate a complete data source using conditional mean matching or draws from the predicted distribution [52, 42]. The additional information only affects the estimation stage, but no change happens in the next step when generating the synthetic dataset. Another difference occurs in the equation for conditional mean matching and draws from the predicted distribution. Under CIA, the Eqs (2) and (3) in Section 3.1.1 were reduced and $\hat{\beta}_{ZY}y_{a}$ , $\hat{\beta}_{YZ}z_{b}$ were omitted, respectively from the following more general case with auxiliary information:

$\displaystyle\hat{z}_{\text{Primary}}^{(A)}=\hat{\alpha}_{Z}+\hat{\beta}_{ZX}x% _{a}+\hat{\beta}_{ZY}y_{a},$ $\displaystyle a=1,\ldots,n_{A},$ (8) $\displaystyle\hat{y}_{\text{Primary}}^{(B)}=\hat{\alpha}_{Y}+\hat{\beta}_{YX}x% _{b}+\hat{\beta}_{YZ}z_{b},$ $\displaystyle b=1,\ldots,n_{B}.$ (9)

Analogous to Section 3.1.1 under CIA, we get the equations for the method of draws from the predicted distribution by adding an error term as follows:

$\displaystyle\hat{z}_{\text{Primary}}^{(A)}=\hat{\alpha}_{Z}+\hat{\beta}_{ZX}x% _{a}+\hat{\beta}_{ZY}y_{a}+\varepsilon_{a},$ $\displaystyle a=1,\ldots,n_{A},$ (10) $\displaystyle\hat{y}_{\text{Primary}}^{(B)}=\hat{\alpha}_{Y}+\hat{\beta}_{YX}x% _{b}+\hat{\beta}_{YZ}z_{b}++\varepsilon_{b},$ $\displaystyle b=1,\ldots,n_{B},$ (11)

where; $\varepsilon_{a}$ is a random residual drawn from $N(0,$ $\hat{\sigma}_{Z|XY}^{2})$ , and $\varepsilon_{b}$ is a random residual drawn from $N(0,\hat{\sigma}_{Y|XZ}^{2})$ . The regression coefficients $\hat{\beta}_{YX}$ and $\hat{\beta}_{ZX}$ are obtained from files A and B respectively. Whereas, $\hat{\beta}_{ZY}$ and $\hat{\beta}_{YZ}$ are obtained from file C.

3.2.2 Nonparametric approaches

In case of a nonparametric macro setting, the only type of auxiliary information available is that from an additional data source C. There are two cases for auxiliary information available from an additional data source C. First, when all variables are available in C. The marginal $X$ distribution can be estimated either using kernels or k Nearest Neighbour methods depending on the whole sample $A\cup B\cup C$ . For more on kernel estimators, see [58, 59, 60]. The second case is when only $Y$ and $Z$ are available in C with missing $X$ . Although nonparametric methods should be used when we have a complete data source i.e file C containing $(X,Y,Z)$ , [61]confirm the possibility of using kernel estimators of distribution functions when the data source is not complete i.e file C containing only $Y$ and $Z$ . In this case, [40] proposes using k Nearest Neighbour in the EM algorithm for continuous variables, or the counterpart of EM; which is Iterative Proportional Fitting algorithm (IPF), in case of categorical variables.

In case of a nonparametric micro setting, auxiliary information is solely represented by an additional data source C. This procedure is presented by [40] in two steps as follows:

Step 1:
Impute a live $z$ value $z_{\text{live}}^{(C)}$ using file C as a donor. This live value, that is observed in file C, is then assigned to the $a^{\text{th}}$ record in file A according to the nearest distance:

$d_{ac}((x_{\text{observed}}^{(A)},y_{\text{observed}}^{(A)}),(x_{\text{% observed}}^{(C)},$ $y_{\text{observed}}^{(C)}))$ , if C contains $X$ , $Y$ and $Z$ ;

$d_{ac}(y_{\text{observed}}^{(A)},y_{\text{observed}}^{(C)}),$ if C contains $Y$ and $Z$ only.

Step 2:
The final live $z$ value $z_{\text{live}}^{(B)}$ is imputed using file B as a donor, then it is assigned to the $a^{\text{th}}$ record in file A with distance $d_{ab}((x_{\text{observed}}^{(A)},z_{\text{live}}^{(C)}),(x_{\text{observed}}^% {(B)},z_{\text{observed}}^{(B)}))$ .

In particular, if C is a sample that is large enough and the information it provides can be considered ”reliable”, then the SM procedure can be limited to Step 1 outlined above; that is imputing $Z$ in A using C as donor.

An alternative procedure based on loglinear models called categorical restriction approaches that was introduced by [42].
3.2.3 Mixed approaches

Analogous to CIA in Section 3.1.3, mixed approaches contain the same two steps in case of auxiliary information. First step is to estimate model parameters, followed by the second step where a nonparametric method is used.

The same procedure outlined in Section 3.1.3 for SM under CIA can be used here with continuous variables. The only difference here is in the way of estimation as information about parameter estimates is available, either from additional data source C or from an external estimate (i.e. information about $\rho_{YZ|X}$ ). In this context, SM takes place using two steps:

Step 1:
The mixed procedure remains unchanged, focusing on imputing primary values using the same regression models as in equations 10 and 11.
Step 2:
Final nonparametric imputation: a live $z$ value; $z_{\text{live}}^{(B)}$ , that is observed in file B is then assigned for each $a^{\text{th}}$ record in file A, corresponding to the closest Mahalanobis distance computed using observed and primary imputed values for $Y$ and $Z$ ; namely $d$ $[(y_{\text{Observed}}^{(A)},\hat{z}_{\text{Primary}}^{(A)}),$ $(\hat{y}_{\text{Primary}}^{(B)},z_{\text{Observed}}^{(B)})]$ .

[34, 51] proposed a similar procedure that permits the incorporation of additional information represented by the correlation coefficient $\rho_{YZ}^{*}$ , where parameters of the normal distribution are estimated by their sample counterparts.

In case of categorical variables, a number of mixed methods are introduced in [42]. For more details, see [11].
3.2.4 Bayesian approaches

The Bayesian approach is especially useful when dealing with auxiliary information. [13] considered a Bayesian approach for SM when an additional data source C is available. This additional dataset C can be used for estimating the conditional association of $Y$ and $Z$ . That is; to use the value $\rho_{YZ|X}^{*}$ estimated from file C, as the prior conditional correlation instead of setting a prior value for $\rho_{YZ|X}$ as in Section 3.1.4. After that, the algorithm will perform as outlined in Section 3.1.4, exploiting all the available information. Therefore, the posterior distribution of all the parameters is changed according to $\rho_{YZ|X}^{*}$ .

3.3 Uncertainty analysis

In practice, CIA rarely holds true, and results relying on CIA may be misleading. If CIA is not valid, bias will exist among variables of interest after applying the SM process. This issue is solvable with the help of auxiliary information when available [42, 62]. When CIA is not valid and auxiliary information is not available, which is often the case, uncertainty analysis may be performed.

Statistical matching is essentially related to an identification problem concerning the association of variables not found together in the same dataset. As known, CIA cannot be validated from the observed data. By depending on the explanatory power of the common variables $X$ , a smaller or wider range of admissible values of the unconditional association of $Y$ and $Z$ can then be estimated. In official statistics this kind of approach is named uncertainty analysis [63]. In other application domains, it takes other names, for instance it is known as partial identification in econometrics and social sciences [64, 65, 66, 67].

Uncertainty in SM can be viewed as a special case of estimation problem as there is a kind of uncertainty about the joint model of $(X,Y,Z)$ . In case of continuous data, uncertainty is quantified by taking the range of an association parameter (e.g. the correlation coefficient) between the variables that are not jointly observed. Whereas, in case of categorical data (e.g. $k$ -way contingency tables), upper and lower bounds are set on cell counts. Basically, uncertainty analysis is concerned with identifying which values of the target parameters are compatible with the available data. To know more about uncertainty and different methods released for uncertainty, see [57, 68, 69, 70, 71, 72, 73, 74, 75, 76].

Uncertainty analysis can be performed in two settings either parametric or nonparametric. According to parametric setting, the result of the identification problem is handled by considering ranges of plausible values of the missing records, extracted from models fitted to the available sample information. Then, intervals are defined by these ranges and are known as uncertainty intervals [43, 52, 34, 13]. According to nonparametric setting, uncertainty for SM is considered in [69, 77, 76]. Uncertainty in a nonparametric setting is still described by a class of models, or specifically, by a class of distributions, for $(X,Y,Z)$ . If compared to the parametric case (either multinormal or multinomial), there are two main sources of trouble. First, since the class of distributions for $(X,Y,Z)$ are not identified by a finite number of parameters, we need a technical tool to describe them. Second, in order to measure, to quantify uncertainty, we need to “summarise”the class of all possible distributions for $(X,Y,Z)$ with a single number. An even more important problem is the quantification of the uncertainty in the presence of auxiliary information on the model [78, 79].

Recently, [80] performed a unique analysis to choose the matching variables by searching for the common variables $X$ that are the most effective in reducing the uncertainty between $Y$ and $Z$ . Also, it is taken into consideration not to select too many common variables to avoid a large number of parameters to estimate. Other measures of uncertainty are considered in [81, 82]. [35] discussed the uncertainty in SM under informative sampling designs considering the three-variate normal case. [83] proposed the use of graphical models to deal with the SM uncertainty for multivariate categorical variables.

4. Developments in statistical matching methods

In addition to the traditional SM techniques outlined in Section 3, there are a number of alternative methods that have been introduced in the literature. This section will present an overview of some of these methods.

4.1 Propensity scores method

[84, 85] were among the first who proposed the use of propensity score matching. Propensity scores are predicted values from logistic regression, where the effect of a certain group or treatment is estimated, accounting for covariates. The matching process is done based on these scores. The main advantage of using this method in SM, is to reduce the matching to one constructed propensity score, which is a very great advantage, especially when there are a large number of common variables $X$ . The methodology of constrained statistical matching, using estimated propensity scores was developed by [86] to produce the required synthetic data sets. [87, 88] outlined three steps for propensity score matching:

Data preparation and harmonization step: Harmonization step involves identification of all common variables $X$ shared by both datasets A and B, before data integration takes place. In the harmonization step, common variables in the two datasets should have the same categories. If the two similar variables cannot be harmonized, they have to be discarded. Also, common variables $X$ should not include missing values. If $A$ and $B$ are representative samples of the same population, the common variables in the two datasets should have the same marginal/joint distribution.

Weight adjustments step: In this step, the sum of the attached weights for records (weighted population totals) in the donor file is adjusted to make them comparable with those in the recipient file.

Estimation of the propensity scores step: In this step, estimation of the propensity scores for matching process is done by creating an outcome variable $R$ , where $R=1$ for all records in the recipient file and $R=0$ for all records in the donor file. The two files are joined by stacking their records. A logistic regression model is then performed, in which R is the dependent variable and the independent variables are the selected common variables $X$ . The common variables $X$ are selected carefully in the logistic regression model for measuring propensity scores to maximize explanatory power. This is because the validity of propensity score statistical matching relies on the power of the common variables to act as good predictors that can be transformed into effective propensity scores. The propensity score is defined as:

$\displaystyle e(z_{i})=P(R=1|X=x_{i})=f(x_{i}^{\prime}\beta),$ (12)

the conditional probability of unit $i$ to belong to a certain group ( $R=1$ ) given the covariates ( $X=x$ ). The estimated propensity score is defined as,

$\displaystyle\hat{e}(z_{i})=f(x_{i}^{\prime}\hat{\beta})=\frac{1}{1+e^{-x_{i}^% {\prime}\hat{\beta}}}.$ (13)

The individual propensity scores $\hat{e}(z_{i})$ are the predicted values from the logistic regression output for $\beta$ . Having calculated propensity scores, all units in each file are sorted in ascending order according to their respective propensity scores $\hat{e}(z_{i})$ . For the two sorted data sets with respect to propensity scores, a cumulative weight variable is created. This cumulative weight is then sorted in descending order. Under this sorting scheme, units with larger weights in the donor file are assigned to multiple units in the recipient file until all of their weight is used up. If all of the units in both files did not match, additional estimation of propensity scores is performed.

On the contrary, [89] opposed using propensity scores for SM justifying their opinion by stating that the procedure ignores the measurements of matching variables $X$ , and this may lead to increased bias. They claimed that measurements of matching variables is less transparent with propensity score matching than with other methods of matching. Therefore, they suggested a set of precautions for using propensity score matching to avoid as much as possible of the imbalance of matching resulting from this method.

4.2 Statistical matching using fractional imputation method

[90, 91] proposed fractional imputation, a comparatively recent form of imputation for handling missing data. Fractional imputation can be handled in two settings either parametric or nonparametric. According to the parametric approach for generating imputed values, it is performed using EM algorithm which is a popular tool for finding the MLE for parameters of the model. In fractional imputation, the E-step can be approximated by the weighted mean of the imputed data likelihood where the fractional weights are computed from the current value of the parameter estimates. Using fractional imputation, numerous imputed values with fractional weights are generated for each unit containing missing data. Each fractional weight indicates the conditional probability of imputed value given observed data. For instance, the following two steps should be considered if we would like to generate the missing $z$ in file A from the conditional distribution of $z$ given observed data, i.e. $f(z|x,y)\varpropto f(y|x,z)f(z|x)$ . First, generate $z^{*}$ using the estimated $\hat{f_{a}}(z|x)$ from file A under the conditional independence assumption. Next, accept $z^{*}$ if $f(y|x,z^{*})$ is sufficiently large. Imputed values in $f(z|x,y)\varpropto f(y|x,z)f(z|x)$ can be generated by parametric fractional imputation via EM algorithm as follows:

For each $i\in A$ , generate $m$ imputed values $z_{1}^{*},\ldots,z_{m}^{*}$ from $\hat{f_{b}}(z|x_{i})$ where $\hat{f}_{b}(z|x)$ is the estimated conditional distribution of $z$ given $x$ from file B.

Fractional weight is assigned for the $j^{\text{th}}$ imputed value $z_{ij}^{*}$ using

$\displaystyle w_{ij(t)}^{*}\varpropto f(y_{i}|x_{i},z_{ij}^{*};\hat{\varPsi_{t% }}),$

where $\sum_{j=1}^{m}w_{ij}^{*}=1$ , $\varPsi_{t}$ is the parameter(s) defining the conditional distribution of $y_{t}$ given $y_{1},\ldots,y_{t-1}$ ; and $\hat{\varPsi_{t}}$ is its current estimated value.

Solve the fractionally imputed score equation for $\varPsi$ to obtain $\hat{\varPsi}_{t+1}$

$\displaystyle\sum_{i\in A}^{m}w_{ia}\sum_{j=1}^{m}w_{ij(t)}^{*}S(\varPsi;x_{i}% ,z_{ij}^{*},y_{i})=0$

where $S(\varPsi;x,z,y)=\log f(y|x,z;\varPsi)/\partial\varPsi$ and $w_{ia}$ is the sampling weight of unit $i$ in file A.

Repeat steps 2 and 3 till convergence.

For nonparametric setting, fractional hot deck imputation is considered; that is a mixed idea of fractional imputation and hot deck procedure. Instead of generating $z_{ij}^{*}$ from $\hat{f_{a}}(z|x)$ which is a parametric setting, hot deck fractional imputation which is a nonparametric approach can be used. In this approach, all observed values of $z_{i}$ in file B are considered as imputed values and the fractional weights in Step 2 are obtained by

$\displaystyle w_{ij}^{*}(\hat{\varPsi_{t}})\varpropto w_{ij0}^{*}f(y_{i}|x_{i}% ,z_{ij}^{*};\hat{\varPsi_{t}}),$ (14)

where $w_{ij0}^{*}=\frac{\hat{f}_{b}(z_{j}|x_{i})}{\sum_{k\in A}^{m}w_{ka}\hat{f}_{b}% (z_{j}|x_{k})}$ .

4.3 Statistical matching based on statistical learning

Statistical Learning (SL) refers to a wide set of classification/regression techniques that “ learn from data”. These are typically algorithm-based techniques that do not assume a statistical model. Statistical learning techniques include methods such as classification and regression trees (CART), random forest, …, etc. rather than methods based on fitting stochastic models. Statistical Learning techniques can be used in statistical matching to replace methods that use model predictions. Some techniques like Nearest Neighbour distance hot deck are special cases of a well-known statistical learning technique called kNN. For more details about SL techniques, see [92, 93, 94, 95]. [11] were first to use SL techniques in SM. They showed how SL can be used to pick a subset of matching variables prior to the matching step. Even, when measuring the distance between the observations in A and those in B, SL techniques can be useful when employing the nearest neighbour distance hot deck [96]. The use of SL in SM takes it a bit further by incorporating other current classification or regression approaches into SM. [97] showed how popular SL techniques can be beneficial for matching purposes. Two different ideas are looked at: (i) integrating datasets; (ii) assessing the uncertainty. The characteristics of these approaches are investigated by simulation study and application on a real survey data. The obtained results in his paper are promising, showing that certain SL approaches can be very successful in using the given information by already available survey data, allowing for a reduction in the uncertainty when using traditional SM techniques. Moreover, [80] compared the SL techniques to traditional hot deck methods to avoid consuming time in selecting the matching variables that are typically needed in hot deck and to get the distances between units in file A and potential donors in file B. [80] showed the superiority of SL techniques to traditional hot deck methods in two aspects; first the reduction of time consumed in selecting matching variables typically needed in hot deck, and second in calculating distances between units in file $A$ and potential donors in file $B$ .

4.4 Statistical matching using multinomial logistic regression models

[36] proposed several mixed procedures based on multinomial logistic regression models depending on CIA and auxiliary information. First, they proposed two mixed methods for SM, without auxiliary information assuming conditional independence. The first approach utilizing multinomial logistic regression, is a mixed method that uses distance hot deck imputation. It is based on the following three steps:

The probability of the unit being in $j^{\text{th}}$ category, $\hat{\varPi}_{j}^{(B)}$ , is estimated from all units in file B, is estimated by fitting a multinomial logistic regression model in which the common variables $X$ are the independent variables and the observed variable $Z$ in file B is the dependent variable.

For each unit in file A, the probability of being in $j^{\text{th}}$ category, $\varPi_{j}^{(A)},$ is estimated using the following equation;

$\displaystyle\hat{\varPi}_{j}^{(A)}=\frac{\exp(\hat{\beta}_{j}^{\prime}x)}{1+% \sum_{i=1}^{J-1}\exp(\hat{\beta}_{i}^{\prime}x)};$ $\displaystyle j=1,\ldots,J-1,$

where $\hat{\beta}_{j}^{\prime}$ is obtained from the first step.

Distance hot deck approach is applied in which a value of $Z$ in file B is imputed for file A based on the nearest Euclidean distance $d_{ab}(\hat{\varPi}_{j}^{(A)},\hat{\varPi}_{j}^{(B)})$ and is denoted by $\hat{Z}$ .

The second mixed method, based on multinomial logistic regression models under CIA, uses a randomization mechanism, and involves the same first and second steps as the previous method. The only difference is in the third step as $\hat{Z}$ for file $A$ is obtained by generating a multinomial random variable with probability $\varPi_{j}^{(A)}$ .

They also extended their method to be used in presence of auxiliary information. Four versions of this method were introduced, in presence of auxiliary information. The first version has three steps as follows:

For file C, fit a multinomial logistic regression model in which $Y$ is the independent variable and the observed variable $Z$ in file B is the dependent variable.

For file A, the estimates in the first step are used to get $\hat{\varPi}_{j}^{(A)}$ .

For file A, $\hat{Z}$ is obtained by generating a multinomial random variable with probability $\hat{\varPi}_{j}^{(A)}$ .

A variation of the above method is to replace the first step by fitting a multinomial logistic regression model, in which $X$ and $Y$ are the independent variables and $Z$ is the dependent variable using file C.

The third variation of this method involves the following six steps:

For file C, fit a multinomial logistic regression model in which $Z$ is the independent variable and the observed variable $Y$ in file A is the dependent variable.

For file $B$ , the estimates in the first step is used to get the probability $\hat{\varPi}_{j}^{(B)}$ .

For file B, $\hat{Y}$ is obtained by generating a multinomial random variable with probability $\hat{\varPi}_{j}^{(B)}$ .

For file B, fit a multinomial logistic regression model in which $X$ and $\hat{Y}$ are independent variables and the observed variable $Z$ in file B is the dependent variable.

For file A, the estimates in the fourth step are used to get probability $\hat{\varPi}_{j}^{(A)}$ .

For file A, $\hat{Z}$ is obtained by generating a multinomial random variable with probability $\hat{\varPi}_{j}^{(A)}$ .

The first step can be replaced by fitting a multinomial logistic regression model in which $X$ and $Z$ are independent variables and $Y$ is the dependent variable, using file C, thus giving a fourth version of the proposed method.

The proposed methods were compared to a random hot deck procedure via a simulation study. The results were very similar in the case of CIA. However, the proposed methods exhibit better performance in case of auxiliary information, especially with high levels of association between $Y$ and $Z$ . Also, simulation results showed that there is no need to have a large size of file C, therefore the expense of overcoming the CIA would not be a major worry in practice.

4.5 Mixed integer linear programming procedure

A set of approaches have been raised recently to tackle the problem of incoherence arising when SM is considered. A number of authors including [98, 99, 100, 101] and [102] applied a mixed integer linear programming procedure, based on efficient L1 distance minimization, in the context of SM. [102] dealt with the managing of inconsistencies inside the SM framework when logical relations among the variables are present. Incoherence can arise in the probability evaluations. They used several advanced adjustment procedures to remove such incoherence. They applied these procedures to real data, and their study found some differences between these adjustment procedures.

Recently, [99, 98] recommended applying a merging approach for jointly inconsistent probabilistic assessments to the SM problem. The merging technique is based on an efficient L1 distance minimization through mixed-integer linear programming, which results in elicitation of imprecise (lower-upper) probability assessments that are not only feasible but also meaningful. They emphasized how their method is meaningful whenever there are structural zeros among the variables. Structural zeros is an important feature of survey data that refers to the existence of impossible combinations of variables. For example, in the combinations of variables of pregnancy status and gender, there should not exist a pregnant male. For household survey, in the combinations of variables of relationship and age, there should not exist a household where a son is older than his biological father. The presence of these structural zeros prevents the sure coherence of the merging of estimates coming from different sources of information. Importance of their approach seems to be apparent whenever there are logical (structural) constraints with varied sources of information.

4.6 Statistical matching using Bayesian networks

[103] described and discussed first attempts for SM of discrete data using Bayesian networks. In Bayesian networks, so-called (directed acyclic) graphs are used to model the data. Their micro matching approach has three steps: estimating and combining the (directed acyclic) graphs for datasets A and B, estimating the corresponding local parameters and combining them to the joint probability distribution, and imputing the missing values in A and B to obtain the integrated dataset.

Their first attempt of using Bayesian networks encouraged a further study into how probabilistic graphical models may be used for SM [104]. In their study, they considered not only discrete but also continuous and mixed variables for SM. Further research is also needed to see if using undirected probabilistic graphical models is more promising. Some of these new points of research for Bayesian network is considered in [104]. [104] performed log-linear Markov networks under the assumption of conditional independence. Their approach visualizes dependencies among variables using undirected graphical models, and obtained a powerful factorization of their joint distribution. It is utilized to estimate the probability components of the joint distribution. They embedded the identification problem of SM into the theory of log-linear Markov networks and showed an exemplary implementation of their approach depending on the German General Social Survey. The findings showed that their suggested SM approach can reconstruct the joint distribution reasonably well. These preliminary findings showed that their method yields good results. Small differences between the sample distribution and the distribution estimated using their SM procedure are particularly encouraging because they avoided overconfidence by deliberately not selecting the specific and common variables based on previous association analysis.

Recently, [83] proposed the use of Bayesian networks to deal with the SM uncertainty for multivariate categorical variables. The motivation for using Bayesian networks in their work is because extra sample information of qualitative dependencies between the components of $Y$ and $Z$ can be utilized in SM. Such information can be used to factorize the joint probability mass function based on the conditional independencies implied by the graph. Extra sample information significantly increases the quality of SM. Because a smaller number of lower dimension parameters must be estimated, the representation of the joint probability distribution using local relationships simplifies both parameter estimation and SM evaluation in a multivariate context. Furthermore, the graphical model’s modularity enables to cope with the subgraphs produced by variables related to the same sample and the subgraphs produced by variables existing in different samples. Accordingly, parameters that are influenced by uncertainty are separated from those that may be directly estimated by sample data. Computational complexity is thus limited to a subset of variables. A simulation study is conducted to evaluate the performance of their suggested approach with and without auxiliary information and to compare it with the saturated multinomial model, in terms of uncertainty reduction. Based on real application, the findings demonstrated the practicality of their suggested approach in real case situations.

5. Issues related to statistical matching

5.1 Methods for complex surveys

Complex surveys refer to surveys that involve sampling designs that are not based on simple random selection. These may include many sampling designs such as probability proportional to size sampling, multistage sampling, cluster sampling, stratified sampling, etc…It is also very common to combine some of these designs according to the population under study or the survey objectives.

In general samples selected using a complex survey design have many differences if compared with simple random sampling. The secondary sampling units, in a multistage sampling design, may show unequal inclusion probabilities; units belonging to the same primary sampling unit are not independent and usually show an intraclass correlation. For more details about complex surveys, see [105].

In the literature, there are some methods raised in SM to deal with complex surveys. These include traditional micro approaches and SM methods that adjust for sampling design, proposed by [52, 106, 107] and [108]. Traditional micro approaches for SM of data from complex sample surveys are usually performed using nonparametric micro methods such as; nearest neighbour donor, rank or random hot deck. These methods neglect the sampling design and the weights of the units in the matching step. After the matching step and filling the missing values in the recipient file by using one of these traditional micro approaches, the sampling design for the recipient data is ready to use it and perform statistical analysis. [109] compared several traditional approaches for SM in the context of complex surveys. Rank and random hot deck approaches appeared to be efficient in terms of preserving the joint distribution $X\times Z$ in the integrated file $A$ dataset and the marginal distribution of variable $Z$ after integrating. Whereas, nearest neighbour donor works admirably only when constrained matching is used and a design variable is considered in forming donation classes.

For SM methods that explicitly consider the sampling design and the weights associated with it, there are Renssen’s method [106]; Rubin’s file concatenation [52] and a method depending on empirical likelihood suggested by [107]. A comparison between Renssen’s and Wu’s approaches with Rubin’s file concatenation procedure is found in [110]. In general, Renssen’s approach relies on a series of calibration’s weights, that are applied in the two datasets, to achieve consistency between estimators derived from files A and B separately. Calibration is a common approach in sample surveys to derive new weights (for more details about calibration; see [111]). Rubin’s procedure is the only method that takes the sampling design into consideration through the concatenation step. It gives weights for all units in the matched file. The general steps of Rubin’s procedure are mentioned in Section 3.1.3. Whereas, [107] proposed his approach based on empirical likelihood. Unlike other methods, Wu’s approach, which is used in the context of SM, guarantees to obtain positive adjusted weights as it is based on maximum likelihood. Also, his approach is consistent and efficient compared with other methods introduced for adjusted weights like methods proposed in [112, 113]. Table 6 summarises advantages and disadvantages of SM methods for complex surveys proposed by [106, 52, 107], in addition to traditional micro approaches.

Table 6
Comparison between different methods of SM in complex surveys

Methods Advantages Disadvantages

Traditional Approaches

Very simple and flexible to be performed

Selection of the matching variables is essential.

Theoretical and practical implications have not been investigated.

Rubin’s file concatenation

It is extremely simple in practice; only the concatenated file is used.

More accurate for estimates of $X$ variables.

For estimating characteristics relating to $Y$ , $Z$ , and their relationship, methods to cope with missing values must be considered.

The theory is developed to concatenate the theoretic samples but in practice the available data refer to the respondents to a survey (final weights incorporate corrections for unit nonresponse, coverage, …).

Renssen’s approach

After harmonization, the parameters linked to $(X,Y)$ on $A$ , as well as $(X,Z)$ on $B$ , can be directly estimated.

Facilitates the incorporation of future auxiliary data sources (file C) into the estimation process.

Offers micro imputations with a distribution that is usually coherent with respect to the one in the donor file.

Calibration may fail.

Managing continuous and categorical variables is complex.

Imputed micro values for categorical variables are estimated probabilities, which may be negative or greater than one.

Wu’s approach

When compared to Renssen’s approach, it is more adaptable (no negative weights; handles mixed type variables).

It provides a complete framework for inferring from complex sample survey data.

Theory is more complex. A major complexity is required to deal with nonproportional stratified sampling.

Methods allow combining theoretic samples, but it is difficult to handle unit nonresponse.

Methods	Advantages	Disadvantages
Traditional Approaches	Very simple and flexible to be performed	Selection of the matching variables is essential. Theoretical and practical implications have not been investigated.
Rubin’s file concatenation	It is extremely simple in practice; only the concatenated file is used. More accurate for estimates of $X$ variables.	For estimating characteristics relating to $Y$ , $Z$ , and their relationship, methods to cope with missing values must be considered. The theory is developed to concatenate the theoretic samples but in practice the available data refer to the respondents to a survey (final weights incorporate corrections for unit nonresponse, coverage, …).
Renssen’s approach	After harmonization, the parameters linked to $(X,Y)$ on $A$ , as well as $(X,Z)$ on $B$ , can be directly estimated. Facilitates the incorporation of future auxiliary data sources (file C) into the estimation process. Offers micro imputations with a distribution that is usually coherent with respect to the one in the donor file.	Calibration may fail. Managing continuous and categorical variables is complex. Imputed micro values for categorical variables are estimated probabilities, which may be negative or greater than one.
Wu’s approach	When compared to Renssen’s approach, it is more adaptable (no negative weights; handles mixed type variables). It provides a complete framework for inferring from complex sample survey data.	Theory is more complex. A major complexity is required to deal with nonproportional stratified sampling. Methods allow combining theoretic samples, but it is difficult to handle unit nonresponse.

Recently, [108] dealt with the problem of SM in case of complex sample surveys using a nonparametric setting. They proposed to use an iterative proportional fitting algorithm for estimating the distribution function of variables that are not found together, and demonstrated how to assess its reliability.

5.2 Practical situations within SM context

One of the situations when SM may need to be applied is Split Questionnaires Design (SQD). The aim of SQD is to reduce burden on the respondent by shortening questionnaires. Since long questionnaires lead to deter prospective respondents and cause high nonresponse rates, split questionnaires are used as an option to reduce the pressure on respondents [114, 115, 116]. Also, long questionnaires decrease the quality of survey responses and, hence, lead to low accuracy [117]. To increase the quality of survey responses, SQD can be used [118]. [12] discussed a SQD, in which, a sample $m$ is selected randomly from the target population. The sample is then split into two independent random samples $m_{a}$ and $m_{b}$ such that $m_{a}\cup m_{b}=m$ and $m_{a}\cap m_{b}=\varnothing$ . Accordingly, there are common variables that appear in the two samples and there exist distinct variables that appear exclusively in either one of the two independent samples. This is a clear situation that requires SM. The SQD yields effects that are close to those of a complete questionnaire, with one exception that is the power is reduced since the number of observations for each variable is reduced [114]. [119] proposed a matching method to be applied for a SQD, based on ordinary least squares. They consider using the macro approach of SM under CIA to estimate the joint distribution of all variables of interest, thus obtaining an identifiable model given the available data. In their method, they fit three models to get the production model with all covariates in common. To validate their proposed method, they compared its performance with other methods of imputation such as MICE, CART and random forest. They found that their proposed method for SM has the best performance among other methods demonstrated for MI in terms of prediction error and it does not necessitate the use of a specific imputation model for missing data.

Another very useful application of SM is combining data from sample surveys with census data. Sample surveys can include detailed data about specific issues, that may not be available via census data. [120] used SM to combine census data with Demographic and Health Survey due to the increasing demand for integrating surveys with census data in order to provide richer, more detailed sources of data. SM methods can be used to connect records from survey and census data where RL is impractical due to confidentiality constraints. [120] created a technique that uses an iterative proportional updating approach to construct a synthetic population of individuals and households that is similar to reality. Then, SM is used for imputing survey data to individuals and households in the synthetic population using the nearest neighbour approach. To assess this process, 2011 Bangladesh census data is used to produce a district-specific synthetic population of individuals and households. The wealth index for each household within the synthetic population is then estimated to impute the closest available records from the 2011 Bangladesh Demographic and Health Survey. The findings show that the method proposed by [120] achieved more representative estimates (when compared to direct survey estimates), particularly in areas with small sample sizes and a small number of population units with different socio-demographic characteristics.

5.3 Essential steps for applying SM in practice

Some considerations and preprocessing steps are required before utilizing SM approaches to integrate two or more data sources. Assuming that we have two data files A and B, the following steps are essential for SM:

Selection of the target variables $Y$ and $Z$ , i.e. variables that are not found jointly in two datasets.

Data preparation and harmonization step.

Selection of matching variables that will be used in the matching process for the two separate sample surveys. For the SM process, many common variables may exist in both files A and B. In practice, only the most relevant ones from these set of common variables are employed in the matching process, which are known as matching variables. These variables should be chosen using appropriate statistical procedures and with the help of subject matter experts. The appropriate statistical procedures may be descriptive or inferential. For instance, the easiest approach for identifying the optimal set of matching variables is to calculate the measures of pairwise correlation/association between $Y$ and each of the available predictors $X$ and between $Z$ and each of the available predictors $X$ , for files A and B respectively. If the response variable is continuous, correlation with predictors can be examined. Also, Spearman’s rank correlation coefficient can be considered to find any potential nonlinear relationships.

The matching framework should be determined with respect to the objective of SM; either micro or macro and also with respect to the appropriate setting; either parametric, nonparametric, mixed approach or Bayesian.

After deciding the SM framework using the previous steps, the suitable SM approach is performed to integrate the two separate datasets.

An appropriate quality assessment technique should be performed to evaluate the dataset after matching.

5.4 Statistical matching quality assessment

Having reviewed all the above traditional and recent developments for SM, maybe the most important question to be asked is: How good is the resulting integrated data? [48] stated two questions about the quality of integrated data. First, how can we measure the quality of estimates based on statistical matched data in practice, i.e. when one does not know the complete data? And second, can we develop a theoretical model for data including relations between target variables that enables us to measure the quality of integrated data?

Quality assessment of joint distribution of variables that have never been jointly observed is a non-trivial task [121], and [49] suggested relatively simple measures of quality assessment of integrated datasets through a comparison of basic statistics (mean, standard deviation, …etc.) in donor and integrated datasets. [13] proposed a more complex way of integrated data quality evaluation, called ‘integration validity’. It is a multilevel framework for the evaluation of quality in a SM procedure, based on four levels of validity for a matching procedure:

A reproduction of true but unknown values of $Z$ of the recipient units.

A joint distribution preservation where a true unknown joint distribution of $(X,Y,Z)$ is preserved in an integrated dataset.

A covariance structure $cov(X,Y,Z)$ is reflected in the integrated dataset, also the marginal distributions of $f_{XY}$ and $f_{XZ}$ are preserved.

The marginal as well as joint distributions of variables in the donor file are preserved in the integrated file.

The first and third levels of quality are not achievable. The only way to validate these two levels is using simulation experiments. The second level may be validated by auxiliary information. In reality, marginal and joint distributions in the integrated datasets are derived when using traditional approaches. This is a minimum criterion to assess the validity of SM technique. This, however, does not imply the validation of the estimates for the joint distributions of the variables found separately in the two datasets.

In practice, the most commonly used quality assessment technique for data resulting from SM is the one suggested by the German Association of Media Analysis [13], which involves:

Comparing the empirical distribution of target variables included in the integrated file with the one in the recipient and the donor files,

Comparing the joint distributions $f_{X,Z}$ and $f_{X,Y}$ observed in the recipient and donor files with their corresponding joint distribution in the integrated file.

More work still needs to focus on the assessment of the quality of integrated data, and developing reliable indicators that can be used in this context.

6. Conclusion and future work

Through this paper, we introduced available methods in existing literature about SM either under CIA, auxiliary information or uncertainty analysis, along with their drawbacks whenever applicable. Recent contributions and alternative techniques for SM have also been reviewed. It is noted that available SM procedures focus more on the case of continuous variables, rather than categorical data. For future work, it is advised that studies focus more on categorical data as it can be considered the most common type of data in most social surveys. An important result by [57] was, that MI approaches are superior to the traditional SM procedures. Consequently, recent MI approaches that are used for categorical data can be employed in the context of SM. Latent class models have recently been used for MI, where they are employed as a tool for estimating the density of categorical variables [122, 123]. The advantage of using latent class models is that they can be used for datasets drawn from large-scale studies, where there is a large number of variables and complex relationship structures. On the other hand, it is known that when assuming CIA in cases where it does not really hold, this can lead to misleading results. To solve such a problem, it is suggested by [9, 57] to choose the common variables carefully in a way that already establishes conditional independence, thus inference about the actually unobserved association becomes valid. In this context, another advantage of using latent class models is the conditional independence, that is, the scores of different items are independent of each other given latent classes. A new approach for SM of categorical data is currently being developed by the authors, based on latent class models within a Bayesian framework. Simulation studies will be performed to evaluate the performance of the proposed latent class model in SM, through making an empirical comparison of several different matching procedures depending on simulated data.

Footnotes

Acknowledgments

The authors are grateful to reviewers for their valuable comments, which permitted to improve the manuscript.

References

Yang

Kim

. Statistical data integration in survey sampling: A review. arXiv preprint arXiv: 200103259. 2020.

Thompson

. Combining data from new and traditional sources in population surveys. International Statistical Review. 2019; 87: S79-S89.

Fellegi

Sunter

. A theory for record linkage. Journal of the American Statistical Association. 1969; 64(328): 1183-1210.

Sayers

Ben-Shlomo

Blom

Steele

. Probabilistic record linkage. International Journal of Epidemiology. 2016; 45(3): 954-964. The Author 2015; Published by Oxford University Press on behalf of the International Epidemiological Association.

Murray

. Probabilistic record linkage and deduplication after indexing, blocking, and filtering. Journal of Privacy and Confidentiality. 2015; 7.

Hof

Ravelli

Zwinderman

. A probabilistic record linkage model for survival data. Journal of the American Statistical Association. 2017; 112(520): 1504-1515. Available from: https://doi.org/10.1080/01621459.2017.1311262.

Gessendorfer

Beste

Drechsler

Sakshaug

, et al. Statistical matching as a supplement to record linkage: A valuable method to tackle non-consent bias. Journal of Official Statistics. 2018; 34(4): 909-933.

Rässler

Fleischer

. Aspects concerning data fusion techniques. in: International Workshop on Household Survey Nonresponse. DEU; 1998; 4: p. 317-333.

Rässler

Fleischer

. An evaluation of data fusion techniques. in: Proceedings of Statistics Canada Symposium 99 on Combining Data from Different Sources; 1999. p. 129-136.

10.

D’Orazio

Di Zio

Scanu

. Statistical matching and official statistics. Rivista di Statistica Ufficiale. 2002.

11.

D’Orazio

Di Zio

Scanu

. Statistical matching: Theory and practice. John Wiley & Sons; 2006.

12.

Kim

Berg

Park

. Statistical matching using fractional imputation. arXiv preprint arXiv: 151003782. 2016.

13.

Rässler

. Statistical matching: A frequentist theory, practical applications, and alternative Bayesian approaches. vol. 168. New York: Springer-Verlag; 2002.

14.

Rubin

. An overview of multiple imputation. in: Proceedings of the Survey Research Methods Section of the American Statistical Association. Citeseer; 1988. p. 79-84.

15.

Little

. Missing-data adjustments in large surveys. Journal of Business and Economic Statistics. 1988; 6(3): 287-296.

16.

Song

Shepperd

. Missing data imputation techniques. International journal of business intelligence and data mining. 2007; 2(3): 261-291.

17.

Rubin

. Basic ideas of multiple imputation for nonresponse. Survey Methodology. 1986; 12(1): 37-47.

18.

Rubin

. Multiple imputations in sample surveys – a phenomenological Bayesian approach to nonresponse; 1978.

19.

Rubin

. Multiple imputation for nonresponse in surveys. John Wiley & Sons; 81: 2004.

20.

Soley-Bori

. Dealing with missing data: Key assumptions and methods for applied analysis. Boston University. 2013; 4: 1-19.

21.

van Ginkel

Linting

Rippe

RCA

van der Voort

. Rebutting existing misconceptions about multiple imputation as a method for handling missing data. Journal of Personality Assessment. 2019; 0(0): 1-12. PMID: 30657714. Available from: https://doi.org/10.1080/00223891.2018.1530680.

22.

Little

Rubin

. Statistical analysis with missing data. John Wiley & Sons; vol. 793. 1987.

23.

Rubin

. Multiple imputation for nonresponse in surveys. John Wiley & Sons; vol. 81: 1987.

24.

Rendall

Ghosh-Dastidar

Weden

Baker

Nazarov

. Multiple imputation for combined-survey estimation with incomplete regressors in one but not both surveys. Sociological Methods and Research. 2013; 42(4): 483-530.

25.

Okner

. Constructing a new data base from existing microdata sets: The 1966 merge file. 1972; 325-342.

26.

Sims

. Comments. Annals of Economic and Social Measurement. 1972; 1(3): 343-345.

27.

Sims

. Rejoinder. Annals of Economic and Social Measurement. 1972; 1(3): 355-357.

28.

Peck

. Comments. Annals of Economic and Social Measurement. 1972; 1(3): 347-348.

29.

Okner

. Reply and comments. 1972; 359-362.

30.

Okner

. in: Data matching and merging: An overview. NBER; 1974. p. 347-352. Available from: http://www.nber.org/chapters/c10114.

31.

Ruggles

. A strategy for merging and matching microdata sets. in: Annals of Economic and Social Measurement, NBER. 1974; 3(2): p. 353-371.

32.

Alter

. Creation of a synthetic data set by linking records of the Canadian survey of consumer finances with the family expenditure survey. in: Annals of Economic and Social Measurement, NBER. 1974; 3(2): p. 373-397.

33.

Sims

. Comments. Annals of Economic and Social Measurement. 1974; 1(3): 395-397.

34.

Moriarity

Scheuren

. Statistical matching: A paradigm for assessing the uncertainty in the procedure. Journal of Official Statistics. 2001; 17(3): 407.

35.

Marella

Pfeffermann

. Matching information from two independent informative samples. Journal of Statistical Planning and Inference. 2019; 203: 70-81.

36.

Kim

Park

. Statistical micro matching using a multinomial logistic regression model for categorical data. Communications for Statistical Applications and Methods. 2019; 26(5): 507-517.

37.

Donatiello

D’Orazio

Frattarola

Rizzi

Scanu

Spaziani

. The role of the conditional independence assumption in statistically matching income and consumption. Statistical Journal of the IAOS. 2016; 32(4): 667-675.

38.

Cutillo

Scanu

. A mixed approach for data fusion of HBS and SILC. Social Indicators Research. 2020; 1-27.

39.

Doretti

Geneletti

Stanghellini

. Missing data: A unified taxonomy guided by conditional independence. International Statistical Review. 2018; 86(2): 189-204.

40.

Paass

. Statistical match: Evaluation of existing procedures and improvements by using additional information. Microanalytic Simulation Models to Support Social and Financial Policy. 1986; 401-420.

41.

Singh

Lemaître

Armstrong

. Statistical matching using log linear imputation. Social Survey Methods Division, Statistics Canada; 1988.

42.

Singh

Mantel

Kinack

Rowe

. Statistical matching: Use of auxiliary information as an alternative to the conditional independence assumption. Survey Methodology. 1993; 19(1): 59-79.

43.

Kadane

. Some statistical problems in merging data files. Journal of Official Statistics. 2001; 17(3): 423-433.

44.

Silverman

. Density estimation for statistics and data analysis. Chapman and Hall, London, 1986; Crossref, á. 1986.

45.

Wand

Jones

. Monographs on statistics and applied probability. Kernel Smoothing. 1995.

46.

Härdle

Linton

. Applied nonparametric methods. vol. 9203. Center for Economic Research, Tilburg University; 1992.

47.

Joenssen

Bankhofer

. Hot deck methods for imputing missing data. in: International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer; 2012. p. 63-75.

48.

De Waal

. Statistical matching: Experimental results and future research questions. Statistics Netherlands; 2015.

49.

Rodgers

. An evaluation of statistical matching. Journal of Business and Economic Statistics. 1984; 2: 91-102.

50.

Singh

Mantel

Kinack

Rowe

. On methods of statistical matching with and without auxiliary information. Technical Report; 1990.

51.

Moriarity

Scheuren

. A note on Rubin’s statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business and Economic Statistics. 2003; 21(1): 65-73.

52.

Rubin

. Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business and Economic Statistics. 1986; 4(1): 87-94. Available from: https://www-jstor-org.web.bisu.edu.cn/stable/1391390.

53.

Rubin

Stern

Vehovar

. Handling don’t know survey responses: The case of the Slovenian plebiscite. Journal of the American Statistical Association. 1995; 90(431): 822-828.

54.

Van Buuren

Oudshoorn

. Flexible multivariate imputation by MICE. Leiden: TNO; 1999.

55.

Shah

Bartlett

Carpenter

Nicholas

Hemingway

. Comparison of random forest and parametric imputation models for imputing missing data using MICE: A CALIBER study. American Journal of Epidemiology. 2014; 179(6): 764-774.

56.

Rässler

. A non-iterative bayesian approach to statistical matching. Statistica Neerlandica. 2003; 57: 58-74.

57.

Rässler

. Data fusion: Identification problems, validity, and multiple imputation. Austrian Journal of Statistics. 2004; 33(1-2): 153-171.

58.

Gasser

Müller

. Kernel estimation of regression functions. in: Smoothing techniques for curve estimation. Springer; 1979. p. 23-68.

59.

Andrews

. Nonparametric kernel estimation for semiparametric models. Econometric Theory. 1995; 560-596.

60.

Geenens

Charpentier

Paindaveine

, et al. Probit transformation for nonparametric kernel estimation of the copula density. Bernoulli. 2017; 23(3): 1848-1873.

61.

Cheng

Chu

. Kernel estimation of distribution functions and quantiles with missing data. Statistica Sinica. 1996; 63-78.

62.

Vantaggi

. Statistical matching of multiple sources: A look through coherence. International Journal of Approximate Reasoning. 2008; 49(3): 701-711.

63.

D’Orazio

Di Zio

Scanu

. The use of uncertainty to choose matching variables in statistical matching. International Journal of Approximate Reasoning. 2017; 90: 433-440.

64.

Molinari

. Partial identification of probability distributions with misclassified data. Journal of Econometrics. 2008; 144(1): 81-117.

65.

Ahfock

Pyne

Lee

McLachlan

. Partial identification in the statistical matching problem. Computational Statistics and Data Analysis. 2016; 104: 79-90.

66.

Fan

Park

. Sharp bounds on the distribution of treatment effects and their statistical inference. Econometric Theory. 2010; 931-951.

67.

Komarova

Nekipelov

Yakovlev

. Identification, data combination, and the risk of disclosure. Quantitative Economics. 2018; 9(1): 395-440.

68.

Manski

. Identification problems in the social sciences. 1995.

69.

Conti

Marella

Scanu

. Uncertainty analysis in statistical matching. Journal of Official Statistics. 2012; 28(1): 69-88.

70.

Endres

Fink

Augustin

. Imprecise imputation: A nonparametric micro approach reflecting the natural uncertainty of statistical matching with categorical data. Journal of Official Statistics. 2019; 35(3): 599-624.

71.

D’Orazio

Di Zio

Scanu

. Statistical matching and the likelihood principle: Uncertainty and logical constraints. ISTAT Technical Report; 2004.

72.

D’Orazio

Di Zio

Scanu

. Uncertainty intervals for nonidentifiable parameters in statistical matching. Proceedings of the 57th Session of the International Statistical Institute. 2009.

73.

D’Orazio

. Statistical matching and imputation of survey data with StatMatch. Italian National Institute of Statistics: Rome, Italy. 2017.

74.

D’Orazio

Di Zio

Scanu

. Statistical matching for categorical data: Displaying uncertainty and using logical constraints. Journal of Official Statistics. 2005; 22(1): 137.

75.

Zhang

Chambers

. An overview on uncertainty and estimation in statistical matching. Analysis of Integrated Data. 2019; 73.

76.

Conti

Marella

Scanu

. How far from identifiability? A systematic overview of the statistical matching problem in a non parametric framework. Communications in Statistics-Theory and Methods. 2017; 46(2): 967-994.

77.

Conti

Marella

Scanu

. Uncertainty analysis for statistical matching of ordered categorical variables. Computational Statistics and Data Analysis. 2013; 68: 311-325.

78.

Ridder

Moffitt

. The econometrics of data combination. Handbook of Econometrics. North Holland, Amsterdam; 2007.

79.

Rässler

Kiesl

. How useful are uncertainty bounds? Some recent theory with an application to Rubin’s causal model. Proceedings of the 57th Session of the International Statistical Institute. 2009.

80.

D’Orazio

Di Zio

Scanu

. Auxiliary variable selection in a statistical matching problem. Zhang L-C, Chambers, R Analysis of integrated data, CRC/Chapman and Hall. 2019; 101-120.

81.

Zhang

. On proxy variables and categorical data fusion. Journal of Official Statistics. 2015; 31(4): 783-807.

82.

Conti

Marella

Scanu

. An overview on uncertainty and estimation in statistical matching. Zhang L-C, Chambers, R Analysis of integrated data, CRC/Chapman and Hall. 2019.

83.

Conti

Marella

Vicard

Vitale

. Multivariate statistical matching using graphical modeling. International Journal of Approximate Reasoning. 2021; 130: 150-169.

84.

Rubin

Thomas

. Matching using estimated propensity scores: Relating theory to practice. Biometrics. 1996; 249-264.

85.

Rubin

Thomas

. Combining propensity score matching with additional adjustments for prognostic covariates. Journal of the American Statistical Association. 2000; 95(450): 573-585.

86.

Kum

Masterson

. Statistical matching using propensity scores: Theory and application to the levy institute measure of economic well-being. 2008.

87.

Kum

Masterson

. Statistical matching using propensity scores: Theory and application to the analysis of the distribution of income and wealth. Journal of Economic and Social Measurement. 2010; 35(3-4): 177-196.

88.

Bestehorn

Kirches

. A deterministic balancing score algorithm to avoid common pitfalls of propensity score matching. arXiv preprint arXiv: 180302704. 2018.

89.

King

Nielsen

. Why propensity scores should not be used for matching. Political Analysis. 2019; 27(4): 435-454.

90.

Kim

. Parametric fractional imputation for missing data analysis. Biometrika. 2011; 98(1): 119-132.

91.

Yang

Kim

, et al. Fractional imputation in survey sampling: A comparative review. Statistical Science. 2016; 31(3): 415-432.

92.

Vapnik

. An overview of statistical learning theory. IEEE transactions on neural networks. 1999; 10(5): 988-999.

93.

Breiman

, et al. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science. 2001; 16(3): 199-231.

94.

Hastie

Tibshirani

Friedman

. The elements of statistical learning: Data mining, inference, and prediction. Springer Science & Business Media; 2009.

95.

James

Witten

Hastie

Tibshirani

. An introduction to statistical learning. vol. 112. Springer; 2013.

96.

D’Orazio

. A two step non parametric procedure for statistical matching. in: 8 th Scientific meeting of the CLAssification and Data Analysis Group of the Italian Statistical Society (CLADAG 2011); 2011. p. 7-9.

97.

D’Orazio

. Statistical learning in official statistics: The case of statistical matching. Statistical Journal of the IAOS. 2019; 35(3): 435-441.

98.

Baioletti

Capotorti

. A L1 based probabilistic merging algorithm and its application to statistical matching. Applied Intelligence. 2019; 49(1): 112-124.

99.

Baioletti

Capotorti

. An efficient probabilistic merging procedure applied to statistical matching. in: International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Springer; 2017. p. 65-74.

100.

Capotorti

. A further empirical study on the Over-Performance of estimate correction in statistical matching. in: International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems. Springer; 2012. p. 124-133.

101.

Baioletti

Capotorti

. Efficient L1-based probability assessments correction: Algorithms and applications to belief merging and revision. in: Proceedings of the 9th International Symposium on Imprecise Probability: Theories and Applications (ISIPTA 2015), Pescara (IT); 2015. p. 37-46.

102.

Capotorti

Vantaggi

. Incoherence correction strategies in statistical matching. in: Proceedings of ISIPTA. Citeseer; 2011. p. 109-118.

103.

Endres

Augustin

. Statistical matching of discrete data by Bayesian networks. in: Conference on Probabilistic Graphical Models. PMLR; 2016. p. 159-170.

104.

Endres

Augustin

. Utilizing log-linear Markov networks to integrate categorical data files. 2019.

105.

Lumley

. Complex surveys: A guide to analysis using R. John Wiley & Sons. 2011; 565.

106.

Renssen

. Use of statistical matching techniques in calibration estimation. Survey Methodology. 1998; 24: 171-184.

107.

. Combining information from multiple surveys through the empirical likelihood method. Canadian Journal of Statistics. 2004; 32(1): 15-26.

108.

Conti

Marella

Scanu

. Statistical matching analysis for complex survey data with applications. Journal of the American Statistical Association. 2016; 111(516): 1715-1725.

109.

D’Orazio

Di Zio

Scanu

. Statistical matching of data from complex sample surveys. in: Proceedings of the European Conference on Quality in Official Statistics-Q2012. 2012; 29.

110.

D’Orazio

Di Zio

Scanu

. Old and new approaches in statistical matching when samples are drawn with complex survey designs. Proceedings of the 45th “Riunione Scientifica della Societa Italiana di Statistica”, Padova. 2010; 16-18.

111.

Särndal

Lundström

. Estimation in surveys with nonresponse. John Wiley & Sons; 2005.

112.

Zieschang

. Sample weighting methods and estimation of totals in the consumer expenditure survey. Journal of the American Statistical Association. 1990; 85(412): 986-1001.

113.

Renssen

Nieuwenbroek

. Aligning estimates for common variables in two or more sample surveys. Journal of the American Statistical Association. 1997; 92(437): 368-374.

114.

Raghunathan

Grizzle

. A split questionnaire survey design. Journal of the American Statistical Association. 1995; 90(429): 54-63.

115.

Chipperfield

Steel

. Design and estimation for split questionnaire surveys. 2009.

116.

Kamgar

Navvabpour

. An efficient method for estimating population parameters using split questionnaire design. Journal of Statistical Research of Iran JSRI. 2017; 14(1): 77-99.

117.

Peytchev

Peytcheva

. Reduction of measurement error due to survey length: Evaluation of the split questionnaire design approach. in: Survey Research Methods. 2017; 11: p. 361-368.

118.

Stuart

. A computationally efficient method for selecting a split questionnaire design. Communications in Statistics-Simulation and Computation. 2019; 1-23.

119.

Ali

Kauermann

. A split questionnaire survey design in the context of statistical matching. Statistical Methods and Applications. 2021; 1-18.

120.

Namazi-Rad

Tanton

Steel

Mokhtarian

Das

. An unconstrained statistical matching algorithm for combining individual and household level geo-specific census and survey data. Computers, Environment and Urban Systems. 2017; 63: 3-14.

121.

Barr

Turner

. Quality issues and evidence in statistical file merging. Data Quality Control: Theory and Pragmatics. 1990; 112: 245.

122.

van der Palm

van der Ark

Vermunt

. A comparison of incomplete data methods for categorical data. Statistical Methods in Medical Research. 2016; 25(2): 754-774. PMID: 23166159. Available from: https://doi.org/10.1177/0962280212465502.

123.

Vidotto

Vermunt

Kaptein

. Multiple imputation of missing categorical data using latent class models: State of the art. Psychological Test and Assessment Modeling. 2015; 57(4): 542.

File A
Unit	Weight $w_{A}$	Gender $X_{1}^{A}$	Age $X_{2}^{A}$	Profit $Y$
A1	3	1	42	9.156
A2	3	1	35	9.149
A3	3	0	63	9.287
A4	3	1	55	9.512
A5	3	0	28	8.484
A6	3	0	53	8.891
A7	3	0	22	8.425
A8	3	1	25	8.867
File B
Unit	Weight $w_{B}$	Gender $X_{1}^{B}$	Age $X_{2}^{B}$	Income $Z$
B1	4	0	33	6.932
B2	4	1	52	5.524
B3	4	1	28	4.224
B4	4	0	59	6.147
B5	4	1	41	7.243
B6	4	0	45	3.230