Abstract
Matching similar records from different databases to prevent duplication in a private manner has attracted plenty of attention, which is referred to as Private Entity Matching (PEM). In spite of various approaches having been proposed to solve this problem, private linking numerical data such as integer (e.g. age), floating point (e.g. body mass index) from multiple databases is an urgent gap, which is commonly required in health domain, statistical departments and more. Hence, this paper targets at solving the problem of linking numerical data from three or more sources in an efficient and secure way. Firstly, we introduce a novel homomorphic encryption method constrained similar modul, which provides strong privacy to encrypt numerical data in the range of real numbers. Then, to avoid frequent decryptions in the homomorphic encryption schema, we draw an inference about the encryption keys. Finally, an accelerated algorithm is proposed to reduce the complexity of multi-party numerical records matching. Our approach is considered absolute safety that no party learns any sensitive information of the others in the absence of collusion. Experiments on two real-world health information databases of patient records validate our approach with regards to the efficiency improvement and at the same time, at no the sacrifice of linkage quality.
Introduction
In the era of big data, large amounts of data have been generated, collected and stored in several domains. Matching records related to the same entities [1] allows data from different databases to be integrated, which has drawn considerable attention in many areas, including healthcare, businesses, government agencies and research projects. However, in many cases, records have to be linked by private information, for example, in linking medical data of the same patients across different hospitals or in collecting the credit history of users from several sources. Due to privacy and confidentiality concerns often these organizations are not willing or allowed to reveal the sensitive or personal data to other database owners, or to any external party. Consequently, these data need be merged, using a private entity matching protocol.
Private Entity Matching (PEM) [2,3, 2,3] is to find records from two or more data sources that refer to the same or similar individuals, without revealing other information besides the matched records. It has been widely studied in many fields, for instance, an integration of medical information from different hospitals, which integrates similar medical records to reduce redundant data and generate a basic public medical library. The library containing a large number of medical records plays an important role in improving the quality of healthcare services. However, medical data [4,5, 4,5] is highly sensitive, including both medical and personal information. The privacy of sensitive data has to be considered.
There have been numerous works on PEM [6,7, 6,7], but some issues are still unsolved. One of the issues is that most of the approaches are limited to linking data from two sources [8,9, 8,9], however, linking data from more sources has become commonly required in real world. The other issue is that majority of the existing solutions can only handle records made up of strings and categorical values [10,11, 10,11], as to the numerical data, there are few approaches [12]. Actually, for many applications, the data used in entity matching includes numerical information, such as data in statistical department, homeland security, biomedical research and more. For example, Table 1 and Table 2 respectively represent the medical information of patients in database A and database B, Integrating similar medical information from different databases to reduce redundant data is necessary. As shown in database A and database B, patients A44884 and B11301 is similar. Thus, to utilize these data effectively, it is urgent to propose a method to match numerical data in privacy. Some existing methods simply treat numerical data as strings, which would induce the loss of precision. As an example, although 99 and 100 are close numerical values, the same values are found to be distant if they are cast as strings. Recently, an approach has been proposed for numerical data [12], which encodes the numerical data by Bloom Filter (BF). However, BF may be vulnerable to frequency based attacks [13] and has a certain rate of errors [14,15, 14,15], which leads to information leakage and false positives.
Patients in database A
Patients in database A
Patients in database B
To deal with the drawbacks above, we propose an approach specially designed for numerical records, which links numerical records from multiple sources in an efficient and secure way.
We introduce a novel encryption method constrained similar modul to encrypt the numerical records in real numbers, which provides stronger security and reduces the encryption time. We also draw an inference about the encryption keys to avoid frequent decryptions in multi-party matching approach. We propose an accelerated algorithm to reduce the complexity of calculating the similarity among multiple numerical records. We also prove that our accelerated algorithm does not compromise linkage quality. When the number of participants is large, the accelerated algorithm shows a good performance in efficiency. We analyze our approach and conduct an empirical study on two large real-world medical information databases. Empirical study results manifest that our solution outperforms previous techniques in privacy, efficiency and at no sacrifice of linkage quality.
The remainder of this paper is organized as follows. In the following section, we mention some previous works related to ours. We introduce definitions and background in Section 3. We describe our approach in Section 4 and analyze our approach in Section 5. In Section 6, we report our empirical study. Finally we summarize our findings in Section 7.
Various techniques have been proposed to tackle the problem of PEM. However, most of them consider linking two databases only, and few approaches have been proposed for PEM on multiple databases. The first approach to PEM [16] linked multiple databases by comparing the hash encoded values from all data sources by using a third party. However, this approach only performs exact matching. A Secure Multi-party Computation (SMC)-based approach using an oblivious transfer protocol was proposed by O’Keefe et al. [17] for PEM on multiple databases. While provably secure, the approach can only perform exact matching of masked values. Kantarcioglu et al. [18] introduced a multi-party approach to perform secure equi-join on k-anonymous databases only for exact matching. Recently, a multi-party PEM approach for categorical values based on k-anonymity and game-theoretic concepts was proposed by Mohammed et al. [11]. Another efficient multi-party approach for categorical data was recently proposed by Karapiperis et al. [19] using a Count-Min sketch data structure. Sketches are used to summarize the local set of elements which provide a global synopsis using homomorphic operations, secure summation, and symmetric noise addition privacy techniques. However, these two approaches are limited to dealing with categorical data. Lai et al. [20] developed an efficient multi-party PEM approach using Bloom filters. However, the approach performs only exact matching. This approach has been adapted by Vatsalan and Christen for approximate matching in multi-party PEM. The approximate privacy-preserving comparison functions for calculating similarities of multiple records has only recently been considered by Vatsalan and Christen [21] using the Dice coefficient similarity. Then, Vatsalan and Christen proposed the use of Counting Bloom Filter (CBF), which is a variation of BF, to improve the approach in [22].
There have been a few approaches designed for strings and categorical data, but few work has so far focused on numerical data. Vatsalan and Christen proposed an efficient BF-based masking approach for numerical data [12] and developed a solution for the privacy-preserving similar patient matching problem. However, BF is vulnerable to frequency-based attacks and has a certain rate of errors, which induces information leak and false positives. As we described next, our approach is absolute safety without information leak, and improves the efficiency at no sacrifice of linkage quality.
Definitions and preliminaries
In this section, we define the problem and introduce the related technologies in this paper.
Problem formulation
For example, there are two records r1 and r2 to be matched in the attribute of first name, the values of attribute first name are sara and sarah. Firstly, to protect the privacy of records, we encode two values by Bloom filter encode, BF(sara) = 10100011000100, BF(sarah) = 10100011100101. Then, we calculate the similarity of two encoded values, Dice_sim = 2 × 5/(7 + 5) = 0.83. By comparing with the threshold, we could decide wether two records in attribute first name is match or not.
Our PEM approach is for multi-party numerical records matching, which can combine with other PEM approaches to match the records made up of numerical, strings and categorical values. The definition of numerical record is as follows:
Similar modul
m is a real number; p is a positive integer.
Similar modul is one of homomorphic encryption schemes, which can perform the homomorphic operators addition, subtraction, multiplication and division in real number. More formally, let Enc
kpub
and Dec
kpriv
be the encryption and decryption functions with keys kpub and kpriv, m1 and m2 be plain texts, c(m1) and c(m2) be cipher texts such that c(m1) = Enc
kpub
(m1), c(m2) = Enc
kpub
(m2). Thus, homomorphic operations can be expressed by operator “±” as follows:
Numerical data is often encountered in PEM, such as age, year, price, or blood pressure. In this paper we use the absolute difference approach to calculate the similarity between the numerical values, the similarity between two numerical data n1, n2 is calculated as follows [12]:
In this part, we provide an overview of our solution, which is composed of three main phases: 1) Data Preparation, 2) Data Encryption and 3) Private Multi-Party Numerical Records Matching. We then describe each phase in detail through Section 4.1 to 4.3 and take an example to illustrate our solution in Section 4.4. The framework is presented in Figure 1.

The framework of our solution.
In the data preparation phase, without loss of generality, the parties agree on the parameters and generate encryption keys to encrypt numerical data. The symbols used in our approach are shown in Table 3.
Table of frequent symbols
Table of frequent symbols
To calculate the similarity among P records respectively from P parties, each party needs to encrypt its data with encryption keys first, then releases the encrypted data to LU to perform multi-party private entity matching.
In this phase, we introduce a lower cost homomorphic encryption scheme similar modul to encrypt numerical data in the range of real numbers. The process of encryption and decryption are described as follows:
Specifically, as discussed by Domingo-Ferrer and Herrer-Joancomart in [25], the homomorphic encryption scheme similar modul has two important properties, which provides strong privacy guarantees.
After encrypting its own data, each party sends the cipher texts C1, C2,..., C P to LU to determine whether (r1, r2,..., r P ) is match or not. Figure 2 shows the whole process of matching multi-party numerical records in privacy, from which we know that our approach is absolutely safe without any information leakage.

The process of private multi-party numerical records matching.
As to the multi-party private matching, we regard P records as a match if any two records are matched, which incurs high computation costs. Hence, in this section, we propose an accelerated algorithm for private multi-party numerical records matching approach, which can compute the similarity among P records by the cipher texts efficiently. The main idea is to obtain the minimum and the maximum values in each attribute, then we only perform similarity calculation on them instead of any two values, we can directly determine whether P records is match or not. Our accelerated algorithm is proposed based on the following proposition:
If a >b, sim(a, b) = 1 - (a - b)/d max = 1 - ((a/b) - 1)/d max ; (proposed in Eq. (4))
When a = a nmax , b = a nmin , sim(a, b) get minimum value which is equal with sim (a nmin , a nmax );
Thus, sim(a, b) > θ n ;
Evidenced by the same token, when a <b or a = b, sim(a, b) > θ n .
Thus, we prove the Proposition 1.
As the proof shows, the Proposition 1 is correct. Hence, we can conclude that the approach introducing the accelerated algorithm has the same linkage quality as the naive approach in which any two encrypted records compare.
r i : A record from party i, 1 ≤ i ≤ P;
r i (a n ): The nth attribute in record r i , 1 ≤ n ≤ d;
h: A positive integer;
C i : The encryption of r i ;
Generatekey() = {p in , q in , rand in };
p in = p1n, q in = q1n, rand1n +h q1n;
C in = Enc(r i (a n )) = {(r i (a n ) + rand in ∗ p in ), q in ∗ p in };
return C i = {Ci1, Ci2,..., C id };
Then, our Decision Rules (DR) can be described as follows:
a nmin : The minimum value of the attribute a n ; a nmax : The maximum value of the attribute a n .
Only when sim(a nmin , a nmax ) in all attributes are above the threshold, we regard (r1, r2,..., r P ) as a match. Otherwise, (r1, r2,..., r P ) is non-match. Based on the DR, the similarity calculations can be terminated early, if the similarity of P records in one attribute has been calculated below the threshold.
By this token, the most important step is to obtain the minimum and the maximum values in each attribute. However the data which LU receives is encrypted. Thus, if our approach satisfies that if m1 > m2, thenc(m1) > c(m2), we can only perform on the cipher texts without frequent decryptions. To satisfy this condition, we assume: x1, x2 are the plain texts and the encryption of x1, x2 are E(x1) = smod{(x1 + rand1 ∗ p), p ∗ q}, E(x2) = smod{(x2 + rand2 ∗ p), p ∗ q} (as calculated in Eq. (5)). Then we infer that what relation between the rand1 and rand2, it satisfies: when x1 ≥ x2 ⇔ E(x1) ≥ E(x2).
(1) x1 ≥ x2 ⇒ E(x1) ≥ E(x2);
x1 - x2 ≥ (rand2 - rand1) ∗ p + (n1 - n2) ∗ p ∗ q, if x1 ≥ x2, (rand2 - rand1) ∗ p + (n1 -n2) ∗ p ∗ q ≥ 0.
Thus, when rand2 ≥ rand1 + (n2 - n1) ∗ q, it satisfies: x1 ≥ x2 ⇒ E(x1) ≥ E(x2).
(2) E(x1) ≥ E(x2) ⇒ x1 ≥ x2;
x1 + rand1 ∗ p - n1 ∗ p ∗ q ≥ x2 + rand2 ∗ p - n2 ∗ p ∗ q, rand2 - rand1 ≤ (x1 - x2)/p + (n2 - n1) ∗ q.
When x1 - x2 = 0,
(x1 - x2)/p + (n2 - n1) ∗ q get minimum value, rand2 - rand1 ≤ (n2 - n1) ∗ q,
Thus, when rand2 ≤ rand1 + (n2 - n1) ∗ q, it satisfies: E(x1) ≥ E(x2) ⇒ x1 ≥ x2.
Above all, when rand2 = rand1 + (n2 - n1) ∗ q, it satisfies: x1 ≥ x2 ⇔ E(x1) ≥ E(x2).
Thus, we conclude the Corollary 1 as follows:
(C1, C2,..., C p ): The encryption records from P parties;
The P encryption records (C1, C2,..., C p ) are match or non-match;
flag = true;
n = 1; i = 2; j = 3;
min = C1n > C2n ? C2n: C1n;
max = C1n ≤ C2n ? C2n: C1n;
tempmax = C in ≥ C jn ? C in : C jn ;
tempmin = C in < C jn ? C in : C jn ;
max = tempmax;
min = tempmin;
i += 2; j +=2;
C a = Enc(max) - Enc(min)
flag = false;
n ++;
return flag;
The encryption scheme in our approach is constrained similar modul, in which the cipher texts retain the inequality relation in the plain texts. Therefore, as shown in Algorithm 1, our approach generates encryption keys rand i = rand1 + h ∗ q. Remarkably, the constrained similar modul still has the two theorems described in Section 4.2 to guarantee the privacy of our approach.
The process of getting the maximum and the minimum values in each attribute has been given in Algorithm 2. Specially, to reduce complexity, our approach divides the encrypted numerical records into pairs, and performs the comparison in each pair. Then we select the larger one in each pair to compare to get the maximum value, and the same way to get the minimum value, which takes less time than other comparison methods. After obtaining the maximum and the minimum values in each attribute, our approach performs the homomorphic subtraction on them, C a = Enc(a nmax ) - Enc(a nmin ), 1 ≤ n ≤ d. Then, we pass the C a to P1 to decrypt and calculate the similarity by the Eq. (4). At last, we judge the relation between sim(a nmin , a nmax ) = 1 - Dec(C a )/d max and θ n . If the sim(a nmin , a nmax ) in all attributes are above the threshold, we regard (r1, r2,..., r P ) as a match. In the whole process of private entity matching, our approach only performs on the cipher texts without frequent decryptions, which reduces the time cost.
In this part, we take an example to illustrate our approach with the process of encrypting data and private matching multi-party numerical records.
Table 4 presents the four records selected from patient database which are used to perform private numerical records matching. In Table 5, we calculate the similarity among the records in [P96, P80, P26] and [P96, P26, P37] in each attribute and determine whether the records are match or not. As to the records in [P96, P80, P26] in attribute Blood pressure, firstly, we generate encryption keys and the parameters [p = 181, q = 71, rand1 = 23, rand2 = 94, rand3 = 236, d1 = 10, θ 1 = 0.95]. Then, we encrypt data with keys. C1 = Enc(69) = smod = 4232; C2 = Enc(66) = smod[(66 + 94 * 181), 181 * 71] = 4229; C3 = Enc(69) = smod[(69 + 236 * 181), 181 * 71] = 4232. C min = 4229, C max = 4232, C sub = C max - C min = 3. Sim(P96, P80, P26) = 1 - Dec(C sub )/10 = 0.7 in attribute Blood pressure. The same way to calculate the similarity of other attributes as shown in Table 5. At last, we can decide [P96, P80, P26] is non-match and [P96, P26, P37] is match. In the whole example, there is no frequent decryptions and comparisons, and the accelerated algorithm reduces the complexity at no sacrifice of linkage quality.
Patient database D
Patient database D
Similarity calculation
We now analyze our approach in terms of privacy, complexity and linkage quality.
P1: P1 receives the encrypted results of secure subtraction from LU. After decrypting the encrypted results with private key, the real results only show the final results without revealing the specific information.
P i (2 ≤ i ≤ P): Each party does not receive any messages regarding other participants. Thus, without collusion each part cannot infer any information.
LU: This party receives the encrypted data from P i (1 ≤ i ≤ P). However, only knows the encrypted data without decryption keys, LU can hardly infer the plain texts (The security is guaranteed by two properties in Section 4.2).
Experiment
To perform the experimental analysis, two real-world medical health information databases are used for this experimental study, which include Diabetes UCI database [28] and Medical Quality Improvement Consortium (MQIC) database [12]. We evaluate our approach in terms of complexity and linkage quality. All tests are conducted on a computer server with a 64-bit, 8.0G of RAM Intel Core (3.30 GHz) CPU.
Database information
The Pima Indians Diabetes database contains 768 patient records, which is downloaded from http://archive.ics.uci.edu/ml/databases/Pima+Indians+Diabetes. Each patient is represented by a 8 dimensional vector.
The Medical Quality Improvement Consortium (MQIC) database contains a random sample of approximately 6 million patient records, which is downloaded from https://github.com/Mathuv/PPSPM. Each patient is represented by a 7 dimensional vector.
These two databases contain only categorical and numerical attributes. The blocking keys
A summary of databases
A summary of databases
We use the following measures to evaluate the performance of our private multi-party numerical records matching technique in terms of complexity and linkage quality. Complexity is evaluated by the total runtime required for our approach. We utilize Precision, Recall and F-measure to evaluate the linkage quality. Based on the number of True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN), Precision, Recall,and F-measure can be calculated as follows[29]:
Precision: The ratio of a correct match TP to all declared duplicates. It is calculated as:
We set the parameters in our approach as the range of threshold is [0.85, 1], d max = 5 for Age, d max = 1 for Body mass index, d max = 10 for Diastolic blood pressure, and d max = 1 for Diabetes pedigree function. The parameters in BF approach are set following the work in [14].
Performance evaluation

Runtime with different database sizes.

Runtime (Blocking with age) with different database sizes.
Next, we evaluate scalability with different numbers of parties in four approaches for Database size = 5,000. As shown in Figure 5, with the participants increasing, our approach shows a greater advantage in efficiency, which owes to our accelerated algorithm.

Runtime with different number of parties.

F-measure with different database sizes.

Linkage quality with different database sizes.

Linkage quality with different number of parties.

Linkage quality with different threshold in MQIC.

Linkage quality with different threshold in Diabetes.
Above all, we can conclude that our approach performs better than previous approaches in efficiency and privacy and at the same time, at no sacrifice of linkage quality.
In this paper, we study the problem of linking numerical records from three or more sources in privacy, which is commonly required in health domain, statistical departments, homeland security and more. We encrypt numerical data by a constrained similar modul to provide stronger privacy. The constrained similar modul takes a little encryption time and has no effect on the linkage quality, which is better than the previous privacy technique BF. Then, we draw an inference about the encryption keys. Based on these encryption keys, our approach can only perform on the cipher texts without frequent decryptions, which reduces the time cost. Next, we derive an accelerated algorithm in private multi-party PEM matching, which avoids any two records calculating the similarity and reduces the complexity without loss of linkage quality. Especially, when the number of participants is large, our approach shows a good performance in efficiency. Finally, as experiments show, our approach exhibits high performance both in efficiency and privacy and at the same time, at no sacrifice of linkage quality.
Footnotes
Acknowledgments
This work is supported by the National Key R&D Program of China (2018YFB1003404) and the National Natural Science Foundation of China (61672142, 61602103, 61472070).
