SeMBlock: A semantic-aware meta-blocking approach for entity resolution

Abstract

Entity resolution refers to the process of identifying, matching, and integrating records belonging to unique entities in a data set. However, a comprehensive comparison across all pairs of records leads to quadratic matching complexity. Therefore, blocking methods are used to group similar entities into small blocks before the matching. Available blocking methods typically do not consider semantic relationships among records. In this paper, we propose a Semantic-aware Meta-Blocking approach called SeMBlock. SeMBlock considers the semantic similarity of records by applying locality-sensitive hashing (LSH) based on word embedding to achieve fast and reliable blocking in a large-scale data environment. To improve the quality of the blocks created, SeMBlock builds a weighted graph of semantically similar records and prunes the graph edges. We extensively compare SeMBlock with 16 existing blocking methods, using three real-world data sets. The experimental results show that SeMBlock significantly outperforms all 16 methods with respect to two relevant measures, F-measure and pair-quality measure. F-measure and pair-quality measure of SeMBlock are approximately 7% and 27%, respectively, higher than recently released blocking methods.

Keywords

Data matching entity resolution meta-blocking word embedding locality-sensitive hashing semantic similarity big data integration

1. Introduction

Integrating two or more datasets in the absence of a unique identifier is a challenging problem that causes redundancy of data and inaccurate knowledge extraction [1, 2, 3]. Entity resolution is used to identify, match and integrate the records of an entity in different datasets [4, 5, 6]. However, there are challenges in entity resolution such as computing similarities between all pairs of records in a large dataset, which is problematic because the number of comparisons grows quadratically with the size of the dataset. Even for a small dataset, calculating the total similarity matrix using costly similarity functions can be extremely difficult [7].

To meet these challenges, blocking in entity resolution is used to group the records into a set of blocks so that a block contains only similar entities [8, 9, 10, 11] and the entities in a block are more similar to each other than to entities in other blocks [12, 13]. A number of blocking methods [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28] have already been proposed which are exclusively based on the textual similarity of records while not taking semantic information into account.

In this paper, we propose a novel blocking method called SeMBlock (Semantic-aware Meta-Blocking approach) that uses a locality-sensitive hashing (LSH) method based on word embedding (BERT) to discover semantic relationships among pairs of records. Text analysis techniques [29, 30, 31, 32] such as word embedding methods, Word2Vec [33, 34] and BERT (Bidirectional Encoder Representations from Transformers) [35], map words to a new space in which semantically similar words are placed adjacent to each other. Word2Vec applies a neural network to describe each word in the text with a vector [33, 34] and BERT pre-trains deep bidirectional representations from unlabeled text [35]. One of the major differences between BERT and Word2vec is that the BERT generates word embeddings based on the contexts in which the words appear. As a result, it generates different embeddings for each of the occurrences of a word, while in Word2Vec each instance of the word has the same representation even if it occurs in different contexts. Locality-sensitive hashing (LSH) is a popular method for nearest neighbor search in high-dimensional spaces. The basic idea is that the more similar two records are, the higher the probability is that they are hashed into the same block. LSH is a fast blocking technique over extensive data sets due to its probabilistic nature. Leveraging LSH techniques can reduce the time complexity of generating blocks to O(n). Additionally, the use of LSH techniques in semantic similarity space significantly improves the quality of blocking by removing record pairs that are textually similar but semantically different [12]. SeMBlock generates a graph of semantically similar records and applies a pruning method in order to improve the quality of blocking. The main contributions of SeMBlock are:

•
Creating fast and reliable blocking in a large-scale data environment that considers the context of the text.
•
Constructing a semantic-aware blocking graph.
•
Outperforming 16 state-of-the-art blocking methods on three real-world data sets.

SeMBlock has been proposed to take advantage of word embedding for extracting semantic similarity and locality-sensitive hashing in order to significantly reduce the time complexity of generating blocks and to improve the run-time and accuracy of entity resolution. Accurate and fast entity resolution has huge practical implications in almost all modern data management tasks such as information extraction [36], big data analysis [37], and knowledge base construction [38] and helps businesses in unifying their customer data and in improving their decision making.

The structure of this article is as follows: Section 2 overviews available blocking methods. Section 3 provides a formal description of the entity resolution problem. SeMBlock is described in detail in Section 4. The experimental results are described and discussed in Section 5. Section 6 provides summarizing conclusions and future work.
2. Background

Entity Resolution is the process of identifying different entity records that represent the same real-world object. Performing entity resolution tasks over large data sets is computationally challenging due to the quadratic complexity $O(n^{2})$ of pairwise comparisons, i.e., as every entity record has to be compared with all others. To reduce its computational cost, blocking techniques [39, 10, 40] have been proposed. Blocking efforts to identify which entity pairs are likely to match in order to limit comparisons only between them without knowledge of the matching function. Blocking leads to a time complexity $O(m^{2}*|B|)$ for the size of the maximal block $m$ and the number of blocks $|B|$ in the worst-case [10, 12]. In the following, we discuss five categories of blocking approaches and point to state-of-the-art representatives of these categories.

Figure 1.

The four main steps of SeMBlock.

Figure 2.

An illustration of the semantic-aware LSH method.

Traditional schema-based blocking techniques such as standard blocking [41, 18], sorted neighborhood [42], extended sorted neighborhood [19], q-grams blocking [43], extended q-grams blocking [19], MFIBlocks [44], canopy clustering [45, 46], extended canopy clustering [19], suffix arrays [47] and extended suffix arrays [19] generate blocks based on the blocking keys, which depend on the schema of the dataset [48]. The main drawback of these methods is the choice of selecting features for the blocking keys, which is a laborious and error-prone process and requires domain expert knowledge [21, 49].

In comparison with schema-based blocking techniques, schema-agnostic blocking approaches such as token blocking [50, 51, 52], attribute-clustering blocking [22], TYPiMatch [20] do not use any schema information [48]. They place each record in multiple blocks and create overlapping blocks which decrease the probability of a matching loss and increase the probability of inserting non-matching records in the same block (high recall but at the expense of low precision) [21].

Meta-blocking approaches such as WNP and CNP meta-blocking [24, 14], BLAST [21], BLOSS [17], supervised meta-blocking [15] and multi-core meta-blocking [53] have been proposed which rebuild a set of blocks to keep the most promising comparisons [14]. To do so, the set of blocks are shown by a weighted graph, called a blocking graph [14]. In this graph, each record is represented by a node and an edge exists between two nodes if the corresponding records together appear in at least one block. The weights of the edges are calculated for the matching probability. Then, a pruning algorithm is applied based on the weights of the edges. Eventually, each pair of nodes connected by edge create a new block [54, 24].

3. Problem definition

At the core of entity resolution lies the concept of the entity record, which forms a unique set of attribute name-value pairs. An individual entity record is specified by $r_{i}$ , where $i$ stands for its unique identity in a record collection $R$ ( $R=\{r_{1},\ldots,r_{n}\}$ ). Each record $r_{i}\in R$ contains a set of attributes denoted by $A_{i}=\{a^{i}_{1},\ldots,a^{i}_{m}\}$ . Two different records $r_{i}$ and $r_{j}$ ( $\{r_{i},r_{j}\}\subseteq R$ and $i\neq j$ ) are called duplicates if they represent the same real-world entity. Blocking is used for scaling entity resolution to extensive data collections. It groups similar records into a set of blocks $B$ $(B=\{b_{1},\ldots,b_{n^{\prime}}\}$ ) so that for any three records $r_{i}\in b_{u}$ , $r_{j}\in b_{u}$ and $r_{k}\in b_{v}$ with $u\neq v$ , the probability that $r_{i}$ and $r_{j}$ refer to the same real-world entity is higher than the probability that $r_{i}$ and $r_{k}$ refer to the same real-world entity. The problem of blocking is to discover a grouping of a given set of records so that the mentioned criteria are fulfilled and the number of required record comparisons is kept as low as possible [48].

4. SeMBlock

SeMBlock comprises four sequential main steps, as overviewed in Fig. 1. These steps are as follows:

1.
For each record $r_{i}\in R$ , its attribute set $A_{i}$ is considered and BERT [35] is applied to each attribute $a^{i}_{i^{\prime}}\in A_{i}$ . As a result, each record $r_{i}$ with $m$ attributes $\{a^{i}_{1},\ldots,a^{i}_{m}\}$ is represented by $m$ BERT embedding ( $\textit{em}(a^{i}_{1}),\ldots,\textit{em}(a^{i}_{m})$ ) where

$\displaystyle\textit{em}(a^{i}_{i^{\prime}})=\textit{BERT}(a^{i}_{i^{\prime}})$ (1)

Table 1
Basic statistical information on the three chosen datasets

Datasets Usage Number of records Number of attributes Attributes Number of matches

DBLP-ACM Record linkage 2,616 $+$ 2,294 4 Title, authors, venue and year 2,220

DBLP-Scholar Record linkage 2,616 $+$ 64,263 4 Title, authors, venue and year 5,347

Cora Deduplication 1,295 12 Author, volume, title, institution, venue, address, publisher, year, pages, editor, note and month 17,184

2.
To avoid the quadratic complexity of calculating the pairwise similarity of attributes’ embedding, SeMBlock applies Locality-Sensitive Hashing (LSH), which hashes similar attributes’ embedding into the same block with high probability and creates a set of blocks $B=\{b_{1},\ldots,b_{n^{\prime}}\}$ . SeMBlock uses LSH for the angular distance proposed by Andoni et al. [55] to calculate cosine similarity between attributes’ embedding of records in record collection R and to put similar attributes’ embedding into the same block $b_{l}$ . This is expressed by

$\displaystyle B=\textit{LSH(EM)}$ (2)

where $\textit{EM}=\{\textit{em}(a^{1}_{1}),\ldots,\textit{em}(a^{n}_{m})\}$ and $n$ is the number of records, $m$ is the number of attributes and the size of EM is equal to $n$ multiplied by $m$ .

Example 1. For the records r1-r4 in Fig. 2, which each contains 4 attributes (e.g., r4 contains $a^{4}_{1}$ , $a^{4}_{2}$ , $a^{4}_{3}$ and $a^{4}_{4}$ ), attribute embeddings are obtained using BERT. Then, all attribute embeddings extracted (EM $=$ { $\textit{em}(a^{1}_{1})$ , $\textit{em}(a^{1}_{2}$ ), $\textit{em}(a^{1}_{3})$ , $\textit{em}(a^{1}_{4})$ , $\textit{em}(a^{2}_{1})$ , $\textit{em}(a^{2}_{2})$ , $\textit{em}(a^{2}_{3})$ , $\textit{em}(a^{2}_{4})$ , $\textit{em}(a^{3}_{1})$ , $\textit{em}(a^{3}_{2})$ , $\textit{em}(a^{3}_{3})$ , $\textit{em}(a^{3}_{4})$ , $\textit{em}(a^{4}_{1})$ , $\textit{em}(a^{4}_{2})$ , $\textit{em}(a^{4}_{3})$ and $\textit{em}(a^{4}_{4})$ }) are given to the LSH in order to calculate cosine similarity among them and put similar embeddings into the same blocks. As shown in Fig. 2, LSH constructs 5 blocks, each containing similar attribute embeddings. Finally for each block, similar records are obtained (e.g., if $\textit{em}(a^{1}_{1})$ and $\textit{em}(a^{3}_{1})$ $\in$ $b_{1}$ then $r_{1}$ and $r_{3}$ $\in$ $b_{1}$ ).

After building the set of blocks $B$ , SeMBlock calculates the similarity between two records $r_{i}$ and $r_{j}$ as follows:

$\displaystyle\textit{Sim}(r_{i},r_{j})=|B_{r_{i}}\cap B_{r_{j}}|$ (3)

where $B_{r_{i}}$ and $B_{r_{j}}$ are the set of blocks associated with $r_{i}$ and $r_{j}$ , respectively, and $|B_{r_{i}}\cap B_{r_{j}}|$ corresponds to the number of blocks $r_{i}$ and $r_{j}$ have in common.

Example 2. For the records $r_{1}$ and $r_{2}$ in Fig. 2, $B_{r_{1}}=\{b_{1},b_{3},b_{4},b_{5}\}$ and $B_{r_{2}}=\{b_{2},b_{3},b_{4},b_{5}\}$ . Therefore, $|B_{r_{1}}\cap B_{r_{2}}|=|\{b_{3},b_{4},b_{5}\}|=3$ .
3.
SeMBlock constructs a blocking graph in which each $r_{i}\in R$ is represented by a node and an edge $e_{ij}$ exists between two nodes $r_{i}$ and $r_{j}$ if the corresponding records $r_{i}$ and $r_{j}$ are in the same block (which is the case if $|B_{r_{i}}\cap B_{r_{j}}|\not=0$ ). The weight of the edge $w_{ij}$ is equal to the similarity of two records $r_{i}$ and $r_{j}$ :

$\displaystyle w_{ij}=\textit{Sim}(r_{i},r_{j})$ (4)
4.
SeMBlock prunes the graph in order to further increase accuracy. If the weight of an edge $e_{ij}$ (i.e., $w_{ij}$ ) is higher than some predefined integer value $\alpha$ , the edge will be kept; otherwise, it will be removed:

$\displaystyle\text{Pruning}(e_{ij})=\left\{\begin{array}[]{ll}\text{keep}(e_{% ij}),&\text{if}\ w_{ij}>\alpha\\ \text{remove}(e_{ij}),&\text{otherwise}\end{array}\right.$ (5)

Figure 3.
PC, PQ, and FM for SeMBlock and different values of $\alpha$ . For these datasets the best pruning threshold is $\alpha=$ 2.

5. Experimental results

Datasets	Usage	Number of records	Number of attributes	Attributes	Number of matches
DBLP-ACM	Record linkage	2,616 $+$ 2,294	4	Title, authors, venue and year	2,220
DBLP-Scholar	Record linkage	2,616 $+$ 64,263	4	Title, authors, venue and year	5,347
Cora	Deduplication	1,295	12	Author, volume, title, institution, venue, address, publisher, year, pages, editor, note and month	17,184

This section is divided into three subsections, describing the three real-world data sets used for evaluation, the blocking evaluation measures used to evaluate SeMBlock, and the performance of SeMBlock in comparison to the 16 blocking methods.

5.1 Data sets

We use three real-world data sets: Cora [56], DBLP-ACM [57] and DBLP-Scholar [57]: Cora contains bibliographic records of machine learning papers, DBLP-ACM contains bibliographic data from DBLP and ACM, and DBLP-Scholar contains bibliographic data from DBLP and Google Scholar. Table 1 shows details of each data set. The experimental results of our baseline models are only available on these three datasets, so we use these three datasets for a fair comparison.

5.2 Evaluation measures

We use three common measures [10, 9, 18, 58, 59] to evaluate blocking quality: Pair Completeness (PC), Pairs Quality (PQ), and F-Measure (FM).

PC, which is also known as recall, estimates the portion of the duplicate entities that co-occur at least once in a set of blocks $B$ [18]:

$\displaystyle\textit{PC}=\frac{D_{B}}{D_{R}}$ (6)

where $D_{B}$ is the number of duplicates appearing in $B$ and $D_{R}$ is the number of all duplicates in the record collection $R$ .

PQ, also known as precision, measures the portion of comparisons that correspond to real duplicates (see, e.g., [18]) and is given by

$\displaystyle\textit{PQ}=\frac{D_{B}}{||B||}$ (7)

where $||B||=\sum_{b_{i}\in B}||b_{i}||$ and $||b_{i}||$ is the number of comparisons implied by the block $b_{i}$ .

Finally, FM is the harmonic mean of PC and PQ (see, e.g., [9]):

$\displaystyle\textit{FM}=\frac{2*\textit{PC}*\textit{PQ}}{\textit{PC}+\textit{% PQ}}$ (8)

5.3 SeMBlock vs. Existing blocking methods

We evaluated SeMBlock with the datasets and measures described above. According to SeMBlock, for each dataset the BERT embeddings of the records are calculated, LHS is applied, a blocking graph is constructed, and graph pruning is executed. As described in Section 4, pruning the parameter $\alpha$ is crucial. Figure 3 shows the impact of this parameter on the three measures PQ, PC, and FM. Specifically, this figure shows for the three datasets how these measures vary for different values of $\alpha$ (note that $\alpha=$ 0 means that no pruning happens). As can be seen from Fig. 3, $\alpha=$ 2 is the best choice for graph pruning with respect to FM for all three datasets. As $\alpha$ increases, the PQ of SeMBlock increases but PC decreases dramatically, and accordingly, FM decreases.

We compared SeMBlock with the following standard blocking approaches (The parameters of each method are determined based on the best values specified for them in the associated papers):

•
Schema-based approaches: Sorted neighborhood blocking (SN)[42], Extended sorted neighborhood blocking (ESN) [19], Canopy clustering (CC) [45, 46], Extended canopy clustering (ECC) [19], Suffix arrays blocking (SA) [47], Extended suffix arrays blocking (ESA) [19], Q-grams blocking (Qg) [43], and Extended q-grams blocking (EQg) [19].
•
Schema-agnostic approaches: Token blocking (TB) [50], Attribute clustering blocking (AC) [22], and Unsupervised blocking technique (BL) [60].
•
Meta-blocking approaches: BLAST [21], Redefined WNP (Wnp1) [24], Reciprocal WNP (Wnp2) [24], Redefined CNP (Cnp1) [24], and Reciprocal CNP (Cnp2) [24].

We calculated the PC, PQ, and FM measures for these 16 methods in the three datasets. The results of this comparison are shown in Table 2 (where SeMBlock was executed with $\alpha=$ 2). As can be seen from this table, SeMBlock outperformed all other methods with respect to FM and PQ. However, the PC values of SeMBlock are not the highest ones (though they are not significantly below the highest PC values) over the three datasets. This is not a serious drawback because for cases where the PC measure is more important, changing the $\alpha$ value (e.g., $\alpha=$ 1) can increase this measure.

Additionally, we evaluated the efficiency of the blocking methods by the number of record pairs each method generates, as blocking aims to reduce the number of pairs to be compared in entity resolution. Table 2 shows the number of record pairs (i.e., $||B||$ ) generated by different blocking methods. As it is shown in Table 2, the number of record pairs of SeMBlock is the lowest in all three datasets.

Table 2
PC, PQ, and FM for SeMBlock and 16 other blocking methods in the Cora, DBLP-ACM, and DBLP-Scholar datasets and comparison on the number of record pairs generated by different approaches

DBLP-Scholar DBLP-ACM Cora

Methods PC PQ FM $||B||$ PC PQ FM $||B||$ PC PQ FM $||B||$

TB 0.43 0.00 0.00 9.1010 ${}^{7}$ 1.00 0.00 0.00 6.6110 ${}^{6}$ 1.00 0.00 0.01 4.8410 ${}^{6}$

AC 0.42 0.00 0.00 8.9910 ${}^{7}$ 1.00 0.00 0.00 6.4210 ${}^{6}$ 1.00 0.00 0.01 4.6810 ${}^{6}$

BL 0.99 0.00 0.00 6.9010 ${}^{6}$ 0.99 0.04 0.07 6.1610 ${}^{4}$ 0.86 0.38 0.53 3.8410 ${}^{4}$

SN 0.00 0.00 0.00 4.3310 ${}^{5}$ 0.96 0.01 0.02 2.4310 ${}^{5}$ 0.65 0.06 0.10 2.0110 ${}^{5}$

ESN 0.44 0.00 0.00 1.8410 ${}^{8}$ 1.00 0.00 0.00 1.4710 ${}^{7}$ 1.00 0.00 0.00 1.0210 ${}^{7}$

CC 0.00 0.00 0.00 3.3910 ${}^{3}$ 0.98 0.84 0.90 2.5810 ${}^{3}$ 0.97 0.02 0.04 8.7810 ${}^{5}$

ECC 0.00 0.00 0.00 1.1010 ${}^{5}$ 0.05 0.01 0.02 9.5110 ${}^{3}$ 0.36 0.36 0.36 1.7010 ${}^{4}$

SA 0.00 0.00 0.00 3.9110 ${}^{5}$ 1.00 0.01 0.01 3.9510 ${}^{5}$ 0.40 0.09 0.15 7.4110 ${}^{4}$

ESA 0.00 0.00 0.00 5.3410 ${}^{5}$ 1.00 0.00 0.01 6.4210 ${}^{5}$ 0.26 0.06 0.10 7.3410 ${}^{4}$

Qg 0.46 0.00 0.00 1.4110 ${}^{8}$ 1.00 0.00 0.00 1.2610 ${}^{7}$ 1.00 0.00 0.00 8.8210 ${}^{6}$

EQg 0.44 0.00 0.00 1.2310 ${}^{8}$ 1.00 0.00 0.00 1.2010 ${}^{7}$ 1.00 0.00 0.00 9.3010 ${}^{6}$

BLAST 0.95 0.05 0.10 4.1010 ${}^{4}$ 0.99 0.61 0.75 3.6010 ${}^{3}$ 0.82 0.84 0.83 1.6810 ${}^{4}$

Wnp1 0.98 0.01 0.02 2.9010 ${}^{5}$ 0.99 0.14 0.24 1.7010 ${}^{4}$ 0.90 0.54 0.68 2.8610 ${}^{4}$

Wnp2 0.95 0.03 0.07 6.3010 ${}^{4}$ 0.98 0.24 0.38 9.2010 ${}^{3}$ 0.81 0.69 0.75 2.0110 ${}^{4}$

Cnp1 0.94 0.02 0.04 1.1010 ${}^{5}$ 0.99 0.10 0.18 2.2010 ${}^{4}$ 0.67 0.66 0.66 1.7410 ${}^{4}$

Cnp2 0.88 0.31 0.46 6.5010 ${}^{4}$ 0.98 0.20 0.34 1.1010 ${}^{4}$ 0.46 0.82 0.59 9.6410 ${}^{3}$

SeMBlock 0.33 0.78 0.46 2.2610 ${}^{3}$ 0.87 0.95 0.91 2.0310 ${}^{3}$ 0.76 0.96 0.85 1.36*10 ${}^{4}$

Moreover, we compared SeMBlock with recently released blocking methods (Rebo-I and Rebo-II) [61] in the Cora dataset. The results showed that SeMBlock outperformed both methods with respect to FM and PQ (i.e., Rebo-I (PC: 0.928, PQ: 0.694 and FM: 0.794), Rebo-II (PC: 0.935, PQ: 0.656 and FM: 0.771), and SeMBlock (PC: 0.76, PQ: 0.96 and FM: 0.85)).
6. Conclusions and future work

	DBLP-Scholar	DBLP-ACM	Cora
TB	0.43	0.00	0.00	9.10*10 ${}^{7}$	1.00	0.00	0.00	6.61*10 ${}^{6}$	1.00	0.00	0.01	4.84*10 ${}^{6}$
AC	0.42	0.00	0.00	8.99*10 ${}^{7}$	1.00	0.00	0.00	6.42*10 ${}^{6}$	1.00	0.00	0.01	4.68*10 ${}^{6}$
BL	0.99	0.00	0.00	6.90*10 ${}^{6}$	0.99	0.04	0.07	6.16*10 ${}^{4}$	0.86	0.38	0.53	3.84*10 ${}^{4}$
SN	0.00	0.00	0.00	4.33*10 ${}^{5}$	0.96	0.01	0.02	2.43*10 ${}^{5}$	0.65	0.06	0.10	2.01*10 ${}^{5}$
ESN	0.44	0.00	0.00	1.84*10 ${}^{8}$	1.00	0.00	0.00	1.47*10 ${}^{7}$	1.00	0.00	0.00	1.02*10 ${}^{7}$
CC	0.00	0.00	0.00	3.39*10 ${}^{3}$	0.98	0.84	0.90	2.58*10 ${}^{3}$	0.97	0.02	0.04	8.78*10 ${}^{5}$
ECC	0.00	0.00	0.00	1.10*10 ${}^{5}$	0.05	0.01	0.02	9.51*10 ${}^{3}$	0.36	0.36	0.36	1.70*10 ${}^{4}$
SA	0.00	0.00	0.00	3.91*10 ${}^{5}$	1.00	0.01	0.01	3.95*10 ${}^{5}$	0.40	0.09	0.15	7.41*10 ${}^{4}$
ESA	0.00	0.00	0.00	5.34*10 ${}^{5}$	1.00	0.00	0.01	6.42*10 ${}^{5}$	0.26	0.06	0.10	7.34*10 ${}^{4}$
Qg	0.46	0.00	0.00	1.41*10 ${}^{8}$	1.00	0.00	0.00	1.26*10 ${}^{7}$	1.00	0.00	0.00	8.82*10 ${}^{6}$
EQg	0.44	0.00	0.00	1.23*10 ${}^{8}$	1.00	0.00	0.00	1.20*10 ${}^{7}$	1.00	0.00	0.00	9.30*10 ${}^{6}$
BLAST	0.95	0.05	0.10	4.10*10 ${}^{4}$	0.99	0.61	0.75	3.60*10 ${}^{3}$	0.82	0.84	0.83	1.68*10 ${}^{4}$
Wnp1	0.98	0.01	0.02	2.90*10 ${}^{5}$	0.99	0.14	0.24	1.70*10 ${}^{4}$	0.90	0.54	0.68	2.86*10 ${}^{4}$
Wnp2	0.95	0.03	0.07	6.30*10 ${}^{4}$	0.98	0.24	0.38	9.20*10 ${}^{3}$	0.81	0.69	0.75	2.01*10 ${}^{4}$
Cnp1	0.94	0.02	0.04	1.10*10 ${}^{5}$	0.99	0.10	0.18	2.20*10 ${}^{4}$	0.67	0.66	0.66	1.74*10 ${}^{4}$
Cnp2	0.88	0.31	0.46	6.50*10 ${}^{4}$	0.98	0.20	0.34	1.10*10 ${}^{4}$	0.46	0.82	0.59	9.64*10 ${}^{3}$
SeMBlock	0.33	0.78	0.46	2.26*10 ${}^{3}$	0.87	0.95	0.91	2.03*10 ${}^{3}$	0.76	0.96	0.85	1.36*10 ${}^{4}$

Entity resolution is the process of identifying records in a data set that refer to the same entity across different data sources. To avoid the quadratic complexity of entity resolution, many attempts have been made to group similar records into blocks, prior to record matching, using blocking techniques. Available blocking methods typically do not exploit semantic criteria for the task of blocking. We introduced a semantic-aware Meta-Blocking approach called SeMBlock that exploits word-embedding based locality-sensitive hashing (LSH) for calculating semantic similarity and identifying relationships among records. The experimental results show that considering semantic relationships in the blocking process can significantly improve the quality of blocking. The size of the blocks generally gets smaller because semantic features can effectively eliminate record pairs that are textually similar but semantically different. We also compared SeMBlock with 16 available standard blocking methods. Overall, SeMBlock can be an effective blocking technique compared to these methods when the priority is to achieve high pair-quality and f-measure, without giving up a high level of pair-completeness.

In the future, first, we plan to extend SeMBlock by leveraging context information within a network environment for enhancing the applicability of SeMBlock to real-world ER problems. Second, we want to investigate the effect of combining several semantic features on the quality of blocking. Third, we intend to find an alternative to the LSH method that increases the accuracy of our method.

References

Bhattacharya

Getoor

. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD). 2007; 1(1): 5.

Christen

. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media, 2012.

Lin

Wang

Gao

. Efficient entity resolution on heterogeneous records. IEEE Transactions on Knowledge and Data Engineering, 2019.

Tauer

Date

Nagi

Sudit

. An incremental graph-partitioning algorithm for entity resolution. Information Fusion. 2019; 46: 171–183.

Christophides

Efthymiou

Palpanas

Papadakis

Stefanidis

. End-to-End Entity Resolution for Big Data: A Survey. arXiv preprint arXiv190506397, 2019.

Kwashie

Liu

Stumptner

Yang

. Certus: an effective entity resolution approach with graph differential dependencies (GDDs). Proceedings of the VLDB Endowment. 2019; 12(6): 653-666.

Bilenko

Kamath

Mooney

. Adaptive blocking: Learning to scale up record linkage. In: Sixth International Conference on Data Mining (ICDM’06), IEEE, 2006, pp. 87–96.

Papadakis

Tsekouras

Thanos

Pittaras

Simonini

Skoutas

, et al. JedAI3: beyond batch, blocking-based Entity Resolution. In: EDBT, 2020, pp. 603–606.

Wang

Cui

Liang

. Semantic-aware blocking for entity resolution. IEEE Transactions on Knowledge and Data Engineering. 2016; 28(1): 166-180.

10.

Papadakis

Skoutas

Thanos

Palpanas

. A Survey of Blocking and Filtering Techniques for Entity Resolution. arXiv preprint arXiv190506167, 2019.

11.

Araújo

Pires

CES

Mestre

Nóbrega

TPD

Nascimento

DCD

Stefanidis

. A noise tolerant and schema-agnostic blocking technique for entity resolution. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. ACM, 2019, pp. 422–430.

12.

Wang

Cui

Liang

. Semantic-aware blocking for entity resolution. IEEE Transactions on Knowledge and Data Engineering. 2015; 28(1): 166-180.

13.

De Vries

Chawla

Christen

. Robust record linkage blocking using suffix arrays and Bloom filters. ACM Transactions on Knowledge Discovery from Data (TKDD). 2011; 5(2): 9.

14.

Papadakis

Koutrika

Palpanas

Nejdl

. Meta-blocking: Taking entity resolutionto the next level. IEEE Transactions on Knowledge and Data Engineering. 2014; 26(8): 1946–1960.

15.

Papadakis

Papastefanatos

Koutrika

. Supervised meta-blocking. Proceedings of the VLDB Endowment. 2014; 7(14): 1929-1940.

16.

Araújo

, et al., Parallel blocking for entity resolution in the context of semi-structured data, 2020.

17.

Dal Bianco

Gonçalves

Duarte

. BLOSS: Effective meta-blocking with almost no effort. Information Systems. 2018; 75: 75-89.

18.

Papadakis

Svirsky

Gal

Palpanas

. Comparative analysis of approximate blocking techniques for entity resolution. Proceedings of the VLDB Endowment. 2016; 9(9): 684-695.

19.

Christen

. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering. 2012; 24(9): 1537-1555.

20.

Tran

. Typimatch: Type-specific unsupervised learning of keys and key values for heterogeneous web data integration. In: Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 2013; pp. 325–334.

21.

Simonini

Bergamaschi

Jagadish

. BLAST: a loosely schema-aware meta-blocking approach for entity resolution. Proceedings of the VLDB Endowment. 2016; 9(12): 1173-1184.

22.

Papadakis

Ioannou

Palpanas

Niederee

Nejdl

. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Transactions on Knowledge and Data Engineering. 2013; 25(12): 2665-2682.

23.

Fisher

Christen

Wang

Rahm

. A clustering-based framework to control block sizes for entity resolution. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015, pp. 279–288.

24.

Papadakis

Papastefanatos

Palpanas

Koubarakis

. Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In: EDBT, 2016, pp. 221–232.

25.

Whang

Menestrina

Koutrika

Theobald

Garcia-Molina

. Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009, pp. 219–232.

26.

Efthymiou

Stefanidis

Christophides

. Benchmarking blocking algorithms for web entities. IEEE Transactions on Big Data, 2016.

27.

Efthymiou

Papadakis

Papastefanatos

Stefanidis

Palpanas

. Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Information Systems. 2017; 65: 137-157.

28.

Efthymiou

Papadakis

Papastefanatos

Stefanidis

Palpanas

. Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE, 2015, pp. 411–420.

29.

Piryani

Gupta

Singh

. Generating aspect-based extractive opinion summary: Drawing inferences from social media texts. Computación y Sistemas. 2018; 22(1): 83-91.

30.

Gupta

Singh

Mukhija

Ghose

. Aspect-based sentiment analysis of mobile reviews. Journal of Intelligent & Fuzzy Systems. 2019; 36(5): 4721-4730.

31.

Piryani

Gupta

Singh

Ghose

. A linguistic rule-based approach for aspect-level sentiment analysis of movie reviews. In: Advances in Computer and Computational Sciences. Springer, 2017, pp. 201–209.

32.

Allahgholi

Rahmani

Javdani

Weiss

Módos

. ADDI: Recommending alternatives for drug–drug interactions with negative health effects. Computers in Biology and Medicine. 2020; 125: 103969.

33.

Mikolov

Chen

Corrado

Dean

. Efficient estimation of word representations in vector space. arXiv preprint arXiv13013781, 2013.

34.

Zhang

. Using Word2Vec to process big text data. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE, 2015, pp. 2895–2897.

35.

Devlin

Chang

Lee

Toutanova

. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv181004805, 2018.

36.

Dong

Gabrilovich

Heitz

Horn

Lao

Murphy

, et al. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 601–610.

37.

Jagadish

Gehrke

Labrinidis

Papakonstantinou

Patel

Ramakrishnan

, et al., Big data and its technical challenges. Communications of the ACM. 2014; 57(7): 86-94.

38.

De Sa

Ratner

Ré

Shin

Wang

, et al., Incremental knowledge base construction using DeepDive. The VLDB Journal. 2017; 26(1): 81-105.

39.

Papadakis

Skoutas

Thanos

Palpanas

. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM Comput Surv. 2020 Mar; 53(2). Available from: 10.1145/3377455.

40.

Vidhya

Geetha

. Entity Resolution and Blocking: A Review. In: 2019 IEEE 9th International Conference on Advanced Computing (IACC). IEEE, 2019, pp. 133–140.

41.

Fellegi

Sunter

. A theory for record linkage. Journal of the American Statistical Association. 1969; 64(328): 1183-1210.

42.

Hernández

Stolfo

. The merge/purge problem for large databases. In: ACM Sigmod Record. vol. 24. ACM, 1995, pp. 127–138.

43.

Gravano

Ipeirotis

Jagadish

Koudas

Muthukrishnan

Srivastava

, et al., Approximate string joins in a database (almost) for free. In: VLDB. vol. 1, 2001, pp. 491–500.

44.

Kenig

Gal

. MFIBlocks: An effective blocking algorithm for entity resolution. Information Systems. 2013; 38(6): 908-926.

45.

McCallum

Nigam

Ungar

. Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. Citeseer, 2000, pp. 169–178.

46.

Baxter

Christen

, et al., A comparison of fast blocking methods for record, 2003.

47.

Aizawa

Oyama

. A fast linkage detection scheme for multi-source information integration. In: International Workshop on Challenges in Web Information Retrieval and Integration. IEEE, 2005, pp. 30–39.

48.

Simonini

Papadakis

Palpanas

Bergamaschi

. Schema-agnostic progressive entity resolution. IEEE Transactions on Knowledge and Data Engineering. 2018; 31(6): 1208-1221.

49.

Rahmani

Ranjbar-Sahraei

Weiss

Tuyls

. Entity resolution in disjoint graphs: an application on genealogical data. Intelligent Data Analysis. 2016; 20(2): 455-475.

50.

Papadakis

Ioannou

Niederée

Fankhauser

. Efficient entity resolution for large heterogeneous information spaces. In: Proceedings of the fourth ACM international conference on Web search and data mining. ACM, 2011, pp. 535–544.

51.

Efthymiou

Papadakis

Stefanidis

Christophides

. MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities. arXiv preprint arXiv190506170, 2019.

52.

Gagliardelli

Simonini

Beneventano

Bergamaschi

. SparkER: Scaling Entity Resolution in Spark. In: EDBT 2019: 22nd International Conference on Extending Database Technology, 2019.

53.

Papadakis

Bereta

Palpanas

Koubarakis

. Multi-core meta-blocking for big linked data. In: Proceedings of the 13th International Conference on Semantic Systems. ACM, 2017, pp. 33–40.

54.

Simonini

Papadakis

Palpanas

Bergamaschi

. Schema-agnostic Progressive Entity Resolution (extended version). arXiv preprint arXiv190506385, 2019.

55.

Andoni

Indyk

Laarhoven

Razenshteyn

Schmidt

. Practical and optimal LSH for angular distance. In: Advances in Neural Information Processing Systems, 2015, pp. 1225–1233.

56.

McCallum

. Cora Dataset. Texas Data Repository Dataverse, 2017. Available from 10.18738/T8/HUIG48.

57.

Mudgal

Rekatsinas

Doan

Park

Krishnan

, et al., Deep learning for entity matching: A design space exploration. In: Proceedings of the 2018 International Conference on Management of Data. ACM, 2018, pp. 19–34.

58.

Shao

Wang

. Active Blocking Scheme Learning for Entity Resolution. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2018, pp. 350–362.

59.

Shao

Wang

Lin

. Skyblocking: Learning Blocking Schemes on the Skyline. arXiv preprint arXiv180512319, 2018.

60.

O’Hare

Jurek-Loughrey

de Campos

. An unsupervised blocking technique for more efficient record linkage. Data & Knowledge Engineering. 2019; 122: 181–195.

61.

. Entity Resolution with Recursive Blocking. Big Data Research, 2020, p. 100134.

	DBLP-Scholar				DBLP-ACM				Cora
Methods	PC	PQ	FM	$\|\|B\|\|$	PC	PQ	FM	$\|\|B\|\|$	PC	PQ	FM	$\|\|B\|\|$
TB	0.43	0.00	0.00	9.10*10 ${}^{7}$	1.00	0.00	0.00	6.61*10 ${}^{6}$	1.00	0.00	0.01	4.84*10 ${}^{6}$
AC	0.42	0.00	0.00	8.99*10 ${}^{7}$	1.00	0.00	0.00	6.42*10 ${}^{6}$	1.00	0.00	0.01	4.68*10 ${}^{6}$
BL	0.99	0.00	0.00	6.90*10 ${}^{6}$	0.99	0.04	0.07	6.16*10 ${}^{4}$	0.86	0.38	0.53	3.84*10 ${}^{4}$
SN	0.00	0.00	0.00	4.33*10 ${}^{5}$	0.96	0.01	0.02	2.43*10 ${}^{5}$	0.65	0.06	0.10	2.01*10 ${}^{5}$
ESN	0.44	0.00	0.00	1.84*10 ${}^{8}$	1.00	0.00	0.00	1.47*10 ${}^{7}$	1.00	0.00	0.00	1.02*10 ${}^{7}$
CC	0.00	0.00	0.00	3.39*10 ${}^{3}$	0.98	0.84	0.90	2.58*10 ${}^{3}$	0.97	0.02	0.04	8.78*10 ${}^{5}$
ECC	0.00	0.00	0.00	1.10*10 ${}^{5}$	0.05	0.01	0.02	9.51*10 ${}^{3}$	0.36	0.36	0.36	1.70*10 ${}^{4}$
SA	0.00	0.00	0.00	3.91*10 ${}^{5}$	1.00	0.01	0.01	3.95*10 ${}^{5}$	0.40	0.09	0.15	7.41*10 ${}^{4}$
ESA	0.00	0.00	0.00	5.34*10 ${}^{5}$	1.00	0.00	0.01	6.42*10 ${}^{5}$	0.26	0.06	0.10	7.34*10 ${}^{4}$
Qg	0.46	0.00	0.00	1.41*10 ${}^{8}$	1.00	0.00	0.00	1.26*10 ${}^{7}$	1.00	0.00	0.00	8.82*10 ${}^{6}$
EQg	0.44	0.00	0.00	1.23*10 ${}^{8}$	1.00	0.00	0.00	1.20*10 ${}^{7}$	1.00	0.00	0.00	9.30*10 ${}^{6}$
BLAST	0.95	0.05	0.10	4.10*10 ${}^{4}$	0.99	0.61	0.75	3.60*10 ${}^{3}$	0.82	0.84	0.83	1.68*10 ${}^{4}$
Wnp1	0.98	0.01	0.02	2.90*10 ${}^{5}$	0.99	0.14	0.24	1.70*10 ${}^{4}$	0.90	0.54	0.68	2.86*10 ${}^{4}$
Wnp2	0.95	0.03	0.07	6.30*10 ${}^{4}$	0.98	0.24	0.38	9.20*10 ${}^{3}$	0.81	0.69	0.75	2.01*10 ${}^{4}$
Cnp1	0.94	0.02	0.04	1.10*10 ${}^{5}$	0.99	0.10	0.18	2.20*10 ${}^{4}$	0.67	0.66	0.66	1.74*10 ${}^{4}$
Cnp2	0.88	0.31	0.46	6.50*10 ${}^{4}$	0.98	0.20	0.34	1.10*10 ${}^{4}$	0.46	0.82	0.59	9.64*10 ${}^{3}$
SeMBlock	0.33	0.78	0.46	2.26*10 ${}^{3}$	0.87	0.95	0.91	2.03*10 ${}^{3}$	0.76	0.96	0.85	1.36*10 ${}^{4}$