An accurate toponym-matching measure based on approximate string matching

Abstract

Approximate string matching (ASM) is a challenging problem, which aims to match different string expressions representing the same object. In this paper, detailed experimental studies were conducted on the subject of toponym matching, which is a new domain where ASM can be performed, and the creation of a single string-matching measure that can perform toponym matching process regardless of the language was attempted. For this purpose, an ASM measure called DAS, which comprises name similarity, word similarity and sentence similarity phases, was created. Considering the experimental results, the retrieval performance and system accuracy of DAS were much better than those of other well-known five measures that were compared on toponym test datasets. In addition, DAS had the best metric values of mean average precision in six languages, and precision/recall graphs confirm this result.

Keywords

Approximate string matching geographic information retrieval toponym matching

1. Introduction

The aim of approximate string matching (ASM), which is a challenging problem with a long history, is matching two or more reference names representing the same real-world object with their original name. ASM is applied in many different domains such as text processing, pattern recognition [1, 2], network security [3] and computational biology [4]. ASM is often a complex and time-consuming process that needs a huge amount of computing resources. Therefore, today there is still a need for efficient ASM measures.

Considering the literature, there are many studies that have reviewed various subjects such as entity resolution, record linkage and fuzzy matching, which are similar to approximate string matching in basic terms. Sarawagi and Bhamidipaty [5] and Tejada and Knoblock [6] employed committee-based active learning method in order to enter correct records into the record-level classifier in the training phase. Bilenko and Mooney [2] presented a machine learning-based adaptive string matching approach. Bunke et al. [7] studied rule-based fuzzy object similarity. Cohen et al. [8] evaluated many string similarity measures to solve the name-matching problem and obtained the best result with Jaro-Winkler using the combination of token-based distance functions. Bhattacharya and Getoor [9] attempted to solve the entity resolution problem using the model of latent Dirichlet allocation [10], which is a probabilistic method. Al-Ssulami [11] proposed a hybrid exact string-matching algorithm, which scans the text from left to the right and matches the pattern from right to the left. Christen [12] studied the subject of name-matching using the well-known 21 string-matching measure and concluded that there is not a single matching method that can perform all kinds of tasks.

In this paper, unlike previous studies, the subject of toponym matching, which is a new domain for ASM implementation, is discussed in detail. A toponym is defined as a geographical location name [13]. Today, it is possible to come across toponym information from digital libraries to World Wide Web content, and the amount of this type of data is increasing continuously. The biggest problem in toponym data is that toponym information regarding same location frequently differs. Incorrect or missing translations and the existence of different names for the same location depending on culture may be the cause of this problem. Toponym matching is performed using ASM criteria in order to associate official names and alternative names of these locations. Toponym matching is defined as matching similar names indicating the same location. For example, although the names of ‘Aksaagac’, ‘Ağsaağaç’ and ‘Agsaagac’ contain some different characters, they are all alternative names for the same location.

Although there are a few studies on the subject of toponym matching, considering similar studies, most of the studies have focused on the fields of geographic information retrieval (GIR) and geographic record linkage. In the study of Sehgal et al. [14], Edit-Distance, Jaccard and Jaro-Winkler string similarity measures were used to match the names of the locations in Afghanistan, and they showed that Edit-Distance measure had better performance compared with other measures. Martins [15] tried various features of the machine learning classification approach to find duplicate and non-duplicate ones among gazetteer records, and achieved some success. Hastings [16] developed a new string similarity measure to be used in the automatic matching of the gazetteer data. The only study focused on toponym matching was conducted by Recchia and Louwerse [17]. In their study, they evaluated the performance of different string similarity measures, and they showed that the performance of string similarity measures is largely dependent on language. For instance, if it is an agglutinative or root-inflected language depending on gender or conjunctions that are used frequently, then the matching results will be affected directly.

In this study, the creation of a single ASM measure, which is unique to this domain, that can perform the toponym matching process regardless of the language, was attempted. For this purpose, an ASM measure called DAS, comprising name similarity, word similarity and sentence similarity phases, was created. The matching performance of DAS was performed on the toponym dataset of six different countries (Egypt, France, Germany, Russia, Turkey and America), which were downloaded from National Geospatial-Intelligence Agency (NGA) GEOnet Names Server, and compared with five well-known ASM measures. Considering the experimental results, the retrieval performance and system accuracy of DAS are much better than those of other measures that were compared on all datasets. For example, the DAS measure found an alternative name for 289,164 toponym data out of 311,437, which is an important retrieval achievement. In addition, the DAS string-matching measure had the best MAP metric values in all languages, and precision/recall graphs confirm this result. Contributions of this paper can be listed as follows:

toponym matching is a newly introduced application domain of approximate string matching;

we developed a unique ASM measure named DAS, which is created for the task of toponym matching only;

we tested DAS on six datasets with five well-known ASM measures and we saw that DAS had the best retrieval performance and system accuracy among all others.

The rest of the paper is organized as follows. In Section 2, the ASM measures compared are introduced briefly. Section 3 presents the details of proposed ASM measure, which is named DAS. In Section 4, the experimental setup is described and the obtained experimental results are discussed in Section 5. Finally, the conclusion is given in Section 6.

2. An overview of string-matching measures

String-matching measures try to compare two string expressions through a variety of ways. These measures are usually built on a similarity function. If the similarity score obtained from ${sim}_{measure} =$ f( $s_{1}$ , $s_{2}$ ) is higher than a certain threshold, then $s_{1}$ and $s_{2}$ , which are two different string expressions, belong to the same toponym. String-matching measures are divided into categories according to the calculation method of similarity measures. The major categories are edit-based, token-based, hybrid and phonetics. In this section, an overview is given of five well-known similarity measures that were compared with DAS measure.

2.1. Jaro Winkler similarity measure

The Jaro-Winkler [18] similarity measure is based on the Jaro [19, 20] similarity measure and located in the edit-based category. It was developed by considering misspellings. When calculating similarity ratios with Jaro-Winkler, greater importance is given to the starting characters compared with other characters [12]. If the string expression contains more than one word phrase, it returns the highest value by calculating all possible permutations.

2.2. Levenshtein similarity measure

The Levenshtein similarity measure [21] is also located within the edit-based category. The similarity ratio is determined based on the least number of addition, deletion and substitution processes required for conversion of $s_{1}$ , which is a string expression, into $s_{2}$ . The lesser the number of processes required for conversion, the greater the similarity between these two string expressions. The similarity ratio should be within the range of 0–1.

2.3. Monge Elkan similarity measure

The Monge Elkan similarity measure [22, 23] is in the hybrid category, and works in recursive logic. The similarity ratio of two string expressions containing more than one token is found using an internal measure of similarity.

2.4. N-gram similarity measure

The N-gram similarity measure [24, 25], in the token-based category, determines common substring numbers with a length of n owned by two string expressions compared with each other. If the length of n is equal to 1, it is named a unigram, whereas it is named a bigram if the length is equal to 2 and a trigram if it is equal to 3. In the N-gram similarity measure, the similarity ratio is obtained by dividing the number of N-grams in the short and long expressions by the number of common N-grams for each one.

2.5. Jaccard similarity measure

The Jaccard similarity measure [24] is also located in the token-based category. The Jaccard similarity measure score for two strings $s_{1}$ and $s_{2}$ is defined as the size of the intersection set of $s_{1}$ and $s_{2}$ divided by the size of the union set of $s_{1}$ and $s_{2}$ .

3. Proposed DAS similarity measure

The DAS similarity measure that we have developed using C programming language is in the hybrid similarity category, which includes the features of the token-based and edit-based categories together. The term DAS consists of the first letters of the names of the author’s family. The DAS measure contains three different stages as follows: name similarity, word similarity and sentence similarity. In the first stage, word similarity is performed and the score of ${sim}_{DAS_WSM}$ is calculated. At this stage, in short, a similarity score is obtained by considering the importance of adjacent characters. In the second stage, sentence similarity is performed and the score of ${sim}_{DAS_SSM}$ is calculated. Word and sentence similarity stages are not interdependent and perform independently. With the completion of these two stages, two similarity scores are observed and the greater similarity score is selected for comparison. If the greater score is above a certain threshold value, name similarity is performed and the similarity score of ${sim}_{DAS_NSM}$ is calculated. Finally, the highest similarity score found among these three measures is selected as the DAS similarity measure.

3.1. Word similarity measure

The algorithm of word similarity is shown in Figure 1. First, the weight of characters of the longer string among $S_{1}$ and $S_{2}$ is computed. This weight is calculated by dividing one number into the number of the characters in the longer expression. When comparing $S_{1}$ and $S_{2}$ string expressions with the word similarity measure, the formula of weight of characters $w_{char}$ is given in eqn (1):

w_{char} = 1 / \max (| S_{1} |, | S_{2} S_{1} |)

(1)

Figure 1.

Word similarity algorithm of DAS.

For example, while computing similarity ratio of two words ‘kasa’ and ‘kiraz’ in Turkish using the word-matching measure, after trimming spaces from the longer word (kiraz), the weight of each character is assigned. The weight of the ‘k’ character is calculated by the formula w_k = 1/max(|‘kasa’|, |‘kiraz’|), which gives 1/5 = 0.20. After defining the character weight of the longer expression, words in expressions are tokenized. The similarity ratio of every word in both expressions is calculated and stored in order. The similarity ratio between two tokens, ${sim}_{token}$ is formulized as given in eqn (2):

{sim}_{token} = \sum_{i = 0}^{\max (c_{tok 1}, c_{tok 2}) | - 1} w_{char} * k

(2)

where, $c_{tok 1}$ and $c_{tok 2}$ represent the tokens compared. k is the similarity coefficient and its value is calculated as eqn (3):

k = {\begin{matrix} 1, c_{tok 1 i} = c_{tok 2 i} \\ 0.8, if (c_{tok 1 i} = c_{tok 2 (i + 1)}) or (c_{tok 1 i} = c_{tok 2 (i - 1)}) \\ 0, otherwise \end{matrix}

(3)

First, character comparison of the paired tokens is performed, and the weight of character is multiplied by the similarity coefficient k, then it is added to the token similarity ratio. If there are the same characters in the same indices, the similarity constant k is considered as 1, which gives the maximum similarity. Missing or extra letter writing is one of the most common writing errors causing no similarity to be found between two words when in fact they are similar. In order to overcome this problem, word similarity measure is designed by considering the importance of adjacent characters. If there is a match with either the previous or the next index, then k is taken as 0.8. In order to determine the appropriate value of similarity coefficient k when there is a match with either the previous or the next index, we design an experimental study that compares the performances of different k values. In this study, the value of similarity coefficient k is selected as 0.8 based on the experimental results. It should be noted that the value of k is adjustable and can vary according to dataset. Lastly, if there is no match with the same or adjacent indexes, k is selected as 0.

For example, while comparing two words ‘KNIGHT’ and ‘NIGHT’, the weight of each character is calculated as $w_{char}$ =1/6 = 0.16666667. After calculating the character weights, they are added to the token similarity ratios by comparing characters with each other. In Table 1, the character matching status of the words ‘KNIGHT’ and ‘NIGHT’ is presented. The similarity ratio between two expressions is calculated as follows: ${sim}_{token}$ = (0.16666667 * 0.8)+(0.16666667 * 0.8)+(0.16666667 * 0.8)+(0.16666667 * 0.8)+(0.16666667 * 0.8) =0.66666667%.

Table 1.

Character matching status of two words KNIGHT and NIGHT.

	K	N	I	G	H	T	Similarity coefficient
N	0	1					0.8
I	0	0	1				0.8
G		0	0	1			0.8
H			0	0	1		0.8
T				0	0	1	0.8

This process is also performed for reverse versions of the words ‘KNIGHT’ and ‘NIGHT, which are ‘THGINK’ and ‘THGIN’. In Table 2, the character-matching status of these two reversed words is given. Their character weights $w_{char}$ are equal to 0.16666667, because the number of letters is the same for both words. The similarity ratio between two expressions is calculated as follows: ${sim}_{token}$ = (0.16666667 * 1)+(0.16666667 * 1)+(0.16666667 * 1)+(0.16666667 * 1)+(0.16666667 * 1) = 0.83333335%. As a result of this calculation, the highest value between these two similarity ratios is chosen: 0.83333335%.

Table 2.

Character matching status of two words THGINK and NIGHT.

	T	H	G	I	N	K	Similarity coefficient
N					1	0	1
I				1	0	0	1
G		0	1	0	0		1
H	0	1	0	0			1
T	1	0					1

3.2. Sentence similarity measure

Sentence similarity measure is based on the logic of matching words of two string expressions by searching these words in each other. The algorithm details of this measure are given in Figure 2. First, $S_{1}$ and $S_{2}$ string expressions are tokenized. Then the average word length of these two expressions is determined. The shorter expression is searched in the longer expression. In the sentence similarity measure, the similarity ratio is defined by dividing the number of matched words by the expression with higher number of words.

Figure 2.

Sentence similarity algorithm of DAS.

For example, while comparing two toponym data $S_{1}$ = ‘Mount of Alp’ and $S_{2}$ = ‘Alp Mountain’, in the first step, $S_{1}$ and $S_{2}$ expressions are tokenized. Then, the average number of letters per word in each expression, named as w1A and w2A, are determined. w1A and w2A are calculated by dividing the total number of letters by the total number of words for $S_{1}$ and $S_{2}$ , respectively. For instance, w1A for $S_{1}$ is calculated by the formula w1A= $TotalNumberOfLetters (S_{1}) / TotalNumberOfWords (S_{1})$ , which gives 10/3 = 3.3333. w2A is calculated in the same way and has the value of 11/2 = 5.5. Since the value of w2A is greater than that of w1A, tokens of $S_{1}$ are searched and matched in $S_{2}$ . The matching process is performed by controlling the containment of each token. The weight of each matching token is calculated by the formula $1 / (MAX (S 1 words . Count, S 2 words . Count))$ , which gives $1 / (MAX (3, 2))$ = $1 / 3 =$ 0.333.

As seen in Table 3, ‘Mount’ and ‘Alp’ tokens are matched. As a result, the sentence similarity score is calculated as follows: ${sim}_{DAS_SSM} = 0.3333 + 0.3333 = 0.6666$ .

Table 3.

Token matching status of two expressions.

	Alp	Mountain	Matching weight
Mount	0	1	0.3333
Of	0	0	0
Alp	1	0	0.3333

3.3. Name similarity measure

The calculation approach of the name similarity measure is substantially different from other stages. If the greater of the similarity scores calculated at the end of the sentence and word similarity stages is above a certain threshold value, then the name similarity measure can be performed. Otherwise, this stage will not work. The mainstay of this approach can be explained as follows: if similarity scores of $S_{1}$ and $S_{2}$ string expressions are lower than a certain threshold value at the end of the sentence or word similarity stages, then these words are either unrelated or, even though they are related, they are synonyms of each other. However, if the score is above a certain threshold value, there is a possibility of there being alternative names for these two expressions; therefore, this possibility is checked by calculating the name similarity measure ${sim}_{DAS_NSM}$ . The simple algorithm of the name-matching stage is given in Figure 3. First, two string expressions $S_{1}$ and $S_{2}$ are tokenized. Then every word in both expressions is compared with each other. While performing the comparison process, if all the letters of any two expressions are the same until performing MinLength, then DAS similarity score is calculated as 1.

Figure 3.

Name similarity algorithm of DAS.

Since the threshold value directly affects the selection of name-matching stage and the matching performance of the DAS measure, it is very important to find the optimum value of this parameter. In this study, this case is considered as a mathematical optimization problem. Mathematical optimization is a search process that uses real or integer values within a set systematically for solving the problem of either maximizing or minimizing a real function. If we consider the DAS similarity measure as a real mathematical function, the optimum value of the threshold t parameter of this function would give us the maximum MAP value.

One dimensional Grid Search [26] algorithm is utilized in order to solve the optimization problem. The Grid Search algorithm has a simple straightforward logic along with a naive structure. In this study, the number of grid points, which were searched in the state space, is reduced by using the Grid Search algorithm as iterated. The toponym datasets of 10 countries were selected as test sets in order to find the optimum values of threshold parameter. In the selection process, it has been considered that structures of the language of test datasets are different from each other and it was important to create as heterogeneous a test dataset as possible. In the test set of each country, the following steps of the Grid Search algorithm were performed:

Step 1: Set the maximum iteration number k in accordance with the precision needed. In this study, this value was set to 5.

Step 2: Five grid points are evenly distributed in the control space labelled as ${x_{A}$ , $x_{B}$ , …, $x_{E}$ } after discretizing the control space C in a closed range [ $c_{\min}$ , $c_{\max}$ ]. These points are determined as {0, 25, 50, 75, 100}.

Step 3: Find the values of points A and B as $f (x_{A})$ and $f (x_{B})$ , respectively.

Step 4: If $f (x_{A})$ > $f (x_{B})$ then go back to step 2 and reset the searching space as [ $x_{A}$ , $x_{B}$ ]. Otherwise, first $f (x_{C})$ should be computed. Then, if $(x_{B})$ > $f (x_{C})$ , go back to step 2 and reset the searching space as [ $x_{A}$ , $x_{B}$ ]. If not, compute $f (x_{D}) .$ Then, go back to step 2 and reset the searching space as [ $x_{B}$ , $x_{D}$ ] if $(x_{C})$ > $f (x_{D}) .$ Otherwise, compute $f (x_{E}) .$ Once again, if $f (x_{D})$ > $f (x_{E}),$ go back to step 2 and reset the searching space as [ $x_{C}$ , $x_{E}$ ]. If it is not, then $f (x_{E})$ has the highest value among the five grid points, so go back to step 2 to reset the searching space as [ $x_{D}$ , $x_{E}$ ], where the iteration number increases by 1.

Step 5: You should keep going until you reach the maximum value of iterations. The numerical solution of the equation is the point that dominates in the last iteration.

After running the Grid Search algorithm for 10 test sets, the best grid points (threshold values) were obtained for each country. As presented in Table 4, in the first run, the point value was found as 0.25 for eight out of 10 country test sets and found as 0 for the remaining two sets. To increase the accuracy of the threshold value obtained, the Grid Search algorithm was run 2 more times with an iterative approach. In the second run, the five new grid points were determined as {5, 10, 15, 20, 25} and the Grid Search algorithm was run again. In general, the same or higher MAP values were obtained. Finally, the Grid Search algorithm was run for the third time within the range of five new grid points {15, 17, 19, 21, 23} and the iteration was terminated.

Table 4.

Calculating the best Grid Points corresponding to MAP values by running the Grid Search Algorithm iteratively.

No.	Test set	First run		Second run		Third run
No.	Test set	Grid point	MAP	Grid point	MAP	Grid point	MAP
1	Germany	25	0.9473	20	0.9482	21	0.9482
2	France	25	0.9596	20	0.9599	21	0.9601
3	America	25	0.2010	20	0.2010	21	0.2040
4	Pakistan	25	0.9604	25	0.9604	21	0.9605
5	Brazil	25	0.9651	20	0.9651	19	0.9652
6	Italy	25	0.9839	25	0.9839	19	0.9842
7	Japan	0	0.8025	15	0.8026	17	0.8027
8	Spain	0	0.5459	15	0.5460	21	0.5463
9	Taiwan	25	0.7751	15	0.7808	21	0.7798
10	Turkey	25	0.8484	20	0.8484	21	0.8484

In Table 4, the grid points obtained as a result of each of three runs and the corresponding MAP values are presented. Considering the results of the third run, 21 point values were selected in seven datasets, whereas 19 were selected in two datasets and 17 were selected in one dataset; and the threshold values were found close to each other in these datasets. In this study, the mathematical average of all grid point values found as 20.2 and this average value was determined as the justified threshold value that achieves optimum MAP values. The threshold value was calculated in the dataset-preparation step and was experimentally used in order to evaluate the DAS measure in all datasets.

4. Experimental study

4.1. Dataset preparation

First, 12 toponym files containing the toponym information belonging to the countries Egypt, France, Germany, Russia, America, Japan, Pakistan, Taiwan, Spain, Brazil, Italy and Turkey were downloaded from NGA GEOnet Names Server. Then, 12 separate datasets were created by performing parsing and tokenization processes for each file. During these processes, location names with no alternative names or the same alternative names as themselves were deleted from the datasets. Ten out of 12 datasets were employed to calculate the optimum value of the threshold parameter using the Grid Search algorithm. Six of these datasets were also used for experimental comparison of the DAS string-matching measure with other measures. The toponym information included by these datasets to be used for comparison is presented in Table 5. Accordingly, the Turkey Dataset has a total of 82,754 alternative names for a total of 23,615 toponym data. The highest number of toponym data is in the Russia Dataset, and there are 625,969 alternative names for a total of 186,027 toponym data. There are 311,437 toponym data for all languages, and 957,877 alternative names for all of these toponym data.

Table 5.

An example of toponym data in Turkey Dataset.

No.	Dataset	Original number of toponyms	Total number of alternatives
1	Turkey	23,615	82,754
2	Egypt	20,928	100,263
3	France	41,772	94,120
4	Germany	17,311	46,402
5	America	21,784	8,369
6	Russia	186,027	625,969

4.2. Experimental methodology

An information retrieval-based evaluation system was experimentally set up rather than a matching-based evaluation system in order to measure the performance of the DAS similarity measure compared with other measures. Information retrieval is a discipline dealing with bringing the relevant documents back in response to the query Q in D collection, which does not have a defined structure [27].

In this system, the toponym data in each dataset represent the queries. For example, the Turkey Dataset has a total of 23,615 units of toponym (queries) and all of these toponyms have a total of 82,754 units of correct alternative names that are already identified by NGA. A test collection containing location names was also created to run the queries and obtain the results. There are both relevant and irrelevant toponym data in this collection, which is specific to the dataset of each country. While relevant toponym data contain correct alternative names belonging to each query, irrelevant toponym data were created by selecting five toponyms starting with different letters from original datasets. In the test collection, a similarity score between each toponym and query was computed according to an approximate string-matching measure, followed by ranking all toponym results in accordance with the similarity score. Finally, evaluation metrics were calculated in order to check the performance of each measure.

4.3. Evaluation metrics

Since the experimental infrastructure of the study is created based on information retrieval, standard evaluation metrics were used to measure accuracy and effectiveness of information retrieval system. Precision and recall metrics are two important metrics of these. Recall is the fraction of actual alternative names that are identified, whereas precision is the fraction of retrieved alternative names that are correct;

Precision = \frac{number of correctly retrieved alternative names}{number of retrieved alternative names}

(4)

Recall = \frac{number of correctly retrieved alternative names}{number of true alternative names}

(5)

Precision–recall charts are used to measure the performance of the system. In this graph, precision values are given at a certain cut-off level in case of the recall [25]. One of the most optimistic metrics to measure the quality of information retrieval systems is mean average precision (MAP). This metric was used to find the mean of the average value of precisions via a set of queries [28]:

MAP (Q) = \frac{1}{| Q |} \sum_{j = 1}^{| Q |} \frac{1}{m_{j}} \sum_{k = 1}^{m_{j}} Precision (R_{k})

(6)

where Q represents a number of original toponym data. Each qj toponym data has an alternative location name relevant to itself (m_j). R_k is a series of results starting from the alternative toponym at the highest similarity ratio to the toponym data k, which is really looked for. Higher values for MAP metric mean that more effective toponym retrieval is performed.

5. Experimental results

In order to accurately evaluate the performance of the DAS string similarity measure, toponym datasets of Turkey, Germany, France, Russia, the USA and Egypt were selected in particular, since they have different language structures. The DAS string-matching measure was compared with five well-known similarity measures – Levenshtein, Jaro-Winkler, Monge Elkan, Bi-Gram and Jaccard. All experimental evaluations were performed using a justified threshold value to trigger the name similarity measure of DAS. In Table 6, string matching performances of all measures on six datasets are summarized.

Table 6.

Experimental comparison results of DAS measure.

Measures	1. Turkey		2. Germany		3. France		4. Russia		5. Egypt		6. USA
Measures	CRTN	MAP	CRTN	MAP	CRTN	MAP	CRTN	MAP	CRTN	MAP	CRTN	MAP
DAS	21,663	0.8484	16,995	0.9482	40,912	0.9601	184,187	0.9803	20,739	0.9402	4,668	0.2040
Levenshtein	21,218	0.7851	16,576	0.9084	40,297	0.8987	182,484	0.9450	19,656	0.8984	3,060	0.1226
JaroWinkler	21,052	0.7708	16,381	0.8541	39,957	0.9017	177,347	0.9109	19,095	0.8424	2,961	0.1183
MongeElkan	21,130	0.7749	16,514	0.8354	40,179	0.8685	182,110	0.9324	19,109	0.8093	4,439	0.0978
Bi-Gram	21,379	0.8113	16,635	0.8871	40,253	0.8899	181,984	0.9337	19,923	0.8606	4,241	0.1734
Jaccard	21,204	0.6713	15,528	0.6542	38,938	0.7172	180,884	0.6735	19,998	0.6306	2,131	0.1821

MAP metric results and correctly retrieved toponym names (CRTN) values were used for performance comparison. The CRTN value was calculated depending on any toponym data having at least one alternative name. The DAS string-matching measure obtained the best MAP results in datasets of Turkey, Germany, France, Russia, Egypt and the USA as follows: 0.8484, 0.9482, 0.9601, 0.9803, 0.9402 and 0.2040, respectively. Considering CRTN numbers for each language dataset, the DAS measure found 284, 360, 615, 1703, 1083 and 229 more alternative names compared with the measure having the closest performance to DAS measure.

Figure 4 presents the precision–recall curves for toponym-matching experiments of each dataset. A higher area under the curve represents both higher precision and higher recall. According to these results, we can conclude that the DAS measure not only increased the CRTN values in six datasets, but also obtained the best results for overall system accuracy by increasing the MAP values in a consistent manner. As a result, the success of DAS string-matching measure is proven.

Figure 4.

Toponym-matching results for the Turkey, Germany, France, Egypt, Russia and USA datasets.

As can be seen in Table 6, the DAS measure has the best retrieval performance and highest MAP values in the six datasets. Only the MAP results of America dataset are differentiated from the others. It can be seen from Table 5 that only the dataset of America has fewer alternative names than its original toponym data. Furthermore, in the America dataset, most of the toponyms do not have correct alternative names. Therefore, correctly matched alternative names and the MAP values are decreased for all similarity measures, including DAS. Considering that the languages selected differ from each other in terms of their structures, it has also been proven that a single string similarity measure can be developed specific to the toponym-matching task regardless of the language.

6. Conclusion

In this paper, detailed experimental studies were conducted on the subject of toponym matching, which is a new domain where ASM can be performed, and it was intended to create a single string-matching measure that can perform the toponym-matching process regardless of the language. For this purpose, a string-matching measure named DAS was developed and proposed.

An experimental information retrieval-based evaluation system was created in order to evaluate the performance of the DAS-matching measure compared with other measures. The DAS string-matching measure was continuously modified by performing hundreds of different experiments in toponym datasets. After performing all modifications, it has become a three-stage measure specific to the toponym-matching task. In particular, the calculations in the Name Similarity Measure stage and computing the optimum value of the Threshold parameter using the Grid Search algorithm positively affected the performance of DAS measure. As can be seen in Figure 4 and Table 6, according to comparisons with other similarity measures, the DAS measure had the best retrieval performance and the highest MAP values. As a result, considering the structural differences of languages selected for experimental tests, it has been proven that a single string similarity measure can be created specific to the toponym-matching task regardless of the language.

One of the future directions for the DAS measure consists of researching practical application areas. For example, GIR is one of the most appropriate application areas for DAS. We would like to integrate the DAS measure into a GIR-based application to be used in the toponym disambiguation task. In addition, the task of entity name matching might also be appropriate for the DAS measure. Specifically, patient name matching is a convenient task to evaluate and compare DAS with well-known similarity measures in the medical record linkage area. Another future work is researching the effect of structural differences for languages on the similarity score computation by considering the computational linguistics. Finally, the word similarity stage of DAS can be improved using edit-based, q-gram or pattern matching approaches.

Footnotes

Funding

This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.

References

You

Park

Kim

. An efficient frequent melody indexing method to improve the performance of query-by-humming systems. Journal of Information Science 2008; 34(6): 777–798.

Bilenko

Money

. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, August, 2003, pp. 39–48.

Karen

Ramsayb

. Now what was that password again? A more flexible way of identifying and authenticating our seniors. Behaviour and Information Technology 2007; 26(4): 309–322.

Won

Park

Yoon

Kim

. An efficient approach for sequence matching in large DNA databases. Journal of Information Science 2006; 32(1): 88–104.

Sarawagi

Bhamidipaty

. Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, Alberta, Canada. New York: ACM 2002, pp. 269–278.

Tejada

Knoblock

. Learning domain independent string transformation weights for high accuracy object identification. In: Proceedings of international conference on knowledge discovery and data mining (SIGKDD), 2002, pp. 350–359.

Bunke

Fàbregas

Kandel

. Rule-based fuzzy object similarity. Mathware & Soft Computing 2001; 8(2): 113–128.

Cohen

Ravikumar

Fienberg

. A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI-03 Workshop on Information Integration, 2003, pp. 73–78.

Bhattacharya

Getoor

. A latent Dirichlet allocation model for entity resolution. In: Proceedings of the SIAM conference on data mining (SDM), 2006.

10.

Blei

Jordan

. Latent Dirichlet allocation. Journal of Machine Learning Research 2003; 3: 993–1022.

11.

Al-Ssulami

. Hybrid string matching algorithm with a pivot. Journal of Information Science 2015; 41(1): 82–88.

12.

Christen

. A comparison of personal name matching: Techniques and practical issues. In: Data mining workshops, sixth IEEE international conference on data mining. Hong Kong, December, 2006, pp. 290–294.

13.

Hill

. Core elements of digital gazetteers: Placenames, categories, and footprints. In: Proceedings of the of the 4th European conference on research and advanced technology for digital libraries, London, 2000, pp. 280–290.

14.

Sehgal

Getoor

Viechnicki

. Entity resolution in geospatial data integration. In: Proceedings of the 14th annual ACM international symposium on advances in geographic information systems, Arlington, VA, 2006, pp. 83–90.

15.

Martins

. A supervised machine learning approach for duplicate detection over gazetteer records. In: Proceedings of the 4th international conference on geospatial semantics. Berlin: Springer, 2011; pp. 34–51.

16.

Hastings

. Automated conflation of digital gazetteer data. International Journal of Geographical Information Science 2008; 22: 1109–1127.

17.

Recchia

Louwerse

. A comparison of string similarity measures for toponym matching. In: Proceedings of the first ACM SIGSPATIAL international workshop on computational models of place. Orlando, FL: ACM, 2013, pp. 54–61.

18.

Winkler

. The state of record linkage and current research problems. Statistics of Income Division. Internal Revenue Service Publication R99/04, http://www.census.gov/srd/www/byname.html

19.

Jaro

. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 1989; 89: 414–420.

20.

Jaro

. Probabilistic linkage of large public health data files. [discussion, pp. 687–689] Statistics in Medicine 1995; 14: 491–498.

21.

Levenshtein Vladimir

. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady February 1966; 10 (8): 707–710.

22.

Monge

Elkan

. The field-matching problem: Algorithm and applications. In: Proceedings of ACM SIGKDD. Menlo Park, CA: AAAI Press, 1996, pp. 267–270.

23.

Monge

Elkan

. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: The proceedings of the SIGMOD workshop on data mining and knowledge discovery, 1997.

24.

Lennon

Peirce

Tarry

Willett

. An evaluation of some conflation algorithms for information retrieval. Journal of Information Science 1981; 3(4): 177–183.

25.

Sutinen

. Approximate pattern matching with the q-gram family. PhD thesis, Department of Computer Science, University of Helsinki. Finland, 1998.

26.

Thisted

. Elements of statistical computing. London: Chapman and Hall, 1988.

27.

Ricardo

Berthier

. Modern information retrieval. New York: ACM Press, 1999, pp. 189–190.

28.

Manning

Raghavan

Schütze

. An introduction to information retrieval, 1st edn. Cambridge: Cambridge University Press, 2009.