Abstract
Duplicate record detection has been an important issue in the fields of data and records management and various detection methods have been proposed. A new method, which uses an optical character recognition (OCR)-converted source of information for record matching to detect duplicates, is proposed and examined in this paper. First, the design of an experiment for examining the performance of such a duplicate detection method is discussed. The base record set with an OCR-converted title page and its verso were prepared along with two test record sets from different union catalogues, and duplicate records between the base record set and the test sets were manually identified. A duplicate detection system was developed to execute matching (1) between records, (2) between a record and an OCR-converted source of information and (3) using a combination of these. Second, matching performance at the individual data element level is examined. Third, the performance of duplicate record detection based on matching at the element level is examined through rule-based detection and machine learning-based detection. The results of the experiment show the usefulness of incorporating source of information into duplicate detection to a certain extent.
Keywords
1. Introduction
Duplicate record detection has been an important issue in the data and records management fields. Various methods have been used to detect duplicate bibliographic records in library catalogues; a crucial issue in a union catalogue as well as in a single catalogue. There are many large-scale union catalogues, such as WorldCat of the Online Computer Library Center (OCLC), COPAC in the UK, Libraries Australia in Australia, and Unicanet and NACSIS-CAT in Japan. Effective and efficient duplicate detection is a core issue in managing union catalogues, and specifically tailored methods to detect and consolidate duplicate records have been implemented. These depend on the record schema/formats, and involve cataloguing codes and the rule interpretations adopted in the union catalogues.
All the duplicate detection methods developed thus far use only bibliographic records, that is, matching between records. In contrast, I propose the use of copies of source of information: optical character recognition (OCR)-converted title pages and versos of the title pages for book resources, as well as records themselves, to match between records. If a bibliographic record A has an OCR-converted source of information, matching between another record B and the source of information of record A can be conducted, as can matching between the two records A and B; judgment on whether the two records are duplicate can be done with greater confidence. Occasionally, records lack data values or contain erroneously recorded values. The copies of a source of information would be robust for such incompleteness and instability of records, and would therefore complement records in matching.
A source of information such as the title page of a book and its verso contains a variety of data that are not recorded in a bibliographic record, such as various place names, organization names and dates. Furthermore, these sources of information cannot cover all the data elements that comprise a bibliographic record; they actually cover only the descriptive portion of a record.
The purpose of this study is to examine the performance of duplicate detection through record matching with the combined use of an OCR-converted source of information. First, the present article reports the design of an experiment to examine the performance of such a duplicate detection method that involves preparing test record sets and developing a duplicate detection system. Second, it reports on the results of the experiment, showing performance at the individual data element level. Third, it reports on the performance of duplicate record detection based on element-level matching through rule-based detection and machine learning-based detection.
Several digitization projects for printed books have been conducted, including Google Books, Internet Archive’s Open Library and several national libraries’ digitization projects. It is possible to utilize these digitized sources of information for duplicate detection.
No studies have examined performance using an OCR-converted source of information to detect duplicate records. To this point, duplicate detection methods have been studied only to match between records themselves. In the field of library catalogues, there have been many preceding studies, including those by Cousins [1], Lazinger [2], Ridley [3], Sitas and Kapidakis [4], and Toney [5]. O’Neill et al. investigated the characteristics of duplicate bibliographic records [6], and Thornburg [7] and Thornburg and Oskins [8] provided a thorough review of issues on duplicate records and their detection. Many other studies were conducted in the late 1970s and 1980s, including those by Hickey and Rypka [9] and MacLaury [10].
For metadata other than bibliographic records in library catalogues, many studies have been conducted, including the application of machine learning. Elmagarmid et al. [11] and Winkler [12] reviewed those studies comprehensively.
Meanwhile, research on expert systems for cataloguing is related to the present study because of the use of scanned title pages and other sources. Davies [13], Jeng [14, 15] and Molto and Svenonius [16] investigated the problems of title page format recognition, including automatic identification of data strings from title pages, to design an automated cataloguing system. However, no system implementation was attempted in these studies. On the other hand, an OCLC research team implemented a system that accepts the scanned title page of a book and associates the data strings that appear in the title page with data elements such as the title proper, other title information and statements of responsibility [17, 18]. Rath and Prasad proposed a similar method for the automatic identification of data strings from title pages of books and their versos and developed a program to examine the performance of the method [19, 20]. Taniguchi proposed using an OCR-converted source of information to support recording evidence for data values in bibliographic records and developed a prototype system to support catalogers in recording evidence by identifying the appropriate strings within OCR-converted title pages of books and their verso pages [21, 22].
2. Design of the experiment
2.1. Test sets of bibliographic records
I developed the following test sets for this study: (1) a base set of bibliographic records (hereafter, base records) and their OCR-converted source of information; and (2) two other sets of records to be matched with the base records. Both sets are from different union catalogues: WorldCat and Libraries Australia. Excerpts of records from the three distinct sets are shown in the Appendix.
2.1.1. Base records with an OCR-converted source of information
A collection of selected volumes of books was assembled. First, 255 books were selected from the collection of the Library and Information Science Library of the University of Tsukuba, in Japan. These books were all in English and published after 1950 and at the first position (second from the top) in each section of the book stacks in the library. For each of these 255 books, the title pages and their versos, which are the main source of information for books prescribed in the Anglo-American Cataloguing Rules (AACR), were scanned and the scanned images were transformed into plain text files with OCR software. During this step, OCR errors were corrected, and the library stamps appearing in the title pages and their versos were excluded. Some versos contained Cataloging-in-Publication data; others did not. Other prescribed sources of information, such as other preliminary pages and series title pages, were not used.
After selection, I obtained bibliographic records corresponding to these books from the library’s catalogue database. Those records were encoded in a format adopted by NACSIS-CAT, a union catalogue system provided by the National Institute of Information in Japan. This encoding format is slightly different from MARC21. Some records were created by the library, while others were transferred from records created by other Japanese libraries or foreign national libraries, including the Library of Congress in the USA. Records for these books should be in line with AACR, although some different rule interpretations have been adopted, including NACSIS-CAT’s own coding manuals.
2.1.2. Record set 1
WorldCat records were obtained by searching OCLC’s FirstSearch retrieval system in 2007. A new WorldCat service began in 2007, and at that time WorldCat database search via FirstSearch terminated. A search with title information – a perfect match with title proper, other title information or a varying form of the title – was executed for every individual base record, and hit records were downloaded as candidates that might include records representing the same book as the base record. Using a program, it was determined that there was no record downloaded more than once across different hit record groups. Overall, 255 record groups corresponding to individual base records, 1123 records in total, were downloaded; each was encoded in FirstSearch’s own format, which is different from MARC21. For each record of the 255 downloaded record groups, I manually determined whether the record represented the same book as the base record represented, that is, whether the records were duplicates. This manual judgment was conducted with reference to ALCTS’s ‘Differences between, changes within’, revised edition [23]. I judged dubious cases to be duplicates. Consequently, 400 records in total were judged to be duplicates of the base records. Table 1 presents the basic data for record set 1. Recently, OCLC reported that a considerable number of duplicate records were contained in the WorldCat database at that time; afterwards, powerful new duplicate detection and resolution software was incorporated, and consequently a large number of duplicates were merged [24].
Outline of record sets that were matched
2.1.3. Record set 2
Libraries Australia is a national union catalogue of that country. I searched the union catalogue database in 2008 using the title information of each base record and obtained hit records as candidates that included records representing the same books as those for the base record. Using a program, it was determined that no record was downloaded more than once across different hit record groups. A total of 255 record groups, 1758 records, were downloaded; each was encoded in its own format rather than in MARC21. For each record of the 255 groups, in the same manner as for record set 1, I manually determined whether the record represented the same book as the base record represented. Consequently, 389 records in total were judged to be duplicates of the base records. Table 1 presents the basic data for record set 2. After I obtained these records from the search system, a new Libraries Australia system replaced the aforementioned system.
Record set 1 comprises a total of 1123 records and contains 400 duplicate records for the base records, as mentioned above. There are 400 duplicate record pairs between the base record set and record set 1 and 285,965 non-duplicate record pairs between the two record sets, as non-duplicate record pairs can be made across record groups for individual base records; in other words, a non-duplicate record pair can be made between a base record and a record from a group different from that of the base record. Namely, 285,965 non-duplicate record pairs were derived from the calculation: (1123 records × 255 groups) − 400 pairs. Similarly, record set 2, consisting of 1758 records in total and containing 389 duplicate records for the base records, has 389 duplicate record pairs and 447,901 non-duplicate record pairs against the base record set.
2.2. System development
A system was developed to execute duplicate detection experiments. The system detects duplicate records between the base records and either of the other record sets by (1) matching both records, (2) matching the OCR-converted source of information for the base record with records of the other record sets, or (3) combining the matching of (1) and (2).
The following are details of the system’s function:
Extracting data element values in specified fields. Data values corresponding to LCCN, ISBN, title (title proper and other title information), statement of responsibility, author heading (main entry heading and added entry heading), edition statement, place of publication, publisher, year of publication, pagination, series title and series number were extracted from the records in the three different formats. When more than one value was recorded for a data element, such as other title information, statement of responsibility or place of publication, each value was considered an individual, separate value. On the other hand, series title and author heading were multiple-occurrence fields in every format, and each occurrence was dealt with as an individual value.
Normalizing extracted data values. All symbols were omitted, and all added and supplied strings (‘(…)’ and ‘[…]’) were also omitted. Uppercase letters were converted to their lowercase equivalent. In addition, abbreviated words were transformed into full spellings as much as possible: for example, ‘introd.’ was transformed into ‘introduction’, and ‘pub. co.’ was transformed into ‘publishing company’.
Matching between normalized data values. When either string that was the normalized data value was contained within the other string, both strings were dealt with as matched; in other words, both exact match and truncated match were applied to the two normalized strings. For example, publisher names ‘Prentice Hall’ and ‘PTR Prentice Hall’ matched, and author headings ‘Davis, Alan Mark’ and ‘Davis, Alan M. (Alan Michael), 1949–’ matched. On the other hand, ‘E. Horwood’ and ‘Ellis Horwood’ did not match. For pagination, only the largest portion was adopted (e.g. ‘160’ was from ‘vi, 160 p.’), and was considered a match if it was within 10 pages. Matching at the data value level was two-valued: matched or not matched. Partially matched was not adopted.
Matching at the data element level. Judgment on matching at the data element level between two records or between a record and an OCR-converted source of information was two-valued: matched or not matched. When a record had one data value for an element in question and the other record had more than one data value for the element, they were judged as matched if any data value pair between the records for the element matched. Similarly, when both records had more than one data value for the identical element, they were judged as matched if any data value pair between the records for the element matched; they were judged as not matched if none of the data value pairs between the records for the element matched. For matching with a source of information, the same manner as for matching between records was applied. In addition, matching with only the first data value for an element in question in a record of sets 1 and 2 was implemented as a choice of the experiment; all data values for the element were adopted in the base record.
Judgment on duplicates at the record level. Several rule-based judgments and machine learning-based judgments were implemented to determine whether a record pair, that is, a pair of a base record and a record either in set 1 or 2 were duplicates. Judgment on whether two records were duplicates was two-valued: duplicate or not duplicate.
3. Results at the data element level
3.1. Experiment modes and their choices
In this section, I report experiment results from the matching at the individual element level. The experiment was conducted using modes A–C, which are shown in Figure 1, and Tables 2 and 3 present the results for record sets 1 and 2, respectively.

Experiment modes A and B.
Matching performance in record set 1
tp, Title page.
Matching performance in record set 2
tp, Title page.
3.1.1. Mode A: Matching between a base record and a record of either of two sets
This is conventional duplicate record detection, although records in this study are represented in different formats. This provides baseline performance. The choices used for matching in this mode are either (1) all data values of an element or (2) only the value that first appears for the element in a record of either set 1 or set 2. In Tables 2 and 3, the rows for ‘mode A’ under the individual data elements show that the matching results of this mode demonstrated better performance between the choices. ‘All’ means the use of all data values, while ‘first’ means the use of the first value only. When performance was the same between the two choices, ‘all, first’ is recorded in the rows of the tables.
The results of the experiment in this mode were summed up using individual data elements in order to determine which elements are useful to detect duplicates. For the duplicate record pairs made from the base record set and either of set 1 or 2, the numbers of matched record pairs, non-matched pairs, and missing values were counted for individual data elements. The non-matched pairs imply cases where data values were found for a given element in a record of set 1 or 2, but the data values did not match the base record or no data value was recorded for the element in the base record. In the case of the missing values, no data value was found for a given element in a record of set 1 or 2, regardless of whether a data value was recorded for the element of the base record. Similarly, for non-duplicate record pairs, the numbers of matched record pairs, non-matched pairs and missing values were counted. Among these categories, the non-matched pairs and missing values within the duplicate pairs were cases of matching failure. In contrast, the matched pairs within the non-duplicate pairs were cases of false match. Using these numbers, the kappa (k) coefficient was calculated to measure the agreement between the categories of two axes: (1) duplicates or non-duplicates and (2) matched or not matched/missing. Precision, recall ratio and F-measure were also calculated in the macro-averaging manner, that is, they were averaged over the matched results for each base record (rather than micro-averaging, which uses the simple pooled contingency table across base records).
3.1.2. Mode B: Matching between an OCR-converted source of information for the base record and a record in either of the two sets
There are four choices under this mode: the use of (1) either a title page alone or a title page plus its verso; (2) either all data values of an element or only the first value – the latter is the same choice as in Mode A. The rows labelled ‘mode B’ under the individual data elements in Tables 2 and 3 show the matching results for this mode that demonstrated better performance among the choices. The ‘tp’ in the tables indicates the use of the title page only, while ‘tp + verso’ indicates the use of the title page plus its verso. When the performance was the same between these two choices, ‘tp, tp + verso’ was recorded in the rows of the tables.
The numbers of matched record pairs, non-matched pairs and missing values were counted in a manner similar to mode A. The non-matched pairs indicate cases where data values were found for a given element in a record of set 1 or 2, but where no string being matched was found in the source of information. The missing values are cases where no data value was found for a given element in a record of the set.
3.1.3. Mode C: Matching combining mode A with B
Mode A matching between two records is executed first and, if this matching does not succeed (i.e. they do not match), then Mode B matching is executed between the source of information for the base record and a record in either of the two sets. The choices in this mode are the same as those in mode B. The rows labelled ‘mode C’ under the individual data elements in the tables show the matching results for this mode that demonstrated better performance between the choices.
Among the above three modes, the individual elements with the highest k coefficients are marked with an asterisk in the tables.
3.2. Results for the individual data elements
3.2.1. LCCN
LCCN appeared in only record set 1; records in set 2 did not contain that element. There were very limited cases where more than one LCCN occurred in a record and thus the choice between using all LCCNs or only the first yielded the same results in all three modes. Further, LCCN usually appeared in the versos of title pages, and thus the choice to use verso pages in addition to title pages showed better performance in modes B and C. The k coefficient and other measures were relatively high when compared with other data elements, while both the base records and the records of set 1, to a certain extent, did not contain LCCN. Among the three modes, mode C produced the highest performance, which implies that record matching involving the source of information is effective. Mode B resulted in slightly worse performance.
3.2.2. ISBN
In record sets 1 and 2, the choice to use both the title page and its verso resulted in better performance, since ISBN usually appears on verso pages. In addition, the choice to use all values also resulted in better performance; in addition to the first ISBN, the second and third, if any, were useful. Mode C showed the highest measures of usefulness of involving the source of information for matching. ISBN showed the highest results across the elements in the two record sets.
3.2.3. Title proper and other title information
After concatenating the title proper and other title information, such as a subtitle, matching was attempted; the condition for a match is either that two titles match or that a title string is contained within another title string. Even if neither of the records contains other title information, the matching succeeds. All modes showed better performance using the first value. Second and third values of other title information and other varying form titles were dealt with as separate individual titles, but this did not cause any improvement in performance. On the other hand, using the title page alone produced better performance in modes B and C, thereby indicating that the additional use of verso pages caused many false matches.
Many false matches – matches within non-duplicate pairs – occurred in both record sets because they were created from the search using title information. This was manifested in the high recall ratio. K coefficient and other measures showed middle-range values. Among the three modes, in both record sets, mode B demonstrated the best performance because of the decrease in the number of false matches.
3.2.4. Statement of responsibility
For statement of responsibility (hereafter, SOR) in modes A and C, false matches occurred to a certain extent in both record sets and in many of the experiment choices. The use of the first value led to better performance in all modes. The second and third SORs served different roles from that of the first statement and these succeeding statements brought about many false matches within the non-duplicate record pairs. The use of the title page alone led to better performance, while mode C in set 2 performed at the same level when the verso page was also used. Many false matches led to low k coefficient values in modes A and C, whereas precision, recall ratio and F-measure in macro-averaging showed good performance. Mode B in both record sets demonstrated the best performance because of the reduced number of false matches.
3.2.5. Author heading
The use of the first author heading performed better than the use of all author headings. The results show their usefulness: all measures were relatively high. Modes B and C were not applied to author headings.
3.2.6. Edition statement
The same performance was shown regardless of whether all edition statements or only the first were used because second and third edition statements rarely occurred. Regarding the choice of using the title page or the title page and its verso page, the latter showed better performance, except in the case of mode B in set 2. Many false matches occurred in modes A and C, and there were many missing values. These led to very low performance in both sets. Mode B showed relatively better performance.
3.2.7. Place of publication
The use of the first value yielded better performance in all modes. The use of the title page showed better performance in modes B and C. The following caused a large number of false matches and thus very low performance in both sets: (a) the verso of the title page occasionally contains many place names other than the place of publication; (b) the second and third place names in a record match various place names in a verso page; and (c) a limited total number of different place names appear in the records and their source of information. In this context, the above combination of choices reduced the number of false matches. Consequently, mode A performed relatively better among the three modes.
3.2.8. Publisher
Publisher name also produced very low performance as a whole. The best combination of the choices was to adopt the first value and the title page alone. The reason for this is similar to that for publication place. Mode A was relatively better.
3.2.9. Year of publication
Publication year showed very low performance as a whole. In set 1, the best combination of the choices was to adopt all the values and the title page alone. In set 2, on the other hand, the best combination was to adopt the first value and the title page alone. Mode A was relatively better.
3.2.10. Pagination
Pagination had a lower performance. This may partially reflect the treatment that pagination within 10 pages was regarded as a match.
3.2.11. Series title
A limited number of records had a series title, and the same series title appeared in records a very limited number of times. Performance was very low. The choice to adopt the first series title was better. The choice to use the title page plus its verso was better in mode B, whereas in mode C, using the title page alone was better in both sets. Among the three modes, in set 1, mode B showed the best k coefficient and mode C the best F-measure. For set 2, mode A was best.
3.2.12. Series number
The same pattern occurred in both sets. In modes A and C, the use of all values performed better. However, in mode B, the use of the first value performed better because series numbers were often just numerical strings without prefixes, such as ‘volume’ and ‘no’, and therefore such numbers made false matches with various strings appearing in the versos of title pages. In order to avoid such false matches, the use of the first value was effective. Mode A was better in both sets among the three modes.
3.3. Discussion
In mode A, ISBN, title, SOR and author heading showed good performance in both sets, whereas edition statement, publication place, publication year and pagination showed poorer performance. Mode B showed good k coefficient and F-measure for ISBN, title and SOR. Mode C showed almost the same pattern as mode A.
When the three modes are compared, the elements for which mode A performed best were publication place, publisher, publication year and series number in both sets. Series title in set 2 was added to this element group. These elements showed lower performance in general and difficulties in mechanical matching between records and OCR-converted sources of information in modes B and C.
The elements for which mode B performed best were title, SOR and edition statement. Series title in set 1 was added to this element group. These elements, except for edition statement, showed medium or high performance in general, thus being useful for matching between records and OCR-converted sources of information.
The elements for which mode C was best were ISBN and LCCN. Comparing the performance in mode A with that in mode C, only LCCN, ISBN and edition statement showed an increase in k coefficient and F-measure in mode C. Series title in set 1 and SOR in set 2 showed an increase in F-measure. These indicate that the combination of matching between records and matching using their source of information did not necessarily increase performance, but reduced it in many cases. It is better to adopt a method that is appropriate for each element rather than to use mode C consistently for every element.
With regard to the choice between using all values or the first one only, results from all three modes indicated that all values of ISBN and series number, except the latter in mode B, were useful. Publication year in Set 1 was added to this elements group. On the other hand, the first value was useful in title, SOR, author heading, publication place, publisher, and series title. Publication year in set 2 and series number in mode B were added to them.
The choice between using the title page or the title page and verso page provided clear results in modes B and C: using the title page alone was appropriate for title, SOR, publication place, publisher, publication year and series number, while using the title page plus its verso was appropriate for LCCN, ISBN and edition statement, and for series title in mode B.
4. Results at the record level
This section reports the experiment’s results for duplicate record detection based on element level matching, which was discussed in the previous section. The experiment was conducted using both rule-based duplicate detection and machine learning-based detection.
4.1. Experiment modes and their choices
Modes A, B and C explained above were experimented with using the following choices in both record sets. For mode A, matching between a base record and a record in the two sets, the use of the first value of data elements was adopted since it performed better as a whole at the individual element level when compared with the use of all values, as previously discussed. This is considered the baseline performance. For mode B, matching between records and OCR-converted sources of information, the use of the first value was adopted, as was the use of the title page plus the verso, as these choices showed relatively better performance at the data element level. If we limited ourselves to the use of the title page, performance was reduced substantially. As for mode C, the combination of matching between records and between records and OCR-converted sources of information, the use of the first value was adopted, as was the use of the title page only; these performed relatively better at the element level.
Mode D was newly introduced: this was the combination of each mode and its choice that showed the best performance for each data element. These are marked with an asterisk under each data element in Tables 2 and 3. Therefore, mode D adopted different combinations of modes and choices depending on record set 1 or 2. Mode D was expected to perform best among all the modes.
4.2. Rule-based detection
Five rules to detect duplicate records were created on the basis of the results from my preceding experiment with Japanese bibliographic records [25]. These rules are sufficient in scope to show the usefulness of incorporating the OCR-converted source of information, but are limited to rules that favour data available from OCR sources; other rules that incorporate data not available from OCR sources could be built.
If the following condition in a rule is satisfied, the record pair is judged to be duplicate:
Rule 1: LCCN | ISBN.
Rule 2: ISBN & title (i.e. title proper, other title information).
Rule 3: ISBN & (SOR | author heading).
Rule 4: Title & publisher & (edition | publication year).
Rule 5: Title & (SOR | author heading) & (edition | publication year).
Here, ‘&’ indicates conjunction and ‘|’ indicates logical disjunction. Rule 1 is applied to only record set 1 and is based on the premise that LCCN and ISBN are correctly recorded because no conjunct condition is added. Rules 2–5 are conjunctions of more than one data element. Other rules can be made, but the above rules cover the primary cases.
Rules 1–5 were applied to the two record sets in modes A–D and their results for sets 1 and 2 are presented in Tables 4 and 5, respectively. The tables show the numbers of matched pairs and non-matched/missing values within duplicate and non-duplicate pairs, as well as k coefficient, macro-averaging precision, recall ratio and F-measure over the 255 groups. Among modes A–C, excluding Mode D, the mode that obtained the highest k coefficient under individual rules is marked with an asterisk in the tables.
Duplicate detection performance under the rule-based approach in record set 1
Duplicate detection performance under the rule-based approach in record set 2
Rule 1 was applied to record set 1 and had a better performance than the single use of LCCN or ISBN throughout modes A–C, as shown in Tables 2 and 3. Mode A, the baseline, demonstrated relatively high performance: 0.801 k coefficient and 0.809 F-measure. Mode B demonstrated a better k coefficient than mode A, but a worse F-measure. Mode C had the same performance as mode A. Mode D, the best combination of the modes and their choices at the element level, showed the highest performance, as expected.
Rules 2 and 3 comprise ISBN and other elements: rule 2 uses titles and rule 3 uses author names. Both rules lowered performance as compared with the cases using ISBN alone. This is due to the occurrence of matching failure caused by the incorporation of elements other than ISBN. If we compare mode A’s performance with that of mode B under rule 2, mode B performed better in both record sets. In contrast, under rule 3, mode A performed better than mode B. These resulted from the elements combined with ISBN. Mode B did not cause a large decrease in matching performance. Mode C showed an increase in performance from modes A and B under rule 3, while under rule 2 it showed an increase from mode A and a decrease from mode B. Mode D demonstrated the highest performance in both sets.
Rules 4 and 5 comprise titles and other elements. The difference between these two rules is that rule 4 uses the publisher, while rule 5 uses the author. Both rules demonstrated much better performance than those with individual elements, as shown in Tables 2 and 3. Comparing mode A’s performance with that of mode B under rule 4, mode B showed better performance in both record sets, whereas under rule 5, mode A performed better. This was dependent on the performance of the elements, that is, the publisher or author. Mode B did not cause a large decrease in matching performance. Mode C showed an increase in performance from modes A and B under both rules. Mode D did not necessarily show good performance under either rule.
Figure 2 presents the k coefficients of Rules 1–5 for the two record sets. It must be noted that, under each rule, mode B did not cause a large decrease in performance from mode A, and under some rules the mode showed better performance, although it had shown much lower performance for some elements at the data element level, as shown in Tables 2 and 3. Under some rules, mode C showed better performance than modes A and B, whereas it had shown lower performance in most elements than either mode A or mode B. Mode D, the best combination at the data element level, generally produced higher performance than the other modes, with some exceptions.

Duplicate detection performance under the rule-based approach (k coefficient).
4.3. Machine learning-based detection
Applying supervised machine learning to duplicate detection was attempted with Weka, a machine learning tool that is widely used to implement many types of learning algorithms [26]. The experiment was a two-class classification, either duplicate or not, and referred to matching results for data elements between records and between records and sources of information.
The following learning algorithms, which are commonly used, were applied to the test collection:
Decision Tree – in Weka, the algorithm is J4.8;
Naïve Bayes;
Random Forest;
SVM (Support Vector Machine), with linear kernel method;
AdaBoost with Decision Stumps;
Bagging with REPTree.
Algorithms 5 and 6 are meta-learning methods that use Decision Stumps and REPTree, respectively, as base-level algorithms. For all algorithms, 10-fold cross-validation was applied: the data is divided randomly into 10 parts; each in turn is used for testing and the remainder is used for training. Learning and testing were executed a total of 10 times, and performance estimates were averaged to yield an overall estimate. We can compare the values of these measures with the result from the rule-based detection. Table 6 presents the k coefficients for individual learning algorithms; their average F-measure values for the 10 executions were very similar to those of the k coefficient and therefore are not described in the table. Among modes A–C, excluding mode D, the highest k coefficient under individual algorithms is marked with an asterisk in the table.
Duplicate detection performance under the machine learning-based approach (k coefficient)
All algorithms performed well in general, regardless of differences in learning algorithms. On average, Decision Tree J4.8, Random Forest and Bagging showed relatively higher values than Naïve Bayes, SVM and AdaBoost. If we compare these performances with those of the rule-based detection presented in Tables 4 and 5, most machine learning-based detection produced better performances in all modes and in both record sets. It was observed that machine learning methods have developed appropriate ways of their own to deal with these record sets, even if they were averaged by cross-validation. This also implies that duplicate detection is an issue to which machine learning can be applied effectively.
In both record sets, mode B performed more poorly than mode A under most algorithms, but the decrease in performance was small. Mode B under Naïve Bayes in both sets and under AdaBoost in set 1 showed better performance than mode A. In contrast, mode C performed better than either mode A or B in most algorithms, thereby indicating that the combination of matching between records and between records and their sources of information at the element level is effective in machine learning duplicate detection. Mode D showed better performance than, or almost the same as, modes A–C; in other words, mode D did not necessarily increase performance.
Of course, the decision trees algorithmically built from modes A–D in record sets 1 and 2 were different from each other and also different from the rules adopted above. For example, mode A in set 1 created the following decision tree rule:
if ( (title & year & page) |
(title & year & ¬ page & author-heading) |
(title & ¬ year & page & place & ISBN) |
(title & ¬ year & page & place & ¬ ISBN & LCCN) |
(¬ title & author-heading & year & place) )
then the pair is duplicate.
where ‘¬’ indicates negation and ‘¬ title’ indicates that titles do not match.
In contrast, mode B in set 1 created the following tree rule:
if ( (title & year & place & ISBN) |
(title & year & place & ¬ ISBN & publisher) |
(title & ¬ year & ¬ place & ISBN & publisher) )
then the pair is duplicate.
5. Conclusion
In this study, a method to incorporate matching between a record and a source of information (i.e. a scanned and OCR-converted title page and its verso) into duplicate record detection was proposed and tested. The following are the results of the experiment conducted in the study:
At the data element level, some elements, such as title, SOR and edition statement, were useful for match between a record and a source of information. However, the combination of matching between records and between records and their sources of information did not necessarily increase in performance, but rather reduced it, except for some elements. Results from the choice between using the title page or the title page plus its verso depended on the elements: title page alone was appropriate for title, SOR, publication place, publisher, publication year and series number, whereas title page plus its verso was appropriate for LCCN, ISBN and edition statement.
At the record level, the duplicate detection performance of rule-based approach depended on elements comprising rules. Under some rules adopted in this study, matching between a record and a source of information showed good detection performance. Furthermore, the combination of matching between records and between records and their sources of information produced high performance detection under some rules.
Machine learning-based duplicate record detection performed well in general, regardless of differences in learning algorithms. On average, decision tree J4.8, Random Forest and Bagging showed relatively high performance. Furthermore, the combination of matching between records and between records and their source of information at the element level is useful in machine learning methods.
Therefore, we may reasonably conclude that incorporating the source of information into duplicate detection is somewhat effective. The copies of a source of information are robust against incompleteness and instability of records, which occasionally lack data values or contain erroneously recorded values, and thus complement records in duplicate detection, provided we use such information sources properly.
It might be possible to examine a method (1) that applies weights for data elements being matched and matching conditions such as perfect match or truncated match, that is, assigning multi-values for matched elements, and (2) where, if the record receives a weight equal to or above the threshold, it is regarded as a duplicate. This method was adopted by the Melvyl Union Catalog of the University of California Libraries, and a similar algorithm has been proposed by Coyle [27]. However, it is not possible to apply such a matching scheme to an OCR-converted source of information in a straightforward manner because the source of information has no delimiter and is not partitioned into elements.
