Abstract
In order to present large amount of information on the Web to both users and machines, it is urgently needed to structure Web data. E-commerce is one of the areas where increasing data bottlenecks on the Web inhibit data access. Ontological display of the product information enables better product comparison and search applications using the semantics of the product specifications and their corresponding values. In this article, we present a framework called OPPCAT, which is used for semi-automatic ontology population from tabular data in e-commerce stores and product catalogues. As a result, OPPCAT allows tabular data to be used for mass production of ontology content. First, we present the common patterns in tabular data which obstruct semi-automatic production of ontologies. Then, we suggest solutions which automatically fix these errors. Finally, we define an algorithm to build ontology content semi-automatically.
1. Introduction
The use of ontologies has become extremely popular for representing machine readable semantic knowledge. However, building ontological content from scratch is a resource-demanding, time-consuming and error-prone task. Therefore, the automatic or semi-automatic construction of ontological content is an emerging research area. Ontology construction, enrichment and adaptation are known as ontology learning [1]. For example, Kutiyanawala et al. [2] uses ontology learning for creating a product ontology designed specifically for search and providing three methods to automatically extract product concepts for this ontology. The ontology population, which falls under the heading of ontology learning, is concerned with the task of adding new concept and relation instances to the ontology [3]. Ontology population requires an initial ontology that will be populated.
This article presents a framework, namely, OPPCAT, for extracting data out of tables in PDF product catalogues and e-commerce stores with the aim of building ontology content. We aim at building ontology content in a fast and effective way even by users who are unfamiliar with ontologies. Our work mainly deals with ontology population. It requires an initial ontology to be populated. However, it extends the ontology through the addition of new classes and properties. Therefore, we also categorise this work as ontology enrichment. Although this work is related to e-commerce domain, the proposed approach gives generic solutions for creating ontologies from Web data. Our approach has two main characteristic features. First, it is specifically targeted at populating the ontology with tabular data. Second, it especially focuses on the anomalies in tabular data that prevent building reliable ontological content. The anomalies that frequently encountered in tabular data are identified and automatically solved by the proposed algorithms.
Section 2 presents related work in this field; then, section 3 presents the most common anomalies in spreadsheet files and defines algorithms to solve these anomalies. Section 4 describes building ontology content semi-automatically using the normalised spreadsheet and uses OPPCAT in a real ontology building scenario. Our approach has been used to create ontological content about the automobiles by inexperienced users. Then, these users filled two surveys about the OPPCAT framework. Section 5 assesses the performance of OPPCAT framework based on the results of these surveys. Finally, section 6 concludes the article with a brief talk about possible future work.
2. Related work
There are some similar research works for creating ontologies from Web data. OntoGenie [4] parses the web pages to create knowledge instances for a given ontology using WordNet as a bridge for mapping between the ontologies and the web page terms. OntoSophie [5] is a system for semi-automatic population of ontologies with instances from unstructured text. It is based on supervised learning. This system learns extraction rules from annotated text and then applies those rules on new articles to populate the ontology. OntoSyphon [6] is a fully automated, unsupervised system that takes an ontology as input and uses the ontology to specify Web searches that identify possible semantic instances, relations and taxonomic information. The advantage of this approach is that the entire Web can be used as a corpus for instantiating entities in the ontology. M 2 [7] is a mapping language for converting data contained in arbitrary spreadsheets into Web Ontology Language (OWL) [8]. The language is implemented in MappingMaster, which is available as a plug-in for the Protégé ontology editor [9]. The disadvantage of the approach is that it is integrated into Protégé and requires users to be familiar with ontologies.
Our approach differs from aforementioned works in literature because it is specifically targeted at populating the ontology with tabular data. From this point, our work is most similar to the literature [10–14]. Holzinger et al. [10] extracts tabular data on the web pages and uses content spotters to derive meaning of the table, that is, to interpret the table structure in terms of product features represented as attribute name–value pairs. Nederstigt et al. [11] detailed about a framework capable of semi-automatic ontology population of tabular product information on the Web. It extracts the raw data from Web pages and then it maps the raw data to predefined OntoProduct ontology concepts. OntoProduct is fully compatible with GoodRelations [15], which is known as ‘the web vocabulary for e-commerce’. Ozacar [12] introduces a tool that produces structured interoperable data from product features, that is, attribute name–value pairs, on the Web. The tool extracts the product features using a website-specific template created by the user. The value of the extracted data is maximised using GoodRelations. The final output of the tool is GoodRelations snippets, which contain product features encoded in RDFa or Microdata.
The data.world project [13] aims to create a semantic-based publication platform for data sets, scalable to hundreds of thousands of heterogeneous users and millions of distinct data sets. When a user creates a data set, data.world generates a unique graph database instance for it. data.world handles each data set individually, giving each its own SPARQL endpoint. When users upload structured data in tabular formats that data.world supports, the data.world ingest pipeline automatically converts those data files into Resource Description Framework (RDF) using the CSVW specification, which provides a standardised model for virtualised tables modelled within an RDF graph structure.
Reasonable Ontology Templates (OTTRs) [14] are simple, but powerful, templates or macros for ontologies, represented in using a dedicated OWL vocabulary. Specifically, the implicit mapping between an OTTR’s parameters and its pattern may be exploited to generate various format descriptions and transformation specifications, for example, queries for extracting pattern instances and transformations between tabular input formats and OTTR pattern instances that may be processed by readily available desktop tools.
Unlike these approaches, our methodology especially focuses on the anomalies in tabular data that prevent building reliable ontological content. Based on our experience, we define and exemplify the anomalies we most frequently encounter in tabular data. Then, we define algorithms to automatically solve these anomalies. Finally, we build the ontological content semi-automatically using the normalised spreadsheet file. It is also good to note that the normalised spreadsheet files can be used with other approaches like Holzinger et al. [10] and Nederstigt et al. [11] for various needs of users.
3. Solving the anomalies in tabular data
This work involves building ontology content from the tabular data in e-commerce stores or product catalogues. We assume tabular data is represented in spreadsheets. Most of the product catalogues is in PDF format, therefore these catalogues should be converted to spreadsheets using programs such as Tabula [16]. Figure 1 shows parts of the technical specification table in Audi A1 brochure. Each column (except the first one) in the table corresponds to an individual of the ‘Automobile’ class. Each value in the first column corresponds to a property of the ‘Automobile’ class. Each value in column i and row j correspond to the value of property j of individual i (i and j are greater than 0). In some cases, it can be necessary to transpose the data in the table.

Audi A1 technical specifications table (partial).
The following sections involve normalising the spreadsheet tables before transforming it to an ontology. We identify the patterns that produce errors while populating the ontology from spreadsheets automatically. We list these patterns and give an example of each pattern. Then, we list potential solutions that automatically solve the problem in the spreadsheet file.
3.1. Renaming different individuals with identical names
3.1.1. Definition
If we have different individuals with identical names in the spreadsheet file, these individuals should be renamed properly.
3.1.2. Example
In Figure 2(a), two instances, named as ‘1.0 TFSI (Turbo Fuel Stratified Injection)’ have different ‘Transmission’ values. Two examples are separated due to the difference of the attribute values. Another example is from Range Rover brochure (Figure 2(b)). Two different instances, named as ‘SDV6’ have different ‘engine power’ values.

Different individuals with identical names: (a) Different individual with identical names example from Audi brochure (b) Different individual with identical names example from Range Rover brochure.
3.1.3. Solution
Two individuals should be renamed properly. Appending the different property (‘Transmission’/‘engine power’) values to the end of the individual names using Algorithm 1 is one of the possible renaming strategies. After executing the algorithm, the first individual in Figure 2(a) is renamed as ‘1.0 TFSI 5-speed manual’, while the second one becomes ‘1.0 TFSI 7-speed S tronic’. In the same manner, the first individual in Figure 2(b) is renamed as ‘SDV6 249’, while the second one becomes ‘SDV6 275’. Time complexity of the algorithm is O(n 3).
3.2. Separating different individuals located in the same column
3.2.1. Definition
In this case, we have two or more different individuals represented in one column as if they were the same. In fact, this column corresponds to more than one individual. This situation can be identified by columns having some cells that are split into n columns.
3.2.2. Example
Figure 3(a) partially shows the technical specification table in Fiat 500 brochure as an example. ‘1.2 69hp Hatchback-Convertible’ value represents two individuals, which are separated due to the difference of ‘Gearbox’ attribute values. Figure 3(b) shows another example from Audi A6 brochure. ‘A6 TFSI (132 kW)’ value represents two different individuals, which are separated due to their body types (‘saloon’ and ‘avant’).

Different individuals located in the same column: (a) Example for different individuals located in the same column from Fiat brochure (b) Example for different individuals located in the same column from Audi brochure.
3.2.3. Solution
The second column in Figure 3(a) should be separated into two and then the two individuals should be renamed as ‘1.2 69hp Hatchback-Convertible 5 forward reverse’ and ‘1.2 69hp Hatchback-Convertible dualogic robotised 5+R’. In the same manner, the individuals in second column in Figure 3(b) should be renamed as ‘A6 TFSI (132 kW) Saloon’ and ‘A6 TFSI (132 kW) Avant’. This step is completed automatically using Algorithm 2, whose time complexity is O(n 3).
3.3. Defining top properties
3.3.1. Definition
If there is a hierarchy between properties in spreadsheet file, it is required to represent this hierarchy in the knowledge base. In our experience, we observed that all the values of the top properties in the spreadsheet files are usually empty.
3.3.2. Example
Figure 4(a) illustrates an example hierarchy. All the values of the top properties (Safety, Audio) in the spreadsheet are empty. Another example (Figure 4(b)) is taken from Lexus LS brochure. In this example, the top properties are maximum output, engine and electric motor.

(a) Property hierarchy in Fiat 500 options table (partial). (b) Property hierarchy in Lexus LS technical specifications table (partial).
3.3.3. Solution
If all of the values (except the first one) in a row are empty, then the property (the first value of the row) is a top property. If i1 is the row index of a top property p and i2 is the row index of the next top property, then all properties having index values between [i1 + 1, i2 − 1] should be defined as subproperties of property p. Array L, which is the output of Algorithm 3, is used for defining top properties automatically in section 4. Time complexity of the algorithm is O(n).
3.4. Renaming different properties with identical names
3.4.1. Definition
Two different subproperties with different top properties may have identical names. In this case, we should rename these subproperties properly.
3.4.2. Example
Figure 5(a) shows an example of this situation from ‘Audi A3’ options table. The first ‘Milano leather, black’ property qualifies ‘standard seats’, while the second ‘Milano leather, black’ property qualifies the ‘sport seats’. Figure 5(b) shows another example from Lexus LS brochure. The first ‘Combined’ property qualifies the ‘CO2 Emissions’, while the second one qualifies the ‘Fuel Consumption’, and the first ‘Front’/‘Rear’ property qualifies the ‘Brakes’, while the second one qualifies the ‘Suspension’.

Different properties with identical names: (a) Example for different properties with identical names from Audi brochure (b) Example for different properties with identical names from Lexus brochure.
3.4.3. Solution
In this case, we append the name of top property to the name of its subproperty using Algorithm 4. Time complexity of the algorithm is O(n 2).
3.5. Separating different properties located in the same row
3.5.1. Definition
An attribute value can be represented as two values in different measurement units within the same cell. The secondary value is usually represented in parentheses.
3.5.2. Example
In Figure 6(a), ‘Maximum speed mph (kph)’ is a property represented by two different measurement units. In the same manner, ‘Acceleration’, ‘Maximum speed’, ‘Urban’ and ‘Extra Urban’ properties in Figure 6(b) are also represented by two different measurement units.

Different properties located in the same row: (a) Example for different properties located in the same row from Mazda, (b) Example for different properties located in the same row from Land Rover and (c) The normalized property hierarchy using Algorithm 5.
3.5.3. Solution
In this case, the property values should be defined in separate rows as ‘Maximum speed mph’ and ‘Maximum speed kph’ (Algorithm 5). In this solution, the secondary unit name must be represented in parentheses (e.g. ‘kph’). The ‘Units’ array contains all the values in the UN/CEFACT Common Codes List [17]. If the property name contains more than one unit name from the ‘Units’ array (e.g. ‘mph’ and ‘kph’) and the property values contain parentheses (e.g. ‘94(151)’), then we add a new row below. We store the primary property values in the original row and the secondary values in the new row. Figure 6(c) represents the result after processing the table in Figure 6(a). Time complexity of the Algorithm 5 is O(n 2).
3.6. Removing non-alphanumeric characters from the spreadsheet file
3.6.1. Definition
In order to comply with the naming conventions of ontology editors such as Protégé, non-alphanumeric characters should be removed from the spreadsheet file.
3.6.2. Example
Figure 7 shows two spreadsheet files which may cause ontology editor errors due to the non-alphanumeric characters (in red box) it contains.

Non-alphanumeric characters in the names of concepts and individuals: (a) Example for non-alphanumeric characters in the names of concepts and individuals from Fiat brochure and (b) Example for non-alphanumeric characters in the names of concepts and individuals from Ford brochure.
3.6.3. Solution
The removal procedure (Algorithm 6) contains the following operations in order: (1) small capitalisation; (2) replacing inch (′), euro (?), percentage (%) and degree (°) symbols with text; (3) removing non-alphanumeric characters; (4) replacing consecutive underscore characters with a single underscore; (5) removing the underscore characters which are at the beginning or at the end of the name; and (6) appending an underscore character in front of the names which starts with a digit.
Time complexity of the algorithm is O(n3).
4. Building the ontology
This process contains semi-automatic steps that gets a spreadsheet file and produces an ontology file. The aim is the mass production of the ontology content even by inexperienced users in a fast and efficient way. Figure 8 shows the interface of the application. User fills the textbox with the name of the class to which the individuals are mapped. The nodes of the tree in this figure are the property names in the first column of the spreadsheet file.

Building the ontology file semi-automatically.
User checks the object properties in the tree. Please note that the datatype properties are relations between instances of classes and RDF literals and XML (eXtensible Markup Language) Schema datatypes. However, object properties are relations between instances of two classes. All the subproperties of a property should have the same metaclass. In other words, if a top property is a datatype property, then all of its subproperties should be datatype properties. In the same manner, if a top property is an object-type property, then all of its subproperties should be object-type properties.
At the final stage all the classes, properties and individuals in the ontology are generated automatically by Algorithm 7 in Appendix 1. The value entered in the textbox is parameter class, the unchecked property names (in Figure 8) are the values of array P datatype and checked values are the values of array P objecttype. topPropertyIndices stores the indices of all top properties, which are defined by array L in Algorithm 3. Algorithm 7 executes the following steps in order:
Creates a class for the value entered by the user (e.g. ‘Automobile’ in Figure 8);
Creates a new OWL datatype property for each datatype property (unchecked properties like ‘urban cycle’ in Figure 8);
Creates a new OWL object property for each object-type property (e.g. ‘Transmission’, ‘Drive System’, ‘Clutch Control’ and ‘Gearbox, no. of gears’ in Figure 8);
For each object-type property, if this property is not a top property, then it creates a new class for the range of the property;
For each value in row 0, it creates an individual of the main class;
For each property value v (cell(m, n), where m, n > 0), if the corresponding property p(cell(m, 0)) is an object-type property, then it defines v as an individual of the class which represents the range of p and fill the property p value of i (cell(0, n)) with the value v;
For each property value v (cell(m, n), where m, n > 0), if the corresponding property p(cell(m, 0)) is an datatype property, then it fills the property p value of i (cell(0, n)) with the value v;
Defines top properties for each value in topPropertyIndices.
The following procedures in Algorithm 7 simply produce OWL code corresponding triples below (the main class is represented by variable
createClass(
createDatatypeProperty(p) →
createObjecttypeProperty(p, obj_c)→
createIndividual(
defineSubProperty(sub,
assignDataPropertyValue(
assignObjectPropertyValue(i_sub, p, i_obj) → i_sub p i_obj
OPPCAT is used in a real-life project, which involves creating ontological content about the automobiles. The tables in PDF catalogues of various automobile brands in Europe have been extracted using Tabula. Then, technical specifications and options of cars in spreadsheet files are imported into ontologies in accordance with OPPCAT methodology. In this study, the reasons for the choice of OPPCAT methodology are the challenging time constraint (1 month) and the lack of experienced project staff. Table 1 represents some statistics about the project (training hours was taken into account when calculating the ‘amount of work’).
Project statistics.
5. Evaluation
We evaluate the usability of the OPPCAT method using two questionnaires: Computer System Usability Questionnaire (CSUQ) [18] and System Usability Scale (SUS) [19]. The reason for choosing these questionnaires is that these two approaches have a higher accuracy with an increasing sample size than the other questionnaires.
Figure 9 shows the results of the study by Tullis and Stetson [20]. They analyse the effectiveness of four standard usability questionnaires: SUS, QUIS (Questionnaire for User Interaction Satisfaction) [21], CSUQ and Words (adapted from Microsoft’s Product Reaction Cards [22]). As one would expect, the accuracy of the analysis increases as the sample size gets larger. SUS and CSUQ reach asymptotes of 90%–100% at a sample size of 12.

Comparison of the significance of the different usability questionnaires.
In this work, we have 10 participants; therefore, the percentage of the reliability of the results is about 75%. All of the participants were undergraduate students with no experience on building ontology or ontology-based systems. The usability test was conducted as follows; first, the participants received a 60-min introduction of applying the method. After this introduction, each participant applied the OPPCAT methodology using catalogues of the one (or two) automobile brand. At the end, each participant filled out a CSUQ and an SUS Questionnaire. The following sections present the results of the questionnaires and discussion of these results.
5.1. CSUQ evaluation results
The CSUQ contains 19 questions and users rate them from 1 to 7, where 1 is ‘strongly agree’ and 7 is ‘strongly disagree’. The three internal subscales of CSUQ (Table 2) are ‘System Usefulness’, ‘Information Quality’ and ‘Interface Quality’. The first eight questions on CSUQ assess ‘System Usefulness’. These questions refer to the users’ perception of the ease of use, learnability, speed of performance and effectiveness in completing tasks and subjective feeling. Questions 9–15 can be used as a means of assessing the participants’ satisfaction with the quality of the information associated with the system. ‘Information Quality’ includes the users’ beliefs regarding error messages and error handling, information clarity, understandability and utility more generally. Questions 16–18 provide a score for the ‘Interface Quality’.
The three internal subscales of CSUQ.
CSUQ: Computer System Usability Questionnaire.
The results of the CSUQ are shown in Table 3, where N is the number of responses, AVG is the average value, DEV is the deviation value, MED is the median, MAX is the maximum and the MIN is the minimum. Most of the participants appreciated that the method was easy-to-apply and has a clear value and purpose. They found it to be a time-saving, fast and useful, which contributes to lowering the cost of data entry and improves productivity. They mentioned that this method provides clear and reliable results with standard PDF catalogues. Participants who had catalogues of two different brands felt comfortable about using the method for different brands without any extra knowledge. The majority of participants enjoyed to use the interface and found it clear and pleasant. In general, the participants were satisfied with the functions and capabilities of the system. The quality of the error messages and the documentation was ranked positive in average, but some participants criticised the inadequate and very generic error messages.
CSUQ evaluation results.
CSUQ: Computer System Usability Questionnaire.
We further calculated the overall score and the three factor scores for ‘System Usefulness’, ‘Information Quality’ and ‘Interface Quality’ for all participants as illustrated in Table 4. The overall assessment of the participants about the usability of the method was positive (82.33%). The results show that the ‘System Usefulness’ ranked highest in all scores (83.93%). Finally, ‘Information Quality’ and ‘Interface Quality’ ranked about 80%.
CSUQ single participant evaluation scores.
5.2. SUS evaluation results
SUS is one of the popular questionnaires, used for the assessment of usability (Table 5). It is described as ‘quick and dirty’ usability scale. The participant rates each question with a 5-point scale, where 5 is ‘strongly agree’ and 1 is ‘strongly disagree’.
Items of the SUS questionnaire.
SUS: System Usability Scale.
The score of positive questions (1, 3, 5, 7 and 9) is the rating minus 1. The score of negative questions (2, 4, 6, 8 and 10) is 5 minus the rating. To obtain the overall score on a scale of 0–100, you add up the score values and multiply the sum by 2.5. The score values of the different participants are presented in Table 6. The final average SUS score is 81.75.
SUS evaluation results.
SUS: System Usability Scale.
Table 7 provides the SUS scores with their corresponding adjective [23] and acceptability [24] ratings. The results show that the proposed method has a ‘good’ and ‘acceptable’ level of usability. Likewise, the results in Table 6 show that all participants rated usability above 62.5.
SUS scores with their corresponding adjective and acceptability ratings.
SUS: System Usability Scale.
6. Conclusion and future work
In most of the cases, the tabular product data on the Web and in the product catalogues are detailed and reliable. This work is a pragmatic approach for building ontological content from the tabular data in these sources. It provides a methodology and tool to help the inexperienced users enrich an ontology. The process contains semi- or fully automatic steps. Although there are similar studies that aim at populating ontologies using tabular data, the main difference of the approach presented in this article is that it takes anomalies in tabular data into consideration to build more reliable ontological content. The proposed anomaly detection and correction system eliminates: (1) different individuals with identical names, (2) different individuals located in the same column, (3) properties with empty values, (4) different properties with identical names, (5) different properties located in the same row and (6) non-alphanumeric characters in the tabular data.
A possible future work is to infer missing property values using data mining techniques on the existing property instances. We can also define further and more complex anomalies, and give algorithmic solutions for them. Another possibility is to build ontological content in a fully automatic fashion.
Footnotes
Appendix 1
| 1: 1: class ← ToCamelCase(class) 2: class[0] ← Character.upperCase(class[0]) 3: createClass(class) 5: P datatype[i] ← ToCamelCase(P datatype[i]) 6: createDatatypeProperty(P datatype[i]) 8: P objecttype[i] ← ToCamelCase(P objecttype[i]) 9: obj class ← P objecttype[i] 10: obj class[0] ← Character.upperCase(obj class[0] 11: createClass(obj class) 12: createObjecttypeProperty(P objecttype[i] obj class) 13: numberOf Rows ← getColumn(0).length 14: numberOf Columns ← getRow(0).length 16: i name ← ToCamelCase(getCell (0, i).value) 17: createIndividual(i name, class) 19: p ← ToCamelCase(getCell(j, 0).value) 20: value ← getCell(j, i).value 23: obj class ← p name 24: obj class[0] ← Character.upperCase(obj class(0) 25: obj value ← ToCamelCase(value) 26: createIndividual(obj value, obj class) 27: assignObjectPropertyValue(i name, p name, obj value) 30: currentIndex ← topPropertyIndices[i] 31: nextIndex ← topPropertyIndices[i + 1] 32: topProperty ← ToCamelCase(getCell(currentIndex, 0).value) 34: subProperty ← ToCamelCase(getCell(j, 0).value) 35: defineSubProperty(subProperty, topProperty) 39: name[j + 1] ← Character.upperCase(name[j + 1]) 40: name ← name.replace(′\s′,′′) |
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship and/or publication of this article.
