Abstract
Organisations use data in different formats: Word documents, Excel spreadsheets, databases, HTML pages and so on. It is not easy to make decisions with such data due to the lack of integration between the different sources and built-in decision-making rules. Decisions can be reached with knowledge bases, which, unlike databases, make it possible to store not only objects, facts and attributes but also more sophisticated patterns such as rules and axioms. The article proposes an ontology-based method for knowledge base creation that allows for the simultaneous integration of semi-structured data sources and extendibility while remaining context independent. At the initial steps of the method, data specification should be performed with the Data Sources Ontology developed by the authors. This ontology provides data structure description that forms supportive knowledge graph. The graph’s schema should be mapped with the domain ontology to be populated. Finally, the data are inserted into the domain ontology according to the mapping rules. Manual input is needed during data specification and data-to-ontology schema mapping.
1. Introduction
Enterprises, especially non-profit and governmental institutions, use data in different formats: Word documents, Excel spreadsheets, databases, HTML pages and so on. Making decisions with data in such a format – a set of non-integrated data pieces – is a challenging task. Imagine, for example, an undergraduate-program administrator who needs to decide on a student’s eligibility for an award. The administrator may have a database with students’ personal information provided by the human resources department, grade information on the university website and exam assessment results in a spreadsheet. The administrator needs to combine this information and check the eligibility criteria described in the university’s guidelines. Accomplishing this task may take several days.
Similar tasks exist in governance, medicine or sports and have common traits as follows:
There are a set of rules that should be applied to the data to make a decision.
The data necessary to make a decision distributed between several sources and provided in different formats.
The process of data integration and rules application is repeating.
Such tasks frequently performed manually but actual human intervention is needed only to find necessary data and formulate rules underlying decision. Data integration and rules application parts can be outsourced to a machine. Indeed, it would be easier for the administrator to make sense of the data if all the sources were situated in a single database and connected with the corresponding decision-making rules. Such opportunities are provided by knowledge bases – unlike a database, a knowledge base makes it possible to store not only objects, facts and attributes but also more sophisticated patterns such as rules and axioms.
To create a knowledge base, one should populate an ontology with the data. In this work, an ontological knowledge base will be defined as populated heavy-weight ontology. Ontology is a set of four elements: O = <C, R, A, I>, where C is a set of ontology classes, R is a set of relationships between concepts and their properties, A is a set of axioms (statements about the concepts) and I is a set of individuals (instances of the classes). If the set of axioms is not empty, the ontology is called heavy-weight. If the set of individuals is not empty, it is called a knowledge base. The process of inserting the new instances into existing ontology is called ontology population [1].
Despite several methods of ontology population already exist, they have a number of drawbacks, such as [1–3] narrow scopes (dedicated to specific projects), data integration from homogeneous sources only, the absence of modularity and extendibility and the absence of useful tools. The proposed method aims to eliminate the aforementioned drawbacks. As for the narrow scopes overcoming, the method is context independent and can be implemented in any domain. Unlike the majority of other methods, the developed method is dedicated for simultaneous insertion of semi-structured data (RDB, XML and spreadsheets). The method is modular, and its different parts can be used separately. Also, the method can be extended, for example, if a company uses a proprietary data format, the ontology of data sources can be complemented by a class for its description. Finally, the method can be implemented with the help of any known ontology editor.
The novelty of the proposed method is twofold. First, it provides an opportunity to populate ontology from different data sources, whereas existing methods focus on data of a single type. Second, it is based on semantic technology itself – to integrate the data, the ontology of data sources is applied. This ontology provides a description for relational, XML and tabular data and is used to connect, describe and unify the structures of different data sources.
The article is structured as follows. In the ‘Related work’ section, the methods of ontology population from semi-structured data sources are reviewed. In the ‘Methodology’ section, the methodology used for method development is presented. In the ‘Results’ section, the method itself is described together with the architecture of the system for method support and the example. The ‘Conclusion’ section is dedicated to the current state and future directions of the method development.
2. Related work
Ontology population methods have been developed in the context of ontology engineering and relate to the insertion of new instances into the ontology, which, together with ontology enrichment, constitute ontology learning [1]. Two streams of literature are related to the ontology population process: the architecture of an ontology population system and the handling of different data sources.
2.1. Ontology population system architecture
During ontology population system development, several design choices should be made. The design aspects that should be considered can be combined into the three groups: overall population system architecture, data insertion style and data-to-ontology mapping style. All the choices for the developed method are summarised in Table 1, list of the design aspects is based on the cited papers.
Design choices summary for ontology population system.
The comparison with other methods is made based on the corresponding papers.
System architecture
Consideration [4] of the combined five main design choices should be made while creating the ontology population system: automation level, type of input documents, domain specificity level, type of instances and relations to be extracted and consistency/redundancy checks. The developed method is semi-automated, not domain specific and allows for the import of structured and semi-structured documents. For ontology-based systems, it should be complemented by the preferred style of ontology use [5]: single ontology (all data sources described with a single global ontology), multiple ontology (local ontology for each data source) and hybrid ontology (local ontologies based on global ontology). For the developed method, the single ontology approach was applied.
Data insertion
Data can be inserted in the ontology using either the mapping on-demand or data-materialisation (‘consolidation’) style. In the mapping on-demand approach, the data are imported into the ontology after user query, meaning that retrieved values are always up to date. The main drawback of this approach is growth in query complexity, which demands expressivity restriction [6]. The data-materialisation approach overcomes this restriction with the data included in the ontology. However, this leads to data deterioration and limitations on data quantity (insofar as a huge volume of data cannot be mapped). To overcome the limitations of these approaches, the ‘materialization on-demand’ style is used by the authors – data are stored in the ontology and refreshed when the user needs to update the data.
Data-to-ontology mapping
Two mapping styles can be distinguished: direct mapping and domain semantics-driven mapping [7]. Direct mapping style inherits ontology structure from the data source structure; it can be performed automatically but cannot provide a full model of the subject domain. In the domain semantic-driven mapping style, ontology structure is primary to data source structure; it provides an accurate domain model, but ontology should be constructed beforehand. The method is based on the domain semantics-driven mapping style.
2.2. Data sources for ontology population
To the best of the authors’ knowledge, existing methods perform the integration of data in a particular format using a specific approach to different data types: mapping languages for RDF, schema extraction for XML and entities classification for tables. Differences in the approaches prevent the simultaneous population of ontology from data in different formats. The developed method proposes an ontology-based approach to overcome this restriction for semi-structured data.
Methods of ontology population with relational data are based on the techniques of RDB to RDF translation with such languages as DR2 MAP [8], R2O [9], DR2Q [10], RDBToOnto [11] – see Michel et al. [7] for a review. The W3C recommendations on RDB to RDF translation have also been published [12]. The main idea of the methods is the translation of relational database schema to the ontology schema using mapping languages.
Methods of ontology population with XML data are also employed in XML-document translation into OWL and RDF ontologies [13–16]. All of the existing approaches can be divided into two groups [17]: the first group encompasses ‘instance approaches’ that permit working with XML-documents without XML-schema (JXML2OWL, XMLMaster). the second group encompasses ‘validation approaches’ based on XML-schema processing; Methods of ontology population with tabular data developed under the idea of `entity-per-row assumption' overcoming. Early approaches implied and assumed that the table contained subjects in rows, predicates in columns and objects at their intersection. This structure, however, is rarely applicable to real-world spreadsheets [18]. Several approaches have tried to overcome this assumption by importing cells as separate entities or blocks [19–23] or by adding semantic structure to the table [3,18,24–26]. Systems for tabular data integration can be divided into four groups based on the table’s content-provided output [27]: (1) RDF graph (RDF 123, XLWrap, Google Refine); (2) OWL ontology instances (Anzo suit, MappingMaster); (3) any part of the Semantic Web environment, including RDF graph or ontology (TopBraid Composer); and (4) don’t convert data into RDF or OWL but provide a simple spreadsheet-like user interface (Populous).
3. Methodology
Method development followed the design science paradigm devoted to the development of new artefacts [28]. According to the memorandum [29], such research should follow four principles: (1) abstraction (artefact can be used to solve a class of problems); (2) originality (artefact expands existing knowledge); (3) justification (artefact can be validated and justified); and (4) benefits of the artefact for stakeholders.
These principles were kept in mind during method development as follows:
Abstraction: The developed method can be used for ontology population from semi-structured data in any context (either for an educational institution data from the example provided in the ‘Introduction’, or financial data, or in other contexts).
Originality: The authors are not familiar with other methods allowing for the simultaneous population of ontology from relational, XML and tabular data.
Justification: The method is needed for organisations storing data in different formats and can be validated using such data.
Benefits: Organisations stand to benefit from using the method as they will be able to perform reasoning with the integrated data and solve the problem of data redundancy, as well as incompleteness.
Method development included six steps in accordance with design science research methodology activities [30] as follows:
Problem identification and motivation. At this step, the authors realised that despite the active development of data transformation to semantic format methods, they are not always useful for real-world data-integration problems as they are dedicated to a specific format whereas the data at organisations is stored in different formats (section ‘Related work’).
Define the objectives for a solution. The appropriate solution for the aforementioned problem would be a method allowing for the integration of data in different formats (section ‘Introduction’).
Design and development. The result of this activity is artefact creation. The artefact is threefold: the method of ontology population itself, the ontology of data sources as a crucial part of the method, and prototyping of the system for method implementation (sections ‘Data Source Ontology’ to ‘Process of domain ontology population from semi-structured data sources’).
Demonstration of the method’s performance is provided with an example of its application to demonstration data sources (sections ‘Data Source Ontology’ to ‘Process of domain ontology population from semi-structured data sources’).
Evaluation of the method with real-world data is in progress.
Communication of the created artefact is presented in this article.
4. Results
A method of semi-automated semantic-driven ontology population from semi-structured data sources was developed. The idea of the method is as follows: a user wants to import the data from different sources (such as relational databases, XML documents and spreadsheets) into a single domain ontology. To do this he or she needs to perform two actions. The first is description of data sources structure with a Data Source Ontology (section ‘Data Source Ontology’). The second is formulation of mapping rules for the corresponding entities in data sources and domain ontology to be populated. Both actions should be performed with the help of system that takes the data, extracts its structure and inserts the data in to the domain ontology (section ‘Architecture of the system for ontology population’). Exact steps of the system performance (section ‘Process of domain ontology population from semi-structured data sources’) are hidden from the user; thus, he or she does not need to interact with any code.
4.1. Data Source Ontology
Data Source Ontology is used for data sources specification – a crucial step for data import in the domain ontology. Data specification of RDF and XML data is performed relatively easily, as they are structured data sources. For tabular data steps, for the term identification from Krauthammer and Nenadic [31] has been adopted: entities recognition, classification and mapping. At first, data ranges with similar data should be recognised by the user, then they should be classified according to the particular entity in the ontology to be populated, finally, the ranges should be mapped to the Data Source Ontology (the authors call this process ‘marking’). Marking of the data sources is performed with the Data Source Ontology to allow the import engine to insert them correctly into a domain ontology. Marking should be performed manually by the user with the help of ontology editing software.
The ontology of data sources describes three types of most common semi-structured data sources: spreadsheets, XML documents and relational databases (Figure 1). The DataSource class describes common properties for all the data sources: name (data source file title), path (file location) and password (in case authentication is needed to access the file). The DataSource class has three subclasses for the description of data source specific properties: Workbook for spreadsheet data, Database for RDB data and XMLdocument for XML data.

Data Source Ontology in e6Tools notation [32]: classes in squares, data properties in rounded boxes, object properties are arrows, arrows with white headings represent subClassOf relations. Properties signed with ‘(N)’ are necessary, signed with ‘(IF)’ are inverse functional.
Any database contains one or more tables (exemplars of the class DBTable), which consist of the fields (exemplars of the class Field) filled with records (exemplars of the class Record). Any XMLdocument comprised the elements (exemplars of the class XMLelement) and their attributes (exemplars of the class Attribute).
Spreadsheet data are specified with the concepts of the Workbook class. The workbook consists of sheets (exemplars of WBSheet class). The data on the sheet have no built-in structure; thus, a Range class was introduced to specify data types. The Range class is the key class for the table description. Exemplars of the Range class are regions, containing cells with similar values. For example, in the table at Figure 2, cells D7:E8 contain numerical values that describe properties of the values in the cells B7:B8. In the marking example, these are range16 and range14, accordingly. All the other values in the table can also be combined into ranges, although the range will sometimes consist of a single cell (like range1–range4 in the example). This approach was chosen in order to overcome the ‘entity-per-row’ assumption – as can be seen in the example in Figure 2, it is not always the case for real-world tables.

Example of a tabular data source with marked ranges.
To import the values into the ontology, the characteristics of the ranges should be added: what exactly they contain, how they are connected to each other and where to find them on a sheet. The main properties of the Range class are as follows:
hasStartCell locates the first (upper left) cell of the range;
hasHorizontalShift describes the range boundary along the X-axis. For example, if the value is ‘5’, the range contains six columns (start column plus five cells from the right). If the value is ‘0’, then there is no shift in this direction and the range contains either a column or a single cell;
hasVerticalShift does the same for the range boundary along the Y-axis;
hasConnectedRange connects different ranges. All of the previous properties describe selected ranges and their characteristics, and this property integrates the ranges in a single structure for further connection with the domain ontology.
In order to export the spreadsheet into the domain ontology, three steps should be performed: (1) table cells are marked as a set of ranges, (2) every range is assigned with a unique identifier and (3) every range is described with the given properties.
The marking of the data to be inserted in the resulting knowledge base functioning semi-automatically. The relational tables and XML documents are marked with the Data Source Ontology concepts automatically as they have explicit structure. In this case, classes of Data Source Ontology are populated with the instances that correspond to the data sources’ structural elements (e.g. XML tags). The tabular data are marked by the user with the concepts provided by Data Source Ontology. In both cases, the Data Source Ontology serves as a schema of data sources structure.
4.2. Architecture of the system for ontology population
Once the data sources have been specified with Data Source Ontology, the data can be inserted into a domain ontology. The ontology population approach is based on extract, transform and load (ETL) technology (Figure 3). The stages are based on the study by Vassiliadis and Simitsis [33] and adapted to the knowledge bases as follows:
Extraction: the data sources to be inserted into the ontology are identified and specified, and the data extracted.
Transformation: data cleaning and conflict resolution is performed together with a mapping of the schemas extracted from the data with the domain ontology (i.e. knowledge base schema).
Loading: the data from the sources are inserted into an appropriate place in the knowledge base (i.e. domain ontology is populated).
The approach is implemented in a system with four modules, the performance of two of them demands manual input from the user (Figure 4). The supportive knowledge graph is created with the help of Data Source Ontology. It contains a structure of the data that will be processed according to the rules prescribed by the knowledge analyst and domain expert. The knowledge base is created after domain ontology population.

Structure of data insertion in the knowledge base follows Extract, Transform, Load technology. The stream of data labelled with white filling, the stream of metadata – with grey.

Architecture of the system for ontology population with semi-structured data sources. This figure has been designed using resources from Flaticon.com
Parts of the data processor are as follows:
Data specification module. This module creates a data sources specification using information about relevant data sources (including structure of the tabular data), their location provided by the user.
Data structure extraction module. The module accesses data sources, then extracts their structure and loads it in the Data Source Ontology together with information about the data source from which the data was extracted. The result of this action is populated Data Source Ontology, that is, supportive knowledge graph that will be used for the connection of the data from relevant sources with the domain ontology.
Mapping module sets up a correspondence between entities from the supportive graph and ontology to be populated using the mapping rules and data source priority information prescribed by the user.
Population module extracts the data from the data sources according to their structure stored in the supportive knowledge graph, performs consistency and redundancy checks and inserts them in the domain ontology.
Implementation
The core modules are Data structure extraction module and Population module. For now, they are implemented as a working prototype in Python and can serve as an extension for any ontology editor software (the authors used Protégé [34]). Such architecture was chosen because Data specification module and Mapping module require manual input from the user. Thus, it is important to allow the user to utilize familiar software.
4.3. Process of domain ontology population from semi-structured data sources
The process of ontology population encompasses six steps (Figure 5). For each step, the main process briefly described, and input and output are provided as well as the example with the hypothetical task performed by the undergraduate-program administrator from the ‘Introduction’.

Steps of the method for an ontology population from semi-structured data sources. This figure has been designed using resources from Flaticon.com
Step 1: creation of the domain ontology
Domain ontology is the ontology used by the organisation where the semi-structured data sources will be integrated. Domain ontology development falls beyond the method’s scope.
Input: Expert knowledge, existing ontological and non-ontological knowledge sources, competency questions and so on in accordance with the chosen methodology for ontology creation.
Output: The domain ontology suitable for organisational goals.
Example: Domain ontology creation is beyond the scopes of the method. The authors assume it have already been performed by the university and is used by the administrator as it is. A fragment of such an ontology may look like the one in Figure 6.

Domain ontology fragment.
Step 2: data source identification
At this step, all of the related data sources (i.e. data sources needed for goal achievement and the answering of competency questions) should be identified. Their identification also falls beyond the method’s scope – the method does not restrict data types as it can process data in different semi-structured formats (XML documents, spreadsheets and databases).
Input: Expert knowledge, description of organisational processes, data storages.
Output: A list of data sources to be populated in the domain ontology.
Example: This step is also beyond the scopes of the method and should be performed by the administrator according to his or her knowledge about the task. For example, it may be two different sources of data about students’ academic performance: the table (Figure 2) and the database (Figure 7).

Example of a database excerpt. Lower tables show database structure (‘PK’ is for a primary key, ‘FK’ is for a foreign key), upper tables show examples of the database tables filled with information.
Step 3: data source specification
The specification is a description of data sources. Data specification requires manual input by the knowledge engineer with the help of a domain expert. Lightweight Data Source Ontology (described in the previous section) is used as a template for data description and further import into the domain ontology – thus, there is no need for a mapping language. At this step, information about data source priority should also be provided (for conflict resolution during ontology population).
Input: A list of the data sources identified at Step 2; Data Source Ontology.
Output: A single file with an ontological description of identified data sources (i.e. schema of data sources is extracted).
Example: The administrator should import data from the two sources. The first is the spreadsheet ‘Grading workbook’ with the single sheet ‘Grading semester’ and 20 ranges (Figure 2). The second is the database ‘Education database’ with four tables (Figure 7). Importantly, the administrator does not have to describe all of the ranges in the spreadsheet and tables in the database, but only those containing the data to be imported into the domain ontology (i.e. the data related to the current task).
Step 4: data structure extraction
Data structure from the different data sources is extracted and integrated into a uniform structure.
Input: The file with the data description from Step 3; data structure of identified sources.
Output: Supportive knowledge graph created from the data.
Example: Once the data have been specified, the supportive knowledge graph should be created. At this step, the structure (e.g. column names) of the marked data sources is inserted in the supportive knowledge graph as individuals of the corresponding classes of the Data Source Ontology. Importantly, the data itself are situated in the initial data sources and is not imported in the supported knowledge graph. However, connections between the structure (e.g. a column name) and the data (the column values) are preserved and will be used at the ontology population step. The result of the Data structure extraction module – the structure of the data, as shown in Figure 8. Values at the picture are a structure of the data sources (column names, range types) and not data itself.

Result of specification module operation (supportive knowledge graph).
Step 5: ontology mapping
The supportive knowledge graph created at the previous step is different from the domain ontology to be populated in two aspects. First, the names of classes and properties in the supportive knowledge graph are taken from the data and are different from the names in the domain ontology. Second, the structure of the supportive graph (and data sources) are different from the domain ontology’s structure. To import the data correctly, mapping between domain ontology and supportive knowledge graph based on the Data Source Ontology should be done. It is performed by a knowledge engineer together with a domain expert with the creation of mapping rules using a set of annotation properties that should be added to the domain ontology (see Table 2 for the detailed description).
The annotations on annotations are used. ‘Upper’ annotation property is dedicated to connecting properties with corresponding values or objects (for Data and Object properties correspondingly). ‘Nested’ annotation property maps names of properties in the domain ontology with corresponding entities reflecting data structure in supportive knowledge graph.
The examples for other ‘nested’ properties will not be provided for the sake of brevity.
Input: Domain ontology; supportive knowledge graph schema.
Output: Mapping between supportive knowledge graph and domain ontology.
Example: The mapping is performed with a set of annotation properties (Table 2). These properties should be created manually in the domain ontology to be populated. Every class in the fragment of the domain ontology (Figure 6) will be described with a set of properties (Table 3). The properties values correspond to the instances’ in the supportive knowledge graph (i.e. names of the data structure elements that should be loaded in the domain ontology).
Annotation properties that should be added to the domain ontology classes for data import.
Step 6: ontology population
At the Step 5, the classes in the domain ontology were connected with the corresponding individuals in the supportive knowledge graph using annotation properties and their values. The individuals in the supportive knowledge graph reflect structure of the data to be imported, but the data itself are stored in described data sources. To insert the values in the domain ontology, the population module processes annotation properties (that map the data structure stored in the supportive knowledge graph to the domain ontology concepts) and imports the data into the domain ontology.
Input: Domain ontology, supportive knowledge graph, mapping information.
Output: Knowledge base; logs with information about the data chosen to insert in case of conflicts.
Example: The population process for the example provided at the previous step executed as follows:
Processing of Education-Student-Id-Field (with getInstancesFrom property) for the class Student of the domain ontology will create two individuals: student1 with rdfs:label 17204 and student2 with rdfs:label 17217.
Processing of Education-Student-Name-Field (with getValuesFrom property) for the student1 will add value ‘Davis Robert’ for the hasName property that should be entered as a value of useDataProperty. Analogously, ‘Watson Emily’ will be a value for the student2 individual.
During the processing of Discipline (with makeReferenceTo property) the individuals student1 and student2 will be connected by hasDiscipline relation with discipline1 and discipline2.
After the processing Student’s and Discipline’s individuals (with combineInstancesFrom property), four individuals of Mark class will be created. Every of them will be connected with corresponding individuals. For example, mark1 will be connected by hasStudent with student1 and by hasDiscipline with discipline1.
The described properties processing should be executed one-by-one in the sequence described in Figure 9. The result of the ontology population is depicted in Figure 10.

Sequence of the Annotation properties processing.

Result of population module operation.
These steps can be applied to any domain field thanks to two inventions. The first is a description of data sources with Data Source Ontology (Step 3). This ontology does not include any domain-related concepts – only those describing semi-structured data sources structure. Thus, the data in different domain fields can be marked in similar way. Supportive knowledge graph (Step 4) will be different even for the different tasks in the same domain field as it is dependent on the data structure. However, the technology is still can be used in different domains as the graph is based on the Data Source Ontology that can be applied to any semi-structured data. The second is a domain ontology population with a set of annotation properties (Steps 5–6). Despite the domain ontology necessarily reflects the domain field, the mapping technique is domain independent as annotation properties do not address any domain-specific information but point at corresponding concepts and instances. Annotation properties processing performing irrespectively of the data content and dependent only on their prerequisites (described in Table 2).
Redundancy and consistency checks
Redundancy is avoided using rdfs:label property. Before the instance creation, the module checks if an instance with a similar label exists. If there is an instance with the same label, a new instance is not created. Consistency check is executed with information about data sources’ priority provided by user at the Step 3. If there is a contradiction in data values that should be imported as properties values, the values from the higher priority source will be imported.
5. Conclusion
The developed method allows for the population of domain ontology with semi-structured data (relational, XML and tabular). The method is best suited for same-type tasks as with the case of the undergraduate-program administrator who needs to consolidate students’ assessments and decide whether they are eligible for certain awards. The data sources are always the same, as well as procedure. Creation of the data source description, and mapping rules, is performed only once. If the following ontology populations use the same data sources, the same specifications and mapping rules can be used with no need for new manual input. In such cases, the proposed method would increase the speed and improve the quality of data-based decisions.
The main benefits of the method lie in its capacity to integrate different data sources simultaneously and its context independency. Unlike the majority of RDB and XML to RDF translation methods, the developed method did not stick to a specific mapping language – any person familiar with ontologies can apply it to their own project. Both benefits are attained with ontology use. Data Source Ontology makes it possible to define instances explicitly while the domain ontology provides a schema for data integration. Mediated population of a domain ontology with Data Source Ontology allows to apply the method irrespectively from data content, thus making it domain independent. The only restriction for the method application is presence of the domain ontology. Importantly, the method can be implemented as a part of any existing ontology editing software; thus, its usage does not require additional training for the user.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author received no financial support for the research, authorship and/or publication of this article.
