Information extraction from automotive reports for ontology population

Abstract

In this paper, we showcase our research on the use of ontologies and information extraction for the purpose of modeling damages incurred on car bodies. With the increasing use of technology in the automotive industry, it is important to have a standardized and efficient way of documenting and analyzing car damage reports. Most existing reports are unstructured, and there is a lack of standardization in describing the damage. To address this issue, we have developed a domain ontology for car damage modeling (OCD),1

industryportal.enit.fr/ontologies/OCD

^,2

github.com/OntologyCarDamage/OCD

and proposed an end-to-end system to extract information from French automotive reports. The information extraction process involves using named entity recognition (NER) and relationship extraction (RE) techniques to identify and extract relevant information from the reports. Then, the extracted information is used to populate the

OCD

ontology, allowing a structured and standardized representation of the damage information. The proposed system was tested on a real dataset of automotive reports and showed promising results.

Keywords

Information extraction ontology named entity recognition relationship extraction

1. Introduction

Information modeling and ontologies have become increasingly crucial in today’s data-driven world (Munir and Anjum, 2018), where the effective sharing, structuring and processing of knowledge across different domains, including healthcare (Kim and Chung, 2014), finance (Benjamin et al., 2022), and transportation (Ahaggach et al., 2023a) can be a significant challenge. The use of structured data and standardized models allows for seamless communication and exchange of information across different systems and stakeholders, which reduces ambiguity and ensures accuracy in data representation. The automotive industry can benefit from the application of information modeling and ontology, where vehicle transportation can be a delicate process. Cars are often subject to various forms of damage during transit, requiring a detailed quality control process to ensure their safe arrival. However, damage reports that lack a consistent structure and standard for describing the damage result in a manual data entry process that is time-consuming and error-prone.

To this end, we have developed $OCD$ , an ontology for damage modeling. This ontology provides a structured framework for describing and categorizing different types of damage. Along with the ontology, we have also proposed an end-to-end system for extracting information from automotive reports. This information extraction system uses natural language processing (NLP) techniques such as NER and RE. The system analyzes unstructured reports and extracts relevant information to populate our ontology. The proposed approach helps to improve the quality control process for vehicle transport and provides a standardized way to document and categorize car damage.

The rest of this article is organized as follows: In Section 2, we discuss related work in the fields of ontology and information extraction, including the use of these techniques in the automotive field. Section 3 presents our approach, which comprises two primary steps: ontology construction and information extraction for the ontology population. In the ontology construction step, we identify the relevant concepts and relationships in the domain of car damage assessment and construct a structured representation of this knowledge. In the information extraction step, we extract information from the text and map it to the relevant concepts and relationships in the ontology. Section 4 presents the results of our experimentation, including the evaluation of each step of our approach. We also demonstrate the scalability of our approach by applying it to a large dataset of car damage reports. Finally, in Section 5, we conclude our article by summarizing our approach and discussing its potential impact on the automotive field. We also provide perspectives on how to further extend this work.

2. Related work

In this section, we present an overview of the existing literature on ontology-based approaches in damage modeling and information extraction, including the use of these techniques in the automotive sector.

2.1. Ontology

In recent years, ontologies have become more popular to express machine-readable semantic knowledge. We can consider the ontology $O$ as a 4-tuple ${C, P, J, A}$ in which $C$ is a set of classes, $P$ is a set of properties, $J$ is a set of instances, and $A$ is a set of axioms. Classes represent the real-world entities or objects; properties accompany classes (i.e., as an attribute) or represent relations between classes; instances denote the class individuals; and we specify by axioms additional constraints involving classes and properties.

2.1.1. Ontology construction

Numerous studies (Browarnik and Maimon, 2015; Wong et al., 2012) have delved into ontology construction, aiming to create structured representations of knowledge for diverse applications. The process of ontology construction involves several key tasks, such as specifying the domain, identifying relevant terms and concepts, establishing rules and axioms, encoding the ontology using representation languages like Resource Description Framework (RDF), RDF Schema (RDFS), or Web Ontology Language (OWL), incorporating existing ontologies, and evaluating the constructed ontologies.

Researchers (Buitelaar et al., 2005; Mishra and Jain, 2015; Browarnik and Maimon, 2015) have explored three primary approaches to ontology construction: manual construction, cooperative construction, and (semi-) automatic construction. Manual construction involves domain experts manually creating the ontology, while in cooperative construction, experts oversee most or all tasks in the construction process. On the other hand, (semi-) automatic construction involves reducing human intervention, aiming for more automated processes while acknowledging the difficulty of achieving full automation.

Addressing challenges (Zhou, 2007; Albukhitan et al., 2017) in automatic ontology construction remains a priority. Among these challenges are achieving fully automatic construction, handling noise terms to improve pre-processing, enhancing the discovery of relations between concepts, and refining the learning of ontology axioms. Furthermore, ontology construction systems need to accommodate data from various sources, including static text collections and heterogeneous data on the World Wide Web. Additionally, there is a need to establish standardized evaluation platforms for assessing ontology construction systems.

2.1.2. Ontology-based damage modeling

The application of ontologies is widespread in various domains, serving as valuable tools for representing and organizing knowledge. In particular, ontologies have found utility in modeling damages across diverse areas, including buildings, production plants, and electrical systems. For instance, Everett et al. (2002) built an ontology to model copier damage, in order to analyze textual documents containing repair information and identify similar documents. Rachman and Chandima Ratnayake (2018) used ontologies to model damages to processing plant equipment, while Hamdan et al. (2019) introduced an ontology to represent building damages as well as relationships with building elements and affected spatial zones. In the automotive domain, ontologies have been used to model traffic accidents by describing the circumstances, location, causes, and effects of the accident, as evidenced by the previous works of Barrachina et al. (2012) and Dardailler (2012). Despite this, not enough attention has been given to modeling the damage sustained by the vehicle, creating a void in the existing literature. To the best of our knowledge, no existing ontology describes the various car damage types in the automotive domain. Although some existing ontologies focus on car information and parts, they serve different purposes and use cases. Vehicle signal specification ontology (Klotz et al., 2018) and the automotive ontology (Feld and Müller, 2011), focus on car-related information and aspects of intelligent in-car systems. Other work on the vehicle sales ontology (Hepp, 2010) primarily addresses e-commerce scenarios, facilitating the description of cars, boats, bikes, and other vehicles for commercial activities.

Table 1 offers a comprehensive comparison of various automotive ontologies mentioned in this section. It is clear that these ontologies lack a specific emphasis on detailed car damage modeling. Consequently, their underlying structures do not align with the specific requirements for a comprehensive car damage ontology. Additionally, some of these ontologies are not available online, lack multilingual support, and do not cater to our specific needs, as our ontology supports French, Arabic, and English languages. This multilingual support facilitates the modeling of car damage and populating it using diverse car damage reports.

Furthermore, it’s worth noting that these existing ontologies lack inference reasoning capabilities. In contrast, our developed ontology, as described in Section 3.2.3, incorporates reasoning, enabling it to perform inference to enhance information extracted from car damage reports and deduce new information.

Table 1
Comparison of automotive ontologies

Criteria Works

Barrachina et al. (2012) Feld and Müller (2011) Hepp (2010) Klotz et al. (2018) Our ontology (OCD)

Car damage modeling × × × × ✓

Car information modeling ✓ ✓ ✓ ✓ ✓

Parts modeling × ✓ × ✓ ✓

Multi-language support × × × × ✓

Access public × × ✓ ✓ ✓

Inference capabilities × ✓ × × ✓

Purpose Road accident modeling Vehicle knowledge sharing E-commerce vehicle modeling Vehicle signal modeling Car damage modeling

Criteria	Works
Car damage modeling	×	×	×	×	✓
Car information modeling	✓	✓	✓	✓	✓
Parts modeling	×	✓	×	✓	✓
Multi-language support	×	×	×	×	✓
Access public	×	×	✓	✓	✓
Inference capabilities	×	✓	×	×	✓
Purpose	Road accident modeling	Vehicle knowledge sharing	E-commerce vehicle modeling	Vehicle signal modeling	Car damage modeling

By addressing the limitations of existing ontologies, our work is centered on developing a domain-specific ontology explicitly tailored to the car damage assessment, with a specific focus on capturing the various types and severity levels of car damages, as well as modeling the car and its parts. Our aim is to create a robust ontology that substantially enhances the understanding and management of car damage assessment.

2.2. Information extraction

Information extraction refers to the process of automatically extracting relevant information from various sources, including but not limited to text documents, websites, and social media. This is typically accomplished through the application of NLP and machine learning (ML) techniques. The extracted information can range from simple and straightforward facts such as names, dates, and locations, to more complex information such as events, relationships, and sentiments. Information extraction is included in several domains, such healthcare (Ayadi et al., 2020; Chandra et al., 2023) where the physician’s report are often unstructured, making it challenging to extract and process information effectively. By applying NLP techniques to extract information from these reports to deduce valuable knowledge about the patient’s sickness, potential treatments, and other relevant medical insights.

In the construction industry, several works (Nepal et al., 2013; Zhou and El-Gohary, 2017; Nundloll et al., 2022) propose information extraction systems to extract relevant information from unstructured reports. These systems are designed with the objective of automating the extraction of requirements, thereby assisting construction users in expeditiously extracting the requisite information and organizing it within their database.

In the automotive field, Information extraction has received significant attention from researchers and practitioners because of its potential to improve various processes in the industry. There have been many studies exploring different NLP and ML techniques for information extraction, including NER, part-of-speech (POS) tagging, rule-based methods, and machine learning algorithms.

One of the earliest attempts was made by Rubens and Agarwal (2002). The authors proposed a combination of NLP and ML classification algorithms to extract attributes from online automotive classifieds. The paper was designed to facilitate structured searches over unstructured data. The extracted attributes, such as brand, model, year, price, and condition of the vehicle, can help automate searching and sorting through large amounts of vehicle-related information available online, saving time and effort compared to manual data entry.

Another study by Bhatia et al. (2008) developed a system to extract structured information using natural language processing techniques. The system relied on maximum entropy classifiers for named entity recognition, and manually crafted rule-based approaches. The extracted information was used to perform structured searches for certain attributes of interest, enabling a more efficient search experience for potential buyers.

More recently, Jalal (2020) discussed the use of regular expressions, a type of pattern matching used in computer programming, to extract structured information from online automobile advertisements. The author explored how this information could improve the search experience for online automobile buyers. By using regular expressions, the author was able to extract relevant information, such as the brand and model of the car, and present it in a structured form to users.

2.2.1. Entity recognition approaches

Named-entity recognition is a subtask of information extraction that involves identifying named entities in unstructured text and placing them into predefined categories. Let T be a text (sentence) with length n and E a set of named entity types. The goal of NER is to find all named entities $NE \subseteq T$ and classify them into the set of types E. The output of the NER system is a set of tuples $(e_{i}, t_{i})$ , where $e_{i}$ is a named entity and $t_{i}$ is its corresponding type in E. In the following, we present the main NER approaches:

NER based on dictionaries. One approach to NER is based on using dictionaries or knowledge bases to map text phrases to synonyms for concepts. This dictionary-based NER approach, as described by Zhou et al. (2006), has the advantage of updating the dictionary with new concepts and synonyms. However, the downside of this approach is that it can only recognize entities that are already in the resource. Also, creating a well-constructed and inclusive terminology resource can be an expensive task.

NER based on rules. Another approach to NER is rule-based, which uses regular expressions to combine information from dictionaries and entity features. Petasis et al. (2001) describe this approach, which also involves updating the dictionary with new concepts and synonyms. The downside of this approach is that creating the rules manually can be a tedious process.

NER based on the corpus. This approach involves using corpora annotated by domain experts and machine learning algorithms to predict entity labels. Alnazzawi et al. (2015) and Leaman et al. (2015) describe this corpus-based approach that does not rely on dictionaries or manually created rules. However, this approach requires the existence of an annotated corpus.

NER based on active learning. NER based on active learning is a type of semi-supervised learning that uses unlabeled data in the training process. Chen et al. (2015) and Tran et al. (2017) describe this approach, which requires fewer learning examples compared to supervised learning. However, it also requires constant user intervention.

NER based on deep neural networks. NER based on deep neural networks, as described by Huang et al. (2015) and Lopez and Kalita (2017), is a type of corpus-based approach that does not depend on dictionaries or manually created rules. This approach has shown better results compared to other NER methods.

Besides these approaches, there are also hybrid approaches that combine multiple techniques. Thomas and Sangeetha (2019) present a NER method that combines deep learning, clustering, and rule-based algorithms to extract clinical entities from medical reports.

2.2.2. Relation extraction approaches

Relation extraction is a subdomain of information extraction that identifies semantic relations between text entities. Let S be a sentence with n words, represented as a sequence of word embeddings: $S = w_{1}, w_{2}, \dots, w_{n}$ . Let E be the set of entities recognized in the sentence, represented as entity embeddings: $E = e_{1}, e_{2}, \dots, e_{m}$ , where m is the number of entities in the sentence. The goal of relation extraction is to identify the existing relationships R between the entities in E. We can define the set of relationships as $R = r_{1}, r_{2}, \dots, r_{k}$ . Each relationship $r_{i}$ can be represented as a function that takes as input two or more entity embeddings and outputs a result indicating the type of the relationship between the entities. We can define the relation extraction function F as follows: $\begin{matrix} F (S, E) = (e_{i}, e_{j}, r_{k}) ∣ e_{i}, e_{j} \in E, r_{k} \in R \end{matrix}$ The function F returns a set of tuples, where each tuple represents a relationship between two entities and the type of the relationship.

There are several methods for relation extraction, and these methods can broadly be classified into four categories: rule-based, supervised, semi-supervised (bootstrap learning), and unsupervised approaches.

Rule-based approaches. These approaches rely on manually crafted rules or patterns that define the relationship between entities. These patterns are based on syntactic and semantic features. One of the major disadvantages of rule-based approaches is that creating and updating these patterns can be time-consuming and challenging, especially for complex relationships. According to Nadeau and Sekine (2007), rule-based methods have the advantage of being interpretable and explainable, but their effectiveness depends on the availability of high-quality rules.

Supervised approaches. These approaches treat relation extraction as a classification problem and train a machine learning model to predict the relationship between two entities. The model learns to identify the relationships between entities based on features such as part-of-speech tags, dependency parsing, and word embeddings. The most commonly used algorithms for this approach are support vector machines (SVM) and neural networks. According to Zeng et al. (2014), supervised learning approaches are effective in identifying a wide range of relationships but require large annotated datasets for training.

Semi-supervised approaches. These approaches also known as bootstrapping, start with a small set of seed pairs and use them to automatically learn additional relationships from annotated data (Brin, 1999; Agichtein and Gravano, 2000). These methods combine labelled and unlabeled data to iteratively expand the seed set and improve the model’s accuracy. Bootstrapping can be computationally efficient and does not require a large amount of labelled data, but it can be susceptible to errors in the initial seed set and generate noisy patterns.

Unsupervised approaches. These approaches involve recognizing pairs of entities that appear together frequently within the same sentence or document. When two entities occur together with enough frequency, it is assumed that there exists some sort of relationship between them (Yates et al., 2007; Davidov and Rappoport, 2008). Additionally, some studies use large amounts of text to extract relations between entities by analyzing strings of words that appear between them. These strings of words are then clustered and simplified to generate relation-strings, as seen in works such as Shinyama and Sekine (2006) and Etzioni et al. (2008).

2.3. Discussion

The existing literature has given limited attention to modeling the vehicle damage. Additionally, the existing works in information extraction in the automotive field mostly rely on online automobile advertisements to extract simple entities such as car brands and models. However, these works use traditional rule-based methods that have limitations, such as their inability to handle variations in natural language and context, leading to reduced accuracy and reliability of the extracted information. Furthermore, they are not designed to extract relations between entities, which is crucial to understanding the underlying semantics of the data.

In our study, we used deep learning approaches for NER since they produce promising results, especially when dealing with large datasets of car damage reports where patterns cannot be easily captured by traditional methods. Deep learning models leverage machine learning algorithms and statistical techniques to learn patterns and associations from labeled data, enabling them to generalize well to new, previously unobserved car damage reports. We compared various deep learning-based NER models and chose the one that yielded the most accurate performance on our data. For relation extraction, we employed ML approaches and tested different algorithms and features to determine the most effective combination for our task. We also used ontology reasoning to further improve the accuracy of our relation extraction. In the next section, we will provide more detail about our approach to constructing $OCD$ and filling it with information extracted using NER and RE.

3. Approach

Our approach comprises two main steps: ontology construction and information extraction. In the ontology construction step, we identify the relevant concepts and relationships in the domain of car damage assessment and construct a structured representation of this knowledge. In the information extraction step, we extract information from the text and map it to the relevant concepts and relationships to populate our ontology.

3.1. Ontology construction

In this section, we provide a detailed account of the ontology construction process, focusing on the various steps (Fig. 1) involved in creating the ontology for car damage ( $OCD$ ) and the main objectives and competency questions guiding the construction of the ontology.

Fig. 1.

Steps in the creation of the ontology for car damage ( $OCD$ ).

3.1.1. Objectives of the ontology

The main objective of constructing the $OCD$ ontology is to create a comprehensive and semantically rich representation of the domain of car damages and related components. This entails capturing the relationships between different car parts, types of damages, severity levels, and other relevant entities. The primary motivation behind this is to enhance data interoperability and facilitate automated information extraction from unstructured car damage reports. Additionally, the ontology aims to establish semantic interoperability by defining shared concepts and relationships within the automotive domain. This fosters seamless integration and communication between diverse systems and applications used in the industry. Moreover, the ontology plays a critical role in improving the accuracy and relevance of extracted relations.

The structured data resulting from this information extraction process is instrumental in training regression models to predict the approximate cost of repairing reported damages, providing valuable insights for the automotive industry.

3.1.2. Competency questions

The OCD (Ontology for Car Damage) addresses several competency questions within the domain of car damages. Some of the key questions that the ontology helps to answer include:

Q1: What are the different types of damages that can occur to a car?

This question addresses the fundamental classification of car damages. The ontology would define various types of damages such as dents, scratches, collisions, etc., providing a structured framework for understanding and categorizing different damage scenarios.

Q2: What are the various car parts that can be affected by damages?

This question delves into the specific components of a car that could be damaged. The ontology would list and define various parts like bumpers, doors, windows, etc. It helps in understanding the granularity of damage, aiding in precise identification and assessment.

Q3: How severe is each type of damage?

This question seeks to establish a severity scale for different types of damages. The ontology should incorporate severity levels (e.g., minor, moderate, severe) associated with each damage type, allowing users to assess the impact of the damage.

Q4: What are the possible relationships between car parts and damages?

This question explores the relationships between different types of damages and the car parts they affect. The ontology defines these relationships, outlining which damages commonly impact specific parts. This information is crucial for insurers and repair shops.

Q5: How can the extracted information from unstructured reports be linked to specific concepts in the ontology?

This question addresses the practical application of the ontology. It involves extracting relevant information from unstructured sources like accident reports and linking this information to the defined concepts within the ontology. This linkage enhances data organization and retrieval.

Q6: How can the ontology be used to enhance the accuracy and relevance of extracted relations between car components and damages?

This question focuses on the ontology’s role in improving the accuracy and relevance of extracted information from automotive reports.

Q7: What is the overall structure and hierarchy of car components and damage types within the ontology?

This question deals with the ontology’s structure. It requires defining a clear hierarchy and structure for car components and damage types. For example, organizing parts hierarchically (e.g. CarBody > Trunk > TrunkLid) and categorizing damage types under appropriate categories (e.g. Damage > BodyDamage > Scratches) creates a structured ontology, aiding in efficient data management and knowledge retrieval.

These questions were formulated after a careful examination of the requirements within the car damages domain. The corresponding SPARQL queries for these questions are highlighted in Table 3.

3.1.3. Collection of terms

To construct the ontology and gather terms and concepts describing the domain of car damages, we relied upon the expertise of insurance professionals. We analyzed their detailed description reports and consulted repair shop records, ensuring comprehensive coverage of all types of damages and car models. The $OCD$ captures pertinent information from these damage reports, which may vary in detail and content. The ontology’s flexible and expandable structure allows easy updates and modifications to accommodate new types of damages or car models, making it a valuable tool for information representation and analysis in the automotive domain. In developing the ontology, our main focus was to create a domain-specific representation tailored explicitly for the car damage assessment domain. As a result, we opted not to reuse concepts from other existing ontologies. This decision was driven by the necessity for precise car damage modeling and the specific requirements of the automotive industry. The creation of a new ontology allows us to accommodate the exact models and standards used in the region, ensuring a more accurate and relevant representation of car damages.

3.1.4. Concepts

In our $OCD$ ontology, we have three main concepts: “ $Damage$ ”, “ $Car$ ”, and “ $CarParts$ ”. The concept of “ $Damage$ ” represents any kind of harm, impairment, or loss caused to a car, while the concept of “ $Car$ ” represents any vehicle that is transported from one location to another, whether by rail, truck, or ship. This includes new cars being shipped from a manufacturer to a dealership and used cars being transported to their next owner.

The concept of damage can be further categorized into three main subclasses: body damage, mechanical damage and missing damage.

$Body Damage$ : This refers to any harm or impairment that occurs to the external components of a car. This includes scratches, dents, cracks, and paint damage, among others. These types of damage are typically repaired by a body shop specialist with expertise in repairing and restoring car bodywork. Within the subclass of $Body damage$ , we can identify different types of damage, such as superficial damage (minor scratches or dents that do not affect the car’s structure), structural damage (major dents or impacts that affect the car’s frame), and paint damage (discoloration, fading, or chipping of the car’s paint).

$Mechanical Damage$ : This refers to any harm or impairment that occurs to the mechanical components of a car. This includes engine malfunctions, transmission issues, and brake system failures, among others. These types of damage are typically repaired by an auto mechanic who specializes in repairing and restoring the car’s mechanical systems.

Within the subclass of mechanical damage, we can identify different types of damage, such as electrical damage (faulty wiring or malfunctioning sensors), transmission damage (slipping gears or leaks), and engine damage (overheating or broken components).

$Missing Damage$ : This refers to any missing components or parts of the car. This can include items such as mirrors, headlights, or other exterior components, as well as interior components such as seats or radios. In some cases, missing damage may also affect the car’s mechanical systems if significant components such as the battery or alternator are missing.

Repairing missing damage typically involves replacing the missing components or parts with new ones, or in some cases, with used or refurbished ones.

The concept

CarParts

can range from the basic components, such as screws and bolts, to the most complex systems, such as the engine and transmission. Furthermore, some specific

CarParts

can be further broken down into multiple constituent parts. For instance, a car wheel is a type of

CarParts

, and it consists of several components, including wheel bearings, wheel rims, tires, and wheel fasteners. Figure 2 provides a snapshot of the classes present in the ontology.

Fig. 2.

A snapshot of the classes present in the $OCD$ ontology.

3.1.5. Data properties

The $OCD$ ontology includes a number of data properties that can be used to describe various attributes of cars, car parts and damages. Table 2 summarizes the data properties available in the $OCD$ ontology.

Table 2
Data properties for the $OCD$ ontology

Data property Domain Type Description

CarBrand Car String The company that produced the car

CarModel Car String The model name of the car

CarYear Car Integer The year the car was manufactured

CarColor Car String The exterior color of the car

CarPrice Car Float The price of the car in a given currency

CarRegistration Car String The registration number of the car

FuelType Car String The type of fuel used by the car, such as gasoline, diesel, or electric

CarMileage Car Integer The total number of miles the car has traveled

hasPartName CarParts String The name of the car part

CarPartMaterial CarParts String The material(s) used to make the part

IsDamaged CarParts Boolean This property specifies whether the car part is damaged or not

Place CarParts String This property specifies the location of a damaged car part

PartPrice CarParts Float The price of the part in a given currency

DamageType Damage String The type of damage sustained by the car or car part

RepairCost Damage Float The estimated cost of repairing the damage

Severity Damage String This property specifies the severity of the damage to the car part

RepairAction Damage string Recommended repair action for the damage

Data property	Domain	Type	Description
CarBrand	Car	String	The company that produced the car
CarModel	Car	String	The model name of the car
CarYear	Car	Integer	The year the car was manufactured
CarColor	Car	String	The exterior color of the car
CarPrice	Car	Float	The price of the car in a given currency
CarRegistration	Car	String	The registration number of the car
FuelType	Car	String	The type of fuel used by the car, such as gasoline, diesel, or electric
CarMileage	Car	Integer	The total number of miles the car has traveled
hasPartName	CarParts	String	The name of the car part
CarPartMaterial	CarParts	String	The material(s) used to make the part
IsDamaged	CarParts	Boolean	This property specifies whether the car part is damaged or not
Place	CarParts	String	This property specifies the location of a damaged car part
PartPrice	CarParts	Float	The price of the part in a given currency
DamageType	Damage	String	The type of damage sustained by the car or car part
RepairCost	Damage	Float	The estimated cost of repairing the damage
Severity	Damage	String	This property specifies the severity of the damage to the car part
RepairAction	Damage	string	Recommended repair action for the damage

3.1.6. Object properties

The $OCD$ ontology includes several object properties that define the relationships between the classes. These object properties include:

$hasCarPart$ : This object property relates the car to its constituent parts. It has a domain of the class $Car$ and a range of the class $CarParts$ . For example, a car has an engine, wheels, seats, and so on. The inverse of this object property is $isPartOf$ .

$hasDamage$ : This object property is used to define the relationship between a $CarParts$ and $Damage$ . The inverse of this object property is $inPart$ .

$hasComponent$ : This object property establishes a relationship between a $CarParts$ and its component $CarParts$ . It denotes that a specific car part is composed of or includes other individual parts. This property is used to model hierarchical structures and compositions within the ontology, allowing for a detailed representation of the relationships between various car components.

3.1.7. Ontology evaluation

Ontology evaluation is the process of assessing the quality of an ontology by measuring it against a set of established criteria, including accuracy, completeness, conciseness, adaptability, clarity, computational ability, and consistency. This helps to ensure that the ontology is reliable and can effectively support its intended applications (Raad and Cruz, 2015).

There are four common techniques used for evaluating ontologies (Hazman et al., 2011; Asim et al., 2018). The first technique, golden standard-based evaluation, compares the learned ontology to a standard one, representing the ideal knowledge representation for a specific domain. The second one is application-based evaluation, which focuses on assessing the ontology’s performance in a particular task-specific application. The third one, data-driven or corpus-based evaluation, measures the ontology’s coverage of a domain using domain-specific knowledge sources. Lastly, the expert-based evaluation usually involves evaluating the ontology through the experiences of users by defining indicators and assessing the ontology against each of them.

In our case, we check the consistency of the ontology and ensuring that the reasoner does not produce any errors using the reasoners $Fact + +$ (Tsarkov and Horrocks, 2006) and $HermiT$ (Shearer et al., 2008). We collaborated with insurance industry domain experts to gather feedback and suggestions for improving our ontology. After conducting a series of interview sessions and analyzing sample data, we made several revisions to the ontology, including adding new classes and properties and refining definitions of existing ones in order to improve its effectiveness in representing and analyzing data. Also, we validated the ontology’s ability to answer competency questions mentioned in Section 3.1.2 using SPARQL (Pérez et al., 2009) (Table 3). Additionally, we map the extracted entities and relations to ensure relevant concepts can be represented. An illustrative example is provided in Section 4.5.3.

The $OCD$ ontology will be used by car dealers and insurance companies to share reports on car damage information among different users. Furthermore, it can be adapted for other vehicles’ damage and reused for various applications.

Table 3
Competency questions and SPARQL queries

3.2. Information extraction for ontology population

To populate our ontology, we propose an information extraction system that comprises four critical modules (Fig. 3): the pre-processing module, the information extraction module, the ontology population module, and the ontology reasoning module. These modules work in concert to provide a complete solution for extracting relevant information from unstructured automotive reports.

3.2.1. Pre-processing module

The text describing the car damage is extracted from $PDF$ reports and analyzed using a spell checking algorithm to correct any spelling errors that may have been made during the writing of damage reports. This module is essential for increasing the efficiency and accuracy of our information extraction module, as it ensures that the extracted information is as accurate as possible.

There are several approaches to correcting spelling errors, such as dictionary-based, rule-based, statistical-based, and neural network-based methods. In this module, we use a dictionary-based approach due to its simplicity, but it has the limitation that it cannot correct composed words, which are common in the automotive domain.

Hence, we propose an algorithm based on the $Levenshtein$ distance (Levenshtein et al., 1966) and sliding window to correct misspelled words, including composed words.

Fig. 3.

The overall architecture of our methodology.

The $Levenshtein$ distance between two strings A and B, denoted as $LD (A, B)$ , is the minimum number of operations (insertions, deletions, or substitutions) required to transform string A into string B. The proposed algorithm (Algorithm 1) takes three parameters: the input sentence, a list of correctly spelled words, and the thresholds for word changes. It returns the correct sentence. The algorithm first initializes an empty list of suggested corrections. It then uses a sliding window approach to get 1-word and 2-word subsets of the input sentence. For each 1-word and 2-word subset, the algorithm calculates the $Levenshtein$ distance between the subset and each word in the word list. If the distance is less than or equal to a pre-defined threshold value for each sliding window, the correctly spelled word is added to a list of suggestions for correction. If the list of suggestions is not empty, the algorithm corrects the misspelled word in the input sentence. Finally, the algorithm returns the correct sentence. Here is an example of a corrected French sentence:

Input: “Pqr choc casse et port raye”

Output: “Pare-choc casse et porte raye”

Algorithm 1:

Spelling checker algorithm with $Levenshtein$ distance and sliding window

3.2.2. Information extraction module

The information extraction module aims to extract the car information (brand, model, color …), the damaged component entities, the damage type and the damage characteristics (severity, location…). For example, in this sentence: “Ford B-MAX with a severe dent on the back-left door and small scratches on the bumper, and a broken rearview mirror”, the extracted information would be: “Ford” as the brand of the car, “B-MAX” as the model of the car, “door”, “bumper”, and “rearview mirror” as components of the car, “dent”, “scratches”, and “broken” as damages, and “severe”, “small” are the severity of these damages. And “back-left” is the place. All of these concepts are defined in the proposed ontology $OCD$ . In addition, we also extract the relationships that exist between different entities. For instance, in the previous sentence, the relation “ $hasDamage$ ” exists between the car parts (“door”, “bumper”, “rearview mirror”) and their corresponding damages (“dent”, “scratches”, “broken”). Similarly, we can extract the relation “ $hasSeverity$ ” between the damages and their associated severity values (“severe”, “small”). By extracting these relationships, we can further enrich the structured representation of the extracted information (entities) and make it even more useful for downstream tasks.

Data labeling. To train our models of NER and RE, we first needed to label our car damage reports. We used the $doccano$ 3

³
https://doccano.github.io/doccano

tool to annotate all existing entities and relations in each report. This tool made it easy to annotate our reports with accuracy. For entities, we select the start and end points of each entity in the reports. Then, we assigned the appropriate label type from the predefined set of entity labels. As for the relations, we identified the two entities involved in the relation and selected the corresponding pre-defined relation type. The annotated reports were then saved in a JSON file, which included entity types with their corresponding begin and end offsets, as well as relation types and the two entities that participated in the relation. Figure 4 illustrates a visual representation of the annotation process in the

doccano

tool. These annotations were used to extract features for training the NER and RE models.

Fig. 4.

Example of entities and relations annotation in $doccano$ for information extraction module.

Named entities recognition. We used the labeled data generated by the $doccano$ tool to train the NER model. To extract features from the labeled data, we proposed an approach that involves capturing the word and POS context of each entity in the report. Specifically, we extract the word context, which is a window of two words on either side of the current word and the current word itself. Additionally, we extract the POS context, which is a window of two POS tags on either side of the current word and the current tag itself. We represent these features as a list of dictionaries, where each dictionary represents the features for a specific word in the report. These extracted features were then used to train and compare various NER models.

In this study, we conducted a comparison of five different NER algorithms, each using a distinct approach to NER, namely transformer-based models, probabilistic models, and sequence models, with the addition of fine-tuning to improve performance. Specifically, we compared the performance of Conditional Random Fields ( $CRF$ ), the Long Short-Term Memory Bidirectional-CRF model ( $BiLSTM - CRF$ ), $FlauBERT$ ,4

⁴

This model is available at: https://huggingface.co/flaubert/flaubert_base_cased.

SpaCy

’s NER model,5

⁵

https://spacy.io/models/fr

and Maximum Entropy Markov Model (

MEMM

The $CRF$ (Lafferty et al. (2001)) model is widely employed for NER tasks. It is a probabilistic model that uses a conditional approach and has a structural resemblance to an undirected graphical model. $CRF$ graph nodes represent sequences of class labels, in our case, $Damage$ , $CarParts$ and others.

The $BiLSTM - CRF$ (Huang et al. (2015)) is a type of sequence model that combines two models: $BiLSTM$ and $CRF$ . The $BiLSTM$ network processes the input sentence in both forward and backward directions to capture the contextual information, while the $CRF$ layer applies constraints on the output sequence to ensure that it is a valid sequence of labels.

$FlauBERT$ developed by Le et al. (2020), is a pre-trained language model based on the BERT architecture, specifically designed for French language processing. We fine-tuned $FlauBERT$ for NER tasks by adding a layer on top of the pre-trained model. The additional layer was trained on our labeled dataset.

$SpaCy$ ’s NER model uses a convolutional neural network architecture with residual connections, which allows it to analyze the context of each word in a text and use that context to make predictions about whether that word is part of a named entity and what the type of this entity is. The model is trained on a large dataset of annotated text in French language. We fine-tuned $SpaCy$ ’s NER model on our datasets to improve its performance on specific tasks.

$MEMM$ (McCallum et al., 2000) is a statistical sequence model used for NER tasks. It is an extension of the Markov Model, which assumes that the probability of a particular state depends only on the previous state. The $MEMM$ extends this by considering a set of features for each state, and it estimates the probabilities of the current state given the previous state and the observed features.

Section 4 presents a detailed description of the experimentation process and results of each NER algorithm. Finally, we select the best model to predict and extract entities from newly arrived car damage reports.

Relation extraction. The next phase in our information extraction module is relation extraction, which involves identifying relationships between different entities extracted with the NER model from a report.

To train RE models, we use the process describe in Fig. 5. First, we use the annotated reports to extract three categories of features: distance features, word features, and embedding features. We describe each type of feature below:

Distance features: Rely on measurements between entities.

Word distance (word dist): the number of words between two entities.

Character distance (char dist): the number of characters between two entities.

Sentence distance (sent dist): the number of sentences between two entities.

Orientation: indicates whether entity 1 appears before or after entity 2.

Word features: Take into account various properties of context and entity words.

Bag of entities: the frequency counts of all annotation types between the entities.

Entity types: indicate the types of entities.

Embedding features: Generated from pre-trained word embedding models for each entity, here we used two models Word2Vec and $SpaCy$ .

Word2Vec embeddings: Word2Vec (Mikolov et al. (2013)), uses a neural network to generate vector representations of words based on their co-occurrence patterns in a large corpus of text data.

$SpaCy$ 6

⁶

https://spacy.io

embeddings: Vector generated from pre-trained word embedding models that capture the meaning and context of individual words.

Fig. 5.

Extracting relationships: a visual representation of relation extraction process.

The extracted features are then put through a feature selection using the Recursive Feature Elimination (RFE) method. This method aims to identify only the relevant features, reduce data dimensionality, and eliminate irrelevant or redundant features, leading to improved model performance, reduced overfitting, and faster training times.

To create our relation extraction models, we use four commonly used classification algorithms for relation extraction: support vector machines (SVM), K-nearest neighbors (KNN), decision trees (DT), and random forest (RF), as illustrated in Fig. 5. To determine the optimal hyperparameters for each model, we apply the grid search method, which is further elaborated on in Section 4. Ultimately, we choose the model with the highest score for predicting relations. In the process of predicting relationships between entities in a new car damage report, we first extract entities from the reports. Next, we extract the same features from these entities that were used during the model training phase. These extracted features are then employed to predict the relationships between the entities accurately.

3.2.3. Ontology reasoning module

Once our relation extraction model has processed the reports and identified relevant relationships between entities, we take our analysis one step further by incorporating $OCD$ ontology reasoning to enhance the quality of the extracted relations by reducing redundancy, resolving conflicts, and minimizing false positives and false negatives in the extracted relations. The process of ontology reasoning involves the following:

Type-based reasoning: Ontology provides a hierarchy of entity types and their properties, which can be used to guide the extraction of relations between entities of a certain type. For instance, the ontology assists in defining the possible relationships between $Carparts$ and $Damage$ , where only two potential relationships exist: $hasDamage$ or $NoRelation$ . This information was leveraged to filter out irrelevant relationships and concentrate on extracting pertinent entities and relationships.

Validation-based reasoning: Ontologies also provide rules and inference engines that can be used to infer additional relations between entities. For example, if the relation extraction model detects multiple relationships between a $Carparts$ and several entities of type $Place$ , but we know that $Carparts$ can only be placed in one location, we can use validation-based reasoning to identify the relationship that has the highest probability. This helps to eliminate incorrect relationships (false positives) and improve the quality of the extracted relations.

Implicit inference-based reasoning: Inferring implicit relationships involves identifying relationships that may not be explicitly recognized by the relation extraction model. By using the semantic relationships between entities in our ontology, we can infer additional relationships between entities. For instance, in a car damage report, all $CarParts$ entities are linked to a $Car$ entity through the $hasCarparts$ relation. Hence, we can deduce that a particular $CarParts$ is associated with a specific $Car$ even if the relation extraction model does not identify this relationship. This type of reasoning helps to enhance the extracted relations and reduce the number of false negatives.

The ontology reasoning process occurs in two steps. Initially, the raw extracted entities and relationships are populated into the ontology without applying reasoning. This step allows for the identification of incomplete or inaccurate data.

In the second step, the ontology reasoning engine applies logical inferences and deductions based on the ontology’s axioms, relationships, and rules. This enhances the information extraction process and improves the accuracy of the extracted entities and relationships.

3.2.4. Ontology population module

This module takes the output of the information extraction and ontology reasoning modules to populate our $OCD$ ontology with instances of concepts. We use $Owlready$ 7

⁷
https://owlready2.readthedocs.io

package which provides a variety of methods for working with ontologies, including inserting instances into the ontology. We map the extracted entities and relationships to the relevant properties and concepts of the ontology. For example, if we extract an entity representing a

CarModel

, we populate the corresponding instance with the data properties of the concept

Car

4. Experimentation

In this section, we present the experiments conducted in the second phase of our study, which focuses on information extraction from automotive reports describing damages to cars. Additionally, we provide a comprehensive description of the dataset used in the study and the steps taken to preprocess and label the data. Furthermore, we detail the evaluation of the spelling checker algorithm and its impact on the overall performance of our named entity recognition and relation extraction models.

4.1. Dataset description:

The dataset used in our study was obtained from the reputable company $Syartec$ ,8

⁸
https://www.syartec.com/

and it comprises a significant number of unstructured car damage reports written in the French language. Specifically, the dataset encompasses

14, 118

car damage reports, providing a diverse and extensive set of textual data to train and test our information extraction models.

The car damage reports encompass a wide range of entities and relationships relevant to the automobile domain. The distribution of entities and relations in the labeled dataset is depicted in Fig. 6.

4.2. Preprocessing and labeling

Before conducting the experiments, the raw dataset underwent thorough preprocessing to ensure data quality and uniformity, with special attention given to the implementation of the spelling checker algorithm (Algorithm 1).

During the labeling process, we conducted a comprehensive evaluation of the spelling checker algorithm to assess its ability to identify and correct spelling errors in the labeled data. To ensure accurate and effective performance, the algorithm was thoughtfully configured with two specific thresholds, denoted as ${threshold}_{1}$ and ${threshold}_{2}$ , with values set to 2 and 4 respectively.

The dataset was then labeled using the $doccano$ tool, a widely-used platform for annotating text data. The labeling process involved four authors of this article collaboratively identifying and annotating entities and relationships within the car damage reports.

The application of the spelling checker algorithm significantly improved the accuracy of the labeled data, providing greater confidence in the overall performance of our information extraction models. However, it is crucial to emphasize that the spelling checker was not the sole determinant of the final corrected word. After the algorithm performed its correction, a secondary verification step was carried out by human agents responsible for reviewing and validating the corrected words. This manual validation ensured that the final corrected words aligned contextually with the intended meaning in the car damage reports, adding an extra layer of scrutiny to guarantee accuracy and reliability.

4.3. Experimental setup:

For evaluating the effectiveness of our information extraction approach, we divided the labeled dataset into training and testing sets for both NER and RE tasks. The training set constituted 80% of the dataset, while the testing set accounted for the remaining 20%. The experiments were conducted on a machine equipped with 16 GB RAM and an Intel Core i7-12700H processor, ensuring sufficient computational resources for training and testing our models.

Fig. 6.

Distribution of entities and relations in the labeled dataset.

4.4. Named entities recognition

Our aim is to extract entities from automotive damage reports. Specifically, we focused on extracting six types of entities: $CarBrand$ , $CarModel$ , $Carparts$ , $Damage$ , $Severity$ , and $Place$ . We excluded other information in the reports since they were already structured, and there was no requirement to extract it.

As discussed in Section 3, we used word context as our feature, which comprises a window of two words on either side of the current word and the current word itself. In addition to this, we also extracted the part-of-speech (POS) context feature, which includes a window of two POS tags on either side of the current word and the current tag itself. The relevance of these features was determined through a RFE method. The extracted features were subsequently organized into a list of dictionaries, with each dictionary representing the features of a specific entity. To ensure consistency and fairness in our experimentation, we kept the same features for all models.

4.4.1. NER model hyperparameters

Hyperparameters are crucial settings that influence the behavior and performance of NER models. All of these hyperparameters were fine-tuned through experimentation to achieve optimal performance on our NER task.

In the $CRF$ model, we used the $LBFGS$ optimization algorithm, which is known for its efficiency and good performance on large-scale problems. To prevent overfitting and improve generalization, we applied $L 1$ and $L 2$ regularization with hyperparameters set to 0.1. Additionally, we set the maximum number of iterations for training to 100 to ensure that the model converged to a stable solution without overfitting the training data. We applied the same hyperparameters of $CRF$ to the $BiLSTM - CRF$ model, with the addition of setting $Hidden_Dim$ was set to 4, $Batch_size$ was set to 32, $Learning_rate$ was set to $5 \times 10^{- 2}$ , and $Weight_Decay$ was set to $1 \times 10^{- 4}$ . For model-based transformers, we fine-tuned the pre-trained $FlauBert$ with the following hyperparameters: the $Learning_rate$ equals $5 \times 10^{- 5}$ , $batch_size$ set to 32, the number of $Training_epochs$ set to 3, and using the $Adam$ optimizer. Regarding $Spacy$ NER French model we fine-tuned it with 1000 iterations, a $batch_size$ of 32, a $dropout$ rate of 0.35, and the $sgd$ optimizer. Lastly, for the $MEMM$ , we set $n_components$ to 4 and $n_iter$ to 1000. These parameters determine the number of hidden states and the maximum number of iterations during the training process.

4.4.2. NER model evaluation

After training our models with the aforementioned hyperparameters, we evaluated their performance on a separate test set. We used ${precision}_{ner} (P_{ner})$ , ${recall}_{ner} (R_{ner})$ , and the F1-score_ner ( $F 1_{ner}$ ) to evaluate the models’ performance. The precision measures the proportion of predicted entities that are correct, recall measures the proportion of actual entities that are correctly identified, and F1-score is the harmonic mean of precision and recall.

We calculated these evaluation metrics as follows: $\begin{matrix} P_{ner} = \frac{{TP}_{ner}}{{TP}_{ner} + {FP}_{ner}}; R_{ner} = \frac{{TP}_{ner}}{{TP}_{ner} + {FN}_{ner}}; {F1}_{ner} = \frac{2 \times {TP}_{ner}}{2 \times {TP}_{ner} + {FP}_{ner} + {FN}_{ner}} \end{matrix}$ Here, the ${TP}_{ner}$ (True Positives) represents the number of true entities correctly recognized by the model, ${FN}_{ner}$ (False Negatives) represents the entities missed by the model, and ${FP}_{ner}$ (False Positives) represents entities incorrectly identified as entities by the model.

Table 4
Comparative results of NER models. The highest precision_ner (P_ner), recall_ner (R_ner) and F1-score_ner (F1_ner) are in bold

Models Entities

Damage CarParts CarBrand CarModel Severity Place

P_ner R_ner F1_ner P_ner R_ner F1_ner P_ner R_ner F1_ner P_ner R_ner F1_ner P_ner R_ner F1_ner P_ner R_ner F1_ner

BiLSTM-CRF 0.89 0.94 0.91 0.89 0.89 0.89 1.00 0.97 0.99 1.00 1.00 1.00 1.00 1.00 1.00 0.95 0.91 0.93

FlauBERT 0.69 0.55 0.61 0.69 0.77 0.73 1.00 1.00 1.00 1.00 1.00 1.00 0.33 0.67 0.44 0.73 0.78 0.76

CRF 0.98 0.91 0.94 0.97 0.95 0.96 0.97 1.00 0.99 0.97 0.95 0.96 0.88 1.00 0.93 1.00 1.00 1.00

$SpaCy$ Model 1.00 0.95 0.97 0.96 0.91 0.93 0.94 0.96 0.95 0.89 0.98 0.93 1.00 1.00 1.00 0.88 1.00 0.93

MEMM 0.78 0.93 0.85 0.59 0.93 0.72 0.82 1.00 0.90 0.84 1.00 0.91 0.74 0.80 0.77 0.98 1.00 0.99

Models	Entities
BiLSTM-CRF	0.89	0.94	0.91	0.89	0.89	0.89	1.00	0.97	0.99	1.00	1.00	1.00	1.00	1.00	1.00	0.95	0.91	0.93
FlauBERT	0.69	0.55	0.61	0.69	0.77	0.73	1.00	1.00	1.00	1.00	1.00	1.00	0.33	0.67	0.44	0.73	0.78	0.76
CRF	0.98	0.91	0.94	0.97	0.95	0.96	0.97	1.00	0.99	0.97	0.95	0.96	0.88	1.00	0.93	1.00	1.00	1.00
$SpaCy$ Model	1.00	0.95	0.97	0.96	0.91	0.93	0.94	0.96	0.95	0.89	0.98	0.93	1.00	1.00	1.00	0.88	1.00	0.93
MEMM	0.78	0.93	0.85	0.59	0.93	0.72	0.82	1.00	0.90	0.84	1.00	0.91	0.74	0.80	0.77	0.98	1.00	0.99

Table 4 shows how different models performed when they were tested on various entity types such as $Damage$ , $CarParts$ , $CarBrand$ , $CarModel$ , $Severity$ , and $Place$ . The $SpaCy$ model achieved the highest F1-score for the $Damage$ entity, with a precision of 1.0 and recall of 0.95. The $CRF$ model achieved the highest F1-score for $CarParts$ , $CarBrand$ , and $Severity$ entities, with F1-score of 0.96, 0.99, and 0.93, respectively. The $BiLSTM - CRF$ model achieved the highest F1-score for $CarModel$ and Place entities, with F1-score of 1.0 and 0.93, respectively. The $FlauBERT$ model achieved the highest F1-score for $CarModel$ and $CarBrand$ entities, with F1-score of 1.0, this can be attributed to the fact that $CarModel$ and $CarBrand$ entities often have distinctive and recognizable names that are likely to have appeared in the training data used to pre-train $FlauBERT$ , making them easier to recognize. The $MEMM$ model performs relatively well, with F1-score ranging from 0.72 to 1.00. It achieves the highest recall for $CarParts$ , $CarBrand$ , and $CarModel$ entities but lags in precision.

Overall, the table suggests that the $SpaCy$ model performs well across most entities, while the CRF model performs well for entities that require more precise identification, and the $BiLSTM - CRF$ model performs well for entities that require identification of longer sequences.

4.5. Relation extraction:

Once we had extracted the entities, we used ML algorithms to identify the relations between them. Specifically, we aimed to extract four types of relation:

The $hasDamage$ relation exists between the entities $Carparts$ and $Damage$ .

The $hasCarParts$ relation exists between the entities $CarBrand$ and $Carparts$ .

The $PlacedIn$ relation exists between the entities $Carparts$ and $Place$ .

The $hasSeverity$ relation exists between the entities $Damage$ and $Severity$ . To extract these relationships, we utilized a range of relevant features that are discussed in detail in Section 3 using the feature selection method.

4.5.1. RE model hyperparameters

The hyperparameters for ML models for RE were carefully selected through a grid search approach. We experimented with different combinations of hyperparameters for each model to determine the optimal values that yield the best performance on our dataset. After evaluating various options, we found the following hyperparameters to be optimal for each model: For SVM, we selected the radial basis kernel function ( $RBF$ ) and set the regularization parameter C to 0.1. For KNN, we set the number of neighbors ( $n_neighbors$ ) to 2 and the distance metric ( $metric$ ) to Euclidean. For the DT model, we set $\max_depth$ to 12, $\min_samples_split$ to 2, and criterion to $gini$ . For RF, we found that setting the number of estimators to 8 and the maximum depth to 10 produced the best performance on our RE task. Table 5 presents the results of each model.

4.5.2. RE model evaluation

In relation extraction, the performance of the model is also evaluated using precision_re (P_re), recall_re (R_re), and the F1-score_re (F1_re). The formulas for calculating precision_re, recall_re, and F1-score_re are as follows: $\begin{matrix} P_{re} = \frac{{TP}_{re}}{{TP}_{re} + {FP}_{re}}; R_{re} = \frac{{TP}_{re}}{{TP}_{re} + {FN}_{re}}; {F1}_{re} = \frac{2 \times {TP}_{re}}{2 \times {TP}_{re} + {FP}_{re} + {FN}_{re}} \end{matrix}$ Here, the true positives (TP_re) represent the number of relations correctly extracted by the model, false positives (FP_re) indicate relations incorrectly extracted, false negatives (FN_re) represent relations that should have been extracted but were missed, and true negatives (TN_re) are relations correctly not extracted.

Table 5 presents the results of four models, SVM, KNN, DT, and RF, for relation extraction using baseline features that include all distances and word features. The results show that the performance of the models varies based on the type of relationship.

The DT and RF models exhibit consistently good performance across all relation types, achieving the highest F1-score for all relations.

The SVM model shows good performance in terms of precision for the $hasDamage$ , $hasCarParts$ , and $PlacedIn$ relation types, but the recall values are relatively low. This indicates that the model has a high number of FN_re and has missed relevant relations.

The KNN model shows a consistent, average performance across all relation types, with F1-score of 0.75 for the $hasDamage$ and $hasCarParts$ relations, and F1-score of 0.96 and 0.6 for the $PlacedIn$ and $hasSeverity$ relations, respectively.

However, incorporating additional features can potentially improve the performance of the models, as demonstrated in Table 6.

Table 5
Comparative results of models for relation extraction using baseline features (all distances & word features). The highest precision_re (P_re), recall_re (R_re) and F1-score_re (F1_re) are in bold

Models Relation type

hasDamage hasCarParts PlacedIn hasSeverity

P_re R_re F1_re P_re R_re F1_re P_re R_re F1_re P_re R_re F1_re

SVM 0.95 0.55 0.70 0.98 0.54 0.70 1.00 0.82 0.90 0.50 0.40 0.44

KNN 0.75 0.75 0.75 0.75 0.75 0.75 0.98 0.95 0.96 0.60 0.60 0.60

DT 0.94 0.92 0.93 0.97 0.97 0.97 0.98 0.98 0.98 0.60 0.60 0.60

RF 0.96 0.90 0.93 0.97 0.97 0.97 0.98 0.98 0.98 0.71 1.00 0.83

Models	Relation type
SVM	0.95	0.55	0.70	0.98	0.54	0.70	1.00	0.82	0.90	0.50	0.40	0.44
KNN	0.75	0.75	0.75	0.75	0.75	0.75	0.98	0.95	0.96	0.60	0.60	0.60
DT	0.94	0.92	0.93	0.97	0.97	0.97	0.98	0.98	0.98	0.60	0.60	0.60
RF	0.96	0.90	0.93	0.97	0.97	0.97	0.98	0.98	0.98	0.71	1.00	0.83

Table 6 presents the results of a comparative analysis of the performance of different feature sets in relation extraction using the random forest algorithm. The table is divided into three main parts. The first part illustrates the performance of the baseline features, which include all distances and word features. The second part shows the performance of the baseline features along with Nous avons soulevé plsueirs, specifically $Spacy$ ’s and $word 2 vec$ embedding. The third part shows the performance of the baseline features along with ontology reasoning.

In terms of the baseline features, the table shows that the best results were obtained using the features that included all word distance features.

The second part of the table shows that adding embedding features ${Spacy}^{'} s$ or $word 2 vec$ vector to the baseline features did not significantly improve the performance.

Finally, the third part of the table shows that the best performance was achieved when using the baseline features and ontology reasoning described in Section 3.2.3, as it achieved the highest F1-score for three out of the four relation types, indicating its superiority in enhancing the quality of the extracted relations. By leveraging the capabilities of the ontology, we were able to reduce redundancy, resolve conflicts, and minimize false positives and false negatives in the extracted relations.

Table 6

Comparative results of the best features for relation extraction using random forest algorithm. The highest precision (P_re), recall (R_re) and F1-score_re (F1_re) are in bold

Features	Relation type

	hasDamage			hasCarParts			PlacedIn			hasSeverity

	P_re	R_re	F1_re	P_re	R_re	F1_re	P_re	R_re	F1_re	P_re	R_re	F1_re
baseline: distance & word feat-s	0.96	0.90	0.93	0.97	0.97	0.97	0.98	0.98	0.98	0.71	1.00	0.83
baseline-nb word dist	0.99	0.89	0.94	0.96	0.96	0.96	0.87	0.79	0.83	0.50	0.60	0.55
baseline-char dist	0.95	0.88	0.92	0.97	0.97	0.97	0.98	0.98	0.98	0.71	1.00	0.83
baseline sent dist	0.95	0.76	0.84	0.97	0.99	0.98	0.74	0.76	0.75	0.50	0.60	0.55
baseline + Embs: Spacy’s vector	0.94	0.75	0.83	0.97	0.92	0.94	0.93	0.88	0.91	0.43	0.60	0.50
baseline + Embs: word2vec	0.98	0.76	0.85	0.96	0.94	0.95	0.95	0.92	0.93	0.43	0.60	0.50
baseline + Ontology-reasoning	0.97	0.90	0.93	0.97	0.99	0.98	0.98	0.98	0.98	0.82	1.00	0.90

Fig. 7.

FP_re and FN_re relation extraction with and without ontology-reasoning.

The plot (Fig. 7) illustrates the number of false positives, indicating the count of incorrectly extracted relations, and false negatives, representing the number of relations that should have been extracted but were missed by the model. This analysis was conducted for different relations, with and without the use of ontology reasoning. The right subplot, which uses ontology reasoning, exhibits a substantial decrease in both FP_re and FN_re when compared to the left subplot, which does not employ ontology reasoning. These findings suggest that integrating ontology reasoning into the model can significantly enhance its capacity to identify the correct relations, as demonstrated in Fig. 8. This figure in question contains text written in French. Here is the English translation of the text: “Rear door heavily scratched, and the front right door dented”. The NER model successfully extracts all the entities correctly. $rear$ is identified as a $Place$ , $door$ as a $CarParts$ , $heavily$ as a $Severity$ , and $scratched$ as $Damage$ . Similarly, $front$ $right$ is recognized as a $Place$ , $door$ as a $CarParts$ , and $dented$ as $Damage$ . However, the RE model makes an error by extracting a relationship $hasDamage$ between $door$ in $front$ $right$ and the damage $scratch$ , which is not accurate. To address this issue and improve the overall model performance, we incorporated ontology reasoning, which helps reduce such errors in the relation extraction model.

Despite the significant improvement brought about by incorporating ontology reasoning, some false positives and false negatives still exist, representing less than $5 %$ of the total relations. These errors can be attributed to labeling errors, and to the inherent complexity of natural language processing and the relationships between entities. Despite this, the proposed system aims to assist users and relies on user intervention to check and correct any false relations.

Fig. 8.

Comparison of entity recognition and relation extraction with and without ontology reasoning.

The experiment results demonstrate the effectiveness of our information extraction approach for automotive damage reports. We used the $Spacy$ model to recognize named entities and the random forest to extract relations between them. By combining the $OCD$ ontology reasoning, we successfully extracted relevant information from a complex automotive report, particularly in scenarios where a single damage event is associated with multiple car parts, as illustrated in Fig. 9.

We created the $OCD$ ontology using Protégé 5.5.09

⁹

https://protege.stanford.edu

and populated it with the extracted information by mapping the entities and corresponding concepts and data/object properties using the

Owlready

library. This ontology provides a more standardized representation of the extracted information, making it easier to process and analyze.

Fig. 9.

Information extraction: entity recognition and relation extraction in complex scenarios.

4.5.3. Example and discussion

In this section, we present an illustrative example that demonstrates the step-by-step process of each module. The example covers the entire workflow, from the preprocessing of car damage reports to the final population of the $OCD$ ontology.

Consider the following car damage report (French report in Fig. 10): “Volvo XC60 damage found are:

front and rear bumpers broken and perforated

right front and right rear door with deep scratches

right front and right rear wing with deep scratches and impact to paintwork

front right fog lights broken

right-hand door mirror broken.”

The first step is the preprocessing module, here we check for spelling errors to ensure that all entities are spelled correctly, as shown in Fig. 10. The corrected errors are underlined in red color.

Fig. 10.

Spelling checker.

The second step, in the information extraction module, we identify and extract entities and relations from the report. The extracted entities include essential information such as the car brand ( $Volvo$ ) and model ( $XC 60$ ), specific car parts affected ( $Bumper$ , $Door$ , etc.), the type of damages sustained ( $Broken$ , $Perforated$ , $Scratches$ , $Impact$ $to$ $Paintwork$ ), and the location of the damages ( $Front$ , $Rear$ , $Right$ , $Front Right$ ).

The relationships between these entities are also extracted. For instance, the $hasCarPart$ relationship associates the entity car $Volvo$ with its constituent parts ( $Bumper$ , $Door$ , etc.). Similarly, the $hasDamage$ relationship links each car part to the type of damage it has sustained ( $Front Bumper$ has $Broken$ and $Perforated$ damages), the result of this step is illustrated in the Fig. 11.

Fig. 11.

Extraction of entities and relations.

The next step is, the ontology reasoning module, which is applied in two steps within our system. Initially, we populate the initial results of the extracted information into the ontology using $OwlReady 2$ library in Python. At this stage, the ontology may contain incomplete or inaccurate data.

Regarding the linking of entities and handling possible ambiguities, the population process ensures that entities with the same name or meaning are linked to the same resource in the ontology. For example, $Ford$ will be linked to the same brand resource, and $B - MAX$ will be linked to the corresponding car model resource.

In scenarios where a single “ $CarParts$ ” entity is associated with multiple “ $Place$ ” entities, as illustrated in the provided example, such as the entity “ $Bumper$ ” being linked with both “ $Front$ ” and “ $Rear$ ” entities through the relation “ $PlacedIn$ ,” we address this situation by creating a new entity “ $Bumper$ ” to distinguish between the “ $Front Bumper$ ” and “ $Rear Bumper$ .” This approach ensures a clear and unambiguous representation of the relationships between entities in the ontology.

For n-Ary relationships, we decompose these complex relationships into multiple binary relationships. As illustrated in the provided example, let’s consider the entities $Right Rear$ , $Door$ , $Deep$ , $Scratches$ . We will decompose the n-Ary relationship $R$ , which involves $CarParts$ , $Place$ , $Severity$ , and $Damage$ into three binary relationships:

$PlacedIn (Door, Right Rear)$

$hasDamage (Door, Scratches)$

$hasSeverity (Scratches, Deep)$

The relation “ $PlacedIn$ ” between “ $CarParts$ ” and “ $Place$ ” will be translated into a data property of the car part entity in the ontology. Similarly, the relation “ $hasSeverity$ ” between “ $Damage$ ” and “ $Severity$ ” will be translated into a data property of the car part entity in the ontology. However, it is essential to extract this relation to capture the semantic information between entities accurately.

In cases where various possible candidates can be linked to the same entity, we prioritize the entities with a higher probability of belonging to the entity based on the NER model score. Additionally, the ontology reasoning module plays a crucial role in making accurate decisions. By applying logical inferences and considering the ontology’s structure, the reasoning process ensures that the correct entities and relationships are linked together.

After the initial population of the ontology, we then leverage the power of ontology reasoning techniques in the second step. The ontology reasoning engine applies logical inferences and deductions based on the ontology’s axioms, relationships, and rules. This reasoning step helps to enhance the information extraction process and improve the accuracy of the extracted entities and relationships.

In the ontology population module, we update the populated ontology with the enhanced and corrected information with $OwlReady 2$ . Any incorrect or inconsistent data from the initial population are removed, and the ontology is updated with the new and improved entities and relationships. Throughout this process, the car damage information is systematically and accurately represented in the ontology. By interconnecting individual entities through meaningful relationships, the ontology provides a comprehensive and structured representation of the car damage data. In the Fig. 12, we illustrate the instantiation of the extracted information from the car damage report into the ontology.

Fig. 12.

Instantiation of extracted information from car damage report into the $OCD$ ontology.

5. Conclusion

In conclusion, the development of the $OCD$ ontology and the application of NER and RE techniques have provided a structured and standardized approach to damage modeling in the automotive industry. This approach allows the extraction of relevant information from unstructured and inconsistent damage reports, resulting in a comprehensive and uniform representation of the damage incurred during vehicle transportation. This has the potential to significantly improve the efficiency and accuracy of damage reporting and analysis in the automotive sector. The contribution of this work to the field of information extraction and ontology-based knowledge representation highlights the importance of leveraging such techniques in addressing real-world problems.

Future research can focus on the expansion and refinement of the $OCD$ ontology to accommodate a wider range of damage scenarios and improve the accuracy of the NER and RE techniques, while also utilizing the extracted information to predict the cost of car damage repairs. To achieve this, we intend to construct a dataset using $SPARQL$ queries to retrieve data from the ontology. Subsequently, we plan to train regression models using this dataset, that can predict the cost of repairing a vehicle based on the severity and nature of the damage. This application can be particularly advantageous for insurance companies or repair shops that need to estimate repair costs for their customers. By leveraging this approach, they can provide more accurate and reliable estimates, thereby improving customer satisfaction and trust. Additionally, it can assist them in streamlining their repair processes and optimizing their costs.

Footnotes

Acknowledgements

This work is supported by both the company $Syartec$ and the $ANRT$ (National Association for Research and Technology).

Online resources

The supplementary online material supports the comprehension and reproduction of this study. It includes: an OWL file representing the ontology ( $OCD$ ), and a file containing SPARQL queries, accessible at the GitHub repository (https://github.com/OntologyCarDamage/OCD) and on industry portal site (http://industryportal.enit.fr/ontologies/OCD).

References

Agichtein, E. & Gravano, L. (2000). Snowball: Extracting relations from large plain-text collections. In Proceedings of the Fifth ACM Conference on Digital Libraries (pp. 85–94). doi:10.1145/336597.336644.

Ahaggach, H., Abrouk, L. & Lebon, E. (2023a). L’analyse des dommages aux voitures à l’aide de la reconnaissance d’entités nommées et d’ontologie. In European Grid Conference.

Albukhitan, S., Helmy, T. & Alnazer, A. (2017). Arabic ontology learning using deep learning. In Proceedings of the International Conference on Web Intelligence (pp. 1138–1142).

Alnazzawi, N., Thompson, P., Batista-Navarro, R. & Ananiadou, S. (2015). Using text mining techniques to extract phenotypic information from the PhenoCHF corpus. In BMC Medical Informatics and Decision Making (Vol. 15, pp. 1–10). Central: BioMed.

Asim, M.N., Wasim, M., Khan, M.U.G., Mahmood, W. & Abbasi, H.M. (2018). A survey of ontology learning techniques and applications. Database, 2018, bay101. doi:10.1093/database/bax101.

Ayadi, A., Auffan, M. & Rose, J. (2020). Ontology-based NLP information extraction to enrich nanomaterial environmental exposure database. Procedia Computer Science, 176, 360–369. doi:10.1016/j.procs.2020.08.037.

Barrachina, J., Garrido, P., Fogue, M., Martinez, F.J., Cano, J.-C., Calafate, C.T. & Manzoni, P. (2012). Caova: A car accident ontology for vanets. In 2012 IEEE Wireless Communications and Networking Conference (WCNC) (pp. 1864–1869). IEEE.

Benjamin, A., Hamza, C., Yara, C., Jana, E.K., Lylia, A. & Nicolas, C. (2022). Construction d’une ontologie dans le domaine financier pour la détection de fraudes (pp. 157–162).

Bhatia, N., Kumar, R. & Senapaty, S. (2008). Extraction of structured information from online automobile advertisements. Department of Computer Science.

10.

Brin, S. (1999). Extracting patterns and relations from the world wide web. In The World Wide Web and Databases: International Workshop WebDB’98, Valencia, Spain, March 27–28, 1998 (pp. 172–183). Springer. Selected Papers.

11.

Browarnik, A. & Maimon, O. (2015). Ontology learning from text: Why the ontology learning layer cake is not viable. International Journal of Signs and Semiotic Systems (IJSSS), 4(2), 1–14. doi:10.4018/IJSSS.2015070101.

12.

Buitelaar, P., Cimiano, P. & Magnini, B. (2005). Ontology Learning from Text: Methods, Evaluation and Applications (Vol. 123). IOS Press.

13.

Chandra, R., Tiwari, S., Agarwal, S. & Singh, N. (2023). Semantic web-based diagnosis and treatment of vector-borne diseases using SWRL rules. Knowledge-Based Systems, 274, 110645. doi:10.1016/j.knosys.2023.110645.

14.

Chen, Y., Lasko, T.A., Mei, Q., Denny, J.C. & Xu, H. (2015). A study of active learning methods for named entity recognition in clinical text. Journal of biomedical informatics, 58, 11–18. doi:10.1016/j.jbi.2015.09.010.

15.

Dardailler, D. (2012). DRAFT Road Accident Ontology. https://www.w3.org/2012/06/rao.html.

16.

Davidov, D. & Rappoport, A. (2008). Classification of semantic relationships between nominals using pattern clusters. In Proceedings of ACL-08: HLT (pp. 227–235).

17.

Etzioni, O., Banko, M., Soderland, S. & Weld, D.S. (2008). Open information extraction from the web. Communications of the ACM, 51(12), 68–74. doi:10.1145/1409360.1409378.

18.

Everett, J.O., Bobrow, D.G., Stolle, R., Crouch, R., de Paiva, V., Condoravdi, C., van den Berg, M. & Polanyi, L. (2002). Making ontologies work for resolving redundancies across documents. Communications of the ACM, 45(2), 55–60. doi:10.1145/503124.503149.

19.

Feld, M. & Müller, C. (2011). The automotive ontology: Managing knowledge inside the vehicle and sharing it between cars. In Proceedings of the 3rd International Conference on Automotive User Interfaces and Interactive Vehicular Applications (pp. 79–86). doi:10.1145/2381416.2381429.

20.

Hamdan, A.-H., Bonduel, M. & Scherer, R.J. (2019). An ontological model for the representation of damage to constructions. In CEUR Workshop Proceedings (Vol. 2389, pp. 64–77).

21.

Hazman, M., El-Beltagy, S.R. & Rafea, A. (2011). A survey of ontology learning approaches. International Journal of Computer Applications, 22(9), 36–43. doi:10.5120/2610-3642.

22.

Hepp, M. (2010). Vehicle sales ontology. Available at http://www.heppnetz.de/ontologies/vso/ns.

23.

Huang, Z., Xu, W. & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint. arXiv:1508.01991.

24.

Jalal, A.A. (2020). Text mining: Design of interactive search engine based regular expressions of online automobile advertisements. Int. J. Eng. Pedagog., 10(3), 35–48. doi:10.3991/ijep.v10i3.12419.

25.

Kim, J. & Chung, K.-Y. (2014). Ontology-based healthcare context information model to implement ubiquitous environment. Multimedia Tools and Applications, 71, 873–888. doi:10.1007/s11042-011-0919-6.

26.

Klotz, B., Troncy, R., Wilms, D. & Bonnet, C. (2018). VSSo: The vehicle signal and attribute ontology. In SSN@ ISWC (pp. 56–63).

27.

Lafferty, J., McCallum, A. & Pereira, F.C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data.

28.

Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L. & Schwab, D. (2020). FlauBERT: Unsupervised language model pre-training for French. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association. (pp. 2479–2490). Marseille, France. https://aclanthology.org/2020.lrec-1.302 .

29.

Leaman, R., Wei, C.-H., Zou, C. & Lu, Z. (2015). Mining patents with tmChem, GNormPlus and an ensemble of open systems. In Proce. the Fifth BioCreative Challenge Evaluation Workshop (pp. 140–146).

30.

Levenshtein, V.I., et al. (1966). Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady (Vol. 10, pp. 707–710). Union: Soviet.

31.

Lopez, M.M. & Kalita, J. (2017). Deep Learning applied to NLP. arXiv preprint. arXiv:1703.03091.

32.

McCallum, A., Freitag, D., Pereira, F.C., et al. (2000). Maximum entropy Markov models for information extraction and segmentation. Icml, 17, 591–598.

33.

Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint. arXiv:1301.3781.

34.

Mishra, S. & Jain, S. (2015). A study of various approaches and tools on ontology. In 2015 IEEE International Conference on Computational Intelligence & Communication Technology. (pp. 57–61). IEEE.

35.

Munir, K. & Anjum, M.S. (2018). The use of ontologies for effective knowledge modelling and information retrieval. Applied Computing and Informatics, 14(2), 116–126. doi:10.1016/j.aci.2017.07.003.

36.

Nadeau, D. & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30, 3–26. doi:10.1075/li.30.1.03nad.

37.

Nepal, M.P., Staub-French, S., Pottinger, R. & Zhang, J. (2013). Ontology-based feature modeling for construction information extraction from a building information model. Journal of Computing in Civil Engineering, 27(5), 555–569. doi:10.1061/(ASCE)CP.1943-5487.0000230.

38.

Nundloll, V., Smail, R., Stevens, C. & Blair, G. (2022). Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science. Heliyon.

39.

Pérez, J., Arenas, M. & Gutierrez, C. (2009). Semantics and complexity of SPARQL. ACM Transactions on Database Systems (TODS), 34(3), 1–45. doi:10.1145/1567274.1567278.

40.

Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V. & Spyropoulos, C.D. (2001). Using machine learning to maintain rule-based named-entity recognition and classification systems. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (pp. 426–433).

41.

Raad, J. & Cruz, C. (2015). A survey on ontology evaluation methods. In Proceedings of the International Conference on Knowledge Engineering and Ontology Development, Part of the 7th International Joint Conference on Knowledge Discovery. Knowledge Engineering and Knowledge Management.

42.

Rachman, A. & Chandima Ratnayake, R. (2018). Ontology-based semantic modeling for automated identification of damage mechanisms in process plants. In Collaborative Networks of Cognitive Systems: 19th IFIP WG 5.5 Working Conference on Virtual Enterprises, PRO-VE 2018. Proceedings Cardiff, UK, September 17–19, 2018 (Vol. 19, pp. 457–466). Springer. doi:10.1007/978-3-319-99127-6_39.

43.

Rubens, M. & Agarwal, P. (2002). Information Extraction from Online Automotive Classifieds. Dept. Of Computer Science.

44.

Shearer, R.D., Motik, B. & Horrocks, I. (2008). Hermit: A highly-efficient OWL reasoner. Owled, 432, 91.

45.

Shinyama, Y. & Sekine, S. (2006). Preemptive information extraction using unrestricted relation discovery. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference (pp. 304–311).

46.

Thomas, A. & Sangeetha, S. (2019). An innovative hybrid approach for extracting named entities from unstructured text data. Computational Intelligence, 35(4), 799–826. doi:10.1111/coin.12214.

47.

Tran, V.C., Nguyen, N.T., Fujita, H., Hoang, D.T. & Hwang, D. (2017). A combination of active learning and self-learning for named entity recognition on Twitter using conditional random fields. Knowledge-Based Systems, 132, 179–187. doi:10.1016/j.knosys.2017.06.023.

48.

Tsarkov, D. & Horrocks, I. (2006). FaCT++ description logic reasoner: System description. In Automated Reasoning: Third International Joint Conference, Proceedings, IJCAR 2006. Seattle, WA, USA, August 17–20, 2006 (Vol. 3, pp. 292–297). Springer. doi:10.1007/11814771_26.

49.

Wong, W., Liu, W. & Bennamoun, M. (2012). Ontology learning from text: A look back and into the future. ACM computing surveys (CSUR), 44(4), 1–36. doi:10.1145/2333112.2333115.

50.

Yates, A., Banko, M., Broadhead, M., Cafarella, M.J., Etzioni, O. & Soderland, S. (2007). Textrunner: Open information extraction on the web. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT) (pp. 25–26).

51.

Zeng, D., Liu, K., Lai, S., Zhou, G. & Zhao, J. (2014). Relation classification via convolutional deep neural network. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp. 2335–2344). Dublin, Ireland: Dublin City University and Association for Computational Linguistics. https://aclanthology.org/C14-1220 .

52.

Zhou, L. (2007). Ontology learning: State of the art and open issues. Information Technology and Management, 8, 241–252. doi:10.1007/s10799-007-0019-5.

53.

Zhou, P. & El-Gohary, N. (2017). Ontology-based automated information extraction from building energy conservation codes. Automation in Construction, 74, 103–117. doi:10.1016/j.autcon.2016.09.004.

54.

Zhou, X., Zhang, X. & Hu, X. (2006). MaxMatcher: Biological concept extraction using approximate dictionary lookup. In PRICAI 2006: Trends in Artificial Intelligence: 9th Pacific Rim International Conference on Artificial Intelligence, Proceedings 9, Guilin, China, August 7–11, 2006 (pp. 1145–1149). Springer.

Criteria	Works

	Barrachina et al. (2012)	Feld and Müller (2011)	Hepp (2010)	Klotz et al. (2018)	Our ontology (OCD)
Car damage modeling	×	×	×	×	✓
Car information modeling	✓	✓	✓	✓	✓
Parts modeling	×	✓	×	✓	✓
Multi-language support	×	×	×	×	✓
Access public	×	×	✓	✓	✓
Inference capabilities	×	✓	×	×	✓
Purpose	Road accident modeling	Vehicle knowledge sharing	E-commerce vehicle modeling	Vehicle signal modeling	Car damage modeling