Simplified effective method for identifying semantic relations from a knowledge graph

Abstract

Semantic relations have been adopted in many research fields, including the semantic web, information retrieval, and Q&A systems. The aim of the semantic relations is to remove conceptual and terminological confusion. This is achieved by specifying a set of general concepts that characterize domains and their definitions and interrelationships. This research describes how to detect semantic relations, including synonyms, hyponyms, and hypernym s based on WordNet and entities of a knowledge graph (KG). This KG was built from two resources: ACM Digital Library and Wikipedia. We used natural language processing and the deep learning approach for processing data before generating the KG with an effective algorithm. We chose five of 245 categories in the ACM Digital Library to evaluate the proposed method. The generated results show that our system has excellent performance.

Keywords

Semantic relations knowledge graph information extraction

1 Introduction

Human knowledge is rich, varied, and complex. There are many methods to rep-resent human knowledge. A knowledge graph (KG) is a natural candidate for this. A KG includes vertexes that represent entities, classes, subclasses, and edges that represent relationships among the vertexes. NELL [1, 2], Freebase [3], and YAGO [4] are examples of large KGs that include millions of entities and semantic relations. Se-mantic relations are expressed as triples of two entities and a binary relation. There are several kinds of semantic relations, such as IS-A, Include, Synonym, Hyponym. A KG with semantic relations can be applied in many computing fields, such as search engines, information retrieval, and Q&A systems. However, there are challenges to building a KG related to data, methods, and tools. Therefore, a KG is either created over a long period or is focused on one domain.

The contributions of this research are shown as follows: (i) We crawled and categorized a large-scale dataset from Wikipedia and ACM Digital Library focusing on the computing domain to build a KG. The KG approach tends to focus on the relationships/links of words rather than independently evaluating individual words. (ii) We propose an algorithm for the detection of several semantic relations, including synonyms, hyponyms, and hypernyms based on the KG and WordNet.

This paper is organized as follows: Section 2 describes related works; section 3 discusses the detection of semantic relations based on the KG; section 4 includes the experimental results and discussion; section 5 provides conclusions and future works.

2 Related work

Kotnis and V. Nastase [5] proposed KGs with only positive relation instances, lead-ing to the emergence of various methods for selecting negative examples. Empirical research was also conducted on the impact of negative sampling on the learned em-beddings, assessed through link prediction. State-of-the-art KG embedding methods were applied, including Rescal, TransE, DistMult, and ComplEX, but the results were based on the subsets of Freebase and WordNet. Dasgupta et al. [6] in 2018 presented HyTE (Hyperplane-based Temporally), a knowledge graph embedding technique, which explicitly incorporates time in the entity-relation space with a hyperplane corresponding to a timestamp. The method can present KG inferences using temporal guidance and predict temporal scopes for relational facts with missing time annotations. However, this method can only exploit temporally scoped facts of KG to pre-sent link prediction and time scopes for unannotated temporal facts. B. Ding et al. [7] investigated the potential of using simple constraints to improve KG-embedding, but this research only focused on two constraints, namely, the non-negativity constraints on learning compact, interpretable entity representations and the approximate entailment constraints. K. Wang et al. [8] proposed a new kind of information, called entity neighbors, which contain both semantic and topological features of a given entity. The research is limited regardless of the semantics of entity neighbors. A. Kutuzov et al. [9] created path2vec, a novel approach for identifying graph embeddings that relies on structural measures of pairwise node similarities. Further research is planned on training embeddings to approximate multiple similarity metrics at once. Generally, there are various techniques used to apply KGs to different fields. Re-search has shown approaches related to natural language processing (NLP), ma-chine/deep learning, or hybrid approaches. In this research, we use NLP and the deep learning approach for data training to build a KG focusing on the computing domain. Semantic relations are then detected based on this graph.

3 Heterogenous document-based knowledge graph embedding

3.1 Building a KG from text documents of the ACM Digital Library

The process for training text documents of the ACM Digital Library involves two steps:

Data pre-processing.

Applying the Keras framework, including a word embedding model of text data.

In the first phase, all text files of the ACM Digital Library were merged according to their categories. After merging, each category had only one text file. These text files were sent as input to a tokenizer to split the English sentences into words based on the whitespace character. The tokenized words were then converted to lowercase form, removed of punctuation, and filtered to remove tokens that are symbols or stop-words. These converted text files were directed back to the extractor again for the stemming process. Stemming refers to the transformation of each word to its base. In this research, we used the Natural Language Tool Kit (NLTK) [10] for data pre-processing. In the second phase, we adopted the multilayer perceptron (MLP) with a word2vec [11] presentation for training the data in the Keras framework. The training data process is shown in Fig. 1. The word layer includes the words that were processed in the first phase, and there are four hidden layers.

Fig. 1

Model using the Keras framework, including word embedding (word2vec).

The next step is to build the KG. The structure of the KG was divided into two layers, with the computing domain as the root of the KG. The first layer is known as the subject layer [12], which includes over 30 categories that were extracted from ACM classification categories [13]. The next layer of KG is known as the object layer, which contains various word vectors that were output from the word2vec word embedding model, e.g., Hardware, SQL Server, Java, CPU, Oracle, Data Structure. The KG representing the computing domain is shown as Fig. 2.

Fig. 2

Hierarchy of a knowledge graph.

3.2 Updating KG from XML documents of Wikipedia

The process to update the KG by entities extracted from Wikipedia includes three steps:

Prepare the XML files, including entities belonging to categories of ACM Digital Libraries

Pre-process the data in the XML files from the previous step

Reuse the Keras framework to train the data after pre-processing.

Additionally, to access and extract data belonging to a category from Wikipedia, the API functions provided by Wikipedia were used.

3.3 Algorithms for the detection of the semantic relations based on the KG

This paper focuses on semantic relations, including synonyms, hyponyms, and hypernyms, which play a vital role in information retrieval. The KG and WordNet can be used to determine these semantic relations. Our proposed algorithm is as follows.

Algorithm 3.1. Algorithm for searching the semantic relations based on graph database

Table 7

Procedure Find_out_SYN_HYPO_HYPE

While Instance is not null

Begin

Instance = root

Find_out_SYN_HYPO_HYPE(root)

Root = root.LEFT

Root = root.RIGHT

SYN = Select WordNet.SYNONYM where WordNet.Instance=Instance

HYPO= Select WordNet.HYPONYM where WordNet.Instance=Instance

HYPE = Select WordNet.HYPERNYM where WordNet.Instance=Instance

End

End While

After the above algorithm was applied, we extracted the semantic relations from WordNet corresponding to the entities of the KG. Some results are shown in Table 1.

Table 1

Set of Synonyms, Hyponyms, and Hypernyms corresponding with entities of the KG

Entities of KG	Synonyms	Hyponyms	Hypernyms
DB	Database		Database system, information processing
Network Monitoring		Network Services	Networks
Neural network		Machine learning	Computing Methodologies
ROM	Read-Only Memory	Core memory	Volatile storage

From Table 1, some semantic relations are evident between an instance of the KG with its synonyms, hyponyms, and hypernyms, such as the following:

DB is Database

Network Services such as Network Monitoring

Machine learning includes Neural network

ROM is Read-Only Memory

4 Experimental results and discussion

We implemented numerous experiments to study the efficiency of the proposed approach. We selected 100 papers, based on their abstracts, for each of the five categories from the ACM Digital Library:

Artificial Intelligent

Operating System

Logic Design

Software

Process Management

We use three measures: Precision (P), Recall (R), and F-measure (F₁) for experimental evaluation. $P (C_{i}) = \frac{Correct (C_{i})}{Correct (Ci) + Wrong (C_{i})}$ (1) $R (C_{i}) = \frac{Correct (C_{i})}{Correct (C_{i}) + Missing (C_{i})}$ (2) $F_{1} (C_{i}) = 2 \frac{P (C_{i}) \times R (C_{i})}{P (C_{i}) + R (C_{i})}$ (3)

where C_i denotes a category in the KG; Correct(C_i) denotes the semantic relations number found in the KG, belonging to the category C_i; Wrong(C_i) denotes the semantic relations number found in KG, not belonging to category C_i; Missing(C_i) denotes the number of the semantic relations not found in the KG. The results obtained are shown in Tables 2 –5.

Table 2

Evaluation results on instances of KG

Category	Number of instances	P	R	F₁
Application	3672	0.7926	0.7651	0.7786
Process Management	3056	0.7653	0.7251	0.7447

Table 3

Evaluation results on a set of synonym relations

Category	Number of synonyms	P	R	F₁
Application	524	0.7926	0.7651	0.7786
Process Management	517	0.9325	0.8616	0.8956

Table 4

Evaluation results on the set of hyponym relations

Category	Number of hyponyms	P	R	F₁
Application	714	0.8938	0.7651	0.8245
Process Management	728	0.8831	0.8515	0.8670

Table 5

Evaluation results on the set of hypernyms

Category	Number of Hypernyms	P	R	F₁
Application	916	0.7926	0.7651	0.7786
Process Management	834	0.8231	0.8455	0.8341

The results in Table 2 revealed that the number of instances extracted after pre-processing is related to the precision, recall, and F-measure. The Application category had higher number of instances and, therefore, resulted in higher precision and percent recall among the categories. The Process Management category had fewer instances, and, therefore, its precision and recall remained the lower. The results of this experiment show that the accuracy of a semantic relation found based on the KG of a category is directly proportional to the number of instances of that category.

Table 3 shows that the Application category had the better synonym relations and higher precision and recall among the categories. The Process Management category had fewer synonym relations, but its precision and recall were higher than those of the Application category. The results of this experiment show that the accuracy of synonym relations found based on the KG of a category is not directly proportional to the number of synonym relations in that category.

Similarly, Table 4 shows that the Process Management had the better hyponym relations and higher recall among the categories. The Application category had fewer hyponym relations, but its precision was higher than that of the Process Management category. The results of this experiment show that the precision and recall of hyponym relations found based on KG of a category are not directly proportional to the number of hyponym relations in that category.

Table 5 also slows that the Application category has the better hyponym relations but lower precision and recall among the categories. The Process Management category has fewer hyponym relations, but its precision and recall were higher than those of the Application category. The results of this experiment show that the precision and recall of hypernym relations found based on KG of a category are not directly proportional to the number of hyponym relations in that category.

The quantity of semantic relations obtained from the KG is shown in Fig. 3. Out of all the categories, Application has the highest number of instances and, therefore, the highest number of synonym, hyponym, and hypernym relations.

Fig. 3

Number of instances of synonym, hyponym, and hypernym relations obtained per category.

The comparison between the precision of the different categories is shown in Fig. 4. The comparison between percent recall of the different categories is shown in Fig. 5.

Fig. 4

Precision of synonym, hyponym, and hypernyms relations.

Fig. 5

Percent recall of synonym, hyponym, and hypernyms relations.

To compare the precision and recall of the instances obtained from our model, the Stanford CoreNLP 1 for the comparative evaluation method was applied. Stanford CoreNLP is a tool for the extraction of instances and relations from text documents. Stanford CoreNLP supports the API functions to develop the applications related to NLP. We chose two categories for comparability; the result is shown in Table 6. The scores reported in Table 6 reveal that the number of instances obtained from the Stanford CoreNLP tool is greater than that obtained from the deep learning model, but the precision and recall of our proposed approach are higher than the CoreNLP tool. The deep learning model focuses on context when processing the words in text documents, which is more reliable. Generally, the proposed method outperformed the Stanford CoreNLP tool.

Table 6

Comparative evaluation method

Category	Number of instances	P	R	F₁
Application (Our model)	3672	0.7926	0.7651	0.7786
Application (CoreNLP)	3056	0.7653	0.7251	0.7447
Process Management (Our model)	3904	0.6846	0.6213	0.6514
Process Management (CoreNLP)	3271	0.6237	0.5875	0.6050

5 Conclusions

The experiment in this study detected semantic relations, including synonyms, hyponyms, and hypernyms based on the KG and WordNet. These semantic relations play an important role for applications related to Question answering and information extraction systems. The KG approach, in particular, considers the relationships/links of words rather than independently evaluating individual words, and the KG is only focused on the computing domain. The currently available KG is cumbersome with 170 categories and one million entities. To solve the problem, we proposed an approach that has two steps—data training for building the KG and determining the semantic relations based on the KG and WordNet. We used the Keras model with word embedding and hidden layers for data training after pre-processing the data, which were extracted from the ACM Digital Library and Wikipedia. The Neo4J Graph Database was used to build the KG after the data training. To detect semantic relations, we applied a search algorithm based on KG and WordNet. Three measures were obtained, including precision, recall, and F-Measure, to evaluate the proposed approach. The connection of WordNet ontology to the proposed method takes more time to define the sematic relations. This is improved by attaching the WordNet into the KG to develop a Question answering system before identifying the semantics relations.

Footnotes

References

Carlson

, et al., Toward an architecture for never-ending language learning, in AAAI Conferences, 2010.

Mitchell

, et al., Never-ending learning, Commun. ACM 61(5) (2018), 103–115. doi: 10.1145/3191513.

Bollacker

, Evans

, Paritosh

, Sturge

, Taylor

, Freebase: A collaboratively created graph database for structuring human knowledge, in Proceedings of the ACM SIGMOD International Conference on Management of Data, 2008, pp. 1247–1249, doi: 10.1145/1376616.1376746.

Suchanek

F.M.

, Kasneci

, Weikum

, Yago: A core of semantic knowledge, in 16th International World WideWeb Conference,WWW2007, 2007, pp. 697–706, doi: 10.1145/1242572.1242667.

Kotnis

, Nastase

, Analysis of the Impact of Negative Sampling on Link Prediction in Knowledge Graphs, in The Computing Research Repository (CoRR), 2017.

Dasgupta

S.S.

, Ray

S.N.

, Talukdar

, Hyte: Hyperplane based temporally aware knowledge graph embedding, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, 2018, pp. 2001–2011, doi: 10.18653/v1/d18-1225.

Ding

, Wang

, Guo

, Improving knowledge graph embedding using simple constraints, in, in ACL 2018 –56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) 1 (2018), 110–121. doi: 10.18653/v1/p18-1011.

Wang

, Liu

, Xu

, Lin

, Knowledge Graph Embedding with Entity Neighbors and Deep Memory Network, in The Computing Research Repository (CoRR), 2018.

Kutuzov

, Dorgham

, Oliynyk

, Biemann

, Panchenko

, Learning Graph Embeddings from WordNet based Similarity Measures, in Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (SEM 2019), 2019, pp. 125–135, doi: 10.18653/v1/s19-1014.

10.

Loper

, Bird

, NLTK: the Natural Language Toolkit, in Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - 1 (2002), 63–70. doi: 10.3115/1118108.1118117.

11.

Mikolov

, Chen

, Corrado

, Dean

, Efficient Estimation of Word Representations in Vector Space, in International Conference on Learning Representations, 2013.

12.

Chien

T.D.C.

, Tuoi

P.T.

, Building ontology based-on heterogeneous data, J. Comput. Sci. Cybern 31(2) (2015), 149–158. doi: 10.15625/1813-9663/31/2/3971.

13.

A.C. Machinery, The ACM Computing Classification System (1998), Computing Classification System, 1998. [Online]. Available: https://www.acm.org/publications/computing-classification-system/1998/ccs98. [Accessed:]