Abstract
Domain terminology recognition and extraction is the primary work for the construction of domain knowledge graph. Traditional method is tedious, and time-consuming, as well as low accuracy. This paper presents an improved Domain Term Extraction-Improvement (TDE-I) method based on relative co-occurrence rate, which can automatically extract domain basic terminologies and domain compound terminologies. We also present a Document Classification Value (DCV) method on the basis of calculating Domain Feature Vector (DFV) value which can implement a judgment to evaluate extraction accuracy according to the evaluation indexes. Our experimental results demonstrate that our approach is effectiveness and accuracy. The proposed method can provide with a new solution for the construction of domain knowledge graph and its applications.
Introduction
Nowadays witnesses a burst growth in quantities of on-line data and knowledge with rapidly developing of the Internet, knowledge expression and organization are being confronted with new challenges. In the year of 2012, Google launched its knowledge graph programme to establish a semantic knowledge base system [10,18], which brought a new vitality to semantic search applications. Knowledge graph [1] is therefore well-recognized to be a promising technical way to deal with these problems and obtains a great continued concern in the community of industry and academics [4,7,12,16,17].
Knowledge graph is broadly divided into two categories, i.e. general knowledge graph and domain knowledge graph. Different from general knowledge graph, coverage of domain knowledge graph is relatively narrow, limited to a particular industry area, but with a strong professional emphasizing on relationship between knowledge and knowledge. This paper focuses on the study of domain knowledge graph. Actually, domain terminology recognition and extraction is the primary step and plays a very important role in the construction of knowledge graph. The traditional methods mainly extract knowledge through a wide range of resources of Internet. In general, taking advantage of large quantity of volunteers in Internet, and using some tools provided, like freebase, Wikidata etc., domain terminologies are entered and knowledge relationships are established by manual, or knowledge is recognized and extracted based on some industry-specific data sources (such as professional website, domain-related internal database etc.) with the help of some professional tools such as Protégé etc., but also in a manual way. Although the results are accurate and quality is relatively high by manual means, drawbacks are very obvious actually. It is required large quantity of man-power and time, and also it is difficult to ensure that there will be no human errors. Under the constant updating and growth of knowledge, manual way obviously cannot be timely updated. More importantly, it would be even more difficult to establish a domain knowledge graph on the situation of absence of relevant professional talents. Thus, automatic extraction of domain terminologies through data mining massive data from Internet is essential towards the success of knowledge graph construction and deployment.
In order to solve the problems of discovery and extraction of domain terminology in establishment of domain knowledge graph, many scholars and scientists proposed their approaches and solutions. The research work on domain recognition and extraction was initiated by Luhn in 1958 [9]. Until now, the solutions can be primarily grouped into three categories, i.e. statistics based method [2,8,11], language rule based method [14,15] and semantic rule based method [3,20], as well as semi-automatic rule based method in the recent [13]. However, these existing methods generally have some issues on the aspects of low accuracy and large invalid words in domain terminology extraction. In order to solve these problems, this paper presents a novel solution to automatic discovery and extraction of domain terminology. On the basis of satisfying with accuracy in a certain threshold, those words that appear in the extracted results with high frequency but not real domain terms are removed, and then domain compound terminologies are constructed using the extracted domain basic domain terminology, to implement discovery and extraction of domain terminology automatically. Comparing with the previous extraction algorithms by some experiments, our method is proved to save a lot of manpower costs and time costs, improve accuracy and remove number of invalid words, which can provide with a better data source for the next stage of construction of knowledge graph.
The rest of the paper is organized as follows. Related work is reviewed in Section 2. Section 3 presents our approach to domain terminology extraction. Section 4 presents our evaluation method. Some experiment results are given in Section 5. Conclusions are drawn and future work is discussed in Section 6.
Related work
The most basic and the primary solution is statistics-based method [11]. Sentences are parsed into several words by a sentence splitter, and noise and trivial words, such as auxiliaries, adjectives, modal words, and so forth, are cleaned. Then, occurrence frequency of each meaningful word is counted and sorted, and those words that are located within a specified threshold range are treated to be domain terminologies. This solution is easy to perform, but having low accuracy, because terminology set inevitably contains a large amount of high frequency of writing words that have low relevance with domain. To solve this problem, some researchers began to find the improved methods, for example, Fukushige and Noguchi [2] proposed TF-IDF based weight computation method, taking both word frequency and document frequency into consideration. In spite of increased accuracy comparing to the simple statistic-based method, the extracted set still contained high rate of low domain relevance words. Lopes et al. [8] proposed an advanced method called TF-DCF to evaluate the relevance of terms extracted from domain corpora based on the absolute term frequency and disjoint corpora frequency. This is an efficient way to extract terms which are relevant for a specific domain, but it is based on the assumption that a term relevance to a specific domain can only be established by comparison with corpora from other domains (contrasting corpora).
Language rule-based method takes use of lexical rules explicitly. From the viewpoint of linguistic semantics, all general types of word combination models are limit which can be enumerated. For example, a sentence always contains a syntactic structure consisting of multiply nouns or gerund followed by verbs, thus words can be split out from sentences, marked, and consequently recombined to construct possible substrings which is recognized as domain terminologies. Following the principle of language combination, Shinnou et al. [15] made a progress through summarizing domain terminology syntactic rules. Saedi et al. [14] enrich source and target multiword terms with syntactic structure and seamlessly integrate them in the tree-based transfer phase of TectoMT. However, simple linguistic combination method can also not avoid invalid words essentially, leading to lower accuracy as the same as statistics-based method. Moreover, different domain has different language rule, and summarization of regularity of all domains will also be a tedious job.
Semantic structure can be useful. Homayouni et al. [3] proposed Latent Semantic Indexing (LSI), a popular linear algebraic indexing method to take advantage of implicit higher-order structure in the association of terms with documents (semantic structure) in order to improve the detection of relevant documents. This method can reach a higher accuracy rate than simple TF-IDF method. Subsequently, researchers are currently developing semantic rule-based method which combines semantic analysis with machine learning, and is helpful for a computer program to understand meaning of texts written in natural language, and to extract domain terminology automatically. This method is promising, but, still remains at the beginning stage which is being encountered some great difficulties.
Domain terminology extraction
Invalid terms elimination
The selection of corpus is the fundamental step from which domain terminology extraction can be initiated, and documents in corpus directly determine the accuracy of extraction. In this paper, some guideline rules are considered as follows for the selection of documents. 1) The document should be composed of a combination of sentences with consecutive contextual relationships; 2) The documents which are selected into corpus should be related to one domain theme, and have obvious domain differentiation, such as portal under various categories of articles; 3) The document should have a suitable length, being a combination of multiple consecutive sentences with contextual relationships, such as a complete news, papers, microblogging and so on. For a Chinese document, the suitable length will be more than 100 but less than 1,000 Chinese characters; 4) The document providing for the domain knowledge should be accurate and credible; and 5) The contents provided by corpus should have a large varieties.
Following these guideline rules, we select documents as samples in corpus from complete news, essays and blogs or published journals which are related to the focused target domain.
(Invalid term).
In daily life and writing, there are a large number of frequently used words, which having no actually meaningful, or having not too much impacts on expression after the removal of main meaning. These words are called invalid words. Such words appear in relatively high frequency, but they will be doped in the final results and affect accuracy of results. Therefore, invalid words must be handled for denoising before texts are processed. Three kind of words have to be filtered, like,
Punctuation mark; No practical meaning word, such as auxiliary verb, modal word; Non-word, such as adjective, distinguishing word, status word, numeral word, quantifier, adverb, preposition, conjunction, auxiliary, interjection, modal word prefix, and suffix, with no significant impacts on the meaning of expression, mainly playing a modified role; Word with length of 1, considering that the possibility for a word, whose length in word segmentation being 1, becoming a domain terminology is very small.
After eliminating invalid terms, these corpus texts already have had the characteristics as follows. Each record is independent. There is no strong correlation between record and record. Also, each record contains the result of clause of the article that represents the record in form of array. Each clause also contains the result of word segmentation of the sentence in form of array. In construction of corpus, we use BeautifulSoup to parse html content to get pure plain contents, and also NLPIR lib [21] to split each sentence in documents into words. The results compose our corpus finally. The records of each document in the corpus are stored in Mongodb, NoSQL database with json format. Therefore, we can organize structure of each record in document as shown in Fig. 1, including id, title, content and frequency of word.

The structure of document of json in Mongodb.
(Domain basic terminology).
Domain basic terminology describes a vocabulary in a field of science and technology which has been widely accepted.
Traditional extraction algorithm combines co-occurrence rate and TF-IDF value to obtain basic terminology.
TF-IDF is the most classical algorithm which is widely used for domain terminology extraction. It is a statistical method, which is used to evaluate importance degree of a word in a single file or in all files of corpus. This method is composed of term frequency (TF) and inverse document frequency (IDF), being actually TF * IDF, the higher the occurrence frequency in a separate file is, the more important the word is, that is, the greater the weight is. In corpus, on the contrary, the higher the frequency is, the smaller the weight is. TF-IDF is actually a kind of weighted technology in field of information retrieval. If a word appears in some articles with high frequency, but frequency of appearance in corpus is not high, then the word would be probably the keyword of the article. It is required to assign a higher weight to distinguish it. The main issue of this algorithm is that, it is more suitable for it to extract keywords in article and remove invalid noise words and make classification on article according to extracted keywords, but for the strong domain corpus, because domain terminology in the whole corpus appears in a relatively high frequency, resulting in a lower TF-IDF value of domain terminology. Although TF-IDF is a simple and feasible algorithm, accuracy is not very high, which is mainly used for denoising and weight calculation for candidate terminology.
Therefore, we make improvements on the algorithm as follows and propose TDE-I (Term Domain Extraction-Improvement) algorithm. On hand, add seed terminology into the extraction of domain basic terminology to improve domain correlation of the results. On the other hand, instead of the existing co-occurrence rate with relative co-occurrence rate, adjacent position of co-occurrence word is also involved in the weighted adjustment. After obtaining domain terminology, it is participated in iteration calculation again as a seed terminology. Words can be ordered by their TF-IDF values. Some words having the highest TF-IDF value, calculating by Eq. (1), can be used to describe domain feature, where
(Seed terminology).
Our TDE-I algorithm starts from several domain basic terminologies, to search for more related domain terminologies. These initial domain basic terminologies are called seed terminology.
Extracted terminology is often closely related to the words of seed terminology. The fit degree as well as relationship tightness between seed terminology and domain directly effects on degree of correlation between terminology extraction and domain. Thence, seed terminology plays a very important role in extracting domain terminology. The closer the relationship between the given seed terminology and the domain is, the more accurate the result is.
(Relative co-occurrence rate).
Relative co-occurrence rate is defined as ratio of number of co-occurrences and number of occurrences of co-occurrence words in corpus, calculating by Eq. (2).
The co-occurrence rate in the traditional algorithms tends to add a large number of common and ineffective words to the results, such as the terms of as, of etc. Different from co-occurrence rate, relative co-occurrence rate is able to remove some noise words that appear in the results with high frequency and invalid words.
Starting from seed terminology, we search for the words with a relatively high co-occurrence rate and eliminate the words under a certain threshold. And then we can calculate the TF-IDF value for these words in corpus and sort them, and eliminate some of them under a specific threshold. After two rounds of natural selection, the two invalid words mentioned before are moved, which can be effective to solve the problem of removing invalid words highly correlated to seed terminology in both traditional co-occurrence rate algorithm and TF-IDF algorithm. Through paragraph labeling technology, the weight can be adjusted based on adding position of the word and seed terminology. At the same time, those words often appearing one after the other at the same time are recorded. When frequency appearing simultaneously one after the other exceeds a threshold, they are directly combined to construct a new domain terminology.
(Candidate terminology).
Starting from seed terminology, the first results obtained by the algorithm are called candidate terminologies. As these terminologies are not conducted by weight calculation, part of speech filtering and threshold filtering and other operations, therefore, there are still a large number of invalid words and with low accuracy in candidate terminologies, it is required to confirm as a final domain terminology after filtering.
Suppose set of seed terminology of domain terminology is B, and set of candidate terminology is W, our algorithm is described as follows.
Step 1: Remove any arbitrary word x marked as unused from B and mark the status of the word as used. Traversal all words, find the word y whose relative co-occurrence rate is bigger than
Step 2: Remove any arbitrary word k marked as unused from W, and mark it as used. Traversal word segmentation by sentence as the basic unit, and record the distance between k and x and also the TF-IDF value of the word. Save these data. When number of times that distance between k and x is 0 exceeds a certain threshold, and there is no combination of
Step 3: If there are unmarked words in W, then go to Step 2; otherwise go to Step 4.
Step 4: If there are still unmarked words in B, then go to Step 1; otherwise go to Step 5.
Step 5: Calculate the weighted frequency of each word in W by Eq. (3).
Step 6: Sort the results after the weighted frequency in descending order. Make the word within a threshold range as a new seed terminology, and append it into the set W.
Step 7: Go to Step 1. If going through the Step 6, there is no new seed terminology in W and B, then the algorithm will be ended.
The set W obtained by the above algorithm is the set of domain basic terminology.
Domain compound terminology extraction
(Domain compound terminology).
Domain compound terminology is defined as a collection of multiple domain terminologies, which is a combination of domain basic terminology and other words. A whole vocabulary usually occurs frequently in an article. Based on it, starting from the acquired domain basic terminologies, we can traverse corpus. At the location that each domain basic terminology appears, combine it with its words before and after and regard it as a domain compound terminology. Filtering some terminologies that obviously cannot be a combination of compound terminology by the combination of part of speech. And then adjust the weight through statistics and iteration to obtain domain compound terminology.
Set domain basic terminology set as W and initial domain compound terminology set as M. Suppose
Step 1: Remove any arbitrary domain basic terminology X and mark it as unused from W, traverse the result of word segmentation by sentence as the basic unit. When X appears in the word segmentation, set the previous position of the word which appears in the word segmentation is
Step 2: If X is the first word of the word segmentation, then perform from Step 3 to Step 5. If X is the last word of the word segmentation, then perform from Step 4 to Step 5; otherwise conduct Step 3 to Step 5.
Step 3: Let the combination of the word X and the word
Step 4: Let the combination of the word X and the word
Step 5: If there are still unlabeled domain terminology in W, then go to Step 1; otherwise go to Step 6.
Step 6: Remove any unused domain compound terminology K from
Step 7: If there are still unmarked domain compound terminology in
Step 8: Sort domain compound terminology in
After several iterations of the above steps, the result will be a domain compound terminology which is a combination of the two words. Iterative results may still contain some incomplete domain compound terminologies. As incomplete domain terminology usually includes relationship with a complete domain terminology in corpus, it should combine a complete domain terminology with the words matching the incomplete domain compound terminology and its word before and after. In order to get a complete domain compound terminology, taking the iteration of domain compound terminology as a set of B, repeat the algorithm. After multiple iterations, we can get a complete domain compound terminology.
Evaluation method
Evaluation index
Currently, there is not yet a uniform standard for evaluation of terminology extraction algorithm. In the paper, basis of assessment we use is accuracy rate, and unrecognized rate. Statistics of extracted terminology is shown as Table 1.
Statistics of extracted terminology
Statistics of extracted terminology
The above situations are based on the standard glossary. Suppose standard glossary be
Accuracy rate AR is defined as a degree of accuracy, calculating by Eq. (5).
(Unrecognized rate).
Unrecognized rate UR is used to obtain the comprehensive degree of extraction, calculating by Eq. (6).
Domain Feature Vector (DFV)
After corpus selection and data preprocessing, we can have a corpus of a document collection D, an individual document
Words can be ordered by their DFV values. Some words, having the highest DFV value, i.e.
We use the same corpus to test TF-IDF as that for experiments of testing DFV. The corpus marked as D, an individual document
Usually, word in corpus contains multiple kind of noises. From the corpus, we traverse the whole corpus to wipe out noise words, which is one of the steps different from the traditional methods. After words in corpus have been filtered, DFV values are calculated to identify DFV element. The algorithm is described as below. 
Document Classification Value (DCV)
Text categorization is defined as assigning predefined categories to text document [20]. Starting with a training set

The procedure of evaluation method.
The test set is composed of documents from domain A and domain B marked as A and B, respectively. After calculating DCV of DFV from domain A and domain B, i.e.
For one document, if DCV given by DFV of domain A is higher than DCV given by domain B, the document will be input into classification A, and vice versa. Accurate rate can be counted out. If both of DCV from DFV of domain A and domain B are both zero, the test result can be regarded as unrecognized value. The accurate rate AR and unrecognized rate UR compose the judgment of DFV model.
No matter what domain basic terminology extraction or domain compound terminology extraction is, it is required to determine range of threshold, which has an important impact on the analysis. At the same time, for different domain, and even different scale of corpus in the same domain, threshold range may be different.
The main factors effecting threshold may include the following three aspects. 1) Corpus, including scale of corpus, domain-related degree of corpus itself and organizational structure of corpus; 2) Domain; 3) Selected seed terminology, including quantity of seed terminology, domain-related degree of seed terminology. While these factors change, range of the optimal threshold constantly changes. For the extraction of large scale of domain terminology, it is suggested to compare with accuracy under the different threshold range to roughly determine the threshold range through randomly extracting a small number of result sets, and then adjust the threshold range by the repeated operations.
Experiments
Data preprocessing
Many researches have done the work for word extracting and a number of methods have been proposed, which are indeed achieved satisfying results and have been used widely. However, there are still some challenges remain, especially for Chinese documents [6]. Chinese language is a typical para taxis language, without obvious separator, such as a space in English, between two words in a sentence. In order to extract words set from plain Chinese text, a sentence is spited into each word. For it, Viterbi HMM Model [5] is used in the paper. After splitting the word from sentence, each word will be marked according to their part of speech. And the classification standard from Peking University’s Daily Corpus POS tag set is used [19]. An example of this procedure is shown in Fig. 3.

Setting for document template.

Candidate terminology generated in the field of transportation after the first round of iteration.
Add the words obtained in the first iteration into seed terminology and perform the second iteration. Similarly, taking the result value greater than or equal to 0.25 and the first 50 words as the threshold, repeat the iteration to obtain the results as shown in Fig. 5. The accuracy of the experimental results is 72.3%.

Candidate terminology generated in the field of transportation after the second round of iteration.

Candidate compound terminologies generated in transportation domain after multiple iterations.

Comparison in accuracy for transportation domain extraction between co-occurrence rate-based algorithm and TDE-I algorithm.
Domain compound terminology extraction. The first 50 words are used as a filter to be a threshold range. Candidate compound terminologies generated in transportation domain after multiple iterations are shown in Fig. 6. The results are greater than or equal to 0.25. The accuracy of the experimental results is 78%.
As we can see from Fig. 7, the key terms extracted from the co-occurrence rate-based algorithm may contain more invalid words, such as the terms of reform, development in Chinese. Number of occurrences that they combine with seed terminology are also more, and also frequency appears in the traffic articles is also larger; while our algorithm reduces the weight of the domain words by means of new word discovery, part of speech filtering and relative co-occurrence rate, and also the accuracy of the results is improved.
In order to test our evaluation method, in our experiments, we construct corpus by collecting 10,377 pieces of documents as a sample set of domain technology from the RSS data source (

The result of DFV algorithm applying to a Chinese corpus.
The distribution of DFV value of domain health and domain technology is shown in Fig. 9.

DFV value distribution.
By analyzing the distribution of DFV value from both domains, it is shown that it is similar to logarithmic distribution, which satisfying with the distribution defined in DCV.

Results of document classification from corpus.
From Fig. 10, we can conclude that the DFV extracted from one domain shows a great distinction from another domain documents, which proves that the DFV is suitable to extract domain features. Besides, we also conduct experiment about applying DFVs to classify documents from foreign documents, the result is shown in Fig. 11.

Document classification from foreign data set.
From Fig. 11, DFV indeed have distinction between different domains, words in DFV can be recognized as domain terminologies. One more result, the length of DFV is determined, as shown in Fig. 12.

Experiment result of determination of the length of DFV.
At beginning, as the length increases, the recognition accuracy begins to increase. However, at the length of 50, it reaches the top. From Fig. 12, DFV of length 50 is the most suitable one, which has the biggest domain distinction.
Comparison with documents from corpus
Comparison with documents from foreign data set
From the comparison results in Table 2 and Table 3, traditional method TF-IDF has obvious high error rate and unrecognized rate; while our DFV is better in both indexes.
Extraction results of accuracy using TDE-I under Intelligent Transportation Network
Extraction results of accuracy using TDE-I under Intelligent Transportation Network
Extraction results of accuracy using TDE-I under China Intelligent Transportation Network
Taking transportation domain as an example, we use 863 articles as corpus under Intelligent Transportation Network. Applying for TDE-I algorithm, through seed terminology in the transportation domain mentioned in the previous section, but a different threshold range, the results of the accuracy after the first round iteration is shown in Table 4 (The highest accuracy is marked with Grey).
For corpus of China Intelligent Transportation Network, we conduct 1,277 articles as corpus. Using the improved extraction algorithm, and also using the seed terminology of transportation domain mentioned in the previous section, but different threshold range, the results of correct rate after the first round iteration are shown in Table 5 (The highest accuracy is marked with Grey).
From the above comparison results, we can see, as the corpus is different, under the situation of the same algorithm and seed terminology, the threshold range of the optimal result is not the same. It is concluded that the optimal threshold range may not fixed for the same domain.
Extraction results in financial domain
Extraction results in transportation domain
From the above comparison results, we can see, the same as the above conclusion of effect of different thresholds in the same domain, range with the highest accuracy may change when both domain and scale of corpus are different.
The construction of knowledge graph, as a massive systemic engineering, requires a serials of tasks, including domain terminology extraction, ontology construction, and relationship construction. Domain terminology extraction is the key and primary step. However, the existing traditional methods are time-consuming and laborious, as well as low accuracy rate and much invalid word. In order to deal with these problems, we propose a new domain terminology method which can complete the task automatically.
Our contributions can be summarized as follows.
For the problem of low accuracy with invalid terms, and poor adaption to high knowledge refreshing rate in domain terminology recognition and extraction, we propose our improved Domain Term Extraction-Improvement (TDE-I) based on relative co-occurrence rate. The experiment results show that our algorithm is better than the traditional algorithm.
There is not yet a uniform standard for evaluating domain terminology extraction algorithm. We present a Document Classification Value (DCV) method on the basis of calculating Domain Feature Vector (DFV) value, which can implement a judgment to evaluate extraction accuracy according to the evaluation indexes of accuracy rate, unrecognized rate. The experiment results demonstrate that the DCV method is effective.
As small-scale corpus may contain very limited number of domain terminologies, it is usually necessary to consider the method which can handle large-scale corpus for terminology extraction. However, the difficulty will be met to handle some corpus beyond memory. A feasible solution is to divide large corpus into several sub-corpus. We can import the divided results into memory and build an index file, and then make processing on them. If the size of corpus is still larger than memory after division, the corpus will continue to be divided until it can be imported into the memory once. This method can effectively avoid the disadvantages frequent disk I/O operations due to multiple reads of corpus.
Further work includes as follows. The main limitation of the current study is the lack of thorough experiments with other corpus of some other languages. Nevertheless, since the objective of this paper is to propose the DFV model, it remains as natural future work the experimentation of our proposal to a statistically significant set of corpus of different languages and related to more specific domains.
Footnotes
Acknowledgements
This work is supported by Shanghai Academy of Spaceflight Technology (no. SAST2016082) and Open Research Fund of the Academy of Satellite Application (no. 2014_CXJJ-YG_02).
