Abstract
SIGNIFICANCE:
The complex network structure information such as citation network and co-authored network provides rich structural information for subject trend research, thus research on deep mining of existing static structure information to obtain keyword trend receives less attention. Both dynamic method for the extraction of domain-specific common word set and the method for calculating keywords dynamic characteristic (DC value) proposed in this paper can be used as basic methods and parameters for subject trend research.
METHOD:
Employed mature and widely used TF-IDF algorithm as the word frequency calculation tool, this paper use stable static structural information such as publishing date and journal name to propose a method of calculating the dynamic characteristic of subject keywords based on stable structure such as time, journal name and domain-specific common word set.
RESULT:
The experimental results show that the keywords’ DC value is positively correlated to the change trend of keywords based on time.
Introduction
In order to learn such knowledge as background information, cutting-edge researches and guidelines for the R&D of enterprises in certain fields, it is of great significance for scholars to study the research trend and development in such fields [1]. Many relevant valuable achievements have been reached by scholars worldwide, among which two main research approaches are mainly adopted, namely, the analysis method based on citation relationship and the analysis method based on word frequency [2]. And this paper focuses on the latter.
Technically, word frequency analysis is a basic technology for text content mining. Sun [3] pointed out in his doctoral thesis that as for the improvement of technological efficiency related to text content mining in specific situations, it is feasible to integrate more network structure information related to the text. There is abundant related network structure information in academic literatures, e.g. publishing date, journal name, authors (co-authorship network), citation relationships, word co-occurrence networks, and so on. Many research achievements resulted from the integration of multiple structure information. For example, the famous CiteSpace is the software based on integrating of word frequency analysis and structure information including time order and citation relationship [4]. It is worth noting that there are huge differences in the complexity of the structure information of different network structures. The structures like citation network, co-authorship network, and word co-occurrence network belong to complex network structure. Although researchers have already had some knowledge of such structural information, there are still many problems left for exploration, for example, the accuracy of overlapping community detection (the collection of keywords for judging subject classification), the communication characteristics of different communities, and the pattern of information communication through multiple communities and so on. Therefore, the introduction of such complex structural information into the research of subject development trends will inevitably bring more complexity to the latter’s research, thereby increasing the uncertainty of the latter research and hindering the judgment of its research results and methods to optimize research. However, the most specific and stable structural information related to the literature is publishing date and journal names, which are not influenced by other factors. Meanwhile, current researches only see them as background structure information and thus their potential is not fully tapped.
This paper propose a calculation method to analyze the development trend of subject keywords. Based on date, journal name and domain-specific common word set, a TF-IDF [5]-based word frequency calculation tool is used to calculate the dynamic characteristic of keywords. The experimental results shows that the dynamic values of the keywords calculated by this method are clear and reasonable and could specifically indicate the development trend of the keywords in corresponding field. And since the dynamic values are stable, they can be used as a basic parameter in other methods to further data analysis, such as cluster analysis.
Literature review
Based on different research ideas, two main research approaches are gradually developed in research on subject development trends.
The research approach based on citation relationship mainly studies the differences between basic and cutting-edge knowledge. Persson [6] deems the citing essays are mainly about the updated research and the cited essays constitutes the knowledge base. This research approach emphasizes the development of cutting-edge researches that base on basic knowledge. Many methods are widely used to reconstruct knowledge base such as bibliometrics, cited essay clustering. Through the visual image focusing on the its timeline, cutting-edge research is defined by Morris et al. [7] as a large number of literatures that are continuously referenced by a fixed set of advanced essays; similar work is also provided by Small’s thematic series of web services [8]; But the new research trends are explored based on the basic knowledge through many approaches as statistics analysis and text analysis. CiteSpace, the masterpiece of Chen, is an important achievement resulted from the practice of this research approach. Many ideas based on this research approach are comprehensively reviewed some work in his the essays [4].
The research approach centered on word frequency analysis is technically based on text content mining. As one of the important part of data mining, text content mining has a wide range of applications, such as text classification [9, 10], public opinion monitoring [11, 12], knowledge discovery [13, 14], information recommendation [12, 15] and so on. And methods like word frequency analysis, word co-occurrence analysis and clustering that are often adopted in field of library and information science, are basic in text content mining. And they are mainly applied to analysis such as hot spot discovery and popularity calculation. In recent years, text mining models such as TF-IDF [16, 17] and word vectors [18] have also been introduced into the field of library and information science, among which TF-IDF algorithm is a commonly used weighting method in the field of information retrieval. It counts the frequency of a word in an article to indicate the importance of the word to the article, and adjusts weights of the word by calculating the logarithms of the frequency of the same word in other articles. That is, the weight of a word is directly proportional to its frequency in the article and is inversely proportional to its frequency in other articles. Word frequency refers to how many times a word appears in a document, and is generally divided by the maximum word frequency from all documents for normalization to unite the order of the word frequency and prevent the weight of words from being partial towards long documents. Used to calculate the weight of the term in the document set, TF-IDF has such characteristics as intuitive model and remarkable effect, and has been widely applied in researches such as search engines, thus it is deemed as a mature word frequency analysis tool [17]. While TF-IDF focuses on structural operation on words, word vectors [19, 20] focus on the semantic expression of words and their semantic computability. By mapping the terms in the text set into an N-dimensional space, the word vector model solves two basic problems of semantic computing: a) Converting unstructured text data into values with the same length; b) Converting semantic computing into the computation of space and distance. Words that are close as values in corresponding space of words are considered to be similar in meaning. The word vector model provides more possibilities for solving many problems of NLP (Natural Language Processing). For example, the Word2Vec algorithm proposed by Mikolov et al. [19, 20] in 2013 is not only computationally efficient, but also makes the vocabulary computable at the semantic level. For example, vector (‘Berlin’) – vector (‘Germany’)
These two different research approaches are applicable to different situations. For example, the research method based on citation relationship analysis is more likely to be used in research of the development trend based on knowledge base and its path, while the method based on text mining analysis is more likely to be used in discovery of hotspots or popularity in a particular text set. In addition, as mentioned above, the effect of the text mining is closely related to the network structure information of the text. The introduction of complex network structure information such as citation network and co-network into the research of the discipline has to some extent added complexity and uncertainty for the research to the problem; While clear and stable structural information such as publishing date and journal name can provide clear and solid structural information for research.
Based on the publishing date and journal name, the two stable parameters, this paper uses TF-IDF as the word frequency calculation tool to calculate the weight of the key words namely TF-IDF values in the document set to propose a method to extract domain-specific common word set of high-frequency like sliding window and to study the dynamic development of the key words in the word set based on timeline, that is the keyword DC (dynamic character) value.
Research goals and thoughts
The “keywords” in this paper refer to the actual words obtained after the segmentation of the title and abstract, not the keywords in the literature summary. This article contains two closely related research goals:
Dynamic extraction of domain-specific common word sets: Extract time-independent common keywords of high frequency based on the attribute “journal name” the literature is published on. Usually corresponding to research hotspots in specific research field, research on common keywords of high-frequency is a basic method to study research hotspots and development in certain fields. This paper tends to solve the problem of how to dynamically determine common high-frequency keywords according to the characteristics of the data set itself. To calculate the dynamic characters of common high-frequency keywords is to study the trend developed based on timeline. The transfer of research hotspots will be reflected in the time-based change of keywords; while the use of algorithms to study the dynamic characters of common high-frequency keywords is to record the changes in corresponding research hotspots.
Based on the research goals above, the procedures to extract domain-specific common word and to calculate dynamic characters of keywords are as follows (Experiment details in 4.2):
Perform data cleaning and basic NLP processing on the original data to obtain a data set including publishing date, journal number and NLP processed text. Divide the data set based on the journal names and calculate the TF-IDF value of each journal’s keywords. Extract domain-specific common keywords of high frequency from journals through sliding window mechanism and score the keywords with weighted average method; the set of scoring standard tends to further identify frequency of keywords being used in the entire data space. The higher the frequency is, the higher the score is; the time-independent common keywords of high frequency will be used as the domain-specific common word set and as the reference word set for the annual selection of common keyword of high frequency for the following dynamic character calculation of key words. Divide the data set according to the publishing year and extract the annual common keywords of high frequency according to the method adopted above. Match the annual common keywords of high frequency obtained in the previous step with the domain-specific common word set as a reference to obtain the time-based change value of each keyword. Calculate the dynamic character DC value of each time-based keyword according to the keyword dynamic character calculation formula proposed in this essay.
Data set
The data in this paper, downloaded from WOS [22] (Web of Science), are collected from the summary of the academic papers published in 12 internationally renowned journals in the field of education informationization within 11 years from 2007 to 2017. After preliminary processing, a data set of summary from 8131 papers is collected, including the title, summary, author, and publishing date of the paper; the publishing date and journal number are used to divide the data set, and the title and summary are contents to be analyzed. The 12 journals are shown in Table 1.
Journal list of reference
Journal list of reference
In view of the main purpose of this paper is to explain the method of extracting domain-specific common word set and corresponding method for calculating the dynamic characters of keywords, the requirements of journals are not as strict as those required by traditional bibliometric research. The proportion of journal types and the amount of literatures in Table 1 is as follows: high-cited journals (J01/J09/J10), accounting for 37.9%; common core journals (J02/J04/J05/J12), accounting for 31.5%; general journal (J03/J06/J07/J08/J11), accounting for 30.6%.
Based on the approach above, the experiment consists of three parts: data preprocessing, extraction and scoring of common high-frequency keywords of, and calculation of keyword DC values.
Data preprocessing
Preprocessing includes conventional data cleaning, NLP processing, and post-slicing word frequency calculation, among which post-slicing word frequency calculation is of high importance. Conventional data cleaning is to check if there are data errors and omissions and then correct them. NLP processing mainly refers to the processing of post-cleaning data such as word segmentation, lemmatization, and stop token filter; and then store the keyword and its publishing year and journal information in the database. The post-slicing word frequency calculation includes two procedures, firstly make slice, and then calculate word frequency of the sliced data set. Two kinds of slicing procedures are required and the corresponding word frequency calculation is performed with the TF-IDF algorithm to respectively acquire two data subsets for subsequent calculation:
D1: Sliced in accordance with “publishing journals”, each data set is processed with the TF-IDF algorithm to obtain the TF-IDF value of keywords in each journal. This data set is arranged in reverse order according to the TF-IDF value of each journal’s keywords. D1 is used to extract domain-specific common word sets. Partial results are shown in Fig. 1; D2: The data set is sliced according to the “Publishing Year”. The sliced annual subsets are processed with the method in D1, that the TF-IDF values of each journal are calculated separately and arranged in reverse order; Then the annual ftidf values of keywords in each journal is sorted in reverse order, that is, each year, a result similar to that of Fig. 1 is obtained; therefore, D2 contains 11 annual journal keywords TF-IDF values in reverse order. It will be used for subsequent keyword dynamic trend calculations.
TF-IDF values of each journal’s keywords in reverse order (partial).
This essay uses a dynamic mechanism similar to a sliding window to determine a common high-frequency keyword set; each keyword is scored. This section contains D1 data set processing. The experimental steps are as follows.
4.2.2.1. Extraction method for common word set with sliding window mechanism
Let the first Assume
It must be noted that the parameters in Eq. (1), ie, Figure 2 provides a clear visual explanation for selecting common high-frequency keywords: when
Testing results of the chosen factors of common high-frequency keywords in data set D1.
4.2.2.2. Overall scores OR and ORE of common high-frequency keywords
The selected
Count the NZA (Non-zero average) and NZC (Non-zero count) of each keyword. NZA is the Normalize the NZA and NZC of each keyword separately and represent them with Na and Nc respectively, as shown in Eq. (2):
Calculate the weighted average of Na and Nc and use it as the OR (Overall Rating) for each keyword through the Eq. (3), Where
Extract the common high-frequency keywords from the above data set D1 and rate the first 12 keywords in reverse order according to their OR values as shown in Fig. 3. Comparing Fig. 3 with Table 2, the keyword score OR value obtained through the common keyword extraction method above is more reasonable and persuasive to explain the specific keyword’s positional relationship in a specific data set and its association with the research hotspot. For example, in the field of educational technology, keywords such as “student/learning/teacher/course/technology/model” are the core vocabulary for the concern of these journals; And the vocabularies listed in Table 2 are not obviously related to this field. Therefore, from the perspective of domain research, the OR value of a keyword is more representative of the research focus of the keyword.
Top 12 keyword list based on TF-IDF value in unsliced data set
Scoring of common high-frequency keywords.
The overall score OR value calculated in the previous section is actually the OR value of each keyword in the whole statistical period (2007–2017), hereinafter referred to as OR (Global-OR). The common high-frequency keyword set
Extract and calculate annual common high frequency keywords and their ORs, which can be labeled as Calculate the annual OR value and standard deviation of each keyword. This experiment uses the common high-frequency keyword set Calculate the dynamic change of the keyword based on time, that is, calculate the DC value of keywords by Eq. (4). The formula is to expand the difference between the annual OR values of the keyword and its 11 annual OR averages in certain proportion, and then sum the difference of each year.
By drawing a graph, the meaning of Fig. 5 can be understood more intuitively. As shown in the upper part of the graph, the curve shows the values of each keyword in each year, and the virtual line is the linear trend line corresponding to the keyword curve; as shown in the figure, the DC value is directly proportional to the trend and slope of keyword curve: the positive and negative DC value respectively show the rising (positive) or falling (negative) of the trend line; the absolute DC value shows the angle of inclination of the trend line. For example, keywords “web” ( The relationship between the DC value and the slope of the trend line are illustrated in the data sheet of the lower part of Fig. 5, in which horizontal axis is the year and vertical is the DC value of keywords. The DC column is the DC value corresponding to the keyword calculated according to Eq. (4), and linest column is the slope of best fitting straight line of each keyword in 2007–2017 calculated by the Linest function in Excel. The slope of the trend line, the “DC
In this paper, the three dimensions of the structural information – the publishing date, the journal name and the domain-specific common word set are introduced into the dynamic character calculation of the keyword to gain the dynamic character DC value which is logical and can clearly indicate the change character of keyword in the field. These keyword-related structural information used in this paper is simple and stable in structure, and the value of each keyword obtained by TF-IDF in certain data set is also stable, which promises the stability of the DC value. Moreover, the DC value is fixed to the slope of the trend line calculated by the Linest function in Excel, which further proves that the DC value can numerically describe the development trend of the keyword. It is worth noting that although DC can be obtained by multiplying the slope of the trend line by a fixed ratio, the selection of keywords and the calculation of DC values are closely related to the characteristics of the subject domain, and thus are not merely simple values. Therefore, it may worth considered to make the DC value which is stable as the basic attribute of the keyword, to make it as a basic parameter in the research of the subject development trend to participate in the research of other methods, such as clustering.
Annual OR value of common high-frequency keywords (partial).
Time-based dynamic change of keyword DC value.
On the other hand, there are some problems remained unresolved in the common high-frequency keyword extraction and in the keyword DC value calculation proposed in this paper. For example, this experiment uses natural words as keywords, but the academic terms have not been included as keywords to calculate their corresponding DC value. In theory, the common high-frequency keyword set calculated with academic terms as keywords, or the domain-specific common word set has clearer logic and relevance. Secondly, in this experiment, the keywords are treated equally, and no other conditions are used to weight the keywords. The results obtained in the data set of this paper can be basically verified such as typical keywords “web” and “game” in Fig. 5; But whether it is applicable to other fields remains to be verified. The problems above are expected to be further explored and verified in the follow-up study. In addition, because the experiment is carried out under the support of some basic research technology, such as NLP processing technology, there are still a small number of keywords in the experiment are difficult to be differentiated perfectly due to the techniques of word segmentation and lemmatization, thus there are a few cases like “learn” and “learning”. These issues are beyond the scope of this article and will not be repeated.
Based on the mature TF-IDF algorithm, this paper proposes a method for calculating the dynamic characters of keywords in specific fields with the structural information from 3 dimensions as publishing date, journal name and domain-specific common word set; The key word DC value calculated with this method has clear logic and domain relevance, and can clearly indicate the change characters of the keyword in the field. The calculation method of this paper is clear in logic and simple and efficient in algorithm. And the DC value of the keyword dynamic characters may be used as the basic parameters for the research of trend development of the subject.
