Similarity detection of English text and teaching evaluation based on improved TCUSS clustering algorithm

Abstract

The semantic similarity calculation task of English text has important influence on other fields of natural language processing and has high research value and application prospect. At present, research on the similarity calculation of short texts has achieved good results, but the research result on long text sets is still poor. This paper proposes a similarity calculation method that combines planar features with structured features and uses support vector regression models. Moreover, this paper uses PST and PDT to represent the syntax, semantics and other information of the text. In addition, through the two structural features suitable for text similarity calculation, this paper proposes a similarity calculation method combining structural features with Tree-LSTM model. Experiments show that this method provides a new idea for interest network extraction.

Keywords

Improved algorithm TCUSS clustering algorithm English text similarity detection

1 Introduction

With the widespread use of computers and the popularity of the Internet, all kinds of information are rapidly expanding. The increase in the amount of information has brought convenience to people, but it has also brought about the problem of excessive information. In the face of vast and voluminous information, people are increasingly hoping to conduct scientific research, business decision-making and business management based on data analysis [1]. In the real world, text is the most important information carrier. In fact, research shows that 80% of the information is contained in text documents. Therefore, the processing and analysis of text documents has become one of the hotspots of data mining and information retrieval technology. There are many techniques for processing and researching text documents. One of the most important techniques is text clustering [2]. T The three problems of high-dimensional sparseness, synonym and polysemous words of text data greatly interfere with the accuracy of the clustering learning algorithm and lead to a sharp decline in the performance of text clustering. Moreover, most of the existing text clustering algorithms does not give a description of how to perform clustering.

Currently, most text clustering algorithms are based on the Vector Space Model (VSM). This text representation is very simple, but it raises the problem of high dimensional sparsity. Moreover, it does not solve the two natural language problems specific to text data: synonyms and polysemous words. All of these problems greatly interfere with the efficiency and accuracy of the text clustering algorithm and lead to a decline in the performance of text clustering. In order to avoid these problems, this paper adopts a new idea to cluster texts, that is, using semantic similarity as a measure of similarity between texts [3]. The TCUSS algorithm does not have the problem of high dimensional sparsity. Moreover, the use of semantic similarity as a measure of similarity between texts not only has theoretical significance, but also solves the problem of polysemous words and synonyms. Finally, it is proved by experiments that the improved hybrid method has higher accuracy than the previous semantic similarity calculation method, and the TCUSS algorithm also improves the quality of clustering [4].

2 Related work

The initial research on text similarity detection mainly focuses on the detection of word similarity. The method based on digital fingerprint [5] and the method based on space vector model [6] are the focus of research. However, with the development and use of some plagiarism detection systems, many plagiarists will rewrite the content marked as plagiarism by changing the sentence structure and word substitution according to the feedback information detected by the system. After rewriting, plagiarists are often able to evade the plagiarism of the detection system without changing the semantic information of the sentence or paragraph. Thus, sentence structure-based detection, cluster-based detection, grammar-based detection, and semantic-based detection [7] emerged. In order to better identify and combat this plagiarism, more researchers began to analyze the semantic information of the text.

A group of scholars in Princeton developed a dictionary of English semantic relations, wordNeT [8], according to the grammar rules of English language. The WordNet semantic dictionary records the English words and their interrelationships in a semantic arrangement. Moreover, the semantic similarity between words can be calculated using these relationships. A fuzzy similarity calculation method is proposed in [9]. The method mainly uses the Shingling method to screen out similar samples from a large sample set, and then compares these potentially similar samples one by two, and sets the lowest similarity value, so as to finally determine the similarity between the samples. Later, the literature [10] proposed a similarity calculation method based on sentence predicate components. The method firstly selects some components in the sentence, and then uses the English rule corpus to calculate the similarity of the part, and replaces the similarity value of the whole sentence. However, we can find that this method focuses on some of the components in the sentence, but ignores other components in the sentence and the structure of the sentence.

In the face of the plagiarism on the semantic level such as synonym replacement, word structure adjustment, and sentence structure adjustment, we can only judge whether the text has plagiarism by semantic comparison at the content level of the sample [11]. In addition, the similarity detection at home and abroad is mostly directed at academic papers and codes, but there are not many similarity detection studies in the automatic composition scoring system, and most of them rely on the vocabulary structure of the article [12]. If we can develop a better mixed-similarity detection method based on words and semantics and apply it to the English text similarity detection system, it is of great significance to solve the problem of English text similarity. The article [16] focuses on IoT and its major role in sophisticating the human behaviours and efforts. This paper also dealt with the collection of various data from various resources that are connected to the internet. The literature [17] talks about the various problems in the vehicular communication field with the proposal of cooperative centralized and distributed spectrum sensing model. Due to the implementation of cooperative cognitive model, interference and various hidden problems are minimized. The literature [18] addresses the problem such as massive volume of bigdata and come up with the concept of SmartBuddy to make intelligent and smart environment using human behaviours and human dynamics. The literature [19] talks about the construction of directed acyclic graph for video coding algorithms for motion estimation in parallel reconfigurable computing systems. Also, partitioning algorithm plays a major role to speed up the video processing. The article [20] dealt exploiting IoT and BigData Analytics using Hadoop ecosystem in real time environments. Implementation of IoT-based Smart City is achieved by the above-mentioned processes [21, 22].

3 Theoretical method analysis

3.1 TF-IDF

TF-IDF (Term Frequency –Inverse Document Frequency) is a statistical method and is used to quantitatively estimate how important a word is to a text set or one of the texts in a corpus. TF-IDF is often used by search engine applications as a measure or rating of the degree of relevance between a web page and a user query. The importance of a word increases proportionally with the number of times it appears in the text, but it also decreases inversely with the frequency it appears in the corpus [13].

The main idea of TF-IDF is: If a word appears frequently in a text and rarely appears in other texts, the word is considered to have a good class distinguishing ability, that is, it is a keyword for the text. TF stands for Term Frequency [14], which is the frequency at which words appear in text d, as shown in Equation (1). IDF stands for Inverse Document Frequency and its main idea is that if the text containing the word t is less, the IDF value is larger, as shown in Equation (2). ${TF}_{i, j} = \frac{n_{i, j}}{\sum_{k} n_{k, j}}$ (1) ${IDF}_{i} = log \frac{| D |}{| {j : t_{i} \in d_{j}} |}$ (2)

In the formula, i represents the word index number and j represents the text index number. In formula (1), n_i,j represents the number of occurrences of the i-th word in the text j, and | { j : t_i ∈ d_j } | represents the number of words contained in the text. In formula (2), |D| represents the total number of texts in the corpus, and | { j : t_i ∈ d_j } | represents the number of texts containing the word. The value of TF-IDF is shown in Equation (3). $TF - {IDF}_{i, j} = {TF}_{i, j} * {IDF}_{i}$ (3)

It can be seen from Equation (3) that the high word frequency within a particular file and the low file frequency of the word in the entire file set, can produce a high weight TF-IDF. Therefore, TF-IDF tends to filter out common words and retain important words [15].

3.2 Mutual information

Mutual Information MI (Mutual Information) is a measure of information in information theory. It can be used to measure the amount of information contained in one random variable about another random variable, or the uncertainty of a random variable being reduced due to another known random variable.

The use of mutual information theory for feature extraction is based on the assumption that the mutual information between terms that appear frequently in a particular category but have a lower frequency in other categories and the category are larger. Mutual information is usually used as a measure between feature words and categories. If feature words belong to this class, they have the largest mutual information. MI is often used in natural texts in natural texts, and the probability of simultaneous occurrence of two words can be obtained through MI. The definition of MI is as shown in formula (4). $I (X; Y) = \sum_{x \in X} \sum_{y \in Y} p (x, y) log \frac{p (x, y)}{p (x) p (y)}$ (4)

In Equation (4), p (x, y) is the joint distribution of two random variables (X ; Y), and p (x) p (y) is the marginal distribution. The mutual information I (X ; Y) is the relative entropy of the joint distribution p (x, y) and the product distribution p (x) p (y).

3.3 Information entropy

Information Entropy IE (Information Entropy) can be used to calculate the amount of uncertainty of a random variable. It represents the average amount of information provided by source X per symbol (no matter what symbol is sent). The greater the entropy of a random variable, the greater its uncertainty, and the less likely it is to correctly estimate its value. The more uncertain the random variable, the more information is needed to determine its value.

If X is a discrete random variable with a value space of $ℝ$ and its probability distribution is $p (x) = p (X = x), x \in ℝ$ , the entropy H (X) of X is defined as Equation (5). $H (X) = - \sum_{x \in ℝ} p (x) {log}_{2} p (x)$ (5)

The probability distribution that maximizes the entropy value truly reflects the distribution of the event, because entropy defines the uncertainty of the random variable. When the entropy is maximum, the random variable is the most uncertain, and it is most difficult to accurately predict its behavior. That is to say, under the premise of knowing part of the knowledge, the most reasonable inference about the unknown distribution is the most uncertain or maximum random judgment of known knowledge.

3.4 Euclidean distance

The Euclidean Distance is a definition that uses the distance representation, that is, the true distance between two points in an n-dimensional space, or the natural length of the vector (that is, the distance from the point to the origin). The Euclidean distance in two-dimensional and three-dimensional space is the actual distance between two points.

If we assume that x, y are two points in the n-dimensional space, that is, x = (x₁, x₂, ⋯ , x_n) , y = (y₁, y₂, ⋯ , y_n), the Euclidean distance between them is as shown in Equation (6). $d (x, y) = \sqrt{(\sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2})}$ (6)

In order to express the similarity by quantization, when the similarity is expressed using the Euclidean distance, the similarity calculation result is regulated to [0, 1], that is, the conversion is performed using the formula (7). The smaller the Euclidean distance, the higher the similarity and the closer the calculation result is to 1. The larger the Euclidean distance, the lower the similarity, and the closer the calculation result is to 0. $sim (x, y) = \frac{1}{1 + d (x, y)}$ (7)

3.5 Manhattan similarity

Manhattan Similarity is a geometric term used in geometric metric spaces to indicate the absolute wheelbase sum of two points on a standard coordinate system. Similar to the Euclidean distance, the Manhattan distance is also used to measure distances in the distance of the dimensional data space. Compared to the Euclidean distance, Manhattan distance calculations are smaller, and performance is higher.

We assume that the Manhattan distance between x = (x₁, x₂, ⋯ , x_n) , y = (y₁, y₂, ⋯ , y_n) is as shown in Equation (8). The smaller the distance value, the higher the similarity. $d (x, y) = \sum_{i = 1}^{n} | x_{i} - y_{i} |$ (8)

3.6 Cosine similarity

Cosine Similarity evaluates their similarity by calculating the cosine of the two vectors. We assume the vector x = (x₁, x₂, ⋯ , x_n) , y = (y₁, y₂, ⋯ , y_n), the cosine similarity of the vector x, y is calculated as shown in Equation (9). $cos θ = \frac{\sum_{i = 1}^{n} (x_{i} \cdot y_{i})}{\sqrt{\sum_{i = 1}^{n} x_{i}^{2}} \cdot \sqrt{\sum_{i = 1}^{n} y_{i}^{2}}}$ (9)

3.7 WordNet

WordNet is an English dictionary based on cognitive linguistics designed by psychologists, linguists and computer engineers at Princeton University. It is not just to arrange the words in alphabetical order, but to form a “network of words” according to the meaning of the words.

WordNet is a broad-based English vocabulary semantic web. Nouns, verbs, adjectives, and adverbs are each organized into a network of synonyms, each of which represents a basic semantic concept. Moreover, these sets are also connected by various relationships, and a polysemous word will appear in the synonym set of each of its meanings. In WordNet, there is no connection between the four different parts of speech networks. WordNet’s noun network was the first to develop, and as such, most scholars’ work is limited to noun networks. The backbone of the noun network is the level of the implication relationship (upper/lower relationship), which occupies nearly 80% of the relationship. The top level in the hierarchy is 11 abstract concepts called Unique Beginners, such as Entity (“live or inanimate concrete existence”), Psychological Feature (“Psychic Features of Life Organisms’’). The deepest level in the noun hierarchy is 16 nodes.

3.8 Named entity recognition

Named Entity Recognition (NER), also known as “name identification’’, refers to the identification of entities with specific meaning in the text, including person names, place names, institution names, proper nouns, and so on. As the bearer information unit of natural language, named entity recognition belongs to the basic research field of text information processing, and is an indispensable component of various natural language processing technologies such as information extraction, information retrieval, machine translation, and question answering system.

Named entities are the subject of named entity recognition research. Named entity recognition mainly focuses on three categories (entity class, time class, number class), that is, seven subcategories (person name, place name, institution name, time, date, currency, percentage). Among them, the named entity of the numeric class and the time class can obtain better effects by rule matching. Therefore, the current research body of named entity recognition mainly focuses on person names, place names, and institution names, and extends from these simple entities to other fields, such as movie names, book names, product names, and protein names.

4 Similarity calculation regression

Machine learning is an important branch of artificial intelligence. It contains the results of disciplines such as probability statistics, computational complexity theory, information theory, and neurobiology. The main method of machine learning is to obtain a certain relationship between input and output by training the known samples so that the unknown input can be predicted. Machine learning algorithms have been widely used in the field of natural language processing, and the text similarity calculations studied in this paper can be regarded as regression problems in machine learning.

There are many machine learning regression algorithms used in natural language processing. The regression models used in this experiment include the Support Vector Regression (SVR) model and the LongShort-Term Memory Over Tree Structures (Tree-LSTM) model. The principles of the two regression models are described below.

4.1 Support vector regression

Training samples $D = {(x_{1}, y_{1}), \dots, (x_{m}, y_{m})}, y_{i} \in ℝ$ are given. The goal of SVR is to learn a regression model as shown in Equation (10). In the formula, w and b are the model parameters to be determined such that f (x) is closer to y. $f (x) = w^{T} x + b$ (10)

For (x, y), the general regression model calculates the loss by the difference between f (x) and true y. The value of the loss function is 0 only if f (x) is the same as y. However, the SVR model allows for a maximum deviation of ∈ between f (x) and y. That is, the loss can only be calculated if the absolute value of the difference between f (x) and y is greater than ∈. The model can also be considered to be centered on f (x) and create a spacer with a width of 2∈. If the training sample is in the interval, the prediction can be considered correct, as shown in Fig. 1.

Fig. 1

Schematic diagram of support vector regression.

The area between the dashed lines is the ɛ interval band, and the samples falling into it do not calculate the loss.

The SVR can be expressed as Equation (11). $min_{w, b} \frac{1}{2} {∥ w ∥}^{2} + C \sum_{i = 1}^{m} ℓ_{ɛ} (f (x_{i}) - y_{i})$ (11)

In the formula, D is a regularization constant, and ℓ_ɛ is a ɛ-insensitive loss function as shown in Fig. 2.

Fig. 2

Schematic diagram of the ɛ-insensitive loss function.

Equation (11) can be rewritten as Equation (12) by using the slack variables ξ_i and ${\hat{ξ}}_{i}$ . $\begin{matrix} min_{w, b, ξ_{i}, {\hat{ξ}}_{i}} \frac{1}{2} {∥ w ∥}^{2} + C \sum_{i = 1}^{m} (ξ_{i} + {\hat{ξ}}_{i}) \\ s . t . f (x_{i}) - y_{i} ⩽ \in + ξ_{i} \\ \begin{matrix} y_{i} - \end{matrix} f (x_{i}) ⩽ \in + ξ_{i} \\ \begin{matrix} ξ_{i} ⩾ 0, \end{matrix} {\hat{ξ}}_{i} ⩾ 0, i = 1, 2, \dots m \end{matrix}$ (12)

By introducing the Lagrangian multiplier $μ_{i} ⩾ 0, {\hat{μ}}_{i} ⩾ 0, α_{i} ⩾ 0, {\hat{α}}_{i} ⩾ 0 .$ , the Lagrangian function of Equation (13) can be obtained by the Lagrangian multiplier method. $\begin{matrix} L (w, b, α, \hat{α}, ξ, \hat{ξ}, μ, \hat{μ}) = \frac{1}{2} {∥ w ∥}^{2} + \\ C \sum_{i = 1}^{m} (ξ_{i} + {\hat{ξ}}_{i}) - \sum_{i = 1}^{m} μ_{i} ξ_{i} - \sum_{i = 1}^{m} {\hat{μ}}_{i} {\hat{ξ}}_{i} \\ + \sum_{i = 1}^{m} α_{i} (f (x_{i}) - y_{i} - \in - ξ_{i}) \\ + \sum_{i = 1}^{m} {\hat{α}}_{i} (y_{i} - f (x_{i}) - \in - {\hat{ξ}}_{i}) \end{matrix}$ (13)

By substituting Equation (10) into Equation (13) and making the partial derivative of $L (w, b, α, \hat{α}, ξ, \hat{ξ}, μ, \hat{μ})$ to w, b, ξ_i and ${\hat{ξ}}_{i}$ as 0, Equations (14)–(17) can be obtained. $w = \sum_{i = 1}^{m} ({\hat{α}}_{i} - α_{i}) x_{i}$ (14) $0 = \sum_{i = 1}^{m} ({\hat{α}}_{i} - α_{i})$ (15) $C = α_{i} + μ_{i}$ (16) $C = {\hat{α}}_{i} + μ_{i}$ (17)

Equations (14)–(17) are substituted into Equation (13). $\begin{matrix} max \sum_{i = 1}^{m} y_{i} ({\hat{α}}_{i} - α_{i}) - \in ({\hat{α}}_{i} + α_{i}) - \\ \frac{1}{2} \sum_{i = 1}^{m} \sum_{j = 1}^{m} ({\hat{α}}_{i} - α_{i}) ({\hat{α}}_{j} - α_{j}) x_{i}^{T} x_{j} \\ \begin{matrix} \end{matrix} s . t . \sum_{i = 1}^{m} ({\hat{α}}_{i} - α_{i}) = 0 \\ \begin{matrix} \end{matrix} 0 ⩽ α_{i}, {\hat{α}}_{i} ⩽ C \end{matrix}$ (18)

The KKT (Karush-Kuhn-Tucker) condition needs to be satisfied in the above process, and the KKT condition is as shown in the formula (19). ${\begin{matrix} α_{i} (f (x_{i}) - y_{i} - \in - ξ_{i}) = 0 \\ {\hat{α}}_{i} (y_{i} - f (x_{i}) - \in - {\hat{ξ}}_{i}) = 0 \\ α_{i} {\hat{α}}_{i} = 0, ξ_{i} {\hat{ξ}}_{i} = 0 \\ (C - α_{i}) ξ_{i} = 0, (C - {\hat{α}}_{i}) {\hat{ξ}}_{i} = 0 \end{matrix}$ (19)

It can be seen that α_i can take a non-zero value only if f (x_i) - y_i - ∈ - ξ_i = 0, and ${\hat{α}}_{i}$ can take a non-zero value only if $y_{i} - f (x_{i}) - \in - {\hat{ξ}}_{i} = 0$ . That is, the corresponding α_i and ${\hat{α}}_{i}$ can take a non-zero value only when the sample (x_i, y_i) does not fall into the ∈-spacer. In addition, the constraints f (x_i) - y_i - ∈ - ξ_i = 0 and $y_{i} - f (x_{i}) - \in - {\hat{ξ}}_{i} = 0$ cannot be true at the same time, so at least one of α_i and ${\hat{α}}_{i}$ is zero.

Substituting Equation (5) into Equation (10), the solution of SVR is as shown in Equation (20). $f (X) = \sum_{i = 1}^{m} ({\hat{α}}_{i} - α_{i}) x_{i}^{T} x + b$ (20)

The samples that make $({\hat{α}}_{i} - α_{i}) \neq 0$ in Equation (20) true are the support vectors of the SVR, which must be outside the ∈-spacer. However, the SVR’s support vector is part of the training sample, and its solution has sparsity characteristics.

As can be seen from the conditions of KKT, there are (C - α_i) ξ_i = 0 and α_i (f (x_i) - y_i - ∈ - ξ_i) = 0 for the sample (x_i, y_i). After obtaining α_i, if 0 < α_i < C, there is ξ_i = 0 and formula (21) can be obtained. $b = y_{i} + \in - \sum_{i = 1}^{m} ({\hat{α}}_{i} - α_{i}) x_{i}^{T} x$ (21)

After obtaining α_i by solving Equation (18), the sample satisfying 0 < α_i < C is arbitrarily selected to obtain b using Equation (21). A more robust method is used: After a plurality of samples satisfying 0 < α_i < C are selected to solve b, the obtained results are averaged.

Equation (14) can be expressed as Equation (22) after considering the mapping information of the feature. $w = \sum_{i = 1}^{m} ({\hat{α}}_{i} - α_{i}) φ (x_{i})$ (22)

Then, SVR can be expressed as formula (23). $f (x) = \sum_{i = 1}^{m} ({\hat{α}}_{i} - α_{i}) k (x, x_{i}) + b$ (23)

In the formula, k (x, x_i) = φ (x_j) is a kernel function. The kernel function is a method to extend the linear learner to a nonlinear learner.

4.2 Tree-LSTM model

In recent years, long-term short-term memory (LSTM) has become popular again. Its effectiveness has been proven in many tasks, such as speech recognition, machine translation, and image-to-text tasks. Recursion is a basic process related to many problems, and recursive processes and hierarchies are very common in different models. For example, the semantics of a sentence can generally be expressed as a hierarchical structure, and image understanding benefits from a structure that can be recursively modeled, and such a structure has achieved good performance in image processing.

Zhu extends LSTM into a tree structure, and it is called Tree-LSTM. Tree-LSTM can obtain historical memory of multiple subunits and more descendants when learning memory cells. Compared with the recurrent neural network, Tree-LSTM can avoid the gradient disappearing, so a long-distance interaction model can be established on the tree structure. Tree-LSTM includes the advantages of recurrent neural networks and recurrent neural networks. Since the binary tree structure is simple and intuitive, this chapter uses the binary tree LSTM to interpret Tree-LSTM.

The LSTM is extended to Tree-LSTM, where each memory unit can reflect the historical state of multiple sub-units and multiple descendant units. As shown in Fig. 3, the root of the tree can be considered to obtain information under the long-distance interaction of the gray and light blue leaf nodes on the tree. In Fig. 3, the small circle (“.’’) or the short line (“-’’) in front of the arrow indicates information transfer and blocking, respectively. In Fig. 3, an example of a binary tree is used, and the gate vector is used. The elements of the gate vector usually use the logical sigmoid function to ensure that the value is in the range [0,1]. By learning gate vectors, Tree-LSTM provides a way to interact remotely in the input structure.

Fig. 3

Schematic diagram of Tree-LSTM.

Each node in Fig. 3 contains a Tree-LSTM memory unit. Figure 4 shows an internal schematic of the Tree-LSTM memory unit. Each memory unit includes an input gate and an output gate. The number of forgotten gates depends on the number of child nodes. The binary tree structure used in this chapter is two child nodes. The hidden state vectors $h_{t - 1}^{L}$ and $h_{t - 1}^{R}$ of the two child nodes serve as input vectors for the node. Input gates i_tuse these four vectors: Hidden vectors ( $h_{t - 1}^{L}$ and $h_{t - 1}^{R}$ ) and unit vectors ( $c_{t - 1}^{L}$ and $c_{t - 1}^{R}$ ) for two child nodes. These four vectors are also used to control the left and right forgetting gate vectors $f_{t - 1}^{L}$ and $f_{t - 1}^{R}$ , and the weights that combine them are specific to each gate vector.

Fig. 4

Internal schematic diagram of Tree-LSTM.

As shown in Equations (24)–(29), each W is different. Unlike the general LSTM, the Tree-LSTM memory unit needs to consider the cell vectors $c_{t - 1}^{L}$ and $c_{t - 1}^{R}$ of all the child nodes at the same time, and control the forgotten gates $f_{t - 1}^{L}$ and $f_{t - 1}^{R}$ for each child separately. The output gate considers the unit vector of the node and the child node. At the same time, the hidden state vector and unit vector of the node will also be passed to the parent node as the left child or right child of the parent node. In this way, by combining the unit vectors of the child nodes controlled by the gate, the memory unit can directly react to multiple child node units and can indirectly react to multiple descendant node units, so the model can capture long-distance interaction relationships. The forward propagation formula of Tree-LSTM is shown in Equations (24)–(29). $f_{t}^{L} = σ (W_{hfl}^{L} h_{t - 1}^{L} + W_{hfl}^{R} h_{t - 1}^{R} + W_{cfl}^{L} c_{t - 1}^{L} + W_{cfl}^{R} c_{t - 1}^{R})$ (24)

$\begin{matrix} f_{t}^{R} = σ (W_{hfr}^{L} h_{t - 1}^{L} + W_{hfr}^{R} h_{t - 1}^{R} + W_{cfr}^{L} c_{t - 1}^{L} \\ + W_{cfr}^{R} c_{t - 1}^{R} + b_{fr}) \end{matrix}$ (25) $x_{t} = W_{hx}^{L} h_{t - 1}^{L} + W_{hx}^{R} h_{t - 1}^{R} + b_{x}$ (26) $c_{t} = f_{t}^{L} \otimes c_{t - 1}^{L} + f_{t}^{R} \otimes c_{t - 1}^{R} + i_{t} \otimes tanh (x_{t})$ (27) $o_{t} = σ (W_{ho}^{L} h_{t - 1}^{L} + W_{ho}^{R} h_{t - 1}^{R} + W_{co} c_{t} + b_{o})$ (28) $h_{t} = o_{t} \otimes tanh (c_{t})$ (29)

In the formula, σ represents the logical sigmoid function, which can be used to control the elements of the gate vector within the [0, 1] range. $f_{t}^{L}$ and $f_{t}^{R}$ are left and right forgetting gates respectively, b is the offset vector, W is the network weight matrix, and ⊗ is the Hadamard product, that is, the vector corresponding element is multiplied. The subscript of the network weight matrix represents the vector it is used for, for example W_hog is a matrix that maps hidden vectors to output gates. The internal structure of the Tree-LSTM is shown in Fig. 4.

Through training, the gradient of each parameter of the objective function can be efficiently calculated by backpropagation. Unlike ordinary LSTM, for Tree-LSTM, the wrong delivery needs to consider the left and right children. If it is a topology for more than two children, it is necessary to distinguish more children. Equations (30)–(36) are the backpropagation formulas of Tree-LSTM.

For each memory unit, we assume that the error passed to the hidden vector is $\in_{t}^{h}$ . The output gate is $δ_{t}^{o}$ , the left forgetting gate is $δ_{t}^{f_{t}}$ , the right forgetting gate is $δ_{t}^{f_{r}}$ , and the derivative formula of the input gate $δ_{t}^{i}$ is as shown in Equations (30)–(36): $\in_{t}^{h} = \frac{\partial o}{\partial h_{t}}$ (30) $δ_{t}^{o} = \in_{t}^{h} \otimes tanh (c_{t}) \otimes σ^{'} (o_{t})$ (31) $δ_{t}^{f_{l}} = \in_{t}^{c} \otimes c_{t - 1}^{L} \otimes σ^{'} (f_{t}^{L})$ (32) $δ_{t}^{f_{r}} = \in_{t}^{c} \otimes c_{t - 1}^{R} \otimes σ^{'} (f_{t}^{R})$ (33) $δ_{t}^{i} = \in_{t}^{c} \otimes tanh (x_{t}) \otimes σ^{'} (i_{t})$ (34) $\begin{matrix} \in_{t}^{c} = \in_{t}^{h} \otimes o_{t} \otimes g^{'} (c_{t}) + \in_{t + 1}^{c} \otimes f_{t + 1}^{L} \\ + (W_{ci}^{L})^{T} δ_{t + 1}^{i} + (W_{cfl}^{L})^{T} δ_{t + 1}^{f_{l}} \\ + (W_{cfl}^{L})^{T} δ_{t + 1}^{f_{l}} + (W_{co})^{T} δ_{t}^{o} \end{matrix}$ (35) $\begin{matrix} \in_{t}^{c} = \in_{t}^{h} \otimes o_{t} \otimes g^{'} (c_{t}) + \in_{t + 1}^{c} \otimes f_{t + 1}^{R} + (W_{ci}^{R})^{T} δ_{t + 1}^{i} + (W_{cfl}^{R})^{T} δ_{t + 1}^{f_{l}} + (W_{cfl}^{R})^{T} δ_{t + 1}^{f_{l}} + (W_{co})^{T} δ_{t}^{o} \end{matrix}$ (36)

In the formula, σ′ (x) denotes a logical function that derives each element of the x vector. $\in_{t}^{c}$ represents the derivative of the unit vector. If the current node is the left child of its parent node, it is computed using Equation (35), otherwise it is computed using Equation (36). In the formula, g′ (x) is the derivative of the hyperbolic tangent function for each element of the x vector. The superscript T of the weighting matrix represents the transpose of the matrix.

When the derivative of each gate vector is calculated, the derivative of the weight matrix in Equations (36) can also be calculated.

4.3 Evaluation criteria

This section will introduce the performance evaluation indicators commonly used in text similarity calculations, namely the Pearson correlation coefficient. In statistics, the Pearson correlation coefficient is used to measure the linear correlation between two variables X and Y, and its value is between –1 and 1. Normally, the correlation strength of the variables is judged by the following range of correlation coefficients: 0.8–1.0 indicates a strong correlation, 0.6–0.8 indicates a strong correlation, 0.4–0.6 indicates a moderate correlation, 0.2–0.4 indicates a weak correlation, and 0.0–0.2 indicates a weak correlation or no correlation.

As shown in Equation (37), the Pearson correlation coefficient between two variables is defined as the quotient of the covariance and standard deviation between the two variables: $ρ_{X, Y} = \frac{cov (X, Y)}{σ_{X} σ_{Y}} = \frac{E {[X - E (X)] [Y - E (Y)]}}{σ_{X} σ_{Y}}$ (37)

In the formula, F is expressed as mathematical expectation or mean, σ is variance, and cov (X, Y) is called covariance of random variable X and Y, that is, cov (X, Y) = E { [X - E (X)] [Y - E (Y)] } , ρ_X,Y is recorded as Pearson correlation coefficient. ,

Since the benchmark system used in this paper uses the Pearson correlation coefficient to evaluate the performance, this paper also obtains the Pearson correlation coefficient of the text similarity calculation score and the manual similarity score by Pearson correlation coefficient method and compares the performance with the benchmark system.

5 System construction

The domestic LTP (Language Technology Platform) has established modules for word segmentation, part-of-speech tagging, partial named entity recognition, syntactic analysis and semantic analysis (word sense disambiguation and semantic role tagging). LTP smoothly integrates these modules. Data is transferred between modules using XML. In addition, it provides some DLL or Web service APIs, visualization tools, and some related corpora. The construction diagram of LTP is shown in Fig. 5.

Fig. 5

LTP construction diagram.

From bottom to top, LTP consists of six components: ① corpus, ② processing modules, ③ XML-based internal data representation and processing, ④ DLL API, ⑤ Web services, and ⑥ visualization tools.

Figure 6 shows the model structure, which consists of three layers: the input layer, the projection layer, and the output layer.

Fig. 6

Model structure diagram.

Output layer: The output layer corresponds to a binary tree, and it uses the words that appear in the corpus as leaf nodes.

The Huffman tree was constructed by using the number of occurrences of each word in the corpus as a weight. There are N leaf nodes in the Huffman tree, corresponding to D words in the dictionary, and N-1 non-leaf nodes, as shown in the figure below.

In the hidden layer, there are CBOW (Continuous Bag of words Model) technology and Skip-gram (Continuous Skip-gram Model) technology. CBOW refers to predicting the probability of occurrence of a current word by using words of context in the text. Skip-Gram uses current words to predict surrounding words, that is, it can be used to deal with questions such as “when a word is given, which words are most likely to appear around it.’’

6 Experimental research

The experiment uses Windows system as the platform and Pycharm as the experimental environment. The experimental data is all the microblogs of 400 users crawled from the Sina Weibo website by the crawler program written by Python. It is divided into four categories: military, fitness, finance, and tourism, and each category has 100 entities. A total of 320 data of 80 users are selected as the category identification set from each of the four categories, and the remaining 80 users are selected as the category set to be determined.

The accuracy comparison between the traditional clustering method and the research method for all user identification categories in the four categories is shown in Fig. 8. According to the results of Fig. 8, when determining the user category, compared with the traditional clustering algorithm, the text similarity algorithm combined with part of speech proposed in this study has improved the accuracy of category determination. In the military category, the accuracy of the feature category is the highest among the four categories because it contains certain military professional words and has certain semantic differences with the characteristic words in other categories. Figure 9 is a common interest network for all users in four categories, and Fig. 10 is a common interest network for military category users taken from Fig. 9. The nodes M, E, F, and T represent four categories of military, fitness, finance, and tourism, respectively, and the digital nodes represent user numbers, and the connection lines between the nodes indicate that the users have common interests. For the user’s interest, this article examines the overall situation of the user, regardless of the user’s interest in different time periods. Moreover, the constructed interest network can be used for friend recommendations.

Fig. 7

Technical principle of CBOW and Skip-gram.

Fig. 8

Accuracy comparison.

Fig. 9

Common interest network for all pending category users.

Fig. 10

Common interest network for military pending category users.

7 Analysis and discussion

(1) This paper proposes a method for calculating the semantic similarity of English text based on structured representation. Most existing text semantic similarity calculation methods use a large number of planar similarity features to represent the similarity of a pair of texts, and the representation is weak. Moreover, this paper further uses PST and PDT to represent the syntax, semantics and other information of the text. In addition, this paper improves the performance of text semantic similarity calculation by using these two structural features suitable for text similarity calculation and combining with planar features.

(2) This paper proposes a method for calculating semantic similarity of English text based on Tree-LSTM. Aiming at the shortcomings of the previous text semantic similarity calculation method for the poor performance of long text set experiments, this method uses the Tree-LSTM model to calculate the similarity. The method is modified on the basis of PDT and PST to obtain the NPST and NPDT structural features suitable for the Tree-LSTM model. In this chapter, planar features will no longer be used, and NPST and NPDT will be combined with the appropriate Tree-LSTM model for text similarity calculation, which has a great performance improvement for long text sets.

(3) This paper proposes a system based on semantic similarity calculation of English texts. In the face of massive English text data, relationship extraction can not only improve the accuracy of text classification, but also play an effective role in promoting social network construction. In view of the interest of network users, this study proposes a user similarity calculation method based on part of speech to construct an interest network. Experiments show that this method provides a new idea for interest network extraction.

8 Conclusion

Currently, most text clustering algorithms are based on the Vector Space Model (VSM). This text representation is very simple, but it raises the problem of high dimensional sparsity. Moreover, it does not solve the two natural language problems specific to text data: synonyms and polysemous words. All of these problems greatly interfere with the efficiency and accuracy of the text clustering algorithm, and the performance of text clustering is degraded. In order to avoid these problems, this paper adopts a new idea to cluster texts, that is, using semantic similarity as a measure of similarity between texts. This paper proposes a similarity calculation method that combines planar features with structured features and uses support vector regression models. Then, this paper proposes a similarity calculation method that combines the structural features with the Tree-LSTM model. Research shows that the algorithm of this study has certain effects.

References

Cheng

, Xu

, Bai

, et al., AON: Towards arbitrarily-oriented text recognition, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE 22(4) (2018), 12–23.

Liu

, Chen

and Wong

, Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition, 44(12) (2018), 78–91.

Shadiev

, Huang

Y.M.

and Hwang

J.P.

, Investigating the effectiveness of speech-to-text recognition applications on learning performance, attention, and meditation, Educational Technology Research and Development 55(3) (2017), 28–32.

Bhunia

A.K.

, Kumar

, Roy

P.P.

, et al., Text recognition in scene image and video frame using color channel selection, Multimedia Tools & Applications 12(2) (2017), 102–111.

Hicham

E.M.

, Akram

and Khalid

, Using features of local densities, statistics and HMM toolkit (HTK) for offline Arabic handwriting text recognition, Journal of Electrical Systems and Information Technology 20(5) (2016), 526–530.

Zayene

, Touj

S.M.

, Hennebert

, et al., Multi-dimensional long short-term memory networks for artificial Arabic text recognition in news video, IET Computer Vision 12(5) (2018), 710–719.

Shadiev

, Wu

T.T.

and Huang

Y.M.

, Enhancing learning performance, attention, and meditation using a speech-to-text recognition application: evidence from multiple data sources, Interactive Learning Environments 11(3) (2017), 1–13.

Kumar

, Saini

, Roy

P.P.

, et al., 3D text segmentation and recognition using leap motion, Multimedia Tools and Applications 76(15) (2017), 16491–16510.

, Han

, Chen

, et al., Review network for scene text recognition, Journal of Electronic Imaging 26(5) (2017), 14–22.

10.

Ch’’Ng

C.K.

and Chan

C.S.

, [IEEE 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) – Kyoto, Japan (2017.11.9-2017.11.15)] 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) –Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition, 23(5) (2017), 935–942.

11.

Verikas

, Radeva

, Nikolaev

D.P.

, et al., SPIE Proceedings [SPIE Ninth International Conference on Machine Vision –Nice, France (Friday 18 November 2016)] Ninth International Conference on Machine Vision (ICMV 2016) – A new method for text detection and recognition in indoor scene for assisting blind people, 103(12) (2017), 103–123.

12.

Khasnobish

, Datta

, Bose

, et al., Analyzing text recognition from tactually evoked EEG, Cognitive Neurodynamics 45(1) (2017), 34–45.

13.

Yindi

and Xiaoyi

, Two-stage scene text sequence recognition method based on text line local extremum region, Journal of Frontiers of Computer Science and Technology 102(23) (2018), 403–412.

14.

Read

J.C.

, A study of the usability of handwriting recognition for text entry by children, Interacting with Computers 19(1) (2018), 57–69.

15.

Reyes-Ortiz

J.A.

, Criminal event ontology population and enrichment using patterns recognition from text, International Journal of Pattern Recognition and Artificial Intelligence 56(3) (2018), 45–53.

16.

Paul

, “Internet of Things: A primer’, R Jeyaraj Human Behavior and Emerging Technologies 1(1) (2019), 37–47.

17.

Paul

, Daniel

, Ahmad

and Rho

, Cooperative cognitive intelligence for internet of vehicles, IEEE Systems Journal 11(3) (2017), 1249–1258.

18.

Paul

Anand

, Ahmad

Awais

, Mazhar Rathore

and Jabbar

Sohail

, Smartbuddy: defining human behaviors using big data analytics in social internet of things, IEEE Wireless Communications 23(5) (2016), 68–74.

19.

Paul

, Jiang

Y.C.

, Wang

J.F.

and Yang

J.F.

, Parallel reconfigurable computing-based mapping algorithm for motion estimation in advanced video coding, ACM Transactions on Embedded Computing Systems (TECS) 11(S2) (2012), 1–18.

20.

Kenzhebaev

K.K.

, Stanzhytskyi

A.N.

and Tsukanova

A.O.

, Existence and uniqueness results, the markovian property of solution for a neutral delay stochastic reaction-diffusion equation in entire space, Dynamic Systems and Applications 28 (2019), 19–46.

21.

Guan

Fuyu

, Cao

Jie

and Fang

Yingjie

, Research on sports intensity and energy consumption based on fractional linear regression equation, Dynamic Systems and Applications 29(3) (2020), 730–742.

22.

Rathore

M.M.

, Paul

, Hong

W.H.

, Seo

H.C.

, Awan

and Saeed

, Exploiting IoT and big data analytics: Defining smart digital city using real-time urban data, Sustainable Cities and Society 40 (2018), 600–610.