Abstract
The semantic similarity calculation task of English text has important influence on other fields of natural language processing and has high research value and application prospect. At present, research on the similarity calculation of short texts has achieved good results, but the research result on long text sets is still poor. This paper proposes a similarity calculation method that combines planar features with structured features and uses support vector regression models. Moreover, this paper uses PST and PDT to represent the syntax, semantics and other information of the text. In addition, through the two structural features suitable for text similarity calculation, this paper proposes a similarity calculation method combining structural features with Tree-LSTM model. Experiments show that this method provides a new idea for interest network extraction.
Introduction
With the widespread use of computers and the popularity of the Internet, all kinds of information are rapidly expanding. The increase in the amount of information has brought convenience to people, but it has also brought about the problem of excessive information. In the face of vast and voluminous information, people are increasingly hoping to conduct scientific research, business decision-making and business management based on data analysis [1]. In the real world, text is the most important information carrier. In fact, research shows that 80% of the information is contained in text documents. Therefore, the processing and analysis of text documents has become one of the hotspots of data mining and information retrieval technology. There are many techniques for processing and researching text documents. One of the most important techniques is text clustering [2]. T The three problems of high-dimensional sparseness, synonym and polysemous words of text data greatly interfere with the accuracy of the clustering learning algorithm and lead to a sharp decline in the performance of text clustering. Moreover, most of the existing text clustering algorithms does not give a description of how to perform clustering.
Currently, most text clustering algorithms are based on the Vector Space Model (VSM). This text representation is very simple, but it raises the problem of high dimensional sparsity. Moreover, it does not solve the two natural language problems specific to text data: synonyms and polysemous words. All of these problems greatly interfere with the efficiency and accuracy of the text clustering algorithm and lead to a decline in the performance of text clustering. In order to avoid these problems, this paper adopts a new idea to cluster texts, that is, using semantic similarity as a measure of similarity between texts [3]. The TCUSS algorithm does not have the problem of high dimensional sparsity. Moreover, the use of semantic similarity as a measure of similarity between texts not only has theoretical significance, but also solves the problem of polysemous words and synonyms. Finally, it is proved by experiments that the improved hybrid method has higher accuracy than the previous semantic similarity calculation method, and the TCUSS algorithm also improves the quality of clustering [4].
Related work
The initial research on text similarity detection mainly focuses on the detection of word similarity. The method based on digital fingerprint [5] and the method based on space vector model [6] are the focus of research. However, with the development and use of some plagiarism detection systems, many plagiarists will rewrite the content marked as plagiarism by changing the sentence structure and word substitution according to the feedback information detected by the system. After rewriting, plagiarists are often able to evade the plagiarism of the detection system without changing the semantic information of the sentence or paragraph. Thus, sentence structure-based detection, cluster-based detection, grammar-based detection, and semantic-based detection [7] emerged. In order to better identify and combat this plagiarism, more researchers began to analyze the semantic information of the text.
A group of scholars in Princeton developed a dictionary of English semantic relations, wordNeT [8], according to the grammar rules of English language. The WordNet semantic dictionary records the English words and their interrelationships in a semantic arrangement. Moreover, the semantic similarity between words can be calculated using these relationships. A fuzzy similarity calculation method is proposed in [9]. The method mainly uses the Shingling method to screen out similar samples from a large sample set, and then compares these potentially similar samples one by two, and sets the lowest similarity value, so as to finally determine the similarity between the samples. Later, the literature [10] proposed a similarity calculation method based on sentence predicate components. The method firstly selects some components in the sentence, and then uses the English rule corpus to calculate the similarity of the part, and replaces the similarity value of the whole sentence. However, we can find that this method focuses on some of the components in the sentence, but ignores other components in the sentence and the structure of the sentence.
In the face of the plagiarism on the semantic level such as synonym replacement, word structure adjustment, and sentence structure adjustment, we can only judge whether the text has plagiarism by semantic comparison at the content level of the sample [11]. In addition, the similarity detection at home and abroad is mostly directed at academic papers and codes, but there are not many similarity detection studies in the automatic composition scoring system, and most of them rely on the vocabulary structure of the article [12]. If we can develop a better mixed-similarity detection method based on words and semantics and apply it to the English text similarity detection system, it is of great significance to solve the problem of English text similarity. The article [16] focuses on IoT and its major role in sophisticating the human behaviours and efforts. This paper also dealt with the collection of various data from various resources that are connected to the internet. The literature [17] talks about the various problems in the vehicular communication field with the proposal of cooperative centralized and distributed spectrum sensing model. Due to the implementation of cooperative cognitive model, interference and various hidden problems are minimized. The literature [18] addresses the problem such as massive volume of bigdata and come up with the concept of SmartBuddy to make intelligent and smart environment using human behaviours and human dynamics. The literature [19] talks about the construction of directed acyclic graph for video coding algorithms for motion estimation in parallel reconfigurable computing systems. Also, partitioning algorithm plays a major role to speed up the video processing. The article [20] dealt exploiting IoT and BigData Analytics using Hadoop ecosystem in real time environments. Implementation of IoT-based Smart City is achieved by the above-mentioned processes [21, 22].
Theoretical method analysis
TF-IDF
TF-IDF (Term Frequency –Inverse Document Frequency) is a statistical method and is used to quantitatively estimate how important a word is to a text set or one of the texts in a corpus. TF-IDF is often used by search engine applications as a measure or rating of the degree of relevance between a web page and a user query. The importance of a word increases proportionally with the number of times it appears in the text, but it also decreases inversely with the frequency it appears in the corpus [13].
The main idea of TF-IDF is: If a word appears frequently in a text and rarely appears in other texts, the word is considered to have a good class distinguishing ability, that is, it is a keyword for the text. TF stands for Term Frequency [14], which is the frequency at which words appear in text d, as shown in Equation (1). IDF stands for Inverse Document Frequency and its main idea is that if the text containing the word t is less, the IDF value is larger, as shown in Equation (2).
In the formula, i represents the word index number and j represents the text index number. In formula (1), ni,j represents the number of occurrences of the i-th word in the text j, and | { j : t
i
∈ d
j
} | represents the number of words contained in the text. In formula (2), |D| represents the total number of texts in the corpus, and | { j : t
i
∈ d
j
} | represents the number of texts containing the word. The value of TF-IDF is shown in Equation (3).
It can be seen from Equation (3) that the high word frequency within a particular file and the low file frequency of the word in the entire file set, can produce a high weight TF-IDF. Therefore, TF-IDF tends to filter out common words and retain important words [15].
Mutual Information MI (Mutual Information) is a measure of information in information theory. It can be used to measure the amount of information contained in one random variable about another random variable, or the uncertainty of a random variable being reduced due to another known random variable.
The use of mutual information theory for feature extraction is based on the assumption that the mutual information between terms that appear frequently in a particular category but have a lower frequency in other categories and the category are larger. Mutual information is usually used as a measure between feature words and categories. If feature words belong to this class, they have the largest mutual information. MI is often used in natural texts in natural texts, and the probability of simultaneous occurrence of two words can be obtained through MI. The definition of MI is as shown in formula (4).
In Equation (4), p (x, y) is the joint distribution of two random variables (X ; Y), and p (x) p (y) is the marginal distribution. The mutual information I (X ; Y) is the relative entropy of the joint distribution p (x, y) and the product distribution p (x) p (y).
Information Entropy IE (Information Entropy) can be used to calculate the amount of uncertainty of a random variable. It represents the average amount of information provided by source X per symbol (no matter what symbol is sent). The greater the entropy of a random variable, the greater its uncertainty, and the less likely it is to correctly estimate its value. The more uncertain the random variable, the more information is needed to determine its value.
If X is a discrete random variable with a value space of
The probability distribution that maximizes the entropy value truly reflects the distribution of the event, because entropy defines the uncertainty of the random variable. When the entropy is maximum, the random variable is the most uncertain, and it is most difficult to accurately predict its behavior. That is to say, under the premise of knowing part of the knowledge, the most reasonable inference about the unknown distribution is the most uncertain or maximum random judgment of known knowledge.
The Euclidean Distance is a definition that uses the distance representation, that is, the true distance between two points in an n-dimensional space, or the natural length of the vector (that is, the distance from the point to the origin). The Euclidean distance in two-dimensional and three-dimensional space is the actual distance between two points.
If we assume that x, y are two points in the n-dimensional space, that is, x = (x1, x2, ⋯ , x
n
) , y = (y1, y2, ⋯ , y
n
), the Euclidean distance between them is as shown in Equation (6).
In order to express the similarity by quantization, when the similarity is expressed using the Euclidean distance, the similarity calculation result is regulated to [0, 1], that is, the conversion is performed using the formula (7). The smaller the Euclidean distance, the higher the similarity and the closer the calculation result is to 1. The larger the Euclidean distance, the lower the similarity, and the closer the calculation result is to 0.
Manhattan Similarity is a geometric term used in geometric metric spaces to indicate the absolute wheelbase sum of two points on a standard coordinate system. Similar to the Euclidean distance, the Manhattan distance is also used to measure distances in the distance of the dimensional data space. Compared to the Euclidean distance, Manhattan distance calculations are smaller, and performance is higher.
We assume that the Manhattan distance between x = (x1, x2, ⋯ , x
n
) , y = (y1, y2, ⋯ , y
n
) is as shown in Equation (8). The smaller the distance value, the higher the similarity.
Cosine Similarity evaluates their similarity by calculating the cosine of the two vectors. We assume the vector x = (x1, x2, ⋯ , x
n
) , y = (y1, y2, ⋯ , y
n
), the cosine similarity of the vector x, y is calculated as shown in Equation (9).
WordNet is an English dictionary based on cognitive linguistics designed by psychologists, linguists and computer engineers at Princeton University. It is not just to arrange the words in alphabetical order, but to form a “network of words” according to the meaning of the words.
WordNet is a broad-based English vocabulary semantic web. Nouns, verbs, adjectives, and adverbs are each organized into a network of synonyms, each of which represents a basic semantic concept. Moreover, these sets are also connected by various relationships, and a polysemous word will appear in the synonym set of each of its meanings. In WordNet, there is no connection between the four different parts of speech networks. WordNet’s noun network was the first to develop, and as such, most scholars’ work is limited to noun networks. The backbone of the noun network is the level of the implication relationship (upper/lower relationship), which occupies nearly 80% of the relationship. The top level in the hierarchy is 11 abstract concepts called Unique Beginners, such as Entity (“live or inanimate concrete existence”), Psychological Feature (“Psychic Features of Life Organisms’’). The deepest level in the noun hierarchy is 16 nodes.
Named entity recognition
Named Entity Recognition (NER), also known as “name identification’’, refers to the identification of entities with specific meaning in the text, including person names, place names, institution names, proper nouns, and so on. As the bearer information unit of natural language, named entity recognition belongs to the basic research field of text information processing, and is an indispensable component of various natural language processing technologies such as information extraction, information retrieval, machine translation, and question answering system.
Named entities are the subject of named entity recognition research. Named entity recognition mainly focuses on three categories (entity class, time class, number class), that is, seven subcategories (person name, place name, institution name, time, date, currency, percentage). Among them, the named entity of the numeric class and the time class can obtain better effects by rule matching. Therefore, the current research body of named entity recognition mainly focuses on person names, place names, and institution names, and extends from these simple entities to other fields, such as movie names, book names, product names, and protein names.
Similarity calculation regression
Machine learning is an important branch of artificial intelligence. It contains the results of disciplines such as probability statistics, computational complexity theory, information theory, and neurobiology. The main method of machine learning is to obtain a certain relationship between input and output by training the known samples so that the unknown input can be predicted. Machine learning algorithms have been widely used in the field of natural language processing, and the text similarity calculations studied in this paper can be regarded as regression problems in machine learning.
There are many machine learning regression algorithms used in natural language processing. The regression models used in this experiment include the Support Vector Regression (SVR) model and the LongShort-Term Memory Over Tree Structures (Tree-LSTM) model. The principles of the two regression models are described below.
Support vector regression
Training samples
For (x, y), the general regression model calculates the loss by the difference between f (x) and true y. The value of the loss function is 0 only if f (x) is the same as y. However, the SVR model allows for a maximum deviation of ∈ between f (x) and y. That is, the loss can only be calculated if the absolute value of the difference between f (x) and y is greater than ∈. The model can also be considered to be centered on f (x) and create a spacer with a width of 2∈. If the training sample is in the interval, the prediction can be considered correct, as shown in Fig. 1.

Schematic diagram of support vector regression.
The area between the dashed lines is the ɛ interval band, and the samples falling into it do not calculate the loss.
The SVR can be expressed as Equation (11).
In the formula, D is a regularization constant, and ℓ ɛ is a ɛ-insensitive loss function as shown in Fig. 2.

Schematic diagram of the ɛ-insensitive loss function.
Equation (11) can be rewritten as Equation (12) by using the slack variables ξ
i
and
By introducing the Lagrangian multiplier
By substituting Equation (10) into Equation (13) and making the partial derivative of
Equations (14)–(17) are substituted into Equation (13).
The KKT (Karush-Kuhn-Tucker) condition needs to be satisfied in the above process, and the KKT condition is as shown in the formula (19).
It can be seen that α
i
can take a non-zero value only if f (x
i
) - y
i
- ∈ - ξ
i
= 0, and
Substituting Equation (5) into Equation (10), the solution of SVR is as shown in Equation (20).
The samples that make
As can be seen from the conditions of KKT, there are (C - α
i
) ξ
i
= 0 and α
i
(f (x
i
) - y
i
- ∈ - ξ
i
) = 0 for the sample (x
i
, y
i
). After obtaining α
i
, if 0 < α
i
< C, there is ξ
i
= 0 and formula (21) can be obtained.
After obtaining α i by solving Equation (18), the sample satisfying 0 < α i < C is arbitrarily selected to obtain b using Equation (21). A more robust method is used: After a plurality of samples satisfying 0 < α i < C are selected to solve b, the obtained results are averaged.
Equation (14) can be expressed as Equation (22) after considering the mapping information of the feature.
Then, SVR can be expressed as formula (23).
In the formula, k (x, x i ) = φ (x j ) is a kernel function. The kernel function is a method to extend the linear learner to a nonlinear learner.
In recent years, long-term short-term memory (LSTM) has become popular again. Its effectiveness has been proven in many tasks, such as speech recognition, machine translation, and image-to-text tasks. Recursion is a basic process related to many problems, and recursive processes and hierarchies are very common in different models. For example, the semantics of a sentence can generally be expressed as a hierarchical structure, and image understanding benefits from a structure that can be recursively modeled, and such a structure has achieved good performance in image processing.
Zhu extends LSTM into a tree structure, and it is called Tree-LSTM. Tree-LSTM can obtain historical memory of multiple subunits and more descendants when learning memory cells. Compared with the recurrent neural network, Tree-LSTM can avoid the gradient disappearing, so a long-distance interaction model can be established on the tree structure. Tree-LSTM includes the advantages of recurrent neural networks and recurrent neural networks. Since the binary tree structure is simple and intuitive, this chapter uses the binary tree LSTM to interpret Tree-LSTM.
The LSTM is extended to Tree-LSTM, where each memory unit can reflect the historical state of multiple sub-units and multiple descendant units. As shown in Fig. 3, the root of the tree can be considered to obtain information under the long-distance interaction of the gray and light blue leaf nodes on the tree. In Fig. 3, the small circle (“.’’) or the short line (“-’’) in front of the arrow indicates information transfer and blocking, respectively. In Fig. 3, an example of a binary tree is used, and the gate vector is used. The elements of the gate vector usually use the logical sigmoid function to ensure that the value is in the range [0,1]. By learning gate vectors, Tree-LSTM provides a way to interact remotely in the input structure.

Schematic diagram of Tree-LSTM.
Each node in Fig. 3 contains a Tree-LSTM memory unit. Figure 4 shows an internal schematic of the Tree-LSTM memory unit. Each memory unit includes an input gate and an output gate. The number of forgotten gates depends on the number of child nodes. The binary tree structure used in this chapter is two child nodes. The hidden state vectors

Internal schematic diagram of Tree-LSTM.
As shown in Equations (24)–(29), each W is different. Unlike the general LSTM, the Tree-LSTM memory unit needs to consider the cell vectors
In the formula, σ represents the logical sigmoid function, which can be used to control the elements of the gate vector within the [0, 1] range.
Through training, the gradient of each parameter of the objective function can be efficiently calculated by backpropagation. Unlike ordinary LSTM, for Tree-LSTM, the wrong delivery needs to consider the left and right children. If it is a topology for more than two children, it is necessary to distinguish more children. Equations (30)–(36) are the backpropagation formulas of Tree-LSTM.
For each memory unit, we assume that the error passed to the hidden vector is
In the formula, σ′ (x) denotes a logical function that derives each element of the x vector.
When the derivative of each gate vector is calculated, the derivative of the weight matrix in Equations (36) can also be calculated.
This section will introduce the performance evaluation indicators commonly used in text similarity calculations, namely the Pearson correlation coefficient. In statistics, the Pearson correlation coefficient is used to measure the linear correlation between two variables X and Y, and its value is between –1 and 1. Normally, the correlation strength of the variables is judged by the following range of correlation coefficients: 0.8–1.0 indicates a strong correlation, 0.6–0.8 indicates a strong correlation, 0.4–0.6 indicates a moderate correlation, 0.2–0.4 indicates a weak correlation, and 0.0–0.2 indicates a weak correlation or no correlation.
As shown in Equation (37), the Pearson correlation coefficient between two variables is defined as the quotient of the covariance and standard deviation between the two variables:
In the formula, F is expressed as mathematical expectation or mean, σ is variance, and cov (X, Y) is called covariance of random variable X and Y, that is, cov (X, Y) = E { [X - E (X)] [Y - E (Y)] } , ρX,Y is recorded as Pearson correlation coefficient.
,
Since the benchmark system used in this paper uses the Pearson correlation coefficient to evaluate the performance, this paper also obtains the Pearson correlation coefficient of the text similarity calculation score and the manual similarity score by Pearson correlation coefficient method and compares the performance with the benchmark system.
The domestic LTP (Language Technology Platform) has established modules for word segmentation, part-of-speech tagging, partial named entity recognition, syntactic analysis and semantic analysis (word sense disambiguation and semantic role tagging). LTP smoothly integrates these modules. Data is transferred between modules using XML. In addition, it provides some DLL or Web service APIs, visualization tools, and some related corpora. The construction diagram of LTP is shown in Fig. 5.

LTP construction diagram.
From bottom to top, LTP consists of six components: ① corpus, ② processing modules, ③ XML-based internal data representation and processing, ④ DLL API, ⑤ Web services, and ⑥ visualization tools.
Figure 6 shows the model structure, which consists of three layers: the input layer, the projection layer, and the output layer.

Model structure diagram.
Output layer: The output layer corresponds to a binary tree, and it uses the words that appear in the corpus as leaf nodes.
The Huffman tree was constructed by using the number of occurrences of each word in the corpus as a weight. There are N leaf nodes in the Huffman tree, corresponding to D words in the dictionary, and N-1 non-leaf nodes, as shown in the figure below.
In the hidden layer, there are CBOW (Continuous Bag of words Model) technology and Skip-gram (Continuous Skip-gram Model) technology. CBOW refers to predicting the probability of occurrence of a current word by using words of context in the text. Skip-Gram uses current words to predict surrounding words, that is, it can be used to deal with questions such as “when a word is given, which words are most likely to appear around it.’’
The experiment uses Windows system as the platform and Pycharm as the experimental environment. The experimental data is all the microblogs of 400 users crawled from the Sina Weibo website by the crawler program written by Python. It is divided into four categories: military, fitness, finance, and tourism, and each category has 100 entities. A total of 320 data of 80 users are selected as the category identification set from each of the four categories, and the remaining 80 users are selected as the category set to be determined.
The accuracy comparison between the traditional clustering method and the research method for all user identification categories in the four categories is shown in Fig. 8. According to the results of Fig. 8, when determining the user category, compared with the traditional clustering algorithm, the text similarity algorithm combined with part of speech proposed in this study has improved the accuracy of category determination. In the military category, the accuracy of the feature category is the highest among the four categories because it contains certain military professional words and has certain semantic differences with the characteristic words in other categories. Figure 9 is a common interest network for all users in four categories, and Fig. 10 is a common interest network for military category users taken from Fig. 9. The nodes M, E, F, and T represent four categories of military, fitness, finance, and tourism, respectively, and the digital nodes represent user numbers, and the connection lines between the nodes indicate that the users have common interests. For the user’s interest, this article examines the overall situation of the user, regardless of the user’s interest in different time periods. Moreover, the constructed interest network can be used for friend recommendations.

Technical principle of CBOW and Skip-gram.

Accuracy comparison.

Common interest network for all pending category users.

Common interest network for military pending category users.
(1) This paper proposes a method for calculating the semantic similarity of English text based on structured representation. Most existing text semantic similarity calculation methods use a large number of planar similarity features to represent the similarity of a pair of texts, and the representation is weak. Moreover, this paper further uses PST and PDT to represent the syntax, semantics and other information of the text. In addition, this paper improves the performance of text semantic similarity calculation by using these two structural features suitable for text similarity calculation and combining with planar features.
(2) This paper proposes a method for calculating semantic similarity of English text based on Tree-LSTM. Aiming at the shortcomings of the previous text semantic similarity calculation method for the poor performance of long text set experiments, this method uses the Tree-LSTM model to calculate the similarity. The method is modified on the basis of PDT and PST to obtain the NPST and NPDT structural features suitable for the Tree-LSTM model. In this chapter, planar features will no longer be used, and NPST and NPDT will be combined with the appropriate Tree-LSTM model for text similarity calculation, which has a great performance improvement for long text sets.
(3) This paper proposes a system based on semantic similarity calculation of English texts. In the face of massive English text data, relationship extraction can not only improve the accuracy of text classification, but also play an effective role in promoting social network construction. In view of the interest of network users, this study proposes a user similarity calculation method based on part of speech to construct an interest network. Experiments show that this method provides a new idea for interest network extraction.
Conclusion
Currently, most text clustering algorithms are based on the Vector Space Model (VSM). This text representation is very simple, but it raises the problem of high dimensional sparsity. Moreover, it does not solve the two natural language problems specific to text data: synonyms and polysemous words. All of these problems greatly interfere with the efficiency and accuracy of the text clustering algorithm, and the performance of text clustering is degraded. In order to avoid these problems, this paper adopts a new idea to cluster texts, that is, using semantic similarity as a measure of similarity between texts. This paper proposes a similarity calculation method that combines planar features with structured features and uses support vector regression models. Then, this paper proposes a similarity calculation method that combines the structural features with the Tree-LSTM model. Research shows that the algorithm of this study has certain effects.
