Simulation of English part-of-speech recognition based on machine learning prediction algorithm

Abstract

English part-of-speech intelligent recognition is the scientific and technological basis for the development of intelligent speech systems. The difficulty in the current English speech recognition system lies in the recognition of English parts of speech. In order to improve the effect of English part-of-speech recognition, this study builds the language rules and morphological models of English morphological forms based on machine learning algorithms. Moreover, this study proposes a stemming extraction algorithm and a syllable division algorithm based on English characteristic rules. By studying basic phrases in English, this study analyzes the compositional structure of phrases, and determines the basic phrase structure and composition rules of English such as noun, verb, and adjective. In addition, this research studies the basic English phrase recognition algorithm based on the rule method and the analysis of basic phrase ambiguity resolution. Finally, this study designs a control experiment to analyze the performance of the algorithm proposed in this paper model and confirm the classification algorithm. The research results show that the algorithm proposed in this paper has a certain practical effect.

Keywords

Machine learning prediction algorithm English part-of-speech recognition algorithm improvement

1 Introduction

Language is indispensable for people living in this world. Whether recording knowledge or transmitting information, language is an indispensable tool. Moreover, language plays a vital role in the development and progress of society. Today’s society has entered the information age, and the various information around people is growing at a very fast rate. In order to effectively use this information, it is necessary to continuously study technologies and methods that can effectively deal with them. Because the language has irregularities and ambiguities, if we want to use a computer to analyze the corpus, we must make some regular deformations of the corpus. The results obtained after the deformation need to be understood and analyzed by the computer while retaining the original text features. As for how to regularize the corpus, it is necessary to select certain features to represent the corpus regularly, which can not only represent the characteristics of the original corpus but also be conveniently used for computer processing.

The goal of natural language processing [1] is to be able to better parse human language, so that computers can understand language like humans, which includes some key technologies and application systems. In today’s society, the Internet is rapidly rising, people are more and more involved in online community exchanges, and the opportunities for people to express their opinions on the Internet are also rapidly increasing. As the volume of text increases, there is an urgent need for a method to quickly analyze the text to obtain useful information. If we still rely on manual analysis of such a large amount of text, it is very unrealistic. Therefore, we need to rely on computers to analyze these texts. This research field is text analysis or text mining [2].

Language is a tool for human society to exchange ideas and connect with each other, and each language consists of speech, vocabulary and grammar. Language and culture constitute the life of a nation. With the rapid development of science and technology in the world today, all countries and nations have entered the information age. As the interdisciplinary subject of computer science technology and linguistics, natural language information processing focuses on the use of computer technology to learn, understand and generate human language. With the advent of artificial intelligence, big data, robots, etc., natural language information processing has become one of the core research issues of information processing for every national language.

The English part-of-speech analysis in this article is the basic and key issue in the process of English information processing like any other language. The research results will affect the subsequent English natural language processing problems, and at the same time map to the performance research and development of its application systems, such as: machine translation, text summarization, social computing problems and other application systems. In view of the above analysis, based on the improved algorithm, this study builds an English part-of-speech recognition model to analyze English part-of-speech.

2 Related work

Automatic lexical analysis research in English, Chinese and Turkish has matured in the world. There are various morphological analysis techniques for automatic lexical analysis, and different languages have different emphasis on the content of lexical analysis. For example, English focuses on word tagging, and Chinese focuses on word segmentation. Lexical analysis research generally adopts three methods: Rule-Based approach; Statistics-Based approach; approach based on Statistics and regulation. In addition, there are deep learning methods based on neural networks. The rule-based method is to use the contextual grammatical rules of the word context in the text to perform lexical analysis in a specific markup language. The literature [3] proposed a series of lexical analysis methods based on linguistic rules and implemented rule-based POS tagging. This method uses the initial state labeler to identify the text without labeling the corpus text to form the labeled corpus text, and then learns the conversion method by comparison, and generates a sorted conversion law set. The statistical-based methods require a corpus of a certain size to be used for statistical models. The CLAWS (Constituent-Likelihood Automatic Word-tagging System) corpus part-of-speech tagging system constructed in the literature [4] implements part-of-speech tagging using n-grams and a first-order Markov transfer matrix. With the Hidden Markov Statistical Model being applied to speech recognition, statistical language models have been applied to all levels of natural language processing. The commonly used statistical language models are: n-gram model (N-Gram model) model, Hidden Markov Model (HMM) model, maximum entropy (Max Entropy, MaxEnt) model, conditional random fields (CRFs) model, Neural-Networks (NN) model, Perceptron, etc. These statistical models have been successfully applied by researchers to research problems such as part-of-speech tagging, such as part-of-speech tagging of hidden Markov model and part-of-speech tagging of maximum entropy model and lexical tagger [5]. Morphological Analysis (MA) and Part-Of-Speech (POS) research have achieved good results in different language information processing in the world. The English stem extractor proposed in the literature [6] has been applied to many languages. For the stemming problem in morphological analysis (stemming), most researchers use the method based on finite state automata to summarize the morphological changes in the language as using language rules, and then construct an automaton to study the morphology. After the literature [7] used finite state machines to describe Finnish morphological rules, many languages have used finite state machines for morphological analysis and achieved good results. In response to the part-of-speech tagging problem, the TAGGIT part-of-speech tagging system developed by Greene and Rubin in the United States constructed part-of-speech disambiguation rules using context and collocation words. For the phrase recognition problem, the literature [8] proposed the English base noun phrase (BaseNP) recognition problem. His training data is part-of-speech tagging corpus that has marked the boundary of noun phrases beforehand. For this, the probability value of different states between any pair of part-of-speech tags is counted, and then the probability value is used to mark between any two adjacent words in the sentence to perform boundary recognition. The block description system proposed in the literature [9] has become a strategy of syntactic analysis. It uses a “divide and conquer” method to reduce the difficulty of complete syntactic analysis in the process. Literature [10] records the relevant situation of the international conference CoNLL-2000 conducting chunk analysis in English. The main method of natural language processing is a model based on rules and statistics. Although artificial neural network is one of the statistical methods, it has not been valued in NLP. The literature [11] proposed deep learning (Deep Learning, DL), which soon became a new technology and method for machine learning (ML). At present, the rapid development of large-scale linguistic tagging corpus databases of various human languages, new technologies combined with deep learning and computer hardware computing capabilities have accelerated the development of natural language processing research.

For the research of Chinese word segmentation, the rule-based research includes: the n-shortest path method proposed by the literature [12] and the error-driven mechanism adopted by the literature [13]. Moreover, the statistical-based research includes: the word-based Chinese word segmentation research proposed by the literature [14] and the conditional random field word segmentation research proposed by the literature [15], which all need to calculate a large number of features. In addition, the literature [16] proposed to use statistics and rules for hierarchical HMM research, and the literature [17] proposes to use language-like model research. For the problem of Chinese lexical analysis, the literature [18] proposed a multi-knowledge source mixed language model lexical analysis system (ELUS), and the literature [19] proposed a cascaded HMM lexical analysis system. Moreover, they all participated in the international Chinese word segmentation evaluation and achieved satisfactory results. For the part-of-speech tagging problem, the literature [20] adopted improved error-driven learning to deal with it, and the literature [21] constructed a part-of-speech recognition rule base to implement a tagging system.

With the rapid development of technology, today’s world has taken science and technology as an important indicator to judge national strength. Science and technology have always been the driving force for the advancement of history, and the progress of science is not just referring to the progress in certain areas but expanding to all fields to make progress together. What is emphasized today is the progress of multidisciplinary integration and interdisciplinarity. In the field of smart instrument design, it is not just relying on improving the instrument to improve the performance of the instrument, but to greatly improve the performance of the smart instrument by introducing knowledge such as information technology and natural science into the smart instrument.

3 English n - gram language model and its application

Language models (LMs) have an important position and applications in statistical natural language processing, such as lexical language models for speech recognition. In human language, there is an internal connection between the words of a sentence, and the next word can be predicted by statistical calculation of probability. The n-gram model (n - gram model) assumes that the current presentation of the nth word is only related to the previous n - 1 words, and it has nothing to do with any other words. Therefore, the n - gram model can be adopted to realize the research of the next word using the previous n - 1 words [22 –24].

For a word string w = w₁, w₂, w₃, ⋯ , w_n, it is considered that the occurrence of word w_i (1 ⩽ i ⩽ n) is related to the entire preceding word string w = w₁w₂w₃ ⋯ w_n-1. Then, the probability of the occurrence of word string W is calculated as: $\begin{matrix} p (W) = p (w_{1} w_{2} w_{3} \dots w_{n}) = p (w_{1}) p (w_{2} | w_{1}) \\ p (w_{3} | w_{1} w_{2}) \dots p (w_{n} | w_{1} \dots w_{n - 1}) = \\ \prod_{i = 1}^{n} p (w_{i} | w_{1} w_{2} \dots w_{i - 1}) \end{matrix}$ (1)

Among them, p (w₁) is the a priori probability of w₁; p(w_n|w₁ ⋯ w_n-1) is the conditional probability, which is the probability of obtaining the word w_n under the condition that the information w₁, w₂, ⋯ , w_n-1 is known.

If it is assumed that w_i is only related to the first n words of the previous word string, the formula is: $\begin{matrix} p (W) = p (w_{1} w_{2} \dots w_{n}) = \\ \prod_{i = 1}^{n} p (w_{i} | w_{i - n + 1} \dots w_{i - 2} w_{i - 1}) \end{matrix}$ (2)

The n - gram model simulates the connection of the above linguistic unit in human language to the following linguistic unit, thereby predicting the next word.

The establishment of the n - gram model requires parameter estimation of the model. The n-gram model can be obtained using training corpus and normalization. The maximum likelihood estimation (MLE) proposed by the German mathematician C.F.Gauss is a parameter estimation method in the statistical language model, but is attributed to the British statistician R. A. Fisher.

In the observation of the range N, the value of the random variable X is taken as 1, 2, 3, ⋯ , k, and each value k is regarded as an event class. Its number of samples in the training set is N (k) and the probability density is p (k). Then the distribution of N (k) can be Seen as a polynomial distribution. Then the likelihood function is: $p (N (1), \dots, N (k)) = \frac{N!}{\prod_{k = 1}^{k} N (k)!} \prod_{k = 1}^{k} p {(k)}^{N (k)}$ (3)

Among them: $\sum_{k = 1}^{K} N (k) = N, \sum_{k = 1}^{K} p (k) = 1$ .

The parameters obtained by the MLE method for the above polynomial distribution p (k) value is: $p (k) = \frac{N (k)}{N}$ (4)

For the n - gram model, the estimated parameters of the model obtained by the MLE method are: $p (w_{n} | w_{1}^{n - 1}) = \frac{c (w_{1}^{n - 1} w_{n})}{\sum_{w} c (w_{1}^{n - 1} w)}$ (5)

In the formula, c represents the number of times the character string is presented in the library. Formula (1) represents the estimated value of the N-gram syntax probability, that is, the observation frequency of the first word in front is divided by the observation frequency of this word sequence.

Statistical natural language processing relies on a corpus to train the data, but any corpus cannot cover all linguistic phenomena in the language. Moreover, most of the low-frequency words in the language belong to words or phrases that do not appear in the corpus and can only be assigned a zero probability. In particular, for training corpora that lack resources like English, there is a problem of sparse data.

The method to solve the data sparseness is to increase the size of the corpus and cover as many language phenomena as possible, so as to reduce the problem of data sparseness. Smoothing is a method of assigning non-zero probabilities to those “zero-probability and low-probability N-grams”. The smoothing method is divided into discount and rollback. Discount is to reduce the count of a non-zero N-gram and assign this part of probability to those events with zero or very low occurrences in the training corpus, and rollback is based on n–1-gram syntax is counted to build an n-gram model.

The basic idea proposed by Witten-Bell is that if an instance in the test process does not appear in the training corpus, it is considered a new event. The probability discount for binary grammar is: $\begin{matrix} p_{i}^{*} = \frac{C_{i}}{N + T} \\ if C_{i} > 0 \end{matrix}$ (6)

The count is directly expressed as follows: $C_{i}^{*} = {\begin{matrix} \frac{T}{Z} \frac{N}{N + T} if C_{i} = 0 \\ C_{i} \frac{N}{N + T} if C_{i} > 0 \end{matrix}$ (7)

Under previous historical conditions, the formula is expressed as: $\sum_{i : C (w_{x} w_{i}) = 0} p * (w_{i} | w_{x}) = \frac{T (w_{x})}{N (w_{x}) + T (w_{x})}$ (8)

The probability that it is redistributed is: $Z (w_{x}) = \sum_{i : C (w_{x} w_{i}) = 0} 1$ (9) $\begin{matrix} p * (w_{i} | w_{i - 1}) = \frac{T (w_{i - 1})}{Z (w_{N}) (N + T (w_{N}))} \\ if C_{w_{i - 1} w_{i}} = 0 \end{matrix}$ (10)

For non-zero binary syntax, the previous parameter T is introduced to discount: $\sum_{i : C (w_{x} w_{i}) > 0} p * (w_{i} | w_{x}) = \frac{C (w_{x} w_{i})}{N (w_{x}) + T (w_{x})}$ (11)

The statistical language model reveals the internal rules of human language and can detect errors that do not conform to the rules of normal language, so a better language model can help people find errors in the corpus and reduce the burden of manual proofreading. This experiment is based on the English language error detection of the Bi - gram language model. After the word string w₁w₂ ⋯ w_i-1 is given, it is possible to calculate the probability of the word string appearing in the back is w_n. The language model is used to check errors in text. The word string is taken from the text to be checked for error and check it according to the n - gram model. The sequence words with high simultaneous presentation rate are judged as correct sequence words, and the sequence words with low simultaneous presentation rate are judged as wrong sequence words.

There are many researches on the error detection of text errors. The general error categories include errors of non-words and true words. Non-word error means that the word does not exist in the dictionary, while true word error means that the word exists in the dictionary, but it is inappropriate to use it in the context. The method of automatic proofreading of true word errors is to use statistical language models, use the N-gram model of words and parts of speech, and calculate the probability of simultaneous presentation to achieve.

The process of error checking in this experiment is as follows: each text passes through the N-gram model in turn, and the N-gram model is constructed according to the following formula, and then the text corpus is used for training. To extract a single word from a sentence in the corpus, the following three steps are taken:

(1) Unary model of words:

The word is obtained from the input corpus. Meanwhile, the number c_i of occurrence of the word w_i in the corpus is counted, and then the probability of the word w_i appearing is calculated. The unary grammatical formula of words is: $p (w_{i}) = C_{i} / N$ (12)

(2) Binary model of words

Two consecutive words are obtained from the input corpus, and the number of times the two words w_i-1w_i are presented in the corpus c_i-1,i is counted together. The probability of the occurrence of the word w_i is calculated in the state of the given word w_i-1. The binary grammar formula of the word is: $p (w_{i} | w_{i - 1}) = \frac{p (w_{i - 1} w_{i})}{p (w_{i - 1})} = \frac{C_{i - 1, i} / N}{C_{i - 1} / N} = \frac{C_{i - 1, i}}{C_{i - 1}}$ (13)

Similarly, the ternary grammatical formula of words is: $\begin{matrix} p (w_{i} | w_{i - 2} w_{i - 1}) = \frac{p (w_{i - 2} w_{i - 1} w_{i})}{p (w_{i - 1} w_{i})} = \\ \frac{C_{i - 2, i - 1, i} / N}{C_{i - 1, i} / N} = \frac{C_{i - 1, i - 1, i}}{C_{i - 1, i}} \end{matrix}$ (14)

(3) Binary smoothing model: Discount smoothing method is used to construct 2 - gram model. 2 - gram model better reflects the local characteristics of the sentence and detects errors in the text. Therefore, by quoting Algorithm 1, building a model can more accurately reflect the laws of language.In order to verify the validity of the results of the n–gram-based English error detection system, this section tests it on the English corpus. In this experimental, 150 articles are selected from the British Daily Mail as training corpus and 10 articles as test corpus. The following precision, recall and F-measure evaluation indicators are used to test the experimental results: $\begin{matrix} Correct rate = \\ \frac{Number of wrong words detected}{Total number of word errors detected} \times 100 % \end{matrix}$ (15) $\begin{matrix} Recall rate = \\ \frac{Number of wrong words detected}{word errors in pending test} \times 100 % \end{matrix}$ (16) $\begin{matrix} F value = \\ \frac{Correctr rate \times Recall rate \times 2}{Correctr rate + Recall rate} \times 100 % \end{matrix}$ (17)

The results of the correct rate, recall rate, and F value in this system through experiments are shown in the following table:

4 Part of speech tagging based on maximum entropy model

In 1957, ETJaynes proposed the principle of maximum entropy (MaxEnt). In 1996, it was applied to natural language processing by Berger et al. In 1997, Papineni et al. proposed a natural language understanding method based on features. Since then, the principle of maximum entropy has become a typical criterion for learning probabilistic models, and it has been adopted by many domestic and foreign researchers. The maximum entropy model is derived from the principle of selecting the optimal model as the linear model-the maximum entropy principle. Ratnaparkhi first proposed an English part-of-speech tagging method based on maximum entropy. The English part-of-speech tagging method based on maximum entropy has the advantages that it is not affected by the size of the tag set, can handle multiple types of problems, and gives a good probability value of the result. Therefore, a series of papers published by Ratnaparkhi show that the method is successfully applied to part-of-speech tagging.

If the possible value of the random variable X is {X = x₁, x₂, x₃, ⋯ , x_k } and its probability distribution is P (X = x_i) = p_i (i = 1, 2, ⋯ , n), the entropy of the random variable X is defined as: $\begin{matrix} H (X) = - \sum_{x \subseteq X} p (x) log \frac{1}{p (x)} \\ 0 ⩽ H (X) ⩽ log | X | \end{matrix}$ (18)

The principle of maximum entropy determines that the maximum entropy is the best model among all learned probability models. Under the premise of X occurrence, denoted by H(Y|X), and the conditional probability distribution is p(y|x). The training data is T ={ (x₁, y₁) , (x₂, y₂) , ⋯ , (x_n, y_n) }. The expression of selecting the best classification model using the principle of maximum is as follows: $max H (Y | X) = - \sum_{(x, y)} p (x) log \frac{1}{p (y | x)}$ (19)

Among them, $p = {p | \begin{matrix} p is the probability distribution \\ that satisfies the condition on X \end{matrix}}$

When modeling with the maximum entropy model, it is necessary to select features and introduce features, samples, and feature functions. Feature is (x, y), Y is information to be determined, and X is context information. The sample is a sample of a certain feature (x, y), and the feature function f (x, y) is defined as the relationship between input x and output y. $f (x, y) = {\begin{matrix} 1 if y = y_{0} and x = x_{0} \\ 0 otherwise \end{matrix}$ (20)

The value of function f (x, y) for distribution $\bar{p} (x, y)$ is: $E_{\bar{p}} (f) = \sum_{x, y} \bar{p} (x, y) f (x, y)$ (21)

The values of function f (x, y) for model p(y|x) and distribution $\bar{p} (x, y)$ are: $E_{p} (f) = \sum_{x, y} \bar{p} (x) p (y | x) f (x, y)$ (22)

If the model obtains the information in the training data, it can be assumed that Equations (21) and (22) are equal and use them as constraints for model learning.

The context of the word is supposed to be marked is x, the output result is y, and y is the element of the part-of-speech tag set Y. The task of part-of-speech tagging according to the maximum entropy model is to build p(y|x) that fits the context and use it to calculate the classification tag result. For the unknown possible (x, y), when x ∈ X, a statistical model of the characteristics of x and y is built, so as to calculate the conditional probability p(y|x) of y ∈ Y.

According to the principle of maximum entropy, it should satisfy the following: $p (y | x) = \frac{1}{Z_{λ} (x)} exp (\sum_{i} λ_{i} f_{i} (x, y))$ (23) $Z_{λ} (x, y) = \sum_{y} (exp \sum_{i} λ_{i} f_{i} (x, y))$ (24)

Among them, f_i (x, y) is the characteristic function and λ_i is the characteristic parameter. Z_λ(y|x) is the normalization factor for ∑p(y|x) = 1. Then, iterative algorithm is used to solve the parameter λ_i.

The part-of-speech tagging problem in English is to obtain each word from the input current text sentence to judge its grammatical category and find the part-of-speech of the current word according to the context. If it is assumed that the vocabulary sequence of an English sentence is W = w₁w₂w₃, ⋯ w_n and the corresponding part-of-speech sequence is T = t₁t₂t₃, ⋯ t_n, the problem of English part-of-speech tagging is to obtain T that optimizes P(T|W), namely: $T = \underset{T}{argmax} p (T | W) = \underset{T}{argmax} p (T) \times p (W | T)$ (25)

The maximum entropy model is the optimal model, that is, when implementing English part-of-speech tagging based on maximum entropy, the key issue is the selection and determination of model features.

5 Model building

The core idea of constructing combined models in this paper is based on statistical models, supplemented by rule bases. First, the sequence to be annotated is preliminarily annotated by the statistical model, and then the annotated sequence is filtered by the rule model to judge whether the annotated statistical model is reasonable. If it is unreasonable, rule processing is performed. The processing of the rule model is divided into two stages: shallow ambiguity processing and deep ambiguity processing. If it is a shallow ambiguity problem that can be solved, it is handled according to the rules of the rule model. If it is a deep ambiguity problem that cannot be solved, two results are given, and the subsequent syntactic or semantic methods are used to deal with it.

The main reason for choosing a statistical model in this article is the rapidity of the statistical model. It needs to be able to quickly give an unlabeled sequence an initial sequence, and to have a good effect on labeling unknown words and rare words. The statistical model selected in this paper is a part-of-speech tagging model combining maximum entropy model and Viterbi algorithm. We use the maximum entropy model for feature extraction to obtain the scores of all possible part-of-speech sequences, and then use the Viterbi algorithm to extract the optimal sequences for these sequences, and finally obtain the optimal tagging. The process of statistical model processing part-of-speech tagging can be divided into three parts: initialization module, maximum entropy feature module, and optimal sequence selection. The overall system module can be divided into user interface, knowledge base and instrument module. The overall module is shown in Fig. 1.

Fig. 1

Model structure.

The user may not know much about the relevant knowledge in the professional field, so the description of the requirements is generally a more popular natural language. At this time, a natural language understanding module is needed to help the user translate the language into a language that the computer can understand. Then, the computer extracts the key information in the user’s language through the knowledge base and selects the instrument and method to realize the user’s function. The basic processing method of the natural language understanding module is shown in Fig. 2:

Fig. 2

The basic processing method of the natural language understanding module.

The classification model design process is: the first step is to prepare the data set that can be input in the model; the second step is to divide the data set into two parts: the first part is used as the training set of the classification model and the second part is used as the test set of the classification model; the third step uses the training set to train the classification model and output the model during the classification model training phase. Finally, the test set is used to evaluate the model obtained after training. The classification model construction process is shown in Fig. 3 below.

Fig. 3

Construction process of classification model.

6 Model performance testing

This article uses the 2019 “Daily Mail” corpus as the test corpus and uses the Stanford NLP tool as the data comparison type to first extract 1,000 corpora for the “from... to...” sentence pattern, and manually label these 1000 corpora. Then, this article uses Stanford NLP tools and part of speech disambiguation model combining rules and statistics to test it. The test results are shown in Table 1:

Table 1
Performance indicators of error finding method based on n–gram

Correct rate Recall rate F value

82.63% 83.54% 82.65%

Correct rate	Recall rate	F value
82.63%	83.54%	82.65%

The remaining 626 corpora are compared with the manual tagging results. The results are shown in Table 3:

Table 2

Comparison table of “from ... to ... ” tagging results

	Stanford NLP	Paper model
Mark as verb	620	375
Mark as preposition	305	345
Mark as conjunction	85	0
Annotated as two results	0	290

Table 3

Comparison table of “from... to” part-of-speech tagging errors

	Stanford NLP	Paper model
Verbs marked incorrectly	82	62
Prepositional annotation error	36	20
Conjunction errors	84	0
Overall error	202	82
Correct rate	67.70%	86.90%

It can be seen from Table 3 and Fig. 5 that the overall accuracy of the combined model is better than the Stanford model. The reason is that the combined model can solve the part-of-speech tagging errors of conjunctions and its ability to handle verb and prepositional part-of-speech errors has also been improved slightly.

Fig. 4

Comparison diagram of “ from ... to ... ” tagging results.

Fig. 5

Comparison diagram of “from... to” part-of-speech tagging errors.

First, 1000 corpora are extracted for the sentence pattern of “one... just...‘‘, and these 1000 corpora are manually annotated first. Then, this article uses Stanford NLP tools and part of speech disambiguation model combining rules and statistics to test it. The test results are shown in Table 4 and Fig. 6:

Table 4

Comparison table of “one ... just ... ” tagging results

	Stanford NLP	Paper model
Mark as number	834	583
Mark as adverb	163	307
Mark as ordinal	13	13
Gives two results	0	107

Fig. 6

Comparison of “one... just...” tagging results.

In the extracted corpus, there is a corpus that does not conform to the structure (263 sentences do not conform to the structure), and this part of data is meaningless and needs to be excluded first. In addition, there are situations in the corpus that cannot be solved by the model, that is, two results are given (106 sentences are given two results), and this part of the data is also invalid for calculating the correct rate. The remaining 631 corpora are compared with the results of manual tagging. The results are shown in Table 5 and Fig. 7.

Table 5

Comparison table of “one... just...” part-of-speech tagging errors

	Stanford NLP	Paper model
Numeral annotation errors	96	38
Adverbs are incorrectly marked	5	23
Overall error	101	61
Correct rate	84.00%	90.30%

Fig. 7

Comparison diagram of “one... just...” part-of-speech tagging errors.

It can be seen from Table 5 and Fig. 7 that the combined model has improved the part-of-speech tagging accuracy compared to the Stanford model.

In this study, naive Bayesian discrimination and KMeans clustering are used to classify the text. Naive Bayes is also called simple Bayes, which is, based on classical mathematical theory, proposed. It is a model used to express the connected probability between individuals, and it, based on the probability and the data that has been observed, determines which category the individual belongs to. Naive Bayes classifier has the advantages of relatively simple calculation, fast speed and high accuracy. Naive Bayes classifier has the advantages of simple calculation, fast speed and high accuracy. The basic idea of the K-means algorithm is to select k points in the data space as the center of clustering and calculate the distance from all points to these k points. The iteration method is used to gradually update the values of the k centers until the k centers are not changing, or the number of iterations reaches the preset number. After that, the final clustering result is obtained.

The classification accuracy and recall rate are shown in Figs. 8 –10.

Fig. 8

Recall rate and accuracy rate of Bayesian classification of two types of text (%).

Fig. 9

Recall rate and accuracy rate of Bayesian classification of three types of text (%).

Fig. 10

Recall rate and accuracy rate of Bayesian classification of five types of text (%).

From the recall and accuracy of several experiments, it can be seen that selecting 98 parts-of-speech to extract text features has a significantly improved classification effect than selecting 22 parts-of-speech as features. The classification is based on 22 first-class parts of speech. When the text category increases, the classification accuracy and recall rate decrease significantly. The reason is that when only one type of part of speech is selected, some features of the text will be partially lost. Therefore, as text categories increase, the mutual influence makes the text more confusing. However, when 98 parts-of-speech is selected as the feature, all the parts-of-speech features of the text are retained, and the text category is increased, which has little effect on the classification accuracy. The five types of text classification can reach more than 97% accuracy. If the problem of classification efficiency is ignored, the accuracy of this result has far exceeded most other feature extraction methods. From the analysis results of the recall rate, the recall rate of the education class reached 100% in the category 2 and category 3 experiments, and the education class also reached 99.33% in the five category experiments, which is the highest among all categories. That is to say, the classification of education into other categories may be very small, but other types may be misclassified as education. It can be seen that the characteristics of educational texts should be wider than other types. In contrast to this, the accuracy rate of the computer class in the three experiments reached 100% in all three experiments, and no other types are misjudged as the computer class. The reason is that the articles in the computer category are more professional, and the features are more obvious than other articles. From the perspective of part-of-speech analysis, the reason is that there are more proper nouns for computer articles, which makes his transition probability matrix have obvious differences compared with other articles. This leads to a very high classification accuracy of computer articles.

7 Conclusion

Based on the characteristics of English itself, this article takes the study of the basic corpus construction of English information processing, lexical analysis and basic phrase recognition as the starting point to study information processing methods that adopt rules and statistics. In addition, this research focuses on the morphological analysis of English words, the segmentation of additional components and the morphological reduction of words, the statistics of letters and words frequency, stem extraction, part-of-speech tagging, text corpus construction, etc., which lay the foundation for further research in English information processing.

By studying basic English phrases, the composition structure system of phrases is analyzed, and the basic phrase structure and composition rules of English such as nominal, verb and adjective are determined. The rule-based English basic phrase recognition algorithm and basic phrase ambiguity resolution analysis are studied. In addition, this study proposes the realization method of basic phrase recognition of different statistical models that take English words, parts of speech and additional components of configuration as characteristic elements. At the same time, this study realizes the recognition of basic English phrases such as nominal, verb, and adjective according to the maximum entropy model, and realizes the recognition of basic English phrases based on the conditional random field model, and analyzes the experimental evaluation to build a phrase library.

Footnotes

Acknowledgment

The research has been financed by The Government Social Science Award Fund Project in 2020 of Yancheng City “Research on the improvement of the medical service for foreigners in Yancheng city under the strategy of One Belt And One Road” (NO: 20szfsk12).

References

Nayak

and Nayak

A.K.

, Odia running text recognition using moment-based feature extraction and mean distance classification technique, Advances in Intelligent Systems & Computing 309 (2015), 497–506.

Ramalakshmi

and Golla

, An advanced fuzzy constructing algorithm for feature discovery in text mining, International Journal of Computer Applications 127(17) (2015), 30–34.

Gissel

S.T.

, Scaffolding students’ independent decoding of unfamiliar text with a prototype of an eBook-feature, Journal of Information Technology Education Research 14(2015) (2015), 439–470.

Oki

Takuro

, Scene text localization using object detection based on filtered feature channels and crosswise region merging, Growth & Change 21(3) (2015), 61–76.

Kamble

R.R.

and Kodavade

D.V.

, Relevance feature search for text mining using FClustering algorithm, International Journal of Computer Sciences & Engineering 6(7) (2018), 223–227.

Maruthupandi

and Devi

K.V.

, Multi-label text classification using optimised feature sets, International Journal of Data Mining Modelling & Management 9(3) (2017), 237.

Pandi

Maruthu

and Rajendran

Vimala Devi K.

, Efficient feature extraction for text mining, Advances in Natural & Applied Sciences 10(4) (2016), 64–73.

, Zhao

and Han

, A fingerprint feature extraction algorithm based on optimal decision for text copy detection, International Journal of Security & Its Applications 10(11) (2016), 67–78.

Soleymanpour

and Marvi

, Text-independent speaker identification based on selection of the most similar feature vectors, International Journal of Speech Technology 20(1) (2016), 1–10.

10.

Mojaveriyan

Mohammad

, Ebrahimpourkomleh

Hossein

and Mousavirad

Seyed jalaleddin

, IGICA: a hybrid feature selection approach in text categorization, International Journal of Intelligent Systems Technologies & Applications 8(3) (2016), 42–47.

11.

Aghdam

M.H.

and Heidari

, Feature selection using particle swarm optimization in text categorization, Journal of Artificial Intelligence & Soft Computing Research 5(4) (2015), 38–43.

12.

Robati

Zahra

, Zahedi

Morteza

and Fayazi Far

Najmeh

, Feature selection and reduction for persian text classification, International Journal of Computer Applications 109(17) (2015), 1–5.

13.

Zia

Tehseen

, Abbas

Qaiser

and Akhtar

Muhammad Pervez

, Evaluation of feature selection approaches for urdu text categorization, International Journal of Intelligent Systems Technologies & Applications 07(6) (2015), 33–40.

14.

zia

Tehseen

, Akhter

Muhammad Pervez

and Abbas

Qaiser

, Comparative study of feature selection approaches for urdu text categorization, Malaysian Journal of Computer Science 28(2) (2015), 93–109.

15.

, Jin

Xue Zhe

and Cui

LiHua

, Text recognition algorithm based on text features, International Journal of Multimedia & Ubiquitous Engineering 11(5) (2016), 209–220.

16.

Yan

, Li

and Zhang

, A new multi-instance multi-label learning approach for image and text classification, Multimedia Tools & Applications 75(13) (2016), 7875–7890.

17.

Agnihotri

, Verma

and Tripathi

, Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents, SpringerPlus 5(1) (2016), 942–956.

18.

Kumar

Ganesh

and Vivekanandan

, Intelligent model view controller based semantic webservice call through mishmash text featuring technique, Journal of Computational & Theoretical Nanoscience 14(4) (2017), 2021–2029.

19.

Shi

, Bai

and Yao

, An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, IEEE Transactions on Pattern Analysis & Machine Intelligence 39(11) (2017), 2298–2304.

20.

Gong

Kaixin

, Chen

Chunfang

and Wei

Ying

, The consistency improvement of probabilistic linguistic hesitant fuzzy preference relations and their application in venture capital group decision making, Journal of Intelligent & Fuzzy Systems 37(2) (2019), 2925–2936.

21.

Zatarain-Cabada

Ramón

, Barrón-Estrada

María Lucia

, García-Lizárraga

Jorge

, et al., Java tutoring system with facial and text emotion recognition, International Journal of Advanced Computer Research 106(106) (2015), 49–52.

22.

Al-Tahraw

M.M.

, Polynomial neural networks versus other arabic text classifiers, Journal of Software 11(5) (2016), 418–430.

23.

Xiao-Yang

, et al., Failure mode and effect analysis using interval type-2 fuzzy sets and fuzzy petri nets, Journal of Intelligent & Fuzzy Systems 37(1) (2019), 693–709.

24.

, Milios

E.E.

and Blustein

, Document clustering with dual supervision through feature reweighting, Computational Intelligence 32(3) (2016), 480–513.