Abstract
At present, the recognition method based on character segmentation is not effective in recognizing English text, and the traditional methods are based on the structural features and statistical characteristics of strokes. In order to improve the recognition effect of in English text, from the perspective of machine learning, this study introduces multi-features to improve the lack of information caused by the small Chinese data set. Moreover, this study disassembles the character recognition problem into a text matching problem of question and answer, and the textual entailment problem of answer and standard answer and continues training on the data set of short text score. The final result has a certain improvement, which proves the usability of the mechanism designed in this paper. In order to study the performance of the model proposed in this paper, the model proposed in this paper and the neural network recognition model are compared in terms of recognition accuracy and recognition speed. The research results show that the algorithm proposed in this paper has a certain effect.
Introduction
Humans have the ability to learn cognition, identify objects, judge reasoning and perceive the future, and how to make machines imitate humans to learn and think is the difficulty of artificial intelligence. With the accumulation of neural network theory and the rapid advancement of computer technology, artificial intelligence has become one of the hot spots in the scientific community. Handwritten text recognition is an important research direction of human-computer interaction and one of the classic topics in the field of artificial intelligence. Words are an important means of human information transmission. There are thousands of languages around the world, and English, Chinese, Japanese, Korean, Greek, and Arabic numerals are all texts. English and Chinese are the two most widely used and most widely used languages. The structure of Chinese characters is far more complicated than other texts such as English, the number of Chinese strokes is large, and the combination rules are complex. Considering the universality of text recognition technology, this paper analyzes and studies English text recognition. English text recognition includes printed text recognition and handwritten text recognition. Printed text recognition can be traced back to the 1960 s and is mainly used in industries such as banks and newspapers. These text writing standards have substantially the same text spacing and uniform character size, which can meet the requirements of optical character recognition (OCR). OCR technology is widely used in computer print recognition. Its principle is to use the light and dark stripes of text to detect the stroke shape, and then translate the character shape into computer text through the stroke recognition method [1]. OCR research results are very rich mainly used for printed text recognition, and have also been applied in real life, such as ABBYY reader, scanning Almighty King and Tesseract character recognition engine, and CAJViewer reader comes with text recognition software. However, OCR technology has poor robustness in correcting and recognizing uneven lighting, blur, font deformation, and slanted text, and cannot meet the requirements of handwritten text recognition with diverse styles.
From the 1980 s to the present, from standard continuous stroke writing to free continuous stroke writing and from single Chinese character and phrase recognition to text line and paragraph recognition, offline handwriting has been the key research content of text recognition [2]. At present, single-character recognition technology has achieved fruitful results. For the entire text line and chapter recognition, the recognition method based on character segmentation is not ideal. The traditional methods are based on the structural features of English and the statistical characteristics of strokes [3]. However, it is very difficult to extract the basic unit of characters, the extracted features are single and unsuitable for changes in writing style, and how to extract a universal and stable feature from the characters has not been solved. However, with the breakthrough of chip technology and the mass production of NVIDIA manufacturers’ new generation of high-performance GPUs, deep learning technology came into being, and theories such as convolutional neural networks, recurrent neural networks, and long and short memory neural networks have been applied. Moreover, the text line segmentation-free recognition technology is gradually gaining popularity and attention, which provides new ideas for solving the problem of handwritten text recognition.
Related work
The character recognition method is very limited and cannot meet the daily needs [4]. With the development of image processing technology and the explosion of information technology, text recognition technology has made new progress. Random forest, support vector machine, BP neural network and other methods have been successfully applied to handwritten digit recognition. However, there are few texture features of handwritten digits, and feature extraction is difficult. Moreover, the recognition accuracy of these traditional algorithms on the public handwritten digit set MNIST is not very high, and it is still a difficult task to correctly recognize a large number of handwritten digits [5]. With the development of concepts such as local receptive field, weight update sharing, and sample feature extraction, convolutional neural networks have been greatly developed and applied. The literature [6] uses convolutional neural networks to perform handwritten digit recognition. However, it does not perform network parameter learning optimization, so the recognition accuracy rate still needs to be further improved.
For the character segmentation algorithm, the literature [7] first obtained the glued text block data set by direct horizontal global projection, small-angle skew correction and local projection, and then segmented the glued text block data according to the degree of difficulty using the optimal segmentation path evaluation function. This method takes a long time and can’t help overlapping characters. The literature [8] first used image morphology operation methods to locate text lines, and then used the projection method, connected domain method and statistical histogram method to perform character segmentation. However, this method is only suitable for rough segmentation, and can only generate straight-line segmentation paths, and it is easy to recognize a single character as multiple character parts and cause erroneous segmentation. The literature [9] divides the text line based on the idea of mathematical transformation, divides the text line into several areas according to the width of the character stroke, and finally uses the Viterbi algorithm to search all the cutting paths. This method can generate curved split paths, but it will generate a large number of redundant split paths. For character recognition, the literature [10] uses image preprocessing technology to eliminate the noise generated when electronic equipment extracts media text and converts it into a two-dimensional image. Then, the literature normalizes the size, position and angle of the characters, and uses the operations such as point density equalization and line interval projection interpolation to compensate for the local deformation of the character structure caused by the normalization, and extracts the dot lines and radical features of the stroke. Finally, the Bayesian distance classifier is used to transform the feature vector into the category label probability. The literature [11] uses the statistical characteristics of Chinese characters to record the differences between different Chinese characters, such as the distribution characteristics of strokes, directional gradient characteristics, moment characteristics and elastic grid characteristics. Moreover, it retains key information features through linear decision analysis and other dimensionality reduction methods, and finally uses classifiers such as nearest neighbor classifier and support vector machine to obtain class probability distribution through feature learning.
The current common text line detection methods include the maximum stable extreme value region (MSER) [12], TextBoxes algorithm and stroke width transformation(SWT) [13]. The MSER algorithm is mainly based on the idea of watershed to detect the difference between the text area and the background color, and the regional stability is calculated by dynamic threshold partitioning to examine the characteristics of the text area. In the literature [14], the ellipse fitting of the MSER component was used, and the fitted eccentricity was used as the component filtering standard to realize the text area extraction. However, the MSER algorithm is relatively sensitive to background noise and is not ideal for detecting blurred images. The SWT algorithm is based on the idea that the stroke width and color are approximately unchanged to extract text candidate regions, and then recognize the text regions through heuristic filtering rules. The literature [15] improved the SWT algorithm, used edge detectors to enhance the edges of text, and detected text areas under different noise backgrounds. However, the SWT algorithm relies heavily on the edge detection results and cannot satisfy the detection of irregular text such as distortion and deformation. The TextBoxes algorithm adds the aspect ratio option of the detection frame on the basis of the SSD (Single Shot Multi-Box Detector) detection network to apply to text detection of various lengths. The literature [16] used the TextBoxes algorithm to detect text of different scales, but it cannot solve the problem of excessively wide text spacing and also cannot detect slanted text lines. For the recognition of unsegmented handwritten text lines, the research results in recent years are also quite remarkable. The basic principle is to use sliding windows to extract features from text lines, and then match the characteristic character sequence with the character model, combine the evaluation criteria to find the best matching character sequence, and finally obtain the text line candidate characters after post-processing operations [17]. The character sequence of the text line is transformed into a state transition diagram. The character state ct at each moment can be expressed by a character model, which depends on the character state of the previous moment and the characteristics of the text line at the current moment [18]. The character state at each moment will be transferred to the character state ct+1 at the next moment according to the probability model. Because the state transition diagram is transformed from the single-character model, the single-character model has become a key link in the unrecognized text line recognition system, and the selection of the appropriate character model directly affects the accuracy of text recognition [19]. The commonly used character models include Hidden Markov Model (HMM), Long Short-Term Memory (LSTM) and Recurrent Neural Network (RNN). Hidden Markov model is the earliest algorithm applied in the field of unsegmented text recognition. It uses the state transition probability between nodes to represent the relationship between sequences and uses fuzzy inference rules to adjust the feedback. The literature [20] proposed a hidden Markov model based on convolutional network (HMM-CNN) non-segmentation method, and successfully applied to handwritten Chinese text recognition. However, HMM relies too much on local state information, the data sparse problem is serious, and it is not sensitive to the prediction of long-distance context information, and the recognition effect is not very good. The literature [21–23] used convolutional shift invariant filters to generate discriminant features and combined with long-term short-term memory recurrent neural network (LSTM-RNN) to achieve handwritten text recognition. However, the system does not implement end-to-end training, and late-stage semantic learning is not fed back to early-stage feature extraction, and the recognition accuracy rate has not reached the expected effect.
WordNet relationship vectorization combined with TransR and word representation
The TransE model first proposes an effective construction of knowledge graph vector representation. For the triplet (head, link, tail) in the knowledge graph, head represents the head entity, link represents the relationship between the pair of entities, tail represents the tail entity. TransE’s motivation is to build entities and relationships in a vector space. At the same time, inspired by the translation invariance in Word2vec, the goal is to make the sum of the head entity and the relationship vector as close as possible to the tail entity, and construct negative examples through negative sampling to make the distance of the negative examples sufficiently large. The loss function is shown in formula (1). Through this loss function, the entity and the relationship vector are learned. However, through practice, it is found that TransE can well model one-to-one relationships, but it cannot model well the one-to-many or many-to-one relationships that are common in WordNet. There are two evaluation indicators, one of which is Mean Rank. The head entity is kept unchanged, and the tail entity is replaced by all the entities in the entity table in turn. The formed sample is predicted and sorted in descending order of loss. The average value of all correct tail entities in it is used as an indicator. The other is Hits @ 10, which represents the probability that the correct entity is ranked in the top 10.
TransR is selected as an alternative. Based on TransE, TransR believes that the prediction of entities in different relationships should have different representations. Each relationship is introduced into a matrix M
r
and the entity vector is mapped into the relationship space, as shown in formula (2).
As a general knowledge base vectorization tool, TransR has made the following transformation as shown in formula (3). The pre-trained word vectors and word vectors are used, and the semantic information and root information of the words are introduced as much as possible to still train according to the previous TransR model to finally obtain the entity vector. In the evaluation, with the help of formula (4), by traversing all relationships, a relationship with the smallest loss is found as the prediction of the word-pair relationship. This vector is also the word-pair relationship vector we have obtained. In this way, the algorithm can not only learn the entity relationship information but also use the word meaning information of Word2vec.
Through supervised classifiers and knowledge base vectorization tools, the multi-relational data is modeled, and the vector angle is used for calculation. As shown in Table 1 below, it is found that the similarities of common antonyms are relatively large.
Word2vec word pair similarity
Word2vec word pair similarity
In order to improve the reasoning ability of the entire sentence, the following loss function is defined as formula (5). Among them, V represents the set of all words, S
w
represents the set of synonyms of the word w, A
w
represents the set of antonyms of the word w, and sim (w, a) represents the calculation formula of the similarity between the word w and the word a. The ultimate goal is to maximize this objective function. The first term of the formula is the sum of the log probability of words w and w corresponding to similar words, and the second term of the formula is the sum of the log negative probability of words w and word w corresponding to the antonym set. The goal is to model antonyms while modeling synonyms.
For the function design of the similarity between vector words, an asymmetric similarity evaluation function is introduced, which uses pointwise mutual information (PMI), that is, the point mutual information of the vector of two words. In addition, for the difference in the input order of the two words, a bias variable bw1 is added, as shown in the following formula (6). Because we also use the difference between the two vectors in the comparison operation, and then take the maximum value as the difference variable, we have updated the similarity expression here, as shown in formula (7). This will help us to show antonyms better in the later word pair comparison.
In the following formula (8),
For a more refined representation, such as formula (9), for each vector q
t
in Q, q
t
is taken as a dot product with each vector k
s
in K to obtain a weight vector. After the soft max operation, the weight is multiplied by the matrix V, which is equivalent to encoding the vector q
t
from the dimension d
k
to d
v
.
Intuitively, a word in a sentence should be more relevant to the words around it and have a greater weight. Therefore, the distance dt-s between the word and other words is introduced into the calculation of the weight, as shown in formula (10) and formula (11).
The use of attention mechanism can effectively improve the effect of long sentence translation and word alignment, as shown in Fig. 1.

Application of attention mechanism in machine translation.
The soft alignment relationship between the words in the premise sentence and the words in the hypothetical sentence is found, and these words are linearly operated on the vectors, and finally spliced together for classification. This model achieves better results with fewer parameters, and the process is divided into the following four steps:
(1) Input representation: In the most basic model, the sequence of word vectors is used directly to represent this sentence.
(2) Attention interaction layer: In this step, the Attention weight of the word pair is obtained through the F function, and the weight between sentences is calculated. It can be seen from the following formulas (12) (13) that park in (park, outside) is the word in the premise sentence, and outside is the word in the hypothesis sentence.
(3) Comparison layer: The word pairs obtained in this step of the attention interaction layer are stitched together. G (a, β) is to directly splice the two word vectors together and pass a multi-layer neural network. In the subsequent improvement, the interaction of the two word vectors is also added. For example, the difference between the word vector and the inner product of the word vector are shown in the following formula.
(4) Aggregation layer: This step is to add the word pair vector G of the same sentence, as shown in formula (16). In this way, two sentence vectors v1, v2 are obtained, the two vectors are spliced, and then sent to the classifier H for classification.
At the input layer, the hidden layer of Bi - LSTM is used to output the Encoding of the sentence. At the interactive attention layer, the Attention vector between two sentences is obtained, which is the weighted representation of each word in the first sentence relative to each word in the second sentence. In this way, a soft alignment in machine translation is obtained. At the local inference layer, Chen adds more explicit relationships to compare the differences between word pairs. Among them, a and b are the sequence of word vectors of premise sentence p and hypothesis sentence h, as shown in formula (17) and formula (18).
Unlike the previous dimensionality reduction directly through MLP, this layer hopes to model the sequence information of the local inference layer.
At the aggregation layer, Chen believes that the variable length causes the sum value to vary with the sentence length. Since the model is not robust enough, no accumulation is performed. Instead, it takes the vector average and maximum value of the word pairs in the same sentence, and finally stitches it into the multi-layer perceptron classifier, as shown in formula (20).
Compared with the previous method of using context-free Word2vec word vectors as input, the introduction of context-sensitive word vectors can introduce more pre-training information. Peters et al. Proposed the deep context-independent vector ELMO. The steps are mainly divided into four parts: character-based input, bidirectional RNN language model coding unit, deep bidirectional RNN, and linear splicing vector output.
(1) Character-based input: By CNN, each character in the word is convolved to obtain word features. This is more effective when there are many unregistered words in the thesaurus in a specific task.
(2) Bidirectional RNN language model: Similar to Peters et al., the word representation is used as input, and the LSTM structure is entered according to the word order, as shown in formula (21). Each output layer has k outputs
(3) Deep: By superimposing multiple bidirectional bilstm structures and superimposing L-layer Bi - LSTM coding layers, we obtain the final representation R
k
of the k-th word, as shown in formula (22).
(4) Linear stitching: The output of each layer of the g layer is linearly spliced as the output of the last kth token. γ represents the task-related scaling factor, and
We use the pre-trained model of the EImo model. Previously, in the input layer of the downstream NLP task, the context-free word sequence [x1, x2, ⋯ , x
k
] is obtained through lookup. There are two ways to introduce. One is stitching at the input layer,
Previously, the premise of obtaining word-pair vectors is often to know the relationship between two word pairs and then determine the relationship between the two. In actual use, all two pairs of words in the two sentences are compared, so that there will be a large number of unrelated word pairs, which will bring a lot of noise.
(1) Attention interaction layer: When calculating the attention, we set the two corresponding words a and b in the sentence pair. If a and b satisfy one of the specified relationships, a weight λ is added to the calculation of the attention value of the neural network, which is a hyperparameter. The range set in the experiment is 0.001, 0.01, 0.1, 1, 10, 20, 50, which represents the strength of the introduced knowledge vector. r ab represents the external knowledge vector of word a and word b. In the KIM proposed by Chen et al., if the feature vector is non-zero, then on the basis of the previous attention matrix, it is set to αa,b = αa,b + λ. Among them, αa,b represents the attention weight of word a and word b originally calculated by dot product. Afterwards, we made the following improvements to the knowledge vectors we obtained.
For the supervised relationship vector, the introduction method is to input two words into the word pair relationship prediction model. Because multi-label classification is used, each position represents the probability of the corresponding category, and the weight of inference for each category needs to be learned, so we set a variable W in formula (24). The dimension of each variable is the same as the dimension of the relation vector, and each position represents the weight of this dimension in inference.
For the entity vector of TransR, according to the previous description, the difference between the head entity and the tail entity is the entity vector. Because at the attention level, two words are input to obtain the relevance or matching weight of these two words, formula (25) is used. According to the previous TransR algorithm, first, the entity vector is projected in each relationship space. After 18 relationships, the projection is the matrix of [18100], which represents the projection of the vector under the 18 relationships. Among them, the dimension of the matrix M is [18100100], which is the entity projection matrix corresponding to each relationship. After v
a
- v
b
, the relationship vector is obtained under the corresponding relationship projection matrix. After that, we multiply this relationship vector and each relationship vector, so that we get the scalar of the corresponding relationship, that is, the predicted relationship vector [18100]. Each position can indicate the probability that the word pair is the relationship and then multiply by the category weight vector W. Finally, we get a attention scalar.
For the antonym vector, we construct formula (26). According to the vector product distribution of antonyms in Chapter 2, it can be seen that for a synonym two vector dots are close to 1, for two antonyms two vector dots are close to each other, and unrelated words are close to 0. Therefore, the absolute value of the product of the two vectors is added as an increased offset.
(2) Local inference layer: Chen et al. directly add the sum of the weight α
ai
of related entity vectors here. This is to introduce more weight information of the alignment vector, which is equivalent to introducing a soft alignment weighted feature vector sum. In fact, the purpose is to find the relationship vector of the word pair after finding the word pair. The method in KIM proposed by Chen et al. Is formula (27). Among them, c
a
represents the representation of the word a in the hypothetical sentence relative to the entire premise sentence, and v
ai
represents the word pair vector we obtained earlier. Usually, the relationship between two word pairs is known and the relationship between the two is judged.
However, in actual use, all two pairs of words in the two sentences will be compared, so that a large number of unrelated word pairs will appear, and it will bring a lot of noise. Therefore, we try to directly take the vector of the word pair with the largest attention weight as the feature vector. The word a in the premise sentence is relative to the feature vectors of all words in the hypothesis sentence, as shown in formula (28).
(3) Aggregation layer: The input of this layer is the result G (a - b, a + b, a * b) of the previous layer Compare, and shape is the length of the string. This layer is to compress the two-dimensional matrix (Composition) into a one-dimensional vector. The previous method used is direct summation, or average summation. The wordnet information is used here.
On this basis, the English text recognition effect of this research model is analyzed. The database used in this research is the handwritten text pictures collected by the network to identify them, as shown in Fig. 4.

Model diagram of textual entailment based on decomposition of attention.

Textual entailment model fused knowledge vector.

English character recognition database.
The offline English handwriting database CASIA-HWDB is used to train and evaluate models. All data is segmented and annotated at the character level, and each data set is divided into standard training and test subsets.
The network construction is implemented using the caffe architecture. The experimental platform is Intel Core i7-7100k 3.6 GHz CPU, NVIDIA GeForce GTX 1070 GPU (8GB). Combined with the learning rate annealing algorithm, a stochastic gradient descent algorithm is used to train the network. For all text line images, the height of the image is filled with 64 (if less than this value), or the image is scaled to have a height of 64 (if greater than this value). The initial learning rate is 0.1, the weight attenuation coefficient is 10-7, the Mini-batch is 64 pictures, and the convergence is achieved after 50 hours of training. The results are shown in Table 2 and Fig. 5.
Network training test data
After iterations, the training loss began to decrease significantly. After 17,000 iterations, the verification accuracy rate began to be higher than the training accuracy rate, which showed that the model did not overfit. After 24,000 trainings, the loss value and accuracy rate tended to be stable, and the final test accuracy rate reachs 92.25%. Character recognition accuracy (AR) and line recognition accuracy (LRA) are used to evaluate network performance. The character recognition accuracy rate is defined as the proportion of correctly recognized Chinese characters to the total number of Chinese characters, and the successful recognition of text lines is defined as that each line of characters is correctly recognized. Line recognition accuracy is defined as LRA=| C | / | T |, where C and T are the number of correctly recognized text lines and the total number of text lines, respectively. In addition, in order to illustrate the effectiveness of the method in this paper, the results are compared and shown in Table 3.
Accuracy of handwritten English character recognition
In order to study the performance of the model in this paper, the model proposed in this study is compared with the neural network recognition model. In the research, the neural network recognition model is defined as model A, and the model proposed in this study is defined as model B. 40 groups of pictures are recognized, and each group of pictures has 100 text pictures. The comparison is made in terms of recognition accuracy and recognition speed. The comparison results are shown in Tables 4, 5, Figs. 6, and 7.
Comparison table of algorithm recognition accuracy
Comparison table of algorithm recognition speed

Curve of network training test.

Comparison chart of recognition accuracy.

Comparison diagram of recognition speed.
It can be seen from Figs. 6 and 7 that in the recognition accuracy rate, this research algorithm is better than the neural network recognition model, and in recognition speed, this research algorithm takes less time than the neural network algorithm. Therefore, the recognition speed of the algorithm in this study is faster. In general, the performance of the model in this study is better than that of the neural network.
This article uses handwritten English as an entry point, elaborate and analyze the difficulties of text recognition and the current research deficiencies, and put forward an effective solution, which is verified through experiments. In the application of textual entailment, because textual entailment has its own knowledge verification characteristics, we try to apply textual entailment to the short text scoring task of online MOOC. In practice, we decompose the question into a text matching question with the question and answer and the text implies the question with the answer and standard answer. In the context of pre-trained text matching and SNLI pre-trained text, this study continues to train on short text scored data sets, and the final result has a certain improvement, proving the availability of the designed mechanism. At the same time, the results prove the value of the text. In order to study the performance of the model in this paper, it is compared with the neural network recognition model. The results show that the performance of the model in this study is better.
Footnotes
Acknowledgments
This paper was supported by (1) 2015 “Comparative Study on Horqin Culture and Buryatia Culture” (Participants), Research Project of Higher Education Institutions of Inner Mongolia Autonomous Region in 2015 (NJZC187); (2) 2016 “The Status of Horqin Culture English Translation and the Solutions” (Participants) for Inner Mongolia Higher Education Institutions 2016 (NJZC16174).
