Chinese long text similarity calculation of semantic progressive fusion based on Bert

Abstract

Text similarity is an important index to measure the similarity between two or more texts. It is widely used in many fields of natural language processing tasks. With the maturity of deep learning technology, a large number of neural network models have been used to calculate text similarity and have achieved good results in similarity calculation task of sentences or short texts. Among them, Bert model has become a research hotspot in this field due to its excellent performance. However, the application effect of existing similarity algorithms on long texts is not ideal, and they cannot truly extract richer semantic information hidden in the structure of long text documents. This paper takes Chinese long text as the research object, proposes a long text similarity calculation method using sentence sequence instead of word level sequence, constructs a long text semantic representation model with semantic progressive fusion, solves the practical problems faced by applications or natural language processing tasks related to long text semantics, in order to breaks through the bottleneck of long text similarity calculation.

Keywords

Natural language processing long text similarity Bert model transformer

1. Introduction

Text similarity computation is not only a basic research in the field of natural language processing, but also a hot research, which is widely used in intelligent search, text mining, automatic question answering, recommendation systems and other fields. The initial techniques were based on statistical information of text, which were simple in principle and easy to implement. However, these early methods were unable to accurately understand the semantic information of text. With the maturity of deep learning technology, a large number of neural network models have been used to calculate text similarity, and have achieved good results in sentence or short text similarity calculation tasks. Especially in 2018, the pre-trained language model Bert was proposed, which greatly improved the accuracy of NLP tasks through pre-trained and fine-tuning of the model. However, these methods have not been ideal in the application of Chinese long texts, mainly because the composition structure of long text is more complex than that of sentences or short texts, so the existing methods cannot extract richer semantic information hidden in the structure of long texts, resulting in unsatisfactory model performance. Therefore, the calculation of semantic similarity for long texts is more difficult and the application is more urgent.

This article takes Chinese long texts as the research object and adopts a three-level progressive structure from “word to sentence to text” to maximize the preservation of the true semantics of long texts based on the characteristics of their grammatical composition structure. It proposes a semantic progressive fusion method for long texts based on Bert model and a model algorithm suitable for calculating the similarity of long texts, so as to solve the applications related to long text semantics or the practical problems faced by natural language processing tasks, break the bottleneck of long text similarity calculation, and promote the development of NLP applications.

2. Related theories and technologies

2.1 Text similarity calculation

Text similarity calculation is a basic research in the field of natural language processing which can quantify the similarity between two or more texts based on different algorithms and models to achieve various text processing tasks. Common text similarity calculation methods include N-gram model, Edit distance, Jaccard similarity coefficient, etc. The principle of this kind of algorithm is relatively simple, but the deep semantic information of the text cannot be obtained directly through the literal distance. Therefore, it is necessary to introduce a model to digitize the text, representing it as a series of vectors capable of expressing text semantics, so that the model can capture the correlation between text semantics through such high-dimensional space, and thus measure the similarity between text semantics [1].

Previous similarity calculation model mostly processed text vectors through word vectors, and generally took average weighting method for word embedding. The biggest drawback of this method is that it ignores important information such as grammar and word order of the text, and cannot accurately represent the semantics of the text. Therefore, the calculation effect of the model is not ideal. In recent years, with the maturity of deep learning technology, more and more neural network models have been used for text semantic similarity calculation. Among them, the more active models include twin tower model DSSM, twin model Siamese LSTM, interactive model BIMPM, etc, which are basically improved on the original models such as CNN, RNN, LSTM, etc. Until 2018, the Bert pre-trained language model officially launched by Google, has presented the most advanced results in this field with its excellent performance, becoming a watershed in the field of text similarity calculation, and many subsequent studies mainly focus on optimization and improvement of Bert model.

2.2 Bert model

As a pre-trained language model, Bert (Bidirectional Encoder Representations from Transformers) [2] is considered to be one of the most advanced and popular models in the field of natural language processing. Compared to other pre-trained models, Bert uses a bidirectional Transformer encoder to more thoroughly capture the bidirectional relationships in statements, making the model perform well on various NLP tasks.

The use of Bert model can be divided into two stages, namely the pre-training stage and the fine-tuning stage [3]. In the pre-training stage, Bert model utilizes a bidirectional Transformer structure to train on a large amount of unlabeled text data, obtaining universal word representations and contextual representations of sentences, which are referred to as the model’s pre-trained weights. In the fine-tuning stage, the model uses these pre-trained weights to initialize the new data and fine tune them on the upstream layer of specific task to better adapt to specific task requirements. Therefore, fine-tuning the pre fine-tuning layer can effectively achieve the task of text similarity calculation.

2.3 Transformer – bidirectional encoder

The structure that plays an important role in Bert model is the Transformer encoder. Compared to LSTM and GRU models that previously dominated the market, Transformer encoder improves model training speed utilizing attention mechanism, which has two significant advantages: firstly, improving model training efficiency using distributed GPU for parallel operations; secondly, capturing semantic associations with longer intervals more effectively when analyzing and predicting long texts [4]. It can be said that Transformer, which is the first model that completely relies on the Self Attention mechanism to calculate input and output, is superior to the previous RNN in precision and performance in terms of accuracy and performance combined with the complexity of the model itself [5].

From the article “Attention is all you need” published by Google, it can be seen that Transformer is an Encoder-Decoder model, with the encoder on the left and the decoder on the right [6]. Both the encoder and decoder are stacked from N identical layers. The encoder is used to extract features from the input and provide effective semantic information for the decoding process, and the decoder outputs the next result of the sequence based on the results of the encoder and the previous prediction.

In natural language processing, Transformer model has been proven to achieve excellent performance, and the Bert model only uses the encoder part of Transformer [7]. In Bert model, the input sequence is first transformed into a vector which is considered as the initial part of the encoder through the source text embedding layer, and then transformed into the representation of the output sequence through the multi-layer attention mechanism and feed-forward neural network.

2.4 Self attention mechanism

Self attention mechanism is an important attention mechanism in neural networks and also a core part of Transformer. It generates contextual feature representations by capturing the internal relationships of input sequences, which has good parallelism and scalability. Compared to traditional attention mechanisms, self attention mechanisms focus more on the associations between internal elements of input or output text, while attention mechanisms focus more on representing the connections between text and text [8]. For Chinese natural language processing tasks, the same word will fuse different semantic information in different contexts, thus showing different meanings. According to this principle, self attention connects the embedding of all words in the entire text, uses different semantics to distinguish polysemy, and weakens long-distance information, enabling the model to capture key information even in long texts.

In summary, Bert model adopts a Transformer structure, which performs better in processing long sequence data compared to traditional models such as CNN, RNN, and LSTM. In addition, Bert model based on self attention mechanism truly achieves bidirectional representation of text, allowing model to consider contextual information on left and right sides during the calculation process and making the representation of word vectors more comprehensive. Although Bert model has achieved significant success in NLP tasks, it mainly focuses on short text processing, and there are still obvious shortcomings in processing of long text. Firstly, due to the character length limit of 512 for Bert’s input, when the text character length exceeds this limit, the model will cut out the long text and only retain a portion of the text information, which can easily cause the loss of semantic information in the text. Secondly, the method that Bert model process the input long text as a whole and understands the meaning of the text through global contextual information, measures the semantics of the text only relying on word level information, and can easily lead to deviation in semantic extraction without considering complex information such as sentence structure in the long text.

Thus, this paper proposes a long text similarity calculation model based on Bert which replaces token level sequence with sentence sequence. On one hand, the length of input characters is no longer limited, on the other hand, the model takes into account the sentence level structural information of long texts, enhance the semantic expression of text vectors and ultimately improve the accuracy of long text similarity calculation.

3. Chinese long text similarity calculation model of semantic progressive fusion based on Bert

When the semantic information of a text is represented as a vector, the similarity calculation between two texts can be transformed into the similarity calculation of two vectors. Therefore, the Chinese long text similarity calculation model proposed in this article mainly consists of two parts. The first part is to obtain the semantic vector representation of long text, and the second part is to calculate the similarity of text. How to accurately obtain the semantic representation of long text is the focus of this research.

3.1 Ideas of the model framework

The accuracy and expressive power of text vectors directly affect the performance and effectiveness of the downstream tasks, so high-quality text vectors are particularly important. Considering the characteristics of Chinese long text in terms of grammatical composition structure, a three-layer structure from “word to sentence to text” is adopted to progressively generate the semantic representation of long text and different level generates corresponding information representation as the input for the next level. That is, the sentence semantic representation is generated from word vectors first, and then the text semantic representation is generated from sentence semantic representations. Finally, based on the spatial mapping of long text semantic vectors, the similarity between two long texts is calculated using the cosine of the included angle. The algorithm flow of long text similarity calculation model is shown in Fig. 1.

Figure 1.

Algorithm flow of long text similarity calculation model.

Start: enter two long texts and divide each text into sentences, which are sequentially stored in two lists.

Repeat: loop calling Bert model to obtain the semantic representation of each sentence in each list.

In this process, firstly, dynamic word vectors which can accurately fuse contextual semantic information are obtained through the feature extraction of Bert. At the same time, the previous approach of generating sentence representations through simple linear combinations is abandoned, and a self-attention mechanism is adopted to better capture bidirectional semantic dependencies. On this basis, the final sentence vector representation is obtained through pooling strategy.

Then: convert long text into sentence level sequences to obtain the semantic representation of long text.

By analogy with the structural relationship between a long text and a sentence, the model can train the word sequence of a sentence, so it is natural to train the sentence sequence of a text. Therefore, each sentence is regarded as a word level token, so a long text is transformed into a sentence at another level, then the transformer model is used again to train the sentence sequence of a long text. Finally, a long text representation with semantic progressive fusion is obtained.

Finally: after obtaining the semantic representation of two long texts, use the cosine of the included angle in vector space to calculate the similarity between the two texts.

3.2 Dynamic word vector acquisition based on Bert model

Directly using the output vector of [CLS] in Bert as sentence representation is not very effective. This project uses Bert for modeling to obtain dynamic word vectors rich in contextual information, and then uses a more suitable network model to obtain sentence vectors containing richer information. In addition, Bert needs to perform a mask operation on the corpus when pre-training the language model, so that the model can predict the masked words when the token is not visible. In the official BERT-base (Chinese) released by Google, Chinese segmentation is based on word granularity, without considering the characteristics of Chinese segmentation. Joint Laboratory of HIT and iFLYTEK Research (abbreviated as HFL) has applied full word mask method to Chinese text, performing [MASK] on all Chinese characters that make up the same word [9]. Compared with the original BERT-Chinese model and the Baidu open-source Chinese pre-training model ERNIE, this model has achieved better performance in multiple Chinese tasks. Full word mask requires first segmenting the Chinese corpus, and then performing a mask operation on each token belonging to the same word. The effect of the mask is shown in Table 1.

Table 1
Example of Generating a Full Word Mask in Chinese

Illustration	Sample
Original text^a	UTF8gbsn前臂屈伸是一个极为普通易做的动作。(Forearm flexion and extension is an extremely common and easy action to perform.)
Segmentation text	UTF8gbsn前臂屈伸是一个极为普通易做的动作。
Raw mask input	UTF8gbsn前臂 [MASK] 伸是一个极为 [MASK] 通易做的动 [MASK]。
Full word mask input	UTF8gbsn前臂 [MASK] [MASK] 是一个极为 [MASK] [MASK] 易做的 [MASK] [MASK]。

^aIn order to preserve the segmentation structure and masking features of Chinese text, only the original text has been translated into English here.

Therefore, using full word mask method of HFL, the given long text is first subjected to Chinese word segmentation processing, and then thrown into Bert model to obtain a dynamic word vector for each word through training.

3.3 Sentence representation with word sequence information and enhanced semantic features

The acquisition of sentence vectors based on Bert involves two stages: input stage and Transformer encoding stage. Firstly, in the input stage, three important vectors are extracted to represent the information of the input sentence: word vector, segmentation vector, and position vector. The word vector represents the vector corresponding to each token in a sentence, which is pre-trained by Bert model and contains the semantic information and contextual information of each word in the input sequence and can be adaptively adjusted according to subsequent encoding and training. The segmentation vector assigns a unique identifier to each sentence to distinguish between two different sentences in the input text. The position vector supplements the lack of position information in the Transformer model, artificially adding vectors representing positions to enable the model to understand the positional relationship of each word in the text, thereby identifying the semantic correlation between words. These three vectors overlap with each other, ultimately forming the initial input vector of Bert. The input layer settings of the Bert model are shown in Fig. 2.

Figure 2.

Input layer settings for Bert.

After the input stage, Bert model uses a multi-layer Transformer encoder to process the input vectors, capturing various semantic relationships in the text and generating corresponding output vectors. In addition, this article changes the self attention mechanism of encoder to a multi head attention mechanism to better capture the multiple relationships and subtle differences between each word, thereby obtaining a semantic representation of sentences with more implicit information.

3.4 Semantic representation of Chinese long texts

Although dynamic word vectors are generated based on Bert, its fatal limitation is that the word sequence length can be up to 512 (including [CLS] and [SEP]), which is obviously not sufficient for long texts. Therefore, by analogy with the structural relationship between long text and sentence, this article considers using sentence sequences instead of word sequences to train again to obtain the semantic representation of a long text.

Therefore, this article takes the sentence vector obtained earlier as the input vector of Transformer encoder, and simulates the method of generating token level vectors to train the sentence vector to obtain complex information between sentences in long texts. Figure 3 shows the semantic generation process of a long text. A long text is divided into N sentences according to the sentence ending symbol, then after training with Bert model, the N sentence vectors obtained are stored in a list, and then the position encoder of Transformer model is used to process these vectors, so as to better capture the structural relationships and semantic information of each sentence in the entire sequence, and improve the expression effect of the output vector on text semantic information.

Figure 3.

Semantic generation process of long text.

Setting the maximum length of words in a sentence to 512, the resulting shape for a sentence is (512, 768). A text sample is set with 80 sentences, if there are more than 80 sentences, the first 80 will be taken. If there are less than 80 sentences, the sentence vectors that need to be supplemented will be filled with all zero. Thus the resulting form is: (N, 80, 768), where “N” represents the number of text samples, “80” represents the maximum length of sentences, and “768” represents the vector dimension. Finally, a vector representation that integrates the semantic information of the long text will be obtained through Transformer model.

3.5 Similarity calculation of long texts

After obtaining the text vector, the cosine similarity calculation method is used to calculate the similarity of the text vector, and the cosine similarity calculation is shown in the following formula.

$\displaystyle\textit{Sim(A,B)}=\frac{A\cdot B}{{||}A||\times{||}B{||}}=\frac{% \sum\limits_{i=1}^{n}{(A_{i}\times B_{j})}}{\sqrt{\sum\limits_{i=1}^{n}{A_{i}^% {2}}}\times\sqrt{\sum\limits_{{j}=1}^{n}{B_{j}^{2}}}}$

Cosine similarity uses the cosine value of the angle between two vectors in the vector space as a measure of the difference between two individuals and its value range is [0,1]. The closer the cosine value is to one, the closer the angle is to zero degrees, which means that the two vectors are more similar [10].

4. Model experiment and result analysis

4.1 Sentence vector generation and optimization based on Bert

The test data for this experiment come from some datasets collected online, and experts provide similarity judgment labels for these test data. During the experiment, the similarity calculation of sentence vectors generated by Bert clearly deviated from expert judgment, so the Bert model was optimized and improved.

4.1.1 Sentence vectors obtained from Bert model

In the experiment of generating sentence vectors through Bert, the hidden layer output of the last layer is used as the vector representation of the sentence. Then, the similarity between the two texts calculated through the cosine formula will be compared with the binary classification labels provided by experts (1 indicates that the two texts are similar, 0 indicates that they are not). The data in column “Expert Label” and “Similarity(Bert)” in Table 2 shows the results of partial experimental data.

Table 2
Comparison of three experimental data for sentence similarity calculation

ID	Sentence A (The experimental text is in Chinese, which is translated into English here)	Sentence B (The experimental text is in Chinese, which is translated into English here)	Expert label	Similarity (Bert)	Similarity (Siamese Bert)
1	The transaction password has been forgotten and the phone card that was bound to the password has also been lost.	Why am I constantly being asked to change my security password recently? It’s so troublesome.	0	0.4112	0.5541
2	Why hasn’t you called me back after borrowing?	Why didn’t you call me after applying for a loan!	1	0.8913	0.8299
3	In 1993, Rafael Puente played for the A-level Cann County Puma.	In 1993, Antonio Muhammad played for the A-level Cann Cougar.	0	0.9898	0.6893
4	The Mississippi River is the fourth longest river in the world, and its basin is one of the three largest black soil regions in the world.	The fourth longest river in the world is the Mississippi River, which is one of the three largest black soil regions in the world.	1	0.9812	0.9888
5	The county system is the most stable first level organizational system in China’s Administrative division system, which has a history of more than 2000 years.	The treatment of mining residues is an option to address pollution issues related to mining area development.	0	0.7449	0.1565
6	A history of China is a history of all ethnic groups in China jointly creating Chinese civilization and their mutual relations	The first seminar on Cantonese culture focused on the historical status and essential characteristics of Cantonese culture.	0	0.8520	0.2719

From the experimental results, it can be seen that the effect of sentence similarity calculation using the vectors of Bert’s hidden layer is not ideal. When there are many high-frequency words or similar words in two sentences, the model cannot accurately express the semantic information of the sentence, and it is easy to judge them as similar, such as the test data from the 3rd, 5th, and 6th group in the experiment, where the expert’s judgment of similarity is 0, and the similarity score of the two sentences calculated by Bert is higher, indicating that the model believes that these two sentences are relatively similar, which goes against the semantic information of the text and the prediction results are not very accurate.

There are two reasons for the above situation: (1) The vector space generated by Bert is anisotropic and the word embedding presents a conical distribution [11], where high frequency words gather at the head of the cone and are closer to the origin, while low frequency words are scattered at the tail and are relatively sparse, so high frequency words will dominate the vector representation of some sentences. (2) The premise of cosine similarity calculation is that the vector base is orthogonal basis, but the high-dimensional vector output from the Bert hidden layer cannot guarantee the orthogonal basis, so the calculation result is not accurate.

4.1.2 Improvement and optimization of Bert algorithm

This article improved the structure of Bert to obtain sentence vectors that can fully express sentence semantics. Sentence pairs were input into two identical Bert models through Siamese networks [12] to obtain corresponding embedding vectors, and then perform comparative loss calculations on the embedding vectors. By minimizing the distance between similar texts and maximizing the distance between different texts, the semantic relationship and similarity between sentences are better represented.

The data in column “Similarity(Siamese Bert)” in Table 2 shows the results of sentence vector similarity calculation for the improved Bert added with Siamese network. From the table, it can be seen that the fine-tuned Siamese Bert performs better than the original Bert on sentence vectors. Taking the 3th set of problematic test data in the above experiment as an example, the similarity results calculated by the model match well with the actual label. From the experimental results, the performance of these indicators are good, indicating that the model performs well in sentence similarity tasks, which can effectively distinguish semantic relationships between sentences.

4.2 Experimental comparison of long text similarity calculation

4.2.1 Experimental data

This experiment selected a partial dataset of Tan Songbo’s text classification corpus from Fudan University and tested on the model proposed by this article and the original Bert model, and compared the results. The dataset contains three column information: Text A, Text B, and a binary classification label given by experts.

The experimental parameters are as follows: the maximum length of a single sentence is 512, the maximum number of sentences is 80, and the learning rate of the model is 2e-5. The pre-training model involved in the experiment was Chinese-Bert-WWM released by HFL.

4.2.2 Experimental setting

The long text similarity calculation model in this article replaces token level sequences with sentence sequences. The long text is segmented first, and then sentence vectors with rich semantics are output through the twin network pre-training model. These vectors are saved in the list according to the order of appearance and finally converted into vector matrix, which are position coded and semantic trained through Transformer model, so as to obtain a long text vector with semantic progressive fusion. Finally, the similarity of the output text vectors are calculated. Following is the experimental code for obtaining long text vectors for this model:

x1 $=$ create_matrix(embedding1) #Create input matrix

arr_reshaped1 $=$ reshape_matrix(x1, 1, s_length1, 384) #Reshape matrix dimensions

end1 $=$ convert_to_variable(arr_reshaped1) #Convert to torch variable

pe $=$ create_positional_encoding(d_model, dropout, max_len) #Create location code

pe_result1 $=$ apply_positional_encoding(pe, end1) #Apply location coding

x1 $=$ pe_result1

mask1 $=$ create_mask(1, s_length1, s_length1) #Create mask code #Create EncoderLayer

layer $=$ create_encoder_layer(size, create_attn(c), create_ff(c), dropout)

en $=$ create_encoder(layer, N) #Create Encoder

en_result1 $=$ apply_encoder(en, x1, mask1) #Apply Encoder

Figure 4 shows the Chinese long text data used for this experiment, a row is a set of two long texts for comparison. Table 3 lists the similarity results calculated by the proposed model in this article and the original Bert model on the blue marked text. The output of the original Bert is a two-class probability, and the probability value of one classification is taken to represent the semantic similarity calculated by the original Bert. In the table, column “Similarity” is the cosine similarity value calculated by the model proposed in this article, column “Probability” is the semantic similarity value of the original Bert.

Figure 4.

Chinese long text data used in the experiment.

Figure 5.

Vector representation and similarity of group 5.

Table 3

Long text similarity comparison between two models

ID	Text A/Text B	Label	Similarity (our model)	Probability (original Bert)
1:A	Kunju, also known as Kunqiang or Kunqu Opera, was originally named “Kunshan Opera” and originated in Kunshan County, Suzhou Prefecture, in the late Yuan and early Ming dynasties. It is one of the genres of Chinese opera and a treasure of traditional Chinese Han culture and art, especially a treasure in opera art, known as the ‘ancestor of all kinds of operas”. Kunqu mainly uses Zhongzhou Mandarin as the language of singing and speaking, with drums and boards controlling the rhythm of singing. Qudi, Sanxian, and other accompanying instruments are the main accompaniment instruments, combining singing, recitation, playing, dance, and martial arts.	0	0.6756	0.0014
1:B	Jingju, also known as Pingju, Peking Opera, etc., is one of China’s national treasures and the most influential type of opera in China. It is distributed around Beijing and covers various parts of the country. The Peking Opera performance system named after Mei Lanfang is regarded as the representative of the oriental drama performance system and one of the three major performance systems in the world. Peking Opera is an important form of expression of the traditional Chinese culture, in which a variety of Elements of art are used as symbols of traditional Chinese culture.
2:A	In recent years, Tibetan medical personnel have made new progress in research on the prevention and control of mountain diseases. Gaoshan disease is a series of diseases caused by hypoxia in high mountains. In the past, the prevention and research of mountain sickness in Tibet was almost blank, and the mortality rate of patients was very high.	1	0.8095	0.9984
2:B	In recent decades, research on prevention and treatment of mountain sickness has always been the focus and center of medical science and technology research in the Tibet Autonomous Region, and has made great progress in basic theories, clinical treatment, new drug development and new technology utilization of plateau medicine. This not only has guiding significance for the treatment of mountain sickness, but also for aerospace, diving, inland hypoxia, and military medicine
3:A	Forearm flexion and extension is an extremely common and easy action to perform. The accuracy of this movement mainly reflects the proprioception of human muscle movement and the accuracy of movement. It can be measured by an intuitive semicircle (an extremely simple and easy to operate measuring instrument). Therefore, the accuracy of forearm flexion and extension is a good and easy to obtain index to measure the proprioception of muscle movement and movement accuracy.	0	0.0347	0.0003
3:B	Thank you so much for this rain. I could have traveled by car, but I had to walk and get wet. Although the night was as dark as ink, and there were wolves and ghosts everywhere, some light flashed by from afar, but I saw countless drops of rain falling in front and behind me, a different kind of flower blooming. Two years ago, I took a boat through Longmen and the Yellow River surface like copper was filled with vortex like peony. I was so happy to say that I stepped on the luxurious carpet. Is tonight’s flower blooming the realm of carpet spreading?
4:A	Boil water in a pot, add a little salt to the water, and blanch the cut tofu in cold water to remove the beany smell. Cut the scallions, ginger, and garlic and set aside. Stir fry minced pork in the pot, add Doubanjiang to stir fry until fragrant, add soy sauce and cooking wine to stir well. Then perform three rounds of stirring, add salt, chicken essence, and sugar, and shake evenly to remove from the pot.	1	0.9186	0.9991

Table 3, continued
ID	Text A/Text B	Label	Similarity (our model)	Probability (original Bert)
4:B	Soak tofu slices in light salt water for 10 minutes, then drain off the water. Wash and chop chopped scallions and ginger. In the pot, add hot oil into the onion and ginger, stir fry the minced meat until the color changes, add Pixian Doubanjiang, stir fry until the flavor changes. Add tofu, water, soy sauce, and sugar, then add pepper powder, sprinkle with water starch and thicken. Simmer for a while, then remove from the pot and serve on a plate.
5:A	With the development of global industrialization, energy scarcity will be a persistent problem, so many architectural designers have begun to develop and construct a new generation of “ecological buildings” tailored to local conditions. Sanzawa Residential Company in Japan has recently successfully developed a new type of solar residential building that can meet 85% of the energy needs of households. The new residence has good insulation and sealing performance, and the roof is equipped with solar cells. When the power supply is insufficient, it can be supplemented by cheap late night power supply. When the solar power is surplus, it can also be sold to the power company.	0	0.5224	0.0017
5:B	Ecological architecture is based on the local natural ecological environment, utilizing the basic principles of ecology, architectural technology, and modern scientific and technological means to reasonably arrange and organize the relationship between architecture and other related factors, making the architecture and environment an organic combination. At the same time, it has good indoor climate conditions and strong biological climate regulation ability to meet the comfortable living environment of people, forming a virtuous cycle system between people, buildings, and the natural ecological environment. We can regard the ecological building as an ecological system. Through the design of the architectural designer, the material and energy can be recycled and converted in an orderly manner within the building ecosystem, so as to obtain a low energy consumption, pollution-free, efficient and Balance of nature building environment

Taking group five as an example, the long text vectors obtained by the model proposed in the article and the final similarity calculation result are shown in Fig. 5.

4.3 Experimental results and analysis

This experiment is used to verify the application effect of the long text similarity calculation model this article proposed. Using the classification label of the original Bert model as the reference group, the semantic similarity results generated by the two models are compared with the label so as to evaluate their ability to handle text similarity.

From Table 3, it can be seen that the long text similarity calculation model proposed in this article has achieved good classification performance in the processing of long text datasets, which can fully represent the semantic information of long texts and effectively improve the accuracy of semantic similarity of long texts. Taking the data from the second, third, and fourth group in the experiment as an example, the experimental results match the actual labels, and the similarity of the model in this paper is more suitable for the text content and discriminate. Compared with the original Bert model, this model can more accurately obtain semantic information of long texts. Taking the first and fifth groups in the experiment as examples, when the two texts have some slight semantic relationships but are not enough to be classified as similar, this model can better describe the semantic relationships between texts. For example, the similarity score calculated by Bert model in the first group is 0.0014, the similarity score calculated by the model in this article is 0.6756, and the similarity score calculated by Bert model in the fifth group is 0.0017, the similarity score calculated by the model in this article is 0.5224. Both sets of data accurately represent the similarities between two long texts in some aspects, but there are also certain differences. In summary, the long text similarity calculation model proposed in this article has obvious advantages in the task of long text semantic similarity, with higher accuracy, stronger representation ability, and better adaptability.

5. Conclusion

Taking full account of the structural features of Chinese long texts, this paper proposes a Chinese long text similarity calculation model of semantic progressive fusion based on Bert. The model first converts the long text into a sentence sequence, which replaces the original word sequence for training. At the same time, it adds the feature training of Transformer and multi head attention mechanism at the level of bidirectional semantics and internal relevance to obtain the text semantic representation that combines context information and hidden text internal features, and ultimately improves the accuracy of text similarity calculation. From the experimental results, it can be seen that the model proposed in this article effectively solves the input character limitation and information loss of sentence level structure in the original Bert model, and has a significant improvement in the accuracy of similarity calculation.

Footnotes

Acknowledgments

This work was support by 2022 Henan Province Key R&D and Promotion Special Project (Science and Technology Research): 222102210259, and was support by Innovation Team of the Ministry of Education in Oracle Bone Inscription Information Processing: 2017PT35.

References

Mikolov

Chen

Corrado

Dean

. Efficient estimation of word representations in vector space. in: International Conference on Learning Representations. 2013. arXiv preprintarXiv1301. 3781v3.

Devlin

Chang

Lee

Toutanova

. BERT: Pre-training of deep bidirectional transformers for language understanding. in: Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies. 2019; 1: 4171-4186.

Radford

Narasimhan

Salimans

Sutskever

. Improving language understanding by generative pre-training. OpenAI. 2018.

Howard

Ruder

. Universal language model fine-tuning for text classification. in: 56th Annual Meeting of the Association for Computational Linguistics. 2018; 1: 328-339.

Cho

Merrienboer

Gulcehre

Bahdanau

, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014, pp. 1724-1734.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

Kaiser

Polosukhin

. Attention is all you need. In: Advances in Neural Information Processing Systems. 2017. arXiv preprint arXiv: 170603762.

Sun

Huang

Wei

Dai

Chen

. Generating diverse translation by manipulating multi-head attention. in: 34th AAAI Conference on Artificial Intelligence/32nd Innovative Applications of Artificial Intelligence Conference/10th AAAI Symposium on Educational Advances in Artificial Intelligence. 2020, 34: 8976-8983.

Lin

Feng

Dos Santos

, et al. A structured self-attentive sentence embedding. in: International Conference on Learning Representations 2017. arXiv preprint arXiv: 17030313.

Cui

Che

Liu

Qing

Yang

. Pre-training with whole word masking for Chinese BERT. in: IEEE/ACM Transactions on Audio Speech and Language Processing. 2021; 29: 3504-3514.

10.

Park

Hong

Kim

. A methodology combining cosine similarity with classifier for text classification. in: Applied Artificial Intelligence. 2020; 34(5): 396-411.

11.

Zhou

Wang

, et al. On the sentence embeddings from pre-trained language models. in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020, pp. 9119-9130.

12.

Reimers

Gurevych

. Sentence-BERT: Sentence embeddings using Siamese Bert-networks. in: Conference on Empirical Methods in Natural Language Processing. 2019; pp. 3982-3992.