Siamese capsule network with position correlation and integrating articles of law for Chinese similar case matching

Abstract

The purpose of the Chinese similar case matching task is to compare the similarity of two case texts with a given anchor text and find out which text is more similar to the anchor text. In the area of law, it plays an important role and has been of interest to many researchers. Previous approaches have compared legal texts only at the text semantic level, without incorporating article information of law. In addition, the position correlation of words in case texts is often important, but it has not been considered in previous approaches. This paper proposes a method which extracts features from the semantic similarity level and from the level of related articles of law, respectively, to enable similarity comparisons of legal case texts. When similarity comparisons are made at the semantic similarity level, a novel capsule network method is proposed based on siamese structure that introduces the position correlation and the routing mechanism within the capsule network is improved so that deep text features between case pairs can be learned. When similarity comparisons are made at the level of related articles of law, related articles of law are selected and coded and interacted with the case text features to generate legal features. Experiment is conducted with a real-world legal text dataset, and the proposed model outperformed all baseline models, demonstrating effectiveness of the proposed model. Further, to confirm the generality of the improved capsule network proposed in the paper on long text datasets, this paper also carried out experiments on two long text datasets, demonstrating effectiveness of the improved capsule network proposed in the model.

Keywords

1 Introduction

In a standard legal system, the most similar past cases can provide adjudicatory advice for current cases with similar elements of the case. To arrive at a just verdict, legal practitioners often devote considerable time and effort to searching for precedents that resemble the cases they are currently dealing with. Therefore, finding similar cases automatically from a large number of case texts has become a problem that needs to be addressed. Not only does this reduce the heavy lifting for legal professionals, it also benefits the rule of law. In view of the fast-paced progress of deep learning methodologies, given the significant availability of legal case data in the public domain, the legal field has recently seen a significant surge in the utilization of artificial intelligence (AI) technologies, indicating a growing interest in the application of AI within this domain. Thanks to the development of neural network techniques, legal case matching tasks have also been performed using neural networks and achieved decent performance. Several previous works exist [1, 24] focus on encoding fact descriptions as continuous vectors and then utilizing fully connected layers to compute the similarity of case texts. The former approach encountered difficulties in identifying the most similar text cases due to the nuanced differences between them. Siamese networks and capsule networks have shown promising results in NLP, as evidenced by studies such as [4 , 33]. Capsule networks are a group of neural networks that represent the probability of the existence of an object by a vector whose direction represents the property of the object and whose vector length reflects the probability of its property. The properties of these capsule networks are useful for extracting features of multi-element features of legal cases. [11, 28] exploit capsule networks [34] based siamese network to automatically learn some element features of legal texts.

Table 1 shows an example of a legal text. A legal text usually consists of three parts, which we have marked with three separate colors. The first is to the request of the record litigant, then to the record of the description of plaintiff of the facts and reasons, and finally to the record of the facts found at the hearing. After analyzing the legal text, it can be concluded that of the three parts mentioned above, the second part is more important than the first and the last part is more important than the second. Therefore it is feasible to introduce position correlation to improve the performance of the model. However, previous studies have not introduced position correlation into the model. Intuitively, there is a high degree of similarity between cases citing the same or similar article of law, and therefore, article information of law is applied to the model. The previous approaches do not incorporate information of articles of law.

Table 1
An example of legal text

This paper proposes a method which extracts features from the semantic similarity level and from the similarity level of related articles of law, respectively, for Chinese legal similarity case matching task. At the semantic similarity level, a novel capsule network method is proposed based on siamese structure that introduces the position correlation and the routing mechanism within the capsule network is improved. The proposed capsule network method interacts the information of the texts that need to be compared after encoding the texts, which has the advantage that the model can focus on the relevant information between two texts, and the features of the more similar texts will be more similar after the feature interaction. The introduction of position correlation can help the model to focus on the features that are in the position with high position correlation and reduce the effect of noisy data in the text on the model. In addition, the proposed improved capsule network approach can improve the feature extraction ability of the model through the introduction of position correlation and the improvement of the routing network, which can help the model to better distinguish the degree of similarity of text pairs. At the similarity level of related articles of law, relevant articles of law are selected and encoded, compared with text features to generate legal features that are applied in subsequent calculations. To be specific, the proposed method includes the fact encoder layer, the capsule network with position layer, feature fusion and comparison layer, output layer. The fact encoder applies convolution neural generate features of fact description and produces primary capsules in capsule network. Then a capsule network with position correlation layer generates element capsules based on the results of primary capsules. Some cases do not provide relevant articles of law at the end of the case, firstly, it is necessary to select some relevant articles of law from the book of articles of law, then encode the articles of law and apply the attention mechanism to interact with the text features to generate relevant features of articles of law. A fully connected layer computes the similarity between the relevant legal text features of two case texts. A feature fusion and comparison layer computes vector distances between features based on their outputs from capsule layers and the article of law layer. A final similarity result is produced by the output layer. In the proposed method, the siamese [33] structure is applied to triplets of the case. In the siamese structure, each case triplet has the same parameters.

Here is a concise overview of the contributions in the paper: (1) A method which extracts features from the semantic level and from the level of similar articles of law is proposed to capture the similarity of text for the matching of Chinese similarity cases.

(2) An approach for fusing position correlation in capsule networks and improving dynamic routing in capsule networks is proposed. The improved capsule network improves the feature correlation of information between text pairs. (3) The knowledge information of law articles is applied to the task of case similarity matching of legal texts at the similarity level of related articles of law to improve the performance of the model. (4) Experiments are conducted with a real-world legal text dataset, and the proposed model outperforms all baseline models, demonstrating that the proposed model is effective. Further, it is proposed to validate the generality of the improved capsule network proposed in the model by utilizing two long text datasets. The result of the experiment indicate that the improved capsule network proposed in the paper is suitable for accomplishing the objective of determining similarity between lengthy documents. The paper is structured as follows: Section 2 concludes with a concise overview of the work related to this topic. Section 3 presents the proposed method. Section 4 reports the experiment results and analysis. Section 5 reports the results of experiments conducted on two Chinese long text similarity datasets to demonstrate the improved capsule network proposed in the paper. Section 6 is the discussion section. Section 7 concludes the paper.

2 Related Work

2.1 Text similarity matching

The goal of text similarity matching can be thought of as computing the degree of semantic relevance between text pairs [6 –9]. Text similarity is the basis of many NLP studies, such as information extraction [10], human-chine dialogue answering [12] and knowledge inference [18], among others. Therefore, the study of matching semantic relevance between texts has recently received attention. In traditional work, some methods are based on artificially designed feature rules, statistics of commonly used words between texts, editing distances, and other methods to calculate similarity matching between texts. Some work is based on embedding models to map text to word embedding [19 , 49], and uses word embedding to calculate text similarity. Common word embedding models include Term Frequency Inverse Document Frequency (TF-IDF) [25] and Bag of Words. In recent years, because of the advancing rapidly of the technology of natural language processing, the performance of text coding technology has also been improved, which has improved the problem of sparse lexical vectors and greatly improved the ability to extract semantic representations of text. Affected by this, the model effect of text similarity matching task direction has also been greatly improved [46] proposed an approach for measuring text similarity based on different similarity methods by combining the structure similarity and word-based similarity. Hu et al. [23] introduced CNN, which are often applied to images, to the sentence similarity matching task. Wan et al. [26] introduced a model based on deep learning that applies sentence representations in multiple positions to compute the similarity of two sentences. Wang et al. [27] introduced a matching model based on bilateral multi-view for encoding sentences vectors from two directions to achieve sentence similarity matching. Paheli et al. [50] proposed to augment the PCNet with the hierarchy of legal statutes, to form a heterogeneous network Hier-SPCNet for computing similarity between two legal case documents based on a precedent citation network among case documents (PCNet). Recent transformer models, such as BERT and sentence-BERT, are purely feed-forward (dispensing with recursion), but use position embeddings and self-attention layers to allow an order-sensitive representation of the document to emerge. These networks can be used in different ways to predict document similarity [52]. Wei et al. [55] improved WMD method using the syntactic parse tree, called Syntax-aware Word Mover’s Distance (SynWMD) for text similarity by incorporating word importance and taking inherent contextual and structural information in a sentence into account. Over the years, models based on the siamese network [33] have achieved good performance in many tasks. The same encoder of the siamese network processes the input two sentences by sharing the weights. Siamese networks are widely applied in text similarity comparison tasks. Saedi et al. [47] proposed a Deep-Siamese Bi-LSTM-model for feeding out the embedded vectors from BERT model and predicted the similarity of the text pairs. Viji et al. [48] proposed a new hybridized approach using Weighted Fine-Tuned BERT Feature extraction with Siamese Bi-LSTM model for text similarity. Han et al. [51] introduced the Siamese Attention-augmented Recurrent Convolutional Neural Network (S-ARCNN) that combines multiple neural network architectures for measuring document similarity. In each subnetwork of S-ARCNN, a document passes through a bidirectional Long Short-Term Memory (bi-LSTM) layer, which sends representations to local and global document modules. Revathy et al. [56] proposed a hybrid approach integrating Deep Siamese Bi-LSTM-Bidirectional Long-short term Memory network and GRU-Gated Recurrent-Unit neural network training model for semantic text similarity. Juan et al. [53] applied semantic methods, the cosine similarity algorithm, and fuzzy logic to improve the text similarity matching of documents. The model proposed in this paper applies the siamese network to encode two input legal texts to achieve the text feature extraction task.

2.2 Similar Case Matching

Ashley et al. [29] proposed a method extracting features, properties in the case document and evaluating the obtained features, not for all sentences. Saravanan et al. [30] built a case ontology that is applied to help model extract text feature. Kumar et al. [31] proposed an approach of calculating similarities using the cosine function between the embedding model and every term found in the legal case document. Raghav et al. [32] applied document similarity fusing the quote information with the aim of finding out the most similar document of the candidate dataset. Hong et al. [45] proposed an approach that incorporates legal features into a model for similar text matching in the legal domain. In this paper, a novel model based on Siamese capsule network with position correlation is proposed to learn features of pairs of legal documents with the semantic matching of similar cases in legal texts.

Fig. 1

Overview of the proposed method

3 Method

Fig.1 displays the overview of model, which is designed to process a group of three factual descriptions presented as text. These descriptions are referred to as Case A, Case B, and Case C, respectively. If B is more similar to A, the model outputs 1. If C is more similar to A, the model outputs 0. The proposed approach consists of several layers:

Fact encoder. The input sequence is first processed by a pre-trained BERT model to obtain embeddings, which are then further processed by CNN and BiLSTM layers to extract local and contextual features from the text.

Capsule network with position layer. This layer applies a capsule network to obtain the feature information of the input text sequence by applying capsule vectors instead of scalars. In this layer, a position structure in capsule network is proposed for learning position information, document embedding and extracting interaction information between the capsules of two text pairs.

Texts feature interaction layer. The text feature interaction layer applies the features obtained from the capsule network layer to interact with other text feature representations, resulting in an interaction matrix.

Related articles of law selection and enoder layer. The role of this layer is to select legal texts that are similar to the case text and encode them to generate legal text features after interacting with the features of the case text.

Output layer. This layer applies a multi-layer perceptron (MLP) classifier to generate final results.

3.1 Fact Document Encoder

As input, the model receives a sequence of text characters. The input text serialized representations of case A, B and C are D_A = (c₁, c₂, . . . , c_m), D_B = (c₁, c₂, . . . , c_n) and D_C = (c₁, c₂, . . . , c_l). m, n, l denote the length of case A, B and C. Since the raw legal case data contains a lot of information that is not important for subsequent prediction algorithms, such as intonation and stop words in utterances. This information often interferes with the subsequent prediction algorithm and seriously reduces the accuracy of the algorithm. Therefore, a dictionary of deactivated words and a dictionary of words specific are constructed to the legal industry. The deactivation lexicon is designed to filter out inflectional words and deactivation words that are not useful to the subsequent prediction algorithm. Based on the construction of the above two types of dictionaries, deactivate words is splitted for each utterance in the text. Removing the meaningless words in the original utterance while retaining the legal field terminology, the application of the constructed dictionary can avoid spliting word errors as well as the loss of key words. The lexical deactivation operation can effectively eliminate information that is irrelevant to the subsequent prediction algorithm and retain important textual information. After completing the above processing, the text in paragraphs is processed, each paragraph containing several sentences and each sentence containing several characters. The processed vocabulary is fed into the pre-trained model. The pre-trained BERT model is applied to generate the embeddings e_i of i-th character in the input case texts A, B, C. The e_i ∈ R^h represents i-th character embedding generated applying the BERT model. The CNN is applied to learn local features of the input case texts A, B, C. e_i:i+j is applied to represent the concatenation of word embedding (e_i, e_i+1, . . . , e_i+j) by BERT in case text D, D ∈ {A, B, C}. Multiple kernels with different window sizes (h₁, . . . , h_t, . . . , h_r) are adopted. In order to generate a convolution vector representing the case text, an operation is introduced that convolves the text with a window of size h_t, as described by Eq.(1). For each kernel W^t, convolution operation is applied on the whole input sequence with padding at both ends of the sequence. Local features are extracted by applying a fully-connected layer after concatenating the results of regular operations on different kernels. The following equations are applied to the calculation: $g_{i}^{t} = f (W^{t} e_{i : i + h_{t} - 1} + b^{t})$ (1) $Q_{i} = concat (g_{i}^{1}, . . ., g_{i}^{t}, . . ., g_{i}^{r})$ (2) $L_{i} = ℵ (W_{i} Q_{i} + b_{i})$ (3) where, f and ℵ are activation function, W^t and W_i are weight parameters to be learned, b^t and b_i are bias parameters to be learned, $g_{i}^{t}$ denotes the representation of the i-th input generated by the t-th convolution kernel. The BiLSTM is applied to extract contextual information of the input case texts A, B, C. The BiLSTM architecture utilizes two LSTM networks. During one, there is a forward move, while during the other, there is a backward move to capture information from both past and future contexts. The LSTM at step u in the network are computed using a set of equations: $i_{u} = ℵ (W_{i} [h_{u - 1}, e_{u}] + b_{i})$ (4) $f_{u} = ℵ (W_{f} [h_{u - 1}, e_{u}] + b_{f})$ (5) $\hat{c_{u}} = \tanh (W_{c} [h_{u - 1}, e_{u}] + b_{c})$ (6) $c_{u} = f_{u} * c_{u - 1} + i_{u} * \hat{c_{u}}$ (7) $o_{u} = ℵ (W_{o} [h_{u - 1}, e_{u}] + b_{o})$ (8) $h_{u} = o_{u} * \tanh (c_{u})$ (9) where, e_u represents the embedding at input step u, h_u-1 is the output of network at step u - 1, [] represents vectors concat operation; W_i, W_f, W_c, W_o are weight parameters to be learned, b_i, b_f, b_c, b_o are bias parameters to be learned; o_u, f_u, c_u,h_u represent the outputs of output gate, forget gate and cell gate, hidden outputs at step u respectively; ℵ represents the activation function. $H_{i} = [\vec{h_{i}}, \overset{\leftarrow}{h_{i}}]$ (10) Equation (10) presents the output feature of the Bi-LSTM model. $\vec{h_{i}}$ represents the output of forward LSTM and $\overset{\leftarrow}{h_{i}}$ represents the output of backward LSTM, [] represents the embedding concat operation for $\vec{h_{i}}$ and $\overset{\leftarrow}{h_{i}}$ .

Once the local feature L_i of i-th word in text and contextual feature H_i of i-th word have been obtained, a MLP layer can be applied to fuse the local features and context features as the input of the next layer. $E_{i} = MLP (L_{i}, H_{i})$ (11) Where, L_i is the output of CNN layer at step i, H_i is the output of BiLSTM layer at step i, a MLP layer consists of several fully connected layers and activation functions, E_i denotes the feature embedding representation of the i-th word after a MLP layer. In order to extract some numerical features of the case, regular methods are used for the factual part of the case extract the corresponding numerical information (e.g., the amount of money involved in the case and the interest rate of the loan) and generate vectors Ψ of numerical features which are spliced together with the representation E of case text in the subsequent computation. $E = Concat (E, Ψ)$ (12) Legal documents A, B, C are entered separately into the fact encoder, which is based on the siamese structure and shared by all parameters in the network. In the following sections, $E_{i}^{a}, E_{i}^{b}, E_{i}^{c}$ are applied to denote the feature embedding representation E_i in Eq.(12) of i-th character in A, B, C respectively.

3.2 Interactive information fusion of text pairs

In this layer, the principle of internal attention is applied to enable the interaction and fusion of information between the two texts being compared. The text A and text B are taken as examples to introduce the calculation of information interaction process. The calculation of the information interaction process between text A and text C is the same operation. The weight of attention for each character is calculated in text A and text B. The weight of attention is defined as ɛ_mn. In this layer, the attention weight for each input tuple is computed as follows: $ɛ_{ij}^{a} = E_{i}^{a} \cdot E_{j}^{b}$ (13) $ɛ_{ij}^{b} = E_{i}^{b} \cdot E_{j}^{a}$ (14) Here $E_{i}^{a}, E_{j}^{a}$ are embeddings of i-th and j-th characters in text A; $E_{i}^{b}$ , $E_{j}^{b}$ are embeddings of i-th and j-th characters in text B. Note that the weight normalization calculation is as follows: $T_{ij}^{a} = \frac{\exp (ɛ_{ij}^{a})}{\sum_{k = 0}^{ς_{b}} \exp (ɛ_{ik}^{a})}$ (15) $T_{ij}^{b} = \frac{\exp (ɛ_{ij}^{b})}{\sum_{k = 0}^{ς_{a}} \exp (ɛ_{ik}^{b})}$ (16) Then, the relevant semantics of a_i and b_i can be composed using $T_{ij}^{a}$ and $T_{ij}^{b}$ correspondingly: $E_{{\hat{ab}}_{i}} = \sum_{j = 0}^{ς_{b}} T_{ij}^{a} E_{j}^{b}$ (17) $E_{{\hat{ba}}_{i}} = \sum_{j = 0}^{ς_{a}} T_{ij}^{b} E_{j}^{a}$ (18) Next, a MLP layer is applied to generate new text feature information from the original feature information and the interactive feature information. $E_{i}^{ab} = MLP ([E_{i}^{a}; E_{{\hat{ab}}_{i}}])$ (19) $E_{i}^{ba} = MLP ([E_{i}^{b}; E_{{\hat{ba}}_{i}}])$ (20) Here $E_{{\hat{ab}}_{i}}$ and $E_{{\hat{ba}}_{i}}$ are embeddings of i-th character in text A and text B via Eq.(17) and Eq.(18). [] represents the embedding concat operation.

3.3 The capsule network with position correlation layer

This layer takes a document pair A and B as input example to introduce the calculation process of capsule network. It is shown in Algorithm 1, where the pseudocode for the improved capsule network is given. The input of text A and text B in capsule are $E^{ab} = {E_{1}^{ab}, E_{2}^{ab}, E_{3}^{ab}, . . ., E_{m}^{ab}}$ and $E^{ba} = {E_{1}^{ba}$ , $E_{2}^{ba}, E_{3}^{ba}, . . ., E_{n}^{ba}}$ .

3.3.1 Position correlation and document-level embedding

The position parameter p is dynamically adjusted by means of multiple iterations in order to obtain more accurate position importance features in proposed method. Specifically, a position importance parameter p is set and initialised. The parameter p is initialized based on a statistical analysis of the text as follows: $p_{i} = {\begin{matrix} 0.11, & i \in The first paragraph . \\ 0.26, & i \in The second paragraph . \\ 0.63, & other \end{matrix}$ (21) To obtain and update the document embedding and p, the initialised parameter p is applied to calculate the feature vector of the whole document by Eq.(22). After obtaining the feature vector of the whole document, Eq.(23) is applied to calculate the similarity of the vocabulary at each position to the feature vector of the whole document, and the update operation of parameter p is performed by using Eq.(24). $D_{ab} = \frac{\sum_{l = 1}^{m} E_{i}^{ab} * p_{i}^{ab}}{| \sum_{l = 1}^{m} E_{i}^{ab} * p_{i}^{ab} |}$ (22) $w_{i}^{ab} = Θ (E_{i}^{ab} W_{i}^{ab} {D_{ab}}^{T})$ (23) $p_{i}^{ab} = p_{i}^{ab} + η (w_{i}^{ab} - p_{i}^{ab})$ (24) Here, $E_{i}^{ab}$ denotes the feature embedding at the i-th position after the information fusion of text A and text B. $p_{i}^{ab}$ denotes the correlation of the i-th position after information fusion between text A and text B. W^ab is the weight parameter that is needed to be learned. || means to compute the modulus of the vector. Θ denotes the maximum-minimum value normalization operation.

3.3.2 Improved dynamic routing

The features generated by the fusion of local and contextual features are applied as input to the capsule network. The original capsule network contains a primary capsule network layer and a digital capsule network layer. We apply the digital capsule layer directly in the proposed capsule structure due to the local features extracted by the convolution layers already fused in the input features of the network.

Algorithm 1 Algorithm of Capsule Network with position correlation to extract features of Case A and Case B.

Input:

$E^{ab} = {E_{1}^{ab}, E_{2}^{ab}, E_{3}^{ab}, . . ., E_{m}^{ab}}$ ;

The position importance parameter p;

The iterative number of dynamic routing: κ;

The iterative number of generating document embedding: τ;

Update step: η

Output:

v^ab, D_ab

1: p^ab = p

2: for iterations in τdo

3: D_ab= $\frac{\sum_{i = 1}^{m} E_{i}^{ab} * p_{i}^{ab}}{| \sum_{i = 1}^{m} E_{i}^{ab} * p_{i}^{ab} |}$ ,

4: $w_{i}^{ab} = Θ (E_{i}^{ab} W_{i}^{ab} {D_{ab}}^{T})$

5: $p_{i}^{ab} = p_{i}^{ab} + η (w_{i}^{a} - p_{i}^{ab})$

6: end for

7: Initialize $b_{ij}^{ab} = 0$

8: for iterations in κ

9: for i in [1, m]

10: for j in [1, n]

11: $u_{j | i}^{ab} = W_{ij} E_{i}^{ab}$

12: $γ_{ij}^{ab} = softmax (b_{ij}^{ab})$

13: $v_{j}^{ab} = \sum_{i} p_{i}^{ab} γ_{ij}^{ab} u_{j | i}^{ab}$

14: $v_{j}^{ab} = sqush (v_{j}^{ab})$

15: $h_{1}^{ab} = (u_{j | i}^{ab})^{T} v_{j}^{ab}$

16: $h_{2}^{ab} = (u_{j | i}^{ab})^{T} v_{j}^{ba}$

17: $h_{3}^{ab} = (u_{j | i}^{ba})^{T} v_{j}^{ab}$

18: $h_{ij}^{ab} = W [h_{1}^{ab}, h_{2}^{ab}, h_{3}^{ab}]$

19: $b_{ij}^{ab} = b_{ij}^{ab} + H_{ij}^{ab}$

20: for end

21: for end

22: for end

23: returnv^ab, D_ab

The general dynamic routing calculation process of the capsule network is as follows: $u_{j | i}^{ab} = W_{ij} E_{i}^{ab}$ (25) $v_{j}^{ab} = \sum i α_{ij} u_{j | i}^{ab}$ (26) $v_{j}^{ab} = sqush (v_{j}^{ab})$ (27) $sqush (v_{j}^{ab}) = \frac{| | v_{j}^{ab} | |^{2}}{1 + | | v_{j}^{ab} | |^{2}} \cdot \frac{v_{j}^{ab}}{| | v_{j}^{ab} | |}$ (28) $α_{ij}^{ab} = α_{ij}^{ab} + (u_{j | i}^{ab})^{T} v_{j}^{ab}$ (29)

In order to take position correlation into account in the network structure, the improved dynamic routing of capsule is proposed. The Eq.(26) is improved and the improved equation is as follows: $v_{j}^{ab} = \sum i p_{i}^{ab} α_{ij}^{ab} u_{j | i}^{ab}$ (30) At the same time, assuming another text information in a text pair can mediate the effect of parameter updates in dynamic routing of capsule network of the current text, and can improve the effectiveness of the model. Eq.(29) is also improved. The improved calculation procedure is as follows: $h_{1} = (u_{j | i}^{ab})^{T} v_{j}^{ab}$ (31) $h_{2} = (u_{j | i}^{ba})^{T} v_{j}^{ab}$ (32) $h_{3} = (u_{j | i}^{ab})^{T} v_{j}^{ba}$ (33) $H_{ij} = W_{ij} [h_{1}, h_{2}, h_{3}]$ (34) $α_{ij}^{ab} = α_{ij}^{ab} + H_{ij}$ (35) Here, W_ij is the weight parameters. Similarly, $v_{j}^{ba}$ , $v_{j}^{ac}$ and $v_{j}^{ca}$ are calculated in the same way as $v_{j}^{ab}$ .

3.4 Feature fusion layer

The output element capsule in section 3.3 are v^ab, v^ac, v^ba and v^ca. The average pooling operation is applied to obtain the average pooling features of v_ab, v_ac, v_ba, v_ca. The operation of average pooling is: ${\overset{ˇ}{v}}^{ab} = \sum_{j = 0}^{n} \frac{v_{j}^{ab}}{n}, {\overset{ˇ}{v}}^{ac} = \sum_{j = 0}^{n} \frac{v_{j}^{ac}}{n}$ (36) ${\overset{ˇ}{v}}^{ba} = \sum_{j = 0}^{n} \frac{v_{j}^{ba}}{n}, {\overset{ˇ}{v}}^{ca} = \sum_{j = 0}^{n} \frac{v_{j}^{ca}}{n}$ (37) Here, n is the number of element capsules, $v_{j}^{ab}, v_{j}^{ba}, v_{j}^{ac}, v_{j}^{ca}$ are the output of the j-th capsule after fusing text A and B, fusing text B and A, fusing text A and C, and fusing text C and A, respectively. The calculation of the fine-grained interaction matrix are as follows: $f^{ab} = W_{f}^{ab} ({\overset{ˇ}{v}}^{ab}, D_{ab}, {\overset{ˇ}{v}}^{ba}, D_{ba})$ (38) $f^{ac} = W_{f}^{ac} ({\overset{ˇ}{v}}^{ac}, D_{ac}, {\overset{ˇ}{v}}^{ca}, D_{ca})$ (39) Here, $W_{f}^{ab}$ and $W_{f}^{ac}$ are weight parameters to be learned.

3.5 Related articles of law selection and enoder

Real legal cases are decided according to the relevant applicable legal provisions. We collected a large number of legal articles from law books and applied them to the study of legal text similarity. Some legal case texts give the cited law at the end of the text, and some do not mention the cited law. For those case texts in which no cited law is given, one task that needs to be completed is the need to identify legal texts that are similar to the text of the case. Here, ase A is taken as an example to get the similarity articles of law. Therefore, a number of related articles of law need to be selected broadly according to the type of case. For example, the case A belongs to the civil case text, then we choose the articles of civil law, forming a collection of N law articles L = (l₁, l₂, . . . , l_N). The doc2vector method [39] is used to encode the collection of N law articles to generate document document vector of law articles. The doc2vector method also is used to encode legal case A to generate document document embedding of the legal case A. The cosine similarity algorithm is applied to calculate the similarity between the legal case A and each article of law, and results of similarity between articles of law and case A are sorted according to values from the higest to the lowest, the top 3 articles of law are selected as relevant articles of law for that legal text. Assuming that the representation of the selected relevant legal text is L = (l₁, l₂, l₃), the embeddings of articles of law after encoded by doc2vector are denoted as $L^{e} = (l_{1}^{e}, l_{2}^{e}, l_{3}^{e})$ and the embedding of case text A after encoded by doc2vector is $e_{d}^{a}$ , the method to compute the embedding of the laws related to the case text A is as follows: $β_{i}^{a} = W_{l} [e_{d}^{a}, l_{i}^{e}] + b_{l}, u \in {1, 2, 3}$ (40) $β_{i}^{l} = \frac{e^{β_{i}^{a}}}{\sum_{j = 1}^{3} e^{β_{j}^{a}}}$ (41) $r^{a} = \sum_{i = 1}^{3} β_{i}^{l} \cdot l_{i}^{e}$ (42) Here W_l and b_l are the parameters to be learnt. Case B and case C generate the embeddings r^b, r^c of related articles of law in the same way as case A. The following equations are applied to calculate the similarity between Case A and Case B, Case A and Case C with respect to the realted articles of law. $r^{ab} = W_{r} (r^{a} - r^{b}) + b_{r}$ (43) $r^{ac} = W_{r} (r^{a} - r^{c}) + b_{r}$ (44) Here W_r and b_r are the parameters to be learnt.

3.6 Output layer

To accurately predict text B or text C which one is more similar to text A for the given triples (A, B, C), the similarity matching task can be defined a task of binary classification. Specifically, the feature distance between f^ab and f^ac is calculated, then the results are fed into a multilayer perceptron network to output the predictions. An MLP layer is applied to exploit the output of an element embedding comparison layer and generate probabilities as follows: $z = MLP (f^{ab} - f^{ac}, r^{ab} - r^{ac})$ (45) So as to optimize the model, the binary cross-entropy loss function is employed: $loss = - \frac{1}{L} \sum_{i = 1}^{L} (\log (z_{i}) \cdot y_{i}) + \log (1 - z_{i}) \cdot (1 - y_{i}))$ (46) Where, L is the number of training instances, y_i ∈ {0, 1} denotes the truth value of instances i, and z_i is the prediction value of instances i.

4 Experments

4.1 Dataset

We adopt the similar case matching dataset used in the China Artificial Intelligence and Law 2019 competition [1] to evaluate performance of the proposed model. All legal cases in the dataset are related to lending, and the degree of similarity between the texts in the dataset is recognized by legal professionals. A few statistics are presented in Table 2 for the Chinese legal dataset. The symbol maxl means the count number of words in the text that contains the most words, The symbol avgl means the average count number of words that contains the average words. A training dataset, validation dataset, and test dataset make up the dataset. Every element in the triplet is an actual legal case document. Every legal case document contains a description of the facts.

Table 2
The statistics of Chinese Legal dataset

Datasets Triplets Fact descriptions maxl avgl

Train dataset 5102 15306 542 386.7

validation dataset 1500 4500 542 384.6

Test dataset 1536 4608 542 384.9

Datasets	Triplets	Fact descriptions	maxl	avgl
Train dataset	5102	15306	542	386.7
validation dataset	1500	4500	542	384.6
Test dataset	1536	4608	542	384.9

4.2 Baselines

For the purpose of comparison with the proposed model, the following models are selected as baseline:

Term matching methods (TF-IDF) [25]: TF-IDF is a statistical method used to calculate the importance of a word to a document in a corpus. The TF-IDF technique has found extensive application in the field of natural language processing.

Siamese framework based methods BERT [15]. The BERT stands for Bidirectional Encoder Representations from Transformers, and it is a pretrained model that learns from a large scale of the unlabeled corpus.

Siamese framework based methods CNN [13] (SiaCNN). A classic convolutional network structure for text similarity task.

Siamese framework based methods LSTM (SiaLSTM). A classic recurrent neural network structure for text similarity task.

Bidirectional Attention Flow [14]: It applies a multi-stage hierarchical framework that applies character-level, word-level, and context-level granularity to model text.

ABCNN [16]. ABCNN is an improvement of BCNN by considering the relationship between two sentences as the introduction of attention in BCNN, adding inter-word contextual information and weighting the information extraction.

SMASH-RNN [17]. A siamese hierarchical recursive network model based on multi-depth attention-based that is applied to semantic text matching of long document. It applies the structure of a document to learn the long semantics of the document.

IACN [35]. An attention-based capsule network model for textual interactions modulo applying dynamic routing mechanisms for interactions to extract information about the interactions between cases.

LFESM [45]. A method for Chinese similar case matching that apply the legal feature vector. In order to compare the results fairly, the extracted features by applying the regular expressions is removed and only kept the rest of the network structure of the model.

4.3 Implementation parameters

Similar to [1], this paper applies the accuracy as an evaluation indicator in this paper. The dims of the outputs of BERT and BiLSTM network are equal to 768. The convolution kernels is [1 , 5]. The dynamic routing κ is assigned 3, while the capsule element dimensions capsule_dim is set to 100 and the number of elements capsules capsule_num is 30. Moreover, the dropout with the dropout rate at 0.1 is applied among each layer.

Fig. 2

Train loss.

4.4 Experiment results

In this section, the proposed method is compared with baseline methods on the Chinese legal dataset. The experiment result is shown in Table 3 and the train loss of proposed model and some baselines over training epochs on train dataset is shown in Fig.2. The valid accuracy of proposed model over training epochs on validation dataset is shown in Fig.3. The test accuracy of proposed model over training epochs on test dataset is shown in Fig.4. From this table 3, the following results can be obtained: (1) The performance of traditional statistical methods (TF-IDF) is inferior to that of models based on deep learning. The reason is that the word embedding representatives generated by the TF-IDF model are sparse and lack rich semantic information. (2) The reason for this is that BERT cannot extract deep features of the text and lacks the ability to extract rich semantic information. (3) Compared with the traditional model, the deep learning model SiaCNN, BiDAF, SiaLSTM, IACN, LFESM can extract depth level of semantic text information of legal text. (4) The proposed model outperforms the baseline models. Specifically, the proposed model achieves 2.7% and 3.14% improvement in development and testing, respectively, over the LFESM that is the best baseline model. As shown in Fig.2, the loss value of the proposed model on the training set decreases relatively quickly as the training batches increase, and loss value of the proposed method is consistently lower than the other models in the figure after the 8-th batch. This indicates that the proposed model outperforms the baseline models. The proposed model exhibits an increase in accuracy on the validation dataset as the number of training batches increases, as depicted in Fig.3. Eventually, the model surpasses the baseline model in terms of validation accuracy.

Table 3
Experiment results of comparing with baseline models on Chinese legal dataset

Models Valid Test

TF-IDF 52.9 53.3

SiaBERT 61.9 67.3

SiaCNN 62.5 68.9

SiaLSTM 62.0 68.0

BiDAF 63.3 68.6

SMASH-RNN 64.1 65.7

ABCNN 62.6 67.1

IACN 63.4 68.9

LFESM 67.87 72.87

The proposed method 70.57 76.01

Models	Valid	Test
TF-IDF	52.9	53.3
SiaBERT	61.9	67.3
SiaCNN	62.5	68.9
SiaLSTM	62.0	68.0
BiDAF	63.3	68.6
SMASH-RNN	64.1	65.7
ABCNN	62.6	67.1
IACN	63.4	68.9
LFESM	67.87	72.87
The proposed method	70.57	76.01

Fig. 3

Eval accuracy.

Fig. 4

Test accuracy.

Fig. 5

Distribution of position correlation. p_ab, p_ba, p_ac, p_ca axes are represented as text A interacting with text B, text B interacting with text A, text A interacting with text C, and text C interacting with text A, respectively. The vertical axis represents the positions of words in the document.

Based on our analysis, there appears to be a consistency between model loss and accuracy changes on validation and test datasets. We determined that this phenomenon is because the larger the change in the loss function, indicating that the ability of model to extract features is then enhanced, and the difference between the probabilities of the results of the two classes produced by the output layer is continuously amplified as training proceeds, while the model will choose the class with the highest probability when making predictions. Thus there is a consistency between the change in the loss value and the change in the accuracy rate. An example is given below to illustrate this. The model outputs two sets of category outcomes with probabilities of P (label = 0) =0.8, P (label = 1) =0.2, and P (label = 0) =0.6, P (label = 1) =0.4. The difference between the probabilities of the two categories in first group is 0.6 and the difference between the probabilities of the two categories in second group is 0.2. For both sets of these two data, the model output for the prediction category is P = 0. The model therefore calculates the same accuracy for both groups data. Assuming that P = 0 is the correct result, the loss value of model can be smaller for the first group data compared to the second group data. In back propagation using the first group data, the model can update the parameters better. As the model is trained, a model with good performance can increase the difference between the probabilities of the generated outcome labels, which in this example is P (label = 0) and P (label = 1).

Fig.5 illustrates the map of position correlation after interaction between text pairs in the dataset. Specifically, the value of the i-th row in the p_ab column in the figure indicates the correlation between the feature embedding of the information generated by the interaction and fusion of text A and text B and the feature embedding of the position i-th obtained after the generation. From Fig.5, it can be seen that the later the position of the feature embedding is, the higher the feature correlation with the feature embedding of the whole text. The reason for the low correlation of the last part of the p_ba and p_ca columns in the figure is that the lengths of text b and text c are less than the value of the longest length of the text set in the model, so the correlation of this part is low.

The proposed model achieves such good results because the proposedr model first extracts the contextual semantic information of the text using the BERT model. Then CNN and LSTM networks are applied to extract both local semantic and contextual feature information of the text. A multilayer MLP network is used in fusing the local and contextual feature information, which has the advantage of maintaining the rotational invariance of the feature information. We introduce position correlation of the text into the capsule network to enhance the learning features ability of the model. Furthermore, the dynamic routing is improved in the capsule network by focusing not only on the features of the current text when extracting text feature information, but also incorporating the relevant features of the text with which it is compared, so that the improved capsule network can extract relevant feature information between text pairs. In addition, if two cases are similar even though the semantic level, the similarity is low, but the related articles of law are similar, the similarity between the two cases is usually higher. Therefore, the information of legal texts related to the cases is also used in the proposed method to help the model to determine the similarity between the cases in terms of legal texts. In order to validate the effectiveness of the proposed modules, some ablation experiments are performed, the results of which are given in the next section.

4.5 Ablation study

The results of the ablation experiments are reported in this section. A series of experiments utilizing various variant models is performed on a Chinese legal dataset. The objective of these experiments is to showcase the performance of each module included in the proposed model.

4.5.1 Impact of module structure

Capsule. The improved capsule network of the proposed method is replaced with the original capsule network, which does not introduce position correlation.

Cap_Atten. The position correlation structure is removed. When updating the routing information in the capsule network, only the attention information between two text pairs is considered.

w/o law. Structures related to information about legal articles are removed from the model.

Fig. 6

Accuracy of ablation study on validation datasets.

Fig. 7

Accuracy of ablation study on test datasets.

Fig. 8

Accuracy of The proposed method with different number of element capsule on validation and test dataset

Fig. 9

Accuracy of The proposed method with different dimension of element capsule on validation and test dataset

The accuracy results of the above three methods and the proposed method on the test dataset, respectively, are shown in Fig.6 and Fig.7. On the validation dataset, this Cap_Atten model improves the accuracy over the Capsule model by 1.72%, and the proposed capsule network improves the accuracy over the Cap_Atten model by 1.9%, and the proposed model in this paper improves the accuracy over the w/o law model by 0.04%. On the test dataset, the Cap_Atten model improves the accuracy over the Capsule model by 2.94%, the proposed capsule network improves the accuracy over the Cap_Atten model by 1.89% and the proposed model in this paper improves the accuracy over the w/o law model by 0.16%.

4.5.2 Impact of capsule_num and capsule_dim

Two basic hyperparameters are applied in the proposed model: the number of capsules capsule_num, the dim of the capsule capsule_dim. The effect of the above two hyperparameters is investigated for the performance of the proposed model. The capsule_num in the experiment are set to [10 , 50] and the capsule_dim in the experiment to [10,20,50,100,150,200,300]. The results of the experiments on the Chinese legal dataset are depicted in Figs.8 and Fig.9. Fig.8 demonstrates that the performance of the model improves as the number of capsules increases, provided the number of elemental capsules is less than 30. The accuracy of the proposed model is highest when the number of elemental capsules is 30. When capsule_num is greater than 30, the performance of the model becomes worse as the number of capsules increases. Fig 9 shows that when capsule_dim is less than 100, the accuracy of the model becomes better as the number of capsules increases. The best accuracy is obtained when capsule_dim is 100. When capsule_dim is greater than 100, the accuracy of the model deteriorates as the number of capsules increases.

4.6 Analysis of sample

Table 4 shows two documents, document B and document C, which are derived from the triplet data as the documents in Table 1. The document in Table 1 is called Document A. Document B and document C are compared with document A and are used to determine document B and document C, which one is more similar to A.

In this example, document A and document C are more similar than document A and document B. Document A and document C in the factual description of the part of the similarity, are about business mismanagement and lending disputes, the form of borrowing for a one-time borrowing of the required amount, while document B is divided into a number of borrowing; borrowing by virtue of the text, document A and document C are issued a loan note, while the text of the B halfway through the signing of the relatively formal "Loan Agreement," and ultimately the text of document A and document C, the starting date of the interest rate from the first borrowing date, document B starting date from the formal signing of the agreement. In the end, both document A and document C have interest dates from the date of the first loan, and Text B has an interest date from the date of the formalization of the agreement. document A and document C both contain for interest rate adjustment, document A because the defendant did not repay the principal in a timely manner and no repayment of money during the recovery period, the interest rate was eventually adjusted upward, document C, the plaintiff also in the late adjustment of the interest rate. In document B, the parties did not disagree on the interest rate, which remained at 2%.

An analysis of Documents A, B, and C reveals that the similarity of the above expressions is manifested in the descriptions of the "Facts and Reasons" and "Facts Found by the Court" sections. This indicates that the words in each position in the text do not have the same relevance to the text as a whole. The section describing the "facts as found by the court" is the most important section when making text comparisons, followed by the section describing the "facts and reasons", and finally the section describing the "plaintiff’s claim". In other words, the content in the "Facts Found by the Court" section had the highest position relevance, the content in the "Facts and Reasons" section was the second most relevant, and the text in the "Description of Plaintiff’s Claims" section had the lowest position correlation. This result also matches the position correlation results shown in Fig.5, indicating that position correlation information can be introduced into the network to improve model performance. The "Description of Plaintiff’s Claim" and "Facts and Reasons" sections contain some noisy data, which is not conducive to the ability of model to learn features of the whole text. In addition, by fusing the features between text pairs with information interaction, the model can obtain a better feature representation of the text. Since the common points between document A and document C are more than those between document A and document B, after the operation of Eq.(17) and Eq.(18), so the fused document embeddings $E_{{\hat{ac}}_{i}}$ and $E_{{\hat{ca}}_{i}}$ will also be more similar to each other than $E_{{\hat{ab}}_{i}}$ and $E_{{\hat{ba}}_{i}}$ are more similar. We have also improved the capsule network so that the capsule network also takes into account the interactions between texts when performing dynamic routing parameter updates. Since document A and document C are more similar, the learned ${\overset{ˇ}{v}}^{ac}$ and ${\overset{ˇ}{v}}^{ca}$ of this network are also more similar. These advantages can help the model more easily determine that document A and document C are more similar and improve the performance of the model.

Table 4
Document B and Document C

5 Additional test

Additional test is carried out in this section to assess the generality of the improved capsule network proposed in the paper on two long document datasets, known as CNSE and CNSS, which are publicly available, as mentioned in the paper by Liu et al. [36].

5.1 Datasets

The CNSE dataset and CNSS dataset consist of long-form news articles in Chinese collected from mainstream Chinese Internet news providers and contains many topics from various fields. There are two datasets available: CNSE and CNSS. The CNSE dataset consists of 29,063 pairs of news articles that are labeled based on whether they agree on the topic or not. On the other hand, the CNSS dataset contains 33,503 pairs of news articles that are labeled based on whether they agree on the topic or not. As in a prior study [36], the training, development, and test datasets in this paper are split in the same proportions. Specifically, 60% of the data is used for training, 20% for development, and the remaining 20% for testing. Table 5 displays the statistical information for CNSE and CNSS.

Table 5
The statistics information of CNSE dataset and CNSS dataset

Datasets Max number of words Average number of words Train Dev Test

CNSE 8897 594 17438 5813 5812

CNSS 8897 602 20102 6701 6700

Datasets	Max number of words	Average number of words	Train	Dev	Test
CNSE	8897	594	17438	5813	5812
CNSS	8897	602	20102	6701	6700

5.2 Baseline methods and experimental setups

To test the performance of the proposed model, this paper has compared it with a range of competing baseline methods. These methods can be classified into two classes. One class is similarity matching via deep neural network models with a representational or interaction focus, including DSSM [37], CDSSM [38], DUET [40], MatchPyramid [41], ARC-II [42] and ARC-I [42]. Another class is similarity matching based on term similarity, including BM25 [43], LDA [44] and SimNet [36].

The input to the proposed method is three files, while the data for this dataset are two long files and no additional legal text information is required for input. Therefore, the proposed model is needed to modify for application to this dataset. Structures related to information of legal articles are removed from the model. The fact encoder is modified to encode two long documents instead of three documents. The interaction layer then focuses between two documents. With this modification, the proposed model is directly applied to the document similarity matching task of two documents. The other hyper parameters remain the same as the experimental parameters set in Section 4.3.

5.3 Experimental results and analysis

The experimental results are shown in Table 6. It can be seen that the proposed method outperforms the baseline model on both datasets. The performance improvement can be attributed to the fact that the proposed method applies a capsule network with position correlation to transform long text into multiple text semantic units and there is contextual location relationship information between the text units, and the improved routing in the capsule network is able to consider the interaction between texts when updating the parameters. Both of these points can help to improve the performance of model.

Table 6
Experimental results of different methods on CNSE and CNSS datasets

Methods CNSE CNSS

ARC-I 53.8 50.1

ARC-II 54.4 52.0

DUET 55.6 52.3

DSSM 58.1 61.1

C-DSSM 60.2 53.0

MatchPyramid 66.4 62.5

BM25 69.6 67.8

LDA 63.8 63.0

The proposed method 72.73 72.16

Methods	CNSE	CNSS
ARC-I	53.8	50.1
ARC-II	54.4	52.0
DUET	55.6	52.3
DSSM	58.1	61.1
C-DSSM	60.2	53.0
MatchPyramid	66.4	62.5
BM25	69.6	67.8
LDA	63.8	63.0
The proposed method	72.73	72.16

6 Discussion

Currently, there are some difficulties in the task of legal text similarity matching. The first is that in the process of model training, the similarity distance between text A and text B in some instances may be equal to that between text A and text C, and the model is prone to errors in predicting such data. Therefore extracting the deep features of legal texts more effectively and increasing the differences between the features between text pairs can help improve the accuracy of the model. Secondly, the text length is relatively long, and in this paper we need to discard some words when we use BERT to encode the text, which has the disadvantage of losing some information and is not conducive to the model learning document level information. To improve this problem, in the future we consider introducing hierarchical coding networks to encode the text to improve the accuracy of the model. In this paper, we apply cross entropy as a loss function to convert the text similarity matching task into a classification task. In order to improve the accuracy of legal text similarity matching, we consider proposing a new loss function for similarity comparison of triple data in future studies. Currently the topic model is applied to various text tasks, and we will also explore how to extract the topics and keywords of legal texts and apply the extracted words topics and keywords to the legal text matching task in the future. In this paper, there is still a lot of room for improvement in the method of calculating the similarity between the text and the relevant legal text, and how to better combine the information of the legal text into the model needs to be continued research.

7 Couslusion

A novel approach for Chinese SCM task integrating articles of law and using a siamese structure capsule network that incorporates position correlation is proposed in this paper. The reason for applying capsule networks is that they are inherently better able to focus on the position of the text compared to other networks, while the position correlation of the text is introduced into the network to better enhance the ability of capsule network to extract text features. At the same time, the position correlation is continuously updated as the network is trained, and the updated information can be used to compute feature embeddings for the entire document. The generated document embeddings can in turn improve the performance of the model. This paper has enhanced the method for updating dynamic routing parameters in the capsule network to compare the similarity of text pairs. Unlike previous approaches, which only consider the current text information, this updated approach takes into account the information of the texts being compared as well. Law article information is also applied to the model. The introduction of law article information can help the model to determine the similarity of cases in terms of laws regulations. The proposed model is tested on a real-world legal case similarity matching dataset, and the results demonstrate its superior performance compared to current baselines, achieving state-of-the-art accuracy. To further validate the generality of the improved capsule network proposed in the paper by utilizing two long text datasets, and the result indicates the generality of the improved capsule network proposed.

Conflict of interest

The authors declare that they have no conflict of interest.

Authors’ contributions

Methodology, material preparation, data collection, and analysis are performed by Zhe Chen. Zhe Chen write the draft of the manuscript, Lin Ye and Hongli Zhang do the supervision, reviewing, Yunting Zhang comments on versions of the manuscript.

Acknowledgement

This work is supported by the National Natural Science Foundation of China (NSFC) under Grant 61872111.

References

Xiao

, Zhong

H.Z

et al. Cail2019-SCM: A dataset of similar case matching in legal domain, arXiv preprint arXiv:1911.08962 (2019).

Devlin

et al. Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018),

Mikolov

et al. Distributed representations of words and phrases and their compositionality, In: Advances in Neural Information Processing Systems, 2013, pp. 3111–3119.

et al., Siamese capsule networks with global and local features for text classification, Neurocomputing 390 (2020), 88–98.

Pennington

, Socher

, Manning

C.D.

Glove: Global vectors for word representation, In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.

Agirre

et al. SemEval-2014 task 10: Multilingual semantic textual similarity, SemEval@ COLING (2014).

Agirre

et al. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability, In: Proceedings of the 9th InternationalWorkshop on Semantic Evaluation (SemEval 2015), 2015.

Agirre

et al. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In: SemEval-2016 10th International Workshop on Semantic Evaluation; San Diego, CA Stroudsburg (PA): ACL; ACL (Association for Computational Linguistics), 2016, pp. 497–511.

et al. ACV-tree:Anewmethod for sentence similarity modeling, IJCAI (2018).

10.

Schütze

, Manning

C.D.

, Raghavan

Introduction to information retrieval. Cambridge: Cambridge University Press, Vol. 39, 2008.

11.

et al. Learning to predict charges for legal judgment via self-attentive capsule network, ECAI (2020).

12.

Hirschman

, Gaizauskas

Natural language question answering: The view from here, Natural Language Engineering (2001), 275–300.

13.

Kim

Convolutional neural networks for sentence classification, In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1746–1751.

14.

Seo

et al. Bidirectional attention flowfor 906 machine comprehension, arXiv preprint arXiv:1611.01603 (2016).

15.

Devlin

et al. Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).

16.

Yin

et al., Abcnn: Attention-based convolutional neural network for modeling sentence pairs, Transactions of the Association for Computational Linguistics 4 (2016).

17.

Jiang

J.Y.

, Zhang

, Li et al. Semantic text matching for long-form documents, In: TheWorldWideWeb Conference, 2019, pp. 795–806.

18.

Conneau

et al. Supervised learning of universal sentence representations from natural language inference data, arXiv preprint arXiv:1705.02364 (2017).

19.

Lee

D.L.

, Chuang

and Seamons

, Document ranking and the vector-space model, IEEE Software 14(2) (1997).

20.

Neculoiu

, Versteegh

, Rotaru

Learning text similarity with siamese recurrent networks, In: Proceedings of the 1st Workshop on Representation Learning for NLP, 2016.

21.

Mueller

and Thyagarajan

, Siamese recurrent architectures for learning sentence similarity, Proceedings of the AAAI Conference on Artificial Intelligence 30(1) (2016).

22.

Salton

, Wong

and Yang

C.-S.

, A vector space model for automatic indexing, Communications of the ACM 18(11) (1975).

23.

et al. Convolutional neural network architectures for matching natural language sentences, In: Advances in Neural Information Processing Systems, 2014.

24.

Zhong

, Xiao

et al. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence, In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020.

25.

Salton

, Term-weighting approaches in automatic text retrieval, Information Processing & Management 24 (1988).

26.

Wan

, Lan

, Guo

, Xu

, Pang

, Cheng

Deep Architecture for Semantic Matching with Multiple position Sentence Representations, In: AAAI Conference on Artificial Intelligence, 2016.

27.

Wang

, Hamza

, Florian

Bilateral multiperspective matching for natural language sentences, arXiv preprint arXiv:1702.03814 (2017).

28.

et al. SECaps: A sequence enhanced capsule model for charge prediction, In: International Conference on Artificial Neural Networks, 2019.

29.

Ashley

K.D.

Improving the representation of legal case texts with information extraction methods, Proceedings ofthe Eigths International Conference on Artificial Intelligence and Law, ICAIL, 2001.

30.

Saravanan

et al. Improving legal information retrieval using an ontological framework, Artif Intell Law (2009).

31.

Kumar

, Krishna Reddy

et al. Similarity analysis of legal judgments, Proceedings of the 4th Bangalore Annual Compute Conference, 2011.

32.

Raghav

, Krishna Reddy

et al. Analyzing the extraction of relevant legal judgments using paragraph-level and citation information, AI4JCArtificial Intelligence for Justice (2016).

33.

Bromley

et al., Signature verification using a” siamese” time delay neural network, Advances in Neural Information Processing Systems (6) (1993).

34.

Sabour

, Frosst

and Hinton

G.E.

, Dynamic routing between capsules, Advances in Neural Information Processing Systems 30 (2017).

35.

et al., IACN: Interactive attention capsule network for similar case matching, Intelligent Data Analysis 26(2) (2022), 525–541.

36.

Liu

, Niu

et al. Matching Article Pairs with Graphical Decomposition and Convolutions, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL, 2019.

37.

Huang

et al. Learning deep structured semantic models for web search using clickthrough data, 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, 2013.

38.

Shen

, He

et al. Learning semantic representations using convolutional neural networks for web search, in: Proceedings of the 23rd International Conference onWorld Wide Web, 2014.

39.

, Mikolov

Distributed representations of sentences and documents, International Conference on Machine Learning PMLR, 2014.

40.

Mitra

, Diaz

et al. Learning to match using local and distributed representations of text for web search, Proceedings of the 26th International Conference on World Wide Web, 2017.

41.

Pang

, Lan

, Guo

, Xu

, Wan

, Cheng

Text matching as image recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2016.

42.

, Lu

, Li

, Chen

Convolutional neural network architectures for matching natural language sentences, arXiv preprint arXiv:1503.03244 (2015).

43.

Robertson

, Zaragoza

The probabilistic relevance framework: BM25 and beyond, Now Publishers Inc, 2009.

44.

Blei

D.M.

, Ng

A.Y.

, Jordan

M.I.

Latent dirichlet allocation, The Journal of Machine Learning Research (2003).

45.

Hong

, Zhou

et al. Legal Feature Enhanced Semantic Matching Network for Similar Case Matching, 2020 International Joint Conference on Neural Networks, IJCNN, 2020.

46.

Farouk

, Measuring text similarity based on structure and word embedding, Cognitive Systems Research 63 (2020), 1–10. https://doi.org/10.1016/j.cogsys.2020.04.002

47.

Saedi

and Dras

, Siamese networks for large-scale author identification[J], Computer Speech & Language 70 (2021), 101241.

48.

Viji

and Revathy

, A hybrid approach of Weighted Fine-Tuned BERT extraction with deep Siamese BiLSTM model for semantic text similarity identification, Multimed Tools Appl 81 (2022), 6131–6157. https://doi.org/10.1007/s11042-021-11771-6

49.

Kenter

, Rijke

M.D.

Short text similarity with word embeddings, Proceedings of the 24th ACMInternational on Conference on Information and Knowledge Management, 2015.

50.

Bhattacharya

, Ghosh

, Pal

, Ghosh

Hier SPCNet: A Legal Statute Hierarchy-based Heterogeneous Network for Computing Legal Case Document Similarity, In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20), Association for Computing Machinery, pp. 1657–1660.

51.

Han

, Shi

, Richie

and Tsui

F.R.

, Building siamese attention -augmented recurrent convolutional neural networks for document similarity scoring, Information Sciences 615 (2022), 90–102.

52.

Khattab

, Colbert

Z.M.

Efficient and effective passage search via contextualized late interaction over bert[C]//Proceedings of the 43rd International ACM SIGIR 1035 Conference on Research and Development in Information Retrieval, 2020.

53.

Figueroa

J.H.

, Pérez-Téllez

and Pinto

, Measuring semantic similarity of documents with weighted cosine and fuzzy logic, J Intell Fuzzy Syst 39 (2020).

54.

Peinelt

, Nguyen

, Liakata

tBERT: Topic models and BERT joining forces for semantic similarity detection, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020.

55.

Wei

, Wang

and Jay Kuo

C.-C.

, Synwmd: Syntaxaware word Mover’s distance for sentence similarity evaluation, Pattern Recognition Letters 170 (2023), 48–55.

56.

Viji

, Revathy

A hybrid approach of Poisson distribution LDA with deep Siamese Bi-LSTM and GRU model for semantic similarity prediction for text data, Multimed Tools Appl (2023).

Siamese capsule network with position correlation and integrating articles of law for Chinese similar case matching

Abstract

Keywords

1 Introduction

Table 1 An example of legal text

2.1 Text similarity matching

2.2 Similar Case Matching

3.1 Fact Document Encoder

3.3.1 Position correlation and document-level embedding

4.1 Dataset

Table 2 The statistics of Chinese Legal dataset Datasets Triplets Fact descriptions maxl avgl Train dataset 5102 15306 542 386.7 validation dataset 1500 4500 542 384.6 Test dataset 1536 4608 542 384.9

4.3 Implementation parameters

Table 3 Experiment results of comparing with baseline models on Chinese legal dataset Models Valid Test TF-IDF 52.9 53.3 SiaBERT 61.9 67.3 SiaCNN 62.5 68.9 SiaLSTM 62.0 68.0 BiDAF 63.3 68.6 SMASH-RNN 64.1 65.7 ABCNN 62.6 67.1 IACN 63.4 68.9 LFESM 67.87 72.87 The proposed method 70.57 76.01

4.5.1 Impact of module structure

4.6 Analysis of sample

Table 4 Document B and Document C

5.1 Datasets

Table 5 The statistics information of CNSE dataset and CNSS dataset Datasets Max number of words Average number of words Train Dev Test CNSE 8897 594 17438 5813 5812 CNSS 8897 602 20102 6701 6700

5.3 Experimental results and analysis

Table 6 Experimental results of different methods on CNSE and CNSS datasets Methods CNSE CNSS ARC-I 53.8 50.1 ARC-II 54.4 52.0 DUET 55.6 52.3 DSSM 58.1 61.1 C-DSSM 60.2 53.0 MatchPyramid 66.4 62.5 BM25 69.6 67.8 LDA 63.8 63.0 The proposed method 72.73 72.16

7 Couslusion

Conflict of interest

Authors’ contributions

Acknowledgement

References

Table 1
An example of legal text

Table 2
The statistics of Chinese Legal dataset

Datasets Triplets Fact descriptions maxl avgl

Train dataset 5102 15306 542 386.7

validation dataset 1500 4500 542 384.6

Test dataset 1536 4608 542 384.9

Table 4
Document B and Document C

Table 5
The statistics information of CNSE dataset and CNSS dataset

Datasets Max number of words Average number of words Train Dev Test

CNSE 8897 594 17438 5813 5812

CNSS 8897 602 20102 6701 6700

Table 6
Experimental results of different methods on CNSE and CNSS datasets

Methods CNSE CNSS

ARC-I 53.8 50.1

ARC-II 54.4 52.0

DUET 55.6 52.3

DSSM 58.1 61.1

C-DSSM 60.2 53.0

MatchPyramid 66.4 62.5

BM25 69.6 67.8

LDA 63.8 63.0

The proposed method 72.73 72.16