Abstract
The similar case matching task aims to detect which two cases are more similar for a given triplet. It plays a significant role in the legal industry and thus has gained much attention. Due to the rapid development of natural language processing technology, various deep learning techniques have been applied to similar case matching task and obtained attractive performance. Most existing researches usually focus on encoding legal documents into a continuous vector. However, a unified vector is difficult to model multiple elements of the case. In the real world, cases contain numerous elements, which are the basis for legal practitioners to judge the similarity among cases. Legal experts usually focus on whether the two cases have similar legal elements. It makes this task especially challenging. In this paper, we propose a novel model, namely
Keywords
Introduction
With the rapid development of deep learning technology, more and more traditional industries have benefited from Artificial Intelligence (AI). Legal Artificial Intelligence (LegalAI) [55] has become a very popular research area and thus has obtained much attention from both legal professionals and AI researchers. The task of similar case matching (SCM) [50] aims to detect which two cases are more similar for a given triplet. Legal case matching plays a significant role in many legal applications. In standard law systems, the judgment results of cases are affected by the most similar cases in the past. Therefore, automatically finding similar cases can reduce heavy and redundant work for legal professionals and benefit the law system.
Early approaches usually treat the legal case matching as a textual semantic matching task. Owing to the success of neural networks [34, 38, 13], popular natural language processing technology has been applied to the legal case matching task and has achieved attractive performance. A large number of prior works [50, 55] focus on encoding fact description into a continuous vector and then compute the similarity scores with a liner layer. However, a unified document embedding cannot model the case’s multiple elements. In the real world, cases containing various elements are the basis for legal practitioners to judge the similarity. Legal experts usually focus on whether two cases have similar legal elements [55]. Therefore, these methods are inconsistent with the solution of legal professionals, which makes this task non-trivial. Moreover, the difference between the cases in a given triplet may be small. It is difficult to determine which two cases are more similar.
Some researchers have begun to explore multiple features from fact descriptions, which shows benefits for many tasks in the legal domain. For example, Hu et al. [22] artificially summarizes ten kinds of typical attributes of charges. Since both summarizing and annotating attributes require lots of manual work, some works attempt to summarize the multiple legal documents’ features automatically. These works [28, 18] make use of the characteristics of the capsule network [41], which automatically learns high-level generalization features. A capsule is a group of neurons which uses vectors to represent an object or an object part, and the orientation of the vector encodes properties of an object (like the shape/color of a face), while the length of the vector reflects its probability of existence (how likely a face with specific properties exists). These capsule network properties could be quite appealing for learning the representation of the elements of cases.
Inspired by the above observations, this paper proposes an Interaction Attention Capsule Network, dubbed as IACN, for similar case matching task. The IACN can capture fine-grained similarity through learning elements representation from fact descriptions. Specifically, IACN includes the Fact Encoder, the Interaction Dynamic Routing layer, the Elements Interaction layer and the Output layer. The Fact Encoder utilizes Bi-LSTM to extract fact description’s contextual features and produces primary capsules. Then the primary capsules are fed into the Interactive Dynamic Routing layer to generate element capsules. We design the Elements Interaction layer to capture fine-grained similarity in the element level instead of the document level. Finally, the output layer produces the final similarity score. For triplets of the case, IACN adopts the siamese framework in the Fact Encoder layer and Interaction Dynamic Routing layer, which share parameters for all case triplets in the learning process. To summarize, the main contributions of this paper are as follows:
We propose an Our ablation experiment denotes the rationality of the IACN model. Our visualization experiment shows that the IACN model can effectively learn the elements representation from cases. The case study experiment reveals that IACN is closer to the legal experts’ decision process, which judges the similarity based on the elementary level. We conduct extensive experiments based on a real-world dataset. The experimental results consistently demonstrate the superiority and competitiveness of our proposed model. We also adjusted IACN to conduct experiments on two long documents matching datasets. The experimental results prove that our method can be applied to various tasks that face similar challenges to the SCM task.
Semantic text similarity
The task of similar case matching (SCM) focuses on determining the similarity between legal case documents. The SCM task is related to semantic text matching [5, 4, 3, 2, 1, 29, 46, 17, 23, 35, 37, 21, 21], which lies at the core of many natural language processing tasks, such as information retrieval [33, 7], question answering [19], and natural language inference [12, 9]. Therefore, it has been obtaining much attention from researchers recently.
Traditional works aim to utilize word similarity through the vector space model [43, 30], e.g. term frequency-inverse document frequency (TF-IDF) [6] and bags-of-words. Recently, owning to the success of deep learning, much attention has been devoted to semantic text matching via encoding text into distributed representation [16] and has obtained stunning performance. For example, Hu et al. [20] designed a CNN model, which adapts the convolutional strategy in vision, for matching two sentences. Wan et al. [48] presented a deep model, which calculates two sentences similar through multiple positional sentence representations. Wang et al. [49] constructed the bilateral multi-perspective matching model (BiMPM), which encodes sentences in two directions. Shen et al. [45] introduced the convolutional deep structured semantic model (CDSSM) to learn low dimensional semantic vectors of queries of Web documents. Another branch of deep learning methods base on the siamese framework [10, 15] and has achieved great success. The same encoder applies to two sentences in the siamese framework. This framework, which sharing parameters, makes the model encode the two sentences into the same embedding space. Compared with this line of works, our work shares several common features with theirs: (1) our work is also attributed to the development of deep learning, and (2) our work also utilizes the advantage of the siamese structure. Nevertheless, our work is different from theirs in several features at least: (1) most of these works focus on encoding legal document into a continuous vector, while our method models the cases as the multiple elements vector and captures fine-grained element similarity to make an interpretable judgment, and (2) capsule networks are not covered in these works.
Legal artificial intelligence
Legal Artificial Intelligence (LegalAI) [55, 32, 54, 51, 11, 47] aims to exploit the technology of artificial intelligence to tackle legal problems. LegalAI plays a significant role in the legal domain, as it can reduce redundant work and save lots of time for legal professionals. Therefore, it has obtained much attention from both AI researchers and legal professionals. Previous researches focus on exploiting mathematical methods [27, 36, 25] to analyze legal documents. Owning to the success of deep learning, more and more researchers make an effort to utilize the neural network method [32] in legal tasks, which include the legal judgment prediction task [50, 18], legal question answering task [56] and similar case matching task [50, 47]. For example, Hu et al. [22] incorporated several discriminative legal attributes to help predict charges. Luo et al. [32] constructed an attention-based neural network method to jointly model the charge prediction and the relevant article extraction task in a unified framework. This paper focuses on a similar case matching task. In this line of works, Zhong et al. [55] and Xiao et al. [50] implemented various deep neural networks, which have achieved attractive performance. Compared to this line of works, our work shares several common features with theirs: (1) our work also benefits from the development of deep neural networks, and (2) our work also addresses the issues related to similar case matching tasks. Nevertheless, our work is different from theirs in several features at least: (1) most of these works ignore the similarity of the fine-grained elements, and (2) capsule network is not covered in these works. In this paper, we attempt to simulate the process of judgment by legal experts, which captures fine-grained elements similarity.
Model
This section first describes the definition of similar cases match tasks (Section 3.1). Then, we provide an overview of the proposed IACN (Section 3.2). Finally, we introduce each module of IACN in detail, respectively (Section 3.3–3.6).
Problem definition
Following the works [55, 50], the task of similar case matching is defined as here. The input of SCM task is a triplet (A, B, C), where A, B, C are the fact descriptions of cases in the triplet, respectively. This purpose of this task is to predict whether
Overview of IACN
The architectural overview of IACN is introduced in this section. As is shown in Fig. 1, it takes a triple of fact descriptions (e.g., A, B and C) as input. The output determines which one between B and C is more similar to A.
Most existing methods usually focus on encoding fact descriptions into a vector, while a unified vector is difficult to model multiple elements of case. Remind that the central idea of IACN is to compute fine-grained elements similarity by using newly-designed interaction dynamic routing and elements interaction layer. Specifically, IACN contains the following modules:
Overview of IACN.
The fact encoder represents the fact description in the semantic-level and generates the primary capsule. Specifically, the fact encoder performs the following steps: (1) firstly, it transforms each word of fact description into the distributed representation via pre-train word embedding; (2) then it uses Bi-directional Long Short-Term Memory (Bi-LSTM) [53, 50] to extract words contextual information and generates primary capsules. In this layer, each fact description A, B, and C of the triplet will perform step (1) and (2).
Given a triplet (A, B, C), where A, B, C are fact descriptions of three cases, we suppose that the fact description A consists of
The
where
The siamese framework is applied to the fact encoder layer, which shares parameters for all triplets cases. The fact description A, B, and C are fed into the fact encoder and respectively produce primary capsules (i.e.,
The capsule network, first presented by Hinton [41], aims to alleviate the limitations of the convolutional neural network in vision fields. A capsule is a group of neurons that uses vectors to represent object or object part, and the orientation of the vector encodes properties of an object, while the length of the vector reflects its probability of existence. It has achieved attractive performance in the legal domain [18, 28]. Therefore, the element capsules can be activated by primary capsules through a dynamic routing mechanism.
To better capture interaction information among fact description, We devise an interactive dynamic routing mechanism. Take a pair of fact descriptions A and B as an example. We describe the detailed calculation of interaction dynamic routing. Suppose the primary capsules of A and B are
Then the element capsules can been activated by primary capsules via interactive dynamic routing mechanism. Let us consider that the element capsule of A interact with B (i.e.
where
[h] : Interactive Dynamic Routing Algorithm.[1] The primary capsules of A:
The Interactive Dynamic Routing Layer’s output is the element capsules that are used to learn the element representation from fact descriptions. Each element capsule represents certain element information of the case. The Element Interaction layer is designed to capture the fine-grained similarity between cases at the elementary level.
Suppose the element capsules of fact description A are
where the
The triplet of fact description (A, B, C) is encoded via the fact encoder, interaction dynamic routing layer, and element interaction layer. The output layer computes the similarity scores between A and B, C with a linear layer, which is similar to [55].
It is assumed that (
where
Experiments
This section covers our experimental results. In this part, we first describe the experimental settings (Section 4.1). Second, we compare IACN with competitive methods (Section 4.2). Third, we conduct an ablation study to investigate the effectiveness of IACN (Section 4.3). Fourth, we study the impact of hyper-parameters (Section 4.4). Fifth, we conduct more experiments to analyze our model, including the case study (Section 4.5) and visualization of legal elements. Sixth, in order to test the university of our model, we adjusted IACN to conduct experiments on two long documents matching datasets (Section 4.7).
Experimental setup
Datasets
The statistics of CAIL2019-SCM dataset
The statistics of CAIL2019-SCM dataset
The #w denotes the number of words (maximum and average in all fact descriptions).
Similar to prior works [55, 50], the Chinese AI and Law 2019 Similar Case Matching (CAIL2019-SCM) dataset is selected as an experimental dataset in this paper to evaluate the effectiveness of our proposed model. The CAIL2019-SCM is a real-world dataset, which is collected from China Judgments Online.2
Following prior works [55, 50], we compare competitive current baselines with IACN. It can be divided into three different types of methods: (1) Term matching methods TF-IDF [42]; (2) Siamese framework based methods LSTM, TextCNN [26], BiDAF [44], and BERT [14]; (3) Semantic matching models ABCNN [52] and SMASH-RNN [24].
The term frequency-inverse document frequency, short for TF-IDF, is a statistical method that can reflect how much a word contributes to a document in a corpus. TF-IDF is a popular weighting factor and has been applied widely in information retrieval. The Bi-Directional Attention Flow (BiDAF) is a hierarchical multi-stage architecture that models the context with character-level, word-level, and contextual-level granularity. The Bidirectional Encoder Representations from Transformers (BERT) is a pre-training language model and has achieved massive success in the natural language processing community. The Attention-based Convolutional Neural Network (ABCNN) is a convolutional neural network that includes attention mechanisms for modeling sentence pairs. The Siamese Multi-depth Attention based Hierarchical Recurrent Neural Network (SMASH RNN) focuses on long document semantic text matching. It can learn the long-form semantics via using the document structure. TextCNN, LSTM is popular methods. These methods experimental results are collected from [55] except LSTM is implemented in this paper.
Implementation details
The detailed training settings of IACN
The detailed training settings of IACN
Similar to [55, 50], the accuracy is employed as evaluation metrics in this paper. We use an embedding layer with 100 dimensions to represent the word sequence. Following the experimental setup of [50], all the fact descriptions of cases are processed by THULAC3
Experimental results of different methods on CAIL2019-SCM dataset
Experimental results of different methods on CAIL2019-SCM dataset
In this section, we compare IACN with these methods on the CAIL2019-SCM dataset to examine the competitiveness. Table 3 shows the comparison results. From this table, we can derive the following interesting conclusions: (1) the TF-IDF performs worse than the deep learning-based model. It is because deep learning-based approaches can capture richer semantic information from legal documents than TF-IDF. (2) Astonishingly, BERT performs poorly. The reason may be that legal texts are often very long (see Section 4.1.1), and BERT is limited by the input of 512 characters. (3) The results of most deep learning-based baselines are very close. For example, the gap between TextCNN, BiDAF, LSTM, and ABCNN is within 1.3%. The reason could be that legal texts are relatively standardized, so all these methods can learn rich semantic information. (4) The IACN outperforms all previous baselines and gets favorable performance. Specifically, our model gains 0.9% and 0.6% improvements across dev and test, respectively. It is a very encouraging result and shows the superiority and competitiveness of IACN. It is worth noting that IACN judges the similarity of case pairs based on the representation of legal elements. This process is consistent with the judgment of similarity by legal experts.
Experimental results of different ablate methods on CAIL2019-SCM dataset
Experimental results of different ablate methods on CAIL2019-SCM dataset
This subsection ablates different variant models of IACN on the CAIL2019-SCM dataset to illustrate this model’s effectiveness. We compare the following variant models with IACN using the default experimental setup.
IACN w/o Bi-LSTM: It employs the word embedding directly as the low-level capsule without Bi-LSTM in the fact encoder. IACN w/o interactive dynamic routing: It is the IACN model without Interactive dynamic routing. It uses origin dynamic routing. IACN w/o elements interaction layer: It removes elements interaction layer and calculates similarity through the linear layer.
Table 4 shows the results of the ablation study. From this table, we observe that: (1) compared with
(a): Performance of IACN with different number of element capsule (fixing the dimension of element capsule to 20). (b): Performance of IACN with different dimension of element capsule (fixing the number of element capsule to 10).
Our model has two essential hyper-parameters: the number of element capsules and the element capsule’s dimension. Here we study the impact of these two hyper-parameters on the performance of our model. Figure 2 shows the detailed results, which are experimented on the CAIL2019-SCM dataset.
Figure 2a shows that our model gains the best accuracy when the number of element capsule is 10. Figure 2b shows that when the element capsule’s dimension is set to 20, the model’s performance reaches the best. We argue it’s possible that the larger the dimension is, the more capabilities the model has. Correspondingly, the computational complexity becomes higher. Therefore, we set the number of element capsule to 10 and the element capsule dimension to 20 in our experiments to balance the performance and training cost.
Case study on three examples. Among them, A and B in example 1 and example 2 are more similar. The A and C are more similar, in example 3.
As mentioned before, existing models for similar case matching tasks focus on modeling the legal document into a vector, while they often ignored the law elements information. Thereby, this work simulates the process of judging similar cases by legal professionals. Specifically, the IACN learns the element representation of legal documents and then considers the final similarity based on the element similarity. To study the central idea’s effectiveness, we demonstrate three examples from the test dataset, as shown in Fig. 3. Among them, A and B in example 1 and example 2 are more similar. The A and C are more similar at instance 3.
Figure 3a, c and e show the elements similarity matrix of fact description A and B on three examples, respectively. Similarly, Fig. 3b, d and f show the elements similarity matrix of fact description A and C. Therefore, the on-diagonal elements measure the degree of association between corresponding elements between cases. The off-diagonal elements reflect how related between non-corresponding elements between cases. From the Figure, we can find: (i) all of them have high on-diagonal elements, which indicates the superb ability of the representations of the elements learned from IACN; (ii) all of them have low off-diagonal elements, this shows that the model can effectively learn mutually exclusive element representations.
Specifically, in example 1, Fig. 3a has more highlight on-diagonal values than Fig. 3b. In example 2, Fig. 3c has more highlight on-diagonal values than Fig. 3d. It indicates that examples 1 and 2 have more similar legally elements than A and C. Therefore, A and B are more similar. In example 3, Fig. 3f has more highlight on-diagonal values than Fig. 3e. Therefore, A and C are more similar, and this process is similar to judging similarity by legal experts based on similar elements of the law. It is worth noting that the diagonals of Fig. 3b and d also have many highlight values, which indicates A and C are also have high similarity based on elements judging in example 1 and 2. In fact, A, B, and C all come from loan-related data, so A, B, and C have a high degree of similarity with each other, which makes this task not trivial. In a real-world scenario, legal experts often judge the similarity of cases based on the criterion that cases with similar legal elements are more similar. The IACN simulates the process of judging by legal experts, which can effectively improve performance.
The visualization of legal element representation (element capsules) on t-sne tools.
As mentioned before, existing models for SCM focus on modeling the legal document into a vector, while they often ignore the law elements information. Thereby, this work simulates the process of judging similar cases by legal professionals. Specifically, the IACN learns the element representation of legal documents and then judges the final similarity based on the element similarity. To verify that the model can effectively learn the representation of mutually exclusive legal elements, we conduct a visualization experiment on the test dataset. Since elements representations are high-dimensional vector, we use t-SNE to visualize them on a 2D space in Fig. 4. Each color represents a different element, and the point clouds are elements representation (element capsule) from dataset. As can be seen, the data is clustered into ten element clusters, and the clusters are mutually exclusive, which again shows that the model can learn mutually exclusive element representations.
In this subsection, we test the universality of our model. We argue that our method can be applied to many scenarios, as long as the scenario also faces the similar challenges to SCM task. In order to illustrate this point, we additionally conducted experiments on two public longer documents datasets, named CNSE and CNSS [31]. Since these two public data sets also face long text challenges, they are relatively suitable as experimental subjects.
Datasets
The statistics of CNSE and CNSS datasets
The statistics of CNSE and CNSS datasets
The #w denotes the number of words (maximum and average in all documents).
Similar to piror works [31], the Chinese News Same Event dataset (CNSE) and Chinese News Same Story dataset (CNSS) are selected as experimental datasets in this subsection to evaluate the universal of our IACN model. The CNSE and CNSS contain long Chinese news articles collected from major Internet news providers in China, covering diverse topics in the open domain. The CNSE dataset contains 29063 pairs of news articles with labels representing whether two documents fall into the same news story. Similarly, the CNSS dataset contains 33503 pairs of article with labels representing whether two documents fall into the same news story. Similar to piror works [31], for both datasets, we use 60% of all the samples as the training set, 20% as the development (validation) set, and the remaining 20% as the test set. The statistics of CNSE and CNSS are reported in Table 5.
In order to verify the effectiveness of our method, we compared a series of competitive models. These methods can be roughly divided into two categories. The first category is the matching by representation-focused or interaction-focused deep neural network models, which includes DSSM [23], CDSSM [45], DUET [35], MatchPyramid [37], ARC-I [21] and ARC-II [21]. The second category is the matching by term-based similarities, which includes BM25 [39], LDA [8] and SimNet [31]. These methods experimental results are collected from [31].
The IACN input is three documents, and the task data is two long documents. Therefore, we need to make some adjustments to the IACN model in order to migrate to this task. Adjust the fact encoder to encode two long documents (assuming A and B) instead of documents A, B, and C. Then, the interaction layer interacts with A and B. Through this simple adjustment, our IACN model can be applied to document matching tasks. For other setting, we use fasttext embedding to represent the word sequence. The maximum word length of all sequences is set to the average length of CNSS and CNSE datasets, respectively, for the sake of simplicity. The other hyper-parameters keep consistent with the main experiments in Section 4.1.3.
Experimental results
Experimental results of different methods on CNSE and CNSS datasets
Experimental results of different methods on CNSE and CNSS datasets
We adjusted the structure of the IACN to enable it to serve the task of matching long documents. We conducted experiments on two publicly available long text datasets (CNSE and CNSS), and compared a series of competitive classic text matching methods. The experimental results are in Table 6. From this table,we can derive that our methods achieves the competitive results on both two datasets. The reason may be that compared to the classic text matching method, our method first models the long text into multiple semantic units (high-level capsules). Each semantic unit can be regarded as a concept abstracted from a long document. Then the interaction layer will capture the fine-grained conceptual relationship between the two long documents. This is beneficial for long document matching.
This paper explores the task of similar case matching in legal domain. We makes the first attempt to exploit the advantage of capsule networks in similar case matching task. In particular, we propose Interactive Attention Capsule Network (dubbed as IACN), which can learn the high-level generalized features between fact description through interaction dynamic routing. The Interactive Attention Capsule Network adopts siamese framework for encoding legal document. The experimental results on real-world datasets demonstrate that IACN outperforms the currently baselines and creates new state-of-the-art performance. We also visualize the element capsules and fine-grained interactive matrix to demonstrate the interpretability of our model. In order to verify the universality of the model, we also adjusted IACN to conduct experiments on two long documents matching datasets. The experimental results prove that our method can be applied to various tasks that face similar challenges to the SCM task. In the future, we will explore integrating the external legal domain knowledge to facilitate this task.
