Legal information extraction and classification using BERT,Bi-LSTM,and CRF models

Abstract

This article combined BERT (Bidirectional Encoder Representation from Transformers), Bi-LSTM (Bidirectional Long Short-Term Memory), and CRF (Conditional Random Field) models to transform unstructured legal text into structured data through information extraction, improving the effectiveness of legal information extraction. The BERT model can be used for deep semantic embedding of legal texts, generating context-sensitive representations for each word. The Bi-LSTM network can capture long-distance dependencies in the text, extract sequence features, and apply CRF layers to globally optimize sequence labels to ensure accurate annotation of entity boundaries and relationships. In the dataset for extracting legal entity relationships related to prostitution constructed in this article, the accuracy, precision, recall rate, and F1 score of entity classification reached 93.6%, 92.7%, 92.1%, and 92.4%, respectively. All 153 samples in the Engage_in_ prostitution relationship were correctly classified. In order to analyze the stability of legal information extraction and classification, the model proposed by this article was tested on five datasets: CAIL2019, CJRC (Chinese Judicial Reading Comprehension), LexGLUE (Legal General Language Understanding Evaluation), COLIEE (Competition on Legal Information Extraction/Appointment), and ECHR (European Court of Human Rights). The accuracy of the article’s model fluctuated only 1.2% on different datasets, while the precision remained stable and the recall fluctuated by 0.7%. This article provided reliable technical support for legal intelligence research by combining BERT, Bi-LSTM, and CRF to accurately extract and classify legal information.

Keywords

legal information extraction relation classification bidirectional encoder representations from transformers bidirectional long short-term memory conditional random field prostitution laws

Introduction

The digitalization process of legal documents is constantly accelerating, providing broad prospects for efficient management and analysis in the legal industry. Despite the significant increase in digitalization of legal texts, there are significant differences in writing standards and formats among different types of legal documents,^1,2 which poses a huge challenge to automated processing. Legal text information extraction aims to automatically identify and extract key information from a large number of complex legal documents, such as legal entities, cases, and judgments. Due to the professionalism, complex syntactic structure, and semantic dependence of legal language, information extraction technology is very important for improving legal document processing. Legal texts^3,4 have complex language structures and specific terminology, and rule-based methods are difficult to handle all types of legal documents. These issues make it an important research topic in legal information processing to extract valuable information from unstructured legal texts and achieve the transformation of structured data. This article improves the accuracy and efficiency of legal text information extraction and classification by combining BERT, Bi-LSTM, and CRF models, provides a more advanced method for intelligent legal document processing, and helps promote the informatization and intelligent development of the legal industry.

The named entity recognition example of the pipeline relationship extraction method is shown in Figure 1.

Figure 1.

Named entity recognition.

The traditional pipeline relationship extraction method usually includes two main steps: (1) named entity recognition can be performed to identify relevant entities from input statements; (2) the recognized entities can be combined in pairs without repetition, and these combinations can be concatenated with the original statement for relationship classification, thereby determining the relationship triplets between entities. One major drawback of this method is the tendency to accumulate errors. Because the two steps of relationship extraction are closely related, the results of named entity recognition are directly used as inputs for relationship classification. If there is an error in the naming entity recognition in the previous step, the erroneous result can be passed to the relationship classification stage, leading to further amplification of the errors in the entire process. In addition, the pipeline method also has the issue of computational redundancy. All recognized entity combinations are paired and passed on to the relationship classification step, and many of these entity combinations are actually unnecessary because they do not have meaningful relationships. Excessive redundant calculations not only increase computational complexity and time costs but may also lead to a decrease in the efficiency of the model when dealing with large amounts of invalid data. The existing pipeline relationship extraction method has the problems of strong model independence and insufficient information transmission. The errors of the previous tasks can gradually accumulate, affecting the accuracy of subsequent tasks. In addition, the inability to effectively capture contextual information limits the overall performance of relationship extraction.

The process of relationship extraction is shown in Table 1.

Table 1.

Relationship extraction process.

Element 1	Element 2	Pipeline information	Relation
Zhang	Li	(Element 1, Element 2)	Provide_shelter_for
Zhang	Wang		Provide_shelter_for
Li	Zhang		None
Li	Wang		None
Wang	Zhang		None
Wang	Li		None

In the task of processing legal texts, existing methods face the problems of low processing efficiency and poor adaptability. In recent years, the application of deep learning technology in legal information extraction has gradually received attention. Bidirectional long short-term memory networks are widely used for feature extraction of text sequences due to their superior performance in capturing long-range dependencies of sequence data. The conditional random field model,^5,6 due to its optimization ability in sequence annotation tasks, can effectively integrate contextual information and improve the coherence of annotation results. By combining these two models and introducing the BERT model to generate vector representations of text, the semantic understanding of legal texts can be further improved, and the model’s extraction and classification capabilities can be enhanced. BERT’s pre-trained language model provides Bi-LSTM with richer semantic features, while CRF optimizes the output of Bi-LSTM to improve the automation level of legal information processing through model combination, thereby promoting the further development of intelligent legal service systems.

This study innovatively combines BERT, Bi-LSTM, and CRF models and simultaneously performs entity recognition and relationship extraction through an end-to-end method, which improves the accuracy and efficiency of information extraction. The innovation of this article lies in the mixed application of BERT, Bi-LSTM, and CRF models, each utilizing its advantages to improve overall performance. BERT captures deep semantics in legal texts, Bi-LSTM models bidirectional dependencies, and CRF ensures accurate tagging of entities and relationships. This integrated model enhances classification accuracy and addresses context and sequence prediction challenges. Tested on datasets like CAIL2019 and ECHR, it proves adaptable and effective for legal information extraction.

Related work

Recent research in legal information extraction and classification focuses on leveraging natural language processing and machine learning to automate the processing of large unstructured legal texts. Legal information extraction^7,8 involves identifying entities, relationships, events, or facts from documents like judgments and contracts. Traditional rule-based methods, though effective for specific formats, struggle with complex legal language and diverse text styles. As a result, data-driven approaches using deep learning models,^9,10 such as convolutional neural networks, long short-term memory networks,^11,12 and attention mechanisms, have become popular. These models, trained on extensive legal corpora, capture complex patterns and contextual relationships. Pre-trained language models like BERT¹³ significantly enhance legal entity recognition and text classification. Combinations like BERT-LSTM (Long Short-Term Memory),¹⁴ BERT-GRU (Gated Recurrent Unit), and BERT-CNN (Convolutional Neural Network) further improve accuracy and generalization. Hierarchical attention networks and multi-task learning methods provide strong results in multi-level classification tasks, handling long texts and complex contexts. In legal extraction and classification,^15,16 challenges such as uncertainty, data imbalance, and noise demand more robust model architectures and data augmentation techniques for better real-world performance. This shift from rule-based processing to deep semantic understanding, powered by deep learning, marks a new era in legal information processing. Previous research mainly relies on rule-based methods, which are difficult to handle complex legal language and diverse text styles.

The Bi-LSTM-CRF model excels in sequence annotation tasks and is highly applicable to legal information extraction and classification. Bi-LSTM^17,18 captures contextual features by considering input sequence information, improving semantic understanding in complex texts. CRF,^19,20 a probabilistic graph model, optimizes sequence labeling by handling dependencies between adjacent labels. The combination of Bi-LSTM for feature extraction and CRF^21,22 for global label optimization enhances entity recognition and relationship extraction in legal texts. Research shows this model^23,24 outperforms traditional methods, especially in tasks like named entity recognition across fields such as legal, medical, and bioinformatics. When paired with BERT for embedding, the Bi-LSTM-CRF architecture significantly improves accuracy and F1 scores, offering superior performance in extracting key information from complex legal documents. Previous studies often lack an in-depth understanding of complex legal texts and fail to effectively capture the dependencies between contextual features and labels. In addition, the limitations of a single model make the accuracy and robustness of legal information extraction insufficient, and it is difficult to adapt to the diversity of different legal datasets.

Methods for extracting and classifying legal information

Building a legal corpus

In this article, the original corpus of training data is sourced from publicly available legal documents related to prostitution on the China Judgments Online platform. The original corpus of training data comes from the China Judgment Documents Network, mainly including legal documents on prostitution. These documents cover a variety of case types, including criminal judgments and rulings, and provide a wealth of legal clauses and judgment basis. The openness and diversity of the corpus ensure the representativeness of the data and help improve the performance of the model in legal information extraction and classification tasks. In order to ensure the effectiveness and accuracy of the constructed dataset, the fact description part of the case in the document is extracted through regular expressions. It includes specific case circumstances and the facts on which the legal judgment is based. Based on the preliminary extracted text fragments, a “rule coarse labeling manual recall” annotation mode is adopted to gradually construct the training dataset. Text can be roughly annotated using preset rules and manually checked and corrected by professional annotators to ensure annotation quality and accuracy. By using a dual annotation mechanism to reduce annotation errors and inconsistencies, the overall quality and robustness of the dataset can be improved.

When constructing a dataset for extracting legal entity relationships related to prostitution, the annotation process is similar to the annotation mode of traditional relationship extraction, while annotating the location of entities and the corresponding types of relationships between entities. This joint annotation approach can handle both entity recognition and relationship classification tasks within a single annotation framework, reducing the problem of error accumulation caused by task decomposition, and more effectively capturing complex semantic relationships between entities in text. The joint labeling method can effectively share information and context by processing entity recognition and relationship classification in a single framework, thereby reducing the error accumulation that may occur when decomposing tasks. This method allows the model to consider the relationship between entities while identifying them, improving overall performance and accuracy. In addition, joint learning reduces the time and resources required for model training and improves efficiency.

This article constructs a dataset for extracting legal entity relationships related to prostitution, including providing shelter, arranging prostitution, engaging in prostitution, patronizing prostitution, facilitating prostitution, and managing brothels. Examples of relationships related to prostitution are shown in Table 2.

Table 2.

Display of relationships related to prostitution.

Relation	Description of the crime
Provide_shelter_for	From March 2021 to May 2022, the defendant Zhang rented a three-bedroom apartment in a residential area in Haidian District, Beijing, and provided the room to prostitutes Li and Wang for prostitution activities on many occasions, charging rent and additional fees. Zhang knew that others used the room for illegal activities, but still provided shelter and obtained illegal profits.
Arrange_prostitution_for	From November 2020 to August 2021, the defendant Liu contacted clients through WeChat and other communication tools, and arranged for clients to meet with prostitutes in hotels or rental houses arranged by him. During this period, Liu arranged Yang and Wu to engage in prostitution in a hotel in Tianhe District, Guangzhou, many times, and collected agency fees from each transaction.
Engage_in_prostitution	On April 12, 2021, the defendant Li provided sexual services to clients for 500 yuan in a hotel room in Xihu District, Hangzhou. Li received cash remuneration afterward and paid a certain proportion of “venue fees” and “protection fees” to the “organizer” Wang through WeChat transfer. On the day of the incident, Li was arrested on the spot by the police.
Patronize_prostitution	On June 17, 2022, client Zhao obtained prostitution information through a pornographic website, contacted prostitute Qian, and engaged in prostitution in an apartment in Xuanwu District, Nanjing, that evening. Zhao paid 1000 yuan in cash as transaction fees. Based on clues, the police raided the apartment and arrested Zhao and Qian on the spot.
Facilitate_prostitution	Since August 2021, the defendant Wang assisted prostitutes Li and Chen to travel between multiple hotels in Hangzhou and Ningbo by providing a special car pick-up service, charging high “pick-up fees.” In many transactions, Wang also assisted prostitutes to avoid police inspections and provided accommodation arrangements.
Manage_brothel	From October 2020 to January 2022, the defendant He rented a storefront on a commercial street in Shanghai and operated a massage shop called “Taohuayuan,” which was actually a den for prostitution. He hired several women to engage in prostitution and provided illegal sexual services to customers in the name of “massage,” collecting commissions from each transaction.

Provide_sthelter_for refers to the act of providing a venue for prostitution activities. Arrange_induction_for refers to the act of organizing or arranging others to engage in prostitution activities. Engage_in_ institution refers to individuals who directly engage in prostitution activities. Patronize_prostitution refers to the act of a client paying money or other forms of compensation in exchange for sexual services. Facilitate_prostitution refers to the act of assisting or promoting prostitution activities through various means, including providing transportation and communication tools. Manage_brothel refers to the operation of venues specifically designed for prostitution activities, including brothels or massage parlors.

The distribution of the constructed dataset for extracting legal entity relationships related to prostitution is shown in Table 3.

Table 3.

Dataset distribution.

Relation	Training set	Test set	Total
Provide_shelter_for	844	211	1055
Arrange_prostitution_for	644	161	805
Engage_in_prostitution	612	153	765
Patronize_prostitution	588	147	735
Facilitate_prostitution	448	112	560
Manage_brothel	412	103	515

In the dataset of legal entity relationships related to prostitution, the distribution of training and testing sets for each type of relationship is as follows: for the “Provide_shelter_for” relationship, the training set contains 844 samples and the testing set contains 211 samples, for a total of 1055 samples. The training set sample size for the “Arrange_prostitution_for” relationship is 644, the testing set is 161, and the total number of samples is 805. The relationship of “Engage_in_prostitution” contains 612 samples in the training set and 153 samples in the testing set, for a total of 765 samples; the training set for the “Patronize_prostitution” relationship has 588 samples, the testing set has 147 samples, and the total number of samples is 735; the training and testing set samples for the “Facilitate_prostitution” relationship are 448 and 112, respectively, for a total of 560 samples. The training set samples for the “Manage_brothel” relationship are 412 and the test set is 103, totaling 515 samples. Each relationship category has sufficient data for learning the characteristics of entities and relationships, which helps the model to identify in legal document information extraction tasks. Among all the training data, Provide_shelter_for has the most training data, 844, and Manage_brothel has the least.

The sentence length distribution of entity relation triplets in the dataset is shown in Figure 2.

Figure 2.

Distribution of sentence length.

The distribution of sentence length for entity relationship triplets shows significant differences in the number of sentence lengths in each interval. There are 250 sentences with a length less than 40, which accounts for a relatively small proportion. There are 543 sentences between 41 and 80 in length, and the number of sentences between 81 and 120 has increased to 689. As the sentence length increases, the number of sentences ranging from 121 to 160 reaches 789, and the number of sentences ranging from 161 to 200 further increases to 921, becoming the interval with the highest number of sentences. The number of sentences with a length between 201 and 240 is 826, while the number of sentences with a length exceeding 240 is reduced to 417. This distribution indicates that most sentence lengths are concentrated between 161 and 200 words, indicating that the expression of entity relationship triplets in legal documents is complex and lengthy. The entity relationship triples in legal documents are usually relatively complex and lengthy, mainly due to the professionalism and rigor of legal texts. Legal clauses, case descriptions, and reasons for judgments often need to be elaborated in detail to ensure accurate communication of information and clarity of legal application. Although this lengthy expression is necessary in law, it also brings greater challenges to information extraction and classification.

In the task of extracting legal entity relationships related to prostitution, in order to accurately identify and classify entities and their relationships in legal documents, this article combines labeling strategies to annotate legal entity relationships related to prostitution, in order to achieve entity and relationship extraction. The labeling strategy includes detailed labeling of the start position, continuation position, non-entity position, relationship type, and position of each entity in the triplet (1 represents the head entity and 2 represents the tail entity). Table 4 shows the labeling strategies for each relationship.

Table 4.

Label strategy.

Label	Meaning
Provide_shelter_for-1	Providing a place
Provide_shelter_for-2	Protected person
Arrange_prostitution_for-1	Organizer who arranges prostitution
Arrange_prostitution_for-2	Prostitute or client
Engage_in_prostitution-1	Prostitute
Engage_in_prostitution-2	Client who transacts with her or him
Patronize_prostitution-1	Client
Patronize_prostitution-2	Prostitute
Facilitate_prostitution-1	Providing tools to assist prostitution
Facilitate_prostitution-2	Person involved in prostitution
Manage_brothel-1	Brothel operator
Manage_brothel-2	Prostitute or client

The entity (head entity) containing “1” in the tag can only match the nearest entity (tail entity) containing “2” backwards; on the contrary, entities containing “2” in the label can only match the nearest “1” entity forward. This matching rule effectively avoids confusion in relation extraction.

Construction of BERT-Bi-LSTM-CRF model

This article combines BERT,^25,26 Bi-LSTM, and CRF to extract and classify legal information. The BERT-Bi-LSTM-CRF model structure is shown in Figure 3.

Figure 3.

BERT-Bi-LSTM-CRF model structure.

The word vectors generated by BERT can be input into a bidirectional long short-term memory network. Bi-LSTM captures contextual features of sequence data through two LSTM layers, forward and backward, and can remember remote dependencies in the sequence. This process enhances the model’s understanding of legal terms and entity relationships by learning the before and after correlations of each word in the text, thereby extracting semantically rich feature representations. The contextual feature vectors output by Bi-LSTM can be input into the conditional random field layer for sequence annotation. CRF can model the global dependencies and constraints of label sequences, further optimizing the accuracy of sequence prediction. By combining Bi-LSTM output with CRF decoding algorithm, the model can output the optimal label sequence, avoiding the problem of inconsistent labeling in independent classification models, thus achieving the effect of legal information extraction and classification.

The foundation of BERT is the Transformer architecture, which is composed of multiple encoders stacked together. Each encoder consists of a self-attention mechanism and a feedforward neural network. Each encoder layer of BERT maps the input sequence to a new vector space, generating higher-level semantic features layer by layer. BERT is based on the Transformer architecture and has a strong ability to understand context. Its bidirectional encoding characteristics enable the model to consider both left and right text information, improving the ability to capture complex semantics.

The calculation process of self attention mechanism is as follows:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(1)

Q, K, and V are matrices obtained by linear transformation of the input sequence, and $d_{k}$ is the dimension of the key vector. The purpose of self-attention mechanism is to recalculate the representation of each word in the input sequence based on the correlation weight between words.

BERT uses two main pre-training tasks, including masking language models and next sentence prediction. BERT’s pre-training improves the model’s language understanding ability through masked language model and next sentence prediction tasks. The masked language model allows the model to learn vocabulary relationships in context, while the next sentence prediction helps capture the logical relationship between sentences. This pre-training method enables BERT to effectively handle a variety of natural language processing tasks, significantly improving performance and generalization capabilities. The goal of masking language models is to randomly mask some words in the input sequence and then train the model to predict these masked words. 15% of words can be randomly selected for masking. For the selected words, there is an 80% probability of being replaced with special MASK tags, a 10% probability of being replaced with random other words, and the remaining 10% remains unchanged. The input of the model includes complete sentences and masked words, and the model needs to predict the original words of each masked word.

The masking language model uses cross-entropy loss to calculate the accuracy of the model in predicting masked words, and the formula is as follows:

L_{MLM} = - \sum_{i = 1}^{N} \log P (w_{i} | w_{1}, \dots, w_{i - 1}, w_{i + 1}, \dots, w_{N})

(2)

w_{i}

is the masked word, N is the length of the sentence, and

P (w_{i} | \cdot)

is the probability predicted by the model.

The goal of the next sentence prediction task is to predict whether two pieces of text are adjacent in the original corpus. The input is two pieces of text: A and B. The model input is CLS mark + A + SEP mark + B + SEP mark. There is a 50% probability that B is the real next sentence after A. A 50% probability that B is a sentence randomly selected from the corpus. The loss function of the next sentence prediction is used to calculate the accuracy of the model when predicting the next sentence relationship. The formula is as follows:

L_{NSP} = - \sum_{i = 1}^{M} y_{i} \log P (y_{i} | x_{i}) + (1 - y_{i}) \log (1 - P (y_{i} | x_{i}))

(3)

In formula (3), M is the number of training samples and $y_{i}$ and $x_{i}$ are the target labels and model inputs, respectively.

The input of the BERT model consists of three parts: input embedding, positional embedding, and segmented embedding. Each word in the input embedding is transformed into a vector representation through a word embedding matrix. Position embedding indicates a word’s position in a sentence, while segmented embedding distinguishes between sentences. BERT’s output is a sequence, with each word’s representation computed through a multi-layer Transformer. Each layer’s output captures both contextual and deep semantic information. Input embedding is responsible for converting vocabulary into vector representation, and position embedding provides the position information of words in sentences to capture the relative position of words in the sequence. Segment embedding is used to distinguish different sentences in the input and help the model understand the relationship between sentences. The combination of these three embeddings enables BERT to effectively capture the semantic and structural information of the text.

Bi-LSTM^27,28 is a bidirectional long short-term memory network that excels in sequence annotation by capturing contextual features from both forward and backward directions. By combining two LSTM layers, Bi-LSTM effectively handles long dependencies in natural language, enabling better understanding of the global context within sequences. The advantage of bidirectional LSTM is that it considers both the forward and backward information of the input sequence, so that it can capture the contextual relationship more comprehensively. This structure enables the model to perform better when processing long dependencies, improves the ability to understand complex language phenomena, and makes it more accurate and effective in natural language processing tasks.

The formula for the forget gate is as follows:

f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})

(4)

W_{f}

is the weight matrix of the forget gate,

h_{t - 1}

is the hidden state of the previous time, and

x_{t}

is the input of the current time.

Input gate:

i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i})

(5)

{\tilde{C}}_{t} = \tanh (W_{C} \cdot [h_{t - 1}, x_{t}] + b_{C})

(6)

{\tilde{C}}_{t}

is the candidate cell state, and

W_{i}

and

W_{C}

are the weight matrices for the input gate and candidate cell state, respectively.

The unit status has been updated as follows:

C_{t} = f_{t} \cdot C_{t - 1} + i_{t} \cdot {\tilde{C}}_{t}

(7)

The output gate formula is as follows:

o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o})

(8)

h_{t} = o_{t} \cdot \tanh (C_{t})

(9)

o_{t}

is the activation value of the output gate,

h_{t}

is the hidden state at the current time,

W_{o}

is the weight matrix of the output gate, and

b_{o}

is the bias.

If the input sequence is set to $E {x_{1}, x_{2}, \dots, x_{T}}$ , then the forward and backward hidden states at time step t are $\vec{h_{t}}$ and $\overset{\leftarrow}{h_{t}}$ . The output of Bi-LSTM is represented as follows:

h_{t} = [\vec{h_{t}}; \overset{\leftarrow}{h_{t}}]

(10)

CRF^29,30 is a conditional probability model that measures the transition probability of each pair of adjacent labels in a sequence and the strength of the association between each label and the features of the input sequence by defining a set of feature functions, thereby learning the complex relationship between input and output. In this article, conditional random fields are used to optimize sequence labeling tasks, combined with the feature extraction capabilities of the Bi-LSTM model. By learning the transition probabilities of adjacent labels and the associations between input features, CRF can enhance the accuracy of entity recognition and relationship extraction in legal texts, and improve the model’s ability to understand and process complex legal relationships. Given the input sequence X, the conditional probability of the output label sequence Y can be expressed as follows:

P (Y | X) = \frac{1}{Z (X)} \exp (\sum_{t = 1}^{T} \sum_{k = 1}^{K} λ_{k} f_{k} (y_{t - 1}, y_{t}, X, t))

(11)

Z (X)

is a normalization factor that ensures that the sum of probabilities of all possible label sequences is 1. The formula is as follows:

Z (X) = \sum_{Y^{'}} \exp (\sum_{t = 1}^{T} \sum_{k = 1}^{K} λ_{k} f_{k} (y_{t - 1}^{'}, y_{t}^{'}, X, t))

(12)

f_{k} (y_{t - 1}^{'}, y_{t}^{'}, X, t)

is a feature function that represents the features transferred from the previous label

y_{t - 1}^{'}

to the current label

y_{t}^{'}

, combined with the information of the input sequence X at time t.

λ_{k}

is the weight parameter of the feature function, which needs to be learned through training data.

CRF is used in the sequence decoding part of legal information extraction models. In legal entity extraction tasks, adjacent labels often have strong dependency relationships. By using CRF, the model can learn the dependency relationships between these labels and output the most likely label sequence for a given input sequence based on these relationships.

Model training and evaluation

This article combines the legal information extraction and classification of BERT, Bi-LSTM, and CRF models for model training in Python environment. The BERT model uses bert-base-uncased, Bi-LSTM sets the hidden layer dimension to 128, uses bidirectional LSTM, the learning rate is set to 0.001, the batch size is 32, CRF uses a linear chain structure, and the number of iterations is set to 20. Pre-trained BERT models can be used to encode legal texts and generate contextual representations for each word. These representations are processed through the Bi-LSTM layer to extract sequence features and capture bidirectional dependencies of the text. The output of Bi-LSTM can be passed to the CRF layer for sequence annotation, and the global characteristics of the CRF layer can be utilized to optimize the label sequence. The loss function can be used for model training, and the Adam optimizer can be used for parameter updates. The model is iteratively optimized multiple times on the training set, and its performance is evaluated through the test set. The hyperparameters are tuned to improve precision, recall, and F1 score.

The experimental environment and parameter setting information are shown in Table 5.

Table 5.

Experimental environment and parameter setting information.

Experimental environment	Configuration	Parameter name	Value
Operating system	Windows 10	Learning rate	0.001
Central processing unit	Intel Core i9-13900K	Optimizer	Adam
Programming language	Python	Epoch	100
Memory	16 GB	Batch size	64
Storage space	1 TB	Dropout	0.5

In the task of combining BERT, Bi-LSTM, and CRF models for legal information extraction and classification, the evaluation of the model considers multiple evaluation metrics, including accuracy, precision, recall, and F1 score. These evaluation indicators are analyzed separately at the levels of entity classification and relationship classification, and detailed evaluations are conducted for different datasets to comprehensively measure the performance of the model.

Different datasets have different text features and legal clauses, so as to evaluate the adaptability and robustness of the model in diverse scenarios. This diverse test can effectively verify the generalization ability of the model and ensure its reliability in practical applications. This article selects publicly available legal information datasets to analyze the stability of the model, and the analyzed publicly available dataset information is shown in Table 6.

Table 6.

Public datasets.

Dataset	Content	Number of cases
CAIL2019	Used for tasks such as legal question answering, crime prediction, and legal article recommendation.	26000+
CJRC	Contains a corpus of Chinese judicial cases covering a wide range of legal fields.	16000+
LexGLUE	Used for classification and understanding evaluation of legal documents.	7000+
COLIEE	Contains case data for legal reasoning and precedent retrieval.	7000+
ECHR	Covering case and judgment documents of the European Court of Human Rights.	11000+

In order to comprehensively analyze the effectiveness of the article’s model in legal information extraction and classification, BERT-Bi-LSTM-CRF was compared with various models, and the structural information of different models is shown in Table 7.

Table 7.

Information on different model structures.

Model	Encoder	Decoder
Pipeline-BERT	—	—
Lawformer	—	—
Legal-BERT	—	—
CNN-LSTM	CNN	LSTM
CNN-GRU	CNN	GRU
CNN-RNN	CNN	RNN
BERT-Bi-LSTM	BERT	Bi-LSTM
This article model	BERT	Bi-LSTM-CRF

Results

Results of legal information extraction

The combined application of BERT, Bi-LSTM, and CRF models constructs a structured data format by automatically identifying and extracting key information from a large number of unstructured legal documents. The advantage of legal information extraction is that it can quickly and accurately extract key information from massive unstructured legal documents, reduce manual review time, and improve work efficiency. At the same time, by constructing a structured data format, it is convenient for subsequent data analysis and legal research. By extracting legal information, the article provides basic data support for legal analysis and intelligent trials, assisting legal practitioners in quickly obtaining the necessary information, reducing the workload of manual screening, and promoting the development of automation and intelligent technology in the field of legal technology. The legal information extraction results of this method are shown in Figure 4.

Figure 4.

Results of legal information extraction.

In the task of legal information extraction, this article uses the BERT-Bi-LSTM-CRF model to extract information and classify relationships from complex legal texts. Using BERT to generate context-rich word vector representations, Bi-LSTM captures the bidirectional features of the input text and finally decodes and optimizes the label sequence through the CRF layer to ensure the accuracy of entity relationships. The effect of the model can be seen through two examples. In the example of “Provide_shelter_for,” “Zhang” is accurately extracted as the head entity providing shelter, and “Li” and “Wang” are used as the tail entities, and the relationship between “Zhang Provide_shelter_for Li” and “Zhang Provide_shelter_for Wang” is established. This type of extraction can accurately reveal the subject object relationship of criminal activities in the text, that is, Zhang provided a place for Li and Wang to engage in illegal activities.

Entity classification results

Entity classification identifies and distinguishes entity types in legal texts, aiding in understanding the text’s semantic structure. Entity classification has important practical significance in legal text processing in 4.2. It can automatically identify and distinguish entity types in legal documents, such as parties, cases, and legal clauses, thereby improving the accuracy and efficiency of legal information retrieval. It supports information extraction, relationship classification, and legal analysis, enhancing the automation of legal document processing. Results are shown in Table 8.

Table 8.

Results of entity classification.

Model	Accuracy	Precision	Recall	F1 value
Pipeline-BERT	89.5%	87.8%	88.2%	88.0%
Lawformer	91.3%	89.7%	90.2%	89.9%
Legal-BERT	90.8%	90.1%	89.5%	89.8%
CNN-LSTM	85.7%	84.5%	83.8%	84.1%
CNN-GRU	86.2%	85.0%	84.3%	84.6%
CNN-RNN	84.8%	83.4%	82.7%	83.0%
BERT-Bi-LSTM	92.1%	91.4%	90.8%	91.1%
This article model	93.6%	92.7%	92.1%	92.4%

The model achieved the highest accuracy of 93.6% in legal text entity classification, outperforming BERT-Bi-LSTM (92.1%), Lawformer (91.3%), and Legal-BERT (90.8%). Traditional models like CNN-LSTM and CNN-GRU scored lower (85.7%–86.2%), indicating limitations in capturing complex legal features. The article’s model leads in precision (92.7%) and recall (92.1%), surpassing BERT-Bi-LSTM and Lawformer, demonstrating superior sensitivity, while CNN models missed important information with lower recall rates.

The F1 score, reflecting precision and recall, shows this model at 92.4%, outperforming others and indicating strong precision and coverage. BERT-Bi-LSTM follows at 91.1%, while Lawformer and Legal-BERT score 89.9% and 89.8%, respectively. Traditional CNN models fall below 85%, struggling with complex legal semantics. BERT-based models excel in legal tasks due to their contextual understanding. This model’s performance across all metrics confirms its effectiveness in legal information extraction.

Relationship classification results

Relationship classification identifies semantic links between entities, enhancing the accuracy of legal information extraction. The confusion matrix for relationship classification is shown in Figure 5.

Figure 5.

Confusion matrix for relationship classification.

This confusion matrix shows the performance of models for different relationship classifications in legal information extraction tasks. The model in this article has a relatively accurate classification performance for most relationships, with 153 samples correctly classified in the “Engage_in_ prostitution” class and no classification errors. “Arrange_prostitution_for” also showed high classification accuracy, and only 1 sample was misclassified as “Provide_shelter_for.” The performance of the “Patronize_prostitution” category was also better. A total of 145 samples were correctly classified, and the number of misclassifications was small. Only 2 samples were misclassified as “Provide_shelter_for.” For the “Facilitate_prostitution” class, 110 samples were correctly classified, but 1 was misclassified as “Patronize_prostitution” and 1 was misclassified as “Manage_brothel.” The misclassification of the “Manage_rotate” category is slightly higher, with 3 samples misclassified as “Facilitate_destion.” The performance of the article’s model in relation classification tasks is relatively stable, especially in distinguishing between relationships such as “Engage_in_prostitution” and “Arrange_prostitution_for,” with high classification accuracy. However, there is a certain degree of confusion regarding “Provide_shelter_for.”

The relationship classification performance of different models is shown in Figure 6.

Figure 6.

Relationship classification performance of different models.

The model presented in this article outperforms other models with an average accuracy of 93.6%, particularly in relation classification tasks such as “Engage_in_prostitution” and “Provide_shelter_for,” achieving 94.2% and 94.1%, respectively. In contrast, BERT-Bi-LSTM also performs excellently, with an average accuracy of 92.1%, especially in the “Engage_in_prostitution” category, where its accuracy reaches 93.0%. These results indicate that the improved model based on BERT can better capture complex semantic and contextual information in legal texts, thereby improving the accuracy of relationship classification. The average accuracy of traditional deep learning models CNN-LSTM, CNN-GRU, and CNN-RNN is 85.7%, 86.2%, and 84.8%, respectively, which are relatively low. Especially when dealing with complex semantic structure relationship classification tasks, these models show significant shortcomings. In the “Manage_brothel” classification task, the accuracy of CNN-RNN is only 85.2%, which reflects its limitations in handling high complexity and high dependency contextual relationships. The Pipeline BERT and Lawformer models also performed well, with average accuracies of 89.5% and 91.3%, respectively. The performance advantages of these models stem from the rich semantic representation ability of their pre-trained language models on large-scale text data, as well as their good ability to capture fine-grained features during task fine-tuning. In the task of extracting and classifying legal information, this model can significantly improve the accuracy of relationship classification, providing effective technical support and practical basis for future legal text automation processing and intelligent legal analysis systems.

Stability analysis

To analyze the stability of the model in this article, different legal information disclosure datasets were selected for validation. The results of the stability analysis are shown in Figure 7.

Figure 7.

Results of stability analysis.

The model’s stability analysis across various legal datasets shows consistent performance, with accuracy fluctuating only 1.2% (94.6% for CAIL2019 to 93.4% for ECHR). Precision remains stable at 92.7%, and recall varies minimally by 0.7%. Overall, the model exhibits strong adaptability across different legal texts. Its high accuracy on CAIL2019 is due to richer training samples, while the lower accuracy on ECHR reflects the complexity of international legal provisions, underscoring the model’s robustness in legal information extraction and classification. The model in this article shows consistent performance on multiple legal datasets, highlighting its adaptability in different legal texts.

Conclusions

This article introduced a deep learning method for legal information extraction and classification, combining BERT, Bi-LSTM, and CRF models. It evaluated on public legal datasets, the model excels in entity and relationship classification, particularly in complex scenarios. By leveraging BERT’s semantic representation, Bi-LSTM’s feature extraction, and CRF’s optimization, it provides an efficient and versatile solution for legal intelligence research. This article provided technical support for applications such as automatic analysis of legal documents, case retrieval, and legal risk assessment in the judicial field, helping legal practitioners improve their work efficiency. The limitation of the research is that the model has poor adaptability to data outside the field. When processing complex sentences or non-standardized legal texts, the extraction effect is reduced, which may affect the generalization ability. The future prospect is to optimize the domain adaptation ability of the model, further improve the processing effect of complex legal texts, and explore more efficient deep learning methods to expand to more legal application scenarios.

Statements and declarations

Footnotes

Conflicting interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Jingwei Li

References

Tan

Bowen

, et al. Research on sentencing prediction methods for legal documents. Chin J Inform 2020; 34(3): 107–114.

Wang

Pei

, et al. A method for identifying named entities in legal documents based on JCWA-DLSTM. Chin J Inform 2020; 34(10): 51–58.

Lihua

Yang

Suge

, et al. Identification of named entities in legal documents based on matching strategies and community attention mechanisms. Chin J Inform 2022; 36(2): 85–92.

Zhenwei

Yuxuan

Yansong

. Natural language understanding for legal documents. Chin J Inform 2022; 36(8): 1–11.

Fan

. A comprehensive review of conditional random fields: variants, hybrids and applications. Artif Intell Rev 2020; 53(6): 4289–4333.

Wang

, et al. A comprehensive review of Markov random field and conditional random field approaches in pathology image analysis. Arch Comput Methods Eng 2022; 29(1): 609–639.

Chunjie

Zhu

. Overview and application analysis of numerical information extraction research. Intell Sci 2019; 37(2): 40–45.

Yang

Wang

Qian

, et al. Automated extraction of enterprise bankruptcy events in ruling documents. J East China Norm Univ Nat Sci Ed 2020; 4: 88.

Adnan

Akbar

. An analytical study of information extraction from unstructured and multidimensional big data. J Big Data 2019; 6(1): 1–38.

10.

Martinez-Rodriguez

Hogan

Lopez-Arevalo

. Information extraction meets the semantic web: a survey. Semant Web 2020; 11(2): 255–335.

11.

Gao

Fei

, et al. A deep neural network model for speakers coreference resolution in legal texts. Inf Process Manag 2020; 57(6): 102365.

12.

Enamoto

Santos

ARAS

Maia

, et al. Multi-label legal text classification with BiLSTM and attention. Int J Comput Appl Technol 2022; 68(4): 369–378.

13.

Hao

Zhang

Liu

, et al. Sentiment recognition and analysis method of official document text based on BERT–SVM model. Neural Comput Appl 2023; 35(35): 24621–24632.

14.

Zhang

Pan

. Legal judgment elements extraction approach with law article-aware mechanism. ACM Trans Asian Low-Resour Lang Inf Process 2021; 21(3): 1–15.

15.

Chalkidis

Kampas

. Deep learning in law: early adaptation and legal word embeddings trained on large corpora. Artif Intell Law 2019; 27(2): 171–198.

16.

Firoozeh

Nazarenko

Alizon

, et al. Keyword extraction: issues and methods. Nat Lang Eng 2020; 26(3): 259–291.

17.

Viji

Revathy

. A hybrid approach of Weighted Fine-Tuned BERT extraction with deep Siamese Bi–LSTM model for semantic text similarity identification. Multimed Tool Appl 2022; 81(5): 6131–6157.

18.

Kumar

Verma

Sharan

. ATE-SPD: simultaneous extraction of aspect-term and aspect sentiment polarity using Bi-LSTM-CRF neural network. J Exp Theor Artif Intell 2021; 33(3): 487–508.

19.

Zhu

Zhang

, et al. Building extraction from high spatial resolution remote sensing images via multiscale-aware and segmentation-prior conditional random fields. Remote Sens 2020; 12(23): 3983.

20.

Liao

Huang

, et al. Extracting knowledge entities from sci-tech intelligence resources based on BiLSTM and conditional random field. IEICE Trans Info Syst 2021; 104(8): 1214–1221.

21.

Zhenyu

Jiang

Zhang

, et al. Research on the name recognition of film critics based on multi-feature Bi-LSTM-CRF. Chin J Inform 2019; 33.3: 94–101.

22.

Liu

Zhang

Chen

, et al. Extraction of Chinese geological time information based on BiLSTM-CRF. Prog Earth Sci 2021; 36(2): 211–220.

23.

Zhang

, et al. Global digital compact: A mechanism for the governance of online discriminatory and misleading content generation. Int J Hum Comput Interact 2025; 41: 1381–1396. DOI: 10.1080/10447318.2024.2314350.

24.

Cao

Jin

. Research on the extraction method of expert information based on the expert homepage. Intell Exp 2019; 1(12): 1–2.

25.

Guo

Xiaolong

. Intelligent identification method of legal case entities based on BERT-BiLSTM-CRF. J Beijing Univ Posts Telecommun 2021; 44(4): 129.

26.

Chunnan

Wang

Sun

, et al. The entity identification method named in the legal document of theft based on BERT. Chin J Inform 2021; 35(8): 73–81.

27.

Suebsombut

Sekhari

Sureephong

, et al. Field data forecasting using LSTM and Bi-LSTM approaches. Appl Sci 2021; 11(24): 11820.

28.

Zhou

Liu

. Social network spam detection based on ALBERT and combination of Bi‐LSTM with self‐attention. Secur Commun Network 2021; 1: 5567991–5568011.

29.

Jasmir

Nurmaini

Firsandaya Malik

, et al. Bigram feature extraction and conditional random fields model to improve text classification clinical trial document. TELKOMNIKA (Telecommunication Computing Electronics and Control) 2021; 19(3): 886–892.

30.

Pradhan

Yajnik

. Parts-of-speech tagging of Nepali texts with bidirectional LSTM, conditional random fields and HMM. Multimed Tool Appl 2024; 83(4): 9893–9909.