Abstract
It is essential to provide a structured data feed to the computer to accomplish any task so that it can process flawlessly to generate the desired output within minimal computational time. Generally, computer programmers should provide a structured data feed to the computer program for its successful execution. The hardcopy document should be scanned to generate its corresponding computer-readable softcopy version of the file. This process also proves to be a budget-friendly approach to disengage human resources from the entire process of record maintenance. Due to this automation, the workload of existing manpower is reduced to a significant level. This concept may prove beneficial for the delivery of any type of services to the ultimate beneficiary (i.e., citizen) in a minimal time frame. The administration has to deal with various issues of citizens due to the pressure of a huge population who seek legal help to resolve their issues, thereby leading to the filing of large numbers of pending legal cases at several courts of the country. To assist the victims with prompt delivery of justice and legal professionals in reducing their workload, this paper proposed a machine learning based automated legal model to enhance the efficiency of the legal support system with an accuracy of 94%.
Introduction
Digital technologies have encouraged the incorporation of human intelligence into machines by applying artificial intelligence to enable them to think and act humanly. Text recognition algorithms of machine learning have helped machines to read [1] and summarize digitized documents [2], which may be accessed remotely through a virtual medium like the internet, thereby resolving the significance of physical communication. In the current pandemic situation where people are maintaining social distancing to break the transmission chain of coronavirus (COVID-19) within society, various professional firms face challenges in computer readable record maintenance to continue their virtual medium operations like the internet. Even safe storage of hard copy documents has become a problem for easy access from any location. To resolve this issue, these hard copy documents may be converted into computer-readable forms like jpeg, png, pdf, etc., so that it can be easily accessed through the internet. In developing countries like India, various service sectors like the judiciary, education, public healthcare and sanitation, public transport, etc., are already overburdened due to a lack of skilled human resources, budget, and infrastructure. According to a legal journal (2019) [3], it has been revealed that there are more than 4.3 million pending cases across 25 high courts in India, out of which more than 0.8 million cases are pending over a decade till now. This is because there are only 19 judges for every one million cases [4, 5], which depicts the scarcity of legal professionals. Moreover, the current pandemic has densified the situation, where legal professionals are forced to maintain social distancing, thereby hampering the regular proceedings of our judicial system. It depicts the necessity of an automated legal support system that can efficiently operate using the virtual medium, like the internet. For that purpose, an intelligent system has to be developed which will convert argument-based legal hard copy documents into their equivalent softcopy form to transmit them through the internet. This legal support system should help legal professionals share their documents through the internet so that court proceedings are not stalled due to natural calamities, global pandemics, etc. To be precise, legal documents like case histories, legal judgments, precedence, etc., should be appropriately managed to facilitate remote access. Typically, this issue is most evident in numerous trial courts across our nation. To achieve this objective, the authors have proposed a legal support system using machine learning to make the conventional legal system more efficient and robust. The proposed model mainly depends on document image processing, text recognition, and summarization. Our proposed model focuses on document image processing and text summarization. Document image processing refers to the entire life cycle of document image [6] conversion from start to end [7]. The document image processing output is an electronic file that helps quicker access to that specific document. Text summarization algorithm helps to shorten long texts into multiple slices or pieces [8, 9]. This approach generates a concise summary consisting of significant portions of the legal text document, i.e., a summary that contains critical information about any case history. The summarizer used in this paper identifies the significant sentences from the given input document that can either be a single text document or cluster of relatable text documents, which are explained thoroughly in this paper.
The relevant literature survey for this paper is mentioned in section –2. The proposed machine learning-based legal support system is described in section –3. The result generated from our proposed system is stated in section –4. The conclusion drawn from this paper is finally mentioned in section –5.
Literature survey
In today’s digital era, machine learning is applied in the multivariate domain like medicine, geology, agriculture, text analysis [10–17], etc. Researchers have been developing techniques to find better accuracy in recognizing characters with the help of optical character recognition (OCR). For example, Phangtriastu et al. [18] merged some feature extraction techniques like the histogram of oriented gradients (HOG), zoning algorithm, and projection profile with the help of the standard classifiers, namely: support vector machine (SVM) and artificial neural network (ANN). Likewise, Mithe, Indalkar, and Divekar [19] proposed a character recognition method using tesseract (a text recognition OCR engine) in the android system. Another researcher, Pawar et al. [20], also worked on implementing a tesseract optical character recognition (OCR) engine to extract textual data from scanned documents or images.
Similarly, Chanda et al. [21] proposed a system that states the problem of recognizing hand-based characteristics (Chinese, Japanese, Korean and Roman Scripts) with the help of directional chain-code histogram-based features along with the gaussian kernel-based support vector machine. Some of the researchers, namely, Dewa, Fadhilah, and Afiahayati [22], tried to solve OCR problems by implementing CNN to distinguish Javanese script. Similarly, Khadijah et al. [23] used CNN and DNN for the classification of Javanese script. Depending on the knowledge obtained from the above literature, very few works have been done on automated legal text recognition using machine learning methods. Presently, in legal forums, the most common form of inputting the data into computers is through the keyboard, which is perhaps the most time-consuming and effort-intensive operation. Therefore, this paper proposes an intelligent system for argument-based legal text recognition and summarization using the machine learning method to overcome these drawbacks. In this work, a convolutional neural network model (CNN) is proposed using Tensorflow with Keras to work like OCR with better accuracy in less processing time.
Material and methods
In this section, we have proposed our Machine Learning-based legal support system using document image analysis, text recognition, and summarization. It focuses on various textual components of the document image, i.e., textual processing of document image analysis. In Fig. 1, we have shown the schematic diagram of our Machine Learning-based automated legal text recognition and summarization system.

Machine learning architecture for legal text recognition and summarization.
The hard copy legal dataset is collected from various trial courts of West Bengal, India which are converted to Portable Document Format (PDF) files using open-source document scanning software. These Portable Document Format (PDF) files are converted into image files at 600dpi to complete the image preprocessing [24, 25] which is discussed in the subsequent section.
Image Preprocessing
Image Preprocessing is essential to get a better quality of images for text recognition. The preprocessing of images are done using image rescaling, noise removal, and thresholding methods. Image rescaling is the process of resizing any digital image [26]. As in old legal case documents its images are mostly of low quality, to enhance its visibility its dpi (dots per inch) are set within 450–600 dpi. Noises which are visible in these legal documents are removed using the median filter method, nonlinear filter, Gaussian filter and linear filter so as to blur edges and reduce the contrast of the image [27] file. Thresholding is the most efficient method for high-level contrast images [28]. Here Otsu’s binarization thresholding method [29] is used, where iteration of all the possible threshold values are performed to calculate the pixels either in the foreground or background [30].
Line segmentation
In this section, we have discussed the Line Segmentation [30] of an image file. For better performance, we have proposed a two-stage binarization of the line-segmentation process. In the first stage, we have performed Otsu’s binarization over the document image, and in the second stage, we have constructed a text line followed by binarization. After completion of this phase, the kernel is created for the segmentation of lines. Furthermore, the dilates function is executed followed by the identification of contours or text line regions, and lines are also extracted to store in a specific folder. Text detection [31] helps to recognize disturbance that might cause difficulty in the application if there exists any pattern or water-marks in the background. Usually, these background pictures and trademarks are classified as the foreground of an image file. After detecting these foregrounds, noises are filtered out (if any) to significantly level to increase the performance and labeling of the foreground is also done using the connected component method. At times users try to cover the maximum portion of the image document using the zoom in feature of the camera mainly to capture maximum text, which may ultimately exclude the margins of the image document. This causes technical issues while rotating the document image without margin information. It is pertinent to mention that images captured by cameras are often oblique and hence text lines may not always be straight. To resolve such issues, in this paper we have proposed a bottom-up CC-based method to construct text lines in a flawless manner.
Character Segmentation
Character Segmentation [32] is a decision processing system using which sequence of character images are decomposed into individual symbols. Character Segmentation is performed in the following manner: Conversion of line image into the corresponding grayscale. Removal of noises from the grayscale image. Implementation of binarization technique. Setting of Kernel for segmentation of characters. Implementation of dilation to find contours. In this phase, each character region is located with the help of contours. Finally, the characters are extracted from the line and stored in a specific folder.
Classification
In this entire process of Image Analysis, Classification is performed after successful completion of Line Segmentation and Character Segmentation of the image file. In the case of character recognition, images captured by regular cameras and other devices pose a challenge which should be handled accurately. To overcome these challenges, a model has been proposed to read texts from type-written legal documents by training machine with Convolutional Neural Network (CNN) [33]. This system provides a sample dataset that consists of image data of English alphabets and digits. The primary purpose of this type of learning algorithm is to train a dataset by comparing trained outputs with actual outputs to find out errors. However, to generate a sharp model, multiple image files are required and for this reason we have used ‘Chars74K dataset’ [34] that consists of: (i) 62 classes and five more classes are added to an entire class of 67 that comprise 0-9, A-Z, a-z, ’.’, ’,’, ’-’, ’(’, ’)’. TensorFlow with Keras are used to build the proposed model [35, 36].
Proposed Machine Learning model
In this section, we have discussed our proposed Machine Learning based legal text recognition model. Machine Learning model should be trained using the available input dataset to predict or classify the data according to the specific classes, like ASCII value of ’0-9’, ‘A-Z’, ‘a-z’ and ‘.’, ‘,’, ‘-’, ‘-’, ‘(’, ‘)’. To train this legal model, each image object obtained from the input dataset should be labeled to identify the image with reference to its class. After labeling the data from the dataset, it is passed into the model based on Convolutional Neural Network (CNN). This sequential model is implemented in following manner: Input layer as per the input shape. Convolution Layer where, kernel size=(3X3) and applying padding (Input size and the output size is same) the activation function is ‘Relu’. Max-pooling where pool size=(2 X 2), Stride = 2 and applying no padding. Convolution Layer where kernel size=(5 X 5) and applying padding (Input size and the output size is the same) the activation function is ‘Relu.’ Flatten layer A dense layer (fully connected layer) is used at the last layer or output layer using ‘softmax’ activation function.
Loss function and epoch numbers should be mentioned explicitly before training of the proposed model dataset with labels and optimizer. Thereafter following functions are performed in a sequential manner: Adam Optimizer Sparse categorical cross-entropy in the loss function The number of epochs (better accuracy is achieved at ten epochs) Batch size (which may vary).
At this stage, the dataset of the model is marked with labels for its training. k-fold cross-validation technique is used to avoid the overfitting issues of our proposed legal model. We have executed this model five times (i.e. k = 5) to obtain average accuracy of classifier for the purpose of automated text recognition. To complete image classification of this model all the line character images are passed to it. Based on ASCII value, classification class is converted into character, which is stored one by one in a text file.
Word segmentation
Finally, word segmentation is performed after classification to generate a text document. At this phase of operation, our proposed legal model can classify the result and generate the characters, but it should be enhanced further to handle word segment, spaces within a line, etc, which creates problem for successful image analysis. To resolve these issues, NLTK [37] and word segmentation library are used to segment the words from spaces [38] to append it into a text file.
Text summarization
After extracting text from the document image, its content is separated into words and sentences using NLTK word and sentence tokenizer. As per the concept of Tokenization, it breaks a series of strings into words, sentences, phrases, keywords, and other components known as Tokens. Furthermore, the score of each sentence with reference to a word is calculated where first score sentences are selected for text summarization. The summary of text extraction is performed through the following steps: Word Tokenization: A dictionary is created to calculate word frequency from a legal text document. Accordingly, the score of each sentence is found with respect to the word. Sentence tokenization: Sentence Tokenization is the process of splitting text documents into individual sentences using the NLTK method. Term Frequency: It denotes a term to find the frequency of word and sentence present within a particular document, Summary generation: Summary generation denotes the creation of a summary within a text file for tokenization of words to store it in a MS Excel file. In this case each word are stored in an individual cell of this MS Ex.
Automated text recognition accuracy
Automated text recognition accuracy
This section shows the output of each step used for proposing the automated legal text recognition system. Experiments have been performed using python 3.7 on windows 10, having an i5 processor with 8GB RAM. At first, a hard copy of argument-based legal case documents is collected and scanned to convert into its image version. Figure 3 shows the capturing of legal documents into pdf format after scanning for conversion into multiple digital images.

Convolutional Neural Network Model.

Collected legal document: (a) pdf, (b) converted image.
To make a better-quality image and to find lines, preprocessing of images is performed. Figure 4 shows preprocessing of document digital image that includes (a) image scaling (b) noise removal, (c) thresholding/binarization (d) dilation.

Pre-processing of image: (a) original image, (b) grayscale, (c) noise removal, (d) binarization, (e) dilated.
After preprocessing of the document image, Figs. 5 and 6 shows line detection and line segmentation respectively.

Line detection.

Line segmentation:(a) input file, (b) segmented lines.
This line segmentation method is followed by character detection from the segmented lines, after which character segmentation is performed. In Fig. 7, we have shown the character segmentation process.

Character segmentation process.
Text extraction is done using a Tensorflow machine learning model followed by classification. Table 1 shows the classification accuracy of the proposed automated legal text recognition system using the 5-fold cross-validation method.
After training the model, in Fig. 8, we have shown the last epochs output, i.e., loss and accuracy.

(a) Model Accuracy, (b) Model Loss.
In Fig. 9, we have shown the stored predicted result in the form of a string.

Characters in strings form: (a) characters, (b) strings.
In Fig. 10, we have shown the word segmentation in the form of the final text document.

Word segmentation: (a) strings, (b) segmented words.
Now, we can create a text file and append the output string result line by line. In Fig. 11, we have shown the entire conversion process of a legal text document as an extracted string stored in a text file in append mode.

Text file conversion: (a) pdf file, (b) converted text file.
Figure 12 shows how text summarization is performed within our proposed system.

Text summarization: (a) input text, (b) summarized text.
Finally, in Fig. 13, the text document is converted into an excel file by storing each word in an individual cell.

Text in excel file.
From the above results, we have seen that the proposed model successfully automates the traditional legal text recognition method with an accuracy of 94 %. Therefore, the proposed intelligent system for automated legal text recognition may be recommended to industries and serve as a baseline for future research on legal text analysis.
In this paper, an automated argument based legal text recognition model has been proposed using machine learning. The developed system consists of a Convolutional Neural Network (CNN). Argument based text documents are scanned to get the images. Characters are differentiated by using the segmentation method. In addition to it, the recognized text has been summarized. The automated recognition system developed provides better accuracy and as well as cost-effective and user friendly.
Moreover, the program is flexible and can be modified easily as per requirement. Therefore, the proposed system may be recommended for legal text recognition to generate soft copy versions of hard copy legal documents. Those can be easily accessed through the internet to facilitate the uninterrupted judiciary operation. The future work includes the classification of the accused individual based on argument-based legal texts using machine learning models.
