Abstract
Traditional file management mainly relies on manual processing, but with the development of artificial intelligence, the convolutional neural network (CNN) model has been widely used in the acquisition, processing, recognition, and conversion of handwritten text, which improves the efficiency and achieves accurate and fast information processing. In order to achieve high-precision handwritten digit recognition, this study proposes a high-performance model based on convolutional neural networks, which extracts features through a convolutional layer, under-samples using a maximum pooling layer to retain important features, and further extracts and classifies the features through a fully connected layer, and ultimately outputs the classified probability distributions using softmax functions. We compare the three optimization algorithms: experimental results show that the RMSprop optimizer has a training accuracy of more than 99% on the MNIST dataset, with a loss rate close to zero, whereas SGD and Adam, although slightly inferior in terms of performance, still have good recognition results, which verifies the superiority of CNNs combined with the RMSprop optimizer in the handwritten digit recognition task.
Keywords
Introduction
As deep learning continues to evolve, it has found extensive application in the fields of text and image recognition and classification. At present, both domestically and internationally, handwritten digit recognition technology (handwriting recognition technology) 1 has been relatively mature, compared to the traditional optical character recognition (OCR) image recognition technology, deep learning-based convolutional neural network (CNN) algorithms can rapidly, accurately, and effectively capture and recognize text in complex scenarios. The convolutional neural network (CNN) model, the core of deep learning, has successfully met the challenges of handwritten digit recognition because of its ability to efficiently extract information and accurately identify under complex visual environments.
In recent years, the field of records management has experienced an unprecedented wave of digital transformation, significantly enhancing the efficiency of information storage, retrieval, and utilization. The transition from cumbersome paper documents to convenient digital copies, from manual cataloging to intelligent indexing, has led archivists to continuously explore new ways to leverage technology. Under the framework of deep learning, the handwriting recognition technology of convolutional neural networks is gradually becoming a new tool for archival analysis and research. Through the powerful feature extraction capability of CNN, even handwritten text in the historical archives of very different styles and ages can be accurately recognized and analyzed, which creates new opportunities for the detailed exploration of archival content, advancement of historical research, and preservation of cultural heritage. Intelligent archive management has increasingly become a focal point for both domestic and international archival communities. 1 At this stage, the main research results on the digital transformation of China’s archival work are focused on the changes and difficulties in the management of the archival industry, especially how the digital conversion strategy of archival work should be specifically implemented after the promulgation of the new version of the Archives Law. In addition, the theoretical framework of archival management in enterprises and its practical application are also discussed in depth, which providing solid theoretical support for understanding the digital transformation of archival work. However, we still need to look at the opportunities and challenges encountered in archival work from a higher perspective.
Therefore, this paper constructs a CNN-based handwriting recognition model for paper archives, and gradually improves the accuracy of handwritten digit recognition by refining abstract features from raw image data through hierarchical learning. The digital transformation of archival work also mines the value of complex traditional archival resources, and realizes the efficient extraction and intelligent management of information through multi-level technological interventions and process reshaping. In this process, we have identified and examined the bottlenecks and obstacles in the key aspects of data standardization, intelligent classification, secure storage, and convenient access, combined with the logic of continuous iteration and optimization of the deep learning model, and put forward countermeasures for archive informatization from the construction of archive management system, archive informatization management planning, and other aspects.
Related work
CNN
Convolutional neural network is a multi-level neural network model. This architecture finds broad application across numerous visual tasks, such as image classification,
2
object detection,
3
and instance segmentation, among others. It primarily comprises an input layer, convolutional layers, pooling layers, fully connected layers, and an output layer. The convolutional layers, in particular, are essential as they utilize convolution operations to significantly lower the number of parameters when compared to traditional multi-layer networks, thus boosting computational efficiency. In this process, the original input data is first convolved through one or more carefully designed learnable convolution cores to extract features. Then, the pooling layer is applied to further reduce the dimension of the feature map, which not only reduces the complexity of subsequent calculations, but also enhances the model’s ability to adapt to different variants, that is, generalization. As depicted in Figure 1, the entire training cycle consists of two main phases: forward propagation and backpropagation. During the forward propagation phase, the system moves through the network structure, processing the input samples and generating predicted outputs. In the backpropagation phase, the discrepancy between the predicted outputs and the desired targets is quantified, and the gradients are propagated backwards to iteratively adjust the network parameters, aiming to minimize the prediction errors. Core diagram of the CNN model.
The core of the model is built on a multi-level structure, starting with three successive convolutional layers. In order to deeply mine and extract the key information in the input image, each layer of convolution is configured with a specific number of convolution cores of appropriate size: the first layer is equipped with 16 filters of size 3×3, which lays the foundation for the preliminary feature mapping. A second layer then intensifies the process with 32 convolution cores of the same size, designed to capture more complex feature structures. The third layer advances the feature extraction process by applying 64 3×3 filters to capture higher-level abstract features. Following the convolutional layer, two pooling layers are alternated to reduce the dimensionality while retaining crucial information. The first layer applies average pooling technology, which effectively reduces computational complexity while maintaining feature validity. Next, after the output of the third convolutional layer, a flattening operation is applied, converting the multidimensional feature maps into a one-dimensional vector to facilitate processing by the fully connected layer. This transformation is essential for the model’s classification process. The subsequent layers include two stages: an initial fully connected layer with 84 neurons, followed by a layer utilizing the ReLU activation function, which introduces nonlinearity and enhances the model’s capacity for representation. The final output layer is configured with 10 neurons, each corresponding to the handwritten digits from 0 to 9. A softmax activation function is applied to ensure that the sum of the output probabilities equals 1, enabling a probabilistic distribution of the category predictions. In the process of model training, RMSprop optimization algorithm is selected as the solution, which can effectively find the local optimal solution along the complicated path of loss function and promote the efficient updating and learning of model parameters. The specific model unit composition is shown in Figure 2. Schematic diagram of the CNN model unit.
Handwriting recognition technology
Handwriting recognition technology is a technology that can convert handwritten Chinese characters, letters, and symbols into editable text information or instructions. At the heart of the technology is the development of an algorithmic model that can accurately identify handwritten content. By analyzing the morphological features and writing rules of handwriting, effective features are extracted, and a large number of data are used for training and algorithm optimization, and finally the purpose of accurately recognizing handwritten content is achieved.
Currently, the predominant approach in this field involves employing neural networks for feature extraction. 4 LeCun et al. 5 introduced the first complete convolutional neural network model, and LeNet-5 6 was the pioneering convolutional neural network model that successfully accomplished handwritten digit image recognition. Subsequently, deep learning gradually developed. AlexNet 7 network model stood out in the ImageNet competition and won the champion. This network incorporated a substantial number of convolutional and pooling layers, along with optimized activation functions, significantly increasing the model’s depth. In 2014, GoodLeNet 8 with a test error of only 6.66% was proposed. The network introduced Inception units, increased network depth and width, and reduced model parameters, and subsequently ResNet 9 and Xception networks 10 were proposed. Researches on handwritten digit recognition have been progressing. More and more researchers devote themselves to the research on handwritten character recognition, which undoubtedly promotes the development of character recognition research and recognition, and lays a solid foundation for the research on handwritten digit recognition. Ru et al. 11 employed a deformable convolutional neural network for classifying and recognizing handwritten samples, achieving a recognition rate of 99.48%. Ma et al. 12 introduced a pulsed neural network-based algorithm for handwritten digit recognition, capable of effectively extracting the structural features of handwritten digits and enhancing recognition accuracy, focusing on the algorithm optimization of the network model during the training process. Yu 13 et al. designed a high-precision handwritten digit classification network to solve the problem that the recognition rate could not satisfy more than 95% accuracy. Fu 14 proposed a scheme based on an improved Inception structure and DenseNet, which combined to reduce the recognition time and improve the operation speed of deep learning. Compared to traditional methods, while convolutional neural networks can achieve higher recognition accuracy, they typically enhance network performance by increasing the depth of network layers and expanding the network scale, which easily leads to problems such as parameter explosion, gradient explosion, and overfitting.
This technology can convert handwritten text documents into electronic text files, 15 and convert handwritten text in archives into editable, searchable, and annotable text formats. Specifically, with the help of handwriting recognition technology, we can combine various types of paper files. For example, manuscripts of academic papers, conference notes, and records of important events are converted into file formats that computers can parse. This not only makes it easy to use electronic tools for editing and management, 16 but also ensures that these documents are properly preserved. For example, in higher education institutions, the archived handwritten data of various departments can be digitized through this technology, thus transforming into electronic documents that are easy to manage and maintain. At the same time, the digitized documents can also be transmitted and used through E-mail, network sharing, etc., which is convenient for multiple people to collaborate on editing and viewing, greatly improving the efficiency of information sharing.
Optimizer classification
Optimizers play a crucial role in the deep learning model training process, and their core task is to find the optimal solution in the high-dimensional space of the loss function so as to minimize the model error. Traditional stochastic gradient descent (SGD) is a simple and intuitive method that gradually approaches the global minimum by updating the weights along the negative direction of the gradient of the loss function. Its update formula is:
In this paper, we compare the effects of three optimizers, namely, SGD, RMSprop, and Adam, with the aim of exploring the impact of different optimization strategies on the model performance. The Adam optimizer combines the strengths of Momentum and RMSprop and achieves adaptive learning rate tuning on different parameter dimensions by simultaneously maintaining the first-order moments of the gradient and the second-order moments of the gradient. It realizes the adaptive learning rate adjustment on different parameter dimensions. The update formula of Adam is as follows:
Information about archives
The recognition of document-related archives, as a specialized type of handwritten text recognition, is characterized by its historical age and the differences in handwriting styles compared to modern methods. Traditional optical character recognition (OCR) techniques, while effective for printed text, often fall short when dealing with the variability and nuances of handwritten documents, especially those found in historical archives. The intricacies of older handwriting styles, including variations in penmanship, ink quality, and paper degradation over time, present significant obstacles for conventional OCR systems. Therefore, it is necessary to train a deep learning model based on methods discussed above to address the complexity and variability in archival handwriting recognition, achieving high-precision text recognition and analysis.
Handwritten file recognition model based on CNN
According to the above problems, a handwritten file image recognition model based on CNN is proposed, as shown in Figure 3, and its components are described in detail. CNN model frame diagram.
Convolution and activation layer
In this model, the input data consists of two-dimensional images represented as matrices, where the rows and columns correspond to the height and width of the image, respectively, and the values indicate pixel intensities ranging from 0 to 255 for grayscale images. For instance, as illustrated in Figure 4, a typical input image is a single-channel gray image of size 28×28 pixels. To build an effective classification model, the input image undergoes a series of convolution operations. In the first convolutional layer, this 28×28 input image is transformed into a set of feature maps by sliding a convolution kernel (filter) over it. Denoting the input image matrix as I and the convolution kernel as K, the convolution operation is mathematically defined (refer to formula (1)), where the kernel performs element-wise multiplications with local regions of the image and sums the results, producing a new matrix, or eigenK-graph, that captures essential patterns and features for accurate classification. Schematic diagram of feature extraction.
In the first convolutional layer, the input feature map with a tensor size of [batch_size, 1, 28, 28] is processed by the convolution kernel to produce an output feature map of size [batch_size, 20, 10, 10]. All negative values are set to zero through the ReLU activation function. Similarly, in the second convolution layer, the input feature map is obtained from the tensor with size [batch_size, 10,12,12] after convolution kernel processing, and the output feature map with size [batch_size, 20,10,10] after convolution kernel processing. The activation function returns all negative values to zero again. This layer-by-layer ReLU activation process can effectively introduce nonlinear characteristics and improve the feature learning ability and classification performance of the model.
Full connection and activation layer
After obtaining the high-dimensional features from the input image, the next step involves extracting and classifying these features. During the feature extraction stage, the tensors produced by the convolutional and pooling layers are reshaped into a one-dimensional vector, which is subsequently processed by the fully connected layer. Specifically, in the initial fully connected layer, the dimensions of the input vector are transformed from [batch_size, 2000] to [batch_size, 500], as detailed in formula (3). This transformation enables the model to consolidate and refine the extracted features, facilitating a more effective classification process.
Y is the output of the first fully connected layer, 𝑊1 and X represent the weight matrix and input tensor, respectively, and b 1 is shown as the bias vector.
Subsequently, the ReLU activation function is applied to the output vector, removing all negative values. The processed tensor is then transformed into an output vector of size [batch_size, 10] by the second fully connected layer, and finally, the probability distribution for each class is calculated using the softmax function, as shown in formula (4):
Z i denotes an element in the output Z of the second fully connected layer, while softmax (z i ) represents the probability associated with the category.
In this process, the combination of the fully connected layer and the ReLU activation function enables the model to effectively capture the high-dimensional features of the input data, thus improving the accuracy of the classification.
Output layer
After being processed by the previous fully connected layer and the activation function, the fully connected layer converts the input tensor of size [batch_size, 500] into an output vector of size [batch_size, 10], as shown in formula (5): This transformation is essential as it maps the high-level features extracted from the earlier layers to the final output classes, which correspond to the different categories in the classification task. The output vector effectively represents the model’s predictions, with each element indicating the likelihood of the input belonging to a specific class.
Finally, each element in the output vector is mapped between (0, 1) through the softmax function, so that the sum of all elements is 1. Through this series of operations, the model finally outputs a tensor of size [batch_size, 10], each element represents a probability distribution of a class. We choose the category with the highest probability as the final prediction result. Through the full connection layer and softmax function, the previous features are converted into the final classification result, so that the model can effectively classify the input data. The detailed process is illustrated in Figure 5. Diagrammatic representation of the CNN model.
Experimental design and results discussion
Dataset
The dataset in this paper is a widely used handwritten digits dataset, whose training and test sets consist of 250 handwritten digits written by different people.The MNIST dataset contains images of Arabic numerals ranging from 0 to 9, which are 28×28 gray scale images and have been dimensionally annotated, as shown in Figure 6. Sample MNIST dataset diagram.
Analysis of experimental results
The experiments in this paper were conducted using Python 3.8, PyTorch 1.11.0, CUDA 11.3, and an RTX 3090 GPU. The batch size for training was set to 128, and the loss function utilized was cross-entropy loss, with ReLU as the activation function. The optimizer employed Adam, performing 100 training iterations. The dataset used for the experiments consisted of the MNIST dataset, which includes images of handwritten digits.
Data preprocessing steps included normalizing pixel values to a range of 0–1 and applying data augmentation techniques such as rotation and scaling to enhance model robustness. The training and testing datasets were divided in a 6:4 ratio, and the entire training process took approximately 1 hour to complete. This comprehensive approach ensures accurate representation and effective processing of the datasets used in this study.
Figures 7 and 8 show the accuracy graphs of the three optimizers, SGD, RMSprop, and Adam, over 100 rounds of training and testing. Figures 9 and 10 Figures 7 and 8 represent the loss function images of the three optimizers, SGD, RMSprop, and Adam, for 100 rounds of training and testing. Plot of accuracy variation of different optimizers on the training set. Plot of the variation of loss function for different optimizers on the training set. Plot of variation in accuracy of different optimizers on test set. Variation of loss function for different optimizers on the test set.



In Figure 7, we observe the training loss over time when a machine learning model is trained with three different optimizers. The horizontal axis represents the number of training rounds and the vertical axis shows the training loss. From the figure, we can see that the losses for all three optimizers show a rapid decrease within the first few rounds of training, indicating that the models are actively learning how to better fit the training data. Particularly striking is the fact that RMSprop shows the most significant drop in loss in the first few rounds, followed by Adam and then SGD. This observation implies that RMSprop provides a relatively fast initial learning rate in the current experimental setting. As training progresses, the loss curves of all optimizers flatten out, indicating that the model is close to convergence.
Figure 8 exhibits the variation in training accuracy over multiple epochs for three different optimizers, Adam, SGD, and RMSprop. The horizontal axis represents the training epochs, while the vertical axis shows the corresponding percentage of training accuracy for each optimizer. It is clear from the figure that within the first few epochs of training, there is a significant difference between the three optimizers in terms of the rate of increase in accuracy. Specifically, RMSprop rises the fastest in the very beginning, followed closely by Adam, while SGD appears to improve more slowly. Over time, all three curves level off and eventually reach similar maximums, suggesting that they all eventually lead the model to almost the same optimal training accuracy. This finding implies that it is possible that after enough training cycles, different optimization methods can all collect rose to similar results. It is also worth noting that the Adam curve shows some slight up and down fluctuations throughout the training process, especially in the early stages.
In summary, Figures 7 and 8 together reveal the behavioral characteristics of different optimizers during the training process and their differences in terms of speed and stability in improving model accuracy, which is an important guide for selecting suitable optimization techniques for machine learning model training.
Figure 9 shows the test loss values over time when using three different optimizers, Adam, SGD and RMSprop. We can see from the figure that the test losses for all three optimizers show large fluctuations within the first few rounds of training, indicating that the model is still in the process of adjusting to the test data in the early stages. Particularly striking is that RMSprop’s test loss exhibits high volatility in the initial rounds, and the overall trend shows an increase. In contrast, Adam’s test losses are relatively low in the early rounds, but gradually increase in value as training progresses. In contrast, SGD’s test loss stays lowest throughout the training process and has the least fluctuation, showing better stability. As training continues, the test loss curves of all optimizers show different long-term trends. The test loss of RMSprop has a clear upward trend in the later stages; the test loss of Adam also rises but at a slower rate; and the test loss of SGD always remains low and hardly rises.
Figure 10 exhibits the variation of test accuracy for three different optimizers, Adam, SGD, and RMSprop, over multiple epochs. The horizontal axis represents the training epochs, while the vertical axis shows the corresponding percentage of test accuracy for each optimizer. It is clear from the figure that within the first few epochs of training, there is a significant difference between the three optimizers in terms of the rate of increase in accuracy. Specifically, RMSprop seems to rise the fastest at the very beginning, closely followed by Adam, while SGD appears to improve more slowly.
Comprehensively analyzing Figures 9 and 10, we can conclude that although different optimizers exhibit their own unique dynamic change patterns in terms of test loss and test accuracy during the training process, under prolonged training, all of these optimizers are eventually able to drive the model to achieve a fairly high test accuracy. However, their performance in terms of test loss is quite different, with SGD showing the best stability, while RMSprop and Adam show varying degrees of fluctuation and upward trend at different stages, respectively.
Conclusions and prospects
The application of the convolutional neural network (CNN) model to handwritten recognition technology in archives management is revolutionizing archival digitization. By leveraging the CNN model 17 for the collection, processing, identification, and transformation of handwritten texts, archival files can be digitized more efficiently 18 , enabling accurate information storage 19 , convenient retrieval, and quick sorting. 20 This technology significantly enhances the efficiency of archives management, reduces reliance on human resources, and improves the accuracy and integrity of archival information, ensuring the long-term preservation and reliable utilization of data.
As CNN models and related deep learning technologies continue to evolve, the prospects for handwritten recognition in archival digitization are expanding. 21 Future advancements are expected to lead to more intelligent, precise, and automated solutions, enhancing the quality and efficiency of archives management and accelerating the digitization process across the entire archival industry. Importantly, advancements in handwriting recognition will provide richer data support, assisting decision-makers in making scientifically informed and efficient decisions regarding the protection, utilization, and management of archives.
In conclusion, this study underscores the transformative impact of CNN models on handwriting recognition in archives management, highlighting their role in improving operational efficiency and data integrity. Future research should focus on refining model accuracy, exploring unsupervised learning techniques to reduce labeled data dependency, and integrating multi-modal approaches to enhance recognition capabilities in diverse handwriting styles. By pursuing these directions, the application of CNNs in handwriting recognition will become an indispensable tool in archives management, leading to a more intelligent and modern future.
Statements and declarations
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
