Abstract
The writer identification task infers the writer by analyzing the texture, structure, and other representative features of the handwriting. Inspired by the attention mechanism, an end-to-end writer identification model is proposed in this paper, which combines both global features and local features. The Vision Transformer is used as the backbone network, and the Convolutional block attention module (CBAM) is introduced to enhance the ability of global feature awareness of the model. The proposed method is evaluated on two public data sets, IAM and CVL respectively. In the task of word-level writer identification, the accuracy rates in two data sets were 90.1% and 92.3% respectively. In the task of page-level writer identification, the accuracy rates were 98.6% and 99.5%, as a state-of-the-art performance.
Introduction
Writer identification task refers to identify a specific writer of handwriting by analyzing the writing style in handwriting. It has potential applications in historical manuscript analysis and file protection. The handwriting style varies with the influence of the writer’s educational, age, writing tools and other factors, so handwriting is regarded as a biological feature. The handwriting data in daily life are widely stored by the form of images. Compared with voice print, gait, iris and other biological features, the handwriting data is easier to collect and analyze. Writer identification task has always been a research hotspot in the field of pattern recognition.
Writer identification can be divided into word-level and page-level according to the number of characters contained in document images. The number of characters in different document images will affect the results of model the handwriting style of the writer. With the increase of the number of characters, the image information become more and more abundant. Therefore, most studies focus on writer identification using page-level document images which contain several paragraphs or sentences. High identification rates are reported on public data sets. However, writer identification on word image is still a challenging problem, due to the lack of image information, especially handcrafted features, it completely fails in word-level writer identification.
The writer identification model based on handcrafted features extraction needs to determine the writer by extracting features and calculating the similarity between features. The methods based on handcrafted features used frequency domain conversion technology and the codebook and mainly used the shape or texture features of handwriting images to identify the writer. These methods requires that each writer sample contains enough image information to obtain a reliable feature sequence. Literature [1] pointed out that in the writer identification method based on handcrafted features, there are at least 150 characters in handwriting image for capturing relatively stable handwriting style. Therefore, the writer identification method based on handcrafted features only applies to page-level handwriting documents with a relatively large number of characters. Most of the existing writer identification models based on deep learning used convolution neural network (CNN), which extracts deep abstract features of handwriting images by stacking several convolutional layers. In A typical CNN model, the linear convolution filter is used to generate feature maps followed by global average pooling (GAP) layer which used to decrease the granularity of features. The features extracted by CNN model are helpful to the identification of the target writer. However, due to the unbalanced attention of the model to local features and global features, the local details in the handwriting image are ignored. Therefore, is still a big gap between word-level and page-level handwriting images in writer identification based on deep learning. Such the model does not have enough ability to extract the information of handwriting style for modeling on the word level handwriting image. Handwriting identification is still a challenging problem in the case of a small number of characters.
The attention mechanism can enable the model to obtain essential information from the key position in handwriting images for writer identification. Attention mechanism can be used to solve the problem of unbalanced ability of the model to capture local features and global information. Vision Transformer [2] model replaces CNN with transformer for the first time and achieves better results in image classification. Inspired by this, Vision Transformer is used as the backbone network to enhance the model’s attention to the global information of handwriting features in this paper. At the same time, Convolutional Block Attention Module (CBAM) [3] is combined with the backbone network to increase the model’s local attention to handwriting images.
The contributions of this paper are as follows: This paper extended the application scope of Vit model to the task of writer recognition. The effect of the self-attention mechanism on word-level writer identification tasks was studied. The problem of local information loss was solved through the information interaction mechanism between the patches of the self-attention machine. Aiming at the writer identification task, the Vision Transformer network was improved by introducing the Convolutional block attention module to consider the local and global features of handwriting. Through combining traditional attention and self-attention, a writer identification model with a stronger ability to extract deep abstract features was constructed. The model can extract more effective features even on word-level images with fewer characters. The experimental results on IAM and CVL datasets showed that the proposed method achieves competitive recognition accuracy.
Related work
This section briefly summarizes the classic methods of writer identification. Depending on the different of features used, we divide methods into two groups: handcrafted features and deep learning.
Handcrafted features for writer identification
The writer identification based on handcrafted features regards handwriting as a texture or structural feature, determines one or more representative features of handwriting according to professional knowledge, designs a feature extraction algorithm, and identifies the writer by calculating the similarity between handwriting features. Writer identification based on handcrafted features usually divides features into textural-based features, and grapheme-based features.
Generally, texture-based features are designed by the joint feature distribution. Brink et al. [4] found that the width of handwriting takes an important role in the writer identification, and made the Quill feature, which joints direction and width of the handwriting. Bulacu et al. [5] segmented the contour image of handwritten handwriting into small segments. Hinge features are used to extract joint features from different angles of handwriting. Schomaker [6], according to the run-length of general patterns, designs the curvature-free COLD feature which based on the joint distribution of line fragment length and direction. The direction of the line fragment was extracted and Polygon approximation was used to find the dominant point in the contour to determine the curvature-free feature of the handwriting. Siddiqi and Vincent [7] used the local structural features of handwriting and codebook form to perform the task of writer identification. They determined the small fragments frequently appearing in handwriting, extracted the orientation and curvature features of sub-images, and combined contour texture features with graphic features to determine the writer.
Features based on grapheme usually are encoded to achieve better results. Khalifa et al. [8] segmented connected components based on contours as graphemes. These graphemes were normalized for building the codebook and multiple codebooks of different sizes were used to build the global feature descriptor. Wu et al. [9] extracted SIFT features, and a codebook was trained by self-organizing map. In [10] the Root SIFT descriptors extracted from the handwriting images were used to identify the writer, and the GMM supervectors were used as the encoding method to describe the characteristic handwriting.
Deep learning for writer identification
With the development of neural networks and deep learning, the convolutional neural network has made outstanding achievements in image feature extraction. More and more researchers were trying to use features extracted by CNN to replace handcrafted features. Using a neural network to automatically extract and classify handwriting features has become a new way for writing identification. The writer identification methods based on deep learning can be divided into two types: the end-to-end method by using a neural network both for feature extraction and classification. Fiel and Sablatnig [11] used the deep convolutional neural network for the task of writing identification for the first time. They used Caffe Net as a handwriting feature extractor. Due to the insufficient ability of Caffe Net to extract features, the identification results were not ideal. Christlein et al. [12] applied convolutional neural network to the task of writer identification, this method applied convolutional neural network to the step of feature extraction, and the method used in the remaining steps was the same as the manual feature method. In [13] extracted the deep abstract features of handwriting images by superimposing multiple convolution layers. Convolutional neural networks are widely used in the task of writer identification. Nguyen et al. [14] rebuilt a model containing a three-layer convolution neural network and used the activation function layer as a classifier for the first time, so that the feature extractor, the feature aggregator, and the feature classifier could follow the Back Propagation in the training process of the whole model. He and Schomaker [15], to study whether the feature information extracted from the character identification task impacted the writer identification task, proposed a multitask learning writer identification method, which applied the features extracted from the auxiliary task to the writer identification. Writer identification researchers have realized that the fusion of different levels of handwriting features can improve identification accuracy. He and Schomaker [16] proposed the FragNet model for writer identification. The model extracts the feature maps of different levels in the feature map pyramid through the depth segment network, used the convolutional neural network to calculate the features of different levels, and obtained more representative features after fusion. He et al. [17] combines the context information in the handwriting image with the local information of the handwriting. This model uses the convolutional neural network to extract features. The global features of the output handwriting of the convolutional neural network are extracted at the middle layer of the convolutional neural network. The global features containing the context information are used as the hidden state of the residual recurrent neural network. Local features are used as the input of residual recurrent neural networks for classification to determine the writer. In order to reduce the waste of human and material resources in data annotation, Chen [18] proposed a semi supervised model for writer identification. This model uses two deep learning models to extract features and used Vlad to encode the extracted features.
In recent years, attention mechanism has been widely used in various tasks. In writer identification, attention mechanism is often combined with Convolutional Neural Network or Recurrent Neural Network to enhance the ability of depth feature extraction of the model. Chen et al. [19] used the letter style adapter (LSA) to encode different letters, using CNN and LSTM. The Hierarchical Attention Pool (HAP) is also proposed for feature aggregation. In the Hierarchical Attention Pool, CNN and attention mechanism are used to aggregate the features of a single character, and then the importance of time step is calculated by using LSTM and attention mechanism combined with the overall image global context. Abhishek et al. [20] used spatial attention mechanism, multi-scale feature fusion and patch-based CNN respectively. Ngo et al. [21] Proposed A-VLAD model by combining attention and key point-based writer identification. In this model, CNN network is used to extract the features of handwriting, and the key point-based attention filter is used to calculate the features. Finally, the generalized deep neural VLAD model is used to aggregate the features to form a representative feature sequence. Shaikh et al. [22] used cross attention and soft attention methods to enhance the pixel area with high correlation with the writer identification task. The existing attention based handwriting identification networks all use CNN or RNN structures, which make the model’s attention to global and local feature information unbalanced. Therefore, this paper proposed A-ViT model to solve these problems.
Proposed method
In this section, the detail of the network structure of proposed Attention Vision Transformer (A-ViT) based method is introduced. First, the backbone network Vision Transformer is introduced. Then the detail of Convolutional block attention module is given.
The model includes the CBAM module and ViT module. In order to reduce the computational complexity of the model, the ViT model cut the image into small patches, and multi head attention made patches interact with each other to obtain local feature information. However, the global feature information will be losed by such way. In order to balance the ability of obtaining global information and local information, we introduce the Convolutional block attention module. The Convolutional block attention module will enable the calculation of the feature map in the channel dimension and spatial dimension and strengthen the ability of modeling global features of the model. The former obtains the global information of the handwriting image, and the latter captures the local information.
The overall architecture of the proposed A-ViT model is shown in Fig. 1. The image is firstly input into Convolutional block attention module to construct the global information. In this module, channel attention and spatial attention are respectively performed on the input image. The calculated feature maps are input into the embedded layer for serialization. The serialized feature sequence is classified after being calculated by 12 encoders.

The structure of model.
The relationship between strokes can be regard as the key point information of handwriting, which belongs to a kind of local feature. The capture accuracy of key points greatly affects the accuracy of the identification results. An excellent writer identification model should have a solid ability to capture local features. In order to capture the local information of the handwriting image better, the advanced Vison Transformer is used as the backbone network.
The central part of the Vision Transformer is the Encoder block. The input of the encoder is a feature sequence, so the images need to be serialized. The Embedding block preprocesses the image before the encoder calculates the features. The preprocessing process includes image serialization, adding position coding, and adding class tokens. The specific operation mode is as follows. First, the image is segmented into patches by a convolution block of 16×16 window size. Then flatten the two dimensions of the height and width of the patches to obtain a two-dimensional characteristic matrix. After adding a class token to the feature matrix and superimposing the position coding parameters, the input of the encoder is obtained.
The Encoder block is composed of 12 stacked encoders. Each encoder is composed of multi head attention and multi-layer perceptron. The residual structure is used to calculate the characteristics inside each part. The overall structure is shown in Fig. 2.

The structure of encoder.
The specific calculation process is shown in Formulas 1 and where X in and X out are the input and output of the encoder respectively; LN represents the Layer Normalization; MH represents the Multiple Heads attention; DP represents Dropout operation.
The feature sequence is processed by Layer Normalization before the calculation of multi head attention and multi-layer perceptron. The calculation process of Layer Normalization is shown in Formula 3.
Where E [x] is the average value of the characteristic sequence.
The multi-head attention mechanism calculated the input characteristic matrix according to the number of attention heads. The calculation formula is shown in Formula 4:
The attention head calculation formula is shown in Formula 5, where head
i
refers to the ith attention head, and
The formula for calculating Attention (·) is shown in Formula 6:
The class token was extracted from the Multi-Head Attention layer as the input of the Multi-Layer Perceptron layer. Multi-Layer Perceptron includes full connection layer, drop out layer and activation function layer. The calculation formula of the GELU is shown in Formula 7.
The number of neurons in the first fully connected layer is 4 times that of the characteristic sequence dimension, 3072.The number of neurons in the second fully connected layer is 768.
Vision Transformer network has excellent local feature information capture capability. Because the computational complexity of the multi head attention module of the vision transformer network is squared with the size of the input feature. In order to reduce the length of images converted into sequences, the backbone network divides images into local patches. This structure makes the modeling scope of the whole model relatively small, and the attention to the global information is weak. The Convolutional block attention module (CBAM) is introduced to enhance the attention of the model to the global information. CBAM will conduct channel attention and spatial attention to the input in turn and calculate the output feature weight with the original feature map to obtain the image after applying attention. The structure of the Convolutional block attention module is shown in Fig. 3.

The structure of CBAM.
Specifically, the CBAM takes the image as the input, that is, F = R(H×W×C). First, Channel Attention is given to the input features, and M c (F) ∈ R(C) denotes. The calculation formula of M c (F) is shown in Formula 8.
The detailed operation is to obtain the average value and maximum value of each channel dimension through the calculation of the Global Average Pooling (GAP) and the Global Max Pooling (GAP) for the input characteristic graph. Add the features calculated by the Multi-Layer Perceptron layer and take the result of the activation function(sigmoid), that is, the channel weight.
The output feature map of channel attention is F
c
, and apply spatial attention to F
c
, represented by M
s
(F
c
). Spatial Attention first applied Global Average Pooling (GAP) and Global Max Pooling (GMP) operations along the F
c
’s channel axis. Then, the convolution layer was to adjust the number of channels. Finally, used the activation function(sigmoid) was to obtain the spatial weight. The calculation process of spatial attention is shown in Formula 9.
Dataset settings
In the experiment, the English handwriting dataset IAM, CVL, and Chinese handwriting dataset CASIA were selected to evaluate the performance of the A-ViT.
CVL [23]: CVL is a public dataset used for writer retrieval, identification, and word identification. The dataset divided the writer’s handwriting into page-level, line-level, and word-level handwriting images. A total of 310 writers in the data set contributed their handwriting, which was written in English and German. Among all handwriting contributors, 27 writers provided seven handwriting documents, and 283 writers each provided five. There are 99890 handwriting images in total. The training set contains 59934 handwriting images, and the test set contains 39956 handwriting images. The data samples are shown in Fig. 4(a).

Three publicly available data sets.
IAM [24]: The IAM dataset contained handwriting images from 657 writers, and the handwriting content is in English. The dataset contained line-level and word-level handwriting images. There are 109227 handwriting images at the word level. The training set contains 75375 handwriting images, and the training set contains 33852 handwriting images. The data samples are shown in Fig. 4(b).
CASIA [25]: The dataset is a handwritten Chinese dataset published by the Institute of Automation, Chinese Academy of Sciences. The dataset was divided into online Chinese handwritten data (CASIA-OLHWDB) and offline Chinese handwritten data (CASIA-HWDB). This experiment selects the offline text data (HWDB2.0) dataset, which contains 419 Chinese handwriting images provided by writers, all of which are stored in rows. In order to meet the requirements of word-level writer identification, this experiment used sliding window segmentation to segment the line-level handwriting image so that the input of the network is as close as possible to the word-level Chinese handwriting image. The whole data set consists of 110916 handwriting images provided by 419 writers. The training set contains 88989 handwriting images, and the test set contains 21927 handwriting images. The data samples are shown in Fig. 4(c).
In order to evaluate our method, we used the trained network to conduct the writer verification test on the word images. Use each image in the test set as a sample for a query. The model must select the most likely real writer from all the candidate writers according to the extracted features.
We use two common evaluation indicators to evaluate the quality of the model. The model will sort the writer index according to the probability, Soft-Top-N is used to detect the top n individuals of the writer index after the model is sorted. If the first N writers of the result include the correct writer, the test result is considered to be correct. Hard-top-N requires that the first n prediction results after probability ranking are all correct, so it can be considered that the test results are correct. In contrast, the requirements of Hard Top N will be more stringent. Since for N = 1 Hard-N, Soft-N are equivalent, we record these scores only once as TOP-1. We selected Top-1 and Soft-Top-5 as the evaluation index of the experiment.
Implementation details
The proposed model is built on the Windows 10 operating system using Python 3.8 and TensorFlow 2.4. framework. The neural network is trained with Adam optimizer [26] and the momentum attenuation is set to 0.0001, and the batch size is set to 16. The initial learning rate is set to 0.0001, which decays dynamically with the cycle. The model had been trained for 100 times. At the same time, we adjust all word level images to a fixed size (224 × 224) to train the proposed neural network. Word level images are resized by keeping the height width ratio unchanged, and all handwriting images are converted to gray images before training to avoid the negative impact of handwriting color on model training.
Experimental results
The A-ViT network took the word-level handwriting image as the input. In the page-level handwriting image experiment, the page-level document images were cut to the word-level, and the voting method was used to select the most likely writer for page-level handwriting images.
First of all, we carried out word-level experiments on IAM, CVL, and CASIA data sets. Table 1 shows the comparison of the results between the A-ViT and existing models on word-level images of IAM and CVL data sets. Since there are few relevant studies available for comparison on dataset CASIA, only the comparison results in IAM and CVL dataset are shown in the table 1. It can be seen from the table that the proposed model effectively improves the accuracy of word-level recognition. Handcrafted features extraction models cannot accurately capture the writing style of handwriting when the number of characters is small. The end-to-end model based on deep learning had achieved high identification results in word-level writer identification. The experiment results show that the self-attention mechanism can effectively improve the word-level identification accuracy of end-to-end models.
Comparison of the results of writer identification task at word level with existing methods
Comparison of the results of writer identification task at word level with existing methods
Performance comparison of existing methods at page level
On the basis of word-level images, we conducted a study on page-level writer identification. First, the page-level images were segmented into word-level images, then the word-level images were recognized, and finally the best matching writer was selected by voting. The comparison results are shown in Table 3. The method proposed in this paper achieves the best on both two data sets.
Comparison of image identification performance at different levels
The experiment results of word-level and page-level are showed in Table 3. Although compared with the method based on manual feature extraction, the accuracy rate of the method based on CNN is greatly improved, there is still a big gap of recognition accuracy rate between word-level image and the page-level image. At the same time, the results also show that the method proposed in this paper performs well in both word-level and page-level tasks.
It can be concluded from the experimental results that the recognition accuracy of the method based on deep learning is higher than that of the method based on manual extraction. However, the existing end-to-end model relies too much on page-level images. It cannot achieve ideal identification results on word-level images. One of the reasons for the unsatisfactory results is that the model does not pay enough attention to the structural information between strokes. The proposed model enhances the ability to capture local features by making each patch in a word-level image interact in pairs. At the same time, the Convolutional block attention module is used to construct the global information of word-level images. This method achieves state-of-the-art performance on word-level images. At the same time, the method in this paper is also comparable with the handcrafted features method in the page-level image identification results. In page-level identification, the identification result of handcrafted features is usually better than the end-to-end neural network model. However, the handcrafted features method is limited by the number of characters in the handwriting image. The proposed method still achieves the identification accuracy of the handcrafted features method on the premise that only word-level handwriting images are used. It shows that the A-ViT model performs better in writer identification tasks with fewer characters.
To intuitively feel the identification performance of A-ViT model, t-SNE was used to visualize features, as shown in Fig. 5. The specific implementation details are as follows: 50 writers were randomly selected from the CVL test set, and each writer extracted 18 word-level handwriting images as the input of A-ViT. We extract the classification sequence as the classification vector in the output feature of the last encoder in the encoder layer. After the extracted classification vector is reduced to in two-dimensional space by PCA, the visual distribution display is performed. From the visualization results, for most writers, their handwriting features are clustered, and most writers’ feature clusters exist in isolation. This shows that the proposed a-vit can correctly distinguish the writer and will not make the handwriting of different writers interfere with each other, highlighting the high identification ability of the model in the word level scene.

t-SNE. Visualization each cluster represents the words written by one writer.
In this paper, the Chinese handwriting data set CASIA was evaluated. The handwriting material in the data set was line-level handwriting material. To ensure that the handwriting material is word-level, we cut the original handwriting image with a sliding window to ensure that the handwriting image after cutting retains two to three characters. The method in this paper achieved 88.9% accuracy on the CASIA dataset. Compared with IAM and CVL datasets, the identification accuracy in CASIA dataset is lower, which may be affected by the segmentation method and the different complexity of character structure between different languages.
In order to verify the effectiveness of each module in A-ViT, ablation experiments were conducted on IAM and CVL data sets. The results are shown in Table 4. Under the condition that the external parameters are identical, we calculate the recognition results of the model with or without attention mechanism. When the A-ViT model was activated, the identification accuracy in IAM and CVL was 0.88% and 1.23% higher than that in the baseline network. This ablation experiment shows that the A-ViT model achieved the best identification rate because it enhanced the attention to local features and the model’s attention to global features of handwriting images.
Ablation experiment
Ablation experiment
In order to show the performance of the improved model more intuitively, we used Grad-CAM [29] to visualize the model results. Grad-CAM can highlight the important structural information in the process of predicting categories in the image by calculating the gradient. The visualization results of the baseline and A-ViT model are shown in Fig. 6.

Visual comparison between the baseline model and the model in this paper on the CVL dataset.
The dark parts are the area where the model pays more attention. It can be seen from the Grad-CAM image that the A-ViT model can better focus on the handwriting structure information and pay attention to local information and global information at the same time. Therefore, better recognition accuracy is obtained.
In this paper, for the writer identification task, we propose the Attention Vision Transformer method, which can maintain global feature awareness while capturing local features. The advantage of the proposed method is that the backbone network vision transformer can extract local information from handwritten images, and the CBAM module can enhance the model to build global information of handwritten images. Therefore, the model can build a stable writing style on word-level handwriting images with a small number of characters, and it achieves state-of-the-art performance on two public datasets. In word-level task, the accuracy rates are 98.6% and 99.5% in data set of IAM and CLV respectively, and that in page-level tast the results are 90.0% and 92.3% respectively. The A-ViT model improves the identification accuracy in word-level writer identification experiments. In the page-level writer identification experiment, without character limitation, only a few characters of words can be used to achieve a high identification accuracy. It is comparable to manual feature extraction. In future work, we will further study the interpretability of the recognition process.
Footnotes
Acknowledgments
The research was funded by the National Natural Science Foundation of China, grant number 62166036. Science Foundation of Gansu Province, grant number 20JR10RA335. Innovation Foundation of Gansu Provincial Department of Education, grant number 2022B-152, 2023B-118. Research Project of Gansu University of Political Science and Law, grant number GZF2022XZD09.
