Abstract
Traditional Chinese painting (TCP), culturally significant, reflects China’s rich history and aesthetics. In recent years, TCP classification has shown impressive performance, but obtaining accurate annotations for these tasks is time-consuming and expensive, involving professional art experts. To address this challenge, we present a semi-supervised learning (SSL) method for traditional painting classification, achieving exceptional results even with a limited number of labels. To improve global representation learning, we employ the self-attention-based MobileVit model as the backbone network. Furthermore, We present a data augmentation strategy, Random Brushwork Augment (RBA), which integrates brushwork to enhance the performance. Comparative experiments confirm the effectiveness of TCP-RBA in Chinese painting classification, demonstrating outstanding accuracy of 88.27% on the test dataset, even with only 10 labels, each representing a single class.
Introduction
With the development of digital technology, the digitization of traditional paintings has become an important part of the dissemination of cultural heritage. This has also led to a large number of methods that combine deep learning and painting [23, 42]. For example, ChipGAN is a generative adversarial network that transfers photos into Chinese ink paintings [12]. [38] used the serial multi-expert framework with semi-supervised clustering methods to perform the annotation of brushwork patterns. By leveraging an integration of edge detection and clustering-based segmentation, [21] developed an extraction method for distinguishing the works of Van Gogh from those of his contemporaries. Fan [11] compared oil paintings and Chinese ink paintings using average distance calculations and rule-based features based on the rule of thirds, the golden section rule and the golden triangle rule. Sheng [32] proposed an algorithm for merging art target regions with maximum similarity, utilizing a deep convolutional neural network for feature extraction and a support vector machine (SVM) for fusion and classification. However, its effectiveness is limited when handling abstract traditional paintings with expansive target regions.
Existing methods heavily rely on supervised learning, and the associated high costs of annotating traditional paintings limit their practicality. Consequently, we present a pioneering method that integrates semi-supervised learning (SSL) with traditional Chinese painting classification, named TCP-RBA. Recognizing the importance of brushwork in traditional Chinese painting classification, we introduce a targeted data augmentation strategy called Random Brushwork Augment (RBA). Moreover, we leverage the MobileVit network [26] to facilitate the acquisition of global representations, effectively addressing the challenge posed by paintings featuring expansive target areas.
In this study, we investigate semi-supervised learning classification methodologies designed to maximize the utility of both labeled and unlabeled data, while also taking into account factors such as diverse painting styles and brushwork. Our principal contributions can be summarized as: We employ semi-supervised learning to enhance results when confronted with limited labeling conditions. Concurrently, we introduce the MobileVit network, adept at learning global representations, which aids in abstracting traditional painting features. In extreme cases, the outcomes demonstrate little deviation from the supervised method. Each painter utilizes unique lines, shapes, and strokes that make up the brushwork of his or her paintings, which are the most distinctive features used to differentiate the painters. Therefore, we combine brushwork with data augmentation strategy to propose Random Brushwork Augment(RBA) mainly for traditional Chinese paintings. We constructed a new dataset of traditional Chinese paintings, which includes 1200 labeled traditional Chinese paintings from 10 famous artists.
Related work
Semi-supervised image classification
Semi-supervised learning (SSL) methods have witnessed significant growth, leveraging limited partial samples to achieve performance on par with supervised learning. SSL can be classified into generative methods, consistency regularization methods, graph-based methods, pseudo-labeling methods, and hybrid methods.
Two important aspects of semi-supervised learning include consistent regularization [2, 19 28] and pseudo-labeling [20]. The main idea of Consistency regularization is that for an input, even if it is slightly disturbed, its prediction should be consistent [5]. This method has been extensively researched. For instance, Temporal Ensembling [19] employs an exponential moving average of the training model to produce an additional input. Pseudo-labeling leverages the model itself to obtain artificial labels for unlabeled data [25, 31].
The current trend involves the integration of two popular methods. MixMatch [4] performance through data augmentation, mixing strategies, and distributional consistency loss. RemixMatch [3], an extension of MixMatch, introduces a novel data augmentation technique called Remixing, enhancing the model’s robustness to diverse transformations. FixMatch [33] employs weak augmentation to create pseudo-labels that oversee the output of strong augmentation, which has shown good experimental results in recent years and is easy to implement.
The FixMatch algorithm has been extensively developed across various domains. For instance, improving FixMatch with weighted nuclear norm regularization for few-shot remote sensing scene classification [37]. Additionally, [35] incorporates FixMatch’s confidence filtering mechanism into 3D object detection. Moreover, FixMatch exhibits strong generalization capabilities and can handle diverse scenarios and datasets. FlexMatch [39] is trained by filtering pseudo-labels with reliable confidence using flexible thresholding.
Recent years have witnessed significant progress in semi-supervised learning methods leveraging graph representations. Thomas [18] have introduced a method for semi-supervised classification on graph-structured data. The model uses an efficient layer-wise propagation rule that is based on a first-order approximation of spectral convolutions on graphs. Jiang [17] proposed graph learning convolutional networks (GLCN) for unknown label estimation by refining the construction of the graph by incorporating both given and estimated labels to facilitate graph convolution operations. Huang [14] leverages the correlation within the label structure, employing both “error correlation” to rectify residual errors in test data through training data and “prediction correlation” to smooth predictions on the test data.
Classification of traditional Chinese paintings
In the current research landscape, scholars have extensively explored supervised methods for the classification of paintings, primarily adopting three distinct types.
The first method focuses on classifying paintings based on different painting techniques, brushwork, and writing. Jiang [17] combined the discrete cosine transform (DCT) and convolutional neural network (CNN). They extracted features from a portion of the DCT coefficients of the image using CNN, reducing redundancy in the original image and achieving a notable result of 95%. Liu [24] created a similarity matrix by combining similarity coefficients between images and utilized AdaBoost [30] to classify the styles of Chinese paintings.
The second method entails painting classification based on content. Siddharth [1] utilized different feature extraction methods (SIFT, color, GLCM, etc.) and explored their effectiveness in classifying genres and styles of Chinese paintings. Qian [36] proposed a style painting classification algorithm based on information entropy. They extracted style features of painting images, including color entropy, block entropy, and contour entropy, combining them to form the information entropy of different painting styles.
The third classification method involves grouping paintings by their creators. Sun [34] proposed a sparse hybrid CNN framework for extracting brushwork features to classify paintings by six famous artists, but the method is computationally complex. Li [22] used a fine-tuned pre-trained VGG-F model to extract image features of Chinese paintings. They proposed the Mutual Information-based Embedded Classification algorithm (MIDEC) by incorporating the DEFC [27] algorithm and mutual information theory. A multi-task feature fusion architecture (MTFFNet) is proposed to achieve better performance by fusing semantic and stroke features [16]. Jiang [15] extracted low, medium, and high-level features using CNN for fusion, assigning different weights to optimize the model and achieving 89.2% accuracy.
While these established methods have demonstrated effectiveness, their reliance on supervised learning is hampered by the substantial cost associated with labeling required for Chinese painting classification. Additionally, the use of CNN models for feature extraction may be suboptimal, given the abstract nature inherent in Chinese paintings.
Data augmentation
Data augmentation serves as a powerful tool in semi-supervised learning. In the early stages, data augmentation strategies focused on fundamental image operations such as rotation, flipping, scaling, cropping, and color transformations. As research has advanced, scholars have sought to integrate various transformations to devise more comprehensive data augmentation strategies.
AutoAugment [6] introduces an automated method for searching and selecting optimal data augmentation policies. Instead of using a complex search mechanism like AutoAugment, RandAugment [7] randomly selects a set of augmentation operations and their strengths for each image. MixUp [41] involves linearly interpolating pairs of training samples (both input images and their corresponding labels). This blending of samples helps the model generalize better and reduces overfitting. AugMix [13] focuses on creating diverse augmented versions of an input image. It employs a combination of three operations: color-based transformations, geometric transformations, and image mixing.
Building on the aforementioned insights, we integrated RBA with select strategies mentioned above. Subsequently, we conducted comparative experiments to identify the most effective strategy tailored to our method.
Method
In this section, we provide a detailed description of our method. To classify traditional Chinese paintings, we utilize the semi-supervised method, incorporating a brushwork data augmentation strategy. This is also the first time that someone has combined semi-supervised method with Chinese painting classification. In the method we perform RandAugment and RBA on the input data, and use MobileVit as the backbone network to train the data. The detailed explanation of MobileViT as the backbone network is in section 3.2.1. The flow of the proposed method is shown in Fig. 1.

Diagram of method. A weakly-augmented image and a random brushwork image (top) is fed into the network to obtain predictions (orange box). When the network assigns a probability to any class which is above a threshold (dotted line), the prediction is converted to a one-hot pseudo-label. Then, we compute the network’s prediction for a strong augmentation of the same image (bottom). The network is trained to make its prediction on the strongly-augmented version match the pseudo-label via a cross-entropy loss.
Brushwork texture characteristics
Lines, shapes, and brushwork comprise the texture of a painting, which reflects the relative smoothness and roughness of the artwork. Objective measures of texture [40] are fundamental to understanding the nature of painting. The unique lines, shapes, and brushwork used by each painter make up the brushwork information of a painting and serve as the distinguishing feature used to differentiate one painter from another. Brushwork can be captured by texture features, texture is a significant element in distinguishing painting styles. Brushwork information is essential in distinguishing between painters. Gray level co-occurrence matrix (GLCM) is a statistically-based method for extracting texture features. It is also important for representing the information of brushwork, and classifying art works. RBA is to use data augmentation based on glcm-extracted brushwork information to optimize traditional painting classification.
Grey Level Co-occurrence Matrix
Gray level co-occurrence matrix (GLCM) is a statistical method for extracting texture features and is also a significant representation of the brushwork information, which allows for further classification of various art works. Agarwal [1] used GLCM as a feature to classify fine art painting images. Daec [9] used edge, GLCM, Gabor, and color among others to assess artwork authenticity. They analyzed the paintings of the two artists and determined that the texture of brushwork is the most distinctive marker for distinguishing between the two artists.
The GLCM provides information on the orientation, intervals, and magnitude of gray levels in an image. The texture information o f the image can be reflected through calculation of corresponding eigenvalues utilizing the covariance matrix. In [29], 14 texture feature parameters based on GLCM were proposed. In our experiments, we selected four of these parameters for extracting texture features. Contrast Contrast enhances the sharpness and texture depth of the image. The contrast (CON) increases with an increase in the number of pixel pairs with high gray scale differences.
Energy. The energy is the sum of the squares of all the elements in the GLCM and reflects the uniformity of the gray scale distribution and the thickness of the texture. The value of ASM is much larger when the elements in the co-occurrence matrix are centrally distributed.
Homogeneity. Homogeneity measures how similar the gray levels of an image are in the row or column direction. The value reflects the local gray scale correlation, and larger values indicate greater correlation.
These four feature values encapsulate both the distribution of grayscale levels and information regarding the texture thickness within the image, offering multifaceted insights into its texture characteristics. i and j represent different grayscale levels. P (i, j) represents the probability at position (i, j) in the Gray level co-occurrence matrix (GLCM), indicating the probability of pixels i and j co-occurring in a specific direction. Our initial step involves deriving these four feature values to generate four distinct texture feature images. Subsequently, we perform a linear combination of these four images, assigning varying weights to each, resulting in the creation of a unified image. These combined representations are depicted in Fig. 2.

Paintings and their corresponding GLCM feature images are displayed. On the left are the original image and the fused GLCM feature image. On the right is the feature image extracted by parameters contrast, energy, entropy and homogeneity.
In semi-supervised learning, the use of powerful data augmentation has been shown to produce better results [8]. Our method uses two types of augmentations: “weak” and “strong”.
Weak augmentation is a standard flip-and-shift augmentation strategy. Strong augmentation include RandAugment or CTAugment, followed by Cutout [10]. Weak augment applied to labeled examples, strong augment applied to unlabeled examples.
The styles of Chinese paintings are complex and diverse. Data augmentation strategies in previous studies focused on real images, and may not have such strong generalization power in processing domain-specific dataset.
To enhance the method’s generalization capability and flexibility, we introduce the RBA data augmentation strategy, aiming for more accurate painting classification. Our experiments reveal a notable improvement in the results achieved by incorporating RBA.
This strategy randomly selects examples from strong augment and weak augment examples in a certain proportion. The selected examples extract brushwork information through GLCM and enter network learning together with the strong augment and weak augment examples. Details of brushwork information extraction are shown in Section 3.1.2. The specific processing of the data augmentation strategy is shown in Equations (5) and (6) :
Initially, we perform weak augmentation on the batch X to generate w (X). Subsequently, we randomly select w (X) according to the proportion of parameter α for GLCM texture feature extraction. We call this process RBA, represented by R (·).
Similarly, for the batch U, we apply strong augmentation to create S (U). GLCM texture feature extraction through RBA. wA is data augmentation for labeled data, and SA is data augmentation for unlabeled data.
Backbone network
The traditional Chinese painting dataset includes various types of paintings, including brushwork paintings, ink paintings, abstract paintings, simple brushwork, and more. Unlike real pictures, traditional Chinese paintings are more abstract in their presentation and target a wider range of features. For example, the target feature of an abstract artwork may be the whole painting.
Most semi-supervised learning models use WideResNet 28*2 as the backbone network, which is a widened residual network. Primarily relying on convolutional operations. However, convolutional neural networks (CNNs) are spatially local. CNNs might struggle to capture extensive global information. How to make the model capture more global representations has becomes a key issue. Self-attention-based visual transformers (ViTs) capture and handle global representations more effectively than CNNs. However, ViTs require more computational resources than CNNs, and relies on large datasets for pre-training. Ultimately, Our model selected mobilevit, a lightweight network that combines the strengths of CNNs and ViTs.
MobileViT introduces MobileViT blocks that encodes both local and global information in a tensor effectively. As shown in Fig. 3, the network structure consists of the MV2 module and the MobileViT module, where the MV2 block is the block of MobileNetv2.

MobileViT.
Standard convolution involves three operations: unfolding, local processing, and folding. The MobileViT block substitutes local convolutional processing with global processing through transformers, combining CNNs and ViTs characteristics. This enables the MobileViT block to learn more effective representations with fewer parameters and straightforward training. MobileViT block structure is shown in Fig. 4.

MobileViT block.
The authors of MobileViT proposed three different configurations, namely MobileViT-S (small), MobileViT-XS (extra small) and MobileViT-XXS (extra extra small).The main difference between the three is the number of channels of the feature map. Considering the size of the experimental dataset, MobileViT-XXS is used as the backbone network for training.
MobileVit exhibits superior global information processing capabilities and is more suitable for traditional Chinese painting classification. We confirm these conclusions by comparison experiments with WideResNet, which can be referenced in section 4.3.
Our model employs consistency regularization on weakly and strongly augmented images. In the initial phase, we utilize weakly augmented data and random brushwork augmented data to generate pseudo-labels. Following this, we compute the cross-entropy loss by the predicted values associated with strongly augmented and random brushwork augmented images against the pseudo-labels generated from their weakly augmented and random brushwork augmented counterparts.
The pseudo-label generation process involves obtaining labels through argMAX () when the predicted value for weak augment exceeds the experimentally set threshold of 0.95.
The loss function of our model contains two cross-entropy loss terms, a supervised loss ls applied to labeled data and an unsupervised loss lu applied to unlabeled data.
Let χ = { (x
b
, p
b
) : b ∈ (1, . . . , B) } be a batch of B labeled examples, where x
b
are the training examples and p
b
are the labels of x
b
. Let
Pseudo-labeling are generated through argMAX (), which should be based on the weakly augmented prediction distribution p
m
(y|wA), expressed as
The loss function l of our model is a combination of supervised loss l
s
and unsupervised loss l
u
:
Dataset
To demonstrate the performance of our method in the face of diverse traditional Chinese painting classification, we collected 1200 Chinese paintings of different artistic styles by 10 famous Chinese painters, including Bai Xueshi, Chen Zhifo, Fan Zeng, Feng Zikai, Lin Fengmian, Wu Guanzhong, Xu Beihong, Zhang Daqian, Zao Wou-ki, and Zhou Sicong. These paintings varied in artistic styles and were manually labeled with class-level annotations to align with our classification task. We primarily utilize the combined data augmentation strategy to expand the dataset and divide it into training and test in 8:2, and some images of the dataset are shown in Fig. 5.

Schematic of a randomly selected sample of paintings from the datasets.
Based on the previous discussion, we propose a framework that combines strong and weak augmentation with RBA for semi-supervised learning to classify traditional Chinese paintings. There are two main points at the core of the method. The first point is to use MobileViT as the backbone, which is better at processing global representations. The second point is to propose RBA based on the characteristics of traditional Chinese paintings.
The method was implemented and trained on an NVIDIA A5000 GPU using PyTorch. In the backbone network comparison experiment, data was consistently resized to 64 × 64 for WideResNet and 224 × 224 for MobileVit. The optimizer for WideResNet was SGD with a learning rate of 0.03, while for MobileVit, it was Adam with a learning rate of 0.0003. Experiments were conducted with epochs set at 64 and 100, and with label quantities of 10 and 100, as detailed in Table 3.
Random Burshwork Augment (RBA) employs the extraction of brushwork features using GLCM (Gray-Level Co-occurrence Matrix) as a data augmentation strategy. By incorporating strong and weak augmentations during training, the model better captures features across diverse samples, enhancing its robustness.To assess the optimal combination method for classification benefits, we conducted comparative experiments by integrating RBA with commonly employed data augmentation strategies in semi-supervised learning. The parameter α in RBA dictates the proportion of randomly selected samples, ranging from 0 to 1. Following a dichotomous method, experiments were conducted with a parameter value ranging between 0 and 1. Ultimately, the optimal performance was observed when α was set to 0.2. Experiments on data augmentation were performed with 10 labels, 100 epochs, a learning rate of 0.0002, and the Adam optimizer. Table 4 illustrates the results of combining RBA with diverse augmentation strategies, while Table 5 displays experiments regarding the parameter α.
Experimental results
In this section, we assess our proposed TCP-RBA model and compare it with other semi-supervised methods on a traditional Chinese painting dataset. Table 1 presents the comparative experimental results for the four selected methods. In the table FixMatch (RA) uses RandAugment and FixMatch (CTA) uses CTAugment for strong-augmentation. Notably, the TCP-RBA model outperforms the selected method, achieving a classification accuracy of 88.27% with only 1 label per class. This performance is comparable to Chinese painting classification achieved under full supervision, highlighting the effectiveness of TCP-RBA.
Different networks serve as backbone networks for supervised learning classification of TCP
Different networks serve as backbone networks for supervised learning classification of TCP
Comparison the classification accuracy of different methods
Our method is experimentally compared using WideResNet and MobileVit-XXS as backbone networks. Specifically, we conducted experiments involving different numbers of labels and epochs. In these experiments, we deliberately employed very few labeled data. Remarkably, a correct rate of 84.69% was achieved when only 10 labeled data points were used, with each class having just 1 label. The results are presented in Table 3, while the changes in accuracy and loss function are illustrated in Figs. 6 and 7 . These experiments indicate that the MobileVit-XXS backbone network attains superior results even with minimal labeled data.
Comparison of experimental results between WideResNet and MobileVit-XXS

Comparison of experimental results of different backbone networks when epoch = 64. (a) Shows the changes in ACC and ACC results, (b) shows the changes in Loss.

Comparison of experimental results of different backbone networks when epoch = 100. (c) Shows the changes in ACC and ACC results, (d) shows the changes in Loss.
Building on the earlier discussion, we recognize the significance of brushwork as crucial features in painting classification. In light of this understanding, we introduce random stroke enhancement for traditional Chinese painting classification. RBA operates by integrating with other data enhancement strategies. We opted to combine RBA with three data augmentation strategies, namely RandAugment, CTAugment, and AugMix.
Table 4 presents the three data augmentation strategies along with the experimental results following the incorporation of RBA. Even with just 10 labels, the classification accuracy notably reaches 88.26%, signifying a significant improvement in the experimental outcomes.
Classification results using different data augmentation
A parameter, denoted as α, is defined in RBA to regulate the extent of randomly selected data for brushwork augmentation. Following the principle of dichotomy, we selected values ranging from 0 to 1 for experimentation. Table 5 presents the results of the parameter α when utilizing MobileVit as the backbone network with 10 labels. Significantly, the most notable enhancement is observed when α is set to 0.2.
Accuracy for different threshold cases for α
In Fig. 8, it is evident that when the value of parameter α hovers around 0.2, the improvement in experimental results is most pronounced.

The impact of different data augmentation strategies on experimental results.
To validate the effectiveness of our method, we conducted supervised classification experiments using four networks: ResNet50, MobileNet, WideResNet, and MobileVit. Notably, WideResNet is a variant derived from ResNet, while MobileVit incorporates a variant network based on the MobileNetV2 block. The experiments were carried out on the Traditional Chinese painting dataset we constructed.
The accuracy (ACC) column in Table 1 illustrates that, in supervised learning, MobileVit outperforms WideResNet in terms of performance. To further assess the role of RBA in Traditional Chinese Painting (TCP), we integrated RBA into the supervised classification of these networks.
The ACC(RBA) column in Table 1 reveals that the inclusion of Random Brushwork Augment significantly improves the classification results across different networks. This highlights the efficacy of RBA, even in a supervised learning for traditional chinese painting classification.
Conclusion
Currently, the classification of traditional Chinese paintings relies on supervised learning methods. However, the high cost associated with annotation poses a significant barrier to the widespread digitization and advancement of traditional Chinese paintings. To address this challenge, We adopt a semi-supervised learning method to classify TCPs, combining RBA and ViTs. The model is better at capturing global representations and brushwork features. We obtain experimental results comparable to supervised experiments with only one label per class. At the same time, combining the RBA data enhancement strategy in the supervised experiment also improved the accuracy, confirming the effectiveness of the RBA strategy in Chinese painting classification.
While our semi-supervised method proves effective in reducing the cost of classifying Chinese paintings, the overall computation method of the model impacts its running time. Therefore, investigating methods to minimize the algorithm’s runtime emerges as a crucial avenue for future research. In our upcoming endeavors, we plan to expand both the dataset and artistic style categories. Additionally, our goal is to incorporate diverse features, such as contrast and saturation, to bolster the accuracy of the algorithmic classification.
Footnotes
Acknowledgments
This work was supported by Beijing digital education research topic (Grant No. BDEC2023619056, BDEC2022619027) Funding for the 2023 Project of BIGC, Beijing Municipal Education Commission & Beijing Natural Science Foundation Co-financing Project (Grant No. KZ202210015019), Project of Construction and Support for high-level Innovative Teams of Beijing Municipal Institutions (Grant No. BPHR20220107, BIGC Project (Grant No. Ed202205, Eb202306, 21090124016, 21090124004, 20190122019, Ec202303, Ea202301, E6202405, 21090122012, 21090323009, 22150223036), Scientific Research Project of Beijing Municipal Education Commission (No. KM201810015011) and 2023 Project Proposal of Beijing Higher Education Association (No. MS2023168).
