Abstract
Efficient feature representation is the key to improving crowd counting performance. CNN and Transformer are the two commonly used feature extraction frameworks in the field of crowd counting. CNN excels at hierarchically extracting local features to obtain a multi-scale feature representation of the image, but it struggles with capturing global features. Transformer, on the other hand, could capture global feature representation by utilizing cascaded self-attention to capture remote dependency relationships, but it often overlooks local detail information. Therefore, relying solely on CNN or Transformer for crowd counting has certain limitations. In this paper, we propose the TCHNet crowd counting model by combining the CNN and Transformer frameworks. The model employs the CMT (CNNs Meet Vision Transformers) backbone network as the Feature Extraction Module (FEM) to hierarchically extract local and global features of the crowd using a combination of convolution and self-attention mechanisms. To obtain more comprehensive spatial local information, an improved Progressive Multi-scale Learning Process (PMLP) is introduced into the FEM, guiding the network to learn at different granularity levels. The features from these three different granularity levels are then fed into the Multi-scale Feature Aggregation Module (MFAM) for fusion. Finally, a Multi-Scale Regression Module (MSRM) is designed to handle the multi-scale fused features, resulting in crowd features rich in high-level semantics and low-level detail. Experimental results on five benchmark datasets demonstrate that TCHNet achieves highly competitive performance compared to some popular crowd counting methods.
Introduction
Crowd counting is one of the hotspots in big data analysis [1], especially in the context of the global post-epidemic era [2]. High-density crowd flow is not only a great challenge to public health issues [3], but also a great potential hazard to public safety [4], such as frequent trampling accidents. Therefore, how to better monitor crowd flow, especially in complex crowded environments, has now become an urgent issue.
The existing crowd counting methods implemented using deep learning techniques can be divided into two main categories: CNN-based methods and Transformer-based methods. CNN-based methods have achieved excellent results due to the CNN’s excellent ability to extract local information [5] and its flexible network structure [6–10]. However, CNN-based methods suffer from the inherent problem of a limited receptive field, which makes it challenging to handle the case of uneven crowd density distribution and complex background environment. One accepted solution to this problem is to design branches with different receptive fields to enhance the global context modeling capability of CNN models [11, 12]. However, these approaches can be structurally inefficient and prone to redundant blocks. Another solution is to introduce auxiliary tasks [13, 14], but this would result in greater computing costs. Although these methods achieve some results, they do not fundamentally address the problem that CNN is not expert at capturing global features
Transformer was originally proposed to improve the performance of various natural language processing (NLP) tasks, then researchers introduced it into the field of computer vision, proposing the Vision Transformers (ViT) for image classification. Due to its excellent performance, ViT was generalized to other computer vision tasks, including crowd counting [15, 16]. However, it is not appropriate to directly employ ViT for crowd counting without improvement. First, ViT splits the image into non-overlapping patches, each of which contains spatially localized information about the 2D structure. When the head in an image is segmented into different patches, as shown in Fig. 1(a), the counting performance would inevitably suffer due to the loss of spatially local information. Second, compared to CNN, the structure of ViT is relatively fixed [17], making it difficult to explicitly extract multi-scale information. However, head size variation is extremely common in crowd counting datasets, as shown in Fig. 1(b), which poses a significant challenge to crowd counting.

(a) Cutting the images into non-overlapping patches may result in the same head being split into different patches, which will seriously lead to count bias. (b) The diversity of head sizes in the crowd counting dataset is very common, which poses a great challenge for accurate counting.
According to the above analysis, CNN has a flexible network structure, but its perceptual field is limited, and only expert at extracting local features. ViT, on the other hand, could provide a global perceptual field, but its network structure is relatively fixed and tends to ignore spatially local information. Therefore, there are limitations in crowd counting based on CNN or Transformer alone. Is it possible to combine CNN and Transformer to complement each other’s strengths to achieve better counting accuracy? The answer is yes! In this paper, a Transformer-CNN Hybrid Network (TCHNet) is proposed to address the problems of spatial local information loss and head size variation in crowd counting. To address the spatial local information loss problem, CMT (CNNs meet vision transformers) [18] proposed by Wang et al. is selected as the Feature Extraction Module (FEM), which employs a Conv stem structure [19] to better preserve spatial local information during the image processing phase. Next, a modified Progressive Multi-granularity Learning Process (PMLP) is introduced in the Feature Extraction Module (FEM) to guide the network to learn at different levels of granularity. Finally, features at three different levels of granularity are selected to be fed into the Multi-granularity Feature Aggregation Module (MFAM) for fusion to fully utilize the spatial local information to help the network locate crowds and improve counting accuracy. To address the head size variation problem, a Multi-Scale Regress Module (MSRM) is designed that adopts the strategy of combining multiple sets of dilated convolutions [20] with different dilatation rates and standard convolutions with different convolution kernels to effectively enhance the multi-scale information extraction capability.
In summary, our contribution can be summarized as follows. The proposed TCHNet could address the problems of spatial local information loss and head size variation more effectively in crowd counting. A Multi-granularity Feature Aggregation Module is designed to aggregate three features from different granularity levels, allowing spatial local information to be fully utilized by fusing fine-grained and coarse-grained features of the network. MFAM helps the network to localize the crowd regions more effectively for better counting accuracy. We design a Multi-Scale Regression Module that employs a strategy of combining dilated convolutions with standard convolutions with a carefully chosen dilation rate to prevent the network from getting trapped in the gridding effect. The MSRM significantly enhances the multi-scale information extraction capability of the network. To further enhance our method, we are the first to introduce a progressive multi-granularity learning strategy to crowd counting. We analyze in Section 3.2 why progressive learning allows the network to focus more on the crowd region, and hope that this work will encourage further research on progressive learning methods for crowd counting. TCHNet has conducted experiments on five mainstream datasets. These extensive experiments demonstrate the excellent performance of our method, which even achieves state-of-the-art results.
CNN-based crowd counting
Despite its strong representation capabilities, CNN has become a recognized solution to enhance the global context modeling capability of CNN due to its fixed-size receptive field. A simple idea is to design branches with different receptive fields. MCNN [11] proposed a multi-column convolutional neural network, while Switch-CNN [12] introduced a density classifier. The optimal regressions from the three different branches are chosen adaptively. These models exhibit inefficiencies and redundant structures.
Another kind of method is to introduce auxiliary tasks [13, 21], the result is greater computing costs. For example, Zhang et al. [13] used an attention mechanism to refine perception and local feature information, while retaining global context information. These methods only consider pixel-level or single-dimension dependencies, resulting in a loss of dependencies between different dimensions of contextual information, which prevents the model from achieving optimal performance.
Transformer-based crowd counting
TransCrowd [15] was the first work to introduce the Transformer to crowd counting. It uses only count-level weakly-supervised training to train models. BCCT [16] designed a ViT-based model with various attention mechanisms to fully-supervised crowd counting. Tian et al. [22] proposed using Transformer to simplify and improve crowd counting, namely CCTrans, which covers both fully-supervised and weakly-supervised training methods. However, the above methods all use the conventional ViT (such as ViT-B-16 [23], Twins transformer [24]) as the backbone network, although achieved a certain effect, there are two problems: First, most models based on Transformer will use a large convolution (such as 16x16 convolution kernel in ViT) to directly cut the input image into non-overlapping patches, which will directly lose 2D space features in the patch as well as numerous details and edges information. The second is the huge amount of parameter computation in the Transformer when calculating self-attention.
Therefore, we use CMT [18] as our backbone network. The main advantages of CMT compared to other backbone networks are as follows: First, CMT adopts the Conv stem structure in the input image processing process and uses multiple 3 × 3 convolutions stacked structures to better preserve the 2D spatial structure and edge and detail information in the patches. Secondly, the excellent design of the CMT component enables the network to achieve better results with smaller parameters.
Progressive multi-granularity learning
As a data augmentation method, progressive multi-granularity training was first applied in the field of fine-grained visual classification. Du et al. [25] proposed a progressive multi-granularity training framework, where the whole training process starts from a more stable fine granularity and gradually transitions to a coarser granularity, by learning complementary information between different image granularities, and achieves outstanding results. Specifically, first, intermediate feature maps at different grain sizes are extracted during network training. Then, the classification loss is computed based on the results obtained in this stage. Finally, the loss is passed back for a timely parameter update. This approach to training has two distinct advantages. First, the entire training framework is step-by-step, which has the advantage of ensuring that the network is more focused on the information at the current granularity at the corresponding stage. Second, the training process starts from a more stable fine granularity and gradually filters down to a coarse granularity, thus avoiding misclassifications due to intra-class differences in larger regions.
Proposed approaches
The overall structure of TCHNet is shown in Fig. 2. First, the images are fed into the Conv stem structure to be divided into multiple patches and then fed into the FEM to extract rich global context information. In FEM, we add a PMLP to guide the network to learn at multiple levels of granularity. Then, features at three different granularity levels are fed into MFAM, which efficiently aggregates coarse-grained features and fine-grained features to obtain multi-granularity features. Finally, the multi-granularity feature is fed into the MSRM to further extract the multi-scale information, obtain the predicted density map, and monitor the final output using OT Loss [26].

The proposed TCHNet overview. First, the images are fed into the Conv stem structure to be divided into multiple patches and then fed into the FEM to extract rich global context information. In FEM, we add a PMLP to guide the network to learn at multiple levels of granularity. Then, features at three different granularity levels are fed into MFAM, which efficiently aggregates coarse-grained features and fine-grained features to obtain multi-granularity features. Finally, the multi-granularity feature is fed into the MSRM to further extract the multi-scale information, obtain the predicted density map, and monitor the final output using OT Loss.
We use CMT-B [18] as the backbone network in FEM, whose features can be used to improve the performance by combining the advantages of CNN into a Transformer. In the image input process, CMT uses a Conv stem structure to process the output image. Specifically, the Conv stem structure uses a 3 × 3 convolution with a step size of 2 and an output channel of 38 to reduce the size of the input image, followed by two 3 × 3 convolutions with a step size of 1 to better extract local information.
The CMT block consists of a local perception unit (LPU), a lightweight multi-head self-attention (LMHSA) module, and an inverted residual feed-forward network (IRFFN), as illustrated in Fig. 3.

Example of the CMT block architecture.
The LPU in the CMT block is designed to extract local information from the input image, which is important for capturing fine-grained details and spatial relationships between nearby pixels. It uses a depth-wise convolutional layer to extract local features and adds a shortcut connection to the input to promote gradient flow and improve training stability. The LPU is defined as:
The LMHSA module is similar to the self-attention mechanism used in ViT, but is designed to be more efficient and lightweight. It uses multiple attention heads to capture global dependencies between different parts of the image and applies layer normalization to improve convergence and stability. The LMHSA is defined as:
The IRFFN in the CMT block is a variant of the feed-forward network used in ViT, but is designed to be more efficient and flexible. It uses inverted residual blocks to reduce the number of parameters and increase the depth of the network and applies layer normalization and dropout to improve generalization and robustness. The LMHSA is defined as:
With the aforementioned three components, the CMT block can be formulated as:
The CMT block can be stacked multiple times to form a deep neural network and can be trained end-to-end using standard backpropagation algorithms.
In addition, a stage-wise structure is introduced to extract multi-scale features and reduce computational consumption. For an RGB image of H × W × 3, the output of each stage is
In the FEM, the stage-wise design allows the network to obtain four feature maps of different sizes with some essential correlations between them. Due to the limited receptive field and representational capabilities of the network in the initial stage, the first information learned is local detail information, which is fine-grained information. Additional semantic information, namely coarse-grained information, is learned in the deeper stages of the network. Using a progressive approach encourages the network to focus more on the information at the current granularity at the corresponding stage than training the entire network directly. Unlike the work of Du et al. [25], our PMLP has three improvements. First, because the transformer backbone at stage 1 has a shallow network depth and cannot obtain accurate semantic information and generate more accurate density maps, we directly start from stage 2 and name the three stages of the learning process L
i
as L1, L2 and L3, respectively, in fine-grained to coarse-grained order. Second, we use binary cross-entropy loss in the PMLP:
The coarse-grained features in the deeper layers of the network contain abstract semantic information but lack detailed information to effectively distinguish the boundaries of the heads, which makes it difficult for models to learn the exact locations of the crowds. Therefore, we construct an MFAM to aggregate the feature information at different granularities. Specifically, we first upsample Fs4 by a factor of 2, and then apply a 3 × 3 convolution to reduce its dimension, making the feature map size the same as Fs3. We then add it to Fs3 to obtain an intermediate feature map Fs3
MFAM
. Next, we fuse Fs3
MFAM
with Fs2 using the same operation as before. Finally, we obtain a feature map Fs2
MFAM
, which completes the aggregation process of multi-granularity features. The entire process can be described as follows:
Due to the large number of wide-angle views used during dataset acquisition, the scale of the heads in the images can vary drastically. Although the CMT with the wise-stage design that we used in FEM can initially extract multi-scale features, it is insufficient to handle the continuous variation of head scales. Therefore, we design an MSAM to more effectively address the scale variation problem. The MSRM model is composed of four branches, which are labeled as M1, M2, M3, and M4, as depicted in Fig. 2. M1, M2, and M3 use a combination strategy of dilated convolutions with different dilation rates and standard convolutions with varying kernel sizes. In contrast, the M4 branch only contains a shortcut connection path of 3 × 3 convolution. To obtain the multi-scale feature F
MSRM
for a given feature map F, we first concatenate the outputs obtained by passing F through the M1, M2, and M3 branches to form F
α
. Next, pass F through the M4 branch to obtain F
β
. Finally, F
α
and F
β
are added to obtain the final multi-scale feature F
MSRM
. The whole process can be described as follows:
To better optimize the network, we use the OT loss in DM-Count [26] to supervise the network training, which is formulated by the weighted sum of Counting loss, Optimal Transport (OT) loss, and Total Variation (TV) loss. The entire loss function is defined as follows:
The training process can be made more stable by adding TV loss.
We evaluate our method across five benchmarks, including ShanghaiTech Part A (ST_Part A) [11], ShanghaiTech Part B (ST_Part B) [11], UCF-QNRF [27], UCF_CC_50 [28] and NWPU-Crowd [29].
Traning details
To make fair comparisons, all training and evaluation were performed on the Pytorch platform using RTX 1080 Ti GPU. Our FEM is based on the official CMT-Base [18] model, which is pre-trained on the ImageNet 1K [30]. Additionally, we perform random cropping and random level flipping during training as a way to augment the training data for experiments. For the UCF-QNRF and NWPU-Crowd datasets, we pre-processed the crop size to ensure that the longer size is less than 2048, and we use the AdamW [31] optimizer with batch size 4 due to GPU memory concerns. For all datasets, our initial learning rate is uniformly set to 1e-5.
Evaluation metrics
We used the MAE, the MSE, and the NAE as the evaluation metrics of the experiment. They are defined by:
In this section, we first provide a brief introduction to each of these datasets. Next, we analyze the experimental results of TCHNet on different datasets and compare them with current state-of-the-art methods. The experimental results are presented in Tables 1 and 2, while some visualization results are shown in Figs. 4 and 5. Finally, we analyze the reasons for the leading position of our method.
Experimental results on ShanghaiTech Part A, ShanghaiTech Part B, UCF-QNRF, and UCF_CC_50. *Represents the weakly-supervised method
Experimental results on ShanghaiTech Part A, ShanghaiTech Part B, UCF-QNRF, and UCF_CC_50. *Represents the weakly-supervised method
Experimental results on NWPU-Crowd. * represents the weakly-supervised method

Visualization results on different datasets, where the first column is the true image, the second column is the number of ground-truth true annotations for that image, and the third column is the number of heads predicted by our network.

Visualization results on the NWPU-Crowd dataset. Due to the specificity of this dataset, we select four different types of test images, namely dense sample, sparse sample, negative sample, and dark sample.
The ShanghaiTech dataset is divided into ST_Part A and ST_Part B. ST_Part A contains a total of 241, 677 head annotations, with an average of 501 annotations per image. Most of the images are sourced from the Internet and have varying scene information.
TCHNet achieved an MAE of 54.9 and an MSE of 92.3 for ST_Part A. Our MAE outperformed the first-place LibraNet [36] by 1.8% and the second-place FIDTM [41] by 3.7%. For MSE, our method also outperforms the state-of-the-art GL [39] by 3.2% and the second-place SFCN [32] by 3.6%. These results indicate that our network design is excellent, and the Transformer-based backbone network has great potential.
ST_Part B
All images in ST_Part B were taken from surveillance cameras in Shanghai. This dataset has a single source, a relatively small average number of people, and a relatively sparse crowd scene, with an average of 124 people per image.
For ST_Part B, our method achieves state-of-the-art results in both MAE and MSE evaluation metrics, where MAE achieves an excellent score of 6.6, which is still 1.4% ahead of the current state-of-the-art FIDTM [41] and 5.7% better than the second-place AMRNet [35]. Additionally, our MSE is 0.9% better than AMRNet. These results suggest that our approach still allows for great accuracy in low-density regions and that our network captures head information at different sizes through the MSRM, maintaining a tiny counting error even when the scale variation is large.
UCF-QNRF
The UCF-QNRF dataset is an extremely challenging crowd counting dataset with rich scenes and diverse viewpoints, densities, and lighting conditions. 1, 535 images were annotated with a total of 1, 251, 642 head-annotated points.
Compared with all current state-of-the-art methods, our MAE is still 2.1% ahead of the current stage optimal method AMRNet [35], indicating that our network can effectively perceive crowd density information in high-density regions and effectively localize crowd targets in low-density regions by aggregating coarse-grained features and fine-grained features, resulting in excellent counting results. However, there is a 12.3% gap between the MSE of our method and LibraNet [36]. This is because LibraNet adopts a sequential decision-making approach to implement scale-weighting. The advantage of this approach is that it can handle extremely dense scenarios of the crowd more effectively, and thus it is only more robust than our method on the UCF-QNRF dataset.
UCF_CC_50
UCF_CC_50 was collected from the Internet in scenes including concerts, demonstrations, and other crowded settings with a total of 50 very dense images taken at different resolutions and from different angles. The average number of people per image is 1, 280, which exceeds the number of additional demographic datasets. Given the small number of UCF_CC_50 images, the publisher of this dataset defined a cross-validation protocol to increase the sample size. We also employ a 5-fold cross-validation strategy, wherein the complete dataset is divided into 5 samples, with 4 of them being utilized as the training set and the remaining 1 as the test set for each training session. Ultimately, the mean MAE and MSE values from the five experiments are computed as the test results.
The MAE and MSE of our method reach 169.5 and 238.5, respectively. Our method is 1.1% ahead of FIDTM [41] in MAE and only 1.9% behind in MSE, indicating that TCHNet can demonstrate excellent accuracy and robustness even in the presence of insufficient RGB information.
NWPU-Crowd
NWPU-Crowd is an extremely challenging and large dataset that, unlike the previous datasets, consists of 5109 images and 213, 375 head annotations including points and boxes. This dataset has significant advantages over the previous ones. First, the dataset contains a heavy set of image scenes, including a large number of negative samples, dense samples, sparse samples, and outage samples. Second, the dataset improves a fair online testing platform. Specifically, the network is loaded with the optimal model trained on the validation set to predict the test set, and the prediction results are uploaded to the official website to obtain the MAE, MSE, and NAE results for the test set.
Our experimental results are shown in Table 2 and visualized in Fig. 5. On the validation set, our MAE achieves the most advanced result of 41.92, which is 18.4% ahead of the most advanced FIDTM [41]. However, for the MSE, our performance is slightly lower than the current state-of-the-art algorithms, which may be explained by the fact that the CNN-based backbone network is more powerful than the Transformer-based backbone network in capturing spatial local information. Moreover, on the test set, our MAE, MSE, and NAE reach 87.82, 444.18, and 0.149 respectively. Our method MAE and MSE in the test set are somewhat behind FIDTM [41] and P2PNet [42], with our MAE being 11.8% behind P2PNet and our MSE being 29.7% behind FIDTM. We analyze the possible reasons for the gap in the following three points. First, both P2PNet and FIDTM introduce head localization as an auxiliary task into the counting effort, and this strategy can make the network more robust, especially in the unknown domain where the performance is better. Second, the average number of NWPU-Crowd is 418, which is moderately dense. The backbone networks of both P2PNet and FIDTM are CNN structures, which can more effectively localize crowd areas in sparse scenes. In contrast, BCCT [16], TransCrowd [15], and our method uses Transformer as the backbone network, which is better at sensing the crowd regions in high-density regions compared to low-density regions. Third, in terms of data preprocessing, we scale this dataset so that the longest edge does not exceed 2048 due to hardware limitations, which reduces the computational overhead but inevitably leads to a reduction in accuracy.
Ablation study
To verify the effectiveness of each part of the TCHNet, we perform a series of ablation studies on ST_Part A and ST_Part B.
Ablation study of MFAM
To evaluate the performance of our MFAM, we use the feature output from each of the three stages of the transformer backbone network separately for validation. For simplicity and in order not to alter the overall network structure, we resized Fs2, Fs3, and Fs4 to the same size (1/8 of the original image) when studying the MFAM, and then verified the effect of the feature maps of each of the three stages on the counting results. The experimental results are shown in Table 3.
Ablation study of MFAM
Ablation study of MFAM
It can be concluded that the feature maps obtained in Fs2 lack deep semantic information, which gravely affects the counting accuracy. While satisfactory results can be achieved using Fs3 and Fs4, our designed MFAM can improve the counting accuracy even more. Since the MFAM makes full use of both coarse-grained and fine-grained information, it enables the network to retain some location information at deeper stages, which is essential for crowd counting efforts.
To evaluate the performance of our MSRM, we label the four branches as M1, M2, M3, and M4, respectively. We first tried to remove the branches with dilated convolution (M1, M2 and M3) and directly used a 3 × 3 Conv layer (M4) to adapt the input feature map to the next layer. This is because we believe that the main role in MSRM may be played by dilated convolution, as it can bring different sizes of the perceptual domain and thus help the network to capture information about different human head sizes. We also confirmed our conjecture in our experiments, the results of which are shown in Table 4.
Ablation study of MSRM
Ablation study of MSRM
It can be seen that the fourth set of MSRM settings on ST_Part A and ST_Part B datasets have some gap with the optimal results of TCHNet, since the M4 branch alone cannot capture sufficient multi-scale information. When dilated convolution is added, although the counting accuracy is slightly improved, neither of them can achieve optimal results, which is because the fixed dilation rate can only capture the information at a specific scale. By superimposing multiple dilation rates, the multi-scale perception capability of the network can be effectively enhanced while avoiding the gridding effect [43].
In addition, different combinations of M1, M2, and M3 with M4 also affect the counting accuracy, and we introduce seven different sets of combinations in MSRM to explore the optimal scheme. The experimental results are shown in Table 5.
Different combination strategies in MSRM
Based on the comparison results, group 7 achieves the best performance, hence we choose this scheme to extract multi-scale information.
To evaluate the performance of our PMLP, we set the structure without PMLP to Baseline, and then conduct experiments on ST_Part A and ST_Part B. In addition, we investigate the effect of λ on the experimental results, which are presented in Table 6.
Ablation study of PMLP
Ablation study of PMLP
According to the experimental results, TCHNet achieves better experimental results when PMLP is added, indicating that through PMLP, the network can learn at multiple granularity levels, completely explore the complementary relationships between different granularities, enhance the network’s ability to extract multi-granularity features and improve the counting accuracy. In addition, we note that a weaker advantage can be achieved when λ is 1e-2 compared to 1e-3 and 5e-2, which also indicates that our method is robust to the choice of λ.
In this paper, we propose a hybrid Transformer and CNN network to address the problems of spatial local information loss and scale variation in crowd counting. The pipeline consists of five components: a FEM for better solving local spatial information loss, a PMLP for learning multi-granularity information correlation, an MFAM for multi-granularity information fusion, an MSRM for providing multi-scale receptive fields, and an optimal transfer loss function for supervised network training. The loss of spatial local information due to patch-dividing is largely reduced by using the more advanced Conv stem structure of CMT. The problem of head-scale variation is addressed by designing an MSRM to provide the network with a multi-scale receptive field. To further enhance our approach and improve the stability of the network during training, we innovatively introduce the PMLP, which enables the network to learn at the multi-granularity level, focusing on the complementary relationships between different granularity levels. Our method drives state-of-the-art counting with a more significant advantage on five mainstream datasets. We hope that our method can be further ported to additional counting tasks for additional contributions.
Footnotes
Acknowledgment
This work was supported by the National Natural Science Foundation of China (No. 62163016, 62066014, 62202165), the open project of State Key Laboratory of Performance Monitoring and Protecting of Rail Transit Infrastructure, East China Jiaotong University (Grant No. HJGZ2023203) and the Natural Science Foundation of Jiangxi Province (No. 20212ACB202001, 20202BABL202018, 20224BAB202016).
