MHANet: Multi-scale hybrid attention network for crowd counting

Abstract

Crowd counting aims to estimate the number, density, and distribution of crowds in an image. The current mainstream approach, based on CNN, has been highly successful. However, CNN is not without its flaws. Its limited receptive field hampers the modeling of global contextual information, and it struggles to effectively handle scale variation and background complexity. In this paper, we propose a Multi-scale Hybrid Attention Network called MHANet to solve crowd counting challenges more effectively. To address the issue of scale variation, we have developed a Multi-scale Aware Module (MAM) that incorporates multiple sets of dilated convolutions with varying dilation rates. The MAM significantly improves the network’s ability to extract information at multiple scales. To tackle the problem of background complexity, we have introduced a Hybrid Attention Module (HAM) that combines spatial attention and channel attention. The HAM effectively directs attention to the crowd region while suppressing background interference, resulting in more accurate counting. MHANet has been extensively experimented on four benchmark datasets and compared against state-of-the-art algorithms. It consistently achieves superior performance in terms of the MAE evaluation metric. MHANet outperforms the current state-of-the-art methods by margins of 1.9%, 5.4%, 0.4%, and 0.8% on the ShanghaiTech Part_A, ShanghaiTech Part_B, UCF-QNRF, and UCF_CC_50 datasets, respectively. Furthermore, a series of ablation experiments targeting MAM and HAM were conducted in this paper, and the experimental results fully demonstrate that MAM and HAM can effectively address the challenges of scale variation and background complexity, ultimately enhancing the accuracy and robustness of the network.

Keywords

CNN crowd counting multi-scale hybrid attention

1 Introduction

Crowd counting aims to estimate the number, density, and distribution of a crowd in an image. In recent years, large-scale gatherings of people have posed a huge challenge to public health and safety. Examples include the prevention and control of COVID-19 and stampede [1]. Moreover, crowd counting plays an irreplaceable role in traffic control [2] and smart city construction [3].

Early researchers utilized detection-based [4 –6] and regression based [7, 8] methods for crowd counting. Detection-based methods were effective in low-density crowd areas but vulnerable to target occlusion. Later, researchers proposed several regression-based methods that could partially address the occlusion problem, but are still ineffective in obtaining the location information of the crowd and counting in extremely dense crowds. With the development of deep learning, Convolutional Neural Networks (CNNs) have become popular among researchers due to their powerful local feature extraction ability and flexible network structure. CNNs have shown remarkable results in the field of computer vision, including crowd counting. However, CNNs have limitations, such as their limited receptive field, which makes it difficult to model global contextual information. Additionally, crowd counting datasets commonly have a large number of target scale variations and complex background issues, which pose significant challenges for accurate counting. Due to the prevalence of wide-angle lenses in datasets, the crowd close to the camera appears larger in the picture, while the crowd far away appears smaller, as shown in Fig. 1(a). The continuous change in target scale can lead to the omission of maximum or minimum scale targets, affecting the accuracy of the counting. Moreover, complicated background cases where people are distributed under different lighting conditions, as shown in Fig. 1(b), can cause misjudgment between foreground and background, leading to differential counting errors.

Fig. 1

The two major challenges in crowd counting. (a) The problem of scale variation. (b) The problem of background complexity.

Therefore, researchers have attempted to get out of this dilemma by enhancing the global context modeling capability of CNNs. One such approach is to use a multi-column structure to enhance the multi-scale feature extraction capability of the network. For example, Zhang et al. [9] proposed a three-branch multi-column convolutional neural network consisting of convolutional kernels of different sizes, called MCNN. The method extracts information at three different scales and then fuses them to enhance the multi-scale information extraction capability of the network. Another type of approach is to add an attention module to the network, which aims to generate an attention feature map to guide the network to better locate the crowd regions. For example, Jiang et al. [10] used the first 13 layers of VGG-16 to construct attention branches, but this significantly increased the computational burden of the model. Although these methods can achieve some results, there is still a huge room for improvement.

In this paper, we propose a Multi-scale Hybrid Attention Network called MHANet to address the problem of scale variation and background complexity in crowd counting. For the scale variation problem, a Multi-scale Aware Module (MAM) is proposed inspired by dilated convolution [11], which delicately employ combinations of different dilated rates to prevent the network from gridding effect [12]. For the background complexity problem, we propose an innovative Hybrid Attention Module (HAM), which perfectly hybrids spatial attention and channel attention, to enhance the global context modeling ability of the MHANet.

Our contribution can be summarized in the following fourfold:

We propose a Multi-scale Hybrid Attention Network (MHANet) for Crowd Counting, which can better deal with the crowd counting challenges.

We designed a Multi-scale Aware Module (MAM) that uses dilated convolutions with different dilated rates to obtain different sizes of receptive fields. The MAM generates higher accurate count results.

We designed a Hybrid Attention Module (HAM) that combines channel attention and spatial attention, and enhances the global modeling capabilities of the MHANet.

MHANet has been experimented on several challenging datasets and achieved surprisingly competitive results.

2 Related work

2.1 Traditional crowd counting methods

Detection-based methods [4 –6] achieve some results for low-density crowd, while the counting effect is considerably reduced in dense scenes. Thus, some researchers have proposed local feature extraction methods, where the object to be detected according to localised areas of the body. Compared to the former method, the local detection is more robust, but the results are still not particularly efficient in high-density scenarios.

Researchers have attempted to address the limitations of the detection method by exploring regression methods [7, 13]. Regression methods offer improved accuracy compared to detection methods, which are constrained by high-density scenarios. However, regression methods involve complex computational procedures. Furthermore, most conventional counting methods rely on global count regressions, disregarding spatial information. Traditional crowd counting methods heavily rely on manual feature extraction and exhibit bias towards sparse scenes. Consequently, their performance suffers in complex situations involving crowd occlusion, foreground perspective, multi-scale variations, and cross-scene scenarios.

2.2 Deep learning-based crowd counting methods

CNN-based methods have been studied and fruitful progress has been achieved. Three related works are temporarily described here, namely the image-pyramid-based approach [14, 15], multi-column approach [9 , 16–19], and the multi-level approach [10 , 20–23]. However, they are imperfect because of CNN’s limited receptive field, which leads to poor global context modeling capabilities, with the result that counting is ineffective in the face of drastic changes in head size and extremely complex backgrounds.

In recent years, the Vision Transformer (ViT) has emerged as a prominent technique in computer vision tasks. An example of its successful application in crowd counting is TransCrowd [24], which utilizes weakly supervised training to achieve excellent results. Despite these impressive advancements, ViT currently cannot entirely replace the dominant position of CNN. First, ViT divides images into patches, and each patch in the image lacks the original spatial structure, resulting in information loss. This breaks down the inherent spatial relationships present in the image. Second, since the patches have a fixed size, it becomes challenging for the ViT to accurately extract multi-scale features, which are crucial for handling objects of different sizes. Third, the computation becomes significantly expensive when the ViT performs multi-head attention, leading to a performance gap compared to CNN, especially when the computational resources are similar. Considering the aforementioned reasons, CNN currently holds certain advantages in terms of efficiency and computational economy over ViT.

2.3 Dilated convolution

Dilated convolution [12] was initially proposed to address the challenge of image segmentation. Traditional methods for image segmentation commonly utilized pooling layers and convolutional layers to expand the network’s receptive field and reduce the resolution of feature maps. Subsequently, upsampling techniques were employed to restore the original size of the feature maps. However, this conventional approach inevitably resulted in a decrease in accuracy. In contrast, dilated convolution introduces gaps within the kernel, allowing for an expanded receptive field without altering the dimensions of the feature map. The spacing between kernel elements is controlled by the dilation rate. The effect of dilated convolution can be visualized as if the kernel expands and covers a larger area without increasing its size. Specifically, the size of the convolution kernel after dilation rate scaling can be viewed as D_k: $D_{k} = d (k - 1) + 1$ (1) where d is the dilation rate and k is the original convolution kernel size. The relationship between the size of the input feature map and the output feature map can be expressed as: $W_{output} = \frac{W_{input} + 2 p - D_{k}}{s} + 1$ (2) where W_input denotes input size, W_output denotes output size, p denotes padding, and s denotes stride.

3 Our method

MHANet as a whole employs an encoder-decoder architecture as shown in Fig. 2. Firstly, the images are passed through the first 13 layers of VGG-16 to extract features at different scale levels, and after passing through the MAM and the HAM, it incorporates the features map into the decoder. The decoder uses multiple convolution operations and upsampling to output the final density map. This section describes the encoder, decoder, multi-scale aware module, hybrid attention module, and loss function of MHANet, respectively.

Fig. 2

The overall structure of MHANet. Firstly, the images are passed through the first 13 layers of VGG-16 to extract features at different scale levels, and after passing through the MAM and the HAM, it incorporates the features map into the decoder. The decoder uses multiple convolution operations and upsampling to output the final density map.

3.1 Encoder

The encoder phase utilizes the first five stacks of VGG-16. Originally designed for classification tasks, VGG-16 networks quickly found application in various computer vision tasks due to their exceptional representational capacity and simple network structure. To alleviate overfitting, pre-trained networks are commonly employed.

In the encoder, the input image is fed to extract shallow features initially. This is because the shallow stage of the network has smaller receptive fields, enabling the capture of fine details such as color, texture, edges, and angles. As the network depth increases, the feature maps undergo multiple convolution operations and subsampling, resulting in larger receptive fields. At this stage, each pixel represents feature information for a region or neighboring regions, providing semantic context for comparison.

Three representative features, namely Conv3 _ 3, Conv4 _ 3, and Conv5 _ 3, are extracted at different scale levels. Specifically, for an RGB image of size H × W, their resolutions are $\frac{H}{4} \times \frac{W}{4}$ , $\frac{H}{8} \times \frac{W}{8}$ , and $\frac{H}{16} \times \frac{W}{16}$ respectively. These features subsequently pass through both the MAM and HAM modules before being fused with their corresponding scale levels in the decoding phase.

3.2 Decoder

The decoding process is a sequential procedure aimed at restoring the size of the feature map. Due to the adoption of four 2× downsampling during the encoding stage, in the decoding stage, four 2× upsampling is used to gradually restore the size of the feature maps.

To address the limitation of acquiring deep semantic information directly through upsampling shallow detail information, we introduce two skip links, integrating the preserved features Conv3 _ 3 and Conv4 _ 3 from the encoding phase into the decoding process. Specifically, the feature maps of Conv3 _ 3 and Conv4 _ 3, each with their respective sizes, are individually concatenated. Consequently, MHANet can access more comprehensive semantic and detailed information, resulting in improved accuracy in localizing crowd areas.

3.3 Multi-scale aware module

Due to the wide-angle view prevalent in the crowd counting dataset, the scale of the crowd close to the lens is larger, while the scale of the crowd farther from the lens is smaller. This drastic scale variation presents challenges to the counting. Therefore, this paper proposes a Multi-scale Aware Module (MAM), as shown in Fig. 2, to address the scale variation problem more effectively.

The MAM employs dilated convolutions as filters to capture richer multi-scale information. However, a specific dilation rate can only capture information at a particular scale, and overlapping of the same dilation rate can result in a gridding effect [12]. Therefore, we carefully choose the strategy of concatenating four sets of dilated convolutions with dilation rates of 1, 3, 5, and 7 to handle continuous scale variations more effectively. The whole procedure can be described as: $F_{a} = DCon v_{1} (F)$ (3) $F_{b} = DCon v_{3} (F)$ (4) $F_{c} = DCon v_{5} (F)$ (5) $F_{d} = DCon v_{7} (F)$ (6) $F^{'} = Con v_{3 \times 3} (concat (F_{a}, F_{b}, F_{c}, F_{d}))$ (7) where concat (·) denotes the concatenate operation. Conv_3×3 and DConv_i denote 3 × 3 convolution and dilated convolution, respectively.

3.4 Hybrid attention module

Spatial attention transforms the spatial information of an input feature into another space while retaining its key information. Channel attention allows the input feature map to learn dependencies in the channel dimensions and then adaptively rescale the features for each channel, allowing the network to focus on the more available channels while enhancing the learning ability.

Inspired by the work of Park et al. [25], a novel Hybrid Attention Module (HAM) is proposed that combines spatial attention and channel attention, as shown in Fig. 2, to address the problem of background complexity more effectively. Specifically, HAM is composed of an upper and lower branch connected in parallel. For the upper branch, the feature map F′ first undergoes channel attention for channel-domain refinement, resulting in . It is then passed through spatial attention for spatial-domain refinement, resulting in . For the lower branch, the feature map F′ goes through spatial attention to obtain , followed by channel attention to obtain . Finally, a 3 × 3 convolution is used to fuse the outputs, resulting in F″. The entire process can be formalized as follows: $F_{up}^{' c} = C A_{1} (F^{'})$ (8)

$F_{up}^{' cs} = S A_{1} (F {' c}_{up})$ (9)

$F_{down}^{' s} = S A_{2} (F^{'})$ (10)

$F_{down}^{' sc} = C A_{2} (F {' s}_{down})$ (11)

$F^{″} = Con v_{3 \times 3} (concat (F \underset{up}{'^{cs}}, F \underset{down}{'^{sc}}))$ (12) where CA_i (·) denotes channel attention operation, SA_i (·) denotes spatial attention operation, concat (·) denotes the concatenate operation, and Conv_3×3 denotes 3 × 3 convolution operation. HAM combines spatial attention and channel attention, allowing for attention enhancement in both spatial and channel dimensions simultaneously. In addition, HAM can comprehensively exploit the information within the feature map and achieve more refined modeling of features across different positions and channels. Compared to using spatial attention or channel attention alone, HAM provides stronger feature representation and semantic modeling capabilities, leading to improved performance in crowd counting tasks.

3.5 Loss function

The loss function of MHANet is split into two parts, and we adopt the MSE to calculate the loss L_density for the final prediction result: $L_{density} = {∥ D^{gt} - D^{pred} ∥}_{2}^{2}$ (13) where D^gt denotes ground-truth density map, D^pred denotes predicted density map. The binary cross-entropy loss L_i is also used to compute the loss for each lever separately for the three levels of feature maps, which is described as: $L_{i} = {D_{i}}^{GT} \cdot log (D_{i}) + (1 - {D_{i}}^{GT}) \cdot (1 - log (D_{i}))$ (14) where D_i^GT is the GT density map in the i-th level and D_i is the predicted density map in the i-th level, respectively. In addition, to better fine-tune the whole network, we multiply L_i by the weight coefficient λ and then add it to L_l2 to obtain the final loss function L: $L = L_{density} + λ \sum_{1}^{3} L_{i}$ (15) where λ = 1e-3, the final loss function L will enable MHANet to achieve higher counting accuracy.

4 Experiment analysis

4.1 Experimental environment

The experimental procedure utilized the Windows 10 operating system and an NVIDIA-3060 GPU. Additionally, a series of data augmentation operations were conducted on the images in the dataset to acquire a more diverse training sample. For images with an edge length of less than 400, bilinear interpolation was applied to increase the edge length to 400. Subsequently, local image patches were randomly flipped, and contrast and grayscale were adjusted to expand the data volume and enrich the training data. In the case of images with extremely large resolutions, the bilinear interpolation method was employed to uniformly resize them to 1024×768 before data enhancement.

4.2 Evaluation metrics

We used the MAE and the MSE as the evaluation metrics of the experiment. They are defined by: $MAE = \frac{1}{N} \sum_{i = 1}^{N} C_{i}^{GT} - C_{i}^{Pr e}$ (16)

$MSE = \sqrt{\frac{{\sum_{i = 1}^{N} C_{i}^{GT} - C_{i}^{Pr e}}^{2}}{N}}$ (17) where N is the number of test images, $C_{i}^{Pr e}$ denotes the number of heads predicted by the i-th image, and $C_{i}^{GT}$ denotes the number of heads for the i-th image ground-truth.

4.3 Dataset and experimental analysis

We evaluate our method across four benchmarks, including ShanghaiTech Part A (ST Part_A) [9], ShanghaiTech Part B (ST Part_B) [9], UCF-QNRF [34], and UCF_CC_50 [35]. The experimental results are presented in Table 1, while some visualization results are shown in Fig. 3.

Table 1
Experimental results on ST Part_A, ST Part_B, UCF-QNRF and UCF_CC_50

Year Conference/Journal Method ST Part_A ST Part_B UCF-QNRF UCF_CC_50

MAE MSE MAE MSE MAE MSE MAE MSE

2019 CVPR CAN [26] 62.3 100 7.8 12.2 107 183 212.2 243.7

2019 CVPR SFCN [27] 59.7 95.7 7.4 11.8 102 171 214.2 318.2

2019 ICCV BL [28] 62.8 101.8 7.7 12.7 88.7 154.8 229.3 308.2

2020 CVPR RPNet [29] 61.2 96.9 8.1 11.6 - - - -

2020 ECCV AMRNet [30] 61.5 98.3 7.0 11.0 86.6 152.2 184.0 265.8

2021 ICCV SUA-Fully [31] 66.9 125.6 12.3 17.9 119.2 213.3 - -

2021 AAAI TopoCount [8] 61.2 104.6 7.8 13.7 89.0 159.0 184.1 258.3

2021 PR MATT [32] 80.1 192.4 11.7 17.5 - - 355.0 550.2

2022 SCIS TransCrowd [24] 66.1 105.1 9.3 16.1 97.2 168.5 - -

2023 WACV DMCNet [33] 58.5 84.6 8.6 13.7 96.5 164.0 - -

- MHANet (Ours) 57.4 94.2 7.0 10.8 86.2 171.9 182.5 240.4

Year Conference/Journal	Method	ST Part_A	ST Part_B	UCF-QNRF	UCF_CC_50
2019 CVPR	CAN [26]	62.3	100	7.8	12.2	107	183	212.2	243.7
2019 CVPR	SFCN [27]	59.7	95.7	7.4	11.8	102	171	214.2	318.2
2019 ICCV	BL [28]	62.8	101.8	7.7	12.7	88.7	154.8	229.3	308.2
2020 CVPR	RPNet [29]	61.2	96.9	8.1	11.6	-	-	-	-
2020 ECCV	AMRNet [30]	61.5	98.3	7.0	11.0	86.6	152.2	184.0	265.8
2021 ICCV	SUA-Fully [31]	66.9	125.6	12.3	17.9	119.2	213.3	-	-
2021 AAAI	TopoCount [8]	61.2	104.6	7.8	13.7	89.0	159.0	184.1	258.3
2021 PR	MATT [32]	80.1	192.4	11.7	17.5	-	-	355.0	550.2
2022 SCIS	TransCrowd [24]	66.1	105.1	9.3	16.1	97.2	168.5	-	-
2023 WACV	DMCNet [33]	58.5	84.6	8.6	13.7	96.5	164.0	-	-
-	MHANet (Ours)	57.4	94.2	7.0	10.8	86.2	171.9	182.5	240.4

Fig. 3

Visualization results on different datasets, where the three columns of pictures from left to right are the true image, the ground-truth and the predicted density map.

4.3.1 ST Part_A

The ShanghaiTech dataset is divided into ST Part_A and ST Part_B. ST Part_A contains a total of 241, 677 head annotations, with an average of 501 annotations per image. Most of the images are sourced from the Internet and have varying scene information.

Table 1 demonstrates that MHANet achieves an MAE of 57.4 and an MSE of 94.2 for ST Part_A. Our MAE outperforms the first-place DMCNet [33] by 1.9%, indicating the effective focus of our network on crowd areas and its ability to suppress complex background interference.

4.3.2 ST Part_B

All images in ST Part_B were taken from surveillance cameras in Shanghai. This dataset has a single source, a relatively small average number of people, and a relatively sparse crowd scene, with an average of 124 people per image.

Table 1 shows the counting results of MHANet on ST Part_B, which demonstrate excellent performance with an MAE of 7.0 and MSE of 10.8. Our MAE outperforms the first-place SFCN [27] by 5.4%. For MSE, our method also outperforms the state-of-the-art AMRNet [30] by 1.8%. These results indicate that our method achieves high accuracy even in low-density regions. Our network incorporates a multi-scale aware module, allowing it to capture head information at different sizes. This capability enables accurate counting even when there is significant variation in head size, maintaining minimal counting errors.

4.3.3 UCF-QNRF

The UCF-QNRF dataset is an extremely challenging crowd counting dataset with rich scenes and diverse viewpoints, densities, and lighting conditions. 1, 535 images were annotated with a total of 1, 251, 642 head-annotated points.

Table 1 demonstrates the strong adaptability of MHANet to extremely complex scenarios, surpassing even newer methods such as TransCrowd [24] and DMCNet [33]. Our proposed method achieves excellent results in terms of both MAE and MSE evaluation metrics. However, it is worth noting that our MSE value is 13% higher than the state-of-the-art AMRNet [30], indicating the potential for further improvements in the robustness of our network. This difference may be attributed to the limited receptive field of the CNN, which poses a disadvantage.

4.3.4 UCF_CC_50

UCF_CC_50 are collected from the Internet, and the scenes include concerts, demonstrations, and other highly crowded occasions, with a total of 50 extremely dense images taken at different resolutions and from different perspectives. The average number of headcounts per image is 1,280, which exceeds that of other crowd counting datasets. Considering the limited number of UCF_CC_50 images, the publisher of this dataset defined a cross-validation protocol to augment the sample size.

As shown in Table 1, the results indicate that our method achieves an MAE of 182.5 and an MSE of 240.4. Comparing our method to AMRNet [30], we observe a 0.8% improvement in MAE, and when compared to CAN [26], there is a 1.3% improvement in MSE. These findings demonstrate that MHANet exhibits remarkable accuracy and robustness, even when dealing with limited RGB information.

4.4 Ablation study

We conduct a series of ablation studies on the ST Part_A. In the implementation, the basic backbone network structure is used as a baseline and the corresponding modules are added step by step to verify their effectiveness.

4.4.1 Ablation study for multi-scale aware module

The experimental comparison results show that the backbone network baseline already has a decent counting effect, but when MAM is added, the counting performance of the network is further improved, as shown in Table 2 and visualized in Fig. 4(a).

Table 2
Ablation study for multi-scale aware module

Sets in Multi-scale Aware Module ST Part_A

MAE MSE

Baseline 63.8 103.4

Baseline+MAM 61.2 96.6

Baseline+MAM+HAM (MHANet) 57.4 94.2

Sets in Multi-scale Aware Module	ST Part_A
Baseline	63.8	103.4
Baseline+MAM	61.2	96.6
Baseline+MAM+HAM (MHANet)	57.4	94.2

Fig. 4

Visualization results of ablation study. (a) The visualization results of the ablation study of MAM. (b) The visualization results of the ablation study of HAM.

According to the experimental results, the accuracy of the count is significantly improved when the MAM is added to the backbone network, indicating that the MAM plays a key role in dealing with the rapid change of the human head scale.

To validate the effectiveness of dilated convolutions in MAM, we extract multi-scale features using dilated convolutions with different dilation rates to compare their performance. For the setting of the dilation rate, four schemes "1, 1, 1, 1", "1, 2, 3, 4", "2, 4, 6, 8", and "1, 3, 5, 7" are tried as shown in Table 3.

Table 3

Ablation study for multi-scale aware module

Dilation rate portfolio strategys	ST Part_A
MAE	MSE
Dilation rates=1,1,1,1	62.3	101.8
Dilation rates=1,2,3,4	61.2	99.3
Dilation rates=2,4,6,8	60.9	99.5
Dilation rates=1,3,5,7	57.4	94.2

According to the experimental results, when dilated convolutions are used in MAM, the counting accuracy of the network is relatively improved, and the best counting performance is achieved with dilated rates of 1, 3, 5, and 7, respectively.

4.4.2 Ablation study for hybrid attention module

The comparison results of the experiments show that the counting effect of the network is significantly improved after the addition of the HAM, as shown in Table 4 and visualized in Fig. 4(b).

Table 4
Ablation study for hybrid attention module

Sets in Hybrid Attention Module ST Part_A

MAE MSE

Baseline 63.8 103.4

Baseline+HAM 61.7 100.4

Baseline+HAM+MAM (MHANet) 57.4 94.2

Sets in Hybrid Attention Module	ST Part_A
Baseline	63.8	103.4
Baseline+HAM	61.7	100.4
Baseline+HAM+MAM (MHANet)	57.4	94.2

From the ablation studies, we can see that the backbone network is already able to obtain relatively accurate density maps, but the foreground information is further enhanced after incorporating the attention module. At the same time, the background noise interference is suppressed so that the foreground part of the density map is more significant and the background error is relatively smaller, further improving the counting accuracy.

4.4.3 Ablation study of loss function

We design two types of total loss schemes, L_density loss function alone and L_density plus weighted L_i loss. Moreover, for the coefficient λ, we design three schemes such that λ is 1e-2, 1e-3, and 1e-4, and the results are given in Table 5.

Table 5
Ablation study of the loss function

Sets in Loss Function ST Part_A

MAE MSE

L = L_density 58.1 98.4

L = L_density + λ (L₁ + L₂ + L₃) (λ = le-2) 58.8 99.3

L = L_density + λ (L₁ + L₂ + L₃) (λ = le-4) 57.9 97.1

L = L_density + λ (L₁ + L₂ + L₃) (λ = le-3) 57.4 94.2

Sets in Loss Function	ST Part_A
L = L_density	58.1	98.4
L = L_density + λ (L₁ + L₂ + L₃) (λ = le-2)	58.8	99.3
L = L_density + λ (L₁ + L₂ + L₃) (λ = le-4)	57.9	97.1
L = L_density + λ (L₁ + L₂ + L₃) (λ = le-3)	57.4	94.2

We can see the second type of scheme can get better experimental results, and in addition, when the value of λ is 1e-4, the ideal result close to 1e-3 is achieved, which also indicates that our loss function has some robustness for the value of λ.

5 Conclusion

In this paper, a Multi-scale Hybrid Attention Network is proposed to address the problem of scale variation and background complexity more effectively in crowd counting. The proposed MHANet incorporates MAM and HAM. The MAM consists of four sets of dilated convolutions with different dilation rates, effectively enhancing the network’s multi-scale information extraction capability. The HAM consists of two sets of spatial attention and channel attention in parallel, effectively increasing the network’s attention to the crowd region and suppressing background interference. Additionally, in the decoding phase, a multi-level supervised loss function is used to obtain more accurate counting results. Compared to other existing methods, the proposed MHANet achieves MAE evaluation scores of 57.4, 7.0, 86.2, and 182.5 on ST Part_A, ST Part_B, UCF-QNRF, and UCF_CC_50 datasets, respectively, surpassing the current state-of-the-art methods by 1.9%, 5.4%, 0.4%, and 0.8%, respectively. However, the MHANet still has some limitations compared to other excellent methods. For instance, when dealing with uneven crowd density distribution, the counting accuracy would decrease due to CNN’s limitation in modeling global contextual information. Furthermore, the complex design structure of MAM and HAM leads to increased computational complexity and longer training time for the network.

There are still many tasks awaiting further exploration by researchers, such as the application of ViT in crowd counting and the study of cross-modal issues in crowd counting. In the future, we plan to explore a lightweight algorithm to achieve crowd counting and extend our approach to other counting domains to make further contributions.

Footnotes

Acknowledgment

This work was supported by the National Natural Science Foundation of China (No. 62163016, 62066014), the Natural Science Foundation of Jiangxi Province (20212ACB202001, 20224BAB202016).

References

Yuan Zhou , Jianxing Yang , Hongru Li , Tao Cao and Sun-Yuan Kung , Adversarial learning for multiscale crowd counting under complex scenes, IEEE Transactions on Cybernetics 51(11) (2020), 5423–5432.

Arian Bakhtiarnia , Błażej Leporowski , Lukas Esterle and Alexandros Iosifidis , Analysis of the effect of low-overhead lossy image compression on the performance of visual crowd counting for smart city applications, arXiv preprint arXiv:2207.10155, 2022.

Teng Li , Huan Chang , Meng Wang , Bingbing Ni , Richang Hong and Shuicheng Yan , Crowded scene analysis: A survey, IEEE Transactions on Circuits and Systems for Video Technology 25(3) (2014), 367–386.

Upendra Singh and Puja Gupta , Activity detection and counting people using mask-rcnn with bidirectional convlstm, Journal of Intelligent & Fuzzy Systems (Preprint) (2022), 1–16.

Issam Laradji

, Negar Rostamzadeh , Pedro Pinheiro

, David Vazquez and Mark Schmidt , Where are the blobs: Counting by localization with point supervision, In Proceedings of the european conference on computer vision (ECCV), pages 547–562, 2018.

Yuting Liu , Miaojing Shi , Qijun Zhao and Xiaofang Wang , Point in, box out: Beyond counting persons in crowds, In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6469–6478, 2019.

Nima Safari , Joar Petter Tanem and Roste

, A block-based predistortion for high power-amplifier linearization, IEEE Transactions on Microwave Theory and Techniques 54(6) (2006), 2813–2820.

Shahira Abousamra , Minh Hoai , Dimitris Samaras and Chao Chen , Localization in the crowd with topological constraints, In Proceedings of the AAAI Conference on Artificial Intelligence 35 (2021), 872–881.

Yingying Zhang , Desen Zhou , Siqin Chen , Shenghua Gao and Yi Ma , Single-image crowd counting via multi-column convolutional neural network, In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 589–597, 2016.

10.

Xiaolong Jiang , Zehao Xiao , Baochang Zhang , Xiantong Zhen , Xianbin Cao , David Doermann and Ling Shao , Crowd counting and density estimation by trellis encoder-decoder networks, In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6133–6142, 2019.

11.

Fisher Yu and Vladlen Koltun , Multi-scale context aggregation by dilated convolutions, arXiv preprint arXiv:1511.07122, 2015.

12.

Yuchun Fang , Yifan Li , Xiaokang Tu , Taifeng Tan and Xin Wang , Face completion with hybrid dilated convolution, Signal Processing: Image Communication 80 (2020), 115664.

13.

Tsung-Yi Lin , Yen-Yu Lin , Ming-Fang Weng , Yu-Chiang Wang , Yu-Feng Hsu and Hong-Yuan Mark Liao , Cross camera people counting with perspective estimation and occlusion handling, In 2011 IEEE International Workshop on Information Forensics and Security, pages 1–6. IEEE, 2011.

14.

Daniel Onoro-Rubio and Roberto Lopez-Sastre

, Towards perspective-free object counting with deep learning, In Computer Vision–ECCV 2016:14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 615–629. Springer, 2016.

15.

Lokesh Boominathan , Srinivas Kruthiventi

S.S.

and Venkatesh Babu

, CrowdNet: A deep convolutional network for dense crowd counting, In Proceedings of the 24th ACM international conference on Multimedia, pages 640–644, 2016.

16.

Vishwanath Sindagi

and Vishal Patel

, Generating highquality crowd density maps using contextual pyramid cnns, In Proceedings of the IEEE international conference on computer vision, pages 1861–1870, 2017.

17.

Anran Zhang , Jiayi Shen , Zehao Xiao , Fan Zhu , Xiantong Zhen , Xianbin Cao and Ling Shao , Relational attention network for crowd counting, In Proceedings of the IEEE/CVF international conference on computer vision, pages 6788–6797, 2019.

18.

Dan Guo , Kun Li , Zheng-Jun Zha and Meng Wang , DADNet: Dilated-attention-deformable convnet for crowd counting, In Proceedings of the 27th ACM international conference on multimedia, pages 1823–1832, 2019.

19.

Zhi-Qi Cheng , Jun-Xiu Li , Qi Dai , Xiao Wu , Jun-Yan He and Alexander Hauptmann

, Improving the learning of multicolumn convolutional neural network for crowd counting, In Proceedings of the 27th ACM international conference on multimedia, pages 1897–1906, 2019.

20.

Lu Zhang , Miaojing Shi and Qiaobo Chen , Crowd counting via scale-adaptive convolutional neural network, In 2018 IEEEWinter Conference on Applications of Computer Vision (WACV), pages 1113–1121. IEEE, 2018.

21.

Vishwanath Sindagi

and Vishal Patel

, HA-CCN: Hierarchical attention-based crowd counting network, IEEE Transactions on Image Processing 29 (2019), 323–335.

22.

Liangzi Rong and Chunping Li , Coarse-and fine-grained attention network with background-aware loss for crowd density map estimation, In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3675–3684, 2021.

23.

Xin Wang , Rongrong Lv , Yang Zhao , Tangwen Yang and Qiuqi Ruan , Multi-scale context aggregation network with attention-guided for crowd counting, In 2020 15th IEEE International Conference on Signal Processing (ICSP), volume 1, pages 240–245. IEEE, 2020.

24.

Dingkang Liang , Xiwu Chen , Wei Xu , Yu Zhou and Xiang Bai , Transcrowd: weakly-supervised crowd counting with transformers, Science China Information Sciences 65(6) (2022), 160104.

25.

Sanghyun Woo , Jongchan Park , Joon-Young Lee and In So Kweon , CBAM: Convolutional block attention module, In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.

26.

Weizhe Liu , Mathieu Salzmann and Pascal Fua , Contextaware crowd counting, In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5099–5108, 2019.

27.

Qi Wang , Junyu Gao , Wei Lin and Yuan Yuan , Learning from synthetic data for crowd counting in the wild, In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8198–8207, 2019.

28.

Zhiheng Ma , Xing Wei , Xiaopeng Hong and Yihong Gong , Bayesian loss for crowd count estimation with point supervision, In Proceedings of the IEEE/CVF international conference on computer vision, pages 6142–6151, 2019.

29.

Yifan Yang , Guorong Li , ZheWu , Li Su , Qingming Huang and Nicu Sebe , Reverse perspective network for perspective-aware object counting, In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4374–4383, 2020.

30.

Xiyang Liu , Jie Yang , Wenrui Ding , Tieqiang Wang , Zhijin Wang and Junjun Xiong , Adaptive mixture regression network with local counting map for crowd counting, In Computer Vision–ECCV 2020:16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, pages 241–257. Springer, 2020.

31.

Yanda Meng , Hongrun Zhang , Yitian Zhao , Xiaoyun Yang , Xuesheng Qian , Xiaowei Huang and Yalin Zheng , Spatial uncertainty-aware semi-supervised crowd counting, In Proceedings of the IEEE/CVF international conference on computer vision, pages 15549–15559, 2021.

32.

Yinjie Lei , Yan Liu , Pingping Zhang and Lingqiao Liu , Towards using count-level weak supervision for crowd counting, Pattern Recognition 109 (2021), 107616.

33.

Mingjie Wang , Hao Cai , Yong Dai and Minglun Gong , Dynamic mixture of counter network for location-agnostic crowd counting, In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 167–177, 2023.

34.

Haroon Idrees , Muhmmad Tayyab , Kishan Athrey , Dong Zhang , Somaya Al-Maadeed , Nasir Rajpoot and Mubarak Shah , Composition loss for counting, density map estimation and localization in dense crowds, In Proceedings of the European conference on computer vision (ECCV), pages 532–546, 2018.

35.

Haroon Idrees , Imran Saleemi , Cody Seibert and Mubarak Shah , Multi-source multi-scale counting in extremely dense crowd images, In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2547–2554, 2013.