Transformer-CNN hybrid network for crowd counting

Abstract

Efficient feature representation is the key to improving crowd counting performance. CNN and Transformer are the two commonly used feature extraction frameworks in the field of crowd counting. CNN excels at hierarchically extracting local features to obtain a multi-scale feature representation of the image, but it struggles with capturing global features. Transformer, on the other hand, could capture global feature representation by utilizing cascaded self-attention to capture remote dependency relationships, but it often overlooks local detail information. Therefore, relying solely on CNN or Transformer for crowd counting has certain limitations. In this paper, we propose the TCHNet crowd counting model by combining the CNN and Transformer frameworks. The model employs the CMT (CNNs Meet Vision Transformers) backbone network as the Feature Extraction Module (FEM) to hierarchically extract local and global features of the crowd using a combination of convolution and self-attention mechanisms. To obtain more comprehensive spatial local information, an improved Progressive Multi-scale Learning Process (PMLP) is introduced into the FEM, guiding the network to learn at different granularity levels. The features from these three different granularity levels are then fed into the Multi-scale Feature Aggregation Module (MFAM) for fusion. Finally, a Multi-Scale Regression Module (MSRM) is designed to handle the multi-scale fused features, resulting in crowd features rich in high-level semantics and low-level detail. Experimental results on five benchmark datasets demonstrate that TCHNet achieves highly competitive performance compared to some popular crowd counting methods.

Keywords

Crowd counting Transformer CNN multi-granularity progressive learning

1 Introduction

Crowd counting is one of the hotspots in big data analysis [1], especially in the context of the global post-epidemic era [2]. High-density crowd flow is not only a great challenge to public health issues [3], but also a great potential hazard to public safety [4], such as frequent trampling accidents. Therefore, how to better monitor crowd flow, especially in complex crowded environments, has now become an urgent issue.

The existing crowd counting methods implemented using deep learning techniques can be divided into two main categories: CNN-based methods and Transformer-based methods. CNN-based methods have achieved excellent results due to the CNN’s excellent ability to extract local information [5] and its flexible network structure [6 –10]. However, CNN-based methods suffer from the inherent problem of a limited receptive field, which makes it challenging to handle the case of uneven crowd density distribution and complex background environment. One accepted solution to this problem is to design branches with different receptive fields to enhance the global context modeling capability of CNN models [11, 12]. However, these approaches can be structurally inefficient and prone to redundant blocks. Another solution is to introduce auxiliary tasks [13, 14], but this would result in greater computing costs. Although these methods achieve some results, they do not fundamentally address the problem that CNN is not expert at capturing global features

Transformer was originally proposed to improve the performance of various natural language processing (NLP) tasks, then researchers introduced it into the field of computer vision, proposing the Vision Transformers (ViT) for image classification. Due to its excellent performance, ViT was generalized to other computer vision tasks, including crowd counting [15, 16]. However, it is not appropriate to directly employ ViT for crowd counting without improvement. First, ViT splits the image into non-overlapping patches, each of which contains spatially localized information about the 2D structure. When the head in an image is segmented into different patches, as shown in Fig. 1(a), the counting performance would inevitably suffer due to the loss of spatially local information. Second, compared to CNN, the structure of ViT is relatively fixed [17], making it difficult to explicitly extract multi-scale information. However, head size variation is extremely common in crowd counting datasets, as shown in Fig. 1(b), which poses a significant challenge to crowd counting.

Fig. 1

(a) Cutting the images into non-overlapping patches may result in the same head being split into different patches, which will seriously lead to count bias. (b) The diversity of head sizes in the crowd counting dataset is very common, which poses a great challenge for accurate counting.

According to the above analysis, CNN has a flexible network structure, but its perceptual field is limited, and only expert at extracting local features. ViT, on the other hand, could provide a global perceptual field, but its network structure is relatively fixed and tends to ignore spatially local information. Therefore, there are limitations in crowd counting based on CNN or Transformer alone. Is it possible to combine CNN and Transformer to complement each other’s strengths to achieve better counting accuracy? The answer is yes! In this paper, a Transformer-CNN Hybrid Network (TCHNet) is proposed to address the problems of spatial local information loss and head size variation in crowd counting. To address the spatial local information loss problem, CMT (CNNs meet vision transformers) [18] proposed by Wang et al. is selected as the Feature Extraction Module (FEM), which employs a Conv stem structure [19] to better preserve spatial local information during the image processing phase. Next, a modified Progressive Multi-granularity Learning Process (PMLP) is introduced in the Feature Extraction Module (FEM) to guide the network to learn at different levels of granularity. Finally, features at three different levels of granularity are selected to be fed into the Multi-granularity Feature Aggregation Module (MFAM) for fusion to fully utilize the spatial local information to help the network locate crowds and improve counting accuracy. To address the head size variation problem, a Multi-Scale Regress Module (MSRM) is designed that adopts the strategy of combining multiple sets of dilated convolutions [20] with different dilatation rates and standard convolutions with different convolution kernels to effectively enhance the multi-scale information extraction capability.

In summary, our contribution can be summarized as follows.

The proposed TCHNet could address the problems of spatial local information loss and head size variation more effectively in crowd counting.

A Multi-granularity Feature Aggregation Module is designed to aggregate three features from different granularity levels, allowing spatial local information to be fully utilized by fusing fine-grained and coarse-grained features of the network. MFAM helps the network to localize the crowd regions more effectively for better counting accuracy.

We design a Multi-Scale Regression Module that employs a strategy of combining dilated convolutions with standard convolutions with a carefully chosen dilation rate to prevent the network from getting trapped in the gridding effect. The MSRM significantly enhances the multi-scale information extraction capability of the network.

To further enhance our method, we are the first to introduce a progressive multi-granularity learning strategy to crowd counting. We analyze in Section 3.2 why progressive learning allows the network to focus more on the crowd region, and hope that this work will encourage further research on progressive learning methods for crowd counting.

TCHNet has conducted experiments on five mainstream datasets. These extensive experiments demonstrate the excellent performance of our method, which even achieves state-of-the-art results.

2 Related work

2.1 CNN-based crowd counting

Despite its strong representation capabilities, CNN has become a recognized solution to enhance the global context modeling capability of CNN due to its fixed-size receptive field. A simple idea is to design branches with different receptive fields. MCNN [11] proposed a multi-column convolutional neural network, while Switch-CNN [12] introduced a density classifier. The optimal regressions from the three different branches are chosen adaptively. These models exhibit inefficiencies and redundant structures.

Another kind of method is to introduce auxiliary tasks [13 , 21], the result is greater computing costs. For example, Zhang et al. [13] used an attention mechanism to refine perception and local feature information, while retaining global context information. These methods only consider pixel-level or single-dimension dependencies, resulting in a loss of dependencies between different dimensions of contextual information, which prevents the model from achieving optimal performance.

2.2 Transformer-based crowd counting

TransCrowd [15] was the first work to introduce the Transformer to crowd counting. It uses only count-level weakly-supervised training to train models. BCCT [16] designed a ViT-based model with various attention mechanisms to fully-supervised crowd counting. Tian et al. [22] proposed using Transformer to simplify and improve crowd counting, namely CCTrans, which covers both fully-supervised and weakly-supervised training methods. However, the above methods all use the conventional ViT (such as ViT-B-16 [23], Twins transformer [24]) as the backbone network, although achieved a certain effect, there are two problems: First, most models based on Transformer will use a large convolution (such as 16x16 convolution kernel in ViT) to directly cut the input image into non-overlapping patches, which will directly lose 2D space features in the patch as well as numerous details and edges information. The second is the huge amount of parameter computation in the Transformer when calculating self-attention.

Therefore, we use CMT [18] as our backbone network. The main advantages of CMT compared to other backbone networks are as follows: First, CMT adopts the Conv stem structure in the input image processing process and uses multiple 3 × 3 convolutions stacked structures to better preserve the 2D spatial structure and edge and detail information in the patches. Secondly, the excellent design of the CMT component enables the network to achieve better results with smaller parameters.

2.3 Progressive multi-granularity learning

As a data augmentation method, progressive multi-granularity training was first applied in the field of fine-grained visual classification. Du et al. [25] proposed a progressive multi-granularity training framework, where the whole training process starts from a more stable fine granularity and gradually transitions to a coarser granularity, by learning complementary information between different image granularities, and achieves outstanding results. Specifically, first, intermediate feature maps at different grain sizes are extracted during network training. Then, the classification loss is computed based on the results obtained in this stage. Finally, the loss is passed back for a timely parameter update. This approach to training has two distinct advantages. First, the entire training framework is step-by-step, which has the advantage of ensuring that the network is more focused on the information at the current granularity at the corresponding stage. Second, the training process starts from a more stable fine granularity and gradually filters down to a coarse granularity, thus avoiding misclassifications due to intra-class differences in larger regions.

3 Proposed approaches

The overall structure of TCHNet is shown in Fig. 2. First, the images are fed into the Conv stem structure to be divided into multiple patches and then fed into the FEM to extract rich global context information. In FEM, we add a PMLP to guide the network to learn at multiple levels of granularity. Then, features at three different granularity levels are fed into MFAM, which efficiently aggregates coarse-grained features and fine-grained features to obtain multi-granularity features. Finally, the multi-granularity feature is fed into the MSRM to further extract the multi-scale information, obtain the predicted density map, and monitor the final output using OT Loss [26].

Fig. 2

The proposed TCHNet overview. First, the images are fed into the Conv stem structure to be divided into multiple patches and then fed into the FEM to extract rich global context information. In FEM, we add a PMLP to guide the network to learn at multiple levels of granularity. Then, features at three different granularity levels are fed into MFAM, which efficiently aggregates coarse-grained features and fine-grained features to obtain multi-granularity features. Finally, the multi-granularity feature is fed into the MSRM to further extract the multi-scale information, obtain the predicted density map, and monitor the final output using OT Loss.

3.1 Feature extraction module

We use CMT-B [18] as the backbone network in FEM, whose features can be used to improve the performance by combining the advantages of CNN into a Transformer. In the image input process, CMT uses a Conv stem structure to process the output image. Specifically, the Conv stem structure uses a 3 × 3 convolution with a step size of 2 and an output channel of 38 to reduce the size of the input image, followed by two 3 × 3 convolutions with a step size of 1 to better extract local information.

The CMT block consists of a local perception unit (LPU), a lightweight multi-head self-attention (LMHSA) module, and an inverted residual feed-forward network (IRFFN), as illustrated in Fig. 3.

Fig. 3

Example of the CMT block architecture.

The LPU in the CMT block is designed to extract local information from the input image, which is important for capturing fine-grained details and spatial relationships between nearby pixels. It uses a depth-wise convolutional layer to extract local features and adds a shortcut connection to the input to promote gradient flow and improve training stability. The LPU is defined as: $LPU (X) = DWConv (X) + X$ (1) where, H × W is the resolution of the input of current stage, d indicates the dimension of features. DWConv (·) denotes the depth-wise convolution.

The LMHSA module is similar to the self-attention mechanism used in ViT, but is designed to be more efficient and lightweight. It uses multiple attention heads to capture global dependencies between different parts of the image and applies layer normalization to improve convergence and stability. The LMHSA is defined as: $LMHSA (Q, K, V) = Soft max (\frac{Q {K^{'}}^{T}}{\sqrt{d_{k}}} + B) V^{'}$ (2) where Q, K, and V denote query, key, and value, respectively. k′ and Q′ denote K and Q, respectively, after DWConv (·). B denotes a relative position bias, which is randomly initialized and learnable.

The IRFFN in the CMT block is a variant of the feed-forward network used in ViT, but is designed to be more efficient and flexible. It uses inverted residual blocks to reduce the number of parameters and increase the depth of the network and applies layer normalization and dropout to improve generalization and robustness. The LMHSA is defined as: $IRFFN (X) = Conv (F (Conv (X)))$ (3) $F (X) = DWConv (X) + X$ (4)

With the aforementioned three components, the CMT block can be formulated as: $Y_{i} = LPU (X_{i - 1})$ (5) $Z_{i} = LMHSA (LN (Y_{i})) + Y_{i}$ (6) $X_{i} = IRFFN (LN (Z_{i})) + Z_{i}$ (7)

The CMT block can be stacked multiple times to form a deep neural network and can be trained end-to-end using standard backpropagation algorithms.

In addition, a stage-wise structure is introduced to extract multi-scale features and reduce computational consumption. For an RGB image of H × W × 3, the output of each stage is $\frac{H}{4} \times \frac{W}{4}$ , $\frac{H}{8} \times \frac{W}{8}$ , $\frac{H}{16} \times \frac{W}{16}$ and $\frac{H}{32} \times \frac{W}{32}$ , which is crucial for dense prediction tasks including crowd counting.

3.2 Progressive multi-granularity learning process

In the FEM, the stage-wise design allows the network to obtain four feature maps of different sizes with some essential correlations between them. Due to the limited receptive field and representational capabilities of the network in the initial stage, the first information learned is local detail information, which is fine-grained information. Additional semantic information, namely coarse-grained information, is learned in the deeper stages of the network. Using a progressive approach encourages the network to focus more on the information at the current granularity at the corresponding stage than training the entire network directly. Unlike the work of Du et al. [25], our PMLP has three improvements. First, because the transformer backbone at stage 1 has a shallow network depth and cannot obtain accurate semantic information and generate more accurate density maps, we directly start from stage 2 and name the three stages of the learning process L_i as L₁, L₂ and L₃, respectively, in fine-grained to coarse-grained order. Second, we use binary cross-entropy loss in the PMLP: $\begin{matrix} L_{i} = & {D_{i}}^{GT} \cdot log (D_{i}) + (1 - {D_{i}}^{GT}) \\ \cdot (1 - log (D_{i})) \end{matrix}$ (8) where D_i^GT is the ground-truth density map in the ith level and D_i is the predicted density map in the ith level, respectively. The degree of approximation between the predicted and target values can be efficiently determined by computing the binary cross-entropy loss. Third, we consider that this learning process is in the preliminary stage of the overall network, and to better fine-tune the overall network, we multiply L_i by the weight λ to obtain the final learning quantity as C_i: $C_{i} = λ \cdot L_{i}$ (9) where λ=1e-2. The loss C_i is then passed back and the parameters are updated in time, and at the end of each training step, the parameters trained at the current step are passed to the next training step as its parameter initialization. This learning approach has two distinct advantages. First, the entire training framework is performed in steps, which has the advantage of ensuring that the network focuses more on the information at the current granularity at the corresponding stage. Second, the learning process starts from a more stable fine-grained state and gradually transitions to a coarse-grained one, thus reducing the misclassification problem during crowd detection.

3.3 Multi-granularity feature aggregation

The coarse-grained features in the deeper layers of the network contain abstract semantic information but lack detailed information to effectively distinguish the boundaries of the heads, which makes it difficult for models to learn the exact locations of the crowds. Therefore, we construct an MFAM to aggregate the feature information at different granularities. Specifically, we first upsample F_s4 by a factor of 2, and then apply a 3 × 3 convolution to reduce its dimension, making the feature map size the same as F_s3. We then add it to F_s3 to obtain an intermediate feature map F_s3^MFAM. Next, we fuse F_s3^MFAM with F_s2 using the same operation as before. Finally, we obtain a feature map F_s2^MFAM, which completes the aggregation process of multi-granularity features. The entire process can be described as follows: ${F_{s 3}}^{MFAM} = Con v_{3 \times 3} (up (F_{s 4})) \oplus F_{s 3}$ (10) ${F_{s 2}}^{MFAM} = Con v_{3 \times 3} (up ({F_{s 3}}^{MFAM})) \oplus F_{s 2}$ (11) where up (·) denotes a 2-fold upsampling operation, Conv_3×3 (·) denotes a 3 × 3 convolution operation, and ⊕ denotes the element-wise addition operation.

3.4 Multi-scale regression module

Due to the large number of wide-angle views used during dataset acquisition, the scale of the heads in the images can vary drastically. Although the CMT with the wise-stage design that we used in FEM can initially extract multi-scale features, it is insufficient to handle the continuous variation of head scales. Therefore, we design an MSAM to more effectively address the scale variation problem. The MSRM model is composed of four branches, which are labeled as M₁, M₂, M₃, and M₄, as depicted in Fig. 2. M₁, M₂, and M₃ use a combination strategy of dilated convolutions with different dilation rates and standard convolutions with varying kernel sizes. In contrast, the M₄ branch only contains a shortcut connection path of 3 × 3 convolution. To obtain the multi-scale feature F^MSRM for a given feature map F, we first concatenate the outputs obtained by passing F through the M₁, M₂, and M₃ branches to form F^α. Next, pass F through the M₄ branch to obtain F^β. Finally, F^α and F^β are added to obtain the final multi-scale feature F^MSRM. The whole process can be described as follows: $F^{α} = Concat (M_{1} (F), M_{2} (F), M_{3} (F))$ (12) $F^{β} = M_{4} (F)$ (13)

$F^{MSRM} = F^{α} \oplus F^{β}$ (14) where M_i (F) denotes the result of F after M_i branch, Concat (·) denotes the concat operation, and ⊕ denotes the element-wise addition operation.

3.5 Loss function

To better optimize the network, we use the OT loss in DM-Count [26] to supervise the network training, which is formulated by the weighted sum of Counting loss, Optimal Transport (OT) loss, and Total Variation (TV) loss. The entire loss function is defined as follows: $\begin{matrix} l (D, \hat{D}) = & l_{C} (D, \hat{D}) + λ_{1} l_{OT} (D, \hat{D}) \\ + λ_{2} {∥ D ∥}_{1} l_{TV} (D, \hat{D}) \end{matrix}$ (15) where D denote ground-truth density map, $\hat{D}$ is the predicted density map and ∥D∥ denote the vectorized dot-annotation map. $l_{C} (D, \hat{D})$ denote count loss, ∥ · ∥ ₁ is the L1 norm of a vector, λ₁ = 0.1 and λ₂ = 0.01. $l_{OT} (D, \hat{D}) \overset{def .}{=} min_{T \in U (D, \hat{D})} < C, T > \overset{def .}{=} \sum C_{i, j} T_{i, j}$ (16) where C is the transmission cost matrix, where $C_{ij} = c (D, \hat{D})$ denotes the cost of the probability mass from pixel D to $\hat{D}$ . T is the transmission matrix that assigns the probability mass of each location D to the $\hat{D}$ used to measure the cost. U denotes the set of all possible routes of all possible transmission masses from D to $\hat{D}$ . $l_{TV} (D, \hat{D}) = \frac{1}{2} {∥ \frac{D}{{∥ D ∥}_{1}} - \frac{\hat{D}}{{∥ \hat{D} ∥}_{1}} ∥}_{1}$ (17)

The training process can be made more stable by adding TV loss.

4 Experiments

We evaluate our method across five benchmarks, including ShanghaiTech Part A (ST_Part A) [11], ShanghaiTech Part B (ST_Part B) [11], UCF-QNRF [27], UCF_CC_50 [28] and NWPU-Crowd [29].

4.1 Traning details

To make fair comparisons, all training and evaluation were performed on the Pytorch platform using RTX 1080 Ti GPU. Our FEM is based on the official CMT-Base [18] model, which is pre-trained on the ImageNet 1K [30]. Additionally, we perform random cropping and random level flipping during training as a way to augment the training data for experiments. For the UCF-QNRF and NWPU-Crowd datasets, we pre-processed the crop size to ensure that the longer size is less than 2048, and we use the AdamW [31] optimizer with batch size 4 due to GPU memory concerns. For all datasets, our initial learning rate is uniformly set to 1e-5.

4.2 Evaluation metrics

We used the MAE, the MSE, and the NAE as the evaluation metrics of the experiment. They are defined by: $MAE = \frac{1}{N} \sum_{i = 1}^{N} || C_{i}^{GT} - C_{i}^{Pr e} ||$ (18) $MSE = {(\frac{{\sum_{i = 1}^{N} || C_{i}^{GT} - C_{i}^{Pr e} ||}^{2}}{N})}^{\frac{1}{2}}$ (19) $NAE = \frac{1}{N} \sum_{i = 1}^{N} \frac{|| C_{i}^{GT} - C_{i}^{Pr e} ||}{C_{i}^{GT}}$ (20) where N is the number of test images, $C_{i}^{Pr e}$ denotes the number of heads predicted by the ith image, and $C_{i}^{GT}$ denotes the number of heads for the ith image ground-truth.

4.3 Comparisons with other methods

In this section, we first provide a brief introduction to each of these datasets. Next, we analyze the experimental results of TCHNet on different datasets and compare them with current state-of-the-art methods. The experimental results are presented in Tables 1 and 2, while some visualization results are shown in Figs. 4 and 5. Finally, we analyze the reasons for the leading position of our method.

Table 1
Experimental results on ShanghaiTech Part A, ShanghaiTech Part B, UCF-QNRF, and UCF_CC_50. Represents the weakly-supervised method

Year Conference/Journal Method Backbone ST Part_A ST Part_B UCF-QNRF UCF_CC_50

MAE MSE MAE MSE MAE MSE MAE MSE

2019 CVPR CAN [21] VGG16 62.3 100 7.8 12.2 107 183 212.2 243.7

2019 CVPR SFCN [32] ResNet101 59.7 95.7 7.4 11.8 102 171 214.2 318.2

2019 ICCV BL [33] VGG19 62.8 101.8 7.7 12.7 88.7 154.8 229.3 308.2

2020 CVPR RPNet [34] VGG16 61.2 96.9 8.1 11.6 - - - -

2020 ECCV AMRNet [35] VGG16 61.5 98.3 7.0 11.0 86.6 152.2 184 265.8

2020 ECCV LibraNet [36] VGG16 55.9 97.1 7.3 11.3 88.1 143.7 181.2 262.2

2021 ICCV SUA-Fully [37] VGG16 66.9 125.6 12.3 17.9 119.2 213.3 - -

2021 AAAI TopoCount [38] VGG16 61.2 104.6 7.8 13.7 89.0 159.0 184.1 258.3

2021 CVPR GL [39] VGG19 61.3 95.4 7.3 11.7 84.3 147.5 - -

2021 arXiv CCTrans [15] Twins 64.4 95.4 7.0 11.5 92.1 158.9 245.0 343.6

2021 PR MATT* [40] VGG16 80.1 192.4 11.7 17.5 - - 355.0 550.2

2022 SCIS TransCrowd* [22] ViT 66.1 105.1 9.3 16.1 97.2 168.5 - -

2022 TMM FIDTM [41] HRNet 57.0 103.4 6.9 11.8 89.0 153.5 171.4 233.1

- TCHNet (Ours) CMT 54.9 92.3 6.6 10.9 84.8 164.0 169.5 238.5

Year Conference/Journal	Method	Backbone	ST Part_A	ST Part_B	UCF-QNRF	UCF_CC_50
2019 CVPR	CAN [21]	VGG16	62.3	100	7.8	12.2	107	183	212.2	243.7
2019 CVPR	SFCN [32]	ResNet101	59.7	95.7	7.4	11.8	102	171	214.2	318.2
2019 ICCV	BL [33]	VGG19	62.8	101.8	7.7	12.7	88.7	154.8	229.3	308.2
2020 CVPR	RPNet [34]	VGG16	61.2	96.9	8.1	11.6	-	-	-	-
2020 ECCV	AMRNet [35]	VGG16	61.5	98.3	7.0	11.0	86.6	152.2	184	265.8
2020 ECCV	LibraNet [36]	VGG16	55.9	97.1	7.3	11.3	88.1	143.7	181.2	262.2
2021 ICCV	SUA-Fully [37]	VGG16	66.9	125.6	12.3	17.9	119.2	213.3	-	-
2021 AAAI	TopoCount [38]	VGG16	61.2	104.6	7.8	13.7	89.0	159.0	184.1	258.3
2021 CVPR	GL [39]	VGG19	61.3	95.4	7.3	11.7	84.3	147.5	-	-
2021 arXiv	CCTrans* [15]	Twins	64.4	95.4	7.0	11.5	92.1	158.9	245.0	343.6
2021 PR	MATT* [40]	VGG16	80.1	192.4	11.7	17.5	-	-	355.0	550.2
2022 SCIS	TransCrowd* [22]	ViT	66.1	105.1	9.3	16.1	97.2	168.5	-	-
2022 TMM	FIDTM [41]	HRNet	57.0	103.4	6.9	11.8	89.0	153.5	171.4	233.1
-	TCHNet (Ours)	CMT	54.9	92.3	6.6	10.9	84.8	164.0	169.5	238.5

Table 2

Experimental results on NWPU-Crowd. * represents the weakly-supervised method

Year Conference/Journal	Method	Backbone	Val		Test
			MAE	MSE	MAE	MSE	NAE
2016 CVPR	MCNN [11]	-	218.5	700.6	232.5	714.6	1.063
2018 CVPR	CSRNet [20]	VGG16	104.9	433.5	121.3	387.8	0.604
2019 CVPR	CAN [21]	VGG16	93.5	489.9	106.3	386.5	0.295
2020 NeurIPS	DM-Count [26]	VGG19	70.5	357.6	88.4	388.6	0.169
2021 CVPR	GL [39]	VGG19	-	-	79.3	346.1	0.180
2021 ICCV	P2PNet [42]	VGG16	-	-	77.4	362.0	-
2021 arXiv	CCTrans* [22]	Twins	48.6	121.1	79.8	344.4	0.157
2022 CVPR	BCCT [16]	ViT	53.0	170.3	82.0	366.9	0.164
2022 SCIS	TransCrowd* [15]	ViT	88.34	400.5	117.7	451.0	0.244
2022 TMM	FIDTM [41]	HRNet	51.4	107.6	86.0	312.5	-
-	TCH-Net (Ours)	CMT	41.92	112.13	87.82	444.18	0.149

Fig. 4

Visualization results on different datasets, where the first column is the true image, the second column is the number of ground-truth true annotations for that image, and the third column is the number of heads predicted by our network.

Fig. 5

Visualization results on the NWPU-Crowd dataset. Due to the specificity of this dataset, we select four different types of test images, namely dense sample, sparse sample, negative sample, and dark sample.

4.3.1 ST_Part A

The ShanghaiTech dataset is divided into ST_Part A and ST_Part B. ST_Part A contains a total of 241, 677 head annotations, with an average of 501 annotations per image. Most of the images are sourced from the Internet and have varying scene information.

TCHNet achieved an MAE of 54.9 and an MSE of 92.3 for ST_Part A. Our MAE outperformed the first-place LibraNet [36] by 1.8% and the second-place FIDTM [41] by 3.7%. For MSE, our method also outperforms the state-of-the-art GL [39] by 3.2% and the second-place SFCN [32] by 3.6%. These results indicate that our network design is excellent, and the Transformer-based backbone network has great potential.

4.3.2 ST_Part B

All images in ST_Part B were taken from surveillance cameras in Shanghai. This dataset has a single source, a relatively small average number of people, and a relatively sparse crowd scene, with an average of 124 people per image.

For ST_Part B, our method achieves state-of-the-art results in both MAE and MSE evaluation metrics, where MAE achieves an excellent score of 6.6, which is still 1.4% ahead of the current state-of-the-art FIDTM [41] and 5.7% better than the second-place AMRNet [35]. Additionally, our MSE is 0.9% better than AMRNet. These results suggest that our approach still allows for great accuracy in low-density regions and that our network captures head information at different sizes through the MSRM, maintaining a tiny counting error even when the scale variation is large.

4.3.3 UCF-QNRF

The UCF-QNRF dataset is an extremely challenging crowd counting dataset with rich scenes and diverse viewpoints, densities, and lighting conditions. 1, 535 images were annotated with a total of 1, 251, 642 head-annotated points.

Compared with all current state-of-the-art methods, our MAE is still 2.1% ahead of the current stage optimal method AMRNet [35], indicating that our network can effectively perceive crowd density information in high-density regions and effectively localize crowd targets in low-density regions by aggregating coarse-grained features and fine-grained features, resulting in excellent counting results. However, there is a 12.3% gap between the MSE of our method and LibraNet [36]. This is because LibraNet adopts a sequential decision-making approach to implement scale-weighting. The advantage of this approach is that it can handle extremely dense scenarios of the crowd more effectively, and thus it is only more robust than our method on the UCF-QNRF dataset.

4.3.4 UCF_CC_50

UCF_CC_50 was collected from the Internet in scenes including concerts, demonstrations, and other crowded settings with a total of 50 very dense images taken at different resolutions and from different angles. The average number of people per image is 1, 280, which exceeds the number of additional demographic datasets. Given the small number of UCF_CC_50 images, the publisher of this dataset defined a cross-validation protocol to increase the sample size. We also employ a 5-fold cross-validation strategy, wherein the complete dataset is divided into 5 samples, with 4 of them being utilized as the training set and the remaining 1 as the test set for each training session. Ultimately, the mean MAE and MSE values from the five experiments are computed as the test results.

The MAE and MSE of our method reach 169.5 and 238.5, respectively. Our method is 1.1% ahead of FIDTM [41] in MAE and only 1.9% behind in MSE, indicating that TCHNet can demonstrate excellent accuracy and robustness even in the presence of insufficient RGB information.

4.3.5 NWPU-Crowd

NWPU-Crowd is an extremely challenging and large dataset that, unlike the previous datasets, consists of 5109 images and 213, 375 head annotations including points and boxes. This dataset has significant advantages over the previous ones. First, the dataset contains a heavy set of image scenes, including a large number of negative samples, dense samples, sparse samples, and outage samples. Second, the dataset improves a fair online testing platform. Specifically, the network is loaded with the optimal model trained on the validation set to predict the test set, and the prediction results are uploaded to the official website to obtain the MAE, MSE, and NAE results for the test set.

Our experimental results are shown in Table 2 and visualized in Fig. 5. On the validation set, our MAE achieves the most advanced result of 41.92, which is 18.4% ahead of the most advanced FIDTM [41]. However, for the MSE, our performance is slightly lower than the current state-of-the-art algorithms, which may be explained by the fact that the CNN-based backbone network is more powerful than the Transformer-based backbone network in capturing spatial local information. Moreover, on the test set, our MAE, MSE, and NAE reach 87.82, 444.18, and 0.149 respectively. Our method MAE and MSE in the test set are somewhat behind FIDTM [41] and P2PNet [42], with our MAE being 11.8% behind P2PNet and our MSE being 29.7% behind FIDTM. We analyze the possible reasons for the gap in the following three points. First, both P2PNet and FIDTM introduce head localization as an auxiliary task into the counting effort, and this strategy can make the network more robust, especially in the unknown domain where the performance is better. Second, the average number of NWPU-Crowd is 418, which is moderately dense. The backbone networks of both P2PNet and FIDTM are CNN structures, which can more effectively localize crowd areas in sparse scenes. In contrast, BCCT [16], TransCrowd [15], and our method uses Transformer as the backbone network, which is better at sensing the crowd regions in high-density regions compared to low-density regions. Third, in terms of data preprocessing, we scale this dataset so that the longest edge does not exceed 2048 due to hardware limitations, which reduces the computational overhead but inevitably leads to a reduction in accuracy.

4.4 Ablation study

To verify the effectiveness of each part of the TCHNet, we perform a series of ablation studies on ST_Part A and ST_Part B.

4.4.1 Ablation study of MFAM

To evaluate the performance of our MFAM, we use the feature output from each of the three stages of the transformer backbone network separately for validation. For simplicity and in order not to alter the overall network structure, we resized F_s2, F_s3, and F_s4 to the same size (1/8 of the original image) when studying the MFAM, and then verified the effect of the feature maps of each of the three stages on the counting results. The experimental results are shown in Table 3.

Table 3
Ablation study of MFAM

Features in MFAM ST_Part A ST_Part B

MAE MSE MAE MSE

F _s2 79.1 117.1 11.6 21.0

F _s3 60.3 102.8 7.2 12.5

F _s4 56.5 95.2 6.9 11.8

F_s2+F_s3+F_s4 (MFAM) 54.9 92.3 6.6 10.9

Features in MFAM	ST_Part A	ST_Part B
F _s2	79.1	117.1	11.6	21.0
F _s3	60.3	102.8	7.2	12.5
F _s4	56.5	95.2	6.9	11.8
F_s2+F_s3+F_s4 (MFAM)	54.9	92.3	6.6	10.9

It can be concluded that the feature maps obtained in F_s2 lack deep semantic information, which gravely affects the counting accuracy. While satisfactory results can be achieved using F_s3 and F_s4, our designed MFAM can improve the counting accuracy even more. Since the MFAM makes full use of both coarse-grained and fine-grained information, it enables the network to retain some location information at deeper stages, which is essential for crowd counting efforts.

4.4.2 Ablation study of MSRM

To evaluate the performance of our MSRM, we label the four branches as M₁, M₂, M₃, and M₄, respectively. We first tried to remove the branches with dilated convolution (M₁, M₂ and M₃) and directly used a 3 × 3 Conv layer (M₄) to adapt the input feature map to the next layer. This is because we believe that the main role in MSRM may be played by dilated convolution, as it can bring different sizes of the perceptual domain and thus help the network to capture information about different human head sizes. We also confirmed our conjecture in our experiments, the results of which are shown in Table 4.

Table 4
Ablation study of MSRM

Sets in MSRM ST_PartA ST_Part B

MAE MSE MAE MSE

M ₁ 56.6 95.9 7.4 12.6

M ₂ 57.0 97.1 7.2 11.8

M ₃ 56.9 96.3 7.0 12.1

M ₄ 58.8 99.7 7.5 11.8

M₁+M₂+M₃+M₄ 54.9 92.3 6.6 10.9

Sets in MSRM	ST_PartA	ST_Part B
M ₁	56.6	95.9	7.4	12.6
M ₂	57.0	97.1	7.2	11.8
M ₃	56.9	96.3	7.0	12.1
M ₄	58.8	99.7	7.5	11.8
M₁+M₂+M₃+M₄	54.9	92.3	6.6	10.9

It can be seen that the fourth set of MSRM settings on ST_Part A and ST_Part B datasets have some gap with the optimal results of TCHNet, since the M₄ branch alone cannot capture sufficient multi-scale information. When dilated convolution is added, although the counting accuracy is slightly improved, neither of them can achieve optimal results, which is because the fixed dilation rate can only capture the information at a specific scale. By superimposing multiple dilation rates, the multi-scale perception capability of the network can be effectively enhanced while avoiding the gridding effect [43].

In addition, different combinations of M₁, M₂, and M₃ with M₄ also affect the counting accuracy, and we introduce seven different sets of combinations in MSRM to explore the optimal scheme. The experimental results are shown in Table 5.

Table 5

Different combination strategies in MSRM

Sets in MSRM	ST_PartA		ST_Part B
	MAE	MSE	MAE	MSE
M₁+M₄	56.4	97.2	7.3	12.0
M₂+M₄	57.6	98.9	6.9	11.4
M₃+M₄	56.9	97.6	7.1	11.1
M₁+M₂+M₄	56.5	99.1	7.2	12.7
M₂+M₃+M₄	57.0	95.5	6.9	10.9
M₁+M₃+M₄	56.0	94.7	7.2	11.3
M₁+M₂+M₃+M₄	54.9	92.3	6.6	10.9

Based on the comparison results, group 7 achieves the best performance, hence we choose this scheme to extract multi-scale information.

4.4.3 Ablation study of PMLP

To evaluate the performance of our PMLP, we set the structure without PMLP to Baseline, and then conduct experiments on ST_Part A and ST_Part B. In addition, we investigate the effect of λ on the experimental results, which are presented in Table 6.

Table 6
Ablation study of PMLP

Strategy ST_PartA ST_Part B

MAE MSE MAE MSE

Without PMLP 56.9 96.5 7.4 12.9

PMLP (λ=1e-1) 56.2 95.9 7.0 12.4

PMLP (λ=5e-2) 55.8 94.8 6.9 11.8

PMLP (λ=1e-3) 55.7 94.1 7.0 12.3

PMLP (λ=1e-2) 54.9 92.3 6.6 10.9

Strategy	ST_PartA	ST_Part B
Without PMLP	56.9	96.5	7.4	12.9
PMLP (λ=1e-1)	56.2	95.9	7.0	12.4
PMLP (λ=5e-2)	55.8	94.8	6.9	11.8
PMLP (λ=1e-3)	55.7	94.1	7.0	12.3
PMLP (λ=1e-2)	54.9	92.3	6.6	10.9

According to the experimental results, TCHNet achieves better experimental results when PMLP is added, indicating that through PMLP, the network can learn at multiple granularity levels, completely explore the complementary relationships between different granularities, enhance the network’s ability to extract multi-granularity features and improve the counting accuracy. In addition, we note that a weaker advantage can be achieved when λ is 1e-2 compared to 1e-3 and 5e-2, which also indicates that our method is robust to the choice of λ.

5 Conclusion

In this paper, we propose a hybrid Transformer and CNN network to address the problems of spatial local information loss and scale variation in crowd counting. The pipeline consists of five components: a FEM for better solving local spatial information loss, a PMLP for learning multi-granularity information correlation, an MFAM for multi-granularity information fusion, an MSRM for providing multi-scale receptive fields, and an optimal transfer loss function for supervised network training. The loss of spatial local information due to patch-dividing is largely reduced by using the more advanced Conv stem structure of CMT. The problem of head-scale variation is addressed by designing an MSRM to provide the network with a multi-scale receptive field. To further enhance our approach and improve the stability of the network during training, we innovatively introduce the PMLP, which enables the network to learn at the multi-granularity level, focusing on the complementary relationships between different granularity levels. Our method drives state-of-the-art counting with a more significant advantage on five mainstream datasets. We hope that our method can be further ported to additional counting tasks for additional contributions.

Footnotes

Acknowledgment

This work was supported by the National Natural Science Foundation of China (No. 62163016, 62066014, 62202165), the open project of State Key Laboratory of Performance Monitoring and Protecting of Rail Transit Infrastructure, East China Jiaotong University (Grant No. HJGZ2023203) and the Natural Science Foundation of Jiangxi Province (No. 20212ACB202001, 20202BABL202018, 20224BAB202016).

References

Aldhaheri

, Alotaibi

, Alzahrani

, Hadi

, Mahmood

, Alhothali

and Barnawi

, Macc net: Multi-task attention crowd counting network, Applied Intelligence 53(8) (2023), 9285–9297.

Liang

, Zhao

, Zhou

, Zhang

, Song

and Shi

, Sc2net: scale-aware crowd counting network with pyramid dilated convolution, Applied Intelligence 53(5) (2023), 5146–5159.

Xie

, Gu

, Li

and Lyu

, Hranet: Hierarchical regionaware network for crowd counting, Applied Intelligence 52(11) (2022), 12191–12205.

Liu

Y.-B.

, Jia

R.-S.

, Liu

Q.-M.

, Zhang

X.-L.

and Sun

H.-M.

, Crowd counting method based on the self-attention residual network, Applied Intelligence 51 (2021), 427–440.

Sharif Razavian

, Azizpour

, Sullivan

and Carlsson

, Cnn features off-the-shelf: an astounding baseline for recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2014, 806–813.

Krizhevsky

, Sutskever

and Hinton

G.E.

, ImageNet classification with deep convolutional neural networks, Communications of the ACM 60(6) (2017), 84–90.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.

Szegedy

, Liu

, Jia

, Sermanet

, Reed

, Anguelov

, Erhan

, Vanhoucke

and Rabinovich

, Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, 1–9.

Szegedy

, Vanhoucke

, Ioffe

, Shlens

and Wojna

, Rethinking the inception architecture for computer vision, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, 2818–2826.

10.

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, 770–778.

11.

Zhang

, Zhou

, Chen

, Gao

and Ma

, Singleimage crowd counting via multi-column convolutional neural network, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, 589–597.

12.

Babu Sam

, Surya

and Venkatesh Babu

, Switching convolutional neural network for crowd counting, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, 5744–5752.

13.

Zhang

, Shen

, Xiao

, Zhu

, Zhen

, Cao

and Shao

, Relational attention network for crowd counting, in Proceedings of the IEEE/CVF International Conference on Computer Vision 2019, 6788–6797.

14.

Sindagi

V.A.

and Patel

V.M.

, HA-CCN: Hierarchical attention-based crowd counting network, IEEE Transactions on Image Processing 29 (2019), 323–335.

15.

Liang

, Chen

, Xu

, Zhou

and Bai

, Transcrowd: weakly-supervised crowd counting with transformers, Science China Information Sciences 65(6) (2022), 160104.

16.

Sun

, Liu

, Probst

, Paudel

D.P.

, Popovic

and Van

, Gool, Boosting crowd counting with transformers, arXiv preprint arXiv:2105.10926, 2021.

17.

Wang

, Wu

, Ouyang

, Han

, Chen

, Jiang

Y.-G.

and Li

S.-N.

, M2TR: Multi-modal multi-scale transformers for deepfake detection, in Proceedings of the 2022 International Conference on Multimedia Retrieval 2022, 615–623.

18.

Guo

, Han

, Wu

, Tang

, Chen

, Wang

and Xu

, CMT: Convolutional neural networks meet vision transformers, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, 12175–12185.

19.

Arevalo

, Gonzalez

F.A.

, Ramos-Pollan

, Oliveira

J.L.

and Lopez

M.A.G.

, Representation learning for mammography mass lesion classification with convolutional neural networks, Computer methods and programs in biomedicine 127 (2016), 248–257.

20.

, Zhang

and Chen

, CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, 1091–1100.

21.

Liu

, Salzmann

and Fua

, Context-aware crowd counting, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, 5099–5108.

22.

Tian

, Chu

and Wang

, CCTrans: Simplifying and improving crowd counting with transformer, arXiv preprint arXiv:2109.14483, 2021.

23.

Dosovitskiy

, Beyer

, Kolesnikov

, Weissenborn

, Zhai

, Unterthiner

, Dehghani

, Minderer

, Heigold

, Gelly

, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929, 2020.

24.

Chu

, Tian

, Wang

, Zhang

, Ren

, Wei

, Xia

and Shen

, Twins: Revisiting the design of spatial attention in vision transformers, Advances in Neural Information Processing Systems 34 (2021), 9355–9366.

25.

, Chang

, Bhunia

A.K.

, Xie

, Ma

, Song

Y.-Z.

and Guo

, Fine-grained visual classification via progressive multi-granularity training of jigsaw patches, in Computer Vision–ECCV: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX, pp. 153–168, Springer, 2020.

26.

Wang

, Liu

, Samaras

and Nguyen

M.H.

, Distribution matching for crowd counting, Advances in Neural Information Processing Systems 33 (2020), 1595–1607.

27.

Idrees

, Tayyab

, Athrey

, Zhang

, Al-Maadeed

, Rajpoot

and Shah

, Composition loss for counting, density map estimation and localization in dense crowds, in Proceedings of the European Conference on Computer Vision (ECCV) 2018, 532–546.

28.

Idrees

, Saleemi

, Seibert

and Shah

, Multisource multi-scale counting in extremely dense crowd images, in 2013 IEEE Conference on Computer Vision and Pattern Recognition 2013, 2547–2554.

29.

Wang

, Gao

, Lin

and Li

, NWPU-Crowd: A largescale benchmark for crowd counting and localization, IEEE Transactions on Pattern Analysis and Machine Intelligence 43(6) (2020), 2141–2149.

30.

Deng

, Dong

, Socher

, Li

L.-J.

, Li

and Fei-Fei

, ImageNet: A large-scale hierarchical image database, in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, IEEE, 2009.

31.

Loshchilov

and Hutter

, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101, 2017.

32.

Wang

, Gao

, Lin

and Yuan

, Learning from synthetic data for crowd counting in the wild, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, 8198–8207.

33.

, Wei

, Hong

and Gong

, Bayesian loss for crowd count estimation with point supervision, in Proceedings of the IEEE/CVF International Conference on Computer Vision (2019), 6142–6151.

34.

Yang

, Li

, Wu

, Su

, Huang

and Sebe

, Reverse perspective network for perspective-aware object counting, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), 4374–4383.

35.

Liu

, Yang

, Ding

, Wang

and Xiong

, Adaptive mixture regression network with local counting map for crowd counting, in Computer Vision–ECCV: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, pp. 241–257, Springer, 2020.

36.

Liu

, Lu

, Zou

, Xiong

, Cao

and Shen

, Weighing counts: Sequential crowd counting by reinforcement learning, in Computer Vision–ECCV: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pp. 164–181, Springer, 2020.

37.

Meng

, Zhang

, Zhao

, Yang

, Qian

, Huang

and Zheng

, Spatial uncertainty-aware semi-supervised crowd counting, in Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, 15549–15559.

38.

Abousamra

, Hoai

, Samaras

and Chen

, Localization in the crowd with topological constraints, in Proceedings of the AAAI Conference on Artificial Intelligence 35 (2021), 872–881.

39.

Wan

, Liu

and Chan

A.B.

, A generalized loss function for crowd counting and localization, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, 1974–1983.

40.

Lei

, Liu

, Zhang

and Liu

, Towards using countlevel weak supervision for crowd counting, Pattern Recognition 109 (2021), 107616.

41.

Liang

, Xu

, Zhu

and Zhou

, Focal inverse distance transform maps for crowd localization, IEEE Transactions on Multimedia, 2022.

42.

Song

, Wang

, Jiang

, Wang

, Tai

, Wang

, Li

, Huang

and Wu

, Rethinking counting and localization in crowds: A purely point-based framework, in Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, 3365–3374.

43.

Fang

, Li

, Tu

, Tan

and Wang

, Face completion with hybrid dilated convolution, Signal Processing: Image Communication 80 (2020), 115664.

Transformer-CNN hybrid network for crowd counting

Abstract

Keywords

1 Introduction

2.1 CNN-based crowd counting

2.2 Transformer-based crowd counting

2.3 Progressive multi-granularity learning

3 Proposed approaches

4.1 Traning details

4.2 Evaluation metrics

4.3.2 ST_Part B

4.3.3 UCF-QNRF

4.3.4 UCF_CC_50

4.3.5 NWPU-Crowd

4.4 Ablation study

4.4.1 Ablation study of MFAM

Table 3 Ablation study of MFAM Features in MFAM ST_Part A ST_Part B MAE MSE MAE MSE F s2 79.1 117.1 11.6 21.0 F s3 60.3 102.8 7.2 12.5 F s4 56.5 95.2 6.9 11.8 Fs2+Fs3+Fs4 (MFAM) 54.9 92.3 6.6 10.9

Table 4 Ablation study of MSRM Sets in MSRM ST_PartA ST_Part B MAE MSE MAE MSE M 1 56.6 95.9 7.4 12.6 M 2 57.0 97.1 7.2 11.8 M 3 56.9 96.3 7.0 12.1 M 4 58.8 99.7 7.5 11.8 M1+M2+M3+M4 54.9 92.3 6.6 10.9

Table 6 Ablation study of PMLP Strategy ST_PartA ST_Part B MAE MSE MAE MSE Without PMLP 56.9 96.5 7.4 12.9 PMLP (λ=1e-1) 56.2 95.9 7.0 12.4 PMLP (λ=5e-2) 55.8 94.8 6.9 11.8 PMLP (λ=1e-3) 55.7 94.1 7.0 12.3 PMLP (λ=1e-2) 54.9 92.3 6.6 10.9

Footnotes

Acknowledgment

References

Table 3
Ablation study of MFAM

Features in MFAM ST_Part A ST_Part B

MAE MSE MAE MSE

F _s2 79.1 117.1 11.6 21.0

F _s3 60.3 102.8 7.2 12.5

F _s4 56.5 95.2 6.9 11.8

F_s2+F_s3+F_s4 (MFAM) 54.9 92.3 6.6 10.9

Table 4
Ablation study of MSRM

Sets in MSRM ST_PartA ST_Part B

MAE MSE MAE MSE

M ₁ 56.6 95.9 7.4 12.6

M ₂ 57.0 97.1 7.2 11.8

M ₃ 56.9 96.3 7.0 12.1

M ₄ 58.8 99.7 7.5 11.8

M₁+M₂+M₃+M₄ 54.9 92.3 6.6 10.9

Table 6
Ablation study of PMLP

Strategy ST_PartA ST_Part B

MAE MSE MAE MSE

Without PMLP 56.9 96.5 7.4 12.9

PMLP (λ=1e-1) 56.2 95.9 7.0 12.4

PMLP (λ=5e-2) 55.8 94.8 6.9 11.8

PMLP (λ=1e-3) 55.7 94.1 7.0 12.3

PMLP (λ=1e-2) 54.9 92.3 6.6 10.9