MfvPose: A multi-scale hybrid framework for human pose estimation

Abstract

Human pose estimation is a challenging visual task that relies on spatial location information. To improve the performance of human pose estimation, it is important to accurately determine the constraint relationship among keypoints. To address this, we propose MfvPose, a novel hybrid model that leverages rich multi-scale information. The proposed model incorporates the HRFOV module, which uses cascaded atrous convolution to maintain high-resolution representations of the backbone extractor and enrich the multi-scale information. In addition, we introduce learnable scalar weights to the Transformer encoder. In detail, it involves a multiplication by a diagonal matrix with learnable scalar weights on output of each residual block, which improves the dynamics of model training and enhances the accuracy of human pose estimation. It is experimentally shown that our proposed MfvPose achieves promising results on various benchmarks.

Keywords

Receptive field multi-head self-attention atrous convolution human pose estimation

1 Introduction

The goal of the 2D human pose estimation [1 –4] task is to locate the position of human anatomical keypoints. It is a fundamental vision task in the field of computer vision and plays an important role in vision tasks such as motion recognition, human-computer interaction, and 3D human pose estimation [5, 6]. Currently, the existing solutions can be divided into heatmap-based methods [7, 8] and regression-based methods [1]. The heatmap-based approach locates keypoints by identifying the maximum response positions. Compared with regression-based methods, it can obtain better results by virtue of its spatial generalization capability.

In recent years, with the rapid development of deep learning, convolutional neural networks have achieved excellent results in human pose estimation tasks through their powerful visual characterization capabilities. In order to obtain a high-resolution representation, most of the existing methods recover high resolution representations from low-resolution representation [2 , 9–11], or maintain a high-resolution representation throughout the process. HRNet [4] achieves high accuracy using parallel multi-resolution subnetworks, but the model always maintains high resolution, which can lead to high computational costs. Therefore, we introduce the atrous convolution module and appropriately reduce the network depth to increase the model’s receptive field, and then generate a multi-scale information fusion framework.

Recently, the Vision Transformer [12, 13] has demonstrated good performance in classification tasks. Its ability of modeling global dependencies is more powerful than convolutional neural network. As a result, it performs well on many vision tasks, such as object detection, semantic segmentation, and pose estimation. However, the design of Vision Transformer lacks the ability to utilize spatial information from visual signals, and therefore still needs to compensate for the loss of spatial information with the help of position embedding. To address this issue, We use a hybrid architecture and add learnable scalar weights [14] to the output position of each residual block in the Transformer encoder. This can further help the model converge and improve the accuracy of feature information.

We refer to the design of the atrous convolution module and the attention mechanism module to propose a new multi-scale network architecture for human pose estimation, named MfvPose. Examples of pose estimation obtained with the MfvPose framework are shown in Fig. 1.

Fig. 1

The examples of pose estimation are obtained using the MfvPose framework.

In general, our work contributions are summarized as follows:

We propose the novel MfvPose framework, a hybrid model with rich multi-scale information for human pose estimation.

We adopt the HRFOV module to improve the receptive field of the backbone and enrich the multi-scale information.

We add learnable scalar weights to the Transformer encoder, which helps to makes the feature information more precise.

The remainder of this paper is structured as follows: We discuss related works in Section II of this paper, before providing a detailed description of the MfvPose framework in Section III. In Section IV, we introduce the two datasets used in the experiment, followed by a demonstration of the effectiveness of our approach in Section V. Finally, our work is concluded in Section VI.

2 Related work

Early human pose estimation tasks were performed by graphical structural models [15], which did not work well. As research continued, it was argued that some complex gestures, such as walking, running and jumping, could be estimated simply by locating the spatial positions of major joint points [16]. This human keypoint-based detection method sets the basic course for current human pose estimation tasks.

In recent years, deep learning methods based on convolutional neural networks [1 , 18] have achieved better results than earlier work in human pose estimation. CPM [3] found the maximum response position in heatmap by learning image features and spatial information. Hourglass [2] proposed a stacked hourglass module to learn multi-scale features that improve the quality of heatmap. SimpleBaseline [11] designed a simple architecture based on ResNet [19] by stacking transposed convolutional layers to achieve good results. HRNet [4] used a parallel multi-resolution subnetworks to maintain high-resolution feature information. Therefore, how to enhance high-resolution features is important for 2D human pose estimation.

Atrous Convolution. Atrous convolution has a wide range of applications in tasks such as semantic segmentation and object detection. It increases the receptive field and reduces the computational effort without reducing the resolution. In case of the one-dimensional signals, the atrous convolution is defined as: $y [i] = \sum_{l}^{L} x [i + rl] \cdot w [l]$ (1) where y [i] is the output signal, x [i] is the input signal, w [l] represents the filter of length L, and r corresponds to the dilation rate. And when r = 1, atrous convolution can be considered equivalent to standard convolution. By changing the rate value, atrous convolution enables us to adaptively adjust the field-of-view of filter.

The large receptive field is conducive to the detection of large targets, and on the other hand the high resolution enables accurate target localization. However, blindly stacking atrous convolutions of the same dilation rate can create gridding problem. It causes a lack of correlation between long-distance information and results in local information loss. HDC [20] set different dilation rates to get information from a wider range of pixels and avoid gridding problems. In DeepLab [21], atrous convolution was used to increase the size of the receptive field in the network and avoid downsampling. The Atrous Spacial Pyramid Pooling(ASPP) [21] approach assembled atrous convolutions in four parallel branches with different rates, achieved good results in semantic segmentation tasks. Similarly, the increased resolution and receptive field provided by the ASPP module can facilitate the contextual detection of body parts.

Vision Transformer. Transformer-based architectures has achieved great success in the vision field by utilizing a multi-head self-attention (MSA) module. Recently, various techniques have been proposed to improve its performance in the visual field. Swin transformer [22] incorporated a distinctive multi-scale architecture into the traditional Transformer, enabling the network to effectively combine features across different scales. Xcit [23] proposed cross-covariance attention (XCA) to reduce time complexity. This modification allowed Xcit to efficiently process long sequences as well as high resolution images. It added the local patch interaction module to increase the information interaction of local patches. On the issue of timeliness, Mobilevit [24] combined the advantages of CNN and ViT to achieve low latency while ensuring accuracy as much as possible. It has been shown that human pose estimation is a visual task that is sensitive to spatial location information. And it is difficult for convolutional neural networks to capture the constrained relationships among keypoints.

To address the above issues, we propose a hybrid architecture design. Based on the ASPP approach, the HRFOV module enhances the feature maps of both the ResNet block and HRModule outputs, thereby increasing the receptive field. Additionally, we add a self-attention module to learn the spatial relationship among keypoints, which improves the model’s sensitivity to spatial information.

3 Method

The proposed MfvPose framework, illustrated in Fig. 2, is a hybrid framework that leverages rich multi-scale information for human pose estimation. MfvPose framework combines multi-scale methods [4], transformer modules [12], and the HRFOV module proposed by us to make further improvements on the feature representation of human keypoints.

Fig. 2

The schematic diagram of MfvPose is proposed. The feature maps extracted by the CNN backbone are fed into the HRFOV module for processing and then uniformly split into patches. After each patch is flattened into a one-dimensional vector, it is mapped into a 1D embedding by a linear projection function. Subsequently, the Transformer encoder learns the spatial relationships among keypoints.

The processing flow of MfvPose architecture with images is shown in Fig. 2. The structure uses HRNet [4] as the backbone network for feature extraction. The input image is first passed through the backbone. Afterwards, the high-resolution feature maps and low-level features are simultaneously processed by the HRFOV module. The integrated HRFOV module in our network further increases the receptive field of the backbone while maintaining the high-resolution of the images. Finally, the stacked Transformer encoders are used to learn the constraint relationships among keypoints. In the following, we describe the application of the HRFOV and transformer modules in detail.

3.1 HRFOV module

High-resolution convolution algorithms have obtained remarkable results in pose estimation. It has been shown that increasing the receptive field and fusing multi-scale information are beneficial for human pose estimation. In this paper, in order to enhance the ability of the backbone network to extract multi-scale information, we propose to combine the Atrous Spatial Pyramid Pooling (ASPP) module [21] with high-resolution convolutional networks, named HRFOV module. It is shown in Fig. 3.

Fig. 3

A schematic diagram of HRFOV is presented, which is designed to enhance the feature maps of both the ResNet Block and HRModule outputs, thereby increasing the receptive field.

The proposed method extends feature extraction with a multi-layer architecture in the HRFOV module. This module performs consistent high-resolution processing of feature maps in each branch to improve the model’s ability to obtain multi-scale information. By using HRNet-stage3 as a feature extractor instead of implementing the entire HRNet architecture, we significantly reduce the network parameters. The HRFOV module uniformly processes the feature maps outputted by the ResNet block and HRModule to increase their receptive field. In detail, utilizing different dilation rates in the atrous convolution allows the model to acquire multi-scale feature information without requiring a pooling operation. The four branches in the HRFOV module have different receptive field, and the dilation rates in the branches are arranged in an increasing form. The HRFOV module makes full use of the ASPP structure based on atrous convolution to maintain a large receptive field. Additionally, the module combines the advantages of ASPP and the WASP [25, 26] module to achieve promising results.

The output F_HRFOV of the HRFOV module is described as follows: $X^{out} = K_{1} * [\sum_{i = 1}^{4} (K_{r_{i}} * X_{i - 1}^{in}) + AP (X_{0}^{in})]$ (2) $F_{HRFOV} = K_{1} * (X_{H}^{out} + X_{R}^{out})$ (3) where * denotes the convolution operation, $X_{0}^{in}$ is the input feature maps, $X_{i}^{in}$ is the feature maps resulting from the i-th atrous convolution, AP is the average pooling operation, K₁ and K_{r
_i} denote convolutions of kernel size 1 × 1 and 3 × 3 with dilations of r_i = [1, 6, 12, 18], $X_{H}^{out}$ and $X_{R}^{out}$ respectively represent the feature maps outputted by the HRModule and ResNet Block, which have been processed by the atrous convolution module. It is worth noting that the output of the ResNet Block is subject to dimension reduction before being inputted to the atrous convolution module. The processed feature maps is concatenated in the channel dimension, and finally the channels of feature maps are adjusted by 1 × 1 convolution for a richer multi-scale representation.

3.2 Transformer encoder with learnable scalar weights

CNN-based methods have inherent inductive biases, such as translation equivariance and locality, which allow CNN-based methods to have good generalization performance. However, the CNN-based approach lacks the ability to model global dependencies, which makes it difficult to capture the constraint relationships among keypoints. We are inspired by Vision Transformer [12] and TokenPose [27] to apply a multi-head self-attention module to capture the spatial relationships among keypoints.

In order to process 2D images, we take a sequence of vector embeddings as input. A feature map x ∈ R^H×W×C is divided into a grid of P_n patches, where $P_{n} = \frac{H}{P_{h}} \times \frac{W}{P_{w}}$ , C is the number of channels processed by the HRFOV module, and the size of each patch is P_h × P_w. And then each patch P_i is flattened into a 1D vector with size of P_h × P_w × C. Each vector is then mapped into a d-dimensional embedding by a linear projection function. The spatial location information E_i is embedded for each vector, and the spatial location information is beneficial to the localization of keypoints.

We refer to TokenPose and add K learnable d-dimensional vectors to the Transformer encoder to represent K human keypoints. The Transformer encoder takes the vectors of feature maps and keypoints as input to learn the constraint relationship among keypoints. According to Vision Transformer, each Transformer encoder consists of a multi-head self-attention module (MSA) and a feed-forward network (FFN). In addition, the LayerNorm (LN) is used prior to each module. The formulation of multi-head self-attention is given as: $MSA (T) = Concat [SA {(T)}_{1}, . . ., SA {(T)}_{h}] W_{o}$ (4) $SA {(T)}_{i} = Soft max [\frac{({TW}_{q}^{i}) ({TW}_{k}^{i})}{\sqrt{d_{h}}}] ({TW}_{v}^{i})$ (5) where $W_{q}^{i} \in R^{d_{h} \times d}, W_{k}^{i} \in R^{d_{h} \times d}, W_{v}^{i} \in R^{d_{h} \times d}$ and $W_{o}^{i} \in R^{d \times d}$ are the learnable parameters of linear projection layers, d is the dimension of vectors, d_h is set to $\frac{d}{h}$ , h represents the number of heads.

In the Transformer encoder, the handling of the keypoint vectors is important. Therefore, we add learnable scalar weights to the output of each residual block, as shown in Fig. 4. In other terms, the output of each layer of the Transformer encoder can be expressed as: $T_{l}^{'} = T_{l} + diag (λ_{l, 1}, . . ., λ_{l, d}) \times MSA (LN (T_{l}))$ (6) $T_{l + 1}^{'} = {T_{l}}^{'} + diag ({λ^{'}}_{l, 1}, . . ., {λ^{'}}_{l, d}) \times FFN (LN ({T^{'}}_{l}))$ (7) where the parameters λ_l,i and $λ_{l, i}^{'}$ are learnable weights, T_l is the output of the l-th layer. Using learnable scalar weights to optimize the model training process makes the output of the MSA and FFN modules more precise and reliable, while improving the dynamics of model training.

To obtain K heatmaps, {H₁, H₂, . . . , H_k}, where H_i represents the heatmap of the i-th human keypoint with size H′ × W′. The K d-dimensional vectors produced by the Transformer encoder undergo linear projection to be mapped into H′ × W′-dimensional vectors. Then, the resulting 1D vectors are reshaped into 2D heatmaps to obtain the final K heatmaps.

Fig. 4

Transformer encoder with learnable scalar weights. We introduce a per-channel weighting method to improve the dynamics of model training.

4 Datasets

Experiments are conducted on two datasets, which are the COCO [34] and MPII [35] datasets.

The COCO dataset, provided by Microsoft, is widely used for tasks such as object detection, human keypoints detection, and semantic segmentation. The COCO dataset is divided into three parts: training, validation, and testing, with a total of over 200,000 images and 250,000 person instances. Each image is annotated with information on the object class, bounding box location and visibility, and for the person instance, 17 keypoints are annotated. The COCO dataset is known for its complexity. There are many challenges such as object occlusion, complex backgrounds, and crowded environments among a large number of human instances in this dataset. These factors make it difficult to accurately extract human keypoints.

The MPII dataset is a widely-used resource for assessing the accuracy of human pose estimation. The dataset comprises approximately 25,000 images and over 40,000 instances of human body annotations, organized by human activity category. Each image is labeled with its corresponding activity, encompassing more than 400 human activities. Notably, these images are sourced from YouTube videos and provide a highly realistic depiction of everyday scenarios.

Table 2
The performance of the MfvPose framework is compared with other approaches on the COCO test-dev set

Method Input size Params (M) GFLOPs AP AP⁵⁰ AP⁷⁵ AP^M AP^L AR

Mask-RCNN [29] - - - 63.1 87.3 68.7 57.8 71.4 -

G-RMI [30] 353 × 257 42.6 57.0 64.9 85.5 71.3 62.3 70.0 69.7

Integral Pose Regression [31] 256 × 256 45.0 11.0 67.8 88.2 74.8 63.9 74.0 -

CPN [9] 384 × 288 - - 72.1 91.4 80.0 68.7 77.2 78.5

RMPE [32] 320 × 256 28.1 26.7 72.3 89.2 79.1 68.0 78.6 -

CFN [33] - - - 72.6 86.1 69.7 78.3 64.1 -

SimpleBaseline-Res152 [11] 384 × 288 68.6 35.6 73.7 91.9 81.1 70.3 80.0 79.0

HRNet-W32 [4] 256 × 192 28.5 7.1 73.5 92.2 81.9 70.2 79.2 79.0

HRNet-W48 [4] 256 × 192 63.6 14.6 74.2 92.4 82.4 70.9 79.7 79.5

TokenPose-B [27] 256 × 192 13.5 5.7 73.9 91.4 81.4 70.5 79.9 79.1

TokenPose-L/D24 [27] 256 × 192 27.5 11.0 74.9 91.9 82.4 71.5 80.9 80.0

MfvPose-D24 256 × 192 27.6 11.3 75.0 92.2 82.4 71.6 80.7 80.0

Method	Input size	Params (M)	GFLOPs	AP	AP⁵⁰	AP⁷⁵	AP^M	AP^L	AR
Mask-RCNN [29]	-	-	-	63.1	87.3	68.7	57.8	71.4	-
G-RMI [30]	353 × 257	42.6	57.0	64.9	85.5	71.3	62.3	70.0	69.7
Integral Pose Regression [31]	256 × 256	45.0	11.0	67.8	88.2	74.8	63.9	74.0	-
CPN [9]	384 × 288	-	-	72.1	91.4	80.0	68.7	77.2	78.5
RMPE [32]	320 × 256	28.1	26.7	72.3	89.2	79.1	68.0	78.6	-
CFN [33]	-	-	-	72.6	86.1	69.7	78.3	64.1	-
SimpleBaseline-Res152 [11]	384 × 288	68.6	35.6	73.7	91.9	81.1	70.3	80.0	79.0
HRNet-W32 [4]	256 × 192	28.5	7.1	73.5	92.2	81.9	70.2	79.2	79.0
HRNet-W48 [4]	256 × 192	63.6	14.6	74.2	92.4	82.4	70.9	79.7	79.5
TokenPose-B [27]	256 × 192	13.5	5.7	73.9	91.4	81.4	70.5	79.9	79.1
TokenPose-L/D24 [27]	256 × 192	27.5	11.0	74.9	91.9	82.4	71.5	80.9	80.0
MfvPose-D24	256 × 192	27.6	11.3	75.0	92.2	82.4	71.6	80.7	80.0

5 Experiments

The experiments are conducted on the Ubuntu 20.04 operating system, using the PyTorch deep learning framework and Python as the development language. The CPU used for the experiments is the Intel E5-2680v3 2.50GHz, and the GPU used is the NVIDIA TITAN X (Pascal). We follow previous studies [25 , 36] and set different dilation rates to effectively avoid the gridding problem. However, the difference in our approach is that we process the low-level features through the HRFOV module, which results in better prediction performance. For model training, we use the Adam optimizer. In the experiment, we use the original settings of HRNet [4] and SimpleBaseline [11], and adopt a top-down approach for human pose estimation [4 , 11]. Initially, we detect human instances using a dedicated detector. Subsequently, we use a model to generate heatmaps for the keypoints. We calculate the learning rate based on the step method. In our work, the learning rate starts at 1e-3, and decreases at the 200th and 260th epochs.

5.1 COCO human pose estimation

In the COCO dataset, we evaluate MfvPose based on Object Keypoint Similarity (OKS).

$OKS = \frac{(\sum_{i} e^{- d_{i}^{2} / 2 s^{2} k_{i}^{2}}) δ (v_{i} > 0)}{\sum_{i} δ (v_{i} > 0)}$ (8) where d_i is the Euclidian distance between the detected keypoint and the corresponding ground truth, v_i indicates the visibility of this keypoint, s is the size of the corresponding object, δ_i represents the normalization parameter for the i-th keypoint, which reflects the standard deviation during the annotation process of the respective keypoint. And k_i is the falloff control constant. The evaluation metrics are Average Precision (AP) and Average Recall (AR).

The MfvPose is compared with state-of-the-art methods on the validation set of COCO dataset. As shown in Table 1, the proposed MfvPose obtains a competitive performance compared to previous methods. D8 and D24 represent Transformer encoder with 8 and 24 layers stacked respectively. With the same resolution, the accuracy is improved to 75.7% in combination with HRFOV module. Compared to the original HRNet-W32 and HRNet-W48, the proposed framework shows a notable improvement of 1.3% and 0.6%, respectively, while also demonstrating a considerable reduction in both parameters and GFLOPs. Compared to SimpleBaseline which uses ResNet-50, ResNet-101, and ResNet-152 as the backbone, our MfvPose achieves improvements of 5.3%, 4.3%, and 3.7%, respectively. In this study, we introduce learnable scalar weights to the Transformer encoder, enhancing the model’s ability to capture spatial information. We refer to the TokenPose [27] configuration and stack the same number of layers of Transformer encoder. The proposed method achieves a 0.2% improvement over TokenPose-B and a 0.2% improvement over TokenPose-L/24. Furthermore, the increase in parameters and GFLOPs is insignificant. Figure 5 demonstrates the visualization results of COCO set obtained using MfvPose.

Fig. 5

The visualization results of COCO set obtained using the MfvPose framework.

Table 1

The performance of the MfvPose framework is compared with other approaches on the COCO validation set

Method	Input size	Params (M)	GFLOPs	AP	AP⁵⁰	AP⁷⁵	AP^M	AP^L	AR
EvoPose2D [28]	256 × 192	2.5	1.0	70.2	88.9	77.8	66.5	76.8	76.9
SimpleBaseline-Res50 [11]	256 × 192	34.0	8.9	70.4	88.6	78.3	67.1	77.2	76.3
SimpleBaseline-Res101 [11]	256 × 192	53.0	12.4	71.4	89.3	79.3	68.1	78.1	77.1
SimpleBaseline-Res152 [11]	256 × 192	68.6	15.7	72.0	89.3	79.8	68.7	78.9	77.8
HRNet-W32 [4]	256 × 192	28.5	7.1	74.4	90.5	81.9	70.8	81.0	79.8
HRNet-W48 [4]	256 × 192	63.6	14.6	75.1	90.6	82.2	71.5	81.8	80.4
TokenPose-B [27]	256 × 192	13.5	5.7	74.4	89.6	81.0	70.7	81.6	79.7
TokenPose-L/D24 [27]	256 × 192	27.5	11.0	75.5	89.9	82.1	71.9	82.3	80.5
LGPose-E4 [37]	256 × 192	16.9	8.9	75.6	90.3	82.2	71.9	82.4	80.4
MfvPose-D8	256 × 192	13.5	5.8	74.6	89.9	81.2	71.1	81.2	79.8
MfvPose-D24	256 × 192	27.6	11.3	75.7	90.0	82.2	71.9	82.2	80.6

5.2 MPII human pose estimation

The MPII dataset contains images sourced from YouTube videos, offering a remarkably realistic portrayal of daily life scenes. In the MPII dataset, we evaluate MfvPose based on the Percentage of Correct Keypoints (PCK) [35]. This metric defines a keypoint prediction as correct if the detected joint is within a specified threshold distance of the ground truth. The MPII dataset commonly uses a PCKh@0.5 threshold, which corresponds to a threshold of 50% of the head diameter. The training strategy and data augmentation methods used are similar to those of the COCO dataset, except that the input size is cropped to 256 × 256 to enable a fair comparison with other techniques.

The results of PCKh@0.5 are shown in Table 3, and we follow the testing configuration in TokenPose. The experimental results show that the proposed MfvPose obtains a more competitive performance. Compared with other methods, our model is lighter, and despite having fewer parameters than TokenPose, it can still achieve better performance. Figure 6 demonstrates the visualization results of MPII set obtained using MfvPose.

Fig. 6

The visualization results of MPII set obtained using the MfvPose framework.

Table 3

The results of MfvPose on the MPII validation set (PCKh@0.5)

Model	Hea	Sho	Elb	Wri	Hip	Kne	Ank	Mean	Params(M)
SimpleBaseline-Res50 [11]	96.4	95.3	89.0	83.2	88.4	84.0	79.6	88.5	34.0
SimpleBaseline-Res101 [11]	96.9	95.9	89.5	84.4	88.4	84.5	80.7	89.1	53.0
SimpleBaseline-Res152 [11]	97.0	95.9	90.0	85.0	89.2	85.3	81.3	89.6	68.6
LGPose-E4 [37]	96.7	95.8	89.6	84.5	88.8	84.9	81.8	89.3	4.2
HRNet-W32 [4]	96.9	96.0	90.6	85.8	88.7	86.6	82.6	90.1	28.5
TokenPose-L/D6 [27]	97.2	95.8	90.4	85.8	89.3	86.2	81.6	89.9	21.4
TokenPose-L/D12 [27]	97.2	95.8	90.7	85.9	89.2	86.2	82.3	90.1	23.5
TokenPose-L/D24 [27]	97.1	95.9	90.4	86.0	89.3	87.1	82.5	90.2	28.1
ours	97.0	96.0	91.0	86.2	89.4	86.8	82.6	90.3	21.5

5.3 Ablation study

We conduct a series of ablation studies on the MPII dataset during our experiments, with the aim of analyzing the improvements resulting from various aspects of our approach. Table 4 shows the results of TokenPose, followed by the results obtained by combining the backbone with the proposed HRFOV module. The LSW represents the learnable scalar weights, and we use TokenPose-L/D6 as the baseline model. The performance of the proposed method gradually improves as innovative modules are added, resulting in a 0.41% improvement over TokenPose-L/D6. Moreover, our proposed method has a wider receptive field, which enables it to extract multi-scale features more effectively, and to achieve greater accuracy in predicting keypoints, even in complex backgrounds. While achieving better performance with MfvPose, the growth in parameters and GFLOPs can be ignored.

Table 4
Ablations of various architectural design choices on the MPII set

Method HRFOV LSW PCKh@0.5

TokenPose-L/D6 × × 89.94

Ours × √ 90.01

Ours √ × 90.13

Ours √ √ 90.35

Method	HRFOV	LSW	PCKh@0.5
TokenPose-L/D6	×	×	89.94
Ours	×	√	90.01
Ours	√	×	90.13
Ours	√	√	90.35

6 Conclusion

In this work, we present a new MfvPose architecture for human pose estimation. Specifically, our HRFOV module processes both the feature maps from the lower layers and the backbone. The HRFOV module stacks atrous convolutions with different dilation rates to improve the multi-scale information extraction capability of the backbone network. The cascaded atrous convolution structure increases the receptive field of the model, which helps to train a superior pose estimator. In addition, human pose estimation is sensitive to spatial information. Therefore, MfvPose includes an attention mechanism module to learn the spatial relationships among keypoints. With this design, MfvPose achieves encouraging results on various benchmarks.

Relying on the application of the self-attention module, it brings the ability to model global dependencies. However, this also incurs unnecessary computational and memory costs, which are already the main drawbacks of the Transformer. In future research, we will focus on solving this problem while also conducting further exploration to improve detection speed.

Footnotes

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China (No. 62173285 and 62103345), the Fujian Provincial Natural Science Foundation of China (No. 2021J011181, 2020J02160 and 2022J011234) and Xiamen Youth Innovation Fund Project (No. 3502Z20206072 and 3502Z20206076).

References

Toshev

, Szegedy

Deeppose: Human pose estimation via deep neural networks, Proceedings of the IEEE conference on computer vision and pattern recognition, (2014), 1653–1660.

Newell

, Yang

, Deng

Stacked hourglass networks for human pose estimation, Computer Vision–1ECCV 2016:14th European Conference, Amsterdam, The Netherlands, October 11–114, 2016, Proceedings, Part VIII 14, (2016), 483–1499.

Wei

S.-E.

, Ramakrishna

, Kanade

, Sheikh

Convolutional pose machines, Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), 4724–4732.

Sun

, Xiao

, Liu

, Wang

Deep high-resolution representation learning for human pose estimation, 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), (2019), 5686–696.

Rogez

, Weinzaepfel

, Schmid

LCR-Net: Localization-Classification-Regression for Human Pose, 2017 IEEE conference on computer vision and pattern recognition (CVPR), (2017), 1216–1224.

Güler

R.A.

, Neverova

, Kokkinos

DensePose: Dense Human Pose Estimation in theWild, 2018 IEEE/CVF conference on computer vision and pattern recognition, (2018), 7297–7306.

Chu

, Yang

, Ouyang

, Ma

, Yuille

A.L.

, Wang

Multi-context Attention for Human Pose Estimation, 2017 IEEE conference on computer vision and pattern recognition (CVPR), (2017), 5669–5678.

Chu

, Ouyang

, Li

, Wang

Structured feature learning for pose estimation, Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), 4715–4723.

Chen

, Wang

, Peng

, Zhang

, Yu

, Sun

Cascaded Pyramid Network for Multi-person Pose Estimation, 2018 IEEE/CVF conference on computer vision and pattern recognition, (2018), 7103–7112.

10.

, Ramanan

Bottom-Up and Top-Down Reasoning with Hierarchical Rectified Gaussians, 2016 IEEE conference on computer vision and pattern recognition (CVPR), (2016), 560–5609.

11.

Xiao

, Wu

, Wei

Simple baselines for human pose estimation and tracking, Proceedings of the European conference on computer vision (ECCV), (2018), 466–481.

12.

Dosovitskiy

, Beyer

, Kolesnikov

, Weissenborn

, Zhai

, Unterthiner

, Dehghani

, Minderer

, Heigold

, Gelly

, Uszkoreit

, Houlsby

An image is worth 16x16 words: Transformers for image recognition at scale, ICLR 2021 –9th international conference on learning representations, (2021).

13.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

Ł.

, Polosukhin

Attention is all you need, Advances in Neural Information Processing Systems 30 (2017).

14.

Touvron

, Cord

, Sablayrolles

, Synnaeve

, Jégou

Going deeper with Image Transformers, 2021 IEEE/CVF international conference on computer vision (ICCV), (2021), 32–42.

15.

Yang

, Ramanan

Articulated pose estimation with flexible mixtures-of-parts, Proceedings of the IEEE computer society conference on computervision and pattern recognition, (2011), 1385–1392.

16.

Johnson

, Everingham

Learning effective human pose estimation from inaccurate annotation, Proceedings of the IEEE computer society conference on computer vision and pattern recognition, (2011), 1465–1472.

17.

Cao

, Simon

, Wei

S.-E.

, Sheikh

Realtime Multiperson 2D Pose Estimation Using Part Affinity Fields, 2017 IEEE conference on computer vision and pattern recognition (CVPR), (2017), 1302–1310.

18.

Zhang

, Yang

, Li

and Jian

, Human pose estimation based on parallel atrous convolution and body structure constraints, Journal of Intelligent & Fuzzy Systems 42 (2022), 5553–5563.

19.

, Zhang

, Ren

, Sun

Deep Residual Learning for Image Recognition, 2016 IEEE conference on computer vision and pattern recognition (CVPR), (2016), 770–778.

20.

Wang

, Chen

, Yuan

, Liu

, Huang

, Hou

, Cottrell

Understanding Convolution for Semantic Segmentation, 2018 IEEE winter conference on applications of computer vision (WACV), (2018), 1451–1460.

21.

Chen

L.-C.

, Papandreou

, Kokkinos

, Murphy

and Yuille

A.L.

, DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2018), 834–848.

22.

Liu

, Lin

, Cao

, Hu

, Wei

, Zhang

, Lin

, Guo

Swin Transformer: Hierarchical Vision Transformer using ShiftedWindows, 2021 IEEE/CVF international conference on computer vision (ICCV), (2021), 9992–10002.

23.

Ali

, Touvron

, Caron

, Bojanowski

, Douze

, Joulin

, Laptev

, Neverova

, Synnaeve

, Verbeek

et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, Advances in Neural Information Processing Systems 34 (2021), 20014–20027.

24.

Mehta

, Rastegari

Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer, ICLR 2022 –10th international conference on learning representations, (2022).

25.

Artacho

, Savakis

UniPose: Unified Human Pose Estimation in Single Images and Videos, 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (2020), 7033–7042.

26.

Artacho

, Savakis

Omnipose:Amulti-scale frame work for multi-person pose estimation, arXiv preprint arXiv:2103.10180, (2021).

27.

, Zhang

, Wang

, Yang

, Xia

S.-T.

, Zhou

TokenPose: Learning Keypoint Tokens for Human Pose Estimation, 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 11293–11302.

28.

McNally

, Vats

, Wong

and McPhee

, EvoPose2D: Pushing the Boundaries of 2D Human Pose Estimation Using Accelerated Neuroevolution With Weight Transfer, IEEE Access 9 (2021), 139403–139414.

29.

, Gkioxari

, Dollár

, Girshick

Mask RCNN, 2017 IEEE International Conference on Computer Vision (ICCV), (2017), 2980–2988.

30.

Papandreou

, Zhu

, Kanazawa

, Toshev

, Tompson

, Bregler

, Murphy

Towards Accurate Multi-person Pose Estimation in the Wild, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), 3711–3719.

31.

Sun

, Xiao

, Wei

, Liang

, Wei

Integral human pose regression, Proceedings of the European conference on computer vision (ECCV), (2018), 529–545.

32.

Fang

H.-S.

, Xie

, Tai

Y.-W.

, Lu

Regional Multi-person Pose Estimation, 2017 IEEE International Conference on Computer Vision (ICCV), (2017), 2353–2362.

33.

Huang

, Gong

, Tao

A coarse-fine network for keypoint localization, Proceedings of the IEEE international conference on computer vision, (2017), 3028–3037.

34.

Lin

T.-Y.

, Maire

, Belongie

, Hays

, Perona

, Ramanan

, Dollár

, Zitnick

C.L.

Microsoft coco: Common objects in context, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), (2014), 740–755.

35.

Andriluka

, Pishchulin

, Gehler

, Schiele

Human Pose Estimation: New Benchmark and State of the Art Analysis, 2014 IEEE Conference on Computer Vision and Pattern Recognition, (2014), 3686–3693.

36.

Chen

L.-C.

, Papandreou

, Schroff

, Adam

Rethinking atrous convolution for semantic image segmentation, arXiv preprint arXiv:1706.05587, (2017).

37.

, Wu

, Zhang

and Zhang

, A Local–Global Estimator Based on Large Kernel CNN and Transformer for Human Pose Estimation and Running Pose Measurement, IEEE Transactions on Instrumentation and Measurement 71 (2022), 1–12.

MfvPose: A multi-scale hybrid framework for human pose estimation

Abstract

Keywords

1 Introduction

5.1 COCO human pose estimation

Table 4 Ablations of various architectural design choices on the MPII set Method HRFOV LSW PCKh@0.5 TokenPose-L/D6 × × 89.94 Ours × √ 90.01 Ours √ × 90.13 Ours √ √ 90.35

Footnotes

Acknowledgment

References

Table 4
Ablations of various architectural design choices on the MPII set

Method HRFOV LSW PCKh@0.5

TokenPose-L/D6 × × 89.94

Ours × √ 90.01

Ours √ × 90.13

Ours √ √ 90.35