A combined local and global structure module for human pose estimation

Abstract

Human pose estimate can be used in action recognition, video surveillance and other fields, which has received a lot of attentions. Since the flexibility of human joints and environmental factors greatly influence pose estimation accuracy, related research is confronted with many challenges. In this paper, we incorporate the pyramid convolution and attention mechanism into the residual block, and introduce a hybrid structure model which synthetically applies the local and global information of the image for the analysis of keypoints detection. In addition, our improved structure model adopts grouped convolution, and the attention module used is lightweight, which will reduce the computational cost of the network. Simulation experiments based on the MS COCO human body keypoints detection data set show that, compared with the Simple Baseline model, our model is similar in parameters and GFLOPs (giga floating-point operations per second), but the performance is better on the detection of accuracy under the multi-person scenes.

Keywords

Human pose estimation keypoints detection context information

1. Introduction

Human pose estimation is also called human bone keypoints detection. It refers to detecting human joint points (also called keypoints, such as wrists, ankles, knees, etc.) from a single image or a video, and then connecting keypoints to represent the human skeleton according to body structure. Because human skeleton contains most of the pose information and ignores other unimportant information in the image, pose estimation is widely used in many downstream fields, such as action recognition, video surveillance, and human trajectory tracking [3, 4, 5].

Traditional pose estimation methods are based on graphical structural frameworks deformable component models, and artificial features. Although these methods have a fast speed, they need to construct specified models for various human poses. Due to the diversity of human posture, the applications of above algorithms are constrained from extending to the general cases [6, 7]. DeepPose [2] is the first model that uses deep learning to solve the task of two-dimensional human pose estimation. It can recognize different human poses through autonomous learning, and has strong applicability Thereby, deep learning has gradually become the mainstream method to solve the problems of two-dimensional human pose estimation [10, 20, 23].

Two-dimensional human pose estimation can be divided into single-person pose estimation and multi-person pose estimation. The single-person pose estimation is only applicable to a small number of scenarios where only one person exists, which will meet some difficulties in the complex environments [2, 20, 21]. In contrast multi-person pose estimation algorithms can describe multi-person behaviors simultaneously, and thus people pay more attention to the exploration of this field. There are two mainstream methods for the multi-person pose estimation: top-down [8, 9, 10, 11, 12] and bottom-up [13, 14, 15, 23]. The bottom-up model first detects all keypoints of human bodies in the image, and then adopts grouping algorithm to categorize detected keypoints. Since all keypoints are simultaneously found, the speed is relatively fast, but the hardware requirements are higher. In addition, as the number of human instances in the image increases, the bottom-up approach is difficult to group keypoints. Contrary to bottom-up pose estimation, the top-down approach uses the object detection algorithm to detect all the human instances in the image, and next detects keypoints of human instances one by one. The typical top-down pose estimation methods utilize a series multi-resolution networks to extract effective features, which enhance the resolution of the feature map to facilitate the location of keypoints [8, 11, 12]. The type of algorithms is of high accuracy and hardware requirements are lower. For example, Simple Baseline extracts features by ResNet [22], and employs deconvolution to obtain heatmaps with higher resolution.

The human body structure tends to be flexible, and change of location between each human joint can form different postures, especially for the situation where the human body is occluded. How to locate the invisible keypoints becomes a significant direction for correctly estimating the human body posture. In recent years, people have used the attention model to achieve good results in some computer vision tasks (such as classification, object detection and segmentation) [26, 28, 29]. The advantage of this model is that it can comprehensively employ local and global information. In view of this characteristic, Chu et al. proposed multi-context attention partially solving the problem of pose estimation in the complicated environments. However, the multi-context attention algorithm uses stacked hourglass networks (SHN) [20] as the basic network, hence the whole structure is more complex and difficult to operate [30].

This article studies the top-down two-dimensional multi-person pose estimation. Inspired by Pyramidal Convolution [24], we propose a new module base on Simple Baseline [10]. It adopts a small-size convolution kernel to extract the features of small targets and local details, whilst applying a large-size convolution kernel to obtain the features of large targets as well as context information containing human body structure information. We also introduce the attention model GC-Net [28] to further strengthen the connection between global and local information. Therefore, the model we present can better locate some keypoints that are difficult to detect by inferring human body poses, as shown in Fig. 1.

Figure 1.

Detection results of pose estimation through our model. Those pictures are randomly selected from COCO. It should be noted that our model successfully detects some overlapped or invisible keypoints.

The remainder of this article is arranged as follows. Section 2 introduces the current work about top-down two-dimensional multi-person pose estimation, and Section 3 illustrates the principle and the structure of this model. Main steps and comparison experiments on the MS COCO database as well as corresponding analysis are shown in Section 4. The conclusion is given in Section 5.

2. Related work

One of the typical bottom-up approaches was given by Cao et al. [13], which encodes the position and direction of the limb and then is combined with the heatmap of keypoints for joint learning and prediction. But under some circumstances, detecting keypoints and grouping in the bottom-up method are dependent, thus the two steps should be merged and implemented simultaneously [15]. Short-Range Offset method is used to predict keypoints of the human body while predicting heatmaps, and at the same time Mid-Range Offset is utilized for grouping [14]. Cheng et al. [23] applied the parallel structure in HR-Net [12] as the backbone network, and proposes a bottom-up pose estimation. The authors found that the model’s capabilities are reduced due to the change of resolution of the model. Therefore, an additional module containing deconvolution and residual blocks is added to obtain a higher resolution heatmap.

Different from bottom-up method, top-down method is popular for its high accuracy. Fang et al. [8] proposed a multi-person pose estimation network RMPE on the basis of the single-person pose estimation network Stacked Hourglass Networks [20]. The network provides a symmetrical space transformation module for processing keypoints location errors caused by object detection frame positioning deviation. To cope with the difficulty of various keypoints, Chen et al. proposed CPN algorithm to detect keypoints separately [11]. Easy-to-detect joint points such as eyes and elbows are detected by GlobalNet, while keypoints difficult to detect (knees, hips, etc.) are acquired by RefineNet with higher-level semantic information. The article also employs online hard keypoints mining to calculate the loss of the most difficult (the largest loss) 8 keypoints. Motivated by networks such as CPN and Stacked Hourglass Networks, Xiao et al. proposed a simple and effective single-stage Simple Baseline [10], which has achieved good results through connecting several deconvolutions based on ResNet framework [22]. However, the prediction ability of this type of multi-stage human pose estimation methods cannot be enhanced with the increasement of stages. In response to this problem, MSPN [9] provides an improved Hourglass module, which uses different Gaussian kernels to generate heatmaps between different modules. Those heatmaps achieve the intermediate supervision from coarse to fine. HR-Net [12] design a parallel structure of high-resolution and low-resolution branches. In each downsampling process, a new parallel low-resolution branch can be obtained from the original high-resolution branch. Meanwhile the high-resolution branch will exchange information with the low-resolution branch, which is beneficial to the fusion of global and local information.

The work of Chu et al. [30] shows that attention models used in computer vision tasks such as object detection are also suitable for pose estimation. The squeeze-andexcitation (SE) block in SE-Net proposed by Hu et al. employs global average pooling to obtain global information, and utilizes some hidden layers with a sigmoid function to learn the inter-channel relationship. Wang et al. [29] proposed a Non-local Network that can query the paired relationship between each location and all other locations to form an attention map. It has a good effect but with huge computational cost. Base on the work of Non-local Network, Cao et al. Presented a more lightweight attention model GC-Net [28] combined with SE-Net. Because GC-Net algorithm is lightweight and can also obtain global context information, it is used widely in pose estimation models.

3. Mixture model for pose estimate

In this article, we adopt the top-down pose estimation method. In order to correctly and quickly detect the keypoints of human body that are partially occluded or invisible, we construct a combined local and global structure model (CLGS) by introducing the pyramid convolution and attention module. First, we use the object detection algorithm to detect the object boxes of all human bodies in the input image, and then apply the pose estimation algorithm to find the keypoints in each object box. Obtaining contextual information can help the model learn some information about the human body structure, thereby facilitating the location of keypoints. Moreover, CLGS algorithm may extract contextual information containing human body structure information through different convolution kernels.

3.1 Principle of the fundamental module

The Simple Baseline method uses ResNet as the backbone network to extract effective features, and bottleneck module is the basic component of ResNet. The structure of bottleneck module is shown in Fig. 2a. The input feature vector is convoluted by 1 $\times$ 1, 3 $\times$ 3 and 1 $\times$ 1 kernels respectively, and then the convolution result is added with the original input vector, which is called shortcut connection, to produce the output vector. The process can be expressed as:

$\displaystyle\textit{output}=\textit{conv3(conv2(conv1(input)))}+\textit{input}$ (1)

where input and output are the same as the Bottleneck; conv1, conv2, and conv3 of Bottleneck are three convolutions of size 1 $\times$ 1, 3 $\times$ 3, and 1 $\times$ 1 respectively.

Figure 2.

Illustration of two typical modules of current classification networks. (a) is the basic module Bottleneck in ResNet, (b) is the basic module Pyconv Bottleneck in PyConvHGResNet.

Pyramidal convolution (Pyconv) [24] is based on Bottleneck in ResNet, and its structure is shown in Fig. 2b. In Pyconv Bottleneck, the input features first are convoluted with the kernel of size 1 $\times$ 1, and then is followed by the convolutions with kernel sizes 3 $\times$ 3, 5 $\times$ 5, 7 $\times$ 7, 9 $\times$ 9 at the same time. What’s more, those output vectors are concatenated and carried out the 1 $\times$ 1 convolution again. The final output vectors are obtained by adding the input of Pyconv Bottleneck and the output result calculated above together through the shortcut connection. In addition, four different convolution kernels are grouped respectively by $G=$ 1, $G=$ 8, $G=$ 16, $G=$ 32. The larger the size, the larger groups are allocated. Thereby the module needs less parameters and fewer giga floating-point operations per second (GFLOPs). The process is given by:

$\displaystyle\textit{conv2}^{\prime}(\cdot)=\textit{conv2}^{\prime}\_1(\cdot)+% \textit{conv{2}}^{\prime}\_2(\cdot)+\textit{conv{2}}^{\prime}\_3(\cdot)+% \textit{conv{2}}^{\prime}\_4(\cdot)$ (2) $\displaystyle\textit{output}=\textit{conv3}(\textit{conv2}^{\prime}(\textit{% conv1(input)}))+\textit{input})$ (3)

where $\textit{conv{2}}^{\prime}\_1(\cdot)$ , $\textit{conv{2}}^{\prime}\_1(\cdot)$ , $\textit{conv{2}}^{\prime}\_1(\cdot)$ , $\textit{conv{2}}^{\prime}\_1(\cdot)$ denote convolutions of size 3 $\times$ 3, 5 $\times$ 5, 7 $\times$ 7, 9 $\times$ 9 respectively.

Since the model based on the attention mechanism can fuse the local and global information of the visual scene and thus facilitate the location of keypoints, in this paper, we try to combine a model of self-attention mechanism with Pyconv bottleneck. It mainly contains two parts: context modeling and transform. The former uses 1 $\times$ 1 convolution and softmax functions to calculate attention weights, and then implements attention pooling to acquire global context features. The latter employs a bottleneck transform containing a couple of 1 $\times$ 1 convolutions for feature conversion. Finally, it uses element-wise multiplication to aggregate the obtained global context features of each location. We show the detail of the combination of GC block and Pyconv bottleneck in Fig. 3b. Since GC block is lightweight and does not greatly increase model parameters, it is possible to add the GC Block to each Pyconv bottleneck without influencing the efficiency of the whole algorithm. Numerous experiments prove that the model performance has been further improved by this method.

Figure 3.

Comparison of GC block and our improved mixture module. (a) is the attention module global context block (GC Block); (b) is the improved mixture module, which is formed by combining GC Block and Pyconv Bottleneck.

3.2 Local and global structure model

In recent years, many pose estimation algorithms use a series of high-resolution to low-resolution networks to extract effective features, which will increase the resolution of feature maps and thus facilitate location of keypoints as well as reduction of errors. Among them, Simple Baseline [10] is regarded as the one of fundamental methods in pose estimation. Although the structure of Simple Baseline is simple, it has achieved relatively good results. Our network is inspired by this concise and effective baseline idea, but we make improvements from the following two aspects:

(1)
In view of pyramidal convolution [24], we adopt the Pyconv Bottleneck module proposed in PyConvHGResNet as the basic block and then use deconvolution to enhance the resolution of the feature map.
(2)
In order to further improve the performance of the model, we add an attention module GC Block to Pyconv Bottleneck.

Figure 4.
Local and global structure in the model. C2-C5 are units that contain 3, 4, 6, and 3 structures consisting of Pyconv bottleneck and GC block, respectively.

The specific structure of our model is shown in Fig. 4. C1-C5 is the backbone network for extracting effective features (PyConvHGResNet50), where C1 is the 7 $\times$ 7 convolution, and C2-C5 are units that contains 3, 4, 6 and 3 numbers of module (b) shown in Fig. 3, respectively. Those structures are constructed from Pyconv Bottleneck and GC Block. After that, we apply 3 deconvolutions in series, and subsequently use 1 $\times$ 1 convolution to output predicted heatmaps. Moreover, we carry out ReLU and batch normalization operations after each convolution and deconvolution.

In this paper, we adopt mean square error as the loss function to calculate the error of the model, and the mathematical expression is as follows:

$\displaystyle\textit{Loss}=\frac{1}{\left|{H\times W\times C}\right|}\sum% \limits_{i,j,c}^{\left|{H\times W\times C}\right|}{\left\|{h(i,j,c)-g(i,j,c)}% \right\|^{2}}$ (4)

where $H, W, C$ are the height, width of the output heatmaps and the number of channels respectively (Since the number of keypoints in the MS COCO dataset is 17, the number of channels in the model is 17); $h, g$ represent the predicted heatmaps and ground truth heatmaps, respectively. The ground truth heatmaps are generated by using a 2D Gaussian function centered on one pixel at the label position of each keypoint. $i, j$ are the coordinates and $c$ denotes the channel of each pixel on heatmaps.
3.3 Analysis of the model

Keypoints are obtained from local information and are also intrinsically connected, for the two-dimensional multi-person pose estimation, global information such as body structure information of limbs are helpful to the connections between adjacent joints.

Pyramid convolution used in our model can effectively extract the features and local details of small targets by small-size convolution kernels. At the same time, large-size convolution kernels are applied to extract contextual information of human body structure. In addition, the added lightweight attention module (global context module) further correlates global and local information, so that our improved model can not only obtain accurate positions of keypoints, but also connection relation among keypoints. The improved model has a coherent understanding of the human body structure and can find those keypoints that are not easy to detect for the estimation of human body posture. The use of grouped convolution in the model also reduces the number of model parameters and calculation cost.

4. Experiment

4.1 Dataset

The MS COCO dataset [27] is widely used in a variety of deep vision tasks. It is divided into three versions: COCO2014, COCO2015, and COCO2017. We utilize COCO2017 as our dataset, which contains 80 categories, 200,000 images, and 5 types of annotation information such as object detection and human keypoint detection. Among them, there are 250,000 human body instances marked with keypoints of the human body, and each human body instance has 17 marked keypoints. COCO2017 is divided into train2017, val2017 and test-dev2017. Among them, train2017 contains 57,000 images, where 150,000 human body instances marks keypoints. Val2017 involves 5000 images and test-dev2017 includes 20,000 images. We use train2017 for the training task and val2017 for the validation and evaluation of our model, respectively.

4.2 Evaluation metric

We apply object keypoint similarity (OKS) to measure the similarity between predicted keypoints and ground truth label keypoints. The calculation formula is given by:

$\displaystyle\textit{OKS}=\frac{\sum\nolimits_{i}{\exp\left(\frac{-d_{i}^{2}}{% 2s^{2}k_{i}^{2}}\right)\delta(v_{i}>0)}}{\sum\nolimits_{i}{\delta(v_{i}>0)}}$ (5)

where $d_{i}$ is the Euclidean distance between the predicted keypoint and the actual keypoint. The number $v_{i}$ represents the visibility mark of the actual keypoint, and $s$ denotes the scale of the actual keypoint. The constant $k_{i}$ is given to specify the control attenuation for each keypoint. We employ average precision (AP) and average recall (AR) as the final score of the model. AP ${}^{50}$ refers to average precision where the preset OKS threshold is 0.5. Similarly, AP ${}^{75}$ represents average precision where the preset OKS threshold is 0.75. Furthermore, AP indicates the average value of average precision from all 10 OKS thresholds in 0.50, 0.05, 0.95. AP ${}^{M}$ refers to average precision for medium objects, and AP ${}^{L}$ is the average precision for large objects. AR indicates the average value of average recall of all 10 OKS thresholds in 0.50:0.05:0.95.

4.3 Training

Our experiment is carried out on a 2 $\times$ GPU server and utilizes the classification model of PyConvHGResNet50 trained on ImageNet in Pyramidal Convolution [24] as the pre-trained model. We adopt Adam optimizer to optimize the model parameters, where the batchsize is set to 32 and the loss function is the mean square error. The total training process includes 150 epochs and the set initial learning rate is 0.001. Furthermore, at the 90th epochs and 120th epochs, the learning rate drops to 0.0001 and 0.00001, respectively. For the training set train 2017 of the MS COCO dataset, we amend the aspect ratio of the human detection box to 4:3. In addition, we crop the human detection box from the image, and adjust its size to 256 $\times$ 192. Random rotation and scaling, flipping and other data augmentation methods are also used to expand our training data. In order to verify the effectiveness of the combined local and global structure combined with GC Block, we conducted a burning experiment. Under the same conditions, we trained two models by considering GC Block or not. Corresponding verification tests are made for both val2017 and test-dev2017 of the MS COCO dataset.

Table 1
Comparison of the test results on Simple Baseline and test results in our model on Val2017. “CPN+OHKM” refers to the improved algorithm containing OHKM [11], “pretrain” indicates whether the algorithm uses a pre-trained model or not, and “ $+$ GC” implies that GC Block is added to Pyconv Bottleneck

Val2017
Method	Backbone	Pretrain	Input size	Params	GFLOPs	AP	AP ${}^{50}$	AP ${}^{75}$	AP ${}^{M}$	AP ${}^{L}$	AR
CPN	ResNet-50	Y	256 $\times$ 192	27.0M	6.2	68.6	–	–	–	–	–
CPN( $+$ OHKM)	ResNet-50	Y	256 $\times$ 192	27.0M	6.2	69.4	–	–	–	–	–
Simple Baseline	ResNet-50	Y	256 $\times$ 192	34.0M	8.9	70.4	88.6	78.3	67.1	77.2	76.3
Simple Baseline	ResNet-101	Y	256 $\times$ 192	53.0M	12.4	71.4	89.3	79.3	68.1	78.1	77.1
Ours	PyConvHGResNet50	Y	256 $\times$ 192	33.7M	9.4	71.2	88.9	78.8	67.7	78.4	77.1
Our( $+$ GC)	PyConvHGResNet50(+GC)	Y	256 $\times$ 192	36.2M	9.4	71.7	89.2	79.2	68.2	78.7	77.2

4.4 Testing

This paper proposes a two-stage, two-dimensional multi-person, and top-down pose estimation method. The human instances are detected by the object detection algorithm and then the keypoints are predicted through our model. In the verification set val2017 and stest-dev2017 of the MS COCO data set, we adopt faster R-CNN [25] as the object detector to detect the corresponding human instance detection frame. As for the predicted heatmaps, we compute the average of the original image heatmaps and the flipped heatmaps as the final output. The position offset by $\frac{1}{4}$ from the highest value on the final output heatmaps to the second highest value is regarded as the location of prediction keypoints.

4.5 Results on the validation set

In Table 1, we show the test results of our network on COCO val2017. The number of parameters in our model without GC Block is 33.7M, and GFLOPs is 8.9. Compared with Simple Baseline (ResNet50), our model without GC Block has lower parameters and GFLOPs, and the test results on val2017 show that our model without GC Block has a certain improvement in main indicators. Among them, AP, AP ${}^{L}$ and AR are 71.2, 78.4 and 77.1, which increases by 0.8, 1.2 and 0.8, respectively. Next, we also implement a similar test by applying our model with GC Block. From Table 1 we may find that the number of parameters in our model with GC Block is 36.2M, which is slightly higher than that of Simple Baseline (ResNet50), but the model performance is further improved and achieves better results. In fact, in the test, AP, AP ${}^{L}$ , and AR are 71.7, 78.7 and 77.2, which increase 1.3, 1.5 and 0.9, respectively. Even compared with Simple Baseline (ResNet101), our model has the similar or even higher performance indicators, but the number of parameters and GFLOPs are greatly reduced.

4.6 Results on the test-dev set

In Table 2 we show the test results of our network model on COCO test-dev2017. The test results shows that our model without GC Block on test-dev2017 is better than Simple Baseline (ResNet50), since two main indicators AP and AR are 70.7 and 76.4, which increase by 0.7 and 0.8, respectively. Similarly, the test results implies that our model combined with GC Block on test-dev2017 is also improved in the accuracy and recall aspects. Compared with Simple Baseline (ResNet50), AP and AR are 71.1 and 76.6, which increase by 1.1 and 1.0, respectively. Therefore, our improved model is better than Simple Baseline (ResNet50) in main indicators.

Table 2
Comparison the test results on Simple Baseline and the test results of our model on test-dev2017. “pretrain” indicates whether the algorithm uses a pre-trained model or not, and “ $+$ GC” indicates that GC Block is added to Pyconv Bottleneck

test-dev2017
Method	Backbone	Pretrain	Input size	Params	GFLOPs	AP	AP ${}^{50}$	AP ${}^{75}$	AP ${}^{M}$	AP ${}^{L}$	AR
OpenPose	–	–	–	–	–	61.8	84.9	67.5	57.1	68.2	66.5
PersonLab	–	–	–	–	–	68.7	89.0	75.4	64.1	75.5	75.4
Simple Baseline	res50	Y	256 $\times$ 192	34.0M	8.9	70.0	90.9	77.9	66.8	75.8	75.6
Ours	pyhgres50	Y	256 $\times$ 192	33.7M	9.4	70.7	91.0	78.9	67.5	76.7	76.4
Our( $+$ GC)	pyhgres50	Y	256 $\times$ 192	36.2M	9.4	71.1	91.0	79.3	67.9	76.9	76.6

Figure 5.

The visualization results of our final model and Simple Baseline (ResNet50).

4.7 Visualization

Our model can accurately detect some keypoints of human bodies that Simple Baseline (ResNet50) is difficult to detect. The visualization results of our final model and Simple Baseline (ResNet50) on Val2017’s prediction results are shown in Fig. 5. The results are given in pairs. The left side of each group of pictures is the prediction results of our model, and the right side denotes the Simple Baseline (ResNet50). Red circles are marked to show the difference of two models.

5. Conclusion

In this article, we propose a top-down two-dimensional multi-person pose estimation model based on the pyconv bottleneck and the lightweight attention model GC Block. The new algorithm can effectively utilize the local information of the image as well as the global structure information to obtain the human body structure information. Based on those improvements, we may infer some invisible or overlapped keypoints which are difficult to locate in the pose estimation task. At the same time, since the grouped convolution is applied, and GC Block is lightweight, the number of our model parameters does not increase and is analogous to the algorithm Simple Baseline. Experiments show that our model has outperformed Simple Baseline on the two datasets of Val2017 and test-dev2017 after we train our network on the standard MS COCO human body keypoints detection dataset. Due to the limitation of backbone network presented in our model, computational cost is still relatively large.

Footnotes

Acknowledgments

This work is supported by the Science and Technology Program of Beijing (No Z181100009218012).

References

Krizhevsky

Sutskever

and Hinton

G.E.

, Imagnet classification with deep convolutional neural networks, Communications of the ACM 60(6) (2017), 84–90.

Toshev

and Szegedy

, Deeppose: Human pose estimation via deep neural networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1653–1660.

Liu

Shahroudy

and Xu

, Skeleton-based action recognition using spatio-temporal lstm network with trust gates, IEEE Transactions on Pattern Analysis and Machine Intelligence 40(12) (2018), 3007–3021.

Varadarajan

Subramanian

and Buló

S.R.

, Joint estimation of human pose a conversational group from social scenes, International Journal of Computer Vision 126(2–4) (2018), 410–429.

, Fast pedestrian detection based on feature of local model, Journal of Computational Methods in Sciences and Engineering 15(3) (2015), 387–393.

Cherian

Mairal

and Alahari

, Mixing body-part sequences for human pose estimation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2353–2360.

Chang

M.C.

and Wang

, Fast Online Upper Body Pose Estimation from Video, 2015, pp. 104.1–104.12.

Fang

Xie

Tai

and Lu

, RMPE: Regional Multi-person Pose Estimation, 2017 IEEE International Conference on Computer Vision, 2017, pp. 2353–2362.

Wang

and Yin

, Rethinking on multi-stage networks for human pose estimation, 2019.

10.

Bin

Haiping

and Yichen

, Simple baselines for human pose estimation and tracking, European Conference on Computer Vision, 2018.

11.

Chen

Wang

and Peng

, Cascaded pyramid network for multi-person pose estimation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7103–7112.

12.

Sun

Xiao

and Liu

, Deep High-Resolution Representation Learning for Human Pose Estimation, Conference on Computer Vision and Pattern Recognition, 2019.

13.

Cao

and Simon

, Realtime multi-person 2d pose estimation using part affinity fields, Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7291–7299.

14.

Papandreou

Zhu

and Chen

L.C

, Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model, Proceedings of the European Conference on Computer Vision, 2018, pp. 269–286.

15.

Newell

Huang

and Deng

, Associative embedding: End-to-end learning for joint detection and grouping, Advances in Neural Information Processing Systems, 2017, pp. 2277–2287.

16.

Pavlakos

Zhou

and Derpanis

K.G.

, Coarse-to-fine volumetric prediction for single-image 3D human pose, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7025–7034.

17.

Pavllo

Feichtenhofer

and Grangier

, 3D human pose estimation in video with temporal convolutions and semi-supervised training, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7753–7762.

18.

Wandt

and Rosenhahn

, Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7782–7791.

19.

Mehta

Sotnychenko

and Mueller

, XNect: Real-time multi-person 3D motion capture with a single RGB camera, ACM Transactions on Graphics 39(2) (2020), 82:1–82:17.

20.

Alejandro

Kaiyu

and Jia

, Stacked hourglass networks for human pose estimation, European Conference on Computer Vision, Springer International Publishing, 2016.

21.

and Zhang

, Cascade feature aggregation for human pose estimation, 2019.

22.

Zhang

Ren

and Sun

, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

23.

Cheng

Xiao

and Wang

, HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5386–5395.

24.

Duta

I.C

Liu

and Zhu

, Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition, 2020.

25.

Ren

and Girshick

, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems, 2015, pp. 91–99.

26.

Shen

and Sun

, Squeeze-and-excitation networks, Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.

27.

Lin

Maire

and Belongie

S.J.

, Microsoft COCO: common objects in context, European Conference on Computer Vision, 2014, pp. 740–755.

28.

Cao

and Lin

, Gcnet: Non-local networks meet squeeze-excitation networks and beyond, Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019.

29.

Wang

Girshick

and Gupta

, Non-local neural networks, IEEE Conference on Computer Vision and Pattern Recognition, 2018.

30.

Chu

Yang

and Ouyang

, Multi-context attention for human pose estimation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1831–1840.

A combined local and global structure module for human pose estimation

Abstract

Keywords

1. Introduction

3. Mixture model for pose estimate

3.1 Principle of the fundamental module

4. Experiment

4.1 Dataset

4.2 Evaluation metric

4.5 Results on the validation set

4.6 Results on the test-dev set

Table 2 Comparison the test results on Simple Baseline and the test results of our model on test-dev2017. “pretrain” indicates whether the algorithm uses a pre-trained model or not, and “ + GC” indicates that GC Block is added to Pyconv Bottleneck

5. Conclusion

Footnotes

Acknowledgments

References

Table 2
Comparison the test results on Simple Baseline and the test results of our model on test-dev2017. “pretrain” indicates whether the algorithm uses a pre-trained model or not, and “ $+$ GC” indicates that GC Block is added to Pyconv Bottleneck