Human pose estimation based on parallel atrous convolution and body structure constraints

Abstract

Human pose estimation is still a challenging task in computer vision, especially in the case of camera view transformation, joints occlusions and overlapping, the task will be of ever-increasing difficulty to achieve success. Most existing methods pass the input through a network, which typically consists of high-to-low resolution sub-networks that are connected in series. Still, during the up-sampling process, the spatial relationships and details might be lost. This paper designs a parallel atrous convolutional network with body structure constraints (PAC-BCNet) to address the problem. Among the mentioned techniques, the parallel atrous convolution (PAC) is constructed to deal with scale changes by connecting multiple different atrous convolution sub-networks in parallel. And it is used to extract features from different scales without reducing the resolution. Besides, the body structure constraints (BC), which enhance the correlation between each keypoint, are constructed to obtain better spatial relationships of the body by designing keypoints constraints sets and improving the loss function. In this work, a comparative experiment of the serial atrous convolution, the parallel atrous convolution, the ablation study with and without body structure constraints are conducted, which reasonably proves the effectiveness of the approach. The model is evaluated on two widely used human pose estimation benchmarks (MPII and LSP). The method achieves better performance on both datasets.

Keywords

Computer vision human pose estimation parallel atrous convolution body structure constraints

1 Introduction

Human pose estimation is one of the most fundamental tasks in computer vision at present. It mainly uses deep learning to map input pictures into geometrically constrained and interdependent multiple body keypoints. Achieving a better understanding of human posture and limb articulation is crucial for high-level vision tasks, such as motion capture, human-computer interaction, and activity recognition.

Due to many years of research with significant progress has been made, there are numerous practical methods have been proposed. Earlier, combining local detectors with structural constraints was broadly used to model the spatial relationships among body parts. The model of human pose that explicitly captures a variety of pose modes was proposed. Unlike other multimodal models, their approach includes both global and local pose cues and uses a convex objective and joint training for mode selection and pose estimation. Then, the researchers proposed a new hierarchical spatial model that can capture an exponential number of poses with a compact mixture representation on each part. Recently, with the development of hardware and datasets, deep models have achieved state-of-the-art performance in human pose estimation [11– 14 , 16]. Tompson et al. [11, 13] estimated keypoint heatmaps followed by selecting the locations with the highest heat values as the keypoints. In [14], a top-down approach was adopted to estimate keypoints for multiple persons, which first detected the human bodies and determined to whom these keypoints belong.

To cope with scale changes, most existing methods pass the input through a network, which typically consists of high-to-low resolution sub-networks that are connected in series. Still, during the up-sampling process, the spatial relationships and details might be lost. For instance, Hourglass [16] and PRMs [4] include several hourglass sub-networks, each sub-network reduces the resolution through a symmetric downsampling process and recovers by up-sampling. In the Recurrent stacking network [17], there are continuous down-sampling between sub-modules and continuous up-sampling within sub-modules. However, the operation will cause keypoints estimation error in the case of joints occlusion or background is indistinguishable from the human body.

A priori knowledge of body structure constraints can provide key information for human pose estimation from where the location of occluded keypoint can be inferred. Especially in real-world situations with complex multi-person activities and cluttered backgrounds. For instance, when the head is invisible or indistinct, it can be roughly inferred from the keypoints of the neck and thorax because these three keypoints are usually in the same line. The structure-aware loss [49] was proposed to improve the suitability and the relationships between the keypoints. However, it only describes the adjacent keypoints and ignores the overall constraints. Moreover, the adjacent keypoints are also incompletely considered which causes loss of accuracy.

PAC-BCNet is proposed in this paper in order to solve the above problems. At first, the features are processed down to a very low resolution through convolutional and max pooling layers. After reaching the lowest resolution, multiple atrous convolution branches begin to learn the features from different scales while keeping the resolution unchanged. And the features of all branches are cascaded in the end. In addition, human pose estimation is an associative task, so the correlation between keypoints is very important. So the paper take it as the starting point, improve the calculation method of the loss function by designing new keypoints association rules which consist of single keypoint sets and combined keypoints sets and further accuracy. The resulting network is illustrated in Fig. 1.

Fig.1

Illustrating the architecture of the proposed PAC-BCNet. It mainly consists of parallel atrous convolution sub-networks and body structure constraints module.

The contributions of the network in this paper compared to other current networks are as follows: (1) Parallel connection of multiple different atrous convolution branches can not only learn features of different scales, but also maintain high-resolution feature maps all the time by which high-quality feature maps and more detailed information can be obtained. (2) The model takes the correlation between keypoints into account by improving the calculation method of the loss function. (3) This paper reasonably demonstrated that the method has superiority on two popular human pose estimation datasets.

2 Related work

More often than not, traditional human pose estimation relies on graph structure and loopy structure, which is recently significantly improved by exploiting deep learning in computer vision. Therefore, starting from deep learning, there are many effective solutions [21 –28]. G. Gkioxari et al. [21] tried to transform CNN into a recurrent network to solve the problem of human pose estimation in individual images and videos. K.sun et al. [27] present a two-stage normalization scheme, human body normalization and limb normalization, to make the distribution of the relative joint locations compact, resulting in easier learning of convolutional spatial models and more accurate pose estimation. In summary, there are two mainstream methods: one is to regress the position of keypoints directly and another is to estimate keypoint heatmaps followed by selecting the locations with the highest heat values as keypoints.

For the method of directly regressing the locations of keypoints, the datasets were generally normalized first, and the regression prediction was performed with L2 norm as the loss function. However, the first step of most convolutional neural networks for estimating keypoint heatmaps is to reduce the resolution of the input, and then send it to the body network to obtain the location of the keypoint by regressing from the keypoint heatmaps and recover the resolution in the end. The main body mainly adopts the high-to-low and low-to-high framework from which can learn feature from different scales. Among them, the high-to-low process mainly produces low-resolution and high-quality features, while the low-to-high process produces high-resolution feature to recover the resolution. Repeating these two processes can achieve better performance. Taking the Hourglass [16] as an example, the main contribution of this network is adopting multi-scale features to estimate human pose. The previous network structure of the human pose estimation, as the network deepens, each layer will lose some information. And the last layer will lose more information, most networks only use the features of the last layer which results in a waste of features. In order to address this problem, there is a solution of feature fusion whose basic idea is adding the feature map of the previous layer to the feature map of the layer before the convolution operation. Especially for the pose estimation related tasks, where different keypoints in the whole body do not have the best recognition accuracy on the same feature map. For example, the arm may be easily recognized on the feature map of the third layer while the head is easier to recognize on the fifth layer. Therefore, the Hourglass cascades the features of each layer with each other. Unlike the common feature fusion, it adopts skipping connections. The full module is illustrated in Fig. 2.

Fig.2

An illustration of a single “hourglass” module. During the upsampling process, each feature output will cascade the features of the symmetric layer.

In order to capture better multi-scale features, the atrous convolution was first proposed in [7], which uses an atrous convolution method to systematically aggregate multi-scale context information without losing resolution. This architecture is based on the fact that the atrous convolution supports the exponential expansion of the receptive field without losing resolution or coverage. The experiment shows that the atrous convolution is particularly beneficial for capturing long-distance features. And human body pose estimation needs to extract some long-distance features, such as the arms.

Due to the successful application of atrous convolution in image segmentation, it has been used in the field of human pose estimation in recent years. The HRnet proposed in [6] is applied to an atrous convolution network which maintains high-resolution features through high-resolution features and parallel low-resolution features. Based on [6], Bao Y et al. [1] proposed a double attention mechanism which is added in the forward transmission of the parallel subnetwork. The purpose is to allocate weight to the transmitted information without changing the number of channels, and allocate the information with large weight as useful information to reduce the interference caused by irrelevant information. Liu Z et al. [2] propose a novel multi-frame human pose estimation framework, leveraging abundant temporal cues between video frames to facilitate keypoint detection.

3 Method

This paper proposed a human body pose estimation method based on parallel atrous convolution (PAC) and body structure constraint loss (BC). In this section, this paper first discussed how atrous convolution is applied to extract features for human pose estimation, and then separately explain the construction of the PAC module and the BC module.

3.1 Atrous convolution for feature extraction

Atrous convolution, literally understood, is to inject holes into a standard convolution graph, and to increase the receptive field from then on. Comparing with the standard convolutional layer, the atrous convolution has the hyperparameters called the expansion rate, which refers to the number of intervals of the convolution kernels (eg. standard convolution, the rate is 1). The use of atrous convolution can avoid reducing the resolution, and obtain more features. This structure can provide a larger receptive field without using a pooling layer and the amount of calculation is equivalent.

Consider two-dimensional situation, for each location i on the output y and a filter k, and atrous convolution is applied over the input feature map x: $y [i] = \sum_{d} x [i + r \cdot d] k [d]$ (1) where the atrous rate r represents the expansion rate of the atrous convolution kernels, that is, the number of intervals of the convolution kernels. Standard convolution is a special case of atrous convolution. At this time, r = 1. Setting different r can obtain different sizes of receptive fields. As shown in Fig. 3.

Fig.3

Schematic diagram of atrous convolution.

It can be seen from Fig. 3 that the final output feature size of the standard convolution kernel used in the first row and the atrous convolution kernel used in the second row has not changed, and both are 6×6 in size. When setting the atrous convolution expansion rate, if you want to keep the feature map resolution constant, you must fill the feature map first. Generally, 0 pixels are filled along the edge, and the number of layers is the same as the expansion rate. In this way, the image resolution after convolution will not decrease. Assuming that the original feature is f0, first use an atrous convolution with an expansion rate of 1 (standard convolution) to convolve f0 to generate f1. At this time, the receptive field of a point on f1 relative to f0 is 3×3, that is, on f1 one point integrates the information of the 3×3 area on f0. Then use the atrous convolution kernel with an expansion rate of 2 to convolve f1 to obtain f2. At this time, the size of the first atrous convolution kernel (standard convolution) is equal to the receptive field of one pixel of the second atrous convolution kernel. Then the receptive field of the generated f2 is 7×7. If two convolutions use a common 3×3 convolution kernel, the receptive field of f2 is 5×5. Obviously, the atrous convolution increases the receptive field without changing the output feature size.

3.2 Parallel atrous convolutional network

The so-called PACNet refers to the main body of the network is composed of multiple branches of the atrous convolutional network. And the atrous convolutional network of each branch is generated by using the atrous convolution kernels with different atrous rates on the feature map. Therefore, features of different scales can be learned, thereby improving accuracy. However, as sampling rate becomes bigger, the number of valid filter weights becomes smaller. For example, applying a 3*3 filter to a 64 * 64 feature map with different atrous rates, when the atrous convolution rate is close to the size of the feature map, the convolution operation under this condition is equivalent to the operation of a simple 1*1 convolution kernel. Because only the center convolution kernel is the effective weight. In order to solve this problem, this paper used global average pooling on the last feature map of the model, feed the resulting image-level features to a 1*1 convolution with 256 filters (and batch normalization). And then bilinearly up-sample the feature to the desired spatial dimension. Finally, connect the final features of all branches and input to another 1×1 convolution (the number of all filters is also 256, and add batch normalization), and obtain the result through the final 1×1 convolution. The PACNet is shown in Fig. 4.

Fig.4

Structure diagram of PACNet.

3.3 Setting the expansion rate of the atrous convolution

The size of the atrous convolution kernel increases as the expansion rate increases, but not all combinations of expansion rates can achieve better results. If only a 3×3 convolution kernel with an expansion rate of 2 is superimposed multiple times, problems will arise. As shown in Fig. 5, the atrous convolution kernel (the non-white part in the figure) is not continuous, which means that not every pixel of the feature map is used for calculation, so the continuity of the information will be lost, and the prediction will be wrong. Therefore, the hybrid atrous convolution is proposed to solve the problem of the discontinuity of the atrous convolution kernel, that is, the expansion rate of the atrous convolution cannot have a common divisor greater than 1, and the expansion rate is set to a sawtooth structure, such as [1,2,3,1,2,3 , 1,2,3,1,2,3]. This method retains a complete and continuous 3×3 area from the beginning, and ensures the continuity of the receptive field.

Fig.5

The effect diagram of the 3×3 convolution kernel with multiple stacking expansion rate of 2.

3.4 Body structure constraints loss

In the past, a large number of methods only calculated the error between the predicted value and the real value of each keypoint in sequence when calculating the loss function, and then sum them up as final loss. However, this method is not rigorous for human pose estimation because the task also involves the correlation between each keypoint and the physical structure constraints. Inspired by [49], this paper designed a body structure constraint loss based on it to model the human keypoints, as shown in Fig. 6. Assuming that the set St represents the t-th combination between keypoints, as shown in Table 1, each combination has a physical structure constraint. Compared with [49], this paper’s combination of keypoints is more comprehensive and more reasonable, and the method makes each keypoint more closely related. In addition, in the previous work, the accuracy of hips, knees and ankles is lower than other keypoints, so this paper added the combination of these three keypoints separately in the collection. The i-th scale body structure constraint loss is defined as follows: $L_{i} = \frac{1}{K} \sum_{k = 1}^{K} | | P_{k}^{i} - G_{k}^{i} | |_{2} + α \sum_{t = 1}^{T} | | P_{S t}^{i} - G_{S t}^{i} | |_{2}$ (2)

Fig.6

Visual structure of the body structure constraint model.

Table 1

Single keypoint and combined keypoints

Single keypoint	Combined keypoints
Head	Head, Neck
Neck	Head, Neck, Thorax
Shoulder	Head, Neck, Thorax, Pelvis
Thorax	Head, Neck, Thorax, Pelvis, Hip
Elbow	Head, Neck, Thorax, Pelvis, Knee
Wrist	Head, Neck, Thorax, Pelvis, Knee, Ankle
Hip	Head, Neck, shoulder
Pelvis	Head, Neck, Shoulder, Elbow
Knee	Head, Neck, Shoulder, Elbow, Wrist
Ankle	Hip, Knee, Ankle

The first part on the right side of the equation represents the individual loss of each keypoint, which is, the error between the prediction confidence map P and the real value confidence map G at the keypoint (x, y). The second part shows the body structure constraint loss, and α is the weight coefficient, K is the total number of keypoints. Figure 6 shows the visualization results of the body structure constraint loss. The top line shows the loss of each keypoint, and the bottom line is the body constraint loss this paper designed. It is important to point out that collection S does not simply add one keypoint at a time. For example, the first element of S is the combination of head and neck, not the combination of head and ankle or other keypoint sets. In this process, the paper considered reasonable physical connections. Only in this way can the model capture better correlation between keypoints.

4 Experiment

Datasets. This paper evaluated the proposed method on two popular datasets, i.e., extend Leeds Sport Pose (LSP) and MPII Human Pose. The LSP datasets includes 11 k training images and 1 k test images from sports activities. MPII includes about 25 k images which include 40 k examples (about 28 k for training, 11 k for testing). Each person in the image is taken from various challenging directions to the camera, and 16 keypoints are marked on the whole body. This paper evaluated the model on a validation set of approximately 3 k images because the MPII dataset does not provide an annotation set for the testing set.

Experimental Settings. In order to meet the requirements of the experiment, first cropped the image with respect to the center of target person in the scene. And then warp the image patch to the size of 256 * 256 pixels. According to the data augmentation in the paper [16], the model rotated the image (+/– 30 degrees) to change the scale (0.75– 1.25), making the network more robust to different directions and different scales. The learning rate of the experiment is set to 1e-3, the learning rate is set to 1e-4 and 1e-5 at the 170 epoths and 200 epochs, and the entire training process is terminated at the 250 epochs.

Evaluation Criteria. For the evaluation of the experimental results, this paper used the standard Percentage of Correct Keypoints (PCK). For the LSP datasets, PCK@0.2 is used, which is, the distance between the predicted and true joints is less than 0.2 * trunk diameter. For the MPII datasets, the PCKh (head-normalized probability of correct keypoint) score is used. If a joint falls within αl pixels of the groundtruth location, it is considered as correct. α is a constant and l is the head size that corresponds to 60% of the diagonal length of the ground-truth head bounding box. The PCKh@0.5 (α = 0.5) score is used in MPII datasets.

5 Results

5.1 Evaluation

Results on the MPII datasets. This paper reported the performance on MPII datasets in Table 2. The approach achieves 92.5% PCKh score at the threshold of 0.5, which is better than the Hourglass and its extension. Qualitative results are demonstrated in Fig. 12.

Table 2
Performance comparisons on the MPII test set (PCKh@0.5)

Method Hea Sho Elb Wri Hip Kne Ank Total

Insafutdinov et al. [50] 96.8 95.2 89.3 84.4 88.4 83.4 78.0 88.5

Wei et al. [51] 97.8 95.0 88.7 84.0 88.4 82.8 79.4 88.5

Bulat et al. [40] 97.9 95.1 89.9 85.3 89.4 85.7 81.7 89.7

Newell et al. [16] 98.2 96.3 91.2 87.1 90.1 87.4 83.6 90.9

Sun et al. [27] 98.1 96.2 91.2 87.2 89.8 87.4 84.1 91.0

Tang et al. [41] 97.4 96.4 92.1 87.7 90.2 87.7 84.3 91.2

Ning et al. [42] 98.1 96.3 92.2 87.8 90.2 87.7 84.3 91.2

Luvizon et al. [18] 98.1 96.6 92.0 87.5 90.6 88.0 82.7 91.2

Chu et al. [32] 98.5 96.3 91.9 88.1 90.6 88.0 85.0 91.5

Chou et al. [19] 98.2 96.8 92.2 88.0 91.3 89.1 84.9 91.8

Chen et al. [20] 98.1 96.5 92.5 88.5 90.2 89.6 86.0 91.9

Yang et al. [4] 98.5 96.7 92.5 88.7 91.1 88.6 86.0 91.0

Ke et al. [5] 98.5 96.8 92.7 88.4 90.6 89.3 86.3 92.1

Tang et al. [23] 98.4 96.9 92.6 88.7 91.8 89.4 86.2 92.3

8-Stack HG [8] 98.6 97.0 92.8 88.8 91.7 89.8 86.6 92.5

UniPose [9] / / / / / / / 92.7

Ours 98.6 96.9 92.8 88.5 92.7 89.8 86.7 92.5

Method	Hea	Sho	Elb	Wri	Hip	Kne	Ank	Total
Insafutdinov et al. [50]	96.8	95.2	89.3	84.4	88.4	83.4	78.0	88.5
Wei et al. [51]	97.8	95.0	88.7	84.0	88.4	82.8	79.4	88.5
Bulat et al. [40]	97.9	95.1	89.9	85.3	89.4	85.7	81.7	89.7
Newell et al. [16]	98.2	96.3	91.2	87.1	90.1	87.4	83.6	90.9
Sun et al. [27]	98.1	96.2	91.2	87.2	89.8	87.4	84.1	91.0
Tang et al. [41]	97.4	96.4	92.1	87.7	90.2	87.7	84.3	91.2
Ning et al. [42]	98.1	96.3	92.2	87.8	90.2	87.7	84.3	91.2
Luvizon et al. [18]	98.1	96.6	92.0	87.5	90.6	88.0	82.7	91.2
Chu et al. [32]	98.5	96.3	91.9	88.1	90.6	88.0	85.0	91.5
Chou et al. [19]	98.2	96.8	92.2	88.0	91.3	89.1	84.9	91.8
Chen et al. [20]	98.1	96.5	92.5	88.5	90.2	89.6	86.0	91.9
Yang et al. [4]	98.5	96.7	92.5	88.7	91.1	88.6	86.0	91.0
Ke et al. [5]	98.5	96.8	92.7	88.4	90.6	89.3	86.3	92.1
Tang et al. [23]	98.4	96.9	92.6	88.7	91.8	89.4	86.2	92.3
8-Stack HG [8]	98.6	97.0	92.8	88.8	91.7	89.8	86.6	92.5
UniPose [9]	/	/	/	/	/	/	/	92.7
Ours	98.6	96.9	92.8	88.5	92.7	89.8	86.7	92.5

Complexity. It can be seen that the complexity of the model increases with the improvement of accuracy. The model increases the number of parameters by 8.4% from 25.1 M to 27.2 M given an eight-stack hourglass network. The method need 9.5 GFLOPs for a 256 * 256 RGB image, which is less than compared to the hourglass network (19.1 GFLOPs), as shown in Table 3.

Table 3

#Params and GFLOPs of some top-performed methods. The GFLOPs is computed with the input size 256*256

Method	#Params	GFLOPs	PCKh@0.5
Insafutdinov et al. [14]	42.6 M	41.2	88.5
Newell et al. [16]	25.1 M	19.1	90.9
Yang et al. [4]	28.1 M	21.3	92.0
Tang et al. [23]	15.5 M	15.6	92.3
Ours	27.2 M	9.5	92.5

Results on the LSP datasets. This paper reported the performance on LSP datasets in Table 4. The approach achieves 93.7% PCKh score at the threshold of 0.2. This paper trained the model by adding MPII training set to the LSP and its extended training set by following previous methods [14 , 50]. It can be found that the results have been greatly improved because the PACNet proposed avoids the loss of detailed information, as demonstrated in Fig. 12.

Table 4

Comparisons of PCK@0.2 score on the LSP dataset

Method	Hea	Sho	Elb	Wri	Hip	Kne	Ank	Total
Belagiannis&Zisserman [17]	95.2	89.0	81.5	77.0	83.7	87.0	82.8	85.2
Lifshitz et al. [22]	96.8	89.0	82.7	79.1	90.9	86.0	82.5	86.7
Pishchulin et al. [14]	97.0	91.0	83.8	78.1	91.0	86.7	82.0	87.1
Insafutdinov et al. [50]	97.4	92.7	87.5	84.4	91.5	89.9	87.2	90.1
Wei et al. [51]	97.8	92.5	87.0	83.9	91.5	90.8	89.9	90.5
Bulat &Tzimiropoulos [34]	97.2	92.1	88.1	85.2	92.2	91.4	88.7	90.7
Wei Yang et al. [4]	98.3	94.5	92.2	88.9	94.4	95.0	93.7	93.9
8-Stack HG [8]	98.4	94.8	92.0	89.4	94.4	94.8	93.8	94.0
UniPose [9]	/	/	/	/	/	/	/	94.5
Ours	98.5	95.3	91.5	89.4	93.6	94.8	92.8	93.7

5.2 Ablation study

Using the 8-stack hourglass stacking network as a benchmark model, this paper conducted ablation study on the MPII validation set.

Structure of atrous convolution. This paper first evaluated two different structures of atrous convolution networks, namely serial atrous convolution and parallel atrous convolution. Perform a standard convolution operation on a picture, at a later stage, the resolution of the feature map will become very small, and some details will be lost. So atrous convolution is used, it is found that in the later stage, the receptive field continues to increase while the resolution remains the same. As shown in Fig. 7, the feature information extracted in the red box is not as good as the feature information extracted using PAC. Therefore, the features obtained by atrous convolution operation in the last steps are cascaded to obtain the final features, which can reduce information loss. This paper replaced the sub-modules of the 8-stack hourglass stacking network with serial and parallel atrous convolutional network modules respectively and conduct a comparative experiment on the MPII validation set. It can draw conclusions based on the experimental results that parallel atrous convolutional network is much better than serial atrous convolutional network, as demonstrated in Fig. 8.

Fig.7

Structure of serial atrous convolutional network.

Fig.8

Comparison results of serial atrous convolution and PAC. A represents the Hourglass model (benchmark model), B represents serial atrous convolutional model (replace the hourglass sub-networks in A with serial atrous convolution), C represents PAC (replace the hourglass sub-networks in A with PAC).

Body structure constraint loss module. Similarly, this paper used the 8-stack hourglass stacking network as the benchmark model, and replaced the keypoints loss function of the benchmark model with the body structure constraint loss function which includes loss of single keypoint and loss of combined keypoints. This paper conducted a comparative experiment between the two models, as shown in Fig. 9. It’s easy to find that the benchmark model is not good as the model with body structure constraint loss which enhances the correlation between keypoints.

Fig.9

Comparison results of the benchmark model and the model with body structure constraint loss. D represents the mode with body structure constraint loss which includes loss of single keypoint and loss of combined keypoints.

This paper proposed a parallel atrous convolution with a body structure constraint loss, because both the individual BC and the PACNet obtain better results on the benchmark model, so it’s reasonable to add the two merges to the benchmark model, that is, the PACNet with the BC. This paper did a comparative experiment, and the result is shown in Fig. 10. In addition, the comparison result of the benchmark model and the final model is shown in Fig. 11.

Fig.10

Comparison results of the PAC model, the model with body structure constraint loss and the combined model. E is equivalent to C + D.

Fig.11

Comparison results of the benchmark. Model and final model.

Fig.12

Results on the MPII (top) and the LSP dataset (bottom).

6 Conclusion

This paper presented a parallel atrous convolutional network with body structure constraints for addressing the challenge problem of human pose estimation in complex scenes. The success stems from three aspects: (i) Parallel connection of multiple different atrous convolution branches can not only learn features of different scales, but also maintain high-resolution feature maps all the time by which high-quality feature maps and more detailed information can be obtained. (ii) When calculating the loss, the body structure constraints are taken into account, and the correlation between each keypoint is used to improve the accuracy. (iii) This paper has done a lot of ablation studies and reasonably proved the effectiveness of parallel atrous convolution and body structure constraints.

7 Limitation and further work

Although the accuracy of the model proposed in this paper exceeds that of most papers, and even exceeds the state of art on some keypoints, the complexity of the model is a little large due to the parallel convolution of multiple atrous and the consideration of many cases when constructing the loss of body constraints. For future work, to simplify the model is worth studying without affecting the accuracy.

Data availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of interest

The authors declare that they have no conflicts of interest.

Footnotes

Acknowledgments

This work is supported by Zhejiang Provincial Technical Plan Project (No. 2020C03105).

References

Bao

, Zhang

and Guo

, Human pose estimation based on Improved High Resolution Network[J], Journal of Physics: Conference Series 1961(1) (2021), 012060(6pp).

Liu

, Chen

, Feng

, et al., Deep Dual Consecutive Network for Human Pose Estimation[J], (2021).

Yan

, Xiong

and Lin

, Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition, In CVPR (2018).

Yang

, Li

, Ouyang

, Li

and Wang

, Learning feature pyramids for human pose estimation, In ICCV (2017), 1290–1299.

, Chang

, Qi

and Lyu

, Multi-scale structure-aware network for human pose estimation, CoRR, abs/1803.09894, (2018).

Sun

, Xiao

Bin

, Liu

Dong

and Wang

Jingdong

, Deep High-Resolution Representation Learning for Human Pose Estimation. arXiv:2v1 [cs.CV], (2019).

and Koltun

, Multi-Scale Context Aggregation by Dilation Convolutions, arXiv:2v3 [cs.CV], (2016).

Zhang

, Ouyang

, Liu

, Qi

, Shen

, Yang

and Jia

, Human pose estimation with spatial contextual information, arXiv preprint arXiv:0, (2019).

Artacho

and Savakis

, Unipose: Unified Human Pose Estimation in Single Images and Videos, arXiv:5v1 [cs.CV], (2020).

10.

Toshev

and Szegedy

, Deeppose: Human pose estimation via deep neural networks, In CVPR, (2014).

11.

Tompson

, Jain

, LeCun

and Bregler

, Joint training of a convolutional network and a graphical model for human pose estimation, In NIPS (2014).

12.

Chen

and Yuille

A.L.

, Articulated pose estimation by a graphical model with image dependent pairwise relations, In NIPS (2014).

13.

Tompson

, Goroshin

, Jain

, LeCun

and Bregler

, Efficient object localization using convolutional networks, In CVPR (2015).

14.

Pishchulin

, Insafutdinov

, Tang

, Andres

, Andriluka

, Gehler

P.V.

and Schiele

, Deepcut: Joint subset partition and labeling for multi-person pose estimation, In CVPR (2016).

15.

Wei

S.-E.

, Ramakrishna

, Kanade

and Sheikh

, Convolutional pose machines, In CVPR (2016).

16.

Newell

, Yang

and Deng

, Stacked hourglass networks for human pose estimation. In ECCV. Springer, (2016).

17.

Belagiannis

and Zisserman

, Recurrent Human Pose Estimation. In CVPR (2017).

18.

Luvizon

D.C.

, Tabia

and Picard

, Human pose regression by combining indirect part detection and contextual information, CoRR, abs/1710.02322, (2017).

19.

Chou

, Chien

and Chen

, Self adversarial training for human pose estimation, CoRR, abs/1707.02439, (2017).

20.

Chen

, Shen

, Wei

, Liu

and Yang

, Adversarial posenet: A structure-aware convolutional network for human pose estimation, In ICCV (2017), 1221–1230.

21.

Gkioxari

, Toshev

and Jaitly

, Chained predictions using convolutional neural networks. In ECCV (2016), 728–743.

22.

Lifshitz

, Fetaya

and Ullman

, Human pose estimation using deep consensus voting, In ECCV (2016), 246–260.

23.

Tang

, Yu

and Wu

, Deeply learned compositional models for human pose estimation, In ECCV.

24.

Nie

, Feng

and Yan

, Mutual learning to adapt for joint human parsing and pose estimation, In ECCV.

25.

Nie

, Feng

, Zuo

and Yan

, Human pose estimation with parsing induced learner, In CVPR (2018).

26.

Peng

, Tang

, Yang

, Feris

R.S.

and Metaxas

, Jointly optimize data augmentation and network training: Adversarial data augmentation in human pose estimation, In CVPR (2018).

27.

Sun

, Lan

, Xing

, Zeng

, Liu

and Wang

, Human pose estimation using global and local normalization, In ICCV (2017), 5600–5608.

28.

Fan

, Zheng

, Lin

and Wang

, Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation, In CVPR (2015), 1347–1355.

29.

Carreira

, Agrawal

, Fragkiadaki

and Malik

, Human pose estimation with iterative error feedback, In CVPR (2016), 4733–4742.

30.

Toshev

and Szegedy

, Deeppose: Human pose estimation via deep neural networks, In CVPR (2014), 1653–1660.

31.

Chu

, Ouyang

, Li

and Wang

, Structured feature learning for pose estimation, In CVPR 4715–4723.

32.

Chu

, Yang

, Ouyang

, Ma

, Yuille

A.L.

and Wang

, Multi-context attention for human pose estimation, In CVPR (2017), 5669–5678.

33.

Yang

, Ouyang

, Li

and Wang

, End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation, In CVPR (2016), 3073–3082.

34.

Bulat

and Tzimiropoulos

, Human pose estimation via convolutional part heatmap regression. In ECCV, volume 9911 of Lecture Notes in Computer Science, pages, 717–732. Springer, (2016).

35.

Chen

, Wang

, Peng

, Zhang

, Yu

and Sun

, Cascaded pyramid network for multi-person pose estimation, CoRR, abs/1711.07319, (2017).

36.

and Ramanan

, Bottom-up and top-down reasoning with hierarchical rectified gaussians, In CVPR (2016), 5600–5609.

37.

Xiao

, Wu

and Wei

, Simple baselines for human pose estimation and tracking, In ECCV 472–487.

38.

Newell

, Yang

and Deng

, Stacked hourglass networks for human pose estimation, In ECCV (2016), 483–499.

39.

Tang

, Yu

and Wu

, Deeply learned compositional models for human pose estimation, In ECCV (2018).

40.

Bulat

and Tzimiropoulos

, Human pose estimation via convolutional part heatmap regression. In ECCV, volume 9911 of Lecture Notes in Computer Science, pages 717–732. Springer, (2016).

41.

Tang

, Peng

, Geng

, Wu

, Zhang

and Metaxas

D.N.

, Quantized densely connected u-nets for effificient landmark localization, In ECCV (2018), 348–364.

42.

Ning

, Zhang

and He

, Knowledge-guided deep frl neural networks for human pose estimation, IEEE Trans. Multimedia 20(5) (2018), 1246–1259 acta.

43.

Papandreou

, Kokkinos

and Savalle

P.-A.

, Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection. In CVPR (2015).

44.

Cai

, Fan

, Feris

R.S.

and Vasconcelos

, A unified multi-scale deep convolutional neural network for fast object detection, In ECCV (2016), 354–370.

45.

Chen

, Papandreou

, Kokkinos

, Murphy

and Yuille

A.L.

, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, IEEE Trans. Pattern Anal. Mach. Intell. 40(4) (2018), 834–848.

46.

Xie

and Tu

, Holistically-nested edge detection, In ICCV (2015), 1395–1403.

47.

Zhao

, Shi

, Qi

, Wang

and Jia

, Pyramid scene parsing network. In CVPR (2017), 6230–6239.

48.

Kanazawa

, Sharma

and Jacobs

D.W.

, Locally scale-invariant convolutional neural networks, CoRR, abs/1412.5104, (2014).

49.

, Chang

M.-C.

, Qi

and Multi-Scale

S.L.

, [50] Structure-Aware Network for Human Pose Estimation. ECCV IN, (2018).

50.

Insafutdinov

, Pishchulin

, Andres

, Andriluka

and Schiele

, Deepercut: A deeper, stronger, and faster multi person pose estimation model. In ECCV (2016), 34–50.

51.

Wei

, Ramakrishna

, Kanade

and Sheikh

, Convolutional pose machines, In CVPR (2016), 4724–4732.