Abstract
Human pose estimation is still a challenging task in computer vision, especially in the case of camera view transformation, joints occlusions and overlapping, the task will be of ever-increasing difficulty to achieve success. Most existing methods pass the input through a network, which typically consists of high-to-low resolution sub-networks that are connected in series. Still, during the up-sampling process, the spatial relationships and details might be lost. This paper designs a parallel atrous convolutional network with body structure constraints (PAC-BCNet) to address the problem. Among the mentioned techniques, the parallel atrous convolution (PAC) is constructed to deal with scale changes by connecting multiple different atrous convolution sub-networks in parallel. And it is used to extract features from different scales without reducing the resolution. Besides, the body structure constraints (BC), which enhance the correlation between each keypoint, are constructed to obtain better spatial relationships of the body by designing keypoints constraints sets and improving the loss function. In this work, a comparative experiment of the serial atrous convolution, the parallel atrous convolution, the ablation study with and without body structure constraints are conducted, which reasonably proves the effectiveness of the approach. The model is evaluated on two widely used human pose estimation benchmarks (MPII and LSP). The method achieves better performance on both datasets.
Introduction
Human pose estimation is one of the most fundamental tasks in computer vision at present. It mainly uses deep learning to map input pictures into geometrically constrained and interdependent multiple body keypoints. Achieving a better understanding of human posture and limb articulation is crucial for high-level vision tasks, such as motion capture, human-computer interaction, and activity recognition.
Due to many years of research with significant progress has been made, there are numerous practical methods have been proposed. Earlier, combining local detectors with structural constraints was broadly used to model the spatial relationships among body parts. The model of human pose that explicitly captures a variety of pose modes was proposed. Unlike other multimodal models, their approach includes both global and local pose cues and uses a convex objective and joint training for mode selection and pose estimation. Then, the researchers proposed a new hierarchical spatial model that can capture an exponential number of poses with a compact mixture representation on each part. Recently, with the development of hardware and datasets, deep models have achieved state-of-the-art performance in human pose estimation [11– 14, 16]. Tompson et al. [11, 13] estimated keypoint heatmaps followed by selecting the locations with the highest heat values as the keypoints. In [14], a top-down approach was adopted to estimate keypoints for multiple persons, which first detected the human bodies and determined to whom these keypoints belong.
To cope with scale changes, most existing methods pass the input through a network, which typically consists of high-to-low resolution sub-networks that are connected in series. Still, during the up-sampling process, the spatial relationships and details might be lost. For instance, Hourglass [16] and PRMs [4] include several hourglass sub-networks, each sub-network reduces the resolution through a symmetric downsampling process and recovers by up-sampling. In the Recurrent stacking network [17], there are continuous down-sampling between sub-modules and continuous up-sampling within sub-modules. However, the operation will cause keypoints estimation error in the case of joints occlusion or background is indistinguishable from the human body.
A priori knowledge of body structure constraints can provide key information for human pose estimation from where the location of occluded keypoint can be inferred. Especially in real-world situations with complex multi-person activities and cluttered backgrounds. For instance, when the head is invisible or indistinct, it can be roughly inferred from the keypoints of the neck and thorax because these three keypoints are usually in the same line. The structure-aware loss [49] was proposed to improve the suitability and the relationships between the keypoints. However, it only describes the adjacent keypoints and ignores the overall constraints. Moreover, the adjacent keypoints are also incompletely considered which causes loss of accuracy.
PAC-BCNet is proposed in this paper in order to solve the above problems. At first, the features are processed down to a very low resolution through convolutional and max pooling layers. After reaching the lowest resolution, multiple atrous convolution branches begin to learn the features from different scales while keeping the resolution unchanged. And the features of all branches are cascaded in the end. In addition, human pose estimation is an associative task, so the correlation between keypoints is very important. So the paper take it as the starting point, improve the calculation method of the loss function by designing new keypoints association rules which consist of single keypoint sets and combined keypoints sets and further accuracy. The resulting network is illustrated in Fig. 1.

Illustrating the architecture of the proposed PAC-BCNet. It mainly consists of parallel atrous convolution sub-networks and body structure constraints module.
The contributions of the network in this paper compared to other current networks are as follows: (1) Parallel connection of multiple different atrous convolution branches can not only learn features of different scales, but also maintain high-resolution feature maps all the time by which high-quality feature maps and more detailed information can be obtained. (2) The model takes the correlation between keypoints into account by improving the calculation method of the loss function. (3) This paper reasonably demonstrated that the method has superiority on two popular human pose estimation datasets.
More often than not, traditional human pose estimation relies on graph structure and loopy structure, which is recently significantly improved by exploiting deep learning in computer vision. Therefore, starting from deep learning, there are many effective solutions [21–28]. G. Gkioxari et al. [21] tried to transform CNN into a recurrent network to solve the problem of human pose estimation in individual images and videos. K.sun et al. [27] present a two-stage normalization scheme, human body normalization and limb normalization, to make the distribution of the relative joint locations compact, resulting in easier learning of convolutional spatial models and more accurate pose estimation. In summary, there are two mainstream methods: one is to regress the position of keypoints directly and another is to estimate keypoint heatmaps followed by selecting the locations with the highest heat values as keypoints.
For the method of directly regressing the locations of keypoints, the datasets were generally normalized first, and the regression prediction was performed with L2 norm as the loss function. However, the first step of most convolutional neural networks for estimating keypoint heatmaps is to reduce the resolution of the input, and then send it to the body network to obtain the location of the keypoint by regressing from the keypoint heatmaps and recover the resolution in the end. The main body mainly adopts the high-to-low and low-to-high framework from which can learn feature from different scales. Among them, the high-to-low process mainly produces low-resolution and high-quality features, while the low-to-high process produces high-resolution feature to recover the resolution. Repeating these two processes can achieve better performance. Taking the Hourglass [16] as an example, the main contribution of this network is adopting multi-scale features to estimate human pose. The previous network structure of the human pose estimation, as the network deepens, each layer will lose some information. And the last layer will lose more information, most networks only use the features of the last layer which results in a waste of features. In order to address this problem, there is a solution of feature fusion whose basic idea is adding the feature map of the previous layer to the feature map of the layer before the convolution operation. Especially for the pose estimation related tasks, where different keypoints in the whole body do not have the best recognition accuracy on the same feature map. For example, the arm may be easily recognized on the feature map of the third layer while the head is easier to recognize on the fifth layer. Therefore, the Hourglass cascades the features of each layer with each other. Unlike the common feature fusion, it adopts skipping connections. The full module is illustrated in Fig. 2.

An illustration of a single “hourglass” module. During the upsampling process, each feature output will cascade the features of the symmetric layer.
In order to capture better multi-scale features, the atrous convolution was first proposed in [7], which uses an atrous convolution method to systematically aggregate multi-scale context information without losing resolution. This architecture is based on the fact that the atrous convolution supports the exponential expansion of the receptive field without losing resolution or coverage. The experiment shows that the atrous convolution is particularly beneficial for capturing long-distance features. And human body pose estimation needs to extract some long-distance features, such as the arms.
Due to the successful application of atrous convolution in image segmentation, it has been used in the field of human pose estimation in recent years. The HRnet proposed in [6] is applied to an atrous convolution network which maintains high-resolution features through high-resolution features and parallel low-resolution features. Based on [6], Bao Y et al. [1] proposed a double attention mechanism which is added in the forward transmission of the parallel subnetwork. The purpose is to allocate weight to the transmitted information without changing the number of channels, and allocate the information with large weight as useful information to reduce the interference caused by irrelevant information. Liu Z et al. [2] propose a novel multi-frame human pose estimation framework, leveraging abundant temporal cues between video frames to facilitate keypoint detection.
This paper proposed a human body pose estimation method based on parallel atrous convolution (PAC) and body structure constraint loss (BC). In this section, this paper first discussed how atrous convolution is applied to extract features for human pose estimation, and then separately explain the construction of the PAC module and the BC module.
Atrous convolution for feature extraction
Atrous convolution, literally understood, is to inject holes into a standard convolution graph, and to increase the receptive field from then on. Comparing with the standard convolutional layer, the atrous convolution has the hyperparameters called the expansion rate, which refers to the number of intervals of the convolution kernels (eg. standard convolution, the rate is 1). The use of atrous convolution can avoid reducing the resolution, and obtain more features. This structure can provide a larger receptive field without using a pooling layer and the amount of calculation is equivalent.
Consider two-dimensional situation, for each location i on the output y and a filter k, and atrous convolution is applied over the input feature map x:

Schematic diagram of atrous convolution.
It can be seen from Fig. 3 that the final output feature size of the standard convolution kernel used in the first row and the atrous convolution kernel used in the second row has not changed, and both are 6×6 in size. When setting the atrous convolution expansion rate, if you want to keep the feature map resolution constant, you must fill the feature map first. Generally, 0 pixels are filled along the edge, and the number of layers is the same as the expansion rate. In this way, the image resolution after convolution will not decrease. Assuming that the original feature is f0, first use an atrous convolution with an expansion rate of 1 (standard convolution) to convolve f0 to generate f1. At this time, the receptive field of a point on f1 relative to f0 is 3×3, that is, on f1 one point integrates the information of the 3×3 area on f0. Then use the atrous convolution kernel with an expansion rate of 2 to convolve f1 to obtain f2. At this time, the size of the first atrous convolution kernel (standard convolution) is equal to the receptive field of one pixel of the second atrous convolution kernel. Then the receptive field of the generated f2 is 7×7. If two convolutions use a common 3×3 convolution kernel, the receptive field of f2 is 5×5. Obviously, the atrous convolution increases the receptive field without changing the output feature size.
The so-called PACNet refers to the main body of the network is composed of multiple branches of the atrous convolutional network. And the atrous convolutional network of each branch is generated by using the atrous convolution kernels with different atrous rates on the feature map. Therefore, features of different scales can be learned, thereby improving accuracy. However, as sampling rate becomes bigger, the number of valid filter weights becomes smaller. For example, applying a 3*3 filter to a 64 * 64 feature map with different atrous rates, when the atrous convolution rate is close to the size of the feature map, the convolution operation under this condition is equivalent to the operation of a simple 1*1 convolution kernel. Because only the center convolution kernel is the effective weight. In order to solve this problem, this paper used global average pooling on the last feature map of the model, feed the resulting image-level features to a 1*1 convolution with 256 filters (and batch normalization). And then bilinearly up-sample the feature to the desired spatial dimension. Finally, connect the final features of all branches and input to another 1×1 convolution (the number of all filters is also 256, and add batch normalization), and obtain the result through the final 1×1 convolution. The PACNet is shown in Fig. 4.

Structure diagram of PACNet.
The size of the atrous convolution kernel increases as the expansion rate increases, but not all combinations of expansion rates can achieve better results. If only a 3×3 convolution kernel with an expansion rate of 2 is superimposed multiple times, problems will arise. As shown in Fig. 5, the atrous convolution kernel (the non-white part in the figure) is not continuous, which means that not every pixel of the feature map is used for calculation, so the continuity of the information will be lost, and the prediction will be wrong. Therefore, the hybrid atrous convolution is proposed to solve the problem of the discontinuity of the atrous convolution kernel, that is, the expansion rate of the atrous convolution cannot have a common divisor greater than 1, and the expansion rate is set to a sawtooth structure, such as [1,2,3,1,2,3, 1,2,3,1,2,3]. This method retains a complete and continuous 3×3 area from the beginning, and ensures the continuity of the receptive field.

The effect diagram of the 3×3 convolution kernel with multiple stacking expansion rate of 2.
In the past, a large number of methods only calculated the error between the predicted value and the real value of each keypoint in sequence when calculating the loss function, and then sum them up as final loss. However, this method is not rigorous for human pose estimation because the task also involves the correlation between each keypoint and the physical structure constraints. Inspired by [49], this paper designed a body structure constraint loss based on it to model the human keypoints, as shown in Fig. 6. Assuming that the set St represents the t-th combination between keypoints, as shown in Table 1, each combination has a physical structure constraint. Compared with [49], this paper’s combination of keypoints is more comprehensive and more reasonable, and the method makes each keypoint more closely related. In addition, in the previous work, the accuracy of hips, knees and ankles is lower than other keypoints, so this paper added the combination of these three keypoints separately in the collection. The i-th scale body structure constraint loss is defined as follows:

Visual structure of the body structure constraint model.
Single keypoint and combined keypoints
The first part on the right side of the equation represents the individual loss of each keypoint, which is, the error between the prediction confidence map P and the real value confidence map G at the keypoint (x, y). The second part shows the body structure constraint loss, and α is the weight coefficient, K is the total number of keypoints. Figure 6 shows the visualization results of the body structure constraint loss. The top line shows the loss of each keypoint, and the bottom line is the body constraint loss this paper designed. It is important to point out that collection S does not simply add one keypoint at a time. For example, the first element of S is the combination of head and neck, not the combination of head and ankle or other keypoint sets. In this process, the paper considered reasonable physical connections. Only in this way can the model capture better correlation between keypoints.
Results
Evaluation
Performance comparisons on the MPII test set (PCKh@0.5)
Performance comparisons on the MPII test set (PCKh@0.5)
#Params and GFLOPs of some top-performed methods. The GFLOPs is computed with the input size 256*256
Comparisons of PCK@0.2 score on the LSP dataset
Using the 8-stack hourglass stacking network as a benchmark model, this paper conducted ablation study on the MPII validation set.

Structure of serial atrous convolutional network.

Comparison results of serial atrous convolution and PAC. A represents the Hourglass model (benchmark model), B represents serial atrous convolutional model (replace the hourglass sub-networks in A with serial atrous convolution), C represents PAC (replace the hourglass sub-networks in A with PAC).

Comparison results of the benchmark model and the model with body structure constraint loss. D represents the mode with body structure constraint loss which includes loss of single keypoint and loss of combined keypoints.
This paper proposed a parallel atrous convolution with a body structure constraint loss, because both the individual BC and the PACNet obtain better results on the benchmark model, so it’s reasonable to add the two merges to the benchmark model, that is, the PACNet with the BC. This paper did a comparative experiment, and the result is shown in Fig. 10. In addition, the comparison result of the benchmark model and the final model is shown in Fig. 11.

Comparison results of the PAC model, the model with body structure constraint loss and the combined model. E is equivalent to C + D.

Comparison results of the benchmark. Model and final model.

Results on the MPII (top) and the LSP dataset (bottom).
This paper presented a parallel atrous convolutional network with body structure constraints for addressing the challenge problem of human pose estimation in complex scenes. The success stems from three aspects: (i) Parallel connection of multiple different atrous convolution branches can not only learn features of different scales, but also maintain high-resolution feature maps all the time by which high-quality feature maps and more detailed information can be obtained. (ii) When calculating the loss, the body structure constraints are taken into account, and the correlation between each keypoint is used to improve the accuracy. (iii) This paper has done a lot of ablation studies and reasonably proved the effectiveness of parallel atrous convolution and body structure constraints.
Limitation and further work
Although the accuracy of the model proposed in this paper exceeds that of most papers, and even exceeds the state of art on some keypoints, the complexity of the model is a little large due to the parallel convolution of multiple atrous and the consideration of many cases when constructing the loss of body constraints. For future work, to simplify the model is worth studying without affecting the accuracy.
Data availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of interest
The authors declare that they have no conflicts of interest.
Footnotes
Acknowledgments
This work is supported by Zhejiang Provincial Technical Plan Project (No. 2020C03105).
