Abstract
Pose estimation plays a crucial role in human-centered vision applications and has advanced significantly in recent years. However, prevailing approaches use extremely complex structural designs for obtaining high scores on the benchmark dataset, hampering edge device applications. In this study, an efficient and lightweight human pose estimation problem is investigated. Enhancements are made to the context enhancement module of the U-shaped structure to improve the multi-scale local modeling capability. With a transformer structure, a lightweight transformer block was designed to enhance the local feature extraction and global modeling ability. Finally, a lightweight pose estimation network— U-shaped Hybrid Vision Transformer, UViT— was developed. The minimal network UViT-T achieved a 3.9% improvement in AP scores on the COCO validation set with fewer model parameters and computational complexity compared with the best-performing V2 version of the MobileNet series. Specifically, with an input size of 384×288, UViT-T achieves an impressive AP score of 70.2 on the COCO test-dev set, with only 1.52 M parameters and 2.32 GFLOPs. The inference speed is approximately twice that of general-purpose networks. This study provides an efficient and lightweight design idea and method for the human pose estimation task and provides theoretical support for its deployment on edge devices.
Keywords
Introduction
Human pose estimation is a fundamental task in computer vision that involves localizing and identifying anatomical keypoints on the human body, such as wrists, elbows, and knees. It has wide-ranging applications in areas such as behavior recognition [1, 2], human-computer interaction [3, 4], and tracking [5, 6]. In recent years, significant advancements have been made in pose estimation, particularly with the development of Convolutional Neural Networks (CNNs) [7–11] and the introduction of the transformer architecture [12–16].
To improve performance, existing human pose estimation models often employ wider and deeper network architectures, resulting in a significant increase in parameters and Floating-point Operations (FLOPs). This increased computational complexity leads to slower inference speeds, making it challenging to deploy on mobile devices. In the meantime, there is a growing demand for efficient and lightweight human pose estimation networks. Although various techniques have been proposed to improve efficiency and lightweight design, they often suffer a significant drop in performance. Therefore, further research is still needed in the field of efficient and lightweight human pose estimation.
The hybrid architecture [17–21] design of CNN and Transformer motivates the construction of a lightweight and low-latency network for mobile visual tasks. This design combines spatial induction bias, insensitivity to data augmentation of CNN, and the adaptive weighting and global processing advantages of ViTs. Inspired by this, this work focuses on efficient and lightweight human pose estimation. To that end, the U2-Net [22] architecture is first redesigned to achieve lightweightness and verify its efficient usage of contextual information in human pose estimation tasks. Thereafter, a mixed pooling attention mechanism is proposed to enhance the feature extraction capability of the contextual feature extraction module. Furthermore, based on this contextual feature extraction module, the UViT-Residual Block, a hybrid architecture feature extraction module, is introduced to enhance local and global modeling capability. To further demonstrate the effectiveness of the lightweight network, a lightweight pose estimation network— U-shaped Hybrid Vision Transformer (UViT)— is developed, as shown in Fig. 1.

UViT network architecture. (a) U_Context Block. (b) UViT Stem Block. (c) UViT-Residual Block. (d) The proposed UViT network, is described in Sec. 3. More details and other variants are shown in Table 3.
To validate the proposed UViT network, this work first defined general-purpose Networks and lighweight Networks for performance comparison, as shown in Table 1. Then, extensive experiments are conducted on the widely-used COCO [23] and MPII [24] datasets. The proposed UViT model has a smaller size and FLOPs compared with the general-purpose network while offering a higher accuracy compared with the lightweight network. For example, compared with MobileNetV2 (achieving 66.8 in AP score on COCO test-dev set with 9.6M parameters and 3.33 GFOPs), the smallest network UViT-T can achieve 70.2 in AP with only 1.52 M parameters and 2.32 GFLOPs. Compared to the general-purpose network, although there is a performance gap, UViT networks have significantly lower GFLOPs and parameters, yet achieve nearly twice the inference speed. The fusion of contextual information and modeling of global information plays a potentially significant role in lightweight networks.
An efficient pose-estimation network called the UViT is proposed. The key to this algorithm is the fusion strategy of contextual information and the ability of local and global feature modeling. Extensive experiments on two benchmark datasets, COCO and MPII, demonstrate the effectiveness of the proposed approach: on the COCO dataset, UViT and GFLOPs reduce the number of parameters by factors of 6.6 and 4.5, respectively, compared to the state-of-the-art HRNet; in addition, compared to the best network in the MobileNet series, the accuracy is improved by more than 5% via half the parameters. Validation experiments on the information fusion strategy was designed to demonstrate the effectiveness of the fusion of multi-scale contextual information in local modeling, thereby exhibiting the advantages of the combination of CNN and lightweight transformers for feature extraction.
Definition of General-Purpose Networks and Lightweight Networks for performance comparison, including parameter count ranges, model names, and corresponding references
The remainder of this paper is organized as follows. Section 2 provides an overview of the work related to human pose estimation, attention mechanisms, and hybrid architecture designs. Section 3 details the mixed pooling-enhanced attention mechanism, context-enhanced module, UViT-Residual block, and proposed lightweight pose estimation network. Section 4 provides extensive results and a comprehensive analysis of the challenging COCO and MPII datasets; a comparison with other state-of-the-art and real-time methods; and a qualitative assessment and comparison of the inference speed of the UViT network. Finally, Section 5 provides the concluding remarks.
Lightweight pose estimation
With the introduction of the DeePose [34], methods based on deep convolutional neural networks have made significant progress in the field of pose estimation, novel network structures and methods have achieved better performance on MPII [24] and COCO [23] benchmarks, especially the high-resolution representation network, HRNet [11]. However, these studies have a more complex architecture and high computational complexity to improve the accuracy of pose estimation, which limits their effective deployment on edge devices. In this work, the main focus is on the design of efficient and lightweight networks for human pose estimation.
To make pose estimation networks more lightweight, Zhang et al. [35] use a model-training method for fast pose distillation using the hourglass network [25] as a teacher model to train lightweight student networks. Bulat et al. [36] binarized the model to speed up the inference time while compressing the model parameters; however, the performance was degraded. Zhang et al. [32] use lightweight convolutional modules to build the network and propose an iterative training strategy with a reduced number of model parameters; however, its computational complexity remains high. Yu et al. [33] use a conditional channel-weighting operation instead of a 1×1 convolution in a shuffle block, which improves the computational efficiency of the network. Wang et al. [37] propose a novel human pose estimation method that utilizes an unbiased data processing technique, introduces lightweight bottleneck blocks with re-parameterized structures, and utilizes multi-branch and single-branch structures to decouple training and deployment phases; this results in improved accuracy and faster inference speed. Zhang et al. [38] propose an ultra-lightweight end-to-end pose distillation network that can perform real-time pose estimation in resource-constrained environments and demonstrates good generalization capabilities. Although lightweight models for pose estimation have been proposed, they often face the issue of sacrificing accuracy and performance in pursuit of lightweight design. Therefore, developing a lightweight pose estimation network that is both efficient and accurate remains an unresolved problem.
The proposed U-Net [39] architecture employs an encoder-decoder architecture and effectively integrates multi-scale contextual information, resulting in improved feature extraction capability and excellent performance in pixel-level visual tasks. First, U-Net achieves parameter sharing between the encoder and decoder, significantly reducing the number of trainable parameters— this reduces computational and storage overheads and mitigates the risk of overfitting. Second, U-Net adopts the design of local connections and small receptive fields, thus reducing the computational load at each layer and the range of contextual information to be processed. The utilization of small receptive fields restricts the focus of each convolutional kernel to local regions, thereby enhancing the ability for local perception of features. Finally, U-Net employs downsampling and upsampling operations to adjust image size and extract contextual information at different scales. U-Net successfully achieves the objective of maintaining high feature extraction capability while reducing the computational complexity.
Following its design principles, a U-shaped Hybrid Vision Transformer network (UViT) is presented, which offers advantages in terms of inference speed, computational complexity, and model size. In contrast to U-Net, considering the lightweight design of the network, a combination of 3×3 group convolution and Inverted Residual block [29] is adopted to downsample by a factor of 4 in the feature extraction stem phase. Additionally, in the En_1 stage of the encoder, contextual features are effectively extracted using a context enhancement module. Furthermore, in the other encoding stages, the UViT-Residual module is employed to perform both local and global feature modeling. Finally, in the decoding stage, the contextual features are integrated with the corresponding outputs from each encoding stage, achieving comprehensive context feature fusion. Its efficiency and effectiveness are validated on the COCO [23] and MPII [24] benchmarks.
Attention mechanism
The attention mechanism is a type of human cognitive system-based cognitive awareness of things-specific information, and has been proven to be an effective method for improving the performance of deep learning systems, especially for tasks such as target detection [14, 40], semantic segmentation [41, 42], production remanufacturing [43, 44], and pose estimation [45, 46].
Through careful empirical investigation, Hu et al. [47] proposed a Squeeze-and-Excitation Network (SENet), which is a representative work in the computer vision field, to apply the attention mechanism to the channel dimension, and numerous subsequent works based on the channel domain are modified based on this. Considering that SENet, which uses Global Average Pooling (GAP) to model the global context and leads to the loss of spatial information, SPA-Net [48] takes an alternative approach using a spatial pyramid structure composed of multiple adaptive average pooling (AAP) to model the local and global contextual semantic information such that the spatial semantic information is more fully utilized. Aiming at the considerably limited receptive field obtained by the convolution operation at each layer, Wang et al. [46] proposed a nonlocal block, which can directly obtain the dependency between two positions without considering the distance; however, the amount of calculation is overly large, and GCnet [49] and CCNet [50] have effectively improved the calculation of the nonlocal block. For deep neural networks, attention mechanisms contribute to performance improvement, however, when the model is small, the computational cost associated with attention mechanisms cannot be ignored, which often leads to significantly lower performance compared to larger networks. To enhance the accuracy and performance of pose estimation while maintaining a lightweight design, there is a need to develop new methods to improve existing attention mechanisms.
Through comparative analysis, CoordinateAttention [51] decomposes channel attention into two one-dimensional feature-coding processes to capture remote dependencies and retain accurate location information. The obtained feature map is then separately coded into direction-aware and position-sensitive attention maps to enhance the representation of the objects of interest. Consequently, it can be applied to the proposed lightweight network, which can effectively improve the network performance without increasing excessive computation. However, CoordinateAttention falls short of simultaneously capturing both local detailed information and global semantic information in the image. To address this limitation and obtain a balance between lightweight design and improved performance, a Mixed Pooling Coordinate Attention mechanism is proposed. This mechanism significantly enhances the accuracy of pose estimation tasks while requiring minimal additional computational cost.
Hybrid architecture of CNN and transformer
Since the emergence of the ViT [52] model in 2020, researchers have discovered the great potential of transformer architectures in the visual field. However, the transformer model has numerous parameters and requires high computational power, which makes it difficult to deploy the model on a mobile terminal. Although there are many excellent works on transformer architecture in the visual field, such as the Swin-Transformer [12] in 2021, which is better and lighter than ViT, a large gap remains in model parameters and inference speed compared with CNN-based lightweight models (such as MobileNet series [29, 54]). Vision Transformer (ViT) is limited because of the lack of spatial inductive biases and the requirement for large-scale datasets along with appropriate data augmentation, regularization, and distillation techniques. To address these limitations and achieve performances comparable to those of CNNs, recent research has introduced hybrid architectures combining CNNs and Transformers. This approach aims to efficiently capture both local and global information while balancing network performance and efficiency. By incorporating CNNs, these hybrid architectures alleviate the issues of massive parameters and computational complexity associated with Transformers.
Early research on ViT introduced convolutions [52, 55], replaced the patchy stem with convolutional operations, or required window-based [12, 56] attention to implicitly integrate CNN and Transformer; these effectively reduced the computational complexity and parameters of ViTs but maintained certain spatial inductive biases. Recent research has focused on hybrid architecture designs that enhance feature representation by better information exchange between tokens [57, 58]. Furthermore, based on the latency analysis of different components in ViT-based models presented in the RIFormer [27], the token mixer occupies a significant portion of the total latency in the backbone network, accounting for 46.3%. Therefore, the latency introduced by the token mixer cannot be ignored. By replacing [59, 60] or even removing inefficient token mixers [27, 61], better performance can be achieved in lightweight network designs [62, 63].
In conclusion, although various lightweight techniques and methods based on deep learning have been proposed, their performance in specific downstream tasks may vary depending on the specific task and dataset characteristics. The pure transformer model is large in size and requires a large amount of data for training to outperform the CNN-based model, and the CNN is the mainstream approach on downstream tasks but lacks a global view, as shown in Table 2. To address these limitations, a hybrid approach is proposed that combines the strengths of both CNN and Transformer models. In this study, the primary motivation is to develop an efficient and lightweight network for human pose estimation. By employing a U-shaped structure with a context feature extraction module for local modeling and incorporating lightweight transformers for global modeling, UViT achieves an efficient and compact model with improved feature extraction capacity.
Comparison of different neural network architectures
Comparison of different neural network architectures
Mixed pooling coordinate attention mechanism
The attention mechanism of SE [47] only considers channel coding, neglecting the importance of location information. Woo et al. [64] introduce spatial information coding through a large-scale convolution kernel; however, it can only capture local relationships and cannot model remote dependencies. GCNet [49], A2-Nets [65], SCNet [66], and CCNet [50] use nonlocal mechanisms to capture spatial information, however, the computation is extremely large to be suitable for lightweight networks. Coordinate Attention [51] can effectively obtain the global receptive field and encode precise location data. Inspired by this, the coordinate attention mechanism was improved to suit pose estimation tasks.
Consider an input feature map that can be viewed as a two-dimensional space where each position represents a pixel in the feature map. This feature map can represent an image or any other two-dimensional structured data. The coordinate attention mechanism determines the weights for each position based on its spatial coordinates and importance in the computation. To ensure the weights add up to one, the weights are normalized and subsequently applied to the corresponding pixel values in the input feature map, resulting in a weighted representation of each feature at each position. This process transforms the spatial coordinates into weights, which are then used to weight the corresponding positions in the feature map according to their importance. The coordinate attention mechanism enables the network to better focus on regions of interest and extract spatially relevant features.
In the initial feature aggregation stage, the mixed pooling strategy is adopted to better retain the texture and background features of the image, effectively improve the performance of visual tasks, and concatenate with the results of feature aggregation in the X and Y directions to enhance the feature representation. Therefore, given the input X, mixed pooling is used to encode in the horizontal and vertical directions in the (H, 1) and (1, W) spatial ranges and the output of the channel in the height h and width w directions can be expressed as
where represents the concatenation operation along the spatial dimension and and denote the adaptive average pooling and adaptive maximum pooling operations, respectively. A concatenation operation is then performed on
The average pooling method weakens the performance of most distinctive features because the maximum pool method ignores the performance of some effective features. Therefore, based on the coordinate attention mechanism, using the mixed pooling strategy along the horizontal and vertical directions can maximize the retention of image texture features and obtain global context information, which helps locate the region of interest more accurately.
First, for the U2-Net [22] network structure, the first two coding stages are downsampled to one-quarter resolution using convolutional and Inverted Residual block [29] to reduce the computational and output at decoding to one-quarter resolution with the number of intermediate channels set to 32. Keypoint detection is then performed on the COCO val. For input resolutions of 256×192 and 384×288, the AP scores were 64.8 and 67.5, respectively, which are superior to MobilenetV2 in terms of accuracy, number of parameters, and FLOPs, demonstrating the powerful contextual feature extraction capability of the U-shaped structure. Inspired by this, a U-shaped network structure was redesigned to further enhance local modeling capability.
For a network with coding and decoding structures, the size of downsampling is often 32 times, which leads to the loss of a large amount of detailed information, resulting in unsatisfactory performance for downstream tasks. Dilated convolution [67] can effectively enlarge the receptive field while maintaining the height and width of the original input feature map. However, a gridding effect exists. According to the design criterion of the Hybrid Dilated Convolution (HDC) [67] expansion factor, dilation rates of [1, 5] were set to obtain a large receptive field and a more uniform distribution of pixel utilization compared to the settings in U2-Net [22], effectively avoiding the gridding effect [68].
Based on the application of a mixed pooling coordinate attention mechanism and dilated convolution, a U-shaped structure was constructed with a large receptive field and enhanced representation of objects of interest. The improved attention mechanism is placed before the summing operation with the shortcut branch. When the sampling rate was 32, the pool operation was removed, and the expansion factor was reasonably designed according to HDC standards to enlarge the receptive field.
UViT-Residual block
The U-Net [39] network, proposed in 2015, is salient in the field of image segmentation. Its U-shaped structure addresses the disadvantage that an FCN [69] cannot capture context and location information. It adopts a completely symmetrical U-shaped structure to strengthen feature fusion, which retains the spatial information of the bottom features and exploits the rich semantic information of the high level.
Standard convolutional operations can be seen as a stack of three sequential operations: (1) unfolding, where the input image is decomposed into a matrix form, (2) matrix multiplication, which learns local features, and (3) folding, which recombines the feature maps into tensor form for further processing. Inspired by MobileViT [17], the UViT-Residual block replaces matrix multiplication with a stack of transformer layers to enable global modeling, giving the network convolution-like properties such as spatial bias. This modeling approach provides insights for designing shallow and narrow models, resulting in lightweight structures. For example, in the UViT-T model, the transformer dimensions and feed-forward network (FFN) dimensions are set to [64, 80, 96] and [128, 160, 192], respectively, in the three stages of the encoder. This configuration results in a low FLOPs of only 0.257 G, accounting for only 25% of the total FLOPs of the entire model. This indicates that the UViT model achieves lightweight design by setting dimension parameters appropriately, while still maintaining sufficient expressiveness and accuracy.
Inspired by this, the Context enhancement module was combined with the transformer to build a UViT-Residual block for local and global modeling. Based on the MobileViT [17] network structure and ablation experiment, patch size was set to 2 to better capture details, set the number of heads of the multi-head attention mechanism to 4, divide the model sizes according to the input dimensions of the transformer, and conduct ablation experiments in feature fusion mode. The use of the UViT-Residual block enhanced the network feature extraction capability without increasing the computational burden. Experiments showed that the UViT-Residual block is considerably friendly to lightweight networks.
Lightweight pose network
Multibranch networks are widely used in human pose estimation, and they can better integrate multiresolution features to address the problem of scale variation. U-Net [39] is widely used in the field of image segmentation to effectively obtain contextual and location information. Based on U-Net [39], U2-Net [22] adopts a two-layer nested U-shaped structure, which better fuses features with different scales and receptive fields and captures global information from different scales. The transformer network architecture has significant potential in the visual field; however, owing to the lack of spatial induction bias and large model parameters, ... . To better meet the needs of mobile deployment, the combination of CNN and Transformer gives full play to their respective advantages and provides ideas for building a lightweight and low-latency network for mobile vision tasks.
Based on the U2-Net design principles, a lightweight pose-estimation network (UViT) was proposed. In contrast to U2-Net, the 2D convolution and Inverted Residual block [29] are used to downsample to one-quarter resolution in the Stem stage and output at one-quarter resolution in the decoding stage, significantly reducing the number of parameters and the size of the FLOPs. Considering the high resolution, increased computational complexity, and redundant information at the first stage of the encoder (En_1), the enhanced U-shaped structure is utilized for contextual feature extraction. Additionally, dilated convolution is utilized for the low-resolution processing of this module to effectively expand the receptive field. The UViT-Residual block is used in encoder stages En_2, En_3, and En_4 to extract effective feature information by a joint transformer structure for local and global modeling. The model size was divided according to the input dimension of the transformer block to improve the global modeling capability because the U-shaped structure has a powerful contextual feature extraction capability. More details and other variants are presented in Table 3. The main body adopts the Encoder-Decoder structure, and the hybrid structure of CNN and Transformer is used in the encoding stage. Following upsampling in the decoding stage, the context information is fused through the skip connection structure. C mid represents the number of intermediate channels, and Concat represents the concatenation operation.
Architectures for UViT. The stem contains a stride 2 3×3 convolution and an Inverted Residual block. The main body adopts an encoder-decoder architecture
Architectures for UViT. The stem contains a stride 2 3×3 convolution and an Inverted Residual block. The main body adopts an encoder-decoder architecture
The proposed model was evaluated on the COCO [23] and MPII [24] datasets, ablation experiments were performed on related techniques, and the results were compared with those of other state-of-the-art methods.
COCO and MPII keypoint detection
Dataset and evaluation metric
Data from more than 200,000 photos and 250,000 pedestrians with 17 human keypoint-labeled information, with an average of two and a maximum of 13 people per image, make up the COCO dataset, a fairly large general-purpose dataset for computer vision tasks. UViT network is trained on the train2017 dataset, which contains 57,000 image data and 150,000 pedestrian messages, and the performance is evaluated on the val2017 and test-dev2017 datasets containing 5,000 and 20,000 images, respectively. Based on the keypoint similarity, object keypoint similarity (OKS) [23] was used to analyze the metrics.
The MPII dataset contains approximately 25,000 images and more than 40,000 human targets annotated with 16 nodes of information and head direction annotation, 3D torso, and body part occlusion. To evaluate the performance of the MPII dataset, the standard metric known as PCKh@0.5 [24] (head-normalized probability of correct keypoint) is utilized. This metric calculates the percentage of correctly localized keypointsby comparing the normalized distance between the detected keypoints and their corresponding ground truth values with a predetermined threshold. More details are shown in Table 4.
Human keypoint detection dataset
The network was trained on an NVIDIA 3080Ti GPU with a minibatch size of 32 to update the parameters, employing a supervised learning approach using annotated keypoint data. The Adam optimizer with aninitial learning rate of 3e-4 was adopted. To address the problem of training instability caused by an initial learning rate setting that was overly large, the learning rate was set using the warmup strategy, the warmup learning rate was set to 0.001, and the warm-up type was linear. A fixed aspect ratio of 4 : 3 was applied to the human detection boxes, and the boxes were subsequently cropped from the photos. Regarding COCO, the image was downsized to 256×192 or 384×288, and regarding MPII, it was downsized to 256×256. The data enhancement strategy includes random rotation ([-60°, 60°]), random scaling ([0.75, 1.25]), and a random flip probability of 0.5. The COCO dataset additionally adds a half-body transformation data enhancement strategy; the transformation threshold was set to 8, and the probability of performing a flip was set to 0.3.
Testing
Regarding the COCO dataset, the top-down method was used for human keypoint detection in the COCO val2017 human detection results, and regarding the MPII dataset, manually labeled target-bounding boxes were used for pose estimation. Using a post-Gaussian filter, the heat maps were estimated and the heat maps for the original and flipped images were averaged. By applying a quarter offset in the direction of the response with the highest and the second-highest response, each keypoint position was identified.
Results
Results on the validation set
Depending on the size of the transformer block input dimensions, three different versions of the lightweight pose estimation network (UViT-T, UViT-S, and UViT-B) were designed, as presented in Table 3. Table 5 presents the experimental results, where the smallest network outperforms other lightweight methods with an AP score of 68.5 after being trained from scratch with a 256×192 input size, only with 1.52M parameters and 1.03 GFLOPs. (i) The overall performance of the UViT network is better than that of the lightweight pose estimation network, which is close to a general-purpose network, but the parameters and FLOPs of the network are smaller. (ii) Compared to SimpleBaseline, there is a little gap (0.6 points) between UViT-B and the corresponding one (ResNet-50); however, the parameters and FLOPs of UViT-B are only 12.7% and 17.7%, respectively, of SimpleBaseline. (ii) Although HRNet has achieved good results, owing to the extensive use of parallel convolution, the speed of the model reasoning is significantly lower than that of the UViT lightweight network, and it is difficult to deploy on mobile devices, as shown in Fig. 3.
Comparisons on the COCO val set. Parameters and GFLOPs are calculated solely for the pose estimation network, excluding the calculations for human detection and keypoint grouping tasks
Comparisons on the COCO val set. Parameters and GFLOPs are calculated solely for the pose estimation network, excluding the calculations for human detection and keypoint grouping tasks
Owing to the combination of CNN and Transformer, and the effective ability of context feature extraction and global modeling, the UViT network achieved a good balance between accuracy and computational complexity. The AP scores of the UViT-S and UViT-B reached 71.6 and 72.3, respectively.
Table 6 presents the pose estimation performance of UViT networks and other state-of-the-art methods on the COCO test-dev set. With an input size of 384×288, the proposed network accuracy significantly exceeds that of small networks, and it is superior in terms of network parameters and computational complexity. Compared with the general-purpose network, although the performance in terms of accuracy is slightly inferior, it is superior in terms of model size and FLOPs.
Comparisons on the COCO test-dev set. Parameters and GFLOPs are calculated solely for the pose estimation network, excluding the calculations for human detection and keypoint grouping tasks
Comparisons on the COCO test-dev set. Parameters and GFLOPs are calculated solely for the pose estimation network, excluding the calculations for human detection and keypoint grouping tasks
Table 7 shows the comparison results between the UViT network and the lightweight pose estimation network. Compared with MobileNetV2 and ShuffleNetV2, UVIT-T achieved higher accuracy and lower FLOPS. With the increase of the model scale, UVIT-B reached the PCKh of 88.3, which was 1.3, 2.9, 5.5, and 6.1 points higher than Small HRNet-W16, ShuffleNetV2, MobileNetV2, and LiteHRNet-30 respectively.
Comparisons on the MPII val set. Parameters and GFLOPs are calculated with an input size of 256×256
Comparisons on the MPII val set. Parameters and GFLOPs are calculated with an input size of 256×256
Ablation experiments were performed using the UViT-T network on the MPII and COCO datasets and the results were verified on the validation set; this included the UViT block feature fusion approach, multi-scale contextual information module design, and input resolutions of 256×256 and 256×192, respectively.
Feature fusion
Regarding the UViT block, as shown in Fig. 2, three feature fusion methods were designed; the results are listed in Table 8. F(in+out) indicates that the input and output are fused by a convolutional layer with a convolutional kernel size of 3×3 after performing a concatenation operation in the channel direction; F out indicates that the feature map is directly output by the UViT block; F(local+global) indicates the contextual local information at different scales and the global information extracted by the transformer block after performing the concatenation operation in the channel direction and summing with the input features. The results indicated that the first fusion method works best with a similar number of parameters and computational complexity, with 1.4 and 1.6 higher AP scores than the second and third methods, respectively, and PCKh 1.2 and 1.5 higher, respectively. This indicates that preserving the detailed information of shallow features effectively enhances pixel-level tasks, and the concatenation operation with the input features is more conducive to the acquisition of high-quality features.

Structure of UViT block. Take the UViT block of En_2 as an example on the MPII val set.
Ablation about feature fusion strategy on the COCO val and MPII val sets. Params and GFLOPs are calculated for the different feature fusion methods. The input resolution of COCO is 256×192 and that of MPII is 256×256
The U-shaped structural design has significant advantages in the feature extraction of contextual information. To further enhance the attention of important features, an improved attention mechanism was introduced before the summation operation of the context extraction module to enhance the local modeling capability. Table 9 shows that using the coordinate attention mechanism improves the AP score by 0.8 points compared to not applying it, and using the improved coordinate attention mechanism performance improves the AP score by 1.3 points. Ablation experiments showed that using a hybrid pooling operation can effectively improve the performance of pixel-level tasks, such as pose estimation.
Ablation about attention mechanism on the COCO val and MPII val sets. Params and GFLOPs are calculated for the different attention mechanisms. The input resolution of COCO is 256×192 and that of MPII is 256×256
Ablation about attention mechanism on the COCO val and MPII val sets. Params and GFLOPs are calculated for the different attention mechanisms. The input resolution of COCO is 256×192 and that of MPII is 256×256
ShufflenetV2 [30] indicates that computational complexity cannot be focused only on FLOPs, which is an indirect metric used to measure model complexity, and that inference speed is the most direct evaluation metric. In this section, the inference speed of the UViT is compared with that of a mainstream network under the same hardware conditions. Figure 3 shows the measurements of AP scores, inference speed, and FLOPs on the COCO dataset.

Measurement of AP score, speed, and GFLOPs on the COCO val2017 with the UViT and SOTA methods. Two distinct colors signify various input sizes, 384×288 and 256×192. The size of a bubble indicates the FLOPs’ scope.
The high-resolution network HRNet [11] has a slow inference speed owing to the presence of numerous parallel convolutions, and SwinT [12] remains computationally intensive, although it uses the Windows multi-head self-attention (W-MSA) approach. The minimal network inference speed was approximately twice as fast as that of Simple152 [5], HRNet-W32 [11], and SwinT [12]. Therefore, this highly efficient and lightweight network could be deployed on edge devices.
To address the shortcomings of the CNN lacking a global view and transformers lacking spatial location information and the high demand for memory and computational resources, a hybrid architecture design of a CNN and Transformer was adopted in this study. The proposed U-shaped structure context module enhances the local feature extraction capability and the UViT-Residual block enhances the global modeling capability. Moreover, the proposed UViT network has more advantages in terms of parameters and FLOPs, with slightly lower accuracy and faster inference on human-pose estimation tasks compared to general-purpose networks, and it significantly improved accuracy with lower parameters and FLOPs compared to lightweight networks. It is assumed to be because the proposed model architecture is designed to accurately capture contextual information at different scales while effectively modeling globally, which is more advantageous for vision tasks similar to pose estimation that require larger receptive fields. Despite achieving a balance between efficiency and lightweight design in human-pose estimation, its generalizability to other visual tasks requires further validation. Additionally, evaluating the universality and robustness of UViT necessitates further assessment of diverse datasets to fully understand the limitations of the UViT network.
Conclusion
In this study, a lightweight hybrid architecture of a CNN and transformer is proposed for the effective deployment of pose estimation tasks in edge devices. Considering the importance of contextual information, the U-shaped network structure was improved and an enhanced attention mechanism was incorporated to achieve contextual information modeling at different scales. To further expand the receptive field and enhance the global modeling capability, a transformer structure was combined with this, and a lightweight UViT-Residual block was designed to capture global information. Compared with methods such as the better-performing HRNet series and SwinT, the proposed method achieved good results on the COCO dataset while being more efficient in terms of the number of parameters, computational complexity, and inference speed. A lightweight design often leads to a decrease in accuracy. Reducing the number of network parameters and FLOPs for effective deployment on edge devices with guaranteed accuracy is a key issue to be addressed in the future.
Footnotes
Acknowledgments
This work was supported by the National Key R&D Program of China (2017YFF0210600), Anhui Provincial Department of Education University Scientific Research Project (2023AH052233, 2022AH051361, 2022AH051368), Suzhou University Scientific Research Project (2021fzjj20, 2023yzd14), and Suzhou university center for design and research magnetic information materials (2021XJPT15)
Conflicts of interest
The authors declare no conflict of interest.
