A lightweight method of pose estimation for indoor object

Abstract

Due to the multiple types of objects and the uncertainty of their geometric structures and scales in indoor scenes, the position and pose estimation of point clouds of indoor objects by mobile robots has the problems of domain gap, high learning cost, and high computing cost. In this paper, a lightweight 6D pose estimation method is proposed, which decomposes the pose estimation into a viewpoint and the in-plane rotation around the optical axis of the viewpoint, and the improved PointNet $++$ network structure and two lightweight modules are used to construct a codebook, and the 6d pose estimation of the point cloud of the indoor objects is completed by building and querying the codebook. The model was trained on the ShapeNetV2 dataset, and reports the ADD-S metric validation on the YCB-Video and LineMOD datasets, reaching 97.0% and 94.6% respectively. The experiment shows that the model can be trained to estimate the 6d position and pose of the unknown object point cloud with lower computation and storage cost, and the model with fewer parameters and better real-time performance is superior to other high-recision methods.

Keywords

Domain adaptation 6d pose estimation lightweight neural network indoor scene mobile robot

1. Introduction

For the intelligent mobile robot in an indoor scene, there are many functional requirements, such as grasping [1], obstacle avoidance [2], and virtual reality [3], which are based on the estimation of object pose. However, with the development of mobile robots and 5G technology, robots will transfer data to a central cloud server for data processing. However, this architecture of robot-cloud collaboration will bring some drawbacks. on the one hand, the transmission of a large amount of raw data will bring a huge network pressure, the actual operation will bring non-negligible delay [4]. On the other hand, there are resource constraints on point cloud processing locally, including memory capacity, computing power, and so on, so lightweight neural networks are more suitable for edge computing such as mobile robots.

In order to make the mobile robot recognize the rigid body posture of the indoor scene object, the robot needs to estimate the information from the input data of the point cloud, which includes the relative rotation and relative displacement relative to the camera or a viewpoint. In the 3D space, the two properties usually contain six degrees of freedom and are therefore called 6D pose estimation [5]. In pose estimation, the estimated point cloud is the target domain, while the training point cloud is the source domain. The domain gap is the distance between these domains. Due to the difference in geometry structure and size of indoor objects, the domain gap becomes a problem that can not be ignored in the pose estimation of mobile robots.

In order to solve the above problems, a generalized and lightweight method for pose estimation by point cloud is proposed. The main innovation of this paper is as follows:

•
An unsupervised learning method is proposed to reduce the problem of domain gap, which is caused by the robot landing on the pose estimation of an indoor object, the pose estimation does not directly use point cloud as input to the network, but decomposed into a viewpoint and in-plane rotation around the optical axis of the viewpoint.
•
In order to reduce the computing cost of mobile robots, a lightweight end-to-end framework is proposed, including a backbone based on improved PointNet $++$ , and two lightweight modules, are cascaded. The functions of viewpoint prediction and in-plane rotation regression around the optical axis of the viewpoint are completed.
•
In order to reduce the learning cost of the point cloud model of unknown objects, the construction-query codebook method is used to provide the viewpoint prediction basis for the object pose estimation, simultaneously reducing redundant calculations during inference.

2. Related work

Scholars have conducted extensive research on 6D pose estimation of indoor objects. In general, 6D pose estimation typically includes the following types of methods: template matching, feature matching, and Hough voting.

The method of feature matching is to find the features between the input data and the complete 3D point cloud of the existing object, and then recover the 6D position and pose of the object by using the correspondence relation. If the input data is a 2D image, the key point must be found between the image and the point cloud. While the input is a point cloud, the feature must be extracted from the target point cloud, and the model is optimized by reducing the distance of the feature to decrease the loss. It is also suitable to use fused multidimensional information to estimate the pose of severely occluded objects, Chen proposed a hybrid representation method [6], which integrated multidimensional features such as geometric information, key points, and edge vectors to reduce inaccurate representation of raw data, and better performance and accuracy are demonstrated than the methods with single dimensional information. Similarly, Ivan Shugurovt proposed a method that combines a two-dimensional object detector with a dense correspondence estimation network to generate predictions based on views generated by various imaging modalities, postures are refined based on predicted and rendered correspondence [7]. The method of feature matching approach usually has a better solution to the problem of occlusion, Huang proposed a network to learn the corresponding 2D-3D relationship based on the input of RGB info and the camera frustum and performed robust matching of 3D-3D through algorithms, which attenuated the effect of occlusion [8]. These methods are suitable for objects with rich textures or geometric details, but also easily affected by lighting and texture.

For the object with weak texture, the method of template matching is more suitable for pose estimation. This method selects the most similar template from the template with complete 6d pose marking, the pose of the template is regarded as the target posture. Aoki used PointNet as a learnable imaging function and optimized Lucas Kanade algorithm to adapt PointNet, for improving the generalization and efficiency in estimated [9]. Different from traditional methods that use images as pose estimation information, Gao proposed a new idea, in which a point cloud is used as the input of depth information, and a separate network is used to learn the regression of rotation and translation, the network based on axis-angle and was optimized by reducing geodesic loss [10]. Later, in order to deal with the problem of domain gaps, he proposed an automatic encoder that learns the pose information of point clouds as a code, and regression of rotation and translation are inferd through that code [11]. For reducing the impact of occlusion, Hua proposed a pose estimator based on RGBD information, this method adds weights to differentiable outliers to improve the confidence of regression results [12], this type of method performs poorly when facing severe occlusion, especially when the geometric structure of the point cloud is missing, it is hard to obtain stable output results.

The method based on Hough voting is to learn a mapping function to map each point or region of the input data into a voting space, obtain candidate poses from the voting results by the clustering algorithm, and finally select an optimal pose by other methods. Wang proposed a method for pixel-level dense fusion of heterogeneous RGB and depth data, and the pose is estimated from the feature, which achieved high precision [13]. In another way, He designed a network, generating key points for voting by these two modes of information separately, and calculating the posture by the least squares fitting algorithm [14]. In order to fully utilize these two types of information, He conducted sparse fusion on the two streams, while improving the performance and accuracy of pose estimation [15]. This type of method performs well in multi-mode but still suffers from severe occlusion in single-ode.

These methods achieve high accuracy in 6D pose estimation but do not demonstrate the ability of domain adaptation. The networks of these methods mostly need to be optimized for a single model in the training dataset, which will result in a significant decrease in accuracy when facing objects that have not been optimized. This is unsuitable for the indoor environment’s complex and ever-changing objects. When mobile robots require estimating more pose of the objects, they need to be retrained, which undoubtedly brings expensive learning costs, including time and computational costs.

In reverse engineering, the Point Cloud is a collection of point data on the surface of an object obtained by measuring instruments. Due to its disorder, discreteness, and non-structure, the feature extraction of the point cloud is a challenge. PointNet pioneered the deep learning of the point cloud, which utilized MLP to ascending dimension the feature and a Max pooling layer to extract global information of the point cloud [16], followed by PointNet $++$ , the idea of grouping is used to extract local features by PointNet and then aggregate the global features in the end [17]. scholars in various fields optimized and improved the method based on PointNet and PointNet $++$ for different tasks, and achieved satisfactory results in 3D point cloud object detection [18], semantic segmentation [19], classification [20] and so on.

To address the problem of domain gap, Cai proposed OVE6D [21] this method designed a viewpoint encoder that completes pose estimation of unseen objects using synthetic data via depth maps. Inspired by OVE6D, this paper proposes a method of pose estimation that is suitable for indoor scenes constructed by point cloud, the pose of an object is decomposed into a virtual viewpoint relative to the model and rotation in the plane around the optical axis of the viewpoint. Different from the previous methods, This method does not directly focus on the pose regression of objects directly but takes the above-mentioned decomposition task as the target, and adapts to different objects’ pose estimation in an unsupervised learning way.

3. Method

The method in this paper belongs to the types of template matching. For the pose estimation of the point clouds of indoor objects, a pre-split mask must be provided, and the model infers the rigid transformation from the standard model to the target model, and this can be represented by a rotation matrix R and a displacement vector t.

The method consists of three stages: During the training, the ShapeNet V2 dataset is used to train the model and calculate a mapping relationship between a random viewpoint and the object’s complete point cloud, which is represented as a mask of the object’s point cloud, the paper calls it the map of the viewpoint. Before the inference, the point cloud of the training dataset is standardized first and then generates a uniform spherical of viewpoints, the maps of each viewpoint are calculated and coded, and stored as a codebook file. In the inference stage, the map of viewpoint calculated from the target object relative to the virtual camera is encoded, and the encoding is compared with the codebook to deduce the candidate viewpoints and the rotation matrixes in the plane of the optical axis around the viewpoints. Finally, the best matching template is calculated from the information.

Figure 1.

The inference process of pose estimation.

As shown in Fig. 1, the inference process for pose estimation consists of four steps, step A, encodes the point cloud of a standard object and generates 2,100 uniform views on a sphere through a Fibonacci grid, the map of each viewpoint is calculated, and encoded by the encoder, and the information including encoding code, Standard Model, scaling factor and so on are saved as a codebook file. Step B indicates that in inference, the point cloud visible to the observation viewpoint is encoded by the encoder, and the cosine similarity function is used to compare the cosine similarity between the modified code and the codebook, and several candidate viewpoints are calculated, the candidate viewpoints are shown in the diagram as blue points. In Step C, the map of the candidate viewpoint and the map of the observation viewpoint are almost the same as each other except for a rotation around the optical axis of the observation viewpoint, predicted the rotations and computed to get the map of the candidate viewpoints. Step D generates the kd-trees from the map of candidate viewpoints, and matches the 32 points of interest extracted from the backbone in step B with the kd-tree. The viewpoint with the least distance loss is the target viewpoint, the pose can be calculated from the viewpoint and the rotation matrix of the optical axis around the viewpoint.

3.1 The point cloud viewpoint coder

After pre-processing the ShapeNetV2 training data set, the surface point cloud of the indoor objects is obtained. The pose estimation method needs to extract features from the point cloud and carry out unsupervised learning. The network model consists of a backbone network and two lightweight heads, the functions of viewpoint selection and point cloud rotation are accomplished respectively.

Figure 2.

The structure of network.

The Fig. 2 shows the structure of the whole network, the triplet map of three viewpoints { $\dot{P},\dot{P}_{\theta},\dot{P}_{\gamma}$ } is calculated which came from the point cloud of an object, extracted the features from the maps to get the aggregated feature { $z,z_{\theta},z_{\gamma}$ }, and inputting them into two heads respectively, the corresponding codes { $v,v_{\theta},v_{\gamma}$ } and the predicted rotation matrix $\theta_{pd}$ of maps $\dot{P},\dot{P}_{\theta}$ are obtained.

3.2 Feature extraction of point cloud

For extracting features, as shown in part A of Fig. 3, a network structure based on PointNet $++$ was used, which includes 3 SA layers. SA layer extracts interest points through the FPS algorithm and performs local clustering using the KNN algorithm. For each cluster, shared-MLP and max pooling are used for feature dimensionality extraction. Inspired by PointNext [22], some improvements were made to the SA layers and hyperparameters of PointNet $++$ , which reduced the parameter quantity.

Figure 3.

Models of the network.

In the training stage, for completing the unsupervised learning of the follow-up Code Head and IRR Head, it is necessary to generate different maps of viewpoints for the point cloud of an object. For this purpose, a number of random viewpoints are generated, and the corresponding triplet maps { $\dot{P},\dot{P}_{\theta},\dot{P}_{\gamma}$ } are generated by using the method in data pre-processing, for each viewpoint, the meaning of the triplet maps respectively mean the map of the viewpoint, the map that rotates at any theta angle in in-plane that around the optical axis to the viewpoint, and the map with $\gamma$ angle different from the viewpoint out of that plane. The three maps are fed into the backbone network for feature extraction, and the triplet aggregation feature vectors { $z,z_{\theta},z_{\gamma}$ }.

3.3 Code head and IRR head

The objective of the Code Head is to be insensitive to the in-plane rotation of a viewpoint and sensitive to the selection of viewpoints, coding needs to extract lower-dimensional features that are suitable to store according to the global features of the map. As shown in part B of Fig. 3, the encoder is designed as an SA-ALL layer and three FC layers, The SA-ALL module is for extracting global features from the backbone. After coding, the triplet code { $v,v_{\theta},v_{\gamma}$ } is gained from triplet aggregation feature vectors { $z,z_{\theta},z_{\gamma}$ }.

In order to achieve the goal that the head is sensitive to different viewpoints and insensitive to the rotation of maps from the same viewpoint, the optimization of the coding head is described as $S(v,v_{\theta})<S(v,v_{\gamma})$ , S is the cosine similarity function, which is used to measure the similarity of two vectors. Because the cosine distance of these codes from different viewpoints should be as far away as possible, the loss is thus described as:

$\displaystyle\textit{Loss}_{v}=\textit{max}\{S(v,v_{\theta})-S(v,v_{\gamma})+% \textit{margin},0\}$ (1)

In Eq. (1), the margin is the sort remainder, which represents the degree of gap between the codes of views that rotate around the optical axis plane of a viewpoint and the codes of the maps from different viewpoints. Due to the preset viewpoints in the codebook being discrete points, a margin is needed to control the ability of the generalization in the map selection. The paper set the margin to 0.1.

The aggregated features extracted through the backbone network have basically described the global structure of the maps, in order to obtain the rotation angle $\theta$ of two similar maps relative to the optical axis plane of the viewpoint, the triplet aggregation feature vector { $z,z_{\theta},z_{\gamma}$ } needs to be upsampled, as shown in part 3 of Fig. 3. The model of up-sample is designed as two FP layers, The FP layer is used for point cloud up-sampling and feature propagation and then extracted feature by a shared-MLP, the rotation matrix is obtained through these three FC layers.

The goal of the IRR head is to deduce a theta matrix, making the map best coincide with the target map by rotating the theta angle around the optical axis of the viewpoint, the degree of these two point clouds overlap can also be described by cosine distance, then the loss is defined as follows:

$\displaystyle\textit{Loss}_{\theta}=-\text{Log}(((S(\hat{P},\hat{P}_{\theta_{% pd}})+1)/2))$ (2)

$\hat{P}$ is a map that is sampled from the verified point cloud of model P relative to the viewpoint, since the $\hat{P}$ does not alter the order of the point clouds, the loss can be taken as the negative logarithm of the cosine similarity of the two point clouds, and the IRR head is optimized by the cosine similarity of the two maps. The whole network consists of a shared backbone and two branches of head and is trained in an end-to-end manner, combining loss defined as:

$\displaystyle\textit{Loss}=\lambda_{1}\textit{Loss}_{v}+\lambda_{2}\textit{% Loss}_{\theta}$ (3)

Among them, $\lambda_{1}$ and $\lambda_{2}$ is the weighted parameter, and the paper set them to 100 and 10 respectively.

4. Experience

The model is trained on the ShapeNetV2 data set [23], and two experiments are designed to verify the validity of the method in solving the problem of domain gap and the accuracy in pose estimation. First, estimate the pose of indoor objects from ModelNet40 dataset [24] with random translation and rotation. Second, compared the effectiveness of pose estimation with other methods on public datasets YCB-Video [25] and Line-MOD [26].

4.1 Training dataset preprocessing

Due to the fact that the point cloud data of indoor objects contains data that comes from the surface of an object, and the models provided by the ShapeNetV2 dataset are composed of composite mesh models, in order to meet the requirements of input for training, it is necessary to perform surface point cloud sampling on the composite model through reverse engineering.

[t] Algorithm of surface Point Cloud samplingPointcloud of an object P: Nx3 Pointcloud after mask $\dot{P}$ : Mx3 Normalized Point Cloud $\overline{P}$

sample_number $\leftarrow$ 6

threshold $=$ 0.05

z_axis $\leftarrow$ [0, 0, 1]

z_viewpoint $\leftarrow$ [0, 0, 3]

masks $\leftarrow$ []

i in sample_number v $\leftarrow$ Generate Random Viewpoint

R $\leftarrow$ Rodrigues (v, z_axis)

$\widetilde{P}\leftarrow\overline{P}\ast v$

vecs $\leftarrow$ []

mask $\leftarrow$ []

point p in $\widetilde{P}$ vec $\leftarrow$ z_viewpoint – p

$\overline{\textit{vec}}\leftarrow$ Normalized Vector vec

Add $\overline{\textit{vec}}$ in vecs

kdtree $\leftarrow$ Get KDTree From vecs

vec $\overline{v}$ in kdtree $\overline{vs}\leftarrow$ Query Ball $\overline{v}$ in kdtree

ps $\leftarrow$ from p by index of $\overline{vs}$

max_i, max_z $\leftarrow$ Select from ps.z

Add max_i in mask

Remove ps from p Except max_z – ps.z $<$ threshold Add mask in masks

i $\leftarrow$ $+$ 1 masks $\leftarrow$ Unique masks

$\mathbf{return}$ P [masks]

Firstly, the PCL Library of C $++$ is used to sample the surface of each component in the mesh model, and the full point cloud of the object is obtained. Afterward, it is necessary to remove the point cloud inside the point cloud. As described in Algorithm 1, the point cloud on the surface of the object can be obtained. Specifically, The map of a viewpoint is computed from 7–24 rows in the algorithm, and the time complexity of this algorithm is O (N²), and N is the number of points in the point cloud.

4.2 Introduction to validation datasets

The YCB-Video dataset consists of 21 objects selected from the YCB dataset, including the real scene and the rendered scene. In the real scene, each scene was made up of 3–9 objects and shot with the RGBD camera. 92 Videos were made, containing a total of 133,827 frames. The scene was rendered using BlenderProc4BOP, and the composite image and pose annotation were automatically generated, including 21 images of objects in different backgrounds, lighting, and viewing angles, with a total of 80000 images.

The LineMOD dataset contains 15 non-textured or low-textured household items, with each object containing a test image set. Each image set displays instances of objects with a large amount of debris and slight occlusion.

4.3 Verification standard

The common metrological standards for 6D pose estimation of point cloud are ADD and ADD-s, which are used to evaluate the matching degree of point clouds of symmetric and asymmetric objects respectively. The formulas are as follows:

$\displaystyle\textit{ADD}=\frac{1}{m}\sum_{x_{1}\in{P},x_{2}\in{\dot{P}_{pd}}}% \|(Rx_{1}+t)-(R_{pd}x_{2}+t_{pd})\|$ (4) $\displaystyle\textit{ADD-S}=\frac{1}{m}\sum_{\widetilde{x}_{1}\in{P},% \widetilde{x}_{2}\in{\dot{P}_{pd}}}\|(R\widetilde{x}_{1}+t)-(R_{pd}\widetilde{% x}_{2}+t_{pd})\|$ (5)

Where $x_{1}$ and $x_{2}$ are the points in the actual point cloud and predicted point cloud respectively, then $\widetilde{x}_{1}$ and $\widetilde{x}_{2}$ are the closest points matched in the actual point cloud and the predicted point cloud respectively, R and t are rotation matrices and displacement vectors, with a threshold set at 0.1 time’s diameter of the object to describe the accuracy of the estimate [27], and the ADD index is commonly used for the assessment of asymmetric objects, the ADD-S index is often used for the evaluation of symmetrical ones.

4.4 Result of experience

The experimental environment is ACER B36H4-AM2 (DCH), with i5-9500 as CPU and NVIDIA RTX 2080 super as GPU, and the OS is Ubuntu18.04. The model is trained on high-performance cloud servers through the preprocessed training dataset, including 45,645 point clouds of the objects. Before validation, the codebook needs to be generated, the codebook of each object is calculated for about 400 seconds and occupies space of about 9.3 Mb.

First, 100 models are selected from the ModelNet40 dataset, and the rotation of [0–2 $\pi$ ] and the displacement of [ $-$ 1, 1] are generated for each model. The pose and orientation of these objects are estimated by the method of this paper. Moreover, a number of chair objects are selected to verify the validity of pose estimation for similar objects.

Figure 4.

Visualization of ModelNet40.

As shown in Fig. 4, with the target point cloud in red and the predicted point cloud in green, Group A represents the pose estimates of objects selected in the ModelNet40 dataset without optimization by the ICP algorithm, the indices of ADD and ADD-s are 81.1% and 99.9% respectively, which indicates that this method has low prediction accuracy for symmetric objects and high prediction accuracy for asymmetric objects. Group B shows strong generalization for the same type of object without optimization using the ICP algorithm, with an ADD-S score of 86.4%.

To demonstrate the encoder’s sensitivity to viewpoint selection, an aircraft model is encoded and a random viewpoint is generated, then visualize the recommended candidate viewpoints for the aircraft model by the Code Head.

Figure 5.

Visualization of viewpoint selection.

The visualization results are shown in part A of Fig. 5. The red circle marked is the virtual viewpoint. The higher the probability, the farther the point is from the center, the color shows more blue, and vice versa, the closer the point is to the center, demonstrates more red. The results show that the encoder can display the probability distribution of the viewpoint correctly. At the same time, there is also a high probability distribution in the symmetric part, indicating that the encoder has low robustness to object symmetry.

At the same time, the maps of the two viewpoints were rotated with a step size of 0.5 times [ $-$ 1, 1] $\pi$ , and the cosine similarity of the encoding of these maps was compared.

As shown in part B of Fig. 5, group A is the cosine similarity of maps that rotated around the optical axis of the same viewpoint, and group B is the cosine similarity of the maps of other viewpoints and the map in group A. The figure shows that the coding of the maps from the same viewpoint does not show a large gap but in different viewpoints.

In another experiment, the paper validated the trained model on untrained pose estimation datasets, including YCB-Video and LineMOD, which included both synthetic and real data, the pose estimation is challenging due to low texture, severe occlusion, and incomplete point cloud. Partial visualization results are shown in Fig. 6.

Table 1

ADD-S of YCB-video dataset

Method	Input	ADD-S (synthetic)	ADD-S (synthetic $+$ real)
BaseNet [10]	D	–	91.3
BaseNet^* [10]	D	–	94.7
CloudAAE^* [11]	D	93.5	94.0
DenseFusion [13]	RGBD	–	93.1
PVN3D^* [14]	RGBD	95.5	96.1
FFB6D^* [15]	RGBD	96.6	97.0
Ours	D	96.3	97.0

Table 1 shows the ADD-S metrics for real data and composite data in the YCB-Video data set. The key frame data of the validation set in the real scene, and the composite data selects a partial training set as the validation set, the method with * is the result of the evaluation after the use of ICP algorithm modification. The experimental results show that, at the threshold of 0.1 times diameter, even without the optimization of the ICP algorithm, the prediction accuracy of the proposed method is slightly lower than the current advanced methods.

Table 2

ADD-S of LineMOD dataset

Method	Input	ADD-S (synthetic)	ADD-S (synthetic $+$ real)
CloudAAE [11]	D	82.1	82.6
CloudAAE^* [11]	D	92.1	95.5
DenseFusion^* [13]	RGBD	–	94.3
PVN3D^* [14]	RGBD	–	99.4
FFB6D^* [15]	RGBD	–	99.7
Ours	D	–	94.6

Table 3

Lightweight of models comparison

Method	DenseFusion	PVN3D	FFB6D	Ours
Parameters	23,382,131	48,653,680	33,850,392	2,505,187
Time	90.9 ms	316.0 ms	124.7 ms	37.2 ms
VRAM	2047 Mb	2218 Mb	1996 Mb	1732 mb

Figure 6.

Visualization of pose estimates on YCB-Video data set.

Table 2 shows the ADD-S metrics for the LineMOD dataset, using a composite model to validate the real-scene training set as a validation set. Since the Point Cloud in the LINEMOD dataset is more severely fragmented, the method of this paper is slightly inferior to the mainstream method after being optimized by the ICP algorithm at the threshold of 0.1 times diameter. in particular, CloudAAE and FFB6D are validated on this dataset using models that have not been trained on these objects, the resulting ADD-S metric is 0.

In order to verify how the light weight of the model, the number of parameters, the time spent in finishing a pose estimation, and the cost of computing resources are compared in the above experimental platform.

Table 3 shows the result of the comparison of the methods including DenseFusion, PVN3D, FFB6D, and the methods in this paper. First, the parameter quantity of the model was compared using the method provided by Pytorch, and the parameter quantity of the model in this paper is much lower than other methods. The second is time spent in pose estimating, only calculating the time from point cloud input to completion of pose estimation. The methods used in the paper are faster than other methods. Finally, for the consumption of computing resources, GPU memory was detected, because the memory consumption of the method depends on the number of recommended viewpoints, the usage of graphics memory is only slightly lower than other methods with 16 recommended viewpoints.

The experimental results show that the model can accurately estimate the 6d position and pose of the object point cloud for the unseen model, the accuracy on average is slightly lower than other models trained for objects, but the model is more lightweight, and the input only requires no texture point cloud, which is suitable for landing in the mobile robots.

5. Conclusion

The article proposes a lightweight pose estimation method for indoor point cloud objects. A viewpoint encoder can be used to accurately estimate the 6D pose of known or approximate object point cloud targets by building a codebook from untextured point cloud data. The entire process is low-cost in both the computing unit and hardware storage unit, and it demonstrates strong generalization ability. Only a codebook needs to be constructed when learning the ability to estimate the pose of a new object, which takes much less time than model training. It allows for more flexible deployment of target pose estimation systems in mobile robot applications. At present, there are still some deficiencies in the framework. When constructing a codebook, due to the sequence in calculating the map of a viewpoint, the algorithm can only run on the CPU, resulting in a longer construction time for the codebook. In addition, before inference, it is necessary to first obtain the segmented point cloud, or a mask for the object, it relies on other tasks such as target detection and even semantic segmentation.

Footnotes

Acknowledgments

This work was supported by the National Key Research and Development Project of China Grant Number 2022YFC3601400.

References

Huang

Liu

Cheng

, Estimating 6d object poses with temporal motion reasoning for robot grasping in cluttered scenes, IEEE Robotics and Automation Letters, 2022.

Nguyen

A.-T.

C.-T.

, Obstacle Avoidance for Autonomous Mobile Robots Based on Mapping Method, in: Proceedings of the International Conference on Advanced Mechanical Engineering, Automation, and Sustainable Development 2021 (AMAS2021), Springer, 2022, pp. 810–816.

Wang

Chen

Dou

, Category-Level 6D Object Pose Estimation via Cascaded Relation and Recurrent Reconstruction Networks, in: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021, pp. 4807–4814. doi: 10.1109/IROS51168.2021.9636212.

Huang

Zeng

Chen

Luo

Zhou

, Edge robotics: Edge-computing-accelerated multirobot simultaneous localization and mapping, IEEE Internet of Things Journal 9(15) (2022), 14087–14102.

Gorschlüter

Rojtberg

Pöllabauer

, A survey of 6d object detection based on 3d models for industrial applications, Journal of Imaging 8(3) (2022), 53.

Song

Huang

, Hybridpose: 6d object pose estimation under hybrid representations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 431–440.

Shugurov

Zakharov

Ilic

, Dpodv2: Dense correspondence-based 6 dof pose estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence 44(11) (2021), 7417–7435.

Huang

Hodan

Zhang

Tran

Twigg

P.-C.

Yuan

Keskin

Wang

, Neural correspondence field for object pose estimation, in: Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X, Springer, 2022, pp. 585–603.

Aoki

Goforth

Srivatsan

R.A.

Lucey

, Pointnetlk: Robust & efficient point cloud registration using pointnet, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7163–7172.

10.

Gao

Lauri

Wang

Zhang

Frintrop

, 6d object pose regression via supervised learning on point clouds, in: 2020 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2020, pp. 3643–3649.

11.

Gao

Lauri

Zhang

Frintrop

, Cloudaae: Learning 6d object pose regression with on-line data synthesis on point clouds, in: 2021 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2021, pp. 11081–11087.

12.

Hua

Zhou

Huang

Wang

Xiong

, Rede: End-to-end object 6d pose robust estimation using differentiable outliers elimination, IEEE Robotics and Automation Letters 6(2) (2021), 2886–2893.

13.

Wang

Zhu

Martín-Martín

Fei-Fei

Savarese

, Densefusion: 6d object pose estimation by iterative dense fusion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3343–3352.

14.

Sun

Huang

Liu

Fan

Sun

, Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11632–11641.

15.

Huang

Fan

Chen

Sun

, Ffb6d: A full flow bidirectional fusion network for 6d pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3003–3013.

16.

C.R.

Guibas

L.J.

, Pointnet: Deep learning on point sets for 3d classification and segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 652–660.

17.

C.R.

Guibas

L.J.

, Pointnet+⁣+: Deep hierarchical feature learning on point sets in a metric space, Advances in Neural Information Processing Systems 30 (2017).

18.

Kang

Zhou

Wang

Chen

, Real-time fruit recognition and grasping estimation for robotic apple harvesting, Sensors 20(19) (2020), 5670.

19.

Zhang

Müller

Stephan

Gross

H.-M.

Notni

, Point cloud hand-object segmentation using multimodal imaging with thermal and color data for safe robotic object handover, Sensors 21(16) (2021), 5676.

20.

Nong

Bai

Liu

, Airborne LiDAR point cloud classification using PointNet+⁣+ network with full neighborhood features, Plos One 18(2) (2023), e0280346.

21.

Cai

Heikkilä

Rahtu

, OVE6D: Object viewpoint encoding for depth-based 6D object pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6803–6813.

22.

Qian

Peng

Mai

Hammoud

Elhoseiny

Ghanem

, Pointnext: Revisiting pointnet+⁣+ with improved training and scaling strategies, Advances in Neural Information Processing Systems 35 (2022), 23192–23204.

23.

Chang

A.X.

Funkhouser

Guibas

Hanrahan

Huang

Savarese

Savva

Song

et al., Shapenet: An information-rich 3d model repository, arXiv preprint arXiv:1512.03012, 2015.

24.

Song

Khosla

Zhang

Tang

Xiao

, 3d shapenets: A deep representation for volumetric shapes, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1912–1920.

25.

Xiang

Schmidt

Narayanan

Fox

, PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes, arXiv preprint arXiv:1711.00199, 2017.

26.

Hinterstoisser

Holzer

Cagniart

Ilic

Lepetit

, Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes, in: IEEE International Conference on Computer Vision, 2012.

27.

Hinterstoisser

Lepetit

Ilic

Holzer

Bradski

Konolige

Navab

, Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes, in: Computer Vision – ACCV 2012: 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5–9, 2012, Revised Selected Papers, Part I 11, Springer, 2013, pp. 548–562.