AaDR-PointCloud: An integrated point cloud processing network using attention and deep residual

Abstract

3D point cloud has irregularity and disorder, which pose challenges for point cloud analysis. In the past, the projection or point cloud voxelization methods often used were insufficient in accuracy and speed. In recent years, the methods using Transformer in the NLP field or ResNet in the deep learning field have shown promising results. This article expands these ideas and introduces a novel approach. This paper designs a model AaDR-PointCloud that combines self-attention blocks and deep residual point blocks and operates iteratively to extract point cloud information. The self-attention blocks used in the model are particularly suitable for point cloud processing because of their order independence. The deep residual point blocks used provide the expression of depth features. The model performs point cloud classification and segmentation tests on two shape classification datasets and an object part segmentation dataset, achieving higher accuracy on these benchmarks.

Keywords

PointCloud transformer ResNet point cloud classification point cloud part segmentation

1 Introduction

3D data addresses the spatial information gap in 2D images and is in urgent need in various applications, including robotics, autonomous vehicles, and augmented reality. In contrast to photographs, which are organized on standard pixel grids, 3D point clouds are collections immersed in continuous space [1]. Owing to the disordered and unstructured nature of point clouds, it is not feasible to directly apply image processing methods to point cloud processing. Furthermore, the performance is further constrained by the inherent sparsity and the presence of noise.

Point cloud analysis has significantly improved in recent years thanks to the addition of neural networks. This advancement is evident across various applications, including 3D shape categorization [2], semantic segmentation [3], and so on. As a result of convolutional neural networks’ (CNNs) success in the field of image processing, new ideas have emerged. For example, [4 –6] suggest developing convolution operators for point cloud that can aggregate local data. In order to obtain the normalized feature domain for convolution, they reorder the input point sequence. These studies primarily build upon the concepts introduced by PointNet [2], but real-time performance remains a significant challenge. At the same time, the Transformer series [7 –9] are also favored by practitioners in point cloud processing. The self-attention operator used in this series is invariant to the arrangement and cardinality of the input elements, considering that point cloud are essentially sets embedded in 3D space [1], its central location inside the Transformer network renders it highly suitable for processing 3D point cloud. [10] aims at processing set data and also decreases the algorithm complexity of Transformer, but the use of multi-head attention still demands significant computational resources. Inspired by these, this paper developes a self-attention module for 3D point cloud processing. The module investigates how self-attention is applied to the immediate areas surrounding each point, and how positional information is encoded in the network. The networks that are produced employ pointwise operations and self-attention. In addition, inspired by the classical network of image analysis [11, 12], this paper considers the memory consumption problem that may be caused by using only a loop of attention blocks to extract point cloud features. Therefore, a simple deep residual MLPs is designed to extract deep aggregation features.

Referring to the above methods, this paper designs a new point cloud processing combined network model. By doing so, the model benefits from the inherent order invariance of Transformer, does not need to specify the order of the point cloud data, completes feature learning through the attention mechanism, and take advantage of the efficiency benefits of the highly optimized MLPs. This new network architecture is called AaDR-PointCloud, and its logical design is shown in Fig. 1.

Fig. 1

Our network, AaDR-PointCloud, takes the original point cloud as input and projects it onto an embedding layer using a simple MLP layer. The processed features are then obtained through an internally looped encoder, which is outlined by the dotted frame and operates once as indicated by the blue arrow. This series of processing leads to the final output results for classification and partial segmentation.

This paper demonstrates the significant effect of AaDR-PointCloud in 3D point cloud learning tasks and compares it with previous work. Among them, AaDR-PointCloud performs shape classification tests on the dataset ModelNet40 [13] (overall accuracy 93.2%) as well as ScanObjectNN [14] (overall accuracy 83.6%), and object part segmentation tests on ShapeNetPart [15] (85.5% instance mIoU). In addition, the results of ablation experiments also prove that the design of each module is meaningful.

In summary, our main contributions can be summarized as follows:

This paper designs a new point cloud processing combination network. The order independence of the attention module used in the network is well-suited for point cloud tasks, and the deep residual point blocks used express the depth features brought by the cyclic structure. Our model provides a point cloud processing idea of circularly combining deep residual MLP and attention methods.

This paper tests the model on data sets in multiple fields and conducts ablation studies to explain the necessity and rationality of module design. Experimental data show that the model has achieved good performance in shape classification and partial segmentation.

2 Related work

Pixels in a 2D image are arranged in regular grids and may be analyzed using classical convolution, while the point cloud is disorganized and dispersed in 3D space, essentially a set. Learning-based methods for processing 3D point clouds can be divided into the following types: voxel-based, projection-based, point-based, and networks using Transformer and self-attention.

Method based on Voxel. The essence of the voxel-based method is to voxelize the point cloud into a 3D grid, and then use CNN for 3D convolution. A voxel network named VoxNet [16] was developed by Maturana et al. to recognize 3D objects with high accuracy. But simply adding voxels to a crowded 3D region will result in a significant increase in computational and memory cost. Later, researchers discovered that the majority of voxel spaces are empty and that sparsity can be employed to address this issue. For example, OctNet [17] first uses a hybrid grid oc-tree with hierarchical division to represent the scene, and indexes the feature vectors of each voxel using an algorithm. In high-resolution point cloud, OctNet’s sparse convolution method requires much less memory and runtime than dense networks. However, the voxelization process is still difficult to avoid the loss of details.

Method based on Projection. The projection-based approach turns an amorphous representation into a regular one by projecting unstructured point cloud onto a 2D plane. Max-pooling multi-view features into a global descriptor is a groundbreaking invention known as MVCNN [18], however in the process, the majority of the information of non-maximum elements is lost. For the creation of a tangent picture that can be used in 2D convolution, Tangent Conv [19] projected the local surface geometry onto the tangent plane of each point. Additionally, [18 , 20–23] project 3D point clouds onto various picture planes using multi-view projection, extract feature representations for these projections using 2D convolution, and then combine several views of the projection to create the final output representation. The projection approach, however, possesses some serious flaws. The recognition performance will be significantly impacted by the projection surface choice, and the 3D occlusion will have an impact on the recognition precision [1].

Method based on Point. As a pioneering work, Qi et al.’s PointNet[2] directly uses point cloud as input, achieves permutation invariance through symmetric functions, and utilizes the maximum pooling layer to extract global features. And, because the local structure information between points cannot be obtained in PointNet, the author then proposes a hierarchical network PointNet++ [24] to capture fine geometric structures from the neighborhood of each point. These models offer a range of sample techniques [3 , 24–27] and can benefit from effective point set sampling.

Graph-based networks can learn spatial attributes by treating each point in the point cloud as a vertex of the graph and creating directed edges for the graph based on the neighbors of each vertex [28]. The Edge Conditioned Convolution (ECC) approach, which was devised by Simonovsky [29] et al., treats each point as a vertex of the graph and connects all of its neighbors by a directed edge. The point cloud is then represented by a filter generation network (such as MLP). DGCNN [30] performs graph convolution on kNN graphs. LDGCNN [31] removes the transformation network of DGCNN and links different levels of hierarchical features to improve its performance. In order to utilize the local geometric structure, KCNet [32] learns features using kernel correlation and graph pooling.

Without quantization, the network built on continuous convolution is applied directly to the set of points in three dimensions. Compared with the convolution kernel defined on the 2D grid structure, the convolution kernel of the 3D point cloud poses challenges in designing due to the irregularity of the point cloud [28]. Convolution is described by PointConv [26] as a continuous, 3D Monte Carlo estimation of convolution. Under the same parameter settings, its convolution kernel further streamlines 3D convolution into two operations: matrix multiplication and 2D convolution. This increases memory and computational efficiency. The RIConv operator proposed by Zhang [33] et al.transforms convolution into 1D with low-level rotation invariant geometric features as input. In SFCNN [34], convolving the projection of a point cloud on an icosahedron, the features connected to the vertices of the polyhedron and their adjacent nodes through a convolution-maximum pooling-convolution structure. PointCNN [5] standardizes the potential order of input points through MLP, and then convolves the converted features. The interpolation convolution operator InterpConv proposed by Mao [35] et al. can measure the geometric relationship between the input point cloud and the kernel weight coordinates.

The above point-based methods are extensively influenced by residual networks. Reference [11] is one of the earliest papers on ResNet, introducing the concept of residual blocks with skip connections that allow information to propagate by adding input and output. He et al. 36 introduce the concept of identity mapping, further simplifying the design of residual blocks, making the network easier to train. They also explore variations of residual blocks, some of which incorporated batch normalization, thus offering greater design flexibility. Together, they lay the foundation for ResNet, enabling it to effectively handle and address the training challenges of deep networks, even when this process may not be well-suited for resource-constrained environments.

Transformer and self-attention. Transformer and attention methods have also made new developments in 2D image recognition following their success in the fields of natural language processing and image processing. Scalar dot product self-attention was used by Hu et al. [37] and Ramachandran et al. [38] within local image blocks. PointGMM shape interpolation combined with multi-layer perceptron (MLP) segmentation and attention segmentation was proposed by Hertz et al. [39].

So, inspired by the above methods, our method combines the advantages of the self-attention (SA) module and the deep resiual MLP module, and arranges them to work in a sequential sequence to obtain better results. Previous work applied global attention to point cloud, which leads to complex computation and rapid memory growth, while our self-attention (SA) in local applications can avoid this problem and broaden the applicability of application of point cloud models.

3 Methods

This section first reviews the general formulas of Transformer and self-attention (SA) operator, as well as the source of ideas for deep residual MLP module design. Next, this paper proposes the AaDR-PointCloud framework for point cloud learning, explaining the outline and detailed design of the encoder and the hierarchical aggregation module in turn. It shows how to apply the point cloud representation learned by the model to various tasks of point cloud processing, including point cloud classification and segmentation.

3.1 Background

Transformer has achieved great success in the field of natural language processing, and the attention mechanism proposed in [7] is increasingly being used. An encoder-decoder structure was originally used to develop the attention mechanism during neural machine translation (NMT), which was then rapidly applied to tasks of a similar nature. The attention mechanism is now widely used in deep learning models, not just those that use an encoder-decoder hierarchy. It is worth mentioning that the attention mechanism can be applied only on the encoder to solve tasks such as text classification or representation learning. The application of this attention mechanism is called self-focusing or internal focusing mechanism, and the most common is the application of Self-Attention mechanism.

[7] use the QKV model to explain the Self-Attention mechanism. In this model, the input is represented as Q (Query), and memory stores context information in the form of key-value pairs (K, V). The attention mechanism can be viewed as a mapping function that maps the Query to a series of key-value pairs (Key, Value). The formula is as follows: $AttentionValue = Q \cdot K^{T} \cdot V$ (1)

$\begin{matrix} AttentionValue (Q, K, V) \\ = softmax (\frac{Q \cdot K^{T}}{\sqrt{d_{K}}}) \cdot V \end{matrix}$ (2) the essence of Attention is to assign a weight coefficient to each element in the sequence. Q · K^T can be regarded as the weight coefficient of Value, and d_K is the dimension of query or key vector.

In addition, the residual network [11] proposed by He Kaiming is a creative work. Its idea is to transform the mapping of network learning from X to Y into learning the difference from X to Y - X, and then add the learned residual information to the original output. This solution solves problems that arise with increasing depth in deep learning, such as gradient vanishing, gradient explosion, and training saturation. The SE-Net [12] proposed by Hu Jie et al. can learn to use global information to selectively emphasize information features and suppress less useful features, bringing significant performance improvements to the existing state-of-the-art networks.

3.2 Framework design

According to the above design concept, this article designs a new effective combination of network to deal with point cloud tasks. The detailed framework of our method is shown in Fig. 2.

Fig. 2

The network structure of AaDR-PointCloud consists of an input embedding module followed by a cyclic stacked attention-residual combination module. The resulting feature map is used for classification or segmentation tasks, which are performed through multiple linear or convolutional layers. The number below each layer indicates the number of output channels. In segmentation tasks, ’CLM’ stands for class label mapping, and ’GMPM’ represents global max pooling mapping, each comprising two convolutional layers. The red dotted box signifies the combination module design.

Framework design description. To establish semantic affinity between points as the foundation for various point cloud processing tasks, AaDR-PointCloud endeavors to encode the input points into a novel, higher-dimensional feature space. The initial step of AaDR-PointCloud is to embed the input coordinates into a fresh feature space. The embedded features are first input into the attention module to learn the shared weights of local regions, subsequently, by means of feature aggregation, the deep residual feature extraction is performed on the aggregated features, and finally the output features are generated through the linear layer. This process is set to consist of four stages, each of which carries out the aforementioned operation, progressively expanding the receptive field. The goal is to model the geometric information of point clouds by iteratively repeating multiple stages.

Specifically, the attention module of AaDR-Point Cloud has almost the same design concept as the original Transformer, the design details and formulas of the module and the deep residual feature extraction module are as follows: given an input point cloud P ∈ R^N×d, where N points have d-dimensional feature description. Firstly, the input point is mapped to a 32-dimensional high-level space through MLP operation (contains a convolution layer, a batch layer and an activation function) to learn a 32-dimensional embedded feature F _e ∈ R^N×d_e (d_e = 32). Then, through the 4-stage feature learning process (including attention module, feature aggregation operation and deep residual point block), the rich semantic information of each point is learned in the process of step-by-step down-sampling and up-dimension operation, and then after linear transformation, the point-by-point d-dimensional feature representation of AaDR-PointCloud output is:

$\begin{matrix} G_{i} = 4 \cdot Φ_{DeepRes} \\ (M (Φ_{Attention} (f_{i, j}), | j = 1, \dots, K)) \end{matrix}$ (3) among them, Φ_Attention (·) and Φ_DeepRes (·) are self-attention blocks and deep residual point MLP blocks respectively, and the aggregation function M (·) is regarded as the maximum pooling operation. The f_i,j is the j-th neighborhood point feature of the i-th sampling point. Formula (3) describes the phase of AaDR-PointCloud. Our model recursively repeats the process for a hierarchical deep network through 4 stages. The neighborhood is chosen using the k-nearest neighbor method (kNN), with k set to 24. This paper carries out ablation experiments on the selection of K-values in the fourth chapter.

Using the above model, this paper perform shape classification and object segmentation tasks on the input point cloud.

Classification. The classification network details using AaDR-PointCloud are shown in Fig. 3(top). When the model is used for classification, the input points are first mapped to a 32-dimensional space, and then the encoder gradually performs down-sampling and channel dimension-increasing operations in four stages. The down-sampling rate of each stage is [2 , 2], and the channel dimension-increasing rate is [2 , 1]. Therefore, the cardinality of the point set generated by each stage is [N, N/2, N/4, N/8, N/16], and the dimension of the output channel d_o is [D, 2D, 4D, 8D, 8D], where N is the number of input points, preset to 1024, and D is the input channel dimension, preset to 32 after embedding layer. After that, the model with rich features is obtained by the maximum pooling operation to obtain the global features, and then the final classification score Output 1 ∈ R^{N
_c} (c = classification) is predicted by a linear layer.

Fig. 3

The process of classification (top) and segmentation (bottom) is realized. In brackets, the former represents the number of points, and the latter represents the dimension of points.

Segmentation. The segmentation network details using AaDR-PointCloud are shown in Fig. 3(bottom). The encoder part of the segmentation process is almost the same as the classification model, except that N is preset to 2048, the downsampling rate is changed to [4 , 4], the channel dimension increase rate is [2 , 2], and the encoder is coupled with the symmetric decoder [24, 40] to map the features from the input point set of downsampling to the output point set. The information under different dimensions of the same sampling point between the stages is superimposed by skipping connection. Furthermore, in Fig. 2, “CLM” is responsible for feature mapping transformation, while “GMPM” captures global features. They serve as heads that receive the computed results generated by the model’s preceding stages, ensuring the model’s effective output. Therefore, the entire segmentation process involves aggregating the point features acquired during the encoder stage with the global features obtained through max-pooling. This is followed by a series of dimensionality reduction and linear operations, ultimately resulting in the segmentation of the output point set, and the output results are Output2 ∈ R^{N
_s} (s = segmentation).

3.3 Detailed design of combination module

Attention Module. The attention module of this paper implements coordinate-based point embedding by using the self-attention instantiation attention layer introduced in Reference [7]. Point embedding aims to place semantically more similar points in the embedding space. Our method does not use natural point embedding because it ignores the relationship between points.

For the simple implementation of AaDR-PointCloud, this article redesigns the self-attention in the original Transformer into a process with the input data stream as a ’point’, and its architecture is shown in Fig. 4. According to Sections [7] and 3.1, the Q, K, and V obtained by linear transformation of the input feature F _in ∈ R^N×d_e are:

$\begin{matrix} (Q, K, V) = F_{in} \cdot (W_{q}, W_{k}, W_{v}) \\ Q, K \in R^{N \times d_{a}}, V \in R^{N \times d_{v}} \\ W_{q}, W_{k} \in R^{d_{v} \times d_{a}}, W_{v} \in R^{d_{v} \times d_{v}} \end{matrix}$ (4) where W_q, W_k, W_v are the weight obtained by training learning, d_a is the dimension of Q, K, and d_v is the dimension of V (d_a = d_v/4).

Firstly, according to formula (2), the matrix dot product method is used to infer the attention weight formula and normalize it:

$\begin{matrix} \tilde{A} = {\tilde{α}}_{i, j} = Q \cdot K^{T} \\ {\bar{α}}_{i, j} = \frac{{\tilde{α}}_{i, j}}{\sqrt{d_{a}}} \\ A = α_{i, j} = softmax ({\bar{α}}_{i, j}) = \frac{exp ({\bar{α}}_{i, j})}{\sum_{k} exp ({\bar{α}}_{i, k})} \end{matrix}$ (5)

Then, the self-attention output feature F _SA using the corresponding weight value vector is derived: $F_{SA} = A \cdot V$ (6)

In the above formula, Q, K and V are only determined by the corresponding linear transformation matrix and the input feature Fin, which are independent of each other in order, and softmax and F _SA are also independent operations. The whole Self-Attention(SA) block shows order independence, which is suitable for the input disorder and irregularity of point cloud.

Finally, another layer of MLP provides the output feature F _out ∈ R^N×d_o for the entire SA layer: $F_{out} = MLP (F_{SA}) = MLP (S A (F_{in}))$ (7)

Fig. 4

Detailed design of combination module. Left: Self-Attention(SA) block; M: Maximum Pooling operation; Right: Deep Residual Point MLP block.

Deep Residual Point Block. The design idea of the deep residual point block is mainly derived from He et al. [11] and He et al. 36 and Hu et al. [12]. Its role is to extract deep aggregation features after the operation of the maximum pooling aggregation local features in multiple stages, which is conducive to the hierarchy of the deep network, so as to obtain the deep feature representation. This paper designes the following formula: $\begin{matrix} Φ (f_{i, j}) = MLP ([{∥ x_{i, j} - x_{i} ∥}_{2}, x_{i, j} - x_{i}, x_{i, j}, x_{i}]) \\ * f_{i, j}, \forall j \in {1, \dots, K} \end{matrix}$ (8) MLP is made up of a full connected (FC) layer, a batch normalization layer, and an activation function, and [·] is the join operation. The ablation experiment in chapter 4 proves that this module does have a certain effect.

Position Encoding. In addition, the attention module cannot be missing the position encoding, which enables the operator to adapt to the local structure in the data [41], which is important for the attention module. 3D point coordinates are obvious candidates for location coding in 3D point cloud analysis. The two-layer MLP operation is transformed into one layer by simply reconstructing the deep residual block, the position coding is defined as follows: $δ = θ (P_{i} - P_{j})$ (9) the P _i and P _j represent the 3D point coordinates of points i and j. The encoding function θ is an MLP with only a fully connected layer, a batch normalization layer, and a ReLU nonlinearity. This is the same as the operation of location embedding, but the effect is different. This paper adds trainable position coding to both the front end of the model and the attention module. The former is used to preprocess the input point cloud data together with the position embedding, and the latter is to make the self-attention module more suitable for the input point cloud.

4 Experiments

In this paper, our AaDR-PointCloud is trained and tested on public shape classification datasets and partial segmentation datasets. The implementation details will be displayed in each part. For the former, this paper utilizes two datasets: the ModelNet40 dataset and the ScanObjectNN dataset. As for the latter, the ShapeNet dataset is selected. In addition, this paper also carried out ablation studies to prove the actual effect of the model.

4.1 Shape classification

4.1.1 Datasets

The ModelNet40 dataset is one of the most commonly used datasets for point cloud classification. The dataset comprises 12,311 CAD models distributed across 40 object categories, with 9,843 models allocated for training and 2,468 for testing, maintaining a ratio of approximately 4: 1. The sampling strategy is to uniformly sample each object to 1024 points for comparison with previous work. Compared with the point cloud processing dataset ModelNet40 released in 2015, the ScanObjectNN dataset is an updated point cloud benchmark. This dataset introduces real-world data interferences like background, noise, and occlusion, which present a more significant challenge for point cloud analysis and bolster the credibility of test results. It contains 15000 objects, divided into 15 classes and 2902 unique object instances.

4.1.2 Setup

For ModelNet40 and ScanObjectNN data sets, this paper sets the batch size to 8, and the initial learning rate is 0.001. The cosine annealing plan is used to adjust the learning rate of each period. The only difference is that the former dataset is trained for 300 cycles, while the latter dataset is trained for 200 cycles. All training processes are implemented on NVIDIA Corporation GP104 [GeForce GTX 1070] GPU. For other comparison methods, this paper takes the best results from the original paper, which ensures that the model in this paper is still competitive in most cases.

4.1.3 Evaluation metrics

On the evaluation metrics, this paper uses the class average accuracy (mAcc) and the overall accuracy (OA) on the test set.

The mean classification accuracy is the average of the classification accuracy for each shape category. It can be mathematically represented as:

$\begin{matrix} {Acc}_{i} = \frac{T P_{i}}{T P_{i} + F P_{i}} \\ mAcc = \frac{1}{N} \sum_{i = 1}^{N} Ac c_{i} \end{matrix}$ (10) where N is the total number of shape categories, Acc_i is the classification accuracy for the i-th category, TP_i represents the number of correctly classified samples for the i-th category, FP_i represents the number of misclassified samples for the i-th category, and their sum represents the total number of samples for the i-th category.

The overall accuracy refers to the ratio of the number of correctly classified samples to the total number of samples. It can be mathematically represented as:

$\begin{matrix} OA = \frac{TP}{TP + FP} \end{matrix}$ (11) where TP represents the number of correctly classified samples, FP represents the number of misclassified samples, and their sum represents the total number of samples.

These two metrics can be used to evaluate the performance of 3D shape classification models. The mAcc measures the average classification accuracy among different categories, while OA measures the overall classification accuracy.

4.1.4 Comparison results

The experimental results are shown in Table 1 and Table 2. After training with the model, the overall accuracy on ModelNet40 is 93.2%, which is 4.0%, 3.1% and 1.3% higher than the voxel-based model Subvolume [42], the multi-view-based model MVCNN, and the point-based model PointNet++ respectively. And, the model have achieved an overall accuracy of 83.6% on ScanObjectNN, 5.7% and 0.8% higher than PointNet++ and MVTN [50] respectively.

Table 1
Classification results on the ModelNet40 dataset

Method input mAcc(%) OA(%) Param. Train Test

speed speed

VoxNet [16] voxel 83.0 85.9

Subvolume [42] voxel 86.0 89.2

MVCNN [18] image – 90.1

PointNet [2] point 86.2 89.2 1.41M 157.4 217

Kd-Net [43] point – 91.8

PointNet++ [2] point – 91.9

Set Transformer [10] point – 90.4

PointCNN [5] point 88.1 92.2

DGCNN [30] point 90.2 92.2

SpiderCNN [44] point – 92.4

PointConv [26] point – 92.5 18.6M 12.6 7.2

Point2Sequence [41] point 90.4 92.6

PointASNL [45] point – 92.9

InterpCNN [35] point – 93.0

Ours point 90.7 93.2 13.7M 33.1 78.8

Method	input	mAcc(%)	OA(%)	Param.	Train	Test
VoxNet [16]	voxel	83.0	85.9
Subvolume [42]	voxel	86.0	89.2
MVCNN [18]	image	–	90.1
PointNet [2]	point	86.2	89.2	1.41M	157.4	217
Kd-Net [43]	point	–	91.8
PointNet++ [2]	point	–	91.9
Set Transformer [10]	point	–	90.4
PointCNN [5]	point	88.1	92.2
DGCNN [30]	point	90.2	92.2
SpiderCNN [44]	point	–	92.4
PointConv [26]	point	–	92.5	18.6M	12.6	7.2
Point2Sequence [41]	point	90.4	92.6
PointASNL [45]	point	–	92.9
InterpCNN [35]	point	–	93.0
Ours	point	90.7	93.2	13.7M	33.1	78.8

Table 2

Classification results on the ScanObjectNN dataset

Method	mAcc(%)	OA(%)
PointNet [2]	63.4	68.2
SpiderCNN [44]	69.8	73.7
PointNet++ [24]	75.4	77.9
DGCNN [30]	73.6	78.1
PointCNN [5]	75.1	78.5
BGA-DGCNN [46]	75.7	79.7
BGA-PN++ [46]	77.5	80.2
DRNet [47]	78.0	80.3
GBNet [48]	77.8	80.5
Simple View [14]	–	80.5±0.3
PRANet [49]	79.1	82.1
MVTN [50]	–	82.8
Ours	80.7	83.6

Additionally, Table 1 provides partial comparisons of model parameters, training speed, and testing speed. It can be observed that the classical PointNet model, despite its lower accuracy, has a small number of parameters and achieves an inference speed of 217 samples per second. From this perspective, it indeed represents a milestone achievement in the field of point cloud processing. PointConv delivers strong results, but it comes with a large number of parameters and high inference cost (7.2 samples per second). In comparison, our model reduces the number of parameters while maintaining a lower inference cost (78.8 samples per second), resulting in competitive performance.

4.2 Object part segmentation

4.2.1 Datasets

Our model uses ShapeNet as the test data set for partial object segmentation. The ShapeNet dataset is a shape repository represented by a 3D CAD model of an object. It contains 16881 object part instances of 16 shape categories and 50 part labels, each instance includes 2–6 parts. Taking the classic PointNet as an example, our model randomly selects 2048 points as sampling points to compare with other work.

4.2.2 Setup

Except for a training period of 350 rounds, all other settings are the same as those in Section 4.1.2 for the shape classification dataset.

4.2.3 Evaluation metrics

In terms of evaluation metrics, this paper employs Intersection over Union (IoU) on the test set to assess segmentation performance, including Class mIoU and Instance mIoU.

Class mIoU represents the average IoU across all classes. Assuming there are N classes, for each class i, its IoU (IoU_i) can be computed, and then the average IoU across all classes is taken. The formula is represented as:

$\begin{matrix} class mIoU = \frac{1}{N} \sum_{i = 1}^{N} Io U_{i} \end{matrix}$ (12) where IoU_i represents the IoU for class i, and N is the total number of classes.

Similarly, Instance mIoU is the average IoU across all instances. In point cloud segmentation tasks, each point can be assigned to a specific instance of a class. For each instance, its IoU (IoU_{inst
_j}) can be calculated, and then the average IoU across all instances is computed. The formula is represented as:

$\begin{matrix} instance mIoU = \frac{1}{M} \sum_{j = 1}^{M} Io U_{{inst}_{j}} \end{matrix}$ (13) where IoU_{inst
_j} represents the IoU for instance j, and M is the total number of instances.

4.2.4 Comparison results

The results of segmentation are shown in Table 3. Our model has achieved competitive results. Compared with PointNet and DGCNN, AaDR-PointCloud increased by 3.4% and 1.5% on Cls.mIoU (class mIoU) respectively, and increased by 1.8% and 0.3% on Inst.mIoU (instance mIoU).

Table 3
Segmentation results on ShapeNet dataset

Method Cls. mIoU Inst. mIoU aero bag cap car chair ear- phone guitar knife lamp laptop motor- bike mug pistol rocket skate- board table

Kd-net [43] – 82.3 80.1 74.6 74.3 70.3 88.6 73.5 90.2 87.2 81.0 94.9 57.4 86.7 78.1 51.8 69.9 80.3

PointNet [2] 80.4 83.7 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6

A-SCN [51] – 84.6 83.8 80.8 83.5 79.3 90.5 69.8 91.7 86.5 82.9 96.0 69.2 93.8 82.5 62.9 74.4 80.8

SO-Net [52] – 84.9 82.8 77.8 88.0 77.3 90.6 73.5 90.7 83.9 82.8 94.8 69.1 94.2 80.9 53.1 72.9 83.0

PointNet++ [24] 81.9 85.1 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6

PCNN [6] 81.8 85.1 82.4 80.1 85.5 79.5 90.8 73.2 91.3 86.0 85.0 95.7 73.2 94.8 83.3 51.0 75.0 81.8

DGCNN [30] 82.3 85.2 84.0 83.4 86.7 77.8 90.6 74.7 91.2 87.5 82.8 95.7 66.3 94.9 81.1 63.5 74.5 82.6

P2Sequence [41] – 85.2 82.6 81.8 87.5 77.3 90.8 77.1 91.1 86.9 83.9 95.7 70.8 94.6 79.3 58.1 75.2 82.8

SpiderCNN [44] 82.4 85.3 83.5 81.0 87.2 77.5 90.7 76.8 91.1 87.3 83.3 95.8 70.2 93.5 82.7 59.7 75.8 82.8

PointASNL [45] – 86.1 84.1 84.7 87.9 79.7 92.2 73.7 91.0 87.2 84.2 95.8 74.4 95.2 81.0 63.0 76.3 83.2

Ours 83.8 85.5 83.3 83.0 88.2 79.9 88.3 78.4 91.7 88.4 82.2 96.1 75.7 94.7 84.3 61.4 82.3 83.2

Method	Cls. mIoU	Inst. mIoU	aero	bag	cap	car	chair	ear- phone	guitar	knife	lamp	laptop	motor- bike	mug	pistol	rocket	skate- board	table
Kd-net [43]	–	82.3	80.1	74.6	74.3	70.3	88.6	73.5	90.2	87.2	81.0	94.9	57.4	86.7	78.1	51.8	69.9	80.3
PointNet [2]	80.4	83.7	83.4	78.7	82.5	74.9	89.6	73.0	91.5	85.9	80.8	95.3	65.2	93.0	81.2	57.9	72.8	80.6
A-SCN [51]	–	84.6	83.8	80.8	83.5	79.3	90.5	69.8	91.7	86.5	82.9	96.0	69.2	93.8	82.5	62.9	74.4	80.8
SO-Net [52]	–	84.9	82.8	77.8	88.0	77.3	90.6	73.5	90.7	83.9	82.8	94.8	69.1	94.2	80.9	53.1	72.9	83.0
PointNet++ [24]	81.9	85.1	82.4	79.0	87.7	77.3	90.8	71.8	91.0	85.9	83.7	95.3	71.6	94.1	81.3	58.7	76.4	82.6
PCNN [6]	81.8	85.1	82.4	80.1	85.5	79.5	90.8	73.2	91.3	86.0	85.0	95.7	73.2	94.8	83.3	51.0	75.0	81.8
DGCNN [30]	82.3	85.2	84.0	83.4	86.7	77.8	90.6	74.7	91.2	87.5	82.8	95.7	66.3	94.9	81.1	63.5	74.5	82.6
P2Sequence [41]	–	85.2	82.6	81.8	87.5	77.3	90.8	77.1	91.1	86.9	83.9	95.7	70.8	94.6	79.3	58.1	75.2	82.8
SpiderCNN [44]	82.4	85.3	83.5	81.0	87.2	77.5	90.7	76.8	91.1	87.3	83.3	95.8	70.2	93.5	82.7	59.7	75.8	82.8
PointASNL [45]	–	86.1	84.1	84.7	87.9	79.7	92.2	73.7	91.0	87.2	84.2	95.8	74.4	95.2	81.0	63.0	76.3	83.2
Ours	83.8	85.5	83.3	83.0	88.2	79.9	88.3	78.4	91.7	88.4	82.2	96.1	75.7	94.7	84.3	61.4	82.3	83.2

4.2.5 Visualization

In addition, this paper also visualizes the ground truth and our part segmentation prediction results in Fig. 5, which shows that our results are basically close to the ground truth.

Fig. 5

Part segmentation results on ShapeNetPart. The left is ground truth and the right is our prediction.

4.3 Comparison with the SOTA methods

In the model presented in this paper, the introduction of Transformer is a necessary prerequisite for achieving high-precision point cloud recognition. However, it comes at the cost of increasing the model’s parameter count, resulting in longer training and testing times. Therefore, the challenge addressed in this paper is how to reduce model complexity while simultaneously ensuring higher accuracy. Compared to state-of-the-art methods, the advantage of this approach is that it extends the idea of “introducing Transformers or Residual Networks into the point cloud domain” by cleverly designing a combined model of both. Through ablation experiments, the rationale and necessity of the module design are explained, ultimately achieving competitive results on multiple datasets.

4.4 Ablation study

Number of neighbors K. This paper sets the number of samples k of the K-nearest neighbor algorithm to 24 in Section 3.2. In this section, this paper explores k as the test results of different values, and the data set uses ScanObjectNN. The results are shown in the table 4: when k = 24, the performance is the best. When k = 8 or 16, the points in the model may not have enough neighbor points to predict. When k = 32 or 64, the number of sampling points increases significantly, and the accuracy improvement tends to be gentle (still less than the accuracy of k = 24),

Table 4
Ablation experiment: classification results when the number of neighbors k on the shape classification data set ScanObjectNN is different

k mAcc(%) OA(%)

8 78.2 80.6

16 79.8 82.6

24 80.7 83.6

32 81.1 83.4

64 81.2 83.5

k	mAcc(%)	OA(%)
8	78.2	80.6
16	79.8	82.6
24	80.7	83.6
32	81.1	83.4
64	81.2	83.5

but the time cost of training model increases significantly, resulting in waste of resources.

Ablation study of module necessity. This paper shows the classification results on the ScanObjectNN dataset after eliminating the Attention module and the Deep Residual(DeepRes) module respectively in Table 5. The results show that after the deep residual point blocks in the combined model are deleted, the accuracy is reduced by 0.7%, which proves the significant effect of the deep residual point blocks. After deleting the attention module, the accuracy decreased more significantly, reaching 3.8%. It is worth mentioning that the attention module can also replace the deep residual block to extract deep features. However, the direct replacement results in doubling the training cost and is not conducive to the deeper expansion of the model, which also proves the simplicity and effectiveness of the deep residual block. In addition, according to the idea of SE-Net, this paper tries to add a spatial attention module to the deep residual dot block, but the accuracy has declined, which means that the SE module is not applicable here.

Table 5

Research on component ablation on the ScanObjectNN test set

Φ _Attention	Φ _DeepRes	mAcc(%)	OA(%)
✓	×	80.7	82.9
×	✓	76.3	79.8
✓	✓	80.7	83.6

5 Conclusion

Inspired by the successful application of Transformer and ResNet in the field of NLP and 2D image, this paper introduces the self-attention (SA) module and deep residual network into point cloud analysis, and design a combined model to extract point cloud information. This paper takes into account the effectiveness of feature extraction and the simplicity of network design. The model has achieved good results in tasks such as point cloud shape classification and point cloud object part segmentation, which shows that our model has good point cloud processing performance.

However, Transformer is more suitable for processing large-scale data input in the design concept. In the field of point cloud where data sets are scarce, there are still some limitations in its application. Our future work is to use more training data to optimize our model, and to inspire thinking about Transformer’s work in other point cloud areas, such as point cloud target detection.

Footnotes

Acknowledgement

The authors would like to thank all the people who have contributed to this paper for their selfless work. This work is supported by the 14th Graduate Education Innovation Fund of Wuhan Institute of Technology (CX2022352), the Hubei Technology Innovation Project (2019AAA045), the National Natural Science Foundation of China (62171327, 62171328, 62072350).

References

Zhao

, Jiang

, Jia

, Torr

, Koltun

, Point transformer, Proceedings of the IEEE/CVF international conference on computer vision, 2021:16259–16268.

C.R.

, Su

, Mo

, Guibas

L.J.

, Pointnet: Deep learning on point sets for 3d classification and segmentation, Proceedings of the IEEE conference on computer vision and pattern recognition, 2017:652–660.

, Yang

, Xie

, Rosa

, Guo

, Wang

, Trigoni

, Markham

, Randla-net: Efficient semantic segmentation of large-scale point clouds, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020:11108–11117.

Tchapmi

, Choy

, Armeni

, Gwak

, Savarese

, Segcloud: Semantic segmentation of 3d point clouds, In 2017 international conference on 3D vision (3DV). IEEE, 2017:537–547.

, Bu

, Sun

, Wu

, Di

, Chen

, Pointcnn: Convolution on x-transformed points, Advances in Neural Information Processing Systems (2018), 31.

Atzmon

, Maron

and Lipman

, Point convolutional neural networks by extension operators, ACM Transactions on Graphics 37(4CD) (2018), 71.1–71.12.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Polosukhin

, Attention is all you need, Advances in Neural Information Processing Systems (2017), 30.

, Fan

, Baevski

, Dauphin

Y.N.

, Auli

, Pay less attention with lightweight and dynamic convolutions, arXiv preprint arXiv:1901.10430, 2019.

Devlin

, Chang

M.W.

, Lee

, Toutanova

, Bert: Pretraining of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, 2018.

10.

Lee

, Lee

, Kim

, Kosiorek

, Choi

, Teh

Y.W.

, Set transformer: A framework for attention-based permutationinvariant neural networks, International conference on machine learning, PMLR, 2019:3744–3753.

11.

, Zhang

, Ren

, Sun

, Deep residual learning for image recognition, In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016:770–778.

12.

, Shen

, Sun

, Squeeze-and-excitation networks, Proceedings of the IEEE conference on computer vision and pattern recognition, 2018:7132–7141.

13.

, Song

, Khosla

, Yu

, Zhang

, Tang

, Xiao

, 3d shapenets: A deep representation for volumetric shapes, In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015:1912–1920.

14.

Goyal

, Law

, Liu

, Newell

, Deng

, Revisiting point cloud shape classification with a simple and effective baseline, International Conference on Machine Learning, PMLR, 2021:3809–3820.

15.

, Kim

V.G.

, Ceylan

, Shen

I.C.

, Yan

, Su...

and Guibas

, A scalable active framework for region annotation in 3d shape collections, ACM Transactions on Graphics (TOG) 35(6cd) (2016), 210.1–210.12.

16.

Maturana

, Scherer

, Voxnet: A 3d convolutional neural network for real-time object recognition, In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2015:922–928.

17.

Riegler

, Ulusoy

A.O.

, Geiger

, Octnet: Learning deep 3d representations at high resolutions, In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017:3577–3586.

18.

, Maji

, Kalogerakis

, Learned-Miller

, Multiview convolutional neural networks for 3d shape recognition, Proceedings of the IEEE international conference on computer vision, 2015:945–953.

19.

Tatarchenko

, Park

, Koltun

, Zhou

Q.Y.

, Tangent convolutions for dense prediction in 3d, Proceedings of the IEEE conference on computer vision and pattern recognition, 2018:3887–3896.

20.

, Zhang

, Xia

, Vehicle detection from 3d lidar using fully convolutional network, arXiv preprint arXiv:1608.07916, 2016.

21.

Chen

, Ma

, Wan

, Li

, Xia

, Multi-view 3d object detection network for autonomous driving, Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017:1907–1915.

22.

Kanezaki

, Matsushita

, Nishida

, Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints, In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018:5010–5019.

23.

Lang

A.H.

, Vora

, Caesar

, Zhou

, Yang

, Beijbom

, Pointpillars: Fast encoders for object detection from 12 Bo Li, Tongwei Lu, Feng Min/An Integrated Point Cloud Processing Network point clouds, In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019:12697–12705.

24.

C.R.

, Li

, Hao

, Guibas

L.J.

, Pointnet++: Deep hierarchical feature learning on point sets in a metric space, Advances in neural information processing systems, 2017, 30.

25.

Dovrat

, Lang

, Avidan

, Learning to sample, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019:2760–2769.

26.

, Qi

, Li

, Pointconv: Deep convolutional networks on 3d point clouds, In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019:9621–9630.

27.

Yang

, Zhang

, Ni

, Li

, Liu

, Zhou

, Tian

, Modeling point clouds with self-attention and gumbel subset sampling, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019: 3323–3332.

28.

Guo

, Wang

, Hu

, Liu

and Bennamoun

, Deep learning for 3d point clouds: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence PP(99) (2020), 1–1.

29.

Simonovsky

, Komodakis

, Dynamic edgeconditioned filters in convolutional neural networks on graphs, Proceedings of the IEEE conference on computer vision and pattern recognition, 2017:3693–3702.

30.

Wang

, Sun

, Liu

, Sarma

S.E.

, Bronstein

M.M.

and Solomon

J.M.

, Dynamic graph cnn for learning on point clouds, ACM Transactions on Graphics 38(5), 2018.

31.

Zhang

, Hao

, Wang Cw De Silva

, Fu

, Linked dynamic graph cnn: Learning on point cloud via linking hierarchical features, arXiv preprint arXiv:1904.10014, 2019.

32.

Shen

, Feng

, Yang

, Tian

, Mining point cloud local structures by kernel correlation and graph pooling, In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018: 4548–4557.

33.

Zhang

, Hua

B.S.

, Rosen

D.W.

, Yeung

S.K.

, Rotation invariant convolutions for 3d point clouds deep learning, In 2019 International conference on 3d vision (3DV). IEEE, 2019:204–213.

34.

Rao

, Lu

, Zhou

, Spherical fractal convolutional neural networks for point cloud recognition, In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019:452–460.

35.

Mao

, Wang

, Li

, Interpolated convolutional networks for 3d point cloud understanding, Proceedings of the IEEE/CVF international conference on computer vision, 2019:1578–1587.

36.

, Zhang

, Ren

, Sun

, Identity mappings in deep residual networks, arXiv preprint arXiv:1603.05027, 2016:630–645.

37.

, Zhang

, Xie

, Lin

, Local relation networks for image recognition, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 3464–3473.

38.

Ramachandran

, Parmar

, Vaswani

, Bello

, Levskaya

, Shlens

, Stand-alone self-attention in vision models, 2019, 32.

39.

Hertz

, Hanocka

, Giryes

, Cohen-Or

, Pointgmm: a neural gmm network for point clouds, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020:12054–12063.

40.

Choy

, Gwak

J.Y.

, Savarese

, 4d spatio-temporal convnets: Minkowski convolutional neural networks, In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019:3075–3084.

41.

Liu

, Han

, Liu

Y.S.

and Zwicker

, Point2sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network, , Proceedings of the AAAI Conference on Artificial Intelligence 33 (2019), 8778–8785.

42.

C.R.

, Su

, NieSSner

, Dai

, Yan

, Guibas

L.J.

, Volumetric and multi-view cnns for object classification on 3d data, Proceedings of the IEEE conference on computer vision and pattern recognition, 2016: 5648–5656.

43.

Klokov

, Lempitsky

, Escape from cells: Deep kdnetworks for the recognition of 3d point cloud models, Proceedings of the IEEE international conference on computer vision, 2017:863–872.

44.

, Fan

, Xu

, Zeng

, Qiao

, Spidercnn: Deep learning on point sets with parameterized convolutional filters, Proceedings of the European conference on computer vision (ECCV), 2018:87–102.

45.

Yan

, Zheng

, Li

, Wang

, Cui

, Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020:5589–5598.

46.

M.A.

, Pham

Q.H.

, Hua

B.S.

, Nguyen

, Yeung

S.K.

, Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data, Proceedings of the IEEE/CVF international conference on computer vision, 2019:1588–1597.

47.

Qiu

, Anwar

, Barnes

, Dense-resolution network for point cloud classification and segmentation, Proceedings of the IEEE/CVFWinter Conference on Applications of Computer Vision, 2021.

48.

Qiu

, Anwar

and Barnes

, Geometric back-projection network for point cloud classification, , IEEE Transactions on Multimedia 24 (2021), 1943–1955.

49.

Cheng

, Chen

, He

, Liu

and Bai

, Pra-net: Point relation-aware network for 3d point cloud analysis, IEEE Transactions on Image Processing PP(99) (2021).

50.

Hamdi

, Giancola

, Li

, Thabet

, Ghanem

, Mvtn: Multi-view transformation network for 3d shape recognition, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021:1–11.

51.

Xie

, Liu

, Chen

, Tu

, Attentional shapecontextnet for point cloud recognition, In IEEE/CVF Conference on Computer Vision & Pattern Recognition, 2018:4606–4615.

52.

, Chen

B.M.

, Lee

G.H.

, So-net: Self-organizing network for point cloud analysis, Proceedings of the IEEE conference on computer vision and pattern recognition, 2018:9397–9406.

AaDR-PointCloud: An integrated point cloud processing network using attention and deep residual

Abstract

Keywords

1 Introduction

3 Methods

3.1 Background

4.1 Shape classification

4.1.1 Datasets

4.1.2 Setup

4.1.3 Evaluation metrics

4.2.1 Datasets

4.2.2 Setup

4.2.3 Evaluation metrics

4.4 Ablation study

Table 4 Ablation experiment: classification results when the number of neighbors k on the shape classification data set ScanObjectNN is different k mAcc(%) OA(%) 8 78.2 80.6 16 79.8 82.6 24 80.7 83.6 32 81.1 83.4 64 81.2 83.5

Footnotes

Acknowledgement

References

Table 4
Ablation experiment: classification results when the number of neighbors k on the shape classification data set ScanObjectNN is different

k mAcc(%) OA(%)

8 78.2 80.6

16 79.8 82.6

24 80.7 83.6

32 81.1 83.4

64 81.2 83.5