DeepGCNs-Att: Point cloud semantic segmentation with contextual point representations

Abstract

Graph Convolutional Networks are able to characterize non-Euclidean spaces effectively compared with traditional Convolutional Neural Networks, which can extract the local features of the point cloud using deep neural networks, but it cannot make full use of the global features of the point cloud for semantic segmentation. To solve this problem, this paper proposes a novel network structure called DeepGCNs-Att that enables deep Graph Convolutional Network to aggregate global context features efficiently. Moreover, to speed up the computation, we add an Attention layer after the Graph Convolutional Network Backbone Block to mutually enhance the connection between the distant points of the non-Euclidean space. Our model is tested on the standard benchmark S3DIS. By comparing with other deep Graph Convolutional Networks, our DeepGCNs-Att’s mIoU has at least two percent higher than that of all other models and even shows excellent results in space complexity and computational complexity under the same number of Graph Convolutional Network layers.

Keywords

Point cloud processing semantic segmentation graph convolutional network attention module deep learning

1 Introduction

Deep learning has been applied to various image processing applications recently, such as image classification [4 , 17], object detection [15 , 27], and semantic segmentation [8 , 28–32]. However, due to the lack of depth information, two-dimensional image data collected by single-shot camera has such certain limitations that cannot fully perceive the surrounding environment, which further facilitates the rapid development of three-dimensional sensors [21 , 47]. By transforming the real scenes into the three-dimensional point clouds, researchers utilize each single point to represent pixel’s 3D geometric coordinate, RGB color, normal vectors and other information. However, the irregular format of point cloud makes it difficult for convolutional neural networks to work efficiently in the point cloud data, which is the most obvious difference between the 2D and 3D datasets processing. Moreover, Some classic machine learning methods based on manual feature extraction, such as support vector machine (SVM) and random forest (RF), have also achieved relatively successful results in a series of 3D model segmentation tasks [44, 45]. However, the point cloud data collected by these methods need to wait for a long manual extraction of features before performing the scene analysis, which is obviously not feasible. To solve this problem, 3D datasets processing methods can be divided into four mainstream directions as follow: 3D convolution [6, 26], multi-view projection onto images and 2D convolution [12], 1D/2D convolution or Multi-layer Perceptron(MLP) on point cloud [25], and Graph Convolution Network(GCN) [34]. PointNet [25], which is concise and effective due to its MLP and max-pooling layer, is widely used in feature extraction for point cloud detection. However, PointNet will limit the accuracy in more complex scenes because of ignoring local features’ influence. Soon after, PointNet++ [3] put forward improvements that construct local and global features through the sampling and grouping layers. This method designed the Multi-scale grouping and the Multi-resolution grouping to capture details in densely sampled regions to capture different features, but its too much empirical data also leads to limited results. Considering the graphs are able to represent point cloud efficaciously, more and more networks are proposed to use GCN for point cloud processing. However, as the number of GCN layers increased, the problem of vanishing gradient and over-smoothing appeared. DeepGCNs [11] inspired by dilated convolutions [9] and DGCNN [34] construct a dilated graph to solve the problem above. Nevertheless, DeepGCNs only use one max-pooling layer at the end to aggregate global features without considering the connection between each point and global information. In this paper, we propose a novel neural network structure that uses ResGCN [11] as GCN Backbone Block and Multilayer Perceptron(MLP) for dimensionality reduction in the output layer of the network, and utilize the dual attention module [13], including Spatial-wise Attention and Channel-wise Attention, to adaptively aggregates global features. Our contributions can be summarized as follow:

We propose a novel neural network model with higher accuracy and faster computational speed than others under the same number of GCN layers.

In the output layer of ResGCNs, dimensionality reduction is performed through MLP, and the attention layer is used to directly output the classification of each point instead of the max-pooling layer.

The experiments on point cloud segmentation indicate that our model is robust to sampling density variation and has an excellent correct rate, which achieves better performance in OA and mIoU than DeepGCNs under the same number of network layers.

Our shorter conference version of this paper appeared in [30], which did not address the problem that analyze the effect of hyperparameters on model training results. This manuscript elaborates the structure of the proposed neural network model in more detail and provides additional analysis on the ablation experiment to address the issue above. The second section introduces other related work on point cloud semantic segmentation. The third section systematically expounds the theoretical basis of the designed network model. The fourth section proves the superiority of our model through experiments, and utilizes ablation experiments to discuss the influence of different hyperparameters on the model performance. Finally, the fifth section summarizes the work done in this paper.

2 Related work

Due to the three-dimensional point cloud belongs to the non-Euclidean space, that is, the point data sorting is irregular and the input dimension is not fixed, which causes the traditional two-dimensional convolutional neural network to be difficult to perform convolution operations on it. The methods to solve the poor performance of two-dimensional convolution in point cloud segmentation can be roughly divided into three categories as follows: pointwise MLP methods, point convolution methods and graph-based methods.

2.1 Pointwise MLP methods

PointNet firstly applies deep learning model directly on point cloud data. However, this model only has one max-pooling layer to integrate all sampling points’ features, leading to a weak ability to extract local information. VoxelNet [35] improves the single max-pooling layer and proposes Voxel Feature Encoding(VFE), which divides the point cloud into equally spaced three-dimensional voxels and convertes a set of points in each voxel into unified features through the VFE layer. Inspired by CNN, PointNet++ uses local structure through hierarchical feature extraction and proposes two multi-scale grouping and multi-resolution grouping strategies to ensure better feature extraction. In [36], the absolute coordinates of each point and the relative coordinates of its neighbors are used to represent the point cloud, and Group Shuffle Attention is used to obtain the relationship between points. Then, a Gumbel Subset Sampling layer with displacement invariability, differentiability and end-to-end trainability is used to train hierarchical features. PointWeb [37] is based on PointNet++, which utilizes Adaptive Feature Adjustment and the relationship between local neighbors to improve the features of points. [38] is proposed to use Structural Relational Network to learn the structural relationship characteristics between different local features.

2.2 Graph-based methods

Super Points Graph (SPG) [19] is proposed to represent the point cloud as the collection of internal connections of simple shapes, which reduces the input point cloud’s size, and finally uses PointNet to get the classification of each point. DGCNN proposes the EdgeConv [34], which integrates the information of local neighbors. Each layer builds a new K-NN graph to learn the global shape attributes gradually, and captures the potential global similar features after multiple iterations. Inspired by deep CNN, DeepGCNs proposes dilated convolutions to realize deep GCN and proposes GCNs with residual connections [16] inspired by ResNet. [39] regards point clouds as a collection of connected simple shapes and superpoints, and uses directed graphs to obtain structural and environmental information. [40] proposes a supervised framework to oversegment a point cloud into pure superpoints. In order to better obtain local geometric relations in high-dimensional space, [191] proposes Network PyramNet based on Graph Embedding Module(GEM) and Pyramid Attention Network(PAN). GEM module expresses point cloud as directed acyclic graph, and uses covariance matrix to replace Euclidean distance when constructing similarity matrix. In the PAN module, four convolution kernels of different sizes are used to extract features.

2.3 Attention-based methods

Attention modules are used to solve the problem that long-distance information will be misdetected, and grasp the key points from a large amount of data without losing features. The work [2] uses self-attention in machine translation for the first time so that each word has global semantic information and captures the long-distance dependency between features. [5] proposes the attention-based score refinement(ASR) module, which concatenates scores of neighboring points with learned weights of attention modules. [13] proposes a Dual Attention Network (DANet) to integrate global and local features, including the Position Attention Module and Channel Attention Module. The results of two attention models are fused through an addition operation, and finally, the convolutional layer outputs the classification of each point. In order to better obtain the spatial distribution of point cloud, [42] proposes the Local Spatial Aware(LSA) layer to learn the weight of spatial perception. [43] proposes the attention-based Score Refinement(ASR) module to conduct post-processing of segmentation results and modify the initial segmentation results by pooling.

3 Methodology

Point cloud semantic segmentation is a problem of classifying the categories of three-dimensional points in the point cloud. The model we proposed consists of the following two aspects, namely GCN Backbone Block and Prediction Block, which completes an end-to-end point cloud semantic segmentation network based on GCN. The network process is shown in Fig. 1.

Fig. 1

Our proposed model is divided into two components. GCN Backbone Block is composed of ResGCNs, used to learn the correlation of adjacent points in the point cloud. For points that are far apart in the point cloud, we use MLP to fuse high-dimensional features in the prediction block and use a dual attention module to extract global features from both spatial and channel and finally classify each point.

3.1 GCN attention backbone

Traditional convolutional neural networks are not suitable for extracting point cloud features, which are spatially disordered. So ResGCNs utilizes GCNs to represent the point cloud as an undirected graph g (ν, ɛ), where ν and ɛ represent the set of n nodes and e edges of each layer respectively. Each layer of graph convolution performs the following update and aggregation operations,

$g_{l + 1} = F (g_{l}, w_{l}) = ρ (φ (g_{l}, w_{l}^{agg}), w_{l}^{update})$ (1)

where F () is a fixed function that performs convolution operations in each l - th layer, g_l and g_l+1 represent the input and output of each layer, φ and ρ are the aggregation function and update function.The aggregation function aggregates the features of the neighborhood of each vertex in the point cloud graph, and then uses the update function to update the representation of each node through a nonlinear transformation to learn new features, $h_{ν_{l + 1}} = φ (h_{ν_{l}}, ρ ({h_{u_{l}} | u_{l} \in N (v_{l})}, h_{ν_{l}}, w_{ρ}), w_{φ})$ (2)

where h_{ν
_l} ∈ R^D represents the features of graph’s vertex ν in the l - th layer, D is the dimension of the feature and h_{ν
_l+1} is the feature vector in the (l + 1) - th layer. N (v_l) is the set of neighbor vertices within a certain distance in the l - th layer. w_ρ parametrizes the vertices h_{u
_l}. w_φ contains the learnable parameters of the aggregation functions. In the process of neural network training, the algorithm only updates the features of the vertices of the point cloud graph, and does not update the features of the edges, which makes the vertices only learn the features of their neighbors. Nevertheless, when vertices that are far away have similar features, the traditional GCN algorithm is likely to lose these key information during training, but ResGCNs subtract the vertex h_{ν
_l} from its neighbors h_{u
_l} to distinguish different features and use the max-pool for feature aggregation: $\dot{ρ} = max (h_{u_{l}} - h_{ν_{l}} | u_{l} \in N (v_{l}))$ (3)

where N (v_l) is the neighbor set of the vertex h_{ν
_l}. With the increase in the number of network layers, dilated convolution can help deep GCNs expand the field of view of the convolution kernel without increasing the amount of calculation, so that each vertex can learn the features of farther vertices, so as to obtain global feature information. As shown in Fig. 2, our GCN Attention Backbone set each node to search for the number of neighboring points k = 16, each layer of ResGCN has f = 64 hidden layers or filters, and the dilation convolution parameter d increases layer by layer. Then, ResGCNs utilize MLP as the update function with the batch normalization to speed up gradient descent and use ReLu as the activation function to achieve nonlinear transformation. Since the algorithm using ResGCN as the GCN Backbone Block, it can be stacked in multiple layers through the residual connection, which solves the vanishing gradient and over smothing problem of the graph convolutional neural networks.

Fig. 2

Examples of the ResGCN model which utilize residual connections to alleviate the vanishing gradient problem.

3.2 Prediction block

At the prediction block layer, our algorithm uses the fusion of global features and local features to semantically classify the category of each point in the point cloud. As shown in Fig. 1, the prediction block first uses MLP to reduce the dimensionality of the high-dimensional features of the vertices, then uses the attention module to extract the global features, and finally obtains the semantic classification results of each point.

3.2.1 MLP feature aggregation

Our MLP module is inspired by PointNet, which connects each layer’s GCN Backbone Block features to reduce the high-dimensional vertex features’ dimensionality and uses the 1 × 1 convolution kernel to aggregate global and local features.

3.2.2 Attention prediction

Compared with the traditional method of directly using softmax on the results of the network output layer to obtain the classification of each vertex, our algorithm uses the dual attention module of the network output layer to enhance the global context information of each vertex. We compared the results of the prediction part with DeepGCNs and found that our model has fewer calculation parameters, which means that the model inference has a faster prediction speed. And we increase the output vertex information of the graph to get better classification results.

Fig. 3

The architecture of our proposed Prediction Block, which stacks MLP and dual attention module to operate feature fusion and global feature aggregation.The input feature is the N*1024 feature vector obtained by ResGCN feature extraction of N three-dimensional points in the point cloud.

3.2.3 Spatial-wise Attention

The role of Spatial-wise Attention is to adaptively aggregate global spatial feature information in the non-Euclidean space, which takes the feature map H ∈ R^C×N of the MLP output layer as input. The C and N are the number of channels in the feature map and the number of vectors, respectively. Moreover, different feature maps A_i and B_j are obtained through two identical MLP ρ_i and ρ_j respectively, where {A_i, B_j} ∈ R^C×N. N is the number of feature vectors and C is the number of feature channel. Then the algorithm utilizes multiplication and softmax to obtain the normalized spatial attention weight U_ij as follows:

$\begin{matrix} U_{ij} & = soft max_{j} (A_{i}, B_{j}) \\ = soft max_{j} (ρ_{i} (H_{i}), ρ_{j} (H_{j})) \end{matrix}$ (4)

After that, our model performes MLP on the feature vectors H to get a new feature map C_j ∈ R^N×D, multiply C and U_ij to update the feature map to aggregate the global spatial features, and merge with the original feature vectors H_i to obtain a new feature map $\dot{H}$ as follows: $\dot{H} = \sum_{j = 1}^{N} (U_{ij} C_{j}) + H_{i}$ (5)

where $\dot{H}$ represents the weighted sum of the spatial feature dependency between each vertex and the original feature, so that the Spatial-wise Attention module can adaptively learn the vertices with similar distant features in the non-Euclidean space.

3.3 Channel-wise Attention

Channel-wise Attention is similar in form to Spatial-wise Attention. The channel’s high-dimensional feature map in each layer’s output of ResGCN can be regarded as a map of classes, making the self-attention module utilizes the interdependence between channels to learn global feature representation. Given the feature map H ∈ R^C×N of the MLP output layer, the Channel-wise Attention module takes these feature vectors as input. Different from Spatial-wise Attention, Channel-wise Attention directly operates matrix multiplication on the transposition of H_i and H_j, and then utilizes Softmax to obtain the channel attention map V_ji: $V_{ji} = soft max_{i} (H_{j}, H_{i})$ (6)

where V_ji is the impact exerted by each vertex h_{i
_v} in i - th channel on the vertex h_{j
_v} in j - th channel. The output of this module is $\ddot{H} \in R^{C \times N}$ , which can aggregate the global feature. Then the algorithm multiplies the transpose of V_ji and H_i as a matrix, and adds the result to H_j to get $\ddot{H}$ : $\ddot{H} = \sum_{i = 1}^{N} (V_{ji} H_{i}) + H_{j}$ (7) where $\dot{H}$ is the weighted sum of the dependent relationship between the channel’s high-dimensional feature map and its original feature, which represents each vertex contains the local features learned by ResGCNs and the features between the distant vertexs. Finally, the algorithm adds $\dot{H}$ and $\ddot{H}$ to obtain the final classification result through nonlinear transformation. So far, the model we proposed has made full use of the rich local features in the GCN backbone block, considering the spatial and channel correlation between short-distance and long-distance vertex features and the interdependence between feature maps, which makes our algorithm can perform point cloud semantic classification more accurately in complex three-dimensional scenes.

4 Experiment

4.1 Experiment setting

4.1.1 Dataset

The S3DIS [1] dataset constructs six different areas by scanning the indoor environment, including 271 rooms and 13 object classes (ceiling, floor, wall, beam, column, window, door, bookcases, board, clutters), 11 Scenes (offices, meeting rooms, corridors, auditoriums, open spaces, lobbies, lounges, pantry, copy rooms, storage and toilets), which has a rich three-dimensional indoor characteristic structure. Compared with other datasets, The S3DIS dataset has more complex spatial semantic information, which makes semantic segmentation more challenging.

4.1.2 Hardware configuration

We use the TensorFlow framework to complete the construction, training and inference of our DeepGCNs-Att model. Our computational device equiped Intel(R) Core(TM) i9-9900k CPU @3.60GHz and two NVIDIA RTX 2080 Ti GPUs.

4.1.3 Implementation

In order to more clearly distinguish the DeepGCNs-Att models trained with different network layers, we named the 7-layer and 14-layer models ResGCN-Att-7 and ResGCN-Att-14, respectively. Adam gradient descent optimizer [33] is used for network training, where the initial learning rate is 0.01. The batch size of the ResGCN-Att-7 and ResGCN-Att-7 is set to 8 and 6, respectively, which depend on the computing power of our hardware equipment. On area5 and over 6 fold of S3DIS, we compared our model with PointNet [25], SEGCloud [20], RSNet [22], MS+CU [8], G+RCU [8], PointNet++ [3], 3DRNN+CF [35]. Due to the limitation of our hardware computing power, the batch size of ResGCN-Att-28 can only be set to 2, which directly cause serious gradient oscillation and training is not easy to converge. So we did not test ResGCN-Att-28, but through the analysis of ResGCN-Att-7 and ResGCN-Att-14, we prove that the proposed neural network architecture can achieve excellent point cloud semantic segmentation results on the deep graph convolutional neural network. The overall accuary(OA) and mean intersection over union(mIoU) are utilized to evaluate the semantic segmentation performance of the network model. The results of semantic segmentation can be divided into the following four categories: true positive(TP), false positive(FP), true negative(TN) and false negative(FN). The mIoU measures the ratio of the intersection and union of the two sets of true and predicted values, that is, mIoU = TP/(FP + FN + TP).

4.1.4 Training and testing process

Firstly, we trained and tested the ResGCN-Att-7 and ResGCN-Att-14 models on area 5 of the S3DIS dataset. The training curves of ResGCN-Att-7 and ResGCN-Att-14 are shown in the Fig. 5. The final accuarcies of two models are both about 0.84. When the number of epochs is about 40, the test accuracy of the model begin to oscillate, indicating that the model has converged and it will be globally optimal. So we tested all models from epoch 40 to 60 and find the final best cloud segmentation model.

Fig. 5

Visualized results of semantic segmentation.

4.1.5 Flops and Params

The number of Params represents the usage of computer’s memory and the number of Flops is an essential indicator for evaluating the neural network’s overall performance, which corresponds to the algorithm’s computational complexity and space complexity. As shown in Table 1, by calculating the number of Flops and Params of the proposed ResGCN-att model, we found that our ResGCN-att model has fewer Params and Flops than ResGCN model under the same network layer. This means that our ResGCN-att requires less computing resources, and the calculation speed is faster.

Table 1
Params and Flops of the two network models under the same number of layers

Method Flops Params

ResGCN-Att-7 405670670811 1239503

ResGCN-Att-14 688299896242 1756943

ResGCN-7 436747801923 1403853

ResGCN-14 764491157786 2150669

Method	Flops	Params
ResGCN-Att-7	405670670811	1239503
ResGCN-Att-14	688299896242	1756943
ResGCN-7	436747801923	1403853
ResGCN-14	764491157786	2150669

4.1.6 Effect of network depth

We analyzed the overall performance of the different number of layers of our DeepGCNs-Att model by conducting tests on the ResGCN-Att-7 and ResGCN-Att-14. Results in Table 2 show that with the number of network layers increasing, ResGCN-Att-7 and ResGCN-Att-14’s mIoU are both approximately two percent higher than that of the DeepGCN. Although we did not test ResGCN-Att-28 due to computational limitation, we are able to make reasonable predictions from mIoU results of ResGCN-Att-7 and ResGCN-Att-14 that our ResGCN-Att-28 model will show better performance than that of ResGCN-28.

Table 2
Results of S3DIS dataset on Area 5 in different layers of GCN Backbone Block

Test Area Method mIoU

Area 5 ResGCN-Att-7 50.74

ResGCN-Att-14 52.38

ResGCN-7 48.95

ResGCN-14 49.9

PointNet 41.09

SEGCloud 48.92

RSNet 51.93

Test Area	Method	mIoU
Area 5	ResGCN-Att-7	50.74
	ResGCN-Att-14	52.38
	ResGCN-7	48.95
	ResGCN-14	49.9
	PointNet	41.09
	SEGCloud	48.92
	RSNet	51.93

4.1.7 Comparison with state-of-the-art networks

In order to test the overall performance of the model on the entire S3DIS data set, we performed 6-fold cross-validation, which are compared with the state-of-art models on area 5 and over 6 fold respectively. The results of OA and mIoU are shown in Table 3. In Area5, the mIoU of our model is 2.48%higher than that of ResGCN-14, which is even very similar to the results of ResGCN-28, and surpasses other existing network models. In terms of over 6 fold, our model achieves accuracy close to the original network model and surpasses the existing network model on mIoU, but has fewer parameters and faster computational speed.

Table 3
IoU segmentation results of each category of the S3DIS data set in Area5 and 6-fold

Test Area Method OA mIoU

Area 5 PointNet - 41.09

SEGCloud - 48.92

RSNet - 51.93

PointNet++ 83.43 51.98

ResGCN-14 - 49.90

ResGCN-28 - 52.49

ResGCN-Att-14(Ours) 83.96 52.38

6-fold PointNet 78.5 47.6

MS+CU 79.2 47.8

G+RCU 81.1 49.7

SGPN 47.6 50.4

A-SCN 81.6 52.7

DGCNN 84.3 56.1

PointNet++ - 53.2

ResGCN-28 85.9 60.0

ResGCN-Att-14(Ours) 83.96 57.7

Test Area	Method	OA	mIoU
Area 5	PointNet	-	41.09
	SEGCloud	-	48.92
	RSNet	-	51.93
	PointNet++	83.43	51.98
	ResGCN-14	-	49.90
	ResGCN-28	-	52.49
	ResGCN-Att-14(Ours)	83.96	52.38
6-fold	PointNet	78.5	47.6
	MS+CU	79.2	47.8
	G+RCU	81.1	49.7
	SGPN	47.6	50.4
	A-SCN	81.6	52.7
	DGCNN	84.3	56.1
	PointNet++	-	53.2
	ResGCN-28	85.9	60.0
	ResGCN-Att-14(Ours)	83.96	57.7

We summarized each category’s IoU in Table 4. It can be seen that our proposed model has a similar IoU result to ResGCN-28 in most categories, with only a small decline, which is due to the decrease in the number of GCN layers. We replaced the fusion module of the original network layer with the dual attention module, which made the prediction module have fewer parameters. With the reduction of the GCN Backbone Block’s layers, the network model can still achieve relatively good results in semantic segmentation. Besides, in terms of area5 and over 6 fold, the semantic segmentation performance of our model is better than the state-of-art model in most categories.

Table 4

IoU segmentation results of each category of the S3DIS data set in Area5

Test Area	Method	ceiling	floor	wall	beam	column	window	door	table	chair	sofa	bookcase	board	clutter
Area 5	PointNet	88.8	97.33	69.8	0.05	3.92	46.26	10.76	52.61	58.93	40.28	5.85	26.38	33.22
	SEGCloud	90.06	96.05	69.86	0.00	18.37	38.35	23.12	75.89	70.40	58.42	40.88	12.96	41.0
	RSNet	93.34	98.36	79.18	0.00	15.75	45.37	50.10	65.52	67.87	22.45	52.45	41.02	43.64
	ResGCN-Att-14(Ours)	91.44	97.62	75.85	0.00	15.38	50.47	42.48	67.64	75.39	14.46	51.32	56.66	42.23
6fold	PointNet	88.0	97.3	69.3	42.4	23.1	47.5	51.6	42.0	54.1	9.6	38.2	29.4	35.2
	MS+CU	88.6	95.8	67.3	36.9	24.9	48.6	52.3	51.9	45.1	10.6	36.8	24.7	37.5
	G+RCU	90.3	92.1	67.9	44.7	24.2	52.3	51.2	58.1	47.4	6.9	39.0	30.0	41.9
	PointNet++	90.2	91.7	73.1	42.7	21.2	49.7	42.3	62.7	59.0	19.6	45.8	36.7	51.6
	3DRNN+CF	92.9	93.8	73.1	42.5	25.9	47.6	59.2	60.4	66.7	24.8	57.0	36.7	51.6
	ResGCN-28	93.1	95.3	78.2	33.9	37.4	56.1	68.2	64.9	61.0	34.6	51.5	51.1	54.4
	ResGCN-Att-14(Ours)	93.0	92.3	76.9	32.7	33.1	56.6	67.0	61.6	63.5	21.9	51.3	47.0	53.3

As shown in Fig. 4, the classification results of our ResGCN-Att-14 are almost identical to that of ResGCN-28. Only a few parts of the point cloud misdetected since the algorithm adopted the dual attention module, which enables the model to learn global context information autonomously in the prediction part. So we utilized fewer GCN layers to obtain classification results similar to 28 layers.

Fig. 4

The training process of ResGCN-Att-7 and ResGCN-Att-14.

4.2 Ablation experiment

We conducted ablation experiments on our proposed model using the method of control variables, which change the parameters of dilation, neighbors, depth, width, fixed k-NN, and connection respectively to verify the influence of these parameters on the model results, as shown in Table 5. Effect of dilation. Results in Table 5 show that dilated graph convolutions account for a 2.46%improvement in mean IoU, justified primarily by the expansion of the network’s receptive field. We found that if the stochastic dilation is increased from 0.2 to 0.4, it will indeed increase the mIoU of the model slightly, but if it continues to increase, a stable gain cannot be obtained. On the contrary, with the increase of the stochastic dilation parameter, the result of the model decreases rapidly. Interestingly, our results in Table 5 also indicate that dilation especially helps deep networks when combined with residual graph connections. If there is no such connection, the performance will actually decrease as the graphics convolution expands. The reason may be that these changing neighbors cause a “worse” gradient, which further hinders convergence when the residual graph connection is not used. Effect of neighbors. Results in Table 5 show that a larger number of neighbors helps in general. As the number of neighbors is decreased by a factor of 2 and 4, the performance drops by 2.53%and 3.25%respectively. However, if the network capacity is large enough, a large number of neighbors will only lead to improved performance. This becomes obvious when we increase the number of neighbors to 2 times and reduce the number of filters to 2 times. Effect of depth. Results in Table 5 show that increasing the number of layers can improve network performance, but the premise is to use residual graph connection and expansion graph convolution, as shown in Table 5. Effect of width. Results in Table 5 show that increasing the number of filters will result in performance improvements similar to increasing the number of layers. Generally speaking, higher network capacity can learn the nuances needed to succeed in extreme situations. Effect of fixed k-NN. Results in Table 5 show that if there is fixed k-NN, the result of the model will decrease, but if the dynamic k-NN is used, more computational cost is required. Effect of connection. Results in Table 5 show that residual graph connections play an essential role in training deeper networks, as they tend to result in more stable gradients. When the residual graph connections between layers are removed, performance dramatically degrades(-13.04%mIoU). As the depth of the network increases, skipping connections becomes the key to convergence.

Table 5
Ablation study on area 5 of S3DIS.We compare our reference network (ResGCN-Att-14) with 14 layers, residual graph connections, and dilated graph convolutions to several ablated variants.We denote residual with the ⊕

Ablation Model Operater mIoU ΔmIoU dynamic connection dilation sto. eps. NNs filters layers

Reference ResGCN-Att-14 EdgeConv 52.38 0.00 ✓ ⊕ ✓ 0.2 16 64 14

Dilation ResGCN-Att-14 EdgeConv 51.89 -0.49 ✓ ⊕ ✓ 0.0 16 64 14

EdgeConv 52.81 0.43 ✓ ⊕ ✓ 0.4 16 64 14

EdgeConv 51.81 -0.57 ✓ ⊕ ✓ 0.6 16 64 14

EdgeConv 51.45 -0.93 ✓ ⊕ ✓ 0.8 16 64 14

EdgeConv 49.92 -2.46 ✓ ⊕ 0.2 16 64 14

Neighbors ResGCN-Att-14 EdgeConv 49.13 -3.25 ✓ ⊕ ✓ 0.2 4 64 14

EdgeConv 49.58 -2.53 ✓ ⊕ ✓ 0.2 8 64 14

EdgeConv 49.07 -3.31 ✓ ⊕ ✓ 0.2 32 32 14

Depth ResGCN-Att-7 EdgeConv 47.24 -5.14 ✓ ⊕ ✓ 0.2 4 64 7

Width ResGCN-Att-14 EdgeConv 45.60 -6.78 ✓ ⊕ ✓ 0.2 16 16 14

EdgeConv 49.07 -3.31 ✓ ⊕ ✓ 0.2 16 32 14

EdgeConv 53.49 1.11 ✓ ⊕ ✓ 0.2 8 128 14

Fixed k-NN ResGCN-Att-14 EdgeConv 47.36 -5.02 ⊕ 16 64 14

Connection ResGCN-Att-14 EdgeConv 39.34 -13.04 ✓ ✓ 0.2 16 64 14

Ablation	Model	Operater	mIoU	ΔmIoU	dynamic	connection	dilation	sto. eps.	NNs	filters	layers
Reference	ResGCN-Att-14	EdgeConv	52.38	0.00	✓	⊕	✓	0.2	16	64	14
Dilation	ResGCN-Att-14	EdgeConv	51.89	-0.49	✓	⊕	✓	0.0	16	64	14
		EdgeConv	52.81	0.43	✓	⊕	✓	0.4	16	64	14
		EdgeConv	51.81	-0.57	✓	⊕	✓	0.6	16	64	14
		EdgeConv	51.45	-0.93	✓	⊕	✓	0.8	16	64	14
		EdgeConv	49.92	-2.46	✓	⊕		0.2	16	64	14
Neighbors	ResGCN-Att-14	EdgeConv	49.13	-3.25	✓	⊕	✓	0.2	4	64	14
		EdgeConv	49.58	-2.53	✓	⊕	✓	0.2	8	64	14
		EdgeConv	49.07	-3.31	✓	⊕	✓	0.2	32	32	14
Depth	ResGCN-Att-7	EdgeConv	47.24	-5.14	✓	⊕	✓	0.2	4	64	7
Width	ResGCN-Att-14	EdgeConv	45.60	-6.78	✓	⊕	✓	0.2	16	16	14
		EdgeConv	49.07	-3.31	✓	⊕	✓	0.2	16	32	14
		EdgeConv	53.49	1.11	✓	⊕	✓	0.2	8	128	14
Fixed k-NN	ResGCN-Att-14	EdgeConv	47.36	-5.02		⊕			16	64	14
Connection	ResGCN-Att-14	EdgeConv	39.34	-13.04	✓		✓	0.2	16	64	14

5 Conclusion

In this paper, we proposed a novel neural network model to improve DeepGCNs performance. After the GCN Backbone Block, we utilized MLP for feature fusion and used the dual attention module, which replaces max-pooling, to actively learn global features in the point cloud. Through a large number of experimental comparisons, we have proved that the proposed neural network model has a better point cloud semantic segmentation effect than existing models, and has fewer model parameters, which is conducive to accelerating model inference. We also conducted ablation experiments on the proposed algorithm to explore the influence of different hyperparameters on the point cloud semantic segmentation results. In the future, we will further improve the neural network framework. Considering that the existing algorithm has high time complexity for the FPS algorithm of point cloud downsampling, we will replace it with a random sampling method to further aggregate the local features of the point cloud to improve the efficiency of the neural network, and it can be applied to more large range of point cloud data.

Footnotes

Acknowledgment

Our project was supported by the National Natural Science Foundation of China (Grant No. 61974073).

References

Armeni , Sax

, Zamir

A.R.

and Savarese

, Joint 2D- 3D-Semantic Data for Indoor Scene Understanding, arXiv preprint (2017), arXiv:1702.01105.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

and Polosukhin

, Attention Is All You Need, Advances in Neural Information Processing Systems, 2017, 5998–6008.

C.R.

, Yi

, Su

and Guibas

L.J.

, PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, Advances in Neural Information Processing Systems, 2017, 5099–5108.

Szegedy

, Wei

, Yangqing

, Sermanet

, Reed

, Anguelov

, Erhan

, Vanhoucke

and Rabinovich

, Going deeper with convolutions, IEEE Conference on Computer Vision and Pattern Recognition, 2015, 1–9.

Zhao

, Zhou

, Lu

and Zhao

, Pooling Scores of Neighboring Points for Improved 3D Point Cloud Segmentation, IEEE International Conference on Image Processing, 2019, 1475–1479.

Maturana

and Scherer

, VoxNet: A 3D Convolutional Neural Network for real-time object recognition, IEEE/RSJ International Conference on Intelligent Robots and Systems, 2015, 922–928.

and Weng

, A survey of image classification methods and techniques for improving classification performance, International Journal of Remote Sensing 28 (2007), 823–870.

Engelmann

, Kontogianni

, Hermans

and Leibe

, Exploring Spatial Context for 3D Semantic Segmentation of Point Clouds, IEEE International Conference on Computer Vision Workshops, 2017, 716–724.

and Koltun

, Multi-Scale Context Aggregation by Dilated Convolutions, arXiv preprint (2015), arXiv:1511.07122.

10.

Lateef

and Ruichek

, Survey on semantic segmentation using deep learning techniques, Neurocomputing 338 (2019), 321–348.

11.

, Muller

, Thabet

and Ghanem

, Deepgcns: Can gcns go as deep as cnns?, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, 9267–9276.

12.

, Maji

, Kalogerakis

and Learned-Miller

, Multi-view Convolutional Neural Networks for 3D Shape Recognition, IEEE International Conference on Computer Vision, 2015, 945–953.

13.

, Liu

, Tian

, Li

, Bao

, Fang

and Lu

, Dual Attention Network for Scene Segmentation, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, 3141–3149.

14.

Long

, Shelhamer

and Darrell

, Fully convolutional networks for semantic segmentation, Proceedings of the IEEE Conference on Computer Vision, 2015, 3431–3440.

15.

, Gkioxari

, Dollár

and Girshick

, Mask rcnn, Proceedings of the IEEE International Conference on Computer Vision 2017, 2961–2969.

16.

, Zhang

, Ren

and Sun

, Deep Residual Learning for Image Recognition, IEEE Conference on Computer Vision and Pattern Recognition 2016, 770–778.

17.

Simonyan

and Zisserman

, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv preprint (2014), arXiv:1409.1556.

18.

Liu

, Ouyang

, Wang

, Fieguth

, Chen

, Liu

and Pietik”ainen

, Deep learning for generic object detection: A survey, International Journal of Computer Vision 128 (2020), 261–318.

19.

Landrieu

and Simonovsky

, Large-Scale Point Cloud Semantic Segmentation with Superpoint Graphs, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 4558–4567.

20.

Tchapmi

, Choy

, Armeni

, Gwak

and Savarese

, SEGCloud: Semantic Segmentation of 3D Point Clouds, International Conference on 3D Vision, 2017, 537–547.

21.

Hamid

M.S.

, Manap

N.A.

, Hamzah

R.A.

and Kadmin

A.F.

, Stereo matching algorithm based on deep learning: A survey, Journal of King Saud University – Computer and Information Sciences (2020).

22.

Huang

, Wang

and Neumann

, Recurrent Slice Networks for 3D Segmentation of Point Clouds, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 2626–2635.

23.

Girshick

, Donahue

, Darrell

and Malik

, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, IEEE Conference on Computer Vision and Pattern Recognition, 2014, 580–587.

24.

Girshick

, Fast R-CNN, IEEE International Conference on Computer Vision, 2015, 1440–1448.

25.

Charles

R.Q.

, Su

, Kaichun

and Guibas

L.J.

, Point Net: Deep Learning on Point Sets for 3D Classification and Segmentation, IEEE Conference on Computer Vision and Pattern Recognition, 2017, 77–85.

26.

, Xu

, Yang

and Yu

, 3D Convolutional Neural Networks for Human Action Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (2013), 221–231.

27.

Xie

, Liu

, Chen

and Tu

, Attentional ShapeContextNet for Point Cloud Recognition, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 4606–4615.

28.

Badrinarayanan

, Kendall

and Cipolla

, SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2017), 2481–2495.

29.

Wang

, Yu

, Huang

and Neumann

, SGPN: Similarity Group Proposal Network for 3D Point Cloud Instance Segmentation, IEEE Conference on Computer Vision and Pattern Recognition, 2018, 2569–2578.

30.

Wang

, Jiang

, Zhang

, Tong

, Me

, Xiao

and Tong

, DeepGCNs Att for point cloud semantic segmentation, International Conference on Artificial Intelligence and Computer Science, 2021.

31.

Wang

, He

and Ma

, Exploiting local and global structure for point cloud semantic segmentation with contextual point representations, arXiv preprint (2019), arXiv:1911.05277.

32.

, Li

, Huang

, Du

and Zhang

, 3d recurrent neural networks with context fusion for point cloud semantic segmentation, Proceedings of the European Conference on Computer Vision, 2018, 403–417.

33.

Lecun

, Bottou

, Bengio

and Haffner

, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (1998), 278–2324.

34.

Wang

, Sun

, Liu

, Sarma

S.E.

, Bronstein

M.M.

and Solomon

J.M.

, Dynamic graph cnn for learning on point clouds, Acm Transactions On Graphics 38 (2019), 1–12.

35.

Zhou

and Tuzel

, VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 4490–4499.

36.

Pham

Q.H.

, Sevestre

, Pahwa

R.S.

, Zhan

, Pang

C.H.

, Chen

, Mustafa

, Chandrasekhar

and Lin

, A*3D dataset: Towards autonomous driving in challenging environments, International Conference on Robotics and Automation, 2020.

37.

Caesar

, Bankiti

, Lang

A.H.

, Vora

, Liong

V.E.

, Xu

, Krishnan

, Pan

, Baldan

and Beijbom

, Nuscenes: A multimodal dataset for autonomous driving, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.

38.

Munoz

, Bagnell

J.A.

, Vandapel

and Hebert

, Contextual classifification with functional max-margin markov networks, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009, pp. 975–982.

39.

Boulch

, Le Saux

and Audebert

, Unstructured point cloud semantic labeling using deep segmentation networks, Eurographics Workshop on 3D Object Retrieval, 2017.

40.

, Wan

, Yue

and Keutzer

, SqueezeSeg: Convo lutional neural nets with recurrent crf for real-time road-object segmentation from 3D lidar point cloud, International Conference on Robotics and Automation, 2018.

41.

Jaritz

, Gu

and Su

, Multi-view pointNet for 3D scene understanding, IEEE International Conference on Computer Vision Workshop, 2019.

42.

Mueller

, Smith

and Ghanem

, Context-aware correlation fifilter tracking, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017.

43.

Zarzar

, Giancola

and Ghanem

, Effificient tracking proposals using 2D-3D siamese networks on lidar, arXiv preprint (2019), arXiv:1903.10168.

44.

Limin

and Greenspan

, Real-time object recognition in sparse range images using error surface embedding, International Journal of Computer Vision 89 (2010), 211–228.

45.

Yulan

, et al., Rotational projection statistics for 3D local surface description and object recognition, International Journal of Computer Vision 105 (2013), 63–86.

46.

Wang

, Zheng

and Feng

, Color constancy enhancement for multi-spectral remote sensing images, 2013 IEEE International Geoscience and Remote Sensing Symposium, 2013.

47.

, Shen

, Zhang

, Yuan

and Yang

, Recovering Quantitative Remote Sensing Products Contaminated by Thick Clouds and Shadows Using Multitemporal Dictionary Learning, IEEE Transactions on Geoscience and Remote Sensing 52 (2014), 7086–7098.

DeepGCNs-Att: Point cloud semantic segmentation with contextual point representations

Abstract

Keywords

1 Introduction

2 Related work

2.1 Pointwise MLP methods

2.2 Graph-based methods

2.3 Attention-based methods

3 Methodology

3.2.1 MLP feature aggregation

3.2.2 Attention prediction

4.1 Experiment setting

4.1.1 Dataset

4.1.2 Hardware configuration

4.1.3 Implementation

4.1.4 Training and testing process

Table 1 Params and Flops of the two network models under the same number of layers Method Flops Params ResGCN-Att-7 405670670811 1239503 ResGCN-Att-14 688299896242 1756943 ResGCN-7 436747801923 1403853 ResGCN-14 764491157786 2150669

Table 2 Results of S3DIS dataset on Area 5 in different layers of GCN Backbone Block Test Area Method mIoU Area 5 ResGCN-Att-7 50.74 ResGCN-Att-14 52.38 ResGCN-7 48.95 ResGCN-14 49.9 PointNet 41.09 SEGCloud 48.92 RSNet 51.93

Footnotes

Acknowledgment

References

Table 1
Params and Flops of the two network models under the same number of layers

Method Flops Params

ResGCN-Att-7 405670670811 1239503

ResGCN-Att-14 688299896242 1756943

ResGCN-7 436747801923 1403853

ResGCN-14 764491157786 2150669

Table 2
Results of S3DIS dataset on Area 5 in different layers of GCN Backbone Block

Test Area Method mIoU

Area 5 ResGCN-Att-7 50.74

ResGCN-Att-14 52.38

ResGCN-7 48.95

ResGCN-14 49.9

PointNet 41.09

SEGCloud 48.92

RSNet 51.93