Abstract
3D shape recognition is a critical research topic in the field of computer vision, attracting substantial attention. Existing approaches mainly focus on extracting distinctive 3D shape features; however, they often neglect the model’s robustness and lack refinement in deep features. To address these limitations, we propose the point-view fusion attention network that aims to extract a concise, informative, and robust 3D shape descriptor. Initially, our approach combines multi-view features with point cloud features to obtain accurate and distinguishable fusion features. To effectively handle these fusion features, we design a dual-attention convolutional network which consists of a channel attention module and a spatial attention module. This dual-attention mechanism greatly enhances the generalization ability and robustness of 3D recognition models. Notably, we introduce a strip-pooling layer in the channel attention module to refine the features, resulting in improved fusion features that are more compact. Finally, a classification process is performed on the refined features to assign appropriate 3D shape labels. Our extensive experiments on the ModelNet10 and ModelNet40 datasets for 3D shape recognition and retrieval demonstrate the remarkable accuracy and robustness of the proposed method.
Introduction
Three-dimensional (3D) shape recognition plays a critical role in engineering and scientific fields, including object modeling, three-dimensional measurement, and realistic mapping [1–4]. As the demand for 3D shape recognition technology increases across various domains, the requirements for it are also growing [5–7]. Nowadays, the development of 3D acquisition technology has further driven the progress of 3D shape recognition tasks. More and more research efforts are focused on feature fusion and refinement.
With the rapid advancements in deep learning techniques, numerous methods using convolutional neural networks (CNN) have been proposed and applied to the field of 3D shape recognition, achieving remarkable performance [8]. In the view-based approaches, the feed data often consists of multi-views captured from different camera viewpoints, which can be easily processed using various well-established CNN models, such as VGG [9], AlexNet [10], GoogLeNet [11], ResNet [12], and DenseNet [13]. However, the multi-view-based presentation method tends to ignore local structural information, which is inevitably discarded due to the influence of camera angle. Hence, integrating features has become a focus of research in this area due to their global characteristics [14, 15]. For example, Bu et al. [16] fused geometric-based features with view-based low-level shape descriptors to create a joint representation of 3D shapes. Although this method achieved relatively promising results, it still lost local information of 3D shapes. Jiang et al. [17] further improved the multimodal fusion architecture by introducing a hierarchical network structure, which better explored the inherent hierarchical associations among views. Moreover, Liu et al. [18] modeled the multi-view context using grouping module to extract features from view-level and group-level context. They then use a group-level fusion module to obtain compact 3D object descriptors. Nie et al. [19] carried out information fusion from view level and feature level respectively to explore image morphological information more deeply. Liu et al. [20] used a voting-based view filtering strategy to filter images, and integrated multi-view information to extract efficient and robust feature descriptors. Furthermore, Bai et al. [21] extracted depth features at different scales, and then fused these multi-scale features to consider their deep-level image relationships. This method demonstrated strong robustness. Zhu et al. [22] fused multiple sets of local image information and explored the correlation between features, which provided a reference for more refined image recognition. The above methods effectively demonstrate the advantages of integrated features in the process of multi-view recognition. However, these multi-feature aggregation methods are primarily based on view-level and geometric-level descriptors, which have inherent deficiencies in local features. Therefore, the point cloud-based method has gained a lot of attention due to its advantages in better storage of 3D spatial information and internal local structure. For example, You et al. [23, 24] fused point cloud features with multi-view features through feature repetition and extracted fusion features to achieve a consistent representation of 3D shapes. Liang et al. [25] scored input images by integrating some multi-modal data, such as multi-view, point cloud, and PANORAMA-view, which made the data dimension larger. Sun et al. [26] separately extracted visual features and structural features from multi-view and point cloud datasets, then used a multimodal fusion strategy to achieve a unified representation. Nevertheless, these methods lack deep mining and refining representation of multimodal information.
Fused features tend to exhibit high-dimensional characteristics, and for better feature fusion between different modalities, they often contain more repetitive or redundant information, which may interfere with the subsequent model classification. To capture the most identifiable depth feature information, Zhang et al. [27] embedded a self-attention mechanism into generative adversarial networks to achieve attention-driven remote dependency modeling, and Woo et al. [28] proposed a convolutional block attention module, which is simple and effective and can be concatenated with any feedforward convolutional neural network. Apparently, attention mechanism has been widely used in 3D shape recognition and retrieval tasks of different data modalities. For example, Nie et al. [29] used multiple attention networks to integrate local image information and evaluated the contribution of each channel to better retain valid information. Furthermore, Nie et al. [30] proposed a deep attention network to process multi-view features and improve image retrieval accuracy by integrating multi-attention features. This paper tested the effectiveness of the attention mechanism on multiple datasets. After that, Ma et al. [31] used a double-channel attentional residual network to process voxel-based input data. The above work effectively indicates the strengthening effect of attention mechanism on 3D shape recognition and provides inspiration for subsequent research. However, these methods are often targeted at global images and it is difficult to capture refined image information. There is also a lack of deep information mining for multimodal data, which could help improve the robustness of 3D image recognition.
In consequence, we present a novel point-view fusion attention network, PVFAN, for 3D shape recognition that effectively integrates point cloud and multi-view features by integrating an attention network. The PVFAN network is designed to capture compact, robust feature descriptors and give the model good generalization ability. This method derives attention graphs in two independent dimensions, space and channel, by producing an intermediate feature graph, and subsequently multiplies these attention graphs with input feature graphs to perform adaptive feature refinement. This lightweight attention module has been broadly utilized in various image processing tasks because of its efficient and convenient characteristics [32–34]. Furthermore, it has inspired us to introduce this attention mechanism into the field of 3D model recognition for achieving more refined features. PVFAN combines the capabilities of group-view convolutional neural networks (GVCNN) and CNN to learn multi-view features from projected images from different angles and point cloud features, respectively. These different modality features are then fused using a replication cascade approach, which is simple and effective. Moreover, a dual attention convolutional network module is constructed for feature refinement. We embed the Strip-pooling layer [33] into the channel self-attention mechanism, which can capture more detailed view feature information and enhance the generalization ability of the model. Then we build a spatial attention module to capture regional features and enhance the robustness of the model. Finally, we utilize the full connection layer to generate category labels for the input data. To further assess the proficiency of PVFAN framework, we performed experiments on the widely-used 3D shape dataset, ModelNet [35], for shape classification tasks and compared them with state-of-the-art techniques. The experimental results indicate that the proposed PVFAN approach demonstrates remarkable performance for 3D shape recognition, thus confirming the effectiveness of PVFAN.
The main contributions are summarized as follows:
1) We combine multi-modal features, i.e., the point cloud features and multi-view features, to obtain accurately discriminated fusion features.
2) A dual-attention convolution network module is developed to deal with fusion features, which can effectively improve the generalization ability and robustness of the recognition model.
3) We embedded a Strip-pooling layer in the channel attention module for feature refinement, which improves the fusion feature and makes it more compact.
4) Our network attains outstanding performance for 3D shape recognition on the most public dataset.
The remainder of this article are organized as follows. Section 2 provides a discussion of related works. Section 3 presents the PVFAN framework in detail. Section 4 offers a detailed account of experiments and discussions, while Section 5 concludes this article.
Related work
In this section, we will briefly review three categories of 3D shape recognition methods: multi-view-based methods, point cloud-based methods, and volumetric-based methods.
Multi-view-based methods
The multi-view method uses multiple projections of an object from different viewpoints to represent its 3D shape, which performs feature extraction and fusion of the recorded images based on convolution to obtain a 3D shape classification. For instance, Su et al. [36] extracted individual features for each view using convolutional neural network and constructed a multi-view convolutional neural network (MVCNN) that fused various views by a pooling operation. MVCNN has demonstrated excellent performance on ModelNet40. To address the challenge of 3D shape identification using a small number of viewpoint images, Yu et al. [37] developed a novel multi-viewpoint artificial neural network “Latent-MVCNN” to recognize different shapes based on multiple predefined or random viewpoints. This method exhibits promising performance when there are only a few view-images available. Additionally, Feng et al. [38] constructed a group-view CNN framework for hierarchical correlation modeling that focuses on discriminant 3D shape descriptions. This method effectively utilized the inherent hierarchical connection and discriminability among different views, achieving satisfactory results in 3D shape recognition tasks. Xu et al. [39] used encoder-decoder to aggregate multi-view features and retrieve images through 3D attribute prediction, which had strong recognition ability for occluded views. Xu et al. [40] used shared weights for multi-view feature extraction and introduced bidirectional LSTM network for feature integration. Similarly, Yang et al. [41] used an extreme learning machine-based autoencoder for multi-view feature integration. After that, Ding et al. [42] reduced the feature dimension by screening the most representative view and extracted the depth features of a single view for image recognition. This method has high efficiency but average accuracy. To improve training efficiency, Wang et al. [43] improved MVCNN by embedding a view clustering module and pooling process based on dominance sets, which pooled information from similar views and enhanced the performance of 3D object identification. In addition, Han et al. [44] established 3D to sequential views (3D2SeqViews) to further enhance the recognizability of learned features. This method uses CNN and a hierarchical attention aggregation module to more efficiently aggregate sequential views, integrating abundant spatial and content information between the views. To protect the information of multiple views during feature fusion, Liang et al. [45] proposed a multi-image hierarchical fusion method for 3D pattern recognition, which combines hierarchical features from multiple images into a concise descriptor. These methods demonstrate that multi-view methods using deep learning have realized significant performance advancements in 3D shape identification tasks.
Point cloud-based methods
The point cloud methods are designed to directly classify point cloud data captured from external devices like RGB-D cameras, 3D scanners, and LIDAR. Qi et al. [46] pioneered this field with a novel point-cloud consuming neural network called PointNet, which respects the order permutation of input points. Although simple, PointNet is highly efficient. To better captured the local structure caused by irregular points in the metric space, PointNet++ [47] combined features from multiple scales adaptively, which enhanced the ability of model to recognize detailed patterns and summarize complex scenes. Additionally, Ma et al. [48] constructed a deep neural network framework to perform semantic segmentation process of original point clouds. This method introduced a multi-scale feature learning module to obtain information context features in 3D point clouds, and then integrated the local and global feature to enhance model performance. To avoid unnecessary scaling behavior, Klokov et al. [49] established a deep learning architecture capable of effectively dealing with unstructured point clouds and capturing features via point subdivision on K-dimensional trees. Hu et al. [50] developed a lightweight and efficient neural structure introducing a local feature integration module to capture key features. This method effectively preserves geometric details by gradually increasing the acceptance domain of each 3D point. Furthermore, to improve the representation capability of point cloud data, a method called dynamic graph CNN (DGCNN), based on the topology restoration model, was designed by Wang et al. [51]. This technique employs the neural network module edge convolution to construct dynamically calculated graphs in each layer of the network, effectively capturing local neighborhood information. In addition, Xu et al. [52] proposed geometric shared networks to better capture the geometric information of point clouds. This method utilizes global context to learn point descriptors and enhance robustness to geometric transformations. With this design, global and local geometric features can be effectively captured and point features can be aggregated more comprehensively. However, on account of the irregular and unstructured nature of point cloud data, traditional 2D CNN is unsuitable for use, which somewhat limits the model’s performance.
Volumetric-based methods
Volumetric modelling techniques first use point clouds as voxels and then use 3D CNN to extract features, enabling recognition tasks. For example, the 3D ShapeNets model proposed by Wu et al. [35] used a deep convolutional network to represent 3D geometric shapes as probability distributions of binary variables in a voxel mesh. Maturana et al. [53] constructed a network architecture for point cloud data processing that integrates volume occupancy grid representations with supervised 3D CNNs. In addition, these and other works [54, 55] use 3D CNNs to handle voxelized data. However, on account of the complexity of 3D convolution, these methods have drawbacks, such as low efficiency, low accuracy of model classification, and ordinary performance. To address such concerns, Riegler et al. [56] established a deep learning representation model based on sparse 3D data that enables memory allocation and computation to focus on dense relevant areas, allowing deeper networks without compromising resolution. In addition, Le et al. [57] put a deep learning architecture that feeds the point cloud data into a 3D mesh via a simple and effective sampling strategy, and then directly captures global features from the original coordinates. Overall, these methods effectively tackle the unstructured 3D point cloud problem; however, balancing model accuracy and calculation cost remains a challenge for volumetric-based approaches.
Methodology
Framework of the PVFAN
In this section, we display the network structure of our 3D shape recognition model in detail. The overall framework diagram of our Point-View Fusion Attention Network (PVFAN) consists of four modules, as depicted in Fig. 1. The model employs multi-view images and point cloud data as input, and the PVFAN, denoted as κ, produces a category prediction vector for Nclasses as output. The PVFAN is then sequentially composed of four parts:

Overall framework of PVFAN.
which are summarized as follows:
•(κ1)
•(κ2)
•(κ3)
•(κ4)
The details of the individual modules can be found in the following subsections.
1) Input data: The proposed PVFAN framework for 3D shape identification utilizes the input of multi-view images of 3D objects and their corresponding point cloud data. To generate the multi-view images, the mesh or point cloud is projected under a virtual camera using a perspective projection, with optional modifications to viewpoints and number of positions. The input view size is set to 224×224 pixels. As for the point cloud image, its dimensions are determined by the number of selected point clouds. In this study, integrate a total of 2048 point clouds analysis.
2) Point cloud data processing: In this branch, the fed data is points cloud, and we denote the input coordinates of D-dimension with n points as
where F (pc) represents the output captured feature vectors of the fed data X (pc); and Conv1 refers to the employed convolution neural network for deeply point cloud feature extraction.
3) Multi-view data processing: In the multi-view feature extraction branch, once the rendering view representation is obtained by using a set of predefined camera arrays, the multi-view data is fed into the GVCNN model [38] for multi-view feature extraction. To preserve the view descriptors of multi-view, the view-pooling and fully connected layer are discarded. Similarly, we have
where F (mv) represents the output extracted feature vectors of the multi-view input X (mv); and Conv2 represents the employed deep neural network model for multi-view feature extraction.
After finishing the feature extraction operation, we obtained the point-view feature, which we denote as κ1 :{ X (pc) , X (mv) } → { F (pc) , F (mv) }. Next, we combine multi-view features and point cloud features. Specifically, we notice that the point cloud features of the same object remain constant across different views. Therefore, we perform multi-feature fusion using replication cascade method. The process is as follows.
where g r (•) refers to the feature replication process, the number of cycles is the same as the number of input views. Then, the point-view feature fusion module can be denoted as κ2 : { F (pc) , F (mv) } → ϑ, where ϑ refers to the output fusion feature.
After acquiring the fusion feature, we proceed with feature refinement. From the 12 views V ={ v1, v2, . . . , v12 } generated by rendering the 3D model, we can obtain corresponding fusion features denoted as ξ ={ ϑ1, ϑ2, . . . , ϑ12 }, following the κ2 process. To comply with the requirements for further attention convolution processing of input data, we input the acquired fusion features into adaptive pooling for feature conversion. This is mainly because adaptive concatenation can generate output data of a given size based on the input data, without altering the number of features in the input or output. The process is as follows:
where H ={ h1, h2, . . . , h12 }, and the AdaP (•) refers to the adaptive pooling process. However, the features H obtained through adaptive pooling contain some redundant information that is unnecessary for the final compact global descriptor. To refine the feature extraction process, we developed a dual-attention convolution network module to process fusion feature. This dual-attention convolution network module consists of two parts, i.e., channel attention module and spatial attention module. In the channel attention module, we embed the Strip-pooling attention convolution process based on the attention mechanism [27].
This attention mechanism can effectively process feature information from the entire input view, resulting in powerful global awareness and robust global feature extraction. However, it is less effective in extracting refined local features due to its lack of inductive bias properties and poor generalization. To overcome these limitations, we devised the Strip-pooling attention convolution network to achieve fine-grained extraction of fusion features, as shown in Fig. 2.

The dual-attention convolution network. The ⊗ denotes matrix multiplication.
The Strip-pooling technique aggregates both global and local information, which distinguishes it from traditional spatial pooling methods that only gather textual information from fixed square regions. This approach effectively compensates for the limitations of max and average pooling. Therefore, we propose a Strip-pooling self-attentional convolution method that maximizes effect the end-to-end capture capacity and maintain detailed information about fusion characteristics. In this method, Strip-pooling uses the H value as the Q value of attention mechanism, which refines the subsampling process of fusion features while retaining more detailed feature information. This effectively overcomes the limitation of the attention mechanism and improves the generalization ability of the model.
As displayed in Fig. 2, we input the feature H to Strip-pooling [33] to generate feature map Q. And we use two 1×1 convolutions to obtain feature map S and T, as follows:
Then, we can generate the attention map U by scaling and normalization using the SoftMax function, as follows:
where Tr () is the transpose operation. Then it multiplies the output of the attention layer by the scaling parameter and adds the input feature map again. The result is as follows
where γ is a learnable scalar, and Y refers to the refined features after channel attention module.
Furthermore, inspired by the convolutional block attention module for the image classification task [28], we construct a spatial attention module to improve the robustness of the model. The spatial attention module is a supplement to the channel attention module. Then, for feature Y, feeds it into the max-pooling layer and ave-pooling layer, as shown below.
Then, we use a 1×1 convolutions to capture final refined features, as follows:
where Sigmoid (•) refers to the element-wise function. Therefore, the refinement feature extraction module can be defined as κ3 : ϑ → R f .
Once the refined features are extracted, we proceed to execute the feature aggregation and classification module denoted as κ4, as shown on the right side of Fig. 1. This module contains a max-pooling layer and a fully connected layer. Then, For the refined feature (R f ) obtained by κ3 process, a max-pooling operation is employed to integrate the captured features, which produces a concise global descriptor. The generated global descriptor is then fed into the fully connected layer to achieve 3D shape recognition.
Experiments and discussions
Datasets
In this section, we present the performance of PVFAN and compare it to state-of-the-art 3D shape recognition methods on the Princeton ModelNet dataset [35]. The dataset comprises 127,912 3D CAD models from 662 categories, where ModelNet40 is a more widely used subset containing 12,311 3D CAD models across 40 popular categories, and ModelNet10 contains 4,899 3D CAD models in 10 categories. In our experiments, we evaluate the proposed algorithm and other state-of-the-art methods on both ModelNet10 and ModelNet40, both of which are publicly accessible on the Princeton ModelNet website. Furthermore, we employ the same dataset split configuration provided by Ref. [35], which consists of 9,843 3D shapes in the training data and 2,468 3D shapes in the testing data of ModelNet40, and 3,991 in the training data and 908 in the testing data of ModelNet10. In addition, the point cloud data are sampled from the surface of each CAD model, as shown in Ref. [46], while multi-view data are captured by camera, following Ref. [38].
Implementation details
In this paper, we conducted our experiment using the PyTorch platform. Specifically, we employed pre-trained GVCNN and DGCNN models in our PVFAN to separately capture view and point cloud features. It is worth noting that any view-based or point cloud-based model could be utilized for extracting global point cloud features and view features. PVFAN was trained end-to-end. Our experiments were run on a Hewlett-Packard workstation equipped with an Intel(R) CPU I7-9700, 32 GB RAM, and a NVIDIA GPU with GeForce RTX 2080. To speed up the training process, we utilized the CUDA instruction set on the GPU with an initial learning rate of 0.0001 and a batch size of 32. We chose the best performing result for comparison with current state-of-the-art methods.
Experiments on 3D shape retrieval
In this section, we conducted 3D shape retrieval experiments on the ModelNet10 and ModelNet40 datasets to evaluate the model performance. For each class of images, we experimented with 120 of them, 100 of which were trained and the remaining 20 were tested. We compare our method with the most popular 3D shape retrieval algorithms, i.e., SPH [58], LFD [59], PANORAMA [60], 3D ShapeNets [35], MVCNN [36], RVCNN [31] and GIFT [61],as shown in Table 1. From Table 1, we can see that the proposed method PVFAN shows excellent performance on both the ModelNet10 and ModelNet40 datasets. Among them, the mAP value of PVFAN on dataset ModelNet10 is increased by 42.29% and 1.19% respectively compared with method LFD and method GIFT. In addition, the mAP value of PVFAN on ModelNet40 dataset is 46.25% and 5.22% higher than that of LFD and GIFT methods, respectively.
Experimental results for 3D shape retrieval
Experimental results for 3D shape retrieval
To provide a more comprehensive assessment of the model’s performance, we present the precision-recall curves for the proposed method and other 3D shape retrieval methods on ModelNet40, as depicted in Fig. 3. As evident from the graph, the proposed method demonstrates higher precision overall compared to the other comparison methods. Notably, PVFAN consistently outperforms the others for nearly every fixed recall value, showcasing the exceptional precision and robustness of our approach.

Precision-recall curves for proposed method and compared methods of 3D shape retrieval on ModelNet40.
Comparison with state-of-the-art methods
In this section, various advanced methods based on different modalities are compared with PVFAN, including volumetric-based models (i.e., 3D ShapeNets [35], VoxNet [53], VRN [62], and MVCNN-MultiRes [63]), multi-views-based methods (i.e., GIFT [61], MVCNN [36], GVCNN [38], PGNet [19], LIMAN [29], DAN [30] and 3D2SeqViews [44]), point cloud-based methods (i.e., G3DNet [64], PointNet [46], PointNet++ [47], and DGCNN [51]), and multimodal methods (i.e., FusionNet [65], PVNet [23], and PVRNet [24]). The details of comparison results are displayed in Table 2. Further experiments will employ the overall classification accuracy of the 3D shapes as the evaluation metric.
Recognition Performance comparison on ModelNet datasets. The methods are grouped according to the type of their input
Recognition Performance comparison on ModelNet datasets. The methods are grouped according to the type of their input
After analyzing the experimental results shown in Table 2, we can observe that PVFAN performs relatively well according to the classification accuracy. Furthermore, on the ModelNet40 dataset, PVFAN achieves an outstanding 95.2% accuracy, while on the ModelNet10 dataset the accuracy is an impressive 96.4%. It’s worth noting that some papers, such as MVCNN and PointNet, do not provide partial results on the ModelNet10 dataset. In comparison to other methods, our PVFAN outperforms the classic MVCNN using GoogleNet by 3.0% in terms of overall accuracy on ModelNet40. Moreover, PVFAN surpasses other advanced techniques that use different data representations, such as point cloud-based and multimodal approaches. In addition, we make the following observations:
•We have observed that our PVFAN outperforms traditional volumetric-based methods like VoxNet and 3D ShapeNets in terms of 3D shape identification capability. The primary reason behind this improvement is mainly because we use CNN to extract point cloud and view features separately. In essence, deep learning models outperform traditional methods in feature learning performance when trained on large-scale data.
•In the view-based approach, PGNet and LIMAN show the best performance, mainly because they use more input views, which contain more comprehensive target information. However, PVFAN stands out among other 12 views-based methods, achieving the best classification results on two datasets. Our approach, in comparison to conventional view-based techniques like 3D2SeqViews and GVCNN, emphasizes cross-modal interaction and integrates point cloud information. We consider the distinctive traits of various modal features when combining multi-view and point cloud features, allowing PVFAN to integrate both local geometric structure and global information, leading to improved robustness and accuracy.
•In terms of point cloud-based methods, PVFAN outperforms other advanced techniques, including PointNet++ and DGCNN, in recognition accuracy. Our approach deeply delves into multimodal information and incorporates attention mechanisms to enhance model performance. The feature refinement process eliminates redundant information and prevents the loss of useful information.
•Lastly, multimodal-based methods excel over single modality methods by combining the advantages of multiple recognition models. Our method attains the highest classification accuracy on both datasets among all multimodal-based approaches. The primary reason behind this is that our dual attention network effectively mines the information in fusion features, preserving essential local information for accurate shape recognition, thereby reducing the model’s dependence on viewpoints.
In this section, we carried out ablation experiments to evaluate the impact of different modules on PVFAN. The following are the ablation studies we conducted: Case 1: In this case, we only leveraged multi-view features for 3D shape identification, and later implemented the other modules except for the feature fusion module. Case 2: In this case, we only used point cloud features for 3D shape recognition and did not implement the feature fusion module. Case 3: We excluded the spatial attention module in this case to validate the performance. Then, we integrated the point cloud features and multi-view features and fed them to the channel attention module for further processing. Case 4: We excluded the proposed channel attention module in this case. Then, we directly fed the fusion feature to the spatial attention module for further processing. Case 5: This case involved the direct classification of the integrated features without implementing the feature refinement process to validate the performance of dual-attention network. The results of our ablation experiments are displayed in Table 3.
Comparison of the overall accuracy from the ablation study on ModelNet
Comparison of the overall accuracy from the ablation study on ModelNet
Observing the exhibited experimental results, we can see that Case 3 and Case 4 have improved the overall accuracy by 4.1% and 3.7% on ModelNet40, and by 3.3% and 1.4% on ModelNet10, respectively, compared to Case 5. These results fully indicate that each module in our method is valid. Moreover, when integrating the feature fusion and feature refinement modules simultaneously, the overall accuracy significantly improves, strongly supporting the rationality and effectiveness of our proposed method. Additionally, we noticed that cases using multiple features, when compared to Case 1 and Case 2, exhibited significantly higher accuracy. This demonstrates that 3D shape recognition can benefit considerably from multimodal information.
We conducted an analysis of the effect of multiple input variables on the performance of our 3D shape recognition network, PVFAN. Our analysis included the following aspects:
1) Number of Views: We considered using four different camera viewpoint setup methods for further shape rendering, with each case having 4, 8, 10, and 12 viewpoints. The virtual cameras have 30 degrees of elevation to the ground plane and are pointed towards the model’s centroid. The interval angle of the virtual camera is set to be 90°, 45°, 36°, and 30°, generating 4, 8, 10, and 12 views for each 3D shape, respectively. We used these four viewpoint setups to test our network’s classification performance on the ModelNet40 dataset, as displayed in Table 4. Note that we kept the number of point clouds constant while changing the viewpoints for fair comparison. To evaluate the performance rigorously, we conducted 50 repeated experiments for each viewpoint number, and the final result is the average accuracy of the classification. Additionally, we compared the effect of different viewpoints on classification accuracy by testing each viewpoint from 1 to 12 on the ModelNet dataset, and the results are displayed in Fig. 4 for better clarity.

Classification accuracy achieved by PVFAN on the ModelNet dataset with respect to different numbers of views.
Performance on ModelNet40 with different input view numbers
Based on the data presented in Table 4 and Fig. 4, it is evident that the classification accuracy of PVFAN increases with an increasing number of views. Notably, PVFAN performs well even with a small number of views. For instance, in a single view, PVFAN recorded recognition accuracy values of 82.4% and 88.7% on the ModelNet40 and ModelNet10 datasets, respectively. With four views, our network achieved high recognition accuracy of approximately 93.3% on the ModelNet10 dataset. However, with an increase to eight views, the recognition accuracy on the ModelNet dataset remained relatively constant. This is because eight views supply sufficient information to capture the entire structure of the 3D object.
2) Number of Points: As depicted in Fig. 5, the quantity of points within point clouds is a critical factor in determining the amount of effective information and structural details. Therefore, we conducted experiments to observe how altering the number of points influences the model’s discrimination ability. Keeping the number of views constant, we set the number of points as 128, 256, 512, 768, 1024, 1280, 1536, 1792, 2048 for investigation. Table 5 presents the experimental outcomes obtained under PVFAN.

Some examples of different number of points.
Performance on ModelNet40 with different input point numbers
From the results presented in Table 5, we can observe that an increase in the number of points yields improvement in the recognition accuracy of PVFAN. When the number of points is 128, the accuracy is a mere 61.1%. However, when the number of points increases to 512, the proposed model’s recognition rate significantly jumps to 91.2%. This improvement occurs because larger numbers of point clouds contain substantial local information that is beneficial in increasing the discriminability of feature descriptors.
The confusion matrix visualization effectively demonstrates the outstanding performance of PVFAN in 3D shape recognition. Notably, PVFAN achieves excellent recognition accuracy even when some view features exhibit high similarity. The confusion matrices depicting our final results on Princeton ModelNet40 and ModelNet10 are presented in Fig. 6 and Fig. 7, respectively. The diagonal elements of the matrices indicate classification accuracy, while the off-diagonal elements correspond to misclassification proportions. As illustrated in Fig. 6 and Fig. 7, most of the 3D shape recognition is accomplished correctly, barring a few categories that have a similar appearance, such as the table and desk. Specific misclassification instances are elucidated in Fig. 8. The dominant reason behind these errors is the high appearance similarity between these categories, leading to confusion in the feature information of the 3D model, resulting in erroneous classification.

Confusion matrix visualization of PVFAN on ModelNet40.

Confusion matrix visualization of PVFAN on ModelNet10.

Some misclassification instances of PVFAN.
3D object shape recognition is a crucial visual task, and in this paper, we propose a novel method for achieving this task by utilizing multi-view features and point cloud features in a combined approach. This method, named PVFAN, involves the extraction of multi-view and point cloud features, followed by multimodal feature fusion. We then construct a dual-attention convolution network including channel attention module and spatial attention module. Specially, we employ a Strip pooling-based fusion attention network to refine fusion features, and ultimately, perform feature aggregation classification to obtain the final category labels. Comprehensive experiments conducted on the ModelNet10 and ModelNet40 datasets revealed the efficacy of our approach. Compared to existing state-of-the-art methods, PVFAN obtains more compact, descriptive, and robust feature descriptors that exhibit superior performance. We intend to continue our research focus on 3D shape recognition and multimodal feature fusion in the future.
