Abstract
The rack-and-pinion drive mechanism (RPD) is a critical transmission component in the battery swap system (BSS) of electric heavy trucks (EHTs). The transmission mechanism in the RPD typically operates under conditions of low speed, speed fluctuations, and short-term sampling, posing challenges for accurate fault diagnosis. To address these issues, this study proposes a fault diagnosis method based on the Heterospectral-Symmetric-Derived Point Cloud Feature Tree (HS-PCFT): First, multiple SDP (symmetric point pattern) variants are generated using Hybrid-dimensional vibration signals to construct multi-perspective symmetric patterns. Based on this, these images are converted into coordinate points and a structured point cloud feature tree is constructed to capture geometric differences across different fault modes. Subsequently, the point cloud is voxelized and input into a lightweight 3D convolutional neural network (3D CNN) for classification. The fault diagnosis algorithm was validated on the BSS platform, achieving an accuracy rate of 97.83%. Comparative studies indicate that the proposed method outperforms traditional 2D/3D networks and conventional machine learning methods. This research proposes an effective representation path from signals to structured geometry, providing an effective solution for fault diagnosis in low-speed and complex industrial environments.
Introduction
In recent years, electric heavy trucks (EHTs) have been widely used in high-frequency and high-intensity transportation scenarios, and their core advantage lies in their green energy drive and efficient operation capability. Compared with the traditional charging mode, the Battery Swapping System (BSS) plays a key role as a quick replenishment solution to ensure the continuous operation and operational efficiency of the vehicle. Among them, the drive mechanism widely adopts Rack and Pinion Drive (RPD) to complete the precise movement and positioning of the battery box, in which the drive gear is the core of the transmission, and its operation status directly affects the execution efficiency of the switching task and the stability of the system.
However, the RPD structure in the BSS operates for a long time under complex working conditions such as high-frequency reciprocating loads and electrical coupling disturbances, and is prone to mechanical wear under long-term loads, especially the active drive gear components are more prone to typical failures in the form of tooth-face abrasion, crack extension, and tooth breakage. These failures usually originate from the micro-discharge phenomenon induced by short-term potential fluctuations in the system, 1 as well as the accumulation of fatigue under long-term loading, resulting in tooth damage gradually evolving into functional failure. 2 Statistical data show that gear damage has become an important causative factor for the frequent occurrence of BSS failures, and if early and accurate identification cannot be realized, it will seriously affect the battery replacement efficiency and vehicle operation safety. Beyond reliability and safety, early identification of drivetrain degradation is also relevant to industrial noise and vibration management in swapping stations, as incipient defects may elevate vibration levels, promote structural transmission to surrounding assemblies, and potentially increase radiated noise and on-site vibration exposure during operation and maintenance.
Frequent gear failures not only affect the stability of system operation, but also put forward higher requirements for fault diagnosis. Driving gears face a series of diagnostic challenges in the typical working conditions of BSS: low rotational speed makes the effective fault characteristics weak, which is difficult to be effectively separated from the noise; control lag and load perturbation cause speed fluctuations, which leads to the characteristic frequency aliasing and characteristic drift; in addition, due to the limitations of the structure and actual working conditions of the battery exchange device, the collected vibration signals often have short-term sampling and local information loss problems, which makes it difficult to fully reflect the fault characteristics in the raw data. At the same time, some of the traditional feature extraction methods and single-scale model training strategy, in the low-speed, weak signal or speed perturbation of the complex conditions, feature differentiation ability and generalization performance still needs to be improved. For this reason, in recent years, researchers have explored multiple paths around the time-frequency modelling of vibration signals, map conversion and deep learning classification.
For the identification of low rotational speed signals, commonly used methods include acoustic emission techniques,
3
local mean decomposition (LMD)
4
and variational modal decomposition (VMD),
5
which are used to enhance the local divisibility of weak signals; however, these methods are sensitive to ambient noise and parameter configurations, and have large computational overheads, which limit their real-time application capabilities. In the problem of discontinuous and short-time sampling, migration learning,
6
sample expansion
7
and sliding window strategy
8
have been introduced to improve the diagnostic performance under small-sample conditions, but they are still susceptible to boundary effects and velocity perturbations, and there is a risk of pseudo-frequency interference. In terms of noise suppression, although decomposition and reconstruction methods (e.g., EEMD
9
For the problem of disordered spectral distribution under speed perturbations, order analysis 12 and time-frequency transformation 13 provide possible solutions; however, their effectiveness at low speeds is constrained by limited frequency resolution and strong dependence on accurate speed-synchronization signals, which compromises practicality. To address these limitations, optimization-driven adaptivity has been increasingly explored to reduce parameter sensitivity and improve robustness under noisy and speed-perturbed conditions. Specifically, adaptive feature mode decomposition can be achieved via metaheuristic optimization guided by a health indicator, 14 adaptive CNN performance can be improved through automatic hyperparameter tuning using amended gorilla troop optimization with quantum-gate mutation, 15 and early weak-fault signatures can be enhanced by flow-direction-optimized spectral-kurtosis-based filtering. 16 However, these studies mainly enhance adaptivity at the preprocessing or network-configuration stage and are still predominantly based on 1D/2D representations, leaving room for a more structured representation pathway to improve geometric separability under complex conditions. From this representation perspective, visual mapping methods such as Symmetrized Dot Pattern (SDP) have been introduced into fault mapping construction, which transforms one-dimensional vibration signals into two-dimensional structural images via polar-coordinate mapping and strengthens inter-class differences while preserving key temporal characteristics.
However, conventional SDP is still typically confined to single-channel, single-scale maps, which limits the diversity and robustness of feature representations across multiple operating conditions. For this reason, some studies have attempted to introduce modal decomposition (e.g., VMD 17 ) or statistical fusion mechanisms (e.g., Chebyshev distance combined with EMD 18 ) for multi-graph optimization, but the overall stability is still insufficient due to the accuracy of parameter selection and consistency of fusion strategies; in addition, parallel SDP, 19 multi-channel fusion,20,21 and holographic spectrum construction 22 are other methods that enhance the ability of multi-source information expression, they are prone to local information loss or overfitting risk under high-dimensional structure.
Meanwhile, with the development of 3D structural modelling technology in the field of target recognition, 3D deep learning methods based on point cloud, view and voxel modelling have been gradually introduced into the intelligent diagnosis task of mechanical structures. PointNet 23 directly processes unstructured 3D point sets with efficient geometric modelling capability; MVCNN 24 improves the model’s adaptability to attitude changes through multi-view image fusion; and VoxNet 25 uses the 3D convolutional grid structure combined with 3D convolutional operations to achieve local feature modelling of spatial structures. Although the above methods perform well in the target classification task, there are still some bottlenecks in their feature characterization ability in scenarios oriented to complex working conditions in mechanical failures.
For the problem of weak signal features of rack and pinion structure under low-speed working conditions, this paper integrates multiple SDP images to construct point cloud structure, and enhances the feature expression with the difference of petal density; in order to cope with the feature instability caused by the fluctuation of rotational speed, this paper design multiple SDP image layers with different parameters to express different feature dimensions, so as to improve the stability and robustness of the features; In the face of short sampling and noise interference problems, the combination of difference map construction and 3D convolutional extraction strategy is used to realize the effective extraction of structural features and interference suppression. In summary, this paper proposes a fault identification method based on point cloud feature tree and 3D convolutional network, which shows good adaptive ability and classification performance in terms of saliency enhancement of weak features, feature stability preservation under speed perturbation, and robustness suppression of noise interference.
The contributions and innovations of this paper are as follows: (1) Aiming at the problem of weak fault signals of the rack and pinion structure of the battery switching system under low-speed conditions, this paper fuses multiple SDP images to construct a three-dimensional point cloud structure, preserves the diversity of petal densities, and realizes the diversified expression and reinforcement of fault features, which improves the recognition ability of fault modes under low-speed conditions. (2) Aiming at the problem of feature stability degradation caused by rotational speed fluctuation in the battery exchange process, this paper constructs a three-dimensional point cloud structure for the joint expression of multi-scale SDP maps, which fuses the geometric distribution information at different scales and enhances the stability and robustness of the feature expression under the speed fluctuation. (3) For the weakening of effective features caused by complex noise interference and short sampling, this paper realizes the fusion expression of multi-map information on the unified point cloud structure through differentiated multiple representations and 3D convolution module, which improves the anti-interference ability of the model.
The structure of the remaining part of this article is as follows: The Fault diagnosis scheme elaborates on the fault diagnosis methods in detail, including the point cloud feature tree method constructed based on SDP variants, point cloud voxelization preprocessing, and 3D CNN structure design. The Case study is based on the vibration data of the driving gears of the rack and pinion transmission devices in the battery swap system of four modes, presents the training and diagnosis results, and verifies the effectiveness of the proposed method. Through Comparative case studies analysis, the performance advantages of the proposed method under complex working conditions were verified. The Conclusion of this study is presented at the end.
Fault diagnosis scheme
In this study, a 3D point cloud feature tree fault diagnosis scheme based on multi-source mapping structure is constructed for the typical operating states and fault modes of driving gears in BSS, as shown in Figure 1. The overall process consists of three parts: signal atlas construction, spatial structure modelling and 3D feature recognition. Among them, the signal atlas construction is based on a variety of SDP improvement methods to encode the gear vibration signals with symmetric graphs, aiming to reinforce the differences in geometric distribution under different fault states. In the map construction, single-channel and multi-channel hybrid strategies are used: for the method based on unidirectional response modelling, the vertical (V-direction) vibration signals are uniformly used as inputs to ensure the consistency of the signal source and the sensitivity to the fault features; while for the SDP variants with multi-channel fusion mechanisms, such as the Parallel-SDP, M-SDP and multi-source map fusion strategies, vertical is introduced, Horizontal and Axial (V+H+A) three-channel signals are jointly constructed to enhance the expressive integrity of spatial features and structural discrimination. This hybrid strategy is based on the intrinsic differences in the compositional mechanisms of each method, and aims to extract discriminative features that coexist with structural significance and pattern diversity. Subsequently, all the generated images are uniformly converted to polar coordinate point sets and stacked along the Z-axis in the order of methods to construct a hierarchical point cloud feature tree structure. On this basis, the point cloud is mapped into a 3D raster by voxelization, and then a lightweight 3D convolutional neural network (3D CNN) carries out multi-layer convolution and feature extraction to achieve highly robust classification and identification of the gear operating state. This method effectively improves the diagnosis accuracy and stability under low-speed disturbance and complex background. Flowchart of fault diagnosis method.
Point cloud feature tree construction based on SDP variants
In order to realize high-precision fault diagnosis of gears under low-speed and non-stationary interference conditions, this paper proposes a 3D point cloud feature tree method that integrates multiple Symmetrized Dot Pattern (SDP) variants. The method takes the original vibration signal as input, uses single-channel (vertical V-direction) or multi-channel (A/H/V) strategy to construct multiple SDP maps, obtains the structural coordinate points through polar coordinate mapping, and stacks them along the Z-axis according to a unified hierarchical strategy to form a 3D spatial point cloud
The input signal consists of a single-channel vibration signal
1. Original SDP graph (single channel). This approach provides the mapping basis for all SDP variants, generating rotationally symmetric 2D polar coordinate maps from single-channel signals using the following mapping equation
26
:
2. VMD-SDP graph (single channel). The method first performs a variational modal decomposition (VMD) on a single-channel signal
Applying this criterion to the dataset yields a concentrated distribution of selected modes (median
Each modal component is then treated separately using the original SDP mapping:
All IMF modal points are plotted together in a single SDP diagram, forming multiple “petals”.
3. VMD-MultiSDP graph (single channel). With the help of each modal point set
A separate SDP map is generated for each IMF modality, which corresponds to multiple images in total for multi-scale feature separation.
4. Parallel-SDP graph (multi-channel). Apply the base SDP mapping to each of the three channel signals
5. MVMD-SDP graph (multi-channel). The MVMD decomposition is performed for each channel
6. MCSDP graph (multi-channel). The three-channel SDP mapping point sets
Fusion of channel mapping by:
Finally, the fused non-zero pixel locations are mapped to a 2D point set:
7. Holographic SDP map (multi-channel). For each channel
Hierarchical determination of SDP variants
In order to avoid arbitrary stacking of different SDP variants along the Z-axis, a data-driven hierarchical determination strategy is adopted.
For the
The spatial expansion of this variant on that sample is quantified as:
To ensure robustness against sample-specific fluctuations and outliers, the final expansion score of the
All SDP variants are sorted in ascending order of
Integrate the 2D point sets generated by all variants and uniformly assign spatial hierarchies, and finally construct the 3D point cloud feature tree
This construction ensures that all variant mapping outputs are fused and form a continuous, unified spatial structure that provides a reliable basis for subsequent voxelized coding with 3D convolutional network inputs.
Voxel-based feature encoding and 3D CNN classification
In order to further extract the spatial structure information in the point cloud feature tree and realize the recognition of gear multi-category patterns, this paper adopts the voxelization method to transform the point cloud feature tree into a regular tensor structure, and constructs a three-dimensional convolutional neural network (3D CNN) model to carry out the in-depth feature extraction and classification discrimination on the voxel data. The overall process mainly includes the following two steps:
Voxelization preprocessing method for point clouds
Since point cloud data is essentially an unstructured representation, if the original dense point set is directly used as the input to the 3D convolutional neural network, it will lead to high data dimensionality and serious structural redundancy, which is not conducive to the training and inference of the deep model. The voxel structure, on the other hand, can effectively portray the distribution pattern of 3D objects in space, with the advantages of strong regularity and easy to analyse and store. Therefore, the point cloud data need to be converted into regularized voxel grid representation before feeding into the 3D CNN network.
In this paper, we adopt a binary voxel occupancy grid to discretize the point cloud into a sparse regular tensor. The voxel state is defined in binary form: if a voxel cell contains at least one point, it is considered as a “non-empty voxel” with a value of 1; otherwise, it is considered as an “empty voxel” with a value of 0. This encoding facilitates the mapping of the original point cloud structure into a sparse regular tensor. To further examine the influence of voxel filling strategies on feature representation, alternative density-based and intensity-based voxel encodings are also implemented and evaluated in the experimental section, while keeping the voxel resolution and network configuration unchanged.
After the construction of the 3D point cloud feature tree
3DCNN structural design
Aiming at the 3D tensor input after point cloud feature tree voxelization, this paper designs a 3D CNN based on the improved structure of VoxNet for accomplishing deep feature extraction and classification discrimination in different modes. As shown in Figure 2, the overall structure of the model consists of three sets of convolutional pooling modules, a global average pooling layer, and two layers of fully connected networks, forming an end-to-end discriminative process from feature extraction to classification output. 3D CNN network architecture.
The three-layer convolution module in the front-end of the network is used to extract local structure, regional features and global semantic information, respectively. Among them, the first convolutional module captures the local edges and symmetric distributions in the point cloud, the second module further refines the structural correlations between spatial regions, and the third module focuses on the higher-order features after multi-scale fusion. Each convolutional layer adopts a 3×3×3 kernel size, and is coupled with batch normalization and ReLU nonlinear activation function to enhance the expressive capability. The pooling operation adopts the maximum pooling strategy to gradually compress the spatial dimension and improve the discriminative and anti-interference properties of the feature map.
After convolutional extraction, each channel is compressed into a single statistic using global average pooling (GAP) to reduce the number of parameters and enhance feature stability. This process outputs fixed-length vectors as inputs to the classification subnetwork. The fully connected module consists of two linear layers, where the output dimension of the first layer is set to 128 and a Dropout mechanism is introduced to alleviate the overfitting problem under small sample conditions. The final classification layer uses a Softmax activation function and outputs four-dimensional vectors corresponding to the four states of the rack and pinion structure: normal, unilateral tooth wear, bilateral tooth wear and tooth broken. The overall network has strong spatial feature modelling capability while maintaining a lightweight structure, providing a stable deep representation base for subsequent classification training.
In the model training phase, the loss function is used to guide the gradual update of network parameters. In this paper, the cross-entropy loss function
In order to achieve an efficient gradient update process, this paper adopts the Adam optimization algorithm to iteratively optimize the network parameters. The algorithm is able to adaptively adjust the learning rate of different parameters and performs stably when dealing with non-smooth targets and sparse gradients, which helps to improve the convergence speed and overall performance of network training.
Case study
Experiment and sample description
In order to verify the effectiveness of the proposed fault diagnosis method and simulate the actual operating conditions of the rack and pinion structure (Rack and Pinion Drive, RPD) in the battery switching system of the electric heavy trucks (EHTs), a dedicated experimental platform was built in this study as shown in Figure 3. The platform is driven by a servomotor with a rotational speed of 3000 RPM, which is combined with a 16:1 ratio reducer to drive the main gear (24 teeth) to drive the pinion gear (20 teeth), realizing the reciprocating linear operation of the platform along a 2100 mm track. The complete travel cycle is about 6.3s, including acceleration, quasi-uniform speed and deceleration phases, of which the quasi-uniform speed phase lasts about 1s, and the speed of the active gear is stabilized at 170-190 RPM. Structure of the test bench.
The condition monitoring system consists of triaxial vibration accelerometers mounted in the vertical, horizontal and axial channels of the support structure and Hall effect velocity sensors, which are located above the active gears for simultaneous acquisition of velocity disturbances and vibration characteristics. The sampling frequency is set to 20 kHz and the signals are intercepted at a quasi-uniform phase, with a single sample length of 20,000 points, for a total of four channels of data, including vibration signals and velocity signals in all three directions.
Description of training and test samples.
Geometric feature representation and point cloud tree construction
On the basis of clarifying the construction mechanism of the point cloud feature tree, this paper further verifies the geometric performance differences of the structure under different failure modes with experimental data to support the feasibility and interpretability of its classification and identification. In order to characterize the geometric distribution of signals under different gear states, this paper fuses 11 layers of image features generated by different SDP mapping methods, and performs unified coordinate extraction and spatial fusion processing for each sample, and finally constructs a 3D point cloud feature tree with a hierarchical structure.
Multi-SDP feature mapping and visualization
To comprehensively capture the geometric differences of vibration signals under different failure modes, this paper constructs a collection of graph methods composed of 11 layers of SDP methods, and performs multi-view mapping and unified coordinate extraction operations on each sample. Specifically, based on the method framework defined by the Fault diagnosis scheme, salient point extraction is performed on the symmetrical images generated for each SDP variant respectively, and they are uniformly converted into normalized two-dimensional Cartesian coordinates. This processing procedure ensures that the coordinate points from different graph sources have geometric consistency and center alignment characteristics, providing a foundation for the subsequent construction of multi-level point cloud structures.
In order to further enhance the interpretability of the intermediate process and to visualize the differences in the mapping structure of each SDP method among different modes, five representative SDP methods (including SDP, VMD-SDP, VMD-MultiSDP, MVMD-SDP, and Holographic-SDP) are selected in this paper and the mapping morphology is comparatively visualized under four types of typical modes (NM, UTW, BTW, TB) under four typical modes are visualized for comparison. The morphology of the map contains geometrical features such as the number of petals, density distribution, symmetry structure and edge contour, and its morphological changes clearly reveal the effects of different categories of modes on the structural perturbations triggered by the original vibration signals with respect to the distribution characteristics of the modes.
Comparison of SDP feature-based visualization for different failure modes.
Layered fusion and point cloud tree formation
In order to achieve the unified fusion of multi-view mapping, based on the construction logic of the Fault diagnosis scheme in this paper, hierarchical mapping operations are performed on all samples: the 2D point set
To validate the expression differences of the constructed point cloud feature tree under different modes, Figure 4 shows the point cloud structure visualization results of four typical operating conditions. It can be observed that there are significant differences among the modes in terms of point cloud density distribution, spatial hierarchical structure, and symmetry morphology: the normal mode exhibits a symmetrical and balanced morphology overall, with compact and clear petal layers; the unilateral tooth wear mode exhibits non-uniform density changes in the upper region, showing slight shifts and interlayer fractures; bilateral tooth wear further exacerbates these differences, with the point cloud exhibiting a bimodal distribution trend; and the tooth broken mode exhibits significant asymmetry and high-density concentrated distribution, particularly in the middle and upper layers, where numerous dense cluster structures are evident. Comparison of point cloud feature tree structures under different modes.
These structural differences reveal that different patterns exhibit distinct spectral response characteristics under a multi-source SDP perspective, validating the effectiveness of point cloud feature trees in structurally expressing geometric information. This provides a stable and interpretable geometric foundation for subsequent voxelization and 3D convolution processing.
Voxelization mapping
To feed the constructed point cloud feature tree into the deep convolutional model, it is necessary to convert its unstructured point set representation into a regular 3D tensor. In this paper, we directly adopt the voxelization process defined in the Fault diagnosis scheme: converting the normalized point set into a sparse binary (0-1) voxel tensor to provide structured input suitable for three-dimensional convolutional neural networks. In the subsequent experiments, this paper selected two commonly used voxel resolutions, 323 and 643, as input configurations; The sensitivity of these parameters will be analyzed in detail in the subsequent parameter selection and sensitivity analysis.
Training and diagnostic results
Training settings
In order to verify the performance of the designed three-dimensional convolutional neural network (3D CNN) model in the fault identification task, this paper conducts systematic training and verification of the network. The model structure has been described in detail in 3D CNN structural design. This section mainly describes the specific parameter Settings and implementation process in the training stage.
The Adam optimizer was used during training, with an initial learning rate set to 1×10−4, a maximum number of iterations set to 50, and a batch size set to 16. In terms of sample division, each category of pattern samples was randomly divided into a training set and a validation set at a ratio of 7:3, and category weights were introduced to mitigate the bias caused by imbalance. To enhance training stability and avoid overfitting, the Dropout mechanism is introduced during training. Unless otherwise specified, results are reported from a single run under the fixed split; the robustness to stochastic training effects is further examined in the subsequent ablation and robustness analyses.
Evaluation indicators
In order to comprehensively measure the classification performance of the model, this paper introduces four commonly used metrics: Accuracy, Precision, Recall and F1-score. The definitions are as follows:
Among them,
Ablation study on multi-channel fusion
To evaluate the effectiveness of the fusion strategy in equation (20), we compared the original pixel-wise max fusion with channel-wise mean fusion and energy-based weighted average fusion. All experiments were conducted under the same fixed data split, identical HS-PCFT construction, and identical 3D CNN training configuration. Since the differences among fusion operators primarily manifest at the response magnitude level prior to discretization, the comparison was performed using the intensity-based voxel representation to preserve amplitude variations introduced by different fusion mechanisms. As verified in the voxelization analysis, binary and intensity representations yield comparable classification performance; therefore, this setting does not alter the overall methodological conclusion.
Comparison of different multi-channel fusion strategies.
Stacking-order sensitivity analysis
In addition to voxel size and inter-layer distance
Stacking-order sensitivity analysis under fixed 323 voxelization and
Effect of voxel filling strategies
To investigate whether richer voxel encoding can improve the discriminative capability of the proposed HS-PCFT representation, we compare three voxel filling strategies, including the binary occupancy grid adopted in this study, a density-based encoding, and an intensity-based encoding. The voxel resolution, inter-layer spacing, fixed train/validation split, network architecture, and training hyperparameters are kept identical for a fair comparison. To account for stochastic effects, each configuration is trained for
Comparison of voxel filling strategies (grid size 323,
Effect of order-tracking preprocessing
To further examine whether low-speed conditions may cause spectral smearing, we introduce a conventional order-tracking (OT) preprocessing stage as a front-end ablation. Specifically, the tachometer-measured instantaneous speed
Influence of OT preprocessing (fixed split,
Influence of network depth
To verify whether the relatively shallow 3D CNN architecture with global average pooling is sufficient to capture high-level semantic features, we conduct a depth ablation study. Three variants are evaluated under identical settings (same HS-PCFT representation, voxelization configuration, fixed train/validation split, and
Depth ablation of the 3D CNN backbone (
Parameter selection and sensitivity analysis
In the process of constructing point cloud feature trees and encoding 3D voxels, the selection of key parameters has a significant impact on the model performance. To further validate the robustness and sensitivity of the proposed diagnostic model to different feature expression parameters, this paper focuses on analysing the influence of two key parameters on the model classification effect: one is the different voxel mesh sizes; and the other is the interlayer spacing parameter
Comparison of model classification accuracy for different combinations of voxel size and interlayer spacing.
As can be seen from Table 8, taking accuracy and computational efficiency into account, the model performs optimally under the combination of
It is worth noting that although a finer resolution such as 643 can theoretically preserve more geometric details, it also substantially increases the input dimensionality (from 32,768 to 262,144 voxels). For each sample, the non-zero voxel ratio is defined as
Model training results and visualization analysis
Figure 5 shows the accuracy curve of the model on the training set and validation set during training. It can be seen that as the number of training iterations increases, the model accuracy steadily improves and eventually converges, with the validation set accuracy reaching 97.83%, indicating that the model has good generalization ability. Training progress.
In addition, Figure 6 shows the confusion matrix of the samples of each category in the validation set. Most of the four modes are successfully distinguished, and there is only one misclassification between the normal mode and the tooth broken mode, which may be due to the presence of local sudden disturbances or short-term shock components in some of the normal samples, resulting in features being confused with the tooth broken class in terms of spatial structure. Confusion matrix obtained by 3D CNN.
To further verify the feature discrimination ability of the proposed model in the embedding space, this paper performs principal component analysis (PCA) on the deep features extracted by 3D CNN and maps them to the 3D space for visualization. As shown in Figure 7, the four fault modes show an overall good clustering and separation trend in the 3D principal component space, reflecting the effectiveness of deep features in significant fault identification. However, it is also observed that there are a small number of normal samples close to the edge of the tooth broken cluster, suggesting that there still exists a certain feature ambiguity near the boundary of the abnormality, which needs to be further improved in the follow-up study to further enhance the model’s discriminative robustness to the distribution of the features under the critical state. Three-dimensional visualization of 3D CNN features.
Comparative case studies
Comparison of classification accuracy, F1 score and training time of different methods on rack and pinion structure dataset (single run, fixed 7:3 split).
In terms of specific configurations, the VoxNet method constructs a shallow network structure with two layers of 3D convolution and two layers of full connectivity, using 323 voxel input, with the number of training rounds set to 20; the PointNet method samples 1024 points for each point cloud sample, generates a 1024-dimensional feature vector through the PointNet encoder, and then performs classification using two layers of FC layers; The multi-view CNN method projects the point cloud onto three planes (XY, XZ, and YZ), constructs a 64 × 64 three-channel grayscale image input, and uses a 2D CNN composed of three layers of convolutional and pooling structures for classification; In traditional shallow methods, temporal domain features include mean, skewness, kurtosis, and variance; frequency domain features include spectral skewness and total energy; and wavelet packet features extract the energy distribution and entropy values of four decomposition layer nodes. These features are uniformly standardized using Z-score before being input into an SVM linear kernel or RBF kernel classifier for recognition.
The experimental results are shown in Table 9. Under the fixed split, the proposed 3D CNN achieves 97.83% accuracy and 0.978 F1-score on the validation set, with the training time controlled to be less than 20 seconds, indicating strong classification capability with practical computational cost. Although VoxNet is a lightweight model, it still achieves an accuracy of 91.30%, with training time controlled at around 22 seconds, making it suitable for embedded deployment; PointNet method directly deals with the global point cloud, and its structure is simple but sensitive to small samples and local changes, with a final accuracy of 73.91%, which is obviously lagging behind in terms of performance; Although the 2D CNN method based on three-view data does not directly process 3D data, it still achieves 88.89% accuracy after integrating multi-view information, and the F1-score reaches 0.889, which is better than the traditional method with single-channel feature input; the feature engineering and SVM method synthesizes three types of features in the time domain, frequency domain, and wavelet packet, in which the highest accuracy is 92.00% and the F1-score reaches 0.920, which indicates that the traditional method still has strong competitiveness under the premise of sufficiently extracting the effective features, but it relies on the feature selection more highly. In summary, the 3D CNN method proposed in this paper fully exploits the spatial features of point cloud graphs under a multi-angle modelling strategy, achieving high classification accuracy while maintaining reasonable computational costs, demonstrating greater practicality and engineering adaptability.
Discussion
Discussion on robustness and deployment
The proposed HS-PCFT framework was validated on a laboratory-scale test bench; the platform was constructed to follow a 1:1 structural configuration of practical battery swapping systems in terms of transmission structure, sensor placement, and operating conditions, thereby improving representativeness for real-world scenarios.
Robustness evaluation under AWGN perturbations at different SNR levels.
In practical deployment, station-specific adaptation can be achieved via transfer learning or fine-tuning using historical operational data, allowing compensation for installation variations, sensor gain differences, and speed distribution shifts. These strategies support reliable model migration from laboratory conditions to field environments.
Implications for industrial noise and vibration management in battery swapping stations
Battery swapping stations operate under high-throughput, repetitive duty cycles, where rack-and-pinion drivetrains can gradually develop wear and meshing defects. Under low rotational speed, speed fluctuations, and short-duration measurements, early fault signatures are weak and easily masked by background noise. If undetected, such defects can increase vibration levels, promote structural transmission to surrounding assemblies, and potentially elevate radiated noise and on-site vibration exposure during operation and maintenance—issues aligned with the journal’s emphasis on industrial noise/vibration consequences and occupational exposure concerns.
From a practical maintenance viewpoint, HS-PCFT provides a representation-oriented route to improve weak-feature separability: multi-variant SDP mappings are organized into a structured 3D point-cloud hierarchy and learned by a lightweight voxel-based 3D model. This complements optimization-driven studies on adaptive preprocessing and model tuning by focusing on preserving discriminative geometric cues in mapped structures, enabling earlier identification of incipient gear defects and supporting condition-based interventions before vibration/noise escalation.
For deployment, the pipeline can be integrated into existing vibration monitoring chains (segment → SDP variants → HS-PCFT → voxelization → 3D inference) to output fault states and confidence. Key considerations include latency under station cycle-time constraints, sensor placement repeatability, and domain shift across stations; these can be mitigated via periodic re-calibration and additional multi-site field data.
Conclusion
This paper addresses the fault diagnosis requirements for drive gears in electric heavy truck battery exchange systems operating under low-speed, speed-fluctuating, and weak-feature conditions. It proposes a point cloud feature tree construction method based on multi-spectrum fusion symmetric patterns, mapping one-dimensional vibration signals to three-dimensional structured spectral expressions. The method begins with spectral transformation, generates multi-source symmetric images using various SDP variants, converts them into polar coordinate points, and constructs a hierarchical three-dimensional point cloud feature tree, enhancing geometric feature distinguishability across fault modes.
The main advantages of this method are as follows: First, multiple SDP spectrum construction mechanisms capture detailed features of signals at different structural levels from the perspectives of modal decomposition, channel fusion, and scale perturbation, effectively enhancing feature significance under low-speed fault conditions; Second, the point cloud feature tree structure forms a clear spatial hierarchical representation through the normalized stacking of spectrum polar coordinates, maintaining structural consistency in the fusion of multi-source spectra; Third, mapping point clouds to 3D voxel tensors facilitates unified modelling, providing a standardized input framework for subsequent spatial analysis networks of any type. The lightweight 3D CNN adopted in this paper serves as an example to validate the usability and high-precision performance of this structure.
This study introduces new modelling concepts—spatial mapping of signal features, structural fusion of mapping data, and unified representation of point-cloud morphology—creating a novel graph-to-structure approach for diagnosing mechanical faults under complex conditions. Future work will examine its adaptability and real-time deployable capability across multi-device, multi-state industrial sensing scenarios.
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
This work was supported by Henan Provincial Science and Technology Research Project (Grant No. 262102221057), Henan Provincial Technology Deputy Directors Program (Enterprise Appointment Program), and Innovation Funds Plan of Henan University of Technology (Grant No. 2021ZKCJ07).
