Abstract
Accurate 3D scene reconstruction of substations is critical for digital twin engineering, but existing point cloud segmentation methods suffer from poor semantic integration and high computational overhead. This study proposes a computational framework for semantic-driven 3D reconstruction, centered on the DI-PointNet algorithm (enhanced from PointNet++), to address these engineering challenges. The framework’s core computational modules include: Point Cloud Preprocessing Engineering: Improved RANSAC algorithm with adaptive thresholding (iterative tolerance adjustment) for ground filtering, reducing false positive rates by 37% compared to standard RANSAC. Power line feature extraction via spatial clustering (DBSCAN with ε = 0.8 m), achieving 96.3% key equipment extraction accuracy. Semantic-Geometric Fusion Network: Two-layer continuous transformer module: Cross-window attention mechanism (window size 32 × 32) enhances feature interaction between adjacent equipment, reducing semantic ambiguity by 29%. Hierarchical key sampling: Progressive downsampling (from 1024 to 128 points) with farthest point sampling (FPS) reduces computational complexity from O(n2) to O(n log n). Inverted residual module: Depth-wise separable convolutions optimize multi-scale feature extraction, cutting memory usage by 41%. Engineering Performance Validation: On a 220 kV substation dataset (1.2 M points), the framework achieves 92.4% scene completeness, 4.2 mm geometric fidelity error, and 92.1% semantic segmentation accuracy. Real-time rendering optimization via level-of-detail (LOD) scheduling enables 34.6 FPS for 4K resolution, outperforming PointNet++ by 18.3 FPS. This computational solution advances 3D reconstruction methodology for industrial scenes, providing technical support for substation digital twin development and demonstrating scalable value in power system engineering.
Keywords
Introduction
Motivation and background
Electricity plays a vital role in China’s energy sector, and its safety and stability are directly related to residents’ personal safety and the normal development of the national economy. 1 Substations serve critical functions in power systems, including current and voltage transformation, power reception, and distribution, making their maintenance and inspection essential for ensuring power safety and stability. However, traditional substation inspections primarily rely on manual patrols, which suffer from high risks, elevated costs, hazardous working conditions, and incomplete inspections. 2 With the advancement of 3D laser scanning technology, point clouds are driving progress in fields such as object recognition and scene segmentation. 3 As point clouds find applications in related areas, the rich 3D information they provide about substation scenes enables the construction of 3D substation models. These models allow users to monitor equipment status via terminals or plan inspection routes in simulated substation environments, reducing workers’ exposure to live equipment and thereby lowering the costs and dangers associated with substation maintenance. 4
3D point cloud segmentation serves as a crucial stage in point cloud processing, enabling the division of 3D point cloud data into smaller clusters according to attributes including color, shape, texture, and proximity, thereby facilitating subsequent tasks such as scene reconstruction, point cloud recognition, and defect detection. 5 This technique allows practitioners to partition scenes into multiple sub-scenes for targeted analysis and processing, eliminating the need to manipulate the entire scene and consequently enhancing both algorithmic efficiency and accuracy. Within substation environments, point cloud segmentation enables the separation of complete scenes into spatially independent equipment clusters, significantly improving the efficiency and precision of device and component identification. 6
Related work
Current research on 3D point cloud semantic segmentation primarily falls into two categories: the first involves analyzing intrinsic characteristics of point cloud data for traditional segmentation, while the second employs deep learning techniques for point cloud processing and segmentation.
Traditional point cloud segmentation methods
Traditional point cloud segmentation methods typically rely on edge detection, model fitting, region growing, or attribute-based approaches. 7 The RANSAC algorithm proposed by You et al. 8 is a commonly used method that iteratively estimates model parameters from noisy data to identify objects with known shapes. 9
While various techniques have been developed, including scan-line grouping, 10 boundary extraction, 11 and region growing, 12 these methods face limitations in accuracy when handling complex shapes and exhibit high computational complexity for large-scale point clouds. 13 Conventional semantic segmentation approaches suffer from several inherent constraints: dependence on manual feature engineering, insufficient robustness, requirement for extensive labeled data, limited local feature extraction, computational inefficiency, and poor generalization capability, all of which restrict their performance in complex 3D scenarios.
Point cloud segmentation method based on deep learning
Deep learning technology 14 has driven the development of point cloud semantic segmentation. Existing methods mainly include three categories: voxelisation, two-dimensional projection, and raw point cloud processing.
For the voxelization method of point clouds, Li et al. 15 proposed the VoxNet model, which can convert unstructured point clouds into regular voxels and use three-dimensional convolutional neural networks. However, due to the sparsity and high computational complexity of three-dimensional convolution, the efficiency of voxel processing is low. Jabberi et al. 16 used weight sharing and resolution reduction methods to alleviate memory usage issues, but inevitably caused loss of point cloud information. These methods have significant limitations in scenarios such as substations that require high-precision segmentation.
In the point cloud two-dimensional projection method, Imabuchi et al. 17 proposed the SnapNet model, which achieved efficient semantic segmentation through multi view RGB-D image generation and fully convolutional networks, but its max pooling operation would lose local detail information. Although this method has the advantages of less memory usage and faster processing speed compared to voxelization, it is sensitive to viewpoint selection and object occlusion, and can only obtain surface information, resulting in the loss of deep geometric features.
In conventional point cloud processing approaches, PointNet 18 employed symmetric functions and max pooling to achieve global feature extraction, yet exhibited limitations in capturing local information; its improved version PointNet++ 19 incorporated sampling and grouping layers to enhance local feature learning, but still suffered from inadequate long-range context aggregation. For substation scenarios, existing methods such as RESMLP 20 utilized multi-scale residual structures to improve feature representation, while demonstrating sensitivity to data variations; whereas octree-based voxelization and region-growing cable segmentation methods 21 encountered challenges including high computational complexity and slow processing speeds, making them unsuitable for real-time processing of large-scale point cloud data. These approaches all presented limitations in balancing accuracy and efficiency when applied to complex substation environments, where despite the incorporation of sampling and grouping layers to strengthen local feature learning, the issue of insufficient long-range context aggregation persisted.
Contributions
To address the issues of insufficient semantic segmentation accuracy and high computational complexity caused by high equipment similarity and large-scale point cloud data in complex substation scenarios, this paper proposes an improved DI-PointNet model based on PointNet++. This method constructs a DLCTransformer composed of layer normalization, self-attention mechanisms, and feedforward networks, effectively enhancing feature interaction and long-range context aggregation capabilities between power equipment; A hierarchical key sampling strategy is adopted to optimize the self-attention computation mechanism, significantly reducing memory consumption during large-scale point cloud processing. Additionally, an InvResMLP based on residual connections and inverted bottleneck design is introduced to improve the efficiency of extracting complex structural features and accelerate model convergence, ultimately achieving high-precision semantic segmentation of point clouds for main equipment in substations.
This study makes the following key contributions: first, a dual-level sequential transformer architecture is proposed, utilizing consecutive transformer layers for critical point selection to strengthen inter-point cloud communication and broaden the coverage of effective receptive fields. Second, a multi-level sampling approach is implemented to produce the essential parameters needed for self-attention computations, dramatically lowering the required processing resources. Additionally, the framework integrates an inverse residual MLP component featuring skip connections and reversed bottleneck architecture, which not only boosts the network’s capacity to discern intricate patterns in substation point cloud data but also successfully addresses gradient dissipation challenges.
Substation point cloud segmentation
Ground filtration
This paper uses the RANSAC algorithm to extract ground point clouds. This algorithm is based on statistical probability principles and identifies point sets that meet plane characteristics through iterative calculations to achieve automatic extraction of ground point clouds
Three points
According to the obtained planar model, points with a planar distance d greater than a threshold are out-of-bounds, and points with a distance less than a threshold are in-bounds, usually with a threshold of 0.01 to 0.1
This paper continuously updates the parameter model until the iteration is complete, with the number of iterations k determined by preset conditions
Power line extraction
Calculate the actual length of a single pixel representation
After filtering out the ground, only the power line point cloud remains at the vertical height of the substation power line, and the XOY plane projection is a straight line. Based on the unique columnar spatial characteristics of power lines, the two end point clouds of the power line are extracted through straight line detection and Euclidean clustering to complete the power line extraction. The steps are as follows:
Project the point cloud of the substation onto the XOY plane, calculate the maximum and minimum x and y coordinates of the point cloud, set the image resolution according to requirements, and determine the correspondence between pixels and actual length
Then calculate the pixel coordinates of each point using the following formula
Gaussian filtering is used for image denoising, and edge detection is optimized through discrete window convolution
Canny edge detection algorithm calculates the edges of an image
This paper uses the Canny algorithm to detect image edges, first using the Sobel operator to calculate pixel gradients
Use
High-precision semantic segmentation method for point clouds of main equipment in substations based on DI-PointNet
DI-PointNet algorithm
The PointNet++ network architecture is shown in Figure 1. It is a deep learning algorithm for processing point cloud data, serving as an extended and improved version of the PointNet algorithm. It incorporates a network structure for extracting local features, significantly enhancing the model’s generalization capability and robustness. To address the issues of high computational complexity and insufficient local feature extraction when processing large-scale point cloud data from substations, this paper proposes the DI-PointNet algorithm: it introduces a two-layer continuous transformer module to enhance information interaction between point clouds. A hierarchical key sampling strategy is adopted to divide the point cloud into dense and sparse spaces to reduce computational costs. An inverted residual module is added to enhance the ability to extract complex structural features, thereby significantly improving the semantic segmentation accuracy of point cloud data from substation equipment. Structure of PointNet++.
DI-PointNet adopts an encoding–decoding structure. The encoder extracts global and local features from the point cloud through downsampling, while the decoder propagates the features to the original point cloud scale through upsampling, ultimately outputting a low-dimensional feature representation. The overall structure of DI-PointNet is shown in Figure 2. Structure of DI-PointNet.
The joint point-volume representation is achieved through a hierarchical feature fusion mechanism that operates at multiple scales. Given an input point cloud
Simultaneously, point-based features are learned through multi-scale sampling
The fusion between voxel and point features occurs through a bidirectional attention mechanism
The cross-modal interaction term
Double-layer continuous converter module
Due to the spatial correlation characteristics among substation equipment, traditional Transformer modules are constrained in long-distance information exchange by window partitioning. This paper proposes the DLCTransformer module, which integrates the LN layer, hierarchical self-attention mechanism (SSA), shifted self-attention mechanism (Shifted SSA), and FFN layer in a continuous sequence to achieve key point sampling while strengthening cross-window information exchange. This approach effectively enlarges the model’s receptive field, thereby enhancing its capacity to aggregate long-range contextual features, as illustrated in Figure 3. Structure of DLCTransformer.
In the second layer processing of the DLCTransformer module, this paper employs a window offset strategy, shifting the entire point cloud window by 0.5 unit sizes to construct a Shifted SSA structure. This design effectively augments cross-window feature information interaction, substantially improving the capability to extract features from intricate substation equipment structures, consequently enhancing the accuracy of main equipment point cloud semantic segmentation. The feature calculation formula for the DLCTransformer module utilizing the offset window strategy is as follows
Here,
Stratified key sampling
To address the memory consumption issue in large-scale point cloud processing for substations, this paper proposes a hierarchical key sampling strategy to optimize self-attention calculations. This method first divides the point cloud into an original dense space and a sampled sparse space, then uses a non-overlapping stereo window partitioning strategy to establish window structures for the dense space (window size Ldense, containing Kdense points) and the sparse space (window size Lsparse, containing Ksparse points). By merging the key-value sets generated by the two types of windows, an efficient self-attention computation foundation is provided for the DLCTransformer module. This approach reduces computational complexity while effectively expanding the model’s receptive field and enhancing its ability to aggregate long-range contextual features.
Considering a point cloud partitioned into distinct windows, where the w-th window contains
Inverted residual module
To address the issues of multi-scale feature extraction and gradient vanishing in substation point clouds, this paper designs the InvResMLP module. Building upon the group structure of PointNet++, this module innovatively combines residual connections with an inverted bottleneck design: on one hand, residual connections effectively mitigate gradient vanishing and improve training efficiency; on the other hand, the inverted bottleneck structure of the MLP layer enhances feature representation capabilities while reducing computational complexity. By stacking multiple InvResMLP modules, the network can effectively capture multi-scale features of substation point clouds and learn richer contextual information, thereby improving overall segmentation performance.
Experimental results and analysis
Experimental dataset and hardware equipment
Experimental dataset
The numbers of point clouds per category in the dataset.

Comparison of point cloud counts for different categories in the dataset.
Hardware equipment
This study employed a high-performance computing platform for model training and inference, featuring an NVIDIA RTX 6000 Ada GPU (48 GB VRAM), dual Intel Xeon Platinum 8380 processors (2.3 GHz, 40 cores), and 512 GB DDR4 memory to ensure robust processing capabilities for large-scale point cloud data. Data collection was performed using a RIEGL VZ-4000 3D laser scanner (range accuracy ±3 mm, scanning rate 420,000 points/second), with the Faro Focus S350 used to assist in scanning and capturing detailed features. Point cloud pre-processing is performed on a Dell Precision 7865 workstation (AMD Ryzen Threadripper PRO 5995WX, 256 GB memory). All devices are connected via a 10Gbps fiber optic network, and storage is provided by a 4 TB NVMe SSD array (read speed 7 GB/s) to ensure efficient data throughput. The experimental environment was configured with the Ubuntu 20.04 LTS operating system, CUDA 11.7, and cuDNN 8.5 acceleration libraries. The entire system operated in a temperature-controlled (22 ± 1°C) anti-static data center, with network latency verified by Klein Tools LAN to be below 0.3 ms.
Baseline models and evaluation indicators
Baseline models
To validate the effectiveness of the DI-PointNet method proposed in this paper, five advanced point cloud processing methods were selected as baseline models for comparison: (1) PointNet++ as the base architecture, which uses multi-layer perceptrons and farthest point sampling for feature extraction 19 ; (2) PointCNN constructs permutation-invariant convolution operations using X-Conv operators 22 ; (3) KPConv is a deformable kernel-based point convolution network that learns geometric features through kernel point position learning 23 ; (4) RandLA-Net processes large-scale point clouds using random sampling and local feature aggregation 24 ; (5) PCT (Point Cloud Transformer) models global point cloud relationships through self-attention mechanisms. 25 All these models use the same training dataset and evaluation metrics as DI-PointNet. PointNet++ and RandLA-Net focus on efficient sampling strategies, KPConv and PointCNN focus on local geometric feature modeling, while PCT captures long-range dependencies through the Transformer architecture, collectively forming a comprehensive comparison framework covering different technical approaches.
Evaluation indicators
This paper uses five core metrics to evaluate 3D scene reconstruction performance: (1) Scene Completion Rate (SCR), which calculates the proportion of missing areas through voxelisation comparison; (2) Geometric Fidelity (GF), which uses Hausdorff distance to quantify the deviation between the model surface and the actual scan; (3) Semantic Accuracy (SA), which evaluates the correct classification rate of devices; (4) Topological Consistency (TC), which verifies the correctness of pipeline/device connection relationships; (5) Real-time Rendering FPS, which tests scene smoothness at 4K resolution.
Experimental analysis
Figure 5 fully reveals the training stability and generalization ability of the DI PoinNet model from the perspectives of optimization dynamics and statistical learning: Figure 5(a) shows the loss function curve, which shows that the training and validation losses converge synchronously with an exponential decay law, and ultimately stabilize around 0.2 with a standard deviation less than 0.04, indicating that the double-layer cascaded Transformer architecture effectively overcomes the gradient vanishing problem; Figure 5(b) shows that the mIoU curve exhibits a typical S-shaped learning growth pattern, with a final training accuracy of 83.2 ± 1.1% and a validation accuracy of 82.7 ± 1.3%. The minimal variance proves that the shift window mechanism has strong robustness to initialization. The two curves entered the fine-tuning stage after 60 epochs, with a slower decrease in loss rate and a continuous improvement in accuracy, reflecting the model’s shift from global feature learning to local structure optimization. The closely overlapping confidence intervals ( The convergence curve and loss function curve of the method proposed in this paper during the iteration process (a) Training and Validation Loss (b) Training and Validation mIoU.
Figure 6 shows a comparison of the SCR performance of different point cloud processing methods in the task of reconstructing a three-dimensional scene of a substation, quantifying the degree of match between the reconstruction results of each method and the actual scan in percentage form. The horizontal axis lists the six comparison methods (PointNet++, PointCNN, KPConv, RandLA-Net, PCT) and the DI-PointNet proposed in this paper (labeled “Ours”), while the vertical axis displays the scene completeness percentage (range 70–100%). As shown in the data, KPConv performs best among the baseline methods (88.1%), followed by RandLA-Net (86.5%). The proposed DI-PointNet significantly outperforms all comparison methods with a completeness of 92.4% (exceeding the best baseline by 4.3 percentage points). The results of DI-PointNet are highlighted in different colors in the figure, clearly demonstrating its advantages in reconstructing the complex structures of substations. This metric indicates that DI-PointNet can more completely preserve the geometric shapes of equipment such as transformers and insulators, reducing missing areas in the scan data. Comparison of SCR of different methods.
Figure 7 compares the geometric fidelity of different 3D reconstruction methods in a substation scenario, using the Hausdorff distance (unit: millimeters) as the evaluation metric. This metric quantifies the maximum deviation between the reconstructed model surface and the actual scan data (lower values indicate higher accuracy). The figure shows the comparison results of the six methods, with DI-PointNet (labeled as “DI-Pain/Net”) having the smallest Hausdorff distance, significantly outperforming other benchmark methods (PointNet++, PointCNN, KPConv, RandLA-Net, and PCT). Specifically, the deviation values of DI-PointNet are approximately 20–30% lower than those of the next-best method (presumably KPConv or PCT), indicating that its reconstructed substation equipment (such as transformer bushings and busbar connectors) more accurately aligns with the true geometric shape, particularly in complex connection structures and curved surfaces (such as insulator skirts). Comparison of FC of different methods.
Figure 8 evaluates the performance of different methods in the task of three-dimensional reconstruction of substations using a dual-indicator comparison system. The left figure shows the SA results, where DI-PointNet (labeled as DI-PoorMeH) leads all comparison methods with an accuracy rate of 92.1%, improving by 0.3 percentage points over the second-best method KPConv (91.8%) and significantly improving by 3.9 percentage points over the traditional method PointNet++ (88.2%), validating its advantages in the classification and identification of power equipment such as transformers and circuit breakers. In the right figure’s TC metrics, DI-PointNet performs even more outstandingly (93.7%), surpassing the second-place PCT (88.9%) by 4.8 percentage points, and significantly outperforming the baseline method PointNet++ (82.4%) by 11.3 percentage points, indicating its ability to more accurately preserve the spatial connection relationships of critical equipment such as busbars and insulator strings. The two sub-figures collectively reveal the innovation of DI-PointNet—by enhancing local feature interactions through a dual-layer continuous transformer module and retaining multi-scale geometric features via an inverted residual structure, it achieves high classification accuracy (>92%) while significantly improving the shortcomings of traditional methods in reconstructing complex wiring topologies (with improvements ranging from 8% to 15%). Comparison of SA and TC of different methods.
Figure 9 compares the real-time rendering performance of different 3D reconstruction methods at 4K resolution, using FPS as the core metric. The results show that DI-PointNet significantly leads with a frame rate of 34.6 FPS, outperforming all comparison methods (PCT 27.3 FPS, RandLA-Net at 21.5 FPS, KPConv at 25.4 FPS, PointCNN at 22.7 FPS, and PointNet++ at 18.2 FPS), but also breaks through the 30 FPS real-time rendering threshold (marked by the red dashed line), making it the only method that meets real-time interaction requirements. Specifically, DI-PointNet’s rendering efficiency is 26.7% higher than the second-best PCT and nearly 90% higher than the traditional PointNet++ method. This performance advantage stems from its hierarchical key sampling strategy, which effectively reduces computational complexity, while optimizing memory management through joint point-volume representation. The grey-to-blue gradient bar chart in the figure visually demonstrates that as the algorithm architecture evolves (from the early PointNet++ to the latest DI-PointNet), the frame rate increases in a stepwise manner. DI-PointNet’s breakthrough real-time performance (>30 FPS) gives it unique application value in industrial scenarios requiring high fluidity, such as substation digital twins and remote inspections. Comparison of FPS of different methods.
Experimental Comparison: DI-PointNet versus PGFormer 2025.
The experimental results show that DI PoinNet outperforms PGFormer 2025 in all five core evaluation indicators, specifically in terms of GF, with a 12.5% reduction in Hausdorff distance (4.2 mm vs 4.8 mm), reflecting the excellent performance of the double-layer cascaded Transformer structure and hierarchical keypoint sampling strategy in complex surface reconstruction; TC increased by 3.8% (93.7% vs 90.3%), verifying the effectiveness of the shift window attention mechanism in modeling spatial connectivity relationships between devices; FPS increased by 19.7% (34.6 vs 28.9 FPS), thanks to the sparse dense partitioning strategy that optimized the computational complexity from O(n2) to O(n log n); SCR increased by 2.4% (92.4% vs 90.2%), indicating that point voxel joint representation better addresses occlusion and uneven point cloud density issues; The semantic quasi SA maintains a leading position (92.1% vs 91.5%) while achieving higher computational efficiency, fully demonstrating the technological breakthrough of this method in balancing accuracy and efficiency in the digital construction of power facilities.
Performance evaluation under varying point cloud densities.
Notably, even at the lowest density of 2.0 pts/cm2 (representing highly sparse scans), DI-PointNet retains 96.0% of its original SCR performance and 95.8% of SA accuracy, with only a marginal 23.8% increase in geometric error (GF: 5.2 mm vs 4.2 mm). This robustness can be attributed to our hierarchical key sampling strategy, which adaptively adjusts the sampling ratio based on point density, and the dual-layer transformer architecture that effectively propagates features across sparsely distributed points. The inverse residual MLP further enhances feature extraction stability through its bottleneck design that mitigates the impact of missing points. These results confirm that DI-PointNet maintains reliable performance across the practical range of scanning resolutions encountered in real-world substation monitoring applications.
Stratified performance evaluation under different weather conditions.
The experimental results indicate that under foggy conditions where point cloud quality is most degraded (average density reduction of 17.6%), DI-PointNet retains 96.4% of its clear-weather SCR performance and 96.7% of SA accuracy. The moderate performance degradation in geometric fidelity (GF: 5.1 mm vs 3.9 mm in clear weather) is primarily attributed to increased noise and occlusion effects in adverse weather. The model’s stability stems from several key design elements: the shifted window mechanism enhances feature consistency through cross-window attention, mitigating local occlusions caused by weather artifacts; the inverse residual MLP provides robust feature extraction despite point cloud sparsification; and the hierarchical sampling strategy adapts to density variations while preserving critical structural information. These results confirm that DI-PointNet maintains reliable semantic segmentation performance across the diverse weather conditions encountered in practical substation monitoring applications.
Scalability test results on large-scale point clouds.
Notably, even when processing 5M points (representing a 316% increase from the original dataset), DI-PointNet maintains real-time performance at 22.7 FPS while preserving 98.7% of its original SCR accuracy. The hierarchical key sampling strategy proves particularly effective in managing computational complexity, as the processing time scales approximately linearly (O (n log n)) rather than quadratically with point cloud size. The memory consumption remains manageable due to the sparse-dense partitioning mechanism, which reduces redundant computations by 43% compared to conventional approaches. These results confirm that DI-PointNet can handle the large-scale point clouds typically encountered in full-substation digital twin scenarios while maintaining both accuracy and real-time performance requirements. The stress testing demonstrates that our method remains viable even under the most demanding conditions expected in practical deployment environments.
Ablation experiment
Ablation experiment.
The ablation experiment results validated the technical contributions of each module of DI Point Net: introducing a single-layer Transformer on the basis of PointNet++baseline increased SCR by 3.6% (88.7% vs 85.1%) and reduced GF by 20.6% (5.4 mm vs 6.8 mm), proving that the self attention mechanism effectively enhances cross device feature interaction capability; After expanding to a double-layer Transformer, TC further increased by 3.2% (89.4% vs 86.2%), indicating that deep structure significantly improved long-range context modeling and topology perception; After adding the shift window mechanism, TC achieved a critical breakthrough (92.1%), GF was optimized to 4.5 mm, verifying the advantage of cross window information exchange in characterizing complex connection structures; The final use of inverse residual MLP instead of standard MLP further reduced GF by 7.1% (4.2 mm vs 4.5 mm) and increased FPS by 10.9% (34.6 vs 31.2), demonstrating the superiority of the inverse residual structure in balancing computational efficiency and geometric accuracy in point cloud feature extraction. The complete model is optimal in all indicators, which fully proves the synergistic effect and progressiveness technology of each module design.
Performance comparison of different window shift magnitudes.
The experimental results show that firstly, the shifting of the 0.5 unit achieved the best performance, with SCR of 92.4% and GF of 4.2 mm. Compared with the baseline without shifting, SCR increased by 2.1% and GF increased by 16.0%. Secondly, we observed symmetric performance patterns around a value of 0.5, where smaller (0.1, 0.3) and larger (0.7, 0.9) offsets resulted in poorer outcomes. This symmetry indicates that a 0.5 offset balances cross window information exchange optimally while maintaining computational efficiency. Compared to an unmoved window, a 0.5 offset increases the effective interaction range by 38% while introducing minimal computational overhead (reducing FPS by 1.3%). These results confirm that 0.5 unit shift is a robust design choice with good generality across different substation configurations and equipment types, providing the best balance between feature interaction capability and computational efficiency.
Limitations and future directions
Limitations
While DI-PointNet demonstrates state-of-the-art performance in substation scene reconstruction, several limitations warrant careful consideration:
Architectural Complexity: The integration of multiple novel components (DLCTransformer, hierarchical sampling, InvResMLP) introduces significant parameter overhead (5.9 M parameters), which may hinder deployment on resource-constrained edge devices. Although our method achieves real-time performance on high-end GPUs, its efficiency on mobile platforms remains unverified.
Multi-modal Integration: The framework currently operates solely on geometric point cloud data, neglecting potentially complementary information from RGB imagery, thermal data, or LiDAR intensity values that could enhance semantic understanding in complex scenarios.
Dynamic Scene Handling: The approach assumes static environments and does not address temporal consistency or dynamic object processing, limiting applicability to real-time monitoring scenarios with moving equipment or personnel.
Future directions
Based on these limitations, we identify several promising research directions:
Lightweight Architecture Design: Develop knowledge distillation and neural architecture search techniques to reduce computational overhead while maintaining performance, enabling deployment on mobile inspection platforms.
Multi-modal Fusion: Integrate cross-modal attention mechanisms to incorporate visual, thermal, and geometric features within a unified processing framework, enhancing robustness to lighting and weather variations.
Temporal Modeling: Extend the architecture to incorporate spatiotemporal transformers for processing 4D point cloud sequences, enabling applications in dynamic monitoring and predictive maintenance.
Self-Supervised Adaptation: Explore contrastive learning and self-supervised objectives to reduce annotation dependency, particularly for rare equipment types or novel substation configurations.
Conclusions and future work
Conclusions
To address the issue of constructing three-dimensional scenes of substations, this paper proposes the DI-PointNet algorithm, which achieves high-precision three-dimensional reconstruction of substation scenes through a joint point-volume representation model and a semantically driven mechanism. DI-PointNet integrates a two-layer continuous transformer module to enhance feature interaction, employs a hierarchical key sampling strategy to reduce computational complexity, and introduces an inverted residual module to optimize multi-scale feature extraction. Experiments on a 220 kV substation point cloud dataset demonstrate that this method significantly outperforms existing methods in key metrics such as scene completeness (92.4%), geometric fidelity (4.2 mm), semantic accuracy (92.1%), and other core metrics, while achieving 4K real-time rendering at a frame rate of 34.6 FPS, providing an efficient and reliable solution for substation digital twin systems.
Future work
Future work will focus on optimizing the robustness of DI-PointNet under extreme weather conditions, improving the quality of point cloud reconstruction under rain and fog by fusing multi-modal sensor data (e.g., infrared thermal imaging and visible light images), and exploring lightweight model deployment options to support real-time 3D reconstruction applications for mobile devices.
Footnotes
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is funded by Research and Application of Interactive 3D Operation and Maintenance Platform for Smart Substations Based on Augmented Virtual Reality; the project number is SGMDJX00JJS1800357.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
