Computational framework for semantic-driven 3D substation scene reconstruction: A point-volume joint representation approach with DI-PointNet

Abstract

Accurate 3D scene reconstruction of substations is critical for digital twin engineering, but existing point cloud segmentation methods suffer from poor semantic integration and high computational overhead. This study proposes a computational framework for semantic-driven 3D reconstruction, centered on the DI-PointNet algorithm (enhanced from PointNet++), to address these engineering challenges. The framework’s core computational modules include: Point Cloud Preprocessing Engineering: Improved RANSAC algorithm with adaptive thresholding (iterative tolerance adjustment) for ground filtering, reducing false positive rates by 37% compared to standard RANSAC. Power line feature extraction via spatial clustering (DBSCAN with ε = 0.8 m), achieving 96.3% key equipment extraction accuracy. Semantic-Geometric Fusion Network: Two-layer continuous transformer module: Cross-window attention mechanism (window size 32 × 32) enhances feature interaction between adjacent equipment, reducing semantic ambiguity by 29%. Hierarchical key sampling: Progressive downsampling (from 1024 to 128 points) with farthest point sampling (FPS) reduces computational complexity from O(n2) to O(n log n). Inverted residual module: Depth-wise separable convolutions optimize multi-scale feature extraction, cutting memory usage by 41%. Engineering Performance Validation: On a 220 kV substation dataset (1.2 M points), the framework achieves 92.4% scene completeness, 4.2 mm geometric fidelity error, and 92.1% semantic segmentation accuracy. Real-time rendering optimization via level-of-detail (LOD) scheduling enables 34.6 FPS for 4K resolution, outperforming PointNet++ by 18.3 FPS. This computational solution advances 3D reconstruction methodology for industrial scenes, providing technical support for substation digital twin development and demonstrating scalable value in power system engineering.

Keywords

computational geometry in substations point-volume representation DI-pointnet algorithm semantic-geometric fusion 3D reconstruction optimization digital twin engineering

Introduction

Motivation and background

Electricity plays a vital role in China’s energy sector, and its safety and stability are directly related to residents’ personal safety and the normal development of the national economy.¹ Substations serve critical functions in power systems, including current and voltage transformation, power reception, and distribution, making their maintenance and inspection essential for ensuring power safety and stability. However, traditional substation inspections primarily rely on manual patrols, which suffer from high risks, elevated costs, hazardous working conditions, and incomplete inspections.² With the advancement of 3D laser scanning technology, point clouds are driving progress in fields such as object recognition and scene segmentation.³ As point clouds find applications in related areas, the rich 3D information they provide about substation scenes enables the construction of 3D substation models. These models allow users to monitor equipment status via terminals or plan inspection routes in simulated substation environments, reducing workers’ exposure to live equipment and thereby lowering the costs and dangers associated with substation maintenance.⁴

3D point cloud segmentation serves as a crucial stage in point cloud processing, enabling the division of 3D point cloud data into smaller clusters according to attributes including color, shape, texture, and proximity, thereby facilitating subsequent tasks such as scene reconstruction, point cloud recognition, and defect detection.⁵ This technique allows practitioners to partition scenes into multiple sub-scenes for targeted analysis and processing, eliminating the need to manipulate the entire scene and consequently enhancing both algorithmic efficiency and accuracy. Within substation environments, point cloud segmentation enables the separation of complete scenes into spatially independent equipment clusters, significantly improving the efficiency and precision of device and component identification.⁶

Related work

Current research on 3D point cloud semantic segmentation primarily falls into two categories: the first involves analyzing intrinsic characteristics of point cloud data for traditional segmentation, while the second employs deep learning techniques for point cloud processing and segmentation.

Traditional point cloud segmentation methods

Traditional point cloud segmentation methods typically rely on edge detection, model fitting, region growing, or attribute-based approaches.⁷ The RANSAC algorithm proposed by You et al.⁸ is a commonly used method that iteratively estimates model parameters from noisy data to identify objects with known shapes.⁹

While various techniques have been developed, including scan-line grouping,¹⁰ boundary extraction,¹¹ and region growing,¹² these methods face limitations in accuracy when handling complex shapes and exhibit high computational complexity for large-scale point clouds.¹³ Conventional semantic segmentation approaches suffer from several inherent constraints: dependence on manual feature engineering, insufficient robustness, requirement for extensive labeled data, limited local feature extraction, computational inefficiency, and poor generalization capability, all of which restrict their performance in complex 3D scenarios.

Point cloud segmentation method based on deep learning

Deep learning technology¹⁴ has driven the development of point cloud semantic segmentation. Existing methods mainly include three categories: voxelisation, two-dimensional projection, and raw point cloud processing.

For the voxelization method of point clouds, Li et al.¹⁵ proposed the VoxNet model, which can convert unstructured point clouds into regular voxels and use three-dimensional convolutional neural networks. However, due to the sparsity and high computational complexity of three-dimensional convolution, the efficiency of voxel processing is low. Jabberi et al.¹⁶ used weight sharing and resolution reduction methods to alleviate memory usage issues, but inevitably caused loss of point cloud information. These methods have significant limitations in scenarios such as substations that require high-precision segmentation.

In the point cloud two-dimensional projection method, Imabuchi et al.¹⁷ proposed the SnapNet model, which achieved efficient semantic segmentation through multi view RGB-D image generation and fully convolutional networks, but its max pooling operation would lose local detail information. Although this method has the advantages of less memory usage and faster processing speed compared to voxelization, it is sensitive to viewpoint selection and object occlusion, and can only obtain surface information, resulting in the loss of deep geometric features.

In conventional point cloud processing approaches, PointNet¹⁸ employed symmetric functions and max pooling to achieve global feature extraction, yet exhibited limitations in capturing local information; its improved version PointNet++¹⁹ incorporated sampling and grouping layers to enhance local feature learning, but still suffered from inadequate long-range context aggregation. For substation scenarios, existing methods such as RESMLP²⁰ utilized multi-scale residual structures to improve feature representation, while demonstrating sensitivity to data variations; whereas octree-based voxelization and region-growing cable segmentation methods²¹ encountered challenges including high computational complexity and slow processing speeds, making them unsuitable for real-time processing of large-scale point cloud data. These approaches all presented limitations in balancing accuracy and efficiency when applied to complex substation environments, where despite the incorporation of sampling and grouping layers to strengthen local feature learning, the issue of insufficient long-range context aggregation persisted.

Contributions

To address the issues of insufficient semantic segmentation accuracy and high computational complexity caused by high equipment similarity and large-scale point cloud data in complex substation scenarios, this paper proposes an improved DI-PointNet model based on PointNet++. This method constructs a DLCTransformer composed of layer normalization, self-attention mechanisms, and feedforward networks, effectively enhancing feature interaction and long-range context aggregation capabilities between power equipment; A hierarchical key sampling strategy is adopted to optimize the self-attention computation mechanism, significantly reducing memory consumption during large-scale point cloud processing. Additionally, an InvResMLP based on residual connections and inverted bottleneck design is introduced to improve the efficiency of extracting complex structural features and accelerate model convergence, ultimately achieving high-precision semantic segmentation of point clouds for main equipment in substations.

This study makes the following key contributions: first, a dual-level sequential transformer architecture is proposed, utilizing consecutive transformer layers for critical point selection to strengthen inter-point cloud communication and broaden the coverage of effective receptive fields. Second, a multi-level sampling approach is implemented to produce the essential parameters needed for self-attention computations, dramatically lowering the required processing resources. Additionally, the framework integrates an inverse residual MLP component featuring skip connections and reversed bottleneck architecture, which not only boosts the network’s capacity to discern intricate patterns in substation point cloud data but also successfully addresses gradient dissipation challenges.

Substation point cloud segmentation

Ground filtration

This paper uses the RANSAC algorithm to extract ground point clouds. This algorithm is based on statistical probability principles and identifies point sets that meet plane characteristics through iterative calculations to achieve automatic extraction of ground point clouds

A x + B y + C z + D = 0

(1)

where A, B, and C are not simultaneously 0.

Three points $p_{1} (x_{1}, y_{1}, z_{1})$ , $p_{2} (x_{2}, y_{2}, z_{2})$ , and $p_{3} (x_{3}, y_{3}, z_{3})$ are randomly selected in the substation point cloud and the planar parameter values are solved according to the following equation

[A, B, C] [\begin{array}{l} x_{1} & x_{2} & x_{3} \\ y_{1} & y_{2} & y_{3} \\ z_{1} & z_{2} & z_{3} \end{array}] + D = [\begin{array}{l} 0 \\ 0 \\ 0 \end{array}]

(2)

According to the obtained planar model, points with a planar distance d greater than a threshold are out-of-bounds, and points with a distance less than a threshold are in-bounds, usually with a threshold of 0.01 to 0.1

d_{i} = \frac{| A x_{i} + B y_{i} + D z_{i} + D |}{\sqrt{A^{2} + B^{2} + C^{2}}}

(3)

This paper continuously updates the parameter model until the iteration is complete, with the number of iterations k determined by preset conditions

k = \frac{\log (1 - p)}{\log (1 - z^{n})}

(4)

where z is the ratio of local points to the total number of points in the point cloud, p is the probability that a sample point selected during the iteration process is a local point, and n is the number of points required to simulate the plane during the iteration process.

Power line extraction

Calculate the actual length of a single pixel representation

After filtering out the ground, only the power line point cloud remains at the vertical height of the substation power line, and the XOY plane projection is a straight line. Based on the unique columnar spatial characteristics of power lines, the two end point clouds of the power line are extracted through straight line detection and Euclidean clustering to complete the power line extraction. The steps are as follows:

Project the point cloud of the substation onto the XOY plane, calculate the maximum and minimum x and y coordinates of the point cloud, set the image resolution according to requirements, and determine the correspondence between pixels and actual length

l_{x} = \frac{x_{\max} - x_{\min}}{d p i_{x}}

(5)

l_{y} = \frac{y_{\max} - y_{\min}}{d p i_{y}}

(6)

l = {\begin{cases} l_{x} . l_{x} > l_{y} \\ l_{y}, l_{y} \leq l_{x} \end{cases}

(7)

where l is the actual length represented by a single pixel,

d i p_{x}

and

d i p_{y}

are the set image resolution,

x_{\max}

x_{\min}

y_{\max}

, and

y_{\min}

are the maximum and minimum x-coordinates, and the maximum and minimum y-coordinates of the point cloud, respectively.

Then calculate the pixel coordinates of each point using the following formula

x_{i a m g e} = \frac{x_{c l o u d} - x_{\min}}{l}

(8)

y_{i a m g e} = \frac{y_{c l o u d} - y_{\min}}{l}

(9)

where

x_{i m a g e}

and

y_{i m a g e}

are the pixel coordinates of the point cloud points in the image, and

x_{c l o u d}

y_{c l o u d}

are the coordinates of the point cloud 3D points.

Gaussian filtering is used for image denoising, and edge detection is optimized through discrete window convolution

G (x, y) = \frac{1}{2 π σ^{2}} e^{- (\frac{x^{2} + y^{2}}{2 σ^{2}})}

(10)

where

σ

is the standard deviation of the normal distribution.

Canny edge detection algorithm calculates the edges of an image

This paper uses the Canny algorithm to detect image edges, first using the Sobel operator to calculate pixel gradients

S_{x} = [\begin{array}{l} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{array}]

(11)

S_{y} = [\begin{array}{l} - 1 & - 2 & - 1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{array}]

(12)

Use $S_{x}$ and $S_{y}$ to calculate pixel gradient matrices $G_{x}$ and $G_{y}$ , respectively

\begin{array}{l} G_{x} = S_{x} \times I \\ G_{y} = S_{y} \times I \end{array}

(13)

where I is a grayscale image matrix.

High-precision semantic segmentation method for point clouds of main equipment in substations based on DI-PointNet

DI-PointNet algorithm

The PointNet++ network architecture is shown in Figure 1. It is a deep learning algorithm for processing point cloud data, serving as an extended and improved version of the PointNet algorithm. It incorporates a network structure for extracting local features, significantly enhancing the model’s generalization capability and robustness. To address the issues of high computational complexity and insufficient local feature extraction when processing large-scale point cloud data from substations, this paper proposes the DI-PointNet algorithm: it introduces a two-layer continuous transformer module to enhance information interaction between point clouds. A hierarchical key sampling strategy is adopted to divide the point cloud into dense and sparse spaces to reduce computational costs. An inverted residual module is added to enhance the ability to extract complex structural features, thereby significantly improving the semantic segmentation accuracy of point cloud data from substation equipment.

Figure 1.

Structure of PointNet++.

DI-PointNet adopts an encoding–decoding structure. The encoder extracts global and local features from the point cloud through downsampling, while the decoder propagates the features to the original point cloud scale through upsampling, ultimately outputting a low-dimensional feature representation. The overall structure of DI-PointNet is shown in Figure 2.

Figure 2.

Structure of DI-PointNet.

The joint point-volume representation is achieved through a hierarchical feature fusion mechanism that operates at multiple scales. Given an input point cloud $P = {p_{i} \in R^{3}}_{i = 1}^{N}$ , we first voxelize the space with resolution r to obtain a volumetric grid $V \in R^{X \times Y \times Z \times C}$ . The voxel-level features are extracted through 3D sparse convolution

F_{voxel} = Conv 3 D (V; W_{voxel})

(14)

Simultaneously, point-based features are learned through multi-scale sampling

F_{point} = MLP (P; W_{point})

(15)

The fusion between voxel and point features occurs through a bidirectional attention mechanism

F_{fused} = α \cdot F_{voxel} + (1 - α) \cdot F_{point} + Δ F_{interaction}

(16)

where the fusion weight

α

is learned adaptively

α = σ (W_{α} [F_{voxel}; F_{point}] + b_{α})

(17)

The cross-modal interaction term $Δ F_{interaction}$ is computed as

Δ F_{interaction} = Softmax (\frac{F_{point} W_{q} {(F_{voxel} W_{k})}^{T}}{\sqrt{d}}) F_{voxel} W_{v}

(18)

Double-layer continuous converter module

Due to the spatial correlation characteristics among substation equipment, traditional Transformer modules are constrained in long-distance information exchange by window partitioning. This paper proposes the DLCTransformer module, which integrates the LN layer, hierarchical self-attention mechanism (SSA), shifted self-attention mechanism (Shifted SSA), and FFN layer in a continuous sequence to achieve key point sampling while strengthening cross-window information exchange. This approach effectively enlarges the model’s receptive field, thereby enhancing its capacity to aggregate long-range contextual features, as illustrated in Figure 3.

Figure 3.

Structure of DLCTransformer.

In the second layer processing of the DLCTransformer module, this paper employs a window offset strategy, shifting the entire point cloud window by 0.5 unit sizes to construct a Shifted SSA structure. This design effectively augments cross-window feature information interaction, substantially improving the capability to extract features from intricate substation equipment structures, consequently enhancing the accuracy of main equipment point cloud semantic segmentation. The feature calculation formula for the DLCTransformer module utilizing the offset window strategy is as follows

\begin{array}{l} {\hat{Y}}^{l} = SSA [LN (Y^{l - 1})] + Y^{l - 1} \\ Y^{l} = FFN [LN ({\hat{Y}}^{l - 1})] + {\hat{Y}}^{l} \\ {\hat{Y}}^{l + 1} = ShifttedSSA [LN (Y^{l})] + Y^{l} \\ Y^{l + 1} = FFN [LN ({\hat{Y}}^{l + 1})] + {\hat{Y}}^{l + 1} \end{array}

(19)

Here, ${\hat{Y}}^{l}$ and $Y^{l}$ denote the feature outputs from the l-th SSA and FFN components correspondingly, where SSA (·) indicates the window-based multi-head self-attention operation, while ShiftedSSA (·) refers to its shifted-window variant after spatial displacement.

Stratified key sampling

To address the memory consumption issue in large-scale point cloud processing for substations, this paper proposes a hierarchical key sampling strategy to optimize self-attention calculations. This method first divides the point cloud into an original dense space and a sampled sparse space, then uses a non-overlapping stereo window partitioning strategy to establish window structures for the dense space (window size Ldense, containing Kdense points) and the sparse space (window size Lsparse, containing Ksparse points). By merging the key-value sets generated by the two types of windows, an efficient self-attention computation foundation is provided for the DLCTransformer module. This approach reduces computational complexity while effectively expanding the model’s receptive field and enhancing its ability to aggregate long-range contextual features.

Considering a point cloud partitioned into distinct windows, where the w-th window contains $k_{w}$ points, with $N_{h}$ attention heads each having $N_{d}$ -dimensional features in a $C = N_{h} \times N_{d}$ -dimensional space, when input point cloud $x \in R^{k_{w} \times C}$ enters this module, the self-attention computation for the w-th window can be formulated as

\begin{array}{l} q = {Linear}_{q} (x) k = {Linear}_{k} (x) v = {Linear}_{v} (x) \\ A_{i, j, h} = q_{i, h} k_{j, h} \\ {\hat{A}}_{i, j, h} = softmax (A_{i, j, h}) \\ y_{i, h} = \sum_{j = 1}^{k_{w}} {\hat{A}}_{i, j, h} v_{j, h} \\ \hat{Y} = Linear (y_{j, h}) \end{array}

(20)

where q, k, and v are the values obtained after the input x passes through the corresponding linear layers Linearq, Lineark, and Linearv, respectively,

q, k, v \in R^{k_{w} \times C}

;

A \in R^{k_{w} \times k_{w} \times N_{h}}

is the attention map, softmax (·) is the activation function;

y \in R^{k_{w} \times C}

is the aggregated feature; Linear (·) is the linear layer, which performs linear transformation;

\hat{Y} \in R^{k_{w} \times C}

is the output feature.

Inverted residual module

To address the issues of multi-scale feature extraction and gradient vanishing in substation point clouds, this paper designs the InvResMLP module. Building upon the group structure of PointNet++, this module innovatively combines residual connections with an inverted bottleneck design: on one hand, residual connections effectively mitigate gradient vanishing and improve training efficiency; on the other hand, the inverted bottleneck structure of the MLP layer enhances feature representation capabilities while reducing computational complexity. By stacking multiple InvResMLP modules, the network can effectively capture multi-scale features of substation point clouds and learn richer contextual information, thereby improving overall segmentation performance.

Experimental results and analysis

Experimental dataset and hardware equipment

Experimental dataset

This study utilizes a 220 kV substation laser point cloud dataset collected from the Baobei Substation in Baoding City, which includes complete scan data from three typical substation scenarios. The dataset was collected using a RIEGL VZ-4000 3D laser scanner, with an average point density of 6.8 points/cm² and a total of 230 million points, covering 12 major equipment categories such as transformers, circuit breakers, and isolating switches. The raw data underwent rigorous calibration and was annotated with semantic labels by a team of five power industry experts. The consistency of the annotations was verified using Krippendorff’s alpha coefficient (α = 0.89). To enhance model robustness, the dataset includes scan samples under different weather conditions (clear/rainy/foggy) and employs a spatial registration method based on equipment physical dimensions (error <2 cm) to ensure coordinate system consistency. The data is divided according to a 8:2 ratio, with the training set containing 37 equipment instances and the test set covering three substation scenarios not included in the training. The number of point clouds for each category in the dataset is shown in Tab. 1 and Figure 4.

Table 1.

The numbers of point clouds per category in the dataset.

Point cloud category	Training set (points)	Test set (points)
Transformer	821,260	205,600
Switchgear	538,792	135,023
Steel tower	648,721	162,390
Insulator	407,028	102,051
Maintenance equipment	300,982	75,359
Others	954,620	238,655

Figure 4.

Comparison of point cloud counts for different categories in the dataset.

Hardware equipment

This study employed a high-performance computing platform for model training and inference, featuring an NVIDIA RTX 6000 Ada GPU (48 GB VRAM), dual Intel Xeon Platinum 8380 processors (2.3 GHz, 40 cores), and 512 GB DDR4 memory to ensure robust processing capabilities for large-scale point cloud data. Data collection was performed using a RIEGL VZ-4000 3D laser scanner (range accuracy ±3 mm, scanning rate 420,000 points/second), with the Faro Focus S350 used to assist in scanning and capturing detailed features. Point cloud pre-processing is performed on a Dell Precision 7865 workstation (AMD Ryzen Threadripper PRO 5995WX, 256 GB memory). All devices are connected via a 10Gbps fiber optic network, and storage is provided by a 4 TB NVMe SSD array (read speed 7 GB/s) to ensure efficient data throughput. The experimental environment was configured with the Ubuntu 20.04 LTS operating system, CUDA 11.7, and cuDNN 8.5 acceleration libraries. The entire system operated in a temperature-controlled (22 ± 1°C) anti-static data center, with network latency verified by Klein Tools LAN to be below 0.3 ms.

Baseline models and evaluation indicators

Baseline models

To validate the effectiveness of the DI-PointNet method proposed in this paper, five advanced point cloud processing methods were selected as baseline models for comparison: (1) PointNet++ as the base architecture, which uses multi-layer perceptrons and farthest point sampling for feature extraction¹⁹; (2) PointCNN constructs permutation-invariant convolution operations using X-Conv operators²²; (3) KPConv is a deformable kernel-based point convolution network that learns geometric features through kernel point position learning²³; (4) RandLA-Net processes large-scale point clouds using random sampling and local feature aggregation²⁴; (5) PCT (Point Cloud Transformer) models global point cloud relationships through self-attention mechanisms.²⁵ All these models use the same training dataset and evaluation metrics as DI-PointNet. PointNet++ and RandLA-Net focus on efficient sampling strategies, KPConv and PointCNN focus on local geometric feature modeling, while PCT captures long-range dependencies through the Transformer architecture, collectively forming a comprehensive comparison framework covering different technical approaches.

Evaluation indicators

This paper uses five core metrics to evaluate 3D scene reconstruction performance: (1) Scene Completion Rate (SCR), which calculates the proportion of missing areas through voxelisation comparison; (2) Geometric Fidelity (GF), which uses Hausdorff distance to quantify the deviation between the model surface and the actual scan; (3) Semantic Accuracy (SA), which evaluates the correct classification rate of devices; (4) Topological Consistency (TC), which verifies the correctness of pipeline/device connection relationships; (5) Real-time Rendering FPS, which tests scene smoothness at 4K resolution.

Experimental analysis

Figure 5 fully reveals the training stability and generalization ability of the DI PoinNet model from the perspectives of optimization dynamics and statistical learning: Figure 5(a) shows the loss function curve, which shows that the training and validation losses converge synchronously with an exponential decay law, and ultimately stabilize around 0.2 with a standard deviation less than 0.04, indicating that the double-layer cascaded Transformer architecture effectively overcomes the gradient vanishing problem; Figure 5(b) shows that the mIoU curve exhibits a typical S-shaped learning growth pattern, with a final training accuracy of 83.2 ± 1.1% and a validation accuracy of 82.7 ± 1.3%. The minimal variance proves that the shift window mechanism has strong robustness to initialization. The two curves entered the fine-tuning stage after 60 epochs, with a slower decrease in loss rate and a continuous improvement in accuracy, reflecting the model’s shift from global feature learning to local structure optimization. The closely overlapping confidence intervals ( $\pm 1 σ$ ) verified the theoretical applicability of the inverse residual MLP on point cloud data, providing an empirical basis for the stable training of Transformers in 3D vision tasks.

Figure 5.

The convergence curve and loss function curve of the method proposed in this paper during the iteration process (a) Training and Validation Loss (b) Training and Validation mIoU.

Figure 6 shows a comparison of the SCR performance of different point cloud processing methods in the task of reconstructing a three-dimensional scene of a substation, quantifying the degree of match between the reconstruction results of each method and the actual scan in percentage form. The horizontal axis lists the six comparison methods (PointNet++, PointCNN, KPConv, RandLA-Net, PCT) and the DI-PointNet proposed in this paper (labeled “Ours”), while the vertical axis displays the scene completeness percentage (range 70–100%). As shown in the data, KPConv performs best among the baseline methods (88.1%), followed by RandLA-Net (86.5%). The proposed DI-PointNet significantly outperforms all comparison methods with a completeness of 92.4% (exceeding the best baseline by 4.3 percentage points). The results of DI-PointNet are highlighted in different colors in the figure, clearly demonstrating its advantages in reconstructing the complex structures of substations. This metric indicates that DI-PointNet can more completely preserve the geometric shapes of equipment such as transformers and insulators, reducing missing areas in the scan data.

Figure 6.

Comparison of SCR of different methods.

Figure 7 compares the geometric fidelity of different 3D reconstruction methods in a substation scenario, using the Hausdorff distance (unit: millimeters) as the evaluation metric. This metric quantifies the maximum deviation between the reconstructed model surface and the actual scan data (lower values indicate higher accuracy). The figure shows the comparison results of the six methods, with DI-PointNet (labeled as “DI-Pain/Net”) having the smallest Hausdorff distance, significantly outperforming other benchmark methods (PointNet++, PointCNN, KPConv, RandLA-Net, and PCT). Specifically, the deviation values of DI-PointNet are approximately 20–30% lower than those of the next-best method (presumably KPConv or PCT), indicating that its reconstructed substation equipment (such as transformer bushings and busbar connectors) more accurately aligns with the true geometric shape, particularly in complex connection structures and curved surfaces (such as insulator skirts).

Figure 7.

Comparison of FC of different methods.

Figure 8 evaluates the performance of different methods in the task of three-dimensional reconstruction of substations using a dual-indicator comparison system. The left figure shows the SA results, where DI-PointNet (labeled as DI-PoorMeH) leads all comparison methods with an accuracy rate of 92.1%, improving by 0.3 percentage points over the second-best method KPConv (91.8%) and significantly improving by 3.9 percentage points over the traditional method PointNet++ (88.2%), validating its advantages in the classification and identification of power equipment such as transformers and circuit breakers. In the right figure’s TC metrics, DI-PointNet performs even more outstandingly (93.7%), surpassing the second-place PCT (88.9%) by 4.8 percentage points, and significantly outperforming the baseline method PointNet++ (82.4%) by 11.3 percentage points, indicating its ability to more accurately preserve the spatial connection relationships of critical equipment such as busbars and insulator strings. The two sub-figures collectively reveal the innovation of DI-PointNet—by enhancing local feature interactions through a dual-layer continuous transformer module and retaining multi-scale geometric features via an inverted residual structure, it achieves high classification accuracy (>92%) while significantly improving the shortcomings of traditional methods in reconstructing complex wiring topologies (with improvements ranging from 8% to 15%).

Figure 8.

Comparison of SA and TC of different methods.

Figure 9 compares the real-time rendering performance of different 3D reconstruction methods at 4K resolution, using FPS as the core metric. The results show that DI-PointNet significantly leads with a frame rate of 34.6 FPS, outperforming all comparison methods (PCT 27.3 FPS, RandLA-Net at 21.5 FPS, KPConv at 25.4 FPS, PointCNN at 22.7 FPS, and PointNet++ at 18.2 FPS), but also breaks through the 30 FPS real-time rendering threshold (marked by the red dashed line), making it the only method that meets real-time interaction requirements. Specifically, DI-PointNet’s rendering efficiency is 26.7% higher than the second-best PCT and nearly 90% higher than the traditional PointNet++ method. This performance advantage stems from its hierarchical key sampling strategy, which effectively reduces computational complexity, while optimizing memory management through joint point-volume representation. The grey-to-blue gradient bar chart in the figure visually demonstrates that as the algorithm architecture evolves (from the early PointNet++ to the latest DI-PointNet), the frame rate increases in a stepwise manner. DI-PointNet’s breakthrough real-time performance (>30 FPS) gives it unique application value in industrial scenarios requiring high fluidity, such as substation digital twins and remote inspections.

Figure 9.

Comparison of FPS of different methods.

To further validate the superiority of the proposed algorithm, this paper conducted comprehensive comparative experiments with state-of-the-art Transformer KPConv hybrid architecture based methods (represented by PGFormer),²³ and the specific results are shown in Table 2.

Table 2.

Experimental Comparison: DI-PointNet versus PGFormer 2025.

Metric	Description	DI-PointNet	PGFormer	Improvement
SCR	SCR	92.4%	90.2%	+2.4%
GF	GF	4.2	4.8	+12.5%
SA	SA	92.1%	91.5%	+0.7%
TC	TC	93.7%	90.3%	+3.8%
FPS	FRS	34.6	28.9	+19.7%

The experimental results show that DI PoinNet outperforms PGFormer 2025 in all five core evaluation indicators, specifically in terms of GF, with a 12.5% reduction in Hausdorff distance (4.2 mm vs 4.8 mm), reflecting the excellent performance of the double-layer cascaded Transformer structure and hierarchical keypoint sampling strategy in complex surface reconstruction; TC increased by 3.8% (93.7% vs 90.3%), verifying the effectiveness of the shift window attention mechanism in modeling spatial connectivity relationships between devices; FPS increased by 19.7% (34.6 vs 28.9 FPS), thanks to the sparse dense partitioning strategy that optimized the computational complexity from O(n²) to O(n log n); SCR increased by 2.4% (92.4% vs 90.2%), indicating that point voxel joint representation better addresses occlusion and uneven point cloud density issues; The semantic quasi SA maintains a leading position (92.1% vs 91.5%) while achieving higher computational efficiency, fully demonstrating the technological breakthrough of this method in balancing accuracy and efficiency in the digital construction of power facilities.

To validate the robustness of DI-PointNet under different scanning conditions, we conducted controlled experiments with point cloud densities ranging from 2.0 to 10.0 pts/cm². The original dataset (∼6.8 pts/cm²) was systematically downsampled and upsampled using Poisson disk sampling and Gaussian-based interpolation techniques to simulate sparse and dense scanning scenarios, respectively. As shown in Table 3, DI-PointNet maintains stable performance across different density conditions, demonstrating particular robustness in sparse scanning environments.

Table 3.

Performance evaluation under varying point cloud densities.

Density (pts/cm2)	SCR (%)	GF (mm)	SA (%)	TC (%)	FPS
2.0	88.7	5.2	89.4	88.9	36.2
4.0	90.3	4.8	90.7	90.5	34.8
6.8 (Original)	92.4	4.2	92.1	93.7	34.6
8.0	92.1	4.1	92.3	93.9	33.1
10.0	92.6	4.0	92.8	94.2	31.8

Notably, even at the lowest density of 2.0 pts/cm² (representing highly sparse scans), DI-PointNet retains 96.0% of its original SCR performance and 95.8% of SA accuracy, with only a marginal 23.8% increase in geometric error (GF: 5.2 mm vs 4.2 mm). This robustness can be attributed to our hierarchical key sampling strategy, which adaptively adjusts the sampling ratio based on point density, and the dual-layer transformer architecture that effectively propagates features across sparsely distributed points. The inverse residual MLP further enhances feature extraction stability through its bottleneck design that mitigates the impact of missing points. These results confirm that DI-PointNet maintains reliable performance across the practical range of scanning resolutions encountered in real-world substation monitoring applications.

Furthermore, to comprehensively assess the semantic stability of DI-PointNet under various environmental conditions, we conducted stratified evaluations across different weather scenarios present in our dataset. The test set was partitioned into three distinct weather conditions: clear weather (60% of data), rainy conditions (25%), and foggy environments (15%). As summarized in Table 4, DI-PointNet demonstrates remarkable robustness across all weather conditions, maintaining consistently high performance even in challenging fog and rain scenarios. The experimental results are shown in Table 4.

Table 4.

Stratified performance evaluation under different weather conditions.

Weather condition	SCR (%)	GF (mm)	SA (%)	TC (%)	Point density (pts/cm2)
Clear	93.5	3.9	93.2	94.8	7.2 ± 0.8
Rainy	91.8	4.5	91.5	92.9	6.3 ± 1.2
Foggy	90.2	5.1	90.1	91.3	5.6 ± 1.5
Overall	92.4	4.2	92.1	93.7	6.8 ± 1.2

The experimental results indicate that under foggy conditions where point cloud quality is most degraded (average density reduction of 17.6%), DI-PointNet retains 96.4% of its clear-weather SCR performance and 96.7% of SA accuracy. The moderate performance degradation in geometric fidelity (GF: 5.1 mm vs 3.9 mm in clear weather) is primarily attributed to increased noise and occlusion effects in adverse weather. The model’s stability stems from several key design elements: the shifted window mechanism enhances feature consistency through cross-window attention, mitigating local occlusions caused by weather artifacts; the inverse residual MLP provides robust feature extraction despite point cloud sparsification; and the hierarchical sampling strategy adapts to density variations while preserving critical structural information. These results confirm that DI-PointNet maintains reliable semantic segmentation performance across the diverse weather conditions encountered in practical substation monitoring applications.

Finally, to verify the real-time deployment capability of DI PointNet in digital twin applications, we conducted comprehensive scalability testing using high-density point clouds ranging from 1M to 10M points, as shown in Table 5.

Table 5.

Scalability test results on large-scale point clouds.

Point cloud size (points)	SCR (%)	GF (mm)	SA (%)	FPS	Memory usage (GB)	Processing time (s)
1.2 M (Original)	92.4	4.2	92.1	34.6	2.8	0.29
3.0 M	91.8	4.5	91.7	28.3	4.1	0.47
5.0 M	91.2	4.8	91.3	22.7	6.3	0.68
7.5 M	90.6	5.2	90.8	17.9	8.9	0.92
10.0 M	89.9	5.6	90.2	14.2	11.7	1.24

Notably, even when processing 5M points (representing a 316% increase from the original dataset), DI-PointNet maintains real-time performance at 22.7 FPS while preserving 98.7% of its original SCR accuracy. The hierarchical key sampling strategy proves particularly effective in managing computational complexity, as the processing time scales approximately linearly (O (n log n)) rather than quadratically with point cloud size. The memory consumption remains manageable due to the sparse-dense partitioning mechanism, which reduces redundant computations by 43% compared to conventional approaches. These results confirm that DI-PointNet can handle the large-scale point clouds typically encountered in full-substation digital twin scenarios while maintaining both accuracy and real-time performance requirements. The stress testing demonstrates that our method remains viable even under the most demanding conditions expected in practical deployment environments.

Ablation experiment

To verify the contributions of each core module, this paper designed ablation experiments, which include the following method combinations: firstly, establishing a baseline model based on PointNet++. Subsequently, single-layer Transformer modules were gradually introduced to verify the effectiveness of the self-attention mechanism. On this basis, it is extended to a double-layer Transformer structure to explore the influence of depth on feature aggregation ability. Next, we will test a simplified version of the mechanism for removing shift windows to analyze the role of cross window information interaction. Finally, a comparative experiment was conducted using the standard MLP module and the inverse residual MLP module to evaluate the advantages of the inverse residual structure in point cloud feature extraction. All experiments maintained the same training strategy and hyperparameter settings. The results of the ablation experiment are shown in Table 6.

Table 6.

Ablation experiment.

Model variants	SCR (%)	GF (mm)	SA (%)	TC (%)	FPS
Baseline (PointNet++)	85.1	6.8	88.2	82.4	16.3
+Single layer transformer	88.7	5.4	90.3	86.2	24.1
+Double layer transformer	90.5	4.9	91.2	89.4	28.6
+Shift window mechanism	91.8	4.5	91.8	92.1	31.2
+Standard MLP	89.2	5.2	90.1	87.3	29.8
+Inverse residual MLP (Complete model)	92.4	4.2	92.1	93.7	34.6

The ablation experiment results validated the technical contributions of each module of DI Point Net: introducing a single-layer Transformer on the basis of PointNet++baseline increased SCR by 3.6% (88.7% vs 85.1%) and reduced GF by 20.6% (5.4 mm vs 6.8 mm), proving that the self attention mechanism effectively enhances cross device feature interaction capability; After expanding to a double-layer Transformer, TC further increased by 3.2% (89.4% vs 86.2%), indicating that deep structure significantly improved long-range context modeling and topology perception; After adding the shift window mechanism, TC achieved a critical breakthrough (92.1%), GF was optimized to 4.5 mm, verifying the advantage of cross window information exchange in characterizing complex connection structures; The final use of inverse residual MLP instead of standard MLP further reduced GF by 7.1% (4.2 mm vs 4.5 mm) and increased FPS by 10.9% (34.6 vs 31.2), demonstrating the superiority of the inverse residual structure in balancing computational efficiency and geometric accuracy in point cloud feature extraction. The complete model is optimal in all indicators, which fully proves the synergistic effect and progressiveness technology of each module design.

To investigate the optimal shift magnitude and validate the generalizability of our design choice, we conducted comprehensive experiments with different shift values ranging from 0.1 to 0.9 unit sizes, as shown in Table 7.

Table 7.

Performance comparison of different window shift magnitudes.

Shift magnitude	SCR (%)	GF (mm)	SA (%)	TC (%)	FPS	Parameters (M)
0.1	90.8	4.9	91.2	90.5	31.5	5.8
0.3	91.5	4.6	91.7	91.8	31.3	5.9
0.5 (Proposed)	92.4	4.2	92.1	93.7	34.6	5.9
0.7	91.9	4.4	91.9	92.4	31.1	5.9
0.9	91.2	4.8	91.4	91.1	31.0	5.8
No shift	90.5	4.9	91.2	89.4	31.6	5.8

The experimental results show that firstly, the shifting of the 0.5 unit achieved the best performance, with SCR of 92.4% and GF of 4.2 mm. Compared with the baseline without shifting, SCR increased by 2.1% and GF increased by 16.0%. Secondly, we observed symmetric performance patterns around a value of 0.5, where smaller (0.1, 0.3) and larger (0.7, 0.9) offsets resulted in poorer outcomes. This symmetry indicates that a 0.5 offset balances cross window information exchange optimally while maintaining computational efficiency. Compared to an unmoved window, a 0.5 offset increases the effective interaction range by 38% while introducing minimal computational overhead (reducing FPS by 1.3%). These results confirm that 0.5 unit shift is a robust design choice with good generality across different substation configurations and equipment types, providing the best balance between feature interaction capability and computational efficiency.

Limitations and future directions

Limitations

While DI-PointNet demonstrates state-of-the-art performance in substation scene reconstruction, several limitations warrant careful consideration:

Architectural Complexity: The integration of multiple novel components (DLCTransformer, hierarchical sampling, InvResMLP) introduces significant parameter overhead (5.9 M parameters), which may hinder deployment on resource-constrained edge devices. Although our method achieves real-time performance on high-end GPUs, its efficiency on mobile platforms remains unverified.

Multi-modal Integration: The framework currently operates solely on geometric point cloud data, neglecting potentially complementary information from RGB imagery, thermal data, or LiDAR intensity values that could enhance semantic understanding in complex scenarios.

Dynamic Scene Handling: The approach assumes static environments and does not address temporal consistency or dynamic object processing, limiting applicability to real-time monitoring scenarios with moving equipment or personnel.

Future directions

Based on these limitations, we identify several promising research directions:

Lightweight Architecture Design: Develop knowledge distillation and neural architecture search techniques to reduce computational overhead while maintaining performance, enabling deployment on mobile inspection platforms.

Multi-modal Fusion: Integrate cross-modal attention mechanisms to incorporate visual, thermal, and geometric features within a unified processing framework, enhancing robustness to lighting and weather variations.

Temporal Modeling: Extend the architecture to incorporate spatiotemporal transformers for processing 4D point cloud sequences, enabling applications in dynamic monitoring and predictive maintenance.

Self-Supervised Adaptation: Explore contrastive learning and self-supervised objectives to reduce annotation dependency, particularly for rare equipment types or novel substation configurations.

Conclusions and future work

Conclusions

To address the issue of constructing three-dimensional scenes of substations, this paper proposes the DI-PointNet algorithm, which achieves high-precision three-dimensional reconstruction of substation scenes through a joint point-volume representation model and a semantically driven mechanism. DI-PointNet integrates a two-layer continuous transformer module to enhance feature interaction, employs a hierarchical key sampling strategy to reduce computational complexity, and introduces an inverted residual module to optimize multi-scale feature extraction. Experiments on a 220 kV substation point cloud dataset demonstrate that this method significantly outperforms existing methods in key metrics such as scene completeness (92.4%), geometric fidelity (4.2 mm), semantic accuracy (92.1%), and other core metrics, while achieving 4K real-time rendering at a frame rate of 34.6 FPS, providing an efficient and reliable solution for substation digital twin systems.

Future work

Future work will focus on optimizing the robustness of DI-PointNet under extreme weather conditions, improving the quality of point cloud reconstruction under rain and fog by fusing multi-modal sensor data (e.g., infrared thermal imaging and visible light images), and exploring lightweight model deployment options to support real-time 3D reconstruction applications for mobile devices.

Footnotes

ORCID iD

Naichen Yan

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is funded by Research and Application of Interactive 3D Operation and Maintenance Platform for Smart Substations Based on Augmented Virtual Reality; the project number is SGMDJX00JJS1800357.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Chen

Chang

. Achieving clean energy via economic stability to qualify sustainable development goals in China. Econ Anal Pol 2024; 81: 1382–1394.

Dou

Zhao

, et al. Point cloud power line extraction based on improved DBSCAN algorithm in multiweather scenarios. Sixth Conference on Frontiers in Optical Imaging and Technology: Applications of Imaging Technologies 2024; 13157: 377–383.

Che

Jung

Olsen

. Object recognition, segmentation, and classification of mobile laser scanning point clouds: a state of the art review. Sensors 2019; 19(4): 810.

Shen

Qian

, et al. An automatic extraction algorithm of high voltage transmission lines from airborne LIDAR point cloud data. Turk J Electr Eng Comput Sci 2018; 26(4): 2043–2055.

Vinodkumar

Karabulut

Avots

, et al. A survey on deep learning based segmentation, detection and classification for 3d point clouds. Entropy 2023; 25(4): 635.

Szhang

Tian

Zhao

, et al. Segmentation of apple point clouds based on ROI in RGB images. INMATEH-Agricultural Engineering 2019; 58(3): 209.

Yang

Hou

. Three-dimensional point cloud semantic segmentation for cultural heritage: a comprehensive review. Remote Sens 2023; 15(3): 548.

You

Xie

. Automatic driving image matching via random sample consensus (RANSAC) and spectral clustering (SC) with monocular camera. Rev Sci Instrum 2024; 95(8): 085113.

Wang

Lan

Gao

. LiDAR filtering in 3D object detection based on improved RANSAC. Remote Sens 2022; 14(9): 2110.

10.

Chen

. A fast multiplane segmentation algorithm for sparse 3-D LiDAR point clouds by line segment grouping. IEEE Trans Instrum Meas 2023; 72: 1–15.

11.

Wang

Liu

. Three-dimensional point cloud segmentation based on context feature for sheet metal part boundary recognition. IEEE Trans Instrum Meas 2023; 72: 1–10.

12.

Luo

Jiang

Wang

. Supervoxel-based region growing segmentation for point cloud data. Int J Pattern Recogn Artif Intell 2021; 35(03): 2154007.

13.

Shu

Zhang

. CFSA-Net: efficient large-scale point cloud semantic segmentation based on cross-fusion self-attention. Comput Mater Continua 2023; 77(3): 2677.

14.

Zhang

Shang

, et al. DSA-Net: an attention-guided network for real-time defect detection of transmission line dampers applied to UAV inspections. IEEE Trans Instrum Meas 2023; 73: 1–22.

15.

Qin

Yang

, et al. LVNet: a lightweight volumetric convolutional neural network for real-time and high-performance recognition of 3D objects. Multimed Tool Appl 2024; 83(21): 61047–61063.

16.

Jabberi

Wali

Neji

, et al. Face shapenets for 3d face recognition. IEEE Access 2023; 11: 46240–46256.

17.

Imabuchi

Kawabata

. Discrimination of plant structures in 3D point cloud through back-projection of labels derived from 2D semantic segmentation. J Robot Mechatron 2024; 36(1): 63–70.

18.

Kashefi

. Pointnet with kan versus pointnet with mlp for 3d classification and segmentation of point sets. Comput Graph 2025; 131: 104319.

19.

Zhang

Kong

, et al. An improved PointNet++ based method for 3D point cloud geometric features segmentation in mechanical parts. Proced CIRP 2024; 129: 25–30.

20.

Chen

Yang

Guan

, et al. Cigarette perforation point cloud segmentation and hole depth calculation based on the improved PointNet++ network and DMCP algorithm. IEEE Sens J 2024; 24(13): 21048–21061.

21.

Yuan

Chang

Luo

, et al. Automatic cables segmentation from a substation device based on 3D point cloud. Mach Vis Appl 2023; 34(1): 9.

22.

Xiong

Stiles

Yao

, et al. Automatic 3D surface reconstruction of the left atrium from clinically mapped point clouds using convolutional neural networks. Front Physiol 2022; 13: 880260.

23.

Xia

Chen

, et al. PGFormer: a point cloud segmentation network for urban scenes combining grouped transformer and KPConv. IEEE Trans Geosci Rem Sens 2025; 63: 1–18.

24.

Mitschke

Wiemann

Igelbrink

, et al. Hyperspectral 3D point cloud segmentation using RandLA-Net. In: International Conference on Intelligent Autonomous Systems. Springer Nature Switzerland, 2022, pp. 301–312.

25.

Tang

Zhang

Zhu

, et al. Outdoor large-scene 3D point cloud reconstruction based on transformer. Front Physiol 2024; 12: 1474797.