Abstract
Aiming at the problem of automatic classification of point cloud in the investigation of vegetation resources in the straw checkerboard barriers region, an improved random forest point cloud classification algorithm was proposed. According to the problems of decision tree redundancy and absolute majority voting in the existing random forest algorithm, first the similarity of the decision tree was calculated based on the tree edit distance, further clustered reduction based on the maximum and minimum distance algorithm, and then introduced classification accuracy of decision tree to construct weight matrix to implement weighted voting at the voting stage. Before random forest classification, based on the characteristics of point cloud data, a total of 20 point cloud single-point features and multi-point statistical features were selected to participate in point cloud classification, based on the point cloud data spatial distribution characteristics, three different scales for selecting point cloud neighborhoods were set based on the point cloud density, point cloud classification feature sets at different scales were constructed, optimizing important features of point cloud to participate in point cloud classification calculation after variable importance scored. The experimental results showed that the point cloud classification based on the optimized random forest algorithm in this paper achieved a total classification accuracy of 94.15% in dataset 1 acquired by lidar, the overall accuracy of classification on dataset 2 obtained by dense matching reaches 92.03%, both were higher than the unoptimized random forest algorithm and MRF, SVM point cloud classification method, and dimensionality reduction through feature optimization can greatly improve the efficiency of the algorithm.
Keywords
Introduction
The grass square sand control technology is a measure that uses wheat grass squares to change the structure of the desert surface wind and sand flow and fix the surface of the sand to control the movement of sand particles [1, 2]. Monitoring the changes in the total biomass of vegetation in the straw checkerboard barriers area is the basis for further analysis of how the straw checkerboard barriers affects the ecological restoration of desert areas, furthermore, the sequence of ecological restoration under desert environment can be obtained, This is of great significance to the development of the ecological environment of the desert area after sand consolidation towards a virtuous circle [3, 4].
In recent years, the remote sensing data has also been more widely used with the development of software and hardware technologies in the field of remote sensing, and the technical cost of remote sensing data acquisition rapidly decreased. In addition to the traditional DEM, DOM, and DLG production, remote sensing data has added applications such as fast image mapping, 3D data modeling, and forestry resource survey, the basis for the further application of remote sensing data is the rapid and accurate acquisition of the 3D point cloud of the target area [5]. The acquisition of 3D point cloud data mainly uses the following two methods at this stage: One is to use lidar scanning to obtain, the second is obtained by dense matching algorithm through photogrammetry [6]. LiDAR point cloud has high acquisition efficiency and good accuracy as an active remote sensing technology, however, there are problems such as high equipment cost, uneven density distribution of the point cloud data collected, difficult to remove interference points, and regular targets such as the boundary of the building cannot be clearly judged, so it brings great difficulties to the subsequent point cloud data processing. The point cloud obtained through dense matching not only can achieve the accuracy close to that of lidar, but also the three-dimensional points carry spectral information directly, and by controlling the image acquisition height, resolution, overlap and other parameters, it can get a 3D point cloud with more detailed features than the lidar point cloud [7, 8]. Dense matching point cloud has been widely used in the fields of 3D modeling of buildings, classification of surface coverage, forest resource survey, biomass estimation and so on at present. In this paper, the way to obtain the three-dimensional point cloud of the straw checkerboard barriers area is the dense matching method of the sequence images taken by the drone [9].
There are two basic characteristics of point cloud data: First, the point cloud data contains spatial three-dimensional coordinate information, and some point clouds obtained by different acquisition methods also contain echo information, spectral information and so on. The second is that although point cloud data has a large amount of data, it is unrelated to each other and disordered. Fine division of point cloud data is the basis for further processing of point cloud data, there are mainly two methods for implementing automatic classification of point clouds at present:
(1) Unsupervised method. There are many unsupervised classification methods to construct connections between 3D points by extracting various features of point cloud data for the characteristics of 3D point cloud data at present. Scholars such as Gerke (2014) extracted the characteristics of line segment length, normalized height, Z component of normal vector, color and texture on point cloud data, combined with graph cut algorithm, divided LiDAR point cloud into ground, buildings and low vegetation And trees 4 types [10]. Tian Qinghua (2017) improved the European clustering algorithm, realized the search radius adaptation, and solved the edge point recognition problem of the point cloud subset based on the distance constraint, thereby achieving point cloud segmentation [11]. Zhang (2016) first flipped the 3D point cloud data, and then simulated the relationship between the cloth and the inverted point cloud to achieve the segmentation between ground point and feature point [12]. Unsupervised point cloud classification algorithms are mostly for specific types of point cloud data, with a single application scenario and poor portability.
(2) Supervised method. Traditional unsupervised point cloud classification method determines the category of point by setting decision rules, supervised classification method is to classify point by training classifier after extracting classification features. Zhang (2013) and others used support vector machines to classify LiDAR data [13]. Niemeyer (2012) and others used a context classification method based on conditional random field to achieve LiDAR point cloud classification [14]. Xu (2014) extracted three types of features: point, plane, and mean drift to train the classifier. also compared the classification effects of various classifiers such as Adaboost, Random Forest, ANN MLP(Artificial Neural Networks Multiple Layer Perceptrons), Support Vector Machine, SVM and Rule-based systematically [15]. Guan (2013) used spectral features, geometric features and echo intensity features to train a random forest classifier to classify LiDAR point cloud into buildings, trees, ground point[16].The above-mentioned machine learning-based supervised point cloud classification method used all extracted point cloud features for classification and division, and failure to fully analyze the relationship between point cloud features and classification accuracy during the classification process, the algorithm classification efficiency is low because of the high feature dimension.
In this paper, an improved random forest point cloud classification algorithm is proposed based on existing research combined the characteristics of the point cloud. The algorithm first reduces the decision tree with high similarity of random forest by clustering, then the classification accuracy is introduced to perform weighted voting on the decision tree classification results.
Point cloud feature calculation
Some of the characteristics of point cloud are single-point features, some are the statistical characteristics of the current point and its neighborhood. Assuming that there is a point cloud c, its point set can be expressed as:
The current calculation point p is:
If the radius of the neighborhood is R, the point set formed by the current point p and its neighborhood can be expressed as:
The point cloud statistical characteristics can be further calculated accordingly.
Height is a more commonly used feature in 3D point cloud classification, it can distinguish ground point and non-ground point effectively in 3D point cloud data. However, height values cannot be directly used as features for point cloud classification due to different ground fluctuations in the collection area. This paper used five features related to height: Normalized height of point cloud, point cloud average height, height variance of point cloud, point cloud height difference, point cloud height change rate, the normalized height is a single point feature, and the remaining four are neighborhood statistical features.
(1) Normalized height characteristics NH. This paper adopts the method provided in Ref. [12] to obtain the (Digital Terrain Model, DTM) of the target point cloud, then the normalized height was the height value obtained by subtracting DTM at each point.
(2) Average height H
av
. The average height is defined as the average of the absolute height of all points in the current point and its neighborhood. The formula is as follows:
n is the total number of neighborhood point taken, H k is the height value of each in the neighborhood.
(3) Height variance H
v
. Height variance is defined as the variance of the height of all points in the neighborhood. The formula is as follows:
n is the total number of points in the neighborhood range taken, H k is the elevation of each point in the neighborhood range taken, and H av is the average height.
(4) Height difference H
diff
. The height difference is defined as the difference between the height of the maximum point and the minimum point in the neighborhood. The formula is as follows:
H h is the maximum value of the height of all points in the neighborhood, and H l is the minimum value of the height of all points in the neighborhood.
(5) Height change rate H
n
. The rate of height change is defined as the ratio of average height to height difference between the current point and the highest point in the neighborhood taken. The formula is as follows:
H av is the average height, H h is the maximum height value, and H is the height of the current point.
The relevant features of the fitted plane mainly describe the relationship between the points and the fitted plane (the plane obtained by fitting the current point and the neighboring point in the least square method). The relevant features of the point cloud fitting plane used in this study were as follows:
(1) Plane roughness (N). The value is the distance between the current point and the fitting plane;
(2) Range of the surface (S r ). The value is the maximum value of the distance from all points in the current neighborhood to the fitted plane;
(3) Surface standard deviation (S STD ). The value is the standard deviation of the distance from all points in the current neighborhood to the fitted plane;
(4) Slope of the surface (S n ). The slope of the plane is the angle between the normal vector of the fitted plane and the vertical direction.
Fitting plane-related features can play a greater role in the distinction between ground point, vegetation point, and building point.
Correlation feature of covariance matrix eigenvalue
The covariance matrix of point cloud data was a 3×3 order symmetric matrix. Solved the covariance matrix to get three eigenvalues, denoted as λ1, λ2 and λ3, according to the relationship between the size of the characteristic value, there were the following three cases:
(1) If λ1 ≈ λ2 ≈ λ3, the current point set has divergent characteristics, vegetation or shrubs have this characteristic usually.
(2) If λ1 ≥ λ2 ≈ λ, the current point set has a linear characteristic, tree branches and building edges have this characteristic usually.
(3) If λ1 ≈ λ2 ≥ λ3, then the current point sets have planar characteristics, the ground and building facades exhibit such characteristics usually.
On this basis, in order to make the above characteristics easy to quantify, four indexes such as curvature λ
c
, λ
s
, λ
l
, λ
p
are introduced, the expression of curvature λ
c
is:
The expressions of the divergence index λ
s
, linear index λ
l
and planeness index λ
p
of the segment are as follows:
The selection of spectral features mainly includes RGB features, Disible-Band Difference Vegetation Index, VDVI, Normalized Green-Blue Difference Index, NGBDI and split G r .
(1) The RGB feature is composed of the DN values of the three channels of visible light red light R, green light G, and blue light B, can be obtained from point cloud data containing RGB information directly.
(2) Vegetation index VDVI and NGBDI can divide vegetation point cloud well, the formula is:
(3) Split G
r
distinguishes vegetation by calculating the ratio of the G channel DN value to the total RGB DN value of the three channels
(1) Point density n refers to the number of all points in the neighborhood of the point;
(2) Normal vectors n x , n y and n z . Defined as the three components of the normal vector of the plane obtained by least-square fitting the points in the neighborhood of the current point.
Multi-scale feature construction
Different point clouds have different data densities, therefore, in the point cloud classification, the selection of neighborhoods for the calculation of point cloud statistical features cannot simply be based on the same scale.
First, use the octree to organize the point cloud and find the average point distance of the point cloud d
m
, then determine the point cloud neighborhood scale, L
i
means scale level, the scale calculation formula is as follows:
s is the scale factor and i is the level, use the amplification factor s to get other levels after determining the minimum scale level L1.
For example, if the extracted maximum scale level is 3 and the amplification factor s = 2, then:
The method of neighborhood division is to draw the sphere with the current point as the center of the sphere and the current scale as the radius, the point set in the sphere is the current neighborhood point set.
The random forest algorithm was proposed by Breiman in 2001 and is essentially a combination of random subspace algorithm and Bagging algorithm. Random forest consists of decision trees as base classifiers. The process of random forest algorithm is [17, 18]:
(1) Use Bootstrap to extract n (n≤N) samples from N training samples with replacement
(2) Repeat K times to generate K decision trees, first extract m attributes(m =⌊ log 2M + 1 ⌋) from all M attributes, then apply the CART algorithm to select the optimal attribute from m attributes as the node splitting attribute, the CART algorithm is as follows:
Suppose the i-th sample in the current set D accounts for p
i
(i = 1, 2, ⋯ , |y|)
|y| is the total number of classification categories, the Gini value of D can be expressed as:
The value range of Gini (D) is [0,1].
Select a test attribute a and use it to divide the random variable D into two categories: D1, D2 then the Gini index of a is:
In the candidate attribute set A = a1, a2, …, a
m
, the attribute that meets the split requirement is the attribute with the smallest Gini index:
(3) Count and vote on the classification results of all K decision trees generated, the classification result with the highest number of votes was determined as the final result.
It can be seen from the classification process of random forest, The node candidate splitting attributes of the random forest are randomly selected, which determines that a large number of “weak decision trees” will be generated for a large number of low-correlation variables and redundant variables in high-dimensional space, and these “weak decision trees” will greatly increase the calculation of random forest classification, reduce classification efficiency, “weak decision tree” will also reduce the accuracy of the classification results at the same time. Therefore, how to remove the “weak decision tree” under the condition of ensuring the diversity of the decision tree is a direction of random forest algorithm optimization. The algorithm gives all decision trees the same voting weight without considering the accuracy weight of the decision tree, which is also a major factor affecting the evaluation accuracy of the algorithm.
The algorithm in this paper is based on the random forest algorithm. First, according to the VIM (Variable Importance Measure, VIM) score, the participating classification features were optimized to remove redundant features, for the decision tree in the algorithm, the tree edit distance is used to calculate the similarity between the decision trees, further used the maximum and minimum distance algorithm to cluster and reduce the decision tree, the simplified decision tree was formed into a sub-forest, and the voting weight matrix is then constructed according to the accuracy. Finally, the classification evaluation accuracy and algorithm operation efficiency of the random forest classifier were improved.
Similarity of decision tree
It can be seen from the construction process of the random forest algorithm that its evaluation accuracy is limited by two conditions: The first is the classification performance of each decision tree, the higher the classification accuracy of a single decision tree, the better the classification performance of the composed sub-forest. If there were many decision trees with low classification performance, the classification performance of the composed random forest classifier will be greatly affected. The second is the correlation between decision trees, the lower the correlation between the decision trees that make up the sub-forest, the better the classification decision performance of the sub-forest. Decision tree with high correlation has high similarity of decision results, if the accuracy of the simultaneous decision tree is low, the effect of the sub-forest classification is poor [19, 20].
It can be known that the decision tree is a binary tree structure from the nature of random forest. The similarity of binary tree structure can be calculated by Zhang & Shasha algorithm, the algorithm measures the similarity between two trees by calculating the step (edit distance) of one tree to another [21].
There are three ways to define the operation of the tree: delete, insert and modify. Let T1 and T2 be two ordered decision trees, of which T2 is the target tree. Define a special label ∧, ∧ → a means adding node a, b→ ∧ means delete node b, S represents an operation sequence, record as s1, ⋯ , s
k
, define a cost function γ to represents node conversion editing operations of a → b, the returned value is a nonnegative real number γ (a → b). The return value of the cost function is different for node operations between different trees, that means different weights can be obtained through different node weights. Expanding γ into the sequence S, the following formula can be obtained:
The edit distance between two trees T1 to T2 can be expressed as:
S is the sequence of editing operations to convert T1 to T2.
As shown in Fig. 1, T[i] represents the i-th node from the left in the T-tree, l(i) means the first node on the left in the subtree with the root node T[i], let the parent node of T[i] be a(i), d(i) the set of T[i] ’s child node is expressed as d (i) ={ l (i) , l (i1) , ⋯ , l (i n ) }, as shown in the dotted nodes in Fig. 1, t(i) represents a subtree composed of T[i] and its child nodes d(i), the parent-child relationship in the subtree is less than or equal to one layer. As shown in the nodes in the solid line in Fig. 1, T [i′ ⋯ i] represents all ordered child forests from i′ to i in tree T, the father-son relationship in the child forest is greater than or equal to the second floor.

Tree structure.
Use forestdist (T1 [l (i) … i1] , T2 [l (j) … j1]) for the distance between two forests, use treedist (i1, j1) for the distance between two trees, the formula for calculating treedist (i, j) is as follows:
(1) if l (i) = l (i1) and l (j) = l (j1)
(2) if l (i) ! = l (i1) or l (j) ! = l (j1)
Combined dynamic programming algorithms, calculate the editing distance according to the above formula. It is not difficult to see that in the process of calculating treedist (i, j), all the values of treedist (i1, j1) need to be calculated first. When i1 is a node in the process from l (i) to i, j1 is a node in the process from l (j) to j, treedist (i1, j1) does not need to be calculated separately. The distance of these subtrees will be automatically generated during the calculation of treedist (i, j). A temporary array can be defined to store the distance of the forest, treedist have been calculated, then release them. Define a permanent array to store the treedist data at the same time, through iterative calculation, finally get treedist (i, j).
After solving the edit distance between the key nodes of the tree structure, by iterating to the root node, it can finally calculate the edit distance between the two decision trees, then the clustering operation of the decision tree can be realized. The simplified clustering process by calculating the edit distance of the tree is as follows:
(1) Used the decision tree generated by the unoptimized algorithm as the candidate data set, select Q decision trees as initial clustering centers randomly.
(2) Used the maximum and minimum distance method to determine Q cluster centers, details as follows:
1) Selected a data as the first cluster center z1 randomly;
2) Calculated the distance from the remaining data to z1, the largest distance is the second cluster center z2;
3) Calculated the distance between the remaining data and the existing cluster center, taked the maximum value in the minimum distance set as the new clustering center;
4) Repeated the above process until Q cluster centers are determined.
(3) Traversed all decision trees and calculate their edit distance from Q cluster centers, clustered decision trees into Q clusters according to the closest distance principle.
(4) Repeat steps (2)-(3), stopped until all decision trees were clustered.
(5) Sorted the evaluation accuracy among the obtained Q decision tree clusters, selected decision tree with high evaluation accuracy to reconstruct random forest.
The majority voting method in the random forest algorithm is used after generating the decision tree, the evaluation accuracy of the decision tree was not considered, ignored the influence of weight on evaluation accuracy. This paper used the weighted voting method as the final classification basis in order to introduce decision tree evaluation accuracy weights in the random forest voting stage. The method is to convert the evaluation accuracy of the decision tree into the voting weight of the algorithm. After simplifying the decision tree clustering in the previous section, the new random forest decision tree evaluation accuracy matrix ACC can be expressed as [22]:
acc ij represented the classification accuracy of the j-th decision tree to the i-th kind of point cloud, j = 1, 2, … J, J is the number of decision trees after clustering; i = 1,2, ... ,I, I is the evaluation accuracy level, it is obtained by calculating the accuracy of each point cloud classification corresponding to each decision tree in the random forest after the cluster is reduced in the verification set.
According to the evaluation accuracy matrix, the weight matrix W is defined as:
w
qp
represents the voting weight of the j-th decision tree corresponding to the i-th point cloud, the calculation method is:
The improved algorithm flow chart is shown in Fig. 2.

Algorithm flow chart.
There were redundant features for point cloud classification, the classification feature set formed has high dimensions, and the feature selection in this paper is at three different scales, so the method used in this paper first optimizes high-dimensional features before point cloud classification, identified the features that can play an important role in the classification process to improve the training and classification efficiency of the algorithm. The random forest algorithm can use the VIM scoring strategy to evaluate the importance of classification features. This evaluation strategy has been widely used in the screening of high-dimensional features [23]. There are two calculation methods for VIM scores currently, one is a substitution method based on OOB error rate; the other is a method based on Gini index. Since the Gini index method has poorer unbiased performance than the replacement method, this study used the replacement method for feature optimization dimensionality reduction. The principle of this method is to perform a substitution operation on the variable value of the data outside the bag in a random manner, and then determine the VIM value of the variable according to the change of the OOB error rate that before and after the replacement, set the variable to X, then the VIM calculation formula of the variable is:
N
tree
is the total amount of decision tree, OOB
ij
is the OOB error rate corresponding to the decision tree i before the variable X
j
substitution operation,
The first test data set used in this article is lidar data provided by the Graphics and Media Laboratory of Lomonosov National University, point cloud data was acquired by UAV equipped with Lidar, the collection area mainly contains trees, low vegetation, buildings and cars. The radar point cloud density is 6 points/m2, a total of 1002668 points, Point cloud contains x,y,z coordinate information [24]. Used MATLABR2017a software to write random forest classification program, the operating system is 64-bit Windows 10, CPU is Inteli9900k, memory 64 G. This paper extracted 10% of ground point, tree point, low vegetation point, building point, and car point in the data set as training samples.
Feature selection
Since the point cloud data of the first test data set does not carry color information, selected the features listed in this article other than color to participate in the classification. The initial feature set size is 47 at 3 scales. This paper took 10% of each type of point cloud as training data, and the rest as test data. The number of Ntree in the random forest was set to 200. The randomly selected feature number m is calculated according to the formula log2M+1. The experimental results show that, as features with poor classification importance scores were successively removed, the classification accuracy does not fluctuate significantly. This is because there were many redundant features of the point cloud classification in high-dimensional features, and a small amount of culling did not have a large impact on the overall result. When the removed subset reached the value of 12, there was a degree of decline in classification accuracy, from 94.79% to 91.63%, as shown in Fig. 3. On this basis, further feature removal resulted in the removal of a large number of features that play a more important role in the classification process, and the overall accuracy of the random forest classifier continues to decrease. Therefore, this paper finally selected 12 features in the first data set to participate in the classification calculation.

Change in accuracy of feature selection process.
Further used variable importance scoring strategy for feature importance evaluation. Figure 4 is the global variable importance score, represents the VIM value of each feature in the feature set during the entire classification process, the higher the VIM score, the greater the weight of the feature in the point cloud classification. As can be seen from the above figure, in this point cloud classification experiment, the most important features that play a role in classification were in turn NH, HnL3, HnL2, HdiffL3, HdiffL2, SnL3, SnL2 (Subscript L i means scale). This shows that these features are very important for point cloud classification. In point cloud data without color information, NH played a crucial role in the separation of ground and non-ground point, H n and H diff played a more important role in the statistical analysis of the distinction between low vegetation and high vegetation. Surface-related features such as S n and S STD have a relatively large effect on the classification process of buildings and automobiles. It can be seen that the important role played in this experiment is the point cloud characteristics at large scale, which is related to the sparseness of the point cloud itself and the calculation method of statistical characteristics.

Variable importance score.
In this experiment, based on the verification data, a confusion matrix was established to evaluate the accuracy of the point cloud classification of the algorithm in this paper. At the same time, in order to verify the effectiveness of the algorithm in this paper, they were compared with the classification results of traditional random forest (RF) algorithm, Markov random field (MRF) algorithm, and support vector machine (SVM) algorithm. The point cloud classification confusion matrix is shown in Table 1.The algorithm of this paper is superior to other algorithms in the classification of ground point, low vegetation point and car point. But overall, all algorithms have a low classification accuracy for building point, car point, and low vegetation point. The classification accuracy of the algorithm in this paper is 94.15%, the traditional RF algorithm is 92.89%, the MRF algorithm is 90.77%, and the SVM algorithm is 90.98%. The confusion matrix is shown in Table 1.
Confusion matrix of point cloud classification
The visualization effect of the point cloud classification of the algorithm in this paper and in other machine learning algorithms is shown in Fig. 5.

Visualization of each algorithm’s point cloud classification.
The model training time is shown in Table 2. Traditional RF algorithm was significantly more time-consuming than MRF algorithm and SVM algorithm, however, after performing feature selection to remove redundant features, the training time of the point cloud classification model was reduced from 168.3 s to 126.5 s, the classification time was reduced from 23.2 s to 15.6 s, and the classification efficiency was improved significantly. While using MRF algorithm and SVM algorithm, all features participated in the calculation.
Classification time of point cloud unit: s
Further apply the algorithm of this paper to classify the vegetation point cloud (data set 2) of straw checkerboard barriers area. The point cloud was acquired at the ALashan Zuoqi (38.790063, 105.503970) by the dense matching method of UAV sequence images. Point cloud includes ground point, Haloxylon ammodendron point and Artemisia point, the original point cloud is first denoised by KD-Tree, and then down-sampled by 0.5 times, a total of 13204475 points were obtained, the point cloud density was 2213 points/m2. Because the densely matched point cloud contained spectral information, this experiment used 65 point cloud features, according to the feature screening process, the final participating classification features were 15, the five features that contribute the most to the classification were: VDVI, NH, SnL2, λsL2, λlL3, Also select 10% of ground point, Haloxylon ammodendron point, and Artemisia point as training samples to obtained the segmentation visualization effect as shown in Fig. 6.

Visualization effect of point cloud classification in straw checkerboard barriers area.
In order to quantitatively analyze the classification effect of this algorithm on densely matching vegetation point clouds in the straw checkerboard barriers area, firstly, the vegetation point clouds in the straw checkerboard barriers were classified by manual interpretation, and used as reference data to compare with the classification results obtained by the remaining three algorithms. The overall accuracy of the algorithm classification in this paper is 92.03%, the traditional RF algorithm is 91.56%, the MRF algorithm is 89.24%, and the SVM algorithm is 88.45%. The error statistics obtained are shown in Table 3, it can be seen from Table 3 that the classification results of the algorithm after feature selection are obviously superior to the RF algorithm without classification selection. The miss rate of the Haloxylon ammodendron point and the wormwood point of the MRF algorithm and the SVM algorithm is significantly higher than that of the algorithm in this paper, but the point misclassification rate of the MRF algorithm and the SVM algorithm for Haloxylon ammodendron point and Artemisia point is equivalent to the algorithm of this paper.
Classification result error statistics
This study focuses on point cloud data, analyzed and summarized the characteristics of point cloud data, and proposed 20 features suitable for point cloud classification, at the same time, according to the point cloud density and the characteristics of the neighborhood, three different hierarchical scales that can adapt to the point cloud’s own density were constructed, improved random forest decision tree construction and voting process, used random forest classification algorithm to analyze the importance of different scales of feature variables, all the feature variables and the most important feature variables were selected respectively to classify the point cloud data of the two experiments and evaluate the accuracy, the conclusion is as follows:
(1) The point cloud data in dataset 1 is the high-altitude airborne radar point cloud, point cloud density is only 6 points/m2, especially without color features, because the overall feature composition of target point clouds is not obvious such as cars, low vegetation, buildings and so on, these types of points had a lower recall rate. Statistical features of point clouds at large scales such as HnL3, HnL2, HdiffL3, HdiffL2, SnL3, SnL2 played a larger role in the point cloud classification of dataset 1. NH was the primary feature that distinguishes ground point from non-ground point. Compared with traditional machine learning algorithms, it is found that the overall accuracy of the algorithm in this paper is better than the unoptimized RF algorithm, also because of the features selected in this paper for the random forest method, the overall accuracy is also higher than the MRF and SVM algorithms. It proved that in the case of high-dimensional features, the random forest algorithm has a greater advantage in the classification problem.
(2) The point cloud data in dataset 2 comes from the dense matching method of UAV sequence images, the point cloud density was much greater than the density in dataset 1. It can be seen that the small-scale statistical features in data set 2 increased compared with data set 1 from the features selected by the VIM score, which also depends on the calculation method of each statistical feature. Due to the mixed occurrence of Haloxylon ammodendron and Artemisia in dataset 2, the overall accuracy of the classification of Haloxylon ammodendron and Artemisia point was lower than that of ground point. Due to the difference in the morphology of Haloxylon ammodendron and Artemisia, the covariance statistical features play a larger role in the point cloud classification in dataset 2. At the same time, it can also be seen that the classification time increased sharply with the increase of the point cloud density and the number of point cloud.
(3) It optimized the random forest algorithm for bagging sampling technology and decision tree decision characteristics in this paper, it can be seen that the classification results of the optimized algorithm in this paper were significantly better than the random forest algorithm from the experimental results of the two point cloud experimental data, although a part of the efficiency of the algorithm was lost due to the decision tree clustering reduction algorithm using the tree edit distance, the model training time of this algorithm was still better than the traditional random forest algorithm due to the reduction in the number of decision trees.
This algorithm can improve efficiency and classification accuracy in the process of point cloud classification calculation. There is also a high classification accuracy in the vegetation classification of the straw checkerboard barriers area obtained by dense matching. Although a large number of features are selected to participate in the classification calculation, there is no adaptive calculation method for the statistical characteristics of point cloud with different densities. At the same time, there are many spectral features of densely matched point cloud, and further analysis is needed. Future research will focus on the above two aspects to carry out point cloud classification and recognition.
