Abstract
The unbalanced development strategy makes the regional development unbalanced. Therefore, in the development process, resources must be effectively utilized according to the level and characteristics of each region. Considering the resource and environmental constraints, this paper measures and analyzes China’s green economic efficiency and green total factor productivity. Moreover, by expounding the characteristics of high-dimensional data, this paper points out the problems of traditional clustering algorithms in high-dimensional data clustering. This paper proposes a density peak clustering algorithm based on sampling and residual squares, which is suitable for high-dimensional large data sets. The algorithm finds abnormal points and boundary points by identifying halo points, and finally determines clusters. In addition, from the experimental comparison on the data set, it can be seen that the improved algorithm is better than the DPC algorithm in both time complexity and clustering results. Finally, this article analyzes data based on actual cases. The research results show that the method proposed in this paper is effective.
Introduction
Since the middle of the 20th century, with the rapid increase of population and rapid economic development, the emergence of ecological problems such as global warming, environmental pollution, and waste of resources, countries today are gradually aware of the serious imbalance between economic development and natural ecological environment in the process of economic development. At this stage, in the context of the era when the tension between man and nature has become a bottleneck for the sustainable development of human civilization or a key point of detonating many social problems, it is the inevitable trend and basic law of the evolution and development of human civilization to promote the excessive transformation of industrial civilization to ecological civilization by reflecting on the excessive economic development in the process of industrial civilization and destroying the balance between man and nature [1].
The global financial crisis that broke out in 2008 directly led to a global economic depression. In addition, various problems are intertwined and superimposed, which seriously threatens the sustainable development of the world. In order to solve the lingering stagnation of the financial crisis and change the current unsustainable development track of the global ecosystem, major developed countries advocate a new round of revolutionary economic reforms. On December 11, 2008, UN Secretary-General Ban Ki-moon proposed the “Green New Deal” concept at the UN Climate Change Conference and called on all countries to focus on investment in environmental projects and implement environmentally friendly policies to promote green economic growth, actively alleviate employment pressure, and restore natural ecosystems to continue to support global economic development. Beginning in 2009, important international organizations such as the United Nations Programme and the OECD have successively advocated green growth. In 2011, the OECD “Towards Green Growth” report was released. The 2012 “Rio+20” United Nations Conference on Sustainable Development proposed a green economy based on sustainable development dedicated to poverty eradication in the sense of calling for changes in the economic growth paradigm and determined the institutional framework for sustainable development. So far, the low-carbon green economic development model reform with the theme of green growth has quickly gained global recognition and put it into practice. This economic model has been the general consensus of political leaders of various countries, and the theory of green growth has also become a hot topic in academic circles for a while [2].
Since the reform and opening up, China’s economy has continued to grow at a high speed. The rise of the green growth theory and the wave of international green new policies have accelerated the secondary adjustment and transformation of the Chinese economy. In the face of severe domestic resource and environmental problems, the Chinese government is firmly committed to the adjustment and transformation of economic green growth, and proposes a series of strategies related to green development to accelerate the transformation of economic growth models and promote the green and ecological transformation of economic development. These studies show that the transformation and development of China’s economy is facing significant constraints on resources and the environment, as well as a major need to transform economic growth methods to achieve sustainable development [3].
Related work
Multi-view clustering (MVC) is to make full use of multiple views of data objects in order to obtain more effective and accurate grouping than using only one data view. With the increasing number of multi-view data in daily life, it is costly and impractical to manually label categories in many big data scenarios. At this time, unsupervised learning emerged [4]. Clustering is a basic unsupervised learning task, a useful tool for exploring data structures, and has been used in many disciplines. Multi-view clustering is an emerging field that has received high attention. Moreover, multi-view clustering is used to process data collected from multiple sources or “views” to take advantage of the supplementary information. In addition, in the case of multi-view clustering, data comes from different domains or from various feature extraction techniques [5]. In order to fuse the data characteristics of different views into one framework, the early methods are mainly aimed at the situation of two views. Theliterature [6] proposed to construct a bipartite graph to connect two types of features and used standard spectral clustering algorithms to obtain the final clustering results. The method in theliterature [7] extended the k-means algorithm to process the data of two conditionally independent views. These initial methods rely on the assumption that there are only two views, so it is difficult to deal with three views or multiple views. The literature [8] used incidence matrix decomposition to fuse the information of multiple matrices. In order to be able to take advantage of the complementarity of multiple views, a large number of optimization solutions for multi-view clustering are proposed. The literature [9] proposed to extend the idea of collaborative training that was originally used in semi-supervised learning to the multi-view spectral clustering algorithm in unsupervised learning. Moreover, it used the spectrum of one view to constrain the similar graphs of another view and made the clusters of the two views tend to each other through an iterative method. The literature [10] proposed a multi-view spectral clustering method based on collaborative regularization. This method uses collaborative regularization on clustering assumptions, and finally obtains consistent clusters between views. Due to the high-dimensional characteristics of multi-view data, many scholars consider reducing the data dimensionality or matrix decomposition before clustering. The literature [11] proposed multi-view canonical correlation analysis, which first maps high-dimensional multi-view data, and then applies traditional clustering methods to the mapped low-dimensional subspace to obtain clustering results. Theliterature [12] proposed kernel canonical correlation analysis and extended it to clustering in the case of incomplete data. Theliterature [13] proposed an analysis of the convergence rate of KCCA. The literature [14] first proposed a multi-view clustering algorithm based on non-negative matrix factorization. The algorithm extends the traditional single-view non-negative matrix method to multi-view non-negative matrix factorization. Moreover, the main idea is to decompose each view into two factor matrices through a non-negative matrix, and different views share a coefficient matrix. This coefficient matrix is considered as the potential clustering structure of different view mapping. Since then, a large number of NMF-based multi-view clustering methods have been proposed. The literature [15] proposed a multi-view spectral clustering algorithm based on low-rank sparse decomposition. The algorithm constructs a transition probability matrix for each view, and then uses these matrices to construct a low-rank transition probability matrix and applies it to standard Markov chain spectrum clustering to obtain the clustering result. Although these view clustering algorithms have greatly improved the clustering performance compared to traditional multi-view clustering algorithms, most algorithms do not fully consider the differences between multi-view data. Therefore, view weight distribution is also a research hotspot. The literature [16] proposed a weighted multi-view k-means clustering algorithm, which dynamically weights views by introducing parameters. The literature [17] proposed a weighted multi-view spectral clustering algorithm based on spectral perturbation, which transformed the weight distribution problem into a standard quadratic programming problem. The literature [18] proposed a self-weighted multi-view clustering algorithm based on soft-capped norms and constructed a model that can automatically learn the optimal weights of the corresponding views for multi-view clustering.
The literature [19] proposed combining sparse and low-rank representations and proposed a low-rank sparse subspace clustering method, which combines the advantages of low-rank and sparsity, and simultaneously performs sparse and low-rank constraints on the representation coefficients. One of the disadvantages of methods based on low-rank and sparse representations is that finding sparse or low-rank representations requires computational time, especially when the dimensionality is high. In order to solve this problem, the data is usually reduced in dimensionality before applying such algorithms. Theliterature [20] proposed an effective algorithm that can learn mapping at the same time and find sparse or low-rank coefficients in the low-dimensional latent space, and finally achieve data clustering through spectral clustering of the representation matrix. The article [23] implementated IoT-based Smart City is achieved by exploiting IoT and BigData Analytics using Hadoop ecosystem in real time environments. The article [24] reflects on IoT and its main role in the development of human behaviors and actions. The paper also deals with the compilation of various data from different databases connected to the Internet. The literature [25] addresses the numerous issues in the field of vehicle communication with the suggestion for a mutual unified and dispersed spectrum sensing model. The introduction of a mutual cognitive paradigm minimizes conflict and multiple unknown problems. The literature [26] discusses the issue, such as large amount of bigdata, and introduces the SmartBuddy framework for creating smart and adaptive ecosystems using human behaviors and human dynamics. The article [27] talks around the development of coordinated non-cyclic chart for video coding calculations for movement estimation in parallel reconfigurable computing frameworks [28]. The partitioning algorithm moreover plays a key part in optimizing the encoding of images [29].
Cluster analysis
Clustering is a process of dividing data objects into one or more clusters according to specific rules. The ultimate goal of cluster analysis is to classify the sample data objects according to the rules, so that the data within the classes are similar and the data between classes are as different as possible. The rules in cluster analysis are usually determined by the similarity of each object to each other. The similarity is measured according to the attribute values of the data points, and then the data objects with high similarity are classified into the same class.
Before introducing the clustering algorithm, let us briefly understand the data types commonly used in cluster analysis. The clustering data set contains multiple data objects, and there are some classic data structures to choose from [21].
Data matrix
The n objects are represented by multiple variables. For example, attributes such as school, major, grade, class, ethnicity, and grades can be used to represent the object “student’’. A data structure like this can be represented in the form of a relational table or matrix, for example, the data matrix of n × p (n objects × p variables) shown below [22]:
The dissimilarity matrix is a n × n -dimensional matrix, and it has the function of storing dissimilarity between data objects, as shown below:
Among them, d (i, j) measures the dissimilarity between objects i and j, and is usually a non-negative value. When i and j are more similar, the d (i, j) value is smaller. However, the greater the difference between i and j, the greater the value of d (i, j). Before clustering, we need to check the data representation. If the data is expressed in the form of a data matrix, it must be converted into a dissimilarity matrix.
For n objects, we use interval variable values
d (i, j) represents the degree of dissimilarity between object i and evil, that is, it uses the distance between two objects to measure the degree of dissimilarity. The smaller the d (i, j) value, the more similar the objects i and j, that is, the less difference between the two. Meanwhile, the larger the d (i, j) value, the less similar the objects i and j, that is, the more the two are different. There are three main methods for measuring dissimilarity of interval numerical attributes: Euclidean distance, Manhattan distance and Minkowski distance.
(1) Euclidean distance:
Euclidean distance (Euclidean distance) is a commonly used distance measurement method. Among them,
(2) Manhattan distance:
Manhattan distance (Manhattan distance) is based on absolute distance to calculate dissimilarity. Among them,
(3) Minkowski distance:
Minkowski distance combines Euclidean distance and Manhattan distance. Among them, q is a positive integer. It can be seen from the above formula that when the value of q is 2, it is the Euclidean distance, and when the value of g is 1, it is the Manhattan distance.
Binary variables are variables with two values, that is, the variable is 0 or 1, and there is no third value. Among them, 0 means false or state does not exist, and 1 means true or state exists. Clustering of binary variables cannot be carried out according to the method of interval value change, because the results obtained in this way are very likely to be wrong.
(1) Nominal attributes
The nominal type is derived from the binary type, which has more values than the binary type, that is, it has more than two values. In nominal variables, the degree of dissimilarity between objects i and j can be calculated in the following way:
Among them, m is the number of variables with the same attribute in the two objects, and P is the number of all variables.
(2) Ordinal type
There are two types of ordinal variables, one is discrete and the other is continuous. Discrete ordinal variables are represented as a set of ordered state values, which are similar to nominal variables. Continuous ordinal variable is a collection of continuous data, and its focus is on the order of the data. The dissimilarity of ordinal variables is calculated as follows:
The value of the ordinal variable f of the i-th object is x if . Among them, f is the ordinal variable describing the data object. At this time, there are M f ordered states 1, ⋯ , M f . Then, we replace x if with r if .
The data values are standardized.
(3) Proportional scale attribute
The proportional scale variable takes a positive measurement value on a non-linear scale, and it approximately follows the following formula:
The calculation method of dissimilarity is generally the same as that of interval scaled variables.
The above several types of attribute methods are based on the following assumption: in the same data set, the attributes existing in the data are all the same data type, and all have the same status (except for asymmetric binary variables). However, this is often not the case in practice. In most cases, there will be multiple data types coexist, their importance is different, and the weights cannot be designed to be the same size. At this time, different algorithms need to be designed according to different situations.
Each application field has its own requirements. Cluster analysis is one of the most challenging fields and has its own strict requirements. It needs to work on data sets of any size, have good scalability, and be able to handle high-dimensional data. Moreover, it needs to be able to discover clusters of arbitrary geometric shapes, identify noise points, eliminate abnormal points, and reduce the priori of the algorithm. In addition, the required results are valuable and can be applied in practice. Therefore, in this regard, we still need to continue to explore.
Basic theory
The concept of information entropy is as follows:
Among them, x is a random variable, S (X) is the set of possible values of x, and p (x) is a probability function.
Entropy becomes a measure of the value of information, and its value reflects the distribution of data. Relative entropy is a concept derived from entropy. This article quotes the concept of relative entropy and uses the method of constructing histograms. Relative entropy reflects how close the distribution of data points is to a uniform distribution. The more uniform the data distribution, the greater the relative entropy. However, when the data distribution becomes more concentrated, the relative entropy value becomes smaller. Because the CLIQUE algorithm divides the data space according to the input parameters, it is difficult for users to determine the input parameters. Moreover, because the sparse area is considered a non-dense unit, some data points that should belong to this cluster are deleted or divided into adjacent intervals following the sparse area, which destroys the integrity of the cluster. Therefore, the REG-CLIQUE algorithm in this article uses relative entropy to distinguish the dense and sparse areas in the molecular spatial data and applies it to the grid division in the first step of the CLIQUE algorithm to achieve adaptive grid division.
Among them, x ij represents the jth histogram grid (bin) on the dimension i, d (x ij ) represents the proportion of the data amount of the histogram grid in the entire data space data set, and T i represents the number of the histogram grid on the dimension i.
Among them, X i represents the set of histogram grids on dimension i.
Gaussian process is also called normal random process, it is a kind of ubiquitous and very important random process. Moreover, any n-dimensional distribution of it is a normal distribution. The probability density function is expressed as follows:
Among them: b jk is the normalized covariance function, and |B| jk is the algebraic cofactor of the element b jk in the determinant |B|.
In the clustering process, the Gaussian process values on the n-dimensional data set D at different times are irrelevant. That is, for all j ≠ k, there is b
jk
= 0. At this time, there are the following results:
This shows that if Gaussian processes are uncorrelated at different times, they are statistically independent and can be used for clustering.
The idea of the density peak clustering algorithm is simple and novel. In short, the DPC algorithm generates clusters by assigning data points to cluster centers with higher density of nearest neighbors. First, two variables for each point are calculated: (1) local density ρ i , (2) high-density nearest neighbor distance δ i , which refers to the distance from data point x i to the data point whose local density is greater than it and the closest to it. In addition, the DPC algorithm uses the decision graph method to identify cluster centers, and its selection criteria are based on the following two basic assumptions: (1) The density of cluster centers is higher than the density of its adjacent data points; (2) The high-density nearest neighbor distance of cluster centers is relatively large. In this way, the number of cluster centers can be selected intuitively.
The decision graph is derived based on the following two basic attributes of each data point x
i
: We assume that the data set consists of XP×M = [x1, x2, ⋯ , x
P
]
T
. Among them, X
i
= [x1i, x2i, ⋯ , x
Mi
] is a vector with M attributes, and p is the total number of data points. The distance between the two data points x
i
and x
j
is calculated as follows:
The local density of data point x
i
is denoted by ρ
i
, which is called the hard threshold. It is defined as:
Among them, if x < 0, there is χ (x) = 1, and C d is the cutoff distance parameter. The parameter C d determination is actually the distribution of the average number of neighbors each data point has. Specifically, ρ i is defined as the number of data points that are shorter than C d and adjacent to x i .
Another type of local density is called soft threshold, and it is defined as follows:
The high-density nearest neighbor distance δ
i
represents the distance value from the nearest sample among all samples with a greater density value than sample x
i
. If x
i
has the highest density value, δ
i
is defined as the furthest distance from other data points. Specifically, δ
i
is calculated as follows:
The DPC algorithm finds a boundary area for each cluster. This area represents the cluster of all data points that can be assigned to the cluster, and is a certain distance away from the data point cluster of another cluster (that is, the cutoff distance C d ). Then, the DPC algorithm finds the data point with the highest density in the boundary area of the cluster and expresses its density as ρ b . Data points in the cluster with a density higher than ρ b are considered part of the cluster core, and other points are considered part of the noise.
Identifying cluster centers is the most important factor affecting the performance of DPC algorithms. The cluster centers with high local density ρ and high density nearest neighbor distance δ can be easily identified in the decision graph. Nevertheless, it is difficult for DPC to identify cluster centers with low local density ρ and high density closest distance δ, or high local density ρ and low density closest distance δ.
SREDPC clustering algorithm is proposed. The algorithm is based on sampling and residual squared, and inherits the advantages of DPC cluster center detection. It uses a sampling method for large data sets to determine the cutoff distance C d , and introduces distance measurement based on residual theory and the density connectivity of DBSCAN.
The entire SREDPC algorithm consists of the following four steps.
Step 1. Preprocessing: The residual squares between the data points and the high-density nearest neighbor distance δ are calculated.
Step 2. The cut-off distance is determined by the sampling process: according to the size of the high-dimensional data set, it is determined whether or not to perform the sampling process, and then the cut-off distance value C d is determined according to the formula.
Step 3. Initial allocation: The decision graph is generated based on the residuals, which identifies cluster centers and assigns respective cluster labels to data points.
Step 4. Halo point recognition: Halo points (composed of boundary points and abnormal points) are identified, the abnormal points in the halo points are detected and isolated, and the final clustering results are output.
Unlike the DPC algorithm, SREDPC introduces a residual method instead of relying on the local density between data points. The reason is that the residuals can construct a more informative decision graph in the later stage, which is helpful to improve the clustering performance. Specifically, the residual square (
Among them, ∥x i - x j ∥ represents the Euclidean distance between x i and x j , N is a predefined parameter that defines the size of the neighborhood, and |N i | represents the number of data points in N i .
δ
i
represents the minimum distance from data point x
i
to another data point with a lower residual error. The calculation formula of δ
i
value is as follows:
The calculation steps of C d in the DPC algorithm are as follows:
(1) The algorithm calculates the distance d ij between the point and the point, and sets d ji = d ij , i < j, i, j ∈ I s ;
(2) The distances of
Among them, f (Mt) = P % * M is rounded to the nearest whole number.
In large data sets and high-dimensional data sets, the distance between each point may not be obtained, and it is not necessary to know all the distances between points, because some points in large data sets have no value contribution to clustering and will greatly increase Algorithm complexity. Therefore, this algorithm decides to use sampling to solve this problem, and then the two factors of sampling ratio and data size should be determined next.
(1) Appropriate sampling ratio is selected: Different sampling ratios in the same one hundred thousand data set are selected, the corresponding cutoff distance C d is calculated respectively, and the C d trend chart shown in Figs. 1 and 2 is obtained.

The relationship between the break distance and the percentage of sampled data.

Comparison of cutoff distance between sample data and original data.
Among them, the X axis represents the sampling ratio, and the Y axis represents the cutoff distance value. It is not difficult to find that when the sampling ratio is about 10%, C d tends to stabilize.
(2) The sampling data scale is determined: a three-dimensional Gaussian distribution artificial data set with a data scale of 1,000 to 10,000 is selected, and the cutoff distance C d is calculated on the data set and the data set sampled by 10% of the data set. The changes and comparison are shown in Fig. 1(b).
Among them, the X axis represents the size of the data set, and the Y axis represents the cutoff distance value. It can be seen that when the data set exceeds 7000, the cutoff distance obtained by the sampled data set is basically the same as that of the original data set.
In summary, when the data set size is less than 7000, the cutoff distance calculation method in the original DPC algorithm is used. At the same time, the sampling method is used when the data set size is greater than 7000. The specific steps are as follows:
Step 1. The sample is drawn,
The parameter X s represents the extracted data set, x represents the original data set, and f sample represents the sampling function.
Step 2. The cutoff distance C d is calculated according to formula (25).
In order to better deal with halo points and improve clustering performance, this algorithm uses cutoff distance bar to identify halo points. During anomaly detection, halo points with high residuals and high occupancy are identified as abnormal points. All abnormal points are collected in anosel. Then, the final clustering result is output as a graph, and the identified abnormal points are highlighted. As shown in Figs. 3 and 4, the halo points identified in the two data sets are represented by red circles.

Halo point recognition of the algorithm in the Flame data set.

Halo point recognition in the algorithm Aggregation dataset.
This paper uses the BBC model of input-oriented variable returns to scale to analyze the green economic efficiency of 31 regions from 2012 to 2019 and obtains the efficiency value and input slack of each province. At the same time, in order to comprehensively analyze the difference between the economic efficiency of considering the resource and environment and the economic efficiency of not considering the resource and environment, this paper also calculates that the resource and environment factors are not considered. After that, this article compares and analyzes the model constructed in this article with traditional economic efficiency using traditional GDP as an output indicator. The results are shown in Table 1, and the scatter plot is shown in Fig. 5.
Statistical table of DEA in the first stage
Statistical table of DEA in the first stage

Statistical diagram of DEA in the first stage.
In order to more clearly examine the green economic efficiency under resource and environmental constraints, the calculation results of traditional economic efficiency are also given in the third stage for comparative analysis. The results are shown in Table 2 and Fig. 6.
Statistical table of the third stage of DEA

Statistical table of the third stage of DEA.
It can be seen from the above that after removing the influence of the external environment and random interference in the second stage, the average comprehensive technical efficiency in the traditional economic efficiency is 0.479, the average pure technical efficiency is 0.967, and the average scale efficiency is 0.496. In the green economy efficiency, the average comprehensive technical efficiency is 0.461, the average pure technical efficiency is 0.962, the average scale efficiency is 0.479, and efficiency losses are widespread. On national average, the efficiency values of the third stage are significantly lower than those of the first stage, indicating that ignoring the influence of the external environment and random interference will overestimate the economic efficiency. Although the efficiency of the green economy in the third stage is still lower than that of the traditional economy, compared with the first stage, the efficiency gap is significantly reduced. In addition, in terms of coefficient of variation, the difference in efficiency between regions is still greater in green economy. It shows that in the case of considering resources and environmental factors, the difference in the degree of waste of resources and environmental damage in development between regions shows greater efficiency differences.
In the current process of my country’s economic development, micro-level behavior is difficult to control, and waste is extremely common. Moreover, many backward enterprises and departments have low technical and management levels, and their production processes consume large amounts of energy and resources, but their utilization efficiency is very low. Therefore, sufficient attention must be paid to the effective use of resources.
This paper introduces the cluster analysis in data mining, analyzes and studies various clustering algorithms, and points out the problems of traditional clustering algorithms in high-dimensional data clustering by explaining the characteristics of high-dimensional data. As the dimensionality and size of the data set increase, the complexity of the clustering algorithm increases, and the distance of the data in the high-dimensional space is almost equal, so the traditional clustering algorithm will lose its function in the high-dimensional spatial clustering. Based on the above problems, this paper proposes a density peak clustering algorithm. In addition, this article proves that the spatial clustering algorithm has a certain effect in the development of urban green economy through actual verification and analysis.
Footnotes
Acknowledgment
The research in this paper was supported by Tibet Soft science research project: Research on the Capability of Tibet Regional Innovation and High-Tech Industry Development (NO. RK20170013).
