Incomplete high dimensional data streams clustering

Abstract

Many recent applications such as sensor networks generate continuous and time varying data streams that are often gathered from multiple data sources with some incompleteness and high dimensionality. Clustering such incomplete high dimensional streaming data faces four constraints which are 1) data incompleteness, 2) high dimensionality of data, 3) data distribution, 4) data streams’ continuous nature. Thus, in this paper, we propose the Subspace clustering for Incomplete High dimensional Data streams (SIHD) framework that overcomes the above clustering issues. The proposed SIHD provides continuous missing values imputation for incomplete streams based on the corresponding nearest-neighbors’ intervals. An adaptive subspace clustering mechanism is proposed to deal with such incomplete high dimensional data streams. Our experimental results using two different data sets prove the efficiency of the proposed SIHD framework in clustering such incomplete high dimensional data streams in terms of accuracy, precision, sensitivity, specificity, and F-score compared to five algorithms GFCM, GBDC-P2P, DS, Ensemble, and DMSC. The proposed SIHD improved: 1) the accuracy on average over the five algorithms in the same mentioned order by 11.3%, 10.8%, 6.5%, 4.1%, and 3.6%, 2) the precision by 15%, 10.6%, 6.4%, 4%, and 3.5%, 3) the sensitivity by 16.6%, 10.6%, 5.8%, 4.2%, and 3.6%, 4) the specificity by 16.8%, 10.9%, 6.5%, 4%, and 3.5%, 5) the F-score by 16.6%, 10.7%, 6.6%, 4.1%, and 3.6%.

Keywords

Data streams incomplete data imputation high dimensional data subspace clustering

1 Introduction

Clustering is the process of categorizing a set of objects, where elements that are closer to each other according to the distance measure are placed in the same cluster, but elements that are further from each other are placed in different clusters [1, 2]. Many applications generate incomplete data. The recent incomplete data clustering approaches are divided into two categories: 1) ignorance-based that ignores the missing values in the clustering process and depends on complete values, 2) imputation-based that estimates the missing values and applies the clustering approach to the complete version of the data. Adopting the first strategy with high missing rates decreases the clustering performance.

In many applications, objects are described by many dimensions. Applying traditional (full dimensional space) clustering over such data objects will lead to useless clustering results. Therefore, subspace clustering of such high dimensional data improves the clustering performance [3].

Most of the recent approaches deeply studied the incompleteness and high dimensionality problems of static data using different strategies, but there are some limitations of these approaches. Recently, many applications such as sensor applications generate continuous, dynamic, rapid and time varying data streams [4 –6].

In practice, data streams suffer from the incompleteness and high dimensionality problems. Data stream processing presents new issues that are not addressed by traditional data management techniques. When applying clustering on data streams with such nature, some challenges are appearing. Unlike static data, the continuous and time varying data streams require continuous clustering over time. So, the clustering algorithm is needed to be applied continuously to the most recent data elements. The clustering results are updated with the continuous arrival of data streams over time.

In this paper, we propose the Subspace clustering for Incomplete High dimensional Data streams (SIHD) framework to overcome the above clustering challenges that face incomplete high dimensional data streams that gathered from multiple sources.

Firstly, the proposed SIHD produces the subspaces of the dimensions of data elements per data source for efficiently dealing with high dimensional data. Each subspace involves a subset of the total dimensions of that source. Then, per data source, the incomplete elements are processed. For each incomplete element per data source, SIHD determines the subspaces that contain missed values. Then, the k-nearest neighbors of each incomplete element per data source are determined over the subspaces of the missing values. The missing values of incomplete elements are imputed based on these k-nearest neighbors. Then, all data elements are clustered over the subspaces per data source. Finally, local results over all data sources are combined to produce the final clustering results.

The main contributions of this paper are summarized as follows. 1) An improved missing values’ imputation strategy is introduced for accurately imputation even with the data high dimensionality constraint, 2) An adaptive subspace clustering mechanism is proposed to efficiently process incomplete high dimensional data streams.

The rest of this paper is organized as follows. In Section 2: The related works to the proposed framework are introduced. In Section 3: The problem formulation is illustrated. In Section‘4: The proposed SIH framework is explained. In Section 5: The proposed algorithm is introduced. In Section 6: The experimental evaluation is presented. In Section 7: The results and analysis are discussed. In Section 8: The work of the paper is concluded.

2 Related work

In this section, we present the related works of our solution from three areas: 1) incomplete data processing, 2) high dimensional data clustering, 3) data streams processing and clustering.

2.1 Incomplete data processing

The multiple imputation of discrete combinations (MIDC) in [7] introduced missing values random imputations for incomplete data. The Grey based Fuzzy c-Means (GFCM) algorithm in [8] provided missing values imputation and fuzzy-based data clustering. The Model based Missing value Imputation using Correlation (MMIC) method in [9] recovered the missing values based on the average K-nearest neighbors’ values.

Algorithms in [10, 11] estimated missing values using the nearest neighbors’ intervals. A Dominance-based algorithm that proposed in [12] addressed incomplete interval-valued data. In [13] a model was proposed to impute incomplete data over limited data environments. An incomplete data imputation algorithm was proposed in [14] based on neural networks. The fuzzy-based algorithm that proposed in [15] addressed incomplete data and uncertain linguistic preference relations.

The above algorithms handled the incompleteness problem with static data based on different methods. However, there are some limitations of these algorithms. Recently, many applications have produced continuous data streams that require continuous processing. Also, most of recent applications generate high dimensional data. The above algorithms processed data over the full dimensional space. The incomplete high dimensional data stream processing was not considered using the above algorithms.

2.2 High dimensional data clustering

The proposed algorithm in [16] handled high dimensional data based on feature selection. The Information Gain based Semi-supervised subspace Clustering (IGSC) algorithm in [17] provided Semi-supervised learning of high dimensional data based on subspace Clustering. In [18] the SUBSCALE algorithm introduced subspace clustering by assigning signatures to the first dimensional clusters that help identifying the maximal space clusters.

The ensemble approach in [19] handled high dimensional incomplete data by providing features partitioning and missing values imputations. In [20] the discriminative multi-view subspace clustering (DMSC) introduced subspace clustering and distributed learning for distributed high dimensional data. The Dense Segments (DS) algorithm in [21] provided subspace clustering based on dense regions over the lower subspaces. In [22] the Classification based Text-Frequent-Pattern-tree (CTFP) method was proposed for dealing with high-dimensional data in text classification.

The above algorithms were dealt with static data. The high dimensional data stream processing was not concentrated enough using the above algorithms.

2.3 Data streams processing

The gossip-based distributed clustering for P2P networks (GBDC-P2P) algorithm in [23] proposed a distributed learning of data streams that were gathered from distributed and dynamic data sources. The fuzzy-based clustering algorithm in [24] handled multi-sensor probabilistic data. The online batch-based active learning algorithm (OBAL) in [25] was proposed for addressing social media streaming data learning.

The stream-based active learning algorithm (SAL) [26] considered both concept drift and concept evolution by adapting the classifier based on continuous variations of streaming data over time. In [27] the distributed stream processing was addressed in different cluster platform engines (SPEs) such as Storm, Spark Streaming, Google Dataflow, and Azure Stream Analytics.

The incompleteness and high dimensionality problems were deeply studied in most of the above algorithms in section 2. But, most of these algorithms were applied to static data. Practically, recent applications generate continuous and time varying data streams that gathered from multiple sources. Data streams are often suffering from incompleteness and high dimensionality. Additional constraints are appearing when dealing with such incomplete high dimensional distributed data streams. In contrast to static data, data streams require continuous processing of the most recent elements that are generated over time. Also, data stream distribution should be considered.

Therefore, in this paper, we propose the Subspace clustering for Incomplete High dimensional Data streams (SIHD) framework to solve the above issues. The main objective of the proposed SIHD is the clustering of data streams in the presence of the incompleteness problem, the dimensionality increase, the data continuity, and the data distribution issue.

Efficient clustering of such data requires an integrative and adaptive process that covers the following questions:

How to process the high dimensional data? What is the strategy to produce the subspaces (subsets of dimensions)?

How to efficiently impute the missing values over the high dimensional data?

How to deal with the distribution of data?

How to address the changeable nature of data streams?

So, SIHD introduces an adaptive subspace clustering that is applicable with incomplete high dimensional data streams that are gathered from multiple sources.

3 Problem formulation

Given a high dimensional incomplete streaming data of n elements $\tilde{E} = {\tilde{E} 1, \tilde{E} 2, \dots, \tilde{E} n}$ in a m-dimensional space, that gathered from multiple sources S = {S₁, S₂, ⋯ , S_i}. This data set includes incomplete elements with some missing values (but not all). Element ${\tilde{E}}_{a}$ can be represented as ${\tilde{E}}_{a} = {{\tilde{E}}_{1 a}, {\tilde{E}}_{2 a}, \dots, {\tilde{E}}_{ma}}$ . For clustering such data set, four issues should be taken into consideration: 1- Continuous and changeable nature of data streams, 2- Data stream distribution, 3- Data incompleteness, 4- High dimensionality of data.

3.1 Incomplete data processing

Assuming a m-dimensional incomplete data set of n objects $\tilde{E} = {{\tilde{E}}_{1}, {\tilde{E}}_{2}, \dots {\tilde{E}}_{n}}$ , which contains some missing values. The recent incomplete data processing is divided into two categories, ignoring the dimensions that contain missing values (ignorance-based) or estimating the missing values to form a complete data set (imputation-based). Ignorance-based strategy affects the accuracy of data clustering.

Most of the recent imputation-based algorithms solve data incompleteness by estimating the missing values based on the nearest-neighbor rule. Firstly, the partial distance of each incomplete object is calculated with all other complete elements. The partial distance computes the similarity between the incomplete object $\tilde{E} a$ and an object $\tilde{E} b$ by formula (1) [10].

$\begin{matrix} D_{ab} = \sqrt{(\frac{1}{\sum_{j = 1}^{m} I_{j}} \sum_{j = 1}^{m} {({\tilde{E}}_{ja} - {\tilde{E}}_{jb})}^{2} I_{j})}, \\ a, b = 1, 2, . . ., n; j = 1, 2, . . ., m \end{matrix}$ (1)

Where $\tilde{E} ja$ , $\tilde{E} jb$ are the j^th dimension values of the elements $\tilde{E} a$ , $\tilde{E} b$ , and $Ij = {\begin{matrix} 1, & if both \tilde{E} ja, \tilde{E} jb are non missing \\ 0, & otherwise \end{matrix}$

Then, the k nearest-neighbors of the incomplete objects can be gotten. Because of the uncertainty of missing values, an interval of a missing dimension value is gotten based on the nearest-neighbors. This interval takes full advantage of the attribute distribution information. The interval of a missing value of an incomplete object can be determined by the minimum and the maximum values of the nearest neighbors’ dimension values [10]. For an incomplete object ${\tilde{E}}_{b}$ with a missed value at the j^th attribute, let $E_{bj}^{-}$ and $E_{bj}^{+}$ are the minimum and the maximum nearest neighbors’ values of the j^th attribute. The nearest neighbors’ interval of ${\tilde{E}}_{jb}$ can be represented as [ $E_{bj}^{-}, E_{bj}^{+}$ ]. Finally, the missing values are imputed based on the assigned intervals.

3.2 High dimensional data clustering

Traditional clustering algorithms generate clusters in the full-dimensional space based on the similarity between the data points using all dimensions of the data set. But with the increase of data dimensionality, some of the dimensions become irrelevant to some of the clusters. Therefore, these full space (full dimensional) clustering algorithms are not effective to produce meaningful clusters for the high-dimensional data. Alternatively, it becomes imperative to get clusters over the relevant subspaces (subsets) of dimensions of the data. The subspace clustering is the process of generating clusters in the subspaces of the data set [18].

For a high dimensional data set E = {E₁, E₂, ⋯ , E_a, ⋯ , E_n} of n elements, each point is a m-dimensional vector as E_a = {E_1a, E_2a, ⋯ , E_ma}. A subspace S is a subset of the original dimension set D = d₁, d₂, ... ,d_m and is represented as S = d₁, d₂, ... , d_k where, d_i ∈ D and 1≤k≤m.

Based on the traditional bottom up subspace clustering algorithms, dense (very close) points in the one-dimensional subspaces (subspaces of all single dimensions ({d₁}, {d₂}, ... ,{d_m}) are combined to compute the two-dimensional clusters which are then combined to compute three dimensional clusters and so on. Each one-dimensional subspace includes clusters of points in a dimension from the original dimension set of the data set. A point is dense in a subspace if it has at least τ neighbors within £ distance. These traditional algorithms generate all possible clusters in all subspaces, but they fail to scale with the dimensions increase. If a cluster c does not exist in any of the higher subspaces of (d + 1) - dimensions, then it is called a maximal subspace cluster. The non-maximal clusters should not be generated because they are trivial, but most of the current subspace clustering algorithms implicitly or explicitly find such redundant and trivial clusters.

In Fig. 1, an example shows how the bottom up subspace clustering works. The example presents only the first two attributes (dimensions) d1, d2 from a data set of 10 points. Firstly, for each dimension, the points are clustered as in Fig. 1. (a). We observe in Fig. 1. (b) that c1 in the two-dimensional space is the intersection of points in d1 [c1] (first cluster in the first dimension) and d2 [c2] (second cluster in the second dimension). Also, c2 in the two-dimensional space is the intersection of points in d1 [c2] and d2 [c3]. In addition, we can note the absence of two-dimensional clusters containing points p1, p3, p5 in the 2-D space d1, d2.

Fig. 1

Example for the second-dimensional (2-D) space production from the first-dimensional (1-D) space.

3.3 Data streams clustering

The traditional clustering that finds clusters over the whole data set of a fixed size is not suitable to the continuous nature of streaming data, so the continuous clustering is appeared. The clustering process is applied continuously over time with the continuous arrival of streaming data. For each time interval, the most recent stream elements are clustered, and the results are updated continuously over time [28].

4 The proposed framework

The proposed Subspace clustering for Incomplete High dimensional Data streams (SIHD) framework is presented in Fig. 2. SIHD introduces a continuous subspace clustering of incomplete high dimensional data streams.

Fig. 2

The proposed SIHD framework.

4.1 The data streams pre-processing sub-system

This sub-system includes two blocks, the data streams distributor and the complete / incomplete streams divider.

Data streams distributor

For each time interval, this block divides the most recent data streams into multiple sub-sets based on their sources.

Complete/incomplete streams divider

This block separates data streams per each data source to complete and incomplete elements.

4.2 The subspaces preparation sub-system

This sub-system generates the subspaces of the input data streams with high dimensionality. It consists of two main blocks: the 1-D subspaces constructor and the maximal space generator.

1-D subspaces constructor

For each data source, the complete stream objects are clustered per each dimension. This block outputs the first dimensional (1-D) subspaces’ clusters per each data source. A 1-D subspace cluster consists of a set of dense points (very close points). A dense point in a subspace is a point that has at least τ neighbors within £ distance.

Maximal space generator

This sub-system is responsible for producing the maximal space based on the 1-D subspaces’ clusters. A maximal subspace cluster is the cluster that does not exist in any of the higher subspaces of (d + 1)-dimensions.

4.3 The incomplete high dimensional streams clustering sub-system

The incomplete high dimensional streams per data subset are clustered in this sub-system. This sub-system consists of four main blocks, the partial distance calculator, the nearest neighbors’ producer, the complete streams generator, and the complete streams subspace and distributed clustering.

Partial distance calculator

For each incomplete stream object per data source, this block determines in which subspaces (from the maximal space) the missing values of the object are found. Then, the partial distance is calculated between the incomplete object and all complete objects over the subspaces of missing values.

Nearest neighbors producer

This block finds the k-nearest neighbors for each incomplete object per data source over the subspaces of missing values. The nearest neighbors of an object are gotten depending on the calculated partial distances.

Complete streams generator

The missing values are imputed per data source in this block. The missing values of an incomplete object are estimated based on the nearest neighbors of the object. This block forms a complete version of the input data.

Complete streams subspace and distributed clustering

This block is responsible for clustering the generated complete data streams (after missing values imputation). The continuous subspace clustering is applied to cluster such high dimensional streams that gathered from multiple data sources. The final clusters are produced based on the sub-results of all sources.

5 The proposed algorithm

The traditional clustering is mainly applied to complete non-continuous data, and also generates clusters in the full-dimensional space. However, most recent applications produce continuous data streams that suffer from high dimensionality and incompleteness. Thus, an adaptive subspace clustering algorithm is proposed to overcome these clustering problems.

Given incomplete high dimensional streaming data of n elements, each incomplete stream element $\tilde{E}$ a has some missing values (but not all). These data streams are generated within a specific time interval from multiple data sources DS = {ds₁, ds₂, ⋯ , ds_N}. Owing to the continuous nature of data streams, the proposed SIHD is applied continuously over time.

Firstly, the proposed SIHD divides the stream elements into multiple sub-sets based on their data sources DS. Per each data source in DS, SIHD splits elements into incomplete and complete elements. Then, per each data source, we generate the subspaces of data. A subspace S over a data source ds_j is a subset of the total dimensions D_j = d₁, d₂, ... ,d_m of the sub-set of data in ds_j, and represented as S = d₁, d₂, ... , d_i, ... ,d_k where, d_i ∈ D_j and 1≤k≤m [18].

There are shortcomings of the bottom up subspace clustering that negatively affect its performance. The bottom up strategy produces intermediate redundant and trivial clusters before reaching the maximal space. A cluster C is called a maximal subspace cluster, if it does not exist in any of the higher subspaces of (d + 1) – dimensions. There is no need to produce the intermediate non-maximal clusters. Also, the bottom up strategy requires an excessive number of database scans during the process of combining lower dimensional candidate clusters. To overcome the expensive steps of the traditional bottom up subspace clustering, we follow the signature strategy that was introduced in [18].

Initially for the complete elements per each data source, SIHD generates the clusters over the first dimensional (1-D) subspaces (subspaces of all single dimensions). Then, SIHD produced the maximal space per data source based on the clusters of 1-D subspaces.

Definition 1 (Density-Reachability):

If an element E_j is neighboring to a dense element E_i then, E_j is called directly density-reachable from E_i. The direct density reachability is not symmetric if the two elements are not dense. Two elements E_i, E_j are called density-reachable from each other, if there is a chain of directly density-reachable elements between them, i.e., ∀ E_r, E_d ∈ {E_i, E₁, . . . , E_m, E_j}, Ed is directly density reachable from E_r.

Definition 2 (Density-Connected):

Two points E_i, E_j are called density-connected with each other, if there is an element E_r such that both E_i and E_j are density-reachable from E_r. Both density reachability and connectivity is defined based on ɛ and τ parameters.

5.1 1-D subspaces generation

SIHD generates the 1-D subspaces’ clusters for complete elements per each data source. A 1-D subspace cluster includes a group of dense (very close) points over the dimension of this subspace. According to DBSCAN algorithm [18], a point which has at least τ neighbors within ɛ distance is called a dense point. The data was normalized to the interval [0, 1] before measuring the distance. L1 metric was used as the distance measure. Each generated 1-D subspace cluster includes the smallest possible number of dense points and called a dense unit.

Therefore, each 1-D subspace consists of a group of dense units (clusters). The size of a dense unit is fixed in all dimensions (1-D subspaces). For easily referring to points, each point is assigned a unique large positive integer and is represented as its label instead of its set of dimensions. For example, an element Ea of a data source ds_j is represented as Ea = l instead of {E_1a, E_2a, ⋯ , E_ma}, where l is a very large positive integer. A well tested label size is 12 digits [18]. A dense unit (cluster) of 1-D subspace is assigned a signature (Sig) which is the sum of the labels of the points in that dense unit. If two dense units U1 and U2 have the same signature (very large positive integer), then U1, U2 will contain the same points with very high probability.

If L≥1 is a positive integer, then a set {l₁, l₂,..., l_δ} is called its partition, such that $L = \sum_{i = 1}^{δ}$ l_i for some δ≥1 and l_i > 0 is called a summand. Also, let p_δ(L) be the total number of such partitions, when each partition has at most δ summands. These integer partitions were studied in [18] by probabilistic methods and an asymptotic formula was concluded for δ= o (L^1/3), $p δ (L) \sim \frac{(\begin{matrix} L - 1 \\ δ - 1 \end{matrix})}{δ!}$ (2)

Let K be a set of random large integers, δ < < |K| < <p_δ (L). Let U1and U2 are two sets of integers from K s.t. |U1| = |U2| =δ, and δ= o (L^1/3). The sums of the integers in these two sets are sum (U1) and sum (U2). If sum (U1) = sum (U2) = L, then U1 and U2 are same with an extremely high probability, if L is very large.

Proof:

Based on Equation 2, for a very large positive integer L, if relatively very small partition size δ is used, then the number of unique fixed-sized partitions will be astronomically large. The probability of getting a particular partition set of size δ is: $\begin{matrix} \frac{(\begin{matrix} L - 1 \\ δ - 1 \end{matrix})}{δ!} / (\begin{matrix} L \\ δ \end{matrix}) & = \frac{(L - 1)! δ! (L - δ)!}{(δ - 1)! (L - δ)! δ! L!} \\ = \frac{1}{L (δ - 1)!} \end{matrix}$ (3)

This means the probability of randomly choosing the same partition again is extremely low. Also, this probability can be made very small by choosing a large value of L and relatively very small δ. Since L is the sum of the labels of δ points in a dense unit U, L can be made very large if we choose very large integers as the individual labels. Therefore, with δ=τ+ 1, the two dense units U1 and U2 will contain the same points with very high probability, if sum (U1) = sum (U2), provided this sum is very large.

Definition 3 (Maximal Subspace Cluster):

A subspace cluster, C = (E, S) is a set of dense elements E in a subspace S, such that ∀E_i, E_j ∈ E, E_i and E_j are density connected with each other in S based on ɛ and τ, and there is no other elements E_r ∈ E, such that E_r is density-reachable from some E_q ∉ E in the subspace S. A cluster C_M = (E, S) is called a maximal subspace cluster if there is no other cluster Cj = (E, S∖) such that S^∖ ⊃ S.

Observation 1:

If at least τ + 1 density-connected elements from a dimension d_i also exist as density-connected elements in the 1-D dimensions (single dimensions) d_j,..., d_r, then these elements will produce a group of dense elements in the maximal subspace, S_M = d_i, d_j,... d_r.

5.2 Maximal subspace generation

SIHD compares all the signatures over all 1-D subspaces (all single dimensions). Thus, the common dense units over 1-D subspaces can be gotten. Based on these common dense units, the SIHD can determine the higher level of subspaces (maximal space). If dense units do not exist in any of the higher subspaces of (d + 1) - dimensions, then the maximal space is reached.

5.3 Incomplete stream elements imputation

After generating the highest level of subspaces based on the complete stream elements, the proposed SIHD handles the incomplete elements. For each incomplete stream element per data source, the SIHD determines the subspaces (from the maximal space) that contain the missing values of that incomplete element. Then, SIHD computes the partial distances between that incomplete element and all complete elements over the subspaces of missing values using formula (1). Then, for each incomplete element per data source, the SIHD generates the q nearest neighbors (from complete elements) over the subspaces of missing values.

Then, the intervals of the missing values of the incomplete elements are gained depending on the nearest neighbors over subspaces of missing values. For a missing value of attribute b of element $\tilde{E} a$ over a specific subspace S in the data source ds_j, the q nearest neighbors’ interval can be represented as [ds_j $[{\tilde{E}}_{ab}^{-}], {ds}_{j}] {\tilde{E}}_{ab}^{+}]$ ]. Let ds_j $[{\tilde{E}}_{ab}^{-}]$ and ${ds}_{j} [{\tilde{E}}_{ab}^{+}]$ are the minimum and the maximum values of the attribute b of the nearest neighbors of element $\tilde{E} a$ over the data source ds_j. Then, the missing values are imputed based on the generated nearest neighbor intervals over subspaces of missing values using formula (2). $\begin{matrix} dsj [\tilde{E} ab] = r \cdot (dsj [{\tilde{E}}_{ab}^{+}] - dsj] {\tilde{E}}_{ab}^{-}]) + \\ [dsj [{\tilde{E}}_{ab}^{-}], r \in [0, 1] [10] \end{matrix}$ (4)

Where r is a random number uniformly distributed between 0 and 1.

5.4 Complete data streams clustering

After imputing the missing values of incomplete objects, we have a complete version of streaming data. Finally, the generated complete data streams are clustered. The proposed SIHD applies subspace clustering on the generated complete data streams. The clusters of complete data streams over the highest level of subspaces (maximal space) are generated per each data source. We used the DBSCAN algorithm [18] to generate clusters per each subspace in the highest level of subspaces (maximal space). According to DBSCAN, a cluster is defined as a set of dense points. By combining the local results over all data sources, the final clusters are produced. The following pseudo code illustrates the proposed algorithm.

Algorithm: SIHD
Input: DS $[\tilde{E}]$ , X, q
Output: FinalClusters
1. for each ds in DS
2. DS [Com], DS [Inc] ← separate complete/incomplete
elements(DS $[\tilde{E}]$ )
3. end for
4. for each ds in DS
5. for each d in D_ds
6. U_d← generate dense units(DS[Com])
7. end for
8. end for
9. for each ds in DS
10. DS [Com] ← assign labels(DS [Com], X)
11. for each d in D_ds
12. Sig_d← assign signature of dense units(U_d)
13. end for
14. MS ← generate maximal space(Sig_d)
15. DS[CE]← impute missing values(DS[Inc], DS[Com] , q)
16. MSclusters ← generate maximal space
clusters(DS[CE] , MS)
17. end for
18. Final Clusters ← integrate local results (MSclusters)

6 Experimental evaluation

Our experiments test the performance of the proposed SIHD algorithm using two high dimensional data sets of predefined data clusters that contain distributed sensors’ streaming data, namely Smartphone-Based Recognition of Human Activities and Postural Transitions data set [29, 30], and Epileptic Seizure Recognition data set [31, 32]. Table 1 presents the description of the data sets. For simplicity, we mention them as data set1 and data set 2. Data streams were processed over a tumbling window that divided data into non-overlapping consecutive windows. Therefore, a streaming data element was processed only one time. We used 3000 elements from data set1 and 1500 elements from data set2 of three consecutive time intervals each with a size of 1000 elements for data set1 and 500 elements for data set 2.

Table 1
Data sets description

Data set number Number of data objects Number of attributes Arrival rate Number of categories

1 3000 561 2.5 s 6

2 1500 179 23.5 s 5

Data set number	Number of data objects	Number of attributes	Arrival rate	Number of categories
1	3000	561	2.5 s	6
2	1500	179	23.5 s	5

The experiments were running three independent times. Each run was performed on objects over a time interval. The results are averaged over the three time intervals in each data set. From the above complete data sets, we produced the incomplete data with missing data rates of 5%, 10%, 15% and 20%. The missing values were generated uniformly (in a random manner) over the two data sets, and non-uniform missing values’ distribution was applied only on data set1. For generating non-uniform missing values’ distribution, missing values were concentrated in fewer columns instead of producing completely randomly missing values. An incomplete element must have at least one missed attribute value. The XLSTAT sampling tool was used to sample our data from the data sets [30].

The performance of the proposed SIHD algorithm is compared with five algorithms, which are 1) GFCM [8] which imputed the missing values of incomplete elements, 2) GBDC-P2P [23] that introduced a distributed clustering of data streams which were captured from distributed and dynamic data sources, 3) DS [21] that provided subspace clustering for high dimensional data, 4) Ensemble [19] that addressed the incomplete high dimensional data clustering, 5) DMSC [20] that provided features partitioning and distributed clustering of high dimensional distributed data. Missing data rates were taken as 0% (complete data), 5%, 10%, 15% and 20%. The nearest neighbor size that was used in the experiments is 6, τ = 3, and ɛ = 0.001. The values of these parameters have been proposed and chosen in [10, 18] as these values are more suitable and give more accurate results.

7 Results and analysis

The measurements of our experiments are: 1) The misclassification, 2) The accuracy, 3) The precision, 4) The sensitivity, 5) The specificity, 6) The F-score. The misclassification measure represents the number of objects that were incorrectly classified. The accuracy represents the number of correct predictions from all predictions made. The precision was represented by the number of true positive predictions divided by the total number of all positive values predicted. The sensitivity is the proportion of positives that are correctly identified. The specificity is the proportion of negatives that are correctly identified. The F-score measures the harmonic mean of precision and sensitivity. The best results over all experiments are marked in bold and the next best results are underlined. Also, two statistical tests of hypotheses were applied to show the significance of the results which are ANOVA and Friedman tests.

7.1 The misclassification results

The average misclassifications were calculated over the proposed SIHD algorithm and the compared algorithms GFCM, GBDC-P2P, DS, Ensemble, and DMSC using: 1) Data set1 and data set 2 with uniform missing data rates of 0% (complete data), 5%, 10%, 15% and 20%, 2) Data set1 with non-uniform missing data rate of 20%. Also, we calculated the average misclassifications based on the proposed SIHD algorithm over data set1 using two different distributions of data (2, 3 subsets) in each time interval. Elements of data set1 [generated from three distributed sources] were divided into two / three distributed portions (sub-sets) to show the effect of distributed clustering.

The ANOVA and Friedman tests were applied on the misclassification results over the subsets of data over the three time intervals using: 1) Data set1 and data set 2 with uniform missing data rates of 0% (complete data), 5%, 10%, 15% and 20%, 2) Data set1 with non-uniform missing data rate of 20%. The ANOVA and Friedman tests were performed with the help of the Social Science Statistics [35].

7.1.1 The misclassification results over uniform missing values distribution

The average misclassifications and the standard deviation of the misclassifications of the proposed SIHD algorithm and the compared methods (GFCM, GBDC-P2P, DS, Ensemble, and DMSC) over data set1 and data set2 with uniform missing values’ distribution of 0%, 5%, 10%, 15% and 20% are shown in Tables 2 and 3.

Table 2
Average misclassifications and standard deviation of the misclassifications over data set1 (uniform)

Average number of misclassifications±Standard deviation

% missing Proposed (SIHD) GFCM GBDC-P2P DS Ensemble DMSC

0 70±1.6 289±2.2 158.7±2.5 109.3±2.1 109.3±2.1 70±1.6

5 73.3±1.2 294.3±2.1 172.3±1.7 122.3±1.2 111±2.9 90.7±1.7

10 77±2.2 298.7±2.9 186±2.2 139.3±1.7 114.7±0.9 103±0.8

15 77.7±1.9 303±1.6 197.3±2.4 143±2.2 116.7±2.6 112.3±1.7

20 78.7±2.1 304.3±1.7 206±2.2 158±2.4 117±1.6 134.3±2.1

	Average number of misclassifications±Standard deviation
0	70±1.6	289±2.2	158.7±2.5	109.3±2.1	109.3±2.1	70±1.6
5	73.3±1.2	294.3±2.1	172.3±1.7	122.3±1.2	111±2.9	90.7±1.7
10	77±2.2	298.7±2.9	186±2.2	139.3±1.7	114.7±0.9	103±0.8
15	77.7±1.9	303±1.6	197.3±2.4	143±2.2	116.7±2.6	112.3±1.7
20	78.7±2.1	304.3±1.7	206±2.2	158±2.4	117±1.6	134.3±2.1

Table 3

Average misclassifications and standard deviation of the misclassifications over data set2 (uniform)

	Average number of misclassifications±Standard deviation
% missing	Proposed (SIHD)	GFCM	GBDC-P2P	DS	Ensemble	DMSC
0	28 ± 0.8	83.3±3.1	61.7±2.9	49±1.4	49±1.4	28 ± 0.8
5	28.7 ± 1.7	84±2.2	75.3±2.4	59.7±2.6	50.7±2.6	44.3±1.2
10	29 ± 2.4	84.7±1.7	84.3±2.1	64±2.2	51.3±2.9	51.7±1.9
15	31.3 ± 2.1	85.3±2.5	89.7±1.7	72.3±1.9	51.7±1.7	59±2.9
20	33.7 ± 0.9	86.7±2.6	99±2.8	81±2.4	53±2.2	66.7±2.5

The proposed SIHD has better results than those of the compared algorithms using the data set1 and data set 2. The results of SIHD ensure the effectiveness based on three parameters: 1) Missing values imputation rather than ignoring them, 2) Distributed processing of streaming data that are gathered from multiple distributed sources, 3) Subspace Clustering of high dimensional data.

SIHD separately processed data elements per data source, but GFCM, DS, and Ensemble algorithms did apply distributed clustering, and ignored the different characteristics of data over the multiple distributed source. Also, SIHD efficiently produced the subspaces of the high dimensional data. Data elements were processed over these subspaces using SIHD which affects the clustering results of SIHD. However, GFCM and GBDC-P2P algorithms clustered the data elements over the full dimensional space without partitioning, which decreased the performance of their clustering results, because with the dimensionality increase, some of the dimensions become irrelevant to some of the clusters. In addition, missing values using SIHD were imputed based on the nearest neighbors over the subspaces of missed data only. So, these neighbors were accurately produced, and missed values were replaced by accurate values. Efficient missing values’ estimation supported the clustering process of SIHD. No dimensions were ignored in the clustering process of SIHD, but GBDC-P2P, DS, and DMSC algorithms ignored the dimensions included missing values. Only dimensions that contain values over all data elements were involved in the clustering methodology of GBDC-P2P, DS, and DMSC algorithms which decreased their performance. Even depending on the full dimensional space in GFCM algorithm to produce nearest neighbors of incomplete elements for missing values imputation, led to non-efficient clusters.

The standard deviations’ results refer to the worst results (over the three runs) of the proposed SIHD are better than the best results (over the three runs) of the rest algorithms over data set1, data set2. For example, the worst misclassification of SIHD over missing rate 5% of data set1 in Table 2 is 73.3 + 1.2, which is less than the best misclassification of DMSC 90.7–1.7.

The ANOVA and Friedman statistical tests were performed on the misclassification results over data set1, 2 with uniform missing data rates. The results of the ANOVA and Friedman tests on data set1, 2 are shown in Tables 4 and 5. The obtained statistical results provide evidence for our results are significant at significance level of 0.05. The p-value over the two tests is less than 0.05 which indicates to the null hypothesis is false or should be rejected.

Table 4

Statistical test of hypotheses of misclassification results over data set1 (uniform)

Missing %	ANOVA		Friedman
	F-ratio	P-Value	X²_r statistic	P-Value
0	139.78	0.00001	12	0.00248
5	31.29	0.00005	7	0.0302
10	18.76	0.000413	11.08	0 00392
15	85.76	0.00001	10.33	0.0057
20	71.82	0.00001	12	0.00248

Table 5

Statistical test of hypotheses of misclassification results over data set2 (uniform)

Missing %	ANOVA		Friedman
	F-ratio	P-Value	X²_r statistic	P-Value
0	15	0.000977	10.33	0.0057
5	35.42	0.000029	11.08	0.00392
10	48.2	0.00001	11.08	0.00392
15	51.47	0.00001	11.08	0.00392
20	35	0.000031	11.08	0.00392

7.1.2 The misclassification results over non-uniform missing values distribution

Table 6 presents the average misclassifications and the standard deviation of the misclassifications of the proposed SIHD and the comparing algorithms using data set1 with 20% non-uniform missing values’ distribution. For generating non-uniform missing values’ distribution, missing values were concentrated in fewer columns instead of producing completely randomly missing values. The results reveal that the worst misclassifications of the proposed SIHD are better than the best misclassifications of all compared algorithms. Results from Tables 2 and 6 of data set1 ensure the effectiveness of the proposed SIHD over other compared algorithms using uniform/non-uniform missing values’ distributions. This is because the SIHD applied missing values imputation and subspace clustering of the incomplete high dimensional streaming data that gathered from multiple sources.

Table 6
Average misclassifications and standard deviation of the misclassifications over data set1 (non-uniform)

Average number of misclassifications±Standard deviation

% missing Proposed (SIHD) GFCM GBDC-P2P DS Ensemble DMSC

20 80.7 ± 1.7 309±1.4 212.3±2.9 161±2.8 122.3±2.5 139±2.4

	Average number of misclassifications±Standard deviation
20	80.7 ± 1.7	309±1.4	212.3±2.9	161±2.8	122.3±2.5	139±2.4

Table 7 shows the results of the ANOVA and Friedman tests on misclassification results over data set1 with 20% non-uniform missing values’ distribution. Table 7 results provide evidence for the significance of misclassification results at significance level of 0.05.

Table 7

Statistical test of hypotheses of misclassification results over data set1 (non-uniform)

Missing %	ANOVA		Friedman
	F-ratio	P-Value	X²_r statistic	P-Value
20	20.56	0.000286	9.33	0.0094

7.1.3 The misclassification results over different distributed portions

We divided data set1 elements [that gathered from three distributed sources] in each time interval into two and three distributed portions (sub-sets). The average misclassifications and the standard deviation of the misclassifications over data set1 with uniform missing rate of 5% are shown in Table 8. The SIHD algorithm was applied on each data distribution of (2, 3) sub-sets. The misclassifications results of the SIHD over three data sub-sets outperform the results of two data sub-sets. Based on the standard deviation results, the worst misclassifications when dividing data into three sub-sets are better than the best misclassifications over two data sub-sets which indicate the effect of distributed clustering introduced by the proposed SIHD algorithm.

Table 8
Average misclassifications and standard deviation of the misclassifications of SIHD over data set1 using different distributions

Number of sub-sets Average misclassifications± Standard deviation

2 86±1.4

3 73.3 ± 1.2

Number of sub-sets	Average misclassifications± Standard deviation
2	86±1.4
3	73.3 ± 1.2

7.2 The confusion matrix results

The clustering performance was measured using the confusion matrix measurements [34] which are accuracy, precision, sensitivity (Recall), specificity, and F-score for: 1) the proposed SIHD and the compared algorithms on data set1 and data set2 over uniform missing data rates of 0%, 5%, 10%, 15% and 20%, 2) the proposed SIHD and the compared algorithms on data set1 over non-uniform missing data rate of 20%, 3) the proposed SIHD on data set1 with two and three distributed portions (sub-sets) in each time interval to illustrate the impact of distributed clustering.

The ANOVA and Friedman tests were applied on the confusion matrix results over data set1 with uniform / non-uniform missing data rates and data set2 with uniform missing data rates.

The confusion matrix measurements over uniform missing values rate

The accuracy, precision, sensitivity, specificity, and F-score of the proposed SIHD and the compared algorithms over data set1 and data set2 with uniform missing rates are shown in Tables 9 and 10. The proposed SIHD outperforms all comparing algorithms due to efficiently providing missing values imputation and subspace clustering of incomplete high dimensional data streams.

Table 9
Average confusion matrix measurements over data set1 (uniform)

Average accuracy % ±Standard deviation Average precision % ± Standard deviation

% missing Proposed (SIHD) GFCM GBDC-P2P DS Ensemble DMSC Proposed (SIHD) GFCM GBDC-P2P DS Ensemble DMSC

0 93 ± 0.2 71.1±0.1 84.1±0.2 89.1±0.1 89.1±0.1 93 ± 0.2 92.6 ± 0.1 70.6±0.2 83.8±0.2 88.6±0.1 88.6±0.1 92.6 ± 0.1

5 92.7 ± 0.1 70.6±0.4 82.8±0.1 87.8±0.4 88.9±0.4 90.9±0.4 92.4 ± 0.2 70.1±0.3 82.3±0.3 88.2±0.2 88.2±0.3 90.4±0.3

10 92.3 ± 0.2 70.1±0.6 81.4±0.2 87.1±0.3 88.5±0.4 89.7±0.2 91.8 ± 0.2 69.7±0.2 81±0.4 86.6±0.4 87.8±0.2 89.3±0.2

15 92.2 ± 0.2 69.7±0.2 80.3±0.2 85.7±0.2 88.3±0.3 88.8±0.2 91.6 ± 0.1 69.3±0.2 79.9±0.4 85±0.3 87.5±0.2 88.2±0.2

20 92.1 ± 0.3 69.6±0.2 79.4±0.4 84.2±0.2 88.3±0.2 86.6±0.1 91.5 ± 0.2 68.9±0.3 78.8±0.2 83.7±0.2 87.5±0.2 86.2±0.4

Average sensitivity % ±Standard deviation Average specificity % ±Standard deviation

% missing Proposed (SIHD) GFCM GBDC-P2P DS Ensemble DMSC Proposed (SIHD) GFCM GBDC-P2P DS Ensemble DMSC

0 92.3 ± 0.1 70.2±0.4 83.4±0.1 88.4±0.1 88.4±0.1 92.3 ± 0.1 93.4 ± 0.2 71.5±0.3 84.4±0.3 89.5±0.2 89.5±0.2 93.4 ± 0.2

5 92.2 ± 0.2 69.8±0.2 82±0.4 87.9±0.3 87.9±0.4 90.1±0.4 93 ± 0.1 70.9±0.3 83.2±0.3 88±0.4 89.4±0.2 91.3±0.3

10 91.5 ± 0.1 69.3±0.4 80.6±0.2 86.3±0.3 87.4±0.3 89±0.4 92.6 ± 0.2 70.3±0.3 81.7±0.2 87.5±0.3 88.9±0.3 89.9±0.2

15 91.4 ± 0.1 69±0.2 79.7±0.1 84.6±0.3 87.2±0.3 87.8±0.2 92.5 ± 0.2 70±0.3 80.5±0.3 86.1±0.3 88.8±0.2 89.1±0.3

20 91.2 ± 0.2 68.6±0.1 78±0.3 83.4±0.2 87.2±0.2 85.9±0.2 92.4 ± 0.2 69.9±0.3 79.8±0.2 84.6±0.2 88.8±0.2 87±0.3

Average F-score % ± Standard deviation

% missing Proposed (SIHD) GFCM GBDC-P2P DS Ensemble DMSC

0 92.4 ± 0.2 70.4±0.2 83.6±0.2 88.5±0.2 88.5±0.2 92.4 ± 0.2

5 92.3 ± 0.2 69.9±0.2 82.1±0.1 88±0.2 88±0.2 90.2±0.1

10 91.6 ± 0.2 69.5±0.2 80.8±0.2 86.4±0.2 87.6±0.2 89.1±0.2

15 91.5 ± 0.1 69.1±0.1 79.8±0.2 84.8±0.3 87.3±0.1 88±0.4

20 91.3 ± 0.2 68.7±0.3 78.4±0.2 83.5±0.2 87.3±0.1 86±0.2

	Average accuracy % ±Standard deviation		Average precision % ± Standard deviation
0	93 ± 0.2	71.1±0.1	84.1±0.2	89.1±0.1	89.1±0.1	93 ± 0.2	92.6 ± 0.1	70.6±0.2	83.8±0.2	88.6±0.1	88.6±0.1	92.6 ± 0.1
5	92.7 ± 0.1	70.6±0.4	82.8±0.1	87.8±0.4	88.9±0.4	90.9±0.4	92.4 ± 0.2	70.1±0.3	82.3±0.3	88.2±0.2	88.2±0.3	90.4±0.3
10	92.3 ± 0.2	70.1±0.6	81.4±0.2	87.1±0.3	88.5±0.4	89.7±0.2	91.8 ± 0.2	69.7±0.2	81±0.4	86.6±0.4	87.8±0.2	89.3±0.2
15	92.2 ± 0.2	69.7±0.2	80.3±0.2	85.7±0.2	88.3±0.3	88.8±0.2	91.6 ± 0.1	69.3±0.2	79.9±0.4	85±0.3	87.5±0.2	88.2±0.2
20	92.1 ± 0.3	69.6±0.2	79.4±0.4	84.2±0.2	88.3±0.2	86.6±0.1	91.5 ± 0.2	68.9±0.3	78.8±0.2	83.7±0.2	87.5±0.2	86.2±0.4
	Average sensitivity % ±Standard deviation		Average specificity % ±Standard deviation
% missing	Proposed (SIHD)	GFCM	GBDC-P2P	DS	Ensemble	DMSC	Proposed (SIHD)	GFCM	GBDC-P2P	DS	Ensemble	DMSC
0	92.3 ± 0.1	70.2±0.4	83.4±0.1	88.4±0.1	88.4±0.1	92.3 ± 0.1	93.4 ± 0.2	71.5±0.3	84.4±0.3	89.5±0.2	89.5±0.2	93.4 ± 0.2
5	92.2 ± 0.2	69.8±0.2	82±0.4	87.9±0.3	87.9±0.4	90.1±0.4	93 ± 0.1	70.9±0.3	83.2±0.3	88±0.4	89.4±0.2	91.3±0.3
10	91.5 ± 0.1	69.3±0.4	80.6±0.2	86.3±0.3	87.4±0.3	89±0.4	92.6 ± 0.2	70.3±0.3	81.7±0.2	87.5±0.3	88.9±0.3	89.9±0.2
15	91.4 ± 0.1	69±0.2	79.7±0.1	84.6±0.3	87.2±0.3	87.8±0.2	92.5 ± 0.2	70±0.3	80.5±0.3	86.1±0.3	88.8±0.2	89.1±0.3
20	91.2 ± 0.2	68.6±0.1	78±0.3	83.4±0.2	87.2±0.2	85.9±0.2	92.4 ± 0.2	69.9±0.3	79.8±0.2	84.6±0.2	88.8±0.2	87±0.3
	Average F-score % ± Standard deviation
% missing				Proposed (SIHD)	GFCM	GBDC-P2P	DS	Ensemble	DMSC
0				92.4 ± 0.2	70.4±0.2	83.6±0.2	88.5±0.2	88.5±0.2	92.4 ± 0.2
5				92.3 ± 0.2	69.9±0.2	82.1±0.1	88±0.2	88±0.2	90.2±0.1
10				91.6 ± 0.2	69.5±0.2	80.8±0.2	86.4±0.2	87.6±0.2	89.1±0.2
15				91.5 ± 0.1	69.1±0.1	79.8±0.2	84.8±0.3	87.3±0.1	88±0.4
20				91.3 ± 0.2	68.7±0.3	78.4±0.2	83.5±0.2	87.3±0.1	86±0.2

Table 10

Average confusion matrix measurements over data set2 (uniform)

	Average accuracy % ±Standard deviation							Average precision % ± Standard deviation
% missing	Proposed (SIHD)	GFCM	GBDC-P2P	DS	Ensemble	DMSC	Proposed (SIHD)	GFCM	GBDC-P2P	DS	Ensemble	DMSC
0	94.4 ± 0.2	83.3±0.3	87.7±0.2	90.2±0.3	90.2±0.3	94.4 ± 0.2	93.9 ± 0.1	82.9±0.3	87.2±0.2	89.6±0.2	89.6±0.2	93.9±0.1
5	94.3 ± 0.2	83.2±0.2	84.9±0.5	88.1±0.2	89.9±0.2	91.1±0.1	93.7 ± 0.2	82.7±0.2	82.7±0.2	87.6±0.2	89.5±0.2	90.6±0.2
10	94.2 ± 0.3	83.1±0.2	83.1±0.2	87.2±0.2	89.7±0.2	89.7±0.2	93.6 ± 0.2	82.6±0.2	82.6±0.2	86.5±0.2	89.4±0.1	89.2±0.3
15	93.7 ± 0.2	82.9±0.2	82.1±0.3	85.8±0.2	89.7±0.4	88.2±0.3	93.2 ± 0.2	82.4±0.2	82.3±0.3	85.2±0.3	89.4±0.1	87.5±0.3
20	93.3 ± 0.2	82.7±0.4	80.2±0.2	83.8±0.2	89.4±0.3	86.7±0.2	92.8 ± 0.2	82.1±0.3	82±0.3	83.3±0.2	88.8±0.2	86.1±0.2
	Average sensitivity % ±Standard deviation							Average specificity % ±Standard deviation
% missing	Proposed (SIHD)	GFCM	GBDC-P2P	DS	Ensemble	DMSC	Proposed (SIHD)	GFCM	GBDC-P2P	DS	Ensemble	DMSC
0	93.6 ± 0.2	82.5±0.3	86.9±0.3	89.1±0.2	89.1±0.2	93.6 ± 0.2	94.8 ± 0.1	83.6±0.2	88.1±0.3	90.5±0.2	90.5±0.2	94.8 ± 0.1
5	93.3 ± 0.1	82.3±0.3	82.5±0.1	87.2±0.2	89±0.4	90.1±0.2	94.6 ± 0.2	83.4±0.2	85.2±0.3	88.6±0.2	90.3±0.3	91.5±0.3
10	93.2 ± 0.1	82.2±0.3	82.3±0.3	86±0.3	88.8±0.4	88.8±0.1	94.5 ± 0.1	83.3±0.3	83.5±0.3	87.6±0.2	90.1±0.2	90±0.3
15	92.9 ± 0.2	81.9±0.1	81.8±0.2	84.6±0.2	88.8±0.2	87±0.2	94 ± 0.2	83.1±0.3	82.4±0.2	86.1±0.3	90.1±0.3	88.7±0.3
20	92.3 ± 0.5	81.7±0.2	81.6±0.2	82.7±0.2	88.1±0.3	85.7±0.3	93.8 ± 0.1	83±0.2	80.6±0.3	84.2±0.3	89.8±0.3	87.1±0.3
	Average F-score % ± Standard deviation
% missing				Proposed (SIHD)	GFCM	GBDC-P2P	DS	Ensemble	DMSC
0				93.7 ± 0.1	82.7±0.2	87±0.2	89.3±0.2	89.3±0.2	93.7 ± 0.1
5				93.5 ± 0.2	82.4±0.2	82.6±0.1	87.4±0.2	89.2±0.1	90.3±0.2
10				93.4 ± 0.2	82.4±0.3	82.4±0.3	86.2±0.2	89.1±0.3	89±0.2
15				93 ± 0.2	82.1±0.3	82±0.2	84.9±0.3	89.1±0.2	87.2±0.3
20				92.5 ± 0.1	81.9±0.2	81.8±0.2	83±0.2	88.4±0.2	85.9±0.2

The ANOVA and Friedman statistical tests were performed on the confusion matrix results over data set1, 2 with uniform missing data rates and shown in Tables 11 and 12. The statistical results at significance level of 0.05 indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random). Therefore, we reject the null hypothesis, and accept the alternative hypothesis.

Table 11

Statistical test of hypotheses of confusion matrix results over data set1 (uniform)

Missing %	ANOVA
Accuracy		Precision		Sensitivity		Specificity		F-score
	F-ratio	P-Value	F-ratio	P-Value	F-ratio	P-Value	F-ratio	P-Value	F-ratio	P-Value
0	20	0.00032	21.66	0.000232	11.05	0.002928	45.62	0.00001	52.39	0.00001
5	11.25	0.002758	81.08	0.00001	23.25	0.000174	24.15	0.000148	33.07	0.000039
10	15.53	0.000855	24.07	0.00015	21.16	0.000255	54	0.00001	63.33	0.00001
15	51.53	0.00001	26.07	0.000108	20.47	0.000291	34.35	0.000033	18.38	0.000446
20	20.64	0.000282	56.03	0.00001	27.14	0.000091	52.93	0.00001	27.14	0.000091
Friedman
Accuracy		Precision		Sensitivity		Specificity		F-score
	X²_r statistic	P-Value	X²_r statistic	P-Value	X²_r statistic	P-Value	X²_r statistic	P-Value	X²_r statistic	P-Value
0	9.75	0.00764	10.33	0.0057	11.08	0.00392	11.08	0.00392	12	0.00248
5	12	0.00248	12	0.00248	12	0.00248	10.08	0.00646	10.33	0.0057
10	12	0.00248	12	0.00248	12	0.00248	12	0.00248	12	0.00248
15	6.08	0.04776	9.25	0.0098	9.75	0.00764	11.08	0.00392	11.08	0.00392
20	6.08	0.04776	12	0.00248	10.08	0.00646	12	0.00248	8.33	0.0155

Table 12

Statistical test of hypotheses of confusion matrix results over data set2 (uniform)

Missing %	ANOVA
Accuracy		Precision		Sensitivity		Specificity		F-score
	F-ratio	P-Value	F-ratio	P-Value	F-ratio	P-Value	F-ratio	P-Value	F-ratio	P-Value
0	70	0.00001	19.14	0.000381	75	0.00001	25.73	0.000114	22.27	0.000207
5	15.86	0.000789	76.17	0.00001	17.2	0.00058	49	0.00001	53.23	0.00001
10	26.33	0.000103	63.33	0.00001	23.30	0.000172	33.18	0.000039	56.87	0.00001
15	37.12	0.000024	44.65	0.00001	33.88	0.000035	36.33	0.000026	56.87	0.00001
20	32.65	0.000041	43.88	0.000011	38	0.000021	28.96	0.000069	29.13	0.000067
Friedman
Accuracy		Precision		Sensitivity		Specificity		F-score
	X²_r statistic	P-Value	X²_r statistic	P-Value	X²_r statistic	P-Value	X²_r statistic	P-Value	X²_r statistic	P-Value
0	12	0.00248	12	0.00248	12	0.00248	12	0.00248	12	0.00248
5	6.08	0.04776	12	0.00248	10.0833	0.00646	12	0.00248	10.33	0.0057
10	10.33	0.0057	12	0.00248	11.0833	0.00392	12	0.00248	12	0.00248
15	6	0.01431	11.08	0.00392	9.25	0.0098	11.08	0.00392	12	0.00248
20	12	0.00248	11.08	0.00392	12	0.00248	12	0.00248	12	0.00248

7.2.2 Confusion matrix measurements using non-uniform missing values distribution

The confusion matrix measurements of the proposed SIHD and the compared algorithms over data set1 with non-uniform missing values’ distribution of 20% are shown in Table 13. For generating non-uniform missing values’ distribution, missing values were concentrated in fewer columns. We can observe that the proposed SIHD results are better than other algorithms. The reason behind SIHD results is applying three improvements which are missing values imputation, distributed and subspace clustering instead of ignoring the missing values and applying non distributed full dimensional space clustering as in the compared algorithms. The clustering results of Tables 9 and 13 of data set1 ensure the effectiveness of SIHD over uniform / non-uniform missing values distributions.

Table 13
Average confusion matrix measurements over data set1 (non-uniform)

Average accuracy % ± Standard deviation Average precision % ± Standard deviation

% missing Proposed (SIHD) GFCM GBDC-P2P DS Ensemble DMSC Proposed (SIHD) GFCM GBDC-P2P DS Ensemble DMSC

20 91.9 ± 0.1 69.1±0.2 78.8±0.3 83.9±0.3 87.8±0.2 86.1±0.2 90.2 ± 0.2 68.6±0.3 78.2±0.4 83.1±0.2 87.1±0.3 85.5±0.3

Average sensitivity % ± Standard deviation Average specificity % ± Standard deviation

% missing Proposed (SIHD) GFCM GBDC-P2P DS Ensemble DMSC Proposed (SIHD) GFCM GBDC-P2P DS Ensemble DMSC

20 90 ± 0.2 68.3±0.2 77.8±0.3 82.7±0.1 86.8±0.2 85.2±0.2 92.3 ± 0.2 69.5±0.3 79.3±0.3 84.2±0.2 88.1±0.2 86.5±0.3

Average F-score % ±Standard deviation

% missing Proposed (SIHD) GFCM GBDC-P2P DS Ensemble DMSC

20 90.1 ± 0.3 68.4±0.3 78±0.2 82.9±0.2 86.9±0.3 85.3±0.2

	Average accuracy % ± Standard deviation		Average precision % ± Standard deviation
20	91.9 ± 0.1	69.1±0.2	78.8±0.3	83.9±0.3	87.8±0.2	86.1±0.2	90.2 ± 0.2	68.6±0.3	78.2±0.4	83.1±0.2	87.1±0.3	85.5±0.3
	Average sensitivity % ± Standard deviation		Average specificity % ± Standard deviation
% missing	Proposed (SIHD)	GFCM	GBDC-P2P	DS	Ensemble	DMSC	Proposed (SIHD)	GFCM	GBDC-P2P	DS	Ensemble	DMSC
20	90 ± 0.2	68.3±0.2	77.8±0.3	82.7±0.1	86.8±0.2	85.2±0.2	92.3 ± 0.2	69.5±0.3	79.3±0.3	84.2±0.2	88.1±0.2	86.5±0.3
	Average F-score % ±Standard deviation
% missing	Proposed (SIHD)	GFCM	GBDC-P2P	DS	Ensemble	DMSC
20				90.1 ± 0.3	68.4±0.3	78±0.2	82.9±0.2	86.9±0.3	85.3±0.2

Table 14 presents the results of the ANOVA and Friedman tests on confusion matrix results over data set1 with 20% non-uniform missing values’ distribution. The results of table 14 provide evidence for the confusion matrix results are significant at significance level of 0.05.

Table 14

Statistical test of hypotheses of confusion matrix results over data set1 (non-uniform)

Missing %	ANOVA
Accuracy		Precision		Sensitivity		Specificity		F-score
	F-ratio	P-Value	F-ratio	P-Value	F-ratio	P-Value	F-ratio	P-Value	F-ratio	P-Value
20	34.42	0.000033	32.05	0.000045	35.90	0.000027	37.43	0.000023	31.58	0.000048
Friedman
Accuracy		Precision		Sensitivity		Specificity		F-score
	X²_r Statistic	P-Value	X²_r statistic	P-Value	X²_r statistic	P-Value	X²_r statistic	P-Value	X²_r statistic	P-Value
20	12	0.00248	12	0.00248	11.08	0.00392	11.08	0.00392	12	0.00248

7.2.3 The Average Confusion Matrix Results over Different Data Distribution

The confusion matrix measurements were calculated for the proposed SIHD algorithm on data set1 elements [gathered from three distributed sources]. The elements in each time interval were divided into two and three distributed sub-sets. The SIHD algorithm was applied to each data division of (2, 3) sub-sets. Table 15 presents the confusion matrix measurements over data set1 with uniform missing values’ distribution of 5%. The results using data division of three sub-sets outperform the results of two data sub-sets which indicate the importance of distributed strategy introduced by the proposed SIHD on the clustering performance.

Table 15
Average confusion matrix measurements over data set1 using different data distribution

Number of sub-sets Average accuracy% ± Standard deviation Average precision % ± Standard deviation Average sensitivity % ± Standard deviation Average specificity % ± Standard deviation Average F-score % ± Standard deviation

2 90.9±0.2 90.3±0.3 89.8±0.2 91.4±0.3 90±0.2

3 92.7 ± 0.1 92.4 ± 0.2 92.2 ± 0.2 93 ± 0.1 92.3 ± 0.2

Number of sub-sets	Average accuracy% ± Standard deviation	Average precision % ± Standard deviation	Average sensitivity % ± Standard deviation	Average specificity % ± Standard deviation	Average F-score % ± Standard deviation
2	90.9±0.2	90.3±0.3	89.8±0.2	91.4±0.3	90±0.2
3	92.7 ± 0.1	92.4 ± 0.2	92.2 ± 0.2	93 ± 0.1	92.3 ± 0.2

8 Conclusion

Data Streams that are generated from many modern applications such as sensor networks have a continuous nature and are often generated from multiple data sources with some incompleteness and high dimensionality. Recent studies deeply addressed the incompleteness and high dimensionality problems with static data using different methodologies. But there are additional constraints when handling incomplete high dimensional data streams. Unlike static data, data streams require continuous processing of the most recent elements over time.

To address these constraints, we propose the Subspace clustering for Incomplete High dimensional Data streams (SIHD) framework. In order to match the continuous variations of data streams over time, the proposed SIHD is working in a continuous manner. SIHD adaptively imputes the missing values of incomplete data that suffers from high dimensionality. An improved clustering strategy is proposed to accurately process such data. The experimental results proved the efficiency of the proposed SIHD over five comparing algorithms using two different data sets in terms of accuracy, precision, sensitivity, specificity, and F-score. The SIHD finds better clustering results of incomplete high dimensional data streams than the comparing algorithms over uniform and non-uniform missing values distributions. In future work, it would be interesting to deal with mixed-type and interval-valued data. We also are interested in working with incomplete big text data. Also, we will consider the query processing of such incomplete high dimensional data streams.

References

Jiang

, Tao

and Li

, Dfc: density fragment clustering without peaks, Journal of Intelligent & Fuzzy Systems 34(1) (2018), 525–536.

Ren

and Bedini

T.H.

, A fuzzy clustering algorithm for Internet customer group behavior data, Journal of Intelligent & Fuzzy Systems 35(4) (2018), 4235–4243.

Liu

, Liu

, Zhang

, Wang

and Xiao

, Entropy-based active sparse subspace clustering, Multimedia Tools and Applications 77(17) (2018), 1–7.

Najib

F.M.

, Ismail

R.M.

, Badr

N.L.

and Gharib

, Clustering based approach for incomplete data streams processing, Journal of Intelligent & Fuzzy Systems (Preprint) (2020), 1—15.

Xue

, Lin

, Yuan

and Cai

, Early warning classification of cluster supply chain emergency based on cloud model and datastream clustering algorithm, Journal of Intelligent & Fuzzy Systems 35(1) (2018), 393–403.

Priya

and Umamaheswari

, Aspect-based summarisation using distributed clustering and single-objective optimisation, Journal of Information Science (2019), 21:0165551519827896.

Ensor

, Deeks

J.J.

, Martin

E.C.

and Riley

R.D.

, Meta-analysis of test accuracy studies using imputation for partial reporting of multiple thresholds, Research synthesis methods 9(1) (2018), 100–115.

Sefidian

A.M.

and Daneshpour

, Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model, Expert Systems with Applications 115 (2019), 68–94.

Zahin

S.A.

, Ahmed

C.F.

and Alam

, An effective method for classification with missing values, Applied Intelligence 48(10) (2018), 3209–3230.

10.

Zhang

, Lu

, Liu

, Pedrycz

, Zhong

and Wang

, A Global Clustering Approach Using Hybrid Optimization for Incomplete Data Based on Interval Reconstruction of Missing Value, International Journal of Intelligent Systems 31(4) (2016), 297–313.

11.

, Zhang

, Lu

, Hou

, Liu

, Pedrycz

and Zhong

, Interval kernel fuzzy c-means clustering of incomplete data, Neurocomputing 237 (2017), 316–331.

12.

Dai

, Yan

, Li

and Liao

, Dominance-based fuzzy rough set approach for incomplete interval-valued data, Journal of Intelligent & Fuzzy Systems 34(1) (2018), 423–436.

13.

Sokat

K.Y.

, Dolinskaya

I.S.

, Smilowitz

and Bank

, Incomplete information imputation in limited data environments with application to disaster response, Journal of Operational Research 269(2) (2018), 466–485.

14.

Liu

, Incomplete big data imputation mining algorithm based on BP neural network, Journal of Intelligent & Fuzzy Systems, (Preprint), 1–10.

15.

, Wen

and Zhang

, Missing values estimation for incomplete uncertain linguistic preference relations and its application in group decision making, Journal of Intelligent & Fuzzy Systems, (Preprint) (2019), 1–14.

16.

Nagpal

and Singh

, Feature selection from high dimensional data based on iterative qualitative mutual information, Journal of Intelligent & Fuzzy Systems, (Preprint), 1–12.

17.

Harikumar

and Akhil

A.S.

, Semi supervised approach towards subspace clustering, Journal of Intelligent & Fuzzy Systems 34(3) (2018), 1619–1629.

18.

Kaur

and Datta

, A novel algorithm for fast and scalable subspace clustering of high-dimensional data, Journal of Big Data 2(1) (2015), 17.

19.

Tran

C.T.

, Zhang

, Andreae

, Xue

and Bui

L.T.

, An effective and efficient approach to classification with incomplete data, Knowledge-Based Systems 154 (2018), 1–6.

20.

Zeng

, Wang

, Yan

, Chen

and Hong

, Robust Discriminative multi-view K-means clustering with feature selection and group sparsity learning, Multimedia Tools and Applications 77(17) (2018), 22433–22453.

21.

Jain

and Murthy

C.A.

, Connectedness-based subspace clustering, Knowledge and Information Systems 58(1) (2019), 9–34.

22.

Liu

, An effective dimensionality reduction method for text classification based on TFP-tree, Journal of Intelligent & Fuzzy Systems 34(3) (2018), 1893–1905.

23.

Azimi

and Sajedi

, Peer sampling gossip-based distributed clustering algorithm for unstructured P2P networks, Neural Computing and Applications 29(2) (2018), 593–612.

24.

and Kim

K.F.

, Fuzzy clustering algorithm of interactive multi-sensor probabilistic data, Journal of Intelligent & Fuzzy Systems 35(4) (2018), 4267–4275.

25.

Pohl

, Bouchachia

and Hellwagner

, Batch-based active learning: Application to social media data for crisis management, Expert Systems with Applications 93 (2018), 232–244.

26.

Mohamad

, Sayed-Mouchaweh

and Bouchachia

, Active learning for classifying data streams with unknown number of classes, Neural Networks 98 (2018), 1–5.

27.

Affetti

, Tommasini

, Margara

, Cugola

and Della Valle

, Defining the execution semantics of stream processing engines, Journal of Big Data 4(1) (2017), 12.

28.

Najib

F.M.

, Ismail

R.M.

, Badr

N.L.

and Tolba

M.F.

, Cloud-based data streams optimization, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(3) (2018), e1247.

29.

Reyes-Ortiz

J.L.

, Oneto

, Samà

, Parra

and Anguita

, Transition-aware human activity recognition using smartphones, Neurocomputing 171 (2016), 754–767.

30.

Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set, https://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions.

31.

Epileptic Seizure Recognition Data Set, https://archive.ics.uci.edu/ml/datasets/Epileptic+Seizure+Recognition.

32.

Andrzejak

R.G.

, Lehnertz

, Mormann

, Rieke

, David

and Elger

C.E.

, Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state, Physical Review 64(6) (2001), 061907.

33.

XLSTAT, http://www.xlstat.com/en/productssolutions/feature/data-sampling.html.

34.

Gupta

, Rawal

, Narasimhan

V.L.

and Shiwani

, Accuracy, sensitivity and specificity measurement of various classification techniques on healthcare data, IOSR J Comput Eng 11(5) (2013).

35.

Social Science Statistics, https://www.socscistatistics.com.

Incomplete high dimensional data streams clustering

Abstract

Keywords

1 Introduction

2 Related work

2.1 Incomplete data processing

2.2 High dimensional data clustering

2.3 Data streams processing

3 Problem formulation

3.1 Incomplete data processing

4 The proposed framework

Data streams distributor

Complete/incomplete streams divider

4.2 The subspaces preparation sub-system

1-D subspaces constructor

Maximal space generator

4.3 The incomplete high dimensional streams clustering sub-system

Partial distance calculator

Nearest neighbors producer

Complete streams generator

Complete streams subspace and distributed clustering

5 The proposed algorithm

5.1 1-D subspaces generation

5.3 Incomplete stream elements imputation

6 Experimental evaluation

Table 1 Data sets description Data set number Number of data objects Number of attributes Arrival rate Number of categories 1 3000 561 2.5 s 6 2 1500 179 23.5 s 5

7.1 The misclassification results

7.1.1 The misclassification results over uniform missing values distribution

Table 6 Average misclassifications and standard deviation of the misclassifications over data set1 (non-uniform) Average number of misclassifications±Standard deviation % missing Proposed (SIHD) GFCM GBDC-P2P DS Ensemble DMSC 20 80.7 ± 1.7 309±1.4 212.3±2.9 161±2.8 122.3±2.5 139±2.4

Table 8 Average misclassifications and standard deviation of the misclassifications of SIHD over data set1 using different distributions Number of sub-sets Average misclassifications± Standard deviation 2 86±1.4 3 73.3 ± 1.2

References

Table 1
Data sets description

Data set number Number of data objects Number of attributes Arrival rate Number of categories

1 3000 561 2.5 s 6

2 1500 179 23.5 s 5

Table 6
Average misclassifications and standard deviation of the misclassifications over data set1 (non-uniform)

Average number of misclassifications±Standard deviation

% missing Proposed (SIHD) GFCM GBDC-P2P DS Ensemble DMSC

20 80.7 ± 1.7 309±1.4 212.3±2.9 161±2.8 122.3±2.5 139±2.4

Table 8
Average misclassifications and standard deviation of the misclassifications of SIHD over data set1 using different distributions

Number of sub-sets Average misclassifications± Standard deviation

2 86±1.4

3 73.3 ± 1.2