Performance analysis of clustering algorithm under two kinds of big data architecture

Abstract

To compare the performance of the clustering algorithm on two data processing architectures, the implementations of k-means clustering algorithm on two big data architectures are given at first in this paper. Then we focus on the differences of theoretical performance of k-means algorithm on two architectures from the mathematical point of view. The theoretical analysis shows that Spark architecture is superior to the Hadoop in aspects of the average execution time and I/O time. Finally, a text data set of social networking site of users’ behaviors is employed to conduct algorithm experiments. The results show that Spark is significantly less than MapReduce in aspects of the execution time and I/O time based on k-means algorithm. The theoretical analysis and the implementation technology of the big data algorithm proposed in this paper are a good reference for the application of big data technology.

Keywords

Hadoop MapReduce Spark clustering algorithm big data k-means

1. Introduction

With the coming of the era of Internet+, massive data has been produced in all aspects of social life. How to dig out its hidden enormous value has become the focus of the community, and also has risen to the national strategic level. In March 2012, the Obama administration announced that they would plan to invest $200 million to start “big data research and a development program”, which followed another major technological development-the “information superhighway” plan [15] announced in 1993. A series of data from the Big Data Report in 2012 McKinsey showed that big data industry had brought $300 billion revenues for the US health care system annually, €250 billion revenues for the European public administration annually, 60% pure profit for the retail industry, and had reduced 50% product development costs for the manufacturing industry. However, Canner thought by 2015 more than 85% of Fortune 500 companies would lose their strengths in the big data competition [11]. The market research firm IDC predicted that big data technology and services market would rise from $3.2 billion in 2010 to $16.9 billion in 2015 and achieved 40% growth rate annually [9]. From the statistics above, it is easy to find that big data is widely applied and is of great value. In terms of the concept and research status of big data, the core force that promote the big data development is the big data processing technology. Whether we can dig out the enormous scientific and economic value hidden in massive data depends on its processing technology. Therefore, the big data technology has become the hot spot and research focus [13]. The limit of traditional data processing model in memory and processing capabilities is unable to meet the actual demands. With the development of science and technology, parallel processing mechanisms e.g. MPI, PVM and MapReduce have been widely used in the past years. However, with the deep research on machine learning, there are a large number of applications that requires iterative algorithm processing. The result of this application processed by the traditional data processing architecture is not satisfactory. An open source universal parallel cloud computing platform-Spark developed by UC Berkeley AMP Lab meets the needs [24]. The Spark is the latest parallel distributed computing framework mainly based on memory computing on big data technology chain. And some issues related to memory computing have got support from the National Natural Science Foundation and related research has started. They are also supported by a lot of companies in the market, such as Alibaba, Baidu, NetEase and so on. Researchers are more concerned about the performance of data processing platform. At home and abroad, most researches have been focused on the differences between MapReduce [14] and Spark; integration of both memory computing and data mining algorithm on Spark [20]; improvement on clustering algorithm [19] binding Spark platform and so on. The decision tree research on two architectures mentioned in the literature [24] shows that the Spark is more suitable for the iterative algorithm, and there is no deep research on the performance differences of the two architectures. Meanwhile, the decision tree also points out the performance differences of the two architectures combining with k-means algorithm [21]. The most latest researches on two architectures performance differences just analyze the results by experiments. But the researches from the mathematical point of theoretical analysis are rare.

In this paper we firstly give the implementations of k-means clustering algorithm on MapReduce and Spark. Then, we focus on the theoretical performance differences of the two architectures from the mathematical point of view. Finally we use experiments to verify the validity of the theoretical analysis of big data algorithm.

2. Two implementations of k-means algorithm

2.1. Overview of k-means algorithm

K-means is a clustering algorithm based on distance and unsupervised learning. It has been used widely on science, industry, business and so on [17]. Its cluster similarity criterion is the distance between data objects. The data of same cluster is similar, and the data of different clusters is different. Clustering function is deviation sum of square criterion function, which is defined as: $\begin{matrix} G_{c} = \sum_{j = 1}^{c} \sum_{k = 1}^{n_{i}} {| x_{k}^{(j)} - m_{j} |}^{2} \end{matrix}$ For each data object $x_{i}$ , the function is to compute which class $x_{i}$ belongs to: $\begin{matrix} c^{i} = {argmin}_{j} | x_{i} - m_{j} |^{2} \end{matrix}$ Where $m_{j}$ is clustering center.

The function of computing new center of clustering j is: $\begin{matrix} m_{new_j} = \frac{\sum_{i = 1}^{n} w_{i j} x_{i}}{\sum_{i = 1}^{n} w_{i j}} \end{matrix}$ Where $x_{i}$ is a data object and $w_{i j}$ is the identification whether $x_{i}$ belongs to class j. If it is true, $w_{i j} = 1$ ; Or $\begin{matrix} w_{i j} = 0 . \end{matrix}$ K-means algorithm implementation [30] is as follows:

Input: data set D, the numbers of cluster k;

Output: the k sets of clustering;

Select data objects as initial center in data set D;

Repeat

For each data object $x_{i}$ from data set D

computing distance from $x_{i}$ to center point

the data object is divided to the nearest cluster

End For

Calculating the data object average value of each clustering center used as a new clustering center until clustering center points no longer change [17,30].

From the algorithm implementation point of view, we have seen that algorithm would be inefficient when algorithm required multiple iterations in dealing with massive data. Then algorithm can’t meet the needs of practical applications. K-means parallel implementation solves the problem [17]. The following parts mainly introduced the k-means parallel implementations on MapReduce and Spark.

2.2. Parallel implementation of k-means based on MapReduce

As is discussed in Section 2.1, the key to parallel implementation of algorithm is to independently assign different samples to the nearest cluster. The Map and Reduce operations are the same in each iteration of parallel implementation k-means algorithm [2]. Firstly, we select k samples as the center randomly, and store them in the HDFS files as a global variable. Then the iteration includes three parts:

Map Function [25]: $< key, value >$ is inputted by default. The ‘key’ is an offset that is the current sample relative to the starting point of the input file. The ‘value’ is a string that consists of each dimension coordinate values of the current sample. Firstly, we analyze each dimension coordinate values of current sample from the value, and calculate the distance from data object to k clustering center. We can obtain the clustered index of nearest distance, and output $< key 1, value 1 >$ , where the key1 is the clustered index of nearest distance, and the value1 is a string that consists of each dimension coordinate values of the current sample.

Combine Function: $< key, V >$ is inputted. The ‘key’ is the clustered index. The ‘V’ is the string linked list that consists of each dimension coordinate values whose clustered index is key. Firstly, we obtain the coordinate values of each sample from the string linked list. Secondly, we add each value corresponding, and record the total number of samples in the list. Outputting $< key 1, value 1 >$ , where the key1 is the clustered index, the value1 is a string which consists of the sample sum and each dimension coordinate values.

Reduce Function: $< key 1, value 1 >$ is inputted. Firstly we can obtain the intermediate results. Secondly, we can get new clustering center through related operation and update the HDFS files. Then the next iteration continues until results converge. The implementation process is shown in Fig. 1.

Fig. 1.

K-means algorithm implementation based on MapReduce.

2.3. Parallel implementation of k-means based on Spark

The implementation of k-means algorithm based on Spark includes two parts [26]: dividing the data clustering point, computing clustering center through multiple iterations until the results converge. The implementation is mainly achieved by the Driver, Mapper, Combiner and Reducer classes [29].

Driver: It’s a underlying driver class of initial program, and it deals with data set through the related functions.

Mapper: It’s a class that determines the initial clustering center, and divides initial data set. It calculates the distance from the data object RDD to the initial clustering center, and selects the class of the nearest distance to merge. At last it reelects the new clustering center. The intermediate results generated by iteration are transformed into the new data object RDD [4].

Combiner: It’s a class to achieve the combination process of the RDD intermediate data set. Because the Map process has produced a large number of RDD intermediate results, the combination can reduce the traffic, and avoid congestion for network communication on the Spark platform.

Reducer: It’s a class that makes local results through Combiner doing Reducer, and gets the global results. It can judge the convergence of clustering center according to the clustering center threshold [6]. The implementation process is shown in Fig. 2.

Fig. 2.

K-means algorithm implementation based on Spark.

3. Theoretical analysis of algorithm performance on two architectures

As is discussed in Section 2, k-means algorithm implementations on two architectures were based on the Map and Reduce. The main reason for the performance differences between the two architectures is that Spark [16] is based on memory RDD [4] calculation which doesn’t need to interact with the disk, while Hadoop is based on external memory which need to interact with the disk. Then, we will analyze the theoretical performance of two architectures by execution time which is one of the standard to measure platform performance merits.

Algorithm execution time consists of computing time, communication time and system execution time. The complexity of computing time of the two architectures is similar. Communication time includes communication volume and communication mechanism. Hadoop and Spark are based on RPC mechanism, so the time difference can be ignored. In terms of communication volume, Hadoop can’t reuse the data set in iterative process, while Spark can support data set cache policy. Whether the data set reuses or not directly affects the number of iteration. We can merge this difference into the execution time. Execution time includes Map, Reduce and I/O operation time. Therefore, the difference of time consumption between the two architectures is mainly the system execution time. Specific time analysis is as follows:

The first iteration of the two architectures is to read data from HDFS. The start and end heartbeat mechanism of Hadoop are negligible relatively to total time. The second and subsequent iterative processing ideology are the same. The mainly difference is I/O time consumption. To analyze the performance differences between the two architectures conveniently, we assume that the cluster is homogeneous, and job is evenly distributed to all nodes, and no node fail during the implementation process. We need the following definitions auxiliary instructions [10]:

Definition 1.
We assume that k-means data processing requires $(k + 1)$ times iteration on two architectures. We make Hadoop, Spark architectures perform a full iteration. The I/O time required in a complete MapReduce process is respectively $T_{h}$ , $T_{s}$ . The Map time on the Hadoop and Spark is respectively $t_{1}$ , $T_{1}$ . The Reduce time on the Hadoop and Spark is respectively $t_{2}$ , $T_{2}$ .
Definition 2.
We define the mainly parameters on performing MapReduce process: input data set S, intermediate output data set $S_{1}$ , the final output data set $S_{2}$ . when the data size is x, each Map running time on Hadoop is $f (x)$ , and each Reduce running time on Hadoop is $g (x)$ . Each Map running time on Spark is $F (x)$ , and each Reduce running time on Spark is $G (x)$ . They are all directly proportional with x, and ratios coefficient respectively are α, β, γ, μ.
Definition 3.
The available maximum numbers of Map and Reduce are respectively M, R in MapReduce computing system. During execution process the Map number divided is X, the Reduce number of system starting is Y. The rate of data that is read from HDFS is $v_{i}$ . The rate of data that is written back to the disk is $v_{o}$ . The rate of data that is read from memory is $V_{i}$ . The rate of data that will be written back to memory is $V_{o}$ . Network transmission rate is $v_{n}$ . The Map initialization overhead is $C_{1}$ on Hadoop, and Reduce initialization overhead is $C_{2}$ on Hadoop. The overheads respectively are $C_{3}$ , $C_{4}$ on Spark. The node number in a cluster is N, The number of CPU cores for each node is p. So we can conclude that $R = N p$ , $M = 2 N p$ .

Hadoop Map Time: The process includes reading data from HDFS, executing Map calculation and writing the Map intermediate results back to disk. Each Map input data is $\frac{S}{X}$ . The time consumption in this process is: $\begin{matrix} (1) & t_{1} = \frac{S}{X v_{i}} + f (\frac{S}{X}) + \frac{S_{1}}{X v_{o}} + C_{1} \end{matrix}$

Hadoop Reduce Time: It inputs intermediate results output by Map sorting and executing Reduce calculation, and outputs the results. Each Reduce input data is $\frac{S_{1}}{Y}$ . The time consumption in this process is: $\begin{matrix} (2) & t_{2} = \frac{S_{1}}{Y v_{i}} + g (\frac{S_{1}}{Y}) + \frac{S_{2}}{Y v_{o}} + C_{2} \end{matrix}$ So we can conclude, the I/O time of finishing a full MapReduce process is: $\begin{matrix} (3) & T_{h} = \frac{S}{X v_{i}} + \frac{S_{1}}{X v_{o}} + \frac{S_{1}}{Y v_{i}} + \frac{S_{2}}{Y v_{o}} \end{matrix}$ Similarly, we can conclude the time consumption of each stage on Spark is: $\begin{array}{l} (4) & T_{1} = \frac{S}{X V_{i}} + F (\frac{S}{X}) + \frac{S_{1}}{X V_{o}} + C_{3} \\ (5) & T_{2} = \frac{S_{1}}{Y V_{i}} + G (\frac{S_{1}}{Y}) + \frac{S_{2}}{Y V_{o}} + C_{4} \\ (6) & T_{s} = \frac{S}{X V_{i}} + \frac{S_{1}}{X V_{o}} + \frac{S_{1}}{Y V_{i}} + \frac{S_{2}}{Y V_{o}} \end{array}$ About Hadoop and Spark, we assume that the data size of each map transmission of each reduce is $\frac{S_{1}}{X Y}$ . So an iterative calculation of network transmission time is: $\begin{matrix} (7) & t_{n} = \frac{S_{1}}{X Y v_{n}} \end{matrix}$ The derivation of the formula above is irrespective of task scheduling. Schedule is inevitable, because the practical application data set is too large. Scheduling times of Map, Reduce implementation are: $\begin{array}{l} (8) & λ_{m} = \frac{X}{M} \\ (9) & λ_{r} = \frac{Y}{R} \end{array}$ When $t_{1} ⩽ M t_{n}$ , there is no need to wait for reduce execution. In the practical application, the time of completing an iterative calculation of Hadoop and Spark is respectively t, $t^{'}$ : $\begin{array}{l} (10) & t = λ_{m} t_{1} + M t_{n} + (λ_{r} - 1) X t_{n} + λ_{r} t_{2} \\ (11) & t^{'} = λ_{m} T_{1} + M t_{n} + (λ_{r} - 1) X t_{n} + λ_{r} T_{2} \end{array}$ The time required for implementation of $(k + 1)$ times iteration under the two architectures is respectively: $\begin{array}{l} (12) & T_{hadoop} = k t = k λ_{m} t_{1} + k M t_{n} + k (λ_{r} - 1) X t_{n} + k λ_{r} t_{2} \\ (13) & T_{spark} = k t^{'} = k λ_{m} T_{1} + k M t_{n} + k (λ_{r} - 1) X t_{n} + k λ_{r} T_{2} \end{array}$ I/O time of $(k + 1)$ times iteration is respectively: $\begin{matrix} (14) & T_{h}^{'} = k T_{h}; T_{s}^{'} = k T_{s} \end{matrix}$ Therefore, the performance differences between the two architectures can be illustrated by execution time and I/O consumption. In order to give a more intuitive description of the problem, the parameters will be specific values. According to the experience, set $Y = 1.75 * N * p$ , we can get $λ_{r} = 2$ . In order to calculate conveniently and reduce workload imbalance, we assume that $S = S_{1} = S_{2}$ , $v_{i} = v_{o} = 100 Mb/s$ , $V_{i} = V_{o} = 10 Gb/s$ , $v_{n} = 1 Gb/s$ ,so we can get $\frac{T_{h}}{T_{s}} = 100$ .

We assume that $v_{i}$ , $v_{o}$ , $V_{i}$ , $V_{o}$ , $v_{n}$ are same, so the other parameters values are respectively, $M = 12$ , $R = 6$ , $N = 3$ , $p = 2$ , $C_{1} = C_{3} = 0.3 s$ , $C_{2} = C_{4} = 0.2 s$ , $α = 0.8 s/M$ , $β = 0.9 s/M$ , $γ = 0.1 s/M$ , $μ = 0.2 s/M$ . We can get $\frac{T_{hadoop}}{T_{spark}} = 18$ . If $C_{1} = 3 s$ , $C_{2} = 2 s$ , $C_{3} = 0.2 s$ , $C_{4} = 0.1 s$ , $α = 1.8 s/M$ , $β = 2 s/M$ , $γ = 0.04 s/M$ , $μ = 0.05 s/M$ , we can get $\frac{T_{hadoop}}{T_{spark}} = 41$ .

The results have shown that the overhead and execution rate of each stage of the implementation process have a great influence on the architecture performance. In practical application, the data volume maybe reach to T level and even bigger. The difference of execution time on two architectures becomes much more obvious. Bandwidth may also become the bottleneck of the two architectures. Finally, the results have shown that Hadoop is longer than Spark in I/O consumption or total time. Using the execution time to measure the performance of two architectures, we can conclude that Spark is superior to the Hadoop.
4. Experiments and results

There are many clustering algorithms now. The algorithm implementation steps are different owing to different algorithm ideas, so algorithm result-clustering effect is different. The effect of clustering algorithms varies in practical applications. To illustrate the performance differences between the two architectures, this paper analyzes clustering algorithm implementations based on two architectures from the mathematical point of view. The experiments use text data set used to test clustering algorithm to compare the performance differences between the two architectures by changing the number of iteration. Therefore, the performance differences between the two architectures can’t be illustrated by clustering effect.

4.1. Experimental environment

In the experiment we used one server and three virtual hosts produced on the Workstation VMware. We used CDH5 as Hadoop and Spark platform, and used CentOS6.5 x64 as node operating system. We use Hadoop 2.5.0-cdh5.3.2 benchmark for Hadoop and the Spark 1.2.0 benchmark for Spark [22] and JDK 1.7 benchmark for Java.

4.2. Experimental data

Experimental data used text data set of social networking site of users’ behaviors [1]. All data is displayed in.csv file format. And they are packaged separately in multiple tar.gz file.

User information format: [user id]∖t[user text], for example: 369319 zzzop. User relationship network format: [user id]∖t[crawled page count]∖t[friend count]∖t[friend id list]∖t[fans count]∖t[fans list], for example: 1.2.3..htm 1 14215 6 hamas jkaneko caol_ila manwomanfilm public_design_center Kaminogoya 4 hamas lawmn shamroy tkwshnsk.

4.3. Results and analysis

The experiment used the standard pure text data set which is used to test k-means algorithm. By changing the number of iteration and comparing average execution time and I/O time of the two architectures, we can illustrate the performance differences between the two architectures. Fig. 3 is the average execution time of k-means algorithm of the two architectures. As we can see from Fig. 3, the processing time of MapReduce increases with the number of iteration, and the processing time of Spark architecture is relatively stable. When the number of iterations is same, the processing time of MapReduce is longer than the processing time of Spark, and the average execution time of MapReduce is 50 times of Spark. The conclusion is consistent with the theoretical analysis. The experimental environment and different parameter values of the theoretical analysis are the causes of deviation.

Figure 4 is the I/O time of the k-means algorithm of the two architectures. As we can see from Fig. 4, the I/O time ratio of MapReduce and Spark increases with the number of iteration. When the number of iteration is same, the I/O time of MapReduce is longer than the I/O time of Spark. So the iterative processing time is mainly I/O time, and I/O time of MapReduce is 60 times of Spark. The conclusion is consistent with the theoretical analysis. The experimental environment and different parameter values of the theoretical analysis and the smaller experimental data set are the causes of deviation.

Fig. 3.

The average execution time comparison under two kinds of architecture.

Fig. 4.

The I/O time comparison under two kinds of architecture.

In a word, as we can see from the experimental results, the execution time and I/O time of Spark are significantly less than MapReduce. So Spark performance is superior to MapReduce in terms of time consumption. Moreover, the experimental results are consistent with the theoretical analysis results of Section 3. Then, we can verify the validity of the theoretical analysis results.

5. Conclusions and future work

In this paper we have introduced the implementation steps of k-means algorithm and its implementation on MapReduce and Spark. Then, this paper focuses on theoretical performance differences of two architectures using clustering algorithm from the mathematical point of view. Finally, with the increase of the number of iteration, MapReduce will increase significantly and Spark will change small on the execution time. That is to say, the performance of the Spark is superior to the MapReduce.

In future work, we plan to further analyze the performance differences of two architectures on scalability. MapReduce is based on external memory calculation [7,12] and Spark is based on memory calculation, so the memory consumption on processing [18,23] data can also affect the architecture performance [3,27,28]. Thus memory optimization is one of the most important directions in future research [5,8].

Footnotes

Acknowledgements

Our sincere appreciation to the anonymous reviewers for their helpful comments and suggestions. What’s more, this work is partially supported by the National Natural Science Foundation of China (Grant No. 61402183), Guangdong Natural Science Foundation (Grant No. S2012030006242), Guangdong Provincial Scientific and Technological Projects (Grant Nos. 2016A010101007, 2016B090918021, 2014B010117001, 2014A010103022, 2014A010103008, 2013B010202001 and 2013B010401021), Guangzhou Civic Science and Technology Project (Grant Nos. 201607010048 and 201604010040) and Fundamental Research Funds for the Central Universities, SCUT (No. 2015ZZ0098).

References

http://www.datatang.com/data/list.

Apache Hadoop, available at: http://hadoop.apache.org/.

Apache Spark documentation, 2014, available at: https://spark.apache.org/documentation.html.

Apache Spark Research, 2014, available at: https://spark.apache.org/research.html.

Byun, A reliable data delivery scheme with delay constraints for wireless sensor networks, Journal of High Speed Networks 21(3) (2015), 195–203. doi:10.3233/JHS-150520.

Feng and

Ma, A distributed frequent itemset mining algorithm based on Spark, in: Computer Supported Cooperative Wok in Design (CSCWD), 2015 IEEE 19th International Conference on, 6–8 May 2015, 2015, pp. 271–275.

Feng, Research and implementation of memory optimization in cluster computer engine Spark, Tsinghua University, 2013.

Fiore,

Palmieri,

Castiglione and

De Santis, A cluster-based data-centric model for network-aware task scheduling in distributed systems, International Journal of Parallel Programming 42(5) (2014), 755–775. doi:10.1007/s10766-013-0289-y.

Gantz and

Reinsel, 2011 digital universe study: Extracting value from chaos, available at: http://www.b-eye-network.com/blogs/devlin/archives/2011/071.

10.

Gao,

Zhou and

Han, An evaluation model on key technologies of large-scale graph data processing, Journal of Computer Research and Development 51(1) (2014), 1–16. doi:10.2190/EC.51.1.a.

11.

Gu,

Liu and

Zuo, Study on carriers’ mobile Internet development strategy in the context of big data, Designing Techniques of Posts and Telecommunications 8 (2012), 21–24.

12.

Guo,

Liu and

Lin, Research on performance of big data computing and query processing based on Impala, Application Research of Computer 32(5) (2015), 1331–1334.

13.

Huang, A study on the analysis of the research hotspots and development trends of big data overseas, Journal of Intelligence 33(6) (2014), 99–104.

14.

Hui and

Wu, Sequence-growth: A scalable and effective frequent itemset mining algorithm for big data based on MapReduce framework, in: Big Data (Big Data Congress) 2015 IEEE International Congress on, June 27 2015–July 2 2015, IEEE, 2015, pp. 393–400.

15.

Li, Scientific value of big data research, Communications of the China Computer Federation 8(9) (2012), 8–15.

16.

Li, Research on spark for big data processing, Modern Computer 3 (2015), 55–60.

17.

Liang, Research on parallelization of data mining algorithm based on distribute platforms Spark and YARN, Sun Yat-sen University, 2014.

18.

Lin, An improved data placement strategy for hadoop, Journal of South China University of Technology (Natural Science Edition) 40(1) (2012), 153–158.

19.

Qiu, The parallel design and application of the CURE algorithm based on Spark platform, South China University of Technology, 2014.

20.

Rathee,

Kaul and

Kashyap, R-apriori: An efficient apriori based algorithm on Spark, in: PIKM ’15 Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management, ACM, New York, NY, USA, 2015, pp. 27–34.

21.

Satish and

Rohan, Comparing Apache Spark and Map Reduce with performance analysis using K-means, International Journal of Computer Applications 113(1) (2015), 8–11.

22.

Scala, available at, http://www.scala-lang.org.

23.

Tu,

Liu and

Lin, Survey of big data, Application Research of Computers 31(6) (2014), 1613–1623.

24.

Wang,

Wu,

Yang and

Yang, Research of decision tree on YARN using MapReduce and Spark, in: World Congress in Computer Science, Computer Engineering, and Applied Computing, 2014, available at: http://www.world-academy-of-science.org/.

25.

Wang, Clustering in the cloud: Clustering algorithms to Hadoop Map/Reduce framework, Department of Computer Science, Texas State Univerdity, 2010.

26.

Yang, The research of recommendation system based on Spark platform, University of Science and Technology of China, 2015.

27.

Zhang,

Yang and

Zhao, Load balancing and data aggregation tree routing algorithm in wireless sensor networks, Journal of High Speed Networks 21(2) (2015), 121–129. doi:10.3233/JHS-150515.

28.

Zhao,

Xia and

Jia, Research and analysis on spatial adaptive strategy of End-hopping system, Journal of High Speed Networks 21(2) (2015), 95–106. doi:10.3233/JHS-150514.

29.

Zhao,

Ma and

Fu, Research on parallel k-means algorithm design based on Hadoop platform, Computer Science 38(10) (2011), 166–168.

30.

Zhou,

Zhang and

Luo, Realization of K-means clustering algorithm based on Hadoop, Computer Technology and Development 23(7) (2013), 17–21.