Recursive Block LU Decomposition based ELM in Apache Spark

Abstract

Due to the massive memory and computational resources required to build complex machine learning models on large datasets, many researchers are employing distributed environments for training the models on large datasets. The parallel implementations of Extreme Learning Machine (ELM) with many variants have been developed using MapReduce and Spark frameworks in the recent years. However, these approaches have severe limitations in terms of Input-Output (I/O) cost, memory, etc. From the literature, it is known that the complexity of ELM is directly propositional to the computation of Moore-Penrose pseudo inverse of hidden layer matrix in ELM. Most of the ELM variants developed on Spark framework have employed Singular Value Decomposition (SVD) to compute the Moore-Penrose pseudo inverse. But, SVD has severe memory limitations when experimenting with large datasets. In this paper, a method that uses Recursive Block LU Decomposition to compute the Moore-Penrose generalized inverse over the Spark cluster has been proposed to reduce the computational complexity. This method enhances the ELM algorithm to be efficient in handling the scalability and also having faster execution of the model. The experimental results have shown that the proposed method is efficient than the existing algorithms available in the literature.

Keywords

Recursive Block LU Decomposition Extreme Learning Machine Apache Spark Moore-Penrose pseudoinverse Big Data

1 Introduction

Nowadays, humongous data is being generated through diversified data generating sources at an ever-increasing rate [2]. The size and dimensionality of these datasets are increasing day by day. So, it is a challenging task to develop efficient and effective machine learning algorithms that are useful in analyzing and extracting valuable knowledge from the massive amount of data. Over the past few decades, many machine learning algorithms have been developed for processing complex and massive datasets to provide valuable insights [6]. In addition to that, recent big data frameworks provided efficient mechanisms to process and model a massive amount of data to get actionable insights. At this juncture, it is indispensable to exploit distributed computing environment efficiently and effectively to model such humongous data [23]. Distributed learning based on the divide-and-conquer approach is used significantly in many frameworks for processing the humongous data, which includes three primary steps [23]

Partitioning the original dataset into sub-datasets.

distribute the sub-datasets to local machines in a network and learn a predictor from each sub-dataset.

Synthesize the global predictor from these local predictors

Developing a classification problem in distributed computing environment is one of the most challenging problems in this big data era. Over the past few decades, machine learning algorithms have been developed for processing complex and massive datasets to get valuable insights. However, the primary requirement is that they should model the complex decision boundaries with less computational resources and time. One of the most popular machine learning algorithms is an Artificial Neural Network (ANN), which has been widely used in various fields. Neural network architectures have gone through many generations from the perceptron to the most recent deep network architectures. In recent years, Extreme Learning Machine (ELM) became most popular since it provides fast training time and having a lesser number of hyper-parameters to be tuned when compared to other neural networks [4]. However, there are many shortcomings of ELM that have been addressed in the literature to improve the performance of ELM. Nevertheless, a major problem i.e. memory consumption while dealing with large datasets has not been addressed extensively in the literature. This problem arises because ELM is a batch processing algorithm that expects all the data in memory to compute the pseudo inverse. So, memory limitation problem became a major bottleneck to use ELM for large datasets.

Heeswijk et al. [14] proposed GPU based parallel ELM ensemble method for large scale regression problems. The author focused on the training phase of ELM only. Sun et al. [26] proposed a parallel ensemble ELM classifier based on OS-ELM. These approaches are developed to process large data by selecting proper ELM model network architecture. Qing He et al. [16] proposed the first parallel ELM (PELM) for regression on MapReduce to process large datasets. In this approach, the author proposes a method that uses a chain of mapper and reducer jobs. So, this approach consumes more time, because it stores data in the disk for every phase. Xin et al. [8, 9] proposed an enhanced approach to PELM that uses single MapReduce job and reduces transfer time. Another variant of PELM is DELM [3], which uses blocks instead of specifying the rows and columns of a hidden layer output matrix. Due to this, the intermediate transmission of data during the shuffling phase has reduced. Some more variants of MapReduce based ELM are PEOS-ELM [22], BPOS-ELM [21], where all of them require many copies of each task and have more disk I/O overhead. To avoid the overhead of disk I/O, ELM based on Apache Spark distributed framework has been proposed by a few researchers. Duan et al. [11] proposed an ELM model on Apache Spark distribution without using the SVD method. The author used the standard matrix multiplication procedure to compute the Moore-Penrose pseudoinverse, whereas matrix multiplication is one of the heavy computational tasks in Apache Spark. Kozik, Rafal [17] developed ELM on Apache Spark using SVD decomposition. In this approach, the author computed the Gramian matrix for hidden layer output matrix and right singular values and then multiplied these right singular vectors with the target vector. The author used the Gramian matrix to avoid the computational cost of SVD for the tall and skinny behaviour of the hidden layer output matrix.

SVD based ELM on Apache Spark has memory limitations [1, 17]. In this paper, an efficient Moore-Penrose pseudo inverse computation using Block Recursive LU Decomposition [7] method is discussed to provide scalable ELM and solve the memory limitation. To avoid the numerical instability problem in LU factorization, a partial pivoting approach is used, and the discussion of this method is available in the methodology section.

This paper is organized as follows: Section 2 outlines the basic details of ELM and Apache Spark; Section 3 provides the methodology employed in this work; Section 4 describes the experimental results; finally, Section 5 provides the result analysis and conclusion.

2 Preview of ELM and Apache Spark

This section presents the ELM procedure along with mathematical representations and Distributed matrix representation with its computational challenges in Apache Spark Framework.

2.1 Mathematical Description of ELM

ELM is a variant of Single-hidden Layer Feedforward Neural networks (SLFNs) based on the notion that the classifier can achieve excellent generalization performance at extremely faster learning speed even if the input-to-hidden layer weights and biases are randomly chosen and not updated during the training process [5]. In addition to this, the hidden-to-output layer weights are determined analytically. G.B. Huang et al. [5] have shown the theoretical results in such a way that ELM can be a universal approximator with a wide variety of activation functions. The main concern of ELM is to avoid gradient-based iterative learning and laborious hyper-parameter tuning, which is a major bottleneck for general SLFN. Fig. 1 shows the architecture of ELM with one input layer, one hidden layer, and output layer.

Fig. 1

Basic architecture of ELM.

The training procedure of the ELM algorithm is described with the help of the flowchart in Fig. 2. The Moore-Penrose generalized inverse can be calculated using the following equations. $G^{†} = (G^{T} G)^{- 1} G^{T}$ (1) when G^TG is non-singular $G^{†} = G^{T} ({GG}^{T})^{- 1}$ (2) when GG^T is a non-singular. Equation (1) is used in many cases where the input dataset has many more instances than the attributes. Among all these approaches, SVD (matrix - decomposition) is the most frequently used method in the literature to compute pseudoinverse even though it is computationally expensive and consumes more memory.

Fig. 2

The Flowchart of the ELM.

2.2 Brief description of Apache Spark

MapReduce framework offers scalability, fault tolerance, load-balancing using distributed computing in both homogeneous and heterogeneous environments to handle large datasets. Even though it is one of the popular frameworks that use data parallelism to handle large datasets, the problems such as data replication, disk I/O communication cost, serialization, etc., have not been dealt in an efficient way. The MapReduce framework does not have elegant ways to control the iterations in iterative algorithm implementation. During the process of iteration-to-iteration, unnecessary scheduling overhead, data reloading, and reprocessing for every iteration cause wastage of resources. It uses a coarse-grained procedure to tackle such tasks, which are too heavy-weight for many of the machine learning tasks. Also, the MapReduce framework does not have control over the total pipeline of the map and reduce steps, so it cannot cache the intermediate results for faster performance. For example, the training of the neural network is an iterative procedure that involves matrix multiplications. Yang Liu et al. [25] studied the MapReduce implementation of Neural Network and concluded that it incurs a high overhead of computation due to the continuous starting and stopping of mappers and reducers in Hadoop environment. Hence, Apache Spark distributed framework for the implementation of ELM is used in this proposed methodology. The Apache spark has the following potential features to support matrix operations and iterative algorithms:

The Apache spark operates on storage abstraction called Resilient Distributed Dataset (RDD) [13]. It consists of the number of deterministic coarse-grained operations and also provides an easy-to-use interface to users.

RDDs permit user-defined data partitioning and hence, the execution engine can exploit this to co-partition RDDs and co-schedule tasks to avoid data movement [13].

To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory based on coarse-grained transformations rather than fine-grained updates to a shared state [13].

Spark has robust high-level operators in the form of actions and transformations for processing of the data in an easy and simple way, unlike MapReduce, which has very few operators and required low-level API to process.

2.2.1 Distributed matrix representation and computational challenges

Since ELM involves matrix operations to compute pseudo-inverse, it requires effective approaches to operate on distributed matrix representation to deploy in the cluster environment. The distributed matrix data representation, hardware-specific acceleration for matrix computation, and interaction of local matrix with distributed matrix in matrix computation are the typical challenges in a distributed computing environment [19]. In Apache Spark, the distributed matrix is partitioned into fixed-size blocks, which are represented by RDD. Usually, the fixed-size blocks are small and square. Fixed-size square blocks are useful to simplify many operations like transpose, etc.

Fig. 3(A), and Fig. 3(B) shows an example of the logical representation of a distributed matrix and physical RDD partitions with hash partitioning [12]. RowMatrix, IndexedRowMatrix CoordinateMatrix and BlockMatrix are matrix layouts used in Apache spark to represent the data of the distributed matrix. The total time required for basic computational operations on these layouts and I/O cost on distributed data-parallel jobs are the key factors to specify the large-scale matrix multiplication execution [18]. For example, computational operations such as small matrix multiplying with large matrix and multiplying two large matrices are two cases in the distributed matrix. In general, the BroadcastMM approach along with the RowMatrix layout is used to perform the large matrix multiplications in an efficient way. The RowMatrix is used to store row elements of Matrix in RDD form. The rows are represented by local vectors [27]. Hence, there is a limitation in the number of columns and also, it is not supported for large matrices. Due to this reason, BlockMatrix representation in Apache Spark is useful to represent the large matrices, whereas it stores the sub-matrix blocks in RDD form [27]. In addition to this, the execution strategy for large-scale matrix multiplication depends upon the scheduled execution workflow of the blocked sub-matrices on Apache Spark Environment [18]. In these execution strategies, the Replication-based Matrix Multiplication (RMM) and Cross-Product Matrix Multiplication (CPMM) are the most important on parallel computing platforms. The RMM execution strategy for matrix multiplication has been used in Apache Spark MLlib. So, R. Gu et al. [18] have adapted the RMM based execution strategy already available in Spark MLlib. Fig. 4 describes the RMM based execution strategy of matrix multiplication on Spark [18].

Fig. 3

Distributed Matrix Representation: (A) Logical Representation (B) Physical Representation.

Fig. 4

RMM based execution strategy of Matrix Multiplication on distributed environment.

3 Proposed methodology

In this work, Block Recursive LU Decomposition technique has been proposed to compute the Moore-Penrose pseudo-inverse of the large matrix in Apache spark distributed environment. Y. Kutlu et al. [24] proposed LU decomposition-based ELM on conventional environment and produced efficient results compared with basic ELM. The author experimented with Gaussian elimination with standard backward substitution to compute L (Lower triangular matrix) and U (Upper triangular matrix) in ELM. But this methodology is prone to numerical instability [10 , 29] because this method breaks down if the elements in the principle diagonal matrix is zero. So it needs a systematic method to select the proper non-zero pivots during each iteration of the Gauss elimination process. The elements which are at the super-diagonal of a Lower triangular matrix are called pivots. LU Decomposition with partial pivoting method ensures that non-zero pivots in gauss elimination procedure and produce the results with less complexity compared to other methods such as small pivoting based LU Decomposition, complete pivoting based LU, etc [28]. By taking this into account, Recursive Block LU Decomposition with partial pivoting is employed in the present work.

The Moore-Penrose pseudo-inverse of the hidden layer activation matrix (G) is specified as G^† = (G^TG) ^-1G^T. Here, G^TG is a symmetric and square matrix. In the previous works, SVD is performed using eigenvalue decomposition-based approach to find the inverse, which takes more computational time and resources. In addition to this limitation, SVD has memory limitation if the dimensionality of the matrix is crossing 10⁴ [20]. In this regard, block-recursive LU decomposition is an efficient strategy to compute (G^TG) ^-1 than the SVD based decomposition used in the literature. The inverse found using block-recursive LU decomposition is then multiplied with G^T to get the Moore-Penrose pseudo-inverse of the hidden layer activation matrix. Here, the G matrix is distributed over the cluster using the BlockMatrix method.

Every node in the cluster is used to compute the basic LU Decomposition of the distributed matrix. The implementation of Recursive Block LU Decomposition depends on two major tasks. The first task is to divide the matrix into sub-matrices, which are suitable for the distributed matrix. The second task is to collect the results of the small distributed matrix from all nodes in the cluster. However, the communication overhead is small in the present approach because Spark uses in-memory computation. The space complexity of recursive block LU decomposition takes $O (n^{3})$ , and the time complexity of basic LU decomposition is $O (2 / 3 t^{3})$ , where N is the dimension of a square matrix, and t is the number of flips that are used to swap the rows in the matrix while computing partial pivoting elements. The total time complexity of this approach is $O (2 / 3 {kt}^{3} + r)$ where k is the number of nodes in a cluster, which are used to compute basic LU decomposition with partial pivoting method, and r is the communication overhead in a cluster.

In the present work, BlockMatrix is used to store a large matrix in terms of sub-matrices (block) in a distributed fashion which is known to be efficient for large datasets [27]. In order to show the superiority of BlockMatrix in comparison with RowMatrix representation a detailed experiment analysis has been conducted. In the experimental study, the computational time required to perform the matrix multiplication is considered as the key metric. The time required to compute a matrix multiplication of large dimensional square matrices on pseudo-mode and multi-node cluster mode has been observed experimentally. The large matrices used in these experiments have been generated randomly for the comparison purpose. The data elements of each matrix are generated randomly in a range [1,100].

Fig. 5 shows the comparative results of RowMatrix, IndexedRowMatrix, CoordinateMatrix, and BlockMatrix data structure based matrix multiplication in pseudo-mode Apache spark. The pseudo-mode was setup on Lenovo ThinkPad with I5 processor and 8GB RAM. In the Fig. 5, the dot product of the matrix is also considered. There was a fatal memory error when performing the standard dot product method for the matrix multiplication when the matrix dimension crosses 4000. But, the data structure provided by Apache Spark has been able to compute matrix product even the matrix dimension crosses 4000. The experimental results showed that the computational time required for matrix multiplication based on RowMatrix, IndexedRowMatrix, and CoordinateMatrix has been increasing, whereas the BlockMatrix based matrix multiplication is consuming less time when compared to them. Another important observation from these experiments is that the heap memory is playing a vital role because the intermediate results of matrix multiplication need a lot of memory space. In this experimental setup, the primary heap memory size 512M is considered. Few experiments have also been carried out by increasing heap memory. The observation from these experiments showed that the computational time of matrix multiplication decreases as the heap memory size increases.

Fig. 5

Average computational time of matrix multiplication operation using distributed matrix representations in pseudo-mode cluster of Apache Spark.

Fig. 6 shows the average computational time needed to perform the RowMatrix, IndexedRowMatrix, CoordinateMatrix, and BlockMatrix based matrix multiplication in a cluster. The extensive experiments have conducted on a multinode cluster. The multinode cluster has three slaves and one master node. Each node in the multinode cluster has 32GB RAM, 1TB hard disk, and 6.5GB heap memory. In this experimental setup, the average time required to compute the matrix multiplication is considered after performing 15 trails on the corresponding dataset. The time required to perform the matrix multiplication is included the communication overhead of the cluster. Even in the multinode cluster, BlockMatrix based matrix multiplication performed well compared to other data structures. In accordance with the results, BlockMatrix representation is used to represent the large dimension matrices in the proposed work.

Fig. 6

Average computational time of matrix multiplication operation using distributed matrix representations in multinode cluster of Apache Spark.

3.1 Recursive block LU decomposition workflow

Let us assume the Gramian matrix (G^TG) is B with dimension as n × n. The LU factorization of matrix B produces L and U, where L is [l_ij] , _1≤i,j≤n and U is [u_ij] , _1≤i,j≤n. Hence B = LU where L is Lower triangular matrix i.e l_ij = 0 for 1 ≤ i < n and U is Upper triangular matrix i.e u_ij = 0 for 1 ≤ j < i ≤ n. To provide numerical stability in LU decomposition, the pre-multiplication of pivot matrix P with B Matrix is used. Hence, the LU decomposition will be like PB = LU. Without loss of generality, we assume the order of the square matrix B is 2^kb, where k an integer and b is the order of the block matrix that can be decomposed on a single server. The large matrices are decomposed the matrix into blocks that has been shown in the following equations. $[\begin{matrix} P_{1} & 0 \\ 0 & P_{2} \end{matrix}] [\begin{matrix} B_{1} & B_{2} \\ B_{3} & B_{4} \end{matrix}] = [\begin{matrix} L_{1} & 0 \\ L_{2} & L_{3} \end{matrix}] [\begin{matrix} U_{1} & U_{2} \\ 0 & U_{3} \end{matrix}]$ (3)

$[\begin{matrix} P_{1} B_{1} & P_{1} B_{2} \\ P_{2} B_{3} & P_{2} B_{4} \end{matrix}] = [\begin{matrix} L_{1} U_{1} & L_{1} U_{2} \\ L_{2} U_{1} & L_{2} U_{2} + L_{3} U_{3} \end{matrix}]$ (4)

$P_{1} B_{1} = L_{1} U_{1}$ (5) $P_{1} B_{2} = L_{1} U_{2}$ (6) $P_{2} B_{3} = L_{2} U_{1}$ (7) $P_{2} B_{4} = L_{2} U_{2} + L_{3} U_{3}$ (8)

Equation (9) is derived from Equation (6) which is as follows $U_{2} = L_{1}^{- 1} P_{1} B_{1}$ (9)

Equation (10) is derived Equation (7) as follows $L_{2} = P_{2} B_{3} U_{1}^{- 1}$ (10) Equation (11) is derived from Equation (8) and Equation (10) $L_{3} U_{3} = P_{2} (B_{4} - B_{3} U_{1}^{- 1} U_{2})$ (11) The partition of the matrix is taking place recursively till B₁ is small and able to compute on LU decomposition of that on a single machine. Hence, the LU decomposition of B₁ produces L₁, U₁ and P₁ easily. With these terms the remaining terms such as L₂, L₃, U₂, U₃ and P₂ can be computer because the fraction terms $(B_{4} - B_{3} U_{1}^{- 1} U_{2})$ that is also small and able to compute on single machine. The procedure of computing these terms are described in algorithms clearly. After computing all these terms L^-1, U^-1 and P will be computed by merging the corresponding results. The L^-1, U^-1 and P are shown in the Equation (12), Equation (13) and Equation (14) respectively. $L^{- 1} = [\begin{matrix} L_{1}^{- 1} & 0 \\ - L_{1} L_{2} L_{1}^{- 1} & L_{3}^{- 1} \end{matrix}]$ (12)

$U^{- 1} = [\begin{matrix} U_{1}^{- 1} & - U_{1}^{- 1} U_{2} U_{3}^{- 1} \\ 0 & U_{3}^{- 1} \end{matrix}]$ (13) $P = [\begin{matrix} P_{1} & 0 \\ 0 & P_{2} \end{matrix}]$ (14)

The workflow of the Recursive Block LU Decomposition Algorithm in Spark is described in Fig. 7. The algorithmic representation of this proposed approach is described with the help of three algorithms. Algorithm 1 is used to describe the computational procedure of partial pivoting LU decomposition, which will be executed on a single node. Algorithm 2 is used to describe the procedure of partition (blocks) and the recursive method of LU decomposition on a large matrix. Once the matrix size becomes small and able to get the LU decomposition directly, then Algorithm 2 will call the Algorithm 1 for getting actual Lower triangular (L), Upper triangular (U), and pivot matrix (P).

Fig. 7

A workflow of Recursive Block LU Decomposition in Apache Spark.

The matrix-B is partitioned into four square sub-matrices. This partition process takes place recursively till the small sub-matrix, where the node can compute basic LU decomposition directly. After computation of LU decomposition on small sub-matrix, the inverse operation on L and U is performed. These results are merged to get the inverse of (G^TG). The work flow of the Recursive block LU decomposition is described in Fig. 7. Finally, Algorithm 3 is used to take the original input and calls the Algorithm 2 for finding the results.

Algorithm 1: Paritial Pivoting LU Decomposition

Require: B is a matrix with dimension as n by n

Ensure: L,U, P

Procedure: BasicLUD(B)

Number of rows in B is stored in n

for k ← 1 to n do

(j,i) ← argmax(∣B_k,k∣, ∣B_k+1,k∣,..., ∣B_n,k∣)

add j to P

swap i^th row with j^th row

for i ← k+1 to n do

B_i,k ← B_i,k / B_k,k

for j ← k+1 to n do

B_i,j ← B_i,j - B_i,k B_k,j

end for

return(B,P) which is L, U and P

Algorithm 2:Recursive Block-level LU Decomposition

Require: B is a matrix with dimension as n by n

Ensure: L^-1, U^-1, P

Procedure: BLUD(B)

if B is small

(L, U, P) ← BasicLUD (B)

L^-1 ← Inverse (L)

U^-1 ← Inverse (U)

else

Partitioned B into four sub-blocks such as $(\begin{matrix} B_{1} & B_{2} \\ B_{3} & B_{4} \end{matrix}) \leftarrow B$

$(L_{1}^{- 1}, U_{1}^{- 1}, P_{1}) \leftarrow BLUD (B_{1})$

Compute Equation (9)

$(L_{3}^{- 1}, U_{3}^{- 1}, P_{2}) \leftarrow BLUD (B_{4} - B_{3} U_{1}^{- 1} U_{2})$

Compute Equation (10)

Compute Equation (12)

Compute Equation (13)

Compute Equation (14)

end if

return (L^-1, U^-1, P)

Algorithm 3 Main Algorithm that calls the Recursive Block LU Decomposition

Require: B is a matrix with dimension as n by n

Ensure: B ^-1

Procedure: BLU(B)

(L^-1, U^-1, P) ← BLUD(B)

B^-1 ← U^-1L^-1P

return(B^-1)

4 Experimental results

In this section, initially, the experimental results on a variant of the MNIST handwritten image dataset have been used to benchmark the proposed work performance with the existing work in the literature. The dataset consists of 5000 samples with 400 features of a handwritten digit image. Initially, the Recursive - Block LU based decomposition has been tested with a master and one single slave node on a standalone cluster. The standalone cluster node has 8GB RAM, 1TB hard disk and Intel i5 processor with four cores as a cluster configuration. The result analysis is shown in Table 1. The numbers of hidden nodes have chosen to support the block size of distributed BlockMatrix in Apache Spark. It is identified that the proposed approach with appropriate distributed block size along with the corresponding hidden layer neurons can produce high training accuracy for classification problems. Due to the limitation in memory for computational tasks, the SVD based ELM with 4095 hidden neurons got a memory exception error. With the help of partitions in RDD, the proposed approach can process the data with a 4095 number of hidden nodes in a cluster.

Table 1
Comparison of SVD based ELM and Block Recursive LU Based ELM in Apache Spark with varying hidden nodes and different distributed block sizes (Block size in KBs)

Algorithm Number of Hidden nodes Block size (KB) Training Accuracy (%) Testing Accuracy (%) Training Time in Seconds

Recursive Block LU Decomposition based 511 256 97.49 97.70 58.88

ELM in Apache Spark

1023 256 98.83 97.87 105.98

1023 512 98.82 97.87 77.82

2047 512 98.84 97.89 102.94

2047 1024 98.86 98.01 79.39

4095 1024 98.86 98.10 99.81

4095 2048 98.88 98.0 87.34

SVD based ELM in Apache Spark 511 – 98.19 97.94 48.39

1023 – 98.83 97.99 65.22

2047 – 98.01 96.51 99.39

2047 – 98.03 97.50 105.80

4095 – Failure Case

M.Duan et al. 3000 – 96.29 95.12 69.99

Algorithm	Number of Hidden nodes	Block size (KB)	Training Accuracy (%)	Testing Accuracy (%)	Training Time in Seconds
Recursive Block LU Decomposition based	511	256	97.49	97.70	58.88
ELM in Apache Spark
	1023	256	98.83	97.87	105.98
	1023	512	98.82	97.87	77.82
	2047	512	98.84	97.89	102.94
	2047	1024	98.86	98.01	79.39
	4095	1024	98.86	98.10	99.81
	4095	2048	98.88	98.0	87.34
SVD based ELM in Apache Spark	511	–	98.19	97.94	48.39
	1023	–	98.83	97.99	65.22
	2047	–	98.01	96.51	99.39
	2047	–	98.03	97.50	105.80
	4095	–	Failure Case
M.Duan et al.	3000	–	96.29	95.12	69.99

M.Duan et al. have implemented the parallel ELM in Spark that used direct matrix multiplication formulae to calculate Moore-Penrose pseudo-inverse. The author used RDD partitions to represent the data with RDD partitions. However, Apache Spark has a severe limitation on the number of partitions used in RDD. Hence, this approach not able to give an efficient performance if the number of partitions in the RDD is more than 5000 [15].

The same dataset is used for analysis over the multi-node cluster. The multi-node cluster has four nodes, out of which one node is a master node, and the remaining three nodes are slave nodes. The configuration of each node in a cluster has 32GB RAM, 1 TB hard disk with four cores. It is observed that there is a slight increase in the training time due to communication overhead. However, the added advantage here is that the present approach can be extended to large datasets. Fig. 8 shows the training time comparative study of Recursive Block LU decomposed based ELM and SVD based ELM in a multi-node spark cluster. The proposed approach empirically showed that it would take less training time for a large number of hidden nodes in a Spark cluster. Fig. 9 shows the training phase accuracy of Recursive Block LU decomposed based ELM and SVD based ELM in a multi-node Apache Spark cluster. From Fig. 9, it can concluded that the training accuracy of both approaches is almost the same.

Fig. 8

Training time comparison of Recursive Block LU Decomposition based ELM with SVD based ELM in Cluster.

Fig. 9

Training accuracy comparison of Recursive Block LU Decomposition based ELM with SVD based ELM in Cluster.

The proposed approach has also been tested on forest-cover dataset to give greater clarity regarding the performance of the proposed approach. Forest cover dataset contains 30 × 30m patches of forest in the US, collected for the task of predicting each patch’s cover type. The forest cover dataset consists of 581012 instances with 54 features. There are seven cover types (classes), which makes the task as a complex multiclass classification problem. Here, one can observe that the number of instances is vast, which is a challenging task for SVD based or standard ELM based implementations. It can be observed that the proposed algorithm can scale up for such a large dataset and produced the results. Table 2 shows the performance analysis of the proposed approach on the forest-cover dataset.

Table 2

Performance Analysis of Recursive Block LU Decomposition on forest cover dataset with different Distributed Block Sizes

Number of Hidden Nodes	Block size (KB)	Training Time in Seconds	Training Accuracy	Testing Accuracy
63	64	285.34	98.15	98.16
255	256	189.26	99.07	98.37
511	256	143.48	99.18	98.59
511	512	110.76	99.18	98.56
1023	1024	80.78	99.28	98.66

5 Result analysis and conclusion

From the result analysis, it can be observed that SVD based ELM in Apache Spark requires large driver memory for computation of the Eigenvalues and Eigenvectors on large blocks of the matrix. But, the Recursive Block LU decomposition-based ELM method in Apache Spark cluster has attained the same training accuracy with less computational time when compared to SVD based ELM in a cluster on moderately large datasets. Also, Recursive Block LU Decomposition based ELM method is scalable when the block size of the matrix is large, which is necessary for training large datasets such as Forest cover dataset. Hence, the present work confidently concludes that the proposed approach is efficient for large datasets where the previous works in the literature have severe limitations that is evident from the experimental results.

References

Nagelli

, Ragala

and Saleena

, Automotive Rank Based ELM Using Iterative Decomposition, International Journal of Intelligent Engineering and Systems 12 (2019), 287–298.

Kumar

, An encyclopedic overview of ’big data’ analytics, International Journal of Applied Engineering Research 10(3) (2015), 5681–5705.

Han

D.H.

, Zhang

and Wang

G.R.

, Classifying uncertain and evolving data streams with distributed extreme learning machine chine, Journal of Computer Science and Technology 30(4) (2015), 874–887.

Huang

, Huang

G.B.

, Song

and You

, Trends in extreme learning machines: A review, Neural Networks 61 (2015), 32–48.

Huang

G.B.

, Zhu

Q.Y.

and Siew

C.K.

, Extreme learning machine: theory and applications, Neurocomputing 70(1-3) (2006), 489–501.

Qiu

, Wu

, Ding

, Xu

and Feng

, 2016. A survey of machine learning for big data processing, EURASIP Journal on Advances in Signal Processing 1 (2016), 67.

Liu

, Liang

and Ansari

, Spark-based large-scale matrix inversion for big data processing, IEEE Access 4 (2016), 2166–2176.

Xin

, Wang

, Qu

and Wang

, Elastic extreme learning machine for big data classification, Neurocomputing 149 (2015), 464–471.

Xin

, Wang

, Chen

, Ding

, Wang

and Zhao

, ELM: distributed extreme learning machine with MapReduce, World Wide Web 17(5) (2014), 1189–1204.

10.

Kurzak

, Luszczek

, Faverge

and Dongarra

, LU factorization with partial pivoting for a multicore system with ac celerators, IEEE Transactions on Parallel and Distributed systems 24(8) (2012), 1613–1621.

11.

Duan

, Li

, Liao

and Li

, A parallel multiclassification algorithm for big data using an extreme learning machine, IEEE Transactions on Neural Networks and Learning Systems 29(6) (2017), 2337–2351.

12.

Boehm

, Dusenberry

M.W.

, Eriksson

, Evfimievski

A.V.

, Makari Manshadi

, Pansare

, Reinwald

, et al., Systemml: Declarative machine learning on spark, Proceedings of the VLDB Endowment 9(13) (2016), 1425–1436.

13.

Zaharia

, Chowdhury

, Das

, Dave

, Ma

, Mc-Cauly

, Franklin

M.J.

, Shenker

and Stoica

, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, In, Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation NSDI 12 (2012), 15–28.

14.

Heeswijk

M.V.

, Miche

, Oja

and Lendasse

, GPUaccelerated and parallelized ELM ensembles for large-scale regression, Neurocomputing 74(16) (2011), 2430–2437.

15.

Zaharia

, Chowdhury

, Franklin

M.J.

, Shenker

and Stoica

, Spark: Cluster computing with working sets, HotCloud 10, 2010.

16.

, Shang

, Zhuang

and Shi

, Parallel extreme learning machine for regression based on MapReduce, Neurocomputing 102 (2013), 52–58.

17.

Kozik

, Distributing extreme learning machines with Apache Spark for NetFlow-based malware activity detection, Pattern Recognition Letters 101 (2018), 14–20.

18.

, Tang

, Tian

, Zhou

, Li

, Zheng

and Huang

, Improving execution concurrency of large-scale matrix multiplication on distributed data-parallel platforms, IEEE Transactions on Parallel and Distributed Systems 28(9) (2017), 2539–2552.

19.

Bosagh Zadeh

, Meng

, Ulanov

, Yavuz

, Pu

, Venkataraman

, Sparks

, Staple

and Zaharia

, Matrix computations and optimization in apache spark, In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), 31–38.

20.

Bosagh Zadeh

, Meng

, Ulanov

, Yavuz

, Pu

, Venkataraman

, Sparks

, Staple

and Zaharia

, linalg: Matrix computations in apache spark, arXiv 13, (2015), 1509.02256.

21.

Huang

, Wang

, Chen

, Wang

and Yu

, An efficient parallel method for batched OS-ELM training using MapReduce, Memetic Computing 9(3) (2017), 183–197.

22.

Huang

, Wang

, Qiu

, Yao

, Wang

and Yu

, Parallel ensemble of online sequential extreme learning machine based on MapReduce, Neurocomputing 174 (2016), 352–367.

23.

, Zhang

and Wang

F.Y.

, Stability-Based Generalization Analysis of Distributed Learning Algorithms for Big Data, IEEE Transactions on Neural Networks and Learning Systems (2019).

24.

Kutlu

, YayÄśk

, Yildirim

and Yildirim

, LU triangularization extreme learning machine in EEG cognitive task classification, Neural Computing and Applications 31(4) (2019), 1117–1126.

25.

Liu

, Jing

and Xu

, Parallelizing backpropagation neural network using MapReduce and cascading model, Computational Intelligence and Neuroscience (2016).

26.

Sun

, Yuan

and Wang

, An OS-ELM based distributed ensemble classification framework in P2P networks, Neurocomputing 74(16) (2011), 2438–2443.

27.

https://spark.apache.org/docs/2.2.0/mllib-data-types.html

28.

Trefethen

L.N.

and Bau

III. , Numerical linear algebra, 50, Siam, 1997.

29.

Reid

M.W.

, Pivoting for LU Factorization, (2014).