Distributed matrix computing system for big data

Abstract

In order to solve the problem of low computing efficiency in big data analysis and model construction, this paper intended to deeply explore the big data analysis programming model, DAG (Directed Acyclic Graph) and other contents, and on this basis, it adopted a distributed matrix computing system Octopus for big data analysis. Octopus is a universal matrix programming framework that provides a programming model based on matrix operations, which can conveniently analyze and process large-scale data. By using Octopus, users can extract functions and data from multiple platforms and operate through a unified matrix operation interface. The distributed matrix representation and storage layer can design data storage formats for distributed file systems. Each computing platform in OctMatrix provides its own matrix library, and it provides a matrix library written in R language for the above users. SymboMatrix provides a matrix interface to OctMatrix that is consistent with OctMatrix. However, SymboMatrix also retains the flow diagram for matrix operations in the process, and it also supports logical and physical optimization of the flow diagram on a DAG. For the DAG computational flow graph generated by SymbolMatrix, this paper divided it into two parts: logical optimization and physical optimization. This paper adopted a distributed file system based on line matrix, and obtained the corresponding platform matrix by reading the documents based on line matrix. In the evaluation of system performance, it was found that the distributed matrix computing system had a high computing efficiency, and the average CPU (central processing unit) usage reached 70%. This system can make full use of computing resources and realize efficient parallel computing.

Keywords

Big data analysis distributed matrix computing system data management matrix segmentation historical data

1. Introduction

Existing big data distribution systems are generally faced with problems such as low computing efficiency and ease of use, and each distributed computing platform has its own underlying programming mode and application scenario. Therefore, an ideal big data distributed matrix computing system needs to support the existing and future better distributed computing platforms. Nowadays, massive information has become an important resource in the information age. Whether it is individuals, enterprises, or government agencies, they are constantly faced with these huge data and their processing problems [1]. The traditional data processing method can not meet the needs of mass data processing, analysis and rapid response. How to efficiently manage and analyze these massive data is a common problem faced by various fields at present. Matrix calculation is an effective data processing method, which has attracted more and more attention. This paper would discuss the research of distributed matrix computing system for big data analysis. This article proposes a distributed matrix computing system for big data: In this study, we propose a distributed system specifically designed for matrix computing needs in big data environments. This system can efficiently process large-scale matrix data and utilize distributed computing and storage resources to achieve fast matrix computing operations.

Big data often contains more and deeper value and knowledge, and its analysis and mining can produce huge social and economic benefits. Because the traditional single computer processing method is difficult to meet the demand of massive data, the research on distributed matrix computing of massive data has been developed rapidly. Zhao Jiayi believed that satellite cluster can be used as a distributed system to complete the task of distributed matrix calculation. He used compression coding techniques and matrix-vector multiplication, and considered the distributed storage of data matrices divided by columns [2]. In a large-scale distributed system consisting of a set of working nodes, Yu, Qian considered a large number of matrix multiplication problems, which were the basis for many data analysis applications [3]. Zhang Jinhua first analyzed the influence of distributed photovoltaic access on the distribution network, and then derived the voltage constraint with a custom matrix. He adopted a matrix described access capacity calculation method [4]. Das, Anindya Bijoy believed that there were many problems with distributed matrix computing. His work used a convolutional coding approach to solve these problems, eliminating these limitations [5]. Ben-Nun Tal reviewed and modeled different types of concurrency in deep neural networks, discussing asynchronous random optimization, distributed matrix system architectures, and corresponding communication schemes [6]. The matrix calculation scheme they adopt does not involve the specific information of data mining, but also needs to consider the calculation speed, data transmission and other factors in order to make a comprehensive evaluation of the performance of the system.

In order to meet the new requirements of deep learning for matrix computation, the existing distributed matrix computation systems focus on the optimization of simple batch operations such as matrix multiplication. In emerging deep learning applications, its support for complex operations (such as convolution operations) is still imperfect, and common matrix operations include addition, subtraction, multiplication, and so on. Reddy, G. Sirichandana adopted hierarchical distributed data matrix, and also provided a running system to help the execution of hierarchical distributed data matrix [7]. Lopes, Paulo A. C believed that the Hungarian algorithm solved the linear assignment problem in polynomial time, and he adopted a simplified and fast distributed matrix compression scheme for the initial relaxation matrix calculation [8]. The graph tensor operation adopted by Zhang Tao supported big data processing of scalar, vector and distributed matrix on each graph node [9]. Gao Yan believed that the distributed stochastic gradient descent algorithm was one of the most popular distributed matrix decomposition algorithms for parallel big data. In order to ensure non-negative matrix decomposition, he adopted a time-delay reduction control scheme [10]. The matrix calculations they adopted did not take into account the efficiency of the execution.

In this paper, a design method of distributed matrix computing system based on big data is adopted, and its implementation process is described in detail, in order to provide a practical solution for researchers and practitioners. In the matrix segmentation module, the load imbalance between nodes can be overcome well, so as to achieve a better load balance scheme. In this paper, fault tolerance technology is introduced into distributed system to ensure its stability, which can provide useful reference for the design and implementation of distributed system. In this paper, the matrix multiplication module selects more effective algorithms to improve the operation efficiency, thus providing a basis for further optimization of the system [11]. In this paper, Octopus, a general-scale machine learning system for multiple computing platforms, is adopted and integrated into different computing platforms, which can provide users with corresponding data processing interfaces.

2. Distributed matrix computing system

2.1 Conceptual framework of matrix calculation

Figure 1.

Conceptual framework of distributed matrix computing for big data analysis.

The conceptual framework of distributed matrix computing for big data analysis is shown in Fig. 1. According to the unified matrix interface provided, the user can design the corresponding machine learning algorithm in the familiar program development environment, so as to achieve a higher level of application. The distributed and parallelized view of the computing system includes the running of the application based on the matrix interface in the underlying system. In this paper, the logical embodiment scheme is physically optimized to obtain the physical embodiment scheme. Finally, according to the physical realization, the corresponding lower platform is selected for matrix operation, and the final result is obtained.

2.2 Distributed matrix computing system framework for big data

Figure 2.

Distributed matrix computing system framework for big data analysis.

Based on this, this paper would adopt a big data-oriented system Octopus. The distribution matrix representation and storage layer is mainly to design the storage format of data for distributed file systems. OctMatrix provides its own matrix library for each computing platform, and OctMatrix packages the matrix library under each platform, which can provide the above users with a matrix library written in R language. SymbolMatrix provides a matrix interface for OctMatrix that is consistent with OctMatrix. It also maintains the flow diagram of matrix operations, and also supports logical and physical optimization of the flow diagram. According to the matrix interface provided by the system, users can develop higher-level distributed matrix and data analysis in R environment. The distributed matrix computing system framework for big data analysis is shown in Fig. 2.

2.3 Cross-platform big data machine learning system: Octopus

2.3.1 Programming model and programming interface

Octopus has a declarative matrix SymbolMatrix and an imperative matrix OctMatrix matrix programming interface. In order to enable upper-layer programs to be executed on multiple underlying computing platforms (Spark, MPI (Multi Point Interface)), SymbolMatrix’s matrix interface would store information about the types of matrix operations and the matrices they depend on. Based on the association information, a matrix computation flow graph, that is, a DirectedAcyclicGraph with directed dependencies, is generated, and then the computation flow graph is optimized logically and physically. In this way, users can quickly complete the work of machine learning, data analysis, algorithm modeling and so on on the basis of OctMatrixAPI (application programming Interface), and convert it to SymbolMatrixAPI to achieve more optimized performance.

Matrix multiplication [12, 13]:

$\displaystyle{C}={A*B}$ (1)

In the formula, $A$ , $B$ , and $C$ are matrices.

Matrix addition [14]:

$\displaystyle{C}={A}+{B}$ (2)

Matrix norm calculation [15, 16]:

$\displaystyle\|A\|=\sqrt{\sum\sum\left|a\right|^{2}}$ (3)

Matrix determinant [17]:

$\displaystyle\Delta A=\sum\left|a\right|^{i+j}*N_{\textit{ij}}$ (4)

$\text{N}_{\textit{ij}}$ is the cofactor of ${i},{j}$ .

In order to meet the needs of fast modeling and efficient computing, Octopus proposes two types of models, namely command model and declarative model.The two types of matrix operations divided in Octopus are continuous operations and batch operations. The difference between these two matrix operations lies in the size of the processed data blocks and the processing method. Continuous operations are mainly aimed at real-time processing scenarios, where the processed data blocks are usually relatively small and can respond and output results in real-time. Batch operations are more suitable for scenarios of batch processing, as the processed data blocks are large and can be calculated and output at once. There are differences between the two, and this article uses code $a<-b*c$ as an example. However, if both $b$ and $c$ are SymbolMatrix, then the multiplication operation does not call the underlying multiplication operation, but instead records the call relationship of the multiplication, thus building a matrix computation flow graph and assigning it to the next matrix, SymbolMatrix. In actual computation, SymbolMatrix must be triggered by some function, such as the valuate function (similar to lazy computation in Spark). The declarative matrix includes all the operational logic of the application, so as to achieve the logical and physical optimization of the program.The symbol matrix in Spark is implemented through GraphX, using a method similar to lazy computing, which can operate and optimize the calculation task through a series of transformation operations, and perform actual calculations when the final result is needed. This feature enables Spark to efficiently handle large-scale symbol matrix computation tasks.

Octopus realizes the integration of big data platform by introducing matrix interface. By modifying the basic computing platform, Octopus supports the collaborative work of multiple big data platforms, enabling them to work seamlessly, achieving more flexible and efficient data processing and improving the overall performance of the system.

The running logic of a complete program refers to the entire execution process of the program, including steps such as loading input data, preprocessing data, calculating operations, and outputting results. The logic and physics of optimizing a program refer to the methods and strategies used to optimize the program, including optimizing data distribution, computing resources, execution order, and other aspects.

Octopus can provide a series of APIs and modules for matrix operation, including: calculation library for symbol matrix and octet; Distributed file system; Matrix loader; Distributed computing engine; Parallel matrix decomposition; Graph computing framework. According to the matrix operation required by big data, it includes two APIs, OctMatrix and SymbolMatrix, as well as matrix generation module, matrix output module and matrix operation module. In order to meet different requirements, users can easily switch between the two modes. For example, OctMatrix can be used when one wants to do rapid modeling, and SymbolMatrix can be used when one wants to improve performance.

Octopus system makes full use of the data access mechanism of the distributed file system Alluxio to realize efficient matrix cross-platform transmission. In this process, users can store matrix data in Alluxio and make use of its powerful distributed file system features. Alluxio provides high-performance data access and cross-platform file sharing solutions, which makes Octopus system better adapt to the needs of different computing platforms.

Octopus system realizes seamless integration with Alluxio through its API and interface. Users can easily upload, download and share matrix data to Alluxio’s distributed storage by using the operation interface provided by Octopus. This design makes Octopus system have the ability to transmit and access matrix data between different computing platforms and realize higher data portability.

Alluxio’s advantage lies in its highly flexible and extensible architecture, which allows Octopus system to better meet the challenges of large-scale data processing. By integrating with Alluxio, Octopus system not only realizes efficient processing of large-scale data, but also provides users with more flexible and reliable data management and access solutions.

Octopus can provide a series of APIs and modules for matrix operations, including: calculation libraries for symbol matrices and octuples; Distributed file system; Matrix loader; Distributed computing engine; Parallel matrix decomposition; Figure Calculation Framework. According to the matrix operation required by big data, it includes two apis, OctMatrix and SymbolMatrix, as well as matrix generation module, matrix output module, and matrix operation module. In order to meet different requirements, users can easily switch between the two models. For example, OctMatrix can be used when one wants to do fast modeling, and SymbolMatrix can be used when one wants to improve performance.

2.3.2 Logical optimization and physical optimization of DAG computational flow graph

The DAG (directed acyclic graph) calculation process of symbolic matrices can be divided into two parts: logical optimization and physical optimization. Logic optimization: Logic optimization refers to the optimization of the computational logic of the symbol matrix DAG diagram, including data flow analysis and optimization, algorithm selection and optimization, etc. Physical optimization: Physical optimization refers to utilizing the characteristics of computing resources to improve the physical execution methods and models of computation. The improvement of logical optimization and physical optimization can help improve the speed and efficiency of symbol matrix calculation. The DAG computational flow graph generated by SymbolMatrix is divided into logical optimization (DAG logical optimization in the graph) and physical optimization (DAG physical optimization in the graph). Logical optimization is to improve the performance of the DAG by changing its own structure. For example, for the matrix multiplication of sequence multiplication, due to the difference in the sequence multiplication order, the computational efficiency would be greatly different, so this paper adopts a new algorithm based on sequence multiplication, and on this basis, designs a new algorithm based on sequence multiplication. Different matrix multiplication orders may lead to different data access patterns and computation paths, thereby affecting the efficiency of computation. Some multiplication orders may lead to an increase in data dependencies, resulting in more computational and data transmission costs. In addition, different orders may lead to different locality of data access, which affects the efficiency of cache utilization. After logical optimization, the result is still DAG. “Physical optimization” refers to the basic calculation optimization based on the characteristics of the platform. For example, when running a program on the Spark platform, analyzing the logically-optimized DAG can improve execution performance. At the same time, the matrix operation on multiple platforms can get better performance. Then, the method is mapped to the operation in OctMatrix, and then applied to the corresponding computing platform to obtain the calculation result of the method. Processing matrix data has the following advantages: parallel computing: matrix operations can effectively perform parallel computing. In a distributed environment, matrices can be divided into block matrices; Data locality: Matrix operations have the characteristic of data locality; Algorithm efficiency: Compared to other data structures, matrix data structures usually have higher algorithm efficiency and numerical stability; Feature extraction and statistical analysis: Matrix data has a wide range of applications in feature extraction and statistical analysis; Data exchange and sharing: By establishing a unified matrix expression, data can be transmitted and processed across different platforms and distributed environments, thereby achieving cross system exchange and sharing of data.

Distributed singular value decomposition:

$\displaystyle{A}={U}\sum\textit{VT}$ (5)

The computational theory can be expressed based on the following formula:

Data loading:

$\displaystyle\textit{F}_{\textit{input}}=\textit{loadData}\left({\textit{data}% \_\textit{path}}\right)$ (6)

Data partition:

$\displaystyle\textit{F}_{\textit{matrix}}=\textit{splitMatrix}\left({\textit{% input},\textit{n}}\right)$ (7)

In the formula, data_Path represents the original data path; $\textit{F}_{\textit{input}}$ is the original data set, and n is the dimension of the target segmentation matrix, that is, the matrix data set divided by the dimension n.

Iterate over each element ( $i, j$ ) in the matrix dataset $\textit{F}_{\textit{matrix}}$ and distribute these elements to the corresponding compute nodes.

The formulas involved are as follows:

$\displaystyle\textit{F}_{\textit{i},\textit{j}}=\textit{distribute}\left({% \textit{F}_{\textit{matrix}}\left[{\textit{i},\textit{j}}\right],\textit{node}% \_\textit{list}}\right)$ (8)

$\textit{F}_{\textit{i},\textit{j}}$ represents the submatrix data assigned to nodes ( $i, j$ ).

The matrix multiplication operation can be expressed by the following formula:

$\displaystyle\textit{C}=\textit{A* B}$ (9)

It should be noted that when doing distributed computing, communication and synchronization between nodes need to be considered. Specifically, for each ( $i, j$ ), after the matrix multiplication operation, the calculated submatrix C needs to be sent to the results summary module. Specifically, the system can read matrix data from dispersed documents and write corresponding matrices in the dispersed documents by following these steps: first, read data, load data, establish matrix, operate data, and write data.

2.3.3 Representation and storage of distributed matrix data

Figure 3.

Representation model of the underlying matrix in Octopus (stand-alone matrix, distributed row matrix, distributed block matrix).

The most basic and core part of the Octopus architecture is a matrix generic programming model. To achieve cross-platform functionality, Octopus must have the ability to obtain data from the same distribution file, which results in a different computing platform. Therefore, the representation and storage management of the data set must be unified to ensure that the data set has a unified representation and storage format in the distributed file system. However, on different computing platforms, the performance of the matrix in memory is not the same, and it can be generally divided into three categories. In the R platform of distributed devices, dividing the matrix into block matrices can improve the parallelism of data and the efficiency of distributed computing. Different partitioning strategies can be used, such as dividing matrices based on fixed block sizes or dynamically adjusting block sizes to adapt to different computing tasks. The matrix of single-machine R platform is constructed by row vector, while the distributed R platform has two modes of expression: row matrix and block matrix. In order to realize the matrix operation and transfer between multiple platforms, this paper adopts a distributed file system based on row matrix. For this reason, distributed file systems typically provide a unified data representation, such as a unified row matrix format or column matrix format. Different data files and computing nodes can map data to this unified format, thus achieving unified expression of data. By reading the documents based on row matrix, the corresponding platform matrix is obtained and written into the distributed file system. The representation model of the underlying matrix in Octopus (stand-alone matrix, distributed row matrix, distributed block matrix) is shown in Fig. 3.

The matrix data is held by a distributed file system, as shown in Fig. 3. In addition to this, Octopus builds a layer of memory-based matrix data storage and access on top of HDFS. Alluxio is a distributed file system based on storage, so the cross-platform transfer of matrix data can be realized by reading and writing Alluxio. Octopus system abstracts the matrix program at the system level, and realizes the algorithm design at the upper level and the distributed parallelization at the lower level.

Octopus realizes cross-platform characteristics, and decouples the upper-level algorithm design from the bottom-level platform through OctMatrix framework. Octopus provides advanced interface, which enables the same algorithm to run on different underlying platforms. Hierarchical abstraction allows algorithm designers to focus on logic by hiding the underlying details. Scalability adapts to platforms with different scales and characteristics through the scheduling and optimization of the underlying distributed parallel computing. This ensures that Octopus maintains a high degree of flexibility and performance in various environments.

In OctMatrix, any underlying computing platform can be selected for execution without modifying it, thus achieving cross-platform characteristics. In addition, Octopus architecture also has good hierarchical abstraction capabilities, which makes the design and implementation of the architecture very simple, and has good scalability. On this basis, the introduction of a system-defined matrix interface can integrate existing and upcoming big data platforms as plug-ins. With appropriate changes to its underlying computing platform, it can run on multiple big data platforms. The results of matrix operations performed on different platforms may vary depending on the platform. Different platforms include different CPU architectures, memory specifications, and computing resource configurations, which can affect the performance of matrix operations.Generally speaking, if matrix operations are performed on high-performance computing platforms, such as supercomputers or distributed systems with large-scale clusters, the octave matrix framework can fully utilize the platform’s parallel computing and high-performance storage resources, thereby achieving higher computing speed and performance.

Octopus realizes the separation of upper-level algorithm design and lower-level distributed parallelism through the abstraction of system-level matrix program. In this architecture, the upper layer algorithm design is carried out through the advanced interface of Octopus, while the lower layer distributed parallel computing is handled by the lower layer of Octopus system. This separation enables algorithm designers to focus on algorithm logic and advanced optimization without paying close attention to the details of the underlying distributed computing.

The advanced interface provided by Octopus abstracts the complexity of the underlying distributed parallelism, simplifies the attention of algorithm designers to the underlying details, and makes the algorithm design more intuitive and easy to understand. Because the bottom implementation of Octopus is decoupled from the specific distributed computing platform, the same high-level algorithm design can run on different bottom platforms, which realizes cross-platform flexibility.

2.4 System programming operation use and programming examples

2.4.1 Basic programming operation

This paper uses R language development environment RstudioServer as the program development environment of Octopus system to further give the basic program operation of Octopus system. To map the algorithm to a computing platform within the OctMatrix framework to obtain an explanation of the algorithm, you can follow the following steps: Understand the OctMatrix framework; Understand the computational model of algorithms; Select a suitable computing platform based on the support of the framework and algorithm requirements; Implement algorithm mapping based on the framework’s interface and programming model; Conduct performance optimization; Run and test.

Octopus system can be connected with multiple big data processing platforms at the lower level to realize big data applications at the upper level. The upper layer is the R language development ring RstudioServer, and the middle layer is Octopus (R package). The lower layer is a variety of big data processing platforms, and the upper layer users use these small data sets to debug these programs on R’s stand-alone platform, so that they can get faster development speed. After debugging the program, the user can choose a big data processing platform by switching between the underlying computing platforms. In this way, without modifying the code logic, the upper layer program can run on the chosen big data platform, so as to achieve the ability to process massive data, and also achieve the cross-platform feature of “Writeonce, runanywhere”. Because Octopus is compatible with the $R$ environment, users can use it both interactively and in batch form.

A demonstration of using Octopus system for text preprocessing and word segmentation in $R$ language is as follows:

2.4.2 Programming examples

A simple example of a Generalized Non-negative Matrix Factorization (GNMF) factorization using the NMF (Non-negative Matrix Factorization) package is shown in Fig. 4.

Step 1. Install and load the NMF package;

Step 2. Create a matrix object, assuming a matrix named $X$ ;

Step 3. Run the GNMF decomposition;

In the above example, the matrix $X$ is decomposed into a GNMF model of rank 2. The rank parameter specifies the rank of the decomposition, and the method parameter specifies the decomposition method.

Step 4. View the decomposition result.

3. Experiment and results of distributed matrix computing system

Table 1
Hardware and software of each device

Project	Configuration information
CPU	Intel E5-2620 xeon 2.10 GHz $\times$ 2
Memory	200 GB(gigabyte)
Disk	6TB SAS (Serial attached SCSI (Small computer system interface)), 100 GB SSD (Solid state drive)
File system	Ext4 file system
Network bandwidth	9 Gbps
OS (Operating system)	RedHat enterprise linux server 8.0
ATLAS version	3.2.3
JVM version	2.7
R version	3.1.1

In this paper, the cluster of Spark and MPI, two distributed computing platforms, consists of 9 physical nodes, 1 primary node and 8 secondary nodes. Table 1 lists the hardware and software of each device. In Spark, the number of actuator partitions is 200. The MPI is also 200. In addition, if the matrix is multiplied A%*% B; the standard for A is $m*k$ ; the standard for B is $k*n$ , and the test case is represented by $m*k*n$ .

Logical execution optimization:

Table 2

Comparison before and after matrix multiplication optimization

Computing platform	Column number of matrix c	Non-multiplicative optimization	Multiplication optimization	speed-up ratio
R	1000	200	40	5
	100	100	10	10
	10	75	5	15
Spark	1000	30	5	6
	100	27	1.5	18
	10	24	1	24
MPI	1000	28	7	4
	100	24	2	12
	10	16	1	16

Figure 4.

A simple example of GNMF decomposition using an NMF package.

In the experiment of chain optimization of matrix multiplication, the case of successive multiplication of three matrices (A%* %B%* %C) is shown, which can be used to show the influence of optimization on computational performance. The rows and columns of A, B, and C are 10000. When the number of columns in matrix C decreases from 1000 to 100 and then to 10, the matrix multiplication before and after optimization is shown in Table 2, and the acceleration ratio is given. The number of columns in the C matrix is reduced from 1000 to 10, and its acceleration coefficient is increased from 5 to 15 in the R platform. On the Spark platform, its acceleration is 6 times that of the original, and now it is 24 times. On the MPI platform, its acceleration is 4 times that of the original, and now it is 16 times. The test results show that the running speed of the matrix multiplication chain optimization method has been significantly improved, the running time has been shortened and good results have been achieved.

Table 3

The impact of common subexpression elimination techniques on system performance

Computing platform	Optimization without common subexpression elimination	Elimination optimization with common subexpressions	Performance improvement (%)
R	190	80	57.9
Spark	28	20	28.6
MPI	30	15	50

The impact of common subexpression elimination technology on system performance is shown in Table 3. As can be seen from Table 3, on the two computing platforms of R and MPI, the performance of the two computing platforms on the system has improved by nearly 50%, which is the result of optimization.

At the same time, this paper would also optimize the logic execution mechanism of DAG on a single platform to evaluate its performance in Gaussian non-negative matrix decomposition. The algorithm is to decompose matrix V (M* N) into matrix W (M* K) and matrix H (K* N). GNMF is mainly used in document clustering, topic modeling and computer vision. For the continuous multiplication representation of W%*% H%*%t (H), it is necessary to first compute the matrix continuous multiplication optimization of H%*%t (H) before doing other operations. In the experiment, 5 rounds of iteration were used, K was fixed at 1000, and the values of M and N were the same. The GNMF code implemented with Octopus is shown in Fig. 5.

System performance evaluation:

Table 4

Results of system performance evaluation

Algorithm task	Calculation time (s)	CPU (Central processing unit) usage rate	Memory usage (GB)	Network bandwidth (Mbps)
Task 1	100	80%	10	500
Task 2	150	70%	9	450
Task 3	170	60%	8	400

Figure 5.

GNMF code implemented with octopus.

During the experiment, the calculation time and resource consumption of each algorithm task can be recorded, and then the performance can be evaluated. The system performance evaluation results are shown in Table 4.

In this paper, the computing time and resource consumption are different when processing different algorithm tasks. Under normal circumstances, the computing efficiency of the system is relatively high, the average CPU usage reaches 70%, and the network bandwidth can also meet the needs of data transmission.

Analysis of matrix calculation results:

Table 5

Matrix calculation results

Algorithm task	Data scale	Iterations	Calculation results	Result correctness
Task 1	1000*1000	10	19.5	Correct
Task 2	2000*1000	20	40.2	Correct
Task 3	3000*1000	30	60.2	Correct
Task 4	4000*1000	40	60.8	Correct

In the course of the experiment, the corresponding calculation results can be recorded and analyzed for different matrix calculation tasks. The matrix calculation results are shown in Table 5.

For different scale matrix calculation tasks, the system can accurately output the corresponding calculation results. At the same time, the convergence speed and stability of the algorithm can be further analyzed according to the number of iterations and other parameters.

Algorithm effect analysis:

Figure 6.

Algorithm effect analysis

In the process of experiment, the advantages and disadvantages of each algorithm can be evaluated by comparing the calculation results and performance indicators of different algorithms. The effect analysis of the corresponding algorithm is shown in Fig. 6.

Although the calculation time and resource consumption of different algorithms are different, the accuracy rate of the output results should also be considered. After comprehensive evaluation, the optimal calculation scheme can be determined. The accuracy rate of the output results of the algorithm is relatively high (88–95%).

Data balance analysis:

In distributed computing, in order to make full use of the computing resources of each computing node, it is usually necessary to store data in blocks on different nodes. In this case, the balance of the data needs to be evaluated. Data balance analysis is shown in Table 6.

By analyzing the data in Table 6, it can be found that although the amount of data stored by each node varies in value, it is relatively balanced, and no node is overloaded.

Network delay analysis:

Table 6

Data balance analysis

Node number	Data volume (GB)
Node1	20
Node2	15
Node3	18
Node4	17
Node5	19
Node6	21
Node7	16
Node8	14
Node9	22
Node10	20

Table 7

Network delay analysis

Node pair	Average delay (ms)
Node1-Node2	5
Node1-Node3	8
Node1-Node4	6
Node2-Node3	7
Node2-Node4	10
Node2-Node5	11
Node3-Node5	9
Node3-Node6	12
Node4-Node6	7
Node5-Node7	3

In distributed computing, communication between nodes is very important. The network latency between nodes needs to be analyzed in order to optimize the data transfer process and calculate scheduling strategies. The network delay analysis is shown in Table 7.

The maximum network delay is 12 ms. The network delay varies among different nodes. This paper analyzes the reasons for these differences and considers how to optimize the off-site data transfer process to improve system efficiency.

Distributed computing process monitoring:

The key parameters of distributed computing system are mainly divided into two aspects: data transmission efficiency and computing scheduling efficiency. It is necessary to monitor these two aspects, record and analyze the key parameters of the system in real time. Distributed computing process monitoring is shown in Fig. 7.

Figure 7 can record the key parameters of data transmission between nodes and calculation progress. By analyzing these parameters, the bottleneck of system efficiency can be determined and the calculation process can be optimized.

Big data storage and backup:

Figure 7.

Distributed computing process monitoring.

Figure 8.

Relationship between storage media and capacity.

The process of distributed computing involves a lot of data storage and backup work, which needs to be managed and analyzed. Especially in online deployment, storage resources must be flexibly expanded and reduced to meet different data volume requirements. Figure 8 shows the relationship among Storage media Solid State Drive (SSD), Network Attached Storage (NAS), Hard Disk Drive (HDD), and capacity.

The total capacity of SSD1 is 20TB. After analyzing the data in Fig. 8, people can have a clear understanding of the usage of each storage media, and make reasonable planning and adjustment to the storage architecture based on the actual situation to meet the needs of the service. Although a number of distributed matrix computing systems with practical value have appeared, there are still many problems to be solved. Future research needs to face these problems and give better solutions in order to better support data processing and analysis.

Table 8

Results obtained by distributed matrix calculation method, distributed random alternating direction method of multiplier, heuristic iterative search algorithm for testing RMSE pair time on Netflix dataset

Time (s)	Distributed matrix computing method (RMSE)	Method of random alternating directions for multiplier distribution (RMSE)	Heuristic iterative search algorithm (RMSE)
100	1.21	1.37	1.48
200	1.10	1.35	1.45
300	1.09	1.22	1.44
400	1.07	1.22	1.40

Recognized test datasets are publicly accessible and widely used datasets, usually provided by researchers or organizations, to evaluate and compare the performance of different algorithms, systems, or models. This paper uses the Netflix,Yahoo!Music R1 machine learning library to provide a range of commonly used data sets, including data for tasks such as classification, regression, and clustering. The results obtained by the distributed matrix calculation method, the distributed random alternating direction method of the multiplier, and the heuristic iterative search algorithm for testing RMSE on the Netflix dataset are shown in Table 8. The numerical value of the distributed matrix method is the smallest, which indicates that its performance is better. The distributed random alternating direction method of the multiplier has an RMSE of 1.22 at 400s.

Table 9

Time (s)	Distributed matrix computing method (RMSE)	Method of random alternating directions for multiplier distribution (RMSE)	Heuristic iterative search algorithm (RMSE)
100	1.29	1.48	1.93
200	1.27	1.45	1.88
300	1.19	1.44	1.85
400	1.17	1.34	1.71

The results obtained by the distributed matrix calculation method, the distributed random alternating direction method of the multiplier, and the heuristic iterative search algorithm for testing RMSE time on the Yahoo!Music R1 dataset are shown in Table 9. The distributed matrix calculation method had the smallest numerical value, and the minimum RMSE value was 1.17.

4. Discussions

To improve the performance of the system, compiler optimization technology plays a very important role. Especially in the field of data analysis, algorithms often include complex operation logic, and it is difficult for users to choose the best execution sequence or scheme when writing scripts. Therefore, the execution plan generated directly in accordance with this script is generally sub-optimal, and would also cause the system to crash due to problems such as insufficient storage space. At this time, the system needs to rely on compilation optimization to solve the problem of execution efficiency. However, different from query optimization in traditional databases, complex linear algebraic algorithms result in a large number and types of operators in the execution plan, and the structure of the plan is also complex. Therefore, the optimization potential of the execution plan is greater, and the related technology is more important.

In the distributed matrix computing system, this paper adopts the deep learning technology based on neural network, which can train and recognize large-scale image data efficiently. In distributed matrix computing, people need to carry out large-scale image recognition and classification. These tasks usually require the processing of massive image data, and distributed matrix computing can achieve fast and efficient operations such as image feature extraction and image prediction.

Using distributed matrix computing system to process natural language can improve the processing speed and accuracy, and can support more natural language processing applications.

In physics, astronomy, chemistry and other fields, scientists need to perform a series of large-scale calculations, many of which involve matrix operations. Distributed matrix computing can be used to complete relevant calculations in a relatively short time, helping scientists better analyze problems and explore unknown areas.

Challenges and trends of distributed matrix computing systems:

Performance bottleneck: in distributed matrix computing systems, network communication and data transmission are important factors affecting performance. How to improve the performance by optimizing the network architecture, increasing the network bandwidth and reducing the network delay is one of the important directions of future research.

Difficult to debug: because of the large number of computing nodes and operation tasks in the distributed matrix computing system, it is difficult to debug and locate errors. In addition, how to use visualization technology, distributed log analysis and other means to enhance the monitoring and debugging ability of the system is also an urgent problem to be solved.

Improve the ease of use of the system: distributed matrix computing systems require professional deployment and maintenance, which can be difficult for ordinary users to use. Therefore, how to design a simple and easy-to-use user interface and application programming interface to facilitate users to process and analyze data is also a problem that needs attention.

Intuitive interface design: the interface must be simple, so that users can quickly understand the interface’s function and operation process. It can use clear pictures and tags, avoid complicated words and technical details, and help users get started quickly.

Scenario operation flow: the scenario is changed to the common operation flow of users, which simplifies complicated operations. This provides users with pre-set templates and workflows that can be used directly or modified as needed.

Visual data processing: it provides charts, visual job interfaces and other direct data processing and analysis functions. In the case of no need to write any complex code, the user only needs to drag, select, configure and other actions to complete the data processing work.

Intelligent prompt and automatic control tools: this system can provide users with intelligent prompt and automatic control tools according to the user’s information and requirements. For example, according to different data types and different processing objects, different recommendation algorithms, different parameter Settings, and different optimization strategies are given to improve the work efficiency of users.

5. Conclusions

As an efficient data processing and analysis method, distributed matrix computing system would play an increasingly important role in the era of big data. This paper presented the design and implementation process of distributed matrix computing system based on big data technology, which has certain theoretical guiding significance and practical reference value. At the same time, in practical applications, appropriate adjustment and optimization should be carried out according to specific scenarios. For example, other segmentation strategies, such as column segmentation, can be adopted in matrix segmentation module to obtain better load balancing. In the matrix multiplication module, more efficient algorithms can be selected to improve the computational efficiency. In the results summary module, different merging methods can also be used to reduce network traffic and so on. In addition, in order to ensure the correctness and reliability of the calculation, it is necessary to introduce a fault tolerance mechanism to ensure the stable operation of the system, such as data backup and data consistency management. In summary, although this paper introduced a basic distributed matrix computing system design scheme, it had wide application and expansion space, and needed further research and exploration.

Footnotes

Funding

This work was supported by Jiangsu Safety & Environment Technology and Equipment for Planting and Breeding Industry Engineering. (Project No. JSZY-2021-06).

References

Mallik

. Distributed system coordination predictive control for network information mode. Distributed Processing System. 2022; 3(4): 45-52.

Zhao

Zhang

Xin

. Compression coding distributed matrix-vector multiplication algorithm in satellite networks. Radio Communication Technology. 2021; 47(5): 655-664.

Mohammad

Maddah

Avestimehr

. Straggler mitigation in distributed matrix multiplication: Fundamental limits and optimal coding IEEE. Transactions on Information Theory. 2020; 66(3): 1920-1933.

Zhang

Fang

Zhu

Yan

Hong

. Based on the calculation of the load distribution of ball bearings and the study of stiffness characteristics in the state of non-complete ball-raceway contact. Journal of Mechanical Engineering. 2020; 56(9): 73-83.

Das Anindya

Aditya

Namrata

. Efficient and robust distributed matrix computations via convolutional coding IEEE. Transactions on Information Theory. 2021; 67(9): 6266-6282.

Tal

Torsten

. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis ACM. Computing Surveys (CSUR). 2019; 52(4): 1-43.

Reddy

, Sirichandana Rao

. An efficient scheme of big data processing by hierarchically distributed data matrix. International Journal of Computer Sciences and Engineering. 2019; 7(7): 247-251.

Lopes Paulo

Satyendra

Aleksandar

Sarat

. Fast block distributed CUDA implementation of the Hungarian algorithm. Journal of Parallel and Distributed Computing. 2019; 130: 50-62.

Zhang

Kan

Liu

. High performance GPU primitives for graph-tensor learning operations. Journal of Parallel and Distributed Computing. 2021; 148(7): 125-137.

10.

Gao

Chen

Zhou

Xing

. A fast distributed non-negative matrix factorization algorithm based on DSGD. International Journal of Distributed Systems and Technologies. 2018; 9(3): 24-38.

11.

Pascual

. Fault tolerant control of distributed system based on neural network. Distributed Processing System. 2020; 1(3): 1-8.

12.

Shang

Guo

. A matrix-described distributed photovoltaic admission capacity calculation method. Power System Protection and Control. 2018; 46(16): 25-30.

13.

Wang

Jia

. Including distributed power distribution network fault location improvement matrix algorithm. Computer Simulation. 2018; 35(4): 58-64.

14.

Zhao

Liu

Zhao

, Superb. Distributed fusion estimator for posture and posture of cluster drones with limited bandwidth. Electro-Optical and Control. 2020; 27(4): 1-5.

15.

Bao

Zhang

. Research on fault tolerance of general matrix multiplication calculations for GPUs. Microelectronics and Computers. 2021; 38(1): 22-26.

16.

Tong

Wang

Yang

Liu

. The basic matrix estimation algorithm using single-strain transform and polar constraints. Journal of Zhengzhou University (Science Edition). 2021; 53(1): 61-67.

17.

Daw

Jamol

. Matrix calculations for moments of Markov processes. Advances in Applied Probability. 2023; 55(1): 126-150.

Distributed matrix computing system for big data

Abstract

Keywords

1. Introduction

2. Distributed matrix computing system

2.1 Conceptual framework of matrix calculation

2.3.1 Programming model and programming interface

2.4.1 Basic programming operation

2.4.2 Programming examples

3. Experiment and results of distributed matrix computing system

Table 1 Hardware and software of each device

5. Conclusions

Footnotes

Funding

References

Table 1
Hardware and software of each device