Fault-tolerant technology for big data cluster in distributed flow processing system

Abstract

In order to overcome the problem of poor recovery and stability of traditional big data fault-tolerant technology, a fault-tolerant technology of big data cluster in distributed flow processing system is proposed. Using the target protocol to build the cache data fault-tolerant mechanism of the distributed flow processing system and the disk data fault-tolerant mechanism, so as to build the internal data fault-tolerant mechanism of the system. By using spark application framework, a fault-tolerant model of big data cluster is built to realize the fault-tolerant of big data cluster in distributed flow processing system. The experimental results show that compared with the traditional methods, the recovery rate of big data cluster fault-tolerant technology proposed in this paper is higher, the highest recovery efficiency is 99.83%, the stability is stronger, and it is more suitable for big data fault-tolerant processing of distributed flow processing system.

Keywords

Distributed flow processing system big data cluster fault-tolerant technology disk data fault-tolerant mechanism

1. Introduction

With the continuous expansion of the scale and capacity of the current data processing system, energy flow and information flow interact closely, which lays the foundation for intelligent data processing [6]. As a special data processing system, distributed flow processing system has the characteristics of high precision of monitoring data, obvious mass of monitoring information, centralized distribution of measurement points, wide fluctuation range of operation parameters, etc., which requires higher fault tolerance and stability of monitoring operation [14]. In order to enhance the data coordination and interaction ability of the distributed flow processing system, the application and investment of various intelligent and new-type monitoring equipment make the overall number of monitoring points show an explosive growth trend, and the structure also presents a complex and diversified development trend, posing a more severe challenge to the operation stability and data throughput of the distributed flow processing system [3]. At present, the distributed flow processing system is still unable to overcome the problems of large fluctuations in monitoring data and frequent load changes. Once the data fault tolerance of the system is insufficient, when some unexpected situations occur, such as monitoring data loss, data processing delay, etc., it is likely to cause false alarm information, missed alarm, delayed alarm, etc. of the system, or even lead to judgment errors of serious faults. And decision-making errors, which seriously threaten the operation safety of the system. Therefore, it is necessary to study the data fault tolerance technology of distributed flow processing system [5,7,20].

Reference [10] applies cloud theory to big data fault-tolerant technology. In this method, the application of data control deeply is analyzed, a fault-tolerant model for big data based on cloud theory is established, to realize data fault-tolerant in distributed flow processing system by combining quasi-synchronous checkpoint algorithm, synchronous checkpoint algorithm and independent checkpoint algorithm. However, the implementation process of this method is complex, so it is difficult to apply in a wide range. Reference [17] proposes an interactive two-stage evaluation evolutionary strategy for big data fault-tolerant mechanism. This method designs an architecture platform for three-module, heterogeneous and redundant fault-tolerant system, which effectively constructs the successful individual pool by evolving mechanism, so as to ensure the real-time repair ability of three redundant modules in the system when it is wrong. In order to further improve the generation efficiency of evolutionary individuals in the pool of successful individuals, an interactive two-stage evolutionary evaluation strategy was introduced to improve the performance of the system from two aspects: the fitness of evolutionary individuals and the heterogeneity between individuals. But the fault-tolerant stability of this method needs to be further improved. Reference [9] proposes a fault-tolerant technology for big data based on length filtering. According to the ratio of length of two records and the absence of attributes, it first excludes some data that can not constitute similar duplicate records, reduces the number of comparisons and improves the detection efficiency. Furthermore, a dynamic fault-tolerant method is proposed to calibrate the results of field similarity evaluation, and the problem of misjudgement due to missing attributes is solved, but the method has the problem of low recovery rate.

In order to solve the problems in the above methods, the fault tolerance technology of big data cluster in distributed flow processing system is proposed. The overall implementation scheme of the technology is as follows:

Establish the internal data fault tolerance mechanism of the system. Using the target protocol to build the cache data fault tolerance mechanism of the distributed flow processing system, and then using RAID5 to achieve the data redundancy of the distributed flow processing system, to complete the establishment of the disk data fault tolerance mechanism.

Fig. 1.

Specific composition of the target protocol.

Establish a fault tolerance mechanism for data replication outside the system. Through data replication technology, the fault-tolerant mechanism of data replication outside the system is established.

Realize big data cluster fault tolerance. Using spark application framework to build the fault-tolerant model of big data cluster, and realize the fault-tolerant of big data cluster in distributed flow processing system.

Experimental verification. Taking recovery rate and fault-tolerant stability as comparative indexes, the proposed method is compared with reference [10], reference [17] and reference [9], and the experimental results are analyzed.

Conclusion.

Through the above scheme to achieve efficient and stable big data cluster processing.

Table 1

Functions that can be achieved through interaction

Serial number	Events/operations	IO caching module function
1	Kernel/System startup	From the perspective of BBU support, it may be necessary to provide parameters for setting IO cache size and physical segment to adjust memory allocation during startup
2	Loading ICM module	Detecting the validity of BBU data
3	Get BBU data status, activate block device, add to ICM module, and brush dirty data	BBU supports check and reuse of IO caches after power failure and restart
4	Setting parameters for ICM module	Setting the local controller ID
5	Adding master block devices to icm modules	Enter remote or all controller information
6	Adding Non-master Block to ICM module	Setting backup mode
7	prepare	Add local block devices as master LU parameters according to configuration
8	Start and configure the scsi-target service, add LUN	Add non-master LU
9	Scsi-initiator gets LUN information, reads and writes LUN	ICM module needs to provide LU and its reference to scsi-target layer
10	Administrator sets volume group master control in system configuration and use	Providing LU information and LU read-write interface to scsi-target layer
11	Administrator cancels LUN mapping	Switch the master control of all volumes under the specified volume group
12	Single controller failure	Delete LUN
13	Administrator restarts or stops a single controller	Controller takeover
14	Administrator restarts or stops two controllers	Controller stop and take-over
15	Double-control accidental power failure, restart after restoration of power supply	Brush dirty data, delete LU, uninstall ICM module
16	Administrator’s Online Expansion of LUN	Considering BBU support, caching re-recognition and brushing dirty data
17	JBOD single link failure	Update LU size online to support notification of scsi-target and initiator

Fig. 2.

Data processing flow of reading function.

Fig. 3.

Data processing of writing function in data reading and writing function.

2. Research on fault-tolerant technology for big data cluster in distributed flow processing system

2.1. Establish the data fault tolerance mechanism in the system

2.1.1. Establishment of fault-tolerant mechanism for cached data

Target protocol is used to build a fault-tolerant mechanism for caching data in distributed stream processing system. Target protocol includes controller compass. It can operate data according to user configuration and application data mode. Data operation mode includes remote read-write mode, write-mirror mode and write-back mode. Target protocol chooses the specific data operation mode according to the attributes of the system tier, the cache mode and the state of the controller. The structure of Target protocol is shown in Fig. 1.

For the internal control level of distributed flow processing system, it is necessary to use Target protocol to interact with cache subsystem, HA subsystem, management subsystem and block device driver in order to achieve fault-tolerant of cached data. The functions that can be achieved through interaction are shown in Table 1.

The IO caching function of cached data fault-tolerant mechanism needs to be realized by the caching function module in Target protocol. The function module can be logically divided into the following sub-modules, including remote backup and IO module, memory recovery and memory pool module, dirty page write-back module, pre-reading module, management configuration module and data read-write module. Data read-write function mainly includes remote data read-write and local data read-write, while cached data mainly refers to cardinal structure. Reading data can be divided into remote reading and proximal reading. The data processing flow is shown in Fig. 2.

Writing data needs to be saved in the mirror and then transited to the main control. After completing the dirty data brushing, the mirror needs to be notified of the release of the backup data. Since dirty data brushing and writing operations must be kept asynchronous, in order to avoid releasing the newly backup data that has not been brushed incorrectly, it is necessary to define wrcount variable for each cached page unit to record the data version of the page [19]. The data processing process of the write function in the data reading and writing function is shown in Fig. 3.

The function of dirty page compass of cached data fault-tolerant mechanism needs to be realized by the controller compass module in Target protocol. The specific function of the controller compass module is shown in Table 2 [2].

The specific execution flow of -icm-pdflushn() function is shown in Fig. 4.

2.1.2. Establish internal disk data fault tolerance mechanism

The data redundancy of distributed stream processing system is realized by RAID5, and the fault-tolerant mechanism of disk data is established [12]. To realize system data redundancy, we first need to assign roles to two controllers of distributed flow processing system. They are regarded as clients and servers to construct disk data fault-tolerant mechanism. The specific structure is shown in Fig. 5.

Then, RAID processing and backup processing are carried out on the client and server, the system disk is patrolled by the client and server, and the parameters of patrol period, patrol rate and patrol scope are configured. Sync algorithm is used to record the data elements of the system disk, and fault-tolerant technology is used to process the data elements, so as to realize the establishment of the disk data fault-tolerant mechanism [15]. The calculation formula of Sync algorithm is as follows: $\begin{array}{l} Data source group \\ (1) & = \frac{Position of array - Position of array head}{Time of array tail - Time of array head} \end{array}$

Table 2
Specific functions of controller compass module

Serial number Function Specific content/description

1 Synchronized refresh of specified range of data The trigger is writing pinch

2 Asynchronously brush a specified number of dirty data Triggers include writing process, coincidence applicant, timer

3 Management Configuration Function The proportion of dirty pages that the writer triggers to brush, and the proportion of dirty pages that the back-end brush keeps

4 Supporting multiple threads to brush Set timing brush time, equipment dirty time

5 Brush trigger Initially create MIN-PDFLUSH-THREADS idle icm-pdflush threads and attach them to the linked list icm-pdflush-list

6 Execute the _icm_pdflushn() function An idle icm-pdflush thread information is obtained from the pdflush-list list list through icm-pdflush-operation. The FN and arg0 of the icm-pdflush-work structure are set up and the thread is awakened

7 Triggering asynchronous dirty page brushing The execution process is shown in Fig. 4

8 Cache Page Counter Definition Dirty page brush trigger obtains an idle icm-dflush thread by calling icm-pdflush-operation and passes the specific brush function pointer FN

Serial number	Function	Specific content/description
1	Synchronized refresh of specified range of data	The trigger is writing pinch
2	Asynchronously brush a specified number of dirty data	Triggers include writing process, coincidence applicant, timer
3	Management Configuration Function	The proportion of dirty pages that the writer triggers to brush, and the proportion of dirty pages that the back-end brush keeps
4	Supporting multiple threads to brush	Set timing brush time, equipment dirty time
5	Brush trigger	Initially create MIN-PDFLUSH-THREADS idle icm-pdflush threads and attach them to the linked list icm-pdflush-list
6	Execute the _icm_pdflushn() function	An idle icm-pdflush thread information is obtained from the pdflush-list list list through icm-pdflush-operation. The FN and arg0 of the icm-pdflush-work structure are set up and the thread is awakened
7	Triggering asynchronous dirty page brushing	The execution process is shown in Fig. 4
8	Cache Page Counter Definition	Dirty page brush trigger obtains an idle icm-dflush thread by calling icm-pdflush-operation and passes the specific brush function pointer FN

2.2. Establishment of fault-tolerant mechanism for external data replication

After completing the establishment of fault-tolerant mechanism of data within the system, the fault-tolerant mechanism of data replication outside the system is established through data replication technology. The data replication architecture is established mainly through remote data replication technology, and the fault-tolerant mechanism of external data replication is established by using remote replication mode of data replication architecture to replicate external data of the system [13]. Remote data replication technology can realize the data protection of distributed stream processing system. By remote replication of IP SAN and FC, system data can be copied. When the main site of the system is facing failure, it can backup and reverse copy the data quickly, and restore the data after the real failure of the main site. Remote data replication technology can be divided into asynchronous remote replication and synchronous remote replication. Asynchronous remote replication usually uses IP to copy data and provides data cross-regional protection service for distributed stream processing system, but its backup rate is usually slow, while synchronous remote replication usually carries out real-time backup of data, which can quickly restore interruption services, mainly by data copy using FC, but there is a distance limit [4]. Therefore, the data replication architecture is established by combining asynchronous remote replication with synchronous remote replication. The data replication architecture is shown in Fig. 6.

Fig. 4.

Execution flow of icm-pdflushn() function.

Fig. 5.

Specific structure of controller.

Fig. 6.

Architecture of established data replication system.

The remote replication mode in the data replication architecture is used to construct the fault-tolerant mechanism of external data replication. The remote replication mode includes asynchronous replication mode and synchronous replication mode. The fault-tolerant mechanism of external data replication has the functions of data replication and external data fault-tolerance [1].

2.3. Fault tolerance of big data cluster in distributed flow processing system

According to the fault-tolerant mechanism of internal data and external data replication, and using Spark application framework to build a fault-tolerant model for big data cluster, the big data cluster fault-tolerant of distributed flow processing system is realized. There are four application frameworks in Spark application framework, namely, SQL interactive query application framework, Spark Streaming real-time flow processing application framework, MLlib machine learning application framework, GraphX graph computing application framework. Combining these application frameworks, a fault-tolerant model for big data clusters is constructed. Because the abstract data set RDD in Spark application framework can be applied to many applications, the resource consumption of daily management and data conversion in the operation and maintenance of distributed flow processing system is reduced [18]. The specific architecture of the fault-tolerant model for big data clusters is shown in Fig. 7.

Among them, the functions of the upper structure in the fault-tolerant model for big data clusters are as follows: SQL interactive query application framework, which is used for Hadoop data fault-tolerant query, with high query efficiency; Spark Streaming real-time flow processing application framework: which is used for fault-tolerant query of large-scale flow monitoring data through the bottom of Spark; MLlib machine learning application framework, which is a machine learning database for data fault-tolerant; GraphX Graph Computing application framework, which is a graph computing module for data fault-tolerant that supports Pregel computing model and Graphlab computing model [8].

There are three kinds of fault-tolerant modes in distributed flow processing system based on fault-tolerant model for big data cluster: Spark on Mesos cluster fault-tolerant mode, Spark Standalone cluster fault-tolerant mode and Spark on Yarn cluster fault-tolerant mode [11]. All three modes need to use abstract data set RDD for operator operation, and RDD for specific operator operation as shown in Table 3 [16].

3. Experimental verification

In order to test the performance of fault-tolerant technology for big data clusters in distributed flow processing system designed in this paper, a comparative experiment is designed.

The experimental environment of fault-tolerant technology for big data clusters under distributed flow processing system is simulated in laboratory. The specific test environment is shown in Fig. 8. There are two kinds of network structures in the test environment, including storage network structure and management network structure. The storage network structure mainly adopts FC data network, while the management network structure mainly uses IP/TCP network exchange. The management network console can monitor and configure the storage network structure using GUI. The fault-tolerant technology for big data clusters under distributed flow processing system is tested by using the built test environment.

In the test environment shown in Fig. 8, create a raid, select 128 KB stripe for it, and install raid and storage pool in the test environment. In the big data fault-tolerant experiment in the distributed flow processing system, in order to ensure the effectiveness of the experiment, the experimental scheme is set as follows: select 200 GB data in the NoSQL database, take recovery rate and fault-tolerant stability as the comparative index, and compare the proposed method with reference [10], reference [17] and reference [9] methods.

3.1. Recovery rate comparison

Comparing the recovery rates of the four methods, five experiments are carried out to compare the recovery rate of fault-tolerant technology for each big data, and the recovery rate is used to show the recovery of fault-tolerant technology for each big data. The experimental results are shown in Table 4.

According to the experimental results of recovery rate comparison in Table 4, the average recovery rate of literature [10] method is 63.03%; the average recovery rate of literature [17] method is 67.69%; the average recovery rate of literature [9] method is 65.47%; and the average recovery rate of big data cluster fault-tolerant technology in distributed flow processing system is 96.84%, that is to say, the recovery rate of big data cluster fault-tolerant technology in distributed flow processing system is far away. It is better than the literature comparison method, that is, the recoverability of big data cluster fault-tolerant technology is better than the traditional big data fault-tolerant technology under the distributed flow processing system, and the recovery performance is improved. Because this paper constructs the internal and external data fault-tolerant mechanism, which effectively improves the recovery rate of fault-tolerant.

3.2. Fault-tolerant stability comparison

In order to further prove the fault-tolerant performance of the proposed technology, the experimental verification is carried out with the fault-tolerant stability as the comparison indicator. The comparison results of the four methods are shown in Fig. 9.

Fig. 7.

Specific architecture of fault-tolerant model for big data clusters.

Table 3

Specific operator operations using RDD

Operational sequence number	Elastic distributed data set operator	Operator operation
1	Filter filtering	The data in the new RDD consists of each data in the original RDD that enables filter operations to return true values
2	Map polymerization	The data in the new RDD is obtained from each data in the original RDD by map operation
3	Union Union set	Data is original in the new RDD
4	SortByKey sort ([ascending], [numTasks])	The Unification of Data in RDD and RDD Other Dataset
5	Join connection (other Dataset, [numTasks])	The original RDD and new RDD data are of type (K, V), and the new RDD data are sorted in ascending order
6	Other Data Set	The original RDD data type is (K, V), the other Dataset data type is (K, W), for the same K, return (K, (V, W))

Fig. 8.

Specific test environment.

Table 4

Comparison of experimental results of recovery rate

Big data fault-tolerant technology	Reference [10] method	Reference [17] method	Reference [9] method	Put forward the method
One-time experimental recovery rate	66.23%	73.21%	55.29%	91.29%
Recovery rate of second experiment	53.45%	66.34%	66.89%	99.83%
Recovery rate of three experiments	55.29%	65.21%	69.43%	95.64%
Recovery rate of four experiments	68.09%	70.19%	79.82%	98.32%
Five experiments recovery rate	72.01%	63.52%	55.94%	99.12%
Average recovery rate	63.03%	67.69%	65.47%	96.84%

Fig. 9.

Comparison of fault-tolerant stability.

It can be seen from the analysis of Fig. 9 that with the increase of experimental time, the fault tolerance stability of this method is always higher than that of the three literature comparison methods, which shows that the proposed fault tolerance technology of big data cluster under the distributed flow processing system has high fault tolerance stability, can be widely used, and has high research value. Because this paper uses spark framework to build the fault-tolerant model of big data cluster, which contains three fault-tolerant models, greatly improving the stability of fault-tolerant.

4. Conclusions

The fault-tolerant technology of big data cluster under the distributed flow processing system through the establishment of the internal and external data fault-tolerant mechanism of the distributed flow processing system, the fault-tolerant model of big data cluster is constructed, thus realizing the fault-tolerant of big data cluster in the distributed flow processing system, and showing the recoverability superior to the traditional big data fault-tolerant technology, with an average recovery rate of 96.84%, realizing the recoverability. It is of great significance for big data processing of distributed stream processing system.

Footnotes

Acknowledgements

This work was supported by Gansu Natural Science Foundation “Research on Information Mining Mechanism of Anti-terrorism and Stability Maintenance in Gansu Province” under grant no 18JR3RA191.

References

Blumekohout,

J.K.

Gamble,

Nielsenet al., Demonstration of qubit operations below a rigorous fault-tolerant threshold with gate set tomography, Nature Communications8(5) (2017), 145–149.

Chen,

Ma and

Dong, Private data aggregation with integrity assurance and fault-tolerant for mobile crowd-sensing, Wireless Networks23(1) (2017), 131–144. doi:10.1007/s11276-015-1120-z.

S.H.

Chen,

M.H.

Shen,

Y.C.

Changet al., Utilizing multi-level data fault-tolerant to converse energy on software-defined storage, in: IEEE International Conference on Smart Cloud, 2017.

R.P.

Duquia, González-Chica ,

Alejandro Bastos,

Luizet al., Describing numerical variables: Which are the most appropriate parameters to describe the data, Anais Brasileiros De Dermatologia92(106) (2017), 21841–21843.

Fang,

Chen and

Xiong, A multi-factor monitoring fault-tolerant model based on a GPU cluster for big data processing, Information Sciences496 (2019), 300–316. doi:10.1016/j.ins.2018.04.053.

Furquim,

Filho,

Jalaliet al., How to improve fault-tolerant in disaster predictions: A case study about flash floods using IoT, ML and real data, Sensors18(3) (2018), 907–911.

Huang,

Zhang,

Heet al., Fault-tolerant in data gathering wireless sensor networks, Computer Journal54(6) (2018), 976–987. doi:10.1093/comjnl/bxr027.

Ji,

Yiyue,

Xujinet al., Study on the tolerance and adaptation of rats to Angiostrongylus cantonensis infection, Parasitology Research116(7) (2017), 1937–1945.

S.Y.

Liu,

Cheng and

Li, Improved SNM algorithm based on length filtering and dynamic fault-tolerance, Application Research of Computers34(1) (2017), 147–150.

10.

Meng, Research on fault-tolerant control of ship track based on cloud theory, Ship Science and Technology25(4) (2018), 22–24.

11.

Saberi,

Yassaghi and

Djamour, Application of geodetic leveling data on recent fault activity in Central Alborz, Iran, Geophysical Journal International211(2) (2017), 751–765. doi:10.1093/gji/ggx311.

12.

Sengupta and

Kachave, Low cost fault-tolerant against

k_{c}

-cycle and

k_{m}

-unit transient for loop based control data flow graphs during physically aware high level synthesis, Microelectronics Reliability74(33) (2017), 88–99. doi:10.1016/j.microrel.2017.05.023.

13.

Shi,

Zhang,

Chakrabartyet al., Towards predictive fault-tolerant in a core-router system: Anomaly detection using correlation-based time-series analysis, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems37(10) (2018), 2111–2124.

14.

N.G.

Tsoutsos and

Maniatakos, Lightweight fault-tolerant for secure aggregation of homomorphic data, 2019.

15.

M.A.

Van,

B.R.

Haverkort and

I.G.

Niemegeers, A method for analysing the performance aspects of the fault-tolerance mechanisms in FDDI, in: [Proceedings] IEEE INFOCOM ’92: The Conference on Computer Communications, Vol. 1, IEEE, 2017, pp. 372–381.

16.

Wei and

Rong, Fault-tolerant detection simulation of node transmission path under cluster virus attacks, Computer Simulation35(49) (2018), 11305–11308.

17.

X.Y.

Yang and

Cheng, Application of interactive two-stage assessment evolutionary strategy in fault-tolerant system, Laser Journal38(7) (2017), 126–129.

18.

Zhang,

Bauer,

M.A.

Kochteet al., Aging resilience and fault-tolerant in runtime reconfigurable architectures, IEEE Transactions on Computers66(6) (2017), 957–970. doi:10.1109/TC.2016.2616405.

19.

J.N.

Zhang,

S.G.

Wang,

Q.B.

Sunet al., Fast and reliable fault-tolerance approach for service composition in integration networks, Journal of Software28(4) (2017), 940–958.

20.

Zhou,

Wang,

C.H.

Hsuet al., Virtual machine placement with

(m, n)

-fault-tolerant in cloud data center, Cluster Computing22(5) (2019), 11619–11631. doi:10.1007/s10586-017-1426-y.