A comprehensive study and review of tuning the performance on database scalability in big data analytics

Abstract

In the modern era, digital data processing with a huge volume of data from the repository is challenging due to various data formats and the extraction techniques available. The accuracy levels and speed of the data processing on larger networks using modern tools have limitations for getting quick results. The major problem of data extraction on the repository is finding the data location and the dynamic changes in the existing data. Even though many researchers created different tools with algorithms for processing those data from the warehouse, it has not given accurate results and gives low latency. This output is due to a larger network of batch processing. The performance of the database scalability has to be tuned with the powerful distributed framework and programming languages for the latest real-time applications to process the huge datasets over the network. Data processing has been done in big data analytics using the modern tools HADOOP and SPARK effectively. Moreover, a recent programming language such as Python will provide solutions with the concepts of map reduction and erasure coding. But it has some challenges and limitations on a huge dataset at network clusters. This review paper deals with Hadoop and Spark features also their challenges and limitations over different criteria such as file size, file formats, and scheduling techniques. In this paper, a detailed survey of the challenges and limitations that occurred during the processing phase in big data analytics was discussed and provided solutions to that by selecting the languages and techniques using modern tools. This paper gives solutions to the research people who are working in big data analytics, for improving the speed of data processing with a proper algorithm over digital data in huge repositories.

Keywords

HADOOP SPARK scalability batch processing big-data

1 Introduction

In this digital world, all people are generating a huge volume of information as data for their real-world applications and needs. Every day plenty of data were created in various domains like healthcare, retail, banking, industries, and companies [1]. The data warehouse has been generated to store and the time taken for getting it on time is miserable. Multiple methods and algorithms are used in a data warehouse as a mining process but are not apt for plenty of situations. Later, data analytics disseminate to the market for managing large amounts of data from various repositories [2]. Accessing those data is using modern tools like Hadoop and Spark [3], and after that applying data mining algorithms for analytics. Modified data stored in various places according to user requirements. The major problem will occur during the extraction phase due to data location and volume of the repository over the network [4]. Figure 1 explains the main V’s used in big data and how it is processed.

Fig. 1

Main V’s used in Big Data.

To provide the solution for the high volume data storage, the classic scale-in database storage was developed but not sufficient. Later, scale-out concepts were introduced as commodity hardware by Hadoop [5]. So Hadoop will provide better solutions to handle data storage in mega repositories in the form of clusters. There are single and multimode clusters in Hadoop with Hadoop Distributed File System (HDFS) [6] for huge dataset storage over a network. Though the data was stored in HDFS clusters, it will be accessed via many algorithms and classifications for extracting them into real-world applications. The problem faced by most companies while extracting the data is waiting time and accessing time from the repository [7]. Hadoop handles this problem with Map Reduce concepts for giving solutions to inadequate time. Map Reduce is a new paradigm introduced by Google, its purpose is to handle the high volume of data by using a map and reducing functions [8]. The entire input file is split into small pieces then map the related data as keys and send it to reduce. Reducers collect those keys and combined them into the appropriate output. This entire process is done by data mining algorithms and various techniques but the time taken to do this task is not the user’s concern. Because the input of Hadoop is a batch processing method, it is not suitable for real-time applications [9].

1.1 Literature survey

For parallel processing of data over networks, Hadoop Framework is used. The data processing between the nodes is based on their location and migration of their place. So aligning and arranging all individual nodes for performing distributed data processing at a time is complicated [10] using normal client-server or peer-to-peer networks. The distributed framework was used to disseminate the data from the repositories but the accuracy level and latency are the factors affected in large networks while accessing huge data sets. To overcome this problem Hadoop framework came and manage all those critical situations easily as commodity hardware. It is a vertical storage data processing system that is fast to recover the data elements even from a large dataset or huge repository.

1.2 History of HADOOP

In earlier days, distributed network node files are sending through a client-server architecture system with limited size. If huge files are sent across the network, latency, throughput, and speed of the transfer are very low [11], and maybe a loss or corruption of files also happened. When data storage is more it is not suitable for data processing within a stipulated period is not possible. So instead of the scale-in concept working in RAID methodology [12], storage has been extended to a different level. But also a lot of difficulties occurred in the distributed database process. The same concept is used online also. Searching for an element from a huge database and given to the user is not succeeded on time. In 2002, nutches were created by yahoo as a web crawler to identify the highest count of elements searched in browsers using the internet [13]. The time taken is quite fast when searching the recent data whereas old data has to be extracted from the database that was not archived quickly. So Google has introduced the concept of Google File System (GFS) [14] with a file access index table for the reference of the files in a network. Based on that index searching method instead of web crawlers, the entire distributed network finishes the searching element task within optimal time. After that Google introduced Map Reduce programming concept to optimize the searched elements from a huge database by map () and reduce () functions. Distributed File System (DFS) was introduced to store a large volume of data from the commodity hardware nodes using Hadoop. So it is called a Hadoop Distributed File System (HDFS). Yahoo was supported by 1000 individual nodes as a cluster to distribute database parallel. But when Hadoop came into this scenario, all nodes are connected with different commodity clusters by the scale-out process where data distribution happened without any clumsiness. Finally, Hadoop was introduced as an open-source framework by Apache and developed by java programming as a core language. Hadoop has introduced its commercial product in the name of Apache Hadoop with basic versions. Though Hadoop has supported parallel distributed databases with Shared Nothing Architecture (SNA) [15] principle, it will support some modern tools for doing data processing. It is called Hadoop Eco System which supports all data processing and analytics work. Figure 2 gives a detailed history of Hadoop and its limitations.

Fig. 2

History of Hadoop.

1.3 HADOOP ecosystem

Hadoop has supported many data mining algorithms and methods for accessing data from a huge data set with the help of modern tools as a supporting system. Data collection from different resources and stored in a warehouse has to be controlled and monitored for data flow access. This will help to find a minimal or optimal solution for time consumption issues in the Hadoop framework. Nevertheless, data generation and extraction have to be monitored using any of the tools in a Hadoop ecosystem that will give an immense result of required data to the user on time among the clusters. Researchers find difficulty over the network optimization time of the ETL (Extraction, Transaction, and Loading) process normally, because of the CAP (Consistency, Availability, Partition Tolerance) theorem concepts [16]. If any nodes got failure, then data alterations are quickly reflected in the cluster by Hadoop Eco-System Tools. So this system deals with the entire big data analytics concept via various tools. Table 1 the entire Hadoop Eco-System structure.

Table 1
Hadoop Eco System

Name Sqoop FLUME HIVE HBASE PIG R

Functions Structured data Collect the logs SQL query Store data in data ware-house. Latin programming Refine data from the ware- house.

Language DML JAVA JAVA JAVA Latin R

Database Model RDBMS NoSQL JSON JSON NoSQL RDBMS, No SQL

Consistency Concepts Yes Yes Yes Yes Yes Yes

Concurrency Yes Yes Yes Yes Yes Yes

Durability Yes Yes Yes Yes Yes Yes

Replication default default No No No No

Storage Method LOCAL HDFS HDFS HDFS HDFS HDFS

Name	Sqoop	FLUME	HIVE	HBASE	PIG	R
Functions	Structured data	Collect the logs	SQL query	Store data in data ware-house.	Latin programming	Refine data from the ware- house.
Language	DML	JAVA	JAVA	JAVA	Latin	R
Database Model	RDBMS	NoSQL	JSON	JSON	NoSQL	RDBMS, No SQL
Consistency Concepts	Yes	Yes	Yes	Yes	Yes	Yes
Concurrency	Yes	Yes	Yes	Yes	Yes	Yes
Durability	Yes	Yes	Yes	Yes	Yes	Yes
Replication	default	default	No	No	No	No
Storage Method	LOCAL	HDFS	HDFS	HDFS	HDFS	HDFS

1.4 HDFS architecture

Hadoop Distributed File System (HDFS) consists of Name Node (NN) and Data Node (DD) in a single node or multi-node cluster setup. Classic Hadoop contains Job Tracker by name node and task tracker by data node to find the flow of the data access. But the limitations of Hadoop made this architecture with a new concept called replication. Each end every input the job has to complete and the output data will be stored in 3 data nodes as a replication [17]. The Metadata of the output data has been stored to avoid software or hardware fault during transmission time. If any node gets failure in a cluster the other nodes get activated and the data has to transfer without delay. In later versions of Hadoop, a Secondary Name Node (SNN) was introduced to avoid the failure of the name node and its data has to be copied as a FSI image. Figure 3 denotes the architecture of HDFS and its replication principles.

Fig. 3

HDFS Architecture.

1.5 HADOOP versions

The Hadoop framework is used to provide parallel distributed database access with a basic java programming paradigm [18]. It emphasizes the work done simplified by Map-reduce concepts working among the clusters. Hadoop was developed by Apache and the basic version was released with several features to do data processing within a short time. Initially, the Hadoop framework was designed only for performing data processing tasks on a distributed database parallel. The entire framework is running as a cluster-based network.

•Hadoop 1. X

Hadoop 1. X version is a basic version that is explained two major components Map reduce and HDFS storage. Map-reduce is a programming model that reveals the input file is divided into the number of maps and converted into key-value pairs. Combiner parts get these maps as input and reduce them according to the keys produced by mappers. Finally, the reduced data will be stored in HDFS storage. Perhaps, this is a reliable storage system and redundant for a distributed database. It consists of a replication factor as 3 by default in master-slave architecture. Data nodes created 64 MB of blocks to store input data in HDFS.

•Hadoop 2. X

Hadoop is Master-Slave architecture by nature and it is controlled by Name Node (NN) as a Master. The remaining nodes which are connected to this Name Node are called Data Nodes (DN) as a slave. If suppose NN got failure or is disconnected from the cluster the entire system will get collapsed. In this critical situation, the Name node has taken a photocopy of its data and stored it in a different node called Secondary Name Node (SNN) over the network to avail the CAP theorem concepts. This additional feature is available in Hadoop 2. X with the name of YARN (Yet Another Resource Negotiator). Here also replication factor is 3 but the block size is 128 MB for input data storage [17, 19]. Below Table 2 will give the technical differences between these versions.

Table 2
Hadoop versions differences

HADOOP 1. X HADOOP 2. X

4000 nodes per cluster 10,000 nodes per cluster

Job Tracker work is the bottleneck YARN cluster is used

One namespace Multiple in HDFS

Static maps and reducer Not restricted

Only one job to map-reduce Any applications that integrated with HADOOP

Working based on the number of tasks in a cluster Working based on cluster size

HADOOP 1. X	HADOOP 2. X
4000 nodes per cluster	10,000 nodes per cluster
Job Tracker work is the bottleneck	YARN cluster is used
One namespace	Multiple in HDFS
Static maps and reducer	Not restricted
Only one job to map-reduce	Any applications that integrated with HADOOP
Working based on the number of tasks in a cluster	Working based on cluster size

•Hadoop 3-version

Hadoop 3. X is the latest version of the Apache Hadoop developed by Apache to overcome the problems of previous versions. The problem in previous versions is mainly lying in the number of blocks allocated for input data. For example, if 6 blocks are needed for storing input data into blocks we need 6X3 = 18 blocks for replication. So the overhead storage value is calculated using extra blocks divided by original blocks and it will be multiplied by 100 which gives a 200 percent result. The extra memory space allocation causes more cost usage problems for business people. So in Hadoop 3. x erasure coding [20, 21] is used to reduce that extra memory space to 50 percent overhead. Figure 4 and Fig. 5 will explain it.

Fig. 4

Replication.

Fig. 5

Erasure Coding.

The above diagram describes erasure coding in the Hadoop 3. x feature. The replication of 3 nodes can be divided and combined with two nodes using the XOR function as parity block storage. The same 6 blocks were taken for input file storage, instead of 18 blocks only 9 blocks were allocated for storage which means 3 blocks for extra storage. So the overhead storage is 3 divided by 6 and multiplied by 100 gives 50% only. Here the storage has to be denoted as Data Lake [22]. Due to this erasure coding number of blocks assigned for incoming data is reduced. So memory has to be utilized in HDFS is very low. Moreover, Erasure coding help to get accurate data with low latency, because of using limited memory utilized in HDFS as a block.

One more feature added to this Hadoop 3. x is yarn architecture has slightly changed to adapt to the reduction of data blocks in HDFS. In this, the resource manager allocated the jobs to the node manager and it will be monitored by the application master. A container is a new feature that will give the request of each node to the Application Master [23] then the request is sent to the name node. If any failure between the nodes all status will be monitored by the application master and the container holds the status of the nodes, so latency and throughput will be high when using Hadoop 3. x. Figure 6 explains the YARN architecture.

Fig. 6

Hadoop 3. X YARN Architecture.

•Differences in Hadoop 2. X vs. 3. X

There are lots of technical features changed in each version of Hadoop which improves the performance of the data processing speed in big data analytics. Table 3 will denote all the technical features of their versions.

Table 3

Hadoop latest and previous version differences

Features	Hadoop 2. X	Hadoop 3. X
Java Version	JDK 7	JDK 8
Fault Tolerance	Replication	Erasure Coding
Data Balancing	HDFS Balancer	CLI disk Balancer
Storage Overhead	200%	50%
Data Storage	Data skew	Data fake
YARN services	Scalability issues	V2 improves
Container	Delay due to feedback	Queue
Nodes per cluster	10000	More than 10000
Speed	Low	High
Single point of Failure	Overcome automatically	No manual intervention
Heap size memory	Configured	Auto tuning
Job Monitoring	Resource Manager	Node Manager
Task Monitoring	Resource Manager	Application Manager
Secondary Name Node support	Only one	More than 2

1.6 Schedulers used in HADOOP

In Hadoop, so many clients are sending their jobs for performing tasks. This can be handled by Job Tracker or resource Manager by Hadoop. There are two different versions are available in Hadoop named Hadoop 1. X and Hadoop 2.X.Here X denotes the version releases/updates. If Hadoop 1. X is used in the cluster, then the tasks can be controlled by the Job Tracker /Resource Manager. If it will be Hadoop 2. X, it may use the secondary node for the purpose of replica in the Name Node and will be used for copying Metadata [24] from the cluster. There are three main schedulers are available in Hadoop.

FIFO

Capacity

FAIR

The following Table 4 [25] explains all the schedulers and their drawbacks.

Table 4
Schedulers’ drawbacks

Type of Scheduler Pros Cons Remarks

FIFO Effective Implementation Poor data location Static Allocation

FAIR Short response time Unbalanced workload Homogeneous System

CAPACITY Unused Capacity jobs Complex implementation Homogeneous System, Non-primitive

Delay Simple Scheduling Not work in all situations Homogeneous System, Static

Matchmaking Good Data locality More response time Homogeneous System, Static

LATE Heterogeneity Lack of reliability Homogeneous System &Heterogeneity

Deadline Constraint Optimizing Timing Cost is high Homogeneous System, Heterogeneity, Dynamic

Resource Aware Cluster nodes Monitoring Extra time for monitoring Homogeneous System, Heterogeneity, Dynamic

HPCA High hit rate and redundancy Cluster change state Homogeneous System, Heterogeneity, Dynamic

Round Robin Proper work completion No priority is given Homogeneous System, Heterogeneity, Dynamic

Type of Scheduler	Pros	Cons	Remarks
FIFO	Effective Implementation	Poor data location	Static Allocation
FAIR	Short response time	Unbalanced workload	Homogeneous System
CAPACITY	Unused Capacity jobs	Complex implementation	Homogeneous System, Non-primitive
Delay	Simple Scheduling	Not work in all situations	Homogeneous System, Static
Matchmaking	Good Data locality	More response time	Homogeneous System, Static
LATE	Heterogeneity	Lack of reliability	Homogeneous System &Heterogeneity
Deadline Constraint	Optimizing Timing	Cost is high	Homogeneous System, Heterogeneity, Dynamic
Resource Aware	Cluster nodes Monitoring	Extra time for monitoring	Homogeneous System, Heterogeneity, Dynamic
HPCA	High hit rate and redundancy	Cluster change state	Homogeneous System, Heterogeneity, Dynamic
Round Robin	Proper work completion	No priority is given	Homogeneous System, Heterogeneity, Dynamic

2 Reasons for using HADOOP

The data processing speed is improved using the Hadoop framework because of its features [26]. It has a lot of advantages over the network. Table 5 explains the Hadoop features.

Table 5
Hadoop Features

Features Usage

Various Data Sources Multiple networks

Availability It has a replication feature which means the data in which stored in a node can replicate in three different nodes. So there is no problem with availability issues.

Scalable A lot of nodes can be connected in a cluster as a single node and multi-node at anytime, anywhere concept.

Cost- Effective Hadoop is an open-source framework for the usage of all companies that created a huge volume of data dynamically.

Low Network Traffic The traffic would not be affecting the data processing task because of connectivity among cluster nodes.

High Throughput The Map-Reduce programming paradigm provides high throughput between the nodes connected in Hadoop by its divide and conquer method job process.

Compatibility Hadoop is a framework that accepts all platforms of operating systems, programming languages, and modern tools of the Hadoop ecosystem.

Multiple Language Support Hadoop is suitable for all object-oriented programming languages like java, python, and Scala. Moreover, it is integrated with Hadoop ecosystem tools effectively

Features	Usage
Various Data Sources	Multiple networks
Availability	It has a replication feature which means the data in which stored in a node can replicate in three different nodes. So there is no problem with availability issues.
Scalable	A lot of nodes can be connected in a cluster as a single node and multi-node at anytime, anywhere concept.
Cost- Effective	Hadoop is an open-source framework for the usage of all companies that created a huge volume of data dynamically.
Low Network Traffic	The traffic would not be affecting the data processing task because of connectivity among cluster nodes.
High Throughput	The Map-Reduce programming paradigm provides high throughput between the nodes connected in Hadoop by its divide and conquer method job process.
Compatibility	Hadoop is a framework that accepts all platforms of operating systems, programming languages, and modern tools of the Hadoop ecosystem.
Multiple Language Support	Hadoop is suitable for all object-oriented programming languages like java, python, and Scala. Moreover, it is integrated with Hadoop ecosystem tools effectively

2.1 Problem identification

Though Hadoop has many features for huge data processing in clusters, it has some drawbacks while executing the tasks. Because the features may have some limitations while distributed data processing running inside the clusters [27]. Multiple factors will affect the Hadoop features and reduce the performance of Hadoop over distributed data processing scenarios. Some of the points are discussed below with their major parameters.

While accessing the small files [28] due to the default block size their speed has less and the allocation of memory is huge. To avoid this Merging of small files, HAR extension files (Hadoop Archives), and H Base tools can be used.

When big files have handled the speed of retrieval is slow and can be processed by SPARK Framework.

Unstructured data processing initiates low latency due to different file formats and this could be handled by SPARK, FLINK, and RDD (Resilience Distributed Data set) is used for storage purposes.

High-level data storage and network-level problems are raised when we talk about security concerns [29] in a larger network that can be solved using HDFS ACL for authentication purposes and YARN (Yet Another Resource Negotiation) as Application Manager.

Batch-wise data input processing is working but not real-time data accessing. The tools like SPARK and FLINK is used to handle that.

More lines of code (1, 20,000) [30] cannot be accessed but using SPARK and FLINK it is possible.

It does not support repetitive computations and no delta iterations but the SPARK tool supported all with in-memory analytics technique.

No Caching and Abstraction features are running in the Hadoop framework whereas SPARK.

2.2 Tuning Hadoop performance

Hadoop is used to perform parallel distributed data processing in different clusters. But it has a lot of problems with parallel processing among nodes. There are some bottlenecks which are affected the performance of Hadoop processing over the network. They are [31]

All the key resources in the CPU can be utilized properly for Map and Reduce process.

Master-Slave architecture is running in the data node as Main memory using RAM.

Network and bandwidth traffic due to huge file size accessing.

The throughput problem of input-output devices data storage over the network.

Hadoop tuning problems in data processing are discussed below with solutions.

A large volume of source data can be tuned by Huge I/O input at the map stage [32] with LZ0.LZ4 codex

Spilled records in the Partition and Sort phases are using a circular memory buffer using the formula

Sort Size = (16 + R) * N / 1,048,576

R–number of Map

N –dividing the Map output records by the number of map tasks are mapred.local.dir = 100MB

Network Traffic at Map and reduce side can be tuned by Writing small snippets to enable or disable in the map-reduce program and default replication factor of 1,3,5,7 nodes in the single and multi-node cluster configuration.

Insufficient Parallel Tasks [33] in idle resources are handled by adjusting Map, Reduce Tasks numbers and memory. There are 2 map, re- duce tasks, 1 CPU vcore and 1024MB memory allocated as a default configuration. For example, 8 CPU cores with 16 GB RAM on Node Managers, then 4 Map, 2 Reduce Tasks with memory 1024 MB allocated to each task and it leaves 2 CPU cores in a buffer for an- other works.

Hadoop Framework running with java programming language by map-reduce model for data processing from a huge dataset warehouse on real-time applications. For complicated analysis of the real world, problems can be easily solved by Hadoop with low-cost open-source. Though data warehouse engines work effectively, the speed of data retrieval is the major problem [34] in analytics. To improve the speed of the data processing in big data analytics the above-said tuning parameters of Hadoop can be implemented with any latest algorithms like Deep Learning, Machine learning, Artificial Intelligence, Genetic Algorithms, Data Mining, Data Warehouse algorithms, and block-chain [35, 36] concepts. Hence the huge dataset of big data is the cause for handling real-world scenarios in many companies. All their worry is to maintain that with low-cost server configuration and consistency should be controlled on time. The retrieval of data from the data warehouse has to be improved with the Hadoop framework by high throughput is succeeded.

3 Map reduces programming model

Map Reduce is an important programming model used in the Hadoop framework that accesses a high volume of data in parallel by disseminating the whole work into individual tasks. So that the input file can be accessed by map-reduce functions to minimize the size of the file coming in the output part with compression [37]. After this process, the user or client will get the exact files that they expected from the large volume of datasets.

3.1 Importance of map-reduce

Map-reduce is used to access a huge dataset that is stored in HDFS parallelly. Increasing the velocity and reliability of the cluster map-reduce plays a major role in processing. The latency and throughput of the entire system will be increased because of the time taken to complete the job.

3.2 Phases of map-reduce

Multiple phases are working in the map-reduce programming model because huge files are divided into independent tasks and each will work parallelly. Separate work has to be done in every stage of the map reduction.

The Map-Reduce model is working only on the data which are stored in HDFS. Because all the operations working in Hadoop Cluster are only based on HDFS storage data. So the input from various sources has to be given to the map-reduce from HDFS is the first step of Map Reduce. According to the data size, the entire file is disseminated into individual tasks by a splitter. The input text format is changed into key-value pairs by the record reader function. Combiner is taking care of that key matches and it will make partitions over the HDFS disk based on the file size. The partitions are stored in the intermediate data of the mapper function to give the output to the next phase. But alignment is the major problem that leads to cause latency or throughput problems. So shuffling of keys and value pairs for each partition is running on the HDFS disk. The next important process that happened in Map Reduce is sorting [40, 41] based on keys from the HDFS. Using index searching techniques the sorted values are generated for the next phase. Reducer is important in map-reduce to optimize all the values into an appropriate format.

3.3 Problems of map-reduce

Map Reduce is designed with java as a programming language platform working on a Hadoop cluster. The cluster may vary in their nodes named as a single node or multi-node cluster have master-slave architecture. The main problem of Map Reduce is extracting data from a huge dataset within a stipulated time but that is not achieved because of the input file size of data from HDFS. The challenge in map-reduce is to minimize or optimize the whole volume of data into compressed format low volume data. But the time to complete that process is very high. In other words, latency and throughput are very low. Normal data extraction from the data warehouse is a little bit slower because of the patterns and algorithms used for processing [42].

3.4 Read/write operations in map reduce

Map Reduce is running with batch processing on Hadoop cluster data input format which means once the input has to be taken another input is waiting for the completion of the previous task. This is the most important problem in Map Reduce and it will be accessed through iterations [43] in Map Reduce. Because once the reading operation has taken place from HDFS it will be processed by the Map-Reduce phases and write the output on HDFS [44]. The next iteration has taken the input from these previous writes on the HDFS disk. Likewise, if more number of iteration processes is compiled in Map-Reduce [45] then it will store HDFS permanently. If the user requires particular data from that they have to write queries using any Data Manipulation Languages (DML) for their results. In this scenario, more iterative operations are not possible by Map Reduce because in batch processing only once the input has to take. If more iterative operations (looping) [46 –48] are running it will not apt for low latency data processing. Because every time the map-reduce model runs repetitive functions, it may not complete the task within time. Moreover, latency is also high while doing data processing. Figure 7 explains the read and writes operations of the data sharing function in Map reduce.

Fig. 7

Data Sharing in Map Reduce.

3.5 Map reduce word count example

The best example for Map Reduce is a java based Word Count Program in the Hadoop cluster. Initially, three sentences have to be taken for input and it will be split into different individual tasks as input split. The next mapping phase takes care of individual tasks and converts that input split into keys and values which means the number of presence of the word is calculated. Based on the alphabet criteria the keys are shuffled and sorted as an output of the mapper. Reducer collects those outputs and gives them as input to the combiner for alignment of key-value pairs. Finally, it collects the time of occurrences of each word from that three sentences and will be given to the output to the client or user. Here the final output will be in the compressed form of input data which leads to data processing with poor latency best throughput and. The size of the input file is low like KB means within a few seconds map-reduce has to be finished. If it will be in MB/GB, then the number of maps and reduces will be more for doing the Map-reduce function [49, 50]. Figure 8 gives the example of word count with three sentences. Finally, the output got by the user is a compressed number of occurrences as an output. Based on this word count all the files are handled by batch processing and perform Map Reduce operations. Figure 8 summarizes the word count example.

Fig. 8

Word Count Example for Map Reduce.

4 Map reduce versions (MRV)

Map Reduce function done in Hadoop cluster by job tracker and task tracker. Classic versions of Map Reducev1 function is working with trackers. But latest version MRV2 is running with YARN architecture. Because it gives the tracking feature of Map reduce job in every stage [51]. The schedulers and queues are used to give the job status of a given task. MRV1 only deals with output whereas MRV2 gives the status of the entire job. Figures 9, 10 illustrate the advantages and disadvantages of MR versions.

Fig. 9

Map Reduce Version1.

Fig. 10

Map Reduce Version2.

4.1 HADOOP map reduces performance tuning

The Map Reduce performance can be accessed by several factors of the Hadoop framework and its features. Map Reduce performance can be affected in terms of speed, latency, throughput, and time taken to complete the task. There are several other factors that may exist during the transmission of data in the Hadoop cluster that will affect map-reduce [52]. They are

Performance

Programming model & Domain

Configuration and automation

Trends

Memory

4.1 Performance

Initialization of Hadoop and Map Reduce will affect the performance due to the techniques used in the entire data processing system. Because Hadoop 1. x gives only the output but cannot give time to complete the task. But Hadoop 2. x overcomes this issue and tracks the status of the job throughout the task. At last, the latest Hadoop 3. x version describes the advanced MRV2 process for quick response over the network on the Hadoop cluster through its erasure coding techniques [52, 53]. So Hadoop framework and Map Reduce installation is the major issue in the performance of Map Reduce consideration. Figure 11 gives the issues of performance in Map Reduce.

Fig. 11

Performance issue 1.

Scheduling of jobs in Map-reduce is an important concept in the Hadoop cluster. Continuously jobs are assigned in Hadoop Framework by the clients; the order of jobs taken for Map Reduce is a typical process. So the schedulers are used to perform this work with the help of queues. Three main schedulers are available in Hadoop namely FIFO, Capacity, and FAIR [54]. Coordination of jobs between the nodes is, coordination between the nodes on the Hadoop cluster disseminates the details of all nodes to consider as the main factor in tuning the Map-reduce function. While accessing a variety of jobs sequentially the resource manager. The status of the jobs will keep tracked and sent to YARN for monitoring. Finally, any jobs that want to kill or delete during the processing time should be controlled by YARN because of this coordination. Any data processing model contains a single input system for processing whereas here both inputs are merged together as a tagging method for easy access to the huge data sets. Figure 12 gives the issues of performance in Map Reduce.

Fig. 12

Performance issue.

4.2 Programming model and domain

Map Reduce writing map and reduce functions using good programming is essential for the users. There are various programming languages supported by Hadoop for performing Map-Reduce operations. Every language is based on platform dependent or independent employing their characteristics. Some of the languages that support the Hadoop ecosystem are SQL, NoSQL, Java, Python, Scala, and JSON [55]. They have their own set of properties to perform operations like join and cross properties of the dataset. It supports the techniques of running iterations and incremental computations among the nodes in Hadoop for accessing distributed databases parallelly. Perhaps, many iteration operations will affect Map Reduce performance. Figures 13 and Fig. 14 denotes issues of programming models.

Fig. 13

Programming Model issue 1.

Fig. 14

Programming Model issue 2.

4.3 Configuration and automation

Self-tuning of the workload between the nodes can be balanced by a load balancer on Hadoop and the data flow sharing among the nodes is controlled adequately is a big challenge. If this work fails automatically Map Reduce will give poor output on the task. Input-Output disk minimization is the major drawback in Hadoop MR for accessing data regularly. Their performance is changed due to the size of input data and methods used for splitting are noted. If the number of reduces is less may increase MR performance. The code written in a specific language supports static code generation [56] and the index creation method on Map-reduce will increase the performance. Sometimes the specific language doesn’t adapt to the changes that are made by the client in the system. The entire system is aware of data optimization principles to provide better performance on Map-reduce.

4.4 Trends

Data warehouse data are accessed by the database engine on Map-reduce. But the data size is very large, and extraction of small data from that engine made it difficult. The time taken to complete the process is very high. But instead of disk processing, it should be done by memory processing directly will improve the MR performance by I/O disks. Indexing [57] is the traditional database technique that is used to search the elements in the database or files run in nodes. It gives the extracted data to the user very fast. It might not depend on the size of a file, in each file the same techniques have been used. Memory caching [58] between the nodes is very important to improve the performance in MR. It describes the status of every job condition and the previous computation level also. Caching helps to identify the location of the data on the node specifically by its memory allocated by the jobs. Even though the nodes or jobs are canceled due to any issues the next job or node will get active and start the process over the network without waiting for manual intervention. The materials required for the MR process can be verified initially before the start of the job allocation by the resource manager.

4.5 Memory

Map Reduce function fully depends on the number of maps and reducers used for every task in the Hadoop cluster. If it will get increase immediately the performance of the system goes very slow in terms of time taken to complete the task.

Calculation of number of maps

The number of maps assigned for every job by a client is too calculated by the size of the input file [59] and allocated blocks for accessing those data. The following formula denotes the number of maps required for performing Map Reduce operations.

$\begin{matrix} Number of Maps & = Total size of the input \\ file / Total number of blocks \end{matrix}$ (1)

By default, minimum of 10 –100 maps per node is assigned for the job. A maximum of 300 maps can be allocated to do Map Reduce job. For example, 10TB of input file size and 128MB block size are allocated by Hadoop 2. x means 10TB/12b MB = 82,000 maps are approximately assigned for completing that job.

Calculation of number of reduces

Normally reducer is allocated for all maps reduce job is 1. If the number of reducers wants to be increased for huge processes, then the configuration file can be changed during installation or after using speculative tasks. The following formula denotes the number of reduces by default required for performing Map Reduce operations.

$\begin{matrix} NumberofReducer \\ = 0.95 or 1.35 * numberofnodes \end{matrix}$ (2)

Skipping bad records

To eliminate the bad records created during the Map-Reduce process can be changed using configuration files. By enabling the true or false function in the configuration file it can be removed. For example, in the word count Map Reduce program written by java only case the sensor output is required means making –DwordCount.case.sensitive = true/ false command during the run time will give better performance than the previous one [59]. Because the bad records can be eliminated using these commands.

Task execution & environment

The task tracker in data nodes keeps track of all information about the jobs and is sent to YARN Resource Manager consequently. But there is a limitation over these operations in terms of memory allocation in a map and reduction for task execution. The command –Djava.library.path=< -Xm512M/-Xm1024M executes Map Reduce environment [60] within that memory limit successfully. The following Table 6 & provides details of Map Reduce Implementation methods and their applications.

Table 6

Map Reduce Implementations

Map Reduce Implement Methods	Advantages	Disadvantages
Google Map Reduce	multiple data blocks on different nodes to avoid fault tolerance problem	Batch processing-based architecture is not suitable for real-time applications
Hadoop	High scalability	Cluster maintenance is difficult.
Grid Grain	Subtask distribution and load balancing	Does not support non-java applications
Mars	Massive Thread Parallelism in GPU	Not for atomic operations due to expensive
Tiled-Map Reduce	Convergence and Generalization	Cost is high
Phoenix	Multicore CPU	Scalability is less
Twister	Tools are used effectively	Not possible to break huge data set

Table 7

Map Reduce Implementations

Map Reduce Applications	Pros	Cons
Distributed Grep	Data analysis is generic	Less response time
Word Count	Massive document collection of occurrences	Limited only
Tera Sort	Load balancing	transparency
Inverted Index	Collection of unique posting list	Lots of pairs in shuffling &sorting
Term Vector	Host analysis search	Sequential tasks
Random Forest	Scalability is high	Low
Extreme Learning Machine	union and simplification	Uncertainty
Spark	Data fit in memory	Huge memory needed
Algorithms	Data exhaustive applications	Time uncontrollable
DNA Fragment	Parallel Algorithm	Large memory
Mobile sensor data	Extracting data is easy	Difficult to implement
Social Networks	Quick response	Need more techniques for analysis

5 Map reduce job optimization techniques

The Map-Reduce job allocated by the resource manager of Hadoop will improve the performance of the data processing speed and accurate results based on the configuration of the cluster and proper allocation of map-reduce tasks with their type of input data. Though LZO compression helps to compress input file size there will be a combiner between mapper and reducer is a must for improving map-reduce job performance optimization [61]. Most of the code data can be reused to avoid searching for data location time over the cluster.

There are some other important aspects used in the map-reduce programming model to provide solutions for map-reduce job performance improvement in the Hadoop framework. All factors have represented the flow of jobs from resource managers to data nodes and how data can deviate from the flow during run time. Because these factors are rectified means even a big job running on the Hadoop cluster will give output with low latency. Below Fig. 15 listed the factors for job optimizations.

Fig. 15

Data colocation.

•Operator pipelining

It is mainly used in Map reduce concept for aggregation of databases to utilize the filter data and perform operations like grouping, sorting, and converting [62, 63] output from one form to another form of operators. Pipelining is used to connect two jobs simultaneously to complete the job within time. But the issue is extended database lock or tie when reading/writing in response to the user request. So the iterate operations are used at that particular time to improve their performance during pipeline events.

•Approximate results

The result of the map-reduce is approximate in terms of size, time, and accuracy. Even though the performance has to be increased during the running time it cannot be predictable by its output. Any files can be taken as an input format it will provide an output of map reduced function. The output cannot be accurate or reliable in such cases.

•Indexing and sorting

Since Map-Reduce works with key-value pairs, it is very complicated to align the order of the jobs by a resource manager. It allocates the task to the data node which may cause conflicts rapidly [57, 64]. So indexing techniques are used in this job execution by searching the elements based on the index key values stored in the index table. The table contains all key values of the independent task in the mapper task and will give exact data to the combiner to perform the merging option. But the issue is that merging also it is complicated by arranging values in any order. So sorting is a function used in between these and performs reducer value output effectively.

•Work sharing

Map Reduce is specially designed for handling multiple jobs parallelly. If multiple jobs are running simultaneously, it is recommended to share those jobs by individual maps [65] in the function. That work was done by a splitter in the map-reduce function. The time taken to complete the job is decreased because of this sharing job process.

•Data reuse

Data that is used for the Map Reduce function from the HDFS storage can be reused for next-level changes in the same input file. Reusability [66] in the form of inheritance and will reduce the number of lines of codes in a program.

•Skew mitigation

Skew Mitigation is the main issue in Map reduce, solved by different techniques to avoid data transmission. Using skew-resilient operators, classical skew-mitigation problems were solved. By repartitioning the concept, skew mitigation can be handled in a big data environment using three major methods. Minimizing the number of times of repartition to any task can reduce repartitioning overhead. Then minimizing repartitioning side effects can be removed during the struggling time to remove mitigation ambiguity. At last, unnecessary recompilations are used to minimize the total transparency of skew mitigation [27 , 67].

•Data colocation

Same location files will be collocated on the similar locate of nodes is a new concept based on the locator of file attribute in the file characteristics. When the new file is creating its location, the list of data nodes and the number of files in the same case can be identified and stored all those input files in the same set of nodes automatically [17 , 68]. It will improve the map-reduce performance by avoiding duplication and repetitions [69, 70] of files in a Hadoop cluster. Figure 15 describes the example of data colocation in the Hadoop cluster.

6 Map reduction using java and python

Map Reduce function can be written in java or any other higher languages, the performance should be changed according to the features of selected languages. Table 8 narrates the differences between java and python coding languages when map reduce can be written.

Table 8
Map Reduce written in java and python differences

Features Java Python

File size Handling <1 GB is easy >1 GB is easy

Library Files All in JAR format Separate library files

File Extension .java .py

Method of calling Main No main method

Data collection Arrays, Index List, set, dictionary, tuples

Object oriented Required Optional

Case Sensitive Required Optional

Compilation Easy in all platform Easy in Linux

Productivity Less More

Applications Desktop, Mobile, Web Analytics, Mathematical,Calculations

Type of files Batch processing, embedded application Real time processing files also

Functions Return 0 &1 is used Dict is used for return

Programming concepts Dynamic less Cannot push threads of single processor to another

Syntax Specific types Simple only

Basic programming C,c++ basics(oops) Higher end concepts like ML

Number of codes High Less code size

Input data format Streaming with STDIN,STDOUT by binary not text Both binary and text

Areas Working Architecture, tester, developer, administrator Analytics, manipulation, retrieval, visual reports, AI, Neural Networks

Speed 25 times greater than python Low due to interpreter

Execution Time High because of code length Easy

Typing Dynamic Static

Verbose Syntax Low Normal

Frameworks Spring, Blade Django, Flask

Gaming Jmonkey Engine Pandas3D,cocos

Ml Libraries Weka, Mallet Tensorflow, pytorch

Features	Java	Python
File size Handling	<1 GB is easy	>1 GB is easy
Library Files	All in JAR format	Separate library files
File Extension	.java	.py
Method of calling	Main	No main method
Data collection	Arrays, Index	List, set, dictionary, tuples
Object oriented	Required	Optional
Case Sensitive	Required	Optional
Compilation	Easy in all platform	Easy in Linux
Productivity	Less	More
Applications	Desktop, Mobile, Web	Analytics, Mathematical,Calculations
Type of files	Batch processing, embedded application	Real time processing files also
Functions	Return 0 &1 is used	Dict is used for return
Programming concepts	Dynamic less	Cannot push threads of single processor to another
Syntax	Specific types	Simple only
Basic programming	C,c++ basics(oops)	Higher end concepts like ML
Number of codes	High	Less code size
Input data format	Streaming with STDIN,STDOUT by binary not text	Both binary and text
Areas Working	Architecture, tester, developer, administrator	Analytics, manipulation, retrieval, visual reports, AI, Neural Networks
Speed	25 times greater than python	Low due to interpreter
Execution Time	High because of code length	Easy
Typing	Dynamic	Static
Verbose Syntax	Low	Normal
Frameworks	Spring, Blade	Django, Flask
Gaming	Jmonkey Engine	Pandas3D,cocos
Ml Libraries	Weka, Mallet	Tensorflow, pytorch

7 Spark framework

Apache Spark framework is an open-source used for distributed cloud computing clusters. It is working with the data processing engine concept meanwhile to be faster than the Hadoop Map Reduce for data analytics. Though Hadoop is used to provide big data analytics effectively, it has some drawbacks [70] with limited factors which were already discussed in section 4.

I. Spark features

In-memory Processing: This technique is used to capture moving data or processes inside and outside of the disk without spending more time. So obviously it is working faster than Hadoop. Approximately 100 times better than Map Reduce on Hadoop due to memory.

Stream Processing: It supports stream processing which means input and output data are continuously accessed. It is mainly used to access real-time application data processing.

Latency: Resilient Distributed Dataset (RDD) is used to catch the data using memory in between the nodes on the cluster. RDD manages logical partitions for distributed data processing and conversion of data format. This is where Spark does most of the operations such as transformation and managing the data. RDD is used in logical portions [71], which can be manipulated on the Hadoop cluster.

Lazy Evaluation: Only for needed situations it is accessed the real world applications otherwise it will be the idle condition.

Less Lines of Code: SAPRK is used SCALA language for processing data with less number of lines when compared to Hadoop.

Figure 16 and Fig. 17 are explained the working principles of the Hadoop map-reduce and spark engine.

Fig. 16

Working of Hadoop Map Reduce.

Fig. 17

Working of Spark.

II. Real world scenarios of SPARK

Many companies created terabytes of data through human and machine generation applications. Apache Spark is used to improve the company’s business insights [72]. Few examples of companies using SPARK in real-world applications.

E-commerce: To improve consumer satisfaction over competitive problems, a few industries are implementing SPARK to handle this situation. They are:

A. eBay: Discounts and or offers for online purchases and any other purchase transaction SPARK can be developed using real-time data. It will provide the updating status and consistency of data at each second so that the customer relationship is very strong on their feedback.

B. Alibaba: Analyze big data, and extraction of image data can be handled by Alibaba Company using SPARK as an implementation tool. They are used on a large graph, for deriving results.

Healthcare: MyFitnessPal, which is used to improve a healthier lifestyle through diet and to scan through the food calorie data of about 100 million users to find the quality of the food system using SPARK in-memory processing techniques.

Media and Entertainment: Netflix, for video streaming uses Apache Spark to control and monitored its users compared with the earlier shows that they have watched.

III. HADOOP AND SPARK SIMILARITIES

Stand-alone Mesos and Cloud are the places where Spark can run on Hadoop.

Machine Learning algorithms can be executed faster inside the memory using Spark’s MLlib in order to provide the solutions which are not given easily by Hadoop Map Reduce [73].

Cluster Administration and Data Management can be done by combining SPARK and Hadoop because SPARK does not have its own Distributed File System (DFS).

Enhanced security can be provided by Hadoop, for making workloads. But Spark can be deployed on available resources at all places of a cluster. So there is no manual allocation and tracking of individual tasks. For the above-said features, SPARK is still used by big companies and industries those who are working on real-world applications.

IV WORKING ON SPARK VS. HADOOP

Hadoop framework is working under the principle of master-slave architecture where used as name node and data node with replication principle. The output of each step in Hadoop has stored their data in the HDFS cluster continuously. So if the client needs to retrieve the data from the database it will be very easy to extract in Hadoop. Because the Hadoop framework takes replication of every job output data in the HDFS cluster disks.

Spark is a distributed cluster framework for processing data on the memory of the nodes by its process engine. In-memory analytics data processing is used in SPARK, so the output of each step is stored in between the node memories for clients. For this, it consumes a lot of memory for storage. One big advantage of SPARK is to access real-time applications frequently. Although it is used for online generated data processing, streaming is mainly used. There is plenty of data generated online with every second. To maintain all those heavy storages and accessing engine or machine should be needed. So SPARK is used a lot of memory units in between the nodes on the network path. The time to complete the job is also very less by using SPARK [74]. Figure 18 differentiates the working of HDFS and SPARK clearly.

Fig. 18

Working difference between SPARK and Hadoop.

7.1 Spark architecture

In general, the SPARK framework is used to access real-time data with its memory analytics processing over the big network without any delay or traffic. A normal SPARK architecture consists of a software driver program that has to be written in SCALA language [75] and it will control all the worker nodes.

The cluster manager has monitored all these works and it is located between the worker node and the driver program node. Spark context is a small program written only for doing the job of data processing on the nodes but the difference is mainly in the memory storage part. The worker node contains the task assigned by the cluster manager with the executor module. Once the program can be executed by a cluster manager, the executor module in the worker node access the input data from HDFS and immediately stores the output in memory. The client wants to know the intermediate data at every step of the execution they will retrieve from that. Figure 19 is clarifying the architecture of SPARK.

Fig. 19

Architecture of SPARK.

7.2 Designing of spark file system

In Hadoop Distributed File System, Map Reduce can use for data processing by mapper functions among nodes under the cluster. The input file is disseminated into the number of tasks by the splitter and each task is working individually for the Map Reduce operation. Every mapper output is collected as a key-value pair and it will be stored in a circular buffer [76 –78] for alignment then whole files are stored in HDFS by partitions. Figure 20 will explain the working nature of Hadoop. The partitions R1, R2, and R3 have different outputs of the mapper from the circular buffer and are arranged them like an array based on the index value. So whenever multiple jobs are coming to HDFS stored the output continuously without any drawback. SPARK initial design is accessing the data from the input and performing the mapper function then suddenly storing the output to a separate partition like a queue. So when the client is required each step output from the HDFS storage, they can collect it directly from that. In that Fig. 20, R1, R2, and R3 are the partitions that collect the output of the mapper and stored it accordingly. During the shuffle section, c1 is a core that is used to denote mapper 1, and c2, c3, and c4 denote other mappers. So if shuffling will happen in SPARK, the mapper output of the particular mapper is stored in core 1. Likewise, an individual CPU node contains 4 cores [79] by default all the other mappers are stored on free cores which are represented by the mapper. Figure 20 will explain the shuffles used in Hadoop and SPARK respectively.

Fig. 20

SPARK files system.

Features of big data eco system tools are listed below in Table 9 for all the tools. There are plenty of differences between Hadoop and SPARK. The experimental results of multi-node clusters are displayed in Table 10.

Table 9

Specifications of all tools

Features	Hadoop	SPARK	Flink	Strom	Kafka	Samza
Performance	Slower	100 times Faster than Hadoop	Closed loop Iteration.	Fast	Fast	Fast
Language Support	Java, Python	Scala Python and R	Java and provides API in Scala, Python	All Languages	Best with java &work with all languages	JVM languages
Processing	Batch	Stream &Batch	Single Stream	Native Stream	Native Stream	Native Stream
Latency	High(min)	Low(sec)	Low(sub sec)	Very low(ms)	Low (1-2 sec)	Low (less than sec)
Security	Kerberos and ACL	Low secured using only passwords	Kerberos	Kerberos	TLS, ACL, Kerberos, SASL	No security
Fault Tolerance	High	Less	snap shot method	High	High	High
Scalability	Large 14000 nodes	High 8000 nodes	High 1000 nodes	High	Average	Average

Table 10

Experimental results of multi-node cluster

Parameters	Hadoop Records	SPARK Records	FLINK Records
Data Size	102.5TB	100TB	>100TB
Elapsed Time	72mins	23 min	>23mins
Nodes	2100	206	190
Cores	50400 physical	6592 virtualized	6080 virtualized
Throughput in cluster	3150GB/sec	618 GB/sec	570 GB/sec
Network	10Gbps	EC2	>10Gbps
Sort Rate	1.42TB/min	4.27 TB/min	4.27 TB/min
Sort rate/node	0.67 GB/min	20.7 GB/min	22.5 GB/min

7.3 HADOOP VS. SPARK

There are plenty of technical differences between Hadoop and SPARK. Based on these results anyone can conclude that for computing their big data which framework is better to select for data processing? Moreover, these technical differences convey the message to the people who plan to initiate a start-up company using computers. They have planned to select the framework for their requirements in all aspects. Table 11 summarizes the features of both Hadoop and SPARK.

Table 11
Hadoop Vs SPARK

Features Hadoop SPARK

File Processing Method Batch processing Batch/Real Time/iterative/graph Processing

Programming Language Java, Python Scala

Data Storage type Scale-out Data Lake or Pool

Programming Model Map reduce In Memory processing

Job Scheduler Externally Not required

Cost Low High

RAM Usage Less Lot of RAMs

Memory Type Single memory Execution &Storage memory Separately

Data Size Up to GB is fine PB is fine

Latency High Low latency

Data taken as input Text, images, videos RDD(Resilient Distributed Dataset)

Disk Type HDD (Hard Disk) SDD (Solid Disk)

N/w Performance Low High

Speed rate <3x <3x. 1/10 nodes

Algorithm by default Divide and conquer ALS (Alternate Least Square)

Data Location details Index Table Abstraction using Mlib

Data Hiding Low High using function calls

Dataset size Small set Huge set > TB

Shuffle speed Low High

Storage memory of mapper output Directly in Disk RAM to Disk

Containers Usage Releases after every map Release only after the entire job completion

Dynamic Allocation Not possible Possible but hectic

Replications 1,3,5 nodes Pipelines

Delay High due to assign JVM for each task Low due to quick launch

Mechanism for message passing Parsing and JAR files Remote Procedure Call (RPC)

Time Taken to complete job Minutes because of small data set Hours for big data set.

Allocating Memory Erasure Coding DAG (Directed Acyclic Graph)

Data Input method Hadoop Streaming SPARK Streaming

Data conversion formats Text to binary All forms

Job Memory Large Low

Input Memory Less High

Processing type Parallel and distributed Parallel and distributed

Data Extraction Disk Based Memory Based

I/O Processing Disk RAM

Resources Usage More Less

Data status Stateless State

Iterative Process Not Taken Taken

Caching Doesn’t support Support in RAM

R/W to HDFS YARN Cluster SPARK Engine

Tools supported Pig,Hive,HBase ALL in one

Accessibility Command User Interface (CUI) Graphical User Interface(GUI)

Traceability Easy by YARN Not possible

Fault Tolerance High Low

Security High(tracking) Low (no tracking)

Storage Architecture Distributed Not distributed

Data taken slot from resources Only one slot Any slots(real time)

Time Lag Yes No

Program Written Map Reduce Driver Program

Controller YARN Cluster Manager

Partition Type Single partition for all map outputs Separate partition for every map output

Companies Used Industries and Companies not needed real time data analytics. Cloud era, Horton works, IBM, British Airways, Face book, Twitter, LinkedIn Real time data processing needed. YAHOO, eBay, Alibaba, Netflix, oracle. Cisco, Verizon, Microsoft, Data Bricks and Amazon

Features	Hadoop	SPARK
File Processing Method	Batch processing	Batch/Real Time/iterative/graph Processing
Programming Language	Java, Python	Scala
Data Storage type	Scale-out	Data Lake or Pool
Programming Model	Map reduce	In Memory processing
Job Scheduler	Externally	Not required
Cost	Low	High
RAM Usage	Less	Lot of RAMs
Memory Type	Single memory	Execution &Storage memory Separately
Data Size	Up to GB is fine	PB is fine
Latency	High	Low latency
Data taken as input	Text, images, videos	RDD(Resilient Distributed Dataset)
Disk Type	HDD (Hard Disk)	SDD (Solid Disk)
N/w Performance	Low	High
Speed rate	<3x	<3x. 1/10 nodes
Algorithm by default	Divide and conquer	ALS (Alternate Least Square)
Data Location details	Index Table	Abstraction using Mlib
Data Hiding	Low	High using function calls
Dataset size	Small set	Huge set > TB
Shuffle speed	Low	High
Storage memory of mapper output	Directly in Disk	RAM to Disk
Containers Usage	Releases after every map	Release only after the entire job completion
Dynamic Allocation	Not possible	Possible but hectic
Replications	1,3,5 nodes	Pipelines
Delay	High due to assign JVM for each task	Low due to quick launch
Mechanism for message passing	Parsing and JAR files	Remote Procedure Call (RPC)
Time Taken to complete job	Minutes because of small data set	Hours for big data set.
Allocating Memory	Erasure Coding	DAG (Directed Acyclic Graph)
Data Input method	Hadoop Streaming	SPARK Streaming
Data conversion formats	Text to binary	All forms
Job Memory	Large	Low
Input Memory	Less	High
Processing type	Parallel and distributed	Parallel and distributed
Data Extraction	Disk Based	Memory Based
I/O Processing	Disk	RAM
Resources Usage	More	Less
Data status	Stateless	State
Iterative Process	Not Taken	Taken
Caching	Doesn’t support	Support in RAM
R/W to HDFS	YARN Cluster	SPARK Engine
Tools supported	Pig,Hive,HBase	ALL in one
Accessibility	Command User Interface (CUI)	Graphical User Interface(GUI)
Traceability	Easy by YARN	Not possible
Fault Tolerance	High	Low
Security	High(tracking)	Low (no tracking)
Storage Architecture	Distributed	Not distributed
Data taken slot from resources	Only one slot	Any slots(real time)
Time Lag	Yes	No
Program Written	Map Reduce	Driver Program
Controller	YARN	Cluster Manager
Partition Type	Single partition for all map outputs	Separate partition for every map output
Companies Used	Industries and Companies not needed real time data analytics. Cloud era, Horton works, IBM, British Airways, Face book, Twitter, LinkedIn	Real time data processing needed. YAHOO, eBay, Alibaba, Netflix, oracle. Cisco, Verizon, Microsoft, Data Bricks and Amazon

8 Author’s contribution

To summarize the contribution for this paper, the authors are explained the challenges and limitations faced in modern tools like Hadoop and SPARK for data processing as following points:

I. Authors were taken various techniques from many research papers on the topic of tuning the performance of the databases while scalability is increased, and all papers are discussed about the data extraction techniques from the huge repositories with low latency and high accuracy over large networks.

II. Authors were written this review paper about Hadoop versions and their features to extract the data from the repositories and also SPARK tool features with the latest techniques. A detailed review has been written in this paper while selecting the tool for extractions with their advantages and disadvantages.

III. Authors have suggested ways to improve the performance of the databases extraction from the repositories. Moreover, the difficulties faced in previous methods. Though modern tools are used for data extraction writing a map-reduce program in Hadoop with a recent algorithm is a challenging task. SPARK is an advanced tool but the cost spend for used that tool is unimaginable for small-scale companies. Here authors were given suggestions to improve the performance in both tools.

9 Conclusion and future scope

Big data analytics is an important technology in this era used to access huge datasets Parallelly in a distributed cluster environment. Based on the requirements of the client or user every software company is deciding to deploy its software and hardware frameworks. Many start-up companies are also confused about their infrastructure to build up. This paper provides a solution for all companies and research-oriented people to select their framework for data processing rapidly. Perhaps, the basic factors of the data processing projects like speed and cost are considered in all situations. The above said technologies and examples are given a transparent view of the big company’s infrastructure for dealing with real-world problems effectively. There is a million-dollar question raises in the software industry that the real world scenario problems have been solved only by big industries or those who are ready to invest more money is the only possibility. But there are other factors also considered in the same scene taken by different industries. The main problem is data-driven from the large datasets with fewer resources is a challenging one. This paper deals with all the points to improve the data processing velocity of big data analytics by the famous framework Hadoop vs. SPARK. Henceforth, the data generated day by day in the real-world can handle different latest algorithms for analytics, and processing from the huge volume is being possible with tuning the already existing methods or trends. There must be proper analysis and research problems finding capacity that should be needed to implement all the innovative solutions for real-world problems. Finally, the user wants to find a solution for their problem with big data analytics Hadoop and SPARK are the main frameworks to provide solutions but according to the user requirement, they have to choose the best one. For example, the client wants to start a company which has low investment but dealing big data problem for a complex solution means Hadoop is their best choice because of the cost and type of data. If the same company has the urge to handle real-world application data and ready to provide huge investment, obviously SPARK is the best tool for them. When we consider technical aspects like algorithm and methodology, both tools are using some common techniques but final decisions might be taken based on cost and type of data handling. The decision taken by all persons who are handling big data analytics, Hadoop Map Reduce is suitable for low-cost and batch processing whereas SPARK is apt for real-time processing and a high-cost tool for data processing.

There are plenty of tools available for handling big data in the IT world, but only limited ones are popular among companies and industries because of their user-friendly or cost-wise approach. Hadoop and SPARK are the tools used in very high-speed data processing by various factors. How long have these tools ruled the world with their updated versions and techniques? New tools of Apache like FLUME, FLINK, and Kafka [80] are also available for accessing both batch and real-time processing in big data analytics. Only the techniques are varied in all tools. The new FLUME tools are used to collect various logs and events from different resources and stored in HDFS with high throughput and low latency. Apache FLINK is used to access the huge datasets by the micro-batch method which runs the data in a single run time with closed-loop operations. So the time to complete tasks is very low and identifying the corrupted data part is also easy. Another tool Apache Kafka is a modern tool used to handle feed with high throughput and low latency in social media. Finally, plenty of tools are used in big data analytics for handling a huge volume of data sets with different mechanisms and approaches. User has to take a decision very carefully in accessing and protecting their data with big data analytics world. This paper has covered the challenges and limitations of big data analytic tools in all aspects and provides solutions to handle those problems in a systematic way of approach.

References

Katal

, Wazid

and Goudar

R.H.

, Big data: issues, challenges, tools and good practices. In 2013 Sixth international conference on contemporary computing (IC3) (2013, August). (pp. 404–409). IEEE..

Khan

, Yaqoob

, Hashem

I.A.T.

, Inayat

, Ali

, Kamaleldin...

and Gani

, Big data: survey, technologies, opportunities, and challenges, The Scientific World Journal (2014), 2014.

Elgendy

and Elragal

, Big data analytics: a literature review paper. In Industrial Conference on Data Mining (2014, July), (pp. 214–227). Springer, Cham.

Tsai

C.W.

, Lai

C.F.

, Chao

H.C.

and Vasilakos

A.V.

, Big data analytics: a survey, Journal of Big data 2(1) (2015), 21.

Weets

J.F.

, Kakhani

M.K.

and Kumar

, Limitations and challenges of HDFS and MapReduce. In 2015 International Conference on Green Computing and Internet of Things (ICGCIoT) (2015, October), (pp. 545–549). IEEE.

, Wang

, Que

and Xu

, Virtual shuffling for efficient data movement in mapreduce, IEEE Transactions on Computers 64(2) (2013), 556–568.

Acharjya

D.P.

and Ahmed

, A survey on big data analytics: challenges, open research issues and tools, International Journal of Advanced Computer Science and Applications 7(2) (2016), 511–518.

, Big privacy: Challenges and opportunities of privacy study in the age of big data, IEEE Access bf 4 (2016), 2751–2763.

Wani

M.A.

and Jabin

, Big data: issues, challenges, and techniques in business intelligence. In Big data analytics (2018), (pp. 613–628). Springer, Singapore.

10.

Oussous

, Benjelloun

F.Z.

, Lahcen

A.A.

and Belfkih

, Big Data technologies: A survey, Journal of King Saud University-Computer and Information Sciences 30(4) (2018), 431–448.

11.

Khan

, Alsaqer

, Shah

, Badsha

, Abbasi

A.A.

and Salehian

, The 10 Vs, issues and challenges of big data. In Proceedings of the 2018 International Conference on Big Data and Education (2018, March), (pp. 52–56).

12.

Kaisler

, Armour

, Espinosa

J.A.

and Money

, Big data: Issues and challenges moving forward. In 2013 46th Hawaii International Conference on System Sciences (2013, January), (pp. 995–1004). IEEE.

13.

Kaisler

, Armour

, Money

and Espinosa

J.A.

, Big data issues and challenges. In Encyclopedia of Information Science and Technology, Third Edition (2015), (pp. 363–370). IGI Global.

14.

Che

, Safran

and Peng

, From big data to big data mining: challenges, issues, and opportunities. In International Conference on Database Systems for Advanced Applications (2013, April), (pp. 1-15). Springer, Berlin, Heidelberg.

15.

O’Driscoll

, Daugelaite

and Sleator

R.D.

, ‘Big data’, Hadoop and cloud computing in genomics, Journal of Biomedical Informatics 46(5) (2013), 774–781.

16.

Demchenko

, Ngo

and Membrey

, Architecture framework and components for the big data ecosystem, Journal of System and Network Engineering 4(7) (2013), 1–31.

17.

Fahad

, Alshatri

, Tari

, Alamri

, Khalil

I.Z.

and Bouras

, A survey of clustering algorithms for big data: Taxonomy and empirical analysis, IEEE Transactions on Emerging Topics in Computing 2(3) (2014), 267–279.

18.

Arfat

, Usman

, Mehmood

and Katib

, Big Data for Smart Infrastructure Design: Opportunities and Challenges. In Smart Infrastructure and Applications (2020), (pp. 491–518). Springer, Cham.

19.

Ahsaan

S.U.

, Kaur

and Naaz

, An Empirical Study of Big Data: Opportunities, Challenges and Technologies. In New Paradigm in Decision Science and Management (2020), (pp. 49–65). Springer, Singapore.

20.

Mohamed

, Najafabadi

M.K.

, Wah

Y.B.

, Zaman

E.A.K.

and Maskat

, The state of the art and taxonomy of big data analytics: view from new big data framework, Artificial Intelligence Review 53(2) (2020), 989–1037.

21.

Arfat

, Usman

, Mehmood

and Katib

, Big Data Tools, Technologies, and Applications: A Survey. In Smart Infrastructure and Applications (2020), (pp. 453–490). Springer, Cham.

22.

Kumari

P.L.S.

, Big Data: Challenges and Solutions. In Security, Privacy, and Forensics Issues in Big Data (2020), (pp. 24–65). IGI Global.

23.

Jaiswal

, Dwivedi

V.K.

and Yadav

O.P.

, Big Data and its Analyzing Tools: A Perspective. In 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS) (2020, March), (pp. 560–565). IEEE.

24.

Sharma

, Singh

and Rehman

, A Review of Big Data Challenges and Preserving Privacy in Big Data. In Advances in Data and Information Sciences (2020), (pp. 57–65). Springer, Singapore.

25.

Riaz

, Ashraf

M.U.

and Siddiq

, A Comparative Study of Big Data Tools and Deployment PIatforms. In 2020 International Conference on Engineering and Emerging Technologies (ICEET) (2020, February), (pp. 1–6). IEEE.

26.

Gupta

N.K.

and Rohil

M.K.

, Big Data Security Challenges and Preventive Solutions. In Data Management, Analytics and Innovation (2020), (pp. 285–299). Springer, Singapore.

27.

Tayal

D.K.

and Meena

, A new MapReduce solution for associative classification to handle scalability and skewness in vertical data structure, Future Generation Computer Systems 103 (2020), 44–57.

28.

Abimbola

, Sanga

and Mongia

, Hadoop Framework Ecosystem: Ant Solution to an Elephantic Data. (2019), Available at SSRN 3463635.

29.

Kashyap

, Big Data Analytics Challenges and Solutions. In Big Data Analytics for Intelligent Healthcare Management (2019), (pp. 19–41). Academic Press.

30.

Ahsaan

S.U.

, Kaur

and Naaz

, An Empirical Study of Big Data: Opportunities, Challenges and Technologies. In New Paradigm in Decision Science and Management (2020), (pp. 49–65). Springer, Singapore.

31.

Kaur

P.L.S.K.

and Bharti

, A Survey on Big Data—Its Challenges and Solution from Vendors. In Big Data Processing Using Spark in Cloud (2019), (pp. 1–22). Springer, Singapore.

32.

Kumari

P.L.S.

, Big Data: Challenges and Solutions. In Security, Privacy, and Forensics Issues in Big Data (2020), (pp. 24–65). IGI Global.

33.

Wani

M.A.

and Jabin

, Big data: issues, challenges, and techniques in business intelligence. In Big data analytics (2018), (pp. 613–628). Springer, Singapore.

34.

Anagnostopoulos

, Zeadally

and Exposito

, Handling big data: research challenges and future directions, The Journal of Supercomputing 72(4) (2016), 1494–1516.

35.

Kapil

, Agrawal

and Khan

R.A.

, Big Data Security challenges: Hadoop Perspective, International Journal of Pure and Applied Mathematics 120(6) (2018), 11767–11784.

36.

, Liu

, Shi

and Jin

, ATCS: Auto-Tuning Configurations of Big Data Frameworks Based on Generative Adversarial Nets, IEEE Access 8 (2020), 50485–50496.

37.

Mohamed

and Hong

, Hadoop-MapReduce job scheduling algorithms survey. In 2016 7th International Conference on Cloud Computing and Big Data (CCBD) (2016, November), (pp. 237–242). IEEE.

38.

Wang

, Zhang

, Yin

, Wang

, Wu

and Han

, Speed up big data analytics by unveiling the storage distribution of sub-datasets, IEEE Transactions on Big Data 4(2) (2016), 231–244.

39.

Nabavinejad

S.M.

, Goudarzi

and Mozaffari

, The memory challenge in reduce phase of MapReduce applications, IEEE Transactions on Big Data 2(4) (2016), 380–386.

40.

Sivarajah

, Kamal

M.M.

, Irani

and Weerakkody

, Critical analysis of Big Data challenges and analytical methods, Journal of Business Research 70 (2017), 263–286.

41.

Dolev

, Florissi

, Gudes

, Sharma

and Singer

, A survey on geographically distributed big-data processing using MapReduce, IEEE Transactions on Big Data 5(1) (2017), 60–80.

42.

Guo

, Rao

, Cheng

and Zhou

, ishuffle: Improving Hadoop performance with shuffle-on-write, IEEE Transactions on Parallel and Distributed Systems 28(6) (2016), 1649–1662.

43.

Wang

, Zhou

, Guo

and Luo

, Cluster frameworks for efficient scheduling and resource allocation in data center networks: A survey, IEEE Communications Surveys & Tutorials 20(4) (2018), 3560–3580.

44.

Time

S.S.R.P.

, Cluster Frameworks for Efficient Scheduling and Resource Allocation in Data Center Networks: A Survey.

45.

Hajeer

and Dasgupta

, Handling big data using a data-aware HDFS and evolutionary clustering technique, IEEE Transactions on Big Data 5(2) (2017), 134–147.

46.

Dey

N.S.

and Gunasekhar

, A comprehensive survey of load balancing strategies using Hadoop queue scheduling and virtual machine migration, IEEE Access 7 (2019), 92259–92284.

47.

Chen

, Yao

, Li

and Xiao

, PISCES: Optimizing Multi-Job Application Execution in MapReduce, IEEE Transactions on Cloud Computing 7(1) (2016), 273–286.

48.

Hariri

R.H.

, Fredericks

E.M.

and Bowers

K.M.

, Uncertainty in big data analytics: survey, opportunities, and challenges, Journal of Big Data 6(1) (2019), 44.

49.

, Wang

, Yu

, Ni

, Zhang

J.A.

and Liu

R.P.

, Survey: Sharding in blockchains, IEEE Access 8 (2020), 14155–14181.

50.

Luengo

, García-Gil

, Ramírez-Gallego

, García

and Herrera

, Dimensionality Reduction for Big Data. In Big Data Preprocessing (2020), (pp. 53–79). Springer, Cham.

51.

Luengo

, García-Gil

, Ramírez-Gallego

, García

and Herrera

, Imbalanced Data Preprocessing for Big Data. In Big Data Preprocessing (2020), (pp. 147–160). Springer, Cham.

52.

Chugh

, Sharma

V.K.

and Jain

, Big Data and Query Optimization Techniques. In Advances in Computing and Intelligent Systems (2020), (pp. 337–345). Springer, Singapore.

53.

Vengadeswaran

and Balasundaram

S.R.

, CLUST: Grouping Aware Data Placement for Improving the Performance of Large-Scale Data Management System. In Proceedings of the 7th ACMIKDD CoDS and 25th COMAD (2020), (pp. 1–9).

54.

Naisuty

, Hidayanto

A.N.

, Harahap

N.C.

, Rosyiq

, Suhanto

and Hartono

G.M.S.

, Data protection on Hadoop distributed file system by using encryption algorithms: a systematic literature review. In Journal of Physics: Conference Series (2020, January). (Vol. 1444, No. 1, p. 012012). IOP Publishing

55.

Mohamed

M.H.

, Khafagy

M.H.

and Ibrahim

M.H.

, Recommender Systems Challenges and Solutions Survey. In 2019 International Conference on Innovative Trends in Computer Engineering (ITCE) (2019, February), (pp. 149–155). IEEE.

56.

Hashem

I.A.T.

, Anuar

N.B.

, Gani

, Yaqoob

, Xia

and Khan

S.U.

, MapReduce: Review and open challenges, Scientometrics 109(1) (2016), 389–422.

57.

Elzein

N.M.

, Majid

M.A.

, Hashem

I.A.T.

, Yaqoob

, Alaba

F.A.

and Imran

, Managing big RDF data in clouds: Challenges, opportunities, and solutions, Sustainable Cities and Society 39 (2018), 375–386.

58.

Pouyanfar

, Yang

, Chen

S.C.

, Shyu

M.L.

and Iyengar

S.S.

, Multimedia big data analytics: A survey, ACM Computing Surveys (CSUR) 51(1) (2018), 1–34.

59.

Al-kahtani

M.S.

and Karim

, Designing an Efficient Distributed Algorithm for Big Data Analytics: Issues and Challenges, International Journal of Computer Science and Information Security (IJCSIS) 15(11) (2017).

60.

, Song

, Basanta-Val

, Steed

and Jo

, Next-generation big data analytics: State of the art, challenges, and future research topics, IEEE Transactions on Industrial Informatics 13(4) (2017), 1891–1899.

61.

Basanta-Val

and García-Valls

, A distributed real-time java-centric architecture for industrial systems, IEEE Transactions on Industrial Informatics 10(1) (2013), 27–34.

62.

Basanta-Val

, Audsley

N.C.

, Wellings

A.J.

, Gray

and Fernández-García

, Architecting time-critical big-data systems, IEEE Transactions on Big Data 2(4) (2016), 310–324.

63.

Liu

, Cai

, Shen

, Liu

and Linge

, An adaptive approach to better load balancing in a consumer-centric cloud environment, IEEE Transactions on Consumer Electronics 62(3) (2016), 243–250.

64.

Montazerolghaem

, Yaghmaee

M.H.

, Leon-Garcia

, Naghibzadeh

and Tashtarian

, A load-balanced call admission controller for IMS cloud computing, IEEE Transactions on Network and Service Management 13(4) (2016), 806–822.

65.

Zhao

, Yang

, Wei

, Ding

, Hu

and Xu

, A heuristic clustering-based task deployment approach for load balancing using Bayes theorem in a cloud environment, IEEE Transactions on Parallel and Distributed Systems 27(2) (2015), 305–316.

66.

Singh

A.K.

and Kumar

, Secure and energy-aware load balancing framework for cloud data center networks, Electronics Letters 55(9) (2019), 540–541.

67.

Shen

, Luo

, Dong

and Zhang

, Virto: joint coflow scheduling and virtual machine placement in cloud data centers, Tsinghua Science and Technology 24(5) (2019), 630–644.

68.

Bhattacharya

, Islam

and Abawajy

, Evolutionary optimization: a big data perspective, Journal of Network and Computer Applications 59 (2016), 416–426.

69.

Chen

, Liu

and Xiao

, Improving MapReduce performance using a smart speculative execution strategy, IEEE Transactions on Computers 63(4) (2013), 954–967.

70.

Wang

, Xu

, Fujita

and Liu

, Towards felicitous decision making: An overview on challenges and trends of Big Data, Information Sciences 367 (2016), 747–765.

71.

Shoro

A.G.

and Soomro

T.R.

, Big data analysis: Apache spark perspective, Global Journal of Computer Science and Technology (2015).

72.

Zaharia

, Xin

R.S.

, Wendell

, Das

, Armbrust

, Dave...

and Ghodsi

, Apache spark: a unified engine for big data processing, Communications of the ACM 59(11) (2016), 56–65.

73.

Salloum

, Dautov

, Chen

, Peng

P.X.

and Huang

J.Z.

, Big data analytics on Apache Spark, International Journal of Data Science and Analytics 1(3–4) (2016), 145–164.

74.

Kumar

M.P.

and Pattern

, Security Issues in Hadoop Associated With Big Data.

75.

Inoubli

, Aridhi

, Mezni

, Maddouri

and Nguifo

, (2018, August). A comparative study on streaming frameworks for big data.

76.

, Bei

and Qian

, Data size-aware high dimensional configurations are auto-tuning of in-memory cluster computing. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (2018, March), (pp. 564–577).

77.

Zaharia

, Chowdhury

, Das

, Dave

, Ma

, McCauly...

and Stoica

, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12) (2012), (pp. 15–28).

78.

, Tan

, Wang

, Zhang

and Salapura

, Spark-bench: a comprehensive benchmarking suite for in-memory data analytic platform spark. In Proceedings of the 12th ACM International Conference on Computing Frontiers (2015, May), (pp. 1–8).

79.

Agrawal

, Butt

, Doshi

, Larriba-Pey

J.L.

, Li

, Reiss

F.R.

and Xia

, Spark Bench–a spark performance testing suite. In Technology Conference onPerformance Evaluation and Benchmarking (2015, August), (pp. 26–44). Springer, Cham.

80.

Jaiswal

, Dwivedi

V.K.

and Yadav

O.P.

, Big Data and its Analyzing Tools: A Perspective. In 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS) (2020, March), (pp. 560–565). IEEE.