Online Analytical Processing for Business Intelligence in Big Data

Abstract

Online analytical processing (OLAP) approach is widely used in business intelligence to cater the multidimensional queries for decades. In this era of cutting-edge technology and the internet, data generation rates have been rising exponentially. Internet of things sensors and social media platforms are some of the major contributors, leading toward the absolute data boom. Storage and speed are the crucial parameters and undoubtedly the burning issues in efficient data handling. The key idea here is to address these two challenges of big data computing in OLAP. In this article, the authors have proposed and implemented OLAP on Hadoop by Indexing (OOHI). OOHI offers a simplified multidimensional model that stores dimensions in the schema server and measures on the Hadoop cluster. Overall setup is divided into various modules, namely: data storage module (DSM), dimension encoding module (DEM), cube segmentation module, segment selection module (SSM), and block selection and process (BSAP) module. Serialization and deserialization concept applied by DSM for storage and retrieval of the data for efficient space utilization. Integer encoding adopted by DEM in dimension hierarchy is selected to escape sparsity problem in multidimensional big data. To reduce search space by chunks of the cube from the queried chunks, SSM plays an important role. Map reduce-based indexing approach and series of seek operations of BSAP module were integrated to achieve parallelism and fault tolerance. Real-time oceanography data and supermarket data sets are applied to demonstrate that OOHI model is data independent. Various test cases are designed to cover the scope of each dimension and volume of data set. Comparative results and performance analytics portray that OOHI outperforms in data storage, dice, slice, and roll-up operations compared with Hadoop based OLAP.

Introduction

Rapid expansion and digitization in almost every sector have increased the opportunities for innovation in tools and technologies. This advancement has resulted into massive amount of data growth, giving birth to big data. Social media, internet, manufacturing, finance, sensors, health care, astronomy, bioinformatics, oceanography, retail industries, and education field have witnessed enormous amount of data.^1,2 Storage, management, and processing of this huge volume of data become crucial. These challenges have opened up new doors of research and invention for “big data” computation and techniques. Data warehouse and big data are the keys to the insight of business intelligence.³ Big data capturing, storage, processing, and analysis have turned out to be very expensive. Academicians and industry experts have stepped up to make this process inexpensive and more efficient.⁴ Direct beneficiaries of improved data management are the supermarket sector, banking, health care, engineering, and communications sectors. Data originating from these sources would demand efficient storage methodologies.¹ Data generated from these heterogeneous sources tend to have different formats and sizes.⁵ To deliver a clear vision to end user, data warehousing works on the concept of extract, transform, and loading a variety of data, which are different in terms of type and volume at a central standard storage block, typically over a nonvolatile platform.

Manifold increase in the volume of data has led to the ever-increasing expectations of faster processing speeds. This matter has motivated the authors, toward an attempt to improvise the online analytical processing (OLAP) structure considering big data environment. Parallelism and distributed processing environment play a vital role in dealing with volume of big data.⁶ OLAP over big data is a hot topic of discussion for academicians and industry experts to undertake and come up with groundbreaking results. Typical OLAP engine follows relational OLAP (ROLAP) architecture, which is highly dependent on relational database management system approach. Multidimensional OLAP (MOLAP) is the most suitable OLAP configuration for big data, as it executes on concept of multidimensional array.⁴ Map reduce programming model is used to address performance challenges of distributed environment. It is a highly recommended programming model to deal with colossal scale data handling algorithms. All the alternate solutions to achieve MOLAP have used cloud-based Hadoop data warehouse systems such as HBase and HadoopDB. On the contrary, this study is targeted to present a new approach called OLAP on Hadoop using Indexing (OOHI) MOLAP system for big data.

This article aims to provide a unique approach to resolve an existing problem. It will be addressed by provisioning of faster query responses to an OLAP in multidimensional environment. Distributed environment and applied parallelism are being used to store comparatively a massive volume of data. Innovative system architecture has been proposed to deploy OLAP querying on large amount of data very quickly. Hadoop platform with map reduce programming model is deployed to ensure scalability and fault tolerance. Data loading module, data storage module (DSM), data hierarchy encoding module, cube segmentation module, segment selection module (SSM), and block selection and process (BSAP) module are major building blocks contributing toward the enhanced performance in OLAP.

The rest of the article is organized as follows. The Literature Survey section discusses the literature survey; the Proposed OOHI Model section describes proposed OOHI model, including DSM, dimension encoding module, cube segmentation module, SSM, and BSAP module. Detailed system architecture is presented in the System Architecture section. This study also demonstrates the map reduce job for OLAP process, experimental setup, data set details, and test cases with achieved results to visualize the actual output scenarios in the Implementations section.

Literature Survey

Various research articles and literature were studied to achieve following objectives followed by detail discussion on each objective.

Identification of OLAP approach on variety of big data with applications and methodologies.

Methods used for precomputation of OLAP cube and application of special data structure to boost OLAP performance.

Existing methodologies to optimize data storage.

Implementation techniques to speed up the OLAP query.

Impact of implementation platform on the performance of OLAP query.

Past 15 years of literature on data set, methodology, and applications on OLAP was studied rigorously to achieve objective 1. Research trends on OLAP for the various topics such as big data warehouse augmentation, OLAP on big data, OLAP with uncertain data, OLAP on geospatial data are popular. As per the studied literature, to achieve objective 1, variety of big data is classified as shown in Figure 1.

FIG. 1.

OLAP on variety of big data. OLAP, online analytical processing.

Exhaustive study of relevant technical articles was carried out to collect aforementioned details. Application-oriented literature with objectives and methodologies is summarized in Table 1. Most appropriate techniques and applications for semi-structured, structured, and unstructured data set are mentioned in detail.

Table 1.

Applications and methodologies for online analytical processing on variety of big data

Sr. No	Title	Data set/application	Objectives/methodology	Variety of data
1	OLAP over uncertain and imprecise data⁷	Theoretical and probabilistic data modeling	Generalized OLAP for imprecise and uncertain data set. Scalability and ambiguity are tested through data set	Unstructured
2	Text cube: computing IR measures for multidimensional text database analysis⁸	Document and text database analysis: real dell customer review data sets	1. Novel text cube model with pull-up and push-down operations	Semi-structured
			2. Efficient storage and aggregation using information retrieval techniques
			3. Optimal query processing partially materialized cube
3	Evaluating statistical tests on OLAP cubes to compare degree of disease⁹	655 patients data set with 25 attributes from Emory University Hospital	Series of statistical tests with OLAP cube, model optimization also achieved	Structured
4	Predictive analytics with surveillance big data¹⁰	ASDI live data from Boeing AATM	Use of SPSS model	Structured
5	OLAPing social media: the case of twitter¹¹	Twitter data set	Limitation of x-DFM further extension to reliable, dynamic, fuzzy dimensional hierarchies of OLAP. Natural language processing methods are developed	Semi-structured
7	A map-reduce-enabled SOLAP cube for large-scale remotely sensed data aggregation¹²	Already in use, Ministry of Water Resources, China, for drought monitoring	Combination of SOLAP and map reduce programming model can be used to process long spatial queries responding faster	Semi-structured
6	An advanced data warehouse for integrating large sets of GPS data¹³	GPS data with fuel consumption data CAN bus data	Design and implementation of new model for spatial data set processing storage and analytics are achieved	Structured
10	Toward OLAP analysis of multidimensional tweet streams¹⁴	Twitter streaming data for election campaign	Fuzzy extension of formal concept analysis method for extraction of objects and attributes from twitter streaming data, multidimensional tweet stream, and microblog summarization is achieved through OLAPing	Semi-structured
8	Data warehouse on Hadoop platform for decision support systems in education¹⁵	IAAU relational Mysql database	To store big data using advanced big data platforms. Usage of Hive and Hbase	Structured
9	The impact of driving styles on fuel consumption: a data-warehouse-and-data-mining-based discovery process¹⁶	12 Parameters of CAN bus data	Application of classification methods followed by star flake schema of data warehouse model	Structured
11	Homogenizing data and metadata retrieval in scientific applications¹⁷	Earth, space, and life science case study data	ASQL model is proposed, which integrates with RDBMS for semi-structured data processing in big data	Unstructured
12	Unified OLAP/OLTP big data processing framework in telecom industry¹⁸	Telecom data set	Mixed use of OLTP and OLAP in addition to columnar database MPP. Proposed framework for mixed mode application in big data	Structured
13	Data warehouse with big data technology for higher education¹⁹	Petra Christian University, Indonesia	Use of Hadoop platform for data ingestion and staging in OLAP for institutional big data processing problem	Structured
14	OLAP analysis of water formation data²⁰	Raw data of formation water in OIL and Gas industries	Use of Pentaho tool and Weka for ETL and OLAPing	Structured
15	EXODuS²¹	Tested on three types of document stores.	New multidimensional and interactive schema-on-read approach is proposed. A new framework and multidimensional hierarchy model is introduced. Data mining followed by data warehousing for efficient analytics	Semi-structured
		1. In gaming, example data set gathered from Sports Reference LLC
		2. 2 Million documents from DBLP in XML format
		3. Twitter data set

AATM, Advanced Air Traffic Management; ASDI, Aircraft Situation Display to Industry; ASQL, Array SQL; CAN, Controller Area Network; DBLP, Digital Bibliography and Library Project; ETL, extract-transform-load; EXODuS, Exploratory OLAP over Document Stores; GPS, Global Positioning System; IAAU, International Ataturk Alatoo University; IR, Information Retrieval; LLC, sports database company; MPP, massively parallel processing; OIL; OLAP, online analytical processing; OLTP, online transactional processing; RDBMS, relational database management system; SOLAP, spatial OLAP; SPSS, Statistical Package for the Social Sciences; x-DFM, extended dimensional fact model; XML, extensible markup language.

Use of precomputation and data compression aids in the process of optimization of OLAP performance. It can be optimized through precomputation and proposed map reduce framework for query algorithm.²² Retrieval of data cubes from Bigtable with comparison of different approaches is also described. It shows the map reduce framework and HBase implementation environment for Bigtable to handle scalability issue.²³ Genetic algorithm has been used to improve data selection. To store the data effectively, only happening cubes from all data cubes have been selected to reduce query cost. Greedy approach has been applied to optimize the storage cost, which will directly affect OLAP performance,²⁴ which fulfills objective 2.

Besides precomputation, data structure also support to optimize data storage. Data warehouse was built through cloud platform with the support of online transactional processing and OLAP.²² Indexing and partitioning were the core techniques for online transactions and to provide data analytical functionality. Revolutionary work for combining OLAP and data mining is presented by Han et al.²² The author has optimized the work of two layers with theoretical framework for OLAP and mining functions into a single entity, which greatly supports today's requirement. New data structure is designed based on the concept of short message service. Abstract model is demarcated, which is able to manage multidimensional tables and data using query algebra and advanced fact constellation schema.¹⁴ All the above literature illustrate an impact of special data structure in process of OLAP to achieve objective 3.

Objective 4 refers to get good speed up of OLAP query exploring various platforms. Features and functionalities of Hadoop Distributed File System (HDFS) are possessed to store large volumes of data, with Structured Query Language (SQL)-based skills for analytics.²⁵ Combination of DBMS and map reduce has been successfully demonstrated as HadoopDB for analytical queries on OLAP.²⁶ MOLAP systems are very well defined on OLAP4cloud and HBaseLattice. Use of indexing approach will improve OLAP performance; OLAP4cloud work on this principal to enhance OLAP performance and to improve storage cost. Avatara is the Hadoop-based OLAP engine; with the provision of low latency and high throughput, it offers online analytic support with the managed high traffic of website.⁴

Hybrid OLAP architecture is one of the key ideas to boost OLAP performance and enhance ability for adaptive computation to observe impact of platform. Hybrid OLAP deliberately designed on CPU, GPU, and memory subsystems to optimize query response times.²⁷ Proficient database design conceptualized on partitioning with replication designed by Lee et al.²⁸ Literature imparts that big data and OLAP are keys to provide better insight. Each of SQL and NoSQL should be supported for the new framework of OLAPing. Hadoop based OLAP (HaOLAP) is very well-known Hadoop development method to achieve OLAP operations such as slice, dice, roll-up, and drill-down operations.⁴

Security and data quality management are also popular areas of data warehouse and OLAP. Although, a lot of research efforts have already been made in this area, only a few of them were found to be efficient and fruitful. Security is applied layer wise and at each stage in the protocol with encryption techniques.²³

Still there are hazards and scope of enhancement in the following area: 1.

For multidimensional, heterogeneous big data, effective data storage support is required.

Existing methodology can be emerged on Hadoop and Cloud Computing platforms to process data in distributed environment.

Better data analytic tools can be implemented with the convergence of artificial intelligence and machine learning.

Proposed OOHI Model

This section describes proposed OOHI model with all minute details such as store, retrieve, encode, and decode all the dimensions and measures. OLAP cube is divided into fixed data size chunks. All the chunks will be processed to achieve distributed work using map reduce framework. Figure 2 shows the visualization of data in multidimensional view.²⁵ The division of cube into chunks to achieve parallelization is shown in Figure 3. To utilize the distributed data environment, model chunking has been used in OOHI.⁴

FIG. 2.

Multidimensional data cube.

FIG. 3.

Visualization of partitioned cube.

Series of relevant modules and algorithms have been used for computation of OLAP cube. DSM offers a worthy storage technique for big data. Data encoding module has adopted integer encoding technique to encode each dimension corresponding to each level in the concept hierarchy.⁸ In response to the user query, selection of only specific chunks from the whole cube is described in detail in cube segmentation and SSM. Once the set of blocks are identified by the segment selection method, it is further processed by BSAP module for slice and dice operations.

Data storage module

It is highly essential to serialize the data for reduced storage costs, required by OLAP in big data. In MOLAP, cube storage demands comparatively more space, as it uses multidimensional array. In OOHI, data have been extracted directly from the database server and stored in serialized manner, which will take key and value instead of storing n-dimensional data and its value.²⁶ The chunk file and the cells of block are serialized for resolution and deserialize for request-query from user. Chunk file is fundamentally the map file given to responsible mapper. Logically, cube cells and chunk files are associated with the values of multidimensional array itself; however, they are actually the map file of HDFS.

Let X be the multidimensional array of n dimensions as {D₁, D₂, D₃, …, D_n}; coordinates of array values are denoted as {P₁, P₂, P₃, …, P_n}, their serialization results into the index as per Equation 1. Computed index shall be considered for further processing. $p c I n d e x (X) = [D_{1} +{[} D_{2} * P_{1} {]}+{[} D_{3} * P_{2} {]}+ \dots \dots . +{[} D_{n} * P_{n - 1}]]$ (1)

Figure 4 represents the visualization of paging concept. Each of the dimension has been encoded as a page. Instead of storing the values using multidimensional array, it can be serialized as a key. Conceptually storage of data achieved through serialization and retrieval of data has been achieved through deserialization. Core functionality of the deserialization process is to calculate the coordinates of the dimensions. The procedure of deserialization is described in Equation 2:

FIG. 4.

Serialization.

\begin{matrix} T_{1} = i n d e x \\ P_{1} = T_{1} % D_{1} T_{2} = T_{1} ∕ P_{1} \\ \begin{matrix} P_{2} = T_{2} % D_{2} T_{3} = T_{2} ∕ P_{2} \\ . . \\ \dots \end{matrix} \\ P n = T n % D n \end{matrix}

(2)

Dimension encoding module

Dimension hierarchy encoding is broadly classified into two categories, namely integer encoding method and binary encoding method. In binary dimension hierarchy encoding method, binary value of each level of the hierarchy is considered.²⁷ As shown in Figure 5, a concept hierarchy of time dimension includes year, month, and day. Each level has been represented by binary digit. Binary encoding of 31.1.1968 is 001000111111. (Binary encoding of 1968 is 001, January is 0001, and 31 is 11111). If 1.2.1968 is encoded as 001001000001 and 31.1.1968 as 001000111111, then the difference between these numbers is two (2), the difference may increase for more number of years. Disjointedness in the binary encoding leads to sparsity problem in multidimensional array.

FIG. 5.

Binary encoding of dimension hierarchy.

Integer encoding method will not break up the level-wise data. As shown in Figure 6, integer value is assigned to year, month, and day of each year in the time dimension. Although storage and retrieval of level-wise information is bit easy in binary encoding, the authors adopted integer encoding to avoid sparsity problem in multidimensional big data.

FIG. 6.

Integer encoding of dimension hierarchy.

Segment selection module

It is highly required to divide the cube into segments to process parallelly. In OOHI, the authors have applied cube segmentation method to divide whole cube into small segments. Effectually, list of chunks shall be considered as input file to map reduce program. The selection of chunk size is critical as bigger chunks will increase parallelism and smaller chunks will increase scheduling cost. It can be derived by considering all the query conditions (support) and possibility and occurrence of each condition (confidence). Considering simple random sample for all possible query conditions and map reduce features such as number of map tasks processed per second, the file addressing time and the scheduling time are as per figured out in Table 2.

Table 2.

Symbols used in cube segmentation

Symbol	Description
AVGt	OLAP average execution time
C	Selection of average number of chunks
MRt	Time affected by Map reduce process
N	Total number of OLAP query conditions
D	Total number of dimensions of OLAP
Q	Intersection of dimension value d and OLAP query condition q
Β	The confidence count of OLAP query condition q
γ	Size of chunk
t1	Map reduce file addressing time
t2	Time require to schedule a task
T	Mapper task processed per second

A V G t = C * M R t

(3)

C = \sum_{q = 1}^{n} (\prod_{i = 1}^{d} \frac{Q i q}{γ i} * β q)

(4)

M R t = t 1 + t 2 + \frac{\prod_{i = 1}^{d} γ i}{T}

(5)

Average OLAP query execution time can be achieved by finding the average number of chunk for available dimensions for query conditions and time affected to execute map reduce task according to Equation 3. According to Equation 4, selection of average number of chunks depends on the intersection between the confidence and the probability of occurrence of query condition at dimension. Mapper task and file addressing time have been considered to find out time affected by map reduce process, according to Equation 5. If we identify the size of chunk that minimizes AVGt, then it will be the perfect chunk size.

Once the whole cube is divided into chunks, it is important to select the targeted segments for the given query to reduce the search space. In OOHI, the SSM has been used to select the targeted segments followed by BSAP. Important operation of OLAP mainly includes slice, dice, roll-up, drill-down, and pivot. Roll-up and drill-down operations are treated as an aggregation and the combination of query, whereas slice and dice operation is treated as query range operation.²⁸ To view the cube in altered way, pivot operation is being used.²¹ In this operation, it is essential to deal with the query range. In OLAP, dimensions are queried and referred by query condition, as defined by the selected dimension. Range specifies a query range of cells in objective or the target dimension, where range is multidimensional tuple.

Let {D₁, D₂, D₃, …, D_n} be the dimensions involving target query and an ordered pair <pi, qi>, (where pi<qi) be the given query range on dimension Di.

Then, Range R = {<pi, qi> | i ∈ [1, n]} $|\frac{P i}{Y i} |\leq C i \leq| \frac{q i}{Y i}|$ (6)

Let us take the range of query be <(p1, p2, p3, …, pn)>, <(q1, q2, q3, …, qn)>, so each pair from (p1, q1), (p2, q2) to (pn, qn) are the coordinator of the segment be (c1, c2, c3, …, cn) and the segment size be (y1, y2, …, yn). According to Equation 6, coordinates found match with the range would be processed by map reduce job.

Exemplification

Cube C = [372, 256, 500].

If input query is [70, 40, 50] to [170, 200, 150].

Starting point is [70/62, 40/64, 50/72] means [1, 0, 0] and end point is [170/62, 200/64, 150/72] means [2, 3, 2].

According to Equation 3; start point $\leq$ segment $\leq$ end point.

As a result, we will get all the chunks having {[1, 0, 0], [1, 0, 1], [1, 0, 2], [1, 1, 0], [1, 1, 1], [1, 1, 2], ……., [2, 3, 2]}, which will be serialized as {28, 29, 30, 35, 36, 37, …….., 79}

BSAP module

HaOLAP and other existing algorithms scan each chunk to match the query and process whole chunk,⁴ whereas OOHI model selects only the required chunk value and not the whole chunk. Figure 7 shows the cube after applying chunking and further processed by map reduce process to achieve parallelization. Data retrieval from the query range will select all the affected chunks from the cube, as shown in Figure 8. BSAP module contributes at this stage to process required cells only and not the whole chunk. According to Figure 8, 16 chunks are selected, but few chunks should be processed partially, half, quarter, and few of them are required to be processed whole. Seek operation has been performed on each selected chunk and processed according to BSAP module. Various seek operations to reach to a cell value are designed after rigorous calculations on OLAP cube.

FIG. 7.

Partitioned cube.

FIG. 8.

Selected chunks to process query.

The BSAP algorithm aims to pass the selected chunk data to mapper as a list of selected cell values. The iterative process provides selected input to record reader. All the existing approaches including HaOLAP check each selected chunk line by line or apply the brute force method to find the query match. BSAP will calculate the buffer size first and corresponding seek operations will be applied as described in Table 3. The pointer has to reach to a cell of chunk required by user query by previously mentioned modules. Once pointer has reached to the end coordinate, program will start reading information or the values till the pointer reaches to the end. After deliberate calculations on each position of cells in the chunk and in the block, many important equations and variables are presented, as shown in Table 3.

Table 3.

Naming conventions used in block selection and process

Fseek	First Seek
Nseek	Seek to Next slice
Bseek	Before seek
Aseek	After seek
Dseek	Dice seek
QSi	Query start in i direction where i = x, y, z, …, n
QEi	Query end in i direction where i = x, y, z, …, n
SSi	Segment start in i direction where i = x, y, z, …, n
SEi	Segment end in i direction where i = x, y, z, …, n

Algorithm: BSAP

Input: User Input: QSi to QEi, File Input: SSi to SEi

BSAP algorithm

seek_Block(Fseek)

For i = 0 to |Z′|

For j = 0 to |Y′|

a. seek_Block(Bseek)

b. read_Block(Block_data)

c. Block_data = [block_end * element size – block_start +1],

. Block_start = Max (QSx, SSx), Block_end = Min (QEx, SEx)

d. skip_Block(Aseek)

End for

skip_block(Nseek)

End for

For the multidimensional query start and query end, tuples are notified by the symbols QSi and QEi. After applying segmentation on cube, OLAP is divided into the fixed size segments. So, each segment start and end tuple are described by the variables SSi and SEi as generated by cube segmentation module. Here, i represents the dimensions used for multidimensional query. When query start/end value is more than segment start/end, then the whole segment have been selected. Every precondition is examined for selected chunks of the cube, and it is represented in Table 4.

Table 4.

Symbols with equations used in block selection and process

Operation name	Conditions
Fseek	(Dseek+Sseek)
Bseek = QSx-SSx	SSx < QSx
Aseek = SEx-QEx	SEx >QEx
Dseek = (QSz – SSz) ^* (P’ ^* P″)	SSz < QSz
Sseek = (QSy – SSy) ^* P’	SSy < QSy
Nseek = (SEy – QEy) ^* (QSy – Sy) + P’^* element	SEy > QEy && SSy < QSy
Z’ = {min (SEz, QEz) – max (SSz, QSz) +1}	If (SSz < QSz)
Y’ = {min (SEy, QEy) – max (SSy, QSy) +1}	If SSy < QSy

First seek operation can be used for skipping all the cells to reach the first cell value as asked by the user. Here, slice operation indicates two-dimensional operation. Seek to next slice is required to reach the next slice, before seek is used to reach the target block. Once required cells are gathered and sent to the mapper for aggregation, seek operation is performed to skip all the rest of elements from the blocks. Exemplification and algorithm have been mentioned for three dimensions. The BSAP algorithm is standardized according to multidimensions. Each operation used in the BSAP algorithm is clearly mentioned in Table 3. The authors have presented the BSAP algorithm for n dimensions and we have exemplified it for three dimensions.

Logical conditions to apply seek operations are mainly (1) P′ = total elements in x direction and (2) P″ = total elements in y direction.

System Architecture

In this section, the authors have described importance and configuration of OOHI system architecture. Shared disk, shared memory, and shared nothing architecture are the keys to develop big data solution. To provide good scalability with the commodity machines, the authors have selected shared nothing architecture.²⁹ To ensure the parallelism and distributed environment, Hadoop framework with map reduce programming model has been chosen. OOHI includes mainly three components as job server, schema server, and Hadoop cluster.

As shown in Figure 9, OOHI architecture contains a separate processing module for the database collection. Extract-transform-load process will be performed on data sources. Separate code was written for extracting raw data and to store the metadata. Input data are in the parse-able format. Cube schema, dimension, and cube metadata are stored in schema server to verify the validity of the queried data and to reduce the cost of searching unavailable data. All the measures associated with dimensions are stored in Hadoop cluster. When end user or OLAP client fire a query for slice operation, the application node will run a map reduce job to find the match criteria and to assign the asked operation. Detailed role of each component is as follows:

FIG. 9.

System architecture. ETL, extract-transform-load.

Job server: Application node is also known as job server. It is the heart of the whole system as it processes OLAP commands fired by user, check for the validity of the command and available resources, call suitable node of Hadoop cluster to process and send processed output to the client. Job server first validates the input with the support of schema server by checking the available metadata and dimensions. With the ensurity from the schema server, job server invoke a command to start map reduce process on Hadoop cluster. Meanwhile map reduce job running on Hadoop cluster, job server observe the progress of the process. Once the job is accomplished by the map reduce job, job server gather the metadata of the cube to generate the resultant cube.

Schema server: Schema server is responsible for storing and maintaining metadata. Both metadata for cube and dimensions are stored in XML format. Dimension's metadata includes number of dimensions and levels of each dimension. Cube metadata includes the path, file structures, and identifiers. Schema server is answerable for all the queries coming from job server.

Hadoop cluster: All the measures are deposited in Hadoop cluster. It looks after all the map reduce instance and the distributed environment by HDFS. The queries submitted by the job server will be run by Hadoop cluster. Initially, data loading map reduce job is run to load measures to Hadoop environment and to keep metadata to schema server. Data loading is one time job and thereafter whenever job request comes from the job node, Hadoop cluster will process it according to dimension encoding technique, dimension traversal technique, cube segmentation, chunk selection, and BSAP methodology mentioned in previous sections.

Implementations

In this section, the authors have described execution processes of aforementioned algorithms. That includes map reduce job explanation with comprehensive flowchart, experimental setup, and test cases with results in depth as follows:

Detailed job for OLAP process (see the Map Reduce Job for OLAP Process section)

Master–slave configuration, experimental setup, and data set description (see the Experimental Setup section)

Design of test cases including data storage, dice, slice, and roll-up operation queries and results (see the Test Cases and Results section)

Map reduce job for OLAP process

Data loading, data storage, and OLAP process described in earlier sections are designed as per map reduce framework. Figure 10 depicts the map reduce job for OLAP process described as follows:

FIG. 10.

Map reduce job to perform OLAP process.

Data loading implementation involves two phases. The first phase loads data to HDFS and second phase generates chunk files through map reduce process. Every map reduce job consists of four components, namely input formatter, mapper, reducer, and output formatter.³⁰ Input formatter takes the raw data and applies the parsing logic to separate measures of dimensions, levels of dimensions, and metadata. Note that the original data files loaded into HDFS is in the XML format. Mapper work on the each sentence; line number and the value, referred as <key, value>. Furthermore, obtained <key, value> pair is serialized according to DSM. New key as index is generated and passed to the reducer. Reducer is responsible for sorting and shuffling the available data.

After the completion of the data loading process, OLAP process kicks in. Generated chunk files are processed by map reduce job to perform OLAP process. First, the query quadruple has been submitted by client and then processed to verify deterministic failures. As shown in Figure 10, input formatter takes data from SSM as a chunk file. Each coordinator of the cell is deserialized according to Equation 2. After deserialization, value will be associated with each coordinate and verified by each query condition. If all coordinates of the cell have been found to match with query condition, then it is serialized again and passed to mapper. The unmatched cell coordinates would be removed from the database. <M_Key_in, M_Value_in> is the input key value pair for the mapper and <M_Key_out, M_Value_out> is the output key value pair for the mapper. Similarly, reducer deals with <R_Key_in, R_Value_in>, as input key value pair and output as <R_Key_out, R_Value_out> key value pair. As a part of input formatter, record reader processes set of serialized cells and values. Once BSAP module finalizes the list of chunks, it passes the array of chunks to mapper. Unlike HaOLAP algorithm, in OOHI array of the chunks are processed to seek an intended position of cell rather than searching whole chunks to identify the cell position using BSAP. Main objective of the mapper for the whole process is to change the level according to the concept hierarchy of the dimensions.

To change the level of dimension coordinates of the cells are required, so mapper performs the deserialization of the obtained <M_Key_in, M_Value_in>, where each key value pair is deserialize in its original coordinate values mentioned as <d₁, d₂, …, d_n, value>. According to the passed query, level of dimension has been changed to match the query. Obtained changed dimensions have been serialized again to perform aggregation by reducer. Now mapper generates new key value pair as <M_Key_out, M_Value_out> that generates serialized coordinates for the new key produced by changed level. Reducer start processing with key value pair as <R_Key_in, R_Value_in>. Furthermore, it also applies aggregation on obtained values for the key and aggregated value that would be assigned to the key. <R_Key_out, R_Value_out> is the output of reducer and it is located as new chunk file into mapper. R_Key_out remain unchanged throughout process of reducer.²⁸

Experimental setup

Execution of OOHI has been carried out on three node set up by commodity computers of 64 bit operating system having 8 GB RAM, windows 10, core i5 CPU, 120 GB hard disk (built in), and extra 1 TB hard disk. One of node in cluster worked as Hadoop master and rest of the two nodes worked as Hadoop slave. Competitors for the comparison of OOHI model have been selected based on parallel processing ground and map reduce programming framework. It is the most suitable architecture for big data processing.¹¹ OLAP implementations include either ROLAP or MOLAP. In the proposed model, we adopted MOLAP. Hence, MOLAP, Hadoop, and map reduce framework motivated us to select HaOLAP as competitor of OOHI. HaOLAP is the well-known approach aiming for improvement in storage cost and effective query performance. It is already been compared with all the advanced data warehouse solutions such as Hive, HBaseLattice, OLAP4cloud, and HadoopDB. HaOLAP have been proved as the better approach.⁴ Source code of HaOLAP is available on https://github.com/MarcGuo/HaoLap website and source code of OOHI implementation is available on https://github.com/Jigna-Nirma/OOHI website.

We used oceanography data from Intergovernmental Oceanographic commission of UNESCO (IODE international oceanographic data and information exchange). We downloaded 11 years of data and total 900 files total size of database is 10 GB. From the XML files, we dig out three-dimensional data as follows:

Three dimensions (T, A, D) where T represents Time, A represents Area, and D represents Depth.

T has 5 levels, which are Year, Season, Month, Day, and Slot. Slot refers to morning, afternoon, and evening of a day.

A has 7 levels, which are 1°, 1/2°, 1/4°, 1/8°, 1/16°, 1/32°, and 1/64°. 1° quadrangular is the area whose length of side is 1° of the longitude and the latitude. The earth could be divided into 360 × 180 1° squares and 4 × 360 × 180 1/2° squares and so on.

D has 3 levels, which are 100, 50, and 10 m. One hundred meters of layer represents the depth of ocean, which is separated per 100 m.

Although type of application does not affect the performance of OOHI, it has been tested on oceanographic data and online shopping data. Results of both the data sets are similar, and hence, we have not included the online shopping results. The alterations between the applications from various domains mainly contain the volume of the dimension values and the sparsity of measures. The whys and wherefores we consider the territory of application do not affect the performance of OOHI are as given below:

OOHI has been implemented on map reduce programming framework. Massive amount and variety of dimensions can be handled by map reduce framework.

In OOHI, the mapping of measures and dimensions can be achieved by computing the complexity of traversing dimension and it is O(1). Hence, amount of dimension does not affect the performance of roll-up, dice, and slice in OOHI.

OOHI does not store the measure whose value is NULL. Thus, OOHI avoids the sparsity of measures.

Test cases and results

Test cases are designed to cover scope of all the three dimensions of our data set. The size of data set is presented as Si (1 ≤ i ≤ 3), as shown in Table 5. Experiments have been divided into three different subsets of original data set. Dimension selection in the experiment plays very critical role. To visualize the volume impact of OOHI, we worked with the data set range from 10⁶ to 10⁸, resulting into 5 million data elements to 160 million data elements, as shown in Table 5.

Table 5.

Oceanography data subset description

Data set subset	Time	Area	Depth	Total no. of elements	Size of elements
S1	186	128	250	5,952,000	10⁶
S2	372	256	500	47,616,000	10⁷
S3	558	384	750	160,704,000	10⁸

Data storage, dice, slice, and roll-up operations queries have been described, as shown in Table 6. As far as dice experiment is concerned, we designed C1, C2, and C3 as cube operations. Each of the dice operation performed on data set described in aforementioned oceanography data subset description. SQL type query is mentioned for OLAP operations. We performed experiment on each S_iC_j (1 ≤ i, j ≤ 3) and observed query performance time in seconds. Selection of query is important in test cases, and hence, all the dimensions are covered in cube queries. We designed the combination of time, area; time, depth; area, depth in queries C1, C2, and C3 accordingly. Basically, nine test cases are designed for each experiment. In the slice operation, we designed SL1, SL2, and SL3 as slice operations. Like dice operation, slice operation parameters also have been executed as S_iSL_j. Logically slice operation involves operation on one dimension, and other two dimensions have been kept as it is. Table 6 shows variations of slice operation queries on time, area, and depth as SL1, SL2, and SL3, respectively. Roll-up operation R1, R2, and R3 are executed on data set S1, S2, and S3 correspondingly. It massively depend on aggregation function, so we kept the queries of roll-up as similar to dice only the level on dimension hierarchy gets updated. Comprehension of each operation has been described in following sections.

Table 6.

Data cube/data slice/data roll-up parameters and query

Operation	Operation	Query
Dice	C1	SELECT ^* from Cube where Time.month in (2,3) and Area.oneForth between 3 and 6
	C2	SELECT ^* from Cube where Time.month=4 and Depth.10 m in (2,3)
	C3	SELECT ^* from Cube where Area.oneForth between 4 and 6 and Depth.10 m = 3
Slice	SL1	SELECT ^* from Cube whereTime.day = 3
	SL2	SELECT ^* from Cube where Area.oneSixteenth = 10
	SL3	SELECT ^* from Cube where Depth.1 m = 17
Roll-up	R1	SELECT Time.Month, ^* where Time.month in (2,3) and Area.oneForth between 3 and 6
	R2	SELECT Area.oneForth, ^* where Time.month=4 and Depth.10 m in (2,3)
	R3	SELECT Depth.10 m, ^* where Area.oneForth between 4 and 6 and Depth.10 m = 3

By Default [SELECT ^*] will select least level of that dimension, that is, [time.day] OR [depth.1m] OR [area.OneSixtyfour].

Data storage

Concept of MOLAP requires to store multidimensional array and the values. In the proposed model, for big data, we adopted MOLAP where we must store elements through multidimensional arrays. According to the state-of-the-art technique discussed in the Data Storage Module section, we used indexing approach rather than multidimensional arrays to store dimensions and measures. Figure 11 and Table 7 depict that dimension-level storage of HaOLAP over indexing in OOHI requires double storage space. It proves that data storage cost of proposed model became half than baseline approach by applying the concept of dimension encoding method, serialization, and deserialization. Also, in OOHI model to improve data storage, we used binary format to store index as key and value as an element. HaOLAP stores metadata and dimension hierarchy with level-wise information used in dimension traversal for further processing.

FIG. 11.

Data storage comparison. HaOLAP, Hadoop based OLAP.

Table 7.

Data subset size description

Data label	Data size (no. of elements)	HaOLAP	OOHI
S₁	5,952,000	750	397
S₂	47,616,000	1400	732.4
S₃	160,704,000	2375	1050

HaOLAP, Hadoop based OLAP; OOHI, OLAP on Hadoop by Indexing.

Dice operation

In dice operation, proposed model has been using segmentation equations and segment selection method as described in the System Architecture section. Also, the BSAP algorithm is applied to reduce the search space in selected chunks. Segmentation is the most effective solution to work with distributed environment and parallel processing, but the selection of segment/chunk size is critical as larger chunks will rise parallelism and smaller chunks will rise scheduling cost. As per Equation 3, chunk size has been decided with the consideration of given query conditions, average map reduce execution time, and file addressing time. We calculated chunk size as 82, 168, and 252 for queries C1, C2, and C3 respectively. OOHI worked onto targeted elements of the chunk and not to the whole chunk, whereas HAOLAP worked on the whole chunk. Brute force technique of HaOLAP resultant into more time consumption in dice operation, as shown in Figure 12a–d. Performance of the OOHI increases as per aforementioned figures, when the amount of elements and data increases. Also, Figure 12d shows the performance of OOHI is 2 × faster than HaOLAP with increase in volume. Table 8 depicts the result of HaOLAP and OOHI with the application of cube queries C1, C2, and C3 on data subset S1, S2, and S3, respectively. We summarized overall performance of dice operation of HaOLAP and OOHI in Figure 12d. Figure 12a–c shows the query comparison of C1, C2, and C3 having different dimensions, levels, and hierarchies comparison on data subset S1, S2, and S3. Query processing time reduction is proportional to performance. Performance analysis comparison of both the methods portray that OOHI performs ∼2.6 times better than HaOLAP.^†

FIG. 12.

(a) Dice comparison on set S1. (b) Dice comparison on set S2. (c) Dice comparison on set S3. (d) Overall dice comparison.

Table 8.

Dice performance

Cube operation	HaOLAP			OOHI
Cube operation	S1	S2	S3	S1	S2	S3
C1	67.75	130.5	195.75	26.867	49.75	72.345
C2	31.875	61.75	92.625	16.45	21.75	29.675
C3	36.76	71.75	107.625	21	39.5	57

Slice operation

In the slice operation, OOHI performance is better than HaOLAP as shown in Figure 13d with resultant values shown in Table 9. Slice operation works with two dimensions, as per slice queries in aforementioned Table 6. All the dimensions are covered in all three queries and applied on data subset Si. Figure 13a shows the query on X direction and fetches the data as per block seek operation mentioned in the BSAP algorithm. Similarly, Figure 13b having query on Y direction and slice seek operation on BSAP is used to fetch elements from the selected chunk. Likewise, Figure 13c shows query on Z direction and corresponding operation is dice seek from the BSAP algorithm. Figure 13e shows the overall performance of slice operation with the increase in volume. It also compares the performance of OOHI and HaOLAP with big data involvement. HaOLAP lacks in chunk processing while reduction of search space and use of seek operations by BSAP and chunk selection criteria are the keys to have better performance. OOHI performs ∼1.36 times better than HaOLAP.^†

FIG. 13.

(a) Slice comparison on set S1. (b) Slice comparison on set S2. (c) Slice comparison on set S3. (d) Overall slice comparison.

Table 9.

Slice performance

Slice operation	HaOLAP			OOHI
Slice operation	S1	S2	S3	S1	S2	S3
SL1	161.5	324	484.5	121.5	243	354
SL2	231	465	693	185	369	545.456
SL3	137	271	411	103	204	301

Roll-up operation

Roll-up operation queries represent the aggregation operation on dice queries. Time consumption of roll-up operation considering all the cases is presented in Figure 14d. The authors have displayed tendency of both methods in Figure 14a–c, with corresponding resultant values in Table 10. Roll-up operation basically involves aggregation on the resultant data that is obtained from the required level of dimension hierarchy. HaOLAP performs local operation on resultant data, whereas OOHI performs aggregation function as a result of reducer of BSAP module and that increase the query performance. Overall performance of roll-up operation is figured out in Figure 14d, which clearly shows that OOHI performs ∼2.8 times better than HaOLAP.^†

FIG. 14.

(a) Roll-up comparison on set S1. (b) Roll-up comparison on set S2. (c) Roll-up comparison on set S3. (d) Overall roll-up comparison.

Table 10.

Roll-up performance

Roll-up operation	HaOLAP			OOHI
Roll-up operation	S1	S2	S3	S1	S2	S3
R1	309.5	619	921	107.5	215	321
R2	109	218	345	30	60	92
R3	138	276	412	75	150	221

Overall performance of both the methods is displayed in Figure 15. Roll-up and dice query performances are 2 × faster in OOHI. Queries are designed in such a way that includes all the dimensions and different combinations of chunk size. Average query performance on each data subset is shown in Figure 15. Dimension encoding and chunking algorithm of HaOLAP are better than OLAP4cloud, HBaseLattice, and HadoopDB but lacking in brute force methods to get elements from the chunks. But, simplified dimension, indexing, chunking, and BSAP are the significant factors that affected OOHI in the slice performance.

FIG. 15.

Overall comparison. OOHI, OLAP on Hadoop by Indexing.

It is perceived from the results that our proposed model OOHI significantly outperformed the state-of-the-art baseline method HaOLAP and the improvement is statistical significant at p ≤ 0.05 by applying t-test. Here, p-value is obtained from t-test with paired two samples for means, as shown in Table 11. It also justifies and explains the efficient performance of our proposed model.

Table 11.

p-Value presentation

Method	p (T ≤ t) one-tail
Dice	0.039
Slice	0.008
Roll-up	0.05
Data storage	0.05

Conclusions

We have presented OOHI model to design, implement, and evaluate OLAP over multidimensional data. To achieve efficient storage utilization and reduced query response time, the authors have implemented (1) integer encoding method for dimensions of the cube in concept hierarchy, (2) serialization/deserialization to store and retrieval of measures, (3) segmentation and segment selection methods to choose appropriate chunk size, and (4) BSAP to reduce search space for queried search. All algorithms and techniques have been implemented through map reduce programming paradigm combined with Hadoop framework. Design of test case series helped us to compare results with baseline approach. Each method has been evaluated through nine test cases including every scenario. In test cases and experiment sections, we compared the performance of data storage, slice, dice, and roll-up on the proposed model and baseline model. We derived that data storage cost of OOHI is half of the HaOLAP. Also in slice (∼1.34 times), dice (∼2.6 times), and in roll-up (∼2.8 times), our proposed model has achieved better performance compared with HaOLAP. This study shows the MOLAP implementation on big data. Inclusion of semi-structured and unstructured data in OLAP would be the future scope of the proposed model.³¹ In the future, more operations can be explored with the same model and existing results can be optimized. Furthermore, we can apply latest distributed platform to take advantage over map reduce framework to deal with volume of big data.³²

Footnotes

Biographies

J.A.P. is working as an Assistant Professor in the Computer Science and Engineering Department at the Institute of Technology, Nirma University. She has completed PhD in area of big data warehousing. Her field of interest and research is big data warehousing, theory of computation, big data analytics, and data mining. P.S. is currently working as Professor (IT) and Director (Research and Development) at Raksha Shakti University. She is having total of 20+ years of experience in teaching, admin, and research at PG level. Her area of interest is cyber security and machine learning.

Author Disclosure Statement

No competing financial interests exist.

Funding Information

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Abbreviations Used

References

Goss

, Veeramuthu

Heading towards big data building a better data warehouse for more data, more speed, and more users. In: Advanced Semiconductor Manufacturing Conference (ASMC), 2013 24th Annual SEMI, 2013, pp. 220–225. Available online at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6552808 (last accessed April 28, 2016).

Patel

, Sharma

. Big Data Harmonization–Challenges and Applications. Int J Recent Innov Trends Comput Commun, 2017; 5:206–208.

Garofalakis

Querying big, dynamic, distributed data. In: DOLAP '14: Proceedings of the 17th International Workshop on Data Warehousing and OLAP, November 2014, pp. 1–1.

Song

, Guo

, Wang

, et al. HaoLap: A Hadoop based OLAP system for big data. J Syst Softw, 2015; 102:167–181.

Cuzzocrea

, Bellatreche

, Song

. Data warehousing and OLAP over big data: Current challenges and future research directions. In: Proceedings of the sixteenth international workshop on Data warehousing and OLAP, 2013, pp. 67–70. Available online at http://dl.acm.org/citation.cfm?id=2517828 (last accessed April 28, 2016).

Cuzzocrea

Analytics over big data: Exploring the convergence of datawarehousing, OLAP and data-intensive cloud infrastructures. In: Computer Software and Applications Conference (COMPSAC), 2013 IEEE 37th Annual, 2013, pp. 481–483.

Cuzzocrea

OLAP Over Uncertain and Imprecise Data: Fundamental Issues and Novel Research Perspectives. In: Proceedings of 21st International Workshop on Database and Expert Systems Applications, DEXA, Bilbao, Spain, 2010, pp. 331–336. IEEE.

Lin

, Ding

, Han

, et al. Text cube: Computing IR measures for multidimensional text database analysis. In: 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, December 15–19, 2008, pp. 905–910. IEEE.

Ordonez

, Chen

. Evaluating Statistical Tests on OLAP Cubes to Compare Degree of Disease. IEEE Transactions on Information Technology in Biomedicine, 2009; 13:756–765.

10.

Ayhan

, Pesce

, Comitz

, et al. S, Predictive analytics with surveillance big data. In: Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, 2012, pp. 81–90. Available online at http://dl.acm.org/citation.cfm?id=2447491 (last accessed April 28, 2016).

11.

Rehman

, Weiler

, Scholl

. OLAPing social media: The case of Twitter. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2013, pp. 1139–1146. Available online at http://dl.acm.org/citation.cfm?id=2500273 (last accessed April 28, 2016).

12.

, Meng

, Wang

, et al. A Map-Reduce-enabled SOLAP cube for large-scale remotely sensed data aggregation. Comput Geosci, 2014; 70:110–119.

13.

Andersen

, Krogh

, Thomsen

, Torp

An Advanced Data Warehouse for Integrating Large Sets of GPS Data. In: Proceedings of the 17th International Workshop on Data Warehousing and OLAP (DOLAP'14). New York, NY: Association for Computing Machinery, 2014, pp. 13–22.

14.

Ben Kraiem

, Feki

, Khrouf

, et al. OLAP of the tweets: From modeling toward exploitation. In: Research Challenges in Information Science (RCIS), 2014 IEEE Eighth International Conference on, 2014, pp. 1–10. Available online at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6861029 (last accessed April 28, 2016).

15.

Bondarev

, Zakirov

. Data warehouse on

Hadoop platform for decision support systems in education

. In: 2015 Twelve International Conference on Electronics Computer and Computation (ICECCO), Almaty, Kazakhstan, September 27–30, 2015, pp. 1–4. IEEE.

16.

Ferreira

, de Almeida

, da Silva

. The impact of driving styles on fuel consumption: A data-warehouse-and-data-mining-based discovery process. IEEE Trans Intell Transp Syst, 2015; 16:2653–2662.

17.

Misev

, Baumann

Homogenizing Data and Metadata Retrieval in Scientific Applications. In: Proceedings of the ACM Eighteenth International Workshop on Data Warehousing and OLAP (DOLAP'15). New York, NY: Association for Computing Machinery, 2015, pp. 25–34.

18.

, Su

, Liu

, et al. A unified OLAP/OLTP big data processing framework in telecom industry. In: 2016 16th International Symposium on Communications and Information Technologies (ISCIT), 2016, pp. 290–295.

19.

Santoso

LW.

Data warehouse with big data technology for higher education. Proced Comput Sci, 2017; 124:93–99.

20.

Soewito

, Isa

, Gunawan

. OLAP analysis of water formation data. In: 2018 International Conference on Information Management and Technology (ICIMTech), 2018, pp. 125–130.

21.

Chouder

, Rizzi

, Chalal

. EXODuS: Exploratory OLAP over Document Stores. Inf Syst, 2019; 79:44–57.

22.

Han

, Lee

, Choi

, et al. Cloud-aware processing of mapreduce-based olap applications. In: Proceedings of the Eleventh Australasian Symposium on Parallel and Distributed Computing—Volume 140, 2013, pp. 31–38. Available online at http://dl.acm.org/citation.cfm?id=2525516 (last accessed April 28, 2016).

23.

Navuluri

, Mukkamala

, Ahmad

. Privacy-aware big data warehouse architecture. In: 2016 IEEE International Congress on Big Data (BigData Congress), June 27–July 2, 2016, pp. 341–344.

24.

Mansmann

, Rehman

, Weiler

, Scholl

. Discovering OLAP dimensions in semi-structured data. Inf Syst, 2014; 44:120–133.

25.

Karayannidis

, Sellis

. SISYPHUS: The implementation of a chunk-based storage manager for OLAP data cubes. Data Knowl Eng, 2003; 45:155–180.

26.

Ballard

, Herreman

, Schau

, et al. Data modeling techniques for data warehousing. IBM, 1998.

27.

Goil

, Choudhary

High performance data mining using data cubes on parallel computers. In: Parallel Processing Symposium, 1998. IPPS/SPDP 1998. Proceedings of the First Merged International… and Symposium on Parallel and Distributed Processing 1998, 1998, pp. 548–555. Available online at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=669979 (last accessed January 13, 2017).

28.

Lee

, Jo

, Kim

. MRDataCube: Data cube computation using MapReduce. In: 2015 International Conference on Big Data and Smart Computing (BigComp), 2015, pp. 95–102. Available online at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7072817 (last accessed April 28, 2016).

29.

Singh

, Reddy

. A survey on platforms for big data analytics. J Big Data, 2015; 2:1–20.

30.

Triguero

, Peralta

, Bacardit

, et al. MRPR: A MapReduce solution for prototype reduction in big data classification. Neurocomputing, 2015; 150:331–345.

31.

, Xu

, Wang

, Ni

. Hierarchical topic modeling of twitter data for online analytical processing. IEEE Access, 2019; 7:12373–12385.

32.

Yang

, Tang

, Yu

, et al. SHC: Distributed query processing for non-relational data store. In: 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris, 2018, pp. 1465–1476.