Distributed and parallel construction method for equi-width histogram in cloud database

Abstract

Data distribution summary has been commonly used in databases to support query optimization, and histograms are of particular interest. A significant issue in histogram estimation is the large amount of data transmission. This paper presents a distributed and parallel construction method for equi-width histogram in cloud database (called DPHCD). Unlike previous methods, the DPHCD does not require the transfer of any table detail during histogram construction. Only small information about buckets and a few necessary data need to be transmitted over the network. The data transmission of DPHCD is unrelated with table size. DPHCD divides the histogram task into small tasks that could be simultaneously executed in a distributed cluster. It uses an innovative tablet-level sampling method to reduce the computing overhead in each cluster node. DPHCD is implemented in the Xugu cloud database management system. Experimental results demonstrate that DPHCD can achieve small data transmission and speed up histogram construction.

Keywords

Cloud database equi-width histogram distributed and parallel computing data transmission

1. Introduction

The data acquisition method and source have become increasingly complicated with the rapid development of cloud computing, the Internet of Things, and the 4G communication technology. NoSQL (Not only SQL) databases based on key/value pairs, such as BigTable [1], HBase [2], Cassandra [3], MongoDB [4], and Redis [5], develop rapidly. However, these NoSQL databases do not support the transaction and SQL (Structured Query Language) interfaces in relational databases and are not fully compatible with existing business intelligence tools. Thus, they have many limitations [6]. Google’s Spanner [7], Oracle’s Exadata [8], and other relational cloud data management systems based on the share-nothing architecture, which have the extensibility of NoSQL and the efficiency of relational databases, still dominate their field.

In terms of the efficient summarization of data distribution and statistical information, histograms are important for improving the performance of data access in the cloud. The accuracy of data distribution assessment directly affects the execution sequence of basic algebra operations, such as join and selection [9]. Histograms can be classified into equi-width, equi-depth, V-optimal, compressed, maxdiff, and other histogram types according to their construction methods. One or many histograms are maintained in most commercial database systems. However, the popular relational database management system (RDBMS), Oracle, does not open its source code to users. Few literature exist on how to build a histogram inside a distributed RDBMS. On the contrary, many papers on building different types of histograms based on the distributed and parallel computing architecture, MapReduce, have been published in the annual top-level database conferences (i.e., Special Interest Group on Management of Data, International Conference on Very Large Data Bases, and International Conference on Data Engineering).

Histogram construction in single-node RDBMS has been studied extensively. Meanwhile, its implementation in the cloud has received limited attention [10]. Estimating histograms of data in cloud databases is challenging, and a simple extension of the traditional solution is insufficient. Cloud database is typically a distributed environment in which parallel processing, data distribution, data transmission cost, and other problems should be considered during histogram estimation. The issues studied in this work include the maximum use of computing resources in distributed working nodes to construct histograms and the direct combination of sub-histograms, which are built in distributed nodes. Network saturation should be avoided when processing large amounts of data in cloud databases.

Based on the study of the internal structure of NoSQL databases and the MapReduce framework, a distributed and parallel construction method for equi-width histogram in cloud databases (called DPHCD) is proposed in this paper. In this algorithm, histogram construction task is divided into several sub-histogram tasks in the application request node. The working nodes in the cluster are responsible for the actual estimation of each sub-histogram. The DPHCD obtains the global maximum and minimum values of the entire distributed cluster. Then, all the working nodes in the cluster estimate sub-histograms according to the global maximum and minimum values that could be directly accumulated to get the global histogram. Only local histogram information, which is very small compared to the table size, is transmitted across the network. To fully utilize the generated data in the sampling phase, tablet-level sampling is adopted for estimating sub-histograms in the working nodes. The main contributions of our work include the followings:

We present a novel algorithm for exploring data transmission over the network in a distributed cluster during histogram estimation to reduce network congestion in the cloud database.

We adopt tablet-level sampling, which is significantly faster than the tuple-level random sampling method, to fully utilize the generated data in the sampling phase.

We implement the DPHCD in the Xugu cloud database management system and conduct comprehensive experiments. The results suggest that DPHCD is efficient in histogram estimation and scalable.

The rest of the paper is organized as follows. We briefly introduce the storage architecture of the cloud database and the overall framework of the proposed algorithm in Section 2. The design and implementation details of the DPHCD approach are discussed in Section 3. Using this method, we explain how the reduction of data transmissions is achieved in Section 4. Section 5 shows representative experimental results, Section 6 discusses related works, and Section 7 concludes the paper.

2. Cloud database

With the recent development of cloud computing, the importance of cloud databases has been widely acknowledged. A cloud database is a collection of structured or unstructured content that resides on a private, public, or hybrid cloud computing infrastructure platform. Many cloud databases for managing data in the cloud exist. Each cloud database product is implemented differently and often attempts to address different kinds of data management requirements and priorities. To provide details of cloud databases, we briefly introduce one of the most useful cloud databases, HBase.

2.1 Storage architecture of HBase

HBase is a key-value store that supports a single data abstraction known as the table-structure (popularly referred to as column family), which is based on the Google Big Table design. HBase is designed to work on top of the Hadoop Distributed File System (HDFS). It accesses HDFS storage blocks directly and stores a natively managed file type.

HBase uses partitioned/shared data and master-slave distributed architecture, where data is hashed and sent to a set of external master processes known as “region servers”, which are responsible for managing subsets of the key space. Region servers write data (thru several layers of indirection) to HDFS, which ensures data availability thru file system replication. The architecture of HBase is shown in Fig. 1.

The HBase architecture has two main services: HMaster, which is responsible for coordinating Regions in the cluster and executing administrative operations, and the HRegionServer, which is responsible for handling a subset of the table’s data. Each HRegionServer serves a set of HRegions, and one HRegion can be served by only one HRegionServer. HRegion is the basic element of availability and distribution for tables and is composed of a Store per column family. A Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family in a table for a given HRegion. The MemStore holds in-memory modifications to the Store. Modifications are KeyValues. When asked to flush, a snapshot of the current MemStore is taken before it is cleared. HBase continues to serve edits out of the new MemStore and create snapshots until the flusher reports in succeed. At this point, the snapshot is released. HFiles are where data reside and are composed of blocks [11].

Figure 1.

Storage architecture of HBase.

2.2 Storage architecture of Xugu cloud database

The file of the Xugu cloud database is organized based on the relational model of the data. It stores a table in a sequence of rows (row-based). The database is stored as a collection of files. Each file is a sequence of records, and one record is a sequence of fields. A file can contain fixed- and variable-length records. Fixed-length records are easy to implement. Thus, we briefly introduce variable-length records. The organization of variable-length records using a slotted page structure is shown in Fig. 2.

Figure 2 shows that a certain number of bytes are allocated to the file header. The page header contains the number of record entries, the end of free space in the block (EFS), size of each entry (ES), and the pointer of each entry (EP). Intuitively, EFS points to the end of the free space, while EP points to the starting address of a variable-length record.

Figure 2.

Slotted page structure for variable-length records.

Figure 3.

General storage architecture of Xugu cloud database.

The file system of the Xugu cloud database is similar to the HDFS. The table files in the Xugu cloud database are broken into 64 MB tablets and distributed among storage nodes. Each tablet is replicated on a different distributed storage node to ensure fault tolerance. The general storage architecture of the Xugu cloud database is shown in Fig. 3.

Each tablet has two replicas that are randomly stored in the storage cluster. In Fig. 3, the table has 6 tablets labeled A, B, C, D, E, and F. These tablets are assigned to storage nodes through polling. Tablets A–F are stored in storage nodes 1–6, respectively. Each tablet has two replicas, and the replica randomly selects a storage node. For example, the first replica of tablet A is assigned to storage node 2, while the second replica selects storage node 4. Tablets transfer data information over an interconnected network. Figure 3 shows that different tablets with the same version of the table $T$ have the property:

$\textit{Tablet}_{i}\cap\textit{Tablet}_{j}=\emptyset;\textit{Tablet}_{1}\cup% \textit{Tablet}_{2}\cup\ldots\cup\textit{Tablet}_{n}=T;0\leqslant\forall i,j% \leqslant N,i\neq j.$ (1)

2.3 Overview of the distributed and parallel construction method for equi-width histogram in cloud database (DPHCD)

According to the storage architecture of the Xugu cloud database, a file is split into one or more tablets, which are stored in a set of distributed storage nodes. These storage nodes serve read and write requests from the file system’s clients.

Figure 4.

Overall framework of DPHCD.

To construct an equi-width histogram in the cloud database, the origin method used in the Xugu database requires for all the tablets data in the distributed nodes to be obtained. Then, these data are scanned in the histogram request node to obtain the equi-width histogram. The origin method requires a full scan of the entire table, and all the tablets data should be transmitted to one node, which is expensive and may result in network saturation in the cloud.

From Fig. 3, we can see that a table is split into several tablets and are distributed into different storage nodes in the cluster. The number of tablets and data ranges stored on different nodes may be different. Therefore, sub-histograms which are built simultaneously in distributed nodes cannot be merged directly to obtain the global histogram. The overall framework of the DPHCD for constructing an equi-width histogram in the Xugu cloud database is shown in Fig. 4. In DPHCD, the histogram task is broken into small subtasks that can be built in distributed nodes to utilize the advantages of parallel computing to reduce execution time. Before estimating sub-histograms in distributed nodes, every storage node scans and sorts the local data of tablets to get the local maximum and minimum values. Then, the maximum and minimum values of the distributed nodes are transferred to the histogram request node to obtain the global maximum and minimum values. After all the sub-histograms are estimated according to the global maximum and minimum values in parallel. Finally, the histogram request node combines each bucket of these sub-histograms directly to obtain the global histogram. The detailed steps of the DPHCD algorithm is described in Section 3.3.

Generating the exact histogram requires every node in the distributed cluster to scan all the tablets, which involves a significant amount of time to complete. DPHCD constructs the approximate histogram with desired accuracy to reduce the construction time. A tablet-level parallel sampling algorithm is designed for approximate histogram. In Fig. 4, a certain percentage of tablets is sampled in parallel before estimating approximate sub-histograms in each distributed node. Tablet-level parallel sampling is discussed in Section 3.2.

As can be seen from the procedure above, the details of the tablets need not be transferred from the distributed storage nodes to the application node. The amount of data transmission over the network is significantly reduced. Data transmission over the network is discussed in Section 4.

3. Equi-width histogram building in the cloud database

3.1 Definitions

Without loss of generality, a table $T$ with $t$ tuples is considered in organizing the data into $N$ storage nodes in the cloud database. The attribute of interest $A$ is distributed over domain $D$ . Then, the problem about histogram is defined as follows:

Definition 1. Set $V$ to represent the value set of $T$ on attribute $A(V\subseteq D)$ . The frequency $f(x)$ of $x\in V$ is the number of tuples $t$ with $t.A=x$ . Then, the bucket in the histogram is denoted by $B=\{<$ $B_{L},B_{R},f(B)$ $>,B_{L},B_{R}\in V\}$ . $B_{L}$ is the left boundary value of bucket $B$ , $B_{R}$ is the right boundary value of bucket $B$ , and $f(B)$ represents the bucket size $|B|$ , where $f(B)=\sum{f(x)},B_{L}<x\leqslant B_{R}$ .

Definition 2. The histogram on attribute $A$ in table $T$ contains a series of buckets denoted by $H=\{<$ $B_{\textit{iL}},B_{\textit{iR}},f(B_{i})$ $>,i=1,\ldots,m\}$ , where $B_{i}\cap B_{j}=\emptyset,\forall i\neq j$ , and $f(B_{1})+f(B_{2})+\ldots+f(B_{m})=t$ .

Definition 3. Table $T$ is organized into $N$ storage nodes. The maximum and minimum values of attribute $A$ in each distributed storage node are denoted by $\textit{Max}_{L}$ and $\textit{Min}_{L}$ , respectively. The maximum and minimum values of attribute $A$ in table $T$ across the entire cluster are denoted by $\textit{Max}_{G}$ and $\textit{Min}_{G}$ , respectively.

3.2 Tablet-level parallel sampling

The amount of data in a cloud database is very large. Thus, generating the exact histogram requires full scan of the whole table, which is expensive and involves a significant amount of time to complete. Sampling is the process of selecting a representative sample from a target population and collecting data from that sample to recognize the statistical properties of the underlying data. The sample usually represents a subset of manageable size data. Sampling has been established as an effective tool for reducing the size of data and avoiding huge costs in subsequent processing. Constructing approximate histograms based on sample data to reflect the data distribution and summarize the contents of large tables is an efficient approach. Histogram construction has been extensively studied in single-node RDBMS, but has received limited attention in the cloud. Tablet-level sampling is adopted for estimating histograms in a distributed cloud database. The data in the cloud database are organized into tablets, and tablet is the unit of data stored in the cluster. To reduce the computational complexity of building a sub-histogram in each node, tablet is used as the unit of random sampling.

The tablet is selected uniformly and randomly with probability $p=1/{\varepsilon^{2}N},\varepsilon\in({0,1})$ , where $\varepsilon$ controls the sample size. The bucket size of the sub-histogram based on tablet-level sampling is the unbiased estimate of the original file in the cloud database. The frequency value of the $i$ th bucket is estimated by Eq. (2).

$f({B_{i}})^{\prime}=\frac{({\sum\nolimits_{k=1}^{n}{H_{\textit{Lk}}.B_{i}}.f({% B_{i}})})}{p},1\leqslant k\leqslant n.$ (2)

Theorem 1. $f({B_{i}})^{\prime}$ is an unbiased estimator of $\widetilde{f({B_{i}})}$ , where $\widetilde{f({B_{i}})}$ is the $i$ th bucket size of the histogram constructed on the entire data.

Proof $H_{\textit{Lk}}$ indicates the approximation sub-histogram estimated on the sample tablets of the $k$ th node in the distributed cluster. $H_{\textit{Lk}}.B_{i}.f({B_{i}})$ represents the $i$ th bucket size of $H_{\textit{Lk}}$ . Set $P_{i}$ to represent the proportion of tuples belonging to $B_{i}$ . The mean of $f({B_{i}})^{\prime}$ is

$\displaystyle E[{f({B_{i}})^{\prime}}]=E\left({\left({\sum\nolimits_{k=1}^{n}{% H_{\textit{Lk}}.B_{i}}.f({B_{i}})}\right)\Big{/}p}\right)$ $\displaystyle=E\left(\left(\left(\sum\nolimits_{k=1}^{n}H_{\textit{Lk}}\right)% \ast p_{i}\right)\Big{/}p\right)$ $\displaystyle=E\left(\left(\sum\nolimits_{k=1}^{n}{({H_{\textit{Lk}}/p})}% \right)\ast p_{i}\right).$ (3)

$H_{\textit{Lk}}$ is the sub-histogram constructed on sample data. The sampling probability of different nodes is the same ( $P$ ) and is transferred from the application request node. The total number of the tuples in the $k$ th storage node is $H_{\textit{Lk}}/P$ . Thus, we can obtain

$E\left(\left(\sum\nolimits_{k=1}^{n}({H_{\textit{Lk}}/p})\right)\ast p_{i}% \right)=E({T\ast p_{i}})=T\ast p_{i}=\widetilde{f({B_{i}})}.$ (4)

Notice that there are some deviations from the frequency value of the $i$ th bucket of the global histogram, the large sample size makes significant variations extremely unlikely. Tablet-level random sampling algorithm can be simultaneously executed in distributed nodes. Shi et al. [21] proved that the variance of $\textit{VAR}(H_{ib})$ based on block sampling is equal to or less than that of $\textit{VAR}(H_{it})$ based on tuples sampling using MapReduce. A tablet in the cloud database is similar to a block in HDFS. Thus, tablet-level sampling is adopted to estimate the approximate histogram in this method.

3.3 Distributed and parallel construction method for equi-width histogram in Xugu cloud database (DPHCD)

To enable a histogram construction task for large-scale data in a cloud database could be performed simultaneously in a distributed cluster. The histogram task should be divided into small sub-histogram tasks, which can be estimated in parallel. As shown in Fig. 3, tablets with the same version are stored on different storage nodes in the cluster. The size of data files, range of data, and other attributes in different nodes are irrelevant to each other. Thus, the sub-histograms constructed in distributed nodes cannot be merged directly.

Equi-width histogram is easy to construct. It can be estimated according to the maximum and minimum values and the number of buckets contained in the histogram.

The cloud database storage architecture in Fig. 3 and the definition in Section 3.1 show that $\textit{Max}_{\textit{Li}}\leqslant$ $\textit{Max}_{G},\textit{Min}_{\textit{Li}}\geqslant\textit{Min}_{G},0% \leqslant i\leqslant N$ . If sub-histograms constructed in the distributed cluster have the same $\textit{Max}_{L}$ and $\textit{Min}_{L}$ values and bucket number, the frequency values of all sub-histograms can be directly accumulated at the application request node to obtain the global histogram of table $T$ . The detailed data of the table need not be transmitted from the distributed node to the histogram application request node. Considering that interconnected network bandwidth is the main factor that affects the performance of data processing in a big data environment, the above approach only need to transmit buckets’ information about sub-histograms across the network.

The proposed DPHCD algorithm constructs equi-width histograms in the Xugu cloud database using the following steps:

Step 1.
When a database client receives a histogram construction task through a SQL statement, the application request is sent to the database engine for execution. In the Xugu cloud database, any node in the cluster can be used as an application request node. Histogram construction task in the Xugu RDBMS is implemented by a stored procedure in the DBMS_STAT package, and the name of the stored procedure is ANALYZE_TABLE.
Step 2.
The application request node with parameters, such as table name $T$ , attribute column $A$ , bucket number $B$ , and sampling probability $P$ , for estimating a histogram needs to transfer these related parameters to the other storage nodes of the cluster over the network using the remote procedure call protocol.
Step 3.
Each storage node, which stores tablets of table $T$ samples, scans and sorts the data to obtain the local maximum and minimum values of the node, and transmits the values to the histogram application request node. Assuming $N$ nodes send their $\textit{Max}_{L}$ and $\textit{Min}_{L}$ data to one node, then the application request node has a dataset $S=\{\textit{Max}_{L1},\textit{Min}_{L1},\textit{Max}_{L2},\textit{Min}_{L2},% \ldots,\textit{Max}_{\textit{LN}},$ $\textit{Min}_{\textit{LN}}\}$ .
Step 4.
These $N$ maximum and minimum data are compared to obtain the global maximum and minimum values ( $\textit{Max}_{G}$ and $\textit{Min}_{G}$ ) of the entire distributed cluster. Then, $\textit{Max}_{G}$ and $\textit{Min}_{G}$ are sent to $N$ storage nodes for building sub-histograms.
Step 5.
Every node in the cluster has the same $\textit{Max}_{G}$ and $\textit{Min}_{G}$ values and bucket number $B$ . Thus, the equi-width histograms constructed in different nodes have the same bucket boundary. Finally, the information of sub-histograms are transmitted to the application request node. The global histogram on attribute $A$ from table $T$ can be accumulated directly by the frequency value of different sub-histograms. The global histogram can be obtained as follows:

$\displaystyle H_{G}.B_{i}.B_{\textit{iL}}=H_{L}.B_{i}.B_{\textit{iL}}$ $\displaystyle H_{G}.B_{i}.B_{\textit{iR}}=H_{L}.B_{i}.B_{\textit{iR}}$ $\displaystyle H_{G}.B_{i}.f({B_{i}})=H_{L1}.B_{i}.f({B_{i}})+\ldots+H_{\textit% {Lk}}.B_{i}.f({B_{i}})+\ldots+H_{\textit{LN}}.B_{i}.f({B_{i}});$ (5) $\displaystyle i=1,\ldots,n.$

In Eq. (5), $H_{G}$ represents the global histogram and $H_{L}$ represents the estimated sub-histograms in distributed storage nodes. $B_{i}$ is the $i$ th bucket of the histogram, $B_{\textit{iL}}$ represents the left boundary of $B_{i}$ , $B_{\textit{iR}}$ corresponds to the right boundary of $B_{i}$ , and $f(B_{i})$ is the frequency of $B_{i}$ .
Step 6.
Finally, the global histogram is shown to users.

The detailed workflow of the DPHCD algorithm in the Xugu cloud database which implements share-nothing architecture is shown in Fig. 4. The algorithm utilizes the advantages of parallel computing and does not need to transmit the detailed data of the table over network. DPHCD is suitable for constructing histograms for large-scale data. The pseudo-code of the DPHCD is shown in Algorithm 1.

Algorithm 1: DPHCD(T, A, B, P);

Input: Table T, Attribute A, Bucket B, Probability P

Output: Equi-width histogram

begin

end ANALYZE_TABLE

1 Exec SYSDBA.DBMS_STAT.ANALYZE_TABLE

2 ANALYZE_RPC_PIPE(T, A, B, P);

3 Sample(P);

4 P_SORT(T, A);

5 RPCToMainNode( $\textit{Max}_{L}$ , $\textit{Min}_{L}$ );

6 for $i=$ 1 to $N$

$\textit{Max}_{G}\geqslant\textit{Max}_{\textit{Li}},\textit{Min}_{G}\leqslant% \textit{Min}_{Lj}(i\leqslant N,j\leqslant N)$

7 RPCToSubNode( $\textit{Max}_{G}$ , $\textit{Min}_{G}$ )

endfor

8 for $i=$ 0 to $B$ do

$B_{\textit{iL}}=\textit{Min}_{G}+i(\textit{Max}_{G}-\textit{Min}_{G})/B$

$B_{\textit{iR}}=\textit{Min}_{G}+(i+1)(\textit{Max}_{G}-\textit{Min}_{G})/B$

endfor

for $j=$ 0 to sample.size do

for $i=$ 0 to $B$ do

If $B_{\textit{iL}}<V(x)\leqslant B_{\textit{iR}}$

$B_{i}.\textit{freq}$ $++$

$i$ $++$ ;

endfor

endfor

9 RPCToMainNode(Histogram ${}_{L}$ );

10 for $i=$ 0 to $B$ do

for $j=$ 0 to $N$ do

$B_{i}.\textit{freq }+=H_{j}.B_{i}.\textit{freq}$

$j$ $++$

endfor

$i$ $++$

endfor

4. Data transmissions analysis

Algorithm 1: DPHCD(T, A, B, P);
end ANALYZE_TABLE
1	Exec SYSDBA.DBMS_STAT.ANALYZE_TABLE
2	ANALYZE_RPC_PIPE(T, A, B, P);
3	Sample(P);
4	P_SORT(T, A);
5	RPCToMainNode( $\textit{Max}_{L}$ , $\textit{Min}_{L}$ );
6	for $i=$ 1 to $N$
	$\textit{Max}_{G}\geqslant\textit{Max}_{\textit{Li}},\textit{Min}_{G}\leqslant% \textit{Min}_{Lj}(i\leqslant N,j\leqslant N)$
7	RPCToSubNode( $\textit{Max}_{G}$ , $\textit{Min}_{G}$ )
	endfor
8	for $i=$ 0 to $B$ do
	$B_{\textit{iL}}=\textit{Min}_{G}+i*(\textit{Max}_{G}-\textit{Min}_{G})/B$
	$B_{\textit{iR}}=\textit{Min}_{G}+(i+1)*(\textit{Max}_{G}-\textit{Min}_{G})/B$
	endfor
	for $j=$ 0 to sample.size do
	for $i=$ 0 to $B$ do
	If $B_{\textit{iL}}<V(x)\leqslant B_{\textit{iR}}$
	$B_{i}.\textit{freq}$ $++$
	$i$ $++$ ;
	endfor
	endfor
9	RPCToMainNode(Histogram ${}_{L}$ );
10	for $i=$ 0 to $B$ do
	for $j=$ 0 to $N$ do
	$B_{i}.\textit{freq }+=H_{j}.B_{i}.\textit{freq}$
	$j$ $++$
	endfor
	$i$ $++$
	endfor

4.1 Data transmissions reduction

In this section, we elaborate how data transmissions are reduced using an intuitive example. We assume a table $T$ , which contains 1,000,000 tuples, is split into 10 tablets. These 10 tablets are stored in a 4 node cluster, and each tablet contains 100,000 records. All machines are directly connected to the interconnected network switch. The number of tablets and data ranges of different nodes may be different. To construct the equi-width histogram, these 10 tablets, which are stored in distributed nodes, should be transmitted to the histogram request node. In this node, each of the 1,000,000 records is scanned to determine to which bucket the record belongs to. The data flow of the origin histogram construction method is shown in Fig. 5. All (1,000,000) records are transmitted over the network because the 10 tablets should be transferred to the histogram request node.

Figure 5.

Data transmission of the origin histogram construction method.

Figure 6.

Data transmission of DPHCD.

In DPHCD, when the node receives a histogram construction task, table name $T$ , attribute $A$ , bucket number $B$ , and sample probability $P$ are first sent to four storage nodes. Then, each storage node samples, scans, and sorts the data to get the local maximum and minimum values of the node ( $\textit{Max}_{\textit{Li}}$ and $\textit{Min}_{\textit{Li}}$ ) and transmits the two values to the histogram application request node. In the histogram request node, the four maximum and minimum data are compared to obtain the global maximum and minimum values ( $\textit{Max}_{G}$ and $\textit{Min}_{G}$ ) of the entire distributed cluster. After transmitting $\textit{Max}_{G}$ and $\textit{Min}_{G}$ to four storage nodes, each node could build the sub-histogram according to the same global maximum and minimum values and bucket number. Then, the four sub-histograms built in the distributed nodes could be directly combined to obtain the global histogram. Figure 6 illustrates the data transmission during histogram construction based on the DPHCD. Only the necessary information ( $T$ , $A$ , $B$ , and $P$ ) and ( $\textit{Max}_{G}$ , $\textit{Min}_{G}$ ) should be sent from the histogram request node to four storage nodes. Four pairs of ( $\textit{Max}_{\textit{Li}}$ , $\textit{Min}_{\textit{Li}}$ ) and Histogram ${}_{\textit{Li}}$ should be transmitted to the histogram request node. Compared with the original histogram construction method in the Xugu database, the amount of data transmission is significantly reduced.

4.2 Comparison of data transmissions

The maximum and minimum values, bucket information, and sampling probability should be transmitted over the network during the estimation process. The detail data transmission over the network of the DPHCD algorithm is $Q=8N+\,3BN$ . The data transferred from the histogram application request node to $N$ working nodes includes table name $T$ , attribute column $A$ , bucket number $B$ , sampling probability $P$ , which is 4 $N$ , and $N$ working nodes, which send 2 $N$ $\textit{Max}_{L}$ and $\textit{Min}_{L}$ to the application request node. After obtaining the $\textit{Max}_{G}$ and $\textit{Min}_{G}$ of the entire file, the two values should be transferred to $N$ working nodes. A sub-histogram contains $B$ buckets. Bucket is a compound value that contains the left boundary, the right boundary, and the frequency value. Then, the data transmission of $N$ sub-histograms is 3BN. Therefore, the total amount of data transmission of DPHCD is $Q=4N+2N+2N+3BN=8N+3BN$ .

The HEDC $++$ algorithm proposed in [21] supports the equi-width histogram for data in the cloud through an extended MapReduce framework. All input data or sampled data should be converted into $<$ key-value $>$ pairs, and these intermediate $<$ key-value $>$ pairs should be transferred from the map phase to the reduce phase. This requirement may result in inefficient performance due to the large amount of data shuffled across clustered machines. In the worst case, all records from the local node should be sent to the other nodes of the cluster by applying the hash partitioning function, which may lead to high network congestion, especially when processing large amounts of data. Estimating the approximate histogram based on sampled data can reduce data transmission over the network. However, the sampled data still consume a proportion of network bandwidth. The data transmission of the DPHCD is unrelated with the table size. The comparison of HEDC $++$ and DPHCD in terms of data transmission is shown in Table 1.

Table 1
Comparison of data transmission

Algorithm	Data transmission
	Optimal value	Worst value	Average value
Exact histogram on HEDC $++$	$3BN$	$t+3BN$	$t/2+3BN$
Approximate histogram on HEDC $++$	$3BN$	$pt+3BN$	$pt/2+3BN$
Exact histogram on DPHCD	$8N+3BN$	$8N+3BN$	$8N+3BN$
Approximate histogram on DPHCD	$8N+3BN$	$8N+3BN$	$8N+3BN$

Table 1 shows that during histogram estimation, the data transmission of HEDC $++$ is related to table size $t$ . By contrast, the optimal, worst, and average values of constructing exact and approximate histograms using the DPHCD algorithm are the same and unrelated to table size $t$ . The DPHCD algorithm is related only to the number of nodes $N$ in the cluster and the number of buckets $B$ in the histogram. Compared with $t$ , $N$ and $B$ are very small and negligible. The algorithm achieves a significant improvement in data transmission over the network. It utilizes the advantages of distributed computing and avoids network saturation when processing large amounts of data in the cloud database.

Figure 7.

Line chart of Gaussian dataset.

5. Experimental evaluation

5.1 Experimental design

To verify the effectiveness of the DPHCD algorithm in constructing equi-width histograms, three different experiments are designed as follow:

Effect of the number of buckets. Exact histograms that contain different numbers of buckets are constructed for synthetic data sets to demonstrate how the number of buckets can affect the data distribution with desired accuracy. The numbers of buckets in the histograms are 10, 20, and 100. For the real dataset, the exact and approximate histograms are estimated to verify the effect of the tablet-level sampling mechanisms. The histogram built on real dataset is set to contain 5 buckets, and the sampling probability is set to 0.4.

Scalability evaluation. Experiments are conducted to evaluate the scalability by varying the node number of the testbed cluster for equi-width histogram estimation. The running time of the algorithm is observed by varying the number of machines in the cluster from 1 to 4.

Comparison of running time. Experiments are designed to compare the DPHCD algorithm against the HEDC $++$ algorithm based on MapReduce by evaluating the running time and data transmission during histogram construction.

Figure 8.

Effect of histograms including different buckets. (a) 10 buckets histogram. (b) 20 buckets histogram. (c) 100 buckets histogram

Figure 9.

Comparison between exact and approximate histograms.

5.2 Experimental environment and datasets

All experiments are performed on a cluster running the Xugu cloud database, which is independently developed by a Chinese company in Chengdu, Sichuan. The cluster consists of 3 DELL PowerEdge R730 rack servers. Each machine is equipped with one Xeon E5-2603 v3 processor, 8 GB of memory, and 1.2 TB of disk, and is connected with 1 GB Ethernet. The Xugu cloud database can run on Linux, Windows, and various Unix platforms. In this experiment, it runs on Window Server 2008. The DPHCD algorithm is implemented using the C $++$ programming language in Microsoft Visual Studio 2012.

We conducted experiments on synthetic datasets and real datasets to evaluate the performance gains achieved by the DPHCD algorithm. The synthetic datasets containing 100,000 records are generated with Gaussian distribution. Every tuple in the dataset includes three columns: primary key, data value, and data description. The maximum and minimum values are 4.2891 and $-$ 4.2486, respectively. The line chart of the Gaussian dataset is shown in Fig. 7. The real dataset is collected by the social computing research at the University of Minnesota for recommendation systems. It contains 24,000,000 ratings and 670,000 tag applications on 40,000 movies by 260,000 users.

5.3 Experimental results and analysis

5.3.1 Effect of the number of buckets

In this experiment, each record in the synthetic dataset is inserted into a table in the cloud database. Three histograms, which contain 10, 20, and 100 buckets, are estimated using the DPHCD algorithm, and the results are shown in Fig. 8.

As shown in Fig. 8, the equi-width histogram with 100 buckets has higher accuracy in terms of data distribution than the histograms containing 10 and 20 buckets. The more buckets constructed in a histogram, the more detailed the description presented for data distribution. However, increasing the number of buckets in the histogram inevitably leads to high storage and computing overhead. When the size of the dataset is fixed, the accuracy of data distribution cannot be improved by increasing the number of buckets excessively. The bucket number of a histogram should be determined by data size, data characteristics, and desired accuracy with specific conditions.

Approximate and exact histograms are estimated on the real dataset, as shown in Fig. 9. The real dataset contains 24,000,000 ratings applied to 40,000 movies by 260,000 users. Figure 9 illustrates that the frequency of ratings between (3, 4] is the highest and nearly reaches 40% of the whole data. The comparison between the exact histogram constructed on the entire data and the approximate histogram constructed on the samples verify that tablet-level sampling also provides accurate estimated results for equi-depth histograms. Histograms provide data distribution information that are valuable for rating prediction, user analysis, and so on.

Figure 10.

Scalability experiments. (a) Scale-up experiment on synthetic dataset. (b) Scale-up experiment on real dataset.

5.3.2 Scalability evaluation

Scale-up experiments are conducted to evaluate the scalability of the DPHCD algorithm. As we add and reduce machines in the cluster, the running time of the histogram task that can be executed within a given time should be increased and decreased by the same factor. Two scale-up tests are conducted for the synthetic and real datasets. For the synthetic data set, the runtimes of estimating an exact histogram are compared by varying the number of machines (From 1 to 4). Meanwhile, the approximate histogram is constructed 10 times on the real dataset under the same condition as that for the synthetic dataset. The results are shown in Fig. 10. The results of equi-width histogram construction on synthetic and real datasets in Fig. 10 illustrated that the running time of DPHCD decreased with the increasing number of nodes in the cluster. This is because when machines are added in the cluster, more nodes can process the histogram task after balancing the workload of each node. The results demonstrate that the DPHCD algorithm can reduce histogram construction time using parallel computing in a distributed cluster and has scalability for cluster scale.

5.3.3 Comparisons of running time and data transmission

5.3.3.1 Execution time

In [21], the HEDC $++$ based on MapReduce provides efficient approaches for equi-width and equi-depth histograms. The equi-width histogram construction method is similar to the DPHCD algorithm. In this section, experiments are conducted to compare the execution times of DPHCD with HEDC $++$ . Exact equi-width histograms are generated on synthetic and real datasets, respectively. The DPHCD algorithm is implemented in the cloud database, while HEDC $++$ is implemented over MapReduce. The construction times of histograms are shown in Fig. 11. The construction time of DPHCD is more than 20 times faster than that of HEDC $++$ . This result does not indicate that the performance of the proposed algorithm is 20 times better than that of HEDC $++$ . This is partly because the performance of MapReduce is lower than the relational parallel database under the same environmental condition. Structured data are stored in the relational database, whereas the data stored in HDFS do not have a pre-defined data model or are not organized in a pre-defined manner. However, the DPHCD algorithm still utilizes the advantages of parallel computing to divide the histogram task into small sub-histogram tasks that can be executed simultaneously. The algorithm can achieve a higher performance than the single-node RDBMS.

5.3.3.2 Data transmission over the network

In Section 4, we obtain the detail data transmission over the network is $Q=8N+3BN$ using the DPHCD algorithm. Data transmission is unrelated to table size $t$ . Estimating the histogram using HEDC $++$ based on MapReduce requires the transfer of sample data from the map phase to the reduce phase. The data transmission of the proposed algorithm is compared with that in the HEDC $++$ when constructing histograms on synthetic and real datasets. The results are shown in Table 2. Each histogram contains 10 buckets, the experiment is conducted on a three-node cluster, and the sampling probability is $p$ . From Table 2, we can see that HEDC $++$ requires the transfer of the file’s detail data. Although constructing an approximate histogram based on sampled data can efficiently reduce the data transmitted over the network, the sampled data still need to be translated to $<$ key, value $>$ pairs for transferring from the map phase to the reduce phase. Table 2 shows that the optimal data transmission of HEDC $++$ is better than that of DPHCD. However, all data of each node are distributed to the local reducer by hash partitioning, which is impossible to achieve. The average and worst values of the DPHCD is much lower than those of the HEDC $++$ and are unrelated to table size $t$ . This is because the proposed algorithm only transmits small histogram information. Table 2 shows that the data transmissions on synthetic and real datasets are identical when the number of nodes in the cluster and the histogram bucket number are fixed. The data transmission of the DPHCD is unrelated with file size. If a histogram containing 20 buckets is constructed using the DPHCD algorithm in a three-node cluster, the data transmission of the cluster is $Q=8\times 3+3\times 20\times 3=204$ .

Table 2
Comparison with HEDC $++$ on data transmission

Algorithm	Data transmission
	Synthetic dataset			Real dataset
	Optimal	Worst	Average	Optimal	Worst	Average
Exact hist on HEDC $++$	90	100090	50090	90	$2.4*10^{7}+90$	$1.2*10^{7}+90$
Approximate hist on HEDC $++$	90	$p*10^{5}+90$	$p510^{4}+90$	90	$p2.410^{7}+90$	$p1.210^{7}+90$
Exact hist on DPHCD	114	114	114	114	114	114
Approximate hist on DPHCD	114	114	114	114	114	114

Figure 11.

Comparison with HEDC $++$ on execution time.

6. Related work

Histograms play an important role in cost-based query optimization, approximate query, and load balancing in the database. In Oracle database, the histogram is used to evaluate data distribution to optimize query plans. Estimating histograms is extensively studied in the field of single-node RDBMS. However, many limitations exist in the big data environment.

Ioannidis [12] surveyed the history of histogram and its comprehensive applications in data management systems. [13] analyzed different types of histograms and their properties. Chaudhuri et al. [14] proposed an approximate histogram construction method based on sampled data, which provided the exact relation between the size of the sampled data with the histogram. Luo et al. [15] developed an adaptive histogram construction method in compressed database. The method tracks hot data in compressed databases by scheduling batched queries and using the feedback in query results to accelerate the convergence speed of the constructed adaptive histogram that can be maintained incrementally. Bruno et al. [16] introduced a “workload-aware” histogram called STHoles that allows bucket nesting to capture data regions with reasonable uniform tuple density. Kanne and Moerkotte [17] designed new bucket types that do not store the number of distinct value and average frequency in the buckets of histogram. All the above methods focus on histogram estimation in the single-node database management system. Adapting them to the cloud environment requires sophisticated considerations. The proposed method is concerned with building histograms in parallel in the cloud environment.

With the arrival of big data era, several scholars have begun to study the parallel histogram construction method based on the MapReduce framework. Jestes et al. [18] proposed a wavelet histogram construction algorithm based on MapReduce using tuple-level sampling method. [19] presented a comprehensive study on the scalable histogram for large probabilistic data sets in MapReduce. They focused on V-optimal histogram based on the expectation-based semantic for the value and tuple models. The MaxDiff histogram construction method based on MapReduce is similar to V-optimal histogram estimation method, with the histogram type as the only difference [20]. Shi et al. [21] extended the original MapReduce framework by adding a sampling phase and a statistical computing phase, which focuses on estimating the equi-width and equi-depth histograms for data in the cloud. To fully utilize the data generated in the sampling phase, they adopted the block of the HDFS as the sampling level. In the MapReduce programming model, each job is divided into two phases: a map phase and a reduce phase. The above techniques based on MapReduce require the transfer of data over the network from the map phase to reduce phase. They retrieve tuples from the block randomly and send the outputs of mappers at certain probability, which aims to provide unbiased estimated histograms and reduce communication. However, the sampled method has a limitation. The sampled data should still be transferred over the network, which entails the use of a certain percentage of network bandwidth. Yıldız et al. [10] proposed a merge-based histogram construction method with a histogram processing framework that constructs an equi-depth histogram for a given time interval. This method is similar to our proposed algorithm, which requires transferring precomputed equi-depth histograms of data partitions. They implemented the method on Hadoop, whereas our method is applied in a relational cloud database.

The histogram computational task can be broken into several subtasks that can be processed independently in a distributed cluster based on the MapReduce architecture. The performance is improved significantly compared to the traditional database. However, the above approaches require the transfer of the converted intermediate key-value pairs from the map phase to reduce phase, which results in high data transmission over the network. The proposed algorithm utilizes the advantages of parallel computing and reduces the amount of data transmission over the network for large-scale data in the cloud database. Few information and sub-histograms, which are built in distributed nodes, are required to be transferred over the network. Data transmission is significantly reduced using the proposed algorithm. The distribution of data may be skewed and lead to highly varying execution times in distributed nodes. The node with low load has to wait for the node with high load. This reduces the overall performance of the algorithm. During sub-histogram estimation, the load balance between different nodes is not considered in this paper.

7. Conclusions

In this study, we address the problem of constructing equi-width histograms in the cloud database. A novel solution, which can build several sub-histograms in a distributed cluster and only requires the transmission of few information to obtain the summary of data distribution, is proposed. Each node in the cluster estimates a sub-histogram according to the global maximum and minimum values of the entire cluster. Then, all the sub-histograms are directly accumulated to obtain the global histogram. During the histogram estimation, few histogram information are required to be transferred over the network. The approximate histogram construction using tablet-level sampling achieves the desired accuracy. The algorithm is applied to Xugu cloud database management. Experimental results show that the performance of the DPHCD is better in terms of time and data transmission.

Footnotes

Acknowledgments

This work is partly supported by the National Science Foundation of China under Grant Nos. 61640209, the Science & Technology Project of Sichuan under Grant No. SCMZ2006012, and the Science & Technology Project of Guizhou under Grant No. [2014] 2004, [2014] 2001, [2016] 7433 and [2015] 13.

Authors’ Bios

Yang Wang received his bachelor degree in computer science from Xuchang University in 2011, and the MSc degree in 2014 in Guizhou University. He joined Chengdu Institute of Computer Application in Chinese Academy of Sciences as a PhD student in 2014. His research interests include big data analytics, distributed, and parallel computing.

Yong Zhong received the MSc degree in 1994 in Chengdu University of Technology, and the PhD degree in 2002 in University of Chinese Academy of Sciences. He is a research fellow at Chengdu Institute of Computer Application in Chinese Academy of Sciences. His research interests include big data analytics and data mining.

Qingshan Ma received his bachelor degree in computer science from Shanxi University in 2014. He joined Chengdu Institute of Computer Application in Chinese Academy of Sciences as an MSc student in 2015. His research interests include big data analytics and join algorithm based on MapReduce model.

Guanci Yang received the MSc degree in 2009 in Guizhou University, and the PhD degree in 2012 in University of Chinese Academy of Sciences. He is an associate professor of Guizhou University. His research interests include computational intelligence and social robot, multi-objective optimization of complex system, and manufacturing big data. He is a member of the Institute of Electrical and Electronics Engineers and China Computer Federation.

References

Chang

Dean

Ghemawat

et al., Bigtable: A distributed storage system for structured data, Proceedings of the 7th Symposium on Operating System Design and Implementation, Seattle, Washington, USA (6–8 November 2006), 205–218.

Foundation Apache Software. https://en.wikipedia.org/wiki/Apache_HBase.

Lakshman

and Malik

, Cassandra: A decentralized structured storage system, ACM SIGOPS Operating Systems Review 44(2) (2010), 35–40.

Mongodb Inc. https://www.mongodb.com.

Redislabs. http://www.redis.io.

Corbett

J.C.

Dean

Epstein

et al., Spanner: Google’s globally distributed database, ACM Transactions on Computer Systems (TOCS) 31(3) (2013), 1–22.

Pokorný

, Database technologies in the world of big data, Proceedings of the 16th International Conference on Computer Systems and Technologies, Dublin, Ireland (25–26 June 2015), 1–12.

Weiss

, A technical overview of the Oracle Exadata database machine and exadata storage server, Oracle White Paper, Oracle Corporation, Redwood Shores (2012).

Cheng

C.X.

Song

X.M.

and Zhou

C.H.

, Generic cumulative annular bucket histogram for spatial selectivity estimation of spatial database management system, International Journal of Geographical Information Science 27(2) (2013), 339–362.

10.

Yıldız

Büyüktanır

and Emekci

, Equi-depth histogram construction for big data with quality guarantees, arXiv preprint, arXiv:1606.05633, 2016.

11.

MapR Technologies. https://mapr.com/blog/in-depth-look-hbase-architecture/.

12.

Ioannidis

, The history of histograms (abridged), Proceedings of the 29th International Conference on Very Large Data Bases, Berlin, Germany (9–12 September 2003), 19–30.

13.

Poosala

Haas

P.J.

Ioannidis

et al., Improved histograms for selectivity estimation of range predicates, SIGMOD ’96 Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada (4–6 June 1996), 294–305.

14.

Chaudhuri

Narasayya

and Motwani

, Random sampling for histogram construction: how much is enough? SIGMOD ’98 Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA (1–4 June 1998), 436–447.

15.

Luo

J.Z.

and Wang

H.Z.

, Construction of an adaptive histogram in compressed database, Journal of Software 20(7) (2009), 1785–1799.

16.

Bruno

Chaudhuri

Gravano

et al., STHoles: A multidimensional workload-aware histogram, SIGMOD ’01 Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, California, USA (21–24 May 2001), 211–222.

17.

Kanne

C.C.

and Moerkotte

, Histograms reloaded: The merits of bucket diversity, SIGMOD ’10 Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, Indiana, USA (6–10 June 2010), 663–674.

18.

Jestes

and Li

, Building wavelet histograms on large data in MapReduce, Proceedings of the Vldb Endowment 5(2) (2011), 109–120.

19.

Tang

M.W.

, Efficient and scalable monitoring and summarization of large probabilistic data, SIGMOD ’13 PhD Symposium Proceedings of the 2013 SIGMOD/PODS PhD Symposium, New York, USA (23–23 June 2013), 61–66.

20.

Zhang