A scalable approach to model big and interacted queries for materialized view through data mining

Abstract

With big data, the volume of the manipulated data is rapidly growing, we find several business sectors contributing to this expansion such as: (i) the social networks, (ii) the use of the sensors, (iii) the commercial transactions. To integrate this actual reality, the management of small, medium, large business sectors needs analytical applications, such as scalable data warehouse, to answer effectively big and interacted queries. This interaction is exploited in the phase of the physical design of data warehouse using optimizations structure such as the materialized view. Selecting an appropriate set of views to materialize under some resource constraints is known as view selection problem (VSP). In this paper, we propose an approach to solve VSP by profiting the world of multi-query optimization in order to generate the global execution plan integrating new dimensions Big and Interacted Queries, to ensure scalability, we use the clustering technique K-means and operations of refinement in order to capture volume of interacted queries without passing by enumeration of all logical plans of the queries and we use our plan to materialize views. Finally, experiments are conducted to show the scalability of our approach.

Keywords

Data warehouse query processing view selection problem

1. Introduction

Big data concern every business and every life on the planet in regard to the huge amount of generated data on different social network platforms, the deployment of different kinds of sensors and the explosion of the usage of the Internet. According to a report of IBM in 2018, every minute there are 72 hours of footage uploaded on YouTube, 216.000 Instagram posts and 204.000.000 emails sent. In addition, the rate of global Internet traffic is estimated at 50.000 GB per second. With this data volume, the business leaders have to analyze it in order to make decisions. Thus, it is necessary to develop new analytical applications that answer the new needs of scalability. Among existing analytical applications, we are interested in the development of scalable data warehouse (DW). The scalability of data warehouse is possible through query optimization and physical design, in order to quickly and effectively answer high and a complex number of Online Analytical Processing (OLAP) queries known as “big queries phenomenon” [30]. This phenomenon considers the repetitive nature of the queries, which are in “big interaction” by sharing several algebraic operations such as the join operation, the aggregation, etc. This interaction is due to the modeling data warehouse in a star schema, or its variants, which need systematically queries, passing by the fact table to run the operations of joins. The interaction is exploited in the phase of the physical design of the data warehouse by the database administrator in order to optimize the whole of these queries using the optimization structures such as the materialized view.

A materialized view is one of the techniques of redundant structure optimization which minimizes the queries response time. It consists of pre-computing the most expensive operations, like the joins and aggregations, and storing their results in the database [1]. These queries are effectively processed using the data stored in views without needing to reach the original data. The database administrator cannot materialize all the possible views, although they improve the performances of the queries as they generate the following problems: (i) the base tables of the data warehouse change and evolve through updates that impose that these changes must be deferred in the materialized view, (ii) we must rewrite the queries according to these views [26] which is a difficult task because we must find a better rewriting equivalent of the original query, (iii) and the storage of all these views requires a bigger disk space. The problem that arises is to select the most beneficial view to materialize in order to minimize the cost of total execution of the workload by respecting some constraints, such as the storage and the maintenance cost. This problem is known as “the View Selection Problem (VSP)”.

VSP can be formalized as follows: a workload of queries given $Q={\{}Q_{1},Q_{2},\ldots,Q_{n}{\}}$ and a set of constraints $C=$ (storage cost, maintenance cost, etc.) and a set of Non-Functional Requirements (NFR) e.g., query processing cost, monetary cost, energy consumption, etc. The VSP consists in selecting a set of a materialized view that satisfies the NFR and respects C constraints. Several algorithms were developed to solve this problem such as deterministic algorithms, randomized algorithms, evolutionary algorithms and hybrid algorithms [3]. However, these works do not consider the interaction of queries and exploit large research space to find the views to be materialized which can generate a high computing time. In addition, materialized views are unsatisfactory in terms of the quality of the solution as they depend on the random choice of the algorithm’s parameters, for example, the choice the first generation and the values of the probabilities for the crossover as well as mutation operators in evolutionary algorithms.

In order to deal with these weaknesses, some works were based on the interaction of queries [7, 15]. The concept of the interaction of queries was already identified by Timos Sellis, who gives rise to another problem known as “Multi-Query Optimization (MQO)” which consists in optimizing the total performance of the workload by generation a global execution plan [4]. Getting this plan is NP-hard problem, the work of Yang et al. [7], was the first which has to merge between the generation of this plan and the selection of materialized views. The main problem with these works their scalability as they do not take into account the Big Queries phenomenon, in particular: volume data, query volume and the big interaction between queries. Traditional VSP methods should, therefore, integrate new aspects to deal with these big queries issues. To the best of our knowledge, only Boukorca et al. work which proposed a scalable solution by borrowing a graph partitioning tool used in the electronic design automation domain [24]. However, this tool suffers from data preprocessing cost and it does not ensure any dynamicity.

In this study, we focus on a data mining technique, especially the clustering method K-means which allows scalability by handling an extremely large database, currently is implemented on Hadoop MapReduce framework [5]. K-means exploits the workload (i.e., big queries) to get the interaction between queries. We propose grouping initially these queries which are similar according to the most expensive operator who is the join operator, into clusters. To maximize this interaction, we apply the operations of refinement which reduces the number of the resulting clusters by k-means. This method consists in using an intra-cluster function which merges the clusters that share join operators to increase the reuse of these operators. K-means and the intra-cluster function are used to model the big interaction between queries without passing by the logical plans of each query in order to generate the global plan execution who guides us to select views to be materialized.

The paper is organized as follows: Section 2 presents the related work. In Section 3, we explain our contribution by modeling the interaction of the queries in order to select the views to be materialized. Section 4 presents our results to select materialized views. Finally, we conclude the paper by discussing the obtained results.

2. Literature review

Several works were developed to solve view selection problem in several contexts, such as the semantic databases [19, 20], the distributed database [21] and the physical design of data warehouse for optimizing queries OLAP [7]. In the paper of Mani and Bellahsene [3] have provided a classification of VSP studies, with the following dimensions:

According to the dimension without or with constraints, the first studies on VSP assume the absence of constraints [7, 8]. The first problem in these works may be that the storage of selected views takes more than the available space, so storage space the main constraint that the selection process has integrated [29]. The second problem, if updated in the base tables, selected views must be updated, then, the maintenance cost becomes an important constraint [1].

In the studies that flow non-functional requirements dimension, selected views must satisfy certain objectives, such as the minimization of query response time [9], maintenance cost [1], the satisfaction of the energy consumption [10]. These objectives can be combined to give rise to a multi-objective formalization of the VSP as in [7] where response time and maintenance cost were considered.

In the dimension algorithm of resolution, simple and advanced algorithms are proposed to solve VSP, such as the deterministic algorithms [2, 9, 22], randomized algorithms [18, 23, 26, 27], evolutionary algorithms [28, 29], hybrid algorithms [7, 17].

In the dimension nature of the selection the views, static approach suppose that the workload of the queries is fixed [7, 11, 24] and dynamic approach is applied when a query arrives consequently the workload is built incrementally and changes over time [12, 13, 14].

Each work on VSP combines one or more dimensions presented above. In this paper, we review studies on VSP by multi-query optimization (MQO) dimension. Most of these studies consist of detecting the common sub-expressions between the queries of the workload by the fusion of their execution plans into a single global plan. This plan is used as a data structure for the view selection problem, the most used data structures in literature are:

AND/OR graph data structure is a directed acyclic graph (DAG) which represents the union of all possible execution plans of each query, this plan has been defined by Roy et al. [16], it is composed of two types of nodes: operation (selection, join, projection, etc.), and equivalence nodes. In the work of Gupta [22], the authors used this AND-OR data structure, to develop a theoretical framework for VSP under the storage space constraint, they proposed a greedy algorithm, which each iteration selects the view having the maximal benefit per unit of space, the execution time of this algorithm is O(In2) where I, it is the number of iterations and $n$ the number of the nodes in the plan, because the algorithm based on an exhaustive search cannot compute the optimal solution in a reasonable time which is due to the complexity of the problem. With the aim of avoiding this kind of search, they proposed another greedy-interchange algorithm which starts with the solution generated by the 1st algorithm and improves this solution by interchanging views already selected by certain view not selected, the algorithm repeatedly carries out such interchanging until there will be more improvement in the solution by interchanging. This new algorithm is unsatisfactory in terms of the quality of solution considering the initial solution influences considerably on the solution found.

In the work of Lee and Hammer [23], the authors used OR DAG for VSP under the maintenance cost constraint, each node in the graph represents a view, the genetic algorithm was applied such as the chromosome represents the set of the views to be materialized, it is coded in a string format $=$ ( $b_{1}$ , $b_{2}$ , $b_{3}$ , $\ldots$ , $b_{m}$ ), where $b_{i}=$ 1 if the view $v_{i}$ is selected for the materialization otherwise $b_{i}=$ 0. This algorithm started with a generation of 30 chromosomes, the selection process following the popular roulette wheel method to select the chromosomes parents for the next generation was used and that after the application of the crossover and mutation operators, then a function of fitness which takes in measure maintenance cost defined to evaluate each solution, the process is repeated until the 400 generations. The experiments proved that their genetic algorithm is higher than the existing solutions to the problem of selection of view with the maintenance cost within the framework global plan OR DAG, but we note that their results depend on the random choice of the 1st generation and the values of the probabilities for the crossover and mutation operators.

Data Cube Lattice data structure is a plan used in the context of multi-dimensional data warehousing, which the nodes represent the views that contain operations of grouping, and edges define the relation between the views. Harinarayan et al. [2] proposed an algorithm greedy, which uses this Data Cube lattice, with the storage constraint, the authors showed that their algorithm presents performances very close to the optimal. However, it traverses the space of solution possible with a high level of granularity and can possibly leave an escape of a good solution. In the work [26] found in their implementation of the algorithm the Harinarayan et al. [2], it ran for over six hours in order to select a set of views that fit in 1% of the data cube space for a flat fact table with 14 dimensions, it is not effective for numerous dimensions. The authors adapted the randomized search methods for the VSP by respecting the storage constraint and maintenance cost, they proposed transformation rules that help the algorithms move through the search space of valid view selections, in order to identify sets of views that minimize the query cost. Their experiments of the randomized algorithms have provided solutions very close to the optimal in a limited time comparing with the work [2]. We note that the found results depend on the transformation rules suggested.

Multi-View Processing Plan (MVPP) data structure is a directed acyclic graph in which root nodes represent the queries, the leaf nodes are the base tables, and the intermediate nodes are an algebraic operator such as selection, join, projection, etc. In this paper, we intend to use MVPP. This data structure has been described by Yang et al. [7], the authors proposed two algorithms for constructing this MVPP. The first algorithm proposed called “A feasible Solution”, which starts with the identification of the optimal plans for each query and after that order them according to the frequency of query multiply by the query processing cost. Once the order of the optimal plans is fixed, the algorithm takes the first optimal plan and merges it with the second by using the common sub-expressions, then, the first two plans with the third, and so on. This procedure will be repeated until all the plans are considered. For each generated MVPP, they push down the unary operators (selection and projection) as far as possible in MVPP and applied the algorithm of selection of the view to be materialized according to the maintenance cost constraint to each MVPP and calculated its cost. The algorithm chooses the MVPP which has the optimal combination between the maintenance cost of the views and query processing cost. This algorithm is very expensive in terms of calculation and does not guarantee the optimal MVPP. In order to overcome the limitations of the previous algorithm a second algorithm was proposed based on 0–1 integer programming, determined by the following steps [7]:

Step 1:
Generate for each query qi all the join plans possible $P={\{}p_{1},\ldots,p_{k}{\}}$ .
Step 2:
To identify all sub-trees join possible for each generated plan $p_{i}\in P$ , Let $S={\{}S_{1},\ldots,S_{m}{\}}$ set patterns derived from $K$ .
Step 3:
Build usage matrix A, where the coefficient $a_{ij}$ represents the possibility or not of the execution of the query $q_{i}$ by the $p_{j}$ plan.
Step 4:
Build usage matrix B, where the coefficient $b_{ij}$ takes value 1 if the join pattern $s_{j}$ is contained in the join tree $p_{i}$ .

Finally, the problem of selection of optimal MVPP is reduced to the selection of a set of plans ${\{}p_{1},\ldots,p_{k}{\}}$ which minimizes the total query processing of the workload

$\displaystyle x_{0}=\sum^{m}_{i=1}\textit{Ecost}(s_{i})\left(\sum^{i}_{j=1}b_% {ij}x_{j}\right)$ (1)

where $\textit{Ecost}(s_{i})$ is the cost join pattern $s_{i}$ . The same algorithm of selection of the views used in the algorithm “A feasible Solution” is then applied to optimal MVPP in order to select the views which have positive benefits between the query processing cost and the maintenance cost. This algorithm suffers from a high complexity O (2n) where n is the number of the query because it enumerates all the possible plans for each query.

Most works developed to study the impact of Multi-View Processing Plan (MVPP) on solving VSP are based on the enumeration of all the logical plans of the queries to build an MVPP. These plans can be optimal or not, which return these works suffer from scalability. To overcome this problem, Boukorca et al proposed to use a hypergraph structure which groups the queries in several connected components [24]. Each component contains the queries which a high interaction, this approach of generating the MVPP occurs by the following stages:

Stage 1:
Constructing the hypergraph of joins: in the graph theory, a hypergraph H is a set of vertices V and a set of hyperedges E. In their case, a hypergraph H is a set of V which represents the set of the join nodes and E represent the workload of query Q. A hyperedge $e_{i}$ connecting a set of vertices corresponds to the join nodes, which participate in the execution of the query $q_{j}$ .
Stage 2:
Partitioning the hypergraph: the authors adapted an existing algorithm resulting from the graph theory HMETIS to divide a hypergraph into $k$ partitions (new hypergraphs) so that the number of hyperedge cut is minimal. Each of these new hypergraphs will be separately treated in the process of construction of the MVPP the queries.
Stage 3:
Transforming of a hypergraph into directed graph MVPP: it consists of finding an order between the join nodes in each small hypergraph, for that, they used three algorithms: the first allows to find the pivot node which corresponds to the first operation of join in MVPP. The second algorithm describes the steps to be followed to transform a hypergraph into MVPP by using the pivot node of maximum benefit. The third algorithm allows splitting a hypergraph into two hypergraphs following a pivot node. The first hypergraph contains the hyperedges which use the pivot, and the second contains the other hyperedges. Finally, they selected the views to be materialized under the storage constraint and the maintenance cost for each component.

The experimental results show the effectiveness of this approach and its scalability on a large workload of queries to select materialized views. However, the authors used a graph partitioning tools, e.g. HMETIS which is known by its phase of data preprocessing to prepare the data a priori (hypergraph) in the form of the files to pass to the partitioning phase, therefore it suffers from preprocessing cost as was shown by the authors of work [6]. In addition, this approach does not ensure any dynamicity, for example, the resulting partitions are static, therefore, in the case of a change of the query workload, it is necessary to redo the phases of preprocessing and partitioning to find new partitions.

In order to generate the multi-view processing plan which will be used to select the views to be materialized, we propose an approach that follows the divide-conquer principle. We use K-means clustering and operations of refinement to divide the search space into disjoint subsets spaces which can be executed in parallel. We define similarity and dissimilarity measures in order to quickly find desirable clustering in only one execution of K-means. The objective is to ensure dynamicity by adapting to change of workload of the resulting clusters using these similarity and dissimilarity measures.
3. Our approach: Generation of the multi-view processing plan for materialized view selection

After the analysis of state-of-art studies on VSP in conjunction with the use multi-query optimization (MQO) dimension, we noted that these works are not scalable in terms the volume of queries processed, because they have based on the enumeration of all the logical plans of each query to build a Multi-View Processing Plan (MVPP), which makes the search space very large to find the best MVPP that will be used to select a suitable set of views to materialize, however, they cannot process the big queries phenomenon. Therefore, to ensure scalability in our approach we propose to build an MVPP without passing by the logical plans of each query in order to avoid having large search space, and to contribute in reducing the complexity of generating of this MVPP by integrating this big queries phenomenon, our approach follows the divide-conquer principle by dividing the search space into several small disjoint sub-spaces, which can be executed in parallel, knowing that each space contains the queries that have high interaction among them. To realize this scalability, we assume that all the joins of the workload have the same priority to be candidates in the first join of the global execution plan of this MVPP. There is no order between these joins of each query and, the MVPP results which determine the optimal execution plans of the queries.

We present every query in the workload by a set of sub-plan with a single join and two selection predicates, and then the union between these sets gives us sub-plan candidates possible to build our MVPP. The idea of the approach is to group the queries that are similar in the same cluster following the most expensive sub-plan with a single join, this motives us to use clustering K-means with similarity and dissimilarity measures proposed and intra-cluster function in dealing with large volume of interacted queries, then to find the MVPP associated to each cluster independently which will be used for VSP. We detail below, the stages of our approach illustrated in Fig. 1.

Figure 1.

Multi-view processing plan (MVPP) generation approach.

3.1 Construction of the training dataset

The first stage is query workload analysis in order to build the clustering context. The workload, we consider are set of SJP (Selection, Join, Projection) queries. Each SJP query is composed of the operator selection, join and projection. Firstly, we build all sub-plan with a single join and two selection operators, therefore we make the syntactic analysis of each query starting from his code SQL $<$ select from where $>$ to extract the nodes where each node represents an algebraic operator (selection, join, projection) and is designated by an identifier, a name, a predicate. The following nodes are extracted:

Base table nodes. Starting from clause $<$ from $>$ , we extract the Base tables contained in the queries, according to the star schema of data warehouse, we will have the fact table and the dimension tables, and we record them as a set T without redundancy.

Selection nodes. Starting from clause $<$ where $>$ , we extract the selection nodes, which are determined by the selection predicate, we place them in a set of selections S without a redundant node.

Sub-plan with a single join. Starting from clause $<$ where $>$ , we extract for each query $Q_{i}$ all sub-plans with single join possible $P={\{}p_{1},\ldots,p_{n}{\}}$ and each plan generated $p_{j}$ is determined by a join predicate and two nodes from the set $S$ or $T$ . A left node and a right node, are placed in a set $P$ of the sub-plan without a redundant sub-plan.

From this knowledge on the workload of queries, we build the training dataset represented by a matrix QP where lines represent the queries $Q_{i}$ and columns the sub-plans with a single join $p_{j}$ . Every cell QP [ $i j$ ] can take a binary value that indicates the possibility of executing the query $Q_{i}$ by the sub-plan $p_{j}$ (value 1) or not (value 0). Consequently, it is a candidate to be the first join in this query or not. We illustrate this step by the following example:

Example: Consider a workload of the 10 queries OLAP (Online Analytical Processing) generated from Star Schema Benchmark (SSB) [25] contains a fact table Lineorder, and four dimension tables: Customer, Supplier, Part, and Dates.

Step1: To extract the sets T, S with their identifiers, predicates.

S
S ${}_{4}$ : D_YEAR $=$ 1918
S ${}_{5}$ : LO_DISCOUNT>=1AND LO_DISCOUNT<=3
S ${}_{6}$ : LO_QUANTITY<25
S ${}_{7}$ : P_BRAND=’MFGR#2221’
S ${}_{8}$ : S_REGION=’ASIA’
S ${}_{9}$ : S_REGION=’EUROPE’

T
T ${}_{0}$ : LINEORDER
T ${}_{1}$ : DATES
T ${}_{2}$ : PART
T ${}_{3}$ : SUPPLIER

Step 2: To generate for each query ( $Q_{i})$ its sub-plans candidates possible ( $P_{i}$ ), each one is defined by a join predicate ( $\theta_{i}$ ) and two nodes left and right as illustrated in Fig. 2.

Figure 2.

Sub-plans candidates possible for each query.

From these sub-plans, defined for each query, we obtain a set of sub-plans candidates $P$ .

P
10:LO_ORDERDATE=D_DATEKEY, Nodedleft=S ${}_{6}$ and Noderight=S ${}_{4}$
11:LO_ORDERDATE=D_DATEKEY, Nodedleft=T ${}_{0}$ and Noderight=S ${}_{4}$
12:LO_ORDERDATE=D_DATEKEY, Nodeleft=T ${}_{0}$ and Noderight=T ${}_{1}$
13:LO_PARTKEY = P_PARTKEY, Nodeleft=T ${}_{0}$ and Noderight=S ${}_{7}$
14:LO_SUPPKEY = S_SUPPKEY, Nodeleft=T ${}_{0}$ and Nodeight=S ${}_{8}$
15:LO_SUPPKEY = S_SUPPKEY, Nodeleft=T ${}_{0}$ and Noderight=S ${}_{9}$

Step 3: Build a training dataset (as in Table 1) from the P: matrix QP.

Table 1

Training dataset

	10	11	12	13	14	15
${Q}_{0}$	1	0	0	0	0	0
${Q}_{1}$	1	0	0	0	0	0
${Q}_{2}$	0	1	0	0	0	0
${Q}_{3}$	0	1	0	0	0	0
${Q}_{4}$	0	0	1	1	1	0
${Q}_{5}$	0	0	0	1	0	0
${Q}_{6}$	0	0	1	1	1	0
${Q}_{7}$	0	0	1	1	0	1
${Q}_{8}$	0	0	0	1	0	1
${Q}_{9}$	0	0	1	1	0	1

3.2 K-means clustering

In this step, we first apply the algorithm K-means by using a training dataset above for partitioning the Big queries workload according to the set P. We group in the cluster the queries which are totally similar. Totally similar queries are queries having an identical binary representation in the matrix QP. For that, we define a similarity measure between two queries by the function sim ( $Q_{a}$ , $Q_{b}$ ), such as in the matrix QP every query is defined by a vector v with $Q_{a}=$ [ $Q_{a1},Q_{a2},\ldots,Q_{av}$ ] and $Q_{b}=$ [ $Q_{b1},Q_{b2},\ldots,Q_{bv}$ ], the values of v represent the possibility of executing the query $Q_{a}$ by sub-plan ( $Q_{ij}=$ 1) otherwise not ( $Q_{ij}=$ 0). We note that we shall never have a vector with the values 0, $\nexists Q_{i}$ with $v=$ [0, 0, $\ldots$ , 0]. We say that $Q_{a}$ and $Q_{b}$ are similar if vectors [ $Q_{a1},Q_{a2},\ldots,Q_{av}$ ] and [ $Q_{b1},Q_{b2},\ldots,Q_{bv}$ ] are equal, for example, $Q_{0}$ and $Q_{1}$ are similar with $v=$ [1, 0, 0, 0, 0], from this description, we form the function of similarity as follows:

$\displaystyle\textit{sim}(Q_{a},Q_{b})=\left\{\begin{array}[]{ll}1&\text{if}\ % Q_{ap}=Q_{bp}=1||Q_{ap}=Q_{bp}=0\ \text{for each}\ p=1..v\\ 0&\text{otherwise}\end{array}\right.$ (2)

We define the dissimilarity measure between two queries considering the set P by the following function diss( $Q_{a,}$ $Q_{b})$ :

$\displaystyle\textit{diss}(Q_{a},q_{b})=\left\{\begin{array}[]{ll}1&\text{if}% \ \exists p=1..v\ \text{where}\ Q_{ap}\neq Q_{bp}\\ 0&\text{otherwise}\end{array}\right.$ (3)

For example, diss ( $Q_{8}$ , $Q_{9})=$ 1 because there exists at least a sub-plan with a single join 12 in a set P where only one of these two queries $Q_{8}$ do not execute with this sub-plan. There is complete reciprocity between the notion of the similarity and dissimilarity: the not similar queries according to the set of sub-plan are necessarily dissimilar according to the set P. We use these measures to determine the number of the clusters K required to have desirable clustering.

We apply clustering K-means with the similarity and dissimilarity measures proposed in the previous example. We initially obtain the following clusters:

$\displaystyle C_{0}={\{}Q_{2},Q_{3}{\}},\text{with}\ P={\{}11{\}}.$ $\displaystyle C_{1}={\{}Q_{4},Q_{6}{\}},\text{with set}\ P={\{}12,13,14{\}}.$ $\displaystyle C_{2}={\{}Q_{0},Q_{1}{\}},\text{with}\ P={\{}10{\}}.$ $\displaystyle C_{3}={\{}Q_{7},Q_{9}{\}},\text{with set}\ P={\{}12,13.15{\}}.$ $\displaystyle C_{4}={\{}Q_{5}{\}},\text{with set}\ P={\{}13{\}}.$ $\displaystyle C_{5}={\{}Q_{8}{\}},\text{with set}\ P={\{}13.15{\}}.$

3.3 Application of the function intra_cluster

In this stage, we apply an operation of refinement in order to reduce the numbers of the clusters initially found by K-means, which contain common sub-plans with a single join between them. There are interactions between these clusters that it is necessary to collect them in order to maximize the reuse of the sub-plans by the various queries. Therefore, we can merge these clusters according to a sub-plan, which is the most expensive and most used by the clusters, i.e., it groups a maximal number of queries, one it calls the score sub-plan. We use the following algorithm to find this score sub-plan S:

Algorithm 1 : findScoreplan()
1.	Input: The clusters $C_{i}$
2.	Output: score sub-plan S
3.	Costmax $=$ 0
4.	For each (p) do
5.	Nbr $=$ nbruse(p) (The number of the clusters which use p)
6.	Cost $=$ costExectution(p) * Nbr (costExecution(v) processing cost of the sub plan p)
7.	if(cost $>$ costmax) then S $=$ p
8.	Fin if
9.	Fin for

In Algorithm 1, we calculate for each sub-plan ( $p\in P$ ) the number of the clusters which use p and this number is multiplied by the processing cost of this sub-plan p in order to select the most expensive sub-plan with a single join, it is the score sub-plan S. We merge the clusters following this sub-plan S found by applying Algorithm 2 in an iterative way. In each fusion, we deleted the clusters already merged with the initial set of the clusters found by K-means and we update the nbruse (p) variable which represents the numbers of the not merged clusters which use p.

We apply Algorithm 2 to the example above, we obtain new clusters after the fusion of $C_{1}$ , $C_{3}$ and $C_{4}$ and $C_{5}$ where S $=$ 13 which groups 4 clusters containing 6 queries.

$\displaystyle C_{1345}={\{}Q_{4},Q_{5},Q_{6},Q_{7},Q_{8},Q_{9}{\}},\text{with}% \ P={\{}13,15,12,14{\}}.$ $\displaystyle C_{0}={\{}Q_{2},Q_{3}{\}},P={\{}11{\}}.$ $\displaystyle C_{2}={\{}Q_{0},Q_{1}s{\}},P={\{}10{\}}.$

3.4 Checking the independence between the clusters

At this level, we ensure that there is not a common sub-plan between the final clusters. When we find a cluster using a sub-plan already used by another cluster, this sub-plan will be duplicated by changing its identifier. For instance, if a cluster $C_{0}$ uses the sub-plan 15, with $C_{0}={\{}Q_{2},Q_{3}{\}}$ , $P={\{}11,15{\}}$ thus, $p=$ 15 will be 16.

Algorithm 2: intra_cluster()
1.	For all sub-plan $p\in P$ do
2.	While nbruse(p) $\neq$ 0 do
3.	$C=\varphi$ (Represent all the merged clusters)
4.	$L=$ 0 (The number of merged clusters)
5.	S $=$ findScoreplan() (Algorithm1)
6.	For ( $i=0,i<k,i++$ ) (K the number of the clusters)
7.	if ( $S\in c_{i}$ )
8.	$C=C\cup c_{i}$
9.	DeleteCluster( $c_{i}$ )
10.	L++
11.	End if
12.	End for
13.	For all $p\in C$ do
14.	nbruse(p) $=$ nbruse(p)-L (the number updates the clusters use sub-plan p)
15.	End while
16.	End of

3.5 Transformation of a cluster into a directed graph MVPP

The generation of MVPP at this step consists in the simple transformation of each cluster into a directed graph MVPP. We assume that the query of our workload is presented by a depth left tree and the first join in each query it is score sub-plan found. From these score sub-plans, we arrange the order of the sub-plans p in each cluster in descending order using the following formula:

$\displaystyle\textit{Cost(p)}=\textit{costexecution(p)}*\textit{nbruse(p)}$ (4)

The formula in Eq. (4) finds the sub-plans to give the best benefit possible in terms of the number of reuse of its result multiplied by its processing cost. Once this order between these sub-plans in each cluster is found, we represent each cluster by a directed graph MVPP by making an update at the node left of each sub-plan. In the case of conflict on even the sub-plan with two nodes left various this sub-plan will be duplicated.

We build the MVPP associated to the cluster $C_{1345}={\{}Q_{4},Q_{5},Q_{6},Q_{7},Q_{8},Q_{9}{\}}$ , with $P={\{}13,15,12,14{\}}$ . We apply the formula in Eq. (4) to the cluster $C_{1345}$ , we obtain the orderly set $P={\{}13,12,15,14{\}}$ , steps of transformation given in Fig. 3.

3.6 Views selection algorithm

We propose an algorithm (Algorithm 3) which individually handles the clusters by selecting the views which have a maximum of benefit and satisfy the storage space constraint. In this way, the storage space is allocated to each cluster. All the selected views optimize the big queries workload. Algorithm 3 describes the steps of the selection process of the views which work in a cyclic way on the cluster as long as the storage constraint is satisfied. We identify for each cluster $c_{i}$ its set of sub-plans which are candidates V to be materialized.

Algorithm 3: Selection_Vue ()
1.	Input: $C_{\textit{MVPP}}$ , S, Disquespace {set the final clusters C with their MVPP and scores sub-plan S and the hard disk
	quantity}
2.	Output: $L_{\text{MV}}=\emptyset$ {lists of the materialized views}
3.	For all ( $c_{i}\in C$ ) do {initialize L ${}_{MV}$ by the score sub-plan for each cluster $c_{i}=0..k$ }
4.	$L_{i}=$ getallview( $c_{i})$ {get set of the views in each cluster $c_{i}$ }
5.	if(sizeof( $s_{i})<$ Disquespace) { $s_{i}$ score sub-plan for cluster $c_{i}$ }
6.	Add( $s_{i}$ , $L_{\textit{MV}}$ )
7.	Disquespace $=$ Disquespace – sizeof(si)
8.	Delete( $s_{i}$ , $L_{i}$ )
9.	end if
10.	For all ( $v\in L_{i}$ ) do
11.	cost(v) $=$ costexecution(v)*nbruse(v) {Descending order of the candidate sub-plan set}
12.	descendingorder( $L_{i}$ )
13.	end for
14.	end for
15.	while(Disquespace $>$ 0)
16.	for all ( $c_{i}\in C$ ) do
17.	$L_{i}=$ getallview( $c_{i}$ )
18.	if(sizeof( $L_{i}$ (0)) $<$ Disquespace)
19.	Add( $L_{i}$ (0), $L_{\textit{MV}}$ )
20.	Disquespace $=$ Disquespace – sizeof( $L_{i}$ (0))
21.	Delete( $L_{i}$ (0), $L_{i}$ )
22.	End if
23.	End for
24.	End while

Figure 3.

Transformation steps of a cluster into a directed graph MVPP.

4. Experiments

We developed a simulator tool in Java Environment that aims to extract all the knowledge on the workload (the sub-plan, selection, projection, and aggregation) and from this knowledge; it builds the training dataset. This data are stored in a CSV file. We imported package weka (weka.jar) [31] into our tool in order to implement K-means clustering using the methods and class offered by this package such as:

•
BuildClusterer(Data) method, which read training data from CSV file already prepared, in order to generate a clusterer.
•
ClusterEvaluation(package weka.clusterers), class for evaluating clustering models,
•
setNumClusters(int K) method: the variable K represents the number of the desired clusters, to determine its value we have programmed the similarity and dissimilarity measures to guarantee the best clustering in only one execution of K-means.

One the clusters have been found, the merger to these clusters which are in interaction according to score sub-plan with a single join is applied. After, our tool generates the MVPP adapted to every cluster following a function which maximizes the benefit of reuse of sub-plans by the queries. Finally, this tool includes the implementation of the algorithm of selection of the view and, we developed a cost model which estimates the cost of query processing in terms the number of Inputs/Outputs pages required for executing a given query. this cost model use database parameters (number of tuples of the relation, number the page on which the relation is stored, etc.) and parameters of the query (number of distinct values of the attribute, selectivity factor of the algebraic operator, etc.), to estimate the cardinality of intermediate results.

We used the dataset of the Star Schema Benchmark (SSB) to evaluate the approach [25], the size 1 G. This SSB data warehouse deployed in Oracle 11 g, it contains a fact table Lineorder of 600 millions of tuples, and four dimension tables: Customer, Supplier, Part and Dates as illustrated in Fig. 4. We used the dbgen tool provided by SSB to generate the dataset, once this set is generated, scripts are used to populate the tables using the SQL loader tool provided by Oracle.

SSB contains 13 decision support queries, we have generated randomly workloads of varying sizes 30 to 10000 from these queries proposed by SSB, using SSB generator query. Our queries cover most types of OLAP queries.

Figure 4.
Star Schema Benchmark (SSB).

In the first experiment, we evaluated the required time to generate an MVPP compared to a classical approach of Yang et al. [7] and Boukorca et al. approach proved its scalability [24], In each test, we change the number of queries as input to evaluate each approach. Figure 5 summarizes the obtained results.

The results show that our approach is more effective in terms of execution time compared with the approach of Yang et al. which increases exponentially with the number of queries and it is the concurrent approach of Boukorca et al, which proves the scalability of our approach.

Figure 5.
The execution time to generate multi-view processing plan (MVPP).

To study the effect of the size of the workload on the performance of our approach, a series of the tests were applied to evaluate the optimization rate for each workload, by estimating the cost of this workload with and without using materialized views (MV) (1-cost ${}_{\text{with}}$ /cost ${}_{\text{without}}$ ). The results in Fig. 6 show that our approach becomes more successful when workload increases.

Figure 6.
Optimization rate of query processing cost.

In order to test the impact of our MVPP for VSP with existing approaches of Boukorca et al and Yang et al., we calculate the processing cost of queries with materialized views, by our simulator. The obtained results are described in Fig. 7.

Our approach outperforms Yang et al., approach which does not scale when many queries are used and outperforms the Boukorca et al. approach. This means that our approach selects the best views compared with existing approaches.

Figure 7.
Query processing cost with materialized view.

To validate the theoretical results found by our simulator, the obtained materialized views by our approach and other approach are then injected in Oracle 11 g DBMS and the workload of 30 OLAP queries are evaluated by considering these sets of views.

The obtained results are given in Fig. 8, which compares total execution costs with materialized view, and materialization costs, which represent the maintenance costs of the views such as maintenance cost, equal the sum of the creation cost and processing cost of the view. This test confirms that our approach is better than Yang et al., one and there is a little better processing cost with materialized view value compared with the Boukorca et al approach which confirms the effectiveness of our MVPP for selecting best views.

Figure 8.
Query processing cost using materialized view (MV) in Oracle.

5. Discussion

Our experimental results have demonstrated the scalability of our approach by stressing then in terms of the size of queries (10000 queries) and it compared with the most important state of art studies. We evaluated the required time to generate our multi-view processing plan (MVPP) against the approach of Yang et al. [7] and Boukorca et al. [24]. The results showed that our approach is more effective in terms of execution time, because we have benefited from the advantage of the K-means algorithm in data mining applications is its efficiency in clustering large datasets, and our proposition of generating this MVPP without passing by the logical plans of each query, which allows our approach becomes scalable. We also tested the impact of our MVPP for the View Selection Problem (VSP), the results prove that our approach selects the best views compared with existing approaches because we grouped the big queries that are similar in the same group following the most important sub-plans with a single join called score sub-plan. But the disadvantage we have found in our approach is that even with the coupling between two world multi-query optimization and the physical design phase by using materialized view in order to optimize big queries context, remain insufficient to gain effectiveness during the evaluation of these queries over centralized data warehouse. In the face of this situation, the passage to distributed and parallel platforms is one of the scalable solutions adapted to evaluate these big queries, for this we want to improve our approach in order to test our MVPP to designing parallel data warehouse.

6. Conclusion

We proposed in this paper an approach of modeling big and interacted queries for materialized views through data mining that considers the scalability issue to deal with the Big Queries phenomenon. This approach is based on the divide-conquer principle to model the interaction between queries by using the K-means clustering and operation of refinement in order to generate the global plan MVPP and materialized views are derived from this MVPP. Our approach is compared with the most important state of art studies and the experimental results show the effectiveness and the scalability of our approach even with a large workload of queries.

In our future work, we will consider the impact of the generated MVPP to identify the selection predicates which participate in the horizontal data partitioning of the data warehouse. In addition, our approach is applied in the static context, does not take into account the change of the workload. In our future work, we will implement our approach in a dynamic context using similarity and dissimilarity measures to study the effect of workload updates on the obtained results.

Footnotes

Authors’ Bios

	Fatiha Betouati received the engineering degree in computer science from the University of Science and Technology Mohamed Boudiaf (USTO-MB), Oran, Algeria, in 2007 and pursuing a post graduation degree in computer science, in the same university in 2011. She is presently a Ph.D. student at the University of Algeria. Her research interests include database design, Data warehouse, Data mining.
	Sid Ahmed Rahal obtained a Doctor in computer science since 1989 in Pau University France. He is a member of the Signal, Systems and Data Laboratory (LSSD). Currently, he is a professor at the University of Science and Technology Mohamed Boudiaf (USTO-MB), Oran, Algeria. His research interests include Object-Oriented, Data Mining, Agents and Expert Systems.

References

Gupta

and Mumick

, Selection of views to materialize under a maintenance cost constraint, In Proceedings of the International Conference on Database Theory (ICDT), Springer, 1999. pp. 453–470.

Harinarayan

Rajaraman

and Ullman

, Implementing data cubes efficiently, ACM SIGMOD Record 25(2) (1996), 205–216.

Mami

and Bellahsene

, A survey of view selection methods, ACM SIGMOD Record 41(1) (2012), 20–29.

Sellis

, Multiple-query optimization, ACM Transactions on Database Systems (TODS) 13(1) (1988 March), 23–52.

and Peng

, An Efficient K-means Clustering Algorithm on MapReduce, In International Conference on Database Systems for Advanced Applications DASFAA, Springer, 2014, pp. 357–371.

Razen

Ibrahim

Panos

and Nikos

, Evaluating SPARQL Queries on Massive RDF Datasets, In Proceedings of the VLDB Endownement 8(12) (2015), 1848–1851.

Yang

Karlapalem

and Li

, Algorithms for materialized view design in data warehousing environment, In Proceedings of the International Conference on Very Large Databases (VLDB), Morgan Kaufmann Publishers Inc., 1997. pp. 136–145.

Yang

Karlapalem

and Li

, A framework for designing materialized views in data ware-housing environment, In Proceedings of IEEE International Conference on Distributed Computing Systems (ICDCS), 1997, pp. 458–465.

Ross

K.A.

Srivastava

and Sudarshan

, Materialized view maintenance and integrity con-straint checking: Trading space for time, ACM SIGMOD Record 25 (1996), 447–458.

10.

Roukh

Bellatreche

Boukorca

and Bouarar

, Eco-dmw: Eco-design methodology for data warehouses, ACM DOLAP (2015), 1–10.

11.

Gupta

H.I.

and Mumick

, Selection of views to materialize in a data warehouse, IEEE Trans-actions on Knowledge and Data Engineering 17(1) (2005), 24–43.

12.

Daneshpour

and Barforoush

, Dynamic view management system for query prediction to view materialization, International Journal of Data Warehousing and Mining (IJDWM) 7(2) (2011), 67–96.

13.

Kotidis

and Roussopoulos

, Dynamat: a dynamic view management system for data ware-houses, ACM Sigmod record 28(2) (1999), 371–382.

14.

Kehua

and Diasse

, A dynamic materialized view Selection in a Cloud-based Data Ware-house, IJCSI International Journal of Computer Science 11(2) (2014), 1694–0814.

15.

Mistry

Roy

Sudarshan

and Ramamritham

, Materialized view selection and maintenance using multi-query optimization, ACM SIGMOD Record 30 (2001), 307–318.

16.

Roy

Seshadri

Sudarshan

and Bhobe

, Efficient and extensible algorithms for multi query optimization, In Proceedings of the ACM SIGMOD International Conference on Man-agement of Data, ACM, 2000, pp. 249–260.

17.

Mami

Coletta

and Bellahsene

, Modeling view selection as a constraint satisfaction problem, In Proceedings of International Conference on Database and Expert Systems Appli-cations (DEXA), Springer, 2011, pp. 396–410.

18.

Derakhshan

Stantic

Korn

and Dehne

, Parallel simulated annealing for materialized view selection in data warehousing environments, In Algorithms and Architectures for Parallel Processing, Springer, 2008. pp. 11–132.

19.

Kementsietsidis

Duan

and Li

, Scalable multi-query optimization for sparql, In Proceedings of the International Conference on Data Engineering (ICDE), IEEE, 2012, pp. 666–677.

20.

Goasdoué

Karanasos

Leblay

and Manolescu

, View selection in semantic web data-bases, In Proceedings of the International Conference on Very Large DataBases (VLDB), 5(2) (2011), pp. 97–108.

21.

Kementsietsidis

Craen

F.N.

and Vansummeren

, Scalable multi-query optimization for exploratory queries over federated scientific databases, VLDB (2008), pp. 16–27.

22.

Gupta

, Selection of views to materialize in a data warehouse, In ICDT, 1997, pp. 98–112.

23.

Lee

and Hammer

, Speeding up materialized view selection in data warehouses using a randomized algorithm, Int. J. Cooperative Inf. Syst. 10(3) (2001), 327–353.

24.

Boukorca

Bellatreche

Senouci

S.A.B.

and Faget

, SONIC: scalable multi-query optimization through integrated circuits, In: Decker

Lhotsk’a

Link

Basl

Tjoa

A.M.

(eds), DEXA 2013, Part I. LNCS, vol. 8055, Springer, Heidelberg, 2013, pp. 278–292.

25.

O’Neil

and O’Neil

, X.C.: Star schema benchmark, 2009.

26.

Kalnis

Mamoulis

and Papadias

, View selection using randomized search, Data Knowl. Eng. 42(1) (2002), 89–111.

27.

Horng

Chang

Liu

and Kao

, Materialized view selection using genetic algorithms in a data warehouse system, In Proceedings of the Congress on Evolutionary Computation (CEC), volume 3. IEEE, 1999.

28.

Zhang

and Yang

, Genetic algorithm for materialized view selection in data warehouse environments, In DataWarehousing and Knowledge Discovery, Springer, 1999, pp. 116–125.

29.

Zhang

Yao

and Yang

, An evolutionary approach to materialized views selection in a data warehouse environment, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 31(3) (2001), 282–294.

30.

Boukorca

Bellatreche

and Benkrid

, HYPAD: Hyper-Graph-Driven Approach for Paral-lel Data Warehouse Design, Springer International Publishing Switzerland 2015 G. Wang et al. Eds., ICA3PP 2015, Part IV, LNCS 9531, 2015. pp. 770–783.

31.

Remocor

Eibe

Mark

Richard

Peter

Alex

and David

, WEKA Manual for Version 3-6-15, University Waikato Hamilton, New Zealand, Edition, 2016.