An efficient attribute reduction algorithm using MapReduce

Abstract

Classical attribute reduction algorithms based on attribute significance initiate too many jobs (O(|C|²)) when they run in MapReduce. To improve the efficiencies of these algorithms, we proposed a novel reduction algorithm. Instead of focusing on attribute significance, the notion of a core attribute was applied to construct a new heuristic reduction algorithm, and only |C| jobs were considered to obtain a reduct. The algorithm only included two basic operations: compare and sort. The latter was optimised using the shuffle mechanism in MapReduce, which provided an efficient sorting ability for big data. In particular, we connected jobs in an iterative form to transfer the processing result of the former job to the latter job. Finally, experimental results demonstrated that the proposed attribute reduction algorithm was efficient and significantly improved upon the classical algorithms in runtime and number of jobs.

Keywords

Attribute reduction MapReduce rough set shuffle mechanism sort technology

1. Introduction

The rapid analysis of data has become strictly demanded in military, Internet, finance and other industries. Large amounts of data are often accompanied by significant data redundancy [1,2], which wastes storage space and reduces the performance of algorithms based on data modelling and decision-making. As an important preprocessing technique for data mining, attribute reduction efficiently reduces the storage space and improves the classification performance by removing redundant and irrelevant attributes [3].

Attribute reduction is a key notions and one of the most important aspects of rough set theory [4 –8], and it has been successfully developed to identify a reduct or multiple reducts [3]. Many classical attribute reduction methods have been proposed in recent years, but reduction methods based on attribute significance are the most popular and widely accepted in real-world applications [9 –15].

The development of big data technology, such as Hadoop [16] and MapReduce, [17 –20], has brought research on attribute reduction using big data technology into focus. Yang et al. [14] initially calculated reductions of block data sets in MapReduce and integrated them into a final reduction. Similarly, Qian et al. [3] realised an attribute reduction algorithm based on attribute significance using MapReduce and accomplished a reduct on tens of millions of samples using 16 nodes. Lv et al. [21] proposed an incremental attribute reduction algorithm and reused the former results to speed up the computation of equivalence classes. Moreover, Miao et al. [10] calculated the relative reduction of decision tables using the angles of particles. This research showed that some classical reduction methods in rough set theory, especially those based on attribute significance, can be realised and correctly executed in MapReduce.

However, existing reduction algorithms based on attribute significance, when running in MapReduce, have a significant shortcoming. Specifically, classical attribute reduction algorithms based on attribute significance initiate too many jobs (O(|C|²)). For example, Qian et al. [3] detailed that MapReduce must initiate multiple jobs to calculate the attribute significance of all candidate attributes in each heuristic process. In a hypothetical scenario involving |C| attributes in decision tables, if only data parallel was considered, the algorithm would run |C| jobs in the first heuristic process, |C| − 1 jobs in the second heuristic process, …, and |C| − |R| + 1 jobs in the |R| heuristic process. In summary, (2|C| − |R| + 1)*|R|/2 jobs would be implemented to calculate a reduct. Each job would require a fixed amount of time to begin and assign tasks. As a result, an algorithm that has initiated too many jobs would not be efficient [22,23].

In general, the features of MapReduce do not optimise these algorithms. Instead, they treat MapReduce as a tool to calculate related parameters, such as the attribute significance or equivalence class [16,24]. No matter how simple these parameters are, considerable amounts of time are required to perform various jobs sequentially [3,10,14,25]. To improve the efficiency, an optimised algorithm must be designed to match the features of MapReduce and reduce the runtime.

For this study, we proposed a novel attribute reduction algorithm using the MapReduce computing framework. First, the notion of a core attribute was suggested to replace the classical attribute significance. Second, each attribute was checked only once, and only |C| jobs were implemented in the proposed algorithm. Finally, each job only included two basic operations: compare and sort. The latter was optimised through the shuffle mechanism that is inherent in MapReduce.

To realise the proposed algorithm using MapReduce, we suggested a novel <key,value> pair. At the map stage, we set all the conditional data as the key. Next, the shuffle mechanism was used to automatically sort the whole decision table in the ascending order to reduce the engineering complexity. Finally, we built a connection between multiple jobs. These jobs were executed iteratively because each attribute had to be judged based on the results of previous attributes.

The rest of this article is organised as follows. The basic concepts of the rough set and the MapReduce framework are reviewed in Section 2. A reduction algorithm based on sorting technology is proposed in Section 3. Furthermore, we discuss how to select core attributes without computing the significance and how novel <key,value> pairs and iterative jobs were applied to successfully run the algorithm in the MapReduce programme. Experimental results are given in Section 4, in which the efficiency of the proposed algorithm is also discussed. Finally, our conclusions are provided in Section 5.

2. Preliminaries

In this section, we review the fundamentals of the rough set model [6] and the computation framework of MapReduce [26] reported previously.

2.1. Pawlak rough set model

In the rough set model, data are represented by an information table, and a set of objects is described using a finite set of attributes.

2.1.1. Definition 1

A decision table S is represented as follows

S = < U, At = C \cup D, {V_{a} | a \in At}, {I_{a} | a \in At} >

(1)

where U = {x₁, x₂, x₃, …, x_n} is a finite non-empty set of objects, At is a finite non-empty set of attributes, C = {c₁, c₂, c₃, …, c_m} is a set of conditional attributes that describes the objects and D is a set of decision attributes that identifies the class to which the object belongs. V_a is a non-empty set of values of a ∈ At, and I_a is an information function that maps an object x in U to exactly one value v in V_a. The decision table is considered to be inconsistent if two objects with the same condition values have different decision values.

Given a subset of attributes $B \subseteq C$ , an indiscernibility relation IND(B) is defined as follows

IND (B) = {(x, x') \in U^{2} : \forall a \in B, a (x) = a (x')}

(2)

The equivalence class of an object x with respect to B is defined as follows

[x]_{B} = {y \in U : (x, y) \in IND (B)}

(3)

2.1.2. Definition 2

U is split into a partition $π_{D} = {D_{1}, D_{2}, D_{3}, . . ., D_{k}}$ with respect to the decision attribute D and another partition $π_{A} = {A_{1}, A_{2}, A_{3}, . . ., A_{r}}$ with respect to a set of conditional attributes A. For a decision class $D_{i} \in π_{D}$ , the lower and upper approximations of D_i with respect to a partition $π_{A}$ are defined as follows

\underline{ap r_{A}} (D_{i}) = {x \in U | [x]_{A} \subseteq D_{i}}

(4)

\bar{ap r_{A}} (D_{i}) = {x \in U | [x]_{A} \cap D_{i} \neq φ}

(5)

2.1.3. Definition 3

For a decision table S, a positive region and boundary region of a partition $π_{A}$ with respect to partition $π_{A}$ are defined as follows

PO S_{A} (D) = ⋃_{1 \leq i \leq k} \underline{ap r_{A}} (D_{i})

(6)

BN D_{A} (D) = ⋃_{1 \leq i \leq k} (\bar{ap r_{A}} (D_{i}) - \underline{ap r_{A} (} D_{i}))

(7)

Generally, POS_c(D) donates the positive region in a decision table. All the objects in POS_C(D) are called consistent objects. The others are called inconsistent objects.

2.1.4. Definition 4

For a decision table S, where $A \subseteq C$ and $c \in C - A$ , the significance of attribute c relative to A is defined as follows

sig (c, A, D) = γ_{A \cup {C}} (D) - γ_{A} (D)

(8)

where $γ_{A \cup {C}} (D) = (| PO S_{A \cup {C}} (D) | / | U |)$ .

2.1.5. Definition 5

Given an information table S, the attribute set $R \subseteq At$ is called a reduct if R satisfies the following two conditions:

$IND (R) = IND (At)$

For any $a \in R, IND (R - {a}) \neq IND (At)$

The set of reducts is referred to as RED(S), and the intersection of all reducts is referred to as the core set, which is described as Core(S) = ∩RED(S).

2.2. MapReduce programming model

MapReduce provides an efficient architecture for dealing with big data. It eases the burden of software developers by handling the data distribution storage, data communication [27 –29] and fault tolerance processing of the system. MapReduce uses two functions as high-level parallel programming abstract models and interfaces: map and reduce. The input and output of the algorithms are defined as the shape of the <key,value> pairs, which take the following form

\begin{matrix} map : < K_{1}, V_{1} > \to [< K_{2}, V_{2} >] \\ reduce : < K_{2}, [V_{2}] > \to [< K_{3}, V_{3} >] \end{matrix}

where K_i, V_i (i = 1, …, 3) are data types defined by the user and the notation […] indicates a list. The corresponding processing logic is described below.

A data record is transmitted to the map function in the form of <K₁,V₁> pairs. The map function processes these <key,value> pairs and outputs intermediate results in another form of <K₂,V₂> pairs. The intermediate result <K₂,[V₂]> is recalculated in the reduce function, which outputs the final result in the form [<K₃,V₃>].

In addition to the map and reduce processes, the shuffle process shown in Figure 1 is always treated as the heart of MapReduce and the place where metaphorical miracles occur [30]. This is because when the map function begins to generate output, it does not simply write the output to a disc, but rather it buffers the output to memory and sorts it for efficiency. The shuffle process saves a large number of manual operations and also simplifies the algorithms because the data have been ordered.

Figure 1.

Shuffle process in the map and reduce tasks.

3. Attribute reduction algorithm using MapReduce

In this section, we analyse the disadvantages of classical reduction algorithms, after which we present a novel algorithm. During this process, the shuffle mechanism played a crucial role in reducing the workload. We also connected every job in the iterative form to transfer the processing results of the former job to the latter job. The related map and reduce functions are also given in this section in the form of pseudo code.

3.1. Classical attribute reduction algorithm based on significance

Reduction algorithms based on attribute significance are greedy. There are two classical heuristic strategies: adding forward or deleting backward. The former strategy gradually adds attributes with the greatest significance to a reduct, while the latter gradually removes the attributes with the lowest attribute significance from the set of candidate attributes until the reduction conditions are satisfied. The two strategies have similar numbers of jobs, but the strategy of adding forward is more popular, as Algorithms 1, 2 and 3 demonstrate [3].

Algorithm 1. Map (key,value)
Input: a decision table DT = (U, C∪D, V_a, I, I_a). Output: <Ac_EquivalenceClass, <d(x), 1>>. // Ac_EquivalenceClass is an equivalent class derived from A∪{c}. A is the set of the selected attribute, and c ∈ C − A is a candidate attribute. 1: Forx ∈ U do 2: Forc ∈ C − A do 3: Emit <Ac_EquivalenceClass, <d(x), 1>> 4: End for 5: End for

Algorithm 2. Reduce (key,value)
Input: Ac_EquivalenceClass and the list of corresponding decision value pairs. Output: <c_EquivalanceClass, Sig_△^c> // Sig_△^c is the importance of the attribute C in this equivalent class. 1: For <d, n> ∈ [<d₁, n₁>,<d₂, n₂>, …] do 2: {Statistics on the number of different decision values $(n_{p}^{1}, n_{p}^{2}, \dots, n_{p}^{k}$ ;} 3: End for 4: Calculate Sig_△^c according to Definition 4. 5: Emit (c_EquivalenceClass, Sig_△^c>

Algorithm 2. Reduce (key,value)

Input: Ac_EquivalenceClass and the list of corresponding decision value pairs.
Output: <c_EquivalanceClass, Sig_△^c>
// Sig_△^c is the importance of the attribute C in this equivalent class.
1: For <d, n> ∈ [<d₁, n₁>,<d₂, n₂>, …] do
2: {Statistics on the number of different decision values

(n_{p}^{1}, n_{p}^{2}, \dots, n_{p}^{k}

;}
3: End for
4: Calculate Sig_△^c according to Definition 4.
5: Emit (c_EquivalenceClass, Sig_△^c>

Algorithm 3. Main
Input: A decision table, S. Output: A reduction, Red. 1: $R = φ$ ; 2: Calculate IND(C); 3: Initiate a MapReduce job and carry out Algorithm 1 and 2 for every candidate attribute $c \in C - R$ . Select $c_{1} = {c \| \underset{c \in C - R}{Best} ({Sig}_{Δ}^{c})}$ based on the calculated Sig_△^c (if c₁ is not unique, then choose one of them arbitrarily.) 4: R=R∪{c₁}; 5: Repeat step 3 and 4 until IND(R)=IND(C); 6: Emit R.

Algorithm 3. Main

Input: A decision table, S.
Output: A reduction, Red.
1:

R = φ

;
2: Calculate IND(C);
3: Initiate a MapReduce job and carry out Algorithm 1 and 2 for every candidate attribute

c \in C - R

. Select

c_{1} = {c | \underset{c \in C - R}{Best} ({Sig}_{Δ}^{c})}

based on the calculated Sig_△^c (if c₁ is not unique, then choose one of them arbitrarily.)
4: R=R∪{c₁};
5: Repeat step 3 and 4 until IND(R)=IND(C);
6: Emit R.

Algorithm 1 calculates the equivalence classes in a data set. Algorithm 2 counts the attribute significance of a candidate attribute according to Definition 4. Algorithm 3 selects the optimal candidate attribute based on the significance of each heuristic process and repeats the heuristic process until the target reduction is completed.

In Algorithms 1 –3, the significance of all the candidate attributes should be calculated, but only the attribute with the maximum significance should be selected in each heuristic process. This means that an attribute should be calculated many times. In the worst case, if only data parallel was considered, a redundant attribute would be calculated |R| times |R|, which equates to the cardinality of a reduct R. In that case, there would be (2|C| − |R| + 1)*|R|/2 jobs.

These algorithms are inefficient in the sense that there are too many jobs for a reduction algorithm. In MapReduce, each job requires a fixed amount of time to begin and assign tasks, and the total runtime is not reduced when an algorithm begins too many jobs.

3.2. Novel attribute reduction algorithm

Below are definitions to help describe the novel algorithm that we proposed.

3.2.1. Definition 6

A decision table S = <U, C∪D, V_a, I_a> was referred to as a sort ascending decision table (SADT) if and only if it satisfied the following conditions:

For i < j, c₁(x_i) ≤ c₁(x_j)

For B_m = {c₁, c₂, …, c_m}, if B_m(x_i) = B_m(x_i₊₁), then c_m₊₁(x_i) ≤ c_m+₁(x_i₊₁)

For i < |U|, if $\forall c \in C$ , it has c(x_i) = c(x_i₊₁), then d(x_i) ≤ d(x_i₊₁)

where |.| denotes the cardinality of a set, m < |C|.

A positive region sort ascending decision table (PR-SADT) was also defined to replace the SADT if the original decision table was inconsistent.

3.2.2. Definition 7

We let an SADT S = <U, C∪D, Va, I, I_a>, and the partition was represented as U/C = {X₁, X₂, …, X_K}. The PR-SADT S_p = <U_p, C∪D, V_a, I_a> was defined as follows

{\begin{matrix} d ({X_{i}}^{'}) = d_{new}, | d (X_{i}) | > 1 \\ d ({X_{i}}^{'}) = d (X_{i}), else \end{matrix}

where $U_{p} / C = {{X_{1}}^{'}, {X_{2}}^{'}, \dots, {X_{K}}^{'}}$ and for $\forall c \in C$ , c(X_i) = c( ${X_{i}}^{'}$ ). Moreover, d_new is a new decision value that satisfies the condition $d_{new} \notin d (U)$ . For convenience, we set d_new = max(d(U)) + 1.

Definition 7 shows that the PR-SADT changed all the rough granules in an SADT into exact granules. Thus, the PR-SADT was considered to be a consistent decision table. Based on Definitions 6 and 7, we proposed a convenient method for judging core attributes as follows.

3.2.3. Theorem 1

We let S be a PR-SADT and |d(U)| > 1. The last condition attribute c_n was considered a core attribute if $\exists x_{i}$ and x_i+₁ satisfied the following conditions:

$c_{n} (x_{i}) < c_{n} (x_{i + 1}), d (x_{i}) \neq d (x_{i + 1})$

For B_n₋₁ = {c₁, c₂, …, c_n₋₁}, B_n₋₁(x_i) = B_n₋₁(x_i₊₁)

3.2.4. Proof

We supposed that U/B_n₋₁ = {Y₁, Y₂, …, Y_K}. If c_n was a core attribute, then $\exists Y_{q}$ , and $| d (Y_{q}) | > 1$ because |d(U)| > 1 and S was a consistent decision table. Furthermore, a granule Y_q could be divided into more granules by the attribute c_n, with $Y_{q} / {c_{n}} = {X_{1}, X_{2}, \dots, X_{K}}$ and K > 1. There were two adjacent subgranules, X_p and X_p₊₁, that satisfied d(X_p) ≠ d(X_p₊₁), 1 ≤ p < K because S was a consistent decision table and $| d (Y_{q}) | > 1$ . We let x_k be the last object in X_p, and x_k+₁ was the first object of X_p₊₁ because S was a PR-SADT, with B_n₋₁(x_k) = B_n₋₁(x_k₊₁), c_n(x_k) < c_n(x_k₊₁) and d(x_k) ≠ d(x_k₊₁).

Meanwhile, the conditions B_n₋₁(x_k) = B_n₋₁(x_k₊₁) and c_n(x_k) < c_n(x_k₊₁) meant that the attribute c_n was a unique attribute that could discern the object pair (x_k, x_k+₁) in a consistent decision table. This meant that attribute c_n was a core attribute.

Based on Theorem 1, the last attribute of the data set was quickly identified. If it was a core attribute, it was required to be preserved; otherwise, it was regarded as a redundant attribute.

Next, a novel attribute reduction algorithm was proposed as Algorithm 4.

Algorithm 4. The novel attribute reduction algorithm.
Step 1. $R = ϕ$ , rearrange the original data set to a PR-SADT. Sort the decision table according to Definition 7 and compare adjacent samples; For i = 1 to \|U\| − 1 If C(x_i) = C(x_i+₁)// where x_i and x_i+₁ are two adjacent samples; If d(x_i) ≠ d(x_i+₁) d(x_i) = d_new;// based on Definition 7, d_new = max(d(x))+1. Delete sample x_i+₁; End End End Step 2. Judge whether the last condition attribute is a core attribute (compare adjacent samples according to Theorem 1). Add it to the reduction set R and jump to Step 4 if it is a core attribute; otherwise, jump to Step 3. Step 3. Delete the last column of data and return to Step 2; Step 4. Place the column data corresponding to the selected attributes of Step 2 in the first column. Step 5. If all the conditional attributes have been checked, output the reduction set R; otherwise, sort the decision table in ascending order and jump to Step 2. In terms of preprocessing, the purpose of Step 1 was to delete redundant samples and make the data set a consistent PR-SADT.

Algorithm 4. The novel attribute reduction algorithm.

Step 1.

R = ϕ

, rearrange the original data set to a PR-SADT.
Sort the decision table according to Definition 7 and compare adjacent samples;
For i = 1 to |U| − 1
If C(x_i) = C(x_i+₁)// where x_i and x_i+₁ are two adjacent samples;
If d(x_i) ≠ d(x_i+₁)
d(x_i) = d_new;// based on Definition 7, d_new = max(d(x))+1.
Delete sample x_i+₁;
End
End
End
Step 2. Judge whether the last condition attribute is a core attribute (compare adjacent samples according to Theorem 1). Add it to the reduction set R and jump to Step 4 if it is a core attribute; otherwise, jump to Step 3.
Step 3. Delete the last column of data and return to Step 2;
Step 4. Place the column data corresponding to the selected attributes of Step 2 in the first column.
Step 5. If all the conditional attributes have been checked, output the reduction set R; otherwise, sort the decision table in ascending order and jump to Step 2.
In terms of preprocessing, the purpose of Step 1 was to delete redundant samples and make the data set a consistent PR-SADT.

In terms of preprocessing, the purpose of Step 1 was to delete redundant samples and make the data set a consistent PR-SADT.

Steps 2–5 outline how to select a core attribute according to Theorem 1. Each candidate attribute was checked only once, and the result was specific: yes or no. Furthermore, sorting the decision table was the unique operation in every cycle (called a heuristic process), and sorting was automatically carried out through the shuffle mechanism of MapReduce.

The proposed algorithm satisfied two key features. First, it adopted the reduct construction by deletion. Second, each attribute in R was a core attribute with respect to the related heuristic steps. Thus, R was considered to be a complete reduct. The detailed proof is as follows.

3.2.5. Proof

Considering any attribute c_i ∈ R, there was an object pair (x_k, x_k+₁) that satisfied the conditions in Theorem 1: B(x_k) = B(x_k+₁), c_|i_|(x_k) < c_i(x_k₊₁) and d(x_k) ≠ d(x_k₊₁), where B = R_i∪{c₁, c₂, …, c_i₋₁}, $R_{i} = {c_{j} \in R | j > i}$ . This meant that the object pair (x_k, x_k+₁) could not be discerned by B. At the same time, as a result of $R - {c_{i}} \subseteq B$ , we concluded that the object pair (x_k, x_k+₁) could not be discerned by R − {c_i} either. However, the object pair could be discerned by R because PR-SADT was a consistent decision table. Thus, attribute c_i was essential for attribute set R. In conclusion, the attributes of R were jointly sufficient and individually necessary for the original data set. R was a reduct.

Figure 2 shows the structure of the proposed algorithm.

Figure 2.

Main process of the proposed algorithm.

The proposed algorithm was shown to be simple and efficient. First, each attribute was checked only once, which meant that only |C| jobs were considered. Second, during the heuristic process, redundant column data were deleted to reduce the time and space complexity of the subsequent heuristic process. Third, the algorithm only included two basic operations: compare and sort, and the latter was optimised using the ‘shuffle’ mechanism in MapReduce, which provided an efficient sorting ability for big data.

3.2.6. Example 1

Table 1 shows the original data set that describes the complete data reduction process.

Table 1.

An original decision table.

c ₁	c ₂	c ₃	c ₄	d
1	1	1	1	0
2	2	2	1	1
2	3	2	3	0
2	2	2	1	1
3	1	2	1	0
1	2	3	2	2
2	3	1	2	3
3	1	2	1	1
1	2	3	2	2
3	1	2	1	1
4	3	4	2	1
1	2	3	2	3
4	3	4	2	2
1	1	1	1	0

In Step 1, we sorted the data set in the ascending order as per the preprocessing steps and removed repetition. In addition, the decision values of all inconsistent objects were changed to 4 (aside from {0, 1, 2, 3}, this value was arbitrary; we set it as 4 for convenience). Table 2 shows the new data set.

Table 2.

PR-SADT after preprocessing.

c ₁	c ₂	c ₃	c ₄	d
1	1	1	1	0
1	2	3	2	4
2	2	2	1	1
2	3	1	2	3
2	3	2	3	0
3	1	2	1	4
4	3	4	2	4

PR-SADT: positive region sort ascending decision table.

In Step 2, we checked the last condition attribute c₄. First, we compared the first and second samples. The values of c₂ and c₃ were not equal. This meant that these two objects did not satisfy the condition. We subsequently compared the second object with the third; the values of c₁ and c₃ did not satisfy the mentioned condition. By analogy, we concluded that there was a lack of two adjacent samples to satisfy the conditions in Theorem 1. Therefore, the attribute c₄ was not a core attribute.

In Step 3, the column data on attribute c₄ was deleted. Next, we returned to Step 2 of the algorithm and checked the new last condition attribute c₃. While comparing the fourth and fifth objects, we found that the attribute values of c₁ and c₂ were the same, but the attribute values of c₃ and the corresponding decision values were different. As such, c₃ was considered a core attribute and R = {c₃}. In Step 4, the column corresponding to attribute c₃ was placed in the first column, and the new decision table was sorted in the ascending order in Step 5, as shown in Table 3.

Table 3.

Decision table after sorting.

c ₃	c ₁	c ₂	d
1	1	1	0
1	2	3	3
2	2	2	1
2	2	3	0
2	3	1	4
3	1	2	4
4	4	3	4

Next, we regarded attribute c₂ as the last condition attribute and checked it in Step 2. While comparing the adjacent objects, we found that the value of c₃ and c₁ of the third and fourth objects were the same, while the decision values were different. This made attribute c₂ a core attribute, with R = {c₂, c₃}. We subsequently repeated the above steps and concluded that c₁ was not a core attribute. The final reduction result was R = {c₂, c₃}.

3.3. Algorithm realisation using MapReduce

In this section, we discuss how we implemented and optimised the proposed algorithm using MapReduce.

First, a novel <key,value> pair was designed to sort the data set through the shuffle mechanism of MapReduce. Next, the job of checking core attributes was presented. Finally, the attribute reduction algorithm was realised using iterative jobs.

3.3.1. Design of preprocessing job

This preprocessing job completed three tasks: sorting, removing repetition and modifying the decision values of inconsistent objects. In this study, sorting and removing repetition were automatically performed using the shuffle mechanism.

Algorithms 5 and 6 detail the realisation of the preprocessing job.

Algorithm 5. Preprocessing-Map (key,value)
Input: a decision table S. Output: <key′,value′>, where key′ is the ordered values of condition attributes, and value′ is the decision values. 1: for each object x in S do 2: key′ ← c₁(x), c₂(x), …, c_n(x); 3: value′ ← d(x) 4: Emit <key′,value′> 5: end for

Algorithm 6. Preprocessing-Reduce (key, V)
Input: the ordered values of condition attributes and decision values Output: <key′,value′>, where key′ is the ordered values of condition attributes, and value′ is the modified decision values. 1: ite ← values; // use iterator to facilitate every element of the value-list. 2: a ← ite.next(); 3: result ← a; // if the value does not have the next element, emit the original value. 4: whileite.hasNext() 5: b ← ite.next(); 6: ifb ≠ a 7: {result ← x₀ = max(d(x)) + 1;} 8: a = b; 9: value′ ← result; 10: Emit <key′,value′>

Algorithm 6. Preprocessing-Reduce (key, V)

Input: the ordered values of condition attributes and decision values
Output: <key′,value′>, where key′ is the ordered values of condition attributes, and value′ is the modified decision values.
1: ite ← values; // use iterator to facilitate every element of the value-list.
2: a ← ite.next();
3: result ← a; // if the value does not have the next element, emit the original value.
4: whileite.hasNext()
5: b ← ite.next();
6: ifb ≠ a
7: {result ← x₀ = max(d(x)) + 1;}
8: a = b;
9: value′ ← result;
10: Emit <key′,value′>

We designed a novel <key,value> to make the algorithm more convenient than the traditional method that uses the lines flag as key, as shown in Figure 3. We set all the condition attribute values as key and the decision values as value.

Figure 3.

Data sorting based on the shuffle and novel <key,value>.

Based on the output of Algorithm 5, the efficient shuffle mechanic sorted the original data set and deleted repeated objects without extra task scheduling. Specifically, after a round of map, the system automatically sorted the whole decision table because of the special key design. This also improved the speed of operation and reduced the amount of code. In addition, the shuffle process removed repetition automatically.

Figure 4 shows the reduce task of the preprocessing job that changed the original decision table to a consistent PR-SADT. We used the conditional attribute values as the key of the reduce function. If the value-lists corresponding to the key had more than one value, the related objects were inconsistent, and their decision values were changed to max(d(x)) + 1. Using values.hasNext(), we ensured that the reduce task checked all the objects automatically and subsequently outputs a PR-SADT.

Figure 4.

Reduce task of the preprocessing job.

3.3.2. Design of core attribute judgement jobs

Based on the PR-SADT output by the preprocessing job, all the conditional attributes were checked sequentially. Figure 5 shows the related job of the core attribute judgement.

Figure 5.

Core attribute judgement job.

According to Theorem 1, it was necessary to compare two adjacent samples. In the map function, the columns corresponding to c₁, …, c_n₋₁ were set to key, the same samples were automatically merged into one key and their decision values were stored in the corresponding value-list. Since the decision table emitted from preprocessing was consistent, was sorted and did not have any repeated objects, if c₁(x_i) = c₁(x_i+₁), c₂(x_i) = c₂(x_i+₁), …, and c_n₋₁(x_i) = c_n₋₁(x_i+₁), then c_n(x_i) < c_n(x_i+₁). Therefore, if there were multiple elements in the value-list <k₂,[v₂]>, then c_n was essential and treated as a core attribute.

The self-circulation structure in reduce automatically compared the next set of adjacent samples, x_i+₁ and x_i+₂, until all samples in the decision table were compared. The detailed attribute reduction algorithm was as follows:

In general, the output of the reduce function was a new data set composed of key and value. However, only the judgement results of the last conditional attribute c_n were required, so we set key = null and the output as a single line of values consisting of 0 or 1. A value of 1 indicated that the parameter changed during the cycle and the candidate attribute was a core.

In Algorithms 7 and 8, only the adjacent samples were compared because the input was a PR-SADT. It was much simpler and more efficient than the traditional algorithm shown in Algorithms 1 and 2. In addition, the proposed algorithm took full advantage of the MapReduce architecture for code simplicity and high efficiency. The sort processing was optimised using the inherent shuffle mechanism of MapReduce.

Algorithm 7. Core attribute judgement job –map (key,value)
Input: the ordered values of the condition attributes, decision values and the candidate attribute c_n. Output: <key′,value′>, where key′ is the columns corresponding to c₁, c₂, …, c_n−₁, and value′ is the decision value. 1: for each object x in S do 2: key′ ← c₁(x), c₂(x), …, c_n−₁(x); 3: value′ ← d(x) 4: Emit <key′,value′> 5: end for

Algorithm 8. Core attribute judgement job –Reduce (key,value)
Input: the columns corresponding to c₁, c₂, …, c_n−₁ and decision values Output: <null,value′>, where value′ is ‘0’ or ‘1’. 1: ite ← values; //use iterator to facilitate every element of the value-list. 2: a ← ite.next(); 3: result ← 0; 4: whileite.hasNext 5: b ← ite.next(); 6: ifb ≠ a 7: {result ← 1;} // ‘1’ means the candidate attribute is core. 8: a = b; 9: value′ ← result; 10: Emit <null,value′>

Algorithm 8. Core attribute judgement job –Reduce (key,value)

Input: the columns corresponding to c₁, c₂, …, c_n−₁ and decision values
Output: <null,value′>, where value′ is ‘0’ or ‘1’.
1: ite ← values; //use iterator to facilitate every element of the value-list.
2: a ← ite.next();
3: result ← 0;
4: whileite.hasNext
5: b ← ite.next();
6: ifb ≠ a
7: {result ← 1;} // ‘1’ means the candidate attribute is core.
8: a = b;
9: value′ ← result;
10: Emit <null,value′>

3.3.3. Algorithm realisation based on iterative jobs

According to Algorithm 4, all the conditional attributes had to be checked using the core attribute judgement job in an iteration model. The output of the pre-job would have influenced the input of the next job. Thus, we set up a global variable to mark the output of the core attribute judgement job so that the programme would respond, regardless of whether the column with respect to the candidate attribute was preserved in the next job. Figure 6 shows the complete iterative relations between the jobs. Figure 7 indicates how the judgement result of c(s) was passed on to the next core attribute judgement job on c(s + 1).

Figure 6.

Design of iterative jobs.

Figure 7.

<key,value> pairs of adjacent jobs.

4. Experimental evaluation

The experimental results of the attribute reduction algorithm using MapReduce are presented in this section. Furthermore, how we examined the performance of the proposed algorithm and verified its superiority in terms of the number of jobs, sizeup and runtime is discussed in detail.

4.1. Experimental setup

We ran the proposed algorithm on a cluster of four nodes, including one master node and three slave nodes. Each node was equipped with 7.8 GB of memory and used four processors (Intel^® Core™ i5-4440 CPU @ 3.10 GHz). For each node, we installed CentOS 6.4 as the operating environment and Eclipse 3.6.1 as the compiling tool. Hadoop 1.1.2 was also installed on the appropriate running platform for large data.

We conducted an extensive series of experiments on the data sets Mushroom and KDD CUP 99 from the UCI Machine Learning Repository and four synthetic large data sets (DS3–DS6). We duplicated 5000 copies of Mushroom as data set DS1. DS2 was the KDD CUP 99 data set. Each data set had only one decision attribute, and the attribute values of each synthetic data set were random integers from 1 to 10. Table 4 summarises all the data sets.

Table 4.

Description of six data sets.

No.	Data sets	Objects	Attributes	Classes
1	DS1	41,594,800	22	2
2	DS2	4,898,431	40	3
3	DS3	10,000,000	30	10
4	DS4	20,000,000	30	10
5	DS5	40,000,000	30	10
6	DS6	5,000,000	30	10

4.2. Comparison of the number of jobs

The number of jobs represented the most significant advantage of the proposed algorithm. The formula for the number of jobs was J = |C| + 1, which was not dependent on the output set R. Thus, it was easy to evaluate the real runtime for each data set. As a comparison, if only data parallel was considered, jobs of the attribute reduction algorithms based on significance calculations were influenced by the number of attributes in reduct R. The related formula for jobs of these algorithms was J = (2|C| − |R| + 1)*|R|/2, where |C| represents the total number of conditional attributes and |R| represents the number of attributes in the reduct. While the traditional algorithms applied more jobs, the runtime was hardly evaluated because of the influence of reduct R. Figures 8 –10 show the detailed comparisons of DS1–DS6.

Figure 8.

Number of jobs managing DS1.

Figure 9.

Number of jobs managing DS2.

Figure 10.

Number of jobs managing DS3–DS6.

As shown in Figures 8 –10, the number of jobs carried out by the proposed algorithm was far less than that carried out by the traditional algorithms. DS2 was considered as an example. It was supposed that 10 conditional attributes were included in the reduct and that the number of jobs in the classic algorithms was 5.8 times greater than that of the proposed algorithm. Considering that each job would require a fixed amount of time to begin and assign tasks, the runtime of the traditional algorithm would not be less.

4.3. Performance analysis on different data sets

4.3.1. Runtime on different data sets

First, the relationship between performance and the number of attributes was analysed, and the results are shown in Figure 11.

Figure 11.

Runtime with respect to |C|.

Based on Figure 11, the runtime of the proposed algorithm increased linearly with an increase in the number of attributes. Therefore, the total runtime was in a controllable range for each data set. In theory, the linear runtime feature arose from the linear quantity of jobs J = |C| + 1 and was not affected by reduct R.

Second, the relationship between the performance and the number of samples was analysed, and the results are shown in Figure 12.

Figure 12.

Relationship between the performance and number of samples.

Figure 12 shows the runtime of 12 data sets. Half the data sets had 10,000,000 objects, and the others had 20,000,000 objects. The attribute numbers were 5, 10, 15, 20, 25 and 30, and the ratios with the runtime were 1.82, 1.81, 1.93, 1.93, 2.03 and 2.01, respectively. The average ratio was 1.93, which was similar to the ratios of the sample numbers. Therefore, there was also a linear relationship between the runtimes and the number of objects.

4.3.2. Sizeup

In traditional attribute reduction algorithms based on Hadoop, such as that reported by Qian et al. [3], the sizeup is usually analysed by copying the original data two, four and eight times. However, these replicas were filtered out by the shuffle process in the proposed algorithm. Thus, through artificial synthesis, we provided test data sets that had 5,000,000; 10,000,000; 20,000,000; and 40,000,000 objects. Each test data set had the same attribute set and decision type. Figure 13 illustrates the sizeup feature. The sizeup curve of the proposed algorithm was nearly linear.

Figure 13.

Sizeup of the algorithm.

In conclusion, the performance and sizeup of the proposed algorithm was nearly linear in terms of the numbers of attributes and objects, and it was negligibly affected by the reduct set. Based on these results, we concluded that the real runtime was in a controllable range for each data set, and it was easily predicted for the users. As a comparison, the performance of traditional algorithms based on significance was similarly exponential and dependent on the output reducts. Thus, the proposed algorithm would be more effective and valuable in real-world applications.

4.4. Comparison with different algorithms

The proposed algorithm was compared with two traditional algorithms: one based on the positive region and another based on the boundary region [3]. The comparison is shown in Figure 14.

Figure 14.

The runtime comparison of the three algorithms.

As Figure 14 indicates, the proposed algorithm required significantly less time to run than the two traditional algorithms. The total runtime of the algorithms based on a positive region was 4.9 times greater than that of the proposed algorithms on average. The algorithm based on a boundary region required 3.5 times longer than the proposed algorithms did. The ratios with different data sets were influenced by the real reducts. The more the attributes in the reducts, the higher the ratios, which meant that a greater number of jobs and longer runtimes were required for the traditional algorithm.

5. Conclusion

For this study, we proposed an efficient attribute reduction method based on MapReduce. This method only included two simple operations: sort and compare. To realise this algorithm, we designed new <key,value> pairs, which enabled us to automatically sort the overall decision tables through the shuffle mechanism of MapReduce. Theoretical analysis and experimental results showed that the proposed algorithm was efficient because |C| + 1 jobs were required and were time controllable because of the linear runtime features relating to attribute numbers and sample numbers.

Footnotes

Acknowledgements

The authors thank the anonymous reviewers for their thoughtful comments and suggestions.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This research was supported by the National Natural Science Foundation of P.R. China (Nos: 61502538, 61773406).

ORCID iD

Jing Li

References

Han

Liew

Hemert

et al. A generic parallel processing model for facilitating data mining and integration. Parallel Comput 2011; 37: 157–171.

Srinivasan

Faruquie

Joshi

Data and task parallelism in ILP using MapReduce. Mach Learn 2012; 86(1): 141–168.

Qian

Miao

Zhang

et al. Parallel attribute reduction algorithms using MapReduce. Inform Sciences 2014; 279: 671–690.

Guyon

Elisseeff

An introduction to variable and feature selection. J Mach Learn Res 2003; 3: 1157–1182.

Cercone

Learning in relational databases: a rough set approach. Comput Intell 1995; 11: 323–338.

Pawlak

Rough sets. Int J Comput Inf Sci 1982; 11: 341–356.

Pawlak

Rough sets: theoretical aspects of reasoning about data. Boston, MA: Kluwer Academic Publishers, 1991.

Yin

Gui

Yang

et al. Core set analysis in inconsistent decision tables. Inform Sciences 2013; 241: 138–147.

Shang

Feng

et al. Quick attribute reduction in inconsistent decision tables. Inform Sciences 2014; 254: 155–180.

10.

Miao

Zhao

Yao

et al. Relative reducts in consistent and inconsistent decision tables of the Pawlak rough set model. Inform Sciences 2009; 179: 4140–4150.

11.

Wang

Yang

DC.

Decision table reduction based on parallel symbiotic evolution. Chin J Comput 2003; 26(5): 630–635 (in Chinese).

12.

Liu

Yang

et al. A quick attribute reduction algorithm with complexity of max(O(|C||U|), O(|C|2|U/C|)). Chin J Comput 2006; 29(3): 611–615 (in Chinese).

13.

Yang

Algorithms based on general discernibility matrix for computation of a core and attribute reduction. Control Decis 2008; 23: 1049–1054.

14.

Yang

Chen

Liang

et al. Attribute reduction for massive data based on rough set theory and MapReduce. In: Yu

Greco

Lingras

et al. (eds) Rough set and knowledge technology (Lecture notes in computer science), vol. 6401. Berlin; Heidelberg: Springer, 2010, pp. 672–678.

15.

Wang

Lan

. Solving the attribute reduction problem with ant colony optimization. In: Peters

Skowron

Chan

C-C

et al. (eds) Transactions on rough sets XIII (Lecture notes in computer science), vol. 6499. Berlin; Heidelberg: Springer, 2011, pp. 240–259.

16.

Liang

Wang

Dang

et al. An efficient rough feature selection algorithm with a multi-granulation view. Int J Approx Reason 2012; 53: 912–926.

17.

Ryu

KH.

MapReduce-based web mining for prediction of web-user navigation. J Inf Sci 2014; 40(5): 557–567.

18.

Onan

Classifier and feature set ensembles for web page classification. J Inf Sci 2016; 42(2): 150–165.

19.

Shon

Han

Kim

et al. Proposal reviewer recommendation system based on big data for a national research management institute. J Inf Sci 2017; 43(2): 147–158.

20.

Yin

Gao

A flexible aggregation framework on large-scale heterogeneous information networks. J Inf Sci 2017; 43(2): 186–203.

21.

Qian

Yue

XD.

Incremental attribute reduction algorithm for big data using MapReduce. J Comput Method Sci Eng 2016; 16: 641–652.

22.

Shang

Fast approximate attribute reduction with MapReduce. In: Lingras

Wolski

Cornelis

et al. (eds) Rough sets and knowledge technology (Lecture notes in computer science), vol. 8171. Berlin; Heidelberg: Springer, 2013, pp. 271–278.

23.

Wang

Quick knowledge reduction based on divide and conquer method in huge data sets. In: Ghosh

Pal

(eds) Pattern recognition and machine intelligence: second international conference (Lecture notes in computer science), vol. 4815. Berlin; Heidelberg: Springer, 2007, pp. 312–315.

24.

Qian

Yue

et al. Hierarchical attribute reduction algorithms for big data using MapReduce. Knowl-Based Syst 2015; 73: 18–31.

25.

Qian

Miao

Zhang

et al. Hybrid approaches to attribute reduction based on indiscernibility and discernibility relation. Int J Approx Reason 2011; 52(2): 212–230.

26.

Apache Hadoop, https://hadoop.apache.org/

27.

Baskaya

Keskustalo

Järvelin

Effectiveness of search result classification based on relevance feedback. J Inf Sci 2013; 39(6): 764–772.

28.

Carmagnola

Osborne

Torre

Escaping the Big Brother: an empirical study on factors influencing identification and information leakage on the Web. J Inf Sci 2014; 40(2): 180–197.

29.

Corbellini

Mateos

Godoy

et al. An architecture and platform for developing distributed recommendation algorithms on large-scale social networks. J Inf Sci 2015; 41(5): 686–704.

30.

White

. Hadoop: the definitive guide. 2015. Beijing, China: Tsinghua University Press.

c ₁	c ₂	c ₃	c ₄	d
1	1	1	1	0
2	2	2	1	1
2	3	2	3	0
2	2	2	1	1
3	1	2	1	0
1	2	3	2	2
2	3	1	2	3
3	1	2	1	1
1	2	3	2	2
3	1	2	1	1
4	3	4	2	1
1	2	3	2	3
4	3	4	2	2
1	1	1	1	0

c ₁	c ₂	c ₃	c ₄	d
1	1	1	1	0
2	2	2	1	1
2	3	2	3	0
2	2	2	1	1
3	1	2	1	0
1	2	3	2	2
2	3	1	2	3
3	1	2	1	1
1	2	3	2	2
3	1	2	1	1
4	3	4	2	1
1	2	3	2	3
4	3	4	2	2
1	1	1	1	0

c ₁	c ₂	c ₃	c ₄	d
1	1	1	1	0
2	2	2	1	1
2	3	2	3	0
2	2	2	1	1
3	1	2	1	0
1	2	3	2	2
2	3	1	2	3
3	1	2	1	1
1	2	3	2	2
3	1	2	1	1
4	3	4	2	1
1	2	3	2	3
4	3	4	2	2
1	1	1	1	0