Abstract
Classical attribute reduction algorithms based on attribute significance initiate too many jobs (O(|C|2)) when they run in MapReduce. To improve the efficiencies of these algorithms, we proposed a novel reduction algorithm. Instead of focusing on attribute significance, the notion of a core attribute was applied to construct a new heuristic reduction algorithm, and only |C| jobs were considered to obtain a reduct. The algorithm only included two basic operations: compare and sort. The latter was optimised using the shuffle mechanism in MapReduce, which provided an efficient sorting ability for big data. In particular, we connected jobs in an iterative form to transfer the processing result of the former job to the latter job. Finally, experimental results demonstrated that the proposed attribute reduction algorithm was efficient and significantly improved upon the classical algorithms in runtime and number of jobs.
1. Introduction
The rapid analysis of data has become strictly demanded in military, Internet, finance and other industries. Large amounts of data are often accompanied by significant data redundancy [1,2], which wastes storage space and reduces the performance of algorithms based on data modelling and decision-making. As an important preprocessing technique for data mining, attribute reduction efficiently reduces the storage space and improves the classification performance by removing redundant and irrelevant attributes [3].
Attribute reduction is a key notions and one of the most important aspects of rough set theory [4–8], and it has been successfully developed to identify a reduct or multiple reducts [3]. Many classical attribute reduction methods have been proposed in recent years, but reduction methods based on attribute significance are the most popular and widely accepted in real-world applications [9–15].
The development of big data technology, such as Hadoop [16] and MapReduce, [17–20], has brought research on attribute reduction using big data technology into focus. Yang et al. [14] initially calculated reductions of block data sets in MapReduce and integrated them into a final reduction. Similarly, Qian et al. [3] realised an attribute reduction algorithm based on attribute significance using MapReduce and accomplished a reduct on tens of millions of samples using 16 nodes. Lv et al. [21] proposed an incremental attribute reduction algorithm and reused the former results to speed up the computation of equivalence classes. Moreover, Miao et al. [10] calculated the relative reduction of decision tables using the angles of particles. This research showed that some classical reduction methods in rough set theory, especially those based on attribute significance, can be realised and correctly executed in MapReduce.
However, existing reduction algorithms based on attribute significance, when running in MapReduce, have a significant shortcoming. Specifically, classical attribute reduction algorithms based on attribute significance initiate too many jobs (O(|C|2)). For example, Qian et al. [3] detailed that MapReduce must initiate multiple jobs to calculate the attribute significance of all candidate attributes in each heuristic process. In a hypothetical scenario involving |C| attributes in decision tables, if only data parallel was considered, the algorithm would run |C| jobs in the first heuristic process, |C| − 1 jobs in the second heuristic process, …, and |C| − |R| + 1 jobs in the |R| heuristic process. In summary, (2|C| − |R| + 1)*|R|/2 jobs would be implemented to calculate a reduct. Each job would require a fixed amount of time to begin and assign tasks. As a result, an algorithm that has initiated too many jobs would not be efficient [22,23].
In general, the features of MapReduce do not optimise these algorithms. Instead, they treat MapReduce as a tool to calculate related parameters, such as the attribute significance or equivalence class [16,24]. No matter how simple these parameters are, considerable amounts of time are required to perform various jobs sequentially [3,10,14,25]. To improve the efficiency, an optimised algorithm must be designed to match the features of MapReduce and reduce the runtime.
For this study, we proposed a novel attribute reduction algorithm using the MapReduce computing framework. First, the notion of a core attribute was suggested to replace the classical attribute significance. Second, each attribute was checked only once, and only |C| jobs were implemented in the proposed algorithm. Finally, each job only included two basic operations: compare and sort. The latter was optimised through the shuffle mechanism that is inherent in MapReduce.
To realise the proposed algorithm using MapReduce, we suggested a novel <key,value> pair. At the map stage, we set all the conditional data as the key. Next, the shuffle mechanism was used to automatically sort the whole decision table in the ascending order to reduce the engineering complexity. Finally, we built a connection between multiple jobs. These jobs were executed iteratively because each attribute had to be judged based on the results of previous attributes.
The rest of this article is organised as follows. The basic concepts of the rough set and the MapReduce framework are reviewed in Section 2. A reduction algorithm based on sorting technology is proposed in Section 3. Furthermore, we discuss how to select core attributes without computing the significance and how novel <key,value> pairs and iterative jobs were applied to successfully run the algorithm in the MapReduce programme. Experimental results are given in Section 4, in which the efficiency of the proposed algorithm is also discussed. Finally, our conclusions are provided in Section 5.
2. Preliminaries
In this section, we review the fundamentals of the rough set model [6] and the computation framework of MapReduce [26] reported previously.
2.1. Pawlak rough set model
In the rough set model, data are represented by an information table, and a set of objects is described using a finite set of attributes.
2.1.1. Definition 1
A decision table S is represented as follows
where U = {x1, x2, x3, …, xn} is a finite non-empty set of objects, At is a finite non-empty set of attributes, C = {c1, c2, c3, …, cm} is a set of conditional attributes that describes the objects and D is a set of decision attributes that identifies the class to which the object belongs. Va is a non-empty set of values of a ∈ At, and Ia is an information function that maps an object x in U to exactly one value v in Va. The decision table is considered to be inconsistent if two objects with the same condition values have different decision values.
Given a subset of attributes
The equivalence class of an object x with respect to B is defined as follows
2.1.2. Definition 2
U is split into a partition
2.1.3. Definition 3
For a decision table S, a positive region and boundary region of a partition
Generally, POSc(D) donates the positive region in a decision table. All the objects in POSC(D) are called consistent objects. The others are called inconsistent objects.
2.1.4. Definition 4
For a decision table S, where
where
2.1.5. Definition 5
Given an information table S, the attribute set
For any
The set of reducts is referred to as RED(S), and the intersection of all reducts is referred to as the core set, which is described as Core(S) = ∩RED(S).
2.2. MapReduce programming model
MapReduce provides an efficient architecture for dealing with big data. It eases the burden of software developers by handling the data distribution storage, data communication [27–29] and fault tolerance processing of the system. MapReduce uses two functions as high-level parallel programming abstract models and interfaces: map and reduce. The input and output of the algorithms are defined as the shape of the <key,value> pairs, which take the following form
where Ki, Vi (i = 1, …, 3) are data types defined by the user and the notation […] indicates a list. The corresponding processing logic is described below.
A data record is transmitted to the map function in the form of <K1,V1> pairs. The map function processes these <key,value> pairs and outputs intermediate results in another form of <K2,V2> pairs. The intermediate result <K2,[V2]> is recalculated in the reduce function, which outputs the final result in the form [<K3,V3>].
In addition to the map and reduce processes, the shuffle process shown in Figure 1 is always treated as the heart of MapReduce and the place where metaphorical miracles occur [30]. This is because when the map function begins to generate output, it does not simply write the output to a disc, but rather it buffers the output to memory and sorts it for efficiency. The shuffle process saves a large number of manual operations and also simplifies the algorithms because the data have been ordered.

Shuffle process in the map and reduce tasks.
3. Attribute reduction algorithm using MapReduce
In this section, we analyse the disadvantages of classical reduction algorithms, after which we present a novel algorithm. During this process, the shuffle mechanism played a crucial role in reducing the workload. We also connected every job in the iterative form to transfer the processing results of the former job to the latter job. The related map and reduce functions are also given in this section in the form of pseudo code.
3.1. Classical attribute reduction algorithm based on significance
Reduction algorithms based on attribute significance are greedy. There are two classical heuristic strategies: adding forward or deleting backward. The former strategy gradually adds attributes with the greatest significance to a reduct, while the latter gradually removes the attributes with the lowest attribute significance from the set of candidate attributes until the reduction conditions are satisfied. The two strategies have similar numbers of jobs, but the strategy of adding forward is more popular, as Algorithms 1, 2 and 3 demonstrate [3].
Algorithm 1 calculates the equivalence classes in a data set. Algorithm 2 counts the attribute significance of a candidate attribute according to Definition 4. Algorithm 3 selects the optimal candidate attribute based on the significance of each heuristic process and repeats the heuristic process until the target reduction is completed.
In Algorithms 1–3, the significance of all the candidate attributes should be calculated, but only the attribute with the maximum significance should be selected in each heuristic process. This means that an attribute should be calculated many times. In the worst case, if only data parallel was considered, a redundant attribute would be calculated |R| times |R|, which equates to the cardinality of a reduct R. In that case, there would be (2|C| − |R| + 1)*|R|/2 jobs.
These algorithms are inefficient in the sense that there are too many jobs for a reduction algorithm. In MapReduce, each job requires a fixed amount of time to begin and assign tasks, and the total runtime is not reduced when an algorithm begins too many jobs.
3.2. Novel attribute reduction algorithm
Below are definitions to help describe the novel algorithm that we proposed.
3.2.1. Definition 6
A decision table S = <U, C∪D, Va, Ia> was referred to as a sort ascending decision table (SADT) if and only if it satisfied the following conditions:
For i < j, c1(xi) ≤ c1(xj)
For Bm = {c1, c2, …, cm}, if Bm(xi) = Bm(xi+1), then cm+1(xi) ≤ cm+1(xi+1)
For i < |U|, if
where |.| denotes the cardinality of a set, m < |C|.
A positive region sort ascending decision table (PR-SADT) was also defined to replace the SADT if the original decision table was inconsistent.
3.2.2. Definition 7
We let an SADT S = <U, C∪D, Va, I, Ia>, and the partition was represented as U/C = {X1, X2, …, XK}. The PR-SADT Sp = <Up, C∪D, Va, Ia> was defined as follows
where
Definition 7 shows that the PR-SADT changed all the rough granules in an SADT into exact granules. Thus, the PR-SADT was considered to be a consistent decision table. Based on Definitions 6 and 7, we proposed a convenient method for judging core attributes as follows.
3.2.3. Theorem 1
We let S be a PR-SADT and |d(U)| > 1. The last condition attribute cn was considered a core attribute if
For Bn−1 = {c1, c2, …, cn−1}, Bn−1(xi) = Bn−1(xi+1)
3.2.4. Proof
We supposed that U/Bn−1 = {Y1, Y2, …, YK}. If cn was a core attribute, then
Meanwhile, the conditions Bn−1(xk) = Bn−1(xk+1) and cn(xk) < cn(xk+1) meant that the attribute cn was a unique attribute that could discern the object pair (xk, xk+1) in a consistent decision table. This meant that attribute cn was a core attribute.
Based on Theorem 1, the last attribute of the data set was quickly identified. If it was a core attribute, it was required to be preserved; otherwise, it was regarded as a redundant attribute.
Next, a novel attribute reduction algorithm was proposed as Algorithm 4.
In terms of preprocessing, the purpose of Step 1 was to delete redundant samples and make the data set a consistent PR-SADT.
Steps 2–5 outline how to select a core attribute according to Theorem 1. Each candidate attribute was checked only once, and the result was specific: yes or no. Furthermore, sorting the decision table was the unique operation in every cycle (called a heuristic process), and sorting was automatically carried out through the shuffle mechanism of MapReduce.
The proposed algorithm satisfied two key features. First, it adopted the reduct construction by deletion. Second, each attribute in R was a core attribute with respect to the related heuristic steps. Thus, R was considered to be a complete reduct. The detailed proof is as follows.
3.2.5. Proof
Considering any attribute ci ∈ R, there was an object pair (xk, xk+1) that satisfied the conditions in Theorem 1: B(xk) = B(xk+1), c|i|(xk) < ci(xk+1) and d(xk) ≠ d(xk+1), where B = Ri∪{c1, c2, …, ci−1},
Figure 2 shows the structure of the proposed algorithm.

Main process of the proposed algorithm.
The proposed algorithm was shown to be simple and efficient. First, each attribute was checked only once, which meant that only |C| jobs were considered. Second, during the heuristic process, redundant column data were deleted to reduce the time and space complexity of the subsequent heuristic process. Third, the algorithm only included two basic operations: compare and sort, and the latter was optimised using the ‘shuffle’ mechanism in MapReduce, which provided an efficient sorting ability for big data.
3.2.6. Example 1
Table 1 shows the original data set that describes the complete data reduction process.
An original decision table.
In Step 1, we sorted the data set in the ascending order as per the preprocessing steps and removed repetition. In addition, the decision values of all inconsistent objects were changed to 4 (aside from {0, 1, 2, 3}, this value was arbitrary; we set it as 4 for convenience). Table 2 shows the new data set.
PR-SADT after preprocessing.
PR-SADT: positive region sort ascending decision table.
In Step 2, we checked the last condition attribute c4. First, we compared the first and second samples. The values of c2 and c3 were not equal. This meant that these two objects did not satisfy the condition. We subsequently compared the second object with the third; the values of c1 and c3 did not satisfy the mentioned condition. By analogy, we concluded that there was a lack of two adjacent samples to satisfy the conditions in Theorem 1. Therefore, the attribute c4 was not a core attribute.
In Step 3, the column data on attribute c4 was deleted. Next, we returned to Step 2 of the algorithm and checked the new last condition attribute c3. While comparing the fourth and fifth objects, we found that the attribute values of c1 and c2 were the same, but the attribute values of c3 and the corresponding decision values were different. As such, c3 was considered a core attribute and R = {c3}. In Step 4, the column corresponding to attribute c3 was placed in the first column, and the new decision table was sorted in the ascending order in Step 5, as shown in Table 3.
Decision table after sorting.
Next, we regarded attribute c2 as the last condition attribute and checked it in Step 2. While comparing the adjacent objects, we found that the value of c3 and c1 of the third and fourth objects were the same, while the decision values were different. This made attribute c2 a core attribute, with R = {c2, c3}. We subsequently repeated the above steps and concluded that c1 was not a core attribute. The final reduction result was R = {c2, c3}.
3.3. Algorithm realisation using MapReduce
In this section, we discuss how we implemented and optimised the proposed algorithm using MapReduce.
First, a novel <key,value> pair was designed to sort the data set through the shuffle mechanism of MapReduce. Next, the job of checking core attributes was presented. Finally, the attribute reduction algorithm was realised using iterative jobs.
3.3.1. Design of preprocessing job
This preprocessing job completed three tasks: sorting, removing repetition and modifying the decision values of inconsistent objects. In this study, sorting and removing repetition were automatically performed using the shuffle mechanism.
Algorithms 5 and 6 detail the realisation of the preprocessing job.
We designed a novel <key,value> to make the algorithm more convenient than the traditional method that uses the lines flag as key, as shown in Figure 3. We set all the condition attribute values as key and the decision values as value.

Data sorting based on the shuffle and novel <key,value>.
Based on the output of Algorithm 5, the efficient shuffle mechanic sorted the original data set and deleted repeated objects without extra task scheduling. Specifically, after a round of map, the system automatically sorted the whole decision table because of the special key design. This also improved the speed of operation and reduced the amount of code. In addition, the shuffle process removed repetition automatically.
Figure 4 shows the reduce task of the preprocessing job that changed the original decision table to a consistent PR-SADT. We used the conditional attribute values as the key of the reduce function. If the value-lists corresponding to the key had more than one value, the related objects were inconsistent, and their decision values were changed to max(d(x)) + 1. Using values.hasNext(), we ensured that the reduce task checked all the objects automatically and subsequently outputs a PR-SADT.

Reduce task of the preprocessing job.
3.3.2. Design of core attribute judgement jobs
Based on the PR-SADT output by the preprocessing job, all the conditional attributes were checked sequentially. Figure 5 shows the related job of the core attribute judgement.

Core attribute judgement job.
According to Theorem 1, it was necessary to compare two adjacent samples. In the map function, the columns corresponding to c1, …, cn−1 were set to key, the same samples were automatically merged into one key and their decision values were stored in the corresponding value-list. Since the decision table emitted from preprocessing was consistent, was sorted and did not have any repeated objects, if c1(xi) = c1(xi+1), c2(xi) = c2(xi+1), …, and cn−1(xi) = cn−1(xi+1), then cn(xi) < cn(xi+1). Therefore, if there were multiple elements in the value-list <k2,[v2]>, then cn was essential and treated as a core attribute.
The self-circulation structure in reduce automatically compared the next set of adjacent samples, xi+1 and xi+2, until all samples in the decision table were compared. The detailed attribute reduction algorithm was as follows:
In general, the output of the reduce function was a new data set composed of key and value. However, only the judgement results of the last conditional attribute cn were required, so we set key = null and the output as a single line of values consisting of 0 or 1. A value of 1 indicated that the parameter changed during the cycle and the candidate attribute was a core.
In Algorithms 7 and 8, only the adjacent samples were compared because the input was a PR-SADT. It was much simpler and more efficient than the traditional algorithm shown in Algorithms 1 and 2. In addition, the proposed algorithm took full advantage of the MapReduce architecture for code simplicity and high efficiency. The sort processing was optimised using the inherent shuffle mechanism of MapReduce.
3.3.3. Algorithm realisation based on iterative jobs
According to Algorithm 4, all the conditional attributes had to be checked using the core attribute judgement job in an iteration model. The output of the pre-job would have influenced the input of the next job. Thus, we set up a global variable to mark the output of the core attribute judgement job so that the programme would respond, regardless of whether the column with respect to the candidate attribute was preserved in the next job. Figure 6 shows the complete iterative relations between the jobs. Figure 7 indicates how the judgement result of c(s) was passed on to the next core attribute judgement job on c(s + 1).

Design of iterative jobs.

<key,value> pairs of adjacent jobs.
4. Experimental evaluation
The experimental results of the attribute reduction algorithm using MapReduce are presented in this section. Furthermore, how we examined the performance of the proposed algorithm and verified its superiority in terms of the number of jobs, sizeup and runtime is discussed in detail.
4.1. Experimental setup
We ran the proposed algorithm on a cluster of four nodes, including one master node and three slave nodes. Each node was equipped with 7.8 GB of memory and used four processors (Intel® Core™ i5-4440 CPU @ 3.10 GHz). For each node, we installed CentOS 6.4 as the operating environment and Eclipse 3.6.1 as the compiling tool. Hadoop 1.1.2 was also installed on the appropriate running platform for large data.
We conducted an extensive series of experiments on the data sets Mushroom and KDD CUP 99 from the UCI Machine Learning Repository and four synthetic large data sets (DS3–DS6). We duplicated 5000 copies of Mushroom as data set DS1. DS2 was the KDD CUP 99 data set. Each data set had only one decision attribute, and the attribute values of each synthetic data set were random integers from 1 to 10. Table 4 summarises all the data sets.
Description of six data sets.
4.2. Comparison of the number of jobs
The number of jobs represented the most significant advantage of the proposed algorithm. The formula for the number of jobs was J = |C| + 1, which was not dependent on the output set R. Thus, it was easy to evaluate the real runtime for each data set. As a comparison, if only data parallel was considered, jobs of the attribute reduction algorithms based on significance calculations were influenced by the number of attributes in reduct R. The related formula for jobs of these algorithms was J = (2|C| − |R| + 1)*|R|/2, where |C| represents the total number of conditional attributes and |R| represents the number of attributes in the reduct. While the traditional algorithms applied more jobs, the runtime was hardly evaluated because of the influence of reduct R. Figures 8–10 show the detailed comparisons of DS1–DS6.

Number of jobs managing DS1.

Number of jobs managing DS2.

Number of jobs managing DS3–DS6.
As shown in Figures 8–10, the number of jobs carried out by the proposed algorithm was far less than that carried out by the traditional algorithms. DS2 was considered as an example. It was supposed that 10 conditional attributes were included in the reduct and that the number of jobs in the classic algorithms was 5.8 times greater than that of the proposed algorithm. Considering that each job would require a fixed amount of time to begin and assign tasks, the runtime of the traditional algorithm would not be less.
4.3. Performance analysis on different data sets
4.3.1. Runtime on different data sets
First, the relationship between performance and the number of attributes was analysed, and the results are shown in Figure 11.

Runtime with respect to |C|.
Based on Figure 11, the runtime of the proposed algorithm increased linearly with an increase in the number of attributes. Therefore, the total runtime was in a controllable range for each data set. In theory, the linear runtime feature arose from the linear quantity of jobs J = |C| + 1 and was not affected by reduct R.
Second, the relationship between the performance and the number of samples was analysed, and the results are shown in Figure 12.

Relationship between the performance and number of samples.
Figure 12 shows the runtime of 12 data sets. Half the data sets had 10,000,000 objects, and the others had 20,000,000 objects. The attribute numbers were 5, 10, 15, 20, 25 and 30, and the ratios with the runtime were 1.82, 1.81, 1.93, 1.93, 2.03 and 2.01, respectively. The average ratio was 1.93, which was similar to the ratios of the sample numbers. Therefore, there was also a linear relationship between the runtimes and the number of objects.
4.3.2. Sizeup
In traditional attribute reduction algorithms based on Hadoop, such as that reported by Qian et al. [3], the sizeup is usually analysed by copying the original data two, four and eight times. However, these replicas were filtered out by the shuffle process in the proposed algorithm. Thus, through artificial synthesis, we provided test data sets that had 5,000,000; 10,000,000; 20,000,000; and 40,000,000 objects. Each test data set had the same attribute set and decision type. Figure 13 illustrates the sizeup feature. The sizeup curve of the proposed algorithm was nearly linear.

Sizeup of the algorithm.
In conclusion, the performance and sizeup of the proposed algorithm was nearly linear in terms of the numbers of attributes and objects, and it was negligibly affected by the reduct set. Based on these results, we concluded that the real runtime was in a controllable range for each data set, and it was easily predicted for the users. As a comparison, the performance of traditional algorithms based on significance was similarly exponential and dependent on the output reducts. Thus, the proposed algorithm would be more effective and valuable in real-world applications.
4.4. Comparison with different algorithms
The proposed algorithm was compared with two traditional algorithms: one based on the positive region and another based on the boundary region [3]. The comparison is shown in Figure 14.

The runtime comparison of the three algorithms.
As Figure 14 indicates, the proposed algorithm required significantly less time to run than the two traditional algorithms. The total runtime of the algorithms based on a positive region was 4.9 times greater than that of the proposed algorithms on average. The algorithm based on a boundary region required 3.5 times longer than the proposed algorithms did. The ratios with different data sets were influenced by the real reducts. The more the attributes in the reducts, the higher the ratios, which meant that a greater number of jobs and longer runtimes were required for the traditional algorithm.
5. Conclusion
For this study, we proposed an efficient attribute reduction method based on MapReduce. This method only included two simple operations: sort and compare. To realise this algorithm, we designed new <key,value> pairs, which enabled us to automatically sort the overall decision tables through the shuffle mechanism of MapReduce. Theoretical analysis and experimental results showed that the proposed algorithm was efficient because |C| + 1 jobs were required and were time controllable because of the linear runtime features relating to attribute numbers and sample numbers.
Footnotes
Acknowledgements
The authors thank the anonymous reviewers for their thoughtful comments and suggestions.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This research was supported by the National Natural Science Foundation of P.R. China (Nos: 61502538, 61773406).
