Scalable Quick Reduct Algorithm - Iterative MapReduce Approach Praveen Kumar Singh
Dr. P.S.V.S Sai Prasad
Tata Consultancy Services India Hyderabad Telangana, India
School of Computer and Information Sciences University of Hyderabad Telangana, India
[email protected]
[email protected]
ABSTRACT Feature selection by reduct computation is the key technique for knowledge acquistion using rough set theory. Existing MapReduce based reduct algorithms use Hadoop Map Reduce framework, which is not suitable for iterative algorithms. Paper aims to design and implementation of Iterative MapReduce based Quick reduct algorithm using Twister framework. The proposed In MRQRA Algorithm has partial granular level computations at mappers and granular computations at reducer. Experimental analysis on KDDCup99 dataset empirically established the relevence of proposed approach.
Keywords Rough sets, Reduct, Parallel/ Distributed Algorithms, Granular Computing,Twister, In MRQRA, Iterative MapReduce.
1.
INTRODUCTION
Rough Set theory[4] is an emerging soft computing paradigm with several applications to data mining, intelligent system design. Reduct computation (Feature Subset selection) in Rough set theory can be used as a tool to discover dependencies between data and to remove the redundancy contained in dataset. Reduct computation forms the basis for knowledge discovery from databases using Rough Set theory. The standalone reduct computation algorithms are not scalable to meet the challenge of Feature Selection in very large decision systems due to the requirement of loading entire data into main memory. The emergence of MapReduce Framework resulted in several distributed/parallel reduct computation algorithms. The existing MapReduce based reduct computation approaches are primarily based on Hadoop MapReduce framework. Iterative MapReduce framework is an extension to Hadoop MapReduce and has efficient support for iterative algorithms. Existing approaches[5] lack granular computations in mappers and also create a separate MapReduce job for each attribute evaluation.
ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
CODS ’16, March 13-16, 2016, Pune, India © 2016 ACM. ISBN 978-1-4503-4217-9/16/03. . . $15.00 DOI: http://dx.doi.org/10.1145/2888451.2888476
The paper aims at improving scalability for reduct computation algorithm using iterative MapReduce in Twister's framework. We developed Inplace MapReduce based QRA(In MRQRA) as an iterative MapReduce version of sequential forward selction (SFS) based Quick Reduct algorithm(QRA)[1]. The choice of Quick Reduct algorithm is arbitrary and the developed methodology in the paper is easily adoptable to other SFS based algorithms.
2.
BACKGROUND AND RELATED WORK
The relevant basics of Rough Sets are available in [3]. Classical Rough Sets are applicable to complete symbolic decision systems represented by DT = U, C ∪ {d} , where U is the set of objects described by symbolic conditional attributes C and decision attribute d. Computations in Rough Sets, based on B ⊆ C, occur using partition of U (granules) resulting from the equivalence relation IN D(B) (indiscernibility relation). A granule goes into lower approximation of a decision concept, if all objects in the granule corresponds to the decision concept. Positive region is the union of lower approximations of all decision concepts and the gamma measure (γB ({d})) defines proportion of postive region objects in U . Reduct R is a minimal subset of conditional attribute set C such that γR ({d}) = γC ({d}). Out of the methods proposed for reduct computation for complete symbolic decision systems[3], dependency function based SFS algorithms are computationally efficient and suitable for scalable reduct computation. Quick Reduct Algorithm (QRA) was proposed by A.Chouchoulas et al. [1].In QRA, using SFS strategy, reduct R is initialized to empty set. In each iteration, attribute out of C-R giving the best gamma gain γR∪{a} (d) − γR (d) is included into R. The process continues till the required end condition γR (D) = γC (D) is reached.
2.1
Twister and its Programming Model
Indiana University's Twister is an enhanced MapReduce runtime that supports iterative MapReduce computations efficiently[2]. Twister supports cacheable map/reduce tasks with in built broadcasting meachanism for communication of iteration state information (dynamic data) to mappers. The programming model of Twister comprises of Driver (main program), Mapper, Reducer, Combiner. The data is horizontally partitioned to individual mappers available in distributed locations. In an iteration, Driver communicates state information to all the mappers. Each Mapper, working with static partition data and dynamic iteration state data, constructs intermediate pairs and communicates to Reducer(s). Each Reduce invocation gets a particular
key along with list of associated values from all the mappers. Reducer aggregates list of values and communicates one or more key-value pairs to combiner. Combiner acts as a global Reducer and computes the result of the iteration and communicates to Driver. Driver repeats iteration till the end condition is reached.
3.
PROPOSED IN_MRQRA ALGORITHM
Given decision table is horizontally partitioned and loaded into mapper processes. In Driver Reduct R is initialized to empty set. An iteration of QRA is aimed at including the best attribute from C − R giving the maximum gamma measure. In an iteration, Driver communicates current R to Mappers. Each Mapper constructs partial granules using IN D(R ∪ a) for all contesting attributes a ∈ C − R. The computation of γR∪a ({d}) requires cardinality of consistent granules (belonging to single decision concept) of IN D(R∪a). Hence, Mapper constructs a pair for each partial granule where key contains contesting attribute and granule signature (array of domain values satisfied by objects of partial granule). The value portion comprise of cardinality of partial granule and associated decision concept in case of consistency. If partial granule is inconsistent then cardinality is set to zero and decision concept is set to a flag indicating inconsistency. A reduce invocation gets input corresponding to a granule as the list of values pertaining to a contesting attribute with the same granule signature is received. Reducer checks for consistency by verifying whether decision concepts from all values are same or not. In case the granule is consistent,the cardinality of granule is computed by the summation of cardinalities of partial granules from list of values. A pair is communicated to Combiner where key corresponds to contesting attribute and value to cardinality of granule (zero in case of inconsistency). Combiner receives list of pairs emitted by all reducers. By summing the cardinalities of granules corresponding to a contesting attribute, γR∪a ({d}) is computed ∀a ∈ C −R. Combiner selects the next best attribute a∗ having maximum gamma and communicates to Driver. Driver includes a∗ into R and continues iteration till the end condition of γR (D) = γC (D) is reached. Compared to existing approaches [5], the partial granular level pair construction significantly reduces the amount of data communicated from mappers to reducers. Another betterment in the proposed approach is by managing an iteration of In MRQRA in a single MapReduce Combiner job whereas existing approaches involve |C − R| MapReduce jobs.
4.
EXPERIMENTAL ANALYSIS
The performance of proposed In MRQRA is evaluated by working with KDDCup99 dataset(having 4898432 objects with 41 attributes). The performance analysis is done by using standard measures for distributed algorithms such as speedup and sizeup. The experimental environment is comprised of a cluster with 4 nodes having configuration of Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz processor, 4GB RAM with OpenSuse-12.2 (Linux 3.4 kernel). All nodes are installed with Twister 0.9 and Master node with ActiveMQ Pub-sub message broker. In MRQRA is executed in the cluster using 20%,40%,60%, 80% and 100% of KDDCup99 Dataset. Considering compu-
tational time for 20% data as base size, the Sizeup measures obtained for 40%, 60%, 80%, 100% are 1.96,2.91,3.52,4.38 respectively. Using KDDCup99 100 percent dataset, In MRQRA is executed on cluster sizes of 1, 2, 4. With reference to computational time for 1 node cluster, Speedup of 1.22, 1.39 are obtained using 2 nodes, 4 nodes cluster respectively.
4.1
Comparitive Result Table 1: Comparitive Results with PAR
Percentage of Data 20 60 100
Objects 979686 2939059 4898432
PAR 5 slaves(sec) 1196 3944 6120
PAR 10 slaves(sec) 939 3301 5050
Proposed In MRQRA 4 slaves(sec) 444.298 1294.646 1947.338
The In MRQRA algorithm is compared with Parallel Attribute Reduction(PAR) algorithm based on Hadoop MapReduce [6]. In [6] experiments on PAR are performed using Hadoop-0.20.2, with 5 node and 10 node clusters having configuration of 2.4 Ghz CPU, 1 GB RAM and using KDDCup99 Dataset. The results of In MRQRA using 4 node cluster are compared with results reported in [6] and given in Table 1.
5.
CONCLUSIONS AND FUTURE WORK
The scalability of In MRQRA is established empirically by obtaining linearly increasing Sizeup measure. The scalability aspect of Speedup measure requires further improvement and will be investigated in future work. In comparision to PAR Algorithm, on average, around 50% computaional gain is obtained by In MRQRA Algorithm. The result is significant even though our cluster configuration is better than that of [6] but our cluster size is much smaller than cluster size in [6]. The comparitive result establish the relevance of Iterative MapReduce for reduct computation and partial granular level computations at Mappers in In MRQRA. Algorithm In MRQRA will be extended for several improvements of QRA available in literature and also for other dependency measures.
6.
REFERENCES
[1] A. Chouchoulas and Q. Shen. Rough set-aided keyword reduction for text categorization. Applied Artificial Intelligence, 15(9):843–873, 2001. [2] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In S. Hariri and K. Keahey, editors, HPDC, pages 810–818. ACM, 2010. [3] R. Jensen. Rough set-based feature selection: A review. In Rough Computing: Theories, Technologies and Applications, pages 70–107. IGI Global, 2008. [4] Z. Pawlak. Rough sets. International Journal of Parallel Programming, 11(5):341–356, 1982. [5] J. Qian, D. Miao, Z. Zhang, and X. Yue. Parallel attribute reduction algorithms using mapreduce. Inf. Sci., 279:671–690, 2014. [6] Y. Yang, Z. Chen, Z. Liang, and G. Wang. Attribute reduction for massive data based on rough set theory and mapreduce. In RSKT, volume 6401 of Lecture Notes in Computer Science, pages 672–678. Springer, 2010.