2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India
Exploiting Apache Flink's Iteration Capabilities for Distributed Apriori: Community Detection Problem as an example Sanjay Rathee
Arti Kashyap
School of Computing and Electrical Engineering I.I.T., Mandi Himachal Pardesh, India E-Mail:
[email protected]
School of Computing and Electrical Engineering I.I.T., Mandi Himachal Pardesh, India E-Mail:
[email protected] one of the most important technique to mine useful information from large datasets. It is used to find interesting relations among various variables in large datasets. To create association rules, first we have to find frequent patterns in datasets, which makes frequent itemset mining the base process for association rule mining.
Abstract—Extraction of useful information from large datasets is one of the most important research problem. Association rule mining is one of the best methods for this purpose. Finding possible associations between items in large transaction based datasets (finding frequent patterns) is most important part of the association rule mining. There exists many algorithms to find frequent patterns but Apriori algorithm always remains a preferred choice due to its ease of implementation and natural tendency to be parallelized. Many single-machine based Apriori variants exist but massive amount of data available these days is above capacity of a single machine. Therefore, to meet the demands of this ever-growing huge data, there is a need of multiple machines based Apriori algorithm. For these type of distributed applications, mapreduce is a popular fault-tolerant framework. Hadoop is one of the best open-source software framework with mapreduce approach for distributed storage and distributed processing of huge datasets using clusters built from commodity hardware. But heavy disk I/O operation at each iteration of a highly iterative algorithm like Apriori makes hadoop inefficient.
Association rule mining has many applications in various field of our life. Historically, it was used in market basket analysis to find relations between various products. We can find products which are sold together frequently and these relations can be used to make a better business plans to arrange and sell the products. Association rules can be very useful in graph mining techniques like community detection. Communities in a graph can be detected by finding most frequent paths occurring in a graph. A clique of required length will represent a community in graph and association rules techniques can find these cliques quite easily. Association rules can also be used for crime detection and prevention[21, 1]. We can analyze large datasets of crime events happened in various cities or states and find out frequent patterns occurring in those events. These frequent patterns can be used to predict the most crime sensitive areas or persons involved in the most crimes. Cyber security is another important application of association rule mining[11]. Large datasets having information about various attacks by various ports and IP addresses can be analyzed to find frequent addresses or ports, which are most sensitive to attacks. This information can be used to block requests from these vulnerable addresses or ports.
A number of map reduce based platforms are being developed for parallel computing in recent years. Among them, two platforms, namely, Spark and Flink have attracted lot of attention because of their inbuilt support to distributed computations. Earlier we had proposed a reduced- Apriori algorithm on Spark platform which outperforms parallel Apriori, firstly because of use of Spark and secondly because of the improvement we proposed in standard Apriori. Therefore, present work is a natural sequel of our earlier work and targets on implementing, testing and benchmarking Apriori on Apache Flink and compares it with Spark implementation. We conduct in-depth experiments to gain insight into the effectiveness, efficiency and scalability of the Apriori algorithm on Flink. We also use community detection graph mining problem as a test case to demonstrate our implementations.
In recent years, association rules are widely used for crowd mining[24, 25]. Crowd mining is a process of extracting information from a huge dataset, which contains answers for various questions asked to the crowd. Based on the answers given by crowd for particular questions, we can find patterns in those answers and predict behavior of the crowd. This useful information can be used to make business strategies so that high benefits can be earned from the crowd available in that area.
Keywords— Apriori; Apache Flink; Mapreduce; Spark; Hadoop; R-Apriori; Frequent itemset mining.
I. INTRODUCTION Data mining techniques like clustering, classification, association rule mining etc. are used to extract the useful information from large datasets. Association rule mining is
978-1-5090-2029-4/16/$31.00 @2016 IEEE
For association rule mining various algorithms like Apriori, Éclat, FP-Growth are proposed. Most of the algorithms scan dataset to find out frequent patterns, which
739
2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India
tackle these problems by using its RDD architecture, which store results at the end of an iteration and provide them for next iteration. Apriori implementation on spark platform gives many times faster results on standard datasets which makes spark platform best for implementation of Apriori algorithm to mine frequent patterns and generate association rules later. Recently, Qiu et al. [5] have reported 18 times speedup on an average for various benchmarks for the yet another frequent itemset mining (YAFIM) algorithm based on Spark RDD framework. Their results with real world data for medical application is observed to be many times faster than MapReduce framework. Apache Spark also have a limitation that new iteration can be started only when all results of earlier iterations are produced. Apache Flink tackles this problem by using pipelining architecture, that is, a new iteration can be started as soon as partial results are available. This pipelining architecture of apache Flink sparked interest in us to implement Apriori algorithm and our R-Apriori which already is faster than standard Apriori on Spark [20] for Flink.
can be used for rule generation. Apriori algorithm, firstly, purposed by R. Agarwal et al. [18], works on a theorem that an itemset can be frequent only if its all nonempty subsets are also frequent. It uses an iterative approach where results of last iterations are used to find current iteration frequent itemset. It starts with finding singleton frequent items where an item occurring more than minimum support count is called frequent. Then K-frequent itemsets are found using K-1 frequent itemsets based on the Apriori approach. After finding singleton frequent items, candidate set is generated. Candidate set for Kth iteration has all K size combination of items whose all subsets are frequent in K-1 iteration. To find K-frequent itemsets, we have to check only itemsets whose all subsets are present in K-1 frequent itemsets. Now these frequent itemsets of K iteration are used for generation of K+1 phase candidate set and it iterates until all the frequent patterns are found. Apriori algorithm has some limitations, candidate set for 2nd phase is too huge if number of items is large and we have to scan dataset again and again for every iteration. An algorithm Eclat[16] was purposed to tackle this problem of many scans of datasets. Éclat algorithm uses horizontal approach. A new list is created where every row contains two columns, 1st column for item and 2nd for all the transactions containing that item. Now singleton frequent items are found by counting number of transactions for that item in 2nd column. After creating candidate set for K iteration, intersection of 2nd columns for all items in itemset can be used to find support of an itemset. Therefore, there is no requirement to scan dataset again and again for every iteration. There is only one scan of dataset in Eclat algorithm which makes it faster than Apriori algorithm, but still there is a limitation of huge candidate set for 2nd phase if number of items are huge. To resolve huge candidate set problem FP-growth[8] algorithm was purposed by J. Han et al. In FP-Growth algorithm candidate generation step is removed. Initially singleton frequent itemset are found and a FP-Tree is created. This FP-Tree is used to find out frequent patterns of all size. This algorithm is faster as compared to Apriori because it removes candidate generation step.
The paper is organized as follows. After introducing the subject and the motivation in Section I, earlier work about frequent itemset mining, mainly Apriori algorithm is reported in Section II. Section III introduces the Flink based Apriori algorithm in details. Section IV evaluate the performance of FApriori. Section V shows the use of Flink based Apriori algorithm for community detection and section VI concludes the paper. II. EARLIER WORK Apriori algorithm by R. Agarwal et al.[18] is the most popular and effective algorithm for association rule mining based on a theorem that any itemset can be frequent only if all its non-empty subsets are frequent. This theorem reduces a lot of computation overhead to find frequent itemsets in transactional data. Many improved version of Apriori are proposed by many researchers. All these conventional and improved Apriori algorithms were capable of handling datasets of limited size. As the size of datasets grows, these algorithms were not so efficient. Therefore, mapreduce based parallel implementations of Apriori were proposed to handle such huge datasets. A mapreduce based parallel implementation of Apriori algorithm (MR-Apriori) was proposed by Ning Li et al.[17] and they evaluated the size up, scale up and speed up performance of parallel Apriori to prove that their algorithm can handle huge dataset efficiently. A new mapreduce based Apriori algorithm with better pruning technique, called IMRApriori, is proposed by Zahra Farzanyar et al.[26]. They claimed that their algorithm is more time efficient than MRApriori algorithm due to better pruning technique. They use their algorithm to find frequent itemsets from huge social network datasets. Another improved mapreduce based Apriori algorithm, called CMR-Apriori, by Jian Guo et al.[9] claimed some coding improvements over earlier mapreduce based algorithms. They found that their algorithm outperform conventional MR-Apriori algorithm for book recommendation service model. Many cloud computing and cluster based Apriori algorithms also proposed. Ling juan Li et al. proposed an efficient and faster Apriori implementation using cloud computing in 2011. All these mapreduce based Apriori
All the basic algorithms work on sequential approach and they were efficient until size of the dataset were small. As the size of datasets started increasing, their efficiency starts decreasing. Therefore to handle large datasets, parallel algorithms are introduced[17]. Many cluster-based algorithms were capable of handling large datasets but they were complex and have many issues like synchronization, replication of data etc. Therefore parallel approach is replaced by MapReduce approach. MapReduce approach makes association rule mining process very fast because algorithms like Apriori have possibilities of high parallelism. Key-value pairs can be easily generated in case of Apriori algorithm. Many Mapreduce based implementations of Apriori Algorithm[13, 12, 7] were proposed which shows a high performance gain as compared to the conventional Apriori algorithm. Hadoop[2] is one of the best platforms to implement Apriori algorithm as a Mapreduce model. But still there are some limitations in Hadoop based implementation of Apriori algorithm. On Hadoop platform, results are stored to HDFS after every iteration and input is taken from HDFS for next iteration, which decreases the performance due to input-output time. But Spark[15] platform
740
2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India
natively because 1) its runtime is well aware of iterative execution 2) no scheduling overhead between iterations 3)state maintenance and caching are handled automatically. Flink uses simple and delta iterations for iterative computations. Delta iteration(Figure2) can speedup iterative computations many time faster as compared to classical iterative computing.(Figure 1).
implementation have nearly same speed and efficiency like MR-Apriori. All of these have a limitation of high I/O operations time because dataset is loaded from hadoop distributed file system (HDFS) for every iteration and then stored back to HDFS after completion of iteration. A mapreduce based implementation of Apriori algorithm on Spark platform(YAFIM) by Hong jian Qiu et al.[5] tackled the problem of high I/O operations time because Spark use RDD's to store results between various iterations of Apriori. Therefore, YAFIM outperforms MR-Apriori algorithm by many times. Spark also have a limitation that next iteration cannot be started until current iteration is completed. Apache Flink [22] support iterative computations natively and Apriori is highly iterative algorithm. Therefore, Flink's native support for iterative computation sparked an interest in us to test performance of Apriori on flink and compare it with Spark implementation.
Fig 2. Apache Flink Delta iteration[22]
III. APRIORI ON FLINK Apriori uses an iterative approach where results of earlier iteration are used to find frequent itemset for next iteration. Apriori works on a theorem that an itemset can be frequent only if all its non-empty subsets are frequent. Therefore, itemset frequent for current iterations are used to generate candidate set for next iterations. Candidate itemset are validated as frequent or infrequent. Frequent itemset are those having occurrences more than minimum support. In next step, these frequent itemsets are used to generate candidate set for next iteration. Same procedure is repeated until there is no frequent itemset found. Therefore, Apriori is called as highly iterative algorithm. Apache Flink support iterations natively and hence provides highly parallel environment for Apriori. Flink implement Apriori in two phases. 1st phase generates all singleton frequent items. 2nd phase use iteration function of flink. It iterates until size of output( frequent itemsets) is zero. Every iteration generates frequent itemsets of length k. Both phases of Apriori are following:
Fig 1. Apache Flink Iteration Support.[22]
A. Apache Flink Apache flink[22] is one of the best open source platforms for scalable batch, distributed and stream data processing. Its core is a streaming dataflow engine which provides data distribution, distributed computation, communication and fault tolerance for distributed computations over large data streams. Its pipelined architecture make it to support iterative computations natively. It supports iterative computations
Figure3. Phase 1 of Flink based Apriori Algorithm.
741
2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India
Singleton Frequent
A. Phase1 Phase 1 of F-Apriori produce all singleton frequent items. Data from hadoop distributed file system is taken as input and distributed to various mappers. Each mapper produce a pair for every item in transaction where item is key. Reducer combines all values for a key and produce total count for every key. If value for a key is more than minimum support then item is frequent otherwise infrequent.
(Itemset, Count) flatMap(_.getCandidateItemset()) Candidate-set
flatmap(_.split(“\n”)) Input File
collect().buildHashTree()
Transactions
HashTree
flatmap(_.split(“ ”)) Transactions
Items
flatMap(_.findItemsetinHashTree())
map(item=>(item,1) Itemsets
(Item, 1) reduceByKey(_+_)
map(itemset=>(Itemset,1)
(Item, Count)
(Itemset, 1)
Fig 4.Apriori on Flink Phase1 lineage Graph.[20]
reduceByKey(_+_)
All the frequent items are stored on flink cache to make a candidate set for 2nd phase. Figure 4 shows lineage graph for phase 1 of Apriori on Flink. Figure 3 represents the mapreduce based model for phase 1.
(Itemset, Count)
Partial results
B. Phase 2 Second phase uses iterative computation. Each iteration consists of candidate generation and mapper and reducer operations. First of all, a candidate set is generated from singleton frequent items using flink's join operation.
Final Results
Fig 5. Apriori on Flink Phase 2 lineage Graph.
Fig 6. Phase 2 of Flink based Apriori Algorithm.
Singleton frequent items are stored in flink cache so join operation will be very fast. Dataset from hadoop distributed file system and candidate set from flink cache is given as input to every mapper. Every mapper checks the presence of every itemset in candidate set in given transactions and produces
742
as output if itemset is present in transaction. Reducer combines all values for every itemset and produce total count for every itemset. If value for an itemset is more than minimum support then these itemset are again sent for generating candidate set for next iteration until given number
2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India
of iteration are completed. Final results are produced when number of required iterations are complete or there is no frequent itemset for an iteration. Figure 5 shows lineage graph for phase 2 of Apriori on Flink. Figure 6 shows processing model of Flink for Apriori algorithm.
•
Flink based Apriori is nearly 2 times faster than Hadoop based Apriori for BMS web view-2 dataset with minimum support 0.3% for 3 iterations(Figure 7.3) . It takes 11 seconds where Spark and Hadoop based Apriori takes 15 and 20 seconds respectively. • Flink based Apriori is more than 3 times faster than Hadoop based Apriori and more than 2 times faster than Spark based Apriori for T25I10D10K dataset for 3 iterations with 10% minimum support (Figure 7.4). For every dataset, Flink based Apriori outperform standard Apriori on Hadoop and standard Apriori on Spark. Flink based Apriori works well when minimum support is high but we observed that as minimum support decreases, Flink may crash.
IV. PERFORMANCE EVALUATION In this section, we present performance evaluation of Apriori on Flink in comparison to other MR-Apriori implementations on Hadoop(MR-Apriori[20]). Most of the MapReduce implementation of Apriori on Hadoop nearly give the same performance as each of them reads data from HDFS and after each iteration write it back. In our experiments, many benchmark datasets were used. All experiments were executed four times and average results were taken as final result. F-Apriori was implemented on Flink-0.9.1 and MRApriori was implemented on Hadoop-2.0. All datasets are available on same HDFS cluster. All experiments are done on a cluster of 2 nodes each having 24 cores and 180GB memory. The computing cores were all running on Centos 6 and Java-8. A. Datasets Experiments were done with four large datasets having different characteristics. First dataset was •
T10I4D100K (artificial datasets generated by IBM’s data generator) [4] which have 105 transactions with 870 items in it. • Retail dataset[4] was used for market-basket model, it contains various transactions done by customer in shopping mall. • BMSWebView2[3] is a dataset used for KDD cup 2000 competition and has average length 4.62 with 6.07 standard deviation. • T25I10D10K[3] is a synthetic dataset generated by random transaction database generator. Properties of these datasets are as shown in Table 1:
Fig 7.1 T10I4D100K min sup=1%
TABLE 1. DATASETS CHARACTERISTICS Dataset
Number of Items
T10I4D100K Retail BMSWebView2 T25I10D10K
870 16470 3340 990
Number of Transactions 100,000 88,163 77,512 4,900
Fig 7.2 Retail Dataset min sup=0.50%
B. Speed Performance Analysis Performance for both algorithms with different datasets were evaluated. For all five datasets, comparison is made with standard Apriori on Hadoop. •
•
For T10I4D100K dataset, Flink based Apriori takes only 34 seconds for 3 iterations where Hadoop and Spark based Apriori takes 58 and 41 seconds(Figure 7.1). For Retail , Flink based Apriori takes 11 second for 3 iterations where Spark and Hadoop based Apriori takes 20 and 22 seconds(Figure 7.2).
Fig 7.3 BMSWebView-2 Dataset min sup=0.30%
743
2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India
challenging problems in graph mining. A community can be detected by finding possible cliques of various length in given graph. Apriori is one of best algorithms to find cliques in such huge graphs. To find a clique of length n, we have to find frequent patterns of length n with minimum support n. For example, we have to find communities in graph 1 given below. We can easily make a transactional dataset for a undirected graph like Graph 1 by using its matrix representation. Figure 8(a) shows the transactional dataset for Graph 1. Every node in graph is taken as transaction ID and node with all its adjacent nodes is itemset for that node. Fig 7.4 T25I10D10K Dataset min sup=10%
V. APRIORI ON FLINK In this section we use Apriori implementation on Flink platform to solve the community detection problem. Finding communities in graph is one of the most important problems in graph mining. Size of graphs is growing too fast these days that single machine algorithms are not capable to handle such huge graph datasets. For example, Facebook has over one million users and still growing at an impressive rate. Therefore, finding communities in graph with one million nodes is not efficient with a single machine algorithm. There is a need of highly parallel distributed computing algorithms to handle such huge graphs. A
B
C
D
F
Fig 9. (a) Frequent pairs after iteration 2 (b) Result after iteration 3 (c) Result after iteration 4.
G
Apriori takes this transactional data as input and produces results as shown in Figure 8(b) after combining all occurrences of every item. Figure 8(c) shows the singleton frequent items which have total count more than minimum support. Phase 1 of Flink based Apriori completes at Figure 8(c).
E
Graph 1. Social media user graph
Fig 10. Using Flink based Apriori to find communities in large Dataset of BMS.
Fig 8. (a) Transactional Dataset for Graph 1 (b) Singleton items with total count (c) Singleton frequent items after Phase1.
Candidate set is generated from singleton frequent items and total count for every itemset in candidate set is evaluated at 2nd iteration of Flink based Apriori. Itemset having total count more than minimum support are marked as frequent (figure 9(a)). Same procedure is repeated for iteration 3 and iteration 4. Figure 9(b) and figure 9(c) shows results for
We used Apace Flink to provide a highly parallel computing environment for mining graph dataset for community detection. Flink support graph and iterative computing natively so it will be best for algorithm like Apriori. Finding communities in huge graphs is one of most
744
2016 Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI), Sept. 21-24, 2016, Jaipur, India
iteration 3 and iteration 4 respectively. From results we can easily find out different possible cliques. An itemset of size 3 is frequent only if its all three nodes having an edge between them and an itemset of size 4 will be frequent if all four nodes have an edge between each other and so on. Therefore, max cliques possible for a set of nodes can be classified as a community. For example, graph 1 have 2 communities ABCD and EFG.
[7] [8]
[9]
A large dataset of BMS web view[3] is analyzed by using Flink based Apriori to find communities (Figure 10). This dataset contain many months click-stream data from two famous e-commerce web sites. Each transaction of these datasets is a web session consisting of all the products detail pages viewed in that particular session. Here every product page can be seen as a node of graph and a community will represent products pages which are traversed together. Flink based Apriori takes few minutes to analyze such a huge dataset with 77,532 transactions. The 2nd iteration is one of the most time and space consuming iteration for distributed Apriori algorithm due to generation of a huge candidate set for singletone frequent items. For example, for 104 frequent singletons, it will generate nearly 108 candidate pairs. To map such a huge number of candidate pairs for every transaction is very time consuming. To gain further insite, it will be interesting to explore more datasets.
[10]
[11]
[12]
[13]
[14]
[15]
VI. CONCLUSION A Flink based distributed Apriori is implemented to mine frequent patterns from huge graphs and large datasets. It uses basic Apriori theorem that an itemset is frequent only if all its non-empty sunset are frequent. It is implemented on Apache Flink platform which provides it highly parallel and distributed computing environment. Flink is best suited for Apriori because Apache Flink have native support for iterative computation and Apriori is iterative algorithm. Flink's pipelined architecture allow us to start a new iteration of Apriori as soon as partial results of earlier iteration are present. Delta iteration functionality of Flink makes Apriori highly parallel and effective algorithm for huge datasets. Apriori is used to find communities in huge graphs. In Summary, we have presented an implementation of Apriori on Flink and tested it with different datasets. We showed that Flink based Apriori is capable of handling huge graphs and large transactional datasets easily.
[16]
[17]
[18]
[19] [20]
[21] [22] [23]
REFERENCES [1]
[2] [3] [4] [5]
[6]
Anna L. Buczak and Christopher M. Gifford, “ Fuzzy Association Rule Mining for Community Crime Pattern Discovery,” In ISI-KDD 2010, ACM, USA, 2010. Apache hadoop. http://hadoop.apache.org/2013. Datasets. http://www.philippe-fournierviger.com/spmf/index.php?link= datasets.php. FIMI Datasets. http://fimi.ua.ac.be/data/ Hongjian Qiu, Rong Gu, Chunfeng Yuan and Yihua Huang, “YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark,” In 2014 IEEE 28th International Parallel & Distributed Processing Symposium Workshops, 2014. Honglie Yu, Jun Wen and Hongmei Wang, “An Improved Apriori Algorithm Based On the Boolean Matrix and Hadoop,” In Int. Conf. on
[24] [25]
[26]
745
Advanced in Control Engineering and Information Science (CEIS), 2011, pp. 1827-1831. J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” In Proc. OSDI. USENIX Association, 2004. J. Han, H. Pei and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” In Proc. Conf. on the Management of Data (SIGMOD’00, Dallas, TX), ACM Press, New York, NY, USA 2000. Jian Guo, “Research on Improved Apriori Algorithm Based on Coding and MapReduce,” In 10th Web Information System and Application Conference, 2013, pp. 294-299. Lan Vu and Gita Alaghband, “Novel Parallel Method for Mining Frequent Patterns on Multi-core Shared Memory Systems,” In ACM conf., Denver USA, 2013, pp. 49-54. Latifur Khan, Mamoun Awad and Bhavani Thuraisingham, “A new intrusion detection system using support vector machines and hierarchical clustering,” In VLDB Journal 2007, 2007, pp. 507-521. Li N., Zeng L., He Q. & Shi Z, “Parallel Implementation of Apriori Algorithm Based on MapReduce,” In Proc. 13th ACIS Int. Conf. Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD ‘12), Kyoto, IEEE, 2012, pp. 236 – 241. Lin M., Lee P. & Hsueh S, “Apriori-based Frequent Itemset Mining Algorithms on MapReduce,” In Proc. 16th International Conference on Ubiquitous Information Management and Communication (ICUIMC ‘12), New York, NY, USA, ACM: Article No. 76, 2012. Lingjuan Li, Min Zhang. The Strategy of Mining Association Rule Based on Cloud Computing. In Proc. 2011 Int. Conf. Business Computing and Global Informatization (BCGIN ‘11). DC, USA, IEEE, 2011, pp. 475-478. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica, “Spark: Cluster Computing with Working Sets,” In Proc. 2nd USENIX conf. on Hot topics in cloud computing, USENIX Association Berkeley, CA, USA, 2010. Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara and Wei Li, “New algorithms for fast discovery of association rules,” Technical Report 651, Computer Science Department, University of Rochester, Rochester, NY 14627. 1997. Ning Li, Li Zang, Qing He and Zhongzhi Shi, “Parallel Implementation of Apriori Algorithm Based on MapReduce,” In Int. Journal of Networked and Distributed Computing, Vol. 1, No. 2 (April 2013), pp. 89-96. R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases,” Research Report RJ9839, IBM Almaden Research Center, SanJose, California, June 1994. R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” In Proc. VLDB, Santiago, Chile, 1994, pp. 487–499. Sanjay Rathee, ManoharKaul and ArtiKashyap, “ R-Apriori: An efficient Apriori based Algorithm on Spark,” In Proc. 24th Conference on Information Retrieval and Knowledge Management(CIKM 2015), PIKM'15, Oct, 2015@ACM. doi: 10.1145/2809890.2809893 Tong Wang , Cynthia Rudin, Daniel Wagner and Rich Sevieri, “Learning to Detect Patterns of Crime,” In Springer , MIT, USA, 2013. Apache Flink: https://flink.apache.org Yang X.Y., Liu Z. & Fu Y., “MapReduce as a Programming Model for Association Rules Algorithm on Hadoop,” In Proc. 3rd Int.l Conf. on Information Sciences and Interaction Sciences (ICIS ‘10), Chengdu, China, IEEE, 2010, pp. 99 – 102. Yeal Amsterdamer, Yeal Grossman,Tova Milo and Pierre Senellart, “Crowd Mining,” In SIGMOD'13, USA, 2013. Yeal Amsterdamer, Yeal Grossman, Tova Milo and Pierre Senellart, “CrowdMiner: Mining association Rules from the crowd,” In Proceedings of VLDB Endowment, 2013. Zahra Farzanyar and Nick Cercone, “Efiicient Mining of Frequent Itemsets in Social Network Data based on Mapreduce Framework,” In 2013 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining, 2013, pp. 1183-1188.