2014 International Conference on Information Technology and Multimedia (ICIMU), November 18 – 20, 2014, Putrajaya, Malaysia
Frequent Pattern Mining in Mobile Devices (A Feasibility Study) Muhammad Habib ur Rehman, Chee Sun Liew, Teh Ying Wah Faculty of Computer Science and Information Technology University of Malaya 50603, Kuala Lumpur, Malaysia
[email protected],
[email protected],
[email protected] developments in wearable devices and on-body sensing devices are creating more opportunities for getting environmental and physiological data to leverage smartphones as an analytical platform. To harness the heterogeneity and variety in data, we are conducting a feasibility study to exploit FPM algorithms in mobile devices. Recently, few generic data mining systems are introduced to use mobile phones as a computational platform. These include Pocket Data Mining (PDM) [11], Open Mobile Miner (OMM) [12] and Mobile WEKA [13], amongst most popular. These tools support variety of classification and clustering algorithms but the research on frequent pattern mining algorithms is still at its earlier stages. This paper contributes by studying the feasibility to exploit FPM algorithms in mobile devices and understand the core requirement and arising issues of studied algorithms. The rest of the paper is organized as under: section-II highlights the related work and FPM in mobile devices is discussed in section-III. The experimental setup and the details of dataset are presented in section-IV. Results and discussion is presented in section-V. Finally, conclusion and future work is discussed in section-VI.
Abstract— The availability of computational power in mobile devices is key-enabler for Mobile Data Mining (MDM) at user-premises. Alternately, resource-constraints like limited energy, narrow bandwidth, and small screens challenge in adoption of MDM. Currently, MDM is based on light-weight algorithms that are adaptive in resourceconstrained environments but a study to evaluate the performance of general algorithms still lacks in the literature. To this end, we have studied six Frequent Pattern Mining (FPM) algorithms and deployed them in mobile devices to evaluate the feasibility and highlighted the associated challenges. The experiments were performed on real and synthetic data sets strictly in android-based mobile device and compared with PC-based setup. The experimental results show that FPM algorithms can leverage MDM after tuning some basic parameters. Keywords— data mining, frequent pattern mining, mobile computing
I. INTRODUCTION The increasing computational power in mobile devices has opened new research avenues for data mining algorithms. The mobility and portability features of these devices have implicitly imposed computational constraints due to limitations [1] in energy, bandwidth, screen real estate, computational power and storage. Numerous data mining algorithms are successfully exploited in smartphones to comply with computation and resource constraints and give the optimal performance. For example classification algorithms are used for activity recognition [2], energy efficiency [3], physiological data analysis [4], personalization, privacy and adaption [5], intelligent distributed classification [6], fall detection [7], injury rehabilitation [8], discrimination between stress and cognitive load [9], and application usage prediction in mobile phones [10] amongst other. Such a wide-scale adoption of data mining algorithms has stimulated the researchers to explore new opportunities for mobile data mining inside smartphones. On the other hand, multiple data sources bring heterogeneity in data. A huge variety of data is being generated by smartphones using accelerometer, compass, GPS locator, microphone, cameras, on-screen keyboard, web logs, application logs, device logs, Bluetooth and Wi-Fi scans, contacts list and SMS data amongst many. Moreover, rapid
978-1-4799-5423-0/14/$31.00 ©2014 IEEE
II. RELATED WORK Overall research in frequent pattern mining varies from basic patterns to multilevel and multidimensional patterns to extended patterns for data sets and streams. The extensive study of literature shows that research in FPM for mobile devices is still at its initial stage. To the best of our knowledge, only two studies [14], [15] exist in recent literature that exploits FPM in mobile commerce and activity recognition. The successful application and lack of literature on FPM is the evidence of the research opportunities in this important data mining area. Two studies on FPM in mobile devices are presented below. In first study, Lu et al. [14] proposed Personal Mobile Commerce Pattern (PMCP-Mine), as a part of Mobile Commerce Explorer (MCE) framework, to discover personal shopping patterns of mobile users in m-commerce environments. PMCP-Mine first mines frequent mobile transactions from user’s local purchase data and then updates local transactions database by removing infrequent transactions. Finally, PMCP-Mine predicts new transaction patterns on the basis of new local transactions database. The
351
2014 International Conference on Information Technology and Multimedia (ICIMU), November 18 – 20, 2014, Putrajaya, Malaysia
until there is no large itemset found. Considering the subset of any large itemset is also large, helps in joining large itemsets with k-1 items and deleting those subsets with no large itemsets and forming large itemsets with k-items. Alternately, AprioriTid is different than Apriori in accessing database for one time in counting minimum support after first pass. The candidate itemsets are encoded after first pass to reduce the size of data and reading efforts at later passes. Apriori algorithm works by counting items to determine large 1-itemsets. Subsequently, it works in two phases. For example for pass k, Apriori-gen function is used to generate candidate itemsets Ck, which works on input from (k-1)th pass using large itemset of k-1 i.e. Lk-1. Next the support of Ck is counted after a database scan; here Ck from a transaction are needed to be determined efficiently for fast counting. A subset function based on hash-tree is used to store Ck. where leaf nodes of the tree contains itemsets list and internal nodes stores hash tables. The cost of keeping whole tree in memory is reduced by counting C′k+1 at kth pass. This strategy works when cost of scanning database is more than cost of keeping in memory and counting additional C′k+1 – Ck+1 candidates. Similarly, AprioriTid also generates C k using Apriori-gen function before the pass begins but the support is counted without accessing database. The set C k contains the items of the form where each Ik represents a large k-itemset with transaction identifier (Tid). B. FP-Growth The increasing computational cost of candidate generation in large itemsets especially with long patterns has led to the development of FP-Growth algorithms. The algorithm is based on frequent-pattern (FP) Tree to store important information in compressed form, about frequent patterns. FP-Growth algorithm performs efficiently due to three-step strategy: 1) database is compressed and stored in FP-tree to avoid multiple database scans at later passes 2) a pattern-frequent growth scheme is used to reduce the cost of candidate generation in long patterns and 3) a divide-n-conquer approach is used to confine the search space and reduce the cost of level-wise search as used in Apriori algorithms. C. Relim Relim (Recursive Elimination) algorithm was inspired by FP-Growth but works without prefix trees. Here, the items are counted and compared with minsup value given by the user at first and then arranged in ascending order for fast processing. It is noted that the core of Relim is a recursive function that has less computational cost than Eclat [19] and Apriori algorithms. The recursive function works by eliminating infrequent items from the transactions and selecting transactions containing least frequent items. Subsequently the least frequent items are deleted from selected transactions and the procedure starts again until there remains only frequent large itemsets. D. Eclat and dEclat The limitation of multiple database scans and complexity of internal data structures, like trees in Apriori and FP-Growth, led towards the development of Eclat/dEclat algorithms. The
performance analysis of PMCP-Mine shows that execution time is incremental with decrease in supported threshold value. In second study, L. Wang et al. [15] presented Emerging Patterns (EP) based data mining technique in complex activity recognition system that works at two layers. In the first layer, the data is processed at BSN nodes and then transmitted to mobile device for further processing. At the nodes level, lightweight algorithms are used for gesture recognitions. Additionally, pattern based real time recognition algorithms are used at central portable device. EP represents a set of frequent items in one class but infrequent in other classes. The notion behind EP based technique is that instances containing EP items are most likely belong to corresponding EP class. The complexity analysis of the proposed algorithm shows that time complexity of matching EP items with items stored in the class is O((m.l+k).n) where n is the length of input vector, k represents the total activities, m denotes number of EPs and l shows average number of items in an EP. On the other hand, the space complexity is (m.l) to hold EPs. The performance analysis reports that average recognition accuracy is 82.87%, average recognition delay is 5.7 sensing periods and average utility is 0.81 on (0-1) scale. Despite of high miss detection and false detection rate, the proposed algorithm performs better than single-layer and HMM based algorithms. The absence of literature on FPM based data mining techniques in mobile phones motivated us to conduct the feasibility study for future research in this important area. III. FPM IN MOBILE DEVICES FPM is basically applied over I(set of items):{i1,...,in}and T(set of transactions):{t1,..,tn} where T⊆I. Transaction ID (TID) is used to uniquely identify a transaction in database. T contains A(a set of items) iff A⊆T. The association rule AÆB over two itemsets A and B exists iff A⊂I and B⊂I and A∩B=∅. The rule AÆB contains the Transaction TD with minimum support s% for AÆB and confidence c% for A∪B. Moreover, for a given set of Transactions D, the rule for minimum confidence (minconf) and minimum support (minsup) are specified by users and all rules that supports minconf and minsup are generated for D resultantly. Feasibility analyses are made over six basic FPM algorithms used for Market Basket Analysis. These algorithms include Apriori and AprioriTid proposed in [16], FP-Growth [17], Relim [18], Eclat [19], and dEclat [20]. The details of these algorithms are presented in following paragraphs. A. Apriori and AprioriTid Apriori and AprioriTid algorithms are used for association rule mining which is a two-step process: 1) finding all itemsets, called large itemsets, having minsup and 2) using large itemsets for association rule generation. The process is: ∀ large itemset a, find non-empty subsets of a and ∀ non-empty subset b, output (rule): b Æ (a-b) if the ratio support(a)/support(b) is at least minconf. Apriori and AprioriTid are multi-pass algorithms where candidate items satisfying minsup are found initially and candidate itemsets are generated by iteratively processing candidate items. This iterative procedure continues
352
2014 International Conference on Information Technology and Multimedia (ICIMU), November 18 – 20, 2014, Putrajaya, Malaysia
accommodated by all studied algorithms with maximum tree size, number of candidate patterns and itemset counts are also presented in Table-1.
algorithms works with less database scans and use efficient technique for exploring search space. Eclat/dEclat algorithms are based on vertical-tid-list database format where each itemset is associated with its enclosing transactions. This Tid-listing strategy helped in enumerating all frequent itemsets. In addition to this, a lattice and sub-lattice based approach was used to decompose the (lattice) dataset and traverse in its (sub-lattices) subsets, fully in main memory. Furthermore, Eclat/dEclat used bottom-up tree traversal as a search strategy for sub-lattice enumerations. The adoption of efficient search schemes enabled to consume less computational resources and minimum DB scans as compared to Apriori, hence éclat/dEclat gain significant recognition. In addition, the Tid-list scheme is also very useful to handle data skew and retained a linear scalability in the database in terms of transaction count. The difference between Eclat and dEclat is their handling with large vertical Tid-lists while generating intermediate results. Here, dEclat used diffsets that count only differences of Transaction IDs (Tids) of candidate patterns. Hence the small diffsets make intersection very fast and improve the overall performance of algorithm as compared to Eclat. This approach of storing differences in tidsets rather than storing whole itemset classes and their Tidlists helped to counter the scalability issue associated with Eclat. After having a thorough overview of selected algorithms, we are elaborating experimental setup in next section.
TABLE 1.
DATASETS AND THEIR CHARACTERISTICS
Prop./Dataset Size number of transactions Sparsity minsup (PC) candidate counts (PC) itemset count (PC) tree size (PC) minsup (Phone) candidate count (phone) Itemset count (phone)
CP99 small 5 low 0 31 31 6 0 31 31
Mushroom medium 8416 low 0.35 1382 1121 8 0.35 1382 1121
Retail Large 88163 high 0.05 16 16 4 0.05 16 16
Next, we are discussing space and time complexity of all algorithms with these datasets. Here, results of pc-based and phone-based analyses are merged and related graphs are generated by running every algorithm 5 times so that average time and space complexity could be measured. The algorithms are regularly tested by changing minsup value from 0 to 1 with a difference of 0.05 in each run. The details of these values are presented in subsequent sections. The terms SCP, SCM, MCP, and MCM denotes space complexity and time complexity of PC and mobile phone analysis respectively. In addition, first three letters of each algorithm are used as subscript to differentiate the associated complexities.
IV. EXPERIMENTAL SETUP To compare the performance of all six algorithms in desktop and mobile environments, we selected two hardware setups: 1) desktop environment 2) mobile environment. First, the desktop environment is based on Intel corei5 based PC with 1GB RAM and windows7 operating system. We have downloaded and installed the desktop version of Java based data mining library Sequential Pattern Mining Framework (SPMF) [21] to test the datasets for performance analysis. Second, a mobile app is developed and deployed on Samsung Galaxy Duos GT-18552. The smartphone has Cortex A5 1.2 GHz Quad-core application processor with 1GB RAM and 8GB internal storage. Android operating system with Jellybean 4.1 version is installed on the smartphone. Moreover, the smartphone is running on 2,000 mAh powerful battery. A mobile application is developed that work similarly like GUIbased desktop version of SPMF. The application takes a data input file and minsup value for the evaluation of selected algorithms and returns resultant statistics that are discussed in the next Section.
A. ContextPasquier99 It should be noted that ContextPasquier99 contains only five transactions without sparsity. The deep analysis of results presented in Fig. 1 reveals that all algorithms exhibited stable performance with all minsup variations. The average space complexity of all algorithms, running on PC, is 7±1MB while same algorithms consumed 12±2MB of memory using mobile phone.
V. RESULTS AND DISCUSSIONS Three datasets were downloaded from SPMF website and testing was performed using same parameters in PC and mobile based experiments. The reason for choosing these datasets is the variety in size of database and sparsity of overall data. The datasets include ContextPasquier99, Mushroom and Retail. The detailed characteristics of these data sets are presented in Table-1. In addition, the minimum minsup value that can be
Fig. 1. Memory consumption of ContextPasquier99 dataset.
Moreover, all algorithms except Apriori still consumed memory even there were no candidate itemset found. The
353
2014 International Conference on Information Technology and Multimedia (ICIMU), November 18 – 20, 2014, Putrajaya, Malaysia
minsup = 0.05 (5%) which is normally not a best case for Market Basket Analysis algorithms.
reason for this consumption is multiple database scans of Apriori which is apparently a constraint in mobile phones. Rather, Apriori hinders for extra DB scans in this case. Conversely, the complexity analysis, presented in Fig. 2, show that all algorithms performed under one millisecond on PC but their performance on Phone varied significantly. It took more time with lower minsup value but reduced gradually with increasing threshold.
Fig. 3. Memory consumption of Mushrooms dataset.
Moreover, it was noted for mobile-based analysis that most of the algorithms started performing best with minsup=0.4 and mined the data efficiently till 100% minsup threshold, which is an empirical evidence that FPM algorithms could be deployed to harness mobile phones as a data mining platform.
Fig. 2. Space complexity of ContextPasquier99dataset.
The overall performance with phone is satisfactory beceause most of the algorithms performed on average 10±5 milliseconds which is not a siginificant time period. It should be noted from Fig. 2 that FP-Growth took more time on mobile as compared with other counterparts. B. Mushroom The less sparse nature of Mushroom data set enabled to run all algorithms except AprioriTid with least minsup=0.05 in both PC-based and phone-based analyses, which is an indicator to the feasibility of mobile as a data mining platform. The need for one time scan and keeping all data in memory is the reason for bad performance of AprioriTid which consumed 215MB of memory on PC with least minsup = 0.25. Conversely, same algorithm performed more worse, due to resource constraints, with consumption of 28.11MB with least minsup = 0.5. The comparison of both analytical platforms presented in Fig. 3 shows that Apriori, FP-Growth and Relim exhibited same performance but the results of Eclat and dEclat algorithms degraded significantly. Finally, The average space complexity of all other agorithms except AprioriTid lies in the range of 30±15MB. The time complexity analysis of Mushroom exhibits the variability in time consumption. Here, it should be noted that we have converted the measuring unit from milliseconds to seconds for results interpretations. The overall analysis of Fig. 4 shows that most of the times all agorithms took 2±1seconds to process the data in both environments, which is very satisfactory in terms of medium size dataset. Alternately, some algorithms occasionally took more time because of candidate genreation and database scan constraints which degraded the performance. For example, Eclat and deClat took 190 seconds on average to process the data with
Fig. 4.
Space complexity of Mushrooms dataset.
C. Retail The retail data set, based on real-time data obtained from a Belgian superstore, was chosen because of large amount of transactions i.e. 88163. Another reason is the evaluation of mobile phone as a data mining platform for large amount of data to be processed at any given time. The performance analysis presented in Fig. 5 shows that Eclat and dEclat algorithms were not able to run on both PC and mobile phone. Moreover, AprioriTid could only be run on PC with a huge constant memory consumption of 173.65MBs that is not even a worst case as compared to other similar algorithms. Similarly, for other algorithms the average space complexity remained at 15±10MB which is suitable for mobile data mining. Time complexity analysis with Retail as shown in Fig.6 provides the evidence that Apriori, FP-Growth, and Relim taken more time in mobile phones as compared to PC based analysis. The average time taken in phone is 21.37
354
2014 International Conference on Information Technology and Multimedia (ICIMU), November 18 – 20, 2014, Putrajaya, Malaysia
phones can be adopted as data mining platforms for general FPM algorithms. The results show that most of the algorithms worked efficiently in mobile environments but performance degraded with exceptionally large candidate sets and large itemsets. These performance-related issues can be harnessed after tuning some basic parameters recommended in this paper. Moreover, FPM algorithms have many potential application areas involving mobile computing, IoTs, and big data ecosystem.
seconds which is almost 46 times more than average time in PC (i.e. 0.46 seconds), which is not a best case in Market Basket Analysis.
ACKNOWLEDGEMENT The authors would like to thank Bright Spark Unit, University of Malaya for the financial support under grant no. BSP/APP/ 1634/2013. REFERENCES [1] Krishnaswamy, S., Gama, J., and Gaber, M.M.: ‘Mobile data stream mining: from algorithms to applications’, In Mobile Data Management (MDM), 2012 IEEE 13th International Conference on, pp. 360-363. [2] Martín, H., Bernardos, A.M., Iglesias, J., and Casar, J.R.: ‘Activity logging using lightweight classification techniques in mobile devices’, Personal and ubiquitous computing, 2013, 17, (4), pp. 675-695. [3] Liang, Y., Zhou, X., Yu, Z., and Guo, B.: ‘Energy-Efficient Motion Related Activity Recognition on Mobile Devices for Pervasive Healthcare’, Mobile Networks and Applications, 2013, pp. 1-15. [4] Solar, H., Fernández, E., Tartarisco, G., Pioggia, G., Cvetković, B., Kozina, S., Luštrek, M., and Lampe, J.: ‘A non invasive, wearable sensor platform for multi-parametric remote monitoring in CHF patients’, in ‘Impact Analysis of Solutions for Chronic Disease Prevention and Management’ (Springer, 2012), pp. 140-147. [5] Gomes, J.B., Krishnaswamy, S., Gaber, M.M., Sousa, P.A., and Menasalvas, E.: ‘Mobile activity recognition using ubiquitous data stream mining’ (Springer, 2012), pp. 130-141. [6] Stahl, F., Gaber, M.M., Aldridge, P., May, D., Liu, H., Bramer, M., and Philip, S.Y.: ‘Homogeneous and heterogeneous distributed classification for pocket data mining’: ‘Transactions on Large-Scale Data-and Knowledge-Centered Systems V’ (Springer, 2012), pp. 183-205. [7] Sherchan, W., Jayaraman, P.P., Krishnaswamy, S., Zaslavsky, A., Loke, S., and Sinha, A.: ‘Using on-the-move mining for mobile crowdsensing’, In Mobile Data Management (MDM), 2012 IEEE 13th International Conference on, pp. 115-124. [8] Pan, J.-I., Chung, H.-W., and Huang, J.-J.: ‘Intelligent Shoulder Joint Home-Based Self-Rehabilitation Monitoring System’, International Journal of Smart Home, 2013, 7, (5) [9] Setz, C., Arnrich, B., Schumm, J., La Marca, R., Troster, G., and Ehlert, U.: ‘Discriminating stress from cognitive load using a wearable EDA device’, Information Technology in Biomedicine, IEEE Transactions on, 2010, 14, (2), pp. 410-417. [10] Liao, Z.-X., Li, S.-C., Peng, W.-C., Yu, P.S., and Liu, T.-C.: ‘On the Feature Discovery for App Usage Prediction in Smartphones’, In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pp. 1127-1132. [11] Gaber, M.M., Stahl, F., and Gomes, J.B.: ‘Pocket Data Mining Framework’: ‘Pocket Data Mining’ (Springer, 2014), pp. 23-40. [12] Krishnaswamy, S., Gaber, M., Harbach, M., Hugues, C., Sinha, A., Gillick, B., Haghighi, P., and Zaslavsky, A.: ‘Open mobile miner: a toolkit for mobile data stream mining’, in ACM KDD’09, 2009
Fig. 5. Memory consumption of Retails dataset.
Finally, we would like to reiterate after detailed analysis of three different size datasets that mobile phones could be used as a data mining platform after tuning some basic parameters. The parameters could be performing chunked data analysis, initiating perioidic data mining approach or using some contextual information to utilize mobile phone’s available resources. For chunked data analysis, sliding window based modelling could be used to mine the data in small and manageable time windows so that maximum data could be mined in real-time. Furthermore, data mining tasks could be initiated after some fixed time-period to restrict the size of datasets. Similarly, contextual information about battery status and sleep time information can also be adopted to schedule data mining tasks. We are hopeful that these recommendation can be useful in further research in mobile data mining.
Fig. 6. Space complexity of Retails dataset.
VI. CONCLUSION The availability of light-weight algorithms and lack of literature on exploitation of general FPM algorithms in resource-constrained environments have motivated us for performing these feasibility analyses. The study, performed on small, medium and large size data sets, reveals that mobile
355
2014 International Conference on Information Technology and Multimedia (ICIMU), November 18 – 20, 2014, Putrajaya, Malaysia
[13] Liu, P., Chen, Y., Tang, W., and Yue, Q.: ‘Mobile weka as data mining tool on android’: ‘Advances in Electrical Engineering and Automation’ (Springer, 2012), pp. 75-80. [14] Lu, E.-C., Lee, W.-C., and Tseng, V.S.: ‘A framework for personal mobile commerce pattern mining and prediction’, Knowledge and Data Engineering, IEEE Transactions on, 2012, 24, (5), pp. 769-782. [15] Wang, L., Gu, T., Tao, X., and Lu, J.: ‘A hierarchical approach to real-time activity recognition in body sensor networks’, Pervasive and Mobile Computing, 2012, 8, (1), pp. 115-130. [16] Agrawal, R., and Srikant, R.: ‘Fast algorithms for mining association rules’, in Proceedings of 20th International Conference on Very Large Data Bases, VLDB (Vol. 1215, pp. 487-499. [17] Han, J., Pei, J., Yin, Y., and Mao, R.: ‘Mining frequent patterns without candidate generation: A frequent-pattern tree approach’, Data mining and knowledge discovery, 2004, 8, (1), pp. 53-87.
[18] Borgelt, C.: ‘Keeping things simple: Finding frequent item sets by recursive elimination’, In Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, pp. 66-70. [19] Zaki, M.J.: ‘Scalable algorithms for association mining’, Knowledge and Data Engineering, IEEE Transactions on, 2000, 12, (3), pp. 372-390. [20] Zaki, M.J., and Gouda, K.: ‘Fast vertical mining using diffsets’, In Proceedings of the 9th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 326-335. [21] Fournier-Viger, P.: ‘Fast Vertical Mining of Sequential Patterns Using Co-occurrence Information’, 2014
356