1
Using Weighted Based Feature Selection Technique for Android Malware Detection Nurul Hidayah Mazlan, Isredza Rahmi A Hamid Soft Computing and Multimedia Centre Faculty Computer Science & Information Technology, Universiti Tun Hussein Onn Malaysia, Johor, Malaysia
[email protected],
[email protected]
Abstract . Recently, the popularity of mobile devices has risen drastically due to the increased functionality of the devices. This matter force a big number of security challenges that need high consideration. Android malware detection method can be divided into two types, which are static and dynamic analysis. Static techniques are often prone to high false negative rates due to evolution in code basis and code repacking, although fast and efficient. While dynamic and behavior based analysis aims to provide methods for effectively and efficiently extracting unique patterns of each malware family based on its behavior. To address some of those shortcomings, the study utilizes permission-based Android malware feature as a basis for malware detection using weighted based technique. Several combinations of feature selection techniques have been evaluated. Keywords: Android malware, Detection, Weighted-based, Term FrequencyInverse Document Frequency
1
Introduction
At present, most of users prefers to use Android based mobile devices due to easy access to many applications such as e-mail, transaction, maps, social network, including online games. There are 3.79 billion mobile devices users out of 7.395 billion total global populations reported [6]. Meanwhile in Malaysia, there are 43.43 million mobile from 30.54 million total populations. This shows that a person may have more than one mobile device where 59% are mobile Internet user. This phenomenon force a big number of security challenges such as leak of sensitive data, network or file system to be addressed. Due to the rising number of mobile devices user, this motivate attackers to mine user’s personal information such as name, contacts, bank account password, credit card number, username and password for online banking, or memorable and private pictures. Other than stealing the data, attacker could gain access for the purpose of damaging the device or bothering the user’s privacy [1]. The attacker illegally taking
2 advantage of the vulnerability of a device by installing malicious application and gain unauthorized remote access. Smart mobile devices have an operating system just like a computer with additional capabilities and capacities. Moreover, the mobile device has combination of network connectivity with high-speed data networking capabilities and geo-location services [2]. Users can easily being forced to subscribe message or calls, remote control of money transfer and extortion to ransom ware. G Data reported 440,000 new Android malware strains in the first quarter of 2015 which shows that each 18 seconds, a new mobile malware strain for Android is discovered [3]. As a result, mobile devices become main target for the malware attackers to seek for confidential information. Malware threats on mobile devices come in various form, such as viruses, Trojans, worms and mobile botnets that has been growing tremendously [4]. There are various infection strategies of malware [5] such as entry point obfuscation, code integration, code insertion, register renaming, memory access reordering and session hijacking. The virus hijacks the program control after it has been launched by overwriting program, importing table addresses and function call instructions in entry point obfuscation. During code integration, virus merges its code with legitimate program that requires disassembly of a target. Other than that, infection can happen through virus code, then modify the entry point of a legitimate program or inject the code into unused sections of a program code. This paper utilizes permission based analysis of Android malware as basis for malware detection. Objectives of this study are: 1. to propose the Android Malware detection model, 2. to develop weighted based feature selection approach to detect Android Malware, and 3. to analyze the proposed feature selection approach and existing approaches by using performance metric to classify Android malware. The remainder of this paper is organized as follows. Section 2 describes related research regarding Android malware detection approaches. Section 3 examines the Android malware feature selection approach pertaining the data and feature set used in the experiment and Term Frequency-Inverse Document Frequency (TF-IDF) algorithm as well. Section 4 gives the performance analysis result and the effectiveness of the proposed weighted-based feature selection. Section 5 concludes the work and direction for future work is discussed.
2
Related Work
This section will examine two feature selection approaches for Android malware, namely, static analysis and dynamic analysis. 2.1
Static Analysis
Static analysis will statically inspect mobile application and disassemble the code [7]. Two main techniques in static analysis are decompiling and data flow tracking. Thus
3 this analysis is fast and fairly easy, static analysis requires regular updates of threat databases and it may be evaded by complicated techniques. V. Varsha et. al [8] proposed a broad static analysis system to classify the Android malware application. To generate vector space model, hardware components, permissions, application components, filtered intents, opcodes and number of small files per application are used as features which is selected using Entropy based Category Coverage Difference. Support Vector Machine (SVM), Rotation Forest and Random Forest are used for system performance evaluation. They achieved 98.14% accuracy using Random Forest classifier tested on 198 of feature length. However, opcodes are prone to obfuscation which could be handled by implementing a normalizer. Other static analysis approach is DeDroid which investigated botnet-specific properties used to detect mobile applications with botnet intensions [9]. Command and control features associated with four well-known malware families including DroidKungFu, Plankton, GoldDream, and Geinimi has been examined. Static analysis is performed using reverse engineering applications by taking five samples from each malware family. The first evaluation was run on 5064 malware binaries belong to 20 malware families. The result shows that 1795 malware samples have been detected having command and control features. Top six malware families with the highest detection ratio have been taken to validate the results where FakeRun achieved the highest accuracy with 100% value. Whereas in second evaluation tested on 14864 of benign binaries, DeDroid has detected 1196 samples having command and control features and taken from top seven malware families. Besides, Geinimi achieved the highest accuracy value with 93%. 2.2
Dynamic Analysis
Dynamic analysis provides new methods for extracting the malware patterns effectively. This methods focused on time when the Android application are being executed either accessing private data or using Application Program Interface (API) calls [7]. Dynamic analysis undergoes offline analysis because of the large amount of computational overhead. Li et. al [7] integrated risky permission combinations and vulnerable API calls as features in the Support Vector Machine (SVM) algorithm. The small files are analyzed and the weight of every dangerous API in the feature vector is calculated using Term Frequency-Inverse Document Frequency (TF-IDF) algorithm. The SVM-based malware detection that contribute dangerous API calls achieved 81% accuracy, while both dangerous API calls and risky permission combinations achieve 86% accuracy. DroidDolphin [10] extracted useful static and dynamic features to detect malicious applications. This analysis involved GUI-based testing, big data analysis, and machine learning. DroidDolphin Architecture code collect the runtime logs of an Android application and decides whether it is a malware or not using machine learning techniques. The preliminary experiment used 32,000 benign and 32,000 malicious applications as training data, and 1,000 benign and 1,000 malicious applications as testing data. The results showed that the prediction accuracy reaches 86.1%.
4 Table 1. Comparisons of related works Work V. Varsha [8]
Approach Static Analysis
Technique Entropy based Category Coverage Difference
Sample Drebin, Google Play Store
Reverse Engineer
Features Hardware components, Requested permissions, Application components, Filtered intents, Opcodes Permissions, API calls
A. Karim [9]
Static Analysis
W. Li et. al [7]
SVM based (Machine Learning)
Weight Calculation
Permissions, API calls
Dynamic analysis
Reverse Engineer
API Calls
Drebin, Google Play Store Google Play Store
W. Wu et. al [10]
Drebin
Accuracy 98.14%
First evaluation: FakeRun: 100 % FakeDoc: 98% DroidKungFu:78% Plankton: 84% Geinimi: 90% GoldDream 80% Second evaluation: Geinimi: 93% GoldDream: 89% Plankton: 82% Kungfu: 78% DroidKungfu: 69% API calls: 81% API calls and risky permission combinations: 86% 86.1%
Although there are clear advantages to detect Android malware as shown in Table 1, there are at present not many methods specifically designed based on dynamic approach. Our study differs from the previous work on feature selection in several ways. We propose a dynamic feature selection by using weighted based technique on permission-based features. We considered analyzing permission-based Android malware features in order to evaluate the malware behaviors. We then choose to use Random Forest algorithm as our classifier.
3
Android Malware Feature Selection Approach
In this section, we discuss the proposed weighted based technique for Android malware detection approach. We will first introduce the model and feature selection and extraction process. 3.1
Android Malware Detection Model
Figure 1 shows the Android malware detection model and general processing steps. The processing phases includes: preprocessing of the Android malware dataset, feature extraction and selection, dynamic feature selection, classification using machine learning algorithm, and finally the evaluation of the detection result. The model is generated by following general data mining approach which aims to build a classifier
5 for Android malware. The classifier should be able to class correctly all the sample either malware or benign. INPUT
ACTIVITIES
Android Malware Dataset
Preprocessing
OUTPUT Intermediate Output Dynamic Feature Selection
Final Output Machine Learning
Result and Evaluation
Fig. 1. Android Malware Detection Model
The proposed model is evaluated using DREBIN [11] dataset. We consider both samples from benign and malicious application. There are various Android dangerous features such as record_audio, read_phone_state, access_fine_location, receive_sms and read_external_storage. All samples are extracted into human readable format (.xml). Then, the samples will undergo the preprocessing process where all data are cleaned up and normalized. After that, the data will go through feature selection process by implementing dynamic feature selection approach using weighted-based feature selection technique. In machine learning phase, we used Random Forest [19] as a classification algorithm. The models will be generated and trained using Waikato Environment for Knowledge Analysis, WEKA [12]. Random Forest algorithms are selected because the algorithm works based on combination of tree predictors. Class with the highest vote is considered to be the best output by considering the voted classes of all individual trees. Moreover, Random Forest algorithm is good for complex classification tasks. It has methods for balancing error in class population that unbalanced. Finally, two datasets are constructed namely as Experiment 1 and Experiment 2. We use the same sets of data as our main focused in this paper is to propose the weighted based feature selection approach. 3.2
Feature Selection and Extraction
Sample of Android data that have been extracted into .xml file will undergo feature selection process. The next step in the process is to generate components of a feature vector by analyzing the database. In feature selection, irrelevant and redundant features will be removed [8]. The dangerous permission is selected as feature because each application installation will request the permission to use certain systems data and features [13]. Functionality of Android application will be exposed to other application through request permissions. Moreover, application with dangerous permission can access user private information and affect the stored data or operation of others application. Table 2 shows the description of 18 Permission-based android malware
6 features that will be used in the experiment. Then, the permission-based feature is selected derived from weight value calculated using TF-IDF and the proposed feature selection algorithm. Table 2. Types of Dangerous Permission-based Feature [13] Type Calendar Camera Contacts
Location
Microphone Phone
Permission READ_CALENDAR WRITE_CALENDAR CAMERA READ_CONTACTS WRITE_CONTACTS GET_ACCOUNTS ACCESS_FINE_LOC ATION ACCESS_COARSE_L OCATION RECORD_AUDIO READ_PHONE_STAT E
CALL_PHONE PROCESS_OUTGOIN G_CALLS SMS
Storage
3.3
SEND_SMS RECEIVE_SMS READ_SMS RECEIVE_MMS READ_EXTERNAL_S TORAGE WRITE_EXTERNAL_ STORAGE
Description Allows an application to read the user's calendar data. Allows an application to write the user's calendar data. Required to be able to access the camera device. Allows an application to read the user's contacts data. Allows an application to write the user's contacts data. Allows access to the list of accounts in the Accounts Service. Allows an application to access precise location. Allows an application to access approximate location. Allows an application to record audio. Allows read only access to phone state, including the phone number of the device, current cellular network information, the status of any ongoing calls, and a list of any PhoneAccounts registered on the device. Allows an application to initiate a phone call without going through the Dialer user interface for the user to confirm the call. Allows an application to see the number being dialed during an outgoing call with the option to redirect the call to a different number or abort the call altogether. Allows an application to send SMS messages. Allows an application to receive SMS messages. Allows an application to read SMS messages. Allows an application to monitor incoming MMS messages. Allows an application to read from external storage. Allows an application to write to external storage.
Term Frequency Inverse Document Frequency (TF-IDF) Algorithm
Term Frequency (TF) and Document Frequency (DF) are two important terms in TFIDF algorithm [14]. TF creates document for each vector of terms that match the vocabulary. Each cell in the vector will represents the TF in the corresponding document as shown in Eq. 1. DF characterized each term in the vocabulary its frequency in the entire collection. DF used for term weighting. Each of the terms may appear zero or more times in a given document and at least once in one of the documents. (1) (2) Next, TF extended into Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF combine term’s frequency in the document (TF) and its frequency in the doc-
7 ument’s collection, called as Document Frequency (DF) as shown in Eq. 2. Then, the normalized TF value is multiplied by where N is number of documents in the entire file collection, while DF is number of files in which the term appears. By using TF-IDF, the weight of specification behavior is calculated. SVM-based approach [7] altered the TF-IDF formula to analyze and calculate the weight of dangerous API calls. defined in the Eq. 3, where is the number of times that Android application calls specific dangerous API . is the total number of times calls all different dangerous API. (3) in Eq. 4 represent as the total number of malware in the training dataset. the number of times that a certain dangerous API is called.
is
(4)
3.4
Weighted Based Feature Selection Algorithm
The proposed weighted based feature selection technique is modified based on the TF-IDF algorithm where our approach focused on both sample and feature. The following equations explain the modified TF-IDF where Eq. 5 shows that is the value of Android malware feature’s in sample . While, is a maximum value of Android malware feature’s in all sample. (5) IDF in Eq. 6 represent as the total number of Android malware feature in dataset, where is number of occurrence of Android malware feature appear in sample . (6) Therefore, the weight equation is defined in Eq. 7 where by multiply the amount of TF and IDF.
is weighted calculation
(7)
4
Performance Analysis
This section explained the experimental setup of Android malware detection using feature selection methods. Some experiments have been conducted to validate the methods.
8 4.1
Experimental Setup
In our study, the detection was performed using WEKA. We used 500 permissionbased samples of dataset consist of malware and benign from Drebin [11] for the experiments. The permission-based sample consists of 18 features tested on two types of experiment denoted as Experiment 1 and Experiment 2. The Experiment 1 is permission-based features tested using TF-IDF algorithm while Experiment 2 use weighted-based Feature Selection algorithm as feature selection approach. 4.2
Performance Metric
In order to measure the effectiveness of the detection approach, we refer to four possible outcomes as: Accuracy, Precision, Sensitivity and F-score. Accuracy in Eq. 8 shows the probability of the class label value to assess the effectiveness of the algorithm. To assess the predictive power of the algorithm, precision estimate the predictive value of a label depending on the class. Main evaluation parameter, F-score, is a composite measure which benefits algorithms with higher sensitivity. It is a weighted average of precision (P) in Eq. 10 and Recall (R) in Eq. 11 calculated using Eq. 9. During the result analysis, True Positive Rate (TPR) and False Positive Rate (FPR) are measured. True positive (TP) is malware classified as malware, while false positive (FP) is benign being misclassified as malware. True Negative (TN) is benign classified as benign while false negative (FN) is malware misclassified as benign. (8) (9) (10) (11) 4.3
Result and Discussion
The experiment is tested on 500 dataset which is then constructed two sets of experiment. Data for Experiment 1 consists of 10 features that have highest TF-IDF value. Experiment 2 contains 10 features with highest weighted-based Feature Selection algorithm value. Random Forest classifier with 10 folds cross-validation has been used for both experiments. Table 3 shows list of permission-based features selected by TF-IDF algorithm and weighted-based feature selection algorithm. Table 4 shows the evaluation performance for Experiment 1 and Experiment 2 with accuracy value of 99.4% and 99.8% respectively. There is a slight increased on detecting android malware using our proposed feature selection algorithm. Furthermore, Experiment 2 achieved 100% TP rate to classify malware as compare to 88% for Experiment 1. Thus, this shows that Experiment 2 has better features selection which correctly classified the malware. We consider to use F-measure value because is a robust measure. The larger F-measure values correspond to better predictability of the classes. Experiment 2 achieved higher F-measure value than Experiment 1 for both benign and malware classes. As for Benign class, Experiment 1 and Experiment 2 achieved 99.7% and 99.9% respectively. Also, for Malware class, Experiment 1 and Experiment 2 get 93.6% and 99.4% respectively. As a result, the feature selection for
9 Experiment 2 is more generalized and managed to detect benign and malware sample accurately. Table 3. Permission-based features selection approaches Rank
TF-IDF
Value
Rank
Weighted-based Feature Selection
Value
1
process_outgoing_calls
2.096910013
1
access_coarse_location
1.255272505
2
receive_mms
2.000000000
2
access_fine_location
1.255272505
3
read_external_storage
1.657577319
3
call_phone
1.255272505
4
write_calendar
1.657577319
4
camera
1.255272505
5
read_calendar
1.619788758
5
get_accounts
1.255272505
6
record_audio
1.619788758
6
read_phone_state
1.255272505
7
get_accounts
1.585026652
7
record_audio
1.255272505
8
write_contacts
1.468521083
8
send_sms
1.255272505
9
read_sms
1.397940009
9
write_external_storage
1.255272505
10
receive_sms
1.376750710
10
process_outgoing_calls
0.954242509
Table 4. Evaluation performance using Random Forest classifier Dataset Experiment 1 Experiment 2
5
Class Benign Malware Benign Malware
TP Rate 1 0.88 0.998 1
FP Rate 0.12 0 0 0.002
Precision 0.994 1 1 0.989
Recall 1 0.88 0.998 1
F-Measure 0.997 0.936 0.999 0.994
Accuracy 99.4% 99.8%
Conclusion
This paper proposed dynamic feature selection by using weighted-based feature selection algorithm to detect Android malware. The importance of feature selection methods such as TF-IDF will be experimentally justified using machine learning algorithm. The proposed feature selection approach and existing approaches has been analyzed using performance metric. We used permission-based feature for the experiment. For future research, we need to analyze other Android malware feature to expand the Android malware detection research area.
6
Acknowledgement
The authors express appreciation to the Universiti Tun Hussein Onn Malaysia (UTHM). This research is supported by Postgraduate Research Grant vot number U610, Short Term Grant vot number U653 and Gates IT Solution Sdn. Bhd. under its publication scheme.
10
References 1. A. P. Felt, M. Finifter, E. Chin, S. Hanna, and D. Wagner, “A Survey of Mobile Malware in the Wild,” in Proceedings of the 1st ACM Workshop on Security and Privacy in Smartphones and Mobile Devices, 2011, pp. 3–14. 2. T. Vidas, D. Votipka, and N. Christin, “All Your Droid Are Belong to Us: A Survey of Current Android Attacks.,” in WOOT, 2011, pp. 81–90. 3. G Data, “G Data Releases Mobile Malware Report For The Fourth Quarter Of 2015,” 2016.[Online].Available:https://www.gdata-software.com/g-data/newsroom/news/article/gdata-releases-mobile-malware-report-for-the-fourth-quarter-of-2015 (Search Date 19/4/2016). [Accessed: 05-Nov-2016]. 4. G. Suarez-Tangil, J. E. Tapiador, P. Peris-Lopez, and A. Ribagorda, “Evolution, Detection and Analysis of Malware for Smart Devices,” IEEE Commun. Surv. Tutorials, vol. 16, no. 2, pp. 961–987, 2014. 5. V. B. Mohata, D. M. Dakhane, and R. L. Pardhi, “Mobile Malware Detection Techniques,” Int. J. Comput. Sci. Eng. Technol., vol. 4, no. 4, pp. 2229–3345, 2013. 6. S. Kemp and We Are Social, “Digital in 2016,” www.wearesocial.com, 2016. [Online]. Available: http://wearesocial.com/sg/special-reports/digital-2016. [Accessed: 05-Nov-2016]. 7. W. Li, J. Ge, and G. Dai, “Detecting Malware for Android Platform: An SVM-Based Approach,” Proceedings - 2nd IEEE International Conference on Cyber Security and Cloud Computing, CSCloud 2015 - IEEE International Symposium of Smart Cloud, IEEE SSC 2015. pp. 464–469, 2016. 8. V. M. V, P. Vinod, and D. K. A, “Heterogeneous feature space for Android malware detection,” Eighth International Conference on Contemporary Computing, {IC3} 2015, Noida, India, August 20-22, 2015. pp. 383–388, 2015. 9. A. Karim, “On the Analysis and Detection of Mobile Botnet,” Journal of Universal Computer Science, vol. 22, no. 4. pp. 567–588, 2016. 10. W.-C. Wu and S.-H. Hung, “DroidDolphin: A Dynamic Android Malware Detection Framework Using Big Data and Machine Learning,” Proceedings of the 2014 Conference on Research in Adaptive and Convergent Systems. pp. 247–252, 2014. 11. D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, and K. Rieck, “DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket.,” in NDSS, 2014. 12. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA Data Mining Software: An Update,” ACM SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 10–18, 2009. 13. Android Developers, “Permissions,” https://developer.Android.com/index.html. [Online].Available:https://developer.Android.com/guide/topics/permissions/index.html. [Accessed: 22-Dec-2016]. 14. A. Shabtai, R. Moskovitch, Y. Elovici, and C. Glezer, “Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey,” Inf. Secur. Tech. Rep., vol. 14, no. 1, pp. 16–29, 2009.