Talos App: On-Device Machine Learning Using TensorFlow to Detect

0 downloads 0 Views 1MB Size Report
problem of malware detection using 'Requested Permissions' as the input parameters. ..... Talos uses softmax regression and a Neural Network consisting of a ...
2018 Fifth International Conference on Internet of Things: Systems, Management and Security (IoTSMS)

Talos App: On-device Machine Learning Using TensorFlow to Detect Android Malware Harshvardhan C Takawale1 and Abhishek Thakur2

Department of Electrical and Electronics Engineering 2 Department of Computer Science BITS Pilani, Hyderabad Campus Hyderabad, India Email{f20160258,abhishek}@hyderabad.bits-pilani.ac.in 1

Abstract— In the recent years, mobiles have surpassed computers to become the device of choice for multiple applications and services. The major credit for this exponential growth goes to Android OS. In a little over a decade of its existence, Android now has a market share that is almost four times its second closest competitor, iOS. But with the increased share, the risk of malware has also increased. In this paper, we will be proposing a lightweight method of malware analysis, the Talos application, that uses ondevice machine learning and TensorFlow. It aims to solve the problem of malware detection using ‘Requested Permissions’ as the input parameters. The entire detection process takes place on the mobile device, and it doesn’t require Internet for its working. The machine learning model is created using TensorFlow. The model’s graph is frozen in the protocol buffer format and then exported for deployment on the mobile device. In our experiments, Talos has demonstrated an accuracy of 93.2%. It could analyze hundreds of apps within a second, even on low-end Android devices.

analysis provides results much faster and we do not need to run the application code, we just need the APK (Application package) file. The Talos application uses static-analysis and on-device machine learning to detect malicious app. The application does not require internet to perform the analysis as the entire model is present on the Android phone inside the application. We have used TensorFlow for creating the machine learning model. The model uses Android’s ‘requested permissions’ as the features. The major advantage here is that it performs analysis faster as the nature of this analysis is static. Also, it does not require internet connectivity to perform the analysis. In the end, the final structure of the project is as shown in Fig. 1. Model Training TensorFlow Dataset

Keywords—Android, malware analysis, TensorFlow, On-device machine learning, static-analysis, Adam optimization algorithm

I.

INTRODUCTION

The recent data from June 2018 suggests that Android OS has the biggest mobile operating system market share worldwide with 79.9 percent [1]. Major credit for the growth goes to the huge number of applications that a user gets to download from the official “Play Store”. As of March 2018, the total number of applications available to the user are 3.3 million [2]. Android also lets its users download applications from unknown sources, which has led to many smaller regional app stores like SlideME, Mobango and GetJar. APKs found anywhere online can directly be installed. But this can also result in security problems as these applications are not thoroughly tested for malware. In 2017, Kaspersky Lab detected 5,730,916 malicious installation packages [3]. Malware detection is the need of the hour and it should be easily accessible to all the users. There are various methods for detecting malware like static analysis and dynamic analysis but performing dynamic analysis consumes a lot of resources and is time-consuming [4]. Thus it is difficult to perform dynamic analysis directly on a mobile device. On the other hand, static-

978-1-5386-9585-2/18/$31.00 ©2018 IEEE

Create and train model Export

Model Deployment Model.pb Interface Application Fig. 1: The structure of the project

A. Problem Statement This work aims to deal with Android malware and provide a fast method that is easy to deploy for the user directly on the device, one that does not require internet connectivity for performing the analysis. We wish to create an application to detects malware using machine learning approach. B. Contribution and Outline The contributions of this paper are -

250

2018 Fifth International Conference on Internet of Things: Systems, Management and Security (IoTSMS)

● Extend the already existing permission based method of malware detection to include the new on-device machine learning techniques.

● Present a light-weight Android application that can detect malicious Android applications without a need to communicate to outside sources over the internet.

The rest of this paper is organized as follows: section II talks of related work with respect to static and dynamic analysis; section III talks of our approach; section IV presents the experimental results; subsequently we discuss the limitations of Talos and future extentions.

II.

RELATED WORK

As mentioned in the report [5], mobile malware are a part of the fastest growing types of malicious software. Android platform, due to its open nature [6] and a huge user base, is a comparatively easier target. Also, as mobile devices are low energy devices, they try to maximize their battery life by compromising other aspects like security. Continuously running scans for malware detection, as done on computers, is not feasible for mobile devices. A lot of day-to-day user data like location, internet searches, online purchases, etc. can be accessed through mobile phones. Thus, they are the best bet for attacking users. Because of the sheer number of malicious applications, a lot of research has happened in this field, and multiple implementations exist to deal with such malware. We will be discussing the research work done in the past on this topic that is related to this paper and how our research tries to take it a step forward. A. Malware detection using static and dynamic analysis Static analysis and Dynamic analysis are two techniques that have been used for a long time to perform malware detection. Both the techniques have their advantages and disadvantages. In this section, we discuss these advantages and disadvantages. We further discuss existing projects that have used these techniques. Static analysis is one of the oldest approaches. In this method, we analyze the application code without actually executing it. Thus static analysis provides an advantage of being much faster compared to dynamic analysis. It is much more suitable for on-mobile analysis, as it requires fewer resources. Many studies have been done using static analysis, such as Apposcopy[7], that does semantics based detection and Droidmat[8], that uses API calls in its detection method. There have also been studies that use Android’s requested permissions in its detection like [9]. But, our method is different because we employ on-device machine learning in the project. This helps us in creating an application that works all by itself, even without requiring internet for the analysis. Dynamic analysis runs the code to detect whether or not the application is malicious. It mainly observes changes in memory, the functioning of the device and/or the change in devices

performance; all of which cannot be detected using static analysis. Generally, dynamic analysis is slow, but it is capable of detecting advanced malware, especially those that change their behavior during code execution. It has high accuracy in detecting new family of malware. Note that dynamic analysis demands a lot of resources and resources on the mobile device may not be sufficient to perform dynamic analysis. Some projects that have employed dynamic analysis in their detection process are - DroidScope[10], Riskranker[11] and IntelliDroid[12]. We have employed static analysis as it is in line with our motive of standalone on-device deployment, especially on low to mid-end devices used in developing countries. B. Malware detection using Machine learning In the recent past, machine learning has been an integral part of malware analysis on all the platforms. This is because it is capable of detecting malware that have never been seen before and thus it wins over traditional signature-based detection methodology. There have been many studies such as [13], [14], which use machine learning and these have shown that this approach is effective. We use TensorFlow in our work because of its light-weight variants like TensorFlow Android and TensorFlow Lite. This helps in on-device machine learning and its applications such as malware detection. III. RESEARCH METHODOLOGY The Talos application can be broken down into a few sequential and fundamental steps. They are A. Extracting permissions B. Creating Dataset C. Creating a Model and freezing its variables D. Creating the Android app Let us get down to each specific step that is required to complete the steps mentioned above. A. Extracting permissions For the model, we will be using Android’s requested permissions as features for the machine learning model. ● Permissions - In Android, every application has its own limited-access sandbox. So to get information and resources from outside, the application needs to declare permissions in its AndroidManifest.xml file of the application. [15]. ● Features - A feature is an individual measurable property or characteristic of a phenomenon being observed to classify data in machine learning [16] Android applications are delivered in APK file format (Application Package). It contains all the libraries, .dex files, manifest file, assets and resources that an application requires. In this approach, as the training part of the machine learning takes place on a separate machine and the prediction part takes place on the mobile device, we need two separate approaches for permission extraction, one for the computer and another for mobile. We use the Androguard [17] to reverse engineer the APK on the computer and create a python script to extract permissions from all the APKs in the dataset and store them in .csv file. To create the .csv file, we compare the extracted

251

2018 Fifth International Conference on Internet of Things: Systems, Management and Security (IoTSMS)

permissions with a list of 324 android permissions [18]. On the application side, we use the same list of permissions for the application on which testing is done. Talos uses Android’s PackageManager abstract class to get all the installed applications and the list of requested permissions for a particular application. B. Dataset Creation We collected 2333 different malicious APKs in our dataset. Most of them belong to the Canadian Institute of Cybersecurity’s Android Botnet [19] dataset, Android Validation dataset [20], and few others that are taken from various open-source malware sets. We also need benign APKs, for which 602 APKs are downloaded from the Google Play Store. The APKs are divided into two sets - Training set and Testing set. The training set contains 2375 applications. It is a mix of both malicious and benign applications. The testing set consists of 560 applications and similar to the training set, it also contained both benign and malicious applications. The permissions are extracted from the dataset and a .csv file is created. The labels are also extracted - “malware” or “benign”. As this is a categorical data, we use one hot encoding to represent this data [21]. ● One hot encoding - Machine learning models generally have two kinds of data, numeric data and categorical data. As these model are mathematical, they do not understand data that is not numeric, like the labels “malware” and “benign” in our model. Thus one hot encoding is used to convert this categorical data into an array of 1’s and 0’s. All the mathematical operations can now be performed on this matrix. C. Model creation using TensorFlow For On-device machine learning, the training part is done on a computer as training a model is very compute intensive. Mobile devices are very low powered for accomplishing this task. Therefore, a model is trained and then the weights and biases are frozen. This frozen model is exported on to the device. It is then used on the mobile device to deliver predictions. We use TensorFlow because of its libraries that help in interfacing the model with the Android application [22]. The weight and bias variables are declared, which are frozen after the training is completed. The final values post training are used in prediction. During training, we load the CSV file and separate the columns as input features and input label columns. For our model, we use the Adam Optimizer algorithm [23], which combines the advantages of two other stochastic gradient descent algorithms namely AdaGrad[24] and RMSProp[25]. 1) Adam Optimizer Algorithm Adam stands for Adaptive Moment Estimation. Adam optimizer algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters β1and β2 are used to control the decay rates of these moving averages. We have used Adam because it is very popular in the field of machine learning and as mentioned in many studies, it works very well with large data and data that has many

parameters. This algorithm has four hyper-parameters: (alpha), 1(beta1), 2(beta2) and (epsilon) [23]. In a machine learning model, hyper-parameters are those variables that we fix before starting the training and they are the controls for fine-tuning and optimizing it. Each hyper-parameter has its own operation ● - It is the learning rate of the model. ● 1 - It is the rate of exponential decay for the first moment estimates. ● 2 - It is the rate of exponential decay for the second moment estimates. ● - It is small constant for numerical stability and avoids division by zero error. Another reason for the wide acceptance of Adam optimizer is that we do not have to change the hyper-parameters much as the defaults give the best results in almost all the cases. The defaults used by Talos are =0.001, 1=0.9, 2=0.999 and =10-8. In the model, the optimizer aims to minimize the crossentropy. Let y be the predicted probability distribution and y’ be the true distribution, then [26] defines the cross-entropy as eq. 1. = − ∑





Eq. 1

● Cross-entropy (H) - Cross-entropy loss, or simply Crossentropy, is the measure of how the poorly a model performs in predicting the correct label. A perfect model has a loss value of zero. The loss gives an idea of how capable the trained model is to describe the data in the dataset. As observed in Fig. 2, the second model provides predictions that are closer to the correct predictions as compared to the first model. Therefore the second model’s loss (depicted by blue arrows) is lesser than that of the first model and thus the second model is better suited for describing the data.

y

y

x

x

Fig. 2: Loss of a machine learning model

2)

The Model

In this section, we will be discussing how a basic model is created and also about the specifics of the model used in the making of the Talos application. Initial steps for creating a model in TensorFlow involves ‘Placeholders’ and ‘Variables’[27]. ● Placeholders are used to assign a place in the memory where values can be stored. They are used primarily to feed data into the graph.

252

2018 Fifth International Conference on Internet of Things: Systems, Management and Security (IoTSMS)

● Variable are used to store values that adapts after after every step of training to improve accuracy. During initializing random values are assigned. It changes as the model tries to fit the correct predictions. In our model, we declare two Placeholders x and y to feed our feature matrix and label matrix, x_input and y_input respectively. Weights(W) and biases(b) are the variables used in training. Talos uses softmax regression and a Neural Network consisting of a single hidden layer. ● Softmax regression is used when we wish to get the probability value for an input data, on whether it belongs to a certain label class. It is generally used when we have many labels but can also be used for two labels as is the case for Talos. Equation [2], based on [26] defines the softmax function. = . 2 ●

Hidden Layers - Hidden layers in a neural network are the layers that lie between the input layer and the output layer, and thus we do not directly interact with it [28]. The number of nodes present in the hidden layer is declared beforehand. Multiple experiments were done and based on the accuracy, 10 nodes have been used in hidden layer for Talos. Fig 3 provides a pictorial representation for the same.

the total number of variables we require to describe the model to four. We finally freeze all these variables after the training iterations. This process converts the variables to constants. Now we will discuss how the TensorFlow’s functions that help us accomplish this job, but first a brief on different classes that are used to save a graph. ● NodeDef - NodeDef is used to define one single operation in the model. It is used basically to hold information about a constant. ● Checkpoint - Checkpoints help in storing the variables. As the variables change regularly, these checkpoints are stored periodically to files. ● GraphDef - They contain the list of all the NodeDefs. Thus, they are used to define the entire executable graph. When executing a graph, if it requires variables at any node, their values are taken from the checkpoint files. Therefore, we need both GraphDef and checkpoint file for complete deployment of our trained model in Android mobile device. Protocol buffers are used to serialize these data structures [29] as shown in Fig. 4. ● Protocol Buffers (pb) are used for storing and interchanging all types of structured data. In this format, we first need to declare the schema, thereafter the data is serialized in the format based on the schema. Protobuf is used when we have high volumes of data, and that data elements have similar structure. For high volumes, it’s performance is better than that of JSON. Train Model

Graph Def (.pb)

Checkpoints (.ckpt)

Frozen Graph (.pb) Fig. 4: Freezing trained model graph Fig. 3: Example of a model having one hidden layer

The input to the model is the matrix of the requested permissions out of the total 324 permissions that we are considering. The output is the numeric probability of whether the application belongs to the first label class or to the second label class. For Talos only two outputs are present. Hence in context of Fig. 3, N= 324, M=2 and K=10. 3)

Freezing the variables and exporting

The next step in the project is saving the trained model in such a way that it can be used by our Android application. For this, we require to freeze the value of the variables (W,b between the layers), whose values have been altered after every step of the training process, so as to give the best possible prediction model. As discussed earlier, we have used one hidden layer in our model. Thus we have two sets of values of variables, each set connecting one layer to another. Each set contains a ‘weight’ variable and a ‘bias’ variable. This brings

The ‘Saver’ class of TensorFlow is used to save the variables in a ‘.ckpt’ file. The ‘freeze_graph’ class takes the .ckpt file and the GraphDef as the input. It changes the variables in the checkpoint file to constants and outputs a new GraphDef created using these constants. D. Android Application Talos analyzes the “requested permissions” from the application, and classifies the app as benign or malicious. 1)

Android Application Sandbox and Permissions

The Android platform uses insolation as a means of security. Each application runs on a separate virtual machine. Android allocates completely separate resources to each application (based on the publisher of the app), similar to different users in a Linux environment. As every application has its own space in the memory, one application cannot communicate with other and access others resources. This is called Sandboxing. But most applications are not standalone, they require resources

253

2018 Fifth International Conference on Internet of Things: Systems, Management and Security (IoTSMS)

from other applications including the Android platform, for their functioning. The applications should explicitly request for the permissions. ● Requested permissions - If an application requires additional capabilities it needs to explicitly declare them in its manifest file. According to developer best practices, an application should request for as few permissions as possible. There are various types of permissions like Normal permissions, Signature permissions and Dangerous permissions - based on the security threat they present. The Normal and Signature permissions are automatically granted while installation. For Dangerous permissions, the user needs to explicitly grant permission to the application. If the user does not approve the requested permission, the application cannot provide the functionality based on that permission. 2)

V.

LIMITATIONS

In this section, we will discuss the malware detection related problems that the application is unable to solve yet. As Talos uses static analysis approach for malware detection, it cannot detect a new family of malware that is completely different from the existing malware. Its unable to detect the advanced malware that use code obfuscation or polymorphism.

PackageManager

Package Manager is an Android class that is used to get information about all the installed application on the device. It is used in the Talos application to get a list of all the installed APKs. It is then again used to get a list of all the requested permissions from the package. The basic task of the package manager is to query the data from the packages.xml file that stores the information of all the installed applications. 3)

Through experiment, we observed that single hidden layer gives the best results. Table 1 captures the details of time taken to analyze the APK files across multiple devices.

TensorFlow Interface

We save the protobuf (.pb) file of the frozen graph into our application. We can get the array of requested permissions, for the application that is to be tested, from the package manager. We need to put this array into the input node of the frozen graph and we can get the prediction from the output node. But this cannot be done directly. We require an interface for this task. For this, the TensorFlow library provides the TensorFlowInferenceInterface class. The data is fed to the model using the feed function of TensorFlowInferenceInterface and then we get the output using the fetch function. The output is then displayed on the device. IV.

EXPERIMENTAL RESULTS

In this section, we will be discussing about the results we get from out machine learning model. Earlier we had divided our dataset into training data and testing data. Here, we run our model on the testing data to find the accuracy and the loss (cross-entropy loss) of the model. While training we divide out training model in batches of 50 and then run it for 500 epochs. The output can be seen in Fig. 5. When we run it on our test data the accuracy is 93.2% and the loss is 0.381. TABLE I. PERFORMANCE OF TALOS Device Name

Android Version

Number of Apps

Time taken (milli-seconds)

Lenovo k8 note Redmi note 5 Samsung SM-G610F Moto G4 Plus OnePlus A3003 OnePlus A6000

8.0.0 8.1.0 7.0.0 7.1.2 8.0.0 8.1.0

95 101 69 88 82 100

566 368 388 471 305 178

Fig. 5: The loss and accuracy of the model

Another limitation of the application is that it can only work on applications that are installed on the device and cannot directly work on APK files of the application. VI.

CONCLUSION AND FUTURE WORK

The aim of the work was to employ on-device machine learning and static permission based analysis to detect malicious Android applications. The proposed Talos application, performs reasonably well for the dataset available with us. The overall project has two basic phases - the training phase, which takes place on the computer, and the prediction phase, which takes place on any Android mobile device. Multiple refinements are planned as future work. We are currently working on making the Talos application ready for the market and publishing on the Google Play Store. We are also working on finding different ways to optimize and improve the current machine learning model. We are trying to find ways to incorporate other information that is present to us like receiver, service and provider information available in the manifest, to improve accuracy. Access to a larger APK dataset, especially the benign application dataset and polymorphic malicious applications, can further improve the accuracy of Talos. The final aim is to include some aspects of dynamic analysis to the Talos app, so as to make it a full-fledged standalone application for on-device APK analysis. ACKNOWLEDGMENT The work would not have been possible without the help of Dr. A. H. Lashkari, Faculty, University of New Brunswick, who

254

2018 Fifth International Conference on Internet of Things: Systems, Management and Security (IoTSMS)

allowed us the access to the Canadian Institute of cyber security's datasets. We would also like to thank Mr. Ruchik Mishra, student, Birla Institute of Technology and Science, Pilani - Hyderabad campus for inputs to early drafts of our work. REFERENCES [1] [2] [3]

[4] [5]

[6] [7]

[8] [9] [10] [11] [12] [13]

[14]

[15] [16] [17] [18] [19] [20]

[21] [22] [23] [24]

“Smartphone sales by OS worldwide 2009-2018 | Statistic,” Statista. [Online].Available:https://www.statista.com/statistics/266219/globalsmartphone-sales-since-1st-quarter-2009-by-operating-system/. “Number of Google Play Store apps 2018 | Statistic,” Statista. [Online].Available:https://www.statista.com/statistics/266210/numberof-available-applications-in-the-google-play-store/. Roman Unuchek on March 7, “Mobile malware evolution 2017,” Securelist - Kaspersky Lab's cyberthreat research and reports, 07-Mar2018.[Online].Available:https://securelist.com/mobile-malware-review2017/84139/. Shaerpour, Kaveh, Ali Dehghantanha, and Ramlan Mahmod. "Trends in android malware detection." Journal of Digital Forensics, Security and Law 8.3 (2013): 2. “Mobile Malware Shows Rapid Growth in Volume and Sophistication,” Information Security News, IT Security News and Cybersecurity Insights: SecurityWeek. [Online]. Available: https://www.securityweek.com/mobile-malware-shows-rapid-growthvolume-and-sophistication. Android-review.googlesource.com. [Online]. Available: https://androidreview.googlesource.com/q/status:open. Feng, Yu, et al. "Apposcopy: Semantics-based detection of android malware through static analysis." Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 2014. Wu, Dong-Jie, et al. "Droidmat: Android malware detection through manifest and api calls tracing." Information Security (Asia JCIS), 2012 Seventh Asia Joint Conference on. IEEE, 2012. Aung, Zarni, and Win Zaw. "Permission-based android malware detection." International Journal of Scientific & Technology Research 2.3 (2013): 228-234. Yan, Lok-Kwong, and Heng Yin. "DroidScope: Seamlessly Reconstructing the OS and Dalvik Semantic Views for Dynamic Android Malware Analysis." USENIX security symposium. 2012. Grace, Michael, et al. "Riskranker: scalable and accurate zero-day android malware detection." Proceedings of the 10th international conference on Mobile systems, applications, and services. ACM, 2012. Wong, Michelle Y., and David Lie. "IntelliDroid: A Targeted Input Generator for the Dynamic Analysis of Android Malware." NDSS. Vol. 16. 2016. Amos, Brandon, Hamilton Turner, and Jules White. "Applying machine learning classifiers to dynamic android malware detection at scale." Wireless communications and mobile computing conference (iwcmc), 2013 9th international. IEEE, 2013. Peiravian, Naser, and Xingquan Zhu. "Machine learning for android malware detection using permission and api calls." Tools with Artificial Intelligence (ICTAI), 2013 IEEE 25th International Conference on. IEEE, 2013. “Request App Permissions | Android Developers,” Android Developers.[Online].Available:https://developer.android.com/training/pe rmissions/requesting. “Feature (machine learning),” Wikipedia, 03-Jul-2018. [Online]. Available: https://en.wikipedia.org/wiki/Feature_(machine_learning). Androguard, “androguard/androguard,”Github, 19-Jul-2018. [Online], Available: http://github.com/androguard/androguard. Arinerron, “A list of all Android permissions...,” Gist. [Online]. Available:https://gist.github.com/Arinerron/1bcaadc7b1cbeae77de0263f 4e15156f. Kadir, Andi Fitriah Abdul, Natalia Stakhanova, and Ali Akbar Ghorbani. "Android botnets: What urls are telling us." International Conference on Network and System Security. Springer, Cham, 2015. Gonzalez, Hugo, Natalia Stakhanova, and Ali A. Ghorbani. "Droidkin: Lightweight detection of android apps similarity." International Conference on Security and Privacy in Communication Systems. Springer, Cham, 2014. K. Potdar, T. S., and C. D., “A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers,” International Journal of Computer Applications, vol. 175, no. 4, pp. 7–9, 2017. “Building TensorFlow on Android | TensorFlow,” TensorFlow. [Online].Available: https://www.TensorFlow.org/mobile/android_build. Diederik P. Kingma andJimmy Ba, “Adam: A Method for Stochastic Optimization,” 3rd International Conference for Learning Representations, San Diego, 2015 Duchi, John, Elad Hazan, and Yoram Singer. "Adaptive subgradient methods for online learning and stochastic optimization." Journal of Machine Learning Research 12.Jul (2011): 2121-2159.

[25] Tieleman, Tijmen and Hinton, Geoffrey (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning [26] “MNIST For ML Beginners,” API Mirror. [Online]. Available: https://apimirror.com/tensorflow~guide/get_started/mnist/beginners. [27] “Module: tf | TensorFlow,” TensorFlow. [Online]. Available: https://www.tensorflow.org/api_docs/python/tf. [28] Panchal, G., and Mahesh Panchal. "Review on methods of selecting number of hidden nodes in artificial neural network." International Journal of Computer Science and Mobile Computing 3.11 (2014): 455-464. [29] Protocolbuffers, “protocolbuffers/protobuf,” GitHub. [Online]. Available: https://github.com/protocolbuffers/protobuf.

255

Suggest Documents