Compromised User Credentials Detection Using ...

3 downloads 1288 Views 2MB Size Report
attention of social media (Facebook and Twitter), reviews on. Amazon & Yelp, credit card business, heat rate monitoring, abnormal changes in stock prices, computer .... some geolocation information [13]. In contrast, both point and collective ...
Compromised User Credentials Detection Using Temporal Features: A Prudent Based Approach Adnan Amin and Sajid Anwar

Babar Shah and Asad Masood Khattak

Center of Excellence in Information Technology Institute of Management Sciences, Peshawar 25000, Pakistan.

College of Technological Innovation Zayed University Abu Dhabi 144534, UAE.

{adnan.amin, sajid.anwar}@imsciences.edu.pk

{babar.shah, asad.khattak}@zu.ac.ae

ABSTRACT This study exposes a serious and rapidly growing cyber threat of compromised legitimate user credentials which is very effective for cyber-criminals to gain trusted relationships with the account owners. Such a compromised user’s credentials ultimately result in damage incurred by the attacker at large-scale. Moreover, the detection of compromised legitimate user activities is crucial in competitive and sensitive organizations because wrong data is more difficult to clean from the database. The proposed study presents a novel approach to detect compromised users’ activity in a live database. Our approach uses a composition of prudence analysis, ripple down rules (RDR) and simulated experts (SE) to detect and identify accounts that experience a sudden change in behavior. We collected data from a sensitive running database for a period of Six months and evaluate the proposed technique. The results show that this combined model can fully detect outlier user’s activity and can provide useful information for the concerned decision maker.

CCS Concepts Information systems→ Data mining

Keywords Prudence analysis; simulation experts; compromised User credential; outlier detection.

1. INTRODUCTION Compromised user credentials are a legitimate user account that has been taken over by a criminal or attacker [1]. This is a growing issue for many organizations, particularly for those who properly implemented database management systems for maintaining their important data with multi-user logins. User credentials are compromised in various ways [1], [2] e.g., phishing scam to steal the users’ login information, bots used to harvest credentials detail, server-side scripting vulnerability, and Sybil accounts. This is a crucial issue and has attracted the attention of social media (Facebook and Twitter), reviews on

Amazon & Yelp, credit card business, heat rate monitoring, abnormal changes in stock prices, computer security and military surveillance [2], [3], [4]. To address this rapidly growing and critical problem of compromised user credentials, researchers have developed several mitigation and detection approaches [1]. Initially, both the academic and industry researchers have focused on fake accounts detection (i.e., self-created accounts for sharing malicious contents) [5]. Unfortunately, these studies do not discriminate the legitimate and compromised user credentials [1]. Therefore, these solutions are incapable of detecting compromised credentials because it has significantly different properties than fake ones. On the other hand, compromised user credentials allow the cyber-criminals to leverage the history of associated users credential and trusted network to more effectively misuse the legitimate rights [6]. As a result, cyber-criminal easily abuse compromised accounts to perform unusual activities (i.e., compromised activities) in important data of an organization. Therefore, compromise user credentials are more valuable to the cyber-attackers than creating fake accounts. The problem of compromised credentials can be tackled by analyzing the user’s behavior using machine learning (ML) techniques. The area of ML is highly motivated by pattern analysis and detection techniques for numerous pattern/behavior recognition problems [7]. To approach a compromised credential problem as an ML task, it can efficiently tackle through outliner analysis and detection (i.e., an ML technique). Hawkins [8] have defined the outlier as; “An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism”. The outlier detection method has been used in various important domains including streaming data, network data, time series data, uncertain data, Temporal outlier analysis (TOA) and high dimensional data [3], [9]. TOA is related to the time series nature of data which requires a dedicated technique to examine outlier or anomalous behavior in the temporal aspects of the data [10]. TOA can help to capture the unusual changes in the user’s behavior. Often, this occurred when user credentials are compromised [1]. There is some noticeable change (i.e., unusual activities) in the user’s routine behavior or activities. In order to model temporal outlier, this study has to focus on temporal continuity which plays a key role in formulations of infrequent changes, sequences or temporal patterns in data [10]. To address this problem, several traditional concepts are used from diverse fields including data mining, information theory, statistical modeling and specific problem formulations [3]. However, prudence and knowledge base (KB) supervised approach have not been widely studied in the context of this specific problem. Therefore, this study presents a novel selfevolved approach to detect and address the compromised user’s

credential problem by using rules-based approach. The proposed approach is based on prudence and knowledge base system (KBS). The RDR learner is used as a base-classifier in the proposed approach because RDR has the ability to update the existing rules by adding new cases according to the new situation [11]. The following steps are involved in the constitution of the proposed technique; (i) RDR is used as base-classifier which also helps in constitution of KBS, (ii) a prudence system is incorporated which will produce a warning prompt whenever a new change or change pattern occurs (i.e., temporal outlier identification), and (iii) finally, built a procedure which will update and maintain the knowledge database just like a virtual human expert. The rest of the research paper is organized as follows; Section 2 provides the details about compromise users’ credential and literature survey. Section 3 explores the primarily study of RDR. The proposed approach is presented in Section 4. Section 5 discusses the evaluation results; whereas, the paper is concluded along with the future direction in Section 6.

2. COMPROMISED USER’S CREDENTIAL AND TEMPORAL OUTLIER ANALYSIS A compromised user’s credential is an account used by the criminal as a genuine account owner. Compromising legitimate user’s credential is very effective for cyber-attackers as it leverage the trust relationships that the login owners have established in the past [1]. According to study [12], the scope of the irregular behavior of user detection encompasses not only violation by the attacker but also misuse of user’s right arising from violations of a legitimate user. Compromised user behavior can be viewed as a two-class classification problem, i.e., normal behavior or abnormal behavior (i.e., Outlier). Outliers are patterns or sequences in data which does not meet to a well-defined notion of usual behavior. Outlier can be classified into the following important categories [3]: Point Outliners: when a single data instance is observed as anomalous as compared to the rest of data patterns, then such single point is known as the point of an outlier. Collective Outliers: when the collection of similar data instances are used to show some anomalous behavior to the entire data is known as collective outliers. For example; if a sequence of computer actions occurs, such as ssh (Secure shell), FTP (File Transfer Protocol), buffer-overflow while the single event is not an outlier. Contextual Outliers: when a data instance is observed as anomalous or changed in a specific context or condition the notion of such context is called contextual or a conditional outlier. For example; a person’s normal weekly shopping expenditure range is $100 except some specific event (e.g. Christmas, Dewali, Eid etc), if purchase reaches $500 in another week, it will be considered as a contextual outlier. The point outliers can occur in any dataset while collective outliers occur only in geospatial data, where values represent some geolocation information [13]. In contrast, both point and collective outliers can be transformed to contextual outlier problem by incorporating the context feature into dataset [3], [4]. Bashir et al [2] proposed an unsupervised technique to distinguish potentially bad and normal behavior using unsupervised anomaly detection technique based on Principal Component Analysis. The model was based on historical data obtained from user’s activities

where the user account was considered as compromised when an anomalous activity was observed. Another study [14] have developed a system based on a clustering technique for detection of compromised user’s credentials in large-scale attacks on Twitter. They analyzed 14 million Twitter’s victims of compromised credentials and ultimately tracked the way how criminals have hijacked the accounts. The advantage of unsupervised or clustering methods is that no prior labeled data are required. However, these methods for outlier detection problem often suffer from high false alarm rates as compared to semi-supervised and supervised methods [3], [4]. Xue, Shang, & Feng, [15] have presented a semi-supervised learning technique for outliers detection using the combined approaches of fuzzy and rough set theory. These semi-supervised techniques were employed on positive and negative instances in the training data. Such techniques are considered as a timeconsuming process of labeling training data when the new nature of instances occurred in the data set. Similarly, in our proposed model, considered RDR technique to train the classifier at the initial stage. Furthermore, this study overcomes the timeconsuming process of training with new changes through RDR approach which have the capability of adding new rules without re-training process. Gupta et al. [10] have proposed an approach for detecting hijacked accounts in social network contents based on clustering the social network text and URL into spam campaigns. However, they failed to distinguish between compromised accounts and fraudulent accounts. Several studies [16], [17], [18], [19] employed windowbased time series approach for outlier detection, where normal sequences are divided into the window (size) sub-sequences and kept in a database with frequent information. On the other hand, temporal outlier analysis (TOA) is also used as outlier detection technique to examine the user’s behavior from the temporal nature of data [10]. The temporal feature can be a time series of observed value. A wide range of machine learning techniques such as SVMs [20], [21] Neural networks [19], [22], Decision Tree [20], Naïve Bayes [20] has been used to address the outlier detection problem in temporal nature of the data. In summary, most of the current studies in the literature have not widely addressed prudence and knowledge-based approach for outlier detection by a compromised user’s credential in temporal data. This study is another positive attempt to propose a benchmarking and empirical model to produce a further contribution in the desired domain. The proposed study introduces a novel procedure for efficient utilization of prudence analysis and SE Prudence approach produces a warning alert every time it detects an outlier behavior while SE will map the new nature of test cases to investigate the KBS being built and update accordingly. The next section presents the primarily study about RDR.

3. RIPPLE DOWN RULES (RDR) RDR was originally introduced by Compton and Jansen [23] in 1990. Compton and Jansen proposed the RDR technique as a suitable methodology for Knowledge Acquisition (KA) as well as maintenance of large-scale rules-based system [11]. The RDR was developed with the notion to handle the maintenance problem associated with conventional rule-based systems [24] and to deal with the contextual nature of knowledge expert [23], [25]. Basically, RDR is a list of rules where each rule can be linked to another list of rules called exceptions. If the exception of next

general rule is applicable then it is applicable [26]. RDR approach enables the domain experts to change KB without the need of knowledge engineer. There are two major types of RDR, namely Single Classification Ripple Down Rules (SCRDR) and Multiple Classification Ripple Down Rules (MCRDR).  SCRDR is a binary classification technique that creates a binary tree of two distinct paths, the first path is known as EXCEPT and the second one is referred to IF-NOT.  MCRDR is also known as RDR-sets [27] which is different from SCRDR in that SCRDR is based on binary classification while the MCRDR is multi-way classification also in MCRDR, rules from the list of successors may fire synchronously but in SCRDR each rule suppresses its successors.

3.1 Rules learning with exceptions Gaines and Compton [11] have proposed Induct RDR algorithm for learning RDR rules that are strongly related to the strategy of finding interesting rules. Steps for Induct RDR are given as follows [26];   

Given a conclusion, find the rule that is least likely to predict that conclusion by chance. Learn the exceptions recursively (except branch) Learn the remaining samples recursively (if-not branch).

An RDR is executed in the similar fashion just like a normal “if statement” along with the addition that when a condition evaluates to be true, so before returning to the conclusion, first check the exception and return the default conclusion if the exception fails. This whole step by step process is also known as exception overriding [28]. The root node contains the default rule which has only true branch and always satisfies like as;

and timestamp) for each record. These two attributes are related to time-series data. Based on these two attributes, the proposed approach will observe the behavior of the users. The attribute id represents the unique identity of the user while the attribute timestamp reflects the actual date/time of the activity that is performed by users.

4.2 Preparation of Auxiliary Tables In the proposed study, we have implemented three auxiliary tables (i) Range Table, (ii) Helping Table and (iii) Derived Table. The range table is constructed based on historical records of a user with the following attributes, i.e. UserID, Day (it holds the day name of the week as value), DayLimit (it specifies the number of activities that performed by a user on a specific day), Max_DayLimit (a value which specifies the maximum number of operation that can be performed by a specific user in a day), HourLimit (a value which specifies the number of operation that can be performed by a user in an hour for each day), Max_HourLimit (a value which specifies the maximum number of operation that can be performed by a user in an hour), and QueryType (a categorical value with each operation (i.e., INS for insert and UPD for update) performed by a specific user). Table 1 reflects the structure and descriptive statistical information of range table. The rules based procedure is built later in this section which is fully based on boundaries and are specified in a range table for each user and every operation (i.e., INS and UPD). The UserID attribute represents the user and when this user performs some operations, then QueryType attribute will hold the categorical values (i.e. INS and UPD). Table 1. Structure of Range Table

A

Monday

125

Max. Day Limit 155

A

Tuesday

97

155

41

41

INS

A

Wednesday

155

155

40

41

INS

A

Thursday

124

155

38

41

INS

A

Friday

64

155

19

41

INS

IF a and b then c case 1 except

A

Monday

16

16

8

8

UPD

IF d then e case 2

A

Tuesday

16

16

4

8

UPD

ELSE IF f and g then h case 3

A

Wednesday

16

16

7

8

UPD

A

Thursday

9

16

6

8

UPD

A

Friday

9

16

5

8

UPD

IF true then default conclusion because default case In the above example, in the absence of any other particular information, the RDR technique recommends taking the default action. If a condition succeeds, then an exception is added which is in the form of nested if-statement. Thus, the default condition is always satisfied and if the default condition is not appropriate, an exception is added [29].

The above rules can also be interpreted as; if conditions a and b are true, then we conclude c unless d is true, in that case, we can conclude e. Here if-then pair is a single rule while the entire ifthen-except-else structure is the RDR structure.

4. EVALUATION SETUP 4.1 Dataset In this study, data were collected from a highly sensitive database server of an organization in Pakistan over the period of the Six Months. For the proposed study, selected two DML (Data Manipulation Language) operations, i.e., Insert and Update records of a user. Thus, the collected 2981 records were inserted where 425 records were updated by the user in the particular time period. The study has focused on two major attributes (i.e. user id

User ID

Day

Day Limit

Hour Limit

Query Type

38

Max. Hour Limit 41

INS

The helping table is used to keep track of the active user activities performed by a user in a specific hours. The helping table can store the record of users on a temporary basis only for a current day. The helping table includes the attributes such as UserID, CurrentHour, Records, and QueryType. Similarly, UserID and QueryType attributes to keep track of the user and type of operation which are performed by a user respectively. The hour attribute holds the value from 01-24 which represent the user’s activity in current hour and count of record either inserted or updated are stored into record attribute. Table 2 represents the basic structure of helping table.

Table 2. Structure of helping table Attributes

Description

UserID (Ui)

Unique Identification of User

CurrentHour (Cur_Hr)

Hold the current hour as value (i.e. possible values 01 to 24).

Records

Track number of record inserted and updated in a current hour.

QueryType (QT)

Keeps track of user’s operation. Two possible values INS or UPD.

Finally, the derived table is used to store the aggregate sum of helping table attribute “records” for a day. It is fully based on the helping table. Table 3 reflects the basic structure of derived table. Table 3. Structure of derived table Attributes

Description

UserID

Unique Identification of User

Day

Hold the current day (i.e. possible values Monday to Sunday).

DayRecords

Track number of record inserted and updated in the current day.

Operation

Keeps track of user’s operation. Two possible values INT or UPD.

The term prudence means to describe the behavior of such a case in some way unusual occurred and then fire a warning about it [30]. We have incorporated the prudence analysis with RDR technique to discover an alternate to deal with the brittleness of KBS. In the literature [30], [31], [32] different mechanisms have been developed for prudent based KBS. Compton and Preston [30] used a technique in which a set of seen attribute’s values associated with each rule and conclusion were maintained in a list. If the attribute’s value for any case does not already exist in the prepared list, then an alert is generated for reviewing the cases. Another study [31] presented a ripple down model (RDM) describing two main functions for detecting the knowledge boundaries based on range probabilities. One function is used for observing range values of the continuous attribute and another was used for observing the already maintained values of the categorical attribute. For this study, the proposed prudent alert to the compromised activity of the user using SE is based on manual RDR based procedure which is supported by three auxiliary tables (i.e. range, Helping, Derived). Given an input stream of user’s operations processed by PCAD (method 1), PCAD first verify the input operation with user’s behavior using information of the concerned user in range table. Ultimately a prudence prompt is generated if the user account is considered to be compromised, else the user is allowed to perform the operation. Then this classification of user’s account as compromised or noncompromised is carried out by RBP procedure (Method 2). The whole procedure for PCAD and RBP consists of the following steps; Method 1: Procedure for Compromise Activity Detection

4.3 Knowledge Acquisition and Classification RDR is used for knowledge acquisition and classification. Therefore, the primary reason of using RDR in the proposed study is having the following advantages such as [30];  Dealing with the situated nature of knowledge provided by experts.  The new rule can be added easily in case the system has given the wrong conclusion.  Providing easy maintainability of KBS.  Case-based validation of KA from the experts.  Prudence analysis is an RDR technique. A manual RDR method is adopted (the steps described in method 2 in section 4.4) to extract the rules with exceptions and cases for building very important component such as Rule-based Procedure (RBP) of the proposed technique. Such component is responsible for making a decision. Rules and exceptions are transformed into decision rules list for easy interpretation and understanding. Based on these rules, the classification and prediction of the class can be easily performed. Therefore, RDR learner is used as a base-classifier in the proposed study because RDR has the ability to update the existing rules by adding new cases according to the new situation [11]. Furthermore, it also helps in the constitution of a knowledge database in the proposed approach.

4.4 Prudent Alert on Compromised activity

Set Ui = Current_UserID; A1. START A2. Load Range Table for Ui A3. Take input from Ui, Consider Ui operation (i.e. Query might be insert or update) A4. Retrieve values of attributes from range table for user Ui Set Dy=Day, HrLimit = HourLimit, DyLimit = DayLimit, Max_HrLimit = Max_HourLimit, Max_DyLimit = Max_DayLimit, QT = QueryType //Call to method 2: RDR based procedure A5. Result = RBP (Ui, Dy, HrLimit, DyLimit, Max_HrLimit, Max_DyLimit, QT) A6. IF Result = NC THEN //NC: Not Compromise Allow user to perform an operation and update helping table and derived tables as follow; A6.1. Get record of user from Helping Table where Hour=Current_Hour for UserID=Ui A6.2. IF record is found THEN /* Keep track the value of Records (attribute) in helping table and by increment counter Rec*/ Rec+=1 and Modify value of attribute “Records” by Rec A6.3. ELSE /*Add new row for User Ui into Helping Table in case of 1st time entry to update Helping Table*/ Add row into HT for User=Ui, QueryType=QT and Day=Current_Day and Records=1 /*Search record of User Ui in Derived Table DT to update Derived table according*/ A6.3.1. Get record of user from Derived Table where Day=Current_Day for UserID=Ui A6.3.2. IF record is found THEN (attribute) in DT

// Keep track the value of Records

Rec+=1 and Modify value of attribute Records by Rec

A6.3.3. ELSE Add Row into DT for User=Ui, QueryType=QT and Day=Current_Day and Records=1 A7. Repeat Step A3 A8. ELSE //Prudence Check Point based on case detected in Step A5 B8.1 IF result=C1 THEN Prudence_Prompt “Crossed Hourly Limit” B8.2 ELSE IF result=C2 THEN B8.2.1 Prudence_Prompt “Crossed Maximum Hourly Limit” B8.2.2 Locked User’s Credential UNTIL new rule is added into RBP with the help of decision expert (i.e. Decision Maker) B8.3 ELSE IF result=C3 THEN B8.3.1 Prudence_Prompt “Crossed the Day Limit” B8.4 ElSE IF result=C4 THEN B8.4.1 Prudence_Prompt “Crossed Maximum Daily Limit” B8.4.2 Locked User’s Credential UNTIL new rule is added in RBP with the help of domain expert.

A9- END

Method 2: RDR Based Procedure (RBP) Receiving Parameters RBP(Ui, Dy, HrLimit, DyLimit, Max_HrLimit, Max_DyLimit, QT)

Figure 1. Analysis of insert operation (week days) of specific user

The dotted line indicates the highest limits for user’s insert operation in specified days of the week when it is crossed (i.e. the bar between dotted-line and dash-line) then the prudence warning prompt will be fired (see step B8 of in method 1). If the dash-line limit is crossed by a user “A” during insert operations, then the outlier or compromised user’s activity detection in the form of prudence alert will be fired and simultaneously the account will be locked until verified by concerned authority (i.e. Step B8.4 in method 1). If an activity is allowed, then a new rule is added into RBP (i.e. Method 2) with the help of domain expert for future use.

C1. START C2. Retrieve values of Records attribute from Helping Table (HT) where Day=Current_Day and Hour=Current_Hour for UserID=Ui and Operation=QT. //Set Value of a variable Cur_Hr_Rec which hold the value of Records in current hour Set Cur_Hr_Rec= Records C3. Retrieve values of Records current day attribute from Derived Table (DT) where Day=Current_Day and UserID=Ui and Operation=QT. //Set Value of a variable sum_rec which hold the value of Records attribute Set Cur_Dy_Rec= DayRecords //Aggregate Day Record

C4. IF true THEN result=NC default case /*By default every activity is Not Compromise as per RDR procedure except*/ C5. EXCEPT IF UserID=Ui & Day=Dy & Max_HourLimit < Cur_Hr_Rec THEN result=C1 C6. EXCEPT IF UserID=Ui &Day=Dy & HourLimitCur_Hr_Rec THEN result=C2

C7. EXCEPT IF UserID=Ui & Day=Dy & Max_DayLimit < Cur_Dy_Rec THEN result=C3 C8. EXCEPT IF UserID=Ui & Day=Dy & DyLimit