Chapter4. Spyware Detection System (SDS)

30 downloads 5749 Views 3MB Size Report
Virus code is usually quite small and easy ... redirecting web searches to the spyware vendor 's search system or .... the easy symptoms to recognize include: ...... Editor. FillCD3. ZMover. DeliPlayer. FireBurner asmonitor djsfree. Ignition arplus.
國立台灣科技大學 資 訊 工 程 系

碩士學位論文 基於資料探勘技術之監視型間諜程式偵測系統 A Surveillance Spyware Detection System Based on Data Mining Methods

研究生:王子彥 M9215017

指導教授:洪西進 博士

中華民國

九十四







二十五



中文摘要 面對間諜程式的來勢洶洶,目前各大防毒軟體廠商紛紛投入研 發,連微軟與雅虎也開始發展相關的防護軟體。但在學術研究方面, 到目前為止僅有一篇針對間諜程式的論文[29]發表於 2004 年。因本 論文針對目前危害性較大的監視型間諜程式(Surveillance Spyware)加 以研究探討,利用有別於目前一般防毒軟體的偵測技術,讓我們的系 統不僅能有效偵測目前現有的間諜程式,更具備偵測新型未知間諜程 式的能力。本論文的主要貢獻在於使用靜態與動態的分析技術去蒐集 間諜程式的相關特徵,再利用資訊增益(Information Gain)和支援向量 機(Support Vector Machine)兩種資料探勘(Data Mining)技術的結合發 展出一套間諜程式偵測系統(Spyware Detection System, SDS) 並提出 一套整體的運作架構。我們的系統不僅對已知的監視型間諜程式有高 達 98%的偵測率,當面對新型未知亦有 96%的良好偵測效果。並且 在我們的運作架構基礎下,系統將擁有自動蒐集間諜程式的新特徵並 重新訓練偵測模組的能力,如此即使間諜程式不斷的推成出新,仍可 有效的偵測,將其危害降到最低。

關鍵字:監視型間諜程式、資訊增益、支援向量機、資料探勘。

1

Abstract Nowadays, the problem of spyware is incredibly serious; some famous anti-virus software vendors such as Norton, Trend Micro had entered the spyware -detection field last year. Even Microsoft and Yahoo also had thrown themselves into the battle of anti-spyware. But there are still less effort to understand it in the research community. At present, there is only one research [29] about the spyware in 2004. In this thesis, we proposed an integrated architecture to defend against surveillance spyware. For overcoming the lacks of usual anti-spyware products, we combine the methods of static analysis and dynamic analysis to extract feature of spyware. By adopting the concepts of machine learning and data-mining, we construct a spyware detection system (SDS) which has 98% detecting rate for known spyware and 96% detecting rate for unknown or novel spyware.

Keywords: surveillance spyware, information gain, support vector machine, data mining.

2

Acknowledgement 本篇論文能順利完成,首先要感謝指導教授洪西進老師兩年來的 教誨。老師治學嚴謹和精益求精的態度,讓非本科系背景的我不僅在 學習知識的過程中獲益良多,為人處世的態度上更是我學習的榜樣。 同時更感謝口試委員范國清教授、唐永新教授、胡俊之教授、蘇 民揚教授能夠在百忙之中抽空蒞臨给予學生寶貴的意見和指導,讓本 論文能更加的豐富完整。 另外也要感謝實驗室裡一同努力的夥伴家得、居呈、智原、立祥、 冠錡、建良、婉淑、Shiva 和學弟妹們在課業上一同相互砥礪,在充 滿歡愉的氣氛完成研究。 最後要感謝我的父母親-王義芳先生、鄭貴卿女士,感謝他們對 我一直以來無怨無悔的養育栽培恩情。在他們的關懷和付出之下,我 得以無後顧之憂的專心於學業上並完成此論文。感恩之情溢於言表, 僅將此論文獻給我的家人,願他們以我為榮。

3

Table of Contents 中文摘要………………………………………………………………………....1 Abstract………………………………………………………………………......2 Acknowledgement………………………………………………………….3 Table of Contents…………………………………………............................4 List of Figures………………………………………………………………....6 List of Tables…………………………………………………………………...8 Chapter1. Introduction…………………………………………………..9 1.1 Background………………………………………………………………............9 1.2 Contributions…………………………………………………………………....11 1.3 Synopsis……………………………………………………………………….....11

Chapter2. Related Works……………………………………………..13 2.1 Difference between Spyware and Virus………………………………...….13 2.2 Classes of Spyware…………………………………………………………….14 2.3 Some Common Trojans……………………………………………………….18 2.4 Spyware Installation Methods…………………………………………….....20 2.5 Traditional Detection Methods.............................…………………......….....24

Chapter3. Support Vector Machine & Information Gain............................................................................................................................27 3.1 Data Mining…………………………………………………………..................27 3.2 Information Gain……………………………………………………………...27 3.3 Support Vector Machine……..………………………………………………28

Chapter4. Spyware Detection System (SDS)………….........35 4.1 Conception of SDS……...…………..……………………………………........35 4.2 Detect Module………..……………………………………………………….36

4

4.3 Data Mining Module………………………………………………………..…42

Chapter5. Experiments & Results...……………….…………......44 5.1 Experiment Data Set & Experiment Environment…………………….....44 5.2 Experiment Method…………………………………………….………….…..45 5.3 Notations & Evaluation Measures…....…………………………………..…48 5.4 Experiment Results………………………………………………………….....49

Chapter6. Conclusions & Future Works….....…………….....57 References…………………………………………………………………....…59 Appendix…………………………………………………………………..…….62 1. Content of Experiment Data Set……………………………………………....62 2. List of Selected Features………………………………………….......……..….74

5

List of Figures Figure 1

Some statistics of non-viral threats…………………………………....17

Figure 2

Interface of NetBus……………………………………………………..20

Figure 3

Spyware is bundled with video codec……………………………….....21

Figure 4

Warning interface……………………………………………………....22

Figure 5

Hyperplane classifier................................................................................28

Figure 6

Examples of Bad Decision Boundaries...................................................29

Figure 7

Optimal hyperplane and margin…………………………………….....30

Figure 8

Training errors of SVM............................................................................31

Figure 9

Mapping function Φ to project into a feature space F……………......32

Figure 10

Architecture of SDS…………………………………………………......36

Figure 11

Detecting module……………………………………………………..…36

Figure 12

Detection workflow…………………………………………………...…37

Figure 13

PE format & some analyze results…………………………………..…38

Figure 14

Network Monitor Interface………………………………………….....39

Figure 15

Report region & invoke remote detection region…………………..…41

Figure 16

Phase of information collect and extract……………………………....43

Figure 17

Workflow of data mining module…………………………………..….43

Figure 18

Interface of DLL & API Analyzer………………………………..……44

Figure 19

Format of SVM vector……………………………………………..…...47

Figure 20

Interface of spyware detector……………………………………..……47

Figure 21

Relation between information gain & detecting rate…………..……..50

Figure 22 FP of static analysis……………………………………..........................52 Figure 23

FP of dynamic analysis……………………………………………...…..53

Figure 24

FP of SDS…………………………………………………………...……54

6

Figure 25

CV rate of different training sets…………………………………..…..55

Figure 26

Comparing the performance of different detection systems……..…..55

7

List of Tables Table 1

Report of spyware scans from EarthLink Inc. ……………………......12

Table 2

Part of the modified system files……………………………………......39

Table 3

Directory of the hosts file……………………………………………......40

Table 4

Content of the hosts file…………………………………………….....…40

Table 5

Ratio of executable size in our experimental data set…………….....…44

Table 6

Performance of static analysis…………………………………….....…..51

Table 7

Performance of dynamic analysis………………………………….....…53

Table 8

Performance of SDS……………………………………………….....…..54

Table 9

Compare with other anti-spyware software………………………....…56

8

Chapter1. Introduction 1.1 Background Spyware is on the rise, posing a serious threat to everyone’s computer, while less visible than those from spam and virus attacks, is invasive, destructive, and potentially costly. According to the report from EarthLink Inc. [1], a major Internet service provider, of scanning 1.5 million computers during Jan. 1 to April 30, 2004 found that there were approximately 41 million traces of spyware programs or components. It means that there are 28 spyware programs in each computer on average and spyware infects an estimated 90% of all Internet-connected computers. The report also found that nearly one in three computers was infected with a Trojan horse or system monitor planted by spyware. The report in detail is listed in Table 1. Based on the overwhelming number of spyware traces in our computer, we can realize that the influence of spyware leads it to a serious problem like virus or spam. But, what the spyware is? Although there is no precise definition, spyware is often called as software that is placed on the user’s machine to transmit information back to a third party without users’ knowledge or permission. There are so many things that spyware can do such as downloading files, running programs in the background, and changing your system settings. It also collects and transmits information such as keystrokes, web surfing habits, passwords, email addresses. Even some other more sensitive data, for example, the personal bank account numbers, credit card numbers even the trade secrets, technique document or financial data of a company, etc. All the actions that a spyware can do could bring a huge damage if those collected important information are used by someone with evil intention. Otherwise, when a spyware is working potentially, it uses the CPU, RAM, and

9

other system resources and bandwidth as it tracks and transmits information. It also causes stability issues with most operating systems. Since the coders of spyware don’t really care how sloppy their coding is, this will cause the infected user’s computer to slow down or crash easily; moreover, it may bring new security vulnerabilities which could be exploited by the malicious hackers to intrude and launch other kinds of attack such as DDoS (Distributed Denial of Service) etc. Spyware may have much differences with viruses, one of the differences is the financial motive behind spyware. Let us look at some static numbers [2] [3]. z

The Federal Trade Commission released a survey in September of 2003 showing

that 27.3 million Americans have been victims of identity theft in the last five years, including 9.9 million people in the last year alone. That equates to approximately 4.6% of the U.S. population! According to the FTC survey, 2002 identity theft losses to businesses and financial institutions totaled nearly $48 billion and consumer victims reported $5 billion in out-of-pocket expenses. This is a growing problem and obviously that spyware is one of the latest and most dangerous threats to privacy in the digital age. z

Using data from an April, 2004, survey of 5,000 U.S. adults who use the Internet

and e-mail, Gartner Inc. estimated that nearly 2 million Americans fall victim to checking account fraud in the last 12 months. The cost to banks and consumers: a staggering $2.4 billion in direct losses, or an average of $1,200 per victim. Since the problem of spyware is incredibly serious, some famous anti-virus software vendors such as Kaspersky [4], McAfee [5], Trend Micro [6], and Norton Symantec [7] had entered the spyware -detection field last year. Even Microsoft and Yahoo also had thrown themselves into the battle of anti-spyware. We will go into more detail about the detect methods to spyware in chapter 2.

10

1.2 Contribution In this thesis, an integrated architecture is proposed to defend against surveillance spyware. For mending the lacks (for example, most new malicious programs can’t be accurately detected if there are no signatures) of usual anti-spyware products, we extract feature of spyware by analyzing its behaviors and then adopting the concept of machine learning and data-mining classification for detecting known or unknown spyware and upgrading the detecting rate of our spyware detection system (SDS) continuously. The SDS has 98% detecting rate to known spyware and 96% detecting rate to unknown or novel spyware. Hence we can say that our SDS is a suitable solution to spyware.

1.3 Synopsis This thesis is organized as follows. First we give a brief view of the classes of spyware and the nowadays anti-spyware methods in chapter 2. In chapters 3 we state the concepts of the information gain and support vector machine both used as the core technology in our system. Chapter 4 presents the architecture of SDS in detail. Finally, the experiment results of the proposed system in chapter 5 and concludes this paper in chapter 6.

11

Table 1. Report of spyware scans from EarthLink Inc. Overall Results

Month of

Month of

Total (Jan. 1 – April 30,

March

April

2004)

Scans

237,199

420,761

1,483,517

Spyware Instances Found

7,086,770

11,305,471

40,846,089

Instances of spyware per scanned PC

29.9

26.9

27.5

Spyware Installations on Scanned PCs by Category Adware

1,262,078

2,298,201

7,642,556

Adware Cookie

5,750,392

8,873,555

32,700,340

System Monitor

35,915

60,873

245,432

Trojans

38,385

72,842

257,761

12

Chapter2. Related work 2.1 Difference between spyware and virus Before we take a look at spyware, there is an important pronouncement we should make first. That is, spyware and viruses are completely different threats. Spyware is designed to collect demographic and personal information, display pop-up advertisements, track shopping and surfing habits or control your computer remotely. Viruses rarely have any purpose other than to annoy users or carry out malicious instructions to crash your system. Virus code is designed to propagate itself as soon as possible. Although it may try to hide itself inside another application, the virulent code is responsible for its own replication. Spyware may also hide themselves inside other applications, but it's not designed to propagate itself. Instead, it relies on cheating the computer user to install the legitimate application or something pretend to be innocence. This is the fundamental difference between spyware and viruses. Another difference between viruses and spyware is the size of the code. Virus code is usually quite small and easy to detect once the virulent code has been defined. Spyware is often quite large in comparison. Many spyware applications bring with them hundreds of files and additional traces, making it extremely difficult for the spyware-detection software to clean everything off the system. Viruses usually exploit the vulnerabilities of system and crash it, so it is much harder to create a virus than spyware which is act like application software. Moreover, we found that it is not a difficult thing to collect spyware from internet. But there are few viruses we can get from internet. Due to the wide gap between the virus and spyware, virus- and spyware-detection applications must be different too. You won’t typically find a

13

single application performing both tasks. This is partly due to the complex nature of the applications and so that vendors can generate multiple revenue streams.

2.2.

Classes of spyware Although there is not an official definition of spyware, we can list most common

types of spyware as following. z Adware monitors the pages fetched by a user’s web browser or other material on the consumer’s computer and when it sees particular pages or terms, displays other pages containing advertisements paid for by the spyware’s sponsors. In some cases, the adware rewrites the web pages displayed by the browser, substituting ads from adware vendor for the ads originally in the page. z Browser hijackers are so called ‘‘Browser Helper Objects’’ install themselves as part of the Internet Explorer web browser and change the way it works. The changes can be as simple as switching to a different home page, or as complex as redirecting web searches to the spyware vendor ’s search system or modifying search results to give you links to nothing but the spyware company’s garbage, or adding toolbars or ‘‘click here’’ buttons that lead to sponsors’ advertisements. z Key loggers record every key pressed by the computer’s user and send the stream of keystrokes back to the spyware’s author. More generally, ‘‘Activity Monitors’’ can log and report on any type of consumers’ computer usage, such as e-mail send and received, web pages visited, and instant messages exchanged. The data can be used for anything from consumer preference statistics to identity theft. z Trojan Horses allow the spyware author or vendor to remotely control the consumer’s computer for the author’s purposes. At the point, the most common purpose is probably to send spam. But if the spyware author gets the right of administrator of the victim’s computer, he might do whatever he wants. 14

z Porn Dialers is the special type of programs used by unscrupulous pornography vendors to permanently change dial-up settings on a computer to connect a modem to a remote location, resulting in expensive long-distance charges and exposure to other spyware programs. Therefore, we can roughly break the spyware down into two different categories, advertising spyware and surveillance spyware. I.

Advertising spyware is often installed alongside (or “bundled” with) other

software, by clicking on some banner advertisements on web pages or sometimes automatically via ActiveX controls on the Internet, often without the user's knowledge. This is usually done without full disclosure that it will be used for gathering personal information and/or showing the user ads. Advertising spyware logs information about the user, possibly including passwords, email addresses, web browsing history, online buying habits, the computer's hardware and software configuration, the name, age, sex, etc. of the user. Advertising spyware causes the bulk of the spyware-related problems users face today. II.

Surveillance spyware includes key loggers and Trojan horses that will be used

by hackers or anyone who would want to covertly gain access to your computer. Most trojans are not viruses, meaning they do not have the ability to reproduce themselves; they rely on the deceptiveness of people to propagate them in the wild. Since we will focus on this topic in this paper, let us see more details about the surveillance spyware. There are about 5 main types of remote access trojans and various subsets of these. ‹

The most common type of trojan is the remote administration type which includes Subseven, netbus, back orifice etc. This type of trojans give the hacker more power over the victims computer then the victim may have originally had. They include such functions as the ability to steal all passwords cached or not (this is done using key logging technology), modify the victims registry, upload, 15

download, execute (run) files, and various other things like turning on a web cam and spying on a victim. ‹

The second type of trojan is a file server trojan; these trojans create a file server, usually an ftp server on the remote victims computer allowing a hacker to upload or download files, this is commonly used to upload a powerful remote administration trojan. Because some of these file server trojans are small, (some are just 8 KBs) they are easily bound to other files making no significant size change. These are most commonly found in games and funny programs that people send around the internet to amuse each other not realizing they are infecting themselves and their friends with trojans.

‹

The third type of trojan is the password sending trojans, these trojans have one purpose and that is to steal passwords from the victim's computer and send them back to the hacker, the most common way these trojans communicate with the hacker is by email. It’s pretty scary to think that your computer is sending a hacker secret emails with all your passwords.

‹

Fourth on the list is key logger trojans; these trojans log everything the victim types and either sends the info. to the hacker by ways of email or stores the typed info in a secret file located on the victims computer which the hacker then downloads using the client part of the trojan.

‹

The fifth type of trojan is probably one of the most disturbing types to be recently developed; this is the distributed denial of service (DDoS) trojans. A hacker infects a large number of victims with a DDoS trojan, then by using the client part of the trojan he can connect either to all of them at once or he sends his commands to a drone (a master server) that then sends the commands out to all the victims to attack a single website or personal computer. This type of trojans had been used recently to bring down big sites like yahoo.com. 16

Although surveillance spyware is not as prevalent as advertising spyware [8] (show in Figure 1), it is the most dangerous kind of spyware that cause the loss number which could be huge than advertising spyware according to the statistical materials what we had stated in chapter1. Therefore, we will focus on it and propose an integrated solution for the surveillance spyware in this thesis. We will use the term “spyware” to surveillance spyware instead in the following chapters.

17

Figure 1. Some statistics of non-viral threats

2.3. Some Common Trojans Back Orifice (BO) was developed by a community of hackers known as “Cult of the dead cow” (www.cultdeadcow.com). This Trojan can be downloaded from

18

www.BO2K.com and numerous other websites. Back Orifice consists of two parts, a client application and a server application (approximately 122 KB). The client application, running on the hacker’s computer, can be used to monitor and control the victim’s computer (which runs the server application). The following are the main characteristics of BO: i. BO can only be used on victim computers that are running the Windows 95 or Windows 98 operating systems. ii. The server part of the program has to be installed on the victim computer. The victim is usually fooled into installing the server part by sending him the Trojan fused with another program (e.g. an electronic Diwali card fused with the Trojan program). iii. The hacker needs to know the IP address of the victim computer. iv. If the victim computer is behind a firewall, then BO will not work. NetBus was developed by a Swedish citizen named Carl-Fredrik Neikter who claimed that he developed it “purely for fun”. Netbus can be downloaded from hundreds of websites. Net Bus consists of two parts, a client application and a server application (named "patch.exe" and having a size of 470 KB). The client application, running on the hacker’s computer, can be used to monitor and control the victim’s computer (which runs the server application). The following are the main characteristics of Netbus: i. Once it is installed on the victim computer, it runs every time the computer is started. ii. Netbus can be used on victim computers that are running the Windows 95 or Windows 98 or Windows NT operating systems. Below is a snapshot of the client interface of the Netbus:

19

Figure 2. Interface of NetBus

2.4. Spyware Installation Methods Firstly, there are some tricks that are often used by spyware to cheat you installing them. ‹

Bundling with some popular of free program like P2P, Freeware, and Shareware. Some of the biggest culprits in spreading spyware are the popular peer-to-peer programs available today. Kazaa, Imesh, Limewire - all of these products install multiple advertising spyware applications. Advertising and marketing firms will approach struggling software developers who are hard up for money and offer to sponsor them financially if they will agree to bundle these adware features with the programs they are developing. Figure 3 shows an example of some spyware that is bundled with a legitimate video codec.

20

Figure 3. Spyware is bundled with video codec ‹

Deceptive functionality. Spyware will often pretend to be something other than what it really is. For example: Offers to synchronize your PC's clock keep track of forms, etc. But it is also doing other hidden things while you browse. If you have any strange looking icons in your tray, you should verify that they are something that you really want there.

‹

The “Keep asking until you say Yes” approach. This is particularly common with drive-by-downloads. Some spyware is delivered by an ActiveX control that tries to load each time you visit a web page where the spyware is present. As a security measure, the browser will ask if you want to install (show in Figure 4). If you say No it's only good until the next web page you load, where you'll again be asked the question. After a few pages of this, some people will give up and say Yes.

21

Figure 4. Warning interface ‹

Use confusing or blanketing legalese: The license agreements don't just come out and say "we're going to collect information and screw up your browsing". Instead, the licenses are full of vague, confusing prose or hided deeply in the End User License Agreement (EULA). It is necessary to read them carefully before you agree the license.

‹

Create a false pretense for needing the software: You get this email message from a friend with an invitation to browse a website or to read the attachment. When you get to the web site it asks you to install a "greeting card viewer" or download and run the attachment that will turn out to be spyware.

‹

Look essential, or be invisible: Some spyware will use an official-looking icon something like a folder icon, a text icon or a fake extension to cover that it is a malicious executable to make you run it carelessly. Some spyware will also use an official-sounding name in the task or startup list so that you'll hesitant to

22

disable it even you see it running. To further mask its existence and reduce your awareness of it, many spyware packages will even install software updates without your knowledge. How can we tell us have Spyware? Not all symptoms are easy to diagnose, but the easy symptoms to recognize include: ‹

Your computer slowing down to a crawl.

‹

Porn sites popping up in your browser when you are surfing the net

‹

Your computer mysteriously dials up phone numbers during the middle of the night, normally to expensive porn chat lines leaving you with a huge bill.

‹

When you enter a search into your search bar, a new and unfamiliar site handles the search.

‹

You notice a new toolbar in your browser that you didn’t want, and find it difficult to get rid of.

‹

New sites are added to your favorites list without you adding them

‹

Your homepage has been hijacked and even though you remove the new site it keeps coming back

‹

You get pop up adverts that address you by your name, even when your computer isn’t connected to the internet.

‹

A strange and unknown Windows Message Box appears on your screen, asking you some personal questions.

‹

Your Windows settings change by themselves like a new screensaver text, date/time, and sound volume changes by itself, CD-ROM drawer opens and closes.

‹

Your I/O devices are working in strange way. For example, the mouse buttons is swapped or moves out of control; keyboard is locked; printer begin printing by itself etc. 23

‹

Programs appear that you don’t remember installing and are running automatically.

‹

Computer is shutdown or log off by itself.

‹

Your network is unusually busy to download and upload any file when there is known task running by user. Please note that most advanced attackers will just spy on you and use your

infected machine for some specific reason, and not perform any of the above "tricks" so as not to cause any suspicious activity on the target system (as this would probably mean they could get easily detected). Someone that just wants to have fun with you is more likely to perform these actions.

2.5. Traditional Detection Methods 1.

Filename matching The simplest form of spyware detection is filename matching. As the name

suggests, this method scans the drive for specific filenames of known spyware. This form of detection works, but there is a considerable flaw in the theory. In an effort to subvert the detection software, spyware companies either change the filenames or employ a random naming strategy that will make the detection software is unable to recognize the spyware. 2.

File properties Another method of spyware detection compares the properties of the file with

those of known spyware. The detection software matches a filename, file properties such as the size, publisher, and version are compared to the known values in the spyware definition database. Combining filename and file property matching makes the detection software more robust. However, spyware authors are able to get around this form of detection easily by renaming the files, changing the publisher, slightly 24

modifying the file size, or updating the revision. 3.

File signatures Filename and file property detection methods basically look at the wrapper

around the program code. In the spyware-detection world, using file signatures to detect rogue files can let us look inside the spyware. When searching for spyware, the detection software actually looks inside files for certain signatures, or patterns. This will be more precisely to catch the spyware. Although spyware authors may be able to easily change the filename or properties, modifying the program is a much more involved process. File-signature detection is still a reliable method, and many popular detection software applications use it. 4.

Heuristics Heuristic detection is similar to file-signature scanning, except that the detection

software searches for certain instructions or commands that are not part of normal applications— such as a command to delete everything on the hard drive. Heuristic methods are generally used to detect malware and other malicious types of applications. Instead of the signatures-based method that searching only for known viruses, anti-virus software uses heuristics to analyze code sequences in an effort to detect unknown viruses. This doesn’t always work, but the attempt occasionally thwarts a virus outbreak before it becomes an epidemic. More effective spyware-detection applications usually combine heuristics with other methods, such as file sharing and filename matching. 5.

Registry scanning Like all applications, spyware modifies the system registry during installation.

Over time, these values can clutter the registry and slow down the computer. The registry may also become corrupted. Virtually all detection applications scan the system registry for traces of spyware by matching values for known spyware 25

applications with those in the application’s definition database. There are still many methods that were used by antivirus software vendors to detect virus, such as check-sum, rule-based, virtual machine, Real-time I/O Scan etc. And also lots of researches [9] [10] [11] [12] [13] [14] [15] [16] [17] in academia that is devoted to counteract virus. But there are few efforts to oppose the problem of spyware. Moreover, spyware authors are constantly changing their applications to avoid detection. In fact, many spyware authors use spyware-detection software to help them determine whether their changes are going to be caught. They work with the various detection packages to tweak the code until their application is no longer found. Moreover, according to the report [18] in April, 2005, “We analyzed all the viruses we received during the past six months, and found that 70 percent contained some sort of spyware module or component. Writers have definitely moved from creating simple viruses to sophisticated 'machines' designed to hijack computers and the information on them.", said Shimon Gruper, the vice president of technologies in Aladdin's eSafe unit. This information tells us that traditional virus-detection methods will be nearly useless when facing the sophisticated spyware and the combination of virus and spyware will be a brand new threat in information security in the future. In this thesis, we propose an integrating system named SDS (Spyware Detection System) which combines the concepts of data mining and machine learning with the methods of static analysis and dynamic analysis. The SDS is a solution to help us to against the threat of spyware. The details of SDS will be showed in the following chapters.

26

Chapter3. Information Gain and Support Vector Machine 3.1 Data Mining Data mining is a process of mining originally hidden and potential information from a great quantity of data. Within the development of data mining technique, there are several different algorithms which can be divided into two classes. One of them is based on the principle of statistics to recognize the distribution by counting the number of different kind data. The other one is based on the theorem of clustering, classifying and similarity. Nowadays, data mining is surely used in many fields such as business managing, marketing, finance, biotechnology, etc. In this paper, we designed a framework that combined two data mining algorithms, namely information gain and support vector machine, to train a classifier on a set of spyware and benign executables to detect known and unknown spyware.

3.2 Information Gain J.R. Quinlan [19] proposed a classification algorithm called ID3, which introduces the concept of information gain: “The information conveyed by a message depends on its probability and can be measured in bits”. z

Information: Given a set of training set S, if there are k classes,

Info(S) =

k



∑ p log ( p ) i =1

2

i

, pi is the proportion of category i examples in S.

i

It is clearly that the Info(S) is the same to the entropy of S. High entropy means S is from a uniform distribution and low entropy means S is from a varied distribution. In our case, the Info(S) should be equaled to the following equation: ⎛ spyware - ⎜⎜ S ⎝

log

spyware 2

S

+

benign S

log

2

benign ⎞ ⎟ ,where |S|=|spyware|+|benign| S ⎟⎠

27

z

Information Gain The information gain of an attribute A is the expected reduction in entropy

caused by partitioning on this attribute.

InfoA(S)= - ⎛ ⎜ ⎜ s+ + s− ⎜ S ⎜ ⎝

, where

+

log s s +s +

2

+

i

S

i∈Values ( A )

⎛ ⎜ s+ ⎜ ⎜ s+ + s− ⎝

s

s





× Info(Si) , Si is the subset of S for attribute A has value i. ⎞ ⎟ ⎟+ − ⎟ ⎠

s− + log s s +s s+ + s− −

2

+

s +s +



S

⎛ ⎜ s+ ⎜ ⎜ ⎜ s+ + s− ⎝

log

s

+

s +s +

+



2

is the number of spyware which has attribute A and

s− s+ + s−

log 2

⎞⎞ ⎟⎟ ⎟⎟ ⎟⎟ s+ + s− ⎟⎟ ⎠⎠

s



s is the number +

of benign which has attribute A. Gain(S,A) = Info(S) -InfoA(S). By the definition of information gain, we should choose the attribute A that gives the large information gain.

3.3 Support Vector Machine Support Vector Machine (SVM) is a method of machine learning, and is used in pattern classification and regression extensively. Because SVM can decrease training error (or called empirical risk) and testing error (or called risk) at the same time, SVM becomes the most popular algorithm in machine learning. SVM was built by Vapnik [20], and we will introduce the classification model in the following part [21]. For the typical classified problems, we defined following notation before solving them: xi: A vector which is used to describe attributes of a record, xi ∈ R N , i = 1,..., l yi: A Boolean number, show the class of xi belongs to, yi ∈ {±1}, i = 1,..., l S: Data set of xi L: Data set of yi f: decision function f ( xi ) = sign ( w • xi + b) where f : R N → {±1} By using decision function, we can determine which class (+1 or -1) a given data xj belong to.

28

For linear classifier, the main goal is using the data in training data set to find a hyperplane to separate those data to the class they belong to (see Figure 5). Hence, we need to find the corresponding vector w and coefficient b of the hyperplane (f: wx+b=0) to separate training data. Then as for the new test data xj, we can simply using decision function f: (w·xj) + b to classify xj. If f(xj) > 0, then xj belongs to class +1, otherwise, if f(xj) < 0, then xj belongs to class -1.

w

b

+1 -1 wx+b=0

Figure 5. Hyperplane classifier

Figure 6. Examples of Bad Decision Boundaries In addition to the hyperplane classifier, if we can obtain largest margin in separating two classes (see Figure 7), then we can decrease the test errors (risks). Due to the reason there might be not only one hyperplane which can be used to separate two classes, with the idea of margin above, Vapnik and Chervonenkis describe an Optimal Hyperplane (OH), which means the optimal hyperplane can separate two 29

classes and have the largest margin. Upon adjusting the values of w and b to satisfy function | ( w ⋅ xi ) + b |= 1 i=1,…,l, for all correctly classified data, the following function yi ⋅ (( w ⋅ xi ) + b) ≥ 1 will be tenable. Then for the optimal hyperplane, the distance of the margin is 2 / w (see Figure 7). Hence the idea is to get the optimal hyperplane with the widest margin, we should get the function yi ⋅ (( w ⋅ xi ) + b) ≥ 1 , i=1,…,l , and the minimal function

τ ( w) =

1 2 w . 2

Figure 7. Optimal hyperplane and margin

Finding optimal hyperplane is a quadratic programming problem, we can use Lagrangian theory and Lagrangian multipliers to solve this: Lagrangian: L( w, b, α ) =

l 1 2 w − ∑ α i ( yi ⋅ (( w ⋅ xi ) + b) − 1) 2 i =1

Lagrangian multipliers: α i ≥ 0 After taking L’s side differential to w and b individually, l

∑α y i =1

i

i

l

=0

w − ∑ α i yi xi = 0

l

, we can get w = ∑ α i yi xi , α i ≥ 0 i =1

i =1

Putting w into Lagrangian and then we can get a dual optimization problem. It is

30

a maximization problem, and can use to show a SVM function which is used to solve linear separable problem. l

maximize LD (α ) = ∑ α i − i =1

1 l ∑ α iα j yi y j (xi • x j ) 2 i , j =1

subject to α i ≥ 0, i = 1, K, l and

∑α y i

i

=0

i

l

and the decision function will be : f ( x) = ( w ⋅ x) + b = ∑ α i yi ⋅ ( xi ⋅ x) + b 。 i =1

For each test data x, use the following function to forecast which class it belongs l

to: sign(∑ α i yi k ( xi , x) + b) i =1

However, in most cases, the input data of classified algorithm are not linear separable, so we should allow some training errors (see Figure 8).

Figure 8. Training errors of SVM For extending the Separable Case, we use a Slack Variable ξ i , i = 1,K, l to get following inequalities from xi • w + b ≥ +1 and xi • w + b ≤ −1 xi • w + b ≥ +1 − ξ i ∀y ∈ {+1} ,

xi • w + b ≤ −1 + ξ i ∀y ∈ {−1} , ξ i ≥ 0 ∀i Combining the two inequalities above, we obtain y i ( xi • w + b ) ≥ 1 − ξ i , ξ i ≥ 0 ∀i For the purpose of solving the non-sparable data, we consider training error and the

31

optimization problem becomes

1 2 2 w 2 + C ξi 2 2 subject to y i ( xi • w + b ) ≥ 1 − ξ i i = 1, K , l

minimize

The dual problem is shown as follow. 1 l 1 ⎞ ⎛ α i − ∑ y i y j α iα j ⎜ (xi • x j ) + δ ij ⎟ L p (α ) = ∑ C ⎠ 2 i , j =1 ⎝ i =1 maximize l

l

subject to

∑yα i =1

i

i

= 0 , α i ≥ 0 i = 1,K, l

Figure 9. Mapping function Φ to project into a feature space F

To use a non-linear mapping function Φ to project the input vectors to a feature

32

N space F: Φ : R → F (Figure 9), and using linear classification on feature space F is

another way to solve non-separable problem. In optimal duel problems, in order not to directly calculate the inner product (Φ(x)Φ(y)), we use the following connotative kernel function k(x,y)=(Φ(x)Φ(y)) instead. By using kernel function to calculate data’s inner product value in feature space, we do not have to project data into feature space. Generally, there are four popular kernel functions: • Linear: K(xi, xj) = xi T xj . • Polynomial: K(xi, xj) = (γxiT xj + r)d, γ> 0. • Radial basis function (RBF): K(xi, xj) = exp(−γ||xi − xj||2), γ> 0. • Sigmoid: K(xi, xj) = tanh(γxiT xj + r) where γ, r, and d are kernel parameters. In this thesis, we will use Support Vector Machine (SVM) to implement the core of our detecting module to judge whether an executable is spyware or not. Although there were some applications of SVM on security issues [22] [23], we are the first to apply SVM on the issue of spyware detecting. That is because by using SVM leads to lots of advantages: Firstly, the features of spyware are easily and constantly changing with the author’s will. Due to the speed of updating features is not fast enough, the traditional detecting method relies on signatures will be useless when facing the appearance of new spyware. During this updating period, systems protected by signature-based algorithms are vulnerable to attacks. The SVM can handle lots and changeable data efficiently. Therefore, SVM is a suitable algorithm in detecting spyware. Secondly, with the comparison to other traditional classifiers, such as Neural Network which is focus on empirical risk minimization, SVM can decrease training error and testing error at the same time (structural risk minimization), so SVM is more efficient and strong. Moreover, the time consumption is less. Speed is one of the most important factors for building a real-time spyware detection system. SVM has a clear advantage than Neural Network in speed. That is because Neural Network needs lots of weights

33

in training while training data is huge, but SVM has nearly no influence to size of training data set.

34

Chapter4. Spyware Detection System (SDS) 4.1 Conceptions of SDS Our system architecture is composed by two main components. There are detecting module and data mining module. The Detecting module consists of local detecting part and remote detecting part. The local detection system provides a simple and real-time way for the client to identify the incoming executable is a benign one or not. The remote detection system has the latest spyware-detection module that can protect the client who invokes it via internet. It also collects the information about the executable which is marked as spyware by the client side detecting module and logs it into the database. The data mining module is a server-side system. It extracts the data in the log database to get useful features to train new SVM classier. If the result of new SVM classier is better than old one, we can record the information about the features, parameters of SVM into the knowledge base. This feedback procedure can upgrade our ability of detecting spyware continuously. Figure 10 shows the conception of our architecture and we will go into more detail about our system below.

35

Intranet

Internet

Experts Knowledge Base

Remote Detecting System

Training System

Client Logs DataBase Detecting Module

Mining System Mining Module

Figure 10. Architecture of SDS

4.2 Detecting Module This module consists of the local detecting and remote detecting parts. Figure 11 shows the conception of this module and Figure 12 shows the workflow of the local detecting component.

Figure 11. Detecting module

36

Figure 12. Detection workflow Traditionally, there are two main approaches for the detection of malicious code: static analysis and dynamic analysis. Static analysis consists in examining the code of programs to determine properties of the dynamic execution of these programs without running them. Dynamic analysis mainly monitors the execution of a program to detect malicious behaviors. Thus, we can see that static analysis and dynamic analysis are complementary. Therefore we will take both of these two approaches and try two merge them to get the best detecting result. Firstly, we need to collect all information about the incoming executable in three categories which includes its behaviors, modification to system files and network activity. Unlike other static analyze method to trace the content of code, code structure or control flow, it would be much quicker and useful to identify the behaviors of spyware by analyzing the kind of using Dynamically Linked Library (DLL) and Win32 API information in the import section (.idata section) of PE format [25][39]. The major structure of PE format is illustrated in Figure13. With the help of this static analysis, we can figure out the behaviors of the executable before it is executed. After running the executable, we record all the modifications which be done by the executable to the system. The purpose of this step is to recognize the effect of executable behaviors by monitoring the modifications of system files and registries 37

(the partial monitor entries are listed in Table 2 and we will list all the entries in appendix) upon the system to reduce the lack of information in the static anlysis step. Since spyware is not like the virus which has immediately destructiveness to user’s computer, we can remove it according to the logs of this step whenever the report of SVM shows that the executable is a spyware indeed.

Figure 13. PE format & some analyze results Besides, the network activity is a characteristic of spyware which may need to transmit the private info of the victim; redirect your connection to a malicious website or release the full access right of the victim computer to the person who is not in a good intention. Hence it is obviously that the spyware is working stealthily if we had monitored some abnormal network services. Figure 14 shows the network monitor interface of our system.

38

Table 2. Part of the modified system entries Modified Registry Entries HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Run HKEY_CURRENT_USER\Software\Microsoft\WindowsNT\CurrentVersion\Windows "load" HKEY_CLASSES_ROOT\txtfile\shell\open\command HKEY_LOCAL_MACHINE\SYSTEM\ControlSet\Control\SessionManager\KnownDLLs

Modified System Folder Entries C:\Windows\ C:\Windows\system32\

Figure 14. Network Monitor Interface With all the signatures we gathering, we can generate the relative feature vector and use SVM decision function to detect the spyware. Otherwise, a basic but effective way of blocking spyware-infected servers is to add DNS names of the spyware-infected servers to the client’s hosts file which is placed in the appropriate directory which is showed in Table4. And remap it to a warning web page or local

39

address to render the spyware useless. Although the list of spyware-infected servers’ address should be maintained all the time, it is the straightest way to help us avoiding the trap of malicious website. The host file can be downloaded from a website [24] and we should update it if any new spy-infected server occurs. Table 3. Directory of the hosts file Win XP

C:\WINDOWS\system32\drivers\etc\

Win 2K

C:\WINNT\system32\drivers\etc\

Win 98/ME

C:\WINDOWS\

Table 4. Partial content of the hosts file Remap

spyware-infected server

annotation

127.0.0.1

abcsearch.com

#[IE-SpyAd]

127.0.0.1 admin.abcsearch.com 127.0.0.1 www3.abcsearch.com

#[Browseraid]

127.0.0.1 www.abcsearch.com 127.0.0.1

absoluagency.com

#[Trojan.StartPage.H]

Our detecting module has two particular capability, the information reporting and remote detection. To improve our detect ability, we should extract more characteristic features of spyware through continuously comprehend them. Therefore, each client will report the information includes the assembly code, the three categories of features which are mentioned above, the feature vector of the executable and some other information if it was judged as a spyware or benign one whose feature vector falls in the margin of positive side. We call this area as “report region" which is illustrated in Figure 15. SVM is a sensitive classifier that would make wrong decision of the point is closed to the hyperplane. This reporting mechanism would be helpful

40

for reducing the rate of false negative and false positive if we get much recognize of the kind executables. Through the XML Web Service [26], remote decision can be made by pressing a button on their client-side detect interface to invoke the SVM which is implemented on the remote server. With the continuously upgrading policy we suggested above, the server-site SVM should has much stronger ability to detect the spyware precisely. This mechanism solves the problem that people often forget to update the virus definitions of their anti-virus software and provides best protection all the time.

Figure 15. Report region & invoke remote detection region Based on the technique of XML Web Service which is working via SOAP (Simple Object Access Protocol), we can implement the abilities of information report and remote detect as remote functions on the server site which could be directly invoked through the internet by local users (Figure 10). Due to some issues on the remote server site security and the integration of the transmitting data, we will authenticate the user’s id and password and encrypt the transmitting data by using SSL (Security Socket Layer) to protect the privilege of server-side usage and the integration of the reported data.

41

4.3 Data Mining Module Like all the security issues, there won’t be a single way which can solve a specific problem effectively. Although we had seen the benefits which are described in chapter 3 of SVM and choose it as the core of our detecting module, it is still important to upgrade our SVM efficiently to achieve high detecting rate for our system. But SVM is a pre-training machine learning mechanism; it is not flexible enough in producing the decision hyperplane. The process of feature selection is the main problem to train a SVM. How could we know the result of training will be good just by selecting any features? To overcome this problem, we construct the data mining module to extracting the feature of spyware. In the preprocess phase, the remote detection system collects all the relative reports from clients and passes them to the log database. Before saving into the database, we should parse these reports first. In feature extracting phase, we use the conception of Quinlan's information-gain to extract the feature which has sufficient volume (illustrate in Figure 16). The threshold of volume and the meaning of the features should be decided by experts. After that, we will store all the meaning features that will be used for the retrain phase into the knowledge database. Finally, the result of retraining phase will give us a new SVM with better spyware detecting ability. Figure 17 shows the entire workflow of our data mining module.

42

Figure 16. Phase of information collect and extract

Figure 17. Workflow of data mining module This information feedback and system promoting mechanism will make us upgrading our server site and client site detection module timely.

43

Chapter5. Experiments and Results 5.1 Experiment Data Set and Experiment Environment As there is no standard experimental data set, we had custom built data set collected from internet during Jan, 2005 to June, 2005 .Our experimental custom built data set consists of 740 benign programs and 407 spyware (we will list the content of our data set in appendix). For tallying with the generality of user experience, we chose the benign programs with size similar to spyware(illustrate in Table 5) from five categories, including system tools, business programs, and internet applications, multimedia and drafting software that was downloaded at http://toget.pchome.com.tw which is a famous software sharing website. We also collected all the spyware from these websites [30]~[38] where most of them are the hacker’s web forum in China. To analyze the influence of spyware upon the user’s computer, we use the VMWare Software[28] which is a virtual machine program to help us comparing the system state difference that is made after executing a spyware. We constructed our virtual environment with Microsoft Windows XP operating system on the platform of Intel Pentium4 3.0GHZ CPU and 1GB RAM. Our system is implemented by Visual C#. Table 5. Ratio of executable size in our experimental data set Benign programs

Spyware