Detection and Confirmation of Web Robot Requests ...

1 downloads 0 Views 770KB Size Report
Jan 14, 2012 - profit, (6) retrieving email ids for spamming in the future, and. (7) testing ... accessed, by whom, and when as it maintains a history of page.
Detection and Confirmation of Web Robot Requests for Cleaning the Voluminous Web Log Data Tanvir Habib Sardar

Zahid Ansari

Department of Information Science and Engineering P.A. College of Engineering, Mangalore Karnataka, India [email protected]

Department of Computer Science and Engineering P.A. College of Engineering, Mangalore Karnataka, India [email protected]

Abstract - Web robots are software applications that run automated tasks over the internet. They traverse the hyperlink structure of the World Wide Web so that they can retrieve information. There are many reasons to distinguish web robot requests and user requests. Some tasks of web robots can be harmful to the web. Firstly, Web robots are employed for assemble business intelligence at e-commerce sites. In such a state of affairs, the e-commerce site may need to detect robots. Secondly, many ecommerce sites carry out Web traffic scrutiny to deduce the way their customers have accessed the site. Unfortunately, such scrutiny can be erroneous by the presence of Web robots. Thirdly, Web robots often consume considerable network bandwidth and server resources at the expense of other users. A web log file is a web server file automatically created and maintained by a web server to check the activity performed by it. It maintains a history of page requests on its site. In this paper we have used four methods together to detect and finally confirm requests as a robot request. Experiments have been performed on the log file generated from the server of an operational web site named vtulife.com which contains data of march-2013. In our research results of web robot detection using various techniques have been compared and an integrated approach is proposed for the confirmation of the robot request. Index Terms – web usage mining, web robot detection, web log file.

I. INTRODUCTION Web robots are programs that traverse the Web autonomously, starting from a “seed” list of Web-pages and recursively visiting documents accessible from that list [1]. They are also known as Web Wanderers, Crawlers, or Spiders [2]. There are various purposes the web robots are sent out. For instance, organization of the World Wide Web and their usage data can be gathered by them. For example, index databases of search engines like Google and Altavista are formed by Web robots. Business organizations deploy web robots to collect email id, online resumes, scrutinize manufactured goods prices, business news, etc. Web robots are used by Web administrators to carry out site maintenance related operations such as mirroring and checking for broken down hyperlinks [3]. It can perform a variety of tasks such as link checking, page indexing and performing vulnerability assessment of targets [3]. However some robots are specifically designed and employed to perform spamming tasks i.e. to spread spam content in Web 2.0 platforms [4]. They are able to perform

human-users tasks on the web such as registering user accounts, searching/submitting content and to navigate through websites [5]. Though typical job of web robots are generally simple and structurally repetitive [1], some tasks of web robots can be harmful to the web. It wastes resources, misleads people and can trick search engines algorithms to gain unfair search result rankings [6]. That is why it is required to recognize arrival requests by Web robots and to distinguish them from other users. Firstly, Web robots are employed for assemble commerce intelligence at e-commerce sites. In such a state of affairs, the e-commerce site may need to detect all HTTP requests coming from the unauthorized robots [7]. Secondly, many e-commerce sites carry out Web traffic scrutiny to deduce the way their customers have accessed the site. Unfortunately, such scrutiny can be erroneous by the presence of Web robots. Tan and Kumar [2002] have discovered that Web robot can access 5% to 85% of the total html page requests [3]. These web robots should be identified and deleted to get the exact information about the site visitors. Thirdly, Web robots often consume considerable network bandwidth and server resources at the expense of other users. In addition badly designed robots may overload the Web server by consuming the resources. Along with these, unkind users also use robots for a variety of tasks, counting (4) distributing requests with fake referrer headers to repeatedly generate trackback links that increase ranking of a search engine, (5) as there are many click-through payment programs well-known on the Web, in which an advertiser pays the referrer web site owner for promoting his online-ad by click-through, web robots are sent to produce automatic mouse-clicks on online ads to increase profit, (6) retrieving email ids for spamming in the future, and (7) testing vulnerabilities in servers, CGI scripts, etc [8]. In one word, Web sites are routinely visited by automated agents known as Web robots that perform acts ranging from the beneficial, such as indexing for search engines, to the malicious, such as searching for vulnerabilities attempting to crack passwords, or spamming bulletin boards [9]. To deal with the concerns listed above, we should to be able to separate the requests of robots from that of the general Web users. Distinguishing Web robots from humans will help marketing companies derive more accurate statistics about the

impact of online advertising and the interaction that real customers have with e-business sites. Also, it will help Web administrators in estimating the real side-effects of robot activity on Web-server performance. Finally, it can provide a basis for developing intelligent admission control systems that will protect Web-sites from aggressive or unwanted robots. However, the openness, the lack of central control, the sheer size, and the dynamic nature of the Internet, render the identification of active crawlers and operational search engines a very difficult challenge. A web log file is a web server file automatically created and maintained by a web server to check the activity performed by it. This log file contains which pages are being accessed, by whom, and when as it maintains a history of page requests on its site. These files are only accessible to the webmaster or other administrative person. An analysis of the web log may be used to observe traffic patterns by time, date, referrer, or user agent [10]. There are different formats of the log file are available [11]. Few of the formats are: NCSA Common log format, Extended Log File Format by w3c [12], Sun ONE Web Server (iPlanet)[13] , IBM Tivoli Access Manager WebSEAL[14] , WebSphere Application Server Logs[15] etc. Log files contains data field to store server access information. All log files contain fields to hold common values like address of the client, requested page from the client, requested method, size of the data retrieved, HTTP status code and the date, time, and time zone when the server finished processing the request etc.

checks robot requests by known robot IP address checking may not give an accurate result. To get an accurate result some other technique should be used so that only robot requests are detected. Following are the Methods used in this process: A. Check robots.txt Accesses: It is a file kept in the top-level directory of a web server. When a robot searches for the "/robots.txt" file for URL, it removes the path section from the URL and puts "/robots.txt" in its place. For example, for “http://www.anysite.com/searchany/index.html, it will remove the ""searchany/index.html ", and replace it with "/robots.txt", and will end up with http://www.anysite.com/robots.txt. A sample of content of robots.txt is given in table II. In robots.txt it gets the "User-agent: *" means this part applies to all robots. The "Disallow: /A.html" forbids all robots from accessing the file http://www.anysite.com/A.html. In this way The Robot Exclusion Standard [16] [17] was proposed to allow web administrators to specify which components of their Website is forbidden to visiting robots. This suggests that Web robots should be easily detected from the robots.txt file access requests. So if the requested pages field of the log file have robots.txt access then the requests could be confirmed as web robot requests even if the same requests cannot be identified by other methods stated in this paper. TABLE II Sample Content of robots.txt

User-agent: * Disallow: /A.html

The contribution of this paper is that requests from robots and users are separated using proposed methodology. (In our proposed methodology we have used four methods to detect web robot requests. To confirm a request as a robot request we have integrated each four methods by intersection and union operations. Section II explains the methodology of our proposed technique. Section III shows the result obtained by executing each of the methods and then results obtained by integrating all four methods by union and intersection.

II. METHODOLOGY In this research approach following methodologies have been utilized for the detection and confirmation of the web robot requests: (1) Check robots.txt Accesses (2) User-Agent Check (3) IP Address Check (4) Count of “HEAD” HTTP requests with unassigned referrers. All of the above methods are web robot detection methods so can be used individually also. The reason to choose all the four methods together is to confirm a request as a robot request by integrating all the methodologies together. For example, IP addresses of same host are used sometimes to deploy robots as well as for general user requests. So for this scenario if only one method that

B. User-Agent Check: Web robots are becoming increasingly important as the size of the Web grows [18]. Poor implementation of the Web robots can guide to severe network and server overload problems. Thus, a guideline is required to guarantee that both the Web robot and Web server can work together with each other in a way that is advantageous to both parties. Under the projected ethical guidelines for robot designers [19] [20] [21], a supportive robot must state its uniqueness to a Web server via its user agent field. For instance, the user agent field of Web robots should contain the name of the robot, unlike the user agent field of Web browsers, which often contains the name Mozilla as shown in table III. In this method User-agents field of the log file is verified. If user agent field value matches then it is a web robot. Figure 1 shows an example where an Internet Explorer browser, identified by its user agent field, Mozilla/4.0 (compatible;MSIE 5.01), was used to request for the HTML page, http://www.anysite.com/A.html.

Browser

GET /A.html HTTP I .1 Host: www.anysite.com Referrer: / Accept: Image/gif, */* Accept-Language: en-us Accept-Encoding: gzip, deftate User-Agent: Mozilla/4.O (compatible MSIE 5.01 Windows NT) Connection: Keep-Alive VVob Sorvor HTTP Request (co.npatibl MSIE S.0I Header WifldOWO NT) Connootlon: Koap-Aliva VVob Sorvor

The referrer field is provided by the HTTP protocol to permit a Web client (particularly, a Web browser) to indicate

Clients Type Browser(Netscap e) Browser(IE) Browser(Opera) Search Engine

HTTP Response Header

Web Server

HTTP/1.1 200 OK Date: Mon, 11March 2013 14:41:20 GMT Server: Apache/ 1 .3.8 (UNIX) Last Modified: Fri, 14 Jan 2012 10:34:45 GMT ETag : “1e5cd-964-381e1bd6” Accept Range: bytes Content-length: 327 Connection: close Content-type: text/html

Figure 1 Communication between Web browser and Web server via HTTP protocol.

C. IP Address Check: Another way to detect robots is by matching the IP address of a client against those of known robots. There are many Web sites that provide a list of IP addresses for known Web robots. Many web sites available which provides up-to-date IP addresses of the web robot clients. In this method IP addresses are download are verified with log file request entry for matches. If a match found means it is probably web robot request. As the same IP address could be used by Web users for surfing the Web and by robots to automatically download some files from a Web site, some other method should also be used along with this method to get confirmed about the robot request. This can be done by selecting only intersection of resulting IP addresses. D. Count of HEAD requests and HTTP requests with unassigned referrers: The guidelines for Web robot designers also suggest that ethical robots should use the HEAD request method, whenever possible. The request method (e.g. GET, HEAD and POST) of an HTTP request message decides what type of job the Web-server should execute on the resource requested by the Web client. For example, a Web server responds to a GET request by sending a message, which carries of some header information with a message body, which contains the file requested. On the opposite, the response to a HEAD request contains only the message header, thus it consist of less communication overhead. This is the reason why Web robots are encouraged to use the HEAD request method. In principal, one can examine requests with a HEAD request to find out Web robot requests. In addition, it is also important to look for requests that have an unassigned referrer fields.

Email Harvester Link Checker Search Engine

TABLE III Sample of user-agents IP address User Agent 160.94.178.152 Mozilla/4.7(X11;I;Linux 2.2.14-5.0 i686) 160.94.178.205 Mozilla/4.0(Compatible;MSIE5.01; Windows NT) 160.94.103.248 Opera/5.01(Windows NT & Opera 5.0; U) 199.172.149.18 ArchitextSpider 4 4.41.77.204 EmailSiphon 130.237.234.90 LinkChecker/1.0 207.138.42.10 Mozilla/4.5(Win95;I)

the address of the Web page that contains the link the client followed so as to reach the current requested page. For example, whenever a user requests for the page http://www.xyz.corn/A.html by clicking on a hyperlink found at http://www.xyz.corn, the user‟s browser will generate an HTTP request message with its referrer field assigned to http://www.xyz.com. Since most robots do not assign any value to their referrer fields, these values appear as “-“in the Web server logs. In this method “Head” requests and HTTP requests with unassigned referrers are verified. E) Integration and Confirmation: As the robot request confirmation process we have integrated all four methods by selecting four methods altogether. The result set obtained from the first method (robots.txt checking) is confirmed as robot request. It is assigned with the label 2. The resulting web requests by integrating other three methods will have a range of different label values starting from one to three based on the number of method(s) validated that web request entry as a web robot request. Only intersection of the detected requests by last three methods (except robots.txt checking) will get label value 3. Now the requests having label value 3 and 2 are integrated and confirmed as web robots requests by using union operation. Figure 2 is the diagrammatic representation of our proposed methodology. We start with the log file as the source of our experiment. Next we have gone through pre-processing. In this stage, Requests containing image access requests are deleted and query strings are eliminated. Image access requests of web robots are very rare. Query strings are eliminated to shorten the job of searching the voluminous log file entries. After this, using the pre-processed log file each of our four methods are executed separately named M1, M2, M3, M4 in the figure respectively for robots.txt check, user-agent check, IP address check, Count of HEAD requests and HTTP requests with unassigned referrers. Output request set obtained by robots.txt checking are confirmed as web robot requests and named Confirmed Robot Requests Set 1(CRRS 1). The outputs of other three methods are Probable Robot Requests

Set 1(PRRS 1), PRRS 2 and PRRS 3 respectively. These sets contain the label value 1. At the last stage we have gathered requests from PRRS 1, PRRS 2 and PRRS 3 and then used intersection operation to confirm the requests having highest label (i.e. 3) as the robot requests and named Confirmed Robot Requests Set 2(CRRS 2). Now in the last step we have integrated CRRS 1 and CRRS 2 by union operation and confirmed the resulting set as confirmed robot requests. Web Server Log File

Pre-processing

Pre-processed Log

Figure 3 provides the algorithm of our Integrated Approach. type logEntry { ip : string request : URI agent : string method : string

time : seconds status : string

referrer : URI protocol : string

} 1.

Let F denotes the Log File.

2.

Let Set Img denotes the set of image file formats.

3.

Let Qry denotes the query symbol (?).

4.

Let Set IPAddr denotes the set of known robot IP addresses.

5.

Let Set Agnt denotes the set of known robot user-agent names.

6.

Let Sets CRRS1, PRRS1, PRRS2, PRRS3 and CRRS2 are empty Sets. Input: F for each fi ε F If( request is robots.txt) then CRRS1=CRRS1+fi for each fi ∈ F { If( ip is in IPAddr ) then PRRS1=PRRS1+fi If( agent is in Agnt) then PRRS2=PRRS2+fi If( method is „Head‟ ) then PRRS3=PRRS3+fi }

7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.

20. CRRS2= PRRS1 Ո PRRS2 Ո PRRS3 21. TCRRS= CRRS1‫ ں‬CRRS2 M1

M2

M3

M4 Figure 3 Algorithms for Integrated Robot Detection Methodology

CRRS1 (Label 2)

PRRS1 (Label 1)

PRRS2 (Label 1)

PRRS3 (Label 1)

Intersection CRRS 2 (Label 3)

Union

Total CRRS(TCRRS) Figure 2 Proposed Methodology Flow Chart

III. Experimental Results A. Input dataset Our input data is a log file with NCSA Common Log Format (CLF). The log is from vtulife.com and contains data of march-2013. Total no of request entries in the log are 186743. This log file formatted with NCSA Common log format. The log file contains the following field values:  Remotehost: Remote hostname (or IP address, if hostname is not available on Dynamic Name Server (DNS), or if DNS lookup is off). Example: cm9165.cass.netor 216.239.38.136  Rfc931: The remote log-in name of the user. Example: -(normally omitted)  Authuser: The username as which the user has authenticated himself. Example: -(normally omitted)  Timestamp: Date, time and time zone (relative to GMT) of the request. Example: [04/May/2013:16:32:50 -0500]  Request: The request line exactly as it came from the client, which contains request method, target Uniform Resource Locator (URL) (relative to domain), and HTTP version. Example: "GET /jobs/ HTTP/1.1"  Status: The HTTP status code returned to the client. Example: 404(page not found)  Bytes: The content-length of the document transferred. Example: 15140 (bytes, Can also be “-” if status code





indicates page not modified) Referrer: The referrer field is provided by the HTTP protocol to allow a Web client (particularly, a Web browser) to specify the URL of the Web page that contains the link the client followed in order to reach the current requested page. “-”, in case of a direct request (e.g. type-in, bookmark). Example: http://www.google.com/search?hl=en&q=technology +social User-Agent: A string which is sent to server by Web client in order to identify itself. It normally contains user‟s Web browser name, version and operating system information. Contents vary if are sent by Web robots. Example: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461

1) Robots.txt Check: After executing the first method total 1654 entries (requests) of web robots are found. The result derived from this method alone can be confirmed as robot request. We have labelled with value 4 for the resulting set entries along with its IP address and user-agent field value. Table V shown sample output of robots.txt check. (These requests are confirmed as robot requests).

Label 4 4 4 4

Figure 4 shows an example of a request recorded in the log file. 66.249.73.168 [26/Feb/2013:20:06:15 -0500] "GET/resource.php?dir=VTU%20Previous%20Year%20Question%20 Papers/V%20and%20VI%20Sem%20Eng%20Question%20papers%202006 -%2007%20Scheme/100%20%20Design%20of%20Machine%20ElementsII/&sort= name&order=asc HTTP/1.1" 200 15819 "-" (compatible "Mozilla /5.0;Googlebot/2.1; +http://www.google.com/bot.html)" Figure 4 A Sample Log Request Entry

4

2) User-Agent Check: After executing this method total 2967 entries (requests) of web robots are found. (These are probable robot requests). Few output entries are shown in table VI. Label 1

Table IV explains corresponding field values of the entry shown in Figure 1. Field Name

TABLE IV Field and their Values of an Entry Values

Remotehost

66.249.73.168

Rfc931

-

Authuser

-

Timestamp Request

[26/Feb/2013:20:06:15 -0500] GET/resource.php?dir=VTU%20Previous%20Year %20Question%20Papers/V%20and%20VI%20Sem %20Eng%20Question%20papers%202006-%2007 %20Scheme/100%20%20 Design%20of%20 Machine%20Elements-II/&sort=name&order=asc HTTP/1.1

Status

200

Bytes

15819

Referrer

-

User- Agent

Mozilla/5.0 (compatible; Googlebot/2.1;+http://www.google .com/bot.html)

B. Pre-processing: In this stage, requests containing image access requests are deleted and query strings are eliminated. Image access requests of web robots are very rare. Query strings are eliminated to shorten the job of searching the voluminous log file entries. Total 100565 entries found after eliminating image access logs and query strings. C. Results: Results using individual methods are retrieved as follows:

TABLE V Sample Result of robots.txt Check IP Address User-Agent Name 157.55.32.96 Mozilla/5.0(compatible;bingbot/2.0;http:// www.bing.com/bingbot.htm) 180.76.5.195 Mozilla/5.0(WindowsNT.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2 66.249.73.168 Mozilla/5.0(compatible;Googlebot/2.1;htt p://www.google.com/bot.html) 66.249.73.238 Mozilla/5.0(compatible;Googlebot/2.1;htt p://www.google.com/bot.html) 38.99.82.191 Mozilla/4.0(compatible;MSIE.0b; Windows NT 6.0)

1 1 1

TABLE VI Sample Result of User-agent Check IP Address User-Agent Name 117.198.105.210 Mozilla/5.0(WindowsNT.1) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.152 Safari/537.22 66.249.73.238 Mozilla/5.0(compatible;Googlebot/2.1;htt p://www.google.com/bot.html) 65.55.24.245 Mozilla/5.0(compatible;bingbot/2.0;http:// www.bing.com/bingbot.htm) 66.249.73.168 Mozilla/5.0(compatible;Googlebot/2.1;htt p://www.google.com/bot.html)

3) IP Address Check: After executing this method total 4324 entries are selected. (These are also probable robot requests). An example set is given in table VII.

Label 1 1 1

1 1

TABLE VII Sample Result of User-agent Check IP Address User-Agent Name 66.249.73.168 Mozilla/5.0(compatible;Googlebot/2.1;htt p://www.google.com/bot.html) 66.249.73.238 Mozilla/5.0(compatible;Googlebot/2.1;htt p://www.google.com/bot.html) 66.249.73.168 SAMSUNG-SGH-E250/1.0Profile/MIDP2.0Configuration/CLDC1.1P.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; GooglebotMobile/2.1; +http://www.google.com/bot.html) 65.55.24.245 Mozilla/5.0(compatible;bingbot/2.0;http:// www.bing.com/bingbot.htm) 216.39.48.231 Mozilla/4.0(compatible;MSIE.01; Windows NT 5.0) RPT-HTTPClient/0.33E

4) Count of HEAD requests and HTTP requests with

unassigned referrers: After executing this method total 5643 entries (requests) of web robots are found. (These requests are also probable robot requests). An example set is given in table VIII.

1

TABLE VIII Sample Result of User-agent Check IP Address User-Agent Name 194.72.238.241 Mozilla/4.0(compatible;Netcrafteb Server Survey) 117.198.105.210 Mozilla/5.0(WindowsNT.1) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.152 Safari/537.22 209.66.70.253 Mozilla/5.0(Windows;U;indows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20 115.242.190.161 MicrosoftOfficeExistenceiscovery

1

62.67.184.93

Label 1 1

1

Mozilla/4.0(compatible;MSIE.1; Windows XP)

We have used Labelling for each entry. The rule is that if an entry is detected as a web robot request by a single method then the entry is assigned label 1, if by two methods then by two and so on. A detected entry with more label value is considered to more probable to be a web robot request. If an entry is labelled with highest value then it is confirmed to be a robot request. So by using all four methods individually total 198 requests are selected. 5) Integration and Confirmation: As the robot request confirmation process we have integrated all four methods. Firstly, from robots.txt check total 1654 requests are found and confirmed as robot requests. Next, we make intersection of the output result set derived from User-agent check (Method 2), IP-address check (Method 3) and Count of HEAD requests and HTTP requests with unassigned referrers (Method 4) and then making union with the result obtained by robots.txt check (Method 1). The resulting web requests obtained from the intersection of results from Method 2, Method 3 and Method 4 are 2325. These 2325 requests are confirmed to be robot requests and added with the requests obtained from Method 1. Union operation removes duplicate requests. Finally we got 3483 confirmed requests. Figure 5 shows the number of requests found by implementing all four methods individually and then by using proposed methodology. III. Conclusion and Future Work In this research work we have explored a number of techniques to identify the Robot generated requests for accessing a web page, with the objective of eliminating them in order to clean the voluminous web log data. We have presented details of our proposed methodology that utilizes and integrates four different robot request detection methods

Figure 5 No. of Requests Found by Executing Our Methodology

and integrates them together to detect and finally verify requests as a confirmed robot request. The integration is achieved by utilizing Set Union and Intersection operations. Experiments have been performed on the log file generated from the server of an operational web site named vtulife.com which contains data of march-2013. Since Robots.txt is visited only by the web robots, therefore output set of method1 contains only the confirmed web robot requests. For the confirmation of robot requests detected through other three methods a set intersection operation is applied. The reasons behind the confirmation is (1) the same IP address could be used by Web users for surfing the Web and by robots to automatically download some files from a Web site. So we can‟t directly confirm them as web robot requests. (2) Some anonymous surfing sites allow Web users to hide their accesses by changing the user agent fields of their browsers into robot-like values such as SilentSurf and Turing Machine. In the present work detection of robot request is performed on a log file with NCSA Common Log Format (CLF). As a future work this methodology may be generalized to deal with log files in various other formats also. Moreover, the robot detection and confirmation technique represented in this paper is implemented in offline mode. An online mode of working may be useful for blocking requests from confirmed robots. REFERENCES [1] Athena Stassopoulou, Marios D. Dikaiakos,” Web robot detection: A probabilistic reasoning approach”, ELSEVIER, Computer Networks, Volume 53, Issue 3, 27 February 2009, Pages 265–278 [2] The Web Robots Pages: www.robotstxt.org [3] P.-N. Tan and V. Kumar, "Discovery of Web Robot Sessions Based on their Navigational Patterns," Data Mining and Knowledge Discovery, vol. 6, pp. 9-35, 2002

[4] P. Hayati, K. Chai, V. Potdar, and A. Talevski, "HoneySpam 2.0: Profiling Web Spambot Behaviour," in 12th International Conference on Principles of Practise in MultiAgent Systems, Nagoya, Japan, 2009, pp. 335-344. [5] Pedram Hayati, Kevin Chai, Vidyasagar Potdar, Alex Talevski, “Behaviour-Based Web Spambot Detection by Utilising Action Time and Action Frequency” [6] Z. Gyongyi and H. Garcia-Molina, "Web spam taxonomy," in Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan, 2005. [7] Graham, L. 2000. Keep your bots to yourself. IEEE Software, 17(6):106–107.

[13] http://docs.oracle.com/cd/E19146-01/821-1827/gdsxf/ [14]http://www.microsoft.com/technet/prodtechnol/WindowsS erver2003/Library/IIS/a3ca6f3a-7fc3-4514-9b61f586d41bd483.mspx?mfr=true [15]http://publib.boulder.ibm.com/infocenter/iseries/v5r3/inde x.jsp?topic=%2Frzamy%2F50%2Ftrb%2Ftrblogs.htm [16]Koster,M. 1994b. A standard for robot exclusion. http://info.webcrawler.com/mak/projects/robots/norobots.html [17] Kolar, C., Leavitt, J., and Mauldin, M. 1996. Robot exclusion standard revisited. http://www.kollar.com/robots. html.

[8] KyoungSoo Park, Vivek S. Pain and Kang-Won Lee, Seraphin CaloSecuring “Web Service by Automatic Robot Detection”

[18]Ahmed Patel, Nikita Schmidt, “Application of structured document parsing to focused web crawling”, ELSEVIER, Computer Standards & Interfaces,Volume 33, Issue 3, March 2011, Pages 325–331

[9]https://www.usenix.org/legacy/event/usenix06/tech/full_pap ers/park/park_html/paper.html

[19]Eichmann, D. 1995. Ethical web agents. Computer Networks and ISDN Systems, 28(1):127–136.

[10] http://en.wikipedia.org/wiki/Server_log

[20]Koster, M. 1994a. Guidelines for robot writers. http://info.webcrawler.com/mak/projects/robots/guidelines.htm l

[11]http://publib.boulder.ibm.com/tividd/td/ITWSA/ITWSA_i nfo45/en_US/HTML/guide/c-logs.html#common [12] http://www.w3.org/TR/WD-logfile.html

[21]Koster, M. 1995. Robots in the web: Threat or treat. ConneXions, 9(4):2–12

Suggest Documents