Web robot detection in the scholarly information environment
Paul Huntington, David Nicholas and Hamid R. Jamali CIBER:1 School of Library, Archive and Information Studies, University College London
Abstract. An increasing number of robots harvest information on the world wide web for a wide variety of purposes. Protocols developed at the inception of the web laid out voluntary procedures in order to identify robot behaviour, and exclude it if necessary. Few robots now follow this protocol and it is now increasingly difficult to filter for this activity in reports of on-site activity. This paper seeks to demonstrate the issues involved in identifying robots and assessing their impact on usage in regard to a project which sought to establish the relative usage patterns of open access and non-open access articles in the Oxford University Press published journal Glycobiology, which offers in a single issue articles in both forms. A number of methods for identifying robots are compared and together these methods found that 40% of the raw logs of this journal could be attributed to robots.
Keywords: electronic journals; robot detection; web crawlers; web log analysis
1.
Introduction
Mechanical agents, crawlers, wanderers and spiders are programs that navigate site pages on the world wide web searching for links, keywords, emails, documents and content. The internet is populated by thousands of these ‘robots’. Some are specialist, for instance: checking for copyright/trademark violations (Cyveillance),2 gathering information for research projects (e-SocietyRobot),3 collecting email (Indy Library),4 data mining (Intelliseek)5 and checking for plagiarism (SlySearch). Search engines are especially big users of robots, sending them out to log and index pages to databases. Google uses Googlebot and the cataloguing specialist Inktomi6 is used by Yahoo. New robots are being launched all the time and it is not inconceivable that in the near future a large number of users will have their own personal spiders, raising even big questions as regarding how robot use is identified and treated. Protocol at the inception of the web laid out voluntary procedures that robots should follow. The Robots Exclusion Protocol enables web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot. In effect the protocol defines access. A robots.txt file
Correspondence to: Hamid R. Jamali, School of Library, Archive and Information Studies (SLAIS), University College London, Henry Morley Building, Gower Street, London WC1E 6BT, UK. Email:
[email protected]
Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
726
Paul Huntington, David Nicholas and Hamid R. Jamali
that can be placed on each web site’s server implements the protocol. This instructs the robot as to what areas of the site they can or cannot index. Unhelpfully, few robots now follow this protocol and much of robot activity is now undeclared. In part this is a result of competition between search engines whereby each seeks to offer a greater indexing of the web. If one organization flouts the protocol in an attempt to gain a competitive indexing advantage then others will follow suit. An alternative explanation is that organizations using a number of robots visit the robots.txt document just once and update this for all their robots. However there have always been organizations that have not followed the protocol because they seek to keep their robot activity secret, as in the case of organizations caching pages for third party viewing, an activity that is still regarded as illegal in some parts of the world. They do this by hiding the lookup identity of their IP (internet protocol) address. Organizations may additionally employ methods to mask robotic behaviour including the mimicry of human activity. This includes staging a longer time gap between views and employing multiple IP identities and limiting the number of views attributed to any single IP address. What this means is that it is now increasingly difficult to filter for this activity in reports of site activity. All this has much significance for today’s log analysts, who face something that the original preinternet, OPAC (Online Public Access Catalogue) log analysts could never have imagined, an information environment where robots would constitute a major group of users. Typically robot activity boosts usage of all web sites [1]. This inevitably raises questions as to whether robot usage should be included or excluded from usage counts, and if the former, should they be treated with the same significance as human use? The established view is that robot activity should be excluded from the metrics generated from the server record of client views [2] because it does not constitute actual human use. While it might not constitute actual use it does lead on to a greater use by humans; the very act of indexing brings greater traffic in its wake. Hence robot activity is in part a positive metric, an indication of site success. Robots do have an impact on analyses of a particular kind, especially comparisons between web sites when one site attracts greater robot activity than the other. This in part depends on how robots conduct their indexing but in general a site with a greater number of links to the site will have more robot activity. More pertinently in regard to this paper, it also impacts upon comparisons made between different types of content, for example in the case of this paper, between freely available content, as in the case of Open Access (OA) content, and restricted (subscriber only) content, as robots will be able to view the one but not the other. Whatever the overall view is of the significance of robot activity, there is a need to successfully identify and account for this activity. This article is the product of CIBER’s Virtual Scholar research programme7 and is the third one to emanate from an Oxford University Press (OUP) funded research investigation into the impact on usage and users of open access publishing. The previous two articles [1, 3] concerned the journal Nucleic Acids Research and examined all the drivers that impact on usage, including OA, search engines and the ending of embargoes, but not robots.
2.
Aims
This paper seeks to establish the levels of robot use in scholarly information environments and does this through a usage study of the OUP published journal, Glycobiology, that offers both open access (OA) and non-open access articles; the study sought to measure the relative take-up of the two kinds of articles. It was necessary to establish levels of robot use because open access full-text articles are easily accessible to robots but non-open access ones are not, so in order to ensure the fairness of the comparison it was necessary to correctly identify robot activity. This was not straightforward, as the paper demonstrates. In this regard this paper evaluates five ways of identifying robots and compares them in terms of the usage identified. There are: • Those robots that declare themselves. Declared robots are robots that report to the Robots.txt document. This is recorded in the log file and all accesses made by IP numbers that have viewed the Robots.txt document are classified as views by declared robots. It should be pointed out that robots from the same search engine may share Robots.txt information. Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
727
Paul Huntington, David Nicholas and Hamid R. Jamali
• Those robots that have an identifiable robot DNS name. Robots like Googlebot and Inktomisearch are domain named (DNS) and a reverse DNS lookup will reveal their identity. • Lists of undeclared robots. Organizations such as robotstxt.org (www.robotstxt.org/) have been set up to monitor robot activity. These organizations publish lists of robots and their known IP addresses. Usage by these IP numbers can be picked out of the logs. Robots can also be identified: • Through browser details contained in the logs. Particular robot names will appear in the browser details contained in the logs. For example Inktomisearch, used by Yahoo, has the word Slurp in its browser details. In some cases a robot can be identified purely on the browser details. • Through the behavioural traits they evidence in the log files. In theory, robots are not human so should not act as humans. They are programs and hence should be characterized by their mechanical behaviour. They might be expected to return to a site at a particular time each day every day, they might view pages 24 hours a day seven days a week, view thousands of pages in one session or view pages sequentially.
3.
Background
Glycobiology provides a unique forum dedicated to research into the biological functions of glycans, including glycoproteins, glycolipids, proteoglycans and free oligosaccharides, and on proteins that specifically interact with glycans (including lectins, glycosyltransferases, and glycosidases). Glycobiology is currently (since 1998, Vol. 8) published monthly. The online archive starts with Vol. 1, No. 1, 1990. Content six months and older is available free. In the six month embargoed period OA published articles are available free while non-OA articles are only available to subscribers.
4.
Related studies
Tan and Kumar [4] noted there are very few published papers in the area of web robot detection although it is a widely recognized problem. Lourenço and Belo [5] also considered the detection and analysis of web crawler activities as a challenging, under-reported problem. The most common method of detection is to check the IP address against known robots [6, 7]. This method is widely used among web administrators. However, the problem with this method is that due to ease of web robot construction and deployment, it is impossible to keep a comprehensive list of IP addresses and user agents for all robots. Moreover, some robots attempt to disguise their identities by using user agents that are very similar to conventional browsers. Eichmann [8] proposed ethical guidelines for robot designers. Typically robots should declare their identities to a web server via its user agent field. Tan and Kumar [4] mentioned a few pitfalls for this method including that some robots (and browsers) use multiple user agent fields within the same session. Koster [9], in an earlier paper, suggested that ethical robots should use the HEAD request method whenever possible. The year 1996 saw the protocol release of Robot Exclusion Standard [10] outlining a code of behaviour for robots. Robots, according to this standard, should identify themselves by reporting to a text file named robots.txt whenever visiting a web site. The file outlines what parts of the site are not open to robot activity. None of the ethical guidelines, protocols and codes of behaviour are mandatory requirements and robots seeking to hide their identities are unlikely to declare their identities in this way. It is evident that all of the above-mentioned methods are applicable only for detecting ethical web robots or robots that follow the standard guidelines; and hence do not work for detect-
Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
728
Paul Huntington, David Nicholas and Hamid R. Jamali
ing malignant and previously unknown robots. This is a critical issue as the technology has made it easy to create robots and they are used for a wide range of purposes (e.g. collecting email addresses). As a result researchers have tried to propose more sophisticated methods for detecting robots. Tan and Kumar [4] demonstrated that robots can be detected by the navigational path adopted, as this is inherently different to that adopted by human users. The authors proposed a robust session identification technique to pre-process the web server logs that could identify sessions with multiple IP addresses and user agents. This would be followed by a procedure for labelling the training and test data sets, and a technique for identifying the mislabelled samples. They maintained that their technique was able to uncover many camouflaged and previously unknown robots and that highly accurate robot classification models could be induced from the access features of the web sessions. Lourenço and Belo [5], in a second generation implementation of navigational robot identification, used click stream data mining to perform robot detection in real-time. Their platform, named ClickTips, sustained a site-specific, updateable detection model that tagged robot traverses based on incremental web session inspection and a decision model that assessed eventual containment. However, this procedure is invasive and requires real time cooperation with site administrators. Dikaiakosa et al. [11] analysed the web logs of five academic sites to examine the crawling behaviour of four general-purpose search engines (Google, AltaVista, Inktomi, and FastSearch) and one major digital library and search engine for scientific literature (CiteSeer). One of the aims of their study was to use the findings as a basis for generating automatic robot detection methods. Their study revealed a few characteristics of robot crawling behaviour. For example, that robots processed views at a percentage much higher than the general population of web clients. They also showed that robots had a higher incidence of viewing error pages. That is, pages with mistakes in the HTML or that have been moved. However, most of the behavioural pattern detection methods are not sophisticated enough to capture some of the robots and could easily produce false positives/negatives. For example, there are many sites automatically refreshing the page (e.g., front pages of most news sites), so the duration method would not make much sense if a human user leaves the browser on overnight. In addition, it is hard to believe that human browsing patterns are all similar in terms of traffic volume, viewing speed, and regularity, so an arbitrary threshold to determine ‘very large’, ‘too fast’ and ‘regular’ may either undercut or overcut the total robots. Pai et al. [12], who discussed their experiences with undesirable traffic, showed that even electronic journals are open to abuse and robots would try hard to imitate human browsing patterns. Some researchers have approached the problem of web robot detection from a security angle. For example, Park et al. [13] approached this problem as a special form of the Turing test and defend the system by inferring if the traffic source is human or robot. They experimented with a system called CoDeeN. By extracting the implicit patterns of human web browsing, they developed simple algorithms to detect human users. Their experiments with the CoDeeN content distribution network showed that 95% of human users were detected within the first 57 requests, and 80% could be identified in only 20 requests, with a maximum false positive rate of 2.4%. In the time that this system was deployed on CoDeeN, robot related abuse complaints dropped by a factor of 10. However, they admitted that their proposed detection mechanism is not completely immune to possible countermeasures by the attackers. A serious hacker could implement a bot that could generate mouse or keystroke events if he or she knows that a human activity detection mechanism has been implemented by a site. In brief, although different strategies have been developed for robot detection, more effective methods for robot detection are yet to be developed. Meanwhile, web administrators have adopted their own strategies for defending the sites against unwanted robots. One of the popular methods is CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart). This is a test consisting of distorted images or sounds, sometimes with instructive description, that are designed to be difficult for robots to decipher [14]. These tests are frequently used by commercial sites which allow only human entrance or limit the number of accesses.
Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
729
Paul Huntington, David Nicholas and Hamid R. Jamali
Table 1 Robots.txt file related to Glycobiology web site User-agent: * Noarchive: / crawl-delay: 10 Disallow: /cgi/folders Disallow: /cgi/citemap Disallow: /cgi/eletter-submit Disallow: /accesslogs Disallow: /conf/ Disallow: /math/ Disallow: /cgi/login Disallow: /cgi/alerts Disallow: /cgi/ctmain Disallow: /cgi/ctalert Disallow: /cgi/external_ref Disallow: /cgi/etoc Disallow: /cgi/searchresults Disallow: /cgi/searchhistory Disallow: /cgi/savedsearch Disallow: /cgi/markedcitation Disallow: /cgi/topics Disallow: /cgi/search Disallow: /cgi/citmgr Disallow: /cgi/reprintsidebar Disallow: /help Disallow: /apps/ Disallow: /backtocs/ Disallow: /browse/ Disallow: /careerfocus/ Disallow: /classifieds/ Disallow: /guides/ Disallow: /honeypot/ Disallow: /misc/press/ Disallow: /cgi/myjs Disallow: /cgi/changeuserinfo
User-agent: msnbot Noarchive: / crawl-delay: 10 Disallow: /cgi/folders Disallow: /cgi/citemap Disallow: /cgi/eletter-submit Disallow: /accesslogs Disallow: /conf/ Disallow: /math/ Disallow: /cgi/login Disallow: /cgi/alerts Disallow: /cgi/ctmain Disallow: /cgi/ctalert Disallow: /cgi/etoc Disallow: /cgi/external_ref Disallow: /cgi/searchresults Disallow: /cgi/searchhistory Disallow: /cgi/savedsearch Disallow: /cgi/markedcitation Disallow: /cgi/topics Disallow: /cgi/search Disallow: /cgi/citmgr Disallow: /cgi/reprintsidebar Disallow: /help Disallow: /apps/ Disallow: /backtocs/ Disallow: /browse/ Disallow: /careerfocus/ Disallow: /classifieds/ Disallow: /guides/ Disallow: /misc/press/ Disallow: /cgi/myjs Disallow: /cgi/changeuserinfo User-agent: Fasterfox Disallow: /
5.
Methods
Of course the only means of investigating robot behaviour is by examining transactional server log files; for obvious reasons, surveys and focus groups, the usual method of most researchers’ choice, are not available to robot researchers. Two years’ (January 2005 to January 2007) worth of transactional data logs were made available by Oxford University Press for the journal. The data was loaded into SPSS and all analysis was completed using SPSS. Lines not related to document and menu views such as images were stripped from the file. Furthermore, for this analysis, sessions were not defined and use analysis was completed at the IP level. Use is defined as client requests or views to menus, abstracts and articles. Table 1 is the Robots.txt document related to the Glycobiology web site. The document uses a wildcard (User-agent: *) to specify that the Robots.txt file applies to all robots, but also specifies MSNbot (User-agent: msnbot) and Fastfox (User-agent: Fasterfox). The document specifies that robots cannot archive content (Noarchive: /); it further disallows access to various directories.
Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
730
Paul Huntington, David Nicholas and Hamid R. Jamali
6.
Results
Robots were identified in the usage logs by two means: 1. by searching for robot identities in the log file; 2. by examining the behaviour of IP identities for tell-tale signs of robot type activity. Robot identities were established through a multi-step process: a. checking what IP numbers had visited the Robots.txt document; b. looking for robot names in the DNS name; c. selecting IP numbers based on existing lists of undeclared robots; and, finally, d. checking browser details for robot identities. In all, identity recognition methods revealed that a third (32.6%) of usage could be attributed to robots, approximately 1,405,507 views in the case of Glycobiology over the two years. The exact breakdown is given in Table 2. For the table the attribution of usage to method is undertaken in the following order – declared robot; DNS name; undeclared robot and browser details – so as to present a chart with no overlaps. Robots, of course, can be identified by more than one method and this is further discussed below. Significantly, declared robots accounted for less than 0.5% of use, robots identified within the DNS look-up accounted for 16.1% of usage, those identified by the undeclared robot list accounted for 6.5%, robot identities found within browser details accounted for a further 2.6% of usage. LOCKSS,8 which is a particular type of automated procedure and reflects archive usage by organizations, accounted for 7.4% of total usage. The LOCKSS system uses a crawler to collect e-journal content from publisher web sites. Both written and machine-readable permissions from the publishers are required for this. Publishers are encouraged to grant libraries legal permission to cache and archive their content by means of the wording in licences or terms and conditions. LOCKSS usage does not represent the caching of material with the intention of delivering that material to third parties. The other category covers human or non-robot use and robot use not identified.
Table 2 Distribution of usage by method of finding robot identity
6.1.
Robot identity
Usage %
DNS LOCKSS Undeclared list Browser Declared Others
16.1 7.4 6.5 2.6 0 67.4
Declared robot identities
Declared robots are robots that on entering the site report to the Robots.txt document. These robots identify themselves by having read this document and are recorded in the logs as having viewed this document. All accesses made by IP numbers that have viewed the Robots.txt document are classified as views by declared robots. In all there were just 36 declared robots and their use amounted to 1422 views or less than 0.1% of use, which of course means that most robot activity goes unreported. DNS for five or half the robots could not be found using a reverse DNS lookup.9 Table 3 gives the top 10 declared robots. Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
731
Paul Huntington, David Nicholas and Hamid R. Jamali
Table 3 Top 10 declared robots by use DNS or IP details
%
svlproxy10.hmdevlab.com 216.52.28.145 203.184.138.131 tide516.microsoft.com tide121.microsoft.com tide18.microsoft.com 203.184.138.132 tide166.microsoft.com 220.165.193.139 222.189.238.140
26.7 14.8 12.1 8.9 6.5 6.4 4.8 4.4 3.3 3.0
As % of declared robot use
90.9%
N.B. 1% accounts for about 14 views.
6.2.
Robots’ identities derived by DNS name
IP numbers were converted into their Domain Name Server (DNS) equivalent identity via a process of reverse DNS lookup. Initially all DNS names with the terms robot, bot, search, spider and crawler were selected; however, this group was found also to include some non-robot identities. Eight DNS names were identified as generating robots and these were: Googlebot, MSNbot, Inktomisearch, Cosmixcorp, Charlotte.search, Fastsearch, Exabot and Search.live. In total this procedure identified 4140 robots and these robots accounted for 692,293 views or approximately 16.1% of usage. Out of the 4140 robots just two had visited the Robots.txt document. In all 3444 different Inktomisearch robots were found to have visited the site, one of these, lj612516.inktomisearch.com, had visited the Robots.txt document. Further, one of the 118 Search.live robots had visited the Robots.txt file, livebot-65–55–209–97.search.live.com. DNS identification added a further 4138 robots to the list we had for declared robots and accounted for an additional 692,291 (16.1%) of views. The top 10 new robots found by this procedure accounted for 40% of use (Table 4). The top 10 undeclared robots accounted for 40.1% of all DNS identified robot activity. As can be seen from the table, robots use more than one IP or DNS identity. Thus, for example, livebot search.live appeared five times. In fact, for this study, livebot search.live activity made up over a third (34%) of DNS identified robot activity and Googlebot a quarter (23%). Table 4 Top 10 DNS derived robots by use DNS or IP details
%
crawl2.cosmixcorp.com charlotte.searchme.com crawler-gw-02.bos3.fastsearch.net livebot-65-54-188-83.search.live.com msnbot.msn.com livebot-65-54-188-80.search.live.com livebot-207-46-98-67.search.live.com crawler-gw-01.bos3.fastsearch.net livebot-65-54-188-82.search.live.com livebot-207-46-98-68.search.live.com
12.6 3.9 3.8 3.4 3.3 3.1 2.9 2.5 2.4 2.2
As % of DNS identified robots
40%
N.B. 1% accounts for 6923 views.
Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
732
Paul Huntington, David Nicholas and Hamid R. Jamali
Table 5 Top 10 by use of undeclared list robots DNS or IP details 64.124.85.78.become.com egspd42222.teoma.com egspd42245.ask.com 66-194-55-242.static.twtelecom.net 64.242.88.50 usstls-23.savvis.net 207.46.98.63 egspd42222.ask.com sv-fw.looksmart.com 8.0/25.61.241.63.in-addr.arpa As % of undeclared robot use
% 6.4 3.3 2.8 2.6 2.1 1.7 1.6 1.0 .9 .6 23.2%
N.B. 1% accounts for 2810 views.
6.3.
Undeclared robot list
A list of robots and their known IP addresses was downloaded (www.robotstxt.org/) and usage by these IP numbers was then picked out from the log files. In total this procedure identified 1775 robots and robots identified from the list accounted for 635,668 views or approximately 14.8% of usage. Many of the robots found had previously been found by those viewing the Robots.txt document and by searching through the DNS name. The undeclared list method picked up 1314 additional robots and accounted for an additional 280,962 or 6.5% of views The top 10 new robots found by this method are reported in Table 5 and accounted for just under a quarter of usage. 6.4.
Robots’ identities within browser details
A detailed review of the logs showed that the client browser details gave additional information as to the identity of the user. For example, use by Inktomisearch was found to include the following information in the browser details: ‘Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/ help/us/ysearch/slurp)’. That is, Inktomisearch could also be identified by the word Slurp in the browser details. Further MSNbot included the following in their browser details: ‘msnbot/1.0 (+http://search.msn.com/msnbot.htm)’. These two examples, though, do not add any further information in that these two robots were successfully identified by IP and reverse DNS lookup anyway. There were, however, instances where new robots were found, for example in the case of the following browser record: ‘FAST Enterprise Crawler 6/Scirus
[email protected]; http://www.scirus.com/ srsapp/contactus/’. This appears to be a crawler operated by Fast on behalf of Scirus. Also Gaisbot, a personal robot, appeared in the browser details of 4827 use records. It is not robust to classify all IP numbers with a robot identity based on browser details as the use of proxy connections means that many connections, including but not exclusively the robot connection, might use the same IP number. Given this, we can only report on usage and not the absolute number of IP identities. In total this procedure identified 469,022 views or approximately 11.0% of usage. Many of the robots found had previously been found by one of the previous methods: those viewing the Robots.txt document, DNS name and undeclared robot lists. In terms of additional robots this method picked up an additional 112,182 (2.6%) of views. The top 10 new robots found in this grouping accounted for a little under three quarters (72.8%) of browser identified use (Table 6).
Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
733
Paul Huntington, David Nicholas and Hamid R. Jamali
Table 6 Top 10 robots identified only by means of browser information DNS or IP details
%
70.42.51.20 136.187.19.99 tpiol.tpiol.com 69.25.71.12 dns-tester.irl.cs.XXXXX.edu 72.5.173.21 24-177-134-6.static.ncr.charter.com 207.68.157.51 EV-ESR1-72-49-246-181.fuse.net 71-13-115-117.static.mdsn.wi.charter.com
23.8 22.4 7.3 4.8 4.1 3.0 2.3 1.9 1.7 1.5
As % of browser identified robot use
72.8%
1% accounts for 1122 views.
6.5.
LOCKSS
LOCKSS is a particular type of automated procedure and reflects archiving by library organizations. The LOCKSS system uses a robot to collect e-journal content from publishers’ web sites. Publishers are encouraged to grant libraries legal permission to cache and archive their content. A web page on the publisher’s web site called the publisher manifest that is recognizable by the crawler contains a specific permission statement and permits the robot to cache, collect, and preserve the content. This is a voluntary code and there will be undeclared LOCKSS users. For this study declared LOCKSS users totalled 97 and views amounted to 318,569 (7.4% of views). As shown in Table 7, the top 10 users identified by use of LOCKSS accounted for just under two-thirds (59%) of usage by this group. The names in the logs refer to universities archiving content. University names have been made anonymous. 6.6.
Identifying robots through behavioural patterns
As explained earlier, identifying robots through IP, DNS, industry lists and browser details is not a very effective way of identifying robots as many robots set out to obscure their identity. An alternative and more productive method is to attempt to classify robots by behaviour. Robots are mechanical, hence they do not sleep, can jump from page to page at highspeed and can consume a high Table 7 Top 10 by use of LOCKSS DNS or IP details sul-lockss21.XXXXX.EDU sul-lockss-floater2.XXXXX.EDU sul-lockss-floater1.XXXXX.EDU lockss.XXXXX.edu sul-lockss-floater5.XXXXX.EDU lockss.cul.XXXXX.edu lockss.library.XXXXX.edu cchem-lat-169-229-198-179.LIPS.XXXXX.EDU lockss.lib.XXXXX.ac.uk sul-lockss-floater6.XXXXX.EDU As % of LOCKSS use
% 4.0 3.8 3.3 3.1 2.9 2.6 2.6 2.6 2.6 2.6 30.1%
1% accounts for 3186 views.
Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
734
Paul Huntington, David Nicholas and Hamid R. Jamali
Table 8 Top 15 by maximum daily views DNS or IP identity
Max daily number of views
Total views
Class by identity methods
4.79.217.19 136.187.19.99 ps3.pbgc.gov charlotte.searchme.com 195.149.117.2 gw.ptr-80-238-227-5.customer.ch.netstream.com crawler-gw-01.bos3.fastsearch.net sul-lockss21.XXXXX.EDU egspd42222.teoma.com c06.ba.accelovation.com lockss0.umdl.XXXXX.edu mckeldin-411.XXXXX.edu 129.79.35.196 sul-lockss-floater6.XXXXX.EDU lockss.library.XXXXX.edu
7390 6328 4757 4658 4536 3138 3077 2832 2578 2542 2502 2432 2427 2427 2427
15,629 27,147 5567 26,675 6737 4544 17,612 12,791 9383 2542 5685 5038 9806 8193 7663
N/C Browser N/C DNS N/C N/C DNS LOCKSS Undeclared Browser LOCKSS LOCKSS N/C LOCKSS LOCKSS
165,015 N/C is not classified.
volume of pages relatively quickly. Hence we might expect robots to variously generate a high usage count, view pages very rapidly, and to view over a long time. The following is probably a good example of what looks like robot activity. User 12.188.44.116 systematically viewed every article in every issue of Glycobiology. They viewed 4093 pages between 23 and 27 February 2007. Viewing began at about five minutes past 1 in the morning on the 23rd and proceeded without interruption for four days. This appears to be robot activity; however, the IP or browser details did not classify it as so. Also, although viewing went on for 24 hours a day over a four-day period suggesting automated use, other metrics were not typical of a robot. For example, typical robot use is characterized by page view times of less than three seconds. After all robots do not ‘read’ and being mechanical can scoop up pages at a very fast rate. However, this user took on average10 just over a minute to digest each page. This begs the question, is this a robot pretending to be a human user or a human user who decided to stay up four days to read every abstract and document on this site; or perhaps, an aggregator using a temporary IP number offering world wide round the clock service? However, this would not explain the systematic viewing of all content: volume by volume and issue by issue. 6.6.1. High viewing activity Table 8 identifies the top 15 users by IP identity ranked by the maximum views on any one day. The argument here is that we would expect robots to view large amounts of information from a site in a day. The top ranked user is 4.79.217.19 who made a total of 7390 views on a single day and in total viewed 15,629 pages over the survey period (January 2005 to January 2007). This user was not identified as a robot via procedures previously outlined. Out of the 15 IPs identified (Table 8) 10 had previously been found by identity methods; that is, the maximum, or heavy, daily download method identified a further five potential robots. That is, identity methods were only two thirds successful in identifying robots. Four of the 15 users did not have a DNS identity obtained via a reverse DNS lookup and the owners of these IP numbers have made some effort to remain anonymous. Of course, at issue is what we define as heavy or robotic use. Table 9 gives the number of IPs identified as robots with levels of maximum daily views. The table also gives for each level the percentage of robots recognized through identity methods. Taking a threshold value of 2000 would furnish 24 IP numbers as robots; identity methods would have spotted 83% of them. Further these 24 IP numbers accounted for 214,875 (5%) views. Moving the threshold value to 1000 and above maximum daily views classifies 120 IPs as robots; 86% of these were previously identified as robots and these 120 IPs accounted for 807,993 views or 20% of all views. Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
735
Paul Huntington, David Nicholas and Hamid R. Jamali
Table 9 The number of IPs identified as robots for different levels of maximum daily page views Levels of maximum daily views
No. of IP numbers
% by identity methods
Usage
Greater Greater Greater Greater Greater
24 120 152 183 575
83% 86% 79% 68% 47%
214,875 807,993 1,049,077 1,160,499 1,376,849
than than than than than
2000 1000 500 250 100
Table 10 Metrics comparing robot (by identity) and non-robot activity
Non-robots Robot (by identity)
Average number of views per IP
Average maximum daily views
Time difference between views
4.6 43.8
4.1 6.6
12.5 51.1
We have been pursuing the idea that robots conform to a particular type of behaviour because they are essentially computer program codes. That is, for example, they view a vast number of web site pages in a relatively short period of time. Hence their metric footprints, such as maximum daily views, will be large. But there are no set rules as to how to identify the threshold above which we are sure only to include robots. Using the data for robots identified via an IP, DNS or browser identity, Table 10 sets out some broad characteristics of robot use and compares this with non robot use. The average number of (Hubers M-estimator) page views, over the whole period, for non-robot IPs was about 5 while for robots this was higher by about 6 times and was estimated at 28 views per IP address. Furthermore, the average maximum daily views were higher: 6.6 compared to 4.1. This confirmed that, in general, robots do view a higher number of pages. However, very much against expectation, robots also recorded a higher time difference between views compared to non-robot users: 51 seconds compared to 13 seconds. Table 10 then does not really provide us with a threshold for maximum daily views above which we can be sure of just selecting robots. Even choosing a relatively high threshold value, such as a maximum daily of 100 page views including proxy server use or where a single computer is used by a multiple number of users. 6.6.2. Rapid viewing Table 11 lists the top 15 IP/DNS identities by the time difference between views, with those with the shortest times first. Time differences of one, two and three seconds are considered; the five ranked highest by total views from each group are given in Table 11. Out of the 15 just five had previously been found by other methods. A question remains, and this is true of all behaviour methods, can we be sure that the IPs found are robots and not human users? The logic of classifying robots by rapid viewing is that humans are unlikely to take one, two or three seconds to access, read and move on to the next page. Furthermore, five of the 15 IPs did not have a DNS identity and were thus anonymous. This method did pick up relatively small robots such as Crossref.org that was perhaps checking for reference links. Table 11 above found that robots recorded a higher time difference between views compared to non-robot users. Being programs robots can be instructed to behave in a variety of different ways, for example to access a new page every six seconds, or once every hour. 6.6.3. Pattern of viewing activity An alternative to identifying robots by their high volume viewing and request time differences is to identify them by their pattern of accesses. Here we consider a simple method to illustrate the technique and identify IP numbers by whether the following view has the same time difference as the previous one; IPs are then ranked by how many times this Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
736
Paul Huntington, David Nicholas and Hamid R. Jamali
Table 11 Top 15 by time difference between views
DNS or IP identity
Time difference between requests
Total views
Class by identity methods
usstls-23.savvis.net 134.243.90.12 fwext1.uhbs.ch 81.222.64.10 indri4.cs.XXXXX.edu 57.67.25.190 61.135.131.238 blackberry.XXXXX.edu cr4.crossref.org robot5.rambler.ru sul-lockss21.XXXXX.EDU client.uk.hub.XXXXX.com 82-41-152-104.cable.ubr01.linl.blueyonder.co.uk rb1-gw-199-broad-127.XXXXX.edu 61.186.190.72
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
4745 3675 2655 2127 1930 1523 1447 875 600 564 12,791 3256 1484 1166 829
Undeclared N/C N/C N/C N/C Browser Undeclared Browser N/C N/C LOCKSS N/C N/C N/C N/C
33,899 N/C is not classified.
happens. The argument here is that robots will be much more regular or patterned in their behaviour. Table 12 gives the top 15 IP numbers by this criterion. It was found that LOCKSS tended to have a six second time difference between views; six out of the seven identified did, while other robots recorded an 11 second or minute pattern. A closer examination of the data indicates that, in many cases, the time difference between views might vary by a second or that where a robot has accessed a cached page then the time difference is cumulated so a typical 6 second difference is recorded as a 12 second difference. Both these suggest that the number of views sharing the same time difference will be understated. All but one of the 15 IP numbers identified had previously been identified as a robot by other methods. This is a strong indication that the digital footprint of robots will be patterned. Table 12 Top 15 by frequency of pattern between views
DNS or IP identity
Frequency of view pattern
Seconds between views
crawl2.cosmixcorp.com 70.42.51.20 136.187.19.99 fds01.ent.XXXXX.edu crawler-gw-02.bos3.fastsearch.net charlotte.searchme.com sul-lockss-floater1.XXXXX.EDU 69.25.71.12 lockss.lib.XXXXX.ac.uk lockss.library.XXXXX.edu tesla.cbi.cnptia.embrapa.br lockss.cul.XXXXX.edu tpiol.tpiol.com lockss0.umdl.XXXXX.edu sul-lockss-floater2.XXXXX.EDU
7839 6127 5425 4127 3580 3396 2758 2473 2442 1971 1854 1853 1768 1704 1662
Mixed 60 11 11 60 30 6 60 6 6 6 11 60 6 6
Total views 87,467 42,743 27,147 8264 26,191 26,675 10,529 17,155 8199 8375 7096 8401 4125 5685 12,176
Class by identity methods DNS Browser Browser N/C DNS DNS LOCKSS Browser LOCKSS LOCKSS LOCKSS LOCKSS Browser LOCKSS LOCKSS
300,228
Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
737
Paul Huntington, David Nicholas and Hamid R. Jamali
Table 13 Number of IPs identified as robots for different levels of frequency of time patterned views Frequency of patterned views
No. of IP numbers
% by identity methods
Usage
More More More More
88 153 222 336
78% 81% 70%% 58%
691,406 899,687 1,101,612 1,285,084
than than than than
500 250 100 50
Table 13 considers different threshold levels of frequency by time patterned views where the following view records the same time difference as the previous view. Taking a threshold value of 500 would select 88 IP numbers as robots; identity methods would have selected 78% of these as robots. Further, these 88 IP numbers accounted for 691,406 (16%) views. Halving the threshold value to 500 classifies 153 IPs as robots; 81% of these were previously identified as robots and these 153 IPs accounted for 899,687 views or 21% of all views. 6.6.4. Duration of viewing activity Table 14 lists 15 IP or DNS identities by average (median) number of hours a day when the identity was active. Robots do not sleep so we might expect robots, unlike humans, to operate over a 24 hour period. Accepting this as a means of robot identification would indicate that identifying robots via IP, DNS and browser identifies robots two thirds of the time (67%) of the time. The metric did pick up some large usage robots, for example crawler-gw02.bos3.fastsearch.net that was previously identified by IP/DNS identity. Five of the IP identities listed were not found via a reverse DNS lookup. This method has picked up two users who were unlikely to be robots: xd-22–132-a8.bta.net.cn and h96n1-vj-d4.ias.bredband.telia.com. Both of them made a relatively small number of views, under 100, and seemed to be linking via an ISP. Table 15 considers different threshold levels of average (median) number of hours a day that the IP was active and where total views exceeded 150 over the research period. This limit was selected to exclude potential human users. Taking a threshold value of 15 hours would classify 20 IP numbers as being robots; identification methods would have selected 65% of these. Further, these 20 IP numbers accounted for 268,570 (6%) views. Halving the threshold value to about 7 hours classifies 71 IPs as robots; 73% of these were previously identified as robots and these 71 IPs accounted for 515,297 (12%) views.
Table 14 Top 15 IP numbers by average (median) number of hours a day that the IP was active DNS or IP identity
Average no. hours day
70.42.51.10 12.188.44.116 xd-22-132-a8.bta.net.cn crawler-gw-02.bos3.fastsearch.net 207.46.98.63 66.160.159.221 70.42.51.20 livebot-207-46-98-67.search.live.com crawl4.topix.net tpiol.tpiol.com livebot-65-55-248-148.search.live.com livebot-65-55-248-147.search.live.com livebot-207-46-98-61.search.live.com h96n1-vj-d4.ias.bredband.telia.com crawl2.cosmixcorp.com
24 24 24 23 23 22 21.50 21.50 21.50 20 20 20 19.50 18.50 18
No. of views 7449 4093 128 26,191 4606 1758 42,743 20,417 10,588 4125 1007 780 9714 80 87,467
Class by identity methods N/C N/C N/C DNS Undeclared N/C Browser DNS N/C Browser DNS DNS DNS N/C DNS
221,146
Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
738
Paul Huntington, David Nicholas and Hamid R. Jamali
Table 15 Number of IPs identified as robots for different average (median) number of hours a day that the IP was active and where total views did not exceed 150 Number of active hours
No. of IP numbers
% by identity methods
Usage
Over Over Over Over Over Over Over
20 39 71 139 290 427 698
65 72% 73% 67% 69% 66% 60%
268,570 411,067 515,297 631,752 819,301 1,034,633 1,277,839
6.7.
15 10 7 5 3 2 1
Metric summary
If the three behaviour metrics are taken together: IPs requesting over 500 maximum daily views (Table 9); IPs with a frequency patterned time sequence greater than 100 (Table 13); and IP numbers with an average visit of 7 hours and over (Table 15) then just 286 IP numbers would be defined as potential robots. These 286 IP numbers made up less than 0.1% of all IP numbers identified as being robots but accounted for a massive 1,262,386 page views, or 29% of usage. There was considerable duplication of IP numbers between the three behaviour methods and more than three quarters of the usage (79%) was identified by two or more of the behavioural procedures. Taking the behavioural and identification methods together found 1,754,846, or 40.7%, of page views robotic or mechanical. Surprisingly, just half (52%) of robot usage was identified by both identification and behavioural methods, 28% was uniquely identified by identification methods and 20% uniquely identified by behavioural methods. One reason for this is that behavioural methods are poor at selecting small robots but perform better at identifying large robots. This is a result of setting relatively high threshold values. These values are set high so as not to include genuine users but at these levels small robots would not be included. It was expected that a higher proportion of robot activity would be picked up by both methods. But the reason for this is that not all robots behave in the same way and that many robots are in fact quite small. Table 16 looks at metrics across different types of robot use identified by IP/DNS identity. Use by the LOCKSS archive was most characteristic of postulated robot activity recording a relatively high average maximum daily views of over 1000 and a relatively fast download time of on average (Huber’s M-estimator) 8 seconds between views. Fastsearch.net also recorded a high average maximum daily figure but this was matched by a greater, and typically more human, time difference between downloads of 32 seconds. There is, however, little to differentiate the metric fingerprints of Googlebot and Inktomisearch from typical human users, that is these robots recorded a relatively small average maximum daily view figure of less than 20 downloads and this was coupled with an average time difference between views in excess of a couple of minutes; metrics that suggest an authentic and scholarly use.
Table 16 Use metrics of six IP/DNS identified robot groupings
LOCKSS Fastsearch.net Search.live MSNbot Googlebot Inktomisearch
Number of robot IP identities
Total use
Average use per IP address
Average max daily views
Time difference between views
97 3 103 17 449 3167
318,569 44,559 235,827 23,325 162,067 86,485
2,268 16,137 442 233 152 23
1,260.5 1,603.9 92.8 60.5 17.7 2.9
7.7 32.4 77.2 73.5 196.8 232.2
Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
739
Paul Huntington, David Nicholas and Hamid R. Jamali
7.
Conclusion
This paper estimates that two-fifths of all web activity of a site can be attributed to robots and this is likely to become greater as the web is populated by an increasing number of robots and mechanical agents. Identifying robots by identity within the logs reveals that about a third of all use is a result of robot activity. However this method is only successful about 80% of the time in locating robots. Looking at the behaviour of IP identities argues that the amount of use that can be attributed to robots is as much as 40%. The paper set out to verify robots via identity and behaviour methods. Robot identities were researched by checking what IP numbers had visited the Robots.txt document, looking for robot names in the DNS name, selecting IP numbers based on existing lists of undeclared robots and finally checking browser details for robot identities. Behavioural methods looked specifically for a metric footprint of robots. It was hypothesized that robots would be likely to generate a high usage count, a high daily count of usage, that usage will be patterned, to operate over a 24 hour period and access content at a rapid rate. In all, the determining of robots by identification indicated that just about a third (32.6%) of usage was robotic. Further, it was found that the identification of robots, particularly large ones, by behaviour was also successful. Four behaviour metrics were used: the maximum views made in a day; time difference between views; a patterned download; and the average number of hours active in a day. It was found that using behaviour methods identified an additional 20% of robot activity. It was apparent from looking at overall metrics that robot use, especially of small robots, can appear quite human. Robots do not all behave in the same way; being programs, robots can be instructed to behave in a variety of different ways. This includes recording a longer time gap between views and employing multiple IP identities and limiting the number of views attributed to any single IP address. There may well be random ‘walk’ or random access time robots. In such cases it is difficult to identify mechanical users from human users. Thus it was found that there was little to differentiate the metric fingerprints of Googlebot and Inktomisearch from typical human use, that is, these robots recorded a relatively small average maximum daily figure of less than 20 downloads and this was coupled with an average time difference between downloads in excess of a couple of minutes: metrics that suggest an authentic and scholarly user. Behavioural methods of identifying robots require the setting of careful threshold levels and these levels have to be set high to avoid including human use; however, these levels would also mean not identifying small use robots. Robot activity probably does not impact negatively on COUNTER compliant statistics produced by publishers for academic libraries to enable them to assess the usage of material they pay for. This is largely because robots are not university based, although it is thought that some academics run their own robots; these are unlikely to be very big but academics could set up a robot to read their own papers. However, LOCKSS may impact on COUNTER. This paper forms part of an analysis to establish usage patterns for the OUP published journal Glycobiology, which offers articles in both an open access and non-open access form. Robots will impact, as a comparison will be made between different types of restricted content: open access versus content requiring a user name and password. Robots will have access to OA content but not restricted content and hence would inflate the usage of the OA material compared to the restricted content. This paper has shown that identifying robots is a complex procedure and that using either identification or behavioural methods is unlikely to reveal the full extent of robot activity. If the full extent of robot use cannot be identified perhaps instead we can do the reverse, that is, identify use that we know is derived from human activity. This would result in a threepart use classification: that of robot use, human use and use that could be attributed to either robots or humans. The extent of OA use can then be estimated and compared to restricted content use for the three groupings.
Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
740
Paul Huntington, David Nicholas and Hamid R. Jamali
Endnotes 1 2 3 4 5 6 7 8 9
Centre for Information Behaviour and the Evaluation of Research (www.ucl.ac.uk/ciber/) www.cyveillance.com/ http://cif.iis.u-tokyo.ac.jp/e-society/ www.indyproject.org/ A robot used by Nielsen BuzzMetrics (www.nielsenbuzzmetrics.com/) A search engine that was acquired by Yahoo in 2002 (http://en.wikipedia.org/wiki/Inktomi) www.ucl.ac.uk/slais/research/ciber/virtualscholar/ www.lockss.org/lockss/Home A Whois lookup may reveal additional information. This is a manual procedure and is not an automated procedure and hence was not included for this analysis. If the owner of an IP number has hidden the DNS identity then this is part of a process to remain anonymous. 10 Median = 63 seconds, mean 89.6.
References [1] D. Nicholas, P. Huntington and H.R. Jamali, The impact of open access publishing (and other access initiatives) on use and users of digital scholarly journals, Learned Publishing 20(1) (2007) 11–15. [2] A. Gutzman, Analysing Traffic on Your E-commerce Site (1999). Available at: www.ecommerceguide.com/solutions/technology/article.php/9561_186011 (accessed 21 October 2007). [3] D. Nicholas, P. Huntington and H.R. Jamali, Open access in context: a user study, Journal of Documentation 63(6) (2007) 853–78. [4] P. Tan and V. Kumar, Discovery of web robot sessions based on their navigational patterns, Data Mining and Knowledge Discovery 6(1) (2002) 9–35. [5] A.G. Lourenço and O.O. Belo, Catching web crawlers in the act. In: D. Wolber et al. (eds), Proceedings of the 6th International Conference on Web Engineering, Palo Alto, California, USA (New York, ACM, 2006) 265–72. [6] M. Yoon, Web Robot Detection (2000). Available at: http://photo.net/doc/robot-detection.html (accessed 20 June 2007). [7] S. Jackson, Building a Better Spider Trap (1998). Available at: www.spiderhunter.com/spidertrap (accessed 21 October 2007). [8] D. Eichman, Ethical web agents, Computer Networks and ISDN Systems 28(1) (1995) 127–36. [9] M. Koster, Guidelines for Robots Writers (1993). Available at: www.robotstxt.org/wc/guidelines.html (accessed 21 October 2007). [10] The Robots Exclusion Standard (1996). Available at: www.robotstxt.org/wc/exclusion.html#robotstxt (accessed 21 October 2007). [11] M.D. Dikaiakosa, A. Stassopouloub and L. Papageorgioua, An investigation of web crawler behavior: characterization and metrics, Computer Communications 28(8) (2005) 880–97. [12] V.S. Pai, L. Wang, K. Park, R. Pang and L. Peterson, The dark side of the web: an open proxy’s view, ACM SIGCOMM Computer Communications Review 34(1) (2004) 57–62. [13] K. Park, V.S. Pai, K.W. Lee and S. Calo, Securing web service by automatic robot detection. In: Proceedings of USENIX Annual Technical Conference, 30 May to 3 June 2006, Boston, MA, USA (USENIX, Boston, MA, 2006) 225–60. [14] L. von Ahn, M. Blum, N. Hopper and J. Langford, CAPTCHA: using hard AI problems for security. In: E. Biham (ed.), Advances in Cryptology: Proceedings of EUROCRYPT, International Conference on the Theory and Applications of Cryptographic Techniques, Warsaw, Poland, May 4–8, 2003 (Springer, Berlin, 2003) 294–311.
Journal of Information Science, 34 (5) 2008, pp. 726–741 © CILIP, DOI: 10.1177/0165551507087237 Downloaded from http://jis.sagepub.com at UCL Library Services on September 19, 2008
741