1
Anti-Scraping Application Development Afzal Haque, and Sanjay Singh
Abstract—Scraping is the activity of retrieving data from a website, often in an automated manner and without the permission of the owner. This data can further be used by the scraper in whatever way he desires. The activity is deemed illegal, but the change in legality has not stopped people from doing the same. Anti-scraping solutions are being offered as rather expensive services, which although are effective, are also slow. This paper aims to list challenges and proposes mitigations techniques to develop a Software as a Product (SaaP) anti-scraping application for small to medium scale websites.
I. I NTRODUCTION Website scraping is the activity of retrieving data from a particular website at some interval (can be regular, can be totally random, partially random within a range), and is generally performed by bots. Human involvement is usually limited to setting the time interval and occasionally, designing counter-measures for bot detection and web analysis techniques, in case the target itself is security conscious. Small and medium size websites are the most susceptible to scraping, since building a scraping bot is easy, but anti-scraping services are expensive to employ. Scraping in this domain is mostly to retrieve competitors’ data in order to step-up one’s own data volume, and subsequently, increase the visitor footfall on one’s own website or web-service. Various judicial decisions and precedents have been set that go against scrapers, regardless of whether the scrapers were using the data for their own good [1], pointing to the stringency against the same, especially when mentioned in the terms of service of the scraped site. Another example is eBay vs. Bidders’ Edge [2]. However, this has barely made a dent of scrapers and on Sanjay Singh is with the Department of Information& Communication Technology, Manipal Institute of Technology, Manipal University, India e-mail:
[email protected] Copyright Notice: ©2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Cite This As: H. Afzal and S. Sanjay, Anti-scraping application development, in Proceedings of Fourth International Conference on Advances in Computing, Communications and Informatics (ICACCI15). Kochi, India, August 2015, pp. 869-874.
the contrary, it is very much on the rise, and seems to be unaffected at the very best. The major challenges that are faced by a Software as a Product (SaaP) anti-scraping application are mostly related to its environment: the fact that it shares the same resources as the website itself. The database server is the same for the application and the website and so is the hosting server. Subsequently, the techniques used by the SaaP model are subject to the user’s server capabilities. Added to this is the fact that the website is a data-rich one, and thus, the database is already under a lot of load. Now that we have established that the anti-scraping application is subject to the user’s server capabilities, we can assume that they are rather less (considering the amount of CPU intensive techniques that are to be implemented). All the restrictions stem from the aforesaid restrictions. Even though the page loading times will increase substantially, it should be as less as possible. The greater the page loading time, the lesser is the visitor footfall and the traffic; hence it should be a priority. This paper aims to shed light on the restrictions faced by anti-scraping applications and offers some ways to mitigate their effect on the same. Rest of the paper is organized as follows. Section II discusses about various anti-scraping techniques and mitigation measures for scraping. Section III describe an example of the proposed anti-scraping application. Section IV discusses various issues related to the antiscarping application development and finally section V conclude this paper. II. A NTI -S CRAPING T ECHNIQUES AND M ITIGATION The aim of anti-scraping application is to keep the increase in page-loading time to a minimum, without compromising on the security. Therefore, all activities that require database transactions and are not really essential for the immediate working of the application are done via the help of cron jobs [3] at times when the site does not record many hits. Another important concept required for the following techniques is to use three lists: 1) black list: list of IPs to block 2) gray list : IPs which might be scrapers and have shown suspicious activity. IPs in this list have
to, for some days, repeatedly solve CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) [4][5] every time they start a session on the site, and 3) white list: normal IPs/IP sources having legitimate traffic. Another technique is using different IP tables for IPs from different locations this produces just a small load on the hosting server, that of deciding the table based on the IP to search whether the IP is blacklisted, graylisted, or unclassified. We can further reduce the location area if the number of visits from that region is too much. This method significantly reduces the page loading time by reducing the number of records the query has to go through to find the required IP address. In order to prevent the blocking of search-engine spider bots, we take the help of the robots.txt [6][7] file. Any bot which has already been recognized as one, and which ignores the robots.txt file can immediately be blacklisted. This is both a precautionary measure and a technique to prevent scraping, mainly because if the search engines bots are blocked, the site stops receiving almost 90% of its new visitors, which would be a major loss.
are used to a particular layout and this might result in decrease of genuine traffic on the site. A pseudocode for markup randomizer is given in Listing 1 while entire process is shown in Fig. 1.
Fig. 1. Flowchart of Markup Randomizer
A. Browser Information One of the most basic filters that can be utilized is C. Information as an Image checking for the information provided by the browsers. It Another method to prevent scraping is to present is known that the information like system fonts, browser information as part of images, and not raw text. Norplugin details etc. are generally not provided by bots. mal text information can be converted to images on Also, most browsers’ fingerprints are highly unique [8]. the server-side and then can be presented to the user. Any traffic source which offers very less information This force scrapers to use Optical Character Recognican be listed in the gray-list, and then it needs to solve tion(OCR) software to actually get the text from the a CAPTCHA to reenter into the white-list. images. However, this method is not very effective as do u n t i l all classes and ids changed nowadays, OCR libraries have become very advanced. get a class/id from the database Deliberate smudging of text in the image or otherwise create a random name will lead to annoyance from the genuine users and might replace in database and template end do affect the user experience on the site. Data can also be presented as flash snippets or HTML5 canvas instead of Listing 1. Pseudo-code for Markup Randomizer. images. B. Regular Markup Changes
D. Frequency Analysis
The markup of a site can be regularly changed in an automated manner, although not in a very major way. Most web scrapers rely on class names and id of various tags to identify the section to scrape. Automated regular changing of just the class and id names of the various sections will prove to be very effective in breaking some scraping bots. However, changing the position of various sections might prove rather annoying to actual users who
One of the most important levels of defense against scraping traffic is checking for frequency of the visits. This is a rather medium-term technique, as the average hits per user has to be determined first (or has to be supplied to the Anti-Scraping Application periodically). Some Anti-Scraping Applications use bell curves to determine hits that are over the top. Again, multiple lists such as gray list, blacklist and white list are used. 2
Before any kind of analysis, hits need to be recorded for some time, around four to seven days, depending on the user. Multiple variations of this technique can exist. One could be to gray list traffic the hits of which per day exist substantially more than the average normal traffic for a day. The benchmarks can be designed to vary. The period could also vary: just take the average number of hits per day over a month; or refer to different averages on different week days, in order to lessen the number of false positives. Another thing that follows immediately is that we have to check for highly substantial differences from the mode of the data (i.e. hits data) range in question, rather than the average, because a particular traffic source recording a large number of hits on a particular time interval, and lesser traffic in the other equivalent time intervals is likely to be not a scrape-bot.
extensively, thus leading to reduced processing requirement. However, for any statistical method to be effective there must be enough samples for traffic on a small or medium website and, the traffic flow has to be observed for a rather long time. G. URL Analysis The analysis of URLs visited could also be used, wherein, the application checks for the pages which the traffic visits. If it is only the data-rich pages, the URLs of which are provided by the user i.e., the person in charge of the website, the application takes action after a particular benchmark defined by the user is crossed. Of course, the ratio of number of non-data rich pages to the ratio of data-rich pages in the website needs to be substantially high for this method to work properly. If this ratio is low, this method is going to throw up a whole lot of false positives. As is evident, this needs a lot of database storage space, plus, it is generally CPU intensive, based on the number of average hits per user per time interval. This, again, is a medium term (bimonthly, usually) technique. Also, an analysis of the order in which pages are visited is a possible modification to this method. When the same order is repeated over a long period of time, it may also point out the existence of a scraper bot. Modified data mining algorithms like Apriori algorithm [15][16], and other frequent-itemset and association rule mining algorithms like ECLAT [17] can be implemented here, provided that the temporal data(time) of the requests is also stored.
E. Interval Analysis Another method is interval analysis which is a longterm strategy. If the interval of visits is very similar, it results in the gray-listing of the traffic. As is already noted earlier, transition from gray-lists to any other list takes place through bot-differentiating techniques. Popular examples of the same include the various forms of CAPTCHA [9]. CAPTCHA tests are based on open and hard problems in AI [4], which include recognition of highly distorted images, answering problems that require semantic knowledge [10], without giving a learning set to the bot. CAPTCHA itself has been developed into many variants by various individuals and organizations like Google [11] and Microsoft [12]. reCAPTCHA [11] aims to convert images of old books and manuscripts to text alongside a normal CAPTCHA image etc.
H. Honeypots and Honeynets Some companies which were already in a networkingrelated business for example, Amazon Cloud Front [18] and Cloud Flare [19] uses a network of honeypots [20], also known as honeynets [21] or honeyfarms around the world to capture bot information, and relay this information to their applications around the world via regular updates or other methods. Such honeypots are traditionally different from the normal ones in the fact that they mostly compromise of some seemingly highly data-rich pages. However, they are effective mostly when the scraper is not limited to a niche. Small honeypotpages can be deployed on the website for detecting completely automated crawler-scrapers too.
F. Traffic Analysis This is a very significant method that has gained even more importance in recent years, considering the fact that modern scrapers also use techniques like IP rotation to evade IP-based detection and identification. Here, botnet detection algorithms are used to detect and identify a particular source not by IP but by its traffic. Yu et al [13] propose a data-mining approach which generate attributes from captured network traffic flows, by first clustering the data and then using discrete Fourier Transform [14] to reduce the size. Since traffic flows from the same source will be very similar, this method has a high probability of flagging bots. The usage of Fourier analysis reduces the size of the problem quite
III. E XAMPLE OF A NTI - SCRAPING A PPLICATION Figure 2 depicts the software architecture of an example using anti-scraping application. When a user requests 3
a web page from a website, a filter first checks whether the user belongs to a black or gray list for the IP range. If it belongs to a black list, the user’s request is denied. If the user is in a gray-list, he is presented with a CAPTCHA to solve, on solving of which his session is established. If the user is present in neither of the lists, his session is established. The website now responds to the user, but the important data is first converted to image using the GD Library [13], and finally the user is served the data. The markup is randomized at regular intervals using an automated randomizer shown in Listings 1. The pages the user visits, along with the request times is logged into the databases. Also, the lower level network traffic data is also being logged for traffic analysis. As per Fig 3, there are two cron jobs in this example application. Cron Job 1 is executed daily (whenever the site records the least traffic) and performs frequency analysis, interval analysis and URL analysis. Anomalous IPs are inserted into gray list or black lists depending upon the thresholds. The second cron job is executed weekly and performs traffic flow analysis. The latter cron job is the prime defense against bots that perform IP rotation.
website. If the website is too big, the complexity and further, maintenance of the product increases and the standard advantages of services of products come into play. In such cases, using a service would be much more preferred to a stand-alone product. V. C ONCLUSION Data theft has become one of the most serious threats for online businesses today. Website scraping tools have become the weapon of choice to perform data theft. From highly publicized scraping attacks on most popular sites to innumerable unpublished incidents, scraping has become a widespread menace. There are many anti-scraping solutions available from many vendors. However, the small to medium scale websites have smaller and less valuable data sets, and therefore can implement an effective anti-scraping measures locally without buying any professional anti-scraping solution. On such sites the scrapers realizes that the extra effort for data theft outweighs the value of the scraped data. This leads to reduction in site scraping attack on those sites. There is a fundamental working principle of the security world at work here - the cost of getting a piece of data, or getting access to a certain environment should be higher than the perceived value of the prize. The proposed anti-scraping solution ensures that the user (website owner) has great freedom in modifying the software according to specific needs, along with ensuring that the increase in page load time is nominal. However, major research is possible in some techniques presented in this paper, including analysis of network traffic to determine the singularity of source etc.
IV. D ISCUSSION The actions that are taken when a webpage is loaded for the first time are the searching of records for the presence of the IP in two tables (blacklist and gray list) and the time and IP is logged. Assuming the threshold for maximum number of records is set as 1000, the time to search through two such properly indexed table would be very less. Such a query to a MySQL database (v5.5.27 with innoDB as the engine, on the same machine) takes less than 0.024 seconds. Also, the URLs visited is being logged along with the request times (insertion takes very less time). The data itself is converted to image with an alpha background which also takes very less time. Other methods are mostly performed when the site is not receiving large number of hits, so they do not contribute to the page loading time. Also, the example deletes data for normal IPs every month to ensure that records don’t get too big and the site does not reach its memory limit. The timings of the various cron jobs can be adjusted to make sure that memory limits are not exceeded and to provide a much more stringent security policy. A SaaP Anti-scraping solution ensures that the user has great freedom in modifying the software according to specific needs, along with ensuring that the increase in page load time is nominal. However, this study makes some assumptions, especially regarding the size of the
R EFERENCES [1] , “Facebook, inc vs. power venture,inc,” Scribd digital library, [Online; accessed 10-March-2015]. [Online]. Available: http: //www.scribd.com/doc/15827848/Facebook-v-Power-05-11-09 [2] T. W.Bell, “Internet law,” http://www.tomwbell.com/NetLaw/ Ch06/eBay.html, [Online; accessed 15-Feb-2015]. [3] P. Vixie, “cron - daemon to execute scheduled commands (vixie cron),” Unix and Linux Forums, 2010, [Online; accessed 15Feb-2015]. [Online]. Available: http://www.unix.com/man-page/ linux/8/cron/ [4] L. von Ahn, M. Blum, and J. Langford, “Telling humans and computers apart automatically,” Commun. ACM, vol. 47, no. 2, pp. 56–60, Feb. 2004. [5] L. V. Ahn, M. Blum, N. J. Hopper, and J. Langford, “Captcha: Using hard ai problems for security,” in Proceedings of the 22Nd International Conference on Theory and Applications of Cryptographic Techniques, ser. EUROCRYPT’03. Berlin, Heidelberg: Springer-Verlag, 2003, pp. 294–311. [Online]. Available: http://dl.acm.org/citation.cfm?id=1766171.1766196 [6] M. Koster, “A method for web robots control,” Network Working Group Draft, [Online; accessed 10-Jan-2015]. [Online]. Available: http://www.robotstxt.org/norobots-rfc.txt
4
Fig. 2. Software Architecture of an example of an Anti-scraping Application
Fig. 3. Classification of cron jobs
5
[7] ——, “A standard for robot exclusion,” Network Working Group Draft, [Online; accessed 10-Feb-2015]. [Online]. Available: http://www.robotstxt.org/orig.html [8] P. Eckersley, “A primer on information theory and privacy,” https://www.eff.org/deeplinks/2010/01/ primer-information-theory-and-privacy, 2015, [Online; accessed 10-March-2015]. [9] M. Moradi and M. Keyvanpour, “Captcha and its alternatives: A review,” Security and Communication Networks, pp. n/a–n/a, 2014. [Online]. Available: http://dx.doi.org/10.1002/sec.1157 [10] Y. Park, Q. Zhang, D. Reeves, and V. Mulukutla, “Antibot: Clustering common semantic patterns for bot detection,” in Computer Software and Applications Conference (COMPSAC), 2010 IEEE 34th Annual, July 2010, pp. 262–272. [11] L. von Ahn, B. Maurer, C. McMillen, D. Abraham, and M. Blum, “recaptcha: Human-based character recognition via web security measures,” Science, vol. 321, no. 5895, pp. 1465–1468, 2008. [Online]. Available: http://www.sciencemag. org/content/321/5895/1465.abstract [12] J. Elson, J. R. Douceur, J. Howell, and J. Saul, “Asirra: A captcha that exploits interest-aligned manual image categorization,” in Proceedings of the 14th ACM Conference on Computer and Communications Security, ser. CCS ’07. New York, NY, USA: ACM, 2007, pp. 366–374. [Online]. Available: http://doi.acm.org/10.1145/1315245.1315291 [13] , “libgd/libgd.github.io,” Github repository, [Online; accessed 10-March-2015]. [Online]. Available: https://github.com/libgd/ libgd.github.io/blob/master/archives.html [14] Wikipedia, “Discrete fourier transform — wikipedia, the free encyclopedia,” http://en.wikipedia.org/w/index.php?title=Discrete
[15]
[16]
[17] [18] [19] [20]
[21]
6
Fourier transform&oldid=650317831, 2015, [Online; accessed 15-March-2015]. R. Agrawal, T. Imieli´nski, and A. Swami, “Mining association rules between sets of items in large databases,” in Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’93. New York, NY, USA: ACM, 1993, pp. 207–216. [Online]. Available: http://doi.acm.org/10.1145/170035.170072 R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast discovery of association rules,” in Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. Menlo Park, CA, USA: American Association for Artificial Intelligence, 1996, pp. 307–328. [Online]. Available: http://dl.acm.org/citation.cfm?id=257938.257975 M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “New algorithms for fast discovery of association rules,” University of Rochester, Rochester, NY, USA, Tech. Rep., 1997. Amazon, “Amazon cloudfront cdn-content delivery network,” http://aws.amazon.com/cloudfront/, [Online; accessed 18-March2015]. CloudFlare, “Cloudflare–free global cdn and dns provider,” http: //www.cloudflare.com/, [Online; accessed 18-March-2015]. M. Gibbens and H. V. Rajendran, “Honeypots,” https://www.cs.arizona.edu/∼collberg/Teaching/466-566/2014/ Resources/presentations/2012/topic12-final/report.pdf, [Online; accessed 20-Feb-2015]. L. Spitzner, “The honeynet project: trapping the hackers,” Security Privacy, IEEE, vol. 1, no. 2, pp. 15–23, Mar 2003.