Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 2, 2012 ISSN 2067 – 4074
Techniques for Securing Web Content Cristian URSU IT&C Security Master, Faculty of Cybernetics and Economic Informatics, Bucharest Univesity of Economic Studies, ROMANIA
[email protected],
[email protected] Abstract: This paper analyzes the dangers to which web content is exposed, demonstrates how data from the
Internet can be gathered and used to obtain profit, how an application can stealthily make use of complicated processing without implementing them and how a website can be cloned in real time. For the considered attacks and vulnerabilities I present, explain, group and evaluate the existing solutions to secure content and last but not least, I suggest a new solution: a library for link encryption.
Key-Words: web, content, data, information, knowledge, data mining, crawler, protection
content on social networks are a very real and dangerous threat especially in the hands of someone with harmful intentions trying to obtain info about them, but also considering that corporations tend to use aggregated social web data to create consumer profiles. The tendency to move data in cloud systems also comes with its risks as personal files are handled by a thirdparty and thus content control becomes harder and harder. With a wave of counter-measures against piracy, filesharing is currently carefully monitored by authorities and this has its impact on the dynamic of web content. Channels that helped transfer content with or without author rights infringement like file-sharing websites and torrents are disappearing at a constant rate and this will make file hot linking more active, with one instance of a file being referred from multiple places.
1. Introduction The Internet is just content linked together. Everyone’s searching, creating, uploading, consuming and reviewing content. The connections between pieces of data are what make browsing the web possible: you start from one page but you never know where you’ll end up. Search engines guide you to what’s relevant, and they do it by constantly crawling the web. But they are the not the only ones reading and monitoring it. The issue of web content protection is a very sensible topic. Nowadays, if you’re not on the Web you don’t exist. And if you’re online, everything you put on the Internet is public and susceptible to being accessed in ways you may not want. The content you provide, the articles you write, the pictures you upload, the functionalities you implement online can be easily copied or accessed directly without you knowing about it. And with the Internet being such a huge source of accessible data it's no wonder that mining algorithms are continuously being used to obtain highly valuable and money-generating knowledge. The exploits I will showcase, the protection measures that already exist and those suggested in this work, all come to show that web content plays a central role in the Internet and our daily lives. For individuals the possible consequences of posting personal
2. So what is the problem? It's never been enough to be innovative, brilliant or hard-working, you also have to protect your intellectual property and your data. And if you're thinking that's what digital rights management (DRM) are for you are only partially right. Copyright protection tries to limit the distribution and availability of intellectual property found in digital format by attaching or encapsulating to the file a set of rules that limit the use of that file. 63
www.jmeds.eu
benefit from shopbot agents1 (like price.ro, pricegrabber.com) that are automatically browsing your catalog and copying information about your products. In this way if you have good prices these comparison shopping agents can bring you extra clients, but you must be aware that your rivals can also get a full list of your products and prices.
DRM is great if you have media content and you want [1] - for example - your visitors to be able to download a song and only listen to it once unless they buy it. We'll call that "hit and run" or "static" protection because without DRM it would be enough for the person trying to get the file or data to download it once in orders to use it forever. Another approach is to digitally timestamp a document so that if it contents gets copied, you could prove that in fact you created the original.
b) Your strong point is your content and you don't want to lose that advantage If you own a website that displays car sale ads and someone is very interested in receiving notifications from you when an ad is posted - and you don't have such a service - they can make their own alerts by automating the steps they normally take to obtain that information from your page (access the ad listing page, order the ads by date with the newest first and compare them with the known ones in order to find new listings). And this may affect you if that notification is used to automatically populate with ads a rival website or if it's so successful that it gets people to pay for it, and you're not the one cashing the check. If you would protect your content and deploy a notification service with terms and conditions then your data would be safe and you would control the access to notifications.
But in today's distributed environment where the focus is on web services things are more dynamic. Usually, when someone explicitly wants to offer relevant content, they deploy a web service to which others can subscribe and receive the content it sends. Because web services are designed to provide data to its clients they have become an important source of content in web pages whether it's a financial report, a weather widget or news feed. And if you have a website that displays valuable information and you didn't create a service for providing that data automatically to your public (at regular intervals or on specific request), you may unwillingly become the source for a service. If your site lacks what we will call "dynamic content protection", someone else may build a service based on your data or may gain other advantages with the help of content stealing robots. That valuable data could be generated or created by you, your users or may be the results of some processing initiated by the client on your page.
c)
You offer processing on your website There are a great number of services offered via Internet, such as file conversions, barcode generation, image processing, text translation, etc. If your web page is just an interface for some processing that takes place on your server (file conversions, automated translations, etc.) you want everyone who needs to use your mechanism to enter your site. If you don't protect the result of processing requests and an attacker manages to gain access to your underlying mechanism without using
2.1. How is that affecting me? Depending on the nature content, there can be scenarios:
of your different
a) You benefit from the fact that your content can be copied Let's say you manage an online shop with unprotected content: You could
1
An application with high autonomy and learning capabilities that is used to search the Internet for price listings and create a comparative catalogue of products for its users
64
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 2, 2012 ISSN 2067 – 4074 your interface, it may be game over for you. He may build a competing service based on your efforts and your server's resources - thus gaining traffic while you lose yours although you process even more requests than before. I have exemplified this weakness on the website that provides image generating services. Even more, I have gone one step further: because the web page interface limits the input parameters while the underlying mechanism can
process a lot more options - the processing I provide is better that the original, although it uses exactly the same mechanism from its server. To gain control of a remote processing mechanism my script has to interact with it and this means sending parameters trough POST method. All web-based processing sites have two components: the interface and the underlying mechanism; the relation between them is described in Figure 1
Figure 1. Obtaining access and control of an external underlying mechanism The interface receives input from the user trough the means of different controls and it sends them to the processing script via POST or GETS method. After the processing is done, the result is outputted back to the interface. For someone to be able to send input directly to the underlying mechanism the data gets sent when normal input is transmitted must be known. If the communication with the server is not secured, network analyzers like Wireshark or Fiddler display the full list of parameters that get sent via POST or GET to the server. For someone to gain control of a remote processing they need to: know the URL of the underlying mechanism; know what are the input parameters and what each parameter does; automate sending valid parameters to the underlying mechanism; obtain the HTML of the page containing the result of the processing; parse that HTML to extract the result (in this case the image).
d) Based on your content, people make decisions which involve money transfer Here's where it gets really tricky. If you manage a website that provides data (usually indicators) based on which people take decisions about investing their money, you know that the stakes are high. Your clients want to maximize their profit, and if you provide them with enough access to data they can deploy mining techniques and obtain exactly that. For this kind of pages, it’s critical to protect your content (your indicators) from being accessed automatically. A business like an online exchange office or an online sport betting website aims to generate profit by making sure that overall it rewards its clients with less than what they invest. If a script can access and collect those indicators from your site, it can do the same with similar data from other websites. And when enough data is gathered, it can start processing it with a specific algorithm and find profitable situations. For example, if you get from Exchange Office A 12.5$ for 10€ and Exchange Office B gives you for one dollar more than 0.8€, then by making transactions at this two offices you get more than 10€ for your initial 10€. 65
www.jmeds.eu
In sport betting having so many sport events and so many betting websites it's predictable that although each booker makes sure that his odds leave no room for guaranteed wins, over the whole web opportunities to get certain wins do appear. The sport booking business works by attributing odds to each possible outcome in a sport event. These odds show how unlikely the booker considers a result to be and can be expressed in different formats: fractional, decimal, US format, etc. Considering a football match between Team A and Team B, there are three possible outcomes: Team A wins, the game ends as a draw or Team B wins. For this event where the booker considers victory for Team A to be the most likely result, given decimal odds of 1.6, 2.4 and 3.2 it means for example that if you bet 100$ on Team A and team A wins you get 160$, otherwise you lose your 100$. If you try to place equal amounts of money on each result, thus playing a total of 300$ (100$ on each result) in the worst case you will win just 160$ (having an overall loss of 140$) and in the best scenario you win 320$ which represents only 20$ extra to your investment. It is pointless to try and develop a method of winning money from just one booker, because he will always do the calculations and keep his set of non-profitable to betting schemes. It is therefore a matter of finding situations where different bookers evaluate differently the probability of outcomes. It becomes certainly profitable to bet when for each possible outcome there is a booker offering a decimal odd greater than the number of possible outcomes. Considering the situation below, profit is guaranteed because for all 3 results at least a booker offers an odd greater than 3 (Booker 2 for Team A - 3.3, Booker 4 for Draw - 3.4 and Booker 1 for Team B -
3.2). Team A wins 4 3 2
Draw
Team B wins 3.5
3.2 3.3 2.4 1.6
2.2 1.7
2.5 2
2
1.9 1.2
1 0 Booker 1 Booker 2 Booker 3 Booker 4 Figure 2. Dummy odds from bookers These differences in the way bookers evaluate competitor's chances come mainly from the behavior of their customers - which may prefer a team or another - and from the booker's understanding of that certain sport (their knowledge and their intuition matter because it's very likely that the odds will change during the game itself, depending on the evolution of the score and how well the teams play). All that someone needs in order to make profit from your website is to be alerted when such opportunities appear and to take action. By automatically and systematically gathering indicator from your site and from other pages and by comparing them, attackers gain knowledge of when and where they need to make electronic payments in order to obtain profit. This knowledge extends to calculating in advance how much is won for an invested amount and data mining can even give predictions based on statistics for when and where similar opportunities will appear. The way in which situations of guaranteed wins are obtained resembles the Data-Information-KnowledgeWisdom model.
66
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 2, 2012 ISSN 2067 – 4074
Figure 3. DIKW (Data-Information-Knowledge-Wisdom) System [2] All the steps that need to be taken in order to build a data mining system to alert when winning is guaranteed are presented included in the following diagram exploit, where Remote site X is
a third-party website that aggregates odds from bookers and is not indented for distributing them but also doesn't implement content protection:
Figure 4. Data mining mechanism
67
www.jmeds.eu
If exploiting this kind of opportunities on your site is not bad enough, you may want to know that if your vulnerabilities are more severe, attackers can build actual money making machines that automate the placing of money (electronic payments) thus generating profit without any other human intervention. And if you're just a simple Internet user, keep in mind that everything you put online might be accessible if the site hosting your content doesn't have reliable security mechanisms implemented.
which is hosting that website, a route from the client to the server is established. The communication consists of an initial three-way handshake, a series of HTTP requests generated by the client browser when clicking links, submitting forms, etc and the final step closing the connection [3, p 45-46]. So another thing that can be done if content can be automatically obtained from a website is to just simply copy its pages. This means first simulating a request to obtain the HTML of a page, then for each resource referenced in that HTML (CSS, JavaScript files, images) downloading it to the server and modifying the HTML to point to the downloaded version found on the attacker’s machine. Then, using the inner links found in the HTML and the same algorithm, a script can automatically visit (and make copies of) all the pages and files referenced inside that website in a process of covering content that is called crawling.
2.2. Why is this possible? Automated access to content is made possible by the client-server paradigm on which the Web works: Servers process the requests they receive from clients (usually browsers) and respond with content [3, p 38] (HTML, XML, etc). The problem is that requests can be simulated trough a script and the server behaves exactly the same, providing it with the desired response. For example, this little script written in PHP that uses the cURL library2 is enough to get the HTML of a page, in this case the landing page for cristiursu.ro. The default request type is GET, but POST requests can also be easily simulated.
But why would someone do that, considering it's just like "Save Page As" from a browser? Well, if you can't access an address from your browser because it's been blocked with a special tool (web filter, firewall, etc) but you can access the address of another website where you have an FTP account, you can make that one work for you like a proxy site. [4] In this way you visit a legitimate website that copies in real time other web pages, so you browse content from your website which is in fact a copy of the forbidden one. This is, of course, limited to browsing experience as interactions with the forbidden site are harder - but not impossible - to replicate (especially if the communication with the server is not encrypted with https and variables are visible in sniffing tools). Efforts to implement user interactions are usually deployed if someone clones a site in order to impersonate it. By pretending that their copy is the original page, an attacker can obtain passwords, credit card numbers and other sensitive information. Users of the original site
A client from the web doesn't usually initiate communication directly with the server. Instead a DNS (Domain Name Server) first receives a request to associate the address entered in the browser with an IP. After the DNS responds with the IP of the machine 2
recursive acronym for Curl URL Request Library
68
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 2, 2012 ISSN 2067 – 4074 input data in the clone’s forms and send it to the attacker’s server while thinking that the interface they see belongs to the original site and their data is received by the legitimate system [5]. These types of attacks built on masquerading are called spoofing. They are based on the fact that, if requested, users provide confidential information if they think that the site they visit is legitimate. In order for users to send data to the attacker it’s not enough to have an identical interface (as result of content stealing) because users also have to be provided with the fake page instead of the original. Attackers need to change the pages that are sent to the victim’s machine but also to make sure that the original site receives user data in order to keep the attack undetected [6]. To break communication between the client browser and the server and to put himself in between (for a man-in-the middle type of attack), an attacker can use: DNS Spoofing to alter the Server's list and how it resolves requests for the victim URL - which is very difficult, or by simply intercepting a DNS request and responding quicker than the Domain Name Server thus giving the client his address to connect to [7];
Name similarity to register similar name domains to those they impersonate (Ex: harods.com for harrods.com, nasa.com for nasa.gov, etc). Most of these situations are short-lived, easy to detect and lead to Trademark infringement over Domain name disputes. Gaining control in this case is based on the victim typing a wrong web address. [8]; IP address changing, by configuring his packets to have the IP Address of the original site; Link alteration by having a clone on his server and adding his domain in front of the web address (example: http://cristiursu.ro/http://ism.ase.ro for http://ism.ase.ro).
To gain control it is enough to modify just once the address accessed by the victim. After that, the victim browses the attacker's pages (which is a real-time clone of the original) where he can add or modify the code as he pleases, altering form actions in order to obtain user input. [7] Whatever technique is used, in the end the system can be exploited with spoofing just like the following schema describes (note that content stealing is used for step 3):
Figure 5. Spoofing attack schema [9] functionalities. This means that you need to find a balance between accepting the risk of having your content stolen and taking actions to prevent this from happening.
3. What can be done? It's quite difficult to say what information from your website should be protected. You want people to be able to read it and use it from your site, and you also want them to find it (automatic scripts of the search engines should be able to access it), but you don't want someone else stealing it or using your
3.1. Legal aspects After you identify the sensitive data that you will secure, the solution is to use a combination of techniques and counter69
www.jmeds.eu
measures that are suited for the content you want to protect and the threat you want to protect it from. But first of all, before you start protecting your content you should state the ownership you have over it and to specify inside Terms and Conditions that the use of content stealing robots is forbidden (just like Facebook.com does in the Safety section of its Terms and Conditions, stating that: "You will not collect users' content or information, or otherwise access Facebook, using automated means (such as harvesting bots, robots, spiders, or scrapers) without our permission" [10]). In many countries (US, EU, UK) the content of your site is protected by default, but it is good practice to use the copyright symbol © along with the "All Rights Reserved" specification. It is also recommended to use the HTML metatag.
content
that database, the software has the ability to determine if a reviewed material is based on yours. Such a paid service that uses Internet and database checking while taking into consideration translations and synonyms is accessible at checkforplagiarism.net, and they claim to "scan documents for Plagiarism using the latest and most in-depth technology available to identify and highlight even the most subtle attempt at either intentional or un-intentional plagiarism" [12]. Free of charge you could use web search engines or sites like CopyScape (http://copyscape.com) to search for exact phrases extracted from your content and see if among the results there are other pages beside yours [13]. Although this is not so advanced, you can still find if someone is using your materials and if so you can then find details about the owner of the website by verifying their domain on whois.net and contact their hosting company to report them [14].
=
But if you are really serious about protecting your website's content, then you should register it at your National Copyright Office (The Romanian equivalent institution is ORDA http://www.orda.ro) because if it becomes necessary to defend in justice this is how you can prove plagiarism [11].
3.3. Existing security mechanisms for web content We have seen that an attacker can write a script to simulate page requests and obtain the HTML result from the server although he doesn’t use the browser, he doesn’t click the submit button of the form, he doesn’t fill the form inputs, and he doesn’t click the links. But to obtain web content he can also program a machine to navigate through the page, doing exactly what a human agent would do (mouse events, pressing buttons, filling textboxes, etc). Most protection measures check that he didn’t just access the link or action of the form by providing valid and controlled parameters to that page. In other words, most protection measures verify that a browser has been used, and the user actually filled the form found in the referrer page and clicked the Submit button [3, p 95]. Some techniques can identify simulated page requests, but against machines that replicate human actions inside a browser there is little or nothing to do.
3.2. Find out who steals your content. Plagiarism assessment If you want to start actions in court against those who steal your content or you simply want to know who is taking advantage of your work, you need ways to identify them. This is possible only if those who copy your content repost it online in their own web context. If the attackers just use your data corroborated with other information in order to deploy mining techniques and gain some knowledge, it will be very hard to find them. But for cases of reusing content there are advanced plagiarism assessment tools that work with a database of copyrighted materials; If your content is included in 70
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 2, 2012 ISSN 2067 – 4074 Software programs that record and replicate mouse actions like Auto Mouse Click (http://www.murgee.com/automouse-click), Auto Hot Key (http://www.autohotkey.com), AutoMe (http://www.asoftech.com/autome) or even Firefox Addons such as iMacros (which states that "Whatever you do with Firefox, iMacros can automate it" [15]) provide examples that mouse actions can be easily automated. The downside for using this type of software is that input parameters on web pages are harder to provide dynamically that it is in scripting technology.
both), or even biometrics or tokens [17]. Although scripts can automate the login authentication, they do it with certain credentials and therefore the access level of the script is that of the account having those credentials. Even if automatic login is made possible with just a few lines of code in PHP with cURL: curl_setopt($ch,CURLOPT_USERPWD, "$username:$password"); //for html authentication or curl_setopt ($c1, CURLOPT_POST, 1); curl_setopt ($c1, CURLOPT_POSTFIELDS, 'User='.$username.'&Password='.$passwo rd.'&button=Login');
Knowing what the dangers are, let's see what can be done to protect content: In most cases, keeping records from your database safe from SQL Injection attacks is a first step in web content protection, because much of the content displayed on web pages is stored in databases. But SQL Injection consequences are way more severe than content stealing and attacks usually target getting other more important information (user credentials, credit card data, etc). To prevent these attacks it's recommended to implement validations on the input strings received from the interface, to escape special characters when sending the string to the database engine (PHP has a special function for this, mysql_real_escape_string ($string) ) and to use stored procedures as much as possible [16]. But because the subjects of SQL injection and content protection are merely intersecting, there is way more to securing content than prevent obtaining it directly from the database. Dynamic websites offer different content for different visitors. Displaying such differentiated and custom information depends on identifying who the visitor is and implementing access rights that dictate what content he is allowed to see. These kind of web applications have an authentication system usually based on login with credentials, certificates (sometimes
//for submitting a form with input controls named User and Password and a button named Login
If authentication requires extra information that cannot be previously known (device generated tokens) the script cannot automate login. Having communication between the client and the server in plaintext might also be a problem because an attacker from the network can see the list of parameters used to access a page by using a sniffing tool (ex: Wireshark, Fiddler) and may later deploy replay attacks. But even though communication with a site is secured trough SSL, if the site doesn’t redirect normal http requests to https, all the security is pointless (ex: http(s)://portal.onrc.ro). And even if https is always used, the form parameters (even hidden ones) are always visible in the source of the page, making automated form submission possible. If JavaScript isn’t deployed to dynamically change or add parameters on mouse events, https may not be of great help. Another problem is direct browser access to files from the server. This doesn't concern resources that are referred in the HTML source because they are already visible. But if user A 71
www.jmeds.eu
sees in his page two pictures placed in /images folder that are named Picture3.jpg and Picture8.jpg he might figure it out that by accessing the folder /images he will see the whole list of pictures or that by typing in a browser http://sitename.com/images/Picture 5.jpg he will obtain a picture belonging to another user. To prevent this from happening, .htaccess file rules should be implemented to block direct access to certain file types. It is also good practice to avoid common an obvious names for files and folders (like the ones from this example). Even so, some of the file and folder names can be found from the source of the page while others could be obtained by trying different file names with brute force. Rules stated in the .htaccess file can also prevent image hotlinking, known as bandwidth theft [11] (when another site embeds media content stored on your servers thus using your bandwidth to load it whenever its page gets accessed).
top.location.href self.location.href;}
=
In reality, frame-buster busters have been created. Another solution involving iframes is to use them in your page and to make the content you are protecting (usually pictures) dependent of its frame so that if someone else wants to use it, they have to embed your frame. This doesn't however prevent the use of your content, it just makes the original source visible in the HTML code. An effective way of making sure everyone knows that you own a media resource is to watermark it by permanently attaching your logo to it. In this way, even if it gets used in some other context, your copyright is very clear. The main downside is that watermarking alters the media content. [1] Captcha systems are often used to differentiate between humans and machines. They are based on asking the visitor to do something in order to prove that he is human - usually a numeric computation or reading symbols from a picture containing sometimes distorted text. This type of verification is not popular among users and solutions to solve captcha challenges are quite accessible. Some are automatic (character recognition software) and become inefficient when the picture has a lot of background noise, while others involve human agents that are paid to remotely solve captcha requests; Attackers provide the picture (assuming they can parse the HTML and obtain it) to the decaptcha server trough a service, and the response they receive is a set of characters that represents a valid solution for that captcha. Human decaptcha services are more accurate than OCR (character recognition software) based software, having a success rate of over 90% compared to only 40% for difficult captchas. Human decaptchas are, however, slower than OCR having response times of 10-30 seconds, while character recognition software give results in
RewriteEngine On RewriteCond %{HTTP_REFERER} !^http://(.+\.)?cristiursu\.ro/ [NC] RewriteCond %{HTTP_REFERER} !^$ RewriteRule .*\.(jpe?g|gif|bmp|png)$ http://cristiursu.ro/failedHotLinking. gif [L]
If implemented, these rules stated in .htaccess would make the server respond with the failedHotLinking.gif picture to any request coming from another server that tries to embed a picture from this domain. Preventing some other site from including your page in an iframe can be achieved by using frame-buster scripts. Using this piece of JavaScript code in your page is supposed to make it impossible to be embedded into an iframe [13]. if (top !== self) {alert('This content cannot be embeded in an iframe');
72
Journal of Mobile, Embedded and Distributed Systems, vol. IV, no. 2, 2012 ISSN 2067 – 4074 just over a second [18]. It therefore begs the question whether decaptcha services are legal. They have to be, considering that it's almost like asking someone else to come and type the captcha code for you. Even more, some legal issues have been raised against those who implement captcha systems on their website, because people with disabilities (particularly visually impaired) claim it to be a discriminatory. Google is now providing a new version of captcha based on an API called reCaptcha that has an extra audio version of the code you need to input. They claim it to be dynamic and based on texts from books that couldn't be solved by OCR software [19]. This is in fact a weakness, because it means that the most complicated part of the captcha image can be left blank or filled with dummy text, because Google cannot verify it. reCaptcha does however come with some advantages, because the picture is not visible in the source of the page, and in order to obtain it and send it to a decaptcha service, attackers need to parse the JSON to get the picture id, download the picture from Google and do all this while having a consistent session within the API. Considering the captcha solution, if you have a piece of text that you want to keep away from crawlers it is better to put it online as a picture or as a PDF. Even if the picture or the PDF gets downloaded it would take a character recognition tool or a decaptcha service to read the text from inside them. Another way to stop crawling is to track the order in which certain pages are visited. Getting to some areas of the site might not have logic unless you have already been to certain pages. In this case access to that pages can be allowed only if the log shows that the visitor has previously made the necessary steps. Because crawlers usually visit the links in the default order, some dummy links can be included in the source before any other references and a specific restriction can be
imposed that if they are visited, the connection will be reset. Web applications should only respond to requests that originate from pages that could have generated them and shouldn't provide content in response to requests without checking if the user agent is a valid browser. In reality they rarely do so because both the referrer page and the user agent are very easy to impersonate by an attacker [20]. Using the cURL library in PHP, this is done in one line of code each:
//set refferer curl_setopt($ch, CURLOPT_REFERER, 'http://cristiursu.ro/dissertation/ind ex.php'); //set user agent curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1");
73
The robots.txt file has the role of allowing or denying crawling robots to visit your site. It is a very inefficient security measure because it only prevents well-behaved scripts from accessing your content. Obviously, a malicious robot will not stop if you just ask it not to scrape your pages [11]. Code scrambling only works against specialized crawlers that repeatedly scrape your site and look for certain data in certain places. Automated scripts of this kind use pieces of the HTML source as references (ex: after