Building blocks of a scalable web crawler - Marc's Blog

14 downloads 158 Views 820KB Size Report
Sep 15, 2010 - 3.2.2 R-tree-based . . . . . . . . . . . . . . . . . . . . . . ..... vs. native threads), their general p
Building blocks of a scalable web crawler Marc Seeger Computer Science and Media Stuttgart Media University September 15, 2010

A Thesis Submitted in Fulfilment of the Requirements for a Degree of Master of Science in Computer Science and Media Primary thesis advisor: Prof. Walter Kriha Secondary thesis advisor: Dr. Dries Buytaert

I

I

Abstract The purpose of this thesis was the investigation and implementation of a good architecture for collecting, analysing and managing website >example.com drupal 7.0 While this interface can be accessed from basically any language that has an HTTP library, Solr offers a second possibility that is more suited to the crawlers infrastructure: a JDBC import. By specifying the connection to the content="WordPress 2.9.2" /> While security through obscurity is not a proper way to keep an application safe, it might be a good idea to remove this tag just in case a vulnerability of a certain version of the CMS would make it easily detectable as a possible target.

6.1.2

Included files

Since the upcoming of the term "AJAX", most bigger content management systems use Javascript to allow their users to edit forms or interact with the website without reloading the complete page. For this, external Javascript files are usually required. In the case of Drupal, a file called "drupal.js" is included using the script tag: Please pay attention to the problems section (6.1.9) of this chapter for further comments on this.

6.1. CMS DETECTION

6.1.3

111

Javascript variables

After a javascript library such as jQuery has been loaded, it is sometimes necessary to set certain CMS specific parameters. In the case of Drupal, for example, it is often the case that the HTML code includes this call:

6.1.7

HTTP headers

Some CMS are detectable by looking at the HTTP headers of the package being returned by the web server. One of the more prominent examples is Drupal which sets the "EXPIRES" header to "Sun, 19 Nov 1978 05:00:00 GMT". In this case, November 19th 1978 is the birthday of the original creator and project lead for Drupal. Before actually adding this to the

6.1. CMS DETECTION

113

fingerprinting engine, it is always a good idea to check the source code (if available) to be sure that it was not a load balancer or the web server that added a certain HTTP header.

6.1.8

Cookies

Another interesting part of the HTTP headers are cookies. A lot of content management systems use session-cookies to track user behaviour or restrict access using role based access control systems. An example of these cookies can be found in headers created by the WebGUI CMS: Set-Cookie: wgSession=qmDE6izoIcokPhm-PjTNgA; Another example would be the blogtronix plattform which also sets a cookie: Set-Cookie: BTX_CLN=CDD92FE[...]5F6E43E; path=/; HttpOnly

6.1.9

Problems

While the techniques that have been showed so far are able to detect a large percentage of CMS systems, there are as always exceptions to those rules. When it comes to the detection of included Javascript/CSS files, "CSS/Javascript aggregation" can circumvent detection of specific files. When aggregating Javascript or CSS files, a single file containing all of the initially included files is generated and linked to. This will generally improve site load time and might look like this in HTML: The filename is usually just generated, and without inspection of the file, it is hard to say which CMS (or other preprocessor) created the file. Another problem arises when it comes to using CMS-specific paths in the detection process. A good example of CMS paths are the "wp-" prefixed

114

CHAPTER 6. FINGERPRINTING

paths that Wordpress uses to store content. Most images posted by a user will usually have "/wp-content/" in the path-name. The problem is that "hotlinking", meaning the linking to images that are on another person’s webspace is pretty common on the internet. By simply relying on paths, a lot of false positives could arise. Selecting paths carefully might help with this. In the case of Wordpress, the "wp-includes" path tends to be a good candidate for detection, as most people will not hotlink another person’s Javascript/CSS.

6.2

Web servers

When it comes to detecting web servers, there are a lot of sophisticated methods to do so. Dustin Willioam Lee describes this in detail in his Master’s Degree Thesis "HMAP: A Technique and Tool For Remote Identification of HTTP Servers" [[16]] Since the fingerprinting of web servers is not the main goal of the current project, I decided to simply rely on the "server" field in the HTTP header. A typical HTTP response including the "Server" field looks like this: $ curl -I marc-seeger.de HTTP/1.1 200 OK Content-Type: text/html Connection: keep-alive Status: 200 X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 2.2.15 Content-Length: 1398 Server: nginx/0.7.67 + Phusion Passenger 2.2.15 (mod_rails/mod_rack) Using this response, we can deduct that the website in question is running on the nginx webserver or is at least load-balanced by it. While some sites try to keep their output to a minimum, the simple check for the "Server" header is enough to detect a large majority of web servers.

6.3. DRUPAL SPECIFIC media="all">@import "/modules/poll/poll.css"; [...]