chapter 2 web crawler

CHAPTER 2

WEB CRAWLER 2.1 INTRODUCTION A “Web Crawler” (also known as “Web Spider”, “Web Robot” or simply “Bot”) is a software for downloading pages from the Web automatically. Unlike what the name may suggest, a Web crawler does not actually move around computers connected to the Internet as viruses or intelligent agents do but only send requests for documents on Web servers from a set of pre-defined locations. The input to this software is a set of starting or seed pages; once these pages are downloaded, they are parsed and scanned for links. The links pointing to pages that have not yet been downloaded are added to a queue from which a new subset of pages is chosen according to some policy. This subset is then downloaded and the process is repeated until a stop criterion is met. Web crawlers can be used for a variety of purposes, the most relevant one is to index a large set of pages and allow other people to search this index.

2.2 A BRIED HISTORY______________________ On June 1993, Matthew Gray, an undergraduate student at MIT, sent the following message to the www-talk mailing list [ 1 ] :

“I have written a perl script that wanders the WWW collecting URLs, keeping tracking of where it’s been and new hosts that it finds. Eventually, after hacking up the code to return some slightly more useful information (currently it just returns URLs), I will produce a search- able index of this.”

1

The project was called WWWW (World Wide Web Wanderer) and it was the first Web crawler. It was used mostly for Web characterization studies [ 2 ] . Currently, all major search engines employ web crawlers, and their referrals account for over 10% of the visitors Web sites receives

[3]

.

Effective Web crawlers are key to the success of modern search engines, and given the enormous size of the Web, it cannot be possibly explored entirely by humans. To a large extent, the web remains intelligible to its human users precisely because it is constantly analyzed and monitored by these automatic agents”

[4]

.

This chapter is organized as follows: the next section motivates the design o f a Web crawler by discussing its possible applications. Other section discusses the general algorithm used by a crawler and how to decompose it into a series of policies.

2.3 APPLICATIONS OF WEB CRAWLER Applications of a Web Crawler A Web crawler is a tool for exploring a subset o f the Web. This exploration may serve different goals. The key application of a Web crawler is to create an index covering broad topics (general web search) or specific topics (vertical web search). Web crawlers may also be used to analyze Web sites automatically for extracting aggregate statistics (web characterization), for improving the sites (web site analysis), or for keeping a copy of a set of pages (web mirroring).

2.3.1 General Web search Web

search

is

the

application

that

has

driven

Web

crawling

development during the last years. It can be roughly divided into general Web search, of the type done by large search engines, and vertical Web search, in which the set is delimited by a topic. A crawler for general web search engine must carefully balance coverage and quality. By coverage we mean that it must scan pages that allow it to create an index that can be used to answer many different queries. By 2

quality we mean that the pages should have high quality. As the crawler has finite resources, these two goals may contradict. A crawler for vertical web search is interested in a particular subset of the Web. This subset may be defined geographically, linguistically, topically, etc. Popular applications for vertical crawlers include the creation of vertical portals (also known as “vortals” [ 5 ] ), this is, Web sites that provide answers for the needs of specific types of users.

2.3.2 Web characterization Web characterization is a prerequisite for building effective Web search engines, and includes all attempts to derive statistical properties of Web pages. The most obvious (and difficult) question is what constitutes a representative sample of the Web. This is a particularly difficult issue if we consider that even when the Web contains a finite amount of information; it may contain infinitely many pages. When crawlers are used for Web characterization, both the starting set of pages and the parameters of the selection policy affect the properties of the obtained collection. Of course, some properties are more affected than others. Page-centered characterization measuring page sizes, technologies, markup and other properties are affected less than link-centered characterization. For link-centered characterization, the choice of the starting URLs for the crawl is critical. A strong bias may be imposed on the observed macroscopic structure o f the Web

[6]

if the

seed URLs are chosen poorly (for instance, the IN component may be too small).

2.3.4 Mirroring Mirroring on the Web is the act of keeping a partial or complete copy of a Web site. The intent of mirroring is typically to distribute server load and provide faster access to users in different network locations. The copies are usually called “ mirrors” .

Crawlers used for mirroring are usually very simple, operate in a controlled environment, and thus can use both compression and delta 3

encoding (for instance, by using rsync) to update the mirrors. Also, there is a mirroring policy indicating which the appropriate refreshing period, is typically daily or weekly, and what is the best time of the day to do the mirroring. Web archiving is the act of keeping a mirror of a large set of pages, without discarding the outdated copies. This is, the whole history of each page is kept. The larger project of web archiving is the Internet Archive

[7]

, that aims at archiving all of the

publicly-available information on the Web, including both text and images. The Internet Archive collection as of March 2006 consisted o f 55 billion page-versions comprising several hundred terabytes of data. The purpose of a Web archiving crawler is to preserve the state of the Web on a given year, before pages are deleted or updated.

2.4 TAXONOMY OF CRAWLER While all the types of crawlers usually aim to have a coverage that is as complete as possible, there are other goals in which they may differ. Some of the Web crawling applications we have described can be classified if we consider what is the relative importance they assign to the following issues: 

Freshness: in some cases, it is critical than the copy obtained by the crawler is up-to-date, while in other cases, an older copy is just as good.



Quality: some crawlers aim for a particular, high-quality portion of the Web, while others are more interested in having broad coverage at different quality levels.



Volume: some crawlers are interested in keeping a large fraction of the Web, while others may sacrifice quantity to have more quality and/or freshness.

Some crawlers’ classified according to these three axes are shown in Figure 2.1. For instance, a mirror system is designed to keep a very accurate and complete copy of a subset of pages, while a vertical search engine or a personal crawler prioritizes the collection of high4

quality resources over these factors and only cares about portions of the content.

Figure 2.1

2.4.1 Crawling algorithms A Web crawler needs to achieve several goals at the same time, and some of these goals contradict each other. In general, a Web crawler seeks “fresh” copies of Web pages, so it must re-visit the pages to detect changes, but at the same time, it has to discover new pages. A Web crawler needs to use the available resources efficiently, including network bandwidth, so it must exhibit a high degree of parallelism without overloading the Web servers it visits. Also, it would be desirable to download “good” pages to avoid wasting resources, but the crawler cannot know in advance which pages are “good”, without first downloading them. Moreover, the target is dynamic: there is a huge amount of pages being added, changed and removed every day. Also, network speed has improved less than current processing speeds and storage capacities. The large volume implies that the crawler can only download a fraction of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that by the time the crawler is downloading the last pages from a collection, it is very likely that new pages have been appeared, or that pages that have already been updated or even deleted. 5

Crawling the Web, in a certain way, resembles watching the sky in a clear night: what we see reflects the state of the stars at different times, as the light travels different distances. What a Web crawler gets is not a “snapshot” of the Web, because it does not represents the Web at any given instant of time. The last pages being crawled are probably very accurately represented, but the first pages that were downloaded have a high probability of have been changed. This idea is depicted in Figure 2.2.

Figure 2.2

In Web crawlers operating over large collections, such as the crawlers of general, global-scale search engines, it is useful to separate the concerns when analyzing crawling algorithms. A useful modularization is to consider the crawler’s behavior as the combination of a series of independent policies. Here we divide a crawling algorithm into three policies with three different goals: 

Selection policy: visit the more important pages.



Re-visit policy: update the index when pages change.



Politeness policy: avoid disrupting Web sites.

2.4.2 Selection policy Given the current size of the Web, even large search engines cover only a portion of the publicly available content. An early study in 2000 by Lawrence and Giles

[8]

showed that no search engine indexes more

than 16% of the Web, while another study in 2005

[9]

showed the the

coverage of large search engines is between 58% and 76%. As a crawler always downloads just a fraction of the Web pages, it is highly 6

desirable that the downloaded f raction contains the most relevant pages and no t just a random sample of the Web. As Edwards et al.

[10]

noted:

“Given that the bandwidth for conducting crawls is neither infinite nor free it is becoming essential to crawl the Web in a not only scalable, but efficient way if some reasonable measure of quality or freshness is to be maintained.” The selection of which pages to crawl can be divided into two types of restrictions: off-line limits that are set beforehand, and on-line limits that are computed as the crawl goes by.

Off-line limits For storage limitations, the crawler operator has to decide beforehand some limits for the crawling process. These limits can be expressed in terms of overall maximum quantities, but because of the power-law distribution of pages to hosts, it is useful to express them as per-host or per-domain limits. The o -line limits used more frequently by Web crawlers are the following: 

A maximum number of hosts in the collection, in the case of vertical crawling in a large domain.



A maximum exploration depth in terms of number of links f ro m the home page or from the starting set of pages.



A maximum overall number of pages or bytes in the collection, depending on the space available to store the data.



Per-host or per-domain limits including maximum number of pages, or bytes downloaded from each server.



A list of accepted mime-types for downloading (e.g.: text/html and text/plain)

Also, per page limits may include the maximum size of the pages, for instance, some search engines may index only the first few words of some pages; o r maximum number of out-links to process per page.

Online limits 7

On-line selection of the pages to be crawled requires a metric o f importance for prioritizing Web pages. The importance of a page may be a function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL. Designing a good selection policy has an added difficulty: it must work with partial information, as the complete set o f Web pages is not known during crawling.

2.4.3 Re-visit policy The Web has a very dynamic nature, and crawling a fraction of the Web can take a long time, usually measured in weeks or months. By the time a Web crawler has finished its crawl, many events could have happened. We characterize these events as creations, updates and deletions [11]: 

Creations When a page is created, it will not be visible on the public Web space until it is linked, so at least one page update –adding a link to the new Web page– must o ccur for a Web page creation to be visible. A Web crawler starts with a set of starting URLs, usually a list of domain names, so registering a domain na me can be seen as the act of creating a URL.



Updates Page changes are difficult to characterize, but basically an update can be either minor, or major. An update is minor if it occurs at the paragraph or sentence level, so the page is semantically almost the same and references to its content are still valid. On the contrary, in the case of a major update, all references to its content are not valid anymore.



Deletions A page is deleted if it is no longer accessible. Note that even if all the links to a page are removed and the page is no longer visible in the Web site, it may still be visible by the Web crawler if the exact URL is known. It is almost impossible to detect that a page has lost all its links, as the Web crawler can never tell if 8

links to the target page are not present, or if they are only present in pages that have not been crawled.

2.4.4 Politeness policy As noted by Koster

[12]

, the use of Web robots, while useful for a

number of tasks, comes with a price for the general community. Web crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time, so they may consume an excessive amount of bandwidth. This creates server overload, especially if the frequency of access to a given server is high, and/or if the robot is poorly written. This is not the only concern; as privacy is also an issue with Web crawlers. They may, for instance access parts of a Web site that were not meant to be public, index it and then over it to a broader audience than what was intended. Finally, if the search engine that operates the robot keeps a “cache” of downloaded pages, copyright issues may arise. A set of guidelines is also important for the continued operation of a Web crawler. If we consider that several Web sites are hosted by a few physical servers, then being impolite with a Web site may result in being banned from all the Web sites hosted by the same hosting provider. The three basic rules for Web crawler operation that have reached community consensus, and are currently obeyed by all major search engines, are: 

A Web crawler must identif y itself as such, and must not pretend to be a regular Web user. This serves many purposes including counting correctly the number of visitors to a Web site, but also to reduce the bandwidth allocated to robots in some cases.



A Web crawler must obey the robots exclusion protocol, that specifies a standard for administrators to indicate which parts of their Web servers should not be accessed by robots.



A Web crawler must keep its bandwidth usage low. This means that it should not download more than one page simultaneously from the same site, and must wait sometime between two consecutive downloads. 9

According to them, the Web crawlers from the four larger search engines follow these rules: 

Google (Googlebot) http://www.google.com/webmasters/bot.html



Yahoo! Search (Slurp!) http://help.yahoo.com/help/us/ysearch/slurp/



MSN Search http://www.msnsearch.com/msnbot.htm



Ask search http://about.ask.com/en/docs/about/webmasters.shtml

2.4.5 Example Crawling Algorithm We now discuss a number of crawling algorithms that are suggested in the literature. Note that many of these algorithms are variations of the best-first scheme. The difference is in the heuristics they use to score the unvisited URLs with some algorithms adapting and tuning their parameters before or during the crawl.

2.4.5.1 Naive Best-First Crawler A naive best-first was one of the crawlers detailed and evaluated by the authors in an extensive study of crawler evaluation [13]. This crawler represents a fetched Web page as a vector of words weighted by occurrence frequency. The crawler then computes the cosine similarity of the page to the query or description provided by the user, and scores the unvisited URLs on the

10

page by this similarity value. The URLs are then added to a frontier that is maintained as a priority queue based on these scores. In the next iteration each crawler thread picks the best URL in the frontier to crawl, and returns with new unvisited URLs that are again inserted in the priority queue after being scored based on the cosine similarity of the parent page. The cosine similarity between the page p and a query q is computed by:

Sim (p,q) = [v q X v p ] / [||v q || .||v p ||] where v q and v p are term frequency (TF) based vector representations of the query and the page respectively, v q · v p is the dot (inner) product of the two vectors, and ||v|| is the Euclidean norm of the vector v. More sophisticated vector representation of pages, such as the TFIDF

[14]

weighting scheme often used in information retrieval, are

problematic in crawling applications because there is no a priori knowledge of the distribution of terms across crawled pages.

In a multiple thread implementation the crawler acts like a best-N-first crawler where N is a function of the number of simultaneously running threads. Thus best-N-first is a generalized version of the best-first crawler that picks N best URLs to crawl at a time. In my research I have found the best-N-first crawler (with N = 256) to be a strong competitor

[15]

showing clear superiority on the retrieval of relevant

pages. Note that the best-first crawler keeps the frontier size within its upper bound by retaining only the best URLs based on the assigned similarity scores.

2.4.5.2 SharkSearch SharkSearch

[16]

is

a

version

of

FishSearch

[17]

with

some

improvements. It uses a similarity measure like the one used in the naive best-first crawler for estimating the relevance of an unvisited URL. However, SharkSearch has a more refined notion of potential scores for the links in the crawl f rontier. The anchor text, text 11

surrounding the links or link-context, and inherited score from ancestors influence the potential scores of links. The ancestors of a URL are the pages that appeared on the crawl path to the URL. SharkSearch, like its predecessor FishSearch, maintains a depth bound. That is, if the crawler finds unimportant pages on a crawl path it stops crawling further along that path. To be able to track all the information, each URL in the frontier is associated with a depth and a potential score. The depth bound (d) is provided by the user while the potential score of an unvisited URL is computed as:

score(url) = γ · inherited(url) + (1 − γ) · neighborhood(url)

where γ < 1 is a parameter, the neighborhood score signifies the contextual evidence found on the page that contains the hyperlink URL, and the inherited score is obtained from the scores of the ancestors of the URL. More precisely, the inherited score is computed as: inherited(url) =

δ · sim(q,p) if sim(q,p) > 0 δ · inherited(p) otherwise

Where δ < 1 is again a parameter, q is the query, and p is the page from which the URL was extracted.

2.4.5.3 Focused Crawler A focused crawler based on a hypertext classifier was developed byChakrabarti. The basic idea of the crawler was to classify crawled pages with categories in a topic taxonomy. To begin, the crawler requires a topic taxonomy such as Yahoo or the ODP. 9 In addition, the user provides example URLs of interest (such as those in a bookmark file). The example URLs get automatically classified onto various categories of the taxonomy. Through an interactive process, the user can correct the automatic classification, add new categories to the taxonomy and mark some of the categories as “good” (i.e., of interest to the user). The crawler uses the example URLs to build a Bayesian classifier that can find the probability (Pr(c|p)) that a crawled page p belongs to a category c in the taxonomy. Note that by definition Pr(r|p) = 1 where r

12

is the root category of the taxonomy. A relevance score is associated with each crawled page that is computed as: R(P) = ∑ Pr(c|p) When the crawler is in a “soft” focused mode, it uses the relevance score of the crawled page to score the unvisited URLs extracted from it. The scored URLs are then added to the frontier. Then in a manner similar to the naive best-first crawler, it picks the best URL to crawl next. In the “hard” focused mode, for a crawled page p, the classifier first finds the leaf node c (in the taxonomy) with maximum probability of including p. If any of the parents (in the taxonomy) of c are marked as “good” by the user, then the URLs from the crawled page p are extracted and added to the frontier. Another interesting element of the focused crawler is the use of a distiller. The distiller applies a modified version of Kleinberg’s algorithm to find topical hubs. The hubs provide links to many authoritative sources on the topic. The distiller is activated at various times during the crawl and some of the top hubs are added to the frontier.

2.4.5.4 Info Spiders In InfoSpiders, an adaptive population of agents search for pages relevant to the topic. Each agent is essentially following the crawling loop while using an adaptive query list and a neural net to decide which links to follow. The algorithm provides an exclusive frontier for each agent. In a multi-threaded implementation of InfoSpiders each agent corresponds to a thread ofexecution. Hence, each thread has a noncontentious access to its own frontier. Note that any of the algorithms described in this chapter may be implemented similarly (one frontier per thread). In the original algorithm each agent kept its frontier limited to the links on the page that was last fetched by the agent. Due to this limited memory approach the crawler was limited to following the links on the current page and it was outperformed by the naive best-first crawler on a number of evaluation criterion. Since then a number of improvements (inspired by naive best-first) to the original algorithm have been designed while retaining its capability to learn link estimates via neural nets and focus its search toward more promising areas by selective reproduction. In fact the redesigned version of the algorithm has been found to outperform various versions of naive best- first

13

crawlers on specific crawling tasks with crawls that are longer than ten thousand pages. The adaptive representation of each agent consists of a list of keywords (initialized with a query or description) and a neural net used to evaluate new links. Each input unit of the neural net receives a count of the frequency with which the keyword occurs in the vicinity of each link to be traversed, weighted to give more importance to keywords occurring near the link (and maximum in the anchor text). There is a single output unit. The output of the neural net is used as a numerical quality estimate for each link considered as input. These estimates are then combined with estimates based on the cosine similarity (Equation 1) between the agent’s keyword vector and the page containing the links. A parameter α, 0 ≤ α ≤ 1 regulates the relative importance given to the estimates based on the neural net versus the parent page. Based on the combined score, the agent uses a stochastic selector to pick one of the links in the frontier with probability

Pr(λ) = eβσ(λ)/ ∑λ’εφeβσ(λ’) where λ is a URL in the local frontier (φ) and σ(λ) is its combined score. The β parameter regulates the greediness of the link selector. After a new page has been fetched, the agent receives “energy” in pro- portion to the similarity between its keyword vector and the new page. The agent’s neural net can be trained to improve the link estimates by predicting the similarity of the new page, given the inputs from the page that contained the link leading to it. A back-propagation algorithm is used for learning. Such a learning technique provides InfoSpiders with the unique capability to adapt the link-following behavior in the course of a crawl by associating relevance estimates with particular patterns of keyword frequencies around links.

14

2.5 THE DESIGENED CRAWLER I have chosen the Naïve best first crawler for this design and implementation in my project and optimized it to the N-best way because as mentioned in the previous text the naïve best first when operated as N-best first can prove as the good competitor for relevant page extraction. The following text describes the basic design methodology and the code.

2.5.1 Processing items in a queue Webcrawling can be regarded as processing items in a queue. When the crawler visits a web page, it extracts links to other web pages. So the crawler puts these URLs at the end of a queue, and continues crawling to a UR L that it removes from the front of the queue.

It is obvious, that every algorithm that just works by processing items that are Therefore,

independent of it

is

each other

desirable

to

write

can easily be

parallelized.

a

that

few

classes

handle

multithreading that can be reused. In fact, the classes that I wrote for web crawling originally were reused exactly as they are for a machine learning program.

Java provides easy-to-use classes for both multithreading and handling of lists. (A queue can be regarded as a special form of a linked list.) For

multithreaded

web crawling,

we

just need

to enhance the

functionality of Java’s classes a little. In the web crawling setting, it is desirable that one and the same webpage is not crawled multiple times. We therefore do not only use a queue, but also a set that contains all URLs that have so far been gathered. Only if a new URL is not in this set, it is added to the queue.

We may also want to limit either the number of web pages we visit or the link depth. The methods in the queue interface resemble the desired functionality. The f ollowing code shows the interface of the queue. package ie.moguntia.threads; 15

import java.util.Set;

/** * Interface Queue * * This interface defines a Queue for a multithreaded environment, where threads can insert (push) Objects (tasks) at the end of the queue and retrieve (pop) them from the beginning. This interface supports a level. This is useful, if the threads have to obey certain restrictions, e.g. read tasks only from level n and write tasks only to level n+1. A thread controller must handle this.

Note that this interface does not

make any assumption of neither how many levels are supported by the implementing class, nor if levels are supported at all. An implementing class could decide just to ignore the level argument. */

public interface Queue { public Set getGatheredElements(); public Set getProcessedElements(); public int getQueueSize(int level); public int getProcessedSize(); public int getGatheredSize(); public void setMaxElements(int elements); public Object pop(int level); public boolean push(Object task, int level); public void clear(); }

Note that if we limit the depth level we need more than one queue. If we only had one queue, the crawler could not easily determine the link depth of the URL it is just visiting. But regardless of the link depth we allow, two queues are sufficient. In this case, we allow the crawler to only fetch URLs from queue 1 and add URLs to queue 2. When all 16

URLs in queue 1 are processed, we switch the queues the figure 2.2 elaborates the concept.

17

Because of the generality of the problem, we can allow general Java Objects to be stored in the actual implementation of the queue. The Main queue code is as follows

import java.util.*;

/** * Simple "reference" implementation of the queue interface * In addition to the interface, an object "data" is supported * for carrying additional data over threads. * By default, the queue supports 4 levels and an unlimited * number of elements. The number of elements is unlimited when * set to -1. A limited number of elements and a different * number of levels can be specified in the appropriate constructors. */

public class ObjectQueue implements Queue {

Object data; Set gatheredElements; Set processedElements; LinkedList queues[]; int mx; int nq;

public synchronized void setData(Object o) { data = o; }

public synchronized Object getData() { return data; }

18

public ObjectQueue(int _nq, int _mx) { nq = _nq; mx = _mx; queues = new LinkedList[nq]; for (int n = 0; n < nq; n++) { queues[n] = new LinkedList(); } }

public ObjectQueue(int _nq) { nq = _nq; mx = -1; queues = new LinkedList[nq]; for (int n = 0; n < nq; n++) { queues[n] = new LinkedList(); } }

public ObjectQueue() { nq = 4; mx = -1; queues = new LinkedList[nq]; for (int n = 0; n < nq; n++) { queues[n] = new LinkedList(); } }

public Set getGatheredElements() { return gatheredElements; }

public Set getProcessedElements() { return processedElements; } 19

public int getQueueSize(int level) { if (level < 0 || level >= nq) return 0; else return queues[level].size(); }

public int getProcessedSize() { return processedElements.size(); }

public int getGatheredSize() { return gatheredElements.size(); }

public void setMaxElements(int elements) { mx = elements; }

public synchronized Object pop(int level) { if (level < 0 || level >= nq) return null; else if (queues[level].size() == 0) return null; else return queues[level].removeFirst(); }

public synchronized boolean push(Object task, int level) { if (mx != -1 && mx = nq) return false; queues[level].addLast(task); return true; }

public synchronized void clear() { for (int n = 0; n < nq; n++) queues[n].clear(); } }

2.5.2 Implementation of Thread Controller As mentioned above, Java has a good interface for handling threads. However, in our special case, we can add a little more generic functionality. In our case, we want to make use of a number of threads that process items from the queue. A 'controller' should create new threads, if and only if there are still items in the queue to process and if the total number of threads does not exceed an upper bound.

In our implementation of such a thread controller, we provide the controller class on construction among other parameters with the class object for the thread class the queue. The queue should be pre-filled with at least one item. We require from our thread class, that it implements the run method it inherits from the Thread class. The following code shows the implementation in java

/** * A simple controller class for a multithreaded environment, where threads * may insert and process 'tasks' from/into a queue. Multiple 'depthlevels' 21

* are supported. 'Tasks' are not to be confused with OS tasks, but just * denote elements in the queue, i.e. a task for the thread to perform. * * Note that the depth level in this class is just to make sure that where * appropriate a thread may only read tasks from level n and write tasks to * level n+1, i.e. only two levels at a time are supported. The actual number * is in this class only used as a halting criteria, but a thread may use * the information. * * For more information on what the depth-levels are good f or see the comments * for interface Queue */

public class ThreadController {

/** * current level (see interface Queue for details on levels) */ int level;

/** * maximum depth level allowed * -1 if be unlimited */ int max Level;

/** * maximum number of parallel threads * -1 if unlimited */ int maxThreads; 22

/** * the task queue */ Queue tasks;

/** * An object that is notified about what a thread does * See comments for interface MessageReceiver for details. */ MessageReceiver receiver;

/** * The class of the threads created by this ThreadController * This class is expected to be a subtype of ControllableThread. */ Class threadClass;

/** * A unique synchronized counter */ int counter;

/** * Number of currently running threads * This value is handed to the threads as an id, so note that the thread * id is not unique, but is always in the range 0...maxThreads-1 */ int nThreads;

/** * Constructor that intializes the instance variables * The queue may already contain some tasks. 23

*

If

_maxThreads

> 0,

_maxThreads

threads

are

started

immediately. * If _tasks.size(_level) > _maxThreads == -1, then only * _tasks.size(_level) threads are started. Note that this includes * the case where _maxThreads == -1, therefore even if the number of * allowed threads is unlimited, only a finite number of threads are * started. */ public ThreadController(Class _threadClass, int _maxThreads, int _max Level, Queue _tasks, int _level, MessageReceiver _receiver) throws InstantiationException, IllegalAccessException { threadClass = _threadClass; maxThreads = _maxThreads; maxLevel = _max Level; tasks = _tasks; level = _level; receiver = _receiver; counter = 0; nThreads = 0; startThreads(); }

/** * Get a unique number from a counter */ public synchronized int getUniqueNumber() { return counter++; } 24

/** * Adjust number of allowed threads and start new threads if possible */ public synchronized void setMaxThreads(int _maxThreads) throws InstantiationException, IllegalAccessException { maxThreads = _maxThreads; startThreads(); }

/** * Get number of maximum allowed threads */ public int getMaxThreads() { return maxThreads; }

/** * Get number of maximum level */ public int getMaxLevel() { return maxLevel; }

/** * Get number of currently running threads */ public int getRunningThreads() { return nThreads; }

/** * Called by a thread to tell the controller that it is about to stop. 25

* The threadId is handed over to the MessageReceiver. * If this was the last running thread it means that one level of the * queue has been completed. In this case, increment the level (if * allowed) and start new threads. */ public synchronized void finished(int threadId) { nThreads--; receiver.finished(threadId); if (nThreads == 0) { level++; if (level > maxLevel) { receiver.finishedAll(); return; } // debug // System.err.println("new level " + level); // if no tasks in queue we're don if (tasks.getQueueSize(level) == 0) { receiver.finishedAll(); return; } try { startThreads(); } catch (InstantiationException e) { // Something has gone wrong on the way, because if it hadn't // worked at all we wouldn't be here. Anyway, we can do // nothing about it, so we just quit instead of moving to // a new level. } catch (IllegalAccessException e) {

26

// Something has gone wrong on the way, because if it hadn't // worked at all we wouldn't be here. Anyway, we can do // nothing about it, so we just quit instead of moving to // a new level. } } }

/** * Start the maximum number of allowed threads */ public synchronized void startThreads() throws InstantiationException, IllegalAccessException { // Start m threads // For more information on where m comes from see comment on // the constructor. int m = maxThreads - nThreads; int ts = tasks.getQueueSize(level); if (ts < m || maxThreads == -1) { m = ts; } // debug // System.err.println(m + " " + maxThreads + " " + nThreads + " " + ts); // Create some threads for (int n = 0; n < m; n++) { ControllableThread thread = (ControllableThread) threadClass.newInstance(); thread.setThreadController(this); 27

thread.setMessageReceiver(receiver); thread.set Level(level); thread.setQueue(tasks); thread.setId(nThreads++); thread.start(); } } }

2.5.3 Implementation of Thread We expect from the thread in the run method, that it fetches new items from the queue, and that it ends itself if there are no items left. This is common for all our possible threads; therefore we can implement this in an upper class for 'controllable' threads.

If there are no more items to process, the ControllableThread can terminate itself, but has to inform the ThreadController about this. In the webcrawler scenario, this is important when all URLs for the link depth of 2 are processed and the next deeper level is reached. The ThreadController is now responsible for shifting the queues and starting new threads as needed.

/** * Abstract class that denotes a thread that can cooperate with a *

ThreadController

and

has

a

Queue,

a

depth

MessageReceiver. */

abstract public class ControllableThread extends Thread { protected int level; protected int id; protected Queue queue; protected ThreadController tc; protected MessageReceiver mr; 28

level

and

a

public void setId(int _id) { id = _id; } public void setLevel(int _level) { level = _level; } public void setQueue(Queue _queue) { queue = _queue; } public void setThreadController(ThreadController _tc) { tc = _tc; } public void setMessageReceiver(MessageReceiver _mr) { mr = _mr; } public ControllableThread() { } public void run() { // pop new urls from the queue until queue is empty for (Object newTask = queue.pop(level); newTask != null; newTask = queue.pop(level)) { // Tell the message receiver what we're doing now mr.receiveMessage(newTask, id); // Process the newTask process(newTask); // If there are less threads running than it could, try // starting more threads if (tc.getMaxThreads() > tc.getRunningThreads()) { try { tc.startThreads(); } catch (Exception e) { System.err.println("[" + id + "] " + e.toString()); 29

} } } // Notif y the ThreadController that we're done tc.finished(id); }

/** * The thread invokes the process method for each object in the queue */ public abstract void process(Object o); }

2.5.4 Extracting Hyperlinks The two main tasks of a webcrawler are of course saving data from a URL to a file and extracting hyperlinks. Because of Javas URL class, both tasks are rather simple. Although in this case this might not be the proper object oriented approach, I decided to implement these tasks in static methods in the SaveURL class. This class contains only static methods. A main method is supplied, so that the class can be used in a stand-alone version to save a file from a given URL. All methods for saving data from a URL into a file are fairly short and selfexplanatory. I provided several variations in the implementation for writing the contents of the URL to a String, an OutputStream or a Writer.

Note that is not necessary for each method to 're-invent the wheel.' For example, the getURL method that returns the URL's contents as a String creates a StringWriter and calls the saveUR L method to write the UR L's contents to a Writer.

When doing input/output operations, it is almost always a good idea to use buffering, because transferring data 'en-bloc' is more efficient than 30

transferring data byte by byte. Java provides buffering off the shelf. In the saveURL(URL, Writer) method Java's BufferedInputStream is used for reading from the URL. In the saveUR L(URL, OutputStream) method buffering buf fering is done through a byte array.

The methods that we will actually use for webcrawling are the writeURLtoFile method in our webcrawler are the writeURLtoFile and the

Using Java's String class, extracting hyperlinks from a web page is almost trivial. We use the getURL method mentioned above to retrieve a String object containing the complete HTML page. The basic idea is to use the indexOf method of the String class to extract the beginning '≶a href'. A couple of other indexOf-calls and a StringTokenizer extract the URL and the link text from the HTML tag. However, for this approach to work correctly, we first have to convert the 'raw' HTML page from its mixed-case form into a form that is only lower case to allow easy extraction. We use the toLowerCase method of the String class to do this. We also convert all whitespaces (i.e. tabs, line breaks etc.) into simple spaces using the replaceAll method. This is necessary because we might get confused if e.g. in the HTML page the '≶a' is followed by a line break instead of a blank.

The extractLinksWithText method in

the SaveURL class stores

URL/text pairs in a Map, extractLinks stores the URLs only in a Vector.

Note that we could also have used more elaborate regular expressions instead of a bunch of indexOf calls. The interested reader who is familiar with regular expressions should feel free to regard rewriting the SaveURL class using the String.split method as an exercise. Note also that the extractLink-methods at the moment do not know about HTML comments. If in an HTML page a link that is commented out

31

occurs, our methods extract them as well as if it was an ordinary link. Whether you regard this a bug or a feature is up to you.

import java.net.*; import java.io.*; import java.util.*;

/** * Class with static methods that can save URLs and extract links * Works also as a standalone program, usage is: java SaveURL [] * If file is not specified, all links on the specified url are printed to the * standard console. */

public class SaveURL {

/** * Opens a buffered stream on the url and copies the contents to writer */ public static void saveURL(URL url, Writer writer) throws IOException { BufferedInputStream

in

=

BufferedInputStream(url.openStream()); for (int c = in.read(); c != -1; c = in.read()) { writer.write(c); } }

/**

32

new

* Opens a buffered stream on the url and copies the contents to OutputStream */ public static void saveURL(URL url, OutputStream os) throws IOException { InputStream is = url.openStream(); byte[] buf = new byte[1048576]; int n = is.read(buf); while (n != -1) { os.write(buf, 0, n); n = is.read(buf); } }

/** * Writes the contents of the url to a string by calling saveURL with a * string writer as argument */ public static String getURL(URL url) throws IOException { StringWriter sw = new StringWriter(); saveURL(url, sw); return sw.toString(); }

/** * Writes the contents of the url to a new file by calling saveURL with * a file writer as argument */ public static void writeURLtoFile(URL url, String filename) throws IOException { // FileWriter writer = new FileWriter(filename); 33

// saveUR L(url, writer); // writer.close(); FileOutputStream os = new FileOutputStream(filename); saveURL(url, os); os.close(); }

/** *

Extract

links

directly

from

a

URL

by

calling

extractLinks(getURL()) */ public static Vector extractLinks(URL url) throws IOException { return extractLinks(getURL(url)); }

public static Map extractLinksWithText(URL url) throws IOException { return extractLinksWithText(getURL(url)); }

/** * Extract links from a html page given as a raw and a lower case string * In order to avoid the possible double conversion from mixed to lower case * a second method is provided, where the conversion is done externally. */ public static Vector extractLinks(String rawPage, String page) { int index = 0; Vector links = new Vector(); while ((index = page.indexOf("

chapter 2 web crawler

chapter 2 web crawler

Suggest Documents

A Scalable, Extensible Web Crawler with Focused Web Crawler

Crawler-Friendly Web Servers

Integration of Web mining and web crawler

22. Crawler-friendly web servers

myanmar web pages crawler - AIRCC

World Wide Web Crawler - CiteSeerX

An Intelligent Web Crawler - CiteSeerX

A New Mercator Web Crawler

Ontology Based Web Crawler - Semantic Scholar

Methodologies for Crawler Based Web Surveys - CiteSeerX

Real-time Web Crawler Detection - LInC

Focused Web Crawler with Page Change

IRJET- Hidden Web Crawler: A Survey

Performance Evaluation of Web Crawler

evolution of web crawler its challenges

Chapter 2 (pdf) - Number Theory Web

chapter 2 - BUET Web Edit Facility

Auto-Explore the Web â Web Crawler

Integration of Web mining and web crawler - Semantic Scholar

Deep Web Crawler: Exploring and Re-ranking of Web ...

Web Crawler: Extracting the Web Data - International Journal of ...

CRAWLER DOZER

Chapter 2 – Part 2

CHAPTER 2

chapter 2 web crawler