Data Mining using Web Spiders - Semantic Scholar

3 downloads 19123 Views 68KB Size Report
In this project, AQ11 and ID3 were chosen as classic data mining algorithms representative of the inductive learning tradition; their results were compared.
Data Mining using Web Spiders Carol D. Harrison George F. Luger

Contact author: Carol Harrison 725-15 Tramway Vista Dr. NE Albuquerque NM 87122 505-856-7483 (evenings) 505-284-4934 (day) fax: none email: [email protected]

1

1. Introduction and Background As the volume of available information has grown, the field of data mining has become more important for turning data into usable knowledge. Mitchell (1999) sees the growth of the data mining industry as the result of three influences: the falling cost of storing large data sets and collecting data through networks; the development of efficient machine learning algorithms; and the reduction in the cost of computation, allowing more complex data analysis. Data mining is defined as the extraction of trends or patterns from large bodies of data. The typical tool used is some form of machine learning algorithm. In this project, AQ11 and ID3 were chosen as classic data mining algorithms representative of the inductive learning tradition; their results were compared to results from the Support Vector Machine, a more recent data mining algorithm, for their ability to extract information from Web pages. As every Web surfer knows, the Internet is a rich source of information about a multitude of topics, and is becoming more mainstream all the time; users need not be very computer literate to use the Web’s resources. The amount of information on the Web is growing at an exponential rate, and there are no common organizing mechanisms to allow easy access to a specific piece of information. Interest groups find domain-specific Web sites through word of mouth or by following links from a known site to other pages. A typical means of entry into the Web is through search engines. Many search engines exist, some of which incorporate data mining algorithms to improve efficiency (Google, for example). Search engine data “freshness” is based on the frequency with which the database is updated by its spiders. The search engine retrieval results are based upon algorithms used to determine relevancy using clues such as location and frequency of keywords (including the contents of the meta tags, which are header information such as title, keywords, abstract, and page description). The number of page links is some indication of the page’s usefulness as a hub (Kosala and Blockeel, 2000); even more important is whether an ‘important’ page such as a well-known search engine links to this page, thereby indicating its status as a recognized authority on the subject. The appearance of the search term in the URL is another relevance indicator. An ongoing cat-and-mouse game exists in which search engines try to find relevant data, and producers of Web pages try to get their pages prominently featured in the retrievals of popular search engines. Search engine algorithms are subject to constant improvement to defend against ‘spam’—spurious meta information on a Web page designed to trick the search engine into choosing that page; data mining is a key defense because it allows comparison of meta information to actual page content to ensure congruence and to assess relevance. Some speculate that search engines are only scratching the surface of Web resources--estimates indicate that major search engines only index 16% of the Web, but because of the size and dynamic nature of the Web, attempts to categorize, classify (or even count) resources are inconclusive. Intelliseek estimates the size of the ‘invisible Web’ as 500 times that of the indexed Web typically trawled by traditional search engines (statistic found at http://bots.internet.com/news/news030901.html). Because search engines rely on linkages to choose Web pages, those pages on the periphery of 2

the Web with few or no linkages may not be indexed at all. Therefore, users needing esoteric or specialized information may be unsatisfied by the mainstream page links typically found by large search engines—perceived goodness of results is extremely dependent on personal need and resists objective measurement. BrightPlanet contends that what they term the ‘deep Web’ is completely overlooked by search engines as a source of information, given that search engines only reference static Web pages. BrightPlanet uses a directed query engine that probes searchable databases which “…only produce results dynamically in response to a direct query (BrightPlanet.com source, page iii).” ProFusion by Intelliseek is another bot that searches the ‘deep Web,’ providing adaptive querying to further refine the search term, ongoing monitoring of sites of interest, and rerun of queries at a future date. Many domain-specific search engines also exist to fill the gaps between specific domain interests and general interests served by search engines. Traditional search engines are unable to keep up with the information explosion. This has created demands for additional or alternative search strategies that augment or supplant conventional search engine techniques. For example, niche engines such as Fact City concentrate on database searches limited to circumscribed domains such as sports and movie information. There are numerous surveys of experts and users that indicate that search engines vary greatly in their capabilities, approach and usefulness for a given task. A survey of the literature concerning data mining and the Web reveals that most topics of study can be categorized in one of three focal areas (Kosala and Blockeel, 2000, Madria et.al, 1999):

User behavior analysis typically uses some combination of “Web server access logs, proxy server logs, browser logs, user profiles, registration data, user sessions or transactions, cookies, user queries, bookmark data, mouse clicks and scrolls…” to follow user navigation paths and analyze their behavior (Kosala and Blockeel, 2000, page 4). Structural analysis can be used “…to discover authority sites for the subjects… and overview sites for the subjects that point to many authorities (hubs)” (Kosala and Blockeel, 2000, page 4) by using graph analysis to draw conclusions about the popularity of Web pages based on their in-degree and out-degree. Content analysis is used to find, categorize, and present useful information from Web content pages. Web mining research covers a number of computer science areas, including information retrieval, information extraction, natural language parsing, database approaches, and machine learning. Information retrieval (IR) methods are used to retrieve data from the source—indexing text, classifying and categorizing documents. Information extraction (IE) techniques have proven useful for pulling relevant facts from documents into structured data objects suitable for automated processing by data mining algorithms or other applications. Kosala and Blockeel (2000, page 3) have distinguished between ‘classical IE’, which uses linguistic preprocessing “such as syntactic analysis, semantic analysis, and discourse analysis”--techniques for natural language processing--and ‘structural IE,’ which uses the meta information found on Web pages to help classify through wrapper induction, e.g., web page meta information such as title, keywords, abstract, description and body. The focus of information extraction is to fill template slots with words or phrases from the document using either 3

natural language processing techniques or structured language processing using metadata cues and patterns. Natural language parsing is a field of study in its own right, but in this context, it is a tool used in information extraction to parse and provide word-sense disambiguation. Database approaches to Web mining store hypertext and meta information (fact or summaries) about the text page in data layers such that raw information is stored at the lowest level, and successive levels of meta data or summarizations are stored at higher levels (Cooley, Srivastava and Mobasher, 1997). Machine learning attempts to find patterns or trends in data and draw conclusions or classify and categorize the data for further analysis. As Kosala and Blockeel (2000) point out, the line quickly blurs between IR, IE, and data mining because these techniques are so interrelated—IE algorithms can also contain data mining components to assist in classification by learning extraction rules from labeled training data. IR techniques can also incorporate machine learning algorithms to improve the retrieval process. 2. Similar Projects Recent projects have focused on the use of machine learning techniques to improve the process of extracting information from the Web: •

The InfoSpiders project (Degeratu and Menczer, 2000) used neural nets with Web spiders, which run multiple threads using the results of Excite and AltaVista search engine queries as their starting point. An evolutionary algorithm that mutates the query over time is used to improve query capability based on learning. The focus of this project was to find as many recent relevant pages as possible rather than focusing on the best page. Two common measures of goodness in Web retrieval are precision and recall. Precision is defined as the proportion of the retrieved set that is relevant; whereas recall is the proportion of all relevant documents that were retrieved. In their lessons learned, Degeratu and Menczer pointed out the problem of defining a relevant set against Web data, thereby requiring the use of approximate sets in calculating precision-recall statistics. A fundamental problem in determining the relevance of a set of returns remains the subjectivity associated with “goodness”—user satisfaction with a set of returned URLs depends on the sophistication of the query, the availability of the desired data, and the congruence between user intent and actual query submitted. It is difficult to identify an independent mechanism for evaluating results. A sample query produced twenty-two citations for only five distinct sites—sublinks within the same main site were listed. The five Web sites chosen by InfoSpiders were a subset of the top ten Google search engine recommendations, which lends credence to their selection. However, since InfoSpiders is focused on a breadth-first search launched from a search engine recommendation, the spider threads are rewarded for getting ‘stuck’ at a good site; the focus is on finding as many relevant sites as possible rather than finding best sites. This may not be effective given typical user surfing behavior—a user who finds a good site will explore links within that site without requiring a list of subordinate sites; a broader listing of relevant sites would be more useful. 4



Glover, Flake, et al. (2001) used Support Vector Machines to improve categoryspecific search (categories used were personal homepages and calls for papers) by looking at words and their location in the document. They also used query modification to successively make a query more representative of the user’s intent based on knowledge of the database contents and information found in meta tags. The query modifications were recommended by the algorithm to tailor the query to three search engines. The results were incorporated into a metasearch engine called Inquirus 2 (used for CiteSeer, a scientific literature database developed at the NEC Research Institute containing over four million citations and over 300,000 documents.) HTML structural cues and text were used to improve category-specific search; e.g., locating personal homepages and calls for papers. For example, the study considers a document with the search term in bold in the title more relevant than a document with the search term in anchor text or the last few sentences of the text body. In this project, HTML structural cues were used (for example, finding the search term in title or keywords is probably more significant than finding it in the text body alone) as well as text (finding the search term in the page content).

3. Procedure This project’s contribution to the field is an analysis of the efficacy of data mining algorithms as applied to categorizing web pages with respect to a search term. Search engines were used as a springboard, following links suggested by the search engine to traverse the Web and reach pages that may not be found by the search engine. It is intended to provide a more selective result based on data mining results through second level verification and application of more stringent requirements for internal page consistency. Also, by using the search engine as a starting point but not relying entirely on its results, the problem of caching is avoided—because of the large amount of churn in Web content, search engine caches may not be completely up-to-date and may miss good sites. This project extends the approaches taken by similar projects as discussed above: It builds on the conceptual design of InfoSpiders but does a broader search; it uses multiple data mining algorithms rather than one neural net algorithm. The project uses structural cues and query modification as does the Glover, et al. project, but works in a broad domain rather than just examining home pages. This project differs from a search engine, a metasearch engine, or other userinteractive software because it is not focused on divining the original intent of the user from the keyword or phrase submitted. This project focused on Web content analysis: specifically, making the process of retrieving information from the Web more useful through application of data mining techniques to Web spiders. The product of this research is a Web spider that incorporates data mining to improve its output, which is a list of Web sites that are relevant to the user-specified search term and information about the confidence level of each recommended URL. 4. Methodology The spider used in this project started with demonstration code (Blum et al., 1998, and http://developer.java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler) 5

as a basis from which to add the search, scanning, and data mining capability. Search terms were selected from the Lycos Top Fifty Searches for the year 2000 (http://50.lycos.com/122000.html), with the addition of a few more generic terms pulled from the Lycos list for January 16-22, 2000. The most popular search terms from Lycos were used to get a representative sample of popular queries; they included: Britney Spears, dragonball, Pokemon, WWF, tattoos, Napster, Las Vegas, Bible, and golf. The additional set of more generic terms was: music, jokes, jobs, movies, travel, and dogs. Upon input of a query term to the spider, the query is scrubbed via stemming and dictionary search to add or remove “s” endings (rudimentary stemming: for example, “dogs” becomes “dogs OR dog” and vice versa) and to correct spelling errors. Then the modified query is submitted to Google, which returns a specified number of URLs as seeds (or starting points) for the spider search. Each subURL opened increments a ‘links found’ variable that prevents the search from becoming ‘stuck’ in a particular site—after a specified number of links are found on a page, the spider opens the next page in the array. In each subURL, the spider traverses each seed page, collecting statistics on the number of times the search term appears in the metadata sections of title, keywords, description, and body. If the search term is found, a goodness test is then done based on a heuristic (for example, finding the search term in the title is more interesting than finding the search term in the body alone). Negative examples were pages containing few or no occurrences of the search term. The goodness criteria were subsequently used as a hint mechanism to the Support Vector Machine algorithm. If the new page is deemed a good page, its sublinks are opened and the process continues until a specified number of URLs are returned or the URL search is exhausted. The purpose of this is to find good pages and linked subpages that conform to a “profile” of what a good page looks like, which is not necessarily what Google returns. A maximum of ten keywords from the keyword meta section of each page, coupled with a running count of each keyword’s occurrence, is stored for future use. A spam detector is used to ensure that multiple occurrences of keywords on the same page do not increment the counter. A refinement to the search involves query modification through the insertion of keywords. At a trigger point specified in terms of the number of good URLs collected, the top three keywords are selected and the query is again modified and resubmitted to Google. The purpose of this query modification is to provide synonym or proxy assistance to the original search term. As a post-processing step, the page statistics are submitted to ID3 and AQ11 algorithms to determine their performance against the SVM. The purpose of using these data mining algorithms is to test the overlap between their good pages and Google’s; to provide a mechanism to rate URL sublinks to follow; and to understand how the data mining algorithm determines page goodness. 5. Results Design explorations dealing with cleanup and preprocessing of the input query included spelling correction and stemming. Rudimentary spelling correction was achieved by submitting the query term to www.dictionary.com. If there was no entry for the query, the dictionary function returned a suggested spelling correction. If the 6

suggestion was a close match for the query, the suggested correction and the original search term were OR’d together. One reason for choosing to use both terms was to gauge the effect and pervasiveness of deliberately misspelled keywords in Web pages. If the suggestion was not a close match, the original query term was used. A difficulty in spelling correction particular to the Web is the use of slang terms and proper names that are not found in dictionaries. Stemming was explored through the use of the Porter stemming algorithm (http://www.tartarus.org/~martin/PorterStemmer/), which garnered mixed results: for example, the input word ‘chicken’ became ‘chick’ with the Porter stemmer; this caused the returns to be considerably different. ‘Britney Spears’ became ‘Britney Spear’—this had the effect of putting more weight on URLs containing both spelling versions of the term. ‘Data mining’ became ‘data minin’—this had the effect of pulling ONLY the pages that contained both terms (perhaps because Google perceived this as a misspelled word. The best results were obtained using a simpler +s/-s algorithm. Design explorations involving synonyms for query modification used WordNet and WordSmyth to develop synonyms for the original query. This did not produce results, probably because of the informality of Web terms for which standard dictionaries do not have entries; instead, query modification focused on using keywords, which was considerably more useful. The point at which keyword selection was done was varied to determine whether the timing affected the choice of the top three keywords--to determine when a sufficient number of keywords has been collected such that the keyword choice is meaningful. A sort was done to find the three keywords with the largest count from the ‘keywords’ meta sections of Web pages processed. (The original search term(s) are ineligible for selection.) The search array is reinitialized and a new query to Google is submitted consisting of the original search term plus the three keywords, separated by ORs. This leverages an interesting feature of Google; that is, Google supports logical AND and logical OR; but appears to do some probabilistic calculations behind the scenes. The submission of OR’d terms serves to mitigate the effect of a bad keyword or original search term submitted by itself, and does not appear to hinder the returns of the correct keyword or search term. Because the keyword search is based on a cumulative count of keywords seen in the pages containing the search term, the effect of keyword contributions from any one page on the overall selection of the top three keywords is limited. The success of keyword choice at 10% of the desired number of URLs tends to depend heavily on the topic—sometimes the results are comparable to keyword choice at 25% and 50% of the requested URLs; sometimes the resultant keywords would be misleading. Keyword choice at 25% of the desired total URLs generally performs well; on some queries, it is similar to keyword choice at 50% of the requested URLs. Keyword choice at 75% was not a significant improvement over the 50% keyword choice. It seems that if good keywords are to be found, they are found in the first several URLs, probably because this project uses Google as a seed point—the Google returns were ranked based on their algorithms, so the first URLs tend to have more appropriate keywords. Further keyword search only muddies the result. 7

Conceptual diagram: Keyword choice at 25% Purpose: Amass enough keywords to ensure good choice; augment original search term with keywords to find remaining good URLs requested

100 URLs requested-25 good URLs found, 87 URLs searched. Collected n keywords from all pages searched

Select top 3 keywords based on count of occurrences in pages searched

Query = dog OR dogs

Query = dog OR dogs OR puppies OR canine OR puppy

Puppies

500

Canine Puppy Spaniel

450 300 200

Agility 150 Show dogs 50 ...

Clear VectorToSearch. Construct new Google query: original search term and keywords for remaining 75 pages requested

A test involving the direct insertion of keywords chosen at 25% of the desired good URLs was used to simulate projects with a limited domain and hence the ability to use keywords already chosen based on prior domain knowledge. In this test, keywords found as described above at 25% of the desired good URLs returned are submitted to Google as the original search term rather than as a concatenation of the original search term and keywords, as before.

Conceptual diagram: Direct insertion of keywords Purpose: Compare effectiveness of using only a keyword search to the results obtained using the original search term(s) Setup step 100 URLs requested. Keyword choice at 25%

Start new spider session Start new Google query with keywords for 100 URLs requested

Query: dog, dogs Top 3 Keywords found: puppies, canine, puppy

Query = puppies OR canine OR puppy OR canines

Stemming “+s”

The heuristic used to judge the success of the direct insertion search versus the original term search was the number of URLs searched to gather 100 good pages. The 8

results were mixed: For three queries, the original search term required approximately half the number of page searches required by the keyword search to find 100 good URLs. Only in one case was the page count for the keyword search less than the original term search, by about 100 pages searched. It is possible to speculate about the cause of these results: For the search term “dogs,” although the keywords (puppies, puppy, canines, canine) are good synonyms, they are less common than the term “dogs.” In the case of “tattoos,” some of the keyword search terms (tattooed, body art, art, arts) were misleading—“arts” returned some nonrelated pages. For “data mining,” the keywords (knowledge discovery, data warehouse, software, softwares) were related to data mining only indirectly. In general, the original search term usually did better than the augmented search term, probably because the original term was more common than the keywords. Google seed size (defined as the number of Google URLs used) was varied to determine the effect on the returned URLs. Starting seed values used were 10, 25, 50, 100, and 125; the number of desired returns varied between 100, 500, and 1000. A 50% seed value (e.g., 100 requested/50 seed) appears to be the required minimum for most queries; 100% seed was always successful in returning 100 URLs. This is because a combination of seeded values and first- or second-order linked pages generates a sufficient number of hits for small requests. For large requests, increasing the seed value does not help. There was also an interaction between interconnectedness of pages and seed value--more popular queries successfully returned 100 pages for a 25 seed. No search term was successful at 10% seed. For requests of 500 and 1000 returns, an average of 140 “good” URLs was returnedincreasing the seed size did not guarantee a sufficient number of returns; many queries did not support 1,000 good returns under any conditions. It is interesting to contemplate why the top search term queries find fewer than 200 good pages. It would seem there are many pages devoted to hot topics, but a large proportion of them are junk (as defined by the goodness algorithm). Data mining algorithms ID3 and AQ11 were run as a post-processing step to compare the performance of classic algorithms against Support Vector Machines (which ran in real time). Training examples were evaluated against a heuristic and gathered as part of the spider run; when the specified number of good and bad examples was collected, the SVM was trained and subsequent examples became test data. Training set size was varied for SVM to determine the effect, as was a comparison of performance against a hand-labeled set of good and bad examples. The SVM code (Chang and Lin, http://www.csie.ntu.edu.tw/~cjlin/libsvm/) was set as follows: the type of SVM used was c-svc, which returns either a +1 for positive examples or a –1 for negative examples. The kernel type was radial basis function (RBF). Other parameters such as degree, coefficient, and nu were left at the default settings. Ray Mooney’s LISP code (Mooney, http://www.cs.utexas.edu/users/ml/mlprogs.html) was used for the ID3 and AQ11 code,. The AQ11 code had a LEF function (as discussed in the literature review, this controls the number of ‘best’ child nodes that are allowed to propagate children for further consideration) allowing beam search to be specified; this defaulted to 1. 9

In autotrain mode, the algorithm develops its own training set by collecting a balanced set of good and bad examples (the first x good examples and the first y bad examples encountered). Once a specified number of good and bad subURLs are collected through running the spider, the support vector machine algorithm is called to train against the data. This training set was compared against a hand labeled training set that provided a more exhaustive list of the possible combinations of search term finds, and the points assigned to indicate the relative goodness of each combination. The hand-labeled training set was taught to give low marks to data containing high counts only in the body (e.g., without values greater than zero in the title, keywords and/or description). After training is completed, data points are submitted to the test SVM algorithm on subsequent passes. Sites rated by the SVM algorithm as good are added to the search array for further investigation. Behavior of the support vector machine against the hand-labeled training set was as follows. Of the ninety-four training examples, sixty-one became support vectors; the other data points were not needed because they did not define the support vector. (The SVM uses only those examples that define the minimum case to define the hyperplane.) Because SVM is not covariant, it is possible that scaling dimensions could affect the outcome; however, extensive testing with the SVM machine did not detect such a skew. In addition, the SVM created some points by interpolating between the given training examples. The automatically generated training set sample size was varied to compare performance – the default was 25 good and 25 bad training examples for 200 returns (25% training sample size); comparators were 50 good examples and 50 bad examples for 200 returns (50% training sample), and 12 good/12 bad examples for 200 returns (12.5% training sample). As expected, a larger training set resulted in better results in terms of fewer outliers and better correlation between high scores in test_goodness (y axis) and clustering between .5 and 1.0 on the x axis, but autotraining at any level did not perform as well as the hand-labeled examples. This is evident (see figure) in which trends and clustering may occur but no clear advantage is seen between the autotrain set sizes. Note that points on the y axis are training data points. comparison of autotrain at 12, 25, and 50 to test_goodness (for search term "jokes")

test_goodness

15 10 autotrain 12/12 autotrain 25/25 autotrain 50/50

5 0 -2

-1

-5

0

autotrain--SVM sum

1

2

To compare ID3 and AQ11 data mining algorithms to SVM, the same autotrain and hand-labeled training sets were used. Since ID3 requires discrete labels rather than continuous ranges, the data were binned. Some descriptiveness (for example, 10

extremely high counts in the body) was lost using this method, but it was sufficient to describe a reasonable sized tree; as the ID3 algorithm attempts to account for inconsistent examples, the branching behavior becomes more complex. Bins were chosen through experimentation and adjustment to best discriminate between positive and negative examples. For the autotrain examples, the tree drawn was dependent on the data in the file. Some bad conclusions were drawn, probably due to the fact that autotrain did not give the ID3 engine a comprehensive example set. For the hand-labeled examples, the tree generated was very similar to the SVM trainer: Keywords were weighted most heavily, with title and description the second binary discriminators. The tree indicated suspicion of the goodness of high body counts without correspondingly large numbers of description, title, and/or keyword counts, which is consistent with the training examples. It also gave less weight to high description counts than to high keywords and title counts, also consistent with the training data. Deliberate omission of two positive examples caused a significant change in the tree. The implication of this ability to manipulate sections of the tree is that the ID3 algorithm is very sensitive to noise; extraneous or erroneous data creates unwieldy, unreadable trees. Note that there are more sophisticated trees available that could be implemented in extensions to this study: the Expectation-Maximization (EM) algorithm as implemented by Jordan and Jacobs (1994) in a tree-structured algorithm as a hierarchical mixture of experts is much less susceptible to noise, because it allows data to belong to multiple classifications (“soft splits” rather than the “hard splits” used by ID3). A comparison of ID3 and SVM results showed that the ID3 tree was less able to distinguish border cases; for example, zero in keywords and zero in description netted a negative rating from ID3 regardless of the counts in title and body. Overall there was good consistency between the ID3 and SVM ratings for the sampled data points. Using the AQ11 algorithm, a comparison of four autotrain data sets versus the handlabeled training set reveals that output for the hand-labeled training set contains the same elements identified by the autotrain sets. The addition of the two examples did not seem to help the result significantly (as it helped the ID3 tree); one additional complex was found and some complexes changed, reflecting additional information provided by the two added training examples about description counts of two or greater. AQ11 seemed particularly susceptible to noise in the training data. The support vectors identified are considerably more complex and, in general, contain more specific information. In general, SVM had the best generalization, while ID3 did well at providing insight into attributes used to discriminate categorizations. AQ11 was inferior to both SVM and ID3 in both regards. A confidence factor was explored as a possible measure of the “goodness” of a URL identified by the SVM. A confidence factor could be comprised of a number of items: rank in the Google seed data, whether the SVM and the test_goodness heuristic agree, and/or a factor from the SVM. This project’s confidence factor was created as a composite: Google rank is implicitly considered given that the spider takes the first n Google returns; if the algorithm’s test of goodness disagrees with the SVM, the 11

confidence factor is reduced; and the factor from the SVM is used as a basis for the calculation of the confidence factor. SVM works as follows: SVM calculates a distance from the support vector as the sum of the support vector coefficients multiplied by the kernel value for the point under consideration. Rho (the threshold for the decision function) is then subtracted from the sum. The resultant value is used to label the node as positive or negative. The SVM code was modified to return the calculated value as well as the label as a rough measure of goodness based on distance from the support vectors: points that are close to the hyperplane described by the support vectors are less solidly in that category than points that are further from the support vector. However, outlier points far from the support vectors are also suspect. There appears to be no direct correlation between the margin of a point and its goodness, although broad trends exist. This suggests that a bucketing scheme is appropriate, since the distance calculation cannot be used directly. As a further precaution, if the test_goodness measure contradicted the SVM goodness label, the confidence factor was reduced by 50%. For example, an SVM sum between .45 and .8 with a –1 test_goodness became 45%. The correlation between the sum returned by the SVM versus the test_goodness algorithm’s rating is shown below (plotted as point value if positive or –1 if negative). Note that there is generally agreement between positive and negative with a few reversals (points greater than zero on the X axis and less than zero on the Y axis; and less than zero on the X axis and greater than zero on the Y axis). There is clustering on the positive examples for each, but not a tight correlation between test_goodness rating and SVM sum of distances. Britney Spears--SVM sum vs. test_goodness

jobs: SVM sum vs. test_goodness

15

test_goodness

15

test_goodness

10 5 0 -2

-1

-5

0

1

-2

2

10 5 0 -1

15

15

10

10

5 0 -5 0

1

2

movies: SVM sum vs. test_goodness

test_goodness

test_goodness

Las Vegas: SVM sum vs. test_goodness

-1

0

SVM sum

SVM sum

-2

-5

1

2

SVM sum

-2

5 0 -1

-5

0

SVM sum

12

1

2

To try to achieve some sense of the relative contributions of title, description, keyword, and body occurrence counts to the SVM coefficient, an analysis was done. In general, title count greater than 1, keyword counts greater than zero, description counts greater than zero or one, and high body counts are correlated with positive pages. A more elaborate confidence factor could consider the ranked results of multiple search engines and additional features of Web pages that may indicate relevance. 6. Conclusions and Future Work Cho, Garcia-Molina, and Page (1998) describe a method for ordering a spider’s visits to URLs based on an algorithm calculating the importance of a page based on textual similarity to the user’s query, the number of in-degree links a given page has, a backlink metric that weighs the importance of a link, a forward link count that measures out-degree links, and a location metric (whether the page is a .com or a .home, and other factors). It should be noted that this work became part of the Google search engine. Determining the centrality of a Web page would provide valuable information about page goodness. It would also be interesting to compare the results from different search engines and metasearch engines. An extension for a limited domain could be the use of a user-created synonym database. Exploring the use of an edit distance algorithm, which determines “the minimum number of insertions, deletions, and substitutions required to transform one string into the other (Ristad and Yianilos, 1998)” could be fruitful in providing more robust spelling correction of the original search term. Correlation of results with other search engines could assist in choosing positive examples for training the SVM. Many Information Extraction systems use a document classification matrix to categorize word sense based on context (for example, in this application, it would have been handy to detect and correctly classify dogs (canines) vs. “dogs of the Dow” (bad stocks)). The addition of a matrix classification system would definitely improve the internal consistency of topics returned. This project used supervised learning methods (ID3, AQ11, SVM) in which the data mining algorithm must be trained, because of the ready availability of algorithms and associated datasets for benchmarking. The Web, however, may lend itself better to unsupervised algorithms because the data tends to be unlabeled, and hand labeling is expensive and subject to error. More sophisticated tree approaches incorporating EM could improve the performance of tree algorithms against SVM. A larger training set could help extend and validate the findings shown in this project. In conclusion, we found it is helpful to use data mining algorithms to leverage and further refine a search engine’s returns. The use of search term counts allows an empirical assessment of the probability that the page is a good one. It also returns relevant results that may not be found by the search engine alone, because it explores links in subpages and uses keywords to find additional related pages. The spider presents for consideration some pages that Google may have deeply buried in its results; it can be argued that a good sublink found by the spider corresponding to the 30,000th reference in the Google returns will probably not be considered by a human user. Additionally, an analysis of the top 100 Google pages returned for a query shows many pages are not very good pages. 13

The use of data mining algorithms, particularly Support Vector Machines, allows categorization of pages into positive and negative categories, although it does not lend itself to a strict determination of goodness. The spider research tool shows promise in the examination of additional meta information to determine whether a page is good. Additional study in this area may find more foolproof ways to winnow out undesirable pages and return only the information desired, but the Web remains an unorganized treasure hunt rather than fully mapped territory.

14

References Blum, T., Keislar, D., Wheaton, J, and Wold, E., Writing a web crawler in the Java programming language. http://developer.java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler, Muscle Fish, LLC, 1998. BrightPlanet.com: The Deep Web: Surfacing Hidden Value, White Paper, http://www.completeplanet.com/Tutorials/DeepWeb/index.asp. Chang, C. and Lin, C. Support Vector Machine algorithm. http://www.csie.ntu.edu.tw/~cjlin/libsvm/. Cho, J., Garcia-Molina, H., and Page, L. Efficient crawling through URL ordering. In Proceedings of the th 7 International WWW Conference, 1998. Cooley, R., Mobasher, B. and Srivastava, J. Web mining: Information and pattern discovery on the World Wide Web. Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'97), November 1997. Degeratu, M. and Menczer, F. Complementing Search Engines with Online Web Mining Agents. Submitted to Decision Support Systems, Special Issue on Web Data Mining, 2000. Glover, E.J., Flake, G.W., Lawrence, S., Birmingham, W.P., Kruger, A., Giles, C.L., Pennock, D. Improving category-specific web search by learning query modifications. In Symposium on Applications and the Internet, SAINT, San Diego, CA, January 8—12, 2001. Harrison, C. D. Data mining using Web spiders. Master’s thesis to be published December 2001. Jordan, M. I., and Jacobs, R. A. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181-214, 1994. Kosala, R. and Blockeel, H. Web mining research: a survey. SIGKDD Explorations, ACM SIGKDD, Volume 2, Issue 1, pages 1-14, July 2000. Madria, S.K., Bhowmick, S.S., Ng, W.K, and Lim, E.P. Research issues in web data mining. In Proceedings of Data Warehousing and Knowledge Discovery, First International Conference, DaWaK ’99, pages 303-312, 1999. Mitchell, T.M. Machine learning and data mining. Communications of the ACM, Volume 42, No. 11, November 1999. Mooney, R.J. at http://www.cs.utexas.edu/users/ml/ml-progs.html. Algorithms for AQ11, ID3, and a Universal Tester.

Porter stemming algorithm: http://www.tartarus.org/~martin/PorterStemmer/ Ristad , E. S., and Yianilos, P. N. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5), 522-532, 1998.

15