which are mostly display-only (HTML) documents, or information-exchange (XML) documents created ..... matches, anchor text information, and match proximity.
Web Information Resource Discovery: Past, Present, and Future Gultekin Ozsoyoglu, Abdullah Al-Hamdani Dept of Electrical Engineering and Computer Science Case Western Reserve University, Cleveland, Ohio 44106 (tekin, abd)@eecs.cwru.edu
1
Introduction
In a time span of twelve years, the World Wide Web--only a computer and an internet connection away from anybody anywhere, and with abundant, diverse and sometimes incorrect, redundant, spam, and bad information--has become the major information repository for the masses and the world. The web is becoming all things to all people, totally oblivious to nation/country/continent boundaries, promising mostly free information to all, and quickly growing into a repository in all languages and all cultures. With large digital libraries and increasingly significant educational resources, the web is becoming an equalizer, a balancing force, and an opportunity for all, especially for underdeveloped/developing countries. The web is both exciting and overwhelming, changing the way the world communicates, from the way businesses are conducted to the way masses are educated, from the way research is performed to the way research results are disseminated. It is fair to say that the web will only get more diverse, larger and more chaotic in the near future. As it is, the web is a repository of text, multimedia, and hypertext documents, which are mostly display-only (HTML) documents, or information-exchange (XML) documents created for the consumption of web-based applications. The web continually grows and changes, with an estimated size of 5B(illion) to 8B pages (largest web index, OpenFind (www.openfind.com.tw) is 3.5B pages; not including hidden web, intranets, and database-enabled pages). Two strengths of the web are: it grows incrementally (and thus scalable), and each individual in each nation, with a connection to the internet, can contribute to content generation on the web, leading to the dissemination of facts (and propaganda), and (sometimes, incorrect) ideas and opinions, in an independent and very democratic manner. As valuable and rich as the web is, presently, there are few ways to search and locate information on the web: one can use (i) the existing search engines to reach to a select set of ranked sites, (ii) meta search engines that in turn employ multiple search engines, and aggregate and rank search results, (iii) question-answering systems (e.g., AskJeeves (www.ask.com) that allow users to pose questions, and return their answers; or, one can (iv) follow links and browse web pages. In this paper, we review the underlying technologies for collecting information (metadata) about the web, and employing these technologies for searching and querying the web. Full paper with larger references is at art.cwru.edu/WebSearchQuerying.pdf. In section 2, we summarize the history and capabilities of web search engines. Sections 3 and 4 are devoted to the automated and manual ways of adding semantics to the web, to help understand and, thus, search the web better. Section 5 offers our predictions on what the near future holds for improving web search and querying.
1
2
Web Search Engines
First, a brief note on the history of web: in 1980, Tim Berners-Lee at CERN (European Nuclear Research Organization) wrote a program that used bidirectional links to browse documents in different directories. By 1990, Berners-Lee designed a graphical user interface to hypertext, named the “world wide web”. By 1993, CERN had developed the three pillars of the present web: HTML markup language, http (the hypertext transport protocol), and the very first http server. Mosaic, created by Mark Anderson as the first web browser, was launched in February 1993, and later became Netscape. And, the rest is history. See www.archive.org for an archive of the web (10Billion pages)
2.1
History
Earliest forms of search engines, launched in 1990, were for searching ftp and gopher directories. 1994 was a busy year for the launch of a number of significant search engines. In January 1994, MCC Research introduced EINet Galaxy (www.galaxy.com), with a hand-organized directory of submitted URLs and search features for Telnet and Gopher, and limited web search features. In April 1994, Yahoo! directory was created by David Filo and Jerry Yang, two Stanford PhD students to help Stanford students locate useful web pages. Yahoo manually created and maintained hierarchically organized topic directories—until October 2002, which is when it started to incorporate Google’s crawler-based listings, “enhanced by” Yahoo directories. Yahoo (www.yahoo.com) became a company in 1995. Also in early 1994, WebCrawler (bought, in order, by Excite, AOL, InfoSpace) from the University of Washington was released. The distinctive feature of the Web Crawler was its full text-indexing feature (of the first 200 words in each document). In June 1994, Lycos from Carnegie Mellon University was released; Lycos search engine is powered by HotBot since 1999. In late 1995, AltaVista search engine (now 1B pages), running on clusters of Alpha workstations and with main memory indexing, was launched by the Digital Equipment Corporation. Inktomi (now 2B pages), which powers HotBot (bought by Lycos, then Yahoo)), was released in 1996. Another major search engine, Google (www.google.com) (now 2.4B pages) was launched in September 1998 by Sergey Brin and Larry Page, two PhD students at Stanford. By 2000, Google’s coverage reached to about %50 of the web, with crawl refreshment rates of once every month. Each of Google and Yahoo now handles about 30% of web search requests. In the last five years, thousands of web search engines, some powered by the above-listed search providers and others using their own directories or crawler-based indexes, were launched. Search Engine Watch lists [37], in nine categories, more than a thousand search engines. Search Engine Colossus [34] lists search engines from 195 countries and 39 territories (with 18 search providers from/about Turkey). Search Engine Watch lists as “major” (well-known or well-used) search engines [35] Google, Yahoo, AllTheWeb.com (powered by a crawler-based search engine; now 2.1 billion pages), and MSN Search (powered by its own directory, LookSmart Directory, and Inktomi’s crawler-generated URLs). As search engines in the “strongly consider” category, Search Engine Watch lists AOL search (powered by Google), AskJeeves (a question-answering system powered by crawling-based Teoma search provider),
2
Hotbot (owned by Yahoo; powered by search providers Google, Inktomi, Teoma, and AllTheWeb.com), Lycos, and Teoma. Since 1996, three years after the launch of web search engines, it occurred to people that, perhaps, employing a large number of search engines for a query and then aggregating their results somehow would provide more useful information; hence the notion of metasearch engines or metacrawlers came about. Presently, there are about 50 metacrawlers. See www.searchenginewatch.com/links/article.php/2156241 for a review and evaluation of metasearch engines. There are also a large number of specialized search engines, such as those that find people (www.whowhere.com), universities (www.campusregistry.com), publications (CiteSeer), etc. See http://cui.unige.ch/1 for a list.
2.2
Techniques and Capabilities
The requirements for web search engines are to (a) locate and rank web documents effectively and efficiently, (b) provide unbiased and up-to-date access to the web, with expressive and useful web search results, and (c) adapt to user queries. Search engines can be classified according to how they index the web: (1) those that use web crawling, and automatically create web listings such as Google, (2) those that manually maintain directories, such as Yahoo (until recently) by dividing manually-submitted information about web sites into categories, and (3) those that use both: hybrid search engines that maintain manually generated association directories and crawler-generated indexes. A search engine is composed of three main parts: a crawler (spider), an indexing software, and a search and ranking software. Crawler scans web documents and collects information about them. Indexing software is used to construct a data structure that can be quickly searched. Search and ranking software is used to analyze a given query, and compare it to the existing indexes to find relevant items and URLs. 2.2.1 Crawling Crawlers crawl the web by starting from a given set of URLs, and iteratively fetch and scan for new URLs, caching them as “URLs to be evaluated”. Building a basic crawler is easy; building an industrial-strength crawler is not. Crawlers built for research purposes fetch up to hundreds of pages a second; commercial crawlers are much more scalable and fetch hundreds of thousands of pages a second. Understandably, crawler details of commercial search engines are not available. However, basic design issues involve concurrent page fetches, design of a powerful storage manager, concurrent address resolution, avoiding spider traps/spams, and detecting duplicate pages/URLs, which we very briefly review next. For more details, read chapter 2 of Chakrabarti’s excellent book [14]. Links extracted from web pages need to be processed, and normalized so that pages will not be fetched multiple times—a difficult task due to virtual hosting (i.e., proxy pass) and multiple hostnames mapping to multiple IP addresses, mostly due to load balancing. Extracted URLs may be absolute or relative; regardless, a canonical URL needs to be formed by the crawler. Crawlers must control the number of service requests to any one http server at a time since commercial servers have safeguards against denial of service attacks, and limit the service to any given client at a time.
3
The crawler repository, managed by a storage manager, contains HTML pages, and quickly becomes very large. For research purposes, single-disk-based storage managers are freely available (www.sleepycat.com); commercial-strength storage managers manage disk farms over a fast local area network, and are much more complex to build. Refreshing crawled pages is another issue with large crawlers. See Cho et al [13] for the design of an incremental crawler. 2.2.2 Document Model and Query Language Web search engines preprocess (usually the first 100 to 200 words of) documents by removing/replacing stopwords (e.g., the, a, in, of, etc.), stemming the text (i.e., reducing the words to the root forms such as changing walking to walk), and replacing documents and words with identifiers. Then, various indexing structures and/or index compression techniques are built on preprocessed documents. As for modeling documents themselves (in order to compare them), the commonly accepted model is the vector-space model from Information Retrieval (IR). By using a vocabulary (terms that appear in all documents), a document is represented as a vector of real numbers where each vector element contains the weight of a term, usually computed using the Term Frequency-Inverse Document Frequency (TF-IDF) weighting scheme [33]. Most web search engines allow “Boolean Queries” (from IR) that are conjunctions and/or disjunctions of words and phrases. Such a language, compared to the existing database query languages, is very limited in expressive power, and not employed by users—average web search requests are 1.3 words long. 2.2.3 Retrieval Relevance To judge the relevance (in terms of precision and recall, two basic measures from IR) of a web page to a given query (i.e., a set of terms), the query and all existing documents (already retrieved by the crawler) are represented as document vectors by using the TF-IDF scheme. Then, the issue of the similarity of a query and a web page becomes the issue of the similarity between the associated vectors (always in the L1 or L2 Euclidean metric). The most commonly used similarity measure for this purpose is the cosine similarity, which computes the cosine of the angle between the two vectors. In the case where users are not satisfied with the response of a search engine and are willing to provide feedback to the search engine to refine the search, relevance feedback models are developed, which include the Rocchio’s method and pseudorelevance feedback [14]. More advanced models include probabilistic relevance feedback models or bayesian inference networks. As it turns out, users rarely provide feedback, and most search engines do not incorporate relevance feedback features. 2.2.4 Hyperlink analysis In web retrieval, the linkage information among web pages is found to be more informative than the IR-based document similarities. To employ linkage information, concepts from Bibliometrics, a field that studies the citation patterns of scientific papers, are employed: (i) prestige of an article: a real, normalized number representing the number of citations to it, and, iteratively, to those that refer to it and so on, and (ii) co-citation between two documents (the number of documents that cocite the two documents, normalized by the total number of documents.
4
PageRank and HITS (hyperlink induced topic search) are two very influential algorithms for ranking pages, developed in early 1996, respectively, by Sergey Brin and Larry Page at Stanford [17, 32], and Jon Kleinberg at IBM Almaden [24]. PageRank of a page is a normalized sum of the prestiges of pages linking to it. In essence, PageRank definition is based on the intuition that a page has high rank (i.e., important) if the sum of the ranks of its incoming links, recursively, is high [32]. PageRank computation converges rapidly for the web. In HITS, first, a set of documents are selected from the web by using the query and, perhaps, IR techniques. One can view the selected documents as nodes and the hyperlinks among them as edges, resulting in a “subgraph” of the web. In this subgraph, authoritative pages are nodes/pages to which many pages have links (directed edges), and hub pages are nodes with many links (outgoing edges) to authoritative pages, and the system reports the highest-ranking authorities and hubs in the subgraph of documents. Intuitively, in Bibliometrics, authorities are articles with definitive, high-quality information, and hubs are high-quality survey articles. PageRank and HITS are similar in that they are both recursive, and involve computing the eigenvalues of the adjacency matrix of a subgraph of the web (using power iterations). It is known that the utilization of such linkage information in retrieval leads to higher retrieval effectiveness, as exemplified by Google [9]. Note however that Google is more than the PageRank mechanism; it employs phrase matches, anchor text information, and match proximity. The basic criticism of PageRank [14] is that the prestige scores are query-independent, leading to a disconnect between the ranking and a given query. Several stochastic variants of PageRank and HITS are researched in the literature, e.g., stochastic link structure analysis (SALSA) [26], more stable HITS algorithm [31], eliminating the effects of two-way self referencing (nepotistic links) [5], eliminating the effects of self-created cliques [14], outlier elimination [5], mixed hubs [14], eliminating the effects of topic contamination and topic drift [14], and exploiting anchor text [11]. 2.2.5 Focused Crawlers General-purpose crawlers do not make an attempt to crawl only pages satisfying a certain criteria. It does make sense however to crawl pages that are likely to have high “importance”, where importance may be defined in terms of PageRank values, or high indegree/outdegree link values of pages, or any other user-defined function. Cho et al [13] studies various URL prioritization and keyword-sensitive web search strategies. It is also found that [30] breadth-first crawl quickly locates pages with higher PageRank. Others proposed IR-based heuristics (e.g., pages containing specific words, or containing phrases that match a given regular expression) such as the FishSearch [8] and its followup SharkSearch [22]. Focused crawlers are systems with a crawler and a classifier system, where the crawler bases its crawling strategies on the judgments of the classifier system, possibly a supervised classifier trained a priori and updated as documents are found. Thus, focused crawlers search only a subset of the web (not necessarily only a specific web resource) that pertains to a specific relevant topic. A number of focused crawlers are proposed in the literature [12, 13, 15, 29], which differ by the properties of the employed classifier, the heuristics used to judge the importance of pages to be
5
fetched or to identify and exploit the hubs, and learning the structure of paths relevant for and leading to important pages [15].
3
Automated Metadata Extraction from Web
If it were possible to extract entities and relationships about entities from web documents, such metadata could then be used to define more powerful queries over the set of web documents. Thus, the field of data extraction is important to web querying. In this section, we briefly list a number of recent works in the field of data extraction, with an interest to using the metadata for web search and querying. DIPRE [10] employs a handful of training tuples of a structured relation R (that represents a specific meta-relationship among entities in the data) to extract all the tuples of R, from a set of HTML documents. Consider the relation R(Organization, Location) with the tuple . Assume that DIPRE encounters the text “Microsoft’s headquarters in Redmond”, which it changes into the pattern p: “’s headquarters in ”. DIPRE then searches the html documents for phrases matching p. Assume that it encounters the string “Boeing’s headquarters in Seattle”, which results in the new tuple being added into R. That is, DIPRE uses the new tuples to generate more patterns, and uses the newly generated patterns to extract more tuples, and so on. Snowball [1, 2], an extension to DIPRE, improves the quality of the extracted data by including automatic patterns and tuple evaluation. QXtract [3] uses automated query-based techniques to retrieve documents that are useful for extracting a target relation from large collection text documents. The Proteus information extraction system [20, 21] uses finite-state patterns to recognize {names, nouns, verbs, and other special forms}, scenario pattern matching to extract events and relationships for a given relation, and an inference process to locate implicit information and to make it explicit. Then, Proteus combines all the information about a single event using event emerging rules. The field of (meta)data extraction from the web has a long way to go at this stage. However, we believe that it provides an alternative to manual content-generatordependent ways of adding semantics to the web, which is discussed next.
4
Adding Manually-Supplied Semantics to Web
Next we briefly summarize the leading web information representation models with extensive research and standardization efforts, namely, the Resource Description Framework (RDF), the semantic web, and ontologies.
4.1
Resource Description Framework
Resource Description Framework (RDF) [27] is designed to describe web information sources by attaching metadata specified in XML. RDF identifies resources using Uniform Resource Identifiers (URI), and describes them in terms of properties and their values [28]. RDF is a graph-based information model, and consists of a set of statements, represented as triples. A triple denotes an edge between two nodes, and has a property name (edge), a resource (node), and a value (a node). A resource can be anything from a web document to an abstract notion. A value can be a resource or a literal (an atomic type). RDF Schema [4] defines a type system for RDF, similar to the type systems of object-oriented programming languages such as Java.
6
RDF Schema allows the definition of classes for resources and property types. The resource Class is used to type resources, and the resource Type is used to type properties. Various properties such as SubClassOf, SubPropertyOf, isDefinedBy, seeAlso, type are available. Various constraints on resources and on properties are defined. It remains to be seen how widely RDF will be adapted by web content generators. Initial results are not encouraging. Eberhart [16] investigated the amount and the type of RDF data that is found on the web. The RDF data were gathered from the web in 2001-2002. The results indicate that (i) RDF is not widely used on the web, (ii) RDF data that is on the web is not easily reachable, and (iii) it is not very sophisticated.
4.2
Semantic Web
Semantic Web [7, 36] is an RDF schema-based effort to define an architecture for the web, with a schema layer, logical layer, and a query language. The overall goal of the semantic web is to support richer information discovery, data integration, task automation, and navigation--by providing standards and techniques for web content generators to add more semantics to their web data [25]. A complex set of layers of web technologies and standards are defined to implement the semantic web [25]. The Unicode and URI layers are used to identify objects in the semantic web, and to make sure that international character sets are used. The XML layer is used to integrate the semantic web definitions with other XML-based standards. The RDF and RDF Schema layer are used to define statements about objects and vocabularies that can be referred using URIs. The Ontology layer is used to evaluate vocabularies, and to define relationships between different concepts. The Digital Signature layer detects alterations to documents. The Logic layer is used to write rules that are executed by the Proof layer, while the Trust layer is used to decide whether to trust a given proof or not. It remains to be seen whether the concepts and standards defined by the semantic web effort will be adapted. One major problem is the complexity of the semantic web, as defined now. The Semantic Web is an active, industry-led research area. 4.2.1 Ontologies An ontology is a specification of a conceptualization (i.e., meta information) [18]. It is used to describe the semantics of the data, with a role similar to the database schema [23]. Ontologies establish a joint terminology between members of a community of interest. An example is the Gene Ontology (www.geneontology.org), for geneticists and biologists. To represent a conceptualization, a representation language can be (and, usually, are not) used, and there are several representation languages [19], RDF and RDF Schema being among them. Horrocks et al. [23] proposed the Ontology Inference Layer (OIL), which is a standard for a web-based representation and inference layer to express ontologies based on RDF and XML schemas. OIL provides rich modeling primitives from frame-based languages, a well-defined semantics based on Description Logic, and automated reasoning support. Ontologies are describes in OIL using three different layers: the object level, the first meta-level (ontology definition), and the second metalevel (ontology container). The object level used to describe concrete instances for a given ontology. The first metal level provides structured vocabulary and well-defined
7
semantics by defining terminology that can be used in the object level. The second meta-level describes the features of a given ontology such as author, name, subject, etc. OIL is compatible with RDF schema; [6] uses RDF modeling primitives to map OIL specifications to their corresponding RDF serializations. Ontologies form another effort to add community-supported and manuallygenerated semantics to the web; it remains to be seen how broadly they will be adapted.
5
What Next?
Major search engines have come a long way in recent years in crawler coverage of the web, fast search over very large indexed data, and providing users with very good responses for one- or two-word queries. The research on general-purpose web search technology has also started to mature with well-developed techniques; surely, in the near future, effective keyword-based web search using most languages will be provided by the major search engines. However, the next natural step of providing more informative accesses to web information resources (not to the whole web) is yet to come. Consider the query: “Find from ACM SIGMOD Anthology five most important prerequisite papers of the paper “Predicate Migration” by Hellerstein and Stonebraker”. Presently, no tools exist to answer such a query. The next enabling step for effective web search and querying will come when metadata about the web becomes widely available. It is not clear that the RDF and the semantic web efforts will succeed in adding semantics to a significant portion of the web due to (a) the complexity of the semantic web architecture with its numerous layers, and (b) the additional manual effort needed to define and add semantics to web data. The alternative direction of automated metadata extraction from the web is yet premature. We think that when it matures, automated metadata extraction will coexist with, if not take over, the manually generated metadata. Regardless, in the future, web information resources, but not the whole web, will have metadata available, allowing users to search and query web information resources using highly powerful, 1st or higher order logic-based query languages. Such languages will be unique and different than database query languages.
6
References
1. E. Agichtein, E. Eskin, L. Gravano, “Combining Strategies for Extracting Relations from Text Collections”, ACM SIGMOD, 2000. 2. Agichtein, E., Gravano, L., “Snowball: Extracting relations from large plain-text collections”, The 5th ACM International Conference on Digital Libraries, June 2000. 3. Agichtein, E., Gravano, L., “Querying Text Databases for Efficient Information Extraction”, Proce. of the 19th IEEE Intl Conference on Data Engineering (ICDE), 2003. 4. Brickley, D., Guha, R.V., “Resource Description Framework Schema (RDFS)”, W3C Proposed Recommendation, 1999, available at http://www.w3.org/TR/PR-rdf-schema. 5. K. Bharat, M.R. Henzinger, “Improved algorithms for topic distillation in a hyperlinked environment”, ACM SIGIR Conf., 1998. 6. J. Broekstra, M. Klein, D. Fensel, and I. Horrocks, “Adding formal semantics to the Web: building on top of RDF Schema”, In Proc. of the ECDL, 2000. 7. Berners-Lee, T., “Semantic Web Roadmap”, W3C draft, Jan 2000, available at http://www.w3.org/DesignIssues/Semantic.html
8
8. P.M.E. De Bra, R.D.J. Post, “Searching for arbitrary information in the WWW: Making Client-based searching feasible”, WWW Conf., 1994. 9. Sergey Brin, Lawrence Page, “The anatomy of a large-scale hypertextual Web search engine”, Computer Networks and ISDN Systems, Brisbane, Australia, 1998. 10. Sergey Brin, “Extracting patterns and relations from the world wide web”, In WebDB Workshop at EDBT, 1998. http://citeseer.nj.nec.com/brin98extracting.html. 11. S. Chakrabarti et al, “Mining the web’s link structure”, IEEE Computer, Aug. 1999. 12. S. Chakrabarti, M. van den Berg, B. Dom, “Focused crawling: A new approach to topicspecific web resource Discovery”, In Proceedings of WWW 8 Conf., 1999. 13. J. Cho, H. Garcia-Molina, L. Page, “Efficient crawling through URL ordering”, In Proceedings of the Seventh International World-Wide Web Conference, 1998. 14. Mining the Web: Discovering knowledge from hypertext data, Chakrabarti, S., MorganKaufmann Publishers, 2003. 15. M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, M. Gori, “Focused Crawling using Context Graphs”, VLDB 2000. 16. Eberhart, A., “Survey of RDF data on the web”, In Proc. of the 6th World Multi-conference on Systemics, Cybernetics and Informatics (SCI), 2002. 17. Google History, at http://www.google.com/corporate/history.html. 18. Gruber, T., “A translation approach to portable ontologies”, Knowledge Acquisition, 1993. 19. Guarino, N., “Formal Ontology and Information Systems”, In N. Guarino (ed.), Formal Ontology in Information Systems, Proc. of the 1st International Conference, 1998. 20. R. Grishman, S. Huttunen, R. Yangarber, “Real-Time Event Extraction for Infectious Disease Outbreaks”, In Proceedings of Human Language Technology Conference, 2002. 21. Ralph Grishman, “Information extraction: Techniques and challenges”, In Maria Teresa Pazienza, editor, Information Extraction, Springer-Verlag, LNAI, 1997. 22. M. Hersovici et al, “The sharksearch algorithm—an application: Tailored web site mapping”, WWW 7 Conf., 1998. 23. I. Horrocks et al, “The Ontology Inference Layer OIL”, Technical report, Free University of Amsterdam, 2000. http://www.ontoknowledge.org/oil/. 24. Kleinberg, J., “Authoritative Sources in hyperlinked environments”, In the 9th ACM-SIAM Symposium on Discrete Mathematics, 1998. 25. M. Koivunen and E. Miller, “W3C Semantic Web Activity”, In the proceedings of the Semantic Web Kick-off Seminar in Finland Nov 2, 2001. 26. R. Lempel, S. Moran, “SALSA: The stochastic approach for link-structure analysis”, ACM TOIS, April 2001. 27. Lassila, O., Swick, R., “Resource Description Framework (RDF) Model and Syntax Specification”, W3C Recommendation 22 February 1999. 28. Frank Manola, Eric Miller, “RDF Primer”, W3C Working Draft, 23 January 2003 29. F. Menczer, G. Pant, M. Ruiz, P. Srinivasan, “Evaluating topic-driven Web crawlers”, In Proc. 24th Intl. ACM SIGIR Conf., 2001 30. M. Najork, J. Weiner,“Breadth-First search crawling yields high-quality pages”, WWW’98. 31. A. Ng, A. Zheng, M. Jordan, “Stable algorithms for link analysis”, ACM SIGIR, 2001. 32. L. Page, S. Brin, R. Motwani, T. Winograd, “The PageRank citation ranking: Bringing order to the web”, Stanford Digital Libraries Working Paper, 1998. 33. G. Salton, Automatic Text Processing, Addison-Wesley, 1989. 34. International Directory of Search Engines, Search Engine Colossus, 2003, available at http://www.searchenginecolossus.com. 35. The Major Search Engines and Directories, Search Engine Watch Report, Danny Sullivan, 2003, available at searchenginewatch.com/links/article.php/2156221. 36. The Semantic Web Community Portal, at http://www.semanticweb.org 37. Search Links, available at http://searchenginewatch.com/links/index.php.
9