Concepts across the Interspace: Information Infrastructure for Community Knowledge Bruce R. Schatz CANIS (Community Architectures for Network Information Systems) Laboratory Graduate School of Library and Information Science University of Illinois at Urbana-Champaign
[email protected], www.canis.uiuc.edue
Abstract A global information infrastructure for knowledge manipulation must support effective analysis to correlate related objects. The Interspace is the coming global network, where knowledge manipulation is supported by concept navigation across community spaces. We have produced a working Interspace Prototype, an analysis environment supporting semantic indexing on community repositories. Scalable technologies have been implemented for concept extraction and concept spaces, which use semantic indexing to facilitate concept navigation. These technologies have been tested on discipline-scale, real-world document collections. The technologies use statistical clustering, on contextual frequency of document phrases within a collection. Computer trends show that semantic indexing technologies will be practical for everyday use on community knowledge in the foreseeable future. Thus, concept navigation across community repositories will become a routine operation. Keywords: Interspace, semantic indexing, scalable semantics concept spaces, concept navigation, concept switching
The most popular service in the Net has always been community sharing – at whatever level of functionality the current technology supports. Technologies such as electronic mail, bulletin boards, moderated newsgroups, bibliographic databases, preprint services, and Web sites are increasing steps along this path. The closer the technology for sharing results in documents gets to the technology of composing the documents themselves, the more heavily used the community sharing mechanisms will be. The waves of the Net illustrate the increasing levels of functionality. As shown in Figure 1, each wave builds on the previous, then establishes a new higher level of standard infrastructure. In the upswing of a wave, the fundamental research is being done for a new level of functionality. Prototype research systems begin in the trough and evolve into mass commercial systems in the peak. The functionality of the current wave is polished commercially in the downswing period for mass propagation before the start of the next wave.
Figure 1. Waves of the Net.
To users living in the Net when a new wave is cresting, the environment feels completely different. This has already occurred during the transition from packets, which are raw bits and files, to objects, which contain display and interaction software for groups of packets. Electronic mail in the ARPAnet was a transformational experience for users in the First Wave, as has been document browsing in the Internet for users in the Second Wave. The transition about to occur will involve concepts, which contain indexing and meaning for groups of objects. Concepts are useful for analysis of the content rather than search of the form. Concept navigation in the Interspace will be the transformational experience for users in the Third Wave of the Net. The First Wave was the level of access, of transmission of data. It began with the coming of the ARPAnet and evolved through large-scale distributed file systems, roughly 10 years on the upswing and 10 on the down. The focus was on packets of bits, transparently transferring them from one machine to another. The Second Wave is the level of organization, of retrieval of information. It began with distributed multimedia network information systems, such as the Telesophy system [1]. Telesophy was featured in my invited talk at the 20th Anniversary Symposium for the ARPAnet in 1989, as the example of future technology for worldwide information spaces. That same year saw the initiation of the World-Wide Web project at CERN, which when coupled with the NCSA Mosaic interface, became the technology that brought worldwide information spaces into everyday reality. The information wave took about 10 years to peak; we are just finishing the 5 years of the consolidation phase. The focus is now on documents of objects, pushing towards Internet-wide operating systems.
2
The Third Wave will be the level of analysis, of correlation of knowledge. It will focus on paths of searches. It will move past search of individual repositories, to analysis of information across sources and subjects. The standard protocols for this information infrastructure will support collections residing directly on users’ machines. The beginnings of these peer-peer protocols are already apparent in the popularity of music swapping services, such as Napster. To support analysis, they must evolve to provide semantic indexing as standard infrastructure. The technology for the Third Wave currently exists in large-scale prototypes in research laboratories. This wave, the Interspace, will have distributed services to transfer concepts across domains, just as the ARPANET had distributed services to transfer files across machines and the Internet has distributed services to transfer objects across repositories. The Interspace provides protocols to interconnect logical spaces, just as the Internet provides protocols to interconnect physical machines. Telesophy became the internal inspiration for Mosaic, through my service as scientific advisor for information systems at NCSA. This paper describes the Interspace Prototype, which was at roughly the same state in 1999 as was the Telesophy Protoype in 1989. There is a fully-fledged research system running in the laboratory, which has semantically indexed large-scale real-world collections and supports concept navigation across multiple sources. The technology is ready for widespread deployment that will catalyze the worldwide Interspace for concept navigation, much as Mosaic catalyzed the worldwide Internet for document browsing.
Towards The Interspace In 1989, the Telesophy Prototype had made it clear that universal interlinked objects were technically feasible. That is, it was possible to create a worldwide information space of objects, which were interlinked and could be transparently navigated. Five years later, in 1994, NCSA Mosaic made it clear that this paradigm could be implemented efficiently enough so that it would become the mass standard for information infrastructure in the Net. From trends in network infrastructure, it was clear that communities of interest would quickly form, with personal web sites dominating archival web sites. The same phenomenon had happened earlier with electronic bulletin boards versus electronic file archives. From trends in information retrieval, it was clear that soon the resulting volume of documents would cause web search to break down. The same phenomenon had happened when bibliographic databases exceeded a certain size relative to their conceptual density (about a million items for a scientific discipline). A decade ago, these trends indicated that the Net would need to evolve beyond the Web, into an infrastructure that directly supported community repositories with semantic indexing. When there were billions of documents on-line, navigating across fixed links would no longer suffice for effective navigation. There would be so many relevant documents for any particular situation that fine-grained links to related documents would need to be created dynamically during user sessions. This would require automatically identifying documents containing related concepts. Thus, the basic network infrastructure would need to support universal interlinked concepts (in the Interspace), just as the current network infrastructure supports universal interlinked objects (in the Internet). The Net of the 21st Century will radically transform the interaction with knowledge. Online information has always been dominated by data centers with large collections indexed by trained professionals. The rise of the Web has rapidly developed the technologies for collections of independent communities. In the future, online information will be dominated by small 3
collections maintained and indexed by the community members themselves. The great mass of objects will be stored in these community repositories. Building the Interspace requires generating semantic indexes for community repositories with interactive support adequate for amateur classifiers, then correlating these indexes across multiple sources with interactive support adequate for amateur navigators. Since there will be so many sources indexed by non-professional indexers, the infrastructure itself must provide substantial support for semantic indexing. Since the sources in the Net will be dominated by small community repositories, the typical interaction will be navigating through many sources – retrieving and correlating objects relevant to the particular session. The infrastructure itself must accordingly provide substantial support for information analysis. The Interspace will be the first generation of the Net to support analysis. For the first time, the standard infrastructure of the Net will support direct interaction with abstraction. The Internet supports search of objects, e.g. matching phrases within documents. The Interspace, in contrast, supports correlation of concepts, e.g. comparing related phrases in one repository to related phrases in another repository. There will be a quantum jump in functionality across the waves, from syntactic search to semantic correlation. Users will navigate within spaces of concepts, to identify relevant phrases before they navigate within networks of objects, as at present. The information infrastructure must explicitly support correlation across communities, by concept switching from specialty to specialty. Concept switching involves navigating from the repository of one community into the repository of another, by traversing bridges across related concepts. This will enable a distributed group of persons to form a specialized community living on the Net, yet be able to communicate effectively with related groups, via translation of concepts across specialties. The infrastructure would relate terminology from community to community, enabling navigation at the level of concepts.
The Interspace Prototype A decade-long program of research, 1990-2000, has produced a working Interspace Prototype. This is an analysis environment with new protocols for information infrastructure, supporting semantic indexing on community collections. The concept of the Interspace grew out of my experience with the Telesophy Prototype and was explicitly mentioned in the conclusions of my 1990 Ph.D. dissertation, evaluating the widearea network performance of information spaces [2]. The Worm Community System, 19901994, was a complete implementation of an analysis environment in molecular biology, with custom technology pre-Web [3]. The algorithms for semantic indexing were developed, 19941998, as part of the Illinois Digital Library project [4]. Finally, the flagship contract in the DARPA Information Management program, 1997-2000, was specifically for the Interspace Prototype [5]. The two-fold goal was to develop a complete analysis environment, then test it on large real collections to demonstrate that the concept technology was indeed generic, independent of subject domain. These goals were successfully achieved, as described in subsequent sections. The Interspace Prototype is composed of a suite of indexing services, which supports semantic indexing for community collections, and an analysis environment, which utilizes these indexes to navigate within and across collections at abstract levels. Our suite of components reproduces automatically, for any collection, equivalents to standard physical library indexes.
4
Some of these indexes represent abstract spaces for concepts and categories above concrete collections of units and objects. A “concept space” records the co-occurrence between units within objects, such as words within documents or textures within images. Much like a subject thesaurus, it is useful for suggesting other words while searching (if your specified word doesn’t retrieve desired documents, try another word which appears together with it in another context). A “category map” records the co-occurrence between objects within concepts, such as two documents with significant overlap of concept words. Much like a subject classification, it is useful for identifying clusters of similar objects for browsing (to locate which sub-collection should be searched for desired items). The information infrastructure for uncovering emergent patterns from federated repositories relies on “scalable semantics”. This technology can index arbitrary collections in a semantic fashion. Scalable semantics attempts to be the golden mean for information retrieval -- semantics pulls towards deep parsing for small collections, while scalable pulls towards shallow parsing for large collections. The parsing extracts generic units from the objects, while the indexing statistically correlates these uniformly across sources. For text documents, the generic units are noun phrases, while the statistical indexes record the co-occurrence frequency, of how often each phrase occurs with each other phrase within a document within the collection. We believe that concepts are the generic level for semantic protocols for the state of technology in the foreseeable future. Concepts provide some semantics with automatic indexing, and noun phrase extraction appears computationally feasible for large collections of diverse materials. Concept spaces actually provide semi-automatic categorization. They are useful for interactive retrieval, with suggestion by machine but selection by human. All our experience indicates that augmentation not automation of human performance is what is technologically feasible for semantic interoperability in digital libraries.
Implementing the Prototype The Interspace Prototype demonstrates that it is technologically feasible to support concept navigation utilizing scalable semantics. The services in the Interspace Prototype generate the semantic indexes for all collections within the spaces. A common set of concepts is extracted across all services and these concepts are used for all indexes, such as concept spaces and category maps. The user sees an integrated analysis environment in which the individual indexes on the individual collections can be transparently navigated. For example, one can easily move from a category map to a concept space, to concepts, to documents, then back up again to higher levels of abstraction from the concepts mentioned in the document. See Sidebar 1 for examples. The Interspace Prototype comprises an analysis environment across multiple indexes of multiple sources. Each source is separately processed, but all are available within the environment. When a source is processed, the concepts are uniformly extracted from each object. With documents, every noun phrase is parsed out and normalized, with its position being recorded. These noun phrases are then used for a series of indexes at different levels of abstraction. Having a common set of phrases implies that a single phrase can be referenced uniformly within multiple indexes. As the phrases represent concepts, this enables a concept to be transparently navigated across indexes and across sources. Our current concept extractor was developed using standard components for noun phrase extraction over general text documents. We experimented with several research and commercial systems, before developing an effective parser from public domain source code. The noun phrase extractor is based upon the Brill tagger [6] and the noun phrase identification rules of 5
NPtool [7]. The parser itself has three major parts: tokenization, part-of-speech tagging, and noun-phrase identification [8]. This software was chosen since it is generic -- the trained lexicon was derived from several different sources including the Wall Street Journal and Brown corpora, hence the lexicon has a fairly general coverage of the English language. It can be applied across subject domains without further domain customization while maintaining a comparable parsing quality. According to our studies, the noun phraser enhanced with the UMLS-lexicon performed slightly better than the generic version on a collection of 630K MEDLINE abstracts but the difference is not statistically significant. A similar parser also works well on certain classes of grayscale images, specifically aerial photographs, using texture density as the extracted units. The generic nature and ease of customization enable the parser to fulfill the range of noun phrase parsing. We did a careful evaluation for biomedical literature of the parser with these general rules. This research experiment parsed all the biomedical literature, 45M (million) unique noun phrases from 10M MEDLINE abstracts, and its description won Best Paper at the 1999 annual meeting of the American Medical Informatics Association [9]. We have used concept space algorithms in numerous experiments to generate and integrate multiple semantic indexes. The space consists of the interrelationships between the concepts in the collection. Interactive navigation of the concept space is useful for locating related terms relevant to a particular search strategy. To create a concept space, first find the context of terms within documents using a noun phrase parser as above, then compute term (noun phrase) relationships using co-occurrence analysis. The co-occurrence analysis computes the contextual relationships between the concepts (noun phrases) within the collections. The documents in the collections are processed one by one, with two concepts related whenever they occur together within the same document. Multiple-word terms are assigned heavier weights than single-word terms because multiple-word terms usually convey more precise semantic meaning than single-word terms. The relationships between noun phrases reflect the strengths of their context associations within a collection. Co-occurring concepts are ranked in decreasing order of similarity.
Simulating the Interspace We have performed several hero experiments, using high-end supercomputers as time machines to simulate the world of the future, ten years hence. These experiments took a large existing collection and partitioned it into many community repositories, to simulate typical future situations. Semantic indexing was then performed on each community repository, to investigate strategies for concept navigation within and across repositories. On our NSF/DARPA/NASA Digital Library Initiative project in 1996, we demonstrated the feasibility of this approach to generating a large-scale testbed for semantic federation of community repositories [4]. The bibliographic database COMPENDEX was used to supply broad coverage across all of engineering, with 800K abstracts chosen from 600 categories from the hierarchical subject classification, while INSPEC was used to supply deep coverage for our core domains of physics, electrical engineering, and computer science, with 400K abstracts chosen from 300 categories. This generated 4M bibliographic abstracts across 1000 repositories (each abstract was classified by indexer subject assignment into roughly 3 categories so there was overlap across repositories). The NCSA 32-node HP/Convex Exemplar was used for 10 days of CPU time to
6
compute concept spaces for each community repository. The final production run of the spaces, after all testing and debugging, took about 2 days of supercomputer time. Our project in the DARPA Information Management program enabled us to carry out discipline-scale experiments in semantic indexes for community repositories. As one example, in 1998, we generated semantic indexes for all of MEDLINE. MEDSPACE [10] was an Interspace composed of concept spaces across all of MEDLINE. The backfiles comprise 10M abstracts, in a database both broad and deep. Using the MeSH subject classification, we partitioned the collection into approximately 10K community repositories and computed concept spaces for each. The multiple classification for each abstract caused an expansion factor of about four in raw abstracts to repository abstracts. The complete MEDSPACE involved 400M phrase occurrences within 40M abstracts. The Medicine computation was an order of magnitude bigger than the Engineering computation (40M versus 4M abstracts). The computing time required was about the same scale – 10 days for test debugging and 2 days for final production. This was possible because the 2year period had made the high-end NCSA supercomputer an order of magnitude better. The 128-node, 64GB SGI /Cray Origin 2000 had 4 times more processors for this highly parallel computation and the faster processors with bigger memories combined with optimized parallel algorithms to further improve the performance. This computation demonstrates the feasibility of generating semantic indexes for entire disciplines, in a form deployable within a large-scale testbed. Some of the semantic indexes were used for experiments on concept switching. See examples in Sidebar 1. The experimental users included physicians for MEDLINE indexes and engineers for INSPEC indexes.
Concept Switching Correlating across communities is where the power of global analysis lies. Mapping the concepts of their community into the concepts of related communities would enable users to locate relevant items from related research. The difficulty is how to locate related research, within the twin explosions of community terminologies and distributed repositories. Each community needs to identify the concepts within their repository and index these concepts in a generic fashion that can be compared to those in other repositories from other communities. The principal function of the Interspace is thus concept switching. The concept switches of the Interspace serve much the same function as the packet switches of the Internet – they effectively map concepts in one repository of one community to concepts in another repository of another community, much as switching gateways in the Internet reliably transmit packets from one machine in one location to another machine in another location. The technologies for concept switching are still immature. A specialized form called vocabulary switching has existed since the 1970s [11]. This form is largely manual – the concepts are taken from a human-generated subject thesaurus and the mapping across thesauri is performed by human subject experts. The Unified Medical Language System (UMLS) developed at the National Library of Medicine contains a modern example of this manual switching across subject thesauri, by relating biomedical vocabulary from multiple thesauri with the Metathesaurus [12]. Vocabulary switching is expensive to maintain, since it requires human tracking of the concepts in the thesauri by experts knowledgeable about both sides of the vocabulary map. Scalable semantics could potentially support full concept switching by parsing all concepts and
7
computing all relationships. The promise of automatic methods is concept mapping at a viable cost for community-scale collections. Future concept switching will rely on cluster-to-cluster mapping, rather than term-to-term. Then each concept will have an equivalence class of related concepts generated for it in the particular situation, and the equivalence class from one space will be mapped into the most relevant classes in other spaces. The simple example in Sidebar 1 uses the related terms in the concept space as the equivalence class for mapping a particular term. Full cluster-to-cluster mapping will likely use neural net technology, such as spreading activation on self-organizing maps of related terms in related documents. Concept switching supports a new and radical paradigm for information retrieval. Users rarely issue searches. Instead, they navigate from concept to concept, transparently within and across repositories, examining relevant objects and the contained concepts. If they can recognize relevant concepts when viewed during an interactive session, they need not know specialized terminology beforehand.
Scalable Semantics The success of the Interspace revolves around the technologies of Scalable Semantics, which attempt to be the golden mean for information retrieval, in-between scalable (broad since works in any subject domain) and semantics (deep since captures the underlying meaning). Traditional technologies have been either broad but shallow (e.g. full-text search) or deep but narrow (e.g. expert systems). The new wave of semantic indexing relies on statistical frequencies of the context of units within objects, such as words within documents, and is thus fully automatic. The technology curves for computer power indicate that semantic indexing of any scale collection will shortly become routine. This observation is largely independent of which style of indexing is considered. Within 5 years, discipline-scale collections will be processed within hours on desktop computers. Within 10 years, the largest collections will be processed in realtime on desktop computers and within minutes on palmtops. In the near future, users will routinely perform semantic indexing on their personal collections using their personal computers. This availability of computing power will push feasible levels of semantic indexing to deeper levels of abstraction. The technologies for handling concepts and categories seem already well understood. Concepts rely on phrases within documents as the units within objects. Categories rely on documents within concepts, clustering the concepts for the next more abstract level. Even higher potential levels would involve perspectives and situations. Perspectives rely on concepts within categories, while Situations rely on categories within collections. These move towards Path matching, where the patterns of all the user’s searches are placed within the contexts of all their available knowledge. These more abstract semantic levels push towards closer matching of the meanings in the user’s minds to the meanings in the world’s objects. Each higher level groups larger units together across multiple relationships. The algorithmic implications are that many more objects must be correlated to generate the indexing, requiring increasing levels of computing power. The early years of the new millennium will see the infrastructure of the Net evolve from the Internet to the Interspace. Each specialized community will maintain their own knowledge collections and semantically index these collections on their own machines. Pattern discovery across community sources will become routine, with concept navigation across the Interspace. Problem Solving in the Net will be an everyday experience in the 21st century.
8
Acknowledgements Thanks go to the members of the Interspace team, who have prototyped the Third Wave of the Net. The DARPA Information Management program provided financial support through contract N66001-97-C-8535, entitled “The Interspace Prototype: An Analysis Environment for Semantic Interoperability”, with program manager Ron Larsen. DARPA supported the Interspace Prototype project from 1997 to 2000, with Principal Investigator Bruce Schatz, and co-Principal Investigators Charles Herring (at the University of Illinois at Urbana-Champaign) and Hsinchun Chen (at the University of Arizona at Tucson). The systems research was performed at Illinois in the CANIS Laboratory (Community Architectures for Network Information Systems) under Schatz. The technical leads were Herring, Bill Pottenger, and Kevin Powell (who served as technical architect after co-writing the original architecture document with Schatz). The primary programmers were Conrad Chang, Les Tyrrell, Yiming Chung, Dan Pape, Qin He, and Nuala Bennett. The algorithms research was performed in the Artificial Intelligence Laboratory at Arizona under Chen. The technical leads were Dorbin Ng, Dmitri Roussinov, Marshall Ramsey, and Kris Tolle. At Illinois, Duncan Lawrie and Bob McGrath evaluated computer power for semantic indexing.
References 1.
B. Schatz, “Telesophy: A System for Manipulating the Knowledge of a Community”, Proc. IEEE Globecom '87, Tokyo, Nov. 1987, pp. 1181-1186. 2. B. Schatz, Interactive Retrieval in Information Spaces Distributed across a Wide-Area Network, Ph.D. Dissertation, Technical Report 90-35, Department of Computer Science, University of Arizona, Tucson, Dec. 1990, 95 pp. 3. B. Schatz, “Building an Electronic Community System”, J. Management Information Systems, Vol. 8, Winter 1991-92, pp. 87-107. reprinted in R. Baecker (ed), Readings in Groupware and Computer Supported Cooperative Work, Morgan Kaufmann, 1993, pp. 550-560 in Chapter 9. 4. B. Schatz, et. al., “Federated Search of Scientific Literature”, Computer, Vol. 32, Feb. 1999, pp. 51-59. 5. B. Schatz, “High-Performance Distributed Digital Libraries: Building the Interspace on the Grid”, Proc. 7th IEEE Int’l Symp. High-Performance Distributed Computing, Chicago, Jul. 1998, pp. 224-234. 6. E. Brill, “Transformation-Based Error-Driven Learning and Natural Language Processing”, Computational Linguistics, Vol. 21, 1995, pp. 543-565. 7. A. Voutilainen, “NPtool: A Detector of English Noun Phrases”, Proc. Workshop on Very Large Corpora, Columbus, OH, June 22, 1993. 8. K. Tolle and H. Chen, “Comparing Noun Phrasing Techniques for Use with Medical Digital Library Tools”, J. Amer. Soc. Information Science, Vol. 51, Mar. 2000, pp. 380-393. 9. N. Bennett, et. al., “Extracting Noun Phrases for all of MEDLINE”, 1999 Annual Meeting American Medical Informatics Assoc., Nov. 1999, pp 681-688. 10. Y. Chung, et. al., “Semantic Indexing for a Complete Subject Discipline”, 4th Int’l ACM Conf. Digital Libraries, Berkeley CA, Aug. 1999, pp. 39-48. 11. R. Niehoff, “Development of an Integrated Energy Vocabulary and the Possibilities for On-line Subject Switching”, J. Amer. Soc. Information Science, vol. 27, Jan-Feb 1976, pp. 3-17. 12. D. Lindberg, et. al., “The Unified Medical Language System”, Methods Information Medicine, Vol. 32, 1999, pp. 281-291.
Figure 1. Waves of the Net. Figure 2. Concept Navigation in the Interspace Prototype. Figure 3. Concept Switching in the Interspace Prototype.
9
Sidebar 1: User Sessions performing Concept Navigation The Interspace Prototype enables navigation across different levels of spaces for documents, for concepts, for categories. The spaces can be navigated from concept to concept without the need for searching. The production interface is invocable from within a web browser and can be found at www.canis.uiuc.edu under Interspace under Demonstrations. The interface is implemented in Smalltalk, but the user interaction takes place via an emulator called ClassicBlend, which dynamically transforms Smalltalk graphics into Java graphics. Figure 2 is a composite screendump of an illustrative session with the Interspace Remote Access (IRA) interface for the Interspace Prototype. This illustrates concept navigation within a community repository from MEDLINE. The user is a clinic physician who wants to find a drug for arthritis that reduces the pain (analgesic) but does not cause stomach (gastrointestinal) bleeding. In upper left window, the community collections (subject domains) with semantic indexes are described. In lower left, the user selects the domain for “Rheumatoid Arthritis”, then searches for all concepts (noun phrases) mentioning the word “bleeding”. They then navigate in concept space to find a related term that might be relevant for their current need. From “gastrointestinal bleeding”, the related concepts include those that are general (“drug”) or artifacts (“ameliorating effect”). But the related concepts also include those of appropriate specificity for locating relevant items, such as names (“trang l”) and detailed concepts (“simple analgesic”). Following further in the concept space from “simple analgesic” yields “maintenance therapy”, where the first document is displayed in the lower right. This document discusses a new drug “proglumetacin”, that when used for treatment produces a patient whose “haematology and blood chemistry were not adversely affected”. Thus this drug does not cause bleeding. This document, however, would have been difficult to retrieve by a standard text search on MEDLINE, due to the difficulty of guessing beforehand the terminology actually used. The upper right lists another document on this drug, which was located by navigating from the concepts (noun phrases) in the current document via the selected concept. Figure 3 gives an example of concept switching in the Interspace Prototype, where the relationships within the concept spaces are used to guide the navigation across community repositories for MEDLINE. The subject domains “Colorectal Neoplasms, Hereditary Nonpolyposis” and “Genes, Regulator” were chosen and their concept spaces were displayed in the middle and the right windows respectively. “Hereditary cancer” was entered as a search term in the first concept space and all concepts that are lexical permutations are returned. Indented levels in the display indicate the hierarchy of the co-occurrence list. Navigating in the concept space moves from “hereditary nonpolyposis colorectal cancer” to the related “mismatch repair genes”. The user then tries to search for this desired term in another domain repository, “Genes, Regulator”. A straight text search at top right returns no hits. So Concept Switching is invoked to switch concepts from one domain to another across their respective concept spaces. The concept switch takes the term “mismatch repair genes” and all related terms from its indented cooccurrence list in the source concept space for “Colorectal Neoplasms” and intersects this set into the target concept space for “Genes, Regulator”.
10
After syntactic transformations, the concept switch produces the list (in the right-most window panel) of concepts computed to be semantically equivalent to “mismatch repair genes” within “Genes, Regulator”. Switching occurs by bridging across community repositories on the term “Polymerase Chain Reaction”, an experimental method common to both subject domains. Navigating the concept space down to the object (document) level locates the article displayed at the bottom. This article discusses a leukaemia inhibitory factor that is related to colon cancer. Note that this article was located without doing a search, by concept switching across repositories starting with the broad term “hereditary cancer” and using common terms as bridges.
11
Figure 2. Concept Navigation in the Interspace Prototype.
12
Figure 3. Concept Switching in the Interspace Prototype.
13
Sidebar 2: Technology Trends underlying Concept Navigation Information Infrastructure evolves, as better technology becomes available to support basic needs. For technology to be mature enough to be incorporated into standard infrastructure, it must be sufficiently generic. That is, the technology must be robust and readily adaptable to many different applications and purposes. For Information Infrastructure to support Concept Navigation in a fundamental way, a number of new technologies must be incorporated into the standard support. The body of this article discusses the Interspace Prototype, an early system possible since these technologies are currently mature enough for a complete research system. The Interspace itself will become widely spread when these underlying technologies further mature into commercial components. This sidebar tries to make explicit the major technologies that the Interspace Prototype (and the Interspace eventually) critically but implicitly relies on. The rise of four technologies is critical, in particular: document protocols for information retrieval, extraction parsers for noun phrases, statistical indexers for context computations, communications protocols for peer-to-peer retrieval. Together, these generic technologies support semantic indexing of community repositories. A document can be stored in a standard representation. Concepts can be extracted from a document with some level of semantics. These concepts can be utilized to transform a document collection into a searchable repository, by indexing the documents with some level of semantics. Finally, the resultant indexing can be utilized to semantically federate the knowledge of a community, by concept navigation across distributed repositories that comprise relevant sources. The Rise of the World-Wide Web has made it possible to store documents in a standard representation. Prior to the worldwide adoption of a single format to represent documents, collections were limited to those that could be administered by a single central organization. Prime examples were Dialog, for bibliographic databases consisting of journal abstracts, and Lexis/Nexis, for full-text databases consisting of magazine articles. The widespread adoption of WWW Protocols enabled global information retrieval, which in turn increased the volume to the point that semantic indexing has become necessary to enable effective retrieval. In particular, the current situation was caused by the universal distribution of servers that store documents in HTML and retrieve documents using HTTP. Many more organizations could now maintain their own collections, since the information retrieval technology was now standard enough to enable information providers to directly store their own collections, rather than transferring them to a central repository archive. Standard protocols implied that a single program could retrieve documents from multiple sources. Thus the WWW protocols enabled the implementation of Web browsers. In particular, Mosaic proved to be the right combination of streamlined standards and flexible interfaces to attract millions of users to information retrieval for the first time [1]. As the number of documents increased, identifying the initial document to hypertext browse from became a major problem. Then, web searchers began to dominate web browsers as the primary interface to the global information space. These searches across so many documents with such variance showed the weakness of syntactic search, such as the word matching used within Dialog, and increased the demand for semantic indexing embedded within the infrastructure [2].
14
The Web at present is fundamentally a client-server model, with few large servers and many small clients. The clients are typically user workstations, which prepare queries to be processed at archival servers. The infrastructure has made the transition from files to documents. The primary functionality has made the transition from access, where a browser is used for directly fetching, to organization, where a searcher is used for initially selecting relevant documents. As the number of servers increases and the size of collections decreases, the infrastructure will evolve into a peer-peer model, where user machines exchange data directly. In this model, each machine is both a client and a server at different times. This model is already popular, with services for music swapping such as Napster estimated to use 20% of present traffic in the Net. However, the functionality is still access to files, rather than organization of documents. This functionality will change as the technology for semantic indexing becomes mature. Document standards eliminate the need for format converters for each collection. Extracting words becomes universally possible with a syntactic parser. But, extracting concepts requires a semantic parser, which extracts the appropriate units from documents of any subject domain. Many years of research into information retrieval have shown that the most discriminating units for retrieval in text documents are multi-word noun phrases. Thus, the best concepts in document collections are noun phrases. The Rise of Generic Parsing has made it possible to automatically extract concepts from arbitrary documents. The key to context-based semantic indexing is identifying the “right size” unit to extract from the objects in the collections. These units represent the “concepts” in the collection. The document collection is then processed statistically to compute the co-occurrence frequency of the units within each document. Over the years, the feasible technology for concept extraction has become increasingly more precise. Initially, there were heuristic rules that used stop words and verb phrases to approximate noun phrase extraction. Then, there were simple noun phrase grammars for particular subject domains. Finally, the statistical parsing technology became good enough, so that extraction was computable without explicit grammars. These statistical parsers can extract noun phrases quite accurately for general texts, after being trained on sample collections [3]. This technology trend approximates meaning by statistical versions of context. This trend in information retrieval has been a global trend in recent years for pattern recognition in many areas. Computers have now become powerful enough that rules can be practically replaced by statistics in many cases. Global statistics on local context has replaced deterministic parsing. For example, in computational linguistics, the best noun phrase extractors no longer have an underlying definite grammar, but instead rely on neural nets trained on typical cases. The initial phases of the DARPA TIPSTER program, a $100M effort to extract facts from newspaper articles for intelligence purposes, were based upon grammars, but the final phases were based upon statistical parsers. Once the neural nets are trained on a range of collections, they can parse arbitrary texts with high accuracy. It is even possible to determine the type of the noun phrases, such as person or place, with high precision [4]. Once the units, such as noun phrases, are extracted, they can be used to approximate meaning. This is done by computing the frequency with which the units occur within each document across the collection. In the same sense that the noun phrases represent concepts, the contextual frequencies represent meanings. These frequencies for each phrase form a space for the collection, where each concept is related to each other concept by co-occurrence. The concept space is used to generate related
15
concepts for a given concept, which can be used to retrieve documents containing the related concepts. The space consists of the interrelationships between the concepts in the collection. Concept navigation is enabled by a concept space computed from a document collection. The technology operates generically, independent of subject domain. The goal is enable users to navigate spaces of concepts, instead of documents of words. Interactive navigation of the concept space is useful for locating related terms relevant to a particular search strategy. The Rise of Statistical Indexing has made it possible to compute relationships between concepts within a collection. Algorithms for computing statistical co-occurrence have been studied within information retrieval since the 1960s [5]. But it is only in the last few years that the statistics involved for effective retrieval have been computationally feasible for real collections. These concept space computations combine artificial intelligence for the concept extraction, via noun phrase parsing, with information retrieval for the concept relationship, via statistical co-occurrence. The technology curves of computer power are making statistical indexing feasible. The coming period is the decade that scalable semantics will become a practical reality. For the 40year period from the dawn of modern information retrieval in 1960 to the present worldwide Internet search of 2000, statistical indexing has been an academic curiosity. Techniques such as co-occurrence frequency were well-known, but confined to collections of only a few hundred documents. The practical information retrieval on large-scale real-world collections of millions of documents relied instead on exact match of text phrases, such as embodied in full-text search. The speed of machines is changing all this rapidly. The next 10 years, 2000-2010, will see the fall of indexing barriers for all real-world collections [6]. For many years, the largest computer could not semantically index the smallest collection. After the coming decade, even the smallest computer will be able to semantically index the largest collection. The body of this article describes the hero experiment in the late 1990s, of semantically indexing the largest scientific discipline on the largest public supercomputer. Experiments of this scale will be routinely carried out by ordinary people on their watches (palmtop computers) less than 10 years later, in the late 2000s. The TREC (Text REtrieval Conference) competition [7] is organized by the National Institute for Standards and Technology (NIST). It grew out of the DARPA TIPSTER evaluation program, starting in 1992, and is now a public indexing competition entered annually by international teams. Each team generates semantic indexes for gigabyte document collections using their statistical software. Currently, semantic indexing can be computed by the appropriate community machine, but in batch mode. For example, a concept space for 1K documents is appropriate for a laboratory of 10 people and takes an hour to compute on a small laboratory server. Similarly, a community space of 10K documents for 100 people takes 3 hours on a large departmental server. Each community repository can be processed on the appropriate-scale server for that community. As the speed of machines increases, the time of indexing will decrease from batch to interactive, and semantic indexing will become feasible on dynamically specified collections. When the technology for semantic indexing becomes routinely available, it will be possible to incorporate this indexing directly into the infrastructure. At present, the Web protocols make it easy to develop a collection to access as a set of documents. Typically, the collection is available for fetching but not for searching, except by being incorporated into web portals, which
16
gather documents via crawlers for central indexing. Software is not commonly available for groups to maintain and index their own collection for web-wide search. The Rise of Peer-Peer Protocols is making it possible to support distributed repositories for small communities. This trend is following the same pattern in the 2000s as email in the ARPAnet did in the 1960s, where person-person communications became the dominant service in infrastructure designed for station-station computations. Today, there are many personal web sites, even though traffic is dominated by central archives, such as home shopping and scientific databases, which drive the market. There are already significant beginnings of peer-peer where simpler protocols enable users to directly share their datasets. These are driven by the desires of specialized communities to directly share with each other, without the intervention of central authorities. The most famous example is Napster for music sharing, where files on a personal machine in a specified format can be made accessible to other peer machines, via a local program that supports the sharing protocol. The Napster service has now become so popular, that the technology is breaking down due to lack of searching capability that can filter out copyrighted songs. There are many examples in more scientific situations of successful peer-peer protocols. Typically, these programs implement a simple service on an individual user’s machine, which performs some small computation on small data that can be combined across many machines into a large computation on large data [8]. For example, the SETI software is running on a million machines across the world, each computing the results of a radio telescope survey from a different sky region. Computed results are sent to a central repository for a database seeking intelligent life across the entire universe. Similar net-wide distributed computation, with volunteer downloads of software onto personal machines, has computed large primes and broken encryption schemes. For-profit corporations have used peer-to-peer computing for public-service medical computations [9]. Generalized software to handle documents or databases currently exists at a primitive level for peer-peer protocols. A canned program can be run, which processes local data in a simple way. Functionality is still at the level of files rather than documents. The infrastructure supports access rather than organization. Internet infrastructure, such as the Open Directory project [10], enables distributed subject curators to index web sites within assigned categories, with the entries being entire collections. In contrast, Interspace infrastructure, such as automatic subject assignment [11], will enable distributed community curators to index the documents themselves within the collections. Increasing scale of community databases will force evolution of peer-peer protocols. Semantic indexing will mature and become infrastructure at whatever level technology will support generically. Community repositories will be automatically indexed, then aggregated to provide global indexes. Concept navigation will become a standard function of global infrastructure in 2010, much as document browsing has become in 2000. Then the Internet will have evolved into the Interspace. References 1. 2.
B. Schatz and J. Hardin, “NCSA Mosaic and the World-Wide Web: Global Hypermedia Protocols for the Internet”, Science, Vol. 265, 12 Aug. 1994, pp. 895-901. T. Berners-Lee, et. al. , “The Semantic Web”, Scientific American, Vol. 284, May 2001, pp. 35-43.
17
3.
T. Strzalkowski, “Natural Language Information Retrieval”, Information Processing & Management, Vol. 31, 1996, pp. 397-417. 4. D. Bikel, et. al., “NYMBLE: A High-Performance Learning Name Finder”, Proc. 5th Conf. Applied Natural Language Processing, Mar. 1998, pp. 194-201. 5. P. Kantor, “Information Retrieval Techniques”, Annual Review Information Science & Technology, Vol. 29, 1994, pp. 53-90. 6. B. Schatz, “Information Retrieval in Digital Libraries: Bringing Search to the Net”, Science, Vol. 275, 17 Jan. 1997, pp. 327-334. 7. D. Harman (ed), Text Retrieval Conferences (TREC), National Institute Standards & Technology (NIST), http://trec.nist.gov 8. B. Hayes, “Collective Wisdom”, American Scientist, Vol. 86, Mar-Apr 1998, pp. 118-122. 9. Intel Philanthropic Peer-to-Peer Program, www.intel.com/cure 10. Open Directory Project, www.dmoz.org 11. Y. Chung, et. al., “Automatic Subject Indexing Using an Associative Neural Network”, Proc. 3rd Int’l ACM Conference Digital Libraries, Jun. 1998, Pittsburgh, pp. 59-68.
18