Published in: Library and Information Service, 2002, 8, p. 39-49
Some ideas concerning the Semantic Web Brendan Rousseau1 and Ronald Rousseau2 1
Department of Mathematics and Computer Sciences Eindhoven University of Technology Postbus 513, 5600 MB Eindhoven, The Netherlands e-mail:
[email protected]
2
KHBO, Dep. Industrial sciences and Technology Zeedijk 101, B-8400 Oostende, Belgium e-mail:
[email protected]
Abstract In 1998 the World-Wide Web Consortium (W3C) inaugurated a research initiative centred on the idea of providing semantics for and facilitating the extraction of knowledge from the WWW. The Semantic Web is a vision of Tim Berners-Lee, the creator of the WWW. With the help of XML, RDF, OIL and other emergent standards it will be possible to give more structure and meaning to existing web data. This will lead to a universal network where all available information can effectively be found: the semantic web. Clearly, the realisation of the semantic web will have a huge influence on the way digital libraries will be conceived. 1
2
Introduction The Internet as we know it, is a fast growing medium: the number of sites is still growing exponentially. The Net thanks its popularity for a large part to the ease and liberty with which almost everyone can add new data and web pages. It does not matter whether you work for an international concern or are just a high school student, everyone can put his or her ‘knowledge’ on the Internet. This freedom clearly has many advantages, but it entails also a number of disadvantages. There is, indeed, only little or no control on the content of web pages. This leads to problems concerning correctness of data, legality (we have here mainly ‘copyright issues’ in mind), aesthetics or just ‘good taste’. On a higher level we deplore the lack of an all (web) embracing indexing system. This makes it sometimes impossible to find information, even if it is publicly available on the Web. Over the years software engineers have developed several web information retrieval systems. Among these search engines such as AltaVista or Google, and directories such as Yahoo!, are the best known. Although these tools are astonishingly good they do not have the ambition (any more) to be complete. Their main goal is trying to provide the most interesting search results on the first page. Another problem with search engines as they function nowadays is the fact that it is impossible for them to connect related facts shown on different web pages. The Internet and the library system have a tremendous amount of data and facts available, yet because search engines have no real cue about the contents of a page, they can’t turn these data into real information for the user. They can only compare symbols, i.e. keywords introduced by someone searching for information, with symbols displayed on web pages. The Internet as we know it is made for people (flesh-and-blood beings) not for machines. Machines need extra information in order to relate content (semantics) to data. A human immediately recognizes a string such as '11 September 2001' as a date, (and a very special one even). A robot, being just a computer program, cannot draw this conclusion unless provided with extra information tools. The WWW becomes more and more integrated in the daily life of many individuals. Most businesses have a website announcing and promoting their activities, and many people nowadays write more e-mails than letters. Similarly digital libraries, as part of the Web, replace many parts of the traditional ‘printed book’ libraries. The Internet certainly has contributed – in a positive sense – to the globalisation of the world. As more and more people get used to these new technologies their demands also increase. The ‘Semantic Web’ – the second generation net - is Tim Berners-Lee’s answer to this (Berners-Lee, 1998). The semantic web does not exist (yet) but is a grant vision encompassing a multitude of possibilities and opportunities (Berners-Lee et al., 2001). Its main aim is to create an environment where all information can be understood by the tools (such as softbots) sent out by their users. Once data are made available in such
3
a way that a computer can ‘understand’ them, these data become real information for the user. A computer program can sift through data thousands of times faster than a human being, and then link related data in order to answer specific questions. If the semantic web were a reality a question such as “I want to spend the summer holidays in San Francisco (California, USA). Please, give me the necessary information” would return a complete itinerary, available hotels and information about tourist attractions, all this fitted into my personal schedule and corresponding to my personal preferences. In this article we will study the following questions: To what extent is this technology already available? Does there exist a standard for the future semantic web? What is necessary to make the existing web into a semantic web? In the following sections we will answer some of these questions.
The Internet as it exists nowadays is based on HTML At the moment HTML (hypertext markup language) is the most successful language used to publish data on the Internet. HTML is a computer language that makes it possible to add layout information to text. Thanks to this extra information a browser can display the text in the way intended by the writer. Over the years this language has evolved from a very simple language to a more sophisticated one making the Net to what it is and looks like nowadays. Tools have been developed so that users do not have to learn all HTML codes. The best known among these are Frontpage and Netobjects based on the WYSIWYG principle. The Frontpage engine, being a part of Microsoft’s Office packet allows the user to save his Word documents as HTML and publish it on any web server. This feature makes the threshold for Net publishing very low. HTML is a W3C standard (World Wide Web Consortium) going through different versions (updates), the latest (maybe the last?) one being version 4.01. [Since then replaced by XHTML, now at version 1.00 second edition (4 October 2001)]. Such a standard fixes the tags that may be used, their meaning and their attributes. A user may not invent his personal HTML-tags, but has to follow the standard. This, however, has not deterred web browsers such as Internet Explorer and Netscape from inventing extra tags. The W3C committee followed these developments and, if the new tags served a useful purpose they were incorporated into the next version of the standard. What is a markup language? The term ‘markup’ refers to the fact that document elements that have a special meaning are marked. The meaning of these marked elements must be understood by others. A markup language defines the markup symbols that will be used, and fixes their meaning. One may divide markup symbols into three classes: layout (stylistic) markup, structural markup and semantic markup.
4
Stylistic markup is concerned with basic lay-out elements, such as (italics), (bold), . Structural markup relates to the document structure. Examples are
, begin first level heading, , begin an ordered list, begin paragraph, etc… . Finally, semantic markup informs the reader about the content of the document. The problem with HTML is that it has almost no features to add ‘meaning’ to a text. HTML tags are basically designed to support only the display of a web page: text, images and push-buttons. A tag such as “Brendan Rousseau” says that the text must be put in the centre of the line but tells us nothing about the meaning of the text between the two HTML symbols. In all honesty we must admit that some users have made creative use of metatags, comments, headings and links in order to indicate subject matter within an HTML context (Hodgson, 2001). For a basic introduction to HTML we refer the reader to e.g. McMurdo (1996) or Castro (1999).
Real knowledge representation via XML A first step in the direction of a truly semantic web is to give the user the ability of making his/her own tags. These tags must be made up in such a way that meaning can be given to the data between them (Bosak & Bray, 1999). HTML does not offer this feature, but XML does. XML (eXtensible Markup Language) is a W3C standard too allowing the coding of text as well as data. XML-data are application, platform and hardware independent. This makes XML a suitable format for data interchange on the Internet. Actually, XML is a meta-markup language: this is a language designed to make markup languages. The development of XML began in 1996 and it became a W3C standard in 1998. On the W3C website it is described as follows: The Extensible Markup Language (XML) is the universal format for structured documents and data on the Web" . XML is based on the existing standards SGML (Standard Generalized Markup Language) and HTML. SGML was developed in the early eighties and has been used mainly for large projects needing large amounts of well-structured documentation such as the technical manual for the Boeing 747. SGML has also been adopted for the exploitation of bibliographical data in a digital library context (Corthouts & Philips, 1996). Developers have taken the best parts of SGML and rewrote it as a simpler and powerful language: XML. In contrast to SGML, which is mainly suited for technical documentation and not for other sorts of data, XML is well-suited for all kinds of data. It is, moreover, possible to extend the language to adapt it to the needs of business, as well as individual users.
5
XML looks somewhat similar to HTML but there are essential differences. XML keeps meaning (semantics) and layout separated. Stylesheets are used for the general layout of similar documents. Another difference is the fact that W3C regulates the introduction of new tags in HTML, while in XML everyone may define tags. This makes XML much more flexible than HTML. Finally, structural requirements for XML are more strict than for HTML. Nowadays most data are stored in relational databases. Yet, this data storage model is not really suited for data formats such as bibliographic data, film and sound, as provided by digital libraries. More complex data structures such as nested tables are also difficult to describe within a structure consisting only of rows and columns. Now that we have some idea about the advantages of XML, we will have a brief look at its basic technical elements. An XML-document is said to be well-formed if it is syntactically correct. It is, moreover, valid if it corresponds to a document type declaration (DTD). An XML-document is well-formed, this is syntactically correct, if it is non-empty; it has a unique special start-tag, the root, () and a unique special end-tag () encompassing the whole document; and all other tags must be nested. This means: with every start-tag corresponds an end-tag, and tags do not overlap. XML allows three types of tags: start-tags, such as , end-tags, such as and empty-tags (not having an end-tag), such as . An XMLdocument may further contain the following elements. A prolog, which is optional, such as comments, such as , process instructions to an external application, such as a printer, and a document type declaration (DTD). The DTD defines the syntactic rules for the elements and attributes of a document, as well as the entities and notations. A DTD begins with where the term ‘string‘ refers to the root-element. All syntactic rules related to the document appear between the […]. All elements, attributes and entities must be declared in a DTD. XML is much more flexible than traditional data systems: it does not need relational schemas, external data type definitions, etc. A set of XML-data contains all the necessary information itself. In this way XML documents may contain any type of data structure, from plain text to complex, dynamic structures
6
such as Java Applets. Data presentation in XML is, moreover, independent of the given data, hence the representation can be altered without any influence on the data itself. This is a important advantage with respect to traditional data structures where another representation often requires a lot of reprogramming. Another advantage of XML, with respect to relational databases, is that all data do not have to be placed on the same server or in the same physical location. XML can see the whole World Wide Web as one big database. Among other advantages, the flexibility of XML makes it the ideal format for e-commerce. XML being a W3C standard, is supported by most international corporations, such as the software giants IBM, SUN, SAP, Netscape and Microsoft. These companies have already incorporated XML in several of their products. Net browsers such as Netscape, Mozilla, and Microsoft’s Internet Explorer already offer XML support. Let us have a look at the following example. In HTML a song can be described using the tags ‘definition title’, ’definition data’, ’unordered list’, ‘list items’. These elements have nothing to do with the fact that we are describing a song. Here is such an example:
- Angel
- by Shaggy
- Publisher: Mercury
- Length; 4:21
- Written: 2001
The same data can be described in XML as follows. Angel Shaggy Mercury 4:21 2001 Instead of the ‘general purpose’ HTML tags we use more meaningful XML-tags. Clearly, a layperson can more easily understand the XML-code then the corresponding HTML code. XML code, moreover, makes it easy for a machine to find all songs in a large document. This is an essential feature in view of the semantic web. Bots searching through HTML documents cannot possible understand the different meanings of the - tag.
7
XML allows different fields to design their own specific representation language. Examples of such fields are chemistry, mathematics and music. These specific languages make it possible to exchange notes, data and information between persons, in such a way that the sender does not have to worry whether the receiver has the necessary software to read the message. Nowadays all too often the receiver needs extra software or to install plug-ins in order to read (or even just ‘see’) the message sent by the source. Now that the Internet breaks up borders between countries and languages, the enormous amount of different data formats becomes fully clear. The fact that XML-data can be stored independent from the required representation is another bonus in handling international data flows. The same data can now easily be represented in different languages, making XML documents into intermediary media (in particular: for documents written in different languages). The fact that XML supports the unicode character set is an important point here (Bosak & Bray, 1999). The Unicode Standard is the universal character encoding scheme for written characters and text. It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software. As the default encoding of HTML and XML, the Unicode Standard provides a sound underpinning for the World Wide Web. While modelled on the ASCII character set, the Unicode Standard goes far beyond ASCII's limited abilities. It provides the capacity to encode all characters used for the written languages of the world: more than 1 million characters can in principle be encoded. The Unicode character encoding treats alphabetic characters, ideographic characters, and symbols equivalently, which means they can be used in any mixture and with equal facility. The Unicode Standard, Version 3.0, contains 49,194 characters from the world's scripts. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and scripts of Asia. The unified Han subset contains 27,484 ideographic characters defined by national and industry standards of China, Japan, Korea, Taiwan, Vietnam, and Singapore. XML-tags can be defined and documented by the user. This happens in a DTD (Document Type Definition). There already exist DTDs for special fields such as CML (Chemical Markup Language), a DTD for chemistry. CML provides a vocabulary and a syntax for the description of molecules and contains tags for atoms, molecular bindings, etc. A browser does not have to know the specific tags that may occur in an XML-document. When a browser has to show an XMLdocument it first finds the corresponding DTD, after which the browser knows how to interpret the used tags. In order to know precisely how to show the document the browser also needs the corresponding style sheet (XSL document). Scientists benefit most of these features. Mathematical or chemical texts contain many scientific formulae but at the moment HTML-based browsers
8
cannot represent these. The fact that XML is able to do this implies that authors are not anymore obliged to apply all kinds of tricks – such as converting formulae into pictures - in order to force the browsers to render the required representation. For this reason mathematicians now turn to MathML. DTDs and style sheets offer the author the possibility to invent tags and to determine how the browser must render them. We will now have a closer look at MathML. Mathematical notations are constantly changing. They guide the eye and make mathematical expressions easier to read and understand. Indeed, a mathematical expression often consists of welldesigned symbols, forming a highly stylized two-dimensional pictogram. MathML is influenced by another markup language for mathematics, namely TeX (designed by Donald Knuth; see (Knuth 1984)). In HTML mathematical equations are never properly displayed. For instance, centre alignment of images is handled in slightly different ways by different browsers. Image-based equations are generally harder to see, read and comprehend than the surrounding text in the browser window. These problems become worse when the document is printed. Moreover, mathematical equations as displayed in HTML, cannot be searched for. These are the problems addressed, and largely solved, by MathML. MathML is designed to provide the encoding of mathematical data in the bottom layer of a two-layered structure. It is, moreover, not intended for direct use by authors, who will use equation editors, conversion programs and other specialized software tools to generate MathML.
Why XML alone does not suffice: Metadata, PICS and RDF The key to the semantic web is metadata. The term ‘metadata’ literally means ‘data about data’. Metadata contain a description of the data, not the data itself. Metadata form an important instrument to have an overview of the available data. They provide computer-understandable information. This feature is the one that makes metadata essential for the implementation of the semantic web. Since metadata provide a full description of the data, they make it easier to recover lost data, or to find data documented in an unusual way. This feature is also important for applications in digital libraries. The most important metadata standard at the moment is the Dublin Core. It is the result of a series of international workshops where a broad group of professionals gathered, including colleagues representing the library and information sciences. The mission of the Dublin Core Metadata Initiative (DCMI) is to make it easier to find resources using the Internet through the following activities:
9
Developing metadata standards for discovery across domains. Defining frameworks for the interoperation of metadata sets Facilitating the development of community – or disciplinary specific metadata sets. One of the recent activities of the DCMI is defining a mechanism to record bibliographic citation information for journal articles in Dublin Core (Dekkers & Weibel, 2002). The PICS (Platform for Internet Content Selection) specification connects labels (metadata) to Internet content. It was initially designed to deny young children or students access to certain web pages. PICS is the basis of many control and filter software such as ‘Cyber Control’ or ‘Cybersitter’. Search engines such as Google use PICS to remove ‘adult content’ from their search results. The PICS-specification, however, also makes other uses of labels possible, such as code signing. A code signing certificate serves to identify to the client which entity is responsible for the code wishing to install itself as a trusted program on the client's computer. Different modes were developed in order to make PICS efficient in all possible situations. Self-rating enables content providers to voluntary label the content they create and distribute, while third-party rating enables multiple, independent labelling services to associate additional labels with content created and distributed by others. Services may devise their own labelling systems, and the same content may receive different labels from different services. Finally, users, parents and teachers may use ratings and labels from a diversity of sources to control the information that they or children under their supervision receive. Some members of the medical profession have already developed a prototype core vocabulary, med-PICS, for possible use with medical information (Eysenbach & Diepgen, 1998). Of course, we should point out the danger of misuse of such a labelling system. PICS makes it possible to filter Internet content on different levels. The filtering software may be part of the browser, but may also be incorporated in a company’s proxy server, if the management wishes to deny access to certain web content. XML addresses only document structure. The Resource Description Framework (RDF) better facilitates interoperation between web applications (Decker et al., 2000). RDF is an XML application, i.e. its syntax is defined in XML, which is an important difference with respect to a DTD. It is customized for adding metainformation to web documents and is currently under development as a W3C standard for content description of web sources. One could say that RDF is the language in which semantic web metadata statements are expressed. RDF will be described somewhat more precisely in the next section.
10
Resource Description Framework (RDF) RDF-labels may contain PICS labels, but also much more such as strings and structured data. The RDF-specification offers a simple (ontological) system for data exchange over the Internet. Since companies started using metadata they have invented (different!) ways of describing resources. These different versions were usually incompatible which led to a serious problem. Being the leading Internet organization W3C tried to solve this problem by introducing RDF. RDF solves the incompatibility problem by providing a syntax and schema specification. The final aim of RDF is to enable users to use the same metadata in different applications and interfaces. With RDF one can provide web resources (any object with an URI = Uniform Resource Identifier) in a machine understandable form. In this way the semantics of an object become understandable and available on the Internet. Once RDF will be in general use all Internet agents and services will be able to use them. RDF extends the basic XML model and syntax with the aim of describing resources. For this, RDF uses the ‘namespace‘ functionality of XML. Documents, containing multiple markup vocabularies, pose problems of recognition and collision. Software modules need to be able to recognize the tags and attributes which they are designed to process, even in the face of "collisions" occurring when markup intended for some other software package uses the same element type or attribute name. These considerations require that document constructs should have universal names, whose scope extends beyond their containing document. XML namespaces accomplish this. Formally, a namespace is a formal collection of terms managed according to a policy or algorithm (Duval et al., 2002). An XML namespace, in particular, is a collection of names, identified by a URI reference, which are used in XML documents as element types and attribute names. Namespaces allow RDF to define a uniquely identifiable set of properties. This set is called a schema. It can be accessed via de URI defined in the namespace. Because RDF is defined within XML it inherits all XML properties, such as support for rendering data in several different languages. This is another difference between RDFs and DTDs. An example of RDF syntax
11
Pics United for the best soccer photography on the web XPECT 2001-01-01 soccer, photo Pics United
The first line states that it is an XML document. The next one says that three namespaces will be used: RDF, AGLS and Dublin Core, where RDF is the default namespace. All properties that will be used come from one of these three namespaces. The main part of this example is contained within the two description tags. Five properties of the resource specified in ‘about’ (specified by its URI) are given. These properties are taken from the Dublin Core (DC) and the Australian Government Locator Service (AGLS). They define certain properties of the URI. The main advantage of RDF lies in the fact that resource description groups can concentrate on semantic problems, instead of syntax and structure of metadata. RDF schemas are easily extensible, reusable and are available in machinereadable form.
Ontologies Ontologies providing shared and common domain ‘theories’ will be a key asset for the semantic web. An ontology determines the classes of a field and organizes them within a class structure (a taxonomy). Such a taxonomy not only defines classes and their relations, but also contains a set of inference rules making it possible to reason about these relations. Each class has certain properties, shared by all members of the class (Vickery (1997), Fensel (2000), Ding (2001)). The best known definition of an ontology is Gruber’s (Gruber, 1993). He states that an ontology is a formal, explicit specification of a shared conceptualisation. The term ‘conceptualisation’ refers to an abstract model of phenomena in the world. This model came to existence by having identified the relevant concepts related to these phenomena. “Explicit” means that the type of concepts used, and the constraints on their use are explicitly defined. “Formal” refers to the fact that the ontology should be machine-readable. The term “shared”, finally, reflects the fact that an ontology should capture consensual knowledge, accepted by the communities. Generally speaking, one may say that the terms ‘ontology’ and ‘thesaurus’ are almost synonyms. The term ‘thesaurus’ is a typical library science term, while the term ‘ontology’ is a computer science (or artificial intelligence) term, borrowed 12
(and adapted) from philosophy. Indeed, ‘ontology’ is actually a philosophical term meaning ‘the theory of objects and their ties’. The unfolding of ontology provides criteria for distinguishing various types of objects (concrete and abstract, existent and non-existent, real and ideal, independent and dependent) and their ties (relations, dependences and predication). An ontology has to be represented by a special, predefined language: an ontology language. Currently existing ontology representation languages are either logic-based, frame-based or web (XML)-based. OIL (Ontology Interchange Language) proposed by the OnToKnowledge project brings these three aspects together. OIL is compatible with RDF schemas augmented with precise semantics (Fensel et al., 2001). It has been successfully applied in areas such as knowledge management and e-commerce. It is very unlikely that one ontology language would satisfy the needs of all users. For this reason OIL is designed in layers. Each layer adds functionality to the underlying one. In this way agents designed for a lower level layer can partially understand descriptions made on a higher level. Core OIL is largely the same as RDF schemas. This property guarantees the compatibility of OIL with RDF. There further exist tools to help scientists draw and manipulate ontologies. Ontolingua is a set of tools for analysing and translating ontologies. WebOnto was designed to support collaborative browsing, creating and editing ontologies. It also contains an ontology discussion tool.
Intelligent Agents We are constantly being called upon to make decisions without having enough information or experience to make the ‘best’ or even an ‘intelligent ‘ choice. The amount of information made available via networks and databases is, moreover, still increasing at a high rate. Search engines cannot cope with this enormous amount of data and yield only limited support in localizing the information searched for by the user. Intelligent agents can be of help here because they transform passive machines into active personal assistants and counselors (Maes, 1994). The semantic web structures the contents of web pages and hence, creates an environment where software agents, roaming from page to page, can – almost autonomously perform complex tasks on behalf of users. Intelligent agents are small software programs scouring the Internet in order to find information that answers the queries of their owners. Agents are semi-autonomous computer programs assisting the user in handling computer applications of all kinds. Agents do not only use the available semantic infrastructure, but also create and maintain this infrastructure. Good agents help people finding the information they need, allowing them to spend less time in the search process, and more on
13
actually analysing the information they have found. A good Internet agent is communicative, capable, autonomous and adaptive (Hendler, 2001). A ‘good’ agent must be able to communicate. This is only possible if it ‘speaks’ the same language as its user. An agent who is not able to understand what you want from him (her/it?) is not very helpful. One of the main problems with existing search engines is the fact that, although they are based on language (or at least linguistic symbols) they have no knowledge of the domain where they have to be active. Ontologies can bring a solution here. They contain the formal definitions of knowledge domains. Examples are facts such as “If X is a private car, then X has four tyres" and inference rules such as "When an object belongs to a set, and this set is a subset of a larger set, then the object also belongs to the larger set". Note that this is inference rule is actually an example of inheritance. An agent must not only be able to act, but also to make suggestions to its user. In other words: an agent offers advice and services. Good agents can do things on the web without its user knowing all the details. In this way users can delegate tasks to their agents. Examples of such tasks are: searching, classifying and storing information, but also reading e-mail, making appointments, keeping a diary and scheduling a trip abroad (Maes, 1994). The less supervision required the better. Users do not want a simple ‘slave’ but an ‘intelligent assistant’. Such an assistant does his utmost to perform for his owner. Consequently, an agent must act autonomously within the parameters set by the owner. Finally, good agents must acquire experience and use this experience helping their users. An agent must be able to adapt and change its behaviour based on a combination of user feedback and environmental factors. This also means that intelligent agents take their users’ level of expertise into account, and, for instance, lower some thresholds for neophytes. The term “adaptability” implies that the agent has learning capabilities. He has not only domain knowledge, but also knows what the user would do, or would want to do under the circumstances. Such software agents do not yet exist outside lab circumstances. They are not yet robust enough to be used on the Internet. Ontologies form a key issue here. Indeed, interaction with an agent requires a new (ontology) language for communication between the agent and the user.
The semantic web and digital libraries As events develop, it seems that the influence of the semantic web on digital libraries is of a fundamental nature. Christine Borgman aptly points out that “digital libraries are (still) hard to use” (Borgman, 2000, Chapter 5). The fact that
14
machines lack semantic knowledge certainly is a factor leading to this observation. Digital libraries need a flexible environment in order to develop effective strategies for selecting, organizing, managing and delivering content to their users. Perhaps XML is just a first step, but it is clear that the layered structure of the semantic web forms the basis of the transport syntax and the information representation framework needed for the design and further development of a digital library. Schemas, ontologies and agents provide the further apparatus for the intelligent processing and distributing of information (Nilsson, 2001). Yet, questions and problems remain. For instance; are creators of information sources for digital libraries willing to semantically enrich the information they create and maintain? Are users willing to make use of metadata and other forms of annotations (e.g. PICS rating categories), and are they willing to add their own annotations? To which extent is a digital library part of an open (free-for-all?) semantic web? We further foresee an incorporation of the dynamic hypertext idea as part of the integration of digital libraries and the semantic web. The term ‘dynamic hypertext’ refers to a framework for self-modifying hyperdocuments. Using the dynamic hypertext feature, an online information source, e.g. a textbook, transforms itself into a reference book during the reading and learning process (Calvi & De Bra, 1997; De Bra et al., 2002).
Conclusion Today’s search engines are mainly based on keyword searches. For many purposes this approach is not suitable, because it is not precise enough. Scientists in particular want more detailed and precise search results. Using XML, RDF, Dublin Core, OIL and other standards will make it possible to add structure and meaning to raw data (Lassila, 1998). The final goal is an authentic and dynamic web where all information is available and efficiently retrievable. Even more is required: pieces of information found in different locations should be brought together in order to derive logical conclusions. This web is the Semantic Web. It is a new web structure where there is room for personalised applications, where link structure is known, and where software agents find their natural habitat. The dream of a semantic web has not yet been realised, but many of its building blocks exist already. HTML, nowadays the de facto standard, is not suitable for representing semantic information. XML, an important step in the right direction, is an open standard giving the user, e.g. the possibility of including metadata. XML offers fields the opportunity to design their own specific subset. This is a good thing, but it makes data interchange more complicated. RDF’s namespaces offer a solution here. This is, however, only a partial solution: also the semantic structure of the data must be known. Here structured languages such as OIL can play a prominent role.
15
Once all data existing on the Net are available together with the correct semantic information, users’ queries can be answered correctly and completely. An essential step towards the semantic web is the joining of subcultures, linguistic as well as corporate. Often two groups of people independently develop very similar concepts, and describing the relation between them can bring great benefits to both parties. It can be compared to an English-Chinese interpreter bringing communication and collaboration between individuals and groups (Berners-Lee et al., 2001). In this way a completely new network at the service of individuals, companies and researchers will grow. Let us hope that the “Semantic Web”-idea lives up to its promises for the benefit of all mankind.
Acknowledgement R.R. thanks Profs. Jin Bihui, Liang Liming and their students for interesting discussions related to the semantic web and its influence on the future of digital libraries.
References T. Berners-Lee (1998). Semantic web road map. http://www.w3.org/DesignIssues/Semantic.html T. Berners-Lee, J. Hendler and O. Lassila (2001). The semantic web. Scientific American, 284(5), 29-37. C. L. Borgman (2000). From Gutenberg to the global information infrastructure. Access to information in the networked world. Cambridge (MA): MIT Press. J. Bosak and T. Bray (1999). XML and the second-genration web. Scientific American 280(5), 79-83. L. Calvi and P. De Bra (1997). Using dynamic hypertext to create multi-purpose textbooks. Proceedings of the AACE-ED-MEDIA ’97 Conference, Toronto (Canada), 130-135. E. Castro (1999). HTML 4 the World Wide Web: Visual QuickStart Guide 4th ed. Berkeley (CA): Peachpit Press. J. Corthouts and R. Philips (1996). SGML: a librarian’s perception. The Electronic Library, 14, 101-110.
16
P. De Bra, A. Aerts, D. Smits and N. Stash (2002). AHA! The next generation. Proceedings ACM Conference on Hypertext and Hypermedia (to appear). S. Decker, S. Melnik, F. Van Harmelen, D. Fensel, M. Klein, J. Broekstra, M. Erdmann and I. Horrocks (2000). The semantic web: the roles of XML and RDF. IEEE Internet Computing, 4(5), 63-74. M. Dekkers and S.L. Weibel (2002). Dublin core metadata initiative progress report and workplan for 2002. D-Lib Magazine, 8(2). http://www.dlib.org/dlib/february02/weibel/02weibel.html Y. Ding (2001). A review of ontologies with the semantic web in view. Journal of Information Science, 27, 377-384. E. Duval, W. Hodgins, S. Sutton, S.L. Weibel (2002). Metadata principles and practicalities. D-Lib Magazine, 8(4). http://www.dlib.org/april02/weibel/04weibel.html G. Eysenbach and T.L. Diepgen (1998). Towards quality management of medical information on the Internet: evaluation, labelling, and filtering of information. British Medical Journal, 317, 1496-1500. D. Fensel (2000). Ontologies: Silver bullet for knowledge management and electronic commerce. Berlin: Springer-Verlag. D. Fensel, F. van Harmelen, I. Horrocks, D. L. McGuinness and P.F. PatelSchneider (2001). OIL: an ontology infrastructure for the semantic web. IEEE Intelligent Systems, 16(2), 38-45. T.R. Gruber (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5, 199-220. J. Hendler (2001). Agents and the semantic web. IEEE Intelligent Systems, 16(2), 30-37. J. Hodgson (2001). Do HTML tags flag semantic content? IEEE Internet Computing, 5(1), 20-25. D. E. Knuth (1984). The TeXbook. Reading (MA): Addison-Wesley. O. Lassila (1998). Web metadata: a matter of semantics. IEEE Internet Computing, 2(4), 30-37. P. Maes (1994). Agents that reduce work and information overload. Communications of the ACM, 37(7), 31-40 + 146.
17
G. McMurdo (1996). HTML for the lazy. Journal of Information Science, 22, 198212. M. Nilsson (2001). The semantic web: how RDF will change learning technology standards. http://www.cetis.ac.uk/content/20010927172953/ B.C. Vickery (1997). Ontologies. Journal of Information Science, 23, 277-286.
Other useful addresses: MathML [http:// www.w3.org/TR/REC-MathML] OIL [http://www.ontoknowledge.org/oil/] PICS [http://www.w3.org/TR/NOTE-PICS-Statement] W3C [http://www.w3.org/2001/sw/]
18
Appendix: some abbreviations AGLS
Australian Government Locator Service
CML
Chemical Markup Language
DC
Dublin Core
DTD
Document Type Definition
HTML
Hypertext Markup Language
IBM
International Business Machines
OIL
Ontology Inference Layer
PICS
Platform for Internet Content Selection
RDF
Resource Description Framework
SGML
Standard Generalized Markup Language
URI
Uniform Resource Identifier
W3C
World Wide Web Consortium
WYSIWYG
What you see is what you get
XML
eXtensible Markup Language
19