Multilingual Information Retrieval in World Wide ... - University of Leeds

3 downloads 238 Views 170KB Size Report
techniques to set up a MIRTH search engine are explored. ... create MIRTH multilingual search engine for .... specialists, linguists and people studying literature.
Multilingual Information Retrieval in World Wide Web Xiaoda Zhang and James N. K. Liu Department of Computing, Hong Kong Polytechnic University, Hong Kong

[email protected]

Eric Atwell Department of Computer Studies, University of Leeds, UK

Abstract

1. Introduction

The article addresses: (1). The design of an

In a large distributed hypertext system like the World

information retrieval (IR), as the Multilingual

Wide Web, users find by following hypertext links.

Information Retrieval Tool Hierarchy (MIRTH),

As the size of the system increases, must traverse

which with virtual corpora on the World Wide Web,

increasingly more links to find what they are looking

also known as Web or WWW. It is motivated by the

for. This task is very demanding. One comprehensive

desire to create a search engine to retrieve

way to cope with this to develop a computer program

information by accessing a virtual. (2). The

to help people explore the Web. This is a search

implementation of a general model of multilingual

engine. All search engines provide users with a query

retrieval for the Web searching. It copes with both

frame in which the user can key in search

and Chinese information retrieval techniques. This

requirement in form of keywords, or phrases, or a

paper starts to address some problems of the World

regular expression. Then, they can go through the

Wide Web relating to information retrieval. Then it

Web pages, locate documents within the entire Web

introduces some existing information retrieval tools

and return selected hits in format of WWW

on the Web. The need to create a multilingual search

documents. Examples of this kind of search engines

engine is discussed. Next, a general hierarchy of

are Infoseek, Yahoo, WebCrawler, Excite, ALIWEB,

MIRTH search engine is illustrated. Furthermore,

CUSI, and W3 Catalogue [Graham,1995].

techniques to set up a MIRTH search engine are

1.1 Advantages of Search Engines explored. These include build up data files, a structure of the search engine [Gilster, 1996], and constraints on query syntax. In addition, the means to create MIRTH multilingual search engine for Chinese (English) information retrieval is dealt and some examples of using MIRTH search engine are given.

To help users to retrieve information from WWW is the basic function of search engines. Two most impressive features of search engines are timesaving way of searching over the Web and their simplicity of use. For example, via Netscape Navigator, once a user keys in a keyword, phrase, or a regular

expression in the frame of a search engine, he/she

linguistics and literature. As a search engine, it

just needs to type 'Enter' from his/her keyboard, then

retrieves information in both English Chinese. This

the desired will be located in a very short time on the

research comprises two major tasks:

same Netscape browser.

• To organize a "virtual corpus" of computer based text-training

materials

for

linguistics

and

1.2 Disadvantages of Existing Search Engines literature available on the World Wide Web. The Each search engine has its weaknesses. The common corpus contains multilingual information about shortcomings for all of can be summarized below: human language learning and training, grammar Most of them can only search in English. This and language modeling research. Users could find would prevent, for instance, linguist from retrieving extensive materials within the corpus such as materials about the Chinese language or theory in novels,

poetry,

prose

and

various

on-line

Chinese. Hence the need exists to establish a search electronic magazines. engine.

• To set up a multilingual search tool which can

Most of them are general search tools focusing on handle queries in both English and Chinese. This general information retrieval, which might not be tool can be used to scan the contents of the efficient for the purpose of searching in a research mentioned corpus on line. area. Although some can search in a very wide database, they can not give the user satisfaction when

2. Methodologies

of

Creating

an

they need very detailed information on professional

Information Retrieval Tool topics, for example, Chinese grammar. As a system of information retrieval tool, some

1.3 Objectives of Research

creation methodologies, for example, building up

As such, the presented paper was motivated by the

special corpora, creating unique data file, defining

desire to create a search engine, which retrieves

special query syntax rules. In the following section

information from the World Wide Web (WWW)

these will be stated briefly.

using a linguistics virtual corpus. Specific attention

2.1 Build up Special Corpora has been paid to multilingual search facilities. The WWW is a huge storeroom where a variety of study will include the discussion on use of English information materials have been placed. As most and Chinese language tools. It is intended to create WWW documents have mainly been written using multilingual information retrieval tool to support HTML in plain text, it can be seen as a corpus searching of in specified areas, such as in the field of [Atwell, 1993], [Liu and Lee 1997]. There are

already billions of documents on WWW, and the

computer and, furthermore, is potentially subject to

number grows rapidly. As all of these hypertext

unforeseen changes as remote sources are modified

documents are dispersed over the Web, it is a time

central control. Hypertext resources on the Web are

consuming task to find in a particular research area.

seen as collections of virtual corpora. A Web data

So it must be wise to set up an access point for

file is also called a data set, that is a collection of

relevant materials such as a Computing corpus (Most

Web pages' Uniform Resource Locators (URL). It

documents on the Web are written in The Hypertext

support the whole process of information retrieval

Markup Language, HTML in short).

from the Web. To set up Web data files, three issues are discussed:

2.2 Set up Web Data File Setting up a data file to save space It is noted that for Web searching, the common

Specifying a structure of a data file

search object is a database [Ford, 1995]. However,

Defining applications for a data file

this requires a huge storage space for information,

A Principle of Setting up the Data File

and it also takes a long time for the search process to

The purpose of creating a data file instead of using

complete. Some special approach based on MIRTH

the contents of hypertext page is to save storage

is devised for organizing the database. Within this

space. It is clear that the wide availability of resource

MIRTH, there are two kinds of search objects:

on the Web might be too large to be saved in one

virtual corpora [Butler, 1992] and data files. The

machine. One way serve this purpose might be to

virtual corpus is dynamic. It is different from the

store only keywords in a data file and a give a few

traditional corpora the contents of a virtual corpus

lines of explanations of the content of the documents

are changed from time to time [Butler, 1992]. A

in the file. The example of the data file is given in

virtual corpus might not be stored on a user's

Fig. 1 as follows:

  • Intelligent Computer program Solutions WWW Site

    Introduction of Computer programs

  • M.S. and Ph.D Computer Graphics Theses

    This directory contains the ASCII text files for all of the Computer Graphics Thesis and Dissertation Abstracts Compendiums published in Computer

  • Computer & Associates

    Computer Based Training Internet Services Web Design Touch Screen Kiosks Electronic Catalogs Java Shock wave Multimedia

    Figure 1: A data file example with key words and explanations When the data file is searched by MIRTH for a key word 'computer', the underline part of text (keywords and main explanation) should be shown on the screen as the results of a search, this can be seen in Fig. 2.

    Figure 2: Search Results on the Data File Data files are application-oriented, or its contents

    Application of a Data File Search results are controlled by data files. About the

    depend on users' requirement of a search engine in

    search procedures, the first thing to be done is to key

    terms of information retrieval, so they are not always

    in a search item in a query box. Then the engine runs.

    the same. The resulting data files might inevitably be

    In fact, the search query as a string has been passed

    unique as it casts their creators' personal ideas

    to the search program by the external program. When

    considering those of its users. Generally speaking,

    the program is run, it starts pattern matching in the

    however, there are some common features for

    data file. If the results found in the data file between

    designing a data file. For example, the design is

    one and ten or more "hits", then the matched items

    influenced by its objectives. The purpose of the

    (hits) will be ranked by the program automatically

    MIRTH

    before being sent back to the user. If the search item

    specialists, linguists and people studying literature.

    is not found in the data file, it means there is nothing

    So, the author approaches the problem of design data

    matched, and the user will get no results from the

    files by collecting all resources related to those aims

    search. Now we could see the importance of a data

    of above group of special users. It has built up a large

    file. The data file restricts the application of a search.

    computing, linguistics, and literature file possible in

    A comprehensive, high quality data file is essential to

    the time available. In the other words, the contents

    efficient searching. Its structure will be influenced by

    the MIRTH data file introduce both general

    the kind of searches users wish to make. This is a

    computing knowledge and linguistics theory and

    major

    literature materials such as novels, poetry, literature

    issue

    that

    was considered

    development of MIRTH.

    during the

    search

    engine

    is

    to

    help

    computer

    journals (both Chinese and English). It can also be

    extended other topics, for example civil engineering

    system to help users searching information more

    and chemistry, if necessary, but this will involve

    accurately and efficiently [Ford, 1995]. It is

    further manual creation of entries into a data file, and

    understood that all developed search engines in

    will dilute the subject specificity thus risking more

    literature have their own syntax rules for making

    erroneous 'hits'.

    query.

    Two problems exist with them: (1) in

    practice, just a few search tools will provide Construct a Data file explanations for their rules explicitly; (2) most The rule of thumb to construct a data file is to save common query syntax rules depend on concepts of computer memory. The best way to do this is choose natural language, such as words, expressions (several keywords of a HTML document and convert them words), etc. While MIRTH differs from them as it into a data file, see example above. Many automatic focuses on linguistics and literature research, some search engines rely on data file to deal with queries. syntax rules for the purpose of supporting specific It is important to choose comprehensive keywords to searching functions are needed. MIRTH syntax rule improve the chance of retrieving relevant documents. definition focus on some special usage, such as how For the data files, keywords can be seen as the words does affix search, Root search, etc. Details will be or phrase, which reflect subject of the corresponding addressed below. home page. The theory behind a data file is as simple as a rule of thumb: users are interested if a particular

    Motivation of Affix and Root Search

    document contains some keywords in relation to their

    As a general definition, syntax addresses the

    interests. Keywords to be placed in a data file were

    structure of sentences, but technically, it has more

    selected manually from introductory textbooks in

    meanings. In terms of computing, the term of syntax

    linguistics, computing and literature, guided by

    has been used widely. Any computer language

    authors' experience in Chinese linguistics and

    requires certain syntax rules for its commands and

    literature.

    codes, it ranges from simple structure of the words to entire program. Moreover, in different situations,

    2.3 Define Query Syntax Rules syntax has different definition and different content. Although MIRTH searches pre-computed data files As far as computer program writing, there are lots of instead of the Web pages, any relevant home pages syntax rules that programmers have to follow. For eventually will be downloaded by users if it is example, in the Hypertext writing, the HTML necessary. Our interest will be on those home pages (HyperText

    Mark-up

    Language)

    requests

    all

    as well. Intelligent query support is included in the commands should be enclosed by: . The

    symbols usually appear in pairs: the is used at the

    Example of Affix and Root Search Pre(*)

    beginning and should be used at the end of the

    If a user wants to search out some words, which start

    same sentence.

    with same prefix, then he/she should follow this

    It has been mentioned that the main function of MIRTH search is to do linguistic search. And to analyzed special structure of words (phrases) is a

    syntax rule to enter their item in the search box: prefix* (without a space before '*'). See example in the following section.

    very important issue for linguistics and language

    If you want to do a search for prefix matching, the

    learners [Graham,1995]. For example: English, most

    search engine will all words in its database, and

    words have a root, but the root can form lots of

    provide the information which you are looking for.

    derivations. Such as adding a suffix or a prefix to a

    Supposing you start your search the words begin

    root, then a new word would be created. Let us

    with 'dis', you can start the search as that: to add

    consider: 'think' is a verb, we can regard it as a root,

    asterisk (*) after the search item without space, then

    when we add the prefix and suffix with it. It might

    search the words prefix "dis", then you would get the

    become new word, such as 'unthinkable'. For

    search results: dislike, display, discrete, dismember,

    language searching, these special functions of prefix,

    discomfort, discredit, discover, discolor, disclose,

    suffix and root search have been considered for

    and disloyal. Fig. 3 illustrates this idea.

    query syntax.

    Figure 3: Search Prefix "Com"

    Fig. 3 searched for prefix of "Com". Totally, over

    It just needs users to input the search query in the

    thirty hits been matched. The details also can be seen

    item box, in front of the suffix, you must add

    with the Fig. 4 for Chinese search.

    asterisk(*), it looks like that: "*ing", then the search engine will seek out which word that includes the suffix you are looking for, and automatically pick them up for you (see Fig. 5 displayed by Web browser Netscape).

    Figure 4: Chinese Engine Search In this process of Chinese prefix search, the search item is "ÖÐ " (central, or middle) + " * ". While running the search engine, it picked out over fifteen hits, which contained the prefix "ÖÐ ". As: ÖÐ ÎÄ

    Figure 5: MIRTH Search for "*ing"

    (Chinese), ÖÐ ¹ú, ÖÐ »ª(China), ÖйúÎÄѧ(Chinese (*) + Root + (*) literature), ÖÐ ¶«(Central Eastern), ÖÐ Î÷ (Central This query syntax means that there is a root part of a Western) and so on. between two stars without space among them. Then (*) + Suffix

    the search engine will match all words that have an

    This syntax rule defines a search to match all words

    identical root installed in the index file.

    having an identical suffix in the data file. The request of the input is *suffix, and there is no space between the asterisk and suffix.

    3. Multilingual Issues in MIRTH

    displayed with Chinese environment (supported by Unionway). Fig. 7 is displayed without Chinese

    3.1 Chinese Computing Environment environment. For multilingual information retrieval, most systems including UNIX and PC systems are in the standard English environment, and the WWW documents work in this format [Christian, 1988]. For example, Netscape has a Web browser (named Navigator), which standard HTML in plain English. When a document in Chinese is on the Web, Netscape Navigator will show the Chinese characters in strange symbols, without a certain software to support displaying and Chinese, the strange symbols get displayed will not be understood by any user (including Chinese people).

    Figure 6: Display with Chinese environment

    Reading Chinese with Web browser, and setting up a search system for Chinese information retrieval, are still topics of debate on the Web [Zhou and Liu, 1997]. But in this paper, we have presented one solution. We now understand to set up a Chinese environment, how to deal with Chinese characters, and how to retrieve information from a Chinese virtual corpus. Chinese GB & BIG5 Codes are displayed in default Netscape Font. Some Chinese software can cope with this problem by converting English computing environment into an environment which supports both English and Chinese. Once this software is installed, the Chinese code will be converted to readable Chinese characters as given in Figs 6 and 7. They are same document files. Fig. 6 is

    Figure 7: Display without Chinese environment

    3.2 Dealing

    with

    Symbolic

    Chinese

    Characters The MIRTH provides access to Chinese virtual corpora as well. To understand this procedure, we

    need to know how the computer deals with symbolic

    adding 32 to both of the line and column numbers of

    characters. Now we discuss the issues such as how to

    Line-Column Code. Taking the word 'big' for

    convert a symbolic character into digital information,

    example, adding 32 to 20 (line number) and 83

    how to store a set of digital characters (as bitmaps)

    (column number) gives 52f and 115, and if we check

    and how to represent particular Chinese in the GB

    the ASCII code table, 52 represents '4' and 115 for 's',

    protocol.

    so GB Code for 'big' is '4s'. As the minimum line and column is 1, so the minimum number of GB Code is

    Chinese Code Protocols 32+1=33, and the maximum Code is 32+94=126. We We discuss terms like GB, HZ, and BIG5 Chinese could see that GB Code is within the range ASCII of code protocols used by MIRTH. When we search a codes, which represent 94 symbols. This means we Chinese data file in MIRTH, this file normally is could use GB Code as standard information written in a kind of code, in other words, the Chinese interchange code set like ASCII. characters are written in specific codes instead of the graphic characters.

    3.3 Chinese Information Retrieval System

    Line-Column Code

    The structure of the Chinese character set is different

    One simple protocol is called Line-Column code,

    from English. It has its special characteristics, and it

    which uses a character's line number and column

    can be displayed by MIRTH on the Web. The whole

    number as its code [Huabei, 1981]. For example,

    procedure can be seen below.

    when we search for the word 'big', we key in 2083, where 20 is the line number 'big' in the Chinese character library, and 83 is the column number. This method is not widely used, as when we key in a single Chinese character, we need to type 4 numbers instead of 2 letters used by GB code.

    GB Code(Guo Biao)

    Chinese WWW Servers How to get access to Chinese documents on the WWW? The first step is to approach the Chinese Web server. Recently, dozens of Chinese servers have appeared on different platforms on the Web. These include: Chinese

    WEB

    server

    (URL:

    http://darwin.technet.sg/cweb/cstart.html), GB means national standard, which stands for

    Wen Zhai (it also known as Chinese News

    Chinese Standard for Information Interchange (read

    Digest. The URL: http://www.cnd.org).

    Guo Biao in Chinese), which is defined by the

    The

    People's Republic of China and is widely used in

    (http://www.ncb.gov.sg/chinese-web/)

    Chinese societies around the World. It is defined by

    Xian

    Chinese

    World

    Wide

    Web Chinese

    Gopher Menu(gopher://sunrise.cc.mcgill.ca/).

    The

    Chinese

    Web

    page

    (http://agora.leeds.ac.uk/xiaoda/Dcorpus.htm),

    they want, and they can click on it to do more search or to capture the information that they are looking for

    and so on. at once. To achieve this aim, we have built up our Retrieving Information from Chinese Data Files Chinese data files in GB and HZ Chinese codes. To allow users of MIRTH to have wider choices of People who have the Chinese environment can get virtual Chinese Corpora, we need to create Chinese access to them easily. data files for Linguistics research. The Chinese corpus is collected via a Chinese data file. Chinese

    Special Syntax Definition for Chinese Search

    search is similar to the English one in MIRTH. The

    Chinese has its individual property of phrasal

    difference is that Chinese codes use two bytes, so

    structure, and it is much complicated than English.

    when we set up the Chinese data file, we have to

    Primarily, the structure of a Chinese word is very

    consider this special property in choosing the

    different from English. We can subdivide English

    appropriate way retrieve information.

    words into parts: suffix, prefix and root in one unit.

    MIRTH allows the user to input the query in

    For example, the word display is comprised of two

    Chinese in terms of keywords, or subject in the

    parts one word, but it comprises two parts: prefix

    linguistics area, then the search engine will return

    (dis), and Root(play), the structure is: prefix + root.

    ranked list of documents in order of relevance. Users can read the documents first, then find their interests and refine the search by marking the documents that

    When translating this word into Chinese, it can be shown by two independent units (they are two), and the structure of it is shown in Figure 8 below:

    have been highlighted. When users find out what

    Figure 8: English "Display" is shown in Chinese

    There is another way which tell you how

    The syntax rules

    differentiate between English and. We know, to add

    The first syntax rule is if the query item is a keyword

    "ing" behind a Verb that can change the nature of a

    in Chinese, a general structure should be: Root +

    word and transform a verb to a noun in English. For

    Root or Root + suffix, because that is a major

    example, 'take' is a verb, "taking" is a verb-noun,

    structure of Chinese words, or phrases, the search

    and can be an object in a sentence. But Chinese

    item should a phrase or few words within this

    doesn't have this kind of rule. There is no way for

    structure:

    changing a verb to noun in Chinese. According to the analysis above, it is very easy to see that: Chinese grammar is different from English.

    So, Chinese

    query syntax is considered in MIRTH. Here are a few examples of the query syntax for a Chinese search.

    Root + Root, or Root + suffix. This rule is quite useful for linguistics, particularly for Chinese grammar learning, such as structure of Chinese words and phrase. For instance, a user search item is ÖÐ(in English is middle or central), it as a root can be used in the derivation of new phrases. It can be seen in the Fig. 9.

    Figure 9: Chinese Phrasal Words The second syntax rule is to use "the keywords must

    The Chinese data file looks for a query by using

    include an object which be a noun (nouns) in the

    the special rule of pattern search. After you define

    search item". For instance: search 'Chinese' is a

    your query as a set of keywords and the other

    standard search pattern, when the search engine gets

    qualifiers, it likes 'Human Language' or 'Chinese

    the query, it will do search around the object

    Grammar'. A Boolean-type search would match the

    'Chinese', and match some results around 'Chinese'.

    subject 'Language' and the object 'Grammar', and

    find a number of items you are interested in from the

    be generated to manipulate data files more

    linguistics data file [Atwell, 1993].

    efficiently and accurately. This management system will perform the task such as add insert, delete, update, replace and sort links with their

    4. Conclusion

    key words.

    In this paper, we have discussed the main hierarchy of MIRTH, a multilingual information retrieval tool

    References

    (also called search engine). For the purpose, three

    Atwell.

    Eric

    1993.

    Knowledge

    at

    Work

    in

    Universities, Leeds University press. works were done. Firstly, particular sources were Butler, Christopher S. 1992. Computers and Written "linked" together as its "Virtual Corpora" containing separated topics such as computing, Linguistics and Language, and Chinese literature. Secondly, an example of a multilingual environment was created with the help of UnionWay. Some of these corpora are multilingual sources, which mainly demonstrate materials in English and Chinese. Thirdly, an automatic tool to retrieve information from the

    Texts, Biddles Ltd, Guildford, Surrey. Christian, Kaare 1988. The Unix Operating System, Jone Wiley & Sons, Inc. Ford, Andrew 1995. Spinning the Web, International Thomson Publishing. Graham, Lan 1995. The HTML Sourcebook, John Wiley & Sons. Gilster, Paol 1996. Finding it on the Internet, John Wiley & Sons, Inc.

    multilingual corpora was set up. Some improvement Huabei 1981. Huabei Computing Institute, Chinese for MIRTH further work shall be:

    National Standard: A Collection of Chinese

    (1) As already stated, this research has set up its own

    Character Codes for Information Exchanging,

    data files. But they are not big enough to hold complete information resources in a particular

    China Standard Press House. Liu, N.K. 1996. Formal verification of some potential contradictions in knowledge base using

    research area. So to install more data materials is

    a High Level Net approach, Applied Intelligence,

    an essential tack before MIRTH can be improved

    6(4):325-344.

    in a real situation. (2) As a complete linguistics tool, more techniques

    Liu, J. and Lee, YK. 1997. Development of a Chinese Extraction System. In Proceedings Of International

    Conference

    on

    Computer

    should be added in, such as tagging, parsing, and

    Processing of Original Languages, April 2-4,

    analyzing structure of a sentence [Zhou et al.,

    1997, Hong Kong.

    1998], [Zhou and Liu, 1997]. (3) Maintenance is an issue for the system [Liu,

    Zhou, L. and Liu, J. 1997. An efficient algorithm for bilingual word translation acquisition“, in the 2nd Workshop

    1996]. A data file management system needs to

    on

    Multilinguality

    in

    Software

    Industry: The AI Contribution (MULSAIC’97) of the International Joint Conference on Artificial Intelligence (IJCAI-97), August 23-29, 1997, Nagoya, Japan. Zhou, L., Liu, J. and Yu, S.E. 1998. Study and implementation of combined techniques for automatic extraction of word translation pairs: An analysis of the contributions of word heuristics to a statistical method”, to appear in International Journal on Computer Processing of Oriental Languages.