Multilingual Information Retrieval in World Wide ... - University of Leeds

Multilingual Information Retrieval in World Wide Web Xiaoda Zhang and James N. K. Liu Department of Computing, Hong Kong Polytechnic University, Hong Kong

[email protected]

Eric Atwell Department of Computer Studies, University of Leeds, UK

Abstract

1. Introduction

The article addresses: (1). The design of an

In a large distributed hypertext system like the World

information retrieval (IR), as the Multilingual

Wide Web, users find by following hypertext links.

Information Retrieval Tool Hierarchy (MIRTH),

As the size of the system increases, must traverse

which with virtual corpora on the World Wide Web,

increasingly more links to find what they are looking

also known as Web or WWW. It is motivated by the

for. This task is very demanding. One comprehensive

desire to create a search engine to retrieve

way to cope with this to develop a computer program

information by accessing a virtual. (2). The

to help people explore the Web. This is a search

implementation of a general model of multilingual

engine. All search engines provide users with a query

retrieval for the Web searching. It copes with both

frame in which the user can key in search

and Chinese information retrieval techniques. This

requirement in form of keywords, or phrases, or a

paper starts to address some problems of the World

regular expression. Then, they can go through the

Wide Web relating to information retrieval. Then it

Web pages, locate documents within the entire Web

introduces some existing information retrieval tools

and return selected hits in format of WWW

on the Web. The need to create a multilingual search

documents. Examples of this kind of search engines

engine is discussed. Next, a general hierarchy of

are Infoseek, Yahoo, WebCrawler, Excite, ALIWEB,

MIRTH search engine is illustrated. Furthermore,

CUSI, and W3 Catalogue [Graham,1995].

techniques to set up a MIRTH search engine are

1.1 Advantages of Search Engines explored. These include build up data files, a structure of the search engine [Gilster, 1996], and constraints on query syntax. In addition, the means to create MIRTH multilingual search engine for Chinese (English) information retrieval is dealt and some examples of using MIRTH search engine are given.

To help users to retrieve information from WWW is the basic function of search engines. Two most impressive features of search engines are timesaving way of searching over the Web and their simplicity of use. For example, via Netscape Navigator, once a user keys in a keyword, phrase, or a regular

expression in the frame of a search engine, he/she

linguistics and literature. As a search engine, it

just needs to type 'Enter' from his/her keyboard, then

retrieves information in both English Chinese. This

the desired will be located in a very short time on the

research comprises two major tasks:

same Netscape browser.

• To organize a "virtual corpus" of computer based text-training

materials

for

linguistics

and

1.2 Disadvantages of Existing Search Engines literature available on the World Wide Web. The Each search engine has its weaknesses. The common corpus contains multilingual information about shortcomings for all of can be summarized below: human language learning and training, grammar Most of them can only search in English. This and language modeling research. Users could find would prevent, for instance, linguist from retrieving extensive materials within the corpus such as materials about the Chinese language or theory in novels,

poetry,

prose

and

various

on-line

Chinese. Hence the need exists to establish a search electronic magazines. engine.

• To set up a multilingual search tool which can

Most of them are general search tools focusing on handle queries in both English and Chinese. This general information retrieval, which might not be tool can be used to scan the contents of the efficient for the purpose of searching in a research mentioned corpus on line. area. Although some can search in a very wide database, they can not give the user satisfaction when

2. Methodologies

of

Creating

an

they need very detailed information on professional

Information Retrieval Tool topics, for example, Chinese grammar. As a system of information retrieval tool, some

1.3 Objectives of Research

creation methodologies, for example, building up

As such, the presented paper was motivated by the

special corpora, creating unique data file, defining

desire to create a search engine, which retrieves

special query syntax rules. In the following section

information from the World Wide Web (WWW)

these will be stated briefly.

using a linguistics virtual corpus. Specific attention

2.1 Build up Special Corpora has been paid to multilingual search facilities. The WWW is a huge storeroom where a variety of study will include the discussion on use of English information materials have been placed. As most and Chinese language tools. It is intended to create WWW documents have mainly been written using multilingual information retrieval tool to support HTML in plain text, it can be seen as a corpus searching of in specified areas, such as in the field of [Atwell, 1993], [Liu and Lee 1997]. There are

already billions of documents on WWW, and the

computer and, furthermore, is potentially subject to

number grows rapidly. As all of these hypertext

unforeseen changes as remote sources are modified

documents are dispersed over the Web, it is a time

central control. Hypertext resources on the Web are

consuming task to find in a particular research area.

seen as collections of virtual corpora. A Web data

So it must be wise to set up an access point for

file is also called a data set, that is a collection of

relevant materials such as a Computing corpus (Most

Web pages' Uniform Resource Locators (URL). It

documents on the Web are written in The Hypertext

support the whole process of information retrieval

Markup Language, HTML in short).

from the Web. To set up Web data files, three issues are discussed:

2.2 Set up Web Data File Setting up a data file to save space It is noted that for Web searching, the common

Specifying a structure of a data file

search object is a database [Ford, 1995]. However,

Defining applications for a data file

this requires a huge storage space for information,

A Principle of Setting up the Data File

and it also takes a long time for the search process to

The purpose of creating a data file instead of using

complete. Some special approach based on MIRTH

the contents of hypertext page is to save storage

is devised for organizing the database. Within this

space. It is clear that the wide availability of resource

MIRTH, there are two kinds of search objects:

on the Web might be too large to be saved in one

virtual corpora [Butler, 1992] and data files. The

machine. One way serve this purpose might be to

virtual corpus is dynamic. It is different from the

store only keywords in a data file and a give a few

traditional corpora the contents of a virtual corpus

lines of explanations of the content of the documents

are changed from time to time [Butler, 1992]. A

in the file. The example of the data file is given in

virtual corpus might not be stored on a user's

Fig. 1 as follows:

Intelligent Computer program Solutions WWW Site
Introduction of Computer programs

M.S. and Ph.D Computer Graphics Theses
This directory contains the ASCII text files for all of the Computer Graphics Thesis and Dissertation Abstracts Compendiums published in Computer

Computer & Associates
Computer Based Training Internet Services Web Design Touch Screen Kiosks Electronic Catalogs Java Shock wave Multimedia
Figure 1: A data file example with key words and explanations When the data file is searched by MIRTH for a key word 'computer', the underline part of text (keywords and main explanation) should be shown on the screen as the results of a search, this can be seen in Fig. 2.

Figure 2: Search Results on the Data File Data files are application-oriented, or its contents

Application of a Data File Search results are controlled by data files. About the

depend on users' requirement of a search engine in

search procedures, the first thing to be done is to key

terms of information retrieval, so they are not always

in a search item in a query box. Then the engine runs.

the same. The resulting data files might inevitably be

In fact, the search query as a string has been passed

unique as it casts their creators' personal ideas

to the search program by the external program. When

considering those of its users. Generally speaking,

the program is run, it starts pattern matching in the

however, there are some common features for

data file. If the results found in the data file between

designing a data file. For example, the design is

one and ten or more "hits", then the matched items

influenced by its objectives. The purpose of the

(hits) will be ranked by the program automatically

MIRTH

before being sent back to the user. If the search item

specialists, linguists and people studying literature.

is not found in the data file, it means there is nothing

So, the author approaches the problem of design data

matched, and the user will get no results from the

files by collecting all resources related to those aims

search. Now we could see the importance of a data

of above group of special users. It has built up a large

file. The data file restricts the application of a search.

computing, linguistics, and literature file possible in

A comprehensive, high quality data file is essential to

the time available. In the other words, the contents

efficient searching. Its structure will be influenced by

the MIRTH data file introduce both general

the kind of searches users wish to make. This is a

computing knowledge and linguistics theory and

major

literature materials such as novels, poetry, literature

issue

that

was considered

development of MIRTH.

during the

search

engine

is

to

help

computer

journals (both Chinese and English). It can also be

extended other topics, for example civil engineering

system to help users searching information more

and chemistry, if necessary, but this will involve

accurately and efficiently [Ford, 1995]. It is

further manual creation of entries into a data file, and

understood that all developed search engines in

will dilute the subject specificity thus risking more

literature have their own syntax rules for making

erroneous 'hits'.

query.

Two problems exist with them: (1) in

practice, just a few search tools will provide Construct a Data file explanations for their rules explicitly; (2) most The rule of thumb to construct a data file is to save common query syntax rules depend on concepts of computer memory. The best way to do this is choose natural language, such as words, expressions (several keywords of a HTML document and convert them words), etc. While MIRTH differs from them as it into a data file, see example above. Many automatic focuses on linguistics and literature research, some search engines rely on data file to deal with queries. syntax rules for the purpose of supporting specific It is important to choose comprehensive keywords to searching functions are needed. MIRTH syntax rule improve the chance of retrieving relevant documents. definition focus on some special usage, such as how For the data files, keywords can be seen as the words does affix search, Root search, etc. Details will be or phrase, which reflect subject of the corresponding addressed below. home page. The theory behind a data file is as simple as a rule of thumb: users are interested if a particular

Motivation of Affix and Root Search

document contains some keywords in relation to their

As a general definition, syntax addresses the

interests. Keywords to be placed in a data file were

structure of sentences, but technically, it has more

selected manually from introductory textbooks in

meanings. In terms of computing, the term of syntax

linguistics, computing and literature, guided by

has been used widely. Any computer language

authors' experience in Chinese linguistics and

requires certain syntax rules for its commands and

literature.

codes, it ranges from simple structure of the words to entire program. Moreover, in different situations,

2.3 Define Query Syntax Rules syntax has different definition and different content. Although MIRTH searches pre-computed data files As far as computer program writing, there are lots of instead of the Web pages, any relevant home pages syntax rules that programmers have to follow. For eventually will be downloaded by users if it is example, in the Hypertext writing, the HTML necessary. Our interest will be on those home pages (HyperText

Mark-up

Language)

requests

all

as well. Intelligent query support is included in the commands should be enclosed by: . The

symbols usually appear in pairs: the is used at the

Example of Affix and Root Search Pre(*)

beginning and should be used at the end of the

If a user wants to search out some words, which start

same sentence.

with same prefix, then he/she should follow this

It has been mentioned that the main function of MIRTH search is to do linguistic search. And to analyzed special structure of words (phrases) is a

syntax rule to enter their item in the search box: prefix* (without a space before '*'). See example in the following section.

very important issue for linguistics and language

If you want to do a search for prefix matching, the

learners [Graham,1995]. For example: English, most

search engine will all words in its database, and

words have a root, but the root can form lots of

provide the information which you are looking for.

derivations. Such as adding a suffix or a prefix to a

Supposing you start your search the words begin

root, then a new word would be created. Let us

with 'dis', you can start the search as that: to add

consider: 'think' is a verb, we can regard it as a root,

asterisk (*) after the search item without space, then

when we add the prefix and suffix with it. It might

search the words prefix "dis", then you would get the

become new word, such as 'unthinkable'. For

search results: dislike, display, discrete, dismember,

language searching, these special functions of prefix,

discomfort, discredit, discover, discolor, disclose,

suffix and root search have been considered for

and disloyal. Fig. 3 illustrates this idea.

query syntax.

Figure 3: Search Prefix "Com"

Fig. 3 searched for prefix of "Com". Totally, over

It just needs users to input the search query in the

thirty hits been matched. The details also can be seen

item box, in front of the suffix, you must add

with the Fig. 4 for Chinese search.

asterisk(*), it looks like that: "*ing", then the search engine will seek out which word that includes the suffix you are looking for, and automatically pick them up for you (see Fig. 5 displayed by Web browser Netscape).

Figure 4: Chinese Engine Search In this process of Chinese prefix search, the search item is "ÖÐ " (central, or middle) + " * ". While running the search engine, it picked out over fifteen hits, which contained the prefix "ÖÐ ". As: ÖÐ ÎÄ

Figure 5: MIRTH Search for "*ing"

(Chinese), ÖÐ ¹ú, ÖÐ »ª(China), ÖÐ¹úÎÄÑ§(Chinese (*) + Root + (*) literature), ÖÐ ¶«(Central Eastern), ÖÐ Î÷ (Central This query syntax means that there is a root part of a Western) and so on. between two stars without space among them. Then (*) + Suffix

the search engine will match all words that have an

This syntax rule defines a search to match all words

identical root installed in the index file.

having an identical suffix in the data file. The request of the input is *suffix, and there is no space between the asterisk and suffix.

3. Multilingual Issues in MIRTH

displayed with Chinese environment (supported by Unionway). Fig. 7 is displayed without Chinese

3.1 Chinese Computing Environment environment. For multilingual information retrieval, most systems including UNIX and PC systems are in the standard English environment, and the WWW documents work in this format [Christian, 1988]. For example, Netscape has a Web browser (named Navigator), which standard HTML in plain English. When a document in Chinese is on the Web, Netscape Navigator will show the Chinese characters in strange symbols, without a certain software to support displaying and Chinese, the strange symbols get displayed will not be understood by any user (including Chinese people).

Figure 6: Display with Chinese environment

Reading Chinese with Web browser, and setting up a search system for Chinese information retrieval, are still topics of debate on the Web [Zhou and Liu, 1997]. But in this paper, we have presented one solution. We now understand to set up a Chinese environment, how to deal with Chinese characters, and how to retrieve information from a Chinese virtual corpus. Chinese GB & BIG5 Codes are displayed in default Netscape Font. Some Chinese software can cope with this problem by converting English computing environment into an environment which supports both English and Chinese. Once this software is installed, the Chinese code will be converted to readable Chinese characters as given in Figs 6 and 7. They are same document files. Fig. 6 is

Figure 7: Display without Chinese environment

3.2 Dealing

with

Symbolic

Chinese

Characters The MIRTH provides access to Chinese virtual corpora as well. To understand this procedure, we

need to know how the computer deals with symbolic

adding 32 to both of the line and column numbers of

characters. Now we discuss the issues such as how to

Line-Column Code. Taking the word 'big' for

convert a symbolic character into digital information,

example, adding 32 to 20 (line number) and 83

how to store a set of digital characters (as bitmaps)

(column number) gives 52f and 115, and if we check

and how to represent particular Chinese in the GB

the ASCII code table, 52 represents '4' and 115 for 's',

protocol.

so GB Code for 'big' is '4s'. As the minimum line and column is 1, so the minimum number of GB Code is

Chinese Code Protocols 32+1=33, and the maximum Code is 32+94=126. We We discuss terms like GB, HZ, and BIG5 Chinese could see that GB Code is within the range ASCII of code protocols used by MIRTH. When we search a codes, which represent 94 symbols. This means we Chinese data file in MIRTH, this file normally is could use GB Code as standard information written in a kind of code, in other words, the Chinese interchange code set like ASCII. characters are written in specific codes instead of the graphic characters.

3.3 Chinese Information Retrieval System

Line-Column Code

The structure of the Chinese character set is different

One simple protocol is called Line-Column code,

from English. It has its special characteristics, and it

which uses a character's line number and column

can be displayed by MIRTH on the Web. The whole

number as its code [Huabei, 1981]. For example,

procedure can be seen below.

when we search for the word 'big', we key in 2083, where 20 is the line number 'big' in the Chinese character library, and 83 is the column number. This method is not widely used, as when we key in a single Chinese character, we need to type 4 numbers instead of 2 letters used by GB code.

GB Code(Guo Biao)

Chinese WWW Servers How to get access to Chinese documents on the WWW? The first step is to approach the Chinese Web server. Recently, dozens of Chinese servers have appeared on different platforms on the Web. These include: Chinese

WEB

server

(URL:

http://darwin.technet.sg/cweb/cstart.html), GB means national standard, which stands for

Wen Zhai (it also known as Chinese News

Chinese Standard for Information Interchange (read

Digest. The URL: http://www.cnd.org).

Guo Biao in Chinese), which is defined by the

The

People's Republic of China and is widely used in

(http://www.ncb.gov.sg/chinese-web/)

Chinese societies around the World. It is defined by

Xian

Chinese

World

Wide

Web Chinese

Gopher Menu(gopher://sunrise.cc.mcgill.ca/).

The

Chinese

Web

page

(http://agora.leeds.ac.uk/xiaoda/Dcorpus.htm),

they want, and they can click on it to do more search or to capture the information that they are looking for

and so on. at once. To achieve this aim, we have built up our Retrieving Information from Chinese Data Files Chinese data files in GB and HZ Chinese codes. To allow users of MIRTH to have wider choices of People who have the Chinese environment can get virtual Chinese Corpora, we need to create Chinese access to them easily. data files for Linguistics research. The Chinese corpus is collected via a Chinese data file. Chinese

Special Syntax Definition for Chinese Search

search is similar to the English one in MIRTH. The

Chinese has its individual property of phrasal

difference is that Chinese codes use two bytes, so

structure, and it is much complicated than English.

when we set up the Chinese data file, we have to

Primarily, the structure of a Chinese word is very

consider this special property in choosing the

different from English. We can subdivide English

appropriate way retrieve information.

words into parts: suffix, prefix and root in one unit.

MIRTH allows the user to input the query in

For example, the word display is comprised of two

Chinese in terms of keywords, or subject in the

parts one word, but it comprises two parts: prefix

linguistics area, then the search engine will return

(dis), and Root(play), the structure is: prefix + root.

ranked list of documents in order of relevance. Users can read the documents first, then find their interests and refine the search by marking the documents that

When translating this word into Chinese, it can be shown by two independent units (they are two), and the structure of it is shown in Figure 8 below:

have been highlighted. When users find out what

Figure 8: English "Display" is shown in Chinese

There is another way which tell you how

The syntax rules

differentiate between English and. We know, to add

The first syntax rule is if the query item is a keyword

"ing" behind a Verb that can change the nature of a

in Chinese, a general structure should be: Root +

word and transform a verb to a noun in English. For

Root or Root + suffix, because that is a major

example, 'take' is a verb, "taking" is a verb-noun,

structure of Chinese words, or phrases, the search

and can be an object in a sentence. But Chinese

item should a phrase or few words within this

doesn't have this kind of rule. There is no way for

structure:

changing a verb to noun in Chinese. According to the analysis above, it is very easy to see that: Chinese grammar is different from English.

So, Chinese

query syntax is considered in MIRTH. Here are a few examples of the query syntax for a Chinese search.

Root + Root, or Root + suffix. This rule is quite useful for linguistics, particularly for Chinese grammar learning, such as structure of Chinese words and phrase. For instance, a user search item is ÖÐ(in English is middle or central), it as a root can be used in the derivation of new phrases. It can be seen in the Fig. 9.

Figure 9: Chinese Phrasal Words The second syntax rule is to use "the keywords must

The Chinese data file looks for a query by using

include an object which be a noun (nouns) in the

the special rule of pattern search. After you define

search item". For instance: search 'Chinese' is a

your query as a set of keywords and the other

standard search pattern, when the search engine gets

qualifiers, it likes 'Human Language' or 'Chinese

the query, it will do search around the object

Grammar'. A Boolean-type search would match the

'Chinese', and match some results around 'Chinese'.

subject 'Language' and the object 'Grammar', and

find a number of items you are interested in from the

be generated to manipulate data files more

linguistics data file [Atwell, 1993].

efficiently and accurately. This management system will perform the task such as add insert, delete, update, replace and sort links with their

4. Conclusion

key words.

In this paper, we have discussed the main hierarchy of MIRTH, a multilingual information retrieval tool

References

(also called search engine). For the purpose, three

Atwell.

Eric

1993.

Knowledge

at

Work

in

Universities, Leeds University press. works were done. Firstly, particular sources were Butler, Christopher S. 1992. Computers and Written "linked" together as its "Virtual Corpora" containing separated topics such as computing, Linguistics and Language, and Chinese literature. Secondly, an example of a multilingual environment was created with the help of UnionWay. Some of these corpora are multilingual sources, which mainly demonstrate materials in English and Chinese. Thirdly, an automatic tool to retrieve information from the

Texts, Biddles Ltd, Guildford, Surrey. Christian, Kaare 1988. The Unix Operating System, Jone Wiley & Sons, Inc. Ford, Andrew 1995. Spinning the Web, International Thomson Publishing. Graham, Lan 1995. The HTML Sourcebook, John Wiley & Sons. Gilster, Paol 1996. Finding it on the Internet, John Wiley & Sons, Inc.

multilingual corpora was set up. Some improvement Huabei 1981. Huabei Computing Institute, Chinese for MIRTH further work shall be:

National Standard: A Collection of Chinese

(1) As already stated, this research has set up its own

Character Codes for Information Exchanging,

data files. But they are not big enough to hold complete information resources in a particular

China Standard Press House. Liu, N.K. 1996. Formal verification of some potential contradictions in knowledge base using

research area. So to install more data materials is

a High Level Net approach, Applied Intelligence,

an essential tack before MIRTH can be improved

6(4):325-344.

in a real situation. (2) As a complete linguistics tool, more techniques

Liu, J. and Lee, YK. 1997. Development of a Chinese Extraction System. In Proceedings Of International

Conference

on

Computer

should be added in, such as tagging, parsing, and

Processing of Original Languages, April 2-4,

analyzing structure of a sentence [Zhou et al.,

1997, Hong Kong.

1998], [Zhou and Liu, 1997]. (3) Maintenance is an issue for the system [Liu,

Zhou, L. and Liu, J. 1997. An efficient algorithm for bilingual word translation acquisition“, in the 2nd Workshop

1996]. A data file management system needs to

on

Multilinguality

in

Software

Industry: The AI Contribution (MULSAIC’97) of the International Joint Conference on Artificial Intelligence (IJCAI-97), August 23-29, 1997, Nagoya, Japan. Zhou, L., Liu, J. and Yu, S.E. 1998. Study and implementation of combined techniques for automatic extraction of word translation pairs: An analysis of the contributions of word heuristics to a statistical method”, to appear in International Journal on Computer Processing of Oriental Languages.