Multilingual Information Retrieval in World Wide Web Xiaoda Zhang and James N. K. Liu Department of Computing, Hong Kong Polytechnic University, Hong Kong
[email protected]
Eric Atwell Department of Computer Studies, University of Leeds, UK
Abstract
1. Introduction
The article addresses: (1). The design of an
In a large distributed hypertext system like the World
information retrieval (IR), as the Multilingual
Wide Web, users find by following hypertext links.
Information Retrieval Tool Hierarchy (MIRTH),
As the size of the system increases, must traverse
which with virtual corpora on the World Wide Web,
increasingly more links to find what they are looking
also known as Web or WWW. It is motivated by the
for. This task is very demanding. One comprehensive
desire to create a search engine to retrieve
way to cope with this to develop a computer program
information by accessing a virtual. (2). The
to help people explore the Web. This is a search
implementation of a general model of multilingual
engine. All search engines provide users with a query
retrieval for the Web searching. It copes with both
frame in which the user can key in search
and Chinese information retrieval techniques. This
requirement in form of keywords, or phrases, or a
paper starts to address some problems of the World
regular expression. Then, they can go through the
Wide Web relating to information retrieval. Then it
Web pages, locate documents within the entire Web
introduces some existing information retrieval tools
and return selected hits in format of WWW
on the Web. The need to create a multilingual search
documents. Examples of this kind of search engines
engine is discussed. Next, a general hierarchy of
are Infoseek, Yahoo, WebCrawler, Excite, ALIWEB,
MIRTH search engine is illustrated. Furthermore,
CUSI, and W3 Catalogue [Graham,1995].
techniques to set up a MIRTH search engine are
1.1 Advantages of Search Engines explored. These include build up data files, a structure of the search engine [Gilster, 1996], and constraints on query syntax. In addition, the means to create MIRTH multilingual search engine for Chinese (English) information retrieval is dealt and some examples of using MIRTH search engine are given.
To help users to retrieve information from WWW is the basic function of search engines. Two most impressive features of search engines are timesaving way of searching over the Web and their simplicity of use. For example, via Netscape Navigator, once a user keys in a keyword, phrase, or a regular
expression in the frame of a search engine, he/she
linguistics and literature. As a search engine, it
just needs to type 'Enter' from his/her keyboard, then
retrieves information in both English Chinese. This
the desired will be located in a very short time on the
research comprises two major tasks:
same Netscape browser.
• To organize a "virtual corpus" of computer based text-training
materials
for
linguistics
and
1.2 Disadvantages of Existing Search Engines literature available on the World Wide Web. The Each search engine has its weaknesses. The common corpus contains multilingual information about shortcomings for all of can be summarized below: human language learning and training, grammar Most of them can only search in English. This and language modeling research. Users could find would prevent, for instance, linguist from retrieving extensive materials within the corpus such as materials about the Chinese language or theory in novels,
poetry,
prose
and
various
on-line
Chinese. Hence the need exists to establish a search electronic magazines. engine.
• To set up a multilingual search tool which can
Most of them are general search tools focusing on handle queries in both English and Chinese. This general information retrieval, which might not be tool can be used to scan the contents of the efficient for the purpose of searching in a research mentioned corpus on line. area. Although some can search in a very wide database, they can not give the user satisfaction when
2. Methodologies
of
Creating
an
they need very detailed information on professional
Information Retrieval Tool topics, for example, Chinese grammar. As a system of information retrieval tool, some
1.3 Objectives of Research
creation methodologies, for example, building up
As such, the presented paper was motivated by the
special corpora, creating unique data file, defining
desire to create a search engine, which retrieves
special query syntax rules. In the following section
information from the World Wide Web (WWW)
these will be stated briefly.
using a linguistics virtual corpus. Specific attention
2.1 Build up Special Corpora has been paid to multilingual search facilities. The WWW is a huge storeroom where a variety of study will include the discussion on use of English information materials have been placed. As most and Chinese language tools. It is intended to create WWW documents have mainly been written using multilingual information retrieval tool to support HTML in plain text, it can be seen as a corpus searching of in specified areas, such as in the field of [Atwell, 1993], [Liu and Lee 1997]. There are
already billions of documents on WWW, and the
computer and, furthermore, is potentially subject to
number grows rapidly. As all of these hypertext
unforeseen changes as remote sources are modified
documents are dispersed over the Web, it is a time
central control. Hypertext resources on the Web are
consuming task to find in a particular research area.
seen as collections of virtual corpora. A Web data
So it must be wise to set up an access point for
file is also called a data set, that is a collection of
relevant materials such as a Computing corpus (Most
Web pages' Uniform Resource Locators (URL). It
documents on the Web are written in The Hypertext
support the whole process of information retrieval
Markup Language, HTML in short).
from the Web. To set up Web data files, three issues are discussed:
2.2 Set up Web Data File Setting up a data file to save space It is noted that for Web searching, the common
Specifying a structure of a data file
search object is a database [Ford, 1995]. However,
Defining applications for a data file
this requires a huge storage space for information,
A Principle of Setting up the Data File
and it also takes a long time for the search process to
The purpose of creating a data file instead of using
complete. Some special approach based on MIRTH
the contents of hypertext page is to save storage
is devised for organizing the database. Within this
space. It is clear that the wide availability of resource
MIRTH, there are two kinds of search objects:
on the Web might be too large to be saved in one
virtual corpora [Butler, 1992] and data files. The
machine. One way serve this purpose might be to
virtual corpus is dynamic. It is different from the
store only keywords in a data file and a give a few
traditional corpora the contents of a virtual corpus
lines of explanations of the content of the documents
are changed from time to time [Butler, 1992]. A
in the file. The example of the data file is given in
virtual corpus might not be stored on a user's
Fig. 1 as follows:
Intelligent Computer program Solutions WWW Site Introduction of Computer programs
M.S. and Ph.D Computer Graphics Theses This directory contains the ASCII text files for all of the Computer Graphics Thesis and Dissertation Abstracts Compendiums published in Computer
Computer & Associates Computer Based Training Internet Services Web Design Touch Screen Kiosks Electronic Catalogs Java Shock wave Multimedia
Figure 1: A data file example with key words and explanations When the data file is searched by MIRTH for a key word 'computer', the underline part of text (keywords and main explanation) should be shown on the screen as the results of a search, this can be seen in Fig. 2.
Figure 2: Search Results on the Data File Data files are application-oriented, or its contents
Application of a Data File Search results are controlled by data files. About the
depend on users' requirement of a search engine in
search procedures, the first thing to be done is to key
terms of information retrieval, so they are not always
in a search item in a query box. Then the engine runs.
the same. The resulting data files might inevitably be
In fact, the search query as a string has been passed
unique as it casts their creators' personal ideas
to the search program by the external program. When
considering those of its users. Generally speaking,
the program is run, it starts pattern matching in the
however, there are some common features for
data file. If the results found in the data file between
designing a data file. For example, the design is
one and ten or more "hits", then the matched items
influenced by its objectives. The purpose of the
(hits) will be ranked by the program automatically
MIRTH
before being sent back to the user. If the search item
specialists, linguists and people studying literature.
is not found in the data file, it means there is nothing
So, the author approaches the problem of design data
matched, and the user will get no results from the
files by collecting all resources related to those aims
search. Now we could see the importance of a data
of above group of special users. It has built up a large
file. The data file restricts the application of a search.
computing, linguistics, and literature file possible in
A comprehensive, high quality data file is essential to
the time available. In the other words, the contents
efficient searching. Its structure will be influenced by
the MIRTH data file introduce both general
the kind of searches users wish to make. This is a
computing knowledge and linguistics theory and
major
literature materials such as novels, poetry, literature
issue
that
was considered
development of MIRTH.
during the
search
engine
is
to
help
computer
journals (both Chinese and English). It can also be
extended other topics, for example civil engineering
system to help users searching information more
and chemistry, if necessary, but this will involve
accurately and efficiently [Ford, 1995]. It is
further manual creation of entries into a data file, and
understood that all developed search engines in
will dilute the subject specificity thus risking more
literature have their own syntax rules for making
erroneous 'hits'.
query.
Two problems exist with them: (1) in
practice, just a few search tools will provide Construct a Data file explanations for their rules explicitly; (2) most The rule of thumb to construct a data file is to save common query syntax rules depend on concepts of computer memory. The best way to do this is choose natural language, such as words, expressions (several keywords of a HTML document and convert them words), etc. While MIRTH differs from them as it into a data file, see example above. Many automatic focuses on linguistics and literature research, some search engines rely on data file to deal with queries. syntax rules for the purpose of supporting specific It is important to choose comprehensive keywords to searching functions are needed. MIRTH syntax rule improve the chance of retrieving relevant documents. definition focus on some special usage, such as how For the data files, keywords can be seen as the words does affix search, Root search, etc. Details will be or phrase, which reflect subject of the corresponding addressed below. home page. The theory behind a data file is as simple as a rule of thumb: users are interested if a particular
Motivation of Affix and Root Search
document contains some keywords in relation to their
As a general definition, syntax addresses the
interests. Keywords to be placed in a data file were
structure of sentences, but technically, it has more
selected manually from introductory textbooks in
meanings. In terms of computing, the term of syntax
linguistics, computing and literature, guided by
has been used widely. Any computer language
authors' experience in Chinese linguistics and
requires certain syntax rules for its commands and
literature.
codes, it ranges from simple structure of the words to entire program. Moreover, in different situations,
2.3 Define Query Syntax Rules syntax has different definition and different content. Although MIRTH searches pre-computed data files As far as computer program writing, there are lots of instead of the Web pages, any relevant home pages syntax rules that programmers have to follow. For eventually will be downloaded by users if it is example, in the Hypertext writing, the HTML necessary. Our interest will be on those home pages (HyperText
Mark-up
Language)
requests
all
as well. Intelligent query support is included in the commands should be enclosed by: . The
symbols usually appear in pairs: the is used at the
Example of Affix and Root Search Pre(*)
beginning and should be used at the end of the
If a user wants to search out some words, which start
same sentence.
with same prefix, then he/she should follow this
It has been mentioned that the main function of MIRTH search is to do linguistic search. And to analyzed special structure of words (phrases) is a
syntax rule to enter their item in the search box: prefix* (without a space before '*'). See example in the following section.
very important issue for linguistics and language
If you want to do a search for prefix matching, the
learners [Graham,1995]. For example: English, most
search engine will all words in its database, and
words have a root, but the root can form lots of
provide the information which you are looking for.
derivations. Such as adding a suffix or a prefix to a
Supposing you start your search the words begin
root, then a new word would be created. Let us
with 'dis', you can start the search as that: to add
consider: 'think' is a verb, we can regard it as a root,
asterisk (*) after the search item without space, then
when we add the prefix and suffix with it. It might
search the words prefix "dis", then you would get the
become new word, such as 'unthinkable'. For
search results: dislike, display, discrete, dismember,
language searching, these special functions of prefix,
discomfort, discredit, discover, discolor, disclose,
suffix and root search have been considered for
and disloyal. Fig. 3 illustrates this idea.
query syntax.
Figure 3: Search Prefix "Com"
Fig. 3 searched for prefix of "Com". Totally, over
It just needs users to input the search query in the
thirty hits been matched. The details also can be seen
item box, in front of the suffix, you must add
with the Fig. 4 for Chinese search.
asterisk(*), it looks like that: "*ing", then the search engine will seek out which word that includes the suffix you are looking for, and automatically pick them up for you (see Fig. 5 displayed by Web browser Netscape).
Figure 4: Chinese Engine Search In this process of Chinese prefix search, the search item is "ÖÐ " (central, or middle) + " * ". While running the search engine, it picked out over fifteen hits, which contained the prefix "ÖÐ ". As: ÖÐ ÎÄ
Figure 5: MIRTH Search for "*ing"
(Chinese), ÖÐ ¹ú, ÖÐ »ª(China), ÖйúÎÄѧ(Chinese (*) + Root + (*) literature), ÖÐ ¶«(Central Eastern), ÖÐ Î÷ (Central This query syntax means that there is a root part of a Western) and so on. between two stars without space among them. Then (*) + Suffix
the search engine will match all words that have an
This syntax rule defines a search to match all words
identical root installed in the index file.
having an identical suffix in the data file. The request of the input is *suffix, and there is no space between the asterisk and suffix.
3. Multilingual Issues in MIRTH
displayed with Chinese environment (supported by Unionway). Fig. 7 is displayed without Chinese
3.1 Chinese Computing Environment environment. For multilingual information retrieval, most systems including UNIX and PC systems are in the standard English environment, and the WWW documents work in this format [Christian, 1988]. For example, Netscape has a Web browser (named Navigator), which standard HTML in plain English. When a document in Chinese is on the Web, Netscape Navigator will show the Chinese characters in strange symbols, without a certain software to support displaying and Chinese, the strange symbols get displayed will not be understood by any user (including Chinese people).
Figure 6: Display with Chinese environment
Reading Chinese with Web browser, and setting up a search system for Chinese information retrieval, are still topics of debate on the Web [Zhou and Liu, 1997]. But in this paper, we have presented one solution. We now understand to set up a Chinese environment, how to deal with Chinese characters, and how to retrieve information from a Chinese virtual corpus. Chinese GB & BIG5 Codes are displayed in default Netscape Font. Some Chinese software can cope with this problem by converting English computing environment into an environment which supports both English and Chinese. Once this software is installed, the Chinese code will be converted to readable Chinese characters as given in Figs 6 and 7. They are same document files. Fig. 6 is
Figure 7: Display without Chinese environment
3.2 Dealing
with
Symbolic
Chinese
Characters The MIRTH provides access to Chinese virtual corpora as well. To understand this procedure, we
need to know how the computer deals with symbolic
adding 32 to both of the line and column numbers of
characters. Now we discuss the issues such as how to
Line-Column Code. Taking the word 'big' for
convert a symbolic character into digital information,
example, adding 32 to 20 (line number) and 83
how to store a set of digital characters (as bitmaps)
(column number) gives 52f and 115, and if we check
and how to represent particular Chinese in the GB
the ASCII code table, 52 represents '4' and 115 for 's',
protocol.
so GB Code for 'big' is '4s'. As the minimum line and column is 1, so the minimum number of GB Code is
Chinese Code Protocols 32+1=33, and the maximum Code is 32+94=126. We We discuss terms like GB, HZ, and BIG5 Chinese could see that GB Code is within the range ASCII of code protocols used by MIRTH. When we search a codes, which represent 94 symbols. This means we Chinese data file in MIRTH, this file normally is could use GB Code as standard information written in a kind of code, in other words, the Chinese interchange code set like ASCII. characters are written in specific codes instead of the graphic characters.
3.3 Chinese Information Retrieval System
Line-Column Code
The structure of the Chinese character set is different
One simple protocol is called Line-Column code,
from English. It has its special characteristics, and it
which uses a character's line number and column
can be displayed by MIRTH on the Web. The whole
number as its code [Huabei, 1981]. For example,
procedure can be seen below.
when we search for the word 'big', we key in 2083, where 20 is the line number 'big' in the Chinese character library, and 83 is the column number. This method is not widely used, as when we key in a single Chinese character, we need to type 4 numbers instead of 2 letters used by GB code.
GB Code(Guo Biao)
Chinese WWW Servers How to get access to Chinese documents on the WWW? The first step is to approach the Chinese Web server. Recently, dozens of Chinese servers have appeared on different platforms on the Web. These include: Chinese
WEB
server
(URL:
http://darwin.technet.sg/cweb/cstart.html), GB means national standard, which stands for
Wen Zhai (it also known as Chinese News
Chinese Standard for Information Interchange (read
Digest. The URL: http://www.cnd.org).
Guo Biao in Chinese), which is defined by the
The
People's Republic of China and is widely used in
(http://www.ncb.gov.sg/chinese-web/)
Chinese societies around the World. It is defined by
Xian
Chinese
World
Wide
Web Chinese
Gopher Menu(gopher://sunrise.cc.mcgill.ca/).
The
Chinese
Web
page
(http://agora.leeds.ac.uk/xiaoda/Dcorpus.htm),
they want, and they can click on it to do more search or to capture the information that they are looking for
and so on. at once. To achieve this aim, we have built up our Retrieving Information from Chinese Data Files Chinese data files in GB and HZ Chinese codes. To allow users of MIRTH to have wider choices of People who have the Chinese environment can get virtual Chinese Corpora, we need to create Chinese access to them easily. data files for Linguistics research. The Chinese corpus is collected via a Chinese data file. Chinese
Special Syntax Definition for Chinese Search
search is similar to the English one in MIRTH. The
Chinese has its individual property of phrasal
difference is that Chinese codes use two bytes, so
structure, and it is much complicated than English.
when we set up the Chinese data file, we have to
Primarily, the structure of a Chinese word is very
consider this special property in choosing the
different from English. We can subdivide English
appropriate way retrieve information.
words into parts: suffix, prefix and root in one unit.
MIRTH allows the user to input the query in
For example, the word display is comprised of two
Chinese in terms of keywords, or subject in the
parts one word, but it comprises two parts: prefix
linguistics area, then the search engine will return
(dis), and Root(play), the structure is: prefix + root.
ranked list of documents in order of relevance. Users can read the documents first, then find their interests and refine the search by marking the documents that
When translating this word into Chinese, it can be shown by two independent units (they are two), and the structure of it is shown in Figure 8 below:
have been highlighted. When users find out what
Figure 8: English "Display" is shown in Chinese
There is another way which tell you how
The syntax rules
differentiate between English and. We know, to add
The first syntax rule is if the query item is a keyword
"ing" behind a Verb that can change the nature of a
in Chinese, a general structure should be: Root +
word and transform a verb to a noun in English. For
Root or Root + suffix, because that is a major
example, 'take' is a verb, "taking" is a verb-noun,
structure of Chinese words, or phrases, the search
and can be an object in a sentence. But Chinese
item should a phrase or few words within this
doesn't have this kind of rule. There is no way for
structure:
changing a verb to noun in Chinese. According to the analysis above, it is very easy to see that: Chinese grammar is different from English.
So, Chinese
query syntax is considered in MIRTH. Here are a few examples of the query syntax for a Chinese search.
Root + Root, or Root + suffix. This rule is quite useful for linguistics, particularly for Chinese grammar learning, such as structure of Chinese words and phrase. For instance, a user search item is ÖÐ(in English is middle or central), it as a root can be used in the derivation of new phrases. It can be seen in the Fig. 9.
Figure 9: Chinese Phrasal Words The second syntax rule is to use "the keywords must
The Chinese data file looks for a query by using
include an object which be a noun (nouns) in the
the special rule of pattern search. After you define
search item". For instance: search 'Chinese' is a
your query as a set of keywords and the other
standard search pattern, when the search engine gets
qualifiers, it likes 'Human Language' or 'Chinese
the query, it will do search around the object
Grammar'. A Boolean-type search would match the
'Chinese', and match some results around 'Chinese'.
subject 'Language' and the object 'Grammar', and
find a number of items you are interested in from the
be generated to manipulate data files more
linguistics data file [Atwell, 1993].
efficiently and accurately. This management system will perform the task such as add insert, delete, update, replace and sort links with their
4. Conclusion
key words.
In this paper, we have discussed the main hierarchy of MIRTH, a multilingual information retrieval tool
References
(also called search engine). For the purpose, three
Atwell.
Eric
1993.
Knowledge
at
Work
in
Universities, Leeds University press. works were done. Firstly, particular sources were Butler, Christopher S. 1992. Computers and Written "linked" together as its "Virtual Corpora" containing separated topics such as computing, Linguistics and Language, and Chinese literature. Secondly, an example of a multilingual environment was created with the help of UnionWay. Some of these corpora are multilingual sources, which mainly demonstrate materials in English and Chinese. Thirdly, an automatic tool to retrieve information from the
Texts, Biddles Ltd, Guildford, Surrey. Christian, Kaare 1988. The Unix Operating System, Jone Wiley & Sons, Inc. Ford, Andrew 1995. Spinning the Web, International Thomson Publishing. Graham, Lan 1995. The HTML Sourcebook, John Wiley & Sons. Gilster, Paol 1996. Finding it on the Internet, John Wiley & Sons, Inc.
multilingual corpora was set up. Some improvement Huabei 1981. Huabei Computing Institute, Chinese for MIRTH further work shall be:
National Standard: A Collection of Chinese
(1) As already stated, this research has set up its own
Character Codes for Information Exchanging,
data files. But they are not big enough to hold complete information resources in a particular
China Standard Press House. Liu, N.K. 1996. Formal verification of some potential contradictions in knowledge base using
research area. So to install more data materials is
a High Level Net approach, Applied Intelligence,
an essential tack before MIRTH can be improved
6(4):325-344.
in a real situation. (2) As a complete linguistics tool, more techniques
Liu, J. and Lee, YK. 1997. Development of a Chinese Extraction System. In Proceedings Of International
Conference
on
Computer
should be added in, such as tagging, parsing, and
Processing of Original Languages, April 2-4,
analyzing structure of a sentence [Zhou et al.,
1997, Hong Kong.
1998], [Zhou and Liu, 1997]. (3) Maintenance is an issue for the system [Liu,
Zhou, L. and Liu, J. 1997. An efficient algorithm for bilingual word translation acquisition“, in the 2nd Workshop
1996]. A data file management system needs to
on
Multilinguality
in
Software
Industry: The AI Contribution (MULSAIC’97) of the International Joint Conference on Artificial Intelligence (IJCAI-97), August 23-29, 1997, Nagoya, Japan. Zhou, L., Liu, J. and Yu, S.E. 1998. Study and implementation of combined techniques for automatic extraction of word translation pairs: An analysis of the contributions of word heuristics to a statistical method”, to appear in International Journal on Computer Processing of Oriental Languages.