Search Engine - IEEE Xplore

3 downloads 0 Views 646KB Size Report
College of Computer and communication, Lanzhou University ofTechnology, ... Gansu Electric Power Information & Communication Company, Lanzhou, China.
Proceedings ofISCIT2005

Study and Design of Chinese Concept-Based Search Engine Hong Zhang1, Yanhong Ma 2, Qiuyu Zhang , Pengshou Xie1 1. College of Computer and communication, Lanzhou University of Technology, Lanzhou, China. 2. Gansu Electric Power Information & Communication Company, Lanzhou, China. Tel: +86-0931-2976093 E-mail: [email protected] transport protocol, Robot ransacks all over the WWW space including all hyperlinks in Webpage to collect Webpage information and stores the information into Webpage database. So we can analyze the Webpage's information and process the information. Compared with other Robot, this Robot can discover the dead links and find the newly added links. It is synchronous with all Internet resource. There are two ways of obtaining preliminary URL. One is itself collecting regularly and the other is user referral. B. Indexer The indexer's goal is to build web index database which can be retrieved by Search Module. Index database is the soul of search engine1. By the look of some respect, it is the Index Database that determines the quality of search engine. Therefore, design of indexer is important and pivotal. In order to incarnate this tenet, two measures are introduced: knowledge database technique and new weighting algorithm. Because there are a lot of parasitological, semantic and lexical knowledge and common sense, language material, words database, statistical table for reverse words frequency (STRWF) etc in knowledge database, it is more exact to segment sentences and words than before and the words segmented is more expressive. Of course, these words will stand for the Webpage meanings. Generally, indexer builds Webpage index record by automatically picking up some characteristic information or the labels which can express Webpage theme, such as Webpage title, Web address, hyperlink, people name, organization name, place name and some anterior words in the Webpage etc. For example, weighting measure adopted by Website AltaVista is showed in table 1. It can be seen from table 1 that AltaVista didn't consider other HTML tags but only tag 'title'. Obviously, this weighting measure is unilateral. Compared with AltaVista,

Abstract- This paper proposes a new kind of Chinese concept-based search engine and gives its theory model, working mechanism and designing procedure. Its kernel is the knowledge database and a new weighting algorithm on counting HTML tags' weight. Using of these two techniques has not only improved the exactitude of index database but also the accuracy of users' query. So the precision ratio and recall ratio of search engine have been improved essentially. Keywords- search engine; concept; weighting; intelligence; knowledge database.

I. INTRODUCTION Rapid development and extensive popularization of Internet is driving search engine to update rapidly. But most of search engines are based on keywords. That is to say they can't distinguish the homographs and can't associate with synonyms of keywords. Search engine has already come into a new field to be researched and developed, especially the Chinese search engine, because of the complexity of Chinese semantic meaning. At this point, the paper puts forward a new kind of Chinese intelligent search engine which is based on concept [1] and knowledge database [2]. It can distinguish homographs and can associate with synonyms of keywords and can get . X " rid of those high frequency words, such as " t etc. These words are frequent but insignificant and they will waste a lot of storage room. What's more, it uses a new algorithm on counting HTML tags' weight. This algorithm considers all kinds of tags, for instance "TITLE, H, P, B" and weight of each tags. If a tag is important then its weight is high. These weights are gained from lots of experiments and theory foundation. Therefore it has greatly enhanced performance of search engine. II. THEORY MODEL From the view of theory, this search engine can be divided into three parts. They are Robot, indexer and searcher. Each part is showed in the Fig. 1. A. Robot Internet search software is usually called Web spider or Crawler or Robot, in this paper we call it Robot. Starting with a preliminary URL table and utilizing the standard

0-7803-9538-7/05/$20.00©2005 IEEE

1 This work is supported by the National Key Technologies R&D Prograrn, China (No. 2001BA201A32) and fund project of scientific research of Lanzhou University of Technology (No. SB20200405 ).

38

this system adopts a different weighting algorithm. It takes both tag type and tag location into account. Different tag has different weight. For instance, title's weight is 8 and HI's weight is 6 and so on. Furthermore, this system pays attention to weight in STRWF. Thus, words weight really represents the words meanings in a Webpage. By these two ways, Index Database's quality has been raised and certainly, search engine's performance has been improved.

TABLE 1 ALTAVISTA WEIGHTING STRATEGY

Fig. 1. Theory Model of Chinese Concept-based Search Engine

Because of limited hardware resource, Webpage file didn't have been stored in Webpage database but only URL has been stored into Webpage database. So that lots of memory space has been saved, this makes it possible to develop this project in a limited resource environment. B. Indexer Design procedure of Indexer is showed in Fig. 2. Each detail is displayed herein below.

C. Searcher Searcher is used to retrieve WebPages which can make users satisfied from index database. Rapidity and accuracy is the leitmotiv of searcher. Friendly interface is another requirement. Same as the indexer, it also requires lexical and syntactical analysis, segmenting sentences and words etc. But it newly appends synonyms expanding. Therefore when user inputs the word 'ti', returns WebPages which not only include the word 'tgY' but also the word '%t##VL'. So does for other words. Using of this way has improved integrity and precision of search engine in another respect. III. SYSTEM DESIGN [3] This system chose Java as design tool. Java is an extensively used network programming language. It is transplantable, secure and stable. Detail design of each part is as follows: A. Robot [4] Design of Robot used BOT package in Java. Main classes to implement this function are Spider, SpiderlnternalWorkload, SpiderWorker, SpiderDone and SpiderSQLWorkload. Main interfaces are SpiderReportable and IworkloadStorable.

Fig.2. the structure of indexer

1). Unload Webpage file This part mainly used net package and URL communication in Java. It gets URLs from Webpage database. For every URL, it creates object URL,

39

which is different from any other Website's weighting method. Details are described as follows: It takes the most influential weighting formula it * idf into account. In this formula, tf is shorted form for term frequency and idf is that for inversed document frequency. Further more, words position and tags type have been thought over, that is to say it calculates weight according to table 2. Then weight of word ki in Webpage file dj is

URLConnection and HttpURLConnection. Then through object URLConnection's method connect, it connects with Web server to get Webpage file and reads Webpage into buffer for next process. 2). Preprocess Webpage file Due to many HTML tags in Webpage file will be marked and many Webpage files are nonstandard, for example, some tags are capital letters, but some are lowercase letters, and Java adopts 16 bit code which is Unicode and Java is case sensitive, it must turn these tags into capital letters and get rid of space in HTML tags such as or < B> etc. Main objects in this part are String and String buffer. 3). Segment sentences Segmenting sentences is to pick out Chinese string from preprocessed Webpage files according to HTML tags' type and Chinese punctuation. These Chinese strings didn't include Chinese punctuation. Latin characters and Chinese punctuation lie in the following ranges. They are showed in hexadecimal system. Basic Latin: 0000... 007F; Latin supplement: 0080.. .024F; Chinese punctuation: 3000.. .303F and FFOO...FFFF So only by following logical expression, chineselnt > Ox024F&&! ( ( chineseInt > = 0x3000 && chineselnt < = 0x303f )11( chineseInt>=OxffOO&&chineseInt=Ox2000&&chineseInt