Taxonomy System for Intelligent Library Search System

5 downloads 116 Views 345KB Size Report
information system, AtmaLib, has a search engine as one of its feature. From the observation .... The addition process refers to the placement of a new node into the user taxonomy, .... 650–Management and auxiliary service. An additional axis,.
Taxonomy System for Intelligent Library Search System Marcelino Ponty, Lukas Engineering Faculty, Departement of Electrical Engineering Atma Jaya Catholic University of Indonesia Jl. Jenderal Sudirman 51, Jakarta, INA, 12930 [email protected], tel +62-21-5708826

Keywords: Taxonomy, personalized library information system, Dewey Decimal Classification, intelligent search

Able to process text more flexible, such as providing mistyped text handling, and common word filtering. The main purpose of the system built in this research, is to give the search engine two of the first capabilities by implementing taxonomy system concepts. Taxonomy system that is used in this resarch is the Dewey Decimal Classification system, a knowledge field classification system that has been widely used to determine collection’s call number, so does in AtmaLib library information system. Taxonomy system is utilized to create a knowledge base about every user’s interest in certain knowledge field, depends on his/her action history to library collections. By processing every user’s knowledge field interest information, the system can determine which document would likely be prefered by certain user, based on its knowledge field, so it can be ranked higher than others document. Furthermore, the system is supported even more by having the text processing feature added, which is provided by Ferret, one of Ruby search library. There are some limitation given in this research. First, the library must use the DDC system to determine the collections call number. Second, every process that utilized call number, is relevant only for the call number that use the DDC system.

I. INTRODUCTION

II. RELATED WORKS

Search system has always been one of librares main feature. Today, most of libraries use computerized system to make the searching process easier, so does the Atma Jaya Indonesia Catholic University, whose library information system, AtmaLib, has a search engine as it feature. Even in one side the subtitution of the librarian role with the search engine can make the search process easier, a search engine can be puzzling for the user if it doesn’t have a good information processing engine. A search engine has to be able to understand what the user needs are, with their different manner and querying skills. Most of the simple search engine provides only one way communication, and it has to be able to understand what the user needs are, only by their limited information input. To overcome this problem, these are some capabilities that a search engine can have, to understand the user better [2]: 1. Able to build a knowledge base about each user which always be updated according to user action 2. Able to process information of each user to understand each user’s needs

A. Taxonomy Taxonomy is a science of classification. Taxonomy system classifies things and represents them in hierarchical structure. Relations between things are described in parent-child relationships. In an even wider sense, the term taxonomy could also be applied to relationship schemes other than parent-child hierarchies, such as network structures with other types of relationships. Taxonomies may then include single children with multi-parents, for example, "Car" might appear with both parents "Vehicle" and "Steel Mechanisms". To some however, this merely means that 'car' is a part of several different taxonomies [3].

Today, most of the libraries use computerized system to make searching process easier, so does the Atma Jaya Catholic University of Indonesia’s library, whose library information system, AtmaLib, has a search engine as one of its feature. From the observation on the uses of the search engine in AtmaLib, it shows that sometimes the search results given by the search engine tends to be inaccurate or inrelevant, which most of time occur because of the insufficient information given to the search engine by the user. To face this problem, a taxonomy system based on Dewey Decimal Classification, is implemented in the library application system, to classify each library collections into certain knowledge field. User actions on library collections are recorded and utilized to create user personal taxonomy system that represents each user’s interest of certain knowledge fields. This research comes to the conclusion that by combining the user’s taxonomy information with the collection’s knowledge field information, the search results relevancies can be increased.

3.

B. Dewey Decimal Classification Of modern library classification schemes, the Dewey Decimal Classification (DDC) is both the oldest and the most widely used in the United States [7]. DDC system arranges knowledge into ten main classes, each class is divided into ten divisions, and each division is divided into ten sections [5]. These classes are used to determine a numeric representation of a knowledge. However, expansion for describing knowledge

more specific, can grow indefinetely. The more specific the work being classified, the longer the number combination grow [7]. Fig. 1 shows an example of DDC number of a book. 600 Techonology 620 Engginering and Applied Operations 621Applied Physics 621.2 Hydraulic Power Technology

6 2 1.2 NEK

Fig. 1. An example of DDC number C. Ferret Ferret is a library for Ruby programming language, which allows program developer to integrate a search engine into a Ruby application. It is inspired by Apache Lucene Java project [1]. Ferret uses full-text indexed searching method, and one of its feature is document weighting using the TF-IDF algorithm to find the most relevant document compared to the input. Another feature of Ferret is the use of Levensthein algorithm which can give a customizable tollerance to mistyped input. D. TF-IDF Algorithm TF-IDF algorithm is an algorithm used to calculate each corpus document relevance compared to the query input. TFIDF works by determining the relative frequency of words in a specific document compared to the inverse proportion of that word over the entire document corpus. Intuitively, this calculation determines how relevant a given word is in a particular document. Words that are common in a single or a small group of documents tend to have higher TF-IDF numbers than common words such as articles and prepositions [6]. The formal procedure for implementing TF-IDF has some minor differences over all its applications, but the overall approach works as follows. Given a document collection D, a word w, and an individual document d є D, we calculate

wd = fw, d × log (|D|/fw, D)

(1)

where fw, d equals the number of times w appears in d, |D| is the size of the corpus, and fw, D equals the number of documents in which w appears in D [6]. By using this calculation, the score of a document would likely be higher if the occurrence of the query word is higher. However, the score addition value on a document when a query word occur, will be determined by the number of document in corpus which contain the query word. The more documents contain the query, the lower the score addition value to the document. III. SYSTEM DESIGN The intelligent library search system consists of three main components: The main taxonomy system, the user taxonomy system, and the intelligent search system. Those components are related to each other. Main taxonomy system describes the taxonomy concept used in the user taxonomy system, and in

collection knowledge field classification system. The intellegent search system, uses the user taxonomy system and the collection knowledge field information to perform the search. A. Main Taxonomy System The main taxonomy system is built according to Dewey Decimal Classification (DDC) system. A table with 4 digit DDC list [4] is imported into the database, along with each number description. An additional record which does not exist in DDC, ‘root’, is added to the database table so that each class, including the main class, has a parent node. A tree structure is constructed to form the taxonomy system, with every DDC record in the database table is presented as a node. To develop the tree, several tree functions are defined. Some basic tree functions, such as finding the depth of a node, finding the parent of a node, finding sibling(s) of a node, as well as finding child(ren) of a node, are utilized so that the relationship between nodes is formed. By manipulating the collections’ call number, which represents DDC knowledge field, the taxonomy system built can be fully utilized to classify collections into certain knowledge field. B. User Taxonomy System The purpose of designing the user taxonomy system, is to provide the search engine with information about the user interest in certain knowledge field. User taxonomy is built according to user historical information about his/her action to the collection, so that every user has a unique taxonomy profile. To be able to be compactly used by the search engine, the information of an user interest is also presented as a tree containing DDC knowledge field as its nodes. A weight is associated to each node which determine a score of document in certain knowledge field. User taxonomy is stored in form of hash, one of Ruby language object type, which format is as follows: User taxonomy = {node1 => {:hit => hit1, :life => life1}, node2 => {:hit => hit2, :life => life2}, …., noden => {:hit => hitn, :life => lifen}} where noden presents a knowledge field, in form of DDC number, hitn equals the basic weight of the noden, and lifen equals the lifetime of noden. The node existence follows the rules of the DDC taxonomy system and the tree data structure. If a node exists in user taxonomy, then all of its ancestors should also exist in the user taxonomy. The only node that has no ancestors is ‘root’. Each node is unique and may only appear once. A node can be added or deleted in certain condition. Accessing a node also means accessing its ancestors. Hit parameter represents the basic weight of a node. Hit value can be added or reduced, automatically or manually, according to the user action. The hit of a node, combined with hit of all of its ancestors, will determine the weight of every

collection which possess the same knowledge field with the node. Hit value of a node always equals or greater than the total hit values of all children node. Life parameter which represents the lifetime of a node, is used in the automatic pruning process. The pruning process is provided, in order to maintain the relevancies of the always growing user taxonomy. Life value of a node is always reduced little by little, everytime the user taxonomy is accessed, and when it reaches zero value, then the node is pruned. The life parameter used in this research has an initial and maximum value of 3, and every reduction will decrease the value by one point. The manipulation process of user taxonomy is one of the research main concern. This process can be divided into two different kinds, the addition process, and the pruning process. Basically, both processes would be done automatically. However, to ease the user, the user is given an option to customize his/her taxonomy tree manually. The addition process refers to the placement of a new node into the user taxonomy, or adding hit value to an existing node in the user taxonomy. The added node and the hit value added are depend on the action of the user. Every addition process on a node will set the node life into its initial value. An addition of a node would also affect the ancestors of the node, which means, increasing a node hit value will increase its ancestor nodes hit values as well. Also, accessing a node means accessing all of its ancestor nodes, thus setting back all of its ancestor nodes’ life value into their initial value. In automatic addition, addition of a node will be triggered if a user make certain actions on a collection. If a user view the details of a collection, then the node that represents the knowledge field of the collection is added into the user taxonomy. In this case, the hit value of the node is added by one. If a user borrow a collection, the same process is happened. In this case, the hit value is added by five. This addition process is done in the background, while user browsing through the library application. An example of automatic addition process is shown in Fig. 1. The pruning process is a process reducing hit value of an existing node in the user taxonomy, or deleting a node in the user taxonomy. In automatic pruning, the life parameter of a node will determine whether a node need to be pruned or not. When a node’s life value reaches zero, its hit value is reduced by certain value. In this research, the reduction value is half of the the current hit value of the node. Furthermore, the hit value of a node will determine whether a node need to be deleted or not. If node’s hit is below the limit given, then the node is deleted from the user taxonomy. The time the automatic pruning is carried out, can be set according to the system needs. In this research, the automatic pruning is triggered everytime a user do an action which cause an automatic addition of a node, for the first time in a day. This limitation is given to refrain the pruning process from happening too often. An example of automatic pruning process is shown in Fig. 2. The manual manipulation might be added to the application to give user a freedom to set their own taxonomy, including the value of the tree. Manual manipulation feature may also be

Action = borrowing a collection Collecion call number = 121.04 BUR

root [4,3] 100 [2,1]

200 [2,3]

+

=

110 [2,1]

Legend:

Fig. 1. Example of automatic node addition into user taxonomy

Fig. 2. Example of automatic pruning of user taxonomy designed to give user ability to customize the manipulation process, while still maintaining some basic behaviour of the tree and weighting algorithm. An example of feature that can be added could be an option whether a manual change of a node would affect its ancestor nodes or not. C. Intelligent Search System The search system employs the Ferret Search Engine. TFIDF value resulted from the Ferret search system then combined with the document knowledge field score to further results in the total value for each document. The searching method that was devised has several searching parameters such as : Term(s) that is/are to be searched from the index. 1. Terms The given parameter will be used to calculate the relevancy between the document and the user input term(s) using TFIDF formula. 2. Terms must exist Term(s) that must exist within the document. Document that does not contain any of term from this parameter will not be displayed on the search result. 3. Terms must not exist Term(s) that must not exist in the document. Document that contains at least one of the terms from this parameter will not be displayed on the search result. 4. Terms match, terms must exist match, terms must not exist match Using Levensthein algorithm to define matching tolerance between the terms from the given parameter and the terms in the index. If an index term does not exceed the tolerance limit when it is compared with the term from the

given parameter, it will be considered relevant with the given parameter. Defining a certain knowledge field as 5. Knowledge field search domain. 6. Use taxonomy to decide whether to use user taxonomy in the searching process or not. The first four parameters are considered as ordinary search parameters. The last two parameters such as knowledge field and taxonomy option are considered smart search parameters. The use of taxonomy in searching process will affect document ranking in searching process. Without using taxonomy, rank of document will be affected only by the TFIDF score generated by Ferret search engine. By using the taxonomy, the TF-IDF score will be combined with document knowledge field score, which will affect document rank. In the searching process, a document knowledge field score, is obtained from further calculation on user taxonomy. As mentioned before, the hit value of a node in user taxonomy along with the hit value of its ancestors, will affect the weight of every collection that possesses the same knowledge field as the node. Equation (2) is used to calculate a knowledge field score represented by a node based on user taxonomy. WTnode = Hitnode + ∑Hitancestors − Hitroot

(2)

WTnode equals to the score of the knowledge field represented by the node. Hitnode equals to the hit value of the node. Hitancestors equals to the hit value of every ancestor of the node. Hitroot equals to the root hit value. After calculation of all node knowledge field score is done, the knowledge field score is normalized by dividing it with the maximum knowledge field score in the user taxonomy. The purpose of the normalization is to make the knowledge field score’s range the same as the TF-IDF score’s range. The knowledge field score obtained for each node, is then used to determined a document’s knowledge field score. A document that has it’s knowledge field exist in the user taxonomy, will have the same score as the knowledge field score of the node that represents the same knowledge field with the document. If a knowledge field of the document does not exist in the user taxonomy, then the knowledge field score will be obtained from the knowledge field score of the ancestor node that exists in the user taxonomy, that is located closest to that node. Knowledge field score will then be added to the TF-IDF document value based on the searching process to get the total value, as shown in Equation (3).

scoretotal = scoreTF-IDF +(n × scoreknowledgefield) (3) scoretotal equals the total score of the document, scoreTF-IDF equals the TF-IDF score of the document, scoreknowledgefield equals the knowledge field score of the document, and n equals the taxonomy coefficient used to determine the strengh of the taxonomy. The bigger the taxonomy coefficient, the more likely the document that is relevant to the user taxonomy be ranked higher.

After calculating every document’s total value, document will be sorted based on its total value. IV. RESULTS The intelligent search system is tested using a manually built user taxonomies. Two taxonomies are used with different knowledge field tendency, and then searching process is carried out with same input parameter. The results are then compared between the search results using the first taxonomy, second taxonomy, and without using taxonomy. The data used are from AtmaLib collection database. The input parameter used in the test is ‘test’, and the taxonomies used in this test is shown in Table I and Table II. Table I shows an example of a user taxonomy with interests in in computer and exact science knowledge field, while Table II shows an example of a user taxonomy with interests in psychology knowledge field. The results of the search process are presented in radar chart in Fig. 3. The chart presents the frequency of document in certain knowledge field during the search attempts. The knowledge field presented in the radar are limited only to the main division in the DDC. Only the knowledge field that are appear in the top fifty search results, in the three searches, are shown as axes. The knowledge fields shown are 000– Generalities, 150–Psychology , 310–General statistics, 370– Education, 420–English and Old English, 510–Mathematics, 610–Medical science, 620–Engineering and applied operation, 650–Management and auxiliary service. An additional axis, ‘others’, are added since the search also resulted some documents that are not using the DDC number system. Left radars in Fig. 3 shows the documents knowledge field frequency in top twenty search results, while right radars shows the document knowledge field frequency in top fifty search results. The two top charts is obtained from the search process without using taxonomy. The two middle charts is obtained from the search process by using user taxonomy shown in Table I, while the two bottom charts is obtained from the search process by using user taxonomy shown in Table II. The top charts show that, while using only TF-IDF score to sort the search results, the document in knowledge field 420– English and Old English, and document without DDC numbering system, would ranked higher in the search results. The document in knowledge field 150–Psychology also appears, but the frequency is not so significant. This means, if a user is going to find documents that exist in domain other than those domains, he/she would likely need to search across more than fifty collection to find it in the search results. TABLE I USER TAXONOMY TREE WITH COMPUTER AND EXACT SCIENCE DOMAIN DDC number – knowledge field Hit Knowledge value field score Root 19.0 0.00 000 – Generalities 10.0 0.40 004 – Data processing, computer science 5.0 1.00 005–Computer programming, program, data 5.0 1.00 600 Technology 9.0 0.36 620 – Engineering and applied operation 9.0 0.72

TABLE II USER TAXONOMY TREE WITH PSYCHOLOGY DOMAIN DDC number – knowledge field Hit value Knowledge field score Root 9.0 0.00 100 – Philosophy and psychology 5.0 0.50 150 – Psychology 5.0 1.00 600 Technology 4.0 0.40 610 – Medical 4.0 0.80 619 Experimental medicine 1.0 0.90

The bottom charts shows the results of search by using the user taxonomy tree that has domain in psychology. The documents from knowledge field 150 – Psychology, are dominating the search results, at least up to fifty top search results. This should the search results wanted by users whose domain is psychology. V. CONCLUSIONS AND FUTURE WORKS This research concludes that by having taxonomy system implemented on the document collections combined with user taxonomy information that represents user’s knowledge field interests, the search results’ knowledge field spreads can be narrowed. The system shows a search process that is affected by user taxonomy information. With a good documentation and customization ability for the user to set his/her own taxonomy, the user will have more option to control the search results, other than by using textual input. Future work for this system may includes developing smarter and multiple knowledge field document classification, and adding the ability to translate other numbering system other than DDC into taxonomy system that has been built. Another system utilizing the taxonomy can also be built, for example, new collection alerting system based on user’s interest, or datamining system with knowledge field taxonomy information as it subject.

Fig. 3. Radar charts of search results documents knowledge field frequency The middle charts shows the results of search by using the user taxonomy tree that has domain in computer and exact science. The results show that, given the same input query ‘test’ while using the user taxonomy in searching, would make the documents in the relevant knowledge field as the user taxonomy, ranked higher. In the twenty top search results, half of the search results are documents that has 000–Generalities knowledge field, which actually includes the computer science knowledge field (004, 005). In the top fifty search results, documents from knowledge field 420–English and old English, and documents that are not numbered using DDC system, which has high TF-IDF score, are appearing. This shows that documents in the user taxonomy knowledge field would get more priority to be shown in the top results.

REFERENCES [1] Balmain, David. 2008. Ferret: Indexed Searching for Ruby Application. California: O’Reilly Media, Inc. [2] Chowdhury, G. G. 2004. Introduction to Modern Information Retrieval, 2nd ed. London: Facet Publishing. [3] Jackson, Joab. 2004. Taxonomy’s not just design, it’s an art. Government Computer News, Vol 23 Issue 3, (http://gcn.com/Articles/2004/02/03/ Taxonomys-not-just-design-its-an-art.aspx?Page=1, accessed 8th January 2009) [4] OCLC Online Computer Library Center, Inc. 1989. Three and Four Digit Headings from the Abridged Dewey Decimal Classification, (http://library.tedankara.k12.tr/dewey/4-digit_DDC.html, diakses 8 Januari 2009). [5] OCLC Online Computer Library Center, Inc. 2003. Summaries: DDC Dewey Decimal Classification. Ohio: Online Computer Library Center [6] Ramos, Juan. 2001. Using TF-IDF to Determine Word Relevance in Document Queries. New Jersey: Departement of Computer Science, Rutgers University. [7] Taylor, Arlene G. 2006. Introduction to Cataloging and Classification, 10th ed. Libraries Unlimited

Suggest Documents