Application of Genetic Algorithm to Determine A ...

3 downloads 82389 Views 223KB Size Report
Jul 20, 2005 - The First Malaysian Software Engineering Conference (MySEC'05). 167. Application of Genetic ... a set of keywords which best described the documents. Genetic .... Interruption, interaction. 2. A Tracking Framework for.
The First Malaysian Software Engineering Conference (MySEC’05)

Application of Genetic Algorithm to Determine A Document Similarity Level in IRS Poltak Sihombing, Abdullah Embong, Putra Sumari School of Computer Sciences Universiti Sains Malaysia Penang, Malaysia [email protected], [email protected], [email protected]

ABSTRACT

The most important problem in IRS (Information Retrieval System) is to get the most relevant document from a database. In this paper we propose a technique to determine a document similarity level in IRS using Genetic Algorithm (GA). The goal of GA was to find a set of documents which best matched the searcher's needs. An evaluation function for the fitness of each chromosome was selected based on Dice's score. The Dice's score is a common measure of association in information retrieval. To initialize a population, we need first to decide the number of genes for each individual and the total number of chromosomes (popsize) in the initial population. GA is basically based on natural biological evolution. The parent solution (chromosome) with the higher level of fitness has a bigger similarity percentage of documents, while those with lower level of fitness have less similarity percentage of documents. By the similarity percentage of documents, the user can choose the most relevant document from a database.

KEYWORDS Information retrieval, similarity, Dice, genetic algorithm

1. Introduction

In this study we are interested to investigate the use of GA to determine the level of similarity by using Dice's fitness measure and to implement this technique in a prototype system called Journal Browser.

The goal of any IRS is to help a user to locate the right document or those documents that have the potential to satisfy the user information needs. Most of the time users are unable to retrieve the required document. To solve this problem, researchers have implemented several methods such as inverted index, Boolean querying, knowledge-based, neural network, probabilistic retrieval and machine learning approach [1]. A newer paradigm, generally considered to be the machine learning approach has attracted the attention of researchers in artificial intelligence, computer science, and other functional disciplines such as engineering, medicine, and business. In contrast to performance systems which acquire knowledge from human experts, machine learning systems acquire knowledge automatically from examples, i.e., from data source. The most frequently used techniques include symbolic, inductive learning algorithms [2], multiple-layered, feed-forward neural networks such as back propagation networks, and evolution-based genetic algorithm (GA) [3] [4].

2. GA and IRS The literature study revealed several implementations of genetic algorithms in information retrieval. In [5], Gordon presented a genetic algorithms based approach for document indexing. Competing document descriptions (keywords) are associated with a document and altered over time by using genetic mutation and crossover operators. In his design, a keyword represents a gene (a bit pattern), a document's list of keywords represents individuals (a bit string), and a collection of documents initially judged relevant by a user represents the initial population. Based on a Jaccard's score matching function (fitness measure), the initial population evolved through generations and eventually converged to an optimal (improved) population a set of keywords which best described the documents.

The GA was used because of its ability to determine document similarity by keywords competition. A keyword represents a gene (a bit pattern), a document's list of keywords represents individuals (a bit string), and a collection of documents initially judged relevant by a user represents the initial population.

Genetic Algorithm (GA) is a part of evolutionary computing, which is rapidly growing area of artificial intelligence. GA is inspired by Darwin's theory of evolution. Simply said, problems are solved by an evolutionary process resulting in a best (fittest) solution [6].

167

The First Malaysian Software Engineering Conference (MySEC’05)

represent the underlying bit strings for the initial population. A chromosome is formed by gene which represents bit (0 and 1). Each bit represents the same unique keyword throughout the complete GA process. When a keyword is present in a document, the bit is set to 1, otherwise it is set to 0. Each document could then be represented in terms of a sequence of 0s and 1s which is called a chromosome model, for example: 01101010110101101. At the beginning of a run of a GA a large population of random chromosomes is created. Chromosome models depend on the case, for example, given the following equation as a case:

2.1 Outline of the Basic GA Let's say there are N chromosomes in the initial population. Then, the following steps are repeated until a solution is found. (i)

(ii)

(iii)

(iv) (v)

Test each chromosome to see how good it is at solving the problem at hand and assign a fitness score accordingly. The fitness score is a measure of how good that chromosome is at solving the problem. Select two members from the current population. The chance of being selected is proportional to the chromosomes fitness. Roulette wheel selection is a commonly used method. Dependent on the crossover rate, crossover the bits from each chosen chromosome at a randomly chosen point. Step through the chosen chromosomes bits and flip dependent on the mutation rate. Repeat step 2, 3, 4 until a new population of N members has been created [6].

a + b * c - d * e = 100 In the equation searches a variable score a, b, c, d, and e so that the total score is 100. Supposing maximum score to each variable is 15, meaning in binary representation there are four bits to each score. Because there are 5 variables, hence chromosome length is 20 bits. Taking example at one particular solution (chromosome), score of variable a is 10(10102), variable b is 5(01012), variable c is 14(11102), variable d is 2(00102), and variable e is 9(10012), hence chromosome at the solution is:

2.2 Representation of GA for IRS A GA maintains a population of individuals, P (t) = x1,…,xn at iteration t. Each individual represents a potential solution to the problem at hand and is implemented as some data structure S. Each solution xi is evaluated to give some measure of fitness. Then a new population at iteration t+1 is formed by selecting the fitter individuals. Some members of the new population undergo transformation by means of genetic operators to form new solutions. There are unary transformations mi (mutation type), which create new individuals by a small change in a single individual and higher order transformations cj (crossover type), which create new individuals by combining parts from several (two or more) individuals. The crossover and mutation are the most important parts of the genetic algorithm. The performance is influenced mainly by these two operators. The genetic algorithm was executed in the following steps:

10100101111000101001 (ii) Crossover This is simply the chance that two chromosomes will swap their bits. Crossover is performed by selecting a random gene along the length of the chromosomes and swapping all the genes after that point [7]. E.g. Given two chromosomes A: 10001001110010010 B: 01010001001000011 Choose a random bit along the length, say at position 9, and swap all the bits after that point, so the above become: A': 10001001101000011 B': 01010001010010010

(i) Encoding of a Chromosome Algorithm begins with a set of solutions (represented by chromosomes) called population. Solutions from one population are taken and used to form a new population. This is motivated by the expectation that the new population will be better than the old one. Solutions which are then selected to form new solutions (offspring) are selected according to their fitness [8].

(iii) Mutation After a crossover is performed, mutation takes place. Mutation is intended to prevent falling of all solutions in the population into a local optimum of the solved problem. Mutation operation randomly changes the offspring resulted from crossover. In case of binary encoding we can switch a few randomly chosen bits from 1 to 0 or from 0 to 1. Mutation can be illustrated as follows [7]:

In IRS the keywords used in the set of userselected documents were first identified to

168

The First Malaysian Software Engineering Conference (MySEC’05)

of the required document from one or more databases based on the similarity level.

Original offspring 1 1101111000011110 Original offspring 2 1101100100110110 Mutated offspring 1 1100111000011110

4. The Result

Mutated offspring 2 1101101100110110

Some of the example result can be seen below: The technique of mutation (as well as crossover) depends mainly on the encoding of chromosomes. For example when we are encoding permutations, mutation could be performed as an exchange of two genes.

======================================

Information Retrieval Report 20/7/2005 08:03:47 AM ====================================== Classification: HCI Query : 1st ; 2 Documents as Queries

(iv) Determination of Population 1. Human-Computer Interaction for Alert Warning and Attention Allocation Systems of the Multi-Modal Watch station Keywords: HCI, Alert, Warning, Interruption, interaction 2. A Tracking Framework for Collaborative Human Computer Interaction Keywords: HCI, 3D, interaction, human

Determination of population on next generation is based on the fitness' score. The higher level of fitness has a bigger probability to reproduce, while those with lower level of fitness have less probability to reproduce. Usually, this probability is selected by Roulette wheel selection method. In the next generation, all population is evaluated to determine whether they have reached the expected solution [7].

=============================== Last Generation (9th Generation) =============================== Solution: 1000110 Keywords: HCI, interaction, 3D

3. Implementation As mentioned before, this technique has been implemented in a form of a prototype called Journal Browser as shown in Figure 1.

================================== Similarity of Documents Retrieval: ==================================

A similarity approach of document was adopted by Dice's score to compute the ``fitness'' of subject descriptions for information retrieval. Dice's score is formulated below:

[40.00%] Human-Computer Interaction Standards [40.00%] Vision-based sensor fusion for Human-Computer Interaction [33.33%] Human-Computer Interaction Through computer vision

Classification: HCI Query : 2nd ; 3 Documents as Queries

Dice's score take a slice from two chromosomes, then multiplied by 2 and divided by amount of length of both chromosomes. Dice's score represents common measurement in genetic algorithm. From this formula it can be seen that if X is equal to Y then Dice's score is equal to 1, meaning that document X is precisely the same as document Y, although this case is very difficult to be found in the database.

1. Human-Computer Interaction for Alert Warning and Attention Allocation Systems of the Multi-Modal Watch station Keywords: HCI, Alert, Warning, Interruption, interaction 2. A Tracking Framework for Collaborative Human Computer Interaction Keywords: HCI, 3D, interaction, human 3. New Human-Computer Interaction Techniques Keywords: HCI, interaction, movement, gesture, dialogue

If a keyword is present in a document, the bit is set to 1, otherwise it is set to 0. Each document could then be represented in terms of a sequence of 0s and 1s. We have computed the fitness of each document based on its relevance to the documents in the user-selected set. Document with a higher Dice's score reflects a higher probability of similarity. Application of this technique will facilitate searching and retrieval

================================ Last Generation (11th Generation) ================================ Solution: 1111100010 Keywords: HCI, interaction, movement, gesture, dialogue, Warning

169

The First Malaysian Software Engineering Conference (MySEC’05)

Figure 1: Journal browser ================================= Similarity of Documents Retrieval: ================================= ================================ Last Generation (24th Generation) ================================ Solution: 11011001111101 Keywords: HCI, Alert, Interruption, interaction, movement, gesture, dialogue, Human, Modeling, COGNITION

[35.57%] Towards Conversational Human-Computer Interaction [33.35%] Human-Computer Interaction Standards [25.00%] Vision-based sensor fusion for Human-Computer Interaction

================================= Similarity of Documents Retrieval: =================================

====================================== Classification: HCI Query : 3rd ; 4 Documents as Queries 1. Human-Computer Interaction for Alert Warning and Attention Allocation Systems of the Multi-Modal Watch station Keywords: HCI, Alert, Warning, Interruption, interaction 2. A Tracking Framework for Collaborative Human Computer Interaction Keywords: HCI, 3D, interaction, human 3. New Human-Computer Interaction Techniques Keywords: HCI, interaction, movement, gesture, dialogue 4. Analyzing Human-Computer Interaction as Distributed Cognition: The Resources Model Keywords: HCI, Human, interaction, Modeling, DISTRIBUTED, COGNITION

[33.33%] Towards Conversational HumanComputer Interaction [33.08%] Implicit Human Computer Interaction through Context [21.43%] Designing and Testing Human Computer Interaction: A Case

4.1 Analysis of the Experimental Result Several queries have been given with different number of document request. The result shows that documents' similarity levels are consistent even though the percentage of similarity may change. This means that the order of the document resulted from the queries changes but the documents residing at the top level remain to be at the top level.

170

The First Malaysian Software Engineering Conference (MySEC’05)

Symbolic Learning, and Genetic Algorithms, Artificial Intelligence Lab, Eller College of Management, The University of Arizona, 1995. http://ai.bpa.arizona.edu/papers/mlir93/mlir9 3.html

It is observed that the parent solution (chromosome) with the higher level of fitness has a bigger probability to reproduce, while those with lower level of fitness have less probability to reproduce. When more documents were used as the source of the queries, the document similarity level may decrease because the Dice's fitness value decreases. For example, for 2 documents as queries, the similarity value was found to be 40.00%, for 3 documents as queries, the similarity value was found to be 35.57 %, for 4 documents as queries, the similarity value was found to be 33.33%. Documents with a higher Dice's score reflect a higher probability of similarity.

[2] Quinlan, J.R. Discovering rules by induction from large collections of examples. In Expert Systems in the Micro-electronic Age, Pages 168-201, Michie, D., Editor, Edinburgh University Press, Edinburgh, Scotland, 1979. [3] Rumelhart, D.E., Hinton, G.E. and Williams, R.J. Learning internal representations by error propagation. In Parallel Distributed Processing, pages 318-362, D.E. Rumelhart, J.L. McClelland, and the PDP Research Group, Editors, The MIT Press, Cambridge, MA, 1986. [4] Belew, R.K. Adaptive information retrieval. In Proceedings of the Twelfth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 11-20, Cambridge, MA, June 25-28, 1989. [5] Gordon, M. Probabilistic and genetic algorithms for document tretrieval. Communications of the ACM, 31(10):12081218, October 1988. [6] Hsiung, Sam, An Introduction to Genetic Algorithms, generation5, 2000. http://www.generation5.org/content/2000/ ga.asp

5. Conclusion and Future Work What we have learned from this is that we can draw some conclusions about overall trends in the technique of probability in document similarity. It is observed that the higher the number of documents given as a query, the higher is the probability to do crossover, mutation and generation. The difference between crossover and mutation is often small, and more often statistically insignificant. We hope the system can be developed further to improve the similarity level of the document retrieval. In a future study, we are going to improve the system performing and to compare it with the method of Horng and Cosine for IRS.

[7] Anonym, Genetic Algorithm Tutorial, aijunkie, 2000. http://www.ai-junkie.com/ga/intro/gat1.html

6. References

[8] Marek Obitko, Genetic Algorithms, 2005. http://cs.felk.cvut.ct/~xobitko/ga/main.html

[1] Chen, Hinchey. Machine Learning for Information Retrieval: Neural Networks,

171

Suggest Documents