2009 Second International Conference on Developments in eSystems Engineering
Genetic Algorithm Based to Improve HTML Document Retrieval Ammar Al-Dallal School of Information Systems Computing and Mathematics Brunel University, U.K
[email protected] Rasha S. Abdul-Wahab College of Information Technology Ahlia University,Manama, Bahrain rasha sh
[email protected] Abstract This paper describes GAHWM, a new evolutionary algorithm that integrates genetic algorithm paradigm with an inverted index model to mine the content of HTML documents for effective web document retrieval. This method is superior in terms of recall and precision over various real life datasets.
1. Introduction The World-Wide Web provides citizens with access to an abundance of information. However, it becomes increasingly difficult to identify the relevant information. Hence the term Web Mining appears, which is extraction of the useful knowledge from the information scattered among the Web. Researches in web mining attend to address this problem by applying techniques from data mining and machine learning to Web data, documents and services [6].Web mining is commonly divided into the following three sub-areas: • Web Content Mining which is extracting useful information from page content (text, images, etc), used by search engines, agents, recommendation engines to help users find what they are looking for. • Web Structure Mining: uses of the hyperlink structure of the Web as an information source. • Web Usage Mining: analysis of user interactions with a Web server. Recently, information retrieval (IR) problem have gained an importance, and most of studies argue that IR can be seen 978-0-7695-3912-6/09 $26.00 © 2009 IEEE DOI 10.1109/DeSE.2009.57
as standard optimization problem as stated in [8]. Therefore, in this paper a genetic algorithm (GA) due to its simplicity and its capability as a powerful search mechanism can be employed to make several important contributions to the field of IR. However, combined information retrieval and genetic algorithm emerge what we named as Genetic Algorithm for HTML Web Content Mining(GAHWM) GA is a probabilistic algorithm used to simulate the mechanism of natural selection of living organisms. GA is often used to solve problems having expensive solutions. It uses the principles of selection and evolution to produce several solutions to a given problem. GA’s search space is composed of candidate solutions to the problem (chromosomes). Each chromosome has an objective function value known as fitness value. This measure is used to favor selection of successful parents for new offsprings. Offsprings solutions are produced from parent solutions by the application of crossover and mutation operators [1, 2]. An important characteristic of GAs is that they perform a global search. Indeed, evolutionary algorithms work with a population of candidate solutions rather than working with a single candidate solutions at a time. This together with the fact they use stochastic operators to perform their search, reduces the probability that they will get stuck in local maxima, and increase the probability that they will find the global maximum. GA have been quite successful in information retrieval. The reason is that the core problem of information retrieval can be seen as an optimization problem. However in Marghny [8], a steady state genetic algorithm (GA) which evolve a population of pages is presented. This method defines an evaluation function which represent a mathematical formulation of the user request. The creation of individual performs by query a standard search engine. The crossover operator with probability of Pc is performed by 347 343
selecting two parent individuals (web pages) from the population. Crossover operator chooses randomly one crossover position within the page and exchanges the links after that position between both individuals (web pages). Picarougne [5] proposes Genminer as technique of GA. In Genminer, the fitness function is used to select a web page which depends on a number of keywords, number of words from the list of words that should be presented and from the list of words that should not be presented, and many links that might lead to relevant pages. Asllani [4] proposed a multiple web-site optimizations using GA. In this approach the algorithm uses several factors in evaluating the web page and formulating the fitness function. These factors represent numbers of web objects; download time of the objects in the page, visualization coefficient and factor of probability of selling product j when customer i is visited immediately before. Sun and Byoung [10] purposed a text mining approach to web document retrieval. They apply GA in their proposal to find the significant HTML tags and getting optimal weights. The proposed method has been evaluated on artificial text sets. A brief review of reprocessing documents will be presented in the next section. The implementation of GAHWM is presented in section 3. The experimental results on web dataset are provided in section 4. Finally, in section 5 a conclusion of the results will be provided and a direction for further approach is given
2. Reprocessing Documents Before applying GA, a set of preprocessing steps for the required documents is needed. A new tool which we called Weighted Web Tool Inverted index (WWT) is applied in GAHWM. The main purpose of web searching and traditional IR is to find all documents that contain the terms in user query. An important option of a given user query is to scan the database document sequentially to find the documents that contain the query terms. There are several models developed to represent the documents. These documents represents the space of the search problem. Some of these models are [1]: Boolean model, vector spacing model, and probabilistic model. However, some of these methods are obviously impractical for a large collection and therefore the inverted index will be used as a base of our tool. WWT builds data structure known as indices from the collected documents in order to speed up the retrieval part. This inverted index proven to be superior to most other indexing schemas, i.e, it is the most important index method used in search engines [7]. Not only is such indexing allow the different retrieval of documents containing query terms, but also very fast to build. WWT is basically a data structure that attaches each distinctive term with a list of all doc-
uments containing frequency of term and its weight. The term weight is integer number indicating the significance of terms in a document. The estimation of the term wight depends on the wight which is given by the words within the tag. Table 1 illustrate the HTML tags and the given weight. The pseudo code of WWT presented in algorithm 1 . Algorithm 1 WWT Tool 1: while there are more documents need to be process do 2: while there is a word in the document do 3: Read the document word at a time until the whole document is read. Split the string into tokens. 4: Remove stop words 5: Evaluate word wight 6: Get the related document node, or create it if it doesn’t exist 7: if there is a link to document node for same document then 8: Increment frequency in that node and its weight 9: else 10: Create new node for this document and set frequency to 1 11: Add this document to the list of documents. 12: end if 13: Increment the word count for that document since we need it for word density per document end while 14: 15: end while WWT starts by reading word by word from the document taken from a set of collected documents. This tool ignores the stop words, the HTML tags and all other words outside a tag. In this respect WWT will classify the words that appears in the tag. Then a weight value is given to these words as shown in table 1. Accordingly WWT produces a list of all words sorted alphabetically for easy search. There are two different kinds of structure will be created from the use WWT, these are: 1. Terms structure: which includes the word to be indexed, the frequency of this word in the whole set of documents and the number of all documents containing this word. 2. Document structure: includes • Document name, • Total number of words in that document (will be used later to find the percentage of the word within this document), • The frequency of word in that document, and • The weight of this word in that document.
344 348
Table 1. HTML Tags and their Weights HTMTag Name Weight Title 6 Head, h1,h2,h3 5 A: Anchor 4 B: Bold 3 body 1
9
3
d1
23 4 11
Mining 7
1
d2
55 7 4
2
d1
34 2 9
Data
Web
3
d1
23 4 11
d3
75 2 5
Null
Null d5
109
1 1
Null
Figure 3. The Constructed Data by WWT Figure 1, 2 and figure 3 represent the created structure of WWT, whereas figure 3 shows how the created structure is linked together. For example, after scanning the dataset, the word ”Data” in figure 3 appeared 9 times in three documents. Word
Freq
No. of Doc. link to doc. node
Figure 1. Word structure
D1
File Size
Word Freq.
Algorithm 2 GAHWM algorithm 1: Load index in to memory. 2: while number of generation = 50 OR difference between fitness function of 2 generations is less than 0.025 do 3: Generate a population page 4: Apply crossover operator with probability of 0.8 to generate next generation of population 5: Calculate fitness function for each chromosome based on the keywords entered by user. 6: Apply a modifier operator. 7: Apply mutation on newly generated chromosome with probability of 0.3 8: end while
Word Weight
Doc. node
Figure 2. Document structure
3. The GAHWM Algorithm This section represents GAHWM method which utilizes WWT tool for producing the required solution. The proposed GAHWM procedure for generating the optimal solution depends on five operators. These include: selection criteria of the population, crossovers of parents to generate offspring, then evaluating the fitness of the selected offsprings after processing a modifier operator and applying mutation on these offspring with certain probability to produce some modified offspring. GAHWM method is applied for certain number of iterations or until the optimal solution is found. More details will be given in the following subsections. Algorithm 2 shows the pseudo code version of GAHWM algorithm.
3.1 Chromosome Representation When a user enters the list of keywords, a new list of documents is created containing the document name and its frequency value based on that keywords. Each document
has a reference number which is used through the construction of chromosome’s genes. GAHWM’s chromosomes are composed of set of genes, i.e, each gi ∈ [d1 ..dn ], and dn ∈ [1..max number of documents] represents the document reference number. The number of genes in the chromosome which are randomly selected do not exceed the total number of documents. In GAHWM method, the solutions (chromosomes) have variable lengths which are generated randomly. The size and contents of these solutions in the population can be altered through a successive generation by using the genetic operations. Each solution in the initial population consists of at least five genes and these genes are unique within a chromosome. However, each solution in the created population is syntactically correct. A pseudo code version for generating solutions in GAHWM is presented in algorithm 3.
3.2 Selecting Population The population of GAHWM comprises set of documents. The first generation which is randomly created has a fixed number of chromosomes. Each chromosome has a set of document references. Next generations are created by selecting two individuals randomly, and pick the one that has higher fitness value, to be followed by selecting another two individuals from the initial population randomly. In this paper, tournament selection is used, in which, two individuals are chosen at random from the population, random number
345 349
Algorithm 3 The process of generating solutions in GAHWM 1: the length of the solution is generated randomly, whereas the length is between [5 − M axlength]. 2: generate the first gene gi randomly. 3: S = S gi 4: while until the number of genes reach the specified length do 5: generate the next gene randomly 6: if gi ∈ / S then 7: S = S gi 8: end if 9: end while
[13] [14]. In mutation a random position gi within the chromosome is randomly selected to be replaced with a new randomly gene after checking its uniqueness.
3.5 Fitness Function The fitness function measure the performance of the retrieved results for each chromosome. GAHWM’s fitness function consists of several terminals [11] [9], some are local influenced by the document whereas others are global influenced by the space of documents. User query is consider as global terminal. For each of these document, the fitness value is calculated as follows(the description of the parameters of the following functions presented in table 2):
r is then generated between 0 and 1. If r < k , (where k is a parameter, for example 0.75), the fittest individual is selected [2].
f (dj ) =
(1)
wi
i=1
3.3 Crossover The GAHWM uses crossover operates to generate the offsprings of the existing population. The crossover used is the 2point crossover method. This operator is performed with probability of 0.8. The two parents of different lengths are aligned with each other and two crossover points are chosen at random. The selection of the first point Cp1 is based on the length of the first parent whereas the second point Cp2 is randomly selected according to the length of the second parent. The tails of the second parent from the onward point Cp1 are switched to create the first offspring, while the tails of the first parent from the onward point Cp2 are switched to create the second offspring. At the time of circumstances, it is important to confirm that all genes of new offsprings are unique. If duplicate gene is found then it is omitted from new offspring. For example, let J x and J y be two selected parent chromosomes which are represented respectively as follows: x x , j20 , j3x ) j x = (j1x , j2x , j12 y y y y y y ) j = (j1 , j2 , j16 , j18 , j20 This operator produces the following two offsprings given that the crossover point of cp1 and cp2 is equal to 2 and 3 respectively: y y x , j18 , j20 ) Ox = (j1x , j2x , j16 y y x y x O = (j1 , j2 , j20 , j3 ) where Ji ∈ [1..maxnumberof documents] represents the document reference number
K
wi =
fi j Ki 1 N T × × × hi j × log( ) × log K Fj tj dfi Ti
(2)
Table 2. Terminal Meanings T i j wj ki K fi j Fj tj hi j N dfi T Ti
Description Term i in the doc. Doc. j in space. Weight of term i in doc. j Frequency of term i in user query Q Total number of terms in Q Frequency of termi in doc.j Size of doc.j (total number of words in doc.j ) Number of unique terms in doc.j HTML tag weight of termi in doc.jj . Total number of doc. in space Total number of doc. having termi Total number of all terms in space Total number of termi in space
Domain local local local global global local local local local global global global global
For each chromosome in GAHWM’s population, the following function is used to calculate the fitness value of these documents: L f (dj ) (3) F (ci ) = j=0
Where L represents the maximum length of chromosome ci .
3.4 Mutation
3.6 Modifier Operator
Mutation represent the forth operator of GAHWM and performed with a probability of 0.3. It is applied to the new chromosome and causes the individual genetic representation to be changed according to some probabilistic rule
As mentioned before, each solution has a set of documents which are represented as numbers. However, some of these documents within the chromosome are not related to the user query and that will impact the performance of
346 350
GAHWM method to find optimal solutions since their low fitness value is included in the fitness value of the chromosome and leads to wrong results. Therefore, this paper presents a modifier operator which is applied after a fitness values is calculated. To discard irrelevant documents from each chromosome, a threshold value will be introduced. This value decides whether the selected document will be added to the modifier chromosome or not. This decision is based on the calculated fitness value for each document. However, if the fitness value of the document is higher than threshold then the modifier operator adds that document to the modified chromosome otherwise the document will be discard. The weight value of the modified chromosome will be calculated after the modifier operator has been processed. This can be measured as percentage of relevant documents within the modified chromosome to the total number of documents within the original one. The range of the weight value is between 0 and 1. The modifier operator support GAHWM method to find accurate solution by balancing the relevant documents against fitness function value of that chromosome. In this case, chromosomes with large number of irrelevant documents will be penalized.
4 Experimental Results
Table 3. GAHWM & GAWS comparative study Alg. List No. Size 100 200 L1 300 400 500 100 200 L2 300 400 500 100 200 L3 300 400 500
Recall GAHWM GAWS 0.37 0.4285 0.6045 0.65 0.7815 0.757 0.83 0.843 0.889 0.9285 0.3935 0.364 0.684 0.633 0.7585 0.754 0.826 0.823 0.934 0.9045 0.785 0.392 0.985 0.6695 0.99 0.7565 0.98 0.8785 0.985 0.891
Precision GAHWM GAWS 0.435 0.3225 0.3755 0.256 0.355 0.2175 0.2885 0.1935 0.265 0.1765 0.349 0.391 0.3505 0.3885 0.275 0.312 0.225 0.2695 0.219 0.2525 0.14 0.3815 0.0945 0.3495 0.0605 0.2725 0.0525 0.256 0.0445 0.207
The GAHWM approach uses 3 sets of keywords. Where each set composed of 3 phrases: one is list of main key words, second is list of must-exist keywords and third is must-not-exist keywords. To test the performance of GAHWM, its results compared with the GAWS [3]. The fitness value GAWS for each document is calculated by using following fitness function:
In this section, the experimental work for analyzing GAHWM method will be presented. GAHWM is tested by using set of 1200 HTML documents along with a run of 3 sets of keywords.
f (di ) = ((0.5 ∗
(k + Ki n) ) ∗ 3 ∗ N )/α(6) Fs
(6)
Where N is
N =( K+ Ki n − Kex )
4.1 Dataset Description Dataset of 1200 HTML documents is used in testing the efficiency of GAHWM. This dataset represent a subset of the 4 Universities Data Set [12]. It contains WWWpages collected from computer science departments of various universities in January 1997 by the World Wide Knowledge Base (Web− >Kb) project of the CMU text learning group.
The performance of GAHWM method is examined in terms of recall and precision. Recall refers to the number of relevant retrieved documents in the collection of all relevant documents with respect to the user query and formulated as: Relevent retrieved (4) Relevent Whereas precision is the number of relevant retrieved document by the retrieved documents and formulated as: Recall =
Relevent retrieved Retrived
Same data set is used in both GAHWM and normal GAWS [3].Table 3 illustrate the result of recall and precision produced by both methods. Although both have the same results, however GAHWM scores higher recall average reaching a maximum of 0.99.
5 Conclusion
4.2 The Performance of GAHWM
P recision =
(7)
(5)
This paper proposes a text mining approach to web document retrieval that uses the tag information of HTML documents. Genetic algorithm, is applied to find a significant documents. Our experimental results show that GAHWM is a promising method. However, a comparative study between GAWS [3] and GAHWM is examined. Our results show that GAHWM performance is more effective for retrieving the required documents than GAWS. The modified version of fitness function proposed in GAHWM improved its ability and its accuracy for finding more promising results..
347 351
References [1] Abdel Mgeid A. Ali A. A. Ahmed Radwan, A. Bahgat Abdel Latef and A. Osman Sadek. Using genetic algorithm to improve information retrieval systems. Proceedings Of World Academy Of Science, Engineering And Technology, 17:6–12, 2006. [2] Rasha S. Abdul-Wahab. New Developmemt in gentic Algorthm for Developing Software. Thesis. University of Tecnology,Thesis,Baghdad,Iraq, 2000. [3] Ammar Al-Dalla, Rasha S. Abdul-Wahab, Genetic algorithm in web search using inverted index representation. 5 th IEEEGCC Conference, 2009. [4] Alireza Asllani, Arben Lari. Using genetic algorithm for dynamic and multiple criteria web-site optimizations. European Journal of Operational Research, 176(1):1767–1777, 2007. [5] A. Oliver-G. Venturini F. Picarougne, N. Monmarch. Geniminer:web mining with a genetic-based algorithm. Proceedings of the IADIS International Conference WWW/Interne, 2002. [6] J. Frnkranz. Web mining. In O. Maimon and L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, pages 899–920, 2005. [7] Bing Liu. Web Data Mining. pringer-Verlag New York, LLC, Dec. 2006. [8] A. F. Ali M. H. Marghny. Web mining based on genetic algorithm. AIML 05 Conference, pages 19–21, 2005. [9] B. Zhang S. Kim. Genetic mining of html structures for effective web-document retrieval. Applied Intelligence, 18(3):243– 256, 2003. [10] B. Zhang Sun Kin. Genetic mining of html structures for effective web document retrieval. Applied Intelligence, 18:243– 256, 2003. [11] Matthew B. Koll Terry Noreault, Michael McGill. A performance evaluation of similarity measures, document term weighting schemes and representations in a boolean environment. Proceedings of the 3rd annual ACM conference on Research and development in information retrieval. Cambridge, England, pages 57–76, 1980. [12] ULR. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo20/www/data/(31/1/2009. [13] D. Vrajitoru. Crossover improvement for the genetic algorithm in information retrieval. Information Processing and Management, 34(4):405–415, 1998. [14] D. Vrajitoru. Large population or many generations for genetic algorithms? implications in information retrieval. In F. Crestani and G. Pasi (Eds.),Soft computing in information retrieval. Techniques and applications, Physica-Verlag, pages 199–222, 2000.
348 352