Fuzzy Adaptive Resonance Theory for Content-Based Data Retrieval

3 downloads 5253 Views 209KB Size Report
Fuzzy Adaptive Resonance Theory for Content-Based Data Retrieval. Amin Milani Fard ... Department of Computer Eng., Ferdowsi University, Mashhad, Iran, IEEE Student Member .... services for many web search engines and portals such as.
Fuzzy Adaptive Resonance Theory for Content-Based Data Retrieval Amin Milani Fard Hoda Akbari Mohammad–R. Akbarzadeh–T.1 Department of Computer Eng., Ferdowsi University, Mashhad, Iran, IEEE Student Member Faculty of Computer Eng., Sharif University of Technology, Tehran, Iran Department of Electrical Eng., Ferdowsi University, Mashhad, Iran, IEEE Senior Member [email protected] Abstract In this paper we propose a content-based text and image retrieval architecture using Fuzzy Adaptive Resonance Theory neural network. This method is equipped with an unsupervised mechanism for dynamic data clustering to deal with incremental information without metadata such as in web environment. Results show noticeable average precision and recall over search results. Index terms — Content-based information retrieval, Neural networks, Fuzzy-ART, Web search engines.

1. Introduction Content-based information retrieval is a problem that is getting more remarkable considering billions of indexed Internet web pages. Web mining methods involves techniques to discover and analyze useful information of the web to retrieve a desired document [1]. As a suitable paradigm, various researchers have recently focused on the use of soft computing in web data mining. The basic idea in content-based retrieval is to find similar informational items without first labeling or annotating them [2]. In such systems some relevant features are extracted from and then stored in a database and indexed to speed up the retrieval process. The most common way to do content-based retrieval is to calculate various features of an informational item. The similarity search can then be done using these feature vectors.

2. Related works Indexing is an important phase of any search engine that tries to build a data structure for quick searching [3]. The concept of index based searching is to construct a list of significant words to be used upon each search request 1

instead of accessing the entire amount of search data and thus speeding up the search process. In vector space models, each document is modeled as a vector representing attributes of the document [4]. Document ranking with respect to a query is then determined by its distance to the query vector. Unsupervised self-organizing map (SOM) based neural networks [5] are mainly used for datasets clustering. SOM neural network structure changes during learning process based on the observed data; however these methods are usually not efficient when dealing with dynamic group clustering such as for web documents. The Adaptive Resonance Theory (ART) neural network [6], however, is an unsupervised incremental clustering neural network especially designed to satisfy this demand.

3. Fuzzy Adaptive Resonance Theory Adaptive Resonance Theory was introduced by Grossberg [7]-[8] in 1976. Some derivatives of ART are ART-1 (binary version) [9], ART-2 (analog version) [10], ART-2A (fast version of ART-2) [11], ART-3 (contains chemical transmitters to control search process in a hierarchical ART structure) [12], and ARTMAP (supervised version) [13]. Some hybrid variants such as Fuzzy-ART [14] (for analog input patterns) and FuzzyARTMAP (supervised Fuzzy-ART) [15] have also been developed. ART based text clustering methods have been used in several works; however, quality of the result has not been clearly evaluated. The Fuzzy ART neural network has two advantages. One is the ability to handle both binary and analog vectors and the other is faster implementation. Considering the above features and also imprecise data and query, in this work we choose to use Fuzzy ART for HTML document clustering. Fuzzy-ART is a generalized ART-1 method which is restricted to continuous binary data in the interval of [0,1]. This approach as discussed by Moore [16] is

Author is currently a visiting scholar at Berkeley Initiative on Soft Computing (BISC), UC Berkeley

1-4244-0674-9/06/$20.00 ©2006 IEEE.

basically similar to many iterative clustering algorithms in which each case is processed by first finding the "nearest" cluster seed (known as prototype or template) to it and then updating that cluster seed to be "closer" to the case. However, in Fuzzy-ART the framework is a little changed by introducing the concept of "resonance" so that each case is processed by first finding the "nearest" cluster seed that "resonates" with the case, and then updating that cluster seed. Resonance is just a matter of being within a certain threshold of a second similarity measure. Fuzzy ART takes three input parameters: choice parameter (β>0), vigilance parameter (a.k.a similarity) (0≤ρ≤1) and learning rate (0≤λ≤1). The Fuzzy ART algorithm has the following steps: 1. Initializing: Initialize all the parameters. 2. Applying input pattern: Let I be next input vector and P be set of candidate prototype vectors 3. Selecting category: Find the closest prototype vector (Pi∈P) which maximizes G JJG I ∧ Pi JJG β + Pi

(1)

4. Vigilance testing: Compare the similarity between the winning prototype and the current input pattern against a user-defined vigilance parameter as follows: G JJG I ∧ Pi I



(2)

If the prototype passes the vigilance test, it is adapted to the given input pattern and goes to step 5. Otherwise, the current prototype is deactivated for the current input pattern. If none of them passes the test, a new prototype is created for the current input pattern. Continue from step 2 for the next input. 5. Updating matched prototype: The matched prototype is updated to move closer to the current input pattern according to the following equation. If λ (the learning rate) is 1, it is called fast learning. Algorithm continues with the next input (step 2). G JJG G pi = λ (I ∧ Pi ) + (1 − λ ) p (3)

4. The proposed approach In this work, we build a search engine for information retrieval using Fuzzy-ART. This presents a good approach

in estimating resemblance of multimedia documents and their classification. However, here only the text and image mining has been investigated. To meet this demand, feature vectors of the document are first extracted, and then passed to the Fuzzy-ART clustering subsystem. The resulting clusters are then inserted into the database tables. Two agents have been designed to satisfy these demands. The first agent, called the crawler agent, is to perform mining process by indexing document. The second agent, called the search agent, is capable of processing queries and responding results to the search.

4.1. Text feature extraction Using word frequency based vector is the standard way to represent textual documents. For example, in a collection of 100 documents, there might be 1000 distinct words. A document can then be encoded as a vector in 1000-dimensional space in which each dimension corresponds with distinct word. For example, vector indicates the document contains the word ‘data’ 10 times, the word ‘mining’ 5 times, the word ‘paper’ 4 times and so on. In this work, the vector is normalized between 0 and 1 which will be used as input vector of the Fuzzy-ART network. The most common frequency-vector based techniques are term frequency (TF), document frequency (DF), and term frequency × inverse document frequency (TFIDF) each of which encodes a document as a list of word-frequency related numbers and differs only in the way numbers are calculated. In TFIDF model [17], each document D is represented by a feature of the form (d1, d2, …, dn), where di is a word in D. The order of the di is based on the weight of each word. Equation 4 is the common weight calculating formula in TFIDF, where tfij is the number of occurrences of the term tj in the web page Pi, idfj is inverse document frequency, dfj is the number of web pages in which term tj occurs, and n is the total number of pages in the database. w ij = tf ij * idf j

( n / df j )

idf j = log 2

(4)

This formula indicates that weight of each term is based on its inverse document frequency in the document collection and the occurrences of this term in the document.

4.2. Image feature extraction To extract features of images, we used Fourier-Mellin transform (FMT). This method was proposed in the late 70's for pattern recognition [18] which was used later for signal and image processing. For image retrieval, rotation and scaling invariant features based on FMT can be used.

This transform investigates the image similarity. Consider that f indicates the gray scale level of a 2D image, FMT of function f is defined as below:

∀(k,v) ∈ Ζ × ℜ

(5) 1 ∞ 2π − iv e − ikθ dθ dr f(r,θ)r ∫ ∫ f 2π 0 0 r Ghorbel in [19] suggested that f σ ( r , θ ) = r σ f ( r , θ ) can

M

(k,v) =

be used instead of standard FMT f ( r , θ ) , in which σ is a positive real constant. Thus Considering equation 5 in the origin, we would have analytic Fourier-Mellin transform which is a unique transform of an image.

∀(k,v) ∈ Ζ × ℜ 1 ∞ 2π σ −iv e −ikθ dθ dr M (k,v) = ∫0 ∫0 f(r,θ)r fσ 2π r

(6)

f ( r , θ ) can be retrieved using inverse analytic Fourier-

Mellin transform as in equation 7: * 1 ∀( r , θ ) ∈ ℜ + × S +∞

f (r , θ ) =



∑ M f σ (k ,v )r −σ +iv e ik θ dv

(7)

−∞ k ∈Z

5. Evaluating search result The project was done using PHP and MySQL and deployed on Apache web server. The work is done using TSEP (The Search Engine Project) open source group work [20]. TSEP is a simple yet very powerful and fast PHP website search engine. To evaluate the search performance, precision and recall basic measures were used. Recall is the ability of a retrieval system to obtain all or most of the relevant documents in the collection [21]. It is defined as the ratio of the number of relevant records retrieved to the total number of relevant records in the database and is usually expressed as a percentage. High recall indicates that you have retrieved most of the available relevant records in a database. Recall =

Relevant Records Retrieved Relevant Records in Database

According to [21], precision is the ratio of the number of relevant records retrieved to the total number of irrelevant and relevant records retrieved. High precision indicates that most of the items you retrieve are relevant. Precision =

Relevant Records Retrieved All Records Retrieved

To evaluate the proposed search engine, 72 pages were randomly selected from www.dmoz.com, the Open Directory Project (ODP) which powers the core directory services for many web search engines and portals such as Netscape, AOL, Google, and AltaVista. These 72 pages were chosen from 9 categories (Algorithm complexity, Artificial intelligence, Computational geometry, Database, Data mining, Operating systems, Processors, Robotics, and Wireless LAN) each of which containing 8 sample pages. Apart from these sample pages, some personal homepages were also included to evaluate results in a noisier test bed. Fifteen queries were sent to the engines, and the top 10 retrieved documents were evaluated. Results show average precision of 69.93% and average recall of 50.09%. Ljosland [22] indicated that when counting only the relevant, not the partly relevant documents, he got precision rates 40% for Alta Vista, 70% for Google, and 40% for AllTheWeb. However, counting the partly relevant documents, results were 50%, 90% and 50%. The average precision/recall of sample queries plotted in Fig. 1 indicates superiority of the proposed system (Kavosh) performance over TSEP system. Precision and recall test, however, does not take into account order of answers and degree of relevance. Therefore we need to use alternative measures such as F-measure which is the harmonic mean of precision and recall. The traditional Fmeasure or balanced F-score is calculated as:

F=

2 × precision × recall precision + recall

F is high when both precision and recall are high. The more it is near to 1 the more retrieved documents are relevant; and the more it is near to 0 the fewer retrieved documents are relevant. Fig. 2 depicted the F-measure of Kavosh versus TSEP systems. Another test was done over a set of 50 images from SUNET picture archive [23]. Results indicate that the system works with a high level of success. The database had been pre-classified into 10 different classes. All the images were gray-scale images with 256 colors. Since FMT is invariant to scaling, the size of the images varied considerably. The results we obtained are good and most of the classes are found with high precision values. The content-based image retrieval system also indicates average precision of 54.71% and average recall of 39.58% over 10 queries. Fig. 3 and Fig. 4 show how the system works.

6. Conclusion The proposed mining-retrieval architecture deals well with dynamic web content taxonomy using Fuzzy-ART

neural network. Measurements were done over 72 pages of computer and electrical documents. Results show average precision of 69.93% and average recall of 50.09% over 15 queries. The content-based image retrieval system also indicates average precision of 54.71% and average

recall of 39.58% over 10 queries. Our approach is supposed to be very helpful in finding multimedia contents without description which is what we are going to investigate in the next stage of this project.

Table 1. Precision and recall measurement for 15 sample queries

Test query Intel artificial intelligence Search engine "search engine" wireless technology intelligent system "open source" Geometry Graphic ODBC Agent "data mining" Linux Network Robocup

Total retrieved documents 5 7 22 10 6 26 4 1 8 9 4 2 6 4 3

Total relevant documents

Precision

Recall

3 12 16 12 8 20 12 8 8 2 11 6 8 8 4

20% 71.43% 54.55% 100% 16.67% 30.77% 100% 100% 75% 22.22% 100% 100% 83.33% 75% 100%

33.33% 41.67% 75.00% 83.33% 12.50% 40% 33.33% 12.50% 75% 100% 36.36% 33.33% 62.50% 37.50% 75%

Kavosh

TSEP

120.00%

60.00%

100.00%

50.00% Harmonic mean

Average precision

Kavosh

Relative retrieved documents 1 5 12 10 1 8 4 1 6 2 4 2 5 3 3

80.00% 60.00% 40.00% 20.00% 0.00% 0.00%

40.00% 30.00% 20.00% 10.00% 0.00%

20.00%

40.00%

60.00%

80.00%

0

2

4

Average recall

8

10

12

Fig 2. The harmonic mean of recall-precision

120.00%

60.00%

100.00%

50.00% Harmonic mean

Average precision

6 Query num ber

Fig 1. The average recall-precision diagram

80.00% 60.00% 40.00% 20.00% 0.00% 0.00%

TSEP

40.00% 30.00% 20.00% 10.00% 0.00%

20.00%

40.00%

60.00%

80.00%

Average recall

Fig 4. The average recall-precision diagram for the image retrieval system

0

2

4

6

8

10

12

Query num ber

Fig 5. The harmonic mean of recall-precision for the image retrieval system

7. Acknowledgement The authors wish to acknowledge the support of Young Iranian Elites Association NGO.

8. References [1] S. K. Pal, V. Talwar, and P. Mitra, Web Mining in Soft Computing Framework: Relevance, State of the Art and Future Directions, IEEE Transactions on Neural Networks, Volume: 13, Issue: 5, pp.1163 --1177, 2002. [2] Y. Rui, T. S. Huang, and S.-F. Chang. Image retrieval: Current techniques, promising directions, and open issues. Journal of Visual Communication and Image Representation, 10(1):39–62, 1999. [3] Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, Addison-Wesley, (1999) [4] Salton, "The SMART retrieval system; experiments in automatic document processing", PrenticeHall, Englewood Cliffs, N. J., 556 pp., 1971 [5] Kohonen, T. (1991). Self-Organizing Maps, Proc. IEEE 78, 1464-1480. [6] Gordon, A.D. (1999). Classification, 2nd Edition. Chapman & Hall/CRC, Boca Raton. [7] Grossberg, S. (1988). Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors. Biological Cybernetics, 23: 1976. Reprinted in Anderson and Rosenfeld, 121-134. [8] Grossberg, S. (1976). Adaptive pattern recognition and universal recoding: II. Feedback, expectation, olfaction, and illusions. Biological Cybernetics, 23: 187-202. [9] Carpenter G.A., & Grossberg, S., "Invariant pattern recognition and recall by anattentive self-organizing art architecture in a nonstationary world", In Proceedings of the IEEE First International Conference on Neural Networks, June 1987, 737-745. [10] Carpenter, G. A., & Grossberg, S. (1987). Art 2: Selforganization of stable category recognition codes for analog input patterns. Applied Optics, 26: 4919-4930. [11] Carpenter, G. A., Grossberg, S., Rosen, D.B. (1991). Art2-a: An adaptive resonance algorithm for rapid category learning and recognition. Neural Networks, 4: 493-504.

[12] Carpenter, G. A., & Grossberg, S. (1990). Art3: Hierarchical search using chemical transmitters in selforganizing pattern recognition architectures. Neural Networks, 3: 129-152. [13] Carpenter, G. A., Grossberg, S., Reynolds, J.H. (1991). Artmap: Supervised real-time learning and classification of nonstationary data by a self-organizing neural network. Neural Networks, 4: 565-588. [14] Carpenter, G. A., Grossberg, S., Rosen, D.B. (1991). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4:759-771. [15] Carpenter, G. A., Grossberg, S., Markuzon, N., Reynolds, J.H., Rosen, D.B., "Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps", IEEE Transactions on Neural Networks, 3(5): September 1992, 698-713. [16] Moore, B., "ART 1 and Pattern Clustering", in Touretzky, D., Hinton, G. and Sejnowski, T., eds., _Proceedings of the 1988 Connectionist Models Summer School_, 174-185, San Mateo, CA: Morgan Kaufmann. [17] Salton, G. and Buckley, C. 1988 Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5): 513–523 [18] T. Yatagay, K. Choji, H. Saito, “Pattern classification using optical Mellin transform and circular photodiode array ”, Optical Communication, Vol. 38, no. 3, pp. 162-165, Aug. 1981. [19] F. Ghorbel, “A complete invariant description for graylevel images by the harmonic analysis approach”, Pattern Recognition Letters, Vol. 15, pp. 1043-1051, Oct. 1994. [20] The Search Engine Project, open source code at: http://tsep.sourceforge.net/ [21] Clarke, S., Willett, P., "Estimating the recall performance of search engines", ASLIB Proceedings, 1997. 49 (7), 184-189. [22] Ljosland M., "Evaluation of Web search engines and the search for better ranking algorithms", SIGIR99 Workshop on Evaluation of Web Retrieval, Padova, Italy, August 19,1999

[23] Dataset available at: ftp://ftp.sunet.se/pub/pictures/

Suggest Documents