Visualizing Ontology Components through Self-Organizing Maps

11 downloads 911 Views 66KB Size Report
Image Processing and Interpretation Research Group. Computer Science and ... ument map is presented as a series of HTML pages facil- itating exploration.
Visualizing Ontology Components through Self-Organizing Maps D Elliman and JRG Pulido∗ Image Processing and Interpretation Research Group Computer Science and Information Technology School University of Nottingham United Kingdom [dge|jrp]@cs.nott.ac.uk

Abstract This paper describes a method for identifying Ontology components by using Self-Organizing Maps. Our system represents the knowledge contained in a particular digital archive by assembling and displaying the ontologies components. This novel approach provides an alternative solution to the problem of classifying on-line information and retrieval, support mechanisms that explore domains, and allows knowledge to be displayed in a browsable manner.

1. Introduction Many web sites contain a large number of pages which have grown, often in a rather uncontrolled manner, over time. Some sites are neat and tidy, like a well-kept garden, whilst others are an overgrown wilderness in which it is difficult to find anything of interest, even if it is known to be hidden there somewhere. It would be useful to be able to construct some representation of the information in the site which would support intelligent and interactive searching, and could also be used as an index. In this paper we describe an approach for constructing an ontology for such web sites. The remainder of this paper is organized as follows. In section 2 some related work is introduced. Our methods are outlined in section 3. Results are presented in section 4, and conclusions and further work in section 5.

2. Related Work There is a growing need for methods of systematic explorative information retrieval where the exact keywords, which could guide to relevant and interesting information, may not be known in advance. Data browsing capabilities are essential for the effective use of on-line archives as they ∗ Corresponding

author

facilitate the discovery of information. One of the most important challenges that the semantic web poses in dealing with large amounts of on-line knowledge is the mapping of unstructured information, suitable for humans, to formal representation of knowledge [3]. In the next subsections we have a brief look at some work done on Ontologies as well as Browsing Systems involving Self-Organizing Maps.

2.1. Constructing Ontologies A representation that brings order and structure to a web site can be referred to as an Ontology. Representing knowledge about a domain as an ontology is a challenging process which is difficult to achieve in a consistent and rigorous way. It is easy to lose consistency and to introduce ambiguity and confusion. The most famous example of this was the ambiguity of the meaning of the is-a relationship as discussed by Brachman in [2]. Web ontologies can take rather different forms to traditional ones. In [5] the use of the so-called Simple HTML Ontology Extension (SHOE) in a real world internet application is described. This approach allows authors to add semantic content to web pages, relating the context to common ontologies that provide contextual information about the domain. A similar approach is presented in [1]. Most tag-annotated web pages tend to categorize concepts, therefore there is no need for complex inference rules to perform automatic classification. By using a large ontology based on WordNet, Ontoseek [4] is a content based information retrieval system constructed from on-line yellow pages. It adopts simple conceptual graphs to represent queries and resource descriptions.

2.2. Browsing Systems An interesting project is presented in [6], where the results of applying the WEBSOM2, a document organization, searching and browsing system, to a set of about 7 million

Figure 1. After a SOM process, a randomly initialized SOM (Left) becomes a categorised SOM (Right).

3.1. The Algorithm

electronic patent abstracts is described. In this case, a document map is presented as a series of HTML pages facilitating exploration. A specified number of best-matching points are marked with a symbol and can then be used as starting points for browsing In [10] a distributed architecture for the extraction of meta-data from WWW documents is proposed and is particularly suited for repositories of historical publications. This information extraction system is based on semi-structured data analysis. The system output is a meta-data object containing a concise representation of the corresponding publication and its components. Here gatherers have been designed as a combination of a parser, based on a context-free grammar, and a web robot, which navigates the links contained in the basic document type to infer the document structure of the entire site. These metadata objects can be interchanged with other web agents, then classified and organized.

The major steps of our approach are as follows: 1) Retrieve hyperlinks from a predefined digital archive, 2) Preprocess each hyperlink, 3) Produce a document space, 4) Construct the SOM [8], 5) Create the Ontology [3]. Let us now describe these steps in detail. 1. Retrieve hyperlinks from a predefined digital archive. For our analysis we have only retrieved local ones, e.g. cs.nott.ac.uk. 2. Preprocess each hyperlink. For each file, the following is done: (a) Remove html tags, e.g. ,. (b) Remove words by using a stoplist e.g. the, by, but, and the like. Common words that carry little information are pruned from the files.

3. Methods

(c) For each remaining word, the following is done: i. Its stem is obtained e.g. play is the stem of the words plays, playing, played. ii. A weighted valued is given to it by using the tf x idf [9] weighting scheme. iii. A vector space is created for each file.

Our software is written in Java, which offers a robust and easy networking functionality, and object-oriented features as well. The Java language and its various APIs are powerful enough for constructing software systems that build ontologies. The collections framework provides efficient hash table structures for use as word lists, and the support for sockets and parsing HTML and XML are available since the version 1.2 of the SDK. Our system will organize the knowledge extracted (figure 1) from a digital archive, e.g. a digital library, or the web itself, as follows:

3. Produce a document space. By using the vector spaces of the previous step, a document space is created. 4. Construct the SOM. By using a suitable number1 of cells, the map processes the docuspace.

1. Set of objects (entities) 2. Set of functions (is-a)

1 A squared map is used, therefore the number of cells (n) is calculated √ as follows: n=2*y. If y=538 then n=2*538=1076, n=32.8 ≈ 30 (cf. subsection 4.2).

3. Set of relations (has, part-of ) 2

d o v e 1 0 0 1 0 0 0 0 1 0 0 1 0

small medium big two legs four legs hair hooves mane feathers hunt run fly swim

h e n 1 0 0 1 0 0 0 0 1 0 0 0 0

d u c k 1 0 0 1 0 0 0 0 1 0 0 0 1

g o o s e 1 0 0 1 0 0 0 0 1 0 0 1 1

o w l 1 0 0 1 0 0 0 0 1 1 0 1 0

h a w k 1 0 0 1 0 0 0 0 1 1 0 1 0

e a g l e 0 1 0 1 0 0 0 0 1 1 0 1 0

f o x 0 1 0 0 1 1 0 0 0 1 0 0 0

d o g 0 1 0 0 1 1 0 0 0 0 1 0 0

w o l f 0 1 0 0 1 1 0 1 0 1 1 0 0

c a t 1 0 0 0 1 1 0 0 0 1 0 0 0

t i g e r 0 0 1 0 1 1 0 0 0 1 1 0 0

l i o n 0 0 1 0 1 1 0 1 0 1 1 0 0

h o r s e 0 0 1 0 1 1 1 1 0 0 1 0 0

z e b r a 0 0 1 0 1 1 1 1 0 0 1 0 0

c o w 0 0 1 0 1 1 1 0 0 0 0 0 0

Table 1. Animals (Entities) and Attributes in the so-called trasnpose way (13x16). In the f ront way (16x13), Entities are rows and Attributes columns.

Then in subsection 4.2, the results from applying our system to a digital archive, for identifying ontology components, are shown. Both subsections show the data in two ways, the front way (Attributes – e.g. part-of ) and the transpose way (Entities – e.g. has).

University part-of has School

...

part-of

has

has Research group

Person

First name

is-a

...

4.1. Classifying Animals

Researcher has

In [7] the animal dataset (table 1) is presented by means of a html page. Our approach uses a 4x4 SOM and presents the same data by using colored areas. In our experiment we found that one dominant characteristic amongst the animals is their size, e.g. birds are small, mammals come in two sizes. On the other hand, birds of prey and hunting mammals, small animals with feathers, big animals with hooves, and the ones with four legs and hair are also clustered together. This is consistent with earlier tests carried out on the dataset. Both SOMs are shown in figure 3. It must be noticed that the vector spaces for zebra and horse, and owl and hawk are equal. The ones for hen and duck are approximately equal. Similarly, the vector spaces for the Attributes feather and two legs, and hair and four legs are equal. That is why some areas overlap and produce a combination of colorings.

Publication

is-a Book

...

Article

Figure 2. Ontology components for the Entity University.

5. Create the Ontology. Once the SOM is done, ontology components can be visualized. This produced results that are often surprisingly close to the user’s intuitive expectation.

4. Results

4.2. Classifying Digital Archives

This section presents two experiments that we have have carried out. First we present, though in a different way, some results that have been published before, cf. [8, 7].

For our second analysis a set of web pages of the Computer Science Department2 at Nottingham University has 2 http://www.cs.nott.ac.uk

3

Figure 3. Left. Attributes: Upper area on the right: small, two legs; middle area on the left: feathers, fly, swim; rest (darker area): medium, hair, four legs, big, hooves, mane, run, hunt. Right. Entities: Upper area on the left: dove, hen, duck, goose, owl, hawk, eagle; lower area on the left: eagle, dog, fox, wolf; on the right: tiger, lion, zebra, cow, horse.

and relationships which may be used to inform a viewer, a search engine, or other software entities. When computer agents are used as the intermediaries in collaborative work, the need for a common ontology representational structure becomes very important. A common ontology would enable collaborators to work together with a minimal risk of misunderstanding. The acquisition and representation of knowledge needs to take into account the complexity that is often present in domains as well as the needs of users or agents carrying out the search. Principled techniques that allow the ontological engineer to deal with the problems caused by such complexity need to be developed, and the ideas in this paper have shown promise as avenues of investigation. HTML tags such as heading levels and emphasis, for example text size and boldface provide heuristics as to the importance of words. Sometimes the words in the first sentence of a paragraph may be considered more important that other words. These heuristics can be used to weight the selection of words after principal component analysis, or simply to select sufficiently important words that may then earn a place in the feature vector by being placed in important points, rather than for their discriminatory power. Synonyms and hypernyms can make this more powerful, and can provide a strong hint of the structure of an ontology.

been used. They were gathered, and processed as described in section 3. We started off with 954 hyperlinks and a lexicon size of 13683, and after reducing the docuspace dimensionality, documents with less that 20 terms and terms appearing in less that 20 documents were disregarded, we ended up with a 538x923 dataset. A 96.2% reduction in the docuspace was achieved. Then a 30x30 SOM and a 40x40 SOM were used for these analyses. It took just over an hour and a half3 on a Sun Enterprise 250 (2*400MHz ultra SPARC CPU, 2Gb RAM) computer. Browsing the SOMs gives us a clear idea and helps us understand what the domain is all about. For instance we can readily identify people within the domain, their roles4 , modules5 that are taught, and research6 interests of the members of the school. Terms like ieee, confer(ence), proceed(ings), workshop, journal, spring(er)7, and even the location of the school (wollaton, jubil(e), campu(s), nottingham), and how to reach it (driv(e), rout(e), map, direc(tion), guid(e)) are clustered together. Further subcategories are also visualized, for instance, within the Image Processing and Interpretation Research Group we found terms like text, vision, ai, colour, recognition, image grouped (figure 4).

5. Conclusions

References

An ontology is a form of knowledge representation that can be used to give a sense of order to unstructured digital sources. It also provides a common vocabulary of concepts

[1] R. Benjamins, D. Fensel, S. Decker, and A. Gomez. (KA)2 : Building ontologies for the internet: A midterm report. International Journal of Human Computer Studies, 51(3):687–712, 1999. [2] R. Brachman. What is-a and isn’t: An analysis of taxonomic links in semantic networks. IEEE Computer, 16(10):10–36,

3 Each

dataset, front and transpose way, took about half an hour each. student, head, tutor, assistant, lecturer. 5 Databases, java, data structures, artificial intelligence. 6 Scheduling, software agents, functional programming. 7 Terms truncated by the stemmer. 4 Professor,

4

Figure 4. Left. Attributes to Entities. Right. Entities to Attributes. Both describing University, Department, School, People, Roles people play, Modules people teach, and several Research interests.

1983. [3] L. Crow and N. Shadbolt. Extracting focused knowledge from the semantic web. Int.J.Human-Computer Studies, 54:155–184, 2001. [4] N. Guarino, C. Masolo, and G. Vetere. Ontoseek: Contentbased access to the web. IEEE Intelligent Systems, pages 70–80, 1999. [5] J. Heflin and J. Hendler. Dynamic ontologies on the web. In American Association For Artificial Intelligence Conference, pages 251–254. AAAI Press, California, 2000. [6] T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, J. Honkela, V. Paatero, and H. Saarela. Self organization of a massive text document collection. In E. Oja and S. Kaski, editors, Kohonen Maps, pages 171–182. Elsevier Science, Amsterdam, 1999. [7] D. Merkl. Text classification with self-organizing maps: Some lessons learned. Neurocomputing, 21:61–77, 1998. [8] H. Ritter and T. Kohonen. Self-organizing semantic maps. Biological Cybernetics, 61:241–254, 1989. [9] G. Salton, J. Allan, and A. Singhal. Automatic text decomposition and structuring. Information Processing and Management, 32(2):127–138, 1996. [10] I. Sanz, R. Berlanga, and M. Aramburu. Gathering metadata from web-based repositories of historical publications. In A. Tjoa and R. Wagner, editors, 9th International Workshop on Database and Expert Systems Applications, pages 473– 478. IEEE Computer Soc Press, Los Alamitos, 1998.

5

Suggest Documents