Automatic Generation of Semantically Enriched Web Pages by a Text Mining Approach
Hsin-Chang Yang ∗ Department of Information Management, National University of Kaohsiung, Kaohsiung, Taiwan
Abstract Nowadays most of the Web pages contain little amount of structure and supporting information that can reveal their semantics or meanings. To enable automated processing of the Web pages, semantic information such as metadata and tags regarding to each page should be added to it. Several authoring tools have been developed to help users tackling this task. However, manual or semi-automatic authoring is implausible when we intend to annotate large amount of Web pages. In this work, we proposed a method to automatically generate some descriptive metadata and tags for a Web page. The idea is to apply the self-organizing map algorithm to cluster the Web pages and discover the relationships between these clusters. In the mean time, the themes of each cluster are also identified. We then use such relationships and themes to tag the Web pages and generate metadata for the Web pages. The result of experiments shows that our method may generate semantically relevant metadata and tags for the Web pages. Key words: Metadata Generation, Semantic Tagging, Text Mining, Self-Organizing Map
Preprint submitted to Elsevier Science
11 July 2008
1
Introduction
The vision of the Semantic Web is to provide machine processable metadata that describes the semantics of resources to facilitate the searching, filtering, condensing, or negotiation of knowledge for human users. A core technology for making the Semantic Web happen, but also to leverage application areas like knowledge management and E-commerce, is the field of semantic annotation, which turns human-understandable content into a machine understandable form. For newly created Semantic Web resources, the annotation can be done manually or by help of some authoring tools. However, it is not practical to annotate existing Web pages due to the gigantic amount of them. To promote the Web to the Semantic Web, we need to develop an automated process to discover the semantics of a Web page and explicitly add them to the page, generally in the form of XML-based metadata, for future use. Automatic semantic metadata generation thus plays an important role to ensure the success of the Semantic Web. We like to emphasize that the metadata mentioned here should carry more semantic meaning than common descriptive metadata such as the Dublin Core. Here, the semantics of a Web page refers to the implicit meanings or topics ∗ Corresponding author. Address: Department of Information Management, National University of Kaohsiung, Kaohsiung, 811, Taiwan, ROC. Tel. +886 7 5919512 Fax. +886 7 5919328 Email address:
[email protected] (Hsin-Chang Yang). URL: http://www.im.nuk.edu.tw/yanghc (Hsin-Chang Yang).
2
that are difficult to extract using simple syntactic features and rules. That is, a more sophisticated process should be applied to analyze the content of the Web page and discover the implicit semantics of it. With the fact that most of the Web pages contain unstructured or semi-structured textual data, the process of extracting implicit knowledge from texts, or namely the text mining process, is naturally introduced to generate semantic metadata.
The intent to annotate existing Web pages with semantic metadata will be impractical if the annotation process requires human intervention or is unscalable. In fact, the entire annotation process is technically trivial and can be easily automated if we omit the semantic extraction part of it. Generally, the semantics of a Web page is difficult to decide without the judgment of human beings. The automation of semantics extraction process is necessary since even minor requirement of human effort will result in tremendous amount of time cost due to the gigantic number of several billions Web pages existed in the Web. However, such automation is rather difficult because the recognition of semantics is a high-level cognitive process. A suitable text mining process may provide a solution to this automatic semantics extraction problem if three basic requirements are satisfied. The first is that the process should be fully automatic without, or with a negligible amount of, human intervention. The second is that it should be generalizable and scalable so we need not to use all existing Web pages for training. The last is that it should be able to extract the real semantics of the Web pages and present it in a human-comprehensible way. The second requirement can be solved by a suitable machine learning algorithm. However, the rest two requirements are considerably difficult since they involve high-level cognitive processes and sophisticated design of the extraction procedure. 3
In this work, we present a semantics extraction method using a text mining approach that fits the requirements mentioned above. Our goal is to generate some semantic metadata and tags for a Web page. We should emphasize that a Web page mentioned here is generally an ordinary page in HTML or XML format without interesting metadata within it. To obtain the semantic metadata and tags of the Web pages, we should first cluster a set of training Web pages using the self-organizing map (SOM) (Kohonen, 1997) algorithm and generate two feature maps. A semantics extraction process is then applied on these maps to identify various types of semantic descriptions of a page, including thematic terms that describe the theme of the page, related pages of this page, and important keywords of this page, etc. The extraction process may also construct an ontology of these Web pages. These semantic descriptions, as well as the ontology, are then used to form the metadata and tag some keywords within the page. Specifically, two kinds of semantic annotations will be generated and added to a Web page. The first kind is a piece of metadata that accompanies the Web page. Such metadata could exist within the page or be stored outside the page. The second kind is some tags that tagged important terms within the page. These tags are generally existed within the page as ordinary HTML or XML tags. We believe that a Web page will be decorated with much semantic information through such metadata and tags. The main contribution of our work is that we incorporate semantic annotation and ontology creation in an unified framework. Furthermore, our method requires no predefined ontology as well as human intervention, and can be applied to dynamic Web pages. Although the generated metadata is rather primitive, we believe that the method will be beneficial to the success of the Semantic Web when the output of the described method migrates to the existing standards.
4
The structure of this paper is as follows. We will discuss some related work in Sec. 2. In Sec. 3 we will introduce the clustering process by SOM and the generation of two feature maps. The text mining process for metadata generation that consists of the identification methods for thematic keywords and their relationships is described in Sec. 4. We then show the experimental result for the proposed method in Sec. 5. Finally, we give some conclusions and discussions in Sec. 6.
2
Related Work
Automatic creation of metadata for Web pages resembles the task of semantic annotation in general. The schemes on semantic annotation of Web pages may be divides into two major categories, namely manual annotation and automatic annotation. Most of manual annotation schemes concern about tools for annotating the Web pages and sharing the annotations. The major concerns of these schemes include the representation of annotations, the easiness of use, the incorporation of ontologies, the design of efficient sharing methodologies, and the evaluation of annotations, etc. Some works on this area can be found in Bechhofer and Gobel (2001); Erdmann et al. (2000); Handschuh and Stabb (2002); Koivunen and Swick (2001); Martin and Eklund (1999); Vargas-Vera et al. (2001b). Meanwhile, a number of annotation tools for producing semantic markups have been developed, such as SHOE (Heflin and Hendler, 2000), Protege-2000 (Noy et al., 2001), OntoAnnotate (Handschuh et al., 2002), MnM (Vargas-Vera et al., 2001a), and Annotea (Kahan et al., 2001). On the other hand, automatic annotation intends to automatically or semi-automatically create annotations for existing Web pages by means of ma5
chine learning or syntactic analysis. Our work deals with automatic creation of semantic metadata and falls into this category. In the following we will discuss some related works.
In general, most of the works about semantic annotation require some predefined ontologies to extract, define, and relate the annotations, as we should discuss in the following. Here we further differentiate the automatic semantic annotation schemes into two models. The first model is called ontology-driven semantic tagging which task is to generate a set of tags that can semantically describe and tag the original content of a Web page in sentence or sub-structure levels according to some ontologies. Graubitz et al. (2001) developed the DIAsDEM framework to incorporate legacy data and collections of semi-structured documents into an integrated information system that can be queried to support decision processes. They applied a KDD process to derive a preliminary flat XML DTD serving as a quasi-schema for the document archive and enable the provision of database-like querying services on textual data. The framework is further improved by Winkler and Spiliopoulou (2001) that derived structured XML DTDs in order to extend previously derived flat DTDs. Dill et al. (2003) introduced the SemTag system that performs automated semantic tagging of large corpora and focuses on detecting the occurrence of particular entities in Web pages. The application is centralized, which is the same as this work, such that knowledge from the entire corpus is used. However, the tagged terms are predefined in a standard ontology rather than automatically identified. Bonino et al. (2003) proposed an architecture for creating and managing annotations using previously defined ontologies, and it allows the tagging of documents at different granularity levels, ranging from the whole document to the single paragraph. Li et al. 6
(2001) proposed a machine-learning based automatic annotation approach, which is implemented in ALPHA system, to annotate Web pages in sentence level and in RDF format. Kiryakov et al. (2005) describes the semantic annotation scheme of KIM platform. The named entities are extracted and tagged according to a upper-level ontology (KIMO) and a knowledge base. All the above-mentioned methods rely on predefined ontologies. Semantic tagging may also be achieved without the use of ontologies. Dingli et al. (2003) proposed the Armadillo architecture that uses an iterative approach which combines information extraction and machine learning to extract annotations for tagging semantically-consistent portions of the Web. The knowledge for extracting annotations comes from some Web sites that provide structured knowledge, such as Citeseer 1 and Google 2 . No ontological knowledge has been used in their method. The second model for automatic semantic annotation is called semantic metadata generation, which task is to generate a section of metadata that can semantically describe the content of the annotating page. The generated metadata may incorporate ontologies implicitly or explicitly, where implicit ontology means that a system defines its own semantic categories as well as their relationships, and explicit ontology means a system relies on some predefined ontologies. Most of the semantic metadata generation schemes use ontologies explicitly since it is difficult to generate ontologies. Volz et al. (2004) describes an annotation scheme called deep annotation that manually creates relational metadata for the Web presentation or annotates directly using the logical database schema. The CREAM framework (Handschuh and Staab, 1 2
www.citeseer.com www.google.com
7
2003) applies a metadata re-recognition process to compare existing metadata literals with newly typed or existing text. It also learns a wrapper from given markup in order to automatically annotate similarly structured pages. Finally, message extraction like systems may be used to recognize named entities, propose co-reference, and extract some relationship from texts. The PANKOW method (Cimiano et al., 2004), a module of the CREAM framework, employs an unsupervised, pattern-based approach to categorize instances with regard to a given ontology. However, the patterns are very limited.
3
SOM Clustering
To obtain the metadata and tags for a Web page, we first perform a clustering process on a training set of Web pages. We then generate feature maps to reveal the relationships among Web pages as well as keywords. In the following subsections, we will start from the preprocessing steps, and follow by the clustering process using SOM learning algorithm. Two labeling processes are then applied to the trained result to construct feature maps which characterized the relationship among keywords and Web pages.
3.1 Preprocessing of Documents
Our approach begins with a standard practice in information retrieval (Salton and McGill, 1983), i.e. the vector space model, to encode documents with vectors, in which each element of a document vector corresponds to a different index terms. In this work the training corpus contains a set of Chinese news Web pages posted on CNA (Central News Agency) newswire. We se8
lect these pages due to the reasons as follows. The first is they are publicly available for research purpose and are used as test set for many works. The other reason is that these pages contain less tags and metadata that should meet our need. Although these Web pages are mostly written in Chinese, our method can be applied to Web pages in any language, as long as they can be segmented into lists of keywords. To encode a Web page into a vector, we first extract index terms (or keywords) from this page. Traditionally there are two schemes for extracting terms from Chinese texts. One is character-based scheme and the other is word-based scheme (Huang and Robertson, 1997). We adopt the second scheme because individual Chinese characters generally carry no context-specific meaning. A word in Chinese is composed by two or more Chinese characters. After extracting words from all training Web pages, we collect all extracted keywords and obtain a vocabulary V for this corpus. It is then used to encode a Web page into a binary vector. In this vector, an element with value 1 indicates the presence of its corresponding word in this Web page; otherwise, a value of 0 indicates the absence of the word. We use binary vector scheme to encode the Web pages because we intend to cluster Web pages according to the co-occurrence of the words, which is irrelevant to the weights of the individual words. Note that we only keep those words that belong to the content of the pages, i.e. we discard those words within HTML or XML tags in the Web pages.
A problem with this encoding method is that if the vocabulary is very large the dimensionality of the vector is also high. In general, dimensions on the order of 1000 to 10000 are very common for even reasonably small collections of Web pages. As a result, techniques for controlling the dimensionality of the vector space are required. In information retrieval several techniques are widely used 9
to reduce the number of index terms. Unfortunately, these techniques are not fully applicable to Chinese documents. For example, stemming is generally not necessary for Chinese texts. On the other hand, we can use stop words and noun group selection to reduce the number of index terms. In this work, we adopt a simple approach by allowing only nouns being index terms. In our experiments, this approach is able to reduce the vocabulary to a reasonable size.
3.2 Clustering and Feature Maps Generation
In this subsection we will describe how to organized Web pages and keywords into clusters by the co-occurrence similarities of keywords. The Web pages in the corpus are first encoded into a set of document vectors as described in Sec. 3.1. We intend to organize these Web pages into a set of clusters such that similar Web pages will fall into the same cluster. Moreover, similar clusters should be ’close’ in some manner. That is, we should be able to organize the clusters such that clusters that contain similar Web pages should be close in some measurement space. The unsupervised learning algorithm of SOM networks (Kohonen, 1997) meets our needs. The SOM algorithm maps a set of high-dimensional vectors to a two-dimensional map of neurons according to the similarities among the vectors. Similar vectors, i.e. vectors with small distance, will map to the same or nearby neurons after the training (or learning) process. That is, the similarity between vectors in the original space is preserved in the mapped space. Applying the SOM algorithm to the document vectors, we actually perform a clustering process on the corpus. A neuron in the trained map can be considered as a cluster. Similar Web pages will fall into the same 10
or neighboring neurons (clusters). Moreover, the similarity of two clusters can be measured by the geometrical distance between their corresponding neurons. To decide the cluster to which a Web page or a keyword belongs, we apply two labeling processes on the Web pages and keywords, respectively. After the Web page labeling process, each Web page is associated with a neuron in the map. We record such associations and obtain the document cluster map (DCM). Similarly, each neuron will be labeled by a set of keywords after the keyword labeling process and we have the keyword cluster map (KCM). We then use these maps to generate the semantic metadata and tags. We define some denotations and describe the clustering process here. Let xi = {xin |1 ≤ n ≤ N }, 1 ≤ i ≤ M , be the document vector of the ith Web page in the corpus, where N is the number of keywords and M is the number of the Web pages in the corpus. We use these vectors as the training inputs to the SOM network. The network consists of a regular grid of neurons. Each neuron in the network has N synapses. Let wj = {wjn |1 ≤ n ≤ N }, 1 ≤ j ≤ J, be the synaptic weight vector of the jth neuron in the network, where J is the number of neurons in the network. We trained the network by the following SOM algorithm: Step 1 Randomly select a training vector xi . Step 2 Find the neuron j with synaptic weights wj which is closest to xi , i.e. ||xi − wj || = min ||xi − wk ||. 1≤k≤J
(1)
Step 3 For every neuron l in the neighborhood of neuron j, update its synaptic weights by old wnew = wold l l + α(t)(xi − wl ),
where α(t) is the training gain at epoch number t. 11
(2)
Step 4 Repeat Step 1 through 4 until all training vectors are selected. Goto Step 5. Step 5 Increase epoch number t. If t reaches the preset maximum training epoch number T , halt the training process; otherwise decrease α(t) and the neighborhood size, goto Step 1. To obtain the DCM, each document vector is compared with every neuron in the trained map. A Web page is labeled to a neuron if the document vector of this page is the closest to the synaptic weight vector of this neuron. Formally, the ith Web page is labeled to the jth neuron if Eq. 1 holds. When all Web pages are labeled to the map, we should record the Web pages labeled on each neuron and obtain the DCM. In the DCM, each neuron is labeled by a set of Web pages which have similar document vectors, i.e. vectors with many overlapped elements. Such similarity makes a neuron being a cluster of similar Web pages in terms of their keyword co-occurrences. Thus Web pages in the same cluster could be considered semantically related since Web pages that contain many overlapped words should be similar in context. Note that some neurons may have no document labeled on them. We call these neurons the unlabeled neurons. We may obtain the KCM by labeling each neuron with certain keywords, which can be achieved by examining the neurons’ synaptic weight vectors. We design the keyword labeling process based on the following observations. Since we use binary representation for the document vectors, the elements in the weight vectors will move toward either 0 or 1. Ideally, all elements should be either 0 or 1 after the training. In practice this situation is not likely to happen since the training vectors are different. However, an element will have a value near 1 if it is repeatedly moved toward 1 during the training. Since an element 12
with value 1 in a document vector represents the presence of its corresponding keyword in that Web page, an element with value near 1 in a synaptic weight vector also shows that such neuron has been repeatedly told to ’learn’ the corresponding keyword during the training process. This keyword should be important and often occur in those Web pages that labeled to this neuron. Thus we should label this neuron by those keywords that their corresponding elements have values near 1. According to such interpretation, we design the following keyword labeling process. First we calculate the agglomerative weight vector of a neuron by aggregating the weight from neighboring neurons as follows. wj =
X 1 wk , |Nc (j)| k∈NC (j)
(3)
where Nc (j) is the set of neighboring neurons of neuron j. Unlabeled neurons are not used in Eq. 3. If the nth element of wj exceeds a predetermined threshold τK , the corresponding keyword of that element is labeled to neuron j. To achieve better result, the threshold is a real value near 1. We aggregate neighboring neurons to prevent noise weight value that may be caused by imperfect convergence. After the labeling process, a neuron may be labeled by several keywords which often co-occur in a set of Web pages. Thus a neuron forms a keyword cluster. The KCM autonomously clusters keywords according to their similarity of co-occurrence. Keywords tend to occur simultaneously in the same Web page will be mapped to neighboring neurons in the map. For example, the translated Chinese words for ”neural” and ”network” often occur simultaneously in a Web page. They will map to the same neuron, or neighboring neurons, in the map because their corresponding elements in the encoded document vector are both set to 1. Thus a neuron will try to learn these two keywords simultaneously. As a result, their corresponding elements 13
will have good chance to have large values and been labeled on this neuron. Thus we can reveal the relationship between two keywords according to their corresponding neurons in the KCM.
The determination of the threshold is an important issue in labeling keywords. Large threshold values should allow less keywords being labeled. These keywords, however, may be more specific in describing the theme of the underlying Web pages. Small threshold values label more keywords to a neuron but some of them may be erroneous. It is difficult to determine optimal threshold value. Here we devise a scheme to determine the threshold value for each neuron. Since the values of elements of the synaptic weight vector reflect the importance of the keywords that the neuron recognizes, we may set the threshold value according to the norm of the weight vector. When a neuron recognizes many keywords as important, many elements of its weight vector have large values. This will result in a large norm. The threshold should also be large to prevent too many keywords being labeled on this neuron. On the other hand, a small norm indicates that the neuron just recognizes few keywords as important. We should set the threshold to a smaller value to allow adequate number of keywords being labeled. According to above realization, we use the following equation to determine the threshold value.
τK = β||wj ||,
(4)
where β is a scale factor and ||wj || is the norm of wj . Note that the threshold τK could be different for different neurons. 14
4
Metadata and Tags Generation
In this work, two kinds of annotations, namely the metadata and the tags, will be added to a ordinary Web page. In this section we will describe the methods to generate such annotations. We will first develop a method to generate an ontology according to the generated feature maps as described in Sec. 3.
4.1 Ontology Generation
To enrich a Web page with semantic information, an understanding of the meaning of the page itself and its relationships to other Web pages will help a lot. In many cases the meaning (or semantics) of a Web page could be represented by its themes, which are generally represented by one or more keywords. Meanwhile, the relationships between Web pages could be approximated by the relationships between their themes. Therefore, identification of the themes of a Web page and the relationships between these themes is the foundation of annotated Web pages. When these kinds of knowledge have been identified, we actually obtain an ontology. In this subsection, we will develop a method to construct an ontology that will be used to generate metadata and tags for Web pages. An ontology consists of a set of concepts along with the relationships between them. Here we let the concepts be those keywords labeled on KCM since these keywords are considered important. The relationships between them could be revealed by the locations of their labeled neurons. Since SOM preserves the topology of the training feature space, nearby neurons in the trained map should be similar. Therefore, keywords labeled on nearby neurons should be 15
related. We decide to build relationships among neurons that are close enough, i.e. within a certain distance. The distance between two neurons here is composed by their geometrical distance and their semantic distance. The geometrical distance measures the Euclidean distance between the coordinates of these neurons in the map. For example, for a square formation of J neurons, √ √ the coordinate of neuron j should be (jx , jy ) = (j/ J, j% J), where jx and jy denote the x and y coordinates of this neuron in the map. Similar definitions of coordinates should be applied to other types of map formation. The geometrical distance between neuron j and k is defined as follows: q
DG (j, k) =
(jx − kx )2 + (jy − ky )2 .
(5)
On the other hand, the semantic distance between neuron j and k is the difference between their corresponding synaptic weight vectors, which is defined as follows: DS (j, k) = ||wj − wk ||.
(6)
The distance between neuron j and k is then calculated by a weighted sum of DG and DS : D(j, k) = DG (j, k) + γDS (j, k),
(7)
where γ is a weight factor to scale the contribution from each component. A common setting is to allow these two components giving equivalent con√ tributions. Note that the value of DG (j, k) ranges from 0 to 2J for square formation with J neurons but the value of DS (j, k) ranges from 0 to |w|, i.e. N . In such case we should let γ =
√ 2J N
to balance the contributions of
DG and DS . The weight factor for other formations should be determined in similar way. According to the above distance definitions, we relate neurons j and k if their distance D(j, k) is smaller than some threshold τD . Note that √ 0 ≤ τD ≤ 2J in our example case. Here we let a neuron be related to itself. 16
When two neurons are related, all the keywords labeled on these neurons are also related. The relationships among keywords can then be established and an ontology is constructed.
4.2 Generating Metadata
The metadata of a Web page in this work consists of four parts. The first part is the important keywords section which contains a set of high-frequency keywords appeared in this Web page. The second part is the important themes section which is a set of identified themes that can reflect the main interest of the Web page. The third part is the related themes section which depicts some themes related to those in the second part and also the relationships among the themes. Finally, we include a related pages section which contains a set of Web pages that are similar to the annotated one. The important keywords are selected according to the standard tf·idf weighting scheme. We select those keywords whose tf·idf weight value exceeds a threshold. These keywords, together with their weights, are recorded in the important keywords section as follows: keyword1 : weight1 , · · · , keywordn : weightn . This kind of information can be used for simple search and content highlighting. The second part contains a set of themes, which have been discovered in the KCM. In the KCM, each neuron is labeled by a set of keywords that are important to those Web pages that labeled on the same neuron in the DCM. Thus it is straightforward to obtain the themes of a Web page through 17
investigating KCM and DCM. We first find the neuron which the Web page labeled in the DCM. Let it be neuron j. The themes of this Web page will then be those keywords that label to neuron j in the KCM. The corresponding element values of these keywords in the agglomerative weight vector wj are used as weights of these keywords. We record these themes and their weights in the important themes section as follows: keyword1 : weight1 , · · · , keywordn : weightn . Note that the identified themes may not appear in this Web page. To obtain the third part which depicts the relationships among the themes, we may use the ontology which is constructed in Sec. 4.1. When the themes of a Web page have been identified as described above, their related themes could then be derived from the ontology. Let {K1 , K2 , . . . , Kt } be the set of themes of this Web page. The related themes consist of all keywords that are related to these themes. That is,
R=
[
RKk ,
(8)
1≤k≤t
where R is the set of related themes and RKk is the set of keywords related to Kk defined in the ontology. We use the distance define in Eq. 7 as weight of a related keyword. These related keywords and their weights are recorded in the related themes section as follows: keyword1 : weight1 , · · · , keywordn : weightn . Note that we will omit themes that already occur in the THEMES part. 18
Finally, we may obtain the related pages of a Web page according to the DCM. We define the relate pages of a Web page as those pages that labeling to the same neuron. We may also include those pages that labeled to the nearby neurons as related pages since these neurons are considered similar by virtue of the SOM. When Web page j is related to Web page i, the weight of Web page j is the normalized difference between their document vectors, i.e. ||wi −wj || N
multiplies by the geometrical distance between their corresponding
neurons. Note that a value of 1 is added to the geometrical distance defined in Eq. 5 to avoid zero distances. We then record these related pages as well as their weights as follows: page1 : weight1 , · · · , pagen : weightn .
4.3 Generating Tags
We define tags as pieces of information that accompany pieces of text in a Web page. In a plain Web page, several types of tags have been defined. These tags are generally syntactic and used for typesetting or inclusion of multimedia objects. In this subsection we will develop several types of semantic tags that will provide related information of some tagged texts when a user browses the page. Different from the syntactic tags, semantic tags are generally some kinds of information that provide semantic information about the tagged text. A common type of semantic tag is the part-of-speech tag, but it is often considered as syntactic tag. It is difficult to draw a strict line between semantic tagging and syntactic tagging since they often overlap. We will adopt a broader view of semantics here. A semantic tag, in this work, is a piece of information that describes the semantics of the tagged text. Such information could 19
have various levels of abstraction. For example, it could be some metaphor behind the text, the conclusion give in the text, or the causality of the text. However, these high level information are rather difficult to derive. Thus, the tags assigned to some texts are often simple categorical labels in practice. For example, semantic tagging Web sites such as Flickr 3 , Del.icio.us 4 , and Technorati 5 allow the users to tag a Web page with user-defined categories. In these sites, users could add tags to photos, posts, and Web sites, etc. to help searching. Similarly, our goal is to generate tags that can be used to provide guidance for a user to find related Web pages. We will tag some keyword in the Web page with its theme. It is then simple to find the related pages and related themes through the ontology developed in Sec. 4.1. In a Web page, we will tag all keywords that appear in the KCM. The tagged information of a keyword will be the index of its labeled neuron in the KCM. For example, let keyword ’knowledge’ be labeled to neuron 17 in the KCM. The keyword should then be tagged as follows: ...knowledge... When there are multiple choices in the KCM for the keyword, we simply select the neuron which is closest to the one to which this page is labeled. We can then find related themes and Web pages according to this neuron index. To facilitate such representation, it is necessary to refer to both KCM and DCM. A plausible approach is to include additional metadata that point to the KCM and DCM as follows:
3 4 5
http://flickr.com http://del.icio.us http://technorati.com
20
URL of the KCM URL of the DCM An agent or a Web browser that use the tags of keywords should refer to these maps to retrieve necessary information. In fact, the metadata generated in Sec. 4.2, except the KEYWORDS part, could be abbreviated by this way. However, we make the references during the generation of metadata to increase its comprehensibility and alleviate the burden of the agent or browser. On the other hand, there are many keywords that needed to be tagged and not all of them will be used by users. Accompanying all related themes and pages with the keyword will dramatically increase the volume of the page. The index approach should conserve lots of space and, in the mean time, provide a way to find helpful information regarding some keywords.
4.4 Application Scenario
We give a scenario to demonstrate the usage and feasibility of the generated metadata and tags. A search engine sent some crawlers to gather pages and make indices of them using keywords within these pages. With the generated metadata, the search engine not only can retrieve pages using themes rather than keywords, but also can provide users related themes and pages accompanied with each page. For example, when a user sends a query word ’SOM’, the search engine may show a list of pages containing ’SOM’, ’self-organizing maps’, or ’self-organizing feature maps’ since they all have the same theme. Note that these pages may not contain the query word. Together with each page, the search engine also shows links for its related themes and related pages. A fictitious result is depicted in Fig. 1 to demonstrate possible out21
comes of the search engine. The user can click on the Related Pages button to view other related pages. He can also click on the Related Themes button to browse a list of themes that related to this page. When a user browses a page, he will find some keywords being highlighted by double underlines to distinguish it from normal hyperlinks. These keywords are tagged as described in Sec. 4.3. When he moves the cursor over a tagged keyword, a dialog window will appear and show the semantic annotations of this keyword. This is depicted in Fig. 2. The user can then move the cursor to the dialog box and click on a keyword for a list of pages that are related to it or the ’Find related pages’ link to obtain a list of pages that is related to the double underlined keyword, which is ’SOM’ in the figure.
5
Experimental Result
We applied our method on the Chinese news articles posted daily in the Web by CNA(Central News Agency, Taiwan). Two corpora were constructed in our experiments. The first corpus (CORPUS-1) contains 100 news articles posted in Aug. 1, 2, and 3, 1996. The second corpus (CORPUS-2) contains 3268 Web pages (or documents interchangeably) posted during Oct. 1 to Oct. 9, 1996. CORPUS-1 is rather small and is used for explanatory purpose only. A word extraction process was applied to the corpora to extract Chinese words. There are 1475 and 10937 words been extracted from CORPUS-1 and CORPUS-2, respectively. To reduce the dimensionality of the feature vectors we discarded those words which occur only once in a page. We also discarded the words appeared in a manually constructed stoplist. Finally, we discarded all words other than those of nouns. These processes reduce the number of words to 414 22
and 1567 for CORPUS-1 and CORPUS-2, respectively. A reduction rate of 72% and 86% are achieved for the two corpora, respectively. To train CORPUS-1, we constructed a self-organizing map which contains 64 neurons in 8 × 8 grid format. The number of neurons is determined experimentally such that a better clustering can be achieved. Each neuron in the map contains 563 synapses. The initial training gain is set to 0.4 and the maximal training epoch number is set to 100. These settings are also determined experimentally. We tried different gain values ranged from 0.1 to 1.0 and various training epoch setting ranged from 50 to 200. We simply adopted the setting which achieves the most satisfying result. After training we labeled the map with keywords and Web pages by the methods described in Sec. 3.2, and obtained the KCM and DCM, respectively, for CORPUS-1. The above process was also applied to CORPUS-2 and obtained the KCM and DCM for CORPUS-2. The thresholds for labeling keywords in constructing the KCMs had been set to 0.7 for both corpora. Fig. 3 shows the KCM of CORPUS-1. To verify the effectiveness of the generated maps, we performed two evaluation processes on these maps. The first process evaluates the goodness of DCM, which clusters similar documents together. We first categorized training Web pages in CORPUS-1 and CORPUS-2 manually and created 27 and 229 classes, respectively. The categorization is based on semantic similarity that is judged by 3 examiners. We do not assign labels to these categories since we just need to know the relationships among these documents. A strict distinction between categories is adopted. That is, we should consider document in different categories being semantically different, although such distinction is always fuzzy in practice. However, such strict distinction will simplify the evaluation process. When two documents in the same category are labeled to different neurons 23
in the DCM, we give a score of 1 to this pair of documents. Otherwise, they have score 0. We denote the score between documents di and dj as si,j . The average score for every pair of documents is calculated as follows: 1 S = ³M ´ 2
X
si,j .
(9)
1≤i
(中央社亞特蘭大(Atlanta)二日美聯電)
Fig. 6. A sample Web page with automatically generated metadata and tags. We translated necessary keywords into English in the parentheses succeeding the keywords.
37