Exploring Open Web Directory for Improving the Performance of Text Document Clustering
by
Gaurav Ruhela Roll Number: 200402012 ruhela
[email protected] [email protected]
Thesis submitted in partial fulfillment of the requirements for the degree of
Master of Science (by Research) in Computer Science and Engineering
Center for Data Engineering International Institute of Information Technology Hyderabad, India January 2010
INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “Exploring Open Web Directory for Improving the performance of Text Document Clustering” by Gaurav Ruhela, has been carried out under my supervision and is not submitted elsewhere for a degree.
Date
Advisor: Prof. P. Krishna Reddy Center for Data Engineering IIIT, Hyderabad
c Gaurav Ruhela, 2010 Copyright All Rights Reserved
Dedicated to my parents Mrs. Krishna Kant Ruhela and Mr. Subhash Chandra Ruhela, for their everlasting love and support.
Acknowledgments
First and foremost, all praise belongs to God who gave me all the help, knowledge, and courage to finish my work. This work would not have been possible without the help and support of many individuals. As my advisor, I offer my sincerest gratitude to my supervisor, Prof. P.Krishna Reddy, who has supported me throughout my thesis with his patience and knowledge whilst allowing me the room to work in my own way. I attribute the level of my Masters degree to his encouragement and effort and without him this thesis, too, would not have been completed or written. He taught me how to pursue research. He helped to shape the direction of this work, filled in many of the gaps in my knowledge, and helped steer me toward solutions. His constant encouragement and near-miraculous ability to always find time for his students have made working with him a true pleasure. I want to thank all the people in IT for Agriculture Lab and Center for Data Engineering lab for their stimulating company during the past years. My life would not be the same without the many friends I have made here in IIIT-H. My good friends Anshul Agarwal, Pratyush Bhatt, Sandeep Saini and Sumit Maheshwari have kept my life both interesting and entertaining during my MS. The many other people of note include Abheet, Anuj, Arjun, Batra, Dinkar, Gopal, Kochar, Mohit, Naveen, Rashi, Shishir, Shukla, Singla and Swati. Finally, I want to express my gratefulness to my mother Mrs. Krishna Kant Ruhela and father Mr. Subhash Chandra Ruhela for their endless love, support, encouragement, patience and selfsacrifice. I am also thankful to my parents for teaching me the value of knowledge and education. No words in any natural language would be sufficient to thank my parents for all they have done for me. I am thankful to my sister Dr.Garima Ruhela and brother Saurabh Ruhela for motivation and every thing else that I can not express.
Abstract In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, company-wide intranets and so on. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize this information. The ultimate goal is to help users find what they are looking for. There are mainly two approaches to enhance the task of document organization: supervised approach, where pre-defined category labels are assigned to documents based on the likelihood suggested by a training set of labeled documents; and unsupervised approach, where there is no need for human intervention or labeled documents at any point in the whole process. Fast and high-quality document clustering algorithms play an important role towards the goal of document organization, as they have been shown to provide both an intuitive navigation/browsing mechanism by organizing large amounts of information into a small number of meaningful clusters. It can be noted that the performance of information retrieval method depends on feature selection and weight assignment methods which are being employed to extract the features from the documents. Normally, features from a document are selected based on some criteria (e.g. Frequency). These features are weighed by TF-IDF (Term frequency and inverse document frequency) method. There are research efforts to improve the performance of feature extraction methods by extending the concepts from the areas of ontology, open web directories and so on. The vector space model is one of the most widely used models to represent text document for computing similarity. Cosine similarity is a popular method to calculate similarity between two vectors. The similarity computation by cosine similarity method is influenced by common features (and their weight) between two document vectors. It can be observed that the cosine similarity is unable to find similarity based on the meaning or semantics. For example, consider two documents, one with a word “BMW” and the other having a word “Jaguar”. The similarity computation with cosine similarity between these two documents equals to zero. Even though these two documents have different vocabulary, they are semantically same as both refer car models. In the thesis, we made an effort to capture the context of the document to incorporate it into similarity computation. We exploit the generalization ability of hierarchical knowledge repositories such as “Open Web Directory”, in which, a given term relates to a context, and the context, in turn, relates to a collection of terms, then we can extract related terms for each term in the document. In a simple generalization hierarchy of web directory, a term at a higher level is a generalized concept for all the terms under this node e.g. sport is a generalized concept for football, cricket, baseball etc. We can add these generalize terms along with their weight to the document vector. By enriching
the document with generalized contextual terms there is a scope to increase the number of common terms, thus the similarity between two documents. An improved approach is also proposed in which these generalized terms are later removed from the document vector leaving behind only the document terms. So, as output, we get a document vector with boosted weights and the dimension space of the document vector remains same. In addition to feature extraction and feature weighting of a document vector, we propose a method to improve clustering quality by assigning weight to the feature vector of a cluster. The contribution of this thesis is threefold: 1. We propose a framework that performs feature generation (using open web directory) and enriches the feature vector with new, more informative and discriminative features. We use topic paths of a term to obtain related contextual and generalized terms. 2. We propose an improved term weighting method called BoostWeight by considering the semantic association between the terms. BoostWeight method increases the weight of terms if they have common generalized term. 3. We propose a methodology to refine a given set of clusters by incrementally moving documents between clusters. More weight is given to the representative and discriminative features of a cluster. To compare the performance of proposed approaches with existing weighting approaches, we have conducted clustering experiments on two data-sets: WebData and Reuters21578 news corpus dataset. Experimental results show that the proposed approach improves clustering performance over other term extraction and weighting approaches.
ii
Contents 1 Introduction
1
1.1
Vector Space Model and Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Overview of Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3.1
Open Web Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3.2
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4
Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.5
Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2 Document Preprocessing and Representation
8
2.1
Representing the Document Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2
About Document Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.3
Identifying Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1
Parsing the Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2
Removing Stop-words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3
Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4
Proximity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Related Work 3.1
15
Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.1
Bag-of-Words Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2
Extended Bag-of-Words Model . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.3
NLP based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2
Feature Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3
Feature Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.1
Supervised Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
iii
3.3.2 3.4
Unsupervised Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 An Approach to Improve Feature Generation by Exploiting Open Web Directory
27
4.1
Problems in Bag of Word Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2
Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3
4.2.1
About Open Web Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.2
Overview of Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . 30
Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.1
Generation of DTerms for document . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2
Determine Weight for the terms
4.3.3 4.4
4.5
. . . . . . . . . . . . . . . . . . . . . . . . . 32 Formation of enriched document vector (d~0d ) . . . . . . . . . . . . . . . . . . . 35
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4.1
Experiment without pruning terms . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.2
Experiment with pruning terms . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 BoostWeight: An Approach to Boost the Term Weights in a Document Vector by Exploiting Open Web Directory
46
5.1
Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2
Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3
Proposed Term Weighting Scheme . . . . . . . . . . . 5.3.1 Generation of document vector d~m . . . . . . . 5.3.2 Formation of enriched document vector d~0m . . 5.3.3 Computing the weight of terms in d~0 . . . . . m
5.3.4 5.3.5
. . . . . . . . . . . . . . . . . 49 . . . . . . . . . . . . . . . . . 51 . . . . . . . . . . . . . . . . . 51 . . . . . . . . . . . . . . . . . 52
Boosting the weight of terms in d~m . . . . . . . . . . . . . . . . . . . . . . . . 53 Document vector d~m with boosted weights . . . . . . . . . . . . . . . . . . . . 55
5.4
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5
Comparison of Proposed Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 Improving Cohesiveness of Text Document Clusters
65
6.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2
Proximity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3
Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4
Proposed Cluster Refinement Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 iv
6.4.1
Generation of virtual clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.4.2
Finding core documents in virtual cluster . . . . . . . . . . . . . . . . . . . . 68
6.4.3
Refinement of virtual clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.6
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7 Conclusion and Future Work
75
7.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2
Conclusion
7.3
Limitation of the Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.4
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Publications
78
Bibliography
79
v
List of Figures 2.1
Vector Representation of Documents . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
4.1
Relation between DT ermi and its topic paths . . . . . . . . . . . . . . . . . . . . . . 32
4.2
Document after merging topic paths for all DTerms . . . . . . . . . . . . . . . . . . . 36
4.3
Topic Paths for the term BMW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4
Topic Paths for the term Jaguar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5
Terms Contributing in Cosine Similarity Computation . . . . . . . . . . . . . . . . . 38
4.6
Flowchart of Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.7
Average Purity values without Pruning Terms . . . . . . . . . . . . . . . . . . . . . . 43
4.8
Average Purity values after removing terms with frequency< 6 . . . . . . . . . . . . 44
4.9
Average Purity values after removing terms with frequency< 31 . . . . . . . . . . . . 44
5.1
Sports as generalized term of soccer and cricket . . . . . . . . . . . . . . . . . . . . . 47
5.2
Relation between Dtermi and its topic paths . . . . . . . . . . . . . . . . . . . . . . 49
5.3
Flowchart of the proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4
Document after merging all topic paths for all DTerms . . . . . . . . . . . . . . . . . 55
5.5
Topic Path for term BMW
5.6
Topic Paths for term Jaguar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.7
Relation Between GTerms and DTerms in a document . . . . . . . . . . . . . . . . . 58
5.8
Purity values for WebData dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.9
Purity values for Reuter21578 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 62
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.10 Entropy values for WebData dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.11 Entropy values for Reuter21578 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.1
Refining of Clusters
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2
Average Purity values after every iteration . . . . . . . . . . . . . . . . . . . . . . . . 73
vi
List of Tables 1.1
Topic paths and count for term Jaguar . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1
Portion of Stop-word List Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2
Words and their corresponding stemmed words . . . . . . . . . . . . . . . . . . . . . 13
3.1
Relevant Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2
Local Weighting Formulas. Refer Table 3.1 for Relevant Terminology . . . . . . . . . 24
3.3
Global Weighting Formulas. Refer Table 3.1 for Relevant Terminology . . . . . . . . 25
3.4
Normalization Factors Formulas. Refer Table 3.1 for Relevant Terminology . . . . . 26
3.5
Popular Weighting Schemas. Refer Table 3.1 for Relevant Terminology . . . . . . . . 26
4.1
Topic paths and count for the term BMW . . . . . . . . . . . . . . . . . . . . . . . 29
4.2
Topic paths and count for the term Jaguar . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 4.4
Algorithm for Generation and Weighting of EDtermSeti . . . . . . . . . . . . . . . 34 Algorithm for Formation of enriched document vector d~0d . . . . . . . . . . . . . . . . 35
4.5
Term Weighting Schema. tf means term frequency, D is the total number of documents in collection, df is the document frequency, dl is the document length, avg dl is the average document length for a collection. . . . . . . . . . . . . . . . . . . . . . 41
4.6
Average purity values without pruning terms . . . . . . . . . . . . . . . . . . . . . . 42
4.7
Average Purity values after removing terms with frequency · · · < tnj , wnj >)
(2.1)
where, wnj is weight of term tnj of document d~j .
2.2
About Document Features
An essential task is the identification of a simplified subset of features that can be used to represent a document. We refer to such a set of features as the representational model of a document and say that individual documents are represented by the set of features that their representational model contains. However, given the potentially large number of words, phrases, sentences etc., even a short document may have vast number of features. 9
Even with attempts to develop efficient representational models, each document in a collection is usually made up of a large number; sometimes an exceedingly large number of features. Another characteristic of text documents is what might be described as “feature sparsity”. Only a small percentage of all possible features for a document collection appear in any single document, and thus when a document is represented as a binary vector of features, most of the values of a vector are zero. Feature-based representation of documents has a trade-off between two important goals. The first goal is to portray the meaning of a document accurately, which tends to incline toward selecting or extracting relatively more features to represent documents. The second goal is to identify features in a way that is most computationally efficient and is practical manner. Sometimes feature generation is supported by cross-referencing of features against controlled vocabularies or external knowledge sources such as dictionaries, thesauri, ontology, or knowledge bases to assist in generating more semantically rich features. Although many potential features can be employed to represent documents the following are most commonly used:- “Character”, “Word/Term”, and “Concept”. • Characters: The individual component-level letters, numerals, special characters and spaces are the building blocks of higher-level semantic features such as words and concepts. A character-level representation can include the full set of all characters for a document or some filtered subset. Character-based representations without positional information (i.e., bag-ofcharacters approaches) are often of very limited utility. Character-based representations that include some level of positional information (e.g., bi-grams or tri-grams) are somewhat more useful and common. • Words/Terms: Specific words selected directly from a “native” document are at what might be described as the basic level of semantic richness. For this reason, word-level features are sometimes referred to as existing in the native feature space of a document. Phrases, multiword expressions would not constitute single word-level features. It is possible for a word-level representation of a document to include a feature for each word within that document that is the “full text”. This can lead to some word-level representations of document collections having tens or even hundreds of thousands of unique words in its feature space. However, most word-level document representations exhibit at least some minimal optimization and therefore consist of subsets of representative features filtered for items such as stop-words, symbolic characters, and meaningless numeric.
10
• Concepts: Concepts are features, generated for a document by means of manual, statistical, or rule-based methodologies. Concept-level features can be manually generated for documents but are now more commonly extracted from documents using complex preprocessing routines that identify single words, multiword expressions, or even larger syntactical units that are then related to specific concept identifiers. For instance, a document collection that includes reviews of sports cars may not actually include the specific word “automotive” or the specific phrase “test drives”, but the concepts “automotive” and “test drives” might nevertheless be found among the set of concepts used to identify and represent the collection. Word-level representations can be more easily and automatically generated from the original source text (through various term-extraction techniques) than concept-level representations, which as a practical matter have often entailed some level of human interaction. In vector space representation, defining terms as distinct single words is referred to as “bag of words” representation. Some researchers state that using phrases rather than single words to define terms produce more accurate classification results [2, 3]; whereas others argue that using single words as terms does not produce worse results [4]. As “bag of words” representation is the most frequently used method for defining terms and it is computationally more efficient than the phrase representation, we use terms as feature to represent a document. With the possibility of large number of features; identifying features becomes necessary to reduce noise. In the next section we discuss the various steps involved to identify relevant features of a document.
2.3
Identifying Features
Preprocessing and document representation phase consists of following steps: • Parsing the documents. • Removing stop-words. • Stemming. These steps will be described briefly in the following sections.
11
2.3.1
Parsing the Documents
In this step, all the HTML mark-up tags are removed from the documents in the document corpora. Case-folding stands for converting all characters in a document into the same case. It is performed by converting all the characters into lower-case.
2.3.2
Removing Stop-words
Stop-words are words such as pronouns, prepositions and conjunctions that are used to provide structure in the language rather than content. These words are encountered very frequently and carry no useful information about the content and thus the category of documents, are called stop-words. Removing stop-words from the documents is very common and necessary to reduce Table 2.1: Portion of Stop-word List Used a
en
or
who
all
about
for
that
will
also
an
from
the
with
am
are
how
und
an
then
as
in
this
the
and
at
is
to
www
any
be
it
was
able
are
by
la
what
about
been
com
of
when
after
being
de
on
where
again
but
comes
could
can
did
do
does
doing
each
else
etc
get
gets
got
how
if
noise. The general strategy for determining a stop list is to sort the terms by collection frequency (the total number of times each term appears in the document collection), and then to take the most frequent terms (frequency greater than pre-defined threshold) and least frequent (frequency less than some pre-defined threshold), often hand-filtered as a stop list, the members of which are then discarded during indexing. We have decided to eliminate the stop-words from the documents, which will lead to a drastic reduction in the dimensionality of the feature space. The list of 571 stop-words used in the Smart system is used [1]. This stop-word list is obtained from [5]. Table 2.1 shows a portion of the stop-word list.
12
2.3.3
Stemming
For grammatical reasons, documents have different forms of a word, such as organize, organizes, and organizing. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set. The goal of stemming is to reduce derivationally related forms of a word to a common base form. For instance: we reduce the similar terms “run”, “runs” and “running” to the word stem “run”. Table 2.2: Words and their corresponding stemmed words Word: runs
Stem: run
Word: running
Stem: run
Word: testing
Stem: test
Word: motoring
Stem: motor
Word: diving
Stem: dive
In order to define words that are in the same context with the same term and consequently to reduce dimensionality, we have decided to define the terms as stemmed words. To stem the words, we have chosen to use Porters Stemming Algorithm [6], which is the most commonly used algorithm for word stemming in English. Implementation of Porters Stemming Algorithm in C, PHP, Perl etc can be downloaded from [7]. This algorithm is embedded to the preprocessing system. Table 2.2 displays a sample of words and the stems produced by Porter’s Stemming Algorithm. After stemming, terms that are shorter than two characters are also removed as they do not carry much information about the content of a document.
2.4
Proximity Measurement
A proximity measure quantifies the closeness between two feature vectors. We measure the proximity using either a similarity or distance measure. Similarity and distance are essentially synonymous and can usually be interchanged (i.e. the distance between elements measures their dissimilarity). As a preprocessing step, given a proximity measure, we can convert the n×m matrix representation of feature vectors into an n × n half-matrix of the proximity between each element:
0
d 21 0 X= d31 d32 0 . .. .. .. .. . . . dn1 dn2 . . . dnm−1 0 13
where, dij is the distance or dissimilarity between elements xi and xj . Note that the distance between two identical elements is ‘0’ (i.e. the diagonal of the matrix) and that the distances are symmetrical (i.e. dij = dji ). If the proximity metric is not symmetric then the full matrix must be used. This matrix is a general representation of the proximity between elements and it can be used directly in any clustering algorithm. Since clustering decisions are based on the proximity between feature vectors, the selection of a good proximity measure is critical for the success of a clustering algorithm. For this reason, there are many proximity measures available. The selection of a measure is partially driven by the types of features in the feature vectors. For calculating similarity between two text documents cosine similarity is very popular. In our study we use the widely used cosine similarity measure to calculate the similarity of two documents [8]. This measure is defined as: d~1 · d~2 Sim(d~1 , d~2 ) = Cosine(d~1 , d~2 ) = |d~1 ||d~2 |
(2.2)
Where, ‘·’ indicates the vector dot product and |d~1 | is the length of document vector d~1 . Regarding the angle described by the vectors, the cosine of the angle is a measure of similarity. It represents how similar or alike a document is to other document. If documents are very alike their corresponding document-document angle should be very small and approaching zero (cosine similarity value approaching ‘1’). On the other hand, if the angle is high, let say, 90 degrees, the vectors would be perpendicular (orthogonal) and the cosine similarity value would be ‘0’. In other words, the vectors does not have common dimension among them, and in such case documents are not considered to be related.
2.5
Summary
In this chapter, we have explained about how to represent a text document into a vector of features. In the next chapter, we will discuss about the related work on feature extraction, feature generation and feature weighting schemes.
14
Chapter 3
Related Work A text document can be modeled as a vector of features. These features represent the content and context of a document. Generally, the features which are important for a given document are included in its vector representation. To quantify the importance of a feature for a given document, several weighing schemes exist. One important point we want to make here is, weighing the features is as equally as important as selecting representative features. Along with common features, weights to these features also play an important role in proximity measurement. In this chapter, we discuss the related work in the areas of feature selection, feature generation and feature weighting.
3.1
Feature Selection
Most of the documents we deal with are not structured. There is a need to model these documents, so that the documents can be used for further tasks. Feature selection selects informative feature for a document. We present several feature selection models in further sub-sections.
3.1.1
Bag-of-Words Model
The basic method to make a feature vector is to add (in vector) all distinct terms occurring in the document. Some pruning can be done on this basic technique, the terms occurring more than or less than a pre-specified threshold can be removed. If a term is very frequent in a document collection, then we can say that it may not be the representative and discriminating feature for a document. Same with the low frequent terms, one or two instances of some terms in the document collection will not help in proximity measurement task.
15
3.1.2
Extended Bag-of-Words Model
There have been efforts to extend the basic “Bag of Word” (BOW) approach. Several studies augmented the bag of words with n-grams [9, 10, 11, 12]. A n-gram is a subsequence of ‘n’ items from a given sequence. The items in consideration can be characters or words according to the application. A n-gram model can be used as a probabilistic model for predicting the next item in such a sequence. n-grams are used in various areas of natural language processing and genetic sequence analysis. If n-grams occur frequently in a document collection then n-grams can be used as possible features for a document.
3.1.3
NLP based Models
Phrases consist of multiple words such as “data mining” or “mobile phone” and constitute a different context than when used separately. Phrases can be extracted by using statistical or Natural Language Processing (NLP) techniques. By statistical methods phrases can be extracted by considering the frequently appearing sequences of words in the document collection [2]. A research on extracting phrases by using NLP techniques for text categorization is discussed in [3]. Phrases can also be extracted by manually defining the phrases for a particular domain such as done to filter spam mail in [13]. Additional studies researched the use of word clustering [14, 15]. Features based on syntactic information such as that available from part-of-speech tagging are used in [16, 17]. In this thesis, we consider words as features to represent the document. Once we have fixed words as feature (most elementary type of features), the problem at hand is to weigh them. We believe if we can improve the performance of system with word as feature, we may improve further by analyzing other feature types (phrases etc). Till now we have covered some approaches which use the features present in the document to represent the feature vector of a document. In the next section, we will explore the methodologies used to generate or include features into a feature vector without the literal occurrence of those features in the document.
3.2
Feature Generation
Over a decade ago [18] formulated the knowledge principle, which postulated that “If a program is to perform a complex task well, it must know a great deal about the world it operates in”. In order to perform a complex task well, the computer needs access to much more extensive and deep knowledge. More recently, there have been a number of efforts to add outside knowledge to the document feature set. Pseudo-relevance feedback [19] uses information from the top-ranked documents, which are assumed to be relevant to the query; for example, characteristic terms from 16
such documents may be used for query expansion [20]. In the vector space model, feedback is typically treated as query vector updating. A well-known approach is the Rocchio method, which simply adds the centroid vector of the relevant documents to the query vector and subtracts from it the centroid vector of the non-relevant documents with appropriate coefficients [21]. In effect, this leads to an expansion of the original query vector, i.e., additional terms are extracted from the known relevant (and non-relevant) documents, and are added to the original query vector with appropriate weights [22]. Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called Singular Value Decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSI is also used to perform automated document categorization [23, 24]. LSI uses example documents to establish the conceptual basis for each category. During categorization processing, the concepts contained in the documents being categorized are compared to the concepts contained in the example items, and a category (or categories) is assigned to the documents based on the similarities between the concepts they contain and the concepts that are contained in the example documents. This is very useful when dealing with an unknown collection of unstructured text. Because it uses a strictly mathematical approach, LSI is inherently independent of language. This enables LSI to elicit the semantic content of information written in any language without requiring the use of auxiliary structures, such as dictionaries and thesauri. The disadvantage of the simple SVD is that it uses a fixed matrix. A term-document matrix can be very large. Hence it will cost much time for calculation of the SVD. Hence every change (when the terms and concepts of a new set of documents need to be included in an LSI index) of the input requires the rebuild of the matrix and a new run of the algorithm [25]. LSI requires relatively high computational performance and memory in comparison to other information retrieval techniques [26]. A word can be used to express two or more different meanings. It is important to find the context in which a word has been used in the document. Word sense disambiguation is an active topic of research which focuses on finding the correct sense in which the word has been used [27, 28, 29]. To capture the context of the document, there have been efforts to augment features from resources like “Yahoo Web Directories”, “Wikipedia”, “Wordnet” etc. Work that uses web directories to gain information about the context of the document is discussed in [30, 31][32]. Such categories offer a standardized and universal way for referring to or describing the nature of real world objects, activities, documents and so on, and may be used (we suggest) to semantically characterize the content of documents. The knowledge resources like Yahoo provide a huge hierarchy of categories (topics) that touch every aspect of human endeavors. Such topics can be used as descriptors.
17
In the course of investigating this idea, an approach for automatic web-page classification based on Yahoo hierarchy is proposed in [30]. Documents are represented as feature vectors that include n-grams instead of including only single words (unigrams). Feature generation is performed in ‘n’ passes over documents, where i-grams are generated in the ith pass. The high number of features is reduced by pruning words contained in the publicly available stop-word list of common english words and by pruning low frequency features. Based on the hierarchical structure the problem is divided into sub-problems, each representing one on the categories included in Yahoo hierarchy. Classifiers are learnt on whole content of the documents. The result of the learning is a set of independent classifiers, each used to predict the probability that a new document is a member of the corresponding category. In [31], the problem of automatic categorization of web-pages in the Yahoo! directory is addressed. To compute the similarity between documents, n-grams are used as feature. Several experiments are performed with various types of descriptions for Yahoo category and the web page to be categorized. These description include “Category Name”, “Entry Content (the html document)”, “Entry Title”, “Entry Summary” etc. These descriptions are provided either by the entries’ submitters or by the Yahoo human indexers. The descriptions are not semantically deep descriptions of “things” but rather headline-like accounts of their nature that describe them in the broader context of human knowledge and experience. For example, Phantom of the Opera might be a Musical, or it might be a Musical, which is a form of Theater, which is a kind of a Performing Art, which in turn is something that has to do with the Arts; in other words, Phantom of the Opera is a Arts:Performing Arts:Theater:-Musicals kind of thing. Such a semantic annotation of documents would be useful, even if it has do be done manually, because it will offer a uniform and universal way of referring to the content of a document. It is difficult to capture the context of the document given a term or two. Thus the neighborhood is used to capture the context and to weight the feature which represents the context of the document. In [33], authors investigate the beneficial effects that can be achieved for text document clustering by integrating an explicit conceptual account of terms found in Wordnet. So far, however, existing text clustering solutions only relate documents that use identical terminology, while they ignore conceptual similarity of terms such as defined in terminological resources like Wordnet [34]. To capture the context of the document with the sole use of term vector may not be sufficient. So, [33] presents several document representations i.e. “add concepts to term vector ”, “replace term with concept terms”, “only concept term”. In these approaches, first frequency and then TF-IDF schemes were used to weigh generalized document vector. TF-IDF scheme worked best with the “add concepts to term vector” approach and by dropping terms which had weight
18
below some predefined threshold. In [32], intentions in dialogues of instant messaging applications are captured, which are used for advertising. In instant messaging applications, a dialog is composed of several utterances issuing by at least two users. They are different from sponsored search in that advertiser content is matched to user utterances instead of user queries. While reading users’ conversation, an intention detecting system recommends suitable advertiser information at a suitable time. The time of the recommendation and the effect of advertisement have a strong relationship. The earlier the correct recommendation is, the larger the effect is. However, time and accuracy are trade-off. At the earlier stages of a dialog, the system may have deficient information to predict suitable advertisement. Thus, a false advertisement may be proposed. On the other hand, the system may have enough information at the later stages. However, users may complete their talk at any time in this case, so the advertisement effect may be lowered. In [32] a solution is proposed to this problem by using yahoo as knowledge resource. In each round of the conversation, retrieve an utterance from a given instant message application. Then, parse the utterance and try to predict intention of the dialog based on current and previous utterances, and consult the advertisement databases that provide sponsor links accordingly. If the information in the utterances is enough for prediction, then several candidates are proposed. Finally, based on predefined criteria, the best candidate is selected and proposed to the IM application as the sponsor link. After figuring out the feature “term frequency” is used to weight the feature. So, external knowledge resources can be used to gain relevant information. With this information, additional and relevant features can be generated and used along with the features in the document. These features may not be equally informative for a document; their relative importance for a document may differ. So, in the next section, we discuss the weighing schemes to assign weight to the features and capture the relative importance of a feature towards a document.
3.3
Feature Weighting
Now we have the knowledge to make feature vector, other important element is to weigh these features. Current state of the art term weighting schemes can be categorized into two classes supervised and unsupervised. These classes are discussed in further sub-sections one-by-one.
3.3.1
Supervised Weighting
Supervised term weighting schemes are based on the distribution of word in different categories. The main aim is to capture correlation between the word frequency and categorical information. Machine learning techniques and probabilistic approaches are used to enhance learning from avail-
19
able knowledge [35, 36]. Many functions, mostly from the traditional information theory and statistics, have been used in Text Categorization; chi-square, information gain, and gain ratio [37] are very popular. One of the drawbacks of supervised approaches is that they need to be trained on a pre-defined positive and negative test samples or predefined categories. Efficiency of these models depends on the quality of the sample sets. With the enormous amount of data and different types of applications, it is not always possible to create these training sets or contextual categories manually.
3.3.2
Unsupervised Weighting
Unsupervised term weighting schemes do not use information on membership of training documents. They just use the statistics of the frequency of terms in the document/corpus. A weighting scheme is composed of three different types of term weighting: local, global and normalization. The term weight wnj is given by wnj = Lnj ∗ Gn ∗ Nj
(3.1)
Where, Lnj is the local weight for term tnj of document dj , Gn is the global weight for term tnj , and Nj is the normalization factor for document dj . • Generally local weights are functions of “how many times each term appears in a document”. Local weighting formulas perform well if they work on the principle that the terms with higher within-document frequency are more pertinent to that document. A list of local weight formulas is given in Table 3.2 and relevant terminology can be referred in Table 3.1. Binary, term frequency, log, augmented log, square root are to name a few. Here, in one way or the other, all weighting schemes use term frequency information of a term. • Global weights are functions of “how many times each term appears in the entire collection”. Global weighting tries to give a “discrimination value” to each term. Many schemes are based on the idea that the less frequently a term appears in the whole collection, the more discriminating it is. A list of global weighting schemes is given in Table 3.3 and relevant terminology can be referred in Table 3.1. Inverse document frequency, probabilistic inverse, global inverse idf are to name a few. Among these inverse document frequency (IDF) is very popular and widely used. • The normalization factor compensates for discrepancies in the lengths of documents [38]. It is useful to normalize the document vectors so that documents are retrieved independent of their lengths. Some of the normalization techniques are mentioned in Table 3.4 and relevant terminology can be referred in Table 3.1. In our experiments, we use the popular cosine normalization method. 20
We can choose different combinations from local, global and normalization schemes to assign weight to a feature. A major class of statistical weighting schemes is examined below, each of which uses “tf” as feature’s capacity of describing the document contents. The differences of these methods are that they measure feature’s capacity of discriminating similar documents via various statistical functions. There are three assumptions that, in one form or another, appear in practically all weighting methods: 1. Rare terms are no less important than frequent terms (the IDF assumption). 2. Multiple appearances of a term in a document are no less important than single appearances (the TF assumption). 3. For the same quantity of term matching, long documents are no more important than short documents (the normalization assumption). 3.3.2.1 Term Frequency Inverse Document Frequency Weighting Scheme Binary weighting and term frequency weighting do not consider the frequency of the term throughout all the documents in the document corpus. Term Frequency Inverse Document Frequency (TF-IDF) weighting is the most common and popular method used for term weighting that takes into account this property. This weight is a statistical measure used to evaluate how important a word is to the document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. In this approach, the weight of ‘nth ’ term in document d~j is assigned proportionally to the number of times the term appears in the document, and in inverse proportion to the number of documents in the corpus in which the term appears. Relevant terminology can be referred in Table 3.1.
) |D| + 1 (3.2) wnj = tfnj ∗ log2 df TF-IDF weighting approach weights the frequency of a term in a document with a factor that (
discounts its importance if it appears in most of the documents, as in this case the term is assumed to have little discriminating power. 3.3.2.2 Other Popular Weighting Schemes LTU, INQUERY and OKAPI are among the other good weighting schemes available. They use modified term frequency and inverse document frequency factor. Length of a document is also considered. These schemes are different from each other in some aspects, either by term frequency factor, variation of inverse document frequency factor, or by the document length normalization factor. These weighting schemes are formulated in the Table 3.5. 21
Above mentioned methods easily scale to very large corpora, with a computational complexity approximately linear in the number of features and training documents. However, idf is global measure and ignore the fact that features may have different discriminating powers for different document topics. For example, “football” is most valuable term in sport news while it has little value for implying financial news. According to idf, weather “football” in sport news or not, its values of idf is the same. The following sections will discuss some methods that calculate features ability of discriminating similar documents in terms of document categories. So, the importance of the contextual usage of word is not considered by these weighting schemes.
3.4
Summary
Generally, usual statistical weighting scheme are used to weigh the selected features which do not consider the context of the documents. So, the development of feature selection and methods to weigh selected features need to proceed hand-in-hand if there is to be hope of improving performance. In the next chapter, we make an effort to develop an unsupervised term extraction and term weighting scheme that neither require any tagging of text nor any kind of training process. The proposed approaches, which we discuss in the next two chapters, differ from previous approaches in many aspects; (i) we introduce a term extraction method using topic paths of open web directory, (ii) We assign weights to document terms and conceptual terms differently, and (iii) We introduce various factors which should be considered while assigning weight to conceptual terms.
22
Table 3.1: Relevant Terminology Term
Description
D
Set of documents in the data-set.
|D|
Total number of documents in the data-set.
dj
‘j th ’ document in set D, dj ∈ D for j > 0.
T
Set of distinct terms in the data-set D.
|T |
Number of distinct terms in the data-set D.
tn
‘nth ’ term in set T, tn ∈ T for n > 0.
Fn tnj
Frequency of tn in the data-set. ‘nth ’ term of document d~j .
tfnj
Frequency of tnj .
Lnj
Local weight for term tnj .
Gn
Global weight of term tn .
Nj
Normalization factor for document dj .
wnj
Final evaluated weight of term tnj of document d~j .
df
Number of documents where tnj occurs at-least once.
aj
Average frequency of the terms that appear in document dj .
xj
Maximum frequency of any term in document dj .
dlj
Length of document dj .
lj
Number of distinct terms in document dj .
slope
Is set to 0.2.
pivot
Is set to the average number of distinct terms per document in the entire collection.
avg dl
Average document length for a collection.
23
Table 3.2: Local Weighting Formulas. Refer Table 3.1 for Relevant Terminology Local Weight
Formula 1, if tf > 0 nj = 0, if tfnj = 0
Binary
Lnj
Term Frequency
Lnj = tfnj
Log
Lnj
Normalized Log
Lnj
Augmented Log
Lnj
Square Root
Lnj
Augmented Normalized Term Frequency
Lnj
Augmented normalized average term frequency
24
Lnj
1 + log(tf ), if tf > 0 nj nj = 0, if tfnj = 0
1 + log(tfnj ) , if tfnj > 0 1 + log(aj ) = 0, if tfnj = 0
0.2 + 0.8 ∗ log(tf + 1), if tf > 0 nj nj = 0, if tfnj = 0
√ tf − 0.5 + 1), if tf > 0 nj nj = 0, if tfnj = 0
( ) 0.5 + 0.5 ∗ tfnj if tfnj > 0 xj = 0, if tfnj = 0
) ( 0.9 + 0.1 ∗ tfnj if tfnj > 0 aj = 0, if tfnj = 0
Table 3.3: Global Weighting Formulas. Refer Table 3.1 for Relevant Terminology Global Weight Formula
No Global Weight
Gn = 1
( Inverse Document Frequency
Gn = log
( Probabilistic Inverse
Gn = log
|D| df
)
|D| − df df
)
( ) tfnj tfnj ∗ log ∑ Fn Fn Gn = 1 + log(|D|)
|D|
Entropy
j=1
( Global Frequency IDF
Gn =
Fn df
)
(
Log-Global Frequency IDF
Fn Gn = log 1 + df
( Incremental Global Frequency IDF
Gn =
Fn 1+ df
)
√ Square root Global Frequency IDF
Gn =
25
Fn − 0.9 df
)
Table 3.4: Normalization Factors Formulas. Refer Table 3.1 for Relevant Terminology Normalization Factors
Formula
No Normalization Factor
Nj = 1
Cosine Normalization
1 Nj = v um u∑ t log (Gn ∗ Lnj )2 n=1
Pivot Unique Normalization
Nj =
1 (1 − slope) ∗ pivot + slope ∗ lj
Table 3.5: Popular Weighting Schemas. Refer Table 3.1 for Relevant Terminology Name Term Weight Schema
TF-IDF
tf ∗ log(
D ) df
(log(tf ) + 1) ∗ log( LTU 0.8 + 0.2
INQUERY
dl avg dl
D + 0.5 ) df log(D + 1)
log(
tf tf + 0.5 + 1.5
OKAPI
D ) df
dl avg dl
tf dl 0.5 + 1.5 + tf avg dl
26
log(
N − df + 0.5 ) df + 0.5
Chapter 4
An Approach to Improve Feature Generation by Exploiting Open Web Directory The process of term extraction and weighting affects the performance of information retrieval, search engines and text mining systems. A text document is abstracted as a vector of terms, and the weight for each term is usually given by using popular TF-IDF method. In the TF-IDF method, the weight of a term is a function of its frequency in the document and in overall document collection. Weighting features is something that many information retrieval systems seem to regard as being of minor importance as compared of finding the features in the first place; but the experiments described here suggest that weighting is considerably as important as additional feature selection. In this chapter, we attempt to demonstrate the relative importance and difficulties involved in the task of document representation and weighting features. We introduce an improved feature generation and feature weighting method by exploiting the contextual/semantic relationship between terms using knowledge repositories such as open web directories.
4.1
Problems in Bag of Word Approach
Since the majority of existing systems employ the bag of words approach to represent documents, we begin by analyzing typical problems and limitations of this method. • Vocabulary Mismatch: In supervised algorithms like classification, words that appear in testing documents but not in training documents are completely ignored by the basic BOW approach that does not use external data to compensate for such vocabulary mismatch. Since the classification model is built with a subset of words that appear in the training documents, 27
words that do not appear there are excluded by definition. Lacking the ability to analyze such words, the system may overlook important parts of the document being classified. • Ignores Semantic Relationship between words: We observe that a critical limitation of the BOW approach lies in its ignorance of the connections between the words. Suppose we have a group of related words, where each word appears only a few times in the collection, and few documents contain more than one word of the group. As a result, the connection between these words remains implicit and cannot be learned without resorting to external knowledge. Some of these limitations are due to data sparsity, after all, if we had infinite amounts of related and non-ambiguous text on every imaginable topic, the bag of words would perform much better. Humans avoid these limitations due to their extensive world knowledge, as well as their ability to understand the words in context rather than just view them as an unordered bag. Our approach that uses structured background knowledge is somewhat a context-based learning, where generalizations of terms are used, thus mimicking humans ability to learn about the context of terms. Later, we show how the above problems and limitations can be resolved through the use of knowledge-based feature generation and weighing. Having presented the problems with the BOW approach, we continue by presenting the basic idea that will address and alleviate these problems using knowledge repositories.
4.2
Basic Idea
Cosine similarity with traditional term selection and term weighting fails to find similarity between two documents that share a topic, but have different terminology. If we use a knowledge resource such as open web directory in which, a given term relates to a context, and the context, in turn, relates to a collection of terms, then we can extract related terms for each term in the document. In a simple generalization hierarchy of web directory, a term at a higher level is a generalized concept for all the terms under this node e.g. sport is a generalized concept for football, cricket, baseball etc. By adding related terms to the feature vector, two different terms which have the same context may get the same related generalized terms. As a result, there is an opportunity to increase similarity between two documents.
4.2.1
About Open Web Directory
Using hierarchical categories of web directories, it is possible to add additional features to the document vector without their literal occurrence in the document. This enriched feature vector has document terms along with the generalized categorical terms which represent the context of 28
the document. This would increase the similarity between two documents even if they did not had common vocabulary, but were semantically related. Table 4.1: Topic paths and count for the term BMW #
Topic Path
Count
1
Recreation: Autos: Makes and Models: BMW
91
2
Recreation: Motorcycles: Makes and Models: BMW
90
3
World: Deutsch: Freizeit: Auto: Marken: BMW
69
4
Business: Automotive: Motorcycles: Makes and Models: Retailers: BMW
29
5
Home: Consumer Information: Automobiles: Purchasing: By Make: BMW
11
Table 4.2: Topic paths and count for the term Jaguar #
Topic Path
Count
1
Recreation: Autos: Makes and Models: Jaguar
67
2
Games: Video Games: Console Platforms: Atari: Jaguar
34
3
Shopping: Vehicles: Parts and Accessories: Makes and Models: European: British: Jaguar
28
4
Kids and Teens: School Time: Science: Living Things: Animals: Mammals: Jaguar
9
5
Sports: Football: American: NFL: Jacksonville Jaguars: Jaguar
4
Example: Consider two document vectors, one document vector contains the term “BMW” and other document vector contains the term “Jaguar”. If we calculate cosine similarity of these two documents, the similarity value returned would be ‘0’ even though both documents contain information about cars. By exploiting the hierarchical knowledge resource such as open web directories, it is possible to improve the performance of similarity computation. When a term is queried in an open web directory such as DMOZ, it returns several topic paths and respective count value. The topic paths obtained for the terms “BMW” and “Jaguar” are listed in Table 4.1 and Table 4.2 respectively. In Table 4.1, first topic path for “BMW” has “Makes and Models” as its immediate generalized term followed by “Autos” and then “Recreation”. If generalized terms of “BMW” and “Jaguar” are included in the corresponding document vectors then “Makes and Models”, “Autos” and “Recreation” will be common to both document vectors. As a result, the cosine similarity between these two documents will be greater than ‘0’. So, there is opportunity to improve the performance of similarity computation by exploiting hierarchical knowledge resources like open web directory.
29
4.2.2
Overview of Proposed Approach
A text document is abstracted as a vector of terms, and the weight for each term is usually given by using popular TF-IDF method. In the TF-IDF method, the weight of a term is a function of its frequency in the document and in overall document collection. The similarity computation by cosine similarity method is influenced by common terms (and their weight) between two document vectors and ignores the semantic relation between terms. We can use the generalization property of hierarchical knowledge repositories to establish that the terms correspond to specific instances of some generalized term. These generalized terms can be used to enrich the document vector, by enriching and weighting we intend to obtain better similarity values between two documents. The input data is transformed into a set of features (feature vector). If the features extracted are carefully chosen, it is expected that the feature set will extract the relevant information from the input data in order to perform the desired task. The feature vector of document has the problems mentioned in bag of words model. To overcome these problems we use the external knowledge resource explicitly cataloged by humans. Feature extraction is introduced as a method for enhancing machine learning algorithms with a large volume of knowledge extracted from available humangenerated repositories. Our method capitalizes on the power of existing hierarchical knowledge repositories by enriching the language of representation, namely, exploring new feature spaces. Prior to term weighting, we employ a feature generator that uses words occurring in the document to enrich the bag of words with new, more informative and discriminating features. Subsequently, these generalized terms give rise to a set of constructed features that provide background knowledge about the document’s content. The constructed features can then be used in conjunction with original bag of words. The resulting set undergoes feature weighting, and the most representative features are highly weighted for document representation. Many sources of world knowledge have become available in recent years. Examples of general purpose knowledge bases include the Open Directory Project (ODP), Yahoo! Web Directory, Google Directories and the Wikipedia.
4.3
Proposed Approach
We explain the proposed approach after explaining the relevant terminology. • Document Term (DT ermi ): Given a document, we extract ‘n’ (n > 0) terms to form initial document vector. We call each term as document term (DTerm). The ‘ith ’ term in the document vector is denoted by DT ermi (1 ≤ i ≤ n). • Topic Path (T P athij ): When web directory is queried with a Dtermi , it returns ‘p’ topic paths. Each topic path contains a sequence of terms. The first term is a DT ermi itself and
30
rest of the terms are generalization of preceding term. T P athij is the j th (1 ≤ j ≤ p) topic path of Dtermi . Formally, T P athij is defined as follows: T P athij :=< xk : xk−1 : · · · x0 , count >
(4.1)
Here, x0 is DT ermi , xk is a immediate generalization of xk−1 , ‘k ’ is the level in the topic path and count is the number of related web pages which falls under the respective topic path. Table 4.1 shows five topic paths related to term “BMW”. Let “BMW” be the first term of the document, then, T P ath11 :=< Recreation : Autos : M akes and M odels : BM W, 91 >. • Generalized term (GT ermijk ): Given a topic path, the terms other than the DTerm are called generalized terms. GT ermijk is a generalized term occurring in the T P athij for k 6= 0, where ‘k ’ is the level number. For example, in T P ath11 , ‘k ’ is ‘0’ for “BMW”, ‘1’, ‘2’, ‘3’ for “Makes and Models”, “Autos” and “Recreation” respectively. • Enriched Term Set (EDtermSeti ): EDtermSeti is the set of generalized words of a document term along-with the document term and their weights respectively, formally, EDtermSeti = {DT ermi , W DT ermi } ∪ {GT ermijk , W GT ermijk }
(4.2)
• W DT ermi and W GT ermijk : The weights of DT ermi and GT ermijk are denoted by W DT ermi and W GT ermijk respectively. Relation between DT ermi and its topic paths is depicted in Figure 4.1. Here, a term points to its immediate generalized term. For each document, proposed approach follows the following steps: 1. Generation of DTerms for document. 2. Determine weight for the terms. 3. Formation of enriched document vector. Details of these steps are discussed in the next sub-sections respectively:
4.3.1
Generation of DTerms for document
Let, ‘|D|’ be the total number of documents in dataset. Feature vector of each document dm (1 ≤ m ≤ |D|) in ‘n’ dimensional term space is d~m = (< t1m , w1m >, < t2m , w2m > · · · < tnm , wnm >), where tnm is the nth term in the document d~m , wnm is the weight of term tnm . High frequency 31
TPathi1
TPathi2
TPathi3
GTermi12
GTermi22
GTermi32
Level 2
GTermi11
GTermi21
GTermi31
Level 1
DTermi
Level 0
Figure 4.1: Relation between DT ermi and its topic paths words (Stop-words) such as ‘i’, “the”, “am”, “and” etc, are removed using a stop-word list [1]. Then the terms are reduced to their basic stem by applying a stemming algorithm [6]. All the terms in the document (DTerms) after pre-processing are used as features. These DTerms are further used to extract generalized terms from the open web directory.
4.3.2
Determine Weight for the terms
Long term, the best weighting methods will be those that can adapt weights as more information becomes available. The process of term weighting affects the performance of information retrieval, search engines and text mining systems. The document should be represented by those terms which define the document and such terms should be highly weighted. Topic Paths (TPaths) for every DTerm of the document are extracted. Each T P athij consists of one DTerm and sequence of GTerms. Assign weight for DTerm as well as GTerm using the following methods. 4.4.2.1 Assigning weight to document terms (DT ermi ) Good weighting methods are as important as the feature selection process and it is suggested that the two need to go hand-in-hand in order to be effective. We can use any weighting scheme to weigh DTerms. Here, we use TF-IDF, effects of weighting with TF, LTU and INQUERY (refer Table 4.5) are also shown in experiment section. Same weights (WDTerm) will be used to weigh GTerms in next sub-section.
32
4.4.2.2 Assigning weight to generalized terms (GT ermijk ) GTerms that are overlapping for several terms should receive higher weights than the GTerms that appear in isolation from the others. But, if only frequency criterion is considered, terms at the high level will get more weight, being generalized term of many terms their frequency will be high. So, along with the addition of generalized terms, their weighting is also important. The weight of a generalized term is a function of three factors namely: 1. GTerms of important terms are important. 2. Importance of topic path. 3. Less weight to more generalized terms. The weight of GT ermijk of DT ermi in j th topic path and k th level can be formalized as WGTermijk = WDTermi ∗ impj ∗ exp−k
(4.3)
We elaborate these three factors one by one. • GTerms of important terms are important W GT ermijk is proportional to W DT ermi because high weight indicates the importance of a term for the document. In other words, terms with high weight represents the document in a more informative manner. So, the generalized terms of high weighted DTerms also become important. Thus the weight of generalized terms should be proportional to the weight of DTerms. • Importance of topic path (impj ) For a Dtermi , several topic paths are obtained because of polysemy nature of term. The importance of a term for different topic paths might differ. The importance of a term towards ‘j th ’ topic path is captured by the probability of a term to occur in that topic path, i.e. impj =
Pp count(j) . m=1 (count(m))
High impj shows that a term is used more frequently in ‘j th ’ topic
path and thus more related to it. Thus the GTerms occurring in this topic path are important. • Less weight to more generalized terms (exp−k ) GTerms closer to DTerm represents the document relatively more precisely than the other more generalized terms in the topic path. So, GTerms which are close to the DTerm in topic path should get relatively more weight than the GTerms which are farther away in the topic path. We use a decreasing function (exponential) to assign less weight with the increase in level of topic path. For Example: Consider topic path 1 of DTerm “BMW” in Table 4.1, “BMW” is at level ‘0’, “Makes and Models”, “Auto” and “Recreation” are at level ‘1’, ‘2’ 33
and ‘3’ respectively. “Makes and Models” is immediate generalized term and thus defines “BMW” better than the other generalized terms like “Auto” or “Recreation”. Thus more weight should be given to “Makes and Models”.
Table 4.3: Algorithm for Generation and Weighting of EDtermSeti Input: DT ermi Output: EDtermSeti , Weighted and Enriched-term Set for DT ermi
1: W DT ermi = tf (DT ermi ) ∗ idf (DT ermi ) 2: EDtermSeti = {DT ermi , W DT ermi } 3: T P athList = {} 4: T P athList = AddTopicPaths(DT ermi ) //Add topic paths of DT ermi 5: 6:
foreach T P athij in T P athList foreach GT ermijk Pp count(j) m=1 (count(m))
7:
impj =
8:
W GT ermijk = W DT ermi ∗ impj ∗ exp−k
9:
EDtermSeti = EDtermSeti ∪ {GT ermijk , W GT ermijk }
10:
end
11: end
In this step, by giving each DT ermi of the document (along-with its weight) as input, the pairs of GTerms along-with their weights are obtained. We define this collection of DT ermi and its GTerms from all topic paths as enriched term set (EDtermSeti ). Formally, EDtermSeti = {DT ermi , W DT ermi } ∪ {GT ermijk , W GT ermijk } ∀j, k
(4.4)
Algorithm for EDtermSeti generation and weighing its terms is given in Table 4.3. Input to the algorithm is a DT ermi and as output enriched term set is returned. The main steps of the algorithm are mentioned below. Step 1 : Document term DT ermi is weighed using standard weighting scheme from Table 4.5 (as an example TF-IDF is used). Step 2 : Along-with its weight DT ermi is added to the enriched term set. Step 4 : All the topic paths (from open web directory) are stored. 34
Step 8 : Weight of every GTerm in all topic paths is calculated. Step 9 : All the GTerms along-with their weight are added to the enriched term set.
4.3.3
Formation of enriched document vector (d~0d )
For every term of every EDtermSet, if a term from EDtermSet is not present in d~0d , then the term along with weight is added to the d~0 to represent this term. If the term exists in d~0 then the weight d
d
of this term is added to its instance in d~0d . Formally, d~0d := ∪{EDtermSeti } ∀i, i ∈ n
(4.5)
Algorithm for formation of enriched document vector is given in Table 4.4. To make a enriched document vector we use EDtermSet (enriched term sets) of all the DTerms. Table 4.4: Algorithm for Formation of enriched document vector d~0d Input: ‘n’ Enriched-term Sets (EDtermSets) Output: Enriched and Weighted Document Vector d~0d . 1: d0d = {} 2: foreach EDtermSeti , i ∈ n 3:
foreach termt in EDtermSeti if (AlreadyExists(termt )) //Check in d0d
4:
IncrementWeight(termt ) //Weight added to instance in d0d
5: 6:
else d0d = d0d ∪ {termt , W termt } //Add term and its weight to d0d
7: 8:
end
9: end
The main steps of the algorithm are mentioned below. Step 4 : Check whether the term already exist in the enriched document vector. Step 5 : If condition at step 4 is true then we just add the weight of the term to the previous weight. This happens because a GTerm can appear in various topic paths for several terms, so it can exist in several enriched document term sets. Step 7 : If the term does not already exist in the enriched document vector then term along-with the weight is added to the vector. 35
To put more insight into the formation of enriched document vector, we present an example and pictorially show enriched document with two DTerms, DT ermx and DT ermy (Figure 4.2). DTerms along-with their topic paths are merged to form enriched document. Ax13
Cx33
Cy23
Cx33
Ax13
Cy23
Ax12
Bx22
Cx32
By12
Cy22
Cx32
Ax12
Bx22 By12
Ax11
Bx21
Cx31
By11
Cy21
Cx31
Ax11
Bx21 By11
DTermx
DTermy
DTermx
Cy22
Cy21
DTermy
Figure 4.2: Document after merging topic paths for all DTerms For term DT ermx ; three topic paths are obtained: 1. T P athx1 :=< Ax13 : Ax12 : Ax11 : DT ermx > 2. T P athx2 :=< Bx22 : Bx21 : DT ermx > 3. T P athx3 :=< Cx33 : Cx32 : Cx31 : DT ermx > While term DT ermy has two topic paths: 1. T P athy1 :=< By12 : By11 : DT ermy > 2. T P athy2 :=< Cy23 : Cy22 : Cy21 : DT ermy > Suppose, Bx21 and By11 , Bx22 and By12 are same terms, thus, after merging kept in same node of enriched document. Weight of these terms is the summation of their weight from both topic paths. Factors which affect the weight of node having term Bx21 are: 1. Importance of paths DT ermx → Bx21 and DT ermy → By11 . 2. Weight of DT ermx and DT ermy . 3. Distance of Bx21 , By11 from DT ermx and DT ermy respectively. So, final weight of a GTerm will be summation of exponentially decreased weight from all the DTerms occurring in the document, whose topic path this GTerm appears in. For example: Take two documents, one have term “BMW” and other have “Jaguar”. Pictorial representation or hierarchical view of generalized terms in topic paths for “BMW” is shown in 36
Buisness
Home
Automotive
Consumer Information
Recreation
Recreation
Motorcycles
Automobiles
Autos
Motorcycles
Makes and Models
Purchasing
Makes and Models
Makes and Models
Retailers
By Make
BMW
Figure 4.3: Topic Paths for the term BMW Figure 4.3. It can be noted that immediate or level one generalized terms for “BMW” are “Makes and Models”, “Retailers” and “By Make”. Level two generalized terms for “BMW” are “Autos”, “Motorcycles”, “Makes and Models” and “Purchasing”. Level one generalized terms are more close to the document term, thus defines it more precisely, so level one generalized terms will get more weight by generalization level factor (exp−l ). Same generalized term can appear in different topic paths. At the term weighting step, information about the topic path is necessary as the importance of topic path is considered. For example: “Makes and Models” appear in three of the links. In links “Recreation: Autos: Makes and Models” and “Recreation: Motorcycles: Makes and Models”, “Makes and Models” is at level one generalization. So, “Makes and Model” receives equal weight from the generalization level factor but receives different weight because of importance of topic path factor. Final enriched document vector have distinct terms with the summation of weights from all the instances of a term. Similar observations can be made in pictorial representation of term “Jaguar” in Figure 4.4. Figure 4.5 shows the terms which contribute in cosine similarity computation. Cosine similarity depends on the common terms between two vectors and their weights. So, the terms in the dotted nodes are of no importance in similarity computation. It can be noted that similarity between these two vectors is greater than zero. Generalized terms “Makes and Models”, “Autos” and “Recreation” are common in two vectors. We can easily figure out that weight assigned to “Makes and Models” should be and is more than “Autos” as it is less generalized than “Autos” and appears in important topic paths. So, more weight is given to the terms which depict the context of a term. Flowchart of the overall approach is shown in Figure 4.6. In Figure 4.6, documents in the corpus
37
Games
Shopping
Kids and Teens
Vehicles
School Time
Parts and Accessories
Science
Recreation
Video Games
Makes and Models
Living Things
Autos
Console Platforms
European
Animals
Makes and Models
Atari
British
Mammal
Jaguar
Figure 4.4: Topic Paths for the term Jaguar
Kids and Teens
Shopping
School Times
Vehicles
Games
Science
Parts and Accessories
Video Games
Living Things
Console Platforms
Animals
European
Atari
Mammals
British
Recreation
Autos
Buisness
Home
Automotive
Consumer Information
Automobiles
Motorcycles
Makes and Models
Purchasing
Retailers
By Make
BMW
Jaguar
Figure 4.5: Terms Contributing in Cosine Similarity Computation
38
Corpus
Indexer Stop-Words Stemming TF
Enriched Document DTerms
GTerms
Knowledge Repository
Weights DTerms
GTerms
Weighting scheme like TF, TF-IDF
Proposed weighting scheme using knowledge resource
Enriched and Weighted Document Vector
Figure 4.6: Flowchart of Proposed Approach
39
are pre-processed and features are indexed by the indexer module. “Enriched Document” module enriches the document with generalized terms by using index term along-with external knowledge repository. Now we have two types of features in the document vector (i) document terms (index terms), and (ii) generalized terms. These features are weighed in the Weights module, which uses different methodologies to weigh document terms and generalized terms. Finally, as output we get enriched and weighed document vector which can be used as an input to several text mining tasks. Detailed procedure of our approach is explained in further sub-sections.
4.4
Experiments
We conducted experiments on WebData1 dataset consisting of 314 web documents already classified into 10 categories. To measure the effectiveness of different weighting schemes, we cluster the documents with Bi-Secting-KMeans [39], a variant of KMeans. Several runs of Bi-Secting-KMeans are used to register the average purity value. Cosine similarity is used as proximity measure. For evaluation of cluster quality we use the following purity measure. Let the given test clusters be C = {C1 , C2 · · · C10 }
(4.6)
and clusters obtained by several approaches be 0 C 0 = {C10 , C20 · · · C10 }
(4.7)
Each resulting cluster Ci0 from a partitioning C 0 of the overall document set ‘D’ is treated as if it were the result of a query. The precision of a cluster Ci0 ∈ C 0 for a given category Cj ∈ C is given by P recision(Ci0 , Cj ) =
|Ci0 ∩ Cj | |Ci0 |
(4.8)
The overall value of purity is computed by taking the weighted average of maximal precision values: P urity(C 0 , C) =
∑ |C 0 | i max P recision(Ci0 , Cj ) |D| C j ∈C 0 0
(4.9)
Ci ∈C
We have conducted experiments with three type of feature vectors: • Only document vector (ODV ): In this experimental setting, vectors of original documents (d~d ) are used for document-document similarity, that is, similarities are calculated without adding GTerms. Feature vector is then weighted with the weighting schemes in Table 4.5. 1
˜ http://pami.uwaterloo.ca/hammouda/webdata/
40
Table 4.5: Term Weighting Schema. tf means term frequency, D is the total number of documents in collection, df is the document frequency, dl is the document length, avg dl is the average document length for a collection. Name
Term Weight Schema
TF
tf
TF-IDF
tf ∗ log( D ) df
(log(tf ) + 1) ∗ log( LTU 0.8 + 0.2
INQUERY
D ) df
dl avg dl
tf tf + 0.5 + 1.5
D + 0.5 ) df log(D + 1)
log( dl avg dl
• Enriched document vector with common weighting (EDVCW ): Here, GTerms are added to the document vector d~d . Enriched document vector is concatenation of document terms and all generalized terms, d~0 = {d~d , G~d }. Same weighting scheme is used to weight both d~d and d
G~d , for instance, if TF-IDF is used to weight d~d , TF-IDF will be used to weight G~d too. • Enriched document vector and proposed weighting (EDVPW ): Enriched document vector, d~0 = {d~d , G~d }. What differs from EDVCW is the way to assign weight to G~d . We consider d
each weighting scheme mentioned in Table 4.5 for DTerms and use proposed approach to assign weights to GTerms.
41
4.4.1
Experiment without pruning terms
In this experiment we have not used any threshold to remove non-informative words. In Table 4.6 and Figure 4.7, we can see that EDVPW outperforms other feature vector representations for all the weighting schemes. INQUERY being the exception got less purity than ODV by a small margin. In Table 4.6 and Figure 4.7, we have made experiments with several weighting schemes. If we take TF-IDF column then the value in ODV row, i.e. 0.7580, shows the purity value of clusters obtained with only document vector (without enrichment) whose terms were weighed by TF-IDF weighting scheme. The value in TF-IDF column and EDVCW row, i.e. 0.6883, shows the purity value obtained with enriched document vector whose terms (both DTerms and GTerms) are weighed by TF-IDF weighting scheme. The value in TF-IDF column and EDVCW row, i.e. 0.7730, shows the purity value obtained with enriched document vector whose terms were weighted by our proposed weighing scheme, DTerms by TF-IDF and GTerms by hierarchical weighting scheme. Similar explanation can be given for other weighing scheme like TF, LTU and INQUERY. Similarly in Figure 4.7, histogram is made to compare purity values of different feature selection and weighting schemes. Table 4.6: Average purity values without pruning terms Feature Vector TF TF-IDF LTU INQUERY ODV
0.7053
0.7580
0.6920
0.6565
EDVCW
0.5231
0.6883
0.6121
0.6367
EDVPW
0.7480
0.7730
0.7282
0.6503
It can be observed that, surprisingly, the clustering performance did not improve with the addition of generalized terms and weighing them with the same weighting scheme used for DTerms (EDVCW vector). On the other hand it got degraded. With the addition of highly generalized terms (which tend to be super concept of several terms) without appropriate weights, discriminating power between two documents got crippled. As a result, proximity computation between two documents is effected, thus the quality of clusters. Comparison of purity value is done for schemes which are applied on different feature vector representations. It can be noted that we can take any base statistical weighting scheme and apply our feature representation and feature weighing approach to get better results. If we compare across different weighing schemes TF-IDF along with EDVPW performs best.
42
0.9 ODV EDVCW EDVPW 0.85
0.8
Average Purity
0.75
0.7
0.65
0.6
0.55
0.5 TF
TF-IDF LTU Different Weighting Schemes
INQUERY
Figure 4.7: Average Purity values without Pruning Terms
4.4.2
Experiment with pruning terms
When using a vector space approach, documents lie in a space whose dimensionality typically ranges from several to tens of thousands. Nonetheless, most documents normally contain a very limited fraction of the total number of terms included in the adopted vocabulary; hence the vectors representing documents are very sparse. This can make learning extremely difficult in such a highdimensional space, especially due to the so-called curse of dimensionality. It is typically desirable to project documents preliminarily into a lower-dimensional subspace, which preserves the semantic structure of the document space but facilitates the use of traditional clustering algorithms. In this experiment, we have used a simple pruning method of removing terms which occur less than a certain frequency and carried out two experiments. One is by selecting the document terms having term frequency greater than 6 and then adding generalized terms for each document term (Table 4.7 and Figure 4.8). Similarly, we have carried out another experiment by selecting the terms having frequency greater than 31 (Table 4.8 and Figure 4.9). The experiments conducted by pruning terms showed better results over without pruning approach. Low frequency terms do not represent the document, so the addition of their generalization terms adds noise to the document vector. Due to this noise, similarity values and thus the purity values were compromised.
43
0.9 ODV EDVCW EDVPW 0.85
0.8
Average Purity
0.75
0.7
0.65
0.6
0.55
0.5 TF
TF-IDF LTU Different Weighting Schemes
INQUERY
Figure 4.8: Average Purity values after removing terms with frequency< 6
0.9 ODV EDVCW EDVPW 0.85
0.8
Average Purity
0.75
0.7
0.65
0.6
0.55
0.5 TF
TF-IDF LTU Different Weighting Schemes
INQUERY
Figure 4.9: Average Purity values after removing terms with frequency< 31
44
Table 4.7: Average Purity values after removing terms with frequency
(5.1)
Here, x0 is DT ermi , xk is a immediate generalization of xk−1 , ‘k ’ is the level in the topic path and count is the number of related web pages which falls under the respective topic path. Table 5.1 shows five topic paths related to term “BMW”. Let “BMW” be the first term of the document, then, T P ath11 :=< Recreation : Autos : M akes and M odels : BM W, 91 >. • Generalized term (GT ermijk ): Given a topic path, the terms other than the DTerm are called generalized terms. GT ermijk is a generalized term occurring in the T P athij for k 6= 0, where ‘k ’ is the level number. For example, in T P ath11 , ‘k ’ is ‘0’ for “BMW”, ‘1’, ‘2’, ‘3’ for “Makes and Models”, “Autos” and “Recreation” respectively. • W DT ermi and W GT ermijk : The weights of DT ermi and GT ermijk are denoted by W DT ermi and W GT ermijk , respectively.
TPath i1
TPath i2
TPath i3
GTermi13
GTermi23
GTermi33
Level 3
GTermi12
GTermi22 GTermi32
Level 2
GTermi11
GTermi21
DTermi
GTermi31
Level 1
Level 0
Figure 5.2: Relation between Dtermi and its topic paths
49
Relation between DT ermi and its topic paths is depicted in Figure 5.2. Here, a term is pointed by its immediate generalized term. Flowchart for the proposed approach is shown in Figure 5.3. Corpus
Document Vector dm Stop-Words Stemming TF
Enriched Documentd’m DTerms
Weighting scheme like TF, TF-IDF
GTerms
Compute Weight of Terms in d’m DTerms
GTerms
Boosting the Weight of Terms in dm DTerms
Enriched and Weighted Document Vector dm DTerms
Figure 5.3: Flowchart of the proposed approach For each document, proposed approach follows the following steps: 1. Generation of document vector d~m . 2. Formation of enriched document vector d~0m . 3. Computing the weight of terms in d~0m . 4. Boosting the weight of terms in d~m . 5. Document vector d~m with boosted weights. These steps are explained in the further sub-sections one by one. 50
Knowledge Repository
5.3.1
Generation of document vector d~m
Let, ‘D’ be the total number of documents in dataset. Feature vector of each document dm (1 ≤ m ≤ D) in ‘n’ dimensional term space is d~m = (< t1 , w1 >, < t2 , w2 > · · · < tn , wn >), where tn is the nth term in the document d~m , wn is the weight of term tn . High frequency words (Stop-words) such as ‘i’, “the”, “am”, “and” etc, are removed using a stop-word list. Then the terms are reduced to their basic stem by applying a stemming algorithm.
5.3.2
Formation of enriched document vector d~0m
In this step, by giving each DT ermi of the document as input, respective GTerms are obtained. We define this collection of DT ermi and its GTerms from all topic paths as enriched-term set (EDT ermSeti ). Formally, EDT ermSeti := {DT ermi ∪ GT ermijk }
(5.2)
Algorithm for EDT ermSeti generation is given in Table 5.2. Description of some steps in Table 5.2 is as follows: Step 1 : DT ermi is stored in the enriched term set. Step 3 : T P athList stores the list of all topic paths of DT ermi . Step 6 : All the generalized terms in all the topic paths for DT ermi are stored in EDT ermSeti . Table 5.2: Algorithm for Generating Enriched-term set (EDT ermSeti ) Input: DT ermi Output: EDT ermSeti , Enriched-term set for DT ermi .
1. EDT ermSeti = {DT ermi } 2. T P athList = {} 3. T P athList = AddTopicPaths(DT ermi ) 4. 5.
foreach T P athij ∈ T P athList foreach GT ermijk EDT ermSeti = EDT ermSeti ∪ {GT ermijk }
6. 7.
end
8. end
51
For every term of every EDTermSet, if a term from EDTermSet is not present in d~0m , then the term is added to the d~0m . If the term exists in d~0m then the frequency of this term is added to its instance in d~0 . Formally, m
d~0m := ∪{EDT ermSeti } , 1 ≤ i ≤ n
(5.3)
Table 5.3: Algorithm for Formation of enriched document vector d~0m Input: ‘n’ Enriched-term Sets (EDTermSets) Output: Enriched Document Vector d~0m . 1. d0m = {} 2. foreach EDT ermSeti , 1 ≤ i ≤ n 3.
foreach termt ∈ EDT ermSeti if (AlreadyExists(termt )) //Check in d0m
4.
IncrementFrequency(termt ) //Frequency updated in d0m
5. 6.
else d0m = d0m ∪ {termt , f requencyt }
7. 8.
end
9. end
Algorithm for d~0m generation is given in Table 5.3. Description of some steps in Table 5.3 is as follows: For each term in EDT ermSeti , Step 4 : checks whether the term already exists in d~0m . Step 5 : increments the frequency of the already existing term. Step 7 : adds the term and its frequency, if the term is not present in d~0m .
5.3.3
Computing the weight of terms in d~0m
We can use any weighting scheme to weigh terms (DTerms and GTerms) in d~0m . Here, we use TF-IDF, effects of weighting with OKAPI and LTU are also shown in experiment section. Same weights (WGDTerm) will be used to boost the weight of DTerms in next sub-section. A word within a given document is considered important if it occurs frequently within the document, but infrequently in the larger collection. So, weight to terms in d~0 is initialized by popular TFIDF m
approach.
52
5.3.4
Boosting the weight of terms in d~m
In this section, we propose a method to boost the term weight. Let, BW DT ermi indicate the boosted weight of DT ermi which is equal to W DT ermi + BW eighti . Here, BW eighti is the boost component which is calculated based on the corresponding generalized terms of DT ermi . The equation to calculate BW eighti is as follows: BW eighti =
p ∑ n ∑ ∑
BW eightijk
(5.4)
i=1 j=1 ∀k
Here, BW eightijk represents the boost component for DT ermi from the corresponding generalized term GT ermijk , which is calculated as follows: BW eightijk = impj ∗ exp−k ∗ W GT ermijk
(5.5)
It can be noted that the BW eightijk is proportional to three factors: 1. WGTerm. 2. impj . 3. exp−k . These factors are explained as follows. • W GT erm:
BW eighti is proportional to W GT ermi because high weight indicates the im-
portance of a term for the document. In other words, terms with high weight represents the document in a more informative manner. If two DTerms have a common important GTerm, then the DTerms are important/related. So, the DTerms having high weighted GTerms also become important. Thus the weight of DTerms should be proportional to the weight of GTerms. • impj : For a Dtermi , so many topic paths are because of polysemy nature of term, that is, a term can be used in many contexts. The importance of a term for different topic paths might be different. So, more weight should be given to the topic path which is more important for a term. The importance of a term towards ‘j th ’ topic path is captured by the probability of a term to occur in that topic path, i.e. impj =
Pp count(j) . m=1 (count(m))
High impj shows that
a DT ermi is used more frequently in ‘j th ’ context and thus more related to it. So, DTerm should get more weight from the GTerms occurring in this topic path.
53
• exp−k : GTerms closer to DTerm represents the document relatively more precisely than the other more generalized terms in the topic path. So, GTerms which are close to the DTerm in topic path should relatively boost more weight than the GTerms which are farther away in the topic path. We use a decreasing function (exponential) to boost less weight with the increase in level of topic path. For Example: Consider topic path 1 of DTerm “BMW” in Table 5.1, “BMW” is at level ‘0’, “Makes and Models”, “Auto” and “Recreation” are at level ‘1’, ‘2’ and ‘3’ respectively. “Makes and Models” is immediate generalized term and thus defines “BMW” better than the other generalized terms like “Auto” or “Recreation”. Thus more weight should be boosted by “Makes and Models”.
Table 5.4: Algorithm for Boosting the weights of DTerms Input: d~0m and ‘n’ EDTermSets Output: Document Vector d~m with boosted weights.
1. BW eighti = 0 2. Compute weight of terms in d~0m with TFIDF 3. 4.
foreach EDT ermSeti , 1 ≤ i ≤ n foreach GT ermijk ∈ EDT ermSeti Pp count(j) m=1 (count(m))
5.
impj =
6.
W GT ermijk = Fetch weight(GT ermijk )
7.
BW eightijk = impj ∗ exp−k ∗ W GT ermijk
8.
BW eighti = BW eighti + BW eightijk
9.
end
10.
W DT ermi = Fetch weight(DT ermi )
11.
BW DT ermi = W DT ermi + BW eighti
12.
d~m := UpdateWeight(DT ermi , BW DT ermi )
13. end 14.
Normalize the weights in document vector d~m W DT ermi = √
(
W DT ermi Px=n 2 x=1 (W DT ermx ) )
54
Algorithm for “Boosting the weights of DTerms” is given in Table 5.4. Description of some of the important steps in Table 5.4 is as follows: Step 1 : initializes the boost weight to be zero. Step 2 : computes the weight of terms (DT erms and GT erms) in d~0m . Step 6 : fetches the weight of GT ermijk from the d~0m (weight assigned at Step 2). Step 8 : calculates the boost weight that has to be assigned to the DT ermi . Step 10 : fetches the weight of DT ermi from the d~0m (weight assigned at Step 2). Step 12 : updates the weight of DT ermi in document vector d~m . Step 14 : normalizes the weight of the terms in d~m .
5.3.5
Document vector d~m with boosted weights
Now, document vector d~m is obtained which have DTerms along-with their boosted weights. Enriched document vector d~0m is removed. To put more insight into the formation of enriched document vector, we present an example and pictorially show enriched document with two DTerms, DT ermx and DT ermy (Figure 4.2). DTerms along-with their topic paths are merged to form enriched document. Ax13
Cx33
Cy23
Cx33
Ax13
Cy23
Ax12
Bx22
Cx32
By12
Cy22
Cx32
Ax12
Bx22 By12
Ax11
Bx21
Cx31
By11
Cy21
Cx31
Ax11
Bx21 By11
DTermx
DTermy
DTermx
Figure 5.4: Document after merging all topic paths for all DTerms For term DT ermx ; three topic paths are obtained: 1. T P athx1 :=< Ax13 : Ax12 : Ax11 : DT ermx > 2. T P athx2 :=< Bx22 : Bx21 : DT ermx > 3. T P athx3 :=< Cx33 : Cx32 : Cx31 : DT ermx >
55
Cy22
Cy21
DTermy
While term DT ermy has two topic paths: 1. T P athy1 :=< By12 : By11 : DT ermy > 2. T P athy2 :=< Cy23 : Cy22 : Cy21 : DT ermy > Suppose, Bx21 and By11 , Bx22 and By12 are same terms, thus, after merging kept in same node of enriched document. We can figure it out that from Figure 5.4 that the node with term Bx21 and Bx22 are generalized terms for both DT ermx and DT ermy . So, both Bx21 and Bx22 should boost the weight of DT ermi and DT ermj . Factors which affect the boost in weight of node having term DT ermx and DT ermy are: 1. Importance of paths Bx21 → DT ermx and Bx22 → DT ermy . 2. Weight of Bx21 and Bx22 . 3. Distance of Bx21 , Bx22 from DT ermx and DT ermy respectively. So, final weight of a DTerm will be summation of weights from all the GTerms occurring in the document, whose topic path this DTerm appears in.
Buisness
Home
Automotive
Consumer Information
Recreation
Recreation
Motorcycles
Automobiles
Autos
Motorcycles
Makes and Models
Purchasing
Makes and Models
Makes and Models
Retailers
By Make
BMW
Figure 5.5: Topic Path for term BMW
56
Games
Shopping
Kids and Teens
Vehicles
School Time
Parts and Accessories
Science
Recreation
Video Games
Makes and Models
Living Things
Autos
Console Platforms
European
Animals
Makes and Models
Atari
British
Mammal
Jaguar
Figure 5.6: Topic Paths for term Jaguar For example: The topic paths obtained for BMW from open web directory are listed in Table 5.1. Pictorial representation or hierarchical view of generalized terms in topic paths for “BMW” is shown in Figure 5.5. It can be noted that immediate or level one generalized terms for “BMW” are “Makes and Models”, “Retailers” and “By Make”. Level two generalized terms for “BMW” are “Autos”, “Motorcycles”, “Makes and Models” and “Purchasing”. Similar observations can be made in pictorial representation of term “Jaguar” in Figure 5.6. We have enriched document which initially had only two terms “BMW” and “Jaguar”. Figure 5.7 shows the terms and their structural and contextual relationship. Cosine similarity depends on the common terms between two vectors and their weights. Frequency of generalized terms “Makes and Models”, “Autos” and “Recreation” is more than other generalized terms. We can easily figure out that weight assigned to “Makes and Models” should be and is more than “Autos” as it is less generalized than “Autos” and appears in important topic paths. Level one generalized terms are more close to the document term, thus defines it more precisely, so level one generalized terms will give more weight by generalization level factor (exp−l ). Same generalized term can appear in different topic paths. Information about the topic path is necessary as the importance of topic path is considered. For example: “Makes and Models” appear in three of the links. In 57
Kids and Teens
Shopping
School Times
Vehicles
Games
Science
Parts and Accessories
Video Games
Living Things
Console Platforms
Animals
European
Atari
Mammals
British
Recreation
Autos
Buisness
Home
Automotive
Consumer Information
Motorcycles
Makes and Models
Automobiles
Purchasing
Retailers
By Make
BMW
Jaguar
Figure 5.7: Relation Between GTerms and DTerms in a document links “Recreation: Autos: Makes and Models” and “Recreation: Motorcycles: Makes and Models”, “Makes and Models” is at level one generalization. So, “Makes and Model” gives equal weight from the generalization level factor but gives different weight because of importance of topic path factor. Final document vector have distinct terms with the summation of weights from all the instances of their respective generalized terms. With the increase in length of document we will get more generalized terms which are common generalized terms for several terms thus depicting the possible context of the document. We can not use the frequency directly though, as the frequency of highly generalized terms will be very high as they tend to appear in various links. So we follow the proposed factors to boost the weight of the document terms. In the current example, weight of “BMW” and “Jaguar” will be increased. If some other term is present in the document and does have the generalized terms same as “BMW” and “Jaguar” then the boost in weight for this term will be less. So, respective increase in weight of related terms will be more than the non-related terms. So, more weight is given to the set of terms which depicts the context of a document.
58
Table 5.5: Term Weighting Schema. In this table, tf indicates the term frequency, D is the total number of documents in collection, df is the document frequency, dl is the document length, avg dl is the average document length for a collection. Name
Term Weight Schema
TF-IDF
tf ∗ log( D ) df
OKAPI
tf dl 0.5 + 1.5 + tf avg dl
(log(tf ) + 1) ∗ log( LTU 0.8 + 0.2
5.4
log(
N − df + 0.5 ) df + 0.5
N ) df
dl avg dl
Experiments
To compare the performance of proposed approach with existing weighting approaches, we have conducted clustering experiments on two data-sets: WebData2 data-set and Reuters21578 news corpus3 data-set. WebData consists of 314 web documents already classified into 10 different categories [40]. In Reuters dataset we have selected documents which belongs to only one class and chose those classes whose size was greater than 15. So we had 28 classes with 6599 documents. We used CLUTO toolkit4 to cluster the documents with bisecting KMeans algorithm and registered the purity and entropy values returned by it. The quality of a clustering solution was measured by using two different metrics that look at the class labels of the documents assigned to each cluster. The first metric is the widely used entropy measure that looks are how the various classes of documents are distributed within each cluster, and the second measure is the purity that measures the extend to which each cluster contained ˜ http://pami.uwaterloo.ca/hammouda/webdata/ http://www.daviddlewis.com/resources/testcollections/reuters21578/ 4 ˜ http://www-users.cs.umn.edu/karypis/ 2 3
59
documents from primarily one class. Given a particular cluster Ci of size |Ci | , the entropy of this cluster is defined to be 1 ∑ |Ci |j |Ci |j log log(q) |Ci | |Ci | q
E(Ci ) = −
(5.6)
j=1
Where ‘q’ is the number of classes in the dataset, and |Ci |j is the number of documents of the j th class that were assigned to the ith cluster, Ci . The entropy of the entire clustering solution is then defined to be the sum of the individual cluster entropies weighted according to the cluster size. That is, Entropy =
k ∑ |Ci | i=1
|D|
E(Ci )
(5.7)
where ‘k’ is the number of clusters and |D| is the total number of documents in the dataset. A perfect clustering solution will be the one that leads to clusters that contain documents from only a single class, in which case the entropy will be zero. In general, the smaller the entropy values, the better the clustering solution is. In addition to the entropy values we calculate the purity values as mentioned in section 4.7. The overall purity of the clustering solution is obtained as a weighted sum of the individual cluster purities. In general, larger the value of purity, the better the clustering solution is In the experiment we compare the existing weighting schemes TFIDF, OKAPI and LTU (Table 5.5) with the corresponding boosted weighting schemes. We call the corresponding boosted weighting schemes as B-TFIDF, B-OKAPI and B-LTU, respectively. Table 5.6 shows the purity and entropy values of clusters on WebData and Reuter21578 data-sets for existing weighting schemes such as TFIDF, OKAPI and LTU as well as improved weighting schemes proposed in this chapter such as B-TFIDF, B-OKAPI and B-LTU. Table 5.6: Purity and Entropy values of Clusters Dataset
WebData
Reuter21578
Purity
Entropy
Purity
Entropy
TFIDF
0.865
0.122
0.855
0.153
B-TFIDF
0.958
0.061
0.864
0.143
OKAPI
0.868
0.116
0.839
0.168
B-OKAPI
0.952
0.070
0.857
0.156
LTU
0.840
0.149
0.849
0.148
B-LTU
0.869
0.122
0.865
0.141
60
Figure 5.8 shows the comparison of purity values of WebData dataset and Figure 5.9 shows the comparison of purity values of Reuter21578 dataset. Regarding entropy values, Figure 5.10 shows comparison on WebData dataset and Figure 5.11 shows comparison on Reuter21578 dataset. Histograms are made to compare the purity and entropy values of different weighing schemes and their boosted versions. It can be observed that BoostWeight shows improvement in performance for all weighting schemes. B-TFIDF registers 9.3% purity improvement and 6.1% decrease in entropy as compared to TFIDF, B-OKAPI registers 8.4% purity improvement and 4.6% decrease in entropy as compared to OKAPI, B-LTU registers 2.9% purity improvement and 2.7% decrease in entropy as compared to LTU. Increase in purity values and decrease in entropy values by boosting schemes is noted in Reuter21578 dataset too. Figure 5.9 and Figure 5.11 shows the result on Reuter21578 dataset.
5.5
Comparison of Proposed Schemes
After experimenting with several weighting schemes, we found that B-TFIDF achieves maximum purity value. In the previous chapter, the maximum performance was shown by EDVPW with TFIDF for Dterms. In this section we compare these two schemes on WebData dataset using CLUTO toolkit. We try several thresholds with EDVPW to capture the performance. Table 5.7: Purity and Entropy values of Clusters Scheme
WebData Purity
Entropy
EDVPW, 0
0.8675
0.1241
EDVPW, 5
0.8690
0.1200
EDVPW, 30
0.8695
0.1187
EDVPW, 45
0.8954
0.1156
EDVPW, 60
0.9227
0.0962
EDVPW, 75
0.9312
0.0925
EDVPW, 90
0.9344
0.0915
B-TFIDF
0.9580
0.0610
From Table 5.7, “EDVPW, 5” means that EDVPW is used and threshold is ‘5’. The terms whose frequency is less than ‘5’ are removed from the document vector. It can be noted that, as expected, by decreasing the dimension we get improvement in both entropy and purity values. B-TFIDF registers the best result. 61
1 Existing Weighting Methods BoostWeight
Average Purity
0.95
0.9
0.85
0.8
0.75
0.7 TF-IDF OKAPI LTU Different Weighting Schemes
Figure 5.8: Purity values for WebData dataset
1 Existing Weighting Methods BoostWeight
Average Purity
0.95
0.9
0.85
0.8
0.75
0.7 TF-IDF OKAPI LTU Different Weighting Schemes
Figure 5.9: Purity values for Reuter21578 dataset
62
0.2 Existing Weighting Methods BoostWeight
Entropy
0.15
0.1
0.05
0 TF-IDF OKAPI LTU Different Weighting Schemes
Figure 5.10: Entropy values for WebData dataset
0.2 Existing Weighting Methods BoostWeight
Entropy
0.15
0.1
0.05
0 TF-IDF OKAPI LTU Different Weighting Schemes
Figure 5.11: Entropy values for Reuter21578 Dataset
63
5.6
Summary
Traditional weighting schemes consider statistics regarding term occurrence to compute the term weight and do not consider the semantic association between the terms. In this chapter, in addition to the statistics regarding term occurrence, we have investigated how semantic association among the terms can be exploited for better term weighting method. We exploit the generalization ability of hierarchical knowledge repositories such as “Open Web Directory” to semantically associate the terms for improved term weighting method. In the proposed approach, we first extract the documents terms from the document and the corresponding generalized terms from the “Open Web Directory”. Using the proposed weighting approach we have boosted the term weight of document terms by exploiting the semantic relationship among document terms based on the corresponding generalized terms. By conducting the clustering experiments on both data-sets it has been shown that the proposed BoostWeight approach improves both purity and entropy values.
64
Chapter 6
Improving Cohesiveness of Text Document Clusters Text document clustering plays an important role in the performance of information retrieval, search engines and text mining systems. Traditional clustering algorithm like k-means, bisecting k-means depend on the similarity value between document vector and cluster vector. Ideally, representative (highly weighted) features of a cluster and representative feature of documents in that cluster should be identical. So, assigning weight to the feature vector of a cluster is an important issue. More weight should be given to the representative and discriminative features of the cluster. Clustering is an unsupervised process, thus it is difficult to capture the possible distribution of documents. So, an intermediate step has to be introduced from which a clustering algorithm can find the discriminative features for the cluster. A feature should be a representative feature of a cluster if the feature is frequent in this cluster and at the same time not so frequent in other clusters. In other words, a representative feature of a cluster should also discriminate the cluster from other clusters. By capturing and weighting these discriminative features there is a scope to obtain quality clusters. In this chapter, we propose a methodology to refine a given set of clusters by incrementally moving documents between clusters. To accomplish this, we make an attempt to identify discriminative feature set for each cluster.
6.1
Background
Clustering is the partitioning of a data set into subsets, so that the data in each subset share some common trait. The main idea is to (i) extract unique content-bearing words from the set of documents and (ii) represent each cluster as a vector of certain weighted terms.
65
Text classification is a supervised learning process, which assigns pre-defined category labels to new documents based on the likelihood suggested by the training set. Availability of these precategorized data-sets is less than non-structured and non-categorized data-sets. This shows the need of unsupervised methods which can use and learn from the beneficial steps of supervised learning and return quality results. State of art clustering algorithms extract and use the relationship between features and documents, but do not use classes as they are not available. In this chapter, we focus on quality of clustering and putting aside the orthogonal problem of time complexity. We use bisecting k-means algorithm to cluster the documents. These clusters are not considered as the final result and are used as virtual clusters. These virtual clusters are now used to: • Capture the neighborhood information of documents. • Identifying and weighing discriminative features of each cluster. • Refine clusters by incrementally moving documents from one cluster to other based on a new (proposed) similarity criteria between document vectors and cluster vector (cluster mean). Document clustering methods can be mainly categorized into two types: partitioning and hierarchical clustering. Both of these methods have been extensively studied by researchers and there has been a lot of interesting work in this area. Clustering plays an important role in automated indexing, for use in browsing collection of documents, genre identification of a document, spam filtering, word sense disambiguation, image segmentation, gene expression analysis by organizing large amount of documents into a small number of meaningful clusters [41, 8, 42]. Hierarchical clustering methods are useful for various data mining tasks. Since hierarchical clustering is a greedy search algorithm, the merging decision made early in the agglomerative process are not necessarily the right ones. In [43], a solution is proposed to refine a clustering produced by the agglomerative hierarchical algorithm to correct the mistakes made early in the agglomerative process. Scatter/Gather [44] is a well-known algorithm which has been proposed for a document browsing system based on clustering. It uses a hierarchical clustering algorithm to determine an initial clustering which is then refined using the k-means clustering algorithm. Many variants of the kmeans algorithm have been proposed for the purpose of text clustering. The classical version of kmeans uses euclidean distance, however this distance measure is inappropriate for its application to clustering a collection of text documents [45]. An effective measure of similarity between documents, and one that is often used in information retrieval, is cosine similarity, which uses cosine of angle between two document vectors.
66
K-means depends on the initial seed set and there is a possibility that it may not give globally optimal clustering solution. In [46], various clustering initialization methods are discussed for better separation in the output solution. Bisecting k-means algorithm outperforms basic k-means as well as agglomerative hierarchical clustering algorithms in terms of accuracy and efficiency [47] [8].
6.2
Proximity Measure
We measure the similarity between two document vectors d~1 and d~2 by calculating cosine similarity. Similarly, value of dot product between a document vector and cluster’s mean is equivalent to the average similarity between the document and all the documents of the cluster which is being represented by the cluster mean [48]. ~k) = Sim(d~1 , C
~k d~1 · C 1 ∑ = Cosine(d~1 , d~k ) ~k| |S| |d~1 ||C
(6.1)
k∈S
6.3
Basic Idea
The problem of finding clusters in data is challenging when clusters are of widely differing sizes, densities and shapes, and when the data contains large amounts of noise and outliers. The use of a shared nearest neighbor definition of similarity removes problems with varying density, while the use of core points handles problems with shape and size [49]. If this knowledge about distribution of documents is utilized there is scope to improve quality of clusters. An intermediate phase should be introduced from which clustering algorithm can learn about the neighborhood of the documents. A cluster should have documents which are correlated to each other and at the same time documents should have some discriminative features of the cluster.
6.4
Proposed Cluster Refinement Scheme
We explain proposed approach in following steps: 1. Generation of virtual clusters. 2. Finding core documents in virtual clusters. 3. Refinement of virtual clusters. We explain these steps one by one in the further sub-sections.
67
6.4.1
Generation of virtual clusters
Bisecting k-means [39] is used to cluster the given data-set. These clusters are considered as virtual clusters, which are used to (i) capture neighborhood information of documents, (ii) identifying and weighing discriminative features of each cluster,(iii) refine clusters by incrementally moving documents from one cluster to other based on a new (proposed) similarity criteria between document and cluster vector (cluster mean). Number of virtual clusters is equal to the desired number of clusters, i.e. ‘N ’. So, V Ck is the k th virtual cluster, where, 1 ≤ k ≤ N . Mean of V Ck is denoted by −−−→ V Mk . Given a set, S, of documents in a virtual cluster V Ck , we define the mean of virtual cluster to be:
−−−→ 1 ∑~ V Mk = dk |S| S
(6.2)
k=1
which is obtained by averaging the weights of the various terms present in the documents of ‘S’. Here, we use TF-IDF weighting scheme to weigh terms of document vector d~k . Though we can use various weighting scheme, some of them are mentioned in Table 2.
6.4.2
Finding core documents in virtual cluster
−−−→ Core documents for a V Ck (CDock ) are the documents whose similarity with the mean V Mk is greater than ‘α’. The documents which are not close (less similarity value) to the mean are considered for the transfer to other cluster. As the clusters can be of different size and different density, global threshold for similarity should not be fixed. If a global threshold for similarity measure is fixed and if its value is low, then the number of core documents will be high and may contain the documents which should be in other clusters. If the threshold for similarity measure is chosen to be high, then non-core documents will be more and the membership of documents which should remain in this cluster is unnecessarily checked. To overcome this problem of variable density of different clusters, similarity threshold for core documents should not be global. The information from virtual clusters can be used to locally set the threshold for similarity. So, threshold ‘α’ is set to be the average pair wise similarity between all documents in the V Ck . 1 ∑∑ Sim(d~x , d~y ) α= 2 |S | S
S
(6.3)
x=1 y=1
Value of ‘α’ will be more for a dense cluster as the documents will be in close proximity. In such a dense cluster, if a document is far from a mean, then it may not be the part of this cluster. So, only non-core documents will be checked for membership in other clusters. Value of ‘α’ will be low for clusters with low density.
68
6.4.3
Refinement of virtual clusters
−−−→ In this step, membership of a non-core document (documents whose Sim(d~k , V Mk ) is less than ‘α’) is checked. For a document d~k , similarity with all clusters is calculated and d~k is moved to the cluster to which d~k has the maximum similarity value. The quality of this iterative movement process depends on the selection of similarity function. As cosine similarity is used as similarity −−−→ function, weight assigned to the terms in d~k and V Mk is crucial. Traditional clustering algorithms like k-means, bisecting k-means etc., which decide the membership of a document into a cluster based on similarity value with the cluster, uses the same weighting scheme for documents and cluster vector i.e. if TF-IDF is used for d~k then TF-IDF will be used to weigh terms in cluster −−−→ mean (V Mk ). Here, TF-IDF is not used to weigh the mean of a cluster as it treats each feature equally for all clusters, and it does not take into account of fact that the weight of a feature in a document is not only related to the document, but also to the cluster that a document belongs to [3, 39][50]. A −−−→ feature may have different importance for different clusters. In this paper, Refined-mean (RMk ) is −−−→ introduced to keep track of discriminative features of the V Ck . Feature vector of cluster, i.e. RMk , should be weighed in such a manner that the weight represents the importance of the feature for a cluster. A weighting scheme, “term frequency inter cluster frequency (TF-ICF)” is proposed to −−−→ weigh the terms in RMk . • TF-ICF weighting scheme: Once virtual clusters are formed, features are weighed according to their frequency in a cluster and taking into account the other clusters as well. This measure captures the importance of a term for a cluster. By this measure, features which occur in a cluster and are not so frequent in other clusters gets more weight. ( Wik = tfik ∗ log2
N +1 n
) (6.4)
Where, Wik = weight of term ti in V Ck tfik = frequency of term ti in V Ck N = number of clusters to be formed n = number of cluster where ti occurs at-least once.
For every term ‘ti ’ of V Ck , weight ‘Wik ’ is calculated which represents the contribution of term ‘ti ’ in the discriminative feature of V Ck . More the weight ‘Wik ’, more is the importance of this term/feature for this cluster. Any document which has the same semantics as of this cluster (V Ck ), should be moved to V Ck . 69
RMi
RMj
RMk
CDock
VCi
VCj
VCk
Figure 6.1: Refining of Clusters To decide the membership of non-core documents, similarity of document vector d~d with the −−−→ refined mean (RMk ) of V Ck is calculated, for all ‘k’. A document will be moved in RCk if and only if
−−−→ −−−→ Sim(d~d , RMk ) > Sim(d~d , RMj ) ∀j, j 6= k
(6.5)
Otherwise, document is moved to the cluster to which it has maximum similarity value. Removing a document from V Ci and assigning it to V Cj (Figure 6.1) changes the mean of V Ci and V Cj . −−→ −−−→ In other words, terms in RMi and RMj will change. So, TF-ICF weight for the terms of d~d is calculated. As frequency of terms and number of different clusters that a term appears (“tfij ” and ‘n’ respectively) may change for the terms in d~d . −−−→ Refined means (RMk ) of the cluster “from which” and “to which” a document is transferred have to be updated after every transfer. Due to these movements, neighborhood of the documents in cluster keeps on changing, so does the density of different features. More weight is given to the feature which is more frequent in a cluster and not so frequent in others. After every movement, if there is any, check for a small cluster is performed. Clusters having less than ‘5’ documents are considered as small clusters. If a small cluster is found, its documents are assigned to some other clusters, using the same notion of similarity calculation. As the number of clusters has to be ‘N ’, the largest cluster is bisected using bisecting k-means (this step follows the routine algorithm, same weighting scheme for both cluster vector and document vector). This process can be repeated for ‘ITER’ number of times. For Example: in Figure 6.1, document d~d is −−−→ −−→ −−−→ moved from V Ci to V Cj because Sim(d~d , RMj ) is greater than Sim(d~d , RMi ) and Sim(d~d , RMk ). −−→ −−−→ After this movement RMi and RMj will be generated again. In short, after clustering the documents, feature vector of all the clusters are compared with document vector, and the document is moved to the cluster having most similar feature vector. 70
After the document movement, feature vector of clusters are updated. Feature vector of clusters are now more dynamic and can incorporate the changes occurred in the feature density. Once a document is moved, it is not considered for transfer to other cluster, at least for this iteration. Same process is repeated for pre-defined number of iterations (ITER). Algorithm for cluster refinement is given in Table 6.1. Description of some important steps is as follows: Step 7 : finds the core documents of cluster number ‘k’. Step 9 : Documents which are in cluster ‘k’, but are non-core, they are checked for their membership in other clusters. Step 11 : checks whether document is moved to another cluster. Step 12 : TF-ICF weight of the terms which are in this document is calculated. Step 13 : checks whether all the clusters have at-least pre-defined number of documents. If not, then at Step 14 : all the documents of small cluster are moved to other clusters. And again TF-ICF weight for these document terms has to be calculated at Step 15. Step 16 : As number of clusters should be ‘N’, bisect k-means is used to bisect the largest cluster.
6.5
Experiments
To compare the performance of proposed approach with existing clustering algorithms, experiments are conducted on WebData1 data-set consisting of 314 web documents already classified into 10 categories. For evaluation of several refinement steps and cluster quality, purity value explained in chapter 4.7 for clusters is registered. In Table 6.2, “iPurity” is the initial purity value registered by traditional bisecting k-means, in which both document vector and cluster vector are weighted by same scheme. These values are used as baseline purity value of clusters. “rPurity” is the refined purity value, refinement values are shown up to five iterations. It can be noted that purity values keeps on increasing with the number of iteration of refinement process. In Table 6.2, for iteration number ‘0’, TF weighting scheme registered lowest average purity value. TF-IDF registered highest purity value followed by LTU and then OKAPI. These purity values are taken as baseline purity value. Significant improvement in purity values can be noted even after ‘1’ iteration only. 15.6% improvement on TF weighting scheme, 6.74%, 12.5%, 9.09% improvement is registered on weighting schemes TF-IDF, OKAPI and LTU respectively. 1
˜ http://pami.uwaterloo.ca/hammouda/webdata/
71
Table 6.1: Algorithm for Cluster Refinement Input: Document Set D, ITER, ‘N’ Output: ‘N’ Clusters
1: k = 0 2: GetCluster by bisecting k-means 3: Calculate tf-icf for terms in all virtual clusters 4: while (k != ITER) 5:
k++
6:
foreach V Ck , k ∈ N
7:
CDock = Find CoreDocs(k)
8:
foreach document not in CDock
9:
Doc-feature-vector = Make-feature-vector(document)
10:
ClusterNo = Assign-clusterNo(Doc-feature-vector)
11:
if (ClusterNo != k)
12:
TF-ICF(Doc-feature-vector)
13:
if (Check-for-small-cluster())
14:
Assign-docs-to-other-clusters()
15:
TF-ICF(Documents-in-V Ck )
16:
bisect-larget-cluster()
17:
endif
18: 19: 20:
endif end end
72
Table 6.2: Average Purity Values in different settings rPurity after iteration Scheme
iPurity
#1
#2
#3
#4
#5
TF
0.6004
0.7564
0.7686
0.7809
0.7820
0.7831
TF-IDF
0.7542
0.8216
0.8274
0.8285
0.8296
0.8296
OKAPI
0.6239
0.7489
0.7487
0.7948
0.7935
0.7935
LTU
0.7142
0.8051
0.8263
0.8294
0.8294
0.8294
In Figure 6.2, iteration number and average purity are the two axis. Iteration number ‘0’ means that the number of times refinement process used is ‘0’. In other words, result obtained from traditional bisecting k-means is considered as the final result. Similarly, iteration number ‘1’, ‘2’ upto ‘5’ shows the number of iterations of refinement process. A document vector can be weighted with several weighting schemes, resulting in different clusters. Refinement process has been tested to refine clusters obtained from different weighting schemes (Table 5.5).
1 TF TF−IDF OKAPI LTU
0.95 0.9
Average Purity
0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0
1
2 3 Iteration number
4
Figure 6.2: Average Purity values after every iteration
73
5
Theoretical proof for convergence is not presented, but experimental results show that after ‘3’ iterations process seems to converge. Experimental result shows that even one run of the refinement process improves the cluster quality. With the number of iterations, percentage increase in purity value decreases.
6.6
Discussion
By traditional bisecting k-means algorithm, if TF-IDF is used as weighting scheme, both document vector and cluster mean would be weighed by TF-IDF method. If proposed refinement approach is used, in which, cluster mean is weighted by TF-ICF measure, significant improvement in the purity values of clusters can be noticed. Similarly for the weighing schemes in Table 5.5, it can be observed that refinement improves the clustering result no matter what weighting scheme is used to weight document vector. NOTE: TF-ICF is used to weight cluster mean, only when membership of a document has to be checked during refinement process. If we need to bisect any cluster (when all the document of a small cluster were allocated to other clusters and to maintain the number of clusters to be ‘N’), traditional weighting method is used to weigh both document vector as well as cluster mean.
6.7
Summary
An approach for assigning weight to the terms representing the cluster is proposed. Using these weighted feature vectors, documents can be iteratively moved to the appropriate cluster. Performance evaluation exhibits clear superiority of the proposed method with its improved document clustering quality with significant factor. Furthermore, terms are weighed in such a way that the weight shows term’s importance for a cluster. For a term, weight can be different for different cluster. Apart from quality clusters, discriminative terms for the clusters are also obtained.
74
Chapter 7
Conclusion and Future Work 7.1
Summary
Traditional weighting schemes consider statistics regarding term occurrence to compute the term weight and do not consider the semantic association between the terms. In addition to the statistics regarding term occurrence, we have investigated how semantic association among the terms can be exploited for better term weighting method. We exploit the generalization ability of hierarchical knowledge repositories such as “Open Web Directory” to semantically associate the terms for improved term weighting method. We use the DMOZ1 Open Directory Project as our knowledge base, due to the easy accessibility of its structure and linked resources (cataloged Web sites). However, our methodology is general enough to facilitate other hierarchical knowledge repositories. We impose the following requirements on knowledge repositories for feature generation: • The knowledge repository should contain a collection of generalized terms or concepts organized in a hierarchical tree structure, where edges represent the “is-a” relationship. Each hierarchy node is labeled with a generalized term, which is more general than those of its children. Using a hierarchical ontology allows us to incorporate generalization of terms. • There should be a count or probability, which denotes the usage of a generalized term or concept for a term occurring in the document. Or in other words importance of a generalized term for a term should be known beforehand. We first presented a context based unsupervised term extraction and weighting scheme. We have exploited the notion that any two terms are related if they have been used in the same context. In the proposed approach, a given document is enriched with generalized terms using 1
http://www.dmoz.org/
75
open web directory. We have proposed a term weighting scheme to give appropriate weights to both document terms and generalized terms. One of the factors to weigh GTerms is the importance of the topic paths. So, GTerms with high weight in enriched document vector represents the context of overall document. The performance results show that the proposed weighting scheme gives better clustering performance over existing weighting schemes. We conducted experiments by reducing the dimension or removing non-informative terms from the document vector. It was noted, as expected, that the performance increased with the decrease in dimension. We proceeded with another experiment (BoostWeight) to explore the possibility of improving the performance without permanently keeping the generalized terms. In BoostWeight, we first extract the documents terms from the document and the corresponding generalized terms from the “Open Web Directory”. Using the proposed weighting approach we have boosted the term weight of document terms by exploiting the semantic relationship among document terms based on the corresponding generalized terms. By conducting the clustering experiments on WebData and Reuters21578 data-sets it has been shown that the proposed BoostWeight approach improves both purity and entropy values. On the lines of improving clustering performance, an approach for assigning weight to the terms representing the cluster is proposed. Using these weighted feature vectors, documents can be iteratively moved to the appropriate cluster. Performance evaluation exhibits clear superiority of the proposed method with its improved document clustering quality with significant factor. Furthermore, terms are weighed in such a way that the weight shows terms importance for a cluster. For a term, weight can be different for different cluster. Apart from quality clusters, discriminative terms for the clusters are also obtained.
7.2
Conclusion
It can be concluded that addition of generalized terms provides the scope of calculating better similarity values and thus better clustering purity. From the experiments, it was noted that, the increase in dimensions indeed add some non-informative terms, and removal of those terms increases the performance of the system. As setting a threshold of weight to remove non-informative terms is difficult. We chose not to increase the dimension of the vectors and modify the weights. This framework showed promising results. Still, we expect that addition of generalized terms is necessary and should be added by taking care of the number of dimensions. Addition of terms require the usage of external knowledge repository. Crerating such knowledge repositories itslef is a difficult task. Comparison with the lexical database, wordnet, has not been done. The focus of the work is to use open web directory structure to improve the feature extraction and feature weighting scheme. 76
We also proposed an approach to refine cluster quality by iteratively moving documents between clusters. It shows that different weighing scheme for cluster representative vector and document vector is effective. It opens a direction to work on, to develop more effective and efficient algorithms.
7.3
Limitation of the Proposed Work
In this thesis, we have made an effort to propose an improved approaches for term extraction from the documents by exploiting knowledge repositories like open web directories. However, as mentioned in the related work, efforts are being made in the literature to propose improved approaches [33, 34] by exploiting knowledge repositories like word-net. In this thesis, we have not compared the performance of the proposed approaches with the existing approaches based on word-net. We leave this task to be carried out as a part of future work.
7.4
Future Work
As part of future work, we are planning to conduct detailed experiments by considering other types of datasets and knowledge repositories like wordnet. In addition, we are planning to conduct experiments by applying dimension reduction techniques. By using hierarchical structure of knowledge repository there is a scope for word sense disambiguation. Proposed approach can also be extended for effective content based recommendation systems. Effective, personalized recommendations are central to cross-selling, a common business strategy that suggests additional items (products or services) to customers for their consideration. Content-based recommendation and collaborative filtering represent two salient approaches for automated recommendations. The content-based approach uses essential features (attributes) of items to make recommendations, without making reference to the preferences of other customers.
77
Publications 1. Gaurav Ruhela and P. Krishna Reddy, Improving Text Document Clustering by Exploiting Open Web Directory, in the proceedings of the Twenty-First International Conference on Software Engineering and Knowledge Engineering (SEKE’09), Boston, USA. 2. Gaurav Ruhela and P. Krishna Reddy, BoostWeight: An Approach to Boost the Term Weights in Document Vector by Exploiting Open Web Directory, Proceedings of the 2009 International Conference on Information and Knowledge Engineering (IKE’09), USA. 3. Gaurav Ruhela, Improving Cohesiveness of Text Document Clusters, Proceedings of the Fifth International Conference on Data Mining (DMIN’09), Nevada, USA. 4. Gaurav Ruhela and P. Krishna Reddy, Exploring Open Web Directory for Improving the performance of Text Document Clustering, Poster Presentation, I-CARE, IBM-IRL Collaborative Academia Research Exchange Program, IBM Research India, Delhi, October 26, 2009.
78
Bibliography [1] G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing,” Communications of the ACM, vol. 18, no. 11, pp. 613–620, 1975. [2] W. W. Cohen and Y. Singer, “Context-sensitive learning methods for text categorization,” in SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval.
New York, NY, USA: ACM, 1996, pp. 307–315.
[3] J. F?rnkranz, T. Mitchell, and E. Riloff, “A case study in using linguistic phrases for text categorization on the www,” in In Working Notes of the AAAI/ICML Workshop on Learning for Text Categorization.
AAAI Press, 1998, pp. 5–12.
[4] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive learning algorithms and representations for text categorization,” in CIKM ’98: Proceedings of the seventh international conference on Information and knowledge management.
New York, NY, USA: ACM, 1998,
pp. 148–155. [5] “ftp://ftp.cs.cornell.edu/pub/smart/,” 2004. [6] M. F. Porter, “An algorithm for suffix stripping,” Program, vol. 14, no. 3, pp. 130–137, 1980. [7] “http://tartarus.org/˜martin/porterstemmer/,” 2004. [8] M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document clustering techniques,” in In KDD Workshop on Text Mining, 2000. [9] D. Mladenic, “Feature subset selection in text-learning,” 1998. [10] B. Raskutti, H. Ferr´a, and A. Kowalczyk, “Second order features for maximising text classification performance,” in Proceedings of ECML-01, 12th European Conference on Machine Learning, L. D. Raedt and P. A. Flach, Eds., 2001. [11] M. F. Caropreso, S. Matwin, and F. Sebastiani, “A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization,” pp. 78–102, 2001. 79
[12] F. Peng and D. Schuurmans, “Combining naive bayes and n-gram language models for text classification,” in In 25th European Conference on Information Retrieval Research (ECIR. Springer-Verlag, 2003, pp. 335–350. [13] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, “A bayesian approach to filtering junk e-mail,” 1998. [14] L. D. Baker and A. K. McCallum, “Distributional clustering of words for text classification,” in SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval.
New York, NY, USA: ACM, 1998, pp. 96–103.
[15] I. Dhillon, S. M. . R. Kumar, S. Mallela, I. Guyon, and A. Elisseeff, “A divisive informationtheoretic feature clustering algorithm for text classification,” Journal of Machine Learning Research, vol. 3, p. 2003, 2003. [16] R. Basili, A. Moschitti, R. Moschitti, and M. T. Pazienza, “Language sensitive text classification,” in In In proceeding of 6th RIAO Conference (RIAO 2000), Content-Based Multimedia Information Access, Coll ge de, 2000. [17] C. Sable, K. McKeown, and K. W. Church, “Nlp found helpful (at least for one text categorization task),” in EMNLP ’02: Proceedings of the ACL-02 conference on Empirical methods in natural language processing.
Morristown, NJ, USA: Association for Computational Lin-
guistics, 2002, pp. 172–179. [18] D. B. Lenat and E. A. Feigenbaum, “On the thresholds of knowledge,” Artif. Intell., vol. 47, no. 1-3, pp. 185–250, 1991. [19] I. Ruthven and M. Lalmas, “A survey on the use of relevance feedback for information access systems,” Knowl. Eng. Rev., vol. 18, no. 2, pp. 95–145, 2003. [20] J. Xu and W. B. Croft, “Query expansion using local and global document analysis,” in SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval.
New York, NY, USA: ACM, 1996, pp. 4–11.
[21] J. Rocchio, Relevance Feedback in Information Retrieval, 1971, pp. 313–323. [22] G. Salton and C. Buckley, “Improving retrieval performance by relevance feedback,” Ithaca, NY, USA, Tech. Rep., 1988. [23] T. K. Landauer, D. Laham, and P. Foltz, “Learning human-like knowledge by singular value decomposition: a progress report,” in NIPS ’97: Proceedings of the 1997 conference on Ad-
80
vances in neural information processing systems 10. Cambridge, MA, USA: MIT Press, 1998, pp. 45–51. [24] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive learning algorithms and representations for text categorization,” in CIKM ’98: Proceedings of the seventh international conference on Information and knowledge management.
New York, NY, USA: ACM, 1998,
pp. 148–155. [25] K. Buss, “Literature review on preprocessing for text mining.” [26] G. Karypis and E.-H. S. Han, “Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval,” in CIKM ’00: Proceedings of the ninth international conference on Information and knowledge management.
New York, NY, USA:
ACM, 2000, pp. 12–19. [27] E. Agirre and G. Rigau, “Word sense disambiguation using conceptual density,” in In Proceedings of the 16th International Conference on Computational Linguistics, 1996, pp. 16–22. [28] Y. Karov and S. Edelman, “Similarity-based word sense disambiguation,” Comput. Linguist., vol. 24, no. 1, pp. 41–59, 1998. [29] E. Gabrilovich and S. Markovitch, “Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization,” J. Mach. Learn. Res., vol. 8, pp. 2297–2345, 2007. [30] Mladenic and Dunja, “Turning yahoo to automatic web-page classifier,” in European Conference on Artificial Intelligence, 1998, pp. 473–474. [31] Y. Labrou and T. Finin, “Yahoo! as an ontology: using yahoo! categories to describe documents,” in In Proceedings of the 8 th International Conference On Information Knowledge Management (CIKM), 1999, pp. 180–187. [32] H.-C. Huang, M.-S. Lin, and H.-H. Chen, “Analysis of intention in dialogues using category trees and its application to advertisement recommendation,” in Proceedings of the Third International Joint Conference on Natural Language Processing, Hyderabad, Andhra Pradesh, India, 2008, pp. 625–630. [33] A. Hotho, S. Staab, and G. Stumme, “Wordnet improves text document clustering,” in In Proc. of the SIGIR 2003 Semantic Web Workshop, 2003, pp. 541–544. [34] G. A. Miller, “Wordnet: a lexical database for english,” Commun. ACM, vol. 38, no. 11, pp. 39–41, 1995. 81
[35] M. Lan, C. L. Tan, and H. B. Low, “Proposing a new term weighting scheme for text categorization,” Boston, 2006, pp. 763–768. [36] H. Xu and C. Li, “A novel term weighting scheme for automated text categorization,” in ISDA ’07: Proceedings of the Seventh International Conference on Intelligent Systems Design and Applications.
Washington, DC, USA: IEEE Computer Society, 2007, pp. 759–764.
[37] F. Debole and F. Sebastiani, “Supervised term weighting for automated text categorization,” in SAC ’03: Proceedings of the 2003 ACM symposium on Applied computing. New York, NY, USA: ACM, 2003, pp. 784–788. [38] G. Salton, Automatic text processing: the transformation, analysis, and retrieval of information by computer.
Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1989.
[39] S. M. Savaresi and D. L. Boley, “On the performance of bisecting k-means and pddp,” in Proceedings of the First SIAM International Conference on Data Mining (ICDM-2001), 2001, pp. 1–14. [40] R. Lecceuche, “Finding comparatively important concepts between texts,” in ASE ’00: Proceedings of the 15th IEEE international conference on Automated software engineering. Washington, DC, USA: IEEE Computer Society, 2000, p. 55. [41] E. M. Rasmussen, “Clustering algorithms,” in Information Retrieval: Data Structures & Algorithms, 1992, pp. 419–442. [42] F. Sebastiani, “Text categorization,” in Text Mining and its Applications to Intelligence, CRM and Knowledge Management. WIT Press, 2005, pp. 109–129. [43] V. K. George Karypis, Eui-hong (Sam) Han, “Multilevel refinement for hierarchical clustering,” in University of Minnesota - Computer Science and Engineering Technical Report, 1999. [44] D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey, “Scatter/gather: a clusterbased approach to browsing large document collections,” in SIGIR ’92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval.
New York, NY, USA: ACM, 1992, pp. 318–329.
[45] L. Zhu, J. Guan, and S. Zhou, “Cwc: A clustering-based feature weighting approach for text classification,” in MDAI ’07: Proceedings of the 4th international conference on Modeling Decisions for Artificial Intelligence. Berlin, Heidelberg: Springer-Verlag, 2007, pp. 204–215.
82
[46] A. Strehl, E. Strehl, J. Ghosh, and R. Mooney, “Impact of similarity measures on web-page clustering,” in In Workshop on Artificial Intelligence for Web Search (AAAI 2000).
AAAI,
2000, pp. 58–64. [47] B. C. M. Fung, K. Wang, and M. Ester, The Encyclopedia of Data Warehousing and Mining. Hershey, PA: Idea Group, August 2008, ch. Hierarchical Document Clustering, pp. 970–975. [48] M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document clustering techniques,” in In KDD Workshop on Text Mining, 2000. [49] L. Ert¨oz, M. Steinbach, and V. Kumar, “Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data,” in SDM, 2003. [50] G. Salton and C. Buckley, “Term weighting approaches in automatic text retrieval,” Ithaca, NY, USA, Tech. Rep., 1987.
83