large web data is called web content outlier mining [2]. The motivation for .... distribution parameters (e.g., mean, variance), which is a major setback. In addition ...
2005 ACM Symposium on Applied Computing
Mining Web Content Outliers using Structure Oriented Weighting Techniques and N-Grams Malik Agyemang
Ken Barker
Rada S. Alhajj
Department of Computer Science University of Calgary 2500 University Drive N.W. Calgary, AB, Canada T2N 1N4
{agyemang, barker, alhajj}@cpsc.ucalgary.ca
ABSTRACT
1. INTRODUCTION
Classifying text into predefined categories is a fundamental task in information retrieval (IR). IR and web mining techniques have been applied to categorize web pages to enable users to manage and use the huge amount of information available on the web. Thus, developing user-friendly and automated tools for managing web information has been on a higher demand in web mining and information retrieval communities. Text categorization, information routing, identification of junk materials, topic identification and structured search are some of the hot spots in web information management. A great deal of techniques exists for classifying web documents into categories. Interestingly, almost none of the existing algorithms consider documents having ‘varying contents’ from the rest of the documents taken from the same domain (category) called web content outliers. In this paper, we take advantage of the HTML structure of web and n-gram technique for partial matching of strings and propose an n-grambased algorithm for mining web content outliers. To reduce the processing time, the optimized algorithm uses only data captured in and tags. Experimental results using planted motifs indicate the proposed n-gram-based algorithm is capable of finding web content outliers. In addition, using texts captured in and tags gave the same results as using text embedded in , , and tags.
The web continues to grow at an exponential rate making it the single largest data repository for research and user consumption. The huge volume of data present on the web calls for automated tools if the information is to be managed efficiently. This has given rise to the development of many intelligent systems for mining information on the web in areas like text categorization, information routing, topic identification, structured search, identification of junk materials, etc. Interestingly, existing web mining algorithms do not consider documents having ‘varying contents’ within the same category called web content outliers. A web content outlier is defined as web document(s) having different contents from similar web documents taken from the same category. The process of locating web content outliers in large web data is called web content outlier mining [2]. The motivation for web content outlier mining is enormous. For example, Superstore, Safeway and Co-Op are three grocery stores in Alberta. Superstore has clothing and hardware sections which the others do not. Supposing Superstore makes profit while Safeway and Co-Op do not, then it may be due to good sales from the clothing and hardware sections. However, if Safeway and CoOp are profitable while Superstore is not, then it might be the case that the clotting and hardware sections are not profitable. This interesting and useful information can be obtained using web content outlier mining algorithms. It is assumed that all the products of the participating companies are available.
Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications – Data Mining.
N-gram techniques have been widely investigated for a number of text processing tasks. N-grams of a string of length k is an ncontiguous slice of the string into substrings each of size n. Ngram systems suffer from large memory requirements because of the huge number of n-gram vectors resulting from the slicing. For example, a string of length k has k-n+1 possible n-grams ignoring all n-grams with trailing or preceding blanks. This problem notwithstanding, n-gram systems have some advantages over full word matching in web page categorization for the following reasons: (1) a great deal of web pages (document) are posted on the web without going through any form of thorough error checking because they might be too costly or the documents may be time dependent. Such documents may contain significant amounts of errors making full word matching less efficient. In such cases, n-gram techniques offer more efficient and effective means of comparing strings because n-grams systems support partial matching of strings. (2) n-gram techniques are faster and more efficient because all n-grams have the same length compared to words that may have different lengths.
General Terms Algorithms, Design
Keywords Text categorization, web mining, n-grams, dissimilarity measure, web contents.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’05, March 13-17, 2005, Santa Fe, New Mexico, USA. Copyright 2005 ACM 1-58113-964-0/05/0003…$5.00.
482
In this paper, we present an n-gram-based algorithm for mining web content outliers, which explores the advantages of n-gram techniques. The algorithm also takes advantage of the HTML structure of web documents. The algorithm is further optimized to use only data captured in and tags to reduce the huge number of n-grams vectors. Experimental results show that using text embedded in and tags give the same results as using and . This result confirms to known fact that using metadata gives a better representation of the contents of a web page than using the actual data on the page [22].
2. RELATED WORK Outlier mining has been studied extensively in statistics where data objects are usually fitted with standard distributions. Data objects that show significantly different characteristics from remaining data are declared outliers [10]. Over a hundred tests, called discordancy tests, have been developed for different scenarios [4]. The statistical techniques require a priori knowledge of the data distribution (e.g., Normal, Poisson) and distribution parameters (e.g., mean, variance), which is a major setback. In addition, most of the distributions used are univariate. Depth-based techniques represent every data object in a k-d space, and assign a depth to each object [21]. Outliers are data objects with smaller depths. In practice, depth-based algorithms are efficient for lower dimensional data [12, 20]. Depth-based algorithms become inefficient for higher dimensional data (k 4) because they rely on the computation of k-d convex hull, which has lower bound complexity of (nk/2) for n objects.
1.1 Motivation The motivation for mining web content outliers includes but is not limited to: i. It is an open research issue that has a number of practical applications in information management and electronic commerce. ii. Consulting companies spend millions of dollars tracking activities of their clients to ascertain what they are doing different to stay in business. With the growing use of the web in cooperate world, it is only a matter of time that all the operation of companies will be available online. Consulting companies can run web content outlier mining algorithms using clients within the same category. The results may reveal clients doing ‘something different’. Thus, helping consultants to keep abreast of the changes. iii. It can be used to determine why certain companies within the same category are profitable while others are not. For example, Superstore, Safeway and Co-Op are three grocery stores. Superstore has clothing and hardware sections which the others do not. Supposing Superstore makes profit while Safeway and Co-Op do not, then it may be due to the clothing and hardware sections. However, if the others are profitable while Superstore is not, then it might be the case that the clotting and hardware sections are not profitable, assuming all the products of participating companies are available online. iv. Web content outlier mining may lead to the discovery of new and emerging business patterns and trends. For example, Wal-Mart which deals in general goods has grocery section in some of their stores in Canada. Grocery used to be absent in Wal-Mart stores three years ago. Zellers is also experimenting on grocery section in some of their stores. This is an emerging business trend and pattern which can be detecting using web content outlier mining techniques. Mining web content outliers can improve the quality of hubs and authority sites by removing identified outlying pages. Removing web content outliers eventually improves the quality of results returned from a search engine.
The distance-based outlier mining concept assigns numeric distances to data objects and computes outliers as data objects with relatively larger distances [1, 14, 15]. The distance-based outlier concept generalizes many of the existing notions of outlier, and enjoys better computational complexity than the depth-based techniques for higher dimensional data. The work described in [19], uses distance to the k-nearest neighbor to rank outliers. Efficient algorithm for computing the top n outliers using their rankings is provided. However, distance-based outlier algorithms are not capable of detecting all forms of outliers [5]. The local outlier factor concept addresses this major setback. The local outlier factor concept assigns a ‘degree of outlying’ to every object and declares data objects with high local outlier factor values as outliers [3, 5, 11]. The local outlier factor depends on the remoteness of an object with respect to its immediate neighbors. The concept treats every data objects as potential outlier and hence able to capture all the different forms of outliers that are ignored by the earlier algorithms. Agyemang et al. [2] introduced web outlier mining concepts and propose a general framework for mining web outliers. Algorithm for mining web content outliers is also proposed. Jung et al. [13] used semantic outlier mining techniques for sessionalizing web log data to track the behavior of individuals who may easily change their interest and intension while surfing on the web. N-gram techniques have been studied and used in many information retrieval tasks. They have been applied in different domains such as language identification [8], document categorization and comparison [17], robust handling of noisy text and many other domains of natural language processing applications [7]. The success of n-gram-based systems is because the strings are discomposed into smaller parts causing errors (misspelled error, typographical errors and errors arising from object character recognition (OCR)) to affect only a limited number of such parts rather than the whole word. The number of n-grams (higher order n-grams) common to two strings is a measure of the similarity between the words. This measure is resistant to a large variety of textual errors.
1.2 Outline of Paper Section 2 presents an overview of outlier mining. N-gram concept and algorithm for mining web content outliers is presented in Section 3. Section 4 presents experimental results using planted motifs. Conclusions and future work are presented in Section 5.
This paper explores the advantages of using n-grams to determine the similarity of strings and expand it to include pages containing similar strings. We establish dissimilarity measure and use it to determine documents having different contents from similar pages within the given category.
483
3. N-GRAMS
3.2 Web Content Outlier Detection
Traditionally, N-gram is defined as an n-contiguous character slice of a string into smaller substrings each of size n. However, the term n-gram can be used for any set of consecutive characters occurring in a string (e.g., n-gram consisting of second, third and fifth characters in a string). This paper uses n-gram in its traditional context. The n-grams from a string of length k is obtained by sliding a window of size n over the string, starting at the first position and moving the window one position at a time until it reaches the end of the string. The set of characters that appear in the window at any position forms the n-grams of that string. For example, the string ‘computer’ has ‘comp’, ‘ompu’, ‘mput’, ‘pute’, ‘uter’ 4-grams and ‘compu’, ‘omput’, ‘mpute’, ‘mpute’, ‘puter’ 5-grams excluding all n-grams having preceding and trailing blanks. The number of possible n-grams resulting from a string of length k is (k-n+1), where n is the size of the ngram.
The inherent property of n-grams taped for many information retrieval tasks is that documents within the same category should have similar n-gram frequency distribution. Thus, documents belonging to the same category should have a greater number of n-grams in common. The proposed algorithm explores the advantages of n-grams and the HTML structure of the web. The paper assumes the existence of a dictionary for intended category. The n-gram frequency profile for the dictionary and the documents are generated. Documents are weighted based on their n-gram frequency distribution and the HTML tags which enclose their root words. N-grams found in a document but not in the dictionary are awarded a higher penalty than those found in the dictionary. N-grams not found in the dictionary contribute more to the dissimilarity of the document while those found in the dictionary increases the similarity between the document and the dictionary. The relative weight of a document determined its dissimilarity to other documents. Outliers are documents with high dissimilarity weights compared to other documents in the category. The major components of the proposed algorithm are shown in Figure 1. Detailed description of the components of the dataflow is discussed below.
A study of different n-gram sizes reveals n-grams that are too short tend to capture similarities between words that are due to factors other than semantic relatedness. For example, the words ‘community’ and ‘commodity’ have five out of eight 2-grams in common (‘co’, ‘om’, ‘mm’, ‘it’, ‘ty’), suggesting ‘community’ and ‘commodity’ may be related but they are not. Similarly, n-grams that are too long fail to capture similarities between similar but different words. N-grams that are too long behave like full word match. Notwithstanding, n-grams of reasonable lengths are able to determine similarity between different but related words better than n-grams of shorter lengths. The words ‘community’ and ‘commodity’ are not related and hence have only one out of five possible 4-grams and no 5-grams in common. The literature on ngrams reveals 5-grams as the n-gram size capable of supporting higher information retrieval precision and recall [8, 18]. Hence, this paper uses 4-grams and 5-grams in the web content outliers mining algorithm with the view of achieving high precision and recall.
Figure 1: Dataflow for proposed algorithm
3.1 Merits of N-Gram Systems The main advantages of n-gram system over full word matching are:
3.2.1 Document Extraction The documents extraction phase retrieves web pages from the category of interest. It can be achieved using existing web search engines or web crawlers [6]. The crawlers should be very efficient in extracting all pages belonging to the category of interest. The tasks expected to be completed at the end of the process are;
i. It is computationally efficient (speed and memory) comparing n-grams having fixed lengths to words with variable lengths. ii. N-grams are more efficient in determining similarity between different but related words in text processing. N-gram techniques outperform the naïve approach of declaring similarity only if two words are fully matched in precision and recall. For example, ‘community’ and ‘communal’ are two different words but share two 5-grams ‘commu’, ‘ommun’ in common because they are inflexion of the same root word. The naïve word matching algorithms cannot match the two words as similar. The word match problem can be solved using stemming algorithms but it requires additional computational cost. iii. N-grams support partial matching of strings with errors. Thus, the word ‘computer’ spelled wrongly as ‘tomputer’, ‘camputer’, or ‘domputer’ could still match because all the words have some reasonable number of n-grams in common with ‘computer’. Full word matching algorithms will report mismatch and hence perform worse under such circumstances.
i. Download all the pages for the category to be mine (e.g., health, commerce, e-business, resume etc.). ii. The algorithm uses HTML tags , 1 and . Any other tags are removed. Extension to the existing web crawler algorithms may be needed to ensure effective results. The output from this phase is fed as input to the preprocessing phase.
3.2.2 Preprocessing Data extracted from the previous phase is transformed into a suitable format for the next phase of the algorithm. Any data besides the text embedded in the HTML tags are removed. This may include hyperlinks, sound, pictures, etc. In addition, stop1
484
We use only ‘description’ of the Mata tag
from the dictionary are assigned higher weights because they show stronger dissimilarity. The overall dissimilarity measure (DM) of a page is the weighted sum of all n-grams on that page.
words (words with frequency greater than some user specified frequency) are carefully removed. The stop-word removal is done with the aid of a publicly available list of stop-words2. Using public list of stop-words is category independent and ensures important words within a category that occur more frequently are not removed. The disadvantage is that there are many different public lists of stop-words all of which may not be the same. Nevertheless, a number of the list could be compared and the appropriate one chosen. The words in the documents are grouped under various HTML tags. Finally the words in the dictionary and the documents are converted to the same case. The output is a set of documents with white-spaced separated words under various HTML tags.
3.2.4.1 Weight Computation The hierarchical importance of the html structure is employed in assigning weights to n-grams depending on tags that enclose their root text. Weights are assigned considering only words that are enclosed in HTML tags , , and because they are the most important tags in HTML document. Texts contained in , are considered metadata and hence their corresponding n-grams are assigned higher weights than texts contained in the tag. The same weight is assigned to the two metadata tags even though they differ slightly in their level of importance.
3.2.3 Generate N-gram Frequency Profiles The n-gram frequency profile is generated for each document and the dictionary separately. This is achieved by reading the text and counting the number of occurrences of each n-gram on the document. This paper uses n-grams of higher lengths because higher order n-grams are capable of capturing similarities between different but related words compared to n-grams of shorter length [8, 18]. In particular, we use 4-grams and 5-grams as desired ngram sizes. The lower order n-grams are not used because 1-gram merely shows the distribution of letters of the alphabets in a language. Similarly, 2-grams show the distribution of very frequent prefix and suffixes and is affected by the language in which the document is written. The n-gram frequency profile is computed as following:
The weight assigned to n-grams as a result of their presence or absence from the dictionary is called a penalty. Higher penalties are awarded to n-grams that are present in the dictionary because they demonstrate stronger similarity. The penalty threshold λ varies for different categories of documents. The dissimilarity measure of each document is the relative weighted sum of all ngrams on that document. Different documents having varying number of n-grams can be compared.
i.
where Tk is the html tag, w(Tk) is the weight assigned to the ngram Nj occurring in the html tag Tk and F(Nj, Tk, Di) denotes the number of times n-gram Nj is present in the html tag Tk in document Di , p(Nj) is penalty awarded against n-gram Nj present in the dictionary and the document. The functions w(Tk) and p(Nj) are defined as:
DM
Tokenize the text into words using the output from preprocessing. ii. Generate all possible 4-grams and 5-grams. Tokens that do not have the complete n-gram-length are removed, For example, ‘testy’ has only ‘test’ and ‘esty’ 4-grams and ‘testy’ 5-gram. iii. Hash the results into a table using a hash function with a collision management mechanism to keep track of each ngram and its count (frequency). We expect the number of ngrams vectors to grow linearly with number of word on a page since we are using only 4-grams and 5-grams. iv. Finally, output the n-grams and their counts in sorted order into a file. The sorted file for each document is the n-grams frequency profile for that document. The file is used as input to the next phase of the algorithm. The above process is repeated to get the n-gram profile of all documents in the category and the dictionary.
i
=
j ek
(1)
ni
w(Tk ) =
β if Tk ∈ metadata, β > 1 1 otherwise
p( N j ) =
1 if N j ∉ dictionary λ otherwise, 0 < λ < 1
(2)
(3)
3.2.5 Determine Outliers This phase of the algorithm determine outliers based on the dissimilarity measures. The dissimilarity measures computed for the different documents are ranked and the top-n declared outliers. The detailed description of the web content outlier detection algorithm is shown in Figure 2.
3.2.4 Compute Dissimilarity Measures The goal is to compute the dissimilarity for determining the differences among pages within the same category. The document and dictionary profiles are used as inputs to this phase. Weights are assigned to n-grams on each page based on the HTML tags that encloses their root words. A penalty function is defined to take care of n-grams that might exist in a document but not in the dictionary. The focus of the algorithm is to measure the dissimilarity between documents, hence n-grams present in both dictionary and documents are assigned lower weights because they demonstrate stronger similarity. However, n-grams absent 2
( p ( N j ) ∗ w (T k ) ∗ F ( N j , T k , D i ) )
ni
3.2.6 Reduction of N-gram Vectors The scalability of the proposed algorithm depends on the number of n-gram vectors resulting from the pages (documents) involved. Experimental results on web page categorization indicate using metadata description of web pages alone gives a better representation of the contents of web pages than using actual text contained on the pages [9, 22]. Thus, the number of n-gram vectors can be reduced by using only the meta data description without using the actual contents of pages involved. In addition, the weighting function will not be applied because all meta data
http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words
485
enclosed in the tags , , and were retrieved. To reduce redundancy, we only used data collected in the ‘Description’ of tag. After preprocessing, a dictionary of words for the resume category is generated using 50 resume pages. The remaining 150 resume pages numbered P1 to P150 and 10 home improvement pages numbered P151 to P160 constitute our test data. The 10 home improvement pages constitute the planted motifs which we want to identify. First, we generated 4-gram frequency profiles for the dictionary and our test data separately and computed the dissimilarity between each page and the dictionary. The algorithm successfully detected 9 home improvement pages and 1 resume page among the top-10 outliers. All 10 home improvement pages were among the top-10 outliers when the experiment was ran using 5-grams as shown in Table 1.
tags have the same weights resulting in the dissimilarity function given below:
( p ( N j ) * F ( N j , Di ) )
ni
DM i =
j ek
ni
(4)
where Nj is an n-gram vector, and F(Nj, Di) denotes the number of times n-gram Nj is present in document Di , p(Nj) is penalty awarded against n-gram Nj present in the dictionary and the document. The function p(Nj) has the same meaning and definition as equation (3) above. Algorithm Input: Dictionary, documents Di Outputs: Outlier pages
The experiment is repeated using and tags only. We created a new dictionary from the same 50 resume pages using text enclosed in and . The information contained in the tag was ignored. The text enclosed in and tags of the remaining 150 resume pages and 10 home improvements pages were extracted and preprocessed. The 4-gram and 5-gram frequency profiles were generated for the dictionary and test data separately. Dissimilarity measures were computed and the top-10 outliers determined. In both cases, all 10 home improvement pages were listed among the top-10 outliers. The summary of experimental results is shown in Table 1.
Other variable: weights (w(Tk)), penalties( p(Tk)) 1. Read the contents of the documents(Di) and dictionary 2. Generate n-gram frequency profile for dictionary 3. Generate n-gram frequency profile for document 4. For (int i =0; m< NoOfDoc i ++) { 5. For(int n =0; n< NoOfNgrams ; n++{ 6. IF (N-gram exists in dictionary){ 7. Weighti =
ni
Table 1. HI4 pages as planted Motifs
( p(N j ) ∗ w(Tk ) ∗ F(N j , Tk , Di ))
Top-10 Web Content Outliers
j ek
Data
Else 8. Weight i =
ni
(w(Tk ) ∗ F ( N j , Tk , Di ))
j ek
4-Grams
5-Grams
, ,
HI pages
Resume pages 1
Resume pages 0
,
HI pages
HI pages
End IF
9 10
Resume pages 0
HI pages
10 10
Resume pages 0
9.} // end of inner for loop 10.
DMi =Weighti / NoOfNgrams
The second dataset consist of the same 150 resume pages used above with 10 recruiter pages planted in them. The experiment is repeated as described above and the number of recruiter pages detected among the top-10 outlying pages noted. The algorithm successfully identified 7 recruiter pages among the top 10 outliers irrespective of the data and the n-grams size. The summary of the results is given in Table 2.
11. }// end of outer for Figure 2: N-gram-based algorithm for mining web outliers
4. EXPERIMENTAL RESULTS This section presents the experimental results of the n-gram-based algorithm for mining web content outliers. Two datasets used to test the algorithm are obtained from the web using DMOZ3, which is an open directory on the web containing millions of classified web pages. The two datasets have some resume pages with other pages planted in them as motifs to be detected. Planted motifs are used because there are no benchmark data for testing web content outliers.
Table 2. Recruiter pages as planted motifs Top-10 Web Content Outliers Data
4.1 Dataset and Methodology
3
4
Open Directory Project, http://dmoz.org.
486
5-Grams
Recruiter
7
Recruiter 8
Resume
3
Resume
Recruiter
7
Recruiter 7
Resumes
3
Resume
,
The first dataset consist of 200 resume pages (pages from 200 websites providing resume services) and 10 home improvement pages (pages proving home improvement services). The contents
4-Grams
, ,
Home Improvement pages
2 3
The results obtained as shown in Table 1 and Table2 indicate the proposed algorithm is capable of identifying web content outliers. The results also demonstrate using metadata gives equally good results as using the metadata and the data itself. The resume pages that consistently appeared among the top-10 outliers in Table 2 looked very much like the recruiter pages rather than resume pages. It is also obvious from the results that identifying web content outliers from identical pages (pages with similar contents) is more difficult than looking for outliers from completely unrelated pages.
[8] Damashek, M. Gauging Similarity with N-Grams: Language Independent Categorization of Text, Science, 267(1995) pp 843-848
5. CONCLUSIONS AND FUTURE WORK
[9] Danile Riboni. Feature Selection for Web Page Classification. D.S.I Universita,, Milano, Italy, 2002
[6] Chakrabarti, S., Berg, M., and Dom, B. Focused Crawling: A New Approach to Topic-specific Web Resource Discovery. Computer Networks, Amsterdam, Netherlands, 1999 [7] Cavnar B. W., Trenkle M.J. N-Gram-Based Text Categorization. Proceedings of SDAIR-94, 3rd Annual Symposium on Document and Information Retrieval, 1994
Web mining is a growing research area in the mining community because of the great patronage the web continues to enjoy. The success of electronic commerce on the web has attracted industrial support for web mining research. Classifying web pages into predefined categories to aid information search and retrieval is a very common task. However, sifting through already categorized documents, looking for web pages with varying contents has not received any attention in the mining community. This paper proposes an n-gram-based algorithm for mining web content outliers. Experimental results using planted motifs shows the proposed algorithm is capable of detecting web content outliers in web data. The results also confirms that for already classified pages, using data captures in metadata ( and ) tags give the same results as using the actual contents of the pages. Using metadata in the experiment reduced the number of possible n-grams comparisons drastically but we intend performing more experiments with large datasets to establish this fact.
[10] Hawkins, D. Identification of Outliers. Chapman and Hall, London, 1980 [11] Jin, W., Tung, A.K.H., and Han, J. Mining Top-n Local Outliers in Large Databases. In Proc. of KDD 2001, San Francisco, CA, USA, 2001. [12] Johnson, T., Kwok, I., and Ng, R. Fast Computation of 2-D Depth Contours. In Proc. of KDD 1998, pp 224-228 [13] Jung, J.J., & Jo, G-S. Semantic Outlier Analysis for Sessionizing Web Logs. Proceeding of 14th European Conference on Machine Learning/7th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), Cavtat – Dubrovnik, 2004, pp 13-25 [14] Knorr, E.M., and Ng, R.T. A Unified Notion of Outliers: Properties and Computation. In Proc. of KDD 97, 1997, pp 219-222
Areas of future research include experimental evaluation of full word match algorithms and n-gram-based algorithms in terms of precision, recall and response time. We shall also compare the results of our dissimilarity measure with standard similarity matrices. Finally, benchmark data needs to be established for evaluating the performance of web outlier mining algorithms.
[15] Knorr, E.M., and Ng, R.T. Algorithms for Mining DistanceBased Outliers in Large Dataset. In Proc. of 24th VLDB Conference, New York, USA, 1998 [16] Kosala, R., and Blockeel, H. Web Mining Research: A Survey. SIGKDD Exploration, ACM July 2000 [17] Labrou, Y., Finin T. Experiments on using Yahoo! Categories to Describe Document, In IJCAI -1999 Workshop on Intelligence Information Extraction
6. REFERENCES [1] Anguilli, F., and Pizzuti, C., Elomaa,T. (Eds.). Fast Outlier Detection in High Dimensional Spaces. PKDD, LNAI 2431, 2002, pp 15-27
[18] Mayfield, J., McName, P. Indexing Using Both N-Grams and Words. In proceeding of NIST Special Publication 500 242: The Seventh Text Retrieval Conference (TREC 7), 1998, pp 419 -224
[2] Agyemang, M., Barker, K., & Alhajj R. Framework for Mining Web Content Outliers. Proceeding of the 19th Annual ACM Symposium on Applied Computing (ACMSAC), Nicosia, Cyprus, 2004, pp 590-594
[19] Ramaswamy, S., Rastogi, R., and Shim, K. Efficient Algorithms for Mining Outliers from Large Data Sets. In Proc. of ACM SIGMOD 2000, USA, pp127-138
[3] Agyemang, M. and Ezeife, C.I. LSC-Mine: Algorithm for Mining Local Outliers. Proceedings of the 15th Information Resource Management Association (IRMA) International Conference, New Orleans, 2004, pp 5-8
[20] Ruts, I., & Rousseuw, P. (1996). Computing Depth Contours of Bivariate Points Cloud. Computational Statistics and Data Analysis, 23(1), 153-168
[4] Barnett, V. and Lewis, T. Outliers in Statistical Data. John Willey, 1994
[21] Tukey J. W. Exploratory Data Analysis. Addison-Wesley, 177
[5] Breunig, M.M., Kriegel, H-P., Ng R.T., and Sander, J. LOF: Identifying Outliers in Large Dataset. Proc. of ACM SIGMOD 2000, Dallas, TX 2000
[22] Yang Y, Slattery S, and Ghani R. A Study of Approaches to Hypertext Categorization Journal of Intelligent Information Systems, 18(2/3):pp219-241, 2002, Special Issue on Automated Text Categorization
487