A Text Processing System for Mining Association Rules

1 downloads 567 Views 457KB Size Report
users with tools and interfaces to facilitate such text mining .... in keyword extraction are (1) removal of insignificant words ..... Weka 3: Data Mining Software in.
A Text Processing System for Mining Association Rules Pengwei Yang and Gongzhu Hu Department of Computer Science Central Michigan University Mount Pleasant, MI 48859, USA [email protected] Abstract Text mining is a much researched area of the general data mining field. It deals with data mining tasks on text data. A lot of research has been done in this area, but few easy-to-use open-source systems exist that provide the users with tools and interfaces to facilitate such text mining tasks, except some expensive commercial products. In this paper, we present a text processing system that incorporates mining association rules from text. Although this paper doe not present new and innovative research ideas, the text processing system we describe here can be considered a simple working application of an important computing task.

1

Introduction

Text mining is a cross-disciplinary field, involving database, statistics, artificial intelligence, and various knowledge mining techniques. Most previous work in knowledge discovery is concerned with structured data (such as relational databases). In reality a large portion of information does not appear in structured data but rather in collections of text documents. Large and increasing numbers of documents are created such as electronic mail, online documents, technical report, news, etc. Text mining enables users to transform the unstructured data into a format that can be easily analyzed, grouped into logical categories, and new inferences can be discovered from the text information. Many test mining techniques and systems have been proposed and developed. Most systems available today, however, are commercial products, such as SAS Enterprise Miner [15], and IBM Intelligent Miner [8], and they are expensive. Open source systems do exist such as Weka [18], but few have incorporated text mining and easy to use. In this paper, we present such a text mining system that provides a graphic use interface for the user to discover association rules in text documents. It is suitable for relatively simple text mining tasks and used as test bed for research. Although focused on association rules mining, the system is extendable and other mining modules can be easily plugged in. There is little new research contribution of this paper, but we believe it is a meaningful application of data ming techniques that

provides a tool to allow users to perform data mining tasks easily in a user friendly environment.

2

Background and Related Work

Data mining in general is to find knowledge hidden in the data set. The knowledge is most likely unknown before. The common data mining tasks include classification, pattern discovery, and association discovery. These tasks also apply to text data. Most of text mining techniques [7] have been focused on text classification, or categorization. Given a text document d and a set of n categories C = {ci , i ∈ [1, n]}, the text classification task is to decide a category k such that d is classified as in ck . The decision is made based on a similarity measure between a document d and a category ci be Sd,ci . Document d is classified as in category ck if the similarity of d and category ck is the largest among all categories. That is, d ∈ ck , where Sd,ck = max1≤i≤n (Sd,ci ) The property of category ci used to calculate the similarity is determined by a set of pre-categorized documents (called a training set) that are known to be in the category ci . Discovering association rules from text data is much less researched. The basic idea of association rule mining is to find associations among items in the data set. By association, it means that an item set A appears together with another item set B in the application modules (database transactions, for example) with a above-threshold frequency. Most association rule mining methods have been applied only to data sets in databases transactions. Association rule mining is defined as follows. Let I = {Ii , i = 1, ...n} be the set of n data items, T = {Ti , i = 1, ...m} be a set of m transactions and transaction Ti = {Ii j , j = 1, ...il } contains il data items Ii j ∈ I. Also let A = {Ii } and B = {I j } be sets of data items for Ii ⊆ I and I j ⊆ I, A ∩ B = ∅. The association rule A =⇒ B (support, con f idence) is said to hold if

1. the probability of both A and B appearing in T is greater than the given threshold support, p(AB) > support; 2. given that A appears in a set of transactions T ∗ ⊆ T , the probability of B also appears in T ∗ is greater than the given threshold con f idence, p(B|A) > con f idence. Research on mining association rules has been for about 15 years. It was first introduced by Agrawal, Imielinski, and Swami [1]. Shortly after, Agrawal and Srikant, and Mannila et al. published the Apriori algorithm, independently [2, 11]. Several variations of the Apriori algorithm were proposed, including use of hash table to improve the mining efficiency [13], partitioning the data to find candidate itemsets [16], reducing the number of transactions [6, 13], sampling [17], and dynamic itemset counting [3]. Various extensions of association rule mining were also studied [19, 14]. It is still an active research area today. Although there are many algorithms for finding association rules from relational databases, few have applied these algorithms to text data to find associations among text documents. Data mining tools also exists, but few are open source and easy to extend and scale. Even fewer are for text mining. Our aim of this work is to develop a tool to support text mining tasks.

3

Text Mining

Text mining is closely related to linguistic processing in terms of words and phrases in the text, simply because words and phrases are the basic data items in the data set. Each “meaningful” word (or keyword) is considered a feature in text documents. General topics of text mining are covered in [5, 10], and programming applications in text mining is described in [12]. The model of feature space is a t-dimensional vector space and each document is represented as a vector of weighted keywords (a point in the space). And each dimension represents a keyword found in documents. The keyword is typically a word, a word stem, or a term associated with the text under consideration or word weights. The vector space model procedure can be divided in to three stages: keyword extraction, keyword weighting, and similarity measure.

3.1

To remove stop words from a text document, we first prepare a list of words, which we consider as stop words, and store them in a file. The text document is then checked against the list. That is, the text document is filtered by the list to remove stop words from the text. Since the pre-defined list is often very limited, we may find new stop words from the text that are often those words with too high frequencies. Word stemming is to reduce inflected (or sometimes derived) words to their stem, because morphological variants of words have same or similar meanings and can be considered equivalent for the purpose of information retrieval or text mining. A common approach of stemming is suffix stripping that removes the commoner morphological and inflexional endings. Suffix stripping can be achieved by using series of rules, but sometimes it also induce errors. After removal of stop words and stemming, the remaining text are filtered by a frequency threshold to produce keywords that have above-threshold frequencies. Once the keywords are extracted, we use serial clustering to find a set of frequently occurring consecutive or closely located keywords that are phrases or combined keywords that are also important to text mining tasks. The lengths of phrases are generally small, about 2 to 3. Longer phrases may be possible but the probability of finding longer phrases is quite low in many real text data sets, unless the frequency threshold was set very low.

3.2

Keyword Weighing

Keywords are not equally useful for content representation. Many of the words in a document might not describe the content at all while others might bear the important information of the document. High weights are assigned to terms deemed important while low weights to the less important ones. The appropriate keyword weighing is important to create more accurate document vector [9]. There are three main factors of keyword weighting: keyword frequency factor, collection frequency factor and length normalization factor. Various weighting schemes have been investigated such as normalized frequency and inverse document frequency. Normalized frequency is a common weighting scheme for keywords within a document while inverse document frequency is used to discriminate one document to other. There are few useful measures for normalized frequency. One of them was proposed in [4]:

Keyword Extraction k + (1 − k) ×

The first step in almost all text mining tasks is to extract keywords from the text documents. Keywords are the meaningful context-bearing words. The three major steps in keyword extraction are (1) removal of insignificant words such as stop words like “a”, “the”, “to”, etc.; (2) reduction of words to stems; and (3) identification of words that are considered “significant” – words with above-threshold frequencies of occurrences in the text.

f reqi j maxFreq j

(1)

where k is a value between 0 and 1, f reqi j is frequency of keyword i in document j, and maxFreq j is the frequency of the most frequent keyword in document j. All documents are represented by weighted keyword vectors of the form D j = {d ji , i = 1, ..., n} where d ji is the weight assigned to keyword i in document j.

A simplified version of the measure in the formula (1) is used in our work to normalize keyword weights without using a selected value of k, and scaled between 0 and 10, as 10 × ( f reqi j /maxFreq j ). Words with less than a given weight threshold are filtered. Phrases (compound words) are located as adjacent keywords. Weights of phrases are assigned based on the weights on each individual words in the phrase. The keyword list is then sorted on their weights. Finally, the top N keywords associated are selected. These keywords and their associated weights are stored in a keyword vector to represent the documents.

3.3

Similarity Measure

The most popular similarity measure in vector space model is the cosine of the angle of the two document vectors. It is computed as the inner product of the two vectors, normalized by the products of the vector lengths. n

∑ w pi wqi P·Q i=1 r =rn cos(θ ) = n kPk · kQk ∑ w2pi ∑ w2qi i=1

i=1

where θ is the angle between vectors P and Q, and w pi and wdi are the weights stored in P and Q. The cosine similarity between documents measures the angle between two document vectors, the smaller the angle, the larger the similarity. In our work, we retrieve the keyword lists from the two text files and both lists are transformed into standard list with same size and same keyword order. The cosine measure is calculated as the similarity of these two files. If similarity is more than minimum value (say, 0.5), a combined keyword list will be created from these two keyword lists and the association rules are then induced on this combined list.

4

Mining Association Rules

Apriori algorithm is an influential algorithm for finding frequent patterns, correlations, or casual structures among sets of items or objects in database transaction, relational databases or other information repositories. In a document database, each document can be viewed as a transaction, while a set of keywords in this document can be considered as a set of items in the transaction. Thus the problem of keyword association mining in document database is mapped to item association mining in transactional databases. Frequent item sets (also called itemsets) refer to the sets of items that have the frequences larger than a given threshold. The Apriori algorithm bases on a prior knowledge of the frequent itemset property: a subset of a frequent itemset must also be a frequent itemset. The algorithm iteratively finds frequent itemsets with cardinality from 1 to k (k-itemset) and use the frequent itemsets to generate association rules.

The algorithm consist two steps. The first step generates the candidate itemset of size k, Ck , by joining the frequent itemset Lk with itself; and the second step reduces the candidate set by pruning any (k − 1)-itemset in Ck that is not frequent, because such (k − 1)-itemset cannot be a subset of a frequent k-itemset. In our work, a keyword list is processed to create transaction sets and item sets. First, the whole list is browsed and the keywords without repeat are listed in the item list. Then, the keywords with non-zero weights are included in transaction lists. And the Apriori algorithm is then applied to generate frequent itemsets based on transaction sets. After the frequent itemsets are found, the association rules can be generated simply as combinations of the subsets in the frequent itemsets, outlined in Algorithm 1. Algorithm 1: Generate association rules Input: F: set of frequent itemsets, and min con f : minimum confidence threshold Output: association rules 1 begin 2 foreach frequent itemset f ∈ F do 3 foreach s ∈ f , s 6= ∅ do support count(l) if support 4 count(s) ≥ min con f then 5 output rule s =⇒ (l − s); 6 end 7 end 8 end 9 end

5

The Association Rules Mining Tool

We developed a text processing tool for mining association rules. It consists of two main modules: manipulation and processing of text files, and keywords/phrases maintenance using a relational database.

5.1

Text Document Manipulation and Processing

This module is the main module of the system, responsible for all the text processing and mining tasks. An example of the keyword processing result of one text file is shown in Figure 1. This module consists of primarily the following four parts: 1. The main user interface includes these functions: • Create GUI components and set up the event handling framework. • Make database connections. • Load selected text. • Retrieve and save keyword list from/to database. • Create stop word list.

Figure 1: Keywords of a text document are identified • Perform word stemming. • Highlight vector of words in the text display area, as shown in Figure 1. 2. Keyword processing. It performs these tasks: • • • • •

Extract keywords from selected text. Calculate the weights of the words. Filter out low-weighted words. Find phrases based on the weights of the words. Sort the items (words and phrases) on their weights.

3. Similarity measures. • Create standard keyword vectors from the two text documents. That is, both vectors have the same set of words (if one vector doesn’t contain a particular word, it is an empty string at the word’s position in the vector). • Calculate similarity as the dot-product of the two vectors. 4. Discovery of association rules. • Read keyword/phrase vectors from database and form transactions. • Find frequent itemsets using Apriori algorithm. • Generate association rule using Algorithm 1.

This module may also be considered to perform its tasks in two modes: single file processing and multiple file processing. The single file processing unit inputs one text file and performs the necessary processing steps before mining. These steps include removing stop words, extracting keywords and phrases, creating vector of keywords and calculating their weights. The keywords are saved in a database. The multi-file processing unit inputs two text files, selected by the user, retrieves the keyword vectors from the database, calculates the similarity of the two files. If the similarity measure is above a threshold, association rules are induced from the combined keywords. Figure 2 shows the partial result of the association rules found from the two keyword lists. In the experiments shown in the figures, we used the following thresholds: • min f requency=0.6 for filtering tokens and create keyword list. • min similarity=0.5 as the similarity measure of two files. • min support=0.5 and min con f idence=0.25 for generating association rules Other threshold values were also tested but results are not included in this paper.

Figure 2: Association rules found (only partially displayed in the scroll pane)

5.2

Database Manipulation

A simple Oracle relational database is used to store the information about the files and keywords, in addition to the user information. The two main tables in the database are file table and keyword table. The database tables are created when the system is started. Records are inserted to the table as the text files are stored in the associated file structure and the keywords are extracted from the files. After a text file has been processed, the user can issue an SQL query to get information from the database, as illustrated in Figure 3.

6

Summary

This paper presents a text mining tool for users to analyze text documents, particularly to find frequent itemsets and generate association rules. The tool provides an easy-touse graphical interface for the user to select files and set up the frequency threshold for text filtering and parameters for association rule mining, as well as easy-to-read displays. A back-end database is used to store the meta-data of the files and information about keywords. The tool can also be extended easily to add more text mining modules, simply because it is an open source with a modular structure. Once a new module (for example, text

classification) is implemented, we can simply include it in the package. The only change is to add new user interfaces to allow the user to communicate with the new module. The work is a preliminary experiment to develop a text mining tool. There are many aspects the system can be improved, such as using different and more efficient algorithms rather than the Apriori algorithm and various keyword weighing measures.

References [1] Rakesh Agrawal, Tomasz Imieli`nski, and Arun Swami. Mining association rules between sets of items in large databases. In Proceedings of the ACM-SIGMOD Intl. Conf. on Management of Data (SIGMOD93), pages 207–216. ACM Press, 1993. [2] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Intl. Conf. on Very Large Databases (VLDB94), pages 487–499. ACM Press, 1994. [3] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. Dynamic itemset counting and implication rules for market basket data. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data, pages 255–264. ACM Press, 1997. [4] W. Bruce Croft. Applications for information retrieval techniques in the office. In Research and Development in

Figure 3: Database query and result Information Retrieval, Sixth Annual International ACM SIGIR Conference, pages 18–23. ACM, 1983. [5] Ronen Feldman and James Sanger. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, 2006. [6] Jiawei Han and Yongjian Fu. Mining multiple-level association rules in large database. volume 11 of IEEE Transactions on Knowledge and Data Engineering, pages 798–805. IEEE, 1999. [7] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd edition, 2006. [8] IBM. DB2 Intelligent Miner. com/software/data/iminer.

http://www-306.ibm.

[9] Yunjae Jung, Haesun Park, , and Ding-Zhu Du. A balanced term-weighting scheme for improved document comparison and classification. In First SIAM International Conference on Data Mining: Workshop on Text Mining, pages 69–76, 2001. [10] Anne Kao and Steve R. Poteet, editors. Natural Language Processing and Text Mining. Springer, 2007. [11] Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Efficient algorithms for discovering association rules. In AAAI Workshop on Knowledge Discovery in Databases (KDD-94), pages 181–192. AAAI Press, 1994. [12] Konchady Manu. Text Mining Application Programming. Programming Series. Charles River Media, 2006. [13] Jong Soo Park, Ming-Syan Chen, and Philip S. Yu. An effective hash-based algorithm for mining association

rules. In Proceedings of the ACM-SIGMOD Intl. Conf. on Management of Data (SIGMOD95), pages 175–186. ACM Press, 1995. [14] Jian Pei, Jiawei Han, Hongjun Lu, Shojiro Nishio, Shiwei Tang, and Dongquing Yang. H-mine: hyper-structure mining of frequent patterns in large databases. In Proceedings of the Intl. Conf. on Data Mining (ICDM’01),, pages 441–448. IEEE, 2001. [15] SAS. SAS Enterprise Miner. http://www.sas.com/ technologies/analytics/datamining/miner. [16] Ashok Savasere, Edward Robert Omiecinski, and Shamkant B. Navathe. An efficient algorithm for mining association rules in large databases. In Proceedings of the Intl. Conf. on Very Large Databases (VLDB95), pages 432–443, 1995. [17] Hannu Toivonen. Sampling large databases for association rules. In Proceedings of 22nd International Conference on Very Large Databases, pages 134–145. Morgan Kaufmann, 1996. [18] University of Waikato. Weka 3: Data Mining Software in Java. http://www.cs.waikato.ac.nz/ml/weka. [19] Marek Wojciechowski and Maciej Zakrzewicz. Hash-mine: a new framework for discovery of frequent itemsets. In Proc. Of 2000 ADBIS-DASFAA Conference, 2000.

Suggest Documents