Concepts and Techniques in Data Mining and APPLICATION TO Text ...

CC: BY NC ND

Table of Contents

Chapter 1. Introduction.................................................................................................................... 1 1.1. What is title="Hyperlinks">hyperlinks) to other text that the reader can immediately access, usually by a mouse click or keypress sequence. Apart from running text, hypertext may contain tables, images and other presentational devices. Other means of interaction may also be present, such as a bubble with text appearing when the mouse hovers over a particular area, a video clip starting, or a form to complete and submit. The most extensive example of hypertext today is the World Wide Web.

(b) the source of the hypertext Figure 1-8: An example of a hypertext with its source (source: http://en.wikipedia.org/wiki/Hypertext)

1.4.

Some Data Mining Applications

Data mining can be applied in wide and diverse areas. Some major applications are listed below.

Data Mining in Retail Industry Data Finding interesting patterns from large amounts of data on sales, customer shopping records, goods transportation, consumption, and service, provides useful information and knowledge for managing retail industry. Data mining in retail databases can help us identify customer shopping behaviors, patterns, and trends. This information will be useful to improve the quality of customer service, achieve better customer retention and satisfaction, enhance goods consumption ratios, design more effective goods transportation and distribution policies, and reduce the cost of business. Recently the popularity of conducting business transactions online (e-commerce) has increased the quantity of data collected continues to expand rapidly. Exploiting data mining on the databases to pursue purchasing patterns can help guide the design and development of data warehouse structures, manage effective sales campaigns, retain customer with personalized product recommendations and targeted services.

Data Mining in Financial Data Currently, a wide variety of financial services have been offered, such as deposit, withdrawal, loan, foreign exchange, and investment. Financial data collected from these services are relatively reliable with high quality. Therefore, data analysis and/or data mining (knowledge discovery) on

9

Sponsored by AIAT.or.th and KINDML, SIIT

these data are realistic and useful. For example, data analysis enables us to view the average, total, maximum, minimum, trend, outlier and other statistical characteristics of each financial resource (e.g., deposit, credit, fund, stock, etc), together with their changes by time period, by geographical region, by customer sector, and by other factors. It is possible to cluster or categorize customers into a set of target groups for marketing purpose, such as loan payment prediction and credit policy analysis. Data mining in financial data can also support the detection of money laundering and other financial crimes. To detect money laundering and other financial crimes, it is necessary to integrate information from multiple databases for finding unusual patterns, such as large amounts of cash flow at certain periods, by certain groups of customers.

Data Mining in Telecommunication Data At present telecommunication industries has generated a tremendous amount of data. Falling into three categories; the call detail data presenting the calls traversing within the telecommunication networks, the network data describing the state of the hardware and software components used in the network, and the customer data indicating the telecommunication usage conducted by customers. Data mining can be used to uncover useful information hidden in these three data sets to identify telecommunication fraud, identify network faults, and improve marketing performance. Fraudulent activity costs the telecommunication industry millions of dollars per year. With data mining techniques, we can detect potential fraudulent actions and usage patterns; identify misconducts to gain fraudulent entry to other customers’ accounts; and reveal unusual patterns that may harm the system itself, such as busy-hour frustrated call attempts, as well as router/switch congestion. With data mining, we can find the software/hardware-related problem inside the telecommunication system. Moreover, finding some frequent sequential patterns, such as frequent calling patterns, we can promote such as the sales of specific long-distance and cellular phone combinations and improve the availability of particular services in the region.

Data Mining in Biological Data Recently, biotechnology has become popular with many potential usages. Its quick advancement triggers an explosive growth of biological data, such as those in genomics, functional genomics, proteomics, and biomedical studies. Applications include the identification and comparative analysis of human genomes and other species’ genomes (by discovering sequencing patterns, gene functions, and evolution paths), the investigation of genetic networks and protein pathways, and the innovation in new pharmaceuticals and advances for cancer treatment. Known as bioinformatics, the field of biological data mining is broad, rich, and dynamic.

1.5.

Text Mining Application

While text mining involves a knowledge-based process, where a user interacts with a document collection, its tasks and applications are vast. Analogous to data mining, text mining, (alternately referred to as text data mining) aims to extract useful information from data sources through the identification and exploration of interesting patterns. Some typical text mining tasks are text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). A number of text mining applications are given below.

Text Mining for Biomedical Applications With accelerated growth of online biomedical information, a set of computational tools are required to filter public biomedical text databases and to highlight their relevant information in a

10


well-organized and coherent manner. For example, it is possible to find large numbers of apparent correlations when we analyze relationships among thousands of genes by analyzing gene-gene relationships stated in biomedical research articles related to mRNA expression profiling experiments with cDNA microarrays and oligonucleotide chips. A large pile of biomedical literatures provides us a great chance to detect hidden relationship.

Text Mining for Security Applications Recently information security has evolved from just focusing on data or actions related to the network and server layers to including text contents, such web and email, transmitted through the network, in the application layer. Recently a number of powerful software packages are developed to monitor web content, radio broadcasting, and cellular/telephone speech in order to find certain keywords. When the keywords are found, the web result and/or speech results will be recorded and analyzed. One of dominant systems related this text mining for security issues is the classified ECHELON surveillance system development by the United States National Security Agency (NSA) in partnership with the UK, Canada, Australia and New Zealand. Besides this, AeroText, and Attensity are software marketed towards security applications, by analyzing plain text sources such as online news article. Text categorization enables us to understand the difference among characteristics of normal and malicious user behaviors from the log entries generated by an online application server.

Text Mining for Marketing Applications Text mining also could help marketing professionals use the mined information for finding nuggets in order to make good decisions. As a simple type of text mining, search engines (or information retrieval systems) can help us find a way to improve our daily tasks. For example, one can type a set of keywords related to his/her interesting topic to find a set of existing related documents that may help. Moreover, text mining can help us to find relationship among different keywords using concept clustering, indexing, association, feature extraction, information visualization, and summarization. This innovative technology helps marketing professionals identify hidden information that could leverage business opportunities, with visually interactive tools to depict patterns and relationships between keywords that form user-friendly interfaces.

Text Mining for Academic Applications Many academic articles have been provided online in both abstract and full texts to public or for commercial purposes, e.g. CiteSeer, ACM portal, Elsevier ScienceDirect, SpringerLink, PubMED, MS Academic Search, Google scholars and DBLP. As the most basic function, text mining can help us properly index documents for later retrieval of similar academic articles. It also assists us to group similar articles for literature review or other purposes, to provide semantic cues to machines to answer specific queries, to find relations among a set of multiple articles for deep analysis.

Text Mining for Patent Analysis Generally, patent documents express important research and development results. However, they are usually lengthy with rich technical terminology, resulting in needing high human efforts for analyses. From many million patent documents in a patent database, one may need to find those similar to a given or intended one. An emerging application of text mining is to search patents based on similarity to assist patent engineers or decision makers in patent analysis.

11


1.6.

Types of Mining Tasks

As stated in Section 1.3, there are various possible types of databases and information repositories on which we can apply data mining or text mining to find the intuitive, valid, useful and novel patterns. In general, there are two classes of data mining tasks as follows. 1. Descriptive mining targets for characterizing some general properties of the data in the database without any use for guessing a future event, such as characterization, discrimination, clustering and association rule mining or frequent pattern mining. 2. Predictive mining aims to build a model that is used later for making prediction of related events or prospective events, by making use of information hidden in the current data, such as classification, numeric prediction, and pattern recognition. After having a set of data to be mined, we have to determine which kind of mining tasks to be performed. In many cases, we may have no idea on what kinds of interesting patterns we can find from the data and hence we may take a strategy to search for several different kinds of patterns in parallel. Mining multiple kinds of patterns may accommodate different user expectations or applications. Besides the type of patterns to be mined, in several cases, we need to consider a mechanism to guide users to search for and discover interesting patterns at various granularities (different levels of abstraction) with interactive environment, and to allow user to track topic changing among different time periods. Moreover, since some patterns may not hold for all of the data in the database, a measure of certainty or “trustworthiness” is usually associated with each discovered pattern. The following indicates some common types of data mining tasks.

1.6.1. Classification or Categorization: Finding the class of an object Classification (or categorization) is a common task in human activities that involves decision or forecast in an unknown or a future situation, using currently available information. As another point of view, classification is the process of constructing a model (or function) that describes and distinguishes different data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown later. The derived model is based on the analysis of a set of training data (i.e., data objects whose class label is known). In some areas, classification is referred as pattern recognition, discrimination, or supervised learning, in contrast with unsupervised learning or clustering where no classes are predefined but they are inferred from the data. There have been several applications of classification to solve scientific, industrial and commercial problems. Some typical classification tasks are the detection of the letter from a character image (such as an automatic postcode reader), the credit-status assignment for a customer on the basis of financial and other personal information, and the preliminary diagnosis of a patient’s disease during waiting for definitive test results. In learning a classification model, there exist various forms in expressing the model derived. Some common forms are classification (IF-THEN) rules, decision trees, mathematical formulae, or neural network. Given a sample dataset in Figure 1-9 (a), the examples of these forms are shown in Figure 1-9 (b)-(d) and Figure 1-10, respectively. A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions. Decision trees can easily be converted to classification rules. A neural network, when used for classification, is typically a collection of neuron-like processing units with weighted connections between the units. There are many other methods for constructing classification models, such as naïve Bayesian classification, support vector machines, and k-nearest neighbor classification.

12


Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy

Temp. hot hot hot mild cool cool cool mild cool mild mild mild hot mild

Humidity high high high high normal normal normal high normal normal normal high normal high

Windy false true false false false true true false false false true true false true

Play no no yes yes yes no yes no yes yes yes yes yes no

(a) A sample data set (the Play-Tennis data set) 1. If (Outlook = ”Overcast”) then Play = “Yes” 2. If (Humidity = ”Normal” and Windy = “False” ) then Play = “Yes” 3. If (Temp = ”Mild” and Humidity = “Normal” ) then Play = “Yes” 4. If (Outlook = ”Rainy” and Windy = “False” ) then Play = “Yes”

(b) Classification rules

Outlook sunny

rainy overcast

Humidity

Windy Yes

high

No

normal

Yes

false

Yes

true

No

(c) Decision Tree Play(yes) =

0.6 * outlook(sunny) + 1.0 * outlook(overcast) + 0.2 outlook(rainy) + 0.1 * temp(hot) + 0.2 * temp(mild) + 0.2 * temp(cool) + 0.5 * humidity(high) + 0.8 * humidity(normal) + 0.6 * windy(false) + 0.3 * windy(true)

Play(no) =

0.3 * outlook(sunny) + 0.1 * outlook(overcast) + 0.7 outlook(rainy) + 0.2 * temp(hot) + 0.1 * temp(mild) + 0.3 * temp(cool) + 0.7 * humidity(high) + 0.1 * humidity(normal) + 0.3 * windy(false) + 0.8 * windy(true)

(d) Linear regression equations

Figure 1-9: Three classification models: (a) a sample classification dataset, (b) classification (IF-THEN) rules, (c) a decision tree, and (d) mathematical formulae (linear equations).

13


outlook=sunny outlook=overcast outlook=rainy temperature=hot

play=yes

temperature=mild temperature=cool humidity=high

play=no

humidity=normal windy=true windy=false

Figure 1-10: An example of an artificial neural network.

1.6.2. Prediction: Predicting the value for an object While classification predicts categorical (discrete, unordered) labels, prediction forecasts continuous-valued functions. Also called numeric prediction, the prediction is applied to estimate missing or unavailable numerical data values rather than class labels. It is intuitive that the term ‘prediction’ points to both numeric prediction and class label prediction. However, in several literatures it usually refers to numeric prediction while label prediction is called classification. Prediction also encompasses the identification of distribution trends based on the available data. For example, it is possible to predict the potential sales amount of a product given its price, or the performance of a computer given its components. Regression analysis, a statistical methodology developed by Sir Frances Galton (1822–1911), is most often used for numeric prediction. Although there are other methods as well in numeric prediction, in fact scientists and researchers often use the terms “regression” and “numeric prediction” synonymously. It is also possible to apply classification techniques (such as artificial neural networks, decision trees, support vector machines, and k-nearest-neighbor classifiers) for prediction, and on the other hand, numeric prediction techniques (i.e., regression analysis) for classification. Regression analysis aims to model the relationship between one or more independent or predictor variables, which are discrete- or continuous-valued, and a dependent or response variable, which is continuous-valued. In data mining or knowledge discovery, the predictor variables are the attributes of interest expressing each training tuple (example), possibly in the form of an attribute vector. In general, the values of the predictor variables are known. Even some of them may be missing, it is possible to apply statistical techniques to recover and handle such cases. On the other hand, the response variable is the target value to be predicted, also called the predicted attribute. Some common models for prediction are linear regression, regression tree and model tree as shown in Figure 1-11 (b)-(d), given a sample dataset in Figure 1-11 (a) where ‘Play’ is the target value and the four preceding variables (‘Outlook’, ‘Temp’, ‘Humidity’ and ‘Windy’) are predictor variables.

14


Outlook 90 95 50 10 15 20 55 85 95 5 80 45 40 25

Temp. 40 32 35 24 10 12 9 22 7 26 25 24 37 23

Humidity 80 85 90 80 50 55 45 95 50 45 40 85 60 90

Windy 10 80 20 5 15 90 95 25 5 10 80 85 15 95

Play 5 10 80 95 85 15 80 10 100 85 95 90 75 20

(a) A sample data set (the real-valued Play-Tennis data set) Play = 149.3 – 0.171*outlook – 0.273*temp – 0.298*humidity – 0.359*windy (b) Linear regression equation

Outlook

Outlook

> 75

< 35

> 75

< 35

35-75

35-75

Humidity

Windy

Humidity

Windy

81.25 >= 65

8.33

< 65

97.5

L3 =50

>= 65

17.5

L1

L1: L2: L3: L4: L5:

(c) Regression Tree

Play Play Play Play Play

< 65

>=50

Concepts and Techniques in Data Mining and APPLICATION TO Text ...

Concepts and Techniques in Data Mining and APPLICATION TO Text ...

Suggest Documents