IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 40, NO. 3, MAY 2010
319
Master Defect Record Retrieval Using Network-Based Feature Association Andrew Rodriguez, W. Art Chaovalitwongse, Member, IEEE, Liang Zhe, Harsh Singhal, and Hoang Pham, Fellow, IEEE
Abstract—As electronic records (e.g., medical records and technical defect records) accumulate, the retrieval of a record from a past instance with the same or similar circumstances, has become extremely valuable. This is because a past record may contain the correct diagnosis or correct solution to the current circumstance. We refer to the two records of the same or similar circumstances as master and duplicate records. Current record retrieval techniques are lacking when applied to this special master defect record retrieval problem. In this study, we propose a new paradigm for master defect record retrieval using network-based feature association (NBFA). We train the master record retrieval process by constructing feature associations to limit the search space. The retrieval paradigm was employed and tested on a real-world large-scale defect record database from a telecommunications company. The empirical results suggest that the NBFA was able to significantly improve the performance of master record retrieval, and should be implemented in practice. This paper presents an overview of technical aspects of the master defect record retrieval problem, describes general methodologies for retrieval of master defect records, proposes a new feature association paradigm, provides performance assessments on real data from a telecommunications company, and highlights difficulties and challenges in this line of research that should be addressed in the future. Index Terms—Information retrieval (IR), keyword library, keyword weighting, query, text mining, vector space model (VSM).
I. INTRODUCTION N RECENT years, there has been an increasing interest in data mining research applied to information retrieval (IR) and text classification (TC). Such data mining needs arise in many different areas, including telecommunications [1], manufacturing [2], [3], medicine [4], finance and banking [5], defect record retrieval [1], and online business [6]. As electronic storage space cost has decreased, it has become quite worthwhile to store large amounts of information about past instances (e.g., transactions, failures, and diagnoses). For example, if a technical defect occurs, a record can be made on the circumstances surrounding the defect and the action that was required to resolve the problem. Then, when the same or a similar circumstance occurs later, a search can be performed on stored records and a
I
Manuscript received December 30, 2008; revised May 14, 2009 and September 30, 2009. First published February 5, 2010; current version published April 14, 2010. This work was supported by the National Science Foundation under CAREER Grant 0546574 (Chaovalitwongse) and the Cisco-Academic Research and Technology Initiative grant. This paper was recommended by Associate Editor Z. Zdrahal. R. Rodriguez, W. A. Chaovalitwongse, L. Zhe, and H. Pham are with the Department of Industrial and Systems Engineering, Rutgers University, Piscataway, NJ 08854 USA (e-mail:
[email protected]). H. Singhal is an consultant in Bangalore, India (e-mail: singhalblr@gmail. com). Digital Object Identifier 10.1109/TSMCC.2010.2040079
relevant solution may be found quickly. The described practice is known as case-based reasoning (CBR) [7]. At the highest level, the CBR cycle can be described by these four processes: retrieve, the most similar case(s); reuse, the information in this case to solve the current problem; revise, the proposed solution; and retain, the parts of this experience likely to be useful for future problem solving [7]. In this paper, we are not proposing CBR or a competing technique, rather, we are working to improve the retrieve aspect of CBR. The challenge of this line of research is that a typical text record contains information stored in rich and complex ways, often with little or no fixed-form fields (for the proposed network-based feature association (NBFA) to be applicable, the records of the dataset should contain at least one fixed-form field). There are several ways to describe and store information contained in text records in concise ways. Still, inspite of hopes to use text records in a helpful way, the increase in stored information in text record databases has caused an often disadvantageous information flood [8]. In order to overcome this setback, a special branch of data mining, called text mining, has been widely studied in recent years to allow users to acquire useful knowledge from large amounts of text records [1], [4]. IR is an application of computer systems to fetch unstructured electronic text and perform other related activities. IR, before the proliferation of the World Wide Web, meant the retrieval of relevant information by a user who had a need for local information. A definition of IR is given by [9]: “An IR system does not inform (i.e., change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or nonexistence) and whereabouts of documents relating to his request.” Querying and query processing are other common terms for IR. TC is the process of automated classification of documents based on keyword occurrences. When comparing document vectors, the similarity measure employed can be the sum of common keyword values in the compared document vectors (for global keyword values) or the sum of the products of common keyword values in the compared document vectors (i.e., dot product). In this paper, we will focus on the development of a master defect record retrieval framework, which will be tested using real-world technical defect records from a telecommunications company. A new record retrieval framework is developed for the purpose of locating defect records that refer to the same defect problem. The defect record database investigated in this study, is used to maintain transaction records for repair and maintenance. Given a duplicate record, we are interested in the retrieval of its master record. The master–duplicate pairings were designated by engineers familiar with the defects. In practice, an engineer
1094-6977/$26.00 © 2010 IEEE
320
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 40, NO. 3, MAY 2010
would only realize that the record he/she is creating, in response to a service call, is a duplicate record when he/she finds, via a search of the database that the same problem has been encountered and documented at an earlier time. Records contain both fixed-format fields and free-form text. It should be noted that there is nothing special about the content to make a master or a duplicate record to be considered as such. Any record can potentially be a master or a duplicate record. The classification is just a matter of timing (i.e., which record came first). We leverage this information to train our IR approaches. Note that the text record database contains a large number (the majority) of records that are not master or duplicate records. The problem of retrieving the single master defect record for some duplicate record poses a challenge in text mining research. The work proposed here describes a new paradigm for defect record retrieval that deals with the master–duplicate search for the first time. In addition to a using a traditional keyword weighting scheme that is widely studied in an existing TC and IR literature [10], we propose a NBFA approach to train the linkage relationships between the master and duplicate records. The new NBFA leverages the information from fixed-form fields in hybrid records as features. The idea is to build a network to associate features and reduce the search space based on this network. This novel NBFA approach can be viewed as a generalization of the decision tree paradigm, where the record linkages and network relationships can be trained during the training step. Through the trained network relationships, we can query multiple branches at the same time using different configurations of NBFA. The reduction of search space should in turn improve the search quality and reduce search time. The contribution of this paper is to demonstrate the gains achieved by using NBFAs, in addition to keyword weighing scheme, when applied to a realworld, dynamic dataset. The NBFA approach can be viewed as a Bayesian network approach, where the trained linkages are given the same probability. The rest of this paper is organized as follows. Section II provides background on data mining techniques as they relate to IR and TC. Section III formally defines the problem that this paper addresses. Section IV outlines the methods traditionally employed and the ones employed in this paper to retrieve desired records. Section V presents results we achieved by applying the described techniques. Section VI presents a conclusion. II. BACKGROUND A. Text Mining In this paper, our work relates to two major text mining research areas: IR and TC. Here, the IR problem is the problem of retrieving a single, linked record via a query of a text record database. The TC problem refers to the assignment of text records into distinct classes or categories. In a sense, we need to categorize two records (master and duplicate records) into a single class or category that is distinct from all other defect records. IR generally utilizes unsupervised learning techniques [11] since for a given input, the record that should be returned, is not always well-defined or known (thus, we cannot reinforce our learning). In this paper, however, we are able to
reinforce our learning because we have master–duplicate pairs as they are defined by engineers familiar with the defects. TC generally utilizes supervised learning techniques [11], since we are often aware of the desired classification of at least a subset of the records (i.e., we are aware of the classification of the training set). Most literature in text mining is focused on TC (supervised learning). This work is important because we are able to leverage information in the defect records of the training set to improve the recall for future queries. B. Traditional IR Practices Consider the characteristics of text documents. Fixed-format fields are those that have strict formatting criteria for the type, range, and precision of its contents, such as specific model names, product lines, components, parts, dates, etc. Free-form text fields, on the other hand, are unstructured with no formatting requirements. The challenge in mining these types of text record databases is the high flexibility of the text contents in the freeform text field. Additionally, the size and content of a problem record itself can be dynamic and can keep growing until the problem is solved. In some cases, for control reasons, problem details may not be edited or removed after being entered. The representation of the text records should be standard so that among other things, one can apply statistical processes or other data mining approaches to obtain useful results, findings, and feedback. We developed a semiautomated data extraction system that is described in Section IV-A. To undertake query processing and categorization in text mining, unstructured text must be represented in a format conducive for manipulation and quantitative–qualitative analysis. The vector space model (VSM) is an efficient method of text representation proposed in [12]. One previous study discusses how the VSM works with the idea that an approximate meaning of the document is captured by the words contained in it [13]. VSM represents documents as vectors in a vector space. The document set comprises an r × t record-term matrix in which each row represents a document, each column represents a keyword/term, and each entry represents the term value, which could be weighted using one of many term schemes or simply represented in binary {0,1} notation. For the sake of record retrieval or classification, the choice of similarity measure and initial term weighting schemes is important. Extensive research has been done assessing the differences on application parameters for a variety of similarity measures and schemes [14], [15]. The claim is that no similarity measure was found that could be considered a good performer under all circumstances [14]. In fact, minor variations, such as different logarithmic base values, caused pronounced changes in the performance of the similarity measure. In the determination of a suitable similarity measure, factors such as type of data, type of query, the individual query itself, the evaluation metric, and the conditions that the returned result must meet, play a crucial role. An implication of this claim is that different similarity measures will affect performance measures used in the retrieval and classification tasks. For our paper, we consider a simple n-nearest neighbor search (n-NNS) using a Jaccard similarity
RODRIGUEZ et al.: MASTER DEFECT RECORD RETRIEVAL USING NETWORK-BASED FEATURE ASSOCIATION
coefficient, when comparing vectors with binary valued weights as well as real-valued weights. We chose to use a Jaccard similarity coefficient, since it is among the most common in IR literature [16]. Research in term weighting schemes have found many approaches that make use of the frequency of occurrences of keywords in the record and across the collection of records. In TC, we would like to weigh terms using class information provided in training samples resulting in greater recall. Difficulty in optimal term weighting arises from the lack of supervision encountered in traditional weighting schemes. Term weights must reflect the record’s membership with its class and also serve to distinguish records from belonging to other classes. Such global term weights can then be used as modifiers obtained from the training set in classifying records that are encountered from the test (i.e., previously unseen) set. Term frequency–inverse document frequency (TF-IDF) weight is a weighting scheme often used in IR and text mining [10]. This weight is a statistical measure used to evaluate how important a term is to a document and to a collection of documents. The importance increases proportionally to the number of times a term appears in the document, but is offset by the frequency of the term within the entire database. The TF–IDF weighting scheme is often used in the VSM to determine the similarity between two text records. [12] proposed the following TF–IDF term weighting scheme: tfi,j = ni,j / k nk ,j , where ni,j is the number of occurrences of the considered term i in record j and the denominator is the number of occurrences of all terms in record j, and IDFi = log(|D|/ j ∈D aij ), where |D| is the total number of records in the database, aij is a binary variable indicating if term i appears in record j, and the denominator is the number of documents in the database, where the term i appears (that is, where ni,j = 0). Then, one can derive the TF–IDF weight by tf IDFi,j = tfi,j · IDFi . Note that the TF–IDF weighting scheme assigns a weight to every term in every record, while the same term in different records may have different TF–IDF weights. In this paper, we only use the IDF function in our weighting scheme (which will be discussed in a later section). Reasons for using IDF rather than TF–IDF include TF in a document depending largely on the author of the document and the nature of the content. III. PROBLEM DEFINITION In this paper, the fixed-format data includes well-defined information about product, component, version, software, etc. surrounding the defect. Here the component field, to be mentioned later, is used in the network generation for feature association to limit search space. The free-form text, includes a descriptive problem headline, problem summary, and diagnosis results. These records usually contain all of the information about a problem from its first reporting, up to and including its resolution. When a customer faces a defect, a service engineer will look for similarities between the current problem and the previously recorded problems that are stored in the database. Our objective is to retrieve the record most relevant to the engineer, when he/she performs a query. The size of these defect records
321
Fig. 1. Query and check cycle—starting with entered keywords (upper left). From the entered keywords, a search vector is created according to the VSM. The search vector for the text record is used by the IR tool for a database query. Ranked results are returned based on a similarity measure. There is a check to see if the master record was returned in the first n results. For example, say the entered keywords correspond to record r4. The master of record r4 is record r7, as seen in pair 2 of example Fig. 2. In this case, the query returned the associated master record of r4, r7, as the second result, therefore, the feedback is positive.
varies greatly, since each engineer includes varying amounts of detail about defects and each defect requires a varying amount of detail, by its nature. The aim of TC is to accurately classify or group text records to common entities in a dataset. The entities or groups are based on the correlation criterion of record characteristics. However, we herein focus on correlating a special kind of text record as defined earlier (known as duplicate) with a special existing record (known as master). The designated master–duplicate pairing indicates that the same defect/problem occurs in both instances. It is noted that this problem is different than text mining in most literature. Although many clustering techniques have been successfully applied to TC, these techniques might not be sufficient in our case as we do not look to merely group records into a set number of clusters. The records that make up our entire database can be said to have similar content because all are defect records from a single and specific industry. Instead, we try to find a unique one-to-one correspondence between the duplicate record and its master record. The difference in objectives is one of degree, rather than of type. Note that the one-to-one relationship is not true in the reverse, that is, a master record can have more than one duplicate record. See Fig. 1 for an illustration of master–duplicate record search procedure. IV. MATERIALS AND METHODS A. Test Dataset and Preprocessing The dataset consists of defect records in html files that were supplied to us by a telecommunications company. Additionally, the company provided a list of what they believe are the most relevant keywords. To supplement these keywords, we scanned the records and analyzed TF. The keyword extraction process is semiautomated and incorporates initial human-input (by field experts and the authors) to reduce irrelevant content in the text record representation. We started by generating a list of
322
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 40, NO. 3, MAY 2010
Fig. 2. Network construction—associated components are used to construct component link networks in three different ways. Known master–duplicate pairs (shown as rectangles to the left), from the training set, are processed one at a time in order to build (i.e., learn) component links. For example, take pair 1. The duplicate record is r8 and the master record is r7. Record r8 has component B and record r7 has component A. The resulting link, then from this single master–duplicate pair evaluation, involves components B and A. In the direct link case, there is a directed edge from B to A. This means that in any later query (i.e., any query taking place after the training), if component B is known to be the associated component, the record search tool will only search records in that database with known component A. Training is done with not just one pair, but all pairs less the pair that is to be the subject of the query. See Fig. 3 for the narrowed search space using the indirect links search configuration.
possible keywords by scanning the records’ headline and problem summary that exists in all records. We and the supplier of the database manually reduced the list of scanned words and terms. The terms that made it through this process were retained as keywords. We retained terms that meet a frequency threshold, but were not words, which provided little to no information, such as articles, conjunctions, prepositions, and pronouns. Such terms are often referred to as stopwords or functional words in this body of research. Next, the VSM of the dataset is constructed by building the keyword vector for each record based on the keywords that occur in its headline and problem summary. The rest of the document is not considered in building the keyword vector representation as the rest of the document may not be english words or technical acronyms, but, for example, cryptic diagnostic output. Using our semiautomated extraction system, the record representations can be updated periodically with the increase in size and number of records. There are 11697 defect records in the dataset. After the human filtering, 1429 keywords make up the keyword library. While each html file is being scanned for keywords, we in parallel scan the component field as a feature, assigning a component identifier to each record. There are a total of 1458 different components (i.e., the component feature of each record is one of 1458 components). The component feature will be later used in NBFA. Our last objective when scanning the dataset is to record the master–duplicate relationships. Each record has a duplicate of field. If this field does not indicate another record, then the record being scanned is not a duplicate of any other record. If the duplicate of field has a value, it is the record identifier of
its master record (see Fig. 2 for use of the duplicate of field). There are 2412 master–duplicate pairs, as defined by engineers familiar with the records. Once an appropriate representation function is implemented, we have component information for each record, and we know all master–duplicate pairs; therefore, we are able to apply our cooperative data mining techniques to perform the queries and evaluate various search configurations.
B. n-Nearest Neighbor Search In order to find records from the database having the same or similar defect issues (approximate matches) as the duplicate record, we begin by implementing a standard n-NNS using a Jaccard similarity coefficient. This search can be viewed as a selection of the elements in a database that are within the similarity rank of the query size (fixed number of nearest neighbors, n). The problem of finding nearest neighbors has been widely studied in the past if the data is in a simple, low-dimensional vector space [11]. In our case, however, the data lies in a large metric space in which the number of neighbors grows very rapidly as the search space increases. In order to mitigate the total cost of comparisons, we implement a NBFA (described in Section IV-D) to reduce the search space (see Fig. 3). To find nearest neighbors, the Jaccard similarity coefficient between records i and j is defined as similarityi,j = ( t∈T vti · vtj )/ i j ( t∈T max(vti , vtj )), where v i = v1i , v2i , . . . , v|T | and v = j v1j , v2j , . . . , v|T | are the binary keyword vectors of records i and j, respectively, and |T | is the dimension of keyword vector space.
RODRIGUEZ et al.: MASTER DEFECT RECORD RETRIEVAL USING NETWORK-BASED FEATURE ASSOCIATION
Fig. 3. Search spaces—the keyword vector of each known duplicate record is used as search input for a query. A successful query returns the master record of the duplicate record, from the entire database. Only the records whose associated components are reachable from the component of the duplicate record are included in the search space.
C. Inverse Document Frequency (Keyword Weights) In this section, we address the issue of the uniqueness of keywords. This issue is important as more commonly used keywords may not be helpful in identifying the master–duplicate relationships. In order to incorporate the uniqueness of keywords, we employ a global keyword weight. That is, in the keyword matching step of our NNS, each keyword weight depends on the number of documents in which a word occurs. We investigate the uniqueness of keywords in our database and generate a keyword weight using a basic count of the number of documents in which keyword occurs. This technique is intuitive, since the keywords that appear in many different records of the database (note that we do not keep a count of the number of times a term appears in a document) are not unique and are less likely to assist in the location of the master record. We want to weigh the keywords that appear in only a few records with a greater weight than the keywords that appear in many records. We define our weight function as follows. For term t, the keyword weight is given by wt = |log prob(t)|, where prob(t) = (column sum for column t)/(total number of nonzero matrix entries). After the keyword weight function is generated, we incorporate it with our n-NNS by modifying the similarity measure between two records. The new similarity is given by similarityi,j = ( t∈T wt · vti · vtj (/( t∈T max(vti , vtj )), where wt is the i j global weight of keyword t, v i = v1i , v2i , . . . , v|T | and v =
j v1j , v2j , . . . , v|T | are the keyword vectors of records i and j, respectively, and |T | is the dimension of keyword vector space.
D. Network-Based Feature Association In order to improve the performance of record searches, we incorporate additional features (from a fixed-format field) extracted from text records by establishing associations among the fixed-format fields. As mentioned, records contain information, such as component, product, and software. In this study, we leverage only component information because it was suggested (and verified) as the most useful of the fixed-form fields by
323
engineers familiar with the records and their use. Note that a similar association procedure can be applied to other fixed-form features as well. The rationale of feature association is to reduce the search space. Specifically, we build associations of the component field by linking components based on training master–duplicate pairs. Since these features are in all records, we can use this information to constrain our search space as shown in Fig. 3. If we were to search for the master record for a current duplicate record without feature association, we would consider every record in the database. On the other hand, if we were employ feature association, we would go through only the records whose components are linked with the component of a current duplicate record. The associated components of the grayed records are not reachable from the current duplicate record (thus not searched), while the nongrayed records are reachable (thus are searched in the indirect link search configuration). Note that although it is likely that the search space reduction by feature association might improve the quality and speed of the search, there is also a possibility that the master record is excluded from the search space and will never be retrieved by our search. The actual component associations used in our search are obtained via a network-based modeling approach as follows. First, construct an unconnected component network, where the set of nodes is the set of components. Then, add an arc (i, j) between nodes (components) i and j, if in our training set, a record with component i is a duplicate of a record with component j. When we perform the search of master record in our database, we only consider the records associated with components that are reachable with the component of the new record as shown in Fig. 3. Note that each node is reachable from itself. The associated components network can be constructed in several ways due to the nature and characteristics of network linking. Here we show three types of network construction as follows. 1) Direct Links: The direct links method of network construction will most limit the search space. A node (component) will only be reachable from another node, if there exists a directed arc built in the training set when there was a master– duplicate pair that directly links the two components. For example, in the direct links column of Fig. 2, assuming we use only three master–duplicate pairs for training, the resulting NBFAs are a) if B then A; b) if C then A; and c) if D then C. After the training that takes place, if we have a record with known component B, we will limit our search space to only records with known components B and A. Similarly, if we have a record with known component C, we will limit our search space to only records with known components C and A. However, if we have a record with known component A, we will limit our search space to only records with known component A. 2) Undirect Links: The undirect links method of network construction is much like the method described earlier except that the links (arcs) that are built bidirectional (undirected). Essentially, the information that is added is the duplicate component to master component relationship. For example, in the undirect links column of Fig. 2, assuming we use only three master–duplicate pairs for training, the resulting NBFAs are a) if A then B and C; b) if B then A; c) if C then A and D; and
324
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 40, NO. 3, MAY 2010
d) if D then C. For example, after the training that takes place, if we have a record with know component A, we will limit our search space to records with known components A, B, and C. Similarly, if we have a record with known component C, we will limit our search space to only records with known components C, A, and D. 3) Indirect Links: The indirect links method of network construction will grow the largest network with respect to the number of connected components in a tree. For example, after the training that takes place in Fig. 2, an indirect link has been formed from component D to A by the pairs A–C and C–D, although there is no direct link from component A to D. That is, an indirect link was made. Let us consider an example in Fig. 2. Assume that a new text record r7 has component A, whose indirect links (right-most column), include nodes B, C, and D. When we perform the NNS, we only consider the records that are associated with components A, B, C, and D. All other records are not included as part of the search space. All of these techniques aid not only in improving the search quality, but also in reducing the query time. In essence, the search space has been narrowed/limited (see the right side of Fig. 3). E. Experimental Design Based on the keyword weighting scheme and NBFA methods described in Sections IV-C and IV-D, we can derive eight search configurations. The eight configurations are: 1) no component association, without keyword weights (basic NNS); 2) no component association, with keyword weights; 3) direct links, without keyword weights; 4) direct links, with keyword weights; 5) undirect links, without keyword weights; 6) undirect links, with keyword links; 7) indirect links, without keyword weights; and 8) indirect links, with keyword weights. We note that there are advantages and disadvantages for each of these configurations. For example, the direct links method is the most limiting and will reduce the search space the most. However, this method will increase a probability that the true master record is excluded from the search space. On the other hand, the indirect links method provides the least constrained search space. Although it is more likely that the true master record is included in the search space, it may not drastically increase the search recall. If the feature association network is a connected network, our search will be the same as the configuration without feature association (basic n-NNS). For comparison purposes, we also consider the best case scenario for feature association, called a perfect association. That is, we say the ideal situation is knowing exactly the component of a master record for a given query. Then, in that best case, we would search only the records that have that one component. In other words, the perfect association is the most limiting search space while ensuring the existence of the true master record in the search space. Obviously, it may not be possible to develop a sophisticated process to exactly pinpoint the single component to which the desired master record would belong. Still, we cannot guarantee that the perfect association will certainly return a master record. This is because the search itself still depends on the use of n-NNS. In brief, this perfect association is theoreti-
Fig. 4. Record processing—defect records are inputs for both the training and testing. The inputs are originally were html files (or text input in the case an engineer is performing a real-time search). Gray-shaded rectangles indicate preprocessing steps for record representation and for learning. Ultimately, the output is the set of the most similar records for the given defect input.
cally the performance upper bound concerning our experiment. The upper bound reflects the inherent limitations due to the content of defect files, the keyword library, service engineer record taking, etc. In order to investigate the search recall of n-NNS with component association, we investigate this issue as a supervised learning problem. We divide the master–duplicate pairs into training and testing sets. We use the training data to construct the component associations and calculate the keyword weights. We perform a n-NNS with component association on the test data. The validation procedure is illustrated in Fig. 4. There are several techniques in the literature proposed to divide the data into training and testing sets, while minimizing the bias of sampling. Based on our past experience, the best known approach is the leave-one-out cross validation (LOOCV). The LOOCV technique is extensively used as a method to estimate the generalization error based on resampling. In other words, it is used to estimate how well the learning from the training data is going to perform on future (unseen) data. The LOOCV is the same as the k-fold cross validation with k being equal to the number of master–duplicate pairs in the original dataset. The component association will be constructed and tested using n-NNS for as many times as there are duplicate pairs. Each time one of the duplicate pairs from the database is left out of the training and is used for testing. We use the LOOCV technique to reduce the bias of training and testing data. That means we take each one of the 2412 master–duplicate pairs out of the training set at a time. We train the component links (as described in Section IV-D) on all, but excluding one pair. Then, we use the one test pair that was not
RODRIGUEZ et al.: MASTER DEFECT RECORD RETRIEVAL USING NETWORK-BASED FEATURE ASSOCIATION
325
TABLE I LOOCV RESULTS (2412 MASTER–DUPLICATE PAIRS)—SEARCH RECALLS YIELDED BY ALL EIGHT SEARCH CONFIGURATIONS USING KEYWORD WEIGHTING AND NBFAS, COMPARED TO THOSE OBTAINED FROM PERFECT LINK ASSOCIATION (THE UPPER BOUND)
used in training and we perform a query, searching for the master record out of 9285 (11 697–2412) records in the database when the duplicate record is provided as search input. Next, we choose a different master–duplicate pair to be the test pair, and we train using all other pairs except that one and test with the duplicate record of that pair as search input. The process continues until we have considered all 2412 pairs. For this paper, we measure the performance of our search by recall, which is defined by the number of correct master record retrievals divided by the number of attempts at master record retrieval. Here, we evaluate ten different query sizes, n = 10, . . . , 100. One can also refer to the search recall at n as the probability that the master record is returned within the first n records when the dataset is queried. As mentioned earlier, the collection of all records’ vector representations can be thought of as an r × t matrix, where each row of the matrix is a vector representation for one record (each entry is 0 or 1). This means that for each query, if the search space is not narrowed via feature association, the number of term comparisons is of the order of rt, since r − 1 distance measures are made, where each vector has a cardinality of t. V. RESULTS This section discusses the results of the defect record retrieval task. The validations are carried out by automatically querying the database with duplicate record vectors and determining if the corresponding master record vector is returned in the top n retrieved document vectors from the database (as illustrated in Fig. 1). We make comparisons among IDF, our NBFA, and combinations of the two methods. Specifically, the IR results are achieved via a script that iteratively performs the LOOCV using the vector representation of each duplicate record. That is, for each duplicate, we try to retrieve the single, designated master record from the database
using the vector representation as search input and comparing it to the vector representation of all other nonduplicate records. Then, we calculate the recall as the number of queries, where the duplicate record’s master record was returned in the top n records, n = 10, 20, . . . , 90, 100, divided by total number of queries. We will use n for the top There has been some debate over the years, but two properties are now well accepted by the research community for measurement of search effectiveness. They are recall: the proportion of relevant documents retrieved by the system; and precision: the proportion of retrieved documents that are relevant [17]. We do not calculate precision in this paper, since in all cases, all but one record will be considered a nonrelevant record. At times, we plot the improvement of the recall that results from using our techniques as compared to using a basic NNS. Table I reports the search recall for eight search configurations in comparison with their upper bounds, perfect association.Recall is another way of representing the probability that the master record is returned among the given number of records returned by the query n. For example, if the recall is 70% at some n, then of ten queries, the user can expect to have seven correct master records returned for that n. Naturally, the recall increases as n increases because the probability of returning the master record is greater as more records are returned. The y-axis of the plot indicates the recall and the x-axis indicates the numbers of records returned n. From the table and plot, we observed that the direct link configuration with keyword weighting yields the best search performance. It achieved a recall of better than 41%, even when only ten records are returned and reached a recall of better than 62% when n is 100. We also observed that the perfect association achieved a recall of almost 100% when n = 100, and the keyword weighting did not improve its search recall when n ≥ 80. This is mainly because when the perfect association extensively limited the search space, so that there
326
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 40, NO. 3, MAY 2010
TABLE II STATISTICAL SIGNIFICANCE OF RECALL IMPROVEMENT FOR ALL 2412 MASTER–DUPLICATE PAIRS DUE TO KEYWORD WEIGHTS WITH RESPECT TO VARIOUS SEARCH METHODS
TABLE III STATISTICAL SIGNIFICANCE OF RECALL DUE TO NBFAS FOR ALL 2412 MASTER–DUPLICATE PAIRS
While the unanimous recall improvement is a promising result, the improvement is quite modest for direct link and undirect link configurations. Specifically, as the number of returned records increases, the search recall improvements of the basic NNS and indirect link configurations are much more prominent than those of direct link and undirect link configurations. We should note that the basic NNS and indirect link configurations did not limit the search space as much as the direct link and undirect link configurations. These observations are expected, as keyword weighting is more effective when the search space is large. B. Effect of Feature Associations
were less records in the search space than the number of records returned. Therefore, the master files were returned regardless of the keyword weights. Note that the perfect association is an ideal case and is used here as a reference for best possible learning. Tables II and III present p-values and test statistics for recall. Table II presents these values for recall improvement due to keyword weighing. A. Effect of Keyword Weights Recall improvement reports the increase in percentage points for recall due to the keyword weighting scheme and NBFA. The y-axis of this plot indicates the recall improvement for each of the search configurations. For example, if the recall for a basic NNS is 35% and the recall for direct link association is 62%, then the recall improvement is 27 percentage points. In each case—no feature association, direct link association, undirected link association, and indirect link association—employing keyword weights always improves the query recall (see Table I). These improvements suggest that keyword prioritization based on uniqueness is helpful in the location of master records. Keywords that appear in many records are not as helpful in distinguishing duplicate records as keywords that occur in fewer records. To evaluate the significance of the improvement due to keyword weighting, we performed a t-test as a statistical hypothesis test to assess the statistical significance of recall improvement. The null hypothesis is that the mean recall improvement (of the ten recall improvement points) for the various methods with keyword weighting is greater than the mean recall improvement (of the ten recall improvement points) for the methods without keyword weighting. Note that the t-test is appropriately applied when the population is assumed to be normally distributed with small sample sizes, so that the statistic on which inference is based, is not normally distributed. In our case, because we only evaluated ten different query sizes, the statistic from this test only serves as a means to compare the improvement due to keyword weights with respect to various search methods. Table II presents the p-values for recall improvement due to keyword weighing with respect to various search methods.
It is observed in Table I that the indirect link association does not yield recall as high as direct and undirect link associations (both with and without keyword weights). This result may be described by the nature of the dataset used. When links are created via an indirect link method, one feature association tree (connected and reachable nodes of the network) grows very large, while a number of independent trees remain quite small. For example, in our experiments, the largest tree contained 405 components and the second largest tree contained only 47 components, while the remaining trees contained eight or less components. Since the largest tree has so many components, there is not much narrowing of the search space, in many cases. That is, the largest tree is beginning to resemble the standard NNS. It may be that some links that are indirectly made are well-justified because in the training pairs, associated components are linked a number of times. However, two associated components only need to be linked once during training, in order for our current method to form the indirect link. The binary nature of the method seems to yield too large of a search space for the components in the largest tree and perhaps too small of a search space for components that belong to the smaller trees. This disadvantage in building links is not necessarily the case for all datasets, but it seems true for this dataset. Of the NBFA methods evaluated, direct link association performed the best in the recall experiments. Direct link associations improved the search recall from 20 to 25 percentage points as compared to the basic NNS. We believe this recall improvement is due to the fact that direct feature associations produces the tightest association for components. Indeed, the recall improvements from these validations are ordered based on the tightness of the associations, from no link association (where all nonduplicate records are searched) to perfect feature association (where only records that have the same component as the master record are searched). By tightness, we mean that trees of the association network remain relatively small and distinct from other trees because in training, we avoid creating any loose (distant) associations. To evaluate the significance of the improved recall due to feature association, we performed a t-test as a statistical hypothesis test to assess the statistical significance of recall improvement. The null hypothesis is that the mean recall (of the ten recall points) of the various NBFA is greater than the mean recall (of the ten recall points) of the non-NBFA (referred to as “none” in Table I). Tables III presents the p-values for recall due to various NBFA search methods.
RODRIGUEZ et al.: MASTER DEFECT RECORD RETRIEVAL USING NETWORK-BASED FEATURE ASSOCIATION
We believe it can be concluded that the NBFA methods that yield the tighter links will result in the highest recall, as long as the training method is intelligently performed. It is clear that the links perform well to improve recall beyond what the keyword library would otherwise allow. Generally speaking, this is a promising result for text mining as hybrid records often contain important fixed-form fields that can be leveraged. It is important to note that the aforementioned risk of excluding the master record by performing feature association is outweighed by the improvement achieved in recall due to the reduced search space. In addition, we believe the reason keyword weighting yields a smaller improvement than the feature association method is likely because the keyword weights are calculated in an unsupervised manner, while the feature association is done in a supervised manner. As a result, we believe there is a need for a smarter keyword prioritization technique. We believe that our future work should focus on an optimization method to best adjusting the keyword weights based on supervised learning. The concept of this new keyword optimization technique would go beyond mere statistical analysis and ensure that keywords that link (i.e., are in common between) duplicate pairs are assigned a greater weight, while keywords that do not link duplicate pairs are assigned a lesser weight, even if those keywords are unique from a statistical point of view.
327
TABLE IV PERFORMANCE CHARACTERISTICS FOR THE 2066 MASTER–DUPLICATE PAIRS WITH GREATER THAN ZERO KEYWORDS IN COMMON
TABLE V PERFORMANCE CHARACTERISTICS FOR THE 1298 MASTER–DUPLICATE PAIRS WITH GREATER THAN ONE KEYWORDS IN COMMON
C. Influence of Keyword Library In this section, we demonstrate the need for our proposed NBFA by showing the improvement in recall it achieves when the keyword library does not contain sufficiently informative terms or when the description of the defect is insufficient in either the master record, duplicate record, or both. Statistical analysis indicates that there were relatively few keywords in common between the 2412 master–duplicate pairs in our original dataset. This finding motivated us to repeat our developed techniques on subsets of the original duplicate pairs to test the effectiveness of our methods as there are more keywords in common between duplicate pairs. We created two data subsets of master–duplicate pairs and evaluated the proposed search configurations for each of the two subsets. In particular, the original dataset contains the total of 2412 known master–duplicate pairs. The second set of pairs (G0) is a subset of the first set. The second set consists of all the duplicate pairs that have greater than zero keywords in common between master and duplicate records. The third set (G1) is a subset of the first two sets. The third set consists of all pairs that have greater than one keyword in common between master and duplicate records. Specifically, the three sets of master–duplicate pairs are: 1) all 2412 master– duplicate pairs; 2) the 2066 master–duplicate pairs, where at least one keyword is in common between the master and duplicate records; and 3) the 1298 master–duplicate pairs, where at least two keywords are in common between the master and duplicate records. Up to this point, we have considered the first set of pairs. We will now consider the second and third set of pairs.
Table IV reports on the first master–duplicate pair subset (G0). At this point, we observed gains from the keyword weighting and feature association similar to that of all duplicate pairs (original pair set). That is, we noticed substantial and sustained recall improvement over all values of n, just as with all duplicate pairs. Table V reports on the second master–duplicate pair subset (G1). Notice that for a smaller n, substantial gains from the keyword weighting and feature association are maintained. However, for a larger n, the recall improvement is less. This phenomenon can be explained by the relatively good performance of the basic NNS (nearly 60%) at a large n and with a good duplicate pair set (with greater than one keyword in common).
328
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 40, NO. 3, MAY 2010
Table IV shows that for the same basic NNS at a large n and a slightly lower quality master–duplicate pair set (with greater than zero keywords in common), recall is a significantly lower, at about 41%. Thus, we observe more room for improvement with a lower quality pair set than with a higher quality pair set. More than one keyword is in common between duplicate pairs, thus the recall for the smaller n is significantly improved even before any keyword weights or feature association. Note that the recall improvement is smaller for the third set than for the second set of pairs. This decrease is because the basic NNS performs much better when at least two keywords are in common between the master and duplicate records (the third set of pairs). These results clearly show the benefit of the NBFA in the case that the keyword library does not allow for master–duplicate pairs to have at least two keywords in common. Statistical evaluation of the number of keywords in common between master–duplicate pairs shows that more than a simple NNS is needed for this type of problem. For example, in the case that no keywords are in common between the master and duplicate records, the NNS would surely not perform very well assuming the duplicate record has a keyword in common with a record(s) that is not its master record, which is a good possibility as our statistical evaluation shows. However, by making the feature associations, we improve the likelihood of returning the master record, in case the training led to a tight feature association. For example, if a certain duplicate record has associated component A and in training only a link was made from A to B and the master of the duplicate record indeed has associated component B and there are not many records with associated component B, then we have a good chance that the desired master record is returned even if the master and duplicate record do not have many (or any) keywords in common. The case just described is the case when feature association is potentially most valuable to the query. Statistical analysis shows that for this dataset and keyword library, we are often subject to a similar set of circumstances. VI. CONCLUSION In this paper, we have shown keyword weighting and NBFA practiced to improve the performance and quality of master defect record search. Component (feature) associations were built in a direct, undirect, and indirect way. Component associations were represented as a network structure based on the set of the data designated as training data (i.e., all master–duplicate pairs except the pair that was left out for testing). The keyword weighting technique is based on the number of records in the database in which the term appears (IDF). For the original 2412 master–duplicate pairs, keyword weighting improved the search recall by as much as 7 percentage points and direct link association improved recall by as much as 25 percentage points. When the two techniques were combined, recall was improved by nearly 28 percentage points (see Table I). Note that there are potentially many ways to form feature associations. In this paper, we present three methods. The best method may well be a function of the dataset and the nature of master–duplicate pairings. As the results demonstrate, it is pos-
sible to overassociate, creating associations that are too broad to be beneficial. This case was displayed in the indirect link association building, where one tree in the association network was very large and the others were very small. For example, the largest tree is about 8.6 times as large as the second largest tree. Again, the nature of the data will dictate which of the three methods performs best. It is probable that this weighting scheme does not give the optimal classification power for IR and TC. For future work, we plan to develop an optimization framework that goes beyond mere statistical analysis to determine the optimal keyword weights using a supervised learning method. For example, we can make the weight of each keyword a decision variable in a linear program and minimize the sum of the distance measures between pairs of records that are duplicates. ACKNOWLEDGMENT The authors would like to thank C. Pham and F. Lin of Cisco Systems, Inc. for their insight into this application of text mining and for access to the set of text records that was used in this paper. REFERENCES [1] J. Kreyss, S. Selvaggio, M. White, and Z. Zakharian, “Text mining for a clear picture of defect reports: A praxis report,” in Proc. 3rd IEEE Int. Conf. Data Mining, 2003, pp. 727–730. [2] C. Dagli and H.-C. Lee, Impacts of Data Mining Technology on Product Design and Planning. London, U.K.: Chapman & Hall, 1997. [3] R. Menon, L. H. Tong, S. Sathiyakeerthi, A. Brombacher, and C. Leong, “The needs and benefits of applying textual data mining within the product development process,” Qual. Rel. Eng. Int., vol. 20, no. 1, pp. 1–15, 2004. [4] P. Cerrito and J. Cerrito, “Data and text mining the electronic medical record to improve care and to lower costs,” in Proc. SAS SUGI, pp. 1–20, 2006. [5] A. Kloptchenko, T. Eklund, B. Back, J. Karlsson, H. Vanharanta, and A. Visa, “Combining data and text mining techniques for analyzing financial reports,” in Proc. Eighth Amer. Conf. Inf. Syst., 2002, pp. 20–28. [6] B. Liu, “Mining data records in web pages,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2003, pp. 601–606. [7] A. Aamodt and E. Plaza, “Case-based reasoning: foundational issues, methodological variations and system approaches,” in AICom—Artificial Intelligence Communications, IOS Press, 1994, vol. 7, no. 1, pp. 39–59. [8] P. Myllymaki, T. Silander, H. Tirri, and P. Uronen, “Bayesian data mining on the web with b-course,” in Proc. First IEEE Int. Conf. Data Mining, 2001, pp. 626–629. [9] F. W. Lancaster, Information Retrieval Systems: Characteristics, Testing, and Evaluation. New York: Wiley, 1968. [10] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “Rcv1: A new benchmark collection for text categorization research,” J. Mach. Learn. Res., vol. 5, pp. 361–397, 2004. [11] R. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed. New York: Wiley, 2001. [12] G. Salton, C. Yang, and A. Wong, “A vector-space model for automatic indexing,” Commun. ACM, vol. 18, pp. 613–620, 1975. [13] D. A. Grossman and O. Frieder, “Information retrieval: Algorithms and heuristics,” Kluwer International Series in Engineering and Computer Science, Norwell, MA 02061, 1998. [14] J. Zobel and A. Moffat, “Exploring the similarity space,” SIGIR Forum, vol. 32, no. 1, pp. 18–34, 1998. [15] W. P. Jones and G. W. Furnas, “Pictures of relevance: A geometric analysis of similarity measures,” J. Am. Soc. Inf. Sci., vol. 38, no. 6, pp. 420–442, 1987. [16] L. Lee, “Measures of distributional similarity,” in Proc. 37th Annu. Meet. Assoc. Comput. Linguistics Comput. Linguistics, Morristown, NJ, Association for Computational Linguistics, 1999, pp. 25–32. [17] C. Cleverdon, The Cranfield Tests on Index Language Devices. San Francisco, CA: Morgan Kaufmann, 1997.
RODRIGUEZ et al.: MASTER DEFECT RECORD RETRIEVAL USING NETWORK-BASED FEATURE ASSOCIATION
Andrew Rodriguez received a B.Sc. (Hons.) degree in computer science from University of Texas at San Antonio, TX, in 2006. He is currently working toward the Ph.D. degree in the Department of Industrial and Systems Engineering, Rutgers University, Piscataway, NJ. He received a fellowship from The Graduate School-New Brunswick, Rutgers University, in 2006–2007 and the National Aeronautics and Space Administration Graduate Student Researchers Program fellowship, in 2007–2008.
W. Art Chaovalitwongse (M’05) received the B.E. degree in telecommunication engineering from King Mongkut Institute of Technology Ladkrabang, Thailand, Bangkok, in 1999 and the M.S. and Ph.D. degrees in industrial and systems engineering from the University of Florida, Gainesville, in 2000 and 2003, respectively. He is currently the Director of the COmplex Systems Modeling and Optimization Laboratory, Rutgers University, Piscataway, NJ. His research interests include integrating scientific concepts and research tools from diverse disciplines (e.g., neuroscience, computational biology, operations research, and computational statistics) with real-life applications. Dr. Chaovalitwongse was awarded the Excellence in Research from the University of Florida for his contribution in epilepsy. He was also the recipient of the 2004 and 2008 William Pierskalla Best Paper Award for Research Excellence in Operations Research and Health Care applications by Institute for Operations Research and the Management Sciences. and His paper was ranked fifth in top 25 articles in Operations Research Letters. He is also a recipient of the 2006 National Science Foundation CAREER Award.
Liang Zhe received the B.Eng. (Hons.) degree from Computer Engineering Department, National University of Singapore (NUS), Kent Ridge, Singapore, in 2001 and the M.Eng. degree from Industrial and Systems Engineering Department (ISE), NUS, in 2003. He is currently working toward the Ph.D. degree in the Department of Industrial and Systems Engineering, Rutgers University, Piscataway, NJ. He was a Research Engineer with the ISE. His research interests include supply chain and logistics, combinatorial optimization, and meta-heuristics with real-world applications in oil transportation, computational biology, and telecommunication.
329
Harsh Singhal received the B.E. degree in industrial engineering and management from the M.S. Ramaiah Institute of Technology, Bangalore, Karnataka, India and the M.S. degree from the Department of Industrial and Systems Engineering, Rutgers University, Piscataway, NJ. He currently provides data mining solutions as an independent consultant in Bangalore, India.
Hoang Pham (S’85–M’87–SM’92–F’05) received the B.S. degree in computer science and the B.S. degree in mathematics from Northeastern Illinois University, the M.S. degree in statistics from the University of Illinois at Urbana-Champaign, and the M.S. and Ph.D. degrees in industrial engineering from State University of New York at Buffalo. He is currently a Professor and Chair of the Department of Industrial and Systems Engineering, Rutgers University, Piscataway, NJ. He was a Senior Engineering Specialist with the Idaho National Engineering Laboratory and Boeing Company. His research interests include system reliability modeling, safety and maintenance, software reliability, and faulttolerant computing. He is the editor-in-chief of the International Journal of Reliability and the Quality and Safety Engineering. He is the author or coauthor of four books and edited ten books. He has authored or coauthored more than 100 journal articles. Dr. Pham is an Associate Editor of the IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS.