160
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 12, NO. 2, APRIL 2008
HAL-Based Evolutionary Inference for Pattern Induction From Psychiatry Web Resources Liang-Chih Yu, Chung-Hsien Wu, Senior Member, IEEE, Jui-Feng Yeh, and Fong-Lin Jang
Abstract—Negative and stressful life events play a significant role in triggering depressive episodes. Psychiatric services that can identify such events efficiently are vital for mental health care and prevention. Meaningful patterns, e.g., , must be extracted from psychiatric texts before these services can be provided. This study presents an evolutionary text-mining framework capable of inducing variable-length patterns from unannotated psychiatry web resources. The proposed framework can be divided into two parts: 1) a cognitive motivated model such as Hyperspace Analog to Language (HAL) and 2) an Evolutionary Inference Algorithm (EIA). The HAL model constructs a high-dimensional context space to represent words as well as combinations of words. Based on the HAL model, the EIA bootstraps with a small set of seed patterns, and then iteratively induces additional relevant patterns. To avoid moving in the wrong direction, the EIA further incorporates relevance feedback to guide the induction process. Experimental results indicate that combining the HAL model and relevance feedback enables the EIA to not only induce patterns from the unannotated web corpora, but also achieve useful results in a reasonable amount of time. The proposed framework thus significantly reduces reliance on annotated corpora. Index Terms—Evolutionary computation, knowledge acquisition, natural language processing, text mining.
I. INTRODUCTION
I
NDIVIDUALS may suffer from negative or stressful life events. Recent studies have summarized about 80 negative life events [1], [2], such as death of a family member, argument with a spouse, and loss of a job. Such life events play an important role in triggering depressive symptoms, such as depressed moods, suicide attempts, and anxiety. Individuals in these circumstances often search for help from psychiatric web sites by describing their mental health problems using message boards and other services. Health professionals respond with suggestions but the response time is generally several days, depending on both the processing time required by health professionals and the number of problems to be processed. Such a long response time is unacceptable, especially for individuals suffering from psychiatric emergencies such as suicide attempts. A potential solution considers the problems that have
Manuscript received September 14, 2006; revised December 29, 2006. This work was supported in part by the National Science Council, Taiwan, R.O.C., under Grant NSC-95-2221-E-006-181-MY3. L.-C. Yu and C.-H. Wu are with the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, R.O.C. (e-mail:
[email protected]). J.-F. Yeh is with the Department of Computer Science and Information Engineering, Far East University, Tainan 744, Taiwan, R.O.C. F.-L. Jang is with the Department of Psychiatry, Chi-Mei Medical Center, Tainan 710, Taiwan, R.O.C. Digital Object Identifier 10.1109/TEVC.2007.895270
been processed and the corresponding suggestions as the psychiatry web resources. A retrieval system is subsequently developed to assist individuals to efficiently and effectively locate the relevant problems. The psychiatry web resources generally contain thousands of consultation records (problem-response pairs), which can provide useful educational information. For instance, the mental health website of the John Tung Foundation (http://www.jtf.org.tw), contains at least 3500 consultation records and rising. Individuals can immediately understand how to deal with their mental health problems according to suggestions given in the relevant consultation records. During development of the retrieval system, the negative life events experienced by individuals are the significant features for identifying their relevant problems. Moreover, considering personal life events helps improve the precision of retrieval results. Negative life events are often expressed in natural language segments (e.g., sentences, text passages). A critical step in identifying these segments is to transform the natural language segments into machine-interpretable semantic representation. This step involves the extraction of key patterns from the segments. Consider the following example. Two years ago, I lost my parents. (Negative life event) Since then, I have tried to kill myself several times. (Suicide) In this example, the pattern comprises two words, indicating that the subject suffered from a negative life event triggering the symptom “Suicide.” A pattern can be considered as a semantically plausible combination of words, where denotes the length of the pattern. Accordingly patterns can have variable length. Our previous studies [3], [4] presented a framework to identify depressive symptoms. This current study extends our previous studies to devise a text-mining framework to induce variable-length patterns automatically from psychiatry web resources. Traditional approaches to pattern induction are generally divided into two areas of research, which are knowledge-based [5] and corpus-based [6]–[10]. Knowledge-based approaches rely on exploiting expert knowledge to design handcrafted patterns. The major limitations of such approaches include the requirement of significant time and effort for designing the handcrafted patterns. Additionally, these patterns have to be redesigned when they are applied to a new domain. Such limitations form a knowledge acquisition bottleneck. This problem can be solved by using a general-purpose ontology such as WordNet [11] and EuroWordNet [12], or a domain-specific ontology constructed by automatic approaches [13]–[16]. These ontologies contain rich concepts and interconcept relations, such as hypernymy–hyponymy relations, and are based on which word similarities can be calculated [17], [18]. However,
1089-778X/$25.00 © 2007 IEEE
YU et al.: HAL-BASED EVOLUTIONARY INFERENCE FOR PATTERN INDUCTION FROM PSYCHIATRY WEB RESOURCES
161
Fig. 1. Block diagram of the EIA.
an ontology is a static knowledge resource, which may not reflect the dynamic characteristics of language. For this consideration, this study adopts the web resources, specifically those relating to psychiatry, as the knowledge resource. Corpus-based approaches can learn patterns on domain corpora efficiently by applying statistical methods. The corpora must be annotated with domain-specific knowledge (e.g., events). Various statistical methods can then be adopted to induce patterns from possible combinations of words in the corpora. For instance, association pattern mining [19], [20], which has been extensively studied in the data mining community, can be transformed to generate patterns from a set of sentences annotated with negative life events. Statistical methods may suffer from the data sparseness problem, resulting in the requirement for large annotated corpora to improve the reliability of parameters. Such large corpora may be unavailable for some application domains. Therefore, this study proposes the use of the psychiatry web resources as the corpora. Traditional corpus-based approaches may be infeasible when using web corpora. For example, health professionals cannot reasonably be expected to annotate the whole web corpora. Additionally, generating possible combinations of words from the web corpora, and then searching for the patterns, are computationally intensive procedures. To address these problems, this study devises an algorithm that can induce patterns from unannotated corpora to reduce the reliance on annotated corpora and can achieve good results in a reasonable amount of time to adapt to the web environment. Therefore, this study devises an Evolutionary Inference Algorithm (EIA) based on evolutionary computation [21]–[25]. The EIA extends classical evolutionary algorithms (EAs) by exploiting a cognitive motivated model, Hyperspace Analog to Language (HAL) [26]–[28], and relevance feedback [29]. The HAL model provides an informative infrastructure for the EIA, which can thus induce variable-length patterns from unannotated corpora. The relevance feedback provides expert knowledge for guiding the induction process, ensuring that the EIA achieves good results in a reasonable amount of time. Fig. 1 shows the block diagram of the EIA. A crucial step for pattern induction is the representation of words, as well as combinations of words. To accomplish this goal, the HAL model constructs a high-dimensional context space for the psychiatry web corpora. Each word in the HAL space is represented as a vector of its context, which means that the sense of a word can be inferred from its context. This notion
Fig. 2. Framework for variable-length pattern induction.
is derived from observations of human behavior. In other words, human beings may determine the sense of an unknown word by referring to the words appearing in its context [27]. Based on the cognitive behavior, two words sharing more common contexts are more similar semantically. To further represent the patterns, the HAL model provides a mechanism to combine their constituent words over the HAL space. The EIA bootstraps with a small set of seed patterns to induce additional relevant patterns once the HAL space is built. For each seed pattern, the EIA first creates the initial population for each pattern length. The induction process is then applied iteratively to induce additional patterns relevant to the seed patterns by comparing their context distributions. To avoid the EIA moving in the wrong direction, the relevance feedback is incorporated to guide the induction process. The induction process is terminated when the termination criteria are satisfied. The rest of this paper is organized as follows. Section II describes the overall framework for pattern induction. Section III then describes the process of building the HAL space. Next, Section IV details the EIA. Section V summarizes the experiment results, which are discussed in Section VI. Conclusions are finally drawn in Section VII, along with recommendations for future research. II. OVERVIEW OF PATTERN INDUCTION The overall framework, as shown in Fig. 2, is divided into two parts, the HAL model and the EIA. The HAL space is first built for the psychiatry web corpora after word segmentation. This step produces a vocabulary of size . The EIA then creates the initial populations of the patterns with different lengths (from 2 to ). Each pattern of length is created by selecting distinct words from the vocabulary. Both the patterns and the seed patterns are represented by combining their constituent words over the HAL space. After initializing each population, , the fitness function is adopted to measure i.e.,
162
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 12, NO. 2, APRIL 2008
Fig. 3. Weighting scheme of the HAL model. TABLE I EXAMPLE OF A HAL SPACE (WINDOW SIZE
= 5) Fig. 4. Conceptual representation of the HAL space.
preceding it. The corresponding column vector denotes its right context information. Each concept can thus be represented as a pair of vectors. That is
the fitness of each pattern in each population. Once all the populations are measured, the selection process chooses a number of patterns to be the parents according to their fitness values. The variation operators, i.e., crossover and mutation, are then applied to produce the offspring. The offspring is also evaluated by the fitness function, and the superior offspring replaces the inferior parents to form a new population. The relevance feedback is then applied to identify the relevant patterns in the population. This information can be adopted to refine the seed pattern to improve its similarity to the relevant set. The refined seed pattern is taken as the reference basis in the next generation. The induction process is performed iteratively until the termination criteria are satisfied. III. HAL SPACE CONSTRUCTION The HAL model represents each word in the vocabulary using a vector representation. Each dimension denotes the weight of a word in the context of the target word. The weights are calculated by considering the syntactic properties, such as the location and distance of words. In principle, a word in the context that is closer to the target word is given a greater weight. The HAL model adopts an observation window of length over the corpus to calculate the weights. All words within the window are considered as co-occurring with each other. Thus, the weight between any two words of distance within a window is calcu. Fig. 3 shows the weighting scheme of the lated as HAL model. The HAL space views the corpus as a sequence of words. Additionally, a special character, namely space, is inserted between two sentences to represent a sentence boundary. Thus, the HAL space is built after moving the window by a one-word increment over the whole corpus. The resulting HAL space is an matrix, where denotes the vocabulary size. Each word in the HAL space is called a concept. Table I presents the HAL space for the example text, “When I was young, I lost my parents.” The row vector corresponding to each concept in Table I denotes its left context information, i.e., the weights of the words
(1) and denote the vectors of the left context inwhere formation and right context information of a concept , respecdenotes the weight of the th dimension of a tively, denotes the dimensionality of a vector, i.e., vovector, and cabulary size. Fig. 4 shows the conceptual representation of the HAL space. Since the weighting scheme of the HAL model is frequencybased, it considers extremely infrequent words as noise, and removes them from the vocabulary. Conversely, a highly frequent word generally receives a high weight, but this does not mean that it is informative because it may also appear in many other vectors. Therefore, the number of vectors in which a word appears should be considered when measuring its informativeness. In principle, a word appearing in more vectors carries less information to discriminate among the vectors. This study adopts a weighting scheme analogous to TF-IDF [30], [31] to reweight the dimensions of a vector, as described in (2) (2) where denotes the total number of vectors, and denotes the number of vectors with as the dimension. After each dimension is reweighted, the HAL space is transformed into a probabilistic framework, and thus each weight can be redefined as (3) denotes the probability that appears in the where vector of . The context information of the words in the corpora can be obtained once the HAL space is built. Fig. 5 shows an example of the context information of the words “boss,” “chief,” and “flower.” “Boss” and “chief” have quite similar contexts, but
YU et al.: HAL-BASED EVOLUTIONARY INFERENCE FOR PATTERN INDUCTION FROM PSYCHIATRY WEB RESOURCES
163
Fig. 5. Example of context information. The dimensions with a gray shadow denote the frequent words in the context of the target word.
these are quite different from the context of “flower.” Accordingly, the words “boss” and “chief” are more similar to each other semantically than “boss” and “flower,” and “chief” and “flower.” Since the context information is calculated based on a window of predefined length, words such as “stress,” “colleague,” and “company” do not occur in the context of “flower,” possibly because the window size is not large enough. Enlarging the window size can increase the number of distinct words occurring in the contexts, but this does not mean that the similarity between words can be improved accordingly. For instance, the word “company” might occur in the context of “flower” after the window size is increased, thus increasing the similarity between “boss” and “flower.” However, this may also introduce many other words into the context of both “boss” and “chief,” which reduces their similarity. This study determines the best window size experimentally.
Fig. 6. Creation of the initial population (VC and VH denote verbs; Na and Nd denote nouns, and Dc denotes an adverb).
sented by combining concepts over the HAL space. This process forms a new concept in the HAL space. Let denote a pattern of length . The concept combination is defined as (5) where denotes a new concept created by the concept combination. The new concept is the representation of a pattern, and is also a vector representation in the HAL space. That is
IV. EVOLUTIONARY INFERENCE ALGORITHM (EIA) Given a seed pattern, the EIA induces a set of relevant patdenote terns with different lengths. Let the given seed pattern with constituent concepts, and denote an arbitrary pattern with constituent concepts. The formal description of the EIA is given as
(6) The combination operator, , is implemented by the product of the weights of the constituent concepts, described as follows: (7)
(4) where denotes the evolutionary inference; denotes the combination operator over the HAL space; and denote the two new concepts generated by concept combination, repand sp, respectively, and denotes the resenting distance between two patterns, which is also adopted for fitness evaluation. The main steps of the EIA are briefly described as follows. • Individuals representation: HAL-based vector representation of individuals. Here, an individual is a pattern. • Initial Population: An initial set of patterns to be induced. • Fitness Function: A distance function adopted to calculate the distance between seed and candidate patterns. • Variation Operators: Crossover and mutation. • Relevance Feedback: A method for providing expert knowledge to guide the induction process. A. Individuals Representation An individual in a classical EA is a potential solution to a particular problem. The potential solutions are the patterns in this task. Therefore, an individual represents a pattern in the HAL space. A pattern comprises a set of concepts, and is thus repre-
where
denotes the weight of the th dimension of
.
B. Initial Population To induce variable-length patterns, the EIA first creates the initial population for each pattern length. A pattern of length is created by selecting concepts from the vocabulary. These concepts are then combined by the same method described in the previous section. The combined concepts form a new concept to represent the pattern. Fig. 6 shows the creation process. The EIA considers the syntactic categories of concepts when selecting them from the vocabulary. Analysis of pattern structures indicates that nouns and verbs are the most common syntactic categories constituting a pattern. This observation also agrees with linguistic evidence that nouns and verbs generally carry more information than the other syntactic categories [32]. Accordingly, this study stipulates that a pattern of a length larger than two must contain at least one noun and one verb. The other concepts are selected randomly from the vocabulary. C. Fitness Function The fitness function calculates the distance between seed patterns and patterns to be induced. A smaller distance indicates
164
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 12, NO. 2, APRIL 2008
a fitter pattern. Let
denote a pattern, and denote a given seed pattern. The fitness of the pattern is defined as
(8) where denotes the distance between two patterns. As mentioned earlier, concept combination means that a pattern becomes a new concept in the HAL space, and can thus be represented by its left and right contexts. The distance between two patterns can thus be calculated through their context distance. Equation (8) can thus be rewritten as
(9) Since the weights of the vectors are represented by a probabilistic framework, each vector of a concept can be considered as a probabilistic distribution of the context words. Accordingly, the Kullback–Leibler (KL) distance [33] is adopted to calculate the distance between two probabilistic distributions, as shown in the following: (10) where denotes the KL distance between two probabilistic distributions. The following divergence measure is adopted in the case of a symmetric distance:
(11) In this way, the distance between two probabilistic distributions can be calculated from their KL divergence, i.e., . Thus, (9) becomes
(12) D. Variation Operators The variation operators aim to introduce new patterns into the search space by making some changes to the patterns in the previous generation. The variation operators used in the EIA include crossover and mutation. Crossover is accomplished by exchanging the constituent concepts of two patterns. Mutation replaces a constituent concept of a pattern with a new concept. A number of patterns, i.e., the parents, have to be selected before applying the variation operators. The selection processes for crossover and mutation are performed by different criteria. Good patterns, i.e., those with smaller distances, are preferred for crossover. Therefore, the parents in a crossover operation are chosen in inverse proportion to their distance values. The worst patterns are preferred in mutation, since changing the worst patterns provides the best chance to improve their offspring. The processes of crossover and mutation are described as follows. • Crossover: Crossover operates on two selected patterns and takes place with a predefined probability. A crossover point
is selected, dividing each of the parents into two parts. The parents are recombined to produce new offspring by combining the first part of one parent with the second part of the other parent. • Mutation: Mutation operates on a selected pattern, and also takes place with a predefined probability. A concept in the pattern is replaced with a new concept selected randomly from the vocabulary. The structure of patterns must be preserved during the crossover and mutation process. In other words, the variation operators must ensure that each of the offspring has at least one noun and one verb. The next step after applying the variation operators is to generate the populations for the next generation. The major consideration of the population update strategy is to preserve the patterns from the previous generation and transfer them to the next in order to improve the quality of the populations. For this consideration, the offspring is first evaluated in terms of the fitness function. Then, only the parents inferior to the offspring are replaced. The population size remains the same throughout the induction process. E. Relevance Feedback In the evolution process, some nonrelevant patterns may have a small distance to a given seed pattern, which may move further the evolutionary process in the wrong direction. One possible solution to this problem is to incorporate expert knowledge to guide the search process. For this purpose, this study adopts relevance feedback, which is the most popular query reformulation strategy in the information retrieval (IR) community. The relevance feedback herein is adopted after new populations have been generated for each generation. Relevance judgment is traditionally performed by human experts. However, human relevance feedback is time-consuming. Additionally, experts cannot reasonably be involved in the whole evolutionary process. Therefore, this study adopts a semiautomatic approach, called a lookup table (LUT) relevance feedback. This approach manually constructs a LUT containing a set of reference patterns, and adopts the LUT in each EIA generation to automate the relevance feedback. The process is described in detail as follows. First, health professionals annotate a small number of training sentences with negative life events, and extract the patterns embedded in these sentences. These manually extracted patterns are considered as the reference patterns, and form a LUT. For each EIA generation, relevance judgment is performed by matching the patterns in the population with those in the LUT. The matched patterns in the population are considered as relevant patterns, and the other unmatched patterns are nonrelevant patterns. According to the relevant and nonrelevant information, the seed pattern can be refined to improve its similarity to the relevant set, such that the EIA can induce more relevant patterns and move away from noisy patterns in future generations. The seed pattern is refined to adjust its context distributions (left and right). Such adjustment is performed by reweighting the dimensions of the context vectors of the seed pattern. The dimensions that are more frequently regarded as the relevant patterns are more significant for identifying the relevant patterns,
YU et al.: HAL-BASED EVOLUTIONARY INFERENCE FOR PATTERN INDUCTION FROM PSYCHIATRY WEB RESOURCES
165
the psychiatry web corpora. The psychiatry web corpora used in this experiment include some professional mental health web sites, such as PsychPark (http://www.psychpark.org) [34] and John Tung Foundation (http://www.jtf.org.tw). A total of 5000 consultation records were collected as the experimental corpora, and 300 of these were randomly selected to build a LUT for relevance feedback. The three physicians annotated each sentence in the 300 records with a normal or negative life event. The patterns were then manually extracted to form the LUT. In the following sections, the parameter settings of the EIA and the effect of using relevance feedback are first examined. The EIA is then compared with a corpus-based approach, namely, association pattern mining.
TABLE II SEED PATTERNS
A. Parameter Setting
and therefore should be emphasized. The significance of a dimension is measured as follows: (13) denotes the significance of the dimension , where and denote the patterns of the relevant and nonrelevant sets, and denote the weights of respectively, and of and , respectively. A higher ratio indicates a more to range from 0 to significant dimension. To normalize 1, (13) can be written as
(14) The dimension of the seed pattern reweighted by
is then (15)
Once the context vectors of the seed pattern are reweighted, they are also transformed into a probabilistic form using (3). The refined seed pattern is taken as the reference basis in the next generation. An EIA generation is completed after the relevance feedback. The whole EIA process is stopped until the seed patterns are exhausted, and for each seed pattern, the EIA is stopped until the maximum number of generations is reached. V. EXPERIMENTAL RESULTS To evaluate the performance of the EIA, a prototype system was constructed, and a set of seed patterns was provided. The reference negative life events need to be selected to create the seed patterns. Three experienced psychiatrists performed the selection process based on recent investigations of negative life events [1], [2]. The three physicians selected negative life events that frequently occur in their clinical interviews as the reference negative life events. Finally, a total of 20 seed patterns were created to form the seed set, as shown in Table II. The EIA then randomly selected one seed pattern without replacement from the seed set, and iteratively induced the relevant patterns from
The parameters of the EIA include the HAL window size and EIA parameters, namely: • PS: population size; : crossover rate; • : mutation rate; • • : length of a pattern. Since patterns with a length larger than 4 are very rarely used to express a negative life event, the length was limited to the range 2–4. 1) HAL Window Size: The HAL model was adopted to build a semantic space by applying an -word window over the whole corpus. The window size determines the number of context words of a concept that can be observed. If the window size is too small, then some important context information may be lost. Conversely, a large window may cause some noisy information to be introduced. Hence, an appropriate setting of the window size can provide an informative infrastructure for evolutionary inference, and thereby obtain a good population. In this experiment, the optimal setting of the window size for the EIA was studied. The metric mean average fitness (MAF) was adopted to evaluate the goodness of a population over all seed patterns. For a particular seed pattern, the average fitness of a population is the average fitness of the patterns in the final population produced by the EIA. Thus, the MAF can be further calculated as the mean fitness over all of the final populations. The configurations of EIA parameters used in this experiment are described as follows: maximum generation is , crossover rate , set to 500, population size . Fig. 7 shows the MAF of the and mutation rate populations with different lengths, against different settings of the window size. The best setting for the HAL window size is 8, as indicated in Fig. 7. Therefore, this setting was adopted for the following experiments. The EIA was found to achieve better results for the populations with smaller lengths, possibly because the number of possible patterns increases with rising length. A large number of possible patterns may lead to a highly complex population. Therefore, the EIA cannot achieve good results within a reasonable number of generations for populations with high complexity when using the same parameter settings for all populations. 2) EIA Parameters: The EIA parameters studied in this ex, , and . The population size controls periment were
166
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 12, NO. 2, APRIL 2008
TABLE III EIA PARAMETER SETTINGS FOR DIFFERENT LENGTHS
TABLE IV RESULTS WITH AND WITHOUT RELEVANCE FEEDBACK
Fig. 7. Results of different HAL window size settings.
B. Effect of Relevance Feedback
Fig. 8. Results of different EIA parameter settings.
the number of patterns that can be measured for each generation. Therefore, raising the population size would be useful for high-complexity populations, since it enables them to accommodate more varieties of possible patterns. The variation operators determine the number of new patterns, i.e., the offspring, to be introduced into a population. Thus, larger populations generally require higher rates of variation operators to speed up the evolutionary process. Fig. 8 presents the results of using dif. ferent parameter settings for When the same rates of variation operators were used, the evolution slowed down as the population sizes rose. This finding indicates that the rates of variation operators should be increased in correspondence with the rise in population size. As indicated in Fig. 8, increasing the rates of variation operators indeed speeds up the evolutionary process. Therefore, the following experiments used larger parameter values for higher complexity populations. Table III shows the EIA parameter settings used to induce the patterns of different lengths. Considering that increasing the population size may also introduce computation costs, the population size increases slightly.
This study adopted the LUT relevance feedback to guide the induction process. To evaluate the effectiveness of the relevance feedback, three variants of the EIA, RF(100), RF(200), and RF(300), were developed by building the LUT using 100, 200, and 300 consultation records, respectively. The three EIA variants were compared to that without the relevance feedback, denoted as RF(—). The evaluation metric adopted herein is precision at 30 (prec@30), where prec@ is calculated as the number of relevant patterns ranked in the top patterns of a population, divided by . Additionally, a paired, two-tailed t-test was adopted to determine whether the performance difference among these strategies was statistically significant. Each of the four EIA variants was rerun with random starts on 30 experiments, and each experiment was stopped at the 500th generation. The top 30 patterns of the population produced at the 500th generation were then presented to the three physicians for relevance judgment. The physicians handled disagreements by majority voting. In other words, a pattern was considered as relevant if two or more physicians judged it as relevant. Finally, the patterns with majority votes were obtained, and the prec@30 of the EIA variants were calculated from them. Table IV shows their average prec@30 values over the 30 experiments. The results indicate that all three EIA variants with relevance feedback achieved significant higher precision than the method without relevance feedback; RF(100), RF(200), and RF(300) improved RF(—) by 18.3%, 35.4%, and 49.4%, respectively. Additionally, the precision rose significantly as more consultation records were used to construct the LUT. The main reason is that increasing the number of consultation records also increases the number of reference patterns in the LUT, increasing the amount of relevant information that can be provided during the evolutionary process. To study the effect of errors in the evolutionary process without relevance feedback, the worst case (with lowest pre-
YU et al.: HAL-BASED EVOLUTIONARY INFERENCE FOR PATTERN INDUCTION FROM PSYCHIATRY WEB RESOURCES
Fig. 9. Error analysis of the EIA without relevance feedback. TABLE V EXAMPLE PATTERNS INDUCED BY THE EIA
cision) of RF(—) and RF(100) over the 30 experiments were selected for comparison. Fig. 9 shows the comparison results. For the worst case of RF(—), the precision rose in the early stage of the evolutionary process, but fell after the 70th generation. This finding indicates that the EIA cannot guarantee that the relevant patterns discovered in the earlier stage are kept in the later stages without the help of relevance feedback. Instead, irrelevant patterns with good fitness values were retained, and eventually occupied most of the top 30 ranks of the populations. The precision of RF(100) rose gradually during the whole evolutionary process after the incorporation of relevance feedback. This finding reveals that the discovered relevant patterns can be retained, and then exploited to refine the seed patterns, enabling more relevant patterns to be discovered in later stages. Table V shows example patterns induced by the EIA with the seed pattern as input. C. Comparative Evaluation This section first briefly describes the baseline system, namely, association pattern mining, then compares it with the EIA to examine the performance difference between a corpusbased approach and an evolutionary computation approach to pattern induction. 1) Association Pattern Mining: The problem of pattern induction can be converted into the problem of association pattern
167
mining, where each sales transaction in a database can be considered as a sentence in the corpora, and each item in a transaction denotes a word in a sentence. An association pattern is defined herein as a combination of multiple associated words, de. Thus, the task of association pattern noted by mining is to mine the association patterns of frequently associated words from the training sentences. For this purpose, this experiment adopted the Apriori algorithm [9], [10], [19], [20] and modified it slightly to fit the application. Its basic concept is to identify frequent word sets recursively, and then generate association patterns from the frequent word sets. The detailed procedure is described as follows. • Find frequent word sets: A word set is frequent if it possesses a minimum support. The support of a word set is defined as the number of training sentences containing the word set. For instance, the support of a two-word set denotes the number of training sentences con. The frequent -word sets taining the word pair are discovered from -word sets. First, the support of each word, i.e., word frequency, in the training corpus is counted. The set of frequent one-word sets, denoted , is then generated by choosing the words with a as , the following minimum support level. To calculate two-step process is performed iteratively until no more frequent -word sets are found. • Join step: A set of candidate -word sets, denoted as , is first generated by merging frequent word sets of , words are in which only the word sets whose first identical can be merged. • Prune step: The support of each candidate word set in is then counted to determine which candidate word sets are frequent. Finally, the candidate word sets with a support count greater than or equal to the minimum support . The (a predefined threshold) are considered to form candidate word sets with a subset that is not frequent are eliminated. Fig. 10 (left-hand side) shows an example of generating . • Generate association patterns from frequent word sets: Association patterns can be generated via a confidence measure once the frequent word sets have been identified. The confidence of an association pattern of words is defined as the Mutual Information (MI) [32] of the words, as shown next
(16) where denotes the probability of the words co-occurring in a sentence in the training set, and denotes the probability of a single word occurring in the training set. Accordingly, for every frequent word is genset in , an association pattern erated if the mutual information of the words is greater than or equal to a minimum confidence. The resulting association patterns are those with a minimum confidence. Fig. 10 (right-hand side) shows an example of generating the association patterns from .
168
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 12, NO. 2, APRIL 2008
Fig. 10. Example of association pattern generation. TABLE VI COMPARISON OF OOP RATES BETWEEN EIA WITH AND WITHOUT RELEVANCE FEEDBACK AND APRIORI ALGORITHM
2) Comparative Results: Since the Apriori algorithm requires a set of annotated sentences to generate association patterns, this experiment adopted the 300 annotated consultation records as the training set. The EIA was implemented with and without relevance feedback, denoted by EIA_RF and EIA, respectively, where the LUT used in EIA_RF was constructed by using the 300 consultation records. Both EIA_RF and EIA were run 30 times, and the resulting patterns were the unification of those induced from each experiment. Additionally, the worst experiment of EIA_RF and EIA were chosen to further study the benefit of multiple EIA experiments. The patterns induced by EIA_RF, EIA, and Apriori were evaluated on the coverage of real data provided by 15 human subjects. The subjects provided their experienced negative life events in the form of natural language sentences. A total of 69 sentences were gathered as the test set. The evaluation metric adopted in this experiment was the out-of-pattern (OOP) rate, a ratio of unseen patterns occurred in the test set, which was calculated as the number of test sentences with a pattern not occurring in
the set of discovered patterns, divided by the total number of test sentences. Additionally, a sign test for pairwise comparison was adopted to determine whether the performance difference was statistically significant. Table VI shows the OOP rates of the EIA-based approaches and Apriori algorithm, where the rows denoted by *_*_Multiple represent the OOP rates of the EIA obtained after 30 experiments, and the rows denoted by *_*_ Worst represent the worst OOP rates over the 30 experiments. The observations from Table VI are summarized as follows. • EIA-based approaches performed better than the Apriori algorithm did: The OOP rates of both EIA_* and EIA_RF_* were lower than Apriori, except for EIA_Worst. As indicated in Table VI, 42 test sentences had a pattern not discovered by Apriori, i.e., %. The EIA reduced the OOP rate by 31.0% by EIA_Multiple, and 54.8% by EIA_RF_Multiple. Both reduction rates were statistically significant. • Using relevance feedback was better than not using relevance feedback: The OOP rate of EIA_Worst was the
YU et al.: HAL-BASED EVOLUTIONARY INFERENCE FOR PATTERN INDUCTION FROM PSYCHIATRY WEB RESOURCES
worst of all methods, since the EIA may move in the wrong direction without relevance feedback, resulting in almost no relevant patterns being discovered. The incorporation of relevance feedback significantly reduced the OOP rate. As indicated, the reduction of the OOP rate was 34.5% of EIA_RF_Multiple over EIA_Multiple. • Performing EIA multiple experiments enhanced the performance: EIA_Multiple significantly reduced the OOP rate by 56.7% over EIA_Worst for EIA without relevance feedback. Similarly, for EIA using relevance feedback, EIA_RF_Multiple significantly reduced the OOP rate by 32.3% over EIA_RF_Worst. Additionally, the performance of EIA_Multiple and EIA_RF_Worst was similar. These findings indicate that the performance of EIA multiple experiments without relevance feedback is similar to that of a single experiment with relevance feedback.
169
this representation, any two words within a document are considered as co-occurring with each other, ignoring the syntactic properties of words, such as their location and distance. This study adopts the HAL model mainly because it can capture such syntactic properties by the observation window and weighting scheme. In other words, any two words within a window are considered as co-occurring, and two words occurring close together have a high dimension weight. The extent to which HAL and LSA differ in terms of co-occurrence computation can be minimized through enlargement of the HAL window to the document boundary. However, the best window size was found to be 8, with performance decreasing when the window size exceeded 8, as revealed in Fig. 7. This finding indicates that enlarging the window size does not improve the estimation of semantic similarity between words. Therefore, the syntactic properties captured by the HAL model are significant features in the determination of semantic similarity between words.
VI. DISCUSSION This study presents an EIA that needs only a small amount of annotated data for pattern induction. Experimental results indicate that the EIA is more effective than a corpus-based approach, namely, association pattern mining. The conclusions and discussion are summarized as follows. 1) Incorporation of the HAL Model: The HAL model records the context information of words in the psychiatry web corpora, and provides a combination method for individual words. Based on the HAL model, the EIA can induce patterns from the large, unannotated corpora. Meanwhile, the Apriori algorithm adopts the criterion of mutual information to discover associated words in annotated sentences, which means that patterns can only be induced from limited training data. Hence, many useful patterns cannot be identified due to the data sparseness problem. 2) Use of Relevance Feedback: The EIA adopts a LUT that provides a set of reference patterns to guide the induction process. Relevance feedback enables the EIA not only to ensure that the number of discovered patterns keeps growing during the evolutionary process, but also to induce early pattern discovery. 3) Multiple Experiments: The performance of the EIA can be improved by running multiple experiments and then unifying the patterns induced from each experiment, since the unification process can increase both the number and diversity of discovered patterns. The cost of the improvement is the CPU time needed for the experiments. Performance of the Apriori algorithm can also be improved by increasing the amount of annotated data. The cost is the annotation time needed by human experts. In our analysis, the CPU time needed for each EIA experiment was 95 min on average, while the time needed for manual annotation of 300 consultation records by a health professional would be seven working days. Although the annotation time may vary depending on the application domains, manual annotation is undoubtedly a labor-intensive process. Hence, multiple experiments can reduce the reliance on manual annotation, and still improve the performance. As the HAL model provides a semantic space for the representation of words, Latent Semantic Analysis (LSA) [35]–[37] is also an alternative approach. The LSA provides the semantic space through a term-by-document matrix, where each cell denotes the frequency of a word occurring in a document. In
VII. CONCLUSION This study has presented an evolutionary text-mining framework that integrates the HAL model and EIA for variable-length pattern induction. The HAL model provides an informative infrastructure to represent the patterns in a high-dimensional context space. Based on the HAL space, the EIA bootstraps with a set of seed patterns, and then induces more relevant patterns from the unannotated psychiatry web corpora. Expert knowledge is incorporated into the model during the induction process by adopting the LUT relevance feedback to automate the guiding process. Additionally, the use of EIA multiple experiments also improves performance. The proposed evolutionary framework only needs a small amount of annotated data, thus reducing the reliance on the availability of large annotated corpora, as required by corpus-based approaches. Future work will be devoted to studying the detection of negative life events using the induced patterns, thus improving psychiatric services. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers and the associate editor for their constructive comments. REFERENCES [1] E. M. Brostedt and N. L. Pedersen, “Stressful life events and affective illness,” Acta Psychiatrica Scandinavica, vol. 107, no. 3, pp. 208–215, 2003. [2] M. E. Pagano, A. E. Skodol, R. L. Stout, M. T. Shea, S. Yen, C. M. Grilo, C. A. Sanislow, D. S. Bender, T. H. McGlashan, M. C. Zanarini, and J. G. Gunderson, “Stressful life events as predictors of functioning: Findings from the collaborative longitudinal personality disorders study,” Acta Psychiatrica Scandinavica, vol. 110, pp. 421–429, 2004. [3] C. H. Wu, L. C. Yu, and F. L. Jang, “Using semantic dependencies to mine depressive symptoms from consultation records,” IEEE Intell. Syst., vol. 20, no. 6, pp. 50–58, Nov.–Dec. 2005. [4] L. C. Yu, C. H. Wu, and F. L. Jang, “Psychiatric consultation record retrieval using scenario-based representation and multilevel mixture model,” IEEE Trans. Inf. Technol. Biomed., 2006, Pre-Posted on IEEE Xplore. [5] W. Lehnert, C. Cardie, D. Fisher, J. McCarthy, E. Riloff, and S. Soderland, “University of Massachusetts: Description of the CIRCUS system used for MUC-4,” in Proc. 4th Message Understanding Conf. (MUC-4), 1992, pp. 282–288. [6] J. T. Kim and D. I. Moldovan, “Acquisition of linguistic patterns for knowledge-based information extraction,” IEEE Trans. Knowl. Data Eng., vol. 7, no. 5, pp. 713–724, Oct. 1995.
170
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 12, NO. 2, APRIL 2008
[7] S. Soderland, “Learning information extraction rules for semi-structured and free text,” Mach. Learn., vol. 34, no. 1-3, pp. 233–272, 1999. [8] I. Muslea, “Extraction patterns for information extraction tasks: A survey,” in Proc. AAAI Workshop on Mach. Learn. Inf. Extraction, 1999, pp. 1–6. [9] C. H. Wu, Z. J. Chuang, and Y. C. Lin, “Emotion recognition from text using semantic labels and separable mixture models,” ACM Trans. Asian Language Inf. Process., vol. 5, no. 2, pp. 165–182, 2006. [10] J. T. Chien, “Association pattern language modeling,” IEEE Trans. Audio, Speech, Language Process., vol. 14, no. 5, pp. 1719–1728, Sep. 2006. [11] C. Fellbaum, WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press, 1998. [12] H. Rodríguez, S. Climent, P. Vossen, L. Bloksma, W. Peters, A. Alonge, F. Bertagna, and A. Roventint, “The top-down strategy for building EeuroWordNet: Vocabulary coverage, base concepts and top ontology,” Comput. Humanities, vol. 32, pp. 117–159, 1998. [13] J. F. Yeh, C. H. Wu, M. J. Chen, and L. C. Yu, “Automated alignment and extraction of bilingual domain ontology for cross-language domain-specific applications,” in Proc. 20th Int. Conf. Computational Linguistics (COLING’04), 2004, pp. 1140–1146. [14] R. Navigli and P. Velardi, “Structural semantic interconnections: A knowledge-based approach to word sense disambiguation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 7, pp. 1075–1086, Jul. 2005. [15] R. Navigli, P. Velardi, and A. Gangemi, “Ontology learning and its application to automated terminology translation,” IEEE Intell. Syst., vol. 18, no. 1, pp. 22–31, Jan.–Feb. 2003. [16] R. Stevens, C. Goble, I. Horrocks, and S. Bechhofer, “Building a bioinformatics ontology using OIL,” IEEE Trans. Inf. Technol. Biomed., vol. 6, no. 2, pp. 135–141, Jun. 2002. [17] C. H. Wu, J. F. Yeh, and Y. S. Lai, “Semantic segment extraction and matching for internet FAQ retrieval,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 7, pp. 930–940, Jul. 2006. [18] G. Leroy and H. Chen, “Meeting medical terminology needs-the ontology-enhanced medical concept mapper,” IEEE Trans. Inf. Technol. Biomed., vol. 5, no. 4, pp. 261–270, Dec. 2001. [19] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proc. Int. Conf. Very Large Data Bases (VLDB), 1994, pp. 487–499. [20] J. Han and M. Kamber, Data Mining: Concepts and Techniques. San Mateo, CA: Morgan Kaufmann, 2001. Evolution [21] Z. Michalewicz, Genetic Algorithms Data Structure Programs. New York: Springer-Verlag, 1996. [22] W. H. Au, K. C. C. Chan, and X. Yao, “A novel evolutionary data mining algorithm with applications to churn prediction,” IEEE Trans. Evol. Comput., vol. 7, no. 6, pp. 532–545, Dec. 2003. [23] J. Atkinson-Abutridy, C. Mellish, and S. Aitken, “A semantically guided and domain-independent evolutionary model for knowledge discovery from texts,” IEEE Trans. Evol. Comput., vol. 7, no. 6, pp. 546–560, Dec. 2003. [24] L. Araujo, “Symbiosis of evolutionary techniques and statistical natural language processing,” IEEE Trans. Evol. Comput., vol. 8, no. 1, pp. 14–27, Feb. 2004. [25] I. Kushchu, “Web-based evolutionary and adaptive information retrieval,” IEEE Trans. Evol. Comput., vol. 9, no. 2, pp. 117–125, Apr. 2005. [26] C. Burgess, K. Livesay, and K. Lund, “Explorations in context space: Words, sentences, discourse,” Discourse Processes, vol. 25, no. 2&3, pp. 211–257, 1998. [27] D. Song and P. D. Bruza, “Towards context sensitive information inference,” J. Amer. Soc. Inf. Sci. Technol., vol. 54, no. 4, pp. 321–334, 2003. [28] J. Bai, D. Song, P. Bruza, J. Y. Nie, and G. Cao, “Query expansion using term relationships in language models for information retrieval,” in Proc. 14th ACM Int. Conf. Inf. Knowl. Manage. (CIKM’05), 2005, pp. 688–695. [29] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Reading, MA: Addison-Wesley, 1999. [30] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, 1988. [31] Y. C. Chang and S. M. Chen, “A new query reweighting method for document retrieval based on genetic algorithms,” IEEE Trans. Evol. Comput., vol. 10, no. 5, pp. 617–622, Oct. 2006. [32] C. Manning and H. Schütze, Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press, 1999. [33] S. Kullback, Information Theory and Statistics. New York: Wiley, 1959.
+
=
[34] Y. M. Bai, C. C. Lin, J. Y. Chen, and W. C. Liu, “Virtual psychiatric clinics,” Amer. J. Psychiatry, vol. 158, no. 7, pp. 1160–1161, 2001. [35] T. K. Landauer, P. W. Foltz, and D. Laham, “An introduction to latent semantic analysis,” Discourse Processes, vol. 25, no. 2&3, pp. 259–284, 1998. [36] M. B. W. Wolfe and S. R. Goldman, “Use of latent semantic analysis for predicting psychological phenomena: Two issues and proposed solutions,” Behaviour Research Methods, vol. 35, no. 1, pp. 22–31, 2003. [37] S. Osinski and D. Weiss, “A concept-driven algorithm for clustering search results,” IEEE Intell. Syst., vol. 20, no. 3, pp. 48–54, May–Jun. 2005. Liang-Chih Yu received the B.S. degree in computer science and engineering from Tatung University, Taipei, Taiwan, R.O.C., in 1998 and the M.S. degree in engineering science from the National Cheng Kung University, Tainan, Taiwan, in 2000. He is currently working towards the Ph.D. degree in the Department of Computer Science and Information Engineering, National Cheng Kung University. His research interests include natural language processing, information retrieval, text mining, and ontology construction.
Chung-Hsien Wu (M’94–SM’03) received the B.S. degree in electronics engineering from the National Chiao-Tung University, Hsinchu, Taiwan, R.O.C., in 1981, and the M.S. and Ph.D. degrees in electrical engineering from the National Cheng Kung University (NCKU), Tainan, Taiwan, in 1987 and 1991, respectively. Since August 1991, he has been with the Department of Computer Science and Information Engineering, NCKU. He became a Professor in August 1997. From 1999 to 2002, he served as the Chairman of the Department. He also worked at the Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, in summer 2003 as a Visiting Scientist. He is currently the Editor-in-Chief for the International Journal of Computational Linguistics and Chinese Language Processing. His research interests include speech recognition, text-to-speech, multimedia information retrieval, spoken language processing, and sign language processing for the hearing impaired. Dr. Wu is a member of International Speech Communication Association and ROCLING.
Jui-Feng Yeh received the B.S. degree in computer science and information engineering from the National Chiao-Tung University, Hsinchu, Taiwan, R.O.C., in 1993, and the M.S. and Ph.D. degrees in computer science and information engineering from the National Cheng Kung University, Tainan, Taiwan, in 1995 and 2006, respectively. He is currently an Assistant Professor in the Department of Computer Science and Information Engineering, Far East University, Tainan, Taiwan. His research interests include information retrieval, spoken language processing, and ontology construction.
Fong-Lin Jang received the Bachelor degree of Medicine from Taiwan University, Taipei, Taiwan, in 1988. From 1990 to 1994, he did residency training in psychiatry at Taiwan University Hospital, Taipei. Since 1994, he has been a licensed Psychiatrist in Taiwan. He has been working at the Psychiatry Department, Chi-Mei Medical Center, Tainan, Taiwan, since 1996, and currently is the Chairman of the Psychiatry Department. His research interests include group psychotherapy, biofeedback, integrated therapeutic system design of biofeedback with Internet and communications, and bioinformatics.