Document not found! Please try again

Mining Concept Profiles with the Vector Model or Where ... - CiteSeerX

7 downloads 25359 Views 523KB Size Report
Consider a specialist in global .... specialist that diseases dominant in the poorer nations are less ..... quite relevant in business intelligence applications. For.
Mining Concept Profiles with the Vector Model or Where on Earth are Diseases being Studied? Padmini Srinivasan School of Library & Information Science

Micah Wedemeyer Department of Computer Science

{padmini-srinivasan,micah-wedemeyer}@uiowa.edu University of Iowa, Iowa City, IA 52242 Abstract

protein products, the diseases caused, and the cellular locations in which the gene is expressed. Note that each of these properties is itself essentially a concept. This is consistent with the fact that we usually describe one concept using other concepts. The concepts identified as properties are those that have a significant presence in the documents that are about, i.e., in the documents that are retrieved, for the initiating concept. Also, the profile of concepts is weighted to reflect the relative importance of each concept as a property. These property concepts may derive from the free text portions of the pool of retrieved documents. Alternatively, they may be derived from the metadata assigned to them. We adopt the latter approach here. 1 Introduction Clearly, if our user explores the same concept with Consider a user who is interested in a particular concept. a different text collection, then the profile may be She wishes to figure out what she can learn about different. This underlines the difference between a this concept from a given text collection. There is of concept profile and a concept definition. The latter course the standard option of conducting a search for is relatively invariant, while the former is tied to the the concept against the collection, following which, she set of texts being mined. The former can in general could read the retrieved documents and manually distill be far more detailed, representing a fuller description. the concept’s properties. However, this process quickly Even with the same dataset, concept exploration may be becomes arduous as the size of the retrieved set increases conducted against the full dataset or against particular to more than say 10 to 20 documents. This is where subsets such as subsets corresponding to different time our concept exploration text mining function steps in. periods. Analysis of such temporal profiles may support It may be used to automatically scour through the text trend analysis as in [11]. Finally, the explored concepts collection and produce a set of attributes that are closely may be atomic such as those represented by single words ‘associated’ with the initial concept. These associations or a user may specify more complex concepts such as are determined statistically. However, consistent with Calcium channel blockers and Alzheimers disease or vector model based research, the hope is that statistical Climbing expeditions on the K2. patterns reflect key semantic phenomenon. In this paper we use our concept exploration funcLet us assume that our user’s concept of interest tion to investigate trends regarding the prevalence of is: “touring northern Italy”. Our function if used to research on diseases. We have implemented this funcbuild a profile from a text database of guide books, tion in our prototype text mining tool. MEDLINE is might list for instance, the ski resorts, the restaurants, the text database that we mine for this application. For the tourist spots, famous people, the railway stations, each disease we obtain a profile that indicates the global the universities, the hospital etc., associated with the prevalence of research on that disease. We then comnorthern region of Italy. If the initial concept were pare global trends in disease research with global trends “p53”, a gene, then the profile vector might identify its In this research we study the value of concept exploration, a function offered in our text mining prototype. This function, implemented using the vector space model, allows one to build a profile for a given concept. This profile is derived from the text collection being mined. The function may be used to build profiles for concepts that are as complex or as simple as the user desires. In this paper, we apply this function towards studying trends in disease research. Profiles are built with diseases as concepts and by mining the MEDLINE database. Disease research trends are compared with disease prevalence trends. The study indicates that text mining may offer a useful option for current efforts at estimating global epidemiological data. More generally, this research demonstrates the application of text mining and concept exploration.

Figure 1: Sample MEDLINE record (abbreviated). Figure 2: Procedure for building MeSH-based profiles. in disease prevalence. In general, our intent is to explore the value of this vector model based text mining function. In particular, we seek to determine if this function may be used to augment global efforts at collecting epidemiological data. This paper is organized as follows. In the next section we provide a description of our concept profiles. After that in section 3, we show how we use concept profiles to explore the global prevalence of research on diseases. Section 4 presents trend comparisons. Section 5 discusses related research and we offer our conclusions in section 6. 2 Concept Profiles Figure 1 shows select fields of an example MEDLINE record. MEDLINE started in 1966 by the National Library of Medicine (NLM), contains more than 12 million bibliographic records representing articles from over 4,500 medical and health care related journals. MeSH concepts [15] are assigned to the records by trained indexers who select from a MeSH hierarchy of around 21,000 phrases. In addition to MeSH, the NLM categorizes each MeSH concept as belonging to one or more of 134 semantic types [16]. Examples semantic types are: Physiologic Function, Disease or Syndrome, Vitamin. In essence, these semantic types offer a higher level metadata scheme for the MeSH metadata itself. We employ both the MeSH metadata assigned to MEDLINE records and the semantic type based metadata categories. MeSH qualifiers are also selected by the indexers from a small vocabulary of about 100 phrases to provide additional focus on the MeSH concept. These are not used in this study. Figure 2.1 depicts the procedure for building MeSH metadata based concept profiles using MEDLINE. The example concept shown is hip fractures in the elderly. The first step is to translate the concept into an ap-

propriate search strategy (this is done by the user). This search is then executed against the full MEDLINE database via the PUBMED interface. Next the retrieved documents are analysed. As mentioned before, our profiles are built on the metadata in the records. Thus the MeSH concepts assigned to the retrieved documents are used to build the profile for the initiating concept. More formally, let the concept of interest be Ci . The profile for Ci consists of a set of vectors, one for each metadata category, i.e., semantic type. Thus for a given category x such as Disease or Syndrome, P rof ile(Ci )x = (2.1)

{wix1 tx1 , wix2 tx2 , · · · , }

where txy represents the metadata term ty that belongs to the metadata category x, wixy is the computed weight for txy in relation to Ci . This weight may be computed using any appropriate weighting scheme (such as mutual information and log likelihood). Below we use the tfidf weighting scheme with normalized weights: (2.2)

wixy = vixy /highest(vixl ),

where l = 1, · · · , m and vixy = nixy ∗ log(N/nxy ). Here N is the number of documents in the database, nxy is the number of documents in which txy occurs and nixy is the number of retrieved documents for Ci in which txy occurs. Normalization by highest(vixl ), the highest value for vixy observed for the metadata terms in category x, yields weights that are in [0,1] within each metadata category. (Note that there are m terms in the domain for metadata category x). Thus a profile is a set of vectors, one for each metadata category. Within each vector the weights represent

Figure 3: Ranking of Nations by Research on Mental Disorders (1966-2000).

the relative emphasis on the different metadata terms w.r.t. the concept’s document set. When appropriate, this profile may be limited to certain categories of metadata representing a view thereby offering a more focused concept representation. For example, profiles of genes may be limited to just the vectors corresponding to functional metadata categories such as Cell Function and Pathologic Function. 3

The Global Prevalence of Research on Diseases We now demonstrate the application of our concept exploration function towards studying the prevalence of disease research across the globe. Later we identify other applications. Consider a specialist in global health policies who is curious about the national contexts within which mental disorders as a health problem is researched. Using our text mining function a profile is built for the concept mental disorders. Since he is interested in the global prevalence of research, the profile is restricted to the vector for the metadata category represented by the semantic type Geographic Area. (Note that this has nothing to do with the geographic affiliations of the authors of the corresponding documents. In contrast, geographic MeSH terms are applied when the study is in some way pertaining to the geographic region). The resultant profile is then a set of weighted country names (using equation 2.2), where the weights reflect the relative emphasis of mental disorders research in the context of the corresponding nation. We can thus rank the nations by their weight and when this data is fed into an appropriate visualizing software such as a GIS the user may see output as shown in Figure 3. Figure 4 provides a similar visualization of the ranking of nations by emphasis of research on Cholera

Figure 4: Ranking of Low and Middle Income Nations by Research on Cholera (1990-2000).

but here the analysis (i.e., the documents retrieved for the concept Cholera), is limited to the decade of the 90s. As the colour changes from green to red the emphasis increases. Regions coloured white are not included in the analysis. Regions coloured blue have zero emphasis on research. Thus our user can visually explore the global distribution of research on diseases. The next question posed by our policy specialist is: does the trend regarding the prevalence of research on mental disorders match the trend regarding the prevalence of the disease itself? In fact, he starts wondering if this is true of diseases in general? The intuition is that a disease will be studied in the context of a nation if indeed the disease occurs in that nation. Also, the more prevalent a disease in a nation the greater the inclination to study it in the context of that region. However, our specialist wonders if these intuitions are correct. The MEDLINE database with its more than 12 million citations offers him a unique opportunity to test these intuitions using our text mining tool. Simulating this scenario we decided to examine a set of diseases, comparing research trends and epidemiological data. Unfortunately, we quickly found it quite challenging to locate global epidemiological data especially historical statistics spanning the last few decades. The World Health Organization (WHO) is a good source of such data through its Statistical Information System [31] and its weekly epidemiological records [30]. However, the list of diseases covered by WHO is not comprehensive. The procedures for obtaining global diseases estimates, either as the number of cases reported or the number of deaths reported is highly complex involving many economic, geographic and political factors at the local, regional, national and international levels [12, 32]. Our present analysis is conducted using a set of 19 diseases where the data is obtained primarily from the WHO resources. We use epidemiological data per-

taining to the decade of the 1990s for the set of diseases. These include several types of cancers and some tropical diseases such as Yellow Fever and Cholera. Profiles were created for each disease limited to Geographical Area metadata concepts. For each disease we ranked the different countries by their weight in the disease profile. We obtained a second ranking of nations based on the prevalence of the disease (say using the number of cases reported by WHO). We then compared rankings to see how well they correlate. Spearman’s rank coefficients was used for all comparisons and all tests of statistical significance were done at the 95% confidence level. Table 1 summarises the comparison between disease prevalence and research prevalence. Correlations are computed for the decade of the nineties, i.e., between epidemiological data for 1991-2000 and MEDLINE data from 1991 - 2000. For some diseases the prevalence data for the full decade are not available. Instead, data available for a smaller range of years within the decade of the 90s are used. The table also includes correlations run by income group. This is because our hypothetical user is interested in determining if income has anything to do with the relationship between research and disease trends. For this we obtained a list of 208 countries from the World Bank website [27]. This list is updated annually and includes 184 member countries as well as 24 non member countries with a population of at least 30,000. The World Bank also classifies the countries as high income (52 countries), middle income (90 countries) and low income (66 countries) which we use in our analysis. Thus Table 1 has at most four rows of information for each disease, corresponding to the high, middle and low income groups and a fourth row (all) for all the countries together. Since the available epidemiological data also differs across diseases in the extent of global coverage, the third column of the table (column N) identifies the number of nations included in the analysis for each disease and income group. The remaining column (CC) provides the Spearmans rank correlation coefficient. It may be observed that for 14 of the 19 diseases (74%), we obtain a significant, positive correlation between a ranking of nations based on disease prevalence and a ranking based on research prevalence. This is true for all income groups for which the epidemiological data is available. Cholera, Meningitis, Yellow Fever, Dengue and Dracunculiosis are the five diseases for which a significant positive correlation is not observed for at least one income group. However, even for these, significant correlations were observed for 7 of the 16 group-based tests.

In order to further understand these results, we ranked, within each disease, the three income groups by their average prevalence (number of cases). Interestingly, if the top scoring income group is either high or middle income (10 diseases), then we obtain significant correlations for all but 1 (90%) disease (Melanoma). Whereas, when the top ranking group is the low income one, then we find significant correlations for only 4 of the 9 (44%) diseases. This suggests to our policy specialist that diseases dominant in the poorer nations are less likely to be researched in national contexts at a level consistent with the prevalence of the diseases. 4

Temporal Trend for Most Frequently Studied Diseases Next our user is interested in examining trends in research prevalence over the last few decades. In particular he wishes to identify the top few (operationalized as top 3 here) diseases studied in nations of particular income groups and examine whether these have changed over time. He seeks to determine if some diseases are emphasized more than others and whether there are differences across income groups and over time. This time using each of our 208 nations in turn as a concept we build profiles. We limit these profiles to vectors of metadata with the semantic type Disease or Syndrome. Four profiles are generated for each nation, distinguished by the time period, from 1961-70, 197180, 1981-90 and 1991-2000. Based on these profiles we identify the 3 highest weighted diseases for each country within each income group and time period and pool these into a set S. Then we assess how often, i.e., for how many countries, a disease appears in S and rank the diseases according to this frequency. We reiterate that all ranks used in this analysis and reported in Table 4 are ranks of diseases based on their frequencies in S and not the original ranks associated with a country. Figure 5, where each D1, D2, ..., DN represents a disease, shows this process. In the figure, D1, D2 and D5 are the top 3 weighted diseases in Nigeria and so get added to S. We compute disease ranks against S generated for each income group and time period independently. Table 2 summarizes this analysis. It includes a disease for an income group if it is one of the 3 most frequent diseases in S (shown in bold) for at least 1 time period. Its ranks in S for the remaining three time periods are also shown for comparison. As an example, during the decade of the 60’s (Time 1) Nutrition Disorders and Malaria are the two most frequent diseases in S for the middle income group and these are ranked within the top 3 positions for 8 of the 90 countries. Over the following decades, Nutrition Disorders becomes the third, then the tenth and finally

Table 1: Correlations between Rankings on Disease Prevalence and Rankings on Research Prevalence for the Decade of the 90s. CC: Spearman’s coefficient; * indicates significance at the 95% confidence level; N = number of countries. Disease Breast Cancer

Hodgkins

Dengue

Liver Neoplasm

Ovarian

Cholera

Stomach Cancer

Leprosy

Malaria

Income Group all high mid low all high mid low all high mid low all high mid low all high mid low all high mid low all high mid low all mid low all mid low

N

CC

Disease

168 35 71 61 165 34 70 61 61 12 39 10 60 34 71 60 166 34 71 61 70 17 21 32 165 34 70 61 64 22 42 89 39 49

0.645* 0.856* 0.709* 0.372* 0.539* 0.710* 0.545* 0.386* 0.488* 0.467 0.660* 0.716* 0.340* 0.840* 0.681* 0.518* 0.542* 0.813* 0.591* 0.352* 0.084 0.487* 0.347 -0.007 0.589* 0.809* 0.626* 0.427* 0.600* 0.565* 0.611* 0.479* 0.395* 0.471*

Colorectal Cancer

Meningitis

Tuberculosis

Prostate Neoplasm

Esophagus Cancer

AIDS

Melanoma

Yellow Fever

Trypanosomiasis Dracunculiasis

Income Group all high mid low all high mid low all high mid low all high mid low all high mid low all high mid low all high mid low all mid low all low

N

CC

163 33 71 59 169 42 74 53 187 44 82 61 165 35 70 60 165 34 71 60 171 39 76 56 165 34 71 60 42 16 25 29 18

0.606* 0.820* 0.608* 0.353* 0.326* 0.492* 0.184 0.222 0.470* 0.731* 0.585* 0.639* 0.561* 0.825* 0.466* 0.287* 0.554* 0.761* 0.610* 0.347* 0.440* 0.384* 0.467* 0.712* 0.562* 0.891* 0.540* 0.179 0.429 0.252 0.607* 0.340* 0.357

Figure 5: Analysis of the Three Most Frequently Studied Diseases for each Country.

the eleventh most frequent top ranked disease. Several observations can be made. First, there are strong differences between the diseases listed for the high income group versus the two other groups. For example, Mental Disorders is consistently the most frequent top ranked disease for the high income group, but this is not a frequently studied disease for the rest of the world. Unfortunately, this pattern is not consistent with the prevalence of this problem. According to the World Bank Study on the Global Burden of Disease, Mental Disorders is the leading cause of disability in the world for 1990 [34]. The question asked now by our health policy specialist is: why are the middle and low income groups not showing greater research emphasis on this family of problems? This is in fact an issue that has generated much attention. For example, in developing countries mental health is viewed as secondary to the more life threatening infectious diseases. Lack of awareness, stigma and discrimination against the mentally ill are also some of the barriers [33]. We now look at Malaria, the most or close to the most frequent top ranked disease for the low and middle income groups but absent from the high income group for any time period. This pattern is consistent with the fact that Malaria is predominantly a tropical disease occurring in Asia, Africa and South and Central America [13]. Several other diseases are also present only in the middle and low income groups such as Schistosomiasis and Nutrition Disorders. Tuberculosis, at least the second most frequent top ranked disease in the 60s across the world becomes less frequent over time and is totally absent from the top ranking pool of the high income group for the decade of the 90s. These trends are also consistent with global occurrence patterns for the disease [29]. High income countries such

as France and Finland have reduced their numbers of Tuberculosis cases over the decades, while in many of the low income nations, these numbers are on the rise. For example, in Malawi the number went from about 5,300 in 1985 to about 19,000 in 1995 and we observe that it has remained a top ranked disease in research for this nation. Overall, the low and middle income groups are more similar to each other in their top ranked diseases while being quite distinct from the high income group. Trend analysis supports our user’s intuition that changes in the prevalence of a disease over time may generate similar changes in the prevalence of research on the disease. The exceptions, as in Mental Disorders are revealing in their imbalance and consistent with external evidence regarding the emphasis on the disease. 5 Related Research Text mining [2, 4, 5, 6, 7, 8, 10, 19, 20, 21, 22, 23, 24, 24, 25, 26] is about the automatic discovery of knowledge from text collections. Although similar to data mining [1, 3, 18] in its goal, the difference is that instead of mining a large collection of well structured data, text mining efforts are based on large collections of texts that are at best semi-structured. In both cases the knowledge discovered are essentially propositions or hypotheses, that require further study and verification. Text mining interests us because of the availability of large text collections such as MEDLINE. Text mining algorithms may operate on the natural language and free-text portions of texts such as the title and abstract [2, 5, 6, 26]. Such algorithms rely on information extraction technology to extract nuggets of information such as the specifics of a merger relationship between two companies described in a news stream or the properties of a gene described in a journal article. Thus text mining from free-text is closely tied to the underlying extraction technology. Instead of exploring the free-text portions, text mining may operate on the metadata associated with these documents [4, 8, 14]. Such algorithms perform text metadata mining. Metadata such as the Dewey Decimal System and the Library of Congress Subject headings (LCSH) have supported information retrieval for more than a century. Dublin Core and RDF have evolved more recently for Web documents. Metadata has also been at the foundation of text categorization research. Whether assigned manually or automatically, metadata offer us succint representations of the essence of the source documents. The premise in metadata mining is that these core nuggets of information offer a reasonable basis for knowledge discovery. One limitation in metadata mining is that the metadata scheme constrains the

Table 2: Most Frequent Top Ranked Diseases. The three most frequent top ranked diseases by income group and by time period are shown. For each entry the number of countries in which it occurs in the top 3 ranks is shown in parenthesis. The symbol n indicates that the disease is not in the top 3 ranks for any country. Time 1, Time 2, Time 3 and Time 4 represent the decades of the 60s, 70s, 80s, and 90s respectively. Ranks 1, 2, and 3 are in bold. Disease

Time 1 Time 2 High Income: 52 countries Mental Disorders 1 (19) 1 (19) Tuberculosis, Pulmonary 2 (11) 7 (2) Neoplasms 2 (1) 5 (5) Coronary Disease 3 (8) 3 (9) Substance-Related Disorders 4 (6) 2 (12) Acquired Immunodeficiency Syndrome n n Occupational Diseases 7 (2) 4 (7) Middle Income: 90 countries Malaria 1 (8) 2 (11) Nutrition Disorders 1 (8) 3 (9) Chagas Disease 2 (7) 10 (2) Schistosomiasis 3 (6) 5 (7) Tuberculosis, Pulmonary 2 (7) 1 (12) Chronic Disease 7 (1) 1 (12) Leishmaniasis 7 (1) 9 (3) Occupational Diseases 2 (7) 4 (8) Acquired Immunodeficiency Syndrome n n Mutation 7 (1) n Low Income: 64 countries Malaria 1 (10) 1 (18) Schistosomiasis 2 (8) 2 (15) Tuberculosis, Pulmonary 3 (5) 4 (7) Leprosy 3 (5) 4 (7) Onchocerciasis 5 (2) 3 (8) Nutrition Disorders 4 (3) 3 (8) Acquired Immunodeficiency Syndrome n n Stress Disorders, Post-Traumatic n n

Time 3

Time 4

1 (18) 9 (1) 4 (7) 5 (6) 6 (5) 2 (16) 3 (13)

1 (15) n 3 (13) 8 (4) 11 (1) 2 (12) 5 (7)

1 (17) 10 (3) 8 (5) 9 (4) 5 (8) 7 (12) 2 (11) 3 (10) 4 (9) n

4 (9) 11 (2) 5 (8) 12 (1) 7 (6) 12 (1) 5 8 1 (22) 2 (13)

1 (34) 4 (8) 7 (2) 3 (9) 4 (8) 4 (8) 2 (9) 7 (1)

1 (27) 3 (4) 5 (3) 5 (3) 5 (3) 5 (3) 2 (26) 3 (4)

kind of knowledge that may be mined. However, this parallels the constraint imposed by, for instance, the schema that underlies a relational database that is being mined. Text mining for knowledge discovery is being explored by several research groups both within and outside the biomedical domain. There are general purpose discovery tools and systems such as IBM’s Intelligent Miner for Text [9] offering feature analysis, categorization and summarization capabilities. TextVis, evolving from earlier systems such as KDT [4] supports document clustering, the interactive creation of taxonomies and the generation of association rules. Association rules [1, 18] are a dominant theme in text mining research (eg. [2, 8]). These rules link pairs or larger groups of concepts and are assigned support and confidence values, scores that are commonly used in data mining research. An association rule such as concept A → concept B indicates that there may be a potentially interesting directional association from A to B. Typically, these are discovered by exploiting the co-occurrence of concepts in the texts being mined, in this case between A and B. Such rules are in fact quite relevant in business intelligence applications. For example, a strong association between two companies, may alert the analyst to an important upcoming merger. These types of associations between pairs of concepts relying on their co-occurrence in the database, capture a certain class of knowledge. Our goal is to identify a descriptive profile for a concept of interest. Concept properties are identified via a subset of documents retrieved from the collection and not from the full collection. Our approach also allows us the advantage of considering concepts that are as complex as needed. Finally, views allow focused representations that may be tailored to the particular application goals.

eases studied by the high income nations are different from those studied by middle and low income nations. In general, research trends over time match trends in disease prevalence. Exceptions as with Mental Disorders are consistent with discussions in the literature and provide additional arguments for increased investment in research on certain problems. One limitation of this study stems from the fact that our text mining strategy relies heavily on MeSH indexing. It is acknowledged that the policies and practices followed by MeSH indexers at NLM have quite naturally evolved over the years. In fact the MeSH vocabulary of index phrases itself has also grown over time. Although any conclusions from such studies should be made with caution we consider it opportune to exploit the significant intellectual investment in MeSH indexing. What has been gained by exploring intuitions that may at some level seem quite obvious? We now have evidence supporting the suggestion that the relative distribution of research on diseases, especially those dominant in high and middle income nations, may provide scientists, policy makers and health care professionals with some estimates of the relative distribution of diseases. Given the expense and complexity in obtaining global epidemiological data, especially temporal data, our approach may offer a reasonable option, although with a time delay due to the publication and indexing processes. The advantage in our approach is that the estimates are obtained solely from the MEDLINE database and using a computational approach implemented in our text mining software. This approach may now be used to explore relative distributions on demand for any disease and over any time period covered by MEDLINE. Consequently one may also independently track different forms of the same disease such as the four main forms of Leishmaniasis; Cutaneous Leishmaniasis and Mucocutaneious Leishmaniasis being two examples 6 Conclusions [28]. Granular estimates for different disease forms may This investigation began with the goal of studying our allow researchers to later combine them for the broader vector model based text mining function that supports disease class. These may perhaps complement surveilconcept exploration. We show how a hypothetical user lence methods that focus on broader classes of diseases. interested in the prevalence of research on diseases may It is also possible to explore more intricate relationships use such a function to study disease profiles. The between diseases, such as the prevalence of Leishmania experiments show that his intuitions are correct in that and HIV as co-infections [28]. At the very least we offer 14 of the 19 tested diseases show significant positive our text mining approach as a possible supplement to correlations between research prevalence and disease the current strategies for collecting global disease occurprevalence, even when the analysis is conducted by rence data. At a more general level, our concept exploration income group. The analysis goes further to show that this positive correlation is less likely when the disease is function may be used for other applications such as discovering drug function. Give a particular drug X, one most prevalent in the low income group of nations. Our concept exploration function was also used to may proceed by comparing the profile of X with the probuild profiles for each of 208 nations over different time files of drugs whose functions are known. Drugs with periods. Analysis of these profiles showed that the dis- very similar profiles may suggest the correct function

of X based on their own functions. This is being explored in parallel research. Another application may be to explore whether a particular dietary substance and a disease are related. For this we may examine their individual profiles to determine if there are meaningful, possibly functional, conceptual connections. Such applications are also the subject of parallel research. We also plan to explore this text mining function with users. References [1] Agrawal A, Imielinksi T, Swami A. Mining association rules between sets of items in large databases. ACM SIGMOD, 207-216, 1993. [2] Blake C. and Pratt W. Better rules, fewer features: A semantic approach to selecting features from Text. ICDM, 59-66, 2001. [3] Fayyad U.M. and Uthurusamy R. Data mining and knowledge disc overy in databases (Introduction to the special section). CACM, 39(11):24-26, 1996. [4] Feldman R, Aumann Y, Amir A, Klosgen W, and Zilberstien A. Maximal association rules: A new tool for mining for keyword co-occurrences in document collections. KDD-97, Newport Beach, CA. 1997. [5] Ronen Feldman and Ido Dagan and Haym Hirsh. Mining text using keyword distributions. Journal of Intellingent Systems, 10, 281-300,1998. [6] M. D. Gordon and R. K. Lindsay. Toward discovery support systems: A replication, re-examination, and extension of Swanson’s work on literature-based discovery of a connection between Raynaud’s and fish oil. JASIS, 47,116-128,1996. [7] Hearst M. Untangling text data mining. Proceedings of the 37th ACL Conference, 1999. [8] Dimitar Hristovski and Janez Stare and Borut Peterlin and Saso Dzeroski. Supporting discovery in medicine by association rule mining in Medline and UMLS. Proceedings of MedInfo, 10(2), 1344-8, 2001. [9] IBM. Intelligent Miner for Text. http://www3.ibm.com/software/data/iminer/fortext [10] Tor-Kristian Jenssen and Astrid Laegreid and Jan Komorowski and Eivind Hovig. A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics, 28, 21-28, May, 2001. [11] Lent B, Agrawal R. and Srikant R. Discovering Trends in Text Databases. Proceedings of the 3rd International Conference on Knowledge Discovery, KDD-97, NewPort Beach, CA, 1997. [12] Margaret McGill and Martin Silink,Diabetes in Children and Adolescents of the Western Pacific Region of the International Diabetes Federation. Diabetes Spectrum. 12(3), 165,1999. http://www.diabetes.org/ diabetesspectrum/99v12n3/Pg165.htm. [13] National Institute of Allergy and Infectious Diseases (US Department of Health and Human Services). Malaria. NIH Publication No. 02-7139. Septem-

[14]

[15] [16]

[17]

[18]

[19] [20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

ber 2002. http://www.niaid.nih.gov/publications/ malaria/. Masys D.R, Welsh J.B, Fink J.L, Gribskov M, Klacansky I, Corbeil J. Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics, 7(4):319-326, 200. National Library of Medicine. Medical Subject Headings. http://www.nlm.nih.gov/mesh/meshhome.html. National Library of Medicine. Unified medical language system knowledge sources. http://umlsks.nlm.nih. gov,2002. Carolina Perez-Iratzeta and Peer Bork and Miguel A. Andrade. Association of genes to genetically inherited disease using data mining (letter). Nature Genetics, 31,316-319, 2002. Piatetsky-Shapiro G. and Frawley W.J.E. Knowledge discovery in databases. Cambridge, MA: MIT Press, 1991. Smalheiser N.R. and Swanson D.R. Indomethacin and Alzheimer’s disease. Neurology, 46:583, 1996. N.R. Smalheiser and D.R. Swanson, Linking estrogen to Alzheimer’s disease: An informatics approach. Neorology, 47,809-810, 1996 P. Srinivasan. MeSHmap: A text mining tool for MEDLINE. Proceedings of the American Medical Informatics Annual Symposium, 642-646, 2001. P. Srinivasan and Thomas C. Rindflesch. Exploring text mining from MEDLINE. Proceedings of the American Medical Informatics Annual Symposium. 2002. Swanson D.R, Smalheiser N.R and Bookstein A. Information discove ry from complementary literatures: categorizing viruses as potential weapons. JASIST 52(10), 797-812. August 2001. D.R. Swanson and N. R. Smalheiser. An interactive system for finding complementary literatures. Artificial Intelligence, 91,183-203,1997. D.R. Swanson. Migraine and Magnesium: Eleven neglected connections. Perspectives in Biology and Medicine, 31, 526-557, 1988. Marc Weeber and Henny Klein and Alan R. Aronson and James G. Mork and Lolkje T.W. de Jongvan den Berg and Rein Vos. Text-based discovery in biomedicine: the architecture of the DAD-system. Proceedings of the American Medical Informatics Annual Symposium, 903-907, 2000. World Bank. Data Statistics. Classification of Economies, http://www.worldbank.org/data/ countryclass/countryclass.html. World Health Organization. Leishmaniasis. Background of the Disease. 2001. http: //www.who.int/emc-documents/surveillance/ docs/whocdscsrisr2001.html/Leishmaniasis/ Leishmaniasis.htm. World Health Organization. Tuberculosis. Strategies, Operations, Monitoring and Evaluation. http://www. who.int/gtb/Country_info/. World Health Organization. Weekly Epidemiological Record, http://www.who.int/wer.

[31] World Health Organization. Statistical Information System http://www.who.int/whosis/WHO [32] World Health Organization. Surveillance, Prevention and Management of Noncommunicable Diseases. MIP 2001. http://www.who.int/mip2001/index.pl? iid=1754. [33] World Health Organization. The World Health Report 2001. Mental Health: New Understanding, New Hope. http://www.who.int/whr/2001/main/en. [34] World Health Organization. The Global Burden of Disease. A Summary. http://www.who.int/msa/mnh/ ems/dalys/intro.htm

Suggest Documents