INTERNATIONAL JOURNAL OF IMAGING SCIENCE AND ENGINEERING (IJISE)
1
Information Extraction from Biomedical Literature using Text Mining Framework Latha .K1
Kalimuthu.S2
Abstract— Biomedical publications are unstructured in nature, making it difficult to utilize data mining or knowledge discovery techniques to retrieve needed documents. In this paper we propose a Text Mining Framework to extract the Biomedical documents from the Biomedical Literature. Our Framework is divided into three distinct stages such as text gathering, text preprocessing and data analysis. The goal of this paper is to use optimized methods to the three stages and to display results. We have used java as front end and oracle as back end for implementing the framework. Keywords— Biomedical literature, Data mining, Framework, Text mining.
B
INTRODUCTION
Biomedical information is growing explosively and new useful results are appearing everyday in research publications. Many of the publications are available online, such as MedLine databases, PubMed Databases [5,7]. Automatic extraction of useful information from these online sources poses a challenge, because these documents are unstructured and expressed in natural language form. The basic objective of this thesis is to enable data mining and text mining techniques to extract knowledge from such documents and to minimize the number of documents to be checked. In this paper we present a system called Text Mining Framework, which consists of the following stages 1. Text gathering 2. Text preprocessing 3. Data analysis For the first stage the documents are collected from the existing Biomedical databases such as PubMed Open Access Initiative [5,7], MedLine , National Library Of Medicine[3], MeSH [3], MDB. Thousand-sample sets of documents are collected from various biological domains and these documents are analyzed and given as the input to the second stage.
Dr.Rajaram.R3
In the second stage the above documents are preprocessed for decreasing the workload in the Data analysis stage. The documents are preprocessed using specialized algorithms such as Stemming, Stop word removal algorithms, in order to save computing time. The third phase focuses on analyzing the documents of the previous phase. We present a novel clustering method using the approach of support vector machine. We propose SVM [2,8,9,14] algorithm because of its nonparametric nature. A unique advantage of SVM algorithm is that it can generate cluster boundaries of arbitrary shape. Whereas other algorithms use a geometric representation and are most often limited to hyper-ellipsoids. This algorithm has a distinct advantage over the other: being based on a kernel method, it avoids explicit calculations in the high-dimensional feature space, and hence is more efficient.
IOMEDICAL
1
K.Latha, Lecturer in Information Technology Department of Thiagarajar College of Engineering, Madurai. 2 S.Kalimuthu, Final year student of Information Technology Department of Thiagarajar College of Engineering, Madurai. 3 Dr.R.Rajaram, Head Of the Department of Computer Science and Engineering, Thiagarajar College of Engineering, Madurai.
IJISE,GA,USA,ISSN:1934-9955,VOL.1,NO.1, JANUARY 2007
I. TEXT GATHERING The process of text gathering in Biomedical literature [10, 11] involves PubMed Open Access Initiative [5,7], MedLine , National Library Of Medicine, MeSH databases , MDB etc. All the above databases contain more than 12,000,000 references of Biomedical publications. “Google” finds more PDF documents containing the word "protein" compared to PubMed and other Biomedical resources, but most of the results are irrelevant. Citation database “CiteSeer” showed good results only for computer science and information technology publications. Therefore we used PubMed and other Biological databases to obtain large clusters of full text documents, all containing relevant results. Our basic step is to download the documents and to prepare them for further stages. We have downloaded 1,000- sample set of documents from the various Biological domains. All the documents are stored in a separate centralized directory from where the documents can be processed and retrieved. The Biologists and the researchers who have administrative rights can upload their documents by using our interface. These documents can also be updated and processed for further retrieval. Most of the journals and research publications are encoded in PDF format [4]. It is necessary to convert these PDF documents into the textual documents. There are conversion tools that convert PDF documents to HTML, doc, but these are not reliable and more cost effective. To successfully employ text mining on PDF encoded paper, it would be advantageous to start with importing new archives, which are developed for PDF conversion. We have implemented PDF conversion successfully using java archives.
LATHA K, KALIMUTHU et al., INFORMATION EXTRACTION FROM BIOMEDICAL LITERATURE USING TEXT MINING FRAMEWORK
II. TEXT PROCESSING Text Preprocessing transforms text into an information-rich document matrix. This method indicates the frequency of every term with in the document collection. "80-20% rule" (i.e.) 80% of the work is done by the preprocessing stage and 20% of the work is done by the remaining stages. Text preprocessing is to prepare the data for the data analysis phases. It is the most important phase because if the data is not properly preprocessed it will be reflected in the remaining phases. The basic steps are • Tokenization • Data cleaning • Stop word removal • Stemming (Paice/Husk) • Identification of most interesting terms In Tokenization the sentences are tokenized by converting words, numbers and punctuation marks in raw text into separate tokens.
According to the modified hamming distance descriptive statistics, Paice/Husk [12,13] algorithm is producing more mean modified hamming distance compared to other algorithms, so that we have selected the Paice/Husk algorithm for the effective stemming. This algorithm eliminates the extreme outliers and these outliers are clearly over stemmed (too many characters were removed) for example this reduces "ultra nationalism" to "ultra". This algorithm contains various stem rules for example dei3y> {-ied >-y} The words with prefix ‘ied’ should be replaced by 'y'. de2> {-ed >- } The words with prefix 'ed' should be removed. START
Data cleaning process involves removal of the unwanted symbols such as punctuation, special characters and conversion of the upper case letters into lower case letters. We have developed our own algorithm for data cleaning process.
Access stemming rule according to final letter of term
Stop word list removal process involves removing stop words such as around, at, above, below, bottom, far, many, if, the, a etc .We have collected all the possible stop words and stored in a separate file, each and every token is matched with the stop word in the stop word list file, if it is matched then the tokens are removed from the document. In Stemming there are different algorithms used for the identification of root words such as Lovin's, Porter's [6,19], Paice/Husk [12,13], S-removal algorithms. We have analyzed the strength and the similarity of affix removal stemming algorithms by measuring 5 different metrics such as 1.The mean number of words per conflation class, 2.The number of words and stems that differ, 3.The mean number of characters removed in forming stems ,4.The median modified hamming distance between words and their stems, 5.The mean modified hamming distance between words and their stems. TABLE I STRENGTH AND THE SIMILARITY OF AFFIX REMOVAL STEMMING ALGORITHMS Stemmer
Mean Modified Hamming Distance
Median Modified Hamming Distance
Mean Character s Removed
Mean Conflatio n Class Size
Word and Stem Different
Lovins
1.72
1
1.67
1.42
34437(69 .4%)
Paice/ Husk
1.98
2
1.94
1.49
34533(69 .5%)
Porter
1.16
1
1.08
1.20
S-Remov
0.03
0
.03
1.01
27897(56 .2%) 1636(3.3%)
IJISE,GA,USA,ISSN:1934-9955,VOL.1,NO.1, JANUARY 2007
Yes Do final letters in the term and rule match
No
Can the rule match be applied
Access next rule Apply rule
Does new stem pass assertion
N o
Ye
Revert to old stem Should would be stemmed again
Output stem
Fig:1 The workflow of the Paice/Husk algorithm
In this method the main function is to identify the nouns, verbs and stems. The terms that are grammatically close to each other (cell, cells) are mapped to one term(cell).After applying the Paice/Husk stemming [12], [13] rules the original words and stemmed words are listed below .
2
INTERNATIONAL JOURNAL OF IMAGING SCIENCE AND ENGINEERING (IJISE)
3
Support Vector Clustering with ξj ≥ 0.
TABLE II PAICE/HUSH STEMMER OUTPUT
To solve this problem we introduce the Lagrangian
OriginalWord
Stemmed Word
OriginalWord
Stemmed Word
Abstract_with
Abstract_with
Access
access
rapid
rapid
ever-increasing
growth articles
growth articl
Everincreasing Quantity information
genomics
genom
understand
understand
research
research
Rewest
Rewest
quantiti inform
III. DATA ANALYSIS Many data mining and text mining techniques are applicable here, as this is where the actual information extraction happens. This deals with fast clustering of the results in order to provide assistance to the user by showing him the concepts in the search result already found. There are different methods available for document clustering such as "Clustering via K-Means", "SOM-Self Organizing Maps (or) Hierarchical clustering", "SVM-Support Vector Machine"[8,9]. In this paper the Biological data is analyzed using Support Vector Machine clustering, because it is a non parametric algorithm. In this algorithm data points are mapped by means of a Gaussian kernel [20] to a high dimensional feature space, where we search for the minimal enclosing sphere. This sphere, when mapped back to data space, can separate into several components, each enclosing a separate cluster of points. We present a simple algorithm for identifying these clusters. The width of the Gaussian kernel[20] controls the scale at which the data is probed while the soft margin constant helps coping with outliers and overlapping clusters. The structure of a dataset is explored by varying the two parameters, maintaining a minimal number of support vectors to assure smooth cluster boundaries. This algorithm can deal with outliers by employing a soft margin constant that allows the sphere in feature space not to enclose all points. For large values of this parameter, we can also deal with overlapping clusters.
L=R2−∑ (R2 + ξj − ||Φ(xj) − a||2)βj −∑ ξjµj + C ∑ξj, (2) j where βj ≥ 0 and µj ≥ 0 are Lagrange multipliers, C is a constant, and C ∑ ξj is a penalty term. Setting to zero the derivative of L with respect to R, a and ξj , respectively, leads to ∑βj=1 (3) j a=∑βjΦ(xj) (4) βj = C −µj (5) The Karush-Kuhn-Tucker complementarity’s conditions of Fletcher (1987) result in ξjµj= 0 (6) (R2+ ξj − ||Φ (xj) − a||2)βj = 0.
(7)
It follows from Eq. (7) that the image of a point xi with ξi > 0 and βi > 0 lies outside the feature-space sphere. Eq. (6) states that such a point has µi = 0, hence we conclude from Eq. (5) that βi = C. This will be called a bounded support vector or BSV. A point xi with ξi = 0 is mapped to the inside or to the surface of the feature space sphere. If its 0 < βi < C then Eq. (7) implies that its image Φ(xi) lies on the surface of the feature space sphere. Such a point will be referred to as a support vector or SV. SVs lie on cluster boundaries, BSVs lie outside the boundaries, and all other points lie inside them. Note that when C ≥ 1 no BSVs exist because of the constraint (3).Using these relations we may eliminate the variables R, a and µj, turning the Lagrangian into the Wolfe dual form that is a function of the variables βj : W=∑Φ(xj)2 βj −∑ βi βj Φ(xi) . Φ(xj).
(8)
Since the variables µj don’t appear in the Lagrangian they may be replaced with the constraints 0 ≤ βj ≤ C, j = 1, . . . , N.
(9)
THE SVC ALGORITHM 1. CLUSTER BOUNDARIES Let {xi} ⊆ χ be a data set of N points, with χ ⊆ IRd, the data space. Using a nonlinear transformation Φ from χ to some high dimensional feature-space, we look for the smallest enclosing sphere of radius R. This is described by the constraints: ||Φ (xj) − a||2≤ R2 for all j , where || . || is the Euclidean norm and a is the center of the sphere. Soft constraints are incorporated by adding slack variables ξj : ||Φ(xj)−a||2≤R2 + ξj (1) IJISE,GA,USA,ISSN:1934-9955,VOL.1,NO.1, JANUARY 2007
We follow the SV method and represent the dot products Φ(xi) . Φ(xj) by an appropriate Mercer kernel K(xi, xj). Throughout this paper we use the Gaussian kernel K(xi, xj) = e-q||xi-xj||2
(10)
with width parameter q. The Lagrangian W is now written as: W =∑ K(xi, xj) βj −∑ βi βj K(xi, xj) (11) j i,j At each point x we define the distance of its image in feature space from the center of the sphere: R2(x)=||Φ(x)−a||2 (12)
LATHA K, KALIMUTHU et al., INFORMATION EXTRACTION FROM BIOMEDICAL LITERATURE USING TEXT MINING FRAMEWORK
In view of (4) and the definition of the kernel we have: R2(x)=K(x, x)− 2∑ βj K (xj, x) +∑ βi βj K(xi, xj) j i,j The radius of the sphere is: R={R(xi)|xiis a support vector } .
(13)
(14)
The contours that enclose the points in data space are defined by the set {x | R(x) = R} .
(15)
They are interpreted by us as forming cluster boundaries in Figure 2 .In view of equation (14), SVs lie on cluster boundaries, BSVs are outside, and all other points lie inside the clusters.
and xj, R(y) ≤ R 0 otherwise. (16) Clusters are now defined as the connected components of the graph induced by A. Checking the line segment is implemented by sampling a number of points.BSVs are unclassified by this procedure since their feature space images lie outside the enclosing sphere. 3. CLUSTERING WITH AND WITHOUT BSVS Without BSV We begin with a data set in which the separation into clusters can be achieved without invoking outliers, i.e. C = 1. Figure 2 demonstrates that as the scale parameter of the Gaussian kernel [20], q, is increased, the shape of the boundary in data-space varies: with increasing q the boundary fits more tightly the data, and at several q values the enclosing contour splits, forming an increasing number of components (clusters). Figure 2a has the smoothest cluster boundary, defined by six SVs. With increasing q, the number of support vectors nsv increases. With BSV In order to observe splitting of contours, we must allow for BSVs. The number of outliers is controlled by the parameter C. From the constraints (3,9) it follows that
Fig 2: Clustering of a data set containing 183 points using SVC with C = 1. Support vectors are designated by small circles, and cluster assignments are represented by different grey scales of the data points.(a): q = 1 (b): q = 20 (c): q = 24 (d): q = 48
nbsv < 1/C , (17) Where nbsv is the number of BSVs. Thus 1/(NC) is an upper bound on the fraction of BSVs, and it is more natural to work with the parameter p=1/NC. (18) When distinct clusters are present, but some outliers (e.g. due to noise) prevent contour separation, it is very useful to employ BSVs. This is demonstrated in Figure 3a: without BSVs contour separation does not occur for the two outer rings for any value of q. When some BSVs are present, the clusters are separated easily (Figure 3b).
2. CLUSTER ASSIGNMENT The cluster description algorithm does not differentiate between points that belong to different clusters. To do so, we use a geometric approach involving R(x), based on the following observation: given a pair of data points that belong to different components (clusters), any path that connects them must exit from the sphere in feature space. Therefore, such a path contains a segment of points y such that R(y) > R. This leads to the definition of the adjacency matrix Aij between pairs of points xi and xj whose images lie in or on the sphere in feature space: Aij = 1 if, for all y on the line segment connecting xi IJISE,GA,USA,ISSN:1934-9955,VOL.1,NO.1, JANUARY 2007
Fig 3: Clustering with and without BSVs.(a): A rings can not e distinguished when C = 1 here q = 3.5 , the lowest q value that leads to separation of the inner cluster (b): Outliers allow easy clustering with parameters p = .3 and q = 1.0 .
4
INTERNATIONAL JOURNAL OF IMAGING SCIENCE AND ENGINEERING (IJISE)
IV. CONCLUSION In this paper we have proposed text mining framework which consists of 3 phases namely Text gathering, Text Preprocessing, DataAnalysis.In text gathering the Biological documents are identified from popular Biomedical databases such as PubMed[5,7], MedLine, MDB, National Library Of Medicine [3] and MeSH databases[3]. Then these documents are given as the input for the text Preprocessing phase. There the documents are preprocessed by using preprocessing techniques such as tokenization, data cleaning, stop word removal and Stemming. Then the processed documents are given to the block Data Analysis. This is the area where the data are analyzed by using specialized novel SVM clustering algorithm [2,8,9]. This method has no explicit bias of either the number or the shape of the clusters. It has two parameters, parameter p, q . Parameter p is the soft margin constant that controls the number of outliers. The parameter of the Gaussian kernel determines the scale at which the data is probed and as it is increased clusters begin to split. Thus the above framework successfully retrieves the Biomedical documents. The goal of this thesis is to increase the precision and recall rate of the information retrieval. V. REFERENCES: [1]
C. Blaschke, A. Valencia, “ Can Bibliographic pointers for known biological data be found automatically? Protein interactions as a case study” , Comparative and Functional Genomics 2, 196206,2001 [2] A. Kowalczyk, B. Raskutti, H. L. Ferrá, “ Exploring Potential of LeaveOneOut Estimator for Calibration of SVM in Text Mining”, PAKDD, 361372,2004 [3] S. Nelson, “Medical Subject Headings – Fact Sheet” ,http://www.nlm.nih.gov/pubs/factsheets/mesh.html,U.S.National Library of Medicine, 2004 [4] PDFSpecification http://partners.adobe.com/asn/acrobat/sdk/public/docs/PDFReference15 _v6.pdf,Adobe, 2004 [5]PubMedCentralOpenAccessInitiative,http://www.pubmedcentral.nih.gov/a bout/openftplist.html [6] M.F.Porter,“ An algorithm for suffix stripping(reprint)” , in Readings in Information Retrieval, MorganKaufman, http://www.tartarus.org/~martin/PorterStemmer/ [7] PubMed, http://www.ncbi.nlm.nih.gov/PubMed/, 2004 [8] S.J. Roberts. Non-parametric unsupervised cluster analysis. Pattern Recognition, 30(2):261–272, 1997. [9] A.K. Jain and R.C. Dubes. Algorithms for clustering data. Prentice Hall, Englewood Cliffs,NJ, 1988. [10]. Shatkay H, Feldman R. Mining the biomedical literature in the genomic era: an overview. J Comput Biol. 2003;10(6):821–55. [11] Jung-Hsien Chiang,IEEE member and Hsu-Chun YU“Literature extraction of protein functions using sentence pattern mining “ IEEE Transactions on Knowledge and Data Engineering, vol 17, no 8, August 2005. [12] Lancaster Paice/Husk Stemming algorithms WWW.lancs.ac.uk/ug/oneillc1/stemmer. [13] Paice/Husk stemmer modifications by Antonio Zamora WWW.scientificpsychic.com/paice/paice.html [14] G.W. Milligan and M.C. Cooper. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50:159– 179, 1985. [15] A.K. Jain and R.C. Dubes. Algorithms for clustering data. Prentice Hall, Englewood Cliffs,NJ, 1988.
IJISE,GA,USA,ISSN:1934-9955,VOL.1,NO.1, JANUARY 2007
[16]. Harman, D. How effective is suffixing? Journal of the American Society for InformationScience 42 (1991) 7-15 [17]. Hull, D. Stemming algorithms - A case study for detailed evaluation. Journal of the American Society for Information Science 47 (1996) 7084 [18] Paice, Chris D. "Another Stemmer." SIGIR Forum 24 (3), 1990, 56-61. [19] Porter, M. F. "An Algorithm for Suffix Stripping." Program 14, 1980, 130-137. [20] Sei-Hyung Lee and Karen Daniels, Gaussian Kernel Width Exploration in Support Vector Clustering 2004-009, Technical Report, University of Massachusetts Lowell, (2004). K.Latha, M.E working as a lecturer in Information Technology of Thiagarajar College of Engineering, Madurai.Research area is Text Mining. Area of interests: Applying Data Mining techniques in Text Analysis, Information Retrieval, and DBMS.Received B.E (Electronics and Communication Engineering) in Bharathidasan University, Trichy in the year 1996 and M.E (Computer Science and Engineering) in Madurai Kamaraj University in the year 2000. Working experience is 7 years in teaching and Associate Life Member in Institution for Engineers. Published two Technical papers in AICTE Sponsored National Level Seminars (XML, UML). Mail ID:
[email protected] S.Kalimuthu doing final year in Information Technology of Thiagarajar College of Engineering, Madurai. Area of interests:Data Warehousing, Data Miing, Information retrieval. Mail ID:
[email protected] Dr.R.Rajaram, Professor and Head of Computer Science and Engineering Department of Thiagarajar College of Engineering, Madurai. He teaches and guides research in Computer Science and Engineering. He has successfully guided 7 research scholars so far. 8 more are working with his guidance under Anna University. His research scholars have published 15 papers in refreed journals and presented more than 30 papers in conferences. He has developed course materials in 10 Computer related topics. Mail ID:
[email protected]
5