International Journal of Database Theory and Application Vol. 4, 4, No. 1, March 2011
Fuzzy Logic Application in Modeling Bioinformatics Sequence Markup Language (BSML) Manuj Darbari, Hasan Ahmed Department of Information Technology, Babu Banarasi Das National Institute of Technology and Management, A-649, Indira Nagar, Lucknow.
[email protected] Abstract This paper highlights Fuzzy Information retrieval approach for Biological databases where the lower level gene differentiation is very complex. We suggest a methodology FLiBSML using the concept of possibilitic Ontology and Fuzzy Linguistic variables. Keywords: Fuzzy Retrieval System, Possibilitic Ontology.
1. Introduction Bioinformatics[3] have been narrowing down itself to develop and use been narrowing down itself to develop and use new precise language that would not suffer from vagueness and yet could provide a precise and optimal methodology[4] in which we can categorise the various forms of complicated species. BSML is an open standard for representing and exchanging bioinformatics sequence data. It was originally created by Visual Genomics. The paper focuses on application of Fuzzy Linguistic term to BSML[1,2] by developing a new language FLi-BSML to specify rules for Bioinformatics classification.
2. BSML Document Structure 2.1. BSML is divided into three main sections: Definitions : it stores biological sequences and sequence annotations. Research : it stores information about experimental research. Display : it focuses on storage of display widgets and references to external image files. The BSML DTD consists of three basic elements : Attribute, Imp and Resource. The Attribute element is used to store arbitrary name/value pains. BSML Resource element is used to store metadata about the BSML document as shown in figure 1.
1
International Journal of Database Theory and Application Vol. 4, 4, No. 1, March 2011
2.2. Representation of Biological Sequence in BSML Document: Representation of all sequence data appears in BSML "Definitions" section. This section contains a "Square" element which number of "sequence" elements. "Sequenceimport" elements are used to reference sequence data stored within other BSML files. < Sequences> < Sequence - import source = "ABC.bsmL"id="AI"/>
Figure 1. Overview of BSML Structure In addition to raw sequence data BSML can also represent sequence features. The sequence features provide additional details regarding a specific location. For example we can take a raw sequence record and identity important parts such as promoter regions, proteincoding regions and 5' and 3' un-translated regions. Sequence features are also important part of other file format. There exists an XEMBL service providing complete access to EMBL Nucleotide sequence database. 2.3. BSML Support to Distributed Annotation System (DAS) BSML supports inbuilt feature of distribution and sharing of genome annotation data through Distributed Annotation System (DAS). DAS is formally specified by a client/Server protocol and a set of XML Document Type Definition (DTDs). A DAs servers contains a small set of XML queries and a corresponding set of XML queries and corresponding set of
2
International Journal of Database Theory and Application Vol. 4, 4, No. 1, March 2011
XML Document Type Definition. If we want to analyse genome structure we have to analyze regions of law sequence data which includes the identification of errors (protein - coding portion of genes) and introns (non coding protein of genes). Genome annotation may also include the linking of sequence data to already cataloged genes, identifying sequence Similarity across species. Summarising, the annotation of genes attempts to decipher and analyse raw sequence data and finally connect it to biological function. In addition to raw sequence data BSML can also represent sequence features. The sequence features provide additional details regarding a specific location. For example we can take a raw sequence record and identity important parts such as promoter regions, proteincoding regions and 5' and 3' un-translated regions. Sequence features are also important part of other file format. There exists an XEMBL service providing complete access to EMBL Nucleotide sequence database.
3. BSML Pattern Matching using Possibilistic Ontology Fuzzy matching framework provides a tool for evaluating queries in case of imprecision of data by representing linguistic label both in classification as well as query. Fuzzy sets can interface numerical values with linguistic terms using membership functions and the comparison of terms can be evaluated by fuzzy pattern matching.
The similarity of protein sequence motivates us in using qualitative pattern matching specifies hyponymy relations between terms by means of possibility and necessity degrees.
3.1. The Model: Let the vocabulary of domain 'i' is defined by a set of terms Ti = {tji , j=1......h (i)} j where t i is the label that can be used in to describe a piece of information. The meaning of piece of information and domain are related through ontology Oi. For two labels tji and tki : Π (tji , tki ) = Π (tki , tji ) represents to what extent tji and tki can describe the same thing. If the two meanings overlap but are not of the same type/classification then: Π (tji , tki ) = I and N (tji , tki ) = 0 The degree specified in the ontology are actually only defined on a subset of the Cartesian product. The rule of Fuzzy Ontology[4,5] states that evaluations must be estimated "at worst" and the ≥ will be generally taken as equality. The possibility rule of fuzzy Inference system estimates the degrees between the data and the query even if the searched items are not directly related to the same classification without using any explicit query expansion stage. 3.2. Application of Fli- BSML: Let as consider a fragment of a simple ontology of Protein. We plot a graph in which we can divide the protein-set into various support represented through BSML. BSML follows simple form of DTD for representing protein data.
3
International Journal of Database Theory and Application Vol. 4, 4, No. 1, March 2011
All the elements are defined to contain # PC DATA.
Figure 2. Protein Ontology. If we have to find out the sample protein record the category of search charts from the accession number then entry name to protein name and finally to gene name. For example, an entry in Swiss-Prot with Accession Number P26954, entry name IL3B_MOUSE, Protein name: Interleukin-3-receptor class II beta chain has four Gene_Name:
The above Gene_name holds synonyms entries. The values of degree present in such ontology are qualitative in nature and estimates semantic relations between terms. These values may be associated for convenience with linguistic labels(Fuzzy Based), such as very similar, rather similar, or only similar to specify the strength of the relations. In order to convert this into relative ordering we use rank- ordering[6] as 0.2, 0.5 and 0.7. The possibility degree is symmetrical and a positive necessity for N (tki, tji) implies nothing for N (tki, tji) .
4
International Journal of Database Theory and Application Vol. 4, 4, No. 1, March 2011
4. Conclusion Use of Fuzzy logic in building The BSML retrieval system is a complex task with every Protein Name with a particular accession there are very similar gene which could be the appropriate choice. Fli-BSML provides support to typed relations used in these ontology such as hyponymy can be matched with necessity degrees. We are also enhancing Fli-BSML to support Markovian chain based backward Ontology method for detecting the closest match to the required Gene.
References [1] Cibulskis Knstian, "An Introduction to BSML" XML. J 4(3), 2005. [2] Bray,"AnnotatedXML Specification", http://www.xmL.com/axmL/ testaxmL.htm, 2004. [3] Chicurel, M., "Bioinformatics : bringing it all together", Nature - 2002; Pg 751-755. [4] Darbari M, (2005), “Modeling Biological Systems, A Unified Approach”, ACM Software Engineeering Notes, USA. [5] Grefenstette, G (1998), "Cross-Language Information Retrieval" Kluwer Academic, Boston. [6] Kraft, D (1999), "Fuzzy set techniques in information retrieval", Kluwer Academic Publishers. [7] Lee. J. (2008), "Information retrieval based on conceptual Distance", Journal of Documentation.
Authors Manuj Darbari is currently working as an Associate Professor in Information Technology at Babu Banarasi Das National Institute of Technology & Management, Lucknow. His teaching areas are ERP, MIS, Soft Computing. He has published ten papers in referred international and national journals. He was selected for Marquis who’s who in science and engineering 20032007. His teaching areas are Information Science, ERP, Software Engineering, and Workflow Management.
Hasan Ahmed is currently working as an Assistant Professor in Information Technology at Babu Banarasi Das National Institute Of Technology And Management, Lucknow. He completed his M Tech from IIIT Bangalore in 2007. He holds research and development experience of more than five years, including that for working at industry laboratories in various positions. Prior to his current assignment, he was a Design Engineer at Nokia R & D, Bangalore. He has published several papers in reputed international conferences. He is, also, a co-author of ‘Mobile Computing’ which is an international technology best seller on ubiquitous computing. His interest areas are Software Engineering and Mobile Computing.
5
International Journal of Database Theory and Application Vol. 4, 4, No. 1, March 2011
6