Bioinformatics Database Design: An effective method ...

86 downloads 9022 Views 539KB Size Report
Bioinformatics Database Design: An effective method for the universal indexing of. DNA sequences using ... *Email: [email protected]. ABSTRACT.
Bioinformatics Database Design: An effective method for the universal indexing of DNA sequences using quarternary logic tables for efficient data mining K S Praharshit Sharma* Post Graduate Diploma in Bioinformatics, Institute of Bioinformatics & Applied Biotechnology, Biotech Park, Electronic City Phase 1, Bangalore - 560 100 (Karnataka) India. *Email: [email protected]

ABSTRACT

ALGORITHM

EXAMPLE

Consider a random sequence: Step 1: GATTACA The DNA strand is identified. 2 0 3 301 0 (Alphanumeric Quarternary Step 2: Equivalent) Its alphanumeric quarternary equivalent is generated using Decimal equivalent = 0x40+ 1x41+ 0x42 the rule A-->0, C-->1, G-->2 + 3x43+ 3x44+ 0x45+ 2x46 = 36568 and T-->3. Length of the sequence = 7 Step 3: UIC = 7-gi-36568 The decimal equivalent of the above number is computed, and is denoted by 'd'. CONCLUSIONS

DNA sequences are converted into patterns of quarternary (base 4)digits by denoting A --> 0, C --> 1, G --> 2 and T -->3, alphanumerically. These patterns are then converted into their respective decimal equivalents (which are prime factorized in case of large values) to generate the latter part of a Unique Identification Code (UIC) using a C program. The lower case alphabets 'gi' (abbreviation for 'gene index') are prefixed to above part of the UIC so generated, to help identify the given sequence as a DNA sequence. To these alphabets 'gi', the length of the DNA sequence in its decimal (base 10) form is further prefixed to complete the generation of UIC for the particular DNA sequence. Step 4: Further improvisation of this model is suggested to include protein sequences The length of the DNA also by replacing 'gi' by 'pi' (abbreviation for 'protein index') and sequence is calculated, and is converting the protein sequences into denoted by 'l'. digits of base 20 using a alphanumeric convention as described above, retaining decimal (base 10) equivalent Step 5: to generate the UIC for protein The Unique Identification Code sequences as before. (UIC) of the DNA sequence is denoted by 'l'-gi-'d', where gi stands for 'gene index'.

* The advantage of this method of storage of DNA sequences is, is that the original DNA sequence can be computationally regenerated from the gene ID as and when required. * The major disadvantage is that the method becomes unfeasible for larger DNA/ protein sequences as the data types such as int have an upper limit.