Database of Repetitive Elements in Complete Genomes ... - CiteSeerX

0 downloads 0 Views 160KB Size Report
ARCHAEA, BACTERIA, AND VIRUS. (APPENDIX A LISTS THE. ABBREVIATIONS OF THE ORGANISMS.) Genome. Name. Total. Repetitive sequences. Match.
TITB-00040-2001.R2

1

Database of Repetitive Elements in Complete Genomes and Data Mining Using Transcription Factor Binding Sites J.T. Horng, Member, IEEE, F.M. Lin, J.H. Lin, H.D. Huang, and B.J. Liu

Abstract-- Approximately 43% of the human genome is occupied by repetitive elements. Even more, around 51% of the rice genome is occupied by repetitive elements. The analysis presented here indicates that repetitive elements in complete genomes may have been very important in the evolutionary genomics. In this study, a database, called the Repeat Sequence Database (RSDB 1 ), is first designed and implemented to store complete and comprehensive repetitive sequences. The database contains direct, inverted and palindromic repetitive sequences, and each repetitive sequence has a variable length ranging from seven to many hundred nucleotides. The repetitive sequences in the database are explored using a mathematical algorithm to mine rules on how combinations of individual binding sites are distributed among repetitive sequences in the database. Combinations of transcription factor binding sites in the repetitive sequences are obtained and then data mining techniques are applied to mine association rules from these combinations. The discovered associations are further pruned to remove insignificant associations and obtain a set of associations. The mined association rules facilitate efforts to identify gene classes regulated by similar mechanisms and accurately predict regulatory elements. Experiments are performed on several genomes including C. elegans, human chromosome 22, and Yeast. Index Terms-- database, data mining, complete genome, gene, transcription factor. I.

INTRODUCTION

mong the many genomes recently sequenced, human chromosome 22 has a length exceeding thirty million nucleotides. Therefore, an enormous amount of information is available to elucidate the organization and function of the genome as a whole. Repetitive sequences are the most abundant sequences in the extragenic region of genomes and have attracted strong biological interest. Biologists have found that many regulatory elements are also located there. As well as playing an important role in forming the chromatin structure in nuclei, these sequences may contain important clues to genetic evolution and phylogeny. The work

A

J.T. Horng, F.M. Lin, J.H. Lin, and H.D. Huang are with the Department of Computer Science and Information Engineering, National Central University, Taiwan (telephone: 886-3-4227151 ext. 4519, e-mail: [email protected] ) B.J. Liu is with the Department of Computer Science and Engineering Yuan-Ze University, Taiwan (e-mail: [email protected]) 1

http://rsdb.csie.ncu.edu.tw

[1] estimated that around 43% of the human genome is occupied by four major types of interspersed repetitive elements, i.e., LINEs (Long Interspersed Nuclear Elements), SINEs (Short Insterspersed Nuclear Elements), LTR (Long Terminal Repeat) elements, and DNA transposons [2]. The analysis in this work provides some insights into the evolution of the human genome. The International Human Genome Project has recently announced a working draft for initially sequencing the human genome. The genetic blueprint of humans will be completed in the near future. However, the Human Genome Project has markedly contributed over the past five years of genomic research to molecular biology databases. Many internet sites can now be accessed for browsing, querying, retrieving or submitting molecular biology data and related data sets [3]. Despite the exponential growth of sequence records in databases such as the International Nucleotide Sequence Databases [4]-[6], records of repetitive sequences are almost complete and comprehensive in a few other databases. Repbase [7] and STRBase [8] are two representative databases of repetitive sequences. Repetitive sequences from Repbase were collected according to the authors’ submissions. The results are obtained mainly through conventional experiments; a continually updated database server serves researchers at Repbase Update [7]. STRBase primarily emphasizes short tandem repeats, which are typically classified as microsatellites. Although the number of repetitive sequences in these two databases may be insufficient for users, these databases are still preferred for some specific analyses of repetitive sequences. Many known transcription factor (TF) binding sites have been collected in the TRANSFAC database [9][10], which is well-maintained and the most complete database of transcription factor binding sites. Notably, the known TF binding sites in TRANSFAC are represented as consensus patterns or nucleotide distribution matrices. The known sites represented as consensus patterns are taken into account in the research of the data mining in repetitive sequences. Brazma et al. [11] stated that, “The matrix representation is generally considered to be the best available means of representing the consensus; however, at present, most consensus descriptions are unreliable in the sense that they tend to give many false positives when compared against the genome sequences of even modest length”. The known TF sites represented as nucleotide distribution matrices are not investigated here

TITB-00040-2001.R2 because of avoiding those “false positives” of the site matches when locating the known sites into repetitive sequences. Data mining is important in extracting knowledge from many repetitive sequences. Agrawal [12] introduced the problem of mining association rules. Association rule mining identified interesting associations or correlations among many data items. Given their continual collection many industries have become interested in mining association rules from their databases. The discovery of interesting associations can support many decision making processes [13]. Frequently used data mining methods involve association rules, statistics, neural networks, and genetic algorithms. Mining association rules frequently yields a huge number of associations, particularly for data sets whose attributes are highly correlated. The Chi-square statistical test ( χ 2) is frequently applied to test independence and correlation [14]. Chi-square involves comparing observed frequencies with the corresponding expected frequencies. A closer similarity between observed frequencies and expected frequencies implies greater independence. Chi-square is used to test the significance of a difference from the expected values. Research on partial classification using association rules has included two case studies for partial classification [15]. This study first designs and implements a database of repetitive elements in complete genomes and then identifies the combinations of transcription factor binding sites in repetitive sequences. Data mining techniques are then used to mine associations from the combinations of transcription factor binding sites that occur in repetitive sequences. The data mining technique can mine an enormous number of associations. Insignificant associations are then removed and a set of useful associations obtained. Additionally, the discovered associations are used to classify partially the repetitive sequences in repeat database. The experimental genome sequences include C. elegans, human chromosome 22, Yeast, and many bacteria.

2 CHROMOSOME and ORGANISM relations store chromosomes and organisms, respectively. Users interested in repetitive sequences can very easily use client-side applications such as Netscape® and Internet Explorer® (IE) to search for and browse data over the Internet. Interfaces of the tools for searching database were written in "PHP: Hypertext Preprocessor" (commonly known as PHP) which is an HTML-embedded scripting language, and were located on a web server (Apache 1.3.12) with a PHP module add-on. Fig. 3 shows an overview of the system architecture.

Fig. 1. ER diagram for the core RSDB schema.

II. DATABASE OF REPETITIVE ELEMENTS Fig. 2. Relational schema for the core RSDB schema. A. System and Methods The proposed system has two parts-(a) the web server that runs on one PC with a single processor (AMD K6-III 450Mhz) and 384M of physical memory and (b) a database server that runs on two PCs, each with two processors (Intel Pentium III 550Mhz) and 512M of physical memory. Red Hat Linux 6.1 was installed on all of the PCs. Fig. 1 depicts an ER diagram of the core RSDB schema. Fig. 2 depicts the relational schema mapped from the ER schema in Fig. 1. The database contains five entity types, i.e., “REPSEQ”, “ORGANISM”, “SEQUENCE”, “REPCHR”, and “CHROMOSOME”. The REPSEQ entity type stores all repetitive elements in complete genomes. Repetitive elements in the relation REPSEQ are regarded as different ones even if the same one is repeatedly appeared in different organisms. The REPCHR relation stores all occurrences of repetitive elements including positions and organisms. The SEQUENCE relation only stores distinct repetitive elements even if two same ones appear in different organisms. The

WWW

Client

config data

repeat data

value files

Internet

RepIns WWW

Server

Search DB (Oracle)

ScheMaker

parser cross check

RSDB (Oracle)

log files

translator

ResLog

Other flat-file format

Fig. 3. System architecture.

B. Implementation 1) Template of schema: Each organism has different properties. In this work repetitive sequences of different organisms were stored in separate tables. Therefore, the

TITB-00040-2001.R2 database tables are dynamically generated. Maintaining such repositories of repetitive sequences is extremely difficult. This work developed a template for generating the necessary schema to solve this problem. Herein, the template is made to operate by using the same logical design as that of the data model, and providing the various organism-dependent attributes in physical database design. The organismdependent information is stored in independent configuration files. Only configuration files, rather than all the entire data and definitions of schema with redundancies needed to be preserved. Using the template markedly reduces not only the use of storage, but also the time spent performing maintenance when migrating or reconstructing parts of the system. 2) Insertion of repetitive sequences: Once the computation tool has examined the repetitive sequences, a software program called RepIns is used to insert them into the database. A repetitive sequence, e.g., 5'-ATAGGGGGTA-3', may occur in different organisms. In RSDB, the sequence is stored in the relation of SEQUENCE, as well as the occurrences of the repeat are stored in both relations of ORGANISM and REPSEQ. RepIns has two main functions. One is parsing data concerning ordinary items, such as organism, date created, date updated and repeat sequence, as well as that concerning repeat-related features, such as the distribution, composition, chromosome number, and positions located in chromosomes. Some derived data, like copy number, the arbitrary classification of copy number and length of repetitive sequences, is also stored. Each original data set is ended by a double slash in the parsing process. The other function of RepIns is to perform cross checks. When the input data have been parsed, each repeat sequence must be checked to determine whether a duplicate one occurs because only a unique is allowed. Three attributes are considered in comparing repetitive sequences and to verify that a sequence is unique. The first attribute is the length. The second is a hash value determined from the first 4,000 nucleotides of the repeat sequence, or the entire sequence if its length is less than 4,000. The third is the query sequence, which is used for suspicious source sequences if their length and hash values are the same. Ultimately, the information contained in the original input data set is inserted into the database. 3) Resolution of log files: When repetitive sequences are inserted, some output is written to log files. These log files help to elucidate the specific characteristics of the input data sets. The most important data of the log files are the clues obtained in cross-checking. The result of the cross-checks are not examined at run-time, but the corresponding sequence identifiers are recorded instead to improve the throughput of the insertion of repetitive sequences. Therefore, these sequence identifiers can be batch processed from a database to determine related organisms. Each log file also briefly summarizes of the total number of processing records and nucleotide positions, matched records, and elapsed time. 4) Searching tools: This work has also designed four basic tools based on web interfaces to enable users to search the database. These tools include search by features, search by range, search by pattern, and look up by Accession Number or Identification (AC/ID). All the interfaces follow the

3 suggestions of local biologists to ensure that they were not unconformable to users. (a) Search by features: This tool is useful for those who basically know what they are seeking, in terms of the fields provided in the web interface. Fields with an asterisk are currently not functional. When users choose an interesting organism, they are given three options, including single value, range of values and arbitrary classification, that apply to size, copy number and composition. “Larger than” or “less than” conditions are obtained by filling in only one range field and leaving the other blank. (b) Search by range: The interface allows a search based on positions of chromosomes. JavaScript is used to change dynamically the optional values in the chromosome field to match those of the organism chosen by the user. Multiple choices are allowed in the chromosome field. “Larger than” or “less than” some position work as in Search by features. (c) Search by pattern: The interface is to query repetitive elements by specifying the given pattern. A more effective method for searching repetitive sequences using regular expressions is also provided. This substitution will help users to compare exact or fragmented repetitive sequences to the query sequence provided by users. (d) Look up by AC/ID: This tool serves users who already know the accession numbers or sequence identifiers. Some users may occasionally store accession numbers or sequence identifiers. These users can flexibly and quickly look up information using this tool. The interface for these four types of queries can be found at http://rsdb.csie.ncu.edu.tw. 5) Flow of search: With the presented system architecture, “Search DB” in Fig. 3 is directly connected to the WWW server. “Search DB” works mainly on caching the search result after applying the searching tools except search by pattern and lookup by AC/ID. Herein, a specific physical storage structure is used to maintain the cache. Using the cache, users can more efficiently browse the search results page by page. Repeat sequence data are rarely updated, explaining why the cache does not need to be dropped at the end of a user’s session. Only an out-of-date cache must be deleted according to a schedule. Thus, when the next user enters the same search conditions, the results may be found in “Search DB”, significantly reducing the server load on RSDB when the hit rate of the search conditions becomes high. The flow of search in is briefly described below. Step 1. Obtain form data from user. Step 2. Generate dynamic SQL query statement according to the search conditions. Step 3. Determine whether a cached result already exists in “Search DB”. If so, go to Step 6. Step 4. Launch a remote query from “Search DB” to perform a search in RSDB. Step 5. Store the search result back to Search DB. Step 6. Retrieve the necessary description of each record.

TITB-00040-2001.R2 Step 7. Generate the HTML pages. III. DATA MINING IN REPETITIVE SEQUENCES USING TRANSCRIPTION FACTOR BINDING SITES A. Background The TRANSFAC database includes 4,965 site sequences and 2,837 factor entries. Most sites are also consensus patterns. The data in TRANSFAC exhibits the following features. A transcription factor binding site accession number may have different consensus sequences. Different binding site accession numbers may share a single consensus sequence. Wild characters such as ‘M’ or ‘W’ used in TRANSFAC cause sequence to cover other sequences like below. Genome sequences are a string of A, C, G or T. The symbols used in addition to A, C, G, or T also include the following: W: A or T S: C or G R: A or G Y: C or T K: G or T M: A or C B: C, G, or T D: A, G, or T H: A, C, or T V: A, C, or G N: A, C, G, or T Small consensus sequences may appear within larger ones. The proposed approach depends on a preprocessing feature because TRANSFAC records complex characteristics of transcription factor binding sites . 1) Properties of repetitive sequences in the Repeat Database: Repetitive sequences in the repeat database are of one of the following three types. 1. Minisatellite repeats: Variable number tandem repeat (VNTR). Each repeat sequence of this type is from 10 to 60 base pairs long. This type appears from five to 50 times in a sequence. 2. Microsatellite repeats: Each repeat of this type has a length ranging from one to four base pair units. This type is repeated 10-20 times. 3. Interspersed genome-wide repeats: Short Interspersed Nuclear Elements (SINEs). Each repeat is less than 280 base pairs long. Long Interspersed Nuclear Elements (LINEs). Each repeat is from six to 8,000 base pairs long. This type repeats 50,000 to 100,000 times. 4. Inverted repeats: Repetitive sequences are inversions of each other. For example, the following two repetitive sequences are inverted. 5’ GATTC---GAATC 3’ 3’ CTAAG---CTTAG 5’ The repetitive sequences in the experiments performed in this study include direct and inverted repeats which are 20 base pairs long or longer. 2) Properties of the data in TRANSFAC: For more details of the properties of data in TRANSFAC, readers should refer to [16]. 3) Significance level: Measurements of correlation and independence are defined as follows [14]: Definition 1 (correlated): Let s be a minimum support; t be a significance level; A be a set of items, and B be an item. Assume that the rule A=>B is correlated if it satisfies the

4 following two conditions. 1. The support exceeds s. 2. The significance level exceeds t. Definition 2 (independent): Let s be a minimum support; t be a significance level; A be a set of items, and B be an item. Assume that the rule A=>B is independent if it satisfies the following two conditions. 1. The support exceeds s. 2. The significance level does not exceed t. B. The Proposed Approach The first stage is a preprocessing and a mapping between the transcription factor binding sites in TRANSFAC and the repetitive sequences in RSDB. Next, Apriori and AprioriTid [17] are applied to mine association rules by combining the transcription factor binding sites in repetitive sequences. Next, Chi-square is used to select certain rules. Finally, the redundant rules are pruned and structured. Steps of the proposed approach are summarized as follows. (1) Determine the number of item sets of the transcription factor binding sites in TRANSFAC. (2) Map the categorical binding sites to a set of transcription factor names. (3) Find the combinations of transcription factors in repetitive sequences. (4) Apply the data mining approach to generate association rules. (5) Determine interesting rules using the Chi-square significance measure. (6) Prune redundant rules [18]. (7) Classify rules that cover and do not cover sets. (8) Partially classify repetitive sequences using mined association rules. TABLE I PARTIAL COMBINATIONS OF TRANSCRIPTION FACTOR BINDING SITES FOR C. ELEGANS, HUMAN CHROMOSOME 22, YEAST, ARCHAEA, BACTERIA, AND VIRUS. (APPENDIX A LISTS THE ABBREVIATIONS OF THE ORGANISMS.) Genome Name

Total Repetitive sequences

Match One

No Match

CE HS 22 SC BS HI HP25 HPJ9 MG MT EC

454927 1347364 4329 700 788 713 721 373 4932 1897

73881 47159 305 73 93 98 88 26 784 188

29962 22211 338 27 55 25 33 16 171 60

More Than One Match 351084 1277994 3686 600 640 590 600 331 3977 1649

Average Factors

Ratio (%)

4.8 7.6 22.5 11.5 7.3 8.3 6.3 6.7 5.1 8.8

77.17 94.85 85.14 85.71 81.22 82.75 83.22 88.74 80.64 86.93

C. Results 1) Preprocessing and mapping between the data in RSDB and in TRANSFAC: The transcription factor binding sites in TRANSFAC are prepared first. Accordingly, the proposed approach requires preprocessing. Combinations of

TITB-00040-2001.R2 transcription factor binding sites in the repetitive sequences in our RSDB are then found. This paper focuses primarily on the repetitive sequences of the genomes, C. elegans, human chromosome 22, Yeast and several bacteria. Table I summarizes the results of preprocessing. Each row refers to a genome or bacteria. The “Total Repetitive sequences” column indicates how many repetitive sequences of a genome are used in the experiments. “Match One”, “No Match” and “More Than One Match” represent one binding site, no binding site, and more than one binding site, respectively, in a repeat sequence. The column “Average Factors” states the average number of transcription factor binding sites found in a repeat sequence. As stated above, the combinations of transcription factors in repetitive sequences are found. The average factors of repetitive sequences may then be computed according to the combination of transcription factors in each repeat sequence. The final column “Ratio” represents the number of repetitive sequences that contain at least one binding site more to the total number of repetitive sequences in a genome. For example, the proportion 77.17% in C. elegans indicates 351,084 repetitive sequences containing more than one binding sites, will be used to mine associations. The method of mining associations from combinations of the transcription factor binding sites found above is discussed as follows. Consider a large database with transactions, where each transaction consists of a set of items. An association rule is an expression of the form A=>B, where A and B are sets of items. The mining of an association rule is the statement that a transaction in the database that contains A also tends to contain B. For example, 90% of the people who purchase beer also purchase peanut. Herein, 90% is called the confidence of the rule. The support given to the rule A=>B is herein the percentage of transactions that include both A and B. The formal statement of the problem is as follows. Let I={i1, i2, ... ,im} be a set of sites, called the item set. Let D be a set of repetitive sequences, where each repeat sequence S, corresponding to a transaction, includes a set of items such that S ⊆ I. Assume that a repeat sequence S contains A, a set of items of I, if A ⊆ S. An association rule is an implication of the form A=>B, where A ⊂ I, B ⊂ I, and A B=0. The rule A=>B holds in the repetitive sequence set D with confidence conf if c% of repetitive elements in D contain A and also B. The rule A=>B has support sup in the repetitive sequence set D if s% of repetitive sequences in D contain A∪B. In the experiments, the minimum support is set to 10%. The association rules are generated only if they have higher support and confidence than specified by the user. Apriori and AprioriTid [17] are then applied to mine association rules. Very many association rules are generated so human identification of interesting and useful ones is difficult. Consequently, Chi-square is applied to prune the discovered association rules, removing the insignificant ones. 2) Pruning and structuring association results: Herein, rules are generated using the Chi-square significance test. The discovered rules are still large and unreadable after the Chisquare significance test is applied. The redundant rules are pruned and the rules are structured that cover and do not cover

5 the set. The conceptual flow of the pruning and structuring is summarized as follows. 1. Discovered rules may be not interesting for many reasons [19]. 2. Rules can refer to uninteresting sites or combinations of sites such as transcription factor binding sites. 3. Rules can be redundant. Three operations are used to process a large collection of rules. (1) Pruning: reduces the number of insignificant rules. (2) Structuring: divides the rules into covering and noncovering sets. (3) Sorting: Rank the rules by confidence. TABLE II THE ASSOCIATION RULES MINED AFTER APPLYING CHI-SQUARE. (APPENDIX A LISTS THE ABBREVIATIONS OF THE ORGANISMS.) Genome Name

MiniSup

Cover Rules

CE HS 22 SC

5% 28% 31%

4 4 5

Non Cover Rules 6 6 5

Total Rules

Ratio of Partial Classification

10 10 10

47% 79% 77%

TABLE III PARTIAL ASSOCIATION RULES FOR ARCHAEA, BACTERIA AND VIRUS ARE MINED AFTER APPLYING CHI-SQUARE. (APPENDIX A LISTS THE ABBREVIATIONS OF THE ORGANISMS.) Genome Name BS HI HP25 HPJ9 MG MT EC CP MP RP

Prune Rules 63 3 0 18 19 0 0 0 0 3

Non Cover Rules 103 3 3 11 17 5 1 3 3 10

Cover Rules 55 3 1 21 11 1 1 1 5 14

Total Rules 158 6 4 32 28 6 2 4 8 24

Chi-square significance is not affected by simple redundancy and strict redundancy. For example, the rule AB=>C is redundant A=>BC. The rule AB=>C is tested, while A=>BC is not. The strict rule A=>B is redundant of A=>BC, and A=>B is tested. The redundancy in the rules is similar to A=>B and AC=>B. The rule A=>B is kept and the rule AC=>B is pruned because AC=>B is covered by the rule A=>B. For example, consider the rule MAMAG=>AAAG. Obviously, the binding site on the right-hand side is covered by that on the left-hand side because M may be A or C. Next, the rule is put into the cover set. Tables II and III present the association rules mined after applying Chi-square from the data in Table I. In Table II, the significance level is set to 95%. In Table II, the “MiniSup” column refers to the minimum support used. The “Cover Rules” and “Non Cover Rules” represent the number of rules in the cover and no-cover sets, respectively, after pruning, and structuring. “Total Rules” represents the number of rules in the cover and non-cover sets. The “Ratio of Partial Classification” represents the ratio of repetitive sequences classified by “Total Rules”. For example, only 47% of the repetitive sequences in C. elegans are

TITB-00040-2001.R2

6

classified using the ten mined rules. Conversely, 53% of the repetitive sequences cannot be classified by the rules. Consequently, the ratio can also be used to measure whether the rules mined are representative. Similarly, Table III summarizes the data for archaea, bacteria and virus. The minimum support is set to 10%.

IV. DISCUSSION Several archaea and bacteria were experimented on to confirm that the association rules found in repetitive sequences also appear in their genomes. Table IV presents some of the experimental results. The column “Association Rules in Repeats” presents association rules mined from the repetitive sequences. The column “Copies of Repeats” states how many copies of repetitive sequences are found in a genome. The column “Occurrences in a Genome” states the number of associations found in a genome for a specific size of window. The size of window is defined as the offset between the occurrences of the transcription factor binding sites. For example, two rules, YY1=>R00231\R00232\R00335 R00668\R00669\R00761\R01081\R01345\R0144\R01446\R0 2955\R02957 and YY1=>R00388, are found in the repetitive sequences of the organism “PA”. Thirty-nine copies of repetitive elements are found to contain the association “YY1=> R00231 \R00232 \R00335\R00668\R00669 \R00761\R01081\R01345\R0144\R01446\R02955\R02957 and YY1=> R00388” in this example. The sizes of window one, five, and ten of the occurrences of the transcription factor binding sites in the former association, YY1 and R00231 are 0, 34 and 105 respectively. Notably, few associations with a small window such as of size one, are found in a genome in Table IV. The corresponding site sequences are presented at the appendix B. Many associations of the occurrences of the transcription factor binding sites are observed with a large window. The result seems to be reasonable. Most transcription factor binding sites are separated by a large distance. However,

several associations with a small distance between two sites are found in the genome “MG”. This phenomenon will be studied further in the future. Two problems must be addressed. One study [1] estimates that repetitive elements occupy around 43% of the human genome. The analysis presented reveals that repetitive elements in the genome may have been very important in evolutionary genomics. Statistical analyses of the repetitive elements in RSDB are now under way. The statistical data will also yield important findings, such as on the genetics of organisms, in the near future. For example, the relationship between genes and repetitive elements will be elucidated to the advantage of biologists. With such information, biologists may understand more deeply where most repetitive elements are located like in exons, introns or other areas. The second task is to create a statistical model of the number of significant hits of the associations mined. The study is ongoing. A formal model is now being built for this purpose. V. CONCLUSION A database of repetitive elements has been developed to provide users not only with query tools but also with useful statistics. The statistics relate to all the data in RSDB. The authors hope that these statistics will help biologists discover a whole new area of biology, including predicting transposons. Combinations of transcription factor binding sites are found in the repeat sequences in the developed RSDB. Each repeat sequence is mapped onto a transaction, and combinations of transcription factor binding sites are mapped onto items of a transaction. Data mining methods are used to mine the associations from the combinations of transcription factor binding sites in repeat sequences. Very many association rules are generated, making the identification of the interesting and useful ones by a human user very difficult. A Chi-square significance test is used to remove the insignificant rules; finally, the redundant rules are pruned.

TABLE IV PARTIAL ASSOCIATION RULES IN A SMALL SCALE (REPETITIVE SEQUENCES) AND GENOME SCALE. (APPENDIX A LISTS THE ABBREVIATIONS OF THE ORGANISMS.) Genomes

PA

TM

MG

Association Rules in Repeats YY1=>R00231\R00232\R00335\R00668\R00669\R00761\ R01081\R01345\R01445\R01446\R02955\R02957 YY1=>R00388 R00388=>R00231\R00232\R00335\R00668\R00669\R0076 1\R01081\R01345\R01445\R01446\R02955\R02957 c-Ets-2=>R03553 R03553=>R01230 c-Ets-2=>R01230 TCF-1alpha\TCF-1\TCF-1F\TCF-1G\TCF-1E\TCF1C\TCF-1B\TCF-1A\TCF-2alpha\LEF-1=>MNB1a

Occurrences in a Genome Copies of Repeats

Window=1

Window =5 Window =10

39

0

34

105

41

0

48

175

37

0

37

64

272 220 218

1506 0 0

1700 56 66

2019 332 206

208

3785

3954

4557

TITB-00040-2001.R2

7

[4]

APPENDIX A. Abbreviation of Organisms Caenorhabditis elegans Homo sapiens Saccharomyces cerevisiae Bacillus subtilis Haemophilus influenzae Rd Helicobacter pylori J99 Helicobacter pylori 26695 Mycoplasma genitalium Mycobacterium tuberculosis H37Rv Escherichia coli Hepatitis C virus Human immunodeficiency virus type 1 Japanese encephalitis virus Aquifex aeolicus Aeropyrum pernix K1 Archaeoglobus fulgidus Chlamydia pneumoniae AR39 Chlamydia trachomatis Mycoplasma pneumoniae M129 Pyrococcus horikoshii OT3 Rickettsia prowazekii strain Madrid E Synechocystis PCC6803 Thermotoga maritima Treponema pallidum subsp. pallidum Ureaplasma urealyticum Pyrococcus abyssi

[5] CE HS SC BS HI HPJ9 HP25 MG MT EC HCV HIV1 JEV AA AP AR CP CT MP PH RP S TM TP UU PA

[6]

[7]

[8]

[9]

[10]

[11]

[12]

B. Association Rules YY1=>R00231\R00232\R00335\R00668\R00669\R00761\R 01081\R01345\R01445\R01446\R02955\R02957 YY1=>R00388 Site Names YY1

Site Sequences TATTT CCWTNTTNNNW CATTA CATTT R00388 TCAAT R00231\R00232\R00335\R00668\ ATTGG R00669\R00761\R01081\R01345\ R01445\R01446\R02955\R02957

[13] [14]

[15] [16]

[17]

[18]

ACKNOWLEDGMENTS The authors thank the National Science Council of the Republic of China and Asia Bioinnovation Corporation for financially supporting this research. Prof. Ueng-Cheng Yang and Dr. Yu-Chung Chang are appreciated for their valuable discussion regarding molecular biology. Prof. Cheng-Yan Kao is also commended for his suggestions regarding our database. Additional thanks are given to Dr. Adam Yao and Pei-Ing Hwang for actively encouraging the initiation of this project. REFERENCES [1]

[2] [3]

W.H. Li, Z. Gu, H. Wang, and A. Nekrutenko, “ Evolutionary analyses of the human genome,” Nature, vol. 409, Feb., 2001, pp. 847-849. T. A. Brown, GENOMES. John Wiley & Sons, Inc. 1999, pp. 135141. C. Burks, “Molecular Biology Database List,” Nucleic Acids Res., vol.

[19]

[20]

27, Issue 1, 1999, pp. 1-9. D.A. Benson., I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, B.A. Rapp and D.L. Wheeler, “GenBank,” Nucleic Acids Res., vol. 28, 2000, pp.15-18. W. Banker, van den A. Broek, E. Camon, P. Hingamp, P. Sterk, G. Stoesser and M.A. Tuli, “The EMBL nucleotide sequence database.,” Nucleic Acids Res., vol. 28, 2000, pp. 19-23. Tateno,Y., Miyazaki,S., Ota,M., Sugawara,H. and T. Gojobori, “DNA data bank of Japan (DDBJ) in collaboration with mass sequencing teams,” Nucleic Acids Res., vol. 28, 2000, pp. 24-26. J. Jurka, “Repbase update: a database and an electronic journal of repetitive elements,” In: Trends Genetics, vol. 16(9): 2000, pp. 418420 C.M. Ruitberg, D.J. Reeder, J.M. Butler, “STRBase: a short tandem repeat DNA database for the human identity testing community,” in: Nucleic Acids Res., vol. 29(1): pp. 320-322, 2001 T. Heinemeyer, X. Chen, H. Karas, A. E. Kel, O. V. Kel, I. Liebich, T. Meinhardt, I. Reuter, F. Schacherer and E. Wingender, “Expanding the TRANSFAC database towards an expert system of regulatory molecular mechanisms,” Nucleic Acids Res., vol. 27, 1999, pp. 318322. T. Heinemeyer, E. Wingender, I. Reuter, H. Hermjakob, A. E. Kel, O. V. Kel, E. V. Ignatieva, E. A. Ananko, O. A. Podkolodnaya, F. A. Kolpakov, N. L. Podkolodny and N. A. Kolchanov, “Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL,” Nucleic Acids Res., vol. 26, 1998, pp. 362-367. A. Brazma, J. Vilo and E. Ukkonen, "Finding Transcription Factor Binding Site Combinations in Yeast Genome (Extended Abstract) . Computer Science and Biology,” in Proceedings of the German Conference on Bioinformatics GCB '97 D, 1997, pp. 57-59. R. Agrawal, T. Imielinski and A. Swami, “Mining Associations between Sets of Items in Large Databases,” Proc. of the ACM SIGMOD Int'l Conference on Management of Data, Washington D.C., 1993, pp/ 207-216. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers., 2001 B. Liu, W. Hsu and Y. Ma, “Pruning and Summarizing the Discovered Associations,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 1999, pp. 125-134. K. Ali, S. Manganaris and R. Srikant, “Partial Classification Using Association Rules,” KDD, 1997, pp. 115-118. J.T. Horng and W.F. Cho, “Predicting regulatory elements in repetitive sequences using transcription factor binding sites, “ Electronic Journal of Biotechnology, vol. 3, December. 2000 R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” in Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994. Expanded version available as IBM Research Report RJ9839, 1994, pp. 487-499. H. Toivonen, M. Klemettinen, P. Ronkainen, K. Hatonen and H. Mannila, “Pruning and grouping discovered association rules,” in MLnet Workshop on Statistics, Machine Learning, and Discovery in Databases, Heraklion, Crete, Greece, 1995, pp. 47-52. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen and A. I. Verkamo, “Finding Interesting Rules from Large Sets of Discovered Association Rules,” CIKM, 1994, pp. 401-407. J.T. Horng, H.D. Huang, and C.C. Huang, “Mining putative regulatory elements in gene promoter regions,” in Proceedings of the German Conference on Bioinformatics. 2001, pp. 90-95.

Jorng-Tzong Horng was born in Nantou, Taiwan, on April 10, 1960. He received the Ph.D. degree in Computer Science and Information Engineering from National Taiwan University, Taipei, in April 1993. In 1993, he joined the Department of Computer Science and Information Engineering, National Central University, Jungli, Taiwan, where he became Professor in 2002. His current research interests include database systems, data mining, genetic algorithms, and bioinformatics.

Suggest Documents