A Hidden Markov Model web application for analysing bacterial genomotyping DNA microarray experiments Richard Newton1, Jason Hinds2 and Lorenz Wernisch1
1
School of Crystallography, Birkbeck College, University of London, Malet St., London
WC1E 7HX, UK 2
Bacterial Microarray Group, Department of Cellular and Molecular Medicine, St George’s
Hospital Medical School, London SW17 ORE, UK
Running Title: Hidden Markov Model for bacterial DNA experiments
1
Acknowledgments The TB data set was provided by Jaqueline Inwald, Veterinary Laboratories Agency, the YP data set by Stewart Hinchliffe, London School of Hygiene and Tropical Medicine and the SA data set by Jodi Lindsay, St. George’s Hospital Medical School. RN was funded as part of a Wellcome Trust Functional Genomics programme grant. The Wellcome Trust funds the BµG@S multi-collaborative microbial pathogen microarray facility under its Functional Genomics Resources Initiative.
Corresponding Author: Richard Newton, School of Crystallography, Birkbeck College, University of London, Malet St., London WC1E 7HX, UK email:
[email protected] telephone: +44 (0)20 7631 6869 fax: +44 (0)20 7631 6803
Figure Legends Figure 1 - Plot of M values with genome position A typical plot of M values with genome position for an array from the TB data set (test strain 2122/97).
Figure 2 - Density distribution of M values Density distribution of M values showing two normal distributions fitted by the HMM (test strain 2122/97).
Figure 3 - Plot of M values with genome position in the RD8 region of M. tuberculosis H37Rv Plot of M values with genome position in the RD8 region of M. tuberculosis H37Rv, showing the gene assignments of the HMM and the position of a cut-off of minus three standard deviations.
Abstract Abstract: Whole genome DNA microarray experiments compare the gene content of different species or strains of bacteria. A statistical approach to analysing the results of these experiments was developed, based on a Hidden Markov model, that takes adjacency of genes along the genome into account when calling genes present or absent. The model was implemented in the statistical language R and applied to three data sets. The method is numerically stable with good convergence properties. Error rates are reduced compared to approaches that ignore spatial information. Moreover, the HMM circumvents a problem encountered in a conventional analysis: determining the cutoff value to use to classify a gene as absent. An Apache Struts web interface for the R script was created for the benefit of users unfamiliar with R. Availability: The application may be found at http://hmmgd.cryst.bbk.ac.uk/hmmgd . Source code illustrating how to run R scripts from a Struts based web application is available from the authors on request. The application is also available for local installation if required.
Introduction DNA microarrays are commonly used to monitor gene expression profiles at the mRNA level. However, microarrays have also become an established technique in bacterial comparative genomics, in which comparisons are made at the genomic DNA rather than mRNA level, studying gene content rather than gene expression. In numerous studies of this type, DNA microarrays have been used to compare a test strain, of unknown gene content, to a sequenced reference strain.[1–11] Bacterial species or strains may vary phenotypically in their degree of pathogenicity, virulence, transmission or host specificity. The ability to compare genomes using microarrays means that it is possible to identify the genes potentially associated with these differences. However there are problems intrinsic to the method, namely sequence divergence and cross-hybridization, plus the noise inherent in any usage of microarray technology, that require a statistical analysis of the data. In this paper an analysis is described based on a first order Hidden Markov model (HMM) in which each gene is represented by a hidden variable that can be in one of two states, absent or present. The acquisition, loss or rearrangement of DNA in bacterial genomes commonly involves whole blocks of contiguous genes. For example, in the comparison of Mycobacterium bovis AF2122/97 with Mycobacterium tuberculosis H37Rv there are around 80 absent genes, but arranged in just 11 segments. Two of these segments contain one gene only, the remaining segments range in size from 3 genes up to 17 genes.[24] This feature of bacterial genomes results from events such as the horizontal transfer of DNA, recombination between repetitive DNA or phage-mediated events. So the state of a gene often depends on the states of the adjacent genes. The HMM exploits this fact and we show that this reduces error rates compared to approaches that ignore spatial information. A further advantage of an HMM over a conventional analysis, which calls genes present or absent based on cut-off values, is that the latter requires use of either an arbitrary cut-off
value or requires additional empirical determination of such values. In contrast, an HMM provides a probability for each of the two possible states of a gene, absent or present, making a decision straightforward without the need for further experimental work. The HMM was written in the statistical language R.[12] This programming language is of tremendous importance in the field of bioinformatics. As a high-level mathematical scripting language it allows the rapid development of programs and now includes a vast range of preprogrammed functions. The Bioconductor[13] project contains many packages specifically for bioinformatics. A disadvantage of R is that many potential users of bioinformatics programs written in R, who come from a non-bioinformatics background, are unfamiliar with the language. A solution to this problem is to provide a user-friendly web interface for R scripts.
Whole genome DNA microarrays Whole genome DNA microarrays are typically constructed so that there is a reporter element on the array representing each gene in the genome of the sequenced reference strain. Reporter elements are specific fragments of DNA able to hybridize complementary DNA present in either of the two genomic DNA samples being compared in the experiment. Following labeling of the two DNA samples with different fluorescent dyes, the two labeled samples are co-hybridized to a single array. Measurement of the fluorescent intensities of the two dyes bound to a particular reporter element on the microarray enables the relative abundance of hybridizing DNA in each sample to be determined. For each reporter element, the relative signal intensity levels of the test strain compared to the reference strain give an indication whether a gene that is present in the reference strain is present, absent or divergent in the test strain. The result for a gene is expressed as a log2 ratio of the test strain signal intensity compared to the reference strain signal intensity, so a lower ratio is indicative of divergence or absence of that particular gene in the test strain. Typical experiments may compare dozens of different test strains to a reference strain and generate large amounts of data.
In practice the data is not as clear-cut as may be expected, whereby genes would clearly segregate into either present or absent categories. Generally the microarray process introduces experimental noise, but there are additional technical artifacts, caused by crosshybridization and sequence divergence, that reduce the ability to clearly determine whether genes are present or not. Cross-hybridization results from an inability to select reporter elements that are truly unique for each gene. For a small minority of reporters, the signal measured may not be specific for the desired target gene but may also include signal from paralogues in the genome that are able to cross-hybridize to the reporter to some degree. This can have two effects: 1) if the target gene is absent but cross-hybridizing paralogues are not, then the ratio for the target gene will not be as low as for absent genes not affected by cross-hybridization, 2) if a cross-hybridizing paralogue is absent but the target gene is not, then the ratio for the target gene will be reduced and the target gene may incorrectly appear absent, due to the loss of cross-hybridizing DNA. Sequence divergence between the test and reference strains also contributes to a lack of clarity. A gene in the test strain with a reduced amount of sequence similarity to the reference strain will generate a lower ratio than a gene with identical sequence in the test and reference strains. Various analysis approaches have been used to determine the genes considered absent or highly divergent in the test strain. These range from a fixed arbitrary ratio cut-off set at a two-fold difference,[4, 11] an empirically determined cut-off based on independent experimental determination of presence or absence[5] or availability of comparative sequence information.[2, 7, 8] Other methods used to analyse comparative genomics data have accounted for variation in the spread of data to set more dynamic and appropriate cut-offs. These include using the variation of ratios for sub-sets of genes considered universally conserved or absent in all the test strains to determine the cut-offs appropriate for a particular strain[6]
and assessing shape properties of the normalized ratio distribution to assign cut-offs to detect outliers in the tail of the distribution that represent absent or highly divergent genes.[14] If several strains of a bacteria have been sequenced then it is possible to design microarrays with reporter elements representing more than just the genes in the genome of the reference strain in the experiment.[15] The genomes of all the sequenced strains can be used in the design. As well as absences the microarray can now measure additions, that is, those genes that are present in the test strain but absent in the reference strain. The additions have positive log ratio values.
Hidden Markov Model Hidden Markov Models (HMM) were first applied in the field of speech recognition[16] but in recent years have found numerous uses in the field of bioinformatics in particular for homology searching and gene prediction.[17, 18] Hidden Markov Models are a method for modeling an observed, sequential signal in terms of a sequence of hidden variables in order to identify the underlying sequence of source states producing the observed signal. In the context of this paper the sequence of log ratio values of genes along the genome, measured from the microarray experiment is the observed signal. Fridlyand et al.[19, 20] have used a HMM approach to analyse microarray comparative genomic hybridization data for copy number alterations in tumours. This paper describes a related model designed specifically for analysing bacterial genomotyping microarray data.
Apache Struts In order to create a user-friendly web interface for the HMM R script we chose to use Apache Struts.[21] Struts is an open source and widely used Java based framework for constructing web applications. The Struts framework separates out the three main components of a web application, the View (the way in which information is presented to the user), the Model (the
data processing) and the Controller (controlling the flow of the application). This produces a well-organised, stable and extensible web application.
Methods Data Three sets of microarray data from BµG@S (Bacterial Microarray Group at St. George’s Hospital Medical School) were used in the development of the Hidden Markov model. The arrays in all three data sets contained approximately 3000-4000 gene-specific reporter elements for genomes of 3-4Mb in size, so on average the resolution of reporters was every 1Kb in the genome. Array designs can be accessed in ArrayExpress (http://www.ebi.ac.uk/arrayexpress) with Accession Numbers A-BUGS-1 (TB), A-BUGS-11 (YP) and A-BUGS-16 (SA). In the first set, referred to as the TB data set, the microarray used was based on the sequenced genome of Mycobacterium tuberculosis H37Rv.[22] Three strains of Mycobacterium bovis, namely BCG Pasteur, AN5 and 2122/97, were compared to the M. tuberculosis H37Rv reference strain with three replicate hybridizations performed for each strain.[7] Figure 1 plots the log2 ratio values (denoted by M in the following[23] ) against position along the genome for a microarray from the TB data set (test strain 2122/97). Each point on the graph corresponds to a single gene. The majority of genes are present in the test strain and are found in a distribution centred around M = 0. Some genes have M values in the region below this distribution and it is these genes which are either absent in the test strain or have a significantly diverged sequence compared to the reference strain gene sequence. A plot of the density of M values from this data set can be seen in Figure 2. The TB data set was used to assess the results from the HMM as there were already independent comparative studies for two of the strains analysed. Mycobacterium bovis 2122/97 used in this experiment has been sequenced[24] and therefore any gene absences or sequence
divergence indicated by the microarray results can be validated at the sequence level. In addition, gene absences in M.bovis BCG Pasteur have been determined previously in another microarray study.[1] The insertion sequence IS6110 occurs in multiple copies throughout the genome, which cross-hybridise to each other extensively. It has therefore been excluded from further analysis. In the second data set, referred to as the YP data set, all genes in the genome of Yersinia pestis strain CO-92 were used as the basis of the array. This sequenced reference strain of Y. pestis was compared to 22 further strains of Y. pestis with two replicate hybridizations performed for each strain.[8] The results from a previous analysis of the data,[8] that included PCR analysis of specific genes, were used to assess the performance of the HMM in detecting gene absences. The third data set, referred to as the SA data set, used microarrays designed from seven strains of Staphylococcus aureus.[15] The data set comprised six microarrays each comparing the same sequenced reference strain to a different sequenced test strain. Because the microarray was designed from more than one sequenced strain, as well as measuring absences in the test strain relative to the reference strain, the experiment can measure additions in the test strain relative to the reference strain. To correct for spatial variation across an array two-dimensional loess (local weighted regression) normalization was applied to the data.[25] To correct for variation in the red/green bias with spot intensity, one dimensional loess normalization was used.[23] The robust measures of location and scale, the median and mad estimators, were used for between-array normalization, so that intensity values between arrays become comparable.
Hidden Markov Model Structure of the HMM The HMM models the test strain genome as a sequence of hidden variables. Each hidden variable may be in one of two states, depending on whether the corresponding gene is present or absent in the strain. We try to determine the sequence of hidden states from the observed sequence of log ratio values. This type of HMM is described by transition probabilities and emission densities. The four transition probabilities refer to the probabilities of either staying in the same state or switching to the alternative state when moving between adjacent positions in the genome. A probability density function for the emission of log ratio values is associated with each one of the two states. We assume that genes are ordered corresponding to their position in the genome, each gene defined by a position t. The observed log ratio value of a gene at position t is denoted by yt . The hidden variable corresponding to the gene at position t is Xt which can take on values 0 (present) or 1 (absent). The emission probabilities are given by P (yt |Xt ). Two normal distributions were used to model the emission probabilities, that is P (yt |xt ) = N (yt ; µxt , σxt ) for xt 0 or 1. Determining the emission probabilities of the two states involves determining the mean and standard deviation of the two distributions. By including both emission and transition probabilities the HMM uses not only the log ratio value of an individual gene to classify the gene as absent or present but also takes into account the values of neighbouring genes. There are two steps to using an HMM to analyze experimental data.[16] Firstly the model parameters, that is the emission densities and transition probabilities, must be determined from the observed sequence. The model is then used to predict the most probable state sequence that could have generated the observed sequence. For the SA data set which measures additions as well as absences a three state HMM is required, with a state corresponding to genes absent in the test strain but present in
the reference strain, a state corresponding to genes present in both strains, and a state corresponding to genes present in the test strain but absent from the reference strain. Fitting the model, prediction of states The HMM parameters (emission densities and transition probabilities) are estimated using an expectation maximization (EM) approach.[26] This is an extension of the maximum likelihood method for fitting the parameters of a distribution to a set of data. Essentially, in the E-step, expected values for the hidden state variables are derived using initial estimates of all parameters. In practice this step is implemented using a forwardsbackwards algorithm. A modification of this algorithm as described by Murphy,[27] which helps to reduce problems with data underflow, was used. Once expected values of hidden variables are calculated, model parameters can be inferred by maximizing the expected log-likelihood in the M-step. For normal emission densities the M-step is straightforward because simple analytical expressions are obtained for re-estimating the means and standard deviations. The expected values of hidden states are then updated using the re-estimates of the model parameters, in a new E-step, followed by another M-step. The cycle is repeated until the model’s parameter values have stabilized. At each iteration the percentage change from the previous iteration in the log likelihood of the data was calculated and the algorithm was considered to have converged when the percentage change was less than 0.01. Around 10 iterations of the EM algorithm were found to be sufficient for convergence, depending on the data set. Progammed in R this takes about 30 seconds to complete. Convergence was found to be insensitive to initial estimates of model parameters. The HMM converged quickly with general initial estimates such as means of 0 (present) and -5.0 (absent), standard deviations of 1.0, and transition probabilities all set to 0.5. Predicting the sequence of hidden states (that is, absence or presence of genes) is a matter
of finding the most probable sequence of states, given the model parameters and the observed data, by using the Viterbi algorithm in the final Viterbi step.[27] Because of the relatively high level of noise in microarray experiments replication of arrays is essential for increasing the reliability of results. The HMM can be run separately on the replicate arrays and the results combined using an ‘all replicates agree rule’ which assigns a gene as absent if and only if all replicates agree it is absent.
Running the R script from a Struts web application The Model part of the Struts web application, a Java program, is responsible for data processing. This program must run the R script, passing user entered information to the script, and then wait for the script to finish. The web application then needs to display the results of running the script. If the R script takes some time to run the web application should display real-time progress information for the user. The web application should also be able to retrieve and display any R errors. The R script is run using the Java Runtime class. Each Java application has one instance of the Runtime class which allows the application to interface with the environment in which it is running. The R script is run using R batch mode (using R CMD BATCH). This batch command is placed in a shell script. The shell script in turn is run as a Process using the Model program’s instance of the Runtime class. In Java an instance of the Process class has a method that causes the current thread to wait until the Process has completed. The web application and the R script communicate via files, saved in a directory. There is one such uniquely named directory for each submission of data. The information the user enters on the application’s web pages needs to be passed as variables to the R script. In our application the Model program writes this information, as R assignments, into a text file which is concatenated with the main R script prior to execution. Alternatively the user entered data may be written to a file that the R script then reads on execution. To pass
results back to the web application the R script writes the results to files and also saves a list of these filenames. The web application, on completion of the R script, reads the list of filenames and displays them as links on the web page of results. Pseudo real-time progress information is achieved by having the R script append information to a text file at stages during its execution. This text file is displayed by a Javascript pop-up window which refreshes itself at fixed intervals. Struts is particularly efficient at validating data input and if necessary, returning error messages to the web page, but it is also important to cater for errors that may occur whilst the R program is running. This is achieved by using the options command in R. This is set so that if an error does occur the error message is written to a text file. The web application checks this file when the Process terminates and if it is not empty, displays the contents as an error message on the web page. As well as online processing the web application provides the option for offline processing when the results are e-mailed to the user rather than displayed on the web page. When data is submitted for offline processing its submission ID is added to a list in a text file. A continuously running Java program checks this file at regular intervals for additions to the list, in which case it processes the relevant files by running the R script, and on completion e-mails the results to the user.
Results and Discussion A graph showing the two normal distributions fitted by the HMM to the data from the TB data set (2122/97 strain), is shown in Figure 2. Because the actual gene absences are known for the TB data set it is possible to classify the HMM gene assignments as either true positives (TP), false negatives (FN) or false positives (FP). The results of applying the HMM to the TB data set are given in Table 1. The table shows the results for the HMM compared to a conventional analysis using a cut-off of minus three standard deviations. The
average of the results for the three strains show that the HMM gives gene assignments with a lower FP rate (2.3) than a conventional analysis using a cut-off of three standard deviations (3.3). The FN rate is also lower for the HMM (3.3 compared to 6.7). The overall error rate FP + FN is 5.6 for the HMM and 10.0 for the cut-off method. The FP rate of the conventional cut-off analysis may be reduced by lowering the cut-off value. A cut-off of minus four standard deviations was found to minimize the FP rate of the conventional analysis to 1.0, which is lower than the FP rate of the HMM (2.3). There is however a concomitant increase in the FN rate to 9.3, which is far higher than that of the HMM (3.3), and the overall error rate (FP + FN) is now 10.3. The lowest overall error rate possible for the conventional cut-off method was 10.0, corresponding to a cut-off of minus three standard deviations, so the HMM performs better, with a lower error rate, than the best conventional analysis. And in practice extra experimental work is required to determine the optimum cut-off to use in a conventional analysis, whereas the HMM requires no experimentally determined cut-off. In some arrays extreme negative M values caused problems with machine underflow. The probability that a gene belongs in the population of present genes becomes too small for the computer and the program fails. To prevent this the application trims extreme negative M values. Extreme positive values can also cause problems. At high positive M values the tails of the two normal distributions may crossover so that the absent distribution becomes, incorrectly, more probable for these high positive M values. To prevent this the application trims high positive M values and also checks the final list of absent genes for any that may have positive M values. Figure 3 shows the M values for a section of the genome of M. tuberculosis H37Rv in the absent region RD8 (genes Rv3617 to Rv3622 ) which is absent in M. bovis 2122/97. The two genes in the middle of the block (Rv3619 and Rv3620 ) with M values close to zero are in fact absent, but their M values are elevated due to cross-hybridization. The graphs show
how the HMM recognizes that the genes are absent, whereas a conventional three standard deviation cut-off analysis will mistakenly assign Rv3620 as present. The results for eleven strains from the YP data set are given in Table 1. In the YP data set the distributions of M values of present and absent genes are not in general as well defined as for the TB data set. It was noted that some strains have very few gene absences and a few of the strains have significant numbers of genes with some degree of sequence divergence. Results are compared to that obtained from a simple cut-off in log2 ratio values of -2.32, which corresponds to a ratio of 0.2 as used in the analysis of Hinchliffe et al.[8] The average results for the eleven strains show that the HMM gave slightly higher FP (3.0) than the conventional cut-off method (1.3) but gave greatly reduced FN (5.5 compared to 14.0). The HMM was tested on 5 strains from the YP data set which have only one gene absence and was found not to work. Similarly for the strains in the data set that contain no absences. In these cases the HMM either fails to converge or else the absent distribution fits to the sequence diverged tail of the distribution of present genes. The HMM is also likely not to work if applied to data from a bacterial genomotyping experiment where there is very little similarity between the strains or species being compared at the level of DNA sequence divergence or genome organisation. A three state HMM was applied to the SA data set. The HMM gave a 9.5% reduction in the total error rate (FP + FN) averaged over the six arrays in the experiment.
Conclusion An HMM analysis for gene absences achieves a lower overall error rate than a conventional analysis without the need to determine any arbitrary parameters such as cut-off values. All necessary parameters are estimated from the data alone. The conventional cut-off analysis requires independent experimental confirmation of absences to determine the optimum cut-
off value to use. The HMM analysis was found to work well even when only 1% of the test genome is absent. The HMM approach was found to be eminently suitable as the basis of an automatic bacterial strain microarray analysis tool, in that it was stable and converged in only a very few iterations. The R language makes programming the matrix algebra required for a Hidden Markov Model quick and simple, but many potential users of the resulting program will be unfamiliar with using R. Providing a web interface for the R script solves this problem. It also simplifies distribution and updating of the application since the R program runs on the server, the user only requiring a web browser to use the program. Apache Struts provides a rigorous and well-tested framework for creating web applications. Running R scripts from a Struts based web application is robust and simple to implement. Although quick to program, R code can be slow to run compared to, for example, C++ or Java. Using the Struts framework means that a prototype R script, if demand merits, can be rewritten in C++ or Java and simply slotted into the application in place of the R code.
References [1] Behr MA, A WM, Gill WP, Salamon H, K SG, Rane S, Small PM: Comparative genomics of BCG vaccines by whole genome DNA microarray. Science 1999, 284:1520–1523. [2] Salama N, Guillemin K, McDaniel TK, Sherlock G, Tompkins L, Falkow S: A wholegenome microarray reveals genetic diversity among Helicobacter pylori strains. PNAS 2000, 97(26):14668–14573. [3] Fitzgerald JR, Sturdevant DE, Mackie SM, Gill SR, Musser JM: Evolutionary genomics of Staphylococcus aureus: Insights into the origin of methicillinresistant strains and the toxic shock syndrome epidemic. PNAS 2001, 15:8821– 8826. [4] Dorrell N, Mangan JA, Laing KG, Hinds J, Linton D, Al-Ghusein H, Barrell BG, Parkhill J, Stoker NG, Karlyshev AV, Butcher PD, Wren BW: Whole genome comparison of Campylobacter jejuni human isolates using a low-cost microarray reveals extensive genetic diversity. Genome Research 2001, 11:1706–1715. [5] Dziejman M, Balon E, Boyd D, Fraser CM, Heidelberg JF, Mekalanos JJ: Comparative genomic analysis of Vibrio cholerae: genes that correlate with cholera endemic and pandemic disease. PNAS 2002, 99(3):1556–1561. [6] Porwollik S, Wong RM, McClelland M: Evolutionary genomics of Salmonella: gene acquisitions revealed by microarray analysis. PNAS 2002, 99(13):8956–61. [7] Inwald J, Hinds J, Palmer S, Dale J, Butcher PD, Hewinson RG, Gordon SV: Genomic analysis of Mycobacterium tuberculosis complex strains used for production of purified protein derivative. J. Clin. Microbiol 2003, 41(8):3929–32. [8] Hinchliffe SJ, Isherwood KE, Stabler RA, Prentice MB, Rakin A, Nichols RA, Oyston PCF, Hinds J, Titball RW, Wren BW: Application of DNA microarrays to study the evolutionary genomics of Yersinia pestis and Yersinia pseudotuberculosis. Genome Research 2003, 13:2018–2029. [9] Cummings CA, Brinig MM, Lepp PW, van de Pas S, Relman DA: Bordetella species are distinguished by patterns of substantial gene loss and host adaptation. J. Bacteriol. 2004, 186(5):1484–1492.
[10] Snyder LA, Davies JK, Saunders NJ: Microarray genomotyping of key experimental strains of Neisseria gonorrhoeae reveals gene complement diversity and five new neisserial genes associated with Minimal Mobile Elements. BMC Genomics 2004, 5:23. [11] Fukiya S, Mizoguchi H, Tobe T, Mori H: Extensive genomic diversity in pathogenic Escherichia coli and Shigella Strains revealed by comparative genomic hybridization microarray. J. Bacteriol. 2004, 186(12):3911–3921. [12] R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria 2004, [http: //www.R-project.org]. [3-900051-07-0]. [13] Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 2004, 5:R80, [http://genomebiology.com/2004/5/10/ R80]. [14] Kim CC, A JE, Chan K, Falkow S: Improved analytical methods for microarraybased genome-composition analysis. Genome Biol. 2002, 3(11):Research0065.1–17. [15] Witney AA, Marsden GL, Holden MTG, Stabler RA, Husain SE, Vass JK, Butcher PD, Hinds J, Lindsay JA: Design, validation, and application of a seven-strain Staphylococcus aureus PCR product microarray for comparative genomics. Appl. Env. Microbiol. 2005, 71:7504–7514. [16] Rabiner LR: A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. IEEE 1989, 77(2):257–286. [17] Eddy SR: What is a Hidden Markov model? Nature Biotech 2004, 22(10):1315– 1316. [18] Salzberg SL, Delcher AL, Kasif S, White O: Microbial gene identification using interpolated Markov models. Nucleic Acids Research 1998, 26(2):544–548. [19] Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN: Application of Hidden Markov Models to the analysis of array CGH data. J. Multivariate Analysis 2004, 90:132–153.
[20] Snijders AM, Nowak NJ, Huey B, Fridlyand J, Law S, Conroy J, Tokuyasu T, Demir K, Chiu R, Mao JH, Jain AN, Jones SJM, Balmain A, Pinkel D, Albertson DG: Mapping segmental and sequence variations among laboratory mice using BAC array CGH. Genome Research 2005, 15:302–311. [21] Apache Struts [http://struts.apache.org/]. [22] Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, Gordon SV, Eiglmeier K, Gas S, Barry CEr, Tekaia F, Badcock K, Basham D, Brown D, Chillingworth T, Connor R, Davies R, Devlin K, Feltwell T, Gentles S, Hamlin N, Holroyd S, Hornsby T, Jagels K, Barrell BG, et al: Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 1998, 393(6685):537–44. [23] Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP: Normalisation for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research 2002, 30(4):e15. [24] Garnier T, Eiglmeier K, Camus JC, Medina N, Mansoor H, Pryor M, Duthoy S, Grondin S, Lacroix C, Monsempe C, Simon S, Harris B, Atkin R, Doggett J, Mayes R, Keating L, Wheeler PR, Parkhill J, Barrell BG, Cole ST, Gordon SV, Hewinson RG: The complete genome sequence of Mycobacterium bovis. PNAS 2003, 100(13):7877– 7882. [25] Wernisch L, Kendall SL, Soneji S, Wietzorrek A, Parish T, Hinds J, Butcher PD, Stoker NG: Analysis of whole-genome microarray replicates using mixed models. Bioinformatics 2003, 19:53–61. [26] Bilmes JA: A gentle tutorial of the EM algorithm and its application to parameter estimation for gaussian mixture and hidden Markov models 1998, [www.icsi.berkeley.edu/techreports/1997.html]. [27] Murphy KP: Dynamic Bayesian Networks: Representation, Inference and Learning. Chapter 3. Exact inference in DBNs 2002, [www.ai.mit.edu/$\ sim$murphyk/Thesis/thesis.html].
Table Table 1 - Results for the TB and YP data sets Results from applying the HMM to the TB and YP data sets, compared to conventional cut-off analyses. The values are the numbers of genes assigned as TP, FP or FN. TP
FP
FN FP+FN
3sd
77
1
4
5
HMM
78
0
3
3
3sd
86
5
4
9
HMM
89
3
1
4
3sd
92
4
12
16
HMM
98
4
6
10
3sd
85.0
3.3
6.7
10.0
HMM
88.3
2.3
3.3
5.6
TB Data set Mb2122
MbAN
BCG
Average
YP Data set G-8786
735
Yokahama
KUMA
Nepal516
KIM
Harbin35
A1122
F361-66
EV76
16-34
Average
Cut-off
47
0
10
10
HMM
52
8
5
13
Cut-off
96
3
7
10
HMM
100
5
3
8
Cut-off
58
0
16
16
HMM
71
0
3
3
Cut-off
30
0
9
9
HMM
38
1
1
2
Cut-off
75
5
22
27
HMM
80
8
17
25
Cut-off
32
0
8
8
HMM
38
1
2
3
Cut-off
63
1
35
36
HMM
77
2
21
23
Cut-off
53
1
14
15
HMM
65
2
2
4
Cut-off
60
1
15
16
HMM
73
2
2
4
Cut-off
70
2
5
7
HMM
72
2
3
5
Cut-off
55
1
12
13
HMM
65
2
2
4
Cut-off 58.1
1.3 14.0 15.3
HMM
3.0
66.5
5.5
8.5
−2 −4 −6
M
0
2
Figure 1:
0
1
2
3
Genome position (Mb)
4
Frequency
−5
−4
−3
−2
−1
0
Figure 2:
30
40 1
800
Frequency 10 M
0
20
600 200
400 0
−5
−4
Detail
−3
−2 M
−1
0
1
HMM assignments: p = present, d = absent
p ppp p p
2
4
Figure 3:
p p p ddd
d
0
d
Rv3619 Rv3620
−2
3 std.dev. = −0.63
Rv3622 Rv3617
−4
M
d
4.052
4.054
4.056
4.058
4.060
Genome position (Mb)
4.062
4.064