Workshop Notes

Workshop Notes

International Workshop Advances in Bioinformatics and Artificial Intelligence: bridging the Gap BAI@IJCAI 2015 International Joint Conference on Artificial Intelligence IJCAI 2015 July 27th 2015 Buenos Aires, Argentina Editors: Abdoulaye Baniré Diallo (Université du Québec à Montréal, Canada) Engelbert Mephu NGuifo (Université Blaise Pascal, France) Mohamed Zaki (Rensselaer Polytechnic Institute, USA) Rabie Saidi (Uniprot, European Bioinformatics Institute, UK) http://bioinfo.uqam.ca/IJCAI_BAI2015/

Preface Artificial Intelligence (AI) has played an increasingly important role in the analysis of sequence, structure and functional patterns or models from sequence databases. Bioinformatics aims to store, organize, explore, extract, analyze, interpret, and utilize information from biological data. The main outcome of this workshop is to present latest results in this exciting area at the intersection of biology and AI. AI approaches can revolutionize new age of bioinformatics and computational biology with discoveries in basic biology, evolution, metagenomics, system biology, regulatory genomics, population genomics and diseases, structural bioinformatics, protein docking, next-generation sequencing (NGS) data processing, chemoinformatics, etc. Bioinformatics provides opportunities for developing novel AI methods. Some of the grand challenges in bioinformatics include protein structure prediction, homology search, epigenetics, multiple alignment and phylogeny construction, genomic sequence analysis, gene finding and gene mapping, as well as applications in gene expression data analysis, drug discovery in pharmaceutical industry, etc. Two questions was at the heart of this workshop : – How can AI techniques contribute to Bioinformatics research, and in particular dealing with biological problems ? – How can Bioinformatics raise new fundamental research problem for AI research ? This one-day workshop permits to bring together scholars and practitioners active in Artificial Intelligence driven Bioinformatics, to present and discuss their research, share their knowledge and experiences, and discuss the current state of the art and the future improvements to advance the intelligent practice of computational biology. This first edition of the Bioinformatics and Artificial Intelligence workshop is been held during the Internation Joint Conference on Artificial Intelligence (IJCAI) in Buenos Aires, July 2015. The Workshop named BAI@IJCAI2015: IJCAI workshop: Advances in Bioinformatics and Artificial Intelligence : Bridging the Gap was on July 27, 2015. The full lenth papers submited to the workshop were carefully peer-reviewed by three members of the program committee members and additional reviewers. The committee decided to accept the seven full lengh papers with the highest score for oral presentation and invited 4 abstracts for poster presentation. The program also included one invited talk. We thank all the program committee members, the additionnal reviewers and all the authors for their contributions. We also thank the IJCAI organizers for their main support on the success of this workshop. This proceeding has been generated from Easychair platform. July 27, 2015 Buenos aires, Argentina

Abdoulaye Baniré Diallo Engelbert Mephu Nguifo Rabie Saidi Mohammed Zaki v

Table of Contents Invited Talk Computational Protein Design as an Optimization Problem . . . . . . . . . . . . Thomas Schiex

1

Full Lenght Contributions for Oral Presentations Zseq : an approach for filtering low complex and biased sequences in next generation sequencing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abed Alkhateeb, Siva Reddy, Iman Rezaeian and Luis Rueda

2

Data-Driven Hospital Admission Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . Ognjen Arandjelovic

10

Using Ensemble Combiners to Improve Gene Regulatory Network Inference Alan R. Fachini, David Correa Martins Jr and Anna Helena Reali Costa

17

Identifying the Network Biomarkers of Breast Cancer Subtypes By Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Forough Firoozbakht, Iman Rezaeian, Alioune Ngom and Luis Rueda

24

Workflow Mining : Discovering Generalized Process Models from Texts . . Ahmed Halioui, Petko Valtchev and Abdoulaye Banire Diallo

33

Evolution and Vaccination of Influenza Virus . . . . . . . . . . . . . . . . . . . . . . . . . Ham Ching Lam, Srinand Sreevatsan and Daniel Boley

40

A Logic for Checking the Probabilistic Steady-State Properties of Reaction Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vincent Picard, Jeremie Bourdon and Anne Siegel

47

Abstract Contributions for Poster Presentation Sessions Robust Intervention on Genetic Regulatory Networks Using Symbolic Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leliane N. de Barros, Ronaldo F. Hashimoto, Karina V. Delgado, Carolina F. Da Silva and Fabio A. C. Tisovec Identification of High Affinity Ubiquitin Variants: an in Silico Mutagenesis-based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mina Maleki, Mohammad Haj Dezfulian and Luis Rueda

vi

54

56

Program Committee Alex Bateman Eric Beaudry Sergio Campos Marcilio de Souto Simon Degivry Abdoulaye Baniré Diallo Jason Ernst Christine Gaspin Tubao Ho Tae Rim Lee Frederique Lisacek Mondher Maddouri

Maria Martin Jesus Osamu Maruyama Engelbert Mephu Nguifo Claire Nedellec Guilherme Oliveira Gaurav Pandey Jan Ramon Dave Ritchie Sushmita Roy Derek Ruths Rabie Saidi Jo˜ ao Carlos Setubal Annegret Wagler Dechang Xu Mohamed Zaki

European Bioinformatics Institute, Cambridge University of Quebec at Montreal Federal University of Minas Gerais University of Orleans INRA Toulouse Université du Québec à Montréal UCLA INRA Toulouse JAIST School of Knowledge Science Korea National Open University Swiss Institute of Bioinformatics, Geneva Unité de Recherche en Programmation, Algorithmique et Heuristiques - URPAH, Faculté des Sciences de Gafsa, Tunisie UniProt, European Bioinformatics Institute, Cambridge Kyushu University LIMOS - Blaise Pascal University - CNRS INRA Jouy-en-josas FIOCRUZ – Minas, Belo Horizonte Mount Sinai School of Medicine, New York City Katholieke Universiteit Leuven INRIA-LORIA, University Henry Poincaré, Nancy University of Wisconsin, Madison McGill University UniProt, European Bioinformatics Institute, Cambridge University of Sao Paulo University Clermont-Ferrand II (Blaise Pascal) Harbin Institute of Technology Rensselaer Polytechnic Institute, NY

vii

Additional Reviewers

Fotuhi Siahpirani, Alireza Meddouri, Nida Plancade, Sandra

viii

IJCAI BAI2015

Keyword Index

Keyword Index

Ailment

10

breast cancer subtyping

24

Comorbidity

10

Ensemble Learning

17

factored markovian decision process with imprecision Filtering FoldX

54 2 56

Gaussian approximation Gene Regulatory Networks Inference generalized pattern mining genetic regulatory networks greedy algorithm

47 17 33 54 24

Health hierarchical classification

10 24

Influenza virus information extraction

40 33

Logic Low Complexity

47 2

Medical Model validation

10 47

Next Generation Sequencing

2

phylogenetic analysis probabilistic boolean networks Protein Stability protein-protein interactions

33 54 56 24

Reaction networks

47

Satisfiability problem Stochastic modelling Systems Biology

47 47 17

Ubiquitin Variants

56 1

IJCAI BAI2015

2

Keyword Index

Unsupervised machine learning

40

Vaccination Visualization

40 40

workflow mining

33

IJCAI BAI2015

Author Index

Author Index

Alkhateeb, Abed Alliot, Jean-Marc Arandjelovic, Ognjen

1 9 11

Bernardes, Juliana Boley, Daniel Bourdon, Jeremie

18 44 52

Costa, Anna Helena Reali

21

Da Silva, Carolina F. de Barros, Leliane N. Delgado, Karina V. Demolombe, Robert Diallo, Abdoulaye Banire

19 19 19 9 37

Fachini, Alan R. Farinas, Luis Firoozbakht, Forough

21 9 28

Haj Dezfulian, Mohammad Halioui, Ahmed Hashimoto, Ronaldo F.

51 37 19

Lam, Ham Ching

44

Maleki, Mina Martins Jr, David Correa

51 21

Ngom, Alioune

28

Picard, Vincent

52

Reddy, Siva Rezaeian, Iman Rueda, Luis

1 1, 28 1, 28, 51

Siegel, Anne Sreevatsan, Srinand

52 44

Tisovec, Fabio A. C.

19

Valtchev, Petko Vieira, Fabio

37 18 1

IJCAI BAI2015

Author Index

Zaverucha, Gerson

2

18

Invited Talk Computational Protein Design as an Optimization Problem Thomas SCHIEX, PhD. Research Director INRA Toulouse, France Email: [email protected]

Abstract After less than two decades, an increasing number of new proteins have been designed following a semi-rational design process. The ultimate aim of Protein Design is to produce an amino-acid sequence (a protein) that will fold in 3D-space according to a desired scaffold. In most cases, the aim is to obtain a new enzyme catalyzing a new reaction, improving an existing catalysis or creating affinity for new partners. The design may also have nanotechnological purposes. Applications are numerous, including in medicine, bioenergies, food and cosmetics. With 20 amino-acids, the space of all amino-acid sequences is extremely combinatorial and its systematic exploration, even if it is directed through experimental selection, is out of reach of experimental approaches. To focus this search, the rational design approach consists in modelling the protein as a 3D object, subjected to various forces (internal torsions, van der Waals, electrostatic, hydrogen bonds and solvation) and to seek an optimal sequence, with criteria that include stability and affinity for a chosen partner. Even with strong simplifying assumptions, this defines very complex combinatorial optimization problems, from both a modelling and solving perspective. At the core of most existing stability approaches lies a simple formulation of this problem, with a rigid scaffold, flexible side-chains represented by a discrete library of conformations (rotamers) and a decomposable energy field. We will see how this NP-hard problem can be modelled using a variety of usual discrete optimization frameworks from both Artificial Intelligence (Constraint Programming, Satisfiability, Machine Learning) and Operations Research (Integer Linear and Quadratic Programming and Optimization). On a benchmark of CPD problems, the efficiency of these different approaches varies tremendously. Among all those, we will quickly detail how the most successful approach works.

Zseq : an approach for filtering low complex and biased sequences in next generation sequencing data Abedalrhman Alkhateeb, Siva Reddy, Iman Rezaeian and Luis Rueda School of Computer Science, University of Windsor 401 Sunset Avenue, Windsor, ON N9B 3P4, Canada {alkhate,singir,rezaeia,lrueda}@uwindsor.ca Abstract Next generation sequencing (NGS) technology generates a huge number of reads (short sequences) which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Pre-processing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique kmers in each sequence as its corresponding score, and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a zscore less than the user defined threshold. Zseq algorithm is able to provide a better mapping rate, it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with other state of the art methods. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Keywords: Next Generation Sequencing; Low Complexity; Filtering; Machine Learning.

1

Introduction

In the last decade, NGS technology has evolved rapidly. which reduced the cost of genome sequencing and influenced the progression of cancer research and other fields. The main purpose of NGS studies is to find clues to genes and protein structures and functions in the sequenced reads. However, this advanced technology can also produce unexpected artifacts [Waszak et al., 2014]. Some of these artifacts come from

cDNA library preparation; those are repetitive low complex regions that appear in the sequenced reads [Mackinnon et al., 2009]. High GC-content is also a common bias due to cDNA library preparation, while GC-content tends to last more in the preparation process [Yakovchuk et al., 2006]. GC-content bias in reads is also known to aggravate genome assembly, and hence it may result a lower genome assembly. Nevertheless, the sequencing procedure itself can produce some low complex repetitive regions like a sequence of ambiguous nucleotides. In general, it is not clear to what extent GC-content bias affects genome assembly [Chen et al., 2013].

1.1

Background in dealing with low complexity and bias regions

Although NGS is able to generate several gigabytes of genomic data for one dataset, this technology comes with the price of artifacts. Some of these artifacts include low complex regions, biased information and ambiguous nucleotides. Processing this huge amount of data is a big challenge. Some of these sequences may not be very useful due to the fact that they belong to regions in the genome with low complexity. A low complexity sequence of nucleotides has highly biased distribution of nucleotides in a way that makes the sequence less diverse of unique k-mers of nucleotides. The lower the complexity of a sequence, the more likely that the sequence will be mapped to different parts of the genome. In other words, when we process low-complex sequences, there is less chance that we can align it to a specific part of the genome uniquely. This low level of certainty regarding the real position of a sequence, makes it less desirable to be used. Poly A/Poly T is a chain of A or T, which are used to prime the 3 and 5 sites in a genome sequence during cDNA library preparation [Brown, 2012]. Poly A/T sequences may cause bias in the reads. The intronic Poly A/T tails tend to splice out rather than staying between coding exons [Zhao et al., 2014]. The GC-content represents the ratio of a G-C pair in the genome sequence. The stop codons show a significant high ratio of A-T nucleotides [Wuitschick and Karrer, 1999], while coding codons have a higher GC-content [Pozzoli et al., 2008]. The GC-content of a gene plays a significant role in carrying the genetic information. The GC-content of the human genome varies among different chromosomes. However, the average GC-content of the human genome is 41% [Vogel, 1997]. The representation of A+T sequences can be signifi-

cantly lower, because in the preparation of a standard library a gel slice is used and heated up to 50 C, thereby increasing the bias of the GC-content [Quail et al., 2008]. There are different techniques that try to remove those sequences with low-complex patterns from samples. Morgulis et al. presented a symmetric DUST method [Morgulis et al., 2006], which masks low-complex regions in a sequence to overcome context sensitivity in calculating the complexity score. Schmieder et al. proposed two methods to evaluate the sequence complexity [Schmieder and Edwards, 2011]. The first method is based on entropy as a measure. The second method, which is a variant of the DUST algorithm based on BLAST search, filters out the low-complex score sequences. Both methods consider each triplet of nucleotides as a word. One of the downsides of the previous methods is that they focus only on the complexity of the sequences. This can be misleading in some cases due to the high biased nature of the sequences. In this paper, we propose a novel method called Zseq, which decreases the uniqueness score of highlybiased regions, thereby filtering highly-biased sequences and low complex sequences.

2

Zseq

The z-score measurement has been used in different applications in bioinformatics [Margulies et al., 2005; Cheadle et al., 2003]. Chopping sequence into k-mers is an essential technique in read assembly. We present the Zseq algorithm that uses the z-score measurement based on uniqueness scores of all reads. The uniqueness score is the normalized number of unique k-mers in each read that takes low complex regions into account. Figure 1 depicts the process of finding reads with improved quality. Each module is explained in detail in the next few sections. In the first step, Zseq scans all the reads and calculates the uniqueness score for all reads. The uniqueness score corresponding to each read is equal to the number of unique k-mers in that read. Zseq considers the default k-mer size, w, as 4mers, which makes the vocabulary of 4 nucleotides (A,T,C,G) to be 44 = 256 words. As the long reads may contain thousands of nucleotides, the 3-mer size is not sufficient to measure the complexity of the reads. This is because a 3-mer word can exist many times in the same read without being considered as unique, even when it is associated with different nucleotides each time. Zseq excludes the 5-mers of the low complex/biased artifacts, such as ambiguous bases (N), PolyA/T and GC-content, from being unique by decreasing the unique score of the reads by one for each 2w in order to reduce the chances of selecting this sequence later. The uniqueness score of each read is then normalized by dividing it by the length of the read. The normalized uniqueness scores of all reads are stored in a vector with the same order of the read in the input file. Figure 2 shows the distribution of the normalized uniqueness scores for all reads for sample SRR202054 from the prostate cancer dataset used in the study of [Kim et al., 2011]. The x-axis shows the normalized uniqueness scores, while the y-axis shows the number of reads. As shown in the figure, the penalized sequences have a very small score down to -30. These are sequences that

Figure 1: Schematic representation of the process for filtering reads using the Zseq method. have been generated using reads that contain high PolyA/T sequences, very high GC-content, or very high number of ambiguous nucleotides (N). In the next step, Zseq calculates the mean and standard deviation for the normalized uniqueness scores. The mean of the normalized uniqueness scores of all reads is calculated in the first loop. The variance is also calculated linearly using a Na¨ıve algorithm to reduce the cost of this step. The standard deviation is calculated from the variance of the vector of the normalized uniqueness scores. Next, for each normalized uniqueness score, we calculate the z-score using the mean, µ, and the standard deviation, , as follows: z = (s µ)/ . (1) The z-score represents how many standard deviations the

Figure 2: Distribution of the normalized uniqueness scores for all reads in sample (SRR202054) (µ = 25.8169, = 7.1681).

Figure 3: Distribution of the z-scores of the normalized uniqueness scores corresponding to each read for sample (SRR202054).

normalized uniqueness score of the read is away from the mean µ for all normalized uniqueness scores. In other words, if a read has a z-score of 0, it means that the read has the normalized uniqueness score of µ, while a z-score of value 1 means that the normalized uniqueness score is away exactly one standard deviation from the µ. Figure 3 shows the zscores for all reads in the sample (SRR202054), where the xaxis is the z-score of the normalized uniqueness scores, while the y-axis indicates how many reads a particular z-scores has in the sample. Finally, the user-adjustable threshold ✓ is used to determine whether or not to select the reads, if the z-score of the normalized uniqueness score of the reads is greater than or equal to ✓, the read will be selected; otherwise, it will be filtered out.

otherwise, DUST will ignore the read. For Zseq, we have chosen -1.5 as the value of the threshold, which makes the read good to be selected if the z-score of that read is greater than or equal to -1.5. The reason behind selecting these two thresholds is that both methods filter almost the same number of reads in each sample. The filtered reads using Zseq have less GC-content than the filtered reads using DUST. It also has smaller standard deviation which makes the reads centered more around the mean than DUST. Figure 4 shows the GC-content distributions for both methods applied on the same sample set (SRR202058).

3

Results

In our experiments, we used the prostate cancer dataset utilized in the study of [Kim et al., 2011]. The dataset is publicly available in NCBI Gene Expression Omnibus (GEO) under accession no. GSE29155. It contains 11 samples in total, where seven of them belong to tumor tissues and the remaining four samples are benign. We measured the GC-content and the number of ambiguous bases of the outcomes of each method, and then aligned the results of both methods to the human genome using Tophat2 as the alignment method [Kim et al., 2013]. DUST takes a value that ranges from 0 and 100 as the complexity threshold, while Zseq takes a z-score value as a complexity threshold, which shows how many standard deviations the normalized uniqueness score of the read is away from the mean. For the DUST method we chose the value 5 as the threshold, which means that the value of the complexity of the read has to be greater than or equal to 5 to be selected;

Zseq shows a slight improvement in reducing the GCcontent, mapping rate and mapping time, while dropping the number of ambiguous bases drastically in comparison with DUST. Table 1 shows that the number of ambiguous bases, N, in the filtered reads using Zseq have drastically decreased in comparison with the ambiguous bases that have been filtered out using DUST in all samples. For example, the number of occurrences of N in sample SRR202054 for filtered reads by DUST is 19,177, while there is only 11,135 filtered reads using Zseq for the same sample.. The results indicate that Zseq slightly shrunk the GC-content percentage distribution and reduced the mean of the GC-content percentage. For sample SRR202055, the mean GC-content is 52.48 ± 12.10% using Zseq, which is less than 52.91±12.38% obtained using DUST method. Zseq also shows better mapping alignment for the filtered reads than DUST for most of the samples. For example, in sample SRR202061, the reads filtered by Zseq have 79.20% mapping rate, which is greater than 77.90% mapping rate for reads filtered by DUST, the only exception is sample SRR202062, which shows a similar mapping rate of 71.30% for both DUST and Zseq.

Table 1: Comparison of the results of applying Zseq on samples from the prostate cancer dataset as a result of applying DUST on the same samples. Sample Number SRR202054 SRR202055 SRR202056 SRR202057 SRR202058 SRR202059 SRR202060 SRR202061 SRR202062

Occurrences of N 40,690 42,965 40,243 42,630 16,643 17,741 19,958 2,156 5,837

Original Mean GC-content 52.82 ± 14.06% 53.01 ± 13.74% 52.94 ± 13.99% 52.94 ± 13.94% 53.12 ± 14.03% 52.56 ± 13.88% 53.44 ± 13.98% 50.06 ± 11.50% 52.81 ± 13.64%

Mapping rate 91.50% 91.20% 91.40% 91.30% 91.00% 90.70% 90.90% 77.00% 69.10%

Occurrences of N 11,135 9,336 10,721 10,403 14,023 14,042 13,775 1,849 5,122

Zseq Mean GC-content 52.61 ± 12.20% 52.48 ± 12.10% 52.67 ± 12.22% 52.65 ± 12.22% 52.63 ± 12.08% 52.18 ± 12.02% 53.23 ± 12.09% 48.87 ± 9.96% 52.69 ± 11.77%

Mapping rate 93.00% 92.40% 92.80% 92.60% 92.40% 92.00% 92.40% 79.20% 71.30%

Occurrences of N 19,177 19,470 18,336 20,018 16,198 17,091 17,281 2,100 5,466

DUST Mean GC-content 52.89 ± 12.33% 52.91 ± 12.38% 52.95 ± 12.36% 52.93 ± 12.36% 53.09 ± 12.36% 52.61 ± 12.28% 53.51 ± 12.21% 49.95 ± 11.12% 52.91 ± 11.84%

Mapping rate 92.80% 92.10% 92.60% 92.40% 92.30% 91.90% 92.30% 77.90% 71.30%

Figure 4: Percentage of GC-content for all filtered reads using the Zseq histogram (a) with µ = 52.63% and = 12.08% and DUST histogram (b) with µ = 53.09% and = 12.36%.

3.1 De novo sequences validation Using Trinity de novo assembler [Grabherr et al., 2011], transcripts have been reconstructed for the original reads of sample SRR202058, reads that have been filtered by DUST and reads that have been filtered by Zseq. In the next step, all three sets of constructed transcripts were evaluated by searching the assembled transcripts with the human genome sequences using BLAST [Altschul et al., 1997]. The set of reconstructed transcript using the filtered reads by Zseq contains a higher number of long sequences in comparison with the other two sets. Figure 5 shows the meaningful sequences for each set. Some of the sequences, which were built using the reads filtered by Zseq, have a length of 1000 bp or more along with high alignment score; while the sequence length is slightly more than 300 bp using the reads filtered by DUST and 200 bp for the original reads without filtering.

Figure 5: biologically meaningful human genomic sequences found using BLAST. De novo assembled transcripts using original reads (a), de novo assembled transcripts using reads filtered by DUST (b), and de novo assembled transcripts using reads filtered by Zseq (c).

3.2

Machine learning validation

In another experiment, we used an independent dataset containing 12 samples (six tumor and six matched normal) [Kan-

nan et al., 2011]. Using these samples, three datasets were generated, one from the original reads, one by applying DUST on the reads, and the third one by applying Zseq on the reads for all samples. In the next step, all reads corresponding to each dataset have been aligned to human genome hg19 using Tophat2 [Kim et al., 2013] and Cufflinks assembler [Trapnell et al., 2012] with default parameters to assemble the transcripts to the human genome and estimate their abundance, which is measured by FPKM value (fragments per kilo bases of exons for per million mapped reads). Table 2 shows the the average mapping rate of reads filtered by each method. Table 2: Average mapping rate of transcripts using the dataset generated by the original reads, reads filtered by DUST and reads filtered by Zseq. Original 88.90%

DUST 90.10%

Z-seq 90.40%

Each generated dataset using filtered reads has 43,497 features (transcripts) with FPKM values. Also, each of the 12 samples was labeled by cancer or matched benign. The FPKM value equals 0 if the transcript has not been presented in that sample. We measured the number of transcripts that can individually separate all cancer samples from normal samples perfectly, with 100% accuracy. In other words, we want to compute the number of transcripts generated using filtered reads by each method, in such a way that the FPKM values corresponding to cancer samples can be separated from of those FPKM of normal samples. Figure 6 depicts two transcripts; transcript a has clearly separable FPKM values, while in transcript b, the FPKM values cannot be separated accurately

Table 3: the number of discriminative transcripts for each of the three datasets Dataset n of discriminative transcripts Original 167 Filtered by Dust 159 Filtered by Zseq 231 the NM 001145410 transcript corresponding to NONO gene was the most significant transcript among all other transcripts in all three datasets. NONO is known to regulate in different types of cancer such as breast and prostate cancer [Traish et al., 1997; Ishiguro et al., 2003]. Next, a suport vector machine (SVM) with linear kernel was applied on the three datasets using this transcript as feature. SVM is a supervised learning machine that tries to find an optimal separating hyperplane between classes [Cortes and Vapnik, 1995]. Using a leave-two-out cross-validation scheme, the classification returns 100% accuracy for the Zseq dataset, 91.66% for the DUST dataset, while it was down to 83.33% in the original reads dataset.

4

Conclusion

We have presented a novel method for filtering the reads that reduces the number of biased, duplicate or ambiguous sequences. Our method finds the complexity of the sequences by assigning a unique score to each read. Using a userdefined threshold, the user can filter the reads with a score less than the threshold. Applying the proposed method on real samples shows that the Zseq algorithm is statistically sound and provides a better mapping rate, while it significantly reduces the number of ambiguous bases in comparison with other state-of-the-art methods. The Zseq method is publicly available and can be accessed using the following link: http://sourceforge.net/projects/zseq/.

Acknowledgments This work has been partially supported by NSERC, the Natural Sciences and Engineering Research Council of Canada.

References

Figure 6: An example of two transcripts, one with separable FPKM values (a), and other transcript with inseparable FPKM values (b). Table 3 shows the number of transcripts that contain separable FPKM values. These results indicate that applying Zseq influences the alignment tool and assembler to quantify more meaningful transcripts that can discriminate cancer and normal samples in comparison with DUST method and original reads. Moreover, using Chi2 [Liu and Setiono, 1995] statistical test on the 231 discriminative transcripts from Zseq dataset,

[Altschul et al., 1997] Stephen F Altschul, Thomas L Madden, Alejandro A Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped blast and psiblast: a new generation of protein database search programs. Nucleic acids research, 25(17):3389–3402, 1997. [Brown, 2012] Terry Brown. Introduction to Genetics: A Molecular Approach. Garland Science, 2012. [Cheadle et al., 2003] Chris Cheadle, Marquis P Vawter, William J Freed, and Kevin G Becker. Analysis of microarray data using z score transformation. The Journal of Molecular Diagnostics, 5(2):73–81, 2003. [Chen et al., 2013] Yen-Chun Chen, Tsunglin Liu, ChunHui Yu, Tzen-Yuh Chiang, and Chi-Chuan Hwang. Effects of gc bias in next-generation-sequencing data on de novo genome assembly. PloS One, 8(4):e62856, 2013.

[Cortes and Vapnik, 1995] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995. [Grabherr et al., 2011] Manfred G Grabherr, Brian J Haas, Moran Yassour, Joshua Z Levin, Dawn A Thompson, Ido Amit, Xian Adiconis, Lin Fan, Raktima Raychowdhury, Qiandong Zeng, et al. Full-length transcriptome assembly from rna-seq data without a reference genome. Nature biotechnology, 29(7):644–652, 2011. [Ishiguro et al., 2003] Hitoshi Ishiguro, Hiroji Uemura, Kiyoshi Fujinami, Naoya Ikeda, Shinsuke Ohta, and Yoshinobu Kubota. 55 kda nuclear matrix protein (nmt55) mrna is expressed in human prostate cancer tissue and is associated with the androgen receptor. International journal of cancer, 105(1):26–32, 2003. [Kannan et al., 2011] Kalpana Kannan, Liguo Wang, Jianghua Wang, Michael M Ittmann, Wei Li, and Laising Yen. Recurrent chimeric rnas enriched in human prostate cancer identified by deep sequencing. Proceedings of the National Academy of Sciences, 108(22):9172–9177, 2011. [Kim et al., 2011] Jung H Kim, Saravana M Dhanasekaran, John R Prensner, Xuhong Cao, Daniel Robinson, Shanker Kalyana-Sundaram, Christina Huang, Sunita Shankar, Xiaojun Jing, Matthew Iyer, et al. Deep sequencing reveals distinct patterns of dna methylation in prostate cancer. Genome Research, 21(7):1028–1041, 2011. [Kim et al., 2013] Daehwan Kim, Geo Pertea, Cole Trapnell, Harold Pimentel, Ryan Kelley, and Steven L Salzberg. Tophat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology, 14(4):R36, 2013. [Liu and Setiono, 1995] Huan Liu and Rudy Setiono. Chi2: Feature selection and discretization of numeric attributes. In 2012 IEEE 24th International Conference on Tools with Artificial Intelligence, pages 388–388. IEEE Computer Society, 1995. [Mackinnon et al., 2009] Margaret J Mackinnon, Jinguang Li, Sachel Mok, Moses M Kortok, Kevin Marsh, Peter R Preiser, and Zbynek Bozdech. Comparative transcriptional and genomic analysis of plasmodium falciparum field isolates. PLoS Pathogens, 5(10):e1000644, 2009. [Margulies et al., 2005] Marcel Margulies, Michael Egholm, William E Altman, Said Attiya, Joel S Bader, Lisa A Bemben, Jan Berka, Michael S Braverman, Yi-Ju Chen, Zhoutao Chen, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376–380, 2005. [Morgulis et al., 2006] Aleksandr Morgulis, E Michael Gertz, Alejandro A Schäffer, and Richa Agarwala. A fast and symmetric dust implementation to mask lowcomplexity dna sequences. Journal of Computational Biology, 13(5):1028–1040, 2006. [Pozzoli et al., 2008] Uberto Pozzoli, Giorgia Menozzi, Matteo Fumagalli, Matteo Cereda, Giacomo P Comi, Rachele Cagliani, Nereo Bresolin, and Manuela Sironi.

Both selective and neutral processes drive gc content evolution in the human genome. BMC Evolutionary Biology, 8(1):99, 2008. [Quail et al., 2008] Michael A Quail, Iwanka Kozarewa, Frances Smith, Aylwyn Scally, Philip J Stephens, Richard Durbin, Harold Swerdlow, and Daniel J Turner. A large genome center’s improvements to the Illumina sequencing system. Nature Methods, 5(12):1005–1010, 2008. [Schmieder and Edwards, 2011] Robert Schmieder and Robert Edwards. Quality control and preprocessing of metagenomic datasets. Bioinformatics, 27(6):863–864, 2011. [Traish et al., 1997] Abdulmaged M Traish, Yue-Hua Huang, Jacqueline Ashba, Mary Pronovost, Matthew Pavao, David B McAneny, and Robert B Moreland. Loss of expression of a 55 kda nuclear protein (nmt55) in estrogen receptor-negative human breast cancer. Diagnostic molecular pathology: the American journal of surgical pathology, part B, 6(4):209–221, 1997. [Trapnell et al., 2012] Cole Trapnell, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R Kelley, Harold Pimentel, Steven L Salzberg, John L Rinn, and Lior Pachter. Differential gene and transcript expression analysis of rna-seq experiments with tophat and cufflinks. Nature protocols, 7(3):562–578, 2012. [Vogel, 1997] Friedrich Vogel. Vogel and Motulsky’s Human Genetics: Problems and Approaches, volume 878. Springer, 1997. [Waszak et al., 2014] Sebastian M Waszak, Helena Kilpinen, Andreas R Gschwind, Andrea Orioli, Sunil K Raghav, Robert M Witwicki, Eugenia Migliavacca, Alisa Yurovsky, Tuuli Lappalainen, Nouria Hernandez, et al. Identification and removal of low-complexity sites in allele-specific analysis of chip-seq data. Bioinformatics, 30(2):165–171, 2014. [Wuitschick and Karrer, 1999] Jeffrey Wuitschick and Kathleen Karrer. Analysis of genomic g+c content, codon usage, initiator codon context and translation termination sites in Tetrahymena Thermophila. Journal of Eukaryotic Microbiology, 46(3):239–247, 1999. [Yakovchuk et al., 2006] Peter Yakovchuk, Ekaterina Protozanova, and Maxim D Frank-Kamenetskii. Basestacking and base-pairing contributions into thermal stability of the dna double helix. Nucleic Acids Research, 34(2):564–574, 2006. [Zhao et al., 2014] Zhixin Zhao, Xiaohui Wu, Praveen Kumar Raj Kumar, Min Dong, Guoli Ji, Qingshun Quinn Li, and Chun Liang. Bioinformatics analysis of alternative polyadenylation in green alga chlamydomonas reinhardtii using transcriptome sequences from three different sequencing platforms. G3: Genes— Genomes— Genetics, 4(5):871–883, 2014.

Data-Driven Hospital Admission Prediction Ognjen Arandjelović School of Computer Science University of St Andrews St Andrews Fife KY16 9SX Scotland E-mail: [email protected]

Abstract

Western Europe Subsaharan Africa World average

The vast amounts of information in the form of electronic medical records are used to develop a novel model of disease progression. The proposed model is based on the representation of a patient’s medical history in the form of a binary history vector, motivated by empirical evidence from previous work and validated using a large ‘realworld’ data corpus. The scope for the use of the described methodology is overarching and ranges from smarter allocation of resources and discovery of novel disease progression patterns and interactions, to incentivization of patients to make lifestyle changes.

1

Introduction

Public health care is an issue of major global significance and concern. On the one end of the spectrum, the developing world is still plagued by “diseases of poverty” which are nearly non-existent in the most technologically developed countries; on the other end, the health risk profile of industrially leading nations has dramatically changed in recent history with an increased skew towards so-called “diseases of affluence”, as illustrated in Figure 1 (data taken from [Murray et al., 2001]). Health care management poses challenges both in the sphere of policy making and scientific research. Considering the complexity of problems at hand, it is unsurprising that there is an ever-increasing effort invested in a diverse range of promising avenues. Yet, the available resources are inherently limited. To ensure their best usage it is crucial both to develop an understanding of the related epidemiology, as well as to be able to communicate this knowledge effectively to those who can benefit from it: governments [Berwick and Hackbarth, 2012], the medical research community [Beykikhoshk et al., 2015a], health care practitioners [Arandjelović, 2015], and patients [Beykikhoshk et al., 2014]. The associations between diseases and a wide variety of risk factors are underlain by a complex web of interactions. This is particularly the case for the diseases of the developed world. The key premise of the present work is that to facilitate the understanding of this complexity and the discovery of meaningful patterns within it, it is crucial to make use of the

Infectious and parasitic diseases

Other

Respiratory infections

Maternal and perinatal conditions

Injuries

Non−communicable despiratory diseases

Nutritional deficiencies

Cardiovascular diseases

Malignant and other neoplasms

Figure 1: Causes of death for the developed world (Western Europe), developing nations (Subsaharan Africa), and the world average. vast amounts of data routinely collected by health services in industrially and technologically developed countries. My specific aim is to develop a framework which allows a health practitioner (e.g. a doctor or a clinician) to manipulate the available patient information in an intuitive yet powerful fashion. Such a framework would, on the one end of the utility spectrum, facilitate a deepening of disease understanding, and on the other, provide the practitioner with a tool which can be used to incentivize the patient at risk to make the required lifestyle changes.

1.1

Data: electronic medical records

This work leverages the large amounts of medical data routinely collected and stored in electronic form by health providers in most developed countries. This is a rich data source which contains a variety of information about each patient including the patient’s age and sex, mother tongue, religion, marital status, profession, etc. In the context of the present work, of main interest is the information collected each time a patient is admitted to the hospital (including outpatient visits to general practitioners or specialists). The format of this data is explained next. Each time a patient is admitted to the hospital the reason for the admission, as determined by the medical practitioner

in primary charge during the admission, is recorded in the patient’s medical history. This is performed using a standardized international ‘codebook’, the International Statistical Classification of Diseases and Related Health Problems (ICD) [World Health Organization, 2004], a medical classification list of diseases, injuries, symptoms, examinations, physical, mental or social circumstances issued by the World Health Organization. The ICD has a tree-like structure; at the top-most level codes are grouped into 12 chapters, each chapter encompassing a spectrum of related health issues (usually symptomatically rather than etiologically related). For example, ICD-10 Chapter 4 which includes codes E00-E90, covers “Endocrine, nutritional and metabolic diseases”. At each subsequent depth level of the tree the grouping is refined and the scope of conditions narrowed down. In this paper I use the classification attained at the depth of two of ICD10, which achieves a good compromise between specificity and frequency of occurrence. This results in each admission being given a three character code which comprises a leading capital letter (A-Z, first classification level), followed by a two digit number (further refinement). For example, E62 codes for “Respiratory infections/inflammations” within the broader range of “Endocrine, nutritional and metabolic diseases”.

2

Modelling comorbidity progression

The major contribution of this work is a novel disease progression model. The principal challenge is posed by the need for a model which is sufficiently flexible to be able to capture complex patterns of comorbidity development, while at the same time constrained enough to facilitate learning from a ‘real world’ data corpus.

2.1

Bottom-up modelling

The problem of modelling disease progression has already attracted a considerable amount of research attention. Most previous research focuses on specific individual diseases, such as type-II diabetes mellitus [Topp et al., 2000; De Gaetano et al., 2008] or heart disease [Ye et al., 2012]. These methods are inherently ‘low-level’-based in the sense that they explicitly model known physiological changes that affect disease progression. For example, the modelling of the progression of type-II diabetes may include low-level models of -cell mass changes, and insulin and glucose dynamics [Topp et al., 2000], with the free parameters (e.g. -cell replication rate) of the models adopted from previous empirical studies. Higher level disease progression then emerges from the interaction of low-level models. The low-level approach to disease modelling has several limitations. Firstly, by their very nature these models are limited to specific diseases only and cannot be readily adapted to deal with conditions with entirely different etiologies. Secondly, the modelling is practically constrained usually to a single condition, two at the most, as the complexity of modelled system increases dramatically with the inclusion of a greater number of conditions. This observation is of major significance as most diseases of the developed world are most often accompanied and affected by multiple comorbidities.

Lastly, the range of diseases which can be modelled in this manner is limited to diseases which are sufficiently well understood and studied to allow for the free model parameters to be set reliably; even for type-II diabetes, which has been studied extensively, at present some parameters must be set in an ad hoc manner and others using in vitro rather than in vivo data [Topp et al., 2000].

2.2

Direct high-level modelling

Given the significance of the disadvantages of low-levelbased disease progression models, in this paper an alternative approach is pursued, that of seeking to describe disease progression as well as the interplay of different comorbidities directly on the ‘high-level’ as observed by a medical practitioner. Previous research in this area is far scarcer than that on low-level modelling; a possible reason for this is probably to be found in the until recently limited availability of largescale medical records data. The central idea of the existing corpus of work is to regard disease progression as a discrete sequence of events, with the progression governed by what is assumed to be a first-order Markov process [Sukkar et al., 2012; Jackson et al., 2003]. A high-level view of disease progression is seen as being reflected by a patient’s admission history H = a1 ! a2 ! . . . ! an where ai is a discrete variable whose value is an ICD code corresponding to the i-th of n admissions on the patient’s record. The parameters of the underlying first-order Markov model are then learnt by estimating transition probabilities p(a0 ! a00 ) for all transitions encountered in training (the remaining transition probabilities are usually set to some low value rather than 0, using a pseudocountbased estimate) [Wang et al., 2014; Folino and Pizzuti, 2011; Bartolomeo et al., 2008]. The model can be applied to predict the admission an+1 expected to follow from the current history by likelihood maximization: an+1 = arg max p(an ! a). a

(1)

Alternatively, it may be used to estimate the probability of a particular diagnosis a⇤ at some point in future: X pf (a⇤ ) = [p(a ! a⇤ ) pf (a)], (2) a

or to sample the space of possible histories:

H 0 = a1 ! a2 ! . . . ! an 99K an+1 99K an+2 . . . . (3)

The primary purpose of the Markovian assumption is to constrain the mechanism underlying a specific process and thus formulate it in a manner which leads to a tractable learning problem. Although it is seldom strictly true, that it is often a reasonable approximation to make is witnessed by its successful application across a diverse range of disciplines; examples of modelled phenomena include meteorological events [Gabriel and Neumann, 1962], software usage patterns [Whittaker and Thomason, 1994], breast cancer screening [Duffy and Yau, 1995], human motion and behaviour [Lee et al., 2005; Arandjelović, 2011], and many others. Nonetheless, the key premise motivating the model in this paper is that the Markovian assumption is in fact not appropriate for the high-level modelling of disease progression

(note that this does not reject its possible applicability in disease progression modelling on different levels of abstraction). Indeed, I will demonstrate this empirically. The aforementioned premise is readily substantiated using a theoretical argument as well. Consider a patient who is admitted for what is diagnosed as a serious chronic illness. If the same patient is subsequently admitted for an unrelated ailment, possibly a trivial one, the knowledge of the serious underlying problem is lost and the power to predict the next related admission lost. The model proposed in the section which follows solves this problem, while at the same retaining the tractability of Markov process-based approaches.

2.3

Proposed approach

In this paper my aim is to predict the probability of a specific admission a following the patient history H : p(H ! a|H).

(4)

The difficulty of formulating this as a tractable learning problem lies in the fact that the space of possible histories is infinite as H can be of an arbitrary length. Even if the length l(H) is limited, the number of possible histories is extremely large: [l(H)]na where na is the number of different admission codes. Therefore it is necessary to make an approximation which constrains and simplifies the task. I already argued why the Markovian assumption on the level of admission codes is inappropriate. In its stead I propose a different representation of a patient’s state, particularly suitable for the modelling of disease progression. Consider a particular admission history H = a1 ! . . . ! an . The proposed method makes use of the well known observation that when it comes to chronic diseases, the very presence of past complications strongly predicts future complications [Mudge et al., 2011; Friedman et al., 2008 2009; Dharmarajan et al., 2013; Butler and Kalogeropoulos, 2012]. Thus, a history H is represented using a history vector v = v(H) which is a fixed length vector with binary values [Beykikhoshk et al., 2015b]. Each vector element corresponds to a specific admission code (except for one special element explained shortly) and its value is 1 if and only if the corresponding admission is present in the history: ⇢ 1 : 9j. H = H1 ! aj ! H2 ^ a = aj 8a 2 A. v(H)i(a) = 0 : otherwise where A is the set of admission codes, i(a) indexes the admission code a in a history vector, and H1,2 may take on degenerate forms of empty histories. By collapsing an arbitrary length history of admission onto a fixed length vector, the space of possible states over which learning is performed is dramatically reduced and the problem immediately made far more tractable. Notice the importance of the observation that it is the presence of past complications which most strongly predicts future ailments, given that under this representation any information on the ordering of admissions is discarded. The binary nature of the representation also has the effect of reducing the size of the space over which inference is performed. In this case, this is achieved by discarding information on the number of re-admissions and in this manner it too predicates the overwhelming predictive power of the presence

of history of a particular ailment, rather than the number of admissions related to it. The disease progression modelling problem at hand is thus reduced to the task of learning transition probabilities between different patient history vectors: p(v(H) ! v(H 0 )).

(5)

It is important to observe that unlike in the case of Markov process models working on the admission level when the number of possible transition probabilities is close to na 2 , here the transition space is far sparser. Specifically, note that it is impossible to observe a transition from a history vector which codes for the existence of a particular past admission to one which does not, that is: v(H)i(a) = 1 ^ v(H 0 )i(a) = 0 ) p(v(H) ! v(H 0 )) = 0. (6) The converse does not hold however. Moreover, possible transitions can be only those which include either no changes to the history vector (re-admission) or which encode exactly one additional admission: p(v(H) ! v(H 0 )) 8 > > 0 : 8a. v(H)i(a) = 1 ) v(H 0 )i(a) = 1 > > > > and < {a : v(H)i(a) = 1}  1 + {a : v(H 0 )i(a) = 1} > > > > > : = 0 : otherwise (7)

This gives the upper bound for the number of non-zero probability transitions of na ⇥ 2na . In practice the actual number of transitions is far smaller (several orders of magnitude for the data set described in the next section) which allows the learnt model to be stored and accessed efficiently. The final aspect of the proposed model concerns transitions with probabilities which do not vanish but which are nonetheless very low. These transitions can be reasonably considered to be noise in the sense that the corresponding probability estimates are unreliable due to low sample size. Hence admission history vectors are constructed using only the n ˆ a most common admission types and merge the remaining na n â types into a single special code ‘other’. Thus, the dimensionality of admission history vectors becomes n ˆ a + 1. The soundness of this approach can be readily observed by examining the plot in Figure 2 which shows that only a small number of admission types covers a vast number of all data. For example the top 30 most frequent types account for 75% of all admissions.

2.4

Empirical model validation

We now turn our attention to the validation of the model on real data – the set of medical records of over 40,000 people treated by a local hospital. It is insightful to consider some of the characteristics of this data set before proceeding with the analysis of disease progression. As expected, the number of admissions per patient was found to vary greatly across the sample; the plot in

120

Cumulative Individual

0.8

100

0.6 80

0.4 0.2 0

0

50

100

150 200 250 300 Admission rank by frequency

350

400

450

Patient age

Cumulative frequency

1

40

Figure 2: Frequency (red line) and cumulative frequency of different admissions. The plot illustrates the highly uneven distribution, with the top 30 most frequent admissions accounting for 75% of the entire data corpus.

20

0 0 10

4

10

80

60

0.6

20

0 1000

Figure 3: Distribution of the number of admissions per patient across the evaluation data set. Interestingly, the patient’s age was found not to be associated with the number of admissions on record, while a low positive correlation (r = 0.14) was found between the patient’s age and the number of conditions the patient had been diagnosed with at some point in the past – see Figures 4(b) and 4(b). A better predictor of the number of admissions was found to be the presence of a particular diagnosis/condition (e.g. a high number of admissions is associated with the presence of the diagnoses of mental disorders, renal and cardiovascular conditions), as illustrated in Figures 5(a) and 5(b). Further insight can be gained by examining Figures 6(a) and 6(b) which summarize the re-admission statistics across different conditions. A mental disorder diagnosis or dialysis treatment for example predict both a high probability of re-admission, as well as a high total number of readmissions. These results are consistent with previous studies in the literature [Vigod et al., 2013; Kilkenny et al., 2013; Allaudeen et al., 2011] and support my diagnosis presencebased model. Next admission To evaluate the predictive power of the proposed model, I examined its performance in the prediction of the next admission based on a patient’s prior admission history, and compared this with the performance of the Markov process-based

Figure 4: (a) Patient age is not associated with the total number of admissions of the patient. (b) Patient age shows low association (r = 0.14, p < 0.001) with the number of conditions the patient has been diagnosed with. approach described previously; see (1)–(3). Both methods were trained using an 80-20 split of data into training and test. Specifically, 80% of the data corpus was used to learn ˆ ! a|H) ˆ the model parameters – conditional probabilities p(H in the case of the proposed model and p(a ! a0 ) for the Markov process-based model. The remaining 20% of the data was used as test input. For each test patient I considered the predictions obtained by the two methods given all possible partial histories. In other words, given a patient with the full admission history H = a1 ! a2 ! . . . ! an I obtain predictions using partial histories Hk = a1 ! . . . ! ak for k = 1 . . . n 1. A summary of the results in given in Figure 7 which shows the cumulative match characteristic curves corresponding to the two methods – each point on a curve represents the proportion of cases (ordinate) for which the actual correct admission type is at worst predicted with a specific rank (abscissa). The first thing that is readily observed from the plot is that the proposed method (blue line) vastly outperforms the Markov process-based approach (red line). What is more, the accuracy of my method is rather remarkable – it correctly predicts the type of the next admission for a patient in 82% of the cases (rank-1). Already at rank-2 the accuracy is nearly 90%.

250 Number of re−admissions per affected patient

2000

1500

1000

500

0

0

5

10 15 20 Admission type (as frequency rank)

25

200

150

100

50

0

30

0

5


(a) Average number of re−admissions per affected patient

Average number of admission per affected patient

60

50

40

30

20

10

0

5


30

(a)

70

0

25

25

20

100

10

50

0

30

0

5


25

Probability of re−admission

Number of admissions on record

2500

0 30

(b)

(b)

Figure 5: (a) The presence of a particular condition in a patient’s history is a good predictor of the total number admissions. (b) Average number of admissions for patients containing a particular diagnosed condition in their history.

Figure 6: (a) Re-admission statistics for the top 30 diagnosed conditions. (b) Average number of re-admissions and the probability of re-admission for a particular condition.

1 0.9 0.8 Proportion of admissions

In comparison, the Markov process-based method achieves only 35% accuracy at rank-1, less than 50% at rank-2, and reaches 90% only at rank-17. It is interesting to observe a particular feature of the CMC plot for the proposed method. Notice its tail behaviour – at rank-25 and above, the Markov process-based approach catches up and actually performs better. While performance at such a high rank is not of direct practical interest, it is insightful to consider how this observation can be explained given that it is highly unlikely for it to be a mere statistical anomaly, considering the amount of data used to estimate the characteristics. The answer is readily revealed by considering the plot in Figure 8 which shows the dependency between the average rank of the proposed method’s prediction and the length of the partial history used as input. Specifically, notice that higher ranks (i.e. worse performance) are associated with short histories. Put differently, when there is little information in a patient’s history, there is more uncertainty about the patient’s possible future ailments. This observation too strongly supports the validity of my model as it shows that accumulating evidence is used and represented in a more meaningful and robust way which allows for the learning of complex interactions between conditions and their development. Finally,

0.7 0.6 0.5 0.4 0.3 Markov chain based model Proposed method Proposed method, for histories with >5 admissions

0.2 0.1 0

5

10

15 20 Progression prediction rank

25

30

Figure 7: Cumulative match characteristics (CMCs) for the prediction of the next admission from patient’s history. this is illustrated in Figure 7 which also shows the plot of the proposed method’s CMC curve restricted to test histories containing at least 5 prior admissions. In this case, rank-1 and rank-2 performances reach the remarkable accuracy of 91% and 97% respectively.

300

1 0.9 Cumulative probability density function

Average partial history length

250

200

150

100

50

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

0

5

10

15 Prediction rank

20

25

30

0

Figure 8: Partial history length vs. next admission prediction rank. Long-term prediction Given the outstanding performance of my method in predicting the type of the next admission given the patient’s current medical history, I next considered how the proposed model performs in long-term predictions. Considering that we are now dealing with sequences of future admissions and thus a much greater space of possible options, the characterization of performance using CMC curves is impractical. Rather, I now compare my approach with the Markov process-based method by comparing the corresponding conditional probabilities for the actual progression observed in the data. In ˆ other words, for the prediction following a partial history H ˆ ! ak+1 ! of the length k and the correct full history H = H . . . ! an I compute the log-ratio of conditional probabilities: ⇢ = log

ˆ ! ak+1 ! . . . ! an |H) ˆ pMarkov (H ˆ ! ak+1 ! . . . ! an |H) ˆ pproposed (H

−2

10

!

(8)

A positive value of ⇢ means that the Markov process-based method performed better and a negative value that the proposed method did. The greater the absolute value of ⇢ the greater is the measured difference in performance in the corresponding direction. As before I divide the data into training and test sets using an 80-20 split and consider the predictions for all possible partial histories in the test set. A summary of the results is presented in Figure 9. Specifically, the plot shows the cumulative distribution function (CDF) of the log-ratio ⇢. As in the case of the one-step prediction, it is readily apparent that the performance of the proposed method vastly exceeds that of the Markov processbased approach. The value of CDF at the crossing of the curve with the ⇢ = 0 line is 0.82 which means that my method exhibited superior performance in 82% of the predictions. Even in the case of 18% of the predictions in which the Markov process-based method performed better, the performance differential is not substantial. This is in sharp contrast with the instances in which the proposed method was better – in 67% of the cases the conditional probability of the correct history progression was over 100 greater for my model.

−1

10 Probability ratio

0

10

1

10

Figure 9: Cumulative density function of the ratio of the probabilities of true patient medical history progression for the admission-level Markov process approach and the proposed method.

3

Summary and conclusions

In this paper the goal was to develop a framework for the inference of complex disease progression patterns from electronic medical records of patients collected by most hospitals in the developed world. The major contribution of this work is a model of disease progression whose key premise is that future development of a patient’s medical state can be predicted well by the presence of the diagnosis of a specific condition at some point in the patient’s past. Thus I introduced a novel representation of a patient’s health state in the form of a history vector – a fixed length vector with binary values which correspond to the presence or absence of specific diagnoses in the patient’s medical history. My model then learns the probabilities associated with semantically valid state transitions. The power of this model was demonstrated using a large, real-world data corpus collected by a local hospital on which it was shown to outperform vastly previous approaches in the literature, achieving over 91% accuracy in the prediction of the first future diagnosis for a patient given the patient’s present medical history (in comparison with 35% accuracy achieved by previously proposed methods). The modelling framework described in this paper readily opens a range of promising future research directions. Firstly, considering the outstanding performance of the methodology at the depth level two of ICD-10, given sufficient training data the framework can be extended to offer more refined prognosis through the use of more specific admission codes found at greater depths of the ICD-10 hierarchy; hierarchical models could prove useful in this regard. Secondly, the use of complementary expert (human) knowledge could be integrated with the proposed data-driven approach to constrain its search space, and increase robustness to missing data and noise. Finally, there are numerous useful possibilities in the realm of visualization and human-computer interaction. Through the use of effective visualization and data exploration techniques, the knowledge extracted by my framework can be used to enhance practitioners’ and clinicians’ understanding of specific patients or patient cohorts, or to incentivize and educate the health care recipient.

References [Allaudeen et al., 2011] N. Allaudeen, A. Vidyarthi, J. Maselli, and A. Auerbach. Redefining readmission risk factors for general medicine patients. J Hosp Med, 6(2):54–60, 2011. [Arandjelović, 2011] O. Arandjelović. Contextually learnt detection of unusual motion-based behaviour in crowded public spaces. In Proc. International Symposium on Computer and Information Sciences, pages 403–410, 2011. [Arandjelović, 2015] O. Arandjelović. Prediction of health outcomes using big (health) data. Conf Proc IEEE Eng Med Biol Soc, 2015. [Bartolomeo et al., 2008] N. Bartolomeo, P. Trerotoli, A. Moretti, and G. Serio. A Markov model to evaluate hospital readmission. BMC Med Res Methodol, 8(1):23, 2008. [Berwick and Hackbarth, 2012] D. M. Berwick and A. D. Hackbarth. Eliminating waste in US health care. JAMA, 307(14):1513–1516, 2012. [Beykikhoshk et al., 2014] A. Beykikhoshk, O. Arandjelović, D. Phung, S. Venkatesh, and T. Caelli. Datamining Twitter and the autism spectrum disorder: a pilot study. In Proc. IEEE/ACM International Conference on Advances in Social Network Analysis and Mining, pages 349–356, 2014. [Beykikhoshk et al., 2015a] A. Beykikhoshk, O. Arandjelović, D. Phung, and S. Venkatesh. Hierarchical Dirichlet process for tracking complex topical structure evolution and its application to autism research literature. In Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining, 1:550–562, 2015. [Beykikhoshk et al., 2015b] A. Beykikhoshk, O. Arandjelović, D. Phung, S. Venkatesh, and T. Caelli. Using Twitter to learn about the autism community. Soc Net Anal Min, 5(1):5–22, 2015. [Butler and Kalogeropoulos, 2012] J. Butler and A. Kalogeropoulos. Hospital strategies to reduce heart failure readmissions. J Am Coll Cardiol, 60(7):615–617, 2012. [De Gaetano et al., 2008] A. De Gaetano, T. Hardy, B. Beck, E. Abu-Raddad, P. Palumbo, J. Bue-Valleskey, and N. Pørksen. Mathematical models of diabetes progression. Am J Physiol Endocrinol Metab, 295:E1462–E1479, 2008. [Dharmarajan et al., 2013] K. Dharmarajan, A. F. Hsieh, Z. Lin, H. Bueno, J. S. Ross, I. Horwitz, J. A. BarretoFilho, N. Kim, S. M. Bernheim, L. G. Suter, E. E. Drye, and H. M. Krumholz. Diagnoses and timing of 30-day readmissions after hospitalization for heart failure, acute myocardial infarction, or pneumonia. JAMA, 309(4):355– 363, 2013. [Duffy and Yau, 1995] N. D. Duffy and J. F. S. Yau. Estimation of mean sojourn time in breast cancer screening using a Markov chain model of both entry to and exit from the preclinical detectable phase. Stat Med, 14(14):1531–1543, 1995. [Folino and Pizzuti, 2011] F. Folino and C. Pizzuti. Combining Markov models and association analysis for disease

prediction. In Proc. International Conference on Information Technology in Bio- and Medical Informatics, pages 39–52, 2011. [Friedman et al., 2008 2009] B. Friedman, H. J. Jiang, and A. Elixhauser. Costly hospital readmissions and complex chronic illness. Inquiry, 45(4):408–421, 2008–2009. [Gabriel and Neumann, 1962] K. R. Gabriel and J. Neumann. A Markov chain model for daily rainfall occurrence at Tel Aviv. Q J Roy Meteor Soc, 88(375):90–95, 1962. [Jackson et al., 2003] C. H. Jackson, L. D. Sharples, S. G. Thompson, S. W. Duffy, and E. Couto. Multistate Markov models for disease progression with classification error. J Roy Stat Soc D, 52(2):193–209, 2003. [Kilkenny et al., 2013] M. F. Kilkenny, M. Longworth, M. Pollack, C. Levi, and D. A. Cadilhac. Factors associated with 28-day hospital readmission after stroke in Australia. Stroke, 44(8):2260–2268, 2013. [Lee et al., 2005] K. Lee, M. Ho, J. Yang, and D. Kriegman. Acquiring linear subspaces for face recognition under variable lighting. IEEE T Pattern Anal, 27(5):684–698, 2005. [Mudge et al., 2011] A. M. Mudge, K. Kasper, A. Clair, H. Redfern, J. J. Bell, M. A. Barras, G. Dip, and N. A. Pachana. Recurrent readmissions in medical patients: a prospective study. J Hosp Med, 6(2):61–67, 2011. [Murray et al., 2001] C. J. L. Murray, A. D. Lopez, C. D. Mathers, and C. Stein. The global burden of disease 2000 project: aims, methods and data sources. World Health Organization, 2001. [Sukkar et al., 2012] R. Sukkar, E. Katz, Y. Zhang, D. Raunig, and B. T. Wyman. Disease progression modeling using hidden Markov models. Conf Proc IEEE Eng Med Biol Soc, pages 2845–2848, 2012. [Topp et al., 2000] B. Topp, K. Promislow, G. de Vries, R. M. Miura, and D. T. Finegood. A model of -cell mass, insulin, and glucose kinetics: Pathways to diabetes. J Theor Biol, 206:605–619, 2000. [Vigod et al., 2013] S. N. Vigod, V. H. Taylor, K. Fung, and P. A. Kurdyak. Within-hospital readmission: an indicator of readmission after discharge from psychiatric hospitalization. Can J Psychiatry, 58(8):476–481, 2013. [Wang et al., 2014] X. Wang, D. Sontag, and F. Wang. Unsupervised learning of disease progression models. In Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 85–94, 2014. [Whittaker and Thomason, 1994] J. A. Whittaker and M. G. Thomason. A Markov chain model for statistical software testing. IEEE T Software Eng, 20(10):812–824, 1994. [World Health Organization, 2004] World Health Organization. International statistical classification of diseases and related health problems., volume 1. World Health Organization, 2004. [Ye et al., 2012] W. Ye, D. J. M. Isaman, and J. Barhak. Use of secondary data to estimate instantaneous model parameters of diabetic heart disease: Lemonade Method. Inform Fusion, 13:137–145, 2012.

Using Ensemble Combiners to Improve Gene Network Inference Alan R. Fachini1 , David C. Martins, Jr.2 and Anna H. Reali Costa1 1 Escola Politécnica, Universidade de São Paulo, Brazil 2 Centro de Matemática, Computaça˜ o e Cogniça˜ o - Universidade Federal do ABC, Brazil @usp.br, [email protected] Abstract Ensemble learning explores the opinion of a community of experts rather than just an individual opinion. Its application to gene network inference still deserves further investigation. In this work, we evaluate widely known gene network inference methods and investigate the use of ensemble combiners to improve inference results when applied to datasets with small number of samples. We analyze the performance of the individual and ensemble methods by using simulated gene expression datasets. We found that ensemble combiners improve the accuracy of the inference results, being an effective approach that can help in the identification of gene networks.

1

Introduction

In a broad range of knowledge fields decisions are made by taking the opinion of a community of specialists into account rather than relying on just one individual [Polikar, 2006]. This approach can weigh individual opinions and usually leads to better decisions. In Artificial Intelligence (AI), the ensemble learning approach is used to weigh and combine classifiers hypothesis. Previous researches have shown that the use of ensembles generally produces more accurate results than a single classifier [Whalen and Pandey, 2013; Marbach et al., 2012; Maclin and Opitz, 2011; Polikar, 2006]. Huge databases of gene expression profiles have been generated by high-throughput sequencing and mRNA transcription technologies such as microarrays and RNA-Seq. The use of such data in the inference of gene regulatory interactions is one of the most challenging problems in bioinformatics research and has been intensively studied [Chai et al., 2014; Wang and HaiyanHuang, 2014; Alakwaa, 2014; Marbach et al., 2010; Hache et al., 2009]. Genes and their products do not work independently; they interact with each other forming a complex network. The relationship amongst genes can be represented by a Gene Network (GN), which provides a global view of the gene regulatory system, and that can be used to investigate cells development and function in organisms. The study of these networks can help to identify target genes potentially involved

in disease processes, fostering the development of drugs and medical treatments [Lopes et al., 2011]. In the last years many computational methods have been developed to reconstruct GNs [Liu, 2015], and many techniques have been borrowed from AI, so as to achieve better results and better deal with incomplete information. Most of gene network inference approaches use only gene expression profiles data, while more recent methods try to integrate multiple classes of data [Hecker et al., 2009]. Unfortunately, there is not a single method that can correctly infer all gene interactions of a network and both their absolute and comparative performances remain poorly understood [Marbach et al., 2010; 2012]. The main difficulty is that a large number of data points is needed, but the available expression data is usually very sparse, containing small sample size relative to the number of covariates [Alakwaa, 2014; Hache et al., 2009]. Moreover, current inference methods still provide gene networks with many false gene interactions, suggesting that further improvements are needed to provide a method truly reliable for biological applications [Mendoza and Bazzan, 2012]. Another important issue is the lack of information about the organisms, which poses challenge on the validation of inferred networks. Hence, simulated data from gene regulatory network models are often adopted to evaluate the topologies of inferred networks. Such evaluation is based on comparing inferred networks with artificially generated gold standard networks [Lopes et al., 2011; Marbach et al., 2012]. Recently, some studies have addressed the use of ensemble classifiers in genomics. Whalen and Pandey [Whalen and Pandey, 2013] explore the use of ensemble selection and stacking in the protein function prediction and genetic interaction prediction by applying and combining some machine learning algorithms. They demonstrate that the presented approaches offer statistically significant improvements for gene interaction prediction. Marbach and colleagues [Marbach et al., 2012] have explored the wisdom of crowds concept applied to the gene networks inference problem, that suggests the collective knowledge of a community could be greater than the knowledge of any individual. The authors integrated the inferred networks of all participating teams to construct a community network by re-scoring interactions according to their average ranks across a set of different methods. The study suggests that combining inference methods could be a

good strategy to improve the quality of the network reconstruction, as different methods have different strengths and weaknesses. In fact, other authors have used ensemble to combine inference methods to improve their inference results. Mendoza and colleagues [Mendoza and Bazzan, 2012] performed a comparative study between several ensemble aggregation methods and analyzed their performances on artificially generated probabilistic Boolean GNs. The authors found that the ensemble predictions yield more accurate results than individual ones. Zhong and colleagues [Zhong et al., 2014] evaluated the performance of different statistical inference methods using simulated networks and Escherichia coli datasets. The authors propose the Ensemble-based aggregation framework (ENA), showing that it can be used to integrate GNs constructed by different methods and to produce more accurate networks. Most of network inference studies suppose unrealistic scenarios, such as large amount of data samples available, or just a small set of genes present in the networks. However, only a few organisms such as Escherichia coli present databases with large number of samples available. Another problem is that some studies assume binary data as input for their inference algorithms. While this can help in the investigation of the global behavior of gene networks [Shmulevich et al., 2002], binary datasets are prone to important information loss due to quantization and do not capture the noisy perturbations of the gene expression profiles provided by high-throughput technologies. In this paper, we compare the quality of popular coexpression methods (CLR, GeneNet, MRNET and WGCNA) in the inference of large scale gene networks using the datasets provided by the DREAM5 Challenge 3 [Pinna et al., 2011]. We then conducted an experiment removing samples from the datasets to investigate the impact of the sample size in the quality of the inferred networks. To improve the results of the experiments we applied popular ensemble combiner algorithms to aggregate the resulting gene networks from the inference methods. In the experiments we show that the decrease in the datasets size drastically degrades the inference quality of all methods. Furthermore, we show that we can improve the networks quality by applying ensemble combiners. This paper is organized as follows. Section 2 reviews the methods of inference and Section 3 describes the popular techniques for ensemble combiners used in our study. Our experiments and results are shown in Section 4, and Section 5 concludes the paper.

pair of genes and derives a score related to the empirical distribution of the MI values. Instead of considering the MI I(Xi , q Xj ) between genes Xi and Xj , it considers the score

2

measures the average redundancy of Xj to each already selected variable Xk 2 S. The selection procedure is repeated for each target gene by assigning Y = Xi and V = X \ Xi where X is the expression level of all genes. Then, it computes and selects the maximum score XjM RM R

Inference Methods

In this section we review the inference methods used in this study. We chose methods easily accessible by means of software packages implemented in R language, which can be executed in a personal computer. These methods accept gene expression profiles as input and are able to generate results from data containing large number of genes and few samples.

2.1

CLR

The CLR (Context Likelihood of Relatedness) method [Faith et al., 2007] computes the mutual information (MI) for each

zij =

zi2 + zj2 , where zi is defined as ✓ ◆ I(Xi , Xj ) µi , zi = max 0,

(1)

i

where µi and i are the mean and the standard deviation of the empirical distribution of the MI values I(Xi , Xk ), k = 1, ..., n. CLR was used to infer the Escherichia coli GN [Meyer et al., 2007].

2.2

GeneNet

Graphical Gaussian models (GGMs) describe GNs using an undirected probabilistic graphical model that allows to distinguish direct from indirect interactions. GGMs provide conditional independence relations between each pair of genes [Hache et al., 2009]. GeneNet implements a specific learning algorithm that allows to estimate GGMs from small sample high-dimensional data. The GeneNet method [Schäfer and Strimmer, 2005] assumes that the data has a multivariate normal distribution. X The partial correlation matrix, rij , is related to the inverse of the covariance matrix C of the data. Hence the covariance matrix is inverted and the partial correlations are determined as X rij =

q

Cij 1 Cii 1 Cjj1

.

(2)

In the last step a significance test of each partial correlation that is not equal to zero is applied.

2.3

MRNET

MRNET is a method for gene networks inference which performs a series of maximum relevance/minimum redundancy (MRMR) feature selection procedures, in which each gene Xi in turn plays the role of the output target Y . The method ranks the input variables set V according to the score sj = uj rj . The relevance term uj = I(Xj ; Y ) is the mutual information of Xj with the target variable Y and the redundancy term rj =

1 X I(Xj ; Xk ) |S|

(3)

Xk 2S

XjM RM R = arg max (sj ) Xj 2V \S

(4)

for the pair of genes (Xi , Xj ). A network can be generated by deleting all edges with scores smaller than a given threshold [Meyer et al., 2007].

2.4

WGCNA

ensemble output

Methods based on correlation provide a simple approach for inference of gene networks. If the expression of the gene Xi increases and gene Xj either increases or decreases simultaneously, it can be said that the genes are co-expressed and a correlation-based method can detect and model this behavior [Wang and HaiyanHuang, 2014]. The WGCNA (Weighted Gene Co-expression Network Analysis) [Zhang and Horvath, 2005] is a widely applied correlation method. The gene network is depicted by an adjacency matrix A = [aij ] which codes the interaction of two genes Xi and Xj using the Pearson correlation to measure the correlation degree between two genes. The adjacency matrix is defined as T 1 X xil A(Gi , Gj ) = cor(Gi , Gj ) = ( T l=1

x ¯i i

)(

xjl

x ¯j

),

j

(5) where Gi and Gj are the sample vectors of gene Xi and Xj . Then a similarity between two genes S(Gi , Gj ) is calculated by getting the complement of the adjacency matrix: S(Gi , Gj ) = 1 |A(Gi , Gj )| . (6) WGCNA increases the contrast between high and low correlation values, smoothing the threshold by applying the constant.

3

Ensemble Combiner µj(x)

Algorithm 1

Gene1 Gene2 ⋮ GeneN

Exp1 0.54 -0.12 ⋮ -0.43

Exp2 -0.34 0.23 ⋮ 0.05

...

ExpN 0.65 0.12 ⋮ 0.45

Algorithm J

input

Figure 1: An ensemble combiner of multiple gene network inference methods. Each algorithm produces it’s own representation of the gene network that are aggregated by the ensemble combiner µj (x). and also the rank product approach proposed by [Zhong et al., 2014].

3.1

Maximum Rule

This is one of the most simple combiners which takes the maximum among the classifiers outputs according to µj (x) = max {dt,j (x)},

Ensemble Combiners

Ensemble-based learning is the process by which different results are generated and combined to solve a given problem. Systems based on ensembles can be used to improve performance or reduce the search neighborhood of a classification problem by choosing and combining the best classifiers. Such systems help to improve the classification quality when dealing with a large amount of data with class disparity. As this study adopts inference methods which results in continuous output values, one for each pair of genes, we can interpret each output as a confidence score given to the possible interaction of two genes. The confidence score of many inference methods can be combined to improve the precision of the reconstructed GN. Figure 1 shows a general framework for an ensemble system. The gene expression matrix is given to the gene network inference methods and the predicted output of each algorithm is combined by an ensemble combiner µj (x) to produce the ensemble output. Combining the output of several classifiers is useful only if there is disagreement among them, since combining identical classifiers produces no gain [Maclin and Opitz, 2011]. As will be shown in Section 4, the inference methods disagree in all evaluation metrics. There are many ensemble aggregation approaches and the best algorithm depends on the data available and on the particular problem. The mean, maximum rules, and weighted average algorithms are simple and show consistent performance in problems where the classes are continuous outputs and the accuracies of the classifiers can be reliably estimated [Polikar, 2006]. In this section we describe these ensemble algorithms

Algorithm 2

t=1...T

(7)

where T is the total number of classifiers and dt,j (x) 2 [0, 1] represents the confidence score given by the classifier t to class !j for a given output instance x [Polikar, 2006]. In this work each class !j represents the presence (or absence) of an edge between two genes.

3.2

Mean Rule

Mean rule is an algebraic combiner which uses a simple function to compute the total support for each class (gene interaction). The choice of the class !j is supported by obtaining the average of all jth outputs obtained by the inference methods: µj (x) =

T 1X dt,j (x), T t=1

(8)

where the class !j with the largest µj (x) is chosen [Polikar, 2006].

3.3

Weighted Average

This rule combines the mean rule with weighted majority voting. The majority voting is the traditionally used algorithm by audience polling, where the class !j that receives the largest number of votes wins. In the weighted majority voting, if certain experts are more precise than others we can add weights !t to their opinions in the voting poll (larger weight means better precision). In the weighted average the weights are not applied to the output classes, but rather to the continuous output dt,j (x) of the classifier [Polikar, 2006]. This rule is

defined as: µj (x) =

T X

(9)

!t dt,j (x).

t=1

3.4

Rank Product

In the rank product algorithm the lower rank is assigned to the class with higher confidence for each classifier’s output [Zhong et al., 2014]. After the rank dt,j (x) has been computed for each GN (given by each classifier t 2 T ), we calculate the rank of each particular edge µj (x) by taking the product of the ranks of the same edge across all GNs (across all classifiers T): µj (x) =

T Y

(10)

dt,j (x).

t=1

4

Experiments and Results

The experimental framework used in this paper is shown in Figure 2. The DREAM5 System Genetics datasets1 are used for performance assessment, and the inference methods are applied and aggregated by the ensemble algorithm. This section details each step of this framework. Datasets Gold Standard Networks

Simulated datasets

Inference Methods

Ensemble Combiner

Adjacency Matrices

Inferred Network

Performance Assessment Precision-Recall and ROC

Figure 2: Experimental Framework. DREAM5 simulated data is used as input for the inference methods, the inference methods are applied and a performance assessment is conducted.

4.1

Benchmark datasets

As the gene networks of real organisms are not well known, we used simulated gene expression datasets to allow a more confident performance assessment of the applied algorithms. The simulated data is limited as it does not come from real experiments, but the availability of gold standard networks helps in the quantitative assessment of the algorithms performance. We used the benchmark dataset provided by the DREAM5 System Genetics challenge to explore the efficacy of our experimental framework. The simulated datasets were generated by the SysGenSIM software [Pinna et al., 2011] available on the challenge website. Networks of 1000 genes with steady state gene-expression levels were created using modular scale-free topology and the simulations were done using deterministic ordinary differential equations [Liu et al., 2008]. The datasets contain gene expression profiles with 1 DREAM5 System Genetics challenge http://wiki.c2b2.columbia.edu/dream/index.php/D5c3

website:

three sample sizes: 100, 300 and 999. For each sample size, five different networks with increasing number of edges were simulated. The gold standards provided consist of an edge list with the source gene vi in the first column and the target gene vj in second column. The third column refers to the presence (set to 1) or absence (set to 0) of regulatory interaction from vi to vj . As the aim is to evaluate the GN inference algorithms from datasets with small number of samples, we also created seven new gene expression profiles containing 5, 10, 15, 20, 30, 50, and 80 samples. Such datasets are subsets of 100 samples datasets.

4.2

Performance Assessment

The use of simulated datasets allows the performance assessment of the inference methods as the gold standard network is previously known. The GN inference problem can be seen as a binary decision problem where, for each pair of genes, the inference method decides if there is an edge representing the presence of an interaction between them or not. The inferred network can be represented as an adjacency matrix E, such that each edge from gene vi to gene vj implies Eij = 1, or Eij = 0 otherwise [Lopes et al., 2011]. The comparison of the inferred GN and the gold standard is summarized in a confusion matrix. If the interaction is correctly inferred the edge is considered a true positive (TP), otherwise it is a false positive (FP). If the absence of an interaction is correctly inferred, it is considered a true negative (TN), otherwise it is a false negative (FN). Several metrics can be extracted from the confusion matrix. In this study the results are evaluated using the area under Precision-Recall curve, AUPR, for the set of interactions inferred for a network. Precision, also named Positive Predictive Value (PPV), is defined as: TP P recision = P P V = , (11) TP + FP which measures the fraction of inferred interactions present in the gold standard among all inferred interactions. In its turn, Recall, also named True Positive Rate (TPR), is defined as: TP Recall = T P R = , (12) TP + FN which measures the proportion of inferred interactions present in the gold standard out of all interactions present in the gold standard. We also evaluate the results using the area under the Receiver Operator Characteristic (ROC) curve, AUROC. The ROC curve shows in the y axis the TPR (Equation 12). The x axis shows the False Positive Rate (FPR), defined as: FP FPR = . (13) FP + TN This denotes the fraction of inferred interactions that are absent in the gold standard among all absent interactions in the gold standard. The diagonal divides the ROC space, where points above the diagonal represent good classification results (better than random), and points below the line represent poor results (worse than random). A perfectly inferred network has an AUROC and AUPR of 1.

Results

The inference methods generally return the inferred gene networks as a weighted undirected graph GN = (V, E) where V = {v1 , · · · , vn } are the vertices (genes) and E = {(vi , vj ) : vi , vj 2 V} is the adjacency matrix in which Eij is a confidence score given by the inference method for the interaction between vi and vj . Although some inference methods like CLR and MRNET return a sparse adjacency matrix where most of the interactions are 0, others like WGCNA and GeneNet return a graph with weights assigned to all edges Eij . While a large value of Eij expresses a high confidence that there is a regulatory interaction between genes vi and vj , values near zero indicate no interaction between them. For the evaluation of the algorithms, we converted the upper right triangle (Eij : i < j) of the inferred network adjacency matrix to an ordered adjacency list where the index of the source gene vi is always smaller than the index of the target gene vj . The ordered inferred adjacency list is compared with the ordered gold standard list G in order to indicate the accuracy of the inference method. The gold standard list Gij can assume the values 1 and 0. If Gij is equal to 1, a true positive (TP) interaction is assumed when a high confidence score Eij is given by the inference method. If Gij is equal to 0 and Eij is high, it means that the inference method returned a false positive (FP) interaction. When Gij is 0 and Eij is near 0, the inference method correctly identified a true negative (TN) interaction, while when Gij is 1 and Eij is near 0, the inference method identified a false negative (FN) interaction. After the inference methods have been applied for all datasets and the ensembles have been built, we calculated their individual AUROC and AUPR scores for each of the five networks generated (five networks for each sample size). These scores indicate the performance of the methods by varying the confidence score threshold. For each considered sample size, we average the five AUROC and AUPR scores obtained by each method in a single value. Figures 3 and 4 show the area under ROC curve (AUROC) and the area under PR curve (AUPR) for the inferred networks describing the results for the default datasets provided by the DREAM5 challenge (100, 300 and 999 samples) and the datasets generated in this experiment (5, 10, 15, 20, 50 and 80 samples). It is clear that a decreasing number of samples degrades the inference quality of all methods considering both AUROC and AUPR scores. WGCNA presented the best AUROC scores for the sample sizes 100, 300 and 999: 0.767, 0.844 and 0.896, respectively. Then comes CLR with AUROC scores 0.763, 0.842 and 0.883, and MRNET with 0.758, 0.831 and 0.869. The worst method in the comparison for these three sample sizes was GeneNet with AUROC scores of 0.757, 0.808 and 0.848, which means that GeneNet identified more false positive interactions than any other method. The AUROC scores achieved by all inference methods for small datasets were very similar. However WGCNA and GeneNet compete with each other: WGCNA performs better for sample sizes 50 and 80, while GeneNet performs better for smaller sample sizes. The AUROC scores obtained by

0.9

0.8

AUROC Mean

4.3

0.7

0.6

0.5 5

10

15

20

50

80

100

300

999

Number of Samples Inference Methods

Ensemble

WGCNA

Rank Product

MRNET

Max

Mean

CLR

GeneNet

Weighted Average

Figure 3: AUROC for the inferred networks executed on different dataset sizes. Each value represents the average AUROC for the inference methods obtained from 5 different network sizes (in terms of number of edges). The horizontal lines above the bars show the AUROC score for the ensemble combiners. WGCNA were 0.539, 0.575, 0.604, 0.630, 0.707 and 0.750 for 5, 10, 15, 20, 50 and 80 samples, respectively. In its turn, the AUROC scores obtained by GeneNet were 0.545, 0.583, 0.610, 0.633, 0.705 and 0.741. For AUPR score we see a completely different situation where GeneNet achieved the best AUPR score for the reconstructed networks from dataset sizes 100, 300 and 900: 0.141, 0.251 and 0.336, respectively. It is followed by CLR, with scores 0.131, 0.241 and 0.330, MRNET with scores 0.125, 0.230 and 0.285, and WGCNA with the smallest AUPR scores 0.125, 0.192 and 0.222. The AUPR scores obtained by the considered inference methods for smaller datasets present noted variation only for networks reconstructed using 50 and 80 samples. WGCNA AUPR scores for 50 and 80 samples were, respectively, 0.078 and 0.110, MRNET scores were 0.067 and 0.104, CLR scores were 0.078, and 0.119, and GeneNet scores were 0.073 and 0.120. While WGCNA performed better in terms of AUROC score, GeneNet presented better AUPR scores in most situations. This means that WGCNA presented better overall performance (i.e, considering the balance between TP and TN) than GeneNet. On the other hand, GeneNet correctly recovered more positive interactions (considering TP) than WGCNA. This is important because as gene networks are sparse (i.e, the number of zeros is much greater than the num-

methods. Overall, as the number of samples increases, the difference between the performances of ensemble approaches and individual methods tends to increase.

0.4

5

AUPR Mean

0.3

0.2

0.1

0.0 5

10

15

20

50

80

100

300

999

Number of Samples Inference Methods

Ensemble

WGCNA

Rank Product

MRNET

Max

Mean

CLR

GeneNet

Weighted Average

Figure 4: AUPR for the inferred networks executed on different datasets. Each value represents the average AUPR for the inference methods obtained from 5 different network sizes (in terms of number of edges). The horizontal lines above the bars show the AUPR score for the ensemble combiners. ber of ones in the gold standard), it is preferable to correctly recover more true positives even at the expense of obtaining a little more false positives [Davis and Goadrich, 2006]. As discussed next, this difference in the scores means that combining the inference methods might result in better networks. Figures 3 and 4 also show the AUROC and AUPR scores obtained by the considered ensemble combiners. In general all ensemble combiners performed better than any individual method, resulting in better scores in terms of both AUROC and AUPR for networks reconstructed from datasets with sample sizes 100, 300 and 999. The rank product resulted in the best AUROC scores 0.781, 0.864 and 0.914, and AUPR scores 0.149, 0.257 and 0.365, for sample sizes 100, 300 and 999, respectively. The ensemble combiners also performed better for datasets with reduced number of samples. The AUROC scores obtained by rank product for reconstructed networks from 5, 10, 15, 20, 50 and 80 samples were respectively: 0.545, 0.584, 0.617, 0.642, 0.719 and 0.764. In its turn, the AUPR scores obtained by rank product were 0.005, 0.007, 0.013, 0.021, 0.081 and 0.123. The AUROC scores obtained by weighted average for these networks were 0.546, 0.585, 0.613, 0.638, 0.715 and 0.756, while its AUPR scores were exactly the same of those obtained by rank product. For 15 or less data points we note that the AUROC scores slightly increased by using ensemble combiners and that the AUPR scores were almost the same for all ensemble combiners and individual

Conclusion and Future Work

In the present work we have undertaken a comparative evaluation of the performance of four gene network inference methods: CLR, WGCNA, GeneNet and MRNET. We have also tested the use of four ensemble combiners algorithms to aggregate the inferred networks: max rule, mean rule, weighted average and rank product. Our experiments were conducted using the simulated datasets provided by the DREAM5 System Genetics challenge. The performed evaluation indicates that the individual methods show different results depending on the applied evaluation metric, with WGCNA and GeneNet showing the best evaluated scores. By aggregating the inference methods results we found that the predictions of an ensemble is more accurate than individual ones in most cases, corroborating previous studies [Zhong et al., 2014; Mendoza and Bazzan, 2012; Marbach et al., 2010; 2012]. Besides, the rank product ensemble performed better than the other combiners. Unlike other studies, we also tested the individual and ensemble methods in datasets with very limited number of samples and large number of genes in order to mimic more realistic scenarios. Our results suggest that ensemble combiners can improve the accuracy of gene networks inferred in this situation. However, there is not a substantial improvement in the accuracy when applying the ensemble methods in inferred networks from datasets with 15 or less samples. We suggest the use of the ensemble methods described here for datasets containing at least 20 gene expression samples, since the improvement obtained by applying ensembles is increasingly substantial with the increase on number of samples. The evaluation criterion in DREAM5 Challenge 3 was based on directed edges and the best performers used both gene expression and phenotypes datasets provided in the challenge [Vignes et al., 2011]. On one hand, inference methods which produce networks with directed edges provide networks with richer information than the ones presented here (since directed edges may indicate causality). On the other hand, causalities are harder to infer, requiring more data samples (preferentially temporal samples) and other data sources in addition to gene expression data. Besides, the inference methods evaluated here are easier to access and use. Even producing networks with less information, the results are still valuable as researchers can explore different thresholds in order to find gene clusters of interest. Further investigation is needed in the use of ensemble combiners for the inference of gene networks. Other combinations of inference methods, including methods that construct gene networks represented as directed graphs, could lead to better inferred networks. More complex ensemble learning algorithms could also be applied to incremental learning of datasets, which gradually become available, and to data integration of different data sources. It would also be important to apply the concepts of ensemble learning to gene networks inference from gene expression profiles of real organisms.

References [Alakwaa, 2014] Fadhl M Alakwaa. Modeling of Gene Regulatory Networks: A Literature Review. Journal of Computational Systems Biology, 1(1):8, 2014. [Chai et al., 2014] Lian En Chai, Swee Kuan Loh, Swee Thing Low, Mohd Saberi Mohamad, Safaai Deris, and Zalmiyah Zakaria. A Review on the Computational Approaches for Gene Regulatory Network Construction. Computers in Biology and Medicine, 48:55–65, 5 2014. [Davis and Goadrich, 2006] Jesse Davis and Mark Goadrich. The Relationship Between Precision-Recall and ROC Curves. pages 233–240, New York, NY, USA, 2006. ACM, ACM Press. [Faith et al., 2007] Jeremiah J Faith, Boris Hayete, Joshua T Thaden, Ilaria Mogno, Jamey Wierzbowski, Guillaume Cottarel, Simon Kasif, James J Collins, and Timothy S Gardner. Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS biology, 5(1):e8, 2007. [Hache et al., 2009] Hendrik Hache, Hans Lehrach, and Ralf Herwig. Reverse Engineering of Gene Regulatory Networks: A Comparative Study. EURASIP Journal on Bioinformatics and Systems Biology, 2009:8, 1 2009. [Hecker et al., 2009] Michael Hecker, Sandro Lambeck, and Susanne Toepfer. Gene Regulatory Network Inference: Data Integration in Dynamic Models – A Review. Biosystems, 96(1):86–103, 4 2009. [Liu et al., 2008] Bing Liu, Alberto de La Fuente, and Ina Hoeschele. Gene network inference via structural equation modeling in genetical genomics experiments. Genetics, 178(3):1763–1776, 2008. [Liu, 2015] Zhi-Ping Liu. Reverse engineering of genomewide gene regulatory networks from gene expression data. Current Genomics, 16(1):3–22, 2015. [Lopes et al., 2011] Fabricio M Lopes, Roberto M Cesar Jr, and Luciano Da F Costa. Gene expression complex networks: synthesis, identification, and analysis. Journal of Computational Biology, 18(10):1353–1367, 2011. [Maclin and Opitz, 2011] Richard Maclin and David W. Opitz. Popular ensemble methods: An empirical study. CoRR, abs/1106.0257, 2011. [Marbach et al., 2010] Daniel Marbach, Robert J. Prill, Thomas Schaffter, Claudio Mattiussi, Dario Floreano, and Gustavo Stolovitzky. Revealing Strengths and Weaknesses of Methods for Gene Network Inference. Proceedings of the National Academy of Sciences of the United States of America, 107(14):6286–91, 4 2010. [Marbach et al., 2012] Daniel Marbach, James C. Costello, Robert Küffner, Nicole M. Vega, Robert J. Prill, Diogo M. Camacho, Kyle R. Allison, The DREAM5 Consortium, Manolis Kellis, James J. Collins, and Gustavo Stolovitzky. Wisdom of Crowds for Robust Gene Network inference. Nature, 9(8):796–804, 8 2012.

[Mendoza and Bazzan, 2012] Mariana Recamonde Mendoza and Ana L. C. Bazzan. On the Ensemble Prediction of Gene Regulatory Networks: A Comparative Study. Neural Networks (SBRN), 2012 Brazilian Symposium on Neural Networks, pages 55–60, 10 2012. [Meyer et al., 2007] Patrick E Meyer, Kevin Kontos, Frederic Lafitte, and Gianluca Bontempi. Information-theoretic inference of large transcriptional regulatory networks. EURASIP journal on bioinformatics and systems biology, 2007, 2007. [Pinna et al., 2011] Andrea Pinna, Nicola Soranzo, Ina Hoeschele, and Alberto de la Fuente. Simulating Systems Genetics Data with SysGenSIM. Bioinformatics, 27(17):2459–2462, 2011. [Polikar, 2006] Robi Polikar. Ensemble based systems in decision making. Circuits and systems magazine, IEEE, 6(3):21–45, 2006. [Schäfer and Strimmer, 2005] Juliane Schäfer and Korbinian Strimmer. An Empirical Bayes Approach to Inferring Large-scale Gene Association Networks. Bioinformatics, 21(6):754–764, 2005. [Shmulevich et al., 2002] I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang. Probabilistic boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics, 18(2):261–274, 2002. [Vignes et al., 2011] Matthieu Vignes, Jimmy Vandel, David Allouche, Nidal Ramadan-Alban, Christine CiercoAyrolles, Thomas Schiex, Brigitte Mangin, and de Simon Givry. Gene Regulatory Network Reconstruction Using Bayesian Networks, the Dantzig Selector, the Lasso and their Meta-analysis. PloS ONE, 6(12):e29165, 1 2011. [Wang and HaiyanHuang, 2014] Y. X. Rachel Wang and HaiyanHuang. Review on Statistical Methods for Gene Network Reconstruction using Expression Data. Journal of Theoretical Biology, 362:53–61, 4 2014. [Whalen and Pandey, 2013] Sean Whalen and Gaurav Pandey. A comparative analysis of ensemble classifiers: Case studies in genomics. Proceedings - IEEE International Conference on Data Mining, ICDM, pages 807–816, 2013. [Zhang and Horvath, 2005] Bin Zhang and Steve Horvath. A General Framework for Weighted Gene Co-expression Network Analysis. Statistical Applications in Genetics and Molecular Biology, 4(1):45, 2005. [Zhong et al., 2014] Rui Zhong, Jeffrey D. Allen, Guanghua Xiao, and Yang Xie. Ensemble-based Network Aggregation Improves the Accuracy of Gene Network Reconstruction. PloS one, 9(11):10, 1 2014.

Identifying the Network Biomarkers of Breast Cancer Subtypes By Interaction Forough Firoozbakht, Iman Rezaeian, Alioune Ngom and Luis Rueda School of Computer Science, University of Windsor 401 Sunset Avenue, Windsor, ON N9B 3P4, Canada {firoozb,rezaeia,angom,lrueda}@uwindsor.ca Abstract Breast cancer (BC) is a complex disease consisting of many different subtypes, each arising from a distinct molecular mechanism and having a distinct clinical progression. Identifying the biomarkers to accurately predict the BC subtypes and understanding the underlying mechanism driving specific subtypes is necessary for more effective treatments and the continual decline in BC mortality. In this paper, we introduce a new hierarchical classification approach based on first building a subtype dendrogram and then using the dendrogram to identify informative gene subnetworks as well as to predict the subtypes of given but arbitrary BC samples. Two methods are proposed which find the informative subnetworks by superimposing the gene expression data to the Human Protein Reference Database (HPRD) of protein-protein interactions, and selecting a set of high-scoring interaction pairs which best discriminate the subtypes. We have applied our approaches to a BC data set consisting of 997 tumor samples from 10 BC subtypes and obtained a collection of 9 discriminative subnetworks, each yielding high predictive accuracy. Keywords: breast cancer; subtyping; protein-protein interactions; hierarchical classification; greedy algorithm.

1

Introduction

Breast cancer (BC) is one of the leading causes of cancer related deaths among women in North America [DeSantis et al., 2014]. Although BC is a hydrogenous disease, it can be categorized into different subtypes that can be distinguished based on gene expression characteristics [Curtis et al., 2012]. Appropriate diagnosis of the specific subtypes of breast cancer is vital to provide the best treatment to the patient. Different methods such as MRI [Loo et al., 2011], mammography [Razzaghi et al., 2013], and CT scan [Mavi et al., 2006] have been used to phenotypically examine the changes in the tissue, but provide little informative information toward direct therapy. Most bioinformatic methods have focused on identifying BC biomarkers as small subsets of differentially expressed (DE) genes. However, DE genes have limited pre-

dictive performance due to (i) the heterogeneity within tissues and across patients and (ii) the dependence between genes, gene products, or pathways. To accurately identify effective BC biomarkers, new bioinformatics methods integrating biological additional information with gene expression data has become necessary. Within the last 5 years, new classes of biomarkers called network biomarkers (NBs) are defined and studied [Liu et al., 2012a; Liu et al., 2012b; Zhang and Chen, 2013]. A NB is a disease-related network of genes identified by an appropriate integration of protein interaction network data with gene expression data, thus taking into account the dependencies between genes. Our contribution is only a proof of concept for now. First, we introduce a new hierarchical classification approach which uses a dendrogram in predicting the subtypes of given but arbitrary samples. The classification dendrogram is proposed in order to perform multiple-class predictions in an efficient manner and with high accuracy. Second, we devise two methods for identifying the discriminative network biomarkers (NBs) given the gene expression data and the proteinprotein interaction (PPI) network. The NB selection methods also guided by classification dendrogram.

2

Materials and Methods

We have used the METABRIC dataset [Curtis et al., 2012] (accession number EGAS00000000083), which contains the copy numbers and gene expressions of 2000 primary breast tumors with long-term clinical follow-up. In [Curtis et al., 2012], the copy number aberrations and copy number variations generated using Affymetrix SNP 6.0 arrays and gene expression data were obtained using Illumina HT 12 technology. The dataset contains two sets of data, validation set and discovery set. Due to the lack of class labels in the validation set, in this paper we only use the discovery set, which contains 997 samples from ten subtypes of breast cancer. Each sample contains expression information of 48,803 probe IDs. The expression of probes corresponding to the same gene have been merged based on the median expression of those probes, which maps all the probes to 24,351 unique genes. The numbers of samples corresponding to each subtype are listed in Table 1. Our approach consists of 3 main steps discussed below illustrated in in Figure 1.

Subtype 1 76

Subtype 2 45

Table 1: The number of samples correspond to each of ten subtypes. Subtype 4 Subtype 5 Subtype 6 Subtype 7 Subtype 8 167 94 44 109 143

Subtype 3 156

Subtype 9 67

Subtype 10 96

which gives better predictive accuracy when used to guide the classification of sample data. Moreover, Ward’s dendrogram yields the most meaningful hierarchy of the 10 BC subtypes discussed in [Curtis et al., 2012] and the 5 BC subtypes discussed in [Perou et al., 2000]. Ward’s method uses analysis of variance to evaluate the distances between clusters [Johnson, 1967]. Ward’s minimum variance method is a special case of the objective function approach originally presented in [Ward Jr, 1963]. Ward’s method works as follows: • Using analysis of variance to evaluate the distances between clusters. • Minimizing the sum of squares of any two (hypothetical) clusters that can be formed at each step, as follows: q Ni ⇥ Nj 2 dij = ||ci cj || (2) Ni + Nj

where Ni and Nj are the numbers of samples in cluster i and j respectively and ci and cj denote the centers of the clusters; ||.|| is the Euclidean norm.

• The mean and cardinality of the new merged cluster, k, is computed as follows: ck =

1 Ni c i + Nj c j , Ni + Nj Nk = Ni + Nj

Figure 1: A schematic view of the proposed method. 1. Build the classification dendrogram We first build a dendrogram (as in the usual hierarchical agglomerative clustering algorithm) by clustering the subtypes (not the samples) given a subtype distance matrix. The base distance metric used in this study is the Euclidean distance between two points X = {x1 , x2 , ..., xn } and Y = {y1 , y2 , ..., yn } in space Rn defined as:

d = |x

v u n uX y| = t |xi i=1

yi |

2

(1)

The subtypes being a sets of samples, have used four different methods for computing the distance between two subtypes (single linkage [Sibson, 1973], complete linkage [Defays, 1977], average linkage [Maulik and Bandyopadhyay, 2002] and Ward’s method [Johnson, 1967]). In this paper we discuss only the Ward’s method since it yields a dendrogram

(3) (4)

2. Build the classifiers The classification dendrogram generated in Step 1 is used similarly to a decision tree. In the training phase, the top node R (the root node) receives a training set DR and then classify each sample of DR as belonging to its left subtree Rl or its right subtree Rr . If a subtree is a leaf node, then the sample is classified as to be of the BC subtype associated to that leaf node, otherwise the classification process repeats at the root Rp of the subtree with the training data DRp ⇢ DR (p = l or r) received from its parent. Let T be a non leaf node with its associated training data TR , we trained a SVM classifier (with linear kernel and default parameters) within the framework of a 10-fold cross-validation as follows. 2.1) We first map all the genes of the gene expression data TR onto the HPRD [Prasad et al., 2009] PPI network in order to subsequently find a discriminatory NB which best classify the data at T ; we removed the self-interactions and the repeated interactions, resulting in a PPI network consisting of 36,822 unique interactions and 9465 proteins. only those genes which can be mapped to the PPI network are retained for further analysis. 2.2) Next, we apply a simple feature ranking method to select a subset of features in which each feature is an interacting gene pair, and in order to obtain a network biomarker (NB); that is, a disease-related region of the interactome that is differentially expressed in BC. 2.3) Finally at node T , we build a SVM classifier from its training

data TR but using only the selected set of interacting pairs. At step 2.2 above, we have devised two methods for finding the NB that best discriminate the subtypes. In each method, we used a simple fisher discriminant analysis (FDA) classifier [Scholkopft and Mullert, 1999] to rank an interacting pair according to how well it separates the data. Method 1: We apply FDA to all the interacting pairs and select the top t ranked pairs as our feature subset. This method tends to produce a NB containing disconnected components since the pairs are selected independent of each other. Method 2: We first select the best ranked interacting gene pair gi,j = (gi , gj ), then select the best pairs gi,k = (gi , gk ) and gj,l = (gj , gl ), with i 6= l andk 6= j, and then we continue in this manner for each new gene ga added to the selected subset until t pairs are selected. This method yields a connected NB. 3. Predict sample subtypes To predict the subtype of new but arbitrary sample data we use the classification dendrogram in the same manner as a decision tree. That is, starting from the root node R, we sort each sample data down the tree, each non-leaf node T further classifying the sample until it reaches a leaf node containing its subtype prediction.

3

Results

Figure 2 shows the subtype dendrogram obtained by the Ward’s method. The PAM50 index [Perou et al., 2000] classifies BC disease into 5 major subtypes, {Her2, Basal, LumA, LumB, Normal-Like}. However, since we are using the METABRIC data [Curtis et al., 2012] which classifies the disease into 10 subtypes, we find it a good idea to match the membership of samples from the 5 PAM50 index subtypes to the 10 METABRIC subtypes. In the figure, the color of a METABRIC subtype (a leaf node) matches with the PAM50 index subtype that appears the most in that node. For example, most of the samples in subtype 3 are members of the PAM50 LumA subtype based, and most of the samples in Her2 subtype are in Subtype 5. Since Ward’s method takes the size of each subtype into account (i.e., the number of samples in each subtype), it tends to merge smaller subtypes first. The following measures are used for evaluating the predictive performance of our approach. TP + TN , (5) TP + FN + FP + TN F -measure uses both precision and recall measures to compute the score as follows: Accuracy =

where

F-measure = 2 ⇥

P ercision ⇥ Recall , P recision + Recall

(6)

TP , (7) TP + FP TP Recall = , (8) TP + FN Another measure, the area under the receiving operating characteristics (ROC) curve, AUC, shows the trade-off between Specificity and Sensitivity (Recall), where: TP Sensitivity (Recall) = , (9) TP + FN Precision =

Figure 2: The hierarchical tree obtained using agglomerative clustering and Ward’s method as the distance method. The color of each subtype shows which PAM50 index, the majority of samples of that subtype fall in. TN , (10) TN + FP Above, T P, T N, F P, F N means true positive, true negative, false positive, and false negative, respectively. Figure 3 shows the NB obtained by Method 1 (NB-1) at the root of the subtree containing the leaf nodes Subtype 2 and Subtype 7, whereas Figure 4 shows the NB obtained by Method 2 (NB-2) at that same subtree root. Table 2 shows the results of our dendrogram classification approach with both Method 1 and Method 2 of selecting interacting gene pairs. Both method yield excellent prediction results in general, with Method 2 generally under-performing Method 1, and by a relatively large margin at some internal node of the dendrogram (for instance, at subtree 1-9-vs-6). Method 1 performs better since it selects the top ranked interacting gene pairs independently of the currently selected genes. In Method 2 an existing high-ranked interacting pair gi,j that will be selected by Method 1 may not be selected if both gi and gj are not already in the currently selected subset of genes. In the table, we also show the results of our dendrogram method when using the Chi-square gene ranking method [Liu and Setiono, 1995]; with this method the gene expression data is not mapped to the PPI network, and only top ranked genes (rather than top ranked interacting gene pairs) are selected. In most cases, the Chi-squared feature selection method outperforms both Method 1 and Method 2 but tends to yield a NB containing only genes but with no interacting pairs. This because, the interactome is incomplete; there are 24,000 genes in the METABRIC data and only 9,465 genes in our HPRD PPI network data. Also, not all the genes in our expression data could be mapped to HPRD (i.e., there associated proteins are not in HPRD). In the future, we plan to integrate different PPI network data from different databases Specif ity =

Figure 3: The network biomarker NB-1 at root of Subtype 2 and Subtype 7. such that all the genes in the expression data can be mapped to.

We used the IPAD pathway analysis database [Zhang and Drabier, 2012] to determine the diseases and pathways associated with the network biomarkers NB-2 we have obtained at the non-leaf nodes of the classification dendrogram using Method-2. As shown in Tables 3 and 4 (respectively for the node classifying subtypes 2 and 7, and the root node), all of the genes in the NB are breast disease-related genes. Some of those genes, such as BRCA1, TP53 AND PTEN are known to have an important function in driving in BC [Ford et al., 1994; Kappler et al., 2014; King et al., 2003; Walerych et al., 2012].

the network biomarkers that best classify the data. Computational experiments have shown that our method give high predictive performances and that the resulting network biomarkers correspond to regions of the interactome which are related to breast diseases. This research is a proof of concept and we plan to devise methods which generate the specific NBs which are differentially expressed in each given subtype but not in the remaining 9 subtypes. This will allow BC researchers research to obtained further insights into the mechanism driving each BC subtype. Also, by appropriately integrating with transcriptome information (such as RNA-seq data), mutation information (such as SNP, CNA and CNV data), and with pathway information, one will be able to accurately detect the exact genes that drive each BC subtypes.

5

Acknowledgments

4

Discussion

Conclusion

In this paper, we have introduced a novel prediction method for predicting the subtype of a breast cancer sample data based. Our method is based first constructing a classification dendrogram, and then use the dendrogram to predict the class of a new sample data; each non leaf node being associated with a classifier and a training data passed on by its parent node. In our approach, we have also appropriately biological information obtained from a PPI network with the gene expression data of breast cancer sample in order to build obtain

This work has been partially supported by NSERC, the Natural Sciences and Engineering Research Council of Canada, and WECCCF, the Windsor Essex County Cancer Centre Foundation Seeds4Hope program.

References [Curtis et al., 2012] Christina Curtis, Sohrab P Shah, SuetFeung Chin, Gulisa Turashvili, Oscar M. Rueda, Mark J.

Table 2: Comparison between selected PPI sub-networks and individual genes using chi-squared method. Node Subset identification # of Genes Accuracy AUC F-measure Method 2 95.5% 0.929 0.954 2 vs 7 Method 1 101 94.8% 0.918 0.947 Chi-squared 99.4% 0.995 0.994 Method 2 95.3% 0.952 0.953 2 7 vs 8 Method 1 99 95.6% 0.956 0.956 Chi-squared 95.3% 0.953 0.953 Method 2 80.8% 0.767 0.803 2 7 8 vs 3 Method 1 98 83.2% 0.808 0.831 Chi-squared 84.5% 0.827 0.845 Method 2 88.8% 0.887 0.888 1 vs 9 Method 1 99 90.2% 0.903 0.902 Chi-squared 96.5% 0.965 0.965 Method 2 90.9% 0.823 0.903 6 vs 1 9 Method 1 99 95.2% 0.906 0.951 Chi-squared 96.3% 0.928 0.962 Method 2 90.8% 0.871 0.906 2 7 8 3 vs 6 1 9 Method 1 98 91.7% 0.888 0.916 Chi-squared 90.8% 0.886 0.908 Method 2 85.4% 0.671 0.829 2 7 8 3 6 1 9 vs 4 Method 1 101 87.5% 0.737 0.863 Chi-squared 90.1% 0.822 0.906 Method 2 96.8% 0.912 0.968 2 7 8 3 6 1 9 4 vs 5 Method 1 95 97.3% 0.929 0.973 Chi-squared 97.8% 0.964 0.978 Method 2 95.8% 0.879 0.958 2 7 8 3 6 1 9 4 5 vs 10 Method 1 99 95.9% 0.875 0.959 Chi-squared 95.8% 0.874 0.958

Disease ID MESH:D001943 MESH:D018270 MESH:D001941 MESH:D018567 MESH:D058922 MESH:D005348 PA443560 MESH:D061325 PA446745 OMIM:604370 PA152132901

Table 3: Breast related diseases corresponding to Subtype 2 vs. Subtype 7 sub-network.

Disease Name Breast Neoplasms Carcinoma, Ductal, Breast Breast Diseases Breast Neoplasms, Male Inflammatory Breast Neoplasms Fibrocystic Breast Disease Breast Neoplasms Hereditary Breast and Ovarian Cancer Syndrome Breast Neoplasms, Male Breast-Overian cancer, familial Hereditary Breast/Ovarian Cancer Syndrome

No. of involved genes in sub-network 98 46 39 31 25 22 4 1 (BRCA1) 1 (BRCA1) 1 (BRCA1) 1 (BRCA1)

P-value 5.19e-2 2.54e-3 3.48e-2 1.70e-3 7.26e-4 6.58e-4 1.53e-1 3.62e-2 3.62e-2 6.19e-2 6.19e-2

No. of related pathways 13 60 43 57 174 62 24 N/A N/A N/A N/A

Top pathway name Metabolism Metabolism Metabolism Metabolism Pathways in cancer Pathways in cancer Biological oxidations N/A N/A N/A N/A

Table 4: Breast related diseases corresponding to Subtype 10 vs. other subtypes at the root sub-network.

Disease ID MESH:D001943 MESH:D001941 MESH:D018270 MESH:D018567 MESH:D058922 MESH:D005348 PA443560

Disease Name Breast Neoplasms Breast Diseases Carcinoma, Ductal, Breast Breast Neoplasms, Male Inflammatory Breast Neoplasms Fibrocystic Breast Disease Breast Neoplasms

No. of involved genes in sub-network 99 43 41 23 19 14 2 (RPS6KA3, TGFB1)

P-value 3.33e-2 7.59e-3 7.20e-3 2.81e-2 9.97e-3 3.49e-2 5.57e-1

No. of related pathways 13 43 60 57 174 62 24

Top pathway name Metabolism Metabolism Metabolism Metabolism Pathways in cancer Pathways in cancer Biological oxidations

Figure 4: The network biomarker NB-2 at root of Subtype 2 and Subtype 7. Dunning, Doug Speed, Andy G. Lynch, Shamith Samarajiwa, Yinyin Yuan, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, 486(7403):346–352, 2012. [Defays, 1977] Daniel Defays. An efficient algorithm for a complete link method. The Computer Journal, 20(4):364– 366, 1977. [DeSantis et al., 2014] Carol DeSantis, Jiemin Ma, Leah Bryan, and Ahmedin Jemal. Breast cancer statistics, 2013. CA: a cancer journal for clinicians, 64(1):52–62, 2014. [Ford et al., 1994] Deborah Ford, Douglas F Easton, D Timothy Bishop, Steven A Narod, and David E Goldgar.

Risks of cancer in brca1-mutation carriers. The Lancet, 343(8899):692–695, 1994. [Johnson, 1967] Stephen C Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254, 1967. [Kappler et al., 2014] Christiana Kappler, Robert Wilson, Bridget Varughese, and Stephen P Ethier. Pten loss enhances amphiregulin-specific signaling and gene expression in triple-negative breast cancer. Cancer Research, 74(19 Supplement):3297–3297, 2014. [King et al., 2003] Mary-Claire King, Joan H Marks, Jessica B Mandell, et al. Breast and ovarian cancer risks due to inherited mutations in brca1 and brca2. Science, 302(5645):643–646, 2003.

[Liu and Setiono, 1995] Huan Liu and Rudy Setiono. Chi2: Feature selection and discretization of numeric attributes. In 2012 IEEE 24th International Conference on Tools with Artificial Intelligence, pages 388–388. IEEE Computer Society, 1995. [Liu et al., 2012a] Ke-Qin Liu, Zhi-Ping Liu, Jin-Kao Hao, Luonan Chen, and Xing-Ming Zhao. Identifying dysregulated pathways in cancers from pathway interaction networks. BMC bioinformatics, 13(1):126, 2012. [Liu et al., 2012b] Xiaoping Liu, Zhi-Ping Liu, Xing-Ming Zhao, and Luonan Chen. Identifying disease genes and module biomarkers by differential interactions. Journal of the American Medical Informatics Association, 19(2):241–248, 2012. [Loo et al., 2011] Claudette E Loo, Marieke E Straver, Sjoerd Rodenhuis, Sara H Muller, Jelle Wesseling, MarieJeanne TFD Vrancken Peeters, and Kenneth GA Gilhuijs. Magnetic resonance imaging response monitoring of breast cancer during neoadjuvant chemotherapy: relevance of breast cancer subtype. Journal of Clinical Oncology, 29(6):660–666, 2011. [Maulik and Bandyopadhyay, 2002] Ujjwal Maulik and Sanghamitra Bandyopadhyay. Performance evaluation of some clustering algorithms and validity indices. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(12):1650–1654, 2002. [Mavi et al., 2006] Ayse Mavi, Muammer Urhan, Q Yu Jian, Hongming Zhuang, Mohamed Houseni, Tevfik F Cermik, Dhurairaj Thiruvenkatasamy, Brian Czerniecki, Mitchell Schnall, and Abass Alavi. Dual time point 18f-fdg pet imaging detects breast cancer with high sensitivity and correlates well with histologic subtypes. Journal of nuclear medicine, 47(9):1440–1446, 2006. [Perou et al., 2000] Charles M Perou, Therese Sørlie, Michael B Eisen, Matt van de Rijn, Stefanie S Jeffrey, Christian A Rees, Jonathan R Pollack, Douglas T Ross, Hilde Johnsen, Lars A Akslen, et al. Molecular portraits of human breast tumours. Nature, 406(6797):747–752, 2000. [Prasad et al., 2009] TS Keshava Prasad, Renu Goel, Kumaran Kandasamy, Shivakumar Keerthikumar, Sameer Kumar, Suresh Mathivanan, Deepthi Telikicherla, Rajesh Raju, Beema Shafreen, Abhilash Venugopal, et al. Human protein reference database2009 update. Nucleic acids research, 37(suppl 1):D767–D772, 2009. [Razzaghi et al., 2013] Hilda Razzaghi, Melissa A Troester, Gretchen L Gierach, Andrew F Olshan, Bonnie C Yankaskas, and Robert C Millikan. Association between mammographic density and basal-like and luminal a breast cancer subtypes. Breast Cancer Res, 15(5):R76, 2013. [Scholkopft and Mullert, 1999] Bernhard Scholkopft and Klaus-Robert Mullert. Fisher discriminant analysis with kernels. In Proceedings of the 1999 IEEE Signal Processing Society Workshop Neural Networks for Signal Processing IX, Madison, WI, USA, pages 23–25, 1999.

[Sibson, 1973] Robin Sibson. Slink: an optimally efficient algorithm for the single-link cluster method. The Computer Journal, 16(1):30–34, 1973. [Walerych et al., 2012] Dawid Walerych, Marco Napoli, Licio Collavin, and Giannino Del Sal. The rebel angel: mutant p53 as the driving oncogene in breast cancer. Carcinogenesis, 33(11):2007–2017, 2012. [Ward Jr, 1963] Joe H Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236–244, 1963. [Zhang and Chen, 2013] Fan Zhang and Jake Y Chen. Breast cancer subtyping from plasma proteins. BMC medical genomics, 6(Suppl 1):S6, 2013. [Zhang and Drabier, 2012] Fan Zhang and Renee Drabier. Ipad: the integrated pathway analysis database for systematic enrichment analysis. BMC bioinformatics, 13(Suppl 15):S7, 2012.

Workflow Mining : Discovering Generalized Process Models from Texts Ahmed Halioui, Petko Valtchev and Abdoulaye Baniré Diallo Computer Science Department, Laboratoire Bioinformatique Université du Québec a` Montréal, Québec, Canada [email protected], [email protected], [email protected] Abstract Contemporary workflow management systems are driven by explicit process models specifying the inter-dependencies between tasks. Creating these models is a challenging and time-consuming process. Existing approaches for mining concrete workflows into models tackle design aspects with no regard for the diverging abstraction levels of the tasks. Concrete workflow logs represent tasks and cases of concrete events –partially or totally ordered– grounding hidden multi-level semantics and contexts. Relevant generalized events could be rediscovered within processes. Here, we propose an ontology-based workflow mining system generating patterns from sequences of events which are themselves extracted from texts. Our system T-GROWLer (GeneRalized Ontology-based Workflow minER within Texts) is based on two ontology-based modules: a workflow extractor and a pattern miner. To this end, it uses two different ontologies: a domain one and a process one. The first ontology supports workflow extraction from texts. Extracted concrete knowledge is enhanced using the process ontology and then fed into the generalized workflow pattern miner.

1

Introduction

The goal of data mining workflows is to extract patterns about processes from event logs [1] [2]. Depending on the process mining technique used, information presented in such logs may vary. The challenge is to extract such data from a variety of data sources, e.g. e-mails, databases and texts. Different types of workflow representations are proposed in the literature of process mining [1] [3] [4]. Petri Networks are the most popular design representing the dynamic aspect of processes. The Alpha-algorithm [1], is one of the first process discovery algorithms. An example of discovered patterns within shopping baskets web pages is: x if activity a : F illInDeliveryInf o is followed by b : CancelOrder but b is never followed by a y, also used in other models, like HMM, YAWL, BPMN, etc. [1] [5]. However, the representational bias of each model should be considered while mining patterns. Workflow representations delimit the search

space of patterns and circumscribe the expressive power of models. Moreover, depending on the questions one seeks to answer, different views on the available data could be presented. In reality, there’s no ”real” unique model, but different ones. There is a huge gap of knowledge between concrete workflows and models (i.e. experience and skills) [1]. Bridging this gap amounts to represent different levels of knowledge. Models should represent the expertise within workflows: which is in some level between concrete (actual) and generalized (abstract) workflows. In this paper, we adopt an ontological approach to represent the different levels of expertise. Ontologies offer shared vocabularies to facilitate tasks of knowledge integration and interpretation via triples of concepts and relations. Building ontologies is a proliferating research topic in artificial intelligence, bioinformatics and computational medicine [6] [7]. Today, ontologies are used to represent data via concepts and relationsfor semantic enrichment. We propose to use ontological representations for two purposes: (1) the extraction of workflow terms and relations from texts, and (2) the extraction of generalized (abstract) workflow patterns. Our approach of pattern mining from scientific texts follows an ontology-based representation of experience within scientific texts. In this study, we distinguish two types of knowledge in order to represent experience and skills: theoretical knowledge and processual knowledge. The former is used to build a domain ontology representing concepts and relations of one domain analysis basics (theory). While a processual ontology is used to represent another dimension of knowledge which is domain practices and software experience. It is specifically used to extract abstract but relevant patterns from texts. Concepts and relations of processual ontology represent explicitly elements of workflows (events and activities). Here, we distinguish two types of workflows: (1) concrete workflows which elements are terms and links between them and (2) abstract workflows representing more general concepts and relations. Only the processual ontology is used in the extraction steps of these two kinds of workflows. The rest of the paper is organized as follows: the next section presents related work and background information on workflow pattern-based mining. In section 3, we define the pipeline of our approach and the models representing the different kinds of knowledge. We also describe the different processes of concrete workflow extraction from texts and gener-

alized pattern mining. The proposed model is applied to the phylogenetic workflows where processes describe tools and methods used in the phylogenetic inference. Section 4 shows the phylogenetic domain knowledge. Section 5 presents the experimental evaluation and results.

2

Background and related work

Given a universe of items (events) O, a database D of records (processes) that combine items from O, and a frequency threshold , pattern mining’s goal is to extract the family F of patterns present in at least records. Two binary relations are generally used in a pattern mining problem combining hierarchies: the generality 1 between patterns Ñ (where is a pattern language) and the instantiation ? between a data record and a pattern. In the simplest settings, both data records and patterns are sets of items (itemsets). Hence, the mining goal is to find all the frequent subsets of a family of sets D Ñ where is the data language. We find in the literature more elaborate data structures such as sequences, episodes and graphs [8]. In the first studies on sequential pattern mining, several work, about including hierarchies in the mining process in order to enrich semantically generated patterns, are proposed. This research axis looks for generalized pattern languages built on the top of concepts C and abstracting objects from O. Concepts from a taxonomy H represent is-a relationships. In fact, a taxonomy is a simple domain ontology. The second wave of studies on generalized pattern mining uses ontologies as a Directed Acyclic Graphs (DAG). Furthermore, we find that ontologies are also used in the different steps of data mining: (a) in the preprocessing step to filter infrequent items, (b) while generating patterns (in the joining step) or (c) in the postprocessing step in order to prune irrelevant patterns. Among the first researches on using knowledge representations in pattern mining domain, we find the work of [9]. Authors propose three data structures: HG-trees (generalized hierarchical trees), AR-rules (attribute-relation rules) and constraints on the environment EBC. Inspired from this work, in [10], the authors used hierarchies in pattern mining process. Their algorithm GSP (Generalized Sequential Patterns) searches for F through the pattern space x , Ñ y by exploring the monotony in frequency with regard to the generality operator Ñ . Their Apriori miner performs a level-wise topdown traversal of the pattern space. On itemsets, it examines patterns at level k - that is, of size k - on two points: the method computes the frequency of candidate patterns by matching them against the records in D whereas it generates k ` 1 candidates by combining pairs of frequent k-patterns. However, their approach is expensive in terms of memory and computation. Their representation of D becomes huge while exploring deeper levels on the hierarchy. Han in [11] uses associations in one level at a time. While Fortin et al. [12] used an oriented object representation to generate association rules of different levels of abstraction. Zhou et al. in [14] represent apriori information in a DAG in order to calculate numerical dependencies (probabilities) between concepts C and items from O. Furthermore, Philips 1

also know as the raising operation

et al. in [15] proposed Prolog ontologies in an inference rulebased system using the first logic order language. Cevspivova et al. integrated ontology in the different steps of mining in the model CRIPS-DM. They publish a new miner afterwards called 4ft-Miner used in mining medical processes [16]. In [14], authors proposed a Generalization Impact measure based on support in order to estimate the sufficient level of abstraction to generate interesting generalized patterns. Finally, in [17], a fully-blown ontology is used. First, the data sequences are enriched with domain properties from the ontology. The resulting structures are akin to labeled digraphs, in which vertices are objects and edges represent links between objects or the induced sequential order. Patterns are generalized ontological digraphs, i.e., made of sets of ontology classes connected by properties (plus sequencing). Matching between data sequences and patterns is defined on-top of the standard ontological instanciation between objects and classes and represents an order preserving injective graph morphism. The proposed xPMiner method applies an Apriori-like level-wise mining whereby levels in the pattern space are defined w.r.t. the overall depth of the pattern elements within the ontology. Candidates of level k ` 1 are generated from k-level frequent patterns by a refinement operator (see for instance [18]) made of four types of primitives: 1) add (append) a root concept to the sequence, 2) replace a concept from the sequence by a direct specialization thereof, 3) add a root property between two concepts from the sequence, and 4) replace a property by a direct specialization. A major shortage of the proposed approach is the cost of the plain patter-to-data matching operation. Indeed, as no optimization is proposed, the large number of labeled digraph morphism computations take a significant toll on the method’s performances. In our own study, we choose to enhance that approach (see below). Overall, in our own approach, data records are extracted from a unstructured sources, e.g., scientific texts. To the best of our knowledge, no ontology-based pattern mining system exists using data extracted from texts. Only text mining systems based on patterns have been proposed [6]. In what follows, we present the T-GROWLer system elicits experience in the form of generalized workflows which are mined from concrete ones, themselves acquired by text analysis of textual sources.

3

T-GROWLer, an ontology-based workflow mining system

We describe in this section the proposed methods in two modules: (A) workflow extraction from texts and (B) workflow mining (see Fig. 1). The first module requires plain text articles, a domain ontology and a processual ontology. Ontologies representing abstract concepts and relations are described to guide the information extraction processes (see Fig. 1). The main goal of module A is to extract concrete workflows from texts. In order to preserve processes order in workflows, abstract workflows are used. Abstract workflows are sequences of abstract concepts ordered by isInputOf and genInputTo relations. The first process (1 - A) extracts named entities from texts using the domain ontology schema. Re-

lation extraction is afterwards executed to create RDF triples between the elements of the processual ontology. These semantic associations enrich concrete workflows with different multi-level links. Once the processual ontology and concrete workflows are constructed, the second module (1 - B) deals with the generation of frequent patterns.

Figure 1: Generalized-workflow mining from texts. We discuss the different steps of the workflow mining from texts in the following sections.

3.1

Workflow extraction from texts

The first module of our T-GROWLer system extracts automatically information from texts and organizes them in two ontologies: a domain ontology and a processual ontology. The first database is used in order to extract terms (named entities and relations) from texts. The second one reorganizes extracted data in a processual schema. Formally, an ontology is a four-tuple ⌦ “ xC, R, Ñ⌦ , ⇢y. C is a set of concepts, R is the set of domain relations. The generality order Ñ⌦ holds both for concepts and relations. ⇢ is a ternary relation C1 ˆ R ˆ C2 which connects two concepts with one relation. C1 is called the domain concept and C2 is the range. Our workflow extraction method takes place 3 different steps: named entity extraction, relation extraction and workflow reconstruction. We recognize and annotate automatically terms in texts with different statistical models. We use a supervised approach in order to recognize and learn concepts from texts [19] . A handwriting specific grammar is described (in JAPE 2 language) for each ontological 2

JAPE is a Java Annotation Patterns Engine. JAPE provides nite

concept to extract and filter the right instances. For example, when a program’s term is a noun (such as IDEA, NETWORK or PHASE), different semantic ambiguities are presented. We can solve this problem be modeling rules to verify if the concerned word is written in uppercase. A learning set is then constructed. In addition, we use morpho-syntactic techniques in Natural Language Processing domain such as tokenization, stemming and POS (Part Of Speech) tags to annotate texts. Two sets of features have been calculated on terms in order to learn them. These features represent term’s contexts in POS annotations and surrounded concepts in a window of five next and previous concepts and tokens. A PAUM model [20] is used to learn terms with the extracted features. PAUM (Perceptron algorithm with uneven margins) was designed especially for imbalanced data and has successfully been applied to named entity recognition problems [20]. For the relation extraction module, we propose a distant supervision approach taking tuples from domain ontology as input and annotate text in a semi-supervised way [21]. We adopt this hypothesis for the relation extraction process: a link (an instance of a relation) between two concepts should exist in the same sentence evoking instances of these concepts. Such link is represented in the verb located between concepts’ terms. In order to recognize tuples in texts, we designed morpho-syntactic rules for each relation. For example the relation isAlignedW ithHomologsIn is defined with specific terms in the context of the verb align. This relation should have as a domain a DataT ype term and as a range a Database term (see Fig. 5). A homologous alignment relation verb (for instance align) should be followed or preceded by the term homologs or homologous in the same sentence. From recognized links between named entities in text, we build a concrete workflow knowledge base where items are named entities ordered by the isInputOf and genInputTo properties (relations). Workflows as traceable graphs represent sequences of events. From the most abstract workflows, we ordered these events in order to construct a workflow database. For example, from the tuples: (⇢1 ) (16sGene, ”isInputOf”,ClustalW), (⇢2 ) (ClustalW, ”genInputTo”, PHYLIP) and (⇢3 ) (PHYLIP, ”genInputTo”, TreeView), we construct the sequence x 16sGene, ClustalW, PHYLIP, TreeView y. This sequence is then enhanced with ontological properties and concepts from the ontology to construct a generalized workflows (see Fig. 5).

3.2

Mining generalized workflows

The second module of our system develops two major aspects of mining patterns. The first is the definition itself of a generalized workflow mining technique. The second aspect treats the definition of frequent patterns in the search space of workflows as directed acyclic graphs. We developed an Aprioi approach inspired from [17] to generate generalized sub-graphs of workflows. The authors in [17] proposed two descriptive languages to represent the pattern space. Data language ⌦ is derived from sequences translated into objects IDs from the knowledge base ⌦. The state transduction over annotations based on regular expressions. https://gate.ac.uk/sale/tao/splitch8.html

second language ⌦ is used to represent a pattern S in a canonical structure. Both languages represent workflows in a pair of a sequence and a triple set S “ p⇣, ✓q where ⇣ is a sequence of concepts (or objects) and ✓ is a set of relations. Level-wise descent relies on the generation of k `1-level patterns from k-level ones, which amounts to the computation of four canonical operations in a depth-first search strategy. To generate candidates, we apply four different canonical operations on the root concepts from the ontology: (1) AddCLS adds a concept to a pattern, (2) AddREL - adds a relation between two concepts, (3) SplCLS - instantiates a concept and (4) SplREL - instantiates a relation in a pattern. For instance, AddREL adds a property between two concepts (a domnain and a range) in a workflow pattern. Concepts subject and object could be in different positions in the pattern S. Specializing a concept/relation replaces the concerned entity with a lower abstract event. For each pattern, 4 different canonical operations are used. Although these operations constitute an effective generation mechanism, their unrestricted use is redundancy-prone and hence would harm the efficiency of the global mining process. In the previous work of [17], calculating the support of a pattern S amounts to match this pattern with all the workflow concrete sequences in the database. For this purpose, we introduce a novel data structure to prune patterns while calculating their supports. Calculating frequent patterns S begins from the last visited concept/relation of a parent pattern. Then, we don’t need to recalculate all the matching possibilities for each concept and relation previously matched. We remind that a sequence in a workflow is obtained from the order of isInputOf and genInputTo links. The non-strict partial order of events is obtained by following the most abstract pattern relations in the ontology ⌦ (see Fig. 5). For instance, the sequence s in Fig. 2 is a poset s.⇣ of events i P 1, ..7 and a set s.✓ of links l P 1, ..6. A pattern S is represented as well with a poset of concepts in S.⇣ and a set of relations S.✓. For an iteration k, S is matched with s in ⇣: i1 Ñ C10, i4 Ñ C11, i6 Ñ C14 and i7 Ñ C15. S is matched as well with s in ✓.

The proposed matching structure M S is a ”stack” results of the canonical operations (adding/specializing concepts/relations) while extending a pattern. For example, the arrows in Fig .3 shows the way how we obtain a pattern S. The result of the last operation in this case is genInputTo.

Figure 3: Matching structure of a pattern S. If the next operation in the k ` 1th iteration is adding the relation R3pC14, C15q (see Fig. 4), then the new data structure of S should put the relation in the proper position. Our matching data structure adds the domain and the range properties by linking the concerned concepts.

Figure 4: Extension of a pattern S by adding a relation R3 (grey boxes). Precedence relation within patterns considers matching solutions structure to calculate pattern’s supports. Our proposed data structure considers pattern generation order and then we don’t have to recalculate all the matched concrete workflows in each iteration. With the anti-monotony characteristic of the Apriori approach, our algorithm guarantees that all frequent patterns are generated.

4

Figure 2: Workflow sequence S and pattern S representations.

Case study: phylogenetic workflows

Phylogeny represent evolutionary history of a group of organisms having a common ancestor. Inferred from nucleic acid (DNA or RNA) or protein sequences, phylogenetic relations are discovered with tools detecting different evolutionary phenomena such as duplications, insertions and deletions [7]. Furthermore, during phylogenetic tasks, multiple bioinformatics tools and methods are used [22] which complicates

the comparison of different studies even for the same species. Platforms like Phylogeny.fr, Taverna or Armadillo [23] propose different tools and web services. However each platform represents a completely different level of phylogenetic knowledge: from workflow based-systems to web services and programming language packages. However, the complex nature of bioinformatics data (as heterogeneous, spread and dynamic) makes the knowledge discovery task very difficult. Phylogenetic workflows follow a general pipeline of processes: (1) data collection, (2) homology search and (3) phylogenetic inference. In the first step, a molecular data (sequences or alignments) of a set of species (taxa) is collected. In the second step, sequences are aligned with homologs (”similar sequences”). In the third step, phylogenetic trees describing evolutionary history of the concerned taxa are inferred. This general pipeline have been used in the construction of workflows from texts. Extracted relation instances should be ordered following this general workflow. We downloaded scientific articles from the digital repository PubMed Central (PMC) 3 database (articles published from 2013 until April 2015) in a XHTML format. Over 2,000 texts have been extracted. In addition, we constructed a terminology gazetteer from different well-known databases in the phylogenetic literature such as NCBI Unigene, Uniprot, Taxonomy Browser and KEGG Disease 4 . The terminology is then reorganized in a domain ontology in order to recognize terms and relations in texts like presented in module A of the proposed model (see Fig. 1). Next, we built a processual ontology to discover workflow patterns within annotated terms in texts. We present in Fig. 5 a sample from the proposed processual phylogenetic ontology.

5

Experiments and results

5.1

Named entity learning results

We used a PAUM model to recognize the different phylogenetic concepts in texts. The Gate implementation of PAUM (with default parameters) is used to evaluate our supervised machine learning annotation system. The learning set is constructed from our gazetteer terminology. A specific grammar for each concept is also modeled. A cross validation of 10 folds presents results in terms precision and recall (see table 1). Our proposed model with context features has well learned and applied 7 different domain concepts: DataType, Taxon, Gene, Protein, Disease, Method and Program. DataType is a concept including different data type sequences: DNA, RNA, protein and alignments. 9366 terms have been extracted from the phylogenetic analyses sections (with 95% of F-measure). The overall F-measure for both sections (abstracts and phylogenetic analyses) is 96%. Indeed, with only context POS annotations and concept neighbors, we could recognize the 7 different classes. Specific vocabularies and gazetteers are able to distinguish different concepts in the texts. A gold standard list of terms and a corresponding acquisition corpus were built. An expert has extracted manually 3 4

http://www.ncbi.nlm.nih.gov/pmc/ More than 1,800,000 terms have been extracted.

Figure 5: A portion from the phylogenetic processual ontology (PhylOntology). On the bottom, we see abstract concepts ordered by the isInputOf and the genInputTo properties. We describe the phylogenetic inference in 3 big steps: DataCollection, HomologousSearch and P hylogeneticInf erence. Each class/concept contains different sub-classes related to each other with different levels of relations. For example, a DataT ype (e.g. DNA sequence) is aligned in a Database used by a CharacterBasedP rogram, a T reeP lottingP rogram and modeled by a M odelSelectionP rogram. about 1,000 terms (and classified them into the different concepts) from 100 phylogenetic sections to build a reference terminology. Each extracted term has been compared to the reference entities in order to validate its domain’s precision and recall (see figure 6). Evaluating extracted terms with the expert reference terminology shows that only 6 classes have been well extracted by our PAUM model. Extracted diseases terms didn’t recognized by the expert. This is expected, since diseases terms aren’t frequent in phylogenetic analyses sections. Only 6 terms have been extracted with our model and 3 are recognized by the expert. However, for the other concept’s instances, our model did recognize terms with 73.95 % of F-measure. We didn’t implement yet our relation extraction system. Thus, only 50 workflows have been manually built in order to execute the generalized pattern mining phase. Concepts and properties terms are then organized in the processual ontology. This latter presents 13 different object properties (links), 68 concepts (with 3 concept roots) and 560 individuals (terms). Our workflow database is composed from 50 workflows with an average of 11 items (terms) and 9 links per workflow. Our workflow database PhyloFlows and the processual ontology represent the input of the next module (B) of the proposed model (see Fig. 1). We present in the next section the different results of the workflow-pattern mining process.

Table 1: Evaluation results of the automatic annotation system on phylogenetic analyses sections - extracted from articles published from 2013 to April 2015 Concepts N.Instances Precision Recall DataType 9366 0.97 0.92 Taxon 8867 0.98 0.94 Gene 922 1.00 0.95 Protein 485 0.99 0.97 Disease 6 0.30 0.30 Method 2671 0.99 0.96 Program 2132 0.99 0.90 Overall 24449 0.98 0.93

Figure 6: Gold standard evaluation results.

5.2

Generalized-pattern results

In order to evaluate our pattern mining system T-GROWLer, we present different experimentations. We tested the new matching algorithm in terms of speed (mining time) and scalability (see Fig. 7). T-GROWLer outperforms the xPMiner system [17] with 2 times order of magnitude in high threshold supports (°0.8). With 60% of minimum support, our approach executes 10 times faster the mining algorithm than the xPMiner algorithm. We also tested our method with a different database. eTPtourism (Electronic Tourism Platform) worfklow-base is a portion of a log file describing web pages navigation of 56 different users in a Web travel site [17]. This database has an average of 19 items and 6 links per sequence and enhanced by an ontology of 157 concepts and 35 object properties. Fig. 7- B shows the difference in mining time between PhyloFlows and eTP-tourism. Running time does eventually increase when more concepts and links are used. Our system spends about 10 times the mining time to generate patterns with 500 items and 100 links more. While eTP-tourism database presents more dense graphs, matching sub-graphs and mining processes become heavier to handle. However, taking 23 seconds to do the whole process of matching and generating about 6,000 graph patterns is still a reasonable time. We present, next, detailed results obtained from the PhyloFlows database. We remind that a pattern S is a set

Figure 7: (A) Comparing the mining time of T-Growler and xPMiner. (B) Varying support for PhyloFlows and e-TP tourism databases. of concepts and relations. Relations are couples of domain and range concepts. For instance:p1, 2q “ risAlignedIns represents the relation isAlignedIn between the 1st and the 2nd concept in the pattern S. We present next, a sample of 2 different generated patterns. 1. Concepts=[DataCollection, MultipleSequenceAlignment, PhylogeneticInference] ; Relations={(1, 3)=[isInferredBy], (1, 2)=[genInputTo], (2, 3)=[genInputTo]} ; Threshold = 80% 2. Concepts=[DNASequence, HomologySearch, GTRModel, CharacterBasedProgram, TreePlottingProgram ; Relations={(1, 2)=[isAlignedIn], (1, 3)=[isModeledBy], (1, 4)=[isUsedByProgram], (1, 5)=[isUsedByProgram], (1, 2)=[isInputOf], (2, 3)=[genInputTo], (3, 4)=[genInputTo], (4, 5)=[genInputTo]} ; Threshold = 40% All generated patterns present different levels of concepts and relations.. For example, in 80% of the cases, we find that a phylogenetic analysis should have at least three ordered steps: data collection, multiple sequence alignment and a phylogenetic inference program. However, these kinds of patterns are generally well known by the community. While decreasing the threshold support, we find more interesting patterns like when selecting DNA sequences, a homology search is done with a GTR model for a character based inference method and a visualization tool (pattern 2). This sequence is enhanced with different kinds of links like which alignment program is used or which kind of inference method has been adopted in these kinds of studies.

These kinds of patterns could be used for a better recommendation considering their abstract but semantically richer meanings. For example, if we know that the 16s rRN A gene is used by the ClustalW alignment program in 10% of the cases, then a concrete recommendation could offer M rBayes as phylogenetic reconstruction program. However, offering any Character ´ based : Inf erence P rogram is a more interesting recommendation due to its higher frequency (40%) in the database as it is a more ”general” concept.

6

Conclusion

Our proposed approach of mining generalized patterns from texts proves that different kinds of knowledge could be extracted from the phylogenetic literature. Expertise and domain knowledge from articles show interesting patterns to mine. Different ontologies have been used to represent the different types of knowledge into concepts and relations. Extracted workflows are enhanced with ontologies’ annotations to produce generalized frequent patterns. These patterns, covering a wider range of values then specific terms, could be used to assist practitioners with a richer vocabulary during their tasks. Our proposed model could also be applied to different domains when expertise is confronted to domain knowledge. Video games, for example, is a good domain application to our approach when gamers are constantly faced to cognitive gestures. Workflows of events could be extracted from games sessions and mined in order to recommend and predict generalized patterns like what type of event is going to happen during a game or what kind of item a player needs to buy to accomplish a mission.

References [1]

W. van der Aalst, T. Weijters, and L. Maruster. Workflow mining: Discovering process models from event logs. IEEE Trans. on Knowl. and Data Eng., 16(9):1128–1142, September 2004.

[2]

C. A. Ellis, A. J. Rembert, K.-H. Kim, and J. Wainer. Beyond workflow mining. In S. Dustdar, J. L. Fiadeiro, and A. P. Sheth, editors, Business Process Management, volume 4102 of Lecture Notes in Computer Science, pages 49–64. Springer, 2006.

[3]

W. M. P. van der Aalst and A. Weijters. Process mining: a research agenda. Computers in Industry, 53:231–244, 2004.

[4]

N. Hashmi. Abstracting workflows: unifying bioinformatics task conceptualization and specification through semantic web services. on Semantic Web for Life., (October):27–28, 2004.

[5]

W. Dong, W. Fan, L. Shi, C. Zhou, and X. Yan. A general framework to encode heterogeneous information sources for contextual pattern mining. Proceedings of the 21st ACM int. conference on Information and knowledge management CIKM ’12, page 65, 2012.

[6]

Y. Tsuruoka, Y. Tateishi, J.-D. Kim, T. Ohta, J. McNaught, S. Ananiadou, and J. Tsujii. Developing a robust part-ofspeech tagger for biomedical text. In Advances in informatics, pages 382–392. Springer, 2005.

[7]

M. Anisimova, D. a. Liberles, H. Philippe, J. Provan, T. Pupko, and A. von Haeseler. State-of the art methodologies dictate new standards for phylogenetic analysis. BMC evolutionary biology, 13:161, January 2013.

[8]

C. C. Aggarwal and J. Han, editors. Frequent Pattern Mining. Springer, 2014.

[9]

S. S. A. Bamshad Mobasher. The role of domain knowledge in data mining. In Proceedings of the 4th int. conference on Information and knowledge management, pages 37–43. ACM, 1995.

[10] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th int. Conference on Extending Database Technology: Advances in Database Technology, EDBT ’96, pages 3–17, London, UK, UK, 1996. Springer-Verlag. [11] J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. In VLDB, volume 95, pages 420–431, 1995. [12] S. Fortin and L. Liu. An object-oriented approach to multilevel association rule mining. In Proceedings of the fifth int. conf. on Information and knowledge management, pages 65– 72. ACM, 1996. [13] X. Zhou and J. Geller. Raising, to enhance rule mining in web marketing with the use of an ontology. Data Mining with Ontologies: Implementations, Findings and Frameworks, pages 18–36, 2007. [14] J. Phillips and B. G. Buchanan. Ontology-guided knowledge discovery in databases. In Proceedings of the 1st int. conf. on Knowledge capture, pages 123–130. ACM, 2001. [15] H. Ceˇspivová, J. Rauch, V. Svatek, M. Kejkula, and M. Tomeckova. Roles of medical ontology in association mining crisp-dm cycle. In ECML/PKDD04 Workshop on Knowledge Discovery and Ontologies (KDO04), Pisa. Citeseer, 2004. [16] M. Adda, P. Valtchev, R. Missaoui, and C. Djeraba. Toward recommendation based on ontology-powered web-usage mining. IEEE Internet Computing, 11(4):45–52, 2007. [17] L. De Raedt. Logical and relational learning. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 5249 LNAI, page 1, 2008. [18] D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3– 26, 2007. [19] Y. Li, K. Bontcheva, and H. Cunningham. Using uneven margins svm and perceptron for information extraction. In Proceedings of the Ninth Conference on Computational Natural Language Learning, CONLL ’05, pages 72–79, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics. [20] N. Bach and S. Badaskar. A survey on relation extraction. Language Technologies Institute, Carnegie Mellon University, www. ark. cs. cmu. edu/LS2/images/9/97/BachBadaskar, 2007. [21] J. W. Leigh, F. J. Lapointe, P. Lopez, and E. Bapteste. Evaluating phylogenetic congruence in the post-genomic era. Genome Biology and Evolution, 3:571–587, 2011. [22] E. Lord, M. Leclercq, A. Boc, A. B. Diallo, and V. Makarenkov. Armadillo 1.1: An Original Workflow Platform for Designing and Conducting Phylogenetic Analysis and Simulations. PloS one, 7(1):e29903, January 2012.

Evolution and Vaccination of Influenza Virus Ham Ching Lam∗ , Srinand Sreevatsan, and Daniel Boley University of Minnesota-Twin Cities Campus Minnesota, MN 55455 United States of America

Abstract In this study, we present an application paradigm in which an unsupervised machine learning approach is applied to the high dimensional influenza genetic sequences in order to investigate whether vaccine is a driving force to the evolution of influenza virus. We first used a visualization approach to visualize the evolutionary paths of vaccine-controlled and non-vaccine controlled influenza viruses in low dimensional space. We then quantified the evolutionary differences between their evolutionary trajectories through the use of within and between scatter matrices computation in order to provide the statistical confidence to support the visualization results. We used the influenza surface Hemagglutinin (HA) gene for this study as the HA gene is the major target of the immune system. The visualization is achieved without using any clustering methods or prior information about the influenza sequences. Our results clearly showed that the evolutionary trajectories between vaccine-controlled and non-vaccine controlled influenza viruses are different and vaccine as an evolution driving force cannot be completely eliminated.

1

Introduction

The rapid growth of the influenza genome sequence data due to the advanced development of sequencing technology in recent years has provided the opportunity for a more comprehensive sequence analysis of the influenza virus. The difficulty in sieving through and making sense of this mountain of data relying solely on phylogenetic approaches has become increasingly limited in part due to the poor scalability of the relevant algorithms [Nicholas, 2007]. Therefore, a different methodology needs to be utilized in order to take advantage of the massive amount of available data but at the same time be able to expose important information or structure within the data. Here, we present an application paradigm in which an unsupervised machine learning approach is applied to the high dimensional influenza genetic sequences so that the evolution of the vaccine controlled and non-vaccine ∗

Corresponding author email: [email protected]

controlled influenza viruses in the past century can be visualized. The main objectives of this study are twofold: (1) to visualize the evolution trajectories of influenza under vaccine pressure and in the wild without using any prior information about the viruses and (2) to provide statistical confidence to support the visualization results. Influenza virus is thought to have originated from a natural reservoir consisting of wild aquatic birds[Taubenberger and Kash, 2010; Webster et al., 1992]. The influenza A virus is divided into subtypes based on differences in the surface proteins hemagglutinin (HA) and neuraminidase (NA), which are targets of the human immune system. Antigenic variants or immunologically distinct strains of A/H1N1, A/H3N2, and Type B have continued to emerge since its introduction into humans [Schweiger et al., 2002]. Vaccination is the main strategy in stopping the infection and transmission of the virus in humans[Hannoun, 2013]. There are three components in a seasonal flu vaccine: (1) A/H1N1, (2) A/H3N2 and (3) Type B influenza. Each component is designed to fight the specific strain in each subtype that is predicted to be the dominant circulating strain in the upcoming flu season. Over the years, there have been over 20 vaccine updates for the A/H3N2 strain, over 16 updates for the Type B strain and 10 updates for the A/H1N1 strain. Each vaccine update is designed to provide immunity to the new antigenic variant that has emerged from the previous flu season. However, the long term effects of vaccination on the evolution of the virus itself is not clear. In order to shed light on this seemingly unsuspected problem, we used the nucleotide sequences from seasonal human A/H3N2 influenza virus from 1971 to 2009 as an example to demonstrate the evolutionary progress of this influenza virus against each successive vaccine introductions from 1971 to 2009. Figure 1 shows progression of influenza evolution based on the nonsynonymous substitutions (dN ) and synonymous substitutions (dS) ratio analysis using the HA1 domain of the HA gene from A/H3N2 virus. The HA1 domain is a hypervariable domain of the HA gene where constant mutational changes can be observed due to the immune pressure generated from the host. A dN/dS ratio greater than 1 indicates the site is under positive selection pressure and is undergoing molecular adaption. In Figure 1, a constant shift of positively selected sites (blue color: dN/dS ratio greater 1) could be observed whenever a new vaccine (green square)

A/H3N2 25

dN/dS ratio > 1

20

dN/dS ratio .8−1

New Vacc. Introduction

15

Time

Repeated Vaccine

10

5

0 0

50

100

150

200

250

300

HA1 domain

Figure 1: Seasonal human A/H3N2 influenza dN/dS ratio analysis against time of vaccine introduction. A constant shift of positively selected site location when a new vaccine was introduced. Horizontal axis represents the position of HA1 domain of the HA gene. Vertical axis represents time progression from 1971 (bottom) to 2009 (top) when each new (green square) and repeated (black square) vaccine was introduced. Red color bars denote the range of positions with dN/dS ratio from 0.8 − 1. Blue color bars denote the range of positions with dN/dS ratio greater than 1. was introduced which indicated that a new antigenic variant had emerged. When a repeated vaccine was introduced, the positively selected sites identified from the previous season remain unchanged. Given the results from the dN/dS ratio analysis, we compared the evolution trajectories of vaccine controlled to non-vaccine controlled influenza viruses and sought to better understand the effect of vaccination has on the evolution of influenza virus. In the present study, we used the human A/H3N2, A/H1N1, Type B, and avian H5 HA sequences as the vaccine-controlled samples. We used the human H5N1 and avian H5N1 HA sequences as the non-vaccine controlled samples.

2

influenza virus can easily generate different phenotypes that have the ability to survive within its host and infect others. To keep track of the evolution of the virus, annual update to the influenza vaccine composition is needed in order to provide a vaccine induced immunity to the general public [Boni, 2008]. The main process in influenza vaccine strain selection is to assess the match between the vaccine strain and the currently circulating strains and the potential new antigenic variant [Russell et al., 2008]. If the vaccine strain does not match the currently circulating strains or the new antigenic variant that is likely to be the major variant in the upcoming influenza season, the vaccine composition is updated to contain a representative of the new variant [Russell et al., 2008]. Each vaccine update is designed to provide immunity to the new antigenic variant that has emerged from the previous flu season. The seasonal influenza vaccine is used to prevent the infection and transmission of the virus, but its effect on the evolution of the virus itself is not clear.

3

Materials and Methods

In this study, utilizing the online NCBI influenza database[Bao et al., 2008], we collected HA sequences from human A/H3N2, A/H1N1, Type B, and avian H5 HA sequences that represent the vaccine-controlled samples. We also collected human H5N1 and avian H5 HA sequences that represent the non-vaccine controlled samples. Table 1 lists the year range and number of HA nucleotide sequences from each sample. Table 1: Vaccine controlled and non-vaccine* controlled human and avian sequences. Samples Year Seqs Human A/H1N1 1918-13 2140 Human A/H3N2 1968-09 175 Human Type B (Vic/Yam) 1970-13 818 *Human H5N1 1997-12 127 Avian H5 (Mexico) 1994-02 32 *Avian H5 (China) 1997-02 32

Background

Influenza viruses have the ability to infect a very broad range of avian and mammalian hosts. Their genomic diversity is acquired through two biological mechanisms: antigenic drift and antigenic shift [Webster et al., 1992]. Antigenic drift consists of the accumulated and continual mutations on surface proteins, resulting in the generation of antigenic variants. Of these surface proteins, we are focused on the hemagglutinin protein. Antigenic shift occurs when complete gene segments are exchanged among different subtypes of influenza viruses within a host cell, resulting in what effectively amounts to a whole new influenza virus genome. Both antigenic drift and antigenic shift allow for the virus to evade the host’s immune response and rapidly adapt to new hosts [Caron et al., 2009; Suzuki, 2006]. The evolution of influenza A virus is driven by the high rate of mutations and the ability to reassort gene segments. Because of its high rate of mutation combined with the lack of error correcting mechanisms during replication,

3.1 Influenza evolution visualization All genetic sequences were first converted into binary strings according to the method outlined in [Lam et al., 2012]. Nucleotide sequences are represented by strings of characters out of an alphabet of four letters: A, C, G, T. To obtain the binary string, each letter is replaced by a code of 4 bits: 1000, 0100, 0010, 0001, respectively. All binary strings were collected into a matrix to which Principal Component Analysis (PCA) [Jolliffe, 2002] was applied to extract the dominant variation from the dataset. Here, we briefly outline the sequence of steps involved in the PCA analysis. Consider a data matrix Xm,n of dimensions m by n with m being the number of strains and n being the number of sites or positions (in this case, n = 987 × 4 = 3948 for nucleotide sequences). Each row of X corresponds to a strain of virus and each column of X corresponds to a particular position. We first center ˆ = X − 1 eeT X the columns of the data matrix X with X m

where e is a column vector of all ones, and then obtain the 1 ˆT ˆ ˆ by C = sample covariance matrix C from X (m−1) X X. C is a square symmetric n × n matrix whose diagonal entries are the variances of the individual sites across strains and the off-diagonal terms are the covariances between different sites. The PCA algorithm is then applied to matrix C. The result is then visualized by plotting the top two or three principal components of the projected data. Since each strain is encoded as a binary string and PCA works at the binary data level, the pairwise distance relationship between the strains in a reduced space can be understood as follows: Let ∥s − t∥H denote the pairwise Hamming distance between two strains s, t (number of differences in genetic sequences). Let ∥s − t∥bin 1 , ∥s − t∥bin 2 denote the distance between the binary encodings of the two sequences (1-norm and 2norm, respectively), and let ∥s − t∥proj denote the 2-norm distance in lower dimensional space after projection onto the leading principal components. Every single change in the genetic sequence alphabet corresponds to changes to 2 bits in the binary encoding. Hence we have the relation between the distance in the lower dimensional space shown on the plots with the Hamming distance among the original sequences: ∥s − t∥2proj ≤ ∥s − t∥2bin 2 = ∥s − t∥bin 1 = 2∥s − t∥H .

Box 1: Virus isolation year as class label C: Number of Classes Ni number of data points in class i = 1, 2, ...C • λ=

tr(B) tr(W )

• B : Between Class scatter matrix !C T – i (ui − M )(ui − M ) ! C – M = 1c i ui ”global mean of dataset” • W : Within Class scatter matrix !C ! N i T – i j (xj − ui )(xj − ui ) – ui : mean of class i. Alg. I: Estimate Separateness Measures tr(Bo ) Let λo = tr(W be the observed separateness value. o) Repeat j = 1 : K2 Repeat i = 1 : K1 generate a randomization of the class labels compute the within-cluster scatter W tr(T )−tr(W ) tr(B) compute the ratio λi = tr(W ) = tr(W ) compute the mean µ and std σ for all λi=1,..K1 0 compute the distance dj = µ−λ σ Compute the mean d¯ and std dˆ of all dj=1..K2 Report the distance of λo from the mean in the form of d¯ ± dˆ

3.2 Quantification

4

In order to provide statistical support to the graphical results obtained, we performed a statistical analysis based on a method that combined a multi-class scatter matrix computation and class labels randomization. The projected data points served as the viruses’ 2-D coordinates and the year of isolation of each virus served as the class label. The multiclass scatter matrix involves the computation of Between-class matrix (B) and Within-class matrix (W) (Box 1). These computed matrices were not used explicitly as we only sought the trace of B and W. These are just the scalar scatter values: sum of squared distances between points and their respective centers. The class separateness measure λo is the ratio of trace B over trace W. A large λo indicates that the classes or clusters are well separated between each other and that elements within a cluster are strongly related or share the same property. This is basically an estimate on how well a multi-class Fisher’s linear discriminant could separate the classes [Alpaydin, 2010]. A class label randomization algorithm (Alg I) provided the ”distance measure” as a surrogate for the probability of observing the observed λo by chance. This is because the area under the tail of the randomized λ distributions beyond the observed separateness values was below rounding error of 10−16 which made the computation of p-value not possible. The larger the ’distance’, the less likely the observed λo is generated by chance.

The application of high-throughput unsupervised method to the high dimensional influenza virus genetic sequence data has made possible the visualization of the evolution of the influenza virus in the span of almost half a century. In this study, we present the graphical results from visualization of vaccine and non-vaccine controlled influenza viruses based on their genetic sequences alone. The human influenza A/H3N2 has the highest number of vaccine updates among the three vaccine controlled influenza viruses circulating in humans. Given the observation that constant shifting of positively selected sites whenever a new vaccine was introduced, we sought to visualize the evolution trajectories of vaccine and non-vaccine controlled influenza samples. We also set out to compute the class or clusters separateness values for both vaccine and non-vaccine controlled samples using the multi-class scatter matrix computation method for both the before and after class labels randomization process. We performed 1000 runs of Alg I on these samples and listed the results in Table 2. The observed separateness values λo of vaccine controlled samples are consistently higher than the nonvaccine controlled samples. This suggested that the vaccinated samples have very good separability by isolation years. In Figure 2, we observed that the human A/H3N2 viruses clustered around vaccine seed strains chronologically since their introduction into humans in 1968. The evolution trajectory is directional going from lower left to lower right in the figure. In Figure 3, two separate lineages of human Type B influenza are co-circulating and that each lineage shows

Results

gests the co-circulation of multiple clades or sublineages of the avian H5 subtype. The diverse genetic diversity of the avian H5 represented by multiple clusters across a long time period indicated that the avian subtype in the wild evolves much slower than seasonal human influenza viruses.

Figure 2: Seasonal human A/H3N2 influenza virus evolution trajectory. Each arrow points to a vaccine seed strain (red dot). The directional evolution can be seen as traveling from lower left to the top and then coming down to the lower right.

−6 Year 2012

−5 1999 1988−Yamagata 1990

19721983 1979

−4

1986 1987−Victoria

Year 2002

1993

−3 Victoria Lineage

Yamagata Lineage

Year 1990

−2 PC2

the same observational characteristics as the A/H3N2, Type B viruses are also clustered around vaccine seed strains. For the human H1N1 influenza virus, a single lineage (black) can be seen that corresponds to the pre-2009 swine H1N1 pandemic. A sudden jump or gap is illustrated in the visualization due to the fact that the pandemic swine H1N1 strain had replaced the classical A/H1N1 and began to evolve (directional trajectory) as it circulated among humans. A vaccinated avian sample was used (avian H5) to further understand the evolution characteristic of vaccine controlled influenza. In late 1993, an outbreak of avian H5 influenza in poultry in Mexico was detected and a long term vaccination program was implemented in hope to bring the outbreak under control and to eradicate the virus [Lee et al., 2004; Escorcia et al., 2008]. The vaccination program was in effect for over 13 years but an increase in respiratory signs of disease was observed in vaccinated chickens [Escorcia et al., 2008]. In other words, the vaccine strain used in the vaccination program no longer matched the circulating strain in the field. The vaccine strain (A/Ck/Mexico/CPA-232/1994) was isolated in 1993 and has been in used for the duration of the program for over a decade. Using the available genetic HA sequences from these vaccinated chicken, we produced a 3 dimensional PCA plot (Figure 5) to show the evolution of the field isolates from 1994 to 2002. The first observation from Figure 5 is that a directional evolutionary trend similar to other vaccinated samples can be seen in this figure. Second, a chronological pattern is obvious indicating that the virus had undergone constant evolution or antigenic drifted away from the early strains. A split in the evolutionary path can be seen occurring in the 1990s. This split or divergence has been reported in studies by [Lee et al., 2004; Escorcia et al., 2008] based on phylogenetic analyses conducted on the same sequence sample. Figure 6 illustrates the evolution trajectory of the nonvaccine controlled human H5N1 influenza from 1997 to 2002. We included the human H5N1 virus as the ’control’ since this subtype is not currently being vaccinated against in humans but is under active research due to its high mortality rate in infected humans. Figure 6 suggests that this subtype has evolved into a few dominant clusters since 1997. Three major evolutionary trends or clustering patterns can be seen originating from the center cluster which contains viruses from 1997. This also implies this influenza subtype has undergone HA gene diversification. Although it has diversified since 1997, the specific H5 HA gene identified in 1997 has remained present in these days [Wei et al., 2012]. Figure 7 shows the evolution of non-vaccine controlled avian H5 influenza virus. The overall observation that arises from this figure is that rather than forming a restricted directional trend, the evolution of the virus is characterized by a collection of clusters scattered on the plot. The collection of clusters suggests a diverse pool of the genetic diversity of the virus. For the avian H5 subtype, a less focused evolutionary trend than vaccine controlled influenza viruses can be observed. The increased genetic diversity since 2000 has been observed by [Garcia et al., 1997] and is captured in this figure with clusters scattered to the left and extended to upper and lower corner at almost the same time. This clearly sug-

2001

−1

Year 1985 2002

0 2004

1

Year 1980

2006 2008

2

2010

Year 1970 3 4 8

2012

6

4

2

0

−2

−4

−6

−8

PC1

Figure 3: Seasonal human Type B influenza virus evolution trajectory. Two separate lineages (Victoria and Yamagata) are evolving simultaneously (top to lower left and to lower right). Vaccine introductions are indicated by year labels.

5

Discussions and Conclusions

Vaccination is the principal measure for preventing influenza and reducing its impact [Webby et al., 2004; Wood et al., 2001]. Almost a century ago after the isolation of the first influenza virus, influenza vaccines have been persis-

Figure 4: Seasonal human H1N1 influenza virus evolution trajectory in 3 dimensions. Pre-2009 pandemic viruses are in black. A clear separation can be seen after pandemic09 replaced the classic A/H1N1 strain. Separate lineages emerged indicating different genetic diversity.

Table 2: Class separateness results: Vaccine and nonvaccine* controlled human and avian samples Sample (Human) λo Distance A/H3N2 (1968-2009) 30.5 978.3 ± .031 Type B:Victoria (1970-2013 26.3 1310 ± .02 Type B:Yamagata (1970-2013) 25.3 1327.8 ± .019 A/H1N1 (1918-2013) 24.7 617.2 ± .04 *H5N1 (1997-2002) 1.01 34.8 ± .029 Sample (Avian) λo Distance Avian H5 Mexico (1994-2002) 1.7 12.23 ± .11 *Avian H5N1 China (1997-2002) 0.268 3.16 ± .0.6

tent and have evolved to respond to the evolution of the influenza viruses evolving in humans. [Gunn et al., 2010; Hannoun, 2013]. Antigenic drift of influenza viruses occurs frequently among circulating strains that leads to new antigenic variants. However, whether the drift mechanism occurs with the presence of vaccine pressure is an important question that needs to be addressed at different level as vaccination is the primary method in prevention and protection for humans against influenza virus. Two studies [Hensley et al., 2009; Lee et al., 2004] have shown that vaccination forces mutations on the HA protein of the influenza virus. These mutations changed the way in which the virus gradually evolved and adapted to a new vaccine protected environment. Here, we extended the spectrum of analysis to include vaccine controlled human and avian samples and non-vaccine controlled human and avian samples to better compare and contrast and understand the evolutionary dynamic of influenza viruses under vaccine pressure. Using vaccinated and non-vaccinated samples from both human and avian hosts, we hope to minimize potential data selection bias and at the same time to provide a fair comparison across hosts under vaccination pres-

Figure 5: Vaccine controlled avian H5 influenza virus evolution trajectory in 3 dimensions. Vaccine was introduced in early 1990s and the virus slowly evolved away from the vaccine strain and established two separate lineages.

Figure 6: Non-vaccine controlled human H5N1 influenza virus evolution trajectory in 3 dimensions. The virus has evolved into a few dominant lineages since 1997. Three major evolutionary lineages can be seen originating from the center cluster which contains viruses from 1997. However, the specific H5 HA gene identified in 1997 has remained present in these days. sure. Our method utilized only the genetic composition of the HA sequences alone without using any specific clustering algorithms. As mentioned above and shown in Figure 1, genetic sequences contain important signals to detect evolutionary trends between different influenza subtypes under vaccination pressure. The genetic composition combined with the implicit positional information of the HA gene is enough to provide clues that the vaccine-controlled influenza viruses are under pressure to mutate in order to escape immune responses. Our method takes advantage of the binary coding of each sequence that preserves the positional information of each HA gene. In this study, we have demonstrated that the evolutionary trajectories for vaccine controlled influenza are directional and restricted. The restricted directional evolutionary trends and clusters formation around the vaccine strains along the evolutionary paths exhibited by the vaccine controlled in-

Figure 7: Non-vaccine controlled avian H5 influenza virus evolution trajectory in 3 dimensions. Multiple clusters scattered throughout sharing almost the same time periods suggesting the co-circulation of multiple clades or sublineages of the avian H5 subtype. fluenza viruses are in sharp contrast to the non-vaccine controlled influenza viruses. Apart from this distinction, the naturally emerged chronological ordering of vaccine controlled influenza viruses in both two and three dimensional visualizations are much more noticeable than the non-vaccine controlled viruses. This natural chronological ordering reflects the active adaptation of the viruses to their changing environment. The class separateness measure exposes the fact that vaccine controlled influenza viruses that share the same isolation year have the tendency to cluster tightly together with good separateability. Each separate cluster or group represents a distinct genetic diversity of the virus group. In contrast, non-vaccine controlled influenza viruses isolated within the same time period appeared to be more scattered and the clusters exhibited much larger within cluster distance with no narrow restricted bands being observed. These observations suggested that the mutations on the HA gene were not restricted to certain sites alone and that the majority of these mutations most likely were synonymous nucleotide substitutions on the HA gene. Also, the number of clusters observed are almost identical to the number of vaccine updates for the seasonal human A/H3N2 and influenza B viruses. The number of clusters observed in the seasonal human A/H1N1 is not the same as the number of vaccine updates but it does show the fact that this virus has been gradually evolving away from the vaccine strains as time passes. Since the A/H1N1pdm09 pandemic strain replaced the A/H1N1 strains in 2009 as the H1N1 vaccine component, the virus can be seen as slowly evolving but has not changed to a new antigenic variant. The very low value of λo computed from non-vaccine controlled influenza viruses has clearly captured the fact that non-vaccine controlled viruses are not actively evolving by the year. In contrast, the vaccine controlled influenza viruses have been actively evolving and adapting to the changing environment constantly as new vaccine composition is being introduced year after year. This is clearly reflected in the very high λo value for vaccine controlled influenza viruses. Although our

analysis was based on genetic sequences alone, the results suggested that a clear difference existed among influenza viruses evolving in a vaccine protected environment than in the wild. This difference is shown through the multi-class scatter computation of their evolutionary paths. This quantitative measurement also serves as a basic statistical support to the observed differences in the evolution dynamics between vaccine controlled and non-vaccine controlled influenza viruses. There are other potential factors besides vaccination that can affect the evolution of influenza viruses, such as host specific immune response, the large difference in life expectancy between humans and avian species, vaccine efficacy and effectiveness, the transmission channel of the virus in difference environment, and geographical regions. These factors have not been considered in this present study because our overall objective is to present a genetic sequence only approach as the first step in understanding the evolution of influenza viruses in a protected environment. Our approach works directly at the sequence level with no prior assumption about the evolution of the virus. It is a departure from traditional one dimensional phylogenetic approach in that we visualize influenza evolution in 2D and 3D space. All phylogenetic methods make or rely heavily upon the assumptions about underlying evolutionary process [Jenkins et al., 2002]. By using methods that avoid making assumptions about the parentage relations among the strains, we can avoid possible misinterpretation of the results. As has been shown in this paper, a data driven approach with no prior assumptions about the evolution of the influenza virus affords us a different perspective in directly visualizing how the virus evolves in a span of over half a century. This perspective has given us insight into the way we think about the driving forces behind the emergence of human seasonal influenza antigenic variant strains season after season. Perhaps, vaccination did play a role in forcing the virus to undergo a different evolutionary path in order to continue to establish itself in its occupied host. A definitively scientific conclusion cannot be drawn without a thorough study of the virus in a controlled experiment for an extended period of time which should no less to include multiple influenza epidemics in humans.

Acknowledgments This research was supported in part by NSF grants IIS0916750, IIS 1319749. Influenza research in Srinand Sreevatsan lab is funded by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN266200700007C. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH or NSF. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References [Alpaydin, 2010] E. Alpaydin. Introduction to Machine Learning. MIT Press, 2nd edition, 2010.

[Bao et al., 2008] Yiming Bao, Pavel Bolotov, Dmitry Dernovoy, Boris Kiryutin, Leonid Zaslavsky, Tatiana Tatusova, Jim Ostell, and David Lipman. The influenza virus resource at the national center for biotechnology information. J Virol, 82(2):596–601, Jan 2008. [Boni, 2008] Maciej F. Boni. Vaccination and antigenic drift in influenza. Vaccine, 26 Suppl 3:C8–14, Jul 2008. [Caron et al., 2009] Alexandre Caron, Nicolas Gaidet, Michel de Garine-Wichatitsky, Serge Morand, and Elissa Z. Cameron. Evolutionary biology, community ecology and avian influenza research. Infect Genet Evol, 9(2):298–303, Mar 2009. [Escorcia et al., 2008] Magdalena Escorcia, Lourdes Vzquez, Sara T. Mndez, Andrea Rodrguez-Ropn, Eduardo Lucio, and Gerardo M. Nava. Avian influenza: genetic evolution under vaccination pressure. Virol J, 5:15, 2008. [Garcia et al., 1997] M Garcia, DL Suarez, JM Crawford, JW Latimer, RD Slemons, DE Swayne, and ML Perdue. Evolution of h5 subtype avian influenza a viruses in north america. Virus research, 51(2):115–124, 1997. [Gunn et al., 2010] Jennifer Lee Gunn, Susan Craddock, and Tamara Giles-Vernick. Influenza and public health: Learning from past pandemics. Earthscan, 2010. [Hannoun, 2013] Claude Hannoun. The evolving history of influenza viruses and influenza vaccines. 2013. [Hensley et al., 2009] Scott E. Hensley, Suman R. Das, Adam L. Bailey, Loren M. Schmidt, Heather D. Hickman, Akila Jayaraman, Karthik Viswanathan, Rahul Raman, Ram Sasisekharan, Jack R. Bennink, and Jonathan W. Yewdell. Hemagglutinin receptor binding avidity drives influenza a virus antigenic drift. Science, 326(5953):734– 736, Oct 2009. [Jenkins et al., 2002] Gareth M Jenkins, Andrew Rambaut, Oliver G Pybus, and Edward C Holmes. Rates of molecular evolution in rna viruses: a quantitative phylogenetic analysis. Journal of molecular evolution, 54(2):156–165, 2002. [Jolliffe, 2002] Ian T Jolliffe. Principal component analysis. Springer verlag, 2002. [Lam et al., 2012] HamChing Lam, Srinand Sreevatsan, and Daniel Boley. Analyzing influenza virus sequences using binary encoding approach. Scientific Programming, 20:3– 13, 2012. [Lee et al., 2004] Chang-Won Lee, Dennis A. Senne, and David L. Suarez. Effect of vaccine use in the evolution of mexican lineage h5n2 avian influenza virus. J Virol, 78(15):8372–8381, Aug 2004. [Nicholas, 2007] Barton Nicholas. Evolution. Cold Spring Harbor Laboratory Press, 1st edition, 2007. [Russell et al., 2008] Colin A Russell, Terry C Jones, Ian G Barr, Nancy J Cox, Rebecca J Garten, Vicky Gregory, Ian D Gust, Alan W Hampson, Alan J Hay, Aeron C Hurt, et al. Influenza vaccine strain selection and recent studies on the global migration of seasonal influenza viruses. Vaccine, 26:D31–D34, 2008.

[Schweiger et al., 2002] B. Schweiger, I. Zadow, and R. Heckler. Antigenic drift and variability of influenza viruses. Med Microbiol Immunol, 191(3-4):133–138, Dec 2002. [Suzuki, 2006] Yoshiyuki Suzuki. Natural selection on the influenza virus genome. Mol Biol Evol, 23(10):1902– 1911, Oct 2006. [Taubenberger and Kash, 2010] Jeffery K Taubenberger and John C Kash. Influenza virus evolution, host adaptation, and pandemic formation. Cell host & microbe, 7(6):440– 451, 2010. [Webby et al., 2004] RJ Webby, DR Perez, JS Coleman, Y Guan, JH Knight, EA Govorkova, LR McClain-Moss, JS Peiris, JE Rehg, EI Tuomanen, et al. Responsiveness to a pandemic alert: use of reverse genetics for rapid development of influenza vaccines. the Lancet, 363(9415):1099– 1103, 2004. [Webster et al., 1992] R. G. Webster, W. J. Bean, O. T. Gorman, T. M. Chambers, and Y. Kawaoka. Evolution and ecology of influenza a viruses. Microbiol Rev, 56(1):152– 179, Mar 1992. [Wei et al., 2012] Kaifa Wei, Yanfeng Chen, Juan Chen, Lingjuan Wu, and Daoxin Xie. Evolution and adaptation of hemagglutinin gene of human h5n1 influenza virus. Virus genes, 44(3):450–458, 2012. [Wood et al., 2001] John M Wood, KG Nicholson, M Zambon, R Hinton, DL Major, RW Newman, U Dunleavy, D Melzack, JS Robertson, and GC Schild. Developing vaccines against potential pandemic influenza viruses. In International Congress Series, volume 1219, pages 751– 759. Elsevier, 2001.

A Logic for Checking the Probabilistic Steady-State Properties of Reaction Networks Vincent Picard and Anne Siegel and Jérémie Bourdon Université de Rennes 1 CNRS Université de Nantes Rennes/Nantes, France [email protected] Abstract Designing probabilistic reaction models and determining their stochastic kinetic parameters are major issues in systems biology. In order to assist in the construction of reaction network models, we introduce a logic that allows one to express asymptotic properties about the steady-state stochastic dynamics of a reaction network. Basically, the formulas can express properties on expectancies, variances and co-variances. If a formula encoding for experimental observations on the system is not satisfiable then the reaction network model can be rejected. We demonstrate that deciding the satisfiability of a formula is NP-hard but we provide a decision method based on solving systems of polynomial constraints. We illustrate our method on a toy example.

1

Introduction

The dynamical quantitative analysis of systems of coupled chemical reactions also known as reaction networks is a major topic of interest in systems biology. Two main mathematical frameworks have been introduced to investigate their kinetic behavior [Helms, 2008]: ordinary differential equations at the population level and stochastic modeling at the single-cell level. Ordinary differential equations (ODEs) provide deterministic trajectories for the average quantities of molecules at the population level. The time evolution of the quantities of molecules ~x is described by a system of ordinary differx ~ x) where S is the stoiential equations of type d~ dt = S f (~ chiometry matrix of the system and f~ is a vector of fluxes that depends on the current matter quantities. Usually the value of f~ is given by the law of mass actions although other laws may be used (Michaelis-Menten, Droop, . . . ). When all molecular species, reactions, kinetic laws and their parameters are known, numerical analysis algorithms allow one to compute approximate trajectories of the average quantities of molecules. When the system is either too large or not enough provided with experimental data, an alternative method is to consider the steady-state of the system, where the reactant concentrations are assumed to be constant because their pro-

duction and consumption are balanced. In this case, the fluxes which depend on matter quantities are constant and must satisfy the equation S f~ = ~0. Based on the information provided by the stoichiometry matrix, constraint-based approaches allow finding the appropriate f subjected to S f~ = ~0 together with additional biological constraints in the flux balance analysis framework (FBA) [Orth et al., 2010]. Among numerous applications, fluxes based methods can be used for model validation and comparison, that is, deciding whether a proposed set of reactions is consistent with the observed data or not. Here the observations consist of measuring some steady-state production rates of output metabolites, that is chemical species that are not consumed by the reactions. Hence we consider a steady-state where all metabolites quantities are constant, except the output metabolites. As an example consider the following reaction network A!B+C B !A+D

The production rates ⌧C and ⌧D of the output metabolites C and D can be derived from the fluxes which satisfy: 0 1 0 d~x B0C >~ >~ = S f (~x) = S f = @ A. (1) ⌧C dt ⌧D

Experimental observations on ⌧C and ⌧D can be encoded into a logical formula of type ' = (⌧C > 2⌧D ). Using equation (1), one can determine if ' is compatible with a given flux f~. This can be logically formalized as: a flux f~ is a model of ' ( f~ |= ') if it satisfies both equations (1) and '. If there is no value for f~ such that f~ |= ', in other words if ' is not satisfiable, then the reaction network can be rejected based on the data at hand. In the proposed example, it can be quickly checked that ' = ⌧C > 2⌧D is not satisfiable, so the data would refute the proposed reaction network. This type of reasoning is very useful for biologists who can eliminate modeling hypotheses based only on output slopes measurements and by checking the satisfiability of a formula. Notice that no information about kinetic laws or parameters have been used. However, there exists some situations where constraints on steady-states fluxes are not sufficient to discriminate models.

As a toy illustrative example, let us consider the two following reaction models ⇢ R1 : A ! B + C (model 1) (2) R2 : B ! A + 2D ⇢ R1 : A ! B + C + 2D (model 2) (3) R2 : B ! A

A flux-based approach indicates that in both models, every flux f~ = (f1 , f2 ) with f1 = f2 satisfies the balance constraints and that the accumulation rate of C equals f1 whereas the accumulation rate of D equals 2f1 . Thus both systems are equivalent from the point of view of fluxes approaches based on the average production rates of C and D. However, the steady-states of these models actually can be distinguished from each other. Intuitively, in model 1, the quantities of C and D should be negatively correlated while they should be positively correlated in model 2. To formalize this intuition, we need to focus on the single cell level, where stochastic fluctuations exist. This can be seen in Fig. 1 where individual trajectories of the system at the stochastic level for both models are depicted. It appears that the mean and variance quantities do not allow distinguishing models 1 and 2, whereas the covariance line (dark line) is clearly distinct between model 1 and 2. Therefore, probabilistic modeling associated with stochastic data can be relevant to assist in the design of reaction networks. Moreover, the importance of dynamical stochastic modeling is continuously growing [Wilkinson, 2009] as biology intrinsically exhibits stochastic behaviors [McAdams and Arkin, 1997; Arkin et al., 1998] and techniques of singlecell observations are improving. The reference method in stochastic modeling is to use the Gillespie stochastic simulation algorithm [Gillespie, 1976; 2007] that generates stochastic trajectories of a reaction network. The distribution of the sampled trajectories is solution to the Chemical Master Equation, which is the probabilistic equivalent of the law of mass actions. However using the Gillespie algorithm is computationally intensive and requires the knowledge of all reaction kinetic parameters as for the differential methods. Consequently, we develop in this work a logic which focuses on the stochastic steady-state properties and does not require information about the kinetic parameters. Objective The aim of this article is to define a logic that permits to express the steady-state stochastic properties (means, variances, covariances) of the outputs, instead of only average production rates. In doing so, we would increase the rejection power of the fluxes based method by taking into account the fluctuations of individual cells. In section 2 we define the syntax and the semantics of the logic, in particular we define the satisfiability of formulas. In section 3 we demonstrate that deciding the satisfiability is NP-hard and we propose an algorithm to decide the satisfiability. In the last section 4 we apply these results on the introductory example.

2

Syntax and Semantics

Reaction networks We consider systems of chemical reactions known as reaction networks. A reaction network con-

sists of n molecular species X1 , . . . , Xn that are involved in m chemical reactions Ri : ai,1 X1 + · · · + ai,n Xn ! bi,1 X1 + · · · + bi,n Xn (1  i  m). The parameters ai,j , bi,j 2 N are the stoichiometry coefficients of the reaction network. The number ai,j represents the quantity of Xj molecules consumed by the reaction Ri and the number bi,j represents the quantity of Xj molecules produced by the reaction Rj . The global effect of the reactions on the molecular quantities is often summarized by the stoichiometry matrix S = (si,j )1im, 1jn where si,j = bi,j ai,j . In our notations, each row of the stoichiometry matrix represents the effect of one reaction on the quantity of molecules. In this article we consider discrete-time dynamics of reaction networks. We denote (~xk )k2N the discrete time stochastic process describing the number of molecules of each chemical specie at time k. For instance (~xk )k2N may be generated by a discrete-time version of the Gillespie algorithm [Sandmann, 2008].

2.1

Syntax

Definition of terms We want to formally define the syntax of formulas describing some asymptotic properties on (~xk )k2N . More precisely we want to compare asymptotically polynomial expressions involving the first and second moments of (~xk ). These polynomial expressions are the terms of our logic. We denote by C = {X1 , . . . , Xn } the non-empty finite set of chemical species symbols. The algebra of terms is defined by structural induction as the least set T satisfying: 8X, Y 2 C, Exp(X) 2 T , Var(X) 2 T , Cov(X, Y ) 2 T ,

8 2 Q, 8T1 , T2 2 T , 2 T , ( · T1 ) 2 T , (T1 + T2 ) 2 T , (T1 ⇥ T2 ) 2 T . For the moment, Exp, Var and Cov are just function symbols, their semantics is defined later. Example 1. (Var(X1 )+Cov(X3 , X4 )) and ((3·Exp(X1 ))⇥ Var(X2 )) are terms. Definition of formulas We are now able to define the syntax of the formulas which are used to compare two terms, that is two polynomial expressions involving the first and second moments of (~xk )k . In order to provide a simple definition, the only atomic formulas we introduce are the comparisons with 0: atomic formulas : AF = {(T

0) / T 2 T }.

The formulas are atomic propositions connected with the classical logical operators. Formally, the set of formulas is defined by structural induction as the least set F satisfying: AF ⇢ F and 8F1 , F2 2 F, ¬F1 2 F, (F1 _ F2 ) 2 F, (F1 ^ F2 ) 2 F. The atomic formulas AF and these three logical operators are sufficient to write the usual comparisons and logical operators, which we introduce as notations: 8T1 , T2 2 T , 8F1 , F2 2 F, (T > 0) ⌘ ¬(( 1 · T ) 0), (T1 T2 ) ⌘ ((T1 + ( 1 · T2 )) 0), (T1 > T2 ) ⌘ ((T1 + ( 1 · T2 )) > 0), (F1 ! F2 ) ⌘ (¬F1 _ F2 ) and (T1 = T2 ) ⌘ ((T1 T2 ) ^ (T2 T1 )). Example 2. Exp(X1 )

(3 · Exp(X2 )) is a formula.

100

100

100

80

80

80

60

60

60

40

40

20

20

0

0

0

−20

−20

−20

0

20

40 60 time (s)

80

−40

100

0

20

40 60 time (#reactions)

80

−40

100

120

120

100

100

100

80

80

80

60

60

60 molecules

120

40

40

20

20

0

0

0

−20

−20

−20

0

20

40 60 time (s)

80

100

−40

0

20


80

100

0

20


80

100

80

100

Mean C Mean D Variance C Variance D Covariance CD

40

20

−40

Mean C Mean D Variance C Variance D Covariance CD

40

20

−40

concentrations

molecules

120

molecules

120

molecules

concentrations

120

−40

0

20


Figure 1: Differential and stochastic dynamics of model 1 (first row) and model 2 (second row). Red (resp. Blue) plots refer to quantities of C (resp. D). The first column depicts the solution to the differential equations derived from the law of mass actions. The second column depicts 50 runs of a stochastic simulation of the models (kinetic parameters equal to 1, 1000A and 1000B initially). The third column depicts the estimated mean, variance and covariance estimated from the simulations depicted in second column.

2.2

Semantics

Approximated moments in steady-state We want to define a relevant semantics of terms, that is a semantics corresponding to the moments (means, variances, covariances) of a stochastic process (~xk )k2N that has a biologically correct distribution. We rely on a central limit theorem obtained in [Picard et al., 2014] when considering steady-state regime approximations: ⇣ ⌘ 1 D p ~xk ~x0 + kS >~p ! N ~0, W (S, ~p) , (4) k!+1 k where W (S, ~p) = S > diag(~p) ~p~p> S and k is the discrete-time variable. Here ~p is a m-dimensional probability vector named the reaction probability vector which represents the probabilities of triggering each reaction during the steady-state regime (that is when the distributions of reactants are stabilized). We denote by Pm the set of m dimenm sional probability vectors that is vectors Pm ~u 2 R satisfying 8i 2 {1, . . . , m}, 0  ui  1 and i=1 ui = 1. Therefore

~p 2 Pm . Equation (4) provides us with asymptotic equivalents of the moments when k ! 1: m X Exak ⇠k xa0 + k sja pj , (5) j=1

Vxak ⇠k k Cov(xak , xbk ) ⇠k k

m X

s2ja pi

k

j=1

m X j=1

X

sja sla pj pl ,

(6)

1j,lm

sja sjb pj

k

X

sja slb pj pl ,

1j,lm

(7)

where uk ⇠k vk means the mathematical asymptotic equivalence of sequences, that is uk = vk + o(vk ). Therefore the approximated first and second moments of ~xk can be obtained when knowing the triplet (S, ~x0 , ~p). This motivates the following definition for the possible models of the formulas. Definition 1. A context is a pair C = (S, ~x0 ) where ~x0 2 Qn

represents the initial quantities at the start of steady-sate regime and S is a m ⇥ n stoichiometry matrix. An interpretation is a triplet I = (S, ~x0 , p~), where (S, ~x0 ) is a context, and p~ 2 Pm are reaction probabilities.

Using this definition we are now able to define what are the models of a formula. An interpretation I = (S, ~x0 , p~) is a model of a formula F , noted

Evaluation of terms When a context is given, the terms can be evaluated as multivariate polynomials with variables corresponding to the time k and reaction probabilities p~ = (pi )0im . The evaluation of leaves when p~ := ~p corresponds to the R[k] polynomial asymptotic expressions given in (5), (6) and (7).

A formula F is valid, noted |= F , if every interpretation is a model. A formula F is valid in a context C = (S, ~x0 ), noted C |= F , if 8~ p 2 Pm , (S, ~x0 , p~) |= F . A formula F is satisfiable in a context C = (S, ~z0 ), if there exists p~ 2 Pm such that (S, ~x0 , p~) |= F . It follows that models of an atomic formula are triplets (S, ~x0 , p~) such that the comparison is satisfied in the sense of the next proposition. Proposition 2. The interpretation I = (S, ~z0 , p~) is a model of F = (T 0) if and only if

Definition 2 (Evaluation of terms). The evaluation [T ]C of a the term T in the context C = (S, ~x0 ) is the polynomial Q[k, p1 , . . . , pm ] defined by structural induction as [Exp(Xa )]C = xa0 + k 0

[Var(Xa )]C = k @ 0

[Cov(Xa , Xb )]C = k @

m X

9K 2 N,

sja pj ,

j=1

m X j=1

m X

s2ja pi

X

1j,lm

sja sjb pj

j=1

1

sja sla pj pl A ,

X

1j,lm

1

sja slb pj pl A ,

[c]C = c when c is a constant, [( · T )]C = · [T ]C , [(T1 + T2 )]C = [T1 ]C + [T2 ]C , [(T1 ⇥ T2 )]C = [T1 ]C [T2 ]C . The following proposition, stating that [T ]C corresponds to the above asymptotic approximation of ~xk when evaluated with p~ = ~p, justifies the definition of the semantics of terms.

Proposition 1. Consider a reaction network with stoichiometry matrix S, initial state ~x0 and steady-state reaction probability vector ~p and a term T (i.e. a polynomial expression of the first and second moments). We denote by uk the natural mathematical interpretation of T in terms of polynomial of expectancies, variances and covariances of ~xk , then [T ]C (k, ~p) ⇠k uk when k ! 1. Proof. (sketch) The proof is done by structural induction on T. Evaluation and models of formulas Definition 3 (Evaluation of formulas). The evaluation [F ]C of a the formula F in the context C = (S, ~x0 ) is the subset of Pm defined by structural induction as [(T

I |= F,

0)]C = {~ p 2 Pm : domk ([T ]C )

0},

(8)

where domk (P ) 2 Q[p1 , . . . , pm ] is the dominant coefficient in k in the polynomial P 2 Q[k, p1 , . . . , pm ], [¬F ]C = Pm \ [F ]C , [(F1 _ F2 )]C = [F1 ]C [ [F2 ]C , [(F1 ^ F2 )]C = [F1 ]C \ [F2 ]C . Therefore, the evaluation of an atomic formula (T 0) is the subset of probability vectors p~ 2 Pm such that k 7! [T ]C (k, p~) 2 Q[k] is asymptotically non negative (since the asymptotic behavior of a polynomial is given by his monomial of highest degree).

8k

if p~ 2 [F ](S,~x0 ) .

K,

[T ](S,~z0 ) (k, p~)

(9)

0.

(10)

Therefore, considering Proposition 1, an interpretation I = (C, p~) is a model of a comparison means that the comparison between the two polynomial expressions of moments is ultimately true in the framework of the steady-state approximation when p~ = ~p. Proof. Denote f (x) = [T ]C (x, p~) for x 2 R. The function f is a polynomial in Q[x], so it has only a limited number of possible asymptotic behavior: either f is constant, or lim+1 f = +1, or lim+1 f = 1. If I |= F , then by definition domk ([T ]C ) 0 meaning that either f is a non negative constant or f is non-constant with positive dominant coefficient. In both cases (10) holds. Conversely, if (10) holds then either f is constant or lim+1 f = +1, so domk ([T ]C ) 0, so I |= F .

In our definitions of models, we pay attention to distinguish between valid formulas which are always true (for instance (7 5) or (Exp(X1 ) 2 Exp(X1 ))) and formulas valid in a context, that is properties whose validity is a consequence of the topology and the stoichiometry of the considered reaction network. Valid properties in a context correspond to asymptotic properties that are true for all steady-state reaction probability vectors. Hence a reaction network can exhibit an asymptotic behavior F without having C |= F . However, if such an asymptotic property is observed in a presumed steady-state then the formula F must be satisfiable in the considered context. This last remark is very important because it allows one to reject a context (S, ~x0 ), and especially to reject S, if a formula F coding for experimental observations of a presumed steady-state is not satisfiable in the considered context.

3

Deciding Satisfiability and Validity

We have defined the validity and satisfiability in a context of a formula in the previous section. The next step is to design algorithms for determining the validity or satisfiability of a given formula. The following lemma shows that both notions are in close relationship, so we can focus on the satisfiability problem. Lemma 1. • A formula F is valid in the context C if and only if [F ]C = Pm .

• A formula F is satisfiable in the context C if and only if [F ]C 6= ?. • In the context C, a formula F is valid (resp. satisfiable) if and only if ¬F is not satisfiable (resp. not valid).

Theoretical Complexity We now demonstrate that the satisfiability problem is NP-hard by using a reduction from 3SAT. Proposition 3. The following F-SAT satisfiability problem is NP-hard. The F-SAT problem Instance: n (number of chemical species), m (number of reactions), S (n ⇥ m stoichiometry matrix), ~x0 (initial quantities), F (a formula). Question: Is there p~ 2 Pm such that (S, ~x0 , p~) |= F ? Therefore, there is no algorithm that can verify for an arbitrary reaction network the satisfiability (or the validity due to Lemma 1) of a formula in polynomial time unless P = NP.

Algorithm 1: Deciding F-SAT Data: A context C = (S, ~x0 ), a formula F Result: p~ such that (C, p~) |= F or UNSAT Step 1: Convert F into disjunctive normal form (DNF); F =

r _

u=1

Fu =

r _

u=1

(G1u ^ · · · ^ Gnuu )

for u = 1 to r do Try Step 2: find p~ 2 [Fu ](S,~x0 ) ; if p~ is found then return p~; end end return UNSAT;

Proof. The proof is obtained by polynomial time reduction from 3-SAT. 3-SAT decision problem [Garey and Johnson, 2002] Instance: n (number of variables), a propositional formula Vr in conjunctive normal form (CNF) ' = i=1 (li1 _ li2 _ li3 ), 1/2/3 where li are literals. Question: Is there a valuation satisfying '? We provide a polynomial time reduction from Vr 3-SAT. Consider a propositional formula in CNF ' = i=1 (li1 _ li2 _ li3 ) and denote by {x1 , . . . , xn } the n variables of '.V From ' we r build an associated formula of our logic F = i=1 (G1i _ q q 2 3 Gi _ Gi ) where Gi = (exp(Xk ) > 0) if li = xk and Gqi = ¬(exp(Xk ) > 0) if liq = ¬xk . Then we prove ' is satisfiable if and only if F is satisfiable.

or formulas. We propose the following Algorithm 1 for determining 9?~ p 2 Pm , (S, ~x0 , p) |= F . In step 2, finding p~ 2 [(Tuq 0)]C (resp. [¬(Tuq 0)]C ) corresponds to finding a solution p~ such that (domk [T ](S,~x0 ) )(~ p) 0 (resp. < 0), that is finding a solution to a polynomial inequality. Therefore, step 2 consists of finding a solution to a set of nu polynomial constraints. Finding such a solution is decidable [Tarski, 1951] and can be performed by state-of-the-art model checking tools such as the SMT-solver dReal [Gao et al., 2013]. The algorithm has two sources of complexity. First, converting F in DNF in step 1 can be computationally intensive since the size of the DNF can be exponential in the size of F and thus r may be large. This should not be a significant problem in usual cases since the formula F is not complex. Second, solving step 2, that is finding a solution to a system of polynomial constraints can be computationally intensive.

• Let v : {x1 , . . . , xn } ! {>, ?} be a valuation satisfying '. We consider the following reaction network with n chemical species and n + 1 reactions {R0 : ? ! ?, Ri : ? ! Xi (i = 1 . . . n)} with stoichiometry matrix S. Then we define the reaction probabilities as pk = 1/n Pifn v(xk ) = >, pk = 0 if v(xk ) = ? and p0 = 1 k=1 pk . We also set initial conditions at zero ~x0 = 0. Let us consider the interpreation I = (S, ~x0 , p~). Then I |= (exp(Xk ) > 0) , pk > 0 , v(xk ) = > and I |= ¬(exp(Xk ) > 0) , pk  0 , v(xk ) = ?. Consequently I |= F .

Terms without multiplications lead to quadratic constraints We have proposed a procedure for determining the satisfiability of a formula F based on solving sets of polynomial inequalities. Since general systems of polynomial inequalities can be difficult to solve we propose to consider a logical fragment of F by restricting terms to linear mathematical expressions of moments. Formally we define Tlin ⇢ T in the same way as T but by removing the last induction rule : 8T1 , T2 2 T , (T1 ⇥ T2 ) 2 T . We then define Flin ⇢ F with the same induction rules but using the terms in Tlin .

• Conversely, if I = (S, ~x0 , p~) |= F then we define a valuation as v(xk ) = > if I |= (Exp(Xk ) > 0) and v(xk ) = ? if I |= ¬(Exp(Xk ) > 0). By definition of |= it follows that v satisfies '. An algorithm for deciding F-SAT We have proven that deciding the satisfiability of a formula is NP-hard, nevertheless it is still interesting to design an algorithm for deciding this problem. Indeed, it is possible to find algorithms that are fast in practice but slow for a few specific reaction networks

Proposition 4. Consider a context C and a term T 2 Tlin then finding p~ 2 Pm , p~ 2 [(T 0)]C can be done by solving a numerical quadratic inequation in the variables (pi ). Proof. (sketch) The proof consists in demonstrating by structural induction on the terms that for all T 2 Tlin , the total degree for the variables (pi ) of [T ]C is at most two. Indeed, the Exp(·) leaves are polynomials of degree at most 1 for (pi ) , and the Var(·) and Cov(·, ·) leaves are polynomials of degree at most two. Then, summing terms and multiplication by a scalar do not increase the degree.

As multiplication of terms is not used in the proof of Proposition 3, the satisfaction problem for Flin is still NP-hard. However, using the algorithm described in the previous section may be simpler as the involved systems of constraints are quadratic. For instance, the constraints correspond to a second-order cone programming (SOCP) [Alizadeh and Goldfarb, 2003] problem which can be solved by interior point methods in tools such as CPLEX and Gurobi.

4

Example

Let us go back to the example of the introduction which was not possible to solve using classical fluxes based analysis. We consider that the biological experimental data are given by the first row of Figure 1. Also, as a consequence of the assumed steady-state, we consider that A and B are balanced, so their quantities do not change in average. Thus the data are encoded into the formula F = (Exp(A) = 1000) ^ (Exp(B) = 1000) ^ (Exp(D) 2 Exp(C)) ^ (Cov(C, D) < 0). (11)

Now we want to discriminate between the two reaction models, hence we introduce the two contexts C1 = (S1 , ~x0 ) and C2 = (S2 , ~x0 ) associated with each reaction network in order to check the satisfiability of F in both contexts. Here we know the initial conditions ~x0 = (1000, 1000, 0, 0). F is already in DNF, so we directly derive the corresponding set of polynomial constraints using from semantics of the formulas. atomic formulas (Exp(A) = 1000) (Exp(B) = 1000) (Exp(D) 2 Exp(C)) (Cov(C, D) < 0)

context C1 p2 p1 = 0 p1 p2 = 0 p2 p 1 2p1 p2 < 0

context C2 p2 p1 = 0 p1 p2 = 0 0 0 2p1 (1 p1 ) < 0

As expected, the constraints are at most quadratic in p~ since there is no multiplication in the formula F . The first system of quadratic constraints admits the (unique) solution p~ = (1/2, 1/2) whereas the second system of constraints has no solution. Consequently, F is satisfiable in the context C1 but not satisfiable in the context C2 9~ p 2 Pm , (C1 , p~) |= F

6 9~ p 2 Pm , (C2 , p~) |= F.

So, the steady-state properties F cannot be obtained using the second reaction network (model 2) which must me rejected.

5

Conclusion

We have introduced a logic whose syntax permits to express properties on the asymptotic first and second moments of the trajectories of a reaction network in steady-state. The semantics of formulas is obtained using a central limit theorem which provides us with analytical expressions of first and second moments. A model of a formula is a reaction network with initial quantities and a steady-state reaction probability vector such that the corresponding Gaussian asymptotic approximation satisfies the formula. When a formula encoding for experimental data is not satifiable in a given context, it means that the context and possibly the stoichiometry matrix is wrong. Thus, our logic provides a refutation of reaction

networks based on the measurements of asymptotic first and second moments of the trajectories. After introducing the logic, we have demonstrated that the F-SAT problem is NP-hard. We provided an algorithm which relies on the DNF conversion and polynomial constraints solving. This open perspectives of improving the satisfiability test by using efficient constraints solving tools. Further work will focus on understanding which instances of F-SAT can be solved in reasonable time. This includes a precise study of the practical complexity of the algorithm on various instances, that are pairs of biological models and datasets, with different sizes of reaction networks and different numbers and types of constraints.

Acknowledgments This work was supported by the French National Research Agency via the investment expenditure program IDEALG (ANR-10-BTBR-02-11). The authors are grateful to Pr. Phillipe Codognet for interesting discussions about solving systems of polynomial constraints.

References [Alizadeh and Goldfarb, 2003] Farid Alizadeh and Donald Goldfarb. Second-order cone programming. Mathematical programming, 95(1):3–51, 2003. [Arkin et al., 1998] Adam Arkin, John Ross, and Harley H McAdams. Stochastic kinetic analysis of developmental pathway bifurcation in phage -infected escherichia coli cells. Genetics, 149(4):1633–1648, 1998. [Gao et al., 2013] Sicun Gao, Soonho Kong, and Edmund M Clarke. dreal: An smt solver for nonlinear theories over the reals. In Automated Deduction–CADE-24, pages 208– 214. Springer, 2013. [Garey and Johnson, 2002] Michael R Garey and David S Johnson. Computers and intractability, volume 29. wh freeman, 2002. [Gillespie, 1976] Daniel T Gillespie. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. Journal of computational physics, 22(4):403–434, 1976. [Gillespie, 2007] Daniel T Gillespie. Stochastic simulation of chemical kinetics. Annu. Rev. Phys. Chem., 58:35–55, 2007. [Helms, 2008] Volkhard Helms. Principles of computational cell biology. Wiley, 2008. [McAdams and Arkin, 1997] HarleyH. McAdams and Adam Arkin. Stochastic mechanisms in gene expression. Proceedings of the National Academy of Sciences, 94(3):814–819, 1997. [Orth et al., 2010] Jeffrey D Orth, Ines Thiele, and Bernhard Ø Palsson. What is flux balance analysis? Nature biotechnology, 28(3):245–248, 2010. [Picard et al., 2014] Vincent Picard, Anne Siegel, and Jérémie Bourdon. Multivariate Normal Approximation for the Stochastic Simulation Algorithm: limit theorem and

applications. In SASB - 5th International Workshop on Static Analysis and Systems Biology, Munchen, Germany, 2014. [Sandmann, 2008] Werner Sandmann. Discrete-time stochastic modeling and simulation of biochemical networks. Computational biology and chemistry, 32(4):292–297, 2008. [Tarski, 1951] Alfred Tarski. A decision method for elementary algebra and geometry. Rand report, 1951. [Wilkinson, 2009] Darren J Wilkinson. Stochastic modelling for quantitative description of heterogeneous biological systems. Nature Reviews Genetics, 10(2):122–133, 2009.

Robust Intervention on Genetic Regulatory Networks Using Symbolic Dynamic Programming Fabio A. C. Tisovec & Leliane N. de Barros & Karina V. Delgado & Carolina Feher da Silva & Ronaldo Fumio Hashimoto USP, Sao Paulo, Brazil [email protected], [email protected], [email protected], [email protected], [email protected] In a genetic regulatory network (GRN ), as certain genes begin to be expressed and others cease to be expressed, cellular behavior is altered according to a pattern. Models of GRNs represent how genes, involved in a given cellular function, interact and how such an interaction determines patterns of evolution of a cell. One of the major objective for modeling a genetic regulatory networks is to design and analyze therapeutic intervention strategies for moving from an undesirable state, i.e. a diseased state, to a healthier one. Intervention in a GRN is a deliberate attempt to change the network dynamics by means of a drug administration. Boolean Network (BN ) is a mathematical model, commonly used to represent a subset of genetic regulatory networks where the genes are represented by boolean variables and the system progresses from one instant of time to the next based on boolean functions [Kau↵man, 1969] [Kim et al., 2002]. Within a BN we also make the assumption of determinism, i.e. it is possible to determine precisely what is the resulting state of a biological system after any number of state transitions. Since this is not a realistic scenario for GRNs, Probabilistic Boolean Networks (PBN s) were proposed, which assume: (i) the existence of non-determinism in the system behavior and; (ii) a dynamics based on a set of probabilistic transition functions [Shmulevich et al., 2002]. Within PBNs, we can define di↵erent behaviors assuming di↵erent sets of probabilistic transition functions, representing di↵erent types of treatments [Datta et al., 2003]. Under certain assumptions, a PBN can be represented by a Markov chain. Research in the area of intervention in genetic networks models this problem as Controlled Markov Chain problem, i.e., a Markov Decision Process (MDP ) problem that can be extracted from a PBN. An MDP is a stochastic process commonly used for sequential decision-making. An MDP solution is a policy (intervention) of actions for any possible state that maximizes the expected intervention reward (or minimizes the expected intervention cost [Datta et al., 2003]). An important issue in solving an MDP is the curse of dimensionality, i.e., the exponential growth of time and memory that the algorithm needs with the number of dimensions of the data. In this case, dimensions correspond to genes and thus, if gene activities are rep-

resented by Boolean values, then the cardinality of the state space is 2N for N genes. One way to deal with large MDPs is a factored MDP (FMDP), which is an alternative to represent the intervention problem since it can solve larger problems by exploring the dependency structure of the model. In FMDPs, the structure of the model is exploited by encoding transition probabilities as factored models such as dynamic Bayesian networks (DBNs) that can be solved with Symbolic Dynamic Programming algorithm (SDP) [Boutilier et al., 1999] [Tan et al., 2010]. Another issue that can arise when modelling a genetic network is that it is often difficult to obtain precise transition probabilities because of insufficient data [Shmulevich et al., 2002] and of errors in data extraction. In addition, individuals may possess fully or partially functional alleles (copies) of a gene, which results in di↵erent transition probabilities within a natural, diverse population. Transition probabilities may also be non-stationary due to incomplete information, as networks are built on selections of small subsets of genes, rather than on all of the organisms genome. To take this uncertainly into account, the MDP with imprecise transition probabilities (MDP-IP) model can be employed [Satia and Lave Jr., 1970] [Nilim and El Ghaoui, 2005] [Delgado et al., 2011]. The MDP-IP is simply an extension of the MDP where the transition probabilities can be imprecisely specified. Instead of a probability measure over the state space, we have a set of probability measures, referred to as a credal set. Although there are works on intervention on GRNs with imprecise probabilities [Denic et al., 2009] [Pal et al., 2008] [Pal et al., 2009], they only deal with problems of small size. In this work, we use the SPUDD-IP algorithm [Delgado et al., 2011], which is the state-ofthe-art method for solving MDP-IPs and thus it can be used to solve larger instances of the GRN intervention problem defined in a factored form and with imprecise probabilities. We empirically demonstrate how to find robust intervention policies for large GRNs by analysing artificial instances generated according with [Kim et al., 2002]. The results show that the SPUDD-IP algorithm can empower researches to solve more realistic problems whose size are larger than current methods allow.

References [Boutilier et al., 1999] Craig Boutilier, Steve Hanks, and Thomas Dean. Decision-theoretic Planning: Structural Assumptions and Computational Leverage. JAIR, 11:1–94, 1999. [Datta et al., 2003] A. Datta, A. Choudhary, M. L. Bittner, and E. R. Dougherty. External control in markovian genetic regulatory networks. Kluwer Academic Publishers, 56(1-2):169–191, 2003. [Delgado et al., 2011] Karina Valdivia Delgado, Scott Sanner, and Leliane Nunes de Barros. Efficient Solutions to Factored MDPs with Imprecise Transition Probabilities. Artif. Intell., 175(9-10):1498–1527, 2011. [Denic et al., 2009] S Z Denic, B Vasic, C D Charalambous, and R Palanivelu. Robust control of uncertain context-sensitive probabilistic Boolean networks. IET systems biology, 3(4):279–95, July 2009. [Kau↵man, 1969] S. A. Kau↵man. Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of theoretical biology, 22(3):437–67, Mar 1969. [Kim et al., 2002] Seungchan Kim, Huai Li, Edward R. Dougherty, Nanwei Cao, Yidong Chen, Michael Bittner, and Edward B. Suh. Can markov chain models mimic biological regulation? Journal of Biological Systems, 10(4):337–357, 2002. [Nilim and El Ghaoui, 2005] Arnab Nilim and Laurent El Ghaoui. Robust Control of Markov Decision Processes with Uncertain Transition Matrices. Oper. Res., 53(5):780–798, 2005. [Pal et al., 2008] Ranadip Pal, Aniruddha Datta, and Edward R. Dougherty. Robust intervention in probabilistic boolean networks. IEEE Transactions on Signal Processing, 56(3):1280–1294, 2008. [Pal et al., 2009] Ranadip Pal, Aniruddha Datta, and Edward R. Dougherty. Bayesian robustness in the control of gene regulatory networks. IEEE Transactions on Signal Processing, 57(9):3667–3678, 2009. [Satia and Lave Jr., 1970] Jay K. Satia and Roy E. Lave Jr. Markovian Decision Processes with Uncertain Transition Probabilities. Operations Research, 21:728– 740, 1970. [Shmulevich et al., 2002] I. Shmulevich, E.R. Dougherty, S. Kim, and W. Zhang. Probabilistic boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics, 18(2):261–274, Feb 2002. [Tan et al., 2010] Mehmet Tan, Reda Alhajj, and Faruk Polat. Automated large-scale control of gene regulatory networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 40(2):286–297, 2010.

Mina Maleki

1

Identification of High Affinity Ubiquitin Variants: an in Silico Mutagenesis-based Approach.

[email protected] 1 2

Mohammad H. Dezfulian

2

[email protected]

1

Luis Rueda

[email protected]

School of Computer Science, University of Windsor, ON, Canada

Department of Genetics, Harvard Medical School, Boston, MA, USA

Ubiquitin (Ub), a small regulatory protein found in almost all tissues of eukaryotic organisms, has been involved in many cell processes including cell cycle regulation, DNA repair mechanisms and gene expression. As a consequence, the mutations of Ub-mediated proteins within a cell appear to be a fundamental characteristic in the development of different types of diseases ranging from various cancer types to neurological disorders such as Alzheimer's disease and idiopathic Parkinson's disease [1]. Therefore, assessing the mutational patterns of Ub and its consequential changes in protein hold great promise as a diagnostic tool, which could also act as a potential drug target. To discover the effect of mutations on the stability of proteins both experimental and computational methods can be employed. Although experimental approaches are extremely innovative and elegant in development of highly potent inhibitors and activators of Ub-mediated proteins, they require specialized instruments and are experimentally demanding and rather expensive. Thus, the in vivo assessment is typically done on a select a number of mutant variants. The subsequent selection of Ub variants relies on carefully designed readouts and extensive experiences. In contrast, computational analysis and unsupervised machine learning techniques have allowed researchers to employ more cost effective methods to discover the list of mutation amino acids that have established stable protein interactions. To this end, a computational method is proposed to find the least number of amino acid mutations in Ub-mediated proteins that make the protein interactions more stable. For this, first, a huge library of ubiquitin variants with single-substitution mutations is generated. Then, the stability of all protein variants in the library are estimated employing FoldX [2]. FoldX is one of the most popular servers to provide a reliable and quantitative estimation of binding free energies, and hence estimate the stability of protein complexes based on their sequence and structural information. Finally, the protein variants with substitution mutations that improve affinity of interactions are found. Considering the same dataset employed in [3], the results demonstrated that the proposed computational method not only can find the list of mutations reported in [3], but can also find more mutations with higher stability score in a much more efficient way than the experimental methods since it considers and examines a wide range of mutations. Considering multiple mutations and also verifying that the mutated sequence of protein can be folded properly within a cell are worth further investigation. 1. 2. 3.

K. Haglund, I. Dikic, “Ubiquitin Signaling and Cancer Pathogenesis,” from book titled Protein Degradation: The Ubiquitin-Proteasome System and Disease. Edited by R. J. Mayer, A. Ciechanover, M. Rechsteiner, John Wiley & Sons, 2008. J. Schymkowitz, J. Borg, F. Stricher et al. “The FoldX web server: an online force field,” Nucleic Acids Research, 2005, vol 33, pW382-8. A. Ernst, G. Avvakumov, J. Tong et al. “A strategy for modulation of enzymes in the ubiquitin system.” Science. 2013, 339(6119):590-5.

Workshop Notes - UQAM

Workshop Notes - UQAM

Suggest Documents

Kodaly Workshop Notes - MENZA

Workshop Notes FCA4AI

Workshop Notes - CT.gov

AIPS-02 Workshop Notes - CiteSeerX

TGM - UQAM

L - UQAM

ITVE - UQAM

Proceedings - UQAM

ITVE - UQAM

Workshop Notes of the International Workshop on ... - LIG Membres

ECAI 2000 Workshop Notes (Workshop 13) - Semantic Scholar

Notes from workshop on Ecosystem Approach ...

Supercomputing 2002 workshop notes - California State University

LECTURE NOTES for WORKSHOP ON NUCLEAR

Piaggio's Technical Notes and Workshop Reference file

Untitled - IEIM-UQAM

Social Innovation - Crises-UQAM

Plant neurobiology - UQAM

Supermarket workers - UQAM

Untitled - UQAM Archipel

161230 Dispossessed - Archipel - UQAM

Social Innovation - Crises-UQAM

uncorrected proof - Archipel - UQAM