Document not found! Please try again

Modelling-based experiment retrieval: A case study with gene ...

2 downloads 0 Views 230KB Size Report
May 26, 2015 - Alvis Brazma‡ ..... use of the speed of heuristic clustering, while avoiding the problem of specifying ..... ments. Bioinformatics, 25(12), i145–i153.
arXiv:1505.05007v2 [stat.ML] 26 May 2015

Modelling-based experiment retrieval: A case study with gene expression clustering Paul Blomstedt∗

Ritabrata Dutta∗1

Alvis Brazma‡

Sohan Seth∗†

Samuel Kaski∗§

Abstract Motivation: Public and private repositories of experimental data are growing to sizes that require dedicated methods for finding relevant data. To improve on the state of the art of keyword searches from annotations, methods for content-based retrieval have been proposed. In the context of gene expression experiments, most methods retrieve gene expression profiles, requiring each experiment to be expressed as a single profile, typically of case vs. control. A more general, recently suggested alternative is to retrieve experiments whose models are good for modelling the query dataset. However, for very noisy and high-dimensional query data, this retrieval criterion turns out to be very noisy as well. Results: We propose doing retrieval using a denoised model of the query dataset, instead of the original noisy dataset itself. To this end, we introduce a general probabilistic framework, where each experiment is modelled separately and the retrieval is done by finding related models. For retrieval of gene expression experiments, we use a probabilistic model called product partition model, which induces a clustering of genes that show similar expression patterns across a number of samples. We then show empirically that inference for the full probabilistic model can be approximated with good performance with the computationally fast k-means clustering algorithm. The suggested metric for retrieval using clusterings is the normalized information distance. The method is highly scalable and straightforward to apply to construct a general-purpose gene expression experiment retrieval method. Availability: The method can be implemented using only standard kmeans and normalized information distance, available in many standard statistical software packages. ∗ Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland † Institute for Adaptive and Neural Computation, School of Informatics, University of Edinburgh, UK ‡ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, UK § Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland 1 Shares equal contributions with the first author.

1

1

Introduction

As the use of high-throughput molecular measurement technologies continues to spread, an ever increasing amount of data from biological experiments are being stored in publicly available repositories. It is then often of interest for researchers to retrieve experimental datasets with relevance to a given experiment, in order to increase the power of statistical analyses and to be able to make novel findings not obtainable from one experiment alone. The current standard practice relies on searching for relevant experiments by keyword annotations (e.g. Zhu et al., 2008). However, despite efforts to maintain compliance with standard formats of documenting experiments, e.g. the MIAME standard (Brazma, 2001), information about experiments may often be missing, insufficient or suffer from variations in terminology (e.g. Baumgartner et al., 2007; Schmidberger et al., 2011). In view of the challenges associated with keywordbased retrieval, the complementary task of querying a database of experiments using measurement data, instead of keywords, has recently received increased attention in the literature. Most earlier content-driven methods used for retrieval of gene expression data represent each experiment in terms of a profile over genes, or alternatively, over known gene sets or gene modules predicted from other data sources, see Hunter et al. (2001); Fujibuchi et al. (2007); Caldas et al. (2009); Engreitz et al. (2010); Georgii et al. (2012) and references therein. A representative example is to compute differential expression profiles of case vs. control, use the correlation between activity profiles as the measure of relevance, and retrieve the experiments with the highest correlations (e.g. Engreitz et al., 2010). This requires auxiliary information about the experiments, namely case and control labels of experiment samples, and possibly additional a priori defined sets of important genes. Recently, two feasibility studies have gone beyond reducing experiments into single profiles by using probabilistic modelling of the experiments in the database being queried. In Faisal et al. (2014), it was assumed that the query dataset can be explained as a mixture of the learnt models, each model learnt from one dataset, such that the measure of relevance is given by the inferred mixture weights. In a slightly different approach (Seth et al., 2014), experiments were retrieved by evaluating the posterior marginal likelihoods, given the query data, of individual models stored for the experiments in the database. In this paper, we introduce a method for retrieving full datasets, which is also based on probabilistic modelling. However, instead of using the query dataset itself as the query, we use a model learnt from it. The measure of relevance is therefore not a likelihood, but instead a suitably defined metric between the models. The argument is that for noisy and complex datasets, it is beneficial to extract relevant characteristics of the query dataset in the same way as was done with the datasets that are being queried. We also make explicit the importance of marginalizing out nuisance parameters which are not directly relevant for the retrieval task. For example, in a gene expression study, one is often more interested in how sets of genes are co-regulated, rather than their exact expression values which are additionally affected by numerous other influences. 2

We tackle the specific problem of retrieving gene expression experiments by using a product partition model (Jordan et al., 2007) to cluster together genes that show similar expression patterns across a number of samples. By integrating out expression levels of the gene sets (i.e., cluster-specific information), only the co-expression patterns revealed by the clustering structure are retained. The clustering induced by the query dataset is then finally compared with the clusterings associated with the database experiments using the normalized information distance (Vinh et al., 2010). Notice that this approach does not involve any “training stage” compared to Seth et al. (2014), and the retrieval step does not involve solving an optimization problem compared to Faisal et al. (2014). While gene clustering has a long history in characterizing gene expression datasets (Eisen et al., 1999; D’haeseleer, 2005), it appears not to have been used in the context of experiment retrieval before. The use of gene clustering provides a straightforward way of characterizing each experiment with minimal preprocessing of the data while capturing central co-expression patterns. Essentially all previous approaches for retrieving gene expression data have converted the data to differential expression (or gene set enrichments) requiring fixed and known case-control distinctions. In contrast, we have only applied standard quality control and RMA normalization steps carried out in-house at the European Bioinformatics Institute (EBI) for datasets in the Expression Atlas database (see Petryszak et al., 2014). Our experimental evaluation further suggests that, for the current application, inference of the full probabilistic model can be replaced by a computationally faster k-means clustering algorithm. The computational simplicity makes the method highly scalable and easy to apply in a black-box manner as a general-purpose retrieval scheme.

2

Approach

Let Dq denote a data matrix from some experiment of interest, and let {Dm }M m=1 be a database of M datasets from previously conducted experiments. The aim is to retrieve datasets from among the {Dm }M m=1 with similar characteristics as the query dataset Dq . Due to the complex nature of the data, there is no single sensible or obvious way of comparing datasets (matrices of possibly different sizes). We propose using a model to characterize each dataset, with the aim of reducing noise and making relevant aspects of the data more tangible, while making the experiments comparable. The retrieval task then consists in ranking M the models {Mm }M m=1 , inferred from {Dm }m=1 , with respect to their similarity to the query model Mq inferred from Dq . Note that in a broad sense, the commonly used differential expression can be considered one model type, and clustering another. To elaborate on the above idea further, we will now assume that each of the datasets can be represented in terms of a probabilistic model with density f in some family {f (·|θ)|θ ∈ Θ}. Often, the parameter θ can be decomposed as θ = (λ, ψ), where ψ is the parameter of interest (e.g., gene clusters) and λ is a nuisance parameter (e.g., average expression level of the gene cluster). Marginal-

3

izing out (integrating the density over) λ then gives an equivalent model family completely determined by ψ. In other words, each value ψ is a representation for a model M in the model space M under consideration, which we can denote by setting M := ψ. Making the operation of marginalizing out λ explicit, the entity used in inferring which model is optimal for representing a dataset D is the marginal likelihood, Z f (D|λ, M)πλ|M (λ|M) dλ, (1) p(D|M) = Λ

where πλ|M (·|M) is a prior density on Λ. While a fully Bayesian approach would proceed with inferring a posterior distribution for M, we will here, for computational cost in the retrieval task, choose a single element of M as a representation for D, the maximum a posteriori (MAP) solution ˜ = arg max{p(D|M)πM (M)}, M

(2)

M∈M

where πM is a prior density (or probability mass) function on M . If a suitable function d : M × M → R can be defined for the pairwise relations between the elements of M , a natural ranking among M1 , . . . , MM will be induced by evaluating d(Mq , Mm ) for all m. For coherence of the ranking scheme, we will make a further assumption that d is a metric. That is, for all M, M′ , M′′ ∈ M , we require that (M1) (M2) (M3) (M4)

d(M, M′ ) ≥ 0 d(M, M′ ) = 0 if and only if M = M′ d(M, M′ ) = d(M′ , M) d(M, M′′ ) ≤ d(M, M′ ) + d(M′ , M′′ ).

(3)

With the above conditions satisfied, the function d conforms to the intuition of a distance, and furthermore, provides a solid foundation for the design of data structures and algorithms as the model space M forms a metric space. Note that the above framework can also be extended to the case where each model M is a full posterior distribution.

3 3.1

Methods Probabilistic model for gene clustering

The first task in constructing a retrieval scheme is to choose an appropriate model for the experiments. While several different approaches, with varying aims and assumptions, exist for modelling gene expression data, a particularly simple and frequently used approach is that of gene clustering (e.g. D’haeseleer, 2005), which seeks to cluster together genes that show similar expression patterns across a number of samples. Consider first a gene expression data matrix D of dimension n × p, where n is the number of genes and p is the number of samples. A clustering S = 4

{s1 . . . , sk } is a partition of the set N = {1, . . . , n} into k non-empty and nonoverlapping subsets, or clusters, such that ∪kc=1 sc = N and sc ∩ sc′ = ∅, for c 6= c′ . We focus here on a probabilistic formulation of clustering, which makes explicit use of partition structures, namely the product partition model (PPM). Technically, PPM assumes that items in the same cluster are exchangeable and items in different clusters are independent (see Jordan et al., 2007). Using the terminology of Section 2, the parameter of interest for this model is the partition structure S, while the nuisance parameter is a vector of cluster-specific model parameters, λ = (λ1 , . . . , λk ). This leads to a marginal likelihood of the form (see Equation (1)) Z p(D|S) = f (D|λ, S)πλ|S (λ|S) dλ Λ

=

Z Y k

=

k Y

Λ c=1

c=1

 f D(sc ) |λc , sc πλ|S (λc |sc ) dλ

 p D(sc ) |sc ,

(4)

where D(sc ) denotes the subset of D which is indexed by sc . Note that the assumption of independence between clusters entails constructing the marginal likelihood as a product of cluster-specific components. Likewise, the prior distribution for S will be constructed as a product, P(S) = K

k Y

for all k ∈ {1, . . . , n},

h(sc ),

c=1

(5)

where K ensures normalization to 1 over the model space S and h(sc ) ≥ 0 for all subsets sc . Note that (5) actually specifies the joint distribution for S and k, but since the latter is implied by the former, we omit k from the notation. It can be shown that a PPM with K and h(sc ) chosen such that Qk η0k c=1 (|sc | − 1)! Q P(S) = , n i=1 η0 + i − 1

(6)

where |sc | is the number of observations in cluster sc and η0 > 0 controls the tendency to form new clusters, can be obtained by integrating out the model parameters in a Dirichlet process mixture model (Dahl, 2009). The cluster-specific marginal likelihoods p D(sc ) |sc in Equation (4) can in principle take any suitable form. Here, we assume that for D(sc ) = [xij ], i ∈ sc , j = 1, . . . , p, the observations in each sample j are independently generated −1 from N µcj , τcj with a conjugate NormalGamma(µ0 , ρ0 , α0 , β0 ) prior on the unknown model parameters. Furthermore, we make the simplistic assumption that the samples themselves are independent, conditional on a cluster assignment (see Hand and Yu, 2001, for a discussion about the implications of this 5

assumption in a classification context). It can then be shown that the resulting cluster-specific marginal likelihoods may be written as p  ρ  21 Γ(α ) β α0  Y |sc | j 0 0 (sc ) (2π)− 2 p D |sc = (7) αj , ρ Γ(α ) β j 0 j j=1 where ρj = ρ0 + |sc |, αj = α0 +

1 X |sc | (xij ) , x ¯j = 2 |sc | i∈s c

1X |sc |ρ0 (¯ xj − µ0 )2 βj = β0 + . (xij − x ¯j )2 + 2 i∈s 2ρj c

Blomstedt et al. (2015) introduced a PPM for clustering mixed discrete and continuous data, where the continuous component was of form (7). Following their implementation, we normalize each column of the data matrix D = ∪kc=1 D(sc ) to have zero mean and unit variance, and set the hyperparameter values to µ0 = 0 and ρ0 = α0 = β0 = 1. Furthermore, the model is equipped with a prior of the form (6), with η0 = 1. Finally, combining Equations (4)–(7), an optimal clustering S˜ w.r.t. a dataset D is given by the MAP solution (see Equation (2)) S˜ = arg max{p(D|S)P(S)}.

(8)

S∈S

3.2

Inference

Although the space S of all clusterings of any dataset D is finite, an exhaustive enumeration of the clusterings to evaluate the posterior probabilities P(S|D) ∝ p(D|S)P(S) is clearly not feasible in practice. Therefore, to find the optimal clustering S˜ as defined in Equation (8), Blomstedt et al. (2015) implemented a stochastic greedy search algorithm, which moves in the model space by successive application of move, split and merge operators. While being more efficient for the optimization task than standard Markov chain Monte Carlo methods, for large amounts of data the algorithm still requires a considerable amount of computation time. Compared to modelling-based clustering approaches, such as the above, heuristic clustering algorithms are usually much faster but require the number of clusters k to be fixed in advance. As an intermediate between the two clustering paradigms, we provide the following simplified solution, which makes use of the speed of heuristic clustering, while avoiding the problem of specifying the number of clusters in advance: First use some heuristic method to find clustering solutions for a range of values k = k1 , . . . , k2 , such that 1 ≤ k1 ≤ k2 ≤ n, and then find S˜ by exhaustive enumeration in the restricted search space. As a special case, setting k1 = k2 corresponds to using a heuristic method with k fixed. The success of the simplified search will largely depend on the choice of algorithm used to find the candidate clustering solutions; a comparison of a set of algorithms in the context of gene clustering is presented in Section 4.1. 6

3.3

Distance metric for clusterings

Assuming now that each of the experiments in a database has been represented with a clustering S ∈ S, the remaining task is to find a function d which can be defined on S and satisfies conditions (M1)–(M4) of Equation (3). In recent years, a new generation of information-theoretic distance measures has emerged (see e.g. Meil˘a, 2007; Vinh et al., 2010), which possess many desirable properties, such as the metric property, and which have been employed because of their strong mathematical foundation and ability to detect non-linear similarities. Vinh et al. (2010) conducted a systematic comparison of information-theoretic distance measures, concluding that the preferred “general-purpose” measure for comparing clusterings is the normalized information distance, later denoted dN ID . Briefly, for two clusterings S and S ′ , the number of items co-occurring in Pk Pk′ clusters sc ∈ S and sc′ ∈ S ′ is given by ncc′ = |sc ∩ s′c′ |, with c=1 c′ =1 ncc′ = Pk ′ n. Furthermore, the marginal sums are denoted by nc· = c′ =1 ncc′ and Pk n·c′ = c=1 ncc′ . A key realization in the derivation of information-theoretic distance measures is that each clustering induces an empirical probability distribution over the set {1, . . . , k}, such that the probability of a randomly chosen item i ∈ N being in cluster sc is given by P(i ∈ sc ) = nc· /n. Similarly, the joint probability of the pair (i, j) ∈ N × N co-occurring in clusters sc and s′c′ is given by P (i, j) ∈ sc × s′c′ = ncc′ /n. The entropy of a clustering S, describing the uncertainty associated with assigning items into the clusters of S, is then formulated as k X P(i ∈ sc ) log P(i ∈ sc ). H(S) = − c=1

The mutual information of clusterings S and S ′ , which measures how much having knowledge of S ′ reduces H(S) (or vice versa), is further defined as I(S, S ′ ) = ′

k X k X

c=1 c′ =1

P (i, j) ∈ sc ×

s′c′



 P (i, j) ∈ sc × s′c′ log . P(i ∈ sc )P(j ∈ s′c′ )

It can also be interpreted as a measure of dependence in the sense that if S and S ′ are independent, then I(S, S ′ ) = 0. Finally, from the above quantities we obtain dN ID as I(S, S ′ ) . (9) dN ID (S, S ′ ) = 1 − max{H(S), H(S ′ )}

4 4.1

Results Data and experimental setup

To evaluate the modelling-based retrieval scheme developed in Sections 2 and 3, we used an initial set of 447 gene expression experiments conducted on the

7

A-AFFY-44 affymetrix genechip and obtained from Expression Atlas, available at http://www.ebi.ac.uk/gxa; see also Petryszak et al. (2014). Values of experimental factor ontology (EFO; see Malone et al., 2010, available at http://www.ebi.ac.uk/efo/), types “cell type”, “disease” and “organism part” were used as ground truth. Concretely, retrieved samples having the same EFO value were considered relevant, and other samples irrelevant. Note that this ground truth is incomplete as these EFO types were typically not the main conditions of interest on which differential gene expression was studied in the experiments, but were chosen to give a more general description of the experiments. A more complete ground truth is not readily available as most other EFO types were only present in small subsets of the experiments. For simplicity, only experiments with the same EFO value present in at least one other experiment were included in this study. Finally, experiments having multiple values for a given EFO type were excluded, resulting in 103 experiments remaining for “cell type”, 76 for “disease” and 174 for “organism part”. The number of different EFO values in these sets of experiments were 23, 19 and 32, respectively. Performance was measured using precision and recall, taken as an average of successively using each of the experiments as a query to retrieve among the remaining experiments. In order to reduce the number of genes for clustering, we initially selected for each experiment the top 100 genes with the lowest p-values for differential expression (as provided in the Expression Atlas). Finally, by taking the union of these genes over all experiments, we arrived at 537 genes per experiment. We emphasize that this is not an essential part of our approach but done for computational convenience only. In a preliminary stage of our analyses, we experimented with different numbers of genes but found that this only had a minor impact on the results. Comparison of heuristic clustering methods for reducing the search space. In Section 3.2, a simplified search algorithm to find the optimal clustering S˜ for a dataset D was suggested, which was based on an exhaustive search among heuristic clustering solutions for a given range of values of k, the number of clusters. To this end, the choice of method may be of crucial importance for performance, since the methods produce clusters in different ways, see D’haeseleer (2005) for an overview. While some guidance in this task can be found in the literature (e.g. de Souto et al., 2008; Jaskowiak et al., 2014), the choice will ultimately be problem-specific. To address this question, we initially chose six methods to perform gene clustering and subsequent retrieval with. These were: complete linkage (CL) with four different distance measures (Euclidean, Pearson’s correlation, Spearman’s correlation and cosine) and k-means with two different distance measures (squared Euclidean and cosine). For all √ methods, the number of clusters was fixed at ⌈ n/2⌉, where n is the number of genes in each experiment. With n = 537 this resulted in 12 clusters for each experiment. In a comparison of retrieval performance (not shown), clusterings obtained using k-means with squared Euclidean distance resulted in the

8

best performance for all three EFO types, thus being selected as the method of choice in the present setting.

4.2

Comparison of retrieval schemes

We will now proceed to evaluating the performance of the modelling-based retrieval approach proposed in Section 2, comparing it with two other approaches for content-based retrieval previously suggested in the literature. The specific model we use is a gene clustering model based on a Gaussian product partition model (PPM), see Section 3. Within the modelling-based approach, three different methods of inferring clusterings from the experimental datasets are considered. These are the maximum a posteriori (MAP) clustering found using a stochastic greedy search algorithm; the MAP clustering found √ on a search space restricted to a range of k-means solutions√with k = 2, . . . , ⌈ n⌉; and finally, a further restriction of this range to k = ⌈ n/2⌉, which trivially reduces to a single k-means solution. The three methods will be referred to as PPM, PPM–k-means, and k-means, respectively. The second retrieval approach is closely related to modelling-based retrieval, but instead of evaluating the distances dN ID (Sq , Sm ), m = 1, . . . , M , see Equation (9), the idea is to evaluate the marginal likelihoods p(Dq |Sm ), of the query dataset Dq . A similar approach was recently suggested in (Seth et al., 2014), where the marginal likelihood was averaged over the posterior distribution of the parameter of interest, corresponding to P(S|Dm ) for the dataset to be retrieved. To keep computational complexities comparable, we did not average over clusterings but only over cluster parameters as above. We refer to this approach as likelihood-based retrieval. See Section 5 for a further discussion about the differences between the two approaches. The third approach, which we will refer to as differential expression-based retrieval, assumes that a statistical test to detect differentially expressed genes has been conducted beforehand. The method is then based on correlating the gene-specific differential expression p-values of the query experiment with those of the database experiments. An approach similar to this was suggested in Engreitz et al. (2010). If targeted at differential expression profiles obtained under some specific known condition, this scheme has much potential to achieve good retrieval performance. On the other hand, it assumes more background knowledge and preprocessing of the data than the suggested retrieval schemes based on gene clustering. Here, the p-values used were those given in Expression Atlas by default (see Petryszak et al., 2014, for details), assuming no selection of specific experimental conditions. The correlation measure used was Pearson’s correlation. Also, the number of genes used to calculate the correlations was 40 569, instead of the 537 genes used for the two other approaches, as this resulted in slightly improved performance. We finally note that differential expression-based retrieval schemes can also be formulated under the general framework of Section 2 using some appropriate probabilistic model for differential expression, as formulated in e.g. Do et al. (2006). The results of the comparison between the retrieval schemes are shown in 9

0.6

0.4

0.5

0.3

0.4

Precision

Precision

0.5

0.2 0.1 0 0

0.3 0.2

0.2

0.4

0.6

0.8

0.1 0

1

0.2

0.4

0.6

0.8

1

Recall

Recall

(a) Cell type

(b) Disease

0.4 0.35

Precision

0.3 0.25 0.2 0.15 0.1 0.05 0

0.2

0.4

0.6

0.8

1

Recall

(c) Organism part

Figure 1: Precision-recall curves comparing modelling-based, likelihoodbased, and differential expression-based retrieval using three EFO types (a– c) as ground truth: random (dashed horizontal line), modelling-based PPM (blue), modelling-based PPM–k-means (green), modelling-based k-means (red), likelihood-based (cyan), differential expression-based (violet). Figure 1, where all three modelling-based retrieval schemes (PPM, PPM–kmeans, and k-means) clearly outperform the two other schemes. The results further indicate that retrieval performance is quite robust to the number of clusters; for both PPM and PPM–k-means, the number of clusters is allowed to vary between experiments, whereas k-means assumes the number of clusters for all experiments to be fixed at k = 12 (for PPM the number of clusters varied between 4 and 19, with an average of 9.9, whereas for PPM–k-means, the corresponding numbers were 3, 24 and 9.2). We conclude that for the current task, the use of k-means provides a good, computationally inexpensive proxy for the full probabilistic model. This conclusion should not, however, be generalized beyond the scope of the current context (see e.g. Blomstedt et al., 2015). The performance of the likelihood-based approach is surprisingly poor here. This may be due to the well-known fact that gene expression measurements tend to be extremely noisy. In essence, the marginal likelihood p(Dq |Sm ) measures how well the query dataset Dq is predicted by a model Sm , learnt from dataset Dm . Even if experiments q and m are in some way related, the idealized model 10

Sm may still not provide a good prediction for data Dq . Therefore, instead of using the complex and possibly very noisy dataset Dq as query input, retaining only the characteristics relevant for retrieval in both Dq and Dm may help to improve performance, as illustrated in Figure 1.

4.3

Biological information in gene clustering

Any single EFO type will necessarily capture only one aspect of an experiment, whereas a meaningful retrieval task usually involves an evaluation of relevance between experiments in terms of a combination of aspects. It is therefore of interest to study the effect of composing the ground truth as a combination of multiple EFO types. In the current experimental setup, the ground truth for each of the EFO types ‘cell type’, ‘disease’ and ‘organism part’, can be represented as a symmetric binary matrix G of dimension M × M , such that entry gi,j = 1 iff experiments i and j are mutually relevant. A ground truth which requires a match in t EFO types can then be formed by summing the three matrices and requiring gi,j = t. In Figure 2, the modelling-based PPM retrieval scheme is evaluated against ground truth relevances requiring (a) any EFO class to match (t ≥ 1) (b) two or more matches (t ≥ 2) and (c) all EFO classes to match (t = 3). The number of experiments satisfying these conditions are 251, 54 and 6, respectively. Intuitively, the ground truth can be considered increasingly informative as the number of matching EFO types required to declare relevance increases. A retrieval scheme capturing biologically relevant information should then be in better agreement with a more informative ground truth. Although the curves of Figures 2a and 2b are not directly comparable due to the differing number of experiments used, the shape of the latter gives an indication of a better agreement. In Figure 2c, owing to the small number of available experiments, the ground truth is compared with the single most relevant experiment retrieved for each query. Here, the retrieval result matches the ground truth in all but one experiment.

4.4

Annotations and gene clustering combined

As noted by Schmidberger et al. (2011), full phenotype information (e.g. sex, age, tumour grade, metastatic disease) about experiments is often missing despite a formal declaration of compliance with MIAME criteria (Brazma, 2001). Hence, even if the user’s task is keyword-based retrieval, it may be a good idea to complement it with information provided by gene clustering. We next used modelling-based retrieval to complement the retrieved experiments, in practice multiplying the elements of a binary similarity matrix stemming from keyword search, with a matrix of cross-distances between clusterings. To investigate the performance of the combined method, we chose all experiments matching in both ‘cell type’ and ‘organism part’, resulting in a total of 43 experiments (all other combinations of two EFO types resulted in significantly less experiments). A match in both of the EFO types was considered to 11

0.4

0.4

0.35

0.35 0.3 Precision

Precision

0.3 0.25 0.2 0.15

0.25 0.2 0.15

0.1

0.1

0.05

0.05

0 0

0.2

0.4

0.6

0.8

0 0

1

0.2

0.4

0.6

Recall

Recall

(a)

(b)

0.8

1

(c)

Figure 2: Evaluation of modelling-based PPM retrieval scheme with respect to the ground truth requiring (a) at least one, (b) at least two, (c) exactly three matching EFO types. The last subfigure compares the ground truth matrix (hollow squares) with the single most relevant retrieved experiment per query (solid squares) for the six experiments having a simultaneous match in all three EFO types. Accession numbers for the experiments are provided as a reference.

12

1 0.9 0.8

Precision

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.2

0.4

0.6

0.8

1

Recall

Figure 3: Retrieval performance assuming only ‘cell type’ (blue) or ‘organism part’ (red) to be known, with respect to the ground truth of a match in both EFO types. The solid curves correspond to the complemented retrieval schemes while the coloured dashed lines represent the precision achieved using only keywords. be the ground truth. The idea was then to retrieve experiments assuming only one of the EFO types to be known, complementing the retrieval with rankings from modelling-based retrieval. The comparison in Figure 3 suggests that this procedure may help to improve performance compared to the use of a keyword alone.

5

Discussion

In this paper, we have introduced a general probabilistic framework for contentdriven retrieval of experimental datasets. Compared to earlier works which also employ probabilistic modelling (e.g. Caldas et al., 2009, 2012; Faisal et al., 2014; Seth et al., 2014), we do not use the likelihood of the query data as a measure of relevance, but instead learn a model of the query data and compare models. We argue that this reduces noise in the query input. With nuisance parameters further marginalized out, only characteristics relevant for the retrieval task are retained. A special instance of the general framework introduced in this paper has been previously used as a comparative method in a simulation study (Seth et al., 2014) with performance slightly inferior to a likelihood-based approach. The simulation setting in that earlier study was, however, very simplistic compared to datasets encountered in many real-life scenarios, such as that of Section 4, where the modelling-based approach was seen to clearly outperform its likelihood-based counterpart. Contrary to likelihood-based approaches, the modelling-based approach requires all models under consideration to belong to the same family. Although this may seem somewhat restrictive, in particular for the potential future scenario in which individual researchers independently store models in a repository along with their datasets (e.g. Faisal et al., 2014), there are also scenarios where the assumption is feasible. Datasets which arise as a result of some specific type

13

of experiment are often in practice modelled using a fairly standardized set of approaches. In particular, if the models are constructed automatically, or by a curator of a data repository, the assumption of the models belonging to the same family is feasible. As a specific application of the general framework, in Sections 3 and 4 we proposed a retrieval scheme for gene expression experiments based on gene clustering. It turned out that clustering is even a surprisingly good model for this purpose; with minimal preprocessing and prior knowledge about the experiments, it is able to yield reasonable retrieval performance (Section 4.2) and to capture biologically relevant characteristics about the experiments (Section 4.3). Finally, we showed that it is straightforward to combine modelling-based retrieval with retrieval using available keywords (Section 4.4).

Acknowledgement The authors would like to thank Ugis Sarkans for providing useful information about Expression Atlas. This work was financially supported by the Academy of Finland (Finnish Centre of Excellence in Computational Inference Research COIN, grant no 251170).

References Baumgartner, Jr., W. A., Cohen, K. B., Fox, L. M., Acquaah-Mensah, G., and Hunter, L. (2007). Manual curation is not sufficient for annotation of genomic databases. Bioinformatics, 23, i41–i48. Blomstedt, P., Tang, J., Xiong, J., Granlund, C., and Corander, J. (2015). A Bayesian predictive model for clustering data of mixed discrete and continuous type. IEEE TPAMI , 37(3), 489–498. Brazma, A. (2001). Minimum information about a microarray experiment (MIAME) – towards standards for microarray data. Nature Genetics, 29, 365–71. Caldas, J., Gehlenborg, N., Faisal, A., Brazma, A., and Kaski, S. (2009). Probabilistic retrieval and visualization of biologically relevant microarray experiments. Bioinformatics, 25(12), i145–i153. Caldas, J., Gehlenborg, Kettunen, E., Faisal, A., R¨ onty, M., Nicholson, A. G., Knuutila, S., Brazma, A., and Kaski, S. (2012). Data-driven information retrieval in heterogeneous collections of transcriptomics data links SIM2s to malignant pleural mesothelioma. Bioinformatics, 28(2), 246–253. Dahl, D. B. (2009). Modal clustering in a class of product partition models. Bayesian Analysis, 4(2), 243–264.

14

de Souto, M. C. P., Costa, I. G., de Araujo, D. S. A., Ludermir, T. B., and Schliep, A. (2008). Clustering cancer gene expression data: a comparative study. BMC Bioinformatics, 9(1), 497. D’haeseleer, P. (2005). How does gene expression clustering work? Biotechnology, 23(12), 1499–1501.

Nature

Do, K.-A., M¨ uller, P., and Vannucci, M., editors (2006). Bayesian Inference for Gene Expression and Proteomics. Cambridge University Press. Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1999). Cluster analysis and display of genome-wide expression patterns. PNAS , 95, 14863– 14868. Engreitz, J. M., Morgan, A. A., Dudley, J. T., Chen, R., Thathoo, R., Altman, R. B., and Butte, A. J. (2010). Content-based microarray search using differential expression profiles. BMC Bioinformatics, 11(603). Faisal, A., Peltonen, J., Georgii, E., Rung, J., and Kaski, S. (2014). Toward computational cumulative biology by combining models of biological datasets. PLoS ONE , 9(11), e113053. Fujibuchi, W., Kiseleva, L., Taniguchi, T., Harada, H., and Horton, P. (2007). Cellmontage: similar expression profile search server. Bioinformatics, 23(22), 3103–3104. Georgii, E., Saloj¨ arvi, J., Brosch´e, M., Kangasj¨arvi, J., and Kaski, S. (2012). Targeted retrieval of gene expression measurements using regulatory models. Bioinformatics, 28(18), 2349–2356. Hand, D. J. and Yu, K. (2001). Idiot’s Bayes – Not so stupid after all? International Statistical Review , 69(3), 385–398. Hunter, L., Taylor, R. C., Leach, S. M., and Simon, R. (2001). GEST: a gene expression search tool based on a novel Bayesian similarity metric. Bioinformatics, 17(Suppl 1), S115–S122. Jaskowiak, P. A., Campello, R. J. G. B., and Costa, I. G. (2014). On the selection of appropriate distances for gene expression data clustering. BMC Bioinformatics, 15(Suppl 2), S2. Jordan, C., Livingstone, V., and Barry, D. (2007). Statical modelling using product partition models. Statistical Modelling, 7(3), 275–295. Malone, J., Holloway, E., Adamusiak, T., Kapushesky, M., Zheng, J., Kolesnikov, N., Zhukova, A., Brazma, A., and Parkinson, H. (2010). Modeling sample variables with an experimental factor ontology. Bioinformatics. Meil˘a, M. (2007). Comparing clusterings–an information based distance. Journal of Multivariate Analysis, 98, 873–895. 15

Petryszak, R., Burdett, T., Fiorelli, B., Fonseca, N., Gonzalez-Porta, M., Hastings, E., Huber, W., Jupp, S., Keays, M., Kryvych, N., McMurry, J., Marioni, J., Malone, J., Megy, K., Rustici, G., Tang, A. Y., Taubert, J., Williams, E., Mannion, O., Parkinson, H. E., and Brazma, A. (2014). Expression Atlas update–a database of gene and transcript expression from microarrayand sequencing-based functional genomics experiments. Nucleic Acids Res., 42(Database issue), D926–32. Schmidberger, M., Lennert, S., and Mansmann, U. (2011). Conceptual aspects of large meta-analyses with publicly available microarray data: a case study in oncology. Bioninformatics and Biology Insights, 5, 13–39. Seth, S., Shawe-Taylor, J., and Kaski, S. (2014). Retrieval of experiments by efficient comparison of marginal likelihoods. In C. K. Loo, K. S. Yap, K. W. Wong, A. T. B. Jin, and K. Huang, editors, Neural Information Processing – 21st International Conference, ICONIP 2014, Kuching, Malaysia, November 3–6, 2014. Proceedings, Part II , volume 8835 of Lecture Notes in Computer Science. Springer. Vinh, N. X., Epps, J., and Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11, 2837–2854. Zhu, Y., Davis, S., Stephens, R., S., M. P., and Chen, Y. (2008). GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus. Bioinformatics, 24(23), 2798–2800.

16

Suggest Documents