112
MEASURING CORRELATIONS IN METABOLOMIC NETWORKS WITH MUTUAL INFORMATION JORGE NUMATA1
[email protected]
OLIVER EBENHÖH2
[email protected]
ERNST-WALTER KNAPP1
[email protected] 1
Macromolecular Modeling Group, Freie Universität Berlin, Takustr. 6, Berlin, 14195 Germany 2 Systems Biology and Mathematical Modelling, Max Planck Institute of Molecular Plant Physiology, Am Mühlenberg 1, Potsdam-Golm, 14476 Germany Non-linear correlations based on mutual information are evaluated to measure statistical dependencies among data points measured from metabolism in two dimensional space. While the Pearson correlation coefficient is only rigorously applicable to characterize strictly linear correlations with Gaussian noise, the mutual information coefficient is more generally valid. Here, we use recent distribution-free (non-parametric) mutual information estimators based on k-nearest neighbor distances. The mutual information algorithm of Kraskov et al. is found to yield estimates with low systematic and statistical error. The significance of the different methods is probed for artificial sets of tens to hundreds of data points, a size currently typical for metabolomic data. We analyze experimental data on metabolite concentrations from Arabidopsis thaliana by using these procedures. The mutual information was able to detect additional non-linear correlations undetectable for the Pearson coefficient. Keywords: statistical correlation; Pearson coefficient; non-linear correlation; mutual information; knearest neighbor entropy; metabolomics; Arabidopsis thaliana.
1.
Introduction: Linear and Non-linear Correlation Measures
1.1. Correlations of metabolite concentration data Metabolomics is a crucial tool in systems biology, since it allows insight into the phenotypic result of gene expression. Metabolites show coupled changes in concentration, both under the influence of genomic and stress perturbations, and as part of the intrinsic variability of a biological network. The meaning of these correlations in terms of biochemical network topology and gene expression remains to be fully elucidated [1-3]. One important stumbling block is synthesized in the saying "correlation is not causation", which equally applies to non-linear correlations. Steuer et al. observed that large linear (Pearson) correlation coefficients often do not coincide with metabolite pairs that are neighbors in the biochemical network [4]. Here, we follow a more modest goal in providing suitable measures for statistical correlations among metabolite concentrations and to test their significance. The method is able to detect also non-linear correlations not accessible to the linear Pearson coefficient rPC.
Measuring Correlations in Metabolomic Networks 113 Statistical correlation measures based on mutual information are able to capture more features of the data than the linear Pearson correlation coefficient. At the same time, they demand larger data sets than the Pearson coefficient to be significant. Here, we test recent developments in non-parametric methods for entropy estimation [5-7] to provide a general, non-linear measure of statistical dependencies. 1.2. Advantages of mutual information as a measure of correlation Mutual information is a non-linear measure of statistical dependence based on information theory [8]. It has advantages over other methods, • since it requires no decomposition of the data into modes, so there is no need to assume additivity of the original variables, as is done in Principal Component (PCA) and Independent Component Analysis (ICA) [9]. • since it makes no assumptions about the functional form (Gaussian or non-Gaussian) of the statistical distribution that produced the data. Hence, it is a non-parametric method. A numerical implementation based on k-nearest neighbor distances [5-7, 10] is more attractive than other methods to estimate mutual information [11], • since it requires no binning to generate histograms. • since it consumes less computational resources and its parameters are easier to tune than for kernel density estimators. It is a common practice to normalize the data to zero mean and unit variance using a linear transformation, which has no effect on the Pearson correlation coefficient. Linear transformations are smooth and uniquely invertible maps, as are the more general homeomorphic (non-linear) transformations. Mutual information for pairs of variables is not altered by general homeomorphic transformations of the data [5, 12]. These properties are important because metabolomic data rarely yield absolute concentrations, but rather ratios of concentrations [2]. 1.3. Entropy, mutual information and statistical (in)dependence We will employ the usual symbol for entropy from information theory (H) instead of the thermodynamic notation (S). All logarithms (ln) refer to base e, so that entropy and mutual information are measured in nats. To convert to bits, divide by ln(2). We are interested in testing the correlation between two random variables xi and xj, which have marginal probability densities pi(xi) and pj(xj) and a joint probability density p(i,j)(xi, xj). In the current non-parametric approach, no particular functional form for probability densities is assumed. The corresponding differential entropies (for continuous variables) are:
H k = − ∫ pk ( x ) ln pk ( x ) dx , k=i, j, H ( i , j ) = − ∫∫ p( i , j ) ( xi , x j ) ln p( i , j ) ( xi , x j ) dxi dx j < (1)
114 J. Numata, O. Ebenhöh & E.W. Knapp The mutual information I(i,j) shared by xi and xj is
⎡ p (x , x ) ⎤ I ( i , j ) = ∫∫ p( i , j ) ( xi , x j ) ln ⎢ ( i , j ) i j ⎥ dxi dx j . ⎣⎢ pi ( xi ) p j ( x j ) ⎦⎥
(2)
Two variables xi and xj are statistically independent if and only if the joint probability density equals the product of the marginal densities: p(i,j)(xi, xj) = pi(xi) × pj(xj), since in that case, the argument of the logarithm term in Eq. 2 is unity and the mutual information vanishes. The mutual information I(i,j) can also be written [8] as:
I (i , j ) = H i + H j − H (i , j ) .
(3)
If the variables xi and xj are correlated, I(i,j) will take a positive value up to min(Hi, Hj). We employ a more intuitive non-linear correlation coefficient rI [6, 13, 14] that assumes values in the interval (0,1) for correlated variables. This coefficient rI is a measure of the generalized statistical dependence between two variables. For strict correlation of the variables xi and xj (e.g., xi = xj or xi = −xj), rI adopts the maximum value of +1; in absence of correlation rI vanishes. Albeit the exact value of I(i,j) cannot become negative, approximate evaluations can. Therefore, we propose here a modification of the coefficient rI to allow also for negative values (−1, 0) that can quantify possible numerical errors in estimating mutual information
rI = sign ⎡⎣ I (i , j ) ⎤⎦ 1 − exp(−2 I (i , j ) ) .
(4)
Note that negative values of rI should not be interpreted as anti-correlations, since I(i,j) adopts also positive values in that case. In contrast to mutual information the Pearson correlation coefficient quantifies exclusively linear correlations, and is actually given as a normalized covariance
r PC =
cov( xi , x j )
σ iσ j
=
( xi − < xi >) ( x j − < x j >) ( xi − < xi >) 2
,
(5)
( x j − < x j >) 2
where
f ( xi , x j ) =
∫ f (x , x ) p i
j
(i, j)
( xi , x j ) dxi dx j .
(6)
Since the Pearson correlation coefficient is based on quadratic forms, it is relatively sensitive to outliers. Negative values of rPC for two variables denote anti-correlation (appearing as negative slope in a linear fit). In the numerical implementation, a value of rPC = 0 is assigned to cases where one of the variances in the denominator of Eq. (5) vanishes. A non-vanishing rPC means that a linear fit can describe the correlation between xi and xj approximately. Similarly, a positive non-linear coefficient rI means that the variables xi and xj are correlated, and a non-linear fit could describe this relationship.
Measuring Correlations in Metabolomic Networks 115 This is a very general statement and does not imply any particular functional form (such as a quadratic polynomial). 1.4. Numerical methods: k-nearest neighbor entropy and Kraskov mutual information We employ two different methods to estimate mutual information I(i,j) and its correlation coefficient rI from Eq. 4. One method uses the k-nearest neighbor entropy [10] introduced by Hnizdo et al. [6, 7], which estimates H(i), H(j) and H(i,j) individually and then calculates the mutual information I(i,j) from Eq. 3, yielding the coefficient rI kNN from Eq. 4. The second method estimates the mutual information I(i,j) in a more direct way using the Kraskov et al. algorithm [5], which is also based on a nearest-neighbor approach. The more direct estimate is advantageous, since it avoids accumulation of systematic biases inherent in the terms H(i), H(j) and H(i,j) when using Eq. 3 for I(i,j). For both algorithms, rI kNN and rI Kras, the only adjustable parameter is the number of neighbors. We employ k = 6th nearest neighbor, which proved to be a good compromise between systematic and statistical errors (data not shown). 2.
Application to Constructed Data
2.1. Non-linear correlations are captured by mutual information The non-linear correlation coefficient based on mutual information is able to detect additional correlations invisible to the linear Pearson coefficient. Cases A1-A7 in Fig. 1 show comparable performance for both coefficients in linear cases. But, the non-linear nature of the correlation between the variables in cases B1-B6 causes rPC to vanish. Visually it is obvious that a relationship exists, and this is quantified by rI Kras.
Figure 1 Comparison of the performance of the Pearson (linear) correlation coefficient rPC and the non-linear measure rI Kras (an implementation of rI) based on mutual information. Each of the 14 cases is an artificial example showing different functional relationships between the variables xi and xj. The artificial data sets are large: Nsize = 105 points, some of them with Gaussian noise. The first row (A1 – A7) represents linear correlations, and for A4 a lack thereof. Except for the sign comparable performance is shown for rI Kras and rPC when the correlation is linear. The second row (B1 – B7) displays non-linear cases where rPC fails to detect any correlation, while rI Kras can quantify it. Case B7 is shown to be uncorrelated by both coefficients.
116 J. Numata, O. Ebenhöh & E.W. Knapp Note that anti-correlation (negative rPC) in B5-B7 is shown simply as statistical dependence in rI Kras, which is strictly never negative in absence of numerical errors. In any case, the concept of anti-correlation is not applicable to relations with changing slope such as B4 or B6 of Fig. 1. Furthermore, anti-correlation loses its meaning even for linear relations in more than two variables [14]. 2.2. Significance of the coefficients given the limited sample size Metabolomics and gene expression experiments currently yield tens to hundreds of data points. To probe the significance of the correlation coefficients among pairs of variables for such sample sizes, we numerically tested the artificial examples A2 (linear correlation), A4 (uncorrelated) and B4 (non-linear correlation undetectable by rPC). From Fig. 2, we observe that sample sizes Nsize that allows detecting correlations reliably need to be larger than Nsize > 40. One important lesson from Fig. 2 is that weak correlations corresponding to small correlation coefficients cannot be discriminated from background noise in agreement with findings of Selbig et al. [11]. There is a gray area in the region 0.550 < rI Kras < 0.665 for Nsize = 43, where a more thorough statistical analysis could be made [15]. In this work, we opt for the safe side by considering only large correlations.
Figure 2: For cases A2 (linear correlation), A4 (uncorrelated) and B4 (non-linear correlation), all three correlation coefficients rPC, rI kNN, rI Kras (vertical axis) were estimated using the sample size Nsize (horizontal axis). The error bars show empirical 95% confidence intervals (p = 0.05). They represent the observed, sometimes asymmetric variation around the mean for 2000 samples of Nsize data points each. Among the nonlinear methods, rI Kras shows less statistical and systematic errors than rI kNN. Negative values of rPC denote anticorrelations. Negative values of rI kNN and rI Kras denote numerical errors in the estimation of mutual information.
Measuring Correlations in Metabolomic Networks 117 The method from Hnizdo et al. [6, 7] yielding rI kNN is based on a nearest neighbor entropy estimator. As suggested by Kraskov et al. [5], the systematic bias in individual entropy estimation of 1-D and 2-D samples will not necessarily cancel out in Eq. 3. In our numerical experiment (Fig. 2, column rI kNN), a negative systematic bias is evident for Nsize < 58 and a positive bias for Nsize > 58 with a larger spread of values and frequent occurrences of negative rI kNN values, which are traces of numerical errors. Nevertheless, the kNN method is still useful for very large sample sizes Nsize > 1000 and when the 1-D entropy of each variable is of interest [10]. Computing rI Kras to obtain the correlation is better suited than rI kNN for our small sample sizes and in particular if we are interested in mutual information I(i,j) and not in 1D entropies. Thereby, rI Kras shows less systematic bias and lower variability (statistical bias) among the different computational methods. Negative values of I(i,j) were only found for small to medium large correlation values rI Kras < 0.665 or very small sample sizes Nsize < 40, where presence of correlation is difficult to detect.
Figure 3: For the test cases A2, A4 and B4 (see Fig. 1 for more details), the absolute value of the Pearson coefficient |rPC| is plotted against mutual information rI Kras for 2 104 samples. Samples yielding points (+) outside the rectangle show significant correlations for both coefficients, while the points (x) from samples inside the rectangle do not. The rectangle marks the cutoff values |rPC| = 0.545 and rI Kras = 0.665, which were chosen to minimize detection of false positives where no correlations are present. For linear correlation (A2) both coefficients provide similar information. But the Pearson coefficient rPC is not able to detect the non-linear correlations in B4 and reports values similar to the uncorrelated case A4. Negative values of rI Kras denote numerical errors in the estimation of mutual information.
The cutoff values for rPC = 0.545 and rI Kras = 0.665 were chosen from Fig. 3 to minimize detection of false positives in the absence of correlation (case A4) for Nsize = 43. With these conditions, we obtain three false positives for 2 104 samples using rI Kras (see Fig. 3, middle part) corresponding to 0.015% of all samples and a concomitant p value of p = 3/2 104 = 0.00015. In the following we will deal with data of sample size Nsize = 43 comparing 16290 pairs of metabolic variables. At the same time, we expect to detect 99.8% of linearly highly correlated pairs, but only 28.5% of the non-linear ones (see Fig. 3). This is because the limited sample size of 43 data points limits reliable detection to large values of non-linear correlations. In an analog numerical simulation with 2 104 samples using the larger sample size Nsize = 700, non-linear correlations with =
118 J. Numata, O. Ebenhöh & E.W. Knapp 0.75 could be completely separated from data obtained with absence of correlation = 0 (data not shown). Even for small Nsize = 43, applying both methods enriches detection of correlations in comparison to the usage of only rPC. 3.
Application to Metabolomic Data
3.1. Correlations among metabolite concentrations from Arabidopsis thaliana The data set to be analyzed consists of a sample of Nsize = 43 Arabidopsis thaliana plants comprising four lines, where each line involves between ten to twelve biological replicates. The latter refers to plants of the same line (i.e. possessing the same DNA), which are kept under identical growth conditions. The four plant lines used are ecotypes Col-0 and C24 and two of their crossings: Col-0xC24 and C24xCol-0. For each of these 43 plants, 181 standardized metabolite concentration ratios of the primary metabolism were measured. The data analyzed in the present study are the resulting (181×180)/2 = 16290 pairs of metabolites taken from a data set analyzed before [16]. In the following, we look for correlations between metabolite pairs by grouping plant lines and biological replicates to form a sample whose size (Nsize = 43) is above the minimum reliable sample size, found to be Nsize = 40. However, a larger sample size would allow detecting differences in correlation among plant lines more reliably. The conservative limits of rPC > 0.545 and rI Kras > 0.665 to detect correlations were chosen for the case of Nsize = 43. These limits are stringent and rather err on the side of producing false negatives to obtain trustworthy positives with p = 0.00015. More tolerant limits can be used for larger sample sizes or by employing p- and q-value analysis [15]. Table 1: Correlations among the 181 metabolites were tested in a pairwise fashion, yielding 16290 pairs, from which 5.6% were found to be significantly correlated (indicated by bold digits). linear Pearson non-linear mutual comments coefficient information coefficient non-linear/only discovered by PC rI Kras > 0.665 r 0.665 linear (large rKras) (Fig. 6) rPC >0.545 PC rI Kras < 0.665 linear (small rKras) (Fig. 7) r >0.545 uncorrelated/un-detectable (Fig. 8) rI Kras < 0.665 rPC 0.656).
Figure 5: Aside from the presence of correlations only detectable by rI Kras > 0.656, these plots exhibit differences among plant lines. Pearson was not significant, with rPC < 0.545.
Figure 6: Examples where both correlation coefficients rPC and rI Kras indicate significant correlation. (rPC > 0.545, rI Kras > 0.656).
Figure 7: Examples for the case where only the Pearson coefficient was significant (rPC > 0.545) but the nonlinear coefficient was not (rI Kras