data processing inequality for mutual information [4]. All of these methods focus on quantifying the dependency between variables without taking temporal ...
DIRECTED NETWORK INFERENCE USING A MEASURE OF DIRECTED INFORMATION Ying Liu, Selin Aviyente Michigan State University Department of Electrical and Engineering East Lansing, MI 48824, USA ABSTRACT The concept of mutual information (MI) has been widely used for inferring complex networks such as genetic regulatory networks. However, the MI based methods cannot infer directed or dynamic networks. In this paper, we propose a new network inference algorithm to infer directed acyclic networks which can determine both the connectivity and causality between different nodes based on the concept of directed information (DI) and conditional directed information. The proposed method is applied to both simulated data and Electroencephalography (EEG) data to evaluate its effectiveness. Index Terms— Information theory, Directed graphs, Electroencephalography 1. INTRODUCTION In many complex systems, it is important to quantify the causal relationships between its subsystems and model the interactions such as in genomic networks [1]. Informationtheoretic approaches have been widely used for the inference of large networks in the bioinformatics community. Most of these methods rely on estimating the mutual information between variables from expression data to quantify the dependency. Chou-Liu tree algorithm was the first to adopt the mutual information in probabilistic model design with minimum spanning tree, which has a low number of edges even for non-sparse target networks [2]. The Relevance network (RELNET) extended these ideas by determining the relevant connections such that a pair of genes X and Y are connected when the mutual information is above a threshold. However, this method may infer false connections when two nodes are indirectly connected through an intermediate node [3]. The algorithm for the reconstruction of accurate cellular networks (ARACNE) addresses this problem by using the data processing inequality for mutual information [4]. All of these methods focus on quantifying the dependency between variables without taking temporal information into account. Therefore, they are limited to inferring undirected or stationary networks. In this paper, we propose a directed acyclic network inference algorithm based on estimating directed This work was in part supported by the National Science Foundation under Grants No. CCF-0728984, CAREER CCF-0746971.
978-1-4244-4296-6/10/$25.00 ©2010 IEEE
513
information between signals over time in order to quantify both the connectivity and causality in networks. 2. DIRECTED INFORMATION AND CAUSALLY CONDITIONAL DIRECTED INFORMATION 2.1. Directed information The directed information measure introduced by Massey [5] is defined for two length N sequences X N = X1 , · · · , XN and Y N = Y1 , · · · , YN as follows: DI(X N → Y N ) =
N
I(X n ; Yn |Y n−1 ),
(1)
n=1
where I(X; Y ) is the mutual information between two random variables X and Y . Since 0 < DI(X N → Y N ) < I(X N ; Y N ) < ∞, in practice a normalized version of DI, which maps DI to the [0, 1] range is used for comparing different interactions [1]:
ρDI =
1 − e−2
N
i=1
I(X i ;Yi |Y i−1 )
.
(2)
2.2. Causally conditional directed information Based on the definition of directed information, if X is the driver of Y , then DI(X → Y ) > DI(Y → X) which indicates causality between the two processes. However, the opposite does not hold, i.e., given a large DI value, X and Y can be either directly connected or indirectly through other nodes. For this reason, we introduce the concept of conditional directed information to discriminate between direct connections and the indirect ones. The directed information from X N to Y N when causally conditioned on the sequence Z N is defined as [6], DI(X N → Y N Z N ) =
N
I(X n ; Yn |Y n−1 Z n ),
(3)
n=1
where I(X; Y |Z) is the conditional mutual information between two random variables X and Y given Z. Any three nodes in a network can interact with each other through two possible scenarios: three nodes are positioned as a chain X → Z → Y or interact through a hub gene X ← Z → Y . All three pairs (X, Y ), (Y, Z),(X, Z) have large DI values. However, in both cases, given the information of the intermediate or hub node Z for the whole time series, X and Y become independent for all time points, i.e the conditional directed information DI(X N → Y N Z N ) equals to zero as will be shown by the following theorem.
ICASSP 2010
Definition 1. X and Y are conditionally independent given Z if P (X, Y |Z) = P (X|Z)P (Y |Z), where P (X|Z) is the conditional probability density function of X given Z. Theorem 1. If Xm and Yk are conditionally independent given Z n for m, k = 1, · · · , n, and n = 1, · · · , N , then DI(X N → Y N Z N ) = 0. In order to prove this theorem, we will use the following two lemmas. Lemma 1. If P (Xm , Yk |Z n ) = P (Xm |Z n )P (Yk |Z n ) for m, k = 1, · · · , n, then I(Xm ; Yk |Z n ) = 0. The proof of this lemma is given in [7]. Lemma 2. I(X m ; Y k |Z n ) = 0 iff I(Xm ; Yk |Z n ) = 0 for m, k = 1, · · · , n. Proof. First, we will prove that when I(X m ; Y k |Z n ) = 0, I(Xm ; Yk |Z n ) = 0. P (X m Y k |Z n ) , I(X m ; Y k |Z n ) = EP (X m Y k Z n ) log P (X m |Z n )P (Y k |Z n ) (4)
When the conditional mutual information defined above is equal to 0, P (X m Y k |Z n ) = P (X m |Z n )P (Y k |Z n ). Therefore the marginal probability can be written as, P (Xi Yj |Z n ) =
P (X m Y k |Z n )dx1 · · · dxi−1 dxi+1 · · · dxm
× dy1 · · · dyj−1 dyj+1 · · · dyk = P (X m |Z n )dx1 · · · dxi−1 dxi+1 · · · dxm × P (Y k |Z n )dy1 · · · dyj−1 dyj+1 · · · dyk = P (Xi |Z n )P (Yj |Z n )
Therefore, I(Xm ; Yk |Z n ) = EP (Xm Yk Z n ) log
P (Xm Yk |Z n ) =0 P (Xm |Z n )P (Yk |Z n ) (5)
Similarly, we can prove that the converse is true. Using Lemmas 1 and 2, we will give the proof of Theorem 1. Proof. If Xm and Yk are conditionally independent given Z n for m, k = 1, · · · , n, and n = 1, · · · , N , then P (Xm , Yk |Z n ) = P (Xm |Z n )P (Yk |Z n ). From Lemma 1, I(Xm , Yk |Z n ) = 0, and from Lemma 2, I(X m , Y k |Z n ) = 0. Therefore, if m = n, k = n, then I(X n ; Y n |Z n ) = 0, which is equal to H(X n |Z n ) = H(X n |Y n , Z n ) from the definition of mutual information; Similarly, if m = n, k = n − 1, then I(X n ; Y n−1 |Z n ) = 0, which is equal to H(X n |Z n ) = H(X n |Y n−1 , Z n ). Therefore, DI(X N → Y N Z N ) =
N
I(X n ; Yn |Y n−1 Z n )
n=1
=
N
[H(X n |Y n−1 Z n ) − H(X n |Y n Z n )]
(6)
n=1
= 0.
514
3. DIRECTED INFORMATION ESTIMATION From equations (1) and (3), the computation of DI requires the estimation of joint probabilities of high dimensional random variables over time. In this paper, directed information estimation based on multi-information [8] is used to estimate the DI and causally conditional DI. The DI and conditional DI can be written in terms of multi-information as, DI(X N → Y N ) =
N
[I(X n , Y n ) − I(X n , Y n−1 )] − I(Y N ),
n=1
DI(X N → Y N Z N ) =
N
[I(X n , Y n , Z n ) − I(X n , Y n−1 , Z n )]
n=1
−
N
[I(Y n Z n ) − I(Y n−1 Z n )],
n=1
(7) n
n
where I(X , Y ) is the multi-information between 2n random variables X1 , . . . , Xn , Y1 , . . . , Yn and is defined in terms of mutual information as I(X n , Y n ) = I(X n ; Y n ) + I(X n ) + I(Y n ). The multi-information can be estimated directly from data by using adaptive partitioning method [9]. 4. INFERENCE ALGORITHM 4.1. Algorithm The proposed inference algorithm utilizes both directed information and conditional directed information to infer the direct interactions between nodes. In the first step (lines 312), we calculate the time averaged directed information between two nodes i and j. If Di,j is larger than a threshold t0 , the information flow from node i to j is assumed to be significant. The threshold, t0 , is determined using surrogate data sets generated by Fourier bootstrapping method such that the 95% of the DI values from the random network are lower than t0 . In the second step of the algorithm (lines 13-26), we determine the directionality of the connections by applying the sign test to the difference in DI between the two directions. If the hypothesis that there is no difference between the two directions can not be rejected at the 5% significance level, we keep both connections. Otherwise, we keep the connection with the larger DI. In order to identify the direct vs. indirect connections, for each node pair i, j, we evaluate the conditional directed information given any other node k for all k = i, j. If node k does not interact with nodes i and j, then DI(Xi → Xj |Xk ) should be close to DI(Xi → Xj ). Otherwise, if k is an intermediate or hub node between i and j, DI(Xi → Xj |Xk ) should be close to 0. Finally (lines 2737), we compute DI(Xi → Xj |Xk )/DI(Xi → Xj ), which is close to 1 when there are no interactions between k and i, j, and rank order this quantity from high to low for all k. The connections with the highest ranks are kept until a desired number of connections, ec, is achieved. 4.2. Validation In order to evaluate the network inference algorithm, F-score is adopted [2]. The F-score can be interpreted as a weighted
2pr TP TP ,p = ,r = . p+r TP + FP TP + FP
(8)
A positive label predicted by the algorithm is considered as true positive (TP) or false positive (FP) depending on whether there exists a corresponding edge in the true network or not. Similarly, a negative label can be a true negative (TN) or false negative (FN) depending on the true network. In this paper, we will compute the F-score as a function of the expected number of connections ec. Algorithm 1 Algorithm: Directed network inference 1: Input time series for M nodes, ec is the expected number of connections; 2: Initialize D ∈ Rn×n , C ∈ (0, 1)(n×n) , cD ∈ Rn×n as zero matrices; 3: for i = 1 to M do 4: for j = 1 to M do 5: Di,j ⇐ DI(i → j); 6: if Di,j > t0 then 7: Ci,j = 1; 8: else 9: Ci,j = 0; 10: end if 11: end for 12: end for 13: for i = 1 to M − 1 do 14: for j = i + 1 to M do 15: if Ci,j == 1 and Cj,i == 1 then 16: h = signtest(DI(i → j), DI(j → i)); 17: if h == 1 then if Di,j < Dj,i then 18: 19: Ci,j == 0; 20: else 21: Cj,i == 0; 22: end if 23: end if 24: end if 25: end for 26: end for 27: for i = 1 to M do 28: for j = 1 to M do 29: if Ci,j == 1 then DI(xi →xj |xk ) 30: cDi,j ⇐ min( ), k = i, j; Di,j 31: else 32: cDi,j ⇐ 0; 33: end if 34: end for 35: end for 36: ca = cD(:); 37: cb = sort(ca ) in descending order and keep the top ec connections
515
Linear 1
Nonlinear
2
3
4
14
5
6
7
8
11
9
10
12
13
(a) Distribution of directed information 50
F−score of simulated network
original random
45
1 0.9
40
0.8
35 0.7
30 0.6 F−score
F =
5. RESULTS 5.1. Simulated data We first test our algorithm on a synthetic network of 14 nodes. The network contains both linear and nonlinear dependencies as shown in Fig. 1(a). We generate 128 realizations of the 14 different time series [10] for each node and compute the DI over 10 time samples. The distribution of the 182 pair-wise DI values is shown in Fig. 1(b). To determine the significance threshold, we generate 100 randomized networks and the pairwise DI is recalculated for each permutation. The distribution of DI for random networks is also shown in Fig. 1(b). We choose t0 = 0.80 such that 95% of the DI values for the random network lies below this threshold. All of the connections above this threshold are kept. We then eliminate the bidirectional connections based on the sign test rule. Indirect connections due to intermediate and hub nodes are also removed using conditional DI. Finally, F-score is used to quantify the performance of the algorithm in terms of the number of expected connections ec as seen in Fig. 1(c). The reconstructed network with 11 expected connections using only DI and both DI and conditional DI are shown in Figs. 1(d) and 1(e), respectively. For 11 connections, the F-score reaches 0.80. Our method effectively removes most of the indirect connections,
Number of connection
average of the precision and recall and is defined as,
25 20
0.3
10
0.2
5 0
0.5 0.4
15
0.1
0
0.2
0.4 0.6 Directed information
0.8
0
1
0
2
4
6 8 10 specific connection ec
(b)
12
14
16
(c) 1 2 3 4 5 7
(d)
14 6
8
11 9
10
12
13
(e)
Fig. 1. (a) The synthetic network; (b) The distribution of directed information; (c) The F-score of simulated network; (d) Reconstructed network without considering conditional DI; (e) Reconstructed network with conditional DI.
such as 12 ↔ 13, etc., and the missed connections are due to the weak nonlinear interactions such as 8 → 6, etc.. 5.2. EEG data We examined EEG data from a study containing the errorrelated negativity (ERN). The ERN is a brain potential response that occurs following performance errors in a speeded reaction time task. Previous work indicates that there is increased information flow associated with ERN for the theta frequency band (4-7Hz) and ERN time window 25-75ms for Error responses (ERN) compared to Correct responses (CRN) [11]. We analyze data from 6 subjects for both the CRN and ERN. The DI measure is computed over a window corresponding to the ERN response, for all trials (70 trials) between 63 electrode pairs in the theta band (4-8Hz) and construct a network with 300 connections for each subject and each response type. Figs. 2(a) and 2(b) show the connections that at least 4 subjects in each group had in common. We can observe that most connections are in the central and parietal lobes for both groups as expected since this is a speeded response task which results in increased information flow in the motor cortex. However, the CRN group has more consistency, with more common connections among subjects, compared to ERN. The ERN group shows stronger information flow between central and parietal lobes, such as Cz ← CP1 , C1 ↔ CPz , C2 ↔ CPz , CP5 ↔ P1 in accordance with prior results [11]. The bidirectional connections are mostly due to volume conduction in the brain and the short-time window used for analysis which makes it hard to determine the directionality of the information flow. F
7
F
F
5
p2
F1
3
F
F
z
5 Subjects 6 Subjects
AF8
AF
AFz
AF
3
F7
4 Subjects
pz
Fp1 AF
4
2
8
6
FT
7
FT
FC
FC
5
3
C
C
T7
CP5
FC2
C
C
1
CP3
CPz
CP2
P3
P1
Pz
P2
POz
PO
5
PO
PO
3
5
PO
8
FC6
FC4
C4
2
z
CP1
P P7
FCz
C
3
5
TP7
FC1
T
C
8
6
CP
4
TP8
CP
6
4
P4
P
PO6
PO
8
8
7
O1
O2
Oz
(a) F
7
F
F
FC
5
3
FCz
FC2
C1
C
C
CP
CP1
CP
P
P1
Pz
P2
POz
PO
3
5
CP
3
5
P
5
P7
PO
8
6
4
2
FC1
C
C
TP7
F
F
F
F
Fz
FT
FC
T7
4
z
F1
3
AF8
AF
AF
AF
5
FT7
5 Subjects
p2
3
F7
F
6 Subjects
AF
3
z
PO
PO5
3
C4
2
z
8
6
4
TP8
CP
6
P4
PO6
4
P
6
P
8
PO8
7
O
1
Oz
[4] A. Margolin, I. Nemenman, K. Basso, C. Wiggins, G. Stolovitzky, R. Favera, and A. Califano, “ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context,” BMC bioinformatics, vol. 7, 2006. [5] J. Massey, “Causality, feedback and directed information,” in Proc. of Intl. Symp. on ISITA, 1990, pp. 27–30. [6] G. Kramer, “Capacity results for the discrete memoryless network,” IEEE Transactions on Information Theory, vol. 49, no. 1, pp. 4–21, 2003.
[10] T. Q. Tung, T. Ryu, K. H. Lee, and D. Lee, “Inferring gene regulatory networks from microarray time series data using transfer entropy,” in Proc. of the 20th IEEE Intel. Symp. on CBMS, 2007, pp. 383–388.
T
C
CP
CP2
8
FC6
FC4
[3] A. J Butte and I. S Kohane, “Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements,” in Pacific Symposium on Biocomputing, 2000, vol. 5, pp. 418–429.
[9] G. A. Darbellay and I. Vajda, “Estimation of the information by an adaptive partitioning of the observation space,” IEEE Transactions on Information Theory, vol. 45, no. 4, pp. 1315–1321, 1999.
4 Subjects
pz
Fp1
[2] P. E. Meyer, K. Kontos, F. Lafitte, and G. Bontempi, “Information-theoretic inference of large transcriptional regulatory networks,” EURASIP Journal on Bioinformatics and Systems Biology, vol. 2007, no. 1, 2007.
[8] Y. Liu, S. Aviyente, and M. Al-khassaweneh, “A High dimensional directed information estimation using datadependent partitioning,” in Proc. of IEEE Workshop on SSP, 2009, pp. 606–609.
P
6
[1] A. Rao, A. O Hero, D. J States, and J. D Engel, “Using directed information to build biologically relevant influence networks,” in Proc. of CSB, 2007, pp. 145–156.
[7] W. Zhao, E. Serpedin, and E. R. Dougherty, “Inferring connectivity of genetic regulatory networks using information-theoretic criteria,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 5, no. 2, pp. 262–274, 2008.
F
F
F4
F
6. CONCLUSIONS In this paper, we propose a network inference algorithm to infer directed networks using DI and conditional DI. The algorithm is applied to both synthetic data and EEG data collected and is shown to discriminate between the two response types in terms of the information flow patterns. 7. REFERENCES
O
2
(b)
Fig. 2. (a) Reconstructed network for correct responses; (b) Reconstructed network for error responses.
516
[11] S. Aviyente, E. M. Bernat, W. S. Evans, C. J. Patrick, and S. R. Sponheim, “A phase synchrony measure for quantifying dynamic functional integration in the brain,” submitted to Human Brain Mapping, 2008.