Oct 12, 2018 - There has been an upsurge in the use of deep learning ... states of long-short-term-memory (LSTM) networks (Bah- danau, Cho .... tricular arrhythmia (Wang et al. ..... Conference on Computer Vision and Pattern Recognition,.
Improving Generalization of Sequence Encoder-Decoder Networks for Inverse Imaging of Cardiac Transmembrane Potential Sandesh Ghimire, Prashnna Kumar Gyawali, John L Sapp, Milan Horacek, Linwei Wang
arXiv:1810.05713v1 [cs.LG] 12 Oct 2018
Golisano College of Computing and Information Sciences Rochester Institute of Technology USA
Abstract Deep learning models have shown state-of-the-art performance in many inverse reconstruction problems. However, it is not well understood what properties of the latent representation may improve the generalization ability of the network. Furthermore, limited models have been presented for inverse reconstructions over time sequences. In this paper, we study the generalization ability of a sequence encoderdecoder model for solving inverse reconstructions on time sequences. Our central hypothesis is that the generalization ability of the network can be improved by 1) constrained stochasticity and 2) global aggregation of temporal information in the latent space. First, drawing from analytical learning theory, we theoretically show that a stochastic latent space will lead to an improved generalization ability. Second, we consider an LSTM encoder-decoder architecture that compresses a global latent vector from all last-layer units in the LSTM encoder. This model is compared with alternative LSTM encoder-decoder architectures, each in deterministic and stochastic versions. The results demonstrate that the generalization ability of an inverse reconstruction network can be improved by constrained stochasticity combined with global aggregation of temporal information in the latent space.
Introduction There has been an upsurge in the use of deep learning based methods in inverse problems in computer vision and medical imaging (Lucas et al. 2018). Examples include image denoising (Mao, Shen, and Yang 2016), inpainting (Pathak et al. 2016), super resolution (Wang et al. 2015) and image reconstructions in a variety of medical imaging modalities such as magnetic resonance imaging, X-ray, and computed tomography (Jin et al. 2017; Wang et al. 2016; Chen et al. 2017). A common deep network architecture for inverse reconstructions is a encoder-decoder network, which learns to encodes an input measurement into a latent representation that is then decoded into a desired inverse solution (Pathak et al. 2016; Zhu et al. 2018). Despite significant successes of these models, two important questions remain relatively unexplored. First, it is not well understood what properties of the latent representation improve the generalization ability of the network. Second, existing works mostly focus on solving single-image problems, while limited models exist for solving inverse problems on image or signal se-
quences. The latter, however, is important because the incorporation of temporal information can often help alleviate the ill-posedness of an inverse reconstruction problem. In this paper, we present a probabilistic sequence encoderdecoder model for solving inverse problems on time sequences. Central to the presented model is an emphasis on the generalization ability of the network in order to learn a general inverse mapping from the measurement to reconstruction results. Our main hypothesis is that the generalization ability of the sequence encoder-decoder network can be improved by the following two properties of the latent space: 1) constrained stochasticity, and 2) global aggregation of information throughout the long time sequence. First, we theoretically show that using stochastic latent space during training helps to learn a decoder that is less sensitive to local changes in the latent space and that, based on analytical learning theory (Kawaguchi and Bengio 2018), leads to good generalization ability of the network . Second, while it is common in a sequence model to use the last unit of a recurrent encoder network for decoding, we hypothesize that – in the presence of long time sequences – the last hidden unit code may not be able to retain global information and may act as a bottleneck in information flow. This shares a fundamental idea with recent works in which alternative architectures were presented with attention mechanism to consider information from all the hidden states of long-short-term-memory (LSTM) networks (Bahdanau, Cho, and Bengio 2014; Luong, Pham, and Manning 2015). Alternative strategies were also presented to back-propagate the loss through all hidden states addressing the issue of gradient vanishing (Lipton et al. 2015; Yue-Hei Ng et al. 2015). Here, to arrive at a compact latent space that globally summarizes a long time sequence, we present an architecture (termed as svs) to combine and compress all LSTM units into a vector representation. While the presented methodology applies for general sequence inverse problems, we test it on the problem of inverse reconstruction of cardiac transmembrane potential (TMP) sequence from high-density electrocardiogram (ECG) sequence acquired on the body surface (MacLeod et al. 1995; Ramanathan et al. 2004; Wang et al. 2013). In this setting, the measurement at each time instant is a spatial potential map on the body surface, and the reconstruction output is a spatial potential map throughout the three-dimensional heart
muscle. The problem at each time instant is severely illposed, and incorporating temporal information is recognized as a main approach to alleviate this issue. To investigate the generalization ability of the presented model, we analyzed in-depth the benefits brought by each of the two key components within the presented model. In specific, we compared the presented svs architecture with two alternatives: one that decodes directly from the sequence produced by the LSTM encoder (termed as sss) and one that decodes from the output of the last unit of the LSTM encoder, as mostly commonly used in language models (termed as svs-L) (Sutskever, Vinyals, and Le 2014). We also compared between deterministic and stochastic versions of svs and svs-L. The experiments results suggest that the generalization ability of the network in inverse reconstruction can be improved by constrained stochasticity combined with global aggregation of sequence information in the latent space. These findings may set a foundation for investigating the generalization ability of deep networks in sequence inverse problems and inverse problems in general.
with data-independent generalization bounds or data dependent bounds for certain hypothesis space of problems, analytical learning theory provides the bound on how well a model learned from a dataset should perform on true (unknown) measures of variable of interest. This makes it aptly suitable for measuring the generalization ability of a stochastic latent space for the given problem and data. The presented work is related to variational autoencoder (VAE) in using stochastic latent space with regularization (Kingma and Welling 2013). Similarly, (Bowman et al. 2015) present a sequence-to-sequence VAE based on LSTM encoder and decoders to generate coherent and diverse sentences from continuous sampling of latent code. However, it is not well understood why stochasticity of the latent space is so important. In this paper, we intend to provide a justification from the learning theory perspective. In addition, while VAE by nature is concerned with the reconstruction of the same input data, the presented network is concerned with the ill-posed problem of inverse signal/image reconstruction from their (often weak) measurements.
Related Work
Preliminaries
There is a large body of work in the the use of deep learning in inverse imaging in both general computer vision (Lucas et al. 2018; Mao, Shen, and Yang 2016; Wang et al. 2015; Yao et al. 2017; Fischer et al. 2015) and medical imaging (Jin et al. 2017; Wang et al. 2016; Chen et al. 2017). Some of these deep inverse reconstruction networks are based on an encoder-decoder structure (Pathak et al. 2016; Zhu et al. 2018), similar to that investigated in this paper. Among these works in the domain of medical image reconstruction, the presented work is the closest to Automap (Zhu et al. 2018), in that the output image is reconstructed directly from the input measurements without any domain-specific intermediate transformations. However, all of these works employ a deterministic architecture which, as we will show later, can be improved with the introduction of stochasticity. Nor do these existing works handle inverse reconstructions of images or signals over time sequences. Different elements of the presented work are conceptually similar to several works across different domains of machine learning. The use of encoder-decoder architectures for sequential signals is related to existing works in language translation (Sutskever, Vinyals, and Le 2014). However, we investigate an alternative global aggregation of sequence information by utilizing and compressing knowledge from all the units in the last layer of the LSTM encoder. This is in concept similar to the works in (Bahdanau, Cho, and Bengio 2014; Luong, Pham, and Manning 2015), where information from all the units of an LSTM encoder are used for language translation. However, to our knowledge, no existing works have analyzed in-depth the difference in the generalization ability of sequence encoder-decoder models with respect to these different designs of the latent space. The presented theoretical analysis of stochasticity in generalization utilizes analytical learning theory (Kawaguchi and Bengio 2018), which is fundamentally different from classical statistical learning theory in that it is strongly instance-dependent. While statistical learning theory deals
Inverse Imaging of cardiac transmembrane potential (TMP) Body-surface electrical potential is produced by TMP in the heart. Their mathematical relation is defined by the quasi-static approximation of electromagnetic theory (Plonsey 1969) and, when solved on patient-specific heart-torso geometry, can be derived as (Wang et al. 2010): y = Hx
(1)
where y denotes the body-surface potential map, x the 3D TMP map over the heart muscle, and H the measurement matrix specific to the heart-torso geometry of an individual. The inverse reconstruction of x from y can be carried out at each time instant independently, which however is notoriously ill-posed since surface potential provides only a weak projection of the 3D TMP. A popular approach is thus to reconstruct the time sequence of TMP propagation over a heart beat, with various strategies to incorporate the temporal information to alleviate the ill-posedness of the problem (Wang et al. 2010; Greensite and Huiskamp 1998). Here, we examine the sequence setting, where x and y represents sequence matrices with each column denoting the potential map at one time instant. This problem has important clinical applications in supporting the diagnosis and treatment for diseases such as ischemia (MacLeod et al. 1995) and ventricular arrhythmia (Wang et al. 2018).
RNN and LSTM Recurrent neural networks (RNN) (Werbos 1990; Kalchbrenner and Blunsom 2013) generalize the traditional neural networks for sequential data by allowing passage of information over time. Compared to traditional RNNs, LSTM networks (Graves 2013; Sutskever, Vinyals, and Le 2014) can better handle long term dependency in the data by architectural changes such as using a memory line called the cell state that runs throughout the whole sequence, the ability to
Figure 1: Presented svs stochastic architecture with mean and variance network in both encoder and decoder. forget information deemed irrelevant using a forget gate, and the ability to selectively update the cell state.
Variational Autoencoder (VAE) VAE (Kingma and Welling 2013) is a probabilistic generative model that is typically trained by optimizing the variational lower bound of the data log likelihood. A VAE is distinctive from the traditional autoencoder in two aspects : 1) the use of stochastic latent space realized by sampling with a reparameterization trick, 2) the use of Kullback-Leibler (KL) divergence to regularize the latent distribution.
Methodology We train a probabilistic sequence encoder-decoder network to learn to reconstruct the time sequence of TMP, x, from input body-surface potential, y. In a supervised setting, we maximize the log likelihood as follows: argmax EP (x,y) log pθ (x|y)
(2)
θ
where P (x, y) is the joint distribution of the input-output pair. We introduce a latent random variable w and express the conditional distribution as: Z pθ (x|y) = pθ2 (x|w)pθ1 (w|y)dw (3) where θ = {θ 1 , θ 2 }. We model both pθ2 (x|w) and pθ1 (w|y) with Gaussian distributions, with mean and variance parameterized by neural networks: pθ1 (w|y) = N (w|tθ1 (y), σt 2 (y)) 2
pθ2 (x|w) = N (x|g θ2 (w), σx (w))
(4) (5)
where σx 2 denotes a matrix of the same dimension as that of x. We implicitly assume that each elements in x is independent and Gaussian with variance given by the corresponding element in σx 2 ; and similarly for w. Introduction of the latent random variable in the network allows us to constrain it in two means to improve the generalization ability of inverse reconstructions. First, we constrain the conditional distribution pθ1 (w|y) to be close to an isotropic Gaussian distribution. Second, we design it to be a concise vector representation compressed from the whole input time sequence.
Regularized stochasticity Drawing from the VAE (Kingma and Welling 2013), we regularize the latent space by constraining the conditional distribution pθ1 (w|y) to be close to an isotropic Gaussian distribution. Training of the network can then be formulated as a constrained optimization problem as follows: Z minimize − EP (x,y) log pθ2 (x|w)pθ1 (w|y)dw such that DKL (pθ1 (w|y)||N (w|0, I)) < δ
(6)
Using the method of Lagrange multipliers, we reformulate the objective function into: hZ L = −EP (x,y) log pθ2 (x|w)pθ1 (w|y)dw i + λ.(DKL (pθ1 (w|y)||N (w|0, I)) − δ) h ≤ −EP (x,y) Ew∼pθ1 (w|y)} [log pθ2 (x|w)] i + λ.(DKL (pθ1 (w|y)||N (w|0, I)) − δ) (7) where the inequality in eq.(7) is due to Jensen’s inequality as the negative logarithm is a convex function. We use reparameterization w = t + σ t as described in (Kingma and Welling 2013) to compute the inner expectation in the first term. The KL divergence in the second term is analytically available for two Gaussian distributions. We thus obtain the upper bound for the loss function as: h X 1 2 L ≤ EP (x,y) E∼N (0,I) 2 (xi − gi (t + σt )) σ x i i i 2 + log σx i + λ.DKL (pθ1 (w|y)||N (w|0, I)) + constant (8) where gi is the ith function mapping latent variable to the ith element of mean of x, such that g θ2 = [g1 , g2 ...gU ].
Global aggregation of sequence information In sequence inverse problems where the measurement at each time instant provides only a weak projection of the reconstruction solution, utilizing the temporal information in the sequence becomes important for better inverse reconstructions. This motivate us to design an architecture that can
distill from the input sequence a global, time-invariant, and low dimensional latent representation from which the entire TMP sequence can be reconstructed. To do so, we present an architecture with two LSTM networks followed by two fully connected neural networks (FC), each respectively for the mean and variance in the encoder network. The decoder then consists of two FC followed by two LSTM networks for the mean and variance of the output. In the encoder, each LSTM decreases spatial dimensions while keeping temporal dimensions constant; the last-layer outputs from all the units in each LSTM are reshaped into a vector, the length of which is decreased by the FC. The structure and dimension of the decoder mirrors that of the encoder. The overall architecture of the presented network is illustrated in Fig. 1.
Encoder-Decoder Learning from the Perspective of Analytical Learning Theory In this section we look at the encoder-decoder inverse reconstructions from the analytical learning theory (Kawaguchi and Bengio 2018). We start with a deterministic latent space setting and then show that having a stochastic latent space with regularization helps in generalization. Let z = (y, x) be an input-output pair, and let Dn = {z (1) , z (2) , ..., z (n) } denote the total set of training and validation data and Zm ⊂ Dn be a validation set. During training, a neural network learns the parameter θ by using an algorithm A and dataset Dn , at the end of which we have a mapping hA(Dn ) (.) from y to x. Typically, we stop training when the model performs well in the validation set. To evaluate this performance, we define a loss function based on our notion of goodness of prediction as `(x, hA(Dn ) (y)). The average validation error is given by EZm `(x, hA(Dn ) (y)). However, there exists a gap between how well the model performs in the validation set versus in the true distribution of the input-output pair; this gap is called the generalization gap. To be precise let (Z, S, µ) be a measure space with µ being a measure on (Z, S). Here, Z = Y × X denotes the input-output space of all the observations and inverse solutions. The generalization gap is given by: Eµ `(x, hA(Dn ) (y)) − EZm `(x, hA(Dn ) (y))
(9)
Note that this generalization gap depends on the specific problem instance. Theorem 1 (Kawaguchi and Bengio 2018) provides an upper bound on equation (9) in terms of data distribution in the latent space and properties of the decoder. Theorem 1 ((Kawaguchi and Bengio 2018)). For any `, let (T , f )be a pair such that T : (Z, S) → ([0, 1]d , B([0, 1]d )) is a measurable function, f : ([0, 1]d , B([0, 1]d )) → (R, B(R)) is of bounded variation as V [f ] < ∞, and `(x, h(y)) = (f ◦ T )(z)∀z ∈ Z, where B(A) indicates the Borel σ- algebra on A. Then for any dataset pair (Dm , Zm ) and any `(x, hA(Dn ) (y)), Eµ `(x, hA(Dm ) (y)) − EZm `(x, hA(Dn ) (y)) ≤ V [f ]D∗ [T∗ µ, T (Zm )] where T∗ µ is pushforward measure of µ under the map T .
For an encoder-decoder setup, T is the encoder which maps the observation to the latent space and f becomes the composition of loss function and decoder which maps latent representation to the reconstruction loss. Note that the [0, 1]d latent domain can be easily extended to a d-orthotope – as long as the latent variables are bounded – using a function s composed of scaling and translation in each dimension. Since s is uniformity preserving and affects the partial derivative of f only up to a scaling factor and thus does not affect our analysis. In practice, there always exists intervals such that the latent representations are bounded. Theorem 1 provides two ways to decrease the generalization gap in our problem: by decreasing the variation V [f ] or the discrepancy D∗ [T∗ µ, T (Zm )]. Here, we show that constrained stochasticity of the latent space helps decrease the variation V [f ]. The variation of f on [0, 1]d in the sense of Hardy and Krause (Hardy 1906) is defined as: V [f ] =
d X
X
V k [fj1 ...jk ]
(10)
k=1 1≤j1