sessment of a patient's condition, and 2D ultrasound is a ubiq- uitous tool in ... support more online methods for echocardiogram analysis. In this paper, we ...
ECHOCARDIOGRAM VIEW CLASSIFICATION USING LOW-LEVEL FEATURES Hui Wu? , Dustin M. Bowers? , Toan T. Huynh† , and Richard Souvenir? ?
Department of Computer Science, University of North Carolina at Charlotte † Department of General Surgery, Carolinas Medical Center ABSTRACT
This paper presents a view classification method for 2D heart ultrasound. Our method uses low-level image features to train a frame-level classifier, which unlike related approaches, does not require an additional pixel-level classification of heart structures. By employing kernel-based classification, our algorithm can classify images from any phase of the heartbeat cycle and efficiently incorporate information from subsequent frames without re-training the model. On real-world data, our algorithm achieves 98.51% accuracy for 8-way classification. While the method can efficiently aggregate multiple frames, in ∼94% of the tests, identification only requires a single frame. Index Terms— echocardiogram, support vector classification, ultrasound
1. INTRODUCTION Non-invasive measures of heart function are useful for the assessment of a patient’s condition, and 2D ultrasound is a ubiquitous tool in clinical settings. While the resulting echocardiograms are typically interpreted by trained cardiologists, automated methods have recently been developed to aid in analysis, such as detecting heart chamber boundaries, detecting specific structures, and 3D reconstruction of the heart. Automated methods provide real-time advantages where an expert is not available for bedside interpretation; which is a frequent occurrence in the clinical setting. Many automated methods integrate information from multiple 2D views during processing (Figure 1). Typically, the views are manually specified and the algorithms applied after the data has been collected. Real-time echocardiogram view selection would support more online methods for echocardiogram analysis. In this paper, we present an algorithm for view classification from video for echocardiogram that employs low-level images features, which can be efficiently computed from the echocardiograms. This material is based upon work supported by the National Science Foundation under Grant No. 1156822.
Fig. 1. Our algorithm supports 2D echocardiogram view classification. The apical 4-chamber (blue) and parasternal short axis (red) are two common views. 2. RELATED WORK Echocardiogram view classification is a problem that has been previously considered. The method of Ebadollahi et al. [1] uses a generic cardiac chamber template for detection and models the properties of detected chambers using Markov Random Fields and multi-class SVM for classification. This method relies on an external signal to identify the end-diastolic (ED) frame, where the heart is mostly dilated. The method of Zhou et al. [2] also makes use of the ED keyframes and uses boosted weak classifiers of Haar-like local rectangle features to generalize the detection of specific heart structures for view classification. Additionally, there have been approaches that do not rely on an external marker for identifying ideal frames for classification. The method of Kumar et al. [3] employs video features based on optical flow and the image edge maps. CardiacVC [4] uses the multi-class Logit-Boost algorithm to perform 4-way echocardiogram view classification. Our method does not require external gating to select specific images, nor image-level annotations and classification of specific heart structures. 3. APPROACH Expert clinicians can quickly identify the transducer location window and image plane from a single frame of data. While other automated approaches rely on accurately detecting specific chambers or structures within an image on specific frames, we take a global approach to the problem and model the data using low-level features. In this section, we describe our feature representation, kernel-based classification approach, and algorithm for echocardiogram view selection from single or multiple frames.
(a) A4C
(b) SCLX
Fig. 2. GIST feature visualization of two different views.
vectors, and κ(·) is a kernel function. We use the radial basis 2 kernel function, κ(xi , xj ) = e(−γkxi −xj k ) , where γ defines the kernel width. After training, given a new example, x, the class ±1 is determined by: ! X f (x) = sgn αi yi κ(xi , x) (2) i
3.1. Feature Representation We use the GIST feature transform [5] for image representation. This descriptor was designed to provide a compact, robust, and discriminative model of a scene without the need for segmentation or classification of specific objects or image patches. It has been widely used for outdoor scene classification, scene perception, large-scale object recognition of compressed images, and as a pre-processing step for clinical content-based image retrieval [6]. The GIST feature transform computes the spectral energy of the image to give a global description using a single feature vector. It consists of spectrograms generated from a grid of 4 × 4 non-overlapping image blocks. Within each block, multiple oriented Gabor filters of different scales are applied to model structures at various scales. Figure 2 shows the visualization of two echocardiograms and their corresponding GIST feature representation. Within each block, the colored radial histograms represent the intensity of the filter output at different scales and directions. The features from each block are concatenated to produce the final vector representation x ∈ Rd , where d is the total number of measurements at each scale and orientation across all of the image blocks.
For multiclass problems, we use one-versus-one (OVO) clasbinary classifiers are sification, where, for K classes, K(K−1) 2 used in combination with a voting scheme for classification. The next two sections describe two additional properties of SVM classification used by our algorithm, class probability estimates and efficient feature vector concatenation. 3.2.1. Class Probability Estimates As described, SVMs provide a hard classification; each data point is assigned to a single class. Probability models for multi-class SVM classification can be used to assess the confidence of a prediction. For a given example, x, pi (x) denotes the probability x belongs to class i. Using the binary classifiers, the pairwise probabilities of classes i and j, rij are estimated. The class probabilities, p = [p1 , p2 , · · · , pK ] are obtained by the following optimization: [7] K
min p
s.t.
1 XX (rji pi − rij pj )2 2 i=1 j6=i X pi ≥ 0, ∀i = 1, · · · , K, pi = 1
(3)
which can be solved efficiently using an iterative approach.
3.2. Support Vector Classification
3.2.2. Feature Vector Concatenation
Support vector machines (SVM) have been widely used for supervised classification. Here, we briefly review SVM classification and highlight two properties, efficient feature vector concatenation and class probability estimates, which form the basis of our view classification algorithm. Consider a classification problem with data {xi }N i=1 and d the associated binary class labels {yi }N i=1 , where xi ∈ R , and yi ∈ {1, −1}. An SVM learns the separating hyperplane in feature space that maximizes the classification margin between the two classes. Using the so-called “kernel trick”, an SVM can be used to learn nonlinear boundaries in feature space by solving the following optimization problem:
Most classifiers require fixed-length vectors as input. In the case of video data, this typically means that data must be acquired at fixed intervals, or variable-length data must be quantized and represented as normalized histograms. The quantization step in the latter approach often requires many free parameters to be tuned to balance the feature vector length with the discriminative power of the classification approach. It can be shown that a linear combination of kernels is itPP self a kernel. That is, κη (·) = i=1 ηi κi (·), where η is a vector of weights for PP kernels, {κi }. In the case of convex η (e.g., η ∈ RP ηi = 1), the weights represent the im+ and portance of each kernel and the combined kernel represents the similarity in a feature space defined by the concatenation of the individual feature vectors [8]. So, given a trained classifier and a set of P data points, X = {x1 , x2 , . . . , xP }, we can extend single-example classification (Equation 2) to sets of arbitrary size: X X f (X, η) = sgn ηj αi yi κ(xi , xj ) (4)
max α
s.t.
N X
1X αi αj yi yj κ(xi , xj ) 2 i,j i=1 X 0 ≤ αi ≤ C, αi yi = 0, ∀i = 1, · · · , N αi −
(1)
where C is the regularization parameter, {αi } are Lagrange multipliers that are non-zero for the corresponding support
j,xj ∈X i
Algorithm 1 Input: images, I; conf. threshold, τp ; max frames, tM AX Output: class label, cˆ 1. t ← 1 2. repeat 3. Calculate GIST descriptor, xt of image It 4. X ← {x1 , . . . , xt } 5. η ← t-length vector of 1t 6. Estimate class label, cˆ, with SVM using X and η 7. Estimate class probability, pcˆ 8. t←t+1 9. until pcˆ ≥ τp or t > tM AX 10. return c ˆ
4. RESULTS AND DISCUSSION To evaluate our method, we acquired echocardiograms collected from a Philips CX50 Ultrasound System, operating at 33Hz. Each of the 270 videos consists of 5 - 10 heartbeat cycles and was annotated by a domain expert as one of seven views (or “Other” for unidentifiable views). The alphanumeric codes refer to the combination of transducer location window: parasternal (PS), apical (A), or subcostal (SC), and image plane: long-axis (LX), short-axis (SX), four-chamber (4C), or two-chamber (2C). Figure 3 shows examples and the distribution of each of the views contained in our data set. This data, collected from a clinical setting, roughly reflects the distribution of views typically examined. The training and
0.00
A2C 0.00
100
0.00
0.00
0.00
0.00
0.00
0.00
PSLX 0.00
0.00
98.38
0.00
0.00
0.00
0.00
1.62
PSSX 0.00
0.00
0.00
100
0.00
0.00
0.00
0.00
SC2C 4.51
0.00
0.00
0.00
95.49 0.00
0.00
0.00
SC4C 0.00
0.00
0.00
0.00
0.00
0.00
0.00
SCLX 0.00
0.00
0.00
0.00
35.71 0.00
64.29
0.00
OTHER 1.95
1.30
0.00
0.00
0.65
0.00
96.10
TH
LX O
SC
4C
0.00
SC
SC
PS
PS
100
ER
0.00
2C
0.00
SX
0.00
LX
0.00
C
0.00
A2
Training requires a set of N images, I = {I1 , I2 , . . . , IN } and associated labels Z = {z1 , z2 , . . . , zN |zi ∈ [1, K]}, where K is the number of different classes. For each image, Ii , we calculate the GIST descriptor, xi (Section 3.1), and train an SVM (Section 3.2). Given a trained single-image SVM, and combining the ideas from Section 3.2, our classification algorithm (Algorithm 1) operates as follows. The feature transform of the new image is computed and evaluated against the trained multiclass SVM. The probability estimate of the classification is used as the confidence of the classification. If the class probability is above a provided threshold, the classification is returned. If not, the next image is acquired, transformed, and a new feature is constructed as the convex sum of the kernels and evaluated against the multi-class SVM. This process is repeated until a classification is returned, or the maximum video window has been reached.
0.00
C
3.3. Algorithm
A4C 100
A4
While it would be possible to re-train the model in this joint feature space, this would require re-training for every possible size of the feature set. In the next section, we describe our algorithm, which only requires training a single classifier.
Fig. 4. Confusion matrix for classification experiments.
Fig. 5. The first two frames were classified with low confidence. When the third frame was included, the confidence value was high enough and the set was correctly classified.
tests each consist of 2700 frames, collected across the set of echocardiograms. The algorithm was implemented in Matlab on a standard PC, using libsvm [9]. GIST features were computed in 4 × 4 blocks, with each block containing 8 orientations for each of the three filters of different scales, resulting in a 384dimensional feature vector. Using cross-validation on the training set, we set C = 1024 and γ = 64 for the SVM regularization parameter and kernel width, respectively. We used a confidence threshold of τp = .9 and maximum window size of tM AX = 4. Figure 4 shows the results of our 8-way classification experiment. In this confusion matrix, each row represents the actual class and each column represents the predicted class. Overall, our method achieved 98.51% accuracy including 100% on pair of views (A4C and A2C) that can be rather similar in appearance. Figure 5 shows an example where the first two frames observed were of low confidence and would have been misclassified. Incorporating the third frame, however, resulted in a high confidence, correct classification. 94.85% of the testing examples were classified using a single frame, 1.06%, .48%, 3.61% required 2, 3, or 4 consecutive frames, respectively. The most difficult class was SCLX, which is partially explained by the lack of training data and also its similarity to other SC views. Figure 6 shows three examples of misclassified views, including two where SC
(a) A4C (73 videos)
(b) A2C (46)
(c) PSLX (44)
(d) PSSX (31)
(e) SC2C (19)
(f) SC4C (11)
(g) SCLX (2)
(h) Other (44)
Fig. 3. Example frames of views and number of different echocardiograms used in the experiments. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, vol. 2, pp. 1559 – 1565.
SCLX SC2C
SC2C A4C
PSLX Other
Fig. 6. Examples of misclassified results with the actual (green, left) and predicted (red, right) labels. views were incorrectly identified. 5. CONCLUSIONS AND FUTURE WORK We presented a method for automatic view classification of 2D echocardiograms. Our approach relies on low-level image features and uses kernel-based methods, which are efficient in both training and testing. This method could not only be used as pre-processing step for other automated methods, but as a training tool for technicians and deployed on low-resource, handheld devices. 6. REFERENCES [1] S. Ebadollahi, Shih-Fu Chang, and H. Wu, “Automatic view recognition in echocardiogram videos using partsbased representation,” IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 2–9, 2004. [2] S.K. Zhou, J.H. Park, B. Georgescu, D. Comaniciu, C. Simopoulos, and J. Otsuki, “Image-based multiclass boosting and echocardiographic view classification,” in
[3] R. Kumar, F. Wang, D. Beymer, and T. Syeda-Mahmood, “Echocardiogram view classification using edge filtered scale-invariant motion features,” in IEEE Conf. on Computer Vision and Pattern Recognition, 2009. [4] J.H. Park, S.K. Zhou, C. Simopoulos, J. Otsuki, and D. Comaniciu, “Automatic cardiac view classification of echocardiogram,” in IEEE Intl Conf on Computer Vision, oct. 2007, pp. 1 – 8. [5] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International Journal of Computer Vision, vol. 42, no. 3, pp. 145–175, 2001. [6] Xiang Yu, Shaoting Zhang, Bo Liu, and Dimitris Metaxas, “Content based medical image retrieval using compressed features,” in Medical Image Computing and Computer Assisted Intervention (MICCAI) Workshop on Sparsity Techniques in Medical Imaging, 2012. [7] T.F. Wu, C.J. Lin, and R.C. Weng, “Probability estimates for multi-class classification by pairwise coupling,” The Journal of Machine Learning Research, vol. 5, pp. 975– 1005, 2004. [8] M. G¨onen and E. Alpaydın, “Multiple kernel learning algorithms,” Journal of Machine Learning Research, vol. 12, pp. 2211–2268, 2011. [9] Chih-Chung Chang and Chih-Jen Lin, “LIBSVM: a library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 27:1–27:27, May 2011.