Exploiting Spatio-temporal Information for View Recognition in Cardiac Echo Videos David Beymer Tanveer Syeda-Mahmood Fei Wang IBM Almaden Research Center 650 Harry Road, San Jose CA 95129 USA
[email protected],
[email protected],
[email protected]
Abstract 2D Echocardiography is an important diagnostic aid for morphological and functional assessment of the heart. The transducer position is varied during an echo exam to elicit important information about the heart function and its anatomy. The knowledge of the transducer viewpoint is important in automatic cardiac echo interpretation to understand the regions being depicted as well as in the quantification of their attributes. In this paper, we address the problem of inferring the transducer viewpoint from the spatiotemporal information in cardiac echo videos. Unlike previous approaches, we exploit motion of the heart within a cardiac cycle in addition to spatial information to discriminate between viewpoints. Specifically, we use an active shape model (ASM) to model shape and texture information in an echo frame. The motion information derived by tracking ASMs through a heart cycle is then projected into the eigen-motion feature space of the viewpoint class for matching. We report comparison with a re-implementation of state-of-the-art view recognition methods in echos on a large database of patients with various cardiac diseases.
1. Introduction Echocardiography is often used to diagnose cardiac diseases related to regional and wall motion as well as valvular motion abnormalities. It provides images of cardiac structures and their movements giving detailed anatomical and functional information about the heart. Much of the practice of echocardiography requires manual intervention in both imaging and in interpretation. For imaging, the transducer position is varied during an echo exam to capture different anatomical sections of the heart for studying their spatial and temporal characteristics. To interpret these echocardiograms, both knowledge of the viewpoint and manual delineation of important regions such as the left ventricle, is often employed by the echocardiographers.
978-1-4244-2340-8/08/$25.00 ©2008 IEEE
With advances in computer vision for medical imaging, techniques for automatic cardiac ultrasound analysis are becoming increasingly available [15, 7, 13, 17]. Recent work has also tried to automatically characterize the spatio-temporal patterns in cardiac echo videos for disease discrimination [17] using prior knowledge of the region layout. However, due to the large difference in appearance of the cardiac anatomy under different transducer positions, many of these methods find it difficult to make accurate cardiac assessment without the knowledge of the transducer position. Similarly, as described in [15], knowledge of the viewpoint is important to regularize the motion in wall motion analysis. It is also needed for the placement of Doppler gates where it is important to know which valves are being depicted. Thus extracting viewpoint information in cardiac echo videos is an important first step in echocardiogram interpretation. In this paper, we address the problem of inferring the transducer viewpoint from the spatio-temporal information in cardiac echo videos. Our motivation for recognizing the viewpoints comes from a somewhat different need than for computer-aided diagnosis mentioned above. We are developing a statistically-guided decision support system for cardiologists that exploits the consensus opinions of other physicians who have looked at similar patients, to present statistical reports summarizing possible diagnoses. The key idea behind our statistical decision support system is the search for similar patients based on the underlying multimodal data. Using echocardiograms as one of the modalitites, we are developing a search system for similar patients based on their echocardiograms. Due to the variability in appearance of the heart under different viewpoints, prior knowledge of the viewpoint is needed both for model selection and filtering during such search. To recognize the viewpoint, we exploit heart motion depicted through a viewpoint as an additional feature to discriminate between viewpoints, unlike existing approaches. Specifically, we use an active shape model to model shape and texture information in an echo frame. The motion in-
formation derived by tracking an ASM through a heart cycle is then projected into the eigen motion feature space of the viewpoint class for matching. Unlike previous work, our approach does not rely on delineation of complete regions or their contours for anchoring view templates but uses geometric and textural features for localization. We also report comparison with state-of-the-art methods of view recognition in echos on a large database of patients with various cardiac diseases. The rest of the paper describes this approach in detail and is organized as follows. In Section 2, we motivate the inherent difficulty in recognizing transducer viewpoints from cardiac echo videos and motivate the need for capturing motion information. In Section 3 we describe related work on viewpoint recognition. In Section 4 we describe our approach to view recognition using spatio-temporal modeling of cardiac echo videos. In Section 5 we present results of comparison of our approach with two other approaches to view recognition. The first approach is a re-implementation of the method reported in [15] which uses LogitBoost to combine results of weak classification based on left ventricle (LV) detection. The second approach sets up a strawman comparison by using a general object category recognition algorithm based on pyramid match kernels (PMK) [10] for the task of view recognition.
2. The viewpoint recognition problem An echocardiographic exam involves scanning the heart from four standard transducer positions, namely, parasternal, apical, subcostal and suprasternal. For each transducer position, the transducer is rotated and oriented to obtain multiple tomographic images of the heart along its long and short axes. Of these viewpoints, the parasternal long axis, parasternal short axis, apical 4-chamber, and apical 2chamber are most common. Imaging from such viewpoints shows different combinations of anatomical regions of the heart, as shown in Fig. 1, which are used to study different diseases. For example, the 4-chamber view shows left ventricle, left atrium, right ventricle, right atrium, septal wall, mitral and tricuspid valves and is used to study left ventricular hypertrophy. There are several challenges in inferring the transducer viewpoints. First, there is significant variation in the appearance for a given transducer position and orientation. These could be due to the anatomical differences between diseased and normal hearts as well as due to the variability between patients, instruments, echocardiographer expertise and imaging modalities. Figs. 2a and b show four-chamber views of two different hearts, where the patients depicted have hypertrophy in one case and excessive enlargement of the left ventricle in the other. The variabilities can also introduce ambiguities in determining the viewpoints. For example, Figs. 2c,d,e show three different echo frames where
RV LV
LV
RV Ao LA
RA LA
Apical 4 Chamber
Parasternal Long Axis
RVOT RA
PA
LV
LA LA
Parasternal Short Axis, Basal
Apical 2 Chamber
Figure 1. Illustration of different cardiac echo viewpoints used in an echocardiographic exam. (RVOT = right ventricular outflow tract, Ao = aorta, PA = pulmonary artery, LV = left ventricle, RV = right ventricle, LA = left atrium, RA = right atrium)
the first two are 2 chamber viewpoints while the last is from a short-axis view. Even if canonical global templates could be designed for representing different viewpoints, finding stable features to anchor these models can be difficult. For example, automatic delineation of the left ventricular contour has been attempted in earlier approaches in an effort to anchor the model matching [15, 20]. However, the left ventricular regions may not be visible in all the viewpoints (eg. parasternal short axis - aortic valve, RV inflow view), and in some cases may be occluded. The effect of zoom and pan can also make the determination of viewpoint difficult. Figs. 2f-g illustrate this where Fig. 2g is a zoomed in version of the region depicted in Fig. 2f. Finally, if automatic interpretation of cardiac videos is to be integrated into an echocardiographer’s workflow, it must analyze real-time recordings. The determination of viewpoint becomes even more challenging in this case as it must distinguish standard viewpoints from transitional ones when the transducer is being moved before settling in a new position. Fig. 2i shows a transitional view that is from a continuous recording with before and after views shown in Fig. 2h and j respectively. This must be classified as a spurious view. The examples above illustrate the challenges involved in building a robust and automatic system for cardiac view classification. These problems can be alleviated if temporal information is exploited in additional to spatial information. For example, the transitional viewpoints can be distinguished by shot detection techniques of video analysis [16]. The fact that a region depicted is a zoomed-in version of the previous viewpoint can be inferred by analyzing the crop action in the preceding frames in a sequence. Similarly, ambiguity in viewpoint can be resolved by observing
the appearance change through a heart cycle. Finally, the effect of disease-specific variations on determining viewpoint can be mitigated by better modeling of the spatio-temporal information present across frames of a heart cycle sequence. The focus of work reported in this paper is to exploit this last observation and propose automatic capture and modeling of spatio-temporal information to enhance the classification of viewpoints.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
3. Previous Work Classification of views has been attempted by researchers earlier, for popular views such as apical four chamber, two chamber, parasternal long axis, short axis view, etc. [7, 13, 20, 15, 1]. In [7], the first approach for view recognition of cardiac videos was attempted using a part-based approach. The cardiac chambers depicted in a view were modeled as a relational graph encoded as a Markov Random Field (MRF). Determining the viewpoint for a new echo video frame involves extracting the cardiac chambers, forming a relational graph and classifying the resulting MRF into one of the template MRFs of views using a support vector machine (SVM). Since the chamber detection process is a weak process identifying primarily, dark ’holes’, it is sensitive to noise in ultrasound where small intensity variations can lead to large errors in the graph formation. Further, it does not adequately handle the effects of rotation, translation and zoom. In [1], the standard views were represented by templates obtained by applying multiresolution spline filters to intensity images. View recognition then involved elastically deforming the template splines to best explain a current image. The deformation energy and the similarity match of the warped current image to the reference templates were classified using a linear discriminant classifier. Since the only constraint used for elastic registration is smoothness, and since no explicit correspondence of regions is used prior to deformation, such elastic deformation is effective only when the current image viewpoint is already close to known viewpoint templates. The work of [20] attempts echo view recognition as a multi-class object detection problem. Each view is represented by a weak classifier based on Haar-like rectangle features as described in [18]. The weak learners are then combined into a strong classifier using the LogitBoost algorithm [9]. The use of Haar-like rectangle features give a large number of features which need to be pruned. Besides, the layout of cardiac regions cannot be well-modeled. The low intensity regions and noise in ultrasound imaging make it difficult to isolate good Haar-like features. Finally, the most recent work on echo view recognition is the work reported in [15]. Here a comprehensive system for echo video classification is described by combining ideas of region correspondence, spatial layout modeling and multi-
Figure 2. Illustration of the variation in cardiac appearance both within and between viewpoints. (a) apical 4 chamber (A4C) view of a patient with hypertrophy, (b) A4C view of a patient with dilated LV. (c) and (d) show an apical 2 chamber view, and (e) shows a short axis view. Changing the echo magnification setting greatly affects appearance: the sonographer increased the magnification from long axis view (f) to (g). Finally, in continuous echo recordings, there will be intermediate, indeterminant views between the standard ones: (i) is an indeterminant view between (h) and (j).
object detection using MLBoost learning. Specifically, the method works by anchoring the viewpoint recognition on
the detection of the left ventricle (LV) in the views. An LV detector was developed using Haar-wavelet features reported in [14]. The prior knowledge of the layout of various cardiac chambers with respect to the left ventricular regions is captured via global templates, one for each of the views. Given a cardiac sequence, each video frame is analyzed to detect LV regions using LV detectors. The global template with LV region registered is fed into an multiview classifier based on MLBoost algorithm. All the above methods have been applied on popular views which are already visually distinct such as 4-chamber view, parasternal long axis and short axis views which explains some of the high classification rates reported (96% in [15]). Another important aspect is that the choice of video frames used for view classification is not made explicit. Moreover, little attention is given to time synchronization or the variability in heart rate. In [7], for example, the time information was factored out by selecting keyframes from the sequence to avoid the structural variations of the regions during a heart cycle.
4. Spatio-temporal modeling for viewpoint recognition Our approach to view recognition is based on building spatial and temporal models for known viewpoints from sample learning data and using a matching algorithm for resolving the best matching model. The overall approach is as follows. The spatial and textural information across frames of the echo sequences of a training set are used to build active shape models for the respective class. Next, the ASM models are tracked within the training sequences to generate motion models for each viewpoint class. To determine the viewpoint of an unknown single heart-cycle echo sequence, we fit a candidate ASM to each of the frames in the sequence to create a tracked ASM sequence. Motion information derived by tracking ASMs through a heart cycle is then projected into the eigen-motion feature space of the viewpoint class. The combined fit of the ASMs and the motion models are then used to evaluate the overall spatiotemporal model fit. The best viewpoint class is then determined by a winner-take-all approach.
4.1. Appearance modeling Since the cardiac regions resemble object layout, view recognition can be regarded as a specific instance of object category recognition. The category recognition problem has been an active field of research lately [10, 19, 11, 8] where the popular approaches either use a bag-of-words model to collect scene information for categorization [8, 11], or use multi-resolution feature histograms [10] with increasing amounts of layout modeling [19]. Due to the lack of discriminable features in low intensity ultrasound, the feature histograms are not as discriminatory. Further, the topological changes within viewpoints is larger than what can be modeled by simple translation [10] or similarity transformations [19] used in these approaches. Finally, the approach presented here utilizes the machinery of active shape models (ASM) [6] and active appearance models (AAM) [5]. ASMs and AAMs are effective statistical tools for modeling the appearance and texture of nonrigid objects, and have been extensively applied in computer vision. They have also been previously used for echocardiogram segmentation and tracking [2, 21]. A variant of AAMs, Active Appearance Motion Models (AAMMs) [3] is most similar to our work. In AAMMs, shape and texture are represented jointly across the entire cardiac cycle, and motion is captured in the shape vector. In our approach, we have separate models for shape, texture and motion. Shape and texture are fit independently for each frame, and motion modeling encompasses the entire cardiac cycle. While their emphasis is to use the eigenmotions to assist/constrain tracking, we use motion eigenanalysis as a feature for viewpoint classification.
Using the paradigm of active shape models [6], we represent shape information in an echo frame by a set of feature points obtained from important anatomical regions depicted in the echo sequences. For example, in apical 4chamber views, the feature points are centered around the mitral valve, interventricular septum, interatrial septum, and the lateral wall. We chose not to trace the entire boundary of the left ventricle because the apex is often not visible due to the ultrasound zoom and transducer location. Wall boundaries are annotated on both the inner and outer sections. While the feature points are manually isolated during training stage, they are automatically identified during matching. Thus the cardiac region information used during model building captures features important for cardiac assessment. For each view, a number of training images were collected from cardiac cycle video clips – covering different patients, diseases, and time offset within the cycle. We assume that each echo video clip represents a full heart cycle. If more than one heart cycle is present, they can be segmented into cycles based on the synchronizing ECG often present in cardiac echos. Capturing the spatiotemporal modes of variation in the training set allows us to capture the appearance variations throughout the heart cycle for the isolated features. Thus, each training image can be represented by a shape vector s obtained by concatenating the locations (x, y) of n features f1 , f2 , . . . , fn as s = [x1 , y1 , . . . , xn , yn ]T .
(1)
Next, we form image patches centered around the feature
Apical 4 Chamber
Parasternal Long Axis
Parasternal Short Axis Basal
Parasternal Short Axis Papillary
Figure 3. ASM feature points for four cardiac echo views. (Best viewed in color.)
locations to capture texture information. Given a training image with n features, the texture vector t concatenates the pixels from all the patches into one long vector, where patch size is matched to the pixel spacing between features. Fig. 3 shows the feature points we use for four different cardiac views. After shape annotation, the shape vectors s are geometrically normalized by a similarity transform for model generation. Procrustes analysis using two anchor points can be used for this normalization step. The texture vector traw is similarly normalized for echo gain and offset by subtracting the mean and dividing by standard deviation as t = (traw − ¯traw )/σ(traw ).
(2)
To construct the ASM model, we reduce the dimensionality of the spatial and textural vector using PCA to find a small set of eigenshapes and eigentextures. The shape s and texture t can then be linearly modeled to form the active shape model as: | | | S = es1 es2 . . . esp s = Sa+s | | | (3) | | | t = T b+t T = et1 et2 . . . etq | | | es1 , es2 , . . . , esp
and q eigentextures where p eigenshapes et1 , et2 , . . . , etq are retained in the PCA. The p-dimensional vector a and the q-dimensional vector b are the low dimensional representations of shape and texture.
4.2. ASM Fitting Fitting an ASM to a new sequence involves finding a similarity transform Γsim to position the model appropriately and recovering the shape and texture vectors a and b.
This is iteratively estimated by alternating between shape and texture update steps. Unlike most ASM techniques, which require a manual initialization to start the model fitting, we use an automatic initialization method where a distance-to-eigenspace method is first used to generate seed ASM initializations. The seed locations are then refined and evaluated using a coarse-to-fine pyramid approach, where seed hypotheses are pruned if their fitting error rises above a threshold. The ASM seed hypotheses that reach the finest pyramid level in the first frame are then tracked across the rest of the frames of the sequence, with the fitting result for frame t serving as the initial conditions for frame t + 1. If multiple hypotheses survive sequence tracking, they are ranked by their fitting score. To evaluate an ASM fit at a given position, we measure error of fit in shape space and texture space using Mahalanobis distance and the resulting appearance of the image itself using reconstruction error suitably normalized. For image I, the ASM fit (a, b, Γsim ) is q+1 T −1 2 fit(a, b, Γsim ) = aT Σ−1 shp a + b Σtex b + 2R /λtex , (4)
where R = kt − T T T tk, t = I(Γsim (x, y)), λq+1 tex is the th (q+1) texture eigenvalue, and Σshp and Σtex are diagonal matrices with PCA eigenvalues (see Cootes and Taylor [4]).
4.3. Motion Modeling Using an ASM to track cardiac motion through a cycle with m frames will produce m sets of feature points {s1 , s2 , . . . , sm }, where each si is a column vector of stacked ASM (x, y) feature coordinates. To create a canonical representation, we vectorize the motion in the cardiac cycle by normalizing for image plane geometry and standardizing the time axis to a target length n as follows: 1. Geometry normalization. Align the first frame s1 to a canonical position using a similarity transform Γsim 1 . Apply the same similarity transform to standardize all frames si ← Γsim 1 (si ). Thus, all motion will be described in a coordinate frame that factors out scale, rotation, and translation. 2. Time normalization. Interpolate the si ’s in time using piecewise linear interpolation to standardize sequence length from input length m to a target length n. 3. Shape decoupling. Factor out object shape by subtracting out frame 1, creating our final motion vector m s2 − s1 s3 − s1 m= . .. . sn − s1
If each si contains F ASM points, the final vector m has dimensionality 2F (n − 1). Given a set of example training motions P mi , our linear model for a new motion m is m = i ci mi . Applying PCA to the training set mi yields a set of eigenmotions em i and mean motion m, giving us a lower dimensional representation r X ci em m= i + m, i=1
or
m=M c+m
| M = em 1 |
| em 2 |
...
| , em r |
where r eigenmotions are retained. The r-dimensional weight vector c is our representation of object motion. To collect training motions, the manual labeling effort is significant since we must annotate entire sequences. We developed a semiautomatic approach that leverages our ASM tracker. We ran our tracker on a set of training sequences, evaluating the returned tracks and making manual corrections as necessary. After vectorizing the motion vectors and applying PCA, we obtained motion models for all four echo views in Fig. 3. A new sequence m can be analyzed to evaluate the degree of fit to our linear motion model. First, we project m to find c c = M T [m − m] and then we compute the Mahalanobis distance cT Σ−1 mot c, where Σmot is a diagonal (r x r) matrix of eigenvalues corm m responding to em 1 , e 2 , . . . , er .
4.4. View recognition algorithm Putting it all together, Fig. 4 presents our algorithm for recognizing cardiac echo views. We assume that the input sequence I1 , I2 , . . . , Im represents a single heart cycle, and the motion models were trained assuming that frame I1 is synchronized with the ECG R peak. For each cardiac view, the algorithm estimates a measure of sequence fit sequence fit = 1/m
m X
fit(ai , bi , Γi ) + cT Σ−1 mot c,
(5)
i=1
which is the average appearance fit over the cycle (eqn (4)) plus the Mahalanobis distance on motion. The recognized view is the one that minimizes this fitting metric.
5. Results We now describe the results of our algorithm for view recognition and comparison with other approaches, namely, multiclass boosting [15] and pyramid match kernels[10].
Cardiac Echo View Recognition Algorithm Input: Sequence I1 , I2 , . . . , Im of length m C view models: ASMj , MotionModelj , 1 ≤ j ≤ C Algo: for class j from 1 to C do 1. Fit ASMj to I1 → generates ASM fit hypotheses 2. Track ASM hypotheses in I2 , . . . , Im → generates (ai , bi , Γi ), 1 ≤ i ≤ m, for each hypothesis 3. Rank surviving hypotheses using eqn (4) to estimate ave fit, chose hypothesis Pm with best rank appearance fitj = 1/m i=1 fit(ai , bi , Γi ) 4. Vectorize motion m from ai and shape model. 5. Project m onto MotionModelj , estimating c. sequence fitj = appearance fitj + cT Σ−1 mot c return argmin (sequence fitj ) j
Figure 4. Our algorithm for cardiac echo view recognition.
5.1. Data set The dataset for our testing came from hospitals in India with physicians recording a complete echo exam in continuous loop videos. Thus no view segmentation or view annotation was present. Further, the dataset depicted many of the challenges described in Section 2 including transitional views as the echocardiographer changed transducer positions. Starting from video sequences of the echocardiographer’s entire workflow, we manually extracted individual cardiac cycles for the four views shown in Table 1: parasternal long axis (PLA), apical four chamber (A4C), parasternal short axis - basal level (PSAB), and parasternal short axis papillary muscle level (PSAP). The videos depicted heart motion of patients covering a range of ages and cardiac diseases. The video was captured at 320x240 and 25 Hz, and a ECG trace waveform at the bottom allowed us to synchronize extracted cycles with the ECG R peak. Each video sequence was about 5 minutes long per patient and depicted several heart cycles. Our current collection has over 200 echo videos or 200x5x30x60 frames. The physicians in our project assisted us in their interpretation, so that the ground truth labels for the cardiac cycles could be obtained. Due to the effort in manual labeling, the results in this paper are reported on four of the views tested containing a total of 1612 frames as shown in Table 1.
# cycles # total frames
A4C 26 597
PLA 18 384
PSAB 12 271
PSAP 16 360
Total 72 1612
Table 1. Our cardiac view data set contains a total of 72 cardiac cycles, including 1612 frames, across 4 echo views.
ASM+Motion ML-Boosting PMK
A4C 96.2% 80.3% 59.3%
PLA 88.9% 75.5% 65.5%
PSAB 91.6% 67.5% 47.7%
PSAP 75.0% 70.9% 73.2%
Ave 87.9% 73.5% 61.4%
Table 2. Performance comparison for different view classification algorithms.
5.2. Training and Testing Both the ASM and motion models require manual labeling of the training data. For constructing the ASMs, we manually label frames from all sequences, sampling the time axis at regular intervals (about 1/4 cycle period). Motion modeling requires the entire training sequence to be annotated, so we leveraged the ASM model to track the sequences, evaluated and manually corrected the automatic tracks, and used this to bootstrap motion training. The testing results for our view recognition algorithm are presented in Table 2. During testing, because of our data set size, we employed a leave-one-out methodology: when testing a sequence from patient X, the entire data set minus X is used for training – ASMs and motion models are generated on-the-fly. Since our sequence fitting score in equation (5) takes the sum of a number of terms, Table 3 breaks down the recognition performance for each of the individual terms. Clearly, the terms contain independent information, and fusing the terms is beneficial. In the last two rows, we compare results for just the appearance-related terms against the appearance + motion terms, demonstrating the incremental value of the motion term. Overall, the errors in our system come from both ASM detection and tracking errors, as well as the discrimination power of our sequence fitting score. Running our algorithm on an Intel Xeon 2.66 GHz CPU with 12 GB RAM, it takes an average of 54 sec to process a 20 frame cardiac cycle sequence with the 4 echo models A4C, PLA, PSAB, and PSAP. This includes PCA computations from ASM and motion model construction, since we use a leave-one-out testing strategy. This also includes the time spent tracking a number of hypotheses for each view model – a number of hypotheses are tracked from an initial detector in frame 0 and the best ranking hypothesis selected in the end. Breaking down the computation more, using a single ASM/motion model to track a single hypothesis over a 20 frame sequence only takes 550 msec.
aT Σ−1 shp a
bT Σ−1 tex b
2R2 /λq+1 tex
cT Σ−1 mot c
1 0 0 0 1 1
0 1 0 0 1 1
0 0 1 0 1 1
0 0 0 1 0 1
Recog Rate 70.8% 48.6% 66.6% 43.1% 83.9% 87.9%
Table 3. Recognition performance for the terms in our fit score, where “1” means to include the term. Of the individual components, shape has the most information. In the last two rows, we compare the appearance terms against the appearance + motion terms, showing that motion adds to the performance.
5.3. Comparisons with multiclass boosting and pyramid match kernels To compare against a state-of-the-art approach to echo view recognition, we implemented the multiclass boosting approach described in [15] against the same data set. As we mentioned in Section 3, the training and testing modules of their view recognition system [15] contain three modules:(i) LV region detector, (ii) global template constructor, and (iii) multi-class view classifier. Since frames from the Parasternal Short-axis Basal (PSAB) view do not contain left ventricle region, we used the left atrium region for the training/testing instead, for anchoring, while still using the LV region for other views. The region detected used the same Haar-wavelet local features [18] as did the final multi-class classifier. Table 2 depicts the accuracy of the view classification results of the multiclass boosting approach in comparison with our model. As evident from the table, we observe that the results generated by our method achieves better view recognition accuracy compared with ML-boosting approach under the same experiment setting. Notice the relatively low accuracy presented here in comparison with the accuracy reported in their paper (Park et. al [15]) could be explained by the relatively small amount of training data used earlier as well as its sensitivity to LV detection, which was missing in the PSAB view in our data set. Finally, we tested our approach against a general object categorization approach based on pyramid match kernels [10]. In pyramid match kernels approach [10], images are represented as histograms in feature space. A kernel matrix comparing all pairs of images is then formed using pairwise multiresolution histogram intersection. The kernel matrix is used to train an SVM classifier. We used the code from [10] to implement the pyramid match kernel approach using six-dimensional SIFT [12] features (x, y, scale, rotation) + 2 PCA coefficients on the SIFT texture descriptor. The multiclass SVM was trained with 25 images/class.
The performance of the pyramid match kernels for the same dataset is shown in Table 5. Overall, the performance falls below our ASM+motion approach and the boosting approach. Due to the considerable variation in the appearance of the echo frames for the same viewpoint and constrained layouts, generic object categorization models are too weak for the echo view recognition problem.
6. Conclusions In this paper, we addressed the problem of inferring the transducer-induced viewpoints from echocardiographic sequences. We have shown that including motion of the heart within a cardiac cycle in addition to spatial information improves the discriminability of viewpoints. At the same time, it allows us to handle the complex echo exam conditions that give rise to large variations in the appearance of heart regions even within known standard views. Comparison with re-implementations of state-of-the-art view recognition and general category recognition approaches indicate that our technique outperforms these methods while putting minimal constraints on the acquisition of these sequences.
References [1] S. Aschkenasy, C. Jansen, R. Osterwalder, A. Linka, M. Unser, S. Marsch, and P. Hunziker. Unsupervised image classification of medical ultrasound data by multiresolution elastic registration. Ultrasound in Medicine and Biology, 32(7):1047–1054, July 2006. [2] R. Beichel, H. Bischof, F. Leberl, and M. Sonka. Robust active appearance models and their application to medical image analysis. IEEE Transactions on Medical Imaging, 24(9):1151–1169, September 2005. [3] J. Bosch, S. Mitchell, B. Lelieveldt, F. Nijland, O. Kamp, M. Sonka, and J. Reiber. Automatic segmentation of echocardiographic sequences by active appearance motion models. IEEE Transactions on Medical Imaging, 21(11):1374–1383, November 2002. [4] T. Cootes and C. Taylor. Using grey-level models to improve active shape model search. In International Conference on Pattern Recognition, volume 1, pages 63–67, 1994. [5] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681–685, 2001. [6] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models-their training and application. Comput. Vis. Image Underst., 61(1):38–59, 1995. [7] S. Ebadollahi, S.-F. Chang, and H. Wu. Automatic view recognition in echocardiogram videos using parts-based representation. cvpr, 02:2–9, 2004. [8] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 264–271, June 2003.
[9] J. Friedman, T. Hastie, and R. Tibshirani. Additive Logistic Regression: a Statistical View of Boosting. The Annals of Statistics, 38(2), 2000. [10] K. Grauman and T. Darrell. The pyramid match kernel: Discriminative classification with sets of image features. In International Conference on Computer Vision, volume 2, pages 1458–1465, 2005. [11] F.-F. Li and P. Perona. A bayesian hierarchical model for learning natural scene categories. In CVPR ’05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 2, pages 524–531, Washington, DC, USA, 2005. IEEE Computer Society. [12] D. Lowe. Object recognition from local scale-invariant features. In International Conference on Computer Vision, volume 2, pages 1150–1157, 1999. [13] M. E. Otey, J. Bi, S. Krishnan, B. Rao, J. Stoeckel, A. Katz, J. Han, and S. Parthasarathy. Automatic view recognition for cardiac ultrasound images. In MICCAI-CVII, 2006. [14] C. P. Papageorgiou, M. Oren, and T. Poggio. 1998, a general framework for object detection. Computer Vision, 1998. Sixth International Conference on, pages 555–562, 1998. [15] J. Park, S. Zhou, C. Simopoulos, J. Otsuki, and D. Comaniciu. Automatic cardiac view classification of echocardiogram. In International Conference on Computer Vision, pages 1–8, 2007. [16] D. Ponceleon, A. Amir, S. Srinivasan, T. Syeda-Mahmood, and D. Petkovic. Cuevideo: automated multimedia indexing and retrieval. In MULTIMEDIA ’99: Proceedings of the seventh ACM international conference on Multimedia (Part 2), page 199, New York, NY, USA, 1999. ACM. [17] T. Syeda-Mahmood, F. Wang, D. Beymer, M. London, and R. Reddy. Characterizing spatio-temporal patterns for disease discrimination in cardiac echo videos. In MICCAI, pages 261–269, 2007. [18] P. Viola and M. J. Jones. Robust real-time face detection. Int. J. Comput. Vision, 57(2):137–154, 2004. [19] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: A comprehensive study. Int. J. Comput. Vision, 73(2):213–238, 2007. [20] S. K. Zhou, J. H. Park, B. Georgescu, D. Comaniciu, C. Simopoulos, and J. Otsuki. Image-based multiclass boosting and echocardiographic view classification. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1559–1565, Washington, DC, USA, 2006. IEEE Computer Society. [21] S. K. Zhou, J. Shao, B. Georgescu, and D. Comaniciu. Pairwise active appearance model and its application to echocardiography tracking. In MICCAI, pages 736–743, 2006.