Learning from Images by Integrating Different ...

2 downloads 0 Views 6MB Size Report
WASHINGTON UNIVERSITY IN ST. LOUIS. School of Engineering and Applied Science ... A dissertation presented to the Graduate School of Arts and Sciences.
Department of Computer Science & Engineering

2009-54

Learning from Images by Integrating Different Perspectives

Authors: Sharath Cholleti

Abstract: While much machine learning research focuses on learning tasks in which one begins from scratch, for many learning scenarios an important component, or even the entire task, is to learn how to combine the predictions made by several independent algorithms or experts. In this dissertation, we study a variety of problems that involve combining predictions from multiple sources. First, we study a type of content-based image retrieval where the user is only interested in a portion of the image. We model this task as a multiple-instance (MI) learning problem and describe a new algorithm, MI-Winnow, that first converts an MI problem into a regular machine learning problem and then uses Winnow to learn. One important component of our work is to combine hypotheses generated using multiple representations. Along with providing experimental evaluation of MI-Winnow, we present some theoretical results for learning in MI setting. In addition, as part of this work, we present a new salient-point representation that reduces the number of salient points by using a segmentation algorithm as a mask but still preserves the overall variety in the image. Next, we study the evaluation of multiple segmentations of the same image using different existing evaluators

Type of Report: PhD Dissertation

Department of Computer Science & Engineering - Washington University in St. Louis Campus Box 1045 - St. Louis, MO - 63130 - ph: (314) 935-6160

WASHINGTON UNIVERSITY IN ST. LOUIS School of Engineering and Applied Science Department of Computer Science and Engineering

Dissertation Examination Committee: Sally A. Goldman, Chair Fred Prior Jason Fritts Burchan Bayazit Robert Pless Bill Smart

LEARNING FROM IMAGES BY INTEGRATING DIFFERENT PERSPECTIVES by Sharath Reddy Cholleti

A dissertation presented to the Graduate School of Arts and Sciences of Washington University in partial fulfillment of the requirements for the degree of Doctor of Philosophy December 2008 St. Louis, Missouri

ABSTRACT OF THE DISSERTATION

Learning from Images by Integrating Different Perspectives by Sharath Reddy Cholleti Doctor of Philosophy in Computer Science Washington University in St. Louis, 2008 Professor Sally A. Goldman, Chairperson

While much machine learning research focuses on learning tasks in which one begins from scratch, for many learning scenarios an important component, or even the entire task, is to learn how to combine the predictions made by several independent algorithms or experts. In this dissertation, we study a variety of problems that involve combining predictions from multiple sources. First, we study a type of content-based image retrieval where the user is only interested in a portion of the image. We model this task as a multiple-instance (MI) learning problem and describe a new algorithm, MI-Winnow, that first converts an MI problem into a regular machine learning problem and then uses Winnow to learn. One important component of our work is to combine hypotheses generated using multiple representations. Along with providing experimental evaluation of MI-Winnow, we present some theoretical results for learning in MI setting. In addition, as part of this work, we present a new salient-point representation that reduces the number

ii

of salient points by using a segmentation algorithm as a mask but still preserves the overall variety in the image. Next, we study the evaluation of multiple segmentations of the same image using different existing evaluators each being best suited for different types of images. We describe a new meta-learning algorithm to combine the existing base evaluators to create a better evaluator that adaptively weights the base evaluators in response to the specific input image. Finally, we present a new algorithm, Veritas, to estimate the ground truth in medical images from multiple expert segmentations. A very unique aspect of this work is that we perform the combination without any training data, such as labeled images or a ranking of the expert segmentations.

iii

Acknowledgments Working towards my PhD has been exciting, challenging and stressful. Along the way I did somethings right and many things wrong while learning a lot of lessons about research, my personality and life. It has been a great journey which I could not have finished without the support of various people. My first thanks goes to my advisor, Sally Goldman. Sally gave me the time and freedom to explore while always being available to guide me in the right direction. I am very grateful to be her student as I learned from her way of thinking about the problems, formulating the solutions and presenting the results. I, along with many others in the department, appreciate Sally’s friendliness and cheerful attitude irrespective of how everything else is at the moment. Fred Prior has been instrumental in opening a new research direction for me. I would like to thank Fred for showing me how valuable collaborations are in research and encouraging and guiding me to explore the intersection of medicine and computer science. I would like to thank Jason Fritts for his constant support, discussions and ideas over the years. I would also like to thank rest of my committee members Robert Pless, Burchan Bayazit and Bill Smart for their advice and ideas. A special thanks goes to my research group members Hui Zhang, Rouhollah Rahmani, John Krettek and Ibrahim Noorzaie for their friendship, support, discussions, ideas and feedback. Interacting and working with many good researchers has been a very fruitful experience. For this I would like to thank Ron Cytron, Gruia-Catalin Roman, David Politte, Kirk Smith, Paul Commean, Bruce Whiting, Steven Don, Charles Hildebolt and Avrim Blum. Similar to every graduate student I had to go through various administrative issues and paperwork over the years. Without the helpful and knowledgeable staff these would have taken lot more time and effort. I would like to thank all the administrative iv

staff who have been tremendously helpful — Jean Grothe, Myrna Harbison, Sharon Matlock, Madeline Hawkins, Andrea Levy, Peggy Fuller, Ron Brown, Stella Sung and Sonya Walker. Computational resources are only as good as the people that manage them. Many thanks go to Mark Bober, Allen Reuter, Paul Koppel and the CTS staff for keeping the computational infrastructure running smoothly and quickly responding to my specific issues. I gratefully acknowledge the financial support from National Science Foundation, Department of Computer Science and Engineering, and Mallinckrodt Institute of Radiology. Having good friends and family has been equally important for my success. I would like to thank you all for variety of reasons: I am greatly indebted to my aunt and uncle, Praveena and Sudhakar Reddy Malladi, for their constant support and encouragement all through my life. Without them my life would have been very different. Colin Heffern and the Heffern family, specially Judy and Ed, for being a great inspiration to face the difficult times with courage and laughter. Laura Houser for numerous bike rides. Gergens Jean Polynice for giving me a taste of thinking and the values I miss being in a different country. Bhargava Janakiraman, KVM Naidu, Pavan Mandalkar and Ruoyun Huang for checking on me. Vanessa Clark, Patience Graybill, Christine Smith and Michael Markus for their support and sharing their thoughts, problems and solutions. Emily Miner for her friendship and opening a new door of opportunity. Karolyn Senter and Ginny Fendell for helping me change my approach to life for better. Robert Miller for being my karate teacher and a great example of a helpful, patient and an honorable person despite of difficult circumstances. Tatayya Thirupathi Reddy Cholleti for always wishing me well and wanting to see me succeed.

v

Nicole Dubruiel for supporting me during a crucial phase and helping me move forward whenever I slowed down in the final stretch. My sister, Sahithy, and brother-in-law, Mahesh Reddy Gillela, for taking care of me over the years and being there for my mom during tough times. Without their support this would not have been possible. For always believing in me and consistently being there for me during my and even your difficult times, I am forever grateful to you — Carol Heffern, Rebecca Giordano, Michael Plezbert and bayaji Sanjeev Dwivedi. For being my best friends this is your success as much as mine.

Sharath Reddy Cholleti Washington University in St. Louis December 2008

vi

To my mom for her love and perseverance. To my dad for his dreams.

vii

Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Localized Content-based Image Retrieval . . . . . . 1.1.1 Data sets . . . . . . . . . . . . . . . . . . . 1.1.2 Measure of Ranking Accuracy . . . . . . . . 1.2 Image Segmentation Evaluation . . . . . . . . . . . 1.3 Combining Expert Opinions without Labeled Data

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

1 1 4 7 7 8

2 Image Representation . . . . . . . . . . . . . . . . . . 2.1 Segmentation-Based Image Representations . . . . 2.2 Salient Points Auto-Reduction using SEgmentation: 2.3 Comparison of Salient Point-Based Representations

. . . . . . . . . . . . SPARSE . . . . . .

. . . .

. . . .

. . . .

. . . .

10 11 13 15

3 Multiple-Instance Winnow . . . 3.1 Related Work . . . . . . . . . 3.2 Winnow . . . . . . . . . . . . 3.3 Multiple-Instance Winnow . . 3.3.1 Reducing to Winnow . 3.3.2 Generating Hypotheses 3.3.3 Ranking Images . . . . 3.4 Experimental Results . . . . . 3.5 Concluding Remarks . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

18 19 22 23 23 26 27 28 32

4 Hardness of Multiple-Instance Learning . . . . . . . . . . . . . . . . 4.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Hardness of multiple-instance learning . . . . . . . . . . . . . . . . .

34 35 35

5 Meta-Evaluation for Image Segmentation . . . . . . . . . . . . . . . 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 MSET: Meta-Segmentation Evaluation Technique . . . . . . . . . . .

39 40 42

. . . . . . . . .

. . . . . . . . .

viii

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . . . . .

5.3

5.4

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Human segmentation results vs. machine segmentation results 5.3.2 Results from different parameterizations of a segmentation method 5.3.3 Results from different segmentation methods . . . . . . . . . . 5.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Veritas: Combing Expert Opinions without Labeled Data 6.1 Background and Our Approach . . . . . . . . . . . . . . . . 6.2 Veritas Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Data and Experiments . . . . . . . . . . . . . . . . . . . . . 6.3.1 Real CT Data . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Artificial Data . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Loss Measures . . . . . . . . . . . . . . . . . . . . . . 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 CT Data . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Artificial Data . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Various combining schemes for Veritas . . . . . . . . 6.5 Conclusions and Future Directions . . . . . . . . . . . . . .

. . . . . . . . . . . .

55 58 60 64 65 66 68 68 69 72 75 76

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

ix

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

46 48 50 50 52 53

List of Tables 5.1 5.2

5.3 6.1 6.2 6.3 6.4

Results for human vs machine segmentations (mean ± 95% confidence interval). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The average evaluation accuracy (mean ± 95% confidence interval) for each method in the experiment when different parameterizations of same algorithm is used. . . . . . . . . . . . . . . . . . . . . . . . . . . The evaluation accuracy (mean ± 95% confidence interval) for each method in the experiment with different segmentation methods. . . . Average loss with 95% confidence interval for real CT data (18 experiments). Average loss with 95% confidence interval for artificial data (93 experiments). Average loss with 95% confidence interval for Veritas with weighted and unweighted combining of experts on CT data. . . . . . . . . . . . . . . . Average loss with 95% confidence interval for Veritas with weighted and unweighted combining of experts on artificial data. . . . . . . . . . . . .

x

48

51 51 69 72 76 76

List of Figures 1.1 1.2 1.3 1.4

An example for LCBIR. . . . . . . . . Sample images from SIVAL data set. . Sample images from COREL data set. Sample images from Flickr data set. . .

2.1

An example showing an image and its segmented image from which a bag is derived. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SPARSE image is derived using the segmented image and the image with wavelet-based salient points. The last two images are shown in gray scale just for easy visibility of the red salient points. . . . . . . . Salient points detection with SPARSE, the Harr wavelet-based method, and SIFT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparing salient point methods on the SIVAL data set where the query set has 8 positive and 8 negative examples. . . . . . . . . . . .

2.2

2.3 2.4 3.1 3.2 3.3 3.4 3.5 3.6 3.7 5.1 5.2 5.3 5.4 5.5

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Overview of MI-Winnow. . . . . . . . . . . . . . . . . . . . . . . . . . Hypothesis generation method. . . . . . . . . . . . . . . . . . . . . . Ranking method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparing MI-Winnow and Accio! on SIVAL data set where the query set has 8 positive and 8 negative examples. . . . . . . . . . . . . . . . Comparing MI-Winnow with varying the number of training examples from SIVAL data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . MI-Winnow on COREL data set. . . . . . . . . . . . . . . . . . . . . Comparing MI-Winnow with Accio! on Flickr data set. . . . . . . . . An Overview of MSET. I is the input image. S1 and S2 are two segmentations of I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An example decision tree output by MSET for base evaluator F . . . . An example decision tree for base evaluator E for which the overall accuracy is 86.8%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image examples segmented by EDISON and IHS. In the top example, EDISON is labeled best, and in the bottom example IHS is labeled as best. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

2 4 5 6 12

14 16 17 25 26 27 29 30 31 32 44 46 49 50

52

6.1

6.2

6.3 6.4 6.5 6.6 6.7 6.8 6.9

On the left, a sample CT image is shown with green rectangle containing the nodule. On the right, three differing expert opinions about the lung nodule segmentation (only a small portion of the image close to the green rectangle is shown). . . . . . . . . . . . . . . . . . . . . . . . . . . . . A sample of one of the CT scans that we use in our experimental evaluation. This particular CT slice of the chest shows both a real lung nodule (top arrow) and synthetic lung nodule (bottom arrow). . . . . . . . . . . . . . Overview of Veritas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Another overview of Veritas. . . . . . . . . . . . . . . . . . . . . . . . . Algorithm to create weak hypotheses. . . . . . . . . . . . . . . . . . . . Confidence Boost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Combining expert segmentations. . . . . . . . . . . . . . . . . . . . . . The ground truth for our artificial data. . . . . . . . . . . . . . . . . . . The eight expert segmentations used for our experiments with the artificial data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.10 Comparing Veritas with STAPLE, average and consensus for CT data with squared loss. First 15 data points use 4 expert opinions and the last 3 use 5 expert opinions. . . . . . . . . . . . . . . . . . . . . . . . 6.11 Comparing Veritas with STAPLE and average for CT data with absolute loss. Consensus is not shown as it is equal to average with absolute loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.12 Comparing Veritas with STAPLE, average and consensus for artificial data with squared loss. First 56 data points use 5 expert opinions, next 28 use 6 expert opinions, next 8 use 7 expert opinions and last one uses 8 expert opinions. . . . . . . . . . . . . . . . . . . . . . . . . 6.13 Comparing Veritas with STAPLE and average for artificial data with absolute loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.14 Comparison (using squared loss) where only 1, 2 or 3 of the experts E4, E5 and E6 are chosen. At least 5 experts are used in the experiments. There are total 92 (18 + 48 + 26) experiments. . . . . . . . . . . . . . 6.15 Comparing weighted and unweighted Veritas on CT data with squared loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.16 Comparing weighted and unweighted Veritas on CT data with absolute loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.17 Comparing weighted and unweighted Veritas on artificial data with squared loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.18 Comparing weighted and unweighted Veritas on artificial data with absolute loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xii

56

57 60 60 61 63 64 67 67

70

71

73 74

75 77 78 79 80

Chapter 1 Introduction While much machine learning research focuses on learning tasks in which one begins from scratch, in this dissertation we study various learning problems where it is important and useful to combine the predictions made by several independent algorithms or experts. In this chapter, we introduce the three problems we address in this dissertation. In Section 1.1, we describe the localized content-based image retrieval problem, and how we model it with multiple-instance learning. In Section 1.2, we describe the image segmentation evaluation problem. Finally, in Section 1.3, we briefly introduce the problem of combining expert opinions when there is no labeled training data.

1.1

Localized Content-based Image Retrieval

Content-Based Image Retrieval (CBIR) is the problem of retrieving semantically relevant images from an image database given one or more input images. One mechanism to get additional labeled images is through relevance feedback , where the user labels images from the results of an initial query as either desirable (positive) or not desirable (negative). Localized content-based image retrieval (LCBIR) is a special type of CBIR where only a small part of the image or a specific object is of interest to the user. Figure 1.1 shows an LCBIR example with 3 positive training images containing a green and yellow pot, 3 negative training images without a pot and a set of desired results containing a green and yellow pot irrespective of the background.

1

Figure 1.1: An example for LCBIR. The multiple-instance (MI) learning model was first formalized by Dietterich et al . in their seminal paper [24]. In this model, each training example is a set (or bag) of instances along with a single label equal to the maximum label among all the instances in the bag. The MI learning model was originally motivated by the drug activity prediction problem where each instance is a possible conformation (or shape) of a drug molecule and each bag contains all likely low-energy conformations for the molecule. A drug molecule is active if it binds strongly to the target protein in at least one of its conformations and is inactive if no conformation binds to the protein. The problem is to predict the label of drug molecules based on their conformations. The individual instances in the bag are not given labels. The goal is to learn to accurately predict the label of previously unseen bags. Supervised learning can be viewed as a special case of MI learning where each bag holds a single instance. The MIL model is well suited for LCBIR since there is natural ambiguity as to what portion of each image is important to the user. In the LCBIR application, each bag corresponds to an image, and each instance in the bag corresponds to a region of the image. Each region is represented by a feature vector. That is, each example is a collection (or bag) of d-dimensional instances where each dimension corresponds 2

to a feature. A label is provided for the bag, but not for the individual instances within the bag. For example, if each region is represented by the average of each of the 3 color values then each instance would be 3-dimensional. However, typically much richer feature representations are used. There has been significant research applying MIL to CBIR [50, 51, 84, 1, 15, 58, 6]. We are particularly interested in developing an MIL algorithm to use within an interactive LCBIR system in which the user would only have to label a very small number of images (≤20) in order to obtain good performance. This is very different from image categorization and other machine learning problems in which very large labeled data sets are available. Learning the region of interest in an image is heavily dependent on how well the image is represented. Different image representations have different strengths. Specifically, we look at a salient point based representation, and a segmentation based representation. We then combine the two representations to form a new improved salient point representation, SPARSE (Salient Points Auto-Reduction using SEgmentation), which we present in Chapter 2. MI learning is inherently difficult as the positive data is very ambiguous. Given a positive bag we do not know which instances are truly positive. MI algorithms provide a mechanism to learn which instances are important. Even if a MI algorithm can learn which instances are truly positive during the training, having knowledge of those instances from other sources can be helpful in improving or in speeding up the learning. In Chapter 3, we present our new MI algorithm MI-Winnow [17] that uses an MI measure diverse density [50] to approximately locate the truly positive instances, and then it uses a single-instance learning algorithm, Winnow [43], for the rest of the learning task. After presenting our algorithms to combine multiple techniques that are evaluated empirically in Chapters 2 and 3, we explore some theoretical aspects of MI learning in Chapter 4. The ambiguity in the positive examples makes learning with MI examples a difficult problem. Under different models of how the MI examples are generated, either the problem becomes an easy to learn single-instance problem [47, 5, 8] or is shown to be as hard as learning DNF formulas [5]. In Chapter 4, we extend the hardness result.

3

1.1.1

Data sets

To evaluate the algorithms we develop in Chapters 2 and 3, we use three benchmarks1 : the SIVAL data set which contains 1500 images among 25 categories; a COREL data set which contains 2000 images among 20 categories; and a Flickr data set which contains 4000 images from 20 categories.

Figure 1.2: Sample images from SIVAL data set. The SIVAL benchmark (www.cse.wustl.edu/∼sg/accio/SIVAL.html) includes 25 image categories containing 1500 images. The categories consist of complex objects photographed against 10 different, highly diverse backgrounds. SIVAL emphasizes the task of LCBIR by including many nearly identical scenes that differ only by the localized target objects. For each object class, the same complex physical object was used in all scenes. It is also a difficult data set in that the scenes are highly diverse and often complex. Furthermore, the objects may occur anywhere spatially in the image and also may be photographed at a wide-angle or close up, or with different orientations. In most images, the target object occupies less than 10%-15% of the image area but may occupy as much as 70%. Sample images are shown in Figure 1.2. 1

All benchmark data sets are available at www.cse.wustl.edu/∼sg/accio/data.html.

4

Figure 1.3: Sample images from COREL data set. The COREL data set, which is very different in nature, is obtained from www.cs.uno.edu/∼yixin/ddsvm.html. In this data set, obtained from COREL Corporation, there are 20 very broad categories, including: African people and villages, beaches, historical buildings, buses, dinosaurs, elephants, flowers, horses, mountains and glaciers, and food. Unlike SIVAL, in which an image contains many objects other than the one of interest, in the COREL benchmark each image just contains an object or scene from the particular category. Sample images are shown in Figure 1.3. Also used is a benchmark composed from Flickr containing 20 categories with 100 images per category, as well as 2000 images that are not from any category (found by searching for object). The images for the 20 categories were obtained by searching for the following terms using Flickr’s API: American flag, boat, cat, Coca-Cola can, fire flame, fireworks, honey bee, Irish flag, keyboard, Mexico City taxi, Mountie, New York taxi, orchard, ostrich, Pepsi can, Persian rug, samurai helmet, snow boarding, sushi, and waterfall. The top 200 images (based on relevance) were downloaded for each category, and then we manually picked 100 images that best represented the category. For the 2000 random images we searched for the word “object” and used the top 2000 images. The specific set of images used is listed at www.cse.wustl.edu/∼sg/accio/flickr-data-set. We use the Flickr data set to illustrate how Accio! can successfully be used for real image retrieval problems where the user is interested in general object categories. Some example images are shown in Figure 1.4.

5

Figure 1.4: Sample images from Flickr data set.

6

1.1.2

Measure of Ranking Accuracy

Our task is to rank the test images, and not to classify the images into positive or negative. As our measure of performance, we use the area under the receiver operating characteristic (ROC) curve [34] that plots the true positive rate as a function of the false positive rate. The area under the ROC curve (AUC) is equivalent to the probability that a randomly chosen positive image is ranked higher than a randomly chosen negative image. Unlike the precision-recall curve, the ROC curve is insensitive to the ratio of positive to negative examples in the image repository. Regardless of the fraction of the images that are positive, for a random permutation the AUC is 0.5. For all AUCs reported, we repeat each experiment with 30 random selections of the positive and negative examples and use these to compute the average AUC and the 95% confidence intervals for the AUC.

1.2

Image Segmentation Evaluation

Image segmentation is a fundamental step in many image, video and computer vision applications. Generally the choice of a segmentation algorithm, or parameterization of a given algorithm, is selected at the application level and fixed for all images within that application. Many segmentation methods have been developed, making it hard to compare different segmentation methods, or even different parameterizations of a single method. However, the ability to compare two segmentations (generally obtained via two different methods or parameterizations) in an application-independent way is important for a variety of reasons, including: to autonomously select among two possible segmentations within a segmentation algorithm or a broader application; to place a segmentation algorithm on a solid experimental and scientific ground [11]; and to monitor the segmentation results on the fly, so that segmentation performance can be guaranteed and consistency can be maintained [26]. Designing a good measure for segmentation quality is a known hard problem as each person has his/her distinct standard for a good segmentation and different applications may function better using different segmentations. While the criteria of a good segmentation are often application-dependent and hard to explicitly define, for many 7

applications the difference between a favorable segmentation and an inferior one is noticeable. It is possible, and necessary, to design performance measures to capture such differences. Human-aided methods are widely used in segmentation evaluation, either by relative measures that compute the discrepancy between one segmentation with a humangenerated reference segmentation [27, 52], or by subjective measures in which humans evaluate the segmentation visually or qualitatively [31]. Although usually deemed more satisfactory, human-aided methods are subjective, tedious and time-consuming. In contrast, stand-alone evaluation methods [13] evaluate a segmentation based on how well it matches the desired characteristics of a good segmentation, as based on human judgment [10, 42, 45, 55, 63, 73, 77]. Since they do not require reference images, these methods can operate over a wide range of image types and varying conditions, and in real-time systems where a large number of unknown images need to be processed. Current stand-alone evaluation methods use different criteria to evaluate segmentation quality and are better at evaluating some types of images, but not all. With the very nature of the problem encouraging subjective opinions, and different evaluation methods being good at judging only some types of images, we believe a better way to tackle the segmentation evaluation problem is to acknowledge the inherent difficulty and to try to exploit the expertise of multiple evaluation methods at the same time. In Chapter 5, we present a new meta-evaluation algorithm, MSET, that learns how good each of the provided evaluation algorithms are on different types of images, and uses that knowledge to automatically weight the evaluation algorithms differently for a new image.

1.3

Combining Expert Opinions without Labeled Data

The lack of a gold standard of truth for medical images is a critical problem limiting computer aided detection, automated image analysis, and change determination. 8

Establishing ground truth for radiological results or quantitative analyses performed on medical images is exceptionally difficult [3, 4, 56, 72]. Consider the task of segmenting a lung nodule in computed tomography (CT) images of the human chest. Both human experts (radiologists) and computer algorithms are available to mark which voxels are part of a nodule. The accuracy of an expert opinion varies greatly depending on the specific feature and image type, and both inter- and intra-observer variance can be unacceptably high [36]. The question is how to combine these expert image segmentations to estimate the unknown ground truth (the real nodule) when there is no ground truth available for training. While it is easy for human observers to identify and label relevant features in images of natural scenes with high accuracy, the same does not hold for medical images. Currently, expert opinion as to the existence and location of image features and their relevance for a specific diagnosis provide the “gold standard” of truth for medical images. In CT scans of the human chest, there are no labeled data where we know the nodule exactly. All we have are the CT scans and different expert opinions as to the location and extent of the nodule. While in some instances, correlative clinical findings, such as pathology results and patient outcomes, may be used as substantiating evidence, this is not ground truth. Unlike similar problems of combining experts that have been considered in machine learning [12], or the problem of combining different evaluators we present in Chapter 5, for this problem the lack of any accurately labeled training data is a significant obstacle we must overcome. In Chapter 6, we present a new algorithm, Veritas, which combines the opinions of multiple experts to estimate the underlying truth.

9

Chapter 2 Image Representation In this chapter, we describe the image representations that are used in the algorithm developed in Chapter 3. One of the representations we describe is a new image representation SPARSE (Salient Points Auto-Reduction using SEgmentation) that is formed by combining a salient point representation with a segmentation based representation [83, 60]. One distinction between region-based CBIR systems and localized CBIR is how the image is processed. Single feature vector CBIR systems represent the entire image as one feature vector. For example, a color histogram [21, 32, 46] defined over the entire image is such a representation. In contrast, multiple feature vector CBIR systems represent the image as a collection of feature vectors with one feature vector for either a block in some pre-specified image subdivision (e.g., [50, 51]), the region defined by a segmentation algorithm (e.g., [70]), or a window around each salient point (e.g., [37, 41, 48, 65, 71, 74]). Many CBIR systems either subdivide the image into pre-defined blocks [50, 51, 66], or more commonly partition the image into different meaningful regions by applying a segmentation algorithm [58, 70]. In both cases, each region of the image is represented as a vector of feature values extracted from that region. Other CBIR systems extract salient points [37, 41, 48, 65, 71, 74], which are points of high variability in the features of the local pixel neighborhood. With salient point-based methods, one feature vector is created for each salient point. Another important distinction is the type of similarity metric used to rank the images. In a global ranking method , all feature vectors in the image representation affect the 10

ranking. While salient point-based methods only use portions of the image around the salient points, if the ranking method uses all salient points, then it is a global method. In contrast, local ranking methods select only portions of an image (or a subset of the salient points) as relevant to rank the images. For example, if a salient point-based method learns which subset S of the salient points are contained in desirable images and ranks images using only the subset S of salient points, then it is a local ranking method. Localized CBIR systems must use local ranking methods. As the number of salient points per image increase localized CBIR systems need more computation to find the appropriate points for learning or retrieving similar images, or some technique to reduce the number of salient points by filtering out unimportant points before the learning phase. In this chapter, we present a novel way to filter out the salient points that do not contribute much to the retrieval accuracy of localized CBIR systems. By retaining only a subset of points that cover all the complex segments in the image this representation reduces the work that localized CBIR systems need to do in selecting good salient points. Through our experiments we show that selectively reducing the number of salient points actually increases the quality of results while decreasing the computation. In Section 2.1, we describe the segmentation-based image representations we use in this and the next chapter. In Section 2.2, we present our new salient point image representation, SPARSE. In Section 2.3, we compare SPARSE with wavelet and SIFT salient point methods.

2.1

Segmentation-Based Image Representations

Here we briefly describe the two image representations used in our experimental work. All images are first transformed into the YCrCb color space and then pre-processed using a wavelet texture filter so that each pixel in the image has three color features and three texture features. Next, the improved hierarchical segmentation (IHS) algorithm [80] is used to segment the image. IHS is a clustering-based segmentation method [29] that uses the Euclidean distance between 6-dimensional feature vectors,

11

with 3 color features and 3 texture features, as its similarity measure. IHS creates segments with regions of contiguous similar pixels. A different segmentation algorithm could be used instead.

Figure 2.1: An example showing an image and its segmented image from which a bag is derived. With the segments created we represent each image as a bag of instances that are generated using no-neighbor bag generator and neighbor bag generator. In both cases, we represent each image as a bag of 32 instances (32 is chosen after experiments with different number of points for the data sets we use). Let I be the segmented image. In the no-neighbor bag generator, each segment x ∈ I is a point in a 6-dimensional feature space with average of just the color and texture values of all the pixels of the segment itself. Figure 2.1 shows how an image is converted to a bag using image segmentation. Sometimes we use point interchangeably with instance to refer to an instance in a bag. Since often it is the immediate surroundings that allow for a more complex and descriptive definition of the content of the image, we compute the neighbors to the north, south, east, and west for every segment. The feature vector for each segment is augmented with the feature differences between its features and its neighbors’ features for all four neighbors. We use the feature differences to allow for robustness against global changes in the image, such as brightness changes from variable light or shade. In the neighbor bag generator, each segment x ∈ I is a point in a 30-dimensional feature space where the first 6 features hold the average color and texture values for x, the next 6 features hold the difference between the average color and texture 12

values of x and the northern neighbor. Similarly, there are 6 features for the difference information between x and its other 3 cardinal neighbors. Even though the feature space of no-neighbor bag generator is a subset of the feature space of neighbor bag generator some results are better for this generator (more details in Section 3.4).

2.2

Salient Points Auto-Reduction using SEgmentation: SPARSE

Here we present an alternate image representation that is based upon salient points, which are points of high variability in the features of the local pixel neighborhood. First, we motivate the use of salient points, and present some challenges that face traditional salient point representations for localized CBIR. Usually the performance of a CBIR depends highly on the quality of the segmentation. Therefore, this representation requires high quality segmentation because small areas of incorrect segmentation might make the representation very different from that of the real object. Salient point-based representations decouple the sensitivity of a CBIR system from the quality of the segmentation. Most salient point detection methods focus on finding the point where there is significant change with respect to a chosen image feature. Traditional uses of salient points for CBIR compute the feature vector for a salient point according to the features of all pixels in a window around the salient point [37, 41, 48, 65, 71, 74]. While such representations capture local characteristics of an image, since salient points can gather at locations of high texture and where the change of image feature is significant (such as edges), a very large number of salient points are generally needed to include salient points for all objects that might be of interest. One drawback of having a large number is that many salient points capture the same portions of the image. But having many salient points with similar information does not help in increasing the retrieval accuracy for a localized CBIR system, yet it increases the computational

13

cost. Also, having too many irrelevant salient points can decrease the retrieval accuracy as shown in our experiments. Even salient point methods that are deliberately designed to reduce the gathering [65], still place many salient points on some objects. In contrast, an advantage of a region-based image representation is that most images are segmented into a fairly small number of segments (10-50 segments). We introduce a new salient point representation for localized CBIR by using image segmentation to form a mask that limits the number of salient points in each segment while maintaining the diversity of the salient points. Our SPARSE image representation limits the number of salient points in each segment while maintaining the diversity needed for localized CBIR. SPARSE first applies a salient point detection algorithm to the image. We use a Harr wavelet-based salient point detection method. Beginning at the coarsest level of wavelet coefficients, we keep track of the salient points from level to level by finding the points with the highest coefficients on the next finer level among those used to compute the wavelet coefficients at the current level. The saliency value of a salient point is the sum of the wavelet coefficients of its parent salient points from all coarser scales. Although a wavelet-based method is used here any salient point detection method can be used here instead, with little modification.

Original image

Segmented image

Wavelet-based method

SPARSE

Figure 2.2: SPARSE image is derived using the segmented image and the image with wavelet-based salient points. The last two images are shown in gray scale just for easy visibility of the red salient points. Next, the IHS segmentation algorithm [80] is applied to the image. The resulting segmentation is used as a mask to reduce the number of salient points. Specifically, SPARSE keeps at most k salient points in each segment, by retaining those with the highest saliency value, and removing the others. If there are less than k salient points in a region, all are kept. In our implementation, we set k = 3 after some preliminary experiments. With 3 salient points per segment we potentially get slightly 14

different points but avoid too much redundancy. We use IHS to create 32 segments per image. Figure 2.2 shows an example with the original image, IHS segmented image, image with salient points derived using Harr wavelet-based method and the resulting SPARSE image. Since the segmentation is just used as a filter for reducing the salient points, the performance of the CBIR system is not as dependent on the exact boundaries of the segmentation since the representation for each salient point is just based on the region around that salient point. In other words, the segmentation only affects the selection of the salient points but not their representation. Figure 2.3 shows examples of salient points detected using SPARSE. For comparison, we also show the salient points detected by the Harr wavelet-based salient points detection method, and the scale-invariant feature transform (SIFT) [48] method. The wavelet-based method selects the top 200 salient points for each image. SPARSE reduces it to at most 96 salient points per image. SIFT selects 392 salient points for the tea box, and 288 salient points for the coke can. When using SPARSE the salient points predominantly gather at complex objects, whereas with the waveletbased method the salient points gather at the edges. While the wavelet-based method does reduce the number of salient points on the textured region (such as at the printed words on the calendar and tea box), SPARSE further reduces the number of salient points at textured regions.

2.3

Comparison of Salient Point-Based Representations

In this section we compare the performance of SPARSE with other salient point extraction methods using an existing localized CBIR system, Accio! [58, 59, 60]. Fig. 2.4 compares the three salient point extraction methods: wavelet, SIFT, and SPARSE on the SIVAL data set (described in Section 1.1.1) using the AUC measure. In these experiments, both SPARSE and wavelet use the same salient point extraction and representation methods (3 color and 3 texture dimensions). The primary difference between them is where the salient points are placed in the image. SIFT 15

Original image

SPARSE

Wavelet-based method

SIFT

Figure 2.3: Salient points detection with SPARSE, the Harr wavelet-based method, and SIFT. both uses a different feature extraction method, placing the salient points differently, and uses a more complex feature representation. SPARSE outperforms wavelet in 23 of 25 categories, 16 of which are statistically significant. The use of SPARSE can also improve the algorithm efficiency by reducing the number of feature vectors per image, and hence the amount of computation. The SIFT feature vector has 128 dimensions that describe the local gradient orientation histogram around a salient point. Results using SIFT were generated using 5 random selections of the training data (as opposed to 30 for SPARSE and wavelet) since the high dimensionality makes it very computationally intensive. SIFT performs 5.9% better than wavelet over all categories, and SPARSE outperforms SIFT by 3.2% (over all categories), despite its relatively simpler feature representation. These experiments show the benefits of combining two different types of image representation algorithms in producing a better representation suitable for a localized CBIR. Though there is some extra computation involved in applying the segmentation algorithm and applying it as a mask, overall it is beneficial as it reduces the computation in learning and ranking phases. In our experiments, Accio! took less time with fewer salient points in SPARSE representation than the regular wavelet representation.

16

G ol dM Tr W ed an D a s 40 l Sm luce Ca ile ntB n y o G Fac wl re e en D Te oll a C Bo ok x Fa e br Sp Ca ic So rite n C C he ften an ck er er Bo ed x S Ju car l Bl ie f ue sP Sc ot C r A an ja un dl xO ge eW ra ith nge H ol Fe B der a D ltF n irt yR low ana u e D nn rRu at in aM g g G in Sho la in ze gB e d o C Wo ok ar D db odP irt yW oar ot St o dB rip rkG ox ed lo N ve ot eb s R oo ap k Bo ok L A W arg pp oo eS le dR po ol on lin gP in

AUC 1

Salient Point Extraction

0.9

0.8

0.7

0.6 SPARSE

0.5 Wavelet

SIFT

0.4

Figure 2.4: Comparing salient point methods on the SIVAL data set where the query set has 8 positive and 8 negative examples.

In the next chapter we look at how some of these representations are used by a new localized CBIR algorithm, MI-Winnow.

17

Chapter 3 Multiple-Instance Winnow In this chapter, we present MI-Winnow [17], a new multiple-instance learning (MIL) algorithm that provides a new technique to convert MIL data into standard singleinstance data. MI-Winnow is different from existing multiple-instance learning algorithms in several key ways. First, MI-Winnow allows each image to be converted into a bag in multiple ways to create training (and test) data that varies in both the number of dimensions per instance, and in the kind of features used. Second, instead of learning a concept defined by a single point-and-scaling hypothesis, MI-Winnow allows the underlying concept to be described by combining a set of separators learned by Winnow. For content-based image retrieval applications, such a generalized hypothesis is important since there may be different ways to recognize which images are of interest. Many multiple-instance learning algorithms output a hypothesis that is a d-dimensional point and a d-dimensional scale vector defining a weighted Euclidean metric. In some cases a L1 norm is used in which case the final hypothesis is a d-dimensional axis-parallel box. We refer to such a hypothesis as a point-and-scaling hypothesis (regardless if the L1 or L2 norm is used). For some application areas, a single pointand-scaling hypothesis has sufficient expressive power. However, in many CBIR tasks, one would expect several instances (regions) in the bag (image) to be needed to explain why the image is desirable. These instances may represent parts of a single complex object or several objects comprising a scene. Thus, it is important to study MIL when using a more complex hypothesis space than a single point-and-scaling hypothesis in order to have the expressive power to capture what makes an image desirable to a user. 18

We present a new MIL algorithm, MI-Winnow [17] based upon Winnow [43], an on-line learning algorithm with provable bounds upon its performance. Winnow was designed to separate the relevant attributes from the irrelevant. MI-Winnow proposes a new way to convert a MIL problem to a set of problems each of which can be solved using a standard single-instance learning algorithm. There are several advantages of MI-Winnow over the EM-DD algorithm [85] and others like it that are commonly used for CBIR. First, the hypothesis returned by MI-Winnow is a set of d-dimensional boxes as opposed to returning a feature point and scale vector. By taking this approach, the difficult problem of finding a good starting value for the scale factors is not needed. Second, by combining the predictions of a set of boxes, MI-Winnow is capable of capturing more complex concepts. Recent version of EM-DD [60] also combines multiple hypotheses. Another drawback of EM-DD is that given a test bag it cannot directly classify whether or not the bag is positive or negative. In order to perform classification of a bag, it is necessary to learn a threshold on the label between positive and negative. MI-Winnow can directly predict if a bag is positive or negative. As discussed in Chapter 2, we define the bag generator as the procedure used to convert the image to a bag. There are many different methods to generate a MIL representation for an image, and depending on the type of images, some can capture the target concept better than others. Unlike existing MIL algorithms, MI-Winnow allows any number of different bag generators to be applied to each image. The remainder of this chapter is organized as follows. Section 3.1 discusses related work. We review Winnow in Section 3.2. MI-Winnow is presented in Section 3.3. Experimental results are presented in Section 3.4. Finally, in Section 3.5, we close with concluding remarks and a discussion of future work.

3.1

Related Work

We briefly overview prior work on MIL with an emphasis on applications to CBIR. Maron and Lozano-P´erez [50] were the first to apply MIL to a CBIR task. Their algorithm returns a point-and-scaling hypothesis as h = {t1 , . . . , td , s1 , . . . , sd } where tk is the feature value for dimension k and sk is a scale factor indicating the importance 19

of feature k. They introduced the diverse density (DD) algorithm that uses a twostep gradient descent search with multiple starting points to find a hypothesis that minimizes NLDD, the negative logarithm of DD(h, D) which is a measure of the likelihood of h given the training data D. One step of the gradient search modifies ~t and the other modifies ~s. Intuitively, the point in this 2d-dimensional search space where the NLDD is the greatest is where the greatest number of positive bags have instances nearby and the all instances from the negative bags are far. Yang et al . [76] built upon the work of Maron and Lozano-P´erez, by using a different fixed partitioning of the image and evaluating the quality of a hypothesis with a weighted correlation similarity measure instead of the diverse density measure. Zhang and Goldman [85] introduced the EM-DD algorithm which treats the knowledge of which instance corresponds to the label of the bag as a missing attribute and applies a variation of the EM algorithm [22] where it selects the value for the hidden variable with the best (versus expected) correspondence to convert the multipleinstance learning problem to a standard supervised learning problem2 . EM-DD starts with some initial guess of a point-and-scaling hypothesis h and then repeatedly performs the following two steps. In the first step (E-step), the current hypothesis h is used to pick one instance from each bag which is most likely (given the generative model) to be the one responsible for the label. In the second step (M -step), the two-step gradient search of the DD algorithm is used to find a new h that minimizes N LDD(h, D). From among the multiple-starts of EM, the point-and-scaling concept with the minimum NLDD value is returned. Finally, a run of EM-DD is started from every positive instance from five randomly selected positive bags (or from all positive bags if there are less than five) with all scale factors set to 0.1. Zhang et al . [84] applied EM-DD to CBIR using a segmentation-based approach to represent each image as a bag. Huang et al . [35] presented a variation of their work that incorporated a different segmentation algorithm, and a neural network based MIL algorithm. 2

More accurately, the method used by EM-DD is MLESAC [67], which is a generalization of RANSAC [28], in which the maximum likelihood value for the hidden variables is selected in the estimation phase.

20

Andrews and Hofmann [1] introduced two new MIL algorithms, MI-SVM and miSVM, that like EM-DD used an EM-based approach. However, instead of using the DD algorithm for the maximization phase, they used a support vector machine. The difference between these two approaches is the choice of the hidden variables. In mi-SVM, the hidden variables are the labels of the individual instances, whereas in MI-SVM, the hidden variable is a selector variable indicating which instance in the bag is positive. Chen and Wang [15] considered the problem of image categorization which is the multi-class problem of determining which of a set of pre-defined categories an image belongs to. For this problem, one would typically expect to have fairly large training sets and the image categories are generally quite broad. DD-SVM uses a one-againstthe-rest strategy to handle the multi-class setting. To create the binary classifier for each category, DD-SVM converts the multiple-instance problem into a single instance problem by introducing a dimension for each instance from a positive bag and then applies a SVM with examples mapped to this new feature space. Bi et al . [6] presented 1-norm SVM which modifies DD-SVM in several key ways. First, EM-DD is not used, but rather each instance in each positive bag is used as a positive prototype and no negative prototypes are used. Second, the feature value in each new dimension uses an unweighted Euclidean metric. Finally, a 1-norm SVM is used since it reduces to solving a linear program which is more efficient. The advantage of this approach is that when the training data size is large (as common for image categorization), comparable results can be obtained with significantly lower computation costs. However, this approach is not good when there are small data sets as one would expect for CBIR. Also, it does not perform any scaling, so all image features are treated equally. Finally, while their method removes the need to tune the parameters used to select the instance prototypes, there are two parameters that must be tuned for the SVM. There has also been research on building ensembles of multiple-instance learners and applying MIL to CBIR such as the work of Zhou and Zhang [86] and Xu and Frank [75]. Zhou and Zhang applied bagging to a set of different MIL algorithms. For several data sets the bagged version of EM-DD performed best. Xu and Frank

21

applied boosting to a MIL algorithm. One drawback of applying bagging or boosting is that the MIL algorithm must be run multiple times using different training data.

3.2

Winnow

In on-line learning model, the learning algorithm receives one example at a time. During training, once the algorithm predicts a label for the example, it receives the true label, and can update its hypothesis. A seminal result in the on-line learning model is Littlestone’s noise-tolerant algorithm Winnow [43] for learning disjunctions among a set of N attributes in which only k (for N  k) of the attributes are relevant. Winnow makes predictions based on a linear threshold function ( yˆt =

P 1 if N i=1 wi xi ≥ θ 0 otherwise

where wi is the weight associated with the Boolean attribute xi , and θ is generally set to N/2 (unless additional prior information is known about the target concept). If the prediction is wrong then the weights are updated as follows. On a false negative prediction, for each attribute xi , where xi = 1, Winnow promotes the weight wi by multiplying wi by some constant update factor α > 1. On a false positive prediction, for each attribute xi , where xi = 1, Winnow demotes the weight wi by dividing it by α. Winnow is similar to the classical Perceptron algorithm [62], except that the Perceptron algorithm updates its weights additively while Winnow uses multiplicative weight updates. Another major difference between these algorithms is that when learning disjunctions over N attributes with k of them relevant, Winnow’s mistake bound is logarithmic in N whereas the Perceptron algorithm’s mistake bound can be linear in N in the worst case [40]. Thus Winnow is a good choice when there are many irrelevant features. We learn a box in the MIL model using the same basic approach of Maass and Warmuth [49] to learn a box in a discretized d-dimensional space {0, 1, . . . , n−1}d . For each of d dimensions they introduce 2n half-spaces with n directed in each direction 22

for each discrete value. Winnow can be used to efficiently determine which of the 2d half-spaces from among the 2dn possible half-spaces are relevant.

3.3

Multiple-Instance Winnow

In this section we describe multiple-instance Winnow (MI-Winnow). A new contribution included within MI-Winnow is an alternative to using EM to handle the ambiguity inherent in not knowing which instance in a positive bag is responsible for the positive label. In particular, we use the diverse density (DD) measure, over the instances in all positive bags, to select a set of candidates for truly positive instances. We then create negative training data from all the instances in negative bags, and train Winnow to generate a box. Then we combine the hypotheses obtained from multiple runs of Winnow (possibly with data obtained using different bag generators) to create a final hypothesis. We have performed experiments in which we have combined EM with MI-Winnow (to reselect the most likely instances to be positive and re-running Winnow), but have found that using EM does not improve the accuracy of the resulting hypothesis (and increases the computation cost).

3.3.1

Reducing to Winnow

Now, we describe how a bag can be converted to a standard example (with one instance) for Winnow. Recall that a bag is positive if and only if at least one instance in the bag is positive. However, the learning algorithm can only see the label for the bag. We define a true positive instance to be an instance from a positive bag that has the same label as the bag. What makes MIL hard is that a positive bag might only contain one true positive instance – the other instances in the bag could be false positives. In contrast, all instances in a negative bag are truly negative. For the moment, suppose MI-Winnow could locate a truly positive instance p from a positive bag. Then for any other positive bag, the closest instance in that bag to p is very likely to also be a positive instance. That is, the set S + of the closest instance 23

to p from all other positive bags should contain mostly positive instances if p is truly positive. Let S − contain all instances from the negative bags. If p is truly positive then S + and S − will be consistent with a d-dimensional box which can be learned with Winnow. Many existing algorithms address the difficulty of finding a truly positive instance by trying each instance from a set of positive bags as the candidate for p. However, such an approach is computationally expensive. MI-Winnow proposes the alternate approach of using the the training data to filter the instances from the positive bags to determine which ones are most likely to be truly positive since those are the instances that will produce the best results. To achieve this goal, we use the diverse density (DD) measure as first defined by Maron and Lozano-P´erez [50]. We use the following notation and conventions. Let dist(pi , pj ) denote the Euclidean distance between instances pi and pj . Let, dh (p) = exp (−dist2 (h, p)). Intuitively, dh (p) can be viewed as the likelihood that p is positive according to h. We use a Gaussian centered around h to convert the distance to this likelihood measure that we then treat as a real-valued label for p. Finally, the diverse density of hypothesis  h is DD(h) = Πni=1 max 1 − |yi − dh (xij )|2 . Observe that the term |yi − dh (xij )|2 j

is the absolute loss between the true label yi for bag xi and the label that would be given to the jth instance xij in xi . Thus the term inside the maximum of the DD(h) definition is a measure of the likelihood that label yi being given to bag xi is correct based on instance xij . The label for each bag, is the likelihood for the best instance in the bag. Finally, under the assumption (as typically made in defining the DD measure) of independence between the instances in a bag, of a uniform prior of the hypothesis space, DD(h) is the likelihood of h being the target based on the labeled data L. An observation used in creating MI-Winnow is that the use of EM to determine which instance in each bag is important as EM-DD does, can be replaced by the much more direct approach of using the DD of each instance as a measure of its importance. MI-Winnow calculates the DD of all instances in each positive bag (treating all features as equally important) to determine which instances are most likely to be truly positive. Only these instances are used as a candidate for a truly

24

positive instance p. This method generates fewer candidate instances for Winnow, and the resulting hypotheses are more likely to be accurate. An overview of MI-Winnow is shown in Figure 3.1. We partition the training bags B into {(B1+ , B1− ), · · · (Bb+ , Bb− )} where each pair is generated by a different bag generator. While all instances created by the same bag generator must have the same dimensionality, different bag generators need not generate instances of the same dimensionality. MI-Winnow takes two parameters, τ which controls how many different candidates are considered as the truly positive instance, and s which controls the number of variables used by Winnow. As shown in the experimental results, the performance of MI-Winnow is not very sensitive to either of these parameter values. MI-Winnow(B = {(B1+ , B1− ), · · · , (Bb+ , Bb− )}, τ, s) Let di be the number of dimension for instances in (Bi+ , Bi− ) + + Let Bi1 , . . . , Bim be the bags in Bi+ i For i = 1 to b Let H = {} Let P− = {p ∈ Bi− } For all p ∈ Bi+ Compute DD(p) For k = 1 to mi + Pτ = {p ∈ Bik | DD(p) in top τ %} For all p ∈ Pτ Let P+ = {} For j = 1 to mi Let Pq = {p0 ∈ Bij+ | DD(p0 ) in top 25%} p0 ∈ Bij+ be the closest to p If p0 ∈ Pq then P+ = P+ ∪ {p0 } If |P+ | > |Bi+ |/2 H = H ∪ {GenHypotheses(P+ , P− , di , s)} Return H

Figure 3.1: Overview of MI-Winnow. The set Pτ contains all instances from a positive bag that have a DD value in the top τ % among all instances in that bag. For each p in Pτ we pick the closest instances to p from each positive bag to place in P+ . However, we do not use any instance from a bag if the closest instance to p is not within the top quarter of DD values among 25

GenHypotheses(P+ , P− , d, s) Let [vimin , vimax ] be range of the training data for dimension i Let (x1 , · · · , xd ) be an instance /* for each dimension */ For i = 1 to d /* location of a half-space */ h = bvimin c /* distance between two half-spaces */ t = (dvimax e − bvimin c)/(s − 1) /* for each half-space add two variables */ For j = 1 to s Add variable that is true iff xi ≥ h Add variable that is true iff xi < h h = h+t Initialize corresponding 2ds-dimensional weight vector w ~ = ~1 Let θ = ds /* threshold */ Repeat until training error does not change in an iteration Perform Winnow weight update for each instance p ∈ P− Perform Winnow weight update for each instance p ∈ P+ Return {w1 /θ, · · · , w2ds /θ}

Figure 3.2: Hypothesis generation method. the instances in that bag (stored in Pq ). The motivation for this choice is that there could be multiple target boxes, each defined by a different subset of positive bags. In such cases, not all positive bags have instances in every target box. If at least half the positive bags do not contribute an instance to P+ , then this P+ is not used with Winnow. This optimization further reduces the number of hypotheses generated.

3.3.2

Generating Hypotheses

Figure 3.2 describes how the hypothesis are generated from the single-instance data created by MI-Winnow. For each of the d dimensions, s half-space locations are created with a gap t. Two variables, pointing in the two half-space directions, are created for each of the half-space location. After that weight vector w ~ is initialized to ~1. These weights form the hypothesis of Winnow. Winnow is run with a threshold 26

θ = 2ds/2 = ds, which is half the sum of the initial weight vector. Training takes multiple iterations with negative and positive instances. Training stops if there is no change in training error after a certain minimum number of iterations. Finally, the weight vector divided by θ is returned as the hypothesis. This division helps removing the bias of d and s. If some other hypotheses are generated using different values for d and s even then the predictions of all hypotheses have the same scale which helps in combining them with equal importance. Different d values occur when multiple image representations with different dimensions are used in learning. We use Winnow as our learning algorithm since it can provably learn a box (polynomially in the number of dimensions) and is robust to noise. Having multiple data sets helps Winnow to learn multiple hypotheses (boxes). If multiple boxes are required to define when an image is positive, using a hypothesis space that combines multiple boxes is essential. Once Winnow has been applied to all the training sets (one for each candidate for a truly positive instance), the hypotheses created must be combined to create a single overall ranking of the test bags.

3.3.3

Ranking Images Rank(hypothesis set H, test image set T ) For image t ∈ T Let vt = 0 For each hypothesis w ~ ∈H Let A be bag generator used for training data for w ~ Let B = {~p1 , . . . , p~r } be bag output by A(t) Let vt = vt + maxp~i ∈B w ~ · p~i Rank images in non-increasing order using {v1 , . . . , v|T | }

Figure 3.3: Ranking method. In this section, we describe how the hypothesis are combined. First, hypotheses with low accuracy on the training set are removed. The remaining hypotheses are combined as described in Figure 3.3. Winnow computes a linear threshold function (the threshold for all hypotheses is 1 after the normalization of dividing by θ). Then the label for each bag can be defined according to the instance that maximizes the 27

dot product w ~ · p~, where w ~ are the weights defining the linear function and p~ is an instance in the bag representation of the image that was used when training Winnow to obtain the given weight vector. Because of the normalized values, hypotheses from different types of bag generators can be combined. Finally, the sum of these dot products are used to rank the test images. If the goal is instead to classify an image, then a positive label is returned if the average value for the dot product is at least 1. Another alternative would be to use a majority vote of the predictions of the individual hypotheses. We found that these two methods gave similar performance.

3.4

Experimental Results

We use SIVAL, COREL and Flickr data sets (as described in Section 1.1.1) in our experiments. Unlike SIVAL in which an image often contains many objects other than the one of interest, in the COREL benchmark each image just contains an object from particular category. Thus in many ways, the SIVAL and COREL benchmark are very different in characteristics. Flickr data set consists of real world photographs with differing priority for the object of interest. MI-Winnow works well in all three settings without adjusting any parameters – the identical algorithm is run for all three data sets. For all the data sets, we segment each image into 32 segments. Each segment has either 6 features (3 color and 3 texture) with the no-neighbor bag generator, or 30 features when using the neighbor bag generator (both of these bag generators are described in Section 2.1). Unless otherwise mentioned the experimental results are 30 fold with the mean AUC and 95% confidence interval reported. Based on preliminary experiments, the Winnow update factor, α, is set to 1.05 and the Winnow algorithm trains for at most 130 iterations over the given data. A possible improvement is mentioned in Section 3.5. We first evaluate the sensitivity of MI-Winnow to the choice for τ (parameter that controls how many different candidates are considered to be the truly positive instance from each positive bag). Recall that the computation time increases as τ increases. Also, if τ is too large, then incorrect hypotheses are likely to be created since the selected instance is not likely to be truly positive. We have performed experiments 28

with the SIVAL data set with τ = 10%, 25%, 50%, and 100% and found that all worked quite well achieving mean AUC values of 0.748, 0.747, 0.747, and 0.746, respectively. In all of these tests we set s, the number of half-spaces, to 15 and used the neighbor bag generator. The performance is same in all the cases showing that DD measure is picking good instances as starting points. For the remainder of the experiments reported let τ = 10%.

Figure 3.4: Comparing MI-Winnow and Accio! on SIVAL data set where the query set has 8 positive and 8 negative examples. Next we evaluated the sensitivity of MI-Winnow to the choice for s, the number of half-spaces used by Winnow for each dimension. We have performed experiments with the SIVAL data set with values for s of 15, 30 and 100. The average AUC values were 0.748, 0.747, and 0.752, respectively, without statistically significant difference. Observe that even with only 15 half-spaces MI-Winnow performs almost as good as 100 half-spaces. By reducing the number of half-spaces the time complexity of MI-Winnow is reduced without any significant reduction in the performance. But performance drops with only 5 or 10 half-spaces. For the remainder of these experiments we fix the number of half-spaces to 15.

29

Figure 3.5: Comparing MI-Winnow with varying the number of training examples from SIVAL data set. For some categories the results are better when using the neighbor bag generator, and for other categories the results are better when using the no-neighbors bag generator. To benefit from both representations, in the next set of experiments, shown in Figure 3.4, we compared the performance when both bag generators are combined as compared to just using them individually. For most categories the results when using both bag generators are almost as good (or better) as the best one when only using one bag generator. Only in a few cases is the performance significantly worse (at the 95% confidence interval) than the best of the two options. The average AUC of MIWinnow with neighbors is 0.748 ± 0.052, without neighbors is 0.755 ± 0.05, and with neighbors and no neighbor bags combined the AUC is 0.783 ± 0.048. By combining both bag generators, we obtain better performance than Accio! (0.746 ± 0.035) when it uses the best parameter choices. Recently, Rouhollah et al . [60] reported improved results (AUC of 0.818) with a new version of Accio! . Figure 3.5 shows another set of experiments in which we vary the size of the training data. We label each curve by (p pos, n neg) where p is the number of positive images 30

Figure 3.6: MI-Winnow on COREL data set. in the training data, and n is the number of negative images in the training data. The rest of the available images are used as test data. For these experiments we used MI-Winnow with both the neighbor and no neighbor bag generators. The AUC values with 95% confidence interval for (4 pos, 4 neg), (8 pos, 8 neg), (12 pos, 12 neg) and (16 pos, 16 neg) are 0.7 ± 0.054, 0.783 ± 0.048, 0.823 ± 0.048 and 0.844 ± 0.047, respectively. As expected, we find that the performance improves as more training data is given. But it improves only to a certain point. There is a bigger improvement from (4 pos, 4 neg) to (8 pos, 8 neg) compared to the improvement from (12 pos, 12 neg) to (16 pos, 16 neg). Figure 3.6 gives results for MI-Winnow with combined method for the COREL data set which is a broad image category. Though MI-Winnow is designed for localized CBIR task these results show that it performs well even with the global CBIR task. The AUC values for (8 pos, 8 neg) and (16 pos, 16 neg) are 0.794 ± 0.051 and 0.83 ± 0.044, respectively. Here again the performance improves when there are more training examples. Recently, Rouhollah et al . [60] reported an AUC of 0.836 for (8 pos, 8 neg) with Accio! . 31

Figure 3.7 gives results for MI-Winnow and the recent Accio! [60] on Flickr data set. The AUC values for MI-Winnow and Accio! are 0.744 ± 0.035 and 0.769 ± 0.033, respectively. Accio! performs slightly better than MI-Winnow but within the overlap of the confidence interval. Notice that overall AUC is less for the Flickr data set compared to SIVAL and COREL data set for both MI-Winnow and Accio! . Both SIVAL and COREL data sets are created for this CBIR task, whereas Flickr data set is created to reflect the real world scenario where the nature of pictures varies more drastically.

Figure 3.7: Comparing MI-Winnow with Accio! on Flickr data set.

3.5

Concluding Remarks

We described a new algorithm, MI-Winnow, for localized CBIR task which combines the multiple-instance problem with Winnow. It effectively combines two different techniques diverse density and Winnow together. Also, MI-Winnow takes advantage of the diversity of multiple image representations by combining the hypotheses generated from all. It outperforms the original Accio! [58] on SIVAL data set, and compares favorably to the new Accio! [60] that performs slightly better on all three 32

data sets. Using various experiments we showed that MI-Winnow works well under various parameter settings does not need fine tuning based on the data set. There are multiple ways to improve the performance of MI-Winnow. One way is to use a set of values for the Winnow update parameter α instead of just one, as there is no way to find one good value for α that works for all sets examples. Higher α values can be used in the earlier iterations and decrease it for the later iterations. We believe using more aggressive values of α in the initial iterations will reduce overall iterations needed. Another way is to use Balanced Winnow [44] which is supposed to work better than Winnow. Interesting future work involves experiments in which classification tasks are considered as opposed to ranking tasks. Unlike Accio! , MI-Winnow can not only rank images but also predict whether a given image is positive or not. Given the training data consisting of images from multiple categories MI-Winnow can be used to learn hypotheses with each category as positive. The hypotheses predict if a given image is positive or not. These predictions can be combined to predict the category of the image.

33

Chapter 4 Hardness of Multiple-Instance Learning In this chapter we focus on the theoretical aspects of MI learning. In PAC learning model [68], to learn a target concept an algorithm creates a hypothesis that approximates the target concept with high probability and low error rate in polynomial time with polynomial number of examples. Previous positive results [47, 5, 8] in MI learning focused on PAC learning axis-aligned boxes. But all of them assume that all the instances in all the bags are drawn independently from the same distribution. But this is a naive assumption. For example, in drug activity prediction problem [24] a bag is made of instances that represent various shapes the drug molecule can adopt by rotating its bonds. The shapes are related. The model in which the instances of a bag are independent does not capture the essence of a multiple-instance bag. It is like lumping shapes of random drug molecules and making it a bag. But if the instances in a bag are arbitrarily dependent then PAC learning axis-aligned box with MI examples is as hard as PAC learning DNF formulas [5], which is a long standing open problem [68] in learning theory. We describe the related work in Section 4.1. In Section 4.2, we extend the hardness result of Auer et al . to show that learning axis-aligned boxes with MI examples is as hard as learning DNF even if only some dimensions of some of the instances of each bag are arbitrarily dependent.

34

4.1

Previous work

Here we discuss the previous theoretical work in learning axis-aligned boxes in ndimensional space using multiple-instance examples. Long and Tan [47] described an efficient PAC algorithm [68] under the restriction that each point in the bag is drawn independently from a product distribution, Dproduct = D1 × . . . × Dn (i.e., the coordinates of each instance are chosen independently). Hence the resulting distribution over r-examples is Drproduct . Auer et al . [5] gave an efficient PAC algorithm that allows each point to be drawn independently from an arbitrary distribution. Hence each r-example is drawn from Dr for an arbitrary D. In their paper, Auer et al . also proved that if the distributional requirements are further relaxed to be arbitrary distributions over r-examples then PAC learning axis-aligned boxes is as hard as PAC learning DNF formulas, which has been an open problem since Valiant’s seminal paper [68] formalizing the PAC learning model. Blum and Kalai [8] described a simple reduction from the problem of PAC learning from multiple-instance examples to that of PAC learning with one-sided random classification noise when the r-examples are drawn from Dr for any distribution D. They also described a more efficient reduction to the statistical-query model [38] that yields the most efficient PAC algorithm known for learning axis-aligned boxes in the ˜ 2 r/2 ), roughly multiple-instance model. Their algorithm has sample complexity O(n a factor of r faster than the result of Auer et al .

4.2

Hardness of multiple-instance learning

In this section we generalize the hardness result of Auer et al . [5]. As in previous work, we also focus on learning an axis-aligned n-dimensional box since it is easy to represent and is a good representation of a small target subspace in n-dimensional space. In noise free data, the target box contains at least one instance from every positive bag and no instances from any negative bag. We start with some definitions adopted from Auer et al . [5]. 35

Denote the real numbers by R and the positive integers by N. Let r, n ∈ N. An rinstance example is a tuple ((~x1 , . . . , ~xr ), y), where ~xi ∈ Rn and y ∈ {0, 1}. We assume each bag has r instances, and use r-instance example and bag interchangeably. Q For each n ∈ N, for each ~a, ~b ∈ Rn , define B~a,~b = ni=1 [ai , bi ], and BOXESn = {B~a,~b : ~a, ~b ∈ Rn }. So, B~a,~b is the axis-aligned box defined by corners ~a and ~b. A sample is a sequence of r-instance examples. For a finite sequence σ = h(~x1,1 , . . . , ~x1,r ), . . . , (~x`,1 , . . . , ~x`,r )i of instances and a box B, the sample generated by σ and B is SB,σ = ((~x1,1 , . . . , ~x1,r ), y1 ), . . . , ((~x`,1 , . . . , ~x`,r ), y` ), where ( yi = ψB (~xi,1 , . . . , ~xi,r ) =

1 if ∃j ∈ {1, . . . , r} : ~xi,j ∈ B 0 if ∀j ∈ {1, . . . , r} : ~xi,j ∈ /B

for i = 1, . . . , `. That is, yi is 1 exactly when some instance in bag i is in B, the target box. Given the sample S and the required error  as inputs, let H(S, ) ⊆ R be the hypothesis output by a learning algorithm. The error of a hypothesis H(S, ) with respect to a probability distribution D over r-instance examples is the probability that a random r-instance example is misclassified. That is, erB,D (H(S, )) = D{(~x1 , . . . , ~xr ) : ψB (~x1 , . . . , ~xr ) 6= ψH(S,) (~x1 , . . . , ~xr )} For distribution Dn on Rn , we define D = Dnr as the distribution that independently samples r times from the distribution Dn . An r-instance example is generated according to a distribution D = Dnr if the r instances are generated by sampling r times independently from Dn . We define D = (Rn )r as the distribution where all the r instances are generated at the same time. An r-instance example that is generated according to distribution D = (Rn )r has r instances that are arbitrarily dependent. We explore learning in a partially dependent distribution where the instances in an rinstance example are neither independent nor completely dependent. Let n1 +n2 = n. If an r-instance example is drawn according to distribution D = (Rn1 )r × Dnr 2 , where 36

Dn2 is a distribution on Rn2 , then the r instances in the example are dependent in n1 dimensions but independent in n2 dimensions. Learning with r-instance examples: A learning algorithm learns BOXESn from partially dependent r-instance examples with sample complexity `(n, r, , δ) if the learning algorithm calculates a hypothesis H(S, ) such that for all B ∈ BOXESn , for all distributions D on (Rn1 )r × Dnr 2 , for all , δ > 0, for all ` > `(n, r, , δ), the probability of erB,D (H(SB,σ , )) >  is less than δ. Theorem 4.2.1. For s ≥ 0 and s is poly(n, r), if there is a poly(n, r, 1/, 1/δ)time algorithm A for learning BOXESnr+s , generated using a partially dependent 0 distribution, then there is a poly(n, r, 1/, 1/δ)-time algorithm A for learning r-term DNF formulas over n variables (from 1-instance examples). Proof: This proof generalizes the proof of Theorem 6 in [5]. Let Ci be a conjunction of a subset of n variables. We reduce learning the r-term DNF f = C1 ∨ C2 ∨ . . . ∨ Cr over n variables x1 , . . . , xn to learning a box in (nr + s)-dimensions from r-instance examples. The main idea is that the nr dimensions are used as the dependent dimensions for the r-instance examples and the other s dimensions are used as the dimensions where the r-instance examples are independent. While in s dimensions the instances are independent, the subproblem with nr dimensions remains hard to learn. Let nr + 1, . . . , nr + s be the s dimensions where the instances are independently identically distributed. For 1 ≤ k ≤ s and 1 ≤ i ≤ r, let uinr+k ∈ {0, 1, 2} be the values for the s dimensions. For each truth setting ~v ∈ {0, 1}n , for all 1 ≤ k ≤ s, let ϕ(~v ) be an r-instance example (2v1 , . . . , 2vn , 1, . . . , 1, u1nr+1 , . . . , u1nr+s ) ∈ {0, 1, 2}nr+s , .. . (1, . . . , 1, 2v1 , . . . , 2vn , urnr+1 , . . . , urnr+s ) ∈ {0, 1, 2}nr+s .

37

Q Let B~a,~b = nr+s i=1 [ai , bi ], where ai , bi ∈ {0, 1, 2}. We now use f = C1 ∨ C2 ∨ . . . ∨ Cr to define the box B~a,~b , where for each 0 ≤ i < r, 1 ≤ j ≤ n, ain+j = 1 and bin+j = 2

if xj ∈ Ci+1

ain+j = 0 and bin+j = 1

if x¯j ∈ Ci+1

ain+j = 0 and bin+j = 2

otherwise

and arn+k = 0 and brn+k = 2

∀1≤k≤s

Now, we prove that ϕ(~v ) is classified as 1 by B~a,~b if and only if ~v satisfies f . Suppose ϕ(~v ) is classified as 1 by B~a,~b . Then at least one of the r-instances of ϕ(~v ) is in the box B~a,~b . Say ith -instance, (1, . . . , 1, 2v1 , . . . , 2vn , 1, . . . , 1, uinr+1 , . . . , uinr+s ), is in B~a,~b . All 2vj ’s are either 0 or 2. If 2vj = 2 then either xj ∈ Ci or xj and x¯j are not present in Ci . If 2vj = 0 then either x¯j ∈ Ci or xj and x¯j are not present in Ci . Clearly, this shows ~v satisfies Ci and thus satisfies f . Now suppose ~v satisfies f and say Ci is a term of f that is satisfied. If xj ∈ Ci then vj must be 1, and if x¯j ∈ Ci then vj must be 0. By construction of B~a,~b , the ith -instance of ϕ(~v ) is in B~a,~b . Hence, ϕ(~v ) is classified 1 by B~a,~b . 0

Consider the DNF learning algorithm A that, for each example (~v , y), gives (ϕ(~v ), y) to A (which learns BOXESnr+s from r-instance examples), and predicts same as A. 0 So A commits the same mistakes as A (hence the same probability of error), and runs in same time as A. The consequence of the above result is that learning under a model where some dimensions of the bags are independently generated according to an arbitrary distribution but some other dimensions are arbitrarily related is as hard as PAC learning DNF formulas.

38

Chapter 5 Meta-Evaluation for Image Segmentation Often a segmentation algorithm is used as a pre-processing step for a larger system. In such situations, it is natural to use the overall performance of an end system to evaluate the segmentation quality. A system-level evaluation would typically segment all images with each of a set of segmentation techniques (or parameterizations) being considered, and then select the one giving the best overall performance. However, even for a given application, one segmentation technique (or parameterization) may be best for some of the images, and another technique may be best for other images. An advantage of a stand-alone method is that it could be applied to each image to adaptively select the best segmentation technique to use on that particular image, possibly improving on the performance obtained when using any one segmentation method for all images. Current stand-alone evaluation methods usually examine different fundamental criteria for segmentation quality, or examine the same criteria in different ways. Each of them typically work well for some types of images, but poorly for others. In this chapter, we propose a new meta-evaluation method, Meta-Segmentation Evaluation Technique (MSET) [79], in which any existing evaluation methods (called base evaluators) are combined by a machine learning algorithm that coalesces their evaluations based on a learned weighting function that is dependent upon the characteristics of the image being segmented. An advantage of our approach is that any base evaluator can be used without any change in our learning algorithm. Also, the training data used by the machine learning algorithm can be labeled by a human, based on 39

similarity to a human-generated reference segmentation, or based upon system-level performance. An advantage of such a machine learning approach, is that the resulting segmentation evaluator is tuned for the types of images upon which it is trained. Also, our method creates a decision tree for each base evaluator that provides information about which features are important in determining when that evaluator is reliable. The weights given to the base evaluators use the decision tree to tailor the influence given to each base evaluator based on the image being segmented. The remainder of the chapter is organized as follows. In Section 5.1, we describe the related work. In Section 5.2, we present our algorithm MSET. Experimental results are presented in Section 5.3. Section 5.4 gives our conclusions and discusses future work.

5.1

Related Work

A large number of stand-alone segmentation evaluation methods have been proposed. Many of the early methods in this area focused on the evaluation of foregroundbackground segmentation, or only for gray-level images [54, 73]. There are segmentation measures that take into account the factors involved in defining segmentation. These factors include region uniformity, normalized region uniformity, region contrast, line contrast, line connectivity, texture, and shape measures [13, 14, 42, 61, 63]. Liu and Yang [45] proposed the evaluation function F (I) =



N

PN

j=1

e2

√j where N Sj

is the number of segments, Sj is the number of pixels in segment j, and e2j is the squared color error of region j. Unless the image has very well-defined regions with very little variation in luminance and chrominance, the F evaluation function has a very strong bias towards segmentations with very few regions. Borsotti et al . [10] improved upon Liu and Yang’smethod, proposinga modified quantitative evaluation  2 √ PN e2j N (Sj ) N Q, where Q(I) = 1000·SI j=1 1+log Sj + . Sj e2j was given more influence in Q by (1 + log Sj ), and Q is penalized strongly by N (Sj ) when there are a large number of segments. So Q is less biased towards overSj segmentation.

40

More recently, Zhang et al . [81] proposed a segmentation evaluation function E based on information theory. They define Vj as the set of all possible values for the luminance in region j and let Lj (m) denote the number of pixels in region j that have luminance of m in the original image. The entropy for region j is deP L (m) L (m) fined as H(Rj ) = − m∈Vj jSj log2 jSj . They next define the expected region PN  Sj  entropy of image I, Hr (I) = j=1 SI H(Rj ), which is simply the expected entropy across all regions where each region has weight (or probability) proportional to its area. As with the squared color error, Hr (I) is a measure of the uniformity within the regions of I. To prevent a bias towards an over-segmented image, they let P Sj Sj E = Hr (I) − N j=1 SI log2 SI . Pal and Bhandari [55] also proposed an entropy-based segmentation evaluation measure for intra-region uniformity based on the secondorder local entropy. Several evaluation metrics designed for frames of a video can be easily modified for image segmentation evaluation [20, 26]. Correia and Pereira [20] proposed a set of metrics for both intra-object measures (such as shape regularity, spatial uniformity, etc) and inter-object measures (such as contrast). These metrics are weighted based on a measure of how much a human reviewer’s attention is attracted by each object. While these metrics are proposed for video segmentation quality measures, Zhang et al . [82] converted them into measures Vs and Vm for image segmentation quality measures by removing the motion and temporal related portions. For each regions, the metrics used are circularity and elongation (circ elong), and compactness (compact). P The inter-region metric contrast is defined by i,j(2DYi,j+DUi,j+DVi,j )/(4×255×Nb ) where Nb is the number of border pixels for the region, and for each pixel i, j in the image DYi,j , DUi,j and DVi,j is the maximum difference between the Y , U and V components, between that pixel and its four neighbors. Also, contextual relevance metrics are used to capture the importance of a region i in terms of the human visual system (HVS). The difference between Vs and Vm is in the way in region these metrics are weighted with Vs weighting the contrast more and Vm weighting the circularity and elongation more. Our approach is inspired by the co-evaluation framework [82] that combines a set of base evaluators using a machine learning algorithm to combine the evaluation results from the base evaluators to obtain an overall evaluation. One limitation of co-evaluation is that it weights each of the base evaluators the same regardless of the 41

image being considered. However, each base evaluator typically excels for some types of images/segmentations, yet works poorly for others. In the following section, we propose our meta-evaluation technique, MSET [79], that addresses the limitations of the co-evaluation framework. MSET determines when each base evaluator will perform best, so the weight given to each base evaluator depends upon the original image and its segmentations being evaluated. An advantage of such a machine learning approach is that the resulting segmentation evaluator is tuned for the types of images upon which it is trained. Also, our method creates a decision tree for each base evaluator that provides information about which features are important in determining when that evaluator is reliable. The decision tree tailors the influence given to each base evaluator according to the image being segmented.

5.2

MSET: Meta-Segmentation Evaluation Technique

Regular machine learning algorithms aim to find hypotheses that explain provided training data and that generalize well to new data. Whereas meta-learning aims to learn the conditions under which each of a set of learning algorithms or applications perform best [69]. The goal of the meta-segmentation evaluation technique (MSET) that we present in this chapter is to create a classifier that given an image I, and two segmentations of I, S1 (created by one algorithm/parameterization) and S2 (created by an alternate algorithm/parameterization), can accurately predict if S1 or S2 is a better segmentation for I. Since MSET is constructed from a set of components that can be independently modified, it provides a lot of flexibility. The way in which these components are combined is illustrated in Figure 5.1. Observe that if the base evaluators are unsupervised segmentation evaluation methods, we obtain a new unsupervised evaluation method. However, unlike typical unsupervised methods, both the selection of the features to use in the decision tree and the training data allows our evaluation method to be tailored to a particular application, yet still be applied without the need for human intervention. 42

We now describe the components of MSET.

Base Evaluator: Any segmentation evaluation method can be used as a base evaluator. In fact, system-level evaluation methods, or even human-aided evaluation methods, can be used as base evaluators enabling fundamentally different evaluation methods to be coalesced. However, in this chapter, we focus on using existing unsupervised methods.

Features: Both the base evaluators and the learning component use a set of features to capture the important qualities of the image and the segmentation methods. Some of these features depend only on the image (e.g., overall color and texture information for the image itself), and other features depend upon the particular segmentation (e.g., number of segments, average color, texture and shape features across all segments). Application-specific features can also be added.

Training Data: The learning component tailors its performance to a particular set of images through the training data, which is composed of a set of examples, each of which includes a raw image I, two segmentations S1 and S2 for I, and a label indicating which of S1 and S2 is a better segmentation. The label could be provided by a human evaluator, by measuring the similarity to a human-generated reference segmentation, or based on which of S1 and S2 produces the better systemlevel performance.

Learning Algorithm: MSET first constructs a decision tree [57] for each base evaluator. However, unlike the standard ID3 decision tree algorithm in which each leaf node gives a predicted label, each leaf node in the decision tree constructed by MSET gives the predicted accuracy for that base evaluator for the input image. These decision trees are then used as a basis for defining a weighted vote over the base evaluators. The decision tree for each evaluator is computed in the following manner. Each example in the training data is labeled as positive (if the evaluator agrees with the 43

I, S1, S2

Base Evaluator 1

Base Evaluator 2

Base Evaluator N

Base evaluation results

learning component MSET

Training labels

Figure 5.1: An Overview of MSET. I is the input image. S1 and S2 are two segmentations of I. label), or otherwise negative. Each internal node in the decision tree partitions all of the examples into two or three sets according to the value of the selected feature. The following features that depend upon the image itself are considered as possible internal nodes:

• We compute the average value across the image for color L, color U, and color V. Then for each Y ∈ {L, U, V } and each possible threshold value t ∈ {90, 128, 150}, we define an internal node where the left branch corresponds to color Y < t, and the right branch corresponds to color Y ≥ t. • Similarly, we compute the average value for wavelet coefficients in horizontal (HL), vertical (LH) and diagonal (HH) directions. For each of these texture features, we define an internal node with possible threshold values of 1.0, 1.5 and 2.0.

All the above threshold values are chosen to give good splits in the possible range of values and based on experiments with a variety of images. We also consider the following features that depend upon the segmentation: the number of segments (NoS), shape and texture features (perimeter, compactness, circularity, elongation, Sobel and contrast as defined in [19]), averaged over all segments. For each of these features, and x ∈ {10, 25, 50}, we define a potential internal node with three branches based 44

on whether S1 ’s value for the attribute is x% greater than S2 ’s value, the difference between S1 and S2 ’s value is less than x%, or S2 ’s value for the attribute is x% greater than S1 ’s value. For all possible internal nodes (e.g., features and threshold choices to define the partitions), the one that maximizes the information gain is selected as the root where the information gain is the difference of entropy of the node and the sum of the entropies of the children based upon the proportion of the examples that are positive and negative in each branch [57]. The process is recursively repeated until either the information gain is less than 0.01, or the number of examples for any child is less than 10, or the number of examples is less than 15% of the total examples in the training set. These values are chosen experimentally to avoid over-fitting of the data by stopping repeated splitting of the nodes. The final decision tree partitions the set of all images into equivalence classes with one equivalence class per leaf. Each leaf holds the percentage of the training examples that reach that leaf correctly classified (as whether S1 or S2 is best) by the given base evaluator. An example decision tree is shown in Figure 5.2. The number of segments (NoS) with x = 25 had the highest information gain and thus was selected as the root. The branch for image pairs in which the number of segments are within 25% of each other becomes a leaf node where the accuracy on the training data is 80%. The other two equivalence classes defined by the root are split again. The leftmost one is split using Texture LH with a threshold of 1.5, and the rightmost one is split using Color V with a threshold of 150. Their children are not split again and become leaf nodes. The accuracy of the evaluator on the entire data set is 46.27%, however, the decision tree has discovered attribute values for which this classifier performs very well (namely, when there is less than a 25% difference between the number of segments), achieving 80% accuracy on this class of images. For each base classifier and its complement (where the selection of the best segmentation is reversed), a decision tree is built as described above. We now describe how these decision trees are combined to predict the label for a new example X consisting of image I, and segmentations S1 and S2 . For decision tree Ti of the ith base evaluator, let Li (X) be the leaf reached for example X, and let αi,j be the accuracy on the training examples for leaf j of tree Ti . Let BC1 be the set of base classifiers (or 45

NoS 25% more in S1 Texture_LH < 1.5

37.93%

25% more in S2

Diff smaller than 25%

>= 1.5

80%

25%

Color_V

< 150

15.38%

>= 150

48.39%

Figure 5.2: An example decision tree output by MSET for base evaluator F . their complements) that indicate that S1 is better than S2 , and let BC2 be the set of base classifiers (or their complements) that indicate that S2 is better than S1 . MSET predicts that S1 is better than S2 if and only if X X 2|αi,Li (X) −0.5|×10 ≥ 2|αi,Li (X) −0.5|×10 . i∈BC1

i∈BC2

That is, the final prediction is made according to a weighted vote where the weights are defined by the decision tree leaves reached by X. The multiplicative factor of 10 in the exponent is included so that a 10% increase in accuracy causes the weight given to that evaluator to be doubled. For instance, examples ending in a leaf with an accuracy of 0.6 receive a weight of 2, whereas examples ending in a leaf with an accuracy of 0.7 receive a weight of 4. Factor of 10 is experimentally chosen to increase the gap between the weights of the evaluators.

5.3

Experimental Results

In this section we describe experimental results to evaluate MSET. One obstacle in the research of segmentation evaluation is the lack of benchmark image sets. Due to availability, as well as to compare the performance of MSET with the co-evaluation 46

method [82], we acquired the same image sets as used in co-evaluation, which are in turn based on the Berkeley Segmentation Dataset [52] and the Military Graphics Collection [53]. We performed three sets of experiments that differ in both the type of images and the methods by which the segmentations are obtained, which result in radical differences in the performance of the base evaluators. Since MSET coalesces the results of base evaluators, the selection of base evaluators is crucial to its performance. A good selection of base evaluators must include those methods that evaluate as many criteria of a good segmentation as possible, thereby enabling MSET to combine the results from the most comprehensive perspectives. However, to compare with co-evaluation, we use the same five evaluators used by the co-evaluation method: F [45], Q [10], E [81] and Vs and Vm [82]. For each of the three experiments, we select a set of images I and for each I ∈ I, we generate segmentations S1 and S2 . We define a label of which segmentation is best using a subjective measure based on human visual evaluation. We then split these examples into two approximately equal-sized training and test sets. For the training set, the label is included along with I, S1 , and S2 . For the test set, the final evaluator created by MSET is given I, S1 and S2 , but not the label. We measure performance by the accuracy, which is defined as the percentage of test examples where the predicted label (of whether S1 or S2 is a better segmentation of I) matches the unseen label. For all experiments, we create 30 random splits of the examples into training and test sets, and report the mean accuracy and 95% confidence interval over the 30 runs. We compare the performance of MSET with each of the five base evaluators, as well as that of the co-evaluation method using the weighted majority combiner (CoE -WM ), which was shown to outperform other co-evaluation methods [82].

47

Evaluation Methods F Q E Vs Vm CoE -WM MSET

Accuracy 17.99% ± 1.06% 14.07% ± 0.94% 83.12% ± 1.00% 17.41% ± 1.00% 19.91% ± 0.96% 90.44% ± 1.07% 93.89% ± 0.85%

Table 5.1: Results for human vs machine segmentations (mean ± 95% confidence interval).

5.3.1

Human segmentation results vs. machine segmentation results

In our first experiment, we apply MSET to compare human segmentation results and machine segmentation results. We use 189 test images from the Berkeley Segmentation Dataset [52]. The test images are segmented manually by a human and then a machine segmentation is generated with the same number of segments, using the Edge Detection and Image Segmentation (EDISON) System [25], which is a lowlevel feature extraction tool that integrates confidence-based edge detection and mean shift-based image segmentation. We only use images where the human segmentation S1 is clearly better than machine segmentation S2 but none of the evaluators make use of this knowledge. Table 5.1 shows the mean accuracy and the 95% confidence interval for each evaluation method. Recall that MSET considers the base evaluators and their complements. Of these 10 evaluators, the best 5 accuracies are 82.01 (complement of F ), 85.93 (complement of Q), 83.12 (E), 82.59 (complement of Vs ), and 80.09 (complement of Vm ). Observe that the improvement of MSET over CoE -WM is statistically significant, and both statistically outperform the base evaluators (and their complements). The decision tree for E (for one of the 30 runs) is shown in Figure 5.3. The overall accuracy of E on the training data used to create this tree was 86.8%. The root of this decision tree partitions the examples based on whether or not the luminance of I is above 150. When the luminance is at least 150, a leaf node is created. For the images 48

Color_L =150

thickness S1 larger by 25%

25 [72.0%] S2 larger by 25%

diff. smaller than 25% 24 [95.8%]

22 [81.8%]

20 [100%]

Figure 5.3: An example decision tree for base evaluator E for which the overall accuracy is 86.8%. used in this experiment, high average luminance usually means there are large areas of uniform light background, such as sky, snow or sea. Those images are relatively simple and EDISON can segment them well. Consequently, the difference between human segmentation and EDISON segmentation is small, and it is more difficult for E to differentiate them. As a result E does not perform as well for the 25 examples whose luminance is over 150. For the remaining 66 examples (with luminance less than 150), E’s accuracy is 92.4%. For those examples where the luminance is low, a further split is based on the thickness [20] where the average thickness of a segmentation describes both the average area of each segment and its shape. A larger thickness indicates both a larger area and more circularity. When the percentage difference between S1 ’s and S2 ’s thickness is greater than 25%, E achieves a very high accuracy (95.8% and 100%). Since by definition E prefers those segmentations whose regions are less equally-sized, if the difference between thickness is high, one of the segmentations is more favorable to E. So, the decision tree achieves its goal of finding the type of examples in which E performs very well. For this data set, E is given more weight if the test image has low luminance, and the difference in thickness between the two segmentations is large.

49

5.3.2

Results from different parameterizations of a segmentation method

In our second experiment, 249 aircraft images from the Military Graphics Collection are segmented by the Improved Hierarchical Segmentation (IHS) algorithm [80] with different parameterizations (namely, the number of segments in the final segmentation). The images used for this experiments were the ones where the human evaluators all agreed which segmentation was best. The label is given to the segmentation that was agreed to be better. In general, the variation in the number of segments was fairly high so that the human evaluators were all in agreement. Some sample images from our evaluation set are show in Figure 5.4.

Figure 5.4: Sample images. The results for this experiment are shown in Table 5.2. Of these 5 evaluators and their complements, the best 5 accuracies are 53.2 (complement of F ), 73.67 (Q), 66.27 (complement E), 55.72 (Vs ), and 62.88 (Vm ). Thus, as compared with the first set of experiments, the base evaluators do not perform as well. One reason for this is that in about two-thirds of the training data, the segmentation with more segments is labeled as the better one. Both F and E have a bias towards segmentations with fewer segments, and thus they didn’t perform as well. However, the improvement of MSET over CoE -WM is statistically significant, and both statistically outperform the base evaluators (and their complements). Although the best base evaluator achieves only 73.57% accuracy, MSET achieves an accuracy over 83%.

5.3.3

Results from different segmentation methods

In our third experiment 268 images from the Berkeley Segmentation Database are segmented with both IHS and EDISON using approximately the same number of 50

Evaluation Methods F Q E Vs Vm CoE -WM MSET

Accuracy 46.80% ± 1.00% 73.67% ± 1.13% 33.73% ± 0.86% 55.72% ± 1.07% 62.88% ± 0.99% 77.14% ± 2.17% 83.13% ± 1.12%

Table 5.2: The average evaluation accuracy (mean ± 95% confidence interval) for each method in the experiment when different parameterizations of same algorithm is used. Evaluation Methods F Q E Vs Vm CoE -WM MSET

Accuracy 37.05% ± 1.06% 47.11% ± 1.12% 41.81% ± 1.19% 57.30% ± 0.89% 60.11% ± 1.06% 62.43% ± 0.94% 65.37% ± 1.21%

Table 5.3: The evaluation accuracy (mean ± 95% confidence interval) for each method in the experiment with different segmentation methods. segments. Two examples are shown in Figure 5.5. A group of six human evaluators independently compare the segmentations from both algorithms. Only those images where at least four evaluators agreed which segmentation is best were used. The results for this experiment are shown in Table 5.3. Of these 5 evaluators and their complements, the best 5 accuracies are 62.95 (complement of F ), 52.89 (complement of Q), 58.19 (complement E), 57.30 (Vs ), and 60.11 (Vm ). Again, results show that the MSET improvement over CoE -WM is statistically significant, and they both outperform the base evaluators.

51

Original Image

EDISON Segmentation

IHS Segmentation

Figure 5.5: Image examples segmented by EDISON and IHS. In the top example, EDISON is labeled best, and in the bottom example IHS is labeled as best.

5.3.4

Discussion

In all three experiments, MSET outperforms the other evaluation methods and this improvement is statistically significant at the 95% confidence level. Clearly, the performance of the base evaluators (and their complements) affects the performance of MSET . In the human vs machine segmentation experiment, the accuracies of the evaluator are either very high (83.12%), or very low (14.07%, 19.91%), so both MSET and CoE -WM have high accuracy (90.44%, 93.89%). In experiment with different segmentation methods, when five evaluators have accuracy between 37% to 60%, i.e., most evaluators perform little better than random guess, both MSET and CoE -WM have low accuracy (62.43%, 65.37%). Experiment with different parameterizations of the same segmentation methods falls between these two extremes. For MSET to obtain good results, two conditions are required: (1) For each image, some evaluator must perform well, and (2) the attributes used in constructing the decision tree must enable it to discriminate when each evaluator performs well. Thus, there are two ways in which the results in the third experiment could be improved. We believe that larger improvements of MSET over CoE -WM can be obtained by finding more discriminative features to use in creating the decision trees. As long as 52

there is at least one high accuracy leaf reached for each example in the test set, then much better performance for MSET could be achieved. Another way in which the results can be improved is to include better base evaluators. In fact, one nice feature of this work is that MSET can improve as better unsupervised evaluation methods are developed.

5.4

Conclusion

Current objective evaluation methods usually examine different fundamental criteria of good segmentation, and rely heavily on the image characteristics they are measuring. So they work well in some cases, or for some groups of images, and poorly for the others. To improve the evaluation accuracy, we propose a meta-segmentation evaluation technique, in which different evaluators judge the performance of the segmentation in different ways, and their measures are combined by a learning algorithm that determines how to coalesce the results from the constituent measures. Based on features extracted from the original image and each segmented image, and features defined for each segmentation, the learning module has the possibility of learning what base evaluator is most likely to generate reliable evaluations for each type of image, and can use this to weight the influence of each base evaluator in a way that is appropriate for the individual image. The advantages of MSET are its improved accuracy as compared to the current evaluation methods, the ability to use any base evaluators with it, the potential to combine fundamentally different types of evaluation methods, the possibility to evaluate the segmented images from different imaging technologies (e.g., segmentations of optical, radar and infra-red images of a target), and its parallel structure, which facilitates fast processing time. Also, the flexible nature of MSET means that it is not necessary to find a single approach in order to get obtain good stand-alone objective segmentation evaluation across all types of images. Rather, it is just necessary to find a set of base evaluation methods such that at least one of them works for each type of image. Combined with a careful selection of attributes to use in constructing the decision tree, MSET can combine such base evaluators to achieve good performance across all image types. 53

There are many interesting directions for future work. We can further explore the selection of the features used by the decision tree algorithm to find some that are more discriminating for determining when the base evaluators perform well. Also, it is interesting to explore other learning techniques besides the variation of the ID3 decision tree algorithm that MSET currently uses.

54

Chapter 6 Veritas: Combing Expert Opinions without Labeled Data Clinical diagnosis, treatment response, biomedical research and clinical trials all rely on precise determination of the true meaning of identified image features. As quantitative imaging biomarkers are increasingly used in clinical trials to evaluate new pharmaceutical agents, and as new automated software for decision support are created, the need for techniques to define a high quality estimate of truth (a so-called gold standard as opposed to actual ground truth) for biomedical image sets has become critical. The ability to establish a high quality estimate of truth will permit the creation of much needed standard image sets for the validation of new detection and decision support algorithms. If a technique can be established to combine the results of multiple experts to improve the accuracy of the composite result (one approach to defining a gold standard of truth), such a technique would have a significant impact on the accuracy of image-based biomarkers and hence of a large class of clinical trials. Volumetric computed tomography (CT) screening of subjects at risk for developing lung cancer is used for early detection of malignant lung nodules and results in reduced patient mortality and morbidity. Searching through volumetric CT data which contains hundreds of images to find possible malignant lesions is time intensive and costly. Automated computer aided detection and measurement (CAD/CAM) systems are being developed to reduce the radiologists’ time to read the CT screening data, but the these systems need to be validated before they can be approved by the FDA for clinical use. Validating CAD/CAM systems depends on the existence of well curated image repositories. While these repositories, e.g. the National Cancer 55

Image Archive, are being developed, they force the imaging informatics community to confront a critical problem – the lack of ground truth for most medical image data sets. Our work is motivated by the application of estimating the true segmentation of a lung nodule in a computed tomography scan of the human chest where we have multiple expert segmentations. This is a different problem compared to the one in the previous chapter as in this case there is no labeled data to train with. Even though we do not have labeled data, the goal of this work is quite different from an unsupervised learning problem in which the goal is to cluster the data into different groups. An example in Figure 6.1 shows a slice of a CT image with a lung nodule and three different expert segmentations of the nodule. Here, for each pixel in the CT image we have multiple opinions from experts – whether a pixel is inside or outside the nodule, and the goal is to find the truth if a pixel is part of the nodule or not.

Figure 6.1: On the left, a sample CT image is shown with green rectangle containing the nodule. On the right, three differing expert opinions about the lung nodule segmentation (only a small portion of the image close to the green rectangle is shown).

56

The same underlying scenario exists in other potential application, e.g. an unsupervised robot navigation system that has multiple obstacle detectors, or a spam detector with multiple algorithms running, each looking at different kinds of information. More generally, any setting where we have several different already trained algorithms that make predictions and we want to combine their opinions without access to any labeled data could be a potential application. Even though getting the ground truth is not as difficult in robot navigation or spam detections problems, it is possible that different algorithms are trained on different data sets before being acquired from a variety of sources.

Figure 6.2: A sample of one of the CT scans that we use in our experimental evaluation. This particular CT slice of the chest shows both a real lung nodule (top arrow) and synthetic lung nodule (bottom arrow).

We present the Veritas algorithm [16] that applies confidence-rated boosting [64] in a novel way to combine expert opinions when there is no labeled data for training. Though this can be applied in different domains where there is a need to combine expert opinions, we investigate it for application to a medical image segmentation problem. We evaluate Veritas using both artificial data and real CT images in which synthetic nodules have been added that provide a ground truth. Figure 6.2 shows one of these data sets with both a real nodule and a synthetic nodule marked. Observe that the synthetic nodule is visually similar to the real nodule. Currently the STAPLE algorithm [72] is the standard approach used for developing a model of truth in this context. We compare Veritas with STAPLE, as well as other natural benchmarks. In Section 6.1, we describe STAPLE and its drawbacks, and the background for our approach. We describe Veritas in Section 6.2. In Section 6.3, the data sets and loss 57

measures used in experiments are described. Results from the experiments are given in Section 6.4. Conclusions and future work are given in Section 6.5.

6.1

Background and Our Approach

In the context of medical image segmentation, the state-of-the-art method for the problem of building a model of truth from the opinions of experts without any ground truth is the STAPLE algorithm [72]. STAPLE aims to simultaneously learn a sensitivity and specificity for each expert as well as the true segmentation, and uses the EM algorithm [23] together with Markov random fields [33] in the following manner. First an estimate for the hidden true segmentation is made (e.g., using the consensus of the experts). This estimated true segmentation can then be used by the “E”step to estimate a specificity and sensitivity for each expert. For the “M”-step a Markov random field is used as a way to incorporate both the estimated specificity and sensitivity of the experts, as well as a smoothness constraint that penalizes two neighboring pixels for having different values. The approach used by STAPLE is very natural: combine expert opinions using some weighted vote, but also incorporate spatial constraints. However, it has the limitation of requiring a generative model and a good estimate of the model parameters (e.g., sensitivity and specificity of the experts). As discussed in the paper presenting STAPLE [33], different initial parameters can produce very different hypothesized segmentations. Incorrect parameters could lead to bad estimates, and there is currently no easy way to find the good initial parameters except by using prior knowledge about their approximate values. Another limitation of STAPLE is that the constraint of spatial homogeneity is hardcoded into the edge weights of the hidden Markov model, and is not dependent on the experts’ segmentations being considered. Also, STAPLE is designed for a Boolean ground truth and is not well suited for making real-valued predictions. However, in CT images, the lung nodules can have real values indicating the nodule density or that a nodule occupies a fraction of the pixel in the discretized CT image.

58

Our approach is instead motivated by work in semi-supervised learning — in particular, co-training [9] and co-boosting [18] — as well as work in computational learning theory on learning from random classification noise [2]. The idea of our method is that we will hold out one expert to use as the “label” for each pixel, and then use confidence-rated boosting [64] to learn a good hypothesis for predicting that label based on the predictions of the other experts. Confidence-rated boosting is a generalization of AdaBoost [30] with confidence-rated predictions where the weak learners can abstain (zero confidence) from predicting for some inputs. Specifically, we use the predictions of the other experts on not only the current pixel but also on various combinations of pixels in the surrounding region. For example, if we use expert m as the label, then the space of weak hypotheses for boosting would include prediction rules such as: “if expert 1 predicts ‘+’ on at least 7 of the 9 pixels in the local region, then predict ‘+’,” or “if experts 1 and 2 both predict ’+’ on at least 6 of the 9 pixels, then predict ’+’.” The idea here is twofold: first, rather than hard-code in beliefs about the strength and form of spatial constraints, we want to allow confidence-rated boosting to select which of these are most important to the problem based on the actual data. Second, if we can model the held-out expert’s predictions as corresponding to the true label corrupted by random classification noise, then optimizing error rate over the noisy labels has been shown theoretically to optimize error rate with respect to the true labels as well [39]. In particular, the error rate of the predictor trained on the noisy labels can be much lower than that of the noisy labels themselves. Finally, we repeat this process for each expert as the hold-out label, and then combine the outputs of the resulting predictors. Collins and Singer [18] also consider boosting in a multi-view setting, developing the CoBoost semi-supervised learning algorithm. Unlike CoBoost, however, we do not explicitly optimize for agreement among the m predictors h1 , . . . , hm produced in this process (one for each held-out expert). This is because we have overlap in the feature space used for training each predictor, and thus an agreement-based objective among h1 , . . . , hm could cause the algorithms to simply focus on common inputs. Instead, we use the expert predictions as labels, with the hope that, much like in the simplified version of co-training analyzed by Blum and Mitchell [9], some of the experts will be making mistakes that are independent of the mistakes of the other experts, and thus minimizing error with respect to those predictions will minimize error with respect to the ground truth. 59

Veritas(E = {E1 , . . ., Em }) for each expert Ei ∈ E Hi = CreateWeakHypotheses (E − Ei ) Pi = ConfidenceBoost (Hi , Ei ) Result = Combine (E, P = {P1 , . . . , Pm })

Figure 6.3: Overview of Veritas.

6.2

Veritas Algorithm

In this section we present the Veritas algorithm. Our goal is to create a model of truth that combines the opinions of multiple experts (both human and machine). When the quality of the experts’ predictions vary significantly, then a simple consensus model is insufficient.

Figure 6.4: Another overview of Veritas. An overview of the Veritas (truth telling) algorithm is shown in Figure 6.3. Another view of the algorithm is shown in Figure 6.4. Veritas learns a hypothesis treating one of the m expert segmentations as the label for all pixels. This process is repeated m times, once for each expert segmentation being treated as the label. Each such execution will create a different hypothesis for the ground truth, which are then combined to create the final prediction. Once an expert is selected to be treated as the label, the m − 1 other expert segmentations are used to create the feature predictions that form the weak learners for confidence-rated boosting [64]. In confidence-rated boosting experts can abstain by saying “I don’t know.” This is similar to the specialist 60

CreateWeakHypotheses (E) /* as we apply, E has m − 1 experts */ for each Ek ∈ E generate Sk /* set contains 29 − 1 subsets of pixels*/ set Hk to empty /* weak hypotheses using Ek */ set Hsingle , Hpairs , Hmajority to empty /* create single expert weak hypotheses */ for each Sk , where k ∈ {1, . . . , |E|} for each s ∈ Sk /* all possible thresholds based on size of s */ for (thr = 0; thr< |s|; thr=thr++) /* sum pixel values in s; note s ∈ {0, 1}|s| */ append Pthe following 4 weak hypotheses to Hk if (P s > thr) predict 1, else predict 0 if (P s > thr) predict -1, else predict 0 if (P s > thr) predict 0, else predict 1 if ( s > thr) predict 0, else predict -1 append Hk to Hsingle /* create paired experts weak hypotheses*/ for (p = 0; p < |E|; p = p++) for (q = p + 1; q < |E|; q = q++) /* combine ith hypothesis from Hp and Hq */ for (i = 0; i < |H1 |; i = i++) i = Sign(Hpi + Hqi ) Hp,q i append Hp,q to Hpairs /* create majority experts weak hypotheses */ for (i = 0; i < |H1 |; i = i++) /* combine ith hypothesis from all Hk */ P|E| i Hmajority = Sign( k=1 Hki − |E|/2) i append Hmajority to Hmajority /* return the set of all weak hypotheses */ return Hsingle ∪ Hpairs ∪ Hmajority

Figure 6.5: Algorithm to create weak hypotheses.

61

model of Blum [7]. These predictions are then combined to form a final opinion for the example. We now describe the details of Veritas for our specific application of lung nodule segmentation in CT images. For each image, Veritas is provided with m expert segmentations, where such a segmentation provides a classification (0 or 1) for each pixel in 2-dimensional data, and for each voxel in 3-dimensional data. Each pixel/voxel will form one example. In particular, to predict the label of the pixel/voxel at location `, we apply confidence-rated boosting to a set of weak hypotheses created using a subset of pixels in the neighborhood of ` within a single expert, using pairs of experts, and using a majority among all the experts. We have selected these weak hypotheses since they incorporate spatial relationships. Let S be the set containing all 29 − 1 non-empty subsets of pixels defined by pixel location ` and its 8 neighbors. For computational reasons, we consider neighbors from the 2-dimensional slice in a 3-dimensional image. There would be 26 neighbors if the 3-dimensional neighborhood is considered. We now describe how the weak hypotheses are created. Pseudo code is given in Figure 6.5.

• Single expert weak hypotheses: For each s ∈ S and expert segmentation Ek , we introduce four types of weak hypotheses: (1) a weak hypothesis fs1 (Ek ) that predicts 1 if the sum of all the pixels in s is greater than a threshold τ (for 1 ≤ τ ≤ s − 1), otherwise predicts 0 (“I don’t know”); (2) a weak hypothesis fs2 (Ek ) that makes the prediction (−1 or 0) complementary to fs1 (Ek ); (3) a weak hypothesis fs3 (Ek ) that predicts 1 if the sum of all the pixels in s is less than or equal to a threshold τ (for 1 ≤ τ ≤ s−1), and otherwise predicts 0; (4) a weak hypothesis fs4 (Ek ) that makes the prediction (−1 or 0) complementary to fs3 (Ek ). These weak hypotheses capture spatial relationships within an image. • Majority experts weak hypotheses: For each set s ∈ S, we introduce a weak hypothesis that predicts based on the majority of the weak hypotheses fsi (E1 ), . . . , fsi (Em−1 ). These weak hypotheses capture an overall consensus (and its complement) for each pixel set.

62

• Paired experts weak hypotheses: Finally, for each s ∈ S and for all distinct pairs Ek , Ej among the m − 1 experts, we introduce the weak hypothesis fsi (Ek ) ∧ fsi (Ej ). These weak hypotheses capture the importance, in some situations, of when two experts agree with one another.

The Sign function used by the CreateWeakHypotheses method is defined as:    1 Sign(x) = −1   0

if x > 0 if x < 0 if x = 0

ConfidenceBoost (Hi , Ei ) /* Ei` provides the label for pixel location ` */ /* Hi` is the set of weak hypotheses predictions */ Run Confidence-rated boosting where weak hypotheses can abstain using the data set: hHi1 , Ei1 i, hHi2 , Ei2 i, . . . Using the hypothesis learned, for all `, predict the label Pi` just using Hi` without Ei` Return Pi

Figure 6.6: Confidence Boost. Using these weak hypotheses, ConfidenceBoost (Figure 6.6) is executed to generate a hypothesis based on each expert as the label. If the boosting generated hypothesis can predict the label expert well then this expert does not contribute much knowledge to that of the other experts. During the combining phase (Figure 6.7) of the weighted version of Veritas, such a hypothesis is weighted lower. The hypothesis with the lowest accuracy is given the highest weight as this hypothesis is contributing knowledge about the ground truth that the original expert does not have. At present, the hypotheses are simply given the integer weights 1, . . . , m. Having an incorrect expert who is very different from others can also lead to this situation. Though this weight scheme might seem counter intuitive, our experiments (see Section 6.4.3) with such weights consistently outperformed the weighting scheme where lowest accuracy hypothesis gets the lowest weight. More work needs to be done to assign more finely selected weights instead of the integer weights. 63

Combine (E, P = {P1 , . . . , Pm }) /* unweighted version */ for each location Pm ` ` truth` = k=1 Pk /m Combine (E, P = {P1 , . . . , Pm }) /* weighted version */ /* weight is based on accuracy of Pk compared to Ek */ /* ∀i,k ∈ {1, .P . . , m}, and pixel locations ` */ Wk = i when ` |Pk` − Ek` | is the ith highest of m differences for each location Pm P ` ` truth` = m k=1 Wk k=1 (Wk Pk )/

Figure 6.7: Combining expert segmentations. Currently, for the 3-dimensional data, we apply this algorithm to each of the 2dimensional slices. An area of future work is to add a limited set of 3-dimensional features (to limit computation) that involve the corresponding voxels across the slices.

6.3

Data and Experiments

We evaluate Veritas using both artificial data and real CT data. Though the experts assign either 1 or 0 (part of the nodule or not, respectively) for each voxel of a CT image, the ground truth is not Boolean as sometimes a nodule can partially occupy a voxel or the density of the nodule is low in that voxel (which corresponds to a fractional ground truth value). We use the average squared loss among all pixels in which there is disagreement among the original m experts as our criterion for comparing Veritas, STAPLE, and two baseline algorithms. We have selected this measure, rather than the loss over the entire image, since the nodule(s) to be isolated on the real CT images are small and the experts all agree on the vast majority of the pixels in the image. Thus, we want to focus on the interesting portion of the data.

64

STAPLE assigns a fractional value to each voxel but its predictions are generally close to 0 or 1 as it was designed with the assumption that the ground truth is Boolean. Whereas Veritas is designed to work with non-Boolean ground truth values. Though we believe average squared loss compared with ground truth is a reasonable measure, to give a fair comparison to STAPLE, we also use an absolute loss measure where the ground truth and also the output from STAPLE and Veritas are rounded to 0 or 1 before taking the absolute difference. Again, we average the loss over all the voxels where there is disagreement among the original m experts.

6.3.1

Real CT Data

With Institutional Review Board (IRB) approval, for our preliminary experiments, we used de-identified spiral CT data collected from a patient using standard pediatric chest protocols on a Siemens Sensation 16 scanner. Images were reconstructed with a voxel size 0.3867 mm in the x- and y-directions (axial), and with a slice spacing of 1mm in the z-direction (cranial-caudal). Three synthetic nodules have been added to the CT scan of the patient. Figure 6.2 shows a slice from the patient that includes a real nodule (which do vary in size) and a synthetic nodule. Significant effort has been made to ensure that the synthetic nodules are similar in form to the real nodules. But at present our model is based on simple lobulated lung nodules which consist of overlapping spheres of constant intensity. We are working on modeling more complicated spiculated lung nodules. De-identified images were modified for the purposes of this study by inserting two or three small overlapping spheres with diameter from 3.5 to 5 mm. In particular, the locations, sizes, and densities (-10 to 100 Hounsfield units) of the nodules were chosen by a pediatric radiologist. Partial volume effects and residual image blur were accounted for by creating the constant-intensity nodules in a grid of sub-voxels smaller than the image voxels, and then blurring with a Gaussian point-spread function with a full-width at half-maximum (FWHM) of 1 mm in the x- and y-directions and a FWHM of 2 mm in the z-direction. The blurred nodules were then down sampled to the original image resolution and added to the CT images. For the 3-dimensional CT data with 193 slices, the image includes over 50,000,000 voxels where 1036, 564 and 1115 voxels are part of the three nodules. 65

For the CT scans each expert was provided data in DICOM format, the standard for data transfer of medical images, and was asked to segment the synthetic nodules using their preferred methodology. The only constraint placed upon the experts was that they not alter the spatial resolution of the images. For this preliminary data, three segmentations were obtained using seed based region growing techniques, another obtained using an edge detection algorithm, and one obtained using a manual tracing method. Software programs utilized were Analyze (Biomedical Imaging Resource, Rochester, MN), Mimics (Materialise US, Ann Arbor, MI), and ImageJ (National Institutes of Health, USA). Any voxel that was a part of a segmented nodule was assigned a binary 1 and any voxel not part of a segmented nodule was assigned a binary 0. Two experienced experts, each with a minimum of 15 years experience in segmenting CT image data, segmented the data twice using different segmentation algorithms (4 expert segmentations). The other expert was a novice, trained by an experienced expert, who did a manual tracing of the image data. All of the parameters of the synthetic nodules (location, diameter, and density) are known exactly and constitute “truth” for our empirical results. To convert the continuous nodules into the discretized truth, we define the true label for each voxel to be the fraction of the volume of that voxel that is part of the synthetic nodule (prior to the application of the blurring function). We created three data sets for which ground truth is known – one for each of the three synthetic nodules.

6.3.2

Artificial Data

In addition to real CT data, we created artificial data in the following manner. First we created a 2-dimensional artificial nodule. (See Figure 6.8.) As in the real CT data it is blurred around the edges of the nodule. Then we created eight different experts, shown in Figure 6.9, that are intended to simulate the kinds of ways in which experts classify the data. Again, if a pixel is considered a part of the nodule then it is assigned a binary 1 and otherwise it is assigned a 0.

• Experts E1 and E2 represent radiologists with significant experience that provide very accurate segmentations. E1 is created by hand marking the nodule 66

Figure 6.8: The ground truth for our artificial data.

Expert E1

Expert E2

Expert E3

Expert E4

Expert E5

Expert E6

Expert E7

Expert E8

Figure 6.9: The eight expert segmentations used for our experiments with the artificial data.

as best possible, and E2 is obtained by defining any pixel that has intensity of 128 (of 255) or higher to be part of the nodule. • Expert E3 represents an expert that tends to treat boundary pixels as part of the nodule. It is obtained by defining any pixel with intensity of 50 (of 255) or higher to be part of the nodule. • Experts E4 , E5 , and E6 represent a variety of expert segmentations that tend to treat boundary pixels as not part of the nodule. Expert E4 only considers a pixel to be part of the nodule if it is a non-blurred portion of the nodule. E5 is obtained by marking the boundaries, by hand, in a way that any blurred areas are not part of the nodule, and E6 is obtained by only defining pixels of maximum intensity of 255 to be part of the nodule.

67

• Experts E7 and E8 represent novice experts whose segmentations are fairly simple in form. Segmentations E7 is a circle roughly around the true nodule, and expert E8 marks a rectangle roughly around the true nodule.

The advantage of using such artificial data is that we can vary it in controlled ways to help understand the strengths and limitations of our new algorithm. Similar to the real data, when the nodule is partly overlapping a pixel, we define the ground truth as the fraction of the area of the pixel covered by the nodule.

6.3.3

Loss Measures

Now we formally define the squared and absolute loss measures that we use in our work. Let Ai be the value of ith voxel according to algorithm A, and Ti be the truth value of the ith voxel. Let the total number of voxels, where not all experts agree with each other, be n and the foreground (maximum) value of the voxel be M . Squared and absolute loss for an algorithm A is given by: Squared Loss =

n X

(Ai − Ti )2 /(n ∗ M )

i=1

Absolute Loss =

n X

|round(Ai ) − round(Ti )|/(n ∗ M )

i=1

Since different types of images can have different ranges of values, division by M normalizes the maximum to 1.

6.4

Results

On all data sets we compare the Veritas algorithm to STAPLE, and to two baseline algorithms. The first baseline we use is the “consensus” baseline, which is obtained by using the majority vote of the m experts as the prediction. The second baseline we introduce is the “average” baseline, which predicts according to the average over 68

Algorithm

Squared loss

Veritas STAPLE Average Consensus

0.0753±0.0068 0.1259±0.0108 0.0798±0.0052 0.1591±0.0208

Veritas percent improvement 40.19 5.64 52.67

Absolute loss 0.2030±0.0241 0.1740±0.0173 0.2349±0.0317 0.2349±0.0317

Veritas percent improvement -16.67 13.58 13.58

Table 6.1: Average loss with 95% confidence interval for real CT data (18 experiments). the expert segmentations. For example, if there are five experts, one that indicates that voxel v is part of a nodule and four that indicate that voxel v is not part of a nodule, then the consensus baseline would predict 0 (not in the nodule) for v, whereas the average baseline would predict 0.2. Consensus is equal to average when using the absolute loss measure and hence it is left out from the graphs using that measure. For STAPLE we use the implementation from the National Library of Medicine Insight Segmentation and Registration Toolkit (ITK) [78] and run it with the default parameters.

6.4.1

CT Data

Table 6.1 reports the average results with 95% confidence interval, using both squared loss and absolute loss, from the three data sets we created from the real CT scans of a patient. For each data set we have five expert segmentations. All the algorithms are run using all subsets of four experts (5 subsets for each data set), and with all five experts. That gives 6 experiments for each data set, and a total of 18 experiments for all three data sets combined. Note that since these experiments are based on only 3 data sets they are not independent. More data sets are needed to better understand the differences in the algorithms. Figure 6.10 compares Veritas with STAPLE, average and consensus for all 18 experiments on CT data with squared loss. Using squared loss, while there are noticeable improvements over STAPLE (40.19%) and the consensus baseline (52.67%), on the real CT data sets the average baseline performs at a similar level as Veritas. We believe this is caused by several factors. First, the synthetic nodules have simple shapes 69

0.3 Veritas STAPLE Average Consensus 0.25

Squared Loss

0.2

0.15

0.1

0.05

0 -4Number of experts combined

|

-5-

Figure 6.10: Comparing Veritas with STAPLE, average and consensus for CT data with squared loss. First 15 data points use 4 expert opinions and the last 3 use 5 expert opinions. made of multiple overlapping spheres. Second, the segmentation errors were not necessarily independent of each other (for this experiment two experts each performed two segmentations and three of the segmentations were performed using similar segmentation techniques). Third, all the segmentations were highly accurate allowing the baseline average method to perform quite well, thus there is very little room for improvement with such high quality expert segmentations. These limitations are not found in clinical trials as lung nodules have many sizes, shapes, textures, locations, and attachments to surrounding structures and accurately and precisely measuring them remains a very challenging problem. Clinical trials are usually multi-center with distributed experts and a single expert would not generate two different expert segmentations of the same data, so independent segmentation errors would be expected as opposed to our simple trial. We are currently working on modeling larger lung 70

0.4 Veritas STAPLE Average 0.35

0.3

Absolute Loss

0.25

0.2

0.15

0.1

0.05

0 -4Number of experts combined

|

-5-

Figure 6.11: Comparing Veritas with STAPLE and average for CT data with absolute loss. Consensus is not shown as it is equal to average with absolute loss. nodules with more complicated geometry, texture, and locations to better represent the range of nodules found in clinical practice. We will repeat these experiments on these data sets as they become available. As seen with the artificial data (Section 6.4.2), we believe that Veritas will outperform the average baseline once the expert segmentations naturally vary more from the underlying ground truth. Figure 6.11 compares Veritas with STAPLE and average for all 18 runs on CT data with absolute loss. We left out consensus as it is equal to average when using absolute loss. Using the absolute loss measure, STAPLE does better than Veritas (16.67%) and the average baseline. This seems to occur whenever the label expert is predicted with high accuracy using weak hypotheses constructed from the other m − 1 experts. The performance of Veritas suffers whenever the label expert is similar to one or more other experts used to create the weak hypotheses. When the weak hypotheses from the similar experts can predict the label expert well, the label expert is not really 71

Algorithm

Squared loss

Veritas STAPLE Average Consensus

0.0389±0.0071 0.0858±0.0047 0.0635±0.0075 0.1475±0.0301

Veritas percent improvement 54.66 38.74 73.63

Absolute loss 0.0972±0.0237 0.1242±0.0095 0.1834±0.0412 0.1834±0.0412

Veritas percent improvement 21.74 47.00 47.00

Table 6.2: Average loss with 95% confidence interval for artificial data (93 experiments). contributing any new knowledge to the group regarding the ground truth. Since our real data is very simple, the segmentations are very similar. We believe more complex lesions (which are common in real medical images) would lead to different segmentations and the performance of Veritas relative to STAPLE and the two baselines would be closer to that seen with the artificial data. The main motivation of our work is to learn the ground truth when there are different expert opinions. If all the experts are very similar (and all very accurate), there is really no need for Veritas or STAPLE – so it is the more complex lesions that are really of interest.

6.4.2

Artificial Data

Table 6.2 shows the average of all results with 95% confidence interval on the artificial data set given in Figure 6.9. We create a set of experiments using various subsets of the eight experts E1 , . . . , E8 . The algorithms are run using all subsets of five of the   eight experts ( 85 = 56 subsets), subsets of six ( 86 = 28 subsets), subsets of seven  experts ( 87 = 8 subsets) and also when all the eight experts are used. All the results     are averaged over those 93 experiments 85 + 86 + 87 + 88 . Similar to CT data experiments, these experiments are also not independent and need more data sets to draw more accurate conclusions. Figure 6.12 compares Veritas with STAPLE, average and consensus for all 93 runs on artificial data with squared loss. Using the squared loss measure, Veritas has an average loss of 0.0389 per voxel compared to the ground truth. Whereas STAPLE has an average loss of 0.0858, baseline average 0.0635 and consensus 0.1475. Veritas outperforms STAPLE by an average of 54.66%, outperforms the average baseline by 72

0.6 Veritas STAPLE Average Consensus 0.5

Squared Loss

0.4

0.3

0.2

0.1

|

0 -5-

-6-

|

-7-

|

8

Number of experts combined

Figure 6.12: Comparing Veritas with STAPLE, average and consensus for artificial data with squared loss. First 56 data points use 5 expert opinions, next 28 use 6 expert opinions, next 8 use 7 expert opinions and last one uses 8 expert opinions. 38.74%, and outperforms the consensus baseline by 73.63%. Since ground truth is fractional, and STAPLE mostly predicts closer to 0 or 1, its performance is worse than the baseline average method which predicts fractional values. Figure 6.13 compares Veritas with STAPLE and average for all 93 runs on artificial data with absolute loss. Using the absolute loss measure which is more suitable to STAPLE (0.1242), it performs much better than baseline average (0.1834). Even using this measure, Veritas (0.0972) performs better than STAPLE by 21.74%. To better understand the influence of different experts we compared the results when only one, two or three of the experts E4, E5 and E6 are used. These three experts tend to treat the fuzzy boundary pixels as not part of the nodule. Hence, these experts are biased towards smaller nodules. Figure 6.14 compares Veritas with STAPLE, average 73

0.8 Veritas STAPLE Average 0.7

0.6

Absolute Loss

0.5

0.4

0.3

0.2

0.1

|

0 -5-

-6-

|

-7-

|

8

Number of experts combined

Figure 6.13: Comparing Veritas with STAPLE and average for artificial data with absolute loss. and consensus using squared loss on 18 experiments with only one of the experts E4, E5 and E6, on 48 experiments with two of the experts, and on 26 experiments with all three of the experts. Veritas outperforms all others in most of the experiments except when three of the five experts are E4, E5, E6. In this case STAPLE performs better than others as shown near the beginning of the “all 3” part of the Figure 6.14. Weighting scheme used by Veritas is unable to overcome the majority of the experts being biased towards underestimating the nodule. Though we need more data sets to draw conclusions this shows a potential weakness and possible area of improvement for the Veritas weighting scheme. Similar pattern is observed when using absolute loss instead of squared loss.

74

0.6 Veritas STAPLE Average Consensus 0.5

Squared Loss

0.4

0.3

0.2

0.1

0 - only 1 -

|

|

- only 2 Number of experts from E4, E5 and E6

- all 3 -

Figure 6.14: Comparison (using squared loss) where only 1, 2 or 3 of the experts E4, E5 and E6 are chosen. At least 5 experts are used in the experiments. There are total 92 (18 + 48 + 26) experiments.

6.4.3

Various combining schemes for Veritas

In the previous discussion, Veritas combined the hypotheses generated by using integer weights where the hypothesis with the lowest accuracy is given the highest weight. We call this method PoorlyPredictedFavored as the hypothesis which predicts the label expert poorly is given higher weight. Though this seems to counter intuitive the results shown below indicate that this is a better method. In this section, we introduce another weighting method, WellPredictedFavored, which gives integer weights where the hypothesis with the highest accuracy is given the highest weight. We compare these two weighted variants of Veritas with the unweighted variant where all the hypotheses are equally weighted.

75

Weighting scheme PoorlyPredictedFavored Unweighted Veritas WellPredictedFavored

Squared loss 0.0753±0.0068 0.0793±0.0058 0.1094±0.0086

Absolute loss 0.2030±0.0241 0.2076±0.0266 0.2081±0.0256

Table 6.3: Average loss with 95% confidence interval for Veritas with weighted and unweighted combining of experts on CT data.

Weighting scheme PoorlyPredictedFavored Unweighted Veritas WellPredictedFavored

Squared loss 0.0389±0.0071 0.0619±0.0081 0.1101±0.0128

Absolute loss 0.0972±0.0237 0.1647±0.0423 0.3395±0.0507

Table 6.4: Average loss with 95% confidence interval for Veritas with weighted and unweighted combining of experts on artificial data.

Table 6.3 and Table 6.4 show the average of all results with 95% confidence interval for weighted and unweighted versions of Veritas for CT and artificial data, respectively. Figure 6.15 and Figure 6.16 compare unweighted and weighted versions of Veritas with squared loss and absolute loss, respectively, for all 18 runs on CT data. Figure 6.17 and Figure 6.18 compare unweighted and weighted versions of Veritas with squared loss and absolute loss, respectively, for all 93 runs on artificial data. In all cases, PoorlyPredictedFavored method performs better than the other two methods. Though WellPredictedFavored method is exactly opposite to the PoorlyPredictedFavored and seems to be more intuitive, it performs worse than PoorlyPredictedFavored which is used in the previous sections. Actually, WellPredictedFavored performs worse than even the unweighted scheme indicating that the PoorlyPredictedFavored way of weighting – hypothesis with the lower accuracy is given the higher weight – is appropriate.

6.5

Conclusions and Future Directions

We have presented the Veritas (truth telling) algorithm to combine expert opinions when there is no labeled data for training. This is a very different problem compared to others in the machine learning literature. We have shown that Veritas compares 76

0.2 PoorlyPredictedFavored WellPredictedFavored Unweighted

Squared Loss

0.15

0.1

0.05

0 -4Number of experts combined

|

-5-

Figure 6.15: Comparing weighted and unweighted Veritas on CT data with squared loss. favorably to STAPLE and two baseline algorithms for both the artificial data, and the real CT image data (under the squared loss) to which a synthetic nodule was added. There are many directions for future work. Soon, we will obtain more data sets with even more realistic and complicated synthetic nodules, and also with a larger variety of expert segmentations that will enable us to perform much more extensive empirical evaluation for real medical images. In addition, we are exploring a variety of ways to improve our algorithm by introducing features that capture the relationship between corresponding voxels from neighboring slices (using 3-dimensional neighbors instead of just 2-dimensional), and weighting the m hypotheses obtained when treating each expert as the ground truth more precisely.

77

0.35 PoorlyPredictedFavored WellPredictedFavored Unweighted 0.3

Absolute Loss

0.25

0.2

0.15

0.1

0.05

0 -4Number of experts combined

|

-5-

Figure 6.16: Comparing weighted and unweighted Veritas on CT data with absolute loss. For example, when two learned hypotheses are very close in their accuracies, giving those hypotheses the same weight might be better. Another important issue that needs to be addressed when multiple experts are very similar. One idea is to precompute their similarities, and reflect that while assigning weights during the combining phase. If there is a group of very similar experts then we can use only one of them and save on the computation without losing accuracy. More experimental work needs to be done in this regard. Although the use of a Markov random field in STAPLE hard codes the constraints in regards to spatial homogeneity, it limits the ability to learn other spatial relationships. We plan to experiment with using Markov random fields as a post-processing step, or as a mechanism to weight the predictions obtained with each expert serving as the label, so that the ground truth obtained has been smoothed. Another research 78

0.35 PoorlyPredictedFavored WellPredictedFavored Unweighted 0.3

Squared Loss

0.25

0.2

0.15

0.1

0.05

|

0 -5-

-6-

|

-7-

|

8

Number of experts combined

Figure 6.17: Comparing weighted and unweighted Veritas on artificial data with squared loss. direction is to incorporate some domain knowledge into Veritas by using features from the raw data along with features constructed from the expert opinions to learn which experts are better under different situations.

79

0.8 PoorlyPredictedFavored WellPredictedFavored Unweighted 0.7

0.6

Absolute Loss

0.5

0.4

0.3

0.2

0.1

|

0 -5-

-6-

|

-7-

|

8

Number of experts combined

Figure 6.18: Comparing weighted and unweighted Veritas on artificial data with absolute loss.

80

References [1] S. Andrews, T. Hofmann, and I. Tsochantaridis. Multiple instance learning with generalized support vector machines. Artificial Intelligence, pages 943–944, 2002. [2] Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988. [3] Samuel G. Armato, Michael F. McNitt-Gray, Anthony P. Reeves, Charles R. Meyer, Geoffrey McLennan, Denise R. Aberle, Ella A. Kazerooni, Heber MacMahon, Edwin J.R. van Beek, David Yankelevitz, Eric A. Hoffman, Claudia I. Henschke, Rachael Y. Roberts, Matthew S. Brown, Roger M. Engelmann, Richard C. Pais, Christopher W. Piker, David Qing, Masha Kocherginsky, Barbara Y. Croft, and Lawrence P. Clarke. The lung image database consortium (LIDC): An evaluation of radiologist variability in the identification of lung nodules on CT scans. Academic Radiology, 14(11):1409–1421, November 2007. [4] Samuel G. Armato, Rachael Y. Roberts, Michael F. McNitt-Gray, Charles R. Meyer, Anthony P. Reeves, Geoffrey McLennan, Roger M. Engelmann, Peyton H. Bland, Dense R. Aberle, Ella A. Kazerooni, Heber MacMahon, Edwin J.R. van Beek, David Yankelevitz, Barbara Y. Croft, and Lawrence P. Clarke. The lung image database consortium (LIDC): Ensuring the integrity of expert-defined “truth”. Academic Radiology, 14(12):1455–1463, December 2007. [5] Peter Auer, Philip M. Long, and Aravind Srinivasan. Approximating hyperrectangles: learning and pseudo-random sets. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 314–323. ACM Press, 1997. [6] J. Bi, Y. Chen, and J. Wang. A sparse support vector machine approach to region-based image categorization. IEEE Conference on Computer Vision and Pattern Recognition, pages 1121–1128, 2005. [7] Avrim Blum. Empirical support for Winnow and weighted-majority algorithms: Results on a calendar scheduling domain. Machine Learning, 26(1):5–23, 1997. [8] Avrim Blum and Adam Kalai. A note on learning from multiple-instance examples. Machine Learning, 30(1):23–29, 1998.

81

[9] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with cotraining. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 92–100, New York, NY, 1998. ACM Press. [10] M. Borsotti, P. Campadelli, and R. Schettini. Quantitative evaluation of color image segmentation results. Pattern Recognition Letters, 19(8):741–747, 1998. [11] Kevin W. Bowyer and P. Jonathon Phillips. Empirical evaluation techniques in computer vision, 1998. [12] Nicol`o Cesa-Bianchi and G´abor Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, 2006. [13] S. Chabrier, B. Emile, H. Laurent, C. Rosenberger, and P. Marche. Unsupervised evaluation of image segmentation appliation to multi-spectral images. Proceedings of the 17th International Conference on Pattern Recognition, 2004. [14] Hsin-Chia Chen and Sheng-Jyh Wang. The use of visible color difference in the quantitative evaluation of color image segmentation. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004. [15] Y. Chen and J. Wang. Image categorization by learning and reasoning with regions. Journal of Machine Learning Research, pages 913–939, 2004. [16] Sharath R. Cholleti, Sally A. Goldman, Avrim Blum, David G. Politte, and Steven Don. Veritas: Combining expert opinions without labeled data. In Proceedings of the 20th International Conference on Tools with Artificial Intelligence, November 2008. [17] Sharath R. Cholleti, Sally A. Goldman, and Rouhollah Rahmani. MI-Winnow: A new multiple-instance learning algorithm. In Proceedings of the 18th International Conference on Tools with Artificial Intelligence, pages 336–346, November 2006. [18] Michael Collins and Yoram Singer. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 100–111, 1999. [19] P. Correia and F.Pereira. Estimation of video object’s relevance. Proceedings of EUSIPCO’2000 - X European Signal Processing Conference, 2000. [20] P. Correia and F.Pereira. Objective evaluation of video segmentation quality. IEEE Transactions on Image Processing, 12(2):186–200, February 2003. [21] I.J. Cox, M.L. Miller, S.M. Omohundro, and P.N. Yianilos. PicHunter: Bayesian relevance feedback. International Conference on Pattern Recognition, pages 361– 369, 1996. 82

[22] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistics Society, pages 1–38, 1977. [23] A. P. Dempster, N. M. Laird, and D. B Rubin. Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39:1–38, 1977. [24] T. Dietterich, R. Lathrop, and T. Lozano-P´erez. Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence, pages 31–37, 1997. [25] Edge Detection and Image SegmentatioN www.caip.rutgers.edu/riul/research/code/EDISON/.

System.

http://

[26] Cigdem Eroglu Erdem, Bulent Sanker, and A. Murat Tekalp. Performance measures for video object segmentation and tracking. IEEE Transactions on Image Processing, 13:937–951, 2004. [27] Francisco J. Estrada and Allan D. Jepson. Quantitative evaluation of a novel image segmentation algorithm. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2005. [28] M. Fischler and R. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartograpy. Communications of the ACM, 24:381–395, 1981. [29] David A. Forsyth and Jean Ponce. Computer Vision: A Modern Approach. Prentice Hall Professional Technical Reference, 2002. [30] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer System Sciences, 55(1):119–139, 1997. [31] Elisa Drelie Gelasca, Touradj Ebrahimi, Mylene Farias, Marco Carli, and Sanjit Mitra. Towards perceptually driven segmentation evaluation metrics. Proceedings of Conference on Computer Vision and Pattern Recognition Workshop, 2004. [32] T. Gevers and A.W.M. Smeulders. Image search engines: An overview. In G. Medioni and S. B. Kang, editors, Emerging Topics in Computer Vision. Prentice Hall, 2004. [33] D. Greig, B. Porteous, and H. Seheult. Exact maximum a posteriori estimation for binary images. Journal of the Royal Statistics Society, B, 51:271–279, 1989. [34] J. Hanley and B. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve, volume 143, pages 29–36. 1982. 83

[35] X. Huang, S.-C. Chen, M.-L. Shyu, and C. Zhang. User concept pattern discovery using relevance feedback and multiple instance learning for content-based image retrieval. 8th International Conference on Knowledge Discovery and Data Mining, pages 100–108, 2002. [36] Carl C. Jaffe. Lecture to American college of radiology. IEEE Transactions on Medical Imaging, 1(4):226–229, 1982. [37] Yan Ke and Rahul Sukthankar. PCA-SIFT: A more distinctive representation for local image descriptiors. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2004. [38] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM, 45(6):983–1006, 1998. [39] Michael J. Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. The MIT Press, 1994. [40] J. Kivinen, M. K. Warmuth, and P. Auer. The perceptron algorithm vs. Winnow: Linear vs. logarithmic mistake bounds when few input variables are relevant. Artificial Intelligence, 97(1-2):325–343, 1997. [41] Luke Ledwich and Stefan Williams. Reduced sift features for image retrieval and indoor localisation. Proceedings of Australasian Conference on Robotics and Automation, 2004. [42] Martin D. Levine and Ahmed M. Nazif. Dynamic measurement of computer generated image segmentations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(2):155–164, 1985. [43] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2(4):285–318, 1988. [44] N. Littlestone. Mistake bounds and logarithmic linear-threshold learning algorithms. PhD thesis, University of California at Santa Cruz, Santa Cruz, CA, USA, 1989. [45] Jianqing Liu and Yee-Hong Yang. Multi-resolution color image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(7):689– 700, 1994. [46] F. Long, H. Zhang, and D. Feng. Fundamentals of content-based image retrieval. In D. Feng, W. Siu, and H. Zhang, editors, Multimedia Information Retrieval and Management – Technological Fundamentals and Applications. Springer, 2003.

84

[47] Philip M. Long and Lei Tan. Pac learning axis-aligned rectangles with respect to product distributions from multiple-instance examples. Machine Learning, 30(1):7–21, 1998. [48] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004. [49] Wolfgang Maass and Manfred K. Warmuth. Efficient learning with virtual threshold gates. Information and Computation, 141(1):66–83, 1998. [50] O. Maron and T. Lozano-P´erez. A framework for multiple-instance learning. Neural Information Processing Systems, 1998. [51] O. Maron and A. Ratan. Multiple-instance learning for natural scene classification. Proceedings of the 15th International Conference on Machine Learning, pages 341–349, 1998. [52] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. Proceedings of the 8th International Conference on Computer Vision, 2:416–423, 2001. [53] Military Graphics Collection. http://www.locked.de/en/ index.html. [54] N. Otsu. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics, 9(1):62–66, 1979. [55] N.R. Pal and D. Bhandari. Image thresholding: some new techniques. Signal Processing, 33(2):139–158, 1993. [56] Fred W. Prior, Bradley J. Erickson, and Lawrence Tarbox. Open source software projects of the caBIG in vivo imaging workspace software special interest group. Journal of Digital Imaging, 20(1):94–100, November 2007. [57] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986. [58] R. Rahmani, S. Goldman, H. Zhang, J. Krettek, and J. Fritts. Localized contentbased image retrieval. Proceedings of ACM Workshop on Multimedia Image Retrieval, pages 227–236, 2005. [59] Rouhollah Rahmani, Sally Goldman, Hui Zhang, and Jason Fritts. Content based image retrieval using multiple instance learning. Technical report, Washington University in St Louis, 2005. [60] Rouhollah Rahmani, Sally A. Goldman, Hui Zhang, Sharath R. Cholleti, and Jason Fritts. Localized content based image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11):1902–1912, November 2008. 85

[61] C. Rosenberger and K. Chehdi. Genetic fusion: Application to multi-components image segmentation. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000. [62] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. (Reprinted in Neurocomputing (MIT Press, 1988).). [63] P.K. Sahoo, S. Soltani, A.K.C. Wong, and Y.C. Chen. A survey of thresholding techniques. Computer Vision, Graphics, and Image Processing, 41(2):233–260, 1988. [64] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297–336, 1999. [65] Q. Tian, N. Sebe, M. S. Lew, E. Loupias, and Thomas S. Huang. Image retrieval using wavelet-based salient points. Journal of Electronic Imaging, Special Issue on Storage and Retrieval of Digital Media, 2001. [66] Q. Tian, Y. Wu, and T. Huang. Combine user defined region-of-interest and spatial layout for image retrieval. Proceedings of International Conference on Image Processing, 2000. [67] P. Torr and A. Zisserman. MLESAC: a new robust estimator with application to estimating image geometry. Computer Vision and Image Understanding, 78:138– 156, 2000. [68] L. G. Valiant. A theory of the learnable. Communications of the ACM, pages 1134–1142, 1984. [69] R. Vilalta and Y. Drissi. A perspective view and survey of metalearning. Artificial Intelligence Review, 18(2):77–95, 2002. [70] J. Wang, J. Li, and G. Wiederhold. SIMPLIcity: semantics-sensitive integrated matching for picture libraries. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 947–963, 2001. [71] Junqiu Wang, Hongbin Zha, and R. Cipolla. Combining interest points and edges for content-based image retrieval. Proceedings of IEEE International Conference on Image Processing, 2005. [72] Simon K. Warfield, Kelly H. Zhou, and William M. Wells. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging, 23(7):903–921, 2004. 86

[73] J.S. Weszka and A. Rosenfeld. Threshold evaluation techniques. IEEE Transactions on Systems, Man and Cybernetics, 8(8):622–629, 1978. [74] Christian Wolf, Jean-Michel Jolion, Walter Kropatsch, and Horst Bischof. Content based image retrieval using interest points and texture features. Proc. IEEE Int. Conference on Pattern Recognition, 2000. [75] Xin Xu and Eibe Frank. Logistic regression and boosting for labeled bags of instances. In H. Dai, R. Srikant, and C. Zhang, editors, Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, volume 3056 of LNAI, pages 272–281, Sydney, Australia, 2004. Springer. [76] C. Yang and T. Lozano-P´erez. Image database retrieval with multiple instance techniques. Proceedings of the 16th International Conference on Data Engineering, pages 233–243, 2000. [77] Yitzhak Yitzhaky and Eli Peli. A method for objective edge detection evaluation and detector parameter selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8):1027–1033, 2003. [78] T. Yoo, M. Ackerman, W. Lorensen, W. Schroeder, V. Chalana, S. Aylward, D. Metaxes, and R. Whitaker. Engineering and algorithm design for an image processing API: A technical report on ITK - the Insight Toolkit, 2002. [79] Hui Zhang, Sharath R. Cholleti, Sally A. Goldman, and Jason Fritts. Meta evaluation of image segmentation using machine learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1138–1145, June 2006. [80] Hui Zhang and Jason Fritts. Improved hierarchical segmentation. Technical report, Washington University in St Louis, 2005. [81] Hui Zhang, Jason Fritts, and Sally Goldman. An entropy-based objective evaluation method for image segmentation. Proceedings of SPIE – Storage and Retrieval Methods and Applications for Multimedia, 2004. [82] Hui Zhang, Jason Fritts, and Sally Goldman. A co-evaluation framework for improving segmentation evaluation. Proceedings of SPIE – Signal Processing, Sensor Fusion and Target Recognition, 5809, 2005. [83] Hui Zhang, Rouhollah Rahmani, Sharath R. Cholleti, and Sally A. Goldman. Local image representations using pruned salient points with applications to CBIR. In Proceedings of the 14th ACM International Conference on Multimedia, pages 287–296, October 2006.

87

[84] Q. Zhang, S. Goldman, W. Yu, and J. Fritts. Content-based image retrieval using multiple instance learning. Proceedings of the 19th International Conference on Machine Learning, pages 682–689, 2002. [85] Qi Zhang and Sally A. Goldman. Em-dd: An improved multiple-instance learning technique. In Advances in Neural Information Processing Systems, pages 1073– 1080. MIT Press, 2001. [86] Zhi-hua Zhou and Min-ling Zhang. Ensembles of multi-instance learners. In Proceedings of the 14th European Conference on Machine Learning, pages 492– 502. Springer, 2003.

88

Vita Sharath Reddy Cholleti Date of Birth

1979

Place of Birth

India

Degrees

B.Tech. Computer Science and Engineering, May 2000 M.S. Computer Science, December 2002 Ph.D. Computer Science, December 2008

Professional Societies

IEEE

Publications

Sharath R. Cholleti, Sally A. Goldman, Avrim Blum, David G. Politte, and Steven Don. Veritas: Combining expert opinions without labeled data. In Proceedings of the 20th International Conference on Tools with Artificial Intelligence, pages 45–52, November 2008. Rouhollah Rahmani, Sally A. Goldman, Hui Zhang, Sharath R. Cholleti, and Jason Fritts. Localized content based image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11):1902–1912, November 2008. Sharath R. Cholleti, Sally A. Goldman, and Rouhollah Rahmani. MI-Winnow: A new multiple-instance learning algorithm. In Proceedings of the 18th International Conference on Tools with Artificial Intelligence, pages 336–346, November 2006. Hui Zhang, Rouhollah Rahmani, Sharath R. Cholleti, and Sally A. Goldman. Local image representations using pruned salient points with applications to CBIR. In Proceedings of the 14th ACM International Conference on Multimedia, pages 287–296, October 2006. 89

Hui Zhang, Sharath R. Cholleti, Sally A. Goldman, and Jason Fritts. Meta evaluation of image segmentation using machine learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1138–1145, June 2006. Delvin Defoe, Sharath R. Cholleti, and Ron Cytron. Upper bound for defragmenting buddy heaps. In Proceedings of the 2004 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, June 2005.

December 2008

90

Learning from Images, Cholleti, Ph.D. 2008

Suggest Documents