14.0 centipede. 12.5 hair spray. 4.4 pencil box. 4.2 snowplow. 21.4 airplane. 14.9 chain saw. 1.1 hamburger. 15.3 pencil sharpener. 0.7 soap dispenser. 0.1 ant.
Weakly Supervised Object Localization with Progressive Domain Adaptation Supplementary Material
1
Dong Li1 , Jia-Bin Huang2 , Yali Li1 , Shengjin Wang1∗, and Ming-Hsuan Yang3 Tsinghua University, 2 University of Illinois, Urbana-Champaign, 3 University of California, Merced
1. Overview
4. Error analysis
In this supplementary material, we present three additional results to complement the paper. First, we report detailed quantitative evaluation on the PASCAL VOC and ILSVRC object detection datasets. Second, we show additional qualitative detection results on the VOC 2007 dataset. Third, we analyze the errors of three variants of the proposed approach and show relative contributions from each component.
In Figure 8, we apply the detector error analysis tool from Hoiem et al. [1] to analyze errors of our detector on the VOC 2007 dataset. Comparing the first column (OM+MIL) with the third column (OM+MIL+FT), we find confusion with similar objects is significantly reduced by the detection adaptation step (FT), particularly for furniture classes. Finetuning the network for object-level detection helps learn discriminative appearance model for object categories. Comparing the second column (MIL+FT) with the third column (OM+MIL+FT), confusion with background (particularly for vehicle classes) and other objects (particularly for furniture classes) is reduced with our class-specific proposal mining (OM) step. With classification adaptation, the OM step helps remove a substantial amount of noise in the initial noisy class-independent proposal collection and mines out class-specific object candidates with high-precision. From the error analysis plots, we note that the majority of errors comes from inaccurate localization. We show sample detection errors in Figure 7. Typical errors with imprecise localization include detecting a bicycle wheel, a bird body, or partial bus and car. Sometimes the detector gets confused with background or similar objects. The last two rows of Figure 7 show the detection errors due to confusion with background and similar objects, respectively. For example, we detect plant in lake and claim to detect potted plants. In the first image on the last row, we incorrectly detect a chair as a sofa.
2. Quantitative Evaluation We show in Table 1, 2, and 3 the detection average precision (AP) performance of our method on the PASCAL VOC 2007, 2010 and 2012 datasets, respectively. In general, our algorithm achieves better localization performance for animal and vehicle than that for furniture classes. It is difficult to detect indoor objects in a weakly supervised manner due to large appearance variations and background clutters. Using the deeper model, the VGGNet, as our base CNN model, we achieve better performance than that from the AlexNet. We also present the precision-recall curves for each category by our OM+MIL+FT-VGGNet method on the VOC 2007 test set in Figure 1, the VOC 2010 val set in Figure 2, and the VOC 2012 val set in Figure 3. Table 4 and 5 show the per-class detection average precision performance of our method on the ILSVRC 2013 detection val2 set. Similarly, we achieve better localization performance using the deeper network.
3. Sample detections We show more sample detection results on th PASCAL VOC 2007 test set in Figure 4, 5, 6. All the detection results are obtained by our OM+MIL+FT-VGGNet method. Our algorithm is able to detect objects under different scales, lighting conditions and partial occlusions. ∗ Corresponding
author
References [1] D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing error in object detectors. In ECCV, 2012. 1
Table 1. Quantitative evaluation in terms of detection average precision (AP) on the PASCAL VOC 2007 test set. Methods aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP OM + MIL + FT-AlexNet 49.7 33.6 30.8 19.9 13 40.5 54.3 37.4 14.8 39.8 9.4 28.8 38.1 49.8 14.5 24 27.1 12.1 42.3 39.7 31.0 OM + MIL + FT-VGGNet 54.5 47.4 41.3 20.8 17.7 51.9 63.5 46.1 21.8 57.1 22.1 34.4 50.5 61.8 16.2 29.9 40.7 15.9 55.3 40.2 39.5
Figure 1. Precision-recall curves for each category on the PASCAL VOC 2007 test set.
Table 2. Quantitative evaluation in terms of detection average precision (AP) on the PASCAL VOC 2010 val set. Methods aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP OM + MIL + FT-AlexNet 37 26.7 17.3 7.1 8.7 55.3 27.4 34.8 6.2 18.7 4.2 20.7 28.4 39.6 8.9 11.3 22.6 6.4 23.3 22.7 21.4 OM + MIL + FT-VGGNet 46.7 50.9 31.9 9.2 13.2 54.6 35.4 47.1 7.5 24.2 6.2 32.6 46.3 56.8 11.6 21.2 32.6 12 45.5 28.9 30.7
Figure 2. Precision-recall curves for each category on the PASCAL VOC 2010 val set.
Table 3. Quantitative evaluation in terms of detection average precision (AP) on the PASCAL VOC 2012 val set. Methods aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP OM + MIL + FT-AlexNet 33.7 27.8 19.9 5.7 6.5 49.6 28.1 37.3 7.6 17.4 2.8 23.3 33.7 40.4 8.9 13.7 22.3 5.5 36.4 26.7 22.4 OM + MIL + FT-VGGNet 42.2 27.8 32.7 4.2 13.7 52.1 35.8 48.3 11.8 31.7 4.9 30.4 45.3 51.8 11.5 13.4 33.5 7.2 45.6 38.4 29.1
Figure 3. Precision-recall curves for each category on the PASCAL VOC 2012 val set.
Table 4. Per-class detection average precision (AP) on the ILSVRC2013 detection val2 set by our OM+MIL+FT-AlexNet method. class accordion airplane ant antelope apple armadillo artichoke axe baby bed backpack bagel balance beam banana band aid banjo baseball basketball bathing cap beaker bear bee bell pepper bench bicycle binder bird bookshelf bow tie bow bowl brassiere burrito bus butterfly camel can opener car cart cattle cello
AP 14.0 14.9 18.7 9.3 9.5 38.0 3.1 0.2 10.6 1.0 3.1 0.1 5.2 0.6 7.6 12.3 0.0 1.2 6.5 33.3 12.4 3.7 1.2 12.1 1.0 28.4 2.4 0.0 0.6 4.6 0.2 3.3 21.0 62.3 5.3 2.5 14.9 23.3 9.8 8.8
class centipede chain saw chair chime cocktail shaker coffee maker computer keyboard computer mouse corkscrew cream croquet ball crutch cucumber cup or mug diaper digital clock dishwasher dog domestic cat dragonfly drum dumbbell electric fan elephant face powder fig filing cabinet flower pot flute fox french horn frog frying pan giant panda goldfish golf ball golfcart guacamole guitar hair dryer
AP 12.5 1.1 3.9 12.7 6.9 1.5 4.9 0.0 1.0 2.0 0.0 0.1 2.4 12.7 0.0 2.5 0.1 24.0 8.8 17.0 0.5 2.0 0.0 23.2 5.8 1.9 4.9 0.3 0.2 28.1 0.5 11.2 0.4 41.1 8.8 17.4 54.7 12.0 6.4 1.6
class hair spray hamburger hammer hamster harmonica harp hat with a wide brim head cabbage helmet hippopotamus horizontal bar horse hotdog iPod isopod jellyfish koala bear ladle ladybug lamp laptop lemon lion lipstick lizard lobster maillot maraca microphone microwave milk can miniskirt monkey motorcycle mushroom nail neck brace oboe orange otter
AP 4.4 15.3 0.4 26.9 0.4 13.4 1.6 1.0 0.3 13.3 0.5 14.4 7.0 23.6 9.2 3.2 13.1 0.0 16.6 1.1 1.4 10.2 3.4 4.6 4.5 4.2 0.2 11.3 0.0 0.8 11.8 0.0 19.2 20.5 11.4 0.0 0.1 9.8 3.8 2.2
class pencil box pencil sharpener perfume person piano pineapple ping-pong ball pitcher pizza plastic bag plate rack pomegranate popsicle porcupine power drill pretzel printer puck punching bag purse rabbit racket ray red panda refrigerator remote control rubber eraser rugby ball ruler salt or pepper shaker saxophone scorpion screwdriver seal sheep ski skunk snail snake snowmobile
AP 4.2 0.7 12.6 0.5 13.2 9.5 0.0 5.1 3.9 0.1 0.2 1.1 1.8 17.2 1.0 4.2 3.4 0.0 3.0 4.6 35.2 0.0 6.9 12.0 1.9 4.7 0.1 0.1 0.4 6.6 2.7 23.7 0.1 1.6 10.9 0.2 8.5 11.3 5.1 2.5
class snowplow soap dispenser soccer ball sofa spatula squirrel starfish stethoscope stove strainer strawberry stretcher sunglasses swimming trunks swine syringe table tape player tennis ball tick tie tiger toaster traffic light train trombone trumpet turtle tv or monitor unicycle vacuum violin volleyball waffle iron washer water bottle watercraft whale wine bottle zebra
AP 21.4 0.1 14.3 2.7 0.2 14.2 9.5 0.1 0.6 1.5 5.1 0.4 1.6 0.0 27.1 0.0 0.8 8.8 12.3 29.8 0.1 14.5 22.9 0.5 21.1 0.2 3.3 25.3 8.1 0.2 2.5 1.6 0.0 1.6 7.7 2.0 6.0 0.0 1.5 13.4
Table 5. Per-class detection average precision (AP) on the ILSVRC2013 detection val2 set by our OM+MIL+FT-VGGNet method. class accordion airplane ant antelope apple armadillo artichoke axe baby bed backpack bagel balance beam banana band aid banjo baseball basketball bathing cap beaker bear bee bell pepper bench bicycle binder bird bookshelf bow tie bow bowl brassiere burrito bus butterfly camel can opener car cart cattle cello
AP 15.9 24.0 24.6 13.7 8.1 50.5 4.1 1.7 19.3 0.7 7.4 0.1 6.9 3.2 6.7 18.1 0.0 0.8 4.4 45.8 11.4 9.6 1.8 13.6 2.2 40.8 1.3 0.0 3.3 6.9 1.6 4.2 25.5 64.9 9.2 8.1 19.4 18.5 17.0 6.1
class centipede chain saw chair chime cocktail shaker coffee maker computer keyboard computer mouse corkscrew cream croquet ball crutch cucumber cup or mug diaper digital clock dishwasher dog domestic cat dragonfly drum dumbbell electric fan elephant face powder fig filing cabinet flower pot flute fox french horn frog frying pan giant panda goldfish golf ball golfcart guacamole guitar hair dryer
AP 14.6 4.8 4.7 15.3 7.3 6.0 9.8 0.0 10.2 4.4 0.1 1.1 4.7 17.8 0.8 9.0 0.6 40.2 22.1 26.0 0.3 4.4 0.3 33.6 5.5 4.4 6.2 0.8 0.0 35.0 2.0 27.0 2.4 49.1 7.9 22.2 55.0 14.4 11.2 4.4
class hair spray hamburger hammer hamster harmonica harp hat with a wide brim head cabbage helmet hippopotamus horizontal bar horse hotdog iPod isopod jellyfish koala bear ladle ladybug lamp laptop lemon lion lipstick lizard lobster maillot maraca microphone microwave milk can miniskirt monkey motorcycle mushroom nail neck brace oboe orange otter
AP 1.8 12.9 0.5 43.0 4.5 28.9 5.1 4.9 0.5 6.7 0.4 19.8 11.0 31.2 24.2 1.0 21.1 0.6 20.6 1.5 4.7 11.7 17.6 1.8 19.1 11.3 0.4 11.1 0.0 3.4 22.3 0.1 29.8 21.0 13.2 0.0 0.0 1.4 3.2 5.0
class pencil box pencil sharpener perfume person piano pineapple ping-pong ball pitcher pizza plastic bag plate rack pomegranate popsicle porcupine power drill pretzel printer puck punching bag purse rabbit racket ray red panda refrigerator remote control rubber eraser rugby ball ruler salt or pepper shaker saxophone scorpion screwdriver seal sheep ski skunk snail snake snowmobile
AP 3.4 0.5 19.8 0.6 6.0 16.8 0.6 11.9 6.5 1.7 3.4 4.4 1.6 40.9 2.1 2.0 8.7 0.0 8.4 8.1 47.1 0.5 12.6 11.0 15.0 11.3 0.1 0.1 0.9 2.8 11.5 28.1 0.1 5.8 20.8 0.1 8.6 13.3 13.7 6.0
class snowplow soap dispenser soccer ball sofa spatula squirrel starfish stethoscope stove strainer strawberry stretcher sunglasses swimming trunks swine syringe table tape player tennis ball tick tie tiger toaster traffic light train trombone trumpet turtle tv or monitor unicycle vacuum violin volleyball waffle iron washer water bottle watercraft whale wine bottle zebra
AP 18.0 1.1 21.7 5.2 0.1 23.8 18.8 2.2 2.5 4.9 8.4 0.7 2.3 0.0 39.6 1.3 0.9 9.6 17.5 7.3 0.1 27.3 23.2 0.6 19.0 2.0 5.5 35.6 6.6 0.4 10.8 4.8 0.0 3.9 3.7 2.9 12.0 0.0 1.8 29.5
Figure 4. Sample detection results on the PASCAL VOC 2007 test set. Green boxes indicate ground-truth instance annotation. Yellow boxes indicate correction detections (with IoU ≥ 0.5). For all the testing results, we set threshold of detection as 0.8 and use NMS to remove duplicate detections.
Figure 5. Sample detection results on the PASCAL VOC 2007 test set. Green boxes indicate ground-truth instance annotation. Yellow boxes indicate correction detections (with IoU ≥ 0.5). For all the testing results, we set threshold of detection as 0.8 and use NMS to remove duplicate detections.
Figure 6. Sample detection results on the PASCAL VOC 2007 test set. Green boxes indicate ground-truth instance annotation. Yellow boxes indicate correction detections (with IoU ≥ 0.5). For all the testing results, we set threshold of detection as 0.8 and use NMS to remove duplicate detections.
Figure 7. Additional sample results of detection errors. Green boxes indicate ground-truth instance annotation. Red boxes indicate false positives.
Figure 8. Detector error analysis. The detection errors are categorized into four types: false positives due to poor localization (Loc), confusion with similar objects (Sim), confusion with other VOC objects (Oth), and confusion with background (BG). The stacked area plots show fraction of FP of each type as the total number of FP increase. We take examples of animal, vehicle and furniture classes. The results of “MIL+FT” and “OM+MIL+FT” are obtained using the VGGNet.