A Dataset of Annotated Animals ∗ Allan Hanbury PRIP, Institute of Computer-Aided Automation Favoritenstr. 9/1832, A-1040 Vienna, Austria e-mail:
[email protected]
Alireza Tavakoli Targhi NADA/CVAP, KTH S-100 44 Stockholm Sweden e-mail:
[email protected]
Abstract An annotation of 59 795 images from the Corel database is presented. Each image has been labelled as containing an animal or not. The type of animal is also specified for most images. In addition, 1289 images have been manually segmented into animals and background. This annotated dataset will allow the evaluation of segmentation, feature extraction and object recognition algorithms for the well-defined task of recognising animals in images.
1
Introduction
Object recognition in images is an important computer vision problem. Better object recognition would lead to narrower semantic gap [17], allowing more effective content-based image retrieval (CBIR). This would come about as matching could be done based on having the same objects in the images, rather than on matching lower-level features such as colour and texture. There are a number of approaches to recognising objects in images. The two most widely used are based on: • Finding distinctive regions in the image [14], calculating features for these regions [13] and classifying the objects. Examples of such an approach can be found in [4, 19]. ∗
Manual segmentations were provided by: Gerd Brunner, Katarina Domijan, Patrick H`ede, Beatriz Marcotegui, Branislav Miˇcuˇs´ık, Christophe Millet, Pierre-Alain Mo¨ellic, Montse Pardas, Lech Szumilas, Alexandra Teynor and Simon Wilson. This work was partly supported by the European Union Network of Excellence MUSCLE (FP6507752) and by the Austrian Science Foundation (FWF) under grant SESAME (P17189-N04).
• Segmenting the images and classifying each resulting region into one of the object classes [1, 3]. One of the difficulties with recognising objects is finding enough data on which to train the classifier. Most annotated databases contain a list of keywords or a free-text description assigned to each image [7], without information on the correspondence between words and image regions. Ways to overcome this difficulty include ignoring the associations between keywords and regions by training an algorithm to assign keywords to the whole image [10], or using ideas from machine translation of text to learn the associations [5]. The most comprehensive pixelwise labelled database is available from Microsoft Research Cambridge [16]. It contains 591 images manually labelled by 23 labels. Most of these images are roughly segmented, although a few accurate segmentations are available. With video data, obtaining many training images is less of a problem, as one can mark an object in one frame and attempt to track it through further frames to obtain many training images with good object localisation [15]. In this paper, we present a dataset in which we have attempted to overcome many of these difficulties. The dataset is aimed at developing and evaluating algorithms for animal recognition. The task of animal recognition is well defined — for the majority of images, it is possible to find a strong agreement between people as to which region of an image contains an animal. The number of annotated images is large — 59 795 images have been annotated as containing an animal or not, with many of the animal images also annotated with the type of animal. Finally, 1289 manually segmented images of animals are part of the dataset, giving the possibility to evaluate segmentation, feature extraction and object recognition algorithms. Many authors in computer vision attempt to evaluate the performance of their methods by applying them on animal images [6, 8, 16]. This could be because correctly recognising an animal usually involves using multiple feature types: colour, texture, shape, etc. Usually recognition or segmentation performance is shown on a small number of animal classes. For instance zebra, cheetah and giraffe [8] or cow, sheep, bird cat and dog (along with 16 non-animal classes) [16] have been used. Our initial study on the dataset described in this paper has shown that to distinguish between zebras and tigers is not as hard as distinguishing between cheetahs and leopards. Furthermore, we have found that the recognition performance decreases when the number of classes increases. Many of these animal classes are seen as being similar by humans, which often translates into having the same response to some feature detectors used in computer vision. The recognition of animals in the proposed dataset represents a realistic task as the animals are imaged at different scales, from different viewpoints and under different illuminations. Similar but less varied animal image datasets that have been used for feature of classifier evaluation include: a butterfly database containing 619 images of seven different classes of butterfly [8] and a bird database containing 600 images (100 samples each) of six different classes of bird (100 images for each class) [9]. The structure of this paper is as follows. Section 2 describes how the dataset was annotated and gives statistics on the occurrence of keywords. Section 3 describes the manual segmentations. Section 4 concludes.
2
Dataset Annotation
We re-annotated the Corel database used in [10, 20], which contains 59 795 images of a wide variety of scenes. In the original database, each group of 100 images is labelled by a group of keywords. For example, each of the 100 images in the “Paris/France” category are assigned the keywords “Paris, European, historical building, beach, landscape, water”, the images in the “Lion” category are assigned the keywords “lion, animal, wildlife, grass” and the images in the “eagle” category are assigned the keywords “wildlife, eagle, sky, bird”. We annotated each of the 59 795 images as containing an animal or not. For most of the images containing an animal, the type of animal is added. We defined four groups of animals: bird, fish, arthropod (insects and spiders) and other (mostly mammals). An animal can be labelled by only the group (e.g. “bird”) or by a group and a specific animal (e.g. “bird-parrot”). A single image can be assigned more than one keyword. Of the 59 795 images, 8114 are of animals (13.6%). Of the images classified as animals, 1505 are in the group bird, 835 in the group fish and 331 in the group arthropod. This leaves 4668 images in the remaining group. 775 of the animal pictures have not been assigned any specific animal type as a label. Table 1 gives the number of times each label appears, with the labels sorted by frequency. The group labels alone indicate the number of times that the category appears without a specific label (so the count for bird does not contain the count of bird-parrot). There are 102 different labels which have been used. The graph in Figure 1 shows the counts of the labels which appear more than 50 times. Horses appear most often in this set of images.
3
Manual Segmentation
There exists no unique “correct” segmentation of a general photograph. Different test subjects asked simply to segment photographs tend to produce different segmentations of the same photograph [11]. This is because they may have different preconceptions about what is important in a scene, or segment the scene at different scales. As the task of animal recognition is well defined, one can, for almost all images, produce an objective segmentation dividing the image into those regions which are animals and those regions which are not. A semi-automatic image segmentation tool (SAIST) was made available to simplify the manual segmentation1 . It uses a marker-based watershed segmentation [18]. The user draws in the markers, as shown in Figure 2a, which leads to the segmentation shown in Figure 2b. This process can be iterated by adding or removing markers (Figure 2c) until the required segmentation is obtained (Figure 2d). The instructions given to the people performing the manual segmentations were, for an image containing n animals, to label them from 1 to n. The background then receives a label of n + 1. A group of volunteers (listed in the acknowledgements) manually segmented 1289 of the images annotated as containing animals. Table 2 shows the animal types for which manual segmentations were done, as well as the number of segmentations of each animal. Examples of some segmentations are shown in Figure 3. 1
available here: http://muscle.prip.tuwien.ac.at
horse 1020 bird 925 fish 814 dog 610 cattle 260 arthropod-butterfly 193 elephant 188 bird-penguin 182 lizard 177 antelope 141 lion 135 cat 126 fox 125 tiger 119 bird-owl 113 deer 112 goat 109 arthropod 102 bird-eagle 100 cougar 100 coyote 100 snake 97 leopard 96 camel 76 rhinoceros 76 whale 71 sheep 66 monkey 62 dolphin 61 seal 60 frog 59 hippopotamus 58 crocodile 54 cheetah 47
bear 45 rabbit 43 tortoise 42 wolf 38 bird-duck 34 bird-parrot 28 squirrel 28 bird-goose 26 bird-swan 26 donkey 26 pig 24 crab 23 giraffe 23 moose 23 zebra 23 polar bear 20 arthropod-moth 19 turtle 19 bird-poultry 18 gorilla 18 chimpanzee 14 fish-shark 14 baboon 13 buffalo 13 wildebeest 13 bird-ostrich 12 racoon 11 kangaroo 10 panther 10 orangutan 9 bird-flamingo 8 hyena 8 llama 8 skunk 8
arthropod-spider panda bird-peacock bird-vulture hare bird-seagull koala otter warthog bird-budgerigar bird-guinea fowl fish-sea horse fish-starfish arthropod-bee arthropod-caterpillar arthropod-scorpion bird-emu bird-pelican bird-turkey walrus arthropod-beetle arthropod-fly arthropod-ladybird arthropod-locust beaver bird-crane bird-dove bird-pigeon bird-stork chameleon fish-octopus hedgehog lobster mouse
Table 1: Frequency of appearance of each of the labels.
7 7 6 6 6 5 5 5 5 3 3 3 3 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
x fo
t t y n d d g e n e h e ca lio lop zar gui han erfl attl do fis bir ors i t l c t n e p h t u e e an -p el d-b rd bi po ro h t ar
Figure 1: Frequency of the labels which appear more than 50 times.
l r r te r s y s d d at p e g le el le in al le di mu fro se lph nke hee ha am ero par nak ag uga yo opo go dee -ow tige o d o e c a o w c o r s ds co c rthr oc ot d o mo le bi r in c r op a bi rh p p hi
0
200
400
600
800
1000
(a)
(b)
(c)
(d)
Figure 2: Use of SAIST. (a) Initial markers. (b) Segmentation resulting from the markers in (a). (c) Additional markers. (d) Segmentation resulting from the markers in (c).
4
Conclusion
The manual segmentations are useful for testing both image segmentation and object recognition algorithms, as well as how they depend on each other. For segmentation algorithms useful for subsequent animal recognition, the boundaries of the animals should be found as well as possible. A boundary evaluation algorithm such as the one proposed by Martin et al. [12] could be used to evaluate this. For evaluating features used in object recognition, the features within the regions manually marked as animals can be used as noise-free training and testing data. Once it is known how well recognition performs with (almost) perfect segmentation, the effect of poor segmentation on subsequent object recognition can be evaluated. Aside from the manually segmented images, the extremely large number of images annotated as containing animals
Cheetah Cougar Coyote Deer Dog Elephant Goat
34 100 100 86 198 100 99
Hippopotamus Horse Leopard Lion Moose Rhinoceros Tiger
40 200 63 96 14 59 100
Table 2: Number of segmentations of each animal. or not will be useful in training classifiers and testing the generalisation of classifiers trained on the manually segmented images. This dataset could also be used for animal classification within a class, for example to classify different species of dogs or to locate features which distinguish between individual animals of the same species. This has a practical application in wildlife management, for example recognising individual penguins [2]. The annotations and segmentations for this dataset will soon be made publically available.
References [1] Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan. Matching words and pictures. Journal of Machine Learning Research, 3:1107–1135, 2003. 2 [2] Tilo Burghardt, Barry Thomas, Peter J Barham, and Janko Calic. Automated visual recognition of individual african penguins. In Fifth International Penguin Conference, Ushuaia, Tierra del Fuego, Argentina, 2004. 7 [3] Peter Carbonetto, Nando de Freitas, and Kobus Barnard. A statistical model for general contextual object recognition. In Proceedings of the ECCV 2004, Part I, pages 350–362, 2004. 2 [4] Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, and Cedric Bray. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision (at ECCV), 2004. 1 [5] Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of the ECCV, 2002. 2 [6] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scaleinvariant learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 264–271, 2003. 2
(a)
(b)
(c)
(d)
Figure 3: Manual segmentations of the images shown on the left. The greylevels of the labelled images have been stretched to make the regions more visible. [7] Allan Hanbury. Review of image annotation for the evaluation of computer vision algorithms. Technical Report PRIP-TR-102, PRIP, TU Wien, January 2006. 2 [8] S. Lazebnik. Semi-Local and Global Models for Texture, Object and Scene Recognition. PhD thesis, University of Illinois at Urbana-Champaign, 2006. 2 [9] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. A maximum entropy framework for part-based texture and object recognition. In International Conference on Computer Vision, volume 1, pages 832–838, 2005. 2 [10] Jia Li and James Z. Wang. Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Transaction on Pattern Analysis and Machine Intelligence, 25(9):1075–1088, 2003. 2, 3 [11] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages 416–423, July 2001. 3
[12] David Martin, Charless Fowlkes, and Jitendra Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(5):530–549, 2004. 6 [13] Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis & Machine Intelligence, 27(10):1615–1630, 2005. 1 [14] Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. International Journal of Computer Vision, 65(1/2):43–72, 2005. 1 [15] Peter M. Roth, Michael Donoser, and Horst Bischof. Tracking for learning an object representation from unlabeled data. In Proc. Computer Vision Winter Workshop, pages 46 – 51, 2006. 2 [16] Jamie Shotton, John Winn, Carsten Rother, and Antonio Criminisi. TextonBoost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages I:1–15, 2006. 2 [17] Arnold W. M. Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349–1380, December 2000. 1 [18] Pierre Soille. Morphological Image Analysis. Springer, second edition, 2002. 3 [19] Ilkay Ulusoy and Christopher M. Bishop. Generative versus discriminative models for object recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR), 2005. 1 [20] James Z. Wang, Jia Li, and Gio Wiederhold. SIMPLIcity: Semantics-sensitive integrated matching for picture libraries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(9):947–963, 2001. 3