Localizing spatially and temporally objects and actions in videos › profile › publication › links › Lo... › profile › publication › links › Lo...PDFJan 3, 2018 — fused in the fc6 and at the conv5, where two losses are computed: pD for the domain discrimin
To cite this version: Vicky Kalogeiton. Localizing spatially and temporally objects and actions in videos. Computer Vision and Pattern Recognition [cs.CV]. University of Edinburgh; INRIA Grenoble, 2017. English. �tel-01674504�
HAL Id: tel-01674504 https://hal.inria.fr/tel-01674504 Submitted on 3 Jan 2018
HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Localizing spatially and temporally objects and actions in videos
Vicky Kalogeiton
Doctor of Philosophy Institute of Perception, Action and Behaviour School of Informatics University of Edinburgh 2017
Abstract The rise of deep learning has facilitated remarkable progress in video understanding. This thesis addresses three important tasks of video understanding: video object detection, joint object and action detection, and spatio-temporal action localization. Object class detection is one of the most important challenges in computer vision. Object detectors are usually trained on bounding-boxes from still images. Recently, video has been used as an alternative source of data. Yet, training an object detector on one domain (either still images or videos) and testing on the other one results in a significant performance gap compared to training and testing on the same domain. In the first part of this thesis, we examine the reasons behind this performance gap. We define and evaluate several domain shift factors: spatial location accuracy, appearance diversity, image quality, aspect distribution, and object size and camera framing. We examine the impact of these factors by comparing the detection performance before and after cancelling them out. The results show that all five factors affect the performance of the detectors and their combined effect explains the performance gap. While most existing approaches for detection in videos focus on objects or human actions separately, in the second part of this thesis we aim at detecting non-human centric actions, i.e., objects performing actions, such as cat eating or dog jumping. We introduce an end-to-end multitask objective that jointly learns object-action relationships. We compare it with different training objectives, validate its effectiveness for detecting object-action pairs in videos, and show that both tasks of object and action detection benefit from this joint learning. In experiments on the A2D dataset [Xu et al., 2015], we obtain state-of-the-art results on segmentation of object-action pairs. In the third part, we are the first to propose an action tubelet detector that leverages the temporal continuity of videos instead of operating at the frame level, as state-ofthe-art approaches do. The same way modern detectors rely on anchor boxes, our tubelet detector is based on anchor cuboids by taking as input a sequence of frames and outputing tubelets, i.e., sequences of bounding boxes with associated scores. Our tubelet detector outperforms all state of the art on the UCF-Sports [Rodriguez et al., 2008], J-HMDB [Jhuang et al., 2013a], and UCF-101 [Soomro et al., 2012] action localization datasets especially at high overlap thresholds. The improvement in detection performance is explained by both more accurate scores and more precise localization. Keywords: action localization, action recognition, object detection, video analysis, computer vision, deep learning, machine learning iii
Acknowledgements First and foremost, I want to thank my advisors Vittorio Ferrari and Cordelia Schmid. Vitto’s passion and Cordelia’s vision were the two true driving forces that kept pushing me a step forward. Vitto’s ability of instantaneously disassembling an idea and placing into a larger, much wider context is truly admirable. Vitto was a true teacher; I am grateful not just because he taught me how to approach an idea, how to tackle and present it, but most importantly because he taught me how to think. Cordelia’s zeal for perfection is truly admirable. Her desire for deep understanding is summarized into one question, that is imprinted in my mind: ‘why?’. Whenever we were discussing an idea, a project, or even a result she was determined to unravel all their aspects and discover their true meaning. I am very grateful for her guidance, her support and especially her persistence of pursuing the excellence. Cordelia’s deep intuition and vision were my sources of motivation and inspiration. I would like to thank my jury members, Taku Komura and Tinne Tuytelaars, for accepting to review my thesis and for traveling long distances to attend my viva. I have been very fortunate to collaborate and be friends with Philippe Weinzaepfel, to whom I am more t