Efficient Projections for Salient Motion Detection and Representation

Nicora, Elena

doi:10.15167/nicora-elena_phd2022-07-20

Motion perception is one of the first abilities developed by our cognitive sys- tems. Since the earliest days of life, we are inclined to focus our attention on moving objects in order to gather information about what is happening around us, without having to actually process all the visual stimuli captured by our eyes. This ability is related to the notion of Visual Saliency. It is based on the concept of finding areas of the scene significantly different from the sur- roundings and it helps both biological and computational systems reducing the amount of incoming information, otherwise extremely expensive even to parallel process. Measuring and understanding motion in the last decades has gained an in- creasing importance in several Artificial Intelligence applications. In Computer Vision, the general problem of motion understanding is often broken down into several tasks, each specialized on a different motion-oriented goal. In the recent years Deep Learning solutions established a sort of monopoly, especially in image and video processing, reaching outstanding results for the task of interest but providing poor generalization and interpretation ability. Furthermore these methods come with major drawbacks in what concerns time and computational complexity, requiring huge amount of data to learn from. Hence their use might not be suited for all the approachable tasks, in particular when we have to deal with pipelines composed by various steps. Robotics, assisted living and video-surveillance are just some examples of ap- plication domains in the need of alternative algorithmic solutions promoting portability, real-time computations and the use of limited quantities of data. The aim of this thesis is to study approaches to couple effectiveness and ef- ficiency, ultimately promoting the overall sustainability. In this direction we investigate the potential of a family of efficient filters, the Gray-Code Kernels, for addressing Visual Saliency estimation with a focus on motion information. Our implementation relies on the use of 3D kernels applied to overlapping blocks of frames and is able to gather meaningful spatio-temporal informa- tion with a very light computation. Through a single set of extracted features we manage to tackle three different motion-oriented goals: motion saliency detection, video object segmentation and motion representation. Additionally, the three intermediate results are exploited to address the problem of Human Action Recognition. To summarise, this thesis focuses on: • The efficient computation of a set of features highlighting spatio-temporal information • The design of global representations able to compactly describe motion cues • The development of a framework that addresses increasingly higher-level tasks of motion understanding The developed framework has been tested on two well-known Computer Vis- ion tasks: Video Object Segmentation and Action Classification. We compared the motion detection and segmentation abilities of our method with classical approaches of similar complexity. In the experimental analysis we evaluate our method on publicly available datasets and show that it is able to effectively and efficiently identify the portion of the image where the motion is occurring, providing tolerance to a variety of scene conditions and complexities. We propose a comparison with classical methods for change detection, outperforming Optical Flow and Background Subtraction algorithms. By adding appearance information to our motion-based segmentation we manage to reach, under appropriate condi- tions, comparable results to more complex state-of-the-art approaches. Lastly, we tested the motion representation ability of our method by employ- ing it in traditional and Deep Learning action recognition scenarios.