Efficient Reinforcement Learning for Robotics Visual Policies with Sim-to-Real Transfer of Decoupled Architectures

Rizzardo, Carlo

doi:10.15167/rizzardo-carlo_phd2024-02-19

Learning-based approaches have brought great advances to robotics in recent years. Reinforcement Learning (RL) methods have shown to be capable of handling highly uncertain and complex tasks such as manipulation, locomotion, and visuomotor control, achieving extraordinary results. Furthermore these methods have brought forward the possibility of developing end-to-end methods that perform robot control directly from visual observations, removing the need for custom perception systems. The drawback of these methodology however is that such approaches often require vast amounts of training data. This scarce sample efficiency is a critical obstacle for the training of real-world robotics tasks, where collecting such massive amounts of data can be exceedingly expensive, if not impossible. This thesis focuses on methods for reducing the overall data requirements of robotic visual policies, by employing sample efficient methods and performing sim-to-real transfer without introducing impractical computational needs. Along the progress of this thesis, I construct a model-free architecture for learning visual tasks structured around a Soft Actor Critic agent and a learned model of the Partially Observable Markov Decision Process (POMDP) underlying the task. At first, I concentrate on a simple implementation of this decoupled architecture, and show how such an architecture is more efficient than traditional fully end-to-end techniques. The learned POMDP model, based on a Variational formulation, learns to extract low dimensional representations from the input images, while biasing the representations toward containing information relevant to the task dynamics. I show that the learning of the feature extractor via an unsupervised learning objective improves sample efficiency in comparison to simply using the reinforcement learning reward signal, while the learning of task dynamics brings a beneficial effect on sample efficiency and asymptotic performance. After this analysis I focus on the domain transfer capabilities of the method, to show its effectiveness for sim-to-real transfer. The decoupled nature of the method, with separate vision and RL modules, allows for independent transfer of policy and feature extractor. I show how domain transfer can be performed by only finetuning the vision section and keeping the policy unchanged. The fundamental challenge in this approach is maintaining the latent representation expressed by the feature extractor compatible with the policy input. I show how the presence of the dynamics model can act as a constraint to maintain the compatibility between policy and feature extractor by experimenting on a real and simulated table-top object pushing scenario. I then progressively explore more complex variations of this scenario and improve the architecture to support more complex tasks, and to relax the requirements it poses to achieve successful transfer. In its final design the architecture demonstrates to be capable of performing sim-to-real transfer of the object-pushing task with remarkable efficiency. In the most simple case, it requires just a couple of hours of real-world experience, plus a couple of hours of training in simulation. Thus solving in four hours a task that would require multiple days to train directly in the real. While the method has been evaluated on a simple experimental task, the architecture is not task-specific, and could be applied to vastly different problems. This work aims at building efficient architectures and defining effective and flexible sim-to-real transfer techniques. The availability of such techniques is crucial to the widespread diffusion of RL-based robotics, which has the potential of scaling to extremely complex tasks, advancing the practical capabilities of real-world robotic systems.