# Leveraging Demonstrations with Latent Space Priors

 Jonas Gehring Deepak Gopinath Jungdam Won Andreas Krause Gabriel Synnaeve Nicolas Usunier Meta AI, ETH Zürich Meta AI Meta AI ETH Zürich Meta AI Meta AI

TL;DR  Combining skill learning from demonstrations and sequence modeling to accelerate learning on transfer tasks.

Low-level Policy
»
«

Latent Space Prior

Abstract  Demonstrations provide insight into relevant state or action space regions, bearing great potential to boost the efficiency and practicality of reinforcement learning agents. In this work, we propose to leverage demonstration datasets by combining skill learning and sequence modeling. Starting with a learned joint latent space, we separately train a generative model of demonstration sequences and an accompanying low-level policy. The sequence model forms a latent space prior over plausible demonstration behaviors to accelerate learning of high-level policies. We show how to acquire such priors from state-only motion capture demonstrations and explore several methods for integrating them into policy learning on transfer tasks. Our experimental results confirm that latent space priors provide significant gains in learning speed and final performance in a set of challenging sparse-reward environments with a complex, simulated humanoid.

### Approach

Our approach consists of a pre-training phase (left), followed by high-level policy learning on transfer tasks. For pre-training, we embed demonstration trajectories $$\boldsymbol{x} \in X$$ into a latent representations $$z$$ with an auto-encoder. We separately learn a prior $$\pi_0$$ that models latent space trajectories, as well as a low-level policy $$\pi_{lo}$$ trained to reenact demonstrations from proprioceptive observations $$s^p$$ a near-term targets $$z$$. On downstream tasks, we train a high-level policy $$\pi_{hi}(z|s)$$ and utilize the latent space prior $$\pi_0$$ to accelerate learning.
We investigate several methods of integration: (1) augmenting the exploration policy with sequences sampled from the prior; (2) regularizing the policy towards distributions predicted by the prior; (3) conditionally generating sequences from high-level actions actions to provide temporal abstraction.

### Low-Level Policy

We show our low-level policy reenacting random clips from the demonstration dataset of motion capture clips. The low-level policy controls the golden robot, the demonstration poses are shown in silver.

### Sampled Motions (kinematic control)

We sample five latent state sequences from the prior, without providing an initial context, and decode them into poses with the VAE decoder. The sampled motions contain sequences from several different demonstrations clips; for example, the leftmost clip starts with Subject 77, Trial 10 and transitions into punches from the end of Subject 86, Trial 1. Overall, motions are of good quality, but we sometimes notice unrealistic behavior, e.g., the character sliding over the floor or jumping too far.

### Sampled Motions (physics-based control)

Here, the sampled motions from above are re-enacted with the trained low-level policy by conditioning on the respective latent state sequence.