QueST: Self-Supervised Skill Abstractions for Continuous Control

Georgia Institute of Technology,

Abstract

Generalization capabilities, or rather a lack thereof, is one of the most important unsolved problems in the field of robot learning, and while several large scale efforts have set out to tackle this problem, unsolved it remains. In this paper, we hypothesize that learning temporal action abstractions using latent variable models (LVMs), which learn to map data to a compressed latent space and back, is a promising direction towards low-level skills that can readily be used for new tasks. Although several works have attempted to show this, they have generally been limited by architectures that do not faithfully capture sharable representations. To address this we present Quantized Skill Transformer (QueST), which learns a larger and more flexible latent encoding that is more capable of modeling the breadth of low-level skills necessary for a variety of tasks. To make use of this extra flexibility, QueST imparts causal inductive bias from the action sequence data into the latent space, leading to more semantically useful and transferable representations. We compare to state-of-the-art imitation learning and LVM baselines and see that QueST’s architecture leads to strong performance on several multitask and few- shot learning benchmarks.

Video

Proposed Architecture

We factorize QueST into two stages. Stage-I maps a sequence of continuous actions to a sequence of discrete skill tokens. Stage-II learns a skill-based policy in the style of next-token prediction using a multi-modal GPT-like transformer.

Model architecture.

Simulated Benchmarks


Multitask Learning

libero multi

Relative improvement of at least 10.3% over VQ-BeT and Diffusion Policy on LIBERO-90 benchmark with 90 tasks.

MetaWorld being a simpler benchmark, all methods perform almost similar across 45 tasks.

meta multi

5-Shot Adaptation

frozen label - denotes decoder frozen in finetuning

libero few

Relative improvement of 24% over next best baseline on 10 unseen tasks from LIBERO-LONG benchmark

All methods perform comparably, with QueST showing a slight improvement over the others across 5 held-out tasks in MetaWorld

meta few

Skill Space t-SNE



LIBERO-90 Rollouts




LIBERO-LONG Rollouts




Fewshot LIBERO-LONG Rollouts





BibTeX

@misc{mete2024questselfsupervisedskillabstractions,
      title={QueST: Self-Supervised Skill Abstractions for Learning Continuous Control}, 
      author={Atharva Mete and Haotian Xue and Albert Wilcox and Yongxin Chen and Animesh Garg},
      year={2024},
      eprint={2407.15840},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2407.15840}, 
}