Zero-Shot Reinforcement Learning from Low Quality Data

NeurIPS 2024

Scott Jeen\(^{\alpha}\), Tom Bewley\(^{\beta}\) & Jonathan M. Cullen\(^{\alpha}\)

\(^{\alpha}\) University of Cambridge \(^{\beta}\) University of Bristol

Motivation

Training policies to (zero-shot) generalise to unseen tasks in an environment is hard! [1]

Behaviour Foundation Models (BFMs) based on forward-backward representations (FB) [2] and universal successor features (USF) [3], provide principled mechanisms for performing zero-shot task generalisation

However, BFMs assumed access to idealised (large & diverse) pre-training datasets that we can’t expect for real problems

Can we pre-train BFMs on realistic (small & narrow) datasets?

Out-of-distribution Value Overestimation in BFMs

Out-of-distribution Value Overestimation in BFMs

Out-of-distribution Value Overestimation in BFMs

Conservative BFMs

Conservative BFMs

Conservative BFMs

Conservative BFMs

Conservative BFMs

ExORL Results

Baselines

Zero-shot RL: FB, SF-LAP [5]
Goal-conditioned RL: GC-IQL [6]
Offline RL: CQL [7]

Datasets

ExORL Results

D4RL Results

Performance on Idealised Datasets is Unaffected

Conclusions

Like standard offline RL methods, BFMs suffer from the distribution shift

As a resolution, we introduce Conservative BFMs

Conservative BFMs considerably outperform standard BFMs on low-quality datasets

Conservative BFMs do not compromise performance on idealised datasets

Project page:

Twitter/X: @enjeeneer

Website: https://enjeeneer.io