Zero-Shot Reinforcement Learning from Low Quality Data

NeurIPS 2024

Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), & Jonathan M. Cullen\(^{1}\)

\(^{1}\) University of Cambridge

\(^{2}\) University of Bristol

[Paper] [Code] [Poster] [Slides]

Summary

Zero-shot reinforcement learning (RL) methods learn general policies that can, in principle, solve any unseen task in an environment. Recently, methods leveraging successor features and successor measures have emerged as viable zero-shot RL candidates, returning near-optimal policies for many unseen tasks. However, to enable this, they have assumed access to unrealistically large and heterogeneous datasets of transitions for pre-training. Most real datasets, like historical logs created by existing control systems, are smaller and less diverse than these current methods expect. As a result, this paper asks:

Can we still perform zero-shot RL with small, homogeneous datasets?

Intuition

When the dataset is inexhaustive, existing methods like FB representations overestimate the value of actions not in the dataset (Figure 2). The above shows this overestimation as dataset size and quality is varied. The smaller and less diverse the dataset, the more \(Q\) values tend to be overestimated.

Figure 1: FB value overestimation with respect to dataset size \(n\) and quality. \(Q\) values predicted during training increase as both the size and “quality" of the dataset decrease. This contradicts the low return of all resultant policies (note: a return of 1000 is the maximum achievable for this task).

We fix this by suppressing the predicted values (or measures) for state-actions not in the dataset.

Figure 2: Conservative zero-shot RL methods. VC-FB (right) suppresses the predicted values for OOD state-action pairs**

Results

We demonstrate our methods improve performance w.r.t standard zero-shot RL / GCRL baselines on low quality datasets from ExORL (Figure 3) and D4RL (Figure 4), and do not hinder performance on large and diverse datasets (Figure 5).

Figure 3: Aggregate zero-shot performance on ExORL. (Left) IQM of task scores across datasets and domains, normalised against the performance of CQL, our baseline. (Right) Performance profiles showing the distribution of scores across all tasks and domains. Both conservative FB variants stochastically dominate vanilla FB.

Figure 4: Aggregate zero-shot RL performance on D4RL. Aggregate IQM scores across all domains and datasets, normalised against the performance of CQL.

Figure 5: Performance by RND dataset size. The performance gap between conservative FB variants and vanilla FB increases as dataset size decreases

Read the full paper for more details!

@article{jeen2024,
  url = {https://arxiv.org/abs/2309.15178},
  author = {Jeen, Scott and Bewley, Tom and Cullen, Jonathan M.},  
  title = {Zero-Shot Reinforcement Learning from Low Quality Data},
  journal = {Advances in Neural Information Processing Systems 38},
  year = {2024},
}