Scott Jeen\(^{1}\), Tom Bewley\(^{2}\), & Jonathan M. Cullen\(^{1}\)

\(^{1}\) University of Cambridge

\(^{2}\) University of Bristol

[Paper] [Code] [Poster]


Imagine you’ve collected a dataset from a system you’d like to control more efficiently. Examples include: household robots, chemical manufacturing processes, autonomous vehicles, or steel-making furnaces. An ideal solution would be to train an autonomous agent on your dataset, then for it to use what it learns to solve any task inside the system. For our household robot, such tasks may include sweeping the floor, making a cup of tea, or cleaning the windows. Formally, we call this problem setting zero-shot reinforcement learning (RL), and taking steps toward realising it in the real-world is the focus of this work.

If our dataset is pseudo-optimal, that is to say, it tells our domestic robot the full extent of the floorspace, where the tea bags are stored, and how many windows exist, then the existing state-of-the-art method, Forward Backward (FB) representations, performs excellently. On average it will solve any task you want inside the system with 85% accuracy. However, if the data we’ve collected from the system is suboptimal–it doesn’t provide all the information required to solve all tasks–then FB representations fail. They fail because they overestimate the value of the data not present in the dataset, or in RL parlance, they overestimate out-of-distribution state-action values–Figure 1 (Middle).

In this work, we resolve this by artificially suppressing these out-of-distribution values, leveraging ideas from conservatism in the Offline RL literature. The family of algorithms we propose are called Conservative Forward Backward representations–Figure 1 (Right). In experiments across a variety of systems and tasks, we show these methods consistently outperform FB representations when the datasets are suboptimal.

Figure 1: FB’s failure-mode on sub-optimal datasets and VC-FB’s resolution. (Left) Ground truth value functions for two tasks in an environment for a given marginal state. (Middle) FB representations overestimate the value of actions not in the dataset for all tasks. (Right) Value-Conservative Forward Backward (VC-FB) Representations suppress the value of actions not in the dataset for all tasks. Black dots represent state-action samples present in the dataset.


Figure 2: FB value overestimation with respect to dataset size ((n)) and quality. Log \(Q\) values and IQM of rollout performance on all Point-mass Maze tasks for datasets (a) RND and (b) RANDOM. \(Q\) values predicted during training increase as both the size and “quality” of the dataset decrease. This contradicts the low return of all resultant policies. Informally, we say the RND dataset is “high” quality, and the RANDOM dataset is “low” quality.

When the dataset is inexhaustive, FB representations overestimate the value of actions not in the dataset. The above shows this overestimation as dataset size and quality is varied. The smaller and less diverse the dataset, the more $Q$ values tend to be overestimated.

We fix this by suppressing the predicted values for actions not in the dataset, and show how this resolves overestimation in the below. Below wer show modified version of Point-mass Maze from the ExORL benchmark. Episodes begin with a point-mass initialised in the upper left of the maze (⊚), and the agent is tasked with selecting \(x\) and \(y\) tilt directions such that the mass is moved towards one of two goal locations (⊛ and ⊛). The action space is two-dimensional and bounded in \([−1,1]\). We take the RND dataset and remove all “left” actions such that \(a_x \in [0, 1]\) and \(a_y \in [−1, 1]\), creating a dataset that has the necessary information for solving the tasks, but is inexhaustive (below (a)). We train FB and VC-FB on this dataset and plot the highest-reward trajectories–below (b) and (c). FB overestimates the value of OOD actions and cannot complete either task. Conversely, VC-FB synthesises the requisite information from the dataset and completes both tasks.

Figure 3: Ignoring out-of-distribution actions. The agents are tasked with learning separate policies for reaching ⊛ and ⊛. (a) RND dataset with all “left” actions removed; quivers represent the mean action direction in each state bin. (b) Best FB rollout after 1 million learning steps. (c) Best VC-FB performance after 1 million learning steps. FB overestimates the value of OOD actions and cannot complete either task; VC-FB synthesises the requisite information from the dataset and completes both tasks.

Aggregate Performance on Suboptimal Datasets

Figure 4: Aggregate zero-shot performance. (Left): IQM of task scores across datasets and domains, normalised against the performance of CQL, our baseline. (Right) Performance profiles showing the distribution of scores across all tasks and domains. Both conservative FB variants stochastically dominate vanilla FB. The black dashed line represents the IQM of CQL performance across all datasets, domains, tasks and seeds.

Both MC-FB and VC-FB stochastically dominate FB, achieving 150% and 137% its performance respectively. MC-FB and VC-FB outperform our single-task baseline in expectation, reaching 111% and 120% of CQL performance respectively despite not having access to task-specific reward labels and needing to fit policies for all tasks. This is a surprising result, and to the best of our knowledge, the first time a multi-task offline agent has been shown to outperform a single-task analogue.

Aggregate Performance on Full Datasets

Table 1: Performance on full RND dataset. Aggregated IQM scores for all tasks with 95% confidence intervals, averaged across three seeds. Both VC-FB and MC-FB maintain the performance of FB.

Domain Task FB VC-FB MC-FB
Walker all tasks 639 (616–661) 659 (647–670) 651 (632–671)
Quadruped all tasks 656 (638–674) 579 (522–635) 635 (628–642)
Maze all tasks 219 (86–353) 287 (117–457) 261 (159–363)
Jaco all tasks 39 (29–50) 33 (24–42) 34 (18–51)
All all tasks 361 381 381

Both conservative FB variants maintain (and slightly exceed) the performance of vanilla FB. These results suggest that performance on large, diverse datasets does not suffer as a consequence of the design decisions made to improve performance on our small datasets that lack diversity. Therefore, we can safely adopt conservatism into FB without worrying about performance trade-offs.


[View on arXiv]

If you find this work informative please consider citing the paper:

  url = {},
  author = {Jeen, Scott and Bewley, Tom and Cullen, Jonathan M.},  
  title = {Conservative World Models},
  publisher = {arXiv},
  year = {2023},