Zero-Shot Reinforcement Learning Under Partial Observability
RL Conference 2025
Scott Jeen\(^{1}\), Tom Bewley, & Jonathan M. Cullen\(^{1}\)
\(^{1}\) University of Cambridge
Summary
Behaviour foundation models (BFMs) can learn general policies that solve any unseen task in an environment under certain assumptions. BFMs have so far assumed access to Markov states that provide all the information required to solve a task, but in most real problems this is an unrealistic assumption and the state is only partially observed. The general fix for partial observability is to use memory models (like Transformers and RNNs) to estimate underlying states from histories of observations and actions. This work augments BFMs with memory models to improve their performance when subjected to partial observability.
Intuition
When BFMs are fed noisy or partial observations instead of clean states, they break in predictable ways (Figure 2). The forward model, which predicts how the world will evolve, misrepresents the true dynamics—this throws off the Q-values and leads to bad action choices. Meanwhile, the backward model, which tries to infer what task the agent is solving based on where it ends up, also fails because it can’t reliably associate partial observations with the correct reward structure. These breakdowns—state and task misidentification—undermine the policy’s ability to plan effectively. Our fix? Plug in memory models to help the BFM reconstruct the underlying state and task from past observations and actions. This lets us keep the same architecture, but extend its powers to messy, partially observed settings.

Figure 2: The failure modes of BFMs under partial observability. FB’s average (IQM) all-task return on Walker when observations are passed to its respective components. Observations are created by adding Gaussian noise to the underlying states. (Left) Observations are passed as input to B causing FB to misidentify the task. (Middle) Observations are passed as input to F and π causing FB to misidentify the state. (Right) Observations are passed as input to F, π and B causing FB to misidentify both the task and state.
Results
We demonstrate our methods improve performance w.r.t memory free BFMs on ExORL environments adapted to exhibit
partially observed states (Figure 3) and changes in dynamics (Figure 4).
_Figure 3: Aggregate zero-shot task performance on ExORL with partially observed states. IQM of task scores across all tasks on noisy and flickering variants of Walker, Cheetah and Quadruped, normalised against the performance of FB in the fully observed environment. 5 random seeds.

Figure 4: Aggregate zero-shot task performance on ExORL with changed dynamics at test time. IQM of task scores across all tasks when trained on dynamics where mass and damping coefficients are scaled to {0.5×, 1.5×} their usual values and evaled on {1.0×, 2.0×} their usual values, normalised against the performance of FB in the fully observed environment. To solve the test dynamics with 1.0× scaling the agent must interpolate within the training set; to solve the test dynamics with 2.0× scaling the agent must extrapolate from the training set.
Interestingly, we find that GRUs are the most performant memory model for BFMs, outperforming Transformers and S4d. In particular,
we find it is the only memory model that can be trained stably when applied to both forward and backward models.
Figure 5: Aggregate zero-shot task performance of FB-M with different memory models. IQM of task scores across all tasks on Walker flickering. (Left) Observations are passed only to a memory-based backward model; the forward model and policy are memory-free. (Middle) Observations are passed only to the forward model and policy; the backward model is memory-free. (Right) Observations are passed to all models.
Check out the full paper for more details! If you find this work informative, please consider citing the paper:
@article{jeen2025,
url = {https://arxiv.org/abs/2309.15178},
author = {Jeen, Scott and Bewley, Tom and Cullen, Jonathan M.},
title = {Zero-Shot Reinforcement Learning Under Partial Observability},
journal = {2nd Reinforcement Learning Conference},
year = {2025},
}