PhD Viva
University of Cambridge
The Case for Zero-Shot RL
Chapter 3: From Low Quality Data
Chapter 4: Under Changed Dynamics
- Engineering: energy generation control (fusion, fission, wind)
- Education: teacher-student interaction
- Mathematics: theorem-proving
- Policy: climate negotiations
- Science: forming hypotheses -> making predictions -> testing them
- System dynamics are rarely known (so standard control theoretic approaches can't be applied)
- Data demonstrating the optimal policy is rarely available (so we can't imitate the optimal policy with supervised learning)
- It is much easier to evaluate a solution that generate one (i.e. there exists a generator-verifier gap)
- Finite data
- Function approximation
Authors | Building | Algo | Efficiency | Data |
---|---|---|---|---|
Wei et al. (2017) [9] | 5-zone Building | DQN | ~35% | ~8 years |
Zhang et al. (2019) [10] | Office | A3C | ~17% | ~30 years |
Valladares et al. (2019) [11] | Classroom | DQN | 5% | ~10 years |
Mixed-Use | Offices | Seminar Centre | |
---|---|---|---|
Location | Greece | Greece | Denmark |
Floor Area (m\(^2\)) | 566 | 643 | 1278 |
Action-space dim | \(\mathbb{R}^{12}\) | \(\mathbb{R}^{14}\) | \(\mathbb{R}^{18}\) |
State-space dim | \(\mathbb{R}^{37}\) | \(\mathbb{R}^{56}\) | \(\mathbb{R}^{59}\) |
Equipment | Thermostats & AHU Flowrates | Thermostats | Thermostats |
I set out to defend the following thesis:
[1] Sutton, R. and Barto, A. (2018). Reinforcement Learning: An Introduction. The MIT Press, second edition.
[2] Touati, A. and Ollivier, Y. (2021). Learning one representation to optimize all rewards. Advances in Neural Information Processing Systems, 34:13–23
[3] Borsa, D., Barreto, A., Quan, J., Mankowitz, D., Munos, R., Van Hasselt, H., Silver, D., and Schaul, T. (2018). Universal successor features approximators. arXiv preprint arXiv:1812.07626.
[4] Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. (2019). Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, volume 32
[5] Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779.
[6] Touati, A., Rapin, J., and Ollivier, Y. (2023). Does zero-shot reinforce- ment learning exist? In The Eleventh International Conference on Learning Representations.
[7] Park, S., Ghosh, D., Eysenbach, B., and Levine, S. (2023). Hiql: Offline goal- conditioned rl with latent states as actions. Advances in Neural Information Processing Systems, 37.
[8] Park, S., Kreiman, T., and Levine, S. (2024). Foundation policies with hilbert representations. International Conference on Machine Learning.
[9] Wei, T., Wang, Y., and Zhu, Q. (2017). Deep reinforcement learning for building hvac control. In Proceedings of the 54th Annual Design Automation Conference 2017, DAC ’17, New York, NY, USA. Association for Computing Machinery.
[10] Zhang, Z., Chong, A., Pan, Y., Zhang, C., and Lam, K. P. (2019b). Whole building energy model for hvac optimal control: A practical framework based on deep reinforcement learning. Energy and Buildings, 199:472–490.
[11] Valladares, W., Galindo, M., Gutiérrez, J., Wu, W.-C., Liao, K.-K., Liao, J.-C., Lu, K.-C., and Wang, C.-C. (2019). Energy optimization associated with thermal comfort and indoor air control via a deep reinforcement learning algorithm. Building and Environment, 155:105 – 117.
[12] Haarnoja, T., Ha, S., Zhou, A., Tan, J., Tucker, G., and Levine, S. (2018a). Learning to walk via deep reinforcement learning. arXiv preprint arXiv:1812.11103.
[13] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
[14] Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. (2018c). Neural network dynamics for model-based deep reinforcement learning with model- free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE.
[15] Scharnhorst, P., Schubnel, B., Fernández Bandera, C., Salom, J., Taddeo, P., Boegli, M., Gorecki, T., Stauffer, Y., Peppas, A., and Politi, C. (2021). Energym: A building model library for controller benchmarking. Applied Sciences, 11(8):3518.