Video: Deepmind's Deep Q-network solving the Atari game Breakout after 600 episodes of self-play (Mnih et. al (2013))
Figure: Top: AlphaGo's infamous Move 37, a counterintuitive move for a human to make with this board set-up, but nonetheless a gaming-winning one. Bottom: Lee Sedol, the world no.1 Go player, flumexed by AlphaGo's moves. (Silver et al (2017))
Figure: The complex, interconnected electricity grid
Figure: The RL control loop. Agent's take sequential actions that affect their environment, the environment changes and the agent receives a reward for their action. Adapted from Episode 1 of David Silver's RL Youtube series
More formally, the RL control loop looks something like this:
Figure: The formalised RL control loop.
Generally when we model a function using a black-box approximator, we can select (informally) from two sets of models, these are:
