Week 7

2022/11/07 - 2022/11/13

1. Reinforcement Learning

Trade-off between exploration and exploitation.

1.1 Tabular Solution

The state and action spaces are small enough for the approximate value functions to be represented as arrays or tables.

  • Finite Markov Decision Process
  • Dynamic Programming
  • Monte Carlo Methods

TD learning combines some of the features of both Monte Carlo and Dynamic Programming (DP) methods.


  • Temporal-Difference Learning
  • n-step Bootsrapping

1.2 Approximate Solution Methods

Problems with arbitrary large state spaces.


  • Prediction Task: Evaluate a given policy by estimating the value of taking actions following the policy.

  • Control Task: Find the optimal policy that gets most rewards.


  • On-policy: Estimate the value of a policy while using it for control.

  • Off-policy: The policy used to generate behaviour, called the behaviour policy, may be unrelated to the policy that is evaluated and improved, called the estimation policy.


  • On-policy TD Prediction
    • TD(0)
  • On-policy TD Control
    • SARSA
  • Off-policy TD Control
    • Q-Learning
    • Dyna-Q and Dynq-Q+
    • Expected SARSA

  • Policy Gradient Methods
    • REINFORCE
    • Actor-Critic
    • Advantage Actor-Critic (A2C)

2. Webots Simulator (ROS2)

Multi-agent Reinforcement Learning.