Week 7

2022/11/07 - 2022/11/13

1. Reinforcement Learning

Trade-off between exploration and exploitation.

1.1 Tabular Solution

The state and action spaces are small enough for the approximate value functions to be represented as arrays or tables.

Finite Markov Decision Process
Dynamic Programming
Monte Carlo Methods

TD learning combines some of the features of both Monte Carlo and Dynamic Programming (DP) methods.

Temporal-Difference Learning
n-step Bootsrapping

1.2 Approximate Solution Methods

Problems with arbitrary large state spaces.

Prediction Task: Evaluate a given policy by estimating the value of taking actions following the policy.
Control Task: Find the optimal policy that gets most rewards.

On-policy: Estimate the value of a policy while using it for control.
Off-policy: The policy used to generate behaviour, called the behaviour policy, may be unrelated to the policy that is evaluated and improved, called the estimation policy.

On-policy TD Prediction
- TD(0)
On-policy TD Control
- SARSA

Off-policy TD Control
- Q-Learning
- Dyna-Q and Dynq-Q+
- Expected SARSA

Policy Gradient Methods
- REINFORCE
- Actor-Critic
- Advantage Actor-Critic (A2C)

2. Webots Simulator (ROS2)

Multi-agent Reinforcement Learning.