Monte Carlo learning and temporal difference learning
At this point, we understand that it is very useful for an agent to learn the state value function , which informs the agent about the long-term value of being in state so that the agent can decide if it is a good state to be in or not. The Monte Carlo (MC) and Temporal Difference (TD) learning methods enable an agent to learn that!
The goal of MC and TD learning is to learn the value functions from the agent's experience as the agent follows its policy .
MC learning updates the value towards the actual return , which is the total discounted reward from time step t. This means that until the end. It is important to note that we can calculate this value only after the end of the sequence, whereas TD learning (TD(0) to be precise), updates the value towards the estimated return given by , which can be calculated after every step.