Scientific Methodology of Policy Gradient Methods (Sutton et al.)
Scientific Methodology of Policy Gradient Methods (Sutton et al.)
- Observation: Approximating a value function is theoretically intractable when combined with function approximation.
- Question/Problem: How can we create a provably convergent RL algorithm that uses function approximation?
- Hypothesis: An explicit, gradient-based policy approximator, aided by an advantage function, will be provably convergent.
- Method (to test the Hyp.): Derive a new policy gradient form, suitable for estimation using an approximate advantage function.
- Experiment: Conduct a mathematical proof to test the convergence of a policy iteration algorithm using this new gradient formulation.
- Analysis: The proof holds, confirming that the algorithm is convergent.
- Conclusion: We proved for the first time that this policy iteration method converges to a locally optimal policy.