Scientific Methodology of Policy Gradient Methods (Sutton et al.)

Scientific Methodology of Policy Gradient Methods (Sutton et al.)

  • Observation: Approximating a value function is theoretically intractable when combined with function approximation.
  • Question/Problem: How can we create a provably convergent RL algorithm that uses function approximation?
  • Hypothesis: An explicit, gradient-based policy approximator, aided by an advantage function, will be provably convergent.
  • Method (to test the Hyp.): Derive a new policy gradient form, suitable for estimation using an approximate advantage function.
  • Experiment: Conduct a mathematical proof to test the convergence of a policy iteration algorithm using this new gradient formulation.
  • Analysis: The proof holds, confirming that the algorithm is convergent.
  • Conclusion: We proved for the first time that this policy iteration method converges to a locally optimal policy.