Generalized Advantage Estimation (GAE)
Definition
Generalized Advantage Estimation (GAE) is a bias-variance balancing technique for estimating the advantage function in reinforcement learning, proposed by Schulman et al. in 2015[1]. GAE provides low-variance but nearly unbiased advantage estimates for policy gradient algorithms (such as PPO and A2PPO) by computing exponentially weighted averages of multiple temporal difference (TD) residuals.
Background: Advantage Function and TD Residuals
In Actor-Critic reinforcement learning, the advantage function is defined as:
Direct computation requires knowledge of the true value function , which in practice must be approximated. The simple one-step TD advantage estimate is:
However, one-step estimates have low variance but high bias (due to reliance on inaccurate value estimates). -step returns can reduce bias but increase variance.
GAE Definition
GAE balances bias and variance through exponentially weighted averaging of -step TD residuals:
where controls the bias-variance tradeoff:
- : Degenerates to one-step TD (low variance, high bias)
- : Similar to -step returns (low bias, high variance)
In practice, due to finite horizon, the recursive form is used:
where is the termination signal ( indicates episode termination at step ).
Application in A2PPO
In the A2PPO algorithm, GAE is used for advantage estimation with the following hyperparameter settings[2]:
| Parameter | Value | Meaning |
|---|---|---|
| 0.99 | Discount factor | |
| (GAE-) | 0.915 | GAE parameter |
In A2PPO's ablation experiments, the combination of GAE with the attention mechanism produces more stable policy gradient estimates, significantly outperforming Vanilla PPO (final reward vs ).
GAE's Variance Control Mechanism
GAE's variance control stems from its finite memory property: distant future TD residuals decay exponentially as . More importantly, GAE's variance is positively correlated with — increasing increases estimation bias but reduces variance, as more reliance is placed on actual cumulative returns.
Related Concepts
- A2PPO (Attention-Augmented PPO): The application framework for GAE in cislunar trajectory optimization
- Low-Thrust Transfer MDP: The RL problem formulation that GAE serves
References
- [1] Schulman J, Moritz P, Levine S, et al. High-dimensional continuous control using generalized advantage estimation[J]. arXiv:1512.04455, 2015.
- [2] Ul Haq I U, Dai H, Du C. Autonomous low-thrust trajectory optimization in cislunar space via attention-augmented reinforcement learning[J]. Aerospace Science and Technology, 2026.
