Cislunar Space Beginner's GuideCislunar Space Beginner's Guide
Satellite Orbit Simulation
Cislunar Glossary
Resources & Tools
Blue Team Research
Space News
AI Q&A
Forum
Home
Gitee
GitHub
  • 简体中文
  • English
Satellite Orbit Simulation
Cislunar Glossary
Resources & Tools
Blue Team Research
Space News
AI Q&A
Forum
Home
Gitee
GitHub
  • 简体中文
  • English
  • Site map

    • Home (overview)
    • Intro · what is cislunar space
    • Orbits · spacecraft trajectories
    • Frontiers · directions & labs
    • Glossary · terms & definitions
    • Tools · data & code
    • News · space industry archive
    • Topic · blue-team research
  • Cislunar glossary (terms & definitions)

    • Cislunar Space Glossary
    • Dynamics models

      • Circular Restricted Three-Body Problem (CR3BP)
      • CR3BP with Low-Thrust (CR3BP-LT)
      • A2PPO (Attention-Augmented Proximal Policy Optimization)
      • Curriculum Learning
      • Low-Thrust Transfer MDP Formulation
      • Generalized Advantage Estimation (GAE)
      • Direct Collocation
      • Birkhoff-Gustavson Normal Form
      • Central Manifold
      • Action-Angle Variables
      • Poincaré Section
    • Mission orbits

      • Earth-Moon L1/L2 Halo Orbit (EML1/EML2 Halo)
      • Orbit Identification
    • Navigation

      • X-ray Pulsar Navigation
    • Lunar minerals

      • Changeite-Mg (Magnesium Changeite)
      • Changeite-Ce (Cerium Changeite)
    • Other

      • Starshade
    • Organizations

      • Anduril Industries
      • Booz Allen Hamilton
      • General Dynamics Mission Systems
      • GITAI USA
      • Lockheed Martin
      • Northrop Grumman
      • Quindar
      • Raytheon Missiles & Defense
      • Sci-Tec
      • SpaceX
      • True Anomaly
      • Turion Space

Generalized Advantage Estimation (GAE)

Definition

Generalized Advantage Estimation (GAE) is a bias-variance balancing technique for estimating the advantage function in reinforcement learning, proposed by Schulman et al. in 2015[1]. GAE provides low-variance but nearly unbiased advantage estimates for policy gradient algorithms (such as PPO and A2PPO) by computing exponentially weighted averages of multiple temporal difference (TD) residuals.

Background: Advantage Function and TD Residuals

In Actor-Critic reinforcement learning, the advantage function is defined as:

Aπ(st,at)=Qπ(st,at)−Vπ(st)A^\pi(s_t, a_t) = Q^\pi(s_t, a_t) - V^\pi(s_t) Aπ(st​,at​)=Qπ(st​,at​)−Vπ(st​)

Direct computation requires knowledge of the true value function VπV^\piVπ, which in practice must be approximated. The simple one-step TD advantage estimate is:

At(1)=δt=rt+γV(st+1)−V(st)A_t^{(1)} = \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) At(1)​=δt​=rt​+γV(st+1​)−V(st​)

However, one-step estimates have low variance but high bias (due to reliance on inaccurate value estimates). nnn-step returns can reduce bias but increase variance.

GAE Definition

GAE balances bias and variance through exponentially weighted averaging of nnn-step TD residuals:

A^tGAE(λ,γ)=∑k=0∞(γλ)kδt+k\hat{A}_t^{\text{GAE}(\lambda, \gamma)} = \sum_{k=0}^{\infty} (\gamma\lambda)^{k} \delta_{t+k} A^tGAE(λ,γ)​=k=0∑∞​(γλ)kδt+k​

where λ∈[0,1]\lambda \in [0,1]λ∈[0,1] controls the bias-variance tradeoff:

  • λ=0\lambda = 0λ=0: Degenerates to one-step TD (low variance, high bias)
  • λ=1\lambda = 1λ=1: Similar to nnn-step returns (low bias, high variance)

In practice, due to finite horizon, the recursive form is used:

A^t=δt+γλ(1−dt)A^t+1\hat{A}_t = \delta_t + \gamma\lambda(1-d_t)\hat{A}_{t+1} A^t​=δt​+γλ(1−dt​)A^t+1​

where dtd_tdt​ is the termination signal (dt=1d_t=1dt​=1 indicates episode termination at step ttt).

Application in A2PPO

In the A2PPO algorithm, GAE is used for advantage estimation with the following hyperparameter settings[2]:

ParameterValueMeaning
γ\gammaγ0.99Discount factor
λ\lambdaλ (GAE-λ\lambdaλ)0.915GAE parameter

In A2PPO's ablation experiments, the combination of GAE with the attention mechanism produces more stable policy gradient estimates, significantly outperforming Vanilla PPO (final reward 1071.41±7.751071.41 \pm 7.751071.41±7.75 vs 344.87±563.71344.87 \pm 563.71344.87±563.71).

GAE's Variance Control Mechanism

GAE's variance control stems from its finite memory property: distant future TD residuals decay exponentially as (γλ)k(\gamma\lambda)^k(γλ)k. More importantly, GAE's variance is positively correlated with λ\lambdaλ — increasing λ\lambdaλ increases estimation bias but reduces variance, as more reliance is placed on actual cumulative returns.

Related Concepts

  • A2PPO (Attention-Augmented PPO): The application framework for GAE in cislunar trajectory optimization
  • Low-Thrust Transfer MDP: The RL problem formulation that GAE serves

References

  • [1] Schulman J, Moritz P, Levine S, et al. High-dimensional continuous control using generalized advantage estimation[J]. arXiv:1512.04455, 2015.
  • [2] Ul Haq I U, Dai H, Du C. Autonomous low-thrust trajectory optimization in cislunar space via attention-augmented reinforcement learning[J]. Aerospace Science and Technology, 2026.
Improve this page
Last Updated: 4/27/26, 8:30 AM
Contributors: Hermes Agent
Prev
Low-Thrust Transfer MDP Formulation
Next
Direct Collocation
地月空间入门指南
Cislunar Space Beginner's GuideYour guide to cislunar space
View on GitHub

Navigate

  • Home
  • About
  • Space News
  • Glossary

Content

  • Cislunar Orbits
  • Research
  • Resources
  • Blue Team

English

  • Home
  • About
  • Space News
  • Glossary

Follow Us

© 2026 Cislunar Space Beginner's Guide  |  湘ICP备2026006405号-1
Related:智慧学习助手 UStudy航天任务工具箱 ATK
支持我
鼓励和赞赏我感谢您的支持