Cislunar Space Beginner's GuideCislunar Space Beginner's Guide
Satellite Orbit Simulation
Cislunar Glossary
Resources & Tools
Blue Team Research
Space News
AI Q&A
Forum
Home
Gitee
GitHub
  • 简体中文
  • English
Satellite Orbit Simulation
Cislunar Glossary
Resources & Tools
Blue Team Research
Space News
AI Q&A
Forum
Home
Gitee
GitHub
  • 简体中文
  • English
  • Site map

    • Home (overview)
    • Intro · what is cislunar space
    • Orbits · spacecraft trajectories
    • Frontiers · directions & labs
    • Glossary · terms & definitions
    • Tools · data & code
    • News · space industry archive
    • Topic · blue-team research
  • Cislunar glossary (terms & definitions)

    • Cislunar Space Glossary
    • Dynamics models

      • Circular Restricted Three-Body Problem (CR3BP)
      • CR3BP with Low-Thrust (CR3BP-LT)
      • A2PPO (Attention-Augmented Proximal Policy Optimization)
      • Curriculum Learning
      • Low-Thrust Transfer MDP Formulation
      • Generalized Advantage Estimation (GAE)
      • Direct Collocation
      • Birkhoff-Gustavson Normal Form
      • Central Manifold
      • Action-Angle Variables
      • Poincaré Section
      • Clohessy-Wiltshire (CW) Equation
      • Patched Method (拼接法)
      • Continuation (延拓)
      • Differential Correction (微分修正)
      • Poincaré Map (庞加莱图)
      • Impulsive Maneuver (脉冲机动)
      • Zero-Velocity Surface
      • Hill Three-Body Problem
      • Bicircular Four-Body Problem
      • Quasi-Bicircular Four-Body Problem
      • Strobe Map
      • Stability Set
      • Backward Stability Set
      • Capture Set
      • /en/glossary/dynamics/batch-deployment.html
      • /en/glossary/dynamics/state-dependent-tsp.html
      • /en/glossary/dynamics/q-law.html
      • /en/glossary/dynamics/mass-discontinuity.html
      • /en/glossary/dynamics/equinoctial-elements.html
      • /en/glossary/dynamics/dynamic-programming.html
      • /en/glossary/dynamics/coasting-arc.html
    • Mission orbits

      • Distant Retrograde Orbit (DRO)
      • Near-Rectilinear Halo Orbit (NRHO)
      • Earth-Moon L1/L2 Halo Orbit (EML1/EML2 Halo)
      • DRO Constellation
      • Orbit Identification
      • Transfer Orbit (转移轨道)
      • Perilune (近月点)
      • Apolune (远月点)
      • Retrograde (逆行)
      • Prograde (顺行)
      • Parking Orbit (停泊轨道)
      • Free-Return Trajectory (自由返回轨道)
      • Halo Orbit (Halo 轨道)
      • Lissajous Orbit (Lissajous 轨道)
      • Lyapunov Orbit (Lyapunov 轨道)
      • Cycler Trajectory
      • Multi-Revolution Halo Orbit
      • Ballistic Capture Orbit
      • Low-Energy Transfer Orbit
      • Full Lunar Surface Coverage Orbit
      • /en/glossary/orbits/hub-and-spoke.html
    • Navigation

      • X-ray Pulsar Navigation
      • LiAISON Navigation
    • Lunar minerals

      • Changeite-Mg (Magnesium Changeite)
      • Changeite-Ce (Cerium Changeite)
    • Other

      • Starshade
      • Noncooperative Target
      • Spacecraft Intention Recognition
      • Chain-of-Thought (CoT) Prompting
      • Low-Rank Adaptation (LoRA)
      • Prompt Tuning (P-tuning)
      • Cislunar Space (地月空间)
      • Low Earth Orbit / LEO (低地球轨道)
      • Lunar Gravity Assist / LGA (月球借力)
      • Powered Lunar Flyby / PLF (有动力月球借力)
      • Weak Stability Boundary / WSB (弱稳定边界)
      • /en/glossary/other/libration-point.html
      • Orbit Insertion (入轨)
      • /en/glossary/other/orbital-residence-platform.html
    • Organizations

      • Anduril Industries
      • Booz Allen Hamilton
      • General Dynamics Mission Systems
      • GITAI USA
      • Lockheed Martin
      • Northrop Grumman
      • Quindar
      • Raytheon Missiles & Defense
      • Sci-Tec
      • SpaceX
      • True Anomaly
      • Turion Space
    • Military space doctrine

      • Space Superiority
      • Competitive Endurance
      • DOTMLPF-P Framework
      • Mission Command
      • Force Design
      • Force Development
      • Force Generation
      • Force Employment
      • Space Force Generation Process (SPAFORGEN)
      • Mission Delta (MD)
      • System Delta (SYD)
      • Space Mission Task Force (SMTF)
      • Commander, Space Forces (COMSPACEFOR)
      • Component Field Commands
      • Space Domain Awareness (SDA)
      • Counterspace Operations
      • Resilient/Disaggregated Architecture
      • Operational Test and Training Infrastructure (OTTI)
      • Golden Dome
    • Observation techniques

      • Image Stacking
      • Shift-and-Add (SAA)
      • Synthetic Tracking
      • Sidereal Tracking
      • Signal-to-Noise Ratio (SNR)
      • Astrometry
      • Source Extraction
      • Ephemeris Correlation
      • Cislunar Moving Objects
      • Lunar Glare Zone
      • Image Registration
      • Background Star Elimination
      • Segmentation Map
      • Hot Pixel
    • Satellite Communication & TT&C

      • BeiDou Satellite System
      • Inter-Satellite Link (ISL)
      • All-Time Seamless Communication
      • Constellation Networking
      • Microwave Link
      • Laser-Microwave Communication

Generalized Advantage Estimation (GAE)

Definition

Generalized Advantage Estimation (GAE) is a bias-variance balancing technique for estimating the advantage function in reinforcement learning, proposed by Schulman et al. in 2015. GAE provides low-variance but nearly unbiased advantage estimates for policy gradient algorithms (such as PPO and A2PPO) by computing exponentially weighted averages of multiple temporal difference (TD) residuals.

Background: Advantage Function and TD Residuals

In Actor-Critic reinforcement learning, the advantage function is defined as:

Aπ(st,at)=Qπ(st,at)−Vπ(st)A^\pi(s_t, a_t) = Q^\pi(s_t, a_t) - V^\pi(s_t) Aπ(st​,at​)=Qπ(st​,at​)−Vπ(st​)

Direct computation requires knowledge of the true value function VπV^\piVπ, which in practice must be approximated. The simple one-step TD advantage estimate is:

At(1)=δt=rt+γV(st+1)−V(st)A_t^{(1)} = \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) At(1)​=δt​=rt​+γV(st+1​)−V(st​)

However, one-step estimates have low variance but high bias (due to reliance on inaccurate value estimates). nnn-step returns can reduce bias but increase variance.

GAE Definition

GAE balances bias and variance through exponentially weighted averaging of nnn-step TD residuals:

A^tGAE(λ,γ)=∑k=0∞(γλ)kδt+k\hat{A}_t^{\text{GAE}(\lambda, \gamma)} = \sum_{k=0}^{\infty} (\gamma\lambda)^{k} \delta_{t+k} A^tGAE(λ,γ)​=k=0∑∞​(γλ)kδt+k​

where λ∈[0,1]\lambda \in [0,1]λ∈[0,1] controls the bias-variance tradeoff:

  • λ=0\lambda = 0λ=0: Degenerates to one-step TD (low variance, high bias)
  • λ=1\lambda = 1λ=1: Similar to nnn-step returns (low bias, high variance)

In practice, due to finite horizon, the recursive form is used:

A^t=δt+γλ(1−dt)A^t+1\hat{A}_t = \delta_t + \gamma\lambda(1-d_t)\hat{A}_{t+1} A^t​=δt​+γλ(1−dt​)A^t+1​

where dtd_tdt​ is the termination signal (dt=1d_t=1dt​=1 indicates episode termination at step ttt).

Application in A2PPO

In the A2PPO algorithm, GAE is used for advantage estimation with the following hyperparameter settings:

ParameterValueMeaning
γ\gammaγ0.99Discount factor
λ\lambdaλ (GAE-λ\lambdaλ)0.915GAE parameter

In A2PPO's ablation experiments, the combination of GAE with the attention mechanism produces more stable policy gradient estimates, significantly outperforming Vanilla PPO (final reward 1071.41±7.751071.41 \pm 7.751071.41±7.75 vs 344.87±563.71344.87 \pm 563.71344.87±563.71).

GAE's Variance Control Mechanism

GAE's variance control stems from its finite memory property: distant future TD residuals decay exponentially as (γλ)k(\gamma\lambda)^k(γλ)k. More importantly, GAE's variance is positively correlated with λ\lambdaλ — increasing λ\lambdaλ increases estimation bias but reduces variance, as more reliance is placed on actual cumulative returns.

Related Concepts

  • A2PPO (Attention-Augmented PPO): The application framework for GAE in cislunar trajectory optimization
  • Low-Thrust Transfer MDP: The RL problem formulation that GAE serves

References

  • Schulman J, Moritz P, Levine S, et al. High-dimensional continuous control using generalized advantage estimation[J]. arXiv:1512.04455, 2015.
  • Ul Haq I U, Dai H, Du C. Autonomous low-thrust trajectory optimization in cislunar space via attention-augmented reinforcement learning[J]. Aerospace Science and Technology, 2026.
Improve this page
Last Updated: 4/29/26, 11:30 AM
Contributors: Hermes Agent, Cron Job
Prev
Low-Thrust Transfer MDP Formulation
Next
Direct Collocation
地月空间入门指南
Cislunar Space Beginner's GuideYour guide to cislunar space
View on GitHub

Navigate

  • Home
  • About
  • Space News
  • Glossary

Content

  • Cislunar Orbits
  • Research
  • Resources
  • Blue Team

English

  • Home
  • About
  • Space News
  • Glossary

Follow Us

© 2026 Cislunar Space Beginner's Guide  |  湘ICP备2026006405号-1
Related:智慧学习助手 UStudy航天任务工具箱 ATK
支持我
鼓励和赞赏我感谢您的支持