Cislunar Space Beginner's GuideCislunar Space Beginner's Guide
Satellite Orbit Simulation
Cislunar Glossary
Resources & Tools
Blue Team Research
Space News
AI Q&A
Forum
Home
Gitee
GitHub
  • 简体中文
  • English
Satellite Orbit Simulation
Cislunar Glossary
Resources & Tools
Blue Team Research
Space News
AI Q&A
Forum
Home
Gitee
GitHub
  • 简体中文
  • English
  • Site map

    • Home (overview)
    • Intro · what is cislunar space
    • Orbits · spacecraft trajectories
    • Frontiers · directions & labs
    • Glossary · terms & definitions
    • Tools · data & code
    • News · space industry archive
    • Topic · blue-team research
  • Cislunar glossary (terms & definitions)

    • Cislunar Space Glossary
    • Dynamics models

      • Circular Restricted Three-Body Problem (CR3BP)
      • CR3BP with Low-Thrust (CR3BP-LT)
      • A2PPO (Attention-Augmented Proximal Policy Optimization)
      • Curriculum Learning
      • Low-Thrust Transfer MDP Formulation
      • Generalized Advantage Estimation (GAE)
      • Direct Collocation
      • Birkhoff-Gustavson Normal Form
      • Central Manifold
      • Action-Angle Variables
      • Poincaré Section
    • Mission orbits

      • Earth-Moon L1/L2 Halo Orbit (EML1/EML2 Halo)
      • Orbit Identification
    • Navigation

      • X-ray Pulsar Navigation
    • Lunar minerals

      • Changeite-Mg (Magnesium Changeite)
      • Changeite-Ce (Cerium Changeite)
    • Other

      • Starshade
    • Organizations

      • Anduril Industries
      • Booz Allen Hamilton
      • General Dynamics Mission Systems
      • GITAI USA
      • Lockheed Martin
      • Northrop Grumman
      • Quindar
      • Raytheon Missiles & Defense
      • Sci-Tec
      • SpaceX
      • True Anomaly
      • Turion Space

A2PPO (Attention-Augmented Proximal Policy Optimization)

Definition

A2PPO is a Deep Reinforcement Learning (DRL) framework for low-thrust trajectory optimization in cislunar space, proposed by Ul Haq, Dai, Du et al. in 2026[1]. Its core innovation lies in integrating a directional cross-attention mechanism into the Actor-Critic architecture of the standard PPO (Proximal Policy Optimization) algorithm, enabling the policy network to selectively attend to state features that the Critic network deems important for future value, thereby improving learning stability and sample efficiency in chaotic multi-body dynamical environments.

Algorithm Architecture

Core Components

The forward propagation pipeline of A2PPO proceeds as follows:

  1. Shared MLP Encoder: Encodes the raw state st∈R16s_t \in \mathbb{R}^{16}st​∈R16 into a hidden vector ht∈R128h_t \in \mathbb{R}^{128}ht​∈R128
  2. Role Projection: Projects hth_tht​ into Actor- and Critic-specific role vectors via two independent linear projections Wa,Wc∈R128×128W_a, W_c \in \mathbb{R}^{128 \times 128}Wa​,Wc​∈R128×128
  3. Tokenization: Reshapes the role vectors into M=4M=4M=4 sub-tokens of dimension d=32d=32d=32 (D=M×d=128D = M \times d = 128D=M×d=128), with learned positional embeddings added
  4. Directional Cross-Attention: Actor tokens serve as Query, Critic tokens as Key and Value, performing feature fusion through multi-head cross-attention (Nh=2N_h=2Nh​=2 heads)
  5. Fusion Output: After residual connections and per-token feed-forward networks (FFN), layer normalization is applied and the result is flattened to obtain the fused hidden vector zt∈R128z_t \in \mathbb{R}^{128}zt​∈R128

Key Design: Directionality

A2PPO adopts an asymmetric Critic → Actor directional cross-attention design: the policy representation is conditioned on the value function's assessment signals, while the Critic remains decoupled from Actor exploration noise. This design outperforms self-attention variants in ablation experiments, significantly improving training stability.

PPO Loss Function

A2PPO optimizes the following composite loss:

J(θ,ψ)=−Lclip(θ)+cv12E[(Vψ(zt)−R^t)2]−ceE[H(πθ(⋅∣zt))]J(\theta, \psi) = -\mathcal{L}^{\mathrm{clip}}(\theta) + c_v \frac{1}{2} \mathbb{E}\left[ (V_\psi(z_t) - \hat{R}_t)^2 \right] - c_e \mathbb{E}\left[ \mathcal{H}(\pi_\theta(\cdot|z_t)) \right] J(θ,ψ)=−Lclip(θ)+cv​21​E[(Vψ​(zt​)−R^t​)2]−ce​E[H(πθ​(⋅∣zt​))]

The three terms are: the clipped policy loss, value function error (weight cvc_vcv​), and policy entropy regularization (weight cec_ece​).

Training Strategy

Curriculum Learning

A2PPO employs a progressive curriculum learning strategy, gradually tightening success thresholds: initial stages use relaxed terminal position/velocity tolerances (e.g., Δd=5×10−3\Delta d = 5 \times 10^{-3}Δd=5×10−3), progressively tightening to Δd=1×10−3\Delta d = 1 \times 10^{-3}Δd=1×10−3 as training advances. This strategy avoids initial instability in the chaotic CR3BP dynamical environment.

Hyperparameter Tuning

A two-stage hyperparameter search (100 trials each) is conducted using the Optuna framework, with key parameters including learning rate (1.315×10−31.315 \times 10^{-3}1.315×10−3), PPO clipping range (0.249), entropy coefficient (0.01474), and GAE-λ\lambdaλ (0.915).

Performance Evaluation

Evaluation results across four cislunar low-thrust transfer scenarios:

ScenarioDescriptionToF (days)Fuel (kg)vs. Direct Collocation
S1L₂ Halo → Halo4.952.084.99 days / 1.28 kg
S2L₂ Halo → NRHO8.385.007.26 days / 5.29 kg
S3NRHO → DRO7.605.107.63 days / 5.11 kg
S4Multi-rev Halo → Halo (very low thrust)33.60.9733.12 days / 0.97 kg

Without any initial guess, A2PPO autonomously learns trajectories highly consistent with direct collocation baselines, while significantly outperforming the SAC baseline in multi-revolution transfer scenarios (37.37 days / 1.06 kg).

Robustness

  • Monte Carlo perturbation test: 100% success rate under 100 initial state perturbations (σ=10−3\sigma = 10^{-3}σ=10−3 NDU)
  • Thrust degradation tolerance: Completes missions under up to 32% deterministic thrust degradation without retraining

Relation to Related Concepts

  • Standard PPO: A2PPO adds a directional cross-attention module on top of standard PPO, with both training convergence speed and final reward significantly outperforming Vanilla PPO
  • SAC (Soft Actor-Critic): As a comparison baseline, A2PPO wins with shorter time and less fuel in multi-revolution transfer scenarios
  • GTrXL: Another Transformer-enhanced RL method; A2PPO's cross-attention mechanism differs, focusing on Actor-Critic feature fusion
  • Generalized Advantage Estimation (GAE): A key component for advantage function estimation in A2PPO
  • Curriculum Learning: The progressive training strategy employed by A2PPO
  • Low-Thrust Transfer MDP: The problem formulation framework for A2PPO

References

  • [1] Ul Haq I U, Dai H, Du C. Autonomous low-thrust trajectory optimization in cislunar space via attention-augmented reinforcement learning. Aerospace Science and Technology, 2026.
Improve this page
Last Updated: 4/27/26, 8:30 AM
Contributors: Hermes Agent
Prev
CR3BP with Low-Thrust (CR3BP-LT)
Next
Curriculum Learning
地月空间入门指南
Cislunar Space Beginner's GuideYour guide to cislunar space
View on GitHub

Navigate

  • Home
  • About
  • Space News
  • Glossary

Content

  • Cislunar Orbits
  • Research
  • Resources
  • Blue Team

English

  • Home
  • About
  • Space News
  • Glossary

Follow Us

© 2026 Cislunar Space Beginner's Guide  |  湘ICP备2026006405号-1
Related:智慧学习助手 UStudy航天任务工具箱 ATK
支持我
鼓励和赞赏我感谢您的支持