A2PPO (Attention-Augmented Proximal Policy Optimization)
Definition
A2PPO is a Deep Reinforcement Learning (DRL) framework for low-thrust trajectory optimization in cislunar space, proposed by Ul Haq, Dai, Du et al. in 2026[1]. Its core innovation lies in integrating a directional cross-attention mechanism into the Actor-Critic architecture of the standard PPO (Proximal Policy Optimization) algorithm, enabling the policy network to selectively attend to state features that the Critic network deems important for future value, thereby improving learning stability and sample efficiency in chaotic multi-body dynamical environments.
Algorithm Architecture
Core Components
The forward propagation pipeline of A2PPO proceeds as follows:
- Shared MLP Encoder: Encodes the raw state into a hidden vector
- Role Projection: Projects into Actor- and Critic-specific role vectors via two independent linear projections
- Tokenization: Reshapes the role vectors into sub-tokens of dimension (), with learned positional embeddings added
- Directional Cross-Attention: Actor tokens serve as Query, Critic tokens as Key and Value, performing feature fusion through multi-head cross-attention ( heads)
- Fusion Output: After residual connections and per-token feed-forward networks (FFN), layer normalization is applied and the result is flattened to obtain the fused hidden vector
Key Design: Directionality
A2PPO adopts an asymmetric Critic → Actor directional cross-attention design: the policy representation is conditioned on the value function's assessment signals, while the Critic remains decoupled from Actor exploration noise. This design outperforms self-attention variants in ablation experiments, significantly improving training stability.
PPO Loss Function
A2PPO optimizes the following composite loss:
The three terms are: the clipped policy loss, value function error (weight ), and policy entropy regularization (weight ).
Training Strategy
Curriculum Learning
A2PPO employs a progressive curriculum learning strategy, gradually tightening success thresholds: initial stages use relaxed terminal position/velocity tolerances (e.g., ), progressively tightening to as training advances. This strategy avoids initial instability in the chaotic CR3BP dynamical environment.
Hyperparameter Tuning
A two-stage hyperparameter search (100 trials each) is conducted using the Optuna framework, with key parameters including learning rate (), PPO clipping range (0.249), entropy coefficient (0.01474), and GAE- (0.915).
Performance Evaluation
Evaluation results across four cislunar low-thrust transfer scenarios:
| Scenario | Description | ToF (days) | Fuel (kg) | vs. Direct Collocation |
|---|---|---|---|---|
| S1 | L₂ Halo → Halo | 4.95 | 2.08 | 4.99 days / 1.28 kg |
| S2 | L₂ Halo → NRHO | 8.38 | 5.00 | 7.26 days / 5.29 kg |
| S3 | NRHO → DRO | 7.60 | 5.10 | 7.63 days / 5.11 kg |
| S4 | Multi-rev Halo → Halo (very low thrust) | 33.6 | 0.97 | 33.12 days / 0.97 kg |
Without any initial guess, A2PPO autonomously learns trajectories highly consistent with direct collocation baselines, while significantly outperforming the SAC baseline in multi-revolution transfer scenarios (37.37 days / 1.06 kg).
Robustness
- Monte Carlo perturbation test: 100% success rate under 100 initial state perturbations ( NDU)
- Thrust degradation tolerance: Completes missions under up to 32% deterministic thrust degradation without retraining
Relation to Related Concepts
- Standard PPO: A2PPO adds a directional cross-attention module on top of standard PPO, with both training convergence speed and final reward significantly outperforming Vanilla PPO
- SAC (Soft Actor-Critic): As a comparison baseline, A2PPO wins with shorter time and less fuel in multi-revolution transfer scenarios
- GTrXL: Another Transformer-enhanced RL method; A2PPO's cross-attention mechanism differs, focusing on Actor-Critic feature fusion
- Generalized Advantage Estimation (GAE): A key component for advantage function estimation in A2PPO
- Curriculum Learning: The progressive training strategy employed by A2PPO
- Low-Thrust Transfer MDP: The problem formulation framework for A2PPO
References
- [1] Ul Haq I U, Dai H, Du C. Autonomous low-thrust trajectory optimization in cislunar space via attention-augmented reinforcement learning. Aerospace Science and Technology, 2026.
