Cislunar Space Beginner's GuideCislunar Space Beginner's Guide
Satellite Orbit Simulation
Cislunar Glossary
Resources & Tools
Blue Team Research
Space News
AI Q&A
Forum
Home
Gitee
GitHub
  • 简体中文
  • English
Satellite Orbit Simulation
Cislunar Glossary
Resources & Tools
Blue Team Research
Space News
AI Q&A
Forum
Home
Gitee
GitHub
  • 简体中文
  • English
  • Site map

    • Home (overview)
    • Intro · what is cislunar space
    • Orbits · spacecraft trajectories
    • Frontiers · directions & labs
    • Glossary · terms & definitions
    • Tools · data & code
    • News · space industry archive
    • Topic · blue-team research
  • Cislunar glossary (terms & definitions)

    • Cislunar Space Glossary
    • Dynamics models

      • Circular Restricted Three-Body Problem (CR3BP)
      • CR3BP with Low-Thrust (CR3BP-LT)
      • A2PPO (Attention-Augmented Proximal Policy Optimization)
      • Curriculum Learning
      • Low-Thrust Transfer MDP Formulation
      • Generalized Advantage Estimation (GAE)
      • Direct Collocation
      • Birkhoff-Gustavson Normal Form
      • Central Manifold
      • Action-Angle Variables
      • Poincaré Section
      • Clohessy-Wiltshire (CW) Equation
      • Patched Method (拼接法)
      • Continuation (延拓)
      • Differential Correction (微分修正)
      • Poincaré Map (庞加莱图)
      • Impulsive Maneuver (脉冲机动)
      • Zero-Velocity Surface
      • Hill Three-Body Problem
      • Bicircular Four-Body Problem
      • Quasi-Bicircular Four-Body Problem
      • Strobe Map
      • Stability Set
      • Backward Stability Set
      • Capture Set
      • /en/glossary/dynamics/batch-deployment.html
      • /en/glossary/dynamics/state-dependent-tsp.html
      • /en/glossary/dynamics/q-law.html
      • /en/glossary/dynamics/mass-discontinuity.html
      • /en/glossary/dynamics/equinoctial-elements.html
      • /en/glossary/dynamics/dynamic-programming.html
      • /en/glossary/dynamics/coasting-arc.html
    • Mission orbits

      • Distant Retrograde Orbit (DRO)
      • Near-Rectilinear Halo Orbit (NRHO)
      • Earth-Moon L1/L2 Halo Orbit (EML1/EML2 Halo)
      • DRO Constellation
      • Orbit Identification
      • Transfer Orbit (转移轨道)
      • Perilune (近月点)
      • Apolune (远月点)
      • Retrograde (逆行)
      • Prograde (顺行)
      • Parking Orbit (停泊轨道)
      • Free-Return Trajectory (自由返回轨道)
      • Halo Orbit (Halo 轨道)
      • Lissajous Orbit (Lissajous 轨道)
      • Lyapunov Orbit (Lyapunov 轨道)
      • Cycler Trajectory
      • Multi-Revolution Halo Orbit
      • Ballistic Capture Orbit
      • Low-Energy Transfer Orbit
      • Full Lunar Surface Coverage Orbit
      • /en/glossary/orbits/hub-and-spoke.html
    • Navigation

      • X-ray Pulsar Navigation
      • LiAISON Navigation
    • Lunar minerals

      • Changeite-Mg (Magnesium Changeite)
      • Changeite-Ce (Cerium Changeite)
    • Other

      • Starshade
      • Noncooperative Target
      • Spacecraft Intention Recognition
      • Chain-of-Thought (CoT) Prompting
      • Low-Rank Adaptation (LoRA)
      • Prompt Tuning (P-tuning)
      • Cislunar Space (地月空间)
      • Low Earth Orbit / LEO (低地球轨道)
      • Lunar Gravity Assist / LGA (月球借力)
      • Powered Lunar Flyby / PLF (有动力月球借力)
      • Weak Stability Boundary / WSB (弱稳定边界)
      • /en/glossary/other/libration-point.html
      • Orbit Insertion (入轨)
      • /en/glossary/other/orbital-residence-platform.html
    • Organizations

      • Anduril Industries
      • Booz Allen Hamilton
      • General Dynamics Mission Systems
      • GITAI USA
      • Lockheed Martin
      • Northrop Grumman
      • Quindar
      • Raytheon Missiles & Defense
      • Sci-Tec
      • SpaceX
      • True Anomaly
      • Turion Space
    • Military space doctrine

      • Space Superiority
      • Competitive Endurance
      • DOTMLPF-P Framework
      • Mission Command
      • Force Design
      • Force Development
      • Force Generation
      • Force Employment
      • Space Force Generation Process (SPAFORGEN)
      • Mission Delta (MD)
      • System Delta (SYD)
      • Space Mission Task Force (SMTF)
      • Commander, Space Forces (COMSPACEFOR)
      • Component Field Commands
      • Space Domain Awareness (SDA)
      • Counterspace Operations
      • Resilient/Disaggregated Architecture
      • Operational Test and Training Infrastructure (OTTI)
      • Golden Dome
    • Observation techniques

      • Image Stacking
      • Shift-and-Add (SAA)
      • Synthetic Tracking
      • Sidereal Tracking
      • Signal-to-Noise Ratio (SNR)
      • Astrometry
      • Source Extraction
      • Ephemeris Correlation
      • Cislunar Moving Objects
      • Lunar Glare Zone
      • Image Registration
      • Background Star Elimination
      • Segmentation Map
      • Hot Pixel
    • Satellite Communication & TT&C

      • BeiDou Satellite System
      • Inter-Satellite Link (ISL)
      • All-Time Seamless Communication
      • Constellation Networking
      • Microwave Link
      • Laser-Microwave Communication

Low-Thrust Transfer MDP Formulation

Definition

In deep reinforcement learning frameworks such as A2PPO, the cislunar low-thrust orbit transfer problem is formulated as a finite-horizon Markov Decision Process (MDP), defined as the tuple (S,A,p,R,γ)(S, A, p, R, \gamma)(S,A,p,R,γ), where SSS is the state space, AAA is the action space, p(s′∣s,a)p(s'|s,a)p(s′∣s,a) is the state transition probability, RRR is the reward function, and γ∈[0,1]\gamma \in [0,1]γ∈[0,1] is the discount factor.

State Space Design

The agent's state space S⊂R16S \subset \mathbb{R}^{16}S⊂R16 contains the spacecraft's absolute dynamical state and relative deviation from the target orbit:

st=[rt,vt,m~t,Δrt,Δvt,Δdt,Δvt,tel,t]⊤∈R16\mathbf{s}_t = [\mathbf{r}_t, \mathbf{v}_t, \tilde{m}_t, \Delta\mathbf{r}_t, \Delta\mathbf{v}_t, \Delta d_t, \Delta v_t, t_{\text{el},t}]^\top \in \mathbb{R}^{16} st​=[rt​,vt​,m~t​,Δrt​,Δvt​,Δdt​,Δvt​,tel,t​]⊤∈R16

State ComponentDimensionDescription
rt=[xt,yt,zt]\mathbf{r}_t = [x_t, y_t, z_t]rt​=[xt​,yt​,zt​]3Position in rotating frame
vt=[x˙t,y˙t,z˙t]\mathbf{v}_t = [\dot{x}_t, \dot{y}_t, \dot{z}_t]vt​=[x˙t​,y˙​t​,z˙t​]3Velocity in rotating frame
m~t\tilde{m}_tm~t​1Normalized spacecraft mass
Δrt=rt−rref,t\Delta\mathbf{r}_t = \mathbf{r}_t - \mathbf{r}_{\text{ref},t}Δrt​=rt​−rref,t​3Position deviation (relative to nearest target orbit point)
Δvt=vt−vref,t\Delta\mathbf{v}_t = \mathbf{v}_t - \mathbf{v}_{\text{ref},t}Δvt​=vt​−vref,t​3Velocity deviation
Δdt=∣Δrt∣\Delta d_t = |\Delta\mathbf{r}_t|Δdt​=∣Δrt​∣1Euclidean position error
Δvt=∣Δvt∣\Delta v_t = |\Delta\mathbf{v}_t|Δvt​=∣Δvt​∣1Velocity error magnitude
tel,tt_{\text{el},t}tel,t​1Normalized elapsed time relative to maximum episode length

This combination of absolute state and relative error simultaneously captures the spacecraft's current dynamical configuration and its guidance deviation from the target orbit, and has been shown to facilitate stable A2PPO training.

Action Space Design

The agent outputs a continuous action at=(a1,a2,a3)∈[−1,1]3\mathbf{a}_t = (a_1, a_2, a_3) \in [-1,1]^3at​=(a1​,a2​,a3​)∈[−1,1]3 at each time step, using spherical coordinate parameterization:

Action ComponentMappingPhysical Meaning
a1a_1a1​ν=(a1+1)/2∈[0,1]\nu = (a_1 + 1)/2 \in [0,1]ν=(a1​+1)/2∈[0,1]Throttle (thrust magnitude fraction)
a2a_2a2​ϕ=πa2∈[−π,π]\phi = \pi a_2 \in [-\pi, \pi]ϕ=πa2​∈[−π,π]Azimuth angle
a3a_3a3​θ=(π/2)a3∈[−π/2,π/2]\theta = (\pi/2)a_3 \in [-\pi/2, \pi/2]θ=(π/2)a3​∈[−π/2,π/2]Elevation angle

The dimensionless thrust control vector is:

u=ν⋅u^,u^=(cos⁡θcos⁡ϕ,cos⁡θsin⁡ϕ,sin⁡θ)\mathbf{u} = \nu \cdot \hat{\mathbf{u}}, \quad \hat{\mathbf{u}} = (\cos\theta\cos\phi, \cos\theta\sin\phi, \sin\theta) u=ν⋅u^,u^=(cosθcosϕ,cosθsinϕ,sinθ)

Reward Function Design

The reward function combines potential-based shaping, penalty terms, and safety constraints:

rt=ΔΦ(st,st−1)⏟Potential shaping−ct−cfΔmt⏟Time and fuel cost+rsafe,t⏟Safety constraint+Ωt⏟Terminal rewardr_t = \underbrace{\Delta\Phi(\mathbf{s}_t, \mathbf{s}_{t-1})}_{\text{Potential shaping}} - \underbrace{c_t - c_f \Delta m_t}_{\text{Time and fuel cost}} + \underbrace{r_{\text{safe},t}}_{\text{Safety constraint}} + \underbrace{\Omega_t}_{\text{Terminal reward}} rt​=Potential shapingΔΦ(st​,st−1​)​​−Time and fuel costct​−cf​Δmt​​​+Safety constraintrsafe,t​​​+Terminal rewardΩt​​​

Potential Function

Φ(s)=−w1posΔd−w1velΔv+w2pose−w3posΔd+w2vele−w3velΔv\Phi(\mathbf{s}) = -w_1^{\text{pos}}\Delta d - w_1^{\text{vel}}\Delta v + w_2^{\text{pos}} e^{-w_3^{\text{pos}}\Delta d} + w_2^{\text{vel}} e^{-w_3^{\text{vel}}\Delta v} Φ(s)=−w1pos​Δd−w1vel​Δv+w2pos​e−w3pos​Δd+w2vel​e−w3vel​Δv

The exponential terms approach w2pos,w2velw_2^{\text{pos}}, w_2^{\text{vel}}w2pos​,w2vel​ as Δd,Δv→0\Delta d, \Delta v \to 0Δd,Δv→0, while the linear terms provide sustained directional guidance.

Terminal Reward

ConditionReward
Successful orbit insertion+1000+1000+1000
Moon collision / fuel depletion−1000-1000−1000
Timeout000

Moon Safety Constraint

rsafe,t={−cs(1−∥rt−rM∥βRM)2if ∥rt−rM∥<βRM0otherwiser_{\text{safe},t} = \begin{cases} -c_s\left(1 - \frac{\|\mathbf{r}_t - \mathbf{r}_M\|}{\beta R_M}\right)^2 & \text{if } \|\mathbf{r}_t - \mathbf{r}_M\| < \beta R_M \\ 0 & \text{otherwise} \end{cases} rsafe,t​=⎩⎨⎧​−cs​(1−βRM​∥rt​−rM​∥​)20​if ∥rt​−rM​∥<βRM​otherwise​

where β=3\beta = 3β=3 is the safety buffer multiplier and RM=1737.4R_M = 1737.4RM​=1737.4 km is the Moon's radius.

Episode Termination Conditions

Termination TypeConditionResult
SuccessΔd<Δdthr\Delta d < \Delta d_{\text{thr}}Δd<Δdthr​ and Δv<Δvthr\Delta v < \Delta v_{\text{thr}}Δv<Δvthr​+1000
Moon collisionrM,t≤RMr_{M,t} \leq R_MrM,t​≤RM​-1000
Fuel depletionmt≤mmin⁡m_t \leq m_{\min}mt​≤mmin​-1000
TimeoutMaximum episode length reached0

Transition Probabilities

State transitions in the CR3BP-LT environment are described by the following ordinary differential equations:

x˙=f(x,u),x=[r,v,m~]⊤\dot{\mathbf{x}} = f(\mathbf{x}, \mathbf{u}), \quad \mathbf{x} = [\mathbf{r}, \mathbf{v}, \tilde{m}]^\top x˙=f(x,u),x=[r,v,m~]⊤

Numerical integration uses an adaptive Runge-Kutta 4(5) integrator (relative tolerance 10−910^{-9}10−9, absolute tolerance 10−1210^{-12}10−12).

References

  • Ul Haq I U, Dai H, Du C. Autonomous low-thrust trajectory optimization in cislunar space via attention-augmented reinforcement learning[J]. Aerospace Science and Technology, 2026.
Improve this page
Last Updated: 4/29/26, 11:30 AM
Contributors: Hermes Agent, Cron Job
Prev
Curriculum Learning
Next
Generalized Advantage Estimation (GAE)
地月空间入门指南
Cislunar Space Beginner's GuideYour guide to cislunar space
View on GitHub

Navigate

  • Home
  • About
  • Space News
  • Glossary

Content

  • Cislunar Orbits
  • Research
  • Resources
  • Blue Team

English

  • Home
  • About
  • Space News
  • Glossary

Follow Us

© 2026 Cislunar Space Beginner's Guide  |  湘ICP备2026006405号-1
Related:智慧学习助手 UStudy航天任务工具箱 ATK
支持我
鼓励和赞赏我感谢您的支持