In deep reinforcement learning frameworks such as A2PPO, the cislunar low-thrust orbit transfer problem is formulated as a finite-horizon Markov Decision Process (MDP), defined as the tuple ( S , A , p , R , γ ) (S, A, p, R, \gamma) ( S , A , p , R , γ ) , where S S S is the state space, A A A is the action space, p ( s ′ ∣ s , a ) p(s'|s,a) p ( s ′ ∣ s , a ) is the state transition probability, R R R is the reward function, and γ ∈ [ 0 , 1 ] \gamma \in [0,1] γ ∈ [ 0 , 1 ] is the discount factor[1] .
The agent's state space S ⊂ R 16 S \subset \mathbb{R}^{16} S ⊂ R 16 contains the spacecraft's absolute dynamical state and relative deviation from the target orbit:
s t = [ r t , v t , m ~ t , Δ r t , Δ v t , Δ d t , Δ v t , t el , t ] ⊤ ∈ R 16 \mathbf{s}_t = [\mathbf{r}_t, \mathbf{v}_t, \tilde{m}_t, \Delta\mathbf{r}_t, \Delta\mathbf{v}_t, \Delta d_t, \Delta v_t, t_{\text{el},t}]^\top \in \mathbb{R}^{16} s t = [ r t , v t , m ~ t , Δ r t , Δ v t , Δ d t , Δ v t , t el , t ] ⊤ ∈ R 16
State Component Dimension Description r t = [ x t , y t , z t ] \mathbf{r}_t = [x_t, y_t, z_t] r t = [ x t , y t , z t ] 3 Position in rotating frame v t = [ x ˙ t , y ˙ t , z ˙ t ] \mathbf{v}_t = [\dot{x}_t, \dot{y}_t, \dot{z}_t] v t = [ x ˙ t , y ˙ t , z ˙ t ] 3 Velocity in rotating frame m ~ t \tilde{m}_t m ~ t 1 Normalized spacecraft mass Δ r t = r t − r ref , t \Delta\mathbf{r}_t = \mathbf{r}_t - \mathbf{r}_{\text{ref},t} Δ r t = r t − r ref , t 3 Position deviation (relative to nearest target orbit point) Δ v t = v t − v ref , t \Delta\mathbf{v}_t = \mathbf{v}_t - \mathbf{v}_{\text{ref},t} Δ v t = v t − v ref , t 3 Velocity deviation Δ d t = ∣ Δ r t ∣ \Delta d_t = |\Delta\mathbf{r}_t| Δ d t = ∣Δ r t ∣ 1 Euclidean position error Δ v t = ∣ Δ v t ∣ \Delta v_t = |\Delta\mathbf{v}_t| Δ v t = ∣Δ v t ∣ 1 Velocity error magnitude t el , t t_{\text{el},t} t el , t 1 Normalized elapsed time relative to maximum episode length
This combination of absolute state and relative error simultaneously captures the spacecraft's current dynamical configuration and its guidance deviation from the target orbit, and has been shown to facilitate stable A2PPO training.
The agent outputs a continuous action a t = ( a 1 , a 2 , a 3 ) ∈ [ − 1 , 1 ] 3 \mathbf{a}_t = (a_1, a_2, a_3) \in [-1,1]^3 a t = ( a 1 , a 2 , a 3 ) ∈ [ − 1 , 1 ] 3 at each time step, using spherical coordinate parameterization:
Action Component Mapping Physical Meaning a 1 a_1 a 1 ν = ( a 1 + 1 ) / 2 ∈ [ 0 , 1 ] \nu = (a_1 + 1)/2 \in [0,1] ν = ( a 1 + 1 ) /2 ∈ [ 0 , 1 ] Throttle (thrust magnitude fraction) a 2 a_2 a 2 ϕ = π a 2 ∈ [ − π , π ] \phi = \pi a_2 \in [-\pi, \pi] ϕ = π a 2 ∈ [ − π , π ] Azimuth angle a 3 a_3 a 3 θ = ( π / 2 ) a 3 ∈ [ − π / 2 , π / 2 ] \theta = (\pi/2)a_3 \in [-\pi/2, \pi/2] θ = ( π /2 ) a 3 ∈ [ − π /2 , π /2 ] Elevation angle
The dimensionless thrust control vector is:
u = ν ⋅ u ^ , u ^ = ( cos θ cos ϕ , cos θ sin ϕ , sin θ ) \mathbf{u} = \nu \cdot \hat{\mathbf{u}}, \quad \hat{\mathbf{u}} = (\cos\theta\cos\phi, \cos\theta\sin\phi, \sin\theta) u = ν ⋅ u ^ , u ^ = ( cos θ cos ϕ , cos θ sin ϕ , sin θ )
The reward function combines potential-based shaping, penalty terms, and safety constraints:
r t = Δ Φ ( s t , s t − 1 ) ⏟ Potential shaping − c t − c f Δ m t ⏟ Time and fuel cost + r safe , t ⏟ Safety constraint + Ω t ⏟ Terminal reward r_t = \underbrace{\Delta\Phi(\mathbf{s}_t, \mathbf{s}_{t-1})}_{\text{Potential shaping}} - \underbrace{c_t - c_f \Delta m_t}_{\text{Time and fuel cost}} + \underbrace{r_{\text{safe},t}}_{\text{Safety constraint}} + \underbrace{\Omega_t}_{\text{Terminal reward}} r t = Potential shaping ΔΦ ( s t , s t − 1 ) − Time and fuel cost c t − c f Δ m t + Safety constraint r safe , t + Terminal reward Ω t
Φ ( s ) = − w 1 pos Δ d − w 1 vel Δ v + w 2 pos e − w 3 pos Δ d + w 2 vel e − w 3 vel Δ v \Phi(\mathbf{s}) = -w_1^{\text{pos}}\Delta d - w_1^{\text{vel}}\Delta v + w_2^{\text{pos}} e^{-w_3^{\text{pos}}\Delta d} + w_2^{\text{vel}} e^{-w_3^{\text{vel}}\Delta v} Φ ( s ) = − w 1 pos Δ d − w 1 vel Δ v + w 2 pos e − w 3 pos Δ d + w 2 vel e − w 3 vel Δ v
The exponential terms approach w 2 pos , w 2 vel w_2^{\text{pos}}, w_2^{\text{vel}} w 2 pos , w 2 vel as Δ d , Δ v → 0 \Delta d, \Delta v \to 0 Δ d , Δ v → 0 , while the linear terms provide sustained directional guidance.
Condition Reward Successful orbit insertion + 1000 +1000 + 1000 Moon collision / fuel depletion − 1000 -1000 − 1000 Timeout 0 0 0
r safe , t = { − c s ( 1 − ∥ r t − r M ∥ β R M ) 2 if ∥ r t − r M ∥ < β R M 0 otherwise r_{\text{safe},t} = \begin{cases} -c_s\left(1 - \frac{\|\mathbf{r}_t - \mathbf{r}_M\|}{\beta R_M}\right)^2 & \text{if } \|\mathbf{r}_t - \mathbf{r}_M\| < \beta R_M \\ 0 & \text{otherwise} \end{cases} r safe , t = ⎩ ⎨ ⎧ − c s ( 1 − β R M ∥ r t − r M ∥ ) 2 0 if ∥ r t − r M ∥ < β R M otherwise
where β = 3 \beta = 3 β = 3 is the safety buffer multiplier and R M = 1737.4 R_M = 1737.4 R M = 1737.4 km is the Moon's radius.
Termination Type Condition Result Success Δ d < Δ d thr \Delta d < \Delta d_{\text{thr}} Δ d < Δ d thr and Δ v < Δ v thr \Delta v < \Delta v_{\text{thr}} Δ v < Δ v thr +1000 Moon collision r M , t ≤ R M r_{M,t} \leq R_M r M , t ≤ R M -1000 Fuel depletion m t ≤ m min m_t \leq m_{\min} m t ≤ m m i n -1000 Timeout Maximum episode length reached 0
State transitions in the CR3BP-LT environment are described by the following ordinary differential equations:
x ˙ = f ( x , u ) , x = [ r , v , m ~ ] ⊤ \dot{\mathbf{x}} = f(\mathbf{x}, \mathbf{u}), \quad \mathbf{x} = [\mathbf{r}, \mathbf{v}, \tilde{m}]^\top x ˙ = f ( x , u ) , x = [ r , v , m ~ ] ⊤
Numerical integration uses an adaptive Runge-Kutta 4(5) integrator (relative tolerance 10 − 9 10^{-9} 1 0 − 9 , absolute tolerance 10 − 12 10^{-12} 1 0 − 12 ).
[1] Ul Haq I U, Dai H, Du C. Autonomous low-thrust trajectory optimization in cislunar space via attention-augmented reinforcement learning[J]. Aerospace Science and Technology, 2026.