Deep Deterministic Policy Gradient (DDPG)

Author: Tianjiang Says
Contributing institutions: School of Astronautics, Harbin Institute of Technology; National Key Laboratory of Rapid Design and Intelligent Swarm for Micro/Nano Spacecraft
References: Guan Yutong et al. Hyperparameter Auto-Tuning and Homotopy Methods for Spacecraft Long-Range Cooperative Rendezvous, Spacecraft Environment Engineering, 2026.

Definition

Deep Deterministic Policy Gradient (DDPG) is a deep reinforcement learning algorithm that combines the Actor-Critic framework with an experience replay mechanism, proposed by Lillicrap et al. in 2015. DDPG is suited for reinforcement learning tasks with continuous action spaces and is capable of learning deterministic policies. It has been widely applied in fields such as robotic control and spacecraft trajectory optimization.

Algorithm Architecture

DDPG employs a dual-network Actor-Critic structure:

Actor network $\mu(s|\theta^\mu)$ : Given a state $s$ , outputs a deterministic action $a$
Critic network $Q(s,a|\theta^Q)$ : Evaluates the value of a state-action pair
Target-Actor network $\mu'(s|\theta^{\mu')$ : Stabilizes training
Target-Critic network $Q'(s,a|\theta^{Q')$ : Stabilizes training

Core Formulas

Loss function of the Critic network:

L(\theta^Q) = \mathbb{E}\left[\left(r + \gamma Q'(s',a'|\theta^{Q'}) - Q(s,a|\theta^Q)\right)^2\right]

Gradient of the Actor network:

\nabla_{\theta^\mu} J \approx \nabla_a Q(s,a|\theta^Q)|_{a=\mu(s)} \nabla_{\theta^\mu}\mu(s|\theta^\mu)

Application in Trajectory Optimization

In spacecraft cooperative rendezvous problems, DDPG is used for hyperparameter auto-tuning:

State design: Stagnation time, duration, iteration progress, particle distribution dispersion, particle distribution direction
Action output: HCPSO hyperparameters such as inertia weight and acceleration factors
Reward function: Designed based on the difference between global best fitness and current fitness

Application by Zhao Han et al. (2026)

Zhao Han et al. combined DDPG with Hybrid Cluster Particle Swarm Optimization (HCPSO) to form the Reinforcement Learning Enhanced Particle Swarm Optimization (RLEPSO), which is used for:

Initial costate optimization in cooperative rendezvous fuel-optimal problems
Autonomous dynamic tuning of hyperparameters based on particle search conditions
Improving the searchability and convergence speed of the optimization algorithm

References

Lillicrap T P, et al. Continuous control with deep reinforcement learning[J]. arXiv:1509.02971, 2015.
Guan Yutong, Gao Changsheng, Hu Yudong, Zhao Han. Hyperparameter Auto-Tuning and Homotopy Methods for Spacecraft Long-Range Cooperative Rendezvous[J]. Spacecraft Environment Engineering, 2026. [in Chinese]