Reinforcement Learning Enhanced Particle Swarm Optimization (RLEPSO)

Author: Tianjiang Says
Contributing institutions: School of Astronautics, Harbin Institute of Technology; National Key Laboratory of Rapid Design and Intelligent Swarm for Micro/Nano Spacecraft
References: Guan Yutong et al. Hyperparameter Auto-Tuning and Homotopy Methods for Spacecraft Long-Range Cooperative Rendezvous, Spacecraft Environment Engineering, 2026.

Definition

Reinforcement Learning Enhanced Particle Swarm Optimization (RLEPSO) is a hybrid optimization algorithm that combines the Deep Deterministic Policy Gradient (DDPG) with Hybrid Cluster Particle Swarm Optimization (HCPSO). RLEPSO uses the DDPG Actor network to autonomously and dynamically adjust HCPSO hyperparameters based on the particle search state, achieving autonomous tuning of algorithm parameters and significantly improving the searchability and convergence speed of the optimization algorithm.

Core Principles

Algorithm Architecture

RLEPSO embeds the DDPG framework on top of HCPSO:

Initialization: Set initial HCPSO parameters and establish DDPG Actor-Critic networks
State perception: Compute state variables from the current iteration state (stagnation time, duration, iteration progress, particle distribution dispersion, particle distribution direction)
Action output: The Actor network outputs hyperparameter adjustment actions
Parameter update: Decode actions into HCPSO parameters such as inertia weight and acceleration factors
Experience replay: Store experience samples for training the Critic network
Iterative optimization: Repeat steps 2-5 until convergence

State Design

RLEPSO uses the following state variables:

State Variable	Definition	Physical Meaning
$T_{stop}$	Stagnation start time	Detects whether the algorithm has stagnated
$T_{dur}$	Stagnation duration	Assesses the severity of stagnation
$T_{run}$	Iteration progress	Ratio of current iteration to maximum iterations
$D_s$	Particle distribution dispersion	Characterizes the degree of particle clustering
$D_r$	Particle distribution direction	Characterizes the directional properties of particle distribution

Actions and Reward

Action: The Actor network outputs a 16-dimensional action vector, decoded into 8 HCPSO hyperparameters ( $\omega_1, c_{11}, c_{12}, \mu_q, \omega_2, c_{21}, c_{22}, p_s$ )

Reward function:

r = \tanh\left((f_g - f_c) \cdot T_{run}\right)

where $f_g$ is the global best fitness and $f_c$ is the current generation best fitness.

Application in Spacecraft Cooperative Rendezvous

Zhao Han et al. (2026) combined RLEPSO with the homotopy method to solve the fuel-optimal problem for long-range spacecraft cooperative rendezvous under $J_2$ perturbation:

Energy-optimal solution: RLEPSO rapidly obtains high-quality initial costates
Homotopy transition: Smooth transition from energy-optimal to fuel-optimal
Results: Compared with PSO and HCPSO, RLEPSO obtained higher-quality initial costates with faster convergence

Simulation Results

Parameter	RLEPSO-Homotopy	Homotopy-SQP Coupled
Fuel consumption	205.40 kg	210.36 kg
Rendezvous time	208.89 TU	225.44 TU
Terminal rendezvous distance	0.7078 km	9.3624 km

References

Guan Yutong, Gao Changsheng, Hu Yudong, Zhao Han. Hyperparameter Auto-Tuning and Homotopy Methods for Spacecraft Long-Range Cooperative Rendezvous[J]. Spacecraft Environment Engineering, 2026. [in Chinese]
Lillicrap T P, et al. Continuous control with deep reinforcement learning[J]. arXiv:1509.02971, 2015.