Reinforcement Learning Enhanced Particle Swarm Optimization (RLEPSO)
Author: Tianjiang Says
Contributing institutions: School of Astronautics, Harbin Institute of Technology; National Key Laboratory of Rapid Design and Intelligent Swarm for Micro/Nano Spacecraft
References: Guan Yutong et al. Hyperparameter Auto-Tuning and Homotopy Methods for Spacecraft Long-Range Cooperative Rendezvous, Spacecraft Environment Engineering, 2026.
Definition
Reinforcement Learning Enhanced Particle Swarm Optimization (RLEPSO) is a hybrid optimization algorithm that combines the Deep Deterministic Policy Gradient (DDPG) with Hybrid Cluster Particle Swarm Optimization (HCPSO). RLEPSO uses the DDPG Actor network to autonomously and dynamically adjust HCPSO hyperparameters based on the particle search state, achieving autonomous tuning of algorithm parameters and significantly improving the searchability and convergence speed of the optimization algorithm.
Core Principles
Algorithm Architecture
RLEPSO embeds the DDPG framework on top of HCPSO:
- Initialization: Set initial HCPSO parameters and establish DDPG Actor-Critic networks
- State perception: Compute state variables from the current iteration state (stagnation time, duration, iteration progress, particle distribution dispersion, particle distribution direction)
- Action output: The Actor network outputs hyperparameter adjustment actions
- Parameter update: Decode actions into HCPSO parameters such as inertia weight and acceleration factors
- Experience replay: Store experience samples for training the Critic network
- Iterative optimization: Repeat steps 2-5 until convergence
State Design
RLEPSO uses the following state variables:
| State Variable | Definition | Physical Meaning |
|---|---|---|
| Stagnation start time | Detects whether the algorithm has stagnated | |
| Stagnation duration | Assesses the severity of stagnation | |
| Iteration progress | Ratio of current iteration to maximum iterations | |
| Particle distribution dispersion | Characterizes the degree of particle clustering | |
| Particle distribution direction | Characterizes the directional properties of particle distribution |
Actions and Reward
Action: The Actor network outputs a 16-dimensional action vector, decoded into 8 HCPSO hyperparameters ()
Reward function:
where is the global best fitness and is the current generation best fitness.
Application in Spacecraft Cooperative Rendezvous
Zhao Han et al. (2026) combined RLEPSO with the homotopy method to solve the fuel-optimal problem for long-range spacecraft cooperative rendezvous under perturbation:
- Energy-optimal solution: RLEPSO rapidly obtains high-quality initial costates
- Homotopy transition: Smooth transition from energy-optimal to fuel-optimal
- Results: Compared with PSO and HCPSO, RLEPSO obtained higher-quality initial costates with faster convergence
Simulation Results
| Parameter | RLEPSO-Homotopy | Homotopy-SQP Coupled |
|---|---|---|
| Fuel consumption | 205.40 kg | 210.36 kg |
| Rendezvous time | 208.89 TU | 225.44 TU |
| Terminal rendezvous distance | 0.7078 km | 9.3624 km |
Related Concepts
- Deep Deterministic Policy Gradient (DDPG)
- Hybrid Cluster Particle Swarm Optimization (HCPSO)
- Particle Swarm Optimization (PSO)
- Homotopy Method
- Pontryagin's Maximum Principle
References
- Guan Yutong, Gao Changsheng, Hu Yudong, Zhao Han. Hyperparameter Auto-Tuning and Homotopy Methods for Spacecraft Long-Range Cooperative Rendezvous[J]. Spacecraft Environment Engineering, 2026. [in Chinese]
- Lillicrap T P, et al. Continuous control with deep reinforcement learning[J]. arXiv:1509.02971, 2015.
