Evolutionary Policy Optimization

Preprint

Jianren Wang*         Yifan Su*         Abhinav Gupta         Deepak Pathak
Carnegie Mellon University
*denotes equal contribution


Abstract

On-policy reinforcement learning (RL) algorithms are widely used for their strong asymptotic performance and training stability, but they struggle to scale with larger batch sizes, as additional parallel environments yield redundant data due to limited policy-induced diversity. In contrast, Evolutionary Algorithms (EAs) scale naturally and encourage exploration via randomized population-based search, but are often sample-inefficient. We propose Evolutionary Policy Optimization (EPO), a hybrid algorithm that combines the scalability and diversity of EAs with the performance and stability of policy gradients. EPO maintains a population of agents conditioned on latent variables, shares actor-critic network parameters for coherence and memory efficiency, and aggregates diverse experiences into a master agent. Across tasks in dexterous manipulation, legged locomotion, and classic control, EPO outperforms state-of-the-art baselines in sample efficiency, asymptotic performance, and scalability.

Experiment Setting


We evaluate our algorithm on eight challenging environments that span a diverse set of tasks, including manipulation, parkour, locomotion, and classic control benchmarks.

Quantitative Performance

We evaluate our approach against state-of-the-art RL methods tailored for large-scale training. The comparisons include off-policy algorithms—SAC and Parallel Q-Learning; on-policy methods—PPO; hybrid-policy methods—SAPG; and population-based EvoRL methods—PBT and CEM-RL. We use the original implementations of each method without introducing algorithm-level modifications.


Qualitative Performance





BibTex

 @article{wang2025evolutionary,
  title={Evolutionary Policy Optimization},
  author={Wang, Jianren and Su, Yifan and Gupta, Abhinav and Pathak, Deepak},
  journal={arXiv preprint arXiv:2503.19037},
  year={2025}
}