🛠️ Steven Gong

Search

Aug 03, 2025, 1 min read

What Matters for Batch Online Reinforcement Learning

RL is from the Q function

And they use IQL

One really important thing that they show is that implicit policy extraction is a lot better than explicit.

Explicit is training the policy using the q-function
Implicit policy extraction is doing multi-trajectory sampling, scoring trajectories, and selecting it

Graph View

Backlinks

Behavior Cloning (BC)
Policy Extraction

Created with Quartz, © 2025

Blog
LinkedIn
Twitter
GitHub