Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data (WSRL)

WSRL = warmstart RL

Project page:

Motivation

This is a problem that I have noticed too. There’s too strong of a mismatch between the offline dataset dataset and online dataset distribution.

This mismatch results in too much unlearning. What is their solution?

Definitions:

  • “unlearning” = the performance drop at the start of fine-tuning, with possible recovery later,
  • “forgetting” = destruction of pre-trained initialization at the beginning of fine-tuning such that recovery becomes nearly impossible with online RL training

  • Well its not completely fair algortihm, these are algos that are not meant do do well with no retention

This is compared to methods that do retain data online, it does slightly better in the online setting good job

forgetting the pre-trained initialization altogether is problematic since it defeats the benefits of offline RL pre-training.

Why Warm-up Prevents Q-Values Divergence

This is really important. Warmup solves this problem by putting more offline-like data into the replay buffer where Q-values are not as pessimistic, thereby preventing the downward spiral in the online Bellman backups and uses high UTD in online RL to quickly re-calibrate the Q-values.

These figures are really important to think about how to debug these issues.

  • 2nd figure shows how the Q scores on the offline distribution stay about the same, whereas if we just leave out the dataset, it just completely forgets it

“Fine-tuning on more on-policy data destroys how well we fit offline data”.

This is really interesting: for places where pretraining doesn’t work, their implementation also doesn’t work. Yet, for Q-Chunking, those curves look reallly good?

“this is expected because when pre-training fails, initializing with the pre-trained network may not bring any useful information gain, and may actually hurt fine-tuning by reducing the network’s plasticity”

They quote “The primacy bias in deep reinforcement learning”