Robot Learning

Going to be 10x Researcher in robot learning.

Best Conferences:

CoRL
RSS
IROS
ICRA

Other

ICLR
ICML
NeurIPS

Walkthrough (CS231n 2025 Lec 17, Yunzhu Li)

Robot learning is the third paradigm in this course alongside supervised and self-supervised: an agent takes actions in an environment and gets rewards, with the goal of learning to act so as to maximize reward. The lecture is structured as 7 sections — problem formulation, perception, RL, model learning + planning, imitation, robotic foundation models, remaining challenges — and what follows mirrors that.

Problem formulation

Universal diagram: a human gives a goal $g$ to an Agent, which exchanges (state $s_{t}$ , action $a_{t}$ , reward $r_{t}$ ) with the Physical World. Almost every interesting learning-based decision problem casts into this loop:

Task	State	Action	Reward
Cart-Pole	angle, ang. speed, position, vel.	horizontal force	+1 per upright step
Robot Locomotion	joint angle/pos/vel	joint torques	+1 upright + forward
Atari (Mnih NeurIPS-DL 2013)	raw game pixels	game pad direction	score delta
Go	board configuration	next stone placement	win=1 / loss=0 (terminal only)
Text generation	current tokens	next token	1 if matches GT
Chatbot	conversation so far	next sentence	human eval ±1/0
Cloth folding	sensor readings	end-effector motion	human eval 1/0

The point of listing all of these is to show that the framing is general — the same machinery applies to a quadruped, a Dota agent, and an LLM.

Robot perception (vs computer vision)

Three properties make robot perception different from vanilla CV:

Embodied — the camera is mounted on a body whose own actions change what’s seen.
Active — see Active Perception: the agent decides where to look by moving.
Situated — outputs feed a perception → action → state-change loop, not an offline label.

Modalities go beyond RGB: depth, IMU, force/torque, tactile.

Reinforcement learning

Standard MDP: see RL and MDP for the formal setup, DQN for the canonical Atari pipeline (Conv 4→16, 8×8/s4 → Conv 16→32, 4×4/s2 → FC-256 → FC-A on a 4×84×84 grayscale frame stack).

Why RL is structurally different from supervised learning (slide 32):

Stochasticity — same $(s, a)$ can yield different $s^{'}$ , $r$ .
Credit assignment — the reward that matters arrived many steps after the action that earned it.
Nondifferentiable — you cannot compute $\partial r / \partial a$ through the world.
Nonstationary — the data distribution shifts as the policy improves, breaking the i.i.d. assumption.

Game-playing milestones (used in lecture as evidence that RL works at scale):

When	System	What changed
Jan 2016	AlphaGo	imitation + tree search + RL; beat Lee Sedol
Oct 2017	AlphaGo Zero	dropped imitation, pure self-play; beat Ke Jie
Dec 2018	AlphaZero	one architecture for Go / Chess / Shogi
Nov 2019	MuZero	planning with a learned model — no rules given
2018–19	AlphaStar / OpenAI Five	StarCraft II, Dota 2 — long-horizon partial-info

Robot RL in the wild: ETH RSL quadrupedal locomotion (Sci Robotics 2020), Unitree B2-W (Dec 2024), OpenAI’s Rubik’s Cube hand (2019), Visual Dexterity (Sci Robotics 2023).

The catch: model-free RL is sample-inefficient. The slide cites the “3000 years of self-play in 40 days” framing — fine for simulators, not for a single physical robot. Hence safety, interpretability, and the human intuition that we do have a world model motivate the next section.

Model learning & model-based planning

Learn the dynamics $P (s_{t + 1} ∣ s_{t}, a_{t})$ , then plan with receding-horizon control: optimize an action sequence over a horizon, execute only the first action, re-plan from the new state. Modern variant: GPU-parallel sample shooting over thousands of candidate sequences; see Receding Horizon Control.

The interesting design question is what is $s_{t}$ :

Pixel dynamics — Deep Visual Foresight (Finn & Levine, ICRA 2017) with CDNA conv+LSTM. Predicts pixels directly. Compute-heavy and entangles physics with rendering.
Keypoint dynamics — Manuelli/Li/Florence/Tedrake CoRL 2020. Track a sparse set of keypoints on the object (e.g. KUKA pushing a Cheez-It box); dynamics are over keypoints, not pixels. Cheap to roll out.
Particle dynamics — Wang/Li/Driggs-Campbell/Fei-Fei/Wu RSS 2023, RoboCook (Shi/Xu/Clarke/Li/Wu CoRL 2023, Best Systems Paper). Represent piles (granola, rice, carrot, candy, dough) as particles, learn dynamics with a GNN ( $s_{0} a_{0}, GNN s_{1} a_{1}, GNN s_{2} \dots$ ). Powerful for deformables / elasto-plastic objects where rigid- or keypoint-state breaks down.

Imitation learning

“It’s just supervised learning from a demonstration dataset.” See Imitation Learning. The lecture walks through the four standard variants:

Behavior Cloning (BC) — fit $π_{θ} (a_{t} ∣ o_{t})$ on $(o_{t}, a_{t})$ pairs. Failure mode: distribution shift — once the learned policy drifts off the expert’s trajectory, the dataset has no recovery examples (the famous self-driving illustration: expert hugs the centerline, learned policy edges off and finds no data on how to come back). See Behavior Cloning for the MLE/regression formulations.
DAgger — iteratively roll out the policy, query the expert at the visited states, retrain on the augmented dataset. Closes the distribution-shift loop.
Inverse RL — flip the arrow: given behavior + environment, recover the reward the expert is optimizing. RL takes (env, reward) → behavior; IRL takes (env, behavior) → reward. See IRL.
Implicit BC — instead of predicting $\overset{a}{^} = F_{θ} (o)$ explicitly, learn an energy $E_{θ} (o, a)$ and infer $\overset{a}{^} = ar g min_{a} E_{θ} (o, a)$ . Handles multi-modal action distributions that an explicit MLP collapses; see Implicit Behavioral Cloning.
Diffusion Policies (Chi et al.) — generalize implicit BC by representing the policy as a denoising process $ε_{θ} (o, a)$ run for $K$ iterations. Handles multi-modality cleanly and commits to one mode within a rollout (where LSTM-GMM and IBC are biased to one mode and BET fails to commit). Predicts a chunk of actions for receding-horizon control. See Diffusion Policy and Action Chunking.

Robotic foundation models

Definition (slide 80): a policy that maps (observation/state, goal) → action with no explicit state or transition function. Two framings used interchangeably: Vision-Language-Action models (VLAs) and Large Behavior Models (LBMs) — see VLA and Robot Foundation Models.

The honest framing on slide 82 is worth quoting: VLMs aren’t perfect but always produce something reasonable; analogously, robotic foundation models won’t always be optimal but should always produce beautiful and reasonable trajectories.

Timeline shown: RT-1 (Dec 2022) → RT-2 (Jul 2023) → RT-X (Oct 2023, 1M episodes / 22 embodiments / 527 skills / 311 scenes / 34 labs / 21 institutions) → OpenVLA (Jun 2024, Llama 2 7B + DINOv2 + SigLIP → 7-DoF action de-tokenizer) → π-Zero (Oct 2024, pi0) → Helix (Figure), Hi-Robot (PI), Gemini Robotics, π-0.5, GR00T (Nvidia), DYNA-1.

π-Zero recipe (Physical Intelligence, Oct 2024): pre-train a VLM + action expert on an internet-scale + Open X-Embodiment + π’s own cross-embodiment dataset; then post-train along three tracks:

Zero-shot in-distribution (e.g. bus tabling)
Specialized post-training to difficult tasks (e.g. empty apartment dryer, batch fold shirts)
Efficient post-training to unseen tasks (e.g. put items in drawer, replace paper towel)

Open-sourced as openpi on Feb 4, 2025 — both the flow-based diffusion VLA (π₀) and the autoregressive variant π₀-FAST.

Remaining challenges (slides 94–102)

The honest list of what is hard right now:

Evaluation. Real-world eval is noisy and expensive: the slide quotes an unnamed lab — “we have large enough budget such that we can still make progress.” Training loss correlates only weakly with real-world success. Sim eval has its own gaps (sim-to-real for rigid / deformable / cloth, asset generation, digitalization, procedural diversity) — there’s no “ImageNet of embodied AI” yet. The ALOHA 2 fleet is shown as a real-world eval rig.
Foundation policy → foundation world model. Yunzhu’s working definition of a world model is action-conditioned future prediction. Same data (action-conditioned robot interaction) trains both; they can co-evolve. Examples: DayDreamer, NVIDIA Cosmos World Foundation Model, 1X World Models.
Foundation models tailored for embodiment. GPT/Llava-style VLMs fail on geometric / embodied / physical tasks. SAM and DINOv2 are closer to what an embodied agent actually needs. The framing shift: “RL from human feedback” → “RL from embodied feedback.”
Adaptation / lifelong learning. Adapt to new scenes, to a specific human’s preferences, and self-improve over time (BEHAVIOR-1K’s preference-rank chart is the prop).
Systems work. “Every robotics work is a system work.” Delays, compute budget, modules talking to each other. Two reference architectures shown: Figure AI’s Helix (System 2 = 7B VLM at 7–9 Hz on GPU 2; System 1 = 80M transformer at 200 Hz on GPU 1 for whole-upper-body control) and PI’s Hi-Robot (high-level VLM emits low-level language commands; low-level VLA emits joint actions, with prompts/interjections from the user).

🛠️ Steven Gong

Table of Contents

Robot Learning

Walkthrough (CS231n 2025 Lec 17, Yunzhu Li)

Problem formulation

Robot perception (vs computer vision)

Reinforcement learning

Model learning & model-based planning

Imitation learning

Robotic foundation models

Remaining challenges (slides 94–102)

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Robot Learning

Walkthrough (CS231n 2025 Lec 17, Yunzhu Li)

Problem formulation

Robot perception (vs computer vision)

Reinforcement learning

Model learning & model-based planning

Imitation learning

Robotic foundation models

Remaining challenges (slides 94–102)

Related

Graph View

Backlinks