Robot Learning
Going to be 10x Researcher in robot learning.
Best Conferences:
- CoRL
- RSS
- IROS
- ICRA
Other
- ICLR
- ICML
- NeurIPS
Walkthrough (CS231n 2025 Lec 17, Yunzhu Li)
Robot learning is the third paradigm in this course alongside supervised and self-supervised: an agent takes actions in an environment and gets rewards, with the goal of learning to act so as to maximize reward. The lecture is structured as 7 sections — problem formulation, perception, RL, model learning + planning, imitation, robotic foundation models, remaining challenges — and what follows mirrors that.
Problem formulation
Universal diagram: a human gives a goal to an Agent, which exchanges (state , action , reward ) with the Physical World. Almost every interesting learning-based decision problem casts into this loop:
| Task | State | Action | Reward |
|---|---|---|---|
| Cart-Pole | angle, ang. speed, position, vel. | horizontal force | +1 per upright step |
| Robot Locomotion | joint angle/pos/vel | joint torques | +1 upright + forward |
| Atari (Mnih NeurIPS-DL 2013) | raw game pixels | game pad direction | score delta |
| Go | board configuration | next stone placement | win=1 / loss=0 (terminal only) |
| Text generation | current tokens | next token | 1 if matches GT |
| Chatbot | conversation so far | next sentence | human eval ±1/0 |
| Cloth folding | sensor readings | end-effector motion | human eval 1/0 |
The point of listing all of these is to show that the framing is general — the same machinery applies to a quadruped, a Dota agent, and an LLM.
Robot perception (vs computer vision)
Three properties make robot perception different from vanilla CV:
- Embodied — the camera is mounted on a body whose own actions change what’s seen.
- Active — see Active Perception: the agent decides where to look by moving.
- Situated — outputs feed a perception → action → state-change loop, not an offline label.
Modalities go beyond RGB: depth, IMU, force/torque, tactile.
Reinforcement learning
Standard MDP: see RL and MDP for the formal setup, DQN for the canonical Atari pipeline (Conv 4→16, 8×8/s4 → Conv 16→32, 4×4/s2 → FC-256 → FC-A on a 4×84×84 grayscale frame stack).
Why RL is structurally different from supervised learning (slide 32):
- Stochasticity — same can yield different , .
- Credit assignment — the reward that matters arrived many steps after the action that earned it.
- Nondifferentiable — you cannot compute through the world.
- Nonstationary — the data distribution shifts as the policy improves, breaking the i.i.d. assumption.
Game-playing milestones (used in lecture as evidence that RL works at scale):
| When | System | What changed |
|---|---|---|
| Jan 2016 | AlphaGo | imitation + tree search + RL; beat Lee Sedol |
| Oct 2017 | AlphaGo Zero | dropped imitation, pure self-play; beat Ke Jie |
| Dec 2018 | AlphaZero | one architecture for Go / Chess / Shogi |
| Nov 2019 | MuZero | planning with a learned model — no rules given |
| 2018–19 | AlphaStar / OpenAI Five | StarCraft II, Dota 2 — long-horizon partial-info |
Robot RL in the wild: ETH RSL quadrupedal locomotion (Sci Robotics 2020), Unitree B2-W (Dec 2024), OpenAI’s Rubik’s Cube hand (2019), Visual Dexterity (Sci Robotics 2023).
The catch: model-free RL is sample-inefficient. The slide cites the “3000 years of self-play in 40 days” framing — fine for simulators, not for a single physical robot. Hence safety, interpretability, and the human intuition that we do have a world model motivate the next section.
Model learning & model-based planning
Learn the dynamics , then plan with receding-horizon control: optimize an action sequence over a horizon, execute only the first action, re-plan from the new state. Modern variant: GPU-parallel sample shooting over thousands of candidate sequences; see Receding Horizon Control.
The interesting design question is what is :
- Pixel dynamics — Deep Visual Foresight (Finn & Levine, ICRA 2017) with CDNA conv+LSTM. Predicts pixels directly. Compute-heavy and entangles physics with rendering.
- Keypoint dynamics — Manuelli/Li/Florence/Tedrake CoRL 2020. Track a sparse set of keypoints on the object (e.g. KUKA pushing a Cheez-It box); dynamics are over keypoints, not pixels. Cheap to roll out.
- Particle dynamics — Wang/Li/Driggs-Campbell/Fei-Fei/Wu RSS 2023, RoboCook (Shi/Xu/Clarke/Li/Wu CoRL 2023, Best Systems Paper). Represent piles (granola, rice, carrot, candy, dough) as particles, learn dynamics with a GNN (). Powerful for deformables / elasto-plastic objects where rigid- or keypoint-state breaks down.
Imitation learning
“It’s just supervised learning from a demonstration dataset.” See Imitation Learning. The lecture walks through the four standard variants:
- Behavior Cloning (BC) — fit on pairs. Failure mode: distribution shift — once the learned policy drifts off the expert’s trajectory, the dataset has no recovery examples (the famous self-driving illustration: expert hugs the centerline, learned policy edges off and finds no data on how to come back). See Behavior Cloning for the MLE/regression formulations.
- DAgger — iteratively roll out the policy, query the expert at the visited states, retrain on the augmented dataset. Closes the distribution-shift loop.
- Inverse RL — flip the arrow: given behavior + environment, recover the reward the expert is optimizing. RL takes (env, reward) → behavior; IRL takes (env, behavior) → reward. See IRL.
- Implicit BC — instead of predicting explicitly, learn an energy and infer . Handles multi-modal action distributions that an explicit MLP collapses; see Implicit Behavioral Cloning.
- Diffusion Policies (Chi et al.) — generalize implicit BC by representing the policy as a denoising process run for iterations. Handles multi-modality cleanly and commits to one mode within a rollout (where LSTM-GMM and IBC are biased to one mode and BET fails to commit). Predicts a chunk of actions for receding-horizon control. See Diffusion Policy and Action Chunking.
Robotic foundation models
Definition (slide 80): a policy that maps (observation/state, goal) → action with no explicit state or transition function. Two framings used interchangeably: Vision-Language-Action models (VLAs) and Large Behavior Models (LBMs) — see VLA and Robot Foundation Models.
The honest framing on slide 82 is worth quoting: VLMs aren’t perfect but always produce something reasonable; analogously, robotic foundation models won’t always be optimal but should always produce beautiful and reasonable trajectories.
Timeline shown: RT-1 (Dec 2022) → RT-2 (Jul 2023) → RT-X (Oct 2023, 1M episodes / 22 embodiments / 527 skills / 311 scenes / 34 labs / 21 institutions) → OpenVLA (Jun 2024, Llama 2 7B + DINOv2 + SigLIP → 7-DoF action de-tokenizer) → π-Zero (Oct 2024, pi0) → Helix (Figure), Hi-Robot (PI), Gemini Robotics, π-0.5, GR00T (Nvidia), DYNA-1.
π-Zero recipe (Physical Intelligence, Oct 2024): pre-train a VLM + action expert on an internet-scale + Open X-Embodiment + π’s own cross-embodiment dataset; then post-train along three tracks:
- Zero-shot in-distribution (e.g. bus tabling)
- Specialized post-training to difficult tasks (e.g. empty apartment dryer, batch fold shirts)
- Efficient post-training to unseen tasks (e.g. put items in drawer, replace paper towel)
Open-sourced as openpi on Feb 4, 2025 — both the flow-based diffusion VLA (π₀) and the autoregressive variant π₀-FAST.
Remaining challenges (slides 94–102)
The honest list of what is hard right now:
- Evaluation. Real-world eval is noisy and expensive: the slide quotes an unnamed lab — “we have large enough budget such that we can still make progress.” Training loss correlates only weakly with real-world success. Sim eval has its own gaps (sim-to-real for rigid / deformable / cloth, asset generation, digitalization, procedural diversity) — there’s no “ImageNet of embodied AI” yet. The ALOHA 2 fleet is shown as a real-world eval rig.
- Foundation policy → foundation world model. Yunzhu’s working definition of a world model is action-conditioned future prediction. Same data (action-conditioned robot interaction) trains both; they can co-evolve. Examples: DayDreamer, NVIDIA Cosmos World Foundation Model, 1X World Models.
- Foundation models tailored for embodiment. GPT/Llava-style VLMs fail on geometric / embodied / physical tasks. SAM and DINOv2 are closer to what an embodied agent actually needs. The framing shift: “RL from human feedback” → “RL from embodied feedback.”
- Adaptation / lifelong learning. Adapt to new scenes, to a specific human’s preferences, and self-improve over time (BEHAVIOR-1K’s preference-rank chart is the prop).
- Systems work. “Every robotics work is a system work.” Delays, compute budget, modules talking to each other. Two reference architectures shown: Figure AI’s Helix (System 2 = 7B VLM at 7–9 Hz on GPU 2; System 1 = 80M transformer at 200 Hz on GPU 1 for whole-upper-body control) and PI’s Hi-Robot (high-level VLM emits low-level language commands; low-level VLA emits joint actions, with prompts/interjections from the user).