Research Ideas

Ideas are cheap. It’s all about who can execute the fastest. Below are a list of ideas that I will execute on for robot learning over the next year.

Every day, write down a list of project ideas to force you to be more creative.

Also, building benchmarks are nice, these are so people can build on top of it.

Imitation Learning

https://calinon.ch/papers/Billard-handbookOfRobotics.pdf good old ideas from here, steering the policy
- Make the teacher active: incremental, corrective, preference-aware. The chapter pushes interactive teaching—users add examples in uncovered regions and steer generalization. Port this idea

Robot-foundation models

Self-supervised objectives for pi0.5?
Learning to recognize from failure mode
Solving proprioceptive history, LLMs can do this

Learning from Human Demonstrations

Lots of stuff leveraging ego-centric videos

Manipulation + Locomomation

As shiza pointed out, we currently use 2 different models, one for manipulation via these Robot Foundation Models, and one for Locomotion (via RL + motion primitives)
How do we unify this? Locomotion policies are generally only conditioned on joint angles, where as robot foundation models are also conditioned on it + images
- But whole-body control is so much harder

Active Perception

Asking a robot to find my iPhone
Can we combine this with goal-conditioning?

Offline RL

Effective pixel-based off-policy RL learning?
- Stabilizing OffPolicy Deep Reinforcement Learning from Pixels
Monte-Carlo offline RL
On-policy offline RL
- This would require world model, and policy learns on-policy through world model (very similar to Learning to Drive from a World Model idea)
Use the monte-carlo return as opposed to boostrapped return initially (which will essentially collapse to imitation learning).
Distributional goal-conditioned RL
- Instead of conditioning on a goal, what if we could condition on a distribution of goals? We would have $Q (s, a, p (g))$ , where the policy also learns a gaussian to approximate the goal?
- Conditioning on a distribution of rewards
- I don’t like this because there are going to be lots of moving parts
Flow-matching based Q-function
- Would this work?
Large Q-functions
Integrating a world model into the Q-function (having a Q with joint objectives to predict state and value) (Q(s,a) → r, s’)
Q(s,a | g) → r,s’
Action chunked world model?
- W(s0 | a0, a1, a2, …, ah-1) = s1,s2,s3,…sh-1
- Compounding errors?

How do we improve Q(s,a) to be more general? What if we could combine the general-ability of Qs with function approximation?

Q’s cannot capture long horizon with sparse rewards
But then how was it able to learn to solve atari?
Tokenizing s,a?

Upside-Down RL Q(s,r) → a

How to achieve AGI in robotics, what’s missing?

What’s missing is the ability for robots to self-improve over time robustly.

Reinforcement learning (model-free generally) → $s, a \to Q$

“How good are these actions?”
teaching robots to learn on their own
Q-learning
- Do we see policy gradient methods for manipulation? I have not. Why not?
Q-Chunking

World models $s, a \to s^{'}$

“Where am i going to end up when i take this action” Perhaps this is called reasoning? or search? This is needed for Robustness (but don’t care as much about safety, more like recovering from failure states)
Learning the dynamics of the world
Dream to Control Learning Behaviors by Latent Imagination
How do we bake a world model into our policy?

Policy $π (a ∣ s)$ No need for proprio-history? We need propriohistory somewhere

Dreamer $s - > s_{z} - > s_{z}^{'} - > a$

W(s,s’) = a W(s,a) = s’

Perhaps we need a Q world model? s,a → s’, Q

$π (a ∣ s)$

There’s also Goal-Conditioned RL.

1. first way

Three stage

Policy sampling $π (a ∣ s)$

Query the world model to get $W (s^{'} ∣ s, a)$ .

Query the Q function conditioned of all three $Q (s, a, s^{'})$ . But isn’t that just V(s’).

the problems of multi-stage approaches

Slow, inefficient, hard to debug. Really it should be end-to-end?

There’s Past Token Prediction.

2. Second way

Bake the world model into policy $π (s^{'}, a ∣ s)$

$Q (s, a, s^{'})$ ? Doesn’t this just collapse down to V(s’)? Yes

So what if we had both $Q (s, a)$ and $V (s^{'})$ computed?

There’s inaccuracy in Q(s,a)
- modeling error from $Q$
There’s inaccuracy in V(s’)
- modeling error from $V$
There’s inaccuracy in s’ computed
- modelling error from $π$ , like s’ might actually not be reachable

How do we teach the model to predict better frames?

The solution might just be Goal-Conditioned RL.

🛠️ Steven Gong

Table of Contents

Research Ideas

How to achieve AGI in robotics, what’s missing?

1. first way

2. Second way

Graph View

Backlinks