Research Ideas
Ideas are cheap. It’s all about who can execute the fastest. Below are a list of ideas that I will execute on for robot learning over the next year.
Every day, write down a list of project ideas to force you to be more creative.
Also, building benchmarks are nice, these are so people can build on top of it.
Active Perception
- Asking a robot to find my iPhone
- Can we combine this with goal-conditioning?
Offline RL
- Monte-Carlo offline RL
- On-policy offline RL
- This would require world model, and policy learns on-policy through world model (very similar to Learning to Drive from a World Model idea)
- Use the monte-carlo return as opposed to boostrapped return initially (which will essentially collapse to imitation learning).
- Distributional goal-conditioned RL
- Instead of conditioning on a goal, what if we could condition on a distribution of goals? We would have , where the policy also learns a gaussian to approximate the goal?
- Conditioning on a distribution of rewards
- I don’t like this because there are going to be lots of moving parts
- Flow-matching based Q-function
- Would this work?
- Large Q-functions
- Integrating a world model into the Q-function (having a Q with joint objectives to predict state and value) (Q(s,a) → r, s’)
- Q(s,a | g) → r,s’
- Action chunked world model?
- W(s0 | a0, a1, a2, …, ah-1) = s1,s2,s3,…sh-1
- Compounding errors?
How do we improve Q(s,a) to be more general? What if we could combine the general-ability of Qs with function approximation?
- Q’s cannot capture long horizon with sparse rewards
- But then how was it able to learn to solve atari?
- Tokenizing s,a?
Upside-Down RL Q(s,r) → a
How to achieve AGI in robotics, what’s missing?
What’s missing is the ability for robots to self-improve over time robustly.
- Reinforcement learning (model-free generally) →
- “How good are these actions?”
- teaching robots to learn on their own
- Q-learning
- Do we see policy gradient methods for manipulation? I have not. Why not?
- Q-Chunking
- World models
- “Where am i going to end up when i take this action” Perhaps this is called reasoning? or search? This is needed for Robustness (but don’t care as much about safety, more like recovering from failure states)
- Learning the dynamics of the world
- Dream to Control Learning Behaviors by Latent Imagination
- How do we bake a world model into our policy?
Policy No need for proprio-history? We need propriohistory somewhere
W(s,s’) = a W(s,a) = s’
Perhaps we need a Q world model? s,a → s’, Q
There’s also Goal-Conditioned RL.
1. first way
Three stage
- Policy sampling
Query the world model to get .
Query the Q function conditioned of all three . But isn’t that just V(s’).
the problems of multi-stage approaches
Slow, inefficient, hard to debug. Really it should be end-to-end?
There’s Past Token Prediction.
2. Second way
Bake the world model into policy
? Doesn’t this just collapse down to V(s’)? Yes
So what if we had both and computed?
- There’s inaccuracy in Q(s,a)
- modeling error from
- There’s inaccuracy in V(s’)
- modeling error from
- There’s inaccuracy in s’ computed
- modelling error from , like s’ might actually not be reachable
How do we teach the model to predict better frames?
The solution might just be Goal-Conditioned RL.