Research Ideas

Ideas are cheap. It’s all about who can execute the fastest. Below are a list of ideas that I will execute on for robot learning over the next year.

Every day, write down a list of project ideas to force you to be more creative.

Also, building benchmarks are nice, these are so people can build on top of it.

Active Perception

  • Asking a robot to find my iPhone
  • Can we combine this with goal-conditioning?

Offline RL

  • Monte-Carlo offline RL
  • On-policy offline RL
  • Use the monte-carlo return as opposed to boostrapped return initially (which will essentially collapse to imitation learning).
  • Distributional goal-conditioned RL
    • Instead of conditioning on a goal, what if we could condition on a distribution of goals? We would have , where the policy also learns a gaussian to approximate the goal?
    • Conditioning on a distribution of rewards
    • I don’t like this because there are going to be lots of moving parts
  • Flow-matching based Q-function
    • Would this work?
  • Large Q-functions
  • Integrating a world model into the Q-function (having a Q with joint objectives to predict state and value) (Q(s,a) r, s’)
  • Q(s,a | g) r,s’
  • Action chunked world model?
    • W(s0 | a0, a1, a2, …, ah-1) = s1,s2,s3,…sh-1
    • Compounding errors?

How do we improve Q(s,a) to be more general? What if we could combine the general-ability of Qs with function approximation?

  • Q’s cannot capture long horizon with sparse rewards
  • But then how was it able to learn to solve atari?
  • Tokenizing s,a?

Upside-Down RL Q(s,r) a

How to achieve AGI in robotics, what’s missing?

What’s missing is the ability for robots to self-improve over time robustly.

  1. Reinforcement learning (model-free generally)
  • “How good are these actions?”
  • teaching robots to learn on their own
  • Q-learning
    • Do we see policy gradient methods for manipulation? I have not. Why not?
  • Q-Chunking
  1. World models
  • “Where am i going to end up when i take this action” Perhaps this is called reasoning? or search? This is needed for Robustness (but don’t care as much about safety, more like recovering from failure states)
  • Learning the dynamics of the world
  • Dream to Control Learning Behaviors by Latent Imagination
  • How do we bake a world model into our policy?

Policy No need for proprio-history? We need propriohistory somewhere

Dreamer

W(s,s’) = a W(s,a) = s’

Perhaps we need a Q world model? s,a s’, Q

There’s also Goal-Conditioned RL.

1. first way

Three stage

  1. Policy sampling

Query the world model to get .

Query the Q function conditioned of all three . But isn’t that just V(s’).

the problems of multi-stage approaches

Slow, inefficient, hard to debug. Really it should be end-to-end?

There’s Past Token Prediction.

2. Second way

Bake the world model into policy

? Doesn’t this just collapse down to V(s’)? Yes

So what if we had both and computed?

  • There’s inaccuracy in Q(s,a)
    • modeling error from
  • There’s inaccuracy in V(s’)
    • modeling error from
  • There’s inaccuracy in s’ computed
    • modelling error from , like s’ might actually not be reachable

How do we teach the model to predict better frames?

The solution might just be Goal-Conditioned RL.