Learning from Human Videos (LfV)
Broad field, so far focused on mostly locomotion (they get expert demos from mocap). How do we stop relying on mocap?
What I need to understand:
- How do people close the cross-embodiment gap?
- How is training done specifically? RL or imitation learning?
- What is not solved in the field?
https://www.youtube.com/watch?v=RdPftGBhN8c&t=2386s&ab_channel=CMURoboticsInstitute
Survey paper:
Line of work for this:
- SFV Reinforcement Learning of Physical Skills from Videos (uses motion tracking + DeepMimic)
- Learning Physically Simulated Tennis Skills from Broadcast Videos
- VideoMimic
- BeyondMimic
Skill discovery