XSkill: Cross Embodiment Skill Discovery

They seem to take both human dataset and robot dataset, and do some fine-tuning on it?

Project website:

  1. Discover
  2. Transfer
  3. Compose

I think what’s cool about this paper is that we don’t need to explicitly split out the episode into sub-skills because when we do the human prompt, we’re able to do this alignment via self-supervised learning.

How do they get this alignment?

  • To achieve this alignment, XSkill introduces a set of K learnable skill prototypes ().
  • These prototypes act as representative anchors in the continuous embedding space.
  • The crucial step is that XSkill forces both the human and robot skill representations to share the same set of prototypes. This constraint means that a video clip showing a robot grasping an object and a video clip showing a human grasping an object must both map to the same anchor point (prototype) in the shared space.

1. Discover

They use sinkhorn-knopp algo for learning a representation invariant to human body

Skills as Prototypes. Once the skill representations are obtained from demonstration videos, XSkill maps representations from all embodiments to a set of K skill prototypes {ck} K k=1, where each is a learnable vector

2. Transfer

Use a skill-conditioned diffusion policy