XSkill: Cross Embodiment Skill Discovery

They seem to take both human dataset and robot dataset, and do some fine-tuning on it?

Project website:

https://xskill.cs.columbia.edu/

Discover
Transfer
Compose

I think what’s cool about this paper is that we don’t need to explicitly split out the episode into sub-skills because when we do the human prompt, we’re able to do this alignment via self-supervised learning.

How do they get this alignment?

To achieve this alignment, XSkill introduces a set of K learnable skill prototypes ( $c_{k}$ ).

These prototypes act as representative anchors in the continuous embedding space.

The crucial step is that XSkill forces both the human and robot skill representations to share the same set of prototypes. This constraint means that a video clip showing a robot grasping an object and a video clip showing a human grasping an object must both map to the same anchor point (prototype) in the shared space.

1. Discover

They use sinkhorn-knopp algo for learning a representation invariant to human body

Skills as Prototypes. Once the skill representations are obtained from demonstration videos, XSkill maps representations from all embodiments to a set of K skill prototypes {ck} K k=1, where each is a learnable vector

2. Transfer

Use a skill-conditioned diffusion policy

🛠️ Steven Gong

Table of Contents

XSkill: Cross Embodiment Skill Discovery

1. Discover

2. Transfer

Graph View

Backlinks