TorchRL has a funny way of writing state transitions, slightly different from the conventional OpenAI Gym.
This, however, is one of the ways it allows parallelization.
Generally followings this shape
Generally, MDP follow this:
a new observation
an indicator of task completion (terminated, truncated, done), and
a reward signal
This can get more complicated in multi-agent RL settings, sometimes reward also is not necessary (like in imitation learning scenarios).
So we root the information at time t of the tensorDict:
General Rule
Everything that belongs to time step t is stored at the root of the tensordict, while everything that belongs to t+1 is stored in the ”next” entry of the tensordict.