Going to plan for this paper.

There are a few background knowledge that I need to motivate the paper:

  • Single Agent vs Multi Agent, emergent behavior
  • Autocurriculum
  • UED

intro Hi everyone, my name is Steven, and today, I’m going to be presenting the MAESTRO Paper.

It’s kind of funny how I found out about this paper, because I found it on twitter, and just a bit of background about me, I currently do research under professor Pant, so I shared this with him, and now I am presenting apparently.

So here I am now.

Idk how many of you actually read the paper, or what your backgrounds are in RL, but I’d like to provide some intuition and background for this paper first. The general motivation for this paper is that we want to be able to train “generally capable RL agents”? And this paper, attempts to answer this from a multi-agent setting.

show the 2 questions.

A lot of attention has been on games. And we talk about games, there’s not really this notion of generally capable. It’s very game specific.

  • Because the rules don’t change between games

I will be focusing this question in the context of autonomous racing, because that is what I know, and this paper generalizes to a lot of things.

And you might think that they go hand in hand. If you can figure out

Most of what we’ve been hearing is techniques to generate competitive agents.

use the example of autonomous driving and racing but I will give some more context on other things.

  • One approach to achieving this goal is to use open-ended learning methods that automatically generate a curriculum of increasingly challenging tasks

There’s this saying in machine learning that the quality of your model really depends on the quality of your data. If you have garbage in, garbage out. You want a model that generalizes well. Predicting, for example, when the next natural disaster is going to be. And you want a diverse dataset.

In reinforcement learning, we don’t really have the notion datasets. Rather, we put some agent in an environment, and the environment is going to send it signals in the form of rewards, and the agent learns to choose the optimal actions in the environment by itself.

Teach an AI how to drive. You want to expose it to different scenarios. Different configurations of cars.

In regular supervised learning, that would be your regular dataset with labels,

Wide distribution of environments?

  • Video games different levels, maps, and game modes with varying degrees of complexity, visual appearance, and gameplay mechanics.
  • Robotic different objects to grasp, surfaces to navigate, or obstacles to avoid.
  • finance different market scenarios, such as bull and bear markets, changes in interest rates or economic policies, and different asset classes with varying levels of risk and return

So let’s formalize it.

Minimimax Adversarial Design

You have a joint action space.

And this is where it gets to start getting really complicated.

Prior work in UED focus on single-agent RL and do not address the dependency between the environment and the strategies of other agents within it. In multi-agent domains, the behaviour of other agents plays a critical role in modulating the complexity and diversity of the challenges faced by a learning agent.

Come up with the best set

MAESTRO is a replay-based approach, like PLR. It maintains a population of co-players (previous frozen checkpoints of the student agent), and for each co-player, a running buffer of environments on which the student receives the highest regret when playing that co-player.

Car Racing Environment

For this environment, we recognise the agent with a higher episodic return as the winner of that episode.

All tracks used to train student agents are procedurally generated by an environment generator, which was built on top of the original MultiCarRacing environment (Schwarting et al., 2021). Each track consists of a closed loop around which the agents must drive a full lap. In order to increase the expressiveness of the original MultiCarRacing, we reparameterized the tracks using Bezier Curve.

  • Each track consists of a Bézier curve based on 12 randomly sampled control points within a fixed radius of B/2 of the centre O of the playfield with B × B size

For training, additional Reward Shaping was introduced similar to (Ma, 2019): an additional reward penalty of −0.1 for driving on the grass, a penalty of −0.5 for driving backwards, as well as an early termination if cars spent too much time on grass. These are all used to help terminate less informative episodes. We utilize a memory-less agent with a frame stacking = 4 and with sticky actions = 8.

Old Notes

These are useless jargon, that doesn’t get to the core of what we are interested in.

Over tthe past few years, there has been notable accomplishments in developing RL agents that can achieve expert and even superhuman performance in competitive games.

  • AlphaGo in the game of Go, MCTS + RL
  • OpenAI Five for Dota 2, PPO
  • MuZero, chess shogi, and Atari games, without any prior knowledge of the game rules MCTS + RL

Some of the techniques used to get these performances

  • Autocurricula (?), where we automatically generate a curriculum of tasks or challenges that gradually increase in difficulty, based on the agent’s current level of proficiency
  • self-play and fictitious self-play algorithms for multi-player settings
  • UED where we expose the agent to a wide distribution of environments
    • Think domain randomization, but better

Unsupervised Environment Design to multi-agent environments. They introduce a new approach called Multi-Agent Environment Design Strategist for Open-Ended Learning (MAESTRO), which efficiently produces adversarial, joint curricula over both environments and co-players.

Their experiments show that MAESTRO outperforms a number of strong baselines on competitive two-player games, spanning discrete and continuous control settings. MAESTRO also attains minimax-regret guarantees at Nash equilibrium.

In summary, this paper presents a promising new approach to developing generally capable reinforcement learning agents through open-ended learning methods that generate curricula of increasingly challenging tasks. The authors extend an existing method to consider the dependency between the environment and co-player in multi-agent domains, and introduce a new approach called MAESTRO that outperforms strong baselines on competitive two-player games. Thank you