Sequence-To-Sequence Model (Seq2Seq)

Sequence-to-sequence modeling is a family of models for transforming one sequence into another.

Examples:

  • Machine translation: Input: “Bonjour le monde” → Output: “Hello world.”
  • Speech recognition: Input: audio frames → Output: text transcription.
  • Text summarization: Input: long document → Output: concise summary.
  • Dialogue systems: Input: user utterance → Output: response.

Is next token prediction Seq2Seq?

It’s autoregressive, not seq2seq.

The original Transformer was made for Seq2Seq.

Most classic seq2seq models (RNN encoder–decoder, Transformer for MT, summarization, etc.) are autoregressive.

The decoder generates one token at a time, left-to-right, conditioning on previous outputs.

why don’t seq2seq models just generate the entire output sequence in one shot (like a big classifier over all possible sequences), instead of autoregressively?

Modeling all sequences directly as a single classification problem is intractable.

Autoregression breaks this into tractable steps.

Encoder-decoder RNN (CS231n 2024 Lec 7)

Before Transformers, seq2seq was “many-to-one + one-to-many” glued together (Sutskever et al. NIPS 2014):

  1. Encoder (many-to-one) — an RNN with weights reads the input sequence and produces a final hidden state that’s meant to summarize the entire input in a single fixed-size vector.
  2. Decoder (one-to-many) — a second RNN with separate weights is initialized from and autoregressively generates the output sequence .

Bottleneck: the entire source sentence has to fit through the fixed-size . Attention was invented to let the decoder look back at all encoder states directly instead of relying on the bottleneck.

Source

CS231n 2024 Lec 7 slides 42–43 (Sutskever seq2seq, many-to-one + one-to-many encoder-decoder).