Image Captioning

Generate a natural-language sentence describing an image. Modern systems use VLMs; the classic pre-VLM recipe (CS231n 2024 Lec 7) wires a CNN to an RNN.

CNN + RNN architecture

  1. Run a pretrained CNN (e.g. VGG) on the image, but drop the final FC-1000 + softmax — keep the penultimate FC-4096 feature vector .
  2. Initialize an RNN with this image feature via an extra input-to-hidden weight : (the only change from a standard RNN language model is the extra term that injects image context.)
  3. Feed a <START> token, sample a word from the softmax, feed it back as the next input. Repeat until the model samples <END>.

Trained end-to-end on (image, caption) pairs like MS-COCO. Representative papers: Karpathy & Fei-Fei “Deep Visual-Semantic Alignments” (CVPR 2015), Vinyals et al. “Show and Tell”.

Failure modes (from Lec 7 failure-cases slide): the model often confuses objects in unusual contexts — e.g. “a woman is holding a cat” when she’s actually holding a fur coat; “a person holding a computer mouse on a desk” for a phone on a desk. Training distribution leaks into the captions.

  • Visual Question Answering (VQA): image + question → answer. Classic architecture fuses a CNN image feature with an LSTM-encoded question via pointwise multiplication, then a softmax over answers.
  • Visual Dialog: multi-turn Q&A grounded in an image.
  • Vision-Language Navigation: agent reads a natural-language instruction and issues actions as the visual scene updates.

Source

CS231n 2024 Lec 7 slides 76–93 (CNN+RNN captioning pipeline, test-time sampling, example successes/failures, VQA, visual dialog, VLN).