Image Captioning

Generate a natural-language sentence describing an image. Modern systems use VLMs; the classic pre-VLM recipe (CS231n 2024 Lec 7) wires a CNN to an RNN.

CNN + RNN architecture

Run a pretrained CNN (e.g. VGG) on the image, but drop the final FC-1000 + softmax — keep the penultimate FC-4096 feature vector $v$ .
Initialize an RNN with this image feature via an extra input-to-hidden weight $W_{ih}$ : $h_{t} = tanh (W_{x h} x_{t} + W_{hh} h_{t - 1} + W_{ih} v)$ (the only change from a standard RNN language model is the extra $+ W_{ih} v$ term that injects image context.)
Feed a <START> token, sample a word from the softmax, feed it back as the next input. Repeat until the model samples <END>.

Trained end-to-end on (image, caption) pairs like MS-COCO. Representative papers: Karpathy & Fei-Fei “Deep Visual-Semantic Alignments” (CVPR 2015), Vinyals et al. “Show and Tell”.

Failure modes (from Lec 7 failure-cases slide): the model often confuses objects in unusual contexts — e.g. “a woman is holding a cat” when she’s actually holding a fur coat; “a person holding a computer mouse on a desk” for a phone on a desk. Training distribution leaks into the captions.

Visual Question Answering (VQA): image + question → answer. Classic architecture fuses a CNN image feature with an LSTM-encoded question via pointwise multiplication, then a softmax over answers.
Visual Dialog: multi-turn Q&A grounded in an image.
Vision-Language Navigation: agent reads a natural-language instruction and issues actions as the visual scene updates.

Source

CS231n 2024 Lec 7 slides 76–93 (CNN+RNN captioning pipeline, test-time sampling, example successes/failures, VQA, visual dialog, VLN).

🛠️ Steven Gong

Table of Contents

Image Captioning

CNN + RNN architecture

Source

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Image Captioning

CNN + RNN architecture

Related tasks

Source

Related

Graph View

Backlinks