Neural Radiance Fields (NeRF)

To look into: https://developer.nvidia.com/blog/getting-started-with-nvidia-instant-nerfs/

Damn I work at NVIDIA. I should be at the forefront of this.

FUNDAMENTAL TEACHING: https://sites.google.com/berkeley.edu/nerf-tutorial/home

I should be familiar with Raytracing

Other nooby tutorials?

Original paper: https://arxiv.org/abs/2003.08934

This idea is so cool, where you have a single image and you can move it. I think this was the catalyst to many of the fly-through things we see today that Google is working on.

Also, I think this is what Waabi is doing.

https://datagen.tech/guides/synthetic-data/neural-radiance-field-nerf/

So how does it work?

Input of NeRF is 5D data

  • input is a single continuous 5D coordinate (spatial location and viewing direction )
  • output is the volume density and view-dependent emitted radiance at that spatial location

They do ray marching.

How do they estimate motion? Ahh, they use COLMAP:

  • ”… and use the COLMAP structure-from-motion package [39] to estimate these parameters for real data)”

Nerfies

There’s blurry images usually with nerfs if the thing is moving, however there was a paper that came out 3 years ago that addresses this

https://www.youtube.com/watch?v=IDMiMKWucaI

Walkthrough (CS231n 2025 Lec 15)

Setup

A NeRF is a per-scene MLP trained from a batch of posed images of a single scene. No 3D supervision β€” only pixel loss against the input views. Camera poses come from COLMAP structure-from-motion when they aren’t given.

Volume rendering (the differentiable bit)

Shoot a ray through a pixel, sample points along it, query the MLP at each, and composite from near to far:

  • β€” color and opacity output by the MLP at sample (opacity derives from density and step size).
  • β€” transmittance: fraction of light not absorbed before reaching sample .
  • Whole thing is differentiable w.r.t. and , so gradient flows back into .

Why it works

View-dependent output lets the same 3D point emit different radiance in different directions, capturing specular highlights. The implicit MLP representation is continuous, so arbitrary viewpoints interpolate smoothly. The tradeoff: rendering is slow (hundreds of MLP evals per ray).

Compared to 3D Gaussian Splatting

NeRF parameterizes the scene densely and implicitly via one MLP; 3DGS parameterizes it sparsely and explicitly via millions of Gaussian blobs. 3DGS trains in ~40min vs NeRF’s ~48h and renders at 137 FPS vs 0.07 FPS, at comparable reconstruction quality.

Variants (CS231n 2024 Lec 18)

The original 2020 paper assumes a static scene captured under fixed exposure. Three notable extensions:

VariantWhat it addsReference
NerfiesDeformable NeRF β€” adds a per-frame learned warp so the scene can move (e.g. selfie video of a person turning their head).Park et al. ICCV 2021
RawNeRFTrain on noisy raw HDR sensor data instead of tonemapped 8-bit JPEGs. Recover scene radiance directly, enables HDR view synthesis.Mildenhall et al. CVPR 2022
BlockNeRFScale to a whole San Francisco neighborhood by tiling many NeRFs over space.Tancik et al. CVPR 2022

The cost problem

Per the lecture (slide 95): training a NeRF takes 1–2 days on a V100 for a single scene. Inference for a image at 224 samples/pixel = 14.6 M MLP forward passes. This is what 3D Gaussian Splatting set out to fix.