Neural Radiance Fields (NeRF)
To look into: https://developer.nvidia.com/blog/getting-started-with-nvidia-instant-nerfs/
Damn I work at NVIDIA. I should be at the forefront of this.
FUNDAMENTAL TEACHING: https://sites.google.com/berkeley.edu/nerf-tutorial/home
I should be familiar with Raytracing
Other nooby tutorials?
Original paper: https://arxiv.org/abs/2003.08934
This idea is so cool, where you have a single image and you can move it. I think this was the catalyst to many of the fly-through things we see today that Google is working on.
Also, I think this is what Waabi is doing.
https://datagen.tech/guides/synthetic-data/neural-radiance-field-nerf/
- Official Repo: https://github.com/bmild/nerf
So how does it work?
Input of NeRF is 5D data
- input is a single continuous 5D coordinate (spatial location and viewing direction )
- output is the volume density and view-dependent emitted radiance at that spatial location
They do ray marching.
How do they estimate motion? Ahh, they use COLMAP:
- ββ¦ and use the COLMAP structure-from-motion package [39] to estimate these parameters for real data)β
Nerfies
Thereβs blurry images usually with nerfs if the thing is moving, however there was a paper that came out 3 years ago that addresses this
https://www.youtube.com/watch?v=IDMiMKWucaI
Walkthrough (CS231n 2025 Lec 15)
Setup
A NeRF is a per-scene MLP trained from a batch of posed images of a single scene. No 3D supervision β only pixel loss against the input views. Camera poses come from COLMAP structure-from-motion when they arenβt given.
Volume rendering (the differentiable bit)
Shoot a ray through a pixel, sample points along it, query the MLP at each, and composite from near to far:
- β color and opacity output by the MLP at sample (opacity derives from density and step size).
- β transmittance: fraction of light not absorbed before reaching sample .
- Whole thing is differentiable w.r.t. and , so gradient flows back into .
Why it works
View-dependent output lets the same 3D point emit different radiance in different directions, capturing specular highlights. The implicit MLP representation is continuous, so arbitrary viewpoints interpolate smoothly. The tradeoff: rendering is slow (hundreds of MLP evals per ray).
Compared to 3D Gaussian Splatting
NeRF parameterizes the scene densely and implicitly via one MLP; 3DGS parameterizes it sparsely and explicitly via millions of Gaussian blobs. 3DGS trains in ~40min vs NeRFβs ~48h and renders at 137 FPS vs 0.07 FPS, at comparable reconstruction quality.
Variants (CS231n 2024 Lec 18)
The original 2020 paper assumes a static scene captured under fixed exposure. Three notable extensions:
| Variant | What it adds | Reference |
|---|---|---|
| Nerfies | Deformable NeRF β adds a per-frame learned warp so the scene can move (e.g. selfie video of a person turning their head). | Park et al. ICCV 2021 |
| RawNeRF | Train on noisy raw HDR sensor data instead of tonemapped 8-bit JPEGs. Recover scene radiance directly, enables HDR view synthesis. | Mildenhall et al. CVPR 2022 |
| BlockNeRF | Scale to a whole San Francisco neighborhood by tiling many NeRFs over space. | Tancik et al. CVPR 2022 |
The cost problem
Per the lecture (slide 95): training a NeRF takes 1β2 days on a V100 for a single scene. Inference for a image at 224 samples/pixel = 14.6 M MLP forward passes. This is what 3D Gaussian Splatting set out to fix.