3D Representation

How do you put a 3D shape into a computer? There’s no single answer — each representation trades off sampling ease, inside/outside queries, memory, and learnability.

Why so many representations?

A surface is a continuous 2-manifold embedded in $R^{3}$ ; discretizing it forces a choice. Explicit reps parameterize the surface directly (easy to sample, hard to query inside/outside); implicit reps describe the surface as a zero set of a field (hard to sample, trivial inside/outside). Cross this with parametric (closed-form shape family, e.g. splines) vs non-parametric (free-form grid / list), and you get the four-quadrant taxonomy below.

Taxonomy (CS231n 2025 Lec 15)

	Non-parametric	Parametric
Explicit	Point Cloud, Mesh	spline patches, Subdivision surfaces
Implicit	Voxels, Level sets	Algebraic surfaces (zero set of polynomial), CSG, Signed Distance Function

Explicit surface: $f : R^{2} \to R^{3}, (u, v) \mapsto (x, y, z)$ . Example — torus: $f (u, v) = ((2 + cos u) cos v, (2 + cos u) sin v, sin u)$ . Easy to sample (plug in $(u, v)$ ), hard to test inside/outside.
Implicit surface: $f (x, y, z) = 0$ . Example — sphere: $x^{2} + y^{2} + z^{2} - 1 = 0$ . Hard to sample (need root-finding / marching cubes), easy to test inside/outside (just plug in).
Level sets: grid of scalar values; surface is the trilinearly-interpolated zero crossing. Used for CT / MRI.
CSG (constructive solid geometry): Boolean $\cup / \cap / ∖$ on primitives.
SDF blending: $lerp (SDF_{A}, SDF_{B})$ then take zero set → smooth shape morphs.

AI + Geometry (CS231n 2025 Lec 15 walkthrough)

Datasets

ShapeNet (Chang 2015) — 3M CAD models; ShapeNetCore subset = 51.3K models, 55 categories. Displaced earlier Princeton Shape Benchmark (1814 models).
Objaverse (Deitke 2022) → Objaverse-XL (2023) — 800K to 10M models.
CO3D (Reizenstein ICCV 2021) — 19K videos, 50 categories, real multi-view captures.
PartNet (Mo CVPR 2019) — fine-grained part decompositions + mobility, hierarchical.
ScanNet (Dai CVPR 2017) — 2.5M views, 1,500 RGBD room scans. Recent follow-ups: ARKitScenes, ScanNet++.

Task zoo

$P (S)$ or $P (S ∣ c)$ : generative — priors, completion, generation.
$P (c ∣ S)$ : discriminative — classification, segmentation, descriptors.
Joint 3D + 2D: differentiable projection / back-projection, neural rendering.

Pipelines by representation

Multi-view (2D CNN reuse) — Multi-View CNN (Su ICCV 2015): render $N$ views → shared CNN1 → element-wise max over views → CNN2 → softmax. Hits ~90% on ModelNet40.

Voxels — 3D ShapeNets (Wu CVPR 2015): 30³ voxel CNN / DBN. 3D-GAN (Wu NeurIPS 2016): shape code → deconv → voxel grid. Visual Object Networks (Zhu NeurIPS 2018): shape → differentiable projection (depth + silhouette) → texture net → image; supports viewpoint / shape / texture edits. Dense voxels scale poorly — octree methods (OctNet Riegler CVPR 2017, O-CNN Wang SIGGRAPH 2017, OGN Tatarchenko ICCV 2017) store occupancy only at the surface.

Points (Lagrangian) — PointNet (Qi CVPR 2017): permutation-invariant set function via shared MLP + max pool. Graph extensions (EdgeConv, Wang TOG 2019) build edges over $k$ -NN neighborhoods. Point-cloud distances: Chamfer (asymmetric sum of nearest-neighbor distances) and Earth Mover’s (bijection cost, requires equal size).

Parametric surfaces — AtlasNet (Groueix CVPR 2018): $K$ MLPs $MLP (z, u, v) \to R^{3}$ each parameterize one patch from the unit square; union = surface.

Deep implicit functions — Occupancy Networks (Mescheder CVPR 2019): $f_{θ} (p) =$ occupancy $\in {0, 1}$ . DeepSDF (Park CVPR 2019): regress signed distance instead — smooth gradients, better geometry. LDIF (Genova CVPR 2020): decompose shape into structured set of local implicit elements (colored ellipsoids + latents).

Radiance fields — NeRF (Mildenhall ECCV 2020): per-scene MLP $F_{Θ} (x, y, z, θ, ϕ) \to (RGB, σ)$ ; volume-render along rays. 3D Gaussian Splatting (Kerbl SIGGRAPH 2023): replace the MLP with millions of explicit 3D Gaussians — $\sim$ 2000× faster rendering at comparable quality.

Structure-aware representations

Instead of a monolithic shape, model element structure + element geometry:

Part sets (no relations) → relationship graphs (with connectivity) → hierarchies (trees, easier to generate) → StructureNet (Mo SIGGRAPH Asia 2019): hierarchical graphs with both parent-child and sibling-edge relations, encoded/decoded by graph convnets.
Programs subsume all of the above (CAD-like) but are data-scarce.

Supervised 3D reconstruction (CS231n 2024 Lec 18 deltas)

CS231n 2024 Lec 18 (“3D Vision,” slides credit Justin Johnson, presented Jun 4 2024) covers the same five-rep taxonomy (Depth Map / Voxel / Pointcloud / Mesh / Implicit) as Lec 15 but goes deeper on supervised reconstruction from a single image. The pieces not already in the Lec 15 walkthrough above:

2.5D — depth maps and surface normals (Eigen & Fergus ICCV 2015)

Depth map: per-pixel distance from camera to scene. RGB + Depth = RGB-D image (2.5D), recordable directly with Intel Realsense / Kinect.
Predict via fully-convolutional net trained with per-pixel L2.
Scale/depth ambiguity (slide 15): a small close cat and a large far cat project to identical pixels — absolute depth from a single image is fundamentally ambiguous. Fix with a scale-invariant loss: $D (y, y^{*}) = \frac{1}{2 n ^{2}} \sum_{i, j} ((lo g y_{i} - lo g y_{j}) - (lo g y_{i}^{*} - lo g y_{j}^{*}))^{2}$ (penalizes relative depths between pixels, ignores any constant scale offset).
Surface normals: per-pixel 3-vector, $3 \times H \times W$ output. Loss = per-pixel cosine $(x \cdot y) / (∥ x ∥∥ y ∥)$ .

Voxels — pipelines + memory wall

3D ShapeNets (Wu CVPR 2015) classification net: $1 \times 3 0^{3}$ input → 6³ conv (48 channels, 13³) → 5³ conv (160 channels, 5³) → 4³ conv (512 channels, 2³) → FC → class scores.
3D-R2N2 (Choy ECCV 2016) generation: 2D CNN encoder → 3D CNN decoder → $V \times V \times V$ occupancy, trained with per-voxel cross-entropy.
Memory wall: $102 4^{3}$ float32 voxel grid = 4 GB. Octrees (Tatarchenko ICCV 2017 OGN) use heterogeneous resolution (dense → octree levels 1/2/3 at $3 2^{3} /6 4^{3} /12 8^{3}$ ).

Pointclouds — sensor fusion + generation

Generation: Fan CVPR 2017 Point Set Generation Network — 2D CNN trunk + two heads (FC head outputs $P_{1} \times 3$ points; conv head outputs $(P_{2} \times 3) \times H^{'} \times W^{'}$ points), trained with Chamfer distance.
Sensor fusion — DenseFusion (Wang CVPR 2019, 6D pose): RGB → CNN → per-pixel feature; pointcloud → PointNet → per-point feature; project image features onto points and concatenate per-point → joint per-point feature for downstream heads. The pattern of “lift 2D CNN features onto 3D primitives” recurs across mesh and NeRF work.

Meshes — Pixel2Mesh and Mesh R-CNN

Pixel2Mesh and Mesh R-CNN are the canonical “predict a triangle mesh from one RGB image” pipelines. Three reusable ideas crystallized in Pixel2Mesh (Wang ECCV 2018):

Iterative refinement — start from a fixed ellipsoid mesh (156 → 628 → 2466 vertices), each stage predicts per-vertex offsets + a graph-unpooling subdivision.
Graph convolution on the mesh: $f_{i}^{'} = W_{0} f_{i} + \sum_{j \in N (i)} W_{1} f_{j}$ — vertices update from their 1-ring neighbors with shared $W_{0}, W_{1}$ .
Vertex-aligned features: project each 3D vertex to the image plane, bilinearly sample CNN feature maps (conv3_3 / conv4_3 / conv5_3). Same trick as RoI-Align in detection.

Loss: convert mesh to pointcloud (sample points on the surface, online for prediction + offline for ground truth), then Chamfer distance. Sampling sidesteps the “same shape, different triangulation” problem.

Mesh R-CNN (Gkioxari ICCV 2019) bolts a mesh-prediction head onto Mask R-CNN: 2D detection → per-instance triangle mesh in the image.

Implicit — algebraic surfaces / CSG / level sets / DeepSDF (slides credit Ren Ng, CS184/284A)

Algebraic surfaces = zero set of a polynomial in $(x, y, z)$ .
CSG = Boolean ops $\cup / \cap / ∖$ on implicit primitives, expression = tree.
Level sets = grid of scalar values; surface where trilinearly-interpolated $f (x) = 0$ . Trades closed-form complexity for grid-controlled expressiveness (like a texture).
DeepSDF (Park CVPR 2019): MLP $f_{θ} (x) \to$ signed distance; surface = decision boundary $f = 0$ . Smooth gradients beat occupancy for geometry quality.

NeRF and Gaussian Splatting variants

NeRF and Gaussian Splatting are documented in their own notes — NeRF and Gaussian Splatting — including the Lec 18 variants (Nerfies, RawNeRF, BlockNeRF, Dynamic 3D Gaussians, Gaussian Splatting SLAM).

Foundation models for 3D generation

DreamFusion (Poole et al. arXiv 2022) — text-to-3D without a 3D dataset: optimize a NeRF so its rendered views match a 2D text-to-image diffusion model (Score Distillation Sampling).
CAT3D (Gao et al. arXiv 2024) — “Create Anything in 3D” via multi-view diffusion: generate consistent multi-view images then fit a 3D model.

The 2D-diffusion-as-3D-supervisor trick is the bridge from the diffusion foundation-model line into 3D, paralleling how VLMs serve as supervisors for VLA action prediction.

🛠️ Steven Gong

Table of Contents

3D Representation

Taxonomy (CS231n 2025 Lec 15)

AI + Geometry (CS231n 2025 Lec 15 walkthrough)

Datasets

Task zoo

Pipelines by representation

Structure-aware representations

Supervised 3D reconstruction (CS231n 2024 Lec 18 deltas)

2.5D — depth maps and surface normals (Eigen & Fergus ICCV 2015)

Voxels — pipelines + memory wall

Pointclouds — sensor fusion + generation

Meshes — Pixel2Mesh and Mesh R-CNN

Implicit — algebraic surfaces / CSG / level sets / DeepSDF (slides credit Ren Ng, CS184/284A)

NeRF and Gaussian Splatting variants

Foundation models for 3D generation

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

3D Representation

Taxonomy (CS231n 2025 Lec 15)

AI + Geometry (CS231n 2025 Lec 15 walkthrough)

Datasets

Task zoo

Pipelines by representation

Structure-aware representations

Supervised 3D reconstruction (CS231n 2024 Lec 18 deltas)

2.5D — depth maps and surface normals (Eigen & Fergus ICCV 2015)

Voxels — pipelines + memory wall

Pointclouds — sensor fusion + generation

Meshes — Pixel2Mesh and Mesh R-CNN

Implicit — algebraic surfaces / CSG / level sets / DeepSDF (slides credit Ren Ng, CS184/284A)

NeRF and Gaussian Splatting variants

Foundation models for 3D generation

Related

Graph View

Backlinks