3D Representation
How do you put a 3D shape into a computer? Thereβs no single answer β each representation trades off sampling ease, inside/outside queries, memory, and learnability.
Why so many representations?
A surface is a continuous 2-manifold embedded in ; discretizing it forces a choice. Explicit reps parameterize the surface directly (easy to sample, hard to query inside/outside); implicit reps describe the surface as a zero set of a field (hard to sample, trivial inside/outside). Cross this with parametric (closed-form shape family, e.g. splines) vs non-parametric (free-form grid / list), and you get the four-quadrant taxonomy below.
Taxonomy (CS231n 2025 Lec 15)
| Non-parametric | Parametric | |
|---|---|---|
| Explicit | Point Cloud, Mesh | spline patches, Subdivision surfaces |
| Implicit | Voxels, Level sets | Algebraic surfaces (zero set of polynomial), CSG, Signed Distance Function |
- Explicit surface: . Example β torus: . Easy to sample (plug in ), hard to test inside/outside.
- Implicit surface: . Example β sphere: . Hard to sample (need root-finding / marching cubes), easy to test inside/outside (just plug in).
- Level sets: grid of scalar values; surface is the trilinearly-interpolated zero crossing. Used for CT / MRI.
- CSG (constructive solid geometry): Boolean on primitives.
- SDF blending: then take zero set β smooth shape morphs.
AI + Geometry (CS231n 2025 Lec 15 walkthrough)
Datasets
- ShapeNet (Chang 2015) β 3M CAD models; ShapeNetCore subset = 51.3K models, 55 categories. Displaced earlier Princeton Shape Benchmark (1814 models).
- Objaverse (Deitke 2022) β Objaverse-XL (2023) β 800K to 10M models.
- CO3D (Reizenstein ICCV 2021) β 19K videos, 50 categories, real multi-view captures.
- PartNet (Mo CVPR 2019) β fine-grained part decompositions + mobility, hierarchical.
- ScanNet (Dai CVPR 2017) β 2.5M views, 1,500 RGBD room scans. Recent follow-ups: ARKitScenes, ScanNet++.
Task zoo
- or : generative β priors, completion, generation.
- : discriminative β classification, segmentation, descriptors.
- Joint 3D + 2D: differentiable projection / back-projection, neural rendering.
Pipelines by representation
Multi-view (2D CNN reuse) β Multi-View CNN (Su ICCV 2015): render views β shared CNN1 β element-wise max over views β CNN2 β softmax. Hits ~90% on ModelNet40.
Voxels β 3D ShapeNets (Wu CVPR 2015): 30Β³ voxel CNN / DBN. 3D-GAN (Wu NeurIPS 2016): shape code β deconv β voxel grid. Visual Object Networks (Zhu NeurIPS 2018): shape β differentiable projection (depth + silhouette) β texture net β image; supports viewpoint / shape / texture edits. Dense voxels scale poorly β octree methods (OctNet Riegler CVPR 2017, O-CNN Wang SIGGRAPH 2017, OGN Tatarchenko ICCV 2017) store occupancy only at the surface.
Points (Lagrangian) β PointNet (Qi CVPR 2017): permutation-invariant set function via shared MLP + max pool. Graph extensions (EdgeConv, Wang TOG 2019) build edges over -NN neighborhoods. Point-cloud distances: Chamfer (asymmetric sum of nearest-neighbor distances) and Earth Moverβs (bijection cost, requires equal size).
Parametric surfaces β AtlasNet (Groueix CVPR 2018): MLPs each parameterize one patch from the unit square; union = surface.
Deep implicit functions β Occupancy Networks (Mescheder CVPR 2019): occupancy . DeepSDF (Park CVPR 2019): regress signed distance instead β smooth gradients, better geometry. LDIF (Genova CVPR 2020): decompose shape into structured set of local implicit elements (colored ellipsoids + latents).
Radiance fields β NeRF (Mildenhall ECCV 2020): per-scene MLP ; volume-render along rays. 3D Gaussian Splatting (Kerbl SIGGRAPH 2023): replace the MLP with millions of explicit 3D Gaussians β 2000Γ faster rendering at comparable quality.
Structure-aware representations
Instead of a monolithic shape, model element structure + element geometry:
- Part sets (no relations) β relationship graphs (with connectivity) β hierarchies (trees, easier to generate) β StructureNet (Mo SIGGRAPH Asia 2019): hierarchical graphs with both parent-child and sibling-edge relations, encoded/decoded by graph convnets.
- Programs subsume all of the above (CAD-like) but are data-scarce.
Supervised 3D reconstruction (CS231n 2024 Lec 18 deltas)
CS231n 2024 Lec 18 (β3D Vision,β slides credit Justin Johnson, presented Jun 4 2024) covers the same five-rep taxonomy (Depth Map / Voxel / Pointcloud / Mesh / Implicit) as Lec 15 but goes deeper on supervised reconstruction from a single image. The pieces not already in the Lec 15 walkthrough above:
2.5D β depth maps and surface normals (Eigen & Fergus ICCV 2015)
- Depth map: per-pixel distance from camera to scene. RGB + Depth = RGB-D image (2.5D), recordable directly with Intel Realsense / Kinect.
- Predict via fully-convolutional net trained with per-pixel L2.
- Scale/depth ambiguity (slide 15): a small close cat and a large far cat project to identical pixels β absolute depth from a single image is fundamentally ambiguous. Fix with a scale-invariant loss: (penalizes relative depths between pixels, ignores any constant scale offset).
- Surface normals: per-pixel 3-vector, output. Loss = per-pixel cosine .
Voxels β pipelines + memory wall
- 3D ShapeNets (Wu CVPR 2015) classification net: input β 6Β³ conv (48 channels, 13Β³) β 5Β³ conv (160 channels, 5Β³) β 4Β³ conv (512 channels, 2Β³) β FC β class scores.
- 3D-R2N2 (Choy ECCV 2016) generation: 2D CNN encoder β 3D CNN decoder β occupancy, trained with per-voxel cross-entropy.
- Memory wall: float32 voxel grid = 4 GB. Octrees (Tatarchenko ICCV 2017 OGN) use heterogeneous resolution (dense β octree levels 1/2/3 at ).
Pointclouds β sensor fusion + generation
- Generation: Fan CVPR 2017 Point Set Generation Network β 2D CNN trunk + two heads (FC head outputs points; conv head outputs points), trained with Chamfer distance.
- Sensor fusion β DenseFusion (Wang CVPR 2019, 6D pose): RGB β CNN β per-pixel feature; pointcloud β PointNet β per-point feature; project image features onto points and concatenate per-point β joint per-point feature for downstream heads. The pattern of βlift 2D CNN features onto 3D primitivesβ recurs across mesh and NeRF work.
Meshes β Pixel2Mesh and Mesh R-CNN
Pixel2Mesh and Mesh R-CNN are the canonical βpredict a triangle mesh from one RGB imageβ pipelines. Three reusable ideas crystallized in Pixel2Mesh (Wang ECCV 2018):
- Iterative refinement β start from a fixed ellipsoid mesh (156 β 628 β 2466 vertices), each stage predicts per-vertex offsets + a graph-unpooling subdivision.
- Graph convolution on the mesh: β vertices update from their 1-ring neighbors with shared .
- Vertex-aligned features: project each 3D vertex to the image plane, bilinearly sample CNN feature maps (
conv3_3 / conv4_3 / conv5_3). Same trick as RoI-Align in detection.
Loss: convert mesh to pointcloud (sample points on the surface, online for prediction + offline for ground truth), then Chamfer distance. Sampling sidesteps the βsame shape, different triangulationβ problem.
Mesh R-CNN (Gkioxari ICCV 2019) bolts a mesh-prediction head onto Mask R-CNN: 2D detection β per-instance triangle mesh in the image.
Implicit β algebraic surfaces / CSG / level sets / DeepSDF (slides credit Ren Ng, CS184/284A)
- Algebraic surfaces = zero set of a polynomial in .
- CSG = Boolean ops on implicit primitives, expression = tree.
- Level sets = grid of scalar values; surface where trilinearly-interpolated . Trades closed-form complexity for grid-controlled expressiveness (like a texture).
- DeepSDF (Park CVPR 2019): MLP signed distance; surface = decision boundary . Smooth gradients beat occupancy for geometry quality.
NeRF and Gaussian Splatting variants
NeRF and Gaussian Splatting are documented in their own notes β NeRF and Gaussian Splatting β including the Lec 18 variants (Nerfies, RawNeRF, BlockNeRF, Dynamic 3D Gaussians, Gaussian Splatting SLAM).
Foundation models for 3D generation
- DreamFusion (Poole et al. arXiv 2022) β text-to-3D without a 3D dataset: optimize a NeRF so its rendered views match a 2D text-to-image diffusion model (Score Distillation Sampling).
- CAT3D (Gao et al. arXiv 2024) β βCreate Anything in 3Dβ via multi-view diffusion: generate consistent multi-view images then fit a 3D model.
The 2D-diffusion-as-3D-supervisor trick is the bridge from the diffusion foundation-model line into 3D, paralleling how VLMs serve as supervisors for VLA action prediction.