3D Representation

How do you put a 3D shape into a computer? There’s no single answer β€” each representation trades off sampling ease, inside/outside queries, memory, and learnability.

Why so many representations?

A surface is a continuous 2-manifold embedded in ; discretizing it forces a choice. Explicit reps parameterize the surface directly (easy to sample, hard to query inside/outside); implicit reps describe the surface as a zero set of a field (hard to sample, trivial inside/outside). Cross this with parametric (closed-form shape family, e.g. splines) vs non-parametric (free-form grid / list), and you get the four-quadrant taxonomy below.

Taxonomy (CS231n 2025 Lec 15)

Non-parametricParametric
ExplicitPoint Cloud, Mesh spline patches, Subdivision surfaces
ImplicitVoxels, Level setsAlgebraic surfaces (zero set of polynomial), CSG, Signed Distance Function
  • Explicit surface: . Example β€” torus: . Easy to sample (plug in ), hard to test inside/outside.
  • Implicit surface: . Example β€” sphere: . Hard to sample (need root-finding / marching cubes), easy to test inside/outside (just plug in).
  • Level sets: grid of scalar values; surface is the trilinearly-interpolated zero crossing. Used for CT / MRI.
  • CSG (constructive solid geometry): Boolean on primitives.
  • SDF blending: then take zero set β†’ smooth shape morphs.

AI + Geometry (CS231n 2025 Lec 15 walkthrough)

Datasets

  • ShapeNet (Chang 2015) β€” 3M CAD models; ShapeNetCore subset = 51.3K models, 55 categories. Displaced earlier Princeton Shape Benchmark (1814 models).
  • Objaverse (Deitke 2022) β†’ Objaverse-XL (2023) β€” 800K to 10M models.
  • CO3D (Reizenstein ICCV 2021) β€” 19K videos, 50 categories, real multi-view captures.
  • PartNet (Mo CVPR 2019) β€” fine-grained part decompositions + mobility, hierarchical.
  • ScanNet (Dai CVPR 2017) β€” 2.5M views, 1,500 RGBD room scans. Recent follow-ups: ARKitScenes, ScanNet++.

Task zoo

  • or : generative β€” priors, completion, generation.
  • : discriminative β€” classification, segmentation, descriptors.
  • Joint 3D + 2D: differentiable projection / back-projection, neural rendering.

Pipelines by representation

Multi-view (2D CNN reuse) β€” Multi-View CNN (Su ICCV 2015): render views β†’ shared CNN1 β†’ element-wise max over views β†’ CNN2 β†’ softmax. Hits ~90% on ModelNet40.

Voxels β€” 3D ShapeNets (Wu CVPR 2015): 30Β³ voxel CNN / DBN. 3D-GAN (Wu NeurIPS 2016): shape code β†’ deconv β†’ voxel grid. Visual Object Networks (Zhu NeurIPS 2018): shape β†’ differentiable projection (depth + silhouette) β†’ texture net β†’ image; supports viewpoint / shape / texture edits. Dense voxels scale poorly β€” octree methods (OctNet Riegler CVPR 2017, O-CNN Wang SIGGRAPH 2017, OGN Tatarchenko ICCV 2017) store occupancy only at the surface.

Points (Lagrangian) β€” PointNet (Qi CVPR 2017): permutation-invariant set function via shared MLP + max pool. Graph extensions (EdgeConv, Wang TOG 2019) build edges over -NN neighborhoods. Point-cloud distances: Chamfer (asymmetric sum of nearest-neighbor distances) and Earth Mover’s (bijection cost, requires equal size).

Parametric surfaces β€” AtlasNet (Groueix CVPR 2018): MLPs each parameterize one patch from the unit square; union = surface.

Deep implicit functions β€” Occupancy Networks (Mescheder CVPR 2019): occupancy . DeepSDF (Park CVPR 2019): regress signed distance instead β€” smooth gradients, better geometry. LDIF (Genova CVPR 2020): decompose shape into structured set of local implicit elements (colored ellipsoids + latents).

Radiance fields β€” NeRF (Mildenhall ECCV 2020): per-scene MLP ; volume-render along rays. 3D Gaussian Splatting (Kerbl SIGGRAPH 2023): replace the MLP with millions of explicit 3D Gaussians β€” 2000Γ— faster rendering at comparable quality.

Structure-aware representations

Instead of a monolithic shape, model element structure + element geometry:

  • Part sets (no relations) β†’ relationship graphs (with connectivity) β†’ hierarchies (trees, easier to generate) β†’ StructureNet (Mo SIGGRAPH Asia 2019): hierarchical graphs with both parent-child and sibling-edge relations, encoded/decoded by graph convnets.
  • Programs subsume all of the above (CAD-like) but are data-scarce.

Supervised 3D reconstruction (CS231n 2024 Lec 18 deltas)

CS231n 2024 Lec 18 (β€œ3D Vision,” slides credit Justin Johnson, presented Jun 4 2024) covers the same five-rep taxonomy (Depth Map / Voxel / Pointcloud / Mesh / Implicit) as Lec 15 but goes deeper on supervised reconstruction from a single image. The pieces not already in the Lec 15 walkthrough above:

2.5D β€” depth maps and surface normals (Eigen & Fergus ICCV 2015)

  • Depth map: per-pixel distance from camera to scene. RGB + Depth = RGB-D image (2.5D), recordable directly with Intel Realsense / Kinect.
  • Predict via fully-convolutional net trained with per-pixel L2.
  • Scale/depth ambiguity (slide 15): a small close cat and a large far cat project to identical pixels β€” absolute depth from a single image is fundamentally ambiguous. Fix with a scale-invariant loss: (penalizes relative depths between pixels, ignores any constant scale offset).
  • Surface normals: per-pixel 3-vector, output. Loss = per-pixel cosine .

Voxels β€” pipelines + memory wall

  • 3D ShapeNets (Wu CVPR 2015) classification net: input β†’ 6Β³ conv (48 channels, 13Β³) β†’ 5Β³ conv (160 channels, 5Β³) β†’ 4Β³ conv (512 channels, 2Β³) β†’ FC β†’ class scores.
  • 3D-R2N2 (Choy ECCV 2016) generation: 2D CNN encoder β†’ 3D CNN decoder β†’ occupancy, trained with per-voxel cross-entropy.
  • Memory wall: float32 voxel grid = 4 GB. Octrees (Tatarchenko ICCV 2017 OGN) use heterogeneous resolution (dense β†’ octree levels 1/2/3 at ).

Pointclouds β€” sensor fusion + generation

  • Generation: Fan CVPR 2017 Point Set Generation Network β€” 2D CNN trunk + two heads (FC head outputs points; conv head outputs points), trained with Chamfer distance.
  • Sensor fusion β€” DenseFusion (Wang CVPR 2019, 6D pose): RGB β†’ CNN β†’ per-pixel feature; pointcloud β†’ PointNet β†’ per-point feature; project image features onto points and concatenate per-point β†’ joint per-point feature for downstream heads. The pattern of β€œlift 2D CNN features onto 3D primitives” recurs across mesh and NeRF work.

Meshes β€” Pixel2Mesh and Mesh R-CNN

Pixel2Mesh and Mesh R-CNN are the canonical β€œpredict a triangle mesh from one RGB image” pipelines. Three reusable ideas crystallized in Pixel2Mesh (Wang ECCV 2018):

  1. Iterative refinement β€” start from a fixed ellipsoid mesh (156 β†’ 628 β†’ 2466 vertices), each stage predicts per-vertex offsets + a graph-unpooling subdivision.
  2. Graph convolution on the mesh: β€” vertices update from their 1-ring neighbors with shared .
  3. Vertex-aligned features: project each 3D vertex to the image plane, bilinearly sample CNN feature maps (conv3_3 / conv4_3 / conv5_3). Same trick as RoI-Align in detection.

Loss: convert mesh to pointcloud (sample points on the surface, online for prediction + offline for ground truth), then Chamfer distance. Sampling sidesteps the β€œsame shape, different triangulation” problem.

Mesh R-CNN (Gkioxari ICCV 2019) bolts a mesh-prediction head onto Mask R-CNN: 2D detection β†’ per-instance triangle mesh in the image.

Implicit β€” algebraic surfaces / CSG / level sets / DeepSDF (slides credit Ren Ng, CS184/284A)

  • Algebraic surfaces = zero set of a polynomial in .
  • CSG = Boolean ops on implicit primitives, expression = tree.
  • Level sets = grid of scalar values; surface where trilinearly-interpolated . Trades closed-form complexity for grid-controlled expressiveness (like a texture).
  • DeepSDF (Park CVPR 2019): MLP signed distance; surface = decision boundary . Smooth gradients beat occupancy for geometry quality.

NeRF and Gaussian Splatting variants

NeRF and Gaussian Splatting are documented in their own notes β€” NeRF and Gaussian Splatting β€” including the Lec 18 variants (Nerfies, RawNeRF, BlockNeRF, Dynamic 3D Gaussians, Gaussian Splatting SLAM).

Foundation models for 3D generation

  • DreamFusion (Poole et al. arXiv 2022) β€” text-to-3D without a 3D dataset: optimize a NeRF so its rendered views match a 2D text-to-image diffusion model (Score Distillation Sampling).
  • CAT3D (Gao et al. arXiv 2024) β€” β€œCreate Anything in 3D” via multi-view diffusion: generate consistent multi-view images then fit a 3D model.

The 2D-diffusion-as-3D-supervisor trick is the bridge from the diffusion foundation-model line into 3D, paralleling how VLMs serve as supervisors for VLA action prediction.