3D Representation

Mesh

A triangle mesh represents a 3D shape as vertices in plus a set of triangular faces over those vertices.

Why meshes vs voxels / pointclouds / implicit fields?

Meshes are the standard representation for graphics β€” every renderer and physics engine consumes them. They are adaptive: flat regions cost few triangles, fine detail gets more. Per-vertex data (RGB, UV coords, normals) interpolates cleanly across faces. The catch is that meshes are nontrivial to process with neural networks β€” they’re irregular graphs, not tensors. (CS231n 2024 Lec 18, slides 44–46.)

Predicting meshes from a single image (CS231n 2024 Lec 18, slides 44–60)

Pixel2Mesh (Wang et al. ECCV 2018)

Single RGB image β†’ triangle mesh, by deforming a fixed initial ellipsoid mesh through three refinement stages (156 β†’ 628 β†’ 2466 vertices). Three reusable ideas:

  1. Iterative refinement β€” start from an ellipsoid template; each stage predicts per-vertex 3D offsets and a graph-unpooling subdivision before the next round.
  2. Graph convolution β€” vertices live on a graph, so use a graph conv: shared across the mesh; new feature for vertex depends on its 1-ring neighbors.
  3. Vertex-aligned features β€” for each mesh vertex, project onto the input image with the camera, bilinearly sample CNN feature maps (conv3_3 / conv4_3 / conv5_3). Same trick as RoI-Align in detection: keeps alignment between image features and 3D state.

Loss: the same shape can be tiled with very different triangulations, so comparing meshes vertex-to-vertex is ill-posed. Instead, convert both meshes to pointclouds (sample points on the predicted surface online + on the GT surface offline) and use Chamfer distance:

Mesh R-CNN (Gkioxari, Malik, Johnson ICCV 2019)

Bolts a mesh-prediction head onto Mask R-CNN: 2D detection + per-instance segmentation as before, plus a triangle mesh per detected object. Extends β€œimage β†’ 2D shapes” to β€œimage β†’ 3D shapes” with the same proposal/feature-pool backbone.