Mesh
A triangle mesh represents a 3D shape as vertices in plus a set of triangular faces over those vertices.
Why meshes vs voxels / pointclouds / implicit fields?
Meshes are the standard representation for graphics β every renderer and physics engine consumes them. They are adaptive: flat regions cost few triangles, fine detail gets more. Per-vertex data (RGB, UV coords, normals) interpolates cleanly across faces. The catch is that meshes are nontrivial to process with neural networks β theyβre irregular graphs, not tensors. (CS231n 2024 Lec 18, slides 44β46.)
Predicting meshes from a single image (CS231n 2024 Lec 18, slides 44β60)
Pixel2Mesh (Wang et al. ECCV 2018)
Single RGB image β triangle mesh, by deforming a fixed initial ellipsoid mesh through three refinement stages (156 β 628 β 2466 vertices). Three reusable ideas:
- Iterative refinement β start from an ellipsoid template; each stage predicts per-vertex 3D offsets and a graph-unpooling subdivision before the next round.
- Graph convolution β vertices live on a graph, so use a graph conv: shared across the mesh; new feature for vertex depends on its 1-ring neighbors.
- Vertex-aligned features β for each mesh vertex, project onto the input image with the camera, bilinearly sample CNN feature maps (
conv3_3 / conv4_3 / conv5_3). Same trick as RoI-Align in detection: keeps alignment between image features and 3D state.
Loss: the same shape can be tiled with very different triangulations, so comparing meshes vertex-to-vertex is ill-posed. Instead, convert both meshes to pointclouds (sample points on the predicted surface online + on the GT surface offline) and use Chamfer distance:
Mesh R-CNN (Gkioxari, Malik, Johnson ICCV 2019)
Bolts a mesh-prediction head onto Mask R-CNN: 2D detection + per-instance segmentation as before, plus a triangle mesh per detected object. Extends βimage β 2D shapesβ to βimage β 3D shapesβ with the same proposal/feature-pool backbone.