Mesh

A triangle mesh represents a 3D shape as $V$ vertices in $R^{3}$ plus a set of triangular faces over those vertices.

Why meshes vs voxels / pointclouds / implicit fields?

Meshes are the standard representation for graphics — every renderer and physics engine consumes them. They are adaptive: flat regions cost few triangles, fine detail gets more. Per-vertex data (RGB, UV coords, normals) interpolates cleanly across faces. The catch is that meshes are nontrivial to process with neural networks — they’re irregular graphs, not tensors. (CS231n 2024 Lec 18, slides 44–46.)

Predicting meshes from a single image (CS231n 2024 Lec 18, slides 44–60)

Pixel2Mesh (Wang et al. ECCV 2018)

Single RGB image → triangle mesh, by deforming a fixed initial ellipsoid mesh through three refinement stages (156 → 628 → 2466 vertices). Three reusable ideas:

Iterative refinement — start from an ellipsoid template; each stage predicts per-vertex 3D offsets and a graph-unpooling subdivision before the next round.
Graph convolution — vertices live on a graph, so use a graph conv: $f_{i}^{'} = W_{0} f_{i} + \sum_{j \in N (i)} W_{1} f_{j}$ shared $W_{0}, W_{1}$ across the mesh; new feature for vertex $v_{i}$ depends on its 1-ring neighbors.
Vertex-aligned features — for each mesh vertex, project onto the input image with the camera, bilinearly sample CNN feature maps (conv3_3 / conv4_3 / conv5_3). Same trick as RoI-Align in detection: keeps alignment between image features and 3D state.

Loss: the same shape can be tiled with very different triangulations, so comparing meshes vertex-to-vertex is ill-posed. Instead, convert both meshes to pointclouds (sample points on the predicted surface online + on the GT surface offline) and use Chamfer distance: $d_{CD} (S_{1}, S_{2}) = \sum_{x \in S_{1}} min_{y \in S_{2}} ∥ x - y ∥_{2}^{2} + \sum_{y \in S_{2}} min_{x \in S_{1}} ∥ x - y ∥_{2}^{2}$

Mesh R-CNN (Gkioxari, Malik, Johnson ICCV 2019)

Bolts a mesh-prediction head onto Mask R-CNN: 2D detection + per-instance segmentation as before, plus a triangle mesh per detected object. Extends “image → 2D shapes” to “image → 3D shapes” with the same proposal/feature-pool backbone.

🛠️ Steven Gong

Table of Contents

Mesh

Predicting meshes from a single image (CS231n 2024 Lec 18, slides 44–60)

Pixel2Mesh (Wang et al. ECCV 2018)

Mesh R-CNN (Gkioxari, Malik, Johnson ICCV 2019)

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Mesh

Predicting meshes from a single image (CS231n 2024 Lec 18, slides 44–60)

Pixel2Mesh (Wang et al. ECCV 2018)

Mesh R-CNN (Gkioxari, Malik, Johnson ICCV 2019)

Related

Graph View

Backlinks