Agentic AI / Generative AI

High-Fidelity 3D Mesh Generation at Scale with Meshtron

Meshes are one of the most important and widely used representations of 3D assets. They are the default standard in the film, design, and gaming industries and they are natively supported by virtually all the 3D softwares and graphics hardwares. 

A 3D mesh can be considered as a collection of polygon faces, most commonly consisting of triangles or quadrilaterals. 

An important property of a mesh is its topology, which refers to the organization of these polygon faces that discretize the 3D surface. Artist-created meshes usually feature highly informative and well-organized topologies that align closely with the underlying structure of the object. 

Video 1. An overview of Meshtron

Having artist-like topology is essential for editing, texturing, animation, and efficient rendering. However, creating these meshes manually by artists is a labor-intensive task that requires significant time and expertise in 3D modeling.

A mesh can be extracted algorithmically from other 3D representations. In a typical text-to-3D or image-to-3D generation system, a neural generator produces a neural field, which is then converted into a mesh using algorithms such as variants of Marching Cubes [FlexiCubes (NVIDIA), NMC, DiffMC] or Marching Tetrahedra [DMTet (NVIDIA)]. 

Unfortunately, these hand-designed algorithms produce dense meshes that do not have artist-like topology, hindering the quality and usefulness of these methods.

A diagram shows that hand-designed algorithms like Marching Tetrahedra and Marching Cubes produce dense meshes that lack adaptivity. This results in meshes with missing details, bumpy geometry, and uninformative tessellation. Meshtron, as a data-driven algorithm, produces artist-like topology with great geometric detail, even at a lower face count.

Figure 1. Mesh comparisons

An emerging line of work generates meshes in a data-driven fashion by learning from existing artist-created meshes using machine learning algorithms. These works show great potential yet are limited in scalability, with the best method handling up to 1.6K faces:

In sharp contrast, a delicate 3D object often requires over 10K faces to be accurately modeled. Our survey shows that artist-created meshes typically have a median face count of 10K and a mean of 32K.

Meshtron provides a simple and scalable, data-driven solution for generating intricate, artist-like meshes of up to 64K faces at 1024-level coordinate resolution. This is over an order of magnitude higher face count and 8x higher coordinate resolution compared to existing methods.

A diagram shows previous works generate meshes with limited face counts at a low, 128-level spatial resolution and Meshtron meshes with controllable face counts of up to 64K at a higher, 1024-level spatial resolution. Compared to previous works, Meshtron produces better quality meshes at similar face counts, while being capable of generating much more sophisticated meshes.

Figure 2. Comparison of previous low-poly meshes with low resolution and Meshtron-generated meshes with controllable face count and high resolution

A mesh as a sequence of tokens

Meshtron is an autoregressive model that generates mesh tokens. It shares the same working principle as autoregressive language models such as GPTs. 

A mesh can easily be converted to a sequence of tokens. The basic building block of a mesh is a triangle face, which can be represented with nine tokens: 

  • Each triangle has three vertices.
  • Each vertex has three coordinates.
  • Each coordinate can be quantized to obtain a discrete token. 

A mesh can thus be represented uniquely as a sequence of tokens by chaining these face tokens together according to a bottom-to-top sorted order.

 GIF shows that a mesh can be converted to a unique sequence of tokens by sorting the vertices and faces. The obtained sequence has a length of 9 times the face count.
Figure 3. Mesh representation by token sequence
A GIF shows that the Meshtron generates triangles in sorted order, from bottom to top.
Figure 4. Bottom-to-top generation process

Meshtron is an efficient mesh generator

Directly generating these mesh tokens can be costly. A mesh with 32K faces requires 300K tokens. For a typical autoregressive model, this would be extremely slow to generate and expensive to train.

Meshtron adopts an Hourglass Transformer architecture, which saves computation and memory by internally merging the tokens to reduce the sequence length. More specifically, the model consists of three stages, each stage reducing the sequence length by three times:

  • First stage: Operates on the coordinate level, which processes every token.
  • Second stage: Operates on the vertex level, processing one token per vertex (3x reduction).
  • Third stage: Operates on the face level, processing one token per face (9x reduction).
A GIF shows that Meshtron uses an Hourglass Transformer as its backbone, which internally reduces the token count. This reduces the computing and memory consumption significantly while making the generation process much faster.
Figure 5. Hourglass Transformer in action

By allocating more layers to the stages with higher reduction ratios, the computation and memory cost is significantly reduced compared to conventional, single-staged models. 

Each stage of the Hourglass network aligns perfectly with the coordinate-vertex-face semantics of a mesh. This not only improves the modeling efficiency, but also has the additional benefit of allocating more compute power to difficult-to-generate tokens. Within each triangle, the first vertex is the easiest to generate while the last vertex is the most difficult.

A diagram shows that mesh sequences have inherent periodicity in the difficulty of generating each token. The hourglass model allocates more compute (GFLOPs) to difficult-to-generate tokens (ones with high log perplexity). Vertical lines show the triangle boundaries within the token sequence.

Figure 6. Log perplexity and per-token GFLOPs at different token positions

Another efficiency-boosting technique used by Meshtron is sliding window attention. A conventional Transformer model has a context length that grows with the sequence length, leading to quadratic growth of compute and linear growth of memory consumption as the sequence becomes longer. This leads to significant slowdown with long sequences both during training and generation. 

Instead, Meshtron maintains a fixed-length context window of 8192 faces. During training, the mesh sequences are randomly cropped to up to 8192 faces. During inference, the token generated more than 8192 faces ago is evicted from the KV cache. The sliding window technique leads to a constant memory cost and constant token throughput that never slows down as the mesh size grows.

With the help of these techniques, Meshtron achieves 2.5x faster token throughput and over 50% saving in memory both during training and inference, while generating better quality meshes.

Meshtron is highly controllable

The current version of Meshtron accepts the following control inputs:

  • Point cloud: Determines the shape of the output mesh.
  • Face count: Determines the density of the output mesh.
  • Quad ratio: Switches between quad and triangle tessellation.
  • Creativity:  Can be adjusted to generate extra details not present in the point cloud.

As almost all of the 3D representations can be converted to point clouds, Meshtron can either serve as a stand-alone remesher to improve the quality of existing meshes, or be used in tandem with a text-to-3D or image-to-3D model to generate artist-grade meshes from scratch.

A diagram shows that in the faithful mode, the mesh generated by Meshtron follows closely to the input, reproducing all the details as well as imperfections. In the creative mode, Meshtron attempts to generate extra details and fix the imperfections. The input meshes in this example are generated by a text-to-3D service.

Figure 7. Meshtron control over creativity

The control input of Meshtron is implemented with cross-attention, which is highly extensible and can easily be adapted to other types of inputs such as images, additional control, and so on. It also enables sliding-window inference by providing the global context.

A diagram shows that all the conditioning inputs are encoded into embeddings using respective encoders. The embeddings are then concatenated and used for controlling mesh generation with cross-attention.

Figure 8. Meshtron full model architecture

Conclusion

Meshtron is an autoregressive mesh generator based on the Hourglass architecture and using sliding window attention. It exploits the periodicity and locality of a mesh sequence to achieve greatly improved efficiency.

Meshtron generates meshes of detailed, complex 3D objects at unprecedented levels of resolution and fidelity, closely resembling those created by professional artists. It opens the door to a more realistic generation of detailed 3D assets for animation, gaming, and virtual environments.

Owing to its general and highly extensible architecture, we hope that Meshtron can serve as the reference for future mesh generation research and applications.

For more information see the following resources:

Discuss (3)

Tags