Microsoft's New Trellis Tool Turns An Image Into 3D Assets

A few weeks ago, Microsoft unveiled a novel 3D generation method for versatile and high-quality 3D asset creation called Trellis. The model uses a unified structured latent representation (SLAT) to decode into various formats, such as Radiance Fields, 3D Gaussians, and meshes, by integrating sparse 3D grids with multiview visual features.

Okay, that sounds like a mouthful, but in simple terms, Trellis is really good at creating high-quality 3D models that look realistic and match the descriptions or pictures you provide. It’s an incredible tool for artists, developers, and designers to produce amazing 3D content efficiently.

I’ve talked about AI-powered 3D object generators in the past, but this one is particularly more impressive in terms of speed and quality.

How Does Trellis Work?

The method uses rectified flow transformers and achieves superior results compared to existing approaches, exhibiting flexible editing capabilities.

The model is trained on a large 3D asset dataset (500K objects) and surpasses existing methods in quality and versatility, as demonstrated through extensive experiments and user studies.

The 3D object generation in Trellis is a two-stage process that uses a special code called “Structured LATent” (SLAT).

Here’s how it works:

Stage 1: Building the Structure

Sparse Structure: Trellis starts by creating a basic framework of the object using a set of “active voxels.” Voxels are like tiny cubes in 3D space. The active voxels outline the rough shape of the object. Imagine building a Lego model and first putting together the main blocks to get the general shape.
Compressing the Structure: To make things more efficient, Trellis compresses this framework into a smaller set of instructions using a technique called a VAE (Variational Autoencoder).
Generating the Framework: Trellis uses a special type of artificial intelligence called a “Rectified Flow Transformer” to take this cheat sheet and turn it into a detailed plan for the object’s framework. This plan tells the computer exactly where to place the active voxels in 3D space.

How Does Trellis Work? The method employs rectified flow transformers and achieves superior results compared to existing approaches, exhibiting flexible editing capabilities. The model is trained on a large 3D asset dataset (500K objects) and surpasses existing methods in quality and versatility, as demonstrated through extensive experiments and user studies. The 3D object generation in Trellis is a two-stage process that uses a special code called “Structured LATent” (SLAT). — Image from Microsoft Trellis

Stage 2: Adding the Details

Local Latents: Once the framework is in place, Trellis adds details to each active voxel using “local latents.” These latents contain information about the object’s appearance, like color and texture.
Feature Aggregation: To figure out what each local latent should look like, Trellis uses a powerful vision model (DINOv2). This vision model analyzes pictures of the object from many different angles and extracts important features, like edges, shapes, and colors.
Generating the Details: Trellis uses another Rectified Flow Transformer to take these features and turn them into the detailed local latents. These latents are then attached to the active voxels to complete the 3D model.

This two-stage process allows Trellis to create high-quality 3D models efficiently. It leverages the power of artificial intelligence and computer vision to understand and recreate complex 3D objects from text descriptions or pictures.

Trellis can convert this SLAT representation into various 3D model formats, like:

3D Gaussians
Radiance Fields
Meshes

Check out these high-quality examples:

Trellis can convert this SLAT representation into various 3D model formats, like: 3D Gaussians Radiance Fields Meshes Check out these high-quality examples: — Image from Microsoft Trellis

The way Trellis compresses the structure and adds details is reminiscent of how professional 3D artists work—starting with a base mesh and then layering details.

However, unlike human artists, Trellis does it in a fraction of the time.

How To Try Trellis

You can try Trellis on HuggingFace.

Upload an image and click “Generate” to create a 3D asset. If the image has an alpha channel, it will be used as the mask. You can play with the generation settings or the GLB extraction settings or leave them as default.

Here’s the sample 3D output:

GIF by Jim Clyde Monge

If you feel satisfied with the 3D asset, click “Extract GLB” to extract the GLB file and download it. You can also view the 3D asset in online tools like GLTF Viewer.

Note: Gaussian file can be very large (~50MB), it will take a while to display and download.

Here are more examples:

Prompt: Spherical robot with gold and silver design.

The result looks pretty decent overall. The gold and silver textures add a nice touch, and from a distance, it looks great. But if you zoom in, you’ll notice it’s still a bit low-poly. The details aren’t as refined as they could be, with some edges looking rough. That said, for something generated this quickly, it’s hard to complain. It’s a solid result if you’re after speed and good enough for most use cases.

Here’s another example using an image as an input.

I really like how close the 3D model gets to the original reference image. The overall shape and structure feel spot-on, which is super impressive. But when you focus on the smaller details, like the ropes or the intricate textures on the sides, they’re not perfect. Even so, from a regular viewing distance, it still looks pretty good. For something generated in seconds, it’s honestly better than I expected. If you’re okay with minor imperfections, this is a fantastic starting point.

Trellis is also great at creating multiple variants of a single 3D object based on text prompts. This feature is great for iterating on designs quickly.

It doesn’t stop there because Trellis is also crazy good at composing complex and vibrant 3D art designs.

I am very impressed by the quality of this 3D scene. Microsoft is setting a new standard in 3D generation with this high-quality, scalable model. In fact, some users have already 3D-printed models created with Trellis, which is super cool!

Running Trellis Locally

You can find Trellis’ code on GitHub and run it locally by following the steps below:

Clone the repository:

git clone --recurse-submodules https://github.com/microsoft/TRELLIS.gitcd TRELLIS

2. Create a new conda environment named trellis and install the dependencies:

Usage: setup.sh [OPTIONS]
Options:
    -h, --help              Display this help message
    --new-env               Create a new conda environment
    --basic                 Install basic dependencies
    --xformers              Install xformers
    --flash-attn            Install flash-attn
    --diffoctreerast        Install diffoctreerast
    --vox2seq               Install vox2seq
    --spconv                Install spconv
    --mipgaussian           Install mip-splatting
    --kaolin                Install kaolin
    --nvdiffrast            Install nvdiffrast
    --demo                  Install all dependencies for demo

Here is an example of how to use the pretrained models for 3D asset generation.

import os
# os.environ['ATTN_BACKEND'] = 'xformers'   # Can be 'flash-attn' or 'xformers', default is 'flash-attn'
os.environ['SPCONV_ALGO'] = 'native'        # Can be 'native' or 'auto', default is 'auto'.
                                            # 'auto' is faster but will do benchmarking at the beginning.
                                            # Recommended to set to 'native' if run only once.

import imageio
from PIL import Image
from trellis.pipelines import TrellisImageTo3DPipeline
from trellis.utils import render_utils, postprocessing_utils

# Load a pipeline from a model folder or a Hugging Face model hub.
pipeline = TrellisImageTo3DPipeline.from_pretrained("JeffreyXiang/TRELLIS-image-large")
pipeline.cuda()

# Load an image
image = Image.open("assets/example_image/T.png")

# Run the pipeline
outputs = pipeline.run(
    image,
    seed=1,
    # Optional parameters
    # sparse_structure_sampler_params={
    #     "steps": 12,
    #     "cfg_strength": 7.5,
    # },
    # slat_sampler_params={
    #     "steps": 12,
    #     "cfg_strength": 3,
    # },
)
# outputs is a dictionary containing generated 3D assets in different formats:
# - outputs['gaussian']: a list of 3D Gaussians
# - outputs['radiance_field']: a list of radiance fields
# - outputs['mesh']: a list of meshes

# Render the outputs
video = render_utils.render_video(outputs['gaussian'][0])['color']
imageio.mimsave("sample_gs.mp4", video, fps=30)
video = render_utils.render_video(outputs['radiance_field'][0])['color']
imageio.mimsave("sample_rf.mp4", video, fps=30)
video = render_utils.render_video(outputs['mesh'][0])['normal']
imageio.mimsave("sample_mesh.mp4", video, fps=30)

# GLB files can be extracted from the outputs
glb = postprocessing_utils.to_glb(
    outputs['gaussian'][0],
    outputs['mesh'][0],
    # Optional parameters
    simplify=0.95,          # Ratio of triangles to remove in the simplification process
    texture_size=1024,      # Size of the texture used for the GLB
)
glb.export("sample.glb")

# Save Gaussians as PLY files
outputs['gaussian'][0].save_ply("sample.ply")

Here’s a list of what you’ll get as a result:

Sample_gs.mp4: a video showing the 3D Gaussian representation
Sample_rf.mp4: video showing the Radiance Field representation
Sample_mesh.mp4: video showing the mesh representation
Sample.glb: a GLB file containing the extracted textured mesh
Sample.ply: a PLY file containing the 3D Gaussian representation

Final Thoughts

I’m not just impressed by reasonable quality but also the speed at which you can generate a splat. To do all this in seconds is a big step forward and shows huge promise for future iterations.

It’s not perfect, though. Complex models, especially those involving human features, can trip it up. But for people who don’t know or want to learn 3D modeling, it’s still an amazing tool.

As a developer, I can’t wait to get API access. The ability to quickly create 3D assets opens up so many possibilities. Game developers and animators are going to find this incredibly useful, and I’m excited to see what the community creates with it.

‍

Stay ahead. Stay updated.