Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction

1Technical University of Munich, 2Synthesia 3University College London

We present Pixel3DMM, a fine-tuned DINO ViT for per-pixel surface normal and uv-coordinate prediction. From top to bottom we show FFHQ input images, estimated surface normals, estimated 2D vertices from the predicted uv-coords and FLAME fits against the above two cues.

In-the-Wild Tracking with Pixel3DMM.

From left to right: input, predicted normals, predicted 2d vertices, tracking overlay, FLAME tracking.

Abstract

We address the 3D reconstruction of human faces from a single RGB image. To this end, we propose Pixel3DMM, a set of highly-generalized vision transformers which predict per-pixel geometric cues in order to constrain the optimization of a 3D morphable face model (3DMM). We exploit the latent features of the DINO foundation model, and introduce a tailored surface normal and uv-coordinate prediction head. We train our model by registering three high-quality 3D face datasets against the FLAME mesh topology, which results in a total of over 1,000 identities and 976K images. For 3D face reconstruction, we propose a FLAME fitting opitmization that solves for the 3DMM parameters from the uv-coordinate and normal estimates. To evaluate our method, we introduce a new benchmark for single-image face reconstruction, which features high diversity facial expressions, viewing angles, and ethnicities. Crucially, our benchmark is the first to evaluate both posed and neutral facial geometry. Ultimately, our method outperforms the most competitive baselines by over 15\% in terms of geometric accuracy for posed facial expressions.

Video

Single Image Reconstructions

Given an input image (top right) we show posed geometry reconstructions of DECA, FlowFace and Ours against the ground truth COLMAP point cloud.

Given an input image (top right) we show neutral geometry reconstructions of DECA, FlowFace and Ours against the ground truth COLMAP point cloud.

Method Overview

Left: Our network consists of a DINO backbone and a light-weight prediction head. We train our models on the NPHM, FaceScape and Ava256 datasets, which we bring into a uniform format using FLAME and non-rigid registration.

Right: At inference time we use the normal and uv-coordinate predictions as optimization targets in a FLAME fitting procedure. While the normal constraint is straight-forward, we incorporate the UV predictions by first predict 2d vertex locations using nearest neighbor look-ups.

Surface Normal Estimation.

Given an input image (left) we show predictions from several surface normal estimators (right top) and error maps (right bottom).

Related Links

For more work on similar tasks, please check out the following papers.

FLAME Tracking: VHAP (also support multi-view tracking), FlowFace (no Code release yet, but promised) and MetricalTracker.

FLAME Feed-Forward Regressors: MICA (best Identity predictions), EMOCA

Surface Normal Estimation: Sapiens, Diff-E2E

BibTeX

@misc{giebenhain2025pixel3dmm,
title={Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction},
author={Simon Giebenhain and Tobias Kirschstein and  Martin R{\"{u}}nz and Lourdes Agapito and Matthias Nie{\ss}ner},
year={2025},
url={https://arxiv.org/abs/2505.00615},
}