MonoNPHM: Dynamic Head Reconstruction from Monoculuar Videos

1Technical University of Munich, 2Synthesia 3University College London

We present MonoNPHM, a neural-field-based parametric head model with disentangled latent space for geometry, expression and appearance (left). MonoNPHM allows for 3D head tracking given only a monocular RGB video by optimizing for latent codes using SDF-based volumetric rendering (right).

Tracking Results of MonoNPHM.

From left to right: input, overlay, reconstrcutions.


We present Monocular Neural Parametric Head Models (MonoNPHM) for dynamic 3D head reconstructions from monocular RGB videos. To this end, we propose a latent appearance space that parameterizes a texture field on top of a neural parametric model. We constrain predicted color values to be correlated with the underlying geometry such that gradients from RGB effectively influence latent geometry codes during inverse rendering. To increase the representational capacity of our expression space, we augment our backward deformation field with hyper-dimensions, thus improving color and geometry representation in topologically challenging expressions. Using MonoNPHM as a learned prior, we approach the task of 3D head reconstruction using signed distance field based volumetric rendering. By numerically inverting our backward deformation field, we incorporated a landmark loss using facial anchor points that are closely tied to our canonical geometry representation. To evaluate the task of dynamic face reconstruction from monocular RGB videos we record 20 challenging Kinect sequences under casual conditions. MonoNPHM outperforms all baselines with a significant margin, and makes an important step towards easily accessible neural parametric face models through RGB tracking.


Reconstructed Geometry.

Press R to reset views.

Latent Shape Interpolation

Here is an interactive viewer allowing for latent identity interpolation. Drag the blue cursor around to linearly interpolate between four different identites. The resulting geometry and appearance is displayed on the right.

Latent Shape Coordinates
(Quadrilateral linear interpolation between 4 cornering identites.)
Resulting geometry and appearance in neutral expression.

Latent Expression Interpolation

Here is an interactive viewer allowing for latent expression interpolation for a fixed identity. Drag the blue cursor around to linearly interpolate between four expressions. The resulting geometry and appearance is displayed on the right.

Latent Expression Coordinates
(Quadrilateral linear interpolation between 4 cornering expressions.)
Posed Geometry and Appearance.

Expression Transfers.

Tracked expression codes from the input video (left) are transferred to other identites in the right.

Method Overview

1. Given a point in posed space, we backward-warp it into canonical space.

2. Geometry is represented in canonical space using a neural SDF, which also produces a condition for the appearance.

3. Appearance is modeled using a texture field in canonical space. Conditioning on geometry features ensures more effective gradients from an RGB loss to the geometry code during inverse rendering.

4. Our geometry and appearance networks depend on a set of discrete face anchor points. Using iterative root finding, we can numerically invert the backward deformation field to obtain anchors in posed space.

5. We perform tracking by optimizing for latent geometry, appearance and expression parameters. RGB and silhouette losses are built using deformable volumetric rendering. The landmark loss is computed by projecting posed anchors into image space.

Deformation Consistency

We visualize the tracked geometry (left), xy-canonical coordinates (middle), and the predicted hyper dimensions mapped to the red and green channel (right).

Related Links

For more work on similar tasks, please check out the following papers.

NPHM and IMface learn neural parametric model for facial geometry.

Neural parametric models including color were proposed in PhMoH and SSIF.

For more work utilizing NPHMs see:

DiffusionAvatar utilizes NPHM as a proxy geometry to add fine-grained control to powerful diffusion-based neural renderer.

FaceTalk generates a consistent expression sequence given an audio sequence using a diffusion model.

DPHM builds a diffusion prior for robust NPHM tracking using a depth sensor.

ClipHead uses a large vision-language-model to add text control to NPHM.


 author={Simon Giebenhain and Tobias Kirschstein and Markos Georgopoulos and  Martin R{\"{u}}nz and Lourdes Agapito and Matthias Nie{\ss}ner},
 title={MonoNPHM: Dynamic Head Reconstruction from Monocular Videos},
 booktitle = {Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)},
 year = {2024}}