MonoNPHM: Dynamic Head Reconstruction from Monoculuar Videos

Abstract

We present Monocular Neural Parametric Head Models (MonoNPHM) for dynamic 3D head reconstructions from monocular RGB videos. To this end, we propose a latent appearance space that parameterizes a texture field on top of a neural parametric model. We constrain predicted color values to be correlated with the underlying geometry such that gradients from RGB effectively influence latent geometry codes during inverse rendering. To increase the representational capacity of our expression space, we augment our backward deformation field with hyper-dimensions, thus improving color and geometry representation in topologically challenging expressions. Using MonoNPHM as a learned prior, we approach the task of 3D head reconstruction using signed distance field based volumetric rendering. By numerically inverting our backward deformation field, we incorporated a landmark loss using facial anchor points that are closely tied to our canonical geometry representation. To evaluate the task of dynamic face reconstruction from monocular RGB videos we record 20 challenging Kinect sequences under casual conditions. MonoNPHM outperforms all baselines with a significant margin, and makes an important step towards easily accessible neural parametric face models through RGB tracking.

Video

Latent Shape Interpolation

Here is an interactive viewer allowing for latent identity interpolation. Drag the blue cursor around to linearly interpolate between four different identites. The resulting geometry and appearance is displayed on the right.

Latent Shape Coordinates
(Quadrilateral linear interpolation between 4 cornering identites.)

Resulting geometry and appearance in neutral expression.

Latent Expression Interpolation

Here is an interactive viewer allowing for latent expression interpolation for a fixed identity. Drag the blue cursor around to linearly interpolate between four expressions. The resulting geometry and appearance is displayed on the right.

Latent Expression Coordinates
(Quadrilateral linear interpolation between 4 cornering expressions.)

Posed Geometry and Appearance.

Method Overview

1. Given a point in posed space, we backward-warp it into canonical space.

2. Geometry is represented in canonical space using a neural SDF, which also produces a condition for the appearance.

3. Appearance is modeled using a texture field in canonical space. Conditioning on geometry features ensures more effective gradients from an RGB loss to the geometry code during inverse rendering.

4. Our geometry and appearance networks depend on a set of discrete face anchor points. Using iterative root finding, we can numerically invert the backward deformation field to obtain anchors in posed space.

5. We perform tracking by optimizing for latent geometry, appearance and expression parameters. RGB and silhouette losses are built using deformable volumetric rendering. The landmark loss is computed by projecting posed anchors into image space.

Deformation Consistency

We visualize the tracked geometry (left), xy-canonical coordinates (middle), and the predicted hyper dimensions mapped to the red and green channel (right).

BibTeX

@inproceedings{giebenhain2024mononphm,
 author={Simon Giebenhain and Tobias Kirschstein and Markos Georgopoulos and  Martin R{\"{u}}nz and Lourdes Agapito and Matthias Nie{\ss}ner},
 title={MonoNPHM: Dynamic Head Reconstruction from Monocular Videos},
 booktitle = {Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)},
 year = {2024}}