Tracking Results of MonoNPHM.
From left to right: input, overlay, reconstrcutions.
We present Monocular Neural Parametric Head Models (MonoNPHM) for dynamic 3D head reconstructions from monocular RGB videos. To this end, we propose a latent appearance space that parameterizes a texture field on top of a neural parametric model. We constrain predicted color values to be correlated with the underlying geometry such that gradients from RGB effectively influence latent geometry codes during inverse rendering. To increase the representational capacity of our expression space, we augment our backward deformation field with hyper-dimensions, thus improving color and geometry representation in topologically challenging expressions. Using MonoNPHM as a learned prior, we approach the task of 3D head reconstruction using signed distance field based volumetric rendering. By numerically inverting our backward deformation field, we incorporated a landmark loss using facial anchor points that are closely tied to our canonical geometry representation. To evaluate the task of dynamic face reconstruction from monocular RGB videos we record 20 challenging Kinect sequences under casual conditions. MonoNPHM outperforms all baselines with a significant margin, and makes an important step towards easily accessible neural parametric face models through RGB tracking.
Reconstructed Geometry.
Here is an interactive viewer allowing for latent identity interpolation. Drag the blue cursor around to linearly interpolate between four different identites. The resulting geometry and appearance is displayed on the right.
Here is an interactive viewer allowing for latent expression interpolation for a fixed identity. Drag the blue cursor around to linearly interpolate between four expressions. The resulting geometry and appearance is displayed on the right.
Expression Transfers.
Tracked expression codes from the input video (left) are transferred to other identites in the right.
1. Given a point in posed space, we backward-warp it into canonical space.
2. Geometry is represented in canonical space using a neural SDF, which also produces a condition for the appearance.
3. Appearance is modeled using a texture field in canonical space. Conditioning on geometry features ensures more effective gradients from an RGB loss to the geometry code during inverse rendering.
4. Our geometry and appearance networks depend on a set of discrete face anchor points. Using iterative root finding, we can numerically invert the backward deformation field to obtain anchors in posed space.
5. We perform tracking by optimizing for latent geometry, appearance and expression parameters. RGB and silhouette losses are built using deformable volumetric rendering. The landmark loss is computed by projecting posed anchors into image space.
For more work on similar tasks, please check out the following papers.
NPHM and IMface learn neural parametric model for facial geometry.
Neural parametric models including color were proposed in PhMoH and SSIF.
For more work utilizing NPHMs see:
DiffusionAvatar utilizes NPHM as a proxy geometry to add fine-grained control to powerful diffusion-based neural renderer.
FaceTalk generates a consistent expression sequence given an audio sequence using a diffusion model.
DPHM builds a diffusion prior for robust NPHM tracking using a depth sensor.
ClipHead uses a large vision-language-model to add text control to NPHM.
@inproceedings{giebenhain2024mononphm,
author={Simon Giebenhain and Tobias Kirschstein and Markos Georgopoulos and Martin R{\"{u}}nz and Lourdes Agapito and Matthias Nie{\ss}ner},
title={MonoNPHM: Dynamic Head Reconstruction from Monocular Videos},
booktitle = {Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)},
year = {2024}}