FlexNeRF: Photorealistic Free-viewpoint Rendering of Moving Humans from Sparse Views

CVPR 2023

Vinoj Jayasundara¹, Amit Agrawal², Nicolas Heron², Abhinav Shrivastava², Larry S. Davis^1,2

¹ University of Maryland - College Park ²Amazon.com, Inc.

FlexNeRF Overview: Pose-independent temporal deformation is used in conjunction with pose-dependent motion fields (rigid and non-rigid). We choose one of the input frames as the canonical view, allowing us to use cyclic-consistency for regularization.

Abstract

We present FlexNeRF, a method for photorealistic free- viewpoint rendering of humans in motion from monocular videos. Our approach works well with sparse views, which is a challenging scenario when the subject is exhibiting fast/complex motions.

We propose a novel approach which jointly optimizes a canonical time and pose configuration, with a pose-dependent motion field and pose-independent temporal deformations complementing each other. Thanks to our novel temporal and cyclic consistency constraints along with additional losses on intermediate representation such as segmentation, our approach provides high quality outputs as the observed views become sparser.

We empirically demonstrate that our method significantly outperforms the state-of-the-art on public benchmark datasets as well as a self-captured fashion dataset.

FlexNeRF Evaluation

Comparison of performance across benchmark datasets: ∗ refers to adjusted LPIPS from the values reported in H-NeRF to fit the same scale as our experiments. † refers to the Self-Captured Fashion (SCF) dataset. ‡ indicates the model trained with sparse (∼ 40) views.

Qualitative comparison of rendered novel views

Self-Captured Fashion (SCF) Dataset

ZJU-MoCap Dataset

Qualitative comparison of rendered novel views on the ZJU-MoCap dataset: Notice the higher quality of rendered images from our method on details such as faces, hands, buttons on shirt, etc.

Analysis of Number of Views

LPIPS metric comparison on ZJU-MoCap between HumanNeRF and our method with decreasing number of views. To further analyze the effect of the number of views, we train various models using different number of input views. Our approach is better than HumanNeRF for all settings, with significant reduction in LPIPS metric as the number of views decreases.

Quality of Rendered Training View

Each cell shows the original training frame (left) with rendered frame using the same viewpoint after training. HumanNeRF (right) totally fails to render the correct pose of the training frame itself for difficult/challenging poses compared to ours (middle), hindering overall learning. This highlights the pitfalls of pose-dependent learning,

BibTeX

@InProceedings{Jayasundara_2023_CVPR,
    author    = {Jayasundara, Vinoj and Agrawal, Amit and Heron, Nicolas and Shrivastava, Abhinav and Davis, Larry S.},
    title     = {FlexNeRF: Photorealistic Free-Viewpoint Rendering of Moving Humans From Sparse Views},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {21118-21127}
}