TL;DR:

We investigate transferring the appearance from a source Neural Radiance Field (NeRF) to a target 3D geometry in a semantically meaningful and multiview-consistent way by leveraging semantic correspondence from ViT features.

Abstract:

A Neural Radiance Field (NeRF) encodes the specific relation of 3D geometry and appearance of a scene. We here ask the question whether we can transfer the appearance from a source NeRF onto a target 3D geometry in a semantically meaningful way, such that the resulting new NeRF retains the target geometry but has an appearance that is an analogy to the source NeRF. To this end, we generalize classic image analogies from 2D images to NeRFs. We leverage correspondence transfer along semantic affinity that is driven by semantic features from large, pre-trained 2D image models to achieve multi-view consistent appearance transfer. Our method allows exploring the mix-and- match product space of 3D geometry and appearance. We show that our method outperforms traditional stylization-based methods and that a large majority of users prefer our method over several typical baselines.

Method Overview:
To create a NeRF Analogy, we perform the following three steps:
  1. We first render out a set of images from both the source- and target NeRFs. We then precompute DiNO-ViT features on these images which serve as dense, semantic image descriptors.
  2. In the second step, we sample random pixels (and their position and normal) on the target and compute the correspondence between to the source images by comparing the DiNO-features' cosine similarity. We can then query the source appearance at the maximum similarity, allowing us to create training tuples which combine source- and target information.
  3. Finally, we transfer the appearance from the source to the target by optimizing the target NeRF's appearance to match the source appearance in the training tuples and refine the results with an additional edge-loss.

Results:

In the video captions, DIA is short for Deep Image Analogies, ST is short for Style Transfer, WCT is short for Whitening and Coloring Transform, the algorithm introduced in the paper "Universal Style Transfer via Feature Transforms", SNeRF is the method by Nguyen-Phuoc et al., and Ours is our NeRF Analogy.

Target
(Shape)
Ours
tmp
Source
(Appearance)
Target
(Shape)
Ours
tmp
Source
(Appearance)
target
(Shape)
Ours
tmp
Source
(Appearance)
target
(Shape)
Ours
tmp
Source
(Appearance)
target
(Shape)
Ours
tmp
Source
(Appearance)
target
(Shape)
Ours
tmp
Source
(Appearance)


Citation
If you find our work useful and use parts or ideas of our paper or code, please cite:

@inproceedings{fischer2024nerf,
  title={NeRF Analogies: Example-Based Visual Attribute Transfer for NeRFs},
  author={Fischer, Michael and Li, Zhengqin and Nguyen-Phuoc, Thu and Bozic, Aljaz and Dong, Zhao and Marshall, Carl and Ritschel, Tobias},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={4640--4650},
  year={2024}
}

Acknowledgements
This research was conducted during an internship at Meta Reality Labs Research. We extend our gratitude to the anonymous reviewers for their insightful feedback and to Meta Reality Labs for their continuous support. Additionally, we are thankful for the support provided by the Rabin Ezra Scholarship Trust.