NeRF-LipSync: A Diffusion Model for Speech-Driven and View-Consistent Lip Synchronization in Digital Avatars

Axyonov, Alexandr; Dolgushin, Mikhail; Ryumin, Dmitry

doi:https://doi.org/10.5194/isprs-archives-XLVIII-2-W9-2025-25-2025

Articles | Volume XLVIII-2/W9-2025

https://doi.org/10.5194/isprs-archives-XLVIII-2-W9-2025-25-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/isprs-archives-XLVIII-2-W9-2025-25-2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume XLVIII-2/W9-2025

04 Sep 2025

| 04 Sep 2025

NeRF-LipSync: A Diffusion Model for Speech-Driven and View-Consistent Lip Synchronization in Digital Avatars

Alexandr Axyonov, Mikhail Dolgushin, and Dmitry Ryumin

Keywords: Lip Synchronization, Diffusion Models, NeRF-based Rendering, Audio-Conditioned Generation, Temporal Consistency

Abstract. Achieving natural, accurate, and identity-preserving lip synchronization in talking avatars is a fundamental problem in audio-visual synthesis. Existing methods often struggle to generalize across speakers, maintain temporal smoothness, or preserve view consistency due to architectural limitations. In this paper, we present NeRF-LipSync, a novel generative framework that synthesizes lip movements conditioned on speech audio while maintaining temporal coherence and view-consistent appearance through a combination of diffusion-based modeling and NeRF-based spatial alignment. Our model incorporates temporal attention and leverages rich audio-visual embeddings to produce expressive, speaker-specific articulation. We evaluate NeRF-LipSync on the VoxCeleb2 and LRW datasets and compare it against strong baselines including Wav2Lip, PC-AVS, and Diff2Lip. On VoxCeleb2, our method achieves an FID of 2.75, SSIM of 0.56, PSNR of 18.32, and LMD of 3.01, with synchronization accuracy (Syncc) reaching 9.06. On LRW, it yields an FID of 2.40, SSIM of 0.71, PSNR of 21.03, and LMD of 2.16. These results confirm the strong generalization ability and perceptual realism of our approach. Ablation studies highlight the contribution of NeRF alignment to identity consistency, diffusion to visual expressiveness, and temporal attention to motion stability. NeRF-LipSync thus offers a robust, scalable solution for high-quality, speech-driven avatar animation.

NeRF-LipSync: A Diffusion Model for Speech-Driven and View-Consistent Lip Synchronization in Digital Avatars

Useful Links

Useful External Links

Our Contact