NeRF FOR HERITAGE 3D RECONSTRUCTION

: Conventional or learning-based 3D reconstruction methods from images have clearly shown their potential for 3D heritage documentation. Nevertheless, Neural Radiance Field (NeRF) approaches are recently revolutionising the way a scene can be rendered or reconstructed in 3D from a set of oriented images. Therefore the paper wants to review some of the last NeRF methods applied to various cultural heritage datasets collected with smartphone videos, touristic approaches or reflex cameras. Firstly several NeRF methods are evaluated. It turned out that Instant-NGP and Nerfacto methods achieved the best outcomes, outperforming all other methods significantly. Successively qualitative and quantitative analyses are performed on various datasets, revealing the good performances of NeRF methods, in particular for areas with uniform texture or shining surfaces, as well as for small datasets of lost artefacts. This is for sure opening new frontiers for 3D documentation, visualization and communication purposes of digital heritage.

. The NeRF method is able to optimize a continuous 5D neural radiance field representation of a scene starting from a set of oriented images. Some of the used images (a), recovered camera poses and sparse point cloud (b), and rendered 3D view from the NeRF representation (c).

INTRODUCTION
The 3D reconstruction and digital documentation of cultural heritage artefacts and scenes is an important task to valorize, study and safeguard, at least digitally, our patrimony. The improvements and efficiency of mass digitisation campaigns of cultural heritage have been driven mainly by the growing need for their preservation as well as by indubitable opportunities offered by digital 3D technologies, artificial intelligence (AI) methods and extended reality (XR) solutions for conservation, communication and virtual access purposes (Kniaz et al., 2019;Teruggi et al., 2021;Verhoeven et al., 2022). Nowadays, active and passive sensors, through static or mobile scanning and photogrammetric methods, provide reliable, fast and accurate 3D results (Di Stefano et al., 2021), often enriched with semantic information for further understanding and communication purposes (Grilli and Remondino, 2019;Mazzacca et al., 2022). The photogrammetric pipeline starts from the acquisition phase, which is essential for retrieving high-quality images. Then, most of the processing steps are presently performed with automated structure from motion (SfM) approaches and multi-view stereo (MVS) algorithms (Zhou et al., 2020;Wang et al., 2021a,b). A recent innovative approach for 3D scene reconstruction is offered by Neural Radiance Fields (NeRF - Figure 1). NeRF synthesizes novel views of complex scenes, starting from a set of oriented input images and optimizing an underlying continuous volumetric scene function (Mildenhall et al., 2020;Mueller et al., 2022). A neural radiance field is a simple fully connected network (weights of a few MB) trained to reproduce input views of a single scene using a rendering loss. The network directly maps from spatial location and viewing direction (5D input) to colour and opacity (4D output). The aim of the paper is to shine light on emerging NeRF approaches for heritage 3D reconstruction in order to effectively use and optimize neural radiance fields to render novel photorealistic views of heritage scenes for 3D documentation, visualization and communication purposes.

RELATED WORKS
The recovery of 3D information from images is a long-lasting problem, solved for many years with conventional geometricbased approaches (Strecha et al., 2006;Goesele et al., 2007;Remondino et al., 2008Remondino et al., , 2014Hirschmuller, 2008;Barnes et al., 2009;Furukawa and Ponce, 2010;Jancosek and Pajdla, 2011;Bleyer et al., 2011;Rothermel et al., 2012;Schoenberger et al., 2016). Recently, learning-based 3D reconstruction methods based on point-, voxel-, mesh-or implicit (and differentiable) representations, have shown impressive results (Choy et al., 2016;Riegler et al., 2017;Chen and Zhang, 2019;Groueix et al., 2019;Wang et al., 2019;Yu and Gao, 2020), even from single images (Richter and Roth, 2018;Kniaz et al., 2019;Bath et al., 2023). Learning-based algorithms (CNN, GAN, etc.) try to infer a depth map from the set of input images, in a stereo or multi-view manner, with supervised or unsupervised approaches. Contrary to conventional methods based on handcrafted features (e.g., photometric consistency) in their cost functions, they try to reformulate the problem by also leveraging on semantic cues of the scene and learning more complex feature representations. Most methods require supervision and ground truth models, which is often hard to obtain for real-world heritage contexts or are based on synthetic data. Therefore differentiable volumetric rendering (DVR) for implicit representations gained popularity as they can train reconstruction models from 2D images and learn implicit 3D shapes and textures (Liu et al., 2019;Niemeyer et al., 2020). Implicit representations represent shape and texture continuously and do not suffer, like voxel-and mesh-based representations, from discretization or low resolution. One of the last recent trends is based on neural scene representation (NeRF), which has gained popularity due to its expressiveness, speed of computation and, generally, lowmemory need. Starting from the significant advance in the use of the attention mechanism (Vaswani et al., 2017), Mildenhall et al. (2020) introduced a method able to represent a scene using a deep fully-connected neural network without any convolutional layers (often referred to as a multilayer perceptron -MLP). The input for the neural network is a single continuous 5D coordinate set, i.e. spatial locations (x,y,z) and viewing directions (θ,f), whereas the output is the volume density (s) and view-dependent emitted radiance (RGB) in each direction and at each location ( Figure 2). Starting from the recovered camera poses, the method is able to synthesize novel views by querying 5D coordinates along camera rays, and it uses classic volume rendering techniques to project the output colours and densities into an image. Further improvements to increase the performance of NeRF methods tackled the reduction of training time (Mueller et al., 2022), dynamic view synthesis (Pumarola 2021), limiting the number of required input images (Yu et al., 2021;Niemeyer, et al., 2022;Zhu et al., 2022), artefacts reduction , integration of depth supervision with sparse point clouds (Deng et al., 2022), knowledge incorporation such as Manhattan world priors (Guo et al., 2022) or monocular geometric cues (Yu et al., 2022a), upscaling to Street View (Rematas et al., 2022), largescale (Yuanbo et al., 2022;Zhang et al., 2022;) and satellite (Roger et al., 2022) images, etc.

SDFstudio
The performances of these methods were evaluated, among others [Karami et al., 2023], on the Ignatius dataset (Knapitsch et al., 2017), which contains 265 sequential images (extracted from a video at 1920x1080 px resolution). The comparison results with respect to ground truth data are reported in Figure 3. They show that Instant-NGP and Nerfacto methods achieved the best outcomes, with an error of approximately 1 cm and 1.5 cm, respectively, outperforming all other methods. Instant-NGP uses multi-resolution hash encoding to reconstruct implicit surfaces. It is a practical and efficient learning-based approach that automatically identifies relevant details and is built upon the Tiny-CUDA-nn framework, which is a self-contained framework designed specifically for training and Lquerying neural networks. By leveraging these advanced techniques, Instant-NGP can achieve high-quality results while maintaining a fast computation time.

EXPERIMENTS
Following the outcomes presented in Section 3, Instant-NGP (from NVlabs) and Nerfacto (from Nerfstudio) are used to perform various experiments on some heritage datasets featuring different characteristics: availability of ground truth (GT) data (Section 4.1), presence of textureless/uniform (Section 4.2) or reflective (Section 4.3) surfaces and touristic repository of lost heritage (Section 4.4). For each dataset, the required camera poses are derived using COLMAP 4 or Agisoft Metashape 5 and ad-hoc converters 6,7 to import the camera parameters into the NeRF. After the training and rendering, a point cloud is generated and exported for analysis and visualization (Figure 4). All experiments were performed on an Alienware Aurora R12 with an 11th Gen Intel® Core™ i7-11700KF 3.60 GHz processor, 32GB of RAM and an NVIDIA GeForce RTX 3080 (10GB of VRAM).

Quantitative analysis
Geometric evaluation of NeRF-based 3D results with respect to reference ground truth (GT) or conventional Multi-View Stereo (MVS) pipelines are hereafter reported. The evaluation was performed by calculating the signed distances between the NeRF meshes and the reference one.
The first dataset consists of a smartphone video sequence (images at 960x540 px) acquired around a Mausoleum in Trento (Italy). The monument has a diameter of ca 25m and a height of ca 15m (without the basement). The acquisitions were performed below the main basement, at ca 10m distance from the object, producing occlusions. Around 200 frames were extracted to create 3D results with MVS (Colmap) and NeRF (Instant-NGP). The geometric comparison with the available Terrestrial Laser Scanner (TLS) revealed a standard deviation of ca 7.4 cm for the photogrammetric approach and ca 15 cm for the NeRF one ( Figure 4, Table 2). The second dataset consists of a smartphone video (3840x2160 px) of the remains of two arches of a structure situated in the archaeological site of Pafos (Cyprus). Approximately 180 frames, centred around a corner of the structure, and taken at a distance of roughly 10m while maintaining a parallel camera alignment to the archaeological remains, were extracted to apply a NeRF (Nerfacto) and MVS 3D reconstruction. The reference 3D data (GT) are provided by a photogrammetric dense point cloud derived from a set of images acquired with a Nikon D3X (6048x4032 px). The geometric comparisons ( Figure 5, Table 2) indicate a similar standard deviation (less than 5 cm), although the Nerfacto output presents significantly more noise and details loss along the object's surfaces.    (Figure 4) and Pafos ( Figure 5) datasets and processing time (3000 epochs for the NeRF approaches.

Photogrammetric model NeRF model
The Pafos dataset was also used to calculate accuracy and completeness (often named as precision and recall, respectively) following the approaches of Knapitsch et al. (2017) and Nocerino et al. (2020). The two metrics were computed with respect to the photogrammetric (Nikon) 3D model. Figure 5g shows how the video-based photogrammetric reconstruction is more accurate whereas the NeRF 3D model has a higher completeness.

Textureless surfaces
Conventional SfM and MVS methods normally meet problems while performing 3D reconstruction of surfaces with uniform colours or textureless areas. A dataset of 20 high-resolution images (6048x4032 px) taken with a Nikon D3X was acquired on some buildings in the Trento's Duomo square (Italy). The images were captured at ground level, at varying distances from the building facades which have evenly painted plasters. The MVS processing was done in Metashape whereas Nerfacto was used for the NeRF 3D reconstruction ( Figure 6). The 3D result generated with NeRF seems to be more complete, with higher density and more consistent point distribution in the challenging areas. As no real ground truth data were available, the completeness is computed with an approach for a planar-like surface built upon Knapitsch et al. (2017). First, both point clouds are cropped to the common area of interest. A reference plane is determined by fitting a plane to a downsampled photogrammetric point cloud using a least-squares approach. Both point clouds are then projected onto this plane and their 3D coordinates are reduced to 2D in a new coordinate frame defined by the plane and the projections of the original Y and Z axes on it. In this new reference frame, a ground truth polygon of the complete façade is defined by constructing a concave hull of all evaluated point clouds. To evaluate the completeness, for each point in the evaluated point clouds, a buffer is calculated at a series of distance thresholds τ. The resulting polygons are merged into single geometries for each τ and cropped to a common extent within the ground truth polygon. The completeness function C(τ) is then defined as the ratio between the area of the polygon obtained for a certain τ and the total area of the reference façade polygon. The results show NeRF outperforming photogrammetry at a 1cm distance threshold by 10pp of the completeness metric ( Figure 6c).

Reflective surfaces
Conventional SfM and MVS methods face problems if reflective and shining or transparent surfaces have to be digitized. A dataset of about 60 high-resolution images (6048x4032 px) was captured with a NIKON D750 camera at the MAG museum in Riva del Garda (Italy). The object ( Figure 7a) is a small bronze statue featuring reflective surfaces and a transparent basement. The images were acquired by rotating the object and capturing images from both parallel and oblique points of view ( Figure 7b).  performed better. Nerfacto outperformed in reconstructing the transparent pedestal and back-support. This highlights the potential of (some) NeRF methods in reconstructing transparent surfaces in a variety of contexts. The computational time for MVS was ca 8 min, ca 3 min for the Nerfacto and ca 40 sec for Instant-NGP (3000 epochs).

Unconstrained touristic images (Photo-tourism)
Photogrammetry has been often used to reconstruct lost heritage objects or monuments by using tourist or archival photos (Gruen et al., 2004). The potential of NeRF methods was tested on a set of ca 30 unordered touristic images taken from the online repository REKREI 8 (Vincent et al., 2015(Vincent et al., , 2016 focused on the Temple of Baalshamin in Palmyra, a monument destroyed in 2015 (Figure 9a). The dataset contains images of varying resolutions and distances, most focusing on the temple's frontal part. For the processing, Colmap was applied as conventional MVS approach. On the other hand, the Photo-tourism implementation in NerfStudio, probably similar to NeRF-W (Martin-Brualla et al., 2021), was chosen as it was developed to handle unconstrained image collections "in the wild" and different camera models. The processing for Colmap took ca 30 min whereas the NeRF approach needed ca 2 min. Due to the limited number of images from the sides and back, both approaches failed to reconstruct those parts. As shown in Figure  9d, the Colmap dense point cloud shows lower density and completeness in a few areas compared to the NeRF results (Figure 9e), such as the columns' bases and the inner part of the facade. This is possibly due to the large baselines or inconsistencies among the images, caused by differences in acquisition conditions as well the front columns casting everchanging shadows on the inner façade. However, the NeRF dense point cloud is noisier compared to the dense point cloud derived in Colmap. Figure 9. Some of the REKREI images utilised for the 3D reconstruction of the Palmyra temple (a) and the recovered camera network (b). NeRF-W 3D view (c) and visual comparison of the photogrammetric (b) and NeRF (c) 3D results.

CONCLUSIONS AND FUTURE WORKS
The work presented an investigation of NeRF methods for heritage 3D reconstruction. Qualitative and quantitative results reported the capabilities of neural radiance fields to derive quite accurate 3D models from a set of images. Textureless, transparent and reflective surfaces were also considered as well as low-and high-resolution images, acquired with smartphones or reflex cameras. Instant-NGP and Nerfacto were primarily utilised as they show the best performances on a typical historical monument. Additionally, the NeRF-W method was employed to process an unstructured collection of touristic images representing a heritage site that has been destroyed. The quantitative analyses indicate a comparable level of accuracy to the dense point cloud generated through conventional MVS methods, with Colmap having a slightly better accuracy although requiring more processing time. Moreover, NeRF methods appear to perform better in scenarios where conventional MVS techniques usually struggle. Even if more tests are surely needed, their performances on textureless surfaces and transparent objects seem very promising. Surely, time-wise, the NeRF approach is generally faster than a MVS approach. This article serves as an initial evaluation of NeRF capabilities in producing cultural heritage 3D contents. In the next phase of our research, we will narrow our focus to specific tasks to obtain a more comprehensive understanding of the behaviour and true potential of various NeRF methods in the cultural heritage domain. In particular, we will: • Investigate the impact of image quality and quantity on the accuracy and completeness of NeRF-based 3D reconstructions of cultural heritage objects; • Perform an extended assessment of NeRF capabilities to accurately reconstruct reflective and transparent surfaces; • Evaluate reliable approach to remove background which is not part of the area/object we want to digitally reconstruct; • Explore NeRF's potential in accurately reconstructing cultural heritage objects from tourist datasets with unconstrained acquisition conditions, focusing in particular on of lost monuments; • Finalize the NeRFBK dataset (https://github.com/3DOM-FBK/NeRFBK) for benchmarking NeRF methods in various contexts and scenarios (heritage, industry, urban, etc.).

ACKNOWLEDGMENTS
Authors are thankful to Matthew Vincent for supporting the data collection within the REKREI database.