NEURAL RADIANCE FIELDS (NERF): REVIEW AND POTENTIAL APPLICATIONS TO DIGITAL CULTURAL HERITAGE

: Neural Radiance Fields (NeRF or NeRFs) are to date emerging as a novel method for synthesizing novel views of complex 3D scenes, leveraging an artificial neural network to optimize a volumetric scene function using a set of input views. We conduct a preliminary critical review of the scientific and technical literature on NeRFs, and we highlight possible applications of the latter in the Cultural Heritage domain, for the image-based reconstruction of 3D models of real, multi-scale objects, even in combination with the more well-established photogrammetric techniques. A comparison is made between NeRFs and photogrammetry in terms of operating procedures and outputs (volumetric renderings vs. point clouds or meshes). It is demonstrated that NeRFs could be conveniently used for rendering objects (sculptures, archaeological remains, sites, paintings etc.) that are challenging for photogrammetry, typically: i) metallic, translucent, and/or transparent surfaces; ii) objects that present homogeneous textures; iii) occlusions, vegetation, and elements of very fine detail.


INTRODUCTION
Neural Radiance Fields (NeRF or NeRFs) are a type of deep learning model that synthesizes novel views of an object from given multi-view images of a scene (Figure 1). The model was first presented by Mildenhall et al. (2020) (matthewtancik.com/nerf). NeRF uses an artificial neural network to output volume density and view-dependent emitted radiance (i.e., the amount of light emitted or reflected by a surface). The neural network takes as input a single continuous 5D coordinate (spatial location , , and viewing direction , ) and outputs the volume density and view-dependent emitted radiance at that spatial location.
A key feature of NeRF is the ability to render high-quality, photorealistic novel views with fine details and smooth transitions between different regions .

* Corresponding author
A multi-scale architecture allows the model to learn and generate features at multiple scales simultaneously. NeRF also can handle occlusion and transparent objects, which makes them well-suited for tasks such as view synthesis and image-based rendering. Between 2022 and, since the large-scale implementation by Müller et al. (2022), NeRFs gained much attention in the Computer Vision field: the original article by Mildenhall et al. received more than 2500 citations and NeRFs have found interesting applications in various fields, including robotics (Adamkiewicz et al., 2022), industrial design (Mergy et al., 2021), autonomous navigation, medicine (Corona-Figueroa et al., 2022), 3D facial recognition (Guo et al., 2021) and human pose estimation (Su et al., 2021;Wang et al., 2022). Despite the ever-growing interest on the topic, the applications of NeRFs in Cultural Heritage are to date underdeveloped and understudied. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-M-2-2023 29th CIPA Symposium "Documenting, Understanding, Preserving Cultural Heritage: Humanities and Digital Technologies for Shaping the Future", 25-30 June 2023, Florence, Italy Moreover, the benefit of using NeRFs compared to more established image-based reconstruction techniques as photogrammetry are not yet adequately known or recognized. In tackling this issue, this paper reviews existing publications related to NeRF with the specific purpose of analysing implications of such techniques in state-of-the-art methods for multi-scale 3D modelling and graphics, in the digital heritage domain.

VIEW SYNTHESIS AND NEURAL RENDERING
The first advent of NeRFs is ascribed to the work by Mildenhall et al. (2020): starting from a set of images with known camera poses, views are synthesized by querying 5D coordinates along camera rays and classic volume rendering techniques are later used to project the output colours and densities into an image. In the basic form, a NeRF model represents a 3D static scene as a continuous 5D function, expressed as: Where = ( , , ) are the in-scene coordinates, direction is expressed as the 3D Cartesian unit vector , = ( , , ) represents color values and stands for the volume density. ( , ) represent the azimuthal and polar viewing angles (viewing direction).
Θ is a Multi-Layer Perceptron (MLP), a feedforward artificial neural network, that outputs the colour information and the volume density ( Figure 2). Although results to be independent of the viewing direction, the colour depends on both the viewing direction and the in-scene coordinate. The models are trained per-scene, and the COLMAP pipeline ) is used to extract camera parameters and camera poses from the input image set. For each pixel in the image being synthetised, camera rays are marched through the scene and a set of sampling points is generated along each ray. For each sampling point, the known viewing direction and sampling locations are used to extract local colour and density through the MLP. 3D reconstruction and novel view synthesis are hence executed via volumetric rendering (Gao et al., 2022). In detail, given volume density and colour functions, volume rendering requires estimating the expected colour ( ) of any camera ray ( ) = + , with camera position and viewing direction . Looking at ( ) as the probability that the ray travels from 1 to without hitting any other particle, ( ) is expressed by equation: Considering a non-deterministic stratified sampling approach, where the ray is divided into N equally spaced bins, a sample is uniformly drawn from each bin, so that equation (2) can be approximated as:

NeRFs optimization
NeRF models employ: i) positional encoding to improve fine detail reconstruction and represent high-frequency functions, and ii) a hierarchical volume sampling strategy to allocate the MLP's performance towards areas of the scene with visible content (Mildenhall et al., 2020). Significant implementations of the work by Mildenhall et al. (2020) include the view synthesis of dynamic scenes with objects in rigid or non-rigid motion (Chen and Tsukada, 2022;Attal et al., 2021;Pumarola et al., 2021), the anti-aliasing in rendering , the depth estimation (Li et al., 2021) and the reconstruction of dynamic fluids from sparse multi-view videos or images (Chu et al., 2022). The work by Barron et al. (2021) studied the incorporation of depth maps within the colour image set and proposed the so-called Mip-NeRF approach to model depth uncertainty through local sampling.
In the earliest version, NeRFs required extremely high computing power and long training times. The introduction of multiresolution hash encoding in neural graphics by Müller et al., (2022), together with the advanced ray-tracing features delivered by GeForce RTX graphic cards, significantly reduced the processing time and capacity required for NeRF training (so that training neural graphics in seconds is to date possible).

Core implications of the theoretical formulation
Based on equation (3), colour can be expressed as the weighted combination of all alpha values (transparency / opacity from alpha compositing at sample point ) and colour values (colour evaluated at the sample point ) of points in a ray (Gao et al., 2022). In comparison with common ray tracing, NeRF leverage a probabilistic function to determine the expected value of colour along the ray ( Figure 3). This has three main implications: - The 3D reconstruction is provided in the form of a volumetric rendering. The NeRF model is neither a point cloud, nor a mesh, but it consists of continuous voxels made of shiny, transparent cubes; - The scene representation is view-dependent, i.e., the colour of the object may change depending on the point of view, meaning that, for the case of non-diffusely reflecting surfaces (non-Lambertian materials), a different reflection might be shown depending on the point of view. -NeRFs allow representation of real-world detailed scenes with complex occlusions, fine details and transient objects. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-M-2-2023 29th CIPA Symposium "Documenting, Understanding, Preserving Cultural Heritage: Humanities and Digital Technologies for Shaping the Future", 25-30 June 2023, Florence, Italy

Existing platforms and tools
The Instant NGP application, developed by Müller et al. (2022) for NVIDIA, accepts both photos and videos as input; the codebase is built on CUDA and Python 3.9, while the camera pose estimation is run through COLMAP. The graphical user interface can be launched via the Anaconda prompt. Nerfstudio by Tancik et al. (2023) is a more recent Python framework allowing for end-to-end creation, training and testing of NeRFs. It integrates a real-time web viewer and supports multiple export modalities, including point clouds and meshes (Figures 4, 5). By default, Nerfstudio expects to apply a scaling factor to the input camera image resolution (downscale). Luma AI by Luma (lumalabs.ai/) is a mobile NeRF capture platform for iOS, currently in beta, built up to integrate more user-friendly tools for neural rendering. The first native volume rendering files compatible with the Unreal Engine video game and software development tool are to date emerging both in Nerfstudio, in the form of the plug-in Volinga (volinga.ai/), than in LumaAI.

NERF APPLICATIONS IN DIGITAL HERITAGE
In the specific domain of Cultural Heritage, only few publications have, so far, explicitly identified NeRFs as a possible tool for virtual reconstruction, digital preservation and conservation of heritage objects and sites. Condorelli et al. (2021) (2022), appearance variations were modelled using appearance embedding, but the output was sought in meshes rather than in radiance fields. The modular approach by Kuang et al. (2022) first inferred the geometry of a real scene by neural rendering and then identified material properties and defined per-image lighting conditions to better relight and composite the captured scene. Analogously, NeRFactor  enabled freeviewpoint relighting to support shadow and material editing under arbitrary environment light changes. At the territorial scale, Mari et al. (2022) developed Sat-NeRF for the reconstruction of in-the-wild satellite images. Another NeRF variant adapted to multi-date collections of satellite images is found in reference . Taking into account photographs taken at different times, with different lighting and weather conditions, different temporal states and transient occluding objects such as pedestrians and cars, in-the-wild reconstruction allow the production of photorealistic and temporally consistent renderings from novel viewpoints. However, main limitations of in-the-wild reconstructions are still identified in: i) the need to reconstruct initial camera poses for each training image (Kuang et al., 2022); ii) sensitivity to camera calibration errors, which can lead to blurry effects; iii) degradation of rendering quality in parts of the scene that are rarely visible in the training images (Martin-Brualla et al., 2021).

Semantic NERFs
Recent developments in Cultural Heritage digitalization include the segmentation and classification of 3D models as point clouds  and meshes (Grilli and Remondino, 2019). Class labeling can be attached to a geometric model to encode semantics, i.e., human-defined concepts, such as information on architectural components, materials and degradation patterns within the digital representation (Croce et al., 2020). The enrichment of NeRFs with semantic information has not been developed yet for cultural heritage: Zhi et al. (2021) extended NeRFs scene-specific representation to include semantic representations that were efficiently learned from partial sparse or noisy annotations of indoor scenes. Similarly, Pavlakos et al. (2022) relied on NeRF models for an accurate estimation of human pose and location. The recovered, semantically enriched 3D scene context was used to render novel views of the human localization within certain environments.

NERF rendering for Virtual and Extended reality
Most recent developments of NeRF include the navigation of neural renderings in Virtual Reality (VR) or Extended Reality (XR) applications (Deng et al., 2022;Park et al., 2022). NeRFs created with Instant NGP can be run in VR or AR modes, through compatible headsets or glasses . However, limitations of these methods are the absence of direct interactions with individual objects of the 3D scene, the lack of real-time collision detection, the high latency and computational cost of renderings in medium-or large-sized scenes. As novel view synthesis is a prerequisite to many VR and XR applications, Chiang et al. (2022) propose a method to control the style of a rendered 3D scene by enabling seamless switching between real-world scenes and virtual artistic styes, prior to VR and augmented reality (AR) applications.

CASE STUDIES AND EARLY RESULTS
The comparison between NeRFs and more established photomodelling techniques is essential to understand the extent to which neural rendering and novel view synthesis can enhance, or complement, existing techniques for cultural heritage digitization. Ongoing work is aimed at testing the advantages and disadvantages of NeRFs for the digital documentation of cultural heritage objects and sites, in comparison and combination with other existing, consolidated techniques such as photogrammetry.
To this end, for the same set of images, taken with sufficient overlap, we evaluate photogrammetric reconstruction -with dense cloud and texture mesh extraction-on the one hand, and NeRF training on the other hand. The results of the two different methods are thereby aligned and compared ( Figure 6).
A key initial assumption is that the camera poses are known on the input image set. The reconstruction of the camera orientation parameters via COLMAP is in fact common to both workflows. NeRFs are then trained to produce direct models (volumetric rendering) and some derived data, i.e., a point cloud and a textured mesh. To export point clouds and meshes from NeRFs, we use the latest ns-export function of Nerfstudio: the marching cubes algorithm (Lorensen and Cline, 1987) and the Poisson surface reconstruction (Kazhdan et al., 2006) are leveraged for mesh generation. Texture coordinates are later derived for the triangle meshes via the xatlas library (github.com/mworchel/xatlaspython). By deriving these outputs for the same synthetic datasets of different target objects, characterized by non-Lambertian materials or very fine details, we intend to compare photogrammetry and NeRFs, e.g., in terms of shape description and representation type (point cloud to point cloud or volumetric rendering to mesh). For this reason, the multi-scale datasets considered are all chosen amongst cases of Cultural Heritage interest that are typically difficult to process with traditional photogrammetry: the Tersicore statue by the sculptor Antonio Canova, an eagle-shaped lectern of the 14 th century and the Caprona Tower, near Pisa. The Tersicore statue. The Tersicore statue has a homogeneous colour, and the texture of its translucent material is difficult to render. The NeRF model (Figure 7) is generated using Nerfstudio and exported as a point cloud and a mesh from an input set of 233 images. The alignment between the NeRF-generated point cloud and the photogrammetric point cloud, realized via CloudCompare, allows an analysis of the cloud-to-cloud deviation (Figure 8) between the two outputs, on a scale from minimum (blue) to maximum values (red).
The results show that NeRF succeeds in describing, in a more complete manner, certain portions of the statue -as the head, the upper part of the pedestal, and the base-which were framed in a reduced number of images over the input dataset. The point cloud from NeRF is characterized by higher noise, which, however, disappears when looking at the starting volumetric rendering. The bronze eagle-shaped lectern. By processing an input dataset of 254 images, NeRFs return color changes and reflections of the lectern's bronze material. The shiny, metallic effect, which is view-dependent (Figure 9), is much more consistent and realistic than the photogrammetric result. In contrast, the photogrammetric model is extremely flat and matte in terms of texture, making it difficult to tell whether the material is bronze or, e.g., wood ( Figure 10).
Caprona Tower and its natural landscape. The challenge of the Caprona Tower dataset (Billi et al., 2023) is to integrate the natural environment around the tower, consisting of scattered vegetation and low shrubs, into the resulting model. We compare the volumetric rendering with the photogrammetric mesh after processing a dataset of 124 drone images (Figures 11, 12). In this case, there is a clear difference between the two types of rendering when it comes to the vegetation surrounding the tower: note the low bushes here and there, the presence of a few holes or excessively sharp sections in the mesh model that are fully returned in the volumetric rendering. The difference can be seen in the tree to the left of the tower and in some of the low bushes.  This suggests the potential use of NeRFs in large-scale spatial mapping applications, even in emergency situations (Croce et al., 2021), to survey difficult to access landscapes or urban contexts characterized by dense vegetation. Figure 11. Neural rendering of the Caprona Tower, on Nerfstudio web viewer.

DISCUSSION
Both photogrammetry and NERF models allow the prediction of appearance and geometry from observed images. At present, both techniques work on image sets with known camera poses. The camera parameters reconstruction phase is common to both methods, so that a set of images that cannot be correctly aligned by photogrammetry cannot be processed by neural rendering, at least for the time being. However, there have been several attempts at NeRF reconstruction with unknown and even randomly initialised camera poses (Meng et al., 2021). Given the same resolution and size of the input images, NeRFs produce more complete models (with less loss of information) in the form of volumetric renderings. When compared to the result of point clouds or meshes obtained by photogrammetry, neural renderings could be meaningful for the survey of: i) objects with fine, volumetric details; ii) reflective or transparent surfaces; iii) occluding elements. In photogrammetry, the appearance of real materials and surfaces does not depend on the point of view, while in NeRF models, the viewing direction is combined with location features so to predict the color from specific point of views and handle transparency and reflectivity. As for photogrammetry, the 3D models produced by NeRFs are not to scale, unless they are appropriately combined with topographic surveys, e.g. from total stations or laser scanners. In terms of outputs, NeRFs were specifically designed for novel view synthesis and volumetric rendering, and the shift from volumetric renderings to more conventional forms of representation as point clouds and meshes has not yet been extensively developed and studied; this prevents interoperability between the different model types. Conversion from NeRF to mesh (and vice versa, from mesh to NeRF) is indeed an active area of research.

CONCLUSIONS
This paper presented an overview of NeRFs, with a focus on possible applications and extensions of neural rendering in the domain of Cultural Heritage. NeRF is a promising alternative to photogrammetry for imagebased modelling, but combining the two techniques, also in terms of output, could be a game changer in overcoming some of the common problems of photogrammetry, including the difficulty of rendering homogeneous textures, transparent or translucent materials, and objects with extremely fine details. Developments of this work could concern the conversion to other forms of digital representation, the comparison between NeRFs and photogrammetry in terms of processing time, resolution, interoperability. The model scalability in relation to information provided by metric surveys should even be assessed. Finally, another interesting question is how NeRFs respond to the loss of information due to the reduced number or to the lower resolution of input images, compared to photogrammetry. These aspects are currently under investigation.

AUTHOR CONTRIBUTIONS
The work is the result of an Italo-French collaboration between the home institutions of each author. V.C. is responsible for project conceptualization, execution of the experiments and drafting of the manuscript. G.C., L.D.L., A.P. and P.V. supervised all steps of the research, provided substantial insights and edits. The work has been funded, from January to February 2023, by the Joint LAB (CNR) project LIA Laboratoire International Associé (CNRS).
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-M-2-2023 29th CIPA Symposium "Documenting, Understanding, Preserving Cultural Heritage: Humanities and Digital Technologies for Shaping the Future", 25-30 June 2023, Florence, Italy