ACCURACY EVALUATION OF SMARTPHONE-BASED VIDEOGRAMMETRY FOR CULTURAL HERITAGE DOCUMENTATION PROCESS

: The last decade has seen the development of a growing tendency to use the most modern technologies, in the field of Cultural Heritage, with the aim of digitizing and facilitating protection and conservation activities. Much research has focused on the development of innovative methods such as photogrammetry or Terrestrial Laser Scanners, in terms of reliability, precision, time and costs. In this research, however, the use of the smartphone was investigated by comparing the point clouds obtained via videogrammetry from smartphones, with those generated by different digital survey techniques, such as Terrestrial Laser Scanners and photogrammetry via SLR camera. Specifically, a smartphone was used and the comparison between point clouds was conducted based on four criteria: point clouds fitting, density evaluation, profiling and texture quality with the aim, therefore, of verifying the geometric reliability of the data and the quality of polygonal lines and the mesh/texture derived. The case study selected was the sepulchral monument of the Pascopepe Lambertini family (14th century), located in the Crypt of Santa Maria della Scala , the Cathedral of Trani (South of Italy). Finally, the research demonstrates how this methodology, in the documentation of heritage, allows for even greater portability and accessibility compared to other methodologies, maintaining widely acceptable standards of accuracy and therefore going to constitute a valid alternative in the documentation of historical heritage.


INTRODUCTION
In Cultural Heritage field, the employment of advanced remote technologies is becoming an established practice for digitalizing and facilitating conservation and preservation activities. The main purpose consists in the construction of suitable and accurate As-Built 3D models able to allow indirect evaluations and analysis, in order to facilitate remote inspections through the construction of digital twin and/or BIM models (Building Information Modelling) (Moyano et al., 2021). In literature, many researchers have focused on the development of innovative technologies/methods, such as Close-Range Photogrammetry (CRP) and laser scanning, to minimize the disadvantages of traditional survey in terms of accuracy, time, and cost. In this regard, Terrestrial Laser Scanner (TLS) and aerial and terrestrial photogrammetry are the most common technologies exploited, since they are potentially safer, more accurate, reliable, and produce output, such as point clouds, less fragmented and accessible from remote devices and operators at any time for further data analysis, inspections, and measurements, significantly reducing on site activities (El-Din Fawzy, 2019). The choice of the acquisition tool is related to various factors and considerations, such as the purpose of the 3D models, the LOD (Level of Detail) required, the experience of operators, the accessibility of the site, the budget assigned. (Mohammadi et al., 2021). Considering the metric properties, digital cameras can be divided into two categories: non-professional compact cameras characterized by low cost and low sensor resolution, and expensive, professional high-resolution digital cameras. The difference between compact and professional cameras consists in the lower geometric stability of non-professional cameras. Smartphones are equipped of compact cameras and are the most interesting option because mobile phones are lightweight, portable, economical and fully equipped with high resolution digital cameras. In literature, some studies have demonstrated that the use of the modern smartphone sensors may facilitate digital surveying activities for Cultural Heritage as they permit greater portability, and greater ease of data acquisition. Considerable results have been obtained probing the modern smartphones as measuring instruments in archaeological field. (Shults, 2017), In others, to obtain the 3D reconstructed object, the 3D depth sensors of smartphones for 3D reconstruction provided morphometric data comparable with photogrammetry and laser scanner results in terms of accuracy (Boboc at al. 2019) (Mikita et al., 2020). The attainable accuracy level of the point cloud obtained from images generated by different smartphones have been investigated demonstrating that the final positional accuracy of the dense point cloud, and the resulting mesh model has an accuracy order of millimeter, comparing the 3D reconstruction with TLS point cloud used as ground truth. (Yilmazturk and Gurbak, 2019) (Costantino et al., 2020). In (Nocerino et al., 2017) an other advantage of smartphone use in Cultural Heritage contest has been implemented through the construction of a collaborative image-based 3D reconstruction pipeline to perform image acquisition with a smartphone and geometric 3D reconstruction on a server during concurrent or disjoint acquisition sessions. However, the geometric accuracy tests of high-resolution smartphone camera images, and 3D object reconstruction extracted from video have not been sufficiently researched in literature, especially considering their use for the documentation of historic building, and specifically, in 3D reconstruction. One limitation for the application of this technique is that the result may not be accurate since the operation of recording a video can produce some noises caused by the weather or movement during recording and also due to the shadow of the operator (Ahmad et al., 2019). In (Murtiyoso and Grussenmeyer, 2021) smartphone videogrammetry has been applied to some decorative bas-reliefs and results has been compared with results of traditional DSLR close range photogrammetry, obtaining values of average deviation, and the standard deviation in order of few millimeter.
Through a comparison among three point clouds acquired from TLS, image-based and video-based photogrammetry, the aim of the paper is not to determine the most precise method as this is easily demonstrated in previous research work, but it consists in evaluating the geometric reliability of 3D reconstruction results for the construction of As-Built models, with the purpose of documentation process. In this study, the sepulchral monument of the Passasepe-Lambertini family (14th century), located in the Crypt of Santa Maria della Scala of the Cathedral of Trani (South of Italy), has been selected as case study.

METHODS AND DATA
In the present work, the experimentation of the use of smartphone videogrammetry will be tested, for the documentation process in the field of Cultural Heritage The video-based reconstruction has been compared to imagebased and TLS point clouds to determine its quality. In following sub-sections, the method adopted will be explained. The proposed research involves the relative comparison of the acquired data points, for different regions of the manufact in the three point clouds with the as-is measurements.

Case of study: historical notes
Inside the Cathedral of Trani, southeast of the wall that separates the crypt of Santa Maria of the Scala and the crypt of San Nicola Pellegrino, insists for centuries the only funerary testimony present in all the sacred place (Francesco Calò, 2016). The monument of the tomb Lambertini or Passasepe-Lambertini, consists of a rectangular stone case, laid on a high step, surmounted by an ogival ciborium covered by a doublegable pitched; the structure is supported by four marble columns, of which the front two are twisted and the rear two are simple, all four embellished with crochets capitals. The case is decorated on the only visible short side and on the front. The main façade of the monument bears the heraldic coats of arms of the family. Above the chest and framed by the monumental canopy there is a very ruined fresco, due to obvious colour falls, which suffered the loss of part of the lower section.
The presence of friezes, four turned columns that support a canopy with ogival arches, and a decorative fresco, make the case study representative of numerous compositional characteristics present in the historical heritage.

Instrumentation and survey phase
The case of study has been scanned and virtually reconstructed through three digital survey techniques: TLS, photogrammetry via camera reflex and videogrammery via smartphone. In following sections, the acquisition phase and 3D reconstruction will be descripted.

TLS based data acquisition and 3D reconstruction:
In the acquisition process the scans have been taken with CAM2 FARO Focus 3D 120 which has a range of up to 120 m, and defines a systematic measurement error (ranging error) between 10-25 m of ±2 mm, 1σ and a resolution camera of 70megapixel. A total number of four scans has been taken placing the device along the outer perimeter and on the stone case from an average distance of 0,80m for including different side of the manufact, having regard to an acquisition plan previously drawn up. Using the default normal resolution (6 mm at 10 m) and normal quality. The total acquisition time was of 22 min. The scans have been merged in Autodesk® Recap Pro. The total number of points of the resulted point cloud was 4.645.420.

Data acquisition and 3D reconstruction:
The sets of photographs/frames have been used for the 3D rendering in the form of a point cloud and photorealistic reconstruction (textured mesh) via image matching and the Structure from Motion (SfM) algorithms, with the software Agisoft Metashape, which allows creating 3D models with point cloud in automatic mode. The pictures have been captured using the Nikon D3300 reflex camera equipped with an APS-C sensor of 23.5 x 15.7 mm and of 24.2 megapixel resolution, at a fixed focal length of 18 mm. The video, instead, has been recorded using a Huawei P30 smartphone, equipped with a 1/4", 40 megapixel sensor, producing a 3840 x 2160 pixels video output. The photographic set and video have been carried out on the basis of an acquisition plan previously drawn up in order to optimize the survey time and ensure the effective overlap of 70% between the photographic shots. The video has been taken rotating around the object in a convergent way, especially near columns. An additional factor considered was the shooting speed during data acquisition. The slower video may allow a greater overlap between any sequenced frames, but will increase the amount of data to be processed. However, the sequencing speed is editable and therefore a slow video can eventually be split into fewer frames when data size is a problem. The photographic set consists of 126 shots executed in 12 minutes and 32 seconds. Moreover, the frames extracted from video of 2 minutes and 32 seconds are 307. To obtain a three-dimensional scaled model, near the tomb have been placed 4 numbered targets, that contained a 0.15 m metric bar, useful to evaluate the accuracy of the final scaling of the model. The arrangement of the objectives took into account the irregular shape of the tomb so that they have been placed not only on the floor, but also on the irregular surfaces that characterize the ogival vault. The photographic and video acquisitions have been made in parallel axes on the surfaces following a smooth path. The data processing of these images produced two dense point clouds of 11,937,957 points for that obtained from photographic set, and 13,122,643 points for that obtained from video, with a ground resolution of 0.321 mm/pixel for the first, and 0,316 for the second.

Point cloud processing:
Generally, the geometric precision of raw data points may be affected due to the presence of noise. The Statistical Outlier Removal (SOR) method was used for noise removal from point cloud data.

Evaluation accuracy methods:
The simplest method is the visual inspection which consists of a visual comparison by overlapping point clouds. It does not require much computation, but it is subjective and cannot provide a quantitative evaluation. The physical measurement method compares a set of measurements taken on the real building and their virtual correspondence. (Anil et al., 2011). The values compared are statistically analysed to obtain a confidence value. The advantage consists in avoiding the errors caused by scaling the Structure for Motion (SfM) point clouds. However, it also has some limitations since it is not possible to make an overall coverage of all the possible measuring (such for ceiling heights, complex morphologies), and directly identify the sources of error. Furthermore, it is a time-consuming process that requires the collection of a large number of measurements. Another method is the Surface Deviation Analysis (SDA) (Anil et al., 2013). A fundamental assumption is that the model and the reference point cloud should geometrically match in terms of tolerance. In fact, the SDA dependents on the used mathematical model, the density of the point clouds and the order of comparison. Results of the analysis can be graphical and numerical (Bonduel et al., 2017). To estimate the correspondences between the 3D reconstructions (point cloud, mesh, 3D mode, etc.), it is possible to use direct or indirect methods based on mathematical models.
In this work, the minimum Euclidean distance has been considered for the evaluation of geometric precision among point cloud of TLS, assumed as ground truth, and SfM point cloud and polygonal mesh, elaborated in Sec.2.2 (Dai et al., 2013). This distance has been considered as the metric to measure the accuracy because it reflects the positional relationship of two objects in 3D space. Actually, from Euclidean distance the maximum error, the average, the mean distance, the standard deviation (σ), and the Root Mean Square Error (RMSE) have been calculated, and are listed in Table 1.
Firstly, carrying out a SDA requires the choice of a threshold to limit distraction caused by irrelevant points. The selection of the threshold depends on several factors, including scene complexity, data noise level, types of error being analysed, and accuracy requirements of the as-is model. The second step is visualising the correspondences through a colour-coded deviation map which is made by colouring each surface according to the different distances. The different errors result in deviation patterns, which can be analysed, so as to identify their sources, their type, and their relevance within the point-cloud data or those derived from the 3D reconstruction. In Anil et al. (2011) are explained several colouring methods that can be used such as continuous vs. binary colouring, colouring points vs. colouring surfaces, signed vs. unsigned deviation maps, to support the maps' understanding. For lower threshold values, the exact size and position of the larger error cannot be identified because all regions with deviations above the threshold value are coloured with the same colour. Smaller thresholds are more effective for visualizing detailed deviations, such as local geometrical errors. Larger thresholds are more effective for visualizing modelling errors influencing the global geometries of larger components in the facilities. A data set can be analysed using a series of thresholds to identify different types of errors.
In this study, the three point clouds were previously cleaned by the elements of disturbance (furniture, points not consistent with the artefact, etc.), using manual segmentation. To compare SfM point clouds and TLS the manufact has been divided into five regions that identify individual architectural components: the ogival vault, the columns, the fresco, and the tomb. Considering The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-M-2-2023 29th CIPA Symposium "Documenting, Understanding, Preserving Cultural Heritage: Humanities and Digital Technologies for Shaping the Future", 25-30 June 2023, Florence, Italy the goal of analysing the accuracy of the video-metric reconstruction, the considered threshold value is corresponded to 10 cm between TLS point cloud, and SfM point clouds and polygonal meshes. The deviations maps have been visualized by either colouring the points composing the surfaces in the model in RGB colour space.

Profiling comparison:
The evaluation of extracting cross-sectional profiles can be a meaningful comparison between the products derived from different devices and method because it can easily show path of the acquired points in a perpendicular plane, which allows to obtain reliable representation of the section, and, a really detailed geometric features. In this study, regions of interest have been selected and the cross sectional profiles have been extracted in form of point clouds and meshes to calculate the Euclidean distance through SDA, and evaluate the minimum, the maximum, the average distance, the standard deviation (σ), and the Root Mean Square Error (RMSE) as described in Sec. 2.3.1. The analysis has been conducted on point clouds and meshes due to the objective of understanding the rapport through the instrument of survey and accuracy of derived representation in 3D environment In fact, in literature, to develop a reliable 3D model (BIM, solid models, etc.), one of the challenges is to convert the point cloud data obtained from the digital survey into a three-dimensional model as accurately as possible.  (Wang et al., 2015) Each point cloud of the sepulchral monument has been analysed through three sections, one horizontal section (HS1) and two vertical section (VS1 and VS2) and then compared both considering the point cloud itself and the polygonal section profile obtained through Cloud Compare.

Criteria for evaluation:
The methodology used for the comparison considers four criteria for evaluating the reliability of the acquired datasets and instruments: • Point clouds fitting: Since a reference surface has not been used, the SDA has been performed using cloud-to-cloud and cloud-to-mesh in absolute values distance assuming a threshold of 10 cm calculating the maximum, the minimum, the average distances, standard deviations (σ), and the Root Mean Square Error (RMSE) have been calculated.
• Density evaluation: Point cloud density is an indicator of the resolution of the data: higher density means more information (high resolution) while lower density means less information (low resolution). It is important to have an understanding of point cloud density because it may impact the quality or accuracy of further projects that will be based on point clouds. The average density of the surface area of the artefact per square centimetre has been calculated for each point cloud, and compared through Cloud Compare.
• Profiling: Through cloud compare, three sections (HS1, VS1 and VS2) have been carried out for each point cloud (TLSbased, image-based and video-based). These have been extracted in two ways: in the form of point cloud slices and in the form of a polyline created directly by the software based on point cloud data and then accurately overlapped to compare the results. Point cloud slices have then been compared through SDA, and visual inspection, indeed polylines have been evaluated manually by measuring distances.
• Texture quality: qualitatively visual assessment of texture s has been evaluated by comparing orthophotos generated by textured polygonal meshes.

RESULTS AND DISCUSSION
In the field of Cultural Heritage (CH) a detailed survey obtained by 3D recording tools is essential which allows to obtain a digital copy of the manufact with high metric precision. However, for some contexts the cost and reduced portability of LiDAR technology have favoured the use of the photogrammetric technique. For each method has advantages and disadvantages, the most appropriate techniques need to be chosen according to documentation requirements and the unique specifications of the study site. This paper proposed an experimental investigation into the usability of smartphone for CH documentation. The use of smartphone is an interesting option due to the portability and the low cost, which may be a useful instrument for helping survey tasks for 3D reconstructions in documentation context. The first consideration in the application of videogrammetry is the setting of frame extraction. Sequences that are too far apart risk to reduce the overlap, nevertheless sequences which are too close together would increase processing time.
Moreover, the main obstacle to high-precision 3D reconstruction using videos is the quality of the frames that can be blurred. Indeed, during the reconstruction of Structure for Motion (SfM) point clouds (PH and VD) in Agisoft Metashape, after having performed a first alignment of the frames extracted from the video, having noticed some holes caused by the homogeneity of the texture in the area of the upper surface of the tomb (constituted by a single smooth stone block), it was chosen to extract a greater number of frames to ensure a greater overlap. This is not possible for the photogrammetric point cloud so in case of errors in the photographic acquisition phase it is necessary to fill the survey campaign. As a result of SDA in Cloud-to-Cloud (Table 1), the maximum standard deviation (σ) among all selected regions has been calculated as 0.0131 m and 0,0334 m for VD in Region 4 and PH in Region 1, respectively. The mean distances for VD is similar for all five regions and it is less than one centimetre. The maximum mean value has been calculated in Region 1 for VD and PH and is 0,007 m, and 0,0168 m respectively. The maximum error corresponds to Region 2 for both point cloud exanimated. The Root Mean Square Errors (RMSEs) are considerate acceptable for both systems. Region 2 and Region 1 are the point cloud portion with a more irregular geometric development.
For video-based point cloud (VD) , the Cloud-to-Mesh operation has been produced values an error in the centimetre order of maximum error, average, mean distance, the standard deviation (σ), and the Root Mean Square Error (RMSE). In Figure.1, the SDA result between TLS and VD point cloud visualized in RGB space is shown, the blue represents minimum distances, and the red indicates maximum distances. The highest error values were found near areas with lower point densities or geometrical discontinuity.
The calculation of the average surface density for square centimetre indicate a difference of about the 40% among point clouds, indeed, it has been calculated for TLS 12, 5 points, and for VD 33, 5 points, and for PH 32,05 which may indicate more reliable results using SfM reconstructions for level of detail useful for as-build 3D models for monitoring or inspections. However, analyzing SfM point clouds through the profiling criteria has emerged a different distribution of the points of PH and VD points than the TLS. As anticipate in Sec 2.3.2, in statistical terms the spatial distribution of both the image-based point cloud and the video-based point cloud have been measured through the Root Mean Square Error formula, using the TLS-based point cloud as reference.  Figure 3 is summarized the visual comparison between cloud slices and section profiles; these comparisons are clearly visible in the scaled detail which highlights the difference between the three point clouds, especially the discrepancy between the TLS and image-based sections (almost coincident) and the video-based one which results slightly offset. The results of average distance and RMSE of each section slice pair have also been reported in Table 2. The results clearly show that sections HS1 and VS1 present a higher error in both average distance and RMSE. This is due to the characteristics of the sectioned surfaces, that present, in the lower part, a highly irregular area corresponding to the base of the sepulchral monument which is richly decorated.

Cross Section
Average (  In addition to investigating the spatial reliability, the textural quality has also considered. For this purpose, digital orthophotos of the facade has been created. The results are presented in Fig. 4, which shows that the orthophoto generated from the photogrammetry (Fig. 4b) has richer textural quality rather than that via TLS and VD (Fig. 4a-c). However, the textural quality of the orthophoto generated from the videogrammetry system cannot be ignored: sharp contours, main characteristic structure details, and transition between stones are distinguishable more than orthophoto generated from TLS.

CONCLUSIONS
In this research, the accuracy level of point cloud acquired based on smartphone videogrammetry has been evaluated and compared with reflex-based photogrammetry and LiDAR point cloud, using a series of criteria and methods applied to the sepulchral monument of the Passasepe-Lambertini family, located in the Crypt of Santa Maria della Scala of the Cathedral of Trani (South of Italy). The aim of the paper is to determine if such level of detail is enough for specific heritage documentation purposes. Quantitative experiments have shown that the results can achieve high accuracy when compared to a reference ground truth, in this case TLS-derived point cloud. The potentialities of the proposed method have been evaluated by analysing the statistical properties as maximum error, average distance, standard deviation and RMSEs by comparing the dense point clouds with the reference model obtained from a professional camera and TLS. The obtained results demonstrated that the behaviour of the different software tools in terms of performance is similar. In the application of photogrammetry and videogrammetry reliability of the survey does not reflect only the quality of the sensor, but also the registration process. Another issue encountered in this case is the quality of the sequenced frames. A balance between overlap and frame quality is an important aspect in this regard. Based on the results obtained from this study, in summary it is possible to consider the method of videogrammetry based on smartphones useful in support of the documentation process of the Cultural Heritage. The geometrical accuracy obtained compared to the most frequent digital methods such as laser scanner and photogrammetry is sufficient. In fact, evaluating the results, the main problem is the level of detail. Furthermore, in this case, the obtained models showed that, by following a rapid acquisition process and using a limited number of inputs obtained from a mass-market sensor, it is possible to generate representations at greater scale of representation enough for the main general analyses. For some visual document applications, this technique is a viable alternative to other more expensive and less portable methods. However, when project requests require high precision such as degradation mapping, valuable elements, orthophotos, CAD, or BIM -based applications, etc., image-based photogrammetry and laser scanning produce undeniably better results. Finding a balance between quality and cost in a range of low cost prices can be a much easier consideration in all projects on a budget, as often happens in the documentation of the patrimony.