COMBINED 360° VIDEO AND UAV RECONSTRUCTION OF SMALL-TOWN HISTORIC CITY CENTERS

: Small-town historic urban environments increasingly face different pressures that may prevent their future sustainability. To promote a recovery plan for those centers, an interdisciplinary methodology is needed to orient future actions toward innovative solutions. However, the development of such a plan needs the definition of a dedicated methodology to drive the definition of recovery actions. A key role is the possibility of low-cost methods for mapping and classifying the built heritage. Both ground and aerial (UAV) solutions can be used to create 3D city models for further classification and analysis. However, integrating those two data in small city centers poses specific challenges. This paper presents a combined approach using ground and aerial (UAV) data to comprehensively reconstruct a part of a small-town city center. Ground data are video acquired with 360° (spherical) cameras. Previous works proved 360° videos can reconstruct with good metric accuracy and complete building façades of historic city centers. The combined reconstruction is tested in the so-called “Casa Teatro”, an abandoned building in the city center of Vogogna, a small town in the Ossola Valley (Northern Italy). The results of the combined image orientation are validated by using ground truth data (Check Points) acquired with a total station. In contrast, the final 3D model reconstruction results are compared with a laser scanning acquisition of the building to evaluate the accuracy and completeness of the photogrammetric point cloud. In addition, matching among 360° frames and UAV frames is studied by analysing connectivity graphs.


INTRODUCTION
Since the introduction of the National Strategy for the internal Areas (SNAI -Strategia Nazionale per le Aree Interne) in 2013, at the central government level and other legal acts issued at regional levels in Italy, the problem of the depopulation of small centers was countered by different strategies. Both national and regional policies emphasize the role of certain pre-conditions for the activation of specific territorial strategies, like the base services for the community and some development addresses concerning: a) active protection of the heritage and the environmental sustainability; b) enhancement of the natural, cultural capital and tourism; c) enhancement of the countryside and food industries; d) activation of clean energy chain; e) fablab and craft. Among the various regional laws facing the problem of the depopulation of small towns, Piedmont Region introduced in 2019 the legal act no. 14 for preserving, enhancing and developing the mountain territory. The approval in 2021 of the National Plan of Recovery and Resilience (PNRR) was addressed, among the various targets, to support territorial cohesion policies. Mission 1 of the recovery plan addresses digitization, innovation, competitiveness, culture, and tourism. This work analyses the application of an innovative methodology of digitization applied to a cultural heritage site in a small town characterized by a depressed economy, belonging to a natural park. The site, Vogogna, in Ossola Valley, also hosts the archaeological ruins of an ancient building. It constitutes an example of an economically depressed area with an increasing trend of depopulation, where local measures are trying to face the problem by improving some services connected to cultural and environmental sectors. The research developed by the authors is a tentative to apply the strategies indicated by the mentioned * Corresponding author strategic addresses for contrasting the depopulation of mountain areas with the enhancement of local cultural resources. An agreement between the local authorities and Politecnico di Milano organized the digitization of the chosen architectural heritage. In this framework, a key role is the possibility of low-cost methods for mapping and classifying the built heritage. To support this digitization activity, ground and aerial (UAV) solutions can be used to create 3D city models for further classification and analysis. Their application was successful in different contexts and has been verified in different projects. However, surveying small city centers poses specific challenges both from a technical and non-technical point of view. From a non-technical perspective, even if several methods exist for the rapid and efficient city and urban mapping, their cost can be out of budget for small communities and municipalities. From a technical point of view, narrow streets and tall buildings can make it difficult to record building elevations and facades, even with oblique UAV images. On the other hand, the ground-level acquisition may fail in getting information on the highest building floors and only accidentally reconstructing roof shapes. To cope with those limitations in this work, we present a combined reconstruction integrating a unique framework: ground acquisition of 360° videos and frame images acquired with a small-payload UAV. The presented method is tested in the socalled "Casa Teatro", an abandoned building in the city centre of Vogogna, a small town in the Ossola Valley (Northern Italy).

RELATED WORKS
In the past decade, there has been an exponential growth in the market for low-cost and consumer-grade cameras, resulting in a broad range of new sensors being available today. One of the latest and most prolific research topics among photogrammetric researchers has been the potential to utilize spherical and cylindrical images captured by non-metric cameras for photogrammetric applications. Over the past few years, the market sector dealing with spherical images and videos, also known as panoramas, has experienced rapid growth. These images are not only used for documentation, visualization, and sharing purposes but also for 360° videos, AR/VR applications, and more. They offer some interesting advantages from a photogrammetric perspective that have not been fully explored yet. Fangi, 2007 introduced the concept of spherical cameras for photogrammetric reconstruction and defined the mathematical model. From a practical point of view, the generation of spherical images was carried out through the stitching of different images acquired around a nodal point from the same camera, following the Computer Vision approach developed by Szeliski and Shum in the late 1990s (Szeliski and Shum, 1997). Starting from those first attempts the topic of spherical cameras gained a lot of attention in the last few years thank to the development of off-the-shelf consumer-grade 360° cameras. Those systems are constituted of two or more synchronized cameras shooting at different directions all around the device. The single shots are then combined into a unique 360° image using an equirectangular projection. The final projection is considered distortion free, up to a scale factor, and the relative orientation of the different cameras is estimate during the stitching phase (Fangi and Nardinocchi, 2013). Since consumergrade spherical cameras are not designed for metric purposes, several works investigate the possibility of using such instruments for photogrammetric applications, mainly working in two directions: (i) the implementation of more sophisticated algorithms and solutions for image stitching (Lee et al., 2020), and (ii) the improvement of photogrammetric techniques for processing these types of images (Fangi et al., 2018;Janiszewski et al. 2022). Indeed, even if the mathematical framework for spherical images is well-defined, some practical aspects may pose significant issues. In the last few years, spherical cameras were used for the survey of narrow spaces such as tunnels and caves and for the documentation of some peculiar architectonical structures like the indoor of belltowers (Teppati Losè et al., 2021) of for the documentation of historical city centres (Barazzetti et al., 2022). Indeed, even if several methods exist for rapid and efficient city and urban mapping based on car or back-packed laser-based solutions, their cost may not be sustainable for small communities and municipalities. For this reason, low-cost solution can represent an interesting alternative. Previous works (Barazzetti et al., 2022) proved 360° videos being able to reconstruct with a good level of metric accuracy and completeness building façades of historic city centers. However, a ground-level acquisition may fail to get information on the highest building floors and only accidentally reconstruct roof shapes. At the same time, Unmanned Aerial Vehicles (UAVs) had a significant market success, allowing low-cost solutions available as entry-level ones and, on the other hand, the possibility of having good resolution camera systems as the payload of lowweight UAVs. Several projects (Hill, 2019;Kerle et al., 2019;Ren et al., 2019) proved the feasibility of consumer-grade UAVs for documentation and 3D reconstruction of large sites. However, surveying small city centers with narrow streets can make it difficult to record building elevations and facades, even with oblique UAV images. In those scenarios, sensor integration and a combination of data from two different platforms (ground and UAV) can be an interesting solution to reduce the uncertainty of the observations gathered separately from each sensor and to provide a more complete reconstruction. This paper presents a combined approach using ground and aerial (UAV) data to comprehensively reconstruct a part of a small-town city center. Ground data are video acquired with 360° (spherical) cameras, while UAV data are recorded using a lightweight consumergrade UAV solution. Combining both data allows the complete reconstruction of a building in the city center of Vogogna. Even if some works (Calantropio et al., 2019) are presenting the topic of 360° camera and UAV they are mainly addressing the topic of integrating 360° cameras on board of a UAV platform. Instead, this paper is focusing on combining UAV and ground data acquired with a 360° camera.

CASE STUDY
The case study is a historical building in Vogogna, a small center in Ossola Valley. Vogogna became an important administrative center during the 14 th century when the Visconti from Milan transformed the town into the capital of the lower Ossola Valley. Its center conserves an interesting medieval settlement, with the medieval municipal palace, the rest of the defensive walls, a castle, and the archaeological remains of a fortress. Among the masonry buildings of the ancient centre, the so-called Theatre House ("Casa Teatro") constitutes the object of this study and a resource for the development of local recovery policies promoted by the local government. The building is dated back to 18 th century when a public archive was realized in the center of Vogogna. According to the historical analysis (Lossetti Mandelli d'Inveruno, 1926), the Theatre house was realized to host the legal acts of the capital of the lower Ossola Valley. The contract for its creation indicates that the building was built on a preexisting house. The document contains the main recommendations for its typology: four rooms divided into two levels, with an atrium on the ground floor. When Vogogna lost its administrative importance, after the Napoleonic period, the archive was moved away and the building changed its function: in 1837 it became a theatre. Used as the house of the associations in the first half of the 20 th century, it was abandoned in recent years. Today, it is considered a strategic building for the recovery plan of the municipality concerning the enhancement of the historical centre.

SURVEY STRATEGY
The survey of "Casa Teatro" was conducted in two separate steps. First, the ground acquisition was carried out by acquiring a 360° video at 5k resolution (5760 × 2880). The major benefit of video capturing 360° images is its speedy acquisition process. The user can walk around with the camera mounted on a pole like a selfie stick, and the overlap between consecutive frames is automatically ensured. Starting from the recorded 5.7k video, frames can be extracted using a defined framerate (e.g., 1 frame every 1 second) or using different criteria. However, it is important to notice that even if the overlap between different frames extracted from a video is generally guaranteed, the ground sampling distance (GSD) may vary significantly depending on the object geometry, the waking speed, and the time interval among successive extracted frames. For this reason, it is recommended to use a sampling rate among frames of a maximum of a few seconds to enhance image orientation and dense reconstruction. With a 360° field of view, the camera captures every aspect of the scene, enabling complete reconstruction. Special attention needs to be given to object geometries aligned with the camera locations: the intersection of rays in 3D space may not be reliable. Similarly, the speed must be reduced in correspondence of doors and any connection between two different rooms. Using such acquisition criteria, the interior and exterior of the "Casa Teatro" was surveyed with the camera Insta One X2. The duration of the video is approximately 4:30 minutes for a video capturing the outer part of the building and approximately 5:10 minutes for the indoor acquisition. Then, a UAV survey of the Casa Teatro was performed using a lightweight drome DJI Mini 2. The drone is equipped with a 1/2,3'' CMOS sensor with a resolution 12 MP and a field of view of 83°. Two different acquisition configurations were adopted. In the first acquisition, images are acquired with a normal attitude (i.e., camera rotation of 90° pointing toward the ground) along two strips. In the second configuration, images are acquired with an oblique geometry (35° or less camera tilting).

DATA PROCESSING AND RESULTS
Firstly, the raw files acquired with the 360° camera are downloaded and processed with the camera's proprietary software (Insta Studio) and exported in .MP4 format using the equirectangular (also known as latitude-longitude) projection. Compared to a traditional camera the 360° solution captures all the surfaces around the camera without privileging just a single direction. Metric reconstruction is done with a photogrammetry project in the Agisoft Metashape software. The 360° video are imported in the software, and a set of keyframes are extracted at a specific frequency. For the presented case study 1 frame per second was evaluated as a suitable sampling frequency and a total of 586 equirectangular images were extracted from the videos. The images acquired from the drone both nadiral and oblique (267 in total) are imported in the software too. The equirectangular frames and the drone images are then processed with a traditional photogrammetric workflow (image orientation, ground control point definition, dense point cloud generation) to create a 3D point cloud model of the Casa Teatro. The spherical camera model must be set at the beginning of processing for 360° images while a frame camera model must be set for drone images. The results of the combined image orientation are validated by using ground truth data (Check Points) acquired with a total station and the final 3D model reconstruction results are compared with a laser scanning acquisition of the building to evaluate the accuracy and completeness of the photogrammetric point cloud (section 5.1). In addition, matching among 360° frames and UAV frames is studied by analysing the connectivity graph for both nadiral direction images and oblique acquisitions (section 5.2).

Comparison with TS and laser scanning data
A set of GCPs and CPs was measured with a Total Station to define the metric accuracy of the reconstruction. In total 11 points were measured: 4 points were used as GCP (red points in Figure  1) and 6 were considered as CPs (yellow points in Figure 1). Residuals on GCPs and CPs are reported in Table 1 both for GCPs and CPs. The optimization of the project with such additional constraints allowed to reach a metric discrepancy of 1.0 centimetre on GCPs. GCPs and CPs were identified both on 360° images and UAV ones (e.g., points A2, A4, A5, A6, A8, A9, A10) and on 360° image only (e.g., points 12, A1, A3, A7). Points identified on 360° camera only were located either in the interior of the building or on vertical surfaces not detectable by the UAV. A significant difference in terms of accuracy is not observable between point identified both on 360° and UAV images and point identified on 360° images only.  Considering image residuals (Table 2), it is possible to observe that residuals on 360° images are generally larger than the ones on UAVs ones. In addition, for GCPs and CPs identified both on UAVs and 360° images residuals on 360° images are significantly larger than the one observed on points measured in 360°images only. This is probably because of points observed both on UAVs and 360° images belong to ground elements (e.g., manhole covers) positioned at the boundary of the equirectangular projection ( Figure 2) that is an area with significant distortions and higher residuals on camera calibration.  A further metric accuracy of the reconstructed dense point cloud model was tested comparing the obtained point cloud with a reference one acquired with a Faro Focus X130. 13 scans were performed for the outdoor area and indoor of the "Casa Teatro". The acquired scans were registered by using a set of GCP measured with TS allowing this way registration of photogrammetric and laser scanning survey in the same local reference system. Comparison between the two point cloud is performed on common areas (i.e., Casa Teatro facades, nearby outdoor areas and indoor rooms). Obviously, the roof was excluded form the comparison since it was not detected with ground laser scanning measurements. Comparison between the two point clouds was carried out in CloudCompare (https://www.danielgm.net/cc/) by computing cloud to cloud unsigned distance. The larger discrepancies in the building indoor area is probably due to the poor illumination of the building indoor (no artificial light is available there). Indeed, since the survey was conducted both for indoor and outdoor areas, to take into consideration different illumination conditions the ISO and shutter speed were set to automated mode (it is not possible to modify in manual mode those parameters while recording the video). In indoor areas, ISO increased due to low illumination condition resulting into a nosier image and consequently to a low quality of the reconstructed point cloud.

Matching between UAV and 360° images
A second study to assess quality and effectiveness of matching between UAVs and 360° images was carried out considering image connectivity. Here image connectivity is defined as the number of common validated matches (tie point) between a couple of images. Connectivity gives a measure of local redundancy, since the higher the number of common points among a couple of images the higher is the link existing between them. Connectivity may also help in identifying bottlenecks and weak points in the acquisition geometry. This means identifying areas not well connected to the remaining portion of the set. For the presented case study after image alignment tie points observed in two images only and tie point with hight reprojection The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-M-2-2023 29th CIPA Symposium "Documenting, Understanding, Preserving Cultural Heritage: Humanities and Digital Technologies for Shaping the Future", 25-30 June 2023, Florence, Italy error were deleted and the image block was reoptimized. This choice was motivated by the fact that "strong" tie points only were used as a measure of connectivity in the image block. Figure  4 reports the connectivity graph for the presented dataset: the green box represents connections between UAV images, the blue box connection among 360° images and the red boxes represents connection between UAVs and 360° images. a.
b. Figure 4. Connectivity matrix: (a) overall matrix same weight is given to each link (left), links are weighted according to the number of matches, the brighter the cell the higher the number of links (right), green identifies links among UAV, blue links among 360° images and red links among UAV and 360° images; (b) zoom of the red area (links among UAV and 360° images) As it can be observed in Figure 4a the set of UAV images represent a subset with strong connections (high number of common tie points). Connections are not concentrated only on the main diagonal meaning a high interlinked subset of UAV images. The set of 360° images shows a lower number of connections between images (Figure 4a) with respect to the one observed for the UAV image subset. The structure of the connectivity matrix is non diagonal showing also in this case that 360°images dataset is well connected. This is achieved by structuring the acquisition creating a path constituted by a set of interconnected loops. In addition, openings at different levels are used to guarantee further interconnection by extending the camera out of windows allowing a further connection between indoor and outdoor images. Figure 4a (red areas) and Figure 4b show the presence of valid matches between UAVs and 360° images. To study the connectivity between UAVs and 360° images a set of connectivity graphs were created for this portion of the connectivity matrix. A connectivity graph is a graph showing the links among different entities. Each entity is represented as a node of the graph and a connection is represented as a line linking two nodes. For the presented case study images (UAV and 360°) represent the nodes of the graph while the existence of valid matches among images are reported as links among the nodes. To have a clearer view of the matches UAV images were divided into nadiral and oblique and their connectivity with 360° images was studied separately. Figure 5 shows the connectivity graph between 360 images and UAV nadiral direction images. For this subset 110 images at 360° have valid links with 116 UAV images. For a simpler visualization in Figure 5 nodes representing images at 360° are represented in the central raw of the graph while nodes representing UAV images are displayed in the top and bottom raw. Link colour highlight the different number of valid matches between a couple of images (i.e., less than 5 matchesred, in between 5 and 15 matchesyellow, more than 15green).
Giving a look at the connectivity graph it is possible to observe two different group of marches (identified in Figure 5 with letters A and B). Elements inside each group are well connected among them (especially group B) while they show weaker connections between them, 5 UAV images are connecting 360° images of the two groups. The top part of Figure 6 highlights the spatial position of the two groups. A closer look at the organization of the graph shows that the two matching groups (A and B) are involving images covering the areas with larger streets at the two sides of the "Casa Teatro". For those two groups the majority of the links between nadiral direction UAV images and 360° images involves points of the paving of the streets; the bottom part of Figure 6 presents two typical examples of matches between UAV and 360°images. As previously mentioned, points belonging to the paving are positioned at the boundary of the equirectangular projection. The significant cartographic distortion observable in correspondence of those areas is probably influencing (i.e., reducing) the number of valid matches identified in an automated way during the image alignment phase. The second group of images analysed is composed by oblique UAV and 360° images. For this subset 112 images at 360° have valid links with 86 UAV images. Figure 7 presents the connectivity graph for this second subset. In can be noticed the high connectivity of this subset both in the high number of links existing between the different images and in the number of valid matches observed in each link. For graph presented in Figure 7 link colour highlight the different number of valid matches between a couple of images using the same criteria previously used for nadir direction images. Giving a look at the spatial distribution of this subset (Figure 8 -left) it is constituted by images facing the main façade of the "Casa Teatro" that was the only façade for which it was possible to acquire oblique UAV images tanks to the availability of a larger street. For this subset the typical distribution of matches is presented in Figure 8 right. Generally valid matches are represented by points belonging to the building façade.
The study of the connectivity among UAV and 360°images showed the feasibility of establishing valid matches between them. In particular, the connectivity matrix showed the high local redundancy that can be achieved by combining UAVs and 360° images. By considering the different typology of UAV images, the number of matches it is influenced by the direction of the UAV images. Indeed, even if both nadiral direction and oblique images presents matches with 360°images acquired on ground, nadiral projection images shows a lower number of valid matches and the position of those matches in the image is located in correspondence of the lower boundary of the equirectangular projection that is an area characterized by larger residuals in camera calibration. On the other hand, the number of matches with oblique images is generally higher even if also in this case the majority of matches involves upper floors of building that are in correspondence of the top boundary of the equirectangular projection.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-M-2-2023 29th CIPA Symposium "Documenting, Understanding, Preserving Cultural Heritage: Humanities and Digital Technologies for Shaping the Future", 25-30 June 2023, Florence, Italy Figure 7. Connectivity graph between 360 images (centre) and UAV nadiral oblique images (top and bottom). Link colour highlight number of valid matches between images. Higher connectivity with respect to UAV nadiral direction images is identified. Figure 8. Images matching identified between 360 images and UAV oblique images: images showing matches (left) and example of matches between an image pair (right).

CONCLUSIONS
Digitization can be an important opportunity for small historic urban areas and small towns. Indeed, availability of digital tools may help such small towns to face challenges that may threaten their sustainability in the future. However, digitization is a task that can be difficult to sustain from an economical point of view for areas with low income and budget. For this reason, one essential factor to sustain digital transformation of those areas is the availability of low-cost methods to map and classify the built heritage. In this paper we presented a method based on low-cost sensors (the cost of each sensor is less than 500 €) combining both ground and aerial (UAV) data. Combining the data obtained from these two sources presents specific challenges in small town centers and this paper presented a solution to specific issues posed by narrow streets to comprehensively reconstruct a portion of a small town's city center. The ground data is captured using 360° (spherical) cameras, which have been proven to accurately reconstruct the building facades of historic city centres and the context is provided by UAV data. The combined reconstruction was tested on the "Casa Teatro," an abandoned building in Vogogna, a small town in the Ossola Valley in Northern Italy. The effectiveness of the reconstruction in terms of data quality was tested by using ground truth data (Check Points) acquired with a total station and comparing the final 3D model reconstruction results with a laser scanning acquisition of the building. The paper also investigated more in detail the characteristics of the matching between the 360° frames and UAV frames by analyzing connectivity graphs. The study explored the connectivity between UAV images and 360° images and found that valid matches can be established between them. The connectivity matrix showed high local redundancy when combining UAVs and 360° images. The direction of UAV images influenced the number of matches, with oblique images generally having more matches than nadiral images. Matches with both nadiral and oblique images were located at the boundary (respectively bottom and top) of the equirectangular projection, which has larger residuals in camera calibration. The influence of this factor in valid link definition will be tested in future works.