DEEP LEARNING TO SUPPORT 3D MAPPING CAPABILITIES OF A PORTABLE VSLAM-BASED SYSTEM

: The use of vision-based localization and mapping techniques, such as visual odometry and SLAM, has become increasingly prevalent in the field of Geomatics, particularly in mobile mapping systems. These methods provide real-time estimation of the 3D scene as well as sensor's position and orientation using images or LiDAR sensors mounted on a moving platform. While visual odometry primarily focuses on the camera's position, SLAM also creates a 3D reconstruction of the environment. Conventional (geometric) and learning-based approaches are used in visual SLAM, with deep learning networks being integrated to perform semantic segmentation, object detection and depth prediction. The goal of this work is to report ongoing developments to extend the GuPho stereo-vision SLAM-based system with deep learning networks for tasks such as crack detection, obstacle detection and depth estimation. Our findings show how a neural network can be coupled to SLAM sequences in order to support 3D mapping application with semantic information.


INTRODUCTION
Vision-based localization techniques, such as visual odometry (VO) and Simultaneous Localization And Mapping (SLAM), are getting more and more common in Geomatics and a key component in many mobile mapping systems, especially portable ones (Torresani et al., 2021a;Otero et al., 2020;Nocerino et al., 2019a;Blaser et al., 2018;Schöps et al., 2017;Nüchter et al., 2015).VO and SLAM provide real-time estimation of the position and orientation of the sensor moving in an environment based solely on a sequence of images or LiDAR profiles captured by one or more sensors rigidly mounted on a platform.They are often combined with other positioning systems such as GNSS and IMU to provide a seamless and more robust navigation and mapping solution.While VO primarily focuses on the camera's position, reconstructing sensor trajectories, SLAM also creates a 3D sparse, semi-dense or dense reconstruction of the environment (Yang et al. 2022;Taketomi et al., 2017;Scaramuzza and Fraundorfer, 2011).SLAM-based 3D surveyinng is nowadays used in multiple applications and field: underwater mapping (Nocerino et al., 2018), rail tunnel inspection (Panella et al., 2020), exploration (Steenbeek and Nex, 2022), autonomous driving (Singandhupe and La, 2019), Augmented Reality (Torresani et al., 2021b), etc.The aim of the work is to introduce the on-going developments to extend our stereo-vision, SLAM-based, lightweight and modular system, called GuPho (Menna et al., 2022;Torresani et al., 2021) with deep learning neural networks in order to perform: • Semantic segmentation, e.g., for crack detection: the system is used in monitoring or inspect tasks and it identifies in real-time cracks in structures; leveraging on the stereo-vision, metric information can be retrieved; • Object detection, such as rocks: when GuPho is used to automatically guide a moving robot, the detection of obstacle is a fundamental task for avoidance and re-routing; • Monocular Depth Estimation (MDE): depth prediction is useful to improve scene understanding, support autonomous navigation and complement conventional MVS methods in textureless areas.The paper is organized as follows: Section 2 briefly recall the low-cost, lightweight and portable modular prototype system, GuPho.Section 3 reports single, stereo or multi-sensor SLAM solutions for 3D mapping purposes.Deep learning solutions are mentioned in Section 4. Data preparation is discussed in Section 5 whereas experiments, evaluations and results are presented in Section 6. Finally Section 7 concludes the paper.
Figure 1: The GuPho stereo-vision system for real-time 3D mapping in its handheld (a) and robotic (b) version.

THE GUPHO SYSTEM
GuPho (Guided Photogrammetry system) is a low-cost, lightweight and portable modular prototype system based on stereo vision and vSLAM method (Menna et al., 2022;Torresani et al., 2021a;Di Stefano et al, 2021).GuPho is equipped with a Raspberry Pi 4 model B, with a roadcom BCM2711, Quad core Cortex-A72 (ARM v8) 64-bit SoC @ 1.8GHz and 8gb RAM.It was developed to provide real-time guidance to the surveyor during the image capturing phase, ensuring a more reliable and effective photogrammetric data acquisition and processing.GuPho can use rectilinear or fisheye lenses to survey indoor or outdoor scenarios, including underwater environments.Realtime 3D mapping capabilities are provided through OpenVSLAM (Sumikura et al., 2019) which builds upon ORB-SLAM2 (Mur-Artal and Tardós, 2017).Real-time computation and visualisation capabilities are used to introduce visual feedbacks to users, including camera-to-object distance warnings to guarantee the expected ground sample distance (GSD) or speed warnings to avoid motion blur.Also, it uses a novel automatic exposure algorithm that exploits 3D information of the observed scene.Figure 1 shows the realized GuPho system, either in its handheld version or coupled to a ground robot (Leo Rover 1 ) for autonomous navigation and 3D mapping.

SLAM SOLUTIONS
The literature is populated by single, stereo or multi-camera photogrammetric systems designed for portable mobile mapping applications and SLAM processing (Perfetti and Fassi, 2022;Torresani et al., 2021;Ortiz-Coder and Sánchez-Ríos, 2020;Meyer et al., 2020;Menna et al., 2019;Nocerino et al., 2019b;Koehl et al., 2016;Teo et al., 2015;Shortis et al., 2007).For mobile mapping applications, real-time processing is mandatory and performed with SLAM approaches (Lai, 2022).In particular, vSLAM can be divided into two categories: traditional and learning-based vSLAM (Chen et al., 2022).Traditional vSLAM uses geometric features, such as points and lines extracted from the images, or the pixel intensity values to understand and map the environment.Learning-based vSLAM methods rely on deep learning-based feature descriptors (Bruno and Colombini, 2021), hybrid methods (Tang et al., 2019) or complete end-to-end approaches (Wang et al., 2017).Convolutional Neural Networks (CNN) have been also integrated into SLAM pipeline (Tateno et al., 2017): the estimation of camera pose is performed by minimizing photometric error whereas learning is used to compute depth information.Steenbeek and Nex (2022) proposed a similar concept applied to UAV video sequences.Novel approaches are also integrating Neural Radiance Fields (NeRF -Mildenhall et al. 2021) into SLAM pipeline in order to offer novel geometric and photometric 3D mapping solutions for accurate and real-time scene reconstruction from monocular images (Rosinol et al., 2022).Sucar et al. (2021) introduced iMAP, the first real-time NeRF-based dense online SLAM model that optimizes camera pose and the implicit scene representation in a hand-held RGB-D camera system.The iMAP system employs an iterative two-step approach of tracking and mapping and utilizes keyframe selection.Zhu et al. (2022) introduced NICE-SLAM, a dense RGB-D SLAM system that uses a hierarchical scene representation incorporating information at multiple levels and pre-trained geometric priors, resulting in detailed reconstructions of large indoor scenes that are more scalable, efficient, and robust than other recent SLAM systems using neural networks.The successive NICER-SLAM (Zhu et al., 2023) is a dense RGB SLAM system that optimizes for camera poses and a hierarchical neural implicit map representation, which allows for high-quality novel view synthesis.The system incorporates additional supervision signals, including monocular geometric cues and optical flow, and a simple warping loss to enforce geometry consistency.SLAM algorithms have been also coupled to neural networks to enhance recognition capability in images or classification algorithms in 3D space (Pillai and Leonard, 2015;Zhag et al., 2018;Duan et al., 2019).

DEEP LEARNING SOLUTIONS
In recent years, machine learning techniques have been applied to images or point clouds with promising results.Convolutional neural networks (CNNs) and other deep learning models can provide high accuracy, recall and prediction speed, allowing for real-time application in SLAM-based applications.Our developments focused on coupling deep learning methods to image sequences acquired by the GuPho system for semantic 1 https://www.leorover.tech/segmentation and objected detection as well as monocular depth estimation (Section 4.3).
• Deep Learning for semantic segmentation and object detection: we rely on Yolov8 (Ultralytics, 2023), designed to detect and localize objects within images or video frames.It can be re-trained to detect a wide range of (new) objects, ensuring real-time performances at 30 fps or higher on medium GPU.Yolov8 is based on multiple layers of convolutional and pooling operations, followed by several fully connected layers.The network takes an input image and processes it through the layers, gradually learning to recognize and locate objects within the image.We have retrained and generalized the method to our scenarios.• Deep Learning for MDE: we build upon MiDaS (Ranftl et al., 2022) which demonstrated to clearly outperform competing methods across diverse datasets.It includes a flexible loss function and a robust training objective invariant to changes in depth range and scale, advocating the use of principled multi-objective learning to combine data from different sources.

Instance segmentation and object detection
The primary objective of object detection is to identify the (precise) location of various objects present in a given scene and assign relevant labels to the bounding boxes of these objects.On the other hand, instance segmentation is a technique that identifies and labels individual objects in an image and their components at pixel level.This allows for a more precise understanding of objects and their relationships.The state-of-theart neural network for object detection in images is YOLO.The YOLO (You Only Look Once) algorithm (Redmon et al., 2016) was a cutting-edge object detection method that could achieve both high precision and speed.YOLO differs from traditional classifiers as it examines the image just once and can identify objects within it.YOLO gained rapid popularity due to its high speed and accuracy in object detection and image segmentation.
As a one-stage object detectors, YOLO directly predicts the bounding boxes and class probabilities of objects in a single pass through the network.These models are known for their speed and efficiency, making them well-suited for real-time applications.Different variants of YOLO (Redmon and Farhadi, 2017;Bochkovskiy at al., 2020;Wang et al., 2022;Ultralytics, 2023) were released throughout the years (Figure 2), with successive improvements in terms of speed, accuracy, efficiency and generalization.Tiny implementation of YOLO on single-board devices (e.g.Raspberry Pi, Jetson, etc.) were also proposed (Ayoub and Schneider-Kamp, 2021;Chan et al., 2022).
With respect to the latest version, YOLOv8 (Ultralytics, 2023), there are five models (YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x) for detection, segmentation and classification.YOLOv8n is the fastest and smallest, while YOLOv8x is the most accurate yet the slowest among them.

Monocular depth estimation
Although humans find it easy to estimate the depth of a scene from a single image, it is a challenging task for computational models due to the ill-posed problem and high resource requirements.Monocular Depth Estimation (MDE) refers to the process of estimating depth from a single RGB image (Ming et al., 2021).Being able to estimate depth from a single image has several benefits, such as aiding in scene comprehension, 3D modelling, robotics, autonomous driving, etc.The recovery of depth information is particularly important in these applications when other information such as stereo images, optical flow or point clouds are not available.Real-time depth estimation has traditionally been performed using stereo images or video sequences, as evidenced by the research in (Ha et al., 2016;Kong and Black, 2015;Cheng and Huang, 2015;Karsch. et al., 2015).However, these methods are resource-intensive and require more data compared to monocular depth estimation.Therefore, MDE has become increasingly popular, leading to the development of several deep learning methods.These methods do not rely on hand-crafted features and utilize deep convolutional neural networks.Among different tested networks, we have chosen Zero-shot Transfer by Combining Relative and Metric Depth (ZoeDepth) framework (Bhat et al., 2023): it combines both monocular depth estimation (MDE) and relative depth estimation (RDE) approaches in a two-stage framework (Figure 3).In the first stage, an encoder-decoder structure is trained to estimate relative depths from the input image.This model is trained on a large variety of datasets, which improves its generalization to different scenes and environments.It builds upon the MiDaS (Ranftl et al., 2020) training strategy for relative depth prediction which uses a loss that is invariant to scale and shift.In the second stage, components responsible for estimating metric depth are added as an additional head.This stage helps to refine the depth estimates by incorporating metric depth information, which is the absolute distance between objects in the scene.

DATA PREPARATION FOR OBJECT DETECTION
The datasets utilized to evaluate neural networks were acquired via GuPho using rectilinear or fisheye lenses.The image sequences have a resolution of 1280 x1024 pixels and feature cracks in asphalt or cement surfaces or sidewalks or building walls, off-road paths with rocks, tunnels with fall obstacles, etc.Given our objects of interest, a manual process of image annotation to improve detection performances was necessary.Stones were annotated using bounding boxes (Figure 4a-b) whereas cracks were annotated using polygons (Figure 4c-d).This latter type of annotation is useful for detecting irregularly shaped objects and provides more precise information about the object's shape.In order to boost the model's performance, some

EXPERIMENTAL EVALUATION AND RESULTS
YOLOv8 was chose due to its accuracy and speed in comparison with other versions.For computational limitations, the learningbased functionalities are applied to monocular images of GuPho.
The extracted semantic information, coupled to the stereo-vision capabilities of GuPho, allows to retrieve metric information and deliver added-value 3D mapping results.For these initial tests, the processing and analyses were performed "offline", using an 12 th Gen Intel® Core™ i9-12950HX 2.30 GHz with 32 GB RAM and NVIDIA RTX A3000 12GB GPU.For stone detection, various iterations of the YOLOv8 model were tested (Table 1), leading to the conclusion that YOLOv8s performed optimally for detecting the stones.Specifically, the highest level of detection accuracy was achieved after conducting 150 epochs, with a Recall of 0.70 and mAP of 0.64.Besides, the processing and inference time are important and Yolov8s performed better (0.3 ms for processing and 8 ms for inference time).Results indicated that there was no significant improvement in detection accuracy beyond 150 epochs.For monocular depth estimation, a single camera sequence from GuPho was considered.We chose the two-stage framework that combined MDE and RDE, named ZoeDepth.As shown in Figure 6, the learning-based approach have a good performance on our dataset (for rectilinear or fisheye lenses) and could be coupled to conventional photogrammetric approaches for depth estimation.

CONCLUSIONS
The paper introduced an extension of stereo-vision, SLAMbased, lightweight and modular GuPho system with deep learning neural networks in order to perform semantic segmentation, object detection and depth estimation.We focused on rock detection (to aid autonomous navigation and obstacle avoidance in robotics applications), crack detection (to support structural monitoring and inspection) and depth prediction (to complement conventional stereo-vision methods in areas with non-collaborative surfaces).Our findings show how a neural network can be couple to SLAM sequences in order to support 3D mapping application with semantic information.
In order to achieve on-board real-time processing of both SLAM and deep learning tasks, we plan to extend GuPho with a more powerful board (e.g., NVIDIA Jetson Nano, equipped with an NVIDIA GPU with 128 CUDA cores) to allow computationally intensive tasks on the GPU.The final aim in the long run is to transform GuPho into an intelligent system that can automatically and swiftly identify objects and obstacles in real-time.GuPho can be operated manually to identify damages on man-made structures or can navigate a robotic platform in challenging environments, like forests or tunnels.Incorporating deep learning methods, GuPho will obtain a profound and intelligent understanding of its surroundings for application and deployment in various fields.Blaser, S., Cavegn, S., Nebiker, S., 2018.Development of a portable high performance mobile mapping system using the

Figure 2 :
Figure 2: Timeline of You Only Look Once (YOLO) variants (Zhang, 2023).Beside identification and tracking of people or animals(Kajabad and Ivanov, 2019;Tang et al., 2023), the YOLO network has been used to detect pavement or side-walk cracks(Yang et al.,

Figure 3 :
Figure 3: The ZoeDepth architecture.An RGB image is fed into the MiDaS depth estimation framework to predict a depth (after Bhat et al., 2023).

Figure 4 :
Figure 4: Annotated images: stones in a tunnel (a-b), cracks in tiles (c) or on asphalt (captured by fisheye lens).
To evaluate detection results, metrics like Recall R and mean Average Precision are used: is the true positive,  is the false negative and  is the false positive;  = ∑ (  −   )   .the number of classes and i is the corresponding class.Some of the prediction/detection results on test images are shown in Figure 5.Each predicted stone is recognised by a bounding box and a confidence score which shows how likely the box contains an object of interest and how confident the classifier is about it.Predicted cracks are shown with a bounding box, a polygon mask and also confidence score.The confidence threshold was set to 0.25, i.e., the minimum score for which the model considers the prediction to be a true prediction.

Figure 5 :Figure 6 :
Figure 5: Results of rocks (above) and cracks (below) detection in some images of a GuPho sequence (b).

Table 1 :
Object detection evaluation of YOLOv8.For crack detection, YOLOv8x model was found to be the fastest (0.2 ms for processing) and the most accurate one (R of 0.73) after 100 epochs (Table2).Figure5reports some detection results on GuPho frames from sequences in the field.