OCTREE-BASED APPROACH FOR REAL-TIME 3D INDOOR MAPPING USING RGB-D VIDEO DATA

: 3D indoor mapping is becoming increasingly critical for a variety of applications such as path planning and navigation for robots. In recent years, there is a growing interest in how low-cost sensors, such as monocular or depth cameras, can be used for 3D mapping. In our paper, we present an octree-based approach for real-time 3D indoor mapping using a handheld RGB depth camera. One benefit of the generated octree map is that it requires less storage and computational resources than point cloud models. Moreover, it explicitly represents free space and unmapped areas, which are essential for the robot's navigation tasks. In this work, on the basis of the ORB-SLAM3 system (Campos et al., 2021), we developed an octree mapping system, which directly calls the keyframes and estimated poses provided by ORB-SLAM3 algorithms. Furthermore, we used point cloud library (PCL) for the dense point cloud mapping and then OctoMap for the point cloud to octree map conversion. Finally, we implemented an efficient probabilistic 3D mapping in the robot operating system (ROS) environment. We used the TUM RGB-D dataset to evaluate the estimated trajectories of the camera. The evaluation shows an average translational RMSE of 5.9 cm on the TUM RGB-D dataset. Besides, we also compared the ground truth point clouds and our generated point clouds. The result shows the mean cloud-to-cloud distance in the corridor scene is about 6 cm. All the evaluation results show our proposed approach is a promising solution for advanced indoor voxel mapping and robotic navigation systems.


INTRODUCTION
With the increasing demand for robotics and autonomous systems, 3D indoor mapping is becoming more critical for various applications such as robot indoor navigation, building inspection, augmented reality (AR), and virtual reality (VR).Especially in complex and dynamic environments, to quickly generate and maintain accurate 3D maps with low-cost sensors.Due to the low cost and the intuitive approach to create a 3D map, many visual simultaneous localization and mapping (vSLAM) systems have been published in the past decades, such as LSD-SLAM (Engel et al., 2013), RGB-D SLAM (Endres et al., 2014) and ORB-SLAM3 (Campos et al., 2021).This paper proposes an octree-based approach for real-time 3D indoor mapping using RGB-D video data.The proposed approach extends ORB-SLAM3 (Campos et al., 2021) algorithm with RGB-D video data captured by Intel RealSense D455 camera to generate an octree-based voxel map in real-time.It utilizes an octree structure to divide the indoor space into smaller regions, allowing for efficient processing of data and reduced computational cost.
The main contribution of this paper is that we extend the stateof-the-art ORB-SLAM3 (Campos et al., 2021) algorithm to octree-based voxel mapping.We utilize an octree structure for spatial partitioning of 3D point clouds, enabling a real-time 3D indoor voxel mapping in a robot operating system (ROS).Our extended system can smoothly run with an Intel RealSense D455 camera.We evaluate the extended system using TUM RGB-D benchmark dataset (Sturm et al., 2012) and a set of ground truth point clouds scanned by an industrial laser mobile mapping system.
The experimental results demonstrate that the proposed system is effective in generating the octree map and occupancy grid map required for robot navigation in real time.This work provides a promising solution for indoor environment mapping tasks that require real-time performance and lightweight 3D mapping, which could be further applied in robotic indoor autonomous navigation, AR and VR applications.This paper consists of three main parts: 1. We extend the ORB-SLAM3 (Campos et al., 2021) algorithm for real-time 3D indoor mapping using an octreebased approach.
2. We apply the proposed system and evaluate its accuracy and efficiency in real indoor environments with varying levels of complexity and clutter.
3. We compare the performance of the proposed system with other advanced methods for 3D indoor mapping, in terms of accuracy and efficiency.This paper is organized as follows.Section 2 gives a brief review of the main algorithms in the field of RGB-D based SLAM.Section 3 describes the proposed octree-based indoor mapping system and its three main components.The evaluation of the proposed approach using datasets with ground truth is presented in Section 4, along with a discussion of the obtained results.Section 5 makes a conclusion of this work and provides an outlook on future research directions.

RELATED WORK
In the following, we review the literature related to RGB-D based SLAM systems, and highlight the main milestones achieved in the field of RGB-D SLAM.
The RGB-D camera-based SLAM algorithm was first proposed by (Henry et al., 2010).This paper explores the potential applications of such cameras in the field of robotics, particularly for the purpose of constructing dense 3D maps of indoor environments.The proposed algorithm detects features using the scale invariant feature transform (SIFT) (Lowe, 2004) method and extracts descriptors from two adjacent RGB frames.Depth information is then added to generate 3D-3D feature point pairs information.The random sample consensus (RANSAC) (Fischler and Bolles, 1981) method is utilized to align the 3D-3D matched point pairs and to derive the corresponding transformation matrix.The motion transformation is subsequently optimized using the iterative closest point (ICP) method (Besl and McKay, 1992).To achieve global optimization of the 3D map, the tree-based network optimizer (TORO) algorithm (Grisetti et al., 2008) is employed at the back end.Finally, the view-based loop closure detection is added to the algorithm to obtain globally consistent 3D maps.
To overcome the slow processing speed of the SIFT method for feature extraction, an improved algorithm was proposed by (Henry et al., 2012).Instead of SIFT, the features from accelerated segment test (FAST) (Bay et al., 2006) method based on accelerated segments is used.Additionally, the sparse beam adjustment (SBA) method (Triggs et al., 1999), with better performance, is used for global optimization instead of the TORO algorithm (Grisetti et al., 2008).Furthermore, scene identification is used to improve the efficiency of loop closure detection.
A handheld RGB-D SLAM system was developed by (Engelhard et al., 2011), from the Department of Computer Science at the University of Freiburg , utilizing a Kinect camera as a sensor.The feature detection and extraction of feature descriptors from RGB images are conducted at the front end of the SLAM algorithm using the speeded up robust features (SURF) algorithm (Bay et al., 2008).Feature matching is carried out between adjacent RGB frames, and the 3D spatial coordinates of matched feature points are calculated using depth information from the depth image.For motion estimation and optimization, the RANSAC method is used to estimate the motion between two frames, and a modified ICP method (Segal et al., 2010) is used to optimize the motion transfer matrix of the camera.The pose graph solver (Grisetti et al., 2010) is used to globally optimize the motion transfer matrix, resulting in the optimal global pose.The output of the system is a globally consistent 3D model of the perceived surrounding, represented as a colored point cloud.
Later, (Endres et al., 2012) developed a new RGB-D SLAM system.In the new system, they proposed to use the open-source general graph optimization (g2o) library (Kümmerle et al., 2011) for global pose graph optimization to generate a global dense 3D model of the environment.Finally, the point cloud was converted into a voxel representation through OctoMap, an octree-based 3D mapping Framework.This resulted a volumetric 3D map of the environment, which can be utilized for robot localization, path planning, and navigation.In our research, we will compare with this RGB-D SLAM system in the context of trajectories estimation.
To evaluate RGB-D SLAM algorithms proposed by researchers from various institutions and universities around the world, researchers from the Technical University of Munich and the University of Freiburg produced a standard RGB-D SLAM dataset (Sturm et al., 2012).They used a motion capture system to record the accurate and time-synchronized poses data of the Microsoft Kinect depth camera.These data are used as ground truth and can be compared with the poses estimated by the RGB-D SLAM algorithms.The image sequences from the Microsoft Kinect depth camera contain both color and depth images at full sensor resolution (640 × 480) with a video frame rate (30 Hz).Finally, a total of 39 sequences were recorded across an office environment and an industrial hall to create the dataset.The dataset provides a diverse range of scenes and camera motions for analysis.The dataset has gained significant popularity and is commonly utilized for the evaluation of RGB-D SLAM systems.In our research, we also use this dataset for system performance evaluation.
KinectFusion (Newcombe et al., 2011) is the first SLAM algorithm based on the Microsoft Kinect depth camera, which is capable of accurate real-time mapping of complex and arbitrary indoor scenes under different lighting conditions.Especially, it allows real-time reconstruction of dense surfaces.KinectFusion system is made up of four components, including surface measurement, surface reconstruction update, surface prediction and sensor pose estimation.Surface measurement as a preprocessing stage to generate a dense vertex map and normal map pyramid.In their paper, KinectFusion relies on the Kinect camera to calculate the camera's poses and generate a 3D map of the environment, but the algorithm is not Kinect specific and can be applied to any RGB-D camera, including Intel RealSense depth camera.
While the original KinectFusion is restricted to small-scale scenes, there are meanwhile extensions for large-scale.(Niesner et al., 2013) contributed an online system for large and fine scale volumetric reconstruction, their system extends a simple spatial hashing technique (Teschner et al., 2003) that compresses space and allows real-time access and update of implicit surface data.Surface data is only densely stored in cells where measurements are observed.In addition, surface data can be efficiently flowed into and out of the hash table, enabling further scalability during sensor motion.However, their approach is designed to be efficient for parallel graphics processing unit (GPU) hardware.
ORB-SLAM (Mur-Artal et al., 2015) employs Oriented FAST and Rotated BRIEF (ORB) (Rublee et al., 2011) algorithm for feature detection and matching, sparse map creation, and loop closure detection.ORB-SLAM algorithm is lightweight and can therefore be run on a standard central processing unit (CPU) host.From a monocular SLAM system by incorporating support for stereo and RGB-D cameras (Mur-Artal and Tardos, 2017), ORBbased SLAM systems have been continuously updated and improved in the past few years, with the release of ORB-SLAM3 (Campos et al., 2021), ORB-SLAM3 has become one of the best performing feature-based SLAM systems that operates in real time, both indoors and outdoors.In our previous research (Hou et al., 2023), we have extended ORB-SLAM3 algorithm and implemented real-time tracking and 3D dense reconstruction using an Intel RealSense D455 camera.Based on our previous 3D dense mapping work, we further extend ORB-SLAM3 algorithm to an octree-based 3D mapping system in this paper, which can generate 3D voxel map of the environment in real time.

METHODOLOGY
As shown in Figure 1, the processing pipeline of our system consists of three main components: 1) ORB-SLAM3 based trajectory estimation for RGB-D cameras.2) 3D dense mapping based on RGB-D data and poses of camera.3) Octree-based indoor mapping with ROS in real-time.In the following sections, we will describe the three main parts of our processing pipeline in more detail.

ORB-SLAM3 based trajectory estimation for RGB-D cameras
ORB-SLAM3 (Campos et al., 2021) is an accurate vSLAM algorithm that supports visual, visual-inertial, and multimap SLAM with monocular, stereo and RGB-D cameras, using pinhole and fisheye lens models.One major feature of ORB-SLAM3 is visual-inertial SLAM, it relies entirely on maximum a posteriori (MAP) estimation, enabling robust real-time operation in a wide range of indoor and outdoor environments.In addition, ORB-SLAM3 is also characterised by its multiple map system, which provides excellent positioning accuracy, loop closure detection and relocalization capability.
In our research, for RGB-D camera tracking and pose estimation, we rely on ORB-SLAM3 as it offers a good trade-off between speed and accuracy.ORB-SLAM3 system includes four main components: 1) Feature extraction and pose estimation.2) Local mapping and local bundle adjustment.3) Loop closure detection and map merging, followed by full bundle adjustment.4) Altas is used to generate a unique DBoW2 (Gálvez-López and Tardós, 2012) database of keyframes for relocalization, loop closing, and map merging.By performing these individual tasks in parallel threads, ORB-SLAM3 can achieve a globally consistent, longterm tracking capability while maintaining a lightweight profile that can be run on a standard CPU.
In our pipeline, we mainly use ORB-SLAM3 for estimating camera motion and processing keyframes derived from RGB-D videos.ORB-SLAM3 supports two different running modes, namely normal mode and ROS mode.The former involves using an RGB-D camera or RGB-D image dataset as input, while the latter involves using the ROS to process the input data obtained from an RGB-D camera.In this paper, we will conduct the octree-based mapping in ROS mode.

3D dense mapping based on RGB-D data and poses of camera
In the second stage of our pipeline, we perform a dense point cloud mapping based on RGB-D data and estimated camera poses.Here, we mainly use the powerful cross-platform open source point cloud library (PCL) proposed by (Rusu and Cousins, 2011) to generate global point clouds.PCL implements a large number of common algorithms related to point clouds, such as acquisition, filtering, segmentation and surface reconstruction.
In the first stage of the point cloud generation, we create a single thread to extend the ORB-SLAM algorithm, and then use a multistep process involving various algorithms.First, we invoke keyframes from ORB-SLAM3 and RGB-D information to generate local point clouds for each session.Finally, we transform these point clouds into a global map based on the estimated camera poses in keyframes.
To improve the quality and accuracy of the point cloud mapping, we use a double-layer filtering method.In the first layer, we perform statistical filtering in the local map to remove outlier points.This filtering method identifies points that are too far from the mean of the local patches and removes them.This helps to improve the accuracy of the reconstructed 3D model by eliminating points that are unlikely to be part of the actual scene.In the second layer of filtering, we use voxel filtering in the point clouds to downsample the number of points in the point cloud.This filtering method reduces memory usage without causing significant distortion of the shape features of the 3D model.The voxel filtering algorithm divides the 3D space into voxels, keeping only one point per voxel.This reduces the number of points while retaining the general shape of the object.In this task, we set the voxel resolution to 0.01 m for downsampling.
By using this multi-step approach with a double-layer filtering method, we have successfully carried out real-time 3D dense point cloud mapping with the Intel RealSense D455 camera.

Octree-based indoor mapping with robot operating system (ROS) in real-time
For the octree-based indoor mapping, we adopt the work of (Hornung et al., 2013)  The input data for this algorithm is the voxel filtered point clouds, generated during the 3D dense point clouds mapping process.
The objective is to generate a 3D model of the indoor environment in real-time using an octree data structure.The algorithm can be split into the following steps: 1) Create a node object to communicate with other nodes in the ROS environment.This object allows the current node to subscribe to topics, publish messages, and interact with the ROS parameter server.
2) After creating the node object, a publisher object is then created to advertise a ROS point cloud message "sensor_msgs/PointCloud2" to topic "cloud_in" as data input for octree mapping.The topic "cloud_in" should be the same as the topic in the octree server launch file.In addition, we empirically set the message queue size to 100,000, which determines how many messages can be cached if the subscriber cannot keep up with the publishing speed.
3) After the definition of a publisher object, a rate object with a frequency of 5 Hz is set, which is the loop rate of the node that was created in Step 1, this step ensures that the node runs at a consistent frequency, regardless of how quickly or slowly the loop computation completes.It is important for real-time applications and for maintaining efficient use of system resources.
4) Using a 3D affine transformation matrix provided by the Eigen library to transform the inputted point clouds.This transformation is useful for aligning point clouds or correcting orientation differences.In this paper, it is used to adjust the pose of the currently acquired global point cloud.
Eigen is a C++ template library for linear algebra.It is esigned to be fast, efficient and easy to use for numerical computation, particularly linear algebraic operations.
5) After the affine transformation, we then use the method "toROSMsg" provided by PCL to convert the transformed global point clouds into a ROS point cloud message format.followed by setting the ROS frame ID.Using the publisher object created in step 2 to publish the ROS point cloud message format data to the topic "cloud_in".
6) Launch the octree server and start the octree mapping.The topic "cloud_in" will be subscribed by RVIZ so that the generated octree map can be visualised in real time.
The above is the procedure of real-time octree mapping with the tool of OctoMap.In our research, the resolution of octree map is set as 0.05 m, the result is shown in Figure 4.

EVALUATION
To evaluate our system performance quantitatively.On the one hand, we use the TUM RGB-D benchmark dataset (Sturm et al., 2012) to evaluate the translational root mean square error (Transl.RSME) in the estimated camera trajectory by comparing it with the ground-truth, as shown in Table 1.On the other hand, we apply our octree mapping system with the self-collected data scanned by a handheld Intel RealSense D455 camera.Figure 2 shows our Intel RealSense D455 camera and example of an RGB-D image sequence from input data.As the octree map is built from point clouds, the quality of the generated 3D point clouds directly affects the quality of the octree mapping, we also use ground truth 3D point cloud data to evaluate the generated 3D point clouds.The ground truth 3D point cloud data is scanned by an industrial NavVis VLX laser mobile mapping system.We evaluate the results with the software CloudCompare.All experiments are conducted on a computer with Ubuntu18.04operating system, AMD Epyc 7402p, 24-core processor, 256G RAM and NVIDIA RTX A4000 GPU.
In this section, we first present the sequences we selected from the benchmark dataset for the evaluation of the camera trajectory, as shown in Table 1.Then, we detail the evaluation process and results, followed by a discussion of the results of our system for octree-based 3D indoor mapping.(Endres et al., 2012).Transl.RMSE is the evaluation of our octree-based 3D indoor mapping system.

Dataset used for evaluation TUM RGB-D dataset:
The TUM RGB-D dataset is a large-scale dataset primarily for texturing office (named as "fr1") and industrial hall (named as "fr2") scenes.The dataset includes both color and depth images taken by the Microsoft Kinect camera, as well as the ground truth trajectory of the camera.The data was recorded at full frame rate (30Hz) and sensor resolution (640x480).The ground truth trajectories were obtained from a high precision motion capture system using eight high speed tracking cameras (100 Hz).In addition, the dataset also provides an automated evaluation script for comparing the estimated trajectory with the ground truth trajectory.
In our research, we selected nine sequences of fr1 for evaluation.Among these selected sequences, as shown in Table 1, the sequences fr1/xyz and fr1/rpy are very simple and the result normally represents the best case.The rest of the sequences are more challenging as they cover a larger office space and more unstable camera movements.

Evaluation of the estimated trajectory
First, we evaluated our system on all fr1 sequences of TUM RGB-D dataset, see Table 1.However, because our octree mapping system is based on ORB-SLAM3 algorithm, it provides higher accuracy camera poses estimation, which will guarantee our system to achieve greater accuracy in the mapping process.For all sequences except the "fr1_360" sequence, the pose estimation results outperform RGB-D SLAM (Endres et al., 2012).On the simple "fr1_xyz" sequence, we obtained the best value of 1.0 cm Transl.RSME error.Even in the challenging "fr1_room" sequence, it was still possible to obtain a 7.2 cm Transl.RSME error.We achieved the worst 22.7 cm Transl.RSME in "fr1_360" sequence, because this sequence contains a fast 360 degree turn, and it has a relatively high average velocity in terms of both translational and angular, this sequence is more challenging for the extended ORB-SLAM3 system.
Overall, this evaluation shows that the extended ORB-SLAM3 system is reliable, which can provide relatively high pose estimation for our real-time octree-based mapping.

Evaluation of the mapping accuracy
To verify the accuracy of the 3D dense point clouds mapping results, we use the point cloud data scanned by a high precision NavVis VLX mobile mapping system as ground truth to conduct a comparison experiment.We evaluate the mapping accuracy using tools on CloudCompare.First, we align point clouds through point pair picking, and then make a fine registration with the iterative closest point (ICP) algorithm.Finally, we compare the cloud-to-cloud (C2C) distances between the ground truth point clouds and our generated point clouds.All parameter settings are selected as default parameters on CloudCompare.We use 0.5 m and 0.2 m C2C absolute distance filters to downweight the influence of outlier points caused by noise, respectively.
The evaluation result is shown in Figure 5 and Table 2, and the input RGB-D video stream details are shown in the Table 3.For the entire point cloud generated, the mean distance of computed distances is 7.4 cm, and the standard deviation of computed distances is 8.5 cm.We can see that in the left room scene, the distance error is large near the window without curtains.But inside the corridor, the accuracy of the obtained point cloud is higher.Then we filter out outliers with distances greater than 0.2 m and recalculate the cloud-to-cloud distances, the obtained mean distance of computed distances is 5.8 cm, and the standard deviation of computed distances is 4.2 cm.In general, this evaluation shows the accuracy of our indoor dense point clouds mapping is good, although the 3D indoor scenes are not as complete due to the camera's perspective.However, the mapping accuracy is also affected by the way of the camera movement and the light situation of the environment, these are classical problems for visual SLAM.Here, we only focus on the mapping with an ideal camera moving state, the direction of D455 camera movement is shown in Figure 3.   3. The details about the input RGB-D video.FPS: Frames per second.

The results of our octree-based 3D indoor mapping
Eventually, we succeeded in mapping an indoor octree model in real time with a handheld D455 camera, the result of our octreebased 3D indoor mapping is shown in Figure 4.As described in the methodology section, we set the default resolution of the octree map to 0.05 m.As can be seen in Table 4, the octree structure provides a compact memory representation for efficient storage of maps.The memory of the color point cloud data is reduced from 46.6M to 6.7M and 288.6kB, with compression rates of 85.6% and 99.4% respectively.

CONCLUSION & FUTURE WORK
We present an octree-based approach for real-time 3D indoor mapping using a handheld Intel RealSense D455 camera.In this work, we extend the ORB-SLAM3 (Campos et al., 2021) algorithm to build large-scale octree voxel maps on the fly.For our indoor mapping system, first, we invoke the keyframes and estimated poses from ORB-SLAM3.Then we apply point cloud library (PCL) to generate 3D dense point clouds.Finally, we use OctoMap, an octree-based mapping framework, to construct and update the octree map based on the published point clouds topic in the robot operating system (ROS) environment.The real-time octree mapping of indoor environments can be further used for robot path planning and navigation.We evaluate our system on the TUM RGB-D dataset.The average translational RMSE on TUM RGB-D dataset is 5.9 cm.In addition, we use ground truth 3D point clouds to evaluate the accuracy of the dense point clouds generated by our system.However, the good accuracy of point clouds also ensures the high quality of the 3D octree mapping.
During the evaluation of our system, we have also discovered some aspects that we need to further improve and verify.For instance, we just utilized the rough camera trajectory from ORB-SLAM3 algorithm, without adding loop closure detection for our octree-based system.Moreover, the mapping results of 3D indoor scenes are not complete, and the point cloud processing still needs to be further improved.Therefore, in the future, we will mainly optimize our octree-based mapping system from the above aspects.

Figure 3 .
Figure 3.The estimated trajectory of Intel RealSense Depth Camera D455, it starts at blue and ends at red.

Figure 4 .
Figure 4. Result of real-time 3D indoor octree mapping using RGB-D video.The ground truth data is scanned by an industrial NavVis VLX laser mobile mapping system.development.It provides a set of tools and libraries for building robot applications, such as communication mechanisms, and algorithms for perception, planning, and control.

Figure 5 .
Figure 5.The comparison results of cloud-to-cloud distances between the ground truth point clouds and our generated point clouds.We use 0.5 m and 0.2 m C2C absolute distance filters to downweight the influence of outlier points caused by noise, respectively.Data Color point clouds Color octree Octree Memory 46.6 M 6.7 M 288.6 kB Compression rate -85.6% 99.4%

Table 1 .
The statistical information on fr1 sequences of TUM RGB-D dataset.Duration, Length, Avg.translational velocity, Avg.angular velocity from TUM RGB-D dataset official website.Transl.RMSE* is the evaluation of RGB-D SLAM from

Table 4 .
The memory representation of color point clouds map in pcd format, color octree map in ot format and octree map in bt format.M represents megabyte and kB represents kilobytes, 1M = 1024 kB.