COMPARISON OF TWO MATHEMATICAL MODELS FOR FISHEYE CAMERAS APPLIED IN THE ORB-SLAM

: Fisheye lens cameras are becoming increasingly popular for vSLAM applications due to their wide field of view (FoV), providing more features to be tracked in a single image shot. However, the complex lens geometry involved in the image formation process still limits their full potential, especially when points in the hyperhemispherical field are unmodeled. In this paper, we compare two adaptations of ORB-SLAM for fisheye lens cameras, considering the use of the rigorous projection model (equisolid-angle) versus the use of the generic projection model (EUCM). The ORB-SLAM versions were adapted for real-time processing on the Nvidia Jetson TX2 board. The experiment was conducted using hyperhemispherical images obtained with a Ricoh Theta S camera. Our results showed that the trajectory estimated with the equisolid-angle ORB-SLAM had smaller discrepancies, compared to the reference trajectory, than the EUCM ORB-SLAM. This suggests that a rigorous photogrammetric model with a suitable treatment of hyperhemispherical points is beneficial for trajectory estimation.


INTRODUCTION
The 3D reconstruction of an unknown environment with simultaneous determination of the sensor orientation is commonly known as SfM (Structure from Motion) in Computer Vision, Phototriangulation in Photogrammetry, and SLAM (Simultaneous Localization and Mapping) in robotics (Cadena et al., 2016;Durrant-Whyte and Bailey, 2006;Granshaw, 2018).The classic problem of SLAM is to sequentially estimate the position and orientation of an agent platform in real-time based on remote sensors, such as cameras and laser scanning systems, mounted in mobile mapping platforms.Thus, SLAM methods enable a consistent estimate of the trajectory based on the map of the environment that contains 3D points.SLAM methods based only on sequential images information are well-known as Visual SLAM (vSLAM) (Li et al., 2019;Taketomi et al., 2017;Torresani and Remondino, 2019).
In recent years, we have witnessed significant advances in vSLAM technology to address the challenges of building accurate and robust maps in various environments.These advances were especially driven by the increased use of lowcost and lightweight optical sensors and high computational power and performance, resulting in state-of-the-art approaches such as: Mono-SLAM (Davison et al., 2007), PTAM (Klein and Murray, 2007), DTAM (Newcombe et al., 2011), SVO (Forster et al., 2014), LSD-SLAM (Engel et al., 2014), ORB-SLAM (Mur-Artal et al., 2015) e DSO (Matsuki et al., 2018).ORB-SLAM is considered one of the leading state-of-the-art SLAM solutions due to its input data flexibility, versatility and accurate estimative.It uses the ORB operator to extract keypoints and match their features, which allows it to work with images acquired using different configurations, such as monocular, stereo, and RGB-D (red, green, blue, and depth) cameras.The matching points enable the estimation of the trajectory and sparse 3D cloud in real-time, in addition to loop closures and relocating (Mur-Artal et al., 2015).However, a vSLAM solution combining fisheye images and ORB-SLAM has not yet been fully explored.
Fisheye lenses provide a wider field of view (FoV), enabling the capture of more features in a single image shot.As a result, they are an attractive choice for use in vSLAM solutions.However, many challenges in the use of large FoV systems (e.g., fisheye) in mobile mapping applications still remain.The main issues are related to the complex camera model and fisheye lens geometry (Wang et al., 2018).
Several works have proposed generic models to cope with the fisheye lens geometry, which allow the application in different camera systems (Usenko et al., 2018).Campos et. al (2020) and Liu et al. (2019) used generic camera models to extend ORB-SLAM for use with fisheye cameras.Campos et al. (2020) utilized the camera model developed by Kannala and Brandt (2006), while Liu et al. (2019) employed the Enhanced Unified Camera Model (EUCM), introduced originally by Geyer and Konstantinos (2000).The choice of these models was driven by their generality and simplicity, which enabled simple adaptation for use in ORB-SLAM.
Applications that demand high accuracy may require rigorous projection models that consider the physical imaging principles of fisheye lenses, such as equidistant, orthogonal, stereographic, and equisolid-angle (Schneider et al., 2009).
In this paper, we compare the accuracy of real-time ORB-SLAM solution (Aldegheri et al., 2019) when using generic versus rigorous models for fisheye lenses.The real-time ORB-SLAM, proposed by Aldegheri et al. (2019), was modified into two versions.The first one applies the generic EUCM model (EUCM ORB-SLAM), and the second one the rigorous equisolid-angle model (equisolid ORB-SLAM), which also allows the treatment of hyperhemispheric points (Castanheiro et al., 2021).Comparative analyses were performed with a fisheye image dataset acquired with backpack-mounted mobile system.Developed by Campos et al. (2018), the mobile mapping system is equipped with a Ricoh Theta S omnidirectional camera, which captures the entire scene with a hyperhemispherical geometry, covering a 360º FoV.

Mathematical Models
Fisheye lens cameras have a FoV of approximately 180º (for hemispherical lenses) or even more (for hyperhemispherical lenses), covering a wider view of a scene in a single image compared to conventional perspective cameras.Previous works proposed a different mathematical model to deal with the geometric distortions caused by fisheye lenses and address this specific lens geometry (Usenko et al., 2018).Here, we describe in detail the two mathematical models used in this work to adapt the ORB-SLAM for fisheye lens geometry: the generic EUCM and the rigorous equisolid-angle model.

Enhanced Unified Camera Model
The EUCM is a generic projection model for fisheye cameras based on a unified camera model that is considered simple and does not require additional distortion coefficients.EUCM requires only two more parameters to cope with the distortions.This model allows the inverse projection function to be expressed in an explicit closed form.As shown in Figure 1, the coordinates of a point ( , e ), in the camera reference system are projected to onto an ellipsoid and then, is mapped to in the image plane using the pinhole model.The vector with image coordinates ( and ), in pixels, with origin at the bottom-left pixel of the image, are calculated by Equation 1, where is the principal distance in pixels, and are the coordinates of the origin of a centered system.[0,5, 1] and , are the projection parameters on the ellipsoid, which allow approximation of the lens properties, despite the strong distortions, with being calculated by Equation 2. (1) (2) The coordinates for the inverse projection from the image plane to point on the ellipsoid in the EUCM model are obtained using Equation 3, in which .Additionally, the maximum value is of is set to 1.The ellipsoid is the projection surface, and the inverse function is a ray's collinearity function since and are on the same ray.
(3) Generic models, however, do not consider the physical principles of imaging.Conventional models considers the physical properties of the fisheye lens, which usually are based on the projection of a sphere on the image plane, with several projection models, such as equidistant, orthogonal, stereographic, and equisolid-angle (Abraham and Förstner, 2005;Hughes et al., 2010;Ray, 1994;Schneider et al., 2009).Castanheiro et al. (2021) presented a comparison of fisheye projection models in the camera calibration of the Ricoh Theta S dual-fisheye system to verify which one is suitable for hyperhemispherical lenses.The results were better when using the equisolid-angle projection model.

Equisolid-angle projection model
Figure 2 illustrates the projection of point onto the image plane as point , following the equisolid-angle projection.In the equisolid-angle model, the radial distance is obtained by the relation , equivalent to the chord length of the arc segment on the sphere.Equation 4 presents the equisolidangle projection model, with , and being the coordinates of a point P in the camera reference system, which are related to their coordinates in the object reference system and through a rigid body transformation (Equation 5).This transformation is a function of e (3D coordinates of the camera position) and , components of the rotation matrix.The coordinates and are the projection of the point in the photogrammetric reference system, with being camera principal distance.The coordinates and in the photogrammetric reference system are obtained through Equation 6, where and are the image coordinates in a centered system, and are the principal point coordinates, and and are the lens distortion components.Since the lens does not exactly follow the mathematical model of projection due to distortions, fisheye projection models can be combined with the same distortion model used for perspective cameras, such as Conrady-Brown (Brown, 1971;Conrady, 1919).( 6) The projection of from the image plane to the object space can be performed by projecting the coordinates ( , ) in the image plane to the sphere and then later to the object space, using the inverse collinearity equations.Thus, the collinearity equations can be considered valid after the projection of the image plane coordinates to the sphere, which is a deformationfree spatial domain (Campos, 2019).The coordinates and can be projected to the point ( , , ) on the sphere of radius by Equation 7, where the incident angle is given by and the angle formed between the and the axis by . (7) The 3D coordinates (X, Y, Z) of point P in the object reference system are determined from the 3D straight line defined by spherical coordinates ( , , ), under the scale factor λ, as indicated by Equation 8. (8)

ORB-SLAM in real-time
ORB-SLAM operates using three modules that run simultaneously on different threads: tracking, local mapping, and loop closing while parallelizing highly complex processes like optimization.The algorithms used in ORB-SLAM demand substantial memory and processing power.Although ORB-SLAM can achieve real-time operation on conventional computers, it does perform as well on mobile platforms due to limited resources and the complexity of operations, particularly in feature extraction and matching.These complexities can impede real-time performance without CPU and GPU optimization.Aldegheri et al. (2019) presented a realtime modification of ORB-SLAM for the Nvidia Jetson TX2 embedded platform for perspective cameras, achieving up to 30 frames per second (fps) by utilizing a heterogeneous computing paradigm, which exploits the board potential by distributing the processing between the CPU and GPU.
The two versions of ORB-SLAM proposed in this work were both modified and optimized to achieve high performance and enable realistic real-time applications, following the approach proposed by Aldegheri et al. (2019), as illustrated in Figure 3. EUCM ORB-SLAM uses the EUCM mathematical model (Figure 1), while equisolid ORB-SLAM employs the rigorous equisolid-angle model (Figure 2).The key challenges to achieving real-time processing of more frames per second are the feature extraction and matching steps.In the original formulation of ORB-SLAM, all processing is done on the CPU.Aldegheri et al. (2019) focused their efforts on improving the performance of the tracking module by implementing it on the GPU.A multi-platform open-source acceleration library specifically designed for real-time embedded platforms was used.The proposed approach applies compatible and optimized GPU CUDA kernels of Jetson TX2, exploring the full potential of the CPU and GPU resources and being optimized through multithreading and computational Bundle Adjustment (BA) using graph optimization during the estimation process (Kummerle et al., 2011).By implementing these optimizations, not only is performance and efficiency enhanced, but also accuracy is maintained, making real-time processing possible even with complex mathematical models for fisheye cameras.

EXPERIMENTS AND RESULTS
Two experiments were independently performed with the same dataset using (1) ORB-SLAM with EUCM model (EUCM ORB-SLAM) and ( 2) ORB-SLAM with equisolid-angle model (equisolid ORB-SLAM).Both experiments were conducted on the Jetson TX2, simulating real-time processing.

Data acquisition and dataset
The dataset consists of fisheye images collected with a Ricoh Theta S dual-fisheye camera.The camera was attached to a backpack-mounted mobile system, and images were captured over a 140 m long path.Only images captured with one of the camera sensors with a frame rate of 5 fps were used in the experiments.The study area is a path covered by high and low vegetation and urban features (buildings, sidewalks, and posts), as shown in Figure 4. Additionally, two reference trajectories, in UTM projection coordinates, were generated to assess performance and positional accuracy: (1) sensor trajectory calculated indirectly by simultaneous bundle adjustment with the Agisoft Metashape, based on a set of selected images with 1 fps and 48 GCPs (Ground Control Points) collected using GNSS (Global Navigation Satellite System) RTK (Real Time Kinematic); (2) sensor trajectory estimated by post-processing dual-frequency GNSS positioning, using a Hiper GPS receiver attached to a backpack while walking along the path.The trajectory obtained from the dual-frequency GNSS receiver data presented an average precision of 0.02 m at the beginning of the trajectory (where there was no signal loss), while the trajectory calculated by Agisoft Metashape had a precision of 0.07 m in camera positions and an average discrepancy of 0.15 m compared to the positions at the beginning of the trajectory obtained from the dual-frequency GNSS receiver.The results obtained with ORB-SLAM are referenced to a local coordinate system with an arbitrary origin, orientation, and scale.Therefore, parameters of a similarity transformation were calculated using GCPs to convert the two estimated ORB-SLAM trajectory points coordinates to UTM projection.The seven estimated similarity parameters (3 translations, 3 rotations and scale) were applied to transform the coordinates of the camera stations, estimated in a local reference system by EUCM ORB-SLAM and equisolid ORB-SLAM, to the UTM projection coordinates.

Performance Assessment
Comparative analyses were performed by selecting 296 common frames from: (1) the reference trajectory calculated with Agisoft Metashape (Metashape); (2) the trajectory obtained using the equisolid-angle model with ORB-SLAM (equisolid ORB-SLAM), and (3) the trajectory calculated using the ORB-SLAM with EUCM model (EUCM ORB-SLAM).The camera trajectory (exposure stations) estimated by Metashape (used as the reference), equisolid ORB-SLAM, and EUCM ORB-SLAM are depicted in Figure 5.The trajectories are shown in black, red, and green, respectively.The camera positions are plotted on the UTM reference system, providing a visual representation of the trajectory for each solution along the 140 m path.The trajectory was intentionally straight to simulate walking in a path where there is no expectation of return and loop closure based on revisiting a place, such as in forests, agricultural corridors, and other similar scenarios.The use of hyperhemispherical images from the Ricoh camera shows that the EUCM model exhibited significant drifts in position and scale when compared to both the reference trajectory and the equisolid ORB-SLAM trajectory.The equisolid ORB-SLAM presented more consistent results compared to the EUCM ORB-SLAM.However, it is possible to observe a drift at the end of the trajectory, which affected not only the position but also the scale.We selected the first and last ten estimated camera positions from both the equisolid ORB-SLAM and EUCM ORB-SLAM trajectories.Then, we calculated the RMSE (Root Mean Squared Error) with respect to the corresponding camera positions obtained from the reference trajectory (Metashape bundle adjustment).The results are presented in Table 1 and Table 2, which show the RMSE of the discrepancies in camera positions between the first and last ten frames of the reference trajectory (Metashape), and the trajectories of equisolid ORB-SLAM and EUCM ORB-SLAM, respectively.The equisolid ORB-SLAM had an RMSE of 0.114 m, 0.111 m, and 0.005 m for E, N, and h coordinates in the first ten frames, while for the last ten frames, the RMSE increased to 3.941 m, 3.592 m, and 0.113 m for E, N, and h coordinates, respectively, as indicated in Table 1.These results show that the discrepancies in the camera positions gradually increased as new observations were acquired and sequentially processed, leading to error propagation and drift at the end of the trajectory.Table 2 shows that the EUCM ORB-SLAM had a RMSE of 44.114 m, 2.402 m, and 17.580 m for E, N, and h coordinates, respectively, for the first ten frames in the estimated trajectory, and a RMSE of 40.859 m, 23.125 m, and 6.854 m for E, N, and h coordinates, respectively, for the last ten frames.The higher errors of the ORB-SLAM EUCM when compared to the equisolid ORB-SLAM performance can be explained by its generic nature, which was not able to cope with the hyperhemispheric geometry of the Ricoh Theta S, mainly due to the considerably long path and a lower frame rate (5 fps

CONCLUSION
The complex lens geometry involved in the image formation process still limits the full potential of fisheye lenses cameras, especially when points in the hyperhemispherical field are unmodeled either by a generic model or by some rigorous model (Castanheiro et al., 2021).Based on our results, it is possible to conclude that when using hyperhemispherical images, the EUCM model showed more discrepancies to the reference trajectory than the equisolid-angle model.One possible reason is the unfitness of EUCM model to the Ricoh Theta S camera geometry.Furthermore, a longer path can also contribute to the trajectory drift since the trajectory resulting from equisolid-angle model was also affected by drift.It is important to emphasize that in the conducted experiments, the loop closure option of ORB-SLAM was not accomplished, which could significantly reduced the effect caused by the trajectory drift.Additionally, a higher frame rate may improve the performance of the EUCM model, but at the cost of increasing computational complexity.Overall, we can conclude that the use of a rigorous photogrammetric model, with a suitable treatment of hyperhemispherical points, proved to be beneficial, enabling the use of image observations throughout the full field of view for trajectory estimation.

Figure 3 .
Figure 3. Modified ORB-SLAM architecture with two mathematical models.

Table 1 .
RMSE of the beginning and end of the trajectory for equisolid ORB-SLAM.

Table 2 .
RMSE of the beginning and end of the trajectory for EUCM ORB-SLAM.