DESIGN AND IMPLEMENTATION OF STEREO VISUAL ODOMETRY SYSTEM BASED ON ROP ADJUSTMENT

: Stereo Visual Odometry (SVO) is a technique used to estimate the continuous position and orientation of a moving platform using a dual-camera system that captures stereo image pairs. To obtain accurate results, A SVO system must be calibrated before use. System calibration is necessary to determine the intrinsic camera parameters (ICPs) and the relative orientation parameters (ROPs) between the cameras at real scale. Compared to monocular visual odometry, a calibrated SVO system can recover the real scale of the translation vector without additional sensors. The proposed method in this study utilizes ROP adjustment for SVO. Instead of conventional bundle adjustment, this method adopts all sets of ROPs as measurements in the designed network adjustment model. Specifically, there are six sets of ROPs among time-adjacent stereo image pairs. A SVO system was designed to implement the proposed SVO method. Two experiments were conducted in outdoor and indoor test fields to evaluate the performance. Several ground check points were set up for distance and position verification. The drift ratio was also evaluated. The results demonstrate that the designed SVO system has great feasibility and accuracy for navigation applications.


INTRODUCTION
The development of visual odometry (VO) systems can be categorized as either using a monocular camera or a stereo pair of cameras.Monocular visual odometry (MVO) (Tian et al., 2021), which employs only one camera, estimates the moving trajectory based on relative measurements between consecutive image frames, specifically relative orientation parameters (ROPs) that can be divided into relative orientation and relative translation.However, MVO cannot directly solve for the real scale of relative translation between image frames, so each relative translation vector is normalized.MVO generally uses 2D-to-2D correspondences for motion computation (Scaramuzza and Fraundorfer, 2011).
In contrast, Stereo Visual Odometry (SVO) is a technique that uses a dual-camera system taking sequences of stereo image pairs to estimate the continuous attitude and position of a moving platform (Nistér et al., 2006).Before application, this dualcamera system must be calibrated to acquire the intrinsic camera parameters (ICPs) and the ROPs between the cameras with real scale.Compared to MVO, SVO can recover the real scale of the relative translation vector without the need for additional sensors.SVO typically uses 3D-to-3D and 3D-to-2D correspondences for motion estimation.
In the conventional procedure of SVO algorithm, local optimization is a critical step in improving the estimated attitude and position.Two commonly used strategies for local optimization are bundle adjustment and loop closure.However, these methods have limitations, such as enormous computing requirements and revisiting the same location, respectively.Additionally, ROPs describe the geometry between two adjacent images, and six sets of ROPs can be formed among time-adjacent * Corresponding author stereo image pairs.However, only parts of these sets of ROPs are used for motion estimation and local optimization.This indicates the potential for generating and utilizing redundant measurements to achieve better performance in SVO.
Therefore, this study proposes a novel SVO method based on ROP adjustment, which establishes a network geometric constraint of multiple images.The main contributions of this study can be summarized as follows.First, instead of using image conjugate points through conventional bundle adjustment, as in (Xu, 2015;Yoon and Kim, 2019), all sets of ROPs obtained among time-adjacent stereo image pairs are discussed and utilized as vision measurements.Second, our SVO method applies 2D-to-2D correspondences to estimate the initial motion while solving the scale ambiguity.In contrast to the conventional use of 3D-to-3D and 3D-to-2D correspondences, as in (Geiger et al., 2011) and (Nistér et al., 2006), respectively, the procedure of our SVO method does not require generating 3D point clouds.Accordingly, both calculation complexity and resource requirements could be decreased.Finally, a network adjustment model based on adjusting all sets of ROPs is developed.The motion can be optimized locally while achieving reasonable accuracy.Through our SVO method, all sets of solved ROPs and calibrated ROPs would be adopted and then estimate the final exterior orientation parameters (EOPs) of SVO system.

Overview of Proposed Method
The proposed SVO based on ROP adjustment follows the workflow shown in Figure 1.Given two stereo image pairs, there are six possible sets of ROPs that can be formed for any pairs of images.In the subsequent image matching step, four sets of ROPs need to be estimated.Two sets of ROPs are obtained through calibration in advance.
The image matching process is critical for detecting corresponding feature points on the two images and establishing matching pairs, which enable the estimation of ROPs.In this study, the feature-based method, SURF is adopted due to its robustness for VO applications.A random sample consensus (RANSAC) scheme is adopted to extract correct matching pairs.These filtered matching pairs can generate the essential matrix to solve ROPs.Once all sets of ROPs are obtained from the time-adjacent stereo image pairs, they can be transformed into EOPs that represent the attitude and position in the object frame (O frame).In this study, the first left camera frame is defined as the O frame, even though each captured image has its individual camera frame.By transforming the continuous attitudes and positions into EOPs defined in the same O frame, the EOPs can be optimized using the proposed local optimization based on ROP adjustment.

Relative Orientation
Relative orientation refers to the geometric relationship between two images captured in the same object space.This relationship includes differences in both attitude and position between the two images.When an image is captured using a frame camera, it is formed based on the principles of perspective projection, where the center of projection represents the image position, and the optical axis represents the image orientation.To deal with image geometry, a three-dimensional (3D) camera frame is usually formed, associated with the image position and orientation.The origin of the camera frame of an image usually corresponds to the center of projection, and one of the frame axes coincides with the optical axis.Under these circumstances, the relative orientation of two images can be described as a 3D translation and rotation between the two camera frames.
As shown in Figure 2, the left and right camera frames represent the camera coordinate systems of the two images, and the relative orientation can be mathematically modeled using a rotation matrix,    , and a translation vector,    .Therefore, there are six parameters associated with relative orientation, three of which correspond to the rotation matrix and the other three correspond to the translation vector.

Figure 2.
Relative orientation of two images mathematically modelled as a rotation matrix and a translation vector.
When a stereo pair of images are captured with a calibrated dual camera system, the relative orientation of this image pair can be known form the calibration.In this case, the six ROPs are known.However, if the relative orientation of two overlapping images is estimated using detected tie points, only five out of six ROPs can be determined, since the baseline scale can be arbitrary.Therefore, the estimated ROPs suffer from the issue of scale ambiguity.Under this circumstance, as depicted in Figure 3, the translation vector is normalized to be a unit vector denoted as ̂  , and the vector scale,    , is remaining unknown.
Figure 3.Estimated relative orientation of two overlapped images by using detected tie points, in which the translation vector scale is unknown.
In summary, the difference between ROPs for a stereo pair and an image pair is that the scale of relative translation is known or not.Monocular cameras require scale determination through direct measurement, motion constraints, or integration with additional sensors (Fraundorfer & Scaramuzza, 2011).Stereo cameras can solve the scale ambiguity themselves.In this study both types of ROPs are used for initial motion estimation and then adopted as measurements in the network adjustment for optimizing the original attitude and position.

Motion Estimation
As mentioned previously, there are six sets of ROPs that can be formed among time-adjacent stereo image pairs.As shown in Figure 4, if the image pair captured at t epoch is denoted images 1 and 2 and the image pair captured at t+1 epoch are denoted images 3 and 4, the six ROP sets are 2 ), and ( 4 2 ,  4 2 ).Among the six ROP sets, ( 2 1 ,  2 1 ) and ( 4 3 ,  4 3 ) are known from the system calibration of the applied SVO system and remain constant.The other four ROP sets can be evaluated with image tie points obtained through image matching.
Table 1 lists the notations of all sets of ROPs in the time-adjacent stereo image pairs.The coordinate systems of the left image and right image at  moment are named  1 frame and  2 frame respectively.The coordinate systems of the left image and right image at  + 1 moment are named  3 frame and  4 frame respectively. 2 1 is a rotation matrix from  2 frame to  1 frame.̂2 1 is a normalized direction vector defined in  1 frame from the origin of  1 frame to the origin of  2 frame. 2 1 is the scale of relative translation, which is the distance from the origin of  1 frame to the origin of  2 frame.The definition is also the same for the other notations.

ROPs
Relative rotation Scale of relative translation ) and ( 4 3 ,  4 3 ) belong to the stereo pairs are known due to the calibration in advance.The remaining four sets of ROPs belong to the image pair are solved after the RANSAC process.In this step, the initial EOPs can be solved based on geometric relation among time-adjacent stereo image pairs.

Local Optimization Based on ROP Adjustment
To optimize the EOPs obtained in the previous step, a two-step network adjustment based on ROPs is developed for local optimization in SVO.In the first step, only relative rotations are used as measurements in the adjustment, along with six inner constraints of a rotation matrix in each.
The weights of measurements are determined based on the inlier number from RANSAC results.All inner constraints of rotation matrices, as well as the known relative rotation,  4 3 are assigned an extremely large weight.The entire least-squares form of the rotation adjustment is shown in Equation ( 1).
This adjustment model has a total of 57 measurements and 18 unknown parameters, which composes two rotation matrices in the O frame.The output of this rotation adjustment is then used as known coefficients in the next step of the network adjustment.where  denotes the residual vector. denotes the design matrix.
denotes the weight matrix.
Second, only the relative translations are used as measurements in the adjustment.Equation ( 2) displays the entire least-squares form translation adjustment.The weights of measurements are set based on the inlier number from RANSAC results.Only the known relative translation, ̂4 3 is assigned an extremely large weight.That is the constraints in this adjustment model.Therefore, there are 15 measurements and 6 unknown parameters, which composes two positions in the O frame. [ where  denotes the residual vector. denotes the design matrix. denotes the weight matrix.
During the adjustment process, it is necessary to calculate and update the scales based on the solved EOPs in each iteration.The only fixed coefficient is the known scale,  4 3 .Equation (3) shows how the scale is calculated.
(3) Figure 5 illustrates an example of network adjustment.In this example, each set of ROPs in different camera frames has been transformed into the corresponding direction vector in the O frame, represented by different colors.Before the network adjustment, these six direction vectors can not be aligned properly.However, after the network adjustment, they are perfectly aligned, demonstrating an improvement in SVO.Compared to the conventional bundle adjustment of image points, the network adjustment based on ROPs offers the advantage of requiring less computation.In the network adjustment, the measurements of ROPs in the least-squares form are fixed and relatively few.There are only 57 measurements in the relative rotation adjustment model and 15 measurements in the translation adjustment model.However, for bundle adjustment, there are numerous measurements in the least-squares form.

IMPLEMENTATION AND EXPERIMENTS
This section begins by demonstrating the implementation of an SVO system.Next, the design and calibration of the SVO system are explained.Finally, two tests are conducted, and the experimental results are provided for both outdoor and indoor scenarios.A detailed analysis is also presented to evaluate the proposed SVO method.The position difference between the estimated trajectory and the corresponding GCP is calculated.
Corresponding mean difference and Root Mean Square Difference (RMSD) are computed.Furthermore, the total difference based on the endpoint and the drift ratio are also calculated for evaluation.

Design of SVO System
To implement the proposed SVO method, a SVO system was designed, as shown in Figure 6.The system consists of two Sony RX0 II cameras mounted on a rigid bar for stability.A synchronized shutter was also developed to capture a stereo image pair at each epoch.The baseline of the cameras to the depth of the scene (B/D) ratio is similar to the baseline to height (B/H) ratio used in aerial photogrammetry (PH), and it is critical for depth accuracy.According to (Shen, 2018), a longer camera baseline leads to better accuracy at the same distance.The current products of stereo cameras, such as the ZED and ZED mini cameras, have baseline lengths of only 6.3 and 12 centimeters, respectively.For better geometrical intersection, the baseline length of our system was set to about 27 centimeters.The overlap of the two cameras is approximately 69% and 96% when the cameras are 1 meter and 10 meters away from the scene, respectively.
In addition, the proposed SVO method can be applied in various scenarios using the SVO system, which can be mounted on a tripod, camping cart, or land vehicle.This system has potential applications in navigation and mobile mapping.Figure 7 shows the camping cart attached with the SVO system.The stands on the camping cart is made of aluminum extrusions for easy assembly.

Calibration of ICPs
Accurate system calibration is essential for obtaining precise ICPs and ROPs in the proposed SVO method.The distribution of detected object points during camera calibration determines the accuracy of the derived ICPs.Instead of using computer vision (CV) calibration method with a flat checkerboard, we adopted the photogrammetry (PH) calibration method to acquire accurate ICPs and ROPs for the SVO system.The calibration was conducted in an indoor calibration field with coded targets.50 stereo image pairs were taken and solved for the ICPs using Australis photogrammetric software.The overall estimated precision of the calibration was 0.16 pixels.To ensure compatibility with the CV standard, the PH ICPs were converted to the CV standard using the ICP transformation method proposed by (Lin et al., 2022), as listed in Table 2. Finally, the rectified images generated using these CV ICPs were used as input to the proposed SVO method.

Calibration of ROPs
By calculating the ROPs of each of the 50 stereo image pairs based on the EOPs generated by the Australis photogrammetric software, the calibrated ROPs were obtained.The average of these ROPs was then used as the final calibrated ROPs in the proposed SVO method as constraints.The calibration results for the ROPs are listed in Table 3.

Relative rotation
Average (degree) The EOPs in the O frame include the attitude and position parameters.The rotation angles around the X, Y, and Z axes are denoted as , , and , respectively.The position in the X, Y, and Z directions are denoted as , , and , respectively.The units for the attitude and position parameters are degrees and meters, respectively.

Test Field
The outdoor test field is located around the museum in National Cheng Kung University (NCKU).The museum building and its surroundings provide clear texture for the scene, as shown in Figure 8.The outdoor environment provides sufficient GNSS signal for accurate measurement of ground control points (GCPs).
Seven GCPs are set up to evaluate the performance of the SVO system in the test field.The horizontal and vertical precision of GCPs are less than 0.015 (m) and 0.03 (m), respectively.

Test Results
The horizontal trajectory of the SVO system during the outdoor test is illustrated in Figure 9.The estimated trajectory is shown in a blue curve and the GCPs are indicated by red dots.The results are summarized in Table 4. Table 4.The results of outdoor test.

Analysis and Evaluation
As shown in Figure 9, the estimated trajectory generated by our SVO system is consistent with the GCPs.The movement is continuous and similar to reality.The position differences between estimated positions and GCPs are mostly less than 1 (m), with mean differences of -0.316, 0.084, and -0.174 in the X, Y, and Z directions, respectively.The RMSDs in the X, Y, and Z directions are ±0.632(m), ±0.344 (m), and ±0.684 (m), respectively.For the entire route, the overall drift is 0.662 (m), with a translation ratio of 0.73 (%), and the scales estimated by our SVO method are reasonable and consistent with the GCPs.
In conclusion, our SVO method is feasible and has been successfully implemented using the designed SVO system.The estimated positions and total distance compared to GCPs demonstrate that our method produces suitable results.

Test Field
The indoor test field is located on the first floor of the Department of Geomatics at NCKU.The scenario is quite diverse, and some areas lack texture, such as the white walls.The entire space is compact, measuring about 16 (m) × 12 (m).This indoor scenario is quite challenging for the SVO method.Figure 10 shows various scenes where our SVO system passes through.Several GCPs are set up in this indoor field, and their coordinates are defined in a local frame.The coordinates of these GCPs were measured using a total station and level.The horizontal and vertical precision of GCPs is both less than 0.030 (m).
Figure 10.The scenarios in the indoor test.

Test Results
The horizontal trajectory of the SVO system during the indoor test is illustrated in Figure 11.The estimated trajectory is shown in a blue curve, and the GCPs are indicated by red dots.The results of the SVO system are summarized in Table 5.  1.25

Analysis and Evaluation
Based on Figure 11, the estimated trajectory aligns with the GCPs, indicating the movement is consistent with the actual experiment.
For the entire route, the overall drift for the entire route is 0.259 (m), with a translation ratio of 1.25 (%).The position differences between P01, P02 and corresponding GCPs are acceptable, with minimal differences in the X direction (about 0.01 m).However, P03 has the maximum position differences in X and Y direction.
The reason might the scenes near the P03 is only a white wall, with a distance of less than 2 (m).Thus, the conjugate points of time adjacent stereo image pairs are much less that causes the worse results.The RMSDs in X, Y, and Z directions are ±0.311(m), ±0.325 (m), and ±0.193 (m) respectively.Although these values are not ideal due to the challenging indoor field, the SVO system successfully recovers the moving trajectory, indicating its feasibility.

CONCLUSIONS
This study presents a novel approach to SVO, using ROPs instead of image points as vision measurements to estimate the attitude and position of a vehicle-mounted SVO system.Unlike conventional bundle adjustment methods that require calculating 3D point clouds, the developed network adjustment model reduces complexity by using fewer measurements.
A SVO system is developed and mounted on a camping cart, with a longer baseline than commercial stereo cameras for better geometric intersection.The system is calibrated using the PH method to obtain ICPs and ROPs, which are then transformed to the CV standard for SVO implementation.
Two experiments are conducted in outdoor and indoor test fields with changing textures, lighting, shadows, and pedestrian and vehicle traffic.In both tests, the estimated positions are comparable to GCPs, with an overall drift of 0.662 (m) and 0.259 (m), and translation ratios of 0.73 (%) and 1.25 (%), respectively.The proposed SVO method successfully achieves continuous attitude and position estimation, and the developed computation method is evaluated and performs consistently with reality.

Figure 1 .
Figure 1.Workflow of proposed SVO based on ROP adjustment.

Figure 4 .
Figure 4. Six sets of ROPs in the time adjacent stereo image pairs.

Figure 5 .
An example of the network adjustment: (a) before the network adjustment (b) after the network adjustment.

Figure 7 .
Figure 7.The camping cart attached with the SVO system.

Figure 8 .
Figure 8.The scenarios in the outdoor test.

Figure 9 .
Figure 9.The horizontal trajectory of SVO in the outdoor test.

Figure 11 .
Figure 11.The horizontal trajectory of SVO in the indoor test.

Table 1 .
Notations of ROPs and EOPs among two-time adjacent image pairs.

Table 2 .
Results of system calibration for ICPs (CV standard).

Table 3 .
Results of system calibration for ROPs.

Table 5 .
The results of indoor test.