Polar-vision 1 : A Novel Collinearity Equation of Perspective Projection in Polar Coordinate System

: Progress has been made in the community of photogrammetry and 3d computer vision in addressing the mathematical challenge posed by the collinearity equation. We introduce a new method for establishing the coordinate reference for 2d pixels and 3d landmarks using ' angular coordinates ' . The mathematical relationships required for converting 3d landmarks, expressed in angular coordinates, to the camera framework are presented. The landmarks are then projected using perspective projection to obtain 2d pixels represented in angular coordinates. This framework is formally nominated as the 'Polar-vision 1 ', which has been developed and integrated into the commercial software G3D-Cluster. Its application to pinhole camera image processing has demonstrated superior efficiency and admission rates of tie points, as well as reconstruction detail capabilities, compared to OpenMVG, achieving approximately a 1.4x improvement. The project 'Key Technologies and Tool System for Realistic 3D Modeling through Integration of Multi-Source Information in the Space-Air-Ground Domain' was awarded First Prize at the 2023 Surveying Science and Technology Awards, with Polar-vision 1 as one of the innovative points.


Motivation
The collinearity equation, also known as the projection matrix (Hartley & Zisserman, 2003) in 3d computer vision, is a fundamental theory in Photogrammetry.It serves to connect 2d images with the 3d physical world, allowing us to extract and optimize 3d metric information from 2d images using various technologies such as camera calibration (Zhang, 1998), structure from motion (Schonberger & Frahm, 2016), and multiview stereo (Goesele et al., 2006).It is important to note that the well-known collinearity equation conveys the concept of Euclidean geometry (Ryan, 1986), describing a bundle of rays that connect the center of projection, 2d pixels (image points), and 3d landmarks (object points) determined by the Euclidean coordinate.
In reality, a bundle of optical rays is better represented in the formula of polar coordinate due to its directional property.Instead of using mutually perpendicular axes, angles are more suitable for describing the vector nature of hitting positions of optical rays.The hitting points of an optical ray on the focal plane and the object are referred to as 2d pixel and 3d landmark, respectively (Rosenfeld, 1988).This inspires us to redefine those 2d-3d points and their mathematical relationship using the concept of "angular coordinate".That is a novel principle of collinearity equation to serve the community of Photogrammetry, nominated as Polar-vision 1 .

Related Works
The collinearity equation consists of three components that can be parameterized: pixel, landmark, and rotation.By optimizing these components non-linearly across different spatial domains, it has a significant impact on the accuracy of calculations related to camera distortions, interior-exterior orientation parameters, and the quality of 3D reconstruction (Triggs et al., 2000).

Pixel
In situations where GPS, IMU, and laser altimeters are unavailable, 2d pixels basically serve as the only observation constraint for photogrammetry.According to research, the majority of work in the fields of photogrammetry and 3d computer vision parameterizes the position of pixels into Euclidean coordinates (x, y).It estimates all camera states during photogrammetry by constructing observation equations for x and y and reconstructs 3D sparse scenes (Li, 1995;Mikhail and Bethel, 2001).
The camera lens deviates from the ideal pinhole model, causing light to refract through lenses of various shapes.This refraction results in the "bending" of light rays (though this term is used figuratively for ease of understanding, not to imply physical bending).Consequently, the pixels calculated by the observation equation represent ideal positions, disregarding distortion effects caused by lens imperfections.When using pixels with distortion effects in photos as observation constraints to estimate the camera state, significant systematic errors may be introduced in the estimation results.Therefore, it is necessary to model the distortion effect to accurately parameterize the pixels in the photo.In the 20 th century, the Brown Conrady model was widely regarded as the "standard" radial and tangential distortion model (Brown, 1996;Conrady, 1919).However, its inconsistency between the Euclidean parameterization of pixel coordinates and the polar forms of distortion effects (radial and tangential) rendered it incapable of adapting to the complex distortion effects of wide-angle, ultrawide-angle, and fisheye lenses.It wasn't until the early 21 st century that a breakthrough occurred with the introduction of the Kannala Brandt model, published in the TPAMI journal (Kannala, Brandt, 2006).This model marked the first successful replacement of the Brown Conrady model in a century.It characterized the distortion effect as a function of the incident angle of light passing through the lens.Some key advantages of this model include: (1) The parameterized distortion function based on the incident angle (denoted as θ) is smoother and more amenable to modeling as a power series.(2) The formula can be adjusted according to θ to support various types of projections.
In simpler terms, photogrammetry begins with the observations captured in photos and then reconstructs the process of translating a 3d scene into a 2d photo through collinearity equations.The goal is to recover the depth information lost in the 2d photo.This translation process is known as projection, where the 3d scene is rigidly transformed into the camera frame and then projected onto the photo following specific rules.This forms the underlying construction logic of collinearity equations.The projection rules supported by the Kannala-Brandt model are as follows: (1) Perspective: θ is parameterized as tan -1 (r/f) (Yang et al., 2005).( 2) Stereographic: θ is parameterized as 2tan -1 ( r/2f ) (Chang et al., 2013).(3) Equisdistant: θ is parameterized as r/f (Hughes et al., 2010).( 4) Equisolid: θ is parameterized as 2sin -1 (r/2f) (Eichenseer et al., 2015).( 5) Orthogonal: θ is parameterized as sin -1 (r/f) (Zhang et al., 2015).
Inspired from (1) the Kannala-Brandt model's parameterization design of distortion effects, and (2) a method presented at ICCV conference that uses 'angular coordinates' to parameterize reprojection errors (Lee and Civera, 2019).we propose an angular parameterization method for pixel coordinates.This involves replacing 'Euclidean coordinates' with 'angular coordinates' to parameterize the position of 2d pixels, thus adapting to the polar coordinate form of distortion effects and the angular distance definition of reprojection errors.At this stage, the camera state and the 3d scene are no longer restricted by the x and y observation equations.Instead, they are limited by a new set of angular coordinate constraints.The observation equations may benefit from the unique properties of angular coordinates.Unlike Euclidean coordinates, which are defined in two specific directions of the photograph, angular coordinates exhibit isotropy and are independent of photograph orientation.Moreover, compared to the Euclidean distances (d0, d1), the distance between two pixels measured in 'angular coordinates' (θ0, θ1) exhibits intrinsic rotational invariance.This property makes the objective function defined by it more robust against outliers encountered during the optimization process.

Landmark
According to research, almost all studies (Wang, 2007;Zhang, 2007;Liu, 2013;Wang, 2016) in the field of photogrammetry parameterise the positions of 3d landmarks in Euclidean coordinates (X, Y, Z).However, this parameterisation is no longer sufficient to address certain specific 3d tasks in the field of computer vision.We focus mainly on monocular vision (Mur Artal et al., 2015) and does not currently address SLAM tasks based on binocular vision (Mur Artal and Tardós, 2017) and LiDAR (Cole and Newman, 2006).In the absence of range sensors, bearing only localisation and mapping (BOLAM) tasks based on monocular vision assume that the target moves linearly in a Euclidean coordinate system.However, the orientation information of the target observed by the sensors is expressed in polar coordinates.Continuing with conventional parameterisation would introduce significant nonlinear challenges into the observation equation (Deans, 2005).In addition, it would prevent BOLAM systems based on Extended Kalman Filter (EKF) from achieving real-time (no delay) initialisation mapping (Chiuso et al., 2002;Bailey, 2003).
The initial mapping phase is crucial and challenging for BOLAM tasks.Monocular vision provides an irreversible rankdeficient measurement, making it difficult to estimate camera trajectories and scene maps through subsequent EKF filtering.There are two main types of methods: (1) Delayed initialization occurs when the camera motion acquires sufficient parallax before initialization begins.During this motion, low-parallax observations are entirely disregarded, leading to inconsistency in the mapping process and potentially resulting in filter failure (Davison, 2003).( 2) Undelayed Landmark Initialization (ULI) involves initializing points either in the direction of camera motion or located very far from the camera upon the first observation.These landmarks may remain visible throughout the entire motion process.ULI is employed to constrain the camera direction and enhance consistency throughout the mapping process (Sola et al., 2005;2008).
If the landmarks utilized by ULI are represented as Euclidean points, the uncertainty interval of their depth information (Zcoordinate) remains unbounded and cannot be resolved using EKF (Terejanu, 2008).At the 2008 ICCV conference, a novel parameterization method was proposed, which involves parameterizing the depth of landmarks (unobservable degrees of freedom) into inverse depth.This approach transforms the uncertainty interval from unbounded to bounded (Civera et al., 2008).Another parameterization method, similar to inverse depth but distinct from it, is inverse distance.Inverse distance is defined as the reciprocal of the distance to the origin of the world reference frame (Sola et al., 2012), and it is expressed in a homogeneous manner as (m, ρ).Note that when transformed into the camera frame, the inverse distance becomes the reciprocal of the distance to the camera center.Compared to Euclidean points, while the uncertainty interval of the inverse distance point is bounded, bilinearity is introduced into the equations when transformed into the camera framework.This introduces some unfavorable factors to the performance of the EKF, which requires a reasonably linear system (Hartley and Zisserman, 2003).Therefore, inspired by the reference (Eade and Drummond, 2006), the concept of anchor (the initial camera center) was introduced to alleviate the adverse effects of bilinearity.Compared to the inverse depth defined by three specific directions in space, the inverse distance has the advantage of isotropy and is independent of the direction of the spatial reference frame.It is expressed as (p0, m, ρ).Since m represents the directional vector of landmarks relative to anchor points, expressing it as Euclidean coordinates is clearly redundant.Therefore, the direction vector is parameterized as the azimuth and elevation angles of the landmarks relative to the anchor.The homogeneous expression of the inverse distance is then refined to (p0, α, β, ρ).Compared to Euclidean points, the use of inverse distance parameterization leads to improved consistency in EKF-SLAM mapping based on monocular vision.It is important to note that EKF is not the only option for SLAM systems.However, while inverse distance is tailored to EKF, it exhibits poor performance in Bundle Adjustment (BA).In BA, landmarks in the direction of camera motion can result in illconditioned normal equations (Konolige and Agrawal, 2008).
To address the issue of inverse depth in BA, (Zhao et al., 2015) proposed a parameterization of landmark depth using parallax angle defined by two anchors, expressed homogeneously as (pm, pa, α, β, γ).Their open-source project surpasses the performance of SBA (Lourakis and Argyros, 2009) and sSBA (Konolige and Garage, 2010) projects, which use Euclidean points.(Zhao et al., 2015) analysed the effects of parallax angle, inverse depth, and Euclidean parameterisation of landmarks on BA performance in a 2d BOLAM system.In contrast, (Zuo et al., 2023) investigated the impact of parallax angle parameterisation of landmarks on BA performance in a 3d BOLAM system.(Sun, 2015) extended the parameterisation of parallax angle from the BOLAM system to the UAV photogrammetry based on the pinhole camera model.The IEEE standard for this method was published in 2023 (IEEE 1937(IEEE .11-2023)).Additionally, (Zuo, 2023) applied the parameterization of parallax angle to satellite photogrammetry using the linear pushbroom model.All of these investigations have shown a significant decrease in the correlation coefficients of 3d structural variables parameterized by parallax angles.This reduction significantly relaxes the strict geometric requirements imposed by adjustment models in photogrammetric measurements.
However, the pixel coordinates are still represented in the Euclidean domain, resulting in a semi-angular domain transformation from the landmark to the pixel.This causes inherent dimensional inconsistency between the observational constraints and the state estimations, making the rigid transformation difficult and complicating the collinearity equation.As a result, numerical instabilities and suboptimal convergence properties occur during the adjustment and optimization processes.Therefore, we parametrize the pixel coordinates and the rigid transformation in the angular domain.

Rotation
The concept of rotation is typically classified into two categories.The first involves rotating a target object within a fixed spatial reference frame, such as the movements of the seven joints of a robot (ankle, knee, hip, shoulder, elbow, wrist, finger) or the rolling, pitching, and yawing actions of an aircraft.The second category involves rotating the spatial reference frame around a fixed target object (Foley, 1996).
In photogrammetry, data collection usually involves rotating the aircraft and camera gimbals.Data process, however, involves rotating a spatial reference frame.We focus solely on the latter type of rotation, which is the initial step in establishing a connection between the physical and digital worlds.
Euler angles were first proposed by (Euler, 1765).The concept was originally presented in Latin and can be referenced in (Weisstein, 2009).Any rotation of a rigid body can be parameterized by rotating three angles around three orthogonal axes (θx, θy, θz) in a specific order.Therefore, it is necessary to clarify the rotation sequence of the three axes before using Euler angles.However, this introduces the problem of universal deadlock, as noted by (Brezov et al., 2013).
In 1840, French mathematicians (Rodrigues, 1840) proposed the concept of the Axis angle parameterization to represent rotation as (n, θ).The original version was in French and can be referred to (Gray, 1980;Caccavalle et al., 1999).The Rodrigues rotation formula, described by (Murray et al., 2017), can rotate any 3d vector around the unit axis n by an angle θ.This formula was also adopted by the OpenMVG framework, as noted by (Moulon et al., 2017).The angle θ can be decomposed into three orthogonal directions of the rotation axis n, similar to the Euler angle rotation around three orthogonal axes.However, unlike Euler angles, the Rodrigues rotation formula does not require a predefined rotation sequence, which mitigates the issue of gimbal lock.(Craig, 2005) proposed a parameterization method for the rotation vector by multiplying the angle θ by the axis n.This method uses a 3d vector to represent the rotation axis and angle.The OpenMVG framework does not directly involve the axis angle in gradient calculation.Instead, it indirectly updates the axis angle by computing the partial derivative of the observation equation with respect to the rotation vector.The landmark is then rotated to the camera frame using the Rodrigues formula.
In 1840, the Irish mathematician Hamilton introduced the concept of Quaternions (Hamilton, 1840), which parameterizes rotations into a pair of conjugate quaternions (q, q*).To rotate any 3d vector v around the unit axis n by angle θ, it is written as a pure quaternion v, left-multiply by q and right-multiplied by q*.This concept is similar to the axis-angle representation, as conjugate quaternions can be transformed equivalently using the Rodrigues formula.Quaternions are not used directly in gradient calculations.Instead, they are first converted to axis angles and then to rotation vectors to compute correction increments.These updated rotation vectors are subsequently converted back to quaternions.The landmarks are transformed to the camera frame by multiplying them with conjugate quaternions.
Rotation matrix (Weisstein, 2003) is necessary to directly rotate landmarks to the camera frame, as Euler angles cannot achieve this.The rotation matrix shares similarities with the Rodrigues formula and conjugate quaternions.It was first developed based on research conducted by the French mathematician (Cauchy, 1815).Deriving the rotation matrix directly to estimate the camera rotation is not possible due to the addition and subtraction operations involved in gradient calculations, which lack closure.To address this issue, Lie group and Lie algebra theory (Lie, 1880;1888) can be introduced.For further details, please refer to the original work in French (Sola et al., 2018).Indeed, these rotation methods are developed in the Euclidean space domain, which is not suitable for rotating landmarks in the angular domain.Therefore, we propose a new method to rotate landmarks in the full angular domain.

Parameterization for Pixel in Angular Domain
We define two angles, namely the viewing-angle and the polarangle, to parameterize the positions of pixels, instead of using Euclidean coordinates (x, y).The viewing-angle θ is defined as the angle between the ray passing through the pixel and the principal axis, which is equivalent to the distance of the pixel from a pole.The polar angle φ represents the angle of the pixel from the pole axis.As shown in Figure 1, homonymous pixels xi and xj are parametrized in angular domain as Figure 1.Concept of pixel in angular domain.

Parameterization for Transformation in Angular Domain
In the angular domain, the transformation from landmarks parameterized by three angles (α, β, γ) to pixels parameterized by two angles (θ, φ) is defined as 1 2 ; As shown in Figure 2, the left camera is a main anchor, and the right is an associate anchor. 1  and 2  represent the mapping relationships from the landmark (α, β, γ) to homonymous pixels (θi, φi) and (θj, φj) respectively.

Geometric Relationship between Pixel and 3D Rotation in Angular Domain
As shown in Figure 3, we can get the geometric relationship between the pixel (θ, φ) and the 3D rotation (ey, ex, ez) by rotating the plane (that the landmark lies on) in the order of ey→ ex→ez to the plane (that the pixel lies on).It can be observed that the construction of the plane (that the pixel lies on) only requires twice rotations (ey → ex) from the plane (that the landmark lies on), as shown Figure 3a.Then, this plane is rotated by an angle ez to form the final image plane, as shown in Figure 3b.This is a significant difference between our method and traditional methods that use Euclidean coordinates (x, y) to parameterize the pixel.

Derivative for View-angle Observation Equation in Angular Domain
Based on Figure 4, the observation from the view-angle θ is finally derived as ( )

Derivative in Angular Domain
We use Figure 4 to derive the Equation 3-5.Firstly, we give some notations as following: (1) Vertex O, p and P are the camera center, the pixel and the landmark, respectively. (2) Segment O-o and O-p are the main optical axis of the camera and the ray passing through the pixel p, respectively.
Secondly, we indicate some geometric relationships in this figure : (1) Angle τ is the complementary angle of angle β. (2) Angle u is the dihedral angle between the plane SAC and the plane OAP. (3) The dihedral angle between the plane SBA and the plane SBC is the right angle.

Comparative Method
The Open Multiple View Geometry (OpenMVG) is an internationally renowned open-source C++ framework for Structure from Motion (SfM) (https://github.com/openMVG).We utilize it as a comparative method to validate the accuracy and efficiency of our method in processing the pinhole imaging datasets.

Satellite Dataset
We utilized data from the Rosetta OSIRIS (Optical, Spectroscopic, and Infrared Remote Imaging System) released by the Max Planck Institute for Solar System Research, sourced from the European Space Agency's Rosetta Mission.The data was acquired using a frame CCD (Charge-Coupled Device) reflecting telescope mounted on the spacecraft, as shown in Figure 5a.This mission marks the first controlled landing of a lander on the surface of a comet, with the target comet being 67P/Churyumov-Gerasimenko, as shown in Figure 5b.

Results
We used 631 comet images, as shown in Figure 6a, to estimate the photographic orbit composed of satellite positions at different times.The resulting orbit is shown in Figure 6b.Table 1 compares the processing accuracy and efficiency of OpenMVG and ours on this dataset.The evaluation metrics are the average reprojection error of all tie points (measured in pixels, denoted as "px") and the total runtime of the incremental SfM (measured in minutes, denoted as "m").

Indicators
OpenMVG It can be observed that on this dataset, our method achieves processing accuracy comparable to OpenMVG, both converging to the sub-pixel level (0.48 px).However, our method reduces the processing time by 21 minutes compared to OpenMVG (an improvement of about 1.4 times).Additionally, our method computes more tie points (an additional 353) and reprojected pixels (an additional 3,383) than OpenMVG, enhancing the admission rate of tie points and reconstruction detail capabilities.Moreover, our method demonstrates higher efficiency than OpenMVG even with an increase in observation equations, validating the effectiveness and efficiency of our method, particularly when applied to satellite pinhole imaging models.
Furthermore, the result of the 3D model of the comet constructed using our methods is shown in Figure 7. Figure 7a displays a sparse point cloud resulting from the forward intersection of 200,708 tie points.Figures 7b and 7c

Results
We used 857 images from the Poets_park_flats dataset, as shown in Figure 8a It can be seen from Table 2 that our method achieves a processing accuracy that is comparable to that of OpenMVG, We also constructed the sparse point cloud using our method with 493 images from the MapPilot dataset, as shown in Figure 9a.The resulting point cloud is shown in Figures 9b and 9c   Therefore, the effectiveness and efficiency of our method, especially when applied to UAV pinhole imagery models, is confirmed by the fact that our method is more efficient than OpenMVG even with an increase in the number of observation equations.

Potential Significance of Polar-vision 1
In the proposed theory of Polar-vision 1 , (1) the formula of angular coordinate (viewing angle, polar angle) is better to represent (modeling) position of 2d pixel with distortion in radial and tangential directions that are often associated with polar coordinates or circular motion.(2) the numerical differences of angular coordinate of 2d pixels are much less than Euclidean coordinate, improving the stability of numerical computation; (3) the formula of angular coordinate (azimuth angle, elevation angle, parallax angle) is better to represent and be used to estimate the depth of 3d landmark via the 3 rd element in the view of narrow parallax angle; (4) 2d pixel and 3d landmark are both represented in angular coordiate, consistent with attitude angles (roll, pitch, yaw) of the camera in the unit of measurement, degree or arc.This improves the stability of numerical computation in bundle adjustment to refine camera poses and positions of 3d landmark simultaneously; (5) the mathematical relationship between 2d pixel and 3d landmark in angular coordinate is more concise and clearer to reveal the geometry of perspective projection and transformation to camera frame; (6) Unlike x or y coordinate, which is defined with respect to a particular direction in image space, viewing angle has the advantage of being isotropic, that is, its properties are independent of the orientation of the image; (7) The mathematical relationship between the "viewing angle" coordinate of 2d pixel and angular coordinate of 3d landmark is independent of the yaw attitude of the camera.That allows us define the observation equation for the "viewing angle" coordinate of the 2d pixel to refine the roll and pitch attitude of the camera, which makes the block of unknown parameters shrinks for more effective estimation of camera attitudes.

Contributions
The main contributions of this work are (1) the parameterization of 2d-3d points via the angular coordinate and (2) the construction of their mathematical relationship in polar coordinate system; We nominate these two items of work as Polar-vision 1 , in which the superscript 1 reveals a series of potential works to do.That is to regain most theories of 3d computer vision in the perspective of polar coordinate system to exploite their potential significances so as to solve hard problems rooted in the primitive theories based on the Euclidean coordinate system.We will nominate those future works as Polar-vision 2 , Polarvision 3 , etc.

Limitations
There are several limitations in this work: (1) The derivative for the polar-angle observation equation in the angular domain is more challenging than for the viewangle equation.Thus, we only add the view-angle observation equation as the additional constraint for the ParallaxBA project.
(2) The mathematical proof for the performance of the ParallaxBA project in 3d photographic geometry is currently unclear.

Future Works
Polar-vision 1 represents a novel collinearity equation formulated via angular coordinates to connect 2d pixels and 3d landmarks.Then, Polar-vision 1 as a basic equation is replaced of the Euclidean coordinate version in the framework of structure from motion.That serves the pinhole camera model.
In the basis of Polar-vision 1 , (1) Polar-vision 2 will explore the potential significance of distortion modeling via angular coordinate of 2d pixel to serve the technology of camera calibration with distortion.
(2) Polar-vision 3 will investigate the advantage of depth estimation via angular coordinate of 3d landmark to serve the linear pushbroom camera model.
Inspired from those series works of Polar-vision, we will integrate it to the community of LiDAR.A more ambitious idea is to introduce the concept of Riemannian geometry (Petersen, 2006) to regain the principals of 3d computer vision.We will nominate those series of works as Riemannian-vision.
θ = view-angle of the pixel p α, β = azimuth and elevation angle of the landmark P ey, ex = two rotation angles from the plane (that the landmark P lies on) to the plane (that the pixel p lies on).e, v= functions of ey and ex
Photographic orbit constructed from 631 images.
show a dense point cloud and a mesh model, respectively.a b c Figure 7. (a) Sparse point cloud (b) Dense point cloud (c) Mesh model of the comet constructed from 631 images.
, to construct the sparse point cloud using our method.The resulting point cloud is shown in Figures 8b and 8c from different perspectives, which is from the forward intersection of 2,728,281 tie points.a b c Figure 8. Sparse point cloud constructed from 857 images.
from different perspectives.These perspectives are from the forward intersection of 671,331 tie points.a b c Figure 9. Sparse point cloud constructed from 493 images.

Table 1 .
Comparison on Rosetta dataset

Table 2 .
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-1-2024 ISPRS TC I Mid-term Symposium "Intelligent Sensing and Remote Sensing Application", 13-17 May 2024, Changsha, China with both converging to the sub-pixel level (0.92 px vs. 0.93 px).Compared to OpenMVG, however, our method reduces the processing time by 444.2 minutes.It also computes more tie points (an additional 3,666) and reprojected pixels (an additional 14,648) than OpenMVG, improving the tie point acceptance rate and reconstruction detail capabilities.Comparison on Poets_park_flats dataset

Table 3
shows that our method achieves processing accuracy comparable to OpenMVG, both converging to the pixel level (1.01 px).However, our method reduces processing time by 38.7 minutes compared to OpenMVG.It also computes 976 more tie points and 5,328 more reprojected pixels than OpenMVG, improving tie point acceptance rate and reconstruction detail capabilities.