STELLAR: A LARGE SATELLITE STEREO DATASET FOR DIGITAL SURFACE MODEL GENERATION

: Stellar is a large, satellite stereo dataset. It contains rectified stereo pairs of the terrain captured by the satellite image sensors and corresponding true disparity maps and semantic segmentation. Unlike stereo vision in autonomous driving and mobile imaging, a satellite stereo pair is not captured simultaneously. Thus, the same object in a satellite stereo pair is more likely to have a varied visual appearance. Stellar provides flexible access to such stereo pairs to train methods to be robust to such appearance variation. We use publicly available data sources, and invented several techniques to perform data registration, rectification, and semantic segmentation on the data to build Stellar. In our preliminary experiment, we fine-tuned two deep-learning stereo methods on Stellar. The result demonstrates that most of the time, these methods generate denser and more accurate disparity maps for satellite stereo by fine-tuning on Stellar, compared to without fine-tuning on satellite stereo datasets, or fine-tuning on previous, smaller satellite stereo datasets. Stellar is available to download at https://github.com/guo-research-group/Stellar .


INTRODUCTION
Digital Surface Models (DSMs) contain 3D representations of the earth's surface. It records the shape information of terrains and human constructions of an area at specific time stamps. DSMs have broad applications in environmental studies, for example, monitoring and analyzing glaciers melting, water level changes of rivers and lakes, and human activities. Additionally, they are widely used in non-environmental applications such as urban visualization, infrastructure planning, and mixed reality.
There are two major approaches to generating DSMs nowadays. The first approach uses Lidars on UAVs to scan the landscape to produce point clouds, which are then processed to become DSMs. The DSMs from this approach have an average height error as low as 0.3 centimeters (San Diego LiDAR Report, n.d.) thanks to the high accuracy of Lidar sensors. But because the cruising altitude of UAVs is relatively low, the Lidar scanning can only cover a limited area each time. A typical full swath spacing of a UAV Lidar scan is below 1km. Thus, it is slow and expensive to build large-area DSMs via this approach. Furthermore, Lidar scanning relies on GPS for accurate geolocation, but the reliability of GPS signals can be easily affected by landscapes, atmospheric conditions, etc., in practice.
Thanks to the recent maturation of very high resolution (VHR) satellite image sensors, an alternative approach that reconstructs DSMs via stereo matching emerges. In this approach, people first perform two-view or multi-view stereo matching using VHR images of the same area captured from different satellite perspectives to generate local disparity maps. Then they fuse disparity maps of areas together to produce a DSM of a larger area. A VHR image can cover areas of hundreds of sq. km with a ground sampling resolution between 30cm-60cm depending on the viewing angle. Therefore, despite its lower accuracy compared to Lidar scanning, it is cheaper and more efficient for people to create a global DSM via satellite stereo.
The accuracy of stereo matching algorithms has improved these years significantly in street view and indoor settings thanks to the appearance of large, supervised datasets such as KITTI (Menze and Geiger, 2015) and Middlebury (Scharstein et al., 2014). These datasets provide sufficient stereo image pairs and corresponding ground truth disparity maps to train data-hungry deep-learning (DL) methods. However, in the domain of satellite stereo, developing a large-scale supervised dataset with diverse VHR images that supports quantitative evaluation is still at an initial stage in our perspective.
We present Stellar. It is a large satellite stereo dataset that contains stereo VHR image pairs and corresponding ground truth disparity maps generated from Lidar measurements. Stellar is so far the largest in areas of data to the best of our knowledge (Table 1). It is flexible to use. People can select stereo VHR image pairs that have specific time differences or sun angle differences. This allows users to control the appearance difference between the stereo image pairs caused by seasonal vegetation, weather, human activities, and shadows and specular reflections caused by sunlight (Figure 1). Stellar also provides semantic segmentation of the areas (e.g., Figure 1), which enables quantitative analysis of stereo-matching accuracy based on types of areas, e.g., buildings, rivers, and vegetation.
Although the source data of Stellar comes from several public repositories (Bosch et al., 2016, Brown et al., 2018, Van Etten et al., 2018, there are image processing challenges that need to be overcome to build Stellar. First, the Lidar measurements and VHR images use different coordinate reference systems. This means the elevation measured from Lidar has to be calibrated to the same coordinate system as VHR images. We used a novel co-registration method to perform the alignment (Section 3.) Second, it is non-trivial to rectify stereo pairs for VHR images, as VHR sensors on satellites have nonlinear homography and non-conjugate epipolar curves. In Stellar, we developed a patch-based rectification method to overcome this challenge (Section 3.2.3.) Finally, Lidar measurements and VHR images  Table 1. Stellar vs. existing satellite stereo datasets. Apart from covering much larger areas than previous ones, Stellar allows users to select stereo image pairs with specific time differences and sun angle differences, etc. Stellar also supports generating stereo images with variable sizes and provides dense ground truth disparity.
of the same area may be acquired with a time difference longer than a year. The landscape may vary between the Lidar and the VHR image, and among VHR images ( Figure 1). This creates noise in the ground truth disparity map, as the disparity from Lidar may not reflect the true disparity of a stereo VHR image pair. To mitigate this issue, Stellar provides semantic segmentation along with the ground truth disparity maps, which enables users to analyze methods on areas less prone to change over time (Section refsubsubsec:semantic.) To validate the effectiveness of Stellar, we fine-tune two DL stereo matching algorithms on it and on previous satellite stereo datasets, and analyze their performance. Our experiments demonstrate that, in most cases, the DL stereo methods improve their accuracy on satellite stereo by fine-tuning on Stellar, compared to without fine-tuning or fine-tuning on previous, smaller satellite stereo datasets. The detailed description of our experiments is in Section 4.
Stellar is available to download at https://github.com/ guo-research-group/Stellar. The classic two-stage stereo matching pipeline first computes a cost volume from a pair of rectified images, then estimates the disparity map based on the cost volume. Traditionally, semiglobal matching (SGM) (Hirschmuller, 2005) is one of the most popular methods for stereo matching. In the era of deep learning, people leveraged deep neural networks to generate the cost volume (Zbontar et al., 2016), perform stereo matching (Zhang et al., 2019), or both. These DL methods clearly outperform traditional, non-DL methods on standard stereo benchmarks, such as Middlebury, KITTI, and ETH3D. Recently, more endto-end DL methods emerged. These methods replace the classic two-stage pipeline with a single neural network architecture that takes in a rectified image pair and directly outputs the disparity map. These end-to-end DL methods have outperformed two-stage DL methods and demonstrated the top performances in computer vision benchmarks (Lipson et al., 2021, Zhao et al., 2022, Li et al., 2022a. There have been several studies that quantitatively analyze the performance of DL stereo methods on satellite stereo (Albanwan and Qin, 2022, Gómez et al., 2022). Albanwan et al. (Albanwan and Qin, 2022) present an extensive comparative study  of various stereo-matching approaches by fine-tuning them using a relatively small satellite stereo dataset (DFC in Table 1). They analyze two-stage approaches including SGM using a non-DL census cost volume (Zabih and Woodfill, 1994) and using a DL-based cost volume (Zbontar et al., 2016), as well as end-to-end DL methods (Kendall et al., 2017, Chang and Chen, 2018, Cheng et al., 2020. They report that the two-stage approaches, whether using DL cost volume or not, are more robust and generalizable. Meanwhile, end-to-end DL methods have the highest geometric accuracy while not generalizing well for unseen data. To increase the robustness of these end-to-end DL methods, a large satellite stereo dataset that covers more diversity in its stereo pairs seems necessary. Table 1 provides a comparison among previous satellite stereo datasets and Stellar. A critical challenge in creating a satellite stereo dataset is the generation of ground truth disparity. Typically, people leverage Lidar measurements to synthesize the ground truths, but the Lidar point clouds are sparse and are in different coordinate systems than satellite imagery. The Multi-View Stereo 3D Mapping dataset (MVS3DM) (Bosch et al., 2016) only contains Lidar point clouds as ground truths and the authors do not calibrate the Lidar data and the Satellite imagery into the same coordinate system, which could cause difficulty in performing supervised training using the dataset. The 2019 Data Fusion Contest dataset (DFC) (Bosch et al., 2019) includes sparse true disparity maps generated from the Lidar point clouds. Its Lidar measurements and satellite imagery are aligned by maximizing mutual information between the two sources of data. SatStereo (Patil et al., 2019a) contains dense ground truths. The authors first create true DSMs from the Lidar measurements, then transform the Lidar DSMs into dense disparity maps in the same coordinate as the satellite imagery via bundle adjustments.

Satellite Stereo Datasets
A key difference between the satellite stereo datasets and stereo datasets in the computer vision community, such as Middlebury, KITTI, and ETH3D, is that the disparity maps are naturally bipolar in satellite stereo, meaning the disparity maps contain both positive and negative values. This is caused by the fundamental difference in camera model between the satellite cameras and pinhole cameras. See Sec 3.2.2 for detailed discussions. As in Table 1, all listed previous datasets contain both positive and negative disparity values. Adapting a stereomatching algorithm to work for the both positive and negative disparity, especially those based on semi-global matching (SGM), is non-trivial. Stellar overcomes this issue. It enables all positive disparity via an innovative tile-based polarity correction (Sec 3.2.3.) Compared to previous satellite stereo datasets, Stellar enables much higher flexibility. Compared to the recently released WHU-Stereo dataset (Li et al., 2022b), which contains a similar covered area as Stellar (Table 1), Stellar contains more than one stereo pair per region with different baseline distances and acquisition times that users can flexibly choose. This could help an algorithm learn to handle varying baseline distances and different visual appearances within stereo pairs.  Figure 3. As UAVs that carry Lidars and satellites use different vertical datum, the two data sources must be calibrated to the same coordinate system.

Lidar Data Processing
Lidars measure a sparse point cloud where (xi, yi) indicates the Lambert conformal conic projection (LCC) coordinate of the point, zi is the altitude of the point with respect to some zero altitude datum, and ci is a semantic label of the point, e.g. building, bridge, water, etc., that is sometimes present in the data.
The first step of Lidar data processing is coordinate transformation. The Lidar point clouds follow NAVD88 geoid vertical datum, which uses the approximate mean sea level as the zero altitudes ( Figure 3.) Meanwhile, the satellite imagery is collected under a different vertical datum, WGS84 ellipsoid. Therefore, the first step is to convert the altitude of each point zi into the same vertical datum as the satellite imagery: The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-M-1-2023 39th International Symposium on Remote Sensing of Environment (ISRSE-39) "From Human Needs to SDGs", 24-28 April 2023, Antalya, Türkiye We achieve the conversion using a predefined transformation grid. We also transform the 2D location of each point from LCC coordinate (xi, yi) to Universal Transverse Mercator (UTM) coordinate (xi,ỹi). We denote the point cloud after coordinate transformation as {xi = (xi,ỹi,zi, ci)}. For simplicity of notation, we omit the index range i = 1, 2, · · · hereafter.

Lidar DSM Generation
After the transformation, we rasterize the transformed point cloud {xi} to a dense 2D altitude map, i.e., the DSM, and a semantic segmentation map. As in Figure 4, we first use Delaunay triangulation to obtain a mesh from the point cloud, T = Delaunay({(xi,ỹi,zi)}). The mesh T contains a list of triangles tj whose vertices are neighboring points in the point cloud, T = {tj}. Next, we prune the redundant triangles in the mesh T to form a new mesh S based on the following rule: S ={tj ∈ T | All edges of tj are shorter than α AND The angle between tj's normal vector and all its neighboring triangles' normal vector are below β AND The maximum altitude difference among tj's vertices is smaller than γ}.
We use α = 10m, β = 20 • , and γ = 1m. Then, as in Figure 4, we define a grid with a step size d and project each point in the grid along the z-axis to intersect with the mesh to find out its altitude. If the ray intersects a triangle tj in the mesh, the height is estimated as the interpolated value of the three vertices of tj. If a ray doesn't intersect any triangles, that point is marked as invalid. In our experiment, we define d = 0.3m or 0.5m. Figure 5 shows a qualitative comparison between the Lidar DSMs in DFC (Bosch et al., 2019) and in Stellar for two regions included in both datasets. The average height difference between the two Lidar DSMs is less than 0.5 m. As highlighted using arrows in Figure 5, Stellar DSMs are more accurate along boundaries such as edges of buildings and include more fine details compared to DFC. 3.1.2 Semantic Segmentation As mentioned before, the semantic label may not be present in some point cloud data.
In this case, we leverage the connected component analysis on the mesh S to cluster all triangles. Then we label the largest connected component as ground and the moderate size connected components, i.e., components with at least 500 triangles, as buildings (Figure 4.) The semantic segmentation map is then generated using the same projection method described in Section 3.1.1. A sample semantic segmentation map is in Figure 1(e).

Satellite Stereo DSM
Each image I from the satellite comes with an approximate camera model P in rational polynomial coefficient (RPC) format, and metadata such as satellite (Figure 6), sun positions, timestamp, and ground sampling distance (GSD). The image we use is the panchromatic image that has a spatial resolution of 0.3-0.7m.

Data Alignment and Stereo Pair Selection
The provided RPC camera model associated with each satellite image has a relative pointing error up to a few meters w.r.t. other images in the same region. Similar to the practice by Patil et al. (Patil et al., 2019a), we correct the relative point errors using bundle adjustment. This correction improves the stereo rectification accuracy. Bundle adjustment is the process that, given a The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-M-1-2023 39th International Symposium on Remote Sensing of Environment (ISRSE-39) "From Human Needs to SDGs", 24-28 April 2023, Antalya, Türkiye set of images {I1, · · · , Ip}, determines the corresponding camera projection function of each image P1, · · · , Pp.
First, we use the SIFT/SURF to detect a set of corresponding key points in each image. We denote the ith key point in the jth image as x j i . We also perform RANSAC to remove outliers to ensure correspondences among key points. Then, we unproject the ith key point for all images {x j i , j = 1, · · · , p} using the estimated projection functions {Pj, j = 1, · · · , p} to locate the corresponding 3D world point Xi via triangulation, and project Xi back to each view to calculate the reprojection error. We use this reprojection error as an objective function to search for the optimal camera projection functions: P1,··· ,p = arg min In multi-date satellite stereo reconstruction, multiple factors such as view difference (Figure 6), sun angle difference, acquisition time difference, etc., influence the quality of DSM reconstruction. Given a set of n satellite images, there are nC2 possible stereo pairs. We ignore stereo pairs with view angle differences less than 5 • to avoid those with narrow baselines. All the rest possible stereo pairs are accessible to users in Stellar.

Epipolar Geometry of Satellite Imagers
The linear pushbroom (LP) camera used in earth observation satellites has only one line of pixels in its photosensor. As the satellite moves along its orbit, the LP camera scans the earth (Figure 7.) During the scanning, the camera center moves at a constant speed. This has a significant impact on the epipolar geometry of two LP cameras. Meanwhile, cameras in our daily life are usually perspective, which has a unique camera center and a constant projection matrix for each captured image.
For a pair of images captured by LP cameras, there is a different fundamental matrix Fij for every ith column of the left image and every jth column of the right image. As in Figure 7, for a point x L i on the ith column of the left image, it will correspond to an epipolar curve cR on the right image (Gupta and Hartley, 1997). The epipolar curve cR intersects the jth column at x R j : Meanwhile, each point x R j corresponds to different epipolar curves on the left image, which means the epipolar curves are non-conjugate. Denote C L i and C R j of the camera center when capturing the ith column of the left image and the jth column of the right image. The baseline vector − −−− → C L i C T j will vary its direction for different i, j in most situations (Habib et al., 2005). Thus, the disparity between the left and right images could become both positive and negative.

Stereo Rectification and Polarity Correction
Rectifying a pair of satellite images is non-trivial because of the hyperbolic, non-conjugate epipolar geometry between the images. To deal with this, we divide the satellite images into tiles and assume each tile to follow the affine camera projective geometry. Thus, we can assume linear, conjugate epipolar geometry between each pair of tiles. De Franchis et al. also used similar tilewise affine assumptions (De Franchis et al., 2014), but we have several innovations compared to their practice. First, to efficiently identify corresponding tiles between the left and right images, we use the DEM sculpting approach (Patil et al., 2019b) to obtain a rough estimate of correspondences among tiles by using the low-resolution digital elevation model (DEM) of the earth. Second, we modify the rectification homography for the right image to apply additional translation to obtain unipolar disparities.
Given a set of images {I1, · · · , Ip} with significant scene overlap and their corresponding bundle adjusted camera projection functions {P1, · · · ,Pp}, we pick a stereo pair (I L , I R ) ∈ {I1, · · · , Ip} and their camera projection functions (P L , P R ) ∈ {P1, · · · ,Pp}. We divide each image in the selected stereo pair into n 500 pixel × 500 pixel tiles with 100pixel overlap between neighboring tiles. This gives us a list of tiled stereo pairs {(I L i , I R i ), i = 1, · · · , n}.
We obtain a set of virtual correspondences in each tiled stereo pair using DEM-Sculpting (Patil et al., 2019b) as follows. The digital elevation model (DEM) is a two-dimensional altitude map of the terrain, similar to DSM but with a much lower spatial resolution. People already have access to the DEM of the entire earth. We uniformly sample points {x L j , j = 1, · · · , K1} on each left tile I L i and unproject these points onto the 30mresolution DEM to obtain the corresponding 3D world point Xj = (P L ) −1 (x L j ). We then project each 3D point Xj onto the right tile to calculate the corresponding pixel location x R j = P R (Xj). We compute another set of correspondences by creating another set of 3D world pointsXj = Xj +(0, 0, 100m) and project them onto I L and I R . We ignore any point that falls outside the boundary of the tiles (I L i , I R i ). We denote the new set  of correspondences as {(x L j ,x R j ), j = 1, · · · , K2}. This new set of points creates an elevation envelope to account for the altitude of buildings, as they do not exist in the low-resolution DEM. In total, the process generates K1 + K2 virtual correspondences between the left and right tile ( Then, we find the rectification homography H L i and H R i for the tile pair. Mathematically, the fundamental matrix Fi of the two tiles has the following form (Hartley and Zisserman, 2003).
as we assume affine camera model. Using the previously identified K = K1 + K2 virtual correspondences, denoted as (x L k , x R k ), k = 1, 2, · · · , K for simplicity, we can estimate the five parameters of the fundamental matrix using the Gold Standard algorithm (Hartley and Zisserman, 2003). The epipoles for left and right images can be given as e L = (−d, c, 0) T and e R = (−b, a, 0) T , respectively. We compute the stereo rectification homography as follows: 2||e||||e ′ || . The matrix T shifts the right image horizontally by tx so that the disparity is unipolar. The shift tx is the maximum disparity of all the K rectified correspondences: With the homography, we can obtain rectified tile pairs Then, we stitch all the rectified pairs together to form two rectified images (Ĩ L ,Ĩ R ). In case of overlapping pixel value for multiple tiles, we pick the value from the tile with a smaller index i. For simplicity, we denote the entire rectification process asĨ L/R = g L/R (I L/R ), whereĨ L/R is the rectified left/right image and g L/R is the left/right rectification function. Figure 9 shows a comparison of the disparity maps and their histograms for DFC and Stellar datasets. It is clear that Stellar contains all positive disparities.

DSM Generation from Disparity Maps
In this section, we discuss an intermediate DSM generated from satellite images. This DSM is for aligning the Lidar data with the satellite data. Given a rectified stereo pair (Ĩ L ,Ĩ R ), we use the modified tSGM algorithm (Patil et al., 2019b) to output two disparity maps (D L , D R ), where D L is the left disparity and D R is the right disparity. For a point x L in the left image I L , we can use the following transformation to find the corresponding point on the right image I R : By unprojecting points x L and x R using their corresponding camera projection functions and using triangulation, we can locate their corresponding 3D world point X. We repeat this process for all points in the left and right images and can generate a 3D point cloud of the scene. Then we can use the same triangulation and interpolation process in Section 3.1.1 to produce a DSM.

Ground Truth Disparity Map Generation
The calibration of the Lidar DSM and the alignment DSM described in Section 3.2.4 follows the procedure in Nuth et al. (Nuth and Kääb, 2011). It minimizes the overall elevation differences between the two DSMs by performing a global translation and bias of the Lidar DSM.
After alignment, we project all points X from the Lidar DSM to the left and right rectified image viax L/R = g L/R (P L/R (X)) and calculate the true left disparity as the horizontal difference between the projected points:D L (x L ) =x L u −x R u . We then treatD L as a point cloud and use the triangulation and interpolation approach in Section 3.1.1 to obtain the ground truth of left disparity map D L . We repeat the same process to obtain D R . Then, we perform a left-right-right-left (LRRL) consistency check to remove pixels where both disparity values disagree, which could happen at occlusion. Mathematically, the LRRL check is: where s = (xu − D L (x), xv) and x is the position of a pixel in the left image.

PRELIMINARY EXPERIMENTAL RESULTS
This section describes our preliminary effort to analyze the effectiveness of Stellar. We select two pre-trained DL stereo architectures, fine-tune them on Stellar and DFC (Bosch et al., 2019) datasets, and analyze their performance. The information about the two DL architectures is as follows: GANet (   cost volume. Then, the network uses semi-global aggregation (SGA) layers and local guided aggregation (LGA) layers to perform an approximated semi-global matching procedure that is differentiable and computationally efficient. The whole architecture is differentiable for end-to-end training.
RAFT-Stereo (Lipson et al., 2021) is an end-to-end approach that directly estimates the disparity map from an image pair without generating the cost volume. The model first extracts feature maps from the image pair to build a multi-resolution correlation pyramid. Then it uses a recurrent architecture, the gated recurrent units (GRUs), to create and iteratively refine the output disparity map.
In our experiment, we use models pre-trained by the original authors on the Sceneflow dataset (Mayer et al., 2016) for both architectures. We fine-tune these pre-trained models on DFC and on a portion of Stellar separately, and test all models on a different portion of Stellar. Figure 8 shows the division of Stellar dataset for this experiment. We use image pairs of two cities as the training set and those of another two as the testing set. The testing images may contain different terrains than the training set. For example, the testing images of California in Figure 8c contain hills that are not present in the training images. Therefore, this experiment also analyzes the generalizability of the models.
The loss function for fine-tuning is: where Dgt is the ground truth disparity map, f = {GA-Net, RAFT-Stereo}, {Ĩ L i ,Ĩ R i } are the ith rectified image pair in the training set. GANet use Huber loss and RAFT-Stereo use L1 loss as the loss function L. The learning rate is 5 × 10 −5 for RAFT-Stereo and 10 −4 for GANet.
We adopt the following metrics for the quantitative evaluation of estimated disparity maps.
• Good 3: given as percentage of the valid pixels in Dgt such that |DL(x) − Dgt(x)| < δ for δ = 3 where x = (u, v) T is pixel position in a disparity map and DL is an estimated left disparity map. For this metric, we consider all valid pixels in the ground truth disparity map, and it measures the density of estimated disparity maps.
• Average Error: Average error between estimated and ground truth disparity map considering valid pixels in both maps.
• RMSE: Root Mean Square Error (RMSE) between estimated and ground truth disparity map considering valid pixels in both maps. Figure 10 shows the quantitative and qualitative evaluation of the two DL models. For GANet, the model fine-tuned with Stellar achieves the best performance in both testing areas. For RAFT-Stereo, the Stellar model performs the best on one testing region, while the pre-trained model achieves the highest accuracy on another. It is worth noticing that both models trained on DFC perform the worst for both regions, which is probably because the ground truth disparity of DFC is sparse and bipolar, and there is limited training data.
It is not crystal clear to us why the pre-trained RAFT-Stereo sometimes performs better than the same architecture finetuned on Stellar. A possible explanation is the training data for Stellar is still not sufficient for such sophisticated DL architectures. We envision more comprehensive analyses of DL satellite stereo methods using datasets such as Stellar in future studies.