LARGE SCALE URBAN RECONSTRUCTION FROM REMOTE SENSING IMAGERY

Automatic large-scale stereo reconstruction of urban areas is increasingly becoming a vital aspect for physical simulations as well as for rapid prototyping large scale 3D city models. In this paper we describe an easily reproducible workflow for obtaining an accurate and textured 3D model of the scene, with overlapping aerial images as input. Starting with the initial camera poses and their refinement via bundle adjustment, we create multiple heightmaps by dense stereo reconstruction and fuse them into one Digital Surface Model (DSM). This DSM is then triangulated, and to reduce the amount of data, mesh simplification methods are employed. The resulting 3D mesh is finally projected into each of the input images to obtain the best fitting texture for each triangle. As verification, we provide visual results as well as numerically evaluating the accuracy by comparing the resulting 3D model against ground truth generated by aerial laser scanning (LiDAR).


INTRODUCTION
Large-scale stereo reconstruction of urban areas is increasingly becoming more important in varied scientific fields as well as in common life.Applications are ranging from physical simulations like flood simulation, the propagation of radio beams or sound, to 3D change detection, flight planning for unmanned aerial vehicles (UAV's), or intuitive and detailed navigation and 3D city visualization.Due to the large-scale attribute of the given problem, the stereo reconstruction problem needs to be addressed with remote sensing imagery arising from satellites or aerial sensors and furthermore needs to be fully automatic, as the manual effort would render the problem intractable.While laser scanning (LiDAR) is a very accurate method for 3D reconstruction, it does not provide any color or texture information about the scene and requires a quite expensive hardware setup.In case of satellite platforms (very large scale reconstruction), laser scanning is not feasible at all.This paper is demonstrating the potential of dense stereo image matching as a costefficient way to create accurate and detailed textured 3D city models by the combination of current state-of-the-art computer vision methods.We further show that the accuracy of today's optical sensors and dense stereo algorithms is indeed sufficient enough to meet the requirements of the aforementioned applications -approximately 1m RMSE for aerial images (2km distance to the scene) and 3m RMSE for WorldView-2 satellite images (770km distance to the scene).When acquiring the images, an initial pose is computed and stored along the corresponding image.In case of aerial images, data from the inertial measurement unit (IMU) and GPS information is used for estimating the initial pose.In case of satellite images, a Rational Polynomial Camera (RPC) model (see e.g.Grodecki, 2001) is initially provided for every scene.In both cases these initial estimates are not very accurate, with the average re-projection error typically > 1 pixel.Therefore we extract and match SIFT features (Lowe, 2004) between all images and optimize the relative camera poses using bundle adjustment (a survey can be found in Triggs et al., 2000) to achieve subpixel accuracy.If available, ground control point are added to the optimization process to further refine the absolute orientation of the cameras.If no initial pose is available, we compute the fundamental matrix for each image pair based on SIFT matches, and perform an image rectification on the two images.
For reconstructing the 3D scene, we use dense stereo matching to make use of all the image information and compute a depth for each single image pixel -compared to sparse multiview image matching methods based on SIFT- (Lowe, 2004) or SURFlike (Bay et al., 2006) features.As for each image pixel multiple depth hypotheses need to be tested, a computational efficient cost function is needed for the image matching, which additionally has to be robust to some radiometric changes between the images.For this case we chose a combination of the Census transform (Zabih and Woodfill, 1994) and Mutual Information (Viola and Wells III, 1997).As the pure data term of the cost function is still prone to noise and performs poorly over large disparity ranges and homogeneously textured regions, a simple pixel-wise Winner-takes-all strategy does not yield acceptable results, which is why we add additional smoothness constraints and solve the resulting optimization problem using the well established Semi Global Matching (Hirschmueller, 2005).The whole process of the 3D reconstruction is described in Section 2.1.At this point, we have a heightmap for each input stereo imagepair, which is equivalent to a point cloud in the coordinate system of the reference image, each point also having a color assigned by it's projection in the corresponding reference image.An intermediate result would now be the fusion of all these point clouds into an even denser point cloud.However, for our purpose, we transform the different heightmaps (respectively their point clouds) into a common orthographic coordinate system (UTM coordinate system) and due to a large image overlap and therefore redundant height information, fuse the projected Digital Surface Models (DSMs) into one single DSM, yielding a higher accuracy because of reduced noise.Since all of our applications need a closed surface, we afterwards transform this DSM (or point cloud) into a meshed 3D model by Delaunay triangulation (Guibas and Stolfi, 1985).In the next step, we address the problem of the complexity of the 3D model.For applications having limited res sources (e.g.navigation devices or web applications) or real time requirements (collision detection for UAV's) the resulting mesh needs to be simplified, while at the same time preserving the accuracy and interesting scene details.As controversial as this may sound compared to the earlier mentioned argument for a dense reconstruction, note the difference at which step we reduce the amount of information.For sparse multiview image matching, the amount of information gets already reduced during the detection and match-ing of sparse image features, whereas in our approach the reduction takes place at a later step and uses all the dense depth information to detect and simplify local planar surfaces.And to finally get a visually appealing and realistically textured 3D model, we use images of the scene taken from different viewpoints to assign best fitting textures for all triangles of the 3D model.The process of the automatic 3D modeling is described in Section 2.3.For evaluation of the algorithm's accuracy, we provide both visual and numerical results for two test scenes of urban areas as described in Section 3. Test scene A consists of an aerial image set of the city of Munich, while Test scene B consists of WorldView-2 satellite images of London.The results of both data sets are numerically evaluated against reference data obtained by airborne laser scanning (LiDAR) -see Section 4.

METHOD
In Figure 1, an overview of the complete workflow of the proposed method is shown -starting with a number of input images I k together with their initial camera poses C k .As already mentioned in the introduction, we first refine the initial poses to achieve subpixel accuracy for the dense stereo matching.To this end we extract and match SIFT features between all images and optimize the relative camera poses using bundle adjustment.If available, ground control point are added to the optimization process to further refine the absolute orientation of the cameras.

Dense Stereo Reconstruction
In classical dense stereo reconstruction, for every image pixel of the reference image (x, y) ∈ I1 and a number of height/depth hypotheses γ ∈ Γ, a matching cost is computed by back-projecting the pixel into 3D space, projecting the resulting 3D point into the second image → (x , y ) ∈ I2 and comparing the image information of the two images at their corresponding positions.The result is the so called disparity space image (Bobick and Intille, 1999), containing the raw matching costs.If no information about the camera model would be available, the search space for matching one pixel in image I1 would be the the whole image I2.To reduce the search space from 2D to 1D, we need to establish an epipolar geometry between image pairs.If the cameras can be approximated by the pinhole camera model, the resulting epipolar geometry is mapping one image coordinate in the first image to a corresponding line in the second image.However, satellite images are obtained using a push-broom camera (the CCDs are arranged one-dimensional instead of a twodimensional array) and the resulting epipolar lines of an image pair using the corresponding Rational Polynomial Camera (RPC) model are not straight, but curved (Oh, 2011).We therefore pursue a different strategy and establish the epipolar geometry between two images I1 and I2 directly by evaluating the function F (1,2) (x, γ), which projects a pixel x from I1 to I2 using the disparity γ, for every single pixel of I1 ∈ Ω and every possible disparity γ ∈ Γ individually.As the camera model may be very complex and computationally expensive and Ω × Γ quite large, we compute F (1,2) (x, γ) only for a sparse (and uniformly distributed) set of grid points in Ω × Γ space and store the results in a lookup-table.For all other points we compute the projected pixel by trilinear interpolation of the nearest grid points.By embedding this procedure in a plane-sweep approach (Collins, 1996), we furthermore get a rotational invariant cost function without introducing additional efforts.Given a disparity γ, we sweep over the reference image I1, sample image I2 at the corresponding image position (x , y ) and copy the obtained color / intensity to an image Ĩ2 at the same position (x, y) as in the reference image.When computing the matching costs of a disparity hypotheses γ and the whole image I1, we simply evaluate the cost function at the same position (x, y), using the same local support window in both I1 and I2 (see Figure 2).Note that by using this approach we have no need for an image rectification and avoid the errors induced by projective distortions and the involved sampling and interpolation.
Figure 2: Computation of the disparity space image using a plane-sweep approach: For a coordinate (x, y) in image I1 and disparity γ, obtain the corresponding coordinate (x , y ) in image I2 using the camera model F1,2, sample the pixel color/intensity and copy it to the warped image Ĩ2 at position (x, y).The matching cost can then be computed by comparing the pixel intensities in both images at position (x, y).

Cost Function
As cost function for the image matching we chose the Census transform (Zabih and Woodfill, 1994) and afterwards perform a (very small) local aggregation of it using Adaptive support-weights (Yoon and Kweon, 2006).The Census transform CT encodes the local image structure within a small patch around a given pixel.It is defined as an ordered set of comparisons of intensity differences and therefore invariant to monotonic transformations which preserve the local pixel inten-sity order: with the concatenation operator N and an ordered set of displacements D ⊂ R 2 , which is normally chosen to be a 7 × 9 window.Because due to implementation issues, the resulting binary vector of length 63 fits into a 64 Bit variable.The matching cost of different Census vectors s1, s2 is then computed as their Hamming distance dH (s1, s2) -number of differing bits -where highest matching quality is achieved for minimal Hamming distance, and the costs are normed to the real-valued interval [0, 1] Using such a 7 × 9 support window for the Census transform increases the robustness of the matching function against mismatches, especially when searching through a large disparity range as is very common in remote sensing data.
On the other hand, this window-based matching suffers from the "foreground fattening" phenomenon when support windows are located on depth discontinuities, such as partially covering a roof top and the adjacent street.To limit this effect, we locally aggregate the cost function of Equation ( 3) using adaptive supportweights (Yoon and Kweon, 2006) for corresponding pixels p in I1 and q in I2: The weights w(p, q) are based on color differences ∆c(p, q) and spatial distances ∆ d (p, q) w(p, q) = exp The basic scheme for the adaptive support-weights.

Optimization
In the resulting disparity space image (DSI) we now search for a functional u(x) (the disparity map), which minimizes the energy function arising from the matching costs (called data term E data ), plus additional regularization terms.As the data term is prone to errors due to some incorrect and noisy measurements, one often needs some smoothness constraints E smooth , forcing the surface of the disparity map to be locally smooth.
This energy is non-trivial to solve, since the smoothness constraints are based on gradients of the disparity map and therefore cannot be optimized pixelwise anymore.Various approximations for this NP-hard problem are existing, e.g.Semi-global Matching (Hirschmueller, 2005), energy minimization via graph cuts (Boykov et al., 2001) or minimization of Total Variation (Pock et al., 2008) -just to name three examples.Because of its simple implementation, low computational complexity and good regularization quality, we use semi-global matching for optimization, minimizing the following discrete energy term The first term is the matching cost, the second and third term penalties for small and large disparity discontinuities between neighboring pixels p ∈ Nx.The key idea is now not to solve the intractable global 2D problem, but to approximate it by combining 1D solutions from different directions r, solved via dynamic programming The penalties are set to default values P1 = 0.4 and P2 = 0.8 (with the raw costs normed to [0, 1]).For deeper insights the reader is referred to the paper of Hirschmueller, 2005.

DSM Post-Processing
After obtaining the disparity map for each image pair, there are most certainly some errors / outliers left due to mismatches or occlusions.To further reduce the number of outliers and increase the accuracy of the disparity map, we apply the following postprocessing steps: • Consistency check By exchanging the two stereo images I1 and I2, we compute a second disparity map D2 (with the reference frame now being I2) and compare each disparity of D1 with the corresponding disparity in D2 -the so called left-right check.
Ideally, these disparities should be equal (uniqueness constraint).If they differ by a value larger than θLR, the dispar-ity at this position is considered to be invalid Note that this requires twice the time of the overall algorithm, since two disparity maps have to be computed.

• Median filtering
Removing additional noise / small area outliers by filtering the disparity map with a 3 × 3 median filter

• Interpolation
The invalidated disparities, resulting from occlusions and mismatches, need to be interpolated, which is mostly done based on the disparities in the local neighborhood.To this end, we implemented and use the multilevel interpolation based on B-splines from (Lee et al., 1997).
• Subdisparity accuracy By fitting a quadratic curve through the cost of the obtained disparity C(x, d) and the costs of the adjacent disparities (±1) and computing the minimum of this curve, we refine the disparity map to subdisparity accuracy.Note that this is theoretically only valid for SSD-based cost functions, but nevertheless works fine for most cost functions.

• Multiview DSM fusion
In the case of two input images, we just project the single disparity into UTM coordinate system, discretize the result and again interpolate the hereby invalidated areas in the DSM (areas of the DSM which are occluded in the disparity map).
For more than two input images, we compute a disparity map for each image pair and project all of them into a regular spaced and discretized grid given in UTM coordinate system.By collecting all the height values per DSM-pixel, we afterwards choose the median of these to be our final height value of the DSM at this position.

Meshing
We create a triangulated mesh of the DSM by simply connecting four incident pixels (x, y), (x + 1, y), (x + 1, y + 1), (x, y + 1) into two triangles.Of the two possible triangulations we choose the one which better matches the height information by using an adaptive triangulation method minimizing the second derivative of the surface in the neighborhood of the square, as proposed in Grabner, 2002.

Simplification
Using the naive meshing from above, the triangulated mesh roughly contains 2n triangles, if n is the number of pixels in the DSM.As our aerial image data for example has a resolution of 0.2m, our 3D model would have about 25 triangles per m 2 or 25 • 10 6 triangles per km 2 -which is simply too much for our purposes.Therefore, we further employ a two step mesh simplification to reduce the amount of triangles needed to represent the 3D model, while at the same time preserving its dominant features and surface properties.The first step aims at simplifying planar structures.To this end, we iterate over all vertices and fit a 3D plane through its adjacent neighbors using the least squares method.Using a quad-edge data structure as in Guibas and Stolfi, 1985 allows an efficient search for the neighboring vertices.
If the minimum distance of the currently visited vertex to the fitted plane is < ∆ plan , the vertex is deemed to be redundant and marked for removal (see Figure 4).As this would sometimes remove the corners of steep building walls, we add a further constraint that the vertex gets only removed, if the height difference to all of its adjacent vertices is < ∆ disc .These two parameters depend on the grid resolution δ of the initial DSM and were chosen to be ∆ plan = δ and ∆ disc = 10δ.An evaluation of the influence of this parameter δ is conducted in Section (4), Figure ( 10).The second step of our mesh simplification is removing nearly collinear triangles.If for any triangle (A, B, C), AB + AC < BC • ∆ coll (with ∆ coll > 1) the vertex A will be removed.We chose to remove only very collinear triangles (∆ coll = 1.01).However, images of the scene taken from different viewpoints allow us to extract the texture of parts of the scene hidden from a single view, like for example the facades of buildings (see Figure 5).In that case we have to devise a quality measure Q for the projection π(ti, I k ) of a triangle ti into each image I k available for texturing.Of all these K projections, we then choose the one with the best quality measure for texturing the triangle ti To reduce the effect of perspective distortion, the angle between the normal vector of the 3D triangle and the looking vector of the camera should be minimal.However, at the same time the 2D projection of the triangle should have maximal size (to capture fine details) and it should be least occluded by other triangles (especially for urban areas containing large buildings and narrow streets).In practice, we found it sufficient to optimize just for the last two requirements, since our aerial image data was taken from roughly the same distance to the scene and therefore its a valid assumption, that the size of the 2D projection of a triangle correlates with the angle between its normal and the looking vector of We therefore extract a texture for a triangle ti from the image I k , where its projection on the image plane is maximal and simultaneously least occluded by other projected triangles.To solve these requirements efficiently, we compute the texture quality of each triangle for an image I k using the following z-buffering approach: • Sort all 3D triangles of the model according to the distance to the camera center • Beginning with the furthest triangle, compute the projection of all triangles onto the image plane and render them using a unique identifier / unique color each • Sweep over the rendered image and compute the visibility of each triangle in term of its remaining visible pixels: Using equation 11, we obtain the view where the best texture quality is to be expected and store the corners of the triangle's 2D projection as its texture coordinates.

DATA
For evaluation, two test sites of complex urban areas were chosen (see Figure 6 and 7), both covering an area of 2000 × 2000 pixel.The two test sites were captured with different remote sensing sensors and ground truth data was obtained by airborne laser scanning (LiDAR) in both cases.

Aerial image data -Munich
The aerial image data was taken of the inner city of Munich, using the 3K+ camera system (Kurz et al., 2007).The flight altitude (or distance of the camera to the scene) is ≈ 2km above ground with a ground sampling distance (GSD) of ≈ 0.2m per pixel.17 partially overlapping images (converted to gray scale, 8 Bit per pixel) were used for the 3D reconstruction, and for texturing, the RGB images themselves were used.The resulting DSM is resampled to 0.2m and, having a size of 2000×2000 pixel, covers an area of 400m × 400m.

Satellite image data -London
The satellite image data was taken of the inner city of London, using the WorldView-2 satellite.The flight altitude (or distance of the camera to the scene) is ≈ 770km above ground with a ground sampling distance (GSD) of 0.5m per pixel.8 overlapping images (obtained during one pass) of the panchromatic sensor (11 Bit per pixel) were used for the 3D reconstruction.For texturing, RGB images were computed by pansharpening the 0.5m GSD panchromatic images with the 2.5m GSD multispectral channels.
The resulting DSM is resampled to 0.5m and, having a size of 2000 × 2000 pixel, covers an area of 1km × 1km.

RESULTS
For both test areas, reference data was obtained by airborne laser scanning (LiDAR), having a positional resolution of about 1m, a positional accuracy of 0.5m RMSE and a vertical accuracy of 0.25m RMSE according to the provider of the data.Due to the different resolution of the DSMs and the LiDAR point cloud, we In Table (1) we provide the numerical results of the accuracy evaluation.As can be seen, the aerial test area produces more accurate results, which is to be expected since the GSD is 0.2m compared to the 0.5m GSD of the satellite test area.Furthermore, the 3D models are less accurate than the original DSM, which also is to be expected as the amount of data is reduced from 4, 000, 000 pixel to 221, 000 and 103, 000 vertices.This actually means a data reduction of 94.5% for the aerial test area and 97.4% for the satellite test area.Additionally we provide visual results of the textured 3D models in Figure ( 8) and ( 9).

CONCLUSION
In this paper we presented a complete, fully automatic and modelfree workflow for the 3D reconstruction of large scale (urban) areas.As our focus is on fully automatic processing chains, we can process the input image data very fast, around the clock and without the need for additional user-guided input.Of course, with manual interaction the accuracy normally is better than a fully automatic approach, so if there is need for higher accuracy, the resulting 3D models can be refined later on by further (semi-) manual processing.We additionally point out that, except from the input images themselves, no additional data like building footprints, special building models / primitives, road maps, etc. is required.
The accuracy was shown to be in the range of about 1m mean absolute error and about 2m root mean square error (mainly in height) for images with a GSD (or pixel size) of 0.2-0.5m,taken from 2km and 770km distance to the scene.Compared to our LiDAR ground truth's horizontal accuracy of 0.5m RMSE and vertical accuracy of 0.25m RMSE, the results of our image-only reconstruction is not that far off.

Figure 1 :
Figure 1: Workflow of the 3D reconstruction process with n input images I k and camera poses C k , refined camera poses C * k and n − 1 pairwise heightmaps HM k .
Figure 3: a) The 7 × 9 window D of the Census transform, b) The basic scheme for the adaptive support-weights.

Figure 6 :
Figure 6: Aerial image data -inner city of Munich (Marienplatz) Figure 8: Resulting 3D model of the aerial image data

Figure 10 :
Figure 10: Influence of the mesh simplification parameter δ on the accuracy of the resulting 3D model (London data set).The correlation between the reduced number of vertices and increasing mean absolute error is clearly visible.

Table 1 :
Properties of the two data sets and accuracy of the resulting DSMs and 3D models -Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Normalized Median AbsoluteDeviation (NMAD)