FUSION OF MULTI-RESOLUTION DIGITAL SURFACE MODELS

: This paper proposes an algorithm for fusing digital surface models (DSM) obtained by heterogenous sensors. Based upon prior con-ﬁdence knowledge, each DSM can be weighted locally adaptively and therefore strengthen or lessen its inﬂuence on the fused result. The proposed algorithm is based on variational methods of ﬁrst and second order, minimizing a global energy functional comprising of a data term forcing the resulting DSM being similar to all of the input height information and incorporating additional local smoothness constraints. By applying these additional constraints in the form of favoring low gradients in the spatial direction, the surface model is forced to be locally smooth and in contrast to simple mean or median based fusion of the height information, this global formulation of context-awareness reduced the noise level of the result signiﬁcantly. Minimization of the global energy functional is done with respect to the L 1 norm and therefore is robust to large height differences in the data, which preserves sharp edges and ﬁne details in the fused surface model, which again simple mean-and median-based methods are not able to do in comparable quality. Due to the convexity of the framed energy functional, the solution furthermore is guaranteed to converge towards the global energy minimum. The accuracy of the algorithms and the quality of the resulting fused surface models is evaluated using synthetic datasets and real world spaceborne datasets from different optical satellite sensors.


INTRODUCTION
Digital surface models (DSM) are a basic component for many applications, such as orthophoto creation, mapping, visualisation and 3D planing in many application fields.Today, many technologies for DSM generation exist, such as airborne LiDAR, SAR interferometry and automatic image matching, each resulting in a different quality and characteristics.As a result, multiple datasets of DSMs are available for most parts of the earths landmass and it is therefore interesting to fuse these into a single, higher accuracy DSM.Depending on the underlying satellite characteristics like ground sampling distance (GSD), the DSMs capture different parts of the scene in different quality, which even can be mutually exlusive to some extent.For example, high resolution sensors like World-View 2 with a GSD of 0.5m perform very well in urban areas, whereas the results in forest areas are somewhat moderate.In contrast, Cartosat-1 with a GSD of 2.5m performs quite opposite in these areas.Even with the same sensor, a different exposure time can drastically alter the results in shadow areas or in highly reflective areas like glaciers.Apart from the obvious aspect of fusing multiple different DSMs, some techniques produce multiple DSMs for the same area, which need to be fused as well.For example, many multi-view image matching techniques are based on matching individual stereo pairs and later fusing these stereo pairs into a common height model, see e.g.(Hirschmueller, 2008), (Kuschk, 2013), (Rumpler et al., n.d.).Our work focuses on the fusion of 2.5D DSM grids, with a resolution from several decimeters to a few meters.DSM fusion has been considered by various authors previously.The simplest method is based on weighted averaging of two or more height maps (Schultz et al., 1999), (Reinartz et al., 2005).As weighted averaging cannot deal with outliers or blunders in the DSMs, a median fusion is often used for multi-DSM fusion, sometimes followed by weighted averaging of the inliers (Hirschmueller, 2008).Both median and weighted averaging does process each pixel independently, and thus cannot take into account the local surface shape, which is usually very regular.Applying additional mean or median based filtering spatially reduces the amount of noise to some extent, at the cost of blurring potentially sharp edges.An example for context aware fusion algorithms is the use of sparse representations (Papasaika et al., 2011), where a DSM patch is computed as a sparse linear combination of dictionary DSM patches.Except for median fusion, pixelwise error maps are required by weighted averaging and sparse representations.A comparison between weighted averaging and sparse representations (Schindler et al., 2011) found that the quality of the fused DSMs is mostly determined by the quality of these pixel based error maps.Another direction of work (Pock et al., 2011) aims to formulate a global energy function, minimizing the distance of the fused result to all input DSMs simultaneously and additionally incorporates the assumption of the world being locally planar.Due to its simple structure and theoretically well founded minimization procedure, we build upon this work and extend it to a weighted, multi-resolution, fusion framework.
As most computer vision problems are generally ill-posed (e.g.image segmentation, stereo reconstruction, image fusion), additional regularizers (constraints) are needed for physical meaningful solutions.In our case of 2.5D image fusion, these regularizers are the assumption of the world being locally planar, meaning that height value of each pixel of the DSM depends on its local context and e.g. is highly unlikely to have a signifanctly different height value than its surrounding pixels.This smoothness constraint typically is implemented by minimizing the gradient of the resulting DSM.Together with the data term this results in large systems of partial differential equations (pde) which are time-consuming to solve and special care has to be taken to still allow for strong discontinuities along building edges and to not smooth them over.The design of the energy functional has to be chosen such that it is convex in the variable to solve for.Otherwise it would be very hard for non-linear minimization method to not getting stuck in local minima.In recent years, Total Variation based methods (TV) for minimizing energy functionals have seen a lot of attention in the research community.One reason is that these algorithms are very wellsuited for parallelization and, together with the recent advances of GPU-based computational power, lead to efficient algorithms, solving these optimization problems efficiently.And as the energy functional of our image fusion problem can be written in a convex formulation, the solution is globally optimal.

TV-L1 Fusion
Based upon the ROF-model for image denoising (Rudin et al., 1992), the extension for multiple image fusion, together with replacing the quadratic data term by the more robust L1 norm as in (Pock et al., 2011) is written by where u ∈ R M •N is our fused DSM to solve, the K input DSMs are given as g k and the scalar factor λ d balances the impact of the smoothness term and the data term.While this model already provides good results by smoothing flat areas and preserving sharp discontinuities, it suffers from the so-called staircasing effect.This effect is a direct result of the regularizer, whose assumption is a locally planar world -where planar unfortunately refers to locally fronto-parallel.This staircasing effect of the TV-L1 algorithm is visible in Figure 2.

TGV-L1 Fusion
To overcome the fronto-parallel assumption of TV-L1 minimization, (Bredies et al., 2010) introduced the mathematical model of Total Generalized Variation (TGV) as a higher-order extension of Total Variation which favors the solution to consist of piecewise polynomial functions (e.g.fronto-parallel, affine, quadratic).Especially the 2nd order is of high interest, as it forces the solution to consist of piecewise planar functions.In contrast to the TV-L1 model, now also including slanted planes.(Pock et al., 2011) applied this model to DSM fusion, resulting in the following optimization problem Now, before the variation of the image u is measured, a 2D vector field v is subtracted from the gradient of u.An affine surface in the image u has a constant gradient ∇u, so by coupling and minimizing |∇u − v|, the vector field v will also be constant and it's gradient ∇v therefore zero.Regarding our overall optimization problem, this means that the energy term will be lower, if affine functions can be found in the image, whereas non-affine function get additional penalties by |∇v|.The values λs, λa, λ d are scalar weights and balance the impact of the smoothness term, the affine term and the data term.

Weighted TGV-L1 Fusion
When fusing DSMs it is desirable to weight the input DSMs on a per pixel base, to be able to incorporate additional prior knowledge into the fusion process.This prior knowledge for example can be based on the different sensor characteristics used to generate the DSM, confidence measures during the 3D reconstruction process itself, information about occluded and therefore unknown areas in each DSM, etc.We therefore extend Equation 2with a weighting matrix W k for each input DSM This optimization problem (and the ones in Equation 1 and 2) is very parameter dependent, as we need to adapt the influence of the data term λ d manually for datasets with different ranges of g (i,j) k ∈ g k as well as for a different number K of input images.To achieve independence of the data range of the input DSMs, we scale all input data to the interval [0..1] with gmin = min i,j,k g (i,j) k and gmax = max i,j,k g (i,j) k . The independence from K is achieved by normalizing the influence of the data term w.r.t. the two-image case and using the adaptive Note that all these extensions and modifications apply to the TV-L1 method similarly.In the next section we will go into detail about how to solve these optimization problems numerically.

ALGORITHM
To solve for the fused DSM u ∈ R M ×N (in the following written as stacked vector R M N ×1 ) in Equation 3, we need to overcome the non-differentiable L1-norm, which complicates any gradient descent based minimization scheme.An efficient algorithm which elegantly circumvents the differentiability problem of the gradient operator is the primal-dual algorithm of (Chambolle and Pock, 2011).By applying the Legendre-Fenchel transform we obtain the dual formulation / conjugate of the separate terms as such that the saddle-point problem in the primal variables u, v and the dual variables p, q, r k with constraints Please note that due to the stacked vector notation, the input weights are denoted as diagonal matrices W k and the corresponding multiplication W k r k is actually a pixelwise multiplication.The saddle-point problem above can be solved by iteratively performing gradient descents on the primal variables and gradient ascents on the dual variables.Applying this primal-dual algorithm leads to the following optimization scheme: do To ensure the constraints of Equation 7, the corresponding proximal mappings above are given as In the analytical derivation of the primal-dual scheme, we require the gradient and divergence operators to be negative adjoint, such that ∇u, p = − u, divp and ∇v, q = − v, divq .Therefore we use finite forward differences with Neumann boundary conditions for the gradient operators and for the divergence operators finite backward difference with Dirichlet boundary conditions.The step sizes of the gradient ascents/descents are bound to the norm of the gradient/divergence operators, see (Chambolle and Pock, 2011), and are set to τu = τp = τr = 1/ √ K1 and τv = τq = 1/ √ K2, with K1 = 2(6 + K) and K2 = 2(4 + K).
The whole algorithm stops, if either the maximum number of iterations has been reached (nIterations = 1000) or the energy change between successive iterations drops below a relative threshold ∆EnergyT hres = 0.1%.Again, the algorithm for the TV-L1 fusion is derived similarly.

Artificial Tests
The first evaluation is done on synthetic data.A given ground truth DSM g is perturbed with noise to simulate different noisy observations of the scene.Five of these noisy DSMs are then given as input to the fusion algorithms and the accuracy of the output DSM u is measured by the logarithmic signal-to-noise ratio: In Figure 2, visual and numerical results are given, showing a significantly higher accuracy of the global optimization methods for DSM fusion over simple mean and median based fusion.Please note that we applied mean and median filtering for all height values of a single pixel as well as its neighboring pixels to also include some spatial regularization.

Unimodal DSM fusion
In our second evaluation, we created 10 different DSMs of the same 1km × 1km area of the inner city of London using a stereo reconstruction framework as proposed in (Kuschk, 2013).For this we have a collection of 25 WorldView-2 images, taken from different positions during one pass of the satellite.The ground sampling distance (GSD) of these images are 0.5m and for evaluation purposes, we obtained a LiDAR measurement of the same area by aerial laser scanning, unfortunately only having a GSD of 1.0m.Two of the satellite images together with the computed heightmaps are shown in Figure 3.The 10 selected heightmaps are projected in the same orthogonal UTM coordinate system and the resulting DSMs are again fed into the different fusion algorithms.Figure 4 shows the result of median based fusion versus TGV-L1 fusion, again with a lower noise ratio visible for the TGV-L1 algorithm.The accuracy of the fused DSMs w.r.t. to the LiDAR ground truth is given in

Multimodal DSM fusion
Our third evaluation is investigating the results of fusing DSMs derived from different sensors and different spatial resolutions.The test data is taken from the ISPRS benchmark (Reinartz et al., 2010) and consists of 3 different scenes (hilly forest = La Mola, mountains = Vacarisses, city = Terassa) near Barcelona, Spain.For each scene, we compute a DSM from the two given CartoSat-1 images (GSD=2.5m)and a DSM from the two given

CONCLUSION
In this paper we proposed global optimization algorithms for fusing multi-resolution DSM obtained by heterogenous sensors.These global optimization algorithms are based on adaptively weighted TV-L1 and TGV-L1 optimization problems, allowing for a context-aware fusion of multiple DSMs.In three different evaluations, both synthetic and real world data sets, a significant improvement of the accuracy was shown with respect to mean and median based filtering methods.

Figure 1 :
Figure 1: Fusion of multiple DSMs covering the same area Fusion using pixel based plus spatial mean filtering, SN R = 22.76(d) Fusion using pixel based plus spatial median filtering, SN R = 32.61(e) TV-L 1 fusion, SN R = 39.60 (f) TGV-L 1 fusion, SN R = 40.32

Figure 2 :
Figure 2: Comparison of local fusion method versus global optimization methods.Both numerical results and visual appearance show the benefit of the latter ones.

Figure 3 :
Figure 3: London dataset: 2 out of 10 input images and their corresponding heightmaps WorldView-2 images (GSD=0.5m).Fusion of the different resolution DSMs is done by projecting the DSMs into the same coordinate frame and setting the weights w k for the invalid pixels of the coarser resolution to zero.As reference we have LiDAR data with a GSD of about 1.0m.The numerical results of local median fusion and global TV-L1 and TGV-L1 fusion are given in Table 2, indicating an improvement of both local and global fusion algorithms over the coarse CartoSat-1 DSM, but only the global fusion algorithm improves the result of the high resolution WorldView-2 DSM.
Figure 4: London dataset: Fusion results

Table 2 :
Results of local median fusion and global TGV-L1 fusion for heterogenous sensor data (CartoSat-1 and WorldView-2 satellite images).The first two rows show the accuracy of the unfused DSM of each satellite separately, whereas the two bottom rows show the fusion results of the two DSMs per scene.