AUTOMATIC GENERATION OF BUILDING MODELS WITH LEVELS OF DETAIL 1-3

We present a workflow for the automatic generation of building models with levels of detail (LOD) 1 to 3 according to the CityGML standard (Gröger et al., 2012). We start with orienting unsorted image sets employing (Mayer et al., 2012), we compute depth maps using semi-global matching (SGM) (Hirschmüller, 2008), and fuse these depth maps to reconstruct dense 3D point clouds (Kuhn et al., 2014). Based on planes segmented from these point clouds, we have developed a stochastic method for roof model selection (Nguatem et al., 2013) and window model selection (Nguatem et al., 2014). We demonstrate our workflow up to the export into CityGML.


INTRODUCTION
The automatic derivation of 3D-models of individual buildings is essential for the generation of landscape and city models of larger areas, especially if the data is used for further analysis or if it is presented in simulation environments.Also, the data of large 3D surface meshes needs to be reduced, e.g., by replacing mesh parts by geometric primitives (Schnabel et al., 2007, Lafarge andMallet, 2012), or by deriving building models on various levels of detail (Becker andHaala, 2008, Verdie et al., 2015).
In recent years, we proposed three methods for the automatic generation of building models with different levels of detail (LOD) following the LOD definitions of CityGML 2.0 (Gröger et al., 2012).First, we demonstrated our ability to detect cuboid based buildings and their major walls, i.e., LOD 1 (Nguatem et al., 2012).Second, we presented a method for determining roof models to obtain building models with LOD 2 (Nguatem et al., 2013).And finally, we proposed a reliable window and door extraction method for modelling building fac ¸ades with LOD 3 (Nguatem et al., 2014).All methods rely on statistical evaluation of the 3D points.They perform well even if the reconstructed point cloud is noisy or if it contains many holes due to, e.g., bright or textureless object surfaces.I.e., our approach is robust for different kinds of data.
In this paper, we present a combination of our previously published methods and the workflow for automatic data analysis consisting of the orientation of images, the computation of depth maps, the generation of highly detailed 3D-point clouds, and finally the interpretation of the data and the construction of building models.Our workflow is almost fully automatic, only very little manual interaction is needed for inspecting the intermediate results, for scaling the dense point cloud, and for rotating the scene into a selected coordinate system.The last two interactions could be skipped if the GPS-information of the acquired images is used.Our software returns the recognized building parts, i.e., walls, roof planes, and windows, and we export the model in CityGML 2.0 format (Gröger et al., 2012).
The paper is structured as follows: In the next section, we describe our methodology.Section 3 presents and discusses our experiments.Finally, we summarize the current state of our work and propose next steps for our work.

METHODOLOGY
In this section, we first present and discuss our workflow starting with image orientation and ending with the reconstruction of a dense 3D-point cloud.Second, we describe our semantic analysis for building modelling.

3D-Point Cloud Generation
As first step, we estimate image orientations with (Mayer et al., 2012) including the recent improvements (Mayer, 2014, Michelini and Mayer, 2014, Michelini and Mayer, 2016).The orientation procedure efficiently estimates camera poses also for large, unsorted image sets.To this end, the images are first sorted according to the number of matched SIFT points (Lowe, 2004), to obtain overlap information between the images.Then, a triplet graph is constructed (Michelini and Mayer, 2016) and highly precise poses are estimated for the triplets.Finally, the poses of image triplets are hierarchically merged including the detection of critical camera configurations (Michelini and Mayer, 2014) and a bundle adjustment on every level (Mayer, 2014).
The obtained orientation is highly precise and very robust also for arbitrary image configurations.The approach does not need additional information on position, e.g., by GPS, or viewing direction, e.g., by INS.Furthermore, no calibrated cameras are needed, so that almost any photogrammetric but also consumer camera can be used for image acquisition.The orientation is initialized with a mapping between the images and an approximate calibration matrix for each camera, which is given by with shear s = 0 and the normalized coordinates of the principal point (xP , yP ) = (0, 0).The focal lengths are set as with f being the focal length [mm] and h and w the sensor height and width [mm].Thus, fx and fy are the focal lengths with a normalized scale.The orientation returns a relative 3D-model of the scene containing the estimated poses of all cameras and the 3D-positions of the matched image points.This point cloud is relatively sparse, but dominant objects, such as buildings, trees or the ground, can readily be seen by manual inspection.
A dense 3D-point cloud is obtained after computing depth maps using semi-global matching (SGM) by (Hirschmüller, 2008) with census as matching cost (Hirschmüller and Scharstein, 2009).
The dense 3D-point cloud is computed by fusing the depth maps and analysing the resulting 3D-points considering additional geometric constraints.The employed approach is scalable from small building scenes to large scenes of villages and cities (Kuhn et al., 2014, Kuhn andMayer, 2015).
The 3D point cloud generation works fully automatically and we obtain point clouds with millions or even billions of 3D points, which can have a point spacing of less than 1 mm, if the cameras have a sufficient resolution.The dense point cloud still describes a relative model without a meaningful scale, and the pose of the coordinate system is defined by the first camera analysed.
A further normalization for the dense point cloud is performed manually at the moment.

Building Modelling
Our building modelling uses a coarse-to-fine approach, i.e., we first detect large building structures, such as major walls and roof surfaces, and only then we search for smaller building parts such as windows.I.e., we first derive building models with level of detail (LOD) 1 and 2, and then we refine these models afterwards by further analysis of each wall.
We start with segmenting the 3D point cloud into small disjunct planar surfaces, then we analyse the topologically adjacent surfaces, if they fit to a predefined roof model.Previously, our scene segmentation was limited to cuboid buildings (Nguatem et al., 2012), but we have extended our approach significantly.
Similar to other methods, where larger scenes with several buildings can be modelled, e.g., (Schnabel et al., 2007) or (Monszpart et al., 2015), we detect arbitrary planes in the entire reconstructed scene.To this end, we employ a divide-and-conquer approach, where we divide the scene into small disjunct patches.In each of these local neighbourhoods, we estimate the most dominant plane using RANSAC (Fischler and Bolles, 1981).Planes with similar normal vectors in adjacent neighbourhoods are merged to obtain reliable candidates for walls, roof planes and the ground surface.
In planar landscapes, the ground surface can easily be determined by selecting the largest plane perpendicular to the vertical direction.When the ground plane is removed, the major building planes characterize the scene.We cluster these planes and fit a roof model for each cluster employing (Nguatem et al., 2013).We employ the GRIC-approach (Torr and Davidson, 2003) for our stochastic sampling to limit the influence of outliers.We make use of predefined roof shapes and we selected several typical roof models of German buildings, e.g., pyramid roof, gable roof, or mansard roof.Since all these roof types have a small number of surfaces, we do not consider a punitive term for model complexity in our evaluation scheme, e.g., by considering minimum description length (Rissanen, 1978).
The vertical walls below the recognized roof model are combined to obtain a waterproof LOD 2 building model.Removing the roof structure, we can downgrade the building model to LOD 1.With respect to gable and half-hipped roofs, where the fac ¸ades have different heights, we harmonize them by cropping the building model at the height of the lowest eaves.
For LOD 3 building models, we focus on openings in the walls such as windows and doors.So far, we have not finished the recognition and modelling of roof superstructures, such as dormers and chimneys and other buildings parts like balconies, oriels and stairs.Again, the localization of windows is performed by stochastic evaluation (Nguatem et al., 2014) and we are able to fit the most common window styles in Germany: rectangular, archshaped and pointed arch-shaped windows.

EXPERIMENTS
In this section, we present the results of our tool chain.We start with describing the results of orienting 208 images acquired by two cameras.Then we show and discuss the results for our dense point cloud.Finally, we present the results of the functional modelling, i.e., the surface plane estimation and the window extraction.

Image Orientation
We acquired 208 images of a single building by two cameras: 70 images were taken manually with a Nikon D800 with a focal length of 24 mm, the other 138 images were acquired with a Sony ILCE α7R with a focal length of 35 mm mounted on a remotely piloted Falcon 8 UAV.Both cameras capture images with 7 360 × 4 912 pixels, i.e., each image contains more than 36 million pixels (RGB).We employ the orientation approach described in Section 1.We initialized the orientation with fx = fy = 0.8 for the Nikon camera and with fx = fy = 1.169 for the Sony camera.The orientation including the construction of a graph of matchable image triplets and the hierarchical bundle adjustment was performed in Figure 2. Six views on a building corner and the corresponding SGM outputs (depth maps).There are almost no commonly matched points between the third and the fourth image.But due to sufficiently many corresponding points between images 1 to 3, and 4 to 6, respectively, we are able to reconstruct dense building surfaces.
21 minutes on a standard computer with 16 cores returning the estimations of 43 321 3D points and the orientation of all 208 images, cf.
for the Nikon and the Sony camera, respectively, i.e., our initialization is a reasonable approximation.Recent experiments and a comparison with another VisualSFM approach (Wu, 2011, Wu, 2013) are presented in (Mayer, 2015, Michelini andMayer, 2016).

SGM and Reconstruction of Dense 3D Point Clouds
SGM took 1357 minutes, i.e., almost 23 hours.The large computation time arises, because we derive one depth map for each image, containing the fused depth information of all pairwise image matches with SGM.The pairwise SGM was calculated on a field programmable gate array (FPGA) Virtex-6 board, the fusion to one depth map was calculated on the CPU.We also downscaled the images by a factor of 2, so that all depth maps have a resolution of 3 680 × 2456 pixels.
In our experiments, we used the original implementation of SGM by (Hirschmüller, 2008) with census matching cost (Hirschmüller and Scharstein, 2009).Although this implementation belongs to the best SGM implementations (high ratio between correctness and performance), we have difficulties in finding the correct correspondences on large weakly textured surfaces, in very bright or dark areas, and when looking on the surface at an angle to the normal vector of more than 45 degrees, cf.Fig. 2. We could not estimate any depth information for all white pixels in the SGM output images.
In the next step, the depth maps for the individual images are fused to obtain a 3D point cloud.The fusion process analyses the data concerning geometric plausibility, so we obtain a point cloud with almost no outliers.Since the approach of (Kuhn et al., 2013) and (Kuhn et al., 2014) divides the scene into smaller parts using an octree, its depth is correlated with the size of the model and the positional accuracy of the individual 3D points.
Due to the large number of pixels, we would obtain extremely many 3D points, if we would reconstruct the scene with the highest available resolution.In consequence, the 3D models would consist of billions of triangles, and we are not able to visualize it on standard computers.
The 3D model shown in Fig. 3 consists of 25 687 052 3D points and 50 686 350 triangles for the entire scene of the building and its surrounding.The reconstruction was computed in approximately 14 hours, again on the standard PC with 16 cores.The density of the 3D model is higher than one point per cm 3 .E.g., the handrail of the stairs is clearly visible.The texture of the mesh could be improved, since the sign left of the door is not readable in the model.
Further results for SGM and the fusion of depth maps into dense 3D point clouds can be found in recently published papers, e.g., (Kuhn et al., 2014) and (Mayer, 2015).

Functional Modelling
So far, we only presented results of our workflow to demonstrate the generation of our input data, when we derive 3D models from imagery.Nevertheless, our approach for functional modelling is also suitable for LiDAR point clouds which usually have less noise, less outliers and coplanar LiDAR-points appear in a regular grid.Furthermore, the 3D models derived from imagery are relative models, i.e., the point cloud does not have a normalized scale, and we do not know the vertical direction of the scene.Consequently, we manually normalize each 3D model.
In the first step of functional modelling, we detect all major planar surfaces of the scene, cf.Fig. 4. The largest plane, which is nearly perpendicular to the vertical direction in non-mountain areas, is usually the ground surface of the scene.The rest of the planes is clustered to obtain candidates for building parts and other objects.With these planes, secondly, we can derive building models following (Nguatem et al., 2013).Our output shows all major walls and the half-hipped roof planes, cf.Fig. 5. Cutting off the surfaces on the roof, we also can derive the corresponding LOD 1 from the LOD 2 model.The LOD 2 model was derived in less than two minutes.
In the third step, we look for holes in all vertical walls.Thus, we are only able to detect open windows or windows which lie behind the building's wall.Windows with a closed shutter cannot be detected if the reconstructed 3D points of the shutter lie (almost) within the plane of the wall.Furthermore, we are able to localise windows of a previously defined size: We designed our window model with common width and height parameters.Due to performance issues, we have rejected small window sizes, i.e., we are unable to extract smaller windows which can usually be found in the cellar.
Figure 3. Reconstructed dense point cloud with surface mesh containing more than 25 million 3D points and more than 50 million triangles.This result is not the highest resolution we can obtain, but we are still able to visualise this model.The lower part shows a close view, where details as the sign left of the door or the handrails of the stairs can be recognized.
Regarding our test example, we could localize all typical 40 windows, cf.Fig. 6.The small windows in the cellar, the closed windows and the windows within the dormers are missing.The derivation of the LOD 3 model with windows was done within one minute.
Further results of the derivation of building models with LOD 1, 2 and 3 can be found in the previous publications (Nguatem et al., 2012), (Nguatem et al., 2013) and (Nguatem et al., 2014).There we also show results of various roof types and window styles which are common for buildings in Germany, e.g., pyramid roof, gable roof and mansard roof, or round arch-shaped and pointed arch-shaped windows.
In this paper, we restrict to only one example with a half-hipped roof and with normal-sized windows, because we want to present our workflow with as many details as possible.Currently, we also test our workflow on publicly available data sets, e.g., the ISPRS benchmark for dense image matching (Nex et al., 2015).Yet, as the roof structures of the buildings of this data set are complex, we see a need for extending our roof modelling towards arbitrary roof structures, cf., e.g., (Xiong et al., 2014).

Export to CityGML
In the last step of our workflow, we export the derived building model to CityGML.We are able import our output in the free CityGML viewer by the Institute of Applied Computer Science The export is also done within a few seconds, so the total time consumed for automatic derivation of the building model from 208 images is 37.5 hours, but most of the time has been used for SGM and 3D reconstruction.

CONCLUSION AND OUTLOOK
In this paper, we have described an approach for automatic generation of dense 3D point clouds from unsorted image sets and the automatic derivation of building models with levels of detail (LOD) 1 to 3. The modelling of buildings is based on segmenting the 3D point cloud into planes.Then we fit roof models and window models into the data employing the stochastic approaches (Nguatem et al., 2013, Nguatem et al., 2014).
Our approach can easily be extended to other appearances of building parts, e.g., half-spherical and cone-shaped roofs or circular windows.To this end, we have to update the data base for defined roof or window models.
Furthermore, we plan to integrate a scene interpretation module into our workflow.E.g., the method of (Huang andMayer, 2015, Kuhn et al., 2016) or (Kluckner and Bischof, 2010) can be used to classify 3D point clouds of landscapes to detect buildings in villages and cities.We also have to solve the problem of finding closed windows, i.e., windows or their shutters lie in the same plane as the surrounding wall.We are not able to detect such windows based on relative depth information, so we need a further analysis, e.g., of the rectified fac ¸ade image.This can be done by employing the grammar based approaches (Teboul et al., 2013, Martinovic andVan Gool, 2014) or by fac ¸ade image interpretation, e.g., by convolutional networks (Schmitz and Mayer, 2016) or by a marked point process (Wenzel and Förstner, 2016).

Figure 1 .
Figure 1.Orientation result of 208 images (presented as pyramids) showing a single building.

Fig. 1 .
The result has an average re-projection error of 0.45 pixels.The returned calibration matrices are

Figure 4 .
Figure 4. Segmented planes of the test scene with a building with its surrounding.

Figure 5 .
Figure 5. LOD 2 building model.Surfaces of roof model and 3D points in the same view with the model supporting points in yellow and others in green (top), and wireframe model of the same building (below).