HEURISTICAL FEATURE EXTRACTION FROM LIDAR DATA AND THEIR VISUALIZATION

Extraction of landscape features from LiDAR data has been studied widely in the past few years. These feature extraction methodologies have been focussed on certain types of features only, namely the bare earth model, buildings principally containing planar roofs, trees and roads. In this paper, we present a methodology to process LiDAR data through DBSCAN, a density based clustering method, which extracts natural and man-made clusters. We then develop heuristics to process these clusters and simplify them to be sent to a visualization engine.


INTRODUCTION
Over the last decade, Light Detection And Ranging (LiDAR) has become one of the most significant technologies used in the field of topographic mapping.As on date, LiDAR is a standard for capturing object surface information for 3D modelling of terrains.Although, these systems provide detailed valuable geometric information, the data still requires interpretation for object extraction and recognition for its use for the purpose of mapping and/or modelling.
Owing to the inaccuracies inherent in the LiDAR data (Habib, 2009), which are due to its components namely the GPS, INS and the laser ranging system and other factors like flight planning, flying conditions, atmospheric effects, terrain undulation and vegetation cover, the extraction of features is not a trivial task.Extraction of features from LiDAR data has been studied extensively in the past decade (Brenner, 1999, Haala and Brenner, 1999, Brenner, 2000, Forlani et al., 2003, Rottensteiner, 2003, Fujisaki et al., 2003, Haala and Brenner, 1997, Hofmann et al., 2003, Rottensteiner et al., 2005, Clode et al., 2007, Bretar, 2008).The study of literature strongly indicates that the focus of these studies have been on specific types of features.In this paper, we propose a novel approach for processing LiDAR data using clustering techniques for extraction of features which would be followed by a development of heuristics, to help generalize and direct the features to a visualization engine.This research is an extension of an earlier work done by the authors (Ghosh and Lohani, 2007) in order to devise a method for quickly processing and visualizing LiDAR data.

CLUSTERING METHODS
Data clustering is known by different names e.g.segmentation, data mining, etc.The main goal is to partition a given set of data points into disjoint subsets.Clustering methods can be divided into many groups namely, partition based, hierarchical, manifold based, evolutionary, fuzzy and hybrid.
Partition based methods like ISODATA, k-means or k-medioid algorithms take an input k for the number of clusters to be generated.Manifold based methods determine only a particular form of surface in the dataset.Evolutionary methods using genetic algorithms, artificial bee colony, and particle swarm optimization technique have been effciently used in clustering datasets.Fuzzy c-means and its extensions namely fuzzy ISODATA, Gata-Gava, fuzzy c-shells, fuzzy c-spherical shells, fuzzy c-ellipsoidal shells etc utilize an optimization procedure tuned to particular shapes of clusters (Baghshah and Shouraki, 2008).
Density based clustering methods like DBSCAN (Ester et al., 1996), OPTICS (Ankrest et al., 1999), DENCLUE (Hinneburg and Gabriel, 2007) etc. are capable of detecting arbitrary shaped clusters in spaces of any dimension.LiDAR data clusters have arbitrary shapes and therefore density based clustering methods are applicable for segmenting LiDAR data.Literature on clustering suggests that amongst the density based clustering techniques, DBSCAN has been very widely studied, implemented and extended.We have therefore chosen DBSCAN as the clustering algorithm for our study related to LiDAR data.

OBJECTIVE
In this paper, we draw out a complete pipeline to mine, extract and generalize specific types of features using clustering of Li-DAR data and heuristical techniques which we design and demonstrate in the process.The clusters are generalized via the heuristic to enable redirection to a visualization engine.

STUDY AREA
In 2004, Optech Corporation conducted a LiDAR flight over the Niagra Falls with the ALTM sensor.The airplane flew at an average height of 1190m, with a DSS 301 SN0039 camera on board for the aerial photographs.The average density of the LiDAR data is 4.85 points per square metre, where all the returns are considered.Subsets of 100m×100m, were chosen based on the presence of features like gabled-roof houses, trees, and the ground.Results in this paper are presented for one of these subsets.The terrain for the given subset is flat.

Mathematical notations and definitions
Let S = {P i (x i , y i , z i ), i = 1, . . ., N} denote a set of three dimensional data points with cardinality N. Let x min , x max , y min , y max , z min , and z max denote the minimums and maximums of the x, y and z coordinates respectively.∆(S ) denotes the two-dimensional Delaunay triangulation (DTRI) of the set S .The longest edge of the triangle ∆ ∈ ∆, is denoted by (∆).Given a Delaunay triangulation ∆(S ) and a threshold τ, a trimmed Delaunay triangulation (TDTRI) of S is denoted by Given a DTRI or TDTRI ∆, the set Ψ(∆(S )) is called the "outline" of the DTRI or TDTRI ∆(S ).In an outline, each of the triangles are broken into edges and then the duplicate edges are removed.The remaining edges are then joined back to form the outline.

Adapting DBSCAN for LiDAR data
Since the LiDAR data has redundant points, we have modified the Euclidean metric as in equation [1].

Visual clustering
A visual clustering of the data is carried out using the TerraScan module from TerraSolid Inc.The visually segmented dataset would be used as a benchmark for evaluating the quality of clustering achieved by the DBSCAN algorithm.It would be an interesting exercise to see if the DBSCAN clustering algorithm follows the natural and artificial features in the given dataset.

Comparing clusters with the actual ones
In DBSCAN, for each of the objects belonging to a cluster its ε-neighbourhood needs to contain at least minPts objects.Ester et al. (1996) have attempted to develop a simple and effective heuristic to determine the parameters ε and minPts using sorted k-dist graphs which map each point to the distance of the k-th nearest neighbour.However, this process is rather interactive and the authors recommend the use of the graphical representation of the k-dist graph to help the user estimate the correct threshold.Daszykowski et al. (2001) assume a uniform distribution of points and thus determine the threshold accordingly.Since the LiDAR data do not follow this statistical distribution, we determine the appropriate distance threshold ε, using an experiment and the Adjusted RAND Index (ARI) (Hubert and Arabie, 1985).
The ARI has values ranging from 0 to 1, indicating similarities from the least to the exact between two clustering outputs.The clusters generated by DBSCAN are compared to the visually segmented dataset for a number of thresholds and the ARI is determined for each case.The highest value of the ARI gives the best threshold for the clustering algorithm.

Types of clusters
The various kinds of clusters that are obtained from the clustering are those containing a little number of points but spanning a small area, those containing a few number of points but spanning a large area, those shaped in the form of domes, those potentially containing planar areas and those following the terrain.Owing to the multiple returns present in the LiDAR dataset, a dome shaped cluster not only contains points at its periphery but also inside it.
Each of these kinds of clusters would have to be given a different treatment for final generalization and visualization.The various types of clusters that could be obtained from the clustering process are given in figure [1].In the next subsection, we develop the heuristics for treating the various types of clusters in the dataset.
Figure 1: Types of clusters

Heuristics
The following definitions are introduced in order to develop and understand the heauristics used for the processing of LiDAR data.
Given a set of data points S , with N points, the centroid of the set is given as Points P 1 , P 2 , P 3 and P 4 are defined as P 1 : (x min , y min , z min ), P 2 : (x min , y max , z min ), P 3 : (x max , y min , z min ), and P 4 : (x max , y max , z min ).Let P c 1 , P c 2 , P c 3 and P c 4 be the points closest to P 1 , P 2 , P 3 and P 4 respectively.Let P denote the least squares plane for the points P c 1 , P c 2 , P c 3 and P c 4 .Then the distance of the point P c , from the plane P for the set S is denoted by δ(S ) and is called as the apex distance for the set S .Let w and b be the length and breadth of the minimal rectangle surrounding the projection of the points of S into the XY plane.The ratio defined as is called as the aspect ratio for the set S .S is called wide if min(w, b) ≥ W where W is a given threshold.S is called flat if δ(S ) < f where f is a given threshold.It w, b and h denote the width, breadth and height of the minimal cuboid containing the points of S .Then denotes the point density per unit volume for the set S .If ρ(S ) ≤ ρ 1 or #(S ) < M, then this set of points is called as sparse, where ρ 1 is given threshold in terms of density, and M is a threshold in terms of number of points.If ρ 1 ≤ ρ(S ) ≤ ρ 2 , where ρ 1 and ρ 2 are given thresholds, the point set S would be referred as a dome shaped cluster.For the purpose of this work, the thresholds ρ 1 , ρ 2 and M are determined empirically, through a visual study of the clusters generated.It can be however said that ρ 1 and ρ 2 depend on a priori information from the data and area under study, like scanning frequency and the features present on the terrain.Calculate Ψ(∆ (S )) 9: Drop the walls to the baseheight h 10: end for The remaining clusters are supposed to be potentially containing planes.The detection of the planes are done using the RANSAC algorithm proposed by Fischler and Bolles (1981).We devise our RANSAC based processing close to Bretar (2008), but however restrict the accepted models so that it has at least n number of points close to it.A certain number of points are rejected during the RANSAC process.In case these points are placed closely, and exceed a certain number, we use triangulation before directing them to the visualization engine.The points belonging to the respective model are projected to the plane.The outlines of these planes are found out and walls are "dropped" to a base height.The algorithm for processing the clusters potentially containing planes is given in Algorithm

RESULTS AND DISCUSSION
We have used a visually segmented dataset as a benchmark for evaluating the performance of the clustering algorithm.The manually classified dataset had 31 clusters.

Identification of best thresholds for clustering
A pre-visualization of the LiDAR data indicated that if the εneighbourhood of a point consisted of less than three points, then we could label the point as an outlier.Thus, the minPts criteria for the points was chosen as 3. To determine the best distance threshold for DBSCAN, an experiment was conducted where ε was varied from a very low value of 0.7 meter to a high value of 4.0 meters.For each of the values of the distance threshold, the Adjusted RAND Index (ARI) was found out to compare the performance of the clustering algorithm for the given dataset with respect to the manually clustered dataset.The performance graphs Table 1: Parameters and contingency tables for testing heuristical data processing 6.2 "Peel"ing a dome shaped cluster As described in Algorithm [2], a dome shaped cluster extracted from the clustered dataset, is "peel"ed by a process of using tetrahedralization followed by trimming of the triangular facets of the tetrahedrons.Figure [6b] shows the peel of a specific dome shaped cluster.

Determination of the thresholds for RANSAC algorithm
The RANSAC algorithm requires a distance threshold ε for fitting the planes and the number of draws from the set of points that one should make to find out the best model.Bretar (2008) gives a method for finding out the number of draws.This approach was modified by adding another threshold m, which specifies that the model drawn out by the RANSAC algorithm had to conform to at least m points from the dataset.It was decided by observation in the dataset, that the minimum size of the plane to be detected had to be at least 4m × 4m.Therefore M can easily be determined by multiplying this area by the areal density of the dataset.
An experiment was conducted to find out the best threshold for RANSAC by taking out some planar clusters and then varying 6.4 Efficacy of the heuristics 5 different subsets were extracted from the main dataset and each of them were clustered using DBSCAN and the same parameters.40 clusters were manually extracted and sorted into various groups namely sparse, wide and flat, dome-shaped and potentially containing planes.Each of these clusters were also passed through the heuristics algorithm with a given set of paramters demonstrated in table [1a] which also sorted them into various groups.The efficacy of the heuristics is demonstrated by the contingency table [1b].

Processing of the dataset and three dimensional restitution
The given dataset is now passed through the pipeline using the thresholds described in table [1a], to generate a simplicial representation of the terrain and the features over it.The number of points in the clusters generated by DBSCAN vary from a very small number, in the order of 10 or less, to a very large number to the scale of 800 or more.To make the processing faster, it was empirically decided that clusters containing less than 100 points would not be processed through this heuristic process, and would be ignored.
The features thus generated by the heuristic process are converted into simplices and polygons with 3D vertices and stored into a file.This file is then read using a program developed using the OpenSceneGraph API and C++.OpenSceneGraph is a tool widely used for visualization of 3D data.The authors have used this tool for immersive visualization of the simplices and polygons generated by the algorithm.The 3D landscape model is also draped with the geocoded aerial photograph of the study area.
The complete procedure of clustering, followed by heuristics and three dimensional restitution took about 600 seconds of time on a 1.5 GHz, 64 bit Kubuntu Linux Machine with 2GB RAM.Two different views of this 3D landscape model in anaglyph mode are shown in figure [8].
In the earlier attempt (Ghosh and Lohani, 2007) for near-realistic three dimensional visualization of the LiDAR data using triangulation and tetrahedralization, it was noted that both these approaches lead to the loss of features and increased the complexity of computation.This paper, has extended the earlier work to present a hybrid approach using triangulation or tetrahedralization based on the decision made by the heuristical process.

CONCLUSION
This paper presented a density based algorithm DBSCAN (Ester et al., 1996) which was adapted for use with LiDAR data with a modified Euclidean metric for calculation of distances.The results show that the algorithms have demonstrated almost satisfactory clustering results, thus yielding clusters which are close to the ground features.The thresholds used in this study were determined by running several experiments and thus choosing the best threshold from existing statistical measures.A study on the analytical determination of the thresholds automatically from the parameters of the lidar data viz.pulse repetition frequency, flying height, flying velocity etc is currently in progress.We also demonstrated a heuristic based approach for processing the various kinds of clusters obtained through the process of clustering.
It is seen that triangulation of the entire dataset essentially ends up in the loss of three dimensional details and a tetrahedralization makes the computation too complex.The heuristic based approach has therefore enabled us to pursue a hybrid approach where both triangulation and tetrahedralization have been used for processing of the clusters.This study was conducted on a small subset of a comparatively larger dataset, for testing feature extraction.For larger datasets, it will be imperative to implement spatial indexing by R*-trees and tiling to test the performances of these algorithms on larger datasets and develop a methodology for processing it in a smaller amount of time.

Firstly
Figure 2: Schema for the processing of clusters [3].The schema for the heuristics of processing the clusters is shown is figure[2].

Figure 3 :
Figure 3: The processing pipeline for the LiDAR datasets 5.7 Summary of the pipeline for data processing In our processing pipeline, a visual segmentation of the given dataset is conducted using the TerraScan module.Clustering of LiDAR data is conducted using DBSCAN for various thresholds and an ARI based comparison is made with the visually segmented dataset.The highest value of ARI would give us the best distance threshold for clustering.Each of the clusters thus obtained is passed through the rule base for generalization and redirection to the visualization engine.A graphical representation of this process is given in figure [3].
for these thresholds v/s the ARI is shown in figure[4].The outputs of the DBSCAN algorithm with the best threshold are shown in figure

Figure 4 :
Figure 4: ARI Performance graphs for DBSCAN Figure 6: (a) A dome shaped cluster (b) A "peel"ed dome cluster using tetrahedralization and subsequent thresholding of the triangular facets.The colors of the triangles vary from dark blue to dark red based on the elevation of the centroid of the triangles (c) Data points for a gabled roof (d) The gabled roof structure generated by the pipeline (e) Data points for a hipped roof (f) The hipped roof structure generated by the pipeline.

Figure 7 :
Figure 7: Threshold determination process for the RANSAC algorithm.The correct threshold seems to settle in between σ h and 1.5σ h

Figure 8 :
Figure 8: Anaglyph views of the data