Zero-shot detection of buildings in mobile LiDAR using Language Vision Model

Recent advances have demonstrated that Language Vision Models (LVMs) surpass the existing State-of-the-Art (SOTA) in two-dimensional (2D) computer vision tasks, motivating attempts to apply LVMs to three-dimensional (3D) data. While LVMs are efficient and effective in addressing various downstream 2D vision tasks without training, they face significant challenges when it comes to point clouds, a representative format for representing 3D data. It is more difficult to extract features from 3D data and there are challenges due to large data sizes and the cost of the collection and labelling, resulting in a notably limited availability of datasets. Moreover, constructing LVMs for point clouds is even more challenging due to the requirements for large amounts of data and training time. To address these issues, our research aims to 1) apply the Grounded SAM through Spherical Projection to transfer 3D to 2D, and 2) experiment with synthetic data to evaluate its effectiveness in bridging the gap between synthetic and real-world data domains. Our approach exhibited high performance with an accuracy of 0.96, an IoU of 0.85, precision of 0.92, recall of 0.91, and an F1 score of 0.92, confirming its potential. However, challenges such as occlusion problems and pixel-level overlaps of multi-label points during spherical image generation remain to be addressed in future studies.


INTRODUCTION
With the advancement of 3D acquisition technologies such as LiDAR and other 3D sensors, along with reductions in price, the accessibility of point cloud data has become more affordable compared to the past.This has led to the emergence of multiple deep-learning techniques that can extract meaningful information from point clouds.(Qi et al., 2017a,b;Thomas et al., 2019;Guo et al., 2021).In the field of remote sensing and photogrammetry, building detection is one of the most fundamental problems for urban scenes as it forms the basis of 3D urban city model construction.Despite the remarkable progress in the field of deep learning, there are still significant obstacles to overcome when it comes to training deep learning models from scratch.One of the biggest challenges is the amount of data required for training, which can be particularly challenging in the case of point cloud datasets.Moreover, even when the dataset is available, training models on such vast amounts of data requires considerable computing power.In particular, this issue is a significant difficulty for the automation of 3D building detection, where data collection, processing, and timeconsuming training are significant obstacles.
In order to address these limitations, several previous studies have proposed techniques such as Sim-to-Real domain transferable learning (Griffiths and Boehm, 2019b;Xiao et al., 2022;Jin et al., 2022;Zhang et al., 2022c).This approach involves training models on synthetic datasets and effectively applying them to real-world datasets.Collecting and annotating realworld data is laborious and time-consuming, and training on synthetic dataset approach helps to avoid that.Moreover, synthesising data allows the generation of extensive data under various conditions such as different lighting and urban layouts.However, due to domain discrepancy, the models trained on * Corresponding author synthetic datasets might struggle to generalise to real-world data due to differences in noise, density, and distribution of point clouds.Therefore, when training on synthetic datasets, caution must be taken to avoid overfitting.It is crucial to build models that possess robustness to accommodate the variations and imperfections in real-world data.
Alternatively, zero-shot transfer can be utilised for various computer vision tasks.Thanks to the advancement of Language Vision Models (LVMs), solving these tasks without the need for explicit training is now possible.LVMs, pre-trained on extensive datasets, consistently demonstrate high performance across different tasks.For instance, CLIP (Radford et al., 2021), a prominent LVM, simultaneously trains an image encoder and a text encoder using an image-text paired dataset.Since the inception of the CLIP model, newer, more powerful models have emerged, such as OpenCLIP (Cherti et al., 2022), ALIGN (Jia et al., 2021), and Flamingo (Alayrac et al., 2022).Subsequently, LVM models have enhanced their robustness and adaptability through knowledge distillation for specific computer vision tasks like object detection (Liu et al., 2023;Du et al., 2022;Lin et al., 2022) and semantic segmentation (Zhou et al., 2023;Liang et al., 2023).
While Language Vision Models (LVMs) have demonstrated impressive performance in visual understanding without fine-tuning and with an open vocabulary, their reliance on extensive training data and computational resources remains a challenge, particularly in the context of 3D point clouds.To address these challenges, researchers have explored techniques such as leveraging pre-trained LVMs for point cloud understanding using multi-view approaches (Takmaz et al., 2023;Peng et al., 2023), knowledge distillation (Zhang et al., 2022b;Zhu et al., 2023), or projections (Zhang et al., 2022b)   In this study, we aim to achieve two main goals.Firstly, we seek to perform 3D point cloud segmentation by adapting an LVM initially designed for 2D computer vision.This adaptation is facilitated through a 3D-to-2D projection method, eliminating the need for pre-training or fine-tuning.Secondly, we analyse the usefulness of this approach by conducting experiments on synthetic data to assess its ability to bridge the domain gap between synthetic and real-world datasets.

Language Vision Models for Point Clouds
Thanks to the high impact of LVMs on the field of computer vision, many research efforts have been made to incorporate LVMs into 3D data processing.However, due to the significant amount of training time and data required to construct LVMs, most progress has been made only in 2D data.These drawbacks are more emphasised when it comes to 3D data, which are much larger in size and require longer training times.Therefore, many works have started investigating ways to utilise existing 2D LVMs for 3D data understanding rather than building LVMs specifically for 3D data, such as point clouds, meshes and voxels.
One method to enable 3D recognition with CLIP-based models is by projecting point clouds onto 2D images.PointCLIP (Zhang et al., 2022b) enables cross-modality zero-shot recognition on point clouds without prior 3D training by leveraging multi-view simple projection to transfer pre-trained 2D knowledge from CLIP to the 3D domain.Under a lightweight interview adapter under few-shot settings, PointCLIP enhances classification performance.However, it still exhibits low performance in zero-shot transfer scenarios.Nevertheless, PointCLIP-V2 (Zhu et al., 2023) addresses these shortcomings by replacing the multi-view simple projection with a realistic projection and employing prompt engineering with GPT-3 (Brown et al., 2020), enhancing performance not limited to classification but also part segmentation and object detection.
An alternative method involves aligning features from point cloud encoders with CLIP representations.For instance, ULIP (Xue et al., 2023) introduces a method that trains triplets consisting point clouds, images, and texts using a limited set of synthesised triplets to align with CLIP image-text space.Liu et al. (2024) enhance 3D representations by aggregating multiple 3D datasets and refine noisy text descriptions through the utilisation of a powerful large language model, GPT-4 (Achiam et al., 2023).
Success in point cloud classification leads to more complex tasks such as object detection and segmentation.OV-3DET (Lu et al., 2023) introduces a novel de-biased triplet cross-modal contrastive learning to connect image, point-cloud, and text modalities for improved performance with LVMs.OpenMask3D (Takmaz et al., 2023) and OpenScene (Peng et al., 2023) models enable Open-Vocabulary 3D scene understanding utilising the pre-trained Image-Text Embedding model, CLIP.These models, based on multi-view scene understanding, are more complex and error-prone as they require precise camera calibration in data integration.Moreover, they are limited to indoor scenes, raising uncertainty about their robustness for outdoor scenes.

Building Segmentation from Point Clouds
Building segmentation tasks in recent studies are often separated based on the method used to collect point clouds.A recent study, published by Gamal et al. (2021), utilised Point-Net (Qi et al., 2017a) and Dynamic Graph Convolutional Neural Network (DGCNN) (Wang et al., 2019) to detect buildings in cities of Indonesia.They collected the necessary LiDAR data using unmanned aerial vehicles, resulting in a promising approach to building detection.The Damage-Sensitive Network (DS-Net) (Xiu et al., 2023) is introduced as a specialised method for identifying collapsed buildings with the Laplacian Unit (LU).2019).Building on the 2D projection, we approach our task of building segmentation using the combination of two 2D foundation models: Segment-Anything Model (SAM) (Kirillov et al., 2023) and Grounding DINO (Liu et al., 2023).Combining these two models was introduced by IDEA research as Grounded Segment-Anything (or Grounded SAM) (Ren et al., 2024).
Our indirect building segmentation method consists of 4 steps as shown in Figure 2.Each step will be introduced in the following sections.

Spherical Image Generation
In order to generate a spherical image from the point clouds, we first normalise the point P (x, y, z) by the reference point O(x0, y0, z0) as the following equation: We set the reference coordinate in the middle of the road or at an intersection.
In the next step, we compute the spherical coordinates (θ, ϕ) for all normalised point clouds.The Figure 3 shows the visual explanation of spherical coordinates, where r is the radius from the origin to the point, θ is the azimuthal angle, and ϕ is the polar angle.Each coordinate is derived by the following equations: where x ′ , y ′ , z ′ are the normalised coordinates from equation 1 Lastly, we project these spherical coordinates to the 2D plane (image), which is also called an Equirectangular Projection.Each x and y coordinate in the image is determined by the following equations: where W and H represent the width and height of the image to project respectively.
Step 1 in Figure 2 shows the result of the spherical image generation procedure of the input 3D point clouds.This figure represents a spherical image generated from a reference point located at the centre of an intersection on the road.

Building Bounding-Box Detection
For building detection, we use an open-set object detection model, Grounding DINO.Grounding DINO is a model that merges the image feature encoder DINO with grounded pre-training to detect a wide range of objects using language inputs (Liu et al., 2023).This model integrates language for open-set detection, dividing the detection process into feature enhancer, languageguided query selection, and a cross-modality decoder (Liu et al., 2023).We use the model for our task, by using the text prompt "buildings" to obtain bounding boxes of the buildings in the generated image.
Step 2 of Figure 2 shows an example of the results of building detection using Grounding DINO.

Building Segmentation
We proceed to use SAM for the segmentation step.SAM is a foundation model for image segmentation, using input prompts such as points or masks (Kirillov et al., 2023).It also explores zero-shot segmentation from free-form text (Kirillov et al., 2023).However, we specifically focus on using the segmentation capabilities of SAM with bounding box inputs.For each bounding box detected in the previous step, we run SAM to segment all the pixels that belong to the building category.In Step 3 of Figure 2, blue-coloured areas represent the segmented pixels corresponding to the buildings.

Back-Projection
After segmenting the building image, we perform a back-projection of the segmented image onto the 3D point cloud.To achieve this, we save the mapping details during the generation of the spherical image.Later, we load these mapping details and replace the RGB data of the initial point clouds with the points that correspond to the segmented pixels.

EXPERIMENTS
We use an NVIDIA GeForce RTX 3070 with 8 GB for running Grounding DINO and SAM.

Performance Metrics
In order to measure model performance, we evaluate a range of commonly used metrics including Accuracy, Precision, Recall, F-1 Score, and Intersection over Union (IoU).IoU is typically used specifically for segmentation tasks.For detailed formulas refer to e.g.Goodfellow et al. (2016); Terven et al. (2023)

Synthcity Dataset
SynthCity (Griffiths and Boehm, 2019b) is an open-source dataset that represents an entire city in the form of a large-scale synthetic point cloud.The dataset includes 9 sub-areas and 9 label categories.SynthCity is specifically designed for pre-training deep learning models, enabling generalisation and expansion of their usage to real-world data.The dataset comprises 367 million completely labelled points with RGB features, making it a valuable resource for researchers and practitioners in this field.All points are labelled with one of 9 categories: Building, Car, Natural Ground, Ground, Pole Like, Street Furniture, Tree, and Pavement.
To avoid reaching computational resource limits, each sub-area in SynthCity is further divided into 4 sub-sub areas.Our dataset excludes 2 sub-sub areas of Area 7 and 1 sub-sub area of Area 9 as they contain neither roads nor buildings.As mentioned in Section 3 earlier, we arbitrarily selected centre points to generate spherical images in each sub-sub area.When making these selections, we imposed a condition that the centre point must first align with the roads.Additionally, if the area includes intersections, we designated the centre point to coincide with the centre of the intersection.

EXPERIMENTAL RESULTS
We present the comprehensive calculations of the error statistics for all sub-sub areas from Area 1 to 9.

Quantitative Evaluation
For the proper assessment of our approach, we compute five metrics.Table 1 displays an overall Accuracy of 0.96, IoU of 0.85, Precision of 0.92, Recall of 0.91, and F1-Score of 0.92 across all areas.Since our method has not been benchmarked against other approaches, we cannot claim it to be the superior choice.Nevertheless, given its highly encouraging outcomes, we wish to emphasise the versatility and resilience of LVMs when applied to 3D synthetic datasets using spherical images.

Qualitative Evaluation
The left image in Figure 1 illustrates the complete raw Syn-thCity dataset, while the right image represents the entire building prediction of the dataset.Our method shows remarkable performance not only in terms of quantitative analysis but also in qualitative analysis, as seen in the full dataset view.
The Figure 4 displays a detailed analysis of the different types of buildings.(a) and (b) highlight the detection of high and low-rise modern buildings, respectively.The results obtained demonstrate the high accuracy of our method in detecting buildings with different heights.Furthermore, (c) illustrates that our method can detect not only modern-style buildings but also old European-style ones, indicating the robustness and versatility of our approach.

False Positive Analysis
We have conducted a False Positive analysis of our method and identified a pattern of misclassified points.Analysis from Table 2 revealed that flat surfaces such as roads, grounds, and pavements often get incorrectly classified as buildings because some points classified as buildings pass through during the back-projection process.This misclassification is not limited to flat surfaces.For instance, tall and large-scale objects, such as trees, can also be labelled as buildings.Because trees typically stand close together, some points predicted as buildings might be projected close to points belonging to the tree category.
From the perspective of visualization, as shown in the Figure 5, false positives occur when we back-project from the mapping of the spherical image.In (a), the segmented area is incorrectly back-projected to the ground behind the buildings, while in (b) and (c), it is incorrectly back-projected to roads, pavements, and trees respectively.They all have a common phenomenon: all false positives are generated behind the buildings.These results are generated because, for each pixel of the spherical image, our method saves more than one point, each with different labels.As in Figure 6 (a), while we generate Point-Pixel mapping, multiple points with labels could be saved in one pixel.This happens due to several reasons: First, when we generate spherical coordinates, the points with the same θ (azimuthal angle) and ϕ (polar angle) but different r (distances from the origin) will be mapped to the same pixel.This is because the equirectangular projection only considers the angular components of the spherical coordinates and not the radial distance as in the Figure 6 (b).Secondly, the SAM model segments buildings, however, some contour pixels are mistakenly classified as building labels.The pixels that were incorrectly segmented will be projected as buildings, including any points that were within those pixels.

CONCLUSION
In our study, we experimented with a training-free zero-shot transfer approach by adopting spherical image generation using the 2D-3D projection method for building segmentation tasks, applying a LVM.We integrated the powerful Open-vocabulary segmentation model, Grounded SAM, which combines the Openvocabulary object detection model called Grounding DINO with the segmentation-enhanced Segment-Anything Model (SAM).
Instead of the commonly used Multi-View projection, we utilised a 360-degree spherical projection, which alleviates the complexities of considering various orientations associated with multiview projection, offering a relatively simple and intuitive approach.Despite the shape distortion challenges of spherical projection, Grounded SAM overcame building shape distortions and confirmed the effectiveness of LVM in accurately detecting buildings.Qualitative analysis revealed the successful detection of buildings regardless of their construction era or height.Quantitatively, the model exhibited excellent performance across all metrics, showing the robustness and versatility of LVM.However, a limitation of spherical image generation is the potential for false positives behind buildings due to the ability to store multiple labels for points in a single pixel.Nevertheless, despite these limitations, the use of LVM for zero-shot transfer allowed computational freedom and demonstrated effective operation even on synthesised datasets, confirming its domain transferability.These promising results could potentially address the lack of point cloud training datasets and compute-intensive training.

Figure 1 .
Figure 1.This figure illustrates the zero-shot building segmentation of the "SynthCity" dataset.(a) displays the original full point clouds used as input, while (b) shows the building detection results highlighted in yellow.

Figure 2 .
Figure 2. The diagram illustrates the steps of the methodological workflow.The figures on the right side are the examples of each step.3. METHODOLOGY In our work we use 2D rendered views of the 3D point clouds to perform scene understanding (as reviewed in Griffiths and Boehm (2019a)).This projection and back-projection process follows work by Sanchez Castillo et al. (2021); Karara et al. (2021); Tabkha et al. (2019).Building on the 2D projection, we approach our task of building segmentation using the combination of two 2D foundation models: Segment-Anything Model (SAM)(Kirillov et al., 2023) and Grounding DINO(Liu et  al., 2023).Combining these two models was introduced by IDEA research as Grounded Segment-Anything (or Grounded SAM)(Ren et al., 2024).

Figure 3 .
Figure 3. Principles of Spherical Coordinate Generation

Figure 4 .
Figure 4.The figure shows the results of the segmentation of buildings using zero-shot prediction across different types of buildings.The results indicate that this method is highly effective in achieving accurate building segmentation regardless of building type

Figure 6 .
Figure 6.The figure displays the process of mapping that leads to the occurrence of false positives.
. Goo and Z. Zeng are supported by the Engineering and Physical Sciences Research Council through an industrial CASE studentship with Ordnance Survey (Grant number EP/X524840/1).
instead of training LVMs directly on point cloud data from scratch.

Table 1 .
Quantitative Analysis of 9 Sub-Areas

Table 2 .
False Positive ratio for each category