COMBINING IMAGE AND POINT CLOUD SEGMENTATION TO IMPROVE HERITAGE UNDERSTANDING

: Current 2D and 3D semantic segmentation frameworks are developed and trained on specific benchmark datasets, often rich of synthetic data, and when they are applied to complex and real-world heritage scenarios they offer much lower accuracy than expected. In this work, we present and demonstrate an early and late fusion of methods for semantic segmentation in cultural heritage applications. We rely on image datasets, point clouds and BIM models. The early fusion utilizes multi-view rendering to generate RGBD imagery of the scene. In contrast, the late fusion approach merges image-based segmentation with a Point Transformer applied to point clouds. Two scenarios are considered and inference results show that predictions are primarily influenced by whether the scene has a predominantly geometric or texture-based signature, underscoring the necessity of fusion methods.


INTRODUCTION
The semantic segmentation of constructions encompasses the segmentation of primary, secondary, and auxiliary building classes, as noted in the reference (Armeni et al., 2017).This segmentation is an intermediate step crucial for detecting different instances of elements within buildings, a requirement for various tasks such as scan-to-BIM procedures and building enrichment pipelines, among others (Croce et al., 2023).Prior to 2020, traditional machine learning methods, along with specific features, were the preferred choice.However, the state of the art has now completely shifted towards deep learning methods, as evident in references (Bello et al., 2020, Guo et al., 2021).These deep learning techniques generally offer improved generalization and reduce the need for feature engineering, such as radiometric feature extraction.Nonetheless, they do demand a significantly larger amount of training data to achieve similar detection rates.Currently, most semantic segmentation approaches still primarily focus on a single modality, which could be either imagery or point cloud data (Coudron et al., 2020).This bias toward single-modality methods is largely due to benchmark datasets that predominantly promote such competitions (Armeni et al., 2017) or limitation in processing methods.However, these single-modality approaches fall short in achieving market-ready detection rates, particularly when dealing with objects of heritage that exhibit intricate geometries and textures.For example, identifying different types of columns in a heavily eroded setting can greatly benefit from both visual and geometric interpretations, even when the latter might introduce noise.Multi-modal data fusion in machine learning is a growing sector (Townend et al., 2024) and some recent works started also to introduce background knowledge into the neural network's learning pipelines (Grilli et al., 2023).
In our work, we propose a framework that integrates image and point cloud segmentation techniques for cultural heritage building elements.To achieve this, we have developed an integration pipeline that combines state-of-theart methods for semantic segmentation of both images and point clouds.In summary, our contributions include: 1.The theoretical framework and implementation for early and late image and point cloud semantic segmentation.
2. An implementation for automated image and point cloud training sample production.
3. An empirical study on two heritage assets to compare the proposed joint semantic segmentation.

RELATED WORK
Heritage Semantic Segmentation -Researchers have been exploring the application of machine learning techniques for the semantic enrichment of 3D point clouds in the cultural heritage field for some time now (Fiorucci et al., 2020, Yang et al., 2023).Supervised machine learning methods have primarily focused on mapping various materials, building techniques, and deterioration phenomena.Leveraging the geometric characteristics of 3D data (Weinmann et al., 2015), these methods utilize extracted geometric features, and sometimes sensor-based features, to train machine learning algorithms to perform their tasks (Grilli et al., 2018, Grilli and Remondino, 2020, Croce et al., 2021).Despite the potential of these approaches, even the field of cultural heritage has seen an increasing change in research interest towards deep learning methods due to their noteworthy improvements in performing the semantic enrichment of 3D point clouds (Pierdicca et al., 2020, Matrone et al., 2020).
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-2/W4-2024 10th Intl.Workshop 3D-ARCH "3D Virtual Reconstruction and Visualization of Complex Architectures", 21-23 February 2024, Siena, Italy This contribution has been peer-reviewed.https://doi.org/10.5194/isprs-archives-XLVIII-2-W4-2024-49-2024| © Author(s) 2024.CC BY 4.0 License.Joint point cloud and image segmentation -Joint point cloud and image segmentation is a popular topic in deep learning fusion approaches (Qi et al., 2021).Both data fusion (early) and method fusion (late) are predominantly pursued in academia.Early fusion methods, for instance, involve rendering 3D information as multi-view 2D images with an additional depth channel (RGBD), which can then be processed by standard 2D convolutions (Cui et al., 2022) (MVCNN).Alternatively, 2D images can be rendered as a 3D graph, tree, or raster point cloud representation (Lu et al., 2022).However, these 2D methods often lose some 3D geometric context and struggle with per-point label prediction.Recent advancements in MVCNN networks include ShapeConv (Cao et al., 2021) and FPS-Net (Xiao et al., 2021).On the other hand, late fusion combines the outputs of multiple networks and averages the results, for example, by integrating Point Transformers (Lu et al., 2022) with purely image-based networks.The advantage here is that each modality can be trained separately, leveraging numerous available benchmarks.However, the averaging of results is typically suboptimal and does not consider the quality of the geometric/texture signature at that location.Beyond early and late fusion, there are hybrid solutions that combine the strengths of both approaches (Zhang et al., 2021).These typically involve the use of intermediate fusion blocks that enable the parallelization of different networks, merging them at strategic points.

Training data production
The successful integration of Deep Learning (DL) methods into heritage projects is fundamentally linked to the automated generation of training and testing data.In our work, we aim for a seamless transition between IFC BIM models, geolocated images produced by a photogrammetric pipeline, and the combined point cloud resulting from both photogrammetry and 3D terrestrial laser scanning (Fig. 1).To effectively train semantic segmentation models, it is crucial to amalgamate information from these three sources, thereby generating the necessary training data.First, there is the choice of the initial modality to which training labels are assigned.Generally, labelling 3D objects is more efficient than 2D formats, as images in photogrammetric pipelines typically have over 60% overlap.Among the 3D formats, the IFC model is much more efficient to label due to the limited number of elements.Additionally, IFC models already contain metadata that can be utilized for the production of training data.For instance, in the first test case (Section 4.1), the IFC model comprises only 214 elements across four types of building elements, while it has a point cloud of 56 million points and 894 24MP images.3D Point Cloud Annotation -Thus in the initial phase of training data creation, the BIM information is associated with the point cloud data.The annotation of BIM information onto the point cloud data, denoted as P , relies on a nearest neighbours variant involving a uniformly sampled BIM point cloud, represented as Q.Given the substantial abstractions present in the BIM, the criterion for assigning information is determined by the difference in normals between a source point p i ∈ P and a set of neighbouring BIM points Q j ⊂ Q, as expressed in Equation 1.
In this context, Q represents the joint visibility point cloud, which is obtained by sampling points from the BIM objects.However, points q j that are situated within neighbouring objects are removed, up to a specified threshold.
The sets Q j consist of points that are in close proximity to every p i , determined by the Euclidean distance threshold t d .To find the best fit q j for each p i , a maximization process is applied to the dot product between the two normals, represented as As shown in Fig. 2, it is evident that the normal filtering improves the fit between the BIM and the point cloud annotation without significantly increasing computational complexity.Subsequently, the class information of the object that q j belongs to is transferred to p j as an additional point label.
2D Image Annotation -The IFC or point cloud data are used to automatically label the imagery.Operating on the full imagery has a major advantage as it has significantly higher detailing (ranging from 12 to 40 megapixels, resulting in avg.0.002 m ground sampling distance -GSD) than the point cloud (avg.density of about 0.005 m).The training data for the image classification is automatically derived from the manually annotated point cloud.Firstly, the images are undistorted using OpenCV, utilizing the intrinsic camera matrices K for each image.Subsequently, each image is subdivided into pixel regions in accordance with the requirements of the image classification model.Next, a set of depth maps denoted as D is generated.This is achieved by performing a dense ray tracing of the photogrammetric point cloud for each image, utilizing the extrinsic camera matrices M for each image (Fig. 3 left, Equation 2).
However, raycasting on the original point cloud is not ideal due to its limited density.Rays tend to pass in between points, resulting in labels for objects situated behind the initial layer of points (Fig. 3 middle).Instead, we adopt an alternative approach by generating a voxel mesh from the octree representation of the point cloud.By enhancing the voxel traversal mechanisms available in Open3D, we can create a dense mesh with the appropriate labels, making it considerably more traceable (Fig. 3 right and Fig. 4).

Semantic segmentation
In the early fusion, images and point clouds are merged into RGBD images: this reduces the complexity of geometric reasoning but allows for the joint semantic segmentation of image and point cloud modalities.For late fusion, we first conduct an image-based semantic segmentation: the results of this segmentation are then associated with the point cloud data.Subsequently, the final classification is determined through a second semantic segmentation step, which is based on features extracted from the point cloud.
Point clouds -For the point cloud segmentation, a set of covariance features are computed for P , including linearity, planarity, verticality, and others as proposed in (Niemeyer et al., 2014).These features, together with the results of the 2D segmentation, are then passed to a neural network as an extra channel of input data.For the tests, we employed the Point Transformer architecture(Zhao et al., 2021), a deep learning method that relies on the selfattention operator for essential tasks in scene understanding.In the Point Transformer, the self-attention mechanism is applied locally, allowing the network to upscale its capabilities for tasks on large scenes with millions of points.The training process was conducted in a single step, with class balancing techniques applied to account for low class presence.The resulting labels, Y , can then be directly applied to the point cloud P .
RGBD -In the early fusion of image and point semantic segmentation, we project the 3D coordinate information onto the image depth channel to form RGBD imagery using the aforementioned techniques.As it is challenging to unify depth maps based on their respective depths, we opt for producing HHA imagery, as proposed in (Gupta et al., 2014).This format incorporates the depth and viewing direction into a uniform depth format, which is more comprehensible than conventional depth maps, albeit being quite computationally demanding to compute, as shown in Fig. 5.
For the semantic segmentation itself, we employ ShapeConv combined with Deeplabv3+ (Chen et al., 2018) and a ResNet-101 backbone (He et al., 2016).ShapeConv is a modelagnostic convolutional layer that can be easily integrated into existing networks, focusing on jointly learning shape and base components.In the original paper, ShapeConv significantly improved the generalization and performance of the base networks on known datasets such as SiD, NYUv2-40, and SUN, as shown in Table 1.

EXPERIMENTS
The early and late fusion are compared against traditional networks that process only a single modality.Specifically, two photogrammetric datasets, each with a unique signature, are selected for these tests.

Dataset I: Paestum
The first test is a photogrammetric reconstruction of the Greek Temple of Neptune (ca 25m x 60m x 15m), located in Paestum, Italy (Fiorillo et al., 2013).The dataset comprises 894 geolocated images captured by hand and UAVs, resulting in a point cloud of ca 56 million points.Although the temple is constructed entirely of one material, it features 10 different building techniques (Fig. 6).Consequently, the temple exhibits a predominantly geometric signature rather than a distinct texture signature.Each method was trained and validated on 25% of the data for 300 epochs, and inference was performed on the entire dataset.Table 2 presents the data distribution in the dataset, which is fairly unbalanced as typically happens in such datasets.Notably, the distributions are similar for the point and pixel distributions, except for classes 4 (6.9% vs 0.2%) and 7 (4.5% vs 0.1%), which are underrepresented in the image dataset, and class 3 (15.8%vs 41.2%), which is significantly over-represented.This over-representation is attributable to the large number of  The image segmentation is conducted as outlined in Section 3..For the late fusion, the segmentation results are projected onto the point cloud as an additional feature.Following this, the Point Transformer network was trained again on the same partition, utilizing also the covariance features listed in Table 3.A batch size of 48,000 points was employed, with a subsampling of 0.005m.Additionally, the ShapeConv combined with Deeplabv3+ and a ResNet-101 backbone was trained on the RGB and HHA channels.
Achieved results are presented in Table 4, 5 and Fig. 7, 8. Overall, while each method scores well for the more prevalent classes, some key differences are observed.Firstly, there is a notable divergence between the mIoU and the weighted mIoU, primarily attributed to class balancing.This effect is most pronounced with ShapeConv, which disproportionately favours the majority classes, resulting in a skewed performance as it neglects minority classes (4,7 and 9).Contrarily, other methods which incorporate weighted training approaches demonstrate a more balanced performance profile.Notably, the detection rates across differ- ent classes still show considerable variance, with the image segmentation networks being particularly susceptible to discrepancies in training sample sizes.Secondly, Point Transformer scores the best results (71.4% mIoU) which is expected due to the geometric nature of the dataset.On the other hand, the DeepLabV2 network, rather than improving these results, actually contributes to greater confusion in the late fusion (with a lower mIoU of 68.5%) due to its subpar classification of the less represented classes.This underscores the importance of careful integration of network results in late fusion, potentially by including the confidence levels.
Thirdly, the ShapeConv network has mixed results.It scores better than most image classes and even some late fusion classes.Nevertheless, it underperforms in representing the minority classes from the image perspective, suggesting a loss of contextual understanding when transitioning from a general to a viewpoint-specific approach, in part due to the severe unbalancing of the training data.Finally, the training efficiency for early fusion is significantly higher than its late fusion counterpart.This depends on the implementation but also the data modality (2D convolutions are faster) and the joint training of a single network with fewer parameters, which is less demanding.detection in the imagery will outperform the point cloud in semantic segmentation.Each method was trained and validated on 50% of the data for 300 epochs, and inference was performed on the entire dataset.Table 6 again reveals a highly unbalanced dataset, with an average class balancing spread of σc = 15%.Similar imbalances in class representation are observed as in Paestum, with classes at eye-level being overrepresented.However, given that the Pecile dataset is significantly smaller, these effects are more pronounced.For instance, the average difference in class balance in Paestum is 5.5%, whereas in Pecile it is 10.5%.All methods were processed analogously to those in Paestum.The Point Transformer network was trained a batch size of 48000 points, a subsampling of 0.005m, and the features listed in Table 7.For the image using DeeplabV2 and ShapeConv, the imagery was divided into 9 tiles, thereby increasing the number of samples to 504.This partitioning incurs minimal overhead on the total calculations.The generation of HHA imagery took 423 seconds.
Results are presented in Table 8, 9 and Fig. 10.The average detection rate of the methods is 63% mIoU while the weighted mIoU is 78.9%, showing a similar trend as the Paestum dataset due to training data differences.However, the image-based methods score significantly better with Point Transformer now being the weakest performer.A significant observation here is the superior performance of image-based methods, with the Point Transformer being the less effective method.This underscores the importance of choosing networks that leverage both texture and geometric features in a scene.Both early and late fusion techniques show comparable efficacy, with the contribution in late fusion primarily coming from the DeepLabV2.Again, it is observed that ShapeConv does not deal well with low presence classes.Despite this, the added geometry channels in ShapeConv do improve the detection rate as some of the materials have some depth- sensitive erosion.An interesting insight is that each network has variable performance depending on the scene, expect for the late fusion.It seems that by directly embedding image detection results into the geometric processing, simultaneously the best and worst results are filtered out, leading to a stable performance across scenes with varying texture and geometric signatures.Contrary to expectations, early fusion didn't mirror this behavior.The further imbalance in training data appeared to hamper the network's efficacy.Additionally, the limited parameter set in early fusion, as opposed to the more elaborate setup in late fusion involving multiple networks, seemed to restrict its ability to encapsulate the same level of complexity effectively.

CONCLUSIONS
This work presented the adoption of early and late fusion methods for image and point cloud semantic seg- Experiments on two test cases demonstrate that the detection rate is primarily influenced by whether the scene has a predominantly geometric or texture-based signature, underscoring the necessity of fusion methods.Image semantic segmentation proves to be more effective in texturerich areas, whereas Point Transformers excel in geometrically complex scenes.The combination of both approaches yields enhanced results in both cases, a pattern also observed in early fusion.Notably, late fusion tends to be more consistent, benefiting from better-suited data modalities and the absence of training entanglement.
The study concludes that employing networks in series or parallel, as seen in late fusion, tends to be more advantageous for projects than early fusion.This is because even if only one of the networks in the series performs well, satisfactory results are achievable.An essential factor in choosing between early and late fusion methods is the scene's complexity.In highly intricate scenes, late fusion is often the better choice as each modality requires a dedicated network for precise tuning.However, in simpler scenarios like those examined in this study, the ability to generalize quickly over smaller networks makes early fusion a viable option.Future research aims to explore further the relationship between scene complexity and the choice of fusion methods.

Figure 1 :
Figure 1: Overview of the project inputs: (left) hand-held and UAV images, point clouds and BIM model.

Figure 2 :
Figure 2: Overview of the transfer of 3D semantic labels from the IFC model to the 3D point cloud: without normal filtering, showing poor results near edges (left) and with normal filtering for a more nuanced segmentation (right).

Figure 3 :
Figure 3: Overview of the image raycasting: original image (left), raycasting on the original point cloud, which is unusable due the lack of surfaces (middle) and raycasting on the voxel mesh, which does yield proper masks for image segmentation (right).

Figure 4 :
Figure 4: Voxel mesh generated from the point cloud octree.

Figure 5 :
Figure 5: Overview of the early fusion modality: (left) original undistorted image, (middle) projected point cloud labels and (right) HHA imagery with depth information.wereapplied to account for low class presence, and data augmentation methods, as recommended in(Shorten and Khoshgoftaar, 2019), were employed.The training process occurred in two stages.Initially, only the output layer was trained using automatically generated training data.Subsequently, the model was further fine-tuned.Out of the total 30,925,387 parameters in DeepLabV2, 30,840,427 were trained for both the building elements and the materials.Given the image segmentation, the outcomes are associated with the most suitable points in the point cloud P .By utilizing the image coordinates of the labels I and depth maps D, a reference point cloud Q can be created using the same raycasting mechanism (Equation3).As there is significant overlap in the imagery, mislabeling in I will result in a cluttered reference point cloud.To obtain the final result, a k-nearest neighbour evaluation between the initial point cloud and the reference cloud.The labelling Y is then obtained by the weighted average label of the project image labels, given inverse distance weights w.These image labels are then assigned as an additional feature in the point cloud semantic segmentation.

4. 2
Dataset II: Wall of the Pecile The second test is a photogrammetric 3D reconstruction of the Wall of the Pecile (18m x 1m x 8m), a part of the courtyard of the Roman Villa Adriana in Tivoli, Rome.The dataset comprises 54 geolocated images taken by hand, resulting in a point cloud of 2.5 million points.It includes 6 building techniques (Fig. 9).However, these classes primarily have texture signatures since the reconstruction consists of a gate at the center of a flat wall with limited geometric signatures.Therefore, it is expected that The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-2/W4-2024 10th Intl.Workshop 3D-ARCH "3D Virtual Reconstruction and Visualization of Complex Architectures", 21-23 February 2024, Siena, Italy This contribution has been peer-reviewed.https://doi.org/10.5194/isprs-archives-XLVIII-2-W4-2024-49-2024| © Author(s) 2024.CC BY 4.0 License.

Figure 7 :
Figure 7: Semantic segmentation results of the early fusion method with ShapeConv on the Paestum dataset.

Figure 10 :
Figure 10: Semantic segmentation results for a part of the Pecile wall: ground truth (left), Point transformer (middle) and DeepLabV2 plus Point Transformer (right).Coudron, I., Puttemans, S. and Goedemé, T., 2020.Semantic extraction of permanent structures for the reconstruction of building interiors.Sensors pp.1-21.

Table 2 :
Class distribution (%) in the imagery and point cloud of Paestum.

Table 4 :
Semantic segmentation results per method.

Table 5 :
Average IoU per class for the 75% test area.

Table 6 :
Pecile class distribution (%) in the imagery and point cloud.

Table 8 :
Average Pecile semantic segmentation results per method.

Table 9 :
Pecile IoU per class for the 50% test area.in cultural heritage applications.It features a methodology for seamless transition between data modalities and efficient production of training data.The late fusion approach merges image-based segmentation with a Point Transformer applied to point clouds.In contrast, the early fusion utilizes multi-view rendering to generate RGBD imagery of the scene. mentation