LAND USE CLASSIFICATION FROM VHR AERIAL IMAGES USING INVARIANT COLOUR COMPONENTS AND TEXTURE

Very high resolution (VHR) aerial images can provide detailed analysis about landscape and environment; nowadays, thanks to the rapid growing airborne data acquisition technology an increasing number of high resolution datasets are freely available. In a VHR image the essential information is contained in the red-green-blue colour components (RGB) and in the texture, therefore a preliminary step in image analysis concerns the classification in order to detect pixels having similar characteristics and to group them in distinct classes. Common land use classification approaches use colour at a first stage, followed by texture analysis, particularly for the evaluation of landscape patterns. Unfortunately RGB-based classifications are significantly influenced by image setting, as contrast, saturation, and brightness, and by the presence of shadows in the scene. The classification methods analysed in this work aim to mitigate these effects. The procedures developed considered the use of invariant colour components, image resampling, and the evaluation of a RGB texture parameter for various increasing sizes of a structuring element. To identify the most efficient solution, the classification vectors obtained were then processed by a K-means unsupervised classifier using different metrics, and the results were compared with respect to corresponding user supervised classifications. The experiments performed and discussed in the paper let us evaluate the effective contribution of texture information, and compare the most suitable vector components and metrics for automatic classification of very high resolution RGB aerial images.


INTRODUCTION
Large datasets of aerial imagery exist in different government institutions, such as national land, urban planning, and construction departments (Lv et al., 2010).Commonly, the spatial resolution can be high or even very high, but the spectral resolution is often limited to RGB, and there is a general lack of the near-infrared bands (Laliberte and Rango, 2009).In the classification of RGB images, without the fundamental contribution of NIR and other bands, the quality of the process is variably influenced by different factors such as subject, illumination conditions, spatial resolution, vector components and metrics used to define the various clusters.In order to classify an image, it is common to use methods based on statistical analysis of pixels: this approach has demonstrated to have good performances only when used to classify images with large pixel size (Wang et al., 2004).High spatial resolution digital aerial imagery, with a pixel size less than one meter, is an underexplored field in which the classification process can be complicated by the spectral variability within a particular class due to the very small pixel size (Aguera et al., 2008), especially in the case of urban areas (Kiema, 2002).The urban environment in particular represents one of the most challenging problems for remote sensing analysis as a consequence of high spatial difference and spectral variance of the surface materials (Herold et al., 2003).This question can be overcome with different techniques that take into account two fundamental image characteristics: colour (spectral information) and texture (Haralick et al., 1973).Colour has high discriminative power, and in many cases objects can be well recognized merely by this characteristic (Swain and Ballard, 1991;Burghouts and Geusebroek, 2009).Texture information can be used for directly classify images or as additional band in the clustering process; in the last case, classification accuracies are generally improved (Wang et al., 2004;Puissant et al., 2005;Rao et al., 2002).Further, texture estimation is important because provides information about spatial and structural arrangement of objects, thanks to the strong correspondence between them and their pattern (Tso and Mather, 2001;Permuter et al., 2006).Even though colour is the most appealing feature, it is also the most vulnerable indexing parameter, since it strongly depends on image lighting conditions, pose and sensor characteristics.This problem affects also texture evaluation, and there is no valid model able to provide illumination invariant entities (Hanbury et al., 2005).Few applications of RGB classification for land use setimation are present in literature.Among these, Chust et al., (2008) used an initial RGB supervised classification as reference, and compared it with those automatically obtained adding further information -NIR, DTM, aspect, slopereaching a mean accuracy of 73.2% and 75.1% on the two test areas considered.Lv et al. (2010) developed a method to carry out land cover classifications that depend only on RGB bands reaching high values of accuracy.They observed that in order to extract building information, it is necessary to include spectral and texture features.This work provides an experimental assessment of the benefits derived from converting the RGB colour components into invariant ones, in order to remove or at least mitigate the classification biases caused by the varying scene illumination conditions, that produce different RGB spectral signatures for identical image objects.Further, the tests carried out aim to evaluate the actual improvements in high resolution colour image classification provided by textural information and spatial resolution resampling.

Invariant image features
In image processing, various colour systems related to RGB are in use today: among their characteristics, for the purpose of image classification, an important colour property is the invariance.A colour invariant system contains models which are more or less insensitive to the varying imaging conditions such as variations in illumination and object pose (Gevers, 1998).Colour invariants can be obtained by constructing a colour ratio model to remove the effects of perspective, light direction, illumination, reflected light intensity, and other factors.HSV, C1C2C3, l1l2l3, CIE-Lab and CIE-Luv are some invariant colour models described in literature and briefly recalled below (e.g.Gevers and Smeulders, 1999;Geusebroek et al., 2001).In addition, other techniques as the SIFT model and image stretch decorrelation can be considered for the same purposes.Thanks to these characteristics all these methods appear suitable to improve the classification accuracy, and to mitigate the distortions of RGB caused by different light conditions.

1.1.1
HSV colour model HSV is an approximately perceptually uniform colour space that provides an intuitive representation of colours and simulates the way humans perceive and manipulate them.Hue (H) represents the dominant spectral component, and is commonly used in tracking applications where some degree of illumination changes are expected.The hue component doesn't contain intensity information, and therefore it is invariant to intensity changes in illumination, however, it is sensitive to light colour changes.Saturation (S) is an expression of the relative purity, that is the degree to which a pure colour is diluted by adding white.Finally, the value (V) corresponds to the colour brightness.In this way, the luminous component (V) is decoupled from the colourcarrying information (H and S).HSV can be obtained from RGB through a nonlinear invertible transformation.After rescaling the RGB coordinates to the range [0,1] by dividing them by their maximum theoretical value (e.g.255 for an 8 bit colour band), H, S, and V are estimated as follows: C1C2C3 and l1l2l3 colour model C1C2C3 and l1l2l3 represent colour models that are theoretically insensitive to viewing direction, surface orientation, illumination direction and intensity.The C 1 , C 2 and C 3 band components can be calculated from R, G, B by the following formulas: The following equations provide the transformation from R, G, B to l 1 , l 2 , l 3 , that forms a set of normalized colour differences: where R, G, B are the Red, Green, and Blue band component values respectively (Gevers and Smeulders, 1999).

CIE-Lab and CIE-Luv
where: ( ) where: assuming again standard conditions as above.

SIFT colour model
An RGB histogram is not invariant to changes in lighting conditions, however, by normalizing the pixel value distributions, scale invariance and shift invariance are achieved with respect to light intensity.Because each band is normalized independently, the descriptor is also normalized against changes in light colour and arbitrary offsets.SIFT descriptors are computed for every RGB band independently as follow: with µ the mean and σ the standard deviation of the distribution of the respective colour bands computed over the area under consideration.

Decorrelation Stretching
This colour transformation is used for enhancing the colour differences in images with high internal band correlation, as evident in many RGB ones.Decorrelation makes possible to distinguish picture details that otherwise are not so immediately visible.The specificity of the model is the following pointwise linear transformation:

EXPERIMENTS
The experimental assessments have been carried out on two image pairs of two suburban environments, acquired respectively in summer 2008 and in winter 2011, at the same resolution (9 cm) (Figure 2).The four original images have been resampled, by the median method, producing four additional pictures with 27 cm pixel size, and further four images at 45 cm resolution.The 12 samples thus obtained formed the dataset employed to investigate the benefits of different invariant colour models, together with the role played by pixel size and scene illumination.
In the first step, in order to set a reference for the following evaluations, a thorough supervised classification of each of the 12 RGB colour images has been carried out: the results obtained were assumed to represent the truth.The classes considered were: vegetation, urban areas, streets, and shadows.
The 12 RGB images were then converted in their corresponding HSV, l1l2l3, C1C2C3 invariant colour components, and in CIE-Luv and CIE-Lab colour spaces; in addition, a SIFT transformation and a stretch decorrelation of the RGB bands was also performed.Each different image thus obtained was classified with K-means unsupervised method, applying in many cases, for further comparison, two possible metrics: Euclidean and cosine distance.
To analyse the effects of texture information in the classification, a further component representing object image texture was considered in some cases, and joined to the clustering vectors: this parameter was the mean RGB component value computed by a structuring element of size 3x3 or 9x9.
In this way, varying the vector composition of the test images, 228 independent K-means classifications have been produced, and individually compared with the corresponding reference ones obtained via the supervised approach (Figure 3 and 4).
All the various steps of the computations have been performed by custom-made Matlab™ scripts, taking advantage of built-in high level functions only when necessary (K-means classification and stretch decorrelation).

RESULTS AND DISCUSSION
Table 1 collects the percentages of agreement between the various unsupervised K-means classifications of the different pixel components, and the user supervised classifications of the same RGB images, assumed to represent the truth.
Column 1 specifies the composition of the classification vector employed, column 2 shows the metric used, and the following twelve columns list the results of the various experiments.The rows are ordered considering the global average agreement obtained in the various tests, and secondly their median value.Following this criterion a significant outcome emerges: classifications exploiting invariant components are generally superior than those based on RGB and texture.Excluding the evident worst case represented by l1l2l3, RGB based methods generally produce average agreements around 71%, while the best three invariant based solutions are very close to 80% and over.Going in detail, must be noted that, starting from the same information, represented by the original RGB image, very different agreements -from 45.3% to 93.9% -can be produced just selecting a specific colour invariant or by stretch decorrelating the colour bands.The minimal variation -from to 57.6% to 74.6% -has been obtained in image 1_2011 at 45 cm, while the maximal one -from 50.9% to 93.9% -has been observed in image 2_2008 at 9 cm.The metric employed appears to be effective only in case of RGB, where substituting Euclidean with cosine distance improves the average agreement of 5 6%.In the other cases the metrics role is very variable.RGB texture information does not seem to provide significant improvements since it produces contrasting results: in the case of RGB, introducing texture 3x3 and 9x9 slightly worsen the outcomes, while in the case of C1C2C3 we observe a significant gain with texture 3x3, and a small loss with texture 9x9.Observing the highest classification results for the various images, they show that L*a*b* performed five times over twelve as the best classifier vector, as in 2008 imagery, but also C1C2C3 based methods achieved similar results, particularly in 2011 images.This can be explained by the fact that 2011 images contain a larger amount of shadows than the corresponding 2008 ones, and the component C3 is particularly suited to detect them, as reported in shadow detection literature (Duan et al., 2013;Movia et al., 2015).In any case, also when CIE-Lab, RGB decorr., and CIE-Luv were not the best ones in the tests, their average distance from them ranges only from 2.1% to 3.7%.1: Agreement percentages between the various unsupervised K-means classifications of the different pixel components, and the supervised classifications assumed as reference.Column M specifies the metrics (E=Euclidean, C=cosine), Med. the median values, t3 and t9 indicate the window size (3×3 or 9×9 cells) for texture evaluation.
Finally, regarding the pixel size effect on classification, small differences have been observed among the average accuracy at 9 cm, 27 cm, and 45 cm; at the same time, increasing the pixel size, slightly reduces the classification variability.Analysing in detail the examples proposed in Figures 3 we can do more specific considerations: CIELab and RGB decorrelated, with Euclidean metrics, graphically prove the best accordance with the supervised classification; l1l2l3 appears very noisy and justifies the worst performance.Figure 3 shows that RGB mistakenly interprets a large roof as vegetation, while HSV identifies even all roofs as vegetation.C1C2C3, RGB decorrelated, and CIELab slightly underestimate shadows around the buildings, demonstrating less sensitive on penumbra pixels.Regarding Figure 4, RGB and HSV misinterprets again some roofs as vegetation, l1l2l3 classification demonstrates again to be very noisy and unable to properly identify shadows.

CONCLUSIONS
RGB-based aerial image classifications are often influenced by shadows, illumination conditions and camera settings.To mitigate these effects, an experimental comparison has been carried out considering texture information, different clustering metrics, RGB decorrelation, and various colour invariant transformations of the original RGB colour space.Results showed that significant classification improvements can been obtained after substituting RBG with CIELab, CIELuv, and C1C2C3 respectively, or by stretchdecorrelating the colour bands.Generally, invariant parameters provide better classification results than those based on RGB.Furthermore, the introduction of texture information in the classification vector, or the application of a different clustering distance -Euclidean vs. cosine -does not appear to give effective improvements for the segmentation process.
L*a*b* and L*u*v* are two perceptual uniform colour spaces, specified by the International Commission on Illumination (CIE), that are considered device independent.There are no simple and univocal formulas to convert from RGB to L*a*b* and L*u*v*, since RGB is device dependent; in any case, for classification purposes, a suitable transformation can be obtained converting RGB to CIE-XYZ, and then deriving L*a*b* and L*u*v* under standard conditions.To this aim, at first R,G,B coordinates are rescaled to the range [0,1] by dividing them by their maximum theoretical value, then the X, Y, Z tristimulus values are evaluated as follows: standard open air illumination conditions.Regarding L*u*v*, the new components u* and v* are obtained as: L * has same meaning and formulation as in L*a*b* ( ) RGB colour pixel vector y = output decorrelated RGB colour pixel vector µ = input dataset mean R = rotation matrix S c = scaling matrix (diagonal) S t = stretching matrix µ target = backprojected data shift.Refer toGillespie et al. (1986) for complete description.

Figure 1
Figure 1 shows some histogram examples representing transformations from RGB to invariant colour components: worth of note is the lack of correlation existing among the various colour spaces.In the graphs, for better comparison with the original RGB, the various component values have been rescaled to [0 255].

Figure 2 :
Figure 2: Image pairs employed for the experiment of two suburban areas at 9 cm resolution: on the left, pictures 1_2008 and 2_2008 taken in summer 2008, while on the right, the corresponding 1_2011 and 2_2011, acquired in winter 2011, showing various long shadows.

*Figure 3 :Figure 4 :
Figure 3: Examples of K-Means classifications of Image 2_2008 at 27 cm with different vector components: blue represents urban areas, cyan identifies streets and bare soil, brown vegetation and yellow shadows.(E) indicates to Euclidean distance.