EXTRACTION OF OPTIMAL SPECTRAL BANDS USING HIERARCHICAL BAND MERGING OUT OF HYPERSPECTRAL DATA

Spectral optimization consists in identifying the most relevant band subset for a specific application. It is a way to reduce hyperspectral data huge dimensionality and can be applied to design specific superspectral sensors dedicated to specific land cover applications. Spectral optimization includes both band selection and band extraction. On the one hand, band selection aims at selecting an optimal band subset (according to a relevance criterion) among the bands of a hyperspectral data set, using automatic feature selection algorithms. On the other hand, band extraction defines the most relevant spectral bands optimizing both their position along the spectrum and their width. The approach presented in this paper first builds a hierarchy of groups of adjacent bands, according to a relevance criterion to decide which adjacent bands must be merged. Then, band selection is performed at the different levels of this hierarchy. Two approaches were proposed to achieve this task : a greedy one and a new adaptation of an incremental feature selection algorithm to this hierarchy of merged bands.


INTRODUCTION
High dimensional remote sensing imagery, such as hyperspectral imagery, generates huge data volumes, consisting of hundreds of contiguous spectral bands.Nevertheless, most of these spectral bands are highly correlated to each other.Thus using all of them is not necessary.Besides, some difficulties are caused by this high dimensionality, as for instance the curse of dimensionality or data storage problems.To answer these general problems, dimensionality reduction strategies aim thus at reducing data volume minimizing the loss of useful information and especially of class separability.These approaches belong either to feature extraction or feature selection categories.Feature extraction methods consist in reformulating and summing up original information, reprojecting it in another feature space.Principal Component Analysis (PCA), Independent Component Analysis (ICA) and Linear Discriminant Analysis (LDA) are state-of-the-art feature extraction techniques.On the opposite, feature selection (FS) methods applied to band selection select the most relevant band subset (among the original bands of the hyperspectral data set) for a specific problem.Furthermore, in the case of hyperspectral data, adjacent bands are very correlated to each other.Thus band extraction, that is to say the definition of an optimal set of spectral bands optimizing both their width and position along the spectra, can be considered as intermediate between feature extraction techniques and individual band selection.Band selection/extraction approaches offer advantages compared to feature extraction techniques.First, they make it possible not to loose the physical meaning of the selected bands.Most important, they are adapted to the design of multispectral or superspectral sensors dedicated to a specific application, that is to say sensors designed to deal with specific land cover classification problems for which only a limited band subset is relevant.

Feature selection
Feature selection (FS) can be seen as a classic optimization problem involving both a metric (that is to say a FS score measuring the relevance of feature subsets) to optimize and an optimization strategy.Even though hybrid approaches involving several criteria exist (Estévez et al., 2009, Li et al., 2011), FS methods and criteria are often differentiated between "filter", "wrapper" and "embedded".It is also possible to distinguish supervised and unsupervised ones, whether classes are taken into account.
Filters Filter methods compute a score of relevance for each feature independently from any classifier.Some filter methods are ranking approaches : features are ranked according to a score of importance, as the ReliefF score (Kira and Rendell, 1992) or a score calculated from PCA decomposition (Chang et al., 1999).Other filters associate a score to feature subsets.In supervised cases, separability measures such as Bhattacharyya or Jeffries-Matusita (JM) distances can be used in order to identify the feature subsets making it possible to best separate classes (Bruzzone andSerpico, 2000, Serpico andMoser, 2007).High order statistics from information theory such as divergence, entropy and mutual information can also be used to select the best feature subsets achieving the minimum redundancy and the maximum relevance, either in unsupervised or supervised situations: (Martínez-Usó et al., 2007) first cluster "correlated" features and then select the most representative feature of each group, while (Battiti, 1994, Estévez et al., 2009) select the set of bands that are the most correlated to the ground truth and the less correlated to each other.
Wrappers For wrappers, the relevance score associated to a feature subset corresponds to the classification performance (measured by a classification quality rate) reached using this feature subset.Examples of such approaches can be found in (Estévez et al., 2009, Li et al., 2011) using SVM classifier, (Zhang et al., 2007) using maximum likelihood classifier, (Díaz-Uriarte and De Andres, 2006) using random forests or even (Minet et al., 2010) for target detection.
Embedded Embedded FS methods are also related to a classifier, but feature selection is performed using a feature relevance score different from a classification performance rate.Some embedded approaches are regularization models associating a fit-todata term (e.g. a classification error rate) associated to a regularization function, penalizing models when the number of features increases (Tuia et al., 2014).Other embedded approaches progressively eliminate features from the model, as SVM-RFE (Guyon et al., 2002) that considers the importance of the features in a SVM model.Other approaches have a built-in mechanism for feature selection, as Random Forests (Breiman, 2001) that uses only the most discriminative feature among a feature subset randomly selected, when splitting a tree node.
Another issue for band selection is the optimization strategy to determine the best feature subset corresponding to a criteria.An exhaustive search is often impossible, especially for wrappers.Therefore, heuristics have been proposed to find a near optimal solution without visiting the entire solution space.These optimization methods can be divided into incremental and stochastic ones.Several incremental search strategies have been detailed in (Pudil et al., 1994), including the Sequential Forward Search (SFS) starting from one feature and incrementally adding another feature making it possible to obtain the best score or on the opposite the Sequential Backward Search (SBS) starting for all possible features and incrementally removing the worst feature.Variant such as Sequential Forward Floating Search (SFFS) or Sequential Backward Search (SBFS) are proposed in (Pudil et al., 1994).(Serpico and Bruzzone, 2001) proposes variants of these methods called Steepest Ascent (SA) algorithms.Among stochastic optimization strategies used for feature selection, several algorithms have been used for feature selection, including Genetic algorithms (Li et al., 2011, Estévez et al., 2009, Minet et al., 2010), Particle Swarm Optimization (PSO) (Yang et al., 2012) or simulated annealing (De Backer et al., 2005, Chang et al., 2011).

Band grouping and band extraction
Band grouping and clustering In the specific case of hyperspectral data, adjacent bands are often very correlated to each other.Thus, band selection encounters the question of the clustering of the spectral bands of a hyperspectral data set.This can be a way to limit the band selection solution space.Band clustering/grouping has sometimes been performed in association with individual band selection.For instance, (Li et al., 2011) who first group adjacent bands according to conditional mutual information, and then perform band selection with the constraint that only one band can be selected per cluster.(Su et al., 2011) perform band clustering applying k-means to band correlation matrix and then iteratively remove the too inhomogeneous clusters and the bands too different from the representative of the cluster to which they belong.(Martínez-Usó et al., 2007) first cluster "correlated" features and then select the most representative feature of each group, according to mutual information.(Chang et al., 2011) performs band clustering using a more global criterion taking specifically into account the existence of several classes : simulated annealing is used to maximise a cost function defined as the sum, over all clusters and over all classes, of the sum of correlation coefficients between bands belonging to a same cluster.(Bigdeli et al., 2013, Prasad andBruce, 2008) perform band clustering, but not for band extraction : a multiple SVM classifier is defined, training one SVM classifier per cluster.(Bigdeli et al., 2013) have compared several band clustering/grouping methods, including k-means applied to the correlation matrix or an approach considering the local minima of mutual information between adjacent bands as cluster borders.(Prasad and Bruce, 2008) propose another band grouping strategy, starting from the first band of the spectrum and progressively growing it with adjacent bands until a stopping condition based on mutual information is reached.
Band extraction Specific band grouping approaches have been proposed for spectral optimization.(De Backer et al., 2005) define spectral bands by Gaussian windows along the spectrum and propose a band extraction optimizing score based on a separability criterion (Bhattacharyya error bound) thanks to a simulated annealing.(Cariou et al., 2011) merge bands according to a criteria based on mutual information.(Jensen and Solberg, 2007) merge adjacent bands decomposing some reference spectra of several classes into piece-wise constant functions.(Wiersma and Landgrebe, 1980) define optimal band subsets using an analytical model considering spectra reconstruction errors.(Serpico and Moser, 2007) propose an adaptation of his Steepest Ascent algorithm to band extraction, also optimizing a JM separability measure.(Minet et al., 2010) apply genetic algorithms to define the most appropriate spectral bands for target detection.Last, some studies have also studied the impact of spectral resolution (Adeline et al., 2014), without selecting an optimal band subset.

Proposed approach
The approach proposed in this paper consists in first building a hierarchy of groups of adjacent bands.Then, band selection is performed at the different levels of this hierarchy.Two approaches are proposed to achieve this task.Thus, it is here intended to use the hierarchy of groups of adjacent bands as a constraint for band extraction and a way to limit the number of possible combinations, contrary to some existing approaches such as (Serpico and Moser, 2007) that extract optimal bands according to JM information using an adapted optimization method or (Minet et al., 2010) that directly use a genetic algorithm to optimize a wrapper score.

DATA SET
The proposed algorithms were mostly tested on the ROSIS VNIR reflectance hyperspectral Pavia Center data set1 .Its spectral domain ranges from 430nm to 860nm.Its associated land cover ground truth includes the next classes : "water", "trees", "meadows", "self blocking bricks", "bare soil", "asphalt", "roofing bitumen" , "roofing tiles" and "shadows".They were also tested on the VNIR-SWIR AVIRIS Indian Pines and Salinas scenes, captured over rural areas.

HIERARCHICAL BAND MERGING
The first step of the proposed approach consists in building a hierarchy of groups of adjacent bands, that are then merged.Even though it is intended to be used to select an optimal band subset, this hierarchy of merged bands can also be a way to explore several band configuration with varying spectral resolution, that is to say with contiguous bands with different bandwidth.

Hierarchical band merging algorithm
Notations Let B = {λi} 0≤i≤nbands be the original (ordered) set of bands.Let H = {H (i) } 0≤i<nlevels be the hierarchy of merged bands.
j } 1≤j≤n i is the ith level of this hierarchy of merged bands.It is composed of ni merged bands, that is to say ni ordered groups of adjacent bands from B. Thus, each H (i) j is defined as a spectral domain : Thus, the merged band B1 ⊕B2 obtained when merging two such adjacent merged bands B1 and B2 is B1⊕B2 = [B1.λmin;B2.λmax]Let J(.) be the score that has to be optimized during the band merging process.
The proposed hierarchical band merging approach is a bottom-up one.The algorithm is defined below : Initialization : H (0) = B (that is to say that merged band of the first level of the hierarchy only contains one individual original band).
Band merging : create level l+1 from level l : Find the pair of adjacent bands at level l that will optimize the score if they are merged : is defined to link the different merged bands at consecutive hierarchy levels :

Band merging criteria
Several optimization scores J can be examined.(In the algorithm described in section 3.1, this score is aimed to be minimized.)They can be either supervised or unsupervised, depending whether classes are considered or not at this step.

Correlation between bands
Between band correlation (either the classic normalized correlation coefficient or mutual information) (see figure 1) measures the dependence between bands.So a first band merging criterion intends to merge adjacent bands considering how they are correlated to each other.Thus, it tries to obtain consistent groups of adjacent correlated bands.Such measure inspired from (Chang et al., 2011) can be defined by next function (intended to be minimized): with c(b1, b2) the correlation score between bands b1 and b2.

Spectra approximation error
Band merging could also use (Jensen and Solberg, 2007)'s method to decompose some reference spectra of several classes into piece-wise constant functions (fig.2).Adjacent bands are then merged trying to minimize the reconstruction error between the original and the piece-wise constant reconstructed spectra.Such measure is defined by next function for a set sj 1≤j≤ns of ns spectra : where mean(sj, H   (Bruzzone andSerpico, 2000, Serpico andMoser, 2007).The Bhattacharyya separability between classes i and j is defined as , with µi and Σi be the mean vector and covariance matrix of class i radiometric distribution.As Bhattacharyya separability is defined for binary problems, its mean over all possible pairs of classes can be used as a global separability measure.Jeffries-Matusita measure for c classes is then defined as At a level of the band merging hierarchy, the best set of merged bands is the one that maximizes class separability.So a possible criterion J (to minimize) for band merging can be defined as J(H (l) ) = −JM (H (l) )

Results
Obtained results on Pavia data set for the 3 criteria described in previous section can be seen on figure 3. The separability based criterion tends to lead to more different results than the other ones.It can be seen that the different criteria don't consider the same parts of the spectrum as having to be kept at fine resolution.For instance, correlation or spectra reconstruction criteria tend to fast merge bands between number 30 and 32, while separability tend to preserve them at fine resolution.On the opposite, separability tends to fast merge some bands in the red-edge domain, while the other criteria keep this domain at fine resolution.This can be understood considering the underlying criteria ; indeed adjacent bands are not very correlated to each other in this domain and the slope of spectra is strong for vegetation classes, and thus they not be merged easily according to correlation or spectra approximation error band merging criteria.On the opposite, the only interesting information for classification (e.g. for class separability) is the fact there is a slope there and thus the values of the bands before and after this domain.Thus, merging these red-edge bands will have little impact on class separability.As the hierarchy of merged bands can also be a way to explore several band configuration with varying contiguous bands with different spectral resolution, the different band configurations corresponding to the different levels were evaluated using a classification quality measure.Thus, for each level, a classification was performed using a support vector machine (SVM) classifier with a radial basis function (rbf) kernel and evaluated.Its Kappa coefficient was considered.Such results are presented on figure 4. It can be seen that some spectral configurations made it possible to obtain better results than at original spectral resolution.Configurations obtained using the correlation coefficient are generally less good than for the two other criteria.Except for Pavia, the spectra piece-wise approximation error merging criterion tends to lead to the best results.But for Pavia, the classification Kappa reached using the different criteria remained very similar.

BAND SELECTION USING A GREEDY METHOD
To optimize spectral configuration for a limited number of merged bands, a greedy approach was first used : it performed band selection at the different levels of the hierarchy of merged bands, paying no attention at results obtained at the previous level.Thus a set of merged bands was selected at each level of the hierarchy.The feature selection (FS) score to optimize was the Jeffries-Matusita separability measure.It was optimized at each level of the hierarchy using an incremental optimization heuristic called Sequential Forward Floating Search (SFFS) (Pudil et al., 1994) and reminded below in its general formulation.

Sequential Forward Floating Search
It is intended to select less than p features among a feature set B. Let S be the selected band subset and J the FS score to maximize.

Results
Obtained results on Pavia data set are presented on figure 5 : 5 merged bands (as in (Le Bris et al., 2014)) were selected at each level of the hierarchy of merged bands.It can be seen that the positions of the selected merged bands don't change a lot when climbing the hierarchy, except when reaching the lowest spectral resolution configurations.It can also be noticed that at some level of the hierarchy the position of some selected merged bands can move and then come back to its initial position when climbing the hierarchy.Thus, it can be possible to use the selected bands at a level l to initialize the algorithm at next level l + 1.This modified method will be presented in section 5..The merged band subsets selected at the different levels of the hierarchy were evaluated according to a classification quality measure.As in previous section, the Kappa coefficient reached by a rbf SVM was considered.Results for Pavia and Indian Pines data sets can be seen on figure 6.At each level of the hierarchy, 5 bands were selected for Pavia, and 10 bands for Indian Pines.It can be seen that these accuracies remain very close to each other whatever the band merging criterion used, and no band merging criterion tend to really be better than the other ones.Results obtained using merged bands are generally better than using the original bands.
Figure 6: Kappa (in %) reached for rbf SVM classification for merged band subsets selected at the different levels of the hierarchy for Pavia and Indian Pines data sets using the greedy FS algorithm (x-axis = number of merged bands in the spectral configuration corresponding to the hierarchy level).

Algorithm
Previous merged band selection approach is greedy and computing time expensive.So an adaptation of the SFFS heuristic was proposed to directly take into account the band merging hierarchy in the band selection process.As for the hierarchical band merging algorithm, a bottom-up approach was chosen.Contrary Figure 7: Computing times and best Kappa coefficients reached on Pavia (for a 5 band subset) and Indian Pines (for a 10 band subset) data sets for band merging criterion "spectra piece-wise approximation error" to the greedy approach, this new algorithm uses the band subset selected at the previous lower level when performing band selection at a new level of the hierarchy of merged bands.This algorithm is described below : Let S (l) = S (l) i 1≤i≤p be the set of selected merged bands at level l of the hierarchy.(NB : The same number p of bands is selected at each level of the hierarchy.)Initialization : standard SFFS band selection algorithm is applied to the base level H (0) of the hierarchy Iterations over the levels of the hierarchy : Generate S (l+1) from S (l) : l+1) ; s endif Question S (l+1) : find band s ∈ S (l+1) such that S (l+1) \ {s} maximizes FS score, i.e. s = argmax z∈S (l+1) J(S (l+1) \ {s}).

Results
Obtained results on Pavia scene for the band merging criterion "spectra piece-wise approximation error" are presented on figure 8 : 5 merged bands were selected at each level of the hierarchy, starting from an initial solution obtained at the bottom level of the hierarchy.As for previous experiments, obtained results were evaluated both for Pavia (5 selected bands) and Indian Pines (10 selected bands) data sets.Kappa reached for rbf SVM classification for merged band subsets selected at the different levels of the hierarchy (built for band merging criterion "spectra piece-wise approximation error") can be seen both for the greedy FS algorithm and for the hierarchy aware one on figure 9 : obtained results remain very close, whatever the optimization algorithm.
It can be said from table 7 that both algorithms lead to equivalent results considering classification performance while the proposed hierarchy aware algorithm is really faster.

CONCLUSION
In this paper, a method was proposed to extract optimal spectral band subsets out of hyperspectral data sets.A hierarchy of merged bands was first built according to a band merging criterion.It was then used to explore the solution space for band extraction : band selection was then performed at each level of the hierarchy, either using a greedy approach or an adapted hierarchy aware approach.Classification results tend to be slightly improved when using merged bands, compared to a direct use of the original bands.Besides, in the context of band optimization for sensor design, it can also be a way to get more photons.Further work will investigate band optimization aiming at selecting merged bands at different levels of the hierarchy.This method will also be applied to a specific sensor design band optimization problem : optimizing spectral bands for urban material classification within the French ANR HYEP ANR 14-CE22-0016-01 project.
(l)i ) denotes the mean of spectra sj over spectral domain H (l) i

Figure 1 :
Figure 1: Examples of groups of bands superimposed on the between band correlation matrix (for Pavia data set)

Figure 3 :
Figure 3: Hierarchies of merged bands obtained for different criteria for Pavia data set: spectra piece-wise approximation error (top), between band correlation (middle) and class separability (bottom).x-axis corresponds to the band numbers/wavelengths. y-axis corresponds to the level in the band merging hierarchy (bottom : finest level with original bands, top : only a single merged band).Vertical black lines are the limits between merged bands : the lower in the hierarchy, the more merged bands.Reference spectra of the classes are displayed in colour.

Figure 4 :
Figure 4: Kappa (in %) reached by a rbf SVM for the different band configurations of the hierarchy (x-axis = number of merged bands in the spectral configuration corresponding to the hierarchy level), for Pavia (top), Indian Pines (middle) and Salinas (bottom) data sets.

Initialization:
Figure5: Pavia data set: selected bands at the different levels of the hierarchy using the greedy approach for hierarchies of merged bands obtained using different band merging criteria : spectra piece-wise approximation error (top), between band correlation (middle), class separability (bottom).x-axis corresponds to the band numbers/wavelengths. y-axis corresponds to the level in the band merging hierarchy (bottom : finest level with original bands, top : only a single merged band).

Figure 8 :
Figure8: Pavia data set: selected bands at the different levels of the hierarchy using the proposed hierarchy aware algorithm for a hierarchy of merged bands obtained using spectra piece-wise approximation error band merging criteria

Figure 9 :
Figure 9: Kappa (in %) reached for rbf SVM classification for merged band subsets selected at the different levels of the hierarchy (built for band merging criterion "spectra piece-wise approximation error") for Pavia and Indian Pines data sets, using the hierarchy aware band selection algorithm.