A Long-time-series Spatio-Temporal-Spectral Fusion Method via Multi-task Learning

Due to the limitations of sensor hardware, clouds and fog, and data transmission limitations, it is difficult for the data obtained by spaceborne remote sensing imager to achieve high temporal, spatial and spectral resolution at the same time, which limits its application in long-time-series high-frequency monitoring. At present, there are several spatio-temporal-spectral algorithms that can realize the fusion of temporal, spatial and spectral resolution, but most of them are based on one to two discrete images, and the integrated fusion at the multi-dimensional level has not yet been realized. There is currently no research on the spatio-temporal-spectral fusion method based on LONG-TIME-SERIES multi-scene remote sensing data. Aiming at solving the bottleneck of spatio-tempora-spectral resolution of remote sensing data, this study proposes a new long-time-series spatio-temporal-spectral fusion method based on multi-task learning to realize the multi-dimensional optimization of multi-source remote sensing data resolutions. Experiments used simulated and real datasets, both of which contain 4 images of 10m ZY1-02D multispectral data, 7 images of 16m GF-6 multispectral data and 4 images of 30m ZY1-02D hyperspectral data, and obtained 7 images of 10m hyperspectral data. The experiments The results show that our method performs the best compared to other methods. This method can provide effective data support for applications based on long-time series remote sensing data.


Introduction
With the rapid development of remote sensing satellites and earth observation technology, earth observation satellites have shown an explosive development with an increasing number of in-orbit satellites.To this day, humans have obtained a massive amount of remote sensing datasets.The accumulation of a large amount of historical remote sensing data has made it possible for applications related to long-time-series remote sensing data monitoring.However, the potentials of long-time-series hyperspectral data has not been fully tapped.Due to the bottleneck constraints of sensors, orbital height and data transmission capability (Wang and Wang, 2010), there is currently no spaceborne remote sensing satellite that can obtain high temporal, spatial and spectral resolution at the same time.Different data has advantages in one or two resolutions, but has inferior in other indicators, which has become a major limiting factor in the application of long-time-series remote sensing data monitoring.In practical applications, for the single scene coverage area within the research area, the actual temporal resolution often differs significantly from the theoretical time resolution capability.Due to the fact that the influence of cloud cover, the temporal resolution of the dataset is also difficult to reach the nominal revisit period.At present, the improvement of temporal resolution can be achieved through satellite networking and data fusion.However, data obtained from different camera may differ in irradiance characterization, and different imaging time may bring more deviation.Thus, this study aims to solve the bottleneck problem of mutual constraints on the temporal-spatial-spectral resolution of hyperspectral satellites that hinders the applications of longtime-series hyperspectral datasets.We proposed a long-timeseries spatio-temporal-spectral fusion method .On the basis of constructing discrete multi temporal remote sensing data into long-term MDD multidimensional remote sensing data (MDD), a multi-task learning model is used to perform joint extraction of multidimensional spatio-temporal-spectral information.The model contains three branches of convolutional nerual networks which extracts mappings of spatio, temporal and spectral information seperately, and learns them jointly via multi-task structure.This method can achieve multi-dimensional integration of spatio-temporal-spectral information and improve the resolution of hyperspectral data.

Related Work
Spatio-temporal-spectral fusion method can be traced back to the work of Huang et al. in 2013.Using the maximum posterior probability (MAP) criterion, based on two Landsat and MODIS sensors, the authors explored the integrated fusion method on multi-temporal spatial-spectral images.By using 19 bands of two scenes, MODIS images with 250 m/500 m/100 m spatial resolution and 6 bands of one scene Landsat image fusion with 30m spatial resolution produces a reconstructed image with MODIS spectral resolution and Landsat spatial resolution that is missing temporally, realizing the combination of spatiotemporal fusion and spatial spectral fusion (Huang et al., 2013).Shen et al. utilized MAP to construct a multi-source sensor spatio-temporal-spectral integrated fusion framework (Shen et al., 2016), which is commonly used for multi view image spatial fusion, spatiotemporal fusion, spatio-temporal fusion, and spatio-temporal-spectral fusion.The performance of spatiotemporal-spectral fusion method was validated using MODIS, Landsat, and SPOT satellites.Jiang et al. constructed a heterogeneous spatiotemporal spectral integrated fusion framework using Deep Residual CycleGAN (Deep Residual CycleGAN) (Jiang et al., 2021).Peng et al., based on the semicoupled sparse tensor factorization method, and built an integrated spatio-temporal-spectral fusion model on the basis of four-dimensional tensors.In addition to verifying spatiotemporal fusion and spatial-spectrum fusion, they also used three groups of Hyperion data to simulate low spaial, low temporal, high spectral resolution data and high spatial, high tempoal, low spectral resolution data for spatio-temporalspectral integration fusion (Peng et al., 2021).Wei et al. designed a spatio-temporal-spectral fusion method for serial panchromatic image sharpening and spatiotemporal fusion using the same platform camera data of Gaofen-1, aiming to construct a 2-meter multispectral image with high temporal resolution.The method was experimentally validated using three temporal phases of Gaofen-1 panchromatic, multispectral, and wideband data, However, the integration of spatio-temporal-spectral has not yet been achieved (Wei et al., 2021).All of the above achievements have made outstanding contributions to the improvement of multi-source remote sensing data resolution, but most of them are research oriented to two to three scene remote sensing image fusion, and there is no research on spatio-temporal-spectral fusion method oriented to long time series or multi scene remote sensing data which involves datasets more than 4 time phrases.For each residual block, its input and output can be simplified as in which , is the output of the residual block, is the input of the residual block, is the number of group convolutions, that is, the cardinality of the network, and is the feature extraction network of the th scale.This approach is easy to achieve convergence while increasing the cardinality of the network, which can increase the accuracy of the network without increasing its depth.The architecture of the mapping feature extraction network residual block is proposed to be designed in Figure 2.

Joint Extraction Based on Multitask Learning
Through the design of multi feature loss function based on multi task learning, the objective loss function of all single task feature network branches is optimally combined to achieve joint feature extraction and fusion of multi task learning.Assuming that the learnable parameter set of a multitask learning network is, which represents all weight parameters in the network.The input set of the multi task learning network is, the label set is, and all predicted label sets are.For each single task characteristic network branch, its loss function is, where.The total loss function of the multi task learning network is designed as a linear combination of each single task characteristic network branch, namely in which, is the weight of the loss function of the branch of the task characteristic network.The most direct way is to manually assign values, but the model is sensitive to weight parameters and may affect the final accuracy results.Therefore, a learnable parameter set will be added, namely ) To force a positive value to avoid a negative value when it drops to, the regularization term is designed as ) Therefore, the total loss function of multi task learning is finally designed as Perform multi-dimensional fusion and reconstruction of the feature information extracted from the model to obtain a spatiotemporal spectral fusion dataset, achieving the following process By concatenating and fusing the mapping feature extraction network, a spatiotemporal spectral fusion dataset can be obtained.

Datasets and Experimental Settings
The technical goal is to construct a long-term spatiotemporal spectrum fusion method based on the idea of multitasking learning, and use ZY1 02D/E multispectral data, hyperspectral data, and GF-6 wide-swath multispectral data to construct a long-term dataset fusion reconstruction, obtaining the spatial resolution of ZY1-02D/E multispectral data high temporal resolution remote sensing data set of time resolution of GF 6 and spectral resolution of ZY-1-02D/E hyperspectral data.The comparison of theoretical target resolution indicators before and after fusion is shown in Table 1.Two datasets are used in this study.Dataset I is the simulated dataset simulated using 7 dates of hyperspectral images obtained from ZY1-02D.The images are sensed during Dec.13th, 2021 to Jun.25th, 2022 at N37.5 E116.7 in China.Dataset I mainly contains buildings and crop fields.The hyperspectral data were simulated to 4 images of 90m hyperspectral data, 7 images of 60m multispectral data and 30m multispectral data.The spatial simulation was done using bilinear interpolation and the spectral simulation used spectral response functions of GF-6 wide-swath camera and ZY1-02D visible/near-infrared camera.Dataset II is the real dataset which contains 4 images sensed by ZY1-02D hyperspectral camera and multispectral camera as well as Gaofen-6 wide-swath camera.The images are sensed during Sep.7th, 2020 to Jan. 27th,2021.The dataset were are preprocessed with geometric correction, geospatial registration and normalized radiometric correction.

Results
The experiments of different kernels' effect were conducted on the two datasets.The kernels are set as 1, 1/3, 1/3/5, as different cardinalities.The results for the two datasets are shown as Table 2 and Table 3.As we can see that for the two datasets, 1 and 1/3 perform the best overall.
For the spectral indicators SAM, smaller kernel shows better performance as results of kernel 1 are the smallest among three groups.For the overall indicators CC and PSNR and the spatial indicators SSMI, 1-3 performs best for the simulated datasets yet 1 performs best for the real dataset.This may due to the fact that simulation was conducted on hyperspectral data, which has blur effect than the finer multispectral data, and 1/3 kernel can better capture the blur effect.When it comes to real dataset with no blur effect on fine images, smaller kernel has better capability to capture fine textures.The running times shows that smaller kernel costs less time.From the result below we can come to the conclusion that, for real datasets with fine multispectral data, models with only 1 kernel performs the best.In order to show the effectiveness of our method, we compare the results of real dataset with combination of spatiotemporal and spatial-spectral methods.We chose traditional methods of ESTARFM (Zhu et al., 2010) and CNMF (Yokoya et al., 2012), as well as deep learning methods of STFDCNN (Peng et al., 2020) and SRECNN (Peng et al., 2019).Among them, STFDCNN is a integrated spatio-temporal methods which can process multi-temporal images of all time.The two-stage methods are combined by each spatiotemporal and spatialspectral methods.The results are shown in

Conclusions
This article proposed a long-time-series spatio-temporalspectral fusion method.By constructing multi-temporal remote sensing datasets into four-dimensional dataset, temporal information can be extracted compactly.The method extract spatial information, spectral information and temporal information by three branches of convolutional neural networks, and fuse them jointly by multi-task learning structure.The method was tested on a simulated dataset and a real dataset, and was compared with two-stage fusion methods.The results show that our method performs the best in both accuracy and cost time.

Figure
Figure 1.Overall Framework

Figure
Figure 2. Residual Block Figure 3. RGB composites of Ground Truth and Our Methods on Dataset I

Table 1 .
Resolutions of Different InstrumentsThe International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-1-2024 ISPRS TC I Mid-term Symposium "Intelligent Sensing and Remote Sensing Application", 13-17 May 2024, Changsha, China

Table 2 .
Table 4 and Table 5 show the results of different dates on both dataset on relatively best average result.Quantitative Results of Different Kernel Size on Dataset I

Table 3 .
Quantitative Results of Different Kernel Size on Dataset II

Table 4 .
TableIIIbelow.We can see that compared to the two-stage methods, our methods perform the best in all quantitative indices as well as cost time.This shows that our integrated method can utilize spatiotemporal-spectral information better compared to two-stage methods.Deep learning methods cost less time, and integrated methods such as STFDCNN can effectively reduce processing time.Quantitative Results of Different Methods on Dataset I

Table 5 .
Quantitative Results of Different Methods on Dataset IIThe International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-1-2024 ISPRS TC I Mid-term Symposium "Intelligent Sensing and Remote Sensing Application", 13-17 May 2024, Changsha, China (a)GT on date 2 (b)Ours on date 2