MONOCULAR DEPTH ESTIMATION FOR NIGHT-TIME IMAGES

: Depth estimation plays a pivotal role in numerous computer vision applications. However, depth estimation networks trained exclusively on daytime images tend to yield poor performance when applied to nighttime scenarios due to domain differences and variations in scene characteristics. In order to address this limitation, we conducted experiments involving the creation of a synthetic nighttime dataset by employing image translation techniques through a generative network. Subsequently, we utilized the generated images to fine-tune the depth estimation network, aiming to investigate the potential for enhancing task performance using generated data. We evaluated our approach by testing with the generated data, and we observed a noticeable improvement in the depth estimation task both before and after fine-tuning. Consequently, our approach yields results that are comparable to those achieved by networks specifically designed for daytime prediction. These findings highlight the effectiveness of utilizing synthetic data to enhance the performance of depth estimation tasks, particularly in nighttime settings.


INTRODUCTION
Depth estimation is a fundamental problem in computer vision that is critical for a wide range of applications, including navigation for autonomous vehicles, augmented reality, and scene understanding.Accurate depth estimation is also essential for tasks such as object detection, tracking, and segmentation, as well as 3D reconstruction.Stereo vision is one method that allows for an accurate estimation of absolute depth using multiple cameras.Another approach is using geometry-based methods such as Structure from Motion (SfM), which is widely used for 3D reconstruction and simultaneous localization and mapping (SLAM).SfM estimates 3D structures from a series of 2D image sequences by exploiting geometric constraints (Zhao et al., 2022).These methods tend to treat depth estimation as a purely geometrical problem, ignoring the content of the images.Monocular depth estimation seems ill-posed without a second input image to enable triangulation (Godard et al., 2018).Yet, the human brain can estimate depth or at least relative depth from a single image.Humans do this by exploiting several cues learned over time, such as perspective, the size of different objects relative to each other, lighting, shadows, and occlusions.By learning these cues, a deep learning model can be trained to estimate the depth from a single image.Many methods were developed to do so, and yielded great results on popular datasets such KITTI (Geiger et al., 2013) and Cityscapes (Cordts et al., 2016).Both datasets, as well as many others used for outdoor depth estimation, consist solely of daytime images.Depth estimation models that are trained using daytime images often exhibit poor performance when applied to night images (Vankadari et al., 2020).This can be caused by the significant differences between the visual characteristics of the two domains.Night images encounter two challenges that day images do not.Firstly, there are problems with low visibility and variable illuminance.Secondly, the varying illuminations, caused by flickering streetlights or moving cars, can violate the assumption of brightness consistency that is present in daytime images where all pixels are lit by the same light source, the sun * Corresponding author (Wang et al., 2021).The collection of high-quality depth data is a complex and a costly process.That is why many approaches tried utilizing semi-supervised or self-supervised learning.
Our approach suggests a cost-efficient method of generating night-time images through the use of an image translation generative adversarial network (Zhu et al., 2017).Image translation is transforming one image from one domain to another.In this work, we apply this technique on a subset of day images from the KITTI dataset to generate their corresponding synthetic night images.We then use the generated synthetic night images to fine-tune a pre-trained depth estimation network (Godard et al., 2018), thereby improving its performance on night images.
The paper is structured into several sections.The first section is the Related Works, where previously attempted approaches for the problem are demonstrated.The second section focuses on Image Translation.Here, we delve into the architecture used for the task, elaborate on the training process of the network, and discuss the generation of synthetic night-images.This section aims to provide a detailed explanation of how the translation from one domain to another is achieved.Moving on, the third section revolves around the Depth Estimation Network.We delve into the details of this network, thoroughly explaining the process of fine-tuning it using the generated images.Lastly, we have the Conducted Experiments and Results section.In this section, we present the experiments carried out to validate the proposed approach.We also provide the corresponding results obtained from these experiments.

RELATED WORK
In this section, we present other relevant studies that have addressed the task of depth estimation, specifically focusing on their applicability to night images or similar conditions with limited available data.

Unsupervised and Self-supervised Techniques
These approaches can be employed to eliminate the necessity of collecting ground truth depth information, although the availability of images remains essential.Therefore, these methods prove valuable when only images are available.Many unsupervised and self-supervised techniques have been introduced, yielding positive outcomes for the task.For instance, in (Godard et al., 2018), a self-supervised mono depth estimation was carried out on the KITTI dataset.The architecture and concept of this network will be thoroughly explained in Section 4.

Approaches For Night Depth Estimation
Methods trained on daytime images exhibit poor performance when applied to nighttime images due to the presence of photometric inconsistencies.While lighting consistency is naturally assumed in daytime images, this assumption does not hold true for nighttime images.Lighting inconsistencies can arise from street lamps, car headlights, or variations in illuminance across different areas of the image.Unfortunately, only a limited number of approaches have specifically addressed the challenge of depth estimation in nighttime conditions.
In (Spencer et al., 2020), DeFeat-Net is introduced as a system capable of simultaneously learning depth from a single image and obtaining a dense feature representation of the environment, along with estimating ego-motion between consecutive frames.Notably, this is achieved through a fully self-supervised approach, eliminating the need for any ground truth data other than a monocular stream of images.Moreover, the learned features exhibit invariance across various weather and lighting conditions.
Another approach, proposed by (Vankadari et al., 2020), considers the problem as a domain adaptation challenge.The depth map is trained using daytime images, employing an encoderdecoder architecture.In addition to that, another encoder is trained using real-time nighttime images.To train the nighttime encoder, an adversarial domain feature adaptation technique is employed, where the night encoder acts as a generator aiming to generate feature maps from a nighttime image that resemble the feature maps obtained from daytime images.By doing so, the depth decoder becomes capable of decoding both the daytime and nighttime feature maps in a consistent manner.

IMAGE TRANSLATION
Our approach consists of two primary steps: generating night images using an image translation network and then utilizing the generated data to fine-tune the depth estimation networks.In this section, we provide an explanation of the fundamental concept behind the translation network, including the employed losses and the architecture of the network.
To begin with, the data generation process involved utilizing a network from (Zhu et al., 2017),which implemented a cycle generative adversarial network (GAN) architecture (Goodfellow et al., n.d.).First, we will provide an overview of the architecture of cycle GANs, followed by an explanation of the loss utilized during training.Finally, we will delve into a detailed description of the architecture of the specific network employed in our approach.

Cycle Generative Adversarial Networks
In a Generative Adversarial Network (GAN), two competing networks are designed.The generative model, denoted as G, aims to capture the data distribution of the training data and generate images that closely resemble the real data.On the other hand, the discriminative model strives to differentiate between real images from the training dataset and those generated by the generative model.The objective is for G to generate images that are indistinguishable from the target domain, while the discriminative model, denoted as D, tries to accurately classify real and fake images.This dynamic creates a learning process in which G minimizes the loss, while D maximizes the same loss, known as the adversarial loss.
In the context of Cycle GAN, the network aims to map between two domains.This involves two generative networks, G and F, as depicted in Figure 1.G maps from domain X to domain Y, while F performs the reverse mapping.Additionally, there are two discriminative networks, Dy, which discriminates domain Y images, and Dx , which discriminates domain X images.The goal here is not only to generate images in both domains but also to enable conditional mapping of scenes from one domain to the other.For instance, if we have a daytime image of a car parked in front of a building and we want to translate it into a nighttime scene, it is not sufficient for the generator to produce a realistic nighttime image.We also require the generator to generate the same scene with the car and the building at night.This is controlled by the cycle consistency loss, ensuring that the translated images preserve the essential elements of the original scene.

Adversarial Loss
The adversarial loss is applied to both mapping functions G and F [11].Let's consider G mapping from domain X to Y, with Dy responsible for distinguishing between generated samples by G and real samples from Y.The objective can be expressed as follows: where mathcalL GAN (G, DY , X, Y ) = the adversial loss between G and D x, y = are samples from domains X and Y Here, G aims to generate images that are similar enough to fool Dy into thinking they are real.Thus, G minimizes the objective, while Dy tries to maximize it by learning to differentiate between real and fake samples.This adversarial competition arises from both models striving to maximize and minimize the same objective.
Similarly, a similar adversarial loss function L GAN (F, Dx, Y, X) is introduced for the mapping from domain Y to X, with the generator F and discriminator Dx.

Cycle Consistency Loss
In theory, the adversarial loss alone does not impose constraints on the generative networks to generate images similar to the source image.While they may generate images that closely resemble the target domain, they might not capture the essence of the input image.This misalignment with the original objective of the cycle GAN, which aims to translate an image while preserving the scene, imposes the need to introduce cycle consistency.
In Figure 1 (b), we observe the translation of image x from domain X to Y using G, followed by translating the result back to X using F, resulting in x.Ideally, if both mapping functions G and F are perfect, x and x should be identical.Similarly, the cycle consistency is also defined in the opposite direction, as shown in Figure 1(c).To enforce this constraint, the cycle consistency loss is formulated as follows: (2) where Lcyc(G, F ) = the cycle loss x, y = samples from domains X and Y G, F = the generator functions The cycle consistency loss ensures that the generated images from both mappings maintain consistency with the original input and output.Combining both the adversarial losses and the cycle consistency loss, the full objective function is obtained by summing equations ( 1) and (2).

Network Architecture
The architecture was adopted from (Johnson et al., 2016) that showed promising results.The generative network follows an encoder-decoder architecture consists of three convolutional layers, several residual blocks (He et al., 2015), two convolutional layers with a stride of ½, and a final convolutional layer that generates RGB images.During training, nine residual blocks were utilized to with image size of 256x256.For the discriminator, a PatchGans approach (Isola et al., 2016) was employed with a resolution of 70x70.The discriminator is trained to classify overlapping image patches.The patch architecture possesses fewer parameters compared to a full image discriminator, making it suitable for discriminating arbitrary image sizes.

Training
Initially, we utilized a pre-trained version of the network to generate the images.However, the results did not meet our expectations.Consequently, we proceeded to retrain the network from scratch.Our training was conducted on the Berkeley Deep Drive dataset (Yu et al., 2018), which contains images captured from the viewpoint of a car dashboard.For training purposes, we utilized a total of 12,454 daytime images and 22,884 nighttime images.The network underwent training for a total of 135 epochs.In Figure 2, we observe the original image alongside the translated nighttime images generated by both the pretrained network and the network trained from scratch.The image generated by the pre-trained network exhibits scattered extra lights that should not be present, whereas these lights are absent in the version generated after the training process.

DEPTH ESTIMATION
The network utilized for depth estimation is inspired from the work of (Godard et al., 2018) and (Zhou et al., 2017).Their network was originally trained for depth estimation on the KITTI dataset (Geiger et al., 2013), which exclusively comprises daytime images.We performed fine-tuning on their network by incorporating the translated images generated from day to night.
In this section, we will delve into the fundamental concepts employed by their network, explain the derivation of the loss function, and explore the network architecture.

Self-supervised learning
Self-supervised learning is a form of unsupervised learning wherein the data itself acts as the source of supervision.It involves defining an auxiliary task, known as the pretext task, which guides the loss function for the primary task.Typically, the outcome of the pretext task is not of primary concern.Instead, the focus lies on the intermediate representation.In this case, image reconstruction serves as the pretext task (Godard et al., 2018).The ultimate goal is not the final result of the reconstruction, but rather the intermediate variable utilized in the process, which is the depth in this particular scenario.

Self-supervised Loss
The framework proposed by (Godard et al., 2018) and (Zhou et al., 2017) involves training two networks simultaneously: a CNN for single view depth estimation and a camera pose estimation network.The supervision signal is derived from a pretext task known as view synthesis.In this task, the network aims to predict the view of a target frame, denoted as It, based on the depth map of that frame, other images capturing the same scene from different poses (referred to as source frames), and the pose mapping between the target and source frames.The source frames It−1 and It+1 are selected as the previous and following frames in a frame sequence relative to It.The pose network predicts the relative pose between consecutive frames.
To reconstruct the target view It, pixels are sampled from a source view Is using the predicted depth map Dt and the relative pose Tt→s.Let pt denote the pixel coordinate in It and K denote the camera intrinsics.The projection of pt into Is, representing the pixel coordinates of the corresponding pixel in Is, can be determined as follows: Applying the same process for each pixel in It while considering It−1 and It+1 as the source frames, this way, we project pixels of the target frame onto the source frames.The pixel value of every pixel in the target is predicted by interpolating the values of ps and its neighboring pixels of both source frames.By following this procedure, an estimated target frame I ′ t is obtained.The depth network is trained by minimizing the photometric reprojection error Lp, where pe represents the photometric reconstruction error: Here pe is a photometric reconstruction error, e.g. the L1 distance in pixel space between the original target frame and the predicted.

Network Architecture
The depth estimation network employs a U-Net architecture (Weng and Zhu, 2015), which consists of an encoder-decoder network with skip connections.The encoder network is based on ResNet18 (He et al., 2015), and the weights are initialized using pretrained weights from ImageNet (Russakovsky et al., 2014).
For the pose estimation network, the architecture is derived from (Wang et al., 2017).It also utilizes ResNet18 (He et al., 2015) as its foundation.The network takes two frames as input and produces a single 6-degrees of freedom (DOF) relative pose between the frames.
In the training process for monocular depth estimation, a sequence of three consecutive frames is utilized, and the pose is estimated between every two consecutive frames within that sequence.To augment the data, horizontal flipping is applied, and there is a 50% chance of altering the brightness, contrast, saturation, and hue jitter.The augmentation is performed on all three input images in a consistent manner.
The models are implemented using PyTorch (Paszke et al., n.d.) and trained using the Adam optimizer (Kingma and Ba, 2014) for 20 epochs.A patch size of 12 is used, and the training is conducted on the KITTI dataset (Geiger et al., 2013).Both the input and output images have a resolution of 640x192.During training, the learning rate starts at 10 −4 for the first 15 epochs and then drops to 10 −5 for the remaining five epochs.
The training process described above was conducted by the original authors exclusively using daytime images from the KITTI dataset.In the following sections we will describe our finetuning process.

EXPERIMENTS
Firstly, we will discuss the results of the image translation and highlight some of the challenges encountered.Subsequently, we will present the various scenarios employed to evaluate the performance of the depth estimation network.The test set utilized in all of our experiments consists of selected images from the KITTI dataset that have undergone translation from day to night.

Incompatible resolution challenge
The first challenge we encountered in our work arose from utilizing two different networks.The image translation network from (Zhu et al., 2017) produced images with a fixed resolution of 256x256, irrespective of the input resolution.However, this resolution was incompatible with the depth estimation network, which expected inputs of size 640x192.Additionally, the images in the KITTI dataset had dimensions of 1241x376.
Resizing the images resulted in significant degradation in quality.To address this issue, we employed a strategy of dividing the images into sub-images and feeding them to the network individually.Subsequently, the translated sub-images were combined to form the final translated image.
Initially, we experimented with dividing the image into four non-overlapping sub-images.As depicted in Figure 3, it was evident that the different divisions were easily distinguishable.
Each pixel in the input image contributed to the overall color palette of the output image, resulting in a fragmented appearance.To mitigate this effect, we adopted a different approach and divided the image into overlapping sub-images with a horizontal shift of 20 pixels.In the final image, each pixel's value was calculated as the average of all values from the sub-images that contained that pixel.As shown in Figure 4, the region in the middle was present in all four sub-images, resulting in the values of that region in the final image being the average across the four sub-images.It's worth noting that we used more than four sub-images by applying a 20-pixel shift, which ultimately resulted in a final image dimension of 640x192.Figure 3 demonstrates the noticeable improvement achieved through this approach.

Translation Results
The pre-trained translation model's results were not consistently perfect, with certain common errors observed in some translated images.Figure 2 demonstrates an instance where the network erroneously predicted additional non-existent lights on the left side of the first image.The network's objective is to learn how to illuminate lights that are not naturally lit during the day but should appear at night, such as car headlights.However, there are instances where the network mistakenly identifies other image elements as lights when they are not.To

Evaluation Metrics for Depth Estimation
We follow the evaluation metrics employed by (Godard et al., 2018), which consist of error metrics where lower values indicate better performance, as well as accuracy metrics where higher values indicate better performance.

Relative error using the absolute
Where gt is the ground truth depth map generated from the Velodyne sensor points included in the KITTI dataset, and p red is the predicted depth map.

Root mean square error
5.3.4Root mean square error of the log The parameters that were chosen for the fine tuning were: a learning rate of 10 4 , training with Adam (Kingma and Ba, 2014), batch size was 10 and smoothness term for regularization λ was 0.001.All the training in the following scenarios was conducted for 22 epochs.

Different Scenarios and Quantitative results
The test set is 697 images translated from KITTI (Geiger et al., 2013) from day to night.These images were used to evaluate the next scenarios.
• We initially evaluated the pretrained network from (Godard et al., 2018) on the original daytime images of the test set, without performing any fine-tuning or image translation on our part.
• We then evaluated the performance of the pretrained network on the translated night images of the test set without any fine-tuning.
• A total of 39810 images were generated for training purposes, along with an additional 4424 images for validation.
For each image that underwent translation for training or validation, the preceding and subsequent frames were also translated.It's important to note that the generated data was not subjected to any filtering; it was all utilized for fine-tuning the network.Subsequently, the network's performance was evaluated on the same test set.
• The images underwent a filtering process using the GANs' discriminator network.This discriminator acts as a classifier, determining whether images are genuine nighttime images or not, and assigning a score ranging from 0 to 1, where a score of 1 indicates a real image.The training images were filtered based on this score, selecting only those with a score higher than 0.85.As a result, 3600 images were chosen for training, while the validation and test sets remained unchanged.
• The filtering process was repeated using the same methodology as before, but this time employing a cutoff score of 0.7.As a result, 17293 images were chosen for training.
As observed from Table 1, the first two rows serve as the baseline for our comparison.The test on daytime images represents the ideal scenario, showcasing the performance of the network trained specifically on daytime images.If our test results approach those of the daytime images, it indicates that our depth estimation works well at night, similar to how the pretrained version performs during the day.
The second row corresponds to the test on translated nighttime images using the pretrained network without any finetuning.This serves as our starting point for improvement.Subsequently, the remaining rows in the table demonstrate our tests

Scenario
The lower, the better The higher, the better SqRel RMSE RMSE log δ < 1.25 δ < 1.25 2 δ < 1.25 after the fine-tuning process.We observe a significant enhancement compared to the initial nighttime test, although the performance has not yet reached the level of the daytime test.
Further analysis involves the filtering test, where we selectively choose translated images based on their authenticity score as determined by the discriminator.Initially, we select all images above a score of 0.7, amounting to approximately 17 thousand images.This filtering mildly improves some of the evaluation criteria compared to not filtering at all.However, when using a stricter threshold of 0.85, resulting in 3600 images, the performance is worse than not filtering at all.This observation indicates that the variety and size of the training set play a crucial role in the overall outcome.
Table 2 presents a comparison of various depth estimation methods conducted by (Godard et al., 2018) using daytime images from the KITTI dataset.The methods are denoted by D, S, and M, representing the use of depth ground truth for supervision, self-supervised stereo vision, and self-supervised monovision, respectively.In the last row, we showcase our best results evaluated on nighttime images.It is important to note that the test is not perfect since it was not conducted on the same data.However, our intention is to demonstrate the performance of our model in its specific task (night depth estimation) compared to different models designed for day depth estimation.
As observed from the table, the evaluation results of our model fall somewhere between the results of the other approaches.It is evident that our model does not perform as well on nighttime images as it does on daytime images.Nevertheless, it remains comparable to other methods specifically developed for daytime depth estimation.It is worth noting that the other models were tested on original daytime images, while our model was evaluated on generated nighttime images.

Qualitative Test
The model was tested on actual nighttime images obtained from the Berkeley Deep Drive dataset (Yu et al., 2018).Although we do not possess the ground truth depth information for this dataset, we utilized it solely for qualitative purposes, comparing the appearance of the depth maps generated by the model before and after fine-tuning.The Berkeley dataset comprises images captured in various environments, including nighttime scenes.The results can be observed in the Figure 5.

CONCLUSION
Our approach has demonstrated remarkable results in the task of depth estimation.Based on the conducted experiments, we conclude that image translation holds immense potential as an affordable image synthesis tool for generating data that can be utilized by various tasks.However, it requires further refinement and examination to understand the impact of data on training.Furthermore, image translation holds promise beyond day and night scenarios, such as simulating different seasons or transforming images from a simulated environment to resemble those captured in real-life scenes.

Figure 2 .
Figure 2. Arranged from top to bottom are the original image, the image translated by the pre-trained network, and the image translated by the network trained from scratch.

Figure 3 .
Figure 3.Comparison of the non-overlapping division (up) and overlapping division versions (down).

Figure 5 .
Figure 5. From top to bottom: the original images, the depth maps before fine-tuning, and the depth maps after fine-tuning.

Table 1 .
Evaluation of different training and test scenarios