COMPARING DIFFERENT MACHINE LEARNING OPTIONS TO MAP BARK BEETLE INFESTATIONS IN CROATIA

: This paper presents different approaches to map bark beetle infested forests in Croatia. Bark beetle infestation presents threat to forest ecosystems. Due to large unapproachable area, it also presents difficulties in mapping infested areas. This paper analyses available machine learning options in open-source software QGIS and SAGA GIS. All options are performed on Copernicus data, Sentinel 2 satellite imagery. Machine learning and classification options are maximum likelihood classifier, minimum distance, artificial neural network, decision tree, K Nearest Neighbor, random forest, support vector machine, spectral angle mapper and Normal Bayes. Kappa values respectively are: 0.71; 0.72; 0.81; 0.68; 0.69; 0.75; 0.26; 0.60; 0.41 which shows highest classification accuracy for artificial neural networks method and lowest for support vector machine accuracy.


INTRODUCTION
Remote sensing is the process of acquiring information about an object or phenomenon without making physical contact with it. It involves usage of various sensors to capture data from a distance, such as aerial photography, satellite imagery, and LiDAR. One of the most important applications of remote sensing is classification. It is the process of categorizing objects or areas based on their characteristics in the acquired data. In recent years, machine learning methods have become increasingly popular for remote sensing classification. Machine learning algorithms, such as artificial neural networks, support vector machines, and random forests, are used to automatically learn and recognize patterns in the data, and then assign classification labels to the objects or areas. These methods have been shown to be effective for a wide range of remote sensing applications, including land use and land cover mapping, vegetation monitoring, and urban growth analysis. Within this paper we explore the use of machine learning methods for remote sensing data classification (Feng et al., 2015;Foody, 2002;Jain et al., 2016;Jog and Dixit, 2016;Kranjčić et al., 2019a;Singh et al., 2017). We discuss various algorithms, their strengths and weaknesses, and their suitability for different types of remote sensing data. We also investigate the impact of different input features, such as spectral, textural, and contextual information, on the performance of the classifiers. Finally, we compare the results of different machine learning methods with traditional classification techniques and discuss the potential for future research and development in this field. However, due to the page limitations, each method and comparations are defined partially. Following methods are used and discussed: maximum likelihood, minimum distance, artificial neural network, decision tree, K nearest neighbour, random forest, support vector machine, spectral angle mapper and naïve Bayes. We executed all the classification methods using OpenCV library from within QGIS and SAGA GIS software. The paper is organised as it follows, second chapter presents methods, study area and data sets used. Chapter three deals with results and discussion, chapter four presents' conclusions and lest chapter shows references used. * Corresponding author

METHODS, STUDY AREA AND DATA SETS
In this chapter, each method is shortly explained, study area is presented together with data sets used.

Maximum likelihood
Maximum likelihood (ML) is a supervised classification method based on Bayes theorem. It is a statistical method that uses probability theory to classify each pixel in an image into different land cover categories based on its spectral characteristics (Ahmad and Quegan, 2012).

Minimum distance
Minimum distance (MD) classifier depends on training data used to perform classification on unknown data set to the classes that minimizes distance between images and classes in multidimensional space. Minimum distance shows maximum similarity. Due to smaller number of calculations it requires less processing time (Jog and Dixit, 2016).

Artificial neural network
Artificial neural networks (ANNs) are a type of machine learning algorithm inspired by the structure and function of biological neurons. ANNs have been widely used in remote sensing classification due to their ability to learn complex relationships between input variables and output classes, and their ability to handle non-linear relationships in the data. ANNs are composed of multiple layers of interconnected nodes, or neurons, which receive input signals, perform a non-linear transformation on those signals, and then pass the transformed signals to the next layer of neurons. The final layer of neurons produces the output, which is typically a classification label (Miller et al., 1995;Song et al., 2012).

Decision tree
Decision tree (DT) is a general, predictive modelling tool with applications in different areas. Decision trees are constructed via an algorithmic approach that describes ways to split a data set based on specific tasks. Due to method simplicity, it is one of the most widely used and practical methods for supervised learning. It is a non-parametric supervised learning method used for both classification and regression tasks (Kumar, 2022;Song and Lu, 2015).

K-nearest neighbor
K-nearest neighbor (KNN) is one of the most basic yet significant classification algorithms in machine learning. It is a supervised machine learning method often used in the domain of pattern recognition, data mining and intrusion detection. It can solve classification and regression problems (Meng et al., 2007).

Random forest
Random forest (RF) is a popular machine learning algorithm used for remote sensing classification that combines multiple decision trees to improve classification accuracy and reduce overfitting. In the RF algorithm, multiple decision trees are trained on different subsets of the training data and with a random subset of input features. Each tree makes a classification decision based on the selected features, and the final classification decision is made by aggregating the decisions of all the trees through a majority voting scheme (Kranjčić et al., 2019a;Oliveira et al., 2012;Pal, 2005;Rodriguez-Galiano et al., 2015).

Support vector machine
Support vector machine (SVM) is a supervised learning algorithm that seeks to find a hyperplane that separates the data into different classes with the largest margin between the classes. In SVM, each pixel in the image is represented as a point in a high-dimensional space, and the algorithm seeks to find the hyperplane that best separates the different classes. The hyperplane is selected to maximize the margin between the closest points of the different classes, which are known as support vectors (Jog and Dixit, 2016;Kranjčić et al., 2019b;Naghibi et al., 2017;Ngoc Thach et al., 2018).

Spectral angle mapper
Spectral angle mapper (SAM) calculates the spectral angle between a pixel's spectral signature and the spectral signature of a known target class to determine its class membership. SAM assumes that the spectral signatures of different materials can be represented as vectors in a high-dimensional space, and that the angle between two vectors represents the similarity between the two spectra. SAM calculates the angle between the pixel's spectral signature and the spectral signature of each target class and assigns the pixel to the class with the smallest angle (De Carvalho and Meneses, 2000; Liu and Yang, 2013).

Naïve Bayes
Naive Bayes (NB) is a probabilistic machine learning algorithm based on Bayes' theorem, which describes the probability of a hypothesis given some observed evidence. In Naive Bayes, each pixel in the image is represented as a vector of input features, such as spectral bands or texture measures. The algorithm assumes that each input feature is independent of the others, which is known as the "naive" assumption. Naive Bayes calculates the probability of each class given the input features and assigns the pixel to the class with the highest probability. The probability of each class is calculated using Bayes' theorem, which incorporates the prior probability of the class and the probability of the input features given the class. (Kholod et al., 2019;Solares and Sanz, 2005;Soria et al., 2011;Wieland and Pittore, 2014)

Study area
Study area is in the in mountainous area of Croatia, where spruce, beech and fir trees can be found. Municipality of Čabar is located on altitude 650 to 1200 meters above sea level and it is covered with spruce and fir forests. During 2014 bark beetle infestation outbreak was registered at municipality of Čabar. Spruce forests are infected with bark beetles and main characteristics are yellow/red treetops which can be distinguished on remote sensed data (Kranjčić et al., 2018).

Used data sets
We used Copernicus Sentinel 2A multispectral images. Sentinel 2A contains multispectral imager covering 13 spectral bands (443nm -2190 nm) with spatial resolution of 10 m, 20 m and 60m (Agency, 2021;Rättich et al., 2020). Date of downloaded data is 04 th August 2017. Figure 1 shows study area, training, and control data sets.  Viera et al. (2005) shows that kappa analysis is a powerful tool to compare differences between classification results. Kappa values between 0.41 and 0.60 indicate that classification is moderate accuracy. Kappa values between 0.61 and 0.80 shows high accuracy and kappa values higher than 0.80 indicates very high classification accuracy. Table 1 shows kappa values for each method. Figures 2 to 10 show results of supervised classification for each abovementioned method, as it follows: maximum likelihood, minimum distance, artificial neural network, decision tree, K-

CONCLUSIONS
In this paper we used several machine learning methods for bark beetle infestation mapping. For classification we used QGIS and SAGA GIS software and all methods are based on OpenCV library. Methods analyzed are as follows: maximum likelihood, minimum distance, artificial neural network, decision tree, Knearest neighbor, random forest, support vector machine, spectral angle mapping and naïve Bayes. Kappa values respectively are: 0.71; 0.72; 0.81; 0.68; 0.69; 0.75; 0.26; 0.60; 0.41. This indicates that artificial neural networks achieved highest classification accuracy and support vector machine accuracy is the lowest. Such results were expected, however higher classification accuracy for support vector machine should be achieved. Results are influenced by various parameters such as training data set, control data set, number of data sets and size of specific sample. Therefore, future research must include exploration how specific parameter affects classification accuracy.