Machine Learning Tools#

2.1 What is ‘scikit-learn’#

Scikit-learn is one of the most powerful and popular python packages designed to facilitate the use of machine learning. It provides various algorithms used for classification, regression, and clustering. In this section, we will be using the random forest algorithm in the scikit-learn package to map SCA.

2.2 What is the random forest algorithm#

Random forest is a widely used machine learning algorithm developed by Leo Breiman and Adele Cutler (citation). It is an ensemble of multiple decision trees that are eventually aggregated to get the most likely result. A decision tree is a type of supervised machine learning that bases on a series of questions to categorize or make predictions. Each question forms a tree node that splits the data into different branches. If the answer to the question is ‘yes’, the decision follows the ‘yes’ branch; otherwise, the decision follows the alternate path until it reaches a result. The quality of the results is evaluated by metrics, such as the mean squared error (MSE), Gini Impurity, and information gain.

While the decision trees algorithm is very easy to use, it can be prone to overfitting issues. Using an ensemble of decision trees can largely reduce the overfitting and prediction variance, providing more accurate results. Bagging, also known as bootstrap aggregation, is the most well-known ensemble learning technique, which trains multiple models independently with the training sample set randomly selected with replacement. The final prediction is determined by the average (for regression) or majority (for classification) of all the models. Random forest is an extension of the bagging approach, which generates a random subset of both samples and features for each model training. While a decision tree is based on all features to make decisions, the random forest algorithm only uses a subset of features, which can reduce the influence of highly correlated features in model prediction. More complex approaches based on image segmentation techniques exist for land cover classification, including snow (citation). For the purpose of this chapter, we chose to illustrate the use of the random forest algorithm for snow classification, as it is a robust and versatile method to address both classification and regression problems.