Model Training

Model Training#

In the field of Snow Water Equivalent (SWE) prediction, training models that accurately represent the complexities of environmental data is a critical task. This chapter delves into the intricacies of model training, focusing on the foundational BaseHole class, its extensions, and the specific machine learning models that utilize this structure.

7.1 Base Hole Class#

7.1.1 Overview#

The BaseHole class is a meticulously crafted blueprint for building SWE predictors. It encapsulates the core processes of data handling, model training, and evaluation, ensuring that common functionalities are standardized and reusable. By designing BaseHole as an extendable class, specific predictor classes can inherit and customize its methods, allowing for flexibility in model creation while maintaining a consistent structure across different models.

Key Attributes:

all_ready_file: A path to the CSV file containing pre-processed data ready for training.
classifier: The machine learning model used for prediction.
holename: The name of the wormhole class, which is derived from the class name itself.
train_x, train_y: Training input and target data, respectively.
test_x, test_y: Testing input and target data, respectively.
test_y_results: The predicted results on the test data.
save_file: Path to save the trained model.

7.1.2 Core Functions#

Preprocessing: The model begins with preprocessing, a critical phase where raw data is transformed into a refined form suitable for training. The BaseHole class adeptly navigates this phase, loading data, cleaning it, and splitting it into training and testing sets. This preparatory step ensures that the models are fed data that is both digestible and informative, setting the stage for accurate predictions.

def preprocessing(self):
        '''
        Preprocesses the data for training and testing.

        Returns:
            None
        '''
        all_ready_pd = pd.read_csv(self.all_ready_file, header=0, index_col=0)
        print("all columns: ", all_ready_pd.columns)
        all_ready_pd = all_ready_pd[all_cols]
        all_ready_pd = all_ready_pd.dropna()
        train, test = train_test_split(all_ready_pd, test_size=0.2)
        self.train_x, self.train_y = train[input_columns].to_numpy().astype('float'), train[['swe_value']].to_numpy().astype('float')
        self.test_x, self.test_y = test[input_columns].to_numpy().astype('float'), test[['swe_value']].to_numpy().astype('float')

Train: The train function is responsible for training the machine learning model using the preprocessed data. This function prepares the model to make accurate predictions by learning patterns from the training data.

def train(self):
        '''
        Trains the machine learning model.

        Returns:
            None
        '''
        self.classifier.fit(self.train_x, self.train_y)

Test: The test function evaluates the model’s performance on a separate testing dataset, allowing for the assessment of its predictive accuracy.

def test(self):
        '''
        Tests the machine learning model on the testing data.

        Returns:
            numpy.ndarray: The predicted results on the testing data.
        '''
        self.test_y_results = self.classifier.predict(self.test_x)
        return self.test_y_results

Predict: The predict function leverages the trained model to make predictions on new, unseen data, providing valuable insights into potential outcomes.

def predict(self, input_x):
    '''
    Makes predictions using the trained model on new input data.

    Args:
        input_x (numpy.ndarray): The input data for prediction.

    Returns:
        numpy.ndarray: The predicted results.
    '''
    return self.classifier.predict(input_x)

More functions in this class which are being overridden in other classes:

Evaluate: The evaluate function, designed to be overridden, is where the performance metrics of the model are calculated and analyzed. This function is crucial for understanding the model’s strengths and weaknesses.
Get Model: The get_model function, another overridable method, is responsible for returning the specific machine learning model object that will be used for training and prediction.
Post-processing: The post_processing function handles the final steps after model predictions are made, such as generating visualizations, analyzing feature importance, and saving results.

7.2 ETHole Class#

The ETHole class is designed to leverage the power of the Extra Trees Regressor, an ensemble learning method. This class is a specialized extension of the RandomForestHole class, inheriting its structure while introducing specific adaptations(model).

Why Extra Trees Regressor?

The Extra Trees Regressor stands out because of its robustness in handling varied data distributions and its ability to capture intricate patterns without overfitting. Unlike traditional decision trees, which split the data by selecting the best feature thresholds, Extra Trees introduces additional randomness by selecting thresholds at random. This randomness helps in reducing variance, making the model less prone to overfitting, especially in high-dimensional spaces like environmental data.

7.2.1 Custom Features#

To maximize the predictive power of the Extra Trees model, the ETHole class introduces several custom features that tailor the training process to the specific needs of SWE prediction.

Custom Loss Function: The custom_loss function in the ETHole class is a specialized loss function that penalizes errors differently based on the true value of SWE. In typical regression tasks, the goal is to minimize the average error across all predictions. However, in SWE prediction it’s crucial to be more accurate in certain ranges, such as when SWE values are high, as these may correspond to critical environmental conditions.

def custom_loss(y_true, y_pred):
    errors = np.abs(y_true - y_pred)
    return np.where(y_true > 10, 2 * errors, errors)

Sample Weights: Sample weights adjust the importance of each data point during the training process. The create_sample_weights method generates weights based on the SWE values, giving more importance to higher values, ensuring that the model focuses more on accurately predicting these critical instances.

def create_sample_weights(self, X, y, scale_factor, columns):
    return (y - np.min(y)) / (np.max(y) - np.min(y)) * scale_factor

7.2.2 Training and Evaluation#

Model Creation: The get_model() method in this class overrides the base method to return an instance of ExtraTreeRegressor.

def get_model(self):
        """
        Returns the Extra Trees Regressor model with specified hyperparameters.

        Returns:
            ExtraTreesRegressor: The Extra Trees Regressor model.
        """
#         return ExtraTreesRegressor(n_estimators=200, 
#                                    max_depth=None,
#                                    random_state=42, 
#                                    min_samples_split=2,
#                                    min_samples_leaf=1,
#                                    n_jobs=5
#                                   )
        return ExtraTreesRegressor(n_jobs=-1, random_state=123)

Train Method: The train method in the ETHole class is designed to take full advantage of the Extra Trees model’s capabilities. By incorporating sample weights, the model becomes more attuned to the nuances of the data, particularly in ranges that are more impactful in the real world.

def train(self):
    self.classifier.fit(self.train_x, self.train_y)
    predictions = self.classifier.predict(self.train_x)
    errors = np.abs(self.train_y - predictions)
    weights = compute_sample_weight('balanced', errors)
    self.classifier.fit(self.train_x, self.train_y, sample_weight=weights)

The training process is carried in two main phases:

Initial Training: The model is first trained on the entire training dataset without any sample weights.
Weighted Training: After the initial training, the model’s predictions are compared with actual values, and sample weights are computed based on the errors. The model is then retrained using these weights, making it more sensitive to critical prediction errors.

7.2.3 Post-Processing#

After training and making predictions, the post_processing method plays a key role in analyzing the model’s performance. One of the primary tasks is to assess feature importance, which helps in understanding which input features (e.g., temperature, precipitation) were most influential in the model’s predictions.

def post_processing(self, chosen_columns=None):
    feature_importances = self.classifier.feature_importances_
    feature_names = self.feature_names
    sorted_indices = np.argsort(feature_importances)[::-1]
    sorted_importances = feature_importances[sorted_indices]
    sorted_feature_names = feature_names[sorted_indices]

    plt.figure(figsize=(10, 6))
    plt.bar(range(len(feature_names)), sorted_importances, tick_label=sorted_feature_names)
    plt.xticks(rotation=90)
    plt.xlabel('Feature')
    plt.ylabel('Feature Importance')
    plt.title('Feature Importance Plot (ET model)')
    plt.tight_layout()
    if chosen_columns == None:
        feature_png = f'{work_dir}/testing_output/et-model-feature-importance-latest.png'
    else:
        feature_png = f'{work_dir}/testing_output/et-model-feature-importance-{len(chosen_columns)}.png'
    plt.savefig(feature_png)
    print(f"Feature image is saved {feature_png}")

The post-processing method generates a feature importance plot, which visually represents how much each feature contributed to the predictions. This is crucial for model interpretation, allowing researchers to understand which environmental factors most significantly impact SWE predictions.

7.3 Training#

In the final stage of the training process, multiple models, including the ETHole, are trained and validated to determine the best performer. This process is encapsulated in a script that orchestrates the training and evaluation of several models, ensuring a comprehensive approach to model selection.

The main() function in model_train_validate script serves as the entry point for handling the model training pipeline. By coordinating various model types, including ETHole, the script ensures that each model is thoroughly trained and evaluated under consistent conditions.

def main():
    print("Train Models")

    worm_holes = [ETHole()]

    for hole in worm_holes:
        hole.preprocessing()
        print(hole.train_x.shape)
        print(hole.train_y.shape)
        
        hole.train()
        hole.test()
        hole.evaluate()
        hole.save()

    print("Finished training and validating all the models.")

Each model created in this function is an instance of the ETHole class. One of the key strengths of this script is its modularity. By simply adjusting the list of models (worm_holes), you can train and validate different algorithms without modifying the core workflow.

This script provides a streamlined way to manage the training process, enabling efficient experimentation with different models and configurations.