Introduction
Contents
Introduction#
Machine Learning (ML) has received lots of attention in the fields of computer vision, artificial intelligence, medicine, and many others. But, with so many different approaches, opinions, and models, it can feel overwhelming when first entering the field. In the earth sciences, we are primarily concerned with discovering how the natural world operates and the forces involved in the shaping it. While ML is a powerful prediction tool, little has been done to implement it as a tool for discovery, especially in the earth sciences. This is primarily due to many of the top performing ML algorithms being difficult to interpret intuitively.
Other problems associated with complex, uninterpretable machine learning models can be manifold. First, troubles arise when we do not know which variables have the most and least influence on the expected outcome. This can become important when making decisions regarding which variables to include and exclude. For instance, if a variable minimally contributes to the predictive power of a model, then we can exclude that variable from the model to form a less complex model. A second problem lies in understanding the functional form of the variables in the machine learning model. This comes into importance if we are interested in how particular variables increase or decrease the expected outcome variable. For example, does an increase in some variable result in a quadratic increase in the outcome? Or an logarithmic decrease? Or a exponential increase?
Fortunately, the field of Explainable AI (XAI) concerns itself with exactly what we seek as earth scientists: how are these ML models making predictions and what do they mean about the processes we seek to approximate? In this tutorial, we walk through how to use an interpretable ML model, use a more complex ML model, and apply a post-training explainability technique, specifically using a python package called SHAP, to better understand how our ML models are making predictions.
Another goal of this tutorial is to provide a framework for machine learning on small datasets. When trying to model physical processes we often face the difficulty of accessing large amounts of data due to laborious collection techniques. The limitation of data becomes a problem when some of the state of the art machine learning algorithms require a large amount of data and find themselves over fitting when provided insufficient amounts of data. In the tutorial we make use of two models that work well on smaller datasets, specifically the Bayesian Linear Regression and the Gaussian Process Regression. We choose these two specific models because, first, both models provide smooth, continuous functions as opposed to step functions common in tree based models. Second, we can create a baseline model from the Bayesian Linear Regression, with the embedded assumption that vertical accretion is simply a linear function of the input variables. We can then test to see if a non-parametric machine learning model, a Gaussian Process Regression, improves any of our performance.
Scientific Motivation#
Sediment is what builds are coastlines. However, the process of sedimentation is incredibly dynamic, constantly being removed, transported, and redeposited. This dynamism makes the prediction of sedimentation rates incredibly difficult. But, since the foundations of our communities are, literally, built upon deposited sediment, it is crucial to know how it moves.
Using a comprehensive dataset of coastal marsh sedimentation rates derived from the Coastal Reference Monitoring System (CRMS), we want to discern the driving environmental factors related to sedimentation, as recorded by vertical accretion rate (Wagner and Haywood III 2022). Vertical accretion rate in our dataset is recorded as the height of sediment deposited above a datum which is then temporally averaged across 16 years, the total amount of time the CRMS dataset encapsulates. It records the product of the sedimentation processes that build our coastlines and combat land loss. Coastal Louisiana is no stranger to the delicate balance between land loss and sedimentation. Louisiana’s sediment budget has been diminished by upstream damming and the leveeing of the river systems, amplifying land loss caused by relative sea-level rise across the coast. The saying is that coastal Louisiana loses about an American football field worth of land every hour!
Complementing vertical accretion rate measurements, the CRMS dataset provides additional biologic, hydrologic, and sedimentologic variables (TABLE 1) (Wagner and Haywood III 2022). We then augment the dataset by calculating our own Distance to Water, Distance to Rivers, Normalized Difference Vegetation Index (NDVI), and Total Suspended Sediment (TSS) to capture additional environmental factors that may influence vertical accretion rates. We want to use a statistical method to first identify the most salient features, then use a machine learning model to make predictions of vertical accretion rates and understand how environmental variables are able to capture the process of vertical accretion rate.
All the data we use is openly available with descriptions at:
CRMS: https://www.lacoast.gov/chart/Charting.aspx?laf=crms&tab=2
CIMS: https://cims.coastal.louisiana.gov/monitoring-data/