Machine Learning in the Geosciences#

The GeoScience MAchine Learning Resources and Training (GeoSMART) framework provides an educational pathway and a foundation in open source scientific ecosystems and progresses through general ML theory, toolkits, and deployment on Cloud computing.

This book is used in the course offered at the University of Washington: Machine Learning in the Geoscienes (AUTMN 2023 - ESS 469/569). The corresponding GitHub repository with notebooks for the tutorials and homeworks is MLGeo. Find the Docker image for the corresponding jupyter hub MLGeo Image.

Instructors:

This project is supported by the GeoSMART team (Stefan Todoran, Nicoleta Cristea, Anthony Arendt, Scott Henderson, Ziheng Sun, Yiyu Ni, Akash Kharita).

Overview#

The course is intended to introduce Machine Learning in Geosciences, the basics of computing, and methodologies in applied machine learning. The course focuses on canonical and topical data sets in seismology, oceanography, cryosphere, planetary sciences, geology, and geodesy. The methods taught include unsupervised clustering, logistic regression, random forest, support vector machine, and deep learning.

Learning objectives#

By the end of the quarter, the students should be able to:

  • Demonstrate computing skills in python, jupyter notebooks, Git version control, and deploy scripts on local computers, cloud-hosted hubs, or cloud instances.

  • Develop and apply standard machine-learning workflows: 1) Data preparation, 2) Model design, 3) Model training, validation, and evaluation.

  • Apply standard data manipulation strategies in the Geosciences: data types (time series and geospatial), data formats, data visualization, dimensionality reduction, and feature engineering.

  • Describe and demonstrate the adoption of open science principles, science reproducibility, and digital scholarship.

  • Describe the canonical examples in a breadth of disciplines in geoscience.

  • Understand at least qualitatively how some of the advanced techniques (Fourier and wavelet transform, principal component analysis,…) manipulate and transform the data to interpret the output.

The UW MLGEO course ESS469/569, we follow a syllabus

Note that we introduce and incoroporate data visualization concepts throughout the book.

Prerequisites#

Prerequisites: MATH 207 and MATH 208, or MATH 307 or 308, or AMATH 351 or 352, CS160 or CS163, or permission from the instructor.

Recommended skills: Knowledge in Python, AMATH301, 100- or 200-level courses in the Earth Sciences. We will provide refreshers on computing as part of the course.

Syllabus#

  • Part I: AI-ready GeoData: This part will focus on geoscientific data, their modalities and dimensions, their basic characteristics, how to extract features, dimensionality reduction, and how to format AI-ready data set from geoscientific data

  • Part II: Classic Machine Learning: This part will focus on developing machine learning skills for model training, evaluation, assessment for generalization, good practice for robust model training for classic machine learning after feature engineering (e.g., K-means, random foreast, knn, etc)

  • Part II: Deep Learning: This part will overview foundamental concepts in deep learning, such as fully connected layers, convolutional neural networks, sequence-to-sequence learning with RNNs, and canonical architectures such as large DNN, ResNets, U-Nets, strategies for training neural networks such as data augmentation, regularization, loss definition that incorporate physics constrains, and modern topics such as foundational models and large language models for geoscience.

Technical skills building#

Throughout the course, the students will build skills in shell, version control using git and GitHub, Python programming, high-performance computing strategies, and simple data visualization using Python.

  • Shell: introduced early in the course, and manipulated if needed

  • Version Control: introduced early in the course and used at every lecture

  • Python Programming: progressively introduced. Specifically, we detail the use of the following packages: numpy, (geo)pandas, sklearn, keras, pytorch.

  • Visualization in Python: introduced early as Matplolib and Plotly, and used in every Python lecture.

  • High performance computing: used in the second half of the course and during development of the final project.

Readings and webinars#

Each week, students will write a short report about either a paper or a webinar. Use the template on canvas and answer the questions when appropriate. Submissions of the report PDF are due Wednesdays at 11:59 pm PDT on canvas. The instructor will spend 15 minutes Monday morning summarizing the reading and webinar reports. Papers can be found and/or uploaded on a shared private course Google Drive here (only accessible with a UW email address).

Github with tutorials and homeworks — Course specific#

The course GitHub contains the tutorial notebooks. Clone the tutorial

    git clone "https://github.com/UW-ESS-DS/MLGeo-2023"

To update the local repository from the remote version git fetch git merge git pull

To force-reset the repository from the main branch git reset –hard origin/main git pull

Make your own repository (MLGEO2023_UWNETID). Copy the environment.yml file and the tutorials into your own reposistory to run and modify them.