Assignment: Preparing AI-Ready Data for The Final Project#

Objective#

This assignment focuses on organizing, cleaning, and preparing data in a form suitable for machine learning. By the end of this task, you should have an organized repository that contains the raw data, cleaned data, annotated attributes, and exploratory analysis that prepares the data for use in machine learning models.

Structure of the Assignment#

Project Repository Setup and Documentation
- Task: Create a public GitHub repository for the group project.
- Requirements:
  - A clear and concise README.md file that explains:
    - The data source(s).
    - Project objectives. There, we should describe the rational of the project.
    - Instructions for setting up the environment (dependencies, packages).
    - High-level description of each script/notebook.
  - Structure your repository using the MLGEO guidelines .
Data Download and Raw Data Organization
- Task: Download the raw geoscientific dataset relevant to your project and discuss the basic modalities.
- Requirements:
  - Include a script or notebook (scripts/download_data.py or notebooks/Download_Data.ipynb) that downloads and verifies the dataset.
  - Ensure that the raw data is stored in a dedicated folder (data/raw/).
  - If applicable, document any API keys or access credentials required to obtain the data in the README.md.
  - Describe the data modalities, data formats
  - If applicable, describe large data archives that can be used for model inference, their size.
Basic Data Cleaning and Manipulation
- Task: Clean the raw data to handle missing values, outliers, or inconsistencies.
- Requirements:
  - Write a script/notebook (scripts/clean_data.py or notebooks/Data_Cleaning.ipynb) that:
    - Handles missing values (e.g., imputation, removal).
    - Corrects or removes outliers.
    - Ensures data consistency (e.g., uniform date formatting, unit conversions).
    - Saves cleaned data in a new folder (data/clean/).
Organizing Data into AI-Ready Format
- Task: Prepare the cleaned data for machine learning, ensuring it is properly annotated and structured.
- Requirements:
  - Convert your data into a format suitable for ML (e.g., pandas DataFrame, NumPy arrays, Xarray).
  - Be very specific and state what is 1) data sample/instances, 2) what is input data (raw data or its feature), and 3) what is the target data. By identifying this early, it will set clear goals for the analysis. Note that choosing a “target data” probably means that the problem (e.g., regression, classification, etc) is already posed.
  - Ensure the data is well-documented with attributes, labels, and metadata.
  - Include a notebook (notebooks/Prepare_AI_Ready_Data.ipynb) that clearly describes:
    - The final shape of the data (number of samples, features, and target labels).
    - A description of each feature/attribute per data sample, what its physical meaning is, and write out the original dimensionality of the data based on the number of features.
    - Save the final AI-ready data in a dedicated folder (data/ai_ready/).
Exploratory Data Analysis (EDA)
- Task: Perform a basic exploration of the cleaned data to understand its structure and key characteristics.
- Requirements:
  - Create a notebook (notebooks/EDA.ipynb) that includes:
    - Basic summary statistics of the dataset (mean, variance, min, max, etc.).
    - Visualization of feature distributions (histograms, box plots, etc.).
    - Correlation analysis between different features and target variables (correlation matrix, heatmaps).
    - Brief discussion on any patterns or insights observed during the analysis.
Dimensionality Discussion and Reduction
- Task: Analyze the dimensionality of your dataset and propose methods to reduce it.
- Requirements:
  - In a notebook (notebooks/Dimensionality_Reduction.ipynb):
    - Discuss the current dimensions of the dataset and any challenges they present (e.g., high dimensionality, sparse data).
    - Propose and implement at least two dimensionality reduction techniques:
      - Feature extraction techniques like PCA (Principal Component Analysis).
      - Non-linear methods like t-SNE (t-Distributed Stochastic Neighbor Embedding).
    - Visualize the results of dimensionality reduction (scatter plots, explained variance charts).
    - Discuss the implications of dimensionality reduction on your dataset.

Deliverables#

A GitHub repository with the following structure:

- data/
  - raw/
  - clean/
  - ai_ready/
- scripts/
  - download_data.py
  - clean_data.py
- notebooks/
  - Download_Data.ipynb
  - Data_Cleaning.ipynb
  - Prepare_AI_Ready_Data.ipynb
  - EDA.ipynb
  - Dimensionality_Reduction.ipynb
- README.md

Ensure all the scripts and notebooks are well-documented, with comments explaining the code.
Submit a link to your GitHub repository as your final assignment and mark the version of the repository at the time of submission.

Grading Criteria#

Repository Organization (10%): Clean structure with appropriate directories, well-documented README.md.
Data Download and Cleaning (20%): Script functionality, handling missing/outlier data, clean data format.
AI-Ready Data Preparation (20%): Proper data annotation, clear dimensionality description, format suitability for ML.
Exploratory Data Analysis (20%): Quality of statistical analysis, insights, and visualizations.
Dimensionality Reduction (20%): Quality of analysis, use of techniques, and discussion on dimensionality challenges.
Documentation and Code Clarity (10%): Clear explanations and code readability.

Required Self Evaluation#

You should use chatGPT (4o is best as of 2024) for self-assessment. You may use the following prompt

Can you grade the following repository <ENTER-URL> with the following rubric "Repository Organization (10%): Clean structure with appropriate directories, well-documented README.md.
Data Download and Cleaning (20%): Script functionality, handling missing/outlier data, clean data format.
AI-Ready Data Preparation (20%): Proper data annotation, clear dimensionality description, format suitability for ML.
Exploratory Data Analysis (20%): Quality of statistical analysis, insights, and visualizations.
Dimensionality Reduction (20%): Quality of analysis, use of techniques, and discussion on dimensionality challenges.
Documentation and Code Clarity (10%): Clear explanations and code readability."

It may also provide additional feedback to improve if you use prompts like

Can you please provide more constructive feedback to improve?

Print & Upload the reports to Canvas to show 1) the initial assessment, and 2) the final assessment.

ML Geo Curriculum

Assignment: Preparing AI-Ready Data for The Final Project

Contents