Assignment: Preparing AI-Ready Data for The Final Project
Contents
Assignment: Preparing AI-Ready Data for The Final Project#
Objective#
This assignment focuses on organizing, cleaning, and preparing data in a form suitable for machine learning. By the end of this task, you should have an organized repository that contains the raw data, cleaned data, annotated attributes, and exploratory analysis that prepares the data for use in machine learning models.
Structure of the Assignment#
Project Repository Setup and Documentation
Task: Create a public GitHub repository for the group project.
Requirements:
A clear and concise
README.md
file that explains:The data source(s).
Project objectives. There, we should describe the rational of the project.
Instructions for setting up the environment (dependencies, packages).
High-level description of each script/notebook.
Structure your repository using the MLGEO guidelines .
Data Download and Raw Data Organization
Task: Download the raw geoscientific dataset relevant to your project and discuss the basic modalities.
Requirements:
Include a script or notebook (
scripts/download_data.py
ornotebooks/Download_Data.ipynb
) that downloads and verifies the dataset.Ensure that the raw data is stored in a dedicated folder (
data/raw/
).If applicable, document any API keys or access credentials required to obtain the data in the
README.md
.Describe the data modalities, data formats
If applicable, describe large data archives that can be used for model inference, their size.
Basic Data Cleaning and Manipulation
Task: Clean the raw data to handle missing values, outliers, or inconsistencies.
Requirements:
Write a script/notebook (
scripts/clean_data.py
ornotebooks/Data_Cleaning.ipynb
) that:Handles missing values (e.g., imputation, removal).
Corrects or removes outliers.
Ensures data consistency (e.g., uniform date formatting, unit conversions).
Saves cleaned data in a new folder (
data/clean/
).
Organizing Data into AI-Ready Format
Task: Prepare the cleaned data for machine learning, ensuring it is properly annotated and structured.
Requirements:
Convert your data into a format suitable for ML (e.g., pandas DataFrame, NumPy arrays, Xarray).
Ensure the data is well-documented with attributes, labels, and metadata.
Include a notebook (
notebooks/Prepare_AI_Ready_Data.ipynb
) that clearly describes:The final shape of the data (number of samples, features, and target labels).
A description of each feature/attribute.
Save the final AI-ready data in a dedicated folder (
data/ai_ready/
).
Exploratory Data Analysis (EDA)
Task: Perform a basic exploration of the cleaned data to understand its structure and key characteristics.
Requirements:
Create a notebook (
notebooks/EDA.ipynb
) that includes:Basic summary statistics of the dataset (mean, variance, min, max, etc.).
Visualization of feature distributions (histograms, box plots, etc.).
Correlation analysis between different features and target variables (correlation matrix, heatmaps).
Brief discussion on any patterns or insights observed during the analysis.
Dimensionality Discussion and Reduction
Task: Analyze the dimensionality of your dataset and propose methods to reduce it.
Requirements:
In a notebook (
notebooks/Dimensionality_Reduction.ipynb
):Discuss the current dimensions of the dataset and any challenges they present (e.g., high dimensionality, sparse data).
Propose and implement at least two dimensionality reduction techniques:
Feature extraction techniques like PCA (Principal Component Analysis).
Non-linear methods like t-SNE (t-Distributed Stochastic Neighbor Embedding).
Visualize the results of dimensionality reduction (scatter plots, explained variance charts).
Discuss the implications of dimensionality reduction on your dataset.
Deliverables#
A GitHub repository with the following structure:
- data/ - raw/ - clean/ - ai_ready/ - scripts/ - download_data.py - clean_data.py - notebooks/ - Download_Data.ipynb - Data_Cleaning.ipynb - Prepare_AI_Ready_Data.ipynb - EDA.ipynb - Dimensionality_Reduction.ipynb - README.md
Ensure all the scripts and notebooks are well-documented, with comments explaining the code.
Submit a link to your GitHub repository as your final assignment.
Grading Criteria#
Repository Organization (10%): Clean structure with appropriate directories, well-documented
README.md
.Data Download and Cleaning (20%): Script functionality, handling missing/outlier data, clean data format.
AI-Ready Data Preparation (20%): Proper data annotation, clear dimensionality description, format suitability for ML.
Exploratory Data Analysis (20%): Quality of statistical analysis, insights, and visualizations.
Dimensionality Reduction (20%): Quality of analysis, use of techniques, and discussion on dimensionality challenges.
Documentation and Code Clarity (10%): Clear explanations and code readability.
Required Self Evaluation#
You should use chatGPT (4o is best as of 2024) for self-assessment. You may use the following prompt
Can you grade the following repository <ENTER-URL> with the following rubric "Repository Organization (10%): Clean structure with appropriate directories, well-documented README.md.
Data Download and Cleaning (20%): Script functionality, handling missing/outlier data, clean data format.
AI-Ready Data Preparation (20%): Proper data annotation, clear dimensionality description, format suitability for ML.
Exploratory Data Analysis (20%): Quality of statistical analysis, insights, and visualizations.
Dimensionality Reduction (20%): Quality of analysis, use of techniques, and discussion on dimensionality challenges.
Documentation and Code Clarity (10%): Clear explanations and code readability."
It may also provide additional feedback to improve if you use prompts like
Can you please provide more constructive feedback to improve?
Print & Upload the reports to Canvas to show 1) the initial assessment, and 2) the final assessment.