3.20 Final Project - Classic Machine Learning#

Objective#

This assignment guides you through classic machine learning tasks for your AI-ready dataset. You’ll explore various ML algorithms, evaluate model performance, and assess computational efficiency. The goal is to practice the steps of selecting and optimizing a model for your project, preparing you for deeper model development.

Structure of the Assignment#

  1. Research Project Relevance

    • Task: Write a brief essay that discusses the relevance of machine learning to your project and outlines your approach.

    • Requirements:

      • A concise, ½-page essay (in Research_Relevance.md) that includes:

        • The problem’s connection to classification or regression.

        • Justification of whether the work will use a supervised, unsupervised, self-supervised, or semi-supervised approach.

        • Expected outcomes and potential impact of applying ML to your dataset.

  2. Clustering Analysis for Classification Projects

    • Task: If your project involves classification, perform a clustering analysis on the features of your data, especially after dimensionality reduction.

    • Requirements:

      • Apply clustering algorithms (e.g., K-means) to understand feature clusters or patterns.

      • Conduct best practices for clustering analysis in a notebook (notebooks/Clustering_Analysis.ipynb):

        • Use silhouette analysis and elbow curve for optimal K identification.

        • Evaluate clusters with homogeneity or the Fowlkes-Mallows Index.

        • Test robustness using repeated runs of K-means with random seeds.

      • Visualize the clusters and discuss how this analysis informs your approach to classification.

  3. AutoML and Hyperparameter Tuning

    • Task: Use automated machine learning tools or perform a manual model hyperparameter search to find suitable models and settings for your problem.

    • Requirements:

      • Apply AutoML frameworks (e.g., pycaret) or manual search techniques to test and select models.

      • Save results and key insights in a notebook (notebooks/AutoML_Hyperparameter_Tuning.ipynb).

      • Describe the algorithms evaluated and the hyperparameters optimized.

      • Identify the most promising models based on accuracy, interpretability, or computational cost.

  4. Training Engineering and Model Assessment

    • Task: Perform a thorough analysis of training strategies and model assessment.

    • Requirements:

      • In a notebook (notebooks/Model_Training_Assessment.ipynb), demonstrate:

        • Cross-validation and train-val-test splitting practices to ensure robust evaluation.

        • Performance generality by testing on diverse subsets of the data.

        • Bootstrapping and bagging techniques across different model architectures and data variations.

        • Visualize model performance (e.g., learning curves, accuracy, loss) across training rounds.

        • A discussion of results, including how these methods affect model generality and performance.

  5. Computational Time Analysis

    • Task: Analyze the computational time needed for model training and deployment.

    • Requirements:

      • In a notebook (notebooks/Computational_Time_Analysis.ipynb), include:

        • Metrics on training time for each model architecture, detailing how various parameters affect speed.

        • An exploration of time vs. accuracy trade-offs for different configurations.

        • An assessment of expected time requirements for model deployment in real-world scenarios.

      • Summarize findings in a short conclusion, focusing on any computational challenges or optimizations relevant to your model.

Deliverables#

  • A GitHub repository with the following structure:

    - data/
      - ai_ready/
    - notebooks/
      - Clustering_Analysis.ipynb
      - AutoML_Hyperparameter_Tuning.ipynb
      - Model_Training_Assessment.ipynb
      - Computational_Time_Analysis.ipynb
    - Research_Relevance.md
    - README.md
    
  • A clear README.md file describing the assignment, key findings, and instructions to reproduce the analyses.

Grading Criteria#

  • Relevance Essay (10%): Clarity of problem framing, appropriate ML approach, and impact explanation.

  • Clustering Analysis (20%) (only if classification project): Depth of clustering analysis, best practices in evaluation, and clarity of insights.

  • AutoML and Hyperparameter Tuning (20%): Range of models tested, quality of parameter search, and clarity in reporting.

  • Training Engineering and Model Assessment (30%): Robustness of training strategy, cross-validation practices, use of bootstrapping, and clear results discussion.

  • Computational Time Analysis (10%): Thorough analysis of training and deployment times, insights on computational efficiency.

  • Documentation and Code Clarity (10%): Clear explanations, code readability, and adherence to best practices.

This assignment enables you to apply classic machine learning methods to real geoscientific data, deepening your practical understanding of model selection, tuning, and assessment.