2.13 ML-ready data#

Preparing A ML Data Set#

Preparing and pre-processing data to integrate in machine learning workflow is fundamental towards a good machine learning project.

Lecture Slides

  1. Consider the problem

    Before jumping to a specific algorithm, define the problem in the most general terms. Formulate to oneself and others what the question is. Is it about transform \(A\) into \(B\)? Predicting \(Y\) from \(X\)? Determining how one variable is related to another? If the problem is computable, then it likely can be addressed using some ML method.

  2. What is realistically possible?

    • Review existing literature (not necessarily ML-specific literature!).

    • If experts can only achieve \(x\) accuracy for some task, that should be your first benchmark. If \(x\) accuracy is “good enough”, then the ML solution may not need to do better!

  3. Look into general approaches

    • If the problem involves recognizing textures in images, read papers that describe both the history and state-of-the-art solutions to that general problem.

    • Likewise, if the project involves dividing data into classes on the basis of several variables, read papers on classification methos.

    • Like in the previous step, get a sense for what is possible for your general question.

    • Determine whether the final approach will be narrow or general (and to what extent).

  4. Compile the data

    • Get your data into one location (e.g., your home folder).

    • This process can take some time, so do it early.

    • However you assemble your data, you should document every step!

  5. Organize the data in machine-readable formats and data structures that can be manipulated automatically in the ML workflow:

    • arrange data in numpy arrays, Xarrays, or pandas.

    • save data and its attributes in Zarr, H5, CSV.

    • store in a single folder

    • do not transform and overwrite the raw data

  6. Characterize the data

    • Explore data statistical properties (draw histograms, distributions, cross-plots, plot correlation matrices). Save the scripts of all data exploration.

    • What will be the actual data input.

    • What are the output/labels.

  7. Consider data manipulations extract feature from the data as a first step toward dimensionality reduction:

    • extract statistical, temporal, or spectral features (use tsfresh, tsfel, …)

    • transform the data into Fourier or Wavelet space (use scipy fft or cwt module)

    • reduce dimension by taking the PCA or ICA of the data. Save these features into file or metadata (use scikit-learn PCA or FastICA module). Additional feature reduction might be:

      • Feature selection finds the dimensions that explain the data without loss of information and ends with a smaller dimensionality of the input data. A forward selection approach starts with one variable that decreases the error the most and add one by one. A backward selection starts with all variables and removes them one by one.

  8. Consider data augmentation

    • Say you have a small1 dataset. One thing you might do to address this issue is augment your data (e.g., create modified copies of your data).

    • Bootstrap your data. Or use Monte Carlo methods to propagate uncertainties. If you have images, skew, stretch, rotate, and mirror them.

  9. Towards reproducible workflows Save the data processing workflow from raw data to feature data.

    • Use the scitkit-learn Pipeline module.

    • Write a python script to reproduce the pre-processing.

Preparing Data for Training#

What is the role of a training and a test set#

A training data set is the foundation of machine learning models. It provides the data used by the algorithm to learn patterns, relationships, and representations necessary for making predictions or decisions. The primary goal of the training data set is to enable the model to minimize the error or loss function by optimizing its parameters.

A test data set is used to evaluate the performance of the model on unseen data. Preparing it involves:

  1. Data Splitting: The test set is typically a randomly chosen subset of the data, separate from the training set, often making up 10–30% of the dataset.

  2. Hold-Out Principle: Test data should never overlap with training or validation data to avoid biased performance estimates.

  3. Real-World Representativeness: The test data should reflect the real-world scenarios where the model will be deployed, ensuring a robust performance evaluation.

Difference between preparing data for Classic Machine Learning (CML) and for Deep Learning (DL)#

  1. Preparing Data for CML: CML models are typically simpler and have fewer hyperparameters to tune.

    • A training data set is used to fit the model. The model is often evaluated directly on the test set after training.

    • Cross-validation techniques (e.g., k-fold cross-validation) are frequently used to assess model performance and reduce overfitting, eliminating the need for a separate validation set in many cases.

  2. Preparing Data for DL: DL models are more complex, often with millions of parameters, and require a validation set in addition to the training and test sets. The validation set is used during training to:

    • Tune hyperparameters (e.g., learning rate, number of layers).

    • Monitor the model’s performance and detect overfitting (e.g., via early stopping when validation loss stops improving).

A typical split for DL is:

  • Training set: 70–80% of the data.

  • Validation set: 10–20% of the data.

  • Test set: 10–20% of the data.

The validation set helps avoid the “test set leakage” issue by ensuring the test set remains unseen during training and hyperparameter tuning.


1

The definition of “small” is problem dependent. 1000 observations may be more than enough for simple regression analyses. The same number of observations may not be adequate for image segmentation tasks. Consider the extent of your problem space.