2.11 ML-ready data#

Preparing and pre-processing data to integrate in machine learning workflow is fundamental towards a good machine learning project.

Click ../_images/teaching.png to get the slides.

  1. Consider the problem

    Before jumping to a specific algorithm, define the problem in the most general terms. Formulate to oneself and others what the question is. Is it about transform \(A\) into \(B\)? Predicting \(Y\) from \(X\)? Determining how one variable is related to another? If the problem is computable, then it likely can be addressed using some ML method.

  2. What is realistically possible?

    • Review existing literature (not necessarily ML-specific literature!).

    • If experts can only achieve \(x\) accuracy for some task, that should be your first benchmark. If \(x\) accuracy is “good enough”, then the ML solution may not need to do better!

  3. Look into general approaches

    • If the problem involves recognizing textures in images, read papers that describe both the history and state-of-the-art solutions to that general problem.

    • Likewise, if the project involves dividing data into classes on the basis of several variables, read papers on classification methos.

    • Like in the previous step, get a sense for what is possible for your general question.

    • Determine whether the final approach will be narrow or general (and to what extent).

  4. Compile the data

    • Get your data into one location (e.g., your home folder).

    • This process can take some time, so do it early.

    • However you assemble your data, you should document every step!

  5. Organize the data in machine-readable formats and data structures that can be manipulated automatically in the ML workflow:

    • arrange data in numpy arrays, Xarrays, or pandas.

    • save data and its attributes in Zarr, H5, CSV.

    • store in a single folder

    • do not transform and overwrite the raw data

  6. Characterize the data

    • Explore data statistical properties (draw histograms, distributions, cross-plots, plot correlation matrices). Save the scripts of all data exploration.

    • What will be the actual data input.

    • What are the output/labels.

  7. Consider data manipulations extract feature from the data as a first step toward dimensionality reduction:

    • extract statistical, temporal, or spectral features (use tsfresh, tsfel, …)

    • transform the data into Fourier or Wavelet space (use scipy fft or cwt module)

    • reduce dimension by taking the PCA or ICA of the data. Save these features into file or metadata (use scikit-learn PCA or FastICA module). Additional feature reduction might be:

      • Feature selection finds the dimensions that explain the data without loss of information and ends with a smaller dimensionality of the input data. A forward selection approach starts with one variable that decreases the error the most and add one by one. A backward selection starts with all variables and removes them one by one.

  8. Consider data augmentation

    • Say you have a small1 dataset. One thing you might do to address this issue is augment your data (e.g., create modified copies of your data).

    • Bootstrap your data. Or use Monte Carlo methods to propagate uncertainties. If you have images, skew, stretch, rotate, and mirror them.

  9. Towards reproducible workflows Save the data processing workflow from raw data to feature data.

    • Use the scitkit-learn Pipeline module.

    • Write a python script to reproduce the pre-processing.


1

The definition of “small” is problem dependent. 1000 observations may be more than enough for simple regression analyses. The same number of observations may not be adequate for image segmentation tasks. Consider the extent of your problem space.