2.9 Feature engineering#

The process of feature engineering is of manipulating, transforming, and selecting raw data into features that can be used in statistical analysis of prediction.

  • statistical features

  • temporal features

  • spectral features (Fourier and Wavelet transforms)

This lecture will demonstrate how to automatically extract features from a popular (but simple) Python package tsfel to extract common features of time series. We will take the example of seismic waveforms recorded in the Pacific Northwest. The Pacific Northwest data detect and labels seismic waveforms for event of various origins: earthquake, explosions (mostly quarry blasts), and surface events (usually avalanches and landslides), but also seismic noise (ambient Earth vibrations in between).

We will explore how these features vary among the four categories, or classes of seismic events.

[Level 1]

# Import modules for seismic data and feature extraction

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import scipy.stats as st
import os
import h5py # for reading .h5 files
#make sure to install wget
!pip install wget
import wget
Collecting wget
  Using cached wget-3.2-py3-none-any.whl
Installing collected packages: wget
Successfully installed wget-3.2

We download 2 files from the class storage: a CSV file with the waveforms themselves as an HDF5 file and their associated metadata as a CSV file:

wget.download("https://www.dropbox.com/s/f0e1ywupdbuv3l3/miniPNW_metadata.csv?dl=1")
wget.download("https://www.dropbox.com/s/0ffh4r23mitn2dz/miniPNW_waveforms.hdf5?dl=1")
'miniPNW_waveforms (1).hdf5'

Metadata#

We first read the metadata and arange them into a Pandas Data frame

# plot the time series
df = pd.read_csv("miniPNW_metadata.csv")
df.head()
Unnamed: 0 event_id source_origin_time source_latitude_deg source_longitude_deg source_type source_depth_km preferred_source_magnitude preferred_source_magnitude_type preferred_source_magnitude_uncertainty ... trace_S_onset trace_P_onset trace_snr_db year source_type_pnsn_label source_local_magnitude source_local_magnitude_uncertainty source_duration_magnitude source_duration_magnitude_uncertainty source_hand_magnitude
0 0 uw61669232 2020-09-07T03:44:14.690000Z 46.560 -119.797 earthquake 23.300 1.30 ml 0.241000 ... impulsive impulsive -1.444|2.612|9.921 2020.0 eq 1.30 0.241273 1.17 0.187840 NaN
1 1 uw60888282 2014-10-08T15:39:31.330000Z 45.371 -121.708 earthquake -0.947 1.67 ml 0.128000 ... impulsive impulsive 0.368|3.526|5.981 2014.0 eq 1.67 0.128000 1.63 0.099000 NaN
2 2 uw61361706 2017-12-30T04:37:46.870000Z 46.165 -120.543 earthquake 13.520 2.46 ml 0.158000 ... impulsive emergent 11.274|13.32|15.828 2017.0 eq 2.46 0.158000 3.37 0.389000 NaN
3 3 uw61639436 2020-06-09T23:37:10.420000Z 46.542 -119.589 earthquake 16.370 1.59 ml 0.157000 ... impulsive emergent 27.007|20.797|19.252 2020.0 eq 1.59 0.156750 1.69 0.354773 NaN
4 4 uw61735446 2021-05-24T10:42:37.810000Z 46.857 -121.941 earthquake 12.380 0.83 ml 0.082983 ... impulsive emergent 18.579|18.912|7.609 2021.0 eq 0.83 0.082983 0.50 0.381880 NaN

5 rows × 37 columns

The nature of the event source is located in one of the metadata attributes

df['source_type'].unique()
array(['earthquake', 'explosion', 'sonic_boom', 'thunder',
       'surface_event'], dtype=object)

Let’s assume that we are exploring features to classify the waveforms into the categories of the event types. We will attribute the *labels as the source_type attribute.

labels=df['source_type']

How many seismic waveforms are there in each of the category?

##

Would you say that this is a balanced data set with respect to the four classes of interest?

Now we will look at the seismic data, taking a random waveform from each of the categories

plt.hist(labels)
(array([500.,   0., 500.,   0.,   0., 126.,   0.,  94.,   0., 500.]),
 array([0. , 0.4, 0.8, 1.2, 1.6, 2. , 2.4, 2.8, 3.2, 3.6, 4. ]),
 <BarContainer object of 10 artists>)
../_images/2.9_feature_engineering_15_1.png

Now are read the data. It is stored in an HDF5 files under a finite number of groups. Each groups has an array of datasets that correspond to the waveforms. To link the metadata to the waveform files, the key trace_name has the dataset ID. The address is labeled as follows:

bucketX$i,:3,:n

where X is the HDF5 group number, i is the index. The file has typically 3 waveforms from each direction of ground motions N, E, Z. In the following exercise, we will focus on the vertical waveforms.

f = h5py.File("miniPNW_waveforms.hdf5", "r")

Below a function to read the file in the data

def read_data(tn,f):
    bucket, narray = tn.split('$')  # split the string of trace_name into bucket and narray
    x, y, z = iter([int(i) for i in narray.split(',:')]) # split thenarray into x, y, z
    data = f['/data/%s' % bucket][x, :y, :z] # read the data as a 3D array
    return data

The trace name is stored as data attriobute in the metadata.

ldata=list(df['trace_name'])
crap=read_data(ldata[400],f)
print(crap.shape)
plt.plot(crap[1,:])
(3, 15001)
[<matplotlib.lines.Line2D at 0x1c90d4820>]
../_images/2.9_feature_engineering_22_2.png

We will just extract the Z component and reshape them into a single array.

nt=crap.shape[-1]
ndata=len(labels)
Z=np.zeros(shape=(ndata,nt))
for i in range(ndata-1):
    Z[i,:]=read_data(df.iloc[i]["trace_name"],f)[2,:nt]

Now we have data and its attributes, in particular the label as source type.

We are going to extract features automatically from tsfel and explore how varied the

!pip install tsfel
import tsfel
Requirement already satisfied: tsfel in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (0.1.6)
Requirement already satisfied: setuptools>=47.1.1 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from tsfel) (65.5.1)
Requirement already satisfied: ipython>=7.4.0 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from tsfel) (8.6.0)
Requirement already satisfied: scipy>=1.7.3 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from tsfel) (1.9.3)
Requirement already satisfied: pandas>=1.5.3 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from tsfel) (2.1.1)
Requirement already satisfied: numpy>=1.18.5 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from tsfel) (1.23.5)
Requirement already satisfied: prompt-toolkit<3.1.0,>3.0.1 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from ipython>=7.4.0->tsfel) (3.0.33)
Requirement already satisfied: jedi>=0.16 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from ipython>=7.4.0->tsfel) (0.18.2)
Requirement already satisfied: pickleshare in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from ipython>=7.4.0->tsfel) (0.7.5)
Requirement already satisfied: stack-data in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from ipython>=7.4.0->tsfel) (0.6.2)
Requirement already satisfied: pexpect>4.3 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from ipython>=7.4.0->tsfel) (4.8.0)
Requirement already satisfied: appnope in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from ipython>=7.4.0->tsfel) (0.1.3)
Requirement already satisfied: backcall in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from ipython>=7.4.0->tsfel) (0.2.0)
Requirement already satisfied: traitlets>=5 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from ipython>=7.4.0->tsfel) (5.5.0)
Requirement already satisfied: pygments>=2.4.0 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from ipython>=7.4.0->tsfel) (2.13.0)
Requirement already satisfied: decorator in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from ipython>=7.4.0->tsfel) (5.1.1)
Requirement already satisfied: matplotlib-inline in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from ipython>=7.4.0->tsfel) (0.1.6)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from jedi>=0.16->ipython>=7.4.0->tsfel) (0.8.3)
Requirement already satisfied: tzdata>=2022.1 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from pandas>=1.5.3->tsfel) (2023.3)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from pandas>=1.5.3->tsfel) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from pandas>=1.5.3->tsfel) (2022.6)
Requirement already satisfied: ptyprocess>=0.5 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from pexpect>4.3->ipython>=7.4.0->tsfel) (0.7.0)
Requirement already satisfied: wcwidth in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from prompt-toolkit<3.1.0,>3.0.1->ipython>=7.4.0->tsfel) (0.2.5)
Requirement already satisfied: six>=1.5 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas>=1.5.3->tsfel) (1.16.0)
Requirement already satisfied: asttokens>=2.1.0 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from stack-data->ipython>=7.4.0->tsfel) (2.1.0)
Requirement already satisfied: executing>=1.2.0 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from stack-data->ipython>=7.4.0->tsfel) (1.2.0)
Requirement already satisfied: pure-eval in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from stack-data->ipython>=7.4.0->tsfel) (0.2.2)

We need to format for input into tsfresh. It needs 1 column with the id (or label), one column for the time stamps (sort)

cfg = tsfel.get_features_by_domain()     
Z.shape
(1720, 15001)
for i in range(Z.shape[0]):
    print(i/Z.shape[0])
    if i==0:
        X=tsfel.time_series_features_extractor(cfg, Z[i,:],fs=100.,)
        X['source_type']=df.iloc[i]['source_type']
    else:
        XX = tsfel.time_series_features_extractor(cfg, Z[i,:],fs=100.,)
        XX['source_type']=df.iloc[i]['source_type']
        X=pd.concat([XX,X],axis=0,ignore_index=True)
X.head()

Exploring the feature space#

Here we will plot distributions of features among the four classes. And we will explore what features are most correlated with each others in each of the categories.

plt.hist(np.log10(X['0_Wavelet variance_1']));
plt.matshow(X.drop('source_type',axis=1).corr(),cmap='seismic',vmin=-1,vmax=1)

Now calculate features for 100 events of each category and find features that differ from each other.

You may 1) calculate all features, 2) plot histograms/distributions of each feature and overlay each source-specific feature, 3) report on the ones that look different between each class and propose a workflow to classify between event types.