2.1 Data Definitions#

Geoscientific data is particularly diverse: point measurements of soil moisture, high rate time series (1000 samples per second) seismograms, rasterized LandSAT imagery, Geospatial and Temporal simulated geophysical fields.

The data modality#

Modality refers to the field, or genre of measurements. Different modalities may be seismograms, GPS displacement time series, surface air temperature time series. All of them are point-based measurements, share the same data type (1D arrays), could be saved in the same data format (e.g., CSV file), but sense different physical fields.

The data type refers to the type of an object. Geoscientific data is numeric (floats, integer) and from which you can calculate things. It can also be categorical (i.e. qualitative or nominal).

The data format refers to the specific type of parsing schema in a file (H5, CSV, JSON). It can be binary (H5), using standard character encodings (CSV, JSON), compressed (H5, Parquet), … more details in Chapter 2.5.

The difference in dimensionalities among geoscientific data challenges the design of machine learning models across disciplines. For most machine-learning practices, data modalities are classified as dimension. One example is a geophysical model that uses sattelite imagery (2D in space) with time series (1D in time) from point-based sensor measurements to predict an output.

Data Frames#

A DataFrame is a tabular data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. Dataframes are relational databases. Their data schema defines how data is organized within the dataframe: it defines the column names to specific values.

Data frames can be saved in row based file formats (Comma Separated Value CSV) or column-based formats (Parquet).

Lecture Slides[! (https://docs.google.com/presentation/d/1PVu8vbYtX0G4W41TB537Irm5V845E4uPsIrWQRfoQB0/edit?usp=sharing) See Lecture Slides.