2.1 Data Definitions#

Data is foundational to geosciences, allowing us to observe, model, and predict natural processes. Understanding the various types of data, their formats, and how they are structured is key to effectively using them in research and applications. In this lecture, we will discuss data modalities encountered in geoscience, typical data formats, the concept of arrays, and data frames. Geoscientific data is particularly diverse: point measurements of soil moisture, high rate time series (1000 samples per second) seismograms, rasterized LandSAT imagery, Geospatial and Temporal simulated geophysical fields.

Geoscientific Temporal Data

AI-Art from Dall-e: geoscientific data with dataframes, geospatial, and temporal data.

Data modality in Geosciences#

In geosciences, data come in multiple modalities depending on the source, nature of the measurements, and intended applications:

In-situ Data: Measurements taken directly at the site of interest. In-situ data often comes as time series. Examples include:
- Temperature data from weather stations.
- Ground motion data from seismometer.
- Soil moisture data from in-situ probes.
Remote Sensing Data: Collected from instruments not in direct contact with the object of study, often using satellites, drones, or aircraft. Geospatial Data are tied to specific locations on Earth’s surface, often represented as maps or grids (e.g., GIS data). Examples include:
- Images at multiple wavelengths (e.g., multispectral or hyperspectral images) from satellites.
- Topography data using LiDAR or radar systems.
- Sea surface temperature from satellites.
Model Data: Simulated data generated from computational models. For example:
- Climate models predicting future temperatures, precipitation, chemistry.
- Hydrological models simulating water flow in river basins.
- Wavefield simulations that solve the wave equation in complex media
Geophysical Data: Subsurface measurements derived through indirect methods like seismic surveys, gravity, or magnetic studies.

Data Formats in Geosciences#

Geoscientific data is typically stored in formats that optimize storage, access, and sharing. Common formats include:

NetCDF (Network Common Data Form): Commonly used for multidimensional scientific data, such as atmospheric, oceanic, or climate model outputs. It efficiently stores array-based data with metadata.

HDF (Hierarchical Data Format): Similar to NetCDF but more general, used for large datasets including satellite imagery.

CSV (Comma-Separated Values): A simple format for tabular data. It’s human-readable and widely supported across software, but less efficient for large or multidimensional datasets.

GeoTIFF: A popular format for raster geospatial data, often used in remote sensing and GIS applications.

Shapefiles: A vector data format for geographic information system (GIS) software, which contains geometric locations and attribute information of spatial features.

Cloud Optimized Formats: As large archives are moving to cloud systems, other formats are becoming useful: COGT (Cloud Optimized GeoTiffs), Zarr, TileDB among others.

Arrays#

An array is a fundamental data structure used to store collections of values, often representing multidimensional data (e.g., gridded spatial data). Arrays in geosciences typically represent data like temperature, pressure, or rainfall on a grid.

Typical Dimensions of Arrays:

1D Arrays: A single sequence of data, such as temperature measurements over time at one location.
2D Arrays: Often represent gridded spatial data (e.g., a map of precipitation over a region).
3D Arrays: Can include additional dimensions, such as time or depth. For instance, a 3D array could represent temperature at various depths and over time for a given region, 3D Earth model of geophysical properties such as seismic wavespeed, time varying (snapshots) of seismic wavefields, …
4D Arrays: Add even more complexity, such as a time-varying 3D grid (e.g., atmospheric data changing over space and time), time-lapse images of the subsurface properties.

Data Frames#

A data frame is a two-dimensional, tabular data structure, commonly used in data analysis. Data frames can be thought of as equivalent to a spreadsheet or database table, where:

Each column represents a variable or feature (e.g., date, location, temperature).
Each row corresponds to an observation or data point. Data frames are popular in programming environments like R and Python (via the Pandas library) because they offer flexibility in handling mixed data types (numerical, categorical, etc.) and are ideal for statistical analysis and data manipulation.

ML Geo Curriculum

2.1 Data Definitions

Contents

2.1 Data Definitions#

Data modality in Geosciences#

Data Formats in Geosciences#

Arrays#

Data Frames#

Lecture Slides#