2.2 Data Formats#

In this tutorial, we will manipulate the data structure from and to several data formats.

JSON (JavaScript Object Notation)#

JSON is a lightweight, human-readable data format used in web applications and APIs for data exchange. SON is used to store metadata, configuration files, and small datasets, particularly when working with web-based applications or interacting with APIs (e.g., querying weather data or geospatial information from an API). Here is a simple example:

{
    "location": "Yellowstone",
    "coordinates": {
        "latitude": 44.423691,
        "longitude": -110.588516
    },
    "elevation_m": 2399,
    "temperature_c": 22.5
}

The character encoding is UTF-8. The data types in JSON files may be numbers, string, boolean, array, object (collection of name-value pairs), or null. More information on JSON from the EarthDataScience course.

In Python, you create simply a JSON file with the JSON library:

import json

data = {
    "location": "Yellowstone",
    "coordinates": {"latitude": 44.423691, "longitude": -110.588516},
    "elevation_m": 2399,
    "temperature_c": 22.5
}

with open('data.json', 'w') as outfile:
    json.dump(data, outfile)

Tabular Data Formats#

Tabular data are common in geosciences. The CSV format is convienent, human-readable files for small data sets that store data in rows. The Parquet format is machine-readable for large data sets that supports compression.

CSV (Comma-Separated Values):#

CSV is a simple, widely-used format for tabular data, often used in geoscience for sharing and storing smaller datasets (e.g., soil samples, environmental readings). It stores data as plain text, making it easy to read but can be inefficient for large datasets. It is most often read using pandas.

import pandas as pd

# Reading a CSV file
df = pd.read_csv('data.csv')

# Writing a CSV file
df.to_csv('output.csv', index=False)
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[2], line 4
      1 import pandas as pd
      3 # Reading a CSV file
----> 4 df = pd.read_csv('data.csv')
      6 # Writing a CSV file
      7 df.to_csv('output.csv', index=False)

File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/pandas/io/parsers/readers.py:948, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    944     dtype_backend=dtype_backend,
    945 )
    946 kwds.update(kwds_defaults)
--> 948 return _read(filepath_or_buffer, kwds)

File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/pandas/io/parsers/readers.py:611, in _read(filepath_or_buffer, kwds)
    608 _validate_names(kwds.get("names", None))
    610 # Create the parser.
--> 611 parser = TextFileReader(filepath_or_buffer, **kwds)
    613 if chunksize or iterator:
    614     return parser

File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1448, in TextFileReader.__init__(self, f, engine, **kwds)
   1445     self.options["has_index_names"] = kwds["has_index_names"]
   1447 self.handles: IOHandles | None = None
-> 1448 self._engine = self._make_engine(f, self.engine)

File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1705, in TextFileReader._make_engine(self, f, engine)
   1703     if "b" not in mode:
   1704         mode += "b"
-> 1705 self.handles = get_handle(
   1706     f,
   1707     mode,
   1708     encoding=self.options.get("encoding", None),
   1709     compression=self.options.get("compression", None),
   1710     memory_map=self.options.get("memory_map", False),
   1711     is_text=is_text,
   1712     errors=self.options.get("encoding_errors", "strict"),
   1713     storage_options=self.options.get("storage_options", None),
   1714 )
   1715 assert self.handles is not None
   1716 f = self.handles.handle

File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/pandas/io/common.py:863, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    858 elif isinstance(handle, str):
    859     # Check whether the filename is to be opened in binary mode.
    860     # Binary mode does not support 'encoding' and 'newline'.
    861     if ioargs.encoding and "b" not in ioargs.mode:
    862         # Encoding
--> 863         handle = open(
    864             handle,
    865             ioargs.mode,
    866             encoding=ioargs.encoding,
    867             errors=errors,
    868             newline="",
    869         )
    870     else:
    871         # Binary mode
    872         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

Parquet:#

Parquet is a binary, columnar storage format optimized for efficiency, particularly for large datasets. It is widely used in big data environments (e.g., storing satellite imagery or climate model outputs)

# Writing the dictionary data to a Parquet file
df = pd.DataFrame([data])
df.to_parquet('data.parquet', index=False)

# Reading the Parquet file
df_read = pd.read_parquet('data.parquet')
print(df_read)

Geospatial Data#

The main formats that support pixelized raster data are:

  • GeoTIFF: metadata standard that allows for georeferencing information embedded in a TIFF (Tagged Image File Format) file. GeoTIFF is enhanced to be cloud optimized.

  • GeoJSON: GeoJSON is a format for encoding a variety of geographic data structures in the JSON format.

import requests, zipfile , os, io
import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import netCDF4 as nc
import numpy as np
import pandas as pd
# import pycrs
import rasterio
import h5py
import rasterio
import netCDF4 as nc
import wget


from folium.plugins import MarkerCluster
from rasterio.mask import mask
from rasterio.plot import show

1. Raster data#

1.1 rasterio to read GeoTIFF#

Raster data is any pixelated (or gridded) data where each pixel is associated with a specific geographical location. The value of a pixel can be continuous (e.g. elevation) or categorical (e.g. land use).

The python package rasterio, with documentation here, and that can read formats such as GeoTIFF and GeoJSON.

See additional introductory materials from EarthDataScience, and tutorials from the GeoHackweek.

We will download topography files found on this page, but stored on a Dropbox folder.

The file name is HYP_50M_SR and is a zipped file.

# Dowload the data using wget.
fname = 'HYP_50M_SR'
wget.download("https://www.dropbox.com/s/j5lxhd8uxrtsxko/"+str(fname)+"?dl=1") # note the last character as a string to request the file itself

The data file will be saved on the home directory, we want to move it into a data folder:

os.replace(fname+".zip", './data/'+fname)

Unzip the file

os.makedirs("./data/"+fname+"/",exist_ok=True)
# wget.download(url,out="HYP_50M_SR") # this does not work on the hub
z = zipfile.ZipFile('./data/'+fname+".zip")
z.extractall("./data/"+fname+"/")

Now let’s get the Digital Elevation Map. We open the unzipped file using the package rasterio.

elevation = rasterio.open("./data/"+fname+"/"+fname+".tif")
print(elevation.variables.keys())

Let us have a look at the dimensions of the data:

elevation.height
elevation.width
elevation.indexes

Can you guess on how to call the data types of the file entry?

# type below

and at the boundaries of the dataset:

elevation.bounds
print(elevation.transform * (0, 0)) # North West corner
print(elevation.transform * (elevation.width, elevation.height)) # South East corner

Here is the projection used for the data:

elevation.crs

How to interpret the data: There are three layers for the three colors red, green, and blue:

print(elevation.colorinterp[0])
print(elevation.colorinterp[1])
print(elevation.colorinterp[2])
print(np.min(elevation.read(1)), np.max(elevation.read(1)))
print(np.min(elevation.read(2)), np.max(elevation.read(2)))
print(np.min(elevation.read(3)), np.max(elevation.read(3)))

Let us now plot the data:

image = elevation.read()
show(image)

1.2 Geopandas to read GeoJSON#

GeoTIFF are not the only kind of files that we can read with geopandas. Let us look at an example of reading data from a geojson file (which is a special case of json file with geographical coordinates).

url = 'https://www.nps.gov/lib/npmap.js/4.0.0/examples/data/national-parks.geojson'
parks = gpd.read_file(url)
parks.head()

Let us plot the data.

Folium is a nice Python package for visualization. The Geohackweek tutorial on Folium is also informative.

m = folium.Map(location=[40, -100], zoom_start=4)
folium.GeoJson(parks).add_to(m)
marker_cluster = MarkerCluster().add_to(m)
m

We are going to focus on the parks in Washington State:

parks_WA = parks.iloc[[94, 127, 187, 228, 286, 294, 295, 297, 299, 300, 302]].reset_index()

We create a list of locations to add popups to the map.

locations = []
for index in range(0, len(parks_WA)):
    location = [parks_WA['geometry'][index].y, parks_WA['geometry'][index].x]
    locations.append(location)
m = folium.Map(location=[47, -121], zoom_start=7)
marker_cluster = MarkerCluster().add_to(m)
for point in range(0, len(locations)):
    folium.Marker(location = locations[point], popup=parks_WA['Name'].iloc[point]).add_to(marker_cluster)
m

2. Hierarchical formats: NETCDF4 & HDF5#

Hierarchical data formats are designed to store large amount of data into a single file. They mimic a file system (e.g., tree-like data structure with nested directories) into a single file. There are two dominent hierarchical data formats (HDF5 and NETCDF4), and one emerging for cloud (Zarr). Hierarchical formats in general can store many data types (numeric vs string).

HDF5#

The Hierarchical Data Format version 5 (HDF5), is an open-source file format that supports large, complex, heterogeneous data. HDF5 uses a “file directory” like structure that allows you to organize data within the file in many different structured ways, as you might do with files on your computer. The HDF5 format also allows for embedding of metadata making it self-describing. HDF5 files are self describing - this means that all elements (the file itself, groups and datasets) can have associated metadata that describes the information contained within the element.

HDF Structure Example:

  • Datasets, which are typed multidimensional arrays

  • Groups, which are container structures that can hold datasets and other groups

An illustration of a H5 data set Figure: HDF5 data example. Found in [neonscience](https://www.neonscience.org/resources/learning-hub/tutorials/about-hdf5)

Netcdf#

The network Common Data Form, or netCDF, was created in the early 1990s, and set out to solve some of the challenges in working with N-dimensional arrays. Netcdf is a collection of self-describing, machine-independent binary data formats and software tools that facilitate the creation, access, and sharing of scientific data stored in N-dimensional arrays, along with metadata describing the contents of each array. Netcdf was built by the climate science community at a time when regional climate models were beginning to produce larger and larger output files. NetCDF version 4 is now a subset of HDF5.

Handling large arrays#

The NetCDF & H5 format has no limit on file sizes. However, any analysis tools that read data from a NetCDF array into memory for some computational operation will be limited by that particular machine’s available memory.

But slow at I/O#

When reading a hierarchical file, the whole tree of the data structure is scanned from the root node down. Since it has to be done each time a user makes an inquiry, reading H5 and Netcdf is slow. There are faster

We will download a geological map stored in a netCDF format on Dropbox. The original data can be found on the USGS database. (https://doi.org/10.3133/ofr20191081)

# Download the geological framework
file1 = wget.download("https://www.dropbox.com/s/wdb25puxh3u07dj/NCM_GeologicFrameworkGrids.nc?dl=1") #"./data/NCM_GeologicFrameworkGrids.nc"
# Download the coordinate grids
file2 = wget.download("https://www.dropbox.com/s/wdb25puxh3u07dj/NCM_SpatialGrid.nc?dl=1") #"./data/NCM_GeologicFrameworkGrids.nc"
# move data
os.replace(file1,'./data/'+file1)
os.replace(file2,'./data/'+file2)
# read data
geology = nc.Dataset('./data/'+file1)
grid = nc.Dataset('./data/'+file2)
geology
geology['Surface Elevation']
np.shape(geology['Surface Elevation'])
geology['Surface Elevation'][3246, 1234]
x = grid['x'][0:4901, 0:3201]
y = grid['y'][0:4901, 0:3201]
elevation = geology['Surface Elevation'][0:4901, 0:3201]
plt.contourf(x, y, elevation)

3. Zarr#

Zarr is a cloud-optimized data format that handles heterogeneous data sets.

In the following exercise, we will use the Xarray open data sets air_temperature and save it into a Zarr file.

Let’s work in groups to :

  • Download the xarray data

  • Save it in to file, report on time and size of the data set.

  • Read it again and check again write time and read times