2.2 Data Formats#

In this tutorial, we will manipulate the data structure from and to several data formats.

The formats that support unstructured (non relational) data are:

  • JSON: JavaScript Object Notation, an open standard file format that uses human-readable text. The data may be attribute-value pairs and arrays. It is language-independent. The syntax looks like this

{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  }

The character encoding is UTF-8. The data types in JSON files may be numbers, string, boolean, array, object (collection of name-value pairs), or null. More information on JSON from the EarthDataScience course.

The main formats that support pixelized raster data are:

  • GeoTIFF: metadata standard that allows for georeferencing information embedded in a TIFF (Tagged Image File Format) file. GeoTIFF is enhanced to be cloud optimized.

  • GeoJSON: GeoJSON is a format for encoding a variety of geographic data structures in the JSON format.

The formats that support tabular data are:

  • CSV

  • Parquet

The data formats for big heterogeneous data (different data types):

  • NetCDF4

  • HDF5

  • Zarr

import requests, zipfile , os, io
import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import netCDF4 as nc
import numpy as np
import pandas as pd
# import pycrs
import rasterio
import h5py
import rasterio
import netCDF4 as nc
import wget


from folium.plugins import MarkerCluster
from rasterio.mask import mask
from rasterio.plot import show

1. Raster data#

1.1 rasterio to read GeoTIFF#

Raster data is any pixelated (or gridded) data where each pixel is associated with a specific geographical location. The value of a pixel can be continuous (e.g. elevation) or categorical (e.g. land use).

The python package rasterio, with documentation here, and that can read formats such as GeoTIFF and GeoJSON.

See additional introductory materials from EarthDataScience, and tutorials from the GeoHackweek.

We will download topography files found on this page, but stored on a Dropbox folder.

The file name is HYP_50M_SR and is a zipped file.

# Dowload the data using wget.
fname = 'HYP_50M_SR'
wget.download("https://www.dropbox.com/s/j5lxhd8uxrtsxko/"+str(fname)+"?dl=1") # note the last character as a string to request the file itself

The data file will be saved on the home directory, we want to move it into a data folder:

os.replace(fname+".zip", './data/'+fname)

Unzip the file

os.makedirs("./data/"+fname+"/",exist_ok=True)
# wget.download(url,out="HYP_50M_SR") # this does not work on the hub
z = zipfile.ZipFile('./data/'+fname+".zip")
z.extractall("./data/"+fname+"/")

Now let’s get the Digital Elevation Map. We open the unzipped file using the package rasterio.

elevation = rasterio.open("./data/"+fname+"/"+fname+".tif")
print(elevation.variables.keys())

Let us have a look at the dimensions of the data:

elevation.height
elevation.width
elevation.indexes

Can you guess on how to call the data types of the file entry?

# type below

and at the boundaries of the dataset:

elevation.bounds
print(elevation.transform * (0, 0)) # North West corner
print(elevation.transform * (elevation.width, elevation.height)) # South East corner

Here is the projection used for the data:

elevation.crs

How to interpret the data: There are three layers for the three colors red, green, and blue:

print(elevation.colorinterp[0])
print(elevation.colorinterp[1])
print(elevation.colorinterp[2])
print(np.min(elevation.read(1)), np.max(elevation.read(1)))
print(np.min(elevation.read(2)), np.max(elevation.read(2)))
print(np.min(elevation.read(3)), np.max(elevation.read(3)))

Let us now plot the data:

image = elevation.read()
show(image)

1.2 Geopandas to read GeoJSON#

GeoTIFF are not the only kind of files that we can read with geopandas. Let us look at an example of reading data from a geojson file (which is a special case of json file with geographical coordinates).

url = 'https://www.nps.gov/lib/npmap.js/4.0.0/examples/data/national-parks.geojson'
parks = gpd.read_file(url)
parks.head()

Let us plot the data.

Folium is a nice Python package for visualization. The Geohackweek tutorial on Folium is also informative.

m = folium.Map(location=[40, -100], zoom_start=4)
folium.GeoJson(parks).add_to(m)
marker_cluster = MarkerCluster().add_to(m)
m

We are going to focus on the parks in Washington State:

parks_WA = parks.iloc[[94, 127, 187, 228, 286, 294, 295, 297, 299, 300, 302]].reset_index()

We create a list of locations to add popups to the map.

locations = []
for index in range(0, len(parks_WA)):
    location = [parks_WA['geometry'][index].y, parks_WA['geometry'][index].x]
    locations.append(location)
m = folium.Map(location=[47, -121], zoom_start=7)
marker_cluster = MarkerCluster().add_to(m)
for point in range(0, len(locations)):
    folium.Marker(location = locations[point], popup=parks_WA['Name'].iloc[point]).add_to(marker_cluster)
m

2. Hierarchical formats: NETCDF4 & HDF5#

Hierarchical data formats are designed to store large amount of data into a single file. They mimic a file system (e.g., tree-like data structure with nested directories) into a single file. There are two dominent hierarchical data formats (HDF5 and NETCDF4), and one emerging for cloud (Zarr). Hierarchical formats in general can store many data types (numeric vs string).

HDF5#

The Hierarchical Data Format version 5 (HDF5), is an open-source file format that supports large, complex, heterogeneous data. HDF5 uses a “file directory” like structure that allows you to organize data within the file in many different structured ways, as you might do with files on your computer. The HDF5 format also allows for embedding of metadata making it self-describing. HDF5 files are self describing - this means that all elements (the file itself, groups and datasets) can have associated metadata that describes the information contained within the element.

HDF Structure Example:

  • Datasets, which are typed multidimensional arrays

  • Groups, which are container structures that can hold datasets and other groups

An illustration of a H5 data set Figure: HDF5 data example. Found in [neonscience](https://www.neonscience.org/resources/learning-hub/tutorials/about-hdf5)

Netcdf#

The network Common Data Form, or netCDF, was created in the early 1990s, and set out to solve some of the challenges in working with N-dimensional arrays. Netcdf is a collection of self-describing, machine-independent binary data formats and software tools that facilitate the creation, access, and sharing of scientific data stored in N-dimensional arrays, along with metadata describing the contents of each array. Netcdf was built by the climate science community at a time when regional climate models were beginning to produce larger and larger output files. NetCDF version 4 is now a subset of HDF5.

Handling large arrays#

The NetCDF & H5 format has no limit on file sizes. However, any analysis tools that read data from a NetCDF array into memory for some computational operation will be limited by that particular machine’s available memory.

But slow at I/O#

When reading a hierarchical file, the whole tree of the data structure is scanned from the root node down. Since it has to be done each time a user makes an inquiry, reading H5 and Netcdf is slow. There are faster

We will download a geological map stored in a netCDF format on Dropbox. The original data can be found on the USGS database. (https://doi.org/10.3133/ofr20191081)

# Download the geological framework
file1 = wget.download("https://www.dropbox.com/s/wdb25puxh3u07dj/NCM_GeologicFrameworkGrids.nc?dl=1") #"./data/NCM_GeologicFrameworkGrids.nc"
# Download the coordinate grids
file2 = wget.download("https://www.dropbox.com/s/wdb25puxh3u07dj/NCM_SpatialGrid.nc?dl=1") #"./data/NCM_GeologicFrameworkGrids.nc"
# move data
os.replace(file1,'./data/'+file1)
os.replace(file2,'./data/'+file2)
# read data
geology = nc.Dataset('./data/'+file1)
grid = nc.Dataset('./data/'+file2)
geology
geology['Surface Elevation']
np.shape(geology['Surface Elevation'])
geology['Surface Elevation'][3246, 1234]
x = grid['x'][0:4901, 0:3201]
y = grid['y'][0:4901, 0:3201]
elevation = geology['Surface Elevation'][0:4901, 0:3201]
plt.contourf(x, y, elevation)

3. Zarr#

Zarr is a cloud-optimized data format that handles heterogeneous data sets.

In the following exercise, we will use the Xarray open data sets air_temperature and save it into a Zarr file.

Let’s work in groups to :

  • Download the xarray data

  • Save it in to file, report on time and size of the data set.

  • Read it again and check again write time and read times