2.3 Data Arrays#

This lecture introduces arrays

This section introduces:

  • Level 1: Numpy package that provides fundamental python tools for data arrays

  • Level 2: Xarrays N-dimensional arrays

  • Level 3: Pytorch Tensor, N-dimensional arrays for pytorch.

At all levels, we will practice

  • Matplotlib python package that provides functionalities for basic plotting.

1 Numpy arrays#

Sequence of data can be stored in python lists. Lists are very flexible; data of identical type can be appended to list on the fly.

Numpy arrays area multi-dimensional objects of specific data types (floats, strings, integers, …). ! Numpy arrays should be declared first ! Allocating memory of the data ahead of time can save computational time. Numpy arrays support arithmetic operations. There are numerous tutorials to get help. https://www.earthdatascience.org/courses/intro-to-earth-data-science/scientific-data-structures-python/numpy-arrays/

# import module
import numpy as np

# define an array of dimension one from.
#this is a list of floats:
a=[0.7 , 0.75, 1.85]
# convert a list to a numpy array
a_nparray = np.array(a)

1.1 1D arrays in Numpy#

1-D arrays are also called vectors. The Numpy function arange will create regularly spaced values within a specific range. It is similar to the python function range.

# create 1D arrays
N=100
x_int = np.arange(N)# this will make a vector of int from 0 to N-1
print(x_int)
N = 100 # number of points

# linearly spaced vectors
x_lat = np.linspace(39,50,N)# this will make a vector of floats from min to max values evenly spaced
x_lon = np.linspace(-128,-125,N)# same for longitude
print(x_lat)
print(x_lon)


# time vectors
x_t = np.linspace(0,100,N)
dt=x_t[1]
print(dt)

# logspaced vectors
# 10^(-1) -> 10^(1) => 0.1 - 1
x_tl = np.logspace(-1,1,N)

print(x_tl)

Indexing Numpy arrays#

Accessing individual elements is done using squared brackets []. First element is indexed at 0, last element is indexed at -1.

arr = np.arange(5)

# Access the first element (index 0)
element = arr[0]
print(element)  # Output: 1

# Access the fourth element (index 3)
element = arr[3]
print(element)  # Output: 4


# Access the last element (index 5)
element = arr[4]
print(element)  # Output: 5

# Access the last element (index 5)
element = arr[-1]
print(element)  # Output: 5

Slicing#

We can slice arrays using the basic syntax start:stop:step:

x=np.arange(10)

# select subarrays between indexes 3 and 6
x1 = x[2:5]
print(x1)

Create an array that select every other elements

# answer here

Create an array with a reversed ordered indexes:

# answer here

1.2 Introduction fo Matplotlib#

Some tips from Sofware-Carpentry

  • Make sure your text is large enough to read. Use the fontsize parameter in xlabel, ylabel, title, and legend, and tick_params with labelsize to increase the text size of the numbers on your axes.

  • Similarly, you should make your graph elements easy to see. Use s to increase the size of your scatterplot markers and linewidth to increase the sizes of your plot lines.

  • Using color (and nothing else) to distinguish between different plot elements will make your plots unreadable to anyone who is colorblind, or who happens to have a black-and-white office printer. For lines, the linestyle parameter lets you use different types of lines. For scatterplots, marker lets you change the shape of your points. If you’re unsure about your colors, you can use Coblis or Color Oracle to simulate what your plots would look like to those with colorblindness.

# make the plots appear in the notebook
%matplotlib inline 
import matplotlib.pyplot as plt
# import matplotlib.pylab as pylab

# set default parameters
params = {'legend.fontsize': 14, \
          'xtick.labelsize':14, \
          'ytick.labelsize':14, \
          'font.size':14}
plt.rcParams.update(params)
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(10,5))  # 1 row, 2 co
ax1.plot(x_t,'.')
ax2.plot(x_tl,'r.');ax2.set_yscale('log')
ax2.set_xlabel('Index of vector')
ax1.set_xlabel('Index of vector')
ax2.set_ylabel('Time (s)')
ax1.set_ylabel('Time (s)')
ax1.set_title('My first plot!')
ax2.set_title('My second plot!')
ax1.grid(True)
ax2.grid(True)
plt.savefig('plot_test.png')

1.3 Random Arrays#

Creating synthetic arrays from statistical distributions of data.

Find some basic function from the statistics functions in Numpy: https://numpy.org/doc/stable/reference/routines.statistics.html

# generate random fields
from numpy import random

N=10000 # number of samples
# Uniform distribution
x1=2*random.rand(N)-1    # uniform distribution centered at 0, varying between -1 and 2.
# Gaussian distribution
x2=random.randn(N)   # Gaussian distribution with a standard deviation of 1
# Power law
a=2 # power of the distribution
x3=random.power(a,N)
# poisson
aa=4
x4=random.poisson(aa,N)
# now compare this distribution to the Gaussian distribution
fig1, (ax1,ax2,ax3,ax4) = plt.subplots(1,4, figsize=(15,5))  # 1 row, 2 co
ax1.plot(x1);ax1.set_title('Uniform')
ax2.plot(x2);ax2.set_title('Gaussian')
ax3.plot(x3);ax3.set_title('Power')
ax4.plot(x4);ax4.set_title('Poisson')


# plot the 2 histograms
fig2, (ax11,ax12,ax13,ax14) = plt.subplots(1,4, figsize=(15,5))  # 1 row, 2 co
ax11.hist(x1,bins=100);ax11.set_title('Uniform')
ax12.hist(x2,bins=100);ax12.set_title('Gaussian')
ax13.hist(x3,bins=100);ax13.set_title('Power')
ax14.hist(x4,bins=100);ax14.set_title('Poisson')
# numpy has built-in functions to calculate basic statistical properties of numpy arrays
print("mean of x1", "standard deviation of x1","mean of x2", "standard deviation of x2",)
print(np.mean(x1),np.std(x1),np.mean(x2),np.std(x2))

1.4 2D arrays in Numpy#

We will now create random 2D arrays

N=1000 # number of samples
# create
# Uniform distribution
x1=2*random.rand(N,N)-1    # uniform distribution centered at 0, varying between -1 and 2.
# Gaussian distribution
x2=random.randn(N,N)   # Gaussian distribution with a standard deviation of 1
# Power law
a=2 # power of the distribution
x3=random.power(a,[N,N])
# poisson
aa=4
x4=random.poisson(aa,[N,N])

# now compare this distribution to the Gaussian distribution
fig, (ax1,ax2,ax3,ax4) = plt.subplots(1,4, figsize=(26,4))  # 1 row, 2 co
ax1.pcolor(x1,vmin=-1, vmax=1)
ax2.pcolor(x2,vmin=-1, vmax=1)
ax3.pcolor(x3,vmin=-1, vmax=1)
ax4.pcolor(x4,vmin=0, vmax=10)

1.5 Array Norms#

\(L_2\) norm of an array \(X\):

\(||X||_2 = \sqrt{\sum_i^N \left(X_i^2\right) / N} \) In numpy that is the default norm of the linear algebra module linalg: np.linalg.norm(X)

We can normalize an array and will focus on wich dimension the array is normalized.

1.6 Measures of distance#

Comparing data often means calculating a distance or dissimilarity between the two data. Similarity is equialent to proximity of two data.

Euclidian distance

\(L_2\) norm of the residual between 2 vectors:

\(d = ||X -Y||_2 = \sqrt{\sum_i^N \left(X_i^2 - Y_i^2 \right) / N} \) In numpy that is the default norm of the linear algebra module linalg: d=np.linalg.norm(X-Y)

Total variation distance

Is the \(L_1\)-norm equivalent of the Euclidian distance: d=np.linalg.norm(X-Y,ord=1)

Pearson coefficient (aka the correlation coefficient)

\( P(X,Y) = \frac{cov(X,Y)}{std(X)std(Y)} \)

\( P(X,Y) = \frac{ \sum{ (X_i-mean(X)) (Y_i-mean(Y)}}{\sqrt{\sum{ (X_i-mean(X))^2 } \sum{ (Y_i-mean(Y))^2 } }} \)

# cross correlate 2 vectors:

x2=random.randn(N)   # Gaussian distribution 
x3=random.randn(N)   # Gaussian distribution 

# Euclidian distance:
d2=np.linalg.norm(x3-x2)
# Total variation distance:
d1=np.linalg.norm(x3-x2,ord=1)
# Pearson coefficient
r = np.corrcoef(x2,x3)[0,1]

print("the three distances are:")
print(d2,d1,r)

plt.plot(x2,x3,'o');plt.grid(True)
plt.xlabel('X2');plt.ylabel('X3')

2 Xarrays#

1.1 Xarray basics#

This tutorial has been copied and modified from the Xarray package tutorials.

Geoscientific data comes with complex metadata that describe the meaning and context for data arrays. Xarrays allows to attach metadat to the data arrays, making it easier to keep track of variables, their units, coordinate systems, and other important data attributes. Because geoscientific data are typically multi-dimensional, integrating dimensions like time, spatial coordinate, allows users to manipulate, transform, perform complex operations on the data arras, In particular, it simplifies tasks to slice in time and space, allows for regridding and subsetting.

Xarrays are well integrated with high performance computing, such as Dask, and storage such as HPC storage using NetCDF and cloud storage with Zarr.

Xarrays were designed to facilitate the manipulation of geoscientific data. Ge

Xarrays are multi-dimensional arrays (“tensors”) that can have several attributes and dimensions. The core structure is DataArray, the N dimensional array that is similar to a pandas.Series. The second is the Dataset that is a multi-dimensional, in-memory array database. It is a dictionary like container of DataArray, the equivalent to pandas.DataFrame.

Xarrrays can be read from netCDF and from Zarr.

You will find plenty of useful tutorials from the Xarray project, such as this.

Here is the Xarray tutorial book introducing Xarray from data structure to visualization. Generally, Xarray wraps Numpy and Pandas and behaves in a similar way as in Pandas. The transition to adopt Xarray should be smooth if you are familiar with Numpy and Pandas already.

import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
import pooch

%matplotlib inline
%config InlineBackend.figure_format='retina'
ds = xr.tutorial.load_dataset("air_temperature")
ds

What are the keys/attributes of the data set?

Find two ways to print the values from attribute air

ds["air"]
ds.air
with xr.set_options(display_style="html"):
    display(ds)

The DataArray has named dimension:

ds.air.dims

and the coordinates are saved in .coord:

ds.air.coords
ds.air.attrs

Add new attributes

# assign your own attributes!
ds.air.attrs["who_is_awesome"] = "xarray"
ds.air.attrs

The underlying data is a numpy array

print(type(ds.air.data))
print(ds.air.data)

2.2 Extracting data#

How to extract data:

  • label-based indexing using .sel

  • position-based indexing using .isel

ds.air.isel(time=1).plot(x="lon")

You would notice that the air temperature is in Kelvin. We can convert it to Celsius by removing 273.15 and changing the attributes units.

ds2=ds
ds2['air']=ds['air']-273.15
ds2['air']['units']='degC'

We also want to show the longitudes in the west direction by removing 360\(^\circ\).

ds2.coords["lon"]=ds2.coords["lon"]-360

Show the mean temperature

ds2.air.mean("time").plot()
ds2.sel(time="2013-05")
# demonstrate slicing
ds.sel(time=slice("2013-05", "2013-07"))
# "nearest indexing at multiple points"
ds.sel(lon=[240.125-360, 234-360], lat=[40.3, 50.3], method="nearest")

2.3 High level computation#

  • groupby : Bin data in to groups and reduce

  • resample : Groupby specialized for time axes. Either downsample or upsample your data.

  • rolling : Operate on rolling windows of your data e.g. running mean

  • coarsen : Downsample your data

  • weighted : Weight your data before reducing

# seasonal groups
ds.groupby("time.season")
# make a seasonal mean
seasonal_mean = ds.groupby("time.season").mean()
seasonal_mean = seasonal_mean.sel(season=["DJF", "MAM", "JJA", "SON"])
seasonal_mean
# resample to monthly frequency
ds.resample(time="M").mean()
# facet the seasonal_mean
seasonal_mean.air.plot(col="season")

We can save Xarrays in to NetCDF and Zarr files

# write to netCDF
%timeit ds.to_netcdf("my-example-dataset.nc")
!ls -lh my-example-dataset.nc
%timeit ds.to_zarr(store="./my-example-dataset.zarr",mode="w")
!du -sh ./my-example-dataset.zarr

3. Pytorch Tensors#

This material is extracted from the Pytorch Package materials and from Dive into Deep Learning

The tensor class in Pytorch is similar to an ndarray in Numpy, with the added features that: 1) it has automatic differentiation and 2) it runs on CPUs and GPUs.

import torch

We can define a basic tensor that will be by default a 1-D array (vector) and run on the CPU

x = torch.arange(12, dtype=torch.float32)
x

Each value is refered to as an element. We can extract the length of the vector using the method numel() (a method is applied with ()) and its attribute shape is found by applyingshape

x.numel()
x.shape

We can reshape a tensor and flip dimensions

x.reshape(12,1).shape
x2=x.reshape(3,4)
x2.shape

We can create a random tensor

x3=torch.randn(3, 4)

We can apply element-wise operations on the tensor simply as methods. Here is an example by applying the function exp to all elements of the random tensor x3.

x3.exp()

We can manipulate 2 arrays of the same shape

x = torch.tensor([1.0, 2, 4, 8])
y = torch.tensor([2, 2, 2, 2])
x + y, x - y, x * y, x / y, x ** y

We can normalize the tensor with its L2 norm. Try below

x /=np.linalg.norm(x)
x

We can convert a numpy array to a Pytorch tensor

a=x.numpy()
print(type(a))
print(type(x))

Move an array from CPU to GPU by changing the device attribute

device = torch.device("cpu")
print(device)

Below, we are going to perform a regression to write a sine function as a function

import torch
import math


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# Create random input and output data
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Randomly initialize weights
a = torch.randn((), device=device, dtype=dtype)
b = torch.randn((), device=device, dtype=dtype)
c = torch.randn((), device=device, dtype=dtype)
d = torch.randn((), device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)

    # Backprop to compute gradients of a, b, c, d with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_a = grad_y_pred.sum()
    grad_b = (grad_y_pred * x).sum()
    grad_c = (grad_y_pred * x ** 2).sum()
    grad_d = (grad_y_pred * x ** 3).sum()

    # Update weights using gradient descent
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    c -= learning_rate * grad_c
    d -= learning_rate * grad_d


print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')