2.4 Pandas#

This section aims to provide new skills in python to handle structured, tabular data.

Learning outcome:

  • Manipulation of data frames (describing, filtering, …)

  • Learn about Lambda functions

  • Intro to datetime objects

  • Plotting data from data frames (histograms and maps)

  • Introduction to Plotly

  • Introduction to CSV & Parquet

This tutorial can be offered in a 2-hour course. Sections are labeled as Level 1, 2, 3 in Section 1, 2, 3, and instructors may choose to leave higher levels for asynchronous, self-guided learning.

We will work on several structured data sets: sensor metadata, seismic data product (earthquake catalog).

First, we import all the modules we need:

import numpy as np
import pandas as pd
import io
import requests
import time
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

import plotly.express as px
import plotly.io as pio
# pio.renderers.default = 'vscode' # writes as standalone html, 
# pio.renderers.default = 'iframe' # writes files as standalone html, 
# pio.renderers.default = 'png' # writes files as standalone png, 
# try notebook, jupyterlab, png, vscode, iframe

1 Pandas Fundamentals#

1.1 Basics#

Pandas are composed of Series and DataFrame. Series are columns with attributes or keys. The DataFrame is a multi-dimensional table made up of Series.

We can create a DataFrame composed of series from scratch using Python dictionary:

data = {
    'temperature' : [36,37,30,50],
    'precipitation':[3,1,0,0]
}
my_pd = pd.DataFrame(data)
print(my_pd)
   temperature  precipitation
0           36              3
1           37              1
2           30              0
3           50              0

Each (key,value) item in the dataframe corresponds to a value in data. To get the keys of the dataframe, type:

my_pd.keys()
Index(['temperature', 'precipitation'], dtype='object')

get a specific Series (different from the array)

print(my_pd.temperature[:])
print(type(my_pd.temperature[:]))
0    36
1    37
2    30
3    50
Name: temperature, dtype: int64
<class 'pandas.core.series.Series'>

to get the value of a specific key (e.g., temperature), at a specific index (e.g., 2) type:

print(my_pd.temperature[2])
print(type(my_pd.temperature[2]))
30
<class 'numpy.int64'>

1.2 Reading a DataFrame from a CSV file#

We can read a pandas directly from a standard file. Here you will read a catalog of earthquakes.

quake = pd.read_csv("Global_Quakes_IRIS.csv")

Now you use the head function to display what is in the file

# enter answer here
quake.head()
time latitude longitude depth magnitude description
0 2010-07-02 06:04:03.570 -13.6098 166.6541 34400.0 6.3 VANUATU ISLANDS
1 2010-07-04 21:55:52.370 39.6611 142.5792 30100.0 6.3 NEAR EAST COAST OF HONSHU
2 2010-07-10 11:43:33.000 11.1664 146.0823 16900.0 6.3 SOUTH OF MARIANA ISLANDS
3 2010-07-12 00:11:20.060 -22.2789 -68.3159 109400.0 6.2 NORTHERN CHILE
4 2010-07-14 08:32:21.850 -38.0635 -73.4649 25700.0 6.6 NEAR COAST OF CENTRAL CHILE

Display the depth using two ways to use the pandas object

print(quake.depth)
print(quake['depth'])
0        34400.0
1        30100.0
2        16900.0
3       109400.0
4        25700.0
          ...   
1780     10000.0
1781    105000.0
1782    608510.0
1783     10000.0
1784     35440.0
Name: depth, Length: 1785, dtype: float64
0        34400.0
1        30100.0
2        16900.0
3       109400.0
4        25700.0
          ...   
1780     10000.0
1781    105000.0
1782    608510.0
1783     10000.0
1784     35440.0
Name: depth, Length: 1785, dtype: float64

Calculate basic statistic of the data using the function describe.

quake.describe()
latitude longitude depth magnitude
count 1785.000000 1785.000000 1785.000000 1785.000000
mean 0.840700 40.608674 82773.187675 6.382403
std 30.579308 125.558363 146988.302031 0.429012
min -69.782500 -179.957200 0.000000 6.000000
25% -19.905100 -73.832200 12000.000000 6.100000
50% -4.478900 113.077800 24400.000000 6.300000
75% 27.161700 145.305700 58000.000000 6.600000
max 74.391800 179.998100 685500.000000 9.100000

Calculate mean and median of specific Series, for example depth.

# answer it here
print(quake.depth.mean())
print(quake.depth.median())
82773.18767507003
24400.0

1.3 Basic Python Functions#

We will now practice how to modify the content of the DataFrame using functions. We will take the example where we want to change the depth values from meters to kilometers. First we can define this operation as a function

# this function converts a value in meters to a value in kilometers
m2km = 1000 # this is defined as a global variable
def meters2kilometers(x):
    return x/m2km
# now test it using the first element of the quake DataFrame
meters2kilometers(quake.depth[0])
34.4

Let’s define another function that uses a local instead of global variable

def meters2kilometers2(x):
    m2km2=1000
    return x/m2km2
# m2km2 is a local variable and cannot be called outside of the function. Prove it next by inquiring its value in the next cell.

We now discuss the lambda functions.

# now we apply it on the entire Series
meters2kilometers(quake.depth)
0        34.40
1        30.10
2        16.90
3       109.40
4        25.70
         ...  
1780     10.00
1781    105.00
1782    608.51
1783     10.00
1784     35.44
Name: depth, Length: 1785, dtype: float64

We can also define this very basic function as a lambda function. There are several ways of doing an operation on all rows of a column. The first option is to use the map function.

If you are not familiar with lambda function in Python, please read additional tutorials here.

We will practice a bit lambda functions

# Now the equivalent in lambda is:
lambda_meters2kilometers = lambda x:x/1000
# x is the variable
# apply it to the entire series
lambda_meters2kilometers(quake.depth)
0        34.40
1        30.10
2        16.90
3       109.40
4        25.70
         ...  
1780     10.00
1781    105.00
1782    608.51
1783     10.00
1784     35.44
Name: depth, Length: 1785, dtype: float64
# you can add several variables into lambda functions
remove_anything = lambda x,y:x-y
remove_anything(3,2)
1

This did not affect the values of the DataFrame, check it:

quake.depth
0        34400.0
1        30100.0
2        16900.0
3       109400.0
4        25700.0
          ...   
1780     10000.0
1781    105000.0
1782    608510.0
1783     10000.0
1784     35440.0
Name: depth, Length: 1785, dtype: float64

Instead, you could overwrite quake.depth=X. Try two approaches but just do it once! You can use python functions map (a function to apply functions that is generic to Python) and apply (a function to apply functions that is specific to Pandas).

Student response section

This section is left for the student to complete.

Try map to apply a lambda function that rescale depth from meters to kilometers.

# implement answer here

Discuss in class: What happened to the depth field?

What happened to the original data frame?

Try apply to apply a lambda function that rescale depth from meters to kilometers.

# or like this
quake.depth=quake.depth.apply(lambda x:x/1000)

Plot a histogram of the depth distributions using matplotlib function hist.

Student response section

This section is left for the student to complete.

# implement answer here
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/Users/marinedenolle/Dropbox/CLASSES/ESS490/curriculum-book/book/Chapter2-DataManipulation/2.4_pandas_rendered.ipynb Cell 43 line 2
      <a href='vscode-notebook-cell:/Users/marinedenolle/Dropbox/CLASSES/ESS490/curriculum-book/book/Chapter2-DataManipulation/2.4_pandas_rendered.ipynb#X54sZmlsZQ%3D%3D?line=0'>1</a> # answer here
----> <a href='vscode-notebook-cell:/Users/marinedenolle/Dropbox/CLASSES/ESS490/curriculum-book/book/Chapter2-DataManipulation/2.4_pandas_rendered.ipynb#X54sZmlsZQ%3D%3D?line=1'>2</a> plt.hist(quake.depth,100)
      <a href='vscode-notebook-cell:/Users/marinedenolle/Dropbox/CLASSES/ESS490/curriculum-book/book/Chapter2-DataManipulation/2.4_pandas_rendered.ipynb#X54sZmlsZQ%3D%3D?line=2'>3</a> plt.grid(True)
      <a href='vscode-notebook-cell:/Users/marinedenolle/Dropbox/CLASSES/ESS490/curriculum-book/book/Chapter2-DataManipulation/2.4_pandas_rendered.ipynb#X54sZmlsZQ%3D%3D?line=3'>4</a> plt.xlabel('Quake depth (km)')

NameError: name 'plt' is not defined

You can use the interactive plotting package Plotly. First we will show a histogram of the event depth using the function histogram.

fig = px.histogram(quake,   #specify what dataframe to use
             x="depth",  #specify the variable for the histogram 
             nbins=50,       #number of bins for the histogram 
             height=400,     #dimensions of the figure
             width=600);
fig.show()

Alternatively, you can try another interactive plotting package Bokeh. Here is the same plot as above but with Bokeh.

from bokeh.plotting import figure, show, output_notebook
output_notebook()
Loading BokehJS ...
p = figure(plot_width=600, plot_height=400)

hist, edges = np.histogram(quake.depth, density=True, bins=50)
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#636EFA", fill_color="#636EFA")

show(p)

We will now make a new plot of the location of the earthquakes. We will use Plotly tool.

The markersize will be scaled with the earthquake magnitude. To do so, we add a marker_size series in the DataFrame

quake['marker_size'] = np.fix(np.exp(quake['magnitude'])) # add marker size as exp(mag)
quake['magnitude bin'] = 0.5*np.fix(2*quake['magnitude']) # add marker size as exp(mag)

2.4 Mapping using Plotly#

Now we will plot the earthquakes locations on a map using the Plotly package. More tutorials on Plotly. The input of the function is self-explanatory and typical of Python’s function. The code documentation of Plotly scatter_geo lists the variables.

fig = px.scatter_geo(quake,
                     lat='latitude',lon='longitude', 
                     range_color=(6,9),
                     height=600, width=600,
                     size='marker_size', color='magnitude',
                     hover_name="description",
                     hover_data=['description','magnitude','depth']);
fig.update_geos(resolution=110, showcountries=True)
fig.update_geos(resolution=110, showcountries=True,projection_type="orthographic")
fig

The data was sorted by time. We now want to sort and show the data instead by magnitude. We use the pandas function sort to create a new DataFrame with sorted values.

quakes2plot=quake.sort_values(by='magnitude bin')

quakes2plot.head()
time latitude longitude depth magnitude description marker_size magnitude bin
0 2010-07-02 06:04:03.570 -13.6098 166.6541 0.03440 6.3 VANUATU ISLANDS 544.0 6.0
1112 2017-08-31 17:06:55.750 -1.1590 99.6881 0.04314 6.3 SOUTHERN SUMATRA 544.0 6.0
1111 2017-08-27 04:17:51.010 -1.4513 148.0803 0.00800 6.3 ADMIRALTY ISLANDS REGION 544.0 6.0
1110 2017-08-19 02:00:52.550 -17.9609 -178.8406 0.54400 6.4 FIJI ISLANDS REGION 601.0 6.0
1108 2017-08-13 03:08:10.560 -3.7682 101.6228 0.03100 6.4 SOUTHERN SUMATRA 601.0 6.0

Now we will plot again using Plotly

fig = px.scatter_geo(quakes2plot,
                     lat='latitude',lon='longitude', 
                     range_color=(6,9),
                     height=600, width=600,
                     size='marker_size', color='magnitude',
                     hover_name="description",
                     hover_data=['description','magnitude','depth']);
fig.update_geos(resolution=110, showcountries=True)
# fig.update_geos(resolution=110, showcountries=True,projection_type="orthographic")

2 Create a Pandas from a text file.#

The python package pandas is very useful to read csv files, but also many text files that are more or less formatted as one observation per row and one column for each feature.

As an example, we are going to look at the list of seismic stations from the Northern California seismic network, available here:

url = 'http://ncedc.org/ftp/pub/doc/NC.info/NC.channel.summary.day'
# this gets the file linked in the URL page and convert it to a string
s = requests.get(url).content 
# this will convert the string, decode it , and make it a table
data = pd.read_csv(io.StringIO(s.decode('utf-8')), header=None, skiprows=2, sep='\s+', usecols=list(range(0, 13)))
# because columns/keys were not assigned, assign them now
data.columns = ['station', 'network', 'channel', 'location', 'rate', 'start_time', 'end_time', 'latitude', 'longitude', 'elevation', 'depth', 'dip', 'azimuth']

Let us look at the data. They are now stored into a pandas dataframe.

data.head()
station network channel location rate start_time end_time latitude longitude elevation depth dip azimuth
0 AAR NC EHZ -- 100.0 1984/01/01,00:00:00 1987/05/01,00:00:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0
1 AAR NC EHZ -- 100.0 1987/05/01,00:00:00 2006/01/04,19:19:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0
2 AAR NC SHZ -- 20.0 1994/11/28,00:00:00 2006/01/04,19:19:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0
3 AAS NC EHZ -- 100.0 1984/11/27,18:45:00 1987/05/01,00:00:00 38.43014 -121.10959 31.0 0.0 -90.0 0.0
4 AAS NC EHZ -- 100.0 1987/05/01,00:00:00 2021/08/13,16:50:00 38.43014 -121.10959 31.0 0.0 -90.0 0.0

We can output the first element of the DataFrame:

data.iloc[0]
station                       AAR
network                        NC
channel                       EHZ
location                       --
rate                        100.0
start_time    1984/01/01,00:00:00
end_time      1987/05/01,00:00:00
latitude                 39.27594
longitude              -121.02696
elevation                   911.0
depth                         0.0
dip                         -90.0
azimuth                       0.0
Name: 0, dtype: object
data.iloc[:, 0]
0        AAR
1        AAR
2        AAR
3        AAS
4        AAS
        ... 
7151     WMP
7152     WMP
7153     WMP
7154     WMP
7155    WWVB
Name: station, Length: 7156, dtype: object

The DataFrame may have bad values. A typical data cleaning involves removing Nan and Zeros for instance.

Use Plotly to map the stations.

data.dropna(inplace=True)
data=data[data.longitude!=0]
fig = px.scatter_geo(data,
                     lat='latitude',lon='longitude', 
                     range_color=(6,9),
                     height=600, width=600,
                     hover_name="station",
                     hover_data=['network','station','channel','rate']);
fig.update_geos(resolution=110, showcountries=True)
fig = px.scatter_mapbox(data,
                     lat='latitude',lon='longitude', 
                     range_color=(6,9),mapbox_style="carto-positron",
                     height=600, width=500,
                     hover_name="station",
                     hover_data=['network','station','channel','rate']);
fig.update_layout(title="Northern California Seismic Network")
fig.show()

3: Manupulating Pandas#

We can filter the data with the value taken by a given column:

data.loc[data.station=='KCPB']
station network channel location rate start_time end_time latitude longitude elevation depth dip azimuth
3407 KCPB NC BHE -- 50.0 1999/08/03,00:00:00 2000/06/06,16:00:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3408 KCPB NC BHE -- 50.0 2000/06/06,16:00:00 2002/01/24,23:50:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3409 KCPB NC BHE -- 50.0 2002/01/24,23:50:00 2002/10/16,23:59:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3410 KCPB NC BHE -- 20.0 2002/10/17,00:00:00 2006/01/24,18:00:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3411 KCPB NC BHN -- 50.0 1999/08/03,00:00:00 2000/06/06,16:00:00 39.68631 -123.58242 1261.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
3480 KCPB NC MNE -- 10.0 2000/06/06,16:00:00 2000/07/12,00:00:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3481 KCPB NC MNN -- 10.0 1999/08/03,00:00:00 2000/06/06,16:00:00 39.68631 -123.58242 1261.0 0.0 0.0 0.0
3482 KCPB NC MNN -- 10.0 2000/06/06,16:00:00 2000/07/12,00:00:00 39.68631 -123.58242 1261.0 0.0 0.0 0.0
3483 KCPB NC MNZ -- 10.0 1999/08/03,00:00:00 2000/06/06,16:00:00 39.68631 -123.58242 1261.0 0.0 -90.0 0.0
3484 KCPB NC MNZ -- 10.0 2000/06/06,16:00:00 2000/07/12,00:00:00 39.68631 -123.58242 1261.0 0.0 -90.0 0.0

78 rows × 13 columns

# Select two stations, use the typical "OR" |
data.loc[(data.station=='KCPB') | (data.station=='KHBB')]
station network channel location rate start_time end_time latitude longitude elevation depth dip azimuth
3407 KCPB NC BHE -- 50.0 1999/08/03,00:00:00 2000/06/06,16:00:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3408 KCPB NC BHE -- 50.0 2000/06/06,16:00:00 2002/01/24,23:50:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3409 KCPB NC BHE -- 50.0 2002/01/24,23:50:00 2002/10/16,23:59:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3410 KCPB NC BHE -- 20.0 2002/10/17,00:00:00 2006/01/24,18:00:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3411 KCPB NC BHN -- 50.0 1999/08/03,00:00:00 2000/06/06,16:00:00 39.68631 -123.58242 1261.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
3666 KHBB NC LHZ -- 1.0 2016/04/28,16:56:00 2022/08/09,18:00:00 40.65990 -123.21966 1864.0 0.0 -90.0 0.0
3667 KHBB NC LHZ -- 1.0 2022/08/09,18:00:00 3000/01/01,00:00:00 40.65990 -123.21966 1864.0 0.0 -90.0 0.0
3668 KHBB NC LNE -- 1.0 2015/10/29,21:18:00 2016/04/28,16:56:00 40.65990 -123.21966 1864.0 0.0 0.0 90.0
3669 KHBB NC LNN -- 1.0 2015/10/29,21:18:00 2016/04/28,16:56:00 40.65990 -123.21966 1864.0 0.0 0.0 0.0
3670 KHBB NC LNZ -- 1.0 2015/10/29,21:18:00 2016/04/28,16:56:00 40.65990 -123.21966 1864.0 0.0 -90.0 0.0

123 rows × 13 columns

# Select two stations, use the typical "AND" &
data.loc[(data.station=='KCPB') & (data.channel=='HNZ')]
station network channel location rate start_time end_time latitude longitude elevation depth dip azimuth
3457 KCPB NC HNZ -- 100.0 2002/10/17,00:00:00 2006/10/18,00:08:00 39.68631 -123.58242 1261.0 0.0 -90.0 0.0
3458 KCPB NC HNZ -- 100.0 2006/10/18,00:08:00 2010/11/01,22:00:00 39.68631 -123.58242 1261.0 0.0 -90.0 0.0
3459 KCPB NC HNZ -- 100.0 2010/11/01,22:00:00 2011/07/13,00:00:00 39.68631 -123.58242 1261.0 0.0 -90.0 0.0
3460 KCPB NC HNZ -- 100.0 2011/07/13,00:00:00 2011/09/07,19:00:00 39.68631 -123.58242 1261.0 0.0 -90.0 0.0
3461 KCPB NC HNZ -- 100.0 2011/09/07,19:00:00 2015/10/29,18:00:00 39.68631 -123.58242 1261.0 0.0 -90.0 0.0
3462 KCPB NC HNZ -- 100.0 2015/10/29,18:00:00 2016/04/28,16:39:00 39.68631 -123.58242 1261.0 0.0 -90.0 0.0
3463 KCPB NC HNZ -- 100.0 2016/04/28,16:39:00 3000/01/01,00:00:00 39.68631 -123.58242 1261.0 0.0 -90.0 0.0
# or like this
data.loc[data.station.isin(['KCPB', 'KHBB'])]
station network channel location rate start_time end_time latitude longitude elevation depth dip azimuth
3407 KCPB NC BHE -- 50.0 1999/08/03,00:00:00 2000/06/06,16:00:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3408 KCPB NC BHE -- 50.0 2000/06/06,16:00:00 2002/01/24,23:50:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3409 KCPB NC BHE -- 50.0 2002/01/24,23:50:00 2002/10/16,23:59:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3410 KCPB NC BHE -- 20.0 2002/10/17,00:00:00 2006/01/24,18:00:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3411 KCPB NC BHN -- 50.0 1999/08/03,00:00:00 2000/06/06,16:00:00 39.68631 -123.58242 1261.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
3666 KHBB NC LHZ -- 1.0 2016/04/28,16:56:00 2022/08/09,18:00:00 40.65990 -123.21966 1864.0 0.0 -90.0 0.0
3667 KHBB NC LHZ -- 1.0 2022/08/09,18:00:00 3000/01/01,00:00:00 40.65990 -123.21966 1864.0 0.0 -90.0 0.0
3668 KHBB NC LNE -- 1.0 2015/10/29,21:18:00 2016/04/28,16:56:00 40.65990 -123.21966 1864.0 0.0 0.0 90.0
3669 KHBB NC LNN -- 1.0 2015/10/29,21:18:00 2016/04/28,16:56:00 40.65990 -123.21966 1864.0 0.0 0.0 0.0
3670 KHBB NC LNZ -- 1.0 2015/10/29,21:18:00 2016/04/28,16:56:00 40.65990 -123.21966 1864.0 0.0 -90.0 0.0

123 rows × 13 columns

We can filter the data with the value taken by a given column:

data.loc[data.station=='KCPB']
station network channel location rate start_time end_time latitude longitude elevation depth dip azimuth
3407 KCPB NC BHE -- 50.0 1999/08/03,00:00:00 2000/06/06,16:00:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3408 KCPB NC BHE -- 50.0 2000/06/06,16:00:00 2002/01/24,23:50:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3409 KCPB NC BHE -- 50.0 2002/01/24,23:50:00 2002/10/16,23:59:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3410 KCPB NC BHE -- 20.0 2002/10/17,00:00:00 2006/01/24,18:00:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3411 KCPB NC BHN -- 50.0 1999/08/03,00:00:00 2000/06/06,16:00:00 39.68631 -123.58242 1261.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
3480 KCPB NC MNE -- 10.0 2000/06/06,16:00:00 2000/07/12,00:00:00 39.68631 -123.58242 1261.0 0.0 0.0 90.0
3481 KCPB NC MNN -- 10.0 1999/08/03,00:00:00 2000/06/06,16:00:00 39.68631 -123.58242 1261.0 0.0 0.0 0.0
3482 KCPB NC MNN -- 10.0 2000/06/06,16:00:00 2000/07/12,00:00:00 39.68631 -123.58242 1261.0 0.0 0.0 0.0
3483 KCPB NC MNZ -- 10.0 1999/08/03,00:00:00 2000/06/06,16:00:00 39.68631 -123.58242 1261.0 0.0 -90.0 0.0
3484 KCPB NC MNZ -- 10.0 2000/06/06,16:00:00 2000/07/12,00:00:00 39.68631 -123.58242 1261.0 0.0 -90.0 0.0

78 rows × 13 columns

We can access a brief summary of the data:

data.station.describe()
count     7136
unique     905
top       KCPB
freq        78
Name: station, dtype: object
data.elevation.describe()
count    7136.000000
mean      665.036617
std       670.177475
min     -1388.000000
25%       164.000000
50%       446.000000
75%       925.000000
max      3680.000000
Name: elevation, dtype: float64

We can perform standard operations on the whole data set:

data.mean()
C:\Users\otodo\AppData\Local\Temp\ipykernel_2672\531903386.py:1: FutureWarning:

The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
rate          87.692383
latitude      38.053684
longitude   -121.828840
elevation    665.036617
depth         15.064938
dip          -41.758688
azimuth       17.735987
dtype: float64

In the case of a categorical variable, we can get the list of possible values that this variable can take:

data.channel.unique()
array(['EHZ', 'SHZ', 'HHE', 'HHN', 'HHZ', 'HNE', 'HNN', 'HNZ', 'LHE',
       'LHN', 'LHZ', 'ACE', 'GAN', 'GEL', 'GLA', 'GLO', 'GNS', 'GPL',
       'LCE', 'LCQ', 'VCO', 'VDT', 'VEC', 'VEI', 'VPB', 'ELE', 'ELN',
       'ELZ', 'SLE', 'SLN', 'SLZ', 'LCL', 'LOG', 'OCF', 'VEA', 'VEP',
       'VFP', 'VKI', 'GST', 'EHE', 'EHN', 'BNE', 'BNN', 'BN1', 'BN2',
       'BN3', 'BV1', 'EP1', 'EP2', 'EP3', 'HDO', 'HN1', 'HN2', 'HN3',
       'HV1', 'SP2', 'SP3', 'BNZ', 'LDO', 'HJ2', 'HJ3', 'HJZ', 'SHE',
       'SHN', 'BHE', 'BHN', 'BHZ', 'LNE', 'LNN', 'LNZ', 'MHE', 'MHN',
       'MHZ', 'MNE', 'MNN', 'MNZ', 'HH2', 'HH3', 'LH2', 'LH3', 'SP1',
       'DP1', 'DP2', 'DP3', 'HLE', 'HLN', 'HLZ', 'XNE', 'XNN', 'XNZ',
       'EH1', 'VM1', 'VM2', 'VM3'], dtype=object)

and get the number of times that each value is taken:

data.station.value_counts()
KCPB    78
JSGB    73
KMPB    72
KHMB    63
CCH1    60
        ..
LAMB     1
LBA      1
PBY      1
MCW      1
MMW      1
Name: station, Length: 905, dtype: int64

The second option is to use the apply function:

data_elevation_mean=data.elevation.unique().mean()
def remean_elevation(row):
    row.elevation = row.elevation - data_elevation_mean
    return row
data.apply(remean_elevation, axis='columns')
station network channel location rate start_time end_time latitude longitude elevation depth dip azimuth
0 AAR NC EHZ -- 100.0 1984/01/01,00:00:00 1987/05/01,00:00:00 39.27594 -121.02696 154.125816 0.0 -90.0 0.0
1 AAR NC EHZ -- 100.0 1987/05/01,00:00:00 2006/01/04,19:19:00 39.27594 -121.02696 154.125816 0.0 -90.0 0.0
2 AAR NC SHZ -- 20.0 1994/11/28,00:00:00 2006/01/04,19:19:00 39.27594 -121.02696 154.125816 0.0 -90.0 0.0
3 AAS NC EHZ -- 100.0 1984/11/27,18:45:00 1987/05/01,00:00:00 38.43014 -121.10959 -725.874184 0.0 -90.0 0.0
4 AAS NC EHZ -- 100.0 1987/05/01,00:00:00 2021/08/13,16:50:00 38.43014 -121.10959 -725.874184 0.0 -90.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
7150 WMP NC SHE -- 20.0 1995/07/02,12:00:00 2002/05/08,22:30:00 35.64059 -118.78570 321.125816 0.0 0.0 90.0
7151 WMP NC SHN -- 20.0 1995/07/02,12:00:00 2002/05/08,22:30:00 35.64059 -118.78570 321.125816 0.0 0.0 0.0
7152 WMP NC SHZ -- 20.0 1995/03/02,19:00:00 1995/07/02,12:00:00 35.64059 -118.78570 321.125816 0.0 -90.0 0.0
7153 WMP NC SHZ -- 20.0 1995/07/02,12:00:00 2002/05/08,22:30:00 35.64059 -118.78570 321.125816 0.0 -90.0 0.0
7154 WMP NC SHZ 10 20.0 1995/07/02,12:00:00 1999/05/11,23:59:00 35.64059 -118.78570 321.125816 0.0 -90.0 0.0

7136 rows × 13 columns

We can also carry out simple operations on columns, provided they make sense.

data.network + ' - ' + data.station
0       NC - AAR
1       NC - AAR
2       NC - AAR
3       NC - AAS
4       NC - AAS
          ...   
7150    NC - WMP
7151    NC - WMP
7152    NC - WMP
7153    NC - WMP
7154    NC - WMP
Length: 7136, dtype: object

A useful feature is to group the rows depending on the value of a categorical variable, and then apply the same operation to all the groups. For instance, I want to know how many times each station appears in the file:

data.groupby('station').station.count()
station
AAR     3
AAS     3
ABJ     4
ABR     3
ADW     3
       ..
VCL     2
VRC     4
VSP     2
VWB     1
WMP    10
Name: station, Length: 905, dtype: int64

We can have access to the data type of each column:

data.dtypes
station        object
network        object
channel        object
location       object
rate          float64
start_time     object
end_time       object
latitude      float64
longitude     float64
elevation     float64
depth         float64
dip           float64
azimuth       float64
dtype: object

Here, pandas does not recognize the start_time and end_time columns as a datetime format, so we cannot use datetime operations on them. We first need to convert these columns into a datetime format:

data.start_time.values()
type(data['start_time'][0])
str
# Transform column from string into datetime format
startdate = pd.to_datetime(data['start_time'], format='%Y/%m/%d,%H:%M:%S')
data['start_time'] = startdate
print(data['start_time'] )
type(data['start_time'][0])
0      1984-01-01 00:00:00
1      1987-05-01 00:00:00
2      1994-11-28 00:00:00
3      1984-11-27 18:45:00
4      1987-05-01 00:00:00
               ...        
7150   1995-07-02 12:00:00
7151   1995-07-02 12:00:00
7152   1995-03-02 19:00:00
7153   1995-07-02 12:00:00
7154   1995-07-02 12:00:00
Name: start_time, Length: 7136, dtype: datetime64[ns]
pandas._libs.tslibs.timestamps.Timestamp
print(data['end_time'])
0       1987/05/01,00:00:00
1       2006/01/04,19:19:00
2       2006/01/04,19:19:00
3       1987/05/01,00:00:00
4       2021/08/13,16:50:00
               ...         
7150    2002/05/08,22:30:00
7151    2002/05/08,22:30:00
7152    1995/07/02,12:00:00
7153    2002/05/08,22:30:00
7154    1999/05/11,23:59:00
Name: end_time, Length: 7136, dtype: object
# do the same for end times
# Avoid 'OutOfBoundsDatetime' error with year 3000
enddate = data['end_time'].str.replace('3000', '2025')
enddate = pd.to_datetime(enddate, format='%Y/%m/%d,%H:%M:%S')
data['end_time'] = enddate

We can now look when each seismic station was installed:

data.groupby('station').apply(lambda df: df.start_time.min())
station
AAR   1984-01-01 00:00:00
AAS   1984-11-27 18:45:00
ABJ   1984-01-01 00:00:00
ABR   1984-01-01 00:00:00
ADW   1984-01-01 00:00:00
              ...        
VCL   1984-01-01 00:00:00
VRC   1993-09-23 22:20:00
VSP   1993-09-24 22:05:00
VWB   1985-01-01 00:00:00
WMP   1995-03-02 19:00:00
Length: 905, dtype: datetime64[ns]

The agg function allows to carry out several operations to each group of rows:

data.groupby(['station']).elevation.agg(['min', 'max'])
min max
station
AAR 911.0 911.0
AAS 31.0 31.0
ABJ 434.0 434.0
ABR -1.0 -1.0
ADW 228.0 228.0
... ... ...
VCL 2048.0 2048.0
VRC 1666.0 1666.0
VSP 1545.0 1545.0
VWB 1736.0 1736.0
WMP 1078.0 1078.0

905 rows × 2 columns

Select the stations that were deployed first and recovered last

data.groupby(['station']).agg({'start_time':lambda x: min(x), 'end_time':lambda x: max(x)})
start_time end_time
station
AAR 1984-01-01 00:00:00 2006-01-04 19:19:00
AAS 1984-11-27 18:45:00 2021-08-13 16:50:00
ABJ 1984-01-01 00:00:00 2019-06-26 19:17:00
ABR 1984-01-01 00:00:00 1997-08-04 21:02:00
ADW 1984-01-01 00:00:00 2006-04-20 01:08:00
... ... ...
VCL 1984-01-01 00:00:00 1985-03-25 21:30:00
VRC 1993-09-23 22:20:00 2001-07-12 16:50:00
VSP 1993-09-24 22:05:00 2001-07-12 16:50:00
VWB 1985-01-01 00:00:00 1985-03-25 19:00:00
WMP 1995-03-02 19:00:00 2002-05-08 22:30:00

905 rows × 2 columns

We can also make groups by selecting the values of two categorical variables:

data.groupby(['station', 'channel']).agg({'start_time':lambda x: min(x), 'end_time':lambda x: max(x)})
start_time end_time
station channel
AAR EHZ 1984-01-01 00:00:00 2006-01-04 19:19:00
SHZ 1994-11-28 00:00:00 2006-01-04 19:19:00
AAS EHZ 1984-11-27 18:45:00 2021-08-13 16:50:00
SHZ 1994-11-28 00:00:00 2006-01-24 18:00:00
ABJ EHZ 1984-01-01 00:00:00 2019-06-26 19:17:00
... ... ... ...
WMP EHN 1995-07-02 12:00:00 2002-05-08 22:30:00
EHZ 1995-03-02 19:00:00 2002-05-08 22:30:00
SHE 1995-07-02 12:00:00 2002-05-08 22:30:00
SHN 1995-07-02 12:00:00 2002-05-08 22:30:00
SHZ 1995-03-02 19:00:00 2002-05-08 22:30:00

4289 rows × 2 columns

Previously, we just printed the output, but we can also store it in a new variable:

data_grouped = data.groupby(['station', 'channel']).agg({'start_time':lambda x: min(x), 'end_time':lambda x: max(x)})
data_grouped.head()
start_time end_time
station channel
AAR EHZ 1984-01-01 00:00:00 2006-01-04 19:19:00
SHZ 1994-11-28 00:00:00 2006-01-04 19:19:00
AAS EHZ 1984-11-27 18:45:00 2021-08-13 16:50:00
SHZ 1994-11-28 00:00:00 2006-01-24 18:00:00
ABJ EHZ 1984-01-01 00:00:00 2019-06-26 19:17:00

When we select only some rows, the index is not automatically reset to start at 0. We can do it manually. Many functions in pandas have also an option to reset the index, and option to transform the dataframe in place, instead of saving the results in another variable.

data_grouped.reset_index()
station channel start_time end_time
0 AAR EHZ 1984-01-01 00:00:00 2006-01-04 19:19:00
1 AAR SHZ 1994-11-28 00:00:00 2006-01-04 19:19:00
2 AAS EHZ 1984-11-27 18:45:00 2021-08-13 16:50:00
3 AAS SHZ 1994-11-28 00:00:00 2006-01-24 18:00:00
4 ABJ EHZ 1984-01-01 00:00:00 2019-06-26 19:17:00
... ... ... ... ...
4284 WMP EHN 1995-07-02 12:00:00 2002-05-08 22:30:00
4285 WMP EHZ 1995-03-02 19:00:00 2002-05-08 22:30:00
4286 WMP SHE 1995-07-02 12:00:00 2002-05-08 22:30:00
4287 WMP SHN 1995-07-02 12:00:00 2002-05-08 22:30:00
4288 WMP SHZ 1995-03-02 19:00:00 2002-05-08 22:30:00

4289 rows × 4 columns

It is also possible to sort the dataset by value.

data_grouped.sort_values(by='start_time')
start_time end_time
station channel
PG1 HV1 1983-06-27 1993-11-10 18:00:00
AAR EHZ 1984-01-01 2006-01-04 19:19:00
LWH EHZ 1984-01-01 2020-03-17 20:50:00
LWD EHZ 1984-01-01 1986-09-25 21:30:00
LTN EHZ 1984-01-01 1988-07-26 23:50:00
... ... ... ...
BTI VCO 2022-09-28 2025-01-01 00:00:00
VDT 2022-09-28 2025-01-01 00:00:00
VEC 2022-09-28 2025-01-01 00:00:00
GEL 2022-09-28 2025-01-01 00:00:00
VPB 2022-09-28 2025-01-01 00:00:00

4289 rows × 2 columns

We can apply the sorting to several columns:

data_grouped.sort_values(by=['start_time', 'end_time'])
start_time end_time
station channel
PG1 HV1 1983-06-27 1993-11-10 18:00:00
PGC EHZ 1984-01-01 1984-04-12 18:00:00
AFO EHZ 1984-01-01 1984-04-23 00:00:00
MCH EHZ 1984-01-01 1984-08-01 19:31:00
MCN EHZ 1984-01-01 1984-09-22 16:45:00
... ... ... ...
BTI VCO 2022-09-28 2025-01-01 00:00:00
VDT 2022-09-28 2025-01-01 00:00:00
VEC 2022-09-28 2025-01-01 00:00:00
VEI 2022-09-28 2025-01-01 00:00:00
VPB 2022-09-28 2025-01-01 00:00:00

4289 rows × 2 columns

4. CSV vs Parquet#

Parquet is a compressed data format that stores and compresses the columns. It is fast for I/O and compact format.

Save data into a CSV file:

%timeit data.to_csv("my_metadata.csv")
!ls -lh my_metadata.csv
66.6 ms ± 4.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
'ls' is not recognized as an internal or external command,
operable program or batch file.

Try and save in Parquet and compare time and memory.

%timeit data.to_parquet("my_metadata.pq")
!ls -lh my_metadata.pq