2.4 Pandas
Contents
2.4 Pandas#
This section aims to provide new skills in python to handle structured, tabular data.
Learning outcome:
Manipulation of data frames (describing, filtering, …)
Learn about Lambda functions
Intro to datetime objects
Plotting data from data frames (histograms and maps)
Introduction to Plotly
Introduction to CSV & Parquet
This tutorial can be offered in a 2-hour course. Sections are labeled as Level 1, 2, 3 in Section 1, 2, 3, and instructors may choose to leave higher levels for asynchronous, self-guided learning.
We will work on several structured data sets: sensor metadata, seismic data product (earthquake catalog).
First, we import all the modules we need:
import numpy as np
import pandas as pd
import io
import requests
import time
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import plotly.express as px
import plotly.io as pio
# pio.renderers.default = 'vscode' # writes as standalone html,
# pio.renderers.default = 'iframe' # writes files as standalone html,
# pio.renderers.default = 'png' # writes files as standalone png,
# try notebook, jupyterlab, png, vscode, iframe
1 Pandas Fundamentals#
1.1 Basics#
Pandas are composed of Series
and DataFrame
. Series
are columns with attributes or keys. The DataFrame
is a multi-dimensional table made up of Series
.
We can create a DataFrame composed of series from scratch using Python dictionary:
data = {
'temperature' : [36,37,30,50],
'precipitation':[3,1,0,0]
}
my_pd = pd.DataFrame(data)
print(my_pd)
temperature precipitation
0 36 3
1 37 1
2 30 0
3 50 0
Each (key,value) item in the dataframe corresponds to a value in data
. To get the keys of the dataframe, type:
my_pd.keys()
Index(['temperature', 'precipitation'], dtype='object')
get a specific Series
(different from the array)
print(my_pd.temperature[:])
print(type(my_pd.temperature[:]))
0 36
1 37
2 30
3 50
Name: temperature, dtype: int64
<class 'pandas.core.series.Series'>
to get the value of a specific key (e.g., temperature), at a specific index (e.g., 2) type:
print(my_pd.temperature[2])
print(type(my_pd.temperature[2]))
30
<class 'numpy.int64'>
1.2 Reading a DataFrame from a CSV file#
We can read a pandas directly from a standard file. Here you will read a catalog of earthquakes.
quake = pd.read_csv("Global_Quakes_IRIS.csv")
Now you use the head
function to display what is in the file
# enter answer here
quake.head()
time | latitude | longitude | depth | magnitude | description | |
---|---|---|---|---|---|---|
0 | 2010-07-02 06:04:03.570 | -13.6098 | 166.6541 | 34400.0 | 6.3 | VANUATU ISLANDS |
1 | 2010-07-04 21:55:52.370 | 39.6611 | 142.5792 | 30100.0 | 6.3 | NEAR EAST COAST OF HONSHU |
2 | 2010-07-10 11:43:33.000 | 11.1664 | 146.0823 | 16900.0 | 6.3 | SOUTH OF MARIANA ISLANDS |
3 | 2010-07-12 00:11:20.060 | -22.2789 | -68.3159 | 109400.0 | 6.2 | NORTHERN CHILE |
4 | 2010-07-14 08:32:21.850 | -38.0635 | -73.4649 | 25700.0 | 6.6 | NEAR COAST OF CENTRAL CHILE |
Display the depth using two ways to use the pandas object
print(quake.depth)
print(quake['depth'])
0 34400.0
1 30100.0
2 16900.0
3 109400.0
4 25700.0
...
1780 10000.0
1781 105000.0
1782 608510.0
1783 10000.0
1784 35440.0
Name: depth, Length: 1785, dtype: float64
0 34400.0
1 30100.0
2 16900.0
3 109400.0
4 25700.0
...
1780 10000.0
1781 105000.0
1782 608510.0
1783 10000.0
1784 35440.0
Name: depth, Length: 1785, dtype: float64
Calculate basic statistic of the data using the function describe
.
quake.describe()
latitude | longitude | depth | magnitude | |
---|---|---|---|---|
count | 1785.000000 | 1785.000000 | 1785.000000 | 1785.000000 |
mean | 0.840700 | 40.608674 | 82773.187675 | 6.382403 |
std | 30.579308 | 125.558363 | 146988.302031 | 0.429012 |
min | -69.782500 | -179.957200 | 0.000000 | 6.000000 |
25% | -19.905100 | -73.832200 | 12000.000000 | 6.100000 |
50% | -4.478900 | 113.077800 | 24400.000000 | 6.300000 |
75% | 27.161700 | 145.305700 | 58000.000000 | 6.600000 |
max | 74.391800 | 179.998100 | 685500.000000 | 9.100000 |
Calculate mean and median of specific Series
, for example depth.
# answer it here
print(quake.depth.mean())
print(quake.depth.median())
82773.18767507003
24400.0
1.3 Basic Python Functions#
We will now practice how to modify the content of the DataFrame using functions. We will take the example where we want to change the depth values from meters to kilometers. First we can define this operation as a function
# this function converts a value in meters to a value in kilometers
m2km = 1000 # this is defined as a global variable
def meters2kilometers(x):
return x/m2km
# now test it using the first element of the quake DataFrame
meters2kilometers(quake.depth[0])
34.4
Let’s define another function that uses a local instead of global variable
def meters2kilometers2(x):
m2km2=1000
return x/m2km2
# m2km2 is a local variable and cannot be called outside of the function. Prove it next by inquiring its value in the next cell.
We now discuss the lambda functions.
# now we apply it on the entire Series
meters2kilometers(quake.depth)
0 34.40
1 30.10
2 16.90
3 109.40
4 25.70
...
1780 10.00
1781 105.00
1782 608.51
1783 10.00
1784 35.44
Name: depth, Length: 1785, dtype: float64
We can also define this very basic function as a lambda function. There are several ways of doing an operation on all rows of a column. The first option is to use the map function.
If you are not familiar with lambda function in Python, please read additional tutorials here.
We will practice a bit lambda functions
# Now the equivalent in lambda is:
lambda_meters2kilometers = lambda x:x/1000
# x is the variable
# apply it to the entire series
lambda_meters2kilometers(quake.depth)
0 34.40
1 30.10
2 16.90
3 109.40
4 25.70
...
1780 10.00
1781 105.00
1782 608.51
1783 10.00
1784 35.44
Name: depth, Length: 1785, dtype: float64
# you can add several variables into lambda functions
remove_anything = lambda x,y:x-y
remove_anything(3,2)
1
This did not affect the values of the DataFrame, check it:
quake.depth
0 34400.0
1 30100.0
2 16900.0
3 109400.0
4 25700.0
...
1780 10000.0
1781 105000.0
1782 608510.0
1783 10000.0
1784 35440.0
Name: depth, Length: 1785, dtype: float64
Instead, you could overwrite quake.depth=X
. Try two approaches but just do it once! You can use python functions map
(a function to apply functions that is generic to Python) and apply
(a function to apply functions that is specific to Pandas).
Student response section
This section is left for the student to complete.
Try map
to apply a lambda
function that rescale depth from meters to kilometers.
# implement answer here
Discuss in class: What happened to the depth field?
What happened to the original data frame?
Try apply
to apply a lambda
function that rescale depth from meters to kilometers.
# or like this
quake.depth=quake.depth.apply(lambda x:x/1000)
Plot a histogram of the depth distributions using matplotlib function hist
.
Student response section
This section is left for the student to complete.
# implement answer here
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
/Users/marinedenolle/Dropbox/CLASSES/ESS490/curriculum-book/book/Chapter2-DataManipulation/2.4_pandas_rendered.ipynb Cell 43 line 2
<a href='vscode-notebook-cell:/Users/marinedenolle/Dropbox/CLASSES/ESS490/curriculum-book/book/Chapter2-DataManipulation/2.4_pandas_rendered.ipynb#X54sZmlsZQ%3D%3D?line=0'>1</a> # answer here
----> <a href='vscode-notebook-cell:/Users/marinedenolle/Dropbox/CLASSES/ESS490/curriculum-book/book/Chapter2-DataManipulation/2.4_pandas_rendered.ipynb#X54sZmlsZQ%3D%3D?line=1'>2</a> plt.hist(quake.depth,100)
<a href='vscode-notebook-cell:/Users/marinedenolle/Dropbox/CLASSES/ESS490/curriculum-book/book/Chapter2-DataManipulation/2.4_pandas_rendered.ipynb#X54sZmlsZQ%3D%3D?line=2'>3</a> plt.grid(True)
<a href='vscode-notebook-cell:/Users/marinedenolle/Dropbox/CLASSES/ESS490/curriculum-book/book/Chapter2-DataManipulation/2.4_pandas_rendered.ipynb#X54sZmlsZQ%3D%3D?line=3'>4</a> plt.xlabel('Quake depth (km)')
NameError: name 'plt' is not defined
You can use the interactive plotting package Plotly. First we will show a histogram of the event depth using the function histogram
.
fig = px.histogram(quake, #specify what dataframe to use
x="depth", #specify the variable for the histogram
nbins=50, #number of bins for the histogram
height=400, #dimensions of the figure
width=600);
fig.show()
Alternatively, you can try another interactive plotting package Bokeh. Here is the same plot as above but with Bokeh.
p = figure(plot_width=600, plot_height=400)
hist, edges = np.histogram(quake.depth, density=True, bins=50)
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#636EFA", fill_color="#636EFA")
show(p)
We will now make a new plot of the location of the earthquakes. We will use Plotly tool.
The markersize will be scaled with the earthquake magnitude. To do so, we add a marker_size
series in the DataFrame
quake['marker_size'] = np.fix(np.exp(quake['magnitude'])) # add marker size as exp(mag)
quake['magnitude bin'] = 0.5*np.fix(2*quake['magnitude']) # add marker size as exp(mag)
2.4 Mapping using Plotly#
Now we will plot the earthquakes locations on a map using the Plotly package. More tutorials on Plotly. The input of the function is self-explanatory and typical of Python’s function. The code documentation of Plotly scatter_geo lists the variables.
fig = px.scatter_geo(quake,
lat='latitude',lon='longitude',
range_color=(6,9),
height=600, width=600,
size='marker_size', color='magnitude',
hover_name="description",
hover_data=['description','magnitude','depth']);
fig.update_geos(resolution=110, showcountries=True)
fig.update_geos(resolution=110, showcountries=True,projection_type="orthographic")
fig
The data was sorted by time. We now want to sort and show the data instead by magnitude. We use the pandas function sort
to create a new DataFrame with sorted values.
quakes2plot=quake.sort_values(by='magnitude bin')
quakes2plot.head()
time | latitude | longitude | depth | magnitude | description | marker_size | magnitude bin | |
---|---|---|---|---|---|---|---|---|
0 | 2010-07-02 06:04:03.570 | -13.6098 | 166.6541 | 0.03440 | 6.3 | VANUATU ISLANDS | 544.0 | 6.0 |
1112 | 2017-08-31 17:06:55.750 | -1.1590 | 99.6881 | 0.04314 | 6.3 | SOUTHERN SUMATRA | 544.0 | 6.0 |
1111 | 2017-08-27 04:17:51.010 | -1.4513 | 148.0803 | 0.00800 | 6.3 | ADMIRALTY ISLANDS REGION | 544.0 | 6.0 |
1110 | 2017-08-19 02:00:52.550 | -17.9609 | -178.8406 | 0.54400 | 6.4 | FIJI ISLANDS REGION | 601.0 | 6.0 |
1108 | 2017-08-13 03:08:10.560 | -3.7682 | 101.6228 | 0.03100 | 6.4 | SOUTHERN SUMATRA | 601.0 | 6.0 |
Now we will plot again using Plotly
fig = px.scatter_geo(quakes2plot,
lat='latitude',lon='longitude',
range_color=(6,9),
height=600, width=600,
size='marker_size', color='magnitude',
hover_name="description",
hover_data=['description','magnitude','depth']);
fig.update_geos(resolution=110, showcountries=True)
# fig.update_geos(resolution=110, showcountries=True,projection_type="orthographic")
2 Create a Pandas from a text file.#
The python package pandas is very useful to read csv files, but also many text files that are more or less formatted as one observation per row and one column for each feature.
As an example, we are going to look at the list of seismic stations from the Northern California seismic network, available here:
url = 'http://ncedc.org/ftp/pub/doc/NC.info/NC.channel.summary.day'
# this gets the file linked in the URL page and convert it to a string
s = requests.get(url).content
# this will convert the string, decode it , and make it a table
data = pd.read_csv(io.StringIO(s.decode('utf-8')), header=None, skiprows=2, sep='\s+', usecols=list(range(0, 13)))
# because columns/keys were not assigned, assign them now
data.columns = ['station', 'network', 'channel', 'location', 'rate', 'start_time', 'end_time', 'latitude', 'longitude', 'elevation', 'depth', 'dip', 'azimuth']
Let us look at the data. They are now stored into a pandas dataframe.
data.head()
station | network | channel | location | rate | start_time | end_time | latitude | longitude | elevation | depth | dip | azimuth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAR | NC | EHZ | -- | 100.0 | 1984/01/01,00:00:00 | 1987/05/01,00:00:00 | 39.27594 | -121.02696 | 911.0 | 0.0 | -90.0 | 0.0 |
1 | AAR | NC | EHZ | -- | 100.0 | 1987/05/01,00:00:00 | 2006/01/04,19:19:00 | 39.27594 | -121.02696 | 911.0 | 0.0 | -90.0 | 0.0 |
2 | AAR | NC | SHZ | -- | 20.0 | 1994/11/28,00:00:00 | 2006/01/04,19:19:00 | 39.27594 | -121.02696 | 911.0 | 0.0 | -90.0 | 0.0 |
3 | AAS | NC | EHZ | -- | 100.0 | 1984/11/27,18:45:00 | 1987/05/01,00:00:00 | 38.43014 | -121.10959 | 31.0 | 0.0 | -90.0 | 0.0 |
4 | AAS | NC | EHZ | -- | 100.0 | 1987/05/01,00:00:00 | 2021/08/13,16:50:00 | 38.43014 | -121.10959 | 31.0 | 0.0 | -90.0 | 0.0 |
We can output the first element of the DataFrame:
data.iloc[0]
station AAR
network NC
channel EHZ
location --
rate 100.0
start_time 1984/01/01,00:00:00
end_time 1987/05/01,00:00:00
latitude 39.27594
longitude -121.02696
elevation 911.0
depth 0.0
dip -90.0
azimuth 0.0
Name: 0, dtype: object
data.iloc[:, 0]
0 AAR
1 AAR
2 AAR
3 AAS
4 AAS
...
7151 WMP
7152 WMP
7153 WMP
7154 WMP
7155 WWVB
Name: station, Length: 7156, dtype: object
The DataFrame may have bad values. A typical data cleaning involves removing Nan and Zeros for instance.
Use Plotly to map the stations.
data.dropna(inplace=True)
data=data[data.longitude!=0]
fig = px.scatter_geo(data,
lat='latitude',lon='longitude',
range_color=(6,9),
height=600, width=600,
hover_name="station",
hover_data=['network','station','channel','rate']);
fig.update_geos(resolution=110, showcountries=True)
fig = px.scatter_mapbox(data,
lat='latitude',lon='longitude',
range_color=(6,9),mapbox_style="carto-positron",
height=600, width=500,
hover_name="station",
hover_data=['network','station','channel','rate']);
fig.update_layout(title="Northern California Seismic Network")
fig.show()
3: Manupulating Pandas#
We can filter the data with the value taken by a given column:
data.loc[data.station=='KCPB']
station | network | channel | location | rate | start_time | end_time | latitude | longitude | elevation | depth | dip | azimuth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3407 | KCPB | NC | BHE | -- | 50.0 | 1999/08/03,00:00:00 | 2000/06/06,16:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3408 | KCPB | NC | BHE | -- | 50.0 | 2000/06/06,16:00:00 | 2002/01/24,23:50:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3409 | KCPB | NC | BHE | -- | 50.0 | 2002/01/24,23:50:00 | 2002/10/16,23:59:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3410 | KCPB | NC | BHE | -- | 20.0 | 2002/10/17,00:00:00 | 2006/01/24,18:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3411 | KCPB | NC | BHN | -- | 50.0 | 1999/08/03,00:00:00 | 2000/06/06,16:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3480 | KCPB | NC | MNE | -- | 10.0 | 2000/06/06,16:00:00 | 2000/07/12,00:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3481 | KCPB | NC | MNN | -- | 10.0 | 1999/08/03,00:00:00 | 2000/06/06,16:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 0.0 |
3482 | KCPB | NC | MNN | -- | 10.0 | 2000/06/06,16:00:00 | 2000/07/12,00:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 0.0 |
3483 | KCPB | NC | MNZ | -- | 10.0 | 1999/08/03,00:00:00 | 2000/06/06,16:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | -90.0 | 0.0 |
3484 | KCPB | NC | MNZ | -- | 10.0 | 2000/06/06,16:00:00 | 2000/07/12,00:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | -90.0 | 0.0 |
78 rows × 13 columns
# Select two stations, use the typical "OR" |
data.loc[(data.station=='KCPB') | (data.station=='KHBB')]
station | network | channel | location | rate | start_time | end_time | latitude | longitude | elevation | depth | dip | azimuth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3407 | KCPB | NC | BHE | -- | 50.0 | 1999/08/03,00:00:00 | 2000/06/06,16:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3408 | KCPB | NC | BHE | -- | 50.0 | 2000/06/06,16:00:00 | 2002/01/24,23:50:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3409 | KCPB | NC | BHE | -- | 50.0 | 2002/01/24,23:50:00 | 2002/10/16,23:59:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3410 | KCPB | NC | BHE | -- | 20.0 | 2002/10/17,00:00:00 | 2006/01/24,18:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3411 | KCPB | NC | BHN | -- | 50.0 | 1999/08/03,00:00:00 | 2000/06/06,16:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3666 | KHBB | NC | LHZ | -- | 1.0 | 2016/04/28,16:56:00 | 2022/08/09,18:00:00 | 40.65990 | -123.21966 | 1864.0 | 0.0 | -90.0 | 0.0 |
3667 | KHBB | NC | LHZ | -- | 1.0 | 2022/08/09,18:00:00 | 3000/01/01,00:00:00 | 40.65990 | -123.21966 | 1864.0 | 0.0 | -90.0 | 0.0 |
3668 | KHBB | NC | LNE | -- | 1.0 | 2015/10/29,21:18:00 | 2016/04/28,16:56:00 | 40.65990 | -123.21966 | 1864.0 | 0.0 | 0.0 | 90.0 |
3669 | KHBB | NC | LNN | -- | 1.0 | 2015/10/29,21:18:00 | 2016/04/28,16:56:00 | 40.65990 | -123.21966 | 1864.0 | 0.0 | 0.0 | 0.0 |
3670 | KHBB | NC | LNZ | -- | 1.0 | 2015/10/29,21:18:00 | 2016/04/28,16:56:00 | 40.65990 | -123.21966 | 1864.0 | 0.0 | -90.0 | 0.0 |
123 rows × 13 columns
# Select two stations, use the typical "AND" &
data.loc[(data.station=='KCPB') & (data.channel=='HNZ')]
station | network | channel | location | rate | start_time | end_time | latitude | longitude | elevation | depth | dip | azimuth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3457 | KCPB | NC | HNZ | -- | 100.0 | 2002/10/17,00:00:00 | 2006/10/18,00:08:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | -90.0 | 0.0 |
3458 | KCPB | NC | HNZ | -- | 100.0 | 2006/10/18,00:08:00 | 2010/11/01,22:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | -90.0 | 0.0 |
3459 | KCPB | NC | HNZ | -- | 100.0 | 2010/11/01,22:00:00 | 2011/07/13,00:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | -90.0 | 0.0 |
3460 | KCPB | NC | HNZ | -- | 100.0 | 2011/07/13,00:00:00 | 2011/09/07,19:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | -90.0 | 0.0 |
3461 | KCPB | NC | HNZ | -- | 100.0 | 2011/09/07,19:00:00 | 2015/10/29,18:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | -90.0 | 0.0 |
3462 | KCPB | NC | HNZ | -- | 100.0 | 2015/10/29,18:00:00 | 2016/04/28,16:39:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | -90.0 | 0.0 |
3463 | KCPB | NC | HNZ | -- | 100.0 | 2016/04/28,16:39:00 | 3000/01/01,00:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | -90.0 | 0.0 |
# or like this
data.loc[data.station.isin(['KCPB', 'KHBB'])]
station | network | channel | location | rate | start_time | end_time | latitude | longitude | elevation | depth | dip | azimuth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3407 | KCPB | NC | BHE | -- | 50.0 | 1999/08/03,00:00:00 | 2000/06/06,16:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3408 | KCPB | NC | BHE | -- | 50.0 | 2000/06/06,16:00:00 | 2002/01/24,23:50:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3409 | KCPB | NC | BHE | -- | 50.0 | 2002/01/24,23:50:00 | 2002/10/16,23:59:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3410 | KCPB | NC | BHE | -- | 20.0 | 2002/10/17,00:00:00 | 2006/01/24,18:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3411 | KCPB | NC | BHN | -- | 50.0 | 1999/08/03,00:00:00 | 2000/06/06,16:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3666 | KHBB | NC | LHZ | -- | 1.0 | 2016/04/28,16:56:00 | 2022/08/09,18:00:00 | 40.65990 | -123.21966 | 1864.0 | 0.0 | -90.0 | 0.0 |
3667 | KHBB | NC | LHZ | -- | 1.0 | 2022/08/09,18:00:00 | 3000/01/01,00:00:00 | 40.65990 | -123.21966 | 1864.0 | 0.0 | -90.0 | 0.0 |
3668 | KHBB | NC | LNE | -- | 1.0 | 2015/10/29,21:18:00 | 2016/04/28,16:56:00 | 40.65990 | -123.21966 | 1864.0 | 0.0 | 0.0 | 90.0 |
3669 | KHBB | NC | LNN | -- | 1.0 | 2015/10/29,21:18:00 | 2016/04/28,16:56:00 | 40.65990 | -123.21966 | 1864.0 | 0.0 | 0.0 | 0.0 |
3670 | KHBB | NC | LNZ | -- | 1.0 | 2015/10/29,21:18:00 | 2016/04/28,16:56:00 | 40.65990 | -123.21966 | 1864.0 | 0.0 | -90.0 | 0.0 |
123 rows × 13 columns
We can filter the data with the value taken by a given column:
data.loc[data.station=='KCPB']
station | network | channel | location | rate | start_time | end_time | latitude | longitude | elevation | depth | dip | azimuth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3407 | KCPB | NC | BHE | -- | 50.0 | 1999/08/03,00:00:00 | 2000/06/06,16:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3408 | KCPB | NC | BHE | -- | 50.0 | 2000/06/06,16:00:00 | 2002/01/24,23:50:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3409 | KCPB | NC | BHE | -- | 50.0 | 2002/01/24,23:50:00 | 2002/10/16,23:59:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3410 | KCPB | NC | BHE | -- | 20.0 | 2002/10/17,00:00:00 | 2006/01/24,18:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3411 | KCPB | NC | BHN | -- | 50.0 | 1999/08/03,00:00:00 | 2000/06/06,16:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3480 | KCPB | NC | MNE | -- | 10.0 | 2000/06/06,16:00:00 | 2000/07/12,00:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 90.0 |
3481 | KCPB | NC | MNN | -- | 10.0 | 1999/08/03,00:00:00 | 2000/06/06,16:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 0.0 |
3482 | KCPB | NC | MNN | -- | 10.0 | 2000/06/06,16:00:00 | 2000/07/12,00:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | 0.0 | 0.0 |
3483 | KCPB | NC | MNZ | -- | 10.0 | 1999/08/03,00:00:00 | 2000/06/06,16:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | -90.0 | 0.0 |
3484 | KCPB | NC | MNZ | -- | 10.0 | 2000/06/06,16:00:00 | 2000/07/12,00:00:00 | 39.68631 | -123.58242 | 1261.0 | 0.0 | -90.0 | 0.0 |
78 rows × 13 columns
We can access a brief summary of the data:
data.station.describe()
count 7136
unique 905
top KCPB
freq 78
Name: station, dtype: object
data.elevation.describe()
count 7136.000000
mean 665.036617
std 670.177475
min -1388.000000
25% 164.000000
50% 446.000000
75% 925.000000
max 3680.000000
Name: elevation, dtype: float64
We can perform standard operations on the whole data set:
data.mean()
C:\Users\otodo\AppData\Local\Temp\ipykernel_2672\531903386.py:1: FutureWarning:
The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
rate 87.692383
latitude 38.053684
longitude -121.828840
elevation 665.036617
depth 15.064938
dip -41.758688
azimuth 17.735987
dtype: float64
In the case of a categorical variable, we can get the list of possible values that this variable can take:
data.channel.unique()
array(['EHZ', 'SHZ', 'HHE', 'HHN', 'HHZ', 'HNE', 'HNN', 'HNZ', 'LHE',
'LHN', 'LHZ', 'ACE', 'GAN', 'GEL', 'GLA', 'GLO', 'GNS', 'GPL',
'LCE', 'LCQ', 'VCO', 'VDT', 'VEC', 'VEI', 'VPB', 'ELE', 'ELN',
'ELZ', 'SLE', 'SLN', 'SLZ', 'LCL', 'LOG', 'OCF', 'VEA', 'VEP',
'VFP', 'VKI', 'GST', 'EHE', 'EHN', 'BNE', 'BNN', 'BN1', 'BN2',
'BN3', 'BV1', 'EP1', 'EP2', 'EP3', 'HDO', 'HN1', 'HN2', 'HN3',
'HV1', 'SP2', 'SP3', 'BNZ', 'LDO', 'HJ2', 'HJ3', 'HJZ', 'SHE',
'SHN', 'BHE', 'BHN', 'BHZ', 'LNE', 'LNN', 'LNZ', 'MHE', 'MHN',
'MHZ', 'MNE', 'MNN', 'MNZ', 'HH2', 'HH3', 'LH2', 'LH3', 'SP1',
'DP1', 'DP2', 'DP3', 'HLE', 'HLN', 'HLZ', 'XNE', 'XNN', 'XNZ',
'EH1', 'VM1', 'VM2', 'VM3'], dtype=object)
and get the number of times that each value is taken:
data.station.value_counts()
KCPB 78
JSGB 73
KMPB 72
KHMB 63
CCH1 60
..
LAMB 1
LBA 1
PBY 1
MCW 1
MMW 1
Name: station, Length: 905, dtype: int64
The second option is to use the apply function:
data_elevation_mean=data.elevation.unique().mean()
def remean_elevation(row):
row.elevation = row.elevation - data_elevation_mean
return row
data.apply(remean_elevation, axis='columns')
station | network | channel | location | rate | start_time | end_time | latitude | longitude | elevation | depth | dip | azimuth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAR | NC | EHZ | -- | 100.0 | 1984/01/01,00:00:00 | 1987/05/01,00:00:00 | 39.27594 | -121.02696 | 154.125816 | 0.0 | -90.0 | 0.0 |
1 | AAR | NC | EHZ | -- | 100.0 | 1987/05/01,00:00:00 | 2006/01/04,19:19:00 | 39.27594 | -121.02696 | 154.125816 | 0.0 | -90.0 | 0.0 |
2 | AAR | NC | SHZ | -- | 20.0 | 1994/11/28,00:00:00 | 2006/01/04,19:19:00 | 39.27594 | -121.02696 | 154.125816 | 0.0 | -90.0 | 0.0 |
3 | AAS | NC | EHZ | -- | 100.0 | 1984/11/27,18:45:00 | 1987/05/01,00:00:00 | 38.43014 | -121.10959 | -725.874184 | 0.0 | -90.0 | 0.0 |
4 | AAS | NC | EHZ | -- | 100.0 | 1987/05/01,00:00:00 | 2021/08/13,16:50:00 | 38.43014 | -121.10959 | -725.874184 | 0.0 | -90.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7150 | WMP | NC | SHE | -- | 20.0 | 1995/07/02,12:00:00 | 2002/05/08,22:30:00 | 35.64059 | -118.78570 | 321.125816 | 0.0 | 0.0 | 90.0 |
7151 | WMP | NC | SHN | -- | 20.0 | 1995/07/02,12:00:00 | 2002/05/08,22:30:00 | 35.64059 | -118.78570 | 321.125816 | 0.0 | 0.0 | 0.0 |
7152 | WMP | NC | SHZ | -- | 20.0 | 1995/03/02,19:00:00 | 1995/07/02,12:00:00 | 35.64059 | -118.78570 | 321.125816 | 0.0 | -90.0 | 0.0 |
7153 | WMP | NC | SHZ | -- | 20.0 | 1995/07/02,12:00:00 | 2002/05/08,22:30:00 | 35.64059 | -118.78570 | 321.125816 | 0.0 | -90.0 | 0.0 |
7154 | WMP | NC | SHZ | 10 | 20.0 | 1995/07/02,12:00:00 | 1999/05/11,23:59:00 | 35.64059 | -118.78570 | 321.125816 | 0.0 | -90.0 | 0.0 |
7136 rows × 13 columns
We can also carry out simple operations on columns, provided they make sense.
data.network + ' - ' + data.station
0 NC - AAR
1 NC - AAR
2 NC - AAR
3 NC - AAS
4 NC - AAS
...
7150 NC - WMP
7151 NC - WMP
7152 NC - WMP
7153 NC - WMP
7154 NC - WMP
Length: 7136, dtype: object
A useful feature is to group the rows depending on the value of a categorical variable, and then apply the same operation to all the groups. For instance, I want to know how many times each station appears in the file:
data.groupby('station').station.count()
station
AAR 3
AAS 3
ABJ 4
ABR 3
ADW 3
..
VCL 2
VRC 4
VSP 2
VWB 1
WMP 10
Name: station, Length: 905, dtype: int64
We can have access to the data type of each column:
data.dtypes
station object
network object
channel object
location object
rate float64
start_time object
end_time object
latitude float64
longitude float64
elevation float64
depth float64
dip float64
azimuth float64
dtype: object
Here, pandas does not recognize the start_time and end_time columns as a datetime format, so we cannot use datetime operations on them. We first need to convert these columns into a datetime format:
data.start_time.values()
type(data['start_time'][0])
str
# Transform column from string into datetime format
startdate = pd.to_datetime(data['start_time'], format='%Y/%m/%d,%H:%M:%S')
data['start_time'] = startdate
print(data['start_time'] )
type(data['start_time'][0])
0 1984-01-01 00:00:00
1 1987-05-01 00:00:00
2 1994-11-28 00:00:00
3 1984-11-27 18:45:00
4 1987-05-01 00:00:00
...
7150 1995-07-02 12:00:00
7151 1995-07-02 12:00:00
7152 1995-03-02 19:00:00
7153 1995-07-02 12:00:00
7154 1995-07-02 12:00:00
Name: start_time, Length: 7136, dtype: datetime64[ns]
pandas._libs.tslibs.timestamps.Timestamp
print(data['end_time'])
0 1987/05/01,00:00:00
1 2006/01/04,19:19:00
2 2006/01/04,19:19:00
3 1987/05/01,00:00:00
4 2021/08/13,16:50:00
...
7150 2002/05/08,22:30:00
7151 2002/05/08,22:30:00
7152 1995/07/02,12:00:00
7153 2002/05/08,22:30:00
7154 1999/05/11,23:59:00
Name: end_time, Length: 7136, dtype: object
# do the same for end times
# Avoid 'OutOfBoundsDatetime' error with year 3000
enddate = data['end_time'].str.replace('3000', '2025')
enddate = pd.to_datetime(enddate, format='%Y/%m/%d,%H:%M:%S')
data['end_time'] = enddate
We can now look when each seismic station was installed:
data.groupby('station').apply(lambda df: df.start_time.min())
station
AAR 1984-01-01 00:00:00
AAS 1984-11-27 18:45:00
ABJ 1984-01-01 00:00:00
ABR 1984-01-01 00:00:00
ADW 1984-01-01 00:00:00
...
VCL 1984-01-01 00:00:00
VRC 1993-09-23 22:20:00
VSP 1993-09-24 22:05:00
VWB 1985-01-01 00:00:00
WMP 1995-03-02 19:00:00
Length: 905, dtype: datetime64[ns]
The agg
function allows to carry out several operations to each group of rows:
data.groupby(['station']).elevation.agg(['min', 'max'])
min | max | |
---|---|---|
station | ||
AAR | 911.0 | 911.0 |
AAS | 31.0 | 31.0 |
ABJ | 434.0 | 434.0 |
ABR | -1.0 | -1.0 |
ADW | 228.0 | 228.0 |
... | ... | ... |
VCL | 2048.0 | 2048.0 |
VRC | 1666.0 | 1666.0 |
VSP | 1545.0 | 1545.0 |
VWB | 1736.0 | 1736.0 |
WMP | 1078.0 | 1078.0 |
905 rows × 2 columns
Select the stations that were deployed first and recovered last
data.groupby(['station']).agg({'start_time':lambda x: min(x), 'end_time':lambda x: max(x)})
start_time | end_time | |
---|---|---|
station | ||
AAR | 1984-01-01 00:00:00 | 2006-01-04 19:19:00 |
AAS | 1984-11-27 18:45:00 | 2021-08-13 16:50:00 |
ABJ | 1984-01-01 00:00:00 | 2019-06-26 19:17:00 |
ABR | 1984-01-01 00:00:00 | 1997-08-04 21:02:00 |
ADW | 1984-01-01 00:00:00 | 2006-04-20 01:08:00 |
... | ... | ... |
VCL | 1984-01-01 00:00:00 | 1985-03-25 21:30:00 |
VRC | 1993-09-23 22:20:00 | 2001-07-12 16:50:00 |
VSP | 1993-09-24 22:05:00 | 2001-07-12 16:50:00 |
VWB | 1985-01-01 00:00:00 | 1985-03-25 19:00:00 |
WMP | 1995-03-02 19:00:00 | 2002-05-08 22:30:00 |
905 rows × 2 columns
We can also make groups by selecting the values of two categorical variables:
data.groupby(['station', 'channel']).agg({'start_time':lambda x: min(x), 'end_time':lambda x: max(x)})
start_time | end_time | ||
---|---|---|---|
station | channel | ||
AAR | EHZ | 1984-01-01 00:00:00 | 2006-01-04 19:19:00 |
SHZ | 1994-11-28 00:00:00 | 2006-01-04 19:19:00 | |
AAS | EHZ | 1984-11-27 18:45:00 | 2021-08-13 16:50:00 |
SHZ | 1994-11-28 00:00:00 | 2006-01-24 18:00:00 | |
ABJ | EHZ | 1984-01-01 00:00:00 | 2019-06-26 19:17:00 |
... | ... | ... | ... |
WMP | EHN | 1995-07-02 12:00:00 | 2002-05-08 22:30:00 |
EHZ | 1995-03-02 19:00:00 | 2002-05-08 22:30:00 | |
SHE | 1995-07-02 12:00:00 | 2002-05-08 22:30:00 | |
SHN | 1995-07-02 12:00:00 | 2002-05-08 22:30:00 | |
SHZ | 1995-03-02 19:00:00 | 2002-05-08 22:30:00 |
4289 rows × 2 columns
Previously, we just printed the output, but we can also store it in a new variable:
data_grouped = data.groupby(['station', 'channel']).agg({'start_time':lambda x: min(x), 'end_time':lambda x: max(x)})
data_grouped.head()
start_time | end_time | ||
---|---|---|---|
station | channel | ||
AAR | EHZ | 1984-01-01 00:00:00 | 2006-01-04 19:19:00 |
SHZ | 1994-11-28 00:00:00 | 2006-01-04 19:19:00 | |
AAS | EHZ | 1984-11-27 18:45:00 | 2021-08-13 16:50:00 |
SHZ | 1994-11-28 00:00:00 | 2006-01-24 18:00:00 | |
ABJ | EHZ | 1984-01-01 00:00:00 | 2019-06-26 19:17:00 |
When we select only some rows, the index is not automatically reset to start at 0. We can do it manually. Many functions in pandas have also an option to reset the index, and option to transform the dataframe in place, instead of saving the results in another variable.
data_grouped.reset_index()
station | channel | start_time | end_time | |
---|---|---|---|---|
0 | AAR | EHZ | 1984-01-01 00:00:00 | 2006-01-04 19:19:00 |
1 | AAR | SHZ | 1994-11-28 00:00:00 | 2006-01-04 19:19:00 |
2 | AAS | EHZ | 1984-11-27 18:45:00 | 2021-08-13 16:50:00 |
3 | AAS | SHZ | 1994-11-28 00:00:00 | 2006-01-24 18:00:00 |
4 | ABJ | EHZ | 1984-01-01 00:00:00 | 2019-06-26 19:17:00 |
... | ... | ... | ... | ... |
4284 | WMP | EHN | 1995-07-02 12:00:00 | 2002-05-08 22:30:00 |
4285 | WMP | EHZ | 1995-03-02 19:00:00 | 2002-05-08 22:30:00 |
4286 | WMP | SHE | 1995-07-02 12:00:00 | 2002-05-08 22:30:00 |
4287 | WMP | SHN | 1995-07-02 12:00:00 | 2002-05-08 22:30:00 |
4288 | WMP | SHZ | 1995-03-02 19:00:00 | 2002-05-08 22:30:00 |
4289 rows × 4 columns
It is also possible to sort the dataset by value.
data_grouped.sort_values(by='start_time')
start_time | end_time | ||
---|---|---|---|
station | channel | ||
PG1 | HV1 | 1983-06-27 | 1993-11-10 18:00:00 |
AAR | EHZ | 1984-01-01 | 2006-01-04 19:19:00 |
LWH | EHZ | 1984-01-01 | 2020-03-17 20:50:00 |
LWD | EHZ | 1984-01-01 | 1986-09-25 21:30:00 |
LTN | EHZ | 1984-01-01 | 1988-07-26 23:50:00 |
... | ... | ... | ... |
BTI | VCO | 2022-09-28 | 2025-01-01 00:00:00 |
VDT | 2022-09-28 | 2025-01-01 00:00:00 | |
VEC | 2022-09-28 | 2025-01-01 00:00:00 | |
GEL | 2022-09-28 | 2025-01-01 00:00:00 | |
VPB | 2022-09-28 | 2025-01-01 00:00:00 |
4289 rows × 2 columns
We can apply the sorting to several columns:
data_grouped.sort_values(by=['start_time', 'end_time'])
start_time | end_time | ||
---|---|---|---|
station | channel | ||
PG1 | HV1 | 1983-06-27 | 1993-11-10 18:00:00 |
PGC | EHZ | 1984-01-01 | 1984-04-12 18:00:00 |
AFO | EHZ | 1984-01-01 | 1984-04-23 00:00:00 |
MCH | EHZ | 1984-01-01 | 1984-08-01 19:31:00 |
MCN | EHZ | 1984-01-01 | 1984-09-22 16:45:00 |
... | ... | ... | ... |
BTI | VCO | 2022-09-28 | 2025-01-01 00:00:00 |
VDT | 2022-09-28 | 2025-01-01 00:00:00 | |
VEC | 2022-09-28 | 2025-01-01 00:00:00 | |
VEI | 2022-09-28 | 2025-01-01 00:00:00 | |
VPB | 2022-09-28 | 2025-01-01 00:00:00 |
4289 rows × 2 columns
4. CSV vs Parquet#
Parquet is a compressed data format that stores and compresses the columns. It is fast for I/O and compact format.
Save data
into a CSV file:
%timeit data.to_csv("my_metadata.csv")
!ls -lh my_metadata.csv
66.6 ms ± 4.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
'ls' is not recognized as an internal or external command,
operable program or batch file.
Try and save in Parquet and compare time and memory.
%timeit data.to_parquet("my_metadata.pq")
!ls -lh my_metadata.pq