2.3 Pandas#

This section aims to provide new skills in python to handle structured, tabular data.

Learning outcome:

  • Manipulation of data frames (describing, filtering, …)

  • Learn about Lambda functions

  • Intro to datetime objects

  • Plotting data from data frames (histograms and maps)

  • Introduction to Plotly

  • Introduction to CSV & Parquet

This tutorial can be offered in a 2-hour course. Sections are labeled as Level 1, 2, 3 in Section 1, 2, 3, and instructors may choose to leave higher levels for asynchronous, self-guided learning.

We will work on several structured data sets: sensor metadata, seismic data product (earthquake catalog).

First, we import all the modules we need:

import numpy as np
import pandas as pd
import io
import requests
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

This notebook uses plotly to make interactive plots

!pip install plotly
import plotly.express as px
import plotly.io as pio
# pio.renderers.default = 'vscode' # writes as standalone html, 
# pio.renderers.default = 'iframe' # writes files as standalone html, 
# pio.renderers.default = 'png' # writes files as standalone png, 
# try notebook, jupyterlab, png, vscode, iframe
Requirement already satisfied: plotly in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (5.17.0)
Requirement already satisfied: tenacity>=6.2.0 in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from plotly) (8.2.3)
Requirement already satisfied: packaging in /Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages (from plotly) (23.2)

[notice] A new release of pip is available: 23.3.1 -> 24.2
[notice] To update, run: pip install --upgrade pip

1 Pandas Fundamentals#

1.1 Basics#

Pandas are composed of Series and DataFrame. Series are columns with attributes or keys. The DataFrame is a multi-dimensional table made up of Series.

We can create a DataFrame composed of series from scratch using Python dictionary:

data = {
    'temperature' : [36,37,30,50],
    'precipitation':[3,1,0,0]
}
my_pd = pd.DataFrame(data)
print(my_pd)
   temperature  precipitation
0           36              3
1           37              1
2           30              0
3           50              0

Each (key,value) item in the dataframe corresponds to a value in data. To get the keys of the dataframe, type:

my_pd.keys()
Index(['temperature', 'precipitation'], dtype='object')

get a specific Series (different from the array)

print(my_pd.temperature[:])
print(type(my_pd.temperature[:]))
0    36
1    37
2    30
3    50
Name: temperature, dtype: int64
<class 'pandas.core.series.Series'>

to get the value of a specific key (e.g., temperature), at a specific index (e.g., 2) type:

print(my_pd.temperature[2])
print(type(my_pd.temperature[2]))
30
<class 'numpy.int64'>

1.2 Reading a DataFrame from a CSV file#

We can read a pandas directly from a standard file. We will download a catalog of earthquakes

fname = ""
gist_dir = ""
!wget "https://raw.githubusercontent.com/UW-MLGEO/MLGeo-dataset/refs/heads/main/data/Global_Quakes_IRIS.csv"
# !wget "https://github.com/UW-MLGEO/MLGeo-dataset/blob/main/data/Global_Quakes_IRIS.csv"
ff = "Global_Quakes_IRIS.csv"
quake = pd.read_csv(ff)
--2024-10-07 05:13:23--  https://raw.githubusercontent.com/UW-MLGEO/MLGeo-dataset/refs/heads/main/data/Global_Quakes_IRIS.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 129629 (127K) [text/plain]
Saving to: ‘Global_Quakes_IRIS.csv.1’

Global_Quakes_IRIS. 100%[===================>] 126.59K  --.-KB/s    in 0.05s   

2024-10-07 05:13:23 (2.35 MB/s) - ‘Global_Quakes_IRIS.csv.1’ saved [129629/129629]

Now you use the head function to display what is in the file

# enter answer here
quake.head()
time latitude longitude depth magnitude description
0 2010-07-02 06:04:03.570 -13.6098 166.6541 34400.0 6.3 VANUATU ISLANDS
1 2010-07-04 21:55:52.370 39.6611 142.5792 30100.0 6.3 NEAR EAST COAST OF HONSHU
2 2010-07-10 11:43:33.000 11.1664 146.0823 16900.0 6.3 SOUTH OF MARIANA ISLANDS
3 2010-07-12 00:11:20.060 -22.2789 -68.3159 109400.0 6.2 NORTHERN CHILE
4 2010-07-14 08:32:21.850 -38.0635 -73.4649 25700.0 6.6 NEAR COAST OF CENTRAL CHILE

Display the depth using two ways to use the pandas object

print(quake.depth)
print(quake['depth'])
0        34400.0
1        30100.0
2        16900.0
3       109400.0
4        25700.0
          ...   
1780     10000.0
1781    105000.0
1782    608510.0
1783     10000.0
1784     35440.0
Name: depth, Length: 1785, dtype: float64
0        34400.0
1        30100.0
2        16900.0
3       109400.0
4        25700.0
          ...   
1780     10000.0
1781    105000.0
1782    608510.0
1783     10000.0
1784     35440.0
Name: depth, Length: 1785, dtype: float64

Calculate basic statistic of the data using the function describe.

quake.describe()
latitude longitude depth magnitude
count 1785.000000 1785.000000 1785.000000 1785.000000
mean 0.840700 40.608674 82773.187675 6.382403
std 30.579308 125.558363 146988.302031 0.429012
min -69.782500 -179.957200 0.000000 6.000000
25% -19.905100 -73.832200 12000.000000 6.100000
50% -4.478900 113.077800 24400.000000 6.300000
75% 27.161700 145.305700 58000.000000 6.600000
max 74.391800 179.998100 685500.000000 9.100000

Calculate mean and median of specific Series, for example depth.

# answer it here
print(quake.depth.mean())
print(quake.depth.median())
82773.18767507003
24400.0

1.3 Manipulating Pandas with Python#

Classic functions#

We will now practice how to modify the content of the DataFrame using functions. We will take the example where we want to change the depth values from meters to kilometers. First we can define this operation as a function

# this function converts a value in meters to a value in kilometers
m2km = 1000 # this is defined as a global variable
def meters2kilometers(x):
    return x/m2km
# now test it using the first element of the quake DataFrame
meters2kilometers(quake.depth[0])
34.4
# this function converts a value in meters to a value in kilometers
def meters2kilometers2(x):
    m2km = 1000 # this is defined as a global variable
    return x/m2km

Let’s define another function that uses a local instead of global variable

# Apply the meters2kilometers2 function to the 'depth' column and add it as a new column 'depth_km' to the quake DataFrame
quake['depth_km'] = quake['depth'].apply(meters2kilometers2)

# Display the first few rows to verify the new column
quake.head()
time latitude longitude depth magnitude description depth_km
0 2010-07-02 06:04:03.570 -13.6098 166.6541 34400.0 6.3 VANUATU ISLANDS 34.4
1 2010-07-04 21:55:52.370 39.6611 142.5792 30100.0 6.3 NEAR EAST COAST OF HONSHU 30.1
2 2010-07-10 11:43:33.000 11.1664 146.0823 16900.0 6.3 SOUTH OF MARIANA ISLANDS 16.9
3 2010-07-12 00:11:20.060 -22.2789 -68.3159 109400.0 6.2 NORTHERN CHILE 109.4
4 2010-07-14 08:32:21.850 -38.0635 -73.4649 25700.0 6.6 NEAR COAST OF CENTRAL CHILE 25.7

Lambda functions#

We now discuss the lambda functions.

  • Lambda functions are small, anonymous functions in Python.

  • They are used for quick, simple operations without needing to define a full function using def.

  • In geoscience, where we deal with large datasets (e.g., climate data, seismic measurements), lambda functions allow us to process data efficiently.

  • Lambda functions in Pandas allow you to quickly transform, filter, or process this data with minimal code.

  • Please read additional tutorials from RealPython.

lambda x: x * 1.2
<function __main__.<lambda>(x)>

This lambda function multiplies any input x by 1.2, which could be useful for tasks like converting units (e.g., from km to m).

Below is an example on how to use lambda in a pandas dataframe. Example 1: The lambda function lambda x: (x - 32) * 5 / 9 converts Fahrenheit to Celsius for each value in the Temperature_F column.

import pandas as pd

# Sample DataFrame
data = {'Temperature_F': [32, 50, 77, 100]}
df = pd.DataFrame(data)

# Convert Fahrenheit to Celsius
df['Temperature_C'] = df['Temperature_F'].apply(lambda x: (x - 32) * 5 / 9)

print(df)
   Temperature_F  Temperature_C
0             32       0.000000
1             50      10.000000
2             77      25.000000
3            100      37.777778
# Now the equivalent in lambda is:
lambda_meters2kilometers = lambda x:x/1000
# x is the variable

Now apply the function to the entire series

Student response section

This section is left for the student to complete.

# apply it to the entire series
lambda_meters2kilometers(quake.depth)
0        34.40
1        30.10
2        16.90
3       109.40
4        25.70
         ...  
1780     10.00
1781    105.00
1782    608.51
1783     10.00
1784     35.44
Name: depth, Length: 1785, dtype: float64

Lambda functions can take several inputs

# you can add several variables into lambda functions
remove_anything = lambda x,y:x-y
remove_anything(3,2)
1

This did not affect the values of the DataFrame, check it:

quake.depth
0        34400.0
1        30100.0
2        16900.0
3       109400.0
4        25700.0
          ...   
1780     10000.0
1781    105000.0
1782    608510.0
1783     10000.0
1784     35440.0
Name: depth, Length: 1785, dtype: float64

Instead, you could overwrite quake.depth=X. Try two approaches but just do it once! You can use python functions map (a function to apply functions that is generic to Python) and apply (a function to apply functions that is specific to Pandas).

Student response section

This section is left for the student to complete.

Try map to apply a lambda function that rescale depth from meters to kilometers.

# implement answer here

Discuss in class: What happened to the depth field?

What happened to the original data frame?

Try apply to apply a lambda function that rescale depth from meters to kilometers.

# or like this
quake.depth=quake.depth.apply(lambda x:x/1000) 

Plot a histogram of the depth distributions using matplotlib function hist.

Student response section

This section is left for the student to complete.

# implement answer here
../_images/2.3_pandas_rendered_49_0.png

You can use the interactive plotting package Plotly. First we will show a histogram of the event depth using the function histogram.

fig = px.histogram(quake,   #specify what dataframe to use
             x="depth",  #specify the variable for the histogram 
             nbins=50,       #number of bins for the histogram 
             height=400,     #dimensions of the figure
             width=600);
fig.show()

Example 2: Conditional Logic for Earthquake Magnitude#

You can use lambda functions to add conditional logic, such as classifying earthquake magnitudes into categories.

Scenario: You have a column Magnitude containing seismic event magnitudes, and you want to classify events as “Minor”, “Moderate”, or “Severe”.

# Classify earthquake magnitudes
quake['Category'] = quake['magnitude'].apply(lambda x: 'Minor' if x < 4.0 else ('Moderate' if x < 6.5 else 'Severe'))
quake.head()
time latitude longitude depth magnitude description depth_km Category
0 2010-07-02 06:04:03.570 -13.6098 166.6541 0.0344 6.3 VANUATU ISLANDS 34.4 Moderate
1 2010-07-04 21:55:52.370 39.6611 142.5792 0.0301 6.3 NEAR EAST COAST OF HONSHU 30.1 Moderate
2 2010-07-10 11:43:33.000 11.1664 146.0823 0.0169 6.3 SOUTH OF MARIANA ISLANDS 16.9 Moderate
3 2010-07-12 00:11:20.060 -22.2789 -68.3159 0.1094 6.2 NORTHERN CHILE 109.4 Moderate
4 2010-07-14 08:32:21.850 -38.0635 -73.4649 0.0257 6.6 NEAR COAST OF CENTRAL CHILE 25.7 Severe

We will now make a new plot of the location of the earthquakes. We will use Plotly tool.

The markersize will be scaled with the earthquake magnitude. To do so, we add a marker_size series in the DataFrame

quake['marker_size'] = quake['magnitude'].apply(lambda x: np.fix(np.exp(x))) # add marker size as exp(mag)
quake['magnitude_bin'] = quake['magnitude'].apply(lambda x: 0.5*np.fix(2*x)) # add marker size as exp(mag)
# another way to do it
quake['marker_size'] = np.fix(np.exp(quake['magnitude'])) # add marker size as exp(mag)
quake['magnitude_bin'] = 0.5*np.fix(2*quake['magnitude']) # add marker size as exp(mag)

1.4 Intermediate Manipulation with Pandas#

You can also apply lambda functions to work with multiple columns, which is common in geoscientific datasets, where you might have spatial or temporal data.

Example 3: Calculating an Index Based on Multiple Measurements Scenario: You have rainfall (Rainfall_mm) and evaporation (Evaporation_mm) data, and you want to calculate the Net Water Balance for each record.

data = {'Rainfall_mm': [100, 80, 120], 'Evaporation_mm': [60, 70, 65]}
df = pd.DataFrame(data)

# Calculate Net Water Balance
df['Net_Water_Balance'] = df.apply(lambda row: row['Rainfall_mm'] - row['Evaporation_mm'], axis=1)

print(df)
   Rainfall_mm  Evaporation_mm  Net_Water_Balance
0          100              60                 40
1           80              70                 10
2          120              65                 55

In this example, lambda row: row['Rainfall_mm'] - row['Evaporation_mm'] calculates the net water balance by subtracting evaporation from rainfall for each record.

1.5 Advanced: Time Series Data Manipulation with Lambda Functions#

Geoscientists frequently work with time series data (e.g., climate data). Lambda functions can be used for efficient data transformations within time series.

Example 4: Applying a Rolling Window Calculation Scenario: Suppose you have daily temperature data, and you want to calculate a 7-day rolling average.

# Sample daily temperature data
data = {'Date': pd.date_range(start='2023-09-01', periods=10, freq='D'),
        'Temperature_C': [20, 22, 23, 21, 19, 24, 25, 26, 22, 20]}
df = pd.DataFrame(data)

# Calculate 7-day rolling average using lambda
df['7_day_avg'] = df['Temperature_C'].rolling(window=3).apply(lambda x: x.mean())
df.head()
Date Temperature_C 7_day_avg
0 2023-09-01 20 NaN
1 2023-09-02 22 NaN
2 2023-09-03 23 21.666667
3 2023-09-04 21 22.000000
4 2023-09-05 19 21.000000

Here, lambda x: x.mean() calculates the rolling mean over a 3-day window for temperature data, which is essential in climate analysis for smoothing short-term fluctuations.

1.6 Aggregate function#

The agg is a powerful method in Pandas that allows you to perform multiple operations on DataFrames and Series. You can apply one or more aggregation functions such as sum, mean, min, max, etc., on different columns or groups of data.

# Sample DataFrame
data = {
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8]
}
df = pd.DataFrame(data)

# Apply aggregation functions
result = df['A'].agg(['sum', 'mean'])
print(result)
sum     10.0
mean     2.5
Name: A, dtype: float64

Aggregating Multiple Columns with Multiple Functions#

You can pass a dictionary to agg where the keys are column names and the values are the functions to be applied.

# Apply different functions to different columns
result = df.agg({'A': ['sum', 'mean'], 'B': ['min', 'max']})
print(result)
         A    B
sum   10.0  NaN
mean   2.5  NaN
min    NaN  5.0
max    NaN  8.0

In this example, column ‘A’ gets sum and mean, while column ‘B’ gets min and max.

You may also use costum functions

# Custom function to calculate range (max - min)
def data_range(x):
    return x.max() - x.min()

# Apply custom function
result = df.agg({'A': ['mean', data_range], 'B': data_range})
print(result)
              A    B
mean        2.5  NaN
data_range  3.0  3.0

2.4 Mapping using Plotly#

Now we will plot the earthquakes locations on a map using the Plotly package. More tutorials on Plotly. The input of the function is self-explanatory and typical of Python’s function. The code documentation of Plotly scatter_geo lists the variables.

fig = px.scatter_geo(quake,
                     lat='latitude',lon='longitude', 
                     range_color=(6,9),
                     height=600, width=600,
                     size='marker_size', color='magnitude',
                     hover_name="description",
                     hover_data=['description','magnitude','depth']);
fig.update_geos(resolution=110, showcountries=True)
fig.update_geos(resolution=110, showcountries=True,projection_type="orthographic")
fig

The lambda function lambda x: 'Minor' if x < 4.0 else ('Moderate' if x < 6.5 else 'Severe') classifies earthquake magnitudes based on their values.

The data was sorted by time. We now want to sort and show the data instead by magnitude. We use the pandas function sort to create a new DataFrame with sorted values.

quakes2plot=quake.sort_values(by='magnitude_bin')
quakes2plot.head()
time latitude longitude depth magnitude description depth_km Category marker_size magnitude_bin
0 2010-07-02 06:04:03.570 -13.6098 166.6541 0.03440 6.3 VANUATU ISLANDS 34.40 Moderate 544.0 6.0
1112 2017-08-31 17:06:55.750 -1.1590 99.6881 0.04314 6.3 SOUTHERN SUMATRA 43.14 Moderate 544.0 6.0
1111 2017-08-27 04:17:51.010 -1.4513 148.0803 0.00800 6.3 ADMIRALTY ISLANDS REGION 8.00 Moderate 544.0 6.0
1110 2017-08-19 02:00:52.550 -17.9609 -178.8406 0.54400 6.4 FIJI ISLANDS REGION 544.00 Moderate 601.0 6.0
1108 2017-08-13 03:08:10.560 -3.7682 101.6228 0.03100 6.4 SOUTHERN SUMATRA 31.00 Moderate 601.0 6.0

Now we will plot again using Plotly

fig = px.scatter_geo(quakes2plot,
                     lat='latitude',lon='longitude', 
                     range_color=(6,9),
                     height=600, width=600,
                     size='marker_size', color='magnitude',
                     hover_name="description",
                     hover_data=['description','magnitude','depth']);
fig.update_geos(resolution=110, showcountries=True)
# fig.update_geos(resolution=110, showcountries=True,projection_type="orthographic")

3 Create a Pandas from a generic text file.#

The python package pandas is very useful to read csv files, but also many text files that are more or less formatted as one observation per row and one column for each feature.

As an example, we are going to look at the list of seismic stations from the Northern California seismic network, available here:

url = 'http://ncedc.org/ftp/pub/doc/NC.info/NC.channel.summary.day'
# this gets the file linked in the URL page and convert it to a string
s = requests.get(url).content 
# this will convert the string, decode it , and make it a table
data = pd.read_csv(io.StringIO(s.decode('utf-8')), header=None, skiprows=2, sep='\s+', usecols=list(range(0, 13)))
# because columns/keys were not assigned, assign them now
data.columns = ['station', 'network', 'channel', 'location', 'rate', 'start_time', 'end_time', 'latitude', 'longitude', 'elevation', 'depth', 'dip', 'azimuth']

Let us look at the data. They are now stored into a pandas dataframe.

data.head()
station network channel location rate start_time end_time latitude longitude elevation depth dip azimuth
0 AAR NC EHZ -- 0.0 1976/07/20,17:38:00 1977/12/01,22:37:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0
1 AAR NC EHZ -- 0.0 1977/12/01,22:37:00 1981/06/12,19:15:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0
2 AAR NC EHZ -- 0.0 1981/06/12,19:15:00 1984/01/01,00:00:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0
3 AAR NC EHZ -- 100.0 1984/01/01,00:00:00 1985/11/03,22:00:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0
4 AAR NC EHZ -- 100.0 1985/11/03,22:00:00 1987/03/10,23:15:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0

We can output the first element of the DataFrame:

data.iloc[0]
station                       AAR
network                        NC
channel                       EHZ
location                       --
rate                          0.0
start_time    1976/07/20,17:38:00
end_time      1977/12/01,22:37:00
latitude                 39.27594
longitude              -121.02696
elevation                   911.0
depth                         0.0
dip                         -90.0
azimuth                       0.0
Name: 0, dtype: object
# display the type of each column
data.dtypes
station        object
network        object
channel        object
location       object
rate          float64
start_time     object
end_time       object
latitude      float64
longitude     float64
elevation     float64
depth         float64
dip           float64
azimuth       float64
dtype: object
data.iloc[:, 0]
0         AAR
1         AAR
2         AAR
3         AAR
4         AAR
         ... 
29434     WMP
29435     WMP
29436     WMP
29437     WSL
29438    WWVB
Name: station, Length: 29439, dtype: object

The start_time and end_time are stored as object or string, convert these columns to datetime objects. With a too-simple prompt, CoPilot may fail at converting:

# convert start_time and end_time to datetime
data['start_time'] = pd.to_datetime(data['start_time'])
data['end_time'] = pd.to_datetime(data['end_time'])
/var/folders/js/lzmy975n0l5bjbmr9db291m00000gn/T/ipykernel_26544/3238846712.py:2: UserWarning:

Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
---------------------------------------------------------------------------
DateParseError                            Traceback (most recent call last)
Cell In[78], line 2
      1 # convert start_time and end_time to datetime
----> 2 data['start_time'] = pd.to_datetime(data['start_time'])
      3 data['end_time'] = pd.to_datetime(data['end_time'])

File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/pandas/core/tools/datetimes.py:1108, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1106             result = arg.tz_localize("utc")
   1107 elif isinstance(arg, ABCSeries):
-> 1108     cache_array = _maybe_cache(arg, format, cache, convert_listlike)
   1109     if not cache_array.empty:
   1110         result = arg.map(cache_array)

File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/pandas/core/tools/datetimes.py:254, in _maybe_cache(arg, format, cache, convert_listlike)
    252 unique_dates = unique(arg)
    253 if len(unique_dates) < len(arg):
--> 254     cache_dates = convert_listlike(unique_dates, format)
    255     # GH#45319
    256     try:

File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/pandas/core/tools/datetimes.py:490, in _convert_listlike_datetimes(arg, format, name, utc, unit, errors, dayfirst, yearfirst, exact)
    487 if format is not None and format != "mixed":
    488     return _array_strptime_with_fallback(arg, name, utc, format, exact, errors)
--> 490 result, tz_parsed = objects_to_datetime64ns(
    491     arg,
    492     dayfirst=dayfirst,
    493     yearfirst=yearfirst,
    494     utc=utc,
    495     errors=errors,
    496     allow_object=True,
    497 )
    499 if tz_parsed is not None:
    500     # We can take a shortcut since the datetime64 numpy array
    501     # is in UTC
    502     dta = DatetimeArray(result, dtype=tz_to_dtype(tz_parsed))

File ~/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py:2346, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, allow_object)
   2343 # if str-dtype, convert
   2344 data = np.array(data, copy=False, dtype=np.object_)
-> 2346 result, tz_parsed = tslib.array_to_datetime(
   2347     data,
   2348     errors=errors,
   2349     utc=utc,
   2350     dayfirst=dayfirst,
   2351     yearfirst=yearfirst,
   2352 )
   2354 if tz_parsed is not None:
   2355     # We can take a shortcut since the datetime64 numpy array
   2356     #  is in UTC
   2357     # Return i8 values to denote unix timestamps
   2358     return result.view("i8"), tz_parsed

File tslib.pyx:403, in pandas._libs.tslib.array_to_datetime()

File tslib.pyx:552, in pandas._libs.tslib.array_to_datetime()

File tslib.pyx:517, in pandas._libs.tslib.array_to_datetime()

File conversion.pyx:546, in pandas._libs.tslibs.conversion.convert_str_to_tsobject()

File parsing.pyx:331, in pandas._libs.tslibs.parsing.parse_datetime_string()

File parsing.pyx:660, in pandas._libs.tslibs.parsing.dateutil_parse()

DateParseError: Unknown datetime string format, unable to parse: 1976/07/20,17:38:00, at position 0
data['startdate'] = pd.to_datetime(data['start_time'], format='%Y/%m/%d,%H:%M:%S')
data.head()
station network channel location rate start_time end_time latitude longitude elevation depth dip azimuth startdate
0 AAR NC EHZ -- 0.0 1976/07/20,17:38:00 1977/12/01,22:37:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0 1976-07-20 17:38:00
1 AAR NC EHZ -- 0.0 1977/12/01,22:37:00 1981/06/12,19:15:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0 1977-12-01 22:37:00
2 AAR NC EHZ -- 0.0 1981/06/12,19:15:00 1984/01/01,00:00:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0 1981-06-12 19:15:00
3 AAR NC EHZ -- 100.0 1984/01/01,00:00:00 1985/11/03,22:00:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0 1984-01-01 00:00:00
4 AAR NC EHZ -- 100.0 1985/11/03,22:00:00 1987/03/10,23:15:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0 1985-11-03 22:00:00
# do the same for end times
# Avoid 'OutOfBoundsDatetime' error with year 3000
enddate = data['end_time'].str.replace('3000', '2026')
enddate = pd.to_datetime(enddate, format='%Y/%m/%d,%H:%M:%S')
data['enddate'] = enddate
data.head()
station network channel location rate start_time end_time latitude longitude elevation depth dip azimuth startdate enddate
0 AAR NC EHZ -- 0.0 1976/07/20,17:38:00 1977/12/01,22:37:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0 1976-07-20 17:38:00 1977-12-01 22:37:00
1 AAR NC EHZ -- 0.0 1977/12/01,22:37:00 1981/06/12,19:15:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0 1977-12-01 22:37:00 1981-06-12 19:15:00
2 AAR NC EHZ -- 0.0 1981/06/12,19:15:00 1984/01/01,00:00:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0 1981-06-12 19:15:00 1984-01-01 00:00:00
3 AAR NC EHZ -- 100.0 1984/01/01,00:00:00 1985/11/03,22:00:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0 1984-01-01 00:00:00 1985-11-03 22:00:00
4 AAR NC EHZ -- 100.0 1985/11/03,22:00:00 1987/03/10,23:15:00 39.27594 -121.02696 911.0 0.0 -90.0 0.0 1985-11-03 22:00:00 1987-03-10 23:15:00

Use Plotly to map the stations.

data.dropna(inplace=True)
data=data[data.longitude!=0]
fig = px.scatter_geo(data,
                     lat='latitude',lon='longitude', 
                     range_color=(6,9),
                     height=600, width=600,
                     hover_name="station",
                     hover_data=['network','station','channel','rate']);
fig.update_geos(resolution=110, showcountries=True)
fig = px.scatter_mapbox(data,
                     lat='latitude',lon='longitude', 
                     range_color=(6,9),mapbox_style="carto-positron",
                     height=600, width=500,
                     hover_name="station",
                     hover_data=['network','station','channel','rate']);
fig.update_layout(title="Northern California Seismic Network")
fig.show()

4: Exercise#

We will now practice on manipulating pandas. This exercise pulls station metadata from a URL and students are expected to practice on specific tasks.

Download data from the NCEDC URL

url = 'http://ncedc.org/ftp/pub/doc/NC.info/NC.channel.summary.day'

The column names are shown at the top of the text file, make a list of strings of these names in order to rename the columns once the dataframe is made

# students answer here
# students answer here
# request the data from the URL and use the IO package
# assign the column names
data.columns = ['station', 'network', 'channel', 'location', 'rate', 'start_time', 'end_time', 'latitude', 'longitude', 'elevation', 'depth', 'dip', 'azimuth']

Find the row of channel KCPB

# find the row of station KCPB

Now select two stations of your choice using |.

# answer below

Now select a given station and a specific channel code, example is KCPB and channel code HNZ.

# Select two stations, use the typical "AND" 

You may also choose the pandas function isin to select rows that have a given attribute that belongs to a list. For instance, use isin to select all rows that have the key station within a list ['KCBP','KHBB' ].

#  students answer here

Q Use panda native functions to calculate how many unique sites (station names) there are in the network.

# students answer here

Q Use pandas native functions to select the unique set of channel codes that end with Z: this will tell you of how many types digitized data the seismic network manages.

# students answer here

Q What station names has the most number of channels? hint you may use value_counts().

# students answer here

Q What is the maximum difference in elevation between the stations using lambda functions

# students answer here

Here, pandas does not recognize the start_time and end_time columns as a datetime format, so we cannot use datetime operations on them. We first need to convert these columns into a datetime format:

# answer here
# Transform column from string into datetime format
# do the same for end times

We can now look when each seismic station was installed using groupby and sorting by the earliest deployment (i.e., the minimum of the start_time)

# students answer here

Select the stations that were deployed first and recovered last using agg and lambda functions

# answer here

4. CSV vs Parquet#

Parquet is a compressed data format that stores and compresses the columns. It is fast for I/O and compact format.

Save data into a CSV file:

%timeit data.to_csv("my_metadata.csv")
!ls -lh my_metadata.csv
195 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
-rw-r--r--@ 1 marinedenolle  staff   4.2M Oct  7 05:15 my_metadata.csv

Try and save in Parquet and compare time and memory.

%timeit data.to_parquet("my_metadata.pq")
!ls -lh my_metadata.pq
25.2 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
-rw-r--r--@ 1 marinedenolle  staff   591K Oct  7 05:15 my_metadata.pq