4.1 Bringing Data Together

4.1 Bringing Data Together#

4.1.1 Introduction to the Data#

The three key datasets:

Climatology Data: Offers a broad view of weather patterns over time.
SNOTEL Data: Provides specific insights into snowpack conditions.
Terrain Data: Brings in the geographical and physical characteristics of the landscape.

Lets look at the data

Climatology data#

import dask.dataframe as dd
climatology_data_path = '../data/training_ready_climatology_data.csv'
climatology_data = dd.read_csv(climatology_data_path)
climatology_data.head()

	date	lat	lon	cell_id	station_id	etr	rmax	rmin	tmmn	tmmx	vpd	vs
0	2001-01-01	41.993149	-120.178715	76b55900-eb3d-4d25-a538-f74302ffe72d	CDEC:ADM	1.5	61.4	30.6	265.7	274.2	0.28	2.7
1	2001-01-02	41.993149	-120.178715	76b55900-eb3d-4d25-a538-f74302ffe72d	CDEC:ADM	2.8	58.1	23.2	266.6	277.3	0.44	4.1
2	2001-01-03	41.993149	-120.178715	76b55900-eb3d-4d25-a538-f74302ffe72d	CDEC:ADM	3.2	38.0	21.4	268.4	280.6	0.54	4.2
3	2001-01-04	41.993149	-120.178715	76b55900-eb3d-4d25-a538-f74302ffe72d	CDEC:ADM	2.0	53.1	23.0	269.7	278.2	0.40	3.0
4	2001-01-05	41.993149	-120.178715	76b55900-eb3d-4d25-a538-f74302ffe72d	CDEC:ADM	2.1	58.7	22.2	270.6	279.6	0.49	2.5

SNOTEL Data#

snotel_data_path = '../data/training_ready_snotel_data.csv'
snotel_data = dd.read_csv(snotel_data_path)
snotel_data.head()

	Date	Snow Water Equivalent (in) Start of Day Values	Change In Snow Water Equivalent (in)	Snow Depth (in) Start of Day Values	Change In Snow Depth (in)	Air Temperature Observed (degF) Start of Day Values	station_name	station_triplet	station_elevation	station_lat	station_long	mapping_station_id	mapping_cell_id	lat	lon
0	2002-01-03	20.3	0.3	NaN	NaN	25.9	Dismal Swamp	446:CA:SNTL	7360	41.99127	-120.18033	CDEC:ADM	76b55900-eb3d-4d25-a538-f74302ffe72d	41.993149	-120.178715
1	2002-01-04	20.4	0.1	NaN	NaN	12.0	Dismal Swamp	446:CA:SNTL	7360	41.99127	-120.18033	CDEC:ADM	76b55900-eb3d-4d25-a538-f74302ffe72d	41.993149	-120.178715
2	2002-01-05	20.4	0.0	NaN	NaN	22.8	Dismal Swamp	446:CA:SNTL	7360	41.99127	-120.18033	CDEC:ADM	76b55900-eb3d-4d25-a538-f74302ffe72d	41.993149	-120.178715
3	2002-01-06	20.5	0.1	NaN	NaN	33.3	Dismal Swamp	446:CA:SNTL	7360	41.99127	-120.18033	CDEC:ADM	76b55900-eb3d-4d25-a538-f74302ffe72d	41.993149	-120.178715
4	2002-01-07	21.2	0.7	NaN	NaN	34.7	Dismal Swamp	446:CA:SNTL	7360	41.99127	-120.18033	CDEC:ADM	76b55900-eb3d-4d25-a538-f74302ffe72d	41.993149	-120.178715

Terrain Data#

terrain_data_path = '../data/training_ready_terrain_data.csv'
terrain_data = dd.read_csv(terrain_data_path)
terrain_data.head()

	lat	lon	elevation	slope	curvature	aspect	eastness	northness
0	41.993149	-120.178715	2290.8364	89.988850	-9401.7705	40.629730	0.577196	0.649194
1	37.727154	-119.136669	2955.2370	89.991880	-5259.1160	38.885838	0.560589	0.661430
2	38.918144	-120.205665	2481.0059	89.992966	-9113.7150	40.579857	0.576732	0.649554
3	37.070608	-118.768361	3329.7136	89.975500	-7727.0957	78.698520	0.775608	0.193519
4	36.364939	-118.292254	2851.1318	89.975540	2352.6350	123.959000	0.692435	-0.509422

Each dataset comes packed with essential features like latitude, longitude, and date, ready to enrich our SWE prediction model.

4.1.2 Integrating the Datasets#

We are combining these large datasets into one DataFrame using Dask. Dask allows us to work with big data efficiently, so we can merge the datasets quickly and easily, no matter how large they are.

And also if the size of the data is larger then reading large CSV files in chunks helps manage big data more efficiently by reducing memory use, speeding up processing, and improving error handling. This approach makes it easier to work on large datasets with limited resources, ensuring flexibility and scalability in data analysis.

Read and Convert#

Each CSV file is read into a Dask DataFrame, with latitude and longitude data types converted to floats for uniformity.

import dask.dataframe as dd
import os
file_path1 = '../data/training_ready_climatology_data.csv'
file_path2 = '../data/training_ready_snotel_data.csv'
file_path3 = '../data/training_ready_terrain_data.csv'
# Read each CSV file into a Dask DataFrame
df1 = dd.read_csv(file_path1)
df2 = dd.read_csv(file_path2)
df3 = dd.read_csv(file_path3)
# Perform data type conversion for latitude and longitude columns
df1['lat'] = df1['lat'].astype(float)
df1['lon'] = df1['lon'].astype(float)
df2['lat'] = df2['lat'].astype(float)
df2['lon'] = df2['lon'].astype(float)
df3['lat'] = df3['lat'].astype(float)
df3['lon'] = df3['lon'].astype(float)
#rename the columns to match the other dataframes
df2 = df2.rename(columns={"Date": "date"})

Merge on Common Ground#

The dataframes are then merged based on shared columns (latitude, longitude, and date), ensuring that each row represents a coherent set of data from all three sources.

# Merge the first two DataFrames based on 'lat', 'lon', and 'date'
merged_df1 = dd.merge(df1, df2, left_on=['lat', 'lon', 'date'], right_on=['lat', 'lon', 'date'])

# Merge the third DataFrame based on 'lat' and 'lon'
merged_df2 = dd.merge(merged_df1, df3, on=['lat', 'lon'])

Output#

The merged DataFrame is saved as a new CSV file, ready for further processing or analysis.

merged_df2.to_csv('../data/model_training_data.csv', index=False, single_file=True)

['/Users/vangavetisaivivek/research/swe-workflow-book/book/data/model_training_data.csv']

df = dd.read_csv('../data/model_training_data.csv')
df.columns

Index(['date', 'lat', 'lon', 'cell_id', 'station_id', 'etr', 'pr', 'rmax',
       'rmin', 'tmmn', 'tmmx', 'vpd', 'vs',
       'Snow Water Equivalent (in) Start of Day Values',
       'Change In Snow Water Equivalent (in)',
       'Snow Depth (in) Start of Day Values', 'Change In Snow Depth (in)',
       'Air Temperature Observed (degF) Start of Day Values', 'station_name',
       'station_triplet', 'station_elevation', 'station_lat', 'station_long',
       'mapping_station_id', 'mapping_cell_id', 'elevation', 'slope',
       'curvature', 'aspect', 'eastness', 'northness'],
      dtype='object')