{ "cells": [ { "cell_type": "markdown", "id": "7387d7e6", "metadata": {}, "source": [ "# 4.1 Bringing Data Together\n", "\n", "## 4.1.1 Introduction to the Data\n", "\n", "The three key datasets:\n", "\n", "- **Climatology Data:** Offers a broad view of weather patterns over time.\n", "- **SNOTEL Data:** Provides specific insights into snowpack conditions.\n", "- **Terrain Data:** Brings in the geographical and physical characteristics of the landscape.\n", "\n", "Lets look at the data" ] }, { "cell_type": "markdown", "id": "c59df521", "metadata": {}, "source": [ "### Climatology data" ] }, { "cell_type": "code", "execution_count": 1, "id": "4d270d3e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datelatloncell_idstation_idetrprrmaxrmintmmntmmxvpdvs
02001-01-0141.993149-120.17871576b55900-eb3d-4d25-a538-f74302ffe72dCDEC:ADM1.50.061.430.6265.7274.20.282.7
12001-01-0241.993149-120.17871576b55900-eb3d-4d25-a538-f74302ffe72dCDEC:ADM2.80.058.123.2266.6277.30.444.1
22001-01-0341.993149-120.17871576b55900-eb3d-4d25-a538-f74302ffe72dCDEC:ADM3.20.038.021.4268.4280.60.544.2
32001-01-0441.993149-120.17871576b55900-eb3d-4d25-a538-f74302ffe72dCDEC:ADM2.00.053.123.0269.7278.20.403.0
42001-01-0541.993149-120.17871576b55900-eb3d-4d25-a538-f74302ffe72dCDEC:ADM2.10.058.722.2270.6279.60.492.5
\n", "
" ], "text/plain": [ " date lat lon cell_id \\\n", "0 2001-01-01 41.993149 -120.178715 76b55900-eb3d-4d25-a538-f74302ffe72d \n", "1 2001-01-02 41.993149 -120.178715 76b55900-eb3d-4d25-a538-f74302ffe72d \n", "2 2001-01-03 41.993149 -120.178715 76b55900-eb3d-4d25-a538-f74302ffe72d \n", "3 2001-01-04 41.993149 -120.178715 76b55900-eb3d-4d25-a538-f74302ffe72d \n", "4 2001-01-05 41.993149 -120.178715 76b55900-eb3d-4d25-a538-f74302ffe72d \n", "\n", " station_id etr pr rmax rmin tmmn tmmx vpd vs \n", "0 CDEC:ADM 1.5 0.0 61.4 30.6 265.7 274.2 0.28 2.7 \n", "1 CDEC:ADM 2.8 0.0 58.1 23.2 266.6 277.3 0.44 4.1 \n", "2 CDEC:ADM 3.2 0.0 38.0 21.4 268.4 280.6 0.54 4.2 \n", "3 CDEC:ADM 2.0 0.0 53.1 23.0 269.7 278.2 0.40 3.0 \n", "4 CDEC:ADM 2.1 0.0 58.7 22.2 270.6 279.6 0.49 2.5 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import dask.dataframe as dd\n", "climatology_data_path = '../data/training_ready_climatology_data.csv'\n", "climatology_data = dd.read_csv(climatology_data_path)\n", "climatology_data.head()" ] }, { "cell_type": "markdown", "id": "80ae76d5", "metadata": {}, "source": [ "### SNOTEL Data" ] }, { "cell_type": "code", "execution_count": 2, "id": "c940b2fb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateSnow Water Equivalent (in) Start of Day ValuesChange In Snow Water Equivalent (in)Snow Depth (in) Start of Day ValuesChange In Snow Depth (in)Air Temperature Observed (degF) Start of Day Valuesstation_namestation_tripletstation_elevationstation_latstation_longmapping_station_idmapping_cell_idlatlon
02002-01-0320.30.3NaNNaN25.9Dismal Swamp446:CA:SNTL736041.99127-120.18033CDEC:ADM76b55900-eb3d-4d25-a538-f74302ffe72d41.993149-120.178715
12002-01-0420.40.1NaNNaN12.0Dismal Swamp446:CA:SNTL736041.99127-120.18033CDEC:ADM76b55900-eb3d-4d25-a538-f74302ffe72d41.993149-120.178715
22002-01-0520.40.0NaNNaN22.8Dismal Swamp446:CA:SNTL736041.99127-120.18033CDEC:ADM76b55900-eb3d-4d25-a538-f74302ffe72d41.993149-120.178715
32002-01-0620.50.1NaNNaN33.3Dismal Swamp446:CA:SNTL736041.99127-120.18033CDEC:ADM76b55900-eb3d-4d25-a538-f74302ffe72d41.993149-120.178715
42002-01-0721.20.7NaNNaN34.7Dismal Swamp446:CA:SNTL736041.99127-120.18033CDEC:ADM76b55900-eb3d-4d25-a538-f74302ffe72d41.993149-120.178715
\n", "
" ], "text/plain": [ " Date Snow Water Equivalent (in) Start of Day Values \\\n", "0 2002-01-03 20.3 \n", "1 2002-01-04 20.4 \n", "2 2002-01-05 20.4 \n", "3 2002-01-06 20.5 \n", "4 2002-01-07 21.2 \n", "\n", " Change In Snow Water Equivalent (in) Snow Depth (in) Start of Day Values \\\n", "0 0.3 NaN \n", "1 0.1 NaN \n", "2 0.0 NaN \n", "3 0.1 NaN \n", "4 0.7 NaN \n", "\n", " Change In Snow Depth (in) \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "\n", " Air Temperature Observed (degF) Start of Day Values station_name \\\n", "0 25.9 Dismal Swamp \n", "1 12.0 Dismal Swamp \n", "2 22.8 Dismal Swamp \n", "3 33.3 Dismal Swamp \n", "4 34.7 Dismal Swamp \n", "\n", " station_triplet station_elevation station_lat station_long \\\n", "0 446:CA:SNTL 7360 41.99127 -120.18033 \n", "1 446:CA:SNTL 7360 41.99127 -120.18033 \n", "2 446:CA:SNTL 7360 41.99127 -120.18033 \n", "3 446:CA:SNTL 7360 41.99127 -120.18033 \n", "4 446:CA:SNTL 7360 41.99127 -120.18033 \n", "\n", " mapping_station_id mapping_cell_id lat \\\n", "0 CDEC:ADM 76b55900-eb3d-4d25-a538-f74302ffe72d 41.993149 \n", "1 CDEC:ADM 76b55900-eb3d-4d25-a538-f74302ffe72d 41.993149 \n", "2 CDEC:ADM 76b55900-eb3d-4d25-a538-f74302ffe72d 41.993149 \n", "3 CDEC:ADM 76b55900-eb3d-4d25-a538-f74302ffe72d 41.993149 \n", "4 CDEC:ADM 76b55900-eb3d-4d25-a538-f74302ffe72d 41.993149 \n", "\n", " lon \n", "0 -120.178715 \n", "1 -120.178715 \n", "2 -120.178715 \n", "3 -120.178715 \n", "4 -120.178715 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "snotel_data_path = '../data/training_ready_snotel_data.csv'\n", "snotel_data = dd.read_csv(snotel_data_path)\n", "snotel_data.head()" ] }, { "cell_type": "markdown", "id": "951a2c8c", "metadata": {}, "source": [ "### Terrain Data" ] }, { "cell_type": "code", "execution_count": 3, "id": "482017bf", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latlonelevationslopecurvatureaspecteastnessnorthness
041.993149-120.1787152290.836489.988850-9401.770540.6297300.5771960.649194
137.727154-119.1366692955.237089.991880-5259.116038.8858380.5605890.661430
238.918144-120.2056652481.005989.992966-9113.715040.5798570.5767320.649554
337.070608-118.7683613329.713689.975500-7727.095778.6985200.7756080.193519
436.364939-118.2922542851.131889.9755402352.6350123.9590000.692435-0.509422
\n", "
" ], "text/plain": [ " lat lon elevation slope curvature aspect \\\n", "0 41.993149 -120.178715 2290.8364 89.988850 -9401.7705 40.629730 \n", "1 37.727154 -119.136669 2955.2370 89.991880 -5259.1160 38.885838 \n", "2 38.918144 -120.205665 2481.0059 89.992966 -9113.7150 40.579857 \n", "3 37.070608 -118.768361 3329.7136 89.975500 -7727.0957 78.698520 \n", "4 36.364939 -118.292254 2851.1318 89.975540 2352.6350 123.959000 \n", "\n", " eastness northness \n", "0 0.577196 0.649194 \n", "1 0.560589 0.661430 \n", "2 0.576732 0.649554 \n", "3 0.775608 0.193519 \n", "4 0.692435 -0.509422 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "terrain_data_path = '../data/training_ready_terrain_data.csv'\n", "terrain_data = dd.read_csv(terrain_data_path)\n", "terrain_data.head()" ] }, { "cell_type": "markdown", "id": "3c8a584b", "metadata": {}, "source": [ "Each dataset comes packed with essential features like latitude, longitude, and date, ready to enrich our SWE prediction model.\n", "\n", "## 4.1.2 Integrating the Datasets\n", "\n", "We are combining these large datasets into one DataFrame using Dask. Dask allows us to work with big data efficiently, so we can merge the datasets quickly and easily, no matter how large they are.\n", "\n", "And also if the size of the data is larger then reading large CSV files in chunks helps manage big data more efficiently by reducing memory use, speeding up processing, and improving error handling. This approach makes it easier to work on large datasets with limited resources, ensuring flexibility and scalability in data analysis.\n", "\n", "### Read and Convert\n", "- Each CSV file is read into a Dask DataFrame, with latitude and longitude data types converted to floats for uniformity." ] }, { "cell_type": "code", "execution_count": 4, "id": "25039fe3", "metadata": {}, "outputs": [], "source": [ "import dask.dataframe as dd\n", "import os\n", "file_path1 = '../data/training_ready_climatology_data.csv'\n", "file_path2 = '../data/training_ready_snotel_data.csv'\n", "file_path3 = '../data/training_ready_terrain_data.csv'\n", "# Read each CSV file into a Dask DataFrame\n", "df1 = dd.read_csv(file_path1)\n", "df2 = dd.read_csv(file_path2)\n", "df3 = dd.read_csv(file_path3)\n", "# Perform data type conversion for latitude and longitude columns\n", "df1['lat'] = df1['lat'].astype(float)\n", "df1['lon'] = df1['lon'].astype(float)\n", "df2['lat'] = df2['lat'].astype(float)\n", "df2['lon'] = df2['lon'].astype(float)\n", "df3['lat'] = df3['lat'].astype(float)\n", "df3['lon'] = df3['lon'].astype(float)\n", "#rename the columns to match the other dataframes\n", "df2 = df2.rename(columns={\"Date\": \"date\"})\n", "\n" ] }, { "cell_type": "markdown", "id": "cd96f259", "metadata": {}, "source": [ "#### Merge on Common Ground\n", "- The dataframes are then merged based on shared columns (latitude, longitude, and date), ensuring that each row represents a coherent set of data from all three sources." ] }, { "cell_type": "code", "execution_count": 5, "id": "43de425b", "metadata": {}, "outputs": [], "source": [ "# Merge the first two DataFrames based on 'lat', 'lon', and 'date'\n", "merged_df1 = dd.merge(df1, df2, left_on=['lat', 'lon', 'date'], right_on=['lat', 'lon', 'date'])\n", "\n", "# Merge the third DataFrame based on 'lat' and 'lon'\n", "merged_df2 = dd.merge(merged_df1, df3, on=['lat', 'lon'])" ] }, { "cell_type": "markdown", "id": "d7c17226", "metadata": {}, "source": [ "#### Output\n", "- The merged DataFrame is saved as a new CSV file, ready for further processing or analysis." ] }, { "cell_type": "code", "execution_count": 6, "id": "c200ffde", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/Users/vangavetisaivivek/research/swe-workflow-book/book/data/model_training_data.csv']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "merged_df2.to_csv('../data/model_training_data.csv', index=False, single_file=True)" ] }, { "cell_type": "code", "execution_count": 7, "id": "414a3ebf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['date', 'lat', 'lon', 'cell_id', 'station_id', 'etr', 'pr', 'rmax',\n", " 'rmin', 'tmmn', 'tmmx', 'vpd', 'vs',\n", " 'Snow Water Equivalent (in) Start of Day Values',\n", " 'Change In Snow Water Equivalent (in)',\n", " 'Snow Depth (in) Start of Day Values', 'Change In Snow Depth (in)',\n", " 'Air Temperature Observed (degF) Start of Day Values', 'station_name',\n", " 'station_triplet', 'station_elevation', 'station_lat', 'station_long',\n", " 'mapping_station_id', 'mapping_cell_id', 'elevation', 'slope',\n", " 'curvature', 'aspect', 'eastness', 'northness'],\n", " dtype='object')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = dd.read_csv('../data/model_training_data.csv')\n", "df.columns" ] } ], "metadata": { "kernelspec": { "display_name": "Python (base)", "language": "python", "name": "base" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 5 }