3.4 GHCNd: Global Historical Climatology Network Daily#

Accurate Snow Water Equivalent (SWE) predictions rely heavily on reliable input data, including snow depth measurements. This section details the process of retrieving, processing, and refining snow depth data from the Global Historical Climatology Network Daily (GHCNd) dataset. The script discussed here automates the retrieval of snow depth data from ground stations, filters and cleans this data, and prepares it for use in SWE prediction models.

import pandas as pd
import requests
import re
from io import StringIO
import dask
import dask.dataframe as dd

all_ghcd_station_file = '../Data/all_ghcn_station_list.csv'
only_active_ghcd_station_in_west_conus_file = '../Data/active_station_only_list.csv'
snowdepth_csv_file = '../Data/active_station_only_list.csv_all_vars.csv'
mask_non_snow_days_ghcd_csv_file = '../Data/active_station_only_list.csv_all_vars_masked_non_snow.csv'
southwest_lon = -125.0
southwest_lat = 25.0
northeast_lon = -100.0
northeast_lat = 49.0

# the training period is three years from 2018 to 2021
train_start_date = "2018-01-03"
train_end_date = "2021-12-31"

3.4.1 Overview#

The GHCNd dataset provides daily climate records from numerous weather stations worldwide. This script specifically targets snow depth data (SNWD) from active ground stations within a defined geographical area. The process involves several key steps:

  1. Downloading and Filtering Station Data: Identifying active stations within the specified region that report snow depth measurements.

  2. Retrieving Snow Depth Observations: Downloading and processing snow depth data for the specified training period.

  3. Masking Non-Snow Days: Refining the dataset to exclude non-snow days, focusing the data on relevant observations.

3.4.1.1 What data is being downloaded?#

  • Source: The data is sourced from the Global Historical Climatology Network Daily (GHCNd), a comprehensive dataset managed by the National Centers for Environmental Information (NCEI). It provides historical daily climate data from thousands of stations worldwide.

  • Features: The key feature targeted by this script is Snow Depth (SNWD), which is the depth of snow on the ground at the time of observation, typically measured in millimeters.

  • Station Metadata: Additional metadata about each station includes the station ID, geographical coordinates (latitude and longitude), and the operational period of the station (start and end years).

3.4.2 Downloading and Filtering Data#

The first step in this process is to download the list of all GHCNd stations and filter them based on geographical and operational criteria. This ensures that only active stations within the region of interest are considered.

3.4.2.1 Downloading and Converting Station Data#

The script begins by downloading the GHCNd station inventory file and converting it into a usable format. The relevant stations are then filtered based on their operational status (active in 2024) and location.

def download_convert_and_read():
  
    url = "https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-inventory.txt"
    # Download the text file from the URL
    response = requests.get(url)
    if response.status_code != 200:
        print("Error: Failed to download the file.")
        return None
    
    # Parse the text content into columns using regex
    pattern = r"(\S+)\s+"
    parsed_data = re.findall(pattern, response.text)
    print("len(parsed_data) = ", len(parsed_data))
    
    # Create a new list to store the rows
    rows = []
    for i in range(0, len(parsed_data), 6):
        rows.append(parsed_data[i:i+6])
    
    print("rows[0:20] = ", rows[0:20])
    # Convert the rows into a CSV-like format
    csv_data = "\n".join([",".join(row) for row in rows])
    
    # Save the CSV-like string to a file
    with open(all_ghcd_station_file, "w") as file:
        file.write(csv_data)
    
    # Read the CSV-like data into a pandas DataFrame
    column_names = ['Station', 'Latitude', 'Longitude', 'Variable', 'Year_Start', 'Year_End']
    df = pd.read_csv(all_ghcd_station_file, header=None, names=column_names)
    print(df.head())
    
    # Remove rows where the last column is not equal to "2024"
    df = df[(df['Year_End'] == 2024) & (df['Variable']=='SNWD')]
    print("Removed non-active stations: ", df.head())
    
    # Filter rows within the latitude and longitude ranges
    df = df[
      (df['Latitude'] >= southwest_lat) & (df['Latitude'] <= northeast_lat) &
      (df['Longitude'] >= southwest_lon) & (df['Longitude'] <= northeast_lon)
    ]
    
    df.to_csv(only_active_ghcd_station_in_west_conus_file, index=False)
    print(f"saved to {only_active_ghcd_station_in_west_conus_file}")
    
    
    return df
download_convert_and_read()
len(parsed_data) =  4537908
rows[0:20] =  [['ACW00011604', '17.1167', '-61.7833', 'TMAX', '1949', '1949'], ['ACW00011604', '17.1167', '-61.7833', 'TMIN', '1949', '1949'], ['ACW00011604', '17.1167', '-61.7833', 'PRCP', '1949', '1949'], ['ACW00011604', '17.1167', '-61.7833', 'SNOW', '1949', '1949'], ['ACW00011604', '17.1167', '-61.7833', 'SNWD', '1949', '1949'], ['ACW00011604', '17.1167', '-61.7833', 'PGTM', '1949', '1949'], ['ACW00011604', '17.1167', '-61.7833', 'WDFG', '1949', '1949'], ['ACW00011604', '17.1167', '-61.7833', 'WSFG', '1949', '1949'], ['ACW00011604', '17.1167', '-61.7833', 'WT03', '1949', '1949'], ['ACW00011604', '17.1167', '-61.7833', 'WT08', '1949', '1949'], ['ACW00011604', '17.1167', '-61.7833', 'WT16', '1949', '1949'], ['ACW00011647', '17.1333', '-61.7833', 'TMAX', '1961', '1961'], ['ACW00011647', '17.1333', '-61.7833', 'TMIN', '1961', '1961'], ['ACW00011647', '17.1333', '-61.7833', 'PRCP', '1957', '1970'], ['ACW00011647', '17.1333', '-61.7833', 'SNOW', '1957', '1970'], ['ACW00011647', '17.1333', '-61.7833', 'SNWD', '1957', '1970'], ['ACW00011647', '17.1333', '-61.7833', 'WT03', '1961', '1961'], ['ACW00011647', '17.1333', '-61.7833', 'WT16', '1961', '1966'], ['AE000041196', '25.3330', '55.5170', 'TMAX', '1944', '2024'], ['AE000041196', '25.3330', '55.5170', 'TMIN', '1944', '2024']]
       Station  Latitude  Longitude Variable  Year_Start  Year_End
0  ACW00011604   17.1167   -61.7833     TMAX        1949      1949
1  ACW00011604   17.1167   -61.7833     TMIN        1949      1949
2  ACW00011604   17.1167   -61.7833     PRCP        1949      1949
3  ACW00011604   17.1167   -61.7833     SNOW        1949      1949
4  ACW00011604   17.1167   -61.7833     SNWD        1949      1949
Removed non-active stations:           Station  Latitude  Longitude Variable  Year_Start  Year_End
424  AJ000037575   41.5500    46.6670     SNWD        1973      2024
446  AJ000037675   41.3670    48.5170     SNWD        1973      2024
454  AJ000037735   40.7167    46.4167     SNWD        1973      2024
475  AJ000037749   40.6500    47.7500     SNWD        1973      2024
485  AJ000037756   40.5330    48.9330     SNWD        1973      2024
saved to ../Data/active_station_only_list.csv
Station Latitude Longitude Variable Year_Start Year_End
70926 CA001011500 48.9333 -123.7500 SNWD 1991 2024
71000 CA001012055 48.8333 -124.0500 SNWD 1980 2024
71151 CA001015105 48.3667 -123.5667 SNWD 1980 2024
71190 CA001015628 48.8167 -123.7167 SNWD 1981 2024
71196 CA001015630 48.8167 -123.7167 SNWD 2007 2024
... ... ... ... ... ... ...
750909 USW00094143 43.5317 -112.9422 SNWD 1954 2024
750992 USW00094178 42.4786 -114.4775 SNWD 1998 2024
751016 USW00094182 44.8942 -116.0997 SNWD 1998 2024
751035 USW00094185 43.5947 -118.9578 SNWD 1973 2024
751532 USW00094290 47.6872 -122.2553 SNWD 1986 2024

4529 rows × 6 columns

  • URL Download: The script begins by downloading the GHCNd station inventory file from the National Centers for Environmental Information (NCEI) website.

  • Data Parsing and Formatting: The raw text data is parsed using regular expressions and converted into a CSV format that can be easily processed by pandas.

  • Filtering: The script filters stations based on their operational status and geographical location, focusing on active stations within the defined latitude and longitude bounds.

  • Saving Processed Data: The filtered station data is saved to a CSV file, which will be used in the subsequent steps to retrieve snow depth observations.

3.4.2.2 Filtering and Saving Active Stations#

After filtering, the script saves the list of relevant stations, which will be used in subsequent steps to retrieve snow depth data. This step is crucial for ensuring that the data retrieval process focuses only on the most relevant stations, minimizing unnecessary data processing.

Key Points:

  • Geographical Filtering: Ensures that only stations within the defined latitudinal and longitudinal bounds are included.

  • Activity Status: Filters out stations that are no longer active, ensuring the data is current and relevant.

3.4.3 Retrieving Snow Depth Observations#

With the relevant stations identified, the next step is to retrieve snow depth observations from the GHCNd dataset for the specified training period. This involves downloading snow depth data from each station and processing it to ensure it is suitable for model training.

3.4.3.1 Processing Snow Depth Data#

The script uses Dask to parallelize the data retrieval process, efficiently downloading snow depth records from each station within the defined period. Dask is a powerful tool for handling large datasets, allowing for the concurrent execution of multiple data retrieval tasks.

import warnings
def get_snow_depth_observations_from_ghcn():
    warnings.filterwarnings("ignore")
    
    new_base_df = pd.read_csv(only_active_ghcd_station_in_west_conus_file)
    print(new_base_df.shape)
    
    start_date = train_start_date
    end_date = train_end_date
	
    # Create an empty Pandas DataFrame with the desired columns
    result_df = pd.DataFrame(columns=[
      'station_name', 
      'date', 
      'lat', 
      'lon', 
      'snow_depth',
    ])
    
    train_start_date_obj = pd.to_datetime(train_start_date)
    train_end_date_obj = pd.to_datetime(train_end_date)

    # Function to process each station
    @dask.delayed
    def process_station(station):
        station_name = station['Station']
        # print(f"retrieving for {station_name}")
        station_lat = station['Latitude']
        station_long = station['Longitude']
        try:
          url = f"https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/{station_name}.csv"
          response = requests.get(url)
          df = pd.read_csv(StringIO(response.text))
          #"STATION","DATE","LATITUDE","LONGITUDE","ELEVATION","NAME","PRCP","PRCP_ATTRIBUTES","SNOW","SNOW_ATTRIBUTES","SNWD","SNWD_ATTRIBUTES","TMAX","TMAX_ATTRIBUTES","TMIN","TMIN_ATTRIBUTES","PGTM","PGTM_ATTRIBUTES","WDFG","WDFG_ATTRIBUTES","WSFG","WSFG_ATTRIBUTES","WT03","WT03_ATTRIBUTES","WT08","WT08_ATTRIBUTES","WT16","WT16_ATTRIBUTES"
          columns_to_keep = ['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD']
          df = df[columns_to_keep]
          # Convert the date column to datetime objects
          df['DATE'] = pd.to_datetime(df['DATE'])
          # Filter rows based on the training period
          df = df[(df['DATE'] >= train_start_date_obj) & (df['DATE'] <= train_end_date_obj)]
          # print(df.head())
          return df
        except Exception as e:
          print("An error occurred:", e)

    # List of delayed computations for each station
    delayed_results = [process_station(row) for _, row in new_base_df.iterrows()]

    # Compute the delayed results
    result_lists = dask.compute(*delayed_results)

    # Concatenate the lists into a Pandas DataFrame
    result_df = pd.concat(result_lists, ignore_index=True)

    # Print the final Pandas DataFrame
    print(result_df.head())

    # Save the DataFrame to a CSV file
    result_df.to_csv(snowdepth_csv_file, index=False)
    print(f"All the data are saved to {snowdepth_csv_file}")
#     result_df.to_csv(csv_file, index=False)

Active Stations
#

get_snow_depth_observations_from_ghcn()
(4529, 6)
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
An error occurred: "None of [Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'SNWD'], dtype='object')] are in the [columns]"
       STATION       DATE  LATITUDE  LONGITUDE  SNWD
0  CA001011500 2018-01-03   48.9333    -123.75   0.0
1  CA001011500 2018-01-04   48.9333    -123.75   0.0
2  CA001011500 2018-01-05   48.9333    -123.75   0.0
3  CA001011500 2018-01-06   48.9333    -123.75   0.0
4  CA001011500 2018-01-07   48.9333    -123.75   0.0
All the data are saved to ../Data/active_station_only_list.csv_all_vars.csv
  • Station Processing with Dask: Each station’s data is processed using Dask’s delayed function, allowing for parallel data retrieval. This significantly speeds up the process, especially when dealing with a large number of stations.

  • Data Features: The primary features extracted from each station’s dataset are:

    • STATION: The unique identifier for the weather station.

    • DATE: The date of the observation.

    • LATITUDE: The latitude of the station.

    • LONGITUDE: The longitude of the station.

    • SNWD: Snow depth in millimeters.

  • Data Filtering: For each station, the script retrieves snow depth data (SNWD) and filters it based on the specified training period. This ensures that only relevant data is included in the final dataset.

  • Data Concatenation and Saving: The retrieved data is concatenated into a single DataFrame and saved to a CSV file for further analysis.

3.4.3.2 Saving the Processed Data#

The final snow depth data, now filtered and cleaned, is saved to a CSV file. This file serves as the foundation for subsequent analyses and SWE prediction tasks. By automating the data retrieval and cleaning process, the script ensures that the dataset is ready for immediate use in model training.

3.4.4 Masking Non-Snow Days#

To refine the dataset further, the script masks out days with no snow depth observations. This step is important for focusing the analysis on days where snow depth measurements are meaningful, excluding days with zero or missing data.

def mask_out_all_non_zero_snowdepth_days():
    
    df = pd.read_csv(snowdepth_csv_file)
    # Create the new column 'swe_value' and assign values based on conditions
    df['swe_value'] = 0  # Initialize all values to 0

    # Assign NaN to 'swe_value' where 'snow_depth' is non-zero
    df.loc[df['SNWD'] != 0, 'swe_value'] = -999

    # Display the first few rows of the DataFrame
    print(df.head())
    df.to_csv(mask_non_snow_days_ghcd_csv_file, index=False)
    
mask_out_all_non_zero_snowdepth_days()
       STATION        DATE  LATITUDE  LONGITUDE  SNWD  swe_value
0  CA001011500  2018-01-03   48.9333    -123.75   0.0          0
1  CA001011500  2018-01-04   48.9333    -123.75   0.0          0
2  CA001011500  2018-01-05   48.9333    -123.75   0.0          0
3  CA001011500  2018-01-06   48.9333    -123.75   0.0          0
4  CA001011500  2018-01-07   48.9333    -123.75   0.0          0
  • Data Masking: The function creates a new column, swe_value, and assigns it a default value of 0. Where snow depth (SNWD) is non-zero, the swe_value is set to -999, effectively masking out these records.

  • Refinement for Analysis: This refinement step ensures that the dataset focuses on periods with significant snow depth, which are most relevant for SWE predictions. The masked data is then saved to a new CSV file, ready for use in further processing or analysis.