3.6 Advanced Microwave Scanning Radiometer(AMSR)

3.6 Advanced Microwave Scanning Radiometer(AMSR)#

AMSR (Advanced Microwave Scanning Radiometer) is a satellite sensor that measures snow cover and water content by detecting Earth’s microwave emissions.

This chapter focuses on the use of AMSR satellite data to monitor snow cover and snow water equivalent (SWE), key indicators in understanding water resources and climate patterns.

This chapter demonstrates the practical applications of transforming AMSR data into actionable information for scientific and environmental purposes.

3.6.1 Generating Daily Snow Data Links from AMSR#

Our goal is to generate a list of download links for AMSR daily snow data files for a specified date range.

AMSR: Satellite sensor providing daily snow data.
start_year, end_year: Define the date range for which data links are generated.
base_url: The starting point of each download link.
timedelta: Used to iterate over each day within the date range.
generate_links: Function that builds and returns the list of URLs.

Here we create list of download links for AMSR snow data by iterating over a date range from the year 2019 to 2022.

from datetime import datetime, timedelta
import os
import subprocess

def generate_links(start_year, end_year):
    '''
    Generate a list of download links for AMSR daily snow data files.

    Args:
        start_year (int): The starting year.
        end_year (int): The ending year (inclusive).

    Returns:
        list: A list of download links for AMSR daily snow data files.
    '''
    base_url = "https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/"
    url_date_format = "%Y.%m.%d"
    file_date_format="%Y%m%d"
    delta = timedelta(days=1)

    start_date = datetime(start_year, 1, 1)
    end_date = datetime(end_year + 1, 1, 1)

    links = []
    current_date = start_date

    while current_date < end_date:
        url_date_format1 = current_date.strftime(url_date_format)
        file_date_format1= current_date.strftime(file_date_format)
        link = base_url + url_date_format1 + "/AMSR_U2_L3_DailySnow_B02_" + file_date_format1 + ".he5"
        links.append(link)
        current_date += delta

    return links
start_year = 2019
end_year = 2022
work_dir = "../data/gridmet_test_run"
links = generate_links(start_year, end_year)
save_location = f'{work_dir}/amsr'
with open(f'{work_dir}/download_links.txt', "w") as txt_file:
    for l in links:
        txt_file.write(l + "\n")

3.6.2 Automated Script for Secure File Download with Wget from a List of URLs#

The follwing shell script is to automate the downloading of files by looping through the links we obtained from 3.6.1.

work_dir: Specifies the working directory where the download links and files will be stored.
input_file: The text file containing the list of URLs to be downloaded.
base_wget_command: The core command used to download files with options for authentication, session management, and secure connections.
output_directory: The folder where the downloaded files will be saved.
Loop: Iterates over each URL in the input file, ensuring all files are downloaded.

Note: Before proceeding with the download, ensure you have logged in to the Earthdata website. You will need to retrieve the session cookies to authenticate your wget requests.

Visit the Earthdata Login and log in with your credentials.

After logging in, use a browser extension like EditThisCookie to export the session cookies.

If the cookies are exported in JSON format, convert them to the standard wget format. You can use tools like the Cookie converter or manually extract the relevant cookies.

Save the cookies in a file named cookies.txt in the appropriate format as required by wget.

The cookies are crucial for authenticating your requests and ensuring successful downloads. In the code snippet below, the output is shown for only 6 files.

%%bash
#!/bin/bash

# Specify the file containing the download links
input_file="../data/download_links_updated.txt"

# Specify the cookies file location
cookies_file="../data/cookies.txt"

# Ensure the cookies file exists (assumes it's been generated previously)
if [ ! -f "$cookies_file" ]; then
    echo "Cookies file not found: $cookies_file"
    exit 1
fi

# Specify the base wget command with common options
base_wget_command="wget --load-cookies $cookies_file --save-cookies mycookies.txt --keep-session-cookies --no-check-certificate --progress=bar:force --quiet"

# Specify the output directory for downloaded files
output_directory="../data/gridmet_test_run/amsr"

# Ensure the output directory exists
mkdir -p "$output_directory"

# Loop through each line (URL) in the input file and download it using wget
while IFS= read -r url || [[ -n "$url" ]]; do
    echo "Downloading: $url"
    $base_wget_command -P "$output_directory" "$url"
done < "$input_file"

Downloading: https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/2019.01.01/AMSR_U2_L3_DailySnow_B02_20190101.he5
Downloading: https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/2019.01.02/AMSR_U2_L3_DailySnow_B02_20190102.he5
Downloading: https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/2019.01.03/AMSR_U2_L3_DailySnow_B02_20190103.he5
Downloading: https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/2019.01.04/AMSR_U2_L3_DailySnow_B02_20190104.he5
Downloading: https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/2019.01.05/AMSR_U2_L3_DailySnow_B02_20190105.he5
Downloading: https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/2019.01.06/AMSR_U2_L3_DailySnow_B02_20190106.he5

3.6.3 Extracting Features from AMSR data:#

Once the AMSR data files are downloaded, the next step is to extract relevant features from these files.

The following script accomplishes this task by processing each AMSR data file and extracting snow water equivalent (SWE) values for specific grid cells corresponding to SNOTEL weather stations.

Lets breakdown each step involved in feature extraction.

3.6.3.1 Finding the Closest Grid Cell Index for Given Latitude and Longitude#

target_latitude, target_longitude: The coordinates of the specific location you want to match to a grid cell.
lat_grid, lon_grid: Arrays representing the grid of latitude and longitude values across a region.
np.unravel_index: It identifies the indices in the grid where the sum of latitude and longitude differences is minimized.

Here we calculate the absolute differences between the target location and each point in the grid to find the closest match:

lat_diff = np.abs(lat_grid - target_latitude)
lon_diff = np.abs(lon_grid - target_longitude)

Through this code snippet, we get the row index (lat_idx), column index (lon_idx), and the actual latitude and longitude of the closest grid cell.

def find_closest_index(target_latitude, target_longitude, lat_grid, lon_grid):
    '''
    Find the index of the grid cell with the closest coordinates to the target latitude and longitude.

    Args:
        target_latitude (float): The target latitude.
        target_longitude (float): The target longitude.
        lat_grid (numpy.ndarray): An array of latitude values.
        lon_grid (numpy.ndarray): An array of longitude values.

    Returns:
        Tuple[int, int, float, float]: A tuple containing the row index, column index, closest latitude, and closest longitude.
    '''
    # Compute the absolute differences between target and grid coordinates
    lat_diff = np.abs(lat_grid - target_latitude)
    lon_diff = np.abs(lon_grid - target_longitude)

    # Find the indices corresponding to the minimum differences
    lat_idx, lon_idx = np.unravel_index(np.argmin(lat_diff + lon_diff), lat_grid.shape)

    return lat_idx, lon_idx, lat_grid[lat_idx, lon_idx], lon_grid[lat_idx, lon_idx]

3.6.3.2 Function to Map SNOTEL Stations to AMSR Grid Coordinates and Create a CSV Mapper#

Next we map SNOTEL station locations to the nearest AMSR grid cells and save this mapping as a CSV file.

new_base_station_list_file: This is a CSV file containing thethe latitude and longitude of various SNOTEL stations.
target_csv_path: The file path where the output CSV file will be saved.
hem_group = file['HDFEOS/GRIDS/Northern Hemisphere']: Accesses the Northern Hemisphere data group within the HDF5 file.

Here we read station data, and check if a mapping file already exists, and downloads the necessary AMSR data file if not already present by using cmd = f"curl --output {target_amsr_hdf_path} ...".

And for each SNOTEL station, we try to identify the closest AMSR grid cell using latitude and longitude comparisons. df.to_csv(target_csv_path, index=False)

And we map SNOTEL stations to AMSR grid cells and save it in CSV file.

This is useful for comparing ground-based measurements with satellite observations. By finding the closest grid point on the AMSR dataset to each SNOTEL station, scientists and researchers can analyze and compare the data more effectively. This helps in improving weather predictions, studying climate patterns, and better understanding environmental conditions.

def create_snotel_station_to_amsr_mapper(new_base_station_list_file, target_csv_path):
    station_data = pd.read_csv(new_base_station_list_file)
    
    date = "2023-01-01"

    date = date.replace("-", ".")
    he5_date = date.replace(".", "")
    
    # Check if the CSV already exists
    
    if os.path.exists(target_csv_path):
        print(f"File {os.path.basename(target_csv_path)} already exists, skipping..")
        df = pd.read_csv(target_csv_path)
        return df
    print('date is ',date)
    target_amsr_hdf_path = f"../data/gridmet_test_run/amsr_testing/testing_amsr_{date}.he5"

    if os.path.exists(target_amsr_hdf_path):
        print(f"File {target_amsr_hdf_path} already exists, skip downloading..")
    else:
        cmd = f"curl --output {target_amsr_hdf_path} -b ~/.urs_cookies -c ~/.urs_cookies -L -n -O https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/{date}/AMSR_U2_L3_DailySnow_B02_{he5_date}.he5"
        print(f'Running command: {cmd}')
        subprocess.run(cmd, shell=True)
    
    df = pd.DataFrame(columns=['amsr_lat', 'amsr_lon', 
                               'amsr_lat_idx', 'amsr_lon_idx',
                               'station_lat', 'station_lon'])
    # Read the HDF
    file = h5py.File(target_amsr_hdf_path, 'r')
    hem_group = file['HDFEOS/GRIDS/Northern Hemisphere']
    lat = hem_group['lat'][:]
    lon = hem_group['lon'][:]
    
    # Replace NaN values with 0
    lat = np.nan_to_num(lat, nan=0.0)
    lon = np.nan_to_num(lon, nan=0.0)
    
    # Convert the AMSR grid into our gridMET 1km grid
    for idx, row in station_data.iterrows():
        target_lat = row['latitude']
        target_lon = row['longitude']
        
        # compare the performance and find the fastest way to search nearest point
        closest_lat_idx, closest_lon_idx, closest_lat, closest_lon = find_closest_index(target_lat, target_lon, lat, lon)
        df.loc[len(df.index)] = [closest_lat, 
                                 closest_lon,
                                 closest_lat_idx,
                                 closest_lon_idx,
                                 target_lat,
                                 target_lon]
    
    # Save the new converted AMSR to CSV file
    df.to_csv(target_csv_path, index=False)
  
    print('AMSR mapper csv is created.')
    return df

3.6.3.3 Extracting and Saving AMSR Snow Data to CSV#

Next, we extract snow water equivalent (SWE) data from AMSR files for a range of dates, match it to specific locations (such as SNOTEL stations), and save the processed data into a CSV file.

amsr_data_dir: Directory containing the AMSR .he5 files.

Here we use a pre-generated mapping of SNOTEL stations to AMSR grid cells obtained from3.6.3.3 to extract relevant SWE values.

Parallel Processing with Dask: Dask is utilized to efficiently process large datasets in parallel, making the extraction and processing faster.
dask_station_data = dd.from_pandas(mapper_df, npartitions=1): Converts the DataFrame into a Dask DataFrame for parallel processing.
mapper_df = create_snotel_station_to_amsr_mapper(new_base_station_list_file, target_csv_path): Creates a mapping between SNOTEL stations and AMSR grid cells.
swe = hem_group['Data Fields/SWE_NorthernDaily'][:]: Extracts the Snow Water Equivalent (SWE) data from the AMSR HDF5 file.
delayed_results = [process_row(row, swe, new_date_str) for _, row in mapper_df.iterrows()]: Uses Dask’s delayed function to process each row of the DataFrame in parallel.
processed_data = dask_bag.map(process_file).filter(lambda x: x is not None).compute(): Processes all the AMSR files in parallel, filtering out any None results.

def extract_amsr_values_save_to_csv(amsr_data_dir, output_csv_file, new_base_station_list_file, start_date, end_date):
    if os.path.exists(output_csv_file):
        os.remove(output_csv_file)
    
    target_csv_path = "../data/gridmet_test_run/training_snotel_station_to_amsr_mapper.csv"
    mapper_df = create_snotel_station_to_amsr_mapper(new_base_station_list_file, 
                                         target_csv_path)
        
    # station_data = pd.read_csv(new_base_station_list_file)

    start_date = datetime.strptime(start_date, "%Y-%m-%d")
    end_date = datetime.strptime(end_date, "%Y-%m-%d")

    # Create a Dask DataFrame
    dask_station_data = dd.from_pandas(mapper_df, npartitions=1)

    # Function to process each file
    def process_file(filename):
        file_path = os.path.join(amsr_data_dir, filename)
        file = h5py.File(file_path, 'r')
        hem_group = file['HDFEOS/GRIDS/Northern Hemisphere']
        date_str = filename.split('_')[-1].split('.')[0]
        date = datetime.strptime(date_str, '%Y%m%d')
        if not (start_date <= date <= end_date):
            print(f"{date} is not in the training period, skipping..")
            return None
        new_date_str = date.strftime("%Y-%m-%d")
        swe = hem_group['Data Fields/SWE_NorthernDaily'][:]
        flag = hem_group['Data Fields/Flags_NorthernDaily'][:]
        # Create an empty Pandas DataFrame with the desired columns
        result_df = pd.DataFrame(columns=['date', 'lat', 'lon', 'AMSR_SWE'])
        print('empty df ',result_df)
        # Sample loop to add rows to the Pandas DataFrame using dask.delayed
        @delayed
        def process_row(row, swe, new_date_str):
          closest_lat_idx = int(row['amsr_lat_idx'])
          closest_lon_idx = int(row['amsr_lon_idx'])
          closest_swe = swe[closest_lat_idx, closest_lon_idx]
          
          return pd.DataFrame([[
            new_date_str, 
            row['station_lat'],
            row['station_lon'],
            closest_swe]], 
            columns=result_df.columns
          )


        # List of delayed computations
        delayed_results = [process_row(row, swe, new_date_str) for _, row in mapper_df.iterrows()]

        # Compute the delayed results and concatenate them into a Pandas DataFrame
        result_df = dask.compute(*delayed_results)
        result_df = pd.concat(result_df, ignore_index=True)

        # Print the final Pandas DataFrame
        print(result_df)
          
        return result_df

    # Get the list of files
    files = [f for f in os.listdir(amsr_data_dir) if f.endswith('.he5')]

    # Create a Dask Bag from the files
    dask_bag = db.from_sequence(files, npartitions=2)

    # Process files in parallel
    processed_data = dask_bag.map(process_file).filter(lambda x: x is not None).compute()
    print(processed_data)

    # Concatenate the processed data
    combined_df = pd.concat(processed_data, ignore_index=True)

    # Save the combined DataFrame to a CSV file
    combined_df.to_csv(output_csv_file, index=False)

    print(f"Merged data saved to {os.path.basename(output_csv_file)}")

3.6.3.4 Running the AMSR Data Extraction Process#

Here we extract and save AMSR snow data for a specified range of dates, linking it to SNOTEL stations, and storing the results in a CSV file.

new_base_station_list_file: CSV file containing a list of active SNOTEL stations in the western U.S.
Date Range: The start and end dates (train_start_date and train_end_date) define the period for which the data will be processed.
extract_amsr_values_save_to_csv: Function call discussed in 3.6.3.4, it is to processes the AMSR data and saves the output to a CSV file.

Here is the main entry point for a script that prepares and processes AMSR data, mapping it to specific training points that include SNOTEL and GHCND (Global Historical Climatology Network Daily) stations.

By mapping SNOTEL and GHCND stations to the closest AMSR grid points and extracting relevant data over a specified time period, this script prepares a dataset that can be used for further analysis or modeling.

amsr_data_dir = "../data/gridmet_test_run/amsr"

all_training_points_with_snotel_ghcnd_file = "../data/gridmet_test_run/all_training_points_snotel_ghcnd_in_westus.csv"
new_base_df = pd.read_csv(all_training_points_with_snotel_ghcnd_file)
print(new_base_df.head())
output_csv_file = f"{all_training_points_with_snotel_ghcnd_file}_amsr_dask_all_training_ponits_with_ghcnd.csv"

start_date = train_start_date
end_date = train_end_date

extract_amsr_values_save_to_csv(amsr_data_dir, output_csv_file, all_training_points_with_snotel_ghcnd_file, start_date, end_date)

   latitude  longitude  modis_x  modis_y
39.95500 -120.53800      123      251
42.95000 -112.83333      337      168
36.23333 -106.43333      515      354
36.23700 -106.42912      515      354
44.45615 -113.30097      324      126
File training_snotel_station_to_amsr_mapper.csv already exists, skipping..
empty df  Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
            date       lat        lon  AMSR_SWE
   2022-12-24  39.95500 -120.53800       255
   2022-12-24  42.95000 -112.83333        25
   2022-12-24  36.23333 -106.43333         0
   2022-12-24  36.23700 -106.42912         0
   2022-12-24  44.45615 -113.30097        18
...          ...       ...        ...       ...
2022-12-24  48.40890 -106.51440        20
2022-12-24  48.54250 -109.76440        15
2022-12-24  45.54940 -100.40860        40
2022-12-24  43.53170 -112.94220        30
2022-12-24  47.68720 -122.25530         0

[8061 rows x 4 columns]
empty df  Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
            date       lat        lon  AMSR_SWE
   2022-12-30  39.95500 -120.53800         0
   2022-12-30  42.95000 -112.83333        70
   2022-12-30  36.23333 -106.43333         0
   2022-12-30  36.23700 -106.42912         0
   2022-12-30  44.45615 -113.30097        44
...          ...       ...        ...       ...
2022-12-30  48.40890 -106.51440        34
2022-12-30  48.54250 -109.76440        56
2022-12-30  45.54940 -100.40860        38
2022-12-30  43.53170 -112.94220        55
2022-12-30  47.68720 -122.25530         0

[8061 rows x 4 columns]
empty df  Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
            date       lat        lon  AMSR_SWE
   2022-12-31  39.95500 -120.53800       255
   2022-12-31  42.95000 -112.83333         0
   2022-12-31  36.23333 -106.43333         0
   2022-12-31  36.23700 -106.42912         0
   2022-12-31  44.45615 -113.30097        28
...          ...       ...        ...       ...
2022-12-31  48.40890 -106.51440        30
2022-12-31  48.54250 -109.76440        60
2022-12-31  45.54940 -100.40860        39
2022-12-31  43.53170 -112.94220        39
2022-12-31  47.68720 -122.25530         0

[8061 rows x 4 columns]
empty df  Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
            date       lat        lon  AMSR_SWE
   2022-12-25  39.95500 -120.53800         0
   2022-12-25  42.95000 -112.83333        29
   2022-12-25  36.23333 -106.43333       255
   2022-12-25  36.23700 -106.42912       255
   2022-12-25  44.45615 -113.30097        27
...          ...       ...        ...       ...
2022-12-25  48.40890 -106.51440        18
2022-12-25  48.54250 -109.76440         0
2022-12-25  45.54940 -100.40860        33
2022-12-25  43.53170 -112.94220        27
2022-12-25  47.68720 -122.25530         0

[8061 rows x 4 columns]
empty df  Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
            date       lat        lon  AMSR_SWE
   2022-12-28  39.95500 -120.53800         0
   2022-12-28  42.95000 -112.83333       255
   2022-12-28  36.23333 -106.43333         0
   2022-12-28  36.23700 -106.42912         0
   2022-12-28  44.45615 -113.30097       255
...          ...       ...        ...       ...
2022-12-28  48.40890 -106.51440        24
2022-12-28  48.54250 -109.76440         0
2022-12-28  45.54940 -100.40860        25
2022-12-28  43.53170 -112.94220       255
2022-12-28  47.68720 -122.25530         0

[8061 rows x 4 columns]
empty df  Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
            date       lat        lon  AMSR_SWE
   2022-12-29  39.95500 -120.53800        32
   2022-12-29  42.95000 -112.83333        55
   2022-12-29  36.23333 -106.43333         0
   2022-12-29  36.23700 -106.42912         0
   2022-12-29  44.45615 -113.30097        31
...          ...       ...        ...       ...
2022-12-29  48.40890 -106.51440        28
2022-12-29  48.54250 -109.76440        39
2022-12-29  45.54940 -100.40860       255
2022-12-29  43.53170 -112.94220        59
2022-12-29  47.68720 -122.25530       255

[8061 rows x 4 columns]
empty df  Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
            date       lat        lon  AMSR_SWE
   2022-12-27  39.95500 -120.53800         0
   2022-12-27  42.95000 -112.83333        19
   2022-12-27  36.23333 -106.43333       255
   2022-12-27  36.23700 -106.42912       255
   2022-12-27  44.45615 -113.30097        23
...          ...       ...        ...       ...
2022-12-27  48.40890 -106.51440        17
2022-12-27  48.54250 -109.76440         0
2022-12-27  45.54940 -100.40860        19
2022-12-27  43.53170 -112.94220        30
2022-12-27  47.68720 -122.25530         0

[8061 rows x 4 columns]
empty df  Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
            date       lat        lon  AMSR_SWE
   2022-12-26  39.95500 -120.53800         0
   2022-12-26  42.95000 -112.83333         0
   2022-12-26  36.23333 -106.43333         0
   2022-12-26  36.23700 -106.42912         0
   2022-12-26  44.45615 -113.30097        22
...          ...       ...        ...       ...
2022-12-26  48.40890 -106.51440        23
2022-12-26  48.54250 -109.76440        31
2022-12-26  45.54940 -100.40860        37
2022-12-26  43.53170 -112.94220        29
2022-12-26  47.68720 -122.25530         0

[8061 rows x 4 columns]
[            date       lat        lon  AMSR_SWE
   2022-12-28  39.95500 -120.53800         0
   2022-12-28  42.95000 -112.83333       255
   2022-12-28  36.23333 -106.43333         0
   2022-12-28  36.23700 -106.42912         0
   2022-12-28  44.45615 -113.30097       255
...          ...       ...        ...       ...
2022-12-28  48.40890 -106.51440        24
2022-12-28  48.54250 -109.76440         0
2022-12-28  45.54940 -100.40860        25
2022-12-28  43.53170 -112.94220       255
2022-12-28  47.68720 -122.25530         0

[8061 rows x 4 columns],             date       lat        lon  AMSR_SWE
   2022-12-29  39.95500 -120.53800        32
   2022-12-29  42.95000 -112.83333        55
   2022-12-29  36.23333 -106.43333         0
   2022-12-29  36.23700 -106.42912         0
   2022-12-29  44.45615 -113.30097        31
...          ...       ...        ...       ...
2022-12-29  48.40890 -106.51440        28
2022-12-29  48.54250 -109.76440        39
2022-12-29  45.54940 -100.40860       255
2022-12-29  43.53170 -112.94220        59
2022-12-29  47.68720 -122.25530       255

[8061 rows x 4 columns],             date       lat        lon  AMSR_SWE
   2022-12-27  39.95500 -120.53800         0
   2022-12-27  42.95000 -112.83333        19
   2022-12-27  36.23333 -106.43333       255
   2022-12-27  36.23700 -106.42912       255
   2022-12-27  44.45615 -113.30097        23
...          ...       ...        ...       ...
2022-12-27  48.40890 -106.51440        17
2022-12-27  48.54250 -109.76440         0
2022-12-27  45.54940 -100.40860        19
2022-12-27  43.53170 -112.94220        30
2022-12-27  47.68720 -122.25530         0

[8061 rows x 4 columns],             date       lat        lon  AMSR_SWE
   2022-12-26  39.95500 -120.53800         0
   2022-12-26  42.95000 -112.83333         0
   2022-12-26  36.23333 -106.43333         0
   2022-12-26  36.23700 -106.42912         0
   2022-12-26  44.45615 -113.30097        22
...          ...       ...        ...       ...
2022-12-26  48.40890 -106.51440        23
2022-12-26  48.54250 -109.76440        31
2022-12-26  45.54940 -100.40860        37
2022-12-26  43.53170 -112.94220        29
2022-12-26  47.68720 -122.25530         0

[8061 rows x 4 columns],             date       lat        lon  AMSR_SWE
   2022-12-24  39.95500 -120.53800       255
   2022-12-24  42.95000 -112.83333        25
   2022-12-24  36.23333 -106.43333         0
   2022-12-24  36.23700 -106.42912         0
   2022-12-24  44.45615 -113.30097        18
...          ...       ...        ...       ...
2022-12-24  48.40890 -106.51440        20
2022-12-24  48.54250 -109.76440        15
2022-12-24  45.54940 -100.40860        40
2022-12-24  43.53170 -112.94220        30
2022-12-24  47.68720 -122.25530         0

[8061 rows x 4 columns],             date       lat        lon  AMSR_SWE
   2022-12-30  39.95500 -120.53800         0
   2022-12-30  42.95000 -112.83333        70
   2022-12-30  36.23333 -106.43333         0
   2022-12-30  36.23700 -106.42912         0
   2022-12-30  44.45615 -113.30097        44
...          ...       ...        ...       ...
2022-12-30  48.40890 -106.51440        34
2022-12-30  48.54250 -109.76440        56
2022-12-30  45.54940 -100.40860        38
2022-12-30  43.53170 -112.94220        55
2022-12-30  47.68720 -122.25530         0

[8061 rows x 4 columns],             date       lat        lon  AMSR_SWE
   2022-12-31  39.95500 -120.53800       255
   2022-12-31  42.95000 -112.83333         0
   2022-12-31  36.23333 -106.43333         0
   2022-12-31  36.23700 -106.42912         0
   2022-12-31  44.45615 -113.30097        28
...          ...       ...        ...       ...
2022-12-31  48.40890 -106.51440        30
2022-12-31  48.54250 -109.76440        60
2022-12-31  45.54940 -100.40860        39
2022-12-31  43.53170 -112.94220        39
2022-12-31  47.68720 -122.25530         0

[8061 rows x 4 columns],             date       lat        lon  AMSR_SWE
   2022-12-25  39.95500 -120.53800         0
   2022-12-25  42.95000 -112.83333        29
   2022-12-25  36.23333 -106.43333       255
   2022-12-25  36.23700 -106.42912       255
   2022-12-25  44.45615 -113.30097        27
...          ...       ...        ...       ...
2022-12-25  48.40890 -106.51440        18
2022-12-25  48.54250 -109.76440         0
2022-12-25  45.54940 -100.40860        33
2022-12-25  43.53170 -112.94220        27
2022-12-25  47.68720 -122.25530         0

[8061 rows x 4 columns]]
Merged data saved to all_training_points_snotel_ghcnd_in_westus.csv_amsr_dask_all_training_ponits_with_ghcnd.csv

3.6.4 AMSR Data Processing and Analysis Pipeline#

In this section we see, how we automate the process of downloading, mapping, processing, and analyzing AMSR (Advanced Microwave Scanning Radiometer) data.

It provides a streamlined and automated pipeline for handling AMSR data, from initial download and grid alignment to final data processing and analysis.

3.6.4.1 Find Closest Point in a Grid#

target_latitude (float): The latitude of the target point.
target_longitude (float): The longitude of the target point.
lat_grid (numpy array): A 2D array representing the grid of latitudes.
lon_grid (numpy array): A 2D array representing the grid of longitudes.
Here we calculate the squared Euclidean distance between the target point and each point in the grid.
And then find the grid point with the minimum distance to the target point and return the indices of that point along with its actual latitude and longitude.

def find_closest_index_numpy(target_latitude, target_longitude, lat_grid, lon_grid):
    # Calculate the squared Euclidean distance between the target point and all grid points
    distance_squared = (lat_grid - target_latitude)**2 + (lon_grid - target_longitude)**2
    
    # Find the indices of the minimum distance
    lat_idx, lon_idx = np.unravel_index(np.argmin(distance_squared), distance_squared.shape)
    
    return lat_idx, lon_idx, lat_grid[lat_idx, lon_idx], lon_grid[lat_idx, lon_idx]

3.6.4.2 Find the closest grid point indices for a target latitude and longitude using KDTree#

Here we find the closest grid point indices to a given target latitude and longitude using a KDTree for efficient spatial searching.

lat_idx (int): The index of the closest latitude in the grid.
lon_idx (int): The index of the closest longitude in the grid.
lat_grid[lat_idx, lon_idx] (float): The actual latitude of the closest grid point.
lon_grid[lat_idx, lon_idx] (float): The actual longitude of the closest grid point.
We use a global KDTree (latlontree) to efficiently search for the nearest grid point. If the KDTree is not already built, it constructs one using the latitude and longitude grids.
Before constructing the KDTree, we replace any NaN values in the latitude and longitude grids with 0.0 to ensure the tree is built without issues.
The KDTree is then queried with the target latitude and longitude to find the nearest point in the grid.
Then we convert the 1D index returned by the KDTree back into 2D grid indices to identify the exact location in the original latitude and longitude grids.
Finally, we return the indices of the closest latitude and longitude, along with the corresponding values from the grid.

def find_closest_index_tree(target_latitude, target_longitude, lat_grid, lon_grid):
    """
    Find the closest grid point indices for a target latitude and longitude using KDTree.

    Parameters:
        target_latitude (float): Target latitude.
        target_longitude (float): Target longitude.
        lat_grid (numpy.ndarray): Array of latitude values.
        lon_grid (numpy.ndarray): Array of longitude values.

    Returns:
        int: Latitude index.
        int: Longitude index.
        float: Closest latitude value.
        float: Closest longitude value.
    """
    global latlontree
    
    if latlontree is None:
        # Create a KD-Tree from lat_grid and lon_grid
        lat_grid_cleaned = np.nan_to_num(lat_grid, nan=0.0)  # Replace NaN with 0
        lon_grid_cleaned = np.nan_to_num(lon_grid, nan=0.0)  # Replace NaN with 0
        latlontree = KDTree(list(zip(lat_grid_cleaned.ravel(), lon_grid_cleaned.ravel())))
      
    # Query the KD-Tree to find the nearest point
    distance, index = latlontree.query([target_latitude, target_longitude])

    # Convert the 1D index to 2D grid indices
    lat_idx, lon_idx = np.unravel_index(index, lat_grid.shape)

    return lat_idx, lon_idx, lat_grid[lat_idx, lon_idx], lon_grid[lat_idx, lon_idx]

3.6.4.3 Find the closest grid point indices for a target latitude and longitude.#

Here we find the grid point in a latitude-longitude array that is closest to a given target latitude and longitude.

Here we compute the absolute difference between the target latitude and longitude and all points in the lat_grid and lon_grid.
And then sum these differences to create a simple distance metric for each grid point.
Next, identify the grid point with the smallest sum of differences, which is considered the closest point to the target location.
Finally, return the indices and the actual latitude and longitude values of this closest grid point.

def find_closest_index(target_latitude, target_longitude, lat_grid, lon_grid):
    """
    Find the closest grid point indices for a target latitude and longitude.

    Parameters:
        target_latitude (float): Target latitude.
        target_longitude (float): Target longitude.
        lat_grid (numpy.ndarray): Array of latitude values.
        lon_grid (numpy.ndarray): Array of longitude values.

    Returns:
        int: Latitude index.
        int: Longitude index.
        float: Closest latitude value.
        float: Closest longitude value.
    """
    lat_diff = np.float64(np.abs(lat_grid - target_latitude))
    lon_diff = np.float64(np.abs(lon_grid - target_longitude))

    # Find the indices corresponding to the minimum differences
    lat_idx, lon_idx = np.unravel_index(np.argmin(lat_diff + lon_diff), lat_grid.shape)

    return lat_idx, lon_idx, lat_grid[lat_idx, lon_idx], lon_grid[lat_idx, lon_idx]

3.6.4.4 Preparing the AMSR to GridMET Mapper#

The goal here is to create a mapping between AMSR grid data and GridMET grid points, saving the results to a CSV file.

target_csv_path: The file path where the mapping between AMSR and GridMET grid points will be saved as a CSV file.
target_amsr_hdf_path: The path where the AMSR data file is stored or will be downloaded to if it doesn’t exist.- western_us_coords: A CSV file containing the latitude and longitude of GridMET grid points for the western U.S.
file = h5py.File(target_amsr_hdf_path, 'r'): Opens the AMSR HDF5 file for reading, allowing access to its contents.
lat = np.nan_to_num(lat, nan=0.0): Replaces any NaN values in the latitude data with zeros to avoid errors during processing.
closest_lat_idx, closest_lon_idx, closest_lat, closest_lon = find_closest_index(target_lat, target_lon, lat, lon): Finds the closest AMSR grid point to each target GridMET point, which is crucial for accurate data mapping.

This code is essential for linking the satellite-based AMSR data grid with the ground-based GridMET grid.

def prepare_amsr_grid_mapper():
    df = pd.DataFrame(columns=['amsr_lat', 'amsr_lon', 
                               'amsr_lat_idx', 'amsr_lon_idx',
                               'gridmet_lat', 'gridmet_lon'])
    date = test_start_date
    date = date.replace("-", ".")
    he5_date = date.replace(".", "")
    
    # Check if the CSV already exists
    target_csv_path = "../data/amsr_to_gridmet_mapper.csv"
    if os.path.exists(target_csv_path):
        print(f"File {os.path.basename(target_csv_path)} already exists, skipping..")
        return
    target_amsr_hdf_path = f"../data/gridmet_test_run/amsr_testing/testing_amsr_{date}.he5"
    if os.path.exists(target_amsr_hdf_path):
        print(f"File {os.path.basename(target_amsr_hdf_path)} already exists, skip downloading..")
    else:
        cmd = f"curl --output {target_amsr_hdf_path} -b ../data/cookies.txt -c ~/.mycookies.txt -L -n -O https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/{date}/AMSR_U2_L3_DailySnow_B02_{he5_date}.he5"
        print(f'Running command: {cmd}')
        subprocess.run(cmd, shell=True)
    
    # Read the HDF
    print('target_amsr_hdf_path', target_amsr_hdf_path)
    file = h5py.File(target_amsr_hdf_path, 'r')
    hem_group = file['HDFEOS/GRIDS/Northern Hemisphere']
    lat = hem_group['lat'][:]
    lon = hem_group['lon'][:]
    
    # Replace NaN values with 0
    lat = np.nan_to_num(lat, nan=0.0)
    lon = np.nan_to_num(lon, nan=0.0)
    
    # Set the number of rows to process
    num_rows_to_process = 100  # Change this value to the desired number of rows
    # Convert the AMSR grid into our gridMET 1km grid
    western_us_df = pd.read_csv(western_us_coords)
    for idx, row in western_us_df.iterrows():
        if idx >= num_rows_to_process:
            break

        target_lat = row['Latitude']
        target_lon = row['Longitude']
        
        # compare the performance and find the fastest way to search nearest point
        closest_lat_idx, closest_lon_idx, closest_lat, closest_lon = find_closest_index(target_lat, target_lon, lat, lon)
        df.loc[len(df.index)] = [closest_lat, 
                                 closest_lon,
                                 closest_lat_idx,
                                 closest_lon_idx,
                                 target_lat, 
                                 target_lon]
    
    # Save the new converted AMSR to CSV file
    df.to_csv(target_csv_path, index=False)
  
    print('AMSR mapper csv is created.')

3.6.4.5 Downloading and Converting AMSR Snow Data to DEM Format#

Here we automate the downloading, conversion, and saving of AMSR data aligned with a DEM grid. And also adds a cumulative sum column to a DataFrame, useful for tracking cumulative metrics over time.

target_date: The specific date for which AMSR data is being processed, initially set to test_start_date.
target_mapper_csv_path: The path to the CSV file that maps AMSR grid points to GridMET grid points.
target_csv_path: The file path where the final converted data will be saved as a CSV.
target_amsr_hdf_path: The path where the downloaded AMSR HDF5 file will be stored.
mapper_df['AMSR_SWE'] = mapper_df.apply(get_swe, axis=1): Applies the get_swe function to each row in the mapping DataFrame to calculate the SWE value for each GridMET point.
mapper_df.drop(columns=['amsr_lat', 'amsr_lon', ...]): Removes unnecessary columns from the DataFrame before saving the final results.

Here we first constructs the URL for the AMSR data corresponding to the specified date and attempts to download it using the curl command. It ensures that the necessary cookies are available for authentication.

And once the data is downloaded, we read the HDF5 file using the h5py library, extracting the latitude, longitude, snow water equivalent (SWE), and flag information from the file.

And then we convert the AMSR grid into a format compatible with the DEM grid by finding the corresponding grid points in the DEM grid. This involves identifying the nearest DEM grid points for each AMSR grid point.

For each DEM grid point, we perform a custom calculation to determine the SWE and flag values based on the nearest AMSR grid points.

And finally we save the converted data, including the latitude, longitude, SWE, and flag information, into a CSV file. This file can be further processed or analyzed in subsequent steps of the script.

def download_amsr_and_convert_grid(target_date = test_start_date):
    """
    Download AMSR snow data, convert it to DEM format, and save as a CSV file.
    """
    # the mapper
    target_mapper_csv_path = f'{work_dir}/amsr_to_gridmet_mapper.csv'
    mapper_df = pd.read_csv(target_mapper_csv_path)
    #print(mapper_df.head())
    
    df = pd.DataFrame(columns=['date', 'lat', 
                               'lon', 'AMSR_SWE', 
                               'AMSR_Flag'])
    date = target_date
    date = date.replace("-", ".")
    he5_date = date.replace(".", "")
    target_date = ""
    # Check if the CSV already exists
    target_csv_path = f"../data/gridmet_test_run/amsr_testing/testing_ready_amsr_{date}.csv"
    if os.path.exists(target_csv_path):
        print(f"File {os.path.basename(target_csv_path)} already exists, skipping..")
        return target_csv_path
    
    target_amsr_hdf_path = f"{work_dir}/amsr_testing/testing_amsr_{date}.he5"
    if os.path.exists(target_amsr_hdf_path) and is_binary(target_amsr_hdf_path):
        print(f"File {os.path.basename(target_amsr_hdf_path)} already exists, skip downloading..")
    else:
        cmd = f"curl --output {target_amsr_hdf_path} -b ../data/cookies.txt -c ~/.urs_cookies -L -n -O https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/{date}/AMSR_U2_L3_DailySnow_B02_{he5_date}.he5"
        print(f'Running command: {cmd}')
        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
        # Check the exit code
        if result.returncode != 0:
            print(f"Command failed with exit code {result.returncode}.")
            if os.path.exists(target_amsr_hdf_path):
              os.remove(target_amsr_hdf_path)
              print(f"Wrong {target_amsr_hdf_path} removed successfully.")
            raise Exception(f"Failed to download {target_amsr_hdf_path} - {result.stderr}")
    
    # Read the HDF
    print('file is ',target_amsr_hdf_path)
    print(f"Reading {target_amsr_hdf_path}")
    file = h5py.File(target_amsr_hdf_path, 'r')
    hem_group = file['HDFEOS/GRIDS/Northern Hemisphere']
    lat = hem_group['lat'][:]
    lon = hem_group['lon'][:]
    
    # Replace NaN values with 0
    lat = np.nan_to_num(lat, nan=0.0)
    lon = np.nan_to_num(lon, nan=0.0)
    
    swe = hem_group['Data Fields/SWE_NorthernDaily'][:]
    flag = hem_group['Data Fields/Flags_NorthernDaily'][:]
    date = datetime.strptime(date, '%Y.%m.%d')
    
    # Convert the AMSR grid into our DEM 1km grid
    
    def get_swe(row):
        # Perform your custom calculation here
        closest_lat_idx = int(row['amsr_lat_idx'])
        closest_lon_idx = int(row['amsr_lon_idx'])
        closest_swe = swe[closest_lat_idx, closest_lon_idx]
        return closest_swe
    
    def get_swe_flag(row):
        # Perform your custom calculation here
        closest_lat_idx = int(row['amsr_lat_idx'])
        closest_lon_idx = int(row['amsr_lon_idx'])
        closest_flag = flag[closest_lat_idx, closest_lon_idx]
        return closest_flag
    
    # Use the apply function to apply the custom function to each row
    mapper_df['AMSR_SWE'] = mapper_df.apply(get_swe, axis=1)
    mapper_df['AMSR_Flag'] = mapper_df.apply(get_swe_flag, axis=1)
    mapper_df['date'] = date
    mapper_df.rename(columns={'dem_lat': 'lat'}, inplace=True)
    mapper_df.rename(columns={'dem_lon': 'lon'}, inplace=True)
    mapper_df = mapper_df.drop(columns=['amsr_lat',
                                        'amsr_lon',
                                        'amsr_lat_idx',
                                        'amsr_lon_idx'])
    
    print("result df: ", mapper_df.head())
    # Save the new converted AMSR to CSV file
    print(f"saving the new AMSR SWE to csv: {target_csv_path}")
    mapper_df.to_csv(target_csv_path, index=False)
    
    print('Completed AMSR testing data collection.')
    return target_csv_path

def add_cumulative_column(df, column_name):
    df[f'cumulative_{column_name}'] = df[column_name].sum()
    return df

3.6.4.6 Aggregate Cumulative AMSR Snow Data and Export to CSV#

The goal of this code is to calculate the cumulative Snow Water Equivalent (SWE) values from AMSR data over a specific period, filling any gaps in the data, and saving the cumulative results into a CSV file. This is particularly useful for analyzing long-term snow accumulation trends.

target_date: The specific date for which cumulative AMSR data is calculated.
past_october_1: The date of October 1st in the previous or current year, serving as the start of the period.
target_csv_path: The path where the cumulative AMSR data will be saved.
gap_filled_csv: The path where the gap-filled version of the cumulative data will be saved.
past_october_1 = datetime(selected_date.year - 1, 10, 1): Sets the start date for cumulative calculations to October 1st of the previous year if the target date is before October, otherwise, it uses October 1st of the current year.
data_dict[current_date_str] = download_amsr_and_convert_grid(current_date_str): Downloads and converts AMSR data for each day from October 1st to the target date, storing the results in a dictionary.
filtered_columns.interpolate(axis=1, method='linear', inplace=True): Fills in missing values in the SWE data using linear interpolation across time.
df[f'cumulative_{column_name}'] = sum_column: Adds a new column to the DataFrame containing the cumulative sum of SWE values.
filled_data.to_csv(gap_filled_csv, index=False): Saves the gap-filled cumulative data to a CSV file.
result.to_csv(target_csv_path, index=False): Saves the final cumulative data for the target date into a CSV file.

This cumulative information is valuable for understanding seasonal snow accumulation and can be used in various environmental analyses and forecasting models. The approach of handling missing data ensures the integrity and completeness of the dataset before it’s saved and used for further analysis.

   
def get_cumulative_amsr_data(target_date = test_start_date, force=False):
    
    selected_date = datetime.strptime(target_date, "%Y-%m-%d")
    print(selected_date)
    if selected_date.month < 10:
      past_october_1 = datetime(selected_date.year - 1, 10, 1)
    else:
      past_october_1 = datetime(selected_date.year, 10, 1)

    # Traverse and print every day from past October 1 to the specific date
    current_date = past_october_1
    target_csv_path = f'{work_dir}/testing_ready_amsr_{target_date}_cumulative.csv'

    columns_to_be_cumulated = ["AMSR_SWE"]
    
    gap_filled_csv = f"{target_csv_path}_gap_filled.csv"
    if os.path.exists(gap_filled_csv) and not force:
      print(f"{gap_filled_csv} already exists, skipping..")
      df = pd.read_csv(gap_filled_csv)
      print(df["AMSR_SWE"].describe())
    else:
      date_keyed_objects = {}
      data_dict = {}
      new_df = None
      while current_date <= selected_date:
        print(current_date.strftime('%Y-%m-%d'))
        current_date_str = current_date.strftime('%Y-%m-%d')

        data_dict[current_date_str] = download_amsr_and_convert_grid(current_date_str)
        current_df = pd.read_csv(data_dict[current_date_str])
        current_df.drop(columns=["date"], inplace=True)

        if current_date != selected_date:
          current_df.rename(columns={
            "AMSR_SWE": f"AMSR_SWE_{current_date_str}",
            "AMSR_Flag": f"AMSR_Flag_{current_date_str}",
          }, inplace=True)
        #print(current_df.head())

        if new_df is None:
          new_df = current_df
        else:
          new_df = pd.merge(new_df, current_df, on=['gridmet_lat', 'gridmet_lon'])
          #new_df = new_df.append(current_df, ignore_index=True)

        current_date += timedelta(days=1)

      print("new_df.columns = ", new_df.columns)
      print("new_df.head = ", new_df.head())
      df = new_df

      #df.sort_values(by=['gridmet_lat', 'gridmet_lon', 'date'], inplace=True)
      print("All current head: ", df.head())
      print("the new_df.shape: ", df.shape)

      print("Start to fill in the missing values")
      #grouped = df.groupby(['gridmet_lat', 'gridmet_lon'])
      filled_data = pd.DataFrame()

      # Apply the function to each group
      for column_name in columns_to_be_cumulated:
        start_time = time.time()
        #filled_data = df.apply(lambda row: interpolate_missing_and_add_cumulative_inplace(row, column_name), axis=1)
        #alike_columns = filled_data.filter(like=column_name)
        #filled_data[f'cumulative_{column_name}'] = alike_columns.sum(axis=1)
        print("filled_data.columns = ", filled_data.columns)
        filtered_columns = df.filter(like=column_name)
        print(filtered_columns.columns)
        filtered_columns = filtered_columns.mask(filtered_columns > 240)
        filtered_columns.interpolate(axis=1, method='linear', inplace=True)
        filtered_columns.fillna(0, inplace=True)
        
        sum_column = filtered_columns.sum(axis=1)
        # Define a specific name for the new column
        df[f'cumulative_{column_name}'] = sum_column
        df[filtered_columns.columns] = filtered_columns
        
        if filtered_columns.isnull().any().any():
          print("filtered_columns :", filtered_columns)
          raise ValueError("Single group: shouldn't have null values here")
      
        print("filled_data.columns: ", filled_data.columns)
        end_time = time.time()
        # Calculate the elapsed time
        elapsed_time = end_time - start_time
        print(f"calculate column {column_name} elapsed time: {elapsed_time} seconds")
        
      filled_data = df
      filled_data["date"] = target_date
      print("Finished correctly ", filled_data.head())
      filled_data.to_csv(gap_filled_csv, index=False)
      print(f"New filled values csv is saved to {gap_filled_csv}")
      df = filled_data
    
    result = df
    print("result.head = ", result.head())
    # fill in the rest NA as 0
    if result.isnull().any().any():
      print("result :", result)
      raise ValueError("Single group: shouldn't have null values here")
    
    # only retain the rows of the target date
    print(result['date'].unique())
    print(result.shape)
    print(result[["AMSR_SWE", "AMSR_Flag"]].describe())
    result.to_csv(target_csv_path, index=False)
    print(f"New data is saved to {os.path.basename(target_csv_path)}")

3.6.4.7 Interpolate Missing Values and Calculate Cumulative SWE In-Place for AMSR Data#

Here we aim to ensure that any missing or anomalous data points within a specific column are handled appropriately through interpolation, and then a cumulative sum is calculated.

This is particularly useful in time series data, where continuity is crucial, and missing data could skew analysis.

The cumulative sum provides an aggregated measure that can be used for further analysis, such as tracking total snowfall or snow cover over a period.

row: A Pandas Series representing a single row of data from a DataFrame, which contains the values to be interpolated.
column_name: The specific column in the DataFrame that needs interpolation and cumulative calculation.
degree: The degree of the polynomial used for interpolation, with a default of 1 (linear).
x_subset_key = x_all_key[x_all_key.str.startswith(column_name)]: Identifies the subset of columns within the row that corresponds to the specified column_name, allowing focused interpolation on the relevant data.
are_all_values_between_0_and_240 = row[x_subset_key].between(1, 239).all(): Checks if all values in the specified subset are within a valid range (for SWE, between 1 and 239), which helps ensure that the data is suitable for interpolation.
row[f"cumulative_{column_name}"] = row[x_subset_key].sum(): After interpolation, this line calculates the cumulative sum of the values in the subset and adds it as a new column in the row.

def interpolate_missing_and_add_cumulative_inplace(row, column_name, degree=1):
  """
  Interpolate missing values in a Pandas Series using polynomial interpolation
  and add a cumulative column.

  Parameters:
    - row (pd.Series): The input row containing the data to be interpolated.
    - column_name (str): The name of the column to be interpolated.
    - degree (int, optional): The degree of the polynomial fit. Default is 1 (linear).

  Returns:
    - pd.Series: The row with interpolated values and a cumulative column.

  Raises:
    - ValueError: If there are unexpected null values after interpolation.

  Note:
    - For 'SWE' column, values above 240 are treated as gaps and set to 240.
    - For 'fsca' column, values above 100 are treated as gaps and set to 100.

  Examples:
    ```python
    # Example usage:
    interpolated_row = interpolate_missing_and_add_cumulative_inplace(my_row, 'fsca', degree=2)
    ```

  """
  
  # Extract X series (column names)
  x_all_key = row.index
  
  x_subset_key = x_all_key[x_all_key.str.startswith(column_name)]
  are_all_values_between_0_and_240 = row[x_subset_key].between(1, 239).all()
  if are_all_values_between_0_and_240:
    print("row[x_subset_key] = ", row[x_subset_key])
    print("row[x_subset_key].sum() = ", row[x_subset_key].sum())
  # create the cumulative column after interpolation
  row[f"cumulative_{column_name}"] = row[x_subset_key].sum()
  return row

3.6.4.8 Running the AMSR Data Extraction Process#

This script is to handle the entire workflow, from data preparation to the generation of cumulative time series data.

prepare_amsr_grid_mapper(): It maps the AMSR grid to the gridMET grid, preparing the necessary data for further processing.
get_cumulative_amsr_data(force=False): This calculates cumulative AMSR data for a specific date range and handles any missing data. The force parameter determines whether to overwrite existing processed data.
input_time_series_file: Defines the file path for the cumulative AMSR data, which will be used in subsequent analyses or processes.

# Run the download and conversion function

prepare_amsr_grid_mapper()
download_amsr_and_convert_grid()

get_cumulative_amsr_data(force=False)
input_time_series_file = f'{work_dir}/testing_ready_amsr_{test_start_date}_cumulative.csv_gap_filled.csv'

File amsr_to_gridmet_mapper.csv already exists, skipping..
File testing_ready_amsr_2024.07.18.csv already exists, skipping..
2024-07-18 00:00:00
../data/gridmet_test_run/testing_ready_amsr_2024-07-18_cumulative.csv_gap_filled.csv already exists, skipping..
count    100.0
mean       0.0
std        0.0
min        0.0
25%        0.0
50%        0.0
75%        0.0
max        0.0
Name: AMSR_SWE, dtype: float64
result.head =     gridmet_lat  gridmet_lon  AMSR_SWE_2023-10-01  AMSR_Flag_2023-10-01  \
0         49.0     -125.000                  0.0                   241   
1         49.0     -124.964                  0.0                   241   
2         49.0     -124.928                  0.0                   241   
3         49.0     -124.892                  0.0                   241   
4         49.0     -124.856                  0.0                   241   

   AMSR_SWE_2023-10-02  AMSR_Flag_2023-10-02  AMSR_SWE_2023-10-03  \
0                  0.0                   241                  0.0   
1                  0.0                   241                  0.0   
2                  0.0                   241                  0.0   
3                  0.0                   241                  0.0   
4                  0.0                   241                  0.0   

   AMSR_Flag_2023-10-03  AMSR_SWE_2023-10-04  AMSR_Flag_2023-10-04  ...  \
0                   241                  0.0                   241  ...   
1                   241                  0.0                   241  ...   
2                   241                  0.0                   241  ...   
3                   241                  0.0                   241  ...   
4                   241                  0.0                   241  ...   

   AMSR_SWE_2024-07-15  AMSR_Flag_2024-07-15  AMSR_SWE_2024-07-16  \
0                  0.0                   241                  0.0   
1                  0.0                   241                  0.0   
2                  0.0                   241                  0.0   
3                  0.0                   241                  0.0   
4                  0.0                   241                  0.0   

   AMSR_Flag_2024-07-16  AMSR_SWE_2024-07-17  AMSR_Flag_2024-07-17  AMSR_SWE  \
0                   241                  0.0                   241       0.0   
1                   241                  0.0                   241       0.0   
2                   241                  0.0                   241       0.0   
3                   241                  0.0                   241       0.0   
4                   241                  0.0                   241       0.0   

   AMSR_Flag  cumulative_AMSR_SWE        date  
0        241                  7.5  2024-07-18  
1        241                  7.5  2024-07-18  
2        241                  7.5  2024-07-18  
3        241                  7.5  2024-07-18  
4        241                  7.5  2024-07-18  

[5 rows x 588 columns]
['2024-07-18']
(100, 588)
       AMSR_SWE   AMSR_Flag
count     100.0  100.000000
mean        0.0  247.750000
std         0.0    6.927292
min         0.0  241.000000
25%         0.0  241.000000
50%         0.0  241.000000
75%         0.0  255.000000
max         0.0  255.000000
New data is saved to testing_ready_amsr_2024-07-18_cumulative.csv

Utility Functions for Feature Extraction#

The following functions are categorized as utility functions. These functions are not central to the main discussion but play a supportive role by providing necessary functionality. They can be referenced as needed throughout the chapter to simplify and streamline the main code examples.

1. Importing required python libraries to run the script#

Importing Libraries: Essential libraries are imported for handling files, processing large datasets, and performing complex calculations.
- os, shutil, subprocess: For file handling, copying, and executing shell commands.
- csv, h5py, numpy, pandas: For reading/writing files, handling HDF5 datasets, numerical computations, and data manipulation.
- dask, xarray: To manage and process large datasets efficiently using parallel computing.

import os
import csv
import h5py
import shutil
import numpy as np
import pandas as pd
from datetime import datetime
import dask
import dask.dataframe as dd
import dask.delayed as delayed
import dask.bag as db
import xarray as xr
import subprocess


# For demonstration purposes, we're using one week of data for training.
# The training period is set from December 24, 2022, to December 31, 2022.
train_start_date = "2022-12-24"
train_end_date = "2022-12-31"

work_dir = "../data/gridmet_test_run"

2. Function to Copy .he5 Files from Source to Destination Directory#

The goal here is to copy all .he5 files from a specified source directory to a destination directory.

source_dir: The directory where the .he5 files are originally located.
destination_dir: The target directory where the .he5 files will be copied.
os.walk: A function that traverses the directory tree, accessing all subdirectories and files.
shutil.copy: A method used to copy the files from the source to the destination.

The code specifically looks for files with the .he5 extension to identify the relevant files for copying.

def copy_he5_files(source_dir, destination_dir):
    '''
    Copy .he5 files from the source directory to the destination directory.

    Args:
        source_dir (str): The source directory containing .he5 files to copy.
        destination_dir (str): The destination directory where .he5 files will be copied.

    Returns:
        None
    '''
    # Get a list of all subdirectories and files in the source directory
    for root, dirs, files in os.walk(source_dir):
        for file in files:
            if file.endswith('.he5'):
                # Get the absolute path of the source file
                source_file_path = os.path.join(root, file)
                # Copy the file to the destination directory
                shutil.copy(source_file_path, destination_dir)

Utility Functions for Data Processing and Analysis Pipeline#

1. Library Imports and Setup for Snow Data#

KDTree: From scipy.spatial, used for performing efficient nearest-neighbor searches in spatial datasets.
plot_all_variables_in_one_csv: A custom function from convert_results_to_images, used for visualizing processed data.

import os
import h5py
import subprocess
import pandas as pd
import numpy as np
from datetime import datetime
from scipy.spatial import KDTree
import time
from datetime import datetime, timedelta, date
import warnings
import sys
# from convert_results_to_images import plot_all_variables_in_one_c

homedir = os.path.expanduser('~')
work_dir = "../data/gridmet_test_run"
test_start_date = "2024-07-18"
western_us_coords = "../data/dem_file.tif.csv"

2. Identifying Binary Files#

Here we determine whether a given file is a binary file or a text file.

We attempt to open the file in binary mode ('rb') and read a chunk of bytes (1024 bytes).
And the we check for null bytes (b'\x00'), which are common in binary files. If a null byte is found, then it is binary file.
Next, we check for a high percentage of non-printable ASCII characters by converting the byte chunk to characters and filtering out non-printable ones. If the chunk has no printable characters, the file is considered binary.
If neither of the above conditions are met, the function assumes the file is a text file.

def is_binary(file_path):
    try:
        with open(file_path, 'rb') as file:
            # Read a chunk of bytes from the file
            chunk = file.read(1024)

            # Check for null bytes, a common indicator of binary data
            if b'\x00' in chunk:
                return True

            # Check for a high percentage of non-printable ASCII characters
            text_characters = "".join(chr(byte) for byte in chunk if 32 <= byte <= 126)
            if not text_characters:
                return True

            # If none of the binary indicators are found, assume it's a text file
            return False

    except FileNotFoundError:
        print(f"File '{file_path}' not found.")
        return False
    except Exception as e:
        print(f"An error occurred: {e}")
        return False