3.6 Advanced Microwave Scanning Radiometer(AMSR)#
AMSR (Advanced Microwave Scanning Radiometer) is a satellite sensor that measures snow cover and water content by detecting Earth’s microwave emissions.
This chapter focuses on the use of AMSR satellite data to monitor snow cover
and snow water equivalent (SWE)
, key indicators in understanding water resources and climate patterns.
This chapter demonstrates the practical applications of transforming AMSR data into actionable information for scientific and environmental purposes.
3.6.1 Generating Daily Snow Data Links from AMSR#
Our goal is to generate a list of download links for AMSR daily snow data files for a specified date range.
AMSR
: Satellite sensor providing daily snow data.start_year
,end_year
: Define the date range for which data links are generated.base_url
: The starting point of each download link.timedelta
: Used to iterate over each day within the date range.generate_links
: Function that builds and returns the list of URLs.
Here we create list of download links for AMSR snow data
by iterating over a date range from the year 2019 to 2022.
from datetime import datetime, timedelta
import os
import subprocess
def generate_links(start_year, end_year):
'''
Generate a list of download links for AMSR daily snow data files.
Args:
start_year (int): The starting year.
end_year (int): The ending year (inclusive).
Returns:
list: A list of download links for AMSR daily snow data files.
'''
base_url = "https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/"
url_date_format = "%Y.%m.%d"
file_date_format="%Y%m%d"
delta = timedelta(days=1)
start_date = datetime(start_year, 1, 1)
end_date = datetime(end_year + 1, 1, 1)
links = []
current_date = start_date
while current_date < end_date:
url_date_format1 = current_date.strftime(url_date_format)
file_date_format1= current_date.strftime(file_date_format)
link = base_url + url_date_format1 + "/AMSR_U2_L3_DailySnow_B02_" + file_date_format1 + ".he5"
links.append(link)
current_date += delta
return links
start_year = 2019
end_year = 2022
work_dir = "../data/gridmet_test_run"
links = generate_links(start_year, end_year)
save_location = f'{work_dir}/amsr'
with open(f'{work_dir}/download_links.txt', "w") as txt_file:
for l in links:
txt_file.write(l + "\n")
3.6.2 Automated Script for Secure File Download with Wget from a List of URLs#
The follwing shell script is to automate the downloading of files by looping through the links we obtained from 3.6.1
.
work_dir
: Specifies the working directory where the download links and files will be stored.input_file
: The text file containing the list of URLs to be downloaded.base_wget_command
: The core command used to download files with options for authentication, session management, and secure connections.output_directory
: The folder where the downloaded files will be saved.Loop: Iterates over each URL in the input file, ensuring all files are downloaded.
Note: Before proceeding with the download, ensure you have logged in to the Earthdata website. You will need to retrieve the session cookies to authenticate your
wget
requests.
Visit the Earthdata Login and log in with your credentials.
After logging in, use a browser extension like EditThisCookie to export the session cookies.
If the cookies are exported in JSON format, convert them to the standard
wget
format. You can use tools like the Cookie converter or manually extract the relevant cookies.Save the cookies in a file named
cookies.txt
in the appropriate format as required bywget
.The cookies are crucial for authenticating your requests and ensuring successful downloads. In the code snippet below, the output is shown for only 6 files.
%%bash
#!/bin/bash
# Specify the file containing the download links
input_file="../data/download_links_updated.txt"
# Specify the cookies file location
cookies_file="../data/cookies.txt"
# Ensure the cookies file exists (assumes it's been generated previously)
if [ ! -f "$cookies_file" ]; then
echo "Cookies file not found: $cookies_file"
exit 1
fi
# Specify the base wget command with common options
base_wget_command="wget --load-cookies $cookies_file --save-cookies mycookies.txt --keep-session-cookies --no-check-certificate --progress=bar:force --quiet"
# Specify the output directory for downloaded files
output_directory="../data/gridmet_test_run/amsr"
# Ensure the output directory exists
mkdir -p "$output_directory"
# Loop through each line (URL) in the input file and download it using wget
while IFS= read -r url || [[ -n "$url" ]]; do
echo "Downloading: $url"
$base_wget_command -P "$output_directory" "$url"
done < "$input_file"
Downloading: https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/2019.01.01/AMSR_U2_L3_DailySnow_B02_20190101.he5
Downloading: https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/2019.01.02/AMSR_U2_L3_DailySnow_B02_20190102.he5
Downloading: https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/2019.01.03/AMSR_U2_L3_DailySnow_B02_20190103.he5
Downloading: https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/2019.01.04/AMSR_U2_L3_DailySnow_B02_20190104.he5
Downloading: https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/2019.01.05/AMSR_U2_L3_DailySnow_B02_20190105.he5
Downloading: https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/2019.01.06/AMSR_U2_L3_DailySnow_B02_20190106.he5
3.6.3 Extracting Features from AMSR data:#
Once the AMSR data files are downloaded, the next step is to extract relevant features from these files.
The following script accomplishes this task by processing each AMSR data file and extracting snow water equivalent (SWE)
values for specific grid cells corresponding to SNOTEL weather stations.
Lets breakdown each step involved in feature extraction.
3.6.3.1 Finding the Closest Grid Cell Index for Given Latitude and Longitude#
target_latitude
,target_longitude
: The coordinates of the specific location you want to match to a grid cell.lat_grid
,lon_grid
: Arrays representing the grid of latitude and longitude values across a region.np.unravel_index
: It identifies the indices in the grid where the sum of latitude and longitude differences is minimized.
Here we calculate the absolute differences between the target location and each point in the grid to find the closest match:
lat_diff = np.abs(lat_grid - target_latitude)
lon_diff = np.abs(lon_grid - target_longitude)
Through this code snippet, we get the row index (lat_idx
), column index (lon_idx
), and the actual latitude and longitude of the closest grid cell.
def find_closest_index(target_latitude, target_longitude, lat_grid, lon_grid):
'''
Find the index of the grid cell with the closest coordinates to the target latitude and longitude.
Args:
target_latitude (float): The target latitude.
target_longitude (float): The target longitude.
lat_grid (numpy.ndarray): An array of latitude values.
lon_grid (numpy.ndarray): An array of longitude values.
Returns:
Tuple[int, int, float, float]: A tuple containing the row index, column index, closest latitude, and closest longitude.
'''
# Compute the absolute differences between target and grid coordinates
lat_diff = np.abs(lat_grid - target_latitude)
lon_diff = np.abs(lon_grid - target_longitude)
# Find the indices corresponding to the minimum differences
lat_idx, lon_idx = np.unravel_index(np.argmin(lat_diff + lon_diff), lat_grid.shape)
return lat_idx, lon_idx, lat_grid[lat_idx, lon_idx], lon_grid[lat_idx, lon_idx]
3.6.3.2 Function to Map SNOTEL Stations to AMSR Grid Coordinates and Create a CSV Mapper#
Next we map SNOTEL station locations to the nearest AMSR grid cells and save this mapping as a CSV file.
new_base_station_list_file
: This is a CSV file containing thethe latitude and longitude of various SNOTEL stations.target_csv_path
: The file path where the output CSV file will be saved.hem_group = file['HDFEOS/GRIDS/Northern Hemisphere']
: Accesses the Northern Hemisphere data group within the HDF5 file.
Here we read station data, and check if a mapping file already exists, and downloads the necessary AMSR data file if not already present by using cmd = f"curl --output {target_amsr_hdf_path} ..."
.
And for each SNOTEL station, we try to identify the closest AMSR grid cell using latitude and longitude comparisons. df.to_csv(target_csv_path, index=False)
And we map SNOTEL stations to AMSR grid cells and save it in CSV file.
This is useful for comparing ground-based measurements with satellite observations. By finding the closest grid point on the AMSR dataset to each SNOTEL station, scientists and researchers can analyze and compare the data more effectively. This helps in improving weather predictions, studying climate patterns, and better understanding environmental conditions.
def create_snotel_station_to_amsr_mapper(new_base_station_list_file, target_csv_path):
station_data = pd.read_csv(new_base_station_list_file)
date = "2023-01-01"
date = date.replace("-", ".")
he5_date = date.replace(".", "")
# Check if the CSV already exists
if os.path.exists(target_csv_path):
print(f"File {os.path.basename(target_csv_path)} already exists, skipping..")
df = pd.read_csv(target_csv_path)
return df
print('date is ',date)
target_amsr_hdf_path = f"../data/gridmet_test_run/amsr_testing/testing_amsr_{date}.he5"
if os.path.exists(target_amsr_hdf_path):
print(f"File {target_amsr_hdf_path} already exists, skip downloading..")
else:
cmd = f"curl --output {target_amsr_hdf_path} -b ~/.urs_cookies -c ~/.urs_cookies -L -n -O https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/{date}/AMSR_U2_L3_DailySnow_B02_{he5_date}.he5"
print(f'Running command: {cmd}')
subprocess.run(cmd, shell=True)
df = pd.DataFrame(columns=['amsr_lat', 'amsr_lon',
'amsr_lat_idx', 'amsr_lon_idx',
'station_lat', 'station_lon'])
# Read the HDF
file = h5py.File(target_amsr_hdf_path, 'r')
hem_group = file['HDFEOS/GRIDS/Northern Hemisphere']
lat = hem_group['lat'][:]
lon = hem_group['lon'][:]
# Replace NaN values with 0
lat = np.nan_to_num(lat, nan=0.0)
lon = np.nan_to_num(lon, nan=0.0)
# Convert the AMSR grid into our gridMET 1km grid
for idx, row in station_data.iterrows():
target_lat = row['latitude']
target_lon = row['longitude']
# compare the performance and find the fastest way to search nearest point
closest_lat_idx, closest_lon_idx, closest_lat, closest_lon = find_closest_index(target_lat, target_lon, lat, lon)
df.loc[len(df.index)] = [closest_lat,
closest_lon,
closest_lat_idx,
closest_lon_idx,
target_lat,
target_lon]
# Save the new converted AMSR to CSV file
df.to_csv(target_csv_path, index=False)
print('AMSR mapper csv is created.')
return df
3.6.3.3 Extracting and Saving AMSR Snow Data to CSV#
Next, we extract snow water equivalent (SWE)
data from AMSR files for a range of dates, match it to specific locations (such as SNOTEL stations), and save the processed data into a CSV file.
amsr_data_dir
: Directory containing the AMSR.he5
files.
Here we use a pre-generated mapping of SNOTEL stations to AMSR grid cells obtained from3.6.3.3
to extract relevant SWE values.
Parallel Processing with Dask: Dask is utilized to efficiently process large datasets in parallel, making the extraction and processing faster.
dask_station_data = dd.from_pandas(mapper_df, npartitions=1)
: Converts the DataFrame into a Dask DataFrame for parallel processing.mapper_df = create_snotel_station_to_amsr_mapper(new_base_station_list_file, target_csv_path)
: Creates a mapping between SNOTEL stations and AMSR grid cells.swe = hem_group['Data Fields/SWE_NorthernDaily'][:]
: Extracts the Snow Water Equivalent (SWE) data from the AMSR HDF5 file.delayed_results = [process_row(row, swe, new_date_str) for _, row in mapper_df.iterrows()]
: Uses Dask’s delayed function to process each row of the DataFrame in parallel.processed_data = dask_bag.map(process_file).filter(lambda x: x is not None).compute()
: Processes all the AMSR files in parallel, filtering out any None results.
def extract_amsr_values_save_to_csv(amsr_data_dir, output_csv_file, new_base_station_list_file, start_date, end_date):
if os.path.exists(output_csv_file):
os.remove(output_csv_file)
target_csv_path = "../data/gridmet_test_run/training_snotel_station_to_amsr_mapper.csv"
mapper_df = create_snotel_station_to_amsr_mapper(new_base_station_list_file,
target_csv_path)
# station_data = pd.read_csv(new_base_station_list_file)
start_date = datetime.strptime(start_date, "%Y-%m-%d")
end_date = datetime.strptime(end_date, "%Y-%m-%d")
# Create a Dask DataFrame
dask_station_data = dd.from_pandas(mapper_df, npartitions=1)
# Function to process each file
def process_file(filename):
file_path = os.path.join(amsr_data_dir, filename)
file = h5py.File(file_path, 'r')
hem_group = file['HDFEOS/GRIDS/Northern Hemisphere']
date_str = filename.split('_')[-1].split('.')[0]
date = datetime.strptime(date_str, '%Y%m%d')
if not (start_date <= date <= end_date):
print(f"{date} is not in the training period, skipping..")
return None
new_date_str = date.strftime("%Y-%m-%d")
swe = hem_group['Data Fields/SWE_NorthernDaily'][:]
flag = hem_group['Data Fields/Flags_NorthernDaily'][:]
# Create an empty Pandas DataFrame with the desired columns
result_df = pd.DataFrame(columns=['date', 'lat', 'lon', 'AMSR_SWE'])
print('empty df ',result_df)
# Sample loop to add rows to the Pandas DataFrame using dask.delayed
@delayed
def process_row(row, swe, new_date_str):
closest_lat_idx = int(row['amsr_lat_idx'])
closest_lon_idx = int(row['amsr_lon_idx'])
closest_swe = swe[closest_lat_idx, closest_lon_idx]
return pd.DataFrame([[
new_date_str,
row['station_lat'],
row['station_lon'],
closest_swe]],
columns=result_df.columns
)
# List of delayed computations
delayed_results = [process_row(row, swe, new_date_str) for _, row in mapper_df.iterrows()]
# Compute the delayed results and concatenate them into a Pandas DataFrame
result_df = dask.compute(*delayed_results)
result_df = pd.concat(result_df, ignore_index=True)
# Print the final Pandas DataFrame
print(result_df)
return result_df
# Get the list of files
files = [f for f in os.listdir(amsr_data_dir) if f.endswith('.he5')]
# Create a Dask Bag from the files
dask_bag = db.from_sequence(files, npartitions=2)
# Process files in parallel
processed_data = dask_bag.map(process_file).filter(lambda x: x is not None).compute()
print(processed_data)
# Concatenate the processed data
combined_df = pd.concat(processed_data, ignore_index=True)
# Save the combined DataFrame to a CSV file
combined_df.to_csv(output_csv_file, index=False)
print(f"Merged data saved to {os.path.basename(output_csv_file)}")
3.6.3.4 Running the AMSR Data Extraction Process#
Here we extract and save AMSR snow data for a specified range of dates, linking it to SNOTEL stations, and storing the results in a CSV file.
new_base_station_list_file
: CSV file containing a list of active SNOTEL stations in the western U.S.Date Range: The start and end dates (
train_start_date
andtrain_end_date
) define the period for which the data will be processed.extract_amsr_values_save_to_csv
: Function call discussed in3.6.3.4
, it is to processes the AMSR data and saves the output to a CSV file.
Here is the main entry point for a script that prepares and processes AMSR data, mapping it to specific training points that include SNOTEL and GHCND (Global Historical Climatology Network Daily) stations.
By mapping SNOTEL and GHCND stations to the closest AMSR grid points and extracting relevant data over a specified time period, this script prepares a dataset that can be used for further analysis or modeling.
amsr_data_dir = "../data/gridmet_test_run/amsr"
all_training_points_with_snotel_ghcnd_file = "../data/gridmet_test_run/all_training_points_snotel_ghcnd_in_westus.csv"
new_base_df = pd.read_csv(all_training_points_with_snotel_ghcnd_file)
print(new_base_df.head())
output_csv_file = f"{all_training_points_with_snotel_ghcnd_file}_amsr_dask_all_training_ponits_with_ghcnd.csv"
start_date = train_start_date
end_date = train_end_date
extract_amsr_values_save_to_csv(amsr_data_dir, output_csv_file, all_training_points_with_snotel_ghcnd_file, start_date, end_date)
latitude longitude modis_x modis_y
0 39.95500 -120.53800 123 251
1 42.95000 -112.83333 337 168
2 36.23333 -106.43333 515 354
3 36.23700 -106.42912 515 354
4 44.45615 -113.30097 324 126
File training_snotel_station_to_amsr_mapper.csv already exists, skipping..
empty df Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
date lat lon AMSR_SWE
0 2022-12-24 39.95500 -120.53800 255
1 2022-12-24 42.95000 -112.83333 25
2 2022-12-24 36.23333 -106.43333 0
3 2022-12-24 36.23700 -106.42912 0
4 2022-12-24 44.45615 -113.30097 18
... ... ... ... ...
8056 2022-12-24 48.40890 -106.51440 20
8057 2022-12-24 48.54250 -109.76440 15
8058 2022-12-24 45.54940 -100.40860 40
8059 2022-12-24 43.53170 -112.94220 30
8060 2022-12-24 47.68720 -122.25530 0
[8061 rows x 4 columns]
empty df Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
date lat lon AMSR_SWE
0 2022-12-30 39.95500 -120.53800 0
1 2022-12-30 42.95000 -112.83333 70
2 2022-12-30 36.23333 -106.43333 0
3 2022-12-30 36.23700 -106.42912 0
4 2022-12-30 44.45615 -113.30097 44
... ... ... ... ...
8056 2022-12-30 48.40890 -106.51440 34
8057 2022-12-30 48.54250 -109.76440 56
8058 2022-12-30 45.54940 -100.40860 38
8059 2022-12-30 43.53170 -112.94220 55
8060 2022-12-30 47.68720 -122.25530 0
[8061 rows x 4 columns]
empty df Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
date lat lon AMSR_SWE
0 2022-12-31 39.95500 -120.53800 255
1 2022-12-31 42.95000 -112.83333 0
2 2022-12-31 36.23333 -106.43333 0
3 2022-12-31 36.23700 -106.42912 0
4 2022-12-31 44.45615 -113.30097 28
... ... ... ... ...
8056 2022-12-31 48.40890 -106.51440 30
8057 2022-12-31 48.54250 -109.76440 60
8058 2022-12-31 45.54940 -100.40860 39
8059 2022-12-31 43.53170 -112.94220 39
8060 2022-12-31 47.68720 -122.25530 0
[8061 rows x 4 columns]
empty df Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
date lat lon AMSR_SWE
0 2022-12-25 39.95500 -120.53800 0
1 2022-12-25 42.95000 -112.83333 29
2 2022-12-25 36.23333 -106.43333 255
3 2022-12-25 36.23700 -106.42912 255
4 2022-12-25 44.45615 -113.30097 27
... ... ... ... ...
8056 2022-12-25 48.40890 -106.51440 18
8057 2022-12-25 48.54250 -109.76440 0
8058 2022-12-25 45.54940 -100.40860 33
8059 2022-12-25 43.53170 -112.94220 27
8060 2022-12-25 47.68720 -122.25530 0
[8061 rows x 4 columns]
empty df Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
date lat lon AMSR_SWE
0 2022-12-28 39.95500 -120.53800 0
1 2022-12-28 42.95000 -112.83333 255
2 2022-12-28 36.23333 -106.43333 0
3 2022-12-28 36.23700 -106.42912 0
4 2022-12-28 44.45615 -113.30097 255
... ... ... ... ...
8056 2022-12-28 48.40890 -106.51440 24
8057 2022-12-28 48.54250 -109.76440 0
8058 2022-12-28 45.54940 -100.40860 25
8059 2022-12-28 43.53170 -112.94220 255
8060 2022-12-28 47.68720 -122.25530 0
[8061 rows x 4 columns]
empty df Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
date lat lon AMSR_SWE
0 2022-12-29 39.95500 -120.53800 32
1 2022-12-29 42.95000 -112.83333 55
2 2022-12-29 36.23333 -106.43333 0
3 2022-12-29 36.23700 -106.42912 0
4 2022-12-29 44.45615 -113.30097 31
... ... ... ... ...
8056 2022-12-29 48.40890 -106.51440 28
8057 2022-12-29 48.54250 -109.76440 39
8058 2022-12-29 45.54940 -100.40860 255
8059 2022-12-29 43.53170 -112.94220 59
8060 2022-12-29 47.68720 -122.25530 255
[8061 rows x 4 columns]
empty df Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
date lat lon AMSR_SWE
0 2022-12-27 39.95500 -120.53800 0
1 2022-12-27 42.95000 -112.83333 19
2 2022-12-27 36.23333 -106.43333 255
3 2022-12-27 36.23700 -106.42912 255
4 2022-12-27 44.45615 -113.30097 23
... ... ... ... ...
8056 2022-12-27 48.40890 -106.51440 17
8057 2022-12-27 48.54250 -109.76440 0
8058 2022-12-27 45.54940 -100.40860 19
8059 2022-12-27 43.53170 -112.94220 30
8060 2022-12-27 47.68720 -122.25530 0
[8061 rows x 4 columns]
empty df Empty DataFrame
Columns: [date, lat, lon, AMSR_SWE]
Index: []
date lat lon AMSR_SWE
0 2022-12-26 39.95500 -120.53800 0
1 2022-12-26 42.95000 -112.83333 0
2 2022-12-26 36.23333 -106.43333 0
3 2022-12-26 36.23700 -106.42912 0
4 2022-12-26 44.45615 -113.30097 22
... ... ... ... ...
8056 2022-12-26 48.40890 -106.51440 23
8057 2022-12-26 48.54250 -109.76440 31
8058 2022-12-26 45.54940 -100.40860 37
8059 2022-12-26 43.53170 -112.94220 29
8060 2022-12-26 47.68720 -122.25530 0
[8061 rows x 4 columns]
[ date lat lon AMSR_SWE
0 2022-12-28 39.95500 -120.53800 0
1 2022-12-28 42.95000 -112.83333 255
2 2022-12-28 36.23333 -106.43333 0
3 2022-12-28 36.23700 -106.42912 0
4 2022-12-28 44.45615 -113.30097 255
... ... ... ... ...
8056 2022-12-28 48.40890 -106.51440 24
8057 2022-12-28 48.54250 -109.76440 0
8058 2022-12-28 45.54940 -100.40860 25
8059 2022-12-28 43.53170 -112.94220 255
8060 2022-12-28 47.68720 -122.25530 0
[8061 rows x 4 columns], date lat lon AMSR_SWE
0 2022-12-29 39.95500 -120.53800 32
1 2022-12-29 42.95000 -112.83333 55
2 2022-12-29 36.23333 -106.43333 0
3 2022-12-29 36.23700 -106.42912 0
4 2022-12-29 44.45615 -113.30097 31
... ... ... ... ...
8056 2022-12-29 48.40890 -106.51440 28
8057 2022-12-29 48.54250 -109.76440 39
8058 2022-12-29 45.54940 -100.40860 255
8059 2022-12-29 43.53170 -112.94220 59
8060 2022-12-29 47.68720 -122.25530 255
[8061 rows x 4 columns], date lat lon AMSR_SWE
0 2022-12-27 39.95500 -120.53800 0
1 2022-12-27 42.95000 -112.83333 19
2 2022-12-27 36.23333 -106.43333 255
3 2022-12-27 36.23700 -106.42912 255
4 2022-12-27 44.45615 -113.30097 23
... ... ... ... ...
8056 2022-12-27 48.40890 -106.51440 17
8057 2022-12-27 48.54250 -109.76440 0
8058 2022-12-27 45.54940 -100.40860 19
8059 2022-12-27 43.53170 -112.94220 30
8060 2022-12-27 47.68720 -122.25530 0
[8061 rows x 4 columns], date lat lon AMSR_SWE
0 2022-12-26 39.95500 -120.53800 0
1 2022-12-26 42.95000 -112.83333 0
2 2022-12-26 36.23333 -106.43333 0
3 2022-12-26 36.23700 -106.42912 0
4 2022-12-26 44.45615 -113.30097 22
... ... ... ... ...
8056 2022-12-26 48.40890 -106.51440 23
8057 2022-12-26 48.54250 -109.76440 31
8058 2022-12-26 45.54940 -100.40860 37
8059 2022-12-26 43.53170 -112.94220 29
8060 2022-12-26 47.68720 -122.25530 0
[8061 rows x 4 columns], date lat lon AMSR_SWE
0 2022-12-24 39.95500 -120.53800 255
1 2022-12-24 42.95000 -112.83333 25
2 2022-12-24 36.23333 -106.43333 0
3 2022-12-24 36.23700 -106.42912 0
4 2022-12-24 44.45615 -113.30097 18
... ... ... ... ...
8056 2022-12-24 48.40890 -106.51440 20
8057 2022-12-24 48.54250 -109.76440 15
8058 2022-12-24 45.54940 -100.40860 40
8059 2022-12-24 43.53170 -112.94220 30
8060 2022-12-24 47.68720 -122.25530 0
[8061 rows x 4 columns], date lat lon AMSR_SWE
0 2022-12-30 39.95500 -120.53800 0
1 2022-12-30 42.95000 -112.83333 70
2 2022-12-30 36.23333 -106.43333 0
3 2022-12-30 36.23700 -106.42912 0
4 2022-12-30 44.45615 -113.30097 44
... ... ... ... ...
8056 2022-12-30 48.40890 -106.51440 34
8057 2022-12-30 48.54250 -109.76440 56
8058 2022-12-30 45.54940 -100.40860 38
8059 2022-12-30 43.53170 -112.94220 55
8060 2022-12-30 47.68720 -122.25530 0
[8061 rows x 4 columns], date lat lon AMSR_SWE
0 2022-12-31 39.95500 -120.53800 255
1 2022-12-31 42.95000 -112.83333 0
2 2022-12-31 36.23333 -106.43333 0
3 2022-12-31 36.23700 -106.42912 0
4 2022-12-31 44.45615 -113.30097 28
... ... ... ... ...
8056 2022-12-31 48.40890 -106.51440 30
8057 2022-12-31 48.54250 -109.76440 60
8058 2022-12-31 45.54940 -100.40860 39
8059 2022-12-31 43.53170 -112.94220 39
8060 2022-12-31 47.68720 -122.25530 0
[8061 rows x 4 columns], date lat lon AMSR_SWE
0 2022-12-25 39.95500 -120.53800 0
1 2022-12-25 42.95000 -112.83333 29
2 2022-12-25 36.23333 -106.43333 255
3 2022-12-25 36.23700 -106.42912 255
4 2022-12-25 44.45615 -113.30097 27
... ... ... ... ...
8056 2022-12-25 48.40890 -106.51440 18
8057 2022-12-25 48.54250 -109.76440 0
8058 2022-12-25 45.54940 -100.40860 33
8059 2022-12-25 43.53170 -112.94220 27
8060 2022-12-25 47.68720 -122.25530 0
[8061 rows x 4 columns]]
Merged data saved to all_training_points_snotel_ghcnd_in_westus.csv_amsr_dask_all_training_ponits_with_ghcnd.csv
3.6.4 AMSR Data Processing and Analysis Pipeline#
In this section we see, how we automate the process of downloading, mapping, processing, and analyzing AMSR (Advanced Microwave Scanning Radiometer) data.
It provides a streamlined and automated pipeline for handling AMSR data, from initial download and grid alignment to final data processing and analysis.
3.6.4.1 Find Closest Point in a Grid#
target_latitude
(float): The latitude of the target point.target_longitude
(float): The longitude of the target point.lat_grid
(numpy array): A 2D array representing the grid of latitudes.lon_grid
(numpy array): A 2D array representing the grid of longitudes.Here we calculate the squared Euclidean distance between the target point and each point in the grid.
And then find the grid point with the minimum distance to the target point and return the indices of that point along with its actual latitude and longitude.
def find_closest_index_numpy(target_latitude, target_longitude, lat_grid, lon_grid):
# Calculate the squared Euclidean distance between the target point and all grid points
distance_squared = (lat_grid - target_latitude)**2 + (lon_grid - target_longitude)**2
# Find the indices of the minimum distance
lat_idx, lon_idx = np.unravel_index(np.argmin(distance_squared), distance_squared.shape)
return lat_idx, lon_idx, lat_grid[lat_idx, lon_idx], lon_grid[lat_idx, lon_idx]
3.6.4.2 Find the closest grid point indices for a target latitude and longitude using KDTree#
Here we find the closest grid point indices to a given target latitude and longitude using a KDTree for efficient spatial searching.
lat_idx
(int): The index of the closest latitude in the grid.lon_idx
(int): The index of the closest longitude in the grid.lat_grid[lat_idx, lon_idx]
(float): The actual latitude of the closest grid point.lon_grid[lat_idx, lon_idx]
(float): The actual longitude of the closest grid point.We use a global KDTree (
latlontree
) to efficiently search for the nearest grid point. If the KDTree is not already built, it constructs one using the latitude and longitude grids.Before constructing the KDTree, we replace any
NaN
values in the latitude and longitude grids with0.0
to ensure the tree is built without issues.The KDTree is then queried with the target latitude and longitude to find the nearest point in the grid.
Then we convert the 1D index returned by the KDTree back into 2D grid indices to identify the exact location in the original latitude and longitude grids.
Finally, we return the indices of the closest latitude and longitude, along with the corresponding values from the grid.
def find_closest_index_tree(target_latitude, target_longitude, lat_grid, lon_grid):
"""
Find the closest grid point indices for a target latitude and longitude using KDTree.
Parameters:
target_latitude (float): Target latitude.
target_longitude (float): Target longitude.
lat_grid (numpy.ndarray): Array of latitude values.
lon_grid (numpy.ndarray): Array of longitude values.
Returns:
int: Latitude index.
int: Longitude index.
float: Closest latitude value.
float: Closest longitude value.
"""
global latlontree
if latlontree is None:
# Create a KD-Tree from lat_grid and lon_grid
lat_grid_cleaned = np.nan_to_num(lat_grid, nan=0.0) # Replace NaN with 0
lon_grid_cleaned = np.nan_to_num(lon_grid, nan=0.0) # Replace NaN with 0
latlontree = KDTree(list(zip(lat_grid_cleaned.ravel(), lon_grid_cleaned.ravel())))
# Query the KD-Tree to find the nearest point
distance, index = latlontree.query([target_latitude, target_longitude])
# Convert the 1D index to 2D grid indices
lat_idx, lon_idx = np.unravel_index(index, lat_grid.shape)
return lat_idx, lon_idx, lat_grid[lat_idx, lon_idx], lon_grid[lat_idx, lon_idx]
3.6.4.3 Find the closest grid point indices for a target latitude and longitude.#
Here we find the grid point in a latitude-longitude array that is closest to a given target latitude and longitude.
Here we compute the absolute difference between the target latitude and longitude and all points in the
lat_grid
andlon_grid
.And then sum these differences to create a simple distance metric for each grid point.
Next, identify the grid point with the smallest sum of differences, which is considered the closest point to the target location.
Finally, return the indices and the actual latitude and longitude values of this closest grid point.
def find_closest_index(target_latitude, target_longitude, lat_grid, lon_grid):
"""
Find the closest grid point indices for a target latitude and longitude.
Parameters:
target_latitude (float): Target latitude.
target_longitude (float): Target longitude.
lat_grid (numpy.ndarray): Array of latitude values.
lon_grid (numpy.ndarray): Array of longitude values.
Returns:
int: Latitude index.
int: Longitude index.
float: Closest latitude value.
float: Closest longitude value.
"""
lat_diff = np.float64(np.abs(lat_grid - target_latitude))
lon_diff = np.float64(np.abs(lon_grid - target_longitude))
# Find the indices corresponding to the minimum differences
lat_idx, lon_idx = np.unravel_index(np.argmin(lat_diff + lon_diff), lat_grid.shape)
return lat_idx, lon_idx, lat_grid[lat_idx, lon_idx], lon_grid[lat_idx, lon_idx]
3.6.4.4 Preparing the AMSR to GridMET Mapper#
The goal here is to create a mapping between AMSR grid data and GridMET grid points, saving the results to a CSV file.
target_csv_path
: The file path where the mapping between AMSR and GridMET grid points will be saved as a CSV file.target_amsr_hdf_path
: The path where the AMSR data file is stored or will be downloaded to if it doesn’t exist.-western_us_coords
: A CSV file containing the latitude and longitude of GridMET grid points for the western U.S.file = h5py.File(target_amsr_hdf_path, 'r')
: Opens the AMSR HDF5 file for reading, allowing access to its contents.lat = np.nan_to_num(lat, nan=0.0)
: Replaces anyNaN
values in the latitude data with zeros to avoid errors during processing.closest_lat_idx, closest_lon_idx, closest_lat, closest_lon = find_closest_index(target_lat, target_lon, lat, lon)
: Finds the closest AMSR grid point to each target GridMET point, which is crucial for accurate data mapping.
This code is essential for linking the satellite-based AMSR data grid with the ground-based GridMET grid.
def prepare_amsr_grid_mapper():
df = pd.DataFrame(columns=['amsr_lat', 'amsr_lon',
'amsr_lat_idx', 'amsr_lon_idx',
'gridmet_lat', 'gridmet_lon'])
date = test_start_date
date = date.replace("-", ".")
he5_date = date.replace(".", "")
# Check if the CSV already exists
target_csv_path = "../data/amsr_to_gridmet_mapper.csv"
if os.path.exists(target_csv_path):
print(f"File {os.path.basename(target_csv_path)} already exists, skipping..")
return
target_amsr_hdf_path = f"../data/gridmet_test_run/amsr_testing/testing_amsr_{date}.he5"
if os.path.exists(target_amsr_hdf_path):
print(f"File {os.path.basename(target_amsr_hdf_path)} already exists, skip downloading..")
else:
cmd = f"curl --output {target_amsr_hdf_path} -b ../data/cookies.txt -c ~/.mycookies.txt -L -n -O https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/{date}/AMSR_U2_L3_DailySnow_B02_{he5_date}.he5"
print(f'Running command: {cmd}')
subprocess.run(cmd, shell=True)
# Read the HDF
print('target_amsr_hdf_path', target_amsr_hdf_path)
file = h5py.File(target_amsr_hdf_path, 'r')
hem_group = file['HDFEOS/GRIDS/Northern Hemisphere']
lat = hem_group['lat'][:]
lon = hem_group['lon'][:]
# Replace NaN values with 0
lat = np.nan_to_num(lat, nan=0.0)
lon = np.nan_to_num(lon, nan=0.0)
# Set the number of rows to process
num_rows_to_process = 100 # Change this value to the desired number of rows
# Convert the AMSR grid into our gridMET 1km grid
western_us_df = pd.read_csv(western_us_coords)
for idx, row in western_us_df.iterrows():
if idx >= num_rows_to_process:
break
target_lat = row['Latitude']
target_lon = row['Longitude']
# compare the performance and find the fastest way to search nearest point
closest_lat_idx, closest_lon_idx, closest_lat, closest_lon = find_closest_index(target_lat, target_lon, lat, lon)
df.loc[len(df.index)] = [closest_lat,
closest_lon,
closest_lat_idx,
closest_lon_idx,
target_lat,
target_lon]
# Save the new converted AMSR to CSV file
df.to_csv(target_csv_path, index=False)
print('AMSR mapper csv is created.')
3.6.4.5 Downloading and Converting AMSR Snow Data to DEM Format#
Here we automate the downloading, conversion, and saving of AMSR data aligned with a DEM grid. And also adds a cumulative sum column to a DataFrame, useful for tracking cumulative metrics over time.
target_date
: The specific date for which AMSR data is being processed, initially set totest_start_date
.target_mapper_csv_path
: The path to the CSV file that maps AMSR grid points to GridMET grid points.target_csv_path
: The file path where the final converted data will be saved as a CSV.target_amsr_hdf_path
: The path where the downloaded AMSR HDF5 file will be stored.mapper_df['AMSR_SWE'] = mapper_df.apply(get_swe, axis=1)
: Applies theget_swe
function to each row in the mapping DataFrame to calculate the SWE value for each GridMET point.mapper_df.drop(columns=['amsr_lat', 'amsr_lon', ...])
: Removes unnecessary columns from the DataFrame before saving the final results.
Here we first constructs the URL for the AMSR data corresponding to the specified date and attempts to download it using the curl command. It ensures that the necessary cookies are available for authentication.
And once the data is downloaded, we read the HDF5 file using the h5py library, extracting the latitude, longitude, snow water equivalent (SWE), and flag information from the file.
And then we convert the AMSR grid into a format compatible with the DEM grid by finding the corresponding grid points in the DEM grid. This involves identifying the nearest DEM grid points for each AMSR grid point.
For each DEM grid point, we perform a custom calculation to determine the SWE and flag values based on the nearest AMSR grid points.
And finally we save the converted data, including the latitude, longitude, SWE, and flag information, into a CSV file. This file can be further processed or analyzed in subsequent steps of the script.
def download_amsr_and_convert_grid(target_date = test_start_date):
"""
Download AMSR snow data, convert it to DEM format, and save as a CSV file.
"""
# the mapper
target_mapper_csv_path = f'{work_dir}/amsr_to_gridmet_mapper.csv'
mapper_df = pd.read_csv(target_mapper_csv_path)
#print(mapper_df.head())
df = pd.DataFrame(columns=['date', 'lat',
'lon', 'AMSR_SWE',
'AMSR_Flag'])
date = target_date
date = date.replace("-", ".")
he5_date = date.replace(".", "")
target_date = ""
# Check if the CSV already exists
target_csv_path = f"../data/gridmet_test_run/amsr_testing/testing_ready_amsr_{date}.csv"
if os.path.exists(target_csv_path):
print(f"File {os.path.basename(target_csv_path)} already exists, skipping..")
return target_csv_path
target_amsr_hdf_path = f"{work_dir}/amsr_testing/testing_amsr_{date}.he5"
if os.path.exists(target_amsr_hdf_path) and is_binary(target_amsr_hdf_path):
print(f"File {os.path.basename(target_amsr_hdf_path)} already exists, skip downloading..")
else:
cmd = f"curl --output {target_amsr_hdf_path} -b ../data/cookies.txt -c ~/.urs_cookies -L -n -O https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/{date}/AMSR_U2_L3_DailySnow_B02_{he5_date}.he5"
print(f'Running command: {cmd}')
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
# Check the exit code
if result.returncode != 0:
print(f"Command failed with exit code {result.returncode}.")
if os.path.exists(target_amsr_hdf_path):
os.remove(target_amsr_hdf_path)
print(f"Wrong {target_amsr_hdf_path} removed successfully.")
raise Exception(f"Failed to download {target_amsr_hdf_path} - {result.stderr}")
# Read the HDF
print('file is ',target_amsr_hdf_path)
print(f"Reading {target_amsr_hdf_path}")
file = h5py.File(target_amsr_hdf_path, 'r')
hem_group = file['HDFEOS/GRIDS/Northern Hemisphere']
lat = hem_group['lat'][:]
lon = hem_group['lon'][:]
# Replace NaN values with 0
lat = np.nan_to_num(lat, nan=0.0)
lon = np.nan_to_num(lon, nan=0.0)
swe = hem_group['Data Fields/SWE_NorthernDaily'][:]
flag = hem_group['Data Fields/Flags_NorthernDaily'][:]
date = datetime.strptime(date, '%Y.%m.%d')
# Convert the AMSR grid into our DEM 1km grid
def get_swe(row):
# Perform your custom calculation here
closest_lat_idx = int(row['amsr_lat_idx'])
closest_lon_idx = int(row['amsr_lon_idx'])
closest_swe = swe[closest_lat_idx, closest_lon_idx]
return closest_swe
def get_swe_flag(row):
# Perform your custom calculation here
closest_lat_idx = int(row['amsr_lat_idx'])
closest_lon_idx = int(row['amsr_lon_idx'])
closest_flag = flag[closest_lat_idx, closest_lon_idx]
return closest_flag
# Use the apply function to apply the custom function to each row
mapper_df['AMSR_SWE'] = mapper_df.apply(get_swe, axis=1)
mapper_df['AMSR_Flag'] = mapper_df.apply(get_swe_flag, axis=1)
mapper_df['date'] = date
mapper_df.rename(columns={'dem_lat': 'lat'}, inplace=True)
mapper_df.rename(columns={'dem_lon': 'lon'}, inplace=True)
mapper_df = mapper_df.drop(columns=['amsr_lat',
'amsr_lon',
'amsr_lat_idx',
'amsr_lon_idx'])
print("result df: ", mapper_df.head())
# Save the new converted AMSR to CSV file
print(f"saving the new AMSR SWE to csv: {target_csv_path}")
mapper_df.to_csv(target_csv_path, index=False)
print('Completed AMSR testing data collection.')
return target_csv_path
def add_cumulative_column(df, column_name):
df[f'cumulative_{column_name}'] = df[column_name].sum()
return df
3.6.4.6 Aggregate Cumulative AMSR Snow Data and Export to CSV#
The goal of this code is to calculate the cumulative Snow Water Equivalent (SWE)
values from AMSR data over a specific period, filling any gaps in the data, and saving the cumulative results into a CSV file. This is particularly useful for analyzing long-term snow accumulation trends.
target_date
: The specific date for which cumulative AMSR data is calculated.past_october_1
: The date of October 1st in the previous or current year, serving as the start of the period.target_csv_path
: The path where the cumulative AMSR data will be saved.gap_filled_csv
: The path where the gap-filled version of the cumulative data will be saved.past_october_1 = datetime(selected_date.year - 1, 10, 1)
: Sets the start date for cumulative calculations to October 1st of the previous year if the target date is before October, otherwise, it uses October 1st of the current year.data_dict[current_date_str] = download_amsr_and_convert_grid(current_date_str)
: Downloads and converts AMSR data for each day from October 1st to the target date, storing the results in a dictionary.filtered_columns.interpolate(axis=1, method='linear', inplace=True)
: Fills in missing values in the SWE data using linear interpolation across time.df[f'cumulative_{column_name}'] = sum_column
: Adds a new column to the DataFrame containing the cumulative sum of SWE values.filled_data.to_csv(gap_filled_csv, index=False)
: Saves the gap-filled cumulative data to a CSV file.result.to_csv(target_csv_path, index=False)
: Saves the final cumulative data for the target date into a CSV file.
This cumulative information is valuable for understanding seasonal snow accumulation and can be used in various environmental analyses and forecasting models. The approach of handling missing data ensures the integrity and completeness of the dataset before it’s saved and used for further analysis.
def get_cumulative_amsr_data(target_date = test_start_date, force=False):
selected_date = datetime.strptime(target_date, "%Y-%m-%d")
print(selected_date)
if selected_date.month < 10:
past_october_1 = datetime(selected_date.year - 1, 10, 1)
else:
past_october_1 = datetime(selected_date.year, 10, 1)
# Traverse and print every day from past October 1 to the specific date
current_date = past_october_1
target_csv_path = f'{work_dir}/testing_ready_amsr_{target_date}_cumulative.csv'
columns_to_be_cumulated = ["AMSR_SWE"]
gap_filled_csv = f"{target_csv_path}_gap_filled.csv"
if os.path.exists(gap_filled_csv) and not force:
print(f"{gap_filled_csv} already exists, skipping..")
df = pd.read_csv(gap_filled_csv)
print(df["AMSR_SWE"].describe())
else:
date_keyed_objects = {}
data_dict = {}
new_df = None
while current_date <= selected_date:
print(current_date.strftime('%Y-%m-%d'))
current_date_str = current_date.strftime('%Y-%m-%d')
data_dict[current_date_str] = download_amsr_and_convert_grid(current_date_str)
current_df = pd.read_csv(data_dict[current_date_str])
current_df.drop(columns=["date"], inplace=True)
if current_date != selected_date:
current_df.rename(columns={
"AMSR_SWE": f"AMSR_SWE_{current_date_str}",
"AMSR_Flag": f"AMSR_Flag_{current_date_str}",
}, inplace=True)
#print(current_df.head())
if new_df is None:
new_df = current_df
else:
new_df = pd.merge(new_df, current_df, on=['gridmet_lat', 'gridmet_lon'])
#new_df = new_df.append(current_df, ignore_index=True)
current_date += timedelta(days=1)
print("new_df.columns = ", new_df.columns)
print("new_df.head = ", new_df.head())
df = new_df
#df.sort_values(by=['gridmet_lat', 'gridmet_lon', 'date'], inplace=True)
print("All current head: ", df.head())
print("the new_df.shape: ", df.shape)
print("Start to fill in the missing values")
#grouped = df.groupby(['gridmet_lat', 'gridmet_lon'])
filled_data = pd.DataFrame()
# Apply the function to each group
for column_name in columns_to_be_cumulated:
start_time = time.time()
#filled_data = df.apply(lambda row: interpolate_missing_and_add_cumulative_inplace(row, column_name), axis=1)
#alike_columns = filled_data.filter(like=column_name)
#filled_data[f'cumulative_{column_name}'] = alike_columns.sum(axis=1)
print("filled_data.columns = ", filled_data.columns)
filtered_columns = df.filter(like=column_name)
print(filtered_columns.columns)
filtered_columns = filtered_columns.mask(filtered_columns > 240)
filtered_columns.interpolate(axis=1, method='linear', inplace=True)
filtered_columns.fillna(0, inplace=True)
sum_column = filtered_columns.sum(axis=1)
# Define a specific name for the new column
df[f'cumulative_{column_name}'] = sum_column
df[filtered_columns.columns] = filtered_columns
if filtered_columns.isnull().any().any():
print("filtered_columns :", filtered_columns)
raise ValueError("Single group: shouldn't have null values here")
print("filled_data.columns: ", filled_data.columns)
end_time = time.time()
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"calculate column {column_name} elapsed time: {elapsed_time} seconds")
filled_data = df
filled_data["date"] = target_date
print("Finished correctly ", filled_data.head())
filled_data.to_csv(gap_filled_csv, index=False)
print(f"New filled values csv is saved to {gap_filled_csv}")
df = filled_data
result = df
print("result.head = ", result.head())
# fill in the rest NA as 0
if result.isnull().any().any():
print("result :", result)
raise ValueError("Single group: shouldn't have null values here")
# only retain the rows of the target date
print(result['date'].unique())
print(result.shape)
print(result[["AMSR_SWE", "AMSR_Flag"]].describe())
result.to_csv(target_csv_path, index=False)
print(f"New data is saved to {os.path.basename(target_csv_path)}")
3.6.4.7 Interpolate Missing Values and Calculate Cumulative SWE In-Place for AMSR Data#
Here we aim to ensure that any missing or anomalous data points within a specific column are handled appropriately through interpolation, and then a cumulative sum is calculated.
This is particularly useful in time series data, where continuity is crucial, and missing data could skew analysis.
The cumulative sum provides an aggregated measure that can be used for further analysis, such as tracking total snowfall or snow cover over a period.
row
: A Pandas Series representing a single row of data from a DataFrame, which contains the values to be interpolated.column_name
: The specific column in the DataFrame that needs interpolation and cumulative calculation.degree
: The degree of the polynomial used for interpolation, with a default of 1 (linear).x_subset_key = x_all_key[x_all_key.str.startswith(column_name)]
: Identifies the subset of columns within the row that corresponds to the specifiedcolumn_name
, allowing focused interpolation on the relevant data.are_all_values_between_0_and_240 = row[x_subset_key].between(1, 239).all()
: Checks if all values in the specified subset are within a valid range (for SWE, between 1 and 239), which helps ensure that the data is suitable for interpolation.row[f"cumulative_{column_name}"] = row[x_subset_key].sum()
: After interpolation, this line calculates the cumulative sum of the values in the subset and adds it as a new column in the row.
def interpolate_missing_and_add_cumulative_inplace(row, column_name, degree=1):
"""
Interpolate missing values in a Pandas Series using polynomial interpolation
and add a cumulative column.
Parameters:
- row (pd.Series): The input row containing the data to be interpolated.
- column_name (str): The name of the column to be interpolated.
- degree (int, optional): The degree of the polynomial fit. Default is 1 (linear).
Returns:
- pd.Series: The row with interpolated values and a cumulative column.
Raises:
- ValueError: If there are unexpected null values after interpolation.
Note:
- For 'SWE' column, values above 240 are treated as gaps and set to 240.
- For 'fsca' column, values above 100 are treated as gaps and set to 100.
Examples:
```python
# Example usage:
interpolated_row = interpolate_missing_and_add_cumulative_inplace(my_row, 'fsca', degree=2)
```
"""
# Extract X series (column names)
x_all_key = row.index
x_subset_key = x_all_key[x_all_key.str.startswith(column_name)]
are_all_values_between_0_and_240 = row[x_subset_key].between(1, 239).all()
if are_all_values_between_0_and_240:
print("row[x_subset_key] = ", row[x_subset_key])
print("row[x_subset_key].sum() = ", row[x_subset_key].sum())
# create the cumulative column after interpolation
row[f"cumulative_{column_name}"] = row[x_subset_key].sum()
return row
3.6.4.8 Running the AMSR Data Extraction Process#
This script is to handle the entire workflow, from data preparation to the generation of cumulative time series data.
prepare_amsr_grid_mapper()
: It maps the AMSR grid to the gridMET grid, preparing the necessary data for further processing.get_cumulative_amsr_data(force=False)
: This calculates cumulative AMSR data for a specific date range and handles any missing data. The force parameter determines whether to overwrite existing processed data.input_time_series_file
: Defines the file path for the cumulative AMSR data, which will be used in subsequent analyses or processes.
# Run the download and conversion function
prepare_amsr_grid_mapper()
download_amsr_and_convert_grid()
get_cumulative_amsr_data(force=False)
input_time_series_file = f'{work_dir}/testing_ready_amsr_{test_start_date}_cumulative.csv_gap_filled.csv'
File amsr_to_gridmet_mapper.csv already exists, skipping..
File testing_ready_amsr_2024.07.18.csv already exists, skipping..
2024-07-18 00:00:00
../data/gridmet_test_run/testing_ready_amsr_2024-07-18_cumulative.csv_gap_filled.csv already exists, skipping..
count 100.0
mean 0.0
std 0.0
min 0.0
25% 0.0
50% 0.0
75% 0.0
max 0.0
Name: AMSR_SWE, dtype: float64
result.head = gridmet_lat gridmet_lon AMSR_SWE_2023-10-01 AMSR_Flag_2023-10-01 \
0 49.0 -125.000 0.0 241
1 49.0 -124.964 0.0 241
2 49.0 -124.928 0.0 241
3 49.0 -124.892 0.0 241
4 49.0 -124.856 0.0 241
AMSR_SWE_2023-10-02 AMSR_Flag_2023-10-02 AMSR_SWE_2023-10-03 \
0 0.0 241 0.0
1 0.0 241 0.0
2 0.0 241 0.0
3 0.0 241 0.0
4 0.0 241 0.0
AMSR_Flag_2023-10-03 AMSR_SWE_2023-10-04 AMSR_Flag_2023-10-04 ... \
0 241 0.0 241 ...
1 241 0.0 241 ...
2 241 0.0 241 ...
3 241 0.0 241 ...
4 241 0.0 241 ...
AMSR_SWE_2024-07-15 AMSR_Flag_2024-07-15 AMSR_SWE_2024-07-16 \
0 0.0 241 0.0
1 0.0 241 0.0
2 0.0 241 0.0
3 0.0 241 0.0
4 0.0 241 0.0
AMSR_Flag_2024-07-16 AMSR_SWE_2024-07-17 AMSR_Flag_2024-07-17 AMSR_SWE \
0 241 0.0 241 0.0
1 241 0.0 241 0.0
2 241 0.0 241 0.0
3 241 0.0 241 0.0
4 241 0.0 241 0.0
AMSR_Flag cumulative_AMSR_SWE date
0 241 7.5 2024-07-18
1 241 7.5 2024-07-18
2 241 7.5 2024-07-18
3 241 7.5 2024-07-18
4 241 7.5 2024-07-18
[5 rows x 588 columns]
['2024-07-18']
(100, 588)
AMSR_SWE AMSR_Flag
count 100.0 100.000000
mean 0.0 247.750000
std 0.0 6.927292
min 0.0 241.000000
25% 0.0 241.000000
50% 0.0 241.000000
75% 0.0 255.000000
max 0.0 255.000000
New data is saved to testing_ready_amsr_2024-07-18_cumulative.csv
Utility Functions for Feature Extraction#
The following functions are categorized as utility functions. These functions are not central to the main discussion but play a supportive role by providing necessary functionality. They can be referenced as needed throughout the chapter to simplify and streamline the main code examples.
1. Importing required python libraries to run the script#
Importing Libraries: Essential libraries are imported for handling files, processing large datasets, and performing complex calculations.
os
,shutil
,subprocess
: For file handling, copying, and executing shell commands.csv
,h5py
,numpy
,pandas
: For reading/writing files, handling HDF5 datasets, numerical computations, and data manipulation.dask
,xarray
: To manage and process large datasets efficiently using parallel computing.
import os
import csv
import h5py
import shutil
import numpy as np
import pandas as pd
from datetime import datetime
import dask
import dask.dataframe as dd
import dask.delayed as delayed
import dask.bag as db
import xarray as xr
import subprocess
# For demonstration purposes, we're using one week of data for training.
# The training period is set from December 24, 2022, to December 31, 2022.
train_start_date = "2022-12-24"
train_end_date = "2022-12-31"
work_dir = "../data/gridmet_test_run"
2. Function to Copy .he5 Files from Source to Destination Directory#
The goal here is to copy all .he5
files from a specified source directory to a destination directory.
source_dir
: The directory where the.he5
files are originally located.destination_dir
: The target directory where the.he5
files will be copied.os.walk
: A function that traverses the directory tree, accessing all subdirectories and files.shutil.copy
: A method used to copy the files from the source to the destination.
The code specifically looks for files with the .he5
extension to identify the relevant files for copying.
def copy_he5_files(source_dir, destination_dir):
'''
Copy .he5 files from the source directory to the destination directory.
Args:
source_dir (str): The source directory containing .he5 files to copy.
destination_dir (str): The destination directory where .he5 files will be copied.
Returns:
None
'''
# Get a list of all subdirectories and files in the source directory
for root, dirs, files in os.walk(source_dir):
for file in files:
if file.endswith('.he5'):
# Get the absolute path of the source file
source_file_path = os.path.join(root, file)
# Copy the file to the destination directory
shutil.copy(source_file_path, destination_dir)
Utility Functions for Data Processing and Analysis Pipeline#
1. Library Imports and Setup for Snow Data#
KDTree
: Fromscipy.spatial
, used for performing efficient nearest-neighbor searches in spatial datasets.plot_all_variables_in_one_csv
: A custom function fromconvert_results_to_images
, used for visualizing processed data.
import os
import h5py
import subprocess
import pandas as pd
import numpy as np
from datetime import datetime
from scipy.spatial import KDTree
import time
from datetime import datetime, timedelta, date
import warnings
import sys
# from convert_results_to_images import plot_all_variables_in_one_c
homedir = os.path.expanduser('~')
work_dir = "../data/gridmet_test_run"
test_start_date = "2024-07-18"
western_us_coords = "../data/dem_file.tif.csv"
2. Identifying Binary Files#
Here we determine whether a given file is a binary file or a text file.
We attempt to open the file in binary mode (
'rb'
) and read a chunk of bytes (1024 bytes).And the we check for null bytes (
b'\x00'
), which are common in binary files. If a null byte is found, then it is binary file.Next, we check for a high percentage of non-printable ASCII characters by converting the byte chunk to characters and filtering out non-printable ones. If the chunk has no printable characters, the file is considered binary.
If neither of the above conditions are met, the function assumes the file is a text file.
def is_binary(file_path):
try:
with open(file_path, 'rb') as file:
# Read a chunk of bytes from the file
chunk = file.read(1024)
# Check for null bytes, a common indicator of binary data
if b'\x00' in chunk:
return True
# Check for a high percentage of non-printable ASCII characters
text_characters = "".join(chr(byte) for byte in chunk if 32 <= byte <= 126)
if not text_characters:
return True
# If none of the binary indicators are found, assume it's a text file
return False
except FileNotFoundError:
print(f"File '{file_path}' not found.")
return False
except Exception as e:
print(f"An error occurred: {e}")
return False