Jupyter Book and GitHub repo.
Issues#
Duplicate timestamps in THREDDS 1Min data#
Overcome this by using zarr files from CAVA.
In the data.ipynb
notebook the goals are:
Ingest low-time-resolution data from the THREDDS server
1 sample per minute
17 data files in NetCDF format, 23 sensors
Select a time window (e.g. one month: Jan 2022)
Write to results as NetCDF files to a local (in-repo) data folder
Check the data for expected profiler behavior based on the
z
depth data variableExtract a timestamp dataset (table > CSV file) for ascents, descents, and rest intervals
It proved to be the case that the 1Min data often had multiple values for a given timestamp. I introduced a solution to the processing chain; and did an in-place fix for the initial dataset.
The main issue is: When timestamps are duplicated: Sensor values vary. They are not duplicates. Why???
Lesser issue: Why does a loop over Dataset times bog down and not complete? (see ‘bog down’ below)
Continuing to the fix: Presuming 1Min data for one month, January 2022amounts to 31 * 24 * 60 = 44640 samples. Survey the 17 data files for duplicates:
print(xr.open_dataset('../data/osb_do_jan22_do.nc'))
etcetera...
Results: The first six sets clearly have many duplicates. The first number >> 44640.
146283 ctd { density, pressure, salinity, temperature, conductivity }
146206 do
138837 fluor { bb, chlora, fdom }
146185 par
147451 7 x spkir (one file) { 412, 443, 490, 510, 555, 620, 683 nm }
149216 current (three files) { east, north, up }
998 pco2 had no duplicates
1005 ph, had 2 duplicates
6399 nitrate, duplicates present
These two lines of code remove duplicate timestamps from a Dataset:
_, keeper_index = np.unique(ds['time'], return_index=True)
ds=ds.isel(time=keeper_index)
Code explanation#
np.unique()
with the True
argument returns a tuple (t, i): t is a sorted version of
the time
dimension as a numpy array and i is a list of unique indices.
The sorted time
we do not care about as it is only part of the Dataset. This is assigned
to a _
variable which is Python for “I do not care”.
The returned list of unique indices i will be those of the first occurrence of each
timestamp, whether it is duplicated or not. This is assigned to keeper_index
and
that, in turn, is used in the following .isel()
subset operation, applied to the
entire Dataset:The time
dimension is subsetted as are coordinates and data variables.
Code to investigate duplication of timestamps#
ds_temperature = xr.open_dataset('../data/osb_ctd_jan22_temperature.nc')
ds_ph = xr.open_dataset('../data/osb_ph_jan22_ph.nc')
len_ph = len(ds_ph['time'])
len_temperature = len(ds_temperature['time'])
ndup_ph = 0
nvar_ph = 0
ndup_temperature = 0
nvar_temperature = 0
for i in range(len_ph-1):
if ds_ph['time'][i] == ds_ph['time'][i+1]:
ndup_ph += 1
if not ds_ph['ph'][i] == ds_ph['ph'][i+1]:
nvar_ph += 1
# Notice the code below only looks at the first 1000 samples, not all 140k. When trying
# to do all 140k the loop bogs down and fails to finish. This is strange as the data
# are thought to be sorted by time. Guess: Some JIT disk access pathology?
for i in range(1000):
if ds_temperature['time'][i] == ds_temperature['time'][i+1]:
ndup_temperature += 1
if not ds_temperature['temperature'][i] == ds_temperature['temperature'][i+1]:
nvar_temperature += 1
print('ph:')
print(len_ph, ndup_ph, nvar_ph)
print()
print('temperature')
print(len_temperature, ndup_temperature, nvar_temperature)
Remove duplicate timestamps#
This is free-standing but the important code is now in the main ingest function ReformatDataFile()
found in data.py
.
for s in jan22_data:
fnm = '../data/' + s
ds = xr.open_dataset(fnm)
len0 = len(ds['time'])
_, keeper_index = np.unique(ds['time'], return_index=True)
ds=ds.isel(time=keeper_index)
len1 = len(ds['time'])
ds.to_netcdf(fnm)
print(s + ' started with ' + str(len0) + ' timestamps, concluded with ' + str(len1))
output:
osb_ctd_jan22_pressure.nc started with 146283 timestamps, concluded with 44638
osb_ctd_jan22_temperature.nc started with 146283 timestamps, concluded with 44638
osb_ctd_jan22_density.nc started with 146283 timestamps, concluded with 44638
osb_ctd_jan22_salinity.nc started with 146283 timestamps, concluded with 44638
osb_ctd_jan22_conductivity.nc started with 146283 timestamps, concluded with 44638
osb_fluor_jan22_fdom.nc started with 138837 timestamps, concluded with 43150
osb_fluor_jan22_chlora.nc started with 138837 timestamps, concluded with 43150
osb_fluor_jan22_bb.nc started with 138837 timestamps, concluded with 43150
osb_spkir_jan22_spkir.nc started with 147451 timestamps, concluded with 44638
osb_nitrate_jan22_nitrate.nc started with 6399 timestamps, concluded with 4101
osb_pco2_jan22_pco2.nc started with 998 timestamps, concluded with 998
osb_do_jan22_do.nc started with 146206 timestamps, concluded with 44627
osb_par_jan22_par.nc started with 146185 timestamps, concluded with 44638
osb_ph_jan22_ph.nc started with 1005 timestamps, concluded with 1003
osb_vel_jan22_up.nc started with 149216 timestamps, concluded with 44638
osb_vel_jan22_east.nc started with 149216 timestamps, concluded with 44638
osb_vel_jan22_north.nc started with 149216 timestamps, concluded with 44638
Typical file size dropped from 3MB to 900kb.
recreating shallow profiler notebook#
Need consistent naming for spkir
Need to re-get the data from CAVA
Address bad pCO2 for not-july-22 months
icepyx install is tetchy#
I clone the
argo
branch:git clone -b argo https://github.com/icesat2/icepyx
I must install from the icepyx directory using
cd icepyx; pip install ./icepyx
import icepyx
works from a notebook located in theicepyx
folderExpect repaired on 1/1/2024
current velocity#
How is current velocity data generated? ADCP on the profiler platform?
bumpy data#
The reference data production must be modified to get away from sawtooth standard deviation profiles. These were a little apparent
in jan22
data and much more pronounced in jul21
data.