Issues

Issues#

Duplicate timestamps in THREDDS 1Min data#

Overcome this by using zarr files from CAVA.

In the data.ipynb notebook the goals are:

Ingest low-time-resolution data from the THREDDS server
- 1 sample per minute
- 17 data files in NetCDF format, 23 sensors
- Select a time window (e.g. one month: Jan 2022)
- Write to results as NetCDF files to a local (in-repo) data folder
Check the data for expected profiler behavior based on the z depth data variable
Extract a timestamp dataset (table > CSV file) for ascents, descents, and rest intervals

It proved to be the case that the 1Min data often had multiple values for a given timestamp. I introduced a solution to the processing chain; and did an in-place fix for the initial dataset.

The main issue is: When timestamps are duplicated: Sensor values vary. They are not duplicates. Why???

Lesser issue: Why does a loop over Dataset times bog down and not complete? (see ‘bog down’ below)

Continuing to the fix: Presuming 1Min data for one month, January 2022amounts to 31 * 24 * 60 = 44640 samples. Survey the 17 data files for duplicates:

print(xr.open_dataset('../data/osb_do_jan22_do.nc'))
etcetera...

Results: The first six sets clearly have many duplicates. The first number >> 44640.

146283 ctd { density, pressure, salinity, temperature, conductivity }
146206 do
138837 fluor { bb, chlora, fdom }
146185 par
147451 7 x spkir (one file) { 412, 443, 490, 510, 555, 620, 683 nm }
149216 current (three files) { east, north, up }
998 pco2 had no duplicates
1005 ph, had 2 duplicates
6399 nitrate, duplicates present

These two lines of code remove duplicate timestamps from a Dataset:

_, keeper_index = np.unique(ds['time'], return_index=True)
ds=ds.isel(time=keeper_index)

Code explanation#

np.unique() with the True argument returns a tuple (t, i): t is a sorted version of the time dimension as a numpy array and i is a list of unique indices.

The sorted time we do not care about as it is only part of the Dataset. This is assigned to a _ variable which is Python for “I do not care”.

The returned list of unique indices i will be those of the first occurrence of each timestamp, whether it is duplicated or not. This is assigned to keeper_index and that, in turn, is used in the following .isel() subset operation, applied to the entire Dataset:The time dimension is subsetted as are coordinates and data variables.

Code to investigate duplication of timestamps#

ds_temperature = xr.open_dataset('../data/osb_ctd_jan22_temperature.nc')
ds_ph        = xr.open_dataset('../data/osb_ph_jan22_ph.nc')

len_ph = len(ds_ph['time'])
len_temperature = len(ds_temperature['time'])

ndup_ph = 0
nvar_ph = 0
ndup_temperature = 0
nvar_temperature = 0

for i in range(len_ph-1):
    if ds_ph['time'][i] == ds_ph['time'][i+1]:
        ndup_ph += 1
        if not ds_ph['ph'][i] == ds_ph['ph'][i+1]:
            nvar_ph += 1

# Notice the code below only looks at the first 1000 samples, not all 140k. When trying
#   to do all 140k the loop bogs down and fails to finish. This is strange as the data
#   are thought to be sorted by time. Guess: Some JIT disk access pathology?
for i in range(1000):
    if ds_temperature['time'][i] == ds_temperature['time'][i+1]:
        ndup_temperature += 1
        if not ds_temperature['temperature'][i] == ds_temperature['temperature'][i+1]:
            nvar_temperature += 1
            
print('ph:')
print(len_ph, ndup_ph, nvar_ph)
print()
print('temperature')
print(len_temperature, ndup_temperature, nvar_temperature)

Remove duplicate timestamps#

This is free-standing but the important code is now in the main ingest function ReformatDataFile() found in data.py.

for s in jan22_data:
    fnm = '../data/' + s
    ds = xr.open_dataset(fnm)
    len0 = len(ds['time'])
    _, keeper_index = np.unique(ds['time'], return_index=True)
    ds=ds.isel(time=keeper_index)
    len1 = len(ds['time'])
    ds.to_netcdf(fnm)
    print(s + ' started with ' + str(len0) + ' timestamps, concluded with ' + str(len1))    

output:
    
osb_ctd_jan22_pressure.nc started with 146283 timestamps, concluded with 44638
osb_ctd_jan22_temperature.nc started with 146283 timestamps, concluded with 44638
osb_ctd_jan22_density.nc started with 146283 timestamps, concluded with 44638
osb_ctd_jan22_salinity.nc started with 146283 timestamps, concluded with 44638
osb_ctd_jan22_conductivity.nc started with 146283 timestamps, concluded with 44638
osb_fluor_jan22_fdom.nc started with 138837 timestamps, concluded with 43150
osb_fluor_jan22_chlora.nc started with 138837 timestamps, concluded with 43150
osb_fluor_jan22_bb.nc started with 138837 timestamps, concluded with 43150
osb_spkir_jan22_spkir.nc started with 147451 timestamps, concluded with 44638
osb_nitrate_jan22_nitrate.nc started with 6399 timestamps, concluded with 4101
osb_pco2_jan22_pco2.nc started with 998 timestamps, concluded with 998
osb_do_jan22_do.nc started with 146206 timestamps, concluded with 44627
osb_par_jan22_par.nc started with 146185 timestamps, concluded with 44638
osb_ph_jan22_ph.nc started with 1005 timestamps, concluded with 1003
osb_vel_jan22_up.nc started with 149216 timestamps, concluded with 44638
osb_vel_jan22_east.nc started with 149216 timestamps, concluded with 44638
osb_vel_jan22_north.nc started with 149216 timestamps, concluded with 44638

Typical file size dropped from 3MB to 900kb.

recreating shallow profiler notebook#

Need consistent naming for spkir
Need to re-get the data from CAVA
Address bad pCO2 for not-july-22 months

icepyx install is tetchy#

I clone the argo branch: git clone -b argo https://github.com/icesat2/icepyx
I must install from the icepyx directory using cd icepyx; pip install ./icepyx
import icepyx works from a notebook located in the icepyx folder
Expect repaired on 1/1/2024

current velocity#

How is current velocity data generated? ADCP on the profiler platform?

bumpy data#

The reference data production must be modified to get away from sawtooth standard deviation profiles. These were a little apparent in jan22 data and much more pronounced in jul21 data.