Data#
The second part of this oceanography Jupyter Book is concerned with methods and technical details in support of the science presented in the first part.
The Epipelargosy
chapter concerned the upper water column as observed
by the regional cabled array shallow profilers.
Shallow Profiler Data#
Data types#
There are two data concepts:
Platform metadata: A record of when a profiler is at rest / ascending / descending
Sensor data: Sensor values as a function of depth for a given profile: Salinity, temperature etcetera
We identify the sensor data of interest by pulling a time range from the platform metadata. For this purpose we refer to a rest interval followed by ascent followed by descent as a single shallow profiler profile.
Selecting sensor data from profile metadata#
Profiles take on the order of an hour for one ascent/descent cycle as the Science Pod traverses from the platform at a depth of 200 meters to near the surface and back again. Typically nine profiles run per day. We consier each profile to be an observation of the state of the epipelagic water column. A data chart of one such observation features a vertical axis for depth (with the surface at the top) and a horizontal axis for the sensor parameter. Note that there is no time axis in this chart scheme.
Profile metadata is stored in a CSV file as a table of timestamps. A single profile consists of three consecutive stages as noted: Rest, Ascent, and Descent. (The Rest stage has the Science Pod parked in the platform at a depth of 200 meters.) Profile metadata is read into memory as a pandas DataFrame where each row corresponds to an entire profile.
Sensor data#
There are about 31 million seconds in a year; and the sampling rate for many shallow profilers is about one sample per second. As a result some data volume management is necessary. We can, for example, work with smaller blocks of time, typically somewhere between one day and one month.
The current solution to managing a fairly large dataset is to create a symbolic
link (ln -s <destination> <source>
) to an empty folder and use the dataloader.ipynb
notebook to populate it. Current emphasis is January 2022, Oregon Slope Base site.
This will break the Jupyter Book (as I have no comprehension of linking the
repository to an external data volume) so that will be a problem to solve another day;
perhaps by staging the reduced datasets to an open cloud location.
Summary of data management considerations#
Profile metadata is relatively small volume, resides within this Book/Repository
Sensor data volume can be understood in terms of measurements
One measurement per second is typical
31 million seconds per year
Each sensor measurement is { sensed value, pressure/depth, timestamp }
Sensor data also includes some quality control; typically ignored (dropped) here
There are about 22 sensor values
Most measurements are made on ascent, some on descent as noted
Three of these are from a fluorometer instrument
Two of these are nitrate measurements
Seven of these are spectral irradiance channels
Not included: Spectrophotometer data
Water density is a measurement derived from temperature, pressure and salinity
Data goes back as far as 2015
There are gaps in the data owing to maintenance etcetera
There are three shallow profiler sites
Coastal Endurance “Oregon Offshore” in 500m depth, distal edge of the continental shelf
RCA “Oregon Slope Base” at 3000m depth, bottom of the continental shelf break
RCA “Axial Base” at 2100m depth, further out in the Pacific at the base of Axial Seamount
Moderate-size datasets can be stored in this repository (upper limit about 50MB)
Large datasets can be saved external to the repository
We use virtual or lazy loading of data structures using the Python
xarray
libraryWhen a data file is ‘read’ the data are literally not read into memory
Rather a description of the data is loaded into an XArray Dataset
Operations on the data cause it to be actually read into memory
Lazy loading facilitates data reduction:
Strip away excess data elements that are not of interest
Time-box a subset of the data
Sub-sample the data to lower temporal / spatial resolution
Repository folder structure#
~/book/_toc.yml -- table of contents --
img/ -- contains resource images --
chapters/ -- Jupyter notebooks map to book chapters --
data/ -- repo data storage space --
modis/ -- Data from the MODIS satellite system
argo/ -- Data from the ARGO drifter program
roms/ -- Data from the Regional Ocean Modeling System (ROMS)
rca/ -- Data from the regional cabled array--
profiles/ -- Profile metadata folder--
axb/ -- Site subdirectories --
oos/
osb/
2019.csv -- Profile metadata for 2019 at the OSB site --
sensors/ -- Sensor (measurement) data from shallow profilers / platforms
axb/ -- Axial Base site
oos/ -- Oregon Offshore site
osb/ -- Oregons Slope Base site (sym-link to external volume)
temp_jan_22.nc -- instrument = CTD, jan22 = time range, sensor = temperature
Data resources#
Shallow profiler data was originally pulled from the OOI Data Explorer. This practice is now deprecated in favor of the Interactive Oceans Data Portal. The latter system includes shallow profiler data held on cloud object storage. The access pattern is described below. The Interactive Oceans website has built-in data exploration and is very educator-friendly.
Terminology#
Regional Cabled Array (RCA): A cabled observatory on the sea floor and in select locations rising up through the water column: off the coast of Oregon
Site: A location in the RCA
Platform: A mechanical structure – static or mobile – at a site
Instrument: A device fastened to a platform that carries one or more sensors
Sensor: A device that measures some aspect of the ocean like pH or temperature
Stream: Data produced by a sensor as part of an instrument located on a platform at a site in the RCA
Profiler ‘Oregon Slope Base’
Profiler ‘Axial Base’ PN3A
Profiler ‘Oregon Offshore’ (shared with the OSU Endurance array)
Sensor dictionary with abbreviations#
The following table lists sensors in relation to instruments.
Short abbreviations are included. They are sometimes used in the code to
make it easier to read.
Spectral irradiance is abbreviated spkir
in OOI nomenclature. This
data is broken out by wavelength channel (7 total) into separate sensors.
The spec
instrument is a spectrophotometer with 83 channels. This
is treated separate from the main collection of shallow profiler sensors.
The current
instrument is a platform-mounted current sensor providing
three-dimensional estimates of current with depth.
Short Sensor Name used Instrument Sensor operates
Abbrev. here
----- ------ ------- ---------- -----------
A Chlorophyll-A chlora fluor continuous
B backscatter backscatter fluor continuous
C CDOM fdom fluor continuous
D Density density ctd continuous
E[] Spec'r optical abs oa spec ? (83 channels)
F[] Spec'r beam atten ba spec
G pCO2 pco2 pco2 midnight/noon descent
H pH ph ph midnight/noon descent
I[] Spectral Irradiance spkir412nm spkir ?
spkir443nm
spkir490nm
spkir510nm
spkir555nm
spkir620nm
spkir683nm
K Conductivity conductivity ctd continuous
N Nitrate nitrate nitrate midnight/noon ascent
P PAR par par continuous
Q pressure pressure ctd continuous
O dissolved oxygen do do continuous
S salinity salinity ctd continuous
T temperature temp ctd continuous
U velocity east veast current continuous:
V velocity north vnorth from platform?
W velocity up vup looking up?
Z depth = pressure depth ctd see 'pressure'
Code note#
This notebook refers to functions in modules data.py
and shallowprofiler.py
.
Tasks
Does the S3 Zarr source go down on Sunday evenings? Is there a way to test if it is down?
This section does not demonstrate profile use; but it should
This section effectively does the profile-from-data chart twice; once is enough
Mothball the ReformatDataFile() function in
data.py
which is part of the old Data Explorer procedure.Move ProfilerDepthChart() to a module file
Look for occurrences of cdom (change to fdom) and temperature (change to temp) and conductivity (change to conduct)
flort backscatter has multiple possibilities; I picked one; verify it is the right one with Wendi
Deal with the code artifact below on WriteProfile()
# def WriteProfile(date_id):
# fnm = '../data/osb_ctd_' + date_id + '_pressure.nc'
# a0, a1, d0, d1, r0, r1 = ProfileGenerator(fnm, 'z', True)
# # last 2 days chart check: xr.open_dataset(fnm)['z'][-1440*2:-1].plot()
# if not ProfileWriter('../profiles/osb_profiles_' + date_id + '.csv', a0, a1, d0, d1, r0, r1): print('ProfileWriter() is False')
#
# for date_id in ['apr21', 'jul21', 'jan22']: WriteProfile(date_id) # !!!!! hard coded flag
Profiles#
This section describes profile metadata: Pre-generated and stored in this
repository at relative path ./data/rca/profiles/<site-abbrev>
.
Time is UTM/Zulu.
Typically nine profiles run per day, two of which are noticeably longer in duration
The two longer profiles are at midnight and noon local time
They are associated with nitrate, pCO2 and pH measurement
These have built-in pauses on descent for equilibration
During rest intervals the profiler is secured to the platform at 200m depth
The platform has its own set of instruments
from matplotlib import pyplot as plt
from shallowprofiler import *
from data import *
from charts import *
RenderShallowProfilerTwoDayDepthChart()
# This code intentionally disabled: Alternative (more expansive) view of shallow profiler cycling.
if False: VisualizeProfiles('jan22', 31, '2022', '01', 'January', 'Oregon Slope Base', 'osb', 'ctd_jan22_conductivity.nc')
Jupyter Notebook running Python 3
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/micromamba/envs/geosmart-template/lib/python3.12/site-packages/xarray/backends/file_manager.py:211, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
210 try:
--> 211 file = self._cache[self._key]
212 except KeyError:
File ~/micromamba/envs/geosmart-template/lib/python3.12/site-packages/xarray/backends/lru_cache.py:56, in LRUCache.__getitem__(self, key)
55 with self._lock:
---> 56 value = self._cache[key]
57 self._cache.move_to_end(key)
KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('/home/runner/work/oceanography/oceanography/book/chapters/data/rca/sensors/osb/conductivity_jan_2022.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False)), 'fe0ce554-8a96-4100-990d-af6d94e50cba']
During handling of the above exception, another exception occurred:
FileNotFoundError Traceback (most recent call last)
Cell In[1], line 6
3 from data import *
4 from charts import *
----> 6 RenderShallowProfilerTwoDayDepthChart()
8 # This code intentionally disabled: Alternative (more expansive) view of shallow profiler cycling.
9 if False: VisualizeProfiles('jan22', 31, '2022', '01', 'January', 'Oregon Slope Base', 'osb', 'ctd_jan22_conductivity.nc')
File ~/work/oceanography/oceanography/book/chapters/charts.py:26, in RenderShallowProfilerTwoDayDepthChart()
21 '''
22 This is a very hardcoded function that generates a two-day span of profiles with some
23 annotations indicating what is going on, particularly with midnight / noon profiles.
24 '''
25 t0, t1, fnm = '2022-01-01', '2022-01-03', './data/rca/sensors/osb/conductivity_jan_2022.nc'
---> 26 ds = xr.open_dataset(fnm).sel(time=slice(dt64(t0), dt64(t1)))
27 fig, axs = plt.subplots(figsize=(12,4), tight_layout=True)
28 axs.plot(ds.time, -ds.depth, marker=',', ms=36., color='k', mfc='r', linewidth='.001')
File ~/micromamba/envs/geosmart-template/lib/python3.12/site-packages/xarray/backends/api.py:686, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
674 decoders = _resolve_decoders_kwargs(
675 decode_cf,
676 open_backend_dataset_parameters=backend.open_dataset_parameters,
(...)
682 decode_coords=decode_coords,
683 )
685 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 686 backend_ds = backend.open_dataset(
687 filename_or_obj,
688 drop_variables=drop_variables,
689 **decoders,
690 **kwargs,
691 )
692 ds = _dataset_from_backend_dataset(
693 backend_ds,
694 filename_or_obj,
(...)
704 **kwargs,
705 )
706 return ds
File ~/micromamba/envs/geosmart-template/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:666, in NetCDF4BackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, format, clobber, diskless, persist, auto_complex, lock, autoclose)
644 def open_dataset(
645 self,
646 filename_or_obj: str | os.PathLike[Any] | ReadBuffer | AbstractDataStore,
(...)
663 autoclose=False,
664 ) -> Dataset:
665 filename_or_obj = _normalize_path(filename_or_obj)
--> 666 store = NetCDF4DataStore.open(
667 filename_or_obj,
668 mode=mode,
669 format=format,
670 group=group,
671 clobber=clobber,
672 diskless=diskless,
673 persist=persist,
674 auto_complex=auto_complex,
675 lock=lock,
676 autoclose=autoclose,
677 )
679 store_entrypoint = StoreBackendEntrypoint()
680 with close_on_error(store):
File ~/micromamba/envs/geosmart-template/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:452, in NetCDF4DataStore.open(cls, filename, mode, format, group, clobber, diskless, persist, auto_complex, lock, lock_maker, autoclose)
448 kwargs["auto_complex"] = auto_complex
449 manager = CachingFileManager(
450 netCDF4.Dataset, filename, mode=mode, kwargs=kwargs
451 )
--> 452 return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
File ~/micromamba/envs/geosmart-template/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:393, in NetCDF4DataStore.__init__(self, manager, group, mode, lock, autoclose)
391 self._group = group
392 self._mode = mode
--> 393 self.format = self.ds.data_model
394 self._filename = self.ds.filepath()
395 self.is_remote = is_remote_uri(self._filename)
File ~/micromamba/envs/geosmart-template/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:461, in NetCDF4DataStore.ds(self)
459 @property
460 def ds(self):
--> 461 return self._acquire()
File ~/micromamba/envs/geosmart-template/lib/python3.12/site-packages/xarray/backends/netCDF4_.py:455, in NetCDF4DataStore._acquire(self, needs_lock)
454 def _acquire(self, needs_lock=True):
--> 455 with self._manager.acquire_context(needs_lock) as root:
456 ds = _nc4_require_group(root, self._group, self._mode)
457 return ds
File ~/micromamba/envs/geosmart-template/lib/python3.12/contextlib.py:137, in _GeneratorContextManager.__enter__(self)
135 del self.args, self.kwds, self.func
136 try:
--> 137 return next(self.gen)
138 except StopIteration:
139 raise RuntimeError("generator didn't yield") from None
File ~/micromamba/envs/geosmart-template/lib/python3.12/site-packages/xarray/backends/file_manager.py:199, in CachingFileManager.acquire_context(self, needs_lock)
196 @contextlib.contextmanager
197 def acquire_context(self, needs_lock=True):
198 """Context manager for acquiring a file."""
--> 199 file, cached = self._acquire_with_cache_info(needs_lock)
200 try:
201 yield file
File ~/micromamba/envs/geosmart-template/lib/python3.12/site-packages/xarray/backends/file_manager.py:217, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
215 kwargs = kwargs.copy()
216 kwargs["mode"] = self._mode
--> 217 file = self._opener(*self._args, **kwargs)
218 if self._mode == "w":
219 # ensure file doesn't get overridden when opened again
220 self._mode = "a"
File src/netCDF4/_netCDF4.pyx:2521, in netCDF4._netCDF4.Dataset.__init__()
File src/netCDF4/_netCDF4.pyx:2158, in netCDF4._netCDF4._ensure_nc_success()
FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/work/oceanography/oceanography/book/chapters/data/rca/sensors/osb/conductivity_jan_2022.nc'
How to use profile files#
Read the file into a pandas Dataframe
Each row is a Rest — Ascent — Descent phase sequence
Each phase has a start and an end: For a total of six events
There is degeneracy: r1 == a0, a1 == d0, and mostly d1 == the next row’s r0
Each entry for these six events is a triple (i, t, z)
i is an index from the source file; usually ignore this
t is the time of the event; important
z is the depth of the event
for a1/d0 this will indicate if the profile approached the surface
Consequently each row has 18 columns
Suppose the idea is to make a depth plot of temperature for a particular profile. Select out this profile and slice the source data using a time range given as a0 to a1. See the next notebook for examples.
Example use of profile metadata#
The following cell reads January 2022 Oregon Slope Base metadata into a pandas DataFrame ‘profiles’. In so doing it converts the information from text to pandas Timestamps. The rows of the resulting DataFrame correspond to consecutive profiles. The columns correspond to (time depth) pairs: r0t is rest start time. r0z is depth at that time. r1t is rest end time. r1z is depth at that time (typically close to r0z). Following this are a0, a1 for ascent and d0, d1 for descent. There is some degeneracy in this metadata as the end of the ascent corresponds to the start of the descent.
January has 31 days so there are a maximum of 279 possible profiles.
profiles = ReadProfileMetadata()
profiles
# profile time access syntax:
profile_row_index = 0
tA0, tA1 = profiles['a0t'][profile_row_index], profiles['a1t'][profile_row_index]
print(tA0, tA1, 'Time difference:', tA1 - tA0)
print('Type of these times:', type(tA0))
ds = xr.open_dataset('./data/rca/sensors/osb/ctd_jan22_conductivity.nc')
ds_timebox = ds.sel(time=slice(tA0, tA1))
ds_timebox.conductivity.plot()
Sensor Data#
Part 1: Deprecated OOI Data Explorer 1-minute data#
In the preceding section we accessed some conductivity data from Oregon Slope Base circa January 2022. This section connects more formally to these existing sensor datasets.
base_path = './data/rca/sensors/'
date_id = 'jan22'
site_id = 'osb'
data_file_list = []
for s in sensors:
data_file_list.append(base_path + site_id + '/' + s[1] + '_' + date_id + '_' + s[0] + '.nc') # example result: 'osb/ctd_jan22_temperature.nc'
!ls -al ./data/rca/sensors/osb/ctd_jan22_conductivity.nc
# relative path to data files is base_path + site_id + '/'
# The data variables corresponding to the jan22_data filenames in data_file_list[]:
month_data_vars = [
['pressure'],['temperature'],['density'], ['salinity'], ['conductivity'],
['fdom'], ['chlora'], ['bb'],
['spkir412nm', 'spkir443nm', 'spkir490nm', 'spkir510nm', 'spkir555nm', 'spkir620nm', 'spkir683nm'],
['nitrate'],
['pco2'],
['do'],
['par'],
['ph'],
['up'], ['east'], ['north']
]
Dataset check#
The above cell creates data_file_list
: A list of time-bounded NetCDF sensor measurement files (in-repo).
To review these files in more detail:
ds = xr.open_dataset(data_file_list[0])
print(ds)
# This cell assembles a data dictionary "d" from the OOI datasets
d = {} # empty Python dictionary
relative_path = './data/rca/sensors/'
sitestring = 'osb' # available: osb = Oregon Slope Base
monthstring = 'jan22' # available: apr21, jul21, jan22
def GetSensorTuple(s, f):
'''
s is the sensor identifier string like 'temperature'
It is actually a 2-element list: [0] is the sensor, [1] is the instrument
f is the source filename like './../data/osb_ctd_jan22_temperature.nc'
'''
df_sensor = xr.open_dataset(f)[s]
df_z = xr.open_dataset(f)['z']
range_lo = ranges[s][0]
range_hi = ranges[s][1]
sensor_color = colors[s]
return (df_sensor, df_z, range_lo, range_hi, sensor_color)
for sensor in sensors: # sensor is a list of 2 strings [sensor_str, instrument_str]
f = relative_path + sitestring + '/' + sensor[1] + '_' + monthstring + '_' + sensor[0] + '.nc'
d[sensor[0]] = GetSensorTuple(sensor[0], f)
# temperature and salinity
fig,axs = ChartTwoSensors(profiles, [ranges['temperature'], ranges['conductivity']], [0],
d['temperature'][0], d['temperature'][1], 'Temperature', colors['temperature'], 'ascent',
d['conductivity'][0], d['conductivity'][1], 'Conductivity', colors['conductivity'], 'ascent', 6, 4)
Note: The above chart shows a temperature excursion to the left at a depth of about 75 meters (red). There is not a corresponding excursion in the salinity (cyan), this illustrating a lack of coincidence.
# temperature: ascent versus descent
fig,axs = ChartTwoSensors(profiles, [ranges['temperature'], ranges['temperature']], [0],
d['temperature'][0], d['temperature'][1], 'T-Ascent', colors['temperature'], 'ascent',
d['temperature'][0], d['temperature'][1], 'T-Descent', 'green', 'descent', 6, 4)
Note: The above chart compares profile temperature on ascent (red) with the subsequent descent (green). The cold excursion noted above is present in both profiles. The temperature structure below 100 meters is similar but less well-matched. This is an initial view of what we can expect for stability.
Part 2. Interactive Oceans Zarr (Lazy Load)#
# This cell connects to the RCA Zarr filesystem for OOI (on the cloud).
# A 'stream' is a source of one or more sensor data time series organized by
# site and by instrument. For example the CTD instrument provides temperature,
# conductivity, pressure, salinity and water density. (Some of these parameters,
# like density, are derived from the others.)
import netCDF4
import xarray as xr
import s3fs
from shallowprofiler import *
from charts import *
fs = s3fs.S3FileSystem(anon=True)
streamlist = fs.listdir('ooi-data', detail = False)
# For more on connecting to the source data: See the dataloader.ipynb notebook:
# - data stream naming system
# - reading in a data stream by site and instrument
# - time range subset
# - eliminating extraneous information
# - saving the result to a local (fast access) NetCDF file
ds_zarr = xr.open_dataset('./data/rca/sensors/osb/temperature_jan_2022.nc')
# temperature: ascent versus descent
fig,axs = ChartTwoSensors(profiles, [ranges['temperature'], ranges['temperature']], [0],
ds_zarr.temperature, -ds_zarr.depth,
'T-Ascent', colors['temperature'], 'ascent',
ds_zarr.temperature, -ds_zarr.depth,
'T-Descent', 'green', 'descent', 6, 4)
# temperature: ascent versus descent
fig,axs = ChartTwoSensors(profiles, [ranges['temperature'], ranges['temperature']], [0],
ds_zarr.temperature, -ds_zarr.depth,
'Zarr Data', colors['temperature'], 'ascent',
d['temperature'][0], d['temperature'][1], 'OOI Data', 'blue', 'ascent', 6, 4)
Fossil code:
df = ds.to_dataframe()
vals = [xr.DataArray(data=df[c], dims=['time'], coords={'time':df.index}, attrs=ds[c].attrs) for c in df.columns]
ds = xr.Dataset(dict(zip(df.columns, vals)), attrs=ds.attrs)
Comment on profile phase durations#