Data Access Methods#

Marty Hidas

Australian Ocean Data Network (AODN)

This tutorial demostrates several ways data can be accessed remotely and loaded into a Python environment, including

OPeNDAP
OGC Web Feature Service (WFS)
direct access to files on cloud storage (AWS S3)
cloud-optimised formats Zarr & Parquet

The examples here use data from Australia’s Integrated Marine Observing System (IMOS). These can be browsed and accessed via the AODN Portal, the IMOS Metadata Catalogue, or the IMOS THREDDS Server. Each data collection’s metadata record includes more specific links to the relevant THREDDS folders, and WFS services.

# Import all the tools we need...
import os

# For working with data
import pandas as pd
import xarray as xr

# For data that may be larger than the memory available on your computer...
import dask
import dask.dataframe as dd

# For accessing OGC Web Feature Service
from owslib.wfs import WebFeatureService

# For accessing AWS S3 cloud storage
import s3fs

# Plotting tools
import holoviews as hv
import hvplot.pandas
import hvplot.xarray

# For plotting geographic data & maps
import geoviews as gv
import geoviews.feature as gf
from geoviews import opts
from cartopy import crs

## Use Matplotlib backend for web preview of notebook
## Comment out these lines to get the default interactive plots using Bokeh
hvplot.extension('matplotlib', compatibility='bokeh')
gv.extension('matplotlib')
gv.output(size=200)

# Set up local data path
DATA_BASEPATH = "/home/jovyan/shared/IMOS"

The old school way#

The old (and still common) way to access data is to first download it to your computer and read it from there. This is easy for small datasets, but not always ideal:

What if the data is bigger than your hard disk?
What if you only need a small fraction of a dataset?
What if the dataset is routinely updated and you want to re-run your analysis on the latest data?
What if you want to run your analysis on another computer or in the cloud?

These days it is often more convenient to have data managed in a central location and access it remotely. There are many ways this can be done. In this tutorial we will look at a few of the common ones, and some of the newer ones.

Web Feature Service (WFS)#

A standard of the Open Geospatial Consortium (OGC)
Allows tabular geospatial data to be accessed via the Web.
A feature has a geometry (e.g. a point/line/polygon) indicating a geographic location, and a set of properties (e.g. temperature)
WFS allows filtering based on geometry or properties.
In Python WFS and other OGC Web Services (OWS) can be accessed using the owslib library

For example, most of the tabular hosted by the AODN is available via WFS.

wfs = WebFeatureService(url="https://geoserver-123.aodn.org.au/geoserver/wfs",
                        version="1.1.0")
wfs.identification.title

'AODN Web Feature Service (WFS)'

# Each dataset is served as a separate "feature type":
print(f"There are {len(wfs.contents)} fature types, e.g.")
list(wfs.contents)[:10]

There are 397 fature types, e.g.

['imos:anmn_ctd_profiles_data',
 'imos:anmn_ctd_profiles_map',
 'imos:anmn_velocity_timeseries_map',
 'imos:anmn_nrs_rt_meteo_timeseries_data',
 'imos:anmn_nrs_rt_meteo_timeseries_map',
 'imos:anmn_nrs_rt_bio_timeseries_data',
 'imos:anmn_nrs_rt_bio_timeseries_map',
 'imos:anmn_nrs_rt_wave_timeseries_data',
 'imos:anmn_nrs_rt_wave_timeseries_map',
 'imos:anmn_acoustics_map']

For now we’ll assume we already know which featuretype we want. In this example we’ll look at a dataset containing condicutivity-temperature-depth (CTD) profiles obtained at the National Reference Stations around Australia (here’s a detailed metadata record)

typename = 'imos:anmn_ctd_profiles_data'
wfs.get_schema(typename)['properties']

{'file_id': 'int',
 'site_code': 'string',
 'cruise_id': 'string',
 'time_coverage_start': 'dateTime',
 'time_coverage_end': 'dateTime',
 'TIME': 'dateTime',
 'INSTANCE': 'int',
 'DIRECTION': 'string',
 'TIME_quality_control': 'string',
 'LATITUDE': 'double',
 'LATITUDE_quality_control': 'string',
 'LONGITUDE': 'double',
 'LONGITUDE_quality_control': 'string',
 'DEPTH': 'float',
 'DEPTH_quality_control': 'string',
 'BOT_DEPTH': 'float',
 'BOT_DEPTH_quality_control': 'string',
 'PRES_REL': 'float',
 'PRES_REL_quality_control': 'string',
 'TEMP': 'float',
 'TEMP_quality_control': 'string',
 'PSAL': 'float',
 'PSAL_quality_control': 'string',
 'DOX2': 'float',
 'DOX2_quality_control': 'string',
 'TURB': 'float',
 'TURB_quality_control': 'string',
 'CHLF': 'float',
 'CHLF_quality_control': 'string',
 'CHLU': 'float',
 'CHLU_quality_control': 'string',
 'CPHL': 'float',
 'CPHL_quality_control': 'string',
 'CNDC': 'float',
 'CNDC_quality_control': 'string',
 'DESC': 'float',
 'DESC_quality_control': 'string',
 'DENS': 'float',
 'DENS_quality_control': 'string'}

We can read in a subset of the data by specifying a bounding box (in this case near Sydney, Australia). We’ll get the result in CSV format so it’s easy to read into a Pandas DataFrame.

First we’ll ask for just 10 features, for a quick look at the data.

xmin, xmax = 151.2, 151.25   # Port Hacking, near Sydney, NSW
ymin, ymax = -34.2, -34.1

response = wfs.getfeature(typename=typename,
                          bbox=(xmin, ymin, xmax, ymax),
                          maxfeatures=10,
                          outputFormat='csv')
df = pd.read_csv(response)
response.close()

df

	FID	file_id	site_code	cruise_id	time_coverage_start	time_coverage_end	TIME	INSTANCE	DIRECTION	TIME_quality_control	...	CHLU_quality_control	CPHL	CPHL_quality_control	CNDC	DESC	DENS	geom
0	anmn_ctd_profiles_data.fid--6fff0f0_189bed8a1b...	754	PH100	PHNRS_1108	2011-08-29T00:03:40	2011-08-29T00:03:40	2011-08-29T00:03:40	NaN	D	NaN	...	NaN	NaN	NaN	4.6266	0.228	1025.8478	POINT (-34.1161666667 151.218)
1	anmn_ctd_profiles_data.fid--6fff0f0_189bed8a1b...	754	PH100	PHNRS_1108	2011-08-29T00:03:40	2011-08-29T00:03:40	2011-08-29T00:03:40	NaN	D	NaN	...	NaN	NaN	NaN	4.6246	0.574	1025.8652	POINT (-34.1161666667 151.218)
2	anmn_ctd_profiles_data.fid--6fff0f0_189bed8a1b...	754	PH100	PHNRS_1108	2011-08-29T00:03:40	2011-08-29T00:03:40	2011-08-29T00:03:40	NaN	D	NaN	...	NaN	NaN	NaN	4.6224	0.741	1025.8737	POINT (-34.1161666667 151.218)
3	anmn_ctd_profiles_data.fid--6fff0f0_189bed8a1b...	754	PH100	PHNRS_1108	2011-08-29T00:03:40	2011-08-29T00:03:40	2011-08-29T00:03:40	NaN	D	NaN	...	NaN	NaN	NaN	4.6190	0.803	1025.8790	POINT (-34.1161666667 151.218)
4	anmn_ctd_profiles_data.fid--6fff0f0_189bed8a1b...	754	PH100	PHNRS_1108	2011-08-29T00:03:40	2011-08-29T00:03:40	2011-08-29T00:03:40	NaN	D	NaN	...	NaN	NaN	NaN	4.6138	0.749	1025.8892	POINT (-34.1161666667 151.218)
5	anmn_ctd_profiles_data.fid--6fff0f0_189bed8a1b...	754	PH100	PHNRS_1108	2011-08-29T00:03:40	2011-08-29T00:03:40	2011-08-29T00:03:40	NaN	D	NaN	...	NaN	NaN	NaN	4.6089	0.687	1025.9072	POINT (-34.1161666667 151.218)
6	anmn_ctd_profiles_data.fid--6fff0f0_189bed8a1b...	754	PH100	PHNRS_1108	2011-08-29T00:03:40	2011-08-29T00:03:40	2011-08-29T00:03:40	NaN	D	NaN	...	NaN	NaN	NaN	4.6067	0.722	1025.9241	POINT (-34.1161666667 151.218)
7	anmn_ctd_profiles_data.fid--6fff0f0_189bed8a1b...	754	PH100	PHNRS_1108	2011-08-29T00:03:40	2011-08-29T00:03:40	2011-08-29T00:03:40	NaN	D	NaN	...	NaN	NaN	NaN	4.6048	0.773	1025.9321	POINT (-34.1161666667 151.218)
8	anmn_ctd_profiles_data.fid--6fff0f0_189bed8a1b...	754	PH100	PHNRS_1108	2011-08-29T00:03:40	2011-08-29T00:03:40	2011-08-29T00:03:40	NaN	D	NaN	...	NaN	NaN	NaN	4.6023	0.788	1025.9385	POINT (-34.1161666667 151.218)
9	anmn_ctd_profiles_data.fid--6fff0f0_189bed8a1b...	754	PH100	PHNRS_1108	2011-08-29T00:03:40	2011-08-29T00:03:40	2011-08-29T00:03:40	NaN	D	NaN	...	NaN	NaN	NaN	4.5982	0.846	1025.9432	POINT (-34.1161666667 151.218)

10 rows × 41 columns

## Load local copy of CSV file returned...

# local_csv = os.path.join(DATA_BASEPATH, 'wfs_response1.csv')
# df = pd.read_csv(local_csv)
# df

We can also filter the data based on the values in specified columns (properties) and ask for only a subset of the columns to be returned. The filters need to be provided in XML format, but the owslib library allows us to construct them in a more Pythonic way.

Here we select only the profiles associated with the Port Hacking 100m mooring site, and only the data points flagged as “good data” by automated quality-control procedures.

from owslib.etree import etree
from owslib.fes import PropertyIsEqualTo, And

filter = And([PropertyIsEqualTo(propertyname="site_code", literal="PH100"),
              PropertyIsEqualTo(propertyname="PRES_REL_quality_control", literal="1"),
              PropertyIsEqualTo(propertyname="TEMP_quality_control", literal="1"),
              PropertyIsEqualTo(propertyname="PSAL_quality_control", literal="1"),
              PropertyIsEqualTo(propertyname="CPHL_quality_control", literal="1")
             ])
filterxml = etree.tostring(filter.toXML(), encoding="unicode")

response = wfs.getfeature(typename=typename, filter=filterxml, outputFormat="csv",
                          propertyname=["TIME", "DEPTH", "TEMP", "PSAL", "CPHL"]
                         )
df = pd.read_csv(response, parse_dates=["TIME"])
response.close()

# the server adds a feature ID column we don't really need
df.drop(columns='FID', inplace=True)

## Load local copy of CSV file returned...

# local_csv = os.path.join(DATA_BASEPATH, 'wfs_response2.csv')
# df = pd.read_csv(local_csv, parse_dates=["TIME"]).drop(columns='FID')

df

	TIME	DEPTH	TEMP	PSAL	CPHL
0	2014-12-08 22:28:54	1.986	21.6432	35.5080	0.9365
1	2014-12-08 22:28:54	2.979	21.6441	35.5085	0.9560
2	2014-12-08 22:28:54	3.971	21.6417	35.5085	0.9644
3	2014-12-08 22:28:54	4.964	21.6314	35.5089	0.9963
4	2014-12-08 22:28:54	5.957	21.6077	35.5102	0.9844
...	...	...	...	...	...
11377	2023-05-15 22:08:05	82.398	18.0130	35.5832	0.1554
11378	2023-05-15 22:08:05	83.391	18.0008	35.5841	0.1417
11379	2023-05-15 22:08:05	84.384	17.9824	35.5843	0.1345
11380	2023-05-15 22:08:05	85.376	17.9343	35.5821	0.0937
11381	2023-05-15 22:08:05	86.368	17.8612	35.5799	0.0300

11382 rows × 5 columns

# We can explore the temperature, salinity and chlorophyll profiles
# Change "by" to "groupby" to view one profile at a time, with time selected interactively
temp_plot = df.hvplot(x="TEMP", y="DEPTH", by="TIME", flip_yaxis=True, legend=False, width=200)
psal_plot = df.hvplot(x="PSAL", y="DEPTH", by="TIME", flip_yaxis=True, legend=False, width=200)
cphl_plot = df.hvplot(x="CPHL", y="DEPTH", by="TIME", flip_yaxis=True, legend=False, width=200)

(temp_plot + psal_plot + cphl_plot).opts(tight=True)

# We can also extract the temperature measurements at a fixed depth
# and compare to the timeseries from the mooring 
comp_depth = 20  # metres

df_sub = df[df.DEPTH.round() == comp_depth]
ctd_plot = df_sub.hvplot.scatter(x="TIME", y="TEMP", c="red")

mooring_plot = ds_mooring.TEMP.sel(DEPTH=comp_depth).hvplot()

mooring_plot * ctd_plot

Direct access to files on cloud storage#

Data files made available to the public on cloud storage such as Amazon S3 (Simple Storage Service) can be accessed over the web as if they were stored locally. You just need to find the exact URL for each file.

In Python, we can access S3 storage in a very similar way to a local filesystem using the s3fs library.

For example, all the public data files hosted by the Australian Ocean Data Network are stored in an S3 bucket called imos-data. You can browse the contents of the bucket and download individual files here.

Below we’ll look at a high-resolution regional SST product from IMOS (based on satellite and in-situ observations). This product is a collection of daily gridded NetCDF files covering the Australian region.

s3 = s3fs.S3FileSystem(anon=True)

# List the most recent files available
sst_files = s3.ls("imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023")
sst_files[-20:]

['imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230713120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230714120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230715120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230716120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230717120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230718120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230719120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230720120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230721120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230722120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230723120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230725120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230726120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230727120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230728120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230729120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230730120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230731120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230801120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc',
 'imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/20230802120000-ABOM-L4_GHRSST-SSTfnd-RAMSSA_09km-AUS-v02.0-fv01.0.nc']

# Plot a subset of the dataset around Australia
sst_var = 'analysed_sst'
gds = gv.Dataset(ds.sel(lat=slice(-50, 0), lon=slice(105, 175)),
                 kdims=['lon', 'lat'],
                 vdims=[sst_var]
                )
sst_plot = (gds.to(gv.Image)
               .opts(cmap='coolwarm', colorbar=True, aspect=1.4, title=ds.title))
sst_plot * gf.land

/opt/conda/lib/python3.9/site-packages/cartopy/io/__init__.py:241: DownloadWarning: Downloading: https://naturalearth.s3.amazonaws.com/110m_physical/ne_110m_land.zip
  warnings.warn(f'Downloading: {url}', DownloadWarning)

It’s worth understanding a little about how this works.

The above example only makes use of the metadata from the file, one of the 4 data variables, and the lon and lat coordinates. On a local filesystem, it would be easy to read only these specific parts of the file from disk.

However, on cloud storage services like S3 (also called “object storage”) the basic read/write functions operate on the entire file (object), so at least in the backend, the entire file is read**. If you only need a small subset of a large file, this can be a very inefficient way to get it.

** Note: it is possible to request only a subset of an S3 object to be read, but this is more advanced usage than what we’re doing here.

For example, if we wanted to plot a timeseries of the above satellite SST product at a given point, we would only need a single value out of each file (corresponding to one point in the timeseries), but the entire file would need to be read each time.

For a quick demo we’ll try this with last month’s files. xarray has a handy open_mfdataset function that can create a single Dataset object out of a series of files (with similar structure).

%%time
s3_objs = [s3.open(f)
           for f in s3.glob("imos-data/IMOS/SRS/SST/ghrsst/L4/RAMSSA/2023/202307*")
          ]
mds = xr.open_mfdataset(s3_objs, engine="h5netcdf")
mds

CPU times: user 4.78 s, sys: 119 ms, total: 4.9 s
Wall time: 15.4 s

<xarray.Dataset>
Dimensions:           (time: 30, lat: 1081, lon: 1561)
Coordinates:
  * time              (time) datetime64[ns] 2023-07-01T12:00:00 ... 2023-07-3...
  * lat               (lat) float32 -70.0 -69.92 -69.83 ... 19.83 19.92 20.0
  * lon               (lon) float32 60.0 60.08 60.17 60.25 ... 189.8 189.9 190.0
Data variables:
    sea_ice_fraction  (time, lat, lon) float32 dask.array<chunksize=(1, 1081, 1561), meta=np.ndarray>
    analysed_sst      (time, lat, lon) float32 dask.array<chunksize=(1, 1081, 1561), meta=np.ndarray>
    analysis_error    (time, lat, lon) float32 dask.array<chunksize=(1, 1081, 1561), meta=np.ndarray>
    mask              (time, lat, lon) float32 dask.array<chunksize=(1, 1081, 1561), meta=np.ndarray>
    crs               (time) int32 -2147483647 -2147483647 ... -2147483647
Attributes: (12/65)
    id:                         RAMSSA_09km-ABOM-L4-AUS-v01
    Conventions:                CF-1.6, ACDD-1.3, ISO 8601
    title:                      RAMSSA v1.1 Analysed high resolution foundati...
    summary:                    AMSR2-JAXA nobs=****** obsesd: avg=0.693 min=...
    source:                     AMSR2-JAXA,AVHRRMTB_G-NAVO,VIIRS_NPP_OSPO,VII...
    references:                 Beggs H., A. Zhong, G. Warren, O. Alves, G. B...
    ...                         ...
    geospatial_lat_max:         20.0
    geospatial_lat_min:         -70.0
    geospatial_lon_max:         190.0
    geospatial_lon_min:         60.0
    geospatial_bounds:          POLYGON((-70 60, 20 60, 20 190, -70 190, -70 ...
    geospatial_bounds_crs:      EPSG:4326

The variables in the dataset are not loaded into memory (they’re still dask.arrays). However, in the background, each complete file had to be downloaded from S3 before the metadata needed by open_mfdataset could be read.

mds.analysed_sst

<xarray.DataArray 'analysed_sst' (time: 30, lat: 1081, lon: 1561)>
dask.array<concatenate, shape=(30, 1081, 1561), dtype=float32, chunksize=(1, 1081, 1561), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2023-07-01T12:00:00 ... 2023-07-31T12:00:00
  * lat      (lat) float32 -70.0 -69.92 -69.83 -69.75 ... 19.75 19.83 19.92 20.0
  * lon      (lon) float32 60.0 60.08 60.17 60.25 ... 189.8 189.8 189.9 190.0
Attributes:
    valid_min:              -300
    valid_max:              4500
    clip_min:               269.30999398231506
    clip_max:               304.8399931881577
    units:                  kelvin
    long_name:              analysed sea surface temperature
    standard_name:          sea_surface_foundation_temperature
    comment:                Optimally interpolated analysis of SST observations.
    source:                 AMSR2-JAXA,AVHRRMTB_G-NAVO,VIIRS_NPP_OSPO,VIIRS_N...
    coverage_content_type:  physicalMeasurement
    grid_mapping:           crs

xarray.DataArray

'analysed_sst'

time: 30
lat: 1081
lon: 1561

dask.array<chunksize=(1, 1081, 1561), meta=np.ndarray>

	Array	Chunk
Bytes	193.11 MiB	6.44 MiB
Shape	(30, 1081, 1561)	(1, 1081, 1561)
Dask graph	30 chunks in 61 graph layers
Data type	float32 numpy.ndarray

Coordinates: (3)

time

(time)

datetime64[ns]

2023-07-01T12:00:00 ... 2023-07-...

long_name :: reference time of SST field
standard_name :: time
axis :: T
comment :: Nominal time because observations are from different sources and are made at different times of the day
coverage_content_type :: coordinate

array(['2023-07-01T12:00:00.000000000', '2023-07-02T12:00:00.000000000',
       '2023-07-03T12:00:00.000000000', '2023-07-04T12:00:00.000000000',
       '2023-07-05T12:00:00.000000000', '2023-07-06T12:00:00.000000000',
       '2023-07-07T12:00:00.000000000', '2023-07-08T12:00:00.000000000',
       '2023-07-09T12:00:00.000000000', '2023-07-10T12:00:00.000000000',
       '2023-07-11T12:00:00.000000000', '2023-07-12T12:00:00.000000000',
       '2023-07-13T12:00:00.000000000', '2023-07-14T12:00:00.000000000',
       '2023-07-15T12:00:00.000000000', '2023-07-16T12:00:00.000000000',
       '2023-07-17T12:00:00.000000000', '2023-07-18T12:00:00.000000000',
       '2023-07-19T12:00:00.000000000', '2023-07-20T12:00:00.000000000',
       '2023-07-21T12:00:00.000000000', '2023-07-22T12:00:00.000000000',
       '2023-07-23T12:00:00.000000000', '2023-07-25T12:00:00.000000000',
       '2023-07-26T12:00:00.000000000', '2023-07-27T12:00:00.000000000',
       '2023-07-28T12:00:00.000000000', '2023-07-29T12:00:00.000000000',
       '2023-07-30T12:00:00.000000000', '2023-07-31T12:00:00.000000000'],
      dtype='datetime64[ns]')

lat
(lat)
float32
-70.0 -69.92 -69.83 ... 19.92 20.0
long_name :
latitude
standard_name :
latitude
axis :
Y
comment :
Latitudes for locating data
valid_min :
-90.0
valid_max :
90.0
units :
degrees_north
coverage_content_type :
coordinate
```
array([-70.      , -69.916664, -69.833336, ...,  19.833334,  19.916666,
        20.      ], dtype=float32)
```
lon
(lon)
float32
60.0 60.08 60.17 ... 189.9 190.0
long_name :
longitude
standard_name :
longitude
axis :
X
comment :
Longitudes for locating data
units :
degrees_east
coverage_content_type :
coordinate
valid_min :
0.0
valid_max :
360.0
```
array([ 60.      ,  60.083332,  60.166668, ..., 189.83333 , 189.91667 ,
       190.      ], dtype=float32)
```

Attributes: (11)
valid_min :
-300
valid_max :
4500
clip_min :
269.30999398231506
clip_max :
304.8399931881577
units :
kelvin
long_name :
analysed sea surface temperature
standard_name :
sea_surface_foundation_temperature
comment :
Optimally interpolated analysis of SST observations.
source :
AMSR2-JAXA,AVHRRMTB_G-NAVO,VIIRS_NPP_OSPO,VIIRS_N20_OSPO,IN_SITU-GTS_BUOYS,IN_SITU-GTS_SHIP,IN_SITU-GTS_TESAC,ACCESS-G-WSP,NCEP-ICE
coverage_content_type :
physicalMeasurement
grid_mapping :
crs

Let’s compare this to reading the same files from a local filesystem…

%%time
from glob import glob
local_files = glob(os.path.join(DATA_BASEPATH, "RAMSSA", "*"))

mds = xr.open_mfdataset(local_files, engine="h5netcdf")
mds

CPU times: user 4.36 s, sys: 56.1 ms, total: 4.41 s
Wall time: 5.28 s

<xarray.Dataset>
Dimensions:           (time: 30, lat: 1081, lon: 1561)
Coordinates:
  * time              (time) datetime64[ns] 2023-07-01T12:00:00 ... 2023-07-3...
  * lat               (lat) float32 -70.0 -69.92 -69.83 ... 19.83 19.92 20.0
  * lon               (lon) float32 60.0 60.08 60.17 60.25 ... 189.8 189.9 190.0
Data variables:
    sea_ice_fraction  (time, lat, lon) float32 dask.array<chunksize=(1, 1081, 1561), meta=np.ndarray>
    analysed_sst      (time, lat, lon) float32 dask.array<chunksize=(1, 1081, 1561), meta=np.ndarray>
    analysis_error    (time, lat, lon) float32 dask.array<chunksize=(1, 1081, 1561), meta=np.ndarray>
    mask              (time, lat, lon) float32 dask.array<chunksize=(1, 1081, 1561), meta=np.ndarray>
    crs               (time) int32 -2147483647 -2147483647 ... -2147483647
Attributes: (12/65)
    id:                         RAMSSA_09km-ABOM-L4-AUS-v01
    Conventions:                CF-1.6, ACDD-1.3, ISO 8601
    title:                      RAMSSA v1.1 Analysed high resolution foundati...
    summary:                    AMSR2-JAXA nobs=****** obsesd: avg=0.693 min=...
    source:                     AMSR2-JAXA,AVHRRMTB_G-NAVO,VIIRS_NPP_OSPO,VII...
    references:                 Beggs H., A. Zhong, G. Warren, O. Alves, G. B...
    ...                         ...
    geospatial_lat_max:         20.0
    geospatial_lat_min:         -70.0
    geospatial_lon_max:         190.0
    geospatial_lon_min:         60.0
    geospatial_bounds:          POLYGON((-70 60, 20 60, 20 190, -70 190, -70 ...
    geospatial_bounds_crs:      EPSG:4326

Whichever way we loaded the dataset, we can plot it the same way as any other xarray.Dataset.

%%time
mds[sst_var].sel(lat=-42, lon=150, method="nearest").hvplot()

CPU times: user 809 ms, sys: 86.3 ms, total: 895 ms
Wall time: 3.65 s

Zarr - a cloud-optimised data format#

Zarr is a relatively new data format specifically developed for efficient access to multi-dimensional data in the cloud. Each dataset is broken up into many smaller files containing “chunks” of the data, organised in a standard hierarchy. The metadata are stored in separate files. When reading such a dataset, only the required information is read for each operation.

# A Zarr "store" can easily be opened as an xarray.Dataset

# In this case the Zarr store is in an S3 bucket
# NOTE: This is an experimental dataset. It may not be available in the fultre.
store = s3fs.S3Map(root='imos-data-pixeldrill/zarrs/2021/', s3=s3, check=False)

zds = xr.open_zarr(store)
zds

<xarray.Dataset>
Dimensions:                  (time: 178, lat: 4500, lon: 6000)
Coordinates:
  * lat                      (lat) float32 19.99 19.97 19.95 ... -69.97 -69.99
  * lon                      (lon) float32 70.01 70.03 70.05 ... 190.0 190.0
  * time                     (time) datetime64[ns] 2021-01-01T15:20:00 ... 20...
Data variables:
    dt_analysis              (time, lat, lon) float32 dask.array<chunksize=(64, 64, 64), meta=np.ndarray>
    l2p_flags                (time, lat, lon) float32 dask.array<chunksize=(64, 64, 64), meta=np.ndarray>
    quality_level            (time, lat, lon) float32 dask.array<chunksize=(64, 64, 64), meta=np.ndarray>
    satellite_zenith_angle   (time, lat, lon) float32 dask.array<chunksize=(64, 64, 64), meta=np.ndarray>
    sea_surface_temperature  (time, lat, lon) float32 dask.array<chunksize=(64, 64, 64), meta=np.ndarray>
    sses_bias                (time, lat, lon) float32 dask.array<chunksize=(64, 64, 64), meta=np.ndarray>
    sses_count               (time, lat, lon) float32 dask.array<chunksize=(64, 64, 64), meta=np.ndarray>
    sses_standard_deviation  (time, lat, lon) float32 dask.array<chunksize=(64, 64, 64), meta=np.ndarray>
Attributes: (12/47)
    Conventions:                CF-1.6
    Metadata_Conventions:       Unidata Dataset Discovery v1.0
    Metadata_Link:              TBA
    acknowledgment:             Any use of these data requires the following ...
    cdm_data_type:              grid
    comment:                    HRPT AVHRR experimental L3 retrieval produced...
    ...                         ...
    summary:                    Skin SST retrievals produced from stitching t...
    time_coverage_end:          20210101T151752Z
    time_coverage_start:        20210101T095824Z
    title:                      IMOS L3S Nighttime gridded multiple-sensor mu...
    uuid:                       4d02ee75-876d-4ff0-8956-ab68917c9001
    westernmost_longitude:      70.01000213623047

# We can see the chunked structure of the data by looking at one of the variables
zds.sea_surface_temperature

<xarray.DataArray 'sea_surface_temperature' (time: 178, lat: 4500, lon: 6000)>
dask.array<open_dataset-5ed7711d911bd370bffa8b7f9a31ccf1sea_surface_temperature, shape=(178, 4500, 6000), dtype=float32, chunksize=(64, 64, 64), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 19.99 19.97 19.95 19.93 ... -69.95 -69.97 -69.99
  * lon      (lon) float32 70.01 70.03 70.05 70.07 ... 189.9 189.9 190.0 190.0
  * time     (time) datetime64[ns] 2021-01-01T15:20:00 ... 2021-07-25T15:20:00
Attributes:
    _Netcdf4Dimid:  2
    comment:        The skin temperature of the ocean at a depth of approxima...
    long_name:      sea surface skin temperature
    standard_name:  sea_surface_skin_temperature
    units:          kelvin
    valid_max:      32767
    valid_min:      -32767

xarray.DataArray

'sea_surface_temperature'

time: 178
lat: 4500
lon: 6000

dask.array<chunksize=(64, 64, 64), meta=np.ndarray>

	Array	Chunk
Bytes	17.90 GiB	1.00 MiB
Shape	(178, 4500, 6000)	(64, 64, 64)
Dask graph	20022 chunks in 2 graph layers
Data type	float32 numpy.ndarray

Coordinates: (3)

lat
(lat)
float32
19.99 19.97 19.95 ... -69.97 -69.99
NAME :
lat
_Netcdf4Dimid :
1
axis :
Y
comment :
Latitudes for locating data
long_name :
latitude
standard_name :
latitude
units :
degrees_north
valid_max :
90.0
valid_min :
-90.0
```
array([ 19.99,  19.97,  19.95, ..., -69.95, -69.97, -69.99], dtype=float32)
```
lon
(lon)
float32
70.01 70.03 70.05 ... 190.0 190.0
NAME :
lon
_Netcdf4Dimid :
0
axis :
X
comment :
Longitudes for locating data
long_name :
longitude
standard_name :
longitude
units :
degrees_east
valid_max :
360.0
valid_min :
-180.0
```
array([ 70.01,  70.03,  70.05, ..., 189.95, 189.97, 189.99], dtype=float32)
```

time

(time)

datetime64[ns]

2021-01-01T15:20:00 ... 2021-07-...

NAME :: time
_Netcdf4Dimid :: 2
axis :: T
comment :: A typical reference time for data
long_name :: reference time of sst file
standard_name :: time

array(['2021-01-01T15:20:00.000000000', '2021-01-02T15:20:00.000000000',
       '2021-01-03T15:20:00.000000000', '2021-01-04T15:20:00.000000000',
       '2021-01-05T15:20:00.000000000', '2021-01-06T15:20:00.000000000',
       '2021-01-07T15:20:00.000000000', '2021-01-08T15:20:00.000000000',
       '2021-01-09T15:20:00.000000000', '2021-01-10T15:20:00.000000000',
       '2021-01-11T15:20:00.000000000', '2021-01-12T15:20:00.000000000',
       '2021-01-13T15:20:00.000000000', '2021-01-14T15:20:00.000000000',
       '2021-01-15T15:20:00.000000000', '2021-01-16T15:20:00.000000000',
       '2021-01-17T15:20:00.000000000', '2021-01-18T15:20:00.000000000',
       '2021-01-19T15:20:00.000000000', '2021-01-20T15:20:00.000000000',
       '2021-01-21T15:20:00.000000000', '2021-01-22T15:20:00.000000000',
       '2021-01-23T15:20:00.000000000', '2021-01-24T15:20:00.000000000',
       '2021-01-26T15:20:00.000000000', '2021-01-27T15:20:00.000000000',
       '2021-01-28T15:20:00.000000000', '2021-01-29T15:20:00.000000000',
       '2021-01-30T15:20:00.000000000', '2021-01-31T15:20:00.000000000',
       '2021-02-01T15:20:00.000000000', '2021-02-02T15:20:00.000000000',
       '2021-02-03T15:20:00.000000000', '2021-02-04T15:20:00.000000000',
       '2021-02-05T15:20:00.000000000', '2021-02-06T15:20:00.000000000',
       '2021-02-07T15:20:00.000000000', '2021-02-08T15:20:00.000000000',
       '2021-02-09T15:20:00.000000000', '2021-02-10T15:20:00.000000000',
       '2021-02-11T15:20:00.000000000', '2021-02-12T15:20:00.000000000',
       '2021-02-13T15:20:00.000000000', '2021-02-14T15:20:00.000000000',
       '2021-02-15T15:20:00.000000000', '2021-02-16T15:20:00.000000000',
       '2021-02-17T15:20:00.000000000', '2021-02-18T15:20:00.000000000',
       '2021-02-19T15:20:00.000000000', '2021-02-20T15:20:00.000000000',
       '2021-02-21T15:20:00.000000000', '2021-02-22T15:20:00.000000000',
       '2021-02-23T15:20:00.000000000', '2021-02-24T15:20:00.000000000',
       '2021-02-25T15:20:00.000000000', '2021-02-26T15:20:00.000000000',
       '2021-02-27T15:20:00.000000000', '2021-02-28T15:20:00.000000000',
       '2021-03-01T15:20:00.000000000', '2021-03-02T15:20:00.000000000',
       '2021-03-03T15:20:00.000000000', '2021-03-04T15:20:00.000000000',
       '2021-03-05T15:20:00.000000000', '2021-03-06T15:20:00.000000000',
       '2021-03-07T15:20:00.000000000', '2021-03-08T15:20:00.000000000',
       '2021-03-09T15:20:00.000000000', '2021-03-10T15:20:00.000000000',
       '2021-03-11T15:20:00.000000000', '2021-03-12T15:20:00.000000000',
       '2021-03-13T15:20:00.000000000', '2021-03-14T15:20:00.000000000',
       '2021-03-15T15:20:00.000000000', '2021-03-16T15:20:00.000000000',
       '2021-03-17T15:20:00.000000000', '2021-03-18T15:20:00.000000000',
       '2021-03-19T15:20:00.000000000', '2021-03-20T15:20:00.000000000',
       '2021-03-21T15:20:00.000000000', '2021-03-22T15:20:00.000000000',
       '2021-03-23T15:20:00.000000000', '2021-03-24T15:20:00.000000000',
       '2021-03-25T15:20:00.000000000', '2021-03-26T15:20:00.000000000',
       '2021-03-27T15:20:00.000000000', '2021-03-28T15:20:00.000000000',
       '2021-03-29T15:20:00.000000000', '2021-03-30T15:20:00.000000000',
       '2021-03-31T15:20:00.000000000', '2021-04-01T15:20:00.000000000',
       '2021-04-02T15:20:00.000000000', '2021-04-03T15:20:00.000000000',
       '2021-04-04T15:20:00.000000000', '2021-04-05T15:20:00.000000000',
       '2021-04-06T15:20:00.000000000', '2021-04-07T15:20:00.000000000',
       '2021-04-08T15:20:00.000000000', '2021-04-09T15:20:00.000000000',
       '2021-04-10T15:20:00.000000000', '2021-04-11T15:20:00.000000000',
       '2021-04-12T15:20:00.000000000', '2021-04-13T15:20:00.000000000',
       '2021-04-14T15:20:00.000000000', '2021-04-15T15:20:00.000000000',
       '2021-04-16T15:20:00.000000000', '2021-04-17T15:20:00.000000000',
       '2021-04-18T15:20:00.000000000', '2021-04-19T15:20:00.000000000',
       '2021-04-20T15:20:00.000000000', '2021-04-21T15:20:00.000000000',
       '2021-04-22T15:20:00.000000000', '2021-04-23T15:20:00.000000000',
       '2021-04-24T15:20:00.000000000', '2021-04-25T15:20:00.000000000',
       '2021-04-26T15:20:00.000000000', '2021-04-27T15:20:00.000000000',
       '2021-04-28T15:20:00.000000000', '2021-04-29T15:20:00.000000000',
       '2021-04-30T15:20:00.000000000', '2021-05-01T15:20:00.000000000',
       '2021-05-02T15:20:00.000000000', '2021-05-03T15:20:00.000000000',
       '2021-05-04T15:20:00.000000000', '2021-05-05T15:20:00.000000000',
       '2021-05-06T15:20:00.000000000', '2021-05-07T15:20:00.000000000',
       '2021-05-08T15:20:00.000000000', '2021-05-10T15:20:00.000000000',
       '2021-05-12T15:20:00.000000000', '2021-05-13T15:20:00.000000000',
       '2021-05-14T15:20:00.000000000', '2021-05-15T15:20:00.000000000',
       '2021-05-16T15:20:00.000000000', '2021-05-17T15:20:00.000000000',
       '2021-05-20T15:20:00.000000000', '2021-05-21T15:20:00.000000000',
       '2021-05-22T15:20:00.000000000', '2021-05-23T15:20:00.000000000',
       '2021-05-24T15:20:00.000000000', '2021-05-25T15:20:00.000000000',
       '2021-05-27T15:20:00.000000000', '2021-05-28T15:20:00.000000000',
       '2021-05-29T15:20:00.000000000', '2021-06-02T15:20:00.000000000',
       '2021-06-03T15:20:00.000000000', '2021-06-04T15:20:00.000000000',
       '2021-06-05T15:20:00.000000000', '2021-06-06T15:20:00.000000000',
       '2021-06-07T15:20:00.000000000', '2021-06-09T15:20:00.000000000',
       '2021-06-10T15:20:00.000000000', '2021-06-11T15:20:00.000000000',
       '2021-06-15T15:20:00.000000000', '2021-06-17T15:20:00.000000000',
       '2021-06-18T15:20:00.000000000', '2021-06-23T15:20:00.000000000',
       '2021-06-24T15:20:00.000000000', '2021-06-25T15:20:00.000000000',
       '2021-06-26T15:20:00.000000000', '2021-06-27T15:20:00.000000000',
       '2021-07-02T15:20:00.000000000', '2021-07-03T15:20:00.000000000',
       '2021-07-05T15:20:00.000000000', '2021-07-10T15:20:00.000000000',
       '2021-07-11T15:20:00.000000000', '2021-07-12T15:20:00.000000000',
       '2021-07-13T15:20:00.000000000', '2021-07-14T15:20:00.000000000',
       '2021-07-15T15:20:00.000000000', '2021-07-16T15:20:00.000000000',
       '2021-07-18T15:20:00.000000000', '2021-07-19T15:20:00.000000000',
       '2021-07-20T15:20:00.000000000', '2021-07-21T15:20:00.000000000',
       '2021-07-22T15:20:00.000000000', '2021-07-23T15:20:00.000000000',
       '2021-07-24T15:20:00.000000000', '2021-07-25T15:20:00.000000000'],
      dtype='datetime64[ns]')

Attributes: (7)
_Netcdf4Dimid :
2
comment :
The skin temperature of the ocean at a depth of approximately 10 microns
long_name :
sea surface skin temperature
standard_name :
sea_surface_skin_temperature
units :
kelvin
valid_max :
32767
valid_min :
-32767

%%time

# We can plot this dataset in exactly the same way as the NetCDF-based one
sst_var = 'sea_surface_temperature'
gds = gv.Dataset(zds[sst_var].sel(time='2021-01-02', lat=slice(0, -50), lon=slice(105, 175)),
                 kdims=['lon', 'lat'],
                 vdims=[sst_var]
                )
sst_plot = (gds.to(gv.Image, ['lon', 'lat'])
               .opts(cmap='coolwarm', colorbar=True, aspect=1.4, title=zds.title))
sst_plot * gf.land

Another example#

A more detailed example of working with similar data in Zarr format can be found here: https://github.com/aodn/rimrep-examples/blob/main/Python_based_scripts/zarr.ipynb

Parquet#

Parquet is a cloud-optimised format designed for tabular data.
Each column of the table is stored in a separate file/object.
These can be further partitioned into row groups.

For a quick demo, we’ll borrow an example from this more detailed notebook, looking at temperature logger data from the Australian Institute of Marine Science. The dataset contains 150 million temperature measurements from numerous sites around Australia (metadata for this dataset).

We’ll use dask.dataframe to access the parquet data in a lazy way - reading only what is necessary and only when requested.

# Here's the path to the dataset on AWS S3
parquet_path = "s3://rimrep-data-public/091-aims-sst/test-50-64-spatialpart/"

# Let's see if there are any temperature loggers near us (in Dunsborough, Western Australia)
filters = [('lon', '>', 114.5),
          ('lon', '<', 115.5),
          ('lat', '>', -34.),
          ('lat', '<', -33.)]

df = dd.read_parquet(parquet_path,
                     filters=filters,
                     # only read the site names and QC'd temperature values
                     columns = ['site', 'qc_val'],
                     index='time',
                     engine='pyarrow',
                     storage_options = {"anon": True}
                    )
df.head()

	site	qc_val
time
2015-01-02 07:30:00+00:00	Geographe Bay	24.3728
2015-01-02 07:00:00+00:00	Geographe Bay	24.3728
2015-01-02 06:30:00+00:00	Geographe Bay	24.3238
2015-01-02 06:00:00+00:00	Geographe Bay	24.2518
2015-01-02 05:30:00+00:00	Geographe Bay	24.1798

len(df)

# This subset should fit into memory, so let's turn it into a regular DataFrame
df = df.compute()

# Let's see how many sites we have...
df.site.unique()

array(['Geographe Bay', 'Cowaramup Bay', 'Canal Rocks'], dtype=object)

# Now we can plot the temperature timeseries for all these sites
df.hvplot(by='site')

Alternative dataset#

Another Parquet example using data from the Ocean Biodiversity Information System (OBIS) is shown in this notebook

Other methods#

ERDDAP#

Supports searching, subsetting, and downloads in a wide range of formats
Example
Also covered in this OHW22 tutorial

New OGC APIs#

New standards from the Open Geospatial Consortium
OGC Features (replacement for WFS) - example
OGC Coverages example

OGC Web Map Service (WMS)#

Also covered in this OHW22 tutorial

Further resources#

Lots of data access examples at IOOS CodeLab
Examples in both Python and R from Reef 2050 Integrated Monitoring and Reporting Program Data Management System (RIMReP DMS)

Data Access Methods#

The old school way#

OPeNDAP#

Another example#

Web Feature Service (WFS)#

Direct access to files on cloud storage#

Zarr - a cloud-optimised data format#

Another example#

Parquet#

Alternative dataset#

Other methods#

ERDDAP#

New OGC APIs#

OGC Web Map Service (WMS)#

Further resources#