[Notebook] Dask and Xarray on AWS-HPC Cluster: Distributed Processing of Earth Data
This notebook continues the previous post by showing the actual code for distributed data processing.
In [1]:
%matplotlib inline
import xarray as xr
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
from dask.diagnostics import ProgressBar
from dask_jobqueue import SLURMCluster
from distributed import Client, progress
In [2]:
import dask
import distributed
dask.__version__, distributed.__version__
Out[2]:
In [3]:
%env HDF5_USE_FILE_LOCKING=FALSE
Data exploration¶
Data are organized by year/month:
In [4]:
ls /fsx
In [5]:
ls /fsx/2008/
In [6]:
ls /fsx/2008/01/data # one variable per file
In [7]:
# hourly data over a month
dr = xr.open_dataarray('/fsx/2008/01/data/sea_surface_temperature.nc')
dr
Out[7]:
In [8]:
# Static plot of the first time slice
fig, ax = plt.subplots(1, 1, figsize=[12, 8], subplot_kw={'projection': ccrs.PlateCarree()})
dr[0].plot(ax=ax, transform=ccrs.PlateCarree(), cbar_kwargs={'shrink': 0.6})
ax.coastlines();
What happens to the values over the land? Easier to check by an interactive plot.
In [9]:
import geoviews as gv
import hvplot.xarray
fig_hv = dr[0].hvplot.quadmesh(
x='lon', y='lat', rasterize=True, cmap='viridis', geo=True,
crs=ccrs.PlateCarree(), projection=ccrs.PlateCarree(), project=True,
width=800, height=400,
) * gv.feature.coastline
# fig_hv