The Large Impact of MPI Collective Algorithms on AWS EC2 HPC cluster

I recently found that, on an EC2 HPC cluster, OpenMPI collectives like MPI_Bcast and MPI_Allreduce are 3~5 times slower than Intel-MPI, due to their underlying collective algorithms. This problem is reported in detail at https://github.com/aws/aws-parallelcluster/issues/1436. The take-away is that one should using Intel-MPI instead of OpenMPI to get optimal performance on EC2. The performance of MPI_Bcast is critical for I/O-intensive applications that need to broadcast data from master to the rest of processes.

It is also possible to fine-tune OpenMPI to narrow the performance gap. For MPI_Bcast, the "knormial tree" algorithm in OpenMPI4 can give 2~3x speed-up over the out-of-box OpenMPI3/OpenMPI4, getting closer to Intel-MPI without EFA. The algorithm can be selected by:

orterun --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 7 your_mpi_program

Here are the complete code and instructions to benchmark all algorithms: https://github.com/JiaweiZhuang/aws-mpi-benchmark

Below is a comparison of all available MPI_Bcast algorithms (more in the GitHub issue). The x-axis is clipped to not show very slow algorithms. See the numbers on each bar for the actual time.

/images/mpi_collectives/MPI_bcast.png

MPI over Multiple TCP Connections on EC2 C5n Instances

This is just a quick note regarding interesting MPI behaviors on EC2.

EC2 C5n instances provide an amazing 100 Gb/s bandwidth 1 , much higher than the 56 Gb/s FDR InfiniBand network on Harvard's HPC cluster 2. It turns out that actually getting the 100 Gb/s bandwidth needs a bit tweak. This post sorely focuses on bandwidth. In terms of latency, Ethernet + TCP (20~30 us) is hard to compete with InfiniBand + RDMA (~1 us).

Tests are conducted on an AWS ParallelCluster with two c5n.18xlarge instances, as in my cloud-HPC guide.

1 TCP bandwidth test with iPerf

Before doing any MPI stuff, first use the general-purpose network testing tool iPerf/iPerf3. AWS has provided an example of using iPerf on EC2 3. Note that iPerf2 and iPerf3 handle parallelism quite differently 4: the --parallel/-P option in iPerf2 creates multiplt threads (thus can ulitize multiple CPU cores), while the same option in iPerf3 opens multiple TCP connections but only one thread (thus can only use a single CPU core). This can lead to quite different benchmark results at high concurrency.

1.1 Single-thread results with iPerf3

Start server:

$ ssh ip-172-31-2-150  # go to one compute node
$ sudo yum install iperf3
$ iperf3 -s  # let it keep running

Start client (in a separate shell):

$ ssh ip-172-31-11-54  # go to another compute node
$ sudo yum install iperf3

# Single TCP stream
# `-c` specifies the server hostname (EC2 private IP).
# Most parameters are kept as default, which seem to perform well.
$ iperf3 -c ip-172-31-2-150 -t 4
...
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-4.00   sec  4.40 GBytes  9.46 Gbits/sec    0             sender
[  4]   0.00-4.00   sec  4.40 GBytes  9.46 Gbits/sec                  receiver
...

# Multiple TCP stream
$ iperf3 -c ip-172-31-2-150 -P 4 -i 1 -t 4
...
[SUM]   0.00-4.00   sec  11.8 GBytes  25.4 Gbits/sec    0             sender
[SUM]   0.00-4.00   sec  11.8 GBytes  25.3 Gbits/sec                  receiver
...

A single stream gets ~9.5 Gb/s, while -P 4 achieves the maximum bandwidth of ~25.4 Gb/s. Using more streams does not help further, as the workload starts to become CPU-limited.

1.2 Multi-thread results with iPerf2

On server:

$ sudo yum install iperf
$ iperf -s

On client:

$ sudo yum install iperf

# Single TCP stream
$ iperf -c ip-172-31-2-150 -t 4  # consistent with iPerf3 result
...
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 4.0 sec  4.40 GBytes  9.45 Gbits/sec
...

# Multiple TCP stream
$ iperf -c ip-172-31-2-150 -P 36 -t 4  # much higher than with iPerf3
...
[SUM]  0.0- 4.0 sec  43.4 GBytes  93.0 Gbits/sec
...

Unlike iPerf3, iPerf2 is able to approach the theoretical 100 Gb/s by using all the available cores.

2 MPI bandwidth test with OSU mirco-benchmarks

Next, do OSU Micro-Benchmarks, a well-known MPI benchmarking framework. Similar tests can be done with Intel MPI Benchmarks.

Get OpenMPI v4.0.0, which allows a single pair of MPI processes to use multiple TCP connections 5.

$ spack install openmpi@4.0.0+pmi schedulers=slurm  # Need to fix https://github.com/spack/spack/pull/10853

Get OSU:

$ spack install osu-micro-benchmarks ^openmpi@4.0.0+pmi schedulers=slurm

Focus on point-to-point communication here:

# all the later commands are executed in this directory
$ cd $(spack location -i osu-micro-benchmarks)/libexec/osu-micro-benchmarks/mpi/pt2pt/

2.1 Single-stream

osu_bw tests bandwidth between a single pair of MPI processes.

$ srun -N 2 -n 2 ./osu_bw
  OSU MPI Bandwidth Test v5.5
  Size      Bandwidth (MB/s)
1                       0.47
2                       0.95
4                       1.90
8                       3.78
16                      7.66
32                     15.17
64                     29.94
128                    53.69
256                   105.53
512                   202.98
1024                  376.49
2048                  626.50
4096                  904.27
8192                 1193.19
16384                1178.43
32768                1180.01
65536                1179.70
131072               1180.92
262144               1181.41
524288               1181.67
1048576              1181.62
2097152              1181.72
4194304              1180.56

This matches the single-stream result 9.5 Gb/s = 9.5/8 GB/s ~ 1200 MB/s from iPerf.

Note

1 GigaByte (GB) = 8 Gigabits (Gb)

2.2 Multi-stream

osu_mbw_mr tests bandwidth between multiple pairs of MPI processes.

# Simply calling `srun` on `osu_mbw_mr` seems to hang forever. Not sure why.
$ # srun -N 2 --ntasks-per-node 36 ./osu_mbw_mr  # in principle it should work

# Do it in two steps fixes the problem.
$ srun -N 2 --ntasks-per-node 72 --pty /bin/bash  # request interactive shell

# `osu_mbw_mr` requires the first half of MPI ranks to be on one node.
# Check it with the verbose output below. Slurm should have the correct placement by default.
$ $(spack location -i openmpi)/bin/orterun --tag-output hostname
...

# Actually running the benchmark
$ $(spack location -i openmpi)/bin/orterun ./osu_mbw_mr
  OSU MPI Multiple Bandwidth / Message Rate Test v5.5
  [ pairs: 72 ] [ window size: 64 ]
  Size                  MB/s        Messages/s
1                      17.94       17944422.39
2                      35.76       17878198.29
4                      71.85       17963002.53
8                     143.80       17974644.52
16                    283.00       17687790.85
32                    551.03       17219816.70
64                   1067.73       16683260.55
128                  2076.05       16219122.14
256                  3890.82       15198501.12
512                  6790.84       13263356.64
1024                10165.19        9926942.84
2048                11454.89        5593209.95
4096                11967.32        2921708.63
8192                12597.32        1537758.49
16384               12686.13         774299.68
32768               12765.72         389578.83
65536               12857.16         196184.66
131072              12829.56          97881.74
262144              12994.67          49570.75
524288              12988.97          24774.49
1048576             12983.20          12381.74
2097152             13011.67           6204.45
4194304             12910.31           3078.06

This matches the theoretical maximum bandwidth (100 Gb/s ~ 12500 MB/s).

On an InfiniBand cluster there is typically little difference between single-stream and multi-stream bandwidth. Something to keep in mind regarding TCP/Ethernet/EC2.

3 Tweaking TCP connections for OpenMPI

OpenMPI v4.0.0 allows one pair of MPI processes to use multiple TCP connections via the btl_tcp_links parameter 5.

$ export OMPI_MCA_btl_tcp_links=36  # https://www.open-mpi.org/faq/?category=tuning#setting-mca-params
$ ompi_info --param btl tcp --level 4 | grep btl_tcp_links  # double check
MCA btl tcp: parameter "btl_tcp_links" (current value: "36", data source: environment, level: 4 tuner/basic, type: unsigned_int)
$ srun -N 2 -n 2 ./osu_bw
  OSU MPI Bandwidth Test v5.5
  Size      Bandwidth (MB/s)
1                       0.46
2                       0.92
4                       1.88
8                       3.78
16                      7.60
32                     14.95
64                     30.34
128                    56.95
256                   113.52
512                   213.36
1024                  400.18
2048                  665.80
4096                  963.67
8192                 1187.67
16384                1180.56
32768                1179.53
65536                2349.06
131072               2379.48
262144               2589.47
524288               2805.73
1048576              2853.21
2097152              2882.12
4194304              2811.13

This matches the previous iPerf3 result (25 Gb/s ~ 3000 MB/s) regarding single-thread, mutli-TCP bandwidth. A single MPI pair is hard to go further, as the communication is now limited by thread/CPU.

This tweak doesn't actually improve the performance of my real-world HPC code, which should already have multiple MPI connections. The lesson learned is probably -- be careful when conducting micro-benchmarks. The out-of-box osu_bw can be misleading on EC2.

4 References

1

New C5n Instances with 100 Gbps Networking: https://aws.amazon.com/blogs/aws/new-c5n-instances-with-100-gbps-networking/

2

Odyssey Architecture: https://www.rc.fas.harvard.edu/resources/odyssey-architecture/

3

"How do I benchmark network throughput between Amazon EC2 Linux instances in the same VPC?" https://aws.amazon.com/premiumsupport/knowledge-center/network-throughput-benchmark-linux-ec2/

4

Discussions on multithreaded iperf3 at: https://github.com/esnet/iperf/issues/289

5(1,2)

"Can I use multiple TCP connections to improve network performance?" https://www.open-mpi.org/faq/?category=tcp#tcp-multi-links

[Notebook] Dask and Xarray on AWS-HPC Cluster: Distributed Processing of Earth Data

This notebook continues the previous post by showing the actual code for distributed data processing.

In [1]:
%matplotlib inline
import xarray as xr
import matplotlib.pyplot as plt
import cartopy.crs as ccrs

from dask.diagnostics import ProgressBar
from dask_jobqueue import SLURMCluster
from distributed import Client, progress
In [2]:
import dask
import distributed
dask.__version__, distributed.__version__
Out[2]:
('1.1.3', '1.26.0')
In [3]:
%env HDF5_USE_FILE_LOCKING=FALSE
env: HDF5_USE_FILE_LOCKING=FALSE

Data exploration

Data are organized by year/month:

In [4]:
ls /fsx
2008/  2010/  2012/  2014/  2016/  2018/
2009/  2011/  2013/  2015/  2017/  QA/
In [5]:
ls /fsx/2008/
01/  02/  03/  04/  05/  06/  07/  08/  09/  10/  11/  12/
In [6]:
ls /fsx/2008/01/data  # one variable per file
air_pressure_at_mean_sea_level.nc*
air_temperature_at_2_metres_1hour_Maximum.nc*
air_temperature_at_2_metres_1hour_Minimum.nc*
air_temperature_at_2_metres.nc*
dew_point_temperature_at_2_metres.nc*
eastward_wind_at_100_metres.nc*
eastward_wind_at_10_metres.nc*
integral_wrt_time_of_surface_direct_downwelling_shortwave_flux_in_air_1hour_Accumulation.nc*
lwe_thickness_of_surface_snow_amount.nc*
northward_wind_at_100_metres.nc*
northward_wind_at_10_metres.nc*
precipitation_amount_1hour_Accumulation.nc*
sea_surface_temperature.nc*
snow_density.nc*
surface_air_pressure.nc*
In [7]:
# hourly data over a month
dr = xr.open_dataarray('/fsx/2008/01/data/sea_surface_temperature.nc')
dr
Out[7]:
<xarray.DataArray 'sea_surface_temperature' (time0: 744, lat: 640, lon: 1280)>
[609484800 values with dtype=float32]
Coordinates:
  * lon      (lon) float32 0.0 0.2812494 0.5624988 ... 359.43674 359.718
  * lat      (lat) float32 89.784874 89.5062 89.22588 ... -89.5062 -89.784874
  * time0    (time0) datetime64[ns] 2008-01-01T07:00:00 ... 2008-02-01T06:00:00
Attributes:
    standard_name:  sea_surface_temperature
    units:          K
    long_name:      Sea surface temperature
    nameECMWF:      Sea surface temperature
    nameCDM:        Sea_surface_temperature_surface
In [8]:
# Static plot of the first time slice
fig, ax = plt.subplots(1, 1, figsize=[12, 8], subplot_kw={'projection': ccrs.PlateCarree()})
dr[0].plot(ax=ax, transform=ccrs.PlateCarree(), cbar_kwargs={'shrink': 0.6})
ax.coastlines();

What happens to the values over the land? Easier to check by an interactive plot.

In [9]:
import geoviews as gv
import hvplot.xarray

fig_hv = dr[0].hvplot.quadmesh(
    x='lon', y='lat', rasterize=True, cmap='viridis', geo=True,
    crs=ccrs.PlateCarree(), projection=ccrs.PlateCarree(), project=True,
    width=800, height=400, 
) * gv.feature.coastline

# fig_hv 
In [10]:
# This is just a hack to display figure on Nikola blog post
# If you know an easier way let me know
import holoviews as hv
from bokeh.resources import CDN, INLINE
from bokeh.embed import file_html
from IPython.display import HTML

HTML(file_html(hv.render(fig_hv), CDN))
Out[10]:
Bokeh Application

So it turns out that the "temperature" over the land is set as 273.16K (0 degree celsius). A better way is probably masking them out.

Serial read with master node

Let's see how slow it is to read one year of data with only master node.

In [11]:
# Just querying metadata will cause files being pulled from S3 to FSx.
# This takes a while at first executation. Much faster at second time.
%time ds_1yr = xr.open_mfdataset('/fsx/2008/*/data/sea_surface_temperature.nc', chunks={'time0': 50})
dr_1yr = ds_1yr['sea_surface_temperature']
dr_1yr
CPU times: user 68.7 ms, sys: 15.3 ms, total: 84 ms
Wall time: 140 ms
Out[11]:
<xarray.DataArray 'sea_surface_temperature' (time0: 8784, lat: 640, lon: 1280)>
dask.array<shape=(8784, 640, 1280), dtype=float32, chunksize=(50, 640, 1280)>
Coordinates:
  * lon      (lon) float32 0.0 0.2812494 0.5624988 ... 359.43674 359.718
  * lat      (lat) float32 89.784874 89.5062 89.22588 ... -89.5062 -89.784874
  * time0    (time0) datetime64[ns] 2008-01-01T07:00:00 ... 2009-01-01T06:00:00
Attributes:
    standard_name:  sea_surface_temperature
    units:          K
    long_name:      Sea surface temperature
    nameECMWF:      Sea surface temperature
    nameCDM:        Sea_surface_temperature_surface

The aggregated size is ~29 GB:

In [12]:
dr_1yr.nbytes / 1e9  # GB
Out[12]:
28.7834112
In [13]:
with ProgressBar():
    mean_1yr_ser = dr_1yr.mean().compute()
[########################################] | 100% Completed |  2min 15.3s
In [14]:
mean_1yr_ser
Out[14]:
<xarray.DataArray 'sea_surface_temperature' ()>
array(282.24515, dtype=float32)

It takes ~2 min. Further reading the 10-year full data would take ~20 min. Such slowness encourages the use of a distributed cluster.

Parallel read with dask cluster

Cluster initialization

In [15]:
!sinfo  # spin-up 8 idle nodes with AWS ParallelCluster
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      8   idle ip-172-31-0-[18,165],ip-172-31-3-104,ip-172-31-7-49,ip-172-31-10-179,ip-172-31-11-[156,246],ip-172-31-13-3
In [16]:
!mkdir -p ./dask_tempdir
In [17]:
# Reference: https://jobqueue.dask.org/en/latest/configuration.html
# - "cores" is the number of CPUs used per Slurm job. 
# Here fix it as 72, which is the number of vCPUs per c5n.18xl node. So one slurm job gets exactly one node.
# - "processes" specifies the number of dask workers in a single Slurm job.
# - "memory" specifies the memory requested in a single Slurm job.

cluster = SLURMCluster(cores=72, processes=36, memory='150GB', 
                       local_directory='./dask_tempdir')
In [18]:
# 8 node * 36 workers/node
cluster.scale(8*36)
cluster

Visit http://localhost:8787 for the dashboard.

In [19]:
# remember to also create dask client to talk to the cluster!
client = Client(cluster)  # automatically switches to distributed mode
client
Out[19]:

Client

Cluster

  • Workers: 288
  • Cores: 576
  • Memory: 1.20 TB
In [20]:
# now the default scheduler is dask.distributed
dask.config.get('scheduler')
Out[20]:
'dask.distributed'
In [21]:
!sinfo  # nodes are now fully allocated
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      8  alloc ip-172-31-0-[18,165],ip-172-31-3-104,ip-172-31-7-49,ip-172-31-10-179,ip-172-31-11-[156,246],ip-172-31-13-3
In [22]:
!squeue  # all are dask worker jobs, one per compute node
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                10   compute dask-wor   centos  R       0:14      1 ip-172-31-0-18
                11   compute dask-wor   centos  R       0:14      1 ip-172-31-0-165
                12   compute dask-wor   centos  R       0:14      1 ip-172-31-3-104
                13   compute dask-wor   centos  R       0:14      1 ip-172-31-7-49
                14   compute dask-wor   centos  R       0:14      1 ip-172-31-10-179
                15   compute dask-wor   centos  R       0:14      1 ip-172-31-11-156
                16   compute dask-wor   centos  R       0:14      1 ip-172-31-11-246
                17   compute dask-wor   centos  R       0:14      1 ip-172-31-13-3

Read 1-year data

In [23]:
# Actually, no need to reopen files. Can just reuse the previous dask graph and put it onto the cluster

ds_1yr = xr.open_mfdataset('/fsx/2008/*/data/sea_surface_temperature.nc', chunks={'time0': 50})
dr_1yr = ds_1yr['sea_surface_temperature']
dr_1yr
Out[23]:
<xarray.DataArray 'sea_surface_temperature' (time0: 8784, lat: 640, lon: 1280)>
dask.array<shape=(8784, 640, 1280), dtype=float32, chunksize=(50, 640, 1280)>
Coordinates:
  * lon      (lon) float32 0.0 0.2812494 0.5624988 ... 359.43674 359.718
  * lat      (lat) float32 89.784874 89.5062 89.22588 ... -89.5062 -89.784874
  * time0    (time0) datetime64[ns] 2008-01-01T07:00:00 ... 2009-01-01T06:00:00
Attributes:
    standard_name:  sea_surface_temperature
    units:          K
    long_name:      Sea surface temperature
    nameECMWF:      Sea surface temperature
    nameCDM:        Sea_surface_temperature_surface
In [24]:
%time mean_1yr_par = dr_1yr.mean().compute()
CPU times: user 1.15 s, sys: 173 ms, total: 1.32 s
Wall time: 5.33 s

The throughput is like 29GB/5s ~ 6 GB/s. This seems to exceed our Lustre bandwidth of ~3GB/s. That's likely because the actual NetCDF files are compressed, and the 29 GB is just for in-memory arrays.

In [25]:
mean_1yr_par.equals(mean_1yr_ser) # consistent with serial result
Out[25]:
True
In [26]:
len(dr_1yr.chunks[0])  # number of dask chunks
Out[26]:
179

There are actually not that many chunks for dask workers, even though I am using quite small chunks. Let's try more files.

Read multi-year data

For this part you might get "Too many files open" error. If so, run sudo sh -c "ulimit -n 65535 && exec su $LOGNAME" to raise the limit before starting Jupyter (ref: https://stackoverflow.com/a/17483998).

In [27]:
file_list = [f'/fsx/{year}/{month:02d}/data/sea_surface_temperature.nc' 
             for year in range(2008, 2018) for month in range(1, 13)]
len(file_list)  # number of files
Out[27]:
120
In [28]:
file_list[0:3], file_list[-3:]  # span over multiple years
Out[28]:
(['/fsx/2008/01/data/sea_surface_temperature.nc',
  '/fsx/2008/02/data/sea_surface_temperature.nc',
  '/fsx/2008/03/data/sea_surface_temperature.nc'],
 ['/fsx/2017/10/data/sea_surface_temperature.nc',
  '/fsx/2017/11/data/sea_surface_temperature.nc',
  '/fsx/2017/12/data/sea_surface_temperature.nc'])
In [29]:
# Will cause data being pulled from S3 to FSx.
# Will take a long time at first executation. Much faster at second time.
%time ds_10yr = xr.open_mfdataset(file_list, chunks={'time0': 50})
dr_10yr = ds_10yr['sea_surface_temperature']
dr_10yr
CPU times: user 734 ms, sys: 110 ms, total: 844 ms
Wall time: 1.34 s
Out[29]:
<xarray.DataArray 'sea_surface_temperature' (time0: 87672, lat: 640, lon: 1280)>
dask.array<shape=(87672, 640, 1280), dtype=float32, chunksize=(50, 640, 1280)>
Coordinates:
  * lon      (lon) float32 0.0 0.2812494 0.5624988 ... 359.43674 359.718
  * lat      (lat) float32 89.784874 89.5062 89.22588 ... -89.5062 -89.784874
  * time0    (time0) datetime64[ns] 2008-01-01T07:00:00 ... 2018-01-01T06:00:00
Attributes:
    standard_name:  sea_surface_temperature
    units:          K
    long_name:      Sea surface temperature
    nameECMWF:      Sea surface temperature
    nameCDM:        Sea_surface_temperature_surface

Near 300 GB!

In [30]:
dr_10yr.nbytes / 1e9 # GB
Out[30]:
287.2836096
In [32]:
%time mean_10yr = dr_10yr.mean().compute()
CPU times: user 9.53 s, sys: 1.06 s, total: 10.6 s
Wall time: 16.3 s

Throughput is like 287GB/15s = 19 GB/s ?! Again that's likely due to HDF5/NetCDF compression.

Finally, instead of getting a scalar value (which is boring), let's get a time-series of global mean SST:

In [33]:
%time ts_10yr = dr_10yr.mean(dim=['lat', 'lon']).compute()
CPU times: user 11 s, sys: 1.47 s, total: 12.5 s
Wall time: 17 s

Despite my vastly inaccurate approximation (not masking out land, not weighting by grid cell areas, in order to keep the code simple), we can still see a clear increase of mean SST over the past years (0.2 °C is sort of a big deal for climate; more rigorous calculations suggest more).

In [35]:
ts_10yr.plot(size=6)
Out[35]:
[<matplotlib.lines.Line2D at 0x7f834e32fc18>]

Experiments with AWS FSx for Lustre: I/O Benchmark and Dask Cluster Deployment

(This post was tweeted by Jeff Barr and by HPC Guru)

AWS has recently launched an extremely interesting new service called FSx for Lustre. Since last week, FSx has also been integrated with the AWS ParallelCluster framework 1, so you can spin-up a Lustre-enabled HPC cluster with one-click. As a reminder for non-HPC people, Lustre is a well-known high-performance parallel file system, deployed on many world-class supercomputers 2. Previously, ParallelCluster (and its predecessor CfnCluster) only provided a Network File System (NFS), running on the master node and exported to all compute nodes. This often causes a serious bottleneck for I/O-intensive applications. A fully-managed Lustre service should largely solve such I/O problem. Furthermore, FSx is "deeply integrated with Amazon S3"3. Considering the incredible amount of public datasets living in S3 4, such integration can lead to endless interesting use cases.

As a new user of FSx, my natural questions are:

  • Does it deliver the claimed performance?

  • How easily can it be used for real-world applications?

To answer these questions, I did some initial experiments:

  • Performed I/O benchmark with IOR, the de-facto HPC I/O benchmark framework.

  • Deployed a Dask cluster on top of AWS HPC environment, and used it to process public Earth science data in S3 buckets.

The two experiments are independent so feel free to jump to either one.

Disclaimer: This is just a preliminary result and is by no means a rigorous evaluation. In particular, I have made no effort on fine-turning the performance. I am not affiliated with AWS (although I do have their generous funding), so please treat this as a third-party report.

1 Cluster infrastructure

The first step is to spin up an AWS ParallelCluster. See my previous post for a detailed walk-through.

1.1 AWS ParallelCluster configuration

The ~/.parallelcluster/config file is almost the same as in previous post. The only thing new is the fsx section. At the time of writing, only CentOS 7 is supported 5.

[cluster your-cluster-section-name]
base_os = centos7
...
fsx_settings = fs

[fsx fs]
shared_dir = /fsx
storage_capacity = 14400

According to AWS 6, a 14,400 GB file system would deliver 2,880 MB/s throughput. You may use a much bigger size for production, but this size should be good for initial testing.

The above config will create an empty file system. A more interesting usage is to pre-import S3 buckets:

[fsx fs]
shared_dir = /fsx
storage_capacity = 14400
imported_file_chunk_size = 1024
import_path = s3://era5-pds

Here I am using the ECMWF ERA5 Reanalysis data 7. The data format is NetCDF4. Other datasets would work similarly.

pcluster create can take several minutes longer than without FSx, because provisioning a Lustre server probably involves heavy-lifting on the AWS side. Go to the "Amazon FSx" console to check the creation status.

After a successful launch, log in and run df -h to make sure that Lustre is properly mounted to /fsx.

1.2 Testing integration with existing S3 bucket

Objects in the bucket will appear as normal on-disk files.

$ cd /fsx
$ ls # displays objects in the S3 bucket
...

However, FSx only stores metadata:

$ cd /fsx/2008/01/data # specific to ERA5 data
$ ls -lh *   # the data appear to be big (~1 GB)
-rwxr-xr-x 1 root root  988M Jul  4  2018 air_pressure_at_mean_sea_level.nc
...
$ du -sh *  # but the actual content is super small.
512 air_pressure_at_mean_sea_level.nc
...

The actual data will be pulled from S3 when accessed. For a NetCDF4 file, either ncdump -h or h5ls will display its basic contents and cause the entire file to be pulled from S3.

$ ncdump -h air_pressure_at_mean_sea_level.nc  # `ncdump` is installable from `sudo yum install netcdf`, or from Spack, or from Conda
...
$ du -sh *  # now much bigger
962M        air_pressure_at_mean_sea_level.nc
...

Note

If you get HDF5 error on Lustre, set export HDF5_USE_FILE_LOCKING=FALSE 8.

2 I/O benchmark by IOR

For general reference, see IOR's documentation: https://ior.readthedocs.io

2.1 Install IOR by Spack

Configure Spack as in the previous post. Then, getting IOR is simply:

$ spack install ior ^openmpi+pmi schedulers=slurm

IOR is also quite easy to install from source, outside of Spack.

Discover the ior executable by:

$ export PATH=$(spack location -i ior)/bin:$PATH

2.2 I/O benchmark result

With two c5n.18xlarge compute nodes running, a multi-node, parallel write-read test can be done by:

$ mkdir /fsx/ior_tempdir
$ cd /fsx/ior_tempdir
$ srun -N 2 --ntasks-per-node 36 ior -t 1m -b 16m -s 4 -F -C -e
...
Max Write: 1632.01 MiB/sec (1711.28 MB/sec)
Max Read:  1654.59 MiB/sec (1734.96 MB/sec)
...

Conducting a proper I/O benchmark is not straightforward, due to various caching effects. IOR implements several tricks (reflected in command line parameters) to get around those effects 9.

I can get maximum throughput with 8 client nodes:

$ srun -N 8 --ntasks-per-node 36 ior -t 1m -b 16m -s 4 -F -C -e
...
Max Write: 2905.59 MiB/sec (3046.73 MB/sec)
Max Read:  2879.96 MiB/sec (3019.85 MB/sec)
...

This matches the 2,880 MB/s claimed by AWS! Using more nodes shows marginal improvement, since the bandwidth should already be saturated.

The logical next step is to test IO-heavy HPC applications and conduct a detailed I/O-profiling. In this post, however, I decide to try a more interesting use case -- big data analytics.

3 Dask cluster on top of AWS HPC stack

The entire idea comes from the Pangeo project (http://pangeo.io) that aims to develop a big-data geoscience platform on HPC and cloud. At its core, Pangeo relies on two excellent Python libraries:

  • Xarray (http://xarray.pydata.org), which is probably the best way to handle NetCDF files and many other data formats in geoscience. It is also used as a general-purpose "multi-dimensional Pandas" outside of geoscience.

  • Dask (https://dask.org), a parallel computing library that can scale NumPy, Pandas, Xarray, and Scikit-Learn to parallel and distributed environments. In particular, Dask-distributed handles distributed computing.

The normal way to deploy Pangeo on cloud is via Dask-Kubernetes, leveraging fully-managed Kubernetes services like:

On the other hand, the deployment of Pangeo on local HPC clusters is through Dask-Jobqueue 10.

Since we already have a fully-fledged HPC cluster (contains Slurm + MPI + Lustre), there is no reason not to test the second approach. Is AWS now a cloud platform or an HPC cluster? The boundary seems to be blurred.

3.1 Cluster deployment with Dask-jobqueue

The deployment turns out to be extremely easy. I am still in the learning curve of Kubernetes, and this alternative HPC approach feels much more straightforward for an HPC person like me.

First, get Miniconda:

$ cd /shared
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
$ bash miniconda.sh -b -p miniconda
$ echo ". /shared/miniconda/etc/profile.d/conda.sh" >> ~/.bashrc
$ source ~/.bashrc
$ conda create -n py37 python=3.7
$ conda activate py37  # replaces `source activate` for conda>=4.4
$ conda install -c conda-forge xarray netCDF4 cartopy dask-jobqueue jupyter

Optionally, install additional visualization libraries that I will use later:

$ pip install geoviews hvplot datashader

Note

It turns out that we don't need to install MPI4Py! Dask-jobqueue only needs a scheduler (here we have Slurm) to launch processes, and uses its own communication mechanism (defaults to TCP) 11.

With two idle c5n.18xlarge nodes, use the following code in ipython to initialize a distributed cluster:

from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(cores=72, processes=36, memory='150GB')  # Slurm thinks there are 72 cores per node due to EC2 hyperthreading
cluster.scale(2*36)

from distributed import Client
client = Client(cluster)

In a separate shell, use sinfo to check the node status -- they should be fully allocated.

To enable Dask's dashboard 12, add an additional SSH connection in a new shell:

$ pcluster ssh your-cluster-name -N -L 8787:localhost:8787

Visit localhost:8787 in the web browser (NOT something like http://172.31.5.224:8787 shown in Python) .

Alternatively, everything can be put together, including Jupyter notebook's port-forwarding:

$ pcluster ssh your-cluster-name -L 8889:localhost:8889 -L 8787:localhost:8787
$ conda activate py37
$ jupyter notebook --NotebookApp.token='' --no-browser --port=8889

Visit localhost:8889 to use the notebook.

That's all about the deployment! This Dask cluster is able to perform parallel read/write with the Lustre file system.

3.2 Computation with Dask-distributed

As an example, I compute the average Sea Surface Temperature (SST) 13 over near 300 GBs of ERA5 data. It gets done in 15 seconds with 8 compute nodes, which would have taken > 20 minutes with a single small node. Here's the screen recording of Dask dashboard during computation.

The full code is available in the next notebook, with some technical comments. At the end of the notebook also shows a sign of climate change (computed from the SST data), so at least we get a bit scientific insight from this toy problem. Hopefully such great computing power can be used to solve some big science.

4 Final thoughts

Back to my initial questions:

  • Does it deliver the claimed performance? Yes, and very accurately, at least for the moderate size I tried. A larger-scale benchmark is TBD though.

  • How easily can it be used for real-world applications? It turns out to be quite easy. All building blocks are already there, and I just need to put them together. It took me one day to get such initial tests done.

This HPC approach might be an alternative way of deploying the Pangeo big data stack on AWS. Some differences from the Kubernetes + pure S3 way are:

  • No need to worry about the HDF + Cloud problem 14. People can now access data in S3 through a POSIX-compliant, high-performance file system interface. This seems a big deal because huge amounts of data are already in HDF & NetCDF formats, and converting them to a more cloud-friendly format like Zarr might take some effort.

  • It is probably easier for existing cloud-HPC users to adopt. Numerical simulations and post-processing can be done in exactly the same environment.

  • It is likely to cost more (haven't rigorously calculated), due to heavier resource provisioning. Lustre essentially acts as a huge cache for S3. In the long-term, this kind of data analytics workflow should probably be handled in a more cloud-native way, using Lambda-like serverless computing, to maximize resource utilization and minimize computational cost. But it is nice to have something that "just works" right now.

Some possible further steps:

  • The performance can be fine-tuned indefinitely. There is an extremely-large parameter space: Lustre stripe size, HDF5 chunk size, Dask chunk size, Dask processes vs threads, client instance counts and types... But unless there are important scientific/business needs, fine-tuning it doesn't seem super interesting.

  • For me personally, this provides a very convenient test environment for scaling-out xESMF 15, the regridding package I wrote. Because the entire pipeline is clearly I/O-limited, what I really need is just a fast file system.

  • The most promising use case is probably some deep-learning-like climate analytics 16. DL algorithms are generally data hungry, and the best place to put massive datasets is, with no doubt, the cloud. How Dask + Xarray + Pangeo fit into DL workflow seems to be in active discussion 17 .

5 References

1

Added in ParallelCluster v2.2.1 https://github.com/aws/aws-parallelcluster/releases/tag/v2.2.1. See FSx section in the docs: https://aws-parallelcluster.readthedocs.io/en/latest/configuration.html#fsx

2

For example, NASA's supercomputing facility provides a nice user guide on Lustre: https://www.nas.nasa.gov/hecc/support/kb/102/

3

See "Using S3 Data Repositories" in FSx guide: https://docs.aws.amazon.com/fsx/latest/LustreGuide/fsx-data-repositories.html

4

See the Registry of Open Data on AWS https://registry.opendata.aws/. A large fraction of them are Earth data: https://aws.amazon.com/earth/.

5

See this issue: https://github.com/aws/aws-parallelcluster/issues/896

6

See Amazon FSx for Lustre Performance at https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html

7

Search for "ECMWF ERA5 Reanalysis" in the Registry of Open Data on AWS: https://registry.opendata.aws/ecmwf-era5. As a reminder for non-atmospheric people, a reanalysis is like the best guess of past atmospheric states, obtained from observations and simulations. For a more detailed but non-technical introduction, Read Reanalyses and Observations: What’s the Difference? at https://journals.ametsoc.org/doi/full/10.1175/BAMS-D-14-00226.1

8

https://stackoverflow.com/questions/49317927/errno-101-netcdf-hdf-error-when-opening-netcdf-file

9

See "First Steps with IOR" at: https://ior.readthedocs.io/en/latest/userDoc/tutorial.html

10

See "Getting Started with Pangeo on HPC": https://pangeo.readthedocs.io/en/latest/setup_guides/hpc.html

11

See the "High Performance Computers" section in Dask docs: http://docs.dask.org/en/latest/setup/hpc.html

12

See "Viewing the Dask Dashboard" in Dask-Jobqueue docs: https://jobqueue.dask.org/en/latest/interactive.html#viewing-the-dask-dashboard

13

SST is an important climate change indicator: https://www.epa.gov/climate-indicators/climate-change-indicators-sea-surface-temperature

14

HDF in the Cloud: challenges and solutions for scientific data: http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud

15

Initial tests regarding distributed regridding with xESMF on Pangeo: https://github.com/pangeo-data/pangeo/issues/334

16

For example, see Berkeley Lab's ClimateNet: https://cs.lbl.gov/news-media/news/2019/climatenet-aims-to-improve-machine-learning-applications-in-climate-science-on-a-global-scale/

17

See the discusson in this issue: https://github.com/pangeo-data/pangeo/issues/567

A Scientist's Guide to Cloud-HPC: Example with AWS ParallelCluster, Slurm, Spack, and WRF

1 Motivation and principle of this guide

Cloud-HPC is growing rapidly 1, and the growth can only be faster with AWS's recent HPC-oriented services such as EC2 c5n, FSx for Lustre, and the soon-coming EFA. However, orchestrating a cloud-HPC cluster is by no means easy, especially considering that many HPC users are from science and engineering and are not trained with IT and system administration skills. There are very few documentations for this niche field, and users could face a pretty steep learning curve. To make people's lives a bit easier (and to provide a reference for the future me), I wrote this guide to show an easy-to-follow workflow of building a fully-fledged HPC cluster environment on AWS.

My basic principles are:

  • Minimize the learning curve for non-IT/non-CS people. That being said, it can still take a while for new users to learn. But you should be able to use a cloud-HPC cluster with confidence after going through this guide.

  • Focus on common, general, transferrable cases. I would avoid diving into a particular scientific field, or into a niche AWS utility with no counterparts on other cloud platforms -- those can be left to other posts, but do not belong to this guide.

This guide will go through:

  • Spin-up an HPC cluster with AWS ParallelCluster, AWS's official HPC framework. If you prefer a multi-platform, general-purpose tool, consider ElastiCluster, but expect a steeper learning curve and less up-to-date AWS features. If you feel that all those frameworks are too black-boxy, try building a cluster manually 2 to understand how multiple nodes are glued together. The manual approach becomes quite inconvenient at production, so you will be much better off by using a higher-level framework.

  • Basic cluster operations with Slurm, an open-source, modern job scheduler deployed on many HPC centers. ParallelCluster can also use AWS Batch instead of Slurm as the scheduler; it is a very interesting feature but I will not cover it here.

  • Common cluster management tricks such as changing the node number and type on the fly.

  • Install HPC software stack by Spack, an open-source, modern HPC package manager used in production at many HPC centers. This part should also work for other cloud platforms, on your own workstation, or in a container.

  • Build real-world HPC code. As an example I will use the Weather Research and Forecasting (WRF) Model, an open-source, well-known atmospheric model. This is just to demonstrate that getting real applications running is relatively straightforward. Adapt it for your own use cases.

2 Prerequisites

This guide only assumes:

  • Basic EC2 knowledge. Knowing how to use a single instance is good enough. Thanks to the wide-spreading ML/DL hype, this seems to become a common skill for science & engineering students -- most people in my department (non-CS) know how to use AWS DL AMI, Google DL VM or Azure DS VM. If not, the easiest way to learn it is probably through online DL courses 3.

  • Basic S3 knowledge. Knowing how to aws s3 cp is good enough. If not, check out AWS's 10-min tutorial.

  • Entry-level HPC user knowledge. Knowing how to submit MPI jobs is good enough. If not, checkout HPC carpentry 4.

It does NOT require the knowledge of:

  • CloudFormation. It is the underlying framework for AWS ParallelCluster (and many third-party tools), but can take quite a while to learn.

  • Cloud networking. You can use the cluster smoothly even without knowing what TCP is.

  • How to build complicated libraries from source -- this will be handled by Spack.

3 Cluster deployment

This section uses ParallelCluster version 2.2.1 as of Mar 2019. Future versions shouldn't be vastly different.

First, check out ParallelCluster's official doc: https://aws-parallelcluster.readthedocs.io. It guides you through some toy examples, but not production-ready applications. Play with the toy examples a bit and get familiar with those basic commands:

  • pcluster configure

  • pcluster create

  • pcluster list

  • pcluster ssh

  • pcluster delete

3.1 A minimum config file for production HPC

The cluster infrastructure is fully specified by ~/.parallelcluster/config. A minimum, recommended config file would look like:

[aws]
aws_region_name = xxx

[cluster your-cluster-section-name]
key_name = xxx
base_os = centos7
master_instance_type = c5n.large
compute_instance_type = c5n.18xlarge
cluster_type = spot
initial_queue_size = 2
scheduler = slurm
placement_group = DYNAMIC
vpc_settings = your-vpc-section-name
ebs_settings = your-ebs-section-name

[vpc your-vpc-section-name]
vpc_id = vpc-xxxxxxxx
master_subnet_id = subnet-xxxxxxxx

[ebs your-ebs-section-name]
shared_dir = shared
volume_type = st1
volume_size = 500

[global]
cluster_template = your-cluster-section-name
update_check = true
sanity_check = true

A brief comment on what are set:

  • aws_region_name should be set at initial pcluster configure. I use us-east-1.

  • key_name is your EC2 key-pair name, for ssh to master instance.

  • base_os = centos7 should be a good choice for HPC, because CentOS is particularly tolerant of legacy HPC code. Some code that doesn't build on Ubuntu can actually pass on CentOS. Without build problems, any OS choice should be fine -- you shouldn't observe visible performance difference across different OS, as long as the compilers are the same.

  • Use the biggest compute node c5n.18xlarge to minimize communication. Master node is less critical for performance and is totally up to you.

  • cluster_type = spot will save you a lot of money by using spot instances for compute nodes.

  • initial_queue_size = 2 spins up two compute nodes at initial launch. This is default but worth emphasizing. Sometimes there is not enough compute capacity in a zone, and with initial_queue_size = 0 you won't be able to detect that at cluster creation.

  • Set scheduler = slurm as we are going to use it in later sections.

  • placement_group = DYNAMIC creates a placement group 5 on the fly so you don't need to create one yourself. Simply put, a cluster placement group improves inter-node connection.

  • vpc_id and master_subnet_id should be set at initial pcluster configure. Because a subnet id is tied to an avail zone 6, the subnet option implicitly specifies which avail zone your cluster will be launched into. You may want to change it because the spot pricing and capacity vary across avail zones.

  • volume_type = st1 specifies throughput-optimized HDD 7 as shared disk. The minimum size is 500 GB. It will be mounted to a directory /shared (which is also default) and will be visible to all nodes.

  • cluster_template allows you to put multiple cluster configurations in a single config file and easily switch between them.

Credential information like aws_access_key_id can be omitted, as it will default to awscli credentials stored in ~/.aws/credentials.

The full list of parameters are available in the official docs 8. Other useful parameters you may consider changing are:

  • Set placement = cluster to also put your master node in the placement group.

  • Specify s3_read_write_resource so you can access that S3 bucket without configuring AWS credentials on the cluster. Useful for archiving data.

  • Increase master_root_volume_size and compute_root_volume_size, if your code involves heavy local disk I/O.

  • max_queue_size and maintain_initial_size are less critical as they can be easily changed later.

I have omitted the FSx section, which is left to the next post.

One last thing: Many HPC code runs faster with hyperthreading disabled 9. To achieve this at launch, you can write a custom script and execute it via the post_install option in pcluster's config file. This is a bit involved though. Hopefully there can be a simple option in future versions of pcluster. (Update in Nov 2019: AWS ParallelCluster v2.5.0 release adds a new option disable_hyperthreading = true that greatly simplify things!)

With the config file in place, run pcluster create your-cluster-name to launch a cluster.

3.2 What are actually deployed

(This part is not required for first-time users. It just helps understanding.)

AWS ParallelCluster (or other third-party cluster tools) glues many AWS services together. While not required, a bit more understanding of the underlying components would be helpful -- especially when debugging and customizing things.

The official doc provides a conceptual overview 10. Here I give a more hands-on introduction by actually walking through the AWS console. When a cluster is running, you will see the following components in the console:

  • CloudFormation Stack. Displayed under "Services" - "CloudFormation". This is the top-level framework that controls the rest. You shouldn't need to touch it, but its output can be useful for debugging.

/images/pcluster_components/cloudformation.png

The rest of services are all displayed under the main EC2 console.

  • EC2 Placement Group. It is created automatically because of the line placement_group = DYNAMIC in the config file.

/images/pcluster_components/placement_group.png
  • EC2 Instances. Here, there are one master node and two compute nodes running, as specified by the config file. You can directly ssh to the master node, but the compute nodes are only accessible from the master node, not from the Internet.

/images/pcluster_components/ec2_instance.png
  • EC2 Auto Scaling Group. Your compute instances belong to an Auto Scaling group 11, which can quickly adjust the number of instances with minimum human operation. The number under the "Instances" column shows the current number of compute nodes; the "Desired" column shows the target number of nodes, and this number can be adjusted automatically by the Slurm scheduler; the "Min" column specifies the lower bound of nodes, which cannot be changed by the scheduler; the "Max" column corresponds to max_queue_size in the config file. You can manually change the number of compute nodes here (more on this later).

/images/pcluster_components/autoscaling.png

The launch event is recored in the "Activity History"; if a node fails to launch, the error message will go here.

/images/pcluster_components/activity_history.png
  • EC2 Launch Template. It specifies the EC2 instance configuration (like instance type and AMI) for the above Auto Scaling Group.

/images/pcluster_components/launch_template.png
  • EC2 Spot Request. With cluster_type = spot, each compute node is associated with a spot request.

/images/pcluster_components/spot_requests.png
  • EBS Volume. You will see 3 kinds of volumes. A standalone volume specified in the ebs section, a volume for master node, and a few volumes for compute nodes.

/images/pcluster_components/ebs_volume.png
  • Auxiliary Services. They are not directly related to the computation, but help gluing the major computing services together. For example, the cluster uses DynamoDB (Amazon's noSQL database) for storing some metadata. The cluster also relies on Amazon SNS and SQS for interaction between the Slurm scheduler and the AutoScaling group. We will see this in action later.

/images/pcluster_components/dynamo_db.png

Imagine the workload involved if you launch all the above resources by hand and glue them together. Fortunately, as a user, there is no need to implement those from scratch. But it is good to know a bit about the underlying components.

In most cases, you should not manually modify those individual resources. For example, if you terminate a compute instance, a new one will be automatically launched to match the current autoscaling requirement. Let the high-level pcluster command handle the cluster operation. Some exceptions will be mentioned in the "tricks" section later.

4 ParallelCluster basic operation

4.1 Using Slurm

After login to the master node with pcluster ssh, you will use Slurm to interact with compute nodes. Here I summarize commonly-used commands. For general reference, see Slurm's documentation: https://www.schedmd.com/.

Slurm is pre-installed at /opt/slurm/ :

$ which sinfo
/opt/slurm/bin/sinfo

Check compute node status:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      2   idle ip-172-31-3-187,ip-172-31-7-245

The 172-31-xxx-xxx is the Private IP 12 of the compute instances. The address range falls in your AWS VPC subnet. On EC2, hostname prints the private IP:

$ hostname  # private ip of master node
ip-172-31-7-214

To execute commands on compute nodes, use srun:

$ srun -N 2 -n 2 hostname  # private ip of compute nodes
ip-172-31-3-187
ip-172-31-7-245

The printed IP should match the output of sinfo.

You can go to a compute node with the standard Slurm command:

$ srun -N 1 -n 72 --pty bash  # Slurm thinks a c5n.18xlarge node has 72 cores due to hyperthreading
$ sinfo  # one node is fully allocated
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      1  alloc ip-172-31-3-187
compute*     up   infinite      1   idle ip-172-31-7-245

Or simply via ssh:

$ ssh ip-172-31-3-187
$ sinfo  # still idle
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      2   idle ip-172-31-3-187,ip-172-31-7-245

In this case, the scheduler is not aware of such activity.

The $HOME directory is exported to all nodes via NFS by default, so you can still see the same files from compute nodes. However, system directories like /usr are specific to each node. Software libraries should generally be installed to a shared disk, otherwise they will not be accessible from compute nodes.

4.2 Check system & hardware

A natural thing is to check CPU info with lscpu and file system structure with df -h. Do this on both master and compute nodes to see the differences.

A serious HPC user should also check the network interface:

$ ifconfig  # display network interface names and details
ens5: ...

lo: ...

Here, the ens5 section is the network interface for inter-node commnunication. Its driver should be ena:

$ ethtool -i ens5
driver: ena
version: 1.5.0K

This means that "Enhanced Networking" is enabled 13. This should be the default on most modern AMIs, so you shouldn't need to change anything.

4.3 Cluster management tricks

AWS ParallelCluster is able to auto-scale 14, meaning that new compute nodes will be launched automatically when there are pending jobs in Slurm's queue, and idle nodes will be terminated automatically.

While this generally works fine, such automatic update takes a while and feels a bit black-boxy. A more straightforward & transparent way is to modify the autoscaling group directly in the console. Right-click on your AutoScaling Group, and select "Edit":

/images/pcluster_components/edit_autoscaling.png
  • Modifying "Desired Capacity" will immediately cause the cluster to adjust to that size. Either to request more nodes or to kill redundant nodes.

  • Increase "Min" to match "Desired Capacity" if you want the compute nodes to keep running even if they are idle. Or keep "Min" as zero, so idle nodes will be killed after some time period (a few minutes, roughly match the "Default Cooldown" section in the Auto Scaling Group).

  • "Max" must be at least the same as "Desired Capacity". This is the hard-limit that the scheduler cannot violate.

After compute nodes are launched or killed, Slurm should be aware of such change in ~1 minute. Check it with sinfo.

To further change the type (not just the number) of the compute nodes, you can modify the config file, and run pcluster update your-cluster-name 15.

5 Install HPC software stack with Spack

While you can get pre-built MPI binaries with sudo yum install -y openmpi-devel on CentOS or sudo apt install -y libopenmpi-dev on Ubuntu, they are generally not the specific version you want. On the other hand, building custom versions of libraries from source is too laborious and error-prone 16. Spack achieves a great balance between the ease-of-use and customizability. It has an excellent documentation which I strongly recommend reading: https://spack.readthedocs.io/.

Here I provide the minimum required steps to build a production-ready HPC environment.

Getting Spack is super easy:

cd /shared  # install to shared disk
git clone https://github.com/spack/spack.git
git checkout 3f1c78128ed8ae96d2b76d0e144c38cbc1c625df  # Spack v0.13.0 release in Oct 26 2019 broke some previous commands. Freeze it to ~Sep 2019.
echo 'export PATH=/shared/spack/bin:$PATH' >> ~/.bashrc  # to discover spack executable
source ~/.bashrc

At the time of writing, I am using:

$ spack --version
0.12.1

The first thing is to check what compilers are available. Most OS should already have a GNU compiler installed, and Spack can discover it:

$ spack compilers
==> Available compilers
-- gcc centos7-x86_64 -------------------------------------------
gcc@4.8.5

Note

If not installed, just sudo yum install gcc gcc-gfortran gcc-c++ on CentOS or sudo apt install gcc gfortran g++ on Ubuntu.

You might want to get a newer version of the compiler:

$ spack install gcc@8.2.0  # can take 30 min!
$ spack compiler add $(spack location -i gcc@8.2.0)
$ spack compilers
==> Available compilers
-- gcc centos7-x86_64 -------------------------------------------
gcc@8.2.0  gcc@4.8.5

Note

Spack builds software from source, which can take a while. To persist the build you can run it inside tmux sessions. If not installed, simply run sudo yum install tmux or sudo apt install tmux.

Note

Always use spack spec to check versions and dependencies before running spack install!

5.1 MPI libraries (OpenMPI with Slurm support)

Spack can install many MPI implementations, for example:

$ spack info mpich
$ spack info mvapich2
$ spack info openmpi

In this example I will use OpenMPI. It has a super-informative documentation at https://www.open-mpi.org/faq/

5.1.1 Installing OpenMPI

In principle, the installation is as simple as:

$ spack install openmpi  # not what we will use here

Or a specific version:

$ spack install openmpi@3.1.3  # not what we will use here

However, we want OpenMPI to be built with Slurm 17, so the launch of MPI processes can be handled by Slurm's scheduler.

Because Slurm is pre-installed, you will add it as an external package to Spack 18.

$ which sinfo  # comes with AWS ParallelCluster
/opt/slurm/bin/sinfo
$ sinfo -V
slurm 16.05.3

Add the following section to ~/.spack/packages.yaml:

packages:
  slurm:
    paths:
      slurm@16.05.3: /opt/slurm/
    buildable: False

This step is extremely important. Without modifying packages.yaml, Spack will install Slurm for you, but the newly-installed Slurm is not configured with the AWS cluster.

Then install OpenMPI wih:

$ spack install openmpi+pmi schedulers=slurm  # use this

After installation, locate its directory:

$ spack find -p openmpi

Modify $PATH to discover executables like mpicc:

$ export PATH=$(spack location -i openmpi)/bin:$PATH

Note

Spack removes the mpirun executable by default if built with Slurm, to encourage the use of srun for better process management 19. I need mpirun for illustration purpose in this guide, so recover it by ln -s orterun mpirun in the directory $(spack location -i openmpi)/bin/.

A serious HPC user should also check the available Byte Transfer Layer (BTL) in OpenMPI:

$ ompi_info --param btl all
  MCA btl: self (MCA v2.1.0, API v3.0.0, Component v3.1.3)
  MCA btl: tcp (MCA v2.1.0, API v3.0.0, Component v3.1.3)
  MCA btl: vader (MCA v2.1.0, API v3.0.0, Component v3.1.3)
  ...
  • self, as its name suggests, is for a process to talk to itself 20.

  • tcp is the default inter-node communication mechanism on EC2 21. It is not ideal for HPC, but this should be changed with the coming EFA.

  • vader is a high-performance intra-node communication mechanism 22.

(Update in Dec 2019 There was a problem with OpenMPI + EFA, but is now solved. See this issue)

5.1.2 Using OpenMPI with Slurm

Let's use this boring but useful "MPI hello world" example:

#include <mpi.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
    int rank, size;
    char hostname[32];
    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    gethostname(hostname, 31);

    printf("I am %d of %d, on host %s\n", rank, size, hostname);

    MPI_Finalize();
    return 0;
}

Put it into a hello_mpi.c file and compile:

$ mpicc -o hello_mpi.x hello_mpi.c
$ mpirun -np 1 ./hello_mpi.x  # runs on master node
I am 0 of 1, on host ip-172-31-7-214

To run it on compute nodes, the classic MPI way is to specify the node list via --host or --hostfile (for OpenMPI; other MPI implementations have similar options):

$ mpirun -np 2 --host ip-172-31-5-150,ip-172-31-14-243 ./hello_mpi.x
I am 0 of 2, on host ip-172-31-5-150
I am 1 of 2, on host ip-172-31-14-243

Following --host are compute node IPs shown by sinfo.

A more sane approach is to launch it via srun, which takes care of the placement of MPI processes:

$ srun -N 2 --ntasks-per-node 2 ./hello_mpi.x
I am 1 of 4, on host ip-172-31-5-150
I am 0 of 4, on host ip-172-31-5-150
I am 3 of 4, on host ip-172-31-14-243
I am 2 of 4, on host ip-172-31-14-243

5.2 HDF5 and NetCDF libraries

HDF5 and NetCDF are very common I/O libraries for HPC, widely used in Earth science and many other fields.

In principle, installing HDF5 is simply:

$ spack install hdf5  # not what we will use here

Many HPC code (like WRF) needs the full HDF5 suite (use spack info to check all the variants):

$ spack install hdf5+fortran+hl  # not what we will use here

Further specify MPI dependencies:

$ spack install hdf5+fortran+hl ^openmpi+pmi schedulers=slurm  # use this

Similarly, for NetCDF C & Fortran, in principle it is simply:

$ spack install netcdf-fortran  # not what we will use here

To specify the full dependency, we end up having:

$ spack install netcdf-fortran ^hdf5+fortran+hl ^openmpi+pmi schedulers=slurm  # use this

5.3 Further reading on advanced package management

For HPC development you generally need to test many combinations of libraries. To better organize multiple environments, check out:

5.4 Note on reusing software installation

For a single EC2 instance, it is easy to save the environment - create an AMI, or just build a Docker image. Things get quite cumbersome with a multi-node cluster environment. From official docs, "Building a custom AMI is not the recommended approach for customizing AWS ParallelCluster." 23.

Fortunately, Spack installs everything to a single, non-root directory (similar to Anaconda), so you can simply tar-ball the entire directory and then upload to S3 or other persistent storage:

$ spack clean --all  # clean all kinds of caches
$ tar zcvf spack.tar.gz spack  # compression
$ aws s3 mb [your-bucket-name]  # create a new bucket. might need to configure AWS credentials for permission
$ aws s3 cp spack.tar.gz s3://[your-bucket-name]/   # upload to S3 bucket

Also remember to save (and later recover) your custom settings in ~/.spack/packages.yaml, ~/.spack/linux/compilers.yaml and .bashrc.

Then you can safely delete the cluster. For the next time, simply pull the tar-ball from S3 and decompress it. The environment would look exactly the same as the last time. You should use the same base_os to minimize binary-compatibility errors.

A minor issue is regarding dynamic linking. When re-creating the cluster environment, make sure that the spack/ directory is located at the same location where the package was installed last time. For example, if it was at /shared/spack/, then the new location should also be exactly /shared/spack/.

The underlying reason is that Spack uses RPATH for library dependencies, to avoid messing around $LD_LIBRARY_PATH 24. Simply put, it hard-codes the dependencies into the binary. You can check the hard-coded paths by, for example:

$ readelf -d $(spack location -i openmpi)/bin/mpicc | grep RPATH

If the new shared EBS volume is mounted to a new location like /shared_new, a quick-and-dirty fix would be:

$ sudo ln -s /shared_new /shared/

5.5 Special note on Intel compilers

Although I'd like to stick with open-source software, sometimes there is a solid reason to use proprietary ones like the Intel compiler -- WRF being a well-known example that runs much faster with ifort than with gfortran 32. Note that Intel is generous enough to provide student licenses for free.

Although Spack can install Intel compilers by itself, a more robust approach is to install it externally and add as an external package 25. Intel has a dedicated guide for installation on EC2 26 so I won't repeat the steps here.

Once you have a working icc/ifort in $PATH, just running spack compiler add should discover the new compilers 27.

Then, you should also add something like

extra_rpaths: ['/shared/intel/lib/intel64']

to ~/.spack/linux/compilers.yaml under the Intel compiler section. Otherwise you will see interesting linking errors when later building libraries with the Intel compiler 28.

After those are all set, simple add %intel to all spack install commands to build new libraries with Intel.

6 Build real applications -- example with WRF

With common libraries like MPI, HDF5, and NetCDF installed, compiling real applications shouldn't be difficult. Here I show how to build WRF, a household name in the Atmospheric Science community. We will hit a few small issues (as likely for other HPC code), but they are all easy to fix by just Googling the error messages.

Get the recently released WRF v4:

$ wget https://github.com/wrf-model/WRF/archive/v4.0.3.tar.gz
$ tar zxvf v4.0.3.tar.gz

Here I only provide the minimum steps to build the WRF model, without diving into the actual model usage. If you plan to use WRF for either research or operation, please carefully study:

6.1 Environment setup

Add those to your ~/.bashrc (adapted from the WRF compile tutorial):

# Let WRF discover necessary executables
export PATH=$(spack location -i gcc)/bin:$PATH  # only needed if you installed a new gcc
export PATH=$(spack location -i openmpi)/bin:$PATH
export PATH=$(spack location -i netcdf)/bin:$PATH
export PATH=$(spack location -i netcdf-fortran)/bin:$PATH

# Environment variables required by WRF
export HDF5=$(spack location -i hdf5)
export NETCDF=$(spack location -i netcdf-fortran)

# run-time linking
export LD_LIBRARY_PATH=$HDF5/lib:$NETCDF/lib:$LD_LIBRARY_PATH

# this prevents segmentation fault when running the model
ulimit -s unlimited

# WRF-specific settings
export WRF_EM_CORE=1
export WRFIO_NCD_NO_LARGE_FILE_SUPPORT=0

WRF also requires NetCDF-C and NetCDF-Fortran to be located in the same directory 29. A quick-and-dirty fix is to copy NetCDF-C libraries and headers to NetCDF-Fortran's directory:

$ NETCDF_C=$(spack location -i netcdf)
$ ln -sf $NETCDF_C/include/*  $NETCDF/include/
$ ln -sf $NETCDF_C/lib/*  $NETCDF/lib/

6.2 Compile WRF

$ cd WRF-4.0.3
$ ./configure
  • For the first question, select 34, which uses GNU compilers and pure MPI ("dmpar" -- Distributed Memory PARallelization).

  • For the second question, select 1, which uses basic nesting.

You should get this successful message:

(omitting many lines...)
------------------------------------------------------------------------
Settings listed above are written to configure.wrf.
If you wish to change settings, please edit that file.
If you wish to change the default options, edit the file:
    arch/configure.defaults


Testing for NetCDF, C and Fortran compiler

This installation of NetCDF is 64-bit
                C compiler is 64-bit
        Fortran compiler is 64-bit
            It will build in 64-bit

*****************************************************************************
This build of WRF will use NETCDF4 with HDF5 compression
*****************************************************************************

To fix a minor issue regarding WRF + GNU + OpenMPI 30, modify the generated configure.wrf so that:

DM_CC = mpicc -DMPI2_SUPPORT

Then build the WRF executable for the commonly used em_real case:

$ ./compile em_real 2>&1 | tee wrf_compile.log

You might also use a bigger master node (or go to a compute node) and add something like -j 8 for parallel build.

It should finally succeed:

(omitting many lines...)
==========================================================================
build started:   Mon Mar  4 01:32:52 UTC 2019
build completed: Mon Mar 4 01:41:36 UTC 2019

--->                  Executables successfully built                  <---

-rwxrwxr-x 1 centos centos 41979152 Mar  4 01:41 main/ndown.exe
-rwxrwxr-x 1 centos centos 41852072 Mar  4 01:41 main/real.exe
-rwxrwxr-x 1 centos centos 41381488 Mar  4 01:41 main/tc.exe
-rwxrwxr-x 1 centos centos 45549368 Mar  4 01:41 main/wrf.exe

==========================================================================

Now you have the WRF executables. This is a good first step, considering that so many people are stuck at simply getting the code compiled 31. Actually using WRF for research or operational purposes requires a lot more steps and domain expertise, which is way beyond this guide. You will also need to build the WRF Preprocessing System (WPS), obtain the geographical data and the boundary/initial conditions for your specific problem, choose the proper model parameters and numerical schemes, and interpret the model output in a scientific way.

In the future, you might be able to install WRF with one-click by Spack 33. For WRF specifically, you might also be interested in EasyBuild for one-click install. A fun fact is that Spack can also install EasyBuild (see spack info easybuild), despite their similar purposes.

That's the end of this guide, which I believe has covered the common patterns for cloud-HPC.

7 References

1

Cloud Computing in HPC Surges: https://www.top500.org/news/cloud-computing-in-hpc-surges/

2

See Quick MPI Cluster Setup on Amazon EC2: https://glennklockwood.blogspot.com/2013/04/quick-mpi-cluster-setup-on-amazon-ec2.html. It was written in 2013 but all steps still apply. AWS console looks quite different now, but the concepts are not changed.

3

For example, fast.ai's tutorial on AWS EC2 https://course.fast.ai/start_aws.html, or Amazon's DL book https://d2l.ai/chapter_appendix-tools-for-deep-learning/aws.html.

4

See Introduction to High-Performance Computing at: https://hpc-carpentry.github.io/hpc-intro/. It only covers very simple cluster usage, not parallel programming.

5

See Placement Groups in AWS docs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html#placement-groups-cluster

6

You might want to review "Regions and Availability Zones" in AWS docs: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html

7

See Amazon EBS Volume Types: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html. HDD is cheap and good enough. If I/O is a real problem then you should use FSx for Lustre.

8

The Configuration section in the docs: https://aws-parallelcluster.readthedocs.io/en/latest/configuration.html

9

See Disabling Intel Hyper-Threading Technology on Amazon Linux at: https://aws.amazon.com/blogs/compute/disabling-intel-hyper-threading-technology-on-amazon-linux/

10

AWS Services used in AWS ParallelCluster: https://aws-parallelcluster.readthedocs.io/en/latest/aws_services.html.

11

See AutoScaling groups in AWS docs https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroup.html

12

You might want to review the IP Addressing section in AWS docs: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-ip-addressing.html

13

See Enhanced Networking on AWS docs. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html. For a more techinical discussion, see SR-IOV and Amazon's C3 Instances: https://glennklockwood.blogspot.com/2013/12/high-performance-virtualization-sr-iov.html

14

See AWS ParallelCluster Auto Scaling: https://aws-parallelcluster.readthedocs.io/en/latest/autoscaling.html

15

See my comment at https://github.com/aws/aws-parallelcluster/issues/307#issuecomment-462215214

16

For example, try building MPI-enabled NetCDF once, and you will never want to do it again: https://www.unidata.ucar.edu/software/netcdf/docs/getting_and_building_netcdf.html

17

See Running jobs under Slurm in OpenMPI docs: https://www.open-mpi.org/faq/?category=slurm

18

https://github.com/spack/spack/pull/8427#issuecomment-395770378

19

See the discussion in my PR: https://github.com/spack/spack/pull/10340

20

See "3. How do I specify use of sm for MPI messages?" in OpenMPI docs: https://www.open-mpi.org/faq/?category=sm#sm-btl

21

See Tuning the run-time characteristics of MPI TCP communications in OpenMPI docs: https://www.open-mpi.org/faq/?category=tcp

22

See "What is the vader BTL?" in OpenMPI docs: https://www.open-mpi.org/faq/?category=sm#what-is-vader

23

See Building a custom AWS ParallelCluster AMI at: https://aws-parallelcluster.readthedocs.io/en/latest/tutorials/02_ami_customization.html

24

A somewhat relevant discussion is that the "Transitive Dependencies" section of Spack docs: https://spack.readthedocs.io/en/latest/workflows.html#transitive-dependencies

25

See Integration of Intel tools installed external to Spack: https://spack.readthedocs.io/en/latest/build_systems/intelpackage.html#integration-of-intel-tools-installed-external-to-spack

26

See Install Intel® Parallel Studio XE on Amazon Web Services (AWS) https://software.intel.com/en-us/articles/install-intel-parallel-studio-xe-on-amazon-web-services-aws

27

See Integrating external compilers in Spack docs: https://spack.readthedocs.io/en/latest/build_systems/intelpackage.html?highlight=intel#integrating-external-compilers

28

See this comment at: https://github.com/spack/spack/issues/8315#issuecomment-393160339

29

Related discussions are at https://github.com/spack/spack/issues/8816 and https://github.com/wrf-model/WRF/issues/794

30

http://forum.wrfforum.com/viewtopic.php?f=5&t=3660

31

Just Google "WRF compile error"

32

Here's a modern WRF benchmark conducted in 2018: https://akirakyle.com/WRF_benchmarks/results.html

33

Until this PR gets merged: https://github.com/spack/spack/pull/9851

[Notebook] Testing Bokeh Interactive Plot on Nikola Sites

This post shows how to add interactive data visualizations to Nikola-powered websites. Everything will be done in Jupyter notebooks! I use Bokeh as an example but similar approaches should work with other JavaScript-based tools like Plotly. For the impatient, drag to the end of this page to see the results.

For deploying website with Nikola, see my previous post.

In [1]:
!nikola --version
Nikola v8.0.1
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

Some boring stuff to test basic notebook display...

Non-executing code snippet:

print('Hello')

Other languages:

if (i=0; i<n; i++) {
  printf("hello %d\n", i);
}

Latex: $$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$$

Table:

This is
a table
In [3]:
# test matplotlib rendering
plt.plot(np.sin(np.linspace(-5, 5)));

Static plots work fine but are a bit boring. Let's create some interactivity.

Interactive visualization with HvPlot and Bokeh

HvPlot provides high-level plotting API built on top of the well-known Bokeh library. Its API is very similar to pandas.DataFrame.plot(), so you don't need to learn too many Bokeh/Holoview-specific syntaxes.

It can be installed easily by pip install hvplot, with all major depedencies.

In [4]:
from IPython.display import HTML

import bokeh
from bokeh.resources import CDN, INLINE
from bokeh.embed import file_html
import holoviews as hv

import pandas as pd
import hvplot.pandas

bokeh.__version__, hv.__version__
Out[4]:
('1.0.4', '1.11.2')

Creating a demo plot

Taken from HvPlot tutorial.

In [5]:
# just some fake data
index = pd.date_range('1/1/2000', periods=1000)
np.random.seed(42)
df = pd.DataFrame(np.random.randn(1000, 4), index=index, columns=list('ABCD')).cumsum()
df.head()
Out[5]:
A B C D
2000-01-01 0.496714 -0.138264 0.647689 1.523030
2000-01-02 0.262561 -0.372401 2.226901 2.290465
2000-01-03 -0.206914 0.170159 1.763484 1.824735
2000-01-04 0.035049 -1.743121 0.038566 1.262447
2000-01-05 -0.977782 -1.428874 -0.869458 -0.149856
In [6]:
plot = df.hvplot()
plot  # The plot shows up in notebook, but not in Nikola-generated website.
Out[6]:

By default the figure is not shown on the Nikola-generated website (i.e. the page you are looking at right now). A few more code is needed for displaying the figure.

Convert to HTML and display

We just need to convert the figure to raw HTML texts and use IPython.display.HTML() to show it:

In [7]:
html_cdn = file_html(hv.render(plot), CDN)  # explained later
HTML(html_cdn)
Out[7]:
Bokeh Application

html_cdn is simply a Python string containing HTML:

In [8]:
type(html_cdn)
Out[8]:
str

It can be generated from a Bokeh figure by bokeh.embed.file_html(). Simplying calling bokeh.embed.file_html(plot) will lead to an error because our plot object is not a bokeh figure yet:

In [9]:
type(plot)
Out[9]:
holoviews.core.overlay.NdOverlay

You need to use hv.render() to convert Holoview output to Bokeh figure.

In [10]:
type(hv.render(plot))
Out[10]:
bokeh.plotting.figure.Figure

So the full command becomes HTML(file_html(hv.render(plot), CDN)). The CDN option says that part of the HTML/CSS/JS code will be retrived from a Content Delivery Network, not stored locally. You may also use INLINE to store all code locally. This allows you to view the HTML file without internet connection but increases the file size:

In [11]:
html_inline = file_html(hv.render(plot), INLINE)
len(html_inline), len(html_cdn)  # The inline one is 5x bigger
Out[11]:
(933634, 203309)

Plot on Maps!

GeoViews is an excellent library for (interactive) geospatial visualizations. It depends on Cartopy, which has better be installed by Conda:

conda install -c conda-forge cartopy  # caution: comes with heavy dependencies!
pip install geoviews
In [12]:
import geoviews as gv
gv.extension('bokeh')
gv.__version__
Out[12]:
'1.6.2'
In [13]:
tile = gv.tile_sources.Wikipedia
tile  # The background map. Can be zoomed-in to a great detail.
Out[13]:
In [14]:
# Again you need a few more code to show the figure on web 
HTML(file_html(hv.render(tile), CDN))
Out[14]:
Bokeh Application
In [15]:
# Overlay some data points
from bokeh.sampledata.airport_routes import airports
airports_filtered = airports[(airports['City'] == 'New York') & (airports['TZ'] == 'America/New_York')]
airports_filtered
Out[15]:
AirportID Name City Country IATA ICAO Latitude Longitude Altitude Timezone DST TZ Type source
282 3697 La Guardia Airport New York United States LGA KLGA 40.777199 -73.872597 21 -5 A America/New_York airport OurAirports
379 3797 John F Kennedy International Airport New York United States JFK KJFK 40.639801 -73.778900 13 -5 A America/New_York airport OurAirports
999 8123 One Police Plaza Heliport New York United States \N NK39 40.712601 -73.999603 244 -5 A America/New_York airport OurAirports
In [16]:
plot_geo = tile * airports_filtered.hvplot.points('Longitude', 'Latitude', geo=True,
                                                  color='red', size=150, width=750, height=500)

HTML(file_html(hv.render(plot_geo), CDN))  # airport locations around NYC
Out[16]:
Bokeh Application

Personal Website with Jupyter Support using Nikola and GitHub Page

I decide to build my new website and occasionally post blogs on research, software, and computing.

For the blogging tool, basic requirements are:

  • It must be a static site that can be easily version-controlled on GitHub. So, not WordPress.

  • I'd like the framework to be written in Python. So, not Jekyll.

  • It should support Jupyter notebook format, so I can easily post computing & data analysis results.

The options come down to Pelican (the most popular one), Tinkerer (Sphinx-extended-for-blog), and Nikola. I finally settle down on Nikola because it is super intuitive, has a clear and user-friendly documentation, and supports Jupyter natively. It takes me like 10 minutes to set up the desired website layout. Other tools are also very powerful but it takes me longer to get the configuration right.

This post records my setup steps, to help others and the forgetful future me. Building such a website only requires basic knowledge in:

  • Git and GitHub

  • Python

  • Markdown (works fine) or reStructuredText (preferred)

I assume no knowledge in Web front-end stuff (HTML/CSS/JS). This should be typical for many computational & data science students.

2 Installation and initialization

The official getting-started guide explains this well. I just summarize the steps here.

I come from the scientific side of Python, so use Conda to create a new environment:

conda create -n blog python=3.6
source activate blog
pip install Nikola[extras]

Making the first demo site is as simple as

  • Run nikola init in a new folder, with default settings.

  • Then run nikola auto to build the page and start a local server for testing.

  • Visit localhost:8000 in the web browser.

Adding a new blog post is as simple as:

  • Run nikola new_post (defaults to reStructuredText format, add -f markdown for *.md format)

  • Write blog contents in the newly generated file (*.rst or *.md) in the posts/ directory.

  • Rerun nikola auto. You can also keep this command running so new contents will be automatically updated.

All configurations are managed by a single conf.py file, which is extremely neat.

Before adding any real contents, you should first configure GitHub deployment, layout, themes, etc.

3 Github deployment and version control

There is an official deployment guide but I feel that more explanation is necessary.

Most posts mention deployment at the final step, but I think it is good to do so at the very beginning.

We will use GitHub pages to host the website freely. Any GitHub repo can have its own GitHub page, but we will use the special one called user page that only works for a repo named [username].github.io, where [username] is your GitHub account name (mine is jiaweizhuang). The website URL will be https://[username].github.io/.

You will need to:

  1. Create a new GitHub repo named [username].github.io.

  2. Initialize a Git repo (git init) in your Nikola project directory.

  3. Link to your remote GitHub repo (git remote add origin https://github.com/[username]/[username].github.io.git)

So far all standard git practices. The non-standard thing is that you shouldn't manually commit anything to the master branch. The master branch is used for storing HTML files to display on the web. It should be handled automatically by the command nikola github_deploy. To version control your source files (conf.py, *.rst, *.md), you should create a new branch called src:

git checkout -b src

My .gitignore is:

cache
.doit.db*
__pycache__
output
.ipynb_checkpoints
.DS_Store

You can manually commit to this src branch and push to GitHub.

In conf.py, double-check that branch names are correct:

GITHUB_SOURCE_BRANCH = 'src'
GITHUB_DEPLOY_BRANCH = 'master'
GITHUB_REMOTE_NAME = 'origin'

I also recommend setting:

GITHUB_COMMIT_SOURCE = False

So that the nikola github_deploy command below won't touch your src branch.

To deploy the content on master, run:

nikola github_deploy

This builds the HTML files, commits to the master branch, and pushes to GitHub. The actual website https://[username].github.io/ will be updated in a few seconds.

You end up having:

  • A well version-controlled src branch, with only source files. You can add meaningful commit messages like for other code projects.

  • An automatically generated master branch, with messy html files which you never need to directly look at. It doesn't have meaningful commit messages, and the commit history is kind of a mess (diff between HTML files).

Note

Remember that nikola github_deploy will use all the files in the current directory, not the most recent commit in the src branch! master and the src are not necessarily synchronized if you set GITHUB_COMMIT_SOURCE = False.

For all the tweaks later, you can incrementally update the GitHub repo and the website, by manually pushing to src and using nikola github_deploy to push to master.

4 Change theme

The default theme looks more like a blog than a personal website. Twitter's Bootstrap is an excellent theme and is built into Nikola. In conf.py, set:

THEME = "bootstrap4"

The theme can be further tweaked by Bootswatch but I find the default theme perfect for me :)

5 Set non-blog layout

Official non-blog guide explains this well. I just summarize the steps here.

Nikola defines two types of contents:

  • "Posts" generated by nikola new_post. It is just the blog post and will be automatically added to the main web page whenever a new post is created.

  • "Pages" generated by nikola new_page. It is a standalone page that will not be automatically added to the main site. This is the building block for a non-blog site.

In conf.py, bring Pages to root level:

POSTS = (
    ("posts/*.rst", "blog", "post.tmpl"),
    ("posts/*.md", "blog", "post.tmpl"),
    ("posts/*.txt", "blog", "post.tmpl"),
    ("posts/*.html", "blog", "post.tmpl"),
)
PAGES = (
    ("pages/*.rst", "", "page.tmpl"),  # notice the second argument
    ("pages/*.md", "", "page.tmpl"),
    ("pages/*.txt", "", "page.tmpl"),
    ("pages/*.html", "", "page.tmpl"),
)

INDEX_PATH = "blog"

Generate your new index page (the entry of you website):

$ nikola new_page
Creating New Page
-----------------

Title: index

To add more pages to the top navigation bar:

$ nikola new_page
Creating New Page
-----------------

Title: Bio

And then add it to conf.py:

NAVIGATION_LINKS = {
    DEFAULT_LANG: (
        ("/index.html", "Home"),
        ("/bio/index.html", "Bio"),
        ...
    ),

6 Enable Jupyter notebook format

Just add *.ipynb as recognizable formats:

POSTS = (
    ("posts/*.rst", "blog", "post.tmpl"),
    ("posts/*.md", "blog", "post.tmpl"),
    ("posts/*.txt", "blog", "post.tmpl"),
    ("posts/*.html", "blog", "post.tmpl"),
    ("posts/*.ipynb", "blog", "post.tmpl"), # new line
)
PAGES = (
    ("pages/*.rst", "", "page.tmpl"),
    ("pages/*.md", "", "page.tmpl"),
    ("pages/*.txt", "", "page.tmpl"),
    ("pages/*.html", "", "page.tmpl"),
    ("pages/*.ipynb", "", "page.tmpl"), # new line
)

With the current version (v8), that's all you need to do!

Create a new blog in notebook format:

$ nikola new_post -f ipynb

7 Add social media button

You might want to add buttons for other sites like GitHub and Twitter, or any icons from Font Awesome.

Taken from this post by Jaakko Luttinen, the minimal example is (only one GitHub button):

EXTRA_HEAD_DATA = '<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/latest/css/font-awesome.min.css">'

CONTENT_FOOTER = '''
<div class="text-center">
<p>
<span class="fa-stack fa-2x">
<a href="https://github.com/[username]">
    <i class="fa fa-circle fa-stack-2x"></i>
    <i class="fa fa-github fa-inverse fa-stack-1x"></i>
</a>
</span>
</p>
</div>
'''

8 Various tweaks

8.1 Add table of content

For *.rst posts, simple add:

.. contents::

Or optionally with numbering:

.. contents::
.. section-numbering::

8.2 Tweak Archive format

To avoid grouping posts by years:

CREATE_SINGLE_ARCHIVE = True

The creation time of each blog post is displayed down to minutes by default. Only showing the date seems enough:

DATE_FORMAT = 'YYYY-MM-dd'

8.3 Enable comment system

Because static sites do not have databases, you need to use a thiry-party comment system as documented on the official doc. The steps are:

  1. Sign up for an account on https://disqus.com/.

  2. On Disqus, select "Create a new site" (or visit https://disqus.com/admin/create/).

  3. During configuration, take note on the "Shortname" you use. Other configs are not very important.

  4. At "Select a plan", choosing the basic free plan is enough.

  5. At "Select Platform", just skip the instructions. No need to insert the "Universal Code" manually, as it is built into Nikola. Keep all default and finish the configuration.

In conf.py, add your Disqus shortname:

COMMENT_SYSTEM = "disqus"
COMMENT_SYSTEM_ID = "[disqus-shortname]"

Deploy to GitHub and the comment system should be enabled.