EB JSC Logos

An Update on EasyBuild at JSC

Damian Alvarez (JSC)

JSC's environment

Systems

  • JURECA
  • JURECA Booster
  • JUAMS
  • JUROPA3
  • JUWELS (soon)

JURECA system characteristics

  • 1.8 + 0.44 PFlops
  • 1872 compute nodes (Haswell)
  • 75 compute nodes, each with 2 NVIDIA K80 GPUs
  • 12 visualization nodes, each with 2 NVIDIA K40 GPUs
  • Mellanox EDR InfiniBand with fat tree topology

  • ~950 users and ~180 projects

  • Any guess on user requirements?

User requirements

"I want it all, I want it all, I want it all, and I want it now"♪♬♫♪♬♫
  • Intel compilers

  • GNU compilers

  • PGI compilers

  • ParaStationMPI

  • Intel MPI

User requirements

"I want it all, I want it all, I want it all, and I want it now"♪♬♫♪♬♫
  • CUDA

  • CUDA-aware MPI

  • And of course:

    • Tons of libraries

    • Compatibility

JURECA Booster system characteristics

  • ~5 PFlops, #29 in Top500 (Nov'17) - Together with JURECA
  • 1640 compute nodes (KNL)
  • Intel OmniPath with fat tree topology

  • Needs to play nicely with JURECA...

Extra requirements

"I want more"♪♬♫♪♬♫
  • Jobs can run in the cluster and in the booster simultaneously

  • Crosscompiling when possible

  • Subset of SW available on the cluster

JUWELS

Soon...

JUWELS

  • ~12 PFlops
  • 2550 compute nodes (Skylake)
  • Mellanox EDR with fat tree topology
  • 48 compute nodes, each with 4 NVIDIA Volta GPUs

  • Different set of users, potentially different set of SW

User View

Module Naming Scheme

  • Hierarchy of compilers and MPI runtimes
    • Modules available are shown in a staged fashion
    • Intuitive
    • Visible software is compatible

    • This is our default Module Naming Scheme

Lmod as modules tool

  • Lmod was designed with module hierarchies in mind
    • module spider and module key
    • Module families (family("compiler") or family("mpi"))

Lmod as modules tool

  • Lmod also has other interesting features
    • Good support for hidden modules (--show-hidden)
    • Cache
    • Properties
    • Hooks
    • ...
------------------------------ Core packages ------------------------------
Advisor/2018 PyOpenGL/3.1.1a1-Python-2.7.14
AllineaForge/7.1 PyQt/4.12.1-Python-2.7.14
[...]

-------------------------------- Compilers --------------------------------
GCC/5.4.0 Intel/2017.5.239-GCC-5.4.0 Intel/2018.1.163-GCC-5.4.0
GCC/7.2.0 (D) Intel/2018.0.128-GCC-5.4.0 (D) PGI/17.9-GCC-5.4.0

-------------------------- Recommended defaults ---------------------------
defaults/CPU

------------------------------ Architectures ------------------------------
Architecture/Haswell (S) Architecture/KNL (S,D)

------------------------------ Other Stages -------------------------------
Stages/Devel (S) Stages/2015b (S) Stages/2017a (S)
Stages/Devel-2017a (S) Stages/2016a (S) Stages/2017b (S,D)
Stages/Devel-2017b (S) Stages/2016b (S)

--------------------------- Development modules ---------------------------
Developers/InstallSoftware Developers/InstallSoftware-2017 (D)
Developers/InstallSoftware-2016a

Current State in JURECA

Toolchain hierarchy
Toolchain tree
User View and Hidden Modules
  • Initial user view:
    • Compilers (GCC, Intel, PGI)
    • Binary tools (VTune, Advisor, TotalView, ...)
    • Packages built with GCCcore (it is preloaded, together with binutils)

  • After loading a compiler:
    • MPI runtimes (ParaStationMPI, MVAPICH2, IntelMPI)
    • Packages built with the chosen compiler

  • After loading an MPI runtime:
    • Packages built with the chosen compiler and MPI runtime

  • If a compiler or MPI is loaded on top of the loaded ones
    • Lmod will swap branches and activate/deactivate modules accordingly (using Lmod's families)

Current State in JURECA's Booster

Toolchain hierarchy
Toolchain tree
User View and Hidden Modules
Not all packages available for a given combination are visible!

  • There are more than 200 hidden packages in total!
  • Close to 400 in stages with duplicated GCCcore
Bundling Extensions
Python, R and Perl have "extensions"

  • 1 module per extension is totally excessive
  • → Bundles
Bundling Extensions
  • Python
    • Python (80 extensions)
    • SciPy-Stack (19 extensions)
    • ...
  • R (504 extensions)
  • Perl (227 extensions)
  • X11 (72 extensions)
Enabling the KNL stack
  • Load a specific architecture module
    • Swaps module trees
    • Swaps Lmod cache files

  • Careful with GCCcore modules
    • Don't compile them with AVX512

  • Cluster login - both stacks
  • Cluster compute - Haswell stack
  • Booster - KNL stack
Upgrading and Retiring Software
Stage concept:

  • Software deployment area for a given timeframe
  • Simply a directory
  • Default stage upgraded every 6 months
  • There is a development stage to test software
  • Tested software is added to our Golden repository
  • (and deployed to production)
    • Overlay for Booster

  • Close to seamless transitions between stages during maintenance
  • Development and old stages are available but not visible by default
Divergence from Upstream EasyBuild
  • Divergence motivated by
    • Use of latest versions available at deployment time
    • Re-positioning of packages in the toolchain hierarchy

  • Most differences are minimal
    • Different versions of software
    • Different versions of dependencies
    • Different toolchains



https://github.com/easybuilders/JSC

Future Work/What to change?

EasyBuild changes
  • Automatic upgrades
    • Of dependency versions
    • Of software versions
  • "Fat" easyconfigs
  • GCCcore specific optarch
  • GCCcore specific installation directory
    • To share between different systems
  • Use CI
  • Framework around EB to manage and deploy SW in multiple systems
Software management changes
  • Tracking module usage
    • Lmod hooks
  • EB hooks to update Lmod's cache

Questions?