DC3

Roman Nuterman

Personal web-page

Danish Center for Climate Computing (DC3)

 

HPC Computing & Storage

 

ÆGIR (AEGIR) cluster equipped with 656 compute cores (17 nodes with 16 cores per node and 12 nodes with 32 cores per node), 2.6 TB of RAM and high-speed Infiniband internal network.

 

Hardware

  • 2 CPUs per node: Xeon E5-2667v3 3.2GHz (8 cores per CPU)
    • RAM per node: 64GB DDR4 (per node)
  • 2 CPUs per node: Intel Xeon E5-2683v4 2.1GHz (16 cores per CPU)
    • RAM per node: 128GB DDR4 (per node)
  • SSD per node: 120 GB
  • Interconnection: Mellanox QDR Infiniband
  • Tape system: 2 PB

 

Account request

 

Connecting to DC3

 

In order to login to DC3 computational system, you must use the SSH protocol. This is provided by the "ssh" command on Unix-like systems (including Mac OS X) or by using an SSH-compatible application (e.g. PuTTY on Microsoft Windows). We recommend that you "forward" X11 connections when initiating an SSH session to DC3. For example, when using the ssh command on Unix-based systems, provide the "-Y" option:

 

ssh -Y jojo@fend01.hpc.ku.dk

 

In order to download/upload data from/to DC3 use the following command:

 

scp –pr user@host1:from_path_file1 user@host2:to_path_file2

 

for more information use man/info commands (man scp).

 

Software

 

DC3 supply a rich set of HPC utilities, applications, and programming libraries. If there is something missing that you want, send email to nuterman@nbi.ku.dk with your request and evaluate it for appropriateness, cost, effort, and benfit to the community.

 

More information about available software and how you use software is included in the next section.

 

Computing Environment

 

When you log in to DC3 computer, you are in your global $HOME directory. You initially land in the same place no matter what login node you connect to: fend01.hpc.ku.dk, fend05.hpc.ku.dk - their home directories are all the same. This means that if you have files or binary executables that are specific to a certain system, you need to manage their location. Many people make subdirectories for each system in their home directory.

 

Customizing Your Environment

The way you interact with the DC3 computer can be controlled via certain startup scripts that run when you log in and at other times. You can customize some of these scripts, which are called "dot files," by setting environment variables and aliases in them.

There are several "standard" dot-files, such files are .bash_profile, .bashrc, .cshrc, .kshrc, .login, .profile, .tcshrc, or .zprofile. Which of those you modify depends on your choice of shell, although note that DC3 recommends the bash.

The table below contains examples of basic customizations. Note that when making changes such as these it's always a good idea to have two terminal sessions active on the machine so that you can back out changes if needed!

 

Customizing Your Dot Files

 

 

 

 

 

 

 

 

Modules

Easy access to software is controlled by the module utility. With modules, you can easily manipulate your computing environment to use applications and programming libraries. In order to have an access to the software one must execute the following command when first time log in:

 

/groups/ocean/software/Modules/3.2.10/bin/add.modules

 

If you want to change software environment you "load," "unload," and "swap" modules. A small set of module commands can do most of what you'll want to do.

 

module list

 

The first command of interest is "module list", which will show you your currently loaded modules. When you first log in, you have a number of modules loaded for you.

 

module avail

 

Let's say you want to use a different compiler. The "module avail" command will list all the available modules. You can use the module's name stem to do a useful search.

 

module swap

 

Let's say I want to use the INTEL compilers instead of GCC. Here's how to make the change:

 

module swap gcc/5.4.0 intel/17.1.0

 

Now you are using the INTEL compilers (C, C++, FORTRAN) version 17.1.0. Note that modules doesn't give you any feedback about whether the swap command did what you wanted it to do, so always double-check your environment using the "module list" command.

 

module load

 

There is plenty of software that is not loaded by default. You can use the "module avail" command to see what modules are available.

 

For example, if you want to use the LAPACK linear algebra library. Try "module avail lapack "

 

The default version is 3.7.0, but say you'd rather use some features available only in version 3.6.0. In that case, just load that module.

 

If you want to use the default version, you can type either "module load scalapack" or "module load scalapack/3.6.0", either will work.

 

Software Available Through Module Utility:

 

ANACONDA is completely free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing.

 

VEROS is Versatile Ocean Simulation in Pure Python, aims to be the swiss army knife of ocean modeling. It is a full-fledged GCM that supports anything between highly idealized configurations and realistic set-ups, targeting students and seasoned researchers alike. Thanks to its seamless interplay with Bohrium, Veros runs efficiently on your laptop, gaming PC (with experimental GPU support through OpenCL), and small cluster.

 

C++ Boost library provides free peer-reviewed portable C++ source libraries to speedup software development.

 

CDO (Climate Data Operators) is a collection of command line Operators to manipulate and analyze Climate and NWP model Data. Supported data formats are GRIB 1/2, netCDF 3/4, SERVICE, EXTRA and IEG. There are more than 600 operators available.

 

CESM is NCAR/UCAR Community Earth System Model

 

FFTw3 (serial & parallel, single & double precision) is a C/FORTRAN subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST).

 

GCC is the GNU Compiler Collection includes front ends for C, C++, Fortran, as well as libraries for these languages (libstdc++,...). GCC was originally written as the compiler for the GNU operating system.

 

GRIB API ECMWF is an application program interface accessible from C, FORTRAN and Python programs developed for encoding and decoding WMO FM-92 GRIB edition 1 and edition 2 messages. A useful set of command line tools is also provided to give quick access to GRIB messages.

 

GSL (GNU Scientific Library) is a numerical library for C and C++ programmers. The library provides a wide range of mathematical routines such as random number generators, special functions and least-squares fitting. There are over 1000 functions in total with an extensive test suite.

 

HDF5 / HDF5-parallel (Hierarchical Data Format) is a data model, library, and file format for storing and managing data. It supports an unlimited variety of data-types, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.

 

LAPACK is Linear Algebra PACKage written in Fortran 90 and provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matrices. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision.

 

METIS is a set of serial programs for partitioning graphs, partitioning finite element meshes, and producing fill-reducing orderings for sparse matrices. The algorithms implemented in METIS are based on the multilevel recursive-bisection, multilevel k-way, and multi-constraint partitioning schemes.

 

MVAPICH2 is a Message Passing Interface-3 implementation.

 

OpenMPI is a Message Passing Interface-3 implementation.

 

NCO is netCDF Operator toolkit, which manipulates and analyzes data stored in netCDF-accessible formats, including DAP, HDF4, and HDF5. It exploits the geophysical expressivity of many CF (Climate & Forecast) metadata conventions, the flexible description of physical dimensions translated by UDUnits, the network transparency of OPeNDAP, the storage features (e.g., compression, chunking, groups) of HDF (the Hierarchical Data Format), and many powerful mathematical and statistical algorithms of GSL (the GNU Scientific Library). NCO is fast, powerful, and free.

 

NetCDF4 (Network Common Data Form) is a set of interfaces for array-oriented data access and a freely distributed collection of data access libraries for C, Fortran, C++ languages. The netCDF libraries support a machine-independent format for representing scientific data. Together, the interfaces, libraries, and format support the creation, access, and sharing of scientific data.

 

NetCDF-parallel is a library providing high-performance parallel I/O while still maintaining file-format compatibility with Unidata's NetCDF, specifically the formats of CDF-1 and CDF-2. Although NetCDF supports parallel I/O starting from version 4, the files must be in HDF5 format. PnetCDF is currently the only choice for carrying out parallel I/O on files that are in classic formats (CDF-1 and 2). In addition, PnetCDF supports the CDF-5 file format, an extension of CDF-2, that supports more data types and allows users to define large dimensions, attributes, and variables (>2B elements). NetCDF gives scientific programmers a self-describing and portable means for storing data. However, prior to version 4, netCDF does so in a serial manner. By making some small changes to the netCDF APIs, PnetCDF can use MPI-IO to achieve high-performance parallel I/O.

 

ParMETIS is an MPI-based parallel library that implements a variety of algorithms for partitioning unstructured graphs, meshes, and for computing fill-reducing orderings of sparse matrices. ParMETIS extends the functionality provided by METIS and includes routines that are especially suited for parallel AMR computations and large-scale numerical simulations. The algorithms implemented in ParMETIS are based on the parallel multilevel k-way graph-partitioning, adaptive repartitioning, and parallel multi-constrained partitioning schemes.

 

PETSc (real, complex) pronounced PET-see (the S is silent), is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations. It supports MPI, shared memory pthreads, and GPUs through CUDA or OpenCL, as well as hybrid MPI-shared memory pthreads or MPI-GPU parallelism.

 

PISM is an open source, parallel, high-resolution ice sheet model.

 

Trilinos Project is an effort to develop algorithms and enabling technologies within an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific problems. A unique design feature of Trilinos is its focus on packages.

 

Compiling Code

 

Let's assume that we're compiling code that will run as a parallel application using MPI for internode communication and the code is written in Fortran, C, or C++. In this case, it's easy because you will use standard compiler wrapper script that bring in all the include file and library paths and set linker options that you'll need. One should use the following wrappers: mpif90, mpicc, or mpic++ for Fortran, C, and C++, respectively.

 

To compile on DC3, use mpif90 -o hello.x hello.f90

 

In case we need to use for compilation an extra library like HDF5 one must load it through module utility. Even with the module loaded, the compiler doesn't know where to find the HDF files. Another way to try to figure it out for yourself is to look under the covers in the HDF5 module.

 

module show hdf5

 

The "module show" command reveals (most of) what the module actually does when you load it. You can see that it defines some environment variables you can use, for example HDF5_INCLUDE, which you can use in your build script or Makefile. Look at the definition of the HDF5_XXX environment variables. They contains all the include and link options.

 

Therefore, we can use mpicc -o hd_copy.x hd_copy.c $HDF5_INCLUDE $HDF5_LIB

 

Compiler Optimizations

 

These are some common compiler optimizations and the types of code that they work best with.

 

Vectorization

The registers and arithmetic units on DC3 are capable of performing the same operation on several double precision operands simultaneously in a SIMD (Single Instruction Multiple Data) fashion. This is often referred to as vectorization because of its similarities to the much larger vector registers and processing units of the Cray systems of the pre-MPP era.

Vector optimization is most useful for large loops with in which each successive operation has no dependencies on the results of the previous operations. Loops can be vectorized by the compiler or by compiler directives in the source code.

 

Inter-procedural Optimization

This is defined as the compiler optimizing over subroutine, function, or other procedural boundaries This can have many levels ranging from inlining, the replacement of a function call with the corresponding source code at compile time, up to treating the entire program as one routine for the purpose of optimization.

This can be the most compute intensive of all optimizations at compile time, particularly for large applications and can result in an increase in the compile time of an order of magnitude or more without any significant speedup and can even cause a compile to crash. For this reason none of the DC3 recommended compiler optimization options include any significant inter-procedural optimizations. It is most suitable when there are function calls embedded within large loops.

 

Relaxation of IEEE Floating-point Precision

Full implementation of IEEE Floating-point precision is often very expensive. There are many floating-point optimization techniques that significantly speed up a code's performance by relaxing some of these requirements. Since most codes do not require an exact implementation of these rules, all of the DC3 recommended optimizations include relaxed floating-point techniques.

 

Optimization Arguments

This table shows how to invoke these optimizations for each compiler. Some of the options have numeric levels with the higher the number, the more extensive the optimizations, and with a level of 0 turning the optimization off. For more information about these optimizations, see the compiler on-line man pages.

 

 

Running Jobs

 

General

The Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

 

Architecture

The entities managed by these SLURM daemons, include nodes, the compute resource in SLURM, partitions, which group nodes into logical (possibly overlapping) sets, jobs, or allocations of resources assigned to a user for a specified amount of time, and job steps, which are sets of (possibly parallel) tasks within a job. The partitions can be considered job queues, each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc.

 

SLURM Commands

These are the SLURM commands frequently used on DC3:

 

sinfo is used to show the state of partitions and nodes managed by SLURM:

 

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

tier3* up 14-00:00:0 1 mix node408

tier3* up 14-00:00:0 10 idle node[406-407,409-416]

sbinlab up infinite 2 mix node[902,904]

sbinlab up infinite 7 alloc node[900-901,903,905-908]

sbinlab_gpu up infinite 1 down* node150

sbinlab_gpu up infinite 2 mix node[151-152]

sbinlab_ib up infinite 5 alloc node[153,157-160]

sbinlab_ib up infinite 3 idle node[154-156]

icecube_gpu up infinite 3 idle node[161-163]

aegir up 1-00:00:00 1 drain* node171

aegir up 1-00:00:00 1 idle* node441

aegir up 1-00:00:00 11 down* node[442-452]

aegir up 1-00:00:00 7 alloc node[164-167,170,172-173]

aegir up 1-00:00:00 9 idle node[168-169,174-180]

kemi1 up infinite 1 mix node318

kemi1 up infinite 15 alloc node[305-317,319-320]

astro2 up 10-00:00:0 1 down* node473

astro2 up 10-00:00:0 5 alloc node[480-481,483-485]

astro2 up 10-00:00:0 26 idle node[454-472,474-479,482]

kemi7 up infinite 2 idle node[148-149]

kemi_gemma up 14-00:00:0 3 mix node[087-088,090]

kemi_gemma up 14-00:00:0 5 idle node[083-086,089]

astro_devel up 2:00:00 4 down* node[769,796,816,878]

astro_devel up 2:00:00 32 alloc node[750-757,817-840]

astro_devel up 2:00:00 96 idle node[758-768,770-795,797-815,841-849,852-877,879-883]

astro_short up 12:00:00 4 down* node[769,796,816,878]

astro_short up 12:00:00 32 alloc node[750-757,817-840]

astro_short up 12:00:00 96 idle node[758-768,770-795,797-815,841-849,852-877,879-883]

astro_long up 5-00:00:00 4 down* node[769,796,816,878]

astro_long up 5-00:00:00 32 alloc node[750-757,817-840]

astro_long up 5-00:00:00 96 idle node[758-768,770-795,797-815,841-849,852-877,879-883]

astro_fe up 6:00:00 4 idle astro[06-09]

 

This shows that there are 14 defined partitions on the system at the moment, listed to the far left. It also shows there is a maximum of 16 nodes available in the aegir partition, which are ready for running jobs (otherwise a part of 16 would have “alloc” attribute) and maximum runtime per job (TIMELIMIT). One node equals 16 processor cores, however, since the entire HPC is a heterogeneous system, it is impossible from sinfo to deduct which processor type it is (as well as memory, interconnect and the like).

 

To see detail specifics of all partitions, one must use:

 

scontrol show partition

 

…………

PartitionName=aegir

AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL

AllocNodes=ALL Default=NO

DefaultTime=NONE DisableRootJobs=YES GraceTime=0 Hidden=NO

MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED

Nodes=node[164-179]

Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF

State=UP TotalCPUs=512 TotalNodes=16 SelectTypeParameters=N/A

DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

…………

 

This output shows us in details that:

 

Anyone can submit a job to the aegir partition (AllowGroups=ALL)

The walltime limit on the aegir partition is 1 day (MaxTime=1-00:00:00)

It is important to understand that “TotalCPUs=512” number shows how many cores + threads are at maximum available on the DC3 cluster.

 

scontrol show nodes is used to show the available nodes on the system

 

or

 

scontrol show Node=node164 show nodes node164 to show information about specific node

 

NodeName=node164 Arch=x86_64 CoresPerSocket=8

CPUAlloc=0 CPUErr=0 CPUTot=32 CPULoad=1.00 Features=(null)

Gres=(null)

NodeAddr=node164 NodeHostName=node164 Version=14.03

OS=Linux RealMemory=64301 AllocMem=0 Sockets=2 Boards=1

State=IDLE ThreadsPerCore=2 TmpDisk=101280 Weight=1

BootTime=2015-01-23T14:54:11 SlurmdStartTime=2015-01-23T15:02:21

CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

 

This output, which is edited for length, shows us a number of things:

 

Node 164 has no job running (CPUAlloc), however the UNIX load average (CPULoad) on the machine is 1 indicating that the entire node on average is fully occupied by jobs (MPI or OpenMP).

It also shows that there are 2 threads per core (ThreadsPerCore), 32 cores available (CPUTot), amount of memory on the node (RealMemory in Mb) and free disk space (TmpDisk).

 

squeue command is used to show the jobs in the queuing system. The command gives an output similar to this:

 

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

484842 astro2 6N8x_2.1 jslarsen R 2:38:06 1 node481

484517 sbinlab ruth.wt. papaleo PD 0:00 1 (Dependency)

484767 astro_lon shock_v1 gmurphy R 3:10:05 12 node[750-761]

484537 astro_lon shock gmurphy R 5:26:39 12 node[797-808]

475166 sbinlab_i CSCHARMM wyong R 5:36:23 2 node[155-156]

 

This partial output shows us that:

 

gmurphy is running in the astro_lon partition, on nodes [750-761] and [797-808] (two different jobs).

papaleo is currently queuing in the sbinlab partition his job ruth.wt. and waiting for a slot (PD).

 

More generally, the output shows us that:

The first column shows us the job id. The job id is used in all subsequent commands to terminate or modify the job.

The second column shows us the partition the job is running in.

The third column shows us the job name.

The fourth column shows us the user’s name of the person queuing the job.

The fifth column shows us the state of the job. Some of the possible job states are as follows:

PD (pending), R (running), CA (cancelled), CF(configuring), CG (completing), CD (completed), F (failed), TO (timeout), NF (node failure) and SE (special exit state).PD (pending), R (running), CA (cancelled), CF(configuring), CG (completing), CD (completed), F (failed), TO (timeout), NF (node failure) and SE (special exit state).

The sixth column shows us the job runtime.

The seventh & eight columns show us the number of allocated nodes and the node list the job is running on.

 

sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

 

scancel is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.

 

SLURM Example Script

 

-----------------------------------------------------------------

#!/bin/sh

#

#SBATCH -p aegir

#SBATCH -A ocean

#SBATCH --job-name= myjob

#SBATCH --time=00:30:00

#SBATCH --constraint=v1

#SBATCH --nodes=2

#SBATCH --ntasks=32

#SBATCH --cpus-per-task=1

#SBATCH --exclusive

#SBATCH --mail-type=ALL

#SBATCH --mail-user=mymail@nbi.ku.dk

#SBATCH --output=slurm.out

 

srun --mpi=none --kill-on-bad-exit my_program.exe

-----------------------------------------------------------------

 

sbatch ./my_batch_script.sh

 

In this example we use aegir partition to run my_program.exe, set our jobname, request 30 minutes of runtime and nodes with 16 cores (--constraint=v1), 2 nodes and 32 cores (with one task per core), no sharing of nodes resources, send e-mail notifications and define file name for standard job output.

One can request a node with 32 cores, but in this case SLURM batch script looks like:

 

-----------------------------------------------------------------

#!/bin/sh

#

#SBATCH -p aegir

#SBATCH -A ocean

#SBATCH --job-name= myjob

#SBATCH --time=00:30:00

#SBATCH --constraint=v2

#SBATCH --nodes=1

#SBATCH --ntasks=32

#SBATCH --cpus-per-task=1

#SBATCH --exclusive

#SBATCH --mail-type=ALL

#SBATCH --mail-user=mymail@nbi.ku.dk

#SBATCH --output=slurm.out

 

srun --mpi=none --kill-on-bad-exit my_program.exe

-----------------------------------------------------------------

 

SLURM References

 

http://slurm.schedmd.com/slurm.html

 

 

 

 

 

bash

csh

export ENVAR=value

setenv ENVAR value

export PATH=$PATH:/new/path

set PATH = ( $PATH /new/path)

alias ll='ls -lrt’

alias ll “ls –lrt”

Optimization

Intel

gfortran/gcc

PGI

Vectorization

-vec

-ftree-vectorize

-Mvect

Interprocedural

-ipo

-finline-[opt],-fipa[-opt]

-Mipa

IEEE FP relaxation

-mno-ieee-fp

-ffast-math

-Knoieee