High-Performance Computing (HPC)

This section describes how to build and run FLUXOS on HPC clusters using MPI+OpenMP hybrid parallelization.

Overview

FLUXOS supports five parallelization modes:

OpenMP only: For workstations and small domains
MPI only: For distributed memory systems
Hybrid MPI+OpenMP: For scalability on HPC clusters
CUDA GPU: For maximum single-node performance with NVIDIA GPUs (recommended for large domains)
Hybrid MPI+OpenMP+CUDA: For maximum scalability on GPU-equipped HPC clusters (recommended)

The hybrid approach uses MPI for communication between nodes, OpenMP for parallelism within each node, and CUDA for GPU offloading of compute-intensive kernels.

Building for HPC

Prerequisites

MPI implementation (OpenMPI, MPICH, or Intel MPI)
OpenMP-enabled compiler (GCC 7+, Intel C++ 18+, or Clang 8+)
Armadillo 9.9+
CMake 3.10+
HDF5 (optional, for parallel output)

Build Commands

# Create build directory
mkdir build && cd build

# Configure with MPI support
cmake -DMODE_release=ON -DUSE_MPI=ON ..

# Build
make -j8

# The executable will be: build/bin/fluxos_mpi

With CUDA GPU acceleration:

cmake -DMODE_release=ON -DUSE_MPI=ON -DUSE_CUDA=ON ..
make -j8

With triangular mesh + GPU + MPI:

cmake -DMODE_release=ON -DUSE_MPI=ON -DUSE_CUDA=ON ..
make -j8

For module-based HPC systems:

# Load required modules (adjust for your system)
module load gcc/11.2.0
module load openmpi/4.1.1
module load armadillo/11.0
module load cmake/3.20

# Build
mkdir build && cd build
cmake -DMODE_release=ON -DUSE_MPI=ON ..
make -j8

Running on HPC Clusters

SLURM Job Script

Example SLURM script for running FLUXOS on a cluster:

#!/bin/bash
#SBATCH --job-name=fluxos_mpi
#SBATCH --output=fluxos_%j.out
#SBATCH --error=fluxos_%j.err
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=2
#SBATCH --time=24:00:00
#SBATCH --partition=compute
#SBATCH --account=your_account

# Load modules
module purge
module load gcc/11.2.0
module load openmpi/4.1.1
module load armadillo/11.0

# Set OpenMP threads per MPI task
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export OMP_PROC_BIND=close
export OMP_PLACES=cores

# Run FLUXOS
srun --mpi=pmix ./build/bin/fluxos_mpi ./input/modset.json

SLURM GPU Job Script

Example SLURM script for GPU-accelerated runs:

#!/bin/bash
#SBATCH --job-name=fluxos_gpu
#SBATCH --output=fluxos_%j.out
#SBATCH --error=fluxos_%j.err
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=2
#SBATCH --time=24:00:00
#SBATCH --partition=gpu

# Load modules
module purge
module load gcc/11.2.0
module load cuda/11.8
module load openmpi/4.1.1
module load armadillo/11.0

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

# Run (each MPI rank uses one GPU)
srun --mpi=pmix ./build/bin/fluxos_mpi ./input/modset.json

PBS Job Script

Example PBS script:

#!/bin/bash
#PBS -N fluxos_mpi
#PBS -l nodes=4:ppn=32
#PBS -l walltime=24:00:00
#PBS -q batch

cd $PBS_O_WORKDIR

# Load modules
module load gcc openmpi armadillo

# Set OpenMP threads
export OMP_NUM_THREADS=2
export OMP_PROC_BIND=close

# Run
mpirun -np 128 ./build/bin/fluxos_mpi ./input/modset.json

Scalability Guidelines

Use this table to select appropriate parallelization for your domain:

Domain Size	Mode	MPI Procs	OMP Threads	Notes
< 500x500	OpenMP or CUDA	1	4-8	Single GPU provides best speedup
500x500 - 2000x2000	CUDA or Hybrid	1-4	2-4	GPU recommended; MPI for multi-GPU
2000x2000 - 5000x5000	Hybrid+CUDA	4-16	2-4	Multi-GPU nodes recommended
> 5000x5000	Hybrid+CUDA	16-64	2-4	Large-scale HPC with GPU nodes

Triangular mesh considerations:

For unstructured triangular meshes, the domain decomposition uses graph-based partitioning rather than 2D Cartesian decomposition. This provides better load balance on irregular domains but requires METIS for optimal partitioning. Without METIS, naive block partitioning is used as a fallback.

Domain Decomposition

Regular Mesh:

FLUXOS uses 2D Cartesian domain decomposition:

The global domain is automatically divided among MPI processes
MPI_Cart_create establishes the process topology
Each process computes a local subdomain with ghost cells
Ghost cells are exchanged at each time step

Global Domain (1000 x 1000 cells)
┌─────────────┬─────────────┐
│  Process 0  │  Process 1  │
│  500x500    │  500x500    │
├─────────────┼─────────────┤
│  Process 2  │  Process 3  │
│  500x500    │  500x500    │
└─────────────┴─────────────┘

Ghost Cell Exchange

Each subdomain maintains ghost cells (halo regions) from neighboring processes:

┌─────────────────────────────┐
│  Ghost cells (from north)   │
├───┬───────────────────┬─────┤
│ G │                   │ G   │
│ h │   Local domain    │ h   │
│ o │                   │ o   │
│ s │                   │ s   │
│ t │                   │ t   │
├───┴───────────────────┴─────┤
│  Ghost cells (from south)   │
└─────────────────────────────┘

Triangular Mesh Decomposition:

For unstructured triangular meshes, domain decomposition uses graph-based partitioning:

METIS partitioning (preferred): METIS_PartGraphKway on the cell adjacency graph
Naive block partitioning (fallback): Sequential cell IDs divided among ranks
Halo cells: Cells across partition-boundary edges are exchanged
Communication: MPI_Isend/MPI_Irecv per neighbor rank

CUDA GPU on HPC Clusters

When using CUDA on HPC clusters:

Each MPI rank is assigned one GPU
Host-device transfers occur at forcing and output steps
CFL reduction uses device-side block reduction followed by host-side MPI reduction
For multi-GPU nodes, use CUDA_VISIBLE_DEVICES or let MPI rank assignment handle GPU selection

Performance Optimization

OpenMP Settings

For optimal OpenMP performance:

# Bind threads to cores
export OMP_PROC_BIND=close
export OMP_PLACES=cores

# Set number of threads (typically 2-4 for hybrid)
export OMP_NUM_THREADS=2

MPI Settings

For OpenMPI:

# Disable InfiniBand if not available
export OMPI_MCA_btl=^openib

# Use shared memory for intra-node communication
export OMPI_MCA_btl=vader,self

For Intel MPI:

# Use shared memory fabric
export I_MPI_FABRICS=shm:ofi

Memory Considerations

Each MPI process allocates memory for its local subdomain plus ghost cells
For very large domains, ensure sufficient memory per node
Consider using fewer MPI processes with more OpenMP threads to reduce memory overhead

Parallel Output

FLUXOS supports two parallel output modes:

Gathered Output (Default)

All data gathered to root process for writing
Simpler file format, single output file
Suitable for moderate domain sizes

Parallel Output

Each process writes its own portion
Creates multiple files plus a manifest
Better scalability for very large domains

Output files are named:

Results/
├── 3600.txt              # Gathered output (single file)
├── 3600_rank0.txt        # Parallel output (per-process)
├── 3600_rank1.txt
├── 3600_manifest.txt     # Manifest listing all files
└── ...

Troubleshooting

Common Issues

MPI not found during build:

# Ensure MPI is in PATH
which mpicc
which mpicxx

# Set CC and CXX if needed
export CC=mpicc
export CXX=mpicxx

Poor scaling:

Check that ghost cell exchange is not dominating
Ensure load balance (equal subdomain sizes)
Verify network bandwidth (InfiniBand recommended)

Memory errors:

Reduce number of MPI processes
Increase OpenMP threads per process
Check for memory leaks with valgrind

Segmentation faults:

Ensure consistent MPI library between build and run
Check Armadillo is compiled with same compiler
Verify input file format

Profiling

For performance analysis:

# With Intel VTune
vtune -collect hotspots -- srun ./build/bin/fluxos_mpi input.json

# With Scalasca
scalasca -analyze srun ./build/bin/fluxos_mpi input.json

# With ARM MAP
map --profile srun ./build/bin/fluxos_mpi input.json

Best Practices

Start small: Test with few MPI processes before scaling up
Monitor load balance: Ensure all processes finish at similar times
Use hybrid mode: Typically 2-4 OpenMP threads per MPI process works best
Check I/O: Parallel I/O may become a bottleneck for frequent output
Verify results: Compare small-domain results with serial version
Use restart files: For long simulations, implement checkpoint/restart