High-Performance Computing (HPC)
This section describes how to build and run FLUXOS on HPC clusters using MPI+OpenMP hybrid parallelization.
Overview
FLUXOS supports five parallelization modes:
OpenMP only: For workstations and small domains
MPI only: For distributed memory systems
Hybrid MPI+OpenMP: For scalability on HPC clusters
CUDA GPU: For maximum single-node performance with NVIDIA GPUs (recommended for large domains)
Hybrid MPI+OpenMP+CUDA: For maximum scalability on GPU-equipped HPC clusters (recommended)
The hybrid approach uses MPI for communication between nodes, OpenMP for parallelism within each node, and CUDA for GPU offloading of compute-intensive kernels.
Building for HPC
Prerequisites
MPI implementation (OpenMPI, MPICH, or Intel MPI)
OpenMP-enabled compiler (GCC 7+, Intel C++ 18+, or Clang 8+)
Armadillo 9.9+
CMake 3.10+
HDF5 (optional, for parallel output)
Build Commands
# Create build directory
mkdir build && cd build
# Configure with MPI support
cmake -DMODE_release=ON -DUSE_MPI=ON ..
# Build
make -j8
# The executable will be: build/bin/fluxos_mpi
With CUDA GPU acceleration:
cmake -DMODE_release=ON -DUSE_MPI=ON -DUSE_CUDA=ON ..
make -j8
With triangular mesh + GPU + MPI:
cmake -DMODE_release=ON -DUSE_MPI=ON -DUSE_CUDA=ON ..
make -j8
For module-based HPC systems:
# Load required modules (adjust for your system)
module load gcc/11.2.0
module load openmpi/4.1.1
module load armadillo/11.0
module load cmake/3.20
# Build
mkdir build && cd build
cmake -DMODE_release=ON -DUSE_MPI=ON ..
make -j8
Running on HPC Clusters
SLURM Job Script
Example SLURM script for running FLUXOS on a cluster:
#!/bin/bash
#SBATCH --job-name=fluxos_mpi
#SBATCH --output=fluxos_%j.out
#SBATCH --error=fluxos_%j.err
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=2
#SBATCH --time=24:00:00
#SBATCH --partition=compute
#SBATCH --account=your_account
# Load modules
module purge
module load gcc/11.2.0
module load openmpi/4.1.1
module load armadillo/11.0
# Set OpenMP threads per MPI task
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export OMP_PROC_BIND=close
export OMP_PLACES=cores
# Run FLUXOS
srun --mpi=pmix ./build/bin/fluxos_mpi ./input/modset.json
SLURM GPU Job Script
Example SLURM script for GPU-accelerated runs:
#!/bin/bash
#SBATCH --job-name=fluxos_gpu
#SBATCH --output=fluxos_%j.out
#SBATCH --error=fluxos_%j.err
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=2
#SBATCH --time=24:00:00
#SBATCH --partition=gpu
# Load modules
module purge
module load gcc/11.2.0
module load cuda/11.8
module load openmpi/4.1.1
module load armadillo/11.0
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
# Run (each MPI rank uses one GPU)
srun --mpi=pmix ./build/bin/fluxos_mpi ./input/modset.json
PBS Job Script
Example PBS script:
#!/bin/bash
#PBS -N fluxos_mpi
#PBS -l nodes=4:ppn=32
#PBS -l walltime=24:00:00
#PBS -q batch
cd $PBS_O_WORKDIR
# Load modules
module load gcc openmpi armadillo
# Set OpenMP threads
export OMP_NUM_THREADS=2
export OMP_PROC_BIND=close
# Run
mpirun -np 128 ./build/bin/fluxos_mpi ./input/modset.json
Scalability Guidelines
Use this table to select appropriate parallelization for your domain:
Domain Size |
Mode |
MPI Procs |
OMP Threads |
Notes |
|---|---|---|---|---|
< 500x500 |
OpenMP or CUDA |
1 |
4-8 |
Single GPU provides best speedup |
500x500 - 2000x2000 |
CUDA or Hybrid |
1-4 |
2-4 |
GPU recommended; MPI for multi-GPU |
2000x2000 - 5000x5000 |
Hybrid+CUDA |
4-16 |
2-4 |
Multi-GPU nodes recommended |
> 5000x5000 |
Hybrid+CUDA |
16-64 |
2-4 |
Large-scale HPC with GPU nodes |
Triangular mesh considerations:
For unstructured triangular meshes, the domain decomposition uses graph-based partitioning rather than 2D Cartesian decomposition. This provides better load balance on irregular domains but requires METIS for optimal partitioning. Without METIS, naive block partitioning is used as a fallback.
Domain Decomposition
Regular Mesh:
FLUXOS uses 2D Cartesian domain decomposition:
The global domain is automatically divided among MPI processes
MPI_Cart_create establishes the process topology
Each process computes a local subdomain with ghost cells
Ghost cells are exchanged at each time step
Global Domain (1000 x 1000 cells)
┌─────────────┬─────────────┐
│ Process 0 │ Process 1 │
│ 500x500 │ 500x500 │
├─────────────┼─────────────┤
│ Process 2 │ Process 3 │
│ 500x500 │ 500x500 │
└─────────────┴─────────────┘
Ghost Cell Exchange
Each subdomain maintains ghost cells (halo regions) from neighboring processes:
┌─────────────────────────────┐
│ Ghost cells (from north) │
├───┬───────────────────┬─────┤
│ G │ │ G │
│ h │ Local domain │ h │
│ o │ │ o │
│ s │ │ s │
│ t │ │ t │
├───┴───────────────────┴─────┤
│ Ghost cells (from south) │
└─────────────────────────────┘
Triangular Mesh Decomposition:
For unstructured triangular meshes, domain decomposition uses graph-based partitioning:
METIS partitioning (preferred):
METIS_PartGraphKwayon the cell adjacency graphNaive block partitioning (fallback): Sequential cell IDs divided among ranks
Halo cells: Cells across partition-boundary edges are exchanged
Communication:
MPI_Isend/MPI_Irecvper neighbor rank
CUDA GPU on HPC Clusters
When using CUDA on HPC clusters:
Each MPI rank is assigned one GPU
Host-device transfers occur at forcing and output steps
CFL reduction uses device-side block reduction followed by host-side MPI reduction
For multi-GPU nodes, use
CUDA_VISIBLE_DEVICESor let MPI rank assignment handle GPU selection
Performance Optimization
OpenMP Settings
For optimal OpenMP performance:
# Bind threads to cores
export OMP_PROC_BIND=close
export OMP_PLACES=cores
# Set number of threads (typically 2-4 for hybrid)
export OMP_NUM_THREADS=2
MPI Settings
For OpenMPI:
# Disable InfiniBand if not available
export OMPI_MCA_btl=^openib
# Use shared memory for intra-node communication
export OMPI_MCA_btl=vader,self
For Intel MPI:
# Use shared memory fabric
export I_MPI_FABRICS=shm:ofi
Memory Considerations
Each MPI process allocates memory for its local subdomain plus ghost cells
For very large domains, ensure sufficient memory per node
Consider using fewer MPI processes with more OpenMP threads to reduce memory overhead
Parallel Output
FLUXOS supports two parallel output modes:
Gathered Output (Default)
All data gathered to root process for writing
Simpler file format, single output file
Suitable for moderate domain sizes
Parallel Output
Each process writes its own portion
Creates multiple files plus a manifest
Better scalability for very large domains
Output files are named:
Results/
├── 3600.txt # Gathered output (single file)
├── 3600_rank0.txt # Parallel output (per-process)
├── 3600_rank1.txt
├── 3600_manifest.txt # Manifest listing all files
└── ...
Troubleshooting
Common Issues
MPI not found during build:
# Ensure MPI is in PATH
which mpicc
which mpicxx
# Set CC and CXX if needed
export CC=mpicc
export CXX=mpicxx
Poor scaling:
Check that ghost cell exchange is not dominating
Ensure load balance (equal subdomain sizes)
Verify network bandwidth (InfiniBand recommended)
Memory errors:
Reduce number of MPI processes
Increase OpenMP threads per process
Check for memory leaks with valgrind
Segmentation faults:
Ensure consistent MPI library between build and run
Check Armadillo is compiled with same compiler
Verify input file format
Profiling
For performance analysis:
# With Intel VTune
vtune -collect hotspots -- srun ./build/bin/fluxos_mpi input.json
# With Scalasca
scalasca -analyze srun ./build/bin/fluxos_mpi input.json
# With ARM MAP
map --profile srun ./build/bin/fluxos_mpi input.json
Best Practices
Start small: Test with few MPI processes before scaling up
Monitor load balance: Ensure all processes finish at similar times
Use hybrid mode: Typically 2-4 OpenMP threads per MPI process works best
Check I/O: Parallel I/O may become a bottleneck for frequent output
Verify results: Compare small-domain results with serial version
Use restart files: For long simulations, implement checkpoint/restart