CUDA GPU Acceleration
This section describes how to build and run FLUXOS with CUDA GPU acceleration for maximum performance on large-scale simulations.
Overview
FLUXOS supports CUDA GPU acceleration for both regular Cartesian and unstructured triangular mesh solvers. GPU offloading can provide 10-50x speedup over single-core CPU execution for domains with millions of cells.
Supported GPU operations:
Hydrodynamics solver (Roe flux, state update, wet/dry tracking)
Courant condition (CFL time step computation)
ADE solute transport (concentration adjustment)
Requirements
NVIDIA GPU: Compute Capability 6.0+ (Pascal or newer)
CUDA Toolkit: Version 11.0 or later
GPU Driver: Compatible with the installed CUDA Toolkit version
GPU |
Compute Capability |
VRAM |
Typical Domain Size |
|---|---|---|---|
GTX 1080 Ti |
6.1 |
11 GB |
Up to 2000x2000 |
RTX 2080 Ti |
7.5 |
11 GB |
Up to 3000x3000 |
RTX 3090 |
8.6 |
24 GB |
Up to 5000x5000 |
A100 |
8.0 |
40/80 GB |
10000x10000+ |
H100 |
9.0 |
80 GB |
10000x10000+ |
Building with CUDA
# Standard CUDA build (regular mesh)
mkdir build && cd build
cmake -DMODE_release=ON -DUSE_CUDA=ON ..
make -j$(nproc)
# CUDA with triangular mesh
cmake -DMODE_release=ON -DUSE_CUDA=ON ..
make -j$(nproc)
# Full-feature build
cmake -DMODE_release=ON -DUSE_CUDA=ON -DUSE_MPI=ON ..
make -j$(nproc)
Running with GPU
No special command-line flags are needed. When built with USE_CUDA, the GPU solver is used automatically:
# Single GPU
./bin/fluxos ./input/modset.json
# With MPI (multi-GPU, one GPU per MPI rank)
mpirun -np 2 ./bin/fluxos_mpi ./input/modset.json
Regular Mesh GPU Architecture
The regular mesh GPU solver uses a 2D CUDA thread grid matching the domain layout:
Thread block size: 16x16 (256 threads per block)
Grid size:
(NCOLS/16, NROWS/16)blocksEach thread processes one cell at
(irow, icol)
Kernels:
cuda_courant_condition: CFL time step with block-level parallel reductioncuda_hydrodynamics_calc: Roe flux computation + state updatecuda_ade_adjust: Concentration adjustment for depth changes
Memory layout:
Device memory mirrors the host
arma::Mat<double>layoutAll fields (h, z, qx, qy, etc.) allocated as contiguous 2D arrays on GPU
Host-device transfers occur at forcing and output steps
Triangular Mesh GPU Architecture
The triangular mesh GPU solver uses 1D thread indexing with 7 specialized kernels:
Per timestep execution order:
┌─────────────────────────────────────┐
│ 1. kernel_tri_wetdry (cells) │
│ 2. kernel_tri_gradient (cells) │
│ 3. kernel_tri_limiter (cells) │
│ 4. kernel_tri_edge_flux (edges) ← main compute kernel
│ 5. kernel_tri_accumulate (edges) │
│ 6. kernel_tri_update (cells) │
└─────────────────────────────────────┘
CFL: kernel_tri_courant (cells)
Thread indexing:
Cell kernels: 1 thread per cell,
blockDim = 256,gridDim = (ncells + 255) / 256Edge kernels: 1 thread per edge,
blockDim = 256,gridDim = (nedges + 255) / 256
Race condition handling:
The flux accumulation kernel (step 5) uses atomicAdd to safely add edge fluxes to the left and right cells concurrently:
// Each edge thread atomically adds its flux to both cells
atomicAdd(&d_dh[left_cell], -flux_mass * dt / cell_area[left_cell]);
atomicAdd(&d_dh[right_cell], +flux_mass * dt / cell_area[right_cell]);
Device memory:
Mesh topology (read-only): flat arrays for edge connectivity, normals, cell geometry
Solution data (read-write): flat
double*arrays indexed by cell/edge IDReduction buffer for CFL computation
CFL Reduction:
Block-level parallel reduction computes the global minimum time step:
Each thread computes its local CFL candidate
Shared memory reduction within each block finds the block minimum
Block minimums written to a reduction buffer
Host reads and reduces across blocks
Performance Guidelines
Maximizing GPU performance:
Large domains benefit most: GPU overhead is amortized over more cells
Minimize host-device transfers: Only transfer at forcing and output steps
Batch chemical species: Process all species in sequence on GPU before transferring back
Use pinned memory: For large domains, pinned (page-locked) host memory improves transfer speed
Expected speedup factors:
Domain Size |
CPU (OpenMP 8T) |
GPU (RTX 3090) |
Speedup |
|---|---|---|---|
500x500 |
1x |
3-5x |
Moderate |
2000x2000 |
1x |
10-20x |
Good |
5000x5000 |
1x |
20-40x |
Excellent |
10000x10000 |
1x |
30-50x |
Excellent |
Note
Actual speedup depends on GPU model, domain complexity, and wet fraction. Domains with large dry regions may show lower speedup due to early-exit optimizations in the CPU code.
Multi-GPU with MPI
For multi-GPU execution, combine USE_CUDA with USE_MPI. Each MPI rank uses one GPU:
# Build
cmake -DMODE_release=ON -DUSE_CUDA=ON -DUSE_MPI=ON ..
make -j$(nproc)
# Run on 4 GPUs
mpirun -np 4 ./bin/fluxos_mpi ./input/modset.json
SLURM example for multi-GPU:
#!/bin/bash
#SBATCH --job-name=fluxos_gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --time=12:00:00
module load cuda/11.8
module load openmpi/4.1.1
srun ./build/bin/fluxos_mpi ./input/modset.json
Troubleshooting
“CUDA error: no CUDA-capable device is detected”:
# Check GPU visibility
nvidia-smi
echo $CUDA_VISIBLE_DEVICES
“CUDA error: out of memory”:
Reduce domain size or use a GPU with more VRAM
For triangular meshes: reduce mesh cell count
Check for memory leaks with
cuda-memcheck
Poor GPU performance:
Ensure the GPU is not in power-saving mode:
nvidia-smi -pm 1Check GPU utilization:
nvidia-smi dmonVerify you’re using a release build (debug builds are much slower on GPU)
Build errors with nvcc:
Ensure CUDA Toolkit version matches your GPU driver
Check compute capability:
nvidia-smi --query-gpu=compute_cap --format=csv