CUDA GPU Acceleration

This section describes how to build and run FLUXOS with CUDA GPU acceleration for maximum performance on large-scale simulations.

Overview

FLUXOS supports CUDA GPU acceleration for both regular Cartesian and unstructured triangular mesh solvers. GPU offloading can provide 10-50x speedup over single-core CPU execution for domains with millions of cells.

Supported GPU operations:

Hydrodynamics solver (Roe flux, state update, wet/dry tracking)
Courant condition (CFL time step computation)
ADE solute transport (concentration adjustment)

Requirements

NVIDIA GPU: Compute Capability 6.0+ (Pascal or newer)
CUDA Toolkit: Version 11.0 or later
GPU Driver: Compatible with the installed CUDA Toolkit version

Recommended GPUs
GPU	Compute Capability	VRAM	Typical Domain Size
GTX 1080 Ti	6.1	11 GB	Up to 2000x2000
RTX 2080 Ti	7.5	11 GB	Up to 3000x3000
RTX 3090	8.6	24 GB	Up to 5000x5000
A100	8.0	40/80 GB	10000x10000+
H100	9.0	80 GB	10000x10000+

Building with CUDA

# Standard CUDA build (regular mesh)
mkdir build && cd build
cmake -DMODE_release=ON -DUSE_CUDA=ON ..
make -j$(nproc)

# CUDA with triangular mesh
cmake -DMODE_release=ON -DUSE_CUDA=ON ..
make -j$(nproc)

# Full-feature build
cmake -DMODE_release=ON -DUSE_CUDA=ON -DUSE_MPI=ON ..
make -j$(nproc)

Running with GPU

No special command-line flags are needed. When built with USE_CUDA, the GPU solver is used automatically:

# Single GPU
./bin/fluxos ./input/modset.json

# With MPI (multi-GPU, one GPU per MPI rank)
mpirun -np 2 ./bin/fluxos_mpi ./input/modset.json

Regular Mesh GPU Architecture

The regular mesh GPU solver uses a 2D CUDA thread grid matching the domain layout:

Thread block size: 16x16 (256 threads per block)
Grid size: (NCOLS/16, NROWS/16) blocks
Each thread processes one cell at (irow, icol)

Kernels:

cuda_courant_condition: CFL time step with block-level parallel reduction
cuda_hydrodynamics_calc: Roe flux computation + state update
cuda_ade_adjust: Concentration adjustment for depth changes

Memory layout:

Device memory mirrors the host arma::Mat<double> layout
All fields (h, z, qx, qy, etc.) allocated as contiguous 2D arrays on GPU
Host-device transfers occur at forcing and output steps

Triangular Mesh GPU Architecture

The triangular mesh GPU solver uses 1D thread indexing with 7 specialized kernels:

Per timestep execution order:
┌─────────────────────────────────────┐
│ 1. kernel_tri_wetdry     (cells)    │
│ 2. kernel_tri_gradient   (cells)    │
│ 3. kernel_tri_limiter    (cells)    │
│ 4. kernel_tri_edge_flux  (edges)  ← main compute kernel
│ 5. kernel_tri_accumulate (edges)    │
│ 6. kernel_tri_update     (cells)    │
└─────────────────────────────────────┘
CFL: kernel_tri_courant    (cells)

Thread indexing:

Cell kernels: 1 thread per cell, blockDim = 256, gridDim = (ncells + 255) / 256
Edge kernels: 1 thread per edge, blockDim = 256, gridDim = (nedges + 255) / 256

Race condition handling:

The flux accumulation kernel (step 5) uses atomicAdd to safely add edge fluxes to the left and right cells concurrently:

// Each edge thread atomically adds its flux to both cells
atomicAdd(&d_dh[left_cell],  -flux_mass * dt / cell_area[left_cell]);
atomicAdd(&d_dh[right_cell], +flux_mass * dt / cell_area[right_cell]);

Device memory:

Mesh topology (read-only): flat arrays for edge connectivity, normals, cell geometry
Solution data (read-write): flat double* arrays indexed by cell/edge ID
Reduction buffer for CFL computation

CFL Reduction:

Block-level parallel reduction computes the global minimum time step:

Each thread computes its local CFL candidate
Shared memory reduction within each block finds the block minimum
Block minimums written to a reduction buffer
Host reads and reduces across blocks

Performance Guidelines

Maximizing GPU performance:

Large domains benefit most: GPU overhead is amortized over more cells
Minimize host-device transfers: Only transfer at forcing and output steps
Batch chemical species: Process all species in sequence on GPU before transferring back
Use pinned memory: For large domains, pinned (page-locked) host memory improves transfer speed

Expected speedup factors:

Domain Size	CPU (OpenMP 8T)	GPU (RTX 3090)	Speedup
500x500	1x	3-5x	Moderate
2000x2000	1x	10-20x	Good
5000x5000	1x	20-40x	Excellent
10000x10000	1x	30-50x	Excellent

Note

Actual speedup depends on GPU model, domain complexity, and wet fraction. Domains with large dry regions may show lower speedup due to early-exit optimizations in the CPU code.

Multi-GPU with MPI

For multi-GPU execution, combine USE_CUDA with USE_MPI. Each MPI rank uses one GPU:

# Build
cmake -DMODE_release=ON -DUSE_CUDA=ON -DUSE_MPI=ON ..
make -j$(nproc)

# Run on 4 GPUs
mpirun -np 4 ./bin/fluxos_mpi ./input/modset.json

SLURM example for multi-GPU:

#!/bin/bash
#SBATCH --job-name=fluxos_gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --time=12:00:00

module load cuda/11.8
module load openmpi/4.1.1

srun ./build/bin/fluxos_mpi ./input/modset.json

Troubleshooting

“CUDA error: no CUDA-capable device is detected”:

# Check GPU visibility
nvidia-smi
echo $CUDA_VISIBLE_DEVICES

“CUDA error: out of memory”:

Reduce domain size or use a GPU with more VRAM
For triangular meshes: reduce mesh cell count
Check for memory leaks with cuda-memcheck

Poor GPU performance:

Ensure the GPU is not in power-saving mode: nvidia-smi -pm 1
Check GPU utilization: nvidia-smi dmon
Verify you’re using a release build (debug builds are much slower on GPU)

Build errors with nvcc:

Ensure CUDA Toolkit version matches your GPU driver
Check compute capability: nvidia-smi --query-gpu=compute_cap --format=csv