CUDA GPU Acceleration
==================================

This section describes how to build and run FLUXOS with CUDA GPU acceleration for maximum performance on large-scale simulations.

Overview
--------

FLUXOS supports CUDA GPU acceleration for both regular Cartesian and unstructured triangular mesh solvers. GPU offloading can provide 10-50x speedup over single-core CPU execution for domains with millions of cells.

**Supported GPU operations:**

* Hydrodynamics solver (Roe flux, state update, wet/dry tracking)
* Courant condition (CFL time step computation)
* ADE solute transport (concentration adjustment)

Requirements
------------

* **NVIDIA GPU**: Compute Capability 6.0+ (Pascal or newer)
* **CUDA Toolkit**: Version 11.0 or later
* **GPU Driver**: Compatible with the installed CUDA Toolkit version

.. list-table:: Recommended GPUs
   :widths: 30 20 25 25
   :header-rows: 1

   * - GPU
     - Compute Capability
     - VRAM
     - Typical Domain Size
   * - GTX 1080 Ti
     - 6.1
     - 11 GB
     - Up to 2000x2000
   * - RTX 2080 Ti
     - 7.5
     - 11 GB
     - Up to 3000x3000
   * - RTX 3090
     - 8.6
     - 24 GB
     - Up to 5000x5000
   * - A100
     - 8.0
     - 40/80 GB
     - 10000x10000+
   * - H100
     - 9.0
     - 80 GB
     - 10000x10000+

Building with CUDA
------------------

.. code-block:: bash

   # Standard CUDA build (regular mesh)
   mkdir build && cd build
   cmake -DMODE_release=ON -DUSE_CUDA=ON ..
   make -j$(nproc)

   # CUDA with triangular mesh
   cmake -DMODE_release=ON -DUSE_CUDA=ON ..
   make -j$(nproc)

   # Full-feature build
   cmake -DMODE_release=ON -DUSE_CUDA=ON -DUSE_MPI=ON ..
   make -j$(nproc)

Running with GPU
----------------

No special command-line flags are needed. When built with ``USE_CUDA``, the GPU solver is used automatically:

.. code-block:: bash

   # Single GPU
   ./bin/fluxos ./input/modset.json

   # With MPI (multi-GPU, one GPU per MPI rank)
   mpirun -np 2 ./bin/fluxos_mpi ./input/modset.json

Regular Mesh GPU Architecture
-----------------------------

The regular mesh GPU solver uses a 2D CUDA thread grid matching the domain layout:

* **Thread block size**: 16x16 (256 threads per block)
* **Grid size**: ``(NCOLS/16, NROWS/16)`` blocks
* Each thread processes one cell at ``(irow, icol)``

**Kernels:**

1. ``cuda_courant_condition``: CFL time step with block-level parallel reduction
2. ``cuda_hydrodynamics_calc``: Roe flux computation + state update
3. ``cuda_ade_adjust``: Concentration adjustment for depth changes

**Memory layout:**

* Device memory mirrors the host ``arma::Mat<double>`` layout
* All fields (h, z, qx, qy, etc.) allocated as contiguous 2D arrays on GPU
* Host-device transfers occur at forcing and output steps

Triangular Mesh GPU Architecture
---------------------------------

The triangular mesh GPU solver uses 1D thread indexing with 7 specialized kernels:

.. code-block:: text

   Per timestep execution order:
   ┌─────────────────────────────────────┐
   │ 1. kernel_tri_wetdry     (cells)    │
   │ 2. kernel_tri_gradient   (cells)    │
   │ 3. kernel_tri_limiter    (cells)    │
   │ 4. kernel_tri_edge_flux  (edges)  ← main compute kernel
   │ 5. kernel_tri_accumulate (edges)    │
   │ 6. kernel_tri_update     (cells)    │
   └─────────────────────────────────────┘
   CFL: kernel_tri_courant    (cells)

**Thread indexing:**

* Cell kernels: 1 thread per cell, ``blockDim = 256``, ``gridDim = (ncells + 255) / 256``
* Edge kernels: 1 thread per edge, ``blockDim = 256``, ``gridDim = (nedges + 255) / 256``

**Race condition handling:**

The flux accumulation kernel (step 5) uses ``atomicAdd`` to safely add edge fluxes to the left and right cells concurrently:

.. code-block:: text

   // Each edge thread atomically adds its flux to both cells
   atomicAdd(&d_dh[left_cell],  -flux_mass * dt / cell_area[left_cell]);
   atomicAdd(&d_dh[right_cell], +flux_mass * dt / cell_area[right_cell]);

**Device memory:**

* Mesh topology (read-only): flat arrays for edge connectivity, normals, cell geometry
* Solution data (read-write): flat ``double*`` arrays indexed by cell/edge ID
* Reduction buffer for CFL computation

**CFL Reduction:**

Block-level parallel reduction computes the global minimum time step:

1. Each thread computes its local CFL candidate
2. Shared memory reduction within each block finds the block minimum
3. Block minimums written to a reduction buffer
4. Host reads and reduces across blocks

Performance Guidelines
----------------------

**Maximizing GPU performance:**

1. **Large domains benefit most**: GPU overhead is amortized over more cells
2. **Minimize host-device transfers**: Only transfer at forcing and output steps
3. **Batch chemical species**: Process all species in sequence on GPU before transferring back
4. **Use pinned memory**: For large domains, pinned (page-locked) host memory improves transfer speed

**Expected speedup factors:**

.. list-table::
   :widths: 25 25 25 25
   :header-rows: 1

   * - Domain Size
     - CPU (OpenMP 8T)
     - GPU (RTX 3090)
     - Speedup
   * - 500x500
     - 1x
     - 3-5x
     - Moderate
   * - 2000x2000
     - 1x
     - 10-20x
     - Good
   * - 5000x5000
     - 1x
     - 20-40x
     - Excellent
   * - 10000x10000
     - 1x
     - 30-50x
     - Excellent

.. note::

   Actual speedup depends on GPU model, domain complexity, and wet fraction. Domains with large dry regions may show lower speedup due to early-exit optimizations in the CPU code.

Multi-GPU with MPI
------------------

For multi-GPU execution, combine ``USE_CUDA`` with ``USE_MPI``. Each MPI rank uses one GPU:

.. code-block:: bash

   # Build
   cmake -DMODE_release=ON -DUSE_CUDA=ON -DUSE_MPI=ON ..
   make -j$(nproc)

   # Run on 4 GPUs
   mpirun -np 4 ./bin/fluxos_mpi ./input/modset.json

**SLURM example for multi-GPU:**

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=fluxos_gpu
   #SBATCH --nodes=2
   #SBATCH --ntasks-per-node=4
   #SBATCH --gres=gpu:4
   #SBATCH --time=12:00:00

   module load cuda/11.8
   module load openmpi/4.1.1

   srun ./build/bin/fluxos_mpi ./input/modset.json

Troubleshooting
---------------

**"CUDA error: no CUDA-capable device is detected":**

.. code-block:: bash

   # Check GPU visibility
   nvidia-smi
   echo $CUDA_VISIBLE_DEVICES

**"CUDA error: out of memory":**

* Reduce domain size or use a GPU with more VRAM
* For triangular meshes: reduce mesh cell count
* Check for memory leaks with ``cuda-memcheck``

**Poor GPU performance:**

* Ensure the GPU is not in power-saving mode: ``nvidia-smi -pm 1``
* Check GPU utilization: ``nvidia-smi dmon``
* Verify you're using a release build (debug builds are much slower on GPU)

**Build errors with nvcc:**

* Ensure CUDA Toolkit version matches your GPU driver
* Check compute capability: ``nvidia-smi --query-gpu=compute_cap --format=csv``