High-Performance Computing (HPC)
==================================

This section describes how to build and run FLUXOS on HPC clusters using MPI+OpenMP hybrid parallelization.

Overview
--------

FLUXOS supports five parallelization modes:

1. **OpenMP only**: For workstations and small domains
2. **MPI only**: For distributed memory systems
3. **Hybrid MPI+OpenMP**: For scalability on HPC clusters
4. **CUDA GPU**: For maximum single-node performance with NVIDIA GPUs (recommended for large domains)
5. **Hybrid MPI+OpenMP+CUDA**: For maximum scalability on GPU-equipped HPC clusters (recommended)

The hybrid approach uses MPI for communication between nodes, OpenMP for parallelism within each node, and CUDA for GPU offloading of compute-intensive kernels.

Building for HPC
----------------

Prerequisites
^^^^^^^^^^^^^

* MPI implementation (OpenMPI, MPICH, or Intel MPI)
* OpenMP-enabled compiler (GCC 7+, Intel C++ 18+, or Clang 8+)
* Armadillo 9.9+
* CMake 3.10+
* HDF5 (optional, for parallel output)

Build Commands
^^^^^^^^^^^^^^

.. code-block:: bash

   # Create build directory
   mkdir build && cd build

   # Configure with MPI support
   cmake -DMODE_release=ON -DUSE_MPI=ON ..

   # Build
   make -j8

   # The executable will be: build/bin/fluxos_mpi

**With CUDA GPU acceleration:**

.. code-block:: bash

   cmake -DMODE_release=ON -DUSE_MPI=ON -DUSE_CUDA=ON ..
   make -j8

**With triangular mesh + GPU + MPI:**

.. code-block:: bash

   cmake -DMODE_release=ON -DUSE_MPI=ON -DUSE_CUDA=ON ..
   make -j8

For module-based HPC systems:

.. code-block:: bash

   # Load required modules (adjust for your system)
   module load gcc/11.2.0
   module load openmpi/4.1.1
   module load armadillo/11.0
   module load cmake/3.20

   # Build
   mkdir build && cd build
   cmake -DMODE_release=ON -DUSE_MPI=ON ..
   make -j8

Running on HPC Clusters
-----------------------

SLURM Job Script
^^^^^^^^^^^^^^^^

Example SLURM script for running FLUXOS on a cluster:

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=fluxos_mpi
   #SBATCH --output=fluxos_%j.out
   #SBATCH --error=fluxos_%j.err
   #SBATCH --nodes=4
   #SBATCH --ntasks-per-node=32
   #SBATCH --cpus-per-task=2
   #SBATCH --time=24:00:00
   #SBATCH --partition=compute
   #SBATCH --account=your_account

   # Load modules
   module purge
   module load gcc/11.2.0
   module load openmpi/4.1.1
   module load armadillo/11.0

   # Set OpenMP threads per MPI task
   export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
   export OMP_PROC_BIND=close
   export OMP_PLACES=cores

   # Run FLUXOS
   srun --mpi=pmix ./build/bin/fluxos_mpi ./input/modset.json

SLURM GPU Job Script
^^^^^^^^^^^^^^^^^^^^

Example SLURM script for GPU-accelerated runs:

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=fluxos_gpu
   #SBATCH --output=fluxos_%j.out
   #SBATCH --error=fluxos_%j.err
   #SBATCH --nodes=2
   #SBATCH --ntasks-per-node=4
   #SBATCH --gres=gpu:4
   #SBATCH --cpus-per-task=2
   #SBATCH --time=24:00:00
   #SBATCH --partition=gpu

   # Load modules
   module purge
   module load gcc/11.2.0
   module load cuda/11.8
   module load openmpi/4.1.1
   module load armadillo/11.0

   export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

   # Run (each MPI rank uses one GPU)
   srun --mpi=pmix ./build/bin/fluxos_mpi ./input/modset.json

PBS Job Script
^^^^^^^^^^^^^^

Example PBS script:

.. code-block:: bash

   #!/bin/bash
   #PBS -N fluxos_mpi
   #PBS -l nodes=4:ppn=32
   #PBS -l walltime=24:00:00
   #PBS -q batch

   cd $PBS_O_WORKDIR

   # Load modules
   module load gcc openmpi armadillo

   # Set OpenMP threads
   export OMP_NUM_THREADS=2
   export OMP_PROC_BIND=close

   # Run
   mpirun -np 128 ./build/bin/fluxos_mpi ./input/modset.json

Scalability Guidelines
----------------------

Use this table to select appropriate parallelization for your domain:

.. list-table::
   :widths: 20 20 15 15 30
   :header-rows: 1

   * - Domain Size
     - Mode
     - MPI Procs
     - OMP Threads
     - Notes
   * - < 500x500
     - OpenMP or CUDA
     - 1
     - 4-8
     - Single GPU provides best speedup
   * - 500x500 - 2000x2000
     - CUDA or Hybrid
     - 1-4
     - 2-4
     - GPU recommended; MPI for multi-GPU
   * - 2000x2000 - 5000x5000
     - Hybrid+CUDA
     - 4-16
     - 2-4
     - Multi-GPU nodes recommended
   * - > 5000x5000
     - Hybrid+CUDA
     - 16-64
     - 2-4
     - Large-scale HPC with GPU nodes

**Triangular mesh considerations:**

For unstructured triangular meshes, the domain decomposition uses graph-based partitioning rather than 2D Cartesian decomposition. This provides better load balance on irregular domains but requires METIS for optimal partitioning. Without METIS, naive block partitioning is used as a fallback.

Domain Decomposition
--------------------

**Regular Mesh:**

FLUXOS uses 2D Cartesian domain decomposition:

* The global domain is automatically divided among MPI processes
* MPI_Cart_create establishes the process topology
* Each process computes a local subdomain with ghost cells
* Ghost cells are exchanged at each time step

.. code-block:: text

   Global Domain (1000 x 1000 cells)
   ┌─────────────┬─────────────┐
   │  Process 0  │  Process 1  │
   │  500x500    │  500x500    │
   ├─────────────┼─────────────┤
   │  Process 2  │  Process 3  │
   │  500x500    │  500x500    │
   └─────────────┴─────────────┘

Ghost Cell Exchange
^^^^^^^^^^^^^^^^^^^

Each subdomain maintains ghost cells (halo regions) from neighboring processes:

.. code-block:: text

   ┌─────────────────────────────┐
   │  Ghost cells (from north)   │
   ├───┬───────────────────┬─────┤
   │ G │                   │ G   │
   │ h │   Local domain    │ h   │
   │ o │                   │ o   │
   │ s │                   │ s   │
   │ t │                   │ t   │
   ├───┴───────────────────┴─────┤
   │  Ghost cells (from south)   │
   └─────────────────────────────┘

**Triangular Mesh Decomposition:**

For unstructured triangular meshes, domain decomposition uses graph-based partitioning:

* **METIS partitioning** (preferred): ``METIS_PartGraphKway`` on the cell adjacency graph
* **Naive block partitioning** (fallback): Sequential cell IDs divided among ranks
* **Halo cells**: Cells across partition-boundary edges are exchanged
* **Communication**: ``MPI_Isend``/``MPI_Irecv`` per neighbor rank

CUDA GPU on HPC Clusters
^^^^^^^^^^^^^^^^^^^^^^^^^

When using CUDA on HPC clusters:

* Each MPI rank is assigned one GPU
* Host-device transfers occur at forcing and output steps
* CFL reduction uses device-side block reduction followed by host-side MPI reduction
* For multi-GPU nodes, use ``CUDA_VISIBLE_DEVICES`` or let MPI rank assignment handle GPU selection

Performance Optimization
------------------------

OpenMP Settings
^^^^^^^^^^^^^^^

For optimal OpenMP performance:

.. code-block:: bash

   # Bind threads to cores
   export OMP_PROC_BIND=close
   export OMP_PLACES=cores

   # Set number of threads (typically 2-4 for hybrid)
   export OMP_NUM_THREADS=2

MPI Settings
^^^^^^^^^^^^

For OpenMPI:

.. code-block:: bash

   # Disable InfiniBand if not available
   export OMPI_MCA_btl=^openib

   # Use shared memory for intra-node communication
   export OMPI_MCA_btl=vader,self

For Intel MPI:

.. code-block:: bash

   # Use shared memory fabric
   export I_MPI_FABRICS=shm:ofi

Memory Considerations
^^^^^^^^^^^^^^^^^^^^^

* Each MPI process allocates memory for its local subdomain plus ghost cells
* For very large domains, ensure sufficient memory per node
* Consider using fewer MPI processes with more OpenMP threads to reduce memory overhead

Parallel Output
---------------

FLUXOS supports two parallel output modes:

**Gathered Output (Default)**

* All data gathered to root process for writing
* Simpler file format, single output file
* Suitable for moderate domain sizes

**Parallel Output**

* Each process writes its own portion
* Creates multiple files plus a manifest
* Better scalability for very large domains

Output files are named:

.. code-block:: text

   Results/
   ├── 3600.txt              # Gathered output (single file)
   ├── 3600_rank0.txt        # Parallel output (per-process)
   ├── 3600_rank1.txt
   ├── 3600_manifest.txt     # Manifest listing all files
   └── ...

Troubleshooting
---------------

Common Issues
^^^^^^^^^^^^^

**MPI not found during build:**

.. code-block:: bash

   # Ensure MPI is in PATH
   which mpicc
   which mpicxx

   # Set CC and CXX if needed
   export CC=mpicc
   export CXX=mpicxx

**Poor scaling:**

* Check that ghost cell exchange is not dominating
* Ensure load balance (equal subdomain sizes)
* Verify network bandwidth (InfiniBand recommended)

**Memory errors:**

* Reduce number of MPI processes
* Increase OpenMP threads per process
* Check for memory leaks with valgrind

**Segmentation faults:**

* Ensure consistent MPI library between build and run
* Check Armadillo is compiled with same compiler
* Verify input file format

Profiling
^^^^^^^^^

For performance analysis:

.. code-block:: bash

   # With Intel VTune
   vtune -collect hotspots -- srun ./build/bin/fluxos_mpi input.json

   # With Scalasca
   scalasca -analyze srun ./build/bin/fluxos_mpi input.json

   # With ARM MAP
   map --profile srun ./build/bin/fluxos_mpi input.json

Best Practices
--------------

1. **Start small**: Test with few MPI processes before scaling up
2. **Monitor load balance**: Ensure all processes finish at similar times
3. **Use hybrid mode**: Typically 2-4 OpenMP threads per MPI process works best
4. **Check I/O**: Parallel I/O may become a bottleneck for frequent output
5. **Verify results**: Compare small-domain results with serial version
6. **Use restart files**: For long simulations, implement checkpoint/restart