Kokkos MPI (Message Passing Interface) and PGAS (Partitioned Global Address Space)
1. Introduction
In the realm of high-performance computing, the integration of internode communication models with intranode parallelism frameworks has become increasingly crucial. This synergy is exemplified by the combination of MPI (Message Passing Interface) and PGAS (Partitioned Global Address Space) with Kokkos, a performance portability ecosystem for manycore architectures.
2. Internode Communication
The integration of MPI concept inside Kokkos represents a powerful approach to hybrid programming, leveraging the strengths of both paradigms. MPI excels in distributed memory parallelism, while Kokkos shines in shared memory parallelism and performance portability across diverse architectures [1].
When writing a hybrid MPI-Kokkos program, one of the primary considerations is data transfer between MPI ranks. Kokkos Views, the library’s multidimensional array abstraction, can be seamlessly integrated with MPI communications. To send data from a Kokkos View, one simply needs to pass the View’s data pointer and size to MPI functions [2]. For example:
Kokkos::View<double*> myView("MyView", 1000);
MPI_Send(myView.data(), myView.size(), MPI_DOUBLE, dest, tag, comm);
This straightforward approach works because Kokkos ensures that View data is contiguous in memory, aligning perfectly with MPI’s expectations [2].
A key optimization in hybrid MPI-Kokkos programs is the overlapping of communication and computation. This can be achieved by leveraging Kokkos' execution spaces and MPI’s non-blocking communication primitives. For example:
auto future = Kokkos::parallel_for(policy, KOKKOS_LAMBDA(int i) {
// Computation kernel
});
MPI_Request request;
MPI_Isend(data, count, MPI_INT, dest, tag, comm, &request);
future.wait(); // Wait for computation to complete
MPI_Wait(&request, MPI_STATUS_IGNORE); // Wait for communication to complete
This pattern allows the computation to proceed concurrently with the MPI communication, potentially masking latency and improving overall performance [1].
Buffer packing strategies play a crucial role in optimizing MPI communication, especially when dealing with non-contiguous data. Kokkos provides efficient mechanisms for packing and unpacking data. One approach is to use Kokkos parallel_for to pack data into a contiguous buffer before sending:
Kokkos::View<double*> sendBuffer("SendBuffer", count);
Kokkos::parallel_for(count, KOKKOS_LAMBDA(int i) {
sendBuffer(i) = computeValue(i);
});
MPI_Send(sendBuffer.data(), count, MPI_DOUBLE, dest, tag, comm);
This method ensures efficient memory access patterns and can leverage the full parallelism of the underlying hardware.
For sparse communication patterns, generating efficient index lists is crucial. Kokkos can assist in this process through its parallel algorithms. For instance, to create a list of indices for non-zero elements:
Kokkos::View<int*> indexList("IndexList", n);
Kokkos::parallel_scan(n, KOKKOS_LAMBDA(int i, int& update, bool final) {
if (data(i) != 0) {
if (final) indexList(update) = i;
++update;
}
});
This approach efficiently generates a compact list of relevant indices, which can then be used to optimize MPI communications for sparse data structures.
3. Kokkos Remote Spaces: PGAS Support
*PGAS* (Partitioned Global Address Space) models are gaining traction, particularly with the advent of "super-node" architectures and evolving network infrastructures [3][4]. Kokkos Remote Spaces extends the Kokkos ecosystem to embrace this paradigm, offering a bridge between shared and distributed memory programming models.
PGAS enables Kokkos to provide a global view of data for convenient multi-GPU, multi-node, and multi-device programming. PGAS provides a high-level abstraction for remote memory accesses, simplifying distributed programming for developers using Kokkos. Kokkos Remote Spaces supports multiple PGAS backends, including SHMEM, NVSHMEM, ROCSHMEM, and MPI One-side, providing flexibility for different types of systems and architectures. PGAS implementations are optimized for high-performance communications, which is crucial for the scientific computing applications that Kokkos targets. Using PGAS allows Kokkos to maintain its philosophy of performance portability across different architectures, from CPUs to GPUs. By using PGAS, Kokkos can offer efficient and portable distributed programming, while maintaining a consistent programming interface with the rest of the Kokkos ecosystem.
To write a PGAS application with Kokkos, developers can utilize the Kokkos Remote Spaces extension. This extension introduces new memory spaces that return data handles with PGAS semantics. Creating a global View in this context is straightforward:
Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Experimental::NVShmemSpace> globalView("GlobalView", N, M);
This declaration creates a two-dimensional View that spans across multiple processing elements in a PGAS model.
Accessing global data in a PGAS model requires careful consideration of data locality and communication costs. Kokkos Remote Spaces provides abstractions that simplify this process. For example, accessing an element of the global View might look like this:
auto element = globalView(i, j);
Behind the scenes, Kokkos handles the necessary communication to fetch or update the data, abstracting away the complexities of the underlying PGAS implementation.
A prime example of PGAS applications is the Sparse Matrix-Vector Multiplication (SpMV) operation, a key component of the Conjugate Gradient (CG) method. In a PGAS model using Kokkos Remote Spaces, the vector becomes distributed, while the sparse matrix stores global indices. This approach allows for efficient parallel computation across multiple nodes.
The implementation of SpMV in this context might involve:
-
Distributing the vector across processing elements.
-
Storing the sparse matrix with global indices.
-
Performing local computations using Kokkos parallel constructs.
-
Utilizing PGAS operations for necessary remote data accesses.
This strategy can lead to significant performance improvements, especially for large-scale problems that exceed the memory capacity of a single node.
Example
Kokkos::initialize(argc, argv);
{
using ExecSpace = Kokkos::Cuda;
using RemoteSpace = Kokkos::Experimental::NVShmemSpace;
using RemoteView = Kokkos::View<double*, Kokkos::LayoutLeft, RemoteSpace>;
const int N = 1000;
RemoteView remote_data("RemoteData", N);
Kokkos::parallel_for("InitializeData", Kokkos::RangePolicy<ExecSpace>(0, N),
KOKKOS_LAMBDA(const int i) {
remote_data(i) = static_cast<double>(i);
});
Kokkos::fence();
double sum = 0.0;
Kokkos::parallel_reduce("SumData", Kokkos::RangePolicy<ExecSpace>(0, N),
KOKKOS_LAMBDA(const int i, double& lsum) {
lsum += remote_data(i);
}, sum);
Kokkos::fence();
printf("Sum of remote data: %f\n", sum);
}
Kokkos::finalize();
Explanations:
Using Kokkos::Experimental::NVShmemSpace as a remote memory space. Creating a RemoteView using NVShmemSpace. Initializing the remote data using a parallel_for on the CUDA runspace. Computing the sum of the remote data with a parallel_reduce. Using Kokkos::fence() to ensure synchronization between remote operations.
This code demonstrates how Kokkos Remote Spaces allows using NVSHMEM as a PGAS backend for simplified multi-GPU programming, providing a global view of the data while maintaining the portability of Kokkos performance