Kokkos PGAS (Partitioned Global Address Space)
1. Introduction
In the realm of high-performance computing, the integration of internode communication models with intranode parallelism frameworks has become increasingly crucial. This synergy is exemplified by the combination of MPI (Message Passing Interface) and PGAS (Partitioned Global Address Space) with Kokkos, a performance portability ecosystem for manycore architectures.
2. Kokkos Remote Spaces: PGAS Support
PGAS (Partitioned Global Address Space) models are gaining traction, particularly with the advent of "super-node" architectures and evolving network infrastructures [1][2]. Kokkos Remote Spaces extends the Kokkos ecosystem to embrace this paradigm, offering a bridge between shared and distributed memory programming models.
PGAS enables Kokkos to provide a global view of data for convenient multi-GPU, multi-node, and multi-device programming. PGAS provides a high-level abstraction for remote memory accesses, simplifying distributed programming for developers using Kokkos. Kokkos Remote Spaces supports multiple PGAS backends, including SHMEM, NVSHMEM, ROCSHMEM, and MPI One-side, providing flexibility for different types of systems and architectures. PGAS implementations are optimized for high-performance communications, which is crucial for the scientific computing applications that Kokkos targets. Using PGAS allows Kokkos to maintain its philosophy of performance portability across different architectures, from CPUs to GPUs. By using PGAS, Kokkos can offer efficient and portable distributed programming, while maintaining a consistent programming interface with the rest of the Kokkos ecosystem.
To write a PGAS application with Kokkos, developers can utilize the Kokkos Remote Spaces extension. This extension introduces new memory spaces that return data handles with PGAS semantics. Creating a global View in this context is straightforward:
Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Experimental::NVShmemSpace> globalView("GlobalView", N, M);
This declaration creates a two-dimensional View that spans across multiple processing elements in a PGAS model.
Accessing global data in a PGAS model requires careful consideration of data locality and communication costs. Kokkos Remote Spaces provides abstractions that simplify this process. For example, accessing an element of the global View might look like this:
auto element = globalView(i, j);
Behind the scenes, Kokkos handles the necessary communication to fetch or update the data, abstracting away the complexities of the underlying PGAS implementation.
A prime example of PGAS applications is the Sparse Matrix-Vector Multiplication (SpMV) operation, a key component of the Conjugate Gradient (CG) method. In a PGAS model using Kokkos Remote Spaces, the vector becomes distributed, while the sparse matrix stores global indices. This approach allows for efficient parallel computation across multiple nodes.
The implementation of SpMV in this context might involve:
-
Distributing the vector across processing elements.
-
Storing the sparse matrix with global indices.
-
Performing local computations using Kokkos parallel constructs.
-
Utilizing PGAS operations for necessary remote data accesses.
This strategy can lead to significant performance improvements, especially for large-scale problems that exceed the memory capacity of a single node.
Example
Kokkos::initialize(argc, argv);
{
using ExecSpace = Kokkos::Cuda;
using RemoteSpace = Kokkos::Experimental::NVShmemSpace;
using RemoteView = Kokkos::View<double*, Kokkos::LayoutLeft, RemoteSpace>;
const int N = 1000;
RemoteView remote_data("RemoteData", N);
Kokkos::parallel_for("InitializeData", Kokkos::RangePolicy<ExecSpace>(0, N),
KOKKOS_LAMBDA(const int i) {
remote_data(i) = static_cast<double>(i);
});
Kokkos::fence();
double sum = 0.0;
Kokkos::parallel_reduce("SumData", Kokkos::RangePolicy<ExecSpace>(0, N),
KOKKOS_LAMBDA(const int i, double& lsum) {
lsum += remote_data(i);
}, sum);
Kokkos::fence();
printf("Sum of remote data: %f\n", sum);
}
Kokkos::finalize();
Explanations:
Using Kokkos::Experimental::NVShmemSpace as a remote memory space. Creating a RemoteView using NVShmemSpace. Initializing the remote data using a parallel_for on the CUDA runspace. Computing the sum of the remote data with a parallel_reduce. Using Kokkos::fence() to ensure synchronization between remote operations.
This code demonstrates how Kokkos Remote Spaces allows using NVSHMEM as a PGAS backend for simplified multi-GPU programming, providing a global view of the data while maintaining the portability of Kokkos performance