Kokkos Memory Access Patterns
1. Introduction
In the realm of high-performance computing, managing memory access patterns is crucial for achieving optimal performance across diverse hardware architectures. The View’s Layout parameter plays a pivotal role in determining how data is organized in memory. Kokkos offers several layout options, including LayoutRight, LayoutLeft, and LayoutStride [1]. LayoutRight, which is typically the default for CPU architectures, organizes data in a row-major format, where elements of the rightmost dimension are contiguous in memory. Conversely, LayoutLeft, often preferred for GPU architectures, uses a column-major layout. This flexibility allows developers to tailor the data organization to the specific requirements of their target hardware, maximizing performance across different platforms.
Memory access patterns in Kokkos are intricately linked to how parallel work indices are mapped to the layout of multidimensional array data. The library provides a sophisticated mapping mechanism that aligns the iteration space of parallel computations with the underlying memory layout. This alignment is critical for performance, as it directly impacts how efficiently the hardware can fetch and process data [2]. For instance, when using a LayoutRight View on a CPU, iterating over the rightmost dimension in the innermost loop of a parallel_for construct will result in cache-friendly, stride-1 memory accesses [3].
The significance of proper memory access patterns and layouts cannot be overstated when it comes to performance. On CPUs, well-aligned access patterns lead to efficient cache utilization, reducing memory latency and improving overall throughput. The importance of caching is particularly evident in operations like inner products, where repeated access to the same data can benefit greatly from cache locality [2]. On GPUs, coalesced memory accesses are paramount. When threads in a warp access contiguous memory locations, the GPU can combine these accesses into fewer, larger transactions, significantly boosting memory bandwidth utilization [4].
2. Managing Memory Access Patterns
Memory access patterns play a pivotal role in achieving performance portability with Kokkos. The library provides mechanisms to control data layout and optimize memory access for different architectures.
2.1. View’s Layout Parameter and Data Layout Control
The View’s Layout parameter in Kokkos is a powerful tool for controlling data layout:
-
Kokkos provides different layout options, primarily
LayoutRight
andLayoutLeft
. -
LayoutRight
is typically the default for CPUs, representing row-major order. -
LayoutLeft
is often the default for GPUs, representing column-major order. -
These layouts determine how multidimensional data is stored in memory.
For example, to create a 2D view with a specific layout:
Kokkos::View<double**, Kokkos::LayoutRight> view2D("view2D", 64, 64);
This creates a 64x64 2D array with a row-major layout.
2.2. Kokkos Mapping and Memory Access Patterns
Kokkos maps parallel work indices to the layout of multidimensional array data:
-
The mapping aims to provide efficient access if iteration indices correspond to the first index of the array.
-
This mapping is crucial for performance, as it determines how threads access memory.
Consider this example:
View<double***, ...> view(...);
Kokkos::parallel_for("Label", ...,
KOKKOS_LAMBDA (int workIndex) {
view(workIndex, ..., ...) = ...; // Efficient access
view(..., workIndex, ...) = ...; // Less efficient
});
Here, accessing the view with workIndex
as the first parameter is more efficient due to the default layout and mapping.
2.3. Performance Impact of Memory Access Patterns and Layouts
To illustrate the performance impact of different memory configurations, consider a simple inner product computation. When implemented with a LayoutRight View on a CPU, the operation benefits from efficient cache usage as it iterates over contiguous memory. However, the same layout on a GPU may lead to uncoalesced memory accesses, potentially reducing performance by an order of magnitude or more [2]. Conversely, a LayoutLeft View would provide coalesced accesses on the GPU but might suffer from cache thrashing on the CPU. This example underscores the importance of selecting the appropriate layout for each target architecture to achieve optimal performance
Thefore, the significance of memory access patterns and layouts on performance cannot be overstated:
-
On CPUs, proper access patterns lead to effective caching, reducing memory latency.
-
On GPUs, coalesced memory access is crucial for performance, where adjacent threads access adjacent memory locations.
-
Misaligned or non-coalesced access can lead to significant performance degradation, potentially by more than 10x on GPUs.
Concrete Example of Memory Configuration Performance
Let’s consider a simple inner product computation:
Kokkos::parallel_reduce("Label",
RangePolicy<ExecutionSpace>(0, N),
KOKKOS_LAMBDA (const size_t row, double& valueToUpdate) {
double thisRowsSum = 0;
for (size_t entry = 0; entry < M; ++entry) {
thisRowsSum += A(row, entry) * x(entry);
}
valueToUpdate += y(row) * thisRowsSum;
}, result);
Remark in this example:
-
For a CPU with
LayoutRight
, this access pattern is cache-friendly. -
For a GPU with
LayoutLeft
, the access toA(row, entry)
might not be coalesced, potentially leading to performance issues.
To optimize for both architectures, you might need to transpose the data or use different layouts for different devices. That’s all.
…