Kokkos Asynchronicity, Streams, and Task Parallelism
1. Introduction
Kokkos C++ supports asynchronous execution through its parallel dispatch operations, which may return before completion while executing in sequence with other Kokkos operations [1][2]. Streams in Kokkos, particularly for CUDA, allow for overlapping kernels and can be used to create CUDA instances, enabling concurrent execution of multiple kernels.
2. Asynchronicity and Streams
Asynchronous execution and streams are essential for maximizing hardware utilization, especially on GPUs. Kokkos provides mechanisms for managing asynchronous operations and concurrent kernel execution.
Blocking and Non-blocking Operations : Kokkos operations can be either blocking or non-blocking:
-
Blocking operations: These operations wait for completion before returning control to the caller.
-
Non-blocking operations: These operations initiate work but return control immediately, allowing other computations to proceed concurrently.
Overlapping Work : Non-blocking operations enable work overlap, which can improve overall performance. Types of work that can overlap include:
-
Host-to-device data transfers
-
Device-to-host data transfers
-
Kernel executions
-
Host computations
Waiting for Completion : To ensure that all asynchronous operations have completed, Kokkos provides synchronization mechanisms:
-
Kokkos::fence()
: Waits for all outstanding asynchronous operations to complete. -
Kokkos::wait()
: Can be used with specific futures or task policies to wait for particular operations.
Running Kernels Simultaneously on a GPU : To run kernels simultaneously on a GPU, Kokkos leverages streams. While not explicitly shown in the provided search results, Kokkos supports concurrent kernel execution through its execution policies and asynchronous launch capabilities
…
3. Task Parallelism
Task parallelism in Kokkos allows for fine-grained dependent execution, which is particularly useful for irregular problems and algorithms with complex dependency structures.
Basic Interface for Fine-grained Tasking :
Kokkos provides a TaskPolicy for coordinating task execution [2]. The basic interface includes:
-
Creating tasks:
policy.create(Functor<exec_space>())
-
Adding dependencies:
policy.add_dependence(task1, task2)
-
Spawning tasks:
policy.spawn(task)
-
Waiting for completion:
Kokkos::wait(task)
orKokkos::wait(policy)
Expressing Dynamic Dependency Structures :
Dynamic dependency structures can be expressed using the add_dependence
method, allowing for the creation of complex task graphs[2]. For example:
auto fx = policy.create(Functor<exec_space>(x));
auto fy = policy.create(Functor<exec_space>(y));
policy.add_dependence(fx, fy); // fx is scheduled after fy
When to Use Kokkos Tasking :
Kokkos tasking is particularly useful in the following scenarios:
-
Irregular problems with complex dependencies
-
Producer-consumer patterns
-
Recursive algorithms
-
When fine-grained parallelism is needed within tasks
Tasking in Kokkos allows for better locality exploitation by enabling nested data-parallelism within a task, which can be particularly beneficial for heterogeneous devices [2].
…