Kokkos Tasking Stream SIMD (Single Instruction Multiple Data)

1. Introduction

Kokkos provides powerful tools for high-performance computing, including SIMD (Single Instruction, Multiple Data) operations, asynchronous execution with streams, and task parallelism. These features enable developers to write efficient, portable code that can leverage the full potential of modern hardware architectures. Let’s explore each of these aspects in detail.

2. SIMD (Single Instruction, Multiple Data)

SIMD operations are a crucial component of modern high-performance computing, allowing for efficient vectorization of code. Kokkos offers portable vector intrinsic types that abstract away hardware-specific details, enabling developers to write vectorized code that can run efficiently on various architectures.

Portable Vector Intrinsic Types : Kokkos provides the Kokkos::Experimental::simd type, which serves as an abstraction over platform-specific vector datatypes [3]. This type is designed to work across all backends, potentially falling back to scalar operations when necessary[4]. The simd type supports various fundamental C++ types for which the current platform supports vector intrinsics.

Improving Vectorization with SIMD Types : To improve vectorization using SIMD types in Kokkos, developers can follow these steps:

  1. Include the necessary header: #include <Kokkos_SIMD.hpp>

  2. Define the SIMD type: `using simd_type = Kokkos::Experimental::native_simd<double>;

  3. Use SIMD types in computations to ensure vectorization:

    simd_type sx(x + i, tag_type());
    simd_type sy(y + i, tag_type());
    simd_type sz(z + i, tag_type());
    simd_type sr = Kokkos::sqrt(sx * sx + sy * sy + sz * sz);
    sr.copy_to(r + i, tag_type());

This approach guarantees that the compiler will generate the appropriate vector instructions for the target architecture [1].

SIMD Types as an Alternative to ThreadVector Loops : SIMD types can be used as an alternative to ThreadVector loops, providing more explicit control over vectorization. This approach allows developers to reason more clearly about the available parallelism in their code, often leading to better performance than relying on auto-vectorization[1].

Achieving Outer Loop Vectorization : SIMD types enable outer loop vectorization by processing multiple elements simultaneously. For example, on a CPU with 256-bit vector registers, the following code can process four elements at once:

    constexpr int width = int(simd_type::size());
    for (int i = 0; i < n; i += width) {
        // SIMD operations here
    }

This approach can significantly improve performance for suitable algorithms [1].

Example

    Kokkos::initialize(argc, argv);
    {
        using simd_type = Kokkos::Experimental::native_simd<double>;
        using tag_type = Kokkos::Experimental::element_aligned_tag;
        constexpr int width = int(simd_type::size());
        int n = 1000;
        Kokkos::View<double*> x("x", n);
        Kokkos::View<double*> y("y", n);
        Kokkos::View<double*> z("z", n);
        Kokkos::View<double*> r("r", n);
        Kokkos::parallel_for("init", n, KOKKOS_LAMBDA(const int i) {
            x(i) = static_cast<double>(i);
            y(i) = static_cast<double>(i * 2);
            z(i) = static_cast<double>(i * 3);
        });
        Kokkos::parallel_for("compute", n / width, KOKKOS_LAMBDA(const int i) {
            int idx = i * width;
            simd_type sx([&x, idx](std::size_t j) { return x(idx + j); });
            simd_type sy([&y, idx](std::size_t j) { return y(idx + j); });
            simd_type sz([&z, idx](std::size_t j) { return z(idx + j); });
            simd_type sr = Kokkos::sqrt(sx * sx + sy * sy + sz * sz);
            sr.copy_to(r.data() + idx, tag_type());
        });
        Kokkos::fence();
        auto h_r = Kokkos::create_mirror_view( r );
        Kokkos::deep_copy( h_r, r );
        printf("First 5 results:\n");
        for (int i = 0; i < 5; ++i) {
            printf("r[%d] = %f\n", i, h_r(i));
        }
    }
    Kokkos::finalize();

Explanations: This program uses Kokkos with SIMD to efficiently compute the square root of the sum of squares of three vectors.

3. References

Points to keep in mind

SIMD (Single Instruction, Multiple Data) in Kokkos is a C++ representation of vector registers that allows a single instruction to be applied to multiple data simultaneously, thus improving performance by parallelizing operations at the data level.