GPU (Graphics Processing Unit is a graphics (co-)processor)

GPU

1. Definition

Graphics Processing Unit is a graphics (co-)processor capable of very efficiently performing calculations on images (2D, 3D, videos, etc.). The raw computing power offered is higher due to the large number of processors present on these cards. This is why it is not uncommon to obtain large acceleration factors between CPU and GPU for the same application.

Explicit code targeting GPUs: CUDA, HIP, SYCL, Kokkos, RAJA,…​

image2

Fig: illustrates the main hardware architecture differences between CPUs and GPUs. The transistor counts associated with various functions are represented abstractly by the relative sizes of the various shaded areas. In the figure, the green corresponds to the calculation; gold is instruction processing; purple is the L1 cache; blue is top level cache and orange is memory (DRAM, which really should be thousands of times larger than caches).

GPUs were originally designed to render graphics. They work great for shading, texturing, and rendering the thousands of independent polygons that make up a 3D object. CPUs, on the other hand, are meant to control the logical flow of any general-purpose program, where a lot of digit manipulation may (or may not) be involved. Due to these very different roles, GPUs are characterized by having many more processing units and higher overall memory bandwidth, while CPUs offer more sophisticated instruction processing and faster clock speed.

2. CPU vs GPU comparison

CPU: Latency-oriented design GPU: Throughput Oriented Design

Clock

High clock frequency

Moderate clock frequency

Caches

Large sizes

Converts high latency accesses in memory to low latency accesses in cache

Small caches

To maximize memory throughput

Control

Sophisticated control system

Branch prediction to reduce latency due to branching
Data loading to reduce latency due to data access

Single controlled

No branch prediction

No data loading

Powerful Arithmetic Logic Unit (ALU)

Reduced operation latency

Numerous, high latency but heavily pipelined for high throughput

Other aspects

Lots of space devoted to caching and control logic. Multi-level caches used to avoid latency

Limited number of registers due to fewer active threads

Control logic to reorganize execution, provide ILP, and minimize pipeline hangs

Requires a very large number of threads for latency to be tolerable

Beneficial aspects for applications

CPUs for sequential games where latency is critical.

CPUs can be 10+X faster than GPUs for sequential code.

GPUs for parallel parts where throughput is critical.

GPUs can be 10+X faster than GPUs for parallel code.