GPU = a throughput-oriented architecture

Slides

1.2 GPU = a Throughput-Oriented Processor.pdf Download 1.2 GPU = a Throughput-Oriented Processor.pdf

Video Transcription

Slide 3 – Two fundamental measures of processor performance

What can be a good performance metric to characterize a processor performance?

One reasonable metric could be the task latency, that is time needed to execute a given task. When using this metric, a good processor is a processor that minimizes the task latency.

A second reasonable metric is the throughput that is the amount of work completed in the unit of time. If we use this metric, a good processor is a processor that maximizes the throughput.

To understand the difference between latency and throughput, we can use the analogy of water flowing in a pipe. In this case, the latency is the time that takes for a water parcel to go from one end to the other end of the pipe. On the other hand, the throughput is the amount of water moving end-to-end in a given time period. It turns out that we can design different systems to transport water but we can decide either to minimize the latency or to maximize the throughput.

In the same way, when designing processors, we need to decide if want to follow either a latency-oriented architecture design or a throughput-oriented architecture design.

Slide 4 – Traditional scalar microprocessors are latency-oriented architectures

Traditional microprocessors, like the one in the laptop I am using now (an Intel Core I5), are latency-oriented as their design aims at minimizing latency.

As you can see in the figure presenting a simple diagram of a latency-oriented architecture, such architectures are characterized by the presence of large memory caches that can span over one or more levels - you might have heard about L1, L2 and L3 caches -, by having a large part of chip dedicated to control and by having a relatively small number of Arithmetic Logic Units (ALUs). In particular, because large part of chip is devoted to control, in latency-oriented architectures we can have out-of-order execution of instructions in our given task.

Slide 5 – Throughput-Oriented Processors …

Accelerator microprocessors are throughput-oriented as their design aims at maximizing the throughput.

The most striking characteristic of throughput-oriented architectures is the large number of Arithmetic Logic Units – the green squares in the diagram on the right – and the consequent high parallelism availability.

Slide 6 – Latency-Oriented VS Throughput-oriented architectures

When comparing the latency- and throughput –oriented microarchitectures choices, the differences are clear.

Latency-oriented architectures, on the left part of this slide, have large caches to move data quickly to the processor and sophisticated control units to allow for out-of-order execution. This design choice is motivated by the fact that we want to complete a serial task as fast as possible.

On the other hand, throughput-oriented architectures - on the right part of the slide - are characterized by a large number of simple computing units and they support highly parallel workloads as the goal in this case is to do as much computation possible in a given period of time.

Slide 7 – GPUs are exemplar of throughput-oriented architectures

After so much talk about throughput-oriented architectures and GPUs, it is probably not a surprise to you that the GPU architecture is an exemplar case of throughput-oriented architecture.

GPUs are very good in enabling highly parallel workloads with no or little dependency between tasks. This is clearly the right design decision when tackling problems in many application domains. In this case killer aps are real-time computer graphics, video processing, medical imaging, deep learning and so on and so forth.

Of course, GPUs are not going to be good in solving problems with lot of synchronization, control and irregular data access, for which latency-oriented processors are more convenient.

Slide 8 – Throughput-Oriented Architectures and GPUS …

What are the basic principles we can use for implementing a throughput-oriented architecture? There are three main concepts we can use: first, we use simple processing units; second we use hardware threads and third we use SIMD execution.

Slide 9 – Many Simple Processing Units

The first design choice regards to the computing units. First, we want to have many of them, as many as possible. By having many computing units, we can have high parallelism and high throughput. However, there is a downside. Processing units of latency-oriented processor occupy lot of space on the chip because they include lot of control logic; so, the only way to have many cores is to reduce their complexity and make the control logic less present on the chip.

Slide 10 – Hardware Multithreading

As large parallelism is one of the main characteristics of throughput-oriented architecture, we need means to exploit such high-parallelism. The way to do that is to create and manage threads. You might think a thread as virtualized scalar processor with a program counter, registers and a stack memory. In throughput-oriented architectures, threads are implemented in hardware and therefore they are very efficient. In latency-oriented architectures, threads are often implemented in software and are handled by the Operative System.

Slide 11 – Hiding Latency with Threads

In throughput-oriented architectures, we typically use a number of threads that is larger than the available computing. This technique is called oversubscribing. This technique allows us to hide latency: long-latency operations of a single-thread can be hidden or covered by ready-to-run work from another thread. For instance, there might be a thread idling because waiting for instruction from pipelined functional units; in this case a thread will carry out a ready-to-run task instead of waiting. Another example is one thread waiting data from memory; again, another thread can hide this idle period by completing a ready-to-run task instead.

Slide 12 – SIMD execution

Finally, throughput-oriented architectures rely heavily on Single Instruction Multiple Data (SIMD) that is the capability of performing the same instruction over different data – you might image different elements of the same array – over an instruction cycle.

Slide 13 – To summarize

To summarize, in this lecture we covered three main points. First, we divided processors between either latency-oriented or throughput-oriented processors. Second, GPUs are throughput-oriented architectures targeting problems with lot of parallelism. Third, the GPU throughput-oriented architecture is based on three main design choices: many simple processing units, hardware threads to hide latency and SIMD execution.

In the next lecture, we are going to look at GPU architecture more in details.

Talk you in a bit!