How GPU works?

graphics processing unit (GPU) is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systemsmobile phonespersonal computersworkstations, and game consoles.

A really comprehensive lecture from Pro. Mutlu:

GPUs and GPGPU Programming (pdf)

What is the difference between Unified and non-unified shader architectures?

Shader architectures can be unified or non-unified. Many of the mobile and embedded GPUs have unified shader architecture.

  • A unified shader architecture executes shader programs, such as fragment and vertex shaders, on the same processing modules.
  • A non-unified architecture uses separate dedicated processing modules for vertex and fragment processing.

Unified architectures can save power and increase performance compared to a non-unified architecture.

Unified architectures also scale much more easily to a given application, whether it is fragment or vertex shader bound, as the unified processors will be used accordingly.

What is a shader?

In computer graphics, a shader is a type of computer program originally used for shading in 3D scenes (the production of appropriate levels of lightdarkness, and color in a rendered image). They now perform a variety of specialized functions in various fields within the category of computer graphics special effects, or else do video post-processing unrelated to shading, or even perform functions unrelated to graphics at all.

What is Tile-Based Deferred Rendering (TBDR)?

The usual rendering technique on most GPUs is called Immediate Mode Rendering (IMR) on which geometry is sent to the GPU, and gets drawn straight away. This simple architecture is relatively inefficient, resulting in wasted processing power and memory bandwidth. Pixels are often still rendered despite never being visible on the screen, such as when a car is completely obscured by a closer building.

Tile-Based Deferred Rendering architecture works in a much intelligent way. It captures the whole scene before starting to render, thus occluded pixels can be identified and rejected before they are processed. The hardware starts splitting up the geometry data into small rectangular regions that will be processed as one image, which we call “tiles”. Every tile is rasterized and processed separately, and as the size of the render is so small, this allows all data to be kept on very fast chip memory.

Deferred rendering means that the architecture will defer all texturing and shading operations until all objects have been tested for visibility. This significantly reduces system memory bandwidth requirements, which in turn increases performance and reduces power requirements. This is a critical advantage for phones, tablets, and other devices where battery life makes all the difference.

What is the benefit of using half-cycle-path?

Until now, we focus on the timing of one-cycle-path or full-cycle-path. Sometimes, there may exist half-cycle-path in design. One example will be the launch flop is negedge triggered, while the capture flop is still posedge triggered.

We discussed the clock skew and how it affects STA in previous post. Equivalently, half-cycle-path can be modeled as one-cycle-path with clock skew δ = +T/2. It is obvious that hold time closure is easier while setup time closure is harder for half-cycle-path.

You may wonder what is the benefit of using half-cycle-path in the design.

Continue reading → What is the benefit of using half-cycle-path?