How GPU works?

graphics processing unit (GPU) is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systemsmobile phonespersonal computersworkstations, and game consoles.

A really comprehensive lecture from Pro. Mutlu:

GPUs and GPGPU Programming (pdf)

What is the difference between Unified and non-unified shader architectures?

Shader architectures can be unified or non-unified. Many of the mobile and embedded GPUs have unified shader architecture.

  • A unified shader architecture executes shader programs, such as fragment and vertex shaders, on the same processing modules.
  • A non-unified architecture uses separate dedicated processing modules for vertex and fragment processing.

Unified architectures can save power and increase performance compared to a non-unified architecture.

Unified architectures also scale much more easily to a given application, whether it is fragment or vertex shader bound, as the unified processors will be used accordingly.

What is Per-Tile Rasterization (Renderer)?

Rasterization: The process of determining which pixels a given primitive touches. Rasterization and pixel coloring are performed on a per-tile basis with the following steps:

  1. When a tile operation begins, the corresponding tile list is retrieved from the Parameter Buffer (PB) to identify the screen-space primitive data that needs to be fetched.
  2. The Image Synthesis Processor (ISP) fetches the primitive data and performs Hidden Surface Removal (HSR), along with depth and stencil tests. The ISP only fetches screen-space position data for the geometry within the tile.
  3. The Tag Buffer contains information about which triangle is on the top for each pixel.
  4. The Texture and Shading Processor (TSP) then applies coloring operations, like fragment shaders, to the visible pixels.
  5. Alpha testing and subsequently alpha blending is then carried out.
  6. Once the tile’s render is complete, the color data is written to the frame buffer in system memory.

This process is repeated until all tiles have been processed and the frame buffer is complete.

What is On-Chip Buffer used for?

Read-Modify-Write operations for the color, depth and stencil buffers are performed using fast on-chip memory instead of relying on repeated system memory access, as traditional IMRs do. Attachments that the application has chosen to preserve, such as the color buffer, will be written to system memory.

What is Vertex Processing (Tiler)?

Every frame, the hardware processes submitted geometry data with the following steps:

  1. The execution of application-defined transformations, such as vertex shaders (Vertex Processing).
  2. The resulting data is then converted to screen-space (Clip, Project, and Cull).
  3. The Tile Accelerator (TA) then determines which tiles contain each transformed primitive (Tiling).
  4. Per-tile lists are then updated to track the primitives which fall within the bounds of each tile.

Each tile in the list contains primitive lists which contain pointers to the transformed vertex data. The tile list and the transformed vertex data are both stored in an intermediate store called the Parameter Buffer (PB). This store resides in system memory, and is mostly managed by the hardware. It contains all information needed to render the tiles.

Vertex Processing (Tiler)

How to optimize for GPU / Graphics workload performance

This part includes key principles to be followed to avoid critical performance flaws when creating/optimization graphics apps. The following recommendations come from the experience of real industry and the developers within it.

  1. Understand the Target Device
    • You’ll need to learn as much info about the device as possible in order to understand different graphics architectures, to use the device in the most efficient manner possible.
  2. Profile the Workload
    • Identify the bottlenecks in the apps you are optimizing and determine whether there are opportunities for improvement.
  3. Perform Clear Well
    • Perform a clear on a framebuffer’s contents to avoid fetching the previous frame’s data on tile-based graphics architectures, which reduces memory bandwidth.
  4. Do Not Update Data Buffers Mid-Frame
    • Avoid touching any buffer when a frame is mid-flight to reduce stalls and temporary buffer stores.
  5. Use Texture Compression
    • Reduce the memory footprint and bandwidth cost of texture assets.
  6. Use Mipmapping
    • This increases texture cache efficiency, which reduces bandwidth and increases performance.
  7. Do Not Use Discard
    • Avoid forcing depth-test processing in the texture stage as this will decrease performance in the early depth rejection architectures.
  8. Do Not Force Unnecessary Synchronization
    • Avoid API functionality that could stall the graphics pipeline and do not access any hardware buffer directly.
  9. Move Calculations “To the Front”
    • Reduce the overall number of calculations by moving them earlier in the pipeline, where there are fewer instances to process.
  10. Group per Material
    • Grouping geometry and texture data can improve app performance.
  11. Do Not Use Depth Pre-pass
    • Depth pre-pass is redundant on deferred rendering architectures.
  12. Prefer Explicit APIs
    • Graphical app made using explicit APIs tend to run more efficiently, if set up correctly.
  13. Prefer Lower Data Precision
    • Lower precision shader variables should be used, where appropriate, to improve performance.
  14. Use All CPU Cores
    • Using multi-threading in apps is critical to efficient CPU use.
  15. Use Indexed Lists
    • Indexed lists can reduce mesh storage requirements by eliminating redundant vertices.
  16. Use On-chip Memory Efficiently for Deferred Rendering
    • Making better use of on-chip memory reduces overall system memory bandwidth usage.