Shader architectures can be unified or non-unified. Many of the mobile and embedded GPUs have unified shader architecture.
A unified shader architecture executes shader programs, such as fragment and vertex shaders, on the same processing modules.
A non-unified architecture uses separate dedicated processing modules for vertex and fragment processing.
Unified architectures can save power and increase performance compared to a non-unified architecture.
Unified architectures also scale much more easily to a given application, whether it is fragment or vertex shader bound, as the unified processors will be used accordingly.
Rasterization: The process of determining which pixels a given primitive touches. Rasterization and pixel coloring are performed on a per-tile basis with the following steps:
When a tile operation begins, the corresponding tile list is retrieved from the Parameter Buffer (PB) to identify the screen-space primitive data that needs to be fetched.
The Image Synthesis Processor (ISP) fetches the primitive data and performs Hidden Surface Removal (HSR), along with depth and stencil tests. The ISP only fetches screen-space position data for the geometry within the tile.
The Tag Buffer contains information about which triangle is on the top for each pixel.
The Texture and Shading Processor (TSP) then applies coloring operations, like fragment shaders, to the visible pixels.
Alpha testing and subsequently alpha blending is then carried out.
Once the tile’s render is complete, the color data is written to the frame buffer in system memory.
This process is repeated until all tiles have been processed and the frame buffer is complete.
Read-Modify-Write operations for the color, depth and stencil buffers are performed using fast on-chip memory instead of relying on repeated system memory access, as traditional IMRs do. Attachments that the application has chosen to preserve, such as the color buffer, will be written to system memory.
The usual rendering technique on most GPUs is called Immediate Mode Rendering (IMR) on which geometry is sent to the GPU, and gets drawn straight away. This simple architecture is relatively inefficient, resulting in wasted processing power and memory bandwidth. Pixels are often still rendered despite never being visible on the screen, such as when a car is completely obscured by a closer building.
Tile-Based Deferred Rendering architecture works in a much intelligent way. It captures the whole scene before starting to render, thus occluded pixels can be identified and rejected before they are processed. The hardware starts splitting up the geometry data into small rectangular regions that will be processed as one image, which we call “tiles”. Every tile is rasterized and processed separately, and as the size of the render is so small, this allows all data to be kept on very fast chip memory.
Deferred rendering means that the architecture will defer all texturing and shading operations until all objects have been tested for visibility. This significantly reduces system memory bandwidth requirements, which in turn increases performance and reduces power requirements. This is a critical advantage for phones, tablets, and other devices where battery life makes all the difference.