Every frame, the hardware processes submitted geometry data with the following steps:
The execution of application-defined transformations, such as vertex shaders (Vertex Processing).
The resulting data is then converted to screen-space (Clip, Project, and Cull).
The Tile Accelerator (TA) then determines which tiles contain each transformed primitive (Tiling).
Per-tile lists are then updated to track the primitives which fall within the bounds of each tile.
Each tile in the list contains primitive lists which contain pointers to the transformed vertex data. The tile list and the transformed vertex data are both stored in an intermediate store called the Parameter Buffer (PB). This store resides in system memory, and is mostly managed by the hardware. It contains all information needed to render the tiles.
The usual rendering technique on most GPUs is called Immediate Mode Rendering (IMR) on which geometry is sent to the GPU, and gets drawn straight away. This simple architecture is relatively inefficient, resulting in wasted processing power and memory bandwidth. Pixels are often still rendered despite never being visible on the screen, such as when a car is completely obscured by a closer building.
Tile-Based Deferred Rendering architecture works in a much intelligent way. It captures the whole scene before starting to render, thus occluded pixels can be identified and rejected before they are processed. The hardware starts splitting up the geometry data into small rectangular regions that will be processed as one image, which we call “tiles”. Every tile is rasterized and processed separately, and as the size of the render is so small, this allows all data to be kept on very fast chip memory.
Deferred rendering means that the architecture will defer all texturing and shading operations until all objects have been tested for visibility. This significantly reduces system memory bandwidth requirements, which in turn increases performance and reduces power requirements. This is a critical advantage for phones, tablets, and other devices where battery life makes all the difference.
This part includes key principles to be followed to avoid critical performance flaws when creating/optimization graphics apps. The following recommendations come from the experience of real industry and the developers within it.
Understand the Target Device
You’ll need to learn as much info about the device as possible in order to understand different graphics architectures, to use the device in the most efficient manner possible.
Profile the Workload
Identify the bottlenecks in the apps you are optimizing and determine whether there are opportunities for improvement.
Perform Clear Well
Perform a clear on a framebuffer’s contents to avoid fetching the previous frame’s data on tile-based graphics architectures, which reduces memory bandwidth.
Do Not Update Data Buffers Mid-Frame
Avoid touching any buffer when a frame is mid-flight to reduce stalls and temporary buffer stores.
Use Texture Compression
Reduce the memory footprint and bandwidth cost of texture assets.
This increases texture cache efficiency, which reduces bandwidth and increases performance.
Do Not Use Discard
Avoid forcing depth-test processing in the texture stage as this will decrease performance in the early depth rejection architectures.
Do Not Force Unnecessary Synchronization
Avoid API functionality that could stall the graphics pipeline and do not access any hardware buffer directly.
Move Calculations “To the Front”
Reduce the overall number of calculations by moving them earlier in the pipeline, where there are fewer instances to process.
Group per Material
Grouping geometry and texture data can improve app performance.
Do Not Use Depth Pre-pass
Depth pre-pass is redundant on deferred rendering architectures.
Prefer Explicit APIs
Graphical app made using explicit APIs tend to run more efficiently, if set up correctly.
Prefer Lower Data Precision
Lower precision shader variables should be used, where appropriate, to improve performance.
Use All CPU Cores
Using multi-threading in apps is critical to efficient CPU use.
Use Indexed Lists
Indexed lists can reduce mesh storage requirements by eliminating redundant vertices.
Use On-chip Memory Efficiently for Deferred Rendering
Making better use of on-chip memory reduces overall system memory bandwidth usage.
The GPU pipeline takes the vertices across several stages during which the vertices have their coordinates transformed between various spaces. Different programmer using different graphics frameworks will care about only certain stages of the pipeline. Normally, the GPU vendor will take care of the “Vertex Fetch”, “Primitive Assembly”, “Rasterization”, and “Framebuffer” stages in their drivers. While the graphics software programmer only need to care about the Vertex and Fragment Processing stages which are programmable through graphics APIs such as Metal, Cuda, or OpenGL, etc.
1. Vertex Fetch
Different graphics Application Programming Interfaces (APIs) have different names for this stage. In order to start rendering 3D content, we first need a scene. A scene contains of models which have meshes of vertices. For example, one of the simplest models is the cube model that consists of six faces (12 triangles). A specific hardware unit called the Scheduler will send the vertices and their attributes on to the Vertex Processing stage.
2. Vertex Processing
At this stage, vertices are processed one by one. Programmers write code to calculate per-vertex lighting and color. Moreover, we send vertex coordinates through various coordinate spaces to reach their position in the final framebuffer. The CPU sends to GPU a vertex buffer that programmer created from the model mesh. Then the programmer configured the vertex buffer using a vertex descriptor which tells the GPU how the vertex data was structured. On the GPU, programmer created a struct to encapsulate the vertex attributes. The vertex shader takes in this struct, as a function argument, and through the qualifier, acknowledges that position comes from the CPU via the position in the vertex buffer. The vertex shader then can processes all the vertices and returns their positions as a float4. A specific hardware unit called Distributer sends the grouped blocks of vertices on to the Primitive Assembly stage.
3. Primitive Assembly
The former stage sent processed vertices grouped into blocks of data to this stage. One thing to keep in mind is that vertices belonging to the same geometrical shape (primitive) are always in the same block which means the one vertex of a point, or the two vertices of a line, or the three vertices of a triangle, would always be in the same block, hence a second block fetch will never be necessary. At this stage, primitives are fully assembled from connected vertices and they move on to the rasterizer.
For every object in the scene, send rays back into the screen and check which pixels are covered by the object. Depth information is kept, so that will be updated the pixel color if the current object is closer than the previously saved one. At this point, all connected vertices sent from the previous stage need to be represented on a 2 dimensional grid using their X/Y coordinates. This step is called triangle setup. Then a process called scan conversion runs on each line of screen to look for intersections and to determine what is visible and what is not. For drawing onto the screen, only the vertices and the slopes they determine are needed. The stored depth information can be used to determine whether each point is in front of other points in the scene. After this stage, the Scheduler unit again dispatches work to the shader cores, but this time it’s the rasterized fragments sent for Fragment Processing.
5. Fragment Processing
The primitive processing in the previous stages was sequential because there is only one Primitive Assembly unit and one Rasterization unit. However, as soon as fragments reach the Scheduler, work can be forked (divided) into many segmented sections, and each section is given to an available shader core. Thousands of cores are then doing parallel processing. This fragment processing stage is another programmable stage. Programmer create a fragment shader function which would receive the lighting, texture coordinate, depth and color information which the vertex function outputs. The fragment shader output is a single color for that fragment. Every of these fragments will contribute to the color of the final pixel in the framebuffer. All the attributes are interpolated for each fragment.
As soon as fragments have been processed into pixels the Distributer unit sends them to the Color Writing unit. This unit is responsible for writing the final color in a special memory location called the framebuffer. From here, the view gets its colored pixels refreshed every frame. A technique called double-buffering is used to do the last action: While the 1st buffer is being displayed on the display, the 2nd one is updated in the background. The, the two buffers are swapped, and the second one is displayed on the screen while the first one is updated, and the cycle continues.