How to optimize for GPU / Graphics workload performance
This part includes key principles to be followed to avoid critical performance flaws when creating/optimization graphics apps. The following recommendations come from the experience of real industry and the developers within it.
- Understand the Target Device
- You’ll need to learn as much info about the device as possible in order to understand different graphics architectures, to use the device in the most efficient manner possible.
- Profile the Workload
- Identify the bottlenecks in the apps you are optimizing and determine whether there are opportunities for improvement.
- Perform Clear Well
- Perform a clear on a framebuffer’s contents to avoid fetching the previous frame’s data on tile-based graphics architectures, which reduces memory bandwidth.
- Do Not Update Data Buffers Mid-Frame
- Avoid touching any buffer when a frame is mid-flight to reduce stalls and temporary buffer stores.
- Use Texture Compression
- Reduce the memory footprint and bandwidth cost of texture assets.
- Use Mipmapping
- This increases texture cache efficiency, which reduces bandwidth and increases performance.
- Do Not Use Discard
- Avoid forcing depth-test processing in the texture stage as this will decrease performance in the early depth rejection architectures.
- Do Not Force Unnecessary Synchronization
- Avoid API functionality that could stall the graphics pipeline and do not access any hardware buffer directly.
- Move Calculations “To the Front”
- Reduce the overall number of calculations by moving them earlier in the pipeline, where there are fewer instances to process.
- Group per Material
- Grouping geometry and texture data can improve app performance.
- Do Not Use Depth Pre-pass
- Depth pre-pass is redundant on deferred rendering architectures.
- Prefer Explicit APIs
- Graphical app made using explicit APIs tend to run more efficiently, if set up correctly.
- Prefer Lower Data Precision
- Lower precision shader variables should be used, where appropriate, to improve performance.
- Use All CPU Cores
- Using multi-threading in apps is critical to efficient CPU use.
- Use Indexed Lists
- Indexed lists can reduce mesh storage requirements by eliminating redundant vertices.
- Use On-chip Memory Efficiently for Deferred Rendering
- Making better use of on-chip memory reduces overall system memory bandwidth usage.