How to optimize for GPU / Graphics workload performance

This part includes key principles to be followed to avoid critical performance flaws when creating/optimization graphics apps. The following recommendations come from the experience of real industry and the developers within it.

  1. Understand the Target Device
    • You’ll need to learn as much info about the device as possible in order to understand different graphics architectures, to use the device in the most efficient manner possible.
  2. Profile the Workload
    • Identify the bottlenecks in the apps you are optimizing and determine whether there are opportunities for improvement.
  3. Perform Clear Well
    • Perform a clear on a framebuffer’s contents to avoid fetching the previous frame’s data on tile-based graphics architectures, which reduces memory bandwidth.
  4. Do Not Update Data Buffers Mid-Frame
    • Avoid touching any buffer when a frame is mid-flight to reduce stalls and temporary buffer stores.
  5. Use Texture Compression
    • Reduce the memory footprint and bandwidth cost of texture assets.
  6. Use Mipmapping
    • This increases texture cache efficiency, which reduces bandwidth and increases performance.
  7. Do Not Use Discard
    • Avoid forcing depth-test processing in the texture stage as this will decrease performance in the early depth rejection architectures.
  8. Do Not Force Unnecessary Synchronization
    • Avoid API functionality that could stall the graphics pipeline and do not access any hardware buffer directly.
  9. Move Calculations “To the Front”
    • Reduce the overall number of calculations by moving them earlier in the pipeline, where there are fewer instances to process.
  10. Group per Material
    • Grouping geometry and texture data can improve app performance.
  11. Do Not Use Depth Pre-pass
    • Depth pre-pass is redundant on deferred rendering architectures.
  12. Prefer Explicit APIs
    • Graphical app made using explicit APIs tend to run more efficiently, if set up correctly.
  13. Prefer Lower Data Precision
    • Lower precision shader variables should be used, where appropriate, to improve performance.
  14. Use All CPU Cores
    • Using multi-threading in apps is critical to efficient CPU use.
  15. Use Indexed Lists
    • Indexed lists can reduce mesh storage requirements by eliminating redundant vertices.
  16. Use On-chip Memory Efficiently for Deferred Rendering
    • Making better use of on-chip memory reduces overall system memory bandwidth usage.

How to design a memory controller with in-order read responses?

Interviewers often ask how to design a memory controller in technical interviews. We show one example below.

The memory controller takes incoming requests along with address and request ID as inputs. It is expected to provide read responses along with response ID as outputs. Internally, it can access memory to fetch the read data.

Continue reading → How to design a memory controller with in-order read responses?

Why is there no possible performance improvement with cache upsizing?

Usually, with cache upsizing, we expect to see system performance improvement. However, this is not always the case. There could be several reasons:

  1. The “compulsory”, instead of “capacity”, prevents the performance improvement from cache upsizing. This means the temporal locality and spatial locality offered by cache are not utilized. For example, the program keeps to access new data and there is no data reuse, which can happen in streaming applications; if context switch happens often, then cache flush may happen often and more “compulsory” will occur

  2. In cache-coherent system, there may be 2 caches competing for one copy of data, i.e., “coherence” miss. This can happen when 2 CPUs want to gain the lock or semaphore simultaneously. Increasing cache size will not help performance in this case
  3. Assuming the cache upsizing is achieved by cache line upsizing, then the loading time of a cache line will increase. This in turn increases the cache miss penalty and average memory access time
  4. Assuming the cache upsizing is achieved by increasing associativity, then the hit latency as well as average memory access time may increase. This is because physical implementation of high associativity cache can be hard

Continue reading → Why is there no possible performance improvement with cache upsizing?