Serial Peripheral Interface

The serial peripheral interface (SPI) is a digital communication protocol for two or more devices as the UART. Here, we will focus only on the SPI communication between two devices. Hence, one device will be the transmitter and the other receiver. Different from the UART, the SPI is a synchronous communication protocol. Besides, communication between the transmitter and receiver is duplex. In other words, data is transmitted and received at the same time in the SPI. Therefore, the SPI communication uses four wires. Two of these wires are for data transfer. One wire is used for common clock signal (for synchronization). The fourth wire is used to enable (select) signal to be explained later.

Being synchronous, the SPI needs a common clock signal generated by either the transmitter or receiver. Clock generating side is called leader. The other side is called follower. The roles are generally called master and slave in literature. However, we prefer leader and follower naming on our website. Therefore, we will use the terms forward. As a result we can have leader-transmitter, leader-receiver, follower-transmitter, and follower-receiver options.

Working Principles of SPI

The working principles of the SPI are simpler than the UART. To understand them, we introduce the data format, connection diagram, transmission and reception operations, and timing in the following parts.

Data Format

Different from the UART, data packet size is not constant in the SPI. This is an advantage since the user can select the packet size as he or she desires. Moreover, the dedicated common clock and enable signal avoid using start and stop bits in the UART. The only requirement here is the need for determining the data packet size. Hence, the transmitter and receiver can understand each other.

Connection Diagram

The SPI uses a dedicated clock line, two data lines (one for transmitter, one for receiver), and a select (enable) line as mentioned in the previous section. https://en.wikipedia.org/wiki/File:SPI_single_slave.svg. Here, the clock signal is denoted by SCLK. The leader output, follower input is denoted by MOSI. The leader input, follower output is denoted by MISO. Select is denoted by SS which is used byt the leader to wake up the follower. The select line is also used when more than one follower is connected to a single leader.

Transmission and Reception Operations

In the SPI, the data transmission and reception is controlled by the leader through SCLK and SS signals. When there is no transmission, SS stays at logic level 1 and SCLK stays either at logic level 0 or 1 depending on the SPI mode. The modes of the SPI and their timing diagrams will be discussed later. The SPI Communication starts when the leader wakes the follower by setting SS to logic level 0. Next, the leader and follower start interchanging data in every clock cycle set by SCLK. Here either the leader sends a bit through MOSI line or the follower sends a bit through MISO line. The SPI mode also determines if data will be sent on the rising or falling edge of SCLK. After all bits are transferred, the common clock stops and leader deselects and the follower by changing SS to logic level 1.

What is the difference between Unified and non-unified shader architectures?

Shader architectures can be unified or non-unified. Many of the mobile and embedded GPUs have unified shader architecture.

  • A unified shader architecture executes shader programs, such as fragment and vertex shaders, on the same processing modules.
  • A non-unified architecture uses separate dedicated processing modules for vertex and fragment processing.

Unified architectures can save power and increase performance compared to a non-unified architecture.

Unified architectures also scale much more easily to a given application, whether it is fragment or vertex shader bound, as the unified processors will be used accordingly.

What is a shader?

In computer graphics, a shader is a type of computer program originally used for shading in 3D scenes (the production of appropriate levels of lightdarkness, and color in a rendered image). They now perform a variety of specialized functions in various fields within the category of computer graphics special effects, or else do video post-processing unrelated to shading, or even perform functions unrelated to graphics at all.

What is Per-Tile Rasterization (Renderer)?

Rasterization: The process of determining which pixels a given primitive touches. Rasterization and pixel coloring are performed on a per-tile basis with the following steps:

  1. When a tile operation begins, the corresponding tile list is retrieved from the Parameter Buffer (PB) to identify the screen-space primitive data that needs to be fetched.
  2. The Image Synthesis Processor (ISP) fetches the primitive data and performs Hidden Surface Removal (HSR), along with depth and stencil tests. The ISP only fetches screen-space position data for the geometry within the tile.
  3. The Tag Buffer contains information about which triangle is on the top for each pixel.
  4. The Texture and Shading Processor (TSP) then applies coloring operations, like fragment shaders, to the visible pixels.
  5. Alpha testing and subsequently alpha blending is then carried out.
  6. Once the tile’s render is complete, the color data is written to the frame buffer in system memory.

This process is repeated until all tiles have been processed and the frame buffer is complete.

What is On-Chip Buffer used for?

Read-Modify-Write operations for the color, depth and stencil buffers are performed using fast on-chip memory instead of relying on repeated system memory access, as traditional IMRs do. Attachments that the application has chosen to preserve, such as the color buffer, will be written to system memory.

What is Vertex Processing (Tiler)?

Every frame, the hardware processes submitted geometry data with the following steps:

  1. The execution of application-defined transformations, such as vertex shaders (Vertex Processing).
  2. The resulting data is then converted to screen-space (Clip, Project, and Cull).
  3. The Tile Accelerator (TA) then determines which tiles contain each transformed primitive (Tiling).
  4. Per-tile lists are then updated to track the primitives which fall within the bounds of each tile.

Each tile in the list contains primitive lists which contain pointers to the transformed vertex data. The tile list and the transformed vertex data are both stored in an intermediate store called the Parameter Buffer (PB). This store resides in system memory, and is mostly managed by the hardware. It contains all information needed to render the tiles.

Vertex Processing (Tiler)

What is Tile-Based Deferred Rendering (TBDR)?

The usual rendering technique on most GPUs is called Immediate Mode Rendering (IMR) on which geometry is sent to the GPU, and gets drawn straight away. This simple architecture is relatively inefficient, resulting in wasted processing power and memory bandwidth. Pixels are often still rendered despite never being visible on the screen, such as when a car is completely obscured by a closer building.

Tile-Based Deferred Rendering architecture works in a much intelligent way. It captures the whole scene before starting to render, thus occluded pixels can be identified and rejected before they are processed. The hardware starts splitting up the geometry data into small rectangular regions that will be processed as one image, which we call “tiles”. Every tile is rasterized and processed separately, and as the size of the render is so small, this allows all data to be kept on very fast chip memory.

Deferred rendering means that the architecture will defer all texturing and shading operations until all objects have been tested for visibility. This significantly reduces system memory bandwidth requirements, which in turn increases performance and reduces power requirements. This is a critical advantage for phones, tablets, and other devices where battery life makes all the difference.