Inter-Integrated Circuit

The I2C is a multi-leader, multi-follower, serial communication protocol between digital devices. In this section, we will cover the working principles of the I2C in terms of the data format, connection diagram, and transmission and reception operations.

Data Format

In the I2C, every follower has a unique address. Data transfer starts with this address. When the follower wakes up and acknowledges back the leader, transfer continues with the pointer/address and data or directly data is transferred depending on the protocol. The address of a follower is usually composed of seven bits. However, in some cases the address can be either eight or ten bits. Independent of the address, pointer, and data size, the transfer is performed in terms of eight-bit packages. Each package has seven-bit address, pointer, and data and one-bit acknowledge value. The receiver merges packet to extract data.

Connection Diagram

The I2C data bus has two wires called serial data line (SDA) and serial clock line (SCL). Besides, all connected devices need a common ground and power line. As a result, the I2C will need four wires for communication. The connection diagram of a generic I2C is presented as https://www.analog.com/-/media/analog/en/landing-pages/technical-articles/i2c-primer-what-is-i2c-part-1-/36684.png?la=en&w=900 . The SDA and SCL are bidirectional lines. Both the lines are connected to VDD by a pull-up resistor. This means they are at logic level 1 when idle. Different from the SPI, every follower has a unique address in the I2C. Therefore, the follower and leader can be chosen over the serial data line without the need of a select signal. Thus, other than power and ground signals, the I2C bus has only two wires connected to all devices. This advantage saves the pin usage compared to the SPI.

Transmission and Reception Operations

As mentioned in the previous section, data on the I2C communication is carried by eight-bit packages. The leader starts the transmission by sending the follower address and read/write decision bit. The follower with this address on the network wakes up and acknowledges the leader that it is alive and ready to talk. Then depending on the decision bit, the leader writes or reads data from the follower. The leader ends the talk by sending a stop signal. The following figure shows the complete timing diagram of the I2C communication: https://www.analog.com/-/media/analog/en/landing-pages/technical-articles/i2c-primer-what-is-i2c-part-1-/36685.png?la=en&w=900, The leader starts transmission by a logic level 1 to 0 transition on SDA while SCL stays at logic level 1. We can call this as the start signal. The transmission ends by a logic level 0 to 1 transition on the SDA while SCL is at logic level 1. We can call this as the stop signal. The address of the device and data is transmit address of the follower. Then R/W signal is sent, which tells the follower if the leader is going to read of write the data to/from the follower. Next, the leader starts sending or receiving data (with the MSB first) followed by an acknowledge signal. There are no restrictions on the number of successively transmitted data bits. The communication continues until the leader sends the stop signal. Note that during the acknowledge signal the transmitter releases the SDA line and the receiver pulls the line to logic level 0 while SCL is at logic level 1.

 

 

What is Graphics Rendering Pipeline?

Graphics Rendering Pipeline

GPU Rendering Pipeline
Graphics (GPU) Rendering Pipeline

The GPU pipeline takes the vertices across several stages during which the vertices have their coordinates transformed between various spaces. Different programmer using different graphics frameworks will care about only certain stages of the pipeline. Normally, the GPU vendor will take care of the “Vertex Fetch”, “Primitive Assembly”, “Rasterization”, and “Framebuffer” stages in their drivers. While the graphics software programmer only need to care about the Vertex and Fragment Processing stages which are programmable through graphics APIs such as Metal, Cuda, or OpenGL, etc.

1. Vertex Fetch

Different graphics Application Programming Interfaces (APIs) have different names for this stage. In order to start rendering 3D content, we first need a scene. A scene contains of models which have meshes of vertices. For example, one of the simplest models is the cube model that consists of six faces (12 triangles). A specific hardware unit called the Scheduler will send the vertices and their attributes on to the Vertex Processing stage.

2. Vertex Processing

At this stage, vertices are processed one by one. Programmers write code to calculate per-vertex lighting and color. Moreover, we send vertex coordinates through various coordinate spaces to reach their position in the final framebuffer. The CPU sends to GPU a vertex buffer that programmer created from the model mesh. Then the programmer configured the vertex buffer using a vertex descriptor which tells the GPU how the vertex data was structured. On the GPU, programmer created a struct to encapsulate the vertex attributes. The vertex shader takes in this struct, as a function argument, and through the qualifier, acknowledges that position comes from the CPU via the position in the vertex buffer. The vertex shader then can processes all the vertices and returns their positions as a float4. A specific hardware unit called Distributer sends the grouped blocks of vertices on to the Primitive Assembly stage.

3. Primitive Assembly

The former stage sent processed vertices grouped into blocks of data to this stage. One thing to keep in mind is that vertices belonging to the same geometrical shape (primitive) are always in the same block which means the one vertex of a point, or the two vertices of a line, or the three vertices of a triangle, would always be in the same block, hence a second block fetch will never be necessary. At this stage, primitives are fully assembled from connected vertices and they move on to the rasterizer.

4. Rasterization

For every object in the scene, send rays back into the screen and check which pixels are covered by the object. Depth information is kept, so that will be updated the pixel color if the current object is closer than the previously saved one. At this point, all connected vertices sent from the previous stage need to be represented on a 2 dimensional grid using their X/Y coordinates. This step is called triangle setup. Then a process called scan conversion runs on each line of screen to look for intersections and to determine what is visible and what is not. For drawing onto the screen, only the vertices and the slopes they determine are needed. The stored depth information can be used to determine whether each point is in front of other points in the scene. After this stage, the Scheduler unit again dispatches work to the shader cores, but this time it’s the rasterized fragments sent for Fragment Processing.

5. Fragment Processing

The primitive processing in the previous stages was sequential because there is only one Primitive Assembly unit and one Rasterization unit. However, as soon as fragments reach the Scheduler, work can be forked (divided) into many segmented sections, and each section is given to an available shader core. Thousands of cores are then doing parallel processing. This fragment processing stage is another programmable stage. Programmer create a fragment shader function which would receive the lighting, texture coordinate, depth and color information which the vertex function outputs. The fragment shader output is a single color for that fragment. Every of these fragments will contribute to the color of the final pixel in the framebuffer. All the attributes are interpolated for each fragment.

6. Framebuffer

As soon as fragments have been processed into pixels the Distributer unit sends them to the Color Writing unit. This unit is responsible for writing the final color in a special memory location called the framebuffer. From here, the view gets its colored pixels refreshed every frame. A technique called double-buffering is used to do the last action: While the 1st buffer is being displayed on the display, the 2nd one is updated in the background. The, the two buffers are swapped, and the second one is displayed on the screen while the first one is updated, and the cycle continues.

How to design a memory controller with in-order read responses?

Interviewers often ask how to design a memory controller in technical interviews. We show one example below.

The memory controller takes incoming requests along with address and request ID as inputs. It is expected to provide read responses along with response ID as outputs. Internally, it can access memory to fetch the read data.

Continue reading → How to design a memory controller with in-order read responses?

Why is there no possible performance improvement with cache upsizing?

Usually, with cache upsizing, we expect to see system performance improvement. However, this is not always the case. There could be several reasons:

  1. The “compulsory”, instead of “capacity”, prevents the performance improvement from cache upsizing. This means the temporal locality and spatial locality offered by cache are not utilized. For example, the program keeps to access new data and there is no data reuse, which can happen in streaming applications; if context switch happens often, then cache flush may happen often and more “compulsory” will occur

  2. In cache-coherent system, there may be 2 caches competing for one copy of data, i.e., “coherence” miss. This can happen when 2 CPUs want to gain the lock or semaphore simultaneously. Increasing cache size will not help performance in this case
  3. Assuming the cache upsizing is achieved by cache line upsizing, then the loading time of a cache line will increase. This in turn increases the cache miss penalty and average memory access time
  4. Assuming the cache upsizing is achieved by increasing associativity, then the hit latency as well as average memory access time may increase. This is because physical implementation of high associativity cache can be hard

Continue reading → Why is there no possible performance improvement with cache upsizing?

How to implement w = 3/2 x + 1/4 y + z?

Obviously, w = 3/2 x + 1/4 y + z = x + (x >> 1) + (y >> 2) + z. But, this is not the end of the story.

All variables here need to be interpreted as fixed point number, with lower 2 digits representing 0.5 and 0.25.

Let’s say x, y and z are within the range between 0 and 3, inclusive. Then w is within the range between 0 and 8.25, inclusive. w’s integer part has 4 bits, and w has 6 bits in total.

‘d8.25 can be represented as ‘b1000.01