## How Computers Work?

What if you got ask the very basic question: How Computers Work?

A brief answer:

A more comprehensive answer:

Skip to content
## Category: Facebook

##
How Computers Work?

##
Can we arbitrarily increase the CPU pipeline depth?

##
How to design a memory controller with in-order read responses?

##
Why is there no possible performance improvement with cache upsizing?

##
How to implement w = 3/2 x + 1/4 y + z?

##
Design a circuit that detects if one input is a delayed version of the other

##
How to implement pseudo LRU?

What if you got ask the very basic question: How Computers Work?

A brief answer:

A more comprehensive answer:

CPU pipelining is a common technique to increase throughput and instruction level parallelism. Can we arbitrarily increase the CPU pipeline depth? Short answer is NO.

This is an open question. We recommend interviewees to answer this question from the following aspects:

Continue reading → Can we arbitrarily increase the CPU pipeline depth?

Interviewers often ask how to design a memory controller in technical interviews. We show one example below.

The memory controller takes incoming requests along with address and request ID as inputs. It is expected to provide read responses along with response ID as outputs. Internally, it can access memory to fetch the read data.

Continue reading → How to design a memory controller with in-order read responses?

Usually, with cache upsizing, we expect to see system performance improvement. However, this is not always the case. There could be several reasons:

- The “compulsory”, instead of “capacity”, prevents the performance improvement from cache upsizing. This means the temporal locality and spatial locality offered by cache are not utilized. For example, the program keeps to access new data and there is no data reuse, which can happen in streaming applications; if context switch happens often, then cache flush may happen often and more “compulsory” will occur
- In cache-coherent system, there may be 2 caches competing for one copy of data, i.e., “coherence” miss. This can happen when 2 CPUs want to gain the lock or semaphore simultaneously. Increasing cache size will not help performance in this case
- Assuming the cache upsizing is achieved by cache line upsizing, then the loading time of a cache line will increase. This in turn increases the cache miss penalty and average memory access time
- Assuming the cache upsizing is achieved by increasing associativity, then the hit latency as well as average memory access time may increase. This is because physical implementation of high associativity cache can be hard

Continue reading → Why is there no possible performance improvement with cache upsizing?

Obviously, w = 3/2 x + 1/4 y + z = x + (x >> 1) + (y >> 2) + z. But, this is not the end of the story.

All variables here need to be interpreted as fixed point number, with lower 2 digits representing 0.5 and 0.25.

Let’s say x, y and z are within the range between 0 and 3, inclusive. Then w is within the range between 0 and 8.25, inclusive. w’s integer part has 4 bits, and w has 6 bits in total.

‘d8.25 can be represented as ‘b1000.01

Assuming 1b input A is generated by a random sequence, 1b input B is a delayed version of A. The delay value varies, and can be [1, 10] (inclusive).

Design a circuit, that takes A and B as inputs, and Y as output. If B is guaranteed to have 1s, then Y should be 1.

Continue reading → Design a circuit that detects if one input is a delayed version of the other

We discussed true LRU implementations, but they are usually costly. In this post, we will explain how to implement pseudo LRU.

An example is shown below. The data structure is similar to a binary tree. For N ways in a cache set, we need to keep (N – 1) bits. If there are 8 sets in a way, then we will need 7 bits.

Upon a cache access, all tree nodes point to that cache way will be flipped. For example, if set 3 is accessed, then h0, h1 and h4 will be flipped in next cycle.

To find LRU, we can perform a depth-first-search starting from h0, and traverse nodes in lower levels. If the node is 0, then we traverse the left sub-tree; otherwise, we traverse the right sub-tree. In the diagram above, the LRU is set 3.

Once we understand the basic idea of pseudo LRU, it is easy to write code for pseudo LRU.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
module lru_pseudo ( input clk, input rst_n, input acc_en, // access indication input [2:0] acc_idx, // index to accessed entry output logic [2:0] lru_idx // index to LRU entry ); // tree node logic h0, h1, h2, h3, h4, h5, h6; always_ff @(posedge clk or negedge rst_n) if (~rst_n) h0 <= '0; else if (acc_en) h0 <= ~h0; // always flip h0 if there's an access always_ff @(posedge clk or negedge rst_n) if (~rst_n) h1 <= '0; else if (acc_en & ~acc_idx[2]) h1 <= ~h1; // flip h1 if acc_idx == 0/1/2/3 always_ff @(posedge clk or negedge rst_n) if (~rst_n) h2 <= '0; else if (acc_en & acc_idx[2]) h2 <= ~h2; // flip h2 if acc_idx == 4/5/6/7 always_ff @(posedge clk or negedge rst_n) if (~rst_n) h3 <= '0; else if (acc_en & (acc_idx[2:1] == 2'b00)) h3 <= ~h3; // flip h3 if acc_idx == 0/1 always_ff @(posedge clk or negedge rst_n) if (~rst_n) h4 <= '0; else if (acc_en & (acc_idx[2:1] == 2'b01)) h4 <= ~h4; // flip h4 if acc_idx == 2/3 always_ff @(posedge clk or negedge rst_n) if (~rst_n) h5 <= '0; else if (acc_en & (acc_idx[2:1] == 2'b10)) h5 <= ~h5; // flip h5 if acc_idx == 4/5 always_ff @(posedge clk or negedge rst_n) if (~rst_n) h6 <= '0; else if (acc_en & (acc_idx[2:1] == 2'b11)) h6 <= ~h6; // flip h6 if acc_idx == 6/7 // lru_idx assignment assign lru_idx[2] = h0; assign lru_idx[1] = ~h0 & h1 | h0 & h2; assign lru_idx[0] = (~h0 & ~h1 & ~h3) | (~h0 & h1 & ~h4) | (h0 & ~h2 & ~h5) | (~h0 & h2 & ~h6); endmodule: lru_pseudo |