In recent years, RISC-V has attracted global attention. This revolutionary ISA has swept the market with its continuous innovation, countless learning and tool resources, and contributions from the engineering community. The biggest charm of RISC-V is that it is an open source ISA.
In this article, I (referring to the author of this article, Mitu Raj, the same below) will introduce how to design a RISC-V CPU from scratch. We will explain the process of defining specifications, designing and improving architecture, identifying and solving challenges, developing RTL, implementing CPU, and testing CPU on simulation/FPGA board.
Start with a Name
It is important to name or brand your idea so that you can keep going until you reach your goal! We are going to build a very simple processor, so I came up with a fancy name "Pequeno", which means "tiny" in Spanish; the full name is: Pequeno RISC-V CPU, aka PQR5.
RISC-V has many flavors and extensions of the ISA architecture. We will start with the simplest one, RV32I, aka 32-bit base integer ISA. This ISA is suitable for building 32-bit CPUs that support integer operations. So, the first spec of Pequeno is as follows:
Pequeno is a 32-bit RISC-V CPU that supports RV32I ISA.
RV32I has 37 32-bit base instructions that we plan to implement in Pequeno. Therefore, we have to understand each instruction in depth. It took me a while to fully grasp the ISA. In the process, I learned the complete specification and designed my own assembler pqr5asm, which was verified with some popular RISC-V assemblers.
"RISBUJ"
The six-letter word above summarizes the instruction types in RV32I. These 37 instructions belong to one of the following categories:
R-type: All integer calculation instructions on registers. I-type: All integer calculation instructions based on registers and immediate values. Also includes JALR and Load instructions. S-type: All storage instructions. B-type: All branch instructions. U-type: Special instructions such as LUI, AUIPC. J-type: Jump instructions like JAL.
There are 32 general registers in the RISC-V architecture, x0-x31. All registers are 32 bits. Among these 32 registers, zero is also called x0 register, which is a useful special register. It is hardwired to zero, cannot be written, and always reads as zero. So what is it used for? You can use x0 as a dummy destination to dump results you don't want to read, or as operand zero, or to generate NOP instructions to idle the CPU.
Integer computation instructions are ALU instructions that are executed on registers and/or 12-bit immediate values. Load/store instructions are used to store/load data between registers and data memory. Jump/branch instructions are used to transfer program control to different locations.
Details of each instruction can be found in the RISC-V specification: RISC-V User Level ISA v2.2.
To learn the ISA, the RISC-V specification document is enough. However, for more clarity, you can study the implementations of different open cores in RTL.
In addition to the 37 basic instructions, I have added 13 pseudo/custom instructions to pqr5asm and extended the ISA to 50 instructions. These instructions are derived from the basic instructions and are intended to simplify the assembly programmer's life... For example:
NOP instruction with ADDI x0, x0, 0 which of course does nothing on the CPU! But it is much simpler and easier to explain in code.
Before we start designing the processor architecture, our expectation is to fully understand how each instruction is encoded in 32-bit binary and what it does.
The RISC-V RV32I assembler PQR5ASM that I developed in Python can be found on my GitHub. You can refer to the Assembler Instruction Manual to write sample assembly code. Compile it and see how it converts to 32-bit binary to consolidate/verify your understanding before moving on to the next step.
Specifications and Architecture
In this chapter, we defined the full specifications and architecture of Pequeno. Last time we simply defined it as a 32-bit CPU. Next, we will go into more details to get a general idea of the architecture we are going to design.
We will design a simple single-core CPU that is able to execute one instruction at a time in the order in which the instructions are fetched, but still in a pipelined manner. We will not support the RISC-V privileged specification because we do not currently plan to have our core operating system support it, nor do we plan to have it support interrupts.
The CPU specifications are as follows:
32-bit CPU, single-issue, single-core. Classic five-stage RISC pipeline. Strictly in-order pipeline. Compliant with RV32I user-level ISA v2.2. Supports all 37 basic instructions. Separate bus interfaces for instruction and data memory access. (Why? More on that later…) Suitable for bare-metal applications, no support for operating systems and interrupts. (More precisely, a limitation!)
As mentioned above, we will support the RV32I ISA. Therefore, the CPU only supports integer operations.
All registers in the CPU are 32 bits. The address and data buses are also 32 bits. The CPU uses the classic little-endian byte addressing memory space. Each address corresponds to a byte in the CPU address space.
0x00 - byte[7:0], 0x01 - byte[15:8] ...
32-bit words can be accessed by 32-bit aligned addresses, i.e. addresses that are multiples of 4:
0x00 - byte 0, 0x04 - byte 1...
Pequeno is a single-issue CPU, i.e. it fetches only one instruction from memory at a time and issues it for decoding and execution. A pipelined processor with a single issue has a maximum IPC = 1 (or minimum/optimal CPI = 1), i.e. the ultimate goal is to execute at a rate of 1 instruction per clock cycle. This is theoretically the highest performance that can be achieved.
The classic five-stage RISC pipeline is the basic architecture for understanding any other RISC architecture. This is the most ideal and simple choice for our CPU. The architecture of Pequeno is built around this five-stage pipeline. Let's dive into the underlying concepts.
For simplicity, we will not support timers, interrupts, and exceptions in the CPU pipeline. Therefore, CSRs and privilege levels do not need to be implemented either. Therefore, the RISC-V privileged ISA is not included in the current implementation of Pequeno.
The simplest way to design a CPU is the non-pipelined way. Let's look at several design approaches for non-pipelined RISC CPUs and understand their drawbacks.
Let's assume the classic sequence of steps that a CPU follows to execute instructions: fetch, decode, execute, memory access, and write back.
The first design approach is to design the CPU as a finite state machine (FSM) with four or five states and perform all operations sequentially. For example:
But this architecture will seriously affect the instruction execution speed. Because it takes multiple clock cycles to execute an instruction. For example, writing to a register takes 3 clock cycles. In case of load/store instructions, memory latency also increases. This is a bad and primitive way to design a CPU. Let's get rid of it completely!
The second approach is that the instruction can be fetched from the instruction memory, decoded, and then executed by fully combinatorial logic. Then, the result of the ALU is written back to the register file. The whole process until the write back can be completed in one clock cycle. Such a CPU is called a single-cycle CPU. If the instruction needs to access data memory, read/write latency should be taken into account. If the read/write latency is one clock cycle, then the store instruction may still execute in one clock cycle like all other instructions, but the load instruction may require an additional clock cycle because the loaded data must be written back to the register file. The PC generation logic must handle the effect of this latency. If the data memory read interface is combinatorial (asynchronous read), then the CPU becomes truly single-cycle for all instructions.
The main disadvantage of this architecture is obviously the long critical path of the combinatorial logic from instruction fetch to write to memory/register file, which limits the timing performance. However, this design approach is simple and suitable for low-end microcontrollers where low clock speed, low power and low area are required.
To achieve higher clock speeds and performance, the instruction sequential processing function of the CPU can be separated. Each sub-process is assigned to an independent processing unit. These processing units are cascaded in sequence to form a pipeline. All units work in parallel and operate on different parts of the instruction execution. In this way, multiple instructions can be processed in parallel. This technique to achieve instruction-level parallelism is called instruction pipelining. This execution pipeline forms the core of a pipelined CPU.
The classic five-stage RISC pipeline has five processing units, also called pipeline stages. These stages are: Instruction Fetch (IF), Decode (ID), Execute (EX), Memory Access (MEM), Write Back (WB). The working principle of the pipeline can be intuitively represented as follows:
Each clock cycle, different parts of an instruction are processed, and each stage processes a different instruction. If you look closely, you will see that instruction 1 is only executed in the 5th cycle. This delay is called the pipeline delay. Δ This delay is the same as the number of pipeline stages. After this delay, cycle 6: instruction 2 is executed, cycle 7: instruction 3 is executed, and so on... In theory, we can calculate the throughput (instructions per cycle, IPC) as follows:
Therefore, a pipelined CPU guarantees that one instruction is executed per clock cycle. This is the maximum IPC possible in a single-issue processor.
By splitting the critical path across multiple pipeline stages, the CPU can now also run at higher clock speeds. Mathematically, this gives a pipelined CPU a multiple of throughput improvement over an equivalent non-pipelined CPU.
This is called pipeline speedup. In simple terms, a CPU with an s-stage pipeline can run at S times the clock speed of its non-pipelined counterpart.
Pipelining generally increases area/power consumption, but the performance gain is worth it.
The math assumes that the pipeline never stalls, that is, data continues to flow from one stage to another on every clock cycle. But in real CPUs, pipelines can stall for a variety of reasons, the main ones being structural/control/data dependencies.
For example: register X cannot be read by the Nth instruction because X was not modified by the (N-1)th instruction reading X back, which is an example of a data hazard in the pipeline.
The Pequeno architecture uses a classic five-stage RISC pipeline. We will implement a strictly in-order pipeline. In an in-order processor, instructions are fetched, decoded, executed, and completed/committed in the order generated by the compiler. If one instruction stalls, the entire pipeline stalls.
In an out-of-order processor, instructions are fetched and decoded in the order generated by the compiler, but execution can proceed in a different order. If one instruction stalls, it does not stall subsequent instructions unless there are dependencies. Independent instructions can pass forward. Execution can still complete/commit in order (this is how it is in most CPUs today). This opens the door to a variety of architectural techniques that significantly improve throughput and performance by reducing clock cycles wasted by stalls and minimizing the insertion of bubbles (what are “bubbles”? Read on…).
Out-of-order processors are fairly complex due to the dynamic scheduling of instructions, but are now the de facto pipeline architecture in today’s high-performance CPUs.
The five pipeline stages are designed as independent units: Fetch Unit (FU), Decode Unit (DU), Execution Unit (EXU), Memory Access Unit (MACCU), and Write Back Unit (WBU).
Fetch Unit (FU): The first stage of the pipeline, interfaces with the instruction memory. The FU fetches instructions from the instruction memory and sends them to the Decode Unit. The FU may contain instruction buffers, initial branch logic, etc.
Decode Unit (DU): The second stage of the pipeline responsible for decoding instructions from the Execution Unit (FU). The DU also initiates read accesses to the register file. Packets from the DU and the register file are retimed and sent together to the Execution Unit.
Execution Unit (EXU): The third stage of the pipeline that validates and executes all decoded instructions from the DU. Invalid/unsupported instructions are not allowed to continue in the pipeline and become "bubbles". The Arithmetic Unit (ALU) is responsible for all integer arithmetic and logical instructions. The Branch Unit is responsible for processing jump/branch instructions. The Load/Store Unit is responsible for processing load/store instructions that require memory access.
Memory Access Unit (MACCU): The fourth stage of the pipeline that interfaces with the data memory. The MACCU is responsible for initiating all memory accesses based on instructions from the EXU. The data memory is the addressing space that may consist of data RAM, memory-mapped I/O peripherals, bridges, interconnects, etc.
Write Back Unit (WBU): The fifth or last stage of the pipeline. Instructions complete execution here. The WBU is responsible for writing the data (load data) from the EXU/MACCU back to the register file.
Between the pipeline stages, a valid-ready handshake is implemented. This is not so obvious at first glance. Each stage registers a data packet and sends it to the next stage. This packet may be instruction/control/data information to be used by the next stage or subsequent stages. This packet is validated by a valid signal. If the packet is invalid, it is called a bubble in the pipeline. A bubble is nothing more than a "hole" in the pipeline that just moves forward in the pipeline without actually performing any operation. This is similar to a NOP instruction. But don't think they are useless! We will see one use for them in the subsequent section when discussing pipeline risks. The following table defines bubbles in the Pequeno instruction pipeline.
Each stage can also stall the previous stage by issuing a stall signal. Once stalled, the stage will retain its data packet until the stall condition disappears. This signal is the same as the inverted ready signal. In an in-order processor, a stall at any stage is similar to a global stall, as it eventually stalls the entire pipeline.
The flush signal is used to flush the pipeline. The flush operation will invalidate all packets registered by the previous stages at once, as they are identified as no longer useful.
For example, when the pipeline fetches and decodes an instruction from the wrong branch after executing a jump/branch instruction, which was only identified as an error in the execution stage, the pipeline should be flushed and fetch the instruction from the correct branch!
Although pipelining significantly improves performance, it also increases the complexity of the CPU architecture. The pipeline technology of the CPU is always accompanied by its twinBro - Pipeline Hazards! Now, let's assume that we know nothing about pipeline hazards. We didn't consider the hazards when designing the architecture.
Dealing with Pipeline Hazards
In this chapter, we will explore pipeline hazards. Last time, we successfully designed a pipeline architecture for the CPU, but we didn't consider the "evil twin" that comes with pipelines. What impact can pipeline hazards have on the architecture? What architectural changes are needed to mitigate these hazards? Let's go ahead and demystify them!
Hazards in the CPU instruction pipeline are dependencies that interfere with the normal execution of the pipeline. When a hazard occurs, the instruction cannot be executed within the specified clock cycle because it may result in incorrect calculation results or control flow. Therefore, the pipeline may be forced to pause until the instruction can be successfully executed.
In the above example, the CPU executes instructions in order according to the order generated by the compiler. Assume that instruction i2 has some dependency on i1, such as i2 needs to read a certain register, but the register is also being modified by the previous instruction i1. Therefore, i2 must wait until i1 writes the result back to the register file, otherwise the old data will be decoded and read from the register file for the execution stage to use. To avoid this data inconsistency, i2 is forced to stall for three clock cycles. The bubbles inserted in the pipeline represent the stall or wait state. i2 is decoded only when i1 is completed. Eventually, i2 completes execution at the 10th clock cycle instead of the 7th clock cycle. A three-clock-cycle delay is introduced due to the stall caused by the data dependency. How does this delay affect CPU performance?
Ideally, we expect the CPU to run at full throughput, i.e. CPI = 1. However, when the pipeline is stalled, the throughput/performance of the CPU decreases due to the increased CPI. For non-ideal CPUs:
There are various ways in which hazards occur in the pipeline. Pipeline hazards can be divided into three categories:
Structural hazards Control hazards Data hazards
Structural hazards occur due to hardware resource conflicts. For example, when two stages of the pipeline want to access the same resource. For example: two instructions need to access memory in the same clock cycle.
In the above example, the CPU has only one memory for storing instructions and data. The instruction fetch stage accesses the memory every clock cycle to fetch the next instruction. Therefore, the instructions in the instruction fetch stage and the memory access stage may conflict if the previous instruction in the memory access stage also needs to access the memory. This will force the CPU to increase the stall cycle, and the instruction fetch stage must wait until the instruction in the memory access stage releases the resource (memory).
Some ways to mitigate structural hazards include:
Stalling the pipeline until the resource is available. Duplicate the resource so that there will not be any conflict. Pipeline the resource so that the two instructions will be in different stages of the pipeline resource. Let's analyze the different situations that can cause structural hazards in Pequeno's pipeline and how to solve them. We do not intend to use stalling as an option to mitigate structural hazards!
In Pequeno's architecture, we implemented the above three solutions to mitigate various structural hazards.
Control hazards are caused by jump/branch instructions. Jump/branch instructions are flow control instructions in the CPU ISA. When control reaches a jump/branch instruction, the CPU must decide whether to execute the branch instruction. At this point, the CPU should take one of the following actions.
Fetch the next instruction at PC+4 (branch not taken) or fetch the instruction at the branch target address (branch taken).
The correctness of the decision can only be determined when the execution stage calculates the result of the branch instruction. Depending on whether the branch is taken or not, the branch address (the address the CPU should branch to) is determined. If the decision made previously was wrong, all instructions fetched and decoded in the pipeline before that clock cycle should be discarded. Because these instructions should not be executed at all! This is achieved by flushing the pipeline and fetching the instruction at the branch address on the next clock cycle. Flushing invalidates the instruction and converts it to a NOP or bubbles. This costs a large number of clock cycles as a penalty. This is called a branch penalty. Therefore, control hazards have the worst impact on CPU performance.
In the above example, i10 completed execution on the 10th clock cycle, but it should have completed execution on the 7th clock cycle. Because the wrong branch instruction (i5) was executed, 3 clock cycles were lost. When the execution stage identifies the wrong branch instruction on the 4th clock cycle, a flush must be performed in the pipeline. How does this affect CPU performance?
If a program running on the above CPU contains 30% branch instructions, the CPI becomes:
CPU performance is reduced by 50%!
To mitigate the control risk, we can adopt some strategies in the architecture...
If the instruction is identified as a branch instruction, just stall the pipeline. This decoding logic can be implemented in the fetch stage itself. Once the branch instruction is executed and the branch address is resolved, the next instruction can be fetched and the pipeline resumed.
Add dedicated branch logic like branch prediction in the Fetch stage.
The essence of branch prediction is: we use some prediction logic in the instruction fetch stage to guess whether the branch should be taken. In the next clock cycle, we fetch the guessed instruction. This instruction is either fetched from PC+4 (predicted branch not taken) or from the branch target address (predicted branch taken). Now there are two possibilities:
If the prediction is found to be correct in the execute stage, nothing is done and the pipeline can continue processing.
If the prediction is found to be wrong, the pipeline is flushed and the correct instruction is fetched from the branch address resolved in the execute stage. This incurs a branch penalty.
As you can see, branch prediction still incurs a branch penalty if it predicts wrong. The design goal should be to reduce the probability of misprediction. The performance of a CPU depends a lot on how “good” the prediction algorithm is. Sophisticated techniques like dynamic branch prediction keep instruction history in order to predict correctly with 80% to 90% probability.
To mitigate control hazard in Pequeno, we will implement a simple branch prediction logic. More details will be revealed in our upcoming blog on the design of the fetch unit.
Data hazard occurs when the execution of an instruction has a data dependency on the result of the previous instruction still being processed in the pipeline. Let’s understand the three types of data hazards with examples to better understand the concept.
Suppose an instruction i1 writes a result to register x. The next instruction i2 also writes a result to the same register. Any subsequent instruction in the program order should read the result of i2 at x. Otherwise, data integrity will be compromised. This data dependency is called output dependency and can lead to WAW (Write-After-Write) data hazard.
Suppose an instruction i1 reads register x. The next instruction, i2, writes the result to the same register. At this point, i1 should read the old value of register X instead of the result of i2. If i2 writes the result to x before i1 reads the result, a data hazard will result. This data dependency is called an anti-dependency and can lead to a WAR (Write-After-Read) data hazard.
Suppose an instruction, i1, writes the result to register x. The next instruction, i2, reads the same register. At this point, i2 should read the value written by i1 to register x instead of the previous value. This data dependency is called a true dependency and can lead to a RAW (Read-After-Write) data hazard.
This is the most common and dominant type of data hazard in pipelined CPUs.
To mitigate data hazards in in-order CPUs, we can use some techniques:
When a data dependency is detected, the pipeline is paused (see the first figure). The decode stage can wait until the previous instruction is executed before executing.
Compile rescheduling: The compiler reschedules the code by scheduling it to execute later to avoid data hazards. The idea is to avoid program stalls while not affecting the integrity of the program control flow, but this is not always possible. The compiler can also insert a NOP instruction between two instructions with data dependency. But this will cause stalls, which will affect performance.
Data/Operand Forwarding: This is a prominent architectural solution to mitigate RAW data risks in sequential execution CPUs. Let's analyze the CPU pipeline to understand the principle behind this technology.
Suppose two adjacent instructions i1 and i2, there is a RAW data dependency between them because they are both accessing register X. The CPU should stall instruction i2 until i1 writes the result back to register x. If the CPU does not have a stall mechanism, i2 will read an older value from x in the decode stage of the third clock cycle. In the fourth clock cycle, the i2 instruction will execute the wrong value of x.
If you look closely at the pipeline, we already have the result of i1 in the third clock cycle. Of course, it is not written back to the register file, but the result is still available at the output of the execute stage. So if we can somehow detect data dependencies and then "forward" that data to the input of the execute stage, then the next instruction can use the forwarded data instead of the data from the decode stage. That way, the data hazard is mitigated! The idea is this:
This is called data/operand forwarding or data/operand bypassing. We forward the data forward in time so that the subsequent dependent instructions in the pipeline can access this bypassed data and execute in the execute stage.
This idea can be extended to different stages. In a 5-stage pipeline that executes instructions in the order i1, i2, ..in, data dependencies may exist:
i1 and i2- need to be bypassed between the execute stage and the output of the decode stage. i1 and i3- need to be bypassed between the memory access stage and the output of the decode stage. i1 and i4- need to be bypassed between the writeback stage and the output of the decode stage.
The architectural solution for mitigating RAW data hazards originating from any stage of the pipeline is as follows:
Consider the following scenario:
There is a data dependency between two adjacent instructions i1 and i2, where the first instruction is a load. This is a special case of a data hazard. Here, we cannot execute i2 until the data is loaded into x1. So, the question is whether we can still mitigate this data hazard with data forwarding? The load data is only available in the memory access stage of i1, and it must be forwarded to the decode stage of i2 to prevent this hazard. The requirement is as follows:
Assuming the load data is available in the memory access stage of cycle 4, you need to "forward" this data to cycle 3, to the decode stage output of i2 (why cycle 3? Because in cycle 4, i has already finished executing in the execute stage!). Essentially, you are trying to forward the current data to the past, which is impossible unless your CPU can time travel! This is not data forwarding, but "data backtracking".
Data forwarding can only be done forward in time.
This data hazard is called a pipeline interlock. The only way to solve this problem is to insert a bubble to stall the pipeline for one clock cycle when the data dependency is detected.
A NOP instruction (aka bubble) is inserted between i1 and i2. This delays i2 by one cycle, so data forwarding can now forward the load data from the memory access stage to the output of the decode stage.
So far, we have only discussed how to mitigate RAW data risks. So, what about WAW and WAR risks? The RISC-V architecture is inherently resistant to WAW and WAR risks implemented by in-order pipelines!
All register writebacks are done in the order that instructions are issued. The data written back is always overwritten by subsequent instructions that write to the same register. Therefore, WAW risk never occurs! Writeback is the last stage of the pipeline. When the writeback occurs, the read instruction has successfully completed execution on the older data. Therefore, WAR risk never occurs!
To mitigate RAW data risks in Pequeno, we will implement data forwarding in hardware using pipeline interlock protection functions. More details will be revealed later, when we will design the data forwarding logic.
We understand and analyze various potential pipeline risks in existing CPU architectures that can cause instruction execution failures. We also design solutions and mechanisms to mitigate these risks. Let’s put together the necessary microarchitecture and finally design the architecture of the Pequeno RISC-V CPU to be completely free of all types of pipeline risks!
In the following posts, we will dive into the RTL design of each pipeline stage/functional unit. We will discuss the different microarchitectural decisions and challenges during the design phase.
Fetch Unit
From here, we start to dive into the microarchitecture and RTL design! In this chapter, we will build and design the Fetch Unit (FU) of Pequeno.
The Fetch Unit (FU) is the first stage of the CPU pipeline that interacts with the instruction memory. The Fetch Unit (FU) fetches instructions from the instruction memory and sends the fetched instructions to the Decode Unit (DU). As discussed in the previous post on the improved architecture of Pequeno, the FU contains branch prediction logic and flush support.
1 Interfaces
Let’s define the interfaces of the Fetch Unit:
2 Instruction Access Interfaces
The core function of the FU in the CPU is instruction access. The Instruction Access Interface (I/F) is used for this purpose. Instructions are stored in the instruction memory (RAM) during execution. Modern CPUs fetch instructions from a cache instead of directly from the instruction memory. The instruction cache (called the primary cache or L1 cache in computer architecture terms) is closer to the CPU and enables faster instruction access by caching/storing frequently accessed instructions and prefetching larger blocks of instructions nearby. Therefore, there is no need to constantly access the slower main memory (RAM). Therefore, most instructions can be quickly accessed directly from the cache.
The CPU will not directly access the interface with the instruction cache/memory. There will be a cache/memory controller between them to control the memory access between them.
It is a good idea to define a standard interface so that any standard instruction memory/cache (IMEM) can be easily plugged into our CPU, and requires little or no glue logic. Let's define two interfaces for instruction access. The request interface (I/F) handles requests from the instruction memory (FU) to the instruction memory. The response interface (I/F) handles responses from the instruction memory to the instruction memory (FU). We will define a simple valid ready based request and response interface (I/F) for the instruction memory (FU), as this is easy to convert to bus protocols such as APB, AXI, etc. if necessary.
Instruction access requires knowing the address of the instruction in memory. The address requested through the request interface (Request I/F) is actually the PC generated by the FU. In the FU interface, we will use a stall signal instead of the ready signal, which behaves in the opposite way to the ready signal. The cache controller usually has a stall signal to stall the request from the processor. This signal is represented by cpu_stall. The response from the memory is the fetched instruction received through the response interface (Response I/F). In addition to the fetched instruction, the response should also contain the corresponding PC. PC is used as an ID to identify the request to which a response has been received. In other words, it indicates the address of the instruction that has been fetched. This is important information required by the next stage of the CPU pipeline (how is it implemented? We will see soon! ). Therefore, the fetched instruction and its PC constitute the response packet to the FU. When the internal pipeline is stalled, the CPU may also need to stall the response from the instruction memory. This signal is represented by mem_stall.
At this point, let's define instruction packet={instruction, PC} in the CPU pipeline.
3PC Generation Logic
The core of the FU is the PC generation logic that controls the request interface (I/F). Since we are designing a 32-bit CPU, the generation of PC should be in increments of 4. After this logic is reset, the PC is generated every clock cycle. The reset value of PC can be hard-coded. This is the address from which the CPU fetches and executes instructions after reset, that is, the address of the first instruction in memory. PC generation is a free-running logic that is only stalled by c pu_stall.
The free-running PC can be bypassed by flushing the I/F and internal branch prediction logic. The PC generation algorithm is implemented as follows:
4 Instruction Buffers
There are two back-to-back instruction buffers inside the FU. Buffer 1 buffers instructions fetched from the instruction memory. Buffer 1 can directly access the response interface (Response I/F). Buffer 2 buffers instructions from buffer 1 and then sends it to the DU through the DU I/F. These two buffers constitute the instruction pipeline inside the FU.
5 Branch Prediction Logic
As discussed above, we must add branch prediction logic in the FU to mitigate control risks. We will implement a simple and static branch prediction algorithm. The main content of the algorithm is as follows:
Always make an unconditional jump.
If the branch instruction is a backward jump, execute the branch. Because the possibilities are as follows:
1. This instruction may be part of the loop exit check of some do-while loop. In this case, if we execute the branch instruction, the probability of correctness is higher.
If the branch instruction is a forward jump, do not execute it. Because the possibilities are as follows:
2. This instruction may be part of the loop entry check of some for loop or while loop. If we do not take the branch and continue to execute the next instruction, the probability of correctness is higher.
3. This instruction may be part of some if-else statement. In this case, we always assume that the if condition is true and continue to execute the next instruction. Theoretically, this deal (bargain) is 50% correct.
The instruction packet of buffer 1 is monitored and analyzed by the branch prediction logic, and a branch prediction signal: branch_taken is generated. This branch prediction signal is then registered and transmitted synchronously with the instruction packet sent to DU. The branch prediction signal is sent to DU through the DU interface.
6 DU
This is the main interface between the fetch unit and the decode unit for sending payloads. The payload contains the fetched instructions and branch prediction information.
Since this is the interface between the two pipeline stages of the CPU, the valid ready I/F is implemented. The following signals constitute the DU I/F:
In the previous blog post, we discussed the concept of stall and refresh in the CPU pipeline and its importance. We also discussed various scenarios in Pequeno architecture that require stall or refresh. Therefore, proper stall and refresh logic must be integrated in each pipeline stage of the CPU. It is crucial to determine at which stage a stall or refresh is required, and which part of the logic in that stage needs to be stalled and refreshed.
Some initial thoughts before implementing stall and refresh logic:
Pipeline stages may be stalled by externally or internally generated conditions. Pipeline stages can be refreshed by externally or internally generated conditions. There is no centralized stall or refresh generation logic in Pequeno. Each stage may have its own stall and refresh generation logic. A stage in the pipeline can only be blocked by the next stage. Any stage blocking will eventually affect the upstream pipeline and cause the entire pipeline to be blocked.
Any stage in the downstream pipeline can refresh a stage. This is called pipeline refresh because the entire pipeline upstream needs to be refreshed at the same time. In Pequeno, pipeline refresh is required only for branch misses in the execution unit (EXU).
Stall logic contains logic to generate local and external stalls. The flush logic contains logic to generate local and pipeline flushes.
Local stalls are generated internally and used locally to stop the current stage. External stalls are generated internally and sent externally to the next stage of the upstream pipeline. Both local and external stalls are generated based on internal conditions and external stalls at the next stage of the downstream pipeline.
Local flush is a flush generated internally and used for the local flush stage. External flush or pipeline flush is a flush generated internally and sent externally to the upstream pipeline. This flushes all stages upstream simultaneously. Both local and external flushes are generated based on internal conditions.
Only the DU can stop the operation of the FU externally. When the DU sets stall, the internal instruction pipeline of the FU (buffer 1 -> buffer 2) should be stopped immediately, and since the FU can no longer receive packets from the IMEM, it should also set mem_stall to the IMEM. Depending on the pipeline/buffer depth in IMEM, the PC generation logic may also eventually be stalled by cpu_stall from IMEM, since IMEM cannot receive any more requests. There are no internal conditions in FU that cause local stalls.
Only EXU can externally flush FU. EXU initiates branch_flush function in CPU instruction pipeline and passes in the address of the next instruction to be fetched after pipeline is flushed ( branch_pc ). FU provides flush interface (Flush I/F) to accept external flush.
Buffer 1, Buffer 2 and PC generation logic in FU are flushed by branch_flush. Signal branch_taken from branch prediction logic also acts as a local flush to buffer 1 and PC generation logic. If branch is taken:
The next instruction should be fetched from branch predicted PC. Therefore, PC generation logic should be flushed and next PC should = branch_pc. Next instruction in buffer 1 should be flushed and invalidated, i.e. NOP/bubble inserted.
Wonderful why Buffer-2 is not flushed by branch_taken? Because the branch instruction from Buffer-1 (responsible for flush generation) should be buffered to Buffer-2 in the next clock cycle and allowed to continue execution in the pipeline. This instruction should not be flushed!
The instruction memory pipeline should also be flushed appropriately. IMEM flush mem_flush is generated by branch_flush and branch_taken.
Let's integrate all the microarchitectures designed so far to complete the architecture of the Fetch Unit.
Ok, everyone! We have successfully designed the Fetch Unit of Pequeno. In the next part, we will design the Decode Unit (DU) of Pequeno.
Decode Unit
The Decode Unit (DU) is the second stage of the CPU pipeline and is responsible for decoding instructions from the Fetch Unit (FU) and sending them to the Execution Unit (EXU). In addition, it is responsible for decoding register addresses and sending them to the register file for register read operations.
Let's define the interface of the Decode Unit.
Among them, the FU interface is the main interface between the fetch unit and the decode unit to receive the payload. The payload contains the fetched instructions and branch prediction information. This interface has been discussed in the previous section.
The EXU interface is the main interface between the decode unit and the execution unit to send the payload. The payload includes the decoded instructions, branch prediction information, and decoded data.
The following are the instruction and branch prediction signals that make up the EXU I/F:
Decoded data is the important information that the DU decodes from the fetched instructions and sends to the EXU. Let's understand what information the EXU needs to execute an instruction.
Opcode, funct3, funct7: Identifies the operation that the EXU is going to perform on the operand. Operand: Depending on the opcode, the operand can be register data (rs0, rs1), register address for writeback (rdt), or 12-bit/20-bit immediate value. Instruction type: Identifies which operand/immediate value must be processed. The decoding process can be tricky. If you understand the ISA and instruction structure correctly, you can recognize different types of instruction patterns. Recognizing the pattern helps in designing the decoding logic in the DU.
The following information is decoded and sent to the EXU via the EXU I/F.
The EXU will use this information to demultiplex the data to the appropriate execution subunit and execute the instruction.
For R-type instructions, the source registers rs1 and rs2 must be decoded and read. The data read from the registers are the operands. All general user registers are located in the register file outside the DU. The DU uses the register file interface to send the address of rs0 and rs1 to the register file for register access. The data read from the register file should also be sent to the EXU in the same clock cycle along with the payload.
The register file takes one cycle to read the register. The DU also takes one cycle to register the payload to be sent to the EXU. Therefore, the source register address is decoded directly from the FU instruction packet by the combinational logic. This ensures the timing synchronization of 1) the payload from the DU to the EXU and 2) the data from the register file to the EXU.
Only the EXU can stop the operation of the DU externally. When the EXU sets the stop, the internal instruction pipeline of the DU should stop immediately, and it should also set the stop to the FU because it can no longer receive packets from the FU. To achieve synchronous operation, the register file should be stopped with the DU because they are both at the same stage of the CPU five-stage pipeline. Therefore, the DU feeds back the external stop from the EXU to the register file. There is no situation inside the DU that causes a local stop.
Only the EXU can flush the FU externally. The EXU starts the branch_flush function in the CPU instruction pipeline and passes in the address of the next instruction to be fetched after flushing the pipeline (branch_pc). The DU provides a flush interface (Flush I/F) to accept external flushes.
The internal pipeline is flushed by branch_flush. The branch_flush from the EXU should immediately invalidate the DU instruction pointing to the EXU with a latency of 0 clock cycles. This is to avoid potential control risks in the next clock cycle EXU.
In the design of the Fetch Unit, we did not invalidate the FU instruction with a 0-cycle delay after receiving the branch_flush instruction. This is because the DU will also be flushed in the next clock cycle, so there will be no control hazard in the DU. So, there is no need to invalidate the FU instruction. The same idea applies to the instructions from IMEM to FU.
The above flowchart shows how the instruction packets and branch prediction data from the FU are buffered in the DU of the instruction pipeline. Only a single level of buffering is used in the DU.
Let’s integrate all the microarchitectures designed so far to complete the architecture of the Decode Unit.
Currently we have completed: Fetch Unit (FU), Decode Unit (DU). In the next section, we will design the register file of Pequeno.
Register File
In RISC-V CPU, the register file is a key component, which consists of a set of general purpose registers used to store data during execution. Pequeno CPU has 32 32-bit general purpose registers ( x0 – x31 ).
Register x0 is called the zero register. It is hardwired to a constant value of 0, providing a useful default value that can be used with other instructions. Suppose you want to initialize another register to 0, just execute mv x1, x0.
x1-x31 are general-purpose registers used to hold intermediate data, addresses, and results of arithmetic or logical operations.
In the CPU architecture designed in the previous article, the register file requires two access interfaces.
Among them, the read access interface is used to read the register at the address sent by DU. Some instructions (such as ADD) require two source register operands rs1 and rs2. Therefore, the read access interface (I/F) requires two read ports to read two registers at the same time. The read access should be a single-cycle access so that the read data is sent to the EXU in the same clock cycle as the payload of the DU. In this way, the read data and the payload of the DU are synchronized in the pipeline.
The write access interface is used to write the execution result back to the WBU sends the register at the address. Only one destination register rdt is written at the end of execution. Therefore, one write port is sufficient. Write access should be single cycle access.
Since the DU and the register file need to be synchronized at the same stage of the pipeline, they should always be stopped together (why? Check the block diagram in the previous section!). For example, if the DU is stopped, the register file should not output read data to the EXU, because this will damage the pipeline. In this case, the register file should also be stopped. This can be ensured by inverting the stop signal of the DU to generate the read_enable of the register file. When the stop is valid, read_enable is driven low and the previous data will remain at the read data output, effectively stopping the register file operation.
Since the register file does not send any instruction packets to the EXU, it does not need any refresh logic. The refresh logic only needs to be handled inside the DU.
In summary, the register file is designed with two independent read ports and one write port. Both read and write accesses are single cycle. The read data is registered. The final architecture is as follows:
We have currently completed: instruction fetch unit (FU), decode unit (DU), register file.
Please stay tuned for the next part.