Chinese chips
  • IBM To Bring Deca's Fan-Out Packaging TechnologyTo North America
    IBM To Bring Deca's Fan-Out Packaging TechnologyTo North America
    IBM has formed an alliance with Deca Technologies to leverage Deca's MFIT technology to enter the fan out wafer level packaging (FOWLP) market, with plans to build a new production line at the Bromont factory in Canada in the second half of 2026. On the 22nd, both parties signed a contract to import Deca's M-Series and adaptive pattern technology into the factory, focusing on MFIT to expand the high-performance small chip integrated supply chain. FOWLP's global production capacity is concentrated in Asia, while North America is expanding its local production capacity. IBM is now focusing on chip design and packaging. This cooperation aims to seize the market in fields such as AI and also reflects the trend of regionalization in the global semiconductor industry chain. IBM and Deca Technologies form an alliance in the field of semiconductor packaging IBM and Deca Technologies have formed an important alliance in the semiconductor packaging field, which will enable IBM to enter the advanced fanout wafer level packaging market. According to the plan, IBM expects to establish a new high-capacity production line within its existing packaging factory located in Bromont, a city in southern Quebec, Canada. At some point in the future, IBM's new production line is expected to produce advanced packaging based on Deca's M-series Fan Out Interlayer Technology (MFIT). MFIT technology can achieve a new type of complex multi chip packaging. Nevertheless, IBM has been providing packaging and testing services to external clients at Bromont for many years to meet their internal needs. With Deca's announcement, IBM will expand its packaging capabilities and enter the field of Fan Out Wafer Level Packaging (FOWLP). Basically, after the chip is manufactured in the wafer fab, it will be assembled into a package. Encapsulation is a small casing used to protect one or more chips from harsh working conditions. FOWLP is an advanced packaging form that can integrate complex chips into the package. FOWLP and other types of packaging help improve chip performance. Deca's MFIT is an advanced form of FOWLP, in which the latest storage devices, processors, and other chips can be integrated in a 2.5D/3D package. Deca CEO Tim Olson stated that MFIT is a high-density integrated platform for AI and other memory intensive computing applications. ”(See Figure 1 below) Fan out wafer level packaging (FOWLP) is an enabling technology, but most, if not all, of the global FOWLP production capacity is located in Asia. Companies such as Riyueguang and TSMC produce fan out packaging across Asia. However, some customers may wish to manufacture and package chips in North America. At some point in the future, customers may have two new fan out production capacity options in North America. IBM is working hard to achieve this. In addition, SkyWater, a US wafer foundry, is developing a fan out production capacity based on Deca technology at a factory in the United States. A Brief History of IBM IBM is an iconic brand in the computer field with a long history. It also has a long and sometimes painful history in the semiconductor industry. The origin of IBM can be traced back to 1911, when a company called Computing Tabulating Recording Company (CTR) was established. CTR provides a record keeping and measurement system. In 1924, CTR was renamed International Business Machines. In 1952, IBM launched its first commercial/scientific computer, called the 701 Electronic Data Processing Machine (EDP). 701 integrates three electronic devices - vacuum tube, magnetic drum, and magnetic tape. Four years later, IBM established a new semiconductor research and development team with the goal of finding a technology to replace outdated vacuum tubes for its system. In the 1960s, IBM developed a new and more advanced alternative technology - solid-state electronic devices based on an emerging technology called integrated circuits (ICs). Afterwards, the company adopted more advanced chip technology in its computer product line. In 1966, IBM established its Microelectronics division, which became the company's Semiconductor division. At that time, the company was developing chips for its own system. In the same year, Robert Danard of IBM invented DRAM, which is still used as the main memory for personal computers, smartphones, and other products today. Another major event occurred in 1993 when IBM entered the commercial semiconductor market. The company manufactures and sells ASICs, processors, and other chips to external customers. In the 1990s, IBM also entered the OEM business, laying the foundation for competition with companies such as TSMC. IBM provides cutting-edge processes and RF technology to OEM customers. The company produces chips in its own wafer fab. However, in the 2010s, IBM's microelectronics division encountered difficulties. The department struggled in the commercial semiconductor business, losing millions of dollars. Its OEM business has also encountered failures. In 2014, IBM sold its microelectronics division (including wafer fabs and foundry business) to foundry supplier GlobalFoundries (GF). IBM has paid approximately $1.5 billion to GF to acquire its microelectronics division. IBM's current semiconductor/packaging work Time flies. Nowadays, IBM not only provides system services, but also offers hybrid cloud and consulting services. The company is still involved in the semiconductor industry. It designs processors and other chips, but no longer produces them in its own wafer fab. It relies on contract manufacturers to produce chips. In addition, IBM has a large semiconductor research and development center in New York. In 2015, the company's R&D department developed a groundbreaking transistor technology called nanosheets. Nanoflakes are essentially a next-generation surround gate (GAA) transistor. In addition, IBM has been providing packaging and testing services to Bromon's customers for many years. In fact, the Bromon factory is the largest outsourced semiconductor packaging and testing (OSAT) factory in North America. The company provides flip chip packaging and testing services at the factory. In addition, IBM is developing an assembly process for co packaging optical devices. IBM has also established an important alliance with Rapidus, a wafer foundry startup headquartered in Japan. Rapidus is developing a 2nm process based on IBM nanosheet transistor technology. Rapidus and IBM are also jointly developing various methods for producing chips. Chips are essentially small modular chips. These chips are electrically connected and then combined in one package to form a brand new complex chip. Now, IBM is collaborating with Deca to develop fanout packaging capabilities. According to the IBM website, the company plans to increase its FOWLP manufacturing capabilities in the second half of 2026. What is fan out? FOWLP is not a new technology and has a long history of development. FOWLP gained fame in 2016 when Apple used TSMC's fanout packaging technology in its iPhone 7. In packaging, TSMC stacks DRAM chips on top of application processors. This processor, named A10, was designed by Apple and manufactured by TSMC using a 16 nanometer process. Apple has also adopted TSMC's fanout packaging technology in subsequent smartphones. FOWLP has a wide range of applications. For example, fan out packaging can integrate multiple chips and components, such as MEMS, filters, crystals, and passive devices. But the uniqueness of fan out packaging lies in its ability to develop small-sized packages with a large number of I/O interfaces. In many cases, small chips are packaged in large-sized packages. This will take up too much space. According to ASE, in fan out packaging, the package size is roughly the same as the chip itself. Fan out packaging can be defined as a packaging where any connection is fan out from the chip surface to support more external I/O. ” Taiwan's ASE, the world's largest OSAT manufacturer, produces a fanout packaging production line based on Deca M series technology. South Korean OSAT manufacturer Nepes is another authorized manufacturer of Deca. In terms of research and development, IBM and SkyWater are developing fan out packaging based on Deca's technology. Last year, SkyWater and Deca announced a $120 million contract with the US Department of Defense. SkyWater is expected to produce fan out packaging at its factory in the United States by the end of this year. At the same time, Deca has also developed multiple versions of M-series fan out technology. Overall, M-series technology can assist customers in developing single-chip and multi chip packaging, 3D packaging, and chipsets. Deca has also developed a manufacturing technology called "Adaptive Patterning" for M-series technology, which is used to produce fine pitch fanout packaging. Deca's M series includes a version called MFIT. This is an advanced technology that covers double-sided wiring, dense 3D interconnects, and embedded bridge chips. It enables customers to develop multi chip packages that integrate high bandwidth memory (HBM), processors, and other devices. Deca's Olson said, "MFIT adopts M-series chip first fan out technology, combined with embedded bridging technology, to create a high-density intermediate layer for the chip, and finally install the processor and memory chip. Adaptive patterning technology can achieve extremely high density with a spacing of less than 10 µ m. ” He said, "MFIT adopts Deca's second-generation technology, which initially used a 20 µ m spacing for embedded components and plans to gradually achieve finer spacing. The flip chip technology used on the intermediate layer of chip level devices is initially consistent with the current industry-leading spacing and plans to gradually achieve finer spacing. Adaptive patterning technology can be extended to finer spacing while maintaining strong manufacturability through design during the manufacturing process. ” Fan out type is not the only choice in the field of advanced packaging. Other options include 2.5D and 3D packaging technology, as well as small chip technology. In summary, there are multiple options in the market, and there will be more innovations in the future.  
    - May 28, 2025
  • First "Made in India" chip produced by semiconductor factories in the northeast region.
    First "Made in India" chip produced by semiconductor factories in the northeast region.
    Indian Prime Minister Narendra Modi announced on Friday (May 23) that India will soon acquire the first "Made in India" chip produced by semiconductor factories in the northeast region. He said that the region is becoming an important destination for both the energy and semiconductor industries. Nowadays, the Northeast region is playing an increasingly important role in strengthening the Indian semiconductor ecosystem. India will soon obtain its first 'Made in India' chip from semiconductor factories in the Northeast region, "Modi said in his inaugural speech at the 2025 Northeast Rising Investors Summit. Last August, Tata Group began building a semiconductor factory in Assam with a total investment of 270 billion rupees. The Prime Minister stated that semiconductor factories have opened up opportunities for the semiconductor industry and other cutting-edge technologies in the region. Modi stated that the government is making large-scale investments in the hydropower or solar energy sectors in various northeastern states, with projects worth tens of millions of rupees already allocated.   He stated that investors not only have the opportunity to invest in factories and infrastructure in the Northeast region, but also have a golden opportunity to invest in the manufacturing industry in the area. He emphasized that significant investment is needed in the fields of solar modules, batteries, energy storage, and research and development, as they represent the future. He said, "The more we invest in the future, the less we rely on other countries The Prime Minister stated that robust roads, good power infrastructure, and logistics networks are the pillars of all industries. Where there is seamless connectivity, trade will also flourish. This means that robust infrastructure is the primary condition and foundation for any development. Modi stated that the trade potential of the Northeast region will double in the next decade. At present, the trade volume between India and ASEAN is close to 1.25 billion US dollars. In the next few years, this trade volume will exceed 200 billion US dollars, and the Northeast region will become a solid bridge to achieve this goal. He stated that the Northeast region will become a trade gateway for ASEAN. Adani Group Chairman Gautam Adani announced in a speech that the group will invest an additional 500 billion rupees in the Northeast region over the next 10 years. Three months ago, the group had promised to invest 500 billion rupees in Assam.  
    - May 24, 2025
  • Proposal and Working Principle of Gallium Oxide p-NiO Heterojunction Bidirectional Switching Device
    Proposal and Working Principle of Gallium Oxide p-NiO Heterojunction Bidirectional Switching Device
    Proposal and Working Principle of Gallium Oxide p-NiO Heterojunction Bidirectional Switching Device   The Power P-GaN SJ BDS Gallium Nitride Superjunction Voltage Resistant Bidirectional Switching Device places all the surge electrical stresses of the original transverse PSJ and transverse p-GaN ReSURF more concentrated on the line closest to the edge of the polarization structure,     There are reliability issues with overload surges, and the large capacitance of the ReSURF field board exacerbates the problem of hot electron injection from overload surges in this area. So Erbao thought about it and decided not to create multiple field limiting rings similar to p-GaN thin layers, all connected to the drain to form a uniform voltage divider, while simultaneously considering the RESURF super junction voltage resistance structure.   So, try making another bidirectional voltage resistant switch?     What name should I choose? Power P-GaN SJ BDS Gallium Nitride Superjunction Voltage Resistant Bidirectional Switching Device?     A friend left a message asking me, didn't you share and discuss the structure of new gallium oxide devices at the Nanjing meeting on Saturday, Erbao? Can this super junction bidirectional switching device SJ BDS structure be used for gallium oxide devices?     Of course, Erbao also wants to give it a try. If one day In the second round, Nakamura Shuji discovered a new buffer growth technology that can directly grow single crystal quasi single crystal quality gallium oxide epitaxial layers on different substrates such as silicon wafers/sapphire wafers. Perhaps in the future, gallium oxide materials will shine brightly in the field of heterogeneous epitaxial lateral high-voltage devices, and even high-voltage integrated ICs, and even replace GaN or silicon carbide devices in many fields?     The heterojunction bidirectional switching device composed of gallium oxide (Ga ₂ O3) and p-type nickel oxide (p-NiO) is a new type of power electronic device. Its working principle combines the characteristics of wide bandgap semiconductor materials, heterojunction band engineering, and superlattice structure design to achieve high voltage resistance, low loss, and bidirectional controllable switching function. The following is a detailed analysis of its working principle:   ---   **1. Material and structural characteristics** -Gallium oxide (Ga ₂ O3): -Ultra wide bandgap semiconductor (bandgap width of about 4.8-4.9 eV), with extremely high critical breakdown electric field (about 8 MV/cm), suitable for high-voltage applications.   -Natural n-type conductivity, but lacking stable p-type doping, requires the introduction of p-type materials (such as p-NiO) through heterojunctions.   -* * p-type nickel oxide (p-NiO) * *: -A p-type transparent conductive oxide forms a heterojunction with Ga ₂ O3 to compensate for the p-type defects of Ga ₂ O3 and provide hole injection capability.   -The alignment of heterojunction interface bands is crucial for carrier transport (which may form Type II band structures and promote charge separation).   -Superjunction structure: -Composed of alternating p-NiO and n-Ga ₂ O3 regions, the transverse electric field distribution is optimized through charge balance, significantly improving the breakdown voltage while reducing the on resistance.   ---   *2. Bidirectional switch mechanism** **(1) Blocked state (off state)** -* * Forward and reverse blockade * *: -Under bidirectional voltage, the heterojunction interface and superlattice structure expand uniformly distributed electric fields through depletion regions, avoiding local electric field concentration.   -The charge balance of a super junction allows the longitudinal electric field (perpendicular to the junction direction) to be shared by the transverse electric field (parallel to the junction direction), significantly increasing the breakdown voltage (up to several thousand volts).   **(2) Conductive state (open state)** -Bidirectional carrier injection: -Forward bias voltage (Ga ₂ O3 terminal connected to positive): The holes of p-NiO are injected into Ga ₂ O ∝, and the electrons of Ga ₂ O ∝ are injected into p-NiO, reducing the heterojunction potential barrier and forming bipolar conduction.   -Reverse bias voltage (Ga ₂ O3 terminal connected to negative): Through the symmetrical design of the superlattice structure, a conduction path is also formed at the interface between p-NiO and Ga ₂ O3 under reverse bias, achieving bidirectional current flow.   -The high doping concentration of the superlattice structure further reduces the on resistance (Ron) and improves efficiency.   **(3) Switch triggering mechanism** -Voltage triggered: -When the applied voltage exceeds the threshold, avalanche breakdown or tunneling effect occurs in the depletion region of the heterojunction, causing carrier multiplication and rapid conduction of the device.   -Field control effect: -Active switching control is achieved by regulating the heterojunction barrier height through gate (if designed) or structural electric field.   ---   **3. Key advantages** -* * High voltage resistance * *: The super junction structure and the high breakdown field strength of Ga ₂ O3 work together to support blocking voltages in the thousands of volts range.   -Low conduction loss: The bipolar conduction mechanism (where electrons and holes participate in conduction together) reduces Ron and improves energy efficiency.   -Bidirectional Symmetry: The structural design ensures consistent electrical characteristics in both forward and reverse directions, making it suitable for AC circuits or bidirectional power control.   -High temperature stability: The wide bandgap material is resistant to high temperatures and suitable for harsh environmental applications.   ---   **4. Potential applications** -High voltage DC/AC converters, such as smart grids and electric vehicle charging systems.   -Solid state circuit breaker: Fast response, high reliability circuit protection.   -RF power devices: High frequency, high-power communication systems.   ---   **5. Challenges and research directions** -Interface optimization: The interface defects of Ga ₂ O ∝/p-NiO heterojunctions may affect carrier transport and need to be improved through annealing or interface passivation.   -* * Thermal management * *: Ga ₂ O ∝ has low thermal conductivity and needs to be combined with heat dissipation design (such as diamond substrate integration).   -* * Process compatibility * *: Heteroepitaxial growth and superlattice manufacturing have high process complexity and require the development of low-cost mass production technologies.   ---   **Summary** Gallium oxide/p-NiO heterojunction bidirectional switching devices achieve high-voltage bidirectional conduction and fast switching characteristics through the synergistic effect of heterojunction band engineering and superlattice charge balance design, which is expected to break through the performance limit of traditional silicon-based devices and promote the development of the next generation of high-power electronic systems.  
    - May 21, 2025
  • Teach you how to design RISC-V CPU
    Teach you how to design RISC-V CPU
    In recent years, RISC-V has attracted global attention. This revolutionary ISA has swept the market with its continuous innovation, countless learning and tool resources, and contributions from the engineering community. The biggest charm of RISC-V is that it is an open source ISA. In this article, I (referring to the author of this article, Mitu Raj, the same below) will introduce how to design a RISC-V CPU from scratch. We will explain the process of defining specifications, designing and improving architecture, identifying and solving challenges, developing RTL, implementing CPU, and testing CPU on simulation/FPGA board.   Start with a Name   It is important to name or brand your idea so that you can keep going until you reach your goal! We are going to build a very simple processor, so I came up with a fancy name "Pequeno", which means "tiny" in Spanish; the full name is: Pequeno RISC-V CPU, aka PQR5. RISC-V has many flavors and extensions of the ISA architecture. We will start with the simplest one, RV32I, aka 32-bit base integer ISA. This ISA is suitable for building 32-bit CPUs that support integer operations. So, the first spec of Pequeno is as follows: Pequeno is a 32-bit RISC-V CPU that supports RV32I ISA. RV32I has 37 32-bit base instructions that we plan to implement in Pequeno. Therefore, we have to understand each instruction in depth. It took me a while to fully grasp the ISA. In the process, I learned the complete specification and designed my own assembler pqr5asm, which was verified with some popular RISC-V assemblers. "RISBUJ" The six-letter word above summarizes the instruction types in RV32I. These 37 instructions belong to one of the following categories: R-type: All integer calculation instructions on registers. I-type: All integer calculation instructions based on registers and immediate values. Also includes JALR and Load instructions. S-type: All storage instructions. B-type: All branch instructions. U-type: Special instructions such as LUI, AUIPC. J-type: Jump instructions like JAL. There are 32 general registers in the RISC-V architecture, x0-x31. All registers are 32 bits. Among these 32 registers, zero is also called x0 register, which is a useful special register. It is hardwired to zero, cannot be written, and always reads as zero. So what is it used for? You can use x0 as a dummy destination to dump results you don't want to read, or as operand zero, or to generate NOP instructions to idle the CPU. Integer computation instructions are ALU instructions that are executed on registers and/or 12-bit immediate values. Load/store instructions are used to store/load data between registers and data memory. Jump/branch instructions are used to transfer program control to different locations. Details of each instruction can be found in the RISC-V specification: RISC-V User Level ISA v2.2. To learn the ISA, the RISC-V specification document is enough. However, for more clarity, you can study the implementations of different open cores in RTL. In addition to the 37 basic instructions, I have added 13 pseudo/custom instructions to pqr5asm and extended the ISA to 50 instructions. These instructions are derived from the basic instructions and are intended to simplify the assembly programmer's life... For example: NOP instruction with ADDI x0, x0, 0 which of course does nothing on the CPU! But it is much simpler and easier to explain in code. Before we start designing the processor architecture, our expectation is to fully understand how each instruction is encoded in 32-bit binary and what it does.   The RISC-V RV32I assembler PQR5ASM that I developed in Python can be found on my GitHub. You can refer to the Assembler Instruction Manual to write sample assembly code. Compile it and see how it converts to 32-bit binary to consolidate/verify your understanding before moving on to the next step.   Specifications and Architecture   In this chapter, we defined the full specifications and architecture of Pequeno. Last time we simply defined it as a 32-bit CPU. Next, we will go into more details to get a general idea of ​​the architecture we are going to design. We will design a simple single-core CPU that is able to execute one instruction at a time in the order in which the instructions are fetched, but still in a pipelined manner. We will not support the RISC-V privileged specification because we do not currently plan to have our core operating system support it, nor do we plan to have it support interrupts. The CPU specifications are as follows: 32-bit CPU, single-issue, single-core. Classic five-stage RISC pipeline. Strictly in-order pipeline. Compliant with RV32I user-level ISA v2.2. Supports all 37 basic instructions. Separate bus interfaces for instruction and data memory access. (Why? More on that later…) Suitable for bare-metal applications, no support for operating systems and interrupts. (More precisely, a limitation!) As mentioned above, we will support the RV32I ISA. Therefore, the CPU only supports integer operations. All registers in the CPU are 32 bits. The address and data buses are also 32 bits. The CPU uses the classic little-endian byte addressing memory space. Each address corresponds to a byte in the CPU address space. 0x00 - byte[7:0], 0x01 - byte[15:8] ... 32-bit words can be accessed by 32-bit aligned addresses, i.e. addresses that are multiples of 4: 0x00 - byte 0, 0x04 - byte 1... Pequeno is a single-issue CPU, i.e. it fetches only one instruction from memory at a time and issues it for decoding and execution. A pipelined processor with a single issue has a maximum IPC = 1 (or minimum/optimal CPI = 1), i.e. the ultimate goal is to execute at a rate of 1 instruction per clock cycle. This is theoretically the highest performance that can be achieved. The classic five-stage RISC pipeline is the basic architecture for understanding any other RISC architecture. This is the most ideal and simple choice for our CPU. The architecture of Pequeno is built around this five-stage pipeline. Let's dive into the underlying concepts. For simplicity, we will not support timers, interrupts, and exceptions in the CPU pipeline. Therefore, CSRs and privilege levels do not need to be implemented either. Therefore, the RISC-V privileged ISA is not included in the current implementation of Pequeno. The simplest way to design a CPU is the non-pipelined way. Let's look at several design approaches for non-pipelined RISC CPUs and understand their drawbacks. Let's assume the classic sequence of steps that a CPU follows to execute instructions: fetch, decode, execute, memory access, and write back. The first design approach is to design the CPU as a finite state machine (FSM) with four or five states and perform all operations sequentially. For example:   But this architecture will seriously affect the instruction execution speed. Because it takes multiple clock cycles to execute an instruction. For example, writing to a register takes 3 clock cycles. In case of load/store instructions, memory latency also increases. This is a bad and primitive way to design a CPU. Let's get rid of it completely! The second approach is that the instruction can be fetched from the instruction memory, decoded, and then executed by fully combinatorial logic. Then, the result of the ALU is written back to the register file. The whole process until the write back can be completed in one clock cycle. Such a CPU is called a single-cycle CPU. If the instruction needs to access data memory, read/write latency should be taken into account. If the read/write latency is one clock cycle, then the store instruction may still execute in one clock cycle like all other instructions, but the load instruction may require an additional clock cycle because the loaded data must be written back to the register file. The PC generation logic must handle the effect of this latency. If the data memory read interface is combinatorial (asynchronous read), then the CPU becomes truly single-cycle for all instructions.   The main disadvantage of this architecture is obviously the long critical path of the combinatorial logic from instruction fetch to write to memory/register file, which limits the timing performance. However, this design approach is simple and suitable for low-end microcontrollers where low clock speed, low power and low area are required. To achieve higher clock speeds and performance, the instruction sequential processing function of the CPU can be separated. Each sub-process is assigned to an independent processing unit. These processing units are cascaded in sequence to form a pipeline. All units work in parallel and operate on different parts of the instruction execution. In this way, multiple instructions can be processed in parallel. This technique to achieve instruction-level parallelism is called instruction pipelining. This execution pipeline forms the core of a pipelined CPU.   The classic five-stage RISC pipeline has five processing units, also called pipeline stages. These stages are: Instruction Fetch (IF), Decode (ID), Execute (EX), Memory Access (MEM), Write Back (WB). The working principle of the pipeline can be intuitively represented as follows:   Each clock cycle, different parts of an instruction are processed, and each stage processes a different instruction. If you look closely, you will see that instruction 1 is only executed in the 5th cycle. This delay is called the pipeline delay. Δ This delay is the same as the number of pipeline stages. After this delay, cycle 6: instruction 2 is executed, cycle 7: instruction 3 is executed, and so on... In theory, we can calculate the throughput (instructions per cycle, IPC) as follows:   Therefore, a pipelined CPU guarantees that one instruction is executed per clock cycle. This is the maximum IPC possible in a single-issue processor. By splitting the critical path across multiple pipeline stages, the CPU can now also run at higher clock speeds. Mathematically, this gives a pipelined CPU a multiple of throughput improvement over an equivalent non-pipelined CPU.   This is called pipeline speedup. In simple terms, a CPU with an s-stage pipeline can run at S times the clock speed of its non-pipelined counterpart. Pipelining generally increases area/power consumption, but the performance gain is worth it. The math assumes that the pipeline never stalls, that is, data continues to flow from one stage to another on every clock cycle. But in real CPUs, pipelines can stall for a variety of reasons, the main ones being structural/control/data dependencies. For example: register X cannot be read by the Nth instruction because X was not modified by the (N-1)th instruction reading X back, which is an example of a data hazard in the pipeline. The Pequeno architecture uses a classic five-stage RISC pipeline. We will implement a strictly in-order pipeline. In an in-order processor, instructions are fetched, decoded, executed, and completed/committed in the order generated by the compiler. If one instruction stalls, the entire pipeline stalls. In an out-of-order processor, instructions are fetched and decoded in the order generated by the compiler, but execution can proceed in a different order. If one instruction stalls, it does not stall subsequent instructions unless there are dependencies. Independent instructions can pass forward. Execution can still complete/commit in order (this is how it is in most CPUs today). This opens the door to a variety of architectural techniques that significantly improve throughput and performance by reducing clock cycles wasted by stalls and minimizing the insertion of bubbles (what are “bubbles”? Read on…).   Out-of-order processors are fairly complex due to the dynamic scheduling of instructions, but are now the de facto pipeline architecture in today’s high-performance CPUs.   The five pipeline stages are designed as independent units: Fetch Unit (FU), Decode Unit (DU), Execution Unit (EXU), Memory Access Unit (MACCU), and Write Back Unit (WBU).   Fetch Unit (FU): The first stage of the pipeline, interfaces with the instruction memory. The FU fetches instructions from the instruction memory and sends them to the Decode Unit. The FU may contain instruction buffers, initial branch logic, etc. Decode Unit (DU): The second stage of the pipeline responsible for decoding instructions from the Execution Unit (FU). The DU also initiates read accesses to the register file. Packets from the DU and the register file are retimed and sent together to the Execution Unit. Execution Unit (EXU): The third stage of the pipeline that validates and executes all decoded instructions from the DU. Invalid/unsupported instructions are not allowed to continue in the pipeline and become "bubbles". The Arithmetic Unit (ALU) is responsible for all integer arithmetic and logical instructions. The Branch Unit is responsible for processing jump/branch instructions. The Load/Store Unit is responsible for processing load/store instructions that require memory access. Memory Access Unit (MACCU): The fourth stage of the pipeline that interfaces with the data memory. The MACCU is responsible for initiating all memory accesses based on instructions from the EXU. The data memory is the addressing space that may consist of data RAM, memory-mapped I/O peripherals, bridges, interconnects, etc. Write Back Unit (WBU): The fifth or last stage of the pipeline. Instructions complete execution here. The WBU is responsible for writing the data (load data) from the EXU/MACCU back to the register file. Between the pipeline stages, a valid-ready handshake is implemented. This is not so obvious at first glance. Each stage registers a data packet and sends it to the next stage. This packet may be instruction/control/data information to be used by the next stage or subsequent stages. This packet is validated by a valid signal. If the packet is invalid, it is called a bubble in the pipeline. A bubble is nothing more than a "hole" in the pipeline that just moves forward in the pipeline without actually performing any operation. This is similar to a NOP instruction. But don't think they are useless! We will see one use for them in the subsequent section when discussing pipeline risks. The following table defines bubbles in the Pequeno instruction pipeline.   Each stage can also stall the previous stage by issuing a stall signal. Once stalled, the stage will retain its data packet until the stall condition disappears. This signal is the same as the inverted ready signal. In an in-order processor, a stall at any stage is similar to a global stall, as it eventually stalls the entire pipeline.   The flush signal is used to flush the pipeline. The flush operation will invalidate all packets registered by the previous stages at once, as they are identified as no longer useful.   For example, when the pipeline fetches and decodes an instruction from the wrong branch after executing a jump/branch instruction, which was only identified as an error in the execution stage, the pipeline should be flushed and fetch the instruction from the correct branch!   Although pipelining significantly improves performance, it also increases the complexity of the CPU architecture. The pipeline technology of the CPU is always accompanied by its twinBro - Pipeline Hazards! Now, let's assume that we know nothing about pipeline hazards. We didn't consider the hazards when designing the architecture.   Dealing with Pipeline Hazards   In this chapter, we will explore pipeline hazards. Last time, we successfully designed a pipeline architecture for the CPU, but we didn't consider the "evil twin" that comes with pipelines. What impact can pipeline hazards have on the architecture? What architectural changes are needed to mitigate these hazards? Let's go ahead and demystify them! Hazards in the CPU instruction pipeline are dependencies that interfere with the normal execution of the pipeline. When a hazard occurs, the instruction cannot be executed within the specified clock cycle because it may result in incorrect calculation results or control flow. Therefore, the pipeline may be forced to pause until the instruction can be successfully executed.   In the above example, the CPU executes instructions in order according to the order generated by the compiler. Assume that instruction i2 has some dependency on i1, such as i2 needs to read a certain register, but the register is also being modified by the previous instruction i1. Therefore, i2 must wait until i1 writes the result back to the register file, otherwise the old data will be decoded and read from the register file for the execution stage to use. To avoid this data inconsistency, i2 is forced to stall for three clock cycles. The bubbles inserted in the pipeline represent the stall or wait state. i2 is decoded only when i1 is completed. Eventually, i2 completes execution at the 10th clock cycle instead of the 7th clock cycle. A three-clock-cycle delay is introduced due to the stall caused by the data dependency. How does this delay affect CPU performance?   Ideally, we expect the CPU to run at full throughput, i.e. CPI = 1. However, when the pipeline is stalled, the throughput/performance of the CPU decreases due to the increased CPI. For non-ideal CPUs:   There are various ways in which hazards occur in the pipeline. Pipeline hazards can be divided into three categories:   Structural hazards Control hazards Data hazards   Structural hazards occur due to hardware resource conflicts. For example, when two stages of the pipeline want to access the same resource. For example: two instructions need to access memory in the same clock cycle.   In the above example, the CPU has only one memory for storing instructions and data. The instruction fetch stage accesses the memory every clock cycle to fetch the next instruction. Therefore, the instructions in the instruction fetch stage and the memory access stage may conflict if the previous instruction in the memory access stage also needs to access the memory. This will force the CPU to increase the stall cycle, and the instruction fetch stage must wait until the instruction in the memory access stage releases the resource (memory). Some ways to mitigate structural hazards include: Stalling the pipeline until the resource is available. Duplicate the resource so that there will not be any conflict. Pipeline the resource so that the two instructions will be in different stages of the pipeline resource. Let's analyze the different situations that can cause structural hazards in Pequeno's pipeline and how to solve them. We do not intend to use stalling as an option to mitigate structural hazards! In Pequeno's architecture, we implemented the above three solutions to mitigate various structural hazards. Control hazards are caused by jump/branch instructions. Jump/branch instructions are flow control instructions in the CPU ISA. When control reaches a jump/branch instruction, the CPU must decide whether to execute the branch instruction. At this point, the CPU should take one of the following actions. Fetch the next instruction at PC+4 (branch not taken) or fetch the instruction at the branch target address (branch taken). The correctness of the decision can only be determined when the execution stage calculates the result of the branch instruction. Depending on whether the branch is taken or not, the branch address (the address the CPU should branch to) is determined. If the decision made previously was wrong, all instructions fetched and decoded in the pipeline before that clock cycle should be discarded. Because these instructions should not be executed at all! This is achieved by flushing the pipeline and fetching the instruction at the branch address on the next clock cycle. Flushing invalidates the instruction and converts it to a NOP or bubbles. This costs a large number of clock cycles as a penalty. This is called a branch penalty. Therefore, control hazards have the worst impact on CPU performance.   In the above example, i10 completed execution on the 10th clock cycle, but it should have completed execution on the 7th clock cycle. Because the wrong branch instruction (i5) was executed, 3 clock cycles were lost. When the execution stage identifies the wrong branch instruction on the 4th clock cycle, a flush must be performed in the pipeline. How does this affect CPU performance? If a program running on the above CPU contains 30% branch instructions, the CPI becomes: CPU performance is reduced by 50%! To mitigate the control risk, we can adopt some strategies in the architecture... If the instruction is identified as a branch instruction, just stall the pipeline. This decoding logic can be implemented in the fetch stage itself. Once the branch instruction is executed and the branch address is resolved, the next instruction can be fetched and the pipeline resumed. Add dedicated branch logic like branch prediction in the Fetch stage. The essence of branch prediction is: we use some prediction logic in the instruction fetch stage to guess whether the branch should be taken. In the next clock cycle, we fetch the guessed instruction. This instruction is either fetched from PC+4 (predicted branch not taken) or from the branch target address (predicted branch taken). Now there are two possibilities: If the prediction is found to be correct in the execute stage, nothing is done and the pipeline can continue processing. If the prediction is found to be wrong, the pipeline is flushed and the correct instruction is fetched from the branch address resolved in the execute stage. This incurs a branch penalty. As you can see, branch prediction still incurs a branch penalty if it predicts wrong. The design goal should be to reduce the probability of misprediction. The performance of a CPU depends a lot on how “good” the prediction algorithm is. Sophisticated techniques like dynamic branch prediction keep instruction history in order to predict correctly with 80% to 90% probability. To mitigate control hazard in Pequeno, we will implement a simple branch prediction logic. More details will be revealed in our upcoming blog on the design of the fetch unit.   Data hazard occurs when the execution of an instruction has a data dependency on the result of the previous instruction still being processed in the pipeline. Let’s understand the three types of data hazards with examples to better understand the concept. Suppose an instruction i1 writes a result to register x. The next instruction i2 also writes a result to the same register. Any subsequent instruction in the program order should read the result of i2 at x. Otherwise, data integrity will be compromised. This data dependency is called output dependency and can lead to WAW (Write-After-Write) data hazard.   Suppose an instruction i1 reads register x. The next instruction, i2, writes the result to the same register. At this point, i1 should read the old value of register X instead of the result of i2. If i2 writes the result to x before i1 reads the result, a data hazard will result. This data dependency is called an anti-dependency and can lead to a WAR (Write-After-Read) data hazard.   Suppose an instruction, i1, writes the result to register x. The next instruction, i2, reads the same register. At this point, i2 should read the value written by i1 to register x instead of the previous value. This data dependency is called a true dependency and can lead to a RAW (Read-After-Write) data hazard.   This is the most common and dominant type of data hazard in pipelined CPUs. To mitigate data hazards in in-order CPUs, we can use some techniques: When a data dependency is detected, the pipeline is paused (see the first figure). The decode stage can wait until the previous instruction is executed before executing. Compile rescheduling: The compiler reschedules the code by scheduling it to execute later to avoid data hazards. The idea is to avoid program stalls while not affecting the integrity of the program control flow, but this is not always possible. The compiler can also insert a NOP instruction between two instructions with data dependency. But this will cause stalls, which will affect performance.   Data/Operand Forwarding: This is a prominent architectural solution to mitigate RAW data risks in sequential execution CPUs. Let's analyze the CPU pipeline to understand the principle behind this technology. Suppose two adjacent instructions i1 and i2, there is a RAW data dependency between them because they are both accessing register X. The CPU should stall instruction i2 until i1 writes the result back to register x. If the CPU does not have a stall mechanism, i2 will read an older value from x in the decode stage of the third clock cycle. In the fourth clock cycle, the i2 instruction will execute the wrong value of x.   If you look closely at the pipeline, we already have the result of i1 in the third clock cycle. Of course, it is not written back to the register file, but the result is still available at the output of the execute stage. So if we can somehow detect data dependencies and then "forward" that data to the input of the execute stage, then the next instruction can use the forwarded data instead of the data from the decode stage. That way, the data hazard is mitigated! The idea is this:   This is called data/operand forwarding or data/operand bypassing. We forward the data forward in time so that the subsequent dependent instructions in the pipeline can access this bypassed data and execute in the execute stage.   This idea can be extended to different stages. In a 5-stage pipeline that executes instructions in the order i1, i2, ..in, data dependencies may exist:   i1 and i2- need to be bypassed between the execute stage and the output of the decode stage. i1 and i3- need to be bypassed between the memory access stage and the output of the decode stage. i1 and i4- need to be bypassed between the writeback stage and the output of the decode stage.   The architectural solution for mitigating RAW data hazards originating from any stage of the pipeline is as follows:   Consider the following scenario:   There is a data dependency between two adjacent instructions i1 and i2, where the first instruction is a load. This is a special case of a data hazard. Here, we cannot execute i2 until the data is loaded into x1. So, the question is whether we can still mitigate this data hazard with data forwarding? The load data is only available in the memory access stage of i1, and it must be forwarded to the decode stage of i2 to prevent this hazard. The requirement is as follows:   Assuming the load data is available in the memory access stage of cycle 4, you need to "forward" this data to cycle 3, to the decode stage output of i2 (why cycle 3? Because in cycle 4, i has already finished executing in the execute stage!). Essentially, you are trying to forward the current data to the past, which is impossible unless your CPU can time travel! This is not data forwarding, but "data backtracking".   Data forwarding can only be done forward in time.   This data hazard is called a pipeline interlock. The only way to solve this problem is to insert a bubble to stall the pipeline for one clock cycle when the data dependency is detected.   A NOP instruction (aka bubble) is inserted between i1 and i2. This delays i2 by one cycle, so data forwarding can now forward the load data from the memory access stage to the output of the decode stage. So far, we have only discussed how to mitigate RAW data risks. So, what about WAW and WAR risks? The RISC-V architecture is inherently resistant to WAW and WAR risks implemented by in-order pipelines! All register writebacks are done in the order that instructions are issued. The data written back is always overwritten by subsequent instructions that write to the same register. Therefore, WAW risk never occurs! Writeback is the last stage of the pipeline. When the writeback occurs, the read instruction has successfully completed execution on the older data. Therefore, WAR risk never occurs! To mitigate RAW data risks in Pequeno, we will implement data forwarding in hardware using pipeline interlock protection functions. More details will be revealed later, when we will design the data forwarding logic.   We understand and analyze various potential pipeline risks in existing CPU architectures that can cause instruction execution failures. We also design solutions and mechanisms to mitigate these risks. Let’s put together the necessary microarchitecture and finally design the architecture of the Pequeno RISC-V CPU to be completely free of all types of pipeline risks!   In the following posts, we will dive into the RTL design of each pipeline stage/functional unit. We will discuss the different microarchitectural decisions and challenges during the design phase.   Fetch Unit   From here, we start to dive into the microarchitecture and RTL design! In this chapter, we will build and design the Fetch Unit (FU) of Pequeno. The Fetch Unit (FU) is the first stage of the CPU pipeline that interacts with the instruction memory. The Fetch Unit (FU) fetches instructions from the instruction memory and sends the fetched instructions to the Decode Unit (DU). As discussed in the previous post on the improved architecture of Pequeno, the FU contains branch prediction logic and flush support.   1 Interfaces   Let’s define the interfaces of the Fetch Unit:   2 Instruction Access Interfaces   The core function of the FU in the CPU is instruction access. The Instruction Access Interface (I/F) is used for this purpose. Instructions are stored in the instruction memory (RAM) during execution. Modern CPUs fetch instructions from a cache instead of directly from the instruction memory. The instruction cache (called the primary cache or L1 cache in computer architecture terms) is closer to the CPU and enables faster instruction access by caching/storing frequently accessed instructions and prefetching larger blocks of instructions nearby. Therefore, there is no need to constantly access the slower main memory (RAM). Therefore, most instructions can be quickly accessed directly from the cache. The CPU will not directly access the interface with the instruction cache/memory. There will be a cache/memory controller between them to control the memory access between them.   It is a good idea to define a standard interface so that any standard instruction memory/cache (IMEM) can be easily plugged into our CPU, and requires little or no glue logic. Let's define two interfaces for instruction access. The request interface (I/F) handles requests from the instruction memory (FU) to the instruction memory. The response interface (I/F) handles responses from the instruction memory to the instruction memory (FU). We will define a simple valid ready based request and response interface (I/F) for the instruction memory (FU), as this is easy to convert to bus protocols such as APB, AXI, etc. if necessary.   Instruction access requires knowing the address of the instruction in memory. The address requested through the request interface (Request I/F) is actually the PC generated by the FU. In the FU interface, we will use a stall signal instead of the ready signal, which behaves in the opposite way to the ready signal. The cache controller usually has a stall signal to stall the request from the processor. This signal is represented by cpu_stall. The response from the memory is the fetched instruction received through the response interface (Response I/F). In addition to the fetched instruction, the response should also contain the corresponding PC. PC is used as an ID to identify the request to which a response has been received. In other words, it indicates the address of the instruction that has been fetched. This is important information required by the next stage of the CPU pipeline (how is it implemented? We will see soon! ). Therefore, the fetched instruction and its PC constitute the response packet to the FU. When the internal pipeline is stalled, the CPU may also need to stall the response from the instruction memory. This signal is represented by mem_stall. At this point, let's define instruction packet={instruction, PC} in the CPU pipeline. 3PC Generation Logic The core of the FU is the PC generation logic that controls the request interface (I/F). Since we are designing a 32-bit CPU, the generation of PC should be in increments of 4. After this logic is reset, the PC is generated every clock cycle. The reset value of PC can be hard-coded. This is the address from which the CPU fetches and executes instructions after reset, that is, the address of the first instruction in memory. PC generation is a free-running logic that is only stalled by c pu_stall. The free-running PC can be bypassed by flushing the I/F and internal branch prediction logic. The PC generation algorithm is implemented as follows:   4 Instruction Buffers There are two back-to-back instruction buffers inside the FU. Buffer 1 buffers instructions fetched from the instruction memory. Buffer 1 can directly access the response interface (Response I/F). Buffer 2 buffers instructions from buffer 1 and then sends it to the DU through the DU I/F. These two buffers constitute the instruction pipeline inside the FU.   5 Branch Prediction Logic As discussed above, we must add branch prediction logic in the FU to mitigate control risks. We will implement a simple and static branch prediction algorithm. The main content of the algorithm is as follows: Always make an unconditional jump. If the branch instruction is a backward jump, execute the branch. Because the possibilities are as follows: 1. This instruction may be part of the loop exit check of some do-while loop. In this case, if we execute the branch instruction, the probability of correctness is higher. If the branch instruction is a forward jump, do not execute it. Because the possibilities are as follows: 2. This instruction may be part of the loop entry check of some for loop or while loop. If we do not take the branch and continue to execute the next instruction, the probability of correctness is higher. 3. This instruction may be part of some if-else statement. In this case, we always assume that the if condition is true and continue to execute the next instruction. Theoretically, this deal (bargain) is 50% correct.   The instruction packet of buffer 1 is monitored and analyzed by the branch prediction logic, and a branch prediction signal: branch_taken is generated. This branch prediction signal is then registered and transmitted synchronously with the instruction packet sent to DU. The branch prediction signal is sent to DU through the DU interface. 6 DU This is the main interface between the fetch unit and the decode unit for sending payloads. The payload contains the fetched instructions and branch prediction information.   Since this is the interface between the two pipeline stages of the CPU, the valid ready I/F is implemented. The following signals constitute the DU I/F:   In the previous blog post, we discussed the concept of stall and refresh in the CPU pipeline and its importance. We also discussed various scenarios in Pequeno architecture that require stall or refresh. Therefore, proper stall and refresh logic must be integrated in each pipeline stage of the CPU. It is crucial to determine at which stage a stall or refresh is required, and which part of the logic in that stage needs to be stalled and refreshed. Some initial thoughts before implementing stall and refresh logic: Pipeline stages may be stalled by externally or internally generated conditions. Pipeline stages can be refreshed by externally or internally generated conditions. There is no centralized stall or refresh generation logic in Pequeno. Each stage may have its own stall and refresh generation logic. A stage in the pipeline can only be blocked by the next stage. Any stage blocking will eventually affect the upstream pipeline and cause the entire pipeline to be blocked. Any stage in the downstream pipeline can refresh a stage. This is called pipeline refresh because the entire pipeline upstream needs to be refreshed at the same time. In Pequeno, pipeline refresh is required only for branch misses in the execution unit (EXU).   Stall logic contains logic to generate local and external stalls. The flush logic contains logic to generate local and pipeline flushes. Local stalls are generated internally and used locally to stop the current stage. External stalls are generated internally and sent externally to the next stage of the upstream pipeline. Both local and external stalls are generated based on internal conditions and external stalls at the next stage of the downstream pipeline. Local flush is a flush generated internally and used for the local flush stage. External flush or pipeline flush is a flush generated internally and sent externally to the upstream pipeline. This flushes all stages upstream simultaneously. Both local and external flushes are generated based on internal conditions.   Only the DU can stop the operation of the FU externally. When the DU sets stall, the internal instruction pipeline of the FU (buffer 1 -> buffer 2) should be stopped immediately, and since the FU can no longer receive packets from the IMEM, it should also set mem_stall to the IMEM. Depending on the pipeline/buffer depth in IMEM, the PC generation logic may also eventually be stalled by cpu_stall from IMEM, since IMEM cannot receive any more requests. There are no internal conditions in FU that cause local stalls. Only EXU can externally flush FU. EXU initiates branch_flush function in CPU instruction pipeline and passes in the address of the next instruction to be fetched after pipeline is flushed ( branch_pc ). FU provides flush interface (Flush I/F) to accept external flush. Buffer 1, Buffer 2 and PC generation logic in FU are flushed by branch_flush. Signal branch_taken from branch prediction logic also acts as a local flush to buffer 1 and PC generation logic. If branch is taken: The next instruction should be fetched from branch predicted PC. Therefore, PC generation logic should be flushed and next PC should = branch_pc. Next instruction in buffer 1 should be flushed and invalidated, i.e. NOP/bubble inserted.   Wonderful why Buffer-2 is not flushed by branch_taken? Because the branch instruction from Buffer-1 (responsible for flush generation) should be buffered to Buffer-2 in the next clock cycle and allowed to continue execution in the pipeline. This instruction should not be flushed! The instruction memory pipeline should also be flushed appropriately. IMEM flush mem_flush is generated by branch_flush and branch_taken. Let's integrate all the microarchitectures designed so far to complete the architecture of the Fetch Unit.   Ok, everyone! We have successfully designed the Fetch Unit of Pequeno. In the next part, we will design the Decode Unit (DU) of Pequeno.   Decode Unit   The Decode Unit (DU) is the second stage of the CPU pipeline and is responsible for decoding instructions from the Fetch Unit (FU) and sending them to the Execution Unit (EXU). In addition, it is responsible for decoding register addresses and sending them to the register file for register read operations. Let's define the interface of the Decode Unit.   Among them, the FU interface is the main interface between the fetch unit and the decode unit to receive the payload. The payload contains the fetched instructions and branch prediction information. This interface has been discussed in the previous section.   The EXU interface is the main interface between the decode unit and the execution unit to send the payload. The payload includes the decoded instructions, branch prediction information, and decoded data.   The following are the instruction and branch prediction signals that make up the EXU I/F:   Decoded data is the important information that the DU decodes from the fetched instructions and sends to the EXU. Let's understand what information the EXU needs to execute an instruction.   Opcode, funct3, funct7: Identifies the operation that the EXU is going to perform on the operand. Operand: Depending on the opcode, the operand can be register data (rs0, rs1), register address for writeback (rdt), or 12-bit/20-bit immediate value. Instruction type: Identifies which operand/immediate value must be processed. The decoding process can be tricky. If you understand the ISA and instruction structure correctly, you can recognize different types of instruction patterns. Recognizing the pattern helps in designing the decoding logic in the DU. The following information is decoded and sent to the EXU via the EXU I/F.   The EXU will use this information to demultiplex the data to the appropriate execution subunit and execute the instruction. For R-type instructions, the source registers rs1 and rs2 must be decoded and read. The data read from the registers are the operands. All general user registers are located in the register file outside the DU. The DU uses the register file interface to send the address of rs0 and rs1 to the register file for register access. The data read from the register file should also be sent to the EXU in the same clock cycle along with the payload.   The register file takes one cycle to read the register. The DU also takes one cycle to register the payload to be sent to the EXU. Therefore, the source register address is decoded directly from the FU instruction packet by the combinational logic. This ensures the timing synchronization of 1) the payload from the DU to the EXU and 2) the data from the register file to the EXU. Only the EXU can stop the operation of the DU externally. When the EXU sets the stop, the internal instruction pipeline of the DU should stop immediately, and it should also set the stop to the FU because it can no longer receive packets from the FU. To achieve synchronous operation, the register file should be stopped with the DU because they are both at the same stage of the CPU five-stage pipeline. Therefore, the DU feeds back the external stop from the EXU to the register file. There is no situation inside the DU that causes a local stop. Only the EXU can flush the FU externally. The EXU starts the branch_flush function in the CPU instruction pipeline and passes in the address of the next instruction to be fetched after flushing the pipeline (branch_pc). The DU provides a flush interface (Flush I/F) to accept external flushes. The internal pipeline is flushed by branch_flush. The branch_flush from the EXU should immediately invalidate the DU instruction pointing to the EXU with a latency of 0 clock cycles. This is to avoid potential control risks in the next clock cycle EXU. In the design of the Fetch Unit, we did not invalidate the FU instruction with a 0-cycle delay after receiving the branch_flush instruction. This is because the DU will also be flushed in the next clock cycle, so there will be no control hazard in the DU. So, there is no need to invalidate the FU instruction. The same idea applies to the instructions from IMEM to FU.   The above flowchart shows how the instruction packets and branch prediction data from the FU are buffered in the DU of the instruction pipeline. Only a single level of buffering is used in the DU. Let’s integrate all the microarchitectures designed so far to complete the architecture of the Decode Unit.   Currently we have completed: Fetch Unit (FU), Decode Unit (DU). In the next section, we will design the register file of Pequeno.   Register File   In RISC-V CPU, the register file is a key component, which consists of a set of general purpose registers used to store data during execution. Pequeno CPU has 32 32-bit general purpose registers ( x0 – x31 ). Register x0 is called the zero register. It is hardwired to a constant value of 0, providing a useful default value that can be used with other instructions. Suppose you want to initialize another register to 0, just execute mv x1, x0. x1-x31 are general-purpose registers used to hold intermediate data, addresses, and results of arithmetic or logical operations. In the CPU architecture designed in the previous article, the register file requires two access interfaces.   Among them, the read access interface is used to read the register at the address sent by DU. Some instructions (such as ADD) require two source register operands rs1 and rs2. Therefore, the read access interface (I/F) requires two read ports to read two registers at the same time. The read access should be a single-cycle access so that the read data is sent to the EXU in the same clock cycle as the payload of the DU. In this way, the read data and the payload of the DU are synchronized in the pipeline. The write access interface is used to write the execution result back to the WBU sends the register at the address. Only one destination register rdt is written at the end of execution. Therefore, one write port is sufficient. Write access should be single cycle access. Since the DU and the register file need to be synchronized at the same stage of the pipeline, they should always be stopped together (why? Check the block diagram in the previous section!). For example, if the DU is stopped, the register file should not output read data to the EXU, because this will damage the pipeline. In this case, the register file should also be stopped. This can be ensured by inverting the stop signal of the DU to generate the read_enable of the register file. When the stop is valid, read_enable is driven low and the previous data will remain at the read data output, effectively stopping the register file operation. Since the register file does not send any instruction packets to the EXU, it does not need any refresh logic. The refresh logic only needs to be handled inside the DU. In summary, the register file is designed with two independent read ports and one write port. Both read and write accesses are single cycle. The read data is registered. The final architecture is as follows:   We have currently completed: instruction fetch unit (FU), decode unit (DU), register file. Please stay tuned for the next part.
    - May 11, 2025
  • RUNIC launches the first voltage-following LDO RS3011-Q1
    RUNIC launches the first voltage-following LDO RS3011-Q1
    RUNIC launches the first voltage-following LDO RS3011-Q1     The voltage-following low-dropout linear regulator chip (LDO) is a special type of LDO regulator whose output voltage (VOUT) directly tracks or "follows" its reference input voltage (VREF), that is, VOUT=VREF.   This type of LDO is usually used to power off-board sensors or other off-board modules in automobiles to ensure the accuracy and stability of the sensor's operating voltage, and to prevent damage to on-board components due to short circuits to the ground or battery short circuits caused by cable breakage.   RS3011-Q1 adopts automotive-grade technology and maximizes safety margin design. It has an input withstand voltage of -40V to 45V and provides a maximum output current of 300mA, meeting the high reliability requirements of automotive electronic application scenarios.   Its main features are as follows: Ø Input limit voltage supports -40V~45V; Ø Adjustable output voltage range 1.5V to 40V; Ø Self-consumption current 80μA under light load; Ø Provides maximum output current capability of 300mA; Ø Maximum 4mV tracking error output; Ø High power supply ripple suppression capability, PSRR reaches 78dB@100Hz; Ø Low Dropout voltage, 280mV@200mA; Ø Reverse polarity protection; Ø Output short circuit to ground protection, output short circuit to power supply protection; Ø Output foot inductive load clamp protection; Ø With over-temperature protection and over-current protection;   Typical application circuit Voltage follower LDO is mainly designed for special circuit protection in the field of automotive electronics applications. Of course, it can also be used in other power supply systems that require real-time adjustable power supply. For example, by controlling the DAC to directly give the required output voltage to the ADJ foot, a high-precision real-time adjustable power supply can be achieved. RS3011-Q1 has an internal back-to-back PMOS topology, which can achieve reverse polarity protection without the use of external diodes. It is particularly suitable for circuits that require reverse connection and backflow protection. RS3011-Q1 provides a standard ESOP8 package, which is fully compatible with commonly used models in the market, and provides two versions with and without enable to meet the design requirements of different product systems. RS3011-Q1 automotive-grade certification is in progress, and samples have been put into storage. You are welcome to request samples for comparison testing.  
    - May 09, 2025
  • How to choose parameters for operational amplifiers?
    How to choose parameters for operational amplifiers?
    The meaning of the parameters related to operational amplifiers that will be encountered in the future will be recorded here. Recently, while using a PGA, I found that there is always a rectangular wave signal in the output when the PGA input is grounded. After amplification by 1000 times, it is very obvious, and I suspect that it is caused by interference from the power supply. At the beginning, 100uf and 0.1 capacitors were added to both the positive and negative power inputs, but the effect was not significant. Later, we planned to connect a resistor in series with the power input terminal. Initially, we chose 1k resistor, but after powering on, we found that the chip could not work at all. We measured the power supply voltage at both ends of the chip and found that it was only slightly above volts. At this point, I looked at the static current in the data manual and found that it was actually 5mA. The PGA is powered by 5V, and if the PGA works normally, the voltage division on the 1k resistor can reach 5V. So later, a 50 ohm resistor was used in combination with 100uf and 0.1uf to form a low-pass filter. This way, the chip worked normally and the output ripple was also much smaller. When choosing an operational amplifier, one should know their design requirements and search for them in the operational amplifier parameter table. Generally speaking, the issues that need to be considered in design include 1 Selection of power supply voltage and mode for operational amplifiers; 2. Selection of operational amplifier packaging; 3. Operational amplifier feedback method, which is VFA (voltage feedback operational amplifier) or CFA (current feedback operational amplifier); 4. Operational amplifier bandwidth; 5. Selection of bias voltage and bias current; 6 temperature drift; 7. Pendulum rate; 8. Selection of input impedance for operational amplifiers; 9. Selection of output driving capability for operational amplifiers; 10. Static power consumption of operational amplifiers, i.e. selection of ICC current size; 11. Selection of operational amplifier noise; 12. Operational amplifier drives load stabilization time, etc. Bias voltage and input bias current In precision circuit design, bias voltage is a key factor. For parameters that are often overlooked, such as bias voltage drift and voltage noise that vary with temperature, they must also be measured. Accurate amplifiers require bias voltage drift of less than 200 μ V and input voltage noise of less than 6nV/√ Hz. The bias voltage drift with temperature variation is required to be less than 1 μ V/℃. The indicator of low bias voltage is important in high gain circuit design, as amplifying the bias voltage may cause high voltage output and occupy a large part of the output swing. The temperature sensing and tension measurement circuit is an application example using precision amplifiers. Low input bias current is sometimes necessary. The amplifier in the light receiving system must have low bias voltage and low input bias current. For example, the dark current of a photodiode is on the order of pA, so the amplifier must have a smaller input bias current. CMOS and JFET input amplifiers are currently available operational amplifiers with the minimum input bias current. Because I am currently using a photovoltaic system for data collection, I am particularly concerned with bias voltage and current during use. If there are other needs, more consideration should also be given to other parameters at this time. 1. Input Offset Voltage (VIO) The input offset voltage is defined as the compensation voltage applied between the two input terminals when the output terminal voltage of the integrated operational amplifier is zero. The input offset voltage actually reflects the symmetry of the internal circuit of the operational amplifier, and the better the symmetry, the smaller the input offset voltage. Input offset voltage is a very important indicator for operational amplifiers, especially for precision operational amplifiers or when used for DC amplification. 2. Input Offset Voltage Drift (VIO) The temperature drift (also known as temperature coefficient) of input offset voltage is defined as the ratio of the change in input offset voltage to the change in temperature within a given temperature range. This parameter is actually a supplement to the input offset voltage, which facilitates the calculation of the drift caused by temperature changes in the amplifier circuit within a given operating range. The input offset voltage temperature drift of general operational amplifiers is between ± 10~20 μ V/℃, while the input offset voltage temperature drift of precision operational amplifiers is less than ± 1 μ V/℃. 3. Input Bias Current IB In the use of operational amplifiers, there may also be an input bias current IB, which refers to the DC current at the base of the input transistor of the first stage amplifier. This current ensures that the amplifier operates within a linear range, providing a DC operating point for the amplifier. The input bias current is defined as the average bias current at the two input terminals of an operational amplifier when the output DC voltage is zero. The input bias current has a significant impact on areas that require input impedance, such as high impedance signal amplification and integration circuits. The input bias current is related to the manufacturing process, and the input bias current for bipolar process (i.e. the standard silicon process mentioned above) is between ± 10nA and 1 μ A; For input stages using field-effect transistors, the input bias current is generally less than 1nA. For bipolar operational amplifiers, the value has a high degree of variability, but is almost unaffected by temperature; For MOS type operational amplifiers, this value is the gate leakage current, which is small but greatly affected by temperature.   4. Input Offset Current Input offset current refers to the error in bias current between two differential input terminals. The input offset current is defined as the difference in bias current between the two input terminals of an operational amplifier when the output DC voltage is zero. The input offset current also reflects the symmetry of the internal circuit of the operational amplifier, and the better the symmetry, the smaller the input offset current. Input offset current is a very important indicator for operational amplifiers, especially for precision operational amplifiers or when used for DC amplification. The input offset current is approximately one percent to one tenth of the input bias current. The input offset current has a significant impact on small signal precision amplification or DC amplification, especially when larger resistors are used outside the operational amplifier (such as 10k or more). The impact of input offset current on accuracy may exceed that of input offset voltage. The smaller the input offset current, the smaller the midpoint offset during DC amplification, and the easier it is to handle. So for precision operational amplifiers, it is an extremely important indicator. 5. Input impedance (1) Differential input impedance Differential input impedance is defined as the ratio of the voltage change at the two input terminals to the corresponding current change at the input terminals when the operational amplifier operates in the linear region. Differential input impedance includes input resistance and input capacitance, and only refers to input resistance at low frequencies. (2) Common mode input impedance The common mode input impedance is defined as the ratio of the change in common mode input voltage to the corresponding change in input current when the operational amplifier operates on an input signal (i.e. the same signal is input at both input terminals of the operational amplifier). At low frequencies, it manifests as common mode resistance. 6. Voltage gain (1) Open Loop Voltage Gain In the absence of negative feedback (open-loop condition), the amplification factor of an operational amplifier is called open-loop gain, denoted as AVOL, which is written as: Large Signal Voltage Gain。 The ideal value of AVOL is infinite, generally ranging from thousands to tens of thousands of times, and its representation can be expressed in dB and V/mV. (2) Closed Loop Gain As the name suggests, it is the amplification factor of an operational amplifier with feedback. 7. Output Voltage Swing When the operational amplifier operates in the linear region, under a specified load, the maximum voltage amplitude that the operational amplifier can output when powered by the current power supply voltage. 8. Input voltage range (1) Differential input voltage range The maximum differential input voltage is defined as the maximum allowable input voltage difference between the two input terminals of the operational amplifier. When the allowed input voltage difference between the two input terminals of the operational amplifier exceeds the maximum differential mode input voltage, it may cause damage to the operational amplifier input stage. (2) Common Mode Input Voltage Range The maximum common mode input voltage is defined as the common mode input voltage when the operational amplifier operates in the linear region and its common mode rejection ratio characteristics significantly deteriorate. It is generally defined as the maximum common mode input voltage corresponding to a 6dB decrease in common mode rejection ratio. The maximum common mode input voltage limits the range of maximum common mode input voltage in the input signal, and this issue needs to be taken into account in circuit design in the presence of interference. 9. Common Mode Rejection Ratio The common mode rejection ratio is defined as the ratio of the differential mode gain to the common mode gain of an operational amplifier when it operates in the linear region. Common mode rejection ratio is an extremely important indicator that can suppress common mode interference signals. Due to the large common mode rejection ratio, the common mode rejection ratio of most operational amplifiers is generally tens of thousands of times or more, and it is not convenient to compare directly with numerical values. Therefore, decibel recording and comparison are generally used. The common mode rejection ratio of a typical operational amplifier is between 80 and 120dB. 10. Supply Voltage Rejection Ratio The power supply voltage suppression ratio is defined as the ratio of the input offset voltage of the operational amplifier to the variation of the power supply voltage when the operational amplifier operates in the linear region. The power supply voltage suppression ratio reflects the impact of power supply changes on the output of operational amplifiers. So when used for DC signal processing or small signal processing analog amplification, the power supply of the operational amplifier needs to be carefully and meticulously processed. Of course, operational amplifiers with high common mode rejection ratio can compensate for a portion of the power supply voltage rejection ratio. In addition, when using dual power supply, the power supply voltage rejection ratios of positive and negative power supplies may not be the same. 11. Static power consumption The static power of an operational amplifier at a given power supply voltage is usually in an unloaded state. Here is the concept of static current IQ, which refers to the current consumed by the operational amplifier during no-load operation. This is the minimum current consumption of the operational amplifier (excluding sleep mode) 12. Slew Rate The conversion rate of an operational amplifier is defined as the rate at which a large signal (including a step signal) is input to the input of the operational amplifier under closed-loop conditions, and the output rise rate of the operational amplifier is measured from its output. Due to the fact that the input stage of the operational amplifier is in a switching state during the conversion period, the feedback loop of the operational amplifier does not function, meaning that the conversion rate is independent of the closed-loop gain. Conversion rate is an important indicator for large signal processing. For general operational amplifiers, the conversion rate SR<=10V/μ s, while for high-speed operational amplifiers, the conversion rate SR>10V/μ s. The current high-speed operational amplifier has a maximum conversion rate SR of 6000V/μ s. This is used for selecting operational amplifiers in large signal processing. 13. Gain bandwidth (1) Gain Bandwidth Product Gain bandwidth product, GBP, The product of bandwidth and gain. (2) Unit gain bandwidth The bandwidth when the amplification factor of the operational amplifier is 1. The concepts of unit gain bandwidth and bandwidth gain product are somewhat similar, but different. It should be noted that for voltage feedback type operational amplifiers, the gain bandwidth product is a constant, but not for current type operational amplifiers, because for current type operational amplifiers, bandwidth and gain are not linearly related. 14. Output impedance The output impedance is defined as the ratio of the voltage change to the corresponding current change when a signal voltage is applied to the output terminal of an operational amplifier when it operates in the linear region. At low frequencies, it only refers to the output resistance of the operational amplifier. This parameter is tested in an open-loop state. 15. Equivalent Input Noise Voltage The equivalent input noise voltage is defined as any AC irregular interference voltage generated at the output of an operational amplifier with good shielding and no signal input. When this noise voltage is converted to the input terminal of the operational amplifier, it is called the operational amplifier input noise voltage (sometimes also expressed as noise current). For broadband noise, the effective value of input noise voltage for ordinary operational amplifiers is about 10-20 μ V.
    - April 11, 2025
  • HBM - Human Discharge Model
    HBM - Human Discharge Model
    HBM stands for Human Body Model, which is commonly known as the human discharge model in ESD electrostatic discharge. It characterizes the chip's anti-static ability, and electronic engineers know that the higher this parameter, the stronger the chip's anti-static ability. However, different chip suppliers usually select different testing standards based on their own understanding, experience, partner resources, or the applicable application scenarios of the chip. Different standards mean that there will be differences in testing methods or conditions, and it is not possible to directly compare the anti-static ability of the chip based on the HBM numbers marked in the specification sheet. At present, the commonly used testing standards for non automotive chips include ANSI/ESDA/JEDEC JS-001 and MIL-STD-883, while automotive chips will use the AEC Q100-002 standard. ANSI(American National Standards Institute), The American National Standards Institute; ESDA(Electrostatic Discharge Association), The American Electrostatic Discharge Association; JEDEC(Joint Electron Device Engineering Council), The Solid State Technology Association; MIL-STD(US Military Standard), That is, the US military emblem. Specific testing plans for HBM using these standards can be found online. Through comparison, it can be seen that the RC values selected for the three standard tests are all R=1.5k Ω C=100pF, There is no significant difference in the peak current and current waveform tested, but the number of pulses and the time interval between pulses tested are completely different It can be seen that there is a huge difference in HBM data, because the influence of packaging materials on ESD parameters is very small, and the anti-static ability of chips under the same grain size should be comparable. The MIL-STD-883 standard is more stringent than the other two standards, and the test data will be even smaller. Of course, more strictly speaking, taking samples from the same batch and using different testing standards for testing will make the data more convincing. Comparison data provided by electronic enthusiasts can also be found online. In summary, MIL-STD-883 standard is the most stringent and the HBM data tested is smaller; The HBM data tested according to ANSI/ESDA/JEDEC JS-001 standard will be relatively large. It should be emphasized here that the standards followed for ESD testing of chips are completely different from those followed for ESD testing of whole machine products, and the energy level of static electricity is even more different. Therefore, it is not possible to directly use the ESD testing equipment of the whole machine to conduct ESD testing on chip pins. In system level circuit design, especially at external interfaces, special attention should be paid to anti-static and anti surge protection. These integrated chips are relatively delicate and cannot be expected to play the role of discrete protective devices.  
    - February 10, 2025
  • Chip Knowledge - MSL (Moisture Sensitivity Level)
    Chip Knowledge - MSL (Moisture Sensitivity Level)
      MSL stands for Moisture Sensitivity Level, which characterizes the ability of a chip to withstand humid environments. It is an extremely important parameter that is often overlooked by electronic engineers. Usually, chips exposed in an open environment will absorb moisture, which may enter the plastic packaging of the chip through the pins. During SMT reflow soldering, the moisture expands due to the instantaneous high temperature, and there is a probability of the so-called "POPCORN" phenomenon occurring. Classification of humidity sensitivity levels According to the JEDEC J-STD-020D standard, MSL is classified into 8 levels, as follows: MSL1 Level - Unlimited workshop life up to and including 30 ° C/85% RH MSL2 level - workshop lifespan of less than or equal to 30 ° C/60% RH for one year MSL2a level - workshop lifespan less than or equal to 30 ° C/60% RH for four weeks MSL3 level - workshop life less than or equal to 30 ° C/60% RH 168 hours MSL4 level - workshop life less than or equal to 30 ° C/60% RH 72 hours MSL5 level - workshop life less than or equal to 30 ° C/60% RH 48 hours MSL Level 5a - Workshop life of less than or equal to 30 ° C/60% RH for 24 hours MSL6 level - immediate workshop life less than or equal to 30 ° C/60% RH (for level 6, components must be baked before use and must be reflow soldered within the time limit specified on the moisture sensitive label) It is particularly important to note that if the ambient temperature or air humidity exceeds the limit test conditions for the corresponding level, the chip can be exposed to an open environment for a shorter period of time than the time specified in the standard. The impact of humidity sensitivity level and protective measures Once moisture enters the interior of the chip, there is a probability that sufficient steam pressure will be generated during SMT to damage or destroy the components. Common damage situations include internal separation (delamination) of the plastic body from the chip or pin frame, damage to the bonding wire soldering, chip damage, or cracks appearing inside the chip (which cannot be observed on the chip surface). The most serious situation is chip swelling and bursting (known as the "popcorn" effect). In addition, after moisture enters the interior of the chip, it may also cause electrochemical corrosion. Water vapor may be ionized to generate hydroxide ions when powered on, and the hydroxide ions may react chemically with the bonding pad or even the metal layer inside the chip, resulting in the formation of hydrated oxides. The oxides will also absorb some of the water vapor, creating fragile parts at the interface between the packaging resin and the metal, leading to bonding failure. If there are potassium, sodium, and chloride ions in the moisture, it will greatly increase the probability of corrosion of the chip, lead frame, and PAD, leading to delamination or peeling. After stratification occurs, the difficulty of moisture invasion is greatly reduced, and the reliability of the chip will also be greatly reduced.         Considering both cost and actual production process control, most chips will be packaged with MSL3 humidity sensitive grade and vacuum sealed bags, while desiccants and humidity sensitive cards will be placed inside. After unpacking the chip, it is necessary to complete the surface mounting and testing as soon as possible, and then apply coatings such as three proof paint for moisture sensitive protection. Once the vacuum bag leaks, the moisture sensitive card changes color, or the package is left for too long after unpacking, it is necessary to perform baking actions according to standard procedures before use to ensure the safety of chip use. Runshi Technology's automotive standard products Runshi Technology's automotive products all use MSL1 grade, and some industrial grade products also use MSL1 grade to better cope with the harsh operating or production environments of different product systems. Cost and quality are like the trade-off between fish and bear's paw. The higher the humidity sensitivity level, the higher the cost of packaging materials and processes, and the more expensive the chip price. Electronic engineers need to choose an acceptable cost, that is, the quality level of the chip, based on SMT production control capabilities, the working environment of the finished product, and market positioning when selecting.
    - February 10, 2025
  • 12 bit low-power analog-to-digital conversion chip RS1320
    12 bit low-power analog-to-digital conversion chip RS1320
    12 bit low-power analog-to-digital conversion chip RS1320 RS1320 is a low-power single channel 12 bit analog-to-digital converter chip with a working voltage support of 2.7V to 5.5V. It supports SPI, QSPI, Microwire, and DSP interfaces, and can be used to achieve digital signal control output analog voltage, restore analog signals, or provide controllable reference voltage. It has a wide range of applications in industrial field data acquisition, various instrument measurement equipment, and analysis equipment. RS1320 is based on mainstream market products and optimized for key parameters according to user application needs, further reducing linear error, zero code error temperature drift, gain error, and gain error temperature drift. It also optimizes and improves conversion rate while considering low power consumption, shortening output voltage establishment time to meet more application scenarios. Its main parameter characteristics are as follows: Ø Ensure output monotonicity; Built in buffer, rail to rail voltage output; Ø Power on output zero voltage; Low power consumption: 1.17mW (3.6V)/2.94mW (5.5V); Ø INL: - 0.7LSB/+1.2LSB; Ø DNL: - 0.1LSB/+0.2LSB; Ø Zero code error 1.3mV; Ø Full scale error -0.01% FS; Ø Output voltage establishment time 6 μ s @ CL=500pF Ø Supports SPI data interface; Ø Extended industrial temperature range:- 40 ° C~125 ° C.   RS1320 adopts a resistor string architecture design, achieving excellent AC/DC characteristics. Some parameter curves are referenced as follows: For more detailed data and parameter curves, please refer to the specification sheet.   RS1320 packaging and pin definition RS1320 provides standard SOT23-6 packaging, with pin definitions fully compatible with DAC121S101. Engineers from all walks of life are welcome to taste and evaluate.  
    - February 10, 2025
1 2

A total of 2 pages

Need Help? Chat with us

leave a message
For any request of parts price or technical support and Free Samples, please fill in the form, Thank you!
Submit

Home

Products

whatsApp

contact