Chinese chips
  • Purchased chip empire?
    Purchased chip empire?
    In the semiconductor world, some companies rely on technology to make a living, while others rely on patents to make a living. Qualcomm, on the basis of possessing these two, has built a chip empire through targeted "buying and buying". Today, Qualcomm is a chip giant spanning mobile phones, automobiles, the Internet of Things, and AI edge computing. But you may not know that the foundation of this empire is the technology teams and assets that have been quietly swallowed, polished with time and patience. GPU: Adreno Rising from the Afterglow of ATI The graphics processing capability of mobile devices in 2006 can be considered "primitive". At that time, the concept of smartphones had not yet become popular, with BlackBerry and Palm Pilot still dominating the market, and the release of the iPhone had to wait until 2007. In this era of mobile graphics processing where 'running a snake is enough', the vast majority of device manufacturers have extremely limited demand for GPUs, focusing more on basic 2D interface rendering and simple multimedia playback functions. However, it is during this seemingly calm period that a far-reaching technology acquisition is brewing. AMD has decided to acquire ATI Technologies for a sky high price of $5.4 billion in order to gain a graphics processing advantage in competition with Intel. This transaction not only changed the landscape of the PC graphics card market, but also unexpectedly planted an important seed for the mobile graphics processing field. In the AMD-ATI acquisition, AMD's main target was ATI's desktop and server GPU business, while ATI's mobile graphics division was considered a "peripheral" at the time. Although this department has considerable technical strength, the market prospects were not clear at the time. The graphics demand for mobile devices is limited, and the entire industry generally lacks confidence in the development potential of mobile GPUs. However, Qualcomm has demonstrated a forward-looking strategic vision. This company, which started with communication chips, is keenly aware that with the continuous enrichment of mobile device functions, graphics processing capabilities will become one of the core competitiveness of future mobile platforms. When AMD sold ATI's mobile GPU division as a "bundle", Qualcomm was quick witted and quickly acquired this experienced team at a relatively low price. After the acquisition, Qualcomm gave this newly acquired GPU division a memorable name: Adreno. This name is not randomly chosen, but rather a letter rearrangement and combination of ATI's famous GPU brand "Radeon". This naming convention not only reflects respect for ATI's technological heritage, but also symbolizes the team's fresh start under its new owner. The name Adreno itself carries a profound technical background. The Radeon series GPU once competed with NVIDIA's GeForce series in the PC field, with a deep accumulation of graphics processing technology. By preserving the symbolic expression of this technological DNA, Qualcomm is actually declaring to the market that Adreno will continue ATI's technological advantages in the field of graphics processing and further develop them. The original team from ATI has brought valuable technological assets to Adreno. These engineers not only have extensive experience in GPU architecture design, but more importantly, they have a deep understanding of various aspects of the graphics rendering pipeline, from vertex processing to pixel shading, from texture mapping to anti aliasing techniques. With the advent of the smartphone era, Adreno has ushered in its own shining moment. Qualcomm has deeply integrated Adreno into the Snapdragon system on chip platform, forming a collaborative optimization of components such as CPU, GPU, DSP, and modem. This SoC design concept not only improves overall performance, but more importantly achieves better power control and thermal management. The development history of Adreno series GPUs can be seen as a microcosm of the progress in mobile graphics processing technology. From the early Adreno 200 series to the current Adreno 740 series, each generation of products has achieved significant improvements in performance, power consumption, and feature support. Especially in terms of support for graphics APIs such as OpenGL ES, Vulkan, DirectX, Adreno has always maintained an industry-leading level. Today, Adreno has far exceeded the technological scope of ATI's mobile GPU. Modern Adreno GPUs not only support traditional 3D graphics rendering, but also integrate cutting-edge technologies such as machine learning acceleration, computational shaders, and variable rate shading (VRS). Adreno has demonstrated strong adaptability and scalability in emerging fields such as AR/VR applications, computational photography, and AI image processing. From a small team in the ATI laboratory to the graphics processing engine that supports billions of mobile devices worldwide today, Adreno's story can be considered a legend in the history of technology. The spark left by ATI, under the careful cultivation of Qualcomm, ultimately ignited the entire sky of mobile graphics processing. CPU: The confidence to invest in self-developed cores In the world of chip design, there is an unwritten rule: when giants start to feel threatened, real change is about to begin. In 2020, when Apple released the MacBook with the M1 chip, the entire industry was shocked. This is not only because M1's performance is stunning, but more importantly, it has proven a truth to the world: based on mobile chips, it is entirely possible to create products that can rival or even surpass traditional x86 processors. In this shock, the one who felt the deepest was none other than Qualcomm. As the dominant player in the field of mobile chips, Qualcomm suddenly found itself facing unprecedented challenges. Apple is no longer satisfied with dominating only in the fields of smartphones and tablets, but has extended its tentacles to the PC market - a field that Qualcomm has always wanted to enter but has never succeeded in breaking through. For a long time, Qualcomm's Snapdragon processors have relied on the Cortex series public core provided by Arm. This model was indeed effective in the early development stages of the mobile market: Arm was responsible for providing mature and stable architecture design, while Qualcomm and other vendors focused on system level optimization and integration. This division of labor cooperation has enabled the rapid development of the Android ecosystem and also contributed to Qualcomm's leadership position in the mobile chip field. However, as the demand for mobile computing continues to increase, this pattern of relying on public architecture is beginning to expose significant limitations. Firstly, the degree of differentiation is limited. When everyone uses the same CPU core, the differences between products are mainly reflected in the process technology, frequency tuning, and integration of peripheral components. It is difficult to form the architectural advantages of the core. Secondly, the space for performance optimization is limited, and the public architecture must take into account the needs of all authorized vendors, making it difficult to conduct in-depth optimization for specific application scenarios. The most fatal thing is that the success of Apple's M-series chips has shown the industry the enormous potential of self-developed architectures. Apple has achieved breakthroughs not only in performance, but more importantly, in power consumption control through completely autonomous architecture design. The advantage of integrating software and hardware has put unprecedented pressure on manufacturers who rely on public architecture. Faced with the impact of Apple's M-series chips, the entire industry has begun to rethink its architecture strategy. Intel is striving to advance its hybrid architecture design, AMD is continuously optimizing its Zen series architecture, and within the Arm ecosystem, major vendors are also seeking more autonomy. As a leader in mobile chips, Qualcomm is deeply aware of the urgency of change. There is a consensus within the company that if we continue to rely entirely on Arm's public core, it will not only be difficult to compete with Apple in terms of performance, but more importantly, we will lose our dominant position in the next round of technological competition. Especially in AI computing, edge computing and other emerging fields, the flexibility and optimization space of self-developed architecture will become a decisive advantage. In January 2021, Qualcomm announced its acquisition of Nuvia for $1.3 billion, which caused a huge shock in the industry. For many people, Nuvia is still a relatively unfamiliar name - this two-year-old startup has less than a hundred employees, neither mass-produced products nor mature business models. The valuation of $1.3 billion was indeed astonishing at the time. However, Qualcomm is not interested in the current situation of Nuvia, but in its technological strength and development potential behind it. The founding team of Nuvia can be described as luxurious: CEO Gerard Williams III was once the chief architect of Apple's A-series chips, involved in the design of multiple generations of processors from A7 to A12X; CTO Manu Gulati and Chief System Architect John Bruno also have extensive experience in designing high-performance processors. This team not only deeply participated in the golden age of Apple chips, but more importantly, they have a unique understanding of how to create high-performance low-power processors. Nuvia's technological roadmap is also highly aligned with Qualcomm's strategic goals. This company focuses on the development of high-performance Arm processors for data center and edge computing, and its design philosophy emphasizes to achieve maximum performance output while maintaining low power consumption. Although the company has not been established for a long time, its technical team has demonstrated impressive design capabilities in a short period of time. More importantly, Nuvia has complete self-developed CPU core design capabilities. From optimizing instruction set architecture, to innovating microarchitecture, to developing compilers and software stacks, Nuvia possesses full stack technical capabilities. This is exactly the core capability that Qualcomm urgently needs. After the acquisition, Qualcomm did not rush to quickly productize Nuvia's technology, but chose a more secure strategy of deep integration. The core technology team of Nuvia has been fully integrated into Qualcomm's research and development system, becoming an important component of Qualcomm's CPU design department. This integration is not just about merging personnel, but also a deep integration of technical concepts and design methodologies. During the integration process, Qualcomm demonstrated considerable patience and strategic determination. The company did not rush to launch transitional products, but gave the Nuvia team sufficient time and resources to steadily advance according to the established technology roadmap. At the same time, Qualcomm will also combine its rich experience in the field of mobile chips, including knowledge in power management, thermal design, manufacturing processes, etc., with Nuvia's high-performance design philosophy. On the basis of Nuvia technology, Qualcomm has begun to re plan its high-performance CPU core development roadmap. This is not only a technical adjustment, but also a significant transformation of the entire product strategy. The new development roadmap places greater emphasis on balancing and optimizing performance and power consumption. Drawing on Nuvia's experience in high-performance processor design, Qualcomm has begun exploring more radical architectural innovations, including wider execution units, deeper pipelines, and smarter branch prediction. At the same time, based on Qualcomm's experience in power control in the field of mobile chips, the new architecture design also pays more attention to dynamic power management under different workloads. In terms of application scenarios, the new roadmap also reflects stronger targeting. In addition to traditional mobile applications, emerging scenarios such as PC computing, edge AI, and cloud native applications have become key optimization targets. This multi scenario optimization strategy provides technical support for Qualcomm's expansion in different markets. In 2024, Qualcomm officially released the Oryon CPU core based on Nuvia technology, marking the first significant achievement in the three-year acquisition of Nuvia. The release of Oryon not only marks Qualcomm's official entry into the era of self-developed CPU cores, but more importantly, injects new vitality into the entire Arm ecosystem. From the technical specifications, the Oryon CPU has indeed demonstrated impressive performance. While maintaining relatively low power consumption, Oryon's single core and multi-core performance have reached industry-leading levels. Especially in terms of AI workloads, Oryon has achieved significant performance improvements through specialized optimization design. The impressive performance of the Oryon CPU is attributed to its innovative breakthroughs in multiple technological aspects. These innovations not only demonstrate the technical strength of the Nuvia team, but also showcase Qualcomm's profound expertise in system level optimization. In terms of microarchitecture design, Oryon adopts a wider execution engine and a deeper out of order execution queue, which can better explore instruction level parallelism. Meanwhile, Oryon has made significant progress in reducing memory access latency through improved branch prediction algorithms and a larger cache hierarchy. In terms of power management, Oryon inherits Qualcomm's rich experience in the field of mobile chips. By finely dividing power domains and dynamically adjusting voltage and frequency, Oryon is able to dynamically adjust power consumption based on actual workloads, maximizing battery life while ensuring performance. In terms of AI acceleration, Oryon integrates specialized matrix operation units and vector processing units, which can efficiently execute various machine learning workloads. This hardware acceleration capability provides strong support for Oryon based devices in AI applications. The launch of the Oryon CPU has also opened up new market opportunities for Qualcomm. The Snapdragon X series processors equipped with Oryon directly target the PC market, competing head-on with Intel and AMD's traditional advantage areas. At the same time, Oryon also provides strong technical support for Qualcomm's layout in emerging markets such as edge computing and AI reasoning. The acquisition price of 1.3 billion US dollars did raise many doubts at the time, but now it seems that the strategic value of this investment has been fully reflected. Nuvia not only brought a world-class CPU design team and core technology to Qualcomm, but more importantly, won the initiative in the next round of technology competition. Wi Fi/Bluetooth: The 'invisible wings' behind Atheros Atheros, once a pioneer in wireless communication, has now become a part of the Qualcomm empire, but its technological DNA still quietly flows through billions of devices. From the initial laptop Wi Fi card to the connection module in today's smartphones, Atheros' technological heritage spans the entire development process of wireless communication era. To understand the value of Atheros, we must return to the starting point of wireless communication technology. In 1998, when the Wi Fi standard was just established and most people were still using dial-up internet, Atheros had already keenly perceived the enormous potential of wireless communication. This company, founded by a research team from Stanford University, has been focused from the beginning on a seemingly simple but extremely complex problem: how to achieve efficient and stable wireless connections between devices. In that era, wireless communication technology was still in its very early stages. The 802.11a/b standard has just been released, with a transmission rate of only 11Mbps and limited connection stability. But Atheros engineers saw the infinite possibilities of this technology and began to delve into various technical aspects such as RF design, antenna technology, signal processing algorithms, etc., attempting to break through the technological bottleneck of wireless communication at that time. The first major breakthrough of Atheros came from the deep optimization of OFDM (Orthogonal Frequency Division Multiplexing) technology. Although this technology has great advantages in theory, it faces many challenges in practical applications, including signal synchronization, inter carrier interference, power consumption control, and other issues. Atheros engineers have successfully solved these technical challenges through innovative algorithm design and hardware optimization, laying the foundation for the rapid development of Wi Fi technology in the future. The key to Atheros standing out in the fierce market competition lies in its continuous investment and breakthroughs in technological innovation. The company has established a development strategy of "technology oriented" since its inception, investing a large amount of resources into the research and development of core technologies. After the release of the 802.11g standard, Atheros was the first to launch a chip solution that supports a transmission rate of 54Mbps, far exceeding its competitors at the time. More importantly, Atheros' chips perform well in power control and signal stability, which enables devices using Atheros chips to provide longer battery life and more reliable connection experience. With the introduction of the 802.11n standard, MIMO (Multiple Input Multiple Output) technology has become an important direction for the development of Wi Fi. Atheros has once again demonstrated its technological innovation capabilities by launching the industry's first commercial chips that support MIMO technology. Through the application of multi antenna technology, Atheros' solution not only significantly improves transmission speed, but also significantly enhances signal coverage and anti-interference capability. During this process, Atheros has accumulated a wealth of RF design experience and signal processing technology. The company's engineers have conducted in-depth research on various complex wireless environments, from residential homes to corporate offices, from dense urban environments to open rural areas, and developed corresponding optimization solutions for different application scenarios. With its strong technical capabilities, Atheros has gradually established its position in the market. The company's first significant breakthrough came from the laptop market. In the era when Wi Fi technology was just emerging, laptops were the most important application carriers, and Atheros, with its excellent chip performance and stability, successfully gained the favor of many laptop manufacturers. With the popularization of home broadband, the consumer router market has begun to develop rapidly. Atheros keenly seized this opportunity and launched a chip solution specifically designed for router applications. These chips not only support higher transmission rates, but also have stronger concurrent processing capabilities, which can provide stable connection services for multiple devices simultaneously. In addition to the consumer market, Atheros has also achieved significant success in the field of enterprise wireless communication. Enterprise level applications have more stringent requirements for wireless communication, and Atheros has launched specialized chip solutions for the enterprise market. These products have reached industry-leading levels in RF performance, concurrent processing capabilities, security encryption, and other aspects. Many enterprise network equipment manufacturers have adopted Atheros' chips to build their enterprise level access points (APs) and wireless controller products. Although Atheros initially started with Wi Fi technology, the company quickly realized that the limitations of a single technology were evident in the field of wireless communication. Different application scenarios require different connectivity technologies, and companies that can provide comprehensive connectivity solutions can gain the greatest advantage in market competition. Based on this understanding, Atheros began to expand into the field of Bluetooth technology. Although Bluetooth technology is not as good as Wi Fi in terms of transmission distance and speed, it has unique advantages in low power consumption and point-to-point connections, making it particularly suitable for applications such as audio transmission and input device connections. Atheros' Bluetooth chip solution also demonstrates excellent technical standards. The company's engineers have conducted in-depth research on various levels of the Bluetooth protocol stack, from the lower level RF design to the upper level application protocols, all of which have been deeply optimized. This enables Atheros' Bluetooth chip not only to have better audio transmission quality, but also to support more device connections and richer application functions. With the rise of smartphones and other mobile devices, modular connectivity solutions have become the mainstream demand in the market. Atheros timely launched a Wi Fi/Bluetooth combination chip, integrating the two technologies into a single chip, which not only reduces costs and power consumption, but also simplifies the design complexity of devices. However, at this time, the entire wireless communication market underwent fundamental changes. The requirements for connecting chips in mobile devices are completely different from those in traditional PC products: smaller size, lower power consumption, higher integration, and stricter cost control. This is undoubtedly a huge challenge for Atheros, which is mainly focused on the PC market. Just as Atheros faced challenges in its transition to the mobile age, Qualcomm emerged. In January 2011, Qualcomm announced the acquisition of Atheros for $3.1 billion, which caused a huge shock in the industry. For many observers, this acquisition price seems too high, especially considering the market challenges faced by Atheros at the time. However, Qualcomm's strategic vision was fully reflected in this acquisition. Qualcomm has a deep understanding of the development trend of the mobile Internet era, and recognizes that connectivity technology will become one of the key elements of mobile device competitiveness. Although Qualcomm has taken a dominant position in the cellular communication field, its strength in non cellular connection technologies such as Wi Fi and Bluetooth is relatively limited. The value of Atheros lies not only in its existing products and market position, but more importantly, its profound technical accumulation and experienced engineering team. Qualcomm values Atheros' core capabilities in RF design, signal processing, protocol stack optimization, and other areas, which complement Qualcomm's cellular communication technology perfectly. In addition, Qualcomm has also seen a trend towards the integration of connectivity technologies. In mobile devices, Wi-Fi、 Multiple connectivity technologies such as Bluetooth and cellular communication need to work together, and manufacturers that can provide unified optimization will have significant competitive advantages. By acquiring Atheros, Qualcomm not only gained world-class connectivity technology, but also laid the foundation for its comprehensive layout in the mobile communication field. After the acquisition, Qualcomm did not simply operate Atheros as an independent business unit, but chose a strategy of deep integration. This integration is not only reflected in the technical aspect, but also in the integration of culture and organizational structure. In terms of technology integration, Qualcomm has deeply integrated Atheros' connectivity technology with its own mobile processor technology. This integration is not simply physical splicing, but collaborative optimization at various levels such as architecture design, power management, and signal processing. Through this deep integration, Qualcomm can provide customers with better performance, lower power consumption, and more cost-effective connectivity solutions. In terms of organizational structure, Qualcomm has retained Atheros' core technology team and given them full autonomy to continue technological innovation. At the same time, Qualcomm has organically integrated Atheros engineers with its own R&D team, forming a complete technology chain covering from RF to applications. This deep integration strategy has achieved significant results. The Atheros technology team not only maintains its original innovative vitality, but also gains greater development space on the Qualcomm platform. And Qualcomm has greatly enhanced its strength in the field of connectivity technology through this integration. By acquiring Atheros, Qualcomm has successfully achieved a comprehensive layout in the field of wireless connectivity. Nowadays, Qualcomm is the only company in the world that can simultaneously operate in cellular Wi-Fi、 The company that provides top-level solutions among the three major wireless protocols of Bluetooth. This comprehensive technological capability provides solid support for Qualcomm's dominant position in the mobile communication market. In the smartphone market, almost all mainstream Android devices adopt Qualcomm's connectivity solutions. From the Samsung Galaxy series to Chinese brands such as Xiaomi, OPPO, Vivo, and from high-end flagships to mid to low end products, Qualcomm's connectivity technology is ubiquitous. The achievement of this market coverage is largely attributed to the contribution of Atheros technology. In the PC market, although Intel dominates in processors, Qualcomm also has strong competitiveness in Wi Fi connectivity. Many laptops using Intel processors have chosen Qualcomm's solutions for their Wi Fi modules. In the IoT and smart home markets, Qualcomm's connectivity technology has been widely applied. From smart speakers to smart home appliances, from industrial IoT to smart cities, Qualcomm's connectivity solutions provide reliable technical support for various application scenarios. One of the greatest values of Qualcomm's acquisition of Atheros is the unified optimization of multiple connectivity technologies. In modern mobile devices, cellular Wi-Fi、 Multiple connection technologies such as Bluetooth require collaborative work, while traditional separated designs often lead to issues such as increased power consumption, decreased performance, and poor user experience. By deeply integrating Atheros technology, Qualcomm has achieved unified design and optimization of multi connectivity technology. In the Snapdragon platform, various connectivity technologies share underlying resources such as RF front-end, antenna system, power management, etc., which not only reduces costs and power consumption, but also improves overall performance. If mobile chips are Qualcomm's signature, then connectivity technology is its invisible wing. By acquiring Atheros, Qualcomm not only gained world-class connectivity technology, but more importantly, completed its strategic transformation from a mobile chip manufacturer to a comprehensive communication solution provider. V2X and Vehicle Regulations Blueprint: Autotalks adds the finishing touch In the tide of intelligent driving, the "dialogue" between vehicles is moving from dreams to reality. The communication between vehicles (V2V) and between vehicles and infrastructure (V2I) together form the foundation of V2X (Vehicle to Everything) technology. These 'invisible dialogues' will become a key support for the safety and synergy of autonomous driving, and are a more' far sighted 'warning system than radar and cameras. Qualcomm has long regarded cars as the next highland of mobile Internet. From the Snapdragon Cockpit platform to the Snapdragon Ride positioning autonomous driving control platform, Qualcomm has built a multi-level automotive SoC system covering information entertainment, AI decision-making, sensor fusion, and more. However, what has been missing in this entire solution is the 'last mile' of V2X communication, a key technology. To this end, Qualcomm acquired the Israeli company Autotalks. This company has been deeply involved in the V2X chip field for many years and is one of the few chip manufacturers that can support both DSRC (Dedicated Short Range Communication) and C-V2X (Cellular Connected Vehicle Communication) dual protocols, with customers spread throughout the European and American automotive supply chains. Compared to the slow exploration of self-developed by major manufacturers, Autotalks' technology has been honed and matured through field testing and on-board processes, with the practical advantage of plug and play implementation. In 2023, this acquisition was quickly completed, and Qualcomm thus established a complete link from intelligent cockpit and autonomous driving control to the connection between the vehicle and the outside world. In the Snapdragon Ride platform, Qualcomm has integrated V2X modules natively for the first time, rather than providing them as external chips. This system level integration not only brings about power optimization and cost control, but more importantly, enhances the stability and synergy of V2X in the entire vehicle system, especially opening up new paths in the integration of AI and V2X information at the vehicle end. On a deeper level, this acquisition is not only a 'technological reinforcement', but also a bet on the future intelligent transportation landscape. Driven by 5G, the connection between cars and everything is no longer an isolated system, but an ecosystem of collaborative perception and decision-making. Whoever can occupy a pivotal position in this network holds the discourse power of "swarm intelligence" in autonomous driving. Autotalks is like a finishing touch to Qualcomm. It not only fills a technological gap, but also elevates Qualcomm's "vehicle specification blueprint" from single machine intelligence to collaborative intelligence - from in car intelligent systems to active cognition of the outside environment. This marks another crucial step for Qualcomm from being a "smart car chip supplier" to a "leader in intelligent transportation platforms". SerDes: The behind the scenes player in laying out high-speed interconnectivity If AI chips are powerful "factories" of computing power, then the data flowing between them is the "blood" that maintains the efficient operation of the entire system. What truly determines whether this blood can circulate efficiently is the often overlooked "data highway" - SerDes (serial parallel converter) technology. With the rapid development of AI reasoning, edge computing, automotive electronics and data center, data interconnection capability has become a new bottleneck of SoC system. The delay and bandwidth of data transmission, whether between internal modules of chips or between chips, often directly determine the performance limit of the entire platform. Qualcomm, which has always been known for its expertise in radio frequency and communication, realized early on that high-speed interconnection would be the key infrastructure that would determine the outcome in the process of advancing towards AI and data centers. But this is not Qualcomm's traditional strength. Faced with this "niche but crucial" technological gap, Qualcomm did not choose to develop from scratch, but instead took decisive action and quietly acquired some SerDes assets of Canadian technology company Alphawave Semi. Although not as well-known as some major manufacturers, Alphawave is a star company in the high-performance SerDes IP field, particularly skilled in high-speed interface solutions under various protocol standards such as PCIe, CXL, Ethernet, etc. Its technology enables high-speed transmission of data in a narrow physical channel with low power consumption and low bit error rate, which is crucial for cutting-edge SoC architectures such as processes below 5nm, Chiplet packaging, and multi Die interconnects. Through this acquisition, Qualcomm has quietly completed a "remedial lesson" in its high-speed I/O IP layout. Its strategic significance goes far beyond enhancing the data transmission capabilities of existing SoCs, but also focuses on the "underlying preparation" for Qualcomm's future entry into new battlefields such as data centers, AI acceleration cards, Chiplet heterogeneous computing platforms, etc. Especially in today's Chiplet trend, a single large chip is gradually being replaced by a combination of multiple small chips (Die), and the high-speed interconnection capability between chips has become a watershed for the success or failure of the platform. Without mature SerDes technology, Chiplets are like a puzzle that cannot be pieced together. With Alphawave's technological capabilities, Qualcomm can not only break through the internal "bottlenecks" of SoCs, but also build its own efficient modular platform architecture in the Chiplet era. More importantly, this interconnection capability is no longer limited to "internal optimization". It has become a bridge for Qualcomm to build the next generation of AI and communication fusion systems - whether it is the AI inference card on the server side or the collaborative work of different modules in the vehicle platform, SerDes technology is an irreplaceable "critical channel". In this competition over speed, latency, power consumption, and area, Qualcomm has quietly upgraded itself from a "chip company" to a "system interconnect layout provider" with a precise move. At the critical juncture of transitioning from the era of chips to the era of system integration, SerDes technology, this small screw, is supporting Qualcomm's next round of ambitions. At the end: What I bought is not only technology, but also the future The outside world often says that Qualcomm is a technology company, but in fact, it is more like the best example of capital and technology integration. Each of its core capabilities - graphics CPU、 Wireless communication, V2X, SerDes - almost all originated from strategic mergers and acquisitions. But what truly takes root and sprouts these abilities is Qualcomm's ability to internalize them into an ecosystem and unify them into a platform. It is not simply assembling parts, but melting them into an organic chip empire. This is a story that reached its peak through mergers and acquisitions, but it's not just about buying. It relies on digestion, integration, and re creation. From mobile phones to cars, from connectivity to computing, Qualcomm has made precise attacks time and time again, writing the legendary semiconductor empire that was "bought". As you can see, the foundation of the empire may have been bought, but tall buildings were built brick by brick.
    - June 15, 2025
  • 0.7nm chip, roadmap update
    0.7nm chip, roadmap update
    The main feature of GAA nanosheet devices is the vertical stacking of two or more nanosheet conductive channels, with each logic standard cell containing one stack for p-type devices and another stack for n-type devices. This configuration allows designers to further reduce the height of logical standard cells, defined as the number of metal lines (or tracks) per cell multiplied by the metal spacing. Designers can also choose to widen the channel at the expense of sacrificing unit height for larger driving current. In addition to the reduced area, GAA nanosheet transistors have another advantage over FinFETs: the gate surrounds the conductive channel from all directions, enhancing the gate's control over the channel even at shorter channel lengths. Figure 1- TEM image of GAA nanosheet device GAA nanosheet technology is expected to continue for at least three generations before chip manufacturers transition to CFET (complementary FET) technology. Due to its nMOS pMOS vertical stacking structure, the integration complexity of CFET is significantly higher than that of conventional nanosheet devices. According to IMEC's roadmap, the mass production of CFET is only feasible starting from node A7. This means that the era of GAA nanosheets must extend at least to the A10 technology node, where the unit height is expected to be as small as 90 nanometers. However, reducing the standard cell size based on GAA nanosheets without affecting performance is extremely challenging. This is exactly where forksheet device architecture may bring relief, as it is a non-destructive technology with greater scalability potential than conventional GAA nanosheet technology. Forksheet, 1nm reliance In 2017, IMEC launched the forksheet device architecture, first as a scaling booster for SRAM cells, and later as a scaling enabler for logic standard cells. The unique feature of its first implementation is the placement of a dielectric wall between nMOS and pMOS devices before gate patterning. Due to the fact that this wall is located in the middle of the logical standard unit, the architecture is referred to as an "inner wall" fork sheet. The wall physically isolates the p-gate trench from the n-gate trench, achieving a tighter n-to-p spacing than FinFET or nanosheet devices. This allows for further reduction of unit area (unit height up to 90nm) while still providing performance improvement. In this' inner wall 'configuration, these thin sheets are controlled by a tri gate forked structure, hence the name of the device. Figure 2- TEM image of the inner wall fork device At VLSI 2021, imec demonstrated the manufacturability of the 300mm inner wall fork sheet process flow. Conducting electrical characteristic tests on fully functional devices confirms that forksheet is the most promising device architecture, capable of extending the miniaturized roadmap of logic and SRAM nanosheets to the A10 node. Due to the reuse of most of the production steps of nanosheets in the integrated process, the technological evolution from nanosheets to forksheets can be considered non disruptive. Manufacturability is being challenged Despite the successful hardware demonstration, some concerns about manufacturability still exist, which has led IMEC to reconsider and improve its initial fork sheet device architecture. The main challenge is related to the manufacturability of the inner wall itself. In order to achieve a 90nm logic standard cell height, the dielectric wall needs to be very thin, within the range of 8-10nm. However, due to the early manufacturing of the equipment process, the wall will be exposed to all subsequent front-end process (FEOL) etching steps, which may further reduce the thickness of the wall, placing considerable demands on the selection of wall materials. In addition, in order to achieve process steps specific to n or p (such as p/n source/drain epitaxy), dedicated masks must be precisely placed on thin dielectric walls, which poses a challenge to the alignment of p/n masks. In addition, 90% of devices in practical applications have a common gate for n and p channels. In standard cells with inner wall forksheet devices, dielectric walls can hinder the pn connection gate. Unless the gate is made higher to cross the wall, which would increase parasitic capacitance. Finally, chip manufacturers are concerned about the three gate architecture, as the gate only surrounds the channel from three sides. Compared with the GAA structure, there is a risk of losing control over the channel at the gate, especially when the channel length is short. External wall fork: dielectric wall at the boundary of CELL At the Very Large Scale Integration Technology and Circuit Symposium 2025 (VLSI 2025), researchers from imec presented a novel fork sheet device architecture and named it the "outer wall fork sheet". They demonstrated through TCAD simulation how this outer wall forksheet improves its previous design by reducing process complexity, providing excellent performance, and maintaining area scalability. Figure 3- Imec's logical technology roadmap, showing the extension of the nanosheet era from 2nm to A10 node, using outer wall forksheet, and then transitioning to A7 and higher versions of CFET The outer wall forksheet places the dielectric wall at the boundary of the standard cell, making it a pp or nn wall. This allows each wall to be shared with adjacent standard cells and can be thickened (up to about 15 nanometers) without affecting the height of the 90 nanometer cells. Another significant feature is the wall cast integration method. The entire process begins with the formation of a wide Si/SiGe stack - a step that is repeated in any GAA technology. After etching away SiGe in the nanosheet channel release step, the stacked Si layer will form nanosheet shaped conductive channels. The dielectric wall will eventually divide the stack into two, with two FETs of similar polarity located on either side of the wall. The dielectric wall itself is processed towards the end of the integration process, that is, after the channel release of the nanosheets, source/drain etching, and source/drain epitaxial growth. The step of replacing the metal gate (RMG) has completed the integration process. Figure 4- Schematic diagram of the forksheet structure for the (top) inner wall and (bottom) outer wall 5 key improvements to the outer wall forksheet Compared with GAA nanosheet devices, inner and outer wall forksheets have two common advantages. In terms of area scaling, they are all able to achieve a 90nm logical standard cell height at the A10 node, which is more advantageous compared to the 115nm cell height in A14 nanosheet technology. The second common advantage is the reduction of parasitic capacitance: the two field-effect transistors (FETs) located on both sides of the wall (with n and p on the inner wall and n and n/or p and p on the outer wall) can be placed closer than units based on nanosheets without causing capacitance problems. In addition, the outer wall forks are expected to surpass the inner wall forks in five key aspects of design. Firstly, due to the adoption of the wall last integration method, the dielectric wall eliminates several complex FEOL steps. Therefore, it can be made from mainstream silica. In the back wall process step, walls are formed by forming trenches in a wide Si/SiGe stack and filling them with SiO2 dielectric. Secondly, as the wall is located at the boundary of the unit, its width can be relaxed to about 15nm, thereby simplifying the process. Thirdly, it is now easy to connect the gates of n and p devices within a standard cell without passing through dielectric walls. Fourthly, the outer wall forksheets are expected to provide better gate control than the inner wall devices, which is related to their ability to form Ω - gate structures instead of three gate forksheets. A wider dielectric wall makes it possible to etch the wall several nanometers in the final RMG step. This allows the gate to partially surround the fourth edge of the channel, forming a W-shaped gate and enhancing control over the channel. Through TCAD simulation, imec researchers found that etching off the 5-nanometer dielectric wall is the best choice, which can increase the driving current by about 25%. Figure 5- The effect of wall etching on gate formation: from triple gate to Ω gate, and then to GAA The fifth aspect is related to the potential of forksheet integrated flow to provide full channel strain, which is an additional performance improvement that is beneficial for driving current. Usually, full channel strain can be obtained by implementing source/drain stress sources. This method has been proven to be highly effective in (p-type) FinFETs, but it is difficult to implement in GAA nanosheets and inner wall forksheet device architectures. Conceptually, the idea is to incorporate Ge atoms into the source/drain regions. Due to the larger size of Ge atoms compared to Si atoms, they introduce compressive strain in the Si channel, thereby increasing the mobility of charge carriers. Figure 6- At the beginning of the outer wall fork sheet process, a "pre all" hard mask (brown) is deposited on top of a wide Si (gray)/SiGe (purple) layer stack. In this way, the Si "seed crystal" beneath the hard mask can support epitaxial growth of the source/drain electrodes The reason why the outer wall forksheet device can achieve fully effective source/drain stress sources is because it adopts the wall last method. Before making the wall, the hard mask will continue to cover the middle portion of the wide Si/SiGe stack, which will later be used to form the wall (Figure 6). The 'Si spine' beneath this hard mask can now serve as a seed crystal during source/drain epitaxial growth, acting as a silicon 'template' that extends from one gate channel to the next. This is similar to Si subfin in FinFET technology: imagine rotating the source/drain epitaxial module 90 ° (Figure 7). If there is no such silicon crystal template, vertical defects will form at the source/drain epitaxial interface, thereby eliminating the compressive strain formed in the silicon channel. Figure 7- The Si spine (right) in the outer wall fork sheet provides a continuous silicon crystal template from one gate channel to the next. This is conceptually similar to Si subfin in FinFET technology (left) External wall forksheet in SRAM and ring oscillator design Finally, IMEC conducted a benchmark study to quantify the power performance area (PPA) advantage of the outer wall fork sheet. When comparing the area of the A10 outer wall fork sheet and the SRAM bit cell based on A14 nanosheets, the area advantage of the nanosheet architecture becomes apparent. Layout display shows that the SRAM cell area based on the outer wall fork sheet has decreased by 22%, due to the reduction in the spacing between pp and nn on the basis of the reduction in gate spacing. Another key indicator for performance evaluation is the simulated frequency of the ring oscillator, expressed as the ratio of effective driving current to effective capacitance (I eff/C eff). Simulation shows that for node A10, an outer wall fork is required to maintain frequency consistency with the previous A14 and 2nm nodes, provided that all of these device structures can achieve full channel stress. It has been proven that achieving full channel stress in nanosheets (2nm and A14) and inner wall fork sheet devices is challenging, and its absence results in a drive current loss of approximately 33%. Therefore, it is expected that the ability to implement an effective source/drain stressor in the outer wall fork sheet device will result in further performance advantages in the design of ring oscillators. Figure 8- Simulation results of ring oscillator (with and without backend (BEOL) load) Outlook and Conclusion The fork blade device architecture was introduced by IMEC with the aim of extending the logic technology roadmap based on nanosheets to the A10 technology node and expecting CFET to achieve mass production. Due to manufacturability issues, IMEC abandoned the original inner wall fork design and developed an "upgraded" version: outer wall fork design. Compared to the inner wall fork sheet, the new design ensures higher manufacturability while improving performance and reducing surface area. Looking ahead to the future, IMEC is currently researching the compatibility between the outer wall forkfoot design and the CFET architecture, as well as to what extent CFET can benefit from PPA from this innovative expansion booster.
    - June 13, 2025
  • Semiconductor giant NXP plans to adjust its production line
    Semiconductor giant NXP plans to adjust its production line
    Recently, it was reported that Half NXP plans to close four 8-inch wafer fabs, one of which is located in Nijmegen, the Netherlands, and the other three are within the United States. As another key location of NXP in the Netherlands besides its headquarters in Eindhoven, Nijmegen's business includes manufacturing, research and development, testing, technology enablement, and support functions, playing an important role in the process of introducing new products. Behind this, NXP plans to transition production to a new 12 inch wafer fab: even without considering edge loss, the production of 12 inch monocrystalline wafers is 2.25 times that of 8-inch wafers, which means lower fixed and manufacturing costs and higher profits. Therefore, NXP plans to close the four wafer fabs mentioned above in the next 10 years. In addition, the 12 inch wafer fab built by NXP and the world's leading joint venture VSMC in Singapore will begin mass production in 2027, which will help reduce the risk of NXP's capacity building. This factory focuses on the production of mixed signal, power management, and analog chips from 130nm to 40nm. It is expected to achieve a monthly production scale of 55000 wafers by 2029, becoming an important manufacturing hub for NXP in the Asia Pacific region. NXP's strategic adjustment is not an isolated case, but a microcosm of the global semiconductor industry upgrading. With the explosive growth of demand for AI and data centers, it has driven the market towards more efficient and lower cost manufacturing technologies. According to SEMI's statistics, it is expected that 82 new 12 inch chip facilities and production lines will be built globally between 2023 and 2026. By 2026, the production capacity of 12 inch wafer fabs will increase to 9.6 million wafers per month. According to relevant data, 12 inch wafers account for about 65% of the total semiconductor wafer shipments, while 8-inch wafers account for about 20%, with the remaining portion mainly consisting of smaller sized wafers. Dr. Li Wei, Executive Vice President of Shanghai Silicon Industry, believes that 2024 may be a turning point for the exit of 8-inch silicon wafers from the historical stage. Because the integrated circuit industry tends to eliminate outdated production capacity technologies during industrial adjustments. Industry analysis suggests that NXP's 12 inch transformation is the result of a combination of technological iteration, market demand, and industry competition. Despite facing challenges such as equipment costs and process complexity, it is gradually building a composite production capacity system that covers advanced and mature processes through joint ventures, contract manufacturing, and other models. However, it needs to find a new balance between technological breakthroughs, cost control, and regional layout.
    - June 11, 2025
  • 17.2 billion yuan! The semiconductor giant just announced
    17.2 billion yuan! The semiconductor giant just announced
    Qualcomm agreed to acquire British semiconductor company Alphawave IP Group on Monday. Qualcomm said the enterprise value of the transaction is approximately US$2.4 billion (approximately RMB 17.2 billion). Under the terms of the acquisition, each AlphaWave shareholder will be entitled to receive $2.48 in cash for each share of AlphaWave stock. AlphaWave said the board of directors unanimously recommended that AlphaWave shareholders vote in favor of the plan. It is reported that after two months of negotiations, Alphawave agreed to accept Qualcomm's $2.4 billion acquisition offer. The price is equivalent to 183 pence per share (approximately RMB 16.07), a 96% premium over the company's closing price of 93.50 pence (approximately RMB 9.09) on March 31 (the day before Qualcomm announced its acquisition intention). Alphawave focuses on developing high-speed semiconductors and connection technologies for data centers and artificial intelligence applications. Alphawave designs and licenses semiconductor technology for data centers, networks, and storage. Its "serializer/deserializer" (SerDes) technology attracted acquisition interest from Qualcomm and SoftBank's chip technology provider Arm in early April. But according to previous reports in April, Arm withdrew after preliminary discussions with Alphawave. SerDes technology is an indispensable part of artificial intelligence applications. Chatbots like ChatGPT usually require thousands of chips to work together to ensure smooth operation. As one of Broadcom's core competitive advantages, SerDes is a key factor in winning AI customers such as Google and OpenAI.
    - June 09, 2025
  • TI plans to increase prices for some product lines
    TI plans to increase prices for some product lines
    TI plans to increase prices for some product lines, effective from June 15th. The average price increase this time is over 10%, with some part numbers experiencing a price increase of 40-70% or more. The price increase materials are mainly concentrated in three types of products: low profit margins, old part numbers, and promised quantities that have not been met. This is a global price increase, not only for the China region. The price increase in the China region is mainly for low profit margin products, involving part numbers such as operational amplifiers and interface ADCs.
    - June 06, 2025
  • Chipanalog CA-PM4644BA four-channel fully integrated multi-phase DCDC micromodule
    Chipanalog CA-PM4644BA four-channel fully integrated multi-phase DCDC micromodule
    In the era of high-density integration of digital circuits, the efficiency, flexibility and reliability of power supply systems have become core challenges. Chipanalog has launched a new CA-PM4644BA wide voltage input four-channel DC/DC buck converter module, which provides high-precision power supply solutions for FPGA, communication storage and other scenarios with three advantages: multiple outputs, flexible expansion and ultra-high integration. 01 Product Overview   CA-PM4644BA is a step-down DC/DC converter with wide input voltage and 4A output for all four channels. The four channels can also be used in parallel to provide a maximum output current of 16A. The device adopts BGA77 package, and integrates switch control circuit, power MOSFET, power inductor, decoupling capacitor and other circuit components. Therefore, only a few external components are required, such as input capacitor, output capacitor, feedback resistor, etc., to form a complete step-down four-way DC/DC regulator. The input voltage range of CA-PM4644BA is 4V~15V, and the output voltage can be set in the range of 0.6V~5.5V by changing the feedback external resistor. CA-PM4644BA application circuit is often used as a load power supply, which can provide high-precision voltages of different specifications such as 1.0V, 1.2V, 1.5V, 1.8V, 3.3V, 5V, etc. for digital circuits in the whole system, such as FPGA control circuits, motherboards and CPUs, communication storage and other circuits, and provide a maximum output current of 4A. CA-PM4644BA can also flexibly configure 4 channels for parallel use, continuously providing output currents of up to 8A (two-phase parallel) and 16A (four-phase parallel).   02 Features   Multiple outputs, flexible expansion: one "core" meets the needs of multiple scenarios Four-channel independent power supply: single-channel 4A output capacity, four channels can independently power different loads (such as 1.8V/3.3V/5V multi-voltage requirements). Parallel output up to 16A: Four channels can be flexibly connected in parallel, supporting 8A (two-phase) and 16A (four-phase) high current output, suitable for high-power core power supply such as CPU/GPU. Wide voltage coverage: input 4V-15V, output 0.6V-5.5V adjustable, accurately matching the power supply requirements of digital circuits. Highly integrated design: Simplify the system and save space. BGA77 package (9mm×15mm×5.01mm), internally integrated switching circuits, MOSFET, inductors, capacitors and other core components, only a small amount of resistors and capacitors are required on the periphery to work. Reduce PCB area occupation and help high-density circuit design. Efficient and stable: A reliable choice in harsh environments. Efficiency is as high as 95% (5V input, 3.3V/1A output), reducing system power consumption and temperature rise. ±1.5% output voltage accuracy, combined with COT control mode, to achieve fast dynamic response and low ripple. -40℃~125℃ wide temperature operation, supporting industrial and automotive environment applications. Multiple protections: Built-in input overvoltage protection, output overcurrent/overvoltage protection, support soft start and temperature monitoring functions to protect the system from damage to the equipment under abnormal conditions.   03 Typical application scenarios FPGA/ASIC power supply: provides 1.0V/1.2V low voltage, high-precision power supply for multi-core processors and logic units. Communication base station and server: Supports multi-channel power management for 5G base station BBU and data center storage modules.
    - June 04, 2025
  • Chip giant, heading to India
    Chip giant, heading to India
      In recent years, in the current context of the global semiconductor industry's anti globalization wave and geopolitical games, India is rising at an impressive speed as the core coordinate of the strategic layout of international chip giants. From Renesas Electronics announcing the launch of 3nm advanced process research and development in India, to Texas Instruments settling its smallest MCU design team in Bangalore, to Foxconn partnering with HCL to build a semiconductor packaging base A 'India fever' is unfolding across the entire industry chain of chip design, manufacturing, and packaging. Indian semiconductor, lively now Renesas 3nm, strong entry into India On May 13, 2025, Japanese semiconductor giant Renesas Electronics launched two 3nm chip design centers in Noida and Bangalore, India. This is India's first 3nm chip design project and marks a crucial step in its semiconductor ambitions. Renesas 3nm Design Center focuses on the research and development of automotive grade and high-performance computing chips, with plans for mass production in the second half of 2027. The project has received strong support from the Indian government, with over 270 academic institutions receiving EDA software and learning kits for engineer training. Renesas plans to increase its workforce in India to 1000 by the end of 2025 and collaborate with over 250 academic institutions and startups through its "Semiconductor Program" and "Production Linked Incentive Program (PLI)". In the manufacturing process, Renesas, together with India's CG Power and Thailand's Star Microelectronics, has invested 76 billion rupees (approximately 920 million US dollars) in Gujarat to build an outsourcing packaging and testing plant, focusing on defense and space chip packaging. It collaborates with Tata Group's 28nm wafer fab to build a "design manufacture package" full industry chain. Renesas focuses on end-to-end capability expansion and hopes to obtain a 50% financial subsidy through cooperation with the Indian government, while deeply integrating into the Indian talent development system. India plans to train 85000 VLSI engineers and support 100 startups within five years, with the goal of building India into Renesas' second-largest global research and development base.   The Indian Ministry of Electronics and Information Technology sees it as a 'major leap' in the semiconductor roadmap, aiming to achieve a semiconductor output value of $109 billion by 2030, accounting for 10% of the global market. However, the project implementation faces many challenges. In the manufacturing process, the precision requirements for 3nm process equipment are extremely high, and only a few companies such as TSMC and Samsung can mass produce it globally. Renesas plans to outsource to TSMC for outsourcing, but geopolitical risks may affect the stability of outsourcing. On the supply chain, India's domestic system is not perfect, and the supply of raw materials and equipment relies on imports, resulting in high and unstable costs. On the technical level, although India has a large group of engineers, they lack high-end design experience and currently only have mature process design capabilities. The 3nm process has extremely high requirements for transistor density and energy efficiency optimization, and there is a lack of IP libraries and design toolchains locally, requiring external support. The ambition and challenges of India's semiconductor industry coexist, and the landing of Renesas' 3nm design center is an important progress. However, whether it can overcome manufacturing dependence, supply chain difficulties, and technological shortcomings in the future will determine whether it can truly occupy a place in the global semiconductor landscape. Foxconn and HCL Joint Venture: Building Semiconductor Packaging Plant in India On May 14, 2025, the Indian Cabinet approved the joint venture between Foxconn and HCL Group to build a semiconductor packaging plant, with a total investment of 37.06 billion rupees (approximately 435 million US dollars), located at Jawar Airport in Uttar Pradesh, and expected to start production in 2027. The project is divided into two phases, with the first phase focusing on packaging testing and the second phase upgrading to a complete manufacturing factory, ultimately achieving a monthly production capacity of 20000 wafers and 36 million display driver chips. In terms of technology and product planning, in the initial stage of the project, we will provide downstream services for overseas chips to avoid the shortcomings of domestic manufacturing in India; The second phase will shift towards the manufacturing of display driver chips, covering fields such as mobile phones and automobiles, forming a vertically integrated ecosystem of "chip module whole machine" with Foxconn's iPhone assembly plant in India. The project is deeply tied to Apple's supply chain restructuring needs. Currently, Indian made iPhones account for 20% of US imports, and Apple plans to expand production capacity in India to cope with geopolitical risks. Foxconn not only responds to Apple's "Made in India" strategy, but also reduces import tariffs on electronic components by 20% through localized chip supply. Its panel factory, which cooperates with Innolux Optoelectronics, will also collaborate with packaging factories to promote localization of the display industry chain. This project is the sixth semiconductor manufacturing project approved by India, supported by the "Semiconductor Plan" policy. The Indian government provides capital subsidies, land concessions, and tax exemptions, and Uttar Pradesh also grants exemptions from electricity taxes and grants for skills training. Foxconn holds 40% of the shares and HCL Group holds 60%. Both parties plan to adopt a "technology introduction+local operation" model to build automotive electronic manufacturing capabilities, and plan to build two more wafer fabs and one packaging plant in the future. As of May 2025, the project has completed company registration and site survey, and is expected to start infrastructure construction by the end of the year.   HCL Group is in talks with NXP and Tesla to establish OEM cooperation for automotive display driver chips. However, the project faces multiple challenges. India lacks sufficient accumulation of display driver chip technology, and although Foxconn has introduced panel technology, chip design relies on external IP authorization.   In addition, the global market is dominated by Samsung and LG, and Foxconn needs to break through technical indicators to enter the mainstream supply chain. Moreover, India can only absorb 30% of its domestic production capacity, and the remaining capacity depends on exports. Geopolitical risks may affect order stability. Overall, this cooperation is an important attempt for India's semiconductor "differentiation breakthrough". If mass production goes smoothly, it is expected to form regional advantages. However, to achieve a leap from "packaging and testing" to "independent design and manufacturing", many bottlenecks such as technology and production capacity still need to be overcome. TSMC to build its first 12 inch wafer fab in India In September 2024, TSMC signed a contract with India's Tata Electronics to jointly build India's first 12 inch wafer fab in Gujarat, with a total investment of $11 billion and a monthly production capacity of 50000 wafers. It is expected to start mass production in 2026. This project is not only a milestone in semiconductor manufacturing in India, but also a key part of TSMC's global layout. TSMC is responsible for the design and construction of wafer fabs, transfer of mature process technology (28nm and above processes), and talent training, while Tata Group undertakes over 90% of investment and operational management. Both parties will build a full industry chain ecosystem of "design manufacturing packaging" through a "technology authorization+local operation" model. The factory focuses on automotive grade, panel drivers, and high-speed computing logic chips, with target markets covering electric vehicles, AI, and other fields. Tata Electronics has negotiated OEM cooperation with NXP and Tesla, and plans to build two more factories in the future to simultaneously promote the construction of the Assam packaging plant. For TSMC, technology transfer can consolidate its mature process influence and obtain market access at low cost through India's "Semiconductor Plan" subsidy of 760 billion rupees and the "Production Linked Incentive Plan". The Indian government provides up to 50% financial subsidies for the project, promising land concessions and tax reductions. India has included the project in its "Self Reliance India" strategy, aiming to cultivate 50000 semiconductor talents and increase self-sufficiency to 50% by 2030. At present, 30% of the factory infrastructure has been completed, 12 mature process patents have been transferred, the first batch of 500 students have entered the training stage, and Tata and NXP's OEM cooperation has entered the technology verification stage. However, the project faces numerous challenges.   In terms of the market, there is overcapacity in mature processes worldwide, and the demand in India may be difficult to digest the scale of producing 50000 pieces per month, requiring reliance on OEM orders to balance production capacity. In terms of policy implementation, India's previous $10 billion subsidy plan had little effect due to slow approval and low participation, and it is doubtful whether this subsidy can be delivered on time. The cooperation between TSMC and Tata is a bold attempt by India's semiconductor industry to achieve "leapfrog development". Its success or failure depends not only on technology transfer, but also on the Indian government's sustained efforts in policy implementation, infrastructure support, and market cultivation. Infineon opens research and development center in India On March 24, 2025, Infineon officially opened its Global Competence Center (GCC) in Ahmedabad, Gujarat, India. As its fifth research and development base in India, the center is located in GIFT City and plans to hire 500 engineers over the next five years, focusing on chip design, product software development, information technology, supply chain management, and system application engineering. Currently, Infineon has over 2500 employees in India, with Bangalore being its largest research and development base. Infineon regards India as a global innovation core, aiming to achieve sales of over 1 billion euros by 2030, closely focusing on India's automotive regulations and industrial chip demand, and accelerating its layout with up to 50% financial subsidies under the "Semiconductor Plan". It adopts a "localization of research and development+outsourcing of manufacturing" model, with a focus on developing next-generation automotive specifications and industrial control chips on the R&D side, and utilizing Indian engineers to reduce costs; The manufacturing side has reached a wafer supply agreement with Indian companies CDIL and Kaynes, with Indian companies responsible for packaging, testing, and sales, forming a "design packaging sales" collaborative chain. Currently, there are no plans to build a self built wafer fab, and the strategy may be adjusted in the long term based on the maturity of the Indian supply chain. In addition, Infineon actively builds a local ecosystem, collaborates with universities to cultivate semiconductor talents, and deepens government enterprise cooperation by leveraging preferential policies such as land and taxation in Gujarat. It aims to capture over 10% of the $100 billion semiconductor market in India by 2032. Infineon's India layout is a key outcome of its "global localization" strategy, attempting to seize the opportunity during India's semiconductor boom period and help India transform into a "manufacturing powerhouse" through research and development centers, local cooperation networks, and policy resource integration. Micron is building a sealing and testing factory in India   The factory focuses on wafer segmentation, packaging, testing, and module production. It is expected that the first batch of products will be produced in the first half of 2025, and after full production, it will create over 5000 high-tech jobs and become a large-scale storage chip packaging and testing base in South Asia. The site selection forms a 50 kilometer industrial cluster with Tata Electronics wafer fab and Renesas Electronics packaging and testing project, and initially constructs a closed loop of "design manufacturing packaging and testing" area. The factory adopts mature processes of 40nm and above to serve the Indian, Southeast Asian, and Middle Eastern markets, which can reduce Micron Asia Pacific's packaging and testing costs by 15% -20%. In the progress of the project, Micron is promoting the localization of the supply chain, Korean material suppliers are investing with factories, Indian local enterprises are also cooperating in equipment maintenance, chemical supply and other fields, and the US government is providing key raw material support. Although production has been delayed by 6 months due to India's infrastructure shortcomings, Micron still sees great potential in the Indian market. This project is a result of the Modi government's "Self Reliance India" strategy, marking a breakthrough in India's chip manufacturing process. As India plans to launch a new round of semiconductor incentive policies worth over billions of dollars, Micron is evaluating phase two expansion and plans to increase monthly testing capacity to 150000 wafers by 2030, covering advanced technologies. Micron's layout in India demonstrates India's determination and potential to accelerate its transformation into a new global hub for chip manufacturing through "policy leverage+international cooperation". Semiconductor giants gather in India In addition, many leading global semiconductor companies are accelerating the construction of strategic pivot points in India. Chip giants such as NVIDIA and AMD have taken the lead in establishing large-scale research and design centers in India, integrating India into their global innovation network to diversify supply chain risks and stay close to the rapidly growing consumer electronics market. As a leader in the automotive chip industry, NXP announced that it will double its R&D investment in India to over $1 billion in the coming years. Currently, it has four design centers and 3000 employees, and plans to establish a second R&D department focused on 5-nanometer automotive chips at the Greater Noida Semiconductor Park, with the goal of increasing the total number of employees to 6000. Qualcomm, TI and other companies have established research and development centers and localized teams to deeply participate in the technological development of emerging fields such as 5G communication and the Internet of Things in India. ADI has formed a strategic alliance with Tata Group to explore the construction of semiconductor manufacturing plants in India, with a focus on developing customized chips for electric vehicles and network infrastructure. This move marks the beginning of international manufacturers extending from the design phase to the manufacturing phase. These layouts resonate with the industrial policies of the Indian government. India has attracted $10 billion wafer fab projects, including a collaboration between Israel's Tower Semiconductor and Adani Group, by revising its $10 billion semiconductor incentive plan, relaxing technology requirements, and increasing subsidy ratios. In addition, global semiconductor equipment giants are also accelerating the construction of strategic pivot points in India, deeply participating in the reshaping of its industrial ecosystem, and improving the layout of the industrial chain. DISCO Japan was the first to establish a legal entity in Bangalore and a service network in Ahmedabad. The initial team of 10 people will be expanded according to customer needs. Its layout aims to provide equipment installation and technical support for Micron, Tata Electronics, and other wafer fabs and packaging and testing plants in India. It also trains Indian marketing personnel in advance through its Singapore base. Applied Materials positions India as a global hub for research and supply chain, and the $400 million investment plan launched in 2023 is steadily advancing. Establishing a Center of Excellence for Artificial Intelligence and Data Science in Chennai, focusing on the development of AI applications for chip manufacturing, with an expected creation of 500 high-end positions. The plan is to expand the total number of employees from 8000 to 10000. At the same time, we are collaborating with 15 suppliers to explore the establishment of equipment component manufacturing bases in India, striving to physically co locate verification centers with wafer fabs, shorten research and development cycles, improve material verification efficiency, and help India form competitiveness in mature process areas. Lam Research (Panlin Group) is implementing a "localization of supply chain" strategy and announced a $1.2 billion investment in Karnataka state in 2024 to collaborate with the local government to promote the construction of local supply capabilities such as precision components and high-purity gas delivery systems. The company evaluates the potential for cooperation between Indian suppliers in the core components of wafer manufacturing equipment and plans to include India in a global network of 3000 suppliers to achieve localized support in key equipment areas such as etching and thin film deposition, thereby enhancing regional supply chain resilience and reducing supply chain risks in the Asia Pacific region. Tokyo Electronics has established a deep cooperation with India's Tata Electronics to supply equipment for its 12 inch wafer fab in Gujarat, and will also establish a specialized training system to help Tata Electronics engineers master advanced process equipment operation techniques. We plan to establish an equipment delivery and after-sales support system in India by 2026, and form a local engineering team to serve Tata Electronics' manufacturing needs in areas such as automotive electronics and AI chips. The layout of giants resonates with India's industrial policies, with central and local governments providing up to 75% of project cost subsidies to promote the coordinated development of equipment giants and wafer fabs. The influx of international capital confirms the strategic value of the Indian market. Its appeal lies not only in the expected chip demand to exceed 100 billion US dollars by 2026, making it the fastest-growing semiconductor market in the world, but also in the explosive growth in fields such as automotive electronics and 5G communication, providing broad application scenarios for the semiconductor industry. Although the Indian semiconductor industry is still constrained by weak infrastructure and insufficient technological accumulation, it is gradually moving from a major chip design outsourcing country to the manufacturing sector through "policy leverage+international cooperation". With the deep participation of leading semiconductor companies, India is expected to form differentiated competitiveness in sub sectors such as automotive electronics and industrial control, becoming an important variable in the restructuring of the global semiconductor supply chain.   The Story of India's Semiconductor Industry In fact, the development process of India's semiconductor industry is full of twists and turns and opportunities, from early technological breakthroughs to policy adjustments, and now global giants are laying out, reflecting a country's unremitting exploration in the semiconductor field. The starting point of India's semiconductor industry can be traced back to 1984, when the government funded semiconductor manufacturing company SCL upgraded its process from 5 microns to 0.8 microns in the 1980s, only one generation behind Intel. However, a major fire in 1989 destroyed the SCL factory, and reconstruction took 8 years, causing India to miss the golden period of semiconductor development. Since then, India has made multiple attempts to attract foreign investment to build factories, but has repeatedly suffered setbacks due to lagging policies and insufficient resources. For example, in 2005, Intel gave up investment due to policy deficiencies, and in 2012, incentive plans were stalled due to capital and water resource issues. Until December 2021, the Modi government launched the "India Semiconductor Plan", providing 760 billion rupees (approximately 10 billion US dollars) in incentive funds, but the initial response was limited. The real turning point came in June 2023, when the revised plan increased the financial support ratio to 50%, covering the entire industry chain of semiconductor manufacturing, packaging and testing, and relaxed technical requirements to attract giants such as Micron and Renesas to settle in. This policy adjustment marks India's shift from slogan based incentives to substantive industrial support. Under policy promotion, the Indian semiconductor industry has made significant progress. In addition to the manufacturers mentioned above, almost all of the world's top semiconductor companies, including Intel, Texas Instruments, Nvidia, Qualcomm, etc., have design and research centers in India, with most personnel concentrated in Bangalore, Karnataka state in southern India. Image source: ISM In addition, India has signed multiple cooperation agreements with the United States, Japan, and the European Union to promote technology transfer and supply chain diversification. Market data shows that semiconductor consumption in India is expected to grow from $22 billion in 2019 to $64 billion in 2026, with a compound annual growth rate of 16%, with automotive, consumer electronics, and wireless communications being the main growth areas.   Reasons for Semiconductor Giants to Invest in India There are several reasons why international semiconductor giants are rushing to India, in my opinion: Policy and financial support: India provides the most generous subsidy policy in the world, with the central government bearing 50% of project costs and state governments providing additional subsidies of 20% -25%. Enterprises only need to contribute 25% -30% in actual investment, directly lowering the investment threshold for enterprises. The revised plan also provides special support for sub sectors such as packaging and testing, compound semiconductors, etc., to further reduce investment risks for enterprises. Image source: India Semiconductor Mission (ISM) Talent reserve and cost advantage: India has 20% of the world's semiconductor design talent, 25 leading companies such as Intel and Qualcomm have established research and development centers in Bangalore, and companies such as New Think Technology have over 5500 employees. Every year, 100000 new engineering graduates are added, providing sufficient manpower reserves for the industry, and the labor cost is only one-third of that in developed countries. Intel, Qualcomm and other companies have established research and development centers in India, utilizing local talents for chip design and software development; Equipment giants such as Applied Materials and Lam Research are expected to train tens of thousands of engineers in the next five years through training programs. Geopolitics and Supply Chain Restructuring: Under the trend of China US trade frictions and global supply chain diversification, India has become an important choice for companies to diversify their risks. Semiconductor giants can avoid geopolitical risks and stay close to rapidly growing local markets such as automotive electronics and 5G equipment by setting up factories in India. The Memorandum of Understanding on Semiconductor Supply Chain and Innovation Partnership signed between India and the United States further strengthens its position as a "reliable manufacturing center". Market potential and industry synergy: The size of the Indian semiconductor market is expected to reach $110 billion by 2030, and the government is promoting the "Make in India" and "Digital India" plans to stimulate local demand. At the same time, India is building a complete industrial chain through local giants and international cooperation, constructing a complete ecosystem from design, manufacturing to packaging, attracting upstream and downstream enterprises to gather, forming industrial clusters, and reducing collaboration costs between enterprises. Meanwhile, Apple's production of iPhones in India can also drive demand for chip matching. Infrastructure upgrade: India is building a "semiconductor city" in Gujarat, with supporting infrastructure such as electricity and transportation, and establishing a semiconductor manufacturing ecosystem fund for park development and logistics network construction. In addition, the Indian government is promoting the "Digital India" plan, investing 11000 kilometers of highways and smart grids to improve supply chain efficiency.
    - June 02, 2025
  • The United States demands that the three major EDA giants completely cut off their supply to China
    The United States demands that the three major EDA giants completely cut off their supply to China
    According to the Financial Times, the Bureau of Industry and Security (BIS) of the US Department of Commerce has reportedly issued notices to the top three global electronic design automation (EDA) software suppliers - Synopsys, Cadence, and Siemens EDA - requesting them to cease providing services to Chinese customers. Another industry insider confirmed that these three companies did receive notification from BIS, but the specific content is still unclear.   Insiders have revealed that the US government is evaluating a broader policy to restrict the sale of chip design software to China. As part of the action, BIS has recently sent letters to some leading EDA suppliers requesting a suspension of shipments to Chinese customers. In response, a BIS spokesperson stated, "The US Department of Commerce is reviewing exports involving strategic projects in China. In some cases, existing export licenses may be suspended or additional license requirements may be imposed during the review period Sassine Ghazi, CEO of New Think Technology, stated in a conference call on May 28th that the company has not yet received formal notification from BIS, but he acknowledged reports of the letter, stating, "We cannot speculate on the potential impact of the notification that has not yet been received.   The United States' implementation of EDA supply cut-off to China is not the first time. In 2019, after Huawei was included in the "Entity List", New Think Technology, Kaiden Electronics, and Mentor Graphics (now Siemens EDA) were required to suspend software licensing and updates to Huawei.   In August 2022, the US Department of Commerce further tightened export controls on EDA tools used for advanced process chip design at 3 nanometers and below, aimed at limiting China's development in the field of cutting-edge chip design. These ongoing measures indicate that cutting off EDA supply to China is a key link in the US semiconductor strategy, with the core goal of curbing China's ability to improve in high-end chip design and manufacturing.
    - May 30, 2025
  • IBM To Bring Deca's Fan-Out Packaging TechnologyTo North America
    IBM To Bring Deca's Fan-Out Packaging TechnologyTo North America
    IBM has formed an alliance with Deca Technologies to leverage Deca's MFIT technology to enter the fan out wafer level packaging (FOWLP) market, with plans to build a new production line at the Bromont factory in Canada in the second half of 2026. On the 22nd, both parties signed a contract to import Deca's M-Series and adaptive pattern technology into the factory, focusing on MFIT to expand the high-performance small chip integrated supply chain. FOWLP's global production capacity is concentrated in Asia, while North America is expanding its local production capacity. IBM is now focusing on chip design and packaging. This cooperation aims to seize the market in fields such as AI and also reflects the trend of regionalization in the global semiconductor industry chain. IBM and Deca Technologies form an alliance in the field of semiconductor packaging IBM and Deca Technologies have formed an important alliance in the semiconductor packaging field, which will enable IBM to enter the advanced fanout wafer level packaging market. According to the plan, IBM expects to establish a new high-capacity production line within its existing packaging factory located in Bromont, a city in southern Quebec, Canada. At some point in the future, IBM's new production line is expected to produce advanced packaging based on Deca's M-series Fan Out Interlayer Technology (MFIT). MFIT technology can achieve a new type of complex multi chip packaging. Nevertheless, IBM has been providing packaging and testing services to external clients at Bromont for many years to meet their internal needs. With Deca's announcement, IBM will expand its packaging capabilities and enter the field of Fan Out Wafer Level Packaging (FOWLP). Basically, after the chip is manufactured in the wafer fab, it will be assembled into a package. Encapsulation is a small casing used to protect one or more chips from harsh working conditions. FOWLP is an advanced packaging form that can integrate complex chips into the package. FOWLP and other types of packaging help improve chip performance. Deca's MFIT is an advanced form of FOWLP, in which the latest storage devices, processors, and other chips can be integrated in a 2.5D/3D package. Deca CEO Tim Olson stated that MFIT is a high-density integrated platform for AI and other memory intensive computing applications. ”(See Figure 1 below) Fan out wafer level packaging (FOWLP) is an enabling technology, but most, if not all, of the global FOWLP production capacity is located in Asia. Companies such as Riyueguang and TSMC produce fan out packaging across Asia. However, some customers may wish to manufacture and package chips in North America. At some point in the future, customers may have two new fan out production capacity options in North America. IBM is working hard to achieve this. In addition, SkyWater, a US wafer foundry, is developing a fan out production capacity based on Deca technology at a factory in the United States. A Brief History of IBM IBM is an iconic brand in the computer field with a long history. It also has a long and sometimes painful history in the semiconductor industry. The origin of IBM can be traced back to 1911, when a company called Computing Tabulating Recording Company (CTR) was established. CTR provides a record keeping and measurement system. In 1924, CTR was renamed International Business Machines. In 1952, IBM launched its first commercial/scientific computer, called the 701 Electronic Data Processing Machine (EDP). 701 integrates three electronic devices - vacuum tube, magnetic drum, and magnetic tape. Four years later, IBM established a new semiconductor research and development team with the goal of finding a technology to replace outdated vacuum tubes for its system. In the 1960s, IBM developed a new and more advanced alternative technology - solid-state electronic devices based on an emerging technology called integrated circuits (ICs). Afterwards, the company adopted more advanced chip technology in its computer product line. In 1966, IBM established its Microelectronics division, which became the company's Semiconductor division. At that time, the company was developing chips for its own system. In the same year, Robert Danard of IBM invented DRAM, which is still used as the main memory for personal computers, smartphones, and other products today. Another major event occurred in 1993 when IBM entered the commercial semiconductor market. The company manufactures and sells ASICs, processors, and other chips to external customers. In the 1990s, IBM also entered the OEM business, laying the foundation for competition with companies such as TSMC. IBM provides cutting-edge processes and RF technology to OEM customers. The company produces chips in its own wafer fab. However, in the 2010s, IBM's microelectronics division encountered difficulties. The department struggled in the commercial semiconductor business, losing millions of dollars. Its OEM business has also encountered failures. In 2014, IBM sold its microelectronics division (including wafer fabs and foundry business) to foundry supplier GlobalFoundries (GF). IBM has paid approximately $1.5 billion to GF to acquire its microelectronics division. IBM's current semiconductor/packaging work Time flies. Nowadays, IBM not only provides system services, but also offers hybrid cloud and consulting services. The company is still involved in the semiconductor industry. It designs processors and other chips, but no longer produces them in its own wafer fab. It relies on contract manufacturers to produce chips. In addition, IBM has a large semiconductor research and development center in New York. In 2015, the company's R&D department developed a groundbreaking transistor technology called nanosheets. Nanoflakes are essentially a next-generation surround gate (GAA) transistor. In addition, IBM has been providing packaging and testing services to Bromon's customers for many years. In fact, the Bromon factory is the largest outsourced semiconductor packaging and testing (OSAT) factory in North America. The company provides flip chip packaging and testing services at the factory. In addition, IBM is developing an assembly process for co packaging optical devices. IBM has also established an important alliance with Rapidus, a wafer foundry startup headquartered in Japan. Rapidus is developing a 2nm process based on IBM nanosheet transistor technology. Rapidus and IBM are also jointly developing various methods for producing chips. Chips are essentially small modular chips. These chips are electrically connected and then combined in one package to form a brand new complex chip. Now, IBM is collaborating with Deca to develop fanout packaging capabilities. According to the IBM website, the company plans to increase its FOWLP manufacturing capabilities in the second half of 2026. What is fan out? FOWLP is not a new technology and has a long history of development. FOWLP gained fame in 2016 when Apple used TSMC's fanout packaging technology in its iPhone 7. In packaging, TSMC stacks DRAM chips on top of application processors. This processor, named A10, was designed by Apple and manufactured by TSMC using a 16 nanometer process. Apple has also adopted TSMC's fanout packaging technology in subsequent smartphones. FOWLP has a wide range of applications. For example, fan out packaging can integrate multiple chips and components, such as MEMS, filters, crystals, and passive devices. But the uniqueness of fan out packaging lies in its ability to develop small-sized packages with a large number of I/O interfaces. In many cases, small chips are packaged in large-sized packages. This will take up too much space. According to ASE, in fan out packaging, the package size is roughly the same as the chip itself. Fan out packaging can be defined as a packaging where any connection is fan out from the chip surface to support more external I/O. ” Taiwan's ASE, the world's largest OSAT manufacturer, produces a fanout packaging production line based on Deca M series technology. South Korean OSAT manufacturer Nepes is another authorized manufacturer of Deca. In terms of research and development, IBM and SkyWater are developing fan out packaging based on Deca's technology. Last year, SkyWater and Deca announced a $120 million contract with the US Department of Defense. SkyWater is expected to produce fan out packaging at its factory in the United States by the end of this year. At the same time, Deca has also developed multiple versions of M-series fan out technology. Overall, M-series technology can assist customers in developing single-chip and multi chip packaging, 3D packaging, and chipsets. Deca has also developed a manufacturing technology called "Adaptive Patterning" for M-series technology, which is used to produce fine pitch fanout packaging. Deca's M series includes a version called MFIT. This is an advanced technology that covers double-sided wiring, dense 3D interconnects, and embedded bridge chips. It enables customers to develop multi chip packages that integrate high bandwidth memory (HBM), processors, and other devices. Deca's Olson said, "MFIT adopts M-series chip first fan out technology, combined with embedded bridging technology, to create a high-density intermediate layer for the chip, and finally install the processor and memory chip. Adaptive patterning technology can achieve extremely high density with a spacing of less than 10 µ m. ” He said, "MFIT adopts Deca's second-generation technology, which initially used a 20 µ m spacing for embedded components and plans to gradually achieve finer spacing. The flip chip technology used on the intermediate layer of chip level devices is initially consistent with the current industry-leading spacing and plans to gradually achieve finer spacing. Adaptive patterning technology can be extended to finer spacing while maintaining strong manufacturability through design during the manufacturing process. ” Fan out type is not the only choice in the field of advanced packaging. Other options include 2.5D and 3D packaging technology, as well as small chip technology. In summary, there are multiple options in the market, and there will be more innovations in the future.
    - May 28, 2025
  • First "Made in India" chip produced by semiconductor factories in the northeast region.
    First "Made in India" chip produced by semiconductor factories in the northeast region.
    Indian Prime Minister Narendra Modi announced on Friday (May 23) that India will soon acquire the first "Made in India" chip produced by semiconductor factories in the northeast region. He said that the region is becoming an important destination for both the energy and semiconductor industries. Nowadays, the Northeast region is playing an increasingly important role in strengthening the Indian semiconductor ecosystem. India will soon obtain its first 'Made in India' chip from semiconductor factories in the Northeast region, "Modi said in his inaugural speech at the 2025 Northeast Rising Investors Summit. Last August, Tata Group began building a semiconductor factory in Assam with a total investment of 270 billion rupees. The Prime Minister stated that semiconductor factories have opened up opportunities for the semiconductor industry and other cutting-edge technologies in the region. Modi stated that the government is making large-scale investments in the hydropower or solar energy sectors in various northeastern states, with projects worth tens of millions of rupees already allocated. picture He stated that investors not only have the opportunity to invest in factories and infrastructure in the Northeast region, but also have a golden opportunity to invest in the manufacturing industry in the area. He emphasized that significant investment is needed in the fields of solar modules, batteries, energy storage, and research and development, as they represent the future. He said, "The more we invest in the future, the less we rely on other countries The Prime Minister stated that robust roads, good power infrastructure, and logistics networks are the pillars of all industries. Where there is seamless connectivity, trade will also flourish. This means that robust infrastructure is the primary condition and foundation for any development. Modi stated that the trade potential of the Northeast region will double in the next decade. At present, the trade volume between India and ASEAN is close to 1.25 billion US dollars. In the next few years, this trade volume will exceed 200 billion US dollars, and the Northeast region will become a solid bridge to achieve this goal. He stated that the Northeast region will become a trade gateway for ASEAN. Adani Group Chairman Gautam Adani announced in a speech that the group will invest an additional 500 billion rupees in the Northeast region over the next 10 years. Three months ago, the group had promised to invest 500 billion rupees in Assam.
    - May 24, 2025
  • Proposal and Working Principle of Gallium Oxide p-NiO Heterojunction Bidirectional Switching Device
    Proposal and Working Principle of Gallium Oxide p-NiO Heterojunction Bidirectional Switching Device
    Proposal and Working Principle of Gallium Oxide p-NiO Heterojunction Bidirectional Switching Device   The Power P-GaN SJ BDS Gallium Nitride Superjunction Voltage Resistant Bidirectional Switching Device places all the surge electrical stresses of the original transverse PSJ and transverse p-GaN ReSURF more concentrated on the line closest to the edge of the polarization structure,     There are reliability issues with overload surges, and the large capacitance of the ReSURF field board exacerbates the problem of hot electron injection from overload surges in this area. So Erbao thought about it and decided not to create multiple field limiting rings similar to p-GaN thin layers, all connected to the drain to form a uniform voltage divider, while simultaneously considering the RESURF super junction voltage resistance structure.   So, try making another bidirectional voltage resistant switch?     What name should I choose? Power P-GaN SJ BDS Gallium Nitride Superjunction Voltage Resistant Bidirectional Switching Device?     A friend left a message asking me, didn't you share and discuss the structure of new gallium oxide devices at the Nanjing meeting on Saturday, Erbao? Can this super junction bidirectional switching device SJ BDS structure be used for gallium oxide devices?     Of course, Erbao also wants to give it a try. If one day 🈶 In the second round, Nakamura Shuji discovered a new buffer growth technology that can directly grow single crystal quasi single crystal quality gallium oxide epitaxial layers on different substrates such as silicon wafers/sapphire wafers. Perhaps in the future, gallium oxide materials will shine brightly in the field of heterogeneous epitaxial lateral high-voltage devices, and even high-voltage integrated ICs, and even replace GaN or silicon carbide devices in many fields?     The heterojunction bidirectional switching device composed of gallium oxide (Ga ₂ O3) and p-type nickel oxide (p-NiO) is a new type of power electronic device. Its working principle combines the characteristics of wide bandgap semiconductor materials, heterojunction band engineering, and superlattice structure design to achieve high voltage resistance, low loss, and bidirectional controllable switching function. The following is a detailed analysis of its working principle:   ---   **1. Material and structural characteristics** -Gallium oxide (Ga ₂ O3): -Ultra wide bandgap semiconductor (bandgap width of about 4.8-4.9 eV), with extremely high critical breakdown electric field (about 8 MV/cm), suitable for high-voltage applications.   -Natural n-type conductivity, but lacking stable p-type doping, requires the introduction of p-type materials (such as p-NiO) through heterojunctions.   -* * p-type nickel oxide (p-NiO) * *: -A p-type transparent conductive oxide forms a heterojunction with Ga ₂ O3 to compensate for the p-type defects of Ga ₂ O3 and provide hole injection capability.   -The alignment of heterojunction interface bands is crucial for carrier transport (which may form Type II band structures and promote charge separation).   -Superjunction structure: -Composed of alternating p-NiO and n-Ga ₂ O3 regions, the transverse electric field distribution is optimized through charge balance, significantly improving the breakdown voltage while reducing the on resistance.   ---   *2. Bidirectional switch mechanism** **(1) Blocked state (off state)** -* * Forward and reverse blockade * *: -Under bidirectional voltage, the heterojunction interface and superlattice structure expand uniformly distributed electric fields through depletion regions, avoiding local electric field concentration.   -The charge balance of a super junction allows the longitudinal electric field (perpendicular to the junction direction) to be shared by the transverse electric field (parallel to the junction direction), significantly increasing the breakdown voltage (up to several thousand volts).   **(2) Conductive state (open state)** -Bidirectional carrier injection: -Forward bias voltage (Ga ₂ O3 terminal connected to positive): The holes of p-NiO are injected into Ga ₂ O ∝, and the electrons of Ga ₂ O ∝ are injected into p-NiO, reducing the heterojunction potential barrier and forming bipolar conduction.   -Reverse bias voltage (Ga ₂ O3 terminal connected to negative): Through the symmetrical design of the superlattice structure, a conduction path is also formed at the interface between p-NiO and Ga ₂ O3 under reverse bias, achieving bidirectional current flow.   -The high doping concentration of the superlattice structure further reduces the on resistance (Ron) and improves efficiency.   **(3) Switch triggering mechanism** -Voltage triggered: -When the applied voltage exceeds the threshold, avalanche breakdown or tunneling effect occurs in the depletion region of the heterojunction, causing carrier multiplication and rapid conduction of the device.   -Field control effect: -Active switching control is achieved by regulating the heterojunction barrier height through gate (if designed) or structural electric field.   ---   **3. Key advantages** -* * High voltage resistance * *: The super junction structure and the high breakdown field strength of Ga ₂ O3 work together to support blocking voltages in the thousands of volts range.   -Low conduction loss: The bipolar conduction mechanism (where electrons and holes participate in conduction together) reduces Ron and improves energy efficiency.   -Bidirectional Symmetry: The structural design ensures consistent electrical characteristics in both forward and reverse directions, making it suitable for AC circuits or bidirectional power control.   -High temperature stability: The wide bandgap material is resistant to high temperatures and suitable for harsh environmental applications.   ---   **4. Potential applications** -High voltage DC/AC converters, such as smart grids and electric vehicle charging systems.   -Solid state circuit breaker: Fast response, high reliability circuit protection.   -RF power devices: High frequency, high-power communication systems.   ---   **5. Challenges and research directions** -Interface optimization: The interface defects of Ga ₂ O ∝/p-NiO heterojunctions may affect carrier transport and need to be improved through annealing or interface passivation.   -* * Thermal management * *: Ga ₂ O ∝ has low thermal conductivity and needs to be combined with heat dissipation design (such as diamond substrate integration).   -* * Process compatibility * *: Heteroepitaxial growth and superlattice manufacturing have high process complexity and require the development of low-cost mass production technologies.   ---   **Summary** Gallium oxide/p-NiO heterojunction bidirectional switching devices achieve high-voltage bidirectional conduction and fast switching characteristics through the synergistic effect of heterojunction band engineering and superlattice charge balance design, which is expected to break through the performance limit of traditional silicon-based devices and promote the development of the next generation of high-power electronic systems.
    - May 21, 2025
  • Teach you how to design RISC-V CPU
    Teach you how to design RISC-V CPU
    In recent years, RISC-V has attracted global attention. This revolutionary ISA has swept the market with its continuous innovation, countless learning and tool resources, and contributions from the engineering community. The biggest charm of RISC-V is that it is an open source ISA. In this article, I (referring to the author of this article, Mitu Raj, the same below) will introduce how to design a RISC-V CPU from scratch. We will explain the process of defining specifications, designing and improving architecture, identifying and solving challenges, developing RTL, implementing CPU, and testing CPU on simulation/FPGA board.   Start with a Name   It is important to name or brand your idea so that you can keep going until you reach your goal! We are going to build a very simple processor, so I came up with a fancy name "Pequeno", which means "tiny" in Spanish; the full name is: Pequeno RISC-V CPU, aka PQR5. RISC-V has many flavors and extensions of the ISA architecture. We will start with the simplest one, RV32I, aka 32-bit base integer ISA. This ISA is suitable for building 32-bit CPUs that support integer operations. So, the first spec of Pequeno is as follows: Pequeno is a 32-bit RISC-V CPU that supports RV32I ISA. RV32I has 37 32-bit base instructions that we plan to implement in Pequeno. Therefore, we have to understand each instruction in depth. It took me a while to fully grasp the ISA. In the process, I learned the complete specification and designed my own assembler pqr5asm, which was verified with some popular RISC-V assemblers. "RISBUJ" The six-letter word above summarizes the instruction types in RV32I. These 37 instructions belong to one of the following categories: R-type: All integer calculation instructions on registers. I-type: All integer calculation instructions based on registers and immediate values. Also includes JALR and Load instructions. S-type: All storage instructions. B-type: All branch instructions. U-type: Special instructions such as LUI, AUIPC. J-type: Jump instructions like JAL. There are 32 general registers in the RISC-V architecture, x0-x31. All registers are 32 bits. Among these 32 registers, zero is also called x0 register, which is a useful special register. It is hardwired to zero, cannot be written, and always reads as zero. So what is it used for? You can use x0 as a dummy destination to dump results you don't want to read, or as operand zero, or to generate NOP instructions to idle the CPU. Integer computation instructions are ALU instructions that are executed on registers and/or 12-bit immediate values. Load/store instructions are used to store/load data between registers and data memory. Jump/branch instructions are used to transfer program control to different locations. Details of each instruction can be found in the RISC-V specification: RISC-V User Level ISA v2.2. To learn the ISA, the RISC-V specification document is enough. However, for more clarity, you can study the implementations of different open cores in RTL. In addition to the 37 basic instructions, I have added 13 pseudo/custom instructions to pqr5asm and extended the ISA to 50 instructions. These instructions are derived from the basic instructions and are intended to simplify the assembly programmer's life... For example: NOP instruction with ADDI x0, x0, 0 which of course does nothing on the CPU! But it is much simpler and easier to explain in code. Before we start designing the processor architecture, our expectation is to fully understand how each instruction is encoded in 32-bit binary and what it does.   The RISC-V RV32I assembler PQR5ASM that I developed in Python can be found on my GitHub. You can refer to the Assembler Instruction Manual to write sample assembly code. Compile it and see how it converts to 32-bit binary to consolidate/verify your understanding before moving on to the next step.   Specifications and Architecture   In this chapter, we defined the full specifications and architecture of Pequeno. Last time we simply defined it as a 32-bit CPU. Next, we will go into more details to get a general idea of ​​the architecture we are going to design. We will design a simple single-core CPU that is able to execute one instruction at a time in the order in which the instructions are fetched, but still in a pipelined manner. We will not support the RISC-V privileged specification because we do not currently plan to have our core operating system support it, nor do we plan to have it support interrupts. The CPU specifications are as follows: 32-bit CPU, single-issue, single-core. Classic five-stage RISC pipeline. Strictly in-order pipeline. Compliant with RV32I user-level ISA v2.2. Supports all 37 basic instructions. Separate bus interfaces for instruction and data memory access. (Why? More on that later…) Suitable for bare-metal applications, no support for operating systems and interrupts. (More precisely, a limitation!) As mentioned above, we will support the RV32I ISA. Therefore, the CPU only supports integer operations. All registers in the CPU are 32 bits. The address and data buses are also 32 bits. The CPU uses the classic little-endian byte addressing memory space. Each address corresponds to a byte in the CPU address space. 0x00 - byte[7:0], 0x01 - byte[15:8] ... 32-bit words can be accessed by 32-bit aligned addresses, i.e. addresses that are multiples of 4: 0x00 - byte 0, 0x04 - byte 1... Pequeno is a single-issue CPU, i.e. it fetches only one instruction from memory at a time and issues it for decoding and execution. A pipelined processor with a single issue has a maximum IPC = 1 (or minimum/optimal CPI = 1), i.e. the ultimate goal is to execute at a rate of 1 instruction per clock cycle. This is theoretically the highest performance that can be achieved. The classic five-stage RISC pipeline is the basic architecture for understanding any other RISC architecture. This is the most ideal and simple choice for our CPU. The architecture of Pequeno is built around this five-stage pipeline. Let's dive into the underlying concepts. For simplicity, we will not support timers, interrupts, and exceptions in the CPU pipeline. Therefore, CSRs and privilege levels do not need to be implemented either. Therefore, the RISC-V privileged ISA is not included in the current implementation of Pequeno. The simplest way to design a CPU is the non-pipelined way. Let's look at several design approaches for non-pipelined RISC CPUs and understand their drawbacks. Let's assume the classic sequence of steps that a CPU follows to execute instructions: fetch, decode, execute, memory access, and write back. The first design approach is to design the CPU as a finite state machine (FSM) with four or five states and perform all operations sequentially. For example:   But this architecture will seriously affect the instruction execution speed. Because it takes multiple clock cycles to execute an instruction. For example, writing to a register takes 3 clock cycles. In case of load/store instructions, memory latency also increases. This is a bad and primitive way to design a CPU. Let's get rid of it completely! The second approach is that the instruction can be fetched from the instruction memory, decoded, and then executed by fully combinatorial logic. Then, the result of the ALU is written back to the register file. The whole process until the write back can be completed in one clock cycle. Such a CPU is called a single-cycle CPU. If the instruction needs to access data memory, read/write latency should be taken into account. If the read/write latency is one clock cycle, then the store instruction may still execute in one clock cycle like all other instructions, but the load instruction may require an additional clock cycle because the loaded data must be written back to the register file. The PC generation logic must handle the effect of this latency. If the data memory read interface is combinatorial (asynchronous read), then the CPU becomes truly single-cycle for all instructions.   The main disadvantage of this architecture is obviously the long critical path of the combinatorial logic from instruction fetch to write to memory/register file, which limits the timing performance. However, this design approach is simple and suitable for low-end microcontrollers where low clock speed, low power and low area are required. To achieve higher clock speeds and performance, the instruction sequential processing function of the CPU can be separated. Each sub-process is assigned to an independent processing unit. These processing units are cascaded in sequence to form a pipeline. All units work in parallel and operate on different parts of the instruction execution. In this way, multiple instructions can be processed in parallel. This technique to achieve instruction-level parallelism is called instruction pipelining. This execution pipeline forms the core of a pipelined CPU.   The classic five-stage RISC pipeline has five processing units, also called pipeline stages. These stages are: Instruction Fetch (IF), Decode (ID), Execute (EX), Memory Access (MEM), Write Back (WB). The working principle of the pipeline can be intuitively represented as follows:   Each clock cycle, different parts of an instruction are processed, and each stage processes a different instruction. If you look closely, you will see that instruction 1 is only executed in the 5th cycle. This delay is called the pipeline delay. Δ This delay is the same as the number of pipeline stages. After this delay, cycle 6: instruction 2 is executed, cycle 7: instruction 3 is executed, and so on... In theory, we can calculate the throughput (instructions per cycle, IPC) as follows:   Therefore, a pipelined CPU guarantees that one instruction is executed per clock cycle. This is the maximum IPC possible in a single-issue processor. By splitting the critical path across multiple pipeline stages, the CPU can now also run at higher clock speeds. Mathematically, this gives a pipelined CPU a multiple of throughput improvement over an equivalent non-pipelined CPU.   This is called pipeline speedup. In simple terms, a CPU with an s-stage pipeline can run at S times the clock speed of its non-pipelined counterpart. Pipelining generally increases area/power consumption, but the performance gain is worth it. The math assumes that the pipeline never stalls, that is, data continues to flow from one stage to another on every clock cycle. But in real CPUs, pipelines can stall for a variety of reasons, the main ones being structural/control/data dependencies. For example: register X cannot be read by the Nth instruction because X was not modified by the (N-1)th instruction reading X back, which is an example of a data hazard in the pipeline. The Pequeno architecture uses a classic five-stage RISC pipeline. We will implement a strictly in-order pipeline. In an in-order processor, instructions are fetched, decoded, executed, and completed/committed in the order generated by the compiler. If one instruction stalls, the entire pipeline stalls. In an out-of-order processor, instructions are fetched and decoded in the order generated by the compiler, but execution can proceed in a different order. If one instruction stalls, it does not stall subsequent instructions unless there are dependencies. Independent instructions can pass forward. Execution can still complete/commit in order (this is how it is in most CPUs today). This opens the door to a variety of architectural techniques that significantly improve throughput and performance by reducing clock cycles wasted by stalls and minimizing the insertion of bubbles (what are “bubbles”? Read on…).   Out-of-order processors are fairly complex due to the dynamic scheduling of instructions, but are now the de facto pipeline architecture in today’s high-performance CPUs.   The five pipeline stages are designed as independent units: Fetch Unit (FU), Decode Unit (DU), Execution Unit (EXU), Memory Access Unit (MACCU), and Write Back Unit (WBU).   Fetch Unit (FU): The first stage of the pipeline, interfaces with the instruction memory. The FU fetches instructions from the instruction memory and sends them to the Decode Unit. The FU may contain instruction buffers, initial branch logic, etc. Decode Unit (DU): The second stage of the pipeline responsible for decoding instructions from the Execution Unit (FU). The DU also initiates read accesses to the register file. Packets from the DU and the register file are retimed and sent together to the Execution Unit. Execution Unit (EXU): The third stage of the pipeline that validates and executes all decoded instructions from the DU. Invalid/unsupported instructions are not allowed to continue in the pipeline and become "bubbles". The Arithmetic Unit (ALU) is responsible for all integer arithmetic and logical instructions. The Branch Unit is responsible for processing jump/branch instructions. The Load/Store Unit is responsible for processing load/store instructions that require memory access. Memory Access Unit (MACCU): The fourth stage of the pipeline that interfaces with the data memory. The MACCU is responsible for initiating all memory accesses based on instructions from the EXU. The data memory is the addressing space that may consist of data RAM, memory-mapped I/O peripherals, bridges, interconnects, etc. Write Back Unit (WBU): The fifth or last stage of the pipeline. Instructions complete execution here. The WBU is responsible for writing the data (load data) from the EXU/MACCU back to the register file. Between the pipeline stages, a valid-ready handshake is implemented. This is not so obvious at first glance. Each stage registers a data packet and sends it to the next stage. This packet may be instruction/control/data information to be used by the next stage or subsequent stages. This packet is validated by a valid signal. If the packet is invalid, it is called a bubble in the pipeline. A bubble is nothing more than a "hole" in the pipeline that just moves forward in the pipeline without actually performing any operation. This is similar to a NOP instruction. But don't think they are useless! We will see one use for them in the subsequent section when discussing pipeline risks. The following table defines bubbles in the Pequeno instruction pipeline.   Each stage can also stall the previous stage by issuing a stall signal. Once stalled, the stage will retain its data packet until the stall condition disappears. This signal is the same as the inverted ready signal. In an in-order processor, a stall at any stage is similar to a global stall, as it eventually stalls the entire pipeline.   The flush signal is used to flush the pipeline. The flush operation will invalidate all packets registered by the previous stages at once, as they are identified as no longer useful.   For example, when the pipeline fetches and decodes an instruction from the wrong branch after executing a jump/branch instruction, which was only identified as an error in the execution stage, the pipeline should be flushed and fetch the instruction from the correct branch!   Although pipelining significantly improves performance, it also increases the complexity of the CPU architecture. The pipeline technology of the CPU is always accompanied by its twinBro - Pipeline Hazards! Now, let's assume that we know nothing about pipeline hazards. We didn't consider the hazards when designing the architecture.   Dealing with Pipeline Hazards   In this chapter, we will explore pipeline hazards. Last time, we successfully designed a pipeline architecture for the CPU, but we didn't consider the "evil twin" that comes with pipelines. What impact can pipeline hazards have on the architecture? What architectural changes are needed to mitigate these hazards? Let's go ahead and demystify them! Hazards in the CPU instruction pipeline are dependencies that interfere with the normal execution of the pipeline. When a hazard occurs, the instruction cannot be executed within the specified clock cycle because it may result in incorrect calculation results or control flow. Therefore, the pipeline may be forced to pause until the instruction can be successfully executed.   In the above example, the CPU executes instructions in order according to the order generated by the compiler. Assume that instruction i2 has some dependency on i1, such as i2 needs to read a certain register, but the register is also being modified by the previous instruction i1. Therefore, i2 must wait until i1 writes the result back to the register file, otherwise the old data will be decoded and read from the register file for the execution stage to use. To avoid this data inconsistency, i2 is forced to stall for three clock cycles. The bubbles inserted in the pipeline represent the stall or wait state. i2 is decoded only when i1 is completed. Eventually, i2 completes execution at the 10th clock cycle instead of the 7th clock cycle. A three-clock-cycle delay is introduced due to the stall caused by the data dependency. How does this delay affect CPU performance?   Ideally, we expect the CPU to run at full throughput, i.e. CPI = 1. However, when the pipeline is stalled, the throughput/performance of the CPU decreases due to the increased CPI. For non-ideal CPUs:   There are various ways in which hazards occur in the pipeline. Pipeline hazards can be divided into three categories:   Structural hazards Control hazards Data hazards   Structural hazards occur due to hardware resource conflicts. For example, when two stages of the pipeline want to access the same resource. For example: two instructions need to access memory in the same clock cycle.   In the above example, the CPU has only one memory for storing instructions and data. The instruction fetch stage accesses the memory every clock cycle to fetch the next instruction. Therefore, the instructions in the instruction fetch stage and the memory access stage may conflict if the previous instruction in the memory access stage also needs to access the memory. This will force the CPU to increase the stall cycle, and the instruction fetch stage must wait until the instruction in the memory access stage releases the resource (memory). Some ways to mitigate structural hazards include: Stalling the pipeline until the resource is available. Duplicate the resource so that there will not be any conflict. Pipeline the resource so that the two instructions will be in different stages of the pipeline resource. Let's analyze the different situations that can cause structural hazards in Pequeno's pipeline and how to solve them. We do not intend to use stalling as an option to mitigate structural hazards! In Pequeno's architecture, we implemented the above three solutions to mitigate various structural hazards. Control hazards are caused by jump/branch instructions. Jump/branch instructions are flow control instructions in the CPU ISA. When control reaches a jump/branch instruction, the CPU must decide whether to execute the branch instruction. At this point, the CPU should take one of the following actions. Fetch the next instruction at PC+4 (branch not taken) or fetch the instruction at the branch target address (branch taken). The correctness of the decision can only be determined when the execution stage calculates the result of the branch instruction. Depending on whether the branch is taken or not, the branch address (the address the CPU should branch to) is determined. If the decision made previously was wrong, all instructions fetched and decoded in the pipeline before that clock cycle should be discarded. Because these instructions should not be executed at all! This is achieved by flushing the pipeline and fetching the instruction at the branch address on the next clock cycle. Flushing invalidates the instruction and converts it to a NOP or bubbles. This costs a large number of clock cycles as a penalty. This is called a branch penalty. Therefore, control hazards have the worst impact on CPU performance.   In the above example, i10 completed execution on the 10th clock cycle, but it should have completed execution on the 7th clock cycle. Because the wrong branch instruction (i5) was executed, 3 clock cycles were lost. When the execution stage identifies the wrong branch instruction on the 4th clock cycle, a flush must be performed in the pipeline. How does this affect CPU performance? If a program running on the above CPU contains 30% branch instructions, the CPI becomes: CPU performance is reduced by 50%! To mitigate the control risk, we can adopt some strategies in the architecture... If the instruction is identified as a branch instruction, just stall the pipeline. This decoding logic can be implemented in the fetch stage itself. Once the branch instruction is executed and the branch address is resolved, the next instruction can be fetched and the pipeline resumed. Add dedicated branch logic like branch prediction in the Fetch stage. The essence of branch prediction is: we use some prediction logic in the instruction fetch stage to guess whether the branch should be taken. In the next clock cycle, we fetch the guessed instruction. This instruction is either fetched from PC+4 (predicted branch not taken) or from the branch target address (predicted branch taken). Now there are two possibilities: If the prediction is found to be correct in the execute stage, nothing is done and the pipeline can continue processing. If the prediction is found to be wrong, the pipeline is flushed and the correct instruction is fetched from the branch address resolved in the execute stage. This incurs a branch penalty. As you can see, branch prediction still incurs a branch penalty if it predicts wrong. The design goal should be to reduce the probability of misprediction. The performance of a CPU depends a lot on how “good” the prediction algorithm is. Sophisticated techniques like dynamic branch prediction keep instruction history in order to predict correctly with 80% to 90% probability. To mitigate control hazard in Pequeno, we will implement a simple branch prediction logic. More details will be revealed in our upcoming blog on the design of the fetch unit.   Data hazard occurs when the execution of an instruction has a data dependency on the result of the previous instruction still being processed in the pipeline. Let’s understand the three types of data hazards with examples to better understand the concept. Suppose an instruction i1 writes a result to register x. The next instruction i2 also writes a result to the same register. Any subsequent instruction in the program order should read the result of i2 at x. Otherwise, data integrity will be compromised. This data dependency is called output dependency and can lead to WAW (Write-After-Write) data hazard.   Suppose an instruction i1 reads register x. The next instruction, i2, writes the result to the same register. At this point, i1 should read the old value of register X instead of the result of i2. If i2 writes the result to x before i1 reads the result, a data hazard will result. This data dependency is called an anti-dependency and can lead to a WAR (Write-After-Read) data hazard.   Suppose an instruction, i1, writes the result to register x. The next instruction, i2, reads the same register. At this point, i2 should read the value written by i1 to register x instead of the previous value. This data dependency is called a true dependency and can lead to a RAW (Read-After-Write) data hazard.   This is the most common and dominant type of data hazard in pipelined CPUs. To mitigate data hazards in in-order CPUs, we can use some techniques: When a data dependency is detected, the pipeline is paused (see the first figure). The decode stage can wait until the previous instruction is executed before executing. Compile rescheduling: The compiler reschedules the code by scheduling it to execute later to avoid data hazards. The idea is to avoid program stalls while not affecting the integrity of the program control flow, but this is not always possible. The compiler can also insert a NOP instruction between two instructions with data dependency. But this will cause stalls, which will affect performance.   Data/Operand Forwarding: This is a prominent architectural solution to mitigate RAW data risks in sequential execution CPUs. Let's analyze the CPU pipeline to understand the principle behind this technology. Suppose two adjacent instructions i1 and i2, there is a RAW data dependency between them because they are both accessing register X. The CPU should stall instruction i2 until i1 writes the result back to register x. If the CPU does not have a stall mechanism, i2 will read an older value from x in the decode stage of the third clock cycle. In the fourth clock cycle, the i2 instruction will execute the wrong value of x.   If you look closely at the pipeline, we already have the result of i1 in the third clock cycle. Of course, it is not written back to the register file, but the result is still available at the output of the execute stage. So if we can somehow detect data dependencies and then "forward" that data to the input of the execute stage, then the next instruction can use the forwarded data instead of the data from the decode stage. That way, the data hazard is mitigated! The idea is this:   This is called data/operand forwarding or data/operand bypassing. We forward the data forward in time so that the subsequent dependent instructions in the pipeline can access this bypassed data and execute in the execute stage.   This idea can be extended to different stages. In a 5-stage pipeline that executes instructions in the order i1, i2, ..in, data dependencies may exist:   i1 and i2- need to be bypassed between the execute stage and the output of the decode stage. i1 and i3- need to be bypassed between the memory access stage and the output of the decode stage. i1 and i4- need to be bypassed between the writeback stage and the output of the decode stage.   The architectural solution for mitigating RAW data hazards originating from any stage of the pipeline is as follows:   Consider the following scenario:   There is a data dependency between two adjacent instructions i1 and i2, where the first instruction is a load. This is a special case of a data hazard. Here, we cannot execute i2 until the data is loaded into x1. So, the question is whether we can still mitigate this data hazard with data forwarding? The load data is only available in the memory access stage of i1, and it must be forwarded to the decode stage of i2 to prevent this hazard. The requirement is as follows:   Assuming the load data is available in the memory access stage of cycle 4, you need to "forward" this data to cycle 3, to the decode stage output of i2 (why cycle 3? Because in cycle 4, i has already finished executing in the execute stage!). Essentially, you are trying to forward the current data to the past, which is impossible unless your CPU can time travel! This is not data forwarding, but "data backtracking".   Data forwarding can only be done forward in time.   This data hazard is called a pipeline interlock. The only way to solve this problem is to insert a bubble to stall the pipeline for one clock cycle when the data dependency is detected.   A NOP instruction (aka bubble) is inserted between i1 and i2. This delays i2 by one cycle, so data forwarding can now forward the load data from the memory access stage to the output of the decode stage. So far, we have only discussed how to mitigate RAW data risks. So, what about WAW and WAR risks? The RISC-V architecture is inherently resistant to WAW and WAR risks implemented by in-order pipelines! All register writebacks are done in the order that instructions are issued. The data written back is always overwritten by subsequent instructions that write to the same register. Therefore, WAW risk never occurs! Writeback is the last stage of the pipeline. When the writeback occurs, the read instruction has successfully completed execution on the older data. Therefore, WAR risk never occurs! To mitigate RAW data risks in Pequeno, we will implement data forwarding in hardware using pipeline interlock protection functions. More details will be revealed later, when we will design the data forwarding logic.   We understand and analyze various potential pipeline risks in existing CPU architectures that can cause instruction execution failures. We also design solutions and mechanisms to mitigate these risks. Let’s put together the necessary microarchitecture and finally design the architecture of the Pequeno RISC-V CPU to be completely free of all types of pipeline risks!   In the following posts, we will dive into the RTL design of each pipeline stage/functional unit. We will discuss the different microarchitectural decisions and challenges during the design phase.   Fetch Unit   From here, we start to dive into the microarchitecture and RTL design! In this chapter, we will build and design the Fetch Unit (FU) of Pequeno. The Fetch Unit (FU) is the first stage of the CPU pipeline that interacts with the instruction memory. The Fetch Unit (FU) fetches instructions from the instruction memory and sends the fetched instructions to the Decode Unit (DU). As discussed in the previous post on the improved architecture of Pequeno, the FU contains branch prediction logic and flush support.   1 Interfaces   Let’s define the interfaces of the Fetch Unit:   2 Instruction Access Interfaces   The core function of the FU in the CPU is instruction access. The Instruction Access Interface (I/F) is used for this purpose. Instructions are stored in the instruction memory (RAM) during execution. Modern CPUs fetch instructions from a cache instead of directly from the instruction memory. The instruction cache (called the primary cache or L1 cache in computer architecture terms) is closer to the CPU and enables faster instruction access by caching/storing frequently accessed instructions and prefetching larger blocks of instructions nearby. Therefore, there is no need to constantly access the slower main memory (RAM). Therefore, most instructions can be quickly accessed directly from the cache. The CPU will not directly access the interface with the instruction cache/memory. There will be a cache/memory controller between them to control the memory access between them.   It is a good idea to define a standard interface so that any standard instruction memory/cache (IMEM) can be easily plugged into our CPU, and requires little or no glue logic. Let's define two interfaces for instruction access. The request interface (I/F) handles requests from the instruction memory (FU) to the instruction memory. The response interface (I/F) handles responses from the instruction memory to the instruction memory (FU). We will define a simple valid ready based request and response interface (I/F) for the instruction memory (FU), as this is easy to convert to bus protocols such as APB, AXI, etc. if necessary.   Instruction access requires knowing the address of the instruction in memory. The address requested through the request interface (Request I/F) is actually the PC generated by the FU. In the FU interface, we will use a stall signal instead of the ready signal, which behaves in the opposite way to the ready signal. The cache controller usually has a stall signal to stall the request from the processor. This signal is represented by cpu_stall. The response from the memory is the fetched instruction received through the response interface (Response I/F). In addition to the fetched instruction, the response should also contain the corresponding PC. PC is used as an ID to identify the request to which a response has been received. In other words, it indicates the address of the instruction that has been fetched. This is important information required by the next stage of the CPU pipeline (how is it implemented? We will see soon! ). Therefore, the fetched instruction and its PC constitute the response packet to the FU. When the internal pipeline is stalled, the CPU may also need to stall the response from the instruction memory. This signal is represented by mem_stall. At this point, let's define instruction packet={instruction, PC} in the CPU pipeline. 3PC Generation Logic The core of the FU is the PC generation logic that controls the request interface (I/F). Since we are designing a 32-bit CPU, the generation of PC should be in increments of 4. After this logic is reset, the PC is generated every clock cycle. The reset value of PC can be hard-coded. This is the address from which the CPU fetches and executes instructions after reset, that is, the address of the first instruction in memory. PC generation is a free-running logic that is only stalled by c pu_stall. The free-running PC can be bypassed by flushing the I/F and internal branch prediction logic. The PC generation algorithm is implemented as follows:   4 Instruction Buffers There are two back-to-back instruction buffers inside the FU. Buffer 1 buffers instructions fetched from the instruction memory. Buffer 1 can directly access the response interface (Response I/F). Buffer 2 buffers instructions from buffer 1 and then sends it to the DU through the DU I/F. These two buffers constitute the instruction pipeline inside the FU.   5 Branch Prediction Logic As discussed above, we must add branch prediction logic in the FU to mitigate control risks. We will implement a simple and static branch prediction algorithm. The main content of the algorithm is as follows: Always make an unconditional jump. If the branch instruction is a backward jump, execute the branch. Because the possibilities are as follows: 1. This instruction may be part of the loop exit check of some do-while loop. In this case, if we execute the branch instruction, the probability of correctness is higher. If the branch instruction is a forward jump, do not execute it. Because the possibilities are as follows: 2. This instruction may be part of the loop entry check of some for loop or while loop. If we do not take the branch and continue to execute the next instruction, the probability of correctness is higher. 3. This instruction may be part of some if-else statement. In this case, we always assume that the if condition is true and continue to execute the next instruction. Theoretically, this deal (bargain) is 50% correct.   The instruction packet of buffer 1 is monitored and analyzed by the branch prediction logic, and a branch prediction signal: branch_taken is generated. This branch prediction signal is then registered and transmitted synchronously with the instruction packet sent to DU. The branch prediction signal is sent to DU through the DU interface. 6 DU This is the main interface between the fetch unit and the decode unit for sending payloads. The payload contains the fetched instructions and branch prediction information.   Since this is the interface between the two pipeline stages of the CPU, the valid ready I/F is implemented. The following signals constitute the DU I/F:   In the previous blog post, we discussed the concept of stall and refresh in the CPU pipeline and its importance. We also discussed various scenarios in Pequeno architecture that require stall or refresh. Therefore, proper stall and refresh logic must be integrated in each pipeline stage of the CPU. It is crucial to determine at which stage a stall or refresh is required, and which part of the logic in that stage needs to be stalled and refreshed. Some initial thoughts before implementing stall and refresh logic: Pipeline stages may be stalled by externally or internally generated conditions. Pipeline stages can be refreshed by externally or internally generated conditions. There is no centralized stall or refresh generation logic in Pequeno. Each stage may have its own stall and refresh generation logic. A stage in the pipeline can only be blocked by the next stage. Any stage blocking will eventually affect the upstream pipeline and cause the entire pipeline to be blocked. Any stage in the downstream pipeline can refresh a stage. This is called pipeline refresh because the entire pipeline upstream needs to be refreshed at the same time. In Pequeno, pipeline refresh is required only for branch misses in the execution unit (EXU).   Stall logic contains logic to generate local and external stalls. The flush logic contains logic to generate local and pipeline flushes. Local stalls are generated internally and used locally to stop the current stage. External stalls are generated internally and sent externally to the next stage of the upstream pipeline. Both local and external stalls are generated based on internal conditions and external stalls at the next stage of the downstream pipeline. Local flush is a flush generated internally and used for the local flush stage. External flush or pipeline flush is a flush generated internally and sent externally to the upstream pipeline. This flushes all stages upstream simultaneously. Both local and external flushes are generated based on internal conditions.   Only the DU can stop the operation of the FU externally. When the DU sets stall, the internal instruction pipeline of the FU (buffer 1 -> buffer 2) should be stopped immediately, and since the FU can no longer receive packets from the IMEM, it should also set mem_stall to the IMEM. Depending on the pipeline/buffer depth in IMEM, the PC generation logic may also eventually be stalled by cpu_stall from IMEM, since IMEM cannot receive any more requests. There are no internal conditions in FU that cause local stalls. Only EXU can externally flush FU. EXU initiates branch_flush function in CPU instruction pipeline and passes in the address of the next instruction to be fetched after pipeline is flushed ( branch_pc ). FU provides flush interface (Flush I/F) to accept external flush. Buffer 1, Buffer 2 and PC generation logic in FU are flushed by branch_flush. Signal branch_taken from branch prediction logic also acts as a local flush to buffer 1 and PC generation logic. If branch is taken: The next instruction should be fetched from branch predicted PC. Therefore, PC generation logic should be flushed and next PC should = branch_pc. Next instruction in buffer 1 should be flushed and invalidated, i.e. NOP/bubble inserted.   Wonderful why Buffer-2 is not flushed by branch_taken? Because the branch instruction from Buffer-1 (responsible for flush generation) should be buffered to Buffer-2 in the next clock cycle and allowed to continue execution in the pipeline. This instruction should not be flushed! The instruction memory pipeline should also be flushed appropriately. IMEM flush mem_flush is generated by branch_flush and branch_taken. Let's integrate all the microarchitectures designed so far to complete the architecture of the Fetch Unit.   Ok, everyone! We have successfully designed the Fetch Unit of Pequeno. In the next part, we will design the Decode Unit (DU) of Pequeno.   Decode Unit   The Decode Unit (DU) is the second stage of the CPU pipeline and is responsible for decoding instructions from the Fetch Unit (FU) and sending them to the Execution Unit (EXU). In addition, it is responsible for decoding register addresses and sending them to the register file for register read operations. Let's define the interface of the Decode Unit.   Among them, the FU interface is the main interface between the fetch unit and the decode unit to receive the payload. The payload contains the fetched instructions and branch prediction information. This interface has been discussed in the previous section.   The EXU interface is the main interface between the decode unit and the execution unit to send the payload. The payload includes the decoded instructions, branch prediction information, and decoded data.   The following are the instruction and branch prediction signals that make up the EXU I/F:   Decoded data is the important information that the DU decodes from the fetched instructions and sends to the EXU. Let's understand what information the EXU needs to execute an instruction.   Opcode, funct3, funct7: Identifies the operation that the EXU is going to perform on the operand. Operand: Depending on the opcode, the operand can be register data (rs0, rs1), register address for writeback (rdt), or 12-bit/20-bit immediate value. Instruction type: Identifies which operand/immediate value must be processed. The decoding process can be tricky. If you understand the ISA and instruction structure correctly, you can recognize different types of instruction patterns. Recognizing the pattern helps in designing the decoding logic in the DU. The following information is decoded and sent to the EXU via the EXU I/F.   The EXU will use this information to demultiplex the data to the appropriate execution subunit and execute the instruction. For R-type instructions, the source registers rs1 and rs2 must be decoded and read. The data read from the registers are the operands. All general user registers are located in the register file outside the DU. The DU uses the register file interface to send the address of rs0 and rs1 to the register file for register access. The data read from the register file should also be sent to the EXU in the same clock cycle along with the payload.   The register file takes one cycle to read the register. The DU also takes one cycle to register the payload to be sent to the EXU. Therefore, the source register address is decoded directly from the FU instruction packet by the combinational logic. This ensures the timing synchronization of 1) the payload from the DU to the EXU and 2) the data from the register file to the EXU. Only the EXU can stop the operation of the DU externally. When the EXU sets the stop, the internal instruction pipeline of the DU should stop immediately, and it should also set the stop to the FU because it can no longer receive packets from the FU. To achieve synchronous operation, the register file should be stopped with the DU because they are both at the same stage of the CPU five-stage pipeline. Therefore, the DU feeds back the external stop from the EXU to the register file. There is no situation inside the DU that causes a local stop. Only the EXU can flush the FU externally. The EXU starts the branch_flush function in the CPU instruction pipeline and passes in the address of the next instruction to be fetched after flushing the pipeline (branch_pc). The DU provides a flush interface (Flush I/F) to accept external flushes. The internal pipeline is flushed by branch_flush. The branch_flush from the EXU should immediately invalidate the DU instruction pointing to the EXU with a latency of 0 clock cycles. This is to avoid potential control risks in the next clock cycle EXU. In the design of the Fetch Unit, we did not invalidate the FU instruction with a 0-cycle delay after receiving the branch_flush instruction. This is because the DU will also be flushed in the next clock cycle, so there will be no control hazard in the DU. So, there is no need to invalidate the FU instruction. The same idea applies to the instructions from IMEM to FU.   The above flowchart shows how the instruction packets and branch prediction data from the FU are buffered in the DU of the instruction pipeline. Only a single level of buffering is used in the DU. Let’s integrate all the microarchitectures designed so far to complete the architecture of the Decode Unit.   Currently we have completed: Fetch Unit (FU), Decode Unit (DU). In the next section, we will design the register file of Pequeno.   Register File   In RISC-V CPU, the register file is a key component, which consists of a set of general purpose registers used to store data during execution. Pequeno CPU has 32 32-bit general purpose registers ( x0 – x31 ). Register x0 is called the zero register. It is hardwired to a constant value of 0, providing a useful default value that can be used with other instructions. Suppose you want to initialize another register to 0, just execute mv x1, x0. x1-x31 are general-purpose registers used to hold intermediate data, addresses, and results of arithmetic or logical operations. In the CPU architecture designed in the previous article, the register file requires two access interfaces.   Among them, the read access interface is used to read the register at the address sent by DU. Some instructions (such as ADD) require two source register operands rs1 and rs2. Therefore, the read access interface (I/F) requires two read ports to read two registers at the same time. The read access should be a single-cycle access so that the read data is sent to the EXU in the same clock cycle as the payload of the DU. In this way, the read data and the payload of the DU are synchronized in the pipeline. The write access interface is used to write the execution result back to the WBU sends the register at the address. Only one destination register rdt is written at the end of execution. Therefore, one write port is sufficient. Write access should be single cycle access. Since the DU and the register file need to be synchronized at the same stage of the pipeline, they should always be stopped together (why? Check the block diagram in the previous section!). For example, if the DU is stopped, the register file should not output read data to the EXU, because this will damage the pipeline. In this case, the register file should also be stopped. This can be ensured by inverting the stop signal of the DU to generate the read_enable of the register file. When the stop is valid, read_enable is driven low and the previous data will remain at the read data output, effectively stopping the register file operation. Since the register file does not send any instruction packets to the EXU, it does not need any refresh logic. The refresh logic only needs to be handled inside the DU. In summary, the register file is designed with two independent read ports and one write port. Both read and write accesses are single cycle. The read data is registered. The final architecture is as follows:   We have currently completed: instruction fetch unit (FU), decode unit (DU), register file. Please stay tuned for the next part.
    - May 11, 2025
1 2

A total of 2 pages

Need Help? Chat with us

leave a message
For any request of parts price or technical support and Free Samples, please fill in the form, Thank you!
Submit

Home

Products

whatsApp

contact