2503.05951

Model: gemma-3-27b-it-free

# TPU-Gen: LLM-Driven Custom Tensor Processing Unit Generator **Authors**: - Ramtin Zand, Shaahin Angizi (New Jersey Institute of Technology, Newark, NJ, USA, University of South Carolina, Columbia, SC, USA) - E-mails: {dv336,shaahin.angizi}@njit.edu ## Abstract The increasing complexity and scale of Deep Neural Networks (DNNs) necessitate specialized tensor accelerators, such as Tensor Processing Units (TPUs), to meet various computational and energy efficiency requirements. Nevertheless, designing optimal TPU remains challenging due to the high domain expertise level, considerable manual design time, and lack of high-quality, domain-specific datasets. This paper introduces TPU-Gen, the first Large Language Model (LLM) based framework designed to automate the exact and approximate TPU generation process, focusing on systolic array architectures. TPU-Gen is supported with a meticulously curated, comprehensive, and open-source dataset that covers a wide range of spatial array designs and approximate multiply-and-accumulate units, enabling design reuse, adaptation, and customization for different DNN workloads. The proposed framework leverages Retrieval-Augmented Generation (RAG) as an effective solution for a data-scare hardware domain in building LLMs, addressing the most intriguing issue, hallucinations. TPU-Gen transforms high-level architectural specifications into optimized low-level implementations through an effective hardware generation pipeline. Our extensive experimental evaluations demonstrate superior performance, power, and area efficiency, with an average reduction in area and power of 92% and 96% from the manual optimization reference values. These results set new standards for driving advancements in next-generation design automation tools powered by LLMs. ## I Introduction The rising computational demands of Deep Neural Networks (DNNs) have driven the adoption of specialized tensor processing accelerators, such as Tensor Processing Units (TPUs). These accelerators, characterized by low global data transfer, high clock frequencies, and deeply pipelined Processing Elements (PEs), excel in accelerating training and inference tasks by optimizing matrix multiplication [1]. Despite their effectiveness, the complexity and expertise required for their design remain significant barriers. Static accelerator design tools, such as Gemmini [2] and DNNWeaver [3], address some of these challenges by providing templates for systolic arrays, data flows, and software ecosystems [4, 5]. However, these tools still face limitations, including complex programming interfaces, high memory usage, and inefficiencies in handling diverse computational patterns [6, 7]. These constraints underscore the need for innovative solutions to streamline hardware design processes. Large Language Models (LLMs) have emerged as a promising solution, offering the ability to generate hardware descriptions from high-level design intents. LLMs can potentially reduce the expertise and time required for DNN hardware development by encapsulating vast domain-specific knowledge. However, realizing this potential requires overcoming three critical challenges. First, existing datasets are often limited in size and detail, hindering the generation of reliable designs [8, 9]. Second, while fine-tuning is essential to minimize the human intervention, fine-tuning LLMs often results in hallucinations producing non-sensical or factually incorrect responses, compromising their applicability [10, 11]. Finally, an effective pipeline is needed to mitigate these hallucinations and ensure the generation of consistent, contextually accurate code [11]. Therefore, the core questions we seek to answer are the following– Can there be an effective way to rely on LLM to act as a critical mind and adapt implementations like Retrieval-Augmented Generation (RAG) to minimize hallucinations? Can we leverage domain-specific LLMs with RAG through an effective pipeline to automate the design process of TPU to meet various computational and energy efficiency requirements? TABLE I: Comparison of the Selected LLM-based HDL/HLS generators. | Property | Ours | [10] | [9] | [8] | [12] | [13] | [14] | [15] | [16] | [17] | [18] | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Function | TPU Gen. | Verilog Gen. | AI Accel. Gen. | Verilog Gen. | Verilog Gen. | Verilog Gen. | Hardware Verf. | Hardware Verf. | Verilog Gen. | $\dagger$ | AI Accel. Gen. | | Chatbot ∗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | | Dataset | ✓ | ✓(Verilog) | ✗ | NA | NA | NA | ✗ | ✗ | ✓ | ✓ | ✓ | | Output format | Verilog | Verilog | HLS | Verilog | Verilog | Verilog | Verilog | HDL | Verilog | Verilog | Chisel | | Auto. Verif. | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ | | Human in Loop | Low | Medium | Medium | Medium | High | Low | Low | Low | Low | Low | Low | | Fine tuning | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | | RAG | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ | ∗ A user interface featuring Prompt template generation for the input of LLM. † Not applicable. To answer this question, we develop the first-of-its-kind TPU-Gen as an automated exact and approximate TPU design generation framework with a comprehensive dataset specifically tailored for ever-growing DNN topologies. Our contributions in this paper are threefold: (1) Due to the limited availability of annotated data necessary for efficient fine-tuning of an open-source LLM, we introduce a meticulously curated dataset that encompasses various levels of detail and corresponding hardware descriptions, designed to enhance LLMs’ learning and generative capabilities in the context of TPU design; (2) We develop TPU-Gen as a potential solution to reduce hallucinations leveraging RAG and fine-tuning, to align best for the LLMs to streamline the approximate TPU design generation process considering budgetary constraints (e.g., power, latency, area), ensuring a seamless transition from high-level specifications to low-level implementations; and (3) We design extensive experiments to evaluate our approach’s performance and reliability, demonstrating its superiority over existing methods. We anticipate that TPU-Gen will provide a framework that will influence the future trajectory of DNN hardware acceleration research for generations to come The dataset and fine-tuned models are open-sourced. The link is omitted to maintain anonymity since the GitHub anonymous link should be under 2GB which is exceeded in this study.. ## II background LLM for Hardware Design. LLMs show promise in generating Hardware Description Language (HDL) and High-Level Synthesis (HLS) code. Table I compares notable methods in this field. VeriGen [10] and ChatEDA [19] refine hardware design workflows, automating the RTL to GDSII process with fine-tuned LLMs. ChipGPT [8] and Autochip [13] integrate LLMs to generate and optimize hardware designs, with Autochip producing precise Verilog code through simulation feedback. Chip-Chat [12] demonstrates interactive LLMs like ChatGPT-4 in accelerating design space exploration. MEV-LLM [20] proposes multi-expert LLM architecture for Verilog code generation. RTLLM [21] and GPT4AIGChip [9] enhance design efficiency, showcasing LLMs’ ability to manage complex design tasks and broaden access to AI accelerator design. To the best of our knowledge, GPT4AIGChip [9] and SA-DS [18] are a few initial works focus on an extensive framework specifically aimed at the generation of domain-specific AI accelerator designs where SA-DS focus on creating a dataset in HLS and employ fine-tuning free methods such as single-shot and multi-shot inputs to LLM. Other works for hardware also include creation of SPICE circuits [22, 23]. However, the absence of prompt optimization, tailored datasets, model fine-tuning, and LLM hallucination pose a barrier to fully harnessing the potential of LLMs in such frameworks [19, 18]. This limitation confines their application to standard LLMs without fine-tuning or In-Context Learning (ICL) [19], which are among the most promising methods for optimizing LLMs [24]. Retrieval-Augmented Generation. RAG is a promising paradigm that combines deep learning with traditional retrieval techniques to help mitigate hallucinations in LLMs [25]. RAG leverages external knowledge bases, such as databases, to retrieve relevant information, facilitating the generation of more accurate and reliable responses [26, 25]. The primary challenge in deploying LLMs for hardware generation or any application lies in their tendency to deviate from the data and hallucinate, making it challenging to capture the essence of circuits and architectural components. LLMs tend to prioritize creativity and finding innovative solutions, which often results in straying from the data [11]. As previous works show, the RAG model can be a cost-efficient solution by retrieving and augmenting data, avoiding heavy computational demands [27]. Approximate MAC Units. Approximate computing has been widely explored as a means to trade reduced accuracy for gains in design metrics, including area, power consumption, and performance [28, 29, 30, 31, 32, 33]. As the computation core in various PEs in TPUs, several approximate Multiply-and-Accumulate (MAC) units have been proposed as alternatives to precise multipliers and adders and extensively analyzed in accelerating deep learning [34, 35]. These MAC units are composed of two arithmetic stages—multiplication and accumulation with previous products—each of which can be independently approximated. Most approximate multipliers, such as logarithmic multipliers, are composed of two key components: low-precision arithmetic logic and a pre-processing unit that acts as steering logic to prepare the operands for low-precision computation [36]. These multipliers typically balance accuracy and power efficiency. For example, the logarithmic multiplier introduced in [29] emphasizes accuracy, while the multipliers in [37] are designed to reduce power and latency. On the other hand, most approximate adders, such as lower part OR adder (LOA) [38], exploit the fact that extended carry propagation is infrequent, allowing adders to be divided into independent sub-adders shortening the critical path. To preserve computational accuracy, the approximation is applied to the least significant bits of the operands, while the most significant bits remain accurate. <details> <summary>x1.png Details</summary> ![02ca36b9](/v1/image/02ca36b9989f842225757ac21332bfb82ee21be6f7f3be970a9838cac4e72016) ### Visual Description \n ## Diagram: Dataflow Architecture for a Processing Unit ### Overview The image depicts a dataflow architecture for a processing unit, likely a specialized accelerator for neural network operations. It illustrates the flow of data from input feature maps and weights, through processing elements (PAUs and APEs), and finally to output memory. The diagram emphasizes parallel processing and the role of a controller in managing the data flow. ### Components/Axes The diagram consists of the following key components: * **Weight/IFMAP Memory:** Located on the left side, serving as the primary input source for both weights and input feature maps (IFMAP). * **IFMAP/Weight Memory:** Located at the top, providing weights to the processing elements. * **DEMUX (Demultiplexer):** Two instances are present, one for the Weight/IFMAP Memory and one for the IFMAP/Weight Memory. These distribute data to multiple FIFO queues. * **FIFO (First-In, First-Out) Queues:** These act as buffers between the memory and the processing units. Multiple FIFO queues are shown, receiving data from the DEMUX. * **PAU (Processing Array Unit):** These units perform initial processing on the data received from the FIFO queues. * **APE (Arithmetic Processing Element):** These units perform further processing on the output of the PAUs. Multiple APEs are chained together. * **MUX (Multiplexer):** Located at the bottom, combining the outputs from the APEs into a single output stream. * **Output Memory (OFMAP):** Located at the bottom, storing the final output feature maps. * **Controller:** A dashed box in the center-left, responsible for coordinating the data flow between the various components. There are no explicit axes in this diagram, as it represents a system architecture rather than a data plot. ### Detailed Analysis or Content Details The diagram illustrates a parallel processing architecture. 1. **Data Input:** Weights and IFMAPs are read from the Weight/IFMAP Memory and IFMAP/Weight Memory. 2. **Demultiplexing:** The DEMUX distributes the data to multiple FIFO queues. The number of FIFO queues is not explicitly stated, but appears to be at least 4. 3. **Buffering:** The FIFO queues buffer the data before it is fed to the PAUs. 4. **Initial Processing (PAU):** The PAUs perform an initial stage of processing on the data. 5. **Further Processing (APE):** The output of the PAUs is then fed to a chain of APEs for further processing. The number of APEs in the chain is not explicitly stated, but appears to be multiple. 6. **Multiplexing:** The MUX combines the outputs from the APEs. 7. **Output:** The final output is written to the Output Memory (OFMAP). 8. **Control:** The Controller manages the entire data flow, coordinating the operation of the DEMUX, FIFO queues, PAUs, APEs, and MUX. The dashed lines indicate control signals or data flow managed by the Controller. The diagram suggests a highly parallel architecture, with multiple PAUs and APEs operating concurrently. ### Key Observations * The architecture is designed for parallel processing, with multiple processing elements operating simultaneously. * The FIFO queues provide buffering to handle variations in data rates between the memory and the processing units. * The Controller plays a crucial role in coordinating the data flow and ensuring correct operation. * The diagram does not specify the type of processing performed by the PAUs and APEs, but it is likely related to neural network operations such as convolution or matrix multiplication. * The use of DEMUX and MUX suggests a flexible architecture that can handle different data widths and formats. ### Interpretation This diagram represents a specialized hardware accelerator designed for efficient processing of data, likely for deep learning applications. The parallel architecture, combined with the buffering provided by the FIFO queues and the coordination of the Controller, allows for high throughput and low latency. The separation of processing into PAUs and APEs suggests a pipelined architecture, where data is processed in stages. The overall design emphasizes maximizing computational efficiency and minimizing data movement, which are critical for performance in deep learning workloads. The diagram highlights a common approach to designing hardware accelerators for neural networks, focusing on parallel processing and efficient data flow. The absence of specific numerical values or performance metrics suggests that the diagram is intended to illustrate the overall architecture rather than provide detailed performance characteristics. </details> Figure 1: The overall template for TPU design. ## III TPU-Gen Framework ### III-A Architectural Template Developing a Generic Template. The TPU architecture utilizes a systolic array of PEs with MAC units for efficient matrix and vector computations. This design enhances performance and reduces energy consumption by reusing data, minimizing buffer operations [1]. Input data propagates diagonally through the array in parallel. The TPU template, illustrated in Fig. 1, extends the TPU’s systolic array with Output Stationary (OS) dataflow to enable concurrent approximation of input feature maps (IFMaps) and weights. It comprises five components: weight/IFMap memory, FIFOs, a controller, Pre-Approximate Units (PAUs), and Approximate Processing Elements (APEs). The weights and IFMaps are stored in their respective memories, with the controller managing memory access and data transfer to FIFOs per the OS dataflow. PAUs, positioned between FIFOs and APEs, dynamically truncate high-precision operands to lower precision before sending them to APEs, which perform MAC operations using approximate multipliers and adders. Sharing PAUs across rows and columns reduces hardware overhead, introducing minimal latency but significantly improving overall performance [39]. Highly-Parameterized RTL Code. We design highly flexible and parameterized RTL codes for 13 different approximate adders and 12 different approximate multipliers as representative approximate circuits. For the approximate adders, we have two tunable parameters: the bit-width and the imprecise part. The bit-width specifies the number of bits for each operand and the imprecise part specifies the number of inexact bits in the adder output. For the approximate multipliers, we have one common parameter, i.e., Width (W), which specifies the bit-width of the multiplication operands. We also have more tunable parameters based on specific multipliers, some of which are listed in Table II. TABLE II: Approximate multiplier hyper-parameters | Design | Parameter | Description | Default | | --- | --- | --- | --- | | BAM [40] | VBL | No. of zero bits during partial product generation | W/2 | | ALM_LOA [41] | M | Inaccurate part of LOA adder | W/2 | | ALM_MAA3 [41] | M | Inaccurate part of MAA3 adder | W/2 | | ALM_SOA [41] | M | Inaccurate part of SOA adder | W/2 | | ASM [42] | Nibble_Width | number of precomputed alphabets | 4 | | DRALM [37] | MULT_DW | Truncated bits of each operand | W/2 | | RoBA [43] | ROUND_WIDTH | Scales the widths of the shifter | 1 | We leveraged the parametrized RTL library of approximate arithmetic circuits to build a TPU library that enables automatic selection of the systolic array size $S$ , bit precision $n$ , and one of the approximate multipliers and approximate adders. The internal parameters that are used to tune the approximate arithmetic libraries are also included in the TPU parameterized RTL library, thus, allowing the user to have complete flexibility to adjust their designs to meet specific hardware specifications and application accuracy requirements. Moreover, we developed a design automation methodology, enabling the automatic implementation and simulation of many TPU circuits in various simulation platforms such as Design Compiler and Vivado. In addition to the highly parameterized RTL codes, we developed TCL and Python scripts to autonomously measure their error, area, performance, and power dissipation under various constraints. ### III-B Framework Overview TPU-Gen framework depicted in Fig. 2 targets the development of domain-specific LLMs, emphasizing the interplay between the model’s responses and two key factors: the input prompt and the model’s learned parameters. The framework optimizes both elements to enhance LLM’s performance. An initial prompt conveying the user’s intent and key software and hardware specifications of the intended TPU design and application is enabled through the Prompt Generator in Step 1. A verbal description of a tensor processing accelerator design can often result in a many-to-one mapping as shown in Fig. 3 (a), especially when such descriptions do not align with the format of the training dataset. This misalignment increases the likelihood of hallucinations in the LLM’s output, potentially leading to faulty designs [44]. To minimize hallucinations and incorrect outputs in LLM-generated designs, studies have shown that inputs adhering closely to patterns observed in the training data produce more accurate and desirable results [17, 18]. However, this critical aspect has often been overlooked in previous state-of-the-art research [9], with some researchers opting instead to address the issue through prompt optimization techniques [18]. In this framework, we tackle the problem by employing a script that extracts key features, such as systolic size and relevant metrics, from any given verbal input by the user. These features are then embedded into a template, which serves as the prompt for the LLM input. As a domain-specific LLM, TPU-Gen focuses on generating the most valuable RTL top file detailing the circuit, and blocks involved in the presented architectural template in Section III.A. <details> <summary>x2.png Details</summary> ![a4fb3cfd](/v1/image/a4fb3cfd9a86678ffc76a17c33fc92d4168fd0514def709cdfe3a671e43573f8) ### Visual Description \n ## Diagram: APTPU Generation Framework ### Overview This diagram illustrates the workflow of an APTPU (likely Application-Specific Test Program Unit) Generation Framework. It depicts a process that takes a user prompt as input, utilizes multi-shot learning with a fine-tuned Large Language Model (LLM), employs Retrieval-Augmented Generation (RAG), validates generated code, and ultimately produces an APTPU with needed performance characteristics. The diagram uses numbered circles to indicate the flow of the process. ### Components/Axes The diagram is divided into three main sections: "Input", "APTPU Generation Framework", and "Output". Within the "APTPU Generation Framework" section, the following components are visible: * **Multi-shot Learning & Fine-tuned LLM:** Contains an icon of a brain and the text "Fine-tuned LLM". * **LLM:** A rectangular box labeled "LLM". * **Data-set:** A rectangular box labeled "Data-set". * **Retrieval-Augmented Generation (RAG):** A rectangular box labeled "Retrieval-Augmented Generation (RAG)". * **Generate Code:** Text within the RAG box. * **Automated Code Validation:** A rectangular box labeled "Automated Code Validation". * **APTPU w. needed perf:** A rectangular box labeled "APTPU w. needed perf" with checkboxes next to "Power", "Delay", and "Area" and an ellipsis. The diagram also includes the following labels: * **User prompt:** Located at the top-left, indicating the input source. * **Prompt Generator:** An icon of a gear and a book, indicating the component that generates prompts. * **Invalid:** Labeling a dashed arrow returning from "Automated Code Validation" to "LLM". * **Valid:** Labeling a solid arrow from "Automated Code Validation" to "APTPU w. needed perf". Numbered circles indicate the process flow: 1 through 7. ### Detailed Analysis / Content Details The process flow is as follows: 1. A "User prompt" enters the system. 2. The "User prompt" is processed by a "Prompt Generator". 3. The output of the "Prompt Generator" feeds into "Multi-shot Learning & Fine-tuned LLM". 4. The "Multi-shot Learning & Fine-tuned LLM" interacts with the "LLM" and "Data-set". 5. The "LLM" generates code, which is passed to "Retrieval-Augmented Generation (RAG)". 6. If the code generated by "RAG" is "Invalid", it is sent back to the "LLM" and "Data-set" for refinement. This is indicated by a dashed arrow. 7. If the code generated by "RAG" is "Valid", it is passed to "Automated Code Validation". 8. If the code is validated, it is used to generate an "APTPU w. needed perf". The output includes options for "Power", "Delay", and "Area". ### Key Observations The diagram highlights a closed-loop system where code generation and validation are iterative. The use of "Multi-shot Learning" and "Retrieval-Augmented Generation" suggests a sophisticated approach to code generation, leveraging existing data and knowledge. The "Automated Code Validation" step is crucial for ensuring the quality and correctness of the generated APTPU. The checkboxes for "Power", "Delay", and "Area" indicate that these are key performance parameters being optimized. ### Interpretation This diagram represents a modern approach to automated test program unit (APTPU) generation. It leverages the power of Large Language Models (LLMs) and machine learning techniques to streamline the process of creating specialized test programs. The iterative feedback loop between code generation and validation is essential for producing high-quality, reliable APTPUs. The inclusion of performance parameters like "Power", "Delay", and "Area" suggests that the framework is designed to optimize these critical aspects of the generated test programs. The diagram implies a shift from manual APTPU development to an automated, data-driven approach, potentially reducing development time and improving test coverage. The use of RAG suggests the LLM is not operating in isolation, but is augmented by a knowledge base to improve the quality of the generated code. </details> Figure 2: The proposed TPU-Gen framework. <details> <summary>x3.png Details</summary> ![442fc8bc](/v1/image/442fc8bca3c7956b457fd52c9a3daec817ec6a3986a7ada5dd47fc5b3d40c686) ### Visual Description \n ## Diagram: TPU Design Process with LLM ### Overview The image presents a diagram illustrating three different approaches (a, b, and c) to designing a TPU (Tensor Processing Unit) using a Large Language Model (LLM). The diagram focuses on how user descriptions are processed and translated into a design, highlighting the impact of prompt generation on the final outcome. ### Components/Axes The diagram consists of three main sections labeled (a), (b), and (c), each representing a different design process. Common elements include: * **User:** Represented by a silhouette icon. * **Description 1, Description 2, Description 3... Description n:** Cloud-shaped elements representing user input descriptions. * **LLM:** A stylized brain icon representing the Large Language Model. * **Prompt Generator:** A gear icon representing a component that generates prompts. * **Arrows:** Indicate the flow of information. * **"Wrong Design"**: A red "X" symbol indicating a failed design outcome. * **"Desired Design"**: A blue checkmark symbol indicating a successful design outcome. * **Text Boxes:** Contain example user inputs. ### Detailed Analysis or Content Details **Section (a): Direct Input** * A user provides multiple descriptions (Description 1, 2, 3) directly to the LLM. * The LLM attempts to generate a design. * The outcome is labeled "Wrong Design" with a red "X". * User input example 1: "I want to design a TPU with 16 processing elements for..." * User input example 2: "I need a 16x16 systolic array with a dataflow With support bits for app..." **Section (b): Prompt Generation - Incorrect** * The user provides multiple descriptions (Description 1, 2, n) to a Prompt Generator. * The Prompt Generator creates a prompt and sends it to the LLM. * The LLM generates code based on the prompt. * The outcome is not explicitly labeled as wrong, but the overall flow suggests an undesirable result. * Prompt example: "Generate the entire code for the systolic size with... following input bitwidth..." **Section (c): Prompt Generation - Correct** * The user provides multiple descriptions (Description 1, 2, 3) to a Prompt Generator. * The Prompt Generator creates a prompt and sends it to the LLM. * The LLM generates a design. * The outcome is labeled "Desired Design" with a blue checkmark. ### Key Observations * The diagram highlights the importance of prompt generation when using an LLM for design tasks. * Direct input from the user to the LLM (section a) results in a "Wrong Design". * Using a Prompt Generator improves the outcome, as demonstrated by the "Desired Design" in section (c). * Section (b) suggests that even with a Prompt Generator, the quality of the prompt is crucial for achieving the desired result. ### Interpretation The diagram illustrates a workflow for utilizing LLMs in hardware design, specifically for TPUs. It demonstrates that simply feeding user descriptions directly to an LLM is insufficient for generating a correct design. The introduction of a Prompt Generator acts as an intermediary, refining the user's intent into a format that the LLM can effectively process. The success of the Prompt Generator is critical; a poorly generated prompt (as potentially implied in section b) can still lead to suboptimal results. The diagram suggests that effective prompt engineering is a key factor in leveraging LLMs for complex design tasks. The use of visual cues like the red "X" and blue checkmark reinforces the idea that the prompt generation step is a binary success/failure point. The diagram doesn't provide quantitative data, but rather a qualitative comparison of different approaches. It's a conceptual illustration of a design process, emphasizing the role of prompt engineering in achieving desired outcomes. </details> Figure 3: (a) Multiple descriptions for a single TPU design demonstrate that a design can be verbally defined in numerous ways, potentially misleading LLMs in generating the intended design, (b) Proposed prompt generator extracts the required features from the given verbal descriptions, (c) Using a script to generate a verbal description aligned with the training data. An immediate usage of the proposed dataset explained in Section III.C in TPU-Gen is to help fine-tune a generic LLM for the task of TPU design, where the input with a prompt will be fed to the LLM (Step 2 in Fig. 2). Equivalently, one may employ ICL, or multi-shot learning as a more computationally efficient compromise to fine-tuning [24]. The multi-shot prompting techniques can be used where the proposed dataset will function as the source for multi-shot examples. Given that the TPU-Gen dataset integrates verbal descriptions with corresponding TPU systolic array design pairs, the LLM generates a TPU’s top-level file as the output in Verilog. This top-level file includes all necessary architectural module dependencies to ensure a fully functional design (step 3). Further, we propose to leverage the RAG module to generate the other dependency files into the project, completing the design (step 4). Next, a third-party quality evaluation tool can be employed to provide a quantitative evaluation of the design, verify functional correctness, and integrate the design with the full stack (step 5). Here, for quality and functional evaluation, the generated designs, initially described in Verilog, are synthesized using YOSYS [45]. This synthesis process incorporates an automated RTL-to-GDSII validation stage, where the generated designs are evaluated and classified as either Valid or Invalid based on the completeness of their code sequences and the correctness of their input-output relationships. Valid designs proceed to resource validation, where they are optimized with respect to Power, Performance, and Area (PPA) metrics. In contrast, designs flagged as Invalid initiate a feedback loop for error analysis and subsequent LLM retraining, enabling iterative refinement (steps 2 to 6) to achieve predefined performance criteria. Ultimately, designs that successfully pass these stages in step 7 are ready for submission to the foundry. <details> <summary>x4.png Details</summary> ![1a4d8d14](/v1/image/1a4d8d142f4e75f0ca42177fe61c022a31ceee67f15c742757b34d4ae4aeb6ab) ### Visual Description \n ## Diagram: APTPU Design and Generation Flow ### Overview This diagram illustrates an iterative process for designing and generating APTPU (likely an acronym for a hardware or software component) configurations. The process involves configuration files, verification, OpenRoad analysis, prompt generation, and ultimately, APTPU generation. The diagram uses numbered arrows to indicate the flow of information and a circular arrow to denote the iterative nature of the process. ### Components/Axes The diagram consists of several rectangular blocks representing data or process stages, connected by arrows. Key components include: * **APTPU CONFIG FILES:** Stack of blue rectangles. * **Verification:** A grey gear-shaped block with text "Verify, Synthesize." * **OpenRoad:** A light blue rectangle with text "PPA reports". * **APTPU + Metrics corpus:** Stack of orange rectangles. * **APTPU + Metrics + Descriptions:** Stack of green rectangles. * **APTPU-Gen:** A light-brown cone-shaped block. * **Granulated prompt:** A black circle containing a document icon. * **Iterative process:** A circular arrow. * **Numbered Arrows:** 1 through 5, indicating the flow. ### Detailed Analysis or Content Details The diagram depicts a five-step process: 1. **APTPU CONFIG FILES** feed into **Verification**. The arrow is labeled "Tune variables, features". 2. **Verification** outputs to **OpenRoad**. 3. **OpenRoad** outputs to **APTPU + Metrics corpus**. 4. **APTPU + Metrics corpus** feeds into **Granulated prompt**. 5. **Granulated prompt** feeds into **APTPU + Metrics + Descriptions**. 6. **APTPU + Metrics + Descriptions** feeds into **APTPU-Gen**. The **Iterative process** arrow loops from **APTPU + Metrics corpus** back to **APTPU CONFIG FILES**, indicating a feedback loop. The text within the **Verification** block states: "Verify, Synthesize." The text within the **OpenRoad** block states: "PPA reports". The text within the **APTPU-Gen** block states: "APTPU-Gen". ### Key Observations The diagram emphasizes an iterative design process. The flow starts with configuration files, goes through verification and analysis (OpenRoad), and then uses the results to refine the configurations. The inclusion of "Metrics" and "Descriptions" suggests a focus on quantifiable performance and detailed documentation. The "Granulated prompt" suggests the use of a prompt-based system, potentially leveraging large language models or similar technologies. ### Interpretation This diagram outlines a methodology for automated or semi-automated design space exploration and optimization of an APTPU. The iterative loop suggests a continuous improvement cycle where analysis results (PPA reports from OpenRoad) are used to refine the configuration files. The generation of "Metrics" and "Descriptions" alongside the APTPU suggests a focus on not only creating a functional component but also understanding and documenting its performance characteristics. The "Granulated prompt" component hints at the use of AI or machine learning techniques to guide the design process, potentially by generating prompts for synthesis or optimization tools. The overall process appears to be geared towards efficient and well-documented APTPU development. The diagram does not provide specific data or numerical values, but rather illustrates a workflow. </details> Figure 4: TPU-Gen dataset curation. ### III-C Dataset Curation Leveraging the parameterized RTL code of the TPU, we develop a script to systematically explore various architectural configurations and generate a wide range of designs within the proposed framework (step 1 in Fig. 4). The generated designs undergo synthesis and functional verification (step 2). Subsequently, the OpenROAD suite [46] is employed to produce PPA metrics (step 3). The PPA data is parsed using Pyverilog (step 4), resulting in the creation of a detailed, multi-level dataset that captures the reported PPA metrics (step 5). Steps 1 to 3 are iterated until all architectural variations are generated. The time required for each data point generation varies depending on the specific configuration. To efficiently populate the TPU-Gen dataset, we utilize multiple scripts that automate the generation of data points across different systolic array sizes, ensuring comprehensive coverage of design space exploration. Fig. 4 shows the detailed methodology underpinning our dataset creation. The validation when compared to prior works [10, 47] understanding we work in a different design space abstraction makes it tough to have a fair comparison. However, looking by the scale of operation and the framework’s efficiency we require minimal efforts comparatively. <details> <summary>x5.png Details</summary> ![78083eff](/v1/image/78083eff6e6f5572db152d94e39afc872ec39d18148f7e7ce923ccafcd9d0887) ### Visual Description \n ## Diagram: Approximate Processing Unit (APU) Architecture ### Overview The image depicts the architecture of an Approximate Processing Unit (APU), highlighting the flow of Input Feature Maps (IFMap) through Processing Array Units (PAU) and the use of approximate adders and multipliers. The diagram illustrates a parallel processing structure with weight application, multiplication, and accumulation operations. ### Components/Axes The diagram consists of the following key components: * **IFMap:** Input Feature Map – the input data stream. * **Weight:** Input weights applied to the IFMap. * **PAU:** Processing Array Unit – the core processing block. * **APE:** Approximate Processing Element. * **APTPU (MxN):** Approximate Processing Tile Processing Unit – the output of the PAU array. * **Arrows:** Indicate data flow direction. * **Legend (Top-Right):** * Solid Black Arrow: DW : \[8,16,32] * Dashed Black Arrow: WW : \[3,4,5,6,7,8,16,32] * Solid Red Arrow: Mult\_DW : \[2,3,4,...,12] * **Approximate Adders (Bottom-Right):** Listed algorithms: SETA, HERLOA, MHEAA…10 more. * **Approximate Multipliers (Bottom-Right):** Listed algorithms: BAM, UDM, ALM\_LOA…10 more. ### Detailed Analysis or Content Details The diagram shows two parallel processing paths. Each path consists of the following stages: 1. **Input:** An IFMap enters the PAU. 2. **Weight Application:** A Weight is applied to the IFMap. 3. **Multiplication:** The weighted IFMap is multiplied (indicated by the 'X' symbol) using approximate multipliers (represented by the red arrows labeled "Mult\_DW : \[2,3,4,...,12]"). 4. **Shift Operation:** A right shift operation is performed (indicated by the "<<"). 5. **Addition:** The shifted result is added (indicated by the '+' symbol) using approximate adders. 6. **Output:** The final result is output from the APTPU (MxN). The PAU is represented by a light blue square, and there are two PAUs shown in parallel. The dashed cyan arrows represent data flow with a width of "WW : \[3,4,5,6,7,8,16,32]". The solid black arrows represent data flow with a width of "DW : \[8,16,32]". The bottom-right section lists approximate adders and multipliers used in the APU. The approximate adders include SETA, HERLOA, and MHEAA, with "10 more" algorithms not explicitly listed. The approximate multipliers include BAM, UDM, and ALM\_LOA, with "10 more" algorithms not explicitly listed. ### Key Observations * The diagram emphasizes the use of approximate computing techniques (approximate adders and multipliers) to potentially reduce power consumption and improve performance. * The parallel structure of the PAUs suggests a high degree of parallelism in the processing. * The legend indicates different data widths (DW, WW) and multiplication factors (Mult\_DW) used in the processing. * The diagram does not provide specific numerical values for the weights or IFMap data. ### Interpretation The diagram illustrates a hardware architecture designed for efficient approximate computation. The use of PAUs and parallel processing suggests a focus on throughput. The inclusion of approximate adders and multipliers indicates a trade-off between accuracy and efficiency. The different data widths (DW, WW) and multiplication factors (Mult\_DW) suggest a configurable architecture that can be optimized for different applications. The listing of multiple approximate algorithms (SETA, HERLOA, BAM, UDM, etc.) implies a flexible design that can leverage various approximation techniques. The diagram is a high-level representation and does not provide details on the specific implementation of the approximate algorithms or the control logic of the PAUs. The diagram suggests a system designed for applications where some loss of accuracy is acceptable in exchange for significant gains in performance and energy efficiency, such as image processing, machine learning, or signal processing. </details> Figure 5: An example of one category and its design space parameters. Fig. 5 visualizes the selection of different circuits to make PAUs and APEs accommodating different input Data Widths (DW) (8, 16, 32 bits) and Weight Widths (WW) (ranging from 3 to 32 bits) to generate approximate MAC units. These feature units highlight the flexible template of the TPU and enhance its adaptability and performance across various DNN workloads. Including lower bit-width weights is particularly advantageous for highly quantified models, enabling efficient processing with reduced computational resources. <details> <summary>x6.png Details</summary> ![d061a637](/v1/image/d061a63763c3921d61d78e2bcd4b8c024b2ddb32734d0a3025fa6b7f24a81e24) ### Visual Description \n ## Text Block: Verilog Code Configuration Summary ### Overview The image presents a text block summarizing Verilog code configuration options for a systolic array design. It details preprocessor macros used to control design parameters like bitwidth, array dimensions, and approximation methods. It also includes metrics related to area and power consumption. ### Components/Axes The text is structured into sections: * **Metrics:** Provides numerical values for "Area", "WNS", and "Total Power". * **Preprocessor Macros:** Defines macros like `DW`, `M`, `N`, `HERLOA`, `SHARED_PRE_APPROX`, and conditional blocks based on selections like `MITCHELL`, `ALM_SOA`, `ALM_LOA`, and `ROBA`. * **Summaries:** Includes "BLOCK SUMMARY", "DETAILED GLOBAL SUMMARY", and "HIGH-LEVEL GLOBAL SUMMARY" sections providing descriptions of the code's purpose and functionality. ### Detailed Analysis or Content Details **Metrics:** * "Area": "29162" (units unspecified) * "WNS": "-12.268" (units unspecified, likely Worst Negative Slack) * "Total Power": "4.21e-03" (units unspecified) **Preprocessor Macros:** * `define DW 8` // Choose IFMAP bitwidth * `define M 4` // Choose M dimensions of the systolic array * `define N 4` // Choose N dimensions of the systolic array * `define HERLOA /APADDER` * `ifdef MITCHELL ...` * `define SHARED_PRE_APPROX` * `elsif ALM_SOA ...` * `define SHARED_PRE_APPROX` * `elsif ALM_LOA ...` * `define SHARED_PRE_APPROX` * `elsif ROBA ...` **BLOCK SUMMARY:** The code block defines various preprocessor macros to configure the design parameters, such as the type of nibble width (NIBBLE_WIDTH), bitwidths for IFMAP (DW), dimensions of the systolic array ('M' and 'N'), accurate part of approximate multipliers (MULT_DW). block_4: This code.... related to different approximate... the `ALM` macro. **DETAILED GLOBAL SUMMARY:** The provided Verilog code represents a design for a 4x4 systolic array implementation...the type of multiplier, adder, and other design choices. features, pre-approximation (SHARED_PRE_APPROX). These macros are controlled by the selection. Overall, this code represents by ...adjusting the preprocessor macros. **HIGH-LEVEL GLOBAL SUMMARY:** The provided Verilog...4x4 systolic array design that utilizes...adder (HERLOA), bitwidths, the design is highly configurable, with the ability... features like...This flexibility allows the design to be tailored... leading to improvements in area, power, and timing performance...such as machine learning efficiency are critical factors. ### Key Observations * The design is for a 4x4 systolic array (defined by `M 4` and `N 4`). * The IFMAP bitwidth is set to 8 (defined by `DW 8`). * The `SHARED_PRE_APPROX` macro is defined within multiple conditional compilation blocks (`ifdef MITCHELL`, `elsif ALM_SOA`, `elsif ALM_LOA`, `elsif ROBA`), suggesting different approximation strategies can be selected. * The metrics provided (Area, WNS, Total Power) are likely the result of a synthesis or implementation run. * The summaries emphasize the configurability and flexibility of the design. ### Interpretation The text describes a configurable Verilog implementation of a 4x4 systolic array, likely intended for machine learning applications given the mention of machine learning efficiency. The use of preprocessor macros allows for tailoring the design to specific requirements, balancing area, power, and performance. The conditional compilation blocks suggest different approximation techniques can be employed, potentially trading off accuracy for efficiency. The provided metrics offer a quantitative assessment of the design's characteristics, although the units are not specified. The negative WNS value indicates potential timing issues that may need to be addressed. The overall design philosophy appears to be focused on flexibility and optimization for resource-constrained environments. </details> Figure 6: An example of a data point by adapting MG-V format. TPU-Gen dataset offers 29,952 possible variations for a systolic array size with 8 different systolic array implementations to facilitate various workloads spanning from 4 $\times$ 4 for smaller loads to 256 $\times$ 256 to crunch bigger DNN workloads. Accounting for the systolic size variations in the TPU-Gen dataset promises a total of 29,952 $\times$ 8 = 2,39,616 data points with PPA metrics reported. While TPU-Gen is constantly growing with newer data points, we checkpoint our dataset creation currently reported as having 25,000 individual TPU designs. We provide two variations: $(i)$ A top module file consisting of details of the entire circuit implementation, which can be used in cases such as RAG implementation to save the computation resources, and $(ii)$ A detailed, multi-level granulated dataset, as depicted in Fig. 6, is curated by adapting MG-Verilog [17] to assist LLM in generating Verilog code to support the development of a highly sophisticated, fine-tuned model. This model facilitates the automated generation of individual hardware modules, intelligent integration, deployment, and reuse across various designs and architectures. Please note that due to the domain-specific nature of the dataset, some data redundancy is inevitable, as similar modules are reused and reconfigured to construct new TPUs with varying architectural configurations. This structured dataset enables efficient exploration and customization of TPU designs while ensuring that the generated modules can be systematically adapted for different design requirements, leading to enhanced flexibility and scalability in hardware design automation. Additionally, we provide detailed metrics for each design iteration, which aid the LLM in generating budget-constrained designs or in creating an efficient design space exploration strategy to accelerate the result optimization process. TABLE III: Prompts to successfully generate exact TPU modules via TPU-Gen. | LLM Model Mistral-7B (Q3) | Module Generation Pass@1 17% | Module Integration Pass@3 83% | Pass@5 100% | Pass@10 100% | Pass@1 0% | Pass@3 25% | Pass@5 75% | Pass@10 100% | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | CodeLlama-7B (Q4) | 0% | 50% | 83% | 100% | 0% | 50% | 75% | 100% | | CodeLlama-13B (Q4) | 66% | 83% | 100% | 100% | 25% | 75% | 100% | 100% | | Claude 3.5 Sonnet | 83% | 100% | 100% | 100% | 75% | 100% | 100% | 100% | | ChatGPT-4o | 83% | 100% | 100% | 100% | 50% | 100% | 100% | 100% | | Gemmini Advanced | 50% | 50% | 74% | 91% | 25% | 75% | 74% | 91% | ## IV Experiment Results ### IV-A Objectives We designed four distinct experiments employing various approaches, each tailored to the unique capabilities of LLMs such as GPT [48], Gemini [49], and Claude [50], as well as the best open-source models from the leader board [51]. Each model is deployed in experiments aligning with the study’s objectives and anticipated outcomes. Experiments 1 focus on observing the prompting mechanism that assists LLM in generating the desired output by implementing ICL; with this knowledge, we develop the prompt template discussed in Sections III-B. Experiment 2 focuses on adapting the proposed TPU-Gen framework by fine-tuning LLM models. For fine-tuning, we used 4 $\times$ A100 GPU with 80GB VRAM. Experiment 3 is to demonstrate the effectiveness of RAG in TPU-Gen and it’s applicability for hardware design. Experiment 4 tests the TPU-Gen framework’s ability to generate designs efficiently with an industry-standard 45nm technology library. Throughout the process, we also consider hardware under the given PPA budget to ensure the feasibility of achieving the objectives outlined in the initial phases. ### IV-B Experiments and Results <details> <summary>x7.png Details</summary> ![377a87f2](/v1/image/377a87f2d31bb1ab58059f685767655bdf755e7c12686b2ebe6b0e8c710539ba) ### Visual Description \n ## Bar and Line Chart: LLM Performance vs. APTPU Modules & Model Comparison ### Overview The image presents two charts side-by-side. Chart (a) is a bar chart with overlaid line graphs, comparing the average prompts for Commercial and Open-Sourced LLMs against the number of APTPU Modules. Chart (b) is a bar chart showing the number of prompts for different LLM models. ### Components/Axes **Chart (a):** * **X-axis:** APTPU Modules, ranging from 0 to 30, with markers at 5, 10, 15, 20, 25. * **Y-axis:** Average Prompts for LLMs, ranging from 0 to 25. * **Data Series:** * Commercial LLMs (represented by dark gray bars) * Open-Sourced LLMs (represented by light gray bars) * Trendline Commercial (represented by a solid blue dashed line) * Trendline Open-Sourced (represented by a solid red dashed line) * **Legend:** Located in the top-left corner, clearly labeling each data series with corresponding colors. **Chart (b):** * **X-axis:** LLM Models, labeled 1 to 6. * **Y-axis:** Number of Prompts, ranging from 0 to 8. * **Data Series:** * ChatGPT 4o (represented by black bars) * Gemmini Advanced (represented by dark gray bars) * Claude (represented by light gray bars) * Codellama 13B (represented by white bars) * Codellama 7B (represented by light blue bars) * Mistral 7B (represented by light green bars) * **Legend:** Located in the top-right corner, clearly labeling each data series with corresponding colors. ### Detailed Analysis or Content Details **Chart (a):** The dark gray bars (Commercial LLMs) generally show higher average prompts than the light gray bars (Open-Sourced LLMs) across all APTPU module values. * At 0 APTPU Modules: Commercial LLMs ≈ 2 prompts, Open-Sourced LLMs ≈ 1 prompt. * At 5 APTPU Modules: Commercial LLMs ≈ 8 prompts, Open-Sourced LLMs ≈ 10 prompts. * At 10 APTPU Modules: Commercial LLMs ≈ 11 prompts, Open-Sourced LLMs ≈ 17 prompts. * At 15 APTPU Modules: Commercial LLMs ≈ 14 prompts, Open-Sourced LLMs ≈ 19 prompts. * At 20 APTPU Modules: Commercial LLMs ≈ 16 prompts, Open-Sourced LLMs ≈ 20 prompts. * At 25 APTPU Modules: Commercial LLMs ≈ 18 prompts, Open-Sourced LLMs ≈ 18 prompts. The blue dashed line (Trendline Commercial) shows an upward trend initially, then plateaus around 16-18 prompts. The red dashed line (Trendline Open-Sourced) shows a more pronounced upward trend, peaking around 20 prompts, then decreasing slightly. **Chart (b):** * LLM Model 1 (ChatGPT 4o): ≈ 7 prompts. * LLM Model 2 (Gemmini Advanced): ≈ 6 prompts. * LLM Model 3 (Claude): ≈ 2 prompts. * LLM Model 4 (Codellama 13B): ≈ 2 prompts. * LLM Model 5 (Codellama 7B): ≈ 1 prompt. * LLM Model 6 (Mistral 7B): ≈ 1 prompt. ### Key Observations * In Chart (a), Open-Sourced LLMs initially outperform Commercial LLMs in terms of average prompts at lower APTPU module counts, but this advantage diminishes and reverses as the number of modules increases. * The trendlines in Chart (a) suggest diminishing returns for both Commercial and Open-Sourced LLMs as the number of APTPU modules increases. * In Chart (b), ChatGPT 4o and Gemmini Advanced receive significantly more prompts than the other models. Claude, Codellama 13B, Codellama 7B, and Mistral 7B receive a relatively low number of prompts. ### Interpretation The data suggests a relationship between the number of APTPU modules and the performance (measured by average prompts) of LLMs. Initially, Open-Sourced LLMs may be more efficient with fewer modules, but Commercial LLMs scale better with increased resources. The trendlines indicate that there's a point of diminishing returns, where adding more modules doesn't significantly improve performance. Chart (b) highlights the popularity or usage of different LLM models. ChatGPT 4o and Gemmini Advanced are clearly the most frequently used models in this dataset, while the others are used much less often. This could be due to factors such as model capabilities, accessibility, or cost. The combination of these two charts provides insights into the trade-offs between resource allocation (APTPU modules) and model choice when deploying LLMs. It suggests that optimizing both the hardware infrastructure and the model selection is crucial for achieving optimal performance. The difference in prompt numbers between models could be due to a variety of factors, including model quality, task suitability, and user preference. </details> Figure 7: Average TPU-Gen prompts for (a) Module Generation, and (b) Module Integration via LLMs. #### IV-B 1 Experiment 1: ICL-Driven TPU Generation and Approximate Design Adaptation. We evaluate the capability of LLMs to generate and synthesize a novel TPU architecture and its approximate version using TPU-Gen. Utilizing the prompt template from [18], we refined it to harness LLM capabilities better. LLM performance is assessed on two metrics: $(i)$ Module Generation —the ability to generate required modules, and $(ii)$ Module Integration —the capability to construct the top module by integrating components. We tested commercial models like [48, 49] via chat interfaces and open-source models listed in Table III, using LM Studio [52]. For the TPU, we successfully developed the design and obtained the GDSII layout (Fig. 8 (a)). Commercial models performed well with a single prompt at pass@1, averaging 72% in module generation and 50% in integration. Open-source models performed better with the increase of pass@k, averaging 72% for pass@1 in module generation to 100% and 50% to 100% upscale from pass@3 to pass@10 in integration. For the approximate TPU, involving approximate circuit algorithms, we provided example circuits and used ICL and Chain of Thought (CoT) to guide the LLMs. Open-source models struggled due to a lack of specialized knowledge, as shown in Fig. 7. The design layout from this experiment is in Fig. 8 (b). All outputs were manually verified using test benches. This is the first work to generate both exact and approximate TPU architectures using prompting to LLM. However, significant human expertise and intervention are required, especially for complex architectures like approximate circuits. To minimize the human involvement, we implement fine-tuning. Takeaway 1. LLMs with efficient prompting are capable of generating exact and approximate TPU modules and integrate them to create complete designs. However, human involvement is extensively required, especially for novel architectures. Fine-tuning LLMs is necessary to reduce human intervention and facilitate the exploration of new designs. #### IV-B 2 Experiment 2: Full TPU-Gen Implementation This experiment investigates cost-efficient approaches for adapting domain-specific language models to hardware design. In previous experiments, we observed that limited spatial and hierarchical hardware knowledge hindered LLM performance in integrating circuits. The TPU-Gen template (Fig. 2) addresses this by delegating creative tasks to the LLM and retrieving dependent modules via RAG, optimizing AI accelerator design while reducing computational overhead and minimizing LLM hallucinations. ICL experiments show that fine-tuning enhances LLM reliability. The TPU-Gen proposes a way to develop domain-specific LLMs with minimal data. The experiment used a TPU-Gen dataset version 1 of 5,000 Verilog headers DW and WW inputs. This dataset comprises systolic array implementations with biased approximate circuit variations. We split data statically in 80:20 for training and testing open-source LLMs [51], with two primary goals of $1.$ Analyzing the impact of the prompt template generator on the fine-tuned LLM’s performance (Table IV). $2.$ Investigating the RAG model for hardware development. <details> <summary>extracted/6256789/Figures/GDSII.jpg Details</summary> ![28b9643c](/v1/image/28b9643c150786fd3c3c30d4f60e125484fac5eb5f5a5fe8e67e8a12ae478437) ### Visual Description \n ## Heatmap: Microstructure Visualization ### Overview The image presents three heatmaps (labeled a, b, and c) visualizing a microstructure. Each heatmap displays a grid-like structure with color variations representing different properties or concentrations within the material. The heatmaps appear to represent the same area, but with differing levels of granularity or processing. There are no explicit axis labels or legends provided within the images themselves. ### Components/Axes The images consist of a grid overlayed with colored regions. The grid appears to be uniform across all three images. The color scheme appears to be consistent across all three images, with shades of green, red, and pink/purple. There are no visible axis titles or scales. The images are labeled (a), (b), and (c) at the bottom-left corner. ### Detailed Analysis or Content Details Due to the lack of a legend, precise quantification of the color values is impossible. However, we can describe the distribution of colors within each heatmap: * **(a):** This heatmap shows large, clustered regions of color. Predominantly, there are large areas of orange/red, interspersed with green and pink/purple. The red regions are more concentrated and form distinct, blocky shapes. The green and pink/purple are more dispersed within and around the red regions. * **(b):** This heatmap exhibits a more refined structure compared to (a). The red regions are smaller and more fragmented, with a greater proportion of green and pink/purple. The overall distribution appears more uniform, with less distinct clustering. * **(c):** This heatmap displays the highest level of granularity. The red regions are even more fragmented and dispersed, with a significant increase in green and pink/purple. The structure appears almost entirely composed of a fine network of colored regions. The grid lines are visible in all three images, providing a reference for spatial distribution. The grid appears to be approximately 20x20. ### Key Observations The primary trend observed is a decrease in the size and concentration of red regions from (a) to (c), accompanied by an increase in the proportion of green and pink/purple. This suggests a process of refinement or dispersion occurring across the three stages represented by the heatmaps. The images suggest a transition from a coarse, clustered microstructure (a) to a fine, dispersed microstructure (c). ### Interpretation The heatmaps likely represent a process of material transformation, such as diffusion, phase separation, or etching. The red color could represent a specific element or phase, while green and pink/purple represent other components or the matrix material. * **(a)** could represent the initial state of the material, with large concentrations of the red component. * **(b)** could represent an intermediate stage, where the red component begins to diffuse or break down. * **(c)** could represent the final state, where the red component is evenly dispersed throughout the material. The increasing granularity from (a) to (c) suggests that the process is leading to a more homogeneous microstructure. Without a legend, it is impossible to determine the exact nature of the process or the meaning of the colors. However, the visual trend clearly indicates a change in the material's microstructure over time or under different conditions. The images are descriptive and do not provide quantitative data. </details> Figure 8: A GDSII layout of (a) TPU, (b) TPU by prompting LLM, (c) approximate TPU by TPU-Gen framework. All models used Low-Rank Adaptation (LoRA) fine-tuning with the Adam optimizer at a learning rate of $1e^{-5}$ . The fine-tuned models were evaluated to generate the desired results efficiently with a random prompt at pass@ $1$ to generate the TPU. From Table IV, we can observe that the outputs without the prompt generator are labeled as failures as they were unsuitable for further development and RAG integration. We can observe the same prompt when parsed to the prompt-template generator with a single try; we score an accuracy of 86.6%. Further, we used RAG and then processed the generated Verilog headers for module retrieval. According to [11], LLMs tend to prioritize creativity and finding innovative solutions, which often results in straying from the data. To address this, we employed a compute and cost-efficient method. This shows that the fine-tuning along with RAG can greatly enhance the performance. Fig. 8 (c) shows the GDSII layout of the design generated by the TPU-Gen framework. TABLE IV: Prompt Generator vs Human inputs to Fine-tuned models. | CodeLlama-7B-hf CodeQwen1.5 -7B Mistral -7B | 27 25 28 | 03 05 02 | 01 0 02 | 29 30 28 | | --- | --- | --- | --- | --- | | Starcoder2-7B | 24 | 06 | 0 | 30 | Takeaway 2. Prompting techniques such as prompt template steer LLM to generate desired results after fine-tuning, as observed 86% success in generation. RAG, a cost-efficient method to generate the hardware modules reliably, completing the entire Verilog design for an application with minimal computational overhead. #### IV-B 3 Experiment 3: Significance of RAG To assess the effectiveness of RAG in the TPU-Gen framework, we evaluated 1,000 Verilog header codes generated by fine-tuned LLMs under two conditions: with and without RAG integration. Table V presents results over 30 designs tested by our framework to generate complete project files. Without RAG, failures occurred due to output token limitations and hallucinated variables. RAG is essential as the design is not a standalone file to compile. Validated header codes were provided in the RAG-enabled pipeline, and required modules were dynamically retrieved from the RAG database, ensuring fully functional and accurate designs. Conversely, models without RAG relied solely on internal knowledge, leading to hallucinations, token constraints, and incomplete designs. Models using RAG consistently achieved pass rates exceeding 95%, with Mistral-7B and CodeLlama-7B-hf attaining 100% success. In contrast, all models failed entirely without RAG, underscoring its pivotal role in ensuring design accuracy and addressing LLM limitations. RAG provides a robust solution to key challenges in fine-tuned LLMs for TPU hardware design by retrieving external information from the RAG database, ensuring contextual accuracy, and significantly reducing hallucinations. Additionally, RAG dynamically fetches dependencies in a modular manner, enabling the generation of complete and accurate designs without exceeding token limits. RAG is a promising solution in this context since our models were fine-tuned with only Verilog header data detailing design features. However, fine-tuning models with the entire design data would expose LLMs to severe hallucinations and token limitations, making generating detailed and functional designs challenging. TABLE V: significance of RAG in TPU-Gen. | CodeLama-7B-hf Mistral-7B CodeQwen1.5-7B | 100 100 95 | 0 0 5 | 0 0 0 | 100 100 100 | | --- | --- | --- | --- | --- | | StarCoder2-7B | 98 | 2 | 0 | 100 | Takeaway 3. The experiment highlights the significance of the RAG usage with a fine-tuned model to avoid hallucinations and let LLM be creative consistently. #### IV-B 4 Experiment 4: Design Generation Efficiency Building on the successful generation of approximate TPU in experiment 2, here we evaluate and benchmark the architectures produced by the TPU-Gen framework as the work performed in this paper is the first of it’s kind we are comparing against manual optimization created by expert human designers, focusing on power, area, and latency as shown in Fig. 9 (a)-(c). We utilize four DNN architectures for this evaluation: LeNet, ResNet18, VGG16, and ResNet56, performing inference tasks on the MNIST, CIFAR-10, SVHN, and CIFAR-100 datasets. In the manually optimized designs, a skilled hardware engineer fine-tunes parameters within the TPU template. This iterative optimization process is repeated until no further performance gains can be achieved within a reasonable timeframe of approximately one day [9], or the expert determines, based on empirical results, that additional refinements would yield minimal benefits. Using the PPA metrics as reference values (e.g., 100mW, 0.25mm 2, 48ms for ResNet56), both TPU-Gen and the manual user are tasked with generating the TPU architecture. Fig. 9 illustrates that across a range of network architectures, TPU-Gen consistently yields results with minimal deviation from the reference benchmarks. In contrast, the manual designs exhibit significant violations in terms of PPA. <details> <summary>x8.png Details</summary> ![ea2afac7](/v1/image/ea2afac77df4bbc83747578cae35b22783bf9435cf84af62d45605b452648ae1) ### Visual Description ## Bar Charts: Performance Comparison of Manual Effort vs. APTPU-Gen ### Overview This image presents three bar charts (labeled (a), (b), and (c)) comparing the performance of "Manual effort" and "APTPU-Gen" across three metrics: Power Consumption (mW), Area (µm²), and Latency (ms). Each chart compares these methods for four different neural network architectures: LeNet, ResNet18, VGG16, and ResNet56. Horizontal dashed red lines indicate constraints for each metric, and text annotations indicate whether these constraints are met or violated. ### Components/Axes Each chart shares the following components: * **X-axis:** Neural Network Architecture (LeNet, ResNet18, VGG16, ResNet56) * **Y-axis:** * (a) Power Consumption (mW) - Scale from 0 to 200. * (b) Area (µm²) - Scale from 0 to 6 x 10⁴. * (c) Latency (ms) - Scale from 0 to 60. * **Legend:** * Black bars: "Manual effort" * Gray bars: "APTPU-Gen" * **Constraints:** Horizontal dashed red lines with annotations: * (a) "Power Constraint Met!" * (b) "Area Constraint Violated!" * (c) "Latency Constraint Met!" ### Detailed Analysis or Content Details **Chart (a): Power Consumption (mW)** * **LeNet:** Manual effort ≈ 10 mW, APTPU-Gen ≈ 8 mW. * **ResNet18:** Manual effort ≈ 20 mW, APTPU-Gen ≈ 15 mW. * **VGG16:** Manual effort ≈ 110 mW, APTPU-Gen ≈ 80 mW. * **ResNet56:** Manual effort ≈ 160 mW, APTPU-Gen ≈ 100 mW. * The red dashed line is at approximately 100 mW. The "Power Constraint Met!" annotation is associated with the APTPU-Gen results. **Chart (b): Area (µm²)** * **LeNet:** Manual effort ≈ 0.2 x 10⁴ µm², APTPU-Gen ≈ 0.1 x 10⁴ µm². * **ResNet18:** Manual effort ≈ 0.8 x 10⁴ µm², APTPU-Gen ≈ 0.5 x 10⁴ µm². * **VGG16:** Manual effort ≈ 2.5 x 10⁴ µm², APTPU-Gen ≈ 2.0 x 10⁴ µm². * **ResNet56:** Manual effort ≈ 3.5 x 10⁴ µm², APTPU-Gen ≈ 2.8 x 10⁴ µm². * The red dashed line is at approximately 3 x 10⁴ µm². The "Area Constraint Violated!" annotation is associated with the Manual effort results. **Chart (c): Latency (ms)** * **LeNet:** Manual effort ≈ 5 ms, APTPU-Gen ≈ 3 ms. * **ResNet18:** Manual effort ≈ 15 ms, APTPU-Gen ≈ 10 ms. * **VGG16:** Manual effort ≈ 35 ms, APTPU-Gen ≈ 30 ms. * **ResNet56:** Manual effort ≈ 50 ms, APTPU-Gen ≈ 40 ms. * The red dashed line is at approximately 50 ms. The "Latency Constraint Met!" annotation is associated with the APTPU-Gen results. ### Key Observations * APTPU-Gen consistently outperforms Manual effort across all three metrics and all four neural network architectures. * The performance gap between the two methods widens as the complexity of the neural network increases (from LeNet to ResNet56). * Manual effort violates the area constraint, while APTPU-Gen meets it. * APTPU-Gen meets both the power and latency constraints, while manual effort does not consistently meet the power constraint. ### Interpretation The data strongly suggests that APTPU-Gen is a superior method for optimizing neural network performance compared to Manual effort. It achieves lower power consumption, smaller area, and reduced latency across a range of network architectures. The increasing performance gap with network complexity indicates that APTPU-Gen is particularly effective for more demanding models. The constraint violations highlight the practical benefits of APTPU-Gen, as it enables designs that adhere to critical performance limitations. The consistent trend across all architectures suggests that the advantages of APTPU-Gen are not specific to any particular network structure. This data could be used to justify the adoption of APTPU-Gen as a preferred optimization technique in resource-constrained environments. </details> Figure 9: PPA metrics comparison for TPU architectures generated by TPU-Gen and the manual user: (a) Power consumption, (b) Area, (c) Latency. Takeaway 4. TPU-Gen consistently yields results with minimal deviation from the PPA reference, whereas the manual designs exhibit significant violations. ## V Conclusions This paper introduces TPU-Gen, a novel dataset and a novel framework for TPU generation, addressing the complexities of generating AI accelerators amidst rapid AI model evolution. A key challenge, hallucinated variables, is mitigated using an RAG approach, dynamically adapting hardware modules. RAG enables cost-effective, full-scale RTL code generation, achieving budget-constrained outputs via fine-tuned models. Our extensive experimental evaluations demonstrate superior performance, power, and area efficiency, with an average reduction in area and power of 92% and 96% from the manual optimization reference values. These results set new standards for driving advancements in next-generation design automation tools powered by LLMs. We are committed to releasing the dataset and fine-tuned models publicly if accepted. ## References - [1] N. Jouppi, C. Young, N. Patil, and D. Patterson, “Motivation for and evaluation of the first tensor processing unit,” IEEE Micro, vol. 38, no. 3, pp. 10–19, 2018. - [2] H. Genc et al., “Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration,” in 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 2021, pp. 769–774. - [3] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh, “From high-level deep neural models to fpgas,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016, pp. 1–12. - [4] W.-Q. Ren et al., “A survey on collaborative dnn inference for edge intelligence,” Machine Intelligence Research, vol. 20, no. 3, pp. 370–395, 2023. - [5] D. Vungarala, M. Morsali, S. Tabrizchi, A. Roohi, and S. Angizi, “Comparative study of low bit-width dnn accelerators: Opportunities and challenges,” in 2023 IEEE 66th International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, 2023, pp. 797–800. - [6] P. Xu and Y. Liang, “Automatic code generation for rocket chip rocc accelerators,” 2020. - [7] S. Angizi, Z. He, A. Awad, and D. Fan, “Mrima: An mram-based in-memory accelerator,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 5, pp. 1123–1136, 2019. - [8] K. Chang, Y. Wang, H. Ren, M. Wang, S. Liang, Y. Han, H. Li, and X. Li, “Chipgpt: How far are we from natural language hardware design,” arXiv preprint arXiv:2305.14019, 2023. - [9] Y. Fu, Y. Zhang, Z. Yu, S. Li, Z. Ye, C. Li, C. Wan, and Y. C. Lin, “Gpt4aigchip: Towards next-generation ai accelerator design automation via large language models,” in 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 2023, pp. 1–9. - [10] S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan-Gavitt, R. Karri, and S. Garg, “Verigen: A large language model for verilog code generation,” ACM Transactions on Design Automation of Electronic Systems, vol. 29, no. 3, pp. 1–31, 2024. - [11] X. Jiang, Y. Tian, F. Hua, C. Xu, Y. Wang, and J. Guo, “A survey on large language model hallucination via a creativity perspective,” arXiv preprint arXiv:2402.06647, 2024. - [12] J. Blocklove, S. Garg, R. Karri, and H. Pearce, “Chip-chat: Challenges and opportunities in conversational hardware design,” in 2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD). IEEE, 2023, pp. 1–6. - [13] S. Thakur, J. Blocklove, H. Pearce, B. Tan, S. Garg, and R. Karri, “Autochip: Automating hdl generation using llm feedback,” arXiv preprint arXiv:2311.04887, 2023. - [14] R. Ma, Y. Yang, Z. Liu, J. Zhang, M. Li, J. Huang, and G. Luo, “Verilogreader: Llm-aided hardware test generation,” arXiv:2406.04373v1, 2024. - [15] W. Fang et al., “Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms,” arXiv:2402.00386v1, 2024. - [16] M. Liu, N. Pinckney, B. Khailany, and H. Ren, “Verilogeval: Evaluating large language models for verilog code generation,” arXiv:2309.07544v2, 2024. - [17] Y. Zhang, Z. Yu, Y. Fu, C. Wan, and Y. C. Lin, “Mg-verilog: Multi-grained dataset towards enhanced llm-assisted verilog generation,” arXiv preprint arXiv:2407.01910, 2024. - [18] D. Vungarala, M. Nazzal, M. Morsali, C. Zhang, A. Ghosh, A. Khreishah, and S. Angizi, “Sa-ds: A dataset for large language model-driven ai accelerator design generation,” arXiv e-prints, pp. arXiv–2404, 2024. - [19] H. Wu et al., “Chateda: A large language model powered autonomous agent for eda,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024. - [20] B. Nadimi and H. Zheng, “A multi-expert large language model architecture for verilog code generation,” arXiv preprint arXiv:2404.08029, 2024. - [21] Y. Lu, S. Liu, Q. Zhang, and Z. Xie, “Rtllm: An open-source benchmark for design rtl generation with large language model,” in 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2024, pp. 722–727. - [22] D. Vungarala, S. Alam, A. Ghosh, and S. Angizi, “Spicepilot: Navigating spice code generation and simulation with ai guidance,” arXiv preprint arXiv:2410.20553, 2024. - [23] Y. Lai, S. Lee, G. Chen, S. Poddar, M. Hu, D. Z. Pan, and P. Luo, “Analogcoder: Analog circuit design via training-free code generation,” arXiv preprint arXiv:2405.14918, 2024. - [24] D. Dai, Y. Sun, L. Dong, Y. Hao, S. Ma, Z. Sui, and F. Wei, “Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers,” arXiv preprint arXiv:2212.10559, 2022. - [25] G. Izacard et al., “Atlas: Few-shot learning with retrieval augmented language models,” Journal of Machine Learning Research, vol. 24, no. 251, pp. 1–43, 2023. - [26] J. Chen, H. Lin, X. Han, and L. Sun, “Benchmarking large language models in retrieval-augmented generation,” arXiv preprint arXiv:2309.01431, 2023. - [27] R. Qin et al., “Robust implementation of retrieval-augmented generation on edge-based computing-in-memory architectures,” arXiv:2405.04700v1, 2024. - [28] A. Roohi, S. Sheikhfaal, S. Angizi, D. Fan, and R. F. DeMara, “Apgan: Approximate gan for robust low energy learning from imprecise components,” IEEE Transactions on Computers, vol. 69, no. 3, pp. 349–360, 2019. - [29] M. S. Ansari, B. Cockburn, and J. Han, “An improved logarithmic multiplier for energy-efficient neural computing,” IEEE Trans. on Comput., vol. 70, pp. 614–625, 2021. - [30] S. Angizi, M. Morsali, S. Tabrizchi, and A. Roohi, “A near-sensor processing accelerator for approximate local binary pattern networks,” IEEE Transactions on Emerging Topics in Computing, vol. 12, no. 1, pp. 73–83, 2023. - [31] H. Jiang, S. Angizi, D. Fan, J. Han, and L. Liu, “Non-volatile approximate arithmetic circuits using scalable hybrid spin-cmos majority gates,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 68, no. 3, pp. 1217–1230, 2021. - [32] S. Angizi, Z. He, A. S. Rakin, and D. Fan, “Cmp-pim: an energy-efficient comparator-based processing-in-memory neural network accelerator,” in Proceedings of the 55th Annual Design Automation Conference, 2018, pp. 1–6. - [33] S. Angizi, H. Jiang, R. F. DeMara, J. Han, and D. Fan, “Majority-based spin-cmos primitives for approximate computing,” IEEE Transactions on Nanotechnology, vol. 17, no. 4, pp. 795–806, 2018. - [34] M. E. Elbtity, H.-W. Son, D.-Y. Lee, and H. Kim, “High speed, approximate arithmetic based convolutional neural network accelerator,” 2020 International SoC Design Conference (ISOCC), pp. 71–72, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:231826033 - [35] H. Younes, A. Ibrahim, M. Rizk, and M. Valle, “Algorithmic level approximate computing for machine learning classifiers,” 2019 26th IEEE Int. Conf. on Electron., Circuits and Syst. (ICECS), pp. 113–114, 2019. - [36] S. Hashemi, R. I. Bahar, and S. Reda, “DRUM: A dynamic range unbiased multiplier for approximate applications,” 2015 IEEE/ACM Int. Conf. on Comput.-Aided Design (ICCAD), pp. 418–425, 2015. - [37] P. Yin, C. Wang, H. Waris, W. Liu, Y. Han, and F. Lombardi, “Design and analysis of energy-efficient dynamic range approximate logarithmic multipliers for machine learning,” IEEE Transactions on Sustainable Computing, vol. 6, no. 4, pp. 612–625, 2021. - [38] A. Dalloo, A. Najafi, and A. Garcia-Ortiz, “Systematic design of an approximate adder: The optimized lower part constant-or adder,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 8, pp. 1595–1599, 2018. - [39] M. E. Elbtity, P. S. Chandarana, B. Reidy, J. K. Eshraghian, and R. Zand, “Aptpu: Approximate computing based tensor processing unit,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 69, no. 12, pp. 5135–5146, 2022. - [40] F. Farshchi et al., “New approximate multiplier for low power digital signal processing,” The 17th CSI International Symposium on Computer Architecture & Digital Systems (CADS 2013), pp. 25–30, 2013. - [41] W. Liu et al., “Design and evaluation of approximate logarithmic multipliers for low power error-tolerant applications,” IEEE Trans. on Circuits and Syst. I: Reg. Papers, vol. 65, pp. 2856–2868, 2018. - [42] S. S. Sarwar et al., “Energy-efficient neural computing with approximate multipliers,” ACM Journal on Emerging Technologies in Computing Systems (JETC), vol. 14, pp. 1 – 23, 2018. - [43] R. Zendegani et al., “Roba multiplier: A rounding-based approximate multiplier for high-speed yet energy-efficient digital signal processing,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, pp. 393–401, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:206810935 - [44] M. Niu, H. Li, J. Shi, H. Haddadi, and F. Mo, “Mitigating hallucinations in large language models via self-refinement-enhanced knowledge retrieval,” arXiv preprint arXiv:2405.06545, 2024. - [45] (2024) Yosys. [Online]. Available: https://github.com/YosysHQ/yosys - [46] (2018) Openroad. [Online]. Available: https://github.com/The-OpenROAD-Project/OpenROAD - [47] H. Pearce et al., “Dave: Deriving automatically verilog from english,” in MLCAD, 2020, pp. 27–32. - [48] (2024) Openai gpt-4. [Online]. Available: https://openai.com/index/hello-gpt-4o/ - [49] (2024) Gemini. [Online]. Available: https://deepmind.google - [50] (2023) Anthropic. [Online]. Available: https://www.anthropic.com - [51] Evalplus leaderboard. https://evalplus.github.io/leaderboard.html. Accessed: 2024-09-21. - [52] “Lm studio - discover, download, and run local llms,” https://lmstudio.ai/, accessed: 2024-09-21.

Rendering Paper...