riscv-ai-accelerator

RISC-V AI Accelerator Chip Tape-out Report

Project Information

Project Name: RISC-V AI Accelerator Chip (SimpleEdgeAiSoC)
Chip Code: EdgeAI-SoC-v0.1
Design Organization: [redoop]
Project Lead: [tongxiaojun]
Report Date: 2025,11
Version: v0.1


1. Project Overview

1.1 Background

With the widespread application of artificial intelligence on edge devices, there is a growing demand for low-power, high-efficiency AI accelerators. This project aims to design a System-on-Chip (SoC) integrating a RISC-V processor and dedicated AI accelerators, specifically optimized for edge AI inference scenarios.

1.2 Design Goals

1.3 Key Features

1.3.1 Processor Core

1.3.2 AI Accelerators

  1. CompactAccel (Traditional Matrix Accelerator)
    • Supports 8x8 matrix multiplication
    • Performance: ~1.6 GOPS @ 100MHz
    • 32-bit fixed-point arithmetic
  2. BitNetAccel (Innovative Multiplier-Free Accelerator)
    • Supports 2x2 to 16x16 matrix multiplication
    • Performance: ~4.8 GOPS @ 100MHz
    • 2-bit weight encoding {-1, 0, +1}
    • Multiplier-free design using only addition/subtraction
    • Sparsity optimization, automatically skips zero weights
    • 10x memory reduction
    • 60% power reduction

1.3.3 Peripheral System


2. Chip Architecture Design

2.1 System Architecture

┌─────────────────────────────────────────────────────────────┐
│                    SimpleEdgeAiSoC                          │
│                                                             │
│  ┌──────────────┐         ┌──────────────────────────┐      │
│  │  PicoRV32    │◄───────►│   Address Decoder        │      │
│  │   CPU Core   │         │   (Memory Map)           │      │
│  │   (RV32I)    │         └──────────┬───────────────┘      │
│  └──────────────┘                    │                      │
│         │                            │                      │
│         │                            ├──► CompactAccel      │
│         │                            │    (8x8 Matrix)      │
│         │                            │                      │
│         │                            ├──► BitNetAccel       │
│         │                            │    (16x16 BitNet)    │
│         │                            │                      │
│         │                            ├──► UART              │
│         │                            │                      │
│         │                            └──► GPIO              │
│         │                                                   │
│         └──► Interrupt Controller                           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

2.2 Memory Map

Address Range Size Module Description
0x00000000 - 0x0FFFFFFF 256 MB RAM Main memory
0x10000000 - 0x10000FFF 4 KB CompactAccel Traditional matrix accelerator
0x10001000 - 0x10001FFF 4 KB BitNetAccel BitNet accelerator
0x20000000 - 0x2000FFFF 64 KB UART Serial peripheral
0x20020000 - 0x2002FFFF 64 KB GPIO General-purpose I/O

3. Key Technical Innovation: BitNet Multiplier-Free Architecture

3.1 Technical Principle

The BitNet architecture is based on 1-bit LLM concepts, quantizing neural network weights to {-1, 0, +1} using 2-bit encoding:

3.2 Core Advantages

  1. Multiplier-Free Design
    • Traditional: result = activation × weight
    • BitNet:
      • When weight = +1: result = activation (addition)
      • When weight = -1: result = -activation (subtraction)
      • When weight = 0: skip computation (sparsity optimization)
  2. Hardware Resource Savings
    • Area reduction: 50% (no multipliers needed)
    • Power reduction: 60% (simple add/subtract operations)
    • Memory usage: 10x reduction (2-bit vs 32-bit weights)
  3. Sparsity Optimization
    • Automatically detects and skips zero weights
    • Tracks skip count for performance analysis
    • Measured sparsity: 26% (8x8 matrix test)

4. Performance Metrics

4.1 Computing Performance

Metric CompactAccel BitNetAccel Total
Matrix Size 8x8 16x16 -
Peak Performance @ 100MHz 1.6 GOPS 4.8 GOPS 6.4 GOPS
Data Width 32-bit 32-bit (activation) + 2-bit (weight) -
Multiplier Count 1 0 1

4.2 Resource Utilization (FPGA Estimate)

Resource Type Quantity Description
LUTs ~8,000 Logic units
FFs ~6,000 Flip-flops
BRAMs ~20 Block RAM
DSPs 1 Digital signal processing units (CompactAccel only)

4.3 Power Analysis

Static Power (synthesis results):

Dynamic Power Estimate (@ 100MHz):

Module Power (mW) Percentage
PicoRV32 CPU 30 30%
CompactAccel 25 25%
BitNetAccel 20 20%
Peripherals 15 15%
Others 10 10%
Total 100 100%

4.4 Timing Performance

Parameter Target Measured Description
Design Frequency 50 MHz - Synthesis constraint
Max Operating Frequency 100 MHz 178.569 MHz Achievable frequency
Min Operating Frequency 50 MHz - Low-power mode
Critical Path Delay < 10 ns - @ 100 MHz
Worst Negative Slack (WNS) - 14.400 ns No violations
Total Negative Slack (TNS) - 0.000 ns No violations
Timing Violations 0 0 Pass

5. Design Verification

5.1 Verification Strategy

Multi-level verification approach:

  1. Unit Testing: Independent functional verification of each module
  2. Integration Testing: Interface verification between modules
  3. System Testing: Complete SoC functional verification
  4. Performance Testing: Performance metrics verification

5.2 Test Coverage

5.2.1 SimpleEdgeAiSoC Tests

5.2.2 BitNet Accelerator Tests

5.3 Test Tools

5.4 Test Results

All test cases passed with test coverage exceeding 95%. Detailed test reports available in chisel/test_run_dir/ directory.


6. Physical Design Considerations

6.1 Process Selection

Selected Process:

Process Advantages:

6.2 Design Scale and Area

Design Scale Limits:

Area Estimation (based on CX55nm process):

Design Scale Statistics:


7. EDA Toolchain

7.1 Open-Source EDA Toolchain Comparison

This project supports two complete open-source EDA toolchains:

Option 1: International Community Solution (OpenROAD)

Stage Tool Purpose Source
RTL Design Chisel/Scala Hardware description UC Berkeley, USA
Simulation Verilator Functional verification International open-source
Synthesis Yosys Logic synthesis Austria
Place & Route OpenROAD Physical implementation UCSD, USA
Static Timing Analysis OpenSTA Timing verification USA
Physical Verification Magic / KLayout DRC/LVS International open-source
Waveform Viewer GTKWave Waveform analysis International open-source

Advantages:

Option 2: Chinese Open-Source Solution (iEDA) ⭐ Recommended

Stage Tool Purpose Source
RTL Design Chisel/Scala Hardware description UC Berkeley, USA
Simulation Verilator Functional verification International open-source
Synthesis iMAP Logic synthesis iEDA, China
Floorplan iFP Floorplanning iEDA, China
Placement iPL Cell placement iEDA, China
Clock Tree Synthesis iCTS Clock tree iEDA, China
Routing iRT Global/detailed routing iEDA, China
Static Timing Analysis iSTA Timing verification iEDA, China
Power Analysis iPW Power evaluation iEDA, China
Physical Verification iDRC Design rule check iEDA, China
Waveform Viewer GTKWave Waveform analysis International open-source

Advantages:

iEDA Project Information:

7.2 Toolchain Selection Recommendations

Scenario Recommended Solution Reason
Teaching & Research iEDA Chinese support, easy to learn
Domestic Chips iEDA Autonomous and controllable, good process adaptation
International Collaboration OpenROAD Mature ecosystem, good compatibility
Commercial Production Commercial Tools Optimal performance, comprehensive technical support

8. Tape-out Process

8.1 International Community Process (OpenROAD)

RTL Design (Chisel)
    ↓
Functional Simulation (Verilator)
    ↓
Logic Synthesis (Yosys) ✅ Completed
    ├── Design Scale: 73,829 instances
    ├── Operating Frequency: 178.569 MHz
    └── Static Power: 627.4 uW
    ↓
Static Timing Analysis (OpenSTA)
    ↓
Floorplan (OpenROAD - Floorplan)
    ↓
Place & Route (OpenROAD - Place & Route)
    ↓
Clock Tree Synthesis (OpenROAD - CTS)
    ↓
Optimization (OpenROAD - Optimization)
    ↓
Sign-off
    ├── Timing Sign-off (OpenSTA)
    ├── Power Sign-off (OpenROAD)
    ├── Physical Verification (Magic/KLayout - DRC/LVS)
    └── Formal Verification (Yosys - Equivalence)
    ↓
GDSII Generation (Magic/KLayout)
    ↓
Tape-out
RTL Design (Chisel)
    ↓
Functional Simulation (Verilator)
    ↓
Logic Synthesis (iMAP) ✅ Completed
    ├── Design Scale: 73,829 instances
    ├── Operating Frequency: 178.569 MHz
    └── Static Power: 627.4 uW
    ↓
Netlist Optimization (iTO - Timing Optimization)
    ↓
Floorplan (iFP - Floorplan)
    ├── Die Size Planning
    ├── Power Network Planning
    └── I/O Planning
    ↓
Placement (iPL - Placement)
    ├── Global Placement
    ├── Detailed Placement
    └── Legalization
    ↓
Clock Tree Synthesis (iCTS)
    ├── Clock Tree Construction
    ├── Clock Buffer Insertion
    └── Clock Skew Optimization
    ↓
Routing (iRT - Routing)
    ├── Global Routing
    ├── Track Assignment
    └── Detailed Routing
    ↓
Static Timing Analysis (iSTA)
    ├── Setup Time Check
    ├── Hold Time Check
    └── Timing Report Generation
    ↓
Power Analysis (iPW - Power Analysis)
    ├── Dynamic Power
    ├── Static Power
    └── Power Optimization
    ↓
Physical Verification (iDRC - Design Rule Check)
    ├── DRC Check
    ├── LVS Verification
    └── Antenna Effect Check
    ↓
Sign-off
    ├── Timing Sign-off (iSTA)
    ├── Power Sign-off (iPW)
    ├── Physical Verification (iDRC)
    └── Formal Verification (iEDA-FV)
    ↓
GDSII Generation (iEDA)
    ↓
Tape-out

iEDA Process Advantages:

Design Scale Verification:


9. iEDA Chinese Open-Source Toolchain Introduction

iEDA (Infrastructure for EDA) is a domestically developed open-source EDA platform jointly developed by the Chinese Academy of Sciences, Peking University, Peng Cheng Laboratory, and other institutions, aiming to break the monopoly of foreign EDA tools and achieve autonomous control of chip design tools.

Core Features:

Main Tool Modules: iMAP (synthesis), iFP (floorplan), iPL (placement), iCTS (clock tree), iRT (routing), iSTA (timing analysis), iPW (power analysis), iDRC (physical verification)

More Information:


10. Risk Assessment and Mitigation

10.1 Technical Risks

Risk Level Mitigation
Timing Convergence Difficulty Medium Reserve timing margin, adopt pipeline design
Power Exceeding Target Low BitNet architecture naturally low-power, fully verified
Area Exceeding Target Low Compact design, resource usage evaluated
Insufficient Verification Medium Increase test cases, improve coverage
EDA Tool Compatibility Low Support both iEDA and OpenROAD solutions

10.2 Project Risks

Risk Level Mitigation
Schedule Delay Medium Reasonable time planning, reserve buffer
Resource Shortage Low Advance planning, ensure resource availability
Tool Issues Low Dual toolchain strategy, iEDA + OpenROAD
International Restrictions Low Prioritize iEDA domestic toolchain

11. Future Work Plan

11.1 Short-term Plan (1-3 months)

  1. Complete Synthesis
    • Generate netlist
    • Timing optimization
    • Area optimization
  2. Physical Design
    • Floorplanning
    • Place & route
    • Clock tree synthesis
  3. Sign-off Verification
    • Static timing analysis
    • Power analysis
    • Physical verification (DRC/LVS)

11.2 Mid-term Plan (3-6 months)

  1. GDSII Generation and Delivery
  2. Tape-out Manufacturing
  3. Chip Packaging

11.3 Long-term Plan (6-12 months)

  1. Chip Testing
    • Functional testing
    • Performance testing
    • Reliability testing
  2. System Integration
    • Development board design
    • Driver development
    • Application examples
  3. Mass Production Preparation
    • Yield analysis
    • Cost optimization
    • Supply chain establishment

12. Summary

12.1 Project Highlights

  1. Innovative BitNet Architecture: Multiplier-free design, significantly reduces power and area
  2. Complete SoC Solution: Integrates CPU, accelerators, peripherals, ready to use
  3. Flexible Programmability: RISC-V CPU supports software control
  4. Thorough Verification: Over 95% test coverage
  5. Clear Documentation: Complete design documentation and user manual
  6. Dual Open-Source Toolchains: Supports both iEDA (domestic) and OpenROAD (international)
  7. Autonomous and Controllable: Prioritizes iEDA domestic toolchain, not subject to international restrictions
  8. Excellent Timing: Measured frequency 178.569MHz, far exceeds 100MHz target
  9. Compact Design: 73,829 instances, meets 100K limit with sufficient margin
  10. Chinese Support: iEDA provides complete Chinese documentation and technical support

12.2 Technical Specifications Summary

Specification Value
Process CX55nm Open-Source PDK
Design Scale 73,829 instances (< 100K limit)
Chip Area ~0.5 mm² (core: 0.3 mm²)
Operating Frequency 50-100 MHz (measured up to 178.569 MHz)
Computing Performance 6.4 GOPS @ 100MHz
Power Consumption < 100 mW (static power: 627.4 uW)
Resource Usage (FPGA) 8K LUTs, 6K FFs, 20 BRAMs
Timing Performance WNS: 14.400ns, TNS: 0.000ns, no violations

12.3 Application Scenarios

12.4 Market Prospects

With the rapid development of edge AI, this chip has broad market prospects:


Appendix

Appendix A: Abbreviations

Abbreviation Full Name Description
SoC System on Chip System on Chip
RISC-V Reduced Instruction Set Computer - V Reduced Instruction Set Computer - Fifth Generation
AI Artificial Intelligence Artificial Intelligence
GOPS Giga Operations Per Second Billion Operations Per Second
PDK Process Design Kit Process Design Kit
EDA Electronic Design Automation Electronic Design Automation
RTL Register Transfer Level Register Transfer Level
GDSII Graphic Database System II Graphic Database System II
DRC Design Rule Check Design Rule Check
LVS Layout Versus Schematic Layout Versus Schematic
STA Static Timing Analysis Static Timing Analysis

Appendix B: References

Core Technologies

  1. BitNet: Scaling 1-bit Transformers for Large Language Models (arXiv:2310.11453)
  2. PicoRV32 - A Size-Optimized RISC-V CPU (https://github.com/YosysHQ/picorv32)
  3. Chisel: Constructing Hardware in a Scala Embedded Language (https://www.chisel-lang.org/)
  4. RISC-V Instruction Set Manual (https://riscv.org/specifications/)

Process and PDK

  1. CX55nm Open-Source PDK Documentation

International Open-Source EDA Tools

  1. Yosys Open SYnthesis Suite (https://yosyshq.net/yosys/)
  2. OpenROAD - Open-source EDA Tool (https://theopenroadproject.org/)
  3. Magic VLSI Layout Tool (http://opencircuitdesign.com/magic/)
  4. Verilator - Fast Verilog/SystemVerilog Simulator (https://www.veripool.org/verilator/)

Chinese Open-Source EDA Tools (iEDA)

  1. iEDA Official Website (https://ieda.oscc.cc/)
  2. iEDA Code Repository (https://gitee.com/oscc-project/iEDA)
  3. iEDA User Manual (https://ieda-docs.oscc.cc/)
  4. iEDA Technical Papers (Peking University, Peng Cheng Laboratory)
  5. Open-Source Chip Community OSCC (https://oscc.cc/)
  1. One Student One Chip Program (https://ysyx.oscc.cc/)
  2. Open-Source Development Tools Forum (OSDT)

Appendix C: Contact Information

Project Lead: [tongxiaojun]
Email: [tongxiaojun@redoop.com]
Phone: [Contact Number]
Project Website: [https://github.com/redoop/riscv-ai-accelerator]
Code Repository: [GitHub/GitLab Link]


End of Report

This report is the RISC-V AI Accelerator Chip Tape-out Report, containing complete information on design, verification, and implementation. For questions, please contact the project lead.