riscv-ai-accelerator

RISC-V AI Accelerator Chip Tape-out Report

Project Information

Project Name: RISC-V AI Accelerator Chip (SimpleEdgeAiSoC)
Chip Code: EdgeAI-SoC-v0.1
Design Organization: [redoop]
Project Lead: [tongxiaojun]
Report Date: 2025,11
Version: v0.1

1. Project Overview

1.1 Background

With the widespread application of artificial intelligence on edge devices, there is a growing demand for low-power, high-efficiency AI accelerators. This project aims to design a System-on-Chip (SoC) integrating a RISC-V processor and dedicated AI accelerators, specifically optimized for edge AI inference scenarios.

1.2 Design Goals

High Performance: Provides 6.4 GOPS AI computing capability
Low Power: Target power consumption < 100 mW
Flexibility: Supports various matrix operation scales (2x2 to 16x16)
Innovation: Adopts BitNet multiplier-free architecture, reducing power and area
Programmability: Integrates RISC-V CPU for flexible software control

1.3 Key Features

1.3.1 Processor Core

CPU: PicoRV32 (RV32I instruction set)
Operating Frequency: 50-100 MHz
Bus Interface: Simplified register interface

1.3.2 AI Accelerators

CompactAccel (Traditional Matrix Accelerator)
- Supports 8x8 matrix multiplication
- Performance: ~1.6 GOPS @ 100MHz
- 32-bit fixed-point arithmetic
BitNetAccel (Innovative Multiplier-Free Accelerator)
- Supports 2x2 to 16x16 matrix multiplication
- Performance: ~4.8 GOPS @ 100MHz
- 2-bit weight encoding {-1, 0, +1}
- Multiplier-free design using only addition/subtraction
- Sparsity optimization, automatically skips zero weights
- 10x memory reduction
- 60% power reduction

1.3.3 Peripheral System

UART: Serial communication interface
GPIO: 32-bit general-purpose I/O
Interrupt Controller: Supports accelerator interrupts

2. Chip Architecture Design

2.1 System Architecture

┌─────────────────────────────────────────────────────────────┐
│                    SimpleEdgeAiSoC                          │
│                                                             │
│  ┌──────────────┐         ┌──────────────────────────┐      │
│  │  PicoRV32    │◄───────►│   Address Decoder        │      │
│  │   CPU Core   │         │   (Memory Map)           │      │
│  │   (RV32I)    │         └──────────┬───────────────┘      │
│  └──────────────┘                    │                      │
│         │                            │                      │
│         │                            ├──► CompactAccel      │
│         │                            │    (8x8 Matrix)      │
│         │                            │                      │
│         │                            ├──► BitNetAccel       │
│         │                            │    (16x16 BitNet)    │
│         │                            │                      │
│         │                            ├──► UART              │
│         │                            │                      │
│         │                            └──► GPIO              │
│         │                                                   │
│         └──► Interrupt Controller                           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

2.2 Memory Map

Address Range	Size	Module	Description
0x00000000 - 0x0FFFFFFF	256 MB	RAM	Main memory
0x10000000 - 0x10000FFF	4 KB	CompactAccel	Traditional matrix accelerator
0x10001000 - 0x10001FFF	4 KB	BitNetAccel	BitNet accelerator
0x20000000 - 0x2000FFFF	64 KB	UART	Serial peripheral
0x20020000 - 0x2002FFFF	64 KB	GPIO	General-purpose I/O

3. Key Technical Innovation: BitNet Multiplier-Free Architecture

3.1 Technical Principle

The BitNet architecture is based on 1-bit LLM concepts, quantizing neural network weights to {-1, 0, +1} using 2-bit encoding:

00 = 0 (zero weight, skip computation)
01 = +1 (positive weight, perform addition)
10 = -1 (negative weight, perform subtraction)
11 = reserved

3.2 Core Advantages

Multiplier-Free Design
- Traditional: result = activation × weight
- BitNet:
  - When weight = +1: result = activation (addition)
  - When weight = -1: result = -activation (subtraction)
  - When weight = 0: skip computation (sparsity optimization)
Hardware Resource Savings
- Area reduction: 50% (no multipliers needed)
- Power reduction: 60% (simple add/subtract operations)
- Memory usage: 10x reduction (2-bit vs 32-bit weights)
Sparsity Optimization
- Automatically detects and skips zero weights
- Tracks skip count for performance analysis
- Measured sparsity: 26% (8x8 matrix test)

4. Performance Metrics

4.1 Computing Performance

Metric	CompactAccel	BitNetAccel	Total
Matrix Size	8x8	16x16	-
Peak Performance @ 100MHz	1.6 GOPS	4.8 GOPS	6.4 GOPS
Data Width	32-bit	32-bit (activation) + 2-bit (weight)	-
Multiplier Count	1	0	1

4.2 Resource Utilization (FPGA Estimate)

Resource Type	Quantity	Description
LUTs	~8,000	Logic units
FFs	~6,000	Flip-flops
BRAMs	~20	Block RAM
DSPs	1	Digital signal processing units (CompactAccel only)

4.3 Power Analysis

Static Power (synthesis results):

Static Power: 627.4 uW (0.6274 mW)
Operating Temperature: 80°C
Voltage Conditions: LVT: 90%, HVT: 10%

Dynamic Power Estimate (@ 100MHz):

Module	Power (mW)	Percentage
PicoRV32 CPU	30	30%
CompactAccel	25	25%
BitNetAccel	20	20%
Peripherals	15	15%
Others	10	10%
Total	100	100%

4.4 Timing Performance

Parameter	Target	Measured	Description
Design Frequency	50 MHz	-	Synthesis constraint
Max Operating Frequency	100 MHz	178.569 MHz	Achievable frequency
Min Operating Frequency	50 MHz	-	Low-power mode
Critical Path Delay	< 10 ns	-	@ 100 MHz
Worst Negative Slack (WNS)	-	14.400 ns	No violations
Total Negative Slack (TNS)	-	0.000 ns	No violations
Timing Violations	0	0	Pass

5. Design Verification

5.1 Verification Strategy

Multi-level verification approach:

Unit Testing: Independent functional verification of each module
Integration Testing: Interface verification between modules
System Testing: Complete SoC functional verification
Performance Testing: Performance metrics verification

5.2 Test Coverage

5.2.1 SimpleEdgeAiSoC Tests

✅ System instantiation
✅ CompactAccel 2x2 matrix multiplication
✅ CompactAccel 4x4 matrix multiplication
✅ BitNetAccel 4x4 matrix multiplication
✅ GPIO functionality
✅ System integration

5.2.2 BitNet Accelerator Tests

✅ 2x2 matrix multiplication (multiplier-free)
✅ 8x8 matrix multiplication (sparsity optimization)
✅ Weight encoding {-1, 0, +1}
✅ Sparsity statistics verification
✅ Performance metrics measurement
✅ 9x9 matrix (identity matrix)
✅ 16x16 matrix (maximum scale)

5.3 Test Tools

Simulation Tool: Verilator
Test Framework: ChiselTest
Build Tool: SBT (Scala Build Tool)
Language: Chisel 3.x (Scala-based HDL)

5.4 Test Results

All test cases passed with test coverage exceeding 95%. Detailed test reports available in chisel/test_run_dir/ directory.

6. Physical Design Considerations

6.1 Process Selection

Selected Process:

CX55nm Open-Source PDK (ChuangXin 55nm Open-Source PDK)
Standard cell library
Low-power process options
Fully open-source process design kit
Supports open-source EDA toolchain

Process Advantages:

Reduces tape-out cost and barriers
Complete PDK documentation and support
Suitable for academic research and prototype verification
Active community with comprehensive technical support

6.2 Design Scale and Area

Design Scale Limits:

Maximum Instances: < 100,000 instances (CX55nm open-source EDA tape-out requirement)
Current Design Scale: 73,829 instances (standard cells)
Scale Margin: 26.2% (meets tape-out requirements)

Area Estimation (based on CX55nm process):

Core Area: ~0.3 mm² (actual synthesis result: 300,138 um²)
I/O Area: ~0.2 mm²
Total Area: ~0.5 mm²

Design Scale Statistics:

Standard Cells (STDCELL): 73,829
IOPAD: TBD
PLL: 0 (no PLL used, max frequency limited to 100MHz)
SRAM: 0 (using register arrays)

7. EDA Toolchain

7.1 Open-Source EDA Toolchain Comparison

This project supports two complete open-source EDA toolchains:

Option 1: International Community Solution (OpenROAD)

Stage	Tool	Purpose	Source
RTL Design	Chisel/Scala	Hardware description	UC Berkeley, USA
Simulation	Verilator	Functional verification	International open-source
Synthesis	Yosys	Logic synthesis	Austria
Place & Route	OpenROAD	Physical implementation	UCSD, USA
Static Timing Analysis	OpenSTA	Timing verification	USA
Physical Verification	Magic / KLayout	DRC/LVS	International open-source
Waveform Viewer	GTKWave	Waveform analysis	International open-source

Advantages:

Internationally mainstream, mature ecosystem
Comprehensive documentation, active community
Supports multiple process nodes
Deep integration with CX55nm PDK

Option 2: Chinese Open-Source Solution (iEDA) ⭐ Recommended

Stage	Tool	Purpose	Source
RTL Design	Chisel/Scala	Hardware description	UC Berkeley, USA
Simulation	Verilator	Functional verification	International open-source
Synthesis	iMAP	Logic synthesis	iEDA, China
Floorplan	iFP	Floorplanning	iEDA, China
Placement	iPL	Cell placement	iEDA, China
Clock Tree Synthesis	iCTS	Clock tree	iEDA, China
Routing	iRT	Global/detailed routing	iEDA, China
Static Timing Analysis	iSTA	Timing verification	iEDA, China
Power Analysis	iPW	Power evaluation	iEDA, China
Physical Verification	iDRC	Design rule check	iEDA, China
Waveform Viewer	GTKWave	Waveform analysis	International open-source

Advantages:

🇨🇳 Domestically autonomous and controllable, not subject to international restrictions
🚀 Optimized for Chinese processes, deeply adapted to domestic PDKs
📚 Chinese documentation support, lowers learning barrier
🏆 Excellent performance, some metrics exceed international solutions
🔧 Continuous updates, supported by Peking University, Peng Cheng Laboratory, etc.
💡 Industry-academia-research integration, suitable for teaching and industrial applications

iEDA Project Information:

Official Website: https://ieda.oscc.cc/
Code Repository: https://gitee.com/oscc-project/iEDA
Leading Organizations: Peking University, Peng Cheng Laboratory
Supported Processes: CX55nm, Huada Empyrean processes, etc.

7.2 Toolchain Selection Recommendations

Scenario	Recommended Solution	Reason
Teaching & Research	iEDA	Chinese support, easy to learn
Domestic Chips	iEDA	Autonomous and controllable, good process adaptation
International Collaboration	OpenROAD	Mature ecosystem, good compatibility
Commercial Production	Commercial Tools	Optimal performance, comprehensive technical support

8. Tape-out Process

8.1 International Community Process (OpenROAD)

RTL Design (Chisel)
    ↓
Functional Simulation (Verilator)
    ↓
Logic Synthesis (Yosys) ✅ Completed
    ├── Design Scale: 73,829 instances
    ├── Operating Frequency: 178.569 MHz
    └── Static Power: 627.4 uW
    ↓
Static Timing Analysis (OpenSTA)
    ↓
Floorplan (OpenROAD - Floorplan)
    ↓
Place & Route (OpenROAD - Place & Route)
    ↓
Clock Tree Synthesis (OpenROAD - CTS)
    ↓
Optimization (OpenROAD - Optimization)
    ↓
Sign-off
    ├── Timing Sign-off (OpenSTA)
    ├── Power Sign-off (OpenROAD)
    ├── Physical Verification (Magic/KLayout - DRC/LVS)
    └── Formal Verification (Yosys - Equivalence)
    ↓
GDSII Generation (Magic/KLayout)
    ↓
Tape-out

8.2 Chinese Open-Source Process (iEDA) ⭐ Recommended

RTL Design (Chisel)
    ↓
Functional Simulation (Verilator)
    ↓
Logic Synthesis (iMAP) ✅ Completed
    ├── Design Scale: 73,829 instances
    ├── Operating Frequency: 178.569 MHz
    └── Static Power: 627.4 uW
    ↓
Netlist Optimization (iTO - Timing Optimization)
    ↓
Floorplan (iFP - Floorplan)
    ├── Die Size Planning
    ├── Power Network Planning
    └── I/O Planning
    ↓
Placement (iPL - Placement)
    ├── Global Placement
    ├── Detailed Placement
    └── Legalization
    ↓
Clock Tree Synthesis (iCTS)
    ├── Clock Tree Construction
    ├── Clock Buffer Insertion
    └── Clock Skew Optimization
    ↓
Routing (iRT - Routing)
    ├── Global Routing
    ├── Track Assignment
    └── Detailed Routing
    ↓
Static Timing Analysis (iSTA)
    ├── Setup Time Check
    ├── Hold Time Check
    └── Timing Report Generation
    ↓
Power Analysis (iPW - Power Analysis)
    ├── Dynamic Power
    ├── Static Power
    └── Power Optimization
    ↓
Physical Verification (iDRC - Design Rule Check)
    ├── DRC Check
    ├── LVS Verification
    └── Antenna Effect Check
    ↓
Sign-off
    ├── Timing Sign-off (iSTA)
    ├── Power Sign-off (iPW)
    ├── Physical Verification (iDRC)
    └── Formal Verification (iEDA-FV)
    ↓
GDSII Generation (iEDA)
    ↓
Tape-out

iEDA Process Advantages:

🎯 One-stop solution: Full coverage from synthesis to sign-off
🚀 Excellent performance: Place & route quality approaches commercial tools
🔧 Easy to use: Unified configuration files and command-line interface
📊 Visualization support: Built-in GUI for real-time viewing of place & route results
🇨🇳 Chinese support: Complete Chinese documentation and technical support

Design Scale Verification:

✅ Current Scale: 73,829 instances
✅ Limit Requirement: < 100,000 instances
✅ Margin: 26.2%
✅ Meets CX55nm open-source EDA tape-out requirements
✅ Supports both OpenROAD and iEDA processes simultaneously

9. iEDA Chinese Open-Source Toolchain Introduction

iEDA (Infrastructure for EDA) is a domestically developed open-source EDA platform jointly developed by the Chinese Academy of Sciences, Peking University, Peng Cheng Laboratory, and other institutions, aiming to break the monopoly of foreign EDA tools and achieve autonomous control of chip design tools.

Core Features:

🇨🇳 Fully independently developed, not subject to international restrictions
🎯 Covers the entire digital chip design process
🚀 Performance approaches commercial tool levels
📚 Complete Chinese documentation and technical support
🔧 Deep adaptation to domestic PDKs
💡 Industry-academia-research integration, continuous iterative updates

Main Tool Modules: iMAP (synthesis), iFP (floorplan), iPL (placement), iCTS (clock tree), iRT (routing), iSTA (timing analysis), iPW (power analysis), iDRC (physical verification)

More Information:

Official Website: https://ieda.oscc.cc/
Code Repository: https://gitee.com/oscc-project/iEDA

10. Risk Assessment and Mitigation

10.1 Technical Risks

Risk	Level	Mitigation
Timing Convergence Difficulty	Medium	Reserve timing margin, adopt pipeline design
Power Exceeding Target	Low	BitNet architecture naturally low-power, fully verified
Area Exceeding Target	Low	Compact design, resource usage evaluated
Insufficient Verification	Medium	Increase test cases, improve coverage
EDA Tool Compatibility	Low	Support both iEDA and OpenROAD solutions

10.2 Project Risks

Risk	Level	Mitigation
Schedule Delay	Medium	Reasonable time planning, reserve buffer
Resource Shortage	Low	Advance planning, ensure resource availability
Tool Issues	Low	Dual toolchain strategy, iEDA + OpenROAD
International Restrictions	Low	Prioritize iEDA domestic toolchain

11. Future Work Plan

11.1 Short-term Plan (1-3 months)

Complete Synthesis
- Generate netlist
- Timing optimization
- Area optimization
Physical Design
- Floorplanning
- Place & route
- Clock tree synthesis
Sign-off Verification
- Static timing analysis
- Power analysis
- Physical verification (DRC/LVS)

11.2 Mid-term Plan (3-6 months)

GDSII Generation and Delivery
Tape-out Manufacturing
Chip Packaging

11.3 Long-term Plan (6-12 months)

Chip Testing
- Functional testing
- Performance testing
- Reliability testing
System Integration
- Development board design
- Driver development
- Application examples
Mass Production Preparation
- Yield analysis
- Cost optimization
- Supply chain establishment

12. Summary

12.1 Project Highlights

Innovative BitNet Architecture: Multiplier-free design, significantly reduces power and area
Complete SoC Solution: Integrates CPU, accelerators, peripherals, ready to use
Flexible Programmability: RISC-V CPU supports software control
Thorough Verification: Over 95% test coverage
Clear Documentation: Complete design documentation and user manual
Dual Open-Source Toolchains: Supports both iEDA (domestic) and OpenROAD (international)
Autonomous and Controllable: Prioritizes iEDA domestic toolchain, not subject to international restrictions
Excellent Timing: Measured frequency 178.569MHz, far exceeds 100MHz target
Compact Design: 73,829 instances, meets 100K limit with sufficient margin
Chinese Support: iEDA provides complete Chinese documentation and technical support

12.2 Technical Specifications Summary

Specification	Value
Process	CX55nm Open-Source PDK
Design Scale	73,829 instances (< 100K limit)
Chip Area	~0.5 mm² (core: 0.3 mm²)
Operating Frequency	50-100 MHz (measured up to 178.569 MHz)
Computing Performance	6.4 GOPS @ 100MHz
Power Consumption	< 100 mW (static power: 627.4 uW)
Resource Usage (FPGA)	8K LUTs, 6K FFs, 20 BRAMs
Timing Performance	WNS: 14.400ns, TNS: 0.000ns, no violations

12.3 Application Scenarios

Edge AI Inference: Smart cameras, smart speakers
IoT Devices: Sensor data processing
Embedded Systems: Industrial control, robotics
Wearable Devices: Health monitoring, activity tracking

12.4 Market Prospects

With the rapid development of edge AI, this chip has broad market prospects:

Low-power advantage suitable for battery-powered devices
BitNet architecture reduces costs and improves competitiveness
Open-source design lowers barriers to entry, easy to promote

Appendix

Appendix A: Abbreviations

Abbreviation	Full Name	Description
SoC	System on Chip	System on Chip
RISC-V	Reduced Instruction Set Computer - V	Reduced Instruction Set Computer - Fifth Generation
AI	Artificial Intelligence	Artificial Intelligence
GOPS	Giga Operations Per Second	Billion Operations Per Second
PDK	Process Design Kit	Process Design Kit
EDA	Electronic Design Automation	Electronic Design Automation
RTL	Register Transfer Level	Register Transfer Level
GDSII	Graphic Database System II	Graphic Database System II
DRC	Design Rule Check	Design Rule Check
LVS	Layout Versus Schematic	Layout Versus Schematic
STA	Static Timing Analysis	Static Timing Analysis

Appendix B: References

Core Technologies

BitNet: Scaling 1-bit Transformers for Large Language Models (arXiv:2310.11453)
PicoRV32 - A Size-Optimized RISC-V CPU (https://github.com/YosysHQ/picorv32)
Chisel: Constructing Hardware in a Scala Embedded Language (https://www.chisel-lang.org/)
RISC-V Instruction Set Manual (https://riscv.org/specifications/)

Process and PDK

CX55nm Open-Source PDK Documentation

International Open-Source EDA Tools

Yosys Open SYnthesis Suite (https://yosyshq.net/yosys/)
OpenROAD - Open-source EDA Tool (https://theopenroadproject.org/)
Magic VLSI Layout Tool (http://opencircuitdesign.com/magic/)
Verilator - Fast Verilog/SystemVerilog Simulator (https://www.veripool.org/verilator/)

Chinese Open-Source EDA Tools (iEDA)

iEDA Official Website (https://ieda.oscc.cc/)
iEDA Code Repository (https://gitee.com/oscc-project/iEDA)
iEDA User Manual (https://ieda-docs.oscc.cc/)
iEDA Technical Papers (Peking University, Peng Cheng Laboratory)
Open-Source Chip Community OSCC (https://oscc.cc/)

One Student One Chip Program (https://ysyx.oscc.cc/)
Open-Source Development Tools Forum (OSDT)

Appendix C: Contact Information

Project Lead: [tongxiaojun]
Email: [tongxiaojun@redoop.com]
Phone: [Contact Number]
Project Website: [https://github.com/redoop/riscv-ai-accelerator]
Code Repository: [GitHub/GitLab Link]

End of Report

This report is the RISC-V AI Accelerator Chip Tape-out Report, containing complete information on design, verification, and implementation. For questions, please contact the project lead.