IP evaluation guide

If you are considering integrating Coral NPU as the ML accelerator in your SoC, use this guide to evaluate the IP's performance and functional compatibility. Start your evaluation by reviewing these pages:

ML performance measurements

When evaluating any NPU, you will want to measure wall-time metrics for ML kernels as well as the number of clock cycles required for execution. Because I/O latency is often the largest contributor to ML performance, measurements should be conducted on a system that accurately reflects target system latencies.

To obtain accurate performance metrics, choose the simulator or emulator that best fits your current evaluation stage:

  • MPACT-CoralNPU behavioral simulator: This is the fastest simulation option but is not clock-cycle-accurate, nor does it model memory or instruction latency.
  • RTL-based Verilator simulator: This simulator provides cycle-accurate metrics for the Coral NPU processor but is slow, approximately 15 kilocycles per second, which may be time-prohibitive for full ML model runs.
  • FPGA emulation: This is the best option for realistic measurements, as it allows for fast hardware emulation of the actual Coral NPU RTL.

Coral NPU includes the RISC-V Zicntr counter registers to measure performance. These are implemented as 64-bit Control and Status Registers (CSRs). In the 32-bit (RV32) Coral NPU system, they are accessed as two 32-bit halves.

  • mcycle: Counts the number of clock cycles executed by the core.
  • minstret: Counts the number of retired (successfully completed) instructions.
  • mtime: Tracks wall-clock real time. This is a memory-mapped register that requires an external timer peripheral be connected to the Coral NPU core. See this Python test code for an example using mtime on the Google Coral FPGA emulation platform.

Accessing these counters typically requires a coordinated software stack. For more details on the Zicntr counter registers, see 3.1.10, Hardware Performance Monitor, in the RISC-V Instruction Set Manual Volume II.

Functional design evaluation

In evaluating Coral NPU, you will likely start by doing some initial design experimentation with the IP.

First generate the SystemVerilog code for the design as shown in IP integration guide.

The Coral NPU core will be an AXI4/TileLink peripheral in the system. Review the AXI bus interface and create a simple RTL simulation test bench to begin to verify the core logic.

Focus initially on the most important top-level signals such as clock, reset, and the 128-bit s_axi and m_axi bus connections. Once you are confident that the core is functional, you can move on to higher-level operations — for example writing test code to write and read to/from Coral's CSR registers.

Other ideas:

  • Write a small, simple assembly language program that outputs one bit of signal over the master AXI bus or on the halted bit. Load the binary program into Coral NPU's ITCM instruction memory, then execute it.
  • Test the assembly language equivalent of return 0. In C and C++, a return 0; statement at the end of a main function is a standard way to signal to the operating system that the program has finished executing successfully.

To get an idea of how to write some basic instantiation and integration test code like this, take a look at this SystemVerilog test bench in the Coral NPU GitHub repository. You will probably need to modify this code to work properly in your development environment.

The Google Coral team uses cocotb which is a coroutine-based, co-simulation test bench environment for verifying VHDL and SystemVerilog RTL using Python. cocotb is free, open-source, and hosted here on GitHub.