I don't know how they got their 3 GB/s memory bandwidth.
My own testing shows 5347.7 MB/s on a 64 MiB to 64 MiB `memcpy()` using a basic 7 instruction RVV copy loop an X100 core. That's a total 10.7 GB/s memory bandwidth.
The A100 "AI" cores do better, with 13225.9 MB/s on the 64 MiB to 64 MiB copy, for a total 26.5 GB/s memory bandwidth.
Both core types do a 25 GB/s `memcpy()` total 50 GB/s in cache.
Why does bandwidth (MB/s) decrease over larger sizes? Is it possible caches play a larger factor during smaller memcpy, and you see the real CPU<->RAM bandwidth when you’re touching larger areas of memory?
EDIT: never mind, your comment seems to indicate that to be the case
Unlike a GPU or NPU, you can just run all your normal RISC-V Linux programs on the AI cores. Bash, gcc, emacs, nodejs ... whatever you want. It's an extra 40% of scalar processing power, for free.
> That documentation also seems to be the original source of the marketing claims that the K3 provides 60 TOPS of AI compute and can run 30B models at over 10 tokens/s. Well, sort of– as another spoiler, I can share that I hit a hard cap at an effective 3B
I'm running Qwen3-Coder-30B-A3B-Instruct-Q5_K_M.gguf on mine ... picked kind of at random from a web page as I'm a complete n00b at running local LLMs, have never used anything other than ChatGPT or (mostly) Grok.
I get 6-7 tok/s, which is slow for throwing around program code, but fine for general knowledge queries. It's a reasonable speed to read along as it outputs. Simple queries start to give output in about 2 seconds.
e.g. I dunno ..
> What does auipc do?
`auipc` stands for "Add Upper Immediate to PC" and is a RISC-V instruction that:
## What it does:
- Takes a 20-bit immediate value and adds it to the program counter (PC)
- The immediate is sign-extended and shifted left by 12 bits
- The result is stored in a destination register
## Syntax:
```
auipc rd, imm
```
## Key characteristics:
- **PC-relative addressing**: Uses the current program counter as a base
- **12-bit offset**: The immediate is shifted left by 12 bits, so it's effectively a 32-bit value
- **No memory access**: Only performs arithmetic, no memory operations
- **Used for**: Creating 32-bit immediate values, setting up base addresses for other instructions
## Example:
```assembly
auipc x5, 0x1000 # x5 = PC + (0x1000 << 12)
```
## Common uses:
- Setting up base addresses for load/store instructions
- Creating 32-bit immediate values when the standard `li` instruction can't handle the full range
- Jump table setup
- Address calculation for large data structures
The instruction is particularly useful in RISC-V because it allows for efficient 32-bit address generation while maintaining the 16-bit instruction size constraint.
[ Prompt: 20.6 t/s | Generation: 6.5 t/s ]
Output started in about 2 seconds.
> what is the airspeed velocity of an unladen swallow?
The classic answer to this question comes from Monty Python's "Holy Grail" - when asked about the airspeed velocity of an unladen swallow, the character replies "What do you mean? An African or European swallow?"
But in reality, the airspeed velocity depends on the species:
**European Swallow (Hirundo rustica):**
- Typical flight speed: 15-20 mph (24-32 km/h)
- Maximum speed: around 25 mph (40 km/h)
**African Swallow (Hirundo sp.):**
- Similar speeds to European swallows, though some species can reach up to 30 mph (48 km/h)
The "unladen" part refers to the swallow not carrying anything, which would slow it down. Swallows are quite agile birds that can fly at impressive speeds while maneuvering through the air.
So while the Python reference is the more famous answer, the real-world airspeed of an unladen swallow is roughly 15-25 mph, depending on the specific species and conditions.
[ Prompt: 25.5 t/s | Generation: 6.6 t/s ]
Again, output starts in about two seconds.
This is offline, no internet, and uses 14W while running all 8 A100 "AI" cores at max.
Is this useful? I mean, for something, right?
I asked it to review https://github.com/brucehoult/trv which is a total of 320 lines of code (I used `/read` on a tar file containing the two code files). It thought for 22 minutes before output started and then spent 8 minutes outputting comments at just over 6.5 tok/s.
Nothing there to scare Claude, but 30 minutes total is still faster than asking a colleague for a code review, and probably more comprehensive too. And it did it on about 0.25 cents of electricity.
> Turns out getting a thread onto the A100 cores requires a two-step handshake:
>
> write the thread’s TID to /proc/set_ai_thread (a kernel interface that unlocks scheduling on cores 8–15 for that specific thread)
> then call sched_setaffinity to pin it.
If you want to just run arbitrary Linux programs on the A100 cores, I wrote a small assembly language launcher which does the above PID writing and then EXECs the thing you really want.
# just run a single program on the A100 cores
ai as hello.s -o hello.o
# same thing but maybe 1ms faster
aix /usr/bin/as hello.s -o hello.o
# run a whole build. All processes started by `make` will run on the A100 cores.
ai make -j8 test
# start a shell on the A100 cores. All programs run from it will be run only on the A100 cores
ai bash
As normal CPUs the eight 2-wide in-order A100 cores (like an A53 or A55 or Pentium or PPC603) add about 40% normal scalar processing power to the eight X100 cores.
That's better than Hyperthreading and well worth using for some additional processing power. Just kick off a background build, or CI or something there while you do something else on the X100 cores. If you ignore the special "AI" matrix processing extension they are just perfectly normal RISC-V RVA23 cores as far as user code is concerned — and in fact significantly faster than the previous generation K1 chip.
A Linux kernel build on just the A100 "AI" cores is faster than on any previous RISC-V SBC under $1000, including the HiFive Premier P550 or Milk-V Megrez. It's several times faster than the VisionFive 2 or Milk-V Jupiter / BPI-F3.
The K3 is also faster than using QEMU/Docker on my 24 core i9-13900 laptop, and while using 25W instead of 200W.
Note the fastest time using a distccd on the X100 cores and another distccd on the A100 cores. This adds a lot of overhead in preprocessing and communication over the network (loopback, but still). But it still gives a pretty nice boost. But running independent tasks on each set of cores is more efficient. Or teaching `gmake` or `ninja` to distribute to two pools of cores using my `ai` launcher would be even better ...
> instead of a PDP 11 as the virtual machine, why not risc V
What would that look like, in your view? I can't see them being significantly different, as computation models go. And I'm familiar with both.
If you're thinking of ++ and --, those were introduced in the B language before the first PDP-11 was even made, plus of course the PDP-11 only offers pre-decrement and post-increment while C offers all four combinations.
I'm not really sure. Problem is, is that a lot of the compressed instructions have been informed by the c way of doing things.
TBF I'm playing with a pi Pico, so anything I do is likely to be more influenced by the memory set up. Ie xip.
I could write a whole essay on this. But I was more meditating on the assumptions and decisions made, because that was obvious/easy given the hardware available. Why don't really write languages like that now, and I don't really think C is that language anymore. But it would be interesting to see what could emerge.
To answer this follow-up, Benji, the syntax was designed so the learning curve isn’t too steep for people coming from languages like C/C++, even if I think that analogy is wrong, Riscrithm isn’t trying to be another C, as that would just be a cheap and worse copy. My goal is to allow people writing drivers or high-performance code to focus on structuring their logic without the compiler abstracting it into ambiguous implementations. Also, it’s great seeing you using a Pico; working with microcontrollers is fantastic, and I think it would be nice if you considered running Riscrithm on it—especially v1.1. Let’s be honest, v1.0 was an MVP, while v1.1 aims to bring all the features that will appear in v2.0, which will largely improve the compiler’s structure and make the language cleaner. I also appreciate your curiosity about what might emerge, and I want to assure you that I’ll do my best to make it a great language!
I was more just musing on the space around high level assembler/ low level programming language. Currently C is the option. So anything entering that space will be competing with C
Hi Benji, I tried to reply to your comment earlier, but it seems it didn’t go through, so I’m trying again to say that you’re right. I hope you enjoyed exploring (and maybe even using) Riscrithm. My goal was to make writing assembly cleaner while still giving developers full control. Sure, C is the standard now, but I hope Riscrithm can offer a fresh perspective on low-level coding.
They've finally stopped comparing RISC-V boards to Pi 3!
But why don't they include Pi 3 and Pi 4 in the charts *anyway*, as well as come more appropriate x86 machines such as Core2Duo, i7-860 (or other Nehalem), 2nd-5th gen i5 etc?
It's not telling anyone anything they didn't know that a K3 isn't going to compare to an Apple M5 or Core Ultra 9 or whatever.
I've got as 32 GB PicoITX K3 and I've been comparing it to a "Late 2012" Mac Mini with i7-3720QM running Ubuntu 24.04 and the Mac is 20% or 30% faster single-core, but of course on some things the K3 wins from having lots of cores.
I imagine the issue is not having hardware to test. Or, at least, not having a test environment for that hardware.
Probably the only hardware they tested for this article was the K3 and the only recent test results they had that were slow enough were the Pi 500+, the P550, and Loongson. I agree though that comparing it to 10 different Ryzens is not super useful.
The Pi and the P550 are what people are comparing it against. And showing how much slower it still is compared to modern x86-86 is useful too. Something like the RK3588 would have been interesting.
This shows that RISC-V has improved a lot compared to itself, that it is becoming competitive with ARM, and that it has a long way to go to the high-end desktop. That is about the right story to tell.
The average bystander doesn't have to care, just buy a machine implementing the RVA23 profile (standard set of extensions) and be happy.
If you're building your own embedded hardware then you determine what your needs actually are e.g. do you need double precision? half precision? vector?. Then you choose a chip implementing that. Then you copy the ISA string from your chip's documentation to the `-march=` argument for GCC/Clang and be happy.
It's not hard and you don't have to think about it unless you very specifically want to.
The average bystander might want to write high-performance code for their risc-v cpu. Then they must know precisely which instructions are available and what the performance implications of using them are. E.g., the difference between a shared and non-shared fp register file is huge.
For the "average bystander" they're going to buy an OS and compatible hardware, or if they're the average programmer they're going to use a compiler and libraries that solve the problem already for them. Very very few people need to worry about the details.
If you want to get the absolute most out of a specific CPU that is in your hands then you of course have to refer to the documentation for that specific CPU.
That process doesn't depend on whether it's an x86 or an Arm or a RISC-V.
That's why x86 people refer to the HUGE document maintained by Agner Fog.
If you want your code to run well on all standards-compliant implementations then you write according to the ISA documentation, in this case RVA23. Or ARMv9-A. Or x86_64 v3.
Nope. I want to get the most out of all cpus that will run my code. This is a combinatorial problem that grows exponentially by the number of relevant extensions. So, yes, you need to know the hardware, but accounting for combinations of 5 features is way easier than accounting for combinations of 10 features.
Riscv is basically repeating the same mistake X11 did. A minimal base that could be varied endlessly by combining extensions. I didn't work for X11. Some extensions became de facto mandatory (shm) while others fell by the wayside. But you could never rely on the availability of a given extension because someone somewhere might not have had it or disabled it. Then Wayland came along and cleaned up the gazillion extensions mis-design because it was a huge PITA. Riscv will get there too, sooner or later.
You think the average person writes performance optmized code?
If you are on that level then you know pretty well what you are targeting. And even then in 99% of cases you just look at the top level profile.
If you do performance analysis for some specific embeded project that is not using a standard profile, then its a bit more work, but hardly impossible.
Bruh, the "average person" won't buy a riscv-based computer in decades. The average bystander to the riscv project indeed writes high-performance code for their, so far, mostly non-existent or emulated riscv processors.
Your seriously arguing the the avg person write performance code so critical that minor difference in hardware implementation are relevant? Most people write code that isn't that performance critical, fireware or they are porting existing software over. A extreme minority of people that interact with an ISA is hand optimizing code.
Lol... the RISC-V ecosystem has loooong passed that stage. RISC-V is eating into markets from deeply embedded to automotive, high-end server cpu's to specialized accelerators. That's mass-produced hard silicon.
It's here to stay, coming to a device near you Real Soon Now (tm).
Do high-performance RISC-V CPUs (that you can actually buy) still exist? SiFive Unleashed was great but IIRC it was a single batch that never returned.
I have in my hands one of the new SpacemiT K3 machines. It arrived today. I'm comparing it to several other things, and finding that it is pretty comparable to a "late 2012" Mac Mini with a i7-3720QM with base 2.6 GHz turbo 3.6 GHz running Ubuntu 24.04. They are quite close in feel for general use, web browsing, code editing, watching YouTube etc. The Mac is a little faster on many things, a LOT slower on others (anything that can use 8 cores, obviously).
You might say that's not "high performance" but we thought it was pretty good a dozen years ago.
The previous SpacemiT K1 chip two years ago was more like one of the last Pentium IIIs or PowerpC G4s, except with a lot more cores.
SpacemiT have a next generation K5 coming out, they say, at the end of the year. Tenstorrent have their new Ascalon-X core comparable to Apple's late 2020 M1 — and designed by the same guy who designed the M1. They've taped out a chip using that and say they'll be selling a dev board in Q2 or Q3. For now the first version is using an old chip process and it will be running at half the clock speed of the M1, but that's still going to be a very decent machine.
The HiFive Unleashed was of course 8 years ago. Since then there have been the HiFive Unmatched (roughly like Cortex A55) and the HiFive Premier P550 (a bit better than Cortex A72, other than no SIMD).
> You might say that's not "high performance" but we thought it was pretty good a dozen years ago.
Definitely sounds pretty high-performance compared to basically every RISC I've seen (and including nearly every cell phone I've ever owned with the exception of the Apple ones).
Tenstorrent is awesome, can't wait to see if I can afford any of their hardware in 5ish years. I miss when you could buy TPUs as a consumer (Coral etc.)
The Arace purchase link for the Jupiter 2 kit says it's “in stock“, but it's actually for a discount coupon. The actual system can only pre-ordered. The Sipeed web site does not say anything about shipping timelines, and the products are not offered in their AliExpress store. I think the Sipeed boards are in preorder, too.
Of course, neither of these are machines. And the average bystander probably isn't used to importing computer parts directly from China, either.
Deep Computing have started taking orders for the final product and the Preorders are shipping within the next 6 weeks. They will be shipping from China I expect, but it's a proper shop front.
AVR is vastly better to program than 8 bit PIC — either by hand or by compiler from C — but some people still insist on using those PICs for simple things.
The "PIC32" name was originally used for MIPS CPUs but more recently ARM ones and PIC32A is an extended dsPIC (16 bit).
There is also now PIC64 which is currently a couple of different RISC-V implementations, one based on quad core SiFive U54 from 2018 (same as PolarFire SoC FPGAS), and higher performance (and rad-tolerant in some versions) octa-core SiFive X280 with vector processing. Microchip have I think also indicated there will be future Arm-based 64 bit PICs.
The PIC and AVR are both more about peripherals (and low power, since they can sleep while peripherals do things).
The documentation on PIC or AVR assembly is extremely short, less than 200 pages. But the chips other peripherals (MVIO, 50mA push/pull current at upto 5V, OpAmps, differential ADCs, Event systems, and more) is where these chips get a resounding advantage.
Except now PIC32 CM has those very nice peripherals (comparable to AVR DD at least). It's very curious to me how it all works out: if Microchip will continue porting their nice hardware to ARM, or if they'll continue to develop new stuff for the 8bit market.
----------
Because the peripherals are hardware, I don't think it's really too valuable looking at the assembly language or other comp-sci details. The 8-bit assembly languages, be it PIC or AVR (or 8051) are all sufficient. Enough CPU to do things and glue the peripherals together.
I'm not sure why anyone would care what chip is in a router that should just sit there doing its job and you're not going to write or run other software on?
Sure it's kind of nice to know the car media player thing I've had for a couple of years has a RISC-V D1s/F133 chip in it, but I bought it because it receives CarPlay (and Android Auto) and transmits audio on FM (actually the best quality one I've had, and I've had a few) and cost $30, not because it's RISC-V.
And I'm eagerly awaiting a pico-ITX SpacemiT K3 box arriving in the next week or so.
But a router? Why do I care, past price and functionality?
See https://news.ycombinator.com/item?id=48523343
reply