Hacker Newsnew | past | comments | ask | show | jobs | submit | scottjg's commentslogin

i admit, you got me chuckling with that one.

interesting. that might be a fun intro project to MoltenVK. I hadn't dug into what was missing for Doom. I thought maybe the issue was that the intro/menu always ran in opengl mode or something. If it's just one missing op, that's way easier.

custom zoom75

it is a separate stack, but that probably doesn't matter much. a user process (in my case, qemu) can communicate with a driverkit driver. the user process can also map memory through the driver, which is how this pci passthrough system works.

i don't think the issues with the project really are specific to driverkit.


i don't know for sure, but i suspect what makes the tinygrad stuff slow isn't the macos host driver itself. i think they're doing something very similar to what i'm doing, which is just mapping the PCI BARs to userspace, then they have a bunch of python code that drives the GPU.

this is only speculation, but i think the big thing that makes tinygrad slow is that the tinygrad inference engine has not really been optimized much for all these open LLM models. probably most of the work has gone towards optimizing the stack for george's self-driving hardware company. since you can't just run the existing CUDA kernels on their engine, that makes things a lot tougher, engineering-wise.

i am actually curious if my project could share a macos host driver with them. i think it would need some changes, but it seems like there's a lot of overlap


in fairness to the LLM critics, every time i ran into a minor speed bump in this project, it told me it probably just wasn't possible to get it to work well. the LLM did pretty actively discourage me from trying to get the whole thing working.

that said, since i was willing to ignore that aspect of it, it did accelerate getting the work done by a lot. it seems like it understands system programming really well, and did a good job navigating the qemu codebase. i have ~20 years of systems programming experience so i already knew what had to be done here. it didn't really guide the project much, but it did write a lot of the code.


the exact numbers in the graph are 17019ms vs 142ms. so you're right, it's not 120x, it's 119.85x.

That explains it. Thanks!

two semi interesting things to note around this:

1. Virtualization.framework seems to support some form of GPU passthrough from the host (granted, not eGPU - it's for the integrated GPU). I think the primary use case is having macOS guests get acceleration, while still sharing GPU time with the host. There is also a patch that recently hit QEMU mainline that supports using the "venus server" with virtio-gpu to support a similar functionality for Linux guests under Hypervisor.framework.

2. Apple internally has some kind of PCI Passthrough support available in Virtualization.framework. It seems like the code is shipped to customers in the framework, but it relies on some kind of kext or kernel component that isn't shipped in retail macOS. I can't say if that's intended to ever be released to customers, but clearly someone at Apple has thought about this the feature.


I experimented with booting Arm macOS 14-26 in QEMU a while back, building on the work of Alexander Graf for macOS 12-13, and reverse-engineered substantial parts of Hypervisor.framework, the in-kernel hypervisor, and a bit of Virtualization.framework. Got newer versions of Sequoia to boot past the log in screen, with GPU acceleration too.

Unless there's another method I missed, the internal GPU "pass through" of Virtualization.framework you're thinking of might actually just be paravirualization, at least that's what the name suggests. It's implemented in the public ParavirtualizedGraphics framework [0], albeit for PG on Arm macOS, the relevant interfaces are private [1]. I haven't looked that deep into it per se, but, fixing the bugs around it, I've run into a few clues suggesting that it's just a command stream + shared memory being passed around. It also uses its own generic driver on the guest side.

Great job, by the way! Love how authors of pieces like this casually come here to comment :)

[0] https://developer.apple.com/documentation/paravirtualizedgra...

[1] https://github.com/qemu/qemu/blob/edcc429e9e41a8e0e415dcdab6...


FYI: https://patchew.org/QEMU/20260324204855.29759-1-mohamed@unpr...

There's some randomness around Tahoe for FileVault and it crashing because Data is detected as not encrypted (and that's not OK on bare metal). If hitting that case you might need to enable FileVault inside the VM (and remember to sync aux storage afterwards if not done)


Looks like someone beat me to it! Thanks!

I see the author of this patch set has run into a few similar issues as me. Strangely, not all of them: I don't see patches for the new PCI MSI-X device introduced somewhere in Sequoia (IIRC), a source of kernel panics for me; and there's still a bug in the PG MMIO path that casts all writes to 32-bit, this one caused me a lot of headaches (no errors but no video adapter detected). I'm somewhat surprised to hear that this works!

There's two significant problems that we both have came across:

- LLVM now favoring pre-indexed load/stores which trap with ISV=0 for MMIO accesses, and those ending up in the GIC initialization path of the newer macOS kernels (looks like there's a separate patch set for this [0]),

- Hypervisor.framework trapping PAC HVC calls.

It seems like the latter has been worked around here by signing QEMU with an Apple-private entitlement and running with SIP off, but there's actually a different way! While some HVCs are trapped right in the host kernel, the PAC trapping happens within Hypervisor.framework itself (at least in the host OSes I tested). It's possible to patch out this functionality without special privileges or talk to the in-kernel hypervisor directly. I originally tested with the former, and chose to implement the latter as a separate accel in the code I was planning to submit upstream, but the complexity of it exploded and, besides confirming that it would have worked, I haven't managed to finish my implementation.

Does the Tahoe crash you mentioned manifest itself in stage 2 iBoot panics? I must admit 26 was never quite my priority so I haven't looked into it, but if so, it might have been closer to booting than I thought :)

[0] https://lists.nongnu.org/archive/html/qemu-devel/2026-04/msg...


It was a kernel panic for Tahoe. Anything between macOS 12 and 26 wasn't tested so releases in-between might have more issues.

The userspace reboot after FileVault password entry acts a bit oddly with QEMU input devices so you might need to attach a new USB tablet or kbd from the monitor.

> looks like there's a separate patch set for this

Yup and it's a bit of a problem to figure out the right thing to do for it on the upstreaming side as normal guests aren't supposed to do that.

> It's possible to patch out this functionality without special privileges or talk to the in-kernel hypervisor directly

Or pre-patch them all to HVC #1 works too. Patching the host Hypervisor.framework sounds quite brittle especially after they moved to a pile of C++


> It was a kernel panic for Tahoe.

Ah, must be something else then.

> normal guests aren't supposed to do that

Oh how I wish Arm didn't let anything like this slip into the architecture spec to begin with! Massive source of pain, especially with protected memory/CCA guests. It's not macOS triggering this in isolation either. Most start up binaries for QNX do this too, somehow also in the GIC init path.

I've looked at how different hypervisors/VMMs handle this and, if this makes that patch set any less hacky, Virtualization.framework, QNX Hypervisor, and (I think) VMware all decode and emulate those instructions in software. Virtualization.framework is a remarkable spaghetti in this regard :)

> Or pre-patch them all to HVC #1 works too. Patching the host Hypervisor.framework sounds quite brittle especially after they moved to a pile of C++

Possibly! IIRC, if HCR_EL2.HCD==1, HVC should trap as undefined instruction. Not sure how much of HCR_EL2 can be set from the user-space, but perhaps this could be the least invasive way.

Simply ignoring the instruction, though, is not enough. I remember in my setup, with HVC handling stubbed out, secondary cores would always fail to start. I suspect this to be the culprit.

The SMP bring-up code would fail to pass pointer authentication on the first indirect branch. It would then immediately pivot into FLEH->SLEH->panic(). panic() shortly would attempt to make an indirect jump itself, hoping to crash the other processors, but instead, getting stuck in a loop of calling itself. This would eventually get caught by a stack overflow guard somewhere in FLEH/SLEH, which would place the core in an infinite loop, and... the system would continue to run with just the boot core. Yo dawg, I heard you like panics :)


> HCR_EL2.HCD

That's not ideal because of:

> Any resulting exception is taken to the Exception level at which the HVC instruction is executed.

instead of trapping to the hypervisor

> I've looked at how different hypervisors/VMMs handle this and, if this makes that patch set any less hacky, Virtualization.framework, QNX Hypervisor, and (I think) VMware all decode and emulate those instructions in software. Virtualization.framework is a remarkable spaghetti in this regard :)

And so does Hyper-V.

> It's not macOS triggering this in isolation either

There are some nightmare cases that SEPOS specifically triggers, such as doing isv=0 accesses to GICR... when using the Apple vGIC handling _that_ becomes truly bizarre.

> Simply ignoring the instruction, though, is not enough

Yeah that's not a great idea


> instead of trapping to the hypervisor

My bad! I mean, ehh, I guess you could maintain a breakpoint in the guest kernel's exception vector table or have QEMU inject its own "zero-level exception handler" whose only purpose would be to capture those HVCs, but that's not as straightforward as I originally thought. And since those PAC calls are expected to set a few Apple-specific registers anyway, using the entitlement or skipping Hypervisor.framework and talking straight to the kernel seem like the only viable options when macOS is the guest.

> There are some nightmare cases that SEPOS specifically triggers, such as doing isv=0 accesses to GICR... when using the Apple vGIC handling _that_ becomes truly bizarre.

Interesting! Are there any resources out there about virtualizing sepOS?


Not much public yet about VRE virtualisation (which includes SEP) at this point.

> whose only purpose would be to capture those HVCs

quite expensive because you get to trap ~ all EL0 -> EL1 priv transitions through the virtualisation infrastructure as the sync handler has a lot going through it


Only if you used a breakpoint or something similar. I believe a "shadow" exception vector like that can run entirely in the guest context with the guest not even being aware of this (MRS is generally always trapped so you can return the address of the real one while still taking exceptions to the injected one).

Figuring out where to put it and how to keep it mapped is another problem, though!


thanks!

there also appears to be a generic pci passthrough path. we were discussing it on the qemu-devel list: https://lore.kernel.org/qemu-devel/C35B5E97-73F2-4A60-951B-B...


Oh, thanks for letting me know, and for the upstreaming work too! I might join the party once I find some more time :)

I very recently ran the numbers on these GPUs for an upcoming blog post. The token generation performance is bad, but the prefill performance is _really_ bad.

For a Qwen 3.6 35B / 3B MoE, 4-bit quant:

- parsing a 4k prompt on a M4 Macbook Air takes 17 seconds before generating a single token.

- on an M4 Max Mac Studio it's faster at 2.3 seconds

- on an RTX 5090, it's 142ms.

RTX 5090 uses more power than an M4 Max Mac Studio but it's not 16x more power.


That's just a 4k context too. At a realistic context window of 16-32k tokens, the comparison becomes downright unfair.


i thought my post was already too long to include this, but to your point, you can run AI inference in this setup and the performance can be pretty good.


There are definitely some use cases where it works out, others where it doesn't; I spent a bit of time testing that side of things late last year: https://www.jeffgeerling.com/blog/2025/big-gpus-dont-need-bi...


a great post that definitely inspired this one. i link to it in the first paragraph of my blog post.


I feel like maybe by the end of this year someone with access to a bunch of RTX Pro 6000s will have them running on a Pi or RK3588 lol.


we can only hope


I appreciate you making the post not about AI.


Yeah it is obviously actually useful for AI inference over which the pcie speed isn't particularly important and a single board computer gets you a small system.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: