Teaching a CUDA Engine to Speak Metal

June 29, 2026

Most of the Apple-Silicon ports I’ve written up are the same fight in different costumes: take a PyTorch project that assumes an NVIDIA card is bolted under the desk and talk it down to MPS . That’s a porting job. The GPU support already exists; you’re just stopping the code from hardcoding the wrong god.

This was a different animal, and I want to be honest about that up front because it changes how impressive - or not - the whole thing is. CTranslate2 isn’t a PyTorch project. It’s a from-scratch C++ inference engine with its own tensors , its own memory allocator, its own CUDA kernels. It has no “GPU device” abstraction you can just point at Metal . So the job wasn’t “port a model.” It was “add a third GPU backend to an engine that has exactly two - CUDA and CPU - and was architected by people who reasonably assumed those were the only two that would ever matter.”

That sounds like a research project. It mostly wasn’t, and the reason it wasn’t is one fact about Apple Silicon that does almost all the work. The rest of the series is the part that one fact didn’t do for free.

The disclosure, because it’s the whole point and not the fine print: I have never written a line of C++. Not for this, not ever. The last C I touched was for Harvard’s CS50, and given a patient afternoon and Stack Overflow I could maybe flip an array. On paper there is no way I’m qualified to read this engine’s source, let alone bolt a GPU backend onto it.

So I didn’t read it. I directed agents that did - armed with a couple of skills I built for the job and a plan I sketched in about ten minutes - and I judged the results by what came out the far end: is this transformer producing the right tokens, or isn’t it. That’s a black-box judgment, a behavioral one, not a line-by-line one. So when a post in this series says “I wrote a kernel” or “I clamped the tanh,” read it the way a general contractor says “I built that house.” I didn’t lay a single brick. I knew what finished was supposed to look like, I knew which wall was crooked, and I knew who to send back to fix it. The bricklaying - every line of it - was the agent’s.

That division of labor isn’t a footnote here; it’s the subject. It’s why this is a writeup and not a pull request, and Part 7 is about exactly that - because the codebase is the kind where, as its own maintainers put it, “a single misplaced pointer can take hours to debug,” and the gap between “I can explain what this does” and “I can vouch for this line” turns out to be the whole story.

Where it got to

A full encoder-decoder transformer runs end-to-end on Metal, in both 32- and 16-bit float, producing token-for-token the same output as the CPU - GPT-2-style and Llama/Mistral-style architectures both. The whole per-token forward pass executes as real GPU kernels: matmuls on Apple’s MPS library, plus hand-written Metal kernels for softmax, the normalizations, rotary embeddings, gather, fused bias-and-activation, and elementwise math. Everything not yet on the GPU runs correctly on a CPU-reference path over shared memory, behind a full regression net.

It is correct, it is memory-safe, and - the part I won’t oversell - for some workloads it is still slower than the CPU, for reasons that turn out to be the most interesting thing in the whole project. That honest middle is what the seven parts below are about.

The series

Read them in order or cherry-pick a war story - Parts 4, 5, and 6 each stand alone as debugging stories. Each links back to the glossary where a term needs unpacking.

Part 1 - The Cheat Code: Unified Memory
The one fact about Apple Silicon that turns ‘add a whole new GPU backend’ from a research project into an afternoon: the CPU and GPU share the same RAM, so a GPU buffer is also a CPU pointer - and …