issue #1 · Memory Systems
A 24 GB model on a 24 GB GPU
Welcome to the first issue. The format is simple: every time the lab ships a proof drop — a small, reproducible result — you get the short version here, with the link to the full writeup and the code.
One rule governs all of it: every claim is reproducible from a one-command script, or it’s marked speculative. No hype, no vibes-based benchmarks.
This drop: UFM
The question: can a single RTX 4090 run a model whose memory footprint exceeds its VRAM?
I tested UFM — a manager that treats VRAM + RAM as one pool — on a routed Mixture-of-Experts with a 24 GB expert bank, on a 23.5 GB card.
- The standard “all on GPU” approach OOMs.
- UFM runs the same model, holding VRAM at 19.6 GB.
- When the working set fits the budget, it does so at ~1% of baseline throughput and ~240× faster than naive CPU offload.
And the part most benchmarks skip — the failure case: when every expert fires every step (no locality), UFM ties naive streaming. It’s a bet on locality, not magic memory. I’d rather show you the edge of the envelope than pretend there isn’t one.
→ Full writeup, figures, and repro: yandesbiens.com/blog/ufm-benchmark
What’s next
Memory Systems thread continues: training-time paging and optimizer-state offload curves. After that, the fractal backbone gets its first controlled comparison.
You can always see the whole program — threads, maturity, and the next proof needed — on the research page.
— Yan
Research conducted at Éthiqueia Québec inc.