yan@yandesbiens:~/projects$ cat ufm/README.md

UFM ● research

run models larger than your VRAM

UFM treats GPU VRAM and CPU RAM as one elastic pool. It keeps the hot parts of a model on the card, prefetches what's about to be used, and evicts least-recently-used sub-modules when memory gets tight — so a single 4090 can run a model whose footprint exceeds 24 GB.

It's the first piece of the research program to get a formal, reproducible benchmark. On a routed Mixture-of-Experts, the standard all-on-GPU approach OOMs at a 24 GB expert bank; UFM runs the same model holding VRAM at 19.6 GB. When the active working set fits the budget, it does so within ~1% of full-GPU throughput and ~240× faster than naive CPU offloading.

I also published the case where it doesn't help: touch every expert every step and you're transfer-bound, where UFM ties dumb streaming. It's a bet on routing locality, not magic memory — and saying so plainly is the point.

// highlights

// stack

yan@yandesbiens:~$ subscribe --proof-drops

Follow this work.

This project advances by proof drop. Get each one as it ships — reproducible, no hype.