Inferno blog

Experiments in self-hosted AI inference, GPU optimization, and ML systems.

Tensor Deduplication for Multi-Model Inference

Multi-model workloads are the norm: A/B tests, customer fine-tunes, safety variants, multi-stage pipelines. GPU memory requirements scales linearly with model count, and VRAM is the limiting resource. Tensor dedup can make a big difference here.

Dec 08, 2025 · 8 min read

inference gpu memory

Shared Backbones: Loading Weights Once, Serving Many Models

Many multimodal and multi-task models share the same underlying text encoder or LLM backbone. This post explores loading shared backbones once and letting multiple heads reuse them.

Nov 29, 2025 · 8 min read