Inferno Inferno
Articles

Inferno blog

Experiments in self-hosted AI inference, GPU optimization, and ML systems.

Tensor Deduplication for Multi-Model Inference
inference gpu memory

Tensor Deduplication for Multi-Model Inference

Multi-model workloads are the norm: A/B tests, customer fine-tunes, safety variants, multi-stage pipelines. GPU memory requirements scales linearly with model count, and VRAM is the limiting resource. Tensor dedup can make a big difference here.

Dec 08, 2025 · 8 min read
Shared Backbones: Loading Weights Once, Serving Many Models
inference gpu memory

Shared Backbones: Loading Weights Once, Serving Many Models

Many multimodal and multi-task models share the same underlying text encoder or LLM backbone. This post explores loading shared backbones once and letting multiple heads reuse them.

Nov 29, 2025 · 8 min read
Inferno © 2025 Inferno Blog
Articles GitHub Twitter

We can't find the internet

Attempting to reconnect

Something went wrong!

Attempting to reconnect