Shared Backbones: Loading Weights Once, Serving Many Models
Many multimodal and multi-task models share the same underlying text encoder or LLM backbone. This post explores loading shared backbones once and letting multiple heads reuse them.
I keep running into the same pattern when trying to self-host models (which is a lot of fun): we run several big models side by side, all of them valuable, all of them slightly different, and all of them wasting VRAM by reloading nearly the same weights.
This post is my attempt to explore a specific idea:
Can we load a shared backbone of weights once on a GPU, then load only the small, unique pieces per model that reuse that backbone?
For example, a BLIP-2 captioning model and a T5-XXL question-answering model may both use the exact same T5-XXL text encoder. Today, that means 11B weights are loaded twice.
This idea is only interesting in a very particular regime:
- The models in the group all provide distinct value. You do not want to replace one with another.
- At the same time, they have large parameter overlap. Something like 70 to 90 percent of their weights is structurally the same.
When both of those are true, paying the VRAM cost for each full checkpoint starts to feel obviously wrong.
1. Dedup weights?
Today, if you run a handful of related models on one node, you usually pay this cost:
- Each model comes as a full checkpoint: 7B, 8B, 30B, 70B parameters or more.
- Each checkpoint is loaded into GPU memory as a separate blob.
- Even if 80 percent of the weights are identical across two models, they still occupy VRAM twice.
In practice, many combinations of captioning, VQA, embedding, reranking, and generation models reuse the same backbone layers under the hood even though most inference servers treat them as independent models.
That might be fine if one model wins and the others are just worse baselines. But that is not what happens in practice.
In reality, we deploy families of models that:
- Share large text encoders, towers, or multimodal backbones.
- Differ in heads, diffusion backbones, adapters, and small sets of fine tuned weights.
- Are all needed in production because they serve different roles.
This is where a shared backbone idea starts to look compelling.
2. Some Shared Backbones scenarios
Shared backbones is about co-hosting models that:
- Serve clearly different purposes for the user.
- Share a large fraction of parameters under the hood.
You want clusters of models that are:
- Fast plus slow: a small model for instant feedback, a larger one for final high quality output.
- Same backbone, different heads: chat, vision, and reranking all built on one text tower.
- Same encoder, different decoders: one encoder that feeds diffusion, captioning, and document QA.
The rest of this post walks through concrete examples in that sweet spot.
3. Example Cluster: Fast and Slow Variants Sharing a Backbone
3.1 Flux.1 Schnell + Flux.1 Dev
In text-to-image, users often want both:
- A fast model for instant previews.
- A slow, better model for final output.
For example:
-
black-forest-labs/FLUX.1-schnellfor fast draft generation. -
black-forest-labs/FLUX.1-devfor slower, higher quality images.
These share the same SigLIP text encoder. The text tower is heavy and identical, while the image transformer differs (small versus large).
From the user perspective:
- The fast model powers interactive UX: drafts, quick suggestions, live previews.
- The slow model is for final renders, marketing assets, production content.
You never want to replace one with the other. You want both. And under the hood, a large chunk of weights could be shared.
3.2 SDXL Turbo + SDXL Base and Style Variants
A similar pattern shows up in SDXL:
-
stabilityai/sdxl-turbofor 30 to 50 times faster previews at lower quality. -
stabilityai/stable-diffusion-xl-base-1.0for higher quality base images. -
SG161222/RealVisXL_V4.0for photorealistic style. -
RunDiffusion/JuggernautXL_v9for cinematic style.
Most of these share the same CLIP XL text encoder. That encoder is a large, expensive chunk of the total parameters.
Again, the story is similar:
- Turbo for instant preview.
- Base or style variants for final delivery.
In a shared backbone world, you would load the CLIP XL encoder once and let multiple diffusion heads reuse it.
3.3 Stable Diffusion 3 Base + SD3 Turbo and Style Variants
Stable Diffusion 3 repeats the pattern with a T5 XXL text backbone:
-
stabilityai/stable-diffusion-3-mediumas a general high quality model. -
Shakker-Labs/SD3-Turbofor fast generations. - Many community style variants on top.
All of these tend to share a massive T5 XXL text encoder that can be several gigabytes even in reduced precision.
4. Example Cluster: One Llama Tower, Many Purposes
Consider using a model like Llama 3 8B as a common text backbone.
You might want three very different endpoints:
- Chat and agents — For support bots, orchestration, tools, and general conversations.
- Vision question answering — LLaVA style models for screenshot understanding, product image Q&A, document Q&A.
- Reranking and scoring — A model that scores search results or ranks candidate answers.
All three can share the same Llama 3 text tower, which might be 80 to 90 percent of the parameters.
They differ in:
- An extra vision encoder plus projector for vision models.
- Different heads, adapters, or LoRAs for chat versus reranking.
From a user perspective, these are distinct tools:
- A chatbot or agent endpoint.
- A vision understanding endpoint.
- A retrieval scoring or reranking endpoint.
From a systems perspective, a shared backbone runtime could:
- Load the Llama 3 tower once.
- Attach small, task specific components as needed per request.
5. Example Cluster: One T5 XXL for Text-to-Image, Image-to-Text, and Document QA
Now imagine using T5 XXL as a text backbone across several modalities.
You might have:
- Image to text (captioning and VQA) — BLIP2 style models that use T5 XXL as the language head for captioning and visual question answering.
- Text to image — Stable Diffusion 3 type models where T5 XXL is the text encoder that conditions the diffusion process.
- Document and OCR question answering — A T5 XXL fine tune for long form document QA or summarization.
These correspond to very different user facing features:
- Generate marketing or product images.
- Caption dashboards, screenshots, or photos.
- Answer questions about scanned PDFs and long documents.
Yet all of them could reuse one large T5 XXL text tower.
A shared backbone approach would:
- Load T5 XXL to GPU once.
- Attach BLIP2 vision heads, SD3 diffusion heads, and QA heads as small, separate modules.
This is the kind of configuration where you get both large VRAM savings and very different business use cases.
6. Example Cluster: One SigLIP for Retrieval, Generation, and Multimodal Assistance
SigLIP is another natural candidate for sharing.
You can imagine a stack that uses one SigLIP encoder for:
- Image retrieval — CLIP-like models for visual search and finding similar images.
- Text to image generation — Flux style models that use SigLIP as a text encoder for conditioning.
- Multimodal assistance — Vision language assistants that use SigLIP as an image encoder for VQA and multimodal chat.
User facing purposes here include:
- Visual search over catalogs and product images.
- Generating marketing or UI art.
- Answering questions about screenshots, PDFs, or photos.
Under a shared backbone design:
- SigLIP is loaded once.
- Retrieval, generation, and assistant endpoints all reuse that encoder.
The shared component is large and common, while the value of each endpoint is genuinely different.
7. VRAM Saving Estimates
Some rough estimates of potential VRAM savings:
| Cluster | Shared backbone | Unique per model | Models | VRAM (no sharing) | VRAM (with sharing) | Savings |
|---|---|---|---|---|---|---|
| Flux.1 Schnell + Dev | ~24 GB | ~1 GB | 2 | 50 GB | 26 GB | ~48% |
| SDXL Turbo + Base + styles | ~14 GB | ~2 GB | 4 | 64 GB | 22 GB | ~66% |
Even with conservative assumptions, sharing the large tower once usually saves tens of gigabytes of VRAM per node for these clusters.
8. Implementation
All of these examples point to the same core requirement:
Load the big, shared weights once per device, then cheaply switch or stack the smaller unique pieces per model.
In practical terms, a shared backbone runtime would need to:
- Keep one or more backbones resident on each GPU.
- Maintain a registry of small, model specific components (heads, adapters, diffusion backbones).
- Route each request to a combination of one backbone plus one or more unique components.
- Manage batching and scheduling so that different logical models can share the same physical backbone compute.
This is more ambitious than simply loading one model per process, but the payoff is potentially large VRAM savings and more flexible multi-model serving.
9. Challenges
There are real technical challenges hiding behind the nice examples:
- Fine tunes with the same starting checkpoint still drift in their weights, so extracting a clean shared subset is nontrivial.
- Quantization formats often differ across models, so a shared backbone must handle mixed precisions or align formats.
- Some layers drift far more than others, which makes it hard to decide what belongs in the backbone versus the per model delta.
- There is no standard tensor diff format for weights, so today we mostly ship full copies instead of base plus delta.
- Inference engines usually assume one contiguous blob per model, not a composable backbone plus components.
So while the motivation is straightforward, getting to a robust implementation requires research on weight factoring, new storage formats, and changes in serving infrastructure.
10. Worth Exploring
Despite the friction, I think this idea is worth exploring:
- Adapter-only fine tunes and LoRA stacks already demonstrate that you can do a lot of work with a frozen core and small deltas.
- Multimodal and diffusion systems increasingly reuse the same encoders across many endpoints.
- Mixture of Experts models share huge parts of their networks at runtime, showing that shared compute is viable at scale.
If we accept that many model families have:
- High overlap in parameters, and
- Real, distinct value across their variants,
then it feels wasteful to keep paying VRAM and load time for fully separate checkpoints.
I do not know yet exactly how far this idea can go, but the examples above make it clear that there is a lot of redundant weight sitting in memory today. Finding a good way to load the shared parts once and treat the unique bits as small add-ons feels like a direction worth pushing on.
I plan to explore this idea further and share any interesting results in a follow-up post.