Optimizing vLLM Cold Start with Model Streaming and Compile Caching

Tue, 30 Jun 2026 00:00:00 +1000

I’ve recently been working on an AI-at-the-edge project (SafetyLens) — that’s a topic for another post. During the demo development, I frequently needed to tweak the model serving engine, swap models in and out, and refine configurations. Every one of those changes meant restarting vLLM, and each restart triggered a full cold start that took five minutes or more before the server was ready to serve again.

And that’s with just a single Qwen3.6-35B-A3B model running off local NVMe storage on 1x NVIDIA DGX Spark node. This will only get worse once I scale the deployment to a multi-node cluster backed by an external PVC, where the weights have to travel over the network on every start. So I went looking for ways to cut down the vLLM cold-start time — and it turns out most of it is recoverable, with no changes to the model or its output quality.

Deploy DeepSeek-R1-0528-671B on Amazon EKS using vLLM

Mon, 01 Jan 0001 00:00:00 +0000

Deploy production generative AI at the edge using Amazon EKS Hybrid Nodes with NVIDIA DGX

Mon, 01 Jan 0001 00:00:00 +0000

GenAI on Route179

Optimizing vLLM Cold Start with Model Streaming and Compile Caching

Deploy DeepSeek-R1-0528-671B on Amazon EKS using vLLM

Deploy production generative AI at the edge using Amazon EKS Hybrid Nodes with NVIDIA DGX