AWS Publications

Deploy DeepSeek-R1-0528-671B on Amazon EKS using vLLM

Deploy production generative AI at the edge using Amazon EKS Hybrid Nodes with NVIDIA DGX

Personal Blog

Optimizing vLLM Cold Start with Model Streaming and Compile Caching

I’ve recently been working on an AI-at-the-edge project (SafetyLens) — that’s a topic for another post. During the demo development, I frequently needed to tweak the model serving engine, swap models in and out, and refine configurations. Every one of those changes meant restarting vLLM, and each restart triggered a full cold start that took five minutes or more before the server was ready to serve again. And that’s with just a single Qwen3.6-35B-A3B model running off local NVMe storage on 1x NVIDIA DGX Spark node. This will only get worse once I scale the deployment to a multi-node cluster backed by an external PVC, where the weights have to travel over the network on every start. So I went looking for ways to cut down the vLLM cold-start time — and it turns out most of it is recoverable, with no changes to the model or its output quality. ...

June 30, 2026 · 10 min · route179