GenAI
AWS Publications
Deploy production generative AI at the edge using Amazon EKS Hybrid Nodes with NVIDIA DGX ↗
Personal Blog
Optimizing vLLM Cold Start with Model Streaming and Compile Caching
I’ve recently been working on an AI-at-the-edge project (SafetyLens) — that’s a topic for another post. During the demo development, I frequently needed to tweak the model serving engine, swap models in and out, and refine configurations. Every one of those changes meant restarting vLLM, and each restart triggered a full cold start that took five minutes or more before the server was ready to serve again. And that’s with just a single Qwen3.6-35B-A3B model running off local NVMe storage on 1x NVIDIA DGX Spark node. This will only get worse once I scale the deployment to a multi-node cluster backed by an external PVC, where the weights have to travel over the network on every start. So I went looking for ways to cut down the vLLM cold-start time — and it turns out most of it is recoverable, with no changes to the model or its output quality. ...