Deep Dive in to vLLM Architecture
Inside vLLM
A Complete Technical Reference
Engine construction, KV cache mechanics, the scheduler loop, slot mapping, prefix caching, guided decoding, and speculative decoding : From source to silicon.
Table of Contents
Foundation: How LLMs Generate Text
Everything in vLLM is shaped by two fundamental principles about autoregressive language models. These determine every design decision in the system.
1: One token at a time, sequentially
LLMs generate tokens one by one. Given a sequence [t₁, t₂, ..., tₙ], the model runs a full forward pass to sample token tₙ₊₁. Then runs again. This is autoregressive generation where each step depends on all prior tokens, and no step can start until the previous one completes.
Generation has two distinct phases with entirely different performance characteristics. Prefill processes the full input prompt in a single parallel pass which is compute-bound. Decode generates tokens one at a time, loading all model weights each step which is memory-bandwidth-bound.
2: The KV Cache
In transformer attention, every token produces a Key (K) and Value (V) vector. During decode, the new token must attend to all past tokens' K and V. Rather than recomputing past K,V from scratch each step, they're stored. We call this is the KV cache. It grows one block of 16 tokens (generally block_size = 16) at a time as generation proceeds.
The KV cache for a single request at a single layer in bf16: 2 × seq_len × num_kv_heads × head_dim × 2 bytes.
For Llama-3-8B (8 KV heads, head_dim = 128), the KV cache footprint per token per layer is:
2 × 8 × 128 × 2 bytes = 4,096 bytes
Across 32 layers, this becomes:
4,096 × 32 = 131,072 bytes ≈ 128 KB per token
For a 2,048-token response:
128 KB × 2,048 = 256 MB
That means a single long generation can consume 256 MB of KV cache memory. When serving many concurrent requests, this memory usage scales linearly making KV memory management the central constraint in large-scale LLM inference systems.
Engine Construction
The LLM engine is the fundamental building block of vLLM. On its own, it already enables high-throughput inference
vLLM has two entry points: LLM for offline synchronous use (scripts, benchmarks), and AsyncLLMEngine for online serving. Both wrap the same LLMEngine / EngineCore.
We'll use the following offline inference snippet as our running example (adapted from basic.py).
# Offline usage from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", ] sampling_params = SamplingParams( temperature=0.8, top_p=0.95 ) def main(): llm = LLM( model="TinyLlama/TinyLlama-1.1B-Chat-v1.0" ) outputs = llm.generate(prompts, sampling_params) if __name__ == "__main__": main()
This configuration is:
- Offline - no web or distributed system scaffolding
- Synchronous - all execution happens in a single blocking process
- Single-GPU - no data, model, pipeline, or expert parallelism
LLM Engine Constructor
The main components of the engine are:
- vLLM config - controls model settings, KV cache size, parallelism, and other configuration knobs.
- Processor - validates raw input, tokenizes it, and converts it into internal engine requests.
- Engine core client - runs the model. In offline mode this is
InprocClient; in large-scale serving it becomes a distributed client. - Output processor - converts raw engine outputs into the final response returned to the user.
Engine Core Components
1. Model Executor
Drives forward passes on the GPU.
- UniProcExecutor - single process, single GPU.
- MultiProcExecutor - supports multiple GPUs.
2. Structured Output Manager
Handles guided decoding (e.g., structured or JSON outputs).
3. Scheduler
Determines which requests are processed next.
- Policy setting - FCFS (First Come First Served) or priority-based scheduling.
- Waiting and Running queues
4. KV Cache Manager
The core of paged attention. It manages memory blocks used for storing KV cache and maintains a free block queue which is a pool of available KV-cache blocks. These blocks act as an indexing structure that maps tokens to their computed KV cache.
Depending on VRAM size and block size, there may be hundreds of thousands of available blocks. As emphasized earlier, Efficient KV cache management is critical for high-concurrency inference.
Lets take a look at the steps :
Engine Initialization and Model Loading
1. Initialize the Device
- Assign a CUDA device (e.g.,
cuda:0) to the worker. - Verify that the model dtype (e.g.,
bf16) is supported. - Check available VRAM against
gpu_memory_utilization(e.g., 0.8 = 80%). - Configure distributed settings (DP, TP, PP, EP, etc.).
- Instantiate a model_runner (sampler, KV cache, forward-pass GPU buffers).
- Instantiate an InputBatch (CPU-side buffers, KV block tables, sampling metadata).
2. Load the Model
- Instantiate the model architecture.
- Load model weights.
- Call
model.eval()to enable inference mode. - Optionally apply
torch.compile()for optimization.
3. Initialize the KV Cache
- Determine per-layer KV cache specification.
- Run a dummy forward pass to profile GPU memory usage.
- Compute how many KV cache blocks fit into available VRAM.
- Allocate and bind KV cache tensors to attention layers.
- Prepare attention metadata (e.g., select FlashAttention backend).
- Unless
--enforce-eager is provided, for each of warmup batch sizes do a dummy run and capture CUDA graphs. CUDA graphs record the whole sequence of GPU work into a DAG.
The 6-step engine construction sequence
Fig 2 - LLM Engine Constructor: 6 Steps Before the First Token
Two execution modes
After initialization, every forward pass runs in one of two modes. Eager mode: standard PyTorch. Its flexible, but each kernel launch has Python overhead. Captured mode: CUDA graphs are recorded during init_kv_cache() for common batch sizes, then replayed as a single GPU command with no Python overhead, dramatically lower latency. Pass --enforce-eager to skip capture.
--enforce-eager is provided, for each of warmup batch sizes do a dummy run and capture CUDA graphs. CUDA graphs record the whole sequence of GPU work into a DAG.init_kv_cache() for common batch sizes, then replayed as a single GPU command with no Python overhead, dramatically lower latency. Pass --enforce-eager to skip capture.
The generate() Loop
When you call llm.generate(prompts, sampling_params), two phases happen: feeding requests into the engine, then stepping until all complete.
- Create a unique
request_id(UUID) and capturearrival_timefor priority scheduling - Call
_process_inputs()→ tokenize, validate, return{prompt, prompt_token_ids, type} - Pack into
EngineCoreRequestwith priority, sampling params, and all metadata - Pass to engine core → wraps in
Requestobject, status =WAITING, add to scheduler's waiting queue
Phase B: The step() function
Once the engine has requests, it calls step() in a loop until all requests finish. Each step is exactly three stages:
- Schedule: We choose which requests to run (decode and/or prefill)
- Forward pass: execute one step of the model for all running requests
- Post-process: Append new tokens to request states, detokenize and check for completion, prepare outputs
- Process RUNNING first. For each: call
allocate_slots(req_id, num_new_tokens=1). If KV memory is tight, preempt the lowest-priority RUNNING request - free its blocks, push it back to WAITING. - Pull from WAITING. For each waiting request (priority order): check if enough free blocks exist for its prefill. If yes, move to RUNNING, allocate slots. If no room, skip until next step.
- Return SchedulerOutput. The batch plan: which requests run, token counts, block tables, prefill vs decode flag. The model executor uses this to build all inputs.
-
Update state - Prune finished requests from
input_batch; refresh KV cache block metadata for each active request. -
Prepare inputs - Copy buffers CPU → GPU; compute position indices; build
slot_mapping(maps each token to its physical slot in paged KV memory); construct the attention metadata struct. - Forward pass - Run the model with custom paged-attention kernels. All sequences are flattened and concatenated into one "super sequence." Position indices and attention masks ensure each sequence attends only to its own tokens, enabling continuous batching without right-padding.
- Gather last-token states - Extract the hidden state at each sequence's final position and project to vocabulary logits.
- Sample - Pick the next token from the logits per the sampling config (greedy, temperature, top-p, top-k, etc.).
- Eager mode - Standard PyTorch forward pass when eager execution is enabled.
- Captured mode - Replay a pre-captured CUDA Graph when eager is not enforced (graphs are recorded during engine startup in the initialize-KV-cache procedure).
- N-gram - Looks back at the last
prompt_lookup_maxtokens. If that n-gram appeared earlier in context, proposes the tokens that followed it. Zero extra model weight, zero extra GPU time. Acceptance: 30–50% for repetitive content (code, summaries). - EAGLE - Keep the large model's embeddings and LM head, replace the transformer stack with a 1–2 layer MLP fine-tuned to predict large model hidden states. Drafts by running this MLP over the large model's cached hidden states. No separate model file needed beyond a small adapter.
- Medusa - Adds k extra linear "heads" on top of the large model's final hidden state, one per speculative position (+1, +2, ..., +k). All run in parallel in a single extra matmul. Baked into the main checkpoint. Slightly lower accuracy than EAGLE but no architecture changes
- Paged Attention is the enabler. By making KV blocks independently allocatable, it enables continuous batching, prefix caching, and dynamic preemption - all of which require allocating and freeing KV memory for individual requests at any time.
- Continuous Batching is what makes paged attention's dynamic allocation useful at serving scale. Without it, blocks would still be under-utilized because requests sit idle in a static batch.
- The Scheduler orchestrates both - every step, deciding which requests get blocks, who runs decode, who starts prefill, and who gets preempted. The slot_mapping and block tables it produces flow directly into the forward pass.
- Chunked Prefill makes the scheduler fair across prompt lengths. Without it, a single 50k-token prompt would hold all decode responses hostage for seconds.
- Prefix Caching exploits the paged block structure - blocks are exactly 16 tokens, which is also the granularity for content-addressable caching. The chained hash ensures positional uniqueness.
- Guided Decoding operates orthogonally at the logit level. FSMs compile to bitmasks, applied before sampling. Compatible with all other techniques - adds a small masking overhead per decode step.
- Speculative Decoding attacks the memory-bandwidth bottleneck that remains even after all the above. At small batch sizes, drafting + verifying in bulk extracts more tokens per weight load.
The Scheduler
The scheduler is the brain of vLLM. Before every step it decides: which requests run, which wait, and who gets preempted if memory is tight. It manages two queues:
Requests that have arrived but haven't started. Status: WAITING. Ordered by policy (FCFS or priority). Scheduler tries to promote them to RUNNING each step - if enough free KV blocks exist for their prefill.
Requests actively generating tokens. Status: RUNNING. Each decode step processes one new token per request, reading all past KV from cache. Always processed first - these are live user requests.
Scheduler decision algorithm
| Mode | Queue type | Ordering |
|---|---|---|
| FCFS | Deque (append/pop) | First in, first served by arrival_time |
| Priority | Heap (heappush/heappop) | User-provided priority value; lower = higher priority |
gpu_memory_utilization and request concurrency limits.
Forward Pass
We call ModelExecutor.execute_model, which delegates to the Worker, which
in turn delegates to the ModelRunner. Here are the main steps:
The forward pass itself has two execution modes:
The diagram below walks through a concrete example that makes continuous batching and paged attention concrete:
The Forward Pass - Inside execute_model()
Every step() dispatches a single forward pass over the current batch. The five-stage pipeline inside execute_model():
What is slot_mapping?
This is the critical piece that glues the flat batch to the paged KV cache. The slot_mapping tells the CUDA attention kernel: "token X in the flattened super-sequence corresponds to physical KV slot Y in GPU memory."
Physical slot = block_id × block_size + position_within_block. If a request holds blocks [3, 7] (block size 16), its tokens map to physical slots [48..63, 112..127]. The slot_mapping is a flat array with one entry per token that the CUDA kernel uses to look up the right memory address without branching.
Worked Example: Three Prompts, Prefill → Decode
Let's trace exactly what vLLM does with three prompts from start to finish, with block_size=4 for clarity.
Advanced Features
With the basic engine flow in place, we can now look at the advanced features. We've already discussed preemption, paged attention, and continuous batching.
Paged Attention
Before vLLM, every framework pre-allocated the maximum possible KV cache for every request, even if it only used a fraction. A 2K-token limit request would claim 2K × 128KB = 256MB from the moment it arrived, even if it only generated 50 tokens. This made serving memory-inefficient and throughput low.
Paged Attention solves this by dividing the KV cache into fixed-size pages (blocks) of 16 tokens each. Pages are allocated dynamically as generation proceeds only when new tokens actually need them. Pages for different requests can be non-contiguous in GPU memory (just like virtual memory), and the attention kernel is modified to look up pages via a block table rather than assuming contiguous memory.
Block size: 16 tokens per block (tunable). Each block holds K,V vectors for 16 positions.
Physical pool: One giant GPU tensor pre-allocated at startup. Carved into equal blocks. The free block queue tracks which blocks are available.
Block table: Per-request mapping: logical block index → physical block ID. The attention CUDA kernel uses this to reconstruct the right KV memory addresses.
ref_cnt: Reference count for prefix sharing - a block with ref_cnt > 1 is shared across multiple requests and cannot be freed until all readers drop it.
Allocate: When a request needs another 16 tokens of KV space, pop one block from free_block_queue. Increment ref_cnt to 1. Add to request's block table.
Free: When a request finishes (or is preempted), walk its block table. For each block, decrement ref_cnt. If ref_cnt = 0, push back to free_block_queue. Immediately reusable.
No copying: Memory is never moved. "Freeing" just returns a block index to the free list. The stale data in it doesn't matter - it'll be overwritten by the next tenant.
Continuous Batching
Static batching waits for the entire batch to finish before starting new requests. With wildly different output lengths, short requests hold their GPU slot idle while waiting for long ones.
Continuous batching fixes this by letting new requests join after any step. Because the scheduler runs before every single step, a WAITING request can slot in the moment a RUNNING request finishes and frees its blocks.
This only works because of paged attention: when A finishes, its KV blocks return immediately to free_block_queue, and C can begin allocation instantly. No memory copying needed - C's blocks are wherever the free list puts them.
Chunked Prefill
Chunked prefill is a technique for handling long prompts by splitting their prefill step into smaller chunks. Without it, we could end up with a single very long request monopolizing one engine step disallowing other prefill requests to run. That would postpone all other requests and increase their latency.
A single large prefill (e.g., a 10,000-token document) can monopolize an entire engine step blocking all decode requests for seconds. Users mid-conversation experience a freeze.
The fix
Set a long_prefill_token_threshold (e.g., 512). During scheduling, if a prefill request would consume more tokens than this threshold, cap it:
# Inside scheduler - prefill token budget accounting if num_new_tokens > long_prefill_token_threshold: num_new_tokens = long_prefill_token_threshold # cap it # Remaining tokens spill to the next step automatically # KV block indexing handles resumption - request picks up exactly where it left off
Implementation is straightforward: cap the number of new tokens per step. If the requested number exceeds long_prefill_token_threshold, reset it to exactly that value. The underlying indexing logic (described earlier) takes care of the rest
The KV cache already knows how many tokens have been processed (blocks allocated), so the next step continues from the right position with zero extra bookkeeping. Decode requests get tokens every step during the chunked prefill period so there are no stalls.
max_num_batched_tokens parameter.
Prefix Caching
Most production deployments prepend the same large block of tokens to every request: a 3,000-token system prompt, a RAG document, an instruction template. Without prefix caching, every request recomputes these KV vectors from scratch. Prefix caching eliminates this waste.
Content-addressed KV blocks
vLLM maintains a hash map: cached_block_hash_to_block. When a block is full (all 16 slots filled) and its tokens won't change, it becomes cacheable. Its hash is stored as a key pointing to the physical block. On a new request, find_longest_cache_hit() walks tokens in 16-token chunks and checks the map - every hit means zero recomputation for those 16 tokens.
Chained block hashing
The same 16 tokens ("the cat sat on") can appear at different positions and must produce different hashes - because KV vectors encode positional information. The hash is therefore chained:
block_hash = hash( parent_hash, # hash of the PREVIOUS block → encodes everything before this block tuple(token_ids), # the exact 16 token IDs in this block extra_hashes, # optional: lora_id, cache_salt for per-tenant isolation ) # Block 0: parent_hash = 0 (root) # Block 1: parent_hash = hash(block 0) # Block 2: parent_hash = hash(block 1) # → "the cat sat" at position 48 has a DIFFERENT hash than at position 0
Speculative Decoding
During decoding, at small batch sizes, the GPU spends most of its time loading model weights, not computing. For a 70B model in bf16, every token has a cost and when the tokens accumulate, the cost becomes apparent.
Use a cheap draft model to propose k tokens, then let the large model verify all k simultaneously in one pass. Since the verify pass costs roughly the same as a single-token pass (same weight load), every accepted draft token is effectively free.
How to accept/reject
For each draft token x, let p(x) = large model's probability and q(x) = draft model's probability:
| Case | Rule | Why |
|---|---|---|
| p(x) ≥ q(x) | Accept with probability 1 | Large model agrees or prefers this token |
| p(x) < q(x) | Accept with probability p(x)/q(x) | Draft was overconfident. Partial acceptance corrects for the gap |
| Rejected | Resample from max(0, p−q) / Σmax(0,p−q) | Residual distribution removes draft bias, restores correct marginals |
| All k accepted | Sample one bonus token from p(x_{k+1}) | Verify pass already computed this distribution |
The final sequence is drawn from exactly the same distribution as pure autoregressive sampling from the large model.
Draft strategies in vLLM V1
Conclusion
These components aren't independent modules bolted together. They form a tight dependency chain where each layer enables the next:
42172ad is well-commented - start with vllm/engine/ and follow the LLMEngine → EngineCore → Scheduler chain.