Deep Dive in to vLLM Architecture
Inside vLLM
A Complete Technical Reference
Engine construction, KV cache mechanics, the scheduler loop, slot mapping, prefix caching internals — from source to silicon.
Table of Contents
Foundation: How LLMs Generate Text
Everything in vLLM is shaped by two facts about autoregressive language models. Before touching any internals, these must be clear.
Fact 1: One token at a time, sequentially
LLMs generate tokens one by one. Given a sequence of tokens [t₁, t₂, ..., tₙ], the model runs a full forward pass to produce a probability distribution over the vocabulary, then samples token tₙ₊₁. Then runs again with [t₁,...,tₙ,tₙ₊₁] to get tₙ₊₂. And so on. This is autoregressive generation — each token depends on all prior tokens.
Fact 2: The KV Cache
In transformer attention, every token produces a Key (K) and Value (V) vector. During decode, the new token must attend to all past tokens' K and V. Rather than recomputing past K,V from scratch each step, they're stored — this is the KV cache. It grows one block of 16 tokens at a time as generation proceeds.
The KV cache for a single request at a single layer in bf16 costs: 2 × seq_len × num_kv_heads × head_dim × 2 bytes. For Llama-3-8B (8 KV heads, head_dim=128) that's 4,096 bytes per token per layer, times 32 layers = 128 KB per token. A 2,048-token response stores 256 MB of KV data. This is why memory management is the central problem.
Engine Construction — What Happens at Startup
vLLM has two entry points: LLM for offline synchronous use (scripts, benchmarks), and AsyncLLMEngine for online serving with HTTP. Both wrap the same LLMEngine / EngineCore.
# Offline usage — basic.py pattern from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0") outputs = llm.generate(prompts, sampling_params) # Configuration: offline, synchronous, single GPU, standard transformer
When you call LLM(model=...), a chain of constructors runs before any token is ever processed. The result is a fully warmed-up, CUDA-graph-captured, KV-cache-allocated inference system.
The 6-step engine construction sequence
The Processor — _process_inputs()
The Processor component sits between raw user input and the engine. It validates, tokenizes, and packs inputs into an EngineCoreRequest. Here's the actual signature from llm.py:
llm.pydef _process_inputs( self, request_id: str, engine_prompt: PromptType, params: SamplingParams | PoolingParams, *, lora_request: LoRARequest | None, priority: int, tokenization_kwargs: dict[str, Any] | None = None, ) -> tuple[EngineCoreRequest, dict[str, Any]]:
Before the GPU ever sees anything, this function:
- 1Validates that inputs are not empty, sizes are within model limits, and all parameters are legal values
- 2Tokenizes raw text into token IDs via the model's tokenizer, returns a dict with
prompt,prompt_token_ids, andtype(text / tokens / embeds) - 3Attaches sampling parameters (
temperature,top_p,top_k,max_tokens, stop conditions) - 4Normalizes tensor shapes and LoRA request info, then packs everything into an
EngineCoreRequest
InProcClient — the engine abstraction
EngineCore is the component that schedules, manages KV cache, runs the model, and generates tokens. But instead of calling EngineCore directly, vLLM uses a client interface. The question this interface answers is: "How do I send requests to the engine and step it forward?"
InProcClient is the simplest implementation — it communicates with the engine in-process (same Python interpreter, same thread). More advanced clients communicate across processes or network sockets, enabling the same API to work for multi-GPU distributed serving. The abstraction lets vLLM change where and how the engine runs without changing any call sites.
init_kv_cache() for common batch sizes (typically powers of 2), then replayed as a single GPU command — no Python per kernel call, dramatically lower latency. If you pass --enforce-eager, capture is skipped and all passes use eager mode.
The generate() Loop — Request Lifecycle
When you call llm.generate(prompts, sampling_params), two phases happen: feeding requests into the engine, then stepping until all are done.
Phase A: Feeding requests
For each prompt in the list:
- 1Create a unique
request_id(UUID) and capturearrival_time(for priority scheduling) - 2Call
_process_inputs()→ tokenize, validate, return{prompt, prompt_token_ids, type} - 3Pack into
EngineCoreRequestwith priority, sampling params, and all metadata - 4Pass to engine core → wraps it in a
Requestobject, sets status toWAITING→ add to scheduler's waiting queue:append()if FCFS,heap_push()if priority-based
Phase B: The step() loop
Once the engine has requests, it calls step() in a tight loop until all requests finish. Each step() is exactly three stages:
# Conceptual view of step() loop while not finished: scheduler_output = scheduler.schedule() # ① who runs, which KV blocks model_output = model_executor.execute_model(scheduler_output) # ② GPU fwd pass request_outputs = output_processor.process_outputs(model_output) # ③ tokens → text
- ①Schedule — Prioritize decode requests (RUNNING queue), then pull from WAITING. Allocate KV blocks via
allocate_slots(). Handle preemption if memory is tight. Return a batch plan. - ②Forward pass —
ModelExecutor → Worker → ModelRunner. Prepare inputs (copy CPU→GPU), run the forward pass (paged attention reads/writes KV blocks), gather last-token hidden states, sample next tokens. - ③Post-process — Append sampled token IDs to each request's sequence. Detokenize. Check stop conditions (EOS, max_tokens, stop strings). Free KV blocks for finished requests.
The Forward Pass — Inside execute_model()
The call chain is: Engine → ModelExecutor.execute_model → Worker → ModelRunner. Inside ModelRunner, a forward pass has five distinct sub-steps.
What is slot_mapping?
This is the critical piece that glues the flat batch to the paged KV cache. The slot mapping tells the attention kernel: "Token X in the flattened super-sequence corresponds to physical KV slot Y in GPU memory."
Physical slot = block_id × block_size + position_within_block. If a request has blocks [3, 7] (block size 16), its tokens map to physical slots [48, 49, 50, ... 63, 112, 113, ...]. The slot_mapping is a flat array — one entry per token — that encodes this mapping so the CUDA attention kernel can look up the right memory address without any branching.
reshape_and_cache_flash is the CUDA function responsible for writing newly computed KV vectors into the paged memory after the forward pass, using this same slot_mapping.
Worked Example: Three Prompts, Prefill → Decode
Let's trace exactly what vLLM does with three prompts from start to finish. This is the canonical example from the notes, with block_size=4 for simplicity.
prompts = [ "Hi, my name is", # after tokenization: [1, 2, 3, 4, 5] — 5 tokens "Today is a beautiful summer day", # → [1, 6, 5, 7, 8, 9, 10] — 7 tokens "Hello there", # → [1, 12, 13] — 3 tokens ] # (Note: "is" == 5, BOS token == 1, block_size = 4)
Step 1: allocate_slots() — assigning KV blocks
Before the first forward pass, the scheduler allocates KV blocks. With block_size=4:
Step 2: Continuous batching flattens sequences
The scheduler converts the three separate sequences into a single concatenated "super-sequence" for the forward pass:
# After continuous batching input_ids = [1,2,3,4,5, 1,6,5,7,8,9,10, 1,12,13] # 15 tokens total positions = [0,1,2,3,4, 0,1,2,3,4,5,6, 0,1,2] # per-sequence positions # Attention metadata query_start_loc = [0, 5, 12, 15] # where each seq starts in flattened batch seq_lens = [5, 7, 3] # lengths for attention mask num_actual_tokens = 15 # total tokens this step
Step 3: slot_mapping — linking tokens to physical KV slots
For every token in the flattened batch, slot_mapping records exactly which physical slot in GPU memory to write its K,V vectors:
# Physical slot = block_id × block_size + offset_within_block # Seq 1 uses blocks 1, 2 → physical slots 4..7, 8..11 (block_size=4) # Seq 2 uses blocks 3, 4 → physical slots 12..15, 16..19 # Seq 3 uses block 5 → physical slots 20..23 slot_mapping = [4,5,6,7,8, 12,13,14,15,16,17,18, 20,21,22] # ↑seq1 ↑seq2 ↑seq3 # The 2nd sequence (7 tokens) maps to slots 12–18 because: # block 3 starts at slot 12 (3×4=12) → slots 12,13,14,15 # block 4 starts at slot 16 (4×4=16) → slots 16,17,18 (3 tokens, 1 slot unused)
Step 4: After prefill — decode step
Say we sample tokens [14, 15, 16] across the three sequences. Next step:
# Decode step — only the NEW tokens are passed as input input_ids = [14, 15, 16] # one new token per sequence positions = [5, 7, 3] # next position in each sequence slot_mapping = [9, 19, 23] # where to write new KVs # ↑seq1: block2, slot 9 (5+4=9, i.e. block2×4+1) # ↑seq2: block4, slot 19 (4×4+3=19) # ↑seq3: block5, slot 23 (5×4+3=23) # Attention metadata for decode query_start_loc = [0, 1, 2, 3] # 1 query token per sequence seq_lens = [6, 8, 4] # total context length per seq (for KV access) num_actual_tokens = 3 # 3 decode tokens total # KV cache reused from prefill — no recomputation of past tokens # The attention kernel reads KVs for all previous tokens via block table
num_actual_tokens = 15 (all prompt tokens), but seq_lens = [5,7,3] tracks context length for masking. During decode, num_actual_tokens = 3 (one new token per seq), but seq_lens = [6,8,4] — the full extended length including cached history. The attention kernel needs the full lengths to know how many past KV vectors to load from paged memory.
Paged Attention — Memory Architecture
The core memory problem: naive implementations pre-allocate a contiguous block of KV cache at max_tokens size per request. Most requests are shorter, leaving the tail wasted. With 40–60% internal waste per request, you can only fit a fraction of the possible concurrent requests on your GPU.
Paged attention borrows virtual memory from OS design: physical memory is non-contiguous; a block table provides the mapping. The KV cache is divided into fixed blocks of 16 tokens. Blocks are allocated on demand from a free_block_queue and returned when the request finishes.
Block metadata fields
Each block in the pool carries metadata — visible in the example above from the notes:
| Field | Purpose |
|---|---|
| block_id | Unique identifier (index into the KV pool) |
| ref_cnt | Reference count. >1 means the block is shared (copy-on-write for beam search, or prefix cache hit) |
| block_hash | Content hash for prefix caching. Set when the block is full and immutable. |
The Scheduler — Decision Logic
The scheduler runs at the start of every step(). It manages two queues and one manager:
Requests that haven't started yet. Status: WAITING. Ordered by FCFS (append) or priority (heap). These are prefill candidates.
Prefill = process the entire prompt once. Compute-bound. After prefill, one token is sampled and the request moves to RUNNING.
Requests actively generating tokens. Status: RUNNING. Each decode step processes only the most recent token, reads all past KV from cache.
Decode = memory-bandwidth-bound. One token per step per request. Prioritized always.
Scheduler decision algorithm
- 1Process RUNNING first — These are already generating for live users. For each: call
allocate_slots(req_id, num_new_tokens=1). If KV memory is tight, preempt the lowest-priority RUNNING request (free its blocks, push back to WAITING). - 2Pull from WAITING — For each waiting request (in priority order): check if enough free blocks exist for its prefill. If yes, move to RUNNING, call
allocate_slots(req_id, num_prompt_tokens). If no room, skip (try again next step). - 3Return SchedulerOutput — The batch plan: which requests run, what token counts, block tables, and whether each is prefill or decode. The model executor uses this to build inputs.
Policy modes
| Mode | Queue type | Ordering |
|---|---|---|
| FCFS | Deque (append/pop) | First in, first served by arrival_time |
| Priority | Heap (heap_push/heappop) | User-provided priority value; lower = higher priority |
Continuous Batching
Static batching waits until the entire batch finishes before starting new requests. With wildly different output lengths (3 tokens vs 300 tokens), the short requests hold their GPU slot idle while waiting for long ones to finish.
Continuous batching fixes this by letting new requests join after any step. Because the scheduler runs before every step, it can immediately slot in a WAITING request the moment a RUNNING request finishes and frees its blocks.
Crucially, this only works because of paged attention: when A finishes, its KV blocks return to free_block_queue, and C can immediately be allocated new blocks. There's no need to shift any memory around — C's blocks are wherever the free list puts them.
In the synchronous engine (basic.py), only the initial prompts can be batched — you can't inject new ones mid-run. In the async engine (serving mode), every step the scheduler checks for newly arrived requests and folds them into the next batch.
Chunked Prefill
A single large prefill (e.g., 10,000-token document) can monopolize an entire engine step — blocking all decode requests for seconds while the long prompt is processed. Users mid-conversation experience a freeze.
The fix (surprisingly simple)
Set a long_prefill_token_threshold (e.g., 512). During scheduling, if a prefill request would consume more than this many tokens in one step, reset its token count to the threshold:
# Inside scheduler, during prefill token budget accounting if num_new_tokens > long_prefill_token_threshold: num_new_tokens = long_prefill_token_threshold # cap it # The remaining tokens spill to the next step automatically # KV block indexing handles resumption correctly — the request picks up where it left off
The KV cache already knows how many tokens have been processed (blocks allocated), so the next step continues from the right position with no special bookkeeping. Decode requests get tokens every step during the chunked prefill period — no stalls.
max_num_batched_tokens parameter.
Prefix Caching — Internals
Most production deployments prepend the same large block of tokens to every request: a 3,000-token system prompt, a RAG document, an instruction template. Without prefix caching, every request recomputes the same KV vectors from scratch. Prefix caching avoids this.
The core idea: content-addressed KV blocks
vLLM maintains a hash map: cached_block_hash_to_block. When a block is full (all 16 slots filled) and its tokens won't change, it becomes cacheable. Its hash is computed and stored as a key pointing to the physical block.
On a new request, find_longest_cache_hit() walks the new request's tokens in 16-token chunks, hashing each block and checking the map. Every hit means zero recomputation for those 16 tokens.
How the block hash is computed
This is where it gets subtle. The same 16 tokens ("the cat sat on") can appear at different positions in different sequences and must produce different hashes — because their KV vectors depend on their position in the sequence. The hash is therefore chained:
block_hash = hash( parent_hash, # hash of the PREVIOUS block → encodes everything before this block tuple(token_ids), # the exact 16 token IDs in this block extra_hashes, # optional: lora_id, cache_salt for per-tenant isolation ) # Block 0: parent_hash = None (or 0 for root) # Block 1: parent_hash = hash of block 0 # Block 2: parent_hash = hash of block 1 # This chain means "the cat sat" at position 48 has a different hash than at position 0
Lazy eviction — why cached blocks survive after requests finish
When a request finishes, its blocks are returned to free_block_queue — but the hash map entries are not immediately cleared. This means:
- 1Block returns to back of
free_block_queue(available for reallocation) but its entry incached_block_hash_to_blockpersists. - 2New request with same prefix →
find_longest_cache_hit()finds the block in the map → removes it fromfree_block_queue→ reuses it directly. Zero recomputation. - 3If another request allocates that block first (pops it from
free_block_queue), the eviction code detects it has a hash entry, clears it from the map, and hands over a fresh block. The cache miss is handled gracefully.
This gives LRU-like behavior for free: popular system prompts stay cached indefinitely because they're reclaimed before eviction. Rare prefixes get evicted naturally as the free list shrinks.
Guided Decoding — Making the Model Follow Rules
Language models are probabilistic. Even a perfectly prompted model can hallucinate field names, produce malformed JSON, or break a required format. In production systems where AI output feeds directly into databases or APIs, this is unacceptable.
Guided decoding makes invalid outputs literally impossible — not just unlikely — by modifying the logits before sampling.
Finite State Machines (FSMs)
A Finite State Machine is a directed graph: states connected by transitions. Each transition is triggered by an input (a token ID). At any state, only certain tokens are valid. vLLM compiles your desired output format (JSON schema, regex, choice list, or full context-free grammar) into an FSM using xgrammar.
Implementation inside vLLM
- 1Request arrives with
GuidedDecodingParams(json_schema=...). Status set toWAITING_FOR_FSM. xgrammar begins compiling asynchronously. - 2Scheduler: if FSM compile is done, transition to
WAITING. If not, skip this step. No blocking. - 3
StructuredOutputManager.prepare_bitmask()called for all FSM requests in batch. Produces_grammar_bitmasktensor (vocab_size / 32 integers per request). - 4After forward pass: Triton kernel expands bitmask and sets forbidden logits to
-inf. Normal sampling proceeds on the constrained distribution. - 5
accept_tokens()advances the FSM state based on sampled token. Next step's bitmask is ready.
Speculative Decoding — Multiple Tokens Per Weight Load
The decode bottleneck is clear: at small batch sizes, the GPU spends most of its time loading model weights from HBM, not computing. For a 70B model in bf16, every token costs streaming 140 GB of weights — ~42ms at H100 bandwidth, mostly idle compute.
The insight: if you use a cheap draft model to propose k tokens, the large model can verify all k simultaneously in one pass. Since the verify pass costs roughly the same as a single token (same weight load), if any draft tokens are accepted, you've done better than baseline.
The accept/reject rule — why it's mathematically correct
For each draft token x at position i, let p(x) = large model's probability and q(x) = draft model's probability:
| Case | Rule | Why |
|---|---|---|
| p(x) ≥ q(x) | Accept with probability 1 | Large model agrees or prefers this token — no correction needed |
| p(x) < q(x) | Accept with probability p(x)/q(x) | Draft was overconfident — partial acceptance corrects for the gap |
| Rejected | Resample from max(0, p−q) / Σmax(0,p−q) | The residual distribution removes the draft bias, restores correct marginals |
| All k accepted | Sample one bonus token from p(x_{k+1}) | The verify pass already computed this distribution — it's "free" |
The result is that the final sequence is drawn from exactly the same distribution as pure autoregressive sampling from the large model. Speculative decoding is not an approximation — it's a lossless acceleration.
Draft strategies in vLLM V1
- 1
N-gram — Looks back at the last
prompt_lookup_maxtokens in the existing sequence. If that n-gram has appeared earlier in the context, proposes the tokens that followed it. Zero extra model weight, zero extra GPU time for drafting. Acceptance: 30–50% for repetitive content (code, summaries). Low for creative text. - 2
EAGLE — "Model surgery": keep the large model's embeddings and LM head, replace the full transformer stack with a 1–2 layer MLP fine-tuned to predict the large model's hidden states. Drafts by running this lightweight MLP over the large model's cached hidden states. No separate model file needed beyond a small adapter. Acceptance: 60–80%+ for most tasks.
- 3
Medusa — Adds
kextra linear "heads" on top of the large model's final hidden state, one per speculative position (+1, +2, ..., +k). All run in parallel in a single extra matmul. Part of the main checkpoint. Slightly lower accuracy than EAGLE but no architecture surgery. Works well for structured repetitive tasks.
System View — How It All Connects
These components aren't independent modules bolted together. They form a tight dependency chain:
- 1Paged Attention is the enabler. By making KV blocks independently allocatable, it enables continuous batching, prefix caching, and dynamic preemption — all of which require the ability to allocate and free KV memory for individual requests at any time.
- 2Continuous Batching is what makes paged attention's dynamic allocation useful at serving scale. Without it, blocks would still be under-utilized because requests sit idle in a static batch.
- 3The Scheduler orchestrates both — every step, it's deciding which requests get blocks, who runs decode, who starts prefill, and who gets preempted. The slot_mapping and block tables it produces flow directly into the forward pass.
- 4Chunked Prefill makes the scheduler fair across prompt lengths. Without it, a single 50k-token prompt would hold all decode responses hostage for seconds.
- 5Prefix Caching exploits the paged block structure — blocks are already exactly 16 tokens, which is also exactly the granularity for content-addressable caching. The chained hash ensures positional uniqueness for free.
- 6Guided Decoding operates orthogonally at the logit level. It compiles format specs to FSMs, produces bitmasks per step, and applies them before sampling. Compatible with all other techniques — adds a small masking overhead per decode step.
- 7Speculative Decoding attacks the memory-bandwidth bottleneck that remains even after all the above. At small batch sizes where the GPU is still waiting on HBM loads, drafting + verifying in bulk extracts more tokens per weight load.
42172ad is also well-commented — start with vllm/engine/ and follow the LLMEngine → EngineCore → Scheduler chain.