Deep Dive in to vLLM Architecture

21-2-2026
Inside vLLM — Deep Technical Reference
Technical Deep Dive - vLLM

Inside vLLM
A Complete Technical Reference

Engine construction, KV cache mechanics, the scheduler loop, slot mapping, prefix caching internals — from source to silicon.

By - Sai Saketh Cherukuri

Foundation: How LLMs Generate Text

Everything in vLLM is shaped by two facts about autoregressive language models. Before touching any internals, these must be clear.

Fact 1: One token at a time, sequentially

LLMs generate tokens one by one. Given a sequence of tokens [t₁, t₂, ..., tₙ], the model runs a full forward pass to produce a probability distribution over the vocabulary, then samples token tₙ₊₁. Then runs again with [t₁,...,tₙ,tₙ₊₁] to get tₙ₊₂. And so on. This is autoregressive generation — each token depends on all prior tokens.

Fig 1 — Prefill Phase vs Decode Phase: Two Different Bottlenecks
PREFILL All prompt tokens processed in ONE parallel pass "The" "cap-" "-ital" "of" "France" 5 tokens, all simultaneously → one matrix multiply GPU Forward Pass (Parallel) Compute K,V for ALL 5 tokens at once O(n²) attention — compute-bound Write KV cache for all 5 tokens K₁V₁, K₂V₂, K₃V₃, K₄V₄, K₅V₅ stored in paged blocks "Paris" ← 1 token sampled COMPUTE-BOUND DECODE One token at a time, each step loads ALL weights Step 1 "Paris" model "is" Step 2 "is" model "the" Step 3 "the" model "cap-" KV cache read every step All past K,V vectors loaded from paged blocks — O(n) memory access For 70B model: 140 GB weights streamed per token At H100 mem-BW 3.35 TB/s → ~42ms/token at batch=1. Compute sits idle. MEMORY-BOUND

Fact 2: The KV Cache

In transformer attention, every token produces a Key (K) and Value (V) vector. During decode, the new token must attend to all past tokens' K and V. Rather than recomputing past K,V from scratch each step, they're stored — this is the KV cache. It grows one block of 16 tokens at a time as generation proceeds.

The KV cache for a single request at a single layer in bf16 costs: 2 × seq_len × num_kv_heads × head_dim × 2 bytes. For Llama-3-8B (8 KV heads, head_dim=128) that's 4,096 bytes per token per layer, times 32 layers = 128 KB per token. A 2,048-token response stores 256 MB of KV data. This is why memory management is the central problem.

Engine Construction — What Happens at Startup

vLLM has two entry points: LLM for offline synchronous use (scripts, benchmarks), and AsyncLLMEngine for online serving with HTTP. Both wrap the same LLMEngine / EngineCore.

# Offline usage — basic.py pattern
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
outputs = llm.generate(prompts, sampling_params)
# Configuration: offline, synchronous, single GPU, standard transformer

When you call LLM(model=...), a chain of constructors runs before any token is ever processed. The result is a fully warmed-up, CUDA-graph-captured, KV-cache-allocated inference system.

The 6-step engine construction sequence

Fig 2 — LLM Engine Constructor: 6 Steps Before the First Token
Step 1 — init_device() • Choose cuda:0 (or assigned GPU) • Check bf16 support, available VRAM, reserve e.g. 80% if gpu_memory_utilization=0.8 Step 2 — Create ModelRunner • Holds: the model, sampler, attention metadata, GPU buffers (input_ids, positions…) • Think of it as: "Everything needed for one forward pass" • Also creates InputBatch (CPU side): block tables, sampling params, request metadata Step 3 — load_model() • Build PyTorch model architecture from config • Load weights from disk (sharded or full), model.eval(), optionally torch.compile() Step 4 — init_kv_cache() ← Critical • Inspect attention layers, figure out KV layout (num_heads, head_dim) • Run a dummy forward pass → measure GPU memory actually used by weights • Calculate how many KV blocks fit in remaining VRAM • Allocate one giant contiguous KV pool, bind slices to attention layers → Model is now "KV-cache aware". CUDA graphs are also captured here. Step 5 (Optional) — CUDA Graph Capture • Record entire forward passes for common batch sizes • Replay later without Python overhead COMPONENT HIERARCHY LLM (offline entry point) Processor text → EngineCoreRequest InProcClient interface to EngineCore EngineCore (schedules, runs model, manages KV) Scheduler Policy: FCFS/priority waiting queue running queue → KV Cache Manager → StructuredOutputMgr Model Executor UniProc/MultiProc Worker (per GPU) → ModelRunner → Sampler → InputBatch (CPU) Output Processor EngineCoreOutput → detokenize → user-visible Output objects

The Processor — _process_inputs()

The Processor component sits between raw user input and the engine. It validates, tokenizes, and packs inputs into an EngineCoreRequest. Here's the actual signature from llm.py:

llm.pydef _process_inputs(
    self,
    request_id: str,
    engine_prompt: PromptType,
    params: SamplingParams | PoolingParams,
    *,
    lora_request: LoRARequest | None,
    priority: int,
    tokenization_kwargs: dict[str, Any] | None = None,
) -> tuple[EngineCoreRequest, dict[str, Any]]:

Before the GPU ever sees anything, this function:

  • 1
    Validates that inputs are not empty, sizes are within model limits, and all parameters are legal values
  • 2
    Tokenizes raw text into token IDs via the model's tokenizer, returns a dict with prompt, prompt_token_ids, and type (text / tokens / embeds)
  • 3
    Attaches sampling parameters (temperature, top_p, top_k, max_tokens, stop conditions)
  • 4
    Normalizes tensor shapes and LoRA request info, then packs everything into an EngineCoreRequest

InProcClient — the engine abstraction

EngineCore is the component that schedules, manages KV cache, runs the model, and generates tokens. But instead of calling EngineCore directly, vLLM uses a client interface. The question this interface answers is: "How do I send requests to the engine and step it forward?"

InProcClient is the simplest implementation — it communicates with the engine in-process (same Python interpreter, same thread). More advanced clients communicate across processes or network sockets, enabling the same API to work for multi-GPU distributed serving. The abstraction lets vLLM change where and how the engine runs without changing any call sites.

Two execution modes After initialization, every forward pass runs in one of two modes. Eager mode: standard PyTorch — flexible, but each kernel launch has Python overhead. Captured mode: CUDA graphs are recorded during init_kv_cache() for common batch sizes (typically powers of 2), then replayed as a single GPU command — no Python per kernel call, dramatically lower latency. If you pass --enforce-eager, capture is skipped and all passes use eager mode.

The generate() Loop — Request Lifecycle

When you call llm.generate(prompts, sampling_params), two phases happen: feeding requests into the engine, then stepping until all are done.

Phase A: Feeding requests

For each prompt in the list:

  • 1
    Create a unique request_id (UUID) and capture arrival_time (for priority scheduling)
  • 2
    Call _process_inputs() → tokenize, validate, return {prompt, prompt_token_ids, type}
  • 3
    Pack into EngineCoreRequest with priority, sampling params, and all metadata
  • 4
    Pass to engine core → wraps it in a Request object, sets status to WAITING → add to scheduler's waiting queue: append() if FCFS, heap_push() if priority-based
Fig 3 — Request Lifecycle: From User Prompt to Waiting Queue to Output
Raw Text "Hello, my name is" Processor • validate • tokenize → [1,2,3,4,5] • pack EngineCoreRequest EngineCoreRequest request_id: "uuid-42" arrival_time: 1721000000.42 prompt: "Hello, my name is" prompt_token_ids: [1,2,3,4,5] type: "token" priority: 0 + sampling params, lora_request… Waiting Queue status = WAITING FCFS → append() priority → heap_push() step() loop ① schedule ② fwd pass ③ postprocess repeat until done In synchronous mode: all initial prompts loaded first, then execution starts. In async mode: new requests can arrive mid-run (continuous batching).

Phase B: The step() loop

Once the engine has requests, it calls step() in a tight loop until all requests finish. Each step() is exactly three stages:

# Conceptual view of step() loop
while not finished:
    scheduler_output = scheduler.schedule()       # ① who runs, which KV blocks
    model_output     = model_executor.execute_model(scheduler_output)  # ② GPU fwd pass
    request_outputs  = output_processor.process_outputs(model_output)  # ③ tokens → text
  • Schedule — Prioritize decode requests (RUNNING queue), then pull from WAITING. Allocate KV blocks via allocate_slots(). Handle preemption if memory is tight. Return a batch plan.
  • Forward passModelExecutor → Worker → ModelRunner. Prepare inputs (copy CPU→GPU), run the forward pass (paged attention reads/writes KV blocks), gather last-token hidden states, sample next tokens.
  • Post-process — Append sampled token IDs to each request's sequence. Detokenize. Check stop conditions (EOS, max_tokens, stop strings). Free KV blocks for finished requests.

The Forward Pass — Inside execute_model()

The call chain is: Engine → ModelExecutor.execute_model → Worker → ModelRunner. Inside ModelRunner, a forward pass has five distinct sub-steps.

Fig 4 — ModelRunner Forward Pass: 5 Sub-Steps
① Update States Remove finished requests from active batch. For remaining: track where their KV cache is, how many tokens so far, which KV blocks they'll read/write this step. ② Prepare Inputs (CPU → GPU) Tokens arrive on CPU. Copy input_ids to GPU. Compute position indices (token's position within its sequence). Build attention metadata (who can attend to whom). Build slot_mapping: for each token in flattened batch, which physical KV cache slot (block_id × block_size + offset) does it belong to? Also set seq_lens (per request), query_start_loc (where each request starts in the flattened batch), num_actual_tokens (total tokens this step). ③ Forward Pass (GPU) All sequences flattened + concatenated into one "super-sequence". Position indices and attention masks keep each sequence logically isolated. Paged attention kernel uses block tables to access KV cache via slot_mapping — no assumption of contiguous memory. reshape_and_cache_flash writes new KVs. ④ Gather Last-Token Hidden States Only the last token's hidden state per request is needed. Gather these via indices → pass through LM head → logits (vocab probabilities). ⑤ Sample → Token ID → Append to request. Write K,V of new token to KV cache.

What is slot_mapping?

This is the critical piece that glues the flat batch to the paged KV cache. The slot mapping tells the attention kernel: "Token X in the flattened super-sequence corresponds to physical KV slot Y in GPU memory."

Physical slot = block_id × block_size + position_within_block. If a request has blocks [3, 7] (block size 16), its tokens map to physical slots [48, 49, 50, ... 63, 112, 113, ...]. The slot_mapping is a flat array — one entry per token — that encodes this mapping so the CUDA attention kernel can look up the right memory address without any branching.

reshape_and_cache_flash is the CUDA function responsible for writing newly computed KV vectors into the paged memory after the forward pass, using this same slot_mapping.

Worked Example: Three Prompts, Prefill → Decode

Let's trace exactly what vLLM does with three prompts from start to finish. This is the canonical example from the notes, with block_size=4 for simplicity.

prompts = [
    "Hi, my name is",              # after tokenization: [1, 2, 3, 4, 5]  — 5 tokens
    "Today is a beautiful summer day", # → [1, 6, 5, 7, 8, 9, 10]              — 7 tokens
    "Hello there",                  # → [1, 12, 13]                          — 3 tokens
]
# (Note: "is" == 5, BOS token == 1, block_size = 4)

Step 1: allocate_slots() — assigning KV blocks

Before the first forward pass, the scheduler allocates KV blocks. With block_size=4:

Fig 5 — Block Allocation for 3 Sequences (block_size=4)
Seq 1: [1,2,3,4,5] — 5 tokens → ceil(5/4) = 2 blocks needed block_id = 1 (full) tokens: 1,2,3,4 block_id = 2 (partial) token: 5 [3 slots free] ref_cnt = 1 block_hash = h₁ Seq 2: [1,6,5,7,8,9,10] — 7 tokens → ceil(7/4) = 2 blocks needed block_id = 3 (full) tokens: 1,6,5,7 block_id = 4 (partial) tokens: 8,9,10 [1 free] ref_cnt = 1 block_hash = h₂ Seq 3: [1,12,13] — 3 tokens → ceil(3/4) = 1 block needed block_id = 5 (partial) tokens: 1,12,13 [1 free] ref_cnt = 1 GPU — Physical KV Block Pool blk 1 seq1 blk 2 seq1 blk 3 seq2 blk 4 seq2 blk 5 seq3

Step 2: Continuous batching flattens sequences

The scheduler converts the three separate sequences into a single concatenated "super-sequence" for the forward pass:

# After continuous batching
input_ids = [1,2,3,4,5, 1,6,5,7,8,9,10, 1,12,13]  # 15 tokens total
positions  = [0,1,2,3,4, 0,1,2,3,4,5,6, 0,1,2]   # per-sequence positions

# Attention metadata
query_start_loc    = [0, 5, 12, 15]  # where each seq starts in flattened batch
seq_lens           = [5, 7, 3]         # lengths for attention mask
num_actual_tokens  = 15                 # total tokens this step

Step 3: slot_mapping — linking tokens to physical KV slots

For every token in the flattened batch, slot_mapping records exactly which physical slot in GPU memory to write its K,V vectors:

# Physical slot = block_id × block_size + offset_within_block
# Seq 1 uses blocks 1, 2 → physical slots 4..7, 8..11 (block_size=4)
# Seq 2 uses blocks 3, 4 → physical slots 12..15, 16..19
# Seq 3 uses block  5    → physical slots 20..23

slot_mapping = [4,5,6,7,8, 12,13,14,15,16,17,18, 20,21,22]
#                ↑seq1                ↑seq2                 ↑seq3

# The 2nd sequence (7 tokens) maps to slots 12–18 because:
#   block 3 starts at slot 12 (3×4=12) → slots 12,13,14,15
#   block 4 starts at slot 16 (4×4=16) → slots 16,17,18  (3 tokens, 1 slot unused)
Fig 6 — slot_mapping: Flattened Batch → Physical GPU KV Slots
Flattened input_ids: 1 2 3 4 5 1 6 5 7 8 9 10 1 12 13 slot_mapping: 4 5 6 7 8 12 13 14 15 16 17 18 20 21 22 slot = block_id × block_size + offset_within_block Seq2 uses blocks 3&4 → slots 12–15 (block3) and 16–18 (block4, 3 tokens used) reshape_and_cache_flash writes K,V at these physical addresses. Attention kernel reads via block table.

Step 4: After prefill — decode step

Say we sample tokens [14, 15, 16] across the three sequences. Next step:

# Decode step — only the NEW tokens are passed as input
input_ids = [14, 15, 16]              # one new token per sequence
positions  = [5,  7,  3]               # next position in each sequence
slot_mapping = [9, 19, 23]             # where to write new KVs
#               ↑seq1: block2, slot 9  (5+4=9, i.e. block2×4+1)
#               ↑seq2: block4, slot 19 (4×4+3=19)
#               ↑seq3: block5, slot 23 (5×4+3=23)

# Attention metadata for decode
query_start_loc   = [0, 1, 2, 3]     # 1 query token per sequence
seq_lens          = [6, 8, 4]         # total context length per seq (for KV access)
num_actual_tokens = 3                  # 3 decode tokens total

# KV cache reused from prefill — no recomputation of past tokens
# The attention kernel reads KVs for all previous tokens via block table
Why seq_lens changes shape between prefill and decode During prefill, num_actual_tokens = 15 (all prompt tokens), but seq_lens = [5,7,3] tracks context length for masking. During decode, num_actual_tokens = 3 (one new token per seq), but seq_lens = [6,8,4] — the full extended length including cached history. The attention kernel needs the full lengths to know how many past KV vectors to load from paged memory.

Paged Attention — Memory Architecture

The core memory problem: naive implementations pre-allocate a contiguous block of KV cache at max_tokens size per request. Most requests are shorter, leaving the tail wasted. With 40–60% internal waste per request, you can only fit a fraction of the possible concurrent requests on your GPU.

Paged attention borrows virtual memory from OS design: physical memory is non-contiguous; a block table provides the mapping. The KV cache is divided into fixed blocks of 16 tokens. Blocks are allocated on demand from a free_block_queue and returned when the request finishes.

Fig 7 — Block Table: Logical Sequence → Physical KV Cache Blocks
Request A — Logical KV Sequence Block 0 tokens 0–15 Block 1 tokens 16–31 Block 2 tokens 32–47 block_table[req_A] logical_idx → physical_id 0 → #47 1 → #23 2 → #89 scattered in GPU mem Physical GPU Memory Block #47 Block #31 (B) Block #23 Block #12 (C) Block #89 free free_block_queue doubly-linked list. allocate: pop front. free: push back.

Block metadata fields

Each block in the pool carries metadata — visible in the example above from the notes:

FieldPurpose
block_idUnique identifier (index into the KV pool)
ref_cntReference count. >1 means the block is shared (copy-on-write for beam search, or prefix cache hit)
block_hashContent hash for prefix caching. Set when the block is full and immutable.
Memory utilization before vs after Before paged attention: ~20–40% GPU memory utilization (pre-allocated contiguous buffers, most wasted). After: ~95%+. On an 80GB H100, loading a 16GB 8B model leaves ~60GB for KV. At 2MB/block (Llama-3-8B, 32 layers), that's ~30,000 KV blocks — enough for thousands of concurrent requests.

The Scheduler — Decision Logic

The scheduler runs at the start of every step(). It manages two queues and one manager:

WAITING queue

Requests that haven't started yet. Status: WAITING. Ordered by FCFS (append) or priority (heap). These are prefill candidates.

Prefill = process the entire prompt once. Compute-bound. After prefill, one token is sampled and the request moves to RUNNING.

RUNNING queue

Requests actively generating tokens. Status: RUNNING. Each decode step processes only the most recent token, reads all past KV from cache.

Decode = memory-bandwidth-bound. One token per step per request. Prioritized always.

Scheduler decision algorithm

  • 1
    Process RUNNING first — These are already generating for live users. For each: call allocate_slots(req_id, num_new_tokens=1). If KV memory is tight, preempt the lowest-priority RUNNING request (free its blocks, push back to WAITING).
  • 2
    Pull from WAITING — For each waiting request (in priority order): check if enough free blocks exist for its prefill. If yes, move to RUNNING, call allocate_slots(req_id, num_prompt_tokens). If no room, skip (try again next step).
  • 3
    Return SchedulerOutput — The batch plan: which requests run, what token counts, block tables, and whether each is prefill or decode. The model executor uses this to build inputs.

Policy modes

ModeQueue typeOrdering
FCFSDeque (append/pop)First in, first served by arrival_time
PriorityHeap (heap_push/heappop)User-provided priority value; lower = higher priority
Preemption note When memory runs out while serving decode requests, vLLM must free some blocks. It picks the lowest-priority RUNNING request, frees all its KV blocks, and returns it to the WAITING queue. The request's token history (not KV) is preserved, so it will re-prefill when resources free up. This preserves correctness at the cost of re-doing the prefill work.

Continuous Batching

Static batching waits until the entire batch finishes before starting new requests. With wildly different output lengths (3 tokens vs 300 tokens), the short requests hold their GPU slot idle while waiting for long ones to finish.

Continuous batching fixes this by letting new requests join after any step. Because the scheduler runs before every step, it can immediately slot in a WAITING request the moment a RUNNING request finishes and frees its blocks.

Fig 8 — Continuous Batching: Requests Join the Moment a Slot Opens
t1t2t3 t4t5t6t7 Static Req A (short) A idle (slot wasted) Req B (long — controls batch) Req C (batch 2 starts) ← new batch blocked until B finishes Continuous Req A Req C (joins immediately when A done) Req B (continues uninterrupted) A done → C joins

Crucially, this only works because of paged attention: when A finishes, its KV blocks return to free_block_queue, and C can immediately be allocated new blocks. There's no need to shift any memory around — C's blocks are wherever the free list puts them.

In the synchronous engine (basic.py), only the initial prompts can be batched — you can't inject new ones mid-run. In the async engine (serving mode), every step the scheduler checks for newly arrived requests and folds them into the next batch.

Chunked Prefill

A single large prefill (e.g., 10,000-token document) can monopolize an entire engine step — blocking all decode requests for seconds while the long prompt is processed. Users mid-conversation experience a freeze.

The fix (surprisingly simple)

Set a long_prefill_token_threshold (e.g., 512). During scheduling, if a prefill request would consume more than this many tokens in one step, reset its token count to the threshold:

# Inside scheduler, during prefill token budget accounting
if num_new_tokens > long_prefill_token_threshold:
    num_new_tokens = long_prefill_token_threshold  # cap it
# The remaining tokens spill to the next step automatically
# KV block indexing handles resumption correctly — the request picks up where it left off

The KV cache already knows how many tokens have been processed (blocks allocated), so the next step continues from the right position with no special bookkeeping. Decode requests get tokens every step during the chunked prefill period — no stalls.

Automatic triggering Even without explicit configuration, chunked prefill kicks in automatically if the total token budget for a step is exhausted by decode requests. Any prefill that exceeds the remaining budget gets truncated to fit. Budget = max_num_batched_tokens parameter.

Prefix Caching — Internals

Most production deployments prepend the same large block of tokens to every request: a 3,000-token system prompt, a RAG document, an instruction template. Without prefix caching, every request recomputes the same KV vectors from scratch. Prefix caching avoids this.

The core idea: content-addressed KV blocks

vLLM maintains a hash map: cached_block_hash_to_block. When a block is full (all 16 slots filled) and its tokens won't change, it becomes cacheable. Its hash is computed and stored as a key pointing to the physical block.

On a new request, find_longest_cache_hit() walks the new request's tokens in 16-token chunks, hashing each block and checking the map. Every hit means zero recomputation for those 16 tokens.

Critical constraint vLLM only caches full blocks. If block_size=16 and your system prompt is 100 tokens, the first 6 complete blocks (96 tokens) are cacheable. The last 4 tokens must always be recomputed. For maximum cache effectiveness, pad your system prompt to a multiple of 16.

How the block hash is computed

This is where it gets subtle. The same 16 tokens ("the cat sat on") can appear at different positions in different sequences and must produce different hashes — because their KV vectors depend on their position in the sequence. The hash is therefore chained:

block_hash = hash(
    parent_hash,         # hash of the PREVIOUS block → encodes everything before this block
    tuple(token_ids),    # the exact 16 token IDs in this block
    extra_hashes,        # optional: lora_id, cache_salt for per-tenant isolation
)
# Block 0: parent_hash = None (or 0 for root)
# Block 1: parent_hash = hash of block 0
# Block 2: parent_hash = hash of block 1
# This chain means "the cat sat" at position 48 has a different hash than at position 0
Fig 9 — Chained Block Hashing for Prefix Caching
REQUEST 1 — System Prompt × 3 blocks + User Q SP Block 0 tokens 0–15 hash(0, [t0..t15]) = H₀ SP Block 1 tokens 16–31 hash(H₀, [t16..t31]) = H₁ SP Block 2 tokens 32–47 hash(H₁, [t32..t47]) = H₂ User Q Block tokens 48–63 (not cached — unique) Stored: { H₀ → Block#14, H₁ → Block#7, H₂ → Block#31 } (blocks go to free_block_queue but hash map entries persist) Blocks lazy-evicted: still in hash map until another request reclaims their physical block REQUEST 2 — Same System Prompt × 3 blocks + Different User Q SP Block 0 HIT ✓ → reuse Block#14 SP Block 1 HIT ✓ → reuse Block#7 SP Block 2 HIT ✓ → reuse Block#31 New User Q COMPUTED (16 tokens only) TTFT speedup: 48 tokens computed instead of 64 — 3× in this example. In practice: 6,000/16 = 375 blocks skipped for a 6k-token system prompt. find_longest_cache_hit() walks blocks greedily from the start, stopping at the first miss.

Lazy eviction — why cached blocks survive after requests finish

When a request finishes, its blocks are returned to free_block_queue — but the hash map entries are not immediately cleared. This means:

  • 1
    Block returns to back of free_block_queue (available for reallocation) but its entry in cached_block_hash_to_block persists.
  • 2
    New request with same prefix → find_longest_cache_hit() finds the block in the map → removes it from free_block_queue → reuses it directly. Zero recomputation.
  • 3
    If another request allocates that block first (pops it from free_block_queue), the eviction code detects it has a hash entry, clears it from the map, and hands over a fresh block. The cache miss is handled gracefully.

This gives LRU-like behavior for free: popular system prompts stay cached indefinitely because they're reclaimed before eviction. Rare prefixes get evicted naturally as the free list shrinks.

Guided Decoding — Making the Model Follow Rules

Language models are probabilistic. Even a perfectly prompted model can hallucinate field names, produce malformed JSON, or break a required format. In production systems where AI output feeds directly into databases or APIs, this is unacceptable.

Guided decoding makes invalid outputs literally impossible — not just unlikely — by modifying the logits before sampling.

Finite State Machines (FSMs)

A Finite State Machine is a directed graph: states connected by transitions. Each transition is triggered by an input (a token ID). At any state, only certain tokens are valid. vLLM compiles your desired output format (JSON schema, regex, choice list, or full context-free grammar) into an FSM using xgrammar.

Fig 10 — FSM to Bitmask: Constraining Token Sampling at Each Step
FSM for: {"label": "positive" | "negative", "score": float} S0 start "{" S1 "label" "positive" "negative" S2 got label "score" S3 float digits "}" DONE At State S1 (choosing label value): only "positive" and "negative" allowed raw logits: 2.3 "the" 4.1 "positive" 0.4 "1234" 3.7 "negative" 2.1 "maybe" ... 31,995 more tokens, all → -∞ bitmask: 0 1 0 1 0 after mask: -inf 4.1 ✓ -inf 3.7 ✓ -inf Softmax over only "positive" + "negative". Other tokens impossible.

Implementation inside vLLM

  • 1
    Request arrives with GuidedDecodingParams(json_schema=...). Status set to WAITING_FOR_FSM. xgrammar begins compiling asynchronously.
  • 2
    Scheduler: if FSM compile is done, transition to WAITING. If not, skip this step. No blocking.
  • 3
    StructuredOutputManager.prepare_bitmask() called for all FSM requests in batch. Produces _grammar_bitmask tensor (vocab_size / 32 integers per request).
  • 4
    After forward pass: Triton kernel expands bitmask and sets forbidden logits to -inf. Normal sampling proceeds on the constrained distribution.
  • 5
    accept_tokens() advances the FSM state based on sampled token. Next step's bitmask is ready.

Speculative Decoding — Multiple Tokens Per Weight Load

The decode bottleneck is clear: at small batch sizes, the GPU spends most of its time loading model weights from HBM, not computing. For a 70B model in bf16, every token costs streaming 140 GB of weights — ~42ms at H100 bandwidth, mostly idle compute.

The insight: if you use a cheap draft model to propose k tokens, the large model can verify all k simultaneously in one pass. Since the verify pass costs roughly the same as a single token (same weight load), if any draft tokens are accepted, you've done better than baseline.

Fig 11 — Speculative Decoding: Draft → Verify → Accept/Reject
PHASE 1 — DRAFT (cheap: n-gram lookup, or tiny neural model) Context "The sky..." Proposer n-gram / EAGLE "is" draft 1 "blue" draft 2 "today" draft 3 (k=3) k cheap passes or 0 extra passes (n-gram) PHASE 2 — VERIFY (large model, ONE forward pass over [context + k draft tokens]) Context "is" "blue" "today" Large Model single forward pass p₁("is") p₂("blue") p₃("today") +1 free token k+1 distributions from ONE pass PHASE 3 — ACCEPT / REJECT (statistically equivalent to pure large-model sampling) Draft 1: "is" p_large = 0.72 p_draft = 0.65 ACCEPT (p ≥ q) accept prob = 1.0 Draft 2: "blue" p_large = 0.45 p_draft = 0.60 ACCEPT w/ prob p/q = 0.45/0.60 = 0.75 Draft 3: "today" p_large = 0.05 p_draft = 0.50 REJECT p/q = 0.10 → resample Correction at rejection Resample from: max(0, p−q) / Σmax(0,p−q) → "and" sampled instead Final: "is blue and" — 3 tokens!

The accept/reject rule — why it's mathematically correct

For each draft token x at position i, let p(x) = large model's probability and q(x) = draft model's probability:

CaseRuleWhy
p(x) ≥ q(x)Accept with probability 1Large model agrees or prefers this token — no correction needed
p(x) < q(x)Accept with probability p(x)/q(x)Draft was overconfident — partial acceptance corrects for the gap
RejectedResample from max(0, p−q) / Σmax(0,p−q)The residual distribution removes the draft bias, restores correct marginals
All k acceptedSample one bonus token from p(x_{k+1})The verify pass already computed this distribution — it's "free"

The result is that the final sequence is drawn from exactly the same distribution as pure autoregressive sampling from the large model. Speculative decoding is not an approximation — it's a lossless acceleration.

Draft strategies in vLLM V1

  • 1
    N-gram — Looks back at the last prompt_lookup_max tokens in the existing sequence. If that n-gram has appeared earlier in the context, proposes the tokens that followed it. Zero extra model weight, zero extra GPU time for drafting. Acceptance: 30–50% for repetitive content (code, summaries). Low for creative text.
  • 2
    EAGLE — "Model surgery": keep the large model's embeddings and LM head, replace the full transformer stack with a 1–2 layer MLP fine-tuned to predict the large model's hidden states. Drafts by running this lightweight MLP over the large model's cached hidden states. No separate model file needed beyond a small adapter. Acceptance: 60–80%+ for most tasks.
  • 3
    Medusa — Adds k extra linear "heads" on top of the large model's final hidden state, one per speculative position (+1, +2, ..., +k). All run in parallel in a single extra matmul. Part of the main checkpoint. Slightly lower accuracy than EAGLE but no architecture surgery. Works well for structured repetitive tasks.
When speculative decoding helps vs hurts It helps most when the system is memory-bandwidth-bound — small batch sizes (1–16 requests), interactive chat, long outputs. The draft cost is tiny, the verify pass is barely more than one token, and even 50% acceptance gives a significant speedup. It hurts at large batch sizes (100+ requests) where the system is already compute-bound and the draft overhead adds latency without enough acceptance gain to compensate.

System View — How It All Connects

These components aren't independent modules bolted together. They form a tight dependency chain:

  • 1
    Paged Attention is the enabler. By making KV blocks independently allocatable, it enables continuous batching, prefix caching, and dynamic preemption — all of which require the ability to allocate and free KV memory for individual requests at any time.
  • 2
    Continuous Batching is what makes paged attention's dynamic allocation useful at serving scale. Without it, blocks would still be under-utilized because requests sit idle in a static batch.
  • 3
    The Scheduler orchestrates both — every step, it's deciding which requests get blocks, who runs decode, who starts prefill, and who gets preempted. The slot_mapping and block tables it produces flow directly into the forward pass.
  • 4
    Chunked Prefill makes the scheduler fair across prompt lengths. Without it, a single 50k-token prompt would hold all decode responses hostage for seconds.
  • 5
    Prefix Caching exploits the paged block structure — blocks are already exactly 16 tokens, which is also exactly the granularity for content-addressable caching. The chained hash ensures positional uniqueness for free.
  • 6
    Guided Decoding operates orthogonally at the logit level. It compiles format specs to FSMs, produces bitmasks per step, and applies them before sampling. Compatible with all other techniques — adds a small masking overhead per decode step.
  • 7
    Speculative Decoding attacks the memory-bandwidth bottleneck that remains even after all the above. At small batch sizes where the GPU is still waiting on HBM loads, drafting + verifying in bulk extracts more tokens per weight load.
What to read next The original technical breakdown by Aleksa Gordić at aleksagordic.com/blog/vllm goes further into multi-GPU tensor parallelism, pipeline parallelism, data parallelism across nodes, and the distributed serving stack. The vLLM codebase at commit 42172ad is also well-commented — start with vllm/engine/ and follow the LLMEngine → EngineCore → Scheduler chain.