Deep Dive in to vLLM Architecture

21-2-2026
Inside vLLM - Deep Technical Reference
Technical Deep Dive · vLLM V1

Inside vLLM
A Complete Technical Reference

Engine construction, KV cache mechanics, the scheduler loop, slot mapping, prefix caching, guided decoding, and speculative decoding : From source to silicon.

By - Sai Saketh Cherukuri · Based on Aleksa Gordić's breakdown

Foundation: How LLMs Generate Text

Everything in vLLM is shaped by two fundamental principles about autoregressive language models. These determine every design decision in the system.

1: One token at a time, sequentially

LLMs generate tokens one by one. Given a sequence [t₁, t₂, ..., tₙ], the model runs a full forward pass to sample token tₙ₊₁. Then runs again. This is autoregressive generation where each step depends on all prior tokens, and no step can start until the previous one completes.

Generation has two distinct phases with entirely different performance characteristics. Prefill processes the full input prompt in a single parallel pass which is compute-bound. Decode generates tokens one at a time, loading all model weights each step which is memory-bandwidth-bound.

Fig 1 - Prefill vs Decode: Two Different Bottlenecks
PREFILL All prompt tokens processed in one parallel pass "The" "cap" "-ital" "of" "France" 5 tokens → one large matrix multiply GPU Forward Pass (Parallel) Compute K,V for all 5 tokens simultaneously O(n²) attention → compute-bound Write KV cache for all tokens K₁V₁ . . . K₅V₅ stored in paged blocks "Paris" sampled COMPUTE-BOUND DECODE One token at a time - loads all weights each step Step 1 "is" "Paris" KV cache read every step All past K,V vectors loaded from paged blocks 70B model → 140 GB weights per token H200 ≈ 42ms/token (batch=1) Compute units idle → memory-bound MEMORY-BANDWIDTH-BOUND

2: The KV Cache

In transformer attention, every token produces a Key (K) and Value (V) vector. During decode, the new token must attend to all past tokens' K and V. Rather than recomputing past K,V from scratch each step, they're stored. We call this is the KV cache. It grows one block of 16 tokens (generally block_size = 16) at a time as generation proceeds.

The KV cache for a single request at a single layer in bf16: 2 × seq_len × num_kv_heads × head_dim × 2 bytes.

For Llama-3-8B (8 KV heads, head_dim = 128), the KV cache footprint per token per layer is:

2 × 8 × 128 × 2 bytes = 4,096 bytes

Across 32 layers, this becomes:

4,096 × 32 = 131,072 bytes ≈ 128 KB per token

For a 2,048-token response:

128 KB × 2,048 = 256 MB

That means a single long generation can consume 256 MB of KV cache memory. When serving many concurrent requests, this memory usage scales linearly making KV memory management the central constraint in large-scale LLM inference systems.

Engine Construction

The LLM engine is the fundamental building block of vLLM. On its own, it already enables high-throughput inference

vLLM has two entry points: LLM for offline synchronous use (scripts, benchmarks), and AsyncLLMEngine for online serving. Both wrap the same LLMEngine / EngineCore.

We'll use the following offline inference snippet as our running example (adapted from basic.py).

# Offline usage
    from vllm import LLM, SamplingParams

    prompts = [
        "Hello, my name is",
        "The president of the United States is",
    ]

    sampling_params = SamplingParams(
        temperature=0.8,
        top_p=0.95
    )

    def main():
        llm = LLM(
            model="TinyLlama/TinyLlama-1.1B-Chat-v1.0"
        )

        outputs = llm.generate(prompts, sampling_params)

    if __name__ == "__main__":
    main()

This configuration is:

  • Offline - no web or distributed system scaffolding
  • Synchronous - all execution happens in a single blocking process
  • Single-GPU - no data, model, pipeline, or expert parallelism

LLM Engine Constructor

The main components of the engine are:

  • vLLM config - controls model settings, KV cache size, parallelism, and other configuration knobs.
  • Processor - validates raw input, tokenizes it, and converts it into internal engine requests.
  • Engine core client - runs the model. In offline mode this is InprocClient; in large-scale serving it becomes a distributed client.
  • Output processor - converts raw engine outputs into the final response returned to the user.

Engine Core Components

1. Model Executor

Drives forward passes on the GPU.

  • UniProcExecutor - single process, single GPU.
  • MultiProcExecutor - supports multiple GPUs.

2. Structured Output Manager

Handles guided decoding (e.g., structured or JSON outputs).

3. Scheduler

Determines which requests are processed next.

  • Policy setting - FCFS (First Come First Served) or priority-based scheduling.
  • Waiting and Running queues

4. KV Cache Manager

The core of paged attention. It manages memory blocks used for storing KV cache and maintains a free block queue which is a pool of available KV-cache blocks. These blocks act as an indexing structure that maps tokens to their computed KV cache.

Depending on VRAM size and block size, there may be hundreds of thousands of available blocks. As emphasized earlier, Efficient KV cache management is critical for high-concurrency inference.

Lets take a look at the steps :

Engine Initialization and Model Loading

1. Initialize the Device

  • Assign a CUDA device (e.g., cuda:0) to the worker.
  • Verify that the model dtype (e.g., bf16) is supported.
  • Check available VRAM against gpu_memory_utilization (e.g., 0.8 = 80%).
  • Configure distributed settings (DP, TP, PP, EP, etc.).
  • Instantiate a model_runner (sampler, KV cache, forward-pass GPU buffers).
  • Instantiate an InputBatch (CPU-side buffers, KV block tables, sampling metadata).

2. Load the Model

  • Instantiate the model architecture.
  • Load model weights.
  • Call model.eval() to enable inference mode.
  • Optionally apply torch.compile() for optimization.

3. Initialize the KV Cache

  • Determine per-layer KV cache specification.
  • Run a dummy forward pass to profile GPU memory usage.
  • Compute how many KV cache blocks fit into available VRAM.
  • Allocate and bind KV cache tensors to attention layers.
  • Prepare attention metadata (e.g., select FlashAttention backend).
  • Unless --enforce-eager is provided, for each of warmup batch sizes do a dummy run and capture CUDA graphs. CUDA graphs record the whole sequence of GPU work into a DAG.

The 6-step engine construction sequence

Fig 2 - LLM Engine Constructor: 6 Steps Before the First Token
Step 1 - init_device() • Choose cuda:0 (or assigned GPU) • Check bf16 support, VRAM, reserve 80% if gpu_memory_utilization=0.8 Step 2 - Create ModelRunner • Holds: model, sampler, attention metadata GPU buffers • "Everything needed for one forward pass" • Creates InputBatch (CPU): block tables, sampling params, request metadata Step 3 - load_model() • Build PyTorch model architecture from config • Load weights (sharded/full), model.eval() torch.compile() optional Step 4 - init_kv_cache() ← Critical • Inspect attention layers → determine KV layout • Run dummy forward pass → measure GPU memory • Calculate how many KV blocks fit in VRAM → Allocate giant contiguous KV pool. COMPONENT HIERARCHY LLM (offline entry point) EngineCore schedules, runs model, manages KV cache Scheduler FCFS / priority queues KV Cache Manager Model Executor Worker → ModelRunner Output Processor EngineCoreOutput → detokenize → user output
Two execution modes After initialization, every forward pass runs in one of two modes. Eager mode: standard PyTorch. Its flexible, but each kernel launch has Python overhead. Captured mode: CUDA graphs are recorded during init_kv_cache() for common batch sizes, then replayed as a single GPU command with no Python overhead, dramatically lower latency. Pass --enforce-eager to skip capture.

The generate() Loop

When you call llm.generate(prompts, sampling_params), two phases happen: feeding requests into the engine, then stepping until all complete.

  • Create a unique request_id (UUID) and capture arrival_time for priority scheduling
  • Call _process_inputs() → tokenize, validate, return {prompt, prompt_token_ids, type}
  • Pack into EngineCoreRequest with priority, sampling params, and all metadata
  • Pass to engine core → wraps in Request object, status = WAITING, add to scheduler's waiting queue
Fig 3 - Request Lifecycle: From User Prompt to Output
Raw Text "Hello, my name is" Processor • validate tokenize→[1,2,3,4,5] pack EngineCoreRequest EngineCoreRequest request_id: "uuid-42" arrival_time: 1721000000.42 prompt_token_ids: [1,2,3,4,5] type: "token" priority: 0 + sampling params… Waiting Queue status = WAITING FCFS → append() priority → heap_push() step() ① schedule ② fwd pass ③ post-proc loop ↻ Sync mode: all prompts loaded first, then step loop. Async mode: new requests fold in mid-run.

Phase B: The step() function

Once the engine has requests, it calls step() in a loop until all requests finish. Each step is exactly three stages:

  • Schedule: We choose which requests to run (decode and/or prefill)
  • Forward pass: execute one step of the model for all running requests
  • Post-process: Append new tokens to request states, detokenize and check for completion, prepare outputs
  • The Scheduler

    The scheduler is the brain of vLLM. Before every step it decides: which requests run, which wait, and who gets preempted if memory is tight. It manages two queues:

    WAITING queue

    Requests that have arrived but haven't started. Status: WAITING. Ordered by policy (FCFS or priority). Scheduler tries to promote them to RUNNING each step - if enough free KV blocks exist for their prefill.

    RUNNING queue

    Requests actively generating tokens. Status: RUNNING. Each decode step processes one new token per request, reading all past KV from cache. Always processed first - these are live user requests.

    Scheduler decision algorithm

    • Process RUNNING first. For each: call allocate_slots(req_id, num_new_tokens=1). If KV memory is tight, preempt the lowest-priority RUNNING request - free its blocks, push it back to WAITING.
    • Pull from WAITING. For each waiting request (priority order): check if enough free blocks exist for its prefill. If yes, move to RUNNING, allocate slots. If no room, skip until next step.
    • Return SchedulerOutput. The batch plan: which requests run, token counts, block tables, prefill vs decode flag. The model executor uses this to build all inputs.
    ModeQueue typeOrdering
    FCFSDeque (append/pop)First in, first served by arrival_time
    PriorityHeap (heappush/heappop)User-provided priority value; lower = higher priority
    Preemption When memory runs out during decode, vLLM frees the lowest-priority RUNNING request's KV blocks and returns it to WAITING. The request's token history is preserved, so it will re-prefill when resources free up - correct but wasteful. Avoid this by tuning gpu_memory_utilization and request concurrency limits.

    Forward Pass

    We call ModelExecutor.execute_model, which delegates to the Worker, which in turn delegates to the ModelRunner. Here are the main steps:

    1. Update state - Prune finished requests from input_batch; refresh KV cache block metadata for each active request.
    2. Prepare inputs - Copy buffers CPU → GPU; compute position indices; build slot_mapping (maps each token to its physical slot in paged KV memory); construct the attention metadata struct.
    3. Forward pass - Run the model with custom paged-attention kernels. All sequences are flattened and concatenated into one "super sequence." Position indices and attention masks ensure each sequence attends only to its own tokens, enabling continuous batching without right-padding.
    4. Gather last-token states - Extract the hidden state at each sequence's final position and project to vocabulary logits.
    5. Sample - Pick the next token from the logits per the sampling config (greedy, temperature, top-p, top-k, etc.).

    The forward pass itself has two execution modes:

    • Eager mode - Standard PyTorch forward pass when eager execution is enabled.
    • Captured mode - Replay a pre-captured CUDA Graph when eager is not enforced (graphs are recorded during engine startup in the initialize-KV-cache procedure).

    The diagram below walks through a concrete example that makes continuous batching and paged attention concrete:

    The Forward Pass - Inside execute_model()

    Every step() dispatches a single forward pass over the current batch. The five-stage pipeline inside execute_model():

    Fig 4 - execute_model(): 5 Stages Per Step
    ① Update States Remove finished requests. Track KV positions, block tables, and token counts. ② Prepare Inputs (CPU → GPU) Copy input_ids to GPU. Compute positions. Build attention metadata. Build slot mapping → KV slot = block_id × block_size + offset. ③ Forward Pass (GPU) Sequences flattened into one super-sequence. Paged attention uses block tables + slot mapping. New KVs written to cache. ④ Gather Last-Token Hidden States Extract last token per request → LM head → logits. ⑤ Sample & Append Sample token → append to request → write new K,V to KV cache.

    What is slot_mapping?

    This is the critical piece that glues the flat batch to the paged KV cache. The slot_mapping tells the CUDA attention kernel: "token X in the flattened super-sequence corresponds to physical KV slot Y in GPU memory."

    Physical slot = block_id × block_size + position_within_block. If a request holds blocks [3, 7] (block size 16), its tokens map to physical slots [48..63, 112..127]. The slot_mapping is a flat array with one entry per token that the CUDA kernel uses to look up the right memory address without branching.

    Worked Example: Three Prompts, Prefill → Decode

    Let's trace exactly what vLLM does with three prompts from start to finish, with block_size=4 for clarity.

    Fig 6 - Complete
    STEP 1 - TOKENIZE INPUTS & ALLOCATE KV CACHE BLOCKS PROMPTS "Hi, my name is" "Today is a beautiful summer day" "Hello there" block_size = 4 slots / block AFTER TOKENIZATION seq₁: [1, 2, 3, 4, 5] seq₂:[1, 6, 5, 7, 8, 9, 10] seq₃: [1, 12, 13] (IDs simplified for clarity) allocate_slots block_id = 1 ref_cnt = 1 block_id = 2 ref_cnt = 1 block_id = 3 ref_cnt = 1 block_id = 4 ref_cnt = 1 block_id = 5 ref_cnt = 1 ■ seq₁ → blocks 1, 2 ■ seq₂ → blocks 3, 4 ■ seq₃ → block 5 STEP 2 - CONTINUOUS BATCHING: FLATTEN INTO ONE "SUPER SEQUENCE" ALL SEQUENCES CONCATENATED - NO PADDING NEEDED input_ids 1 2 3 4 5 1 6 5 7 8 9 10 1 12 13 seq₁ · 5 tokens seq₂ · 7 tokens seq₃ · 3 tokens positions [ 0,1,2,3,4, 0,1,2,3,4,5,6, 0,1,2 ] ← resets to 0 per sequence STEP 3 - SLOT MAPPING: EACH TOKEN → ITS PHYSICAL SLOT IN GPU PAGED MEMORY WHICH GPU MEMORY SLOT DOES EACH TOKEN'S KV PAIR GO INTO? slot_mapping 4 5 6 7 8 12 13 14 15 16 17 18 20 21 22 seq₂ has blocks 3 & 4 block 3 → slots 12–15 block 4 → slots 16–19 7 tokens fill slots 12–18 GPU INITIAL STATE - 5 EMPTY PAGED BLOCKS (BEFORE PREFILL) GPU PAGED KV CACHE blk 1 blk 2 blk 3 blk 4 blk 5 FORWARD PASS 1 - PREFILL: RUN ALL 15 TOKENS, WRITE KVs TO GPU GPU KV CACHE AFTER PREFILL blk 1 4/4 full blk 2 1/4 used KVs: seq₁ · 5 tokens blk 3 4/4 full blk 4 3/4 used KVs: seq₂ · 7 tokens blk 5 3/4 used KVs: seq₃ · 3 tokens attn metadata (prefill) query_start_loc = [0, 5, 12, 15] seq_lens = [5, 7, 3] num_actual_tokens = 15 → sample 1 new token per seq FORWARD PASS 2 - DECODE: ONLY 1 NEW TOKEN PER SEQUENCE SAMPLED NEW TOKENS: 14 (seq₁) · 15 (seq₂) · 16 (seq₃) input_ids 1–5 14 NEW ↑ 1–7 15 NEW ↑ 1–3 16 NEW ↑ ← only 3 tokens run all prior KVs reused slot_mapping [ 4..9, 12..19, 20..23 ] +1 new slot written per sequence GPU KV CACHE AFTER DECODE blk 1 (4) blk 2 (2) seq₁ → 6 tokens blk 3 (4) blk 4 (4) seq₂ → 8 tokens blk 5 (4) seq₃ → 4 tokens attn metadata (decode) query_start_loc = [0, 1, 2, 3] seq_lens = [6, 8, 4] num_actual_tokens = 3 WHY THIS IS FAST Decode cost = number of active sequences, not total tokens generated. KVs are computed once (prefill) and cached in paged GPU blocks. Attn kernels use slot_mapping + seq_lens to route each new query to the right non-contiguous memory block - no recomputation, no padding, no wasted slots.

    Advanced Features

    With the basic engine flow in place, we can now look at the advanced features. We've already discussed preemption, paged attention, and continuous batching.

    Paged Attention

    Before vLLM, every framework pre-allocated the maximum possible KV cache for every request, even if it only used a fraction. A 2K-token limit request would claim 2K × 128KB = 256MB from the moment it arrived, even if it only generated 50 tokens. This made serving memory-inefficient and throughput low.

    Paged Attention solves this by dividing the KV cache into fixed-size pages (blocks) of 16 tokens each. Pages are allocated dynamically as generation proceeds only when new tokens actually need them. Pages for different requests can be non-contiguous in GPU memory (just like virtual memory), and the attention kernel is modified to look up pages via a block table rather than assuming contiguous memory.

    Paged Attention - Key Properties

    Block size: 16 tokens per block (tunable). Each block holds K,V vectors for 16 positions.

    Physical pool: One giant GPU tensor pre-allocated at startup. Carved into equal blocks. The free block queue tracks which blocks are available.

    Block table: Per-request mapping: logical block index → physical block ID. The attention CUDA kernel uses this to reconstruct the right KV memory addresses.

    ref_cnt: Reference count for prefix sharing - a block with ref_cnt > 1 is shared across multiple requests and cannot be freed until all readers drop it.

    Allocate / Free Cycle

    Allocate: When a request needs another 16 tokens of KV space, pop one block from free_block_queue. Increment ref_cnt to 1. Add to request's block table.

    Free: When a request finishes (or is preempted), walk its block table. For each block, decrement ref_cnt. If ref_cnt = 0, push back to free_block_queue. Immediately reusable.

    No copying: Memory is never moved. "Freeing" just returns a block index to the free list. The stale data in it doesn't matter - it'll be overwritten by the next tenant.

    Continuous Batching

    Static batching waits for the entire batch to finish before starting new requests. With wildly different output lengths, short requests hold their GPU slot idle while waiting for long ones.

    Continuous batching fixes this by letting new requests join after any step. Because the scheduler runs before every single step, a WAITING request can slot in the moment a RUNNING request finishes and frees its blocks.

    Fig 8 - Continuous Batching: New Requests Join the Moment a Slot Opens
    t1t2t3 t4t5t6t7 Static Req A (short) A idle (GPU slot wasted) Req B (long - controls batch duration) Req C (batch 2 starts) ← C blocked until B finishes Continuous Req A Req C (joins immediately when A done) Req B (continues uninterrupted) A done → C joins

    This only works because of paged attention: when A finishes, its KV blocks return immediately to free_block_queue, and C can begin allocation instantly. No memory copying needed - C's blocks are wherever the free list puts them.

    Chunked Prefill

    Chunked prefill is a technique for handling long prompts by splitting their prefill step into smaller chunks. Without it, we could end up with a single very long request monopolizing one engine step disallowing other prefill requests to run. That would postpone all other requests and increase their latency.

    A single large prefill (e.g., a 10,000-token document) can monopolize an entire engine step blocking all decode requests for seconds. Users mid-conversation experience a freeze.

    The fix

    Set a long_prefill_token_threshold (e.g., 512). During scheduling, if a prefill request would consume more tokens than this threshold, cap it:

    # Inside scheduler - prefill token budget accounting
        if num_new_tokens > long_prefill_token_threshold:
            num_new_tokens = long_prefill_token_threshold   # cap it
        # Remaining tokens spill to the next step automatically
        # KV block indexing handles resumption - request picks up exactly where it left off

    Implementation is straightforward: cap the number of new tokens per step. If the requested number exceeds long_prefill_token_threshold, reset it to exactly that value. The underlying indexing logic (described earlier) takes care of the rest

    The KV cache already knows how many tokens have been processed (blocks allocated), so the next step continues from the right position with zero extra bookkeeping. Decode requests get tokens every step during the chunked prefill period so there are no stalls.

    Automatic triggering Even without explicit configuration, chunked prefill kicks in automatically if the total token budget is exhausted by decode requests. Any prefill that exceeds the remaining budget gets truncated to fit. Budget = max_num_batched_tokens parameter.

    Prefix Caching

    Most production deployments prepend the same large block of tokens to every request: a 3,000-token system prompt, a RAG document, an instruction template. Without prefix caching, every request recomputes these KV vectors from scratch. Prefix caching eliminates this waste.

    Content-addressed KV blocks

    vLLM maintains a hash map: cached_block_hash_to_block. When a block is full (all 16 slots filled) and its tokens won't change, it becomes cacheable. Its hash is stored as a key pointing to the physical block. On a new request, find_longest_cache_hit() walks tokens in 16-token chunks and checks the map - every hit means zero recomputation for those 16 tokens.

    Critical constraint vLLM only caches full blocks. If block_size=16 and your system prompt is 100 tokens, the first 6 complete blocks (96 tokens) are cacheable. The last 4 tokens must always be recomputed. For maximum cache effectiveness, pad your system prompt to a multiple of 16.

    Chained block hashing

    The same 16 tokens ("the cat sat on") can appear at different positions and must produce different hashes - because KV vectors encode positional information. The hash is therefore chained:

    block_hash = hash(
        parent_hash,         # hash of the PREVIOUS block → encodes everything before this block
        tuple(token_ids),    # the exact 16 token IDs in this block
        extra_hashes,        # optional: lora_id, cache_salt for per-tenant isolation
    )
        # Block 0: parent_hash = 0 (root)
        # Block 1: parent_hash = hash(block 0)
        # Block 2: parent_hash = hash(block 1)
        # → "the cat sat" at position 48 has a DIFFERENT hash than at position 0
    Fig 9 - Chained Block Hashing for Prefix Caching
    REQUEST 1 - System Prompt × 3 blocks + User Question SP Block 0 tokens 0–15 hash(0, [t0..t15]) = H₀ SP Block 1 tokens 16–31 hash(H₀, [t16..t31]) = H₁ SP Block 2 tokens 32–47 hash(H₁, [t32..t47]) = H₂ User Q Block tokens 48–63 not cached,unique per request Stored: { H₀ → Block#14, H₁ → Block#7, H₂ → Block#31 } Blocks lazy-evicted: hash entries persist REQUEST 2 - Same System Prompt + Different User Question SP Block 0 HIT → reuse Block#14 SP Block 1 HIT → reuse Block#7 SP Block 2 HIT → reuse Block#31 New User Q recompute (unique) Result: 48/64 tokens free - 75% of prefill skipped!

    Speculative Decoding

    During decoding, at small batch sizes, the GPU spends most of its time loading model weights, not computing. For a 70B model in bf16, every token has a cost and when the tokens accumulate, the cost becomes apparent.

    Use a cheap draft model to propose k tokens, then let the large model verify all k simultaneously in one pass. Since the verify pass costs roughly the same as a single-token pass (same weight load), every accepted draft token is effectively free.

    Fig 11 - Speculative Decoding: Draft → Verify → Accept/Reject
    PHASE 1 - DRAFT (cheap: n-gram lookup or tiny neural model) Context "The sky..." Proposer n-gram / EAGLE "is" draft 1 "blue" draft 2 "today" draft 3 (k=3) k cheap passes or 0 extra (n-gram) PHASE 2 - VERIFY (large model, ONE forward pass over [context + k draft tokens]) Context "is" "blue" "today" Large Model single forward pass p₁("is") p₂("blue") p₃("today") +1 free tok k+1 distributions from ONE pass PHASE 3 - ACCEPT / REJECT (statistically equivalent to pure large-model sampling) Draft 1: "is" p_large = 0.72 p_draft = 0.65 ACCEPT (p ≥ q) Draft 2: "blue" p_large = 0.45 p_draft = 0.60 ACCEPT w/ prob 0.75 Draft 3: "today" p_large = 0.05 p_draft = 0.50 REJECT → resample Correction Resample from: max(0, p−q) / Σmax(0,p−q) → "and" sampled instead Final: "is blue and"

    How to accept/reject

    For each draft token x, let p(x) = large model's probability and q(x) = draft model's probability:

    CaseRuleWhy
    p(x) ≥ q(x)Accept with probability 1Large model agrees or prefers this token
    p(x) < q(x)Accept with probability p(x)/q(x)Draft was overconfident. Partial acceptance corrects for the gap
    RejectedResample from max(0, p−q) / Σmax(0,p−q)Residual distribution removes draft bias, restores correct marginals
    All k acceptedSample one bonus token from p(x_{k+1})Verify pass already computed this distribution

    The final sequence is drawn from exactly the same distribution as pure autoregressive sampling from the large model.

    Draft strategies in vLLM V1

    • N-gram - Looks back at the last prompt_lookup_max tokens. If that n-gram appeared earlier in context, proposes the tokens that followed it. Zero extra model weight, zero extra GPU time. Acceptance: 30–50% for repetitive content (code, summaries).
    • EAGLE - Keep the large model's embeddings and LM head, replace the transformer stack with a 1–2 layer MLP fine-tuned to predict large model hidden states. Drafts by running this MLP over the large model's cached hidden states. No separate model file needed beyond a small adapter.
    • Medusa - Adds k extra linear "heads" on top of the large model's final hidden state, one per speculative position (+1, +2, ..., +k). All run in parallel in a single extra matmul. Baked into the main checkpoint. Slightly lower accuracy than EAGLE but no architecture changes
    When speculative decoding helps vs hurts It helps most when the system is memory-bandwidth-bound, small batch sizes (1–16 requests), interactive chat, long outputs. The draft cost is tiny, the verify pass costs barely more than one token, and even 50% acceptance gives significant speedup. It hurts at large batch sizes (100+ requests) where the system is already compute-bound, the draft overhead adds latency without enough acceptance gain to compensate.

    Conclusion

    These components aren't independent modules bolted together. They form a tight dependency chain where each layer enables the next:

    • Paged Attention is the enabler. By making KV blocks independently allocatable, it enables continuous batching, prefix caching, and dynamic preemption - all of which require allocating and freeing KV memory for individual requests at any time.
    • Continuous Batching is what makes paged attention's dynamic allocation useful at serving scale. Without it, blocks would still be under-utilized because requests sit idle in a static batch.
    • The Scheduler orchestrates both - every step, deciding which requests get blocks, who runs decode, who starts prefill, and who gets preempted. The slot_mapping and block tables it produces flow directly into the forward pass.
    • Chunked Prefill makes the scheduler fair across prompt lengths. Without it, a single 50k-token prompt would hold all decode responses hostage for seconds.
    • Prefix Caching exploits the paged block structure - blocks are exactly 16 tokens, which is also the granularity for content-addressable caching. The chained hash ensures positional uniqueness.
    • Guided Decoding operates orthogonally at the logit level. FSMs compile to bitmasks, applied before sampling. Compatible with all other techniques - adds a small masking overhead per decode step.
    • Speculative Decoding attacks the memory-bandwidth bottleneck that remains even after all the above. At small batch sizes, drafting + verifying in bulk extracts more tokens per weight load.
    What to read next Aleksa Gordić's breakdown at aleksagordic.com/blog/vllm goes further into multi-GPU tensor parallelism, pipeline parallelism, and the distributed serving stack. The vLLM codebase at commit 42172ad is well-commented - start with vllm/engine/ and follow the LLMEngine → EngineCore → Scheduler chain.