Paged Attention: Turning the Page on Transformer Memory

19-12-2025
Paged Attention - Turning the Page on Transformer Memory
Systems · LLM Inference · Memory Management

Paged Attention:
Turning the Page on
Transformer Memory

How ideas borrowed from operating systems - paging and virtual memory - revolutionize KV cache management for large-scale LLM serving.

By Sai Saketh  ·  December 19, 2025  ·  9 min read

KV Cache and the Hidden Cost of Long-Context LLM Inference

Large Language Models based on the transformer architecture have become an integral part of modern systems, powering applications ranging from search and recommendation to conversational AI and code generation. Despite their success, serving these models efficiently at scale remains a major systems challenge. One of the most critical and often misunderstood bottlenecks is the management of the Key Value cache, commonly referred to as the KV cache.

This article is based on the paper "Efficient Memory Management for Large Language Model Serving with Paged Attention", which introduces a novel systems-level approach to managing the Key Value cache during transformer inference. The paper demonstrates how ideas borrowed from operating systems, such as paging and virtual memory, can be applied to attention mechanisms to significantly improve memory efficiency and throughput when serving large language models at scale.

What is KV Cache?

Figure 1 - The Key Value Cache in a Transformer
Attention(Q, K, V) = softmax(QKᵀ / √d_k) V x q = W_q x K = W_k x V = W_v x Query New token in this decoder step Key Previous context that model should attend Value Weighted sum over previous context

During transformer inference, the Key Value cache is a critical optimization that enables efficient token generation by storing the attention keys and values computed for previously processed tokens at every transformer layer. In a self attention mechanism, each token produces a query, key, and value vector, but only the query is specific to the current decoding step. The keys and values associated with past tokens remain invariant once computed and are reused for all subsequent steps. By caching these tensors, the model avoids recomputing attention states for the entire prefix at every iteration, reducing the per token computation from quadratic to linear complexity with respect to sequence length.

The KV cache grows monotonically as generation progresses and is maintained per request, per layer, and per attention head. Its memory footprint is therefore proportional to the number of layers, hidden dimension, sequence length, numerical precision, and batch size. While this design dramatically improves computational efficiency, it introduces a significant memory burden, often exceeding the size of the model weights themselves for long sequences or large batches.

For a more in depth look, check out my article on encoders.

Problem at hand

Transformers rely on self attention to allow each token to attend to all previously generated tokens. During inference, one fundamental limitation of the transformer model is that as you generate more and more text, you use up more and more memory. Eventually you reach a point where you run out of memory.

Figure 2 - KV Cache is not sufficient for the maximum sequence length

In Figure 2, we can see a real life example from my own work where I attempt to benchmark the LLaMA3.18b model on a NVIDIA RTX 4090 using MLPerf, but the 24GB VRAM wasn't sufficient for a sequence length of that size.

Looking at OpenAIs pricing model (Figure 3) you can notice something interesting. They charge you more (twice as much) for using the longer context model. One of the economic consequence of high memory usage when handling large context lenghts.

Figure 3 - The pricing model for an 8k context vs 32k context model. Inputs and outputs

Once the Key and Value tensors are computed for a particular token in a sequence, they remain unchanged for the rest of the generation process, regardless of how many additional tokens are produced. In principle, this means the model should be able to reuse this information indefinitely. However, without an explicit caching mechanism, a transformer would be forced to recompute the Keys and Values for all previous tokens at every decoding step. This repeated computation constitutes a significant portion of the inference cost and quickly becomes prohibitively slow as sequence length grows.

An intuitive way to understand this is to imagine writing a document one word at a time. If, before writing each new word, you had to reread and mentally reconstruct everything you had written so far, progress would be extremely slow. A far more efficient approach is to remember what has already been written and focus only on adding the next word. The KV cache plays exactly this role in transformer inference.

The KV cache stores the attention Keys and Values that have already been computed for previous tokens. When the model generates a new token, it computes only the Key and Value vectors corresponding to that token and appends them to the existing cache.

While the KV cache significantly reduces computational overhead, it introduces a substantial memory cost. The memory required to store the cache can be approximated by the following expression:

2 × Precision × Number_Of_Layers × d_model × Seq_length × Batch_size
  • Factor of 2: Accounts for storing both the Key and Value tensors in the attention mechanism.
  • Precision: Number of bytes used per parameter, for example 4 bytes for FP32 or 2 bytes for FP16.
  • Number of layers (Number_Of_Layers): Determined by the transformer architecture and fixed for a given model.
  • Model dimension (d_model): The embedding or hidden dimension defined by the model architecture.
  • Sequence length (Seq_length): The number of tokens processed or generated, which varies with the input and output length.
  • Batch size (Batch_size): The number of sequences processed in parallel during inference, determined by the workload and system capacity.

To illustrate the magnitude of this cost, consider the OPT-30B model. Using FP16 inference, with 48 transformer layers, a model dimension of 7168, a sequence length of 1024 tokens, and a batch size of 128, the KV cache alone requires approximately 180 GB of memory. In contrast, the model weights themselves occupy roughly 60 GB.

Key Insight For OPT-30B at batch size 128: KV cache ≈ 180 GB, while model weights ≈ 60 GB. The cache is 3× larger than the model itself.

Paged Attention to the rescue

Serving large language models is expensive even on high-end GPUs such as the NVIDIA A100. Despite their computational power, a single GPU can handle only a limited number of requests per second. For example, with LLaMA-13B and moderately sized inputs, a single A100 typically achieves fewer than one request per second. As a result, production deployments require large GPU fleets to meet throughput demands, making cost and scalability persistent challenges for many organizations. This has been a critical pain point for many companies.

Now to understand what exactly the issue is here, lets take a look at the steps involved in inferencing (Figure 4).

Figure 4 - Inference Process of LLMs
Output the future of Layer N Layer N Layer N Layer 1 Layer 1 Layer 1 Input Artificial Intelligence is the future Repeat until the sequence Reaches its pre-defined maximum length (e.g., 2048 tokens) Generates certain tokens (e.g., "<|end of sequence|>")
  • 1
    First the user provides an input (prompt) having multiple tokens.
  • 2
    The prompt is processed by the model to generate the next token.
  • 3
    The generated token is appended to the input sequence and fed back into the model.
  • 4
    The model produces the subsequent token based on the updated sequence.
  • 5
    This iterative process continues until the model reaches a predefined maximum sequence length, such as 2048 tokens, or generates an end-of-sequence token.

The KV cache here is used in this inference process as explain above. So as we can see, it is important to have efficient management of the KV cache for high throughput LLM serving.

Memory waste in previous systems

Figure 5 - Memory Waste in Previous Systems
Artifi - cial Intelli - gence is the future of techno- logy <eos> <resv> <resv> . . . . . . LLM is . . . 3 KV cache slots for request A's prompt 2 slots for generated tokens Request A current step 3 slots future used (reserved) External fragmentation 2040 slots never used (internal fragmentation) Request B

Looking at the layout of memory in older systems, we can identify 3 main issues.

  • We have internal fragmentation - We don't know in advance how many tokens the model will generate.
  • Reservation - its not used right now but will be used in the future.
  • External fragmentation - Different requests A and B may have different sequence lengths.

According to tests by the authors, only 20% - 40% of KV cache is utilized to store token states and the rest is wasted.

vLLM solves this issue by introducing a new method of memory management similar to the tried and true paging and virtual memory techinques used in OSs called Paged Memory.

Paged Memory

vLLM addresses this problem by applying ideas that have existed in operating systems for decades, specifically paging and virtual memory.

Instead of allocating one large contiguous KV cache per request, vLLM partitions the KV cache into fixed-size blocks called KV blocks. Each block stores the Key and Value states for a fixed number of tokens.

vLLM introduces a level of indirection between logical and physical KV blocks.

  • Logical KV blocks represent the sequence order of tokens for a request.
  • Physical KV blocks represent actual memory locations on the GPU.
Figure 6 - Paged Memory in Practice: OS Virtual Memory vs vLLM KV Cache
Memory management in OS Process A Process B Page 0 Page 1 Page 2 Page 3 Page 4 Physical Memory Memory management in vLLM Request A Request B KV Block 0 KV Block 1 KV Block 2 KV Block 3 KV Block 4 KV Cache

The mapping between logical and physical blocks is maintained in a block table, analogous to a page table in an operating system.

This design allows logical token order to be preserved even when physical memory is fragmented.

These blocks are allocated on demand rather than being pre-allocated upfront.

Figure 7 - Paged Memory: Logical KV Blocks Mapped to Physical KV Blocks via Block Table
Request A Block Table Physical block number # Filled 1 4 7 4 5 1 Logical KV blocks block 0 Alan Turing is a block 1 computer scientist and mathema- tician block 2 renowned block 3 Physical KV blocks block 0 block 1 computer scientist and mathem- atician block 2 block 3 block 4 Allocated on demand block 5 renowned block 6 block 7 Alan Turing is a

Paged Attention

Paged Attention introduces a different approach to managing attention memory by abandoning the assumption that all Key Value states must be stored in a single contiguous block of memory. Traditional attention implementations rely on contiguous KV storage, which works well when input and output lengths are known and static. However, this approach becomes highly inefficient for LLM inference, where sequence lengths are dynamic and unpredictable.

To address this, Paged Attention partitions the KV cache into fixed-size blocks and virtualizes them into logical and physical views. This design decouples the logical token order from the physical memory layout and enables on-demand memory allocation.

Figure 8 - Paged Attention
Physical KV blocks Alan Turing is a computer scientist and mathematician Artificial Intelligence is the future of technology renowned Request A Block Table Logical KV blocks Alan Turing is a computer scientist and mathematician renowned Request B Block Table Logical KV blocks Artificial Intelligence is the future of technology

For example, consider a request with the prompt: "The ".

  • The logical KV blocks represent the token sequence in order, with tokens mapped to consecutive logical blocks to preserve their natural order.
  • The Physical KV blocks represent actual memory locations, which may be allocated in any order and are not required to be contiguous.
  • A block table maintains the mapping between logical KV blocks and physical KV blocks, functioning similarly to a page table in an operating system.
  • When the final logical block becomes full, a new physical block is allocated and mapped on demand, rather than being pre-allocated upfront as in traditional systems.

This block-based and virtualized design allows KV cache memory to grow dynamically with the sequence length while avoiding the inefficiencies of large contiguous allocations.

Memory efficiency of paged attention

Paged Attention significantly improves memory efficiency by minimizing both internal and external fragmentation.

  • Internal fragmentation is limited to the final block of each sequence, since all preceding blocks are fully utilized. The amount of wasted space per sequence is therefore bounded by the block size. In practice, block sizes of 16 or 32 tokens are commonly used, which is orders of magnitude smaller than typical sequence lengths. As a result, internal fragmentation is negligible.
  • External fragmentation is completely eliminated because all KV blocks are of fixed size. This uniformity allows memory to be allocated and reused efficiently without gaps caused by variable-length allocations.

Data collected by the authors show that KV cache space is reduced to less than 4%, representing a 3x–5x improvement over prior systems. This enables larger batch sizes and directly leads to higher throughput during LLM inference.

Result KV cache waste reduced to <4% which is a 3x–5x improvement over prior systems. Larger batch sizes, higher throughput.

Shared memory

Figure 9 - A Single Prompt may Produce Multiple Outputs (Parallel Sampling)
E.g.) Parallel sampling The future of cloud computing is Shared bw sequences LLM bright and poised for further growth and transformation. Here's why: . . . intertwined with the advancement of artificial intelligence (AI) . . . likely to be characterized by several key trends: . . . Prompt Multiple outputs

Paged Attention also enables efficient sharing of KV blocks across multiple sequences through its dynamic block mapping mechanism. When multiple generation paths originate from the same prompt, their initial Key Value states are identical and do not need to be recomputed or duplicated in memory. Instead, these shared states can be stored once and referenced by multiple sequences.

Consider a prompt such as "The future of cloud computing is", where the model generates multiple candidate continuations for selection.

  • 1
    The prompt is processed once, and the resulting KV blocks are allocated in physical memory.
  • 2
    Multiple logical sequences are created, all initially pointing to the same physical KV blocks corresponding to the shared prompt.
  • 3
    Each physical KV block maintains a reference count indicating how many logical sequences currently map to it.
  • 4
    When sequence A generates a new token that differs from the others, it detects that the corresponding physical KV block is shared.
  • 5
    The reference count for that block is decremented, and a new physical KV block is allocated using copy-on-write.
  • 6
    Sequence A writes its newly generated Key and Value tensors into this new physical block.
  • 7
    When sequence B generates its next token and finds that the mapped physical KV block has a reference count of one, it writes the new KV state directly into the existing block without copying.
  • 8
    Generation continues, with only the different blocks being unique while all shared prefix blocks remain common across sequences.
Figure 10 - Sharing of KV Cache: Two Sequences Share the Same Prompt Blocks
Physical KV blocks The future of cloud computing is intertwined with computing is bright and Sequence A Block Table Logical KV blocks The future of cloud computing is bright and Sequence B Block Table Logical KV blocks The future of cloud computing is intertwined

This significantly reduces memory usage for long prompts and directly increases throughput by freeing GPU memory for larger batch sizes.

This sharing mechanism naturally extends to more complex decoding strategies, such as beam search, where multiple hypotheses share substantial portions of their prefix.

That's all for this article!

Paged Attention is one of the most elegant applications of operating systems thinking to deep learning infrastructure. By treating the KV cache like virtual memory with logical blocks, physical blocks, block tables, and copy-on-write semantics. vLLM achieves dramatically better memory utilization and throughput.

This article is based on the paper "Efficient Memory Management for Large Language Model Serving with Paged Attention" by Kwon et al. Be sure to check out the original paper for full experimental results and implementation details.