kv-cache-compression

First open-source implementation of Google TurboQuant (ICLR 2026) -- near-optimal KV cache compression for LLM inference. 5x compression with near-zero quality loss.

machine-learning compression deep-learning pytorch transformer attention quantization iclr vector-quantization memory-optimization kv-cache google-research llm vllm llm-inference kv-cache-compression

Updated May 25, 2026
Python

snu-mllab / Context-Memory

Star

Pytorch implementation for "Compressed Context Memory For Online Language Model Interaction" (ICLR'24)

efficient-llm-inference context-compression kv-cache-compression

Updated Apr 18, 2024
Python

JIA-Lab-research / Q-LLM

Star

This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"

fast-inference inference-acceleration large-language-models long-context kv-cache-compression

Updated Jul 16, 2024
Python

abdelfattah-lab / xKV

Star

xKV: Cross-Layer SVD for KV-Cache Compression

mla low-rank long-context llm-inference deepseek kv-cache-compression inter-layer

Updated May 27, 2026
Python

Linking-ai / SCOPE

Star

(ACL2025 oral) SCOPE: Optimizing KV Cache Compression in Long-context Generation

long-context kv-cache-compression kvcache

Updated May 28, 2025
Jupyter Notebook

Janghyun1230 / FastKVzip

Star

Accurate and fast KV cache compression with a gating mechanism

large-language-models kv-cache-compression

Updated Apr 5, 2026
Python

Native Windows build of vLLM 0.21.0 — no WSL, no Docker. Now for RTX 50-series (Blackwell, sm_120): Python 3.13 + CUDA 12.8 + PyTorch 2.11. Pre-built wheel + Windows patch, 10 KV-cache compression dtypes, and the OpenAI API server fixed to run on Windows.

Updated May 26, 2026
Python

OnlyTerp / kvtc

Star

First open-source KVTC implementation (NVIDIA, ICLR 2026) -- 8-32x KV cache compression via PCA + adaptive quantization + entropy coding

compression pytorch nvidia transformer pca attention dynamic-programming quantization deflate entropy-coding memory-optimization kv-cache llm llm-inference kv-cache-compression iclr-2026

Updated Apr 17, 2026
Python

MAC-AutoML / Awesome-Efficient-Large-Models

Star

A list of awesome papers on compression and acceleration of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs).

acceleration compression survey pruning quantization knowledge-distillation awesome-papers large-language-models multimodal-large-language-models speculative-decoding kv-cache-compression

Updated May 12, 2026

AMD-AGI / AMD-Hybrid-Models

Star

Official repo for AMD hybrid models training and inference workflow

amd attention-mechanism mamba mla hybrid-models kv-cache large-language-models llm kv-cache-compression zebra-llama

Updated May 14, 2026
Python

FluffyAIcode / LLM-KV--Cache-compress

Star

Discrete Kakeya cover for LLM KV cache: D4/E8 nested-lattice quantisation realising a Kakeya-style tube-cover over the direction sphere. 2.4x-2.8x compression at <1% perplexity loss on Qwen3, Llama-3, DeepSeek, GLM-4, Gemma. Drop-in transformers.DynamicCache. pip install kakeyalattice.

transformers quantization discrete-geometry kv-cache long-context vllm llm-inference kv-cache-compression qwen3 lattice-quantization e8-lattice d4-lattice kakeya kakeya-set

Updated Apr 30, 2026
Python

MGDDestiny / Lava

Star

LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation

llm kv-cache-compression

Updated Sep 17, 2025
Python

NAME0x0 / AVA

Star

Research and training stack for AVA — a tool-using, memory-aware virtual assistant targeting 4 GB VRAM. Spans custom transformers, verifier-RL, external memory, multi-domain benchmarks, and Gemma 4 inference optimization.

Updated May 20, 2026
Python

Improve this page

Add a description, image, and links to the kv-cache-compression topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the kv-cache-compression topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv-cache-compression

Here are 34 public repositories matching this topic...

Zefan-Cai / KVCache-Factory

NVIDIA / kvpress

Zefan-Cai / Awesome-LLM-KV-Cache

AtomicBot-ai / atomic-llama-cpp-turboquant

snu-mllab / KVzip

itsnamgyu / block-transformer

shadowpa0327 / Palu

OnlyTerp / turboquant

snu-mllab / Context-Memory

JIA-Lab-research / Q-LLM

abdelfattah-lab / xKV

Linking-ai / SCOPE

Janghyun1230 / FastKVzip

aivrar / vllm-windows-build

OnlyTerp / kvtc

MAC-AutoML / Awesome-Efficient-Large-Models

AMD-AGI / AMD-Hybrid-Models

FluffyAIcode / LLM-KV--Cache-compress

MGDDestiny / Lava

NAME0x0 / AVA

Improve this page

Add this topic to your repo