Dynamic KV Cache compression based on vLLM framework

·

4 min read

Dynamic KV Cache compression based on  vLLM framework

Motivation

By reviewing recent academic papers from the past year in the field of KV sparsity (H2O、SnapKV、PyramidKV), we apply KV sparsity to different layers of the model. By employing a pruning strategy, we eliminate KV pairs with lower scores while retaining those with higher scores and closer proximity. This approach reduces memory usage, as well as computational and I/O overhead, ultimately leading to accelerated inference.

Experiments

Baselines and Settings:We run all KV-Compress experiments using our vLLM integration forked from v0.6.2, running in cuda grapth mode with a block size of 16. For all RTX 4090/Llama-3.1-8B-Instruct experiments, we use default gpu memory utilization of 0.9 and set maxmodel-length to 32k.We evaluate our compression on Llama-3.1-8B-Instruct , comparing performance against the following baseline methods introduced in prior work:

  • vLLM-0.6.2

  • Novita AI, Pyramid KV Cache compression based on vLLM framework

MMLU Pro and LongBench:We control different KV Cache compression ratios by setting different sliding window lengths on different layers. In the experiment, we mainly set three different sliding window lengths of 1024, 1280, and 1536, and conducted cross-tests on different numbers of layers.MMLU ProIn the MMLU Pro test,different KV sparsity layers and different sliding window lengths show different performances. Taking the acceleration ratio into consideration, the overall accuracy can be guaranteed to be above 98%.

vllm-0.6.2Novita AIsliding window=1536sliding window=1280sliding window=1024
KV sparsity layersfullfull0.44960.44790.4349
220.45170.44960.44790.4349
260.45170.44760.44490.4377
310.45170.44760.44030.431

LongBenchIn the LongBench test, we selected a sliding window of 1024 for performance testing and found that the accuracy loss was about 1.03%.

longbenchnarrativeqaqaspermultifieldqa_enhotpotqa2wikimqamusiquegov_reportqmsummulti_newstrectriviaqasamsumpassage_countpassage_retrieval_enlccrepobench-pavg
vllm-base30.1844.7452.8454.9245.728.4134.4725.4626.9872.591.6543.766.8399.563.4256.5149.5
Novita AI30.4244.9852.3754.2744.7530.6928.9924.7725.7271.591.6143.097.5599.562.8455.8948.99

Throughput Benchmarks:In real-world LLM applications, an input/output length of 5000/500 is the most commonly observed configuration,and the TTFT index must be less than 2s。Based on these conditions, we conducted batch performance comparison tests, which yielded a 1.5x inference speedup for vLLM.

Throughputvllm-0.6.2Novita AIsliding window=1536sliding window=1280sliding window=1024
KV sparsity layersfull11.261.311.38
2211.341.471.49
2611.441.531.58

Major changes

Modified files mainly include:

  • Flash attention, sparse scoring based on Flash attention while ensuring that kernel performance loss is less than 1%.

  • Paged attention and reshape_and_cache, sparse scoring based on Paged attention and synchronize sparse scoring in prefill and docode stages.

  • Block_manager and other functions related to memory management and tensor prepare.

Conclusion

Novita AI also supports tensor parallelism to enable models such as Llama3-70B to run on multiple GPUs. Currently, it does not support open code for some reasons, but we hope to contribute some technology and ideas to the community, and welcome technical exchanges with everyone.Just a heads-up: The following features aren’t supported yet in vLLM-0.6.2:

  • Chunked-prefill

  • Prefix caching

  • FlashInfer and other non-FlashAttention backends

  • Speculative Decoding

Originally published at Novita AI

Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.

Recommended reading

  1. How KV Sparsity Achieves 1.5x Acceleration for vLLM

  2. Unlock Llama 3–8b Zero-Shot Chat: Expert Tips and Techniques

  3. Utilize Clipboard Conqueror with Novita AI API Key for Developer Productivity