Table of contents
Motivation
By reviewing recent academic papers from the past year in the field of KV sparsity (H2O、SnapKV、PyramidKV), we apply KV sparsity to different layers of the model. By employing a pruning strategy, we eliminate KV pairs with lower scores while retaining those with higher scores and closer proximity. This approach reduces memory usage, as well as computational and I/O overhead, ultimately leading to accelerated inference.
Experiments
Baselines and Settings:We run all KV-Compress experiments using our vLLM integration forked from v0.6.2, running in cuda grapth mode with a block size of 16. For all RTX 4090/Llama-3.1-8B-Instruct experiments, we use default gpu memory utilization of 0.9 and set maxmodel-length to 32k.We evaluate our compression on Llama-3.1-8B-Instruct , comparing performance against the following baseline methods introduced in prior work:
vLLM-0.6.2
Novita AI, Pyramid KV Cache compression based on vLLM framework
MMLU Pro and LongBench:We control different KV Cache compression ratios by setting different sliding window lengths on different layers. In the experiment, we mainly set three different sliding window lengths of 1024, 1280, and 1536, and conducted cross-tests on different numbers of layers.MMLU ProIn the MMLU Pro test,different KV sparsity layers and different sliding window lengths show different performances. Taking the acceleration ratio into consideration, the overall accuracy can be guaranteed to be above 98%.
vllm-0.6.2 | Novita AI | sliding window=1536 | sliding window=1280 | sliding window=1024 | |
KV sparsity layers | full | full | 0.4496 | 0.4479 | 0.4349 |
22 | 0.4517 | 0.4496 | 0.4479 | 0.4349 | |
26 | 0.4517 | 0.4476 | 0.4449 | 0.4377 | |
31 | 0.4517 | 0.4476 | 0.4403 | 0.431 |
LongBenchIn the LongBench test, we selected a sliding window of 1024 for performance testing and found that the accuracy loss was about 1.03%.
longbench | narrativeqa | qasper | multifieldqa_en | hotpotqa | 2wikimqa | musique | gov_report | qmsum | multi_news | trec | triviaqa | samsum | passage_count | passage_retrieval_en | lcc | repobench-p | avg |
vllm-base | 30.18 | 44.74 | 52.84 | 54.92 | 45.7 | 28.41 | 34.47 | 25.46 | 26.98 | 72.5 | 91.65 | 43.76 | 6.83 | 99.5 | 63.42 | 56.51 | 49.5 |
Novita AI | 30.42 | 44.98 | 52.37 | 54.27 | 44.75 | 30.69 | 28.99 | 24.77 | 25.72 | 71.5 | 91.61 | 43.09 | 7.55 | 99.5 | 62.84 | 55.89 | 48.99 |
Throughput Benchmarks:In real-world LLM applications, an input/output length of 5000/500 is the most commonly observed configuration,and the TTFT index must be less than 2s。Based on these conditions, we conducted batch performance comparison tests, which yielded a 1.5x inference speedup for vLLM.
Throughput | vllm-0.6.2 | Novita AI | sliding window=1536 | sliding window=1280 | sliding window=1024 |
KV sparsity layers | full | 1 | 1.26 | 1.31 | 1.38 |
22 | 1 | 1.34 | 1.47 | 1.49 | |
26 | 1 | 1.44 | 1.53 | 1.58 |
Major changes
Modified files mainly include:
Flash attention, sparse scoring based on Flash attention while ensuring that kernel performance loss is less than 1%.
Paged attention and reshape_and_cache, sparse scoring based on Paged attention and synchronize sparse scoring in prefill and docode stages.
Block_manager and other functions related to memory management and tensor prepare.
Conclusion
Novita AI also supports tensor parallelism to enable models such as Llama3-70B to run on multiple GPUs. Currently, it does not support open code for some reasons, but we hope to contribute some technology and ideas to the community, and welcome technical exchanges with everyone.Just a heads-up: The following features aren’t supported yet in vLLM-0.6.2:
Chunked-prefill
Prefix caching
FlashInfer and other non-FlashAttention backends
Speculative Decoding
Originally published at
Novita AI
Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.
Recommended reading