Novita AI KV Sparsity for Faster Llama-70B Loading

Motivation

By reviewing recent academic papers from the past year in the field of KV sparsity (H2O、SnapKV、PyramidKV), we apply KV sparsity to different layers of the model. By employing a pruning strategy, we eliminate KV pairs with lower scores while retaining those with higher scores and closer proximity. This approach reduces memory usage, as well as computational and I/O overhead, ultimately leading to accelerated inference.

Experiments

Baselines and Settings：We run all KV-Compress experiments using our vLLM integration forked from v0.6.2, running in cuda grapth mode with a block size of 16. For all RTX 4090/Llama-3.1-8B-Instruct experiments, we use default gpu memory utilization of 0.9 and set maxmodel-length to 32k.We evaluate our compression on Llama-3.1-8B-Instruct , comparing performance against the following baseline methods introduced in prior work：

vLLM-0.6.2
Novita AI, Pyramid KV Cache compression based on vLLM framework

MMLU Pro and LongBench：We control different KV Cache compression ratios by setting different sliding window lengths on different layers. In the experiment, we mainly set three different sliding window lengths of 1024, 1280, and 1536, and conducted cross-tests on different numbers of layers.MMLU ProIn the MMLU Pro test,different KV sparsity layers and different sliding window lengths show different performances. Taking the acceleration ratio into consideration, the overall accuracy can be guaranteed to be above 98%.

	vllm-0.6.2	Novita AI	sliding window=1536	sliding window=1280	sliding window=1024
KV sparsity layers	full	full	0.4496	0.4479	0.4349
22	0.4517	0.4496	0.4479	0.4349
26	0.4517	0.4476	0.4449	0.4377
31	0.4517	0.4476	0.4403	0.431

LongBenchIn the LongBench test, we selected a sliding window of 1024 for performance testing and found that the accuracy loss was about 1.03%.

longbench	narrativeqa	qasper	multifieldqa_en	hotpotqa	2wikimqa	musique	gov_report	qmsum	multi_news	trec	triviaqa	samsum	passage_count	passage_retrieval_en	lcc	repobench-p	avg
vllm-base	30.18	44.74	52.84	54.92	45.7	28.41	34.47	25.46	26.98	72.5	91.65	43.76	6.83	99.5	63.42	56.51	49.5
Novita AI	30.42	44.98	52.37	54.27	44.75	30.69	28.99	24.77	25.72	71.5	91.61	43.09	7.55	99.5	62.84	55.89	48.99

Throughput Benchmarks：In real-world LLM applications, an input/output length of 5000/500 is the most commonly observed configuration，and the TTFT index must be less than 2s。Based on these conditions, we conducted batch performance comparison tests, which yielded a 1.5x inference speedup for vLLM.

Throughput	vllm-0.6.2	Novita AI	sliding window=1536	sliding window=1280	sliding window=1024
KV sparsity layers	full	1	1.26	1.31	1.38
22	1	1.34	1.47	1.49
26	1	1.44	1.53	1.58

Major changes

Modified files mainly include：

Flash attention, sparse scoring based on Flash attention while ensuring that kernel performance loss is less than 1%.
Paged attention and reshape_and_cache， sparse scoring based on Paged attention and synchronize sparse scoring in prefill and docode stages.
Block_manager and other functions related to memory management and tensor prepare.

Conclusion

Novita AI also supports tensor parallelism to enable models such as Llama3-70B to run on multiple GPUs. Currently, it does not support open code for some reasons, but we hope to contribute some technology and ideas to the community, and welcome technical exchanges with everyone.Just a heads-up: The following features aren’t supported yet in vLLM-0.6.2:

Chunked-prefill
Prefix caching
FlashInfer and other non-FlashAttention backends
Speculative Decoding

Originally published at Novita AI

Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.

Recommended reading

Dynamic KV Cache compression based on vLLM framework

Table of contents

Motivation

Experiments

Major changes

Conclusion