Deepgemm: Revolutionizing FP8 GEMM Computation

Yesterday is the third day of Open Source Week, and DeepSeek has officially released the DeepGEMM open-source library.
This is an FP8 GEMM computation library designed specifically for dense and MoE models, providing strong support for the training and inference of MoE FP8 quantized models like DeepSeek - V3/R1.

DeepGEMM has been deeply optimized for NVIDIA Hopper architecture GPUs (such as H100, H200, H800).
The main features include concise code (the core part is only about 300 lines) yet outstanding performance, which can match or even surpass expert-tuned libraries across various matrix shapes.

As a cloud platform dedicated to providing high-performance AI computing services, Novita AI has deployed a large number of MoE FP8 quantized models (such as the DeepSeek FP8 version).
To better leverage DeepGEMM technology and enhance the inference efficiency of these models, Novita AI conducted comprehensive performance testing on DeepGEMM as soon as it was available.
Before delving into the specific test data, let's first familiarize ourselves with some relevant basic concepts.

What is GEMM?

GEMM (General Matrix Multiplication) is the most fundamental and important computational operator in deep learning, and GEMM optimization is the core of high-performance AI computing.
DeepGEMM is an open-source library designed specifically to accelerate key GEMM operations in deep learning, enhancing the overall performance of the AI system by improving GEMM computation efficiency.

Unique Advantages of DeepGEMM

Compared to mature template libraries like CUTLASS and CuTe, DeepGEMM takes a lightweight design approach: it does not aim for broad compatibility with all GPUs and computational scenarios, but focuses on fully leveraging the FP8 computing capabilities of the Hopper architecture, with meticulous optimization for the matrix shapes commonly used in large models like DeepSeek R1 and V3.

Technological Innovations of DeepGEMM

DeepGEMM achieves performance breakthroughs through the following four core technological innovations:

Just-In-Time Compilation (JIT)

Traditional methods require pre-compiling CUDA code before calling it, while DeepGEMM's JIT technology hides the compilation process at runtime, eliminating the need for manual compilation.
Developers do not need to create complex Python interfaces, simplifying the development process and achieving functionality with just a few lines of code

Computation and Transfer Overlap Optimization

DeepGEMM performs data transfer and computation operations simultaneously, fully utilizing the Tensor Memory Accelerator (TMA) feature of the Hopper architecture, further optimizing data transfer efficiency. Additionally, DeepGEMM uses low-level PTX instructions to achieve extreme performance.

Support for Arbitrary Matrix Sizes

Traditional GEMM implementations require matrix sizes to be powers of 2 (such as 128, 256), while DeepGEMM supports non-aligned block sizes for matrices. This feature avoids memory waste and improves overall computational efficiency.

FFMA SASS Instruction-Level Optimization

By modifying the yield and reuse bits of the FFMA instructions, more opportunities for overlapping MMA instructions with promotion FFMA instructions are created, resulting in performance improvements of over 10% in certain scenarios, even with limited understanding of the underlying architecture.

Novita AI First-Hand Evaluation: DeepGEMM Universality

In the inference scenarios of MoE models, Novita AI conducted detailed performance tests of DeepGEMM on the H100 and H200 GPUs and compared the results with the official H800 benchmark data.

First, we summarized the key hardware parameters of the H100, H200, and H800 GPUs that impact the performance of DeepGEMM:

Metric	H100 SXM	H200 SXM	H800 SXM
FP8 Compute Power	3958 TFLOPS	3958 TFLOPS	3958 TFLOPS
Memory Bandwidth	3.35 TB/s	4.8 TB/s	3.35 TB/s

MoE Model: Grouped GEMM with Continuous Storage Layout (Training Forward Pass, Inference Prefill)

In MoE networks using the continuous storage layout, the performance differences between H100, H200, and H800 (official) are minimal.

The figure below shows the memory bandwidth utilization comparison test. Due to computational bottlenecks and the similar FP8 compute power of the three GPUs, their performance shows no significant differences.

moe gemma bandwidth

The figure below illustrates the computational performance comparison. Since the memory access bottleneck was not reached, the performance of the three GPUs shows no notable differences.

moe gemma computation

MoE Model: Grouped GEMM with Masked Storage Layout (Inference Decoding)

In MoE networks using the masked storage layout, the H200 demonstrates the best performance, while the differences between H100 and H800 are very small.

The figure below shows the memory bandwidth utilization comparison test. Since the masked storage layout consumes more memory bandwidth than the continuous storage layout, some scenarios have reached the memory access bottleneck, resulting in performance differences among the three GPUs:

The figure below is the computational performance comparison test, highlighting the differences caused by bandwidth:

DeepGEMM vs. SGLang Triton Performance Comparison

Currently, mainstream inference frameworks use grouped GEMM operators developed based on SGLang Triton for the MoE module. We conducted performance comparison tests between DeepGEMM and SGLang Triton under H200 hardware conditions:

DeepGEMM exhibits certain advantages in the continuous storage layout, but SGLang Triton performs better in the masked storage layout. Currently, some of SGLang Triton's operators are primarily applied to masked storage scenarios. Therefore, DeepGEMM requires further optimization to replace SGLang Triton in inference frameworks.

For MoE models using the continuous storage layout (training forward pass, inference prefill), DeepGEMM shows more significant advantages.

sglang

For MoE models using the masked storage layout (inference decoding), Triton demonstrates superior performance:

Conclusion

The evaluation results demonstrate that DeepGEMM exhibits significant performance optimization capabilities across multiple GPUs, including H100, H200, and H800, highlighting its strong versatility.

For MoE series models (such as DeepSeek V3 and R1) running on the Hopper architecture, integrating DeepGEMM into the inference framework by replacing the original CUTLASS version of grouped GEMMs is expected to deliver approximately 1.2x acceleration in model inference, enhancing overall performance.

Currently, DeepGEMM cannot fully replace SGLang Triton and requires further optimization to expand its application scope. In inference decoding, SGLang Triton remains more efficient, while DeepGEMM shows greater advantages in training forward passes and inference prefill stages.

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing the affordable and reliable GPU cloud for building and scaling.

try deepseek r1

Get $20 credits and Try DeepSeek now!

DeepGEMM Tested by Novita AI: Can It Replace SGLang?

Table of contents

What is GEMM?

Unique Advantages of DeepGEMM

Technological Innovations of DeepGEMM

Novita AI First-Hand Evaluation: DeepGEMM Universality

MoE Model: Grouped GEMM with Continuous Storage Layout (Training Forward Pass, Inference Prefill)

MoE Model: Grouped GEMM with Masked Storage Layout (Inference Decoding)

DeepGEMM vs. SGLang Triton Performance Comparison

Conclusion

DeepGEMM Tested by Novita AI: Can It Replace SGLang?

Table of contents

What is GEMM?

Unique Advantages of DeepGEMM

Technological Innovations of DeepGEMM

Novita AI First-Hand Evaluation: DeepGEMM Universality

MoE Model: Grouped GEMM with Continuous Storage Layout (Training Forward Pass, Inference Prefill)

MoE Model: Grouped GEMM with Masked Storage Layout (Inference Decoding)

DeepGEMM vs. SGLang Triton Performance Comparison

Conclusion

Recommend Reading