Today, following the release of FlashMLA, DeepSeek has launched its second OpenSourceWeek project—DeepEP.
As the first open-source EP communication library specifically designed for training and inference of MoE (Mixture-of-Experts) models, DeepEP marks a significant step forward in the field of Expert Parallelism (EP). It aims to provide MoE models with low-latency, high-bandwidth, and high-throughput communication capabilities both across GPUs within a node and between nodes.According to test results, DeepEP achieves near-maximum bandwidth performance for intra-node multi-GPU communication, while also significantly improving inter-node communication efficiency.
What is EP?
Before diving deeper into DeepEP, it’s important to first understand what EP is.
EP (Expert Parallelism) is a distributed computing method specifically designed for MoE (Mixture-of-Experts) models, which published by DeepSeek originally. MoE is a Transformer-based model architecture that employs a sparse activation strategy, making it more lightweight during training compared to traditional dense models. In an MoE neural network, only a subset of the model’s components (referred to as "experts") is activated to process the input at any given time.
The importance of EP (Expert Parallelism) in accelerating large language model inference lies in its ability to efficiently partition MoE models. When a model adopts the MoE architecture with hundreds of experts (e.g., 320 experts), EP can assign different experts to independent computing nodes, with its parallel granularity directly matching the number of experts.
In contrast, TP (Tensor Parallelism) relies on splitting computation based on the multi-head mechanism in Attention layers. For example, in a typical 32-head configuration, TP faces challenges in scaling to 64 or more GPUs because the splitting dimension is insufficient (32 < 64), making it difficult to fully utilize hardware resources. EP, on the other hand, partitions computations along the dimension of experts,.
from EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
DP VS TP VS PP VS EP
Method | Internal Logic | Core Problem Solved |
Data Parallelism (DP) | Replicate models across devices, split input data, synchronize gradient updates. | Slow training speed due to large dataset size. |
Tensor Parallelism (TP) | Split parameter matrices across devices, perform distributed computation, and aggregate results. | Single-layer parameters exceeding device memory capacity. |
Pipeline Parallelism (PP) | Partition model layers across devices, schedule micro-batches through pipelining. | Insufficient memory for models with extreme depth. |
Expert Parallelism (EP) | Dynamically route inputs to expert sub-networks with sparse parameter activation. | Memory and computational inefficiency at trillion-parameter scale. |
Modern large-scale models (e.g., GPT-4, DeepSeek-V3) typically integrate multiple parallelism strategies simultaneously to maximize efficiency:
Tensor Parallelism (TP): Splits the parameters of individual layers across devices.
Pipeline Parallelism (PP): Distributes different layers of the model across devices to process in a pipeline manner.
Data Parallelism (DP): Synchronizes training across multiple machines by replicating the model and splitting the dataset.
Expert Parallelism (EP): Expands sparse parameters by distributing experts across devices for MoE models.
By combining these strategies, large models can effectively utilize available hardware resources, scaling to larger model sizes and datasets while maintaining training and inference efficiency.
What is DeepEP?
DeepEP is a communication library specifically designed for MoE (Mixture-of-Experts) and EP (Expert Parallelism), offering the following core advantages:
1. Highly Optimized All-to-All Communication
DeepEP provides an efficient All-to-All communication kernel that significantly reduces data transfer bottlenecks, ensuring smoother information exchange between experts in distributed environments.
2. Support for NVLink and RDMA in Intra-Node/Inter-Node Communication
DeepEP supports both NVLink and RDMA technologies, enabling high-performance communication within nodes and across nodes:
NVLink: Delivers bandwidth up to 160 GB/s for intra-node communication.
RDMA: Enables low-latency inter-node data transfers, meeting the demands of large-scale distributed training.
3. High-Throughput Compute Core
- For training and inference prefill stages, DeepEP provides a high-throughput compute core, ensuring efficient processing of large-scale data.
4. Low-Latency Compute Core
DeepEP offers a low-latency compute core based on RDMA/Infiniband, which minimizes inference latency. This is especially beneficial for latency-sensitive applications during the inference decoding stage.
5. Native Support for FP8 Data Distribution
DeepEP natively supports FP8 data distribution, reducing data transfer volume while maintaining precision, further improving communication efficiency.
6. Flexible GPU Resource Control
DeepEP features a flexible GPU resource scheduling mechanism, allowing efficient overlap of computation and communication. This minimizes resource wastage and enhances overall performance.
EP VS DeepEP
In essence, EP defines the "what" (how to split experts and distribute workloads), while DeepEP provides the "how" (efficient communication mechanisms to make EP faster and more scalable).
DeepEP Performance
DeepEP showcases exceptional performance in both intra-node and inter-node communication, particularly in hybrid architectures combining NVLink and RDMA. Below are the performance results in two typical scenarios:
Regular Kernel Performance (NVLink and RDMA Forwarding)
Test Environment:
GPU: H800 (NVLink with a maximum bandwidth of ~160 GB/s)
Network: CX7 InfiniBand 400 Gb/s RDMA NIC (maximum bandwidth ~50 GB/s)
Configuration: DeepSeek-V3/R1 pretraining setup (batch size: 4096 tokens, hidden size: 7168, top-4 layers, top-8 experts, FP8 distribution, and BF16 aggregation)
Performance Results:
Intra-node communication achieves bandwidth close to the NVLink maximum (160 GB/s), demonstrating extremely high data transfer efficiency.
Inter-node communication maintains stable bandwidth under RDMA, meeting the requirements for large-scale distributed training.
Low-Latency Kernel Performance (Pure RDMA)
Test Environment:
GPU: H800
Network: CX7 InfiniBand 400 Gb/s RDMA NIC (maximum bandwidth ~50 GB/s)
Configuration: Typical DeepSeek-V3/R1 production setup (batch size: 128 tokens, hidden size: 7168, top-8 experts, FP8 distribution, and BF16 aggregation)
Performance Results:
The low-latency kernel achieves microsecond-level latency in pure RDMA mode, making it suitable for latency-sensitive inference decoding tasks.
Even under high parallelism (#EP=256), RDMA bandwidth remains stable, ensuring efficient data transfer.
DeepEP Application Scenarios
DeepEP is well-suited for various MoE model training and inference scenarios, particularly in large-scale distributed training. Key application scenarios include:
MoE Model Training
- DeepEP's high-throughput compute core and efficient All-to-All communication mechanism significantly accelerate the training process, especially in multi-node, multi-GPU environments.
Inference Prefill Stage
- During the inference prefill stage, DeepEP's high-throughput compute core efficiently processes large amounts of data, ensuring a highly efficient inference pipeline.
Inference Decoding Stage
- For the decoding stage, DeepEP's low-latency compute core minimizes inference delays, making it ideal for real-time applications.
Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing the affordable and reliable GPU cloud for building and scaling.
Get $20 credits and Try DeepSeek now!