Boost Large-Scale Recommendation System Training Embedding Using EMBark

Recommendation systems are core to the Internet industry, and efficiently training them is a key issue for various companies. Most recommendation systems are deep learning recommendation models (DLRMs), containing billions or even tens of billions of ID features. Figure 1 shows a typical structure.

In recent years, GPU solutions such as NVIDIA Merlin HugeCTR and TorchRec have significantly accelerated the training of DLRMs by storing large-scale ID feature embeddings on GPUs and processing them in parallel. Using GPU memory bandwidth results in significant improvements compared to CPU solutions.

At the same time, as the number of GPUs used in training clusters increases (from eight GPUs to 128 GPUs), we have found that the communication overhead of the embedding accounts for a larger proportion of the total training overhead. In some large-scale training scenarios (such as on 16 nodes), it even exceeds half (51%).

This is mainly due to two reasons:

1. As the number of GPUs in the cluster increases, the number of embedding tables on each node gradually decreases, leading to unbalanced loads between different nodes and reducing training efficiency.

2. Compared to intranode bandwidth, internode bandwidth is much smaller, resulting in longer communication times for embedding model parallelism.

To help industry users better understand and solve these problems, the NVIDIA HugeCTR team presented EMBark at RecSys 2024. EMBark supports 3D flexible sharding strategies and combines communication compression strategies to finetune the load imbalance problem in large-scale cluster deep recommendation model training and reduce the communication time required for embeddings. The related code and paper are open source.

EMBark introduction

EMBark improves the performance of embeddings in DLRM training under different cluster configurations and accelerates training throughput. It’s implemented using the NVIDIA Merlin HugeCTR open-source recommendation system framework, but the techniques can also be applied to other machine learning frameworks.

EMBark has three key components: embedding clusters, flexible 3D sharding schemes, and a sharding planner. Figure 3 shows the overall architecture of EMBark.

Embedding clusters

Embedding clusters efficiently train embeddings by grouping them with similar features and applying customized compression strategies to each cluster. These clusters include a data distributor, embedding storage, and embedding operators, working together to convert feature IDs into embedding vectors.

There are three types of embedding clusters: data parallel (DP), reduction-based (RB), and unique-based (UB). Each uses different communication methods during training and is suitable for different embeddings.

DP clusters: Don’t compress communication, making them simple and efficient, but they replicate the embedding table on each GPU, so they are only suitable for small tables.
RB clusters: Use reduction operations and are significantly effective for compressing multi-hot input tables with pooling operations.
UB clusters: Only send unique vectors, which is beneficial for handling embedding tables with obvious access hotspots.

Flexible 3D sharding scheme

The flexible 3D sharding scheme can solve the workload imbalance problem in RB clusters. Unlike fixed sharding strategies such as row-wise, table-wise, or column-wise, EMBark uses a 3D tuple (i, j, k) to represent each shard, where i represents the table index, j represents the row shard index, and k represents the column shard index. This approach enables each embedding to be sharded across an arbitrary number of GPUs, providing flexibility and precise control over workload balance.

Sharding planner

To find the optimal sharding strategy, EMBark provides a sharding planner—a cost-driven greedy search algorithm that identifies the best sharding strategy based on hardware specifications and embedding configurations.

Evaluation

Leveraging embedding clusters, the flexible 3D sharding scheme, and an advanced sharding planner, EMBark significantly enhances training performance. To evaluate its effectiveness, experiments were conducted on a cluster of NVIDIA DGX H100 nodes, each equipped with eight NVIDIA H100 GPUs (total 640 gb HBM, bandwidth of 24TB/s per node).

Within each node, all GPUs are interconnected through NVLink (900 GB/s bidirectional) and the internode communication uses InfiniBand (8x400Gbps).

	DLRM-DCNv2	T180	T200	T510	T7
# parameters	48B	70B	100B	110B	470B
FLOPS per sample	96M	308M	387M	1,015M	25M
# embedding tables	26	180	200	510	7
Table dimensions (d)	128	{32, 128}	128	128	{32, 128}
Average hotness (h)	~10	~80	~20	~5	~20
Achieved QPS	~75M	~11M	~15M	~8M	~74M

Table 1. DLRM evaluation specifications

To demonstrate that EMBark can efficiently train DLRM models of any scale, we tested using the MLPerf DLRM-DCNv2 model and generated several synthetic models with larger embedding tables and different properties (see Table 1). Our training dataset exhibits a power-law skew of a = 1.2.

The baseline uses a serial kernel execution order, a fixed table-row-wise sharding strategy, and all RB clusters. The experiments successively used three optimizations: overlap, more flexible sharding strategies, and better cluster configurations. Across four representative DLRM variants (DLRM-DCNv2, T180, T200, and T510), EMBark achieved an average of 1.5x end-to-end training throughput acceleration, up to 1.77x faster than the baseline.

You can refer to the paper for more detailed experimental results and related analysis.

Faster DLRM training with EMBark

EMBark tackles the challenge of lengthy embedding processes in large-scale recommendation system model training. By supporting 3D flexible sharding strategies and combining different communication compression strategies, it can finetune the load imbalance problem in deep recommendation model training on large-scale clusters and reduce the communication time required for embeddings.

This improves the training efficiency of large-scale recommendation system models. Across four representative DLRM variants (DLRM-DCNv2, T180, T200, and T510), EMBark achieved an average of 1.5x end-to-end training throughput acceleration, up to 1.77x faster than the baseline.

We are also actively exploring embedding offloading-related technologies and optimizing torch-related work and look forward to sharing our progress in the future. If you are interested in this area, please contact us for more information.