Deepspeed Chat

Affiliation

Microsoft

Commercial

Fine-tuning Method

RLHF

Note

- DeepSpeed Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales . Doc, code example - Features . Easy-Breezy Training & Inference API : A complete E2E RLHF training (from Huggingface model + an inference API) . High performance system : Hybrid Engine achieves 15x training speedup over SOTA RLHF systems (memory management & data movement across different phases of RLHF) . Accessible Large Model Support : a single or multi-GPUs through ZeRO and LoRA . An Universal Acceleration Backend for RLHF : Support InstructGPT pipeline (+ data abstraction & blending capabilities for multiple data sources) - Hybrid Engine for Optimization • Inference : a light-weight memory management system (KV-cache & intermediate results) ◦ highly-optimized inference-adapted kernels ◦ tensor parallelism implementation (model partitioning) • Training : memory optimization techniques ◦ ZeRO family (sharding) (model partitioning) & LoRA - Performance . OPT-13B (9 hours; $300 Azure cloud) & OPT-30B (18 hours; $600 Azure cloud) . 13B (1.25 hours; 64 GPU cluster) & 175B (1 day; 64 GPU cluster) [training speed; throughput] . single GPU : 10x for RLHF . multiple GPUs : 6-19x speedup over Colossal-AI & 1.4-10.5x over HuggingFace DDP . example . Actor (OPT-66B) & Reward (OPT-350M) : Step1 (82m) - Step2 (5m) - Step3 (7.5h) . Actor (OPT-1.3B) & Reward (OPT-350M) : Step1 (2900s) - Step2 (670s) - Step (1.2h) [model scalability] . DeepSpeed-HE : 6.5B (a single GPU) & 50B (a single A100 40G node) . Colossal-AI : 1.3B (a single GPU) & 6.7B (a single A100 40G node)

데이터

모델 크기

새롭게 제공된 Resource

Training/Inference Pipeline

출시일

2023-04-12

Full-fledged RLHF Training Pipeline

Our pipeline includes three main steps:

•

Step 1: Supervised finetuning (SFT), where human responses to various queries are carefully selected to finetune the pretrained language models.

•

Step 2: Reward model finetuning, where a separate (usually smaller than the SFT) model (RW) is trained with a dataset that has human-provided rankings of multiple answers to the same query.

•

Step 3: RLHF training, where the SFT model is further finetuned with the reward feedback from the RW model using the Proximal Policy Optimization (PPO) algorithm.

We provide two additional features in Step 3 to help improve model quality:

•

Exponential Moving Average (EMA) collection, where an EMA based checkpoint can be chosen for the final evaluation.

•

Mixture Training, which mixes the pretraining objective (i.e., the next word prediction) with the PPO objective to prevent regression performance on public benchmarks like SQuAD2.0.

according to InstructGPT, EMA checkpoints generally provide better response quality than conventional final trained model and Mixture Training can help the model retain the pre-training benchmark solving ability. As such, we provide them for users to fully get the training experience as described in InstructGPT and strike for higher model quality.

In addition to being highly consistent with InstructGPT paper, we also provide convenient features to support researchers and practitioners to train their own RLHF model with multiple data resources:

•

Data Abstraction and Blending Capabilities: DeepSpeed-Chat is able to train the model with multiple datasets for better model quality. It is equipped with (1) an abstract dataset layer to unify the format of different datasets; and (2) data splitting/blending capabilities so that the multiple datasets are properly blended then split across the 3 training stages.

Step 3 (RLHF training) 에서의 Hybrid Engine

Each iteration requires efficient processing of two phases a) inference phase for token/experience generation, producing inputs for the training and b) training phase to update the weights of actor and reward models, as well as the interaction and scheduling between them.

two major costs: (1) the memory cost, as several copies of the SFT and RW models need to be served throughout stage 3; and (2) the predominant generation phase, which if not accelerated properly, will significantly slow down the entire stage 3. Additionally, the two important features we added in Stage 3, including Exponential Moving Average (EMA) collection and Mixture Training, will incur additional memory and training costs.

inference engine과 training engine을 동시에 제공하되, 각각에 맞는 optimization technique 이용

•

Inference : a light-weight memory management system (KV-cache & intermediate results)

◦

highly-optimized inference-adapted kernels

◦

tensor parallelism implementation (model partitioning)

•

Training : memory optimization techniques 

◦

ZeRO family (sharding) (model partitioning) & LoRA

⇒ avoid memory allocation bottlenecks & support large batch sizes