ColossalChat (Coati)

HPC-AI tech
Fine-tuning Method
- Homepage, blog, ColossalAI Github, ColossalChat github - 거대 GPT 모델 학습을 위한 pipeline 제공 - 10x speedup, 47x cost savings, 175B 이상의 parameters - the first to open-source a complete RLHF pipeline : from training to deployment - 학습/추론 pipeline 제공 : Pytorch 대비 추론 1.4배, 학습 7.7배, 10.3배 큰 모델 학습 가능 (alpaca code 대비 학습 속도 3배 빠름) - automatic parallelism, memory management, dynamic scheduling, “data / pipeline / sequence / tensor parallelism” - chip and cloud agnostic : GPUs, TPUs, FPGAs, CPUs - ColossalAI Talking Intelligence (Coati models) : RLHF Pipeline 제공 • Supports comprehensive large-model training acceleration capabilities for ColossalAI, without requiring knowledge of complex distributed training algorithms • Supervised datasets collection • Supervised instructions fine-tuning • Training reward model • Reinforcement learning with human feedback • Quantization inference • Fast model deploying • Perfectly integrated with the Hugging Face ecosystem, a high degree of model customization - Limitations of LLaMA-finetuned models and dataset . limitation of LLaMa fine-tuning models : missing knowledge by LLaMa / Lack of counting ability / Lack of Logics (reasoning and calculation) / Tend to repeat the last sentence / poor multilingual results . limitation of InstructWild dataset : Lack of summarization ability / multi-turn chat and role-playing / self-recognition / Safety - Quantization 지원 & Deployment 지원 . 8-bit quantization (RTN), 4-bit quantization (GPTQ), and FP16 inference.
- InstructWild (link): 52k instructions for English (24M tokens) & 52K instructions for Chinese (30M tokens) . 생성 과정 : 700 noisy instructions 수집 from Twitter & filter out noisy ones → 429 clean instruction (alpaca와 다르게 제한없이 instruction 얻음) → ChatGPT → 5개의 prompts를 예시로 두고 새로운 instruction들을 생성하도록 함 → ChatGPT → 각각의 instruction에 대해 response 수집 . 영어와 중국어 각각 별도로 수집 . 데이터 모으는데 총 $880 소모
모델 크기
새롭게 제공된 Resource
Training/Inference Pipeline



Zero + Gemini to Reduce Memory Redundancy

Colossal-AI supports ZeRO (Zero Redundancy Optimizer) to improve memory usage efficiency, enabling larger models to be accommodated at a lower cost, without affecting computing granularity and communication efficiency.
The automatic chunk mechanism can further improve ZeRO’s performance by increasing memory usage efficiency, reducing communication frequency, and avoiding memory fragmentation.
The heterogeneous memory space manager, Gemini, supports unloading optimizer states from GPU memory to CPU memory or hard disk space to overcome the limitations of GPU memory capacity, expand the scale of trainable models, and reduce the cost of large AI model applications.