- paper
- Architecture : Pre-Norm, SwiGLU, Rotary PE (GPTNeo & GPT-J에서 쓰이기 시작하면서 LLM에 활용됨)
- Efficient Implementation : attention 계산 (xformer library) & backward (Flashattention for self-att), autograd 대신 backward function 직접 구현 (activations 값 미리 저장), model & sequence parallelism (memory 감소)
- Resource for 65.2B : 380 tokens/seg/GPU on 2048 A100 GPU with 80 GB of RAM & 1.4 tokens 학습시 21일 소요
데이터
- English CommonCrawl (67%) : CCNet pipeline quality filtering + filtering model 추가 활용
- C4 (15%) : CCNet pipeline
- Github (4.5%) : Apache, BSD, MIT licenses
- Wikipedia (4.5%) : bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk
- Gutenberg and Books3 (4.5%)
- ArXiv (2.5%) : latex files
- Stack Exchange (2%) : High quality QA data