- RoBERTa Data (BookCorpus, Stories, CCNews v2)
- the Pile (CommonCrawl, DM Mathematics, Project Gutenberg, HackerNews, OpenSubtitles, OpenWebText2, USPTO, Wikipedia)
- PushShift.ioReddit (only longest chain of thread; 66%)
→ English text 위주로 추출 (CommonCrawl 내부에 non-English data 조금 있음)
→ filtered out by MinhashLSH (Jaccard similarity ≥ .95; Pile은 중복 문서가 많음)
→ GPT-2 BPE tokenizer
→ final data : 180B tokens