Polyglot-Ko

Affiliation

EleutherAI

BigScience

Commercial

Fine-tuning Method

Note

- github, , huggingface, blog - Eleuther AI에서 공개한 GPT 구조 언어모델 (Multilingual; 특히 한국어용) - Model . 5.8B : 172B tokens, 320k steps, 256 A100 GPUs with the GPT-NeoX framework.

데이터

- TUNiB Korean Data (863GB, 원본 1.2TB)

모델 크기

1.3B 3.8B 5.8B

새롭게 제공된 Resource

Model

출시일

2022-12

Training Data

Source	Size (GB)	Link
Korean blog posts	682.3	-
Korean news dataset	87.0	-
Modu corpus	26.4	corpus.korean.go.kr
Korean patent dataset	19.0	-
Korean Q & A dataset	18.1	-
KcBert dataset	12.7	github.com/Beomi/KcBERT
Korean fiction dataset	6.1	-
Korean online comments	4.2	-
Korean wikipedia	1.4	ko.wikipedia.org
Clova call	< 1.0	github.com/clovaai/ClovaCall
Naver sentiment movie corpus	< 1.0	github.com/e9t/nsmc
Korean hate speech dataset	< 1.0	-
Open subtitles	< 1.0	opus.nlpl.eu/OpenSubtitles.php
AIHub various tasks datasets	< 1.0	aihub.or.kr
Standard Korean language dictionary	< 1.0	stdict.korean.go.kr/main/main.do