Search

Polyglot-Ko

Affiliation
EleutherAI
BigScience
Commercial
Fine-tuning Method
Note
- github, , huggingface, blog - Eleuther AI에서 공개한 GPT 구조 언어모델 (Multilingual; 특히 한국어용) - Model . 5.8B : 172B tokens, 320k steps, 256 A100 GPUs with the GPT-NeoX framework.
데이터
- TUNiB Korean Data (863GB, 원본 1.2TB)
모델 크기
1.3B 3.8B 5.8B
새롭게 제공된 Resource
Model
출시일
2022-12

Training Data

Source
Size (GB)
Link
Korean blog posts
682.3
-
Korean news dataset
87.0
-
Modu corpus
26.4
corpus.korean.go.kr
Korean patent dataset
19.0
-
Korean Q & A dataset
18.1
-
KcBert dataset
12.7
github.com/Beomi/KcBERT
Korean fiction dataset
6.1
-
Korean online comments
4.2
-
Korean wikipedia
1.4
ko.wikipedia.org
Clova call
< 1.0
github.com/clovaai/ClovaCall
Naver sentiment movie corpus
< 1.0
github.com/e9t/nsmc
Korean hate speech dataset
< 1.0
-
Open subtitles
< 1.0
opus.nlpl.eu/OpenSubtitles.php
AIHub various tasks datasets
< 1.0
aihub.or.kr
Standard Korean language dictionary
< 1.0
stdict.korean.go.kr/main/main.do