Training Data
Source | Size (GB) | Link |
Korean blog posts | 682.3 | - |
Korean news dataset | 87.0 | - |
Modu corpus | 26.4 | corpus.korean.go.kr |
Korean patent dataset | 19.0 | - |
Korean Q & A dataset | 18.1 | - |
KcBert dataset | 12.7 | github.com/Beomi/KcBERT |
Korean fiction dataset | 6.1 | - |
Korean online comments | 4.2 | - |
Korean wikipedia | 1.4 | ko.wikipedia.org |
Clova call | < 1.0 | github.com/clovaai/ClovaCall |
Naver sentiment movie corpus | < 1.0 | github.com/e9t/nsmc |
Korean hate speech dataset | < 1.0 | - |
Open subtitles | < 1.0 | opus.nlpl.eu/OpenSubtitles.php |
AIHub various tasks datasets | < 1.0 | aihub.or.kr |
Standard Korean language dictionary | < 1.0 | stdict.korean.go.kr/main/main.do |