A curated list of Japanese, Korean and Vietnamese open speech corpora
I would curate a list of open speech corpora for academic uses of Japanese, Korean and Vietnamese. While speech processing systems achieves outstanding results exponentialy for major languages like English and Chinese, the development of other languages is not as active. This list was created to make it more easy to jump start a speech process project and spark interests in research and development of speech processing systems.
Japanese, Korean and Vietnamese are languages which highly be influenced by Chinese in the old days, but their mordern counterpart is shifted to different directions which created uniques and challenging problems for speech processing systems. While Japanese still use Chinese character (Kanji) along with Hiragana, Katakana and Romanji as writing systems, Korean use Hangul as main systems but Chinese characters still be recognized in the cultures, while Vietnamese completely abandon Chinese characters and using exlusively an extended Roman alphabet for writing but lots of borrowed words (sounds) is still using in everyday life but most people don’t notice its origin. Moreover these languages is all experimenting the language mixcoding phenomenon mostly with English as Internet becomes a utility for everyone.
This post would present a curated list of speech corpora for these 3 languages, these corpora should be able to be used for academic purposes. For commercial, you should go to the corpus homepage and contact the owners directly. This post would be updated when new corpus was found.
Available tags: #single
, #multiple
, #dialect
, #polyglot
, #in-the-wild
, #code-switch
, #bilingual
Japanese
JSUT
- Tags:
#single
- Description: Japanese speech corpus of Saruwatari Lab, University of Tokyo
- Type: Single speaker, Female (Native Japanese)
- Amount: 10 hours
- Audio quality: 48kHz, recorded in anechoic room
- License: can be used for research
- Link: JSUT
- Release year: 2017
JVS
- Tags:
#multiple
- Description: Japanese versatile speech corpus
- Type: Multiple speaker (Professional speakers)
- Amount: 30 hours, 100 speakers
- Audio quality: 24kHz, studio recording
- License: can be used for research
- Link: JVS
- Release year: 2019
CSS10-ja
- Tags:
#single
- Description: A collection of single speaker speech datasets for 10 languages - Japanese
- Type: Single speaker, Male (Native Japanese)
- Amount: 14.9 hours
- Audio quality: 22kHz, LibriVox audiobook
- License: CC0, public domain
- Link: CSS10-ja
- Release year: 2019
JSUT-book
- Tags:
#single
- Description: Japanese speech corpus of Saruwatari Lab, University of Tokyo, audiobook
- Type: Single speaker, Female (non-professional Japanese speaker)
- Amount: ~1 hour
- Audio quality: 48kHz
- License: can be used for research
- Link: JSUT-book
- Release year: 2020
JSSS
- Tags:
#single
- Description: Japanese speech corpus for summarization and simplification
- Type: Single speaker, Female (non-professional Japanese speaker)
- Amount: ~8 hour
- Audio quality: 24kHz
- License: can be used for research
- Link: JSSS
- Release year: 2020
Also check:
- JSSS-misc: misc tasks of JSSS corpus
TEDxJP-10K
- Tags:
#multiple
- Description: Japanese speech dataset for ASR evalation built from Japanese TEDx videos and their subtitles
- Type: Multiple speakers, TEDx talks
- Amount: 10,000 segments of videos in YouTube “TEDx talks in Japanese” playlist
- Audio quality: varying
- License: N/A
- Link: TEDxJP-10k
- Release year: 2020
LaboroTVSpeech
- Tags:
#multiple
- Description: A large-scale Japanese speech corpus on TV recordings
- Type: Multiple speakers, TV recordings and their subtitles
- Amount: over 2,000 hours of speech
- Audio quality: 16 kHz
- License: N/A
- Link: LaboroTVSpeech
- Release year: 2020
JMD
- Tags:
#multiple
#dialect
- Description: Japanese multi-dialect corpus for speech synthesis
- Type: Several speakers, native dialect speaker’s voice
- Amount: 2 speakers, ~2 hours per speaker
- Audio quality: 24kHz
- License: can be used for research
- Link: JMD
- Release year: 2021
J-KAC
- Tags:
#single
- Description: Japanese Kamishibai and audiobook corpus
- Type: Single speakers, Male (Professional speaker)
- Amount: ~9 hours
- Audio quality: 48kHz
- License: research only
- Link: J-KAC
- Release year: 2021
JTubeSpeech
- Tags:
#multiple
#in-the-wild
- Description: Corpus of Japanese speech collected from YouTube
- Type: Youtube scraping, natural and synthetic speech (TTS)
- Amount: 10,000 hours, lots of speakers
- Audio quality: varying
- License: N/A
- Link: JTubeSpeech
- Release year: 2021
tri-jek
- Tags:
#single
#polyglot
- Description: Japanese-English-Korean tri-lingual speech corpus
- Type: Single speaker, Female, Japanese (native), Korean (native), English
- Amount: 11 hours (ja: 2.8, kr: 6.7, and en: 1.5 hours)
- Audio quality: 24kHz
- License: can be used for research
- Link: tri-jek
- Release year: 2021
Kokoro
- Tags:
#single
- Description: Kokoro Speech Dataset is a public domain Japanese speech dataset
- Type: Single Speaker, Male, native Japanese, Librivox audiobook
- Amount: ~60 hours
- Audio quality: 22.05 kHz
- License: CC0, public domain
- Link: Kokoro-Speech-Dataset
- Release year: 2021
JECS
- Tags:
#single
,#bilingual
,#code-switch
- Description: Japanese-English bilingual code-switching corpus
- Type: Single speaker, Male, bilingual speakers, parallel English and Japanese utterances + code-switch utterance with acted emotion
- Amount: 2.5 hours in totals
- Audio quality: 24kHz
- License: Can be used for research
- Link: jecs
- Release year: 2022
SpeedSpeech-JA-2022
- Tags:
#multiple
- Description: Speech-rate conversion corpus, one sentence read with different speed by a same speaker
- Type: One male and one female professional narrator
- Amount: 324 sentences per speed rate per speaker
- Audio quality: 48 kHz, 24-bit
- License: CC BY-NC 4.0
- Link: SpeedSpeech-JA-2022
- Release year: 2022
SMASH corpus
- Tags:
#multiple
- Description: A spontaneous speech corpus recording third-person audio commentaries on gameplay
- Type: players’ conversations (Super Smash Bros. Ultimate), game screen capture, third-person commentaries and transcript
- Amount: ~3.2 hours of speech, multiple matches
- Audio quality: 16 kHz
- License: Can be used for research
- Link: smash
- Release year: 2022
Korean
Seoul Corpus
- Tags:
#multiple
- Description: The Korean Corpus of Spontaneous Speech
- Type: Multiple speakers, age/gender groups, interviews, labeling
- Amount: 42.8 hours, 40 speakers
- Audio quality: 22.05kHz
- License: CC BY-NC 2.0
- Link: Seoul Corpus - OpenSLR
- Release year: 2015
KSS Dataset
- Tags:
#single
- Description: Korean Single speaker Speech Dataset
- Type: Single speaker, Female (Professional voice actress)
- Amount: 12 hours, 12853 utterances
- Audio quality: 44.1kHz
- License: no commercial
- Link: KSS Dataset
- Release year: 2018
Zeroth Korean
- Tags:
#multiple
- Description: Audio data of Project Zeroth for Korean Speech Recognition
- Type: Multiple speakers (Crowdsourcing)
- Amount: 76.6 hours, 35139 utterances, 137 speakers, 16472 unique sentences
- Audio quality: crowdsourcing using MoreCoin (Android phone record devices)
- License: CC BY 4.0
- Link: Zeroth Project, alias: Openslr - Zeroth Korean
- Release year: 2018
Pansori-TEDxKR
- Tags:
#multiple
- Description: Korean speech corpus generated from Korean language TEDx talks
- Type: Multiple speakers (TEDx talks)
- Amount: ~3 hours, 41 speakers
- Audio quality: 16kHz, TEDx talks
- License: CC BY-NC-ND 4.0
- Link: Pansori TEDxKR Corpus, alias: Openslr - Pansori-TEDxKR
- Release year: 2019
Deeply Korean Read Speech corpus
- Tags:
#multiple
- Description: Pairs of Korean reading the scripts with 3 text sentiments using 3 vocal sentiments. Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
- Type: Multiple speakers
- Amount: ~3 hours, ~2000 utterances (It is 1% subset of a commercial corpus)
- Audio quality: Studio apartment, Dance studio, Anechoic chamber
- License: CC BY-NC-ND 4.0
- Link: Deeply Korean read speech corpus, Openslr
- Release year: 2021
Deeply parent-child vocal interaction dataset
- Tags:
#multiple
- Description: The interaction of pairs of parent and child(reading fairy tales, singing children’s songs, conversing, and others).Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
- Type: Multiple speakers
- Amount: ~16 hours, ~20000 utterances (It is 1% subset of a commercial corpus)
- Audio quality: Studio apartment, Dance studio, Anechoic chamber
- License: CC BY-NC-ND 4.0
- Link: Deeply parent-child vocal interaction dataset, Openslr
- Release year: 2021
tri-jek
- Tags:
#single
#polyglot
Details in the Japanese Section
Vietnamese
VIVOS
- Tags:
#multiple
- Description: Vietnamese speech corpus for speech recognition
- Type: Multiple speakers (volunteers)
- Amount: 15 hours, 12420 utterances, 65 speakers
- Audio quality: 16kHz, quiet room
- License: CC BY-NC-SA 4.0
- Link: VIVOS
- Release year: 2016
Update History
- 20220916: Update
VIVOS
link, added several Japanese speech corpora:TEDxJP-10K
,LaboroTVSpeech
,Kokoro
,JECS
,SpeedSpeech-JA-2022
,SMASH corpus
- 20211123: Removed
VinBigdata-VLSP2020-100h
, added a new Korean CorpusSeoul Corpus
, a new Japanese CorpusJTubeSpeech
, and a new polyglot corpustri-jek
. - 20210628: Added several Japanese corpora
JSSS
,JMD
,J-KAC
- 20210209: Added a new Japanese single-speaker corpus
JSUT-book
and 2 Korean multi-speaker corporaDeeply Korean read speech corpus
,Deeply parent-child vocal interaction dataset
- 20201209: Added a new Vietnamese multi-speaker corpus
VinBigdata-VLSP2020-100h
- 20190925: Added a new Japanese single-speaker corpus
CSS10-ja
- 20190902: Added a new Japanese multi-speaker corpus
JVS
- 20190215: Added a new Korean multi-speaker corpus
Pansori-TEDxKR
- 20180422: Initial post with 4 corpora
JSUT
,KSS
,Zeroth-Korean
,VIVOS