sarulab-speech / jtubespeech Goto Github PK

License: Apache License 2.0

Python 100.00%

jtubespeech's Introduction

JTubeSpeech: Corpus of Japanese speech collected from YouTube

This repository provides 1) a list of YouTube videos with Japanese subtitles (JTubeSpeech), 2) scripts for making new lists of new languages, and 3) tiny lists for other languages.

Description

data/{lang}/{YYYYMM}.csv lists as follows. See step4 for download.

	videoid	auto	sub	channelid
0	0017RsBbUHk	True	True	UCTW2tw0Mhho72MojB1L48IQ
1	00PqfZgiboc	False	True	UCzoghTgl4dvIW9GZF6UC-BA
---	---	---	---	---

lang: Language ID (ja [Japanese], en [English], ...)
YYYYMM: Year and month when we collect data
videoid: YouTube video ID. Its YouTube page is https://www.youtube.com/watch?v={videoid}.
auto: The video has an automatic subtitle or not.
sub: The video has a manual (i.e., human-generated) subtitle or not.
channelid: YouTube Channel ID. Its YouTube page is https://www.youtube.com/channel/{channelid}.

Statistics

lang	filename (data/)	#videos-sub-true	#videos-auto-true
ja	ja/202103.csv	110,000 (10,000 hours)	4,960,000
en	en/202108_middle.csv	739543	667555
	en/202108_tiny.csv	74227	65570
ru	ru/202203_middle.csv	258222	349388
	ru/202108_tiny.csv	39890	46061
de	de/202203_middle.csv	194468	527993
	de/202108_tiny.csv	30727	66954
fr	fr/202203_middle.csv	164261	524261
	fr/202108_tiny.csv	25371	70466
ar	ar/202203_middle.csv	158568	311697
	ar/202108_tiny.csv	31993	42649
th	th/202203_middle.csv	154416	250417
	th/202108_tiny.csv	40886	26907
tr	tr/202203_middle.csv	154213	494187
	tr/202108_tiny.csv	27317	68079
hi	hi/202203_middle.csv	132175	172565
	hi/202108_tiny.csv	34034	31439
zh	zh/202108_middle.csv	126271	23387
	zh/202108_tiny.csv	63126	23387
id	id/202203_middle.csv	105334	447836
	id/202108_tiny.csv	18086	72760
el	el/202203_middle.csv	96436	156445
	el/202108_tiny.csv	25947	26735
pt	pt/202203_middle.csv	90600	436425
	pt/202108_tiny.csv	11692	48974
da	da/202203_middle.csv	86027	421190
	da/202108_tiny.csv	18779	62094
bn	bn/202203_middle.csv	75371	303335
	bn/202108_tiny.csv	16315	57112
fi	fi/202203_middle.csv	68571	347307
	fi/202108_tiny.csv	15561	50626
ta	ta/202203_middle.csv	66923	89209
	ta/202108_tiny.csv	21860	26120
hu	hu/202203_middle.csv	64792	351426
	hu/202108_tiny.csv	13154	49237
uk	uk/202203_middle.csv	55098	283741
	uk/202108_tiny.csv	9103	36392
fa	fa/202203_middle.csv	54165	203794
	fa/202108_tiny.csv	10482	24102
ur	ur/202203_middle.csv	47426	177232
	ur/202108_tiny.csv	10917	26503
az	az/202203_middle.csv	42906	272895
	az/202108_tiny.csv	11188	52025
te	te/202203_middle.csv	41478	110521
	te/202108_tiny.csv	11929	24444
ka	ka/202203_middle.csv	38199	158179
	ka/202108_tiny.csv	10395	23914
ml	ml/202203_middle.csv	35477	249624
	ml/202108_tiny.csv	9080	42359
be	be/202203_middle.csv	33935	227854
	be/202108_tiny.csv	7622	37739
is	is/202203_middle.csv	32272	159506
	is/202108_tiny.csv	10632	38268
kk	kk/202203_middle.csv	26021	148230
	kk/202108_tiny.csv	6917	26163
ga	ga/202203_middle.csv	22177	131863
	ga/202108_tiny.csv	9058	51411
ky	ky/202203_middle.csv	20583	150884
	ky/202108_tiny.csv	7241	42027
tg	tg/202203_middle.csv	15451	135276
	tg/202108_tiny.csv	5491	40244

Contributors

Shinnosuke Takamichi (The University of Tokyo, Japan) [main contributor]
Ludwig Kürzinger (Technical University of Munich, Germany)
Takaaki Saeki (The University of Tokyo, Japan)
Sayaka Shiota (Tokyo Metropolitan University, Japan)
Shinji Watanabe (Carnegie Mellon University, USA)

Scripts for data collection

scripts/*.py are scripts for data collection from YouTube. Since processes of the scripts are language independent, users can collect data of their favorite languages. youtube-dl and ffmpeg are required.

step1: making search words

The script scripts/make_search_word.py downloads the wikipedia dump file and finds words for searching videos. {lang} is the language code, e.g., ja (Japanese) and en (English).

$ python scripts/make_search_word.py {lang}

step2: obtaining video IDs

The script scripts/obtain_video_id.py obtains YouTube video IDs by searching by words. {filename_word_list} is a word list file made in step1. After this step, the process will take a long time. It is recommended to split the files (e.g., {filename_word_list}) and run them in parallel.

$ python scripts/obtain_video_id.py {lang} {filename_word_list}

step3: checking if subtitles are available

The script scripts/retrieve_subtitle_exists.py retrieves whether the video has subtitles or not. {filename_videoid_list} is a videoID list file made in step2. This process will make a CSV file.

$ python scripts/retrieve_subtitle_exists.py {lang} {filename_videoid_list}

step4: downloading videos with manual subtitles

The script scripts/download_video.py downloads audio and manual subtitles. Note that, this process requires a very large amount of storage.{filename_subtitle_list} is a subtitle list file made in step3. The audio and subtitles will be saved in video/{lang}/wav16k and video/{lang}/txt, respectively.

$ python scripts/download_video.py {lang} {filename_subtitle_list}

step5 (ASR): alignment and scoring

Subtitles are not always correctly aligned with the audio and in some cases, subtitles not fit to the audio. The script scripts/align.py aligns subtitles and audio with CTC segmentation using an ESPnet 2 ASR model:

$ python scripts/align.py {asr_train_config} {asr_model_file} {wavdir} {txtdir} {output_dir}

The result is written into a segments file segments.txt and a log file segments.log in the output directory. Using the segments file, bad utterances or audio files can be sorted-out:

min_confidence_score=-0.3
awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${output_dir}/segments.txt

step5 (ASV): speaker variation scoring

There are three types of videos: text-to-speech (a.k.a., TTS) video, single-speaker (i.e., monologue) video, and multi-speaker (e.g., dialogue) video. The script scripts/xxx.py obtains scores of speaker variation within a video to classify videos into three types.

$ python scripts/xxx.py

Reference

coming soon

Link

Update

Aug. 2021: first update ({lang}/*_tiny.csv)
Jan. 2022: add mid-size data ({lang}/*_middile.csv)

jtubespeech's People

Contributors

Stargazers

Watchers

Forkers

lumaku takaaki-saeki eiichiroi supikiti sciai-ai shirayu kju196 kazuki-wata xierhacker ishine otamajakusi s3nh jupinter ina299 changxiangshi naohirotawara andrewnlauder nicholasneo78 xuridongsheng7142 aidasdir zhuzhi-fairy franck-dernoncourt ashiba k-washi donstang frances255 kakushawn nauman-daw cin-hy felixfuyihui yasutak gusido akuzeee robot-nano rasenganai xianchao-wu arvind-puthucode passerbya adno sh1gechan underdogliu

jtubespeech's Issues

scripts/download_video.py exit abnormally when the video title cannot be retrieved

I got an error when I ran the scripts/download_video.py.

The error message is:

$ python3 ./scripts/download_video.py ja data/ja/202103.csv
...
[youtube] 00bfDzS1HzE: Downloading webpage
[youtube] 00bfDzS1HzE: Downloading embed webpage
[youtube] 00bfDzS1HzE: Refetching age-gated info webpage
WARNING: Unable to extract video title
ERROR: no conn, hlsvp, hlsManifestUrl or url_encoded_fmt_stream_map information found in video info; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
  0%|                                               | 18/111784 [15:07<1565:48:05, 50.43s/it]
Traceback (most recent call last):
  File "/usr/lib64/python3.6/shutil.py", line 550, in move
    os.rename(src, real_dst)
FileNotFoundError: [Errno 2] No such file or directory: 'video/ja/wav/00/00bfDzS1HzE.ja.vtt' -> 'video/ja/vtt/00/00bfDzS1HzE.vtt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./scripts/download_video.py", line 64, in <module>
    dirname = download_video(args.lang, args.sublist, args.outdir)
  File "./scripts/download_video.py", line 39, in download_video
    shutil.move(f"{base}.{lang}.vtt", fn["vtt"])
  File "/usr/lib64/python3.6/shutil.py", line 564, in move
    copy_function(src, real_dst)
  File "/usr/lib64/python3.6/shutil.py", line 263, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/lib64/python3.6/shutil.py", line 120, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: 'video/ja/wav/00/00bfDzS1HzE.ja.vtt'

This error occurs because some videos (ex. 00bfDzS1HzE, 00d0gfXqEfU) in the CSV are private or deleted now.

I'm testing following patch:

diff --git a/scripts/download_video.py b/scripts/download_video.py
index 112f885..61aa345 100644
--- a/scripts/download_video.py
+++ b/scripts/download_video.py
@@ -35,7 +35,10 @@ def download_video(lang, fn_sub, outdir="video", wait_sec=10, keep_org=False):
       # download
       url = make_video_url(videoid)
       base = fn["wav"].parent.joinpath(fn["wav"].stem)
-      subprocess.run(f"youtube-dl --sub-lang {lang} --extract-audio --audio-format wav --write-sub {url} -o {base}.\%\(ext\)s", shell=True,universal_newlines=True)
+      cp = subprocess.run(f"youtube-dl --sub-lang {lang} --extract-audio --audio-format wav --write-sub {url} -o {base}.\%\(ext\)s", shell=True,universal_newlines=True)
+      if cp.returncode != 0:
+        print(f"Failed to download the video: url = {url}")
+        continue
       shutil.move(f"{base}.{lang}.vtt", fn["vtt"])

       # vtt -> txt (reformatting)

Note:
- There is no output of detailed error messages because youtube-dl output error message in detail.
- There is no clean up code for the video file (youtube-dl may delete the video file?).

Total duration of segments after filtering bad segments is less than result in paper

Hi,
I ran step 4 and 5 using the file /data/ja/202103.csv you provided. I got more than 10M files with a total duration of over 10,000 hours for all segments. But after filtering bad segments with min_confidence_score=-0.3, the total of number of good segments is only about 480,000 with a total duration of 351 hours. So, the yield is roughly 3.5% and the total duration is much less than what you mentioned in the paper (1,300 hours). Do you know the possible reasons?

Noisy result issue

Tried it for tagalog, unfortunately, the video IDs are too noisy and captures a lot of irrelevant languages. I guess this is mostly effective for languages with exclusive scripts, like Japanese or Amharic. For languages that mostly use Roman characters, a more sophisticated keyword choice is needed to really get valid IDs, otherwise there's a lot of irrelevant video IDs captured. Or maybe get meta info using youtube-dl to confirm the language content. Another way for videos with subtitles, compute perplexity against a tested language model to confirm language content.

Or am I missing something?

Any recommended model for English alignment?

Thanks for your exciting work!

I am starting to learn ASR and make an ASR dataset for experiments. Your work is really helpful.

In the code, you recommend this model for Japanese alignment.

Do you have a recommended English model?

License of downloaded videos.

We obtain audio data whose license is not only Creative Commons, but also YouTube license is gotten because filters are videos and Subtitle/CC at https://github.com/sarulab-speech/jtubespeech/blob/master/scripts/util.py#L12.

YouTube licensed videos seem not to be able to download or modify.
https://www.youtube.com/static?template=terms

The following restrictions apply to your use of the Service. You are not allowed to:

access, reproduce, download, distribute, transmit, broadcast, display, sell, license, alter, modify or otherwise use any part of the Service or any Content except: (a) as expressly authorized by the Service; or (b) with prior written permission from YouTube and, if applicable, the respective rights holders;

Is this OK?
If it isn't, how about adding the Creative Commons to the filter?

Speaker variation scoring script.

In the README file, there are a step 5 in the pipeline for producing ASV dataset:

step5 (ASV): speaker variation scoring
There are three types of videos: text-to-speech (a.k.a., TTS) video, single-speaker (i.e., monologue) video, and multi-speaker (e.g., dialogue) video. The script scripts/xxx.py obtains scores of speaker variation within a video to classify videos into three types.

$ python scripts/xxx.py

but I can not find that script anywhere in this repo. Can you provide the script to calculate speaker variation scores?

sarulab-speech / jtubespeech Goto Github PK

jtubespeech's Introduction

JTubeSpeech: Corpus of Japanese speech collected from YouTube

Description

Statistics

Contributors

Scripts for data collection

step1: making search words

step2: obtaining video IDs

step3: checking if subtitles are available

step4: downloading videos with manual subtitles

step5 (ASR): alignment and scoring

step5 (ASV): speaker variation scoring

Reference

Link

Update

jtubespeech's People

Contributors

Stargazers

Watchers

Forkers

jtubespeech's Issues

Recommend Projects

Recommend Topics

Recommend Org