Giter Site home page Giter Site logo

jtubespeech's Introduction

JTubeSpeech: Corpus of Japanese speech collected from YouTube

This repository provides 1) a list of YouTube videos with Japanese subtitles (JTubeSpeech), 2) scripts for making new lists of new languages, and 3) tiny lists for other languages.

Description

data/{lang}/{YYYYMM}.csv lists as follows. See step4 for download.

videoid auto sub channelid
0 0017RsBbUHk True True UCTW2tw0Mhho72MojB1L48IQ
1 00PqfZgiboc False True UCzoghTgl4dvIW9GZF6UC-BA
--- --- --- --- ---

  • lang: Language ID (ja [Japanese], en [English], ...)
  • YYYYMM: Year and month when we collect data
  • videoid: YouTube video ID. Its YouTube page is https://www.youtube.com/watch?v={videoid}.
  • auto: The video has an automatic subtitle or not.
  • sub: The video has a manual (i.e., human-generated) subtitle or not.
  • channelid: YouTube Channel ID. Its YouTube page is https://www.youtube.com/channel/{channelid}.

Statistics

lang filename (data/) #videos-sub-true #videos-auto-true
ja ja/202103.csv 110,000 (10,000 hours) 4,960,000
en en/202108_middle.csv 739543 667555
en/202108_tiny.csv 74227 65570
ru ru/202203_middle.csv 258222 349388
ru/202108_tiny.csv 39890 46061
de de/202203_middle.csv 194468 527993
de/202108_tiny.csv 30727 66954
fr fr/202203_middle.csv 164261 524261
fr/202108_tiny.csv 25371 70466
ar ar/202203_middle.csv 158568 311697
ar/202108_tiny.csv 31993 42649
th th/202203_middle.csv 154416 250417
th/202108_tiny.csv 40886 26907
tr tr/202203_middle.csv 154213 494187
tr/202108_tiny.csv 27317 68079
hi hi/202203_middle.csv 132175 172565
hi/202108_tiny.csv 34034 31439
zh zh/202108_middle.csv 126271 23387
zh/202108_tiny.csv 63126 23387
id id/202203_middle.csv 105334 447836
id/202108_tiny.csv 18086 72760
el el/202203_middle.csv 96436 156445
el/202108_tiny.csv 25947 26735
pt pt/202203_middle.csv 90600 436425
pt/202108_tiny.csv 11692 48974
da da/202203_middle.csv 86027 421190
da/202108_tiny.csv 18779 62094
bn bn/202203_middle.csv 75371 303335
bn/202108_tiny.csv 16315 57112
fi fi/202203_middle.csv 68571 347307
fi/202108_tiny.csv 15561 50626
ta ta/202203_middle.csv 66923 89209
ta/202108_tiny.csv 21860 26120
hu hu/202203_middle.csv 64792 351426
hu/202108_tiny.csv 13154 49237
uk uk/202203_middle.csv 55098 283741
uk/202108_tiny.csv 9103 36392
fa fa/202203_middle.csv 54165 203794
fa/202108_tiny.csv 10482 24102
ur ur/202203_middle.csv 47426 177232
ur/202108_tiny.csv 10917 26503
az az/202203_middle.csv 42906 272895
az/202108_tiny.csv 11188 52025
te te/202203_middle.csv 41478 110521
te/202108_tiny.csv 11929 24444
ka ka/202203_middle.csv 38199 158179
ka/202108_tiny.csv 10395 23914
ml ml/202203_middle.csv 35477 249624
ml/202108_tiny.csv 9080 42359
be be/202203_middle.csv 33935 227854
be/202108_tiny.csv 7622 37739
is is/202203_middle.csv 32272 159506
is/202108_tiny.csv 10632 38268
kk kk/202203_middle.csv 26021 148230
kk/202108_tiny.csv 6917 26163
ga ga/202203_middle.csv 22177 131863
ga/202108_tiny.csv 9058 51411
ky ky/202203_middle.csv 20583 150884
ky/202108_tiny.csv 7241 42027
tg tg/202203_middle.csv 15451 135276
tg/202108_tiny.csv 5491 40244

Contributors

Scripts for data collection

scripts/*.py are scripts for data collection from YouTube. Since processes of the scripts are language independent, users can collect data of their favorite languages. youtube-dl and ffmpeg are required.

step1: making search words

The script scripts/make_search_word.py downloads the wikipedia dump file and finds words for searching videos. {lang} is the language code, e.g., ja (Japanese) and en (English).

$ python scripts/make_search_word.py {lang}

step2: obtaining video IDs

The script scripts/obtain_video_id.py obtains YouTube video IDs by searching by words. {filename_word_list} is a word list file made in step1. After this step, the process will take a long time. It is recommended to split the files (e.g., {filename_word_list}) and run them in parallel.

$ python scripts/obtain_video_id.py {lang} {filename_word_list}

step3: checking if subtitles are available

The script scripts/retrieve_subtitle_exists.py retrieves whether the video has subtitles or not. {filename_videoid_list} is a videoID list file made in step2. This process will make a CSV file.

$ python scripts/retrieve_subtitle_exists.py {lang} {filename_videoid_list}

step4: downloading videos with manual subtitles

The script scripts/download_video.py downloads audio and manual subtitles. Note that, this process requires a very large amount of storage.{filename_subtitle_list} is a subtitle list file made in step3. The audio and subtitles will be saved in video/{lang}/wav16k and video/{lang}/txt, respectively.

$ python scripts/download_video.py {lang} {filename_subtitle_list}

step5 (ASR): alignment and scoring

Subtitles are not always correctly aligned with the audio and in some cases, subtitles not fit to the audio. The script scripts/align.py aligns subtitles and audio with CTC segmentation using an ESPnet 2 ASR model:

$ python scripts/align.py {asr_train_config} {asr_model_file} {wavdir} {txtdir} {output_dir}

The result is written into a segments file segments.txt and a log file segments.log in the output directory. Using the segments file, bad utterances or audio files can be sorted-out:

min_confidence_score=-0.3
awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${output_dir}/segments.txt

step5 (ASV): speaker variation scoring

There are three types of videos: text-to-speech (a.k.a., TTS) video, single-speaker (i.e., monologue) video, and multi-speaker (e.g., dialogue) video. The script scripts/xxx.py obtains scores of speaker variation within a video to classify videos into three types.

$ python scripts/xxx.py

Reference

  • coming soon

Link

Update

  • Aug. 2021: first update ({lang}/*_tiny.csv)
  • Jan. 2022: add mid-size data ({lang}/*_middile.csv)

jtubespeech's People

Contributors

eiichiroi avatar hamadatakaki avatar lumaku avatar s3nh avatar shirayu avatar takaaki-saeki avatar vebmaylrie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jtubespeech's Issues

scripts/download_video.py exit abnormally when the video title cannot be retrieved

I got an error when I ran the scripts/download_video.py.

The error message is:

$ python3 ./scripts/download_video.py ja data/ja/202103.csv
...
[youtube] 00bfDzS1HzE: Downloading webpage
[youtube] 00bfDzS1HzE: Downloading embed webpage
[youtube] 00bfDzS1HzE: Refetching age-gated info webpage
WARNING: Unable to extract video title
ERROR: no conn, hlsvp, hlsManifestUrl or url_encoded_fmt_stream_map information found in video info; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
  0%|                                               | 18/111784 [15:07<1565:48:05, 50.43s/it]
Traceback (most recent call last):
  File "/usr/lib64/python3.6/shutil.py", line 550, in move
    os.rename(src, real_dst)
FileNotFoundError: [Errno 2] No such file or directory: 'video/ja/wav/00/00bfDzS1HzE.ja.vtt' -> 'video/ja/vtt/00/00bfDzS1HzE.vtt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./scripts/download_video.py", line 64, in <module>
    dirname = download_video(args.lang, args.sublist, args.outdir)
  File "./scripts/download_video.py", line 39, in download_video
    shutil.move(f"{base}.{lang}.vtt", fn["vtt"])
  File "/usr/lib64/python3.6/shutil.py", line 564, in move
    copy_function(src, real_dst)
  File "/usr/lib64/python3.6/shutil.py", line 263, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/lib64/python3.6/shutil.py", line 120, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: 'video/ja/wav/00/00bfDzS1HzE.ja.vtt'

This error occurs because some videos (ex. 00bfDzS1HzE, 00d0gfXqEfU) in the CSV are private or deleted now.

I'm testing following patch:

diff --git a/scripts/download_video.py b/scripts/download_video.py
index 112f885..61aa345 100644
--- a/scripts/download_video.py
+++ b/scripts/download_video.py
@@ -35,7 +35,10 @@ def download_video(lang, fn_sub, outdir="video", wait_sec=10, keep_org=False):
       # download
       url = make_video_url(videoid)
       base = fn["wav"].parent.joinpath(fn["wav"].stem)
-      subprocess.run(f"youtube-dl --sub-lang {lang} --extract-audio --audio-format wav --write-sub {url} -o {base}.\%\(ext\)s", shell=True,universal_newlines=True)
+      cp = subprocess.run(f"youtube-dl --sub-lang {lang} --extract-audio --audio-format wav --write-sub {url} -o {base}.\%\(ext\)s", shell=True,universal_newlines=True)
+      if cp.returncode != 0:
+        print(f"Failed to download the video: url = {url}")
+        continue
       shutil.move(f"{base}.{lang}.vtt", fn["vtt"])

       # vtt -> txt (reformatting)
  • Note:
    • There is no output of detailed error messages because youtube-dl output error message in detail.
    • There is no clean up code for the video file (youtube-dl may delete the video file?).

Total duration of segments after filtering bad segments is less than result in paper

Hi,
I ran step 4 and 5 using the file /data/ja/202103.csv you provided. I got more than 10M files with a total duration of over 10,000 hours for all segments. But after filtering bad segments with min_confidence_score=-0.3, the total of number of good segments is only about 480,000 with a total duration of 351 hours. So, the yield is roughly 3.5% and the total duration is much less than what you mentioned in the paper (1,300 hours). Do you know the possible reasons?

Noisy result issue

Tried it for tagalog, unfortunately, the video IDs are too noisy and captures a lot of irrelevant languages. I guess this is mostly effective for languages with exclusive scripts, like Japanese or Amharic. For languages that mostly use Roman characters, a more sophisticated keyword choice is needed to really get valid IDs, otherwise there's a lot of irrelevant video IDs captured. Or maybe get meta info using youtube-dl to confirm the language content. Another way for videos with subtitles, compute perplexity against a tested language model to confirm language content.

Or am I missing something?

Any recommended model for English alignment?

Thanks for your exciting work!

I am starting to learn ASR and make an ASR dataset for experiments. Your work is really helpful.

In the code, you recommend this model for Japanese alignment.

Do you have a recommended English model?

License of downloaded videos.

We obtain audio data whose license is not only Creative Commons, but also YouTube license is gotten because filters are videos and Subtitle/CC at https://github.com/sarulab-speech/jtubespeech/blob/master/scripts/util.py#L12.

YouTube licensed videos seem not to be able to download or modify.
https://www.youtube.com/static?template=terms

The following restrictions apply to your use of the Service. You are not allowed to:

access, reproduce, download, distribute, transmit, broadcast, display, sell, license, alter, modify or otherwise use any part of the Service or any Content except: (a) as expressly authorized by the Service; or (b) with prior written permission from YouTube and, if applicable, the respective rights holders;

Is this OK?
If it isn't, how about adding the Creative Commons to the filter?

Speaker variation scoring script.

In the README file, there are a step 5 in the pipeline for producing ASV dataset:

step5 (ASV): speaker variation scoring
There are three types of videos: text-to-speech (a.k.a., TTS) video, single-speaker (i.e., monologue) video, and multi-speaker (e.g., dialogue) video. The script scripts/xxx.py obtains scores of speaker variation within a video to classify videos into three types.

$ python scripts/xxx.py

but I can not find that script anywhere in this repo. Can you provide the script to calculate speaker variation scores?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.