flakm / jupiter-search Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 1.0 11.83 MB

Convert podstast RSS feed to transcriptions using whisper model

License: Apache License 2.0

Rust 92.26% Shell 2.01% Nix 3.32% Dockerfile 2.40%

podcast rust transcription

jupiter-search's People

Contributors

Stargazers

Watchers

Forkers

eraden

jupiter-search's Issues

Deprecate ffmpeg subprocess calling and use fully rust solution

Currently in ffmpeg_decoder.rs conversion from mp3 to wav is done using creating a child subprocess and calling ffmpeg directly.
This is suboptimal for a number of reasons:

It is prone to ffmpeg version changes (or it is missing completely)
It requires the data to be written and read from the disk - downloaded mp3, converted/resampled waw
It makes it impossible to use in no-std contexts like wasm

I've tried to prepare an alternative rust-only solution inside decoder.rs but it is not working - possibly because I'm a complete tool when it comes to audio.

Why it is a no-stopper:

It's still a poc
ffmpeg is pretty stable so the flags won't change from version to version
The error will suggest that the ffmpeg is missing (?)

Experiment with speedup of the audio file itself

According to ggerganov/whisper.cpp#394 it should be possible to speed up audio and still get some results.

Here is ffmpeg instruction for this: https://trac.ffmpeg.org/wiki/How%20to%20speed%20up%20/%20slow%20down%20a%20video

Acceptance criteria:

it is possible to control speedup from the cli
the timestamps are adjusted in the resulting transcript

Add mp3 metadata collecting

Since we need to download the data we might parse all of the data from mp3 to connect it to the transcript.

Awesome crate for this specific task: https://docs.rs/lofty/latest/lofty/

Prepare caching for transcription based on rss feed

Since transcription is very time-consuming (0.66xaudio_length) the results of stt should be cached based on RSS hashing to s3 or some other storage.

preferably cache should contain all the metadata for possible transformations (ie speaker tagging etc)
there should be some kind of index of all cached entries to enable watching new RSS feeds

Split vad chunking and stt into separate crate

prepare set of ci tests that will run inference on github ci using git lfs for audio and models (might be the smallest ones)
publish crate to cates.io

RUSTSEC-2020-0071: Potential segfault in the time crate

Potential segfault in the time crate

Details
Package	`time`
Version	`0.1.44`
URL	time-rs/time#293
Date	2020-11-18
Patched versions	`>=0.2.23`
Unaffected versions	`=0.2.0,=0.2.1,=0.2.2,=0.2.3,=0.2.4,=0.2.5,=0.2.6`

Impact

Unix-like operating systems may segfault due to dereferencing a dangling pointer in specific circumstances. This requires an environment variable to be set in a different thread than the affected functions. This may occur without the user's knowledge, notably in a third-party library.

The affected functions from time 0.2.7 through 0.2.22 are:

time::UtcOffset::local_offset_at
time::UtcOffset::try_local_offset_at
time::UtcOffset::current_local_offset
time::UtcOffset::try_current_local_offset
time::OffsetDateTime::now_local
time::OffsetDateTime::try_now_local

The affected functions in time 0.1 (all versions) are:

at
at_utc
now

Non-Unix targets (including Windows and wasm) are unaffected.

Patches

Pending a proper fix, the internal method that determines the local offset has been modified to always return None on the affected operating systems. This has the effect of returning an Err on the try_* methods and UTC on the non-try_* methods.

Users and library authors with time in their dependency tree should perform cargo update, which will pull in the updated, unaffected code.

Users of time 0.1 do not have a patch and should upgrade to an unaffected version: time 0.2.23 or greater or the 0.3 series.

Workarounds

No workarounds are known.

See advisory page for additional details.

Investigate why stt unit test is hanging

Find out why unit test that does the same thing as the example code in get_transcript.rs hangs? oO

    fn stt_works() {
        let mut ctx = SttContext::try_new("resources/ggml-tiny.en.bin").unwrap();
        let t = ctx
            .get_transcript_file("resources/super_short.mp3", false)
            .unwrap();
        println!("{:?}", t);
        assert!(t.utterances.len() > 0)
    }

Add CI and publish to crates.io

Maybe those crates won't get used so much but having a hosted version of documentation would be awesome ;)

Resources

Requirements

cross build for mac & linux
publish to crates.io
publish sha sums unless crates do have it?

Prepare a new scorer for STT inference

Currently the results of generic models is less than satisfactory.
The language model clearly doesn't have any linuxy techy words, so it should be tuned for this specific purpose.

Training requires checkpoints that can be taken from the releases page https://github.com/coqui-ai/STT/releases?q=1.3.0&expanded=true all 1.x.x releases are backward compatible with models generic models.