Giter Site home page Giter Site logo

amsehili / auditok Goto Github PK

View Code? Open in Web Editor NEW
724.0 28.0 93.0 3.67 MB

An audio/acoustic activity detection and audio segmentation tool

License: MIT License

Python 100.00%
audio-activities audio-data audio-segmentation voice-detection vad voice-activity-detection

auditok's Introduction

Build status

https://travis-ci.org/amsehili/auditok.svg?branch=master Documentation status

auditok is an Audio Activity Detection tool that can process online data (read from an audio device or from standard input) as well as audio files. It can be used as a command-line program or by calling its API.

The latest version of the documentation can be found on readthedocs.

Installation

A basic version of auditok will run with standard Python (>=3.4). However, without installing additional dependencies, auditok can only deal with audio files in wav or raw formats. if you want more features, the following packages are needed:

  • pydub : read audio files in popular audio formats (ogg, mp3, etc.) or extract audio from a video file.
  • pyaudio : read audio data from the microphone and play audio back.
  • tqdm : show progress bar while playing audio clips.
  • matplotlib : plot audio signal and detections.
  • numpy : required by matplotlib. Also used for some math operations instead of standard python if available.

Install the latest stable version with pip:

sudo pip install auditok

Install the latest development version from github:

pip install git+https://github.com/amsehili/auditok

or

git clone https://github.com/amsehili/auditok.git
cd auditok
python setup.py install

Basic example

import auditok

# split returns a generator of AudioRegion objects
audio_regions = auditok.split(
    "audio.wav",
    min_dur=0.2,     # minimum duration of a valid audio event in seconds
    max_dur=4,       # maximum duration of an event
    max_silence=0.3, # maximum duration of tolerated continuous silence within an event
    energy_threshold=55 # threshold of detection
)

for i, r in enumerate(audio_regions):

    # Regions returned by `split` have 'start' and 'end' metadata fields
    print("Region {i}: {r.meta.start:.3f}s -- {r.meta.end:.3f}s".format(i=i, r=r))

    # play detection
    # r.play(progress_bar=True)

    # region's metadata can also be used with the `save` method
    # (no need to explicitly specify region's object and `format` arguments)
    filename = r.save("region_{meta.start:.3f}-{meta.end:.3f}.wav")
    print("region saved as: {}".format(filename))

output example:

Region 0: 0.700s -- 1.400s
region saved as: region_0.700-1.400.wav
Region 1: 3.800s -- 4.500s
region saved as: region_3.800-4.500.wav
Region 2: 8.750s -- 9.950s
region saved as: region_8.750-9.950.wav
Region 3: 11.700s -- 12.400s
region saved as: region_11.700-12.400.wav
Region 4: 15.050s -- 15.850s
region saved as: region_15.050-15.850.wav

Split and plot

Visualize audio signal and detections:

import auditok
region = auditok.load("audio.wav") # returns an AudioRegion object
regions = region.split_and_plot(...) # or just region.splitp()

output figure:

doc/figures/example_1.png

Limitations

Currently, the core detection algorithm is based on the energy of audio signal. While this is fast and works very well for audio streams with low background noise (e.g., podcasts with few people talking, language lessons, audio recorded in a rather quiet environment, etc.) the performance can drop as the level of noise increases. Furthermore, the algorithm makes no distinction between speech and other kinds of sounds, so you shouldn't use it for Voice Activity Detection if your audio data also contain non-speech events.

License

MIT.

auditok's People

Contributors

amsehili avatar jhoelzl avatar leminhnguyen avatar ps2 avatar samelltiger avatar taf2 avatar yoyota avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

auditok's Issues

Cannot use split for an audio-humming.wav

(split_env) C:\Users\v_gejwzhang\ASR>python split.py
Traceback (most recent call last):
File "split.py", line 9, in
energy_threshold=55 # threshold of detection
File "C:\ProgramData\Anaconda3\envs\split_env\lib\site-packages\auditok\core.py", line 227, in split
source = AudioReader(input, block_dur=analysis_window, **params)
File "C:\ProgramData\Anaconda3\envs\split_env\lib\site-packages\auditok\util.py", line 1008, in init
input = get_audio_source(input, **kwargs)
File "C:\ProgramData\Anaconda3\envs\split_env\lib\site-packages\auditok\io.py", line 731, in get_audio_source
return from_file(filename=input, **kwargs)
File "C:\ProgramData\Anaconda3\envs\split_env\lib\site-packages\auditok\io.py", line 919, in from_file
return _load_wave(filename, large_file)
File "C:\ProgramData\Anaconda3\envs\split_env\lib\site-packages\auditok\io.py", line 813, in _load_wave
with wave.open(file) as fp:
File "C:\ProgramData\Anaconda3\envs\split_env\lib\wave.py", line 499, in open
return Wave_read(f)
File "C:\ProgramData\Anaconda3\envs\split_env\lib\wave.py", line 163, in init
self.initfp(f)
File "C:\ProgramData\Anaconda3\envs\split_env\lib\wave.py", line 143, in initfp
self._read_fmt_chunk(chunk)
File "C:\ProgramData\Anaconda3\envs\split_env\lib\wave.py", line 260, in _read_fmt_chunk
raise Error('unknown format: %r' % (wFormatTag,))
wave.Error: unknown format: 3
`import auditok

split returns a generator of AudioRegion objects

audio_regions = auditok.split(
"humming.wav",
min_dur=0.2, # minimum duration of a valid audio event in seconds
max_dur=4, # maximum duration of an event
max_silence=0.3, # maximum duration of tolerated continuous silence within an event
energy_threshold=55 # threshold of detection
)

for i, r in enumerate(audio_regions):

# Regions returned by `split` have 'start' and 'end' metadata fields
print("Region {i}: {r.meta.start:.3f}s -- {r.meta.end:.3f}s".format(i=i, r=r))

# play detection
# r.play(progress_bar=True)

# region's metadata can also be used with the `save` method
# (no need to explicitly specify region's object and `format` arguments)
filename = r.save("region_{meta.start:.3f}-{meta.end:.3f}.wav")
print("region saved as: {}".format(filename))

`
humming.wav is a file bigger than 25 MB

AudioParameterError

auditok.split(
'/tmp/tmprn08794x.wav'
)

AudioParameterError: The length of audio data must be an integer multiple of sample_width * channels

I'm unsure what I should do.

Error when running audiotok

This is what I get when I try to run audiotok

ALSA lib pcm_dmix.c:1099:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2501:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2501:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2501:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_route.c:867:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_dmix.c:1099:(snd_pcm_dmix_open) unable to open slave
connect(2) call to /dev/shm/jack-1000/default/jack_0 failed (err=No such file or directory)
attempt to connect to server failed

I'm on archlinux with python 3.6

Error when running auditok

This is what I get when I try to run audiotok

ALSA lib pcm_dmix.c:1099:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2501:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2501:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2501:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_route.c:867:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_dmix.c:1099:(snd_pcm_dmix_open) unable to open slave
connect(2) call to /dev/shm/jack-1000/default/jack_0 failed (err=No such file or directory)
attempt to connect to server failed

I'm on archlinux with python 3.6, any help appreciated!

Creating an .srt subtitle file

Hi there,
@amsehili
I want to create an .srt subtitle file from an audio mp3 file that I have, my goal is to automatically detect the speech regions and create an .srt file with time-stamps and blank transcription, which I will then use to fill the transcription manually.

I have used this command:

auditok -e 55 -i interview.mp3 -m 10 --printf "{id}\n{start} --> {end}\nPut some text here...\n" --time-format "%h:%m:%s.%i" -o output.srt

the result is:

Output #0, srt, to 'c:\users\divana\appdata\local\temp\tmp_s9dvg':
Output file #0 does not contain any stream

Though if I dont use the -o option, the segments can be viewd in the terminal, but cant export to srt.
Can anyone help in creating a subtitle file .srt with time-stamps
Waiting for your reply

Audi file attached
interview.zip

Real-Time Silence detection from bytes

From #23, I am trying to split speaker's audio using pyaudio stream:

The Callback Part (how can I use in_data and split it like read input from microphone in #23 )

def callback(self, in_data, frame_count, time_info, status):
    """Write frames and return PA flag"""
    # wave_file.writeframes(in_data)
    self.frames.append(in_data)
    input= b''.join(self.frames)
    print(input)
    reader = AudioReader(
        input=input,
        sr=self.__SAMPLE_RATE,
        sw=self.__SAMPLE_WIDTH,
        ch=self.__CHANNEL
        )
    for (i, region) in enumerate(split(
        input=reader,
        # eth=self.__ENERGY_THRESHOLD,
        max_silence=self.__MAX_SILENCE,
        max_dur=self.__MAX_DURATION,
        min_dur=self.__MIN_DURATION
        )):
        print(f"{constants.CONSOLE_COLOR_RED}split{constants.CONSOLE_COLOR_WHITE}")
        path = f'{constants.TEMP_SPEAKER_OUTPUT_AUDIO_DIR}/{str(time.time()) + constants.TEMP_SPEAKER_OUTPUT_AUDIO_FORMAT}'
        region.save(path)
        self.frames = []
        break

    return (in_data, pyaudio.paContinue)

The pyaudio part

with p.open(format=pyaudio.paInt16,
        channels=default_speakers["maxInputChannels"],
        rate=int(default_speakers["defaultSampleRate"]),
        frames_per_buffer=pyaudio.get_sample_size(pyaudio.paInt16),
        input=True,
        input_device_index=default_speakers["index"],
        stream_callback=self.callback
) as stream:
    """
    Opena PA stream via context manager.
    After leaving the context, everything will
    be correctly closed(Stream, PyAudio manager)            
    """
    while self.ai_listen_handler.is_listening_speaker:
        time.sleep(1)

Use auditok.split to split microphone input in real-time

for region in auditok.split(
    input=None,
    sr=self.__SAMPLE_RATE,
    sw=self.__SAMPLE_WIDTH,
    ch=self.__CHANNEL,
    eth=self.__ENERGY_THRESHOLD,
    max_silence=self.__MAX_SILENCE,
    max_dur=self.__MAX_DURATION,
    min_dur=self.__MIN_DURATION
    ):
    if not self.ai_listen_handler.is_listening_mic:
        return

    path = f'{constants.TEMP_MIC_INPUT_AUDIO_DIR}/{str(time.time()) + constants.TEMP_MIC_INPUT_AUDIO_FORMAT}'
    region.save(path)

Doubt : format of the output

Hello
This is not an issue but an doubt
I am using auditok for audio tokenization,
but i need data in librosa/soundfile format
when i check librosa/sf the data values are in floating points while in auditok they are large numbers
But in both documentation its mentioned that output is timeseries.
Can you please help me convert one output format to other, or at the least explain what is the format of the output in auditok

Run on-device on Android device

I plan to run auditok on-device to detect pauses in speech. Currently, it is being handled as an API call (front-end streams the audio via a microphone and sends it to a back-end API which does the segmentation on a real-time basis).

Is it possible to convert it into a Tensorflow lite model sorts for on-device inference, rather than an API call?

Plotting tests failed

Version: 0.2.0
Command: python -m unittest discover -v tests/.
Output: tests.log
All tests pass with the exception of those in test_plotting.py

Keyword {duration} returns error

Hello, beside {id}, {start} and {end}, i also want to use the keyword {duration}.

However, when using it, i get following error:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/myproject/venv/local/lib/python2.7/site-packages/auditok/cmdline.py", line 524, in run
    end = self.time_formatter(end_time)))
KeyError: 'duration'

Regards,
Josef

How to process audio already loaded in numpy and/or torch?

Thank you for this good repo!

I find that auditok works in general better than popular VAD like silero (which can have unexplained behaviour on some types of audio).
I'd like to use it in my project, but I struggle to do so, because when I call the VAD, I don't have access to a wav file.
The only way I found to pass the torch vector of raw audio is to use this awkward conversion:

    byte_io = io.BytesIO(bytes())
    scipy.io.wavfile.write(byte_io, SAMPLE_RATE, (audio.numpy() * 32767).astype(np.int16)) # audio is a torch tensor
    bytes_wav = byte_io.read()

    segments = auditok.split(
        bytes_wav,
        sampling_rate=SAMPLE_RATE,        # sampling frequency in Hz
        channels=1,                       # number of channels
        sample_width=2,                   # number of bytes per sample
        min_dur=min_speech_duration,      # minimum duration of a valid audio event in seconds
        max_dur=len(audio)/SAMPLE_RATE,   # maximum duration of an event
        max_silence=min_silence_duration, # maximum duration of tolerated continuous silence within an event
        energy_threshold=50,
        drop_trailing_silence=True,
    )

Is there a better way to do that?

If you want to see more, or directly comment on the related PR, it's here: https://github.com/linto-ai/whisper-timestamped/pull/78/files#diff-4d4adecf50ce8affc04f13ab7274717945dd716eb910225ff154f717e81c3b64R1791

splitting the audio with overlaps.

Hi thanks for this library.
I was wondering if there is a way in which i can split my audio into smaller clips but with adjacent clips having overlapping samples.
Example:
Lets say I have a clip of 1min and it has silences from 10-20sec and then 30-40 secs and my max length of audio is lets say 5 secs. So I was wondering if there is a way in which i can get the splitted audio with duration lets say 1st audio from 0-5 then second audio from 3-8 (overlapping 4and 5 seconds) and so on?

Thanks any help is appreciated.

Splitting audio

It possible use auditok to split audio in fixed length, before Silence?

For example:

auditok -n 600 -m 600 -i myFile.wav

I want split myFile.wav in up to 600 seconds (10min) of audio. If audio event is greather than 10min, then, it must split most close silence before the 10min boundary. That is possible with current implementation ?

Integrate spleeter model

Add an quantalized Spleeter model, so that auditok can isolate the vocal sound from the original audio before splitting. Without the background noise, auditok can split the dialogs better.

Callback data format

Hi I'm using auditok to segment sentences from speech and I would like to convert the data returned in the callback into an int16 wav format for use with a different application. Could you tell me how I could go about doing that?

Dependencies aren't listed in setup.py and pyproject.toml

The project imports numpy, pydub, pyaudio, tqdm, but they aren't listed as dependencies.

In case there are no required dependencies - the require and depend clauses should be left empty, and optional dependencies should still be listed.

Otherwise the lack of this info causes confusion.

ImportError: No module named setuptools

When I try to run the setup script, I get the following error:

~/auditok $ python setup.py install
Traceback (most recent call last):
  File "setup.py", line 4, in <module>
    from setuptools import setup
ImportError: No module named setuptools

Where should it be finding this "setuptools" directory/module?

Thanks

Specify recording device

I have multiple recording devices on my system, but is seems to only pick up the mic plugged in to the microphone jack. How can I get it to use the microphone on the webcam that is plugged into a USB port?

I tried inserting the following at line 354 of io.py:
input_device_index = 2,
but it didn't have any effect.

I'm not very familiar with Python, but this looks like the right place according to the PyAudio documentation.

Real-Time Silence detection

I am using PyAudio to collect audio streams from the microphone input to detect end-of-speech. I was wondering if I can process the streams in a real-time manner using auditok. If so, how do I optimize it to run faster?

I am using the following code for the time-being:

from auditok import ADSFactory, AudioEnergyValidator, StreamTokenizer, player_for


# We set the `record` argument to True so that we can rewind the source
asource = ADSFactory.ads(max_time=10, record=True)

validator = AudioEnergyValidator(sample_width=asource.get_sample_width(), 
                                 energy_threshold=65)

tokenizer = StreamTokenizer(
    validator=validator,
    min_length=20,
    max_length=400,
    max_continuous_silence=30,
)

asource.open()
tokens = tokenizer.tokenize(asource)

# Play detected regions back
player = player_for(asource)

trimmed_signal = b""
for p, q, r in tokens:
    trimmed_signal += b"".join(p)
print("\n ** Playing trimmed signal...")
player.play(trimmed_signal)

asource.close()
player.stop()

Ultimately, my end goal is to record user's input from the microphone (mobile device) and record audio, before doing a speech-to-text transcription using Google API.

Making standalone executable for win32

Hi. After I successfully installed auditok, I have this this two files in the Script directory:

  • auditok.exe
  • auditok-script.py

I tested the executable, and it perfectly worked. But for now, I want to be able to run this on the other machine without python installed. In other words, I need win32 standalone console executable of auditok.

I tried to use pyinstaller:

pip install pyinstaller
pyinstaller auditok-script.py

After that I have "dist" directory with "auditok-script" dir and bunch of files in it. I run auditok-script.exe in the dir, and it gave me the error:

...\python\Scripts\dist\auditok-script>auditok-script.exe
Traceback (most recent call last):
  File "auditok-script.py", line 11, in <module>
  File "site-packages\pkg_resources\__init__.py", line 480, in load_entry_point
  File "site-packages\pkg_resources\__init__.py", line 472, in get_distribution
  File "site-packages\pkg_resources\__init__.py", line 344, in get_provider
  File "site-packages\pkg_resources\__init__.py", line 892, in require
  File "site-packages\pkg_resources\__init__.py", line 778, in resolve
pkg_resources.DistributionNotFound: The 'auditok==0.1.8' distribution was not found and is required by the application
[21452] Failed to execute script auditok-script

...\python\Scripts\dist\auditok-script>

I'm not very familiar with python, so maybe there another way to build standalone binaries? Or how can I fix this error?

Thanks.

Replace/remove genty

The dependency genty looks abandoned, unmaintained and has issues with its own tests. The last commit was in the beginning of 2016. Please consider to remove/replace it.

Get {start} and {end} with 3 decimals

Hello,

first of all, this is a great python module and does a very good job!

I am wondering if i can extend the number of decimals for the export tags {start} and {end}.
By default, the have 2 decimals - i need rounding to 3 decimals.

Regards,
Josef

Errors running auditok

I'm on Linux Mint 17.2.

When I try to run "auditok", I get the following messages:

~/git_projects/auditok $ auditok
ALSA lib pcm_dsnoop.c:618:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1022:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2239:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2239:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2239:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
bt_audio_service_open: connect() failed: Connection refused (111)
bt_audio_service_open: connect() failed: Connection refused (111)
bt_audio_service_open: connect() failed: Connection refused (111)
bt_audio_service_open: connect() failed: Connection refused (111)
ALSA lib pcm_dmix.c:1022:(snd_pcm_dmix_open) unable to open slave
Cannot connect to server socket err = No such file or directory
Cannot connect to server request channel
jack server is not running or cannot be started
1 0.00 5.00
2 5.00 10.00
3 10.00 15.00
4 15.00 20.00
^C5 20.00 20.60

Should I be concerned about any of these messages?
It doesn't seem to be detecting any sound. It just outputs automatically every 5 seconds.

Any suggestions would be appreciated.

Thanks.

Use auditok as API to detect pauses in speech

Thanks for the great lib!!

I have a bunch of utterances extracted from conversations, and I want to detect pauses in each of these utterances and how long the pauses are, both short and long pauses (500ms is the threshold for example for determining long or short).

Is it possible to use auditok as API to call such function for pause detection, in my existing data pipeline at all? Sorry if the question seems general, but it'd be greatly appreciated if you can provide any advices. Thank you again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.