Giter Site home page Giter Site logo

kaldi-io-for-python's Introduction

kaldi-io-for-python

'Glue' code connecting kaldi data and python.

Supported data types

  • vector (integer)
  • Vector (float, double)
  • Matrix (float, double)
  • Posterior (posteriors, nnet1 training targets, confusion networks, ...)

Examples

Reading feature scp example:
import kaldi_io
for key,mat in kaldi_io.read_mat_scp(file):
  ...
Writing feature ark to file/stream:
import kaldi_io
with open(ark_file,'wb') as f:
  for key,mat in dict.iteritems():
    kaldi_io.write_mat(f, mat, key=key)
Writing features as 'ark,scp' by pipeline with 'copy-feats':
import kaldi_io
ark_scp_output='ark:| copy-feats --compress=true ark:- ark,scp:data/feats2.ark,data/feats2.scp'
with kaldi_io.open_or_fd(ark_scp_output,'wb') as f:
  for key,mat in dict.iteritems():
    kaldi_io.write_mat(f, mat, key=key)

Install

  • from pypi:
pip install kaldi_io
  • from sources:
git clone https://github.com/vesis84/kaldi-io-for-python.git <kaldi-io-dir>`
pip install -r requirements.txt
pip install --editable .

Note: it is recommended to set export KALDI_ROOT=<some_kaldi_dir> environment variable. The I/O based on pipes can then contain kaldi binaries.

Unit tests

(note: these are not included in pypi package)

Unit tests are started this way:

./run_tests.sh

or by:

python3 -m unittest discover -s tests -t . python2 -m unittest discover -s tests -t .

License

Apache License, Version 2.0 ('LICENSE-2.0.txt')

Community

  • accepting pull requests with extensions on GitHub
  • accepting feedback via GitHub 'Issues' in the repo

kaldi-io-for-python's People

Contributors

alicegaz avatar bobchennan avatar boeddeker avatar florian1990 avatar hawkaaron avatar karelvesely84 avatar oplatek avatar xx205 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kaldi-io-for-python's Issues

Immediately read a big file after writing it may leads to error ?

I am wondering if we need to wait a thread that writes files to hard-disk in this script. Today I use kaldi_io to write a big file and immediately read it which led to this Error:

ivector-mean ark:data/train/spk2utt scp:exp/train_embed_vectors/embeddings.scp 'ark:| copy-vector ark:- ark,scp:exp/train_embed_vectors/spk_embeddings.ark,exp/train_embed_vectors/spk_embeddings.scp' ark,t:exp/train_embed_vectors/num_utts.ark 
WARNING (ivector-mean[5.4.84~1405-c643]:ReadScriptFile():kaldi-table.cc:72) Invalid 148626'th line in script file:"id11251-gFfcgOVmiO0-00006"
WARNING (ivector-mean[5.4.84~1405-c643]:ReadScriptFile():kaldi-table.cc:46) [script file was: exp/train_embed_vectors/embeddings.scp]
ERROR (ivector-mean[5.4.84~1405-c643]:RandomAccessTableReader():util/kaldi-table-inl.h:2512) Error opening RandomAccessTableReader object  (rspecifier is: scp:exp/train_embed_vectors/embeddings.scp)

The error will disappear after I manually type this cmd:

ivector-mean ark:data/train/spk2utt scp:exp/train_embed_vectors/embeddings.scp 'ark:| copy-vector ark:- ark,scp:exp/train_embed_vectors/spk_embeddings.ark,exp/train_embed_vectors/spk_embeddings.scp' ark,t:exp/train_embed_vectors/num_utts.ark

So does this cmd:

copy-vector "scp:echo 'id11251-gFfcgOVmiO0-00006 exp/train_embed_vectors/embeddings.ark:614118526' |" ark,t:-|less

I guess the possible reason is that when I reading the newly created big file, the earlier thread is still writing the file.

Modifications to $PATH if $KALDI_ROOT is not set

Hi,

Are the modifications to the PATH variable in lines 17-24 really necessary?

If yes, I would suggest replacing the modifications with an exception if $KALDI_ROOT is not set, and if they are not necessary for the script, I would suggest to remove them completely!

Thanks for all the work,
Best,
Quentin

Reading scp files created by subsegment_data_dir.sh

Hi,

I was trying to read mfcc features from a subsegmented directory, i.e created using utils/data/subsegment_data_dir.sh. The contents of feats.scp are of the form:

<path_to_ark_file>:xx[0:N]

Currently, this cannot be handled by read_mat_scp. Is there any alternative?

Compressed matrices leads to `sample_size` not defined

If the matrix type is CM, then the sample_size assertions will fail because we never set sample_size.

Not sure what the sample_size is for CM, or whether it should simply complain?

Either way I'm happy to make a pull request.

It cannot use for python3

first ,thank you for your work.
I have saw that you make capability for python3,but in kaldi_io.py,it sames that you have not revise code. And as the matter of fact,I can not run kaldi_io.py in python3. If it really could run on python3,could you give me some advise for how to use it.Before now,I have try to modify code to fit for python3,but it does not work.
the primary problem is str and byte which are different from python2
thank you,hope the response!!

Exit code 255 with open_or_fd

Hi,
I'm working with python 3.5.2, and I am using a virtualenv to run kaldi_io. I'm trying to use this sample:

import kaldi_io
ark_scp_output='ark:| copy-feats --compress=true ark:- ark,scp:data/feats2.ark,data/feats2.scp'
with kaldi_io.open_or_fd(ark_scp_output,'wb') as f:for key,mat in dict.iteritems():
kaldi_io.write_mat(f, mat, key=key)

My dict is a csv file, such that the first column is utt-id, and the 2nd column is the feature. This feature is in a 2-D (1x1) numpy matrix format. I have 2 problems:
a) In using key, string type is not supported. Since that is an optional argument, I just didn't pass it. But I know it will be important.

b) In using open_or_fd, I'm getting the following error:
RROR (copy-feats[5.4.176~1-be967]:Read():kaldi-matrix.cc:1616) Failed to read matrix from stream. : Expected "[", got "��

[ Stack-Trace: ]

kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::Matrix::Read(std::istream&, bool, bool)
kaldi::KaldiObjectHolder<kaldi::Matrix >::Read(std::istream&)
kaldi::SequentialTableReaderArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix > >::Next()
kaldi::SequentialTableReaderArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix > >::Open(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)
kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix > >::Open(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)
kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix > >::SequentialTableReader(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)
main
__libc_start_main
_start

WARNING (copy-feats[5.4.1761-be967]:Read():util/kaldi-holder-inl.h:84) Exception caught reading Table object.
WARNING (copy-feats[5.4.176
1-be967]:Next():util/kaldi-table-inl.h:574) Object read failed, reading archive standard input
WARNING (copy-feats[5.4.1761-be967]:Open():util/kaldi-table-inl.h:521) Error beginning to read archive file (wrong filename?): standard input
ERROR (copy-feats[5.4.176
1-be967]:SequentialTableReader():util/kaldi-table-inl.h:860) Error constructing TableReader: rspecifier is ark:-

[ Stack-Trace: ]

kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix > >::SequentialTableReader(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)
main
__libc_start_main
_start

Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/ayushi/Projects_2018/non_native_perception/data/recordings_edited/kaldi-io/kaldi_io/kaldi_io/kaldi_io.py", line 82, in cleanup
raise SubprocessFailed('cmd %s returned %d !' % (cmd,ret))
kaldi_io.kaldi_io.SubprocessFailed: cmd copy-feats --print-args=false ark:- ark,scp:/home/ayushi/Tools/kaldi/egs/nn_perception/data/train/feats.ark,/home/ayushi/Tools/kaldi/egs/nn_perception/data/train/feats.scp returned 255 !

Am I doing something wrong?

Writing features as 'ark,scp' by pipeline with 'copy-feats'

Hi,
I can't correctly execute the example you provided to write the .ark and .scp files at the same time.
error_kaldi_io
If instead I create the ark file and use copy-feats to create its copy and the attached .scp file, I don't encounter any problems.

Reading target (alignment) files

Hi,
I was trying to use kaldi_io to import alignment files, but I could not find out how to do it, and if it's possible.

I ran the timit recipe and ended up with a number of ali.<n>.gz files for example under the exp/mono_ali/ directory. I can convert those files from transition model IDs to PDF IDs with the command (for example):
ali-to-pdf exp/mono_ali/final.mdl "ark:gunzip -c exp/mono_ali/ali.1.gz|" ark,t:mono_ali.1.pdf.txt
The resulting file contains a line for each utterance, with utterance ID (for example faem0_si1392) and then a list of integer identifiers of the states in the model for each frame in that utterance.

  • Is there a way to import this file into python using kaldi_io?
  • Is there a way to pipe the ali-to-pdf command when opening the ali.1.gz file, so that I don't need to run it separately?

Thank you!
Giampiero

`read_ali_ark` crashes when reading gzipped file

I am trying to read the alignment file using read_ali_ark method. My code looks like this:

src_file = 's5/exp/tri2_ali/ali.1.gz'
abc = kaldi_io.read_ali_ark(src_file)

But this crashes on assert. It goes like this:

Simply removing this assert fixes the issue and makes it possible to read gzipped ark files.

Parse matrix range in read_mat()

As of now, only read_mat_scp() supports matrix ranges (as in /path/to/file/foo.ark:5[30:40])
I suggest moving the range parsing into read_mat() so that ranges are also supported for direct calls of this function.

UTF8 decoding problem and accent management

Hi !

First of all, thanks you for this great job ! However, I had to transform every decode() in kaldi-io.py in decode('latin-1') in order to deal with French accents. I also had to comment an assertion that was checking for only non accentuated characters. It would be cool if you could bring this accentuation management for foreigners !

Support for ranges in script files

Hi,

I noticed that kaldi_io currently does not support ranges in script files. I need this feature so I implemented it here. I guess it is best to generate test cases for that, before I open a pull request. Unfortunately I have no experience with testing in python so far. If you would like to help me with that or could point out a resource, that would be great.

Another point is, that I did not yet fully understand the Kaldi rx/wxfilenames. So I guess you could also add ranges to script file lines like
utt_id_01002 gunzip -c /usr/data/01002.gz |
but I am not sure how this would be done.

Thanks for your work on kaldi_io!

Add tags for releases

Last release on pypi is 0.9.4, but there are no tags in this repo. For packaging things well in conda-forge, we need to know the relationship between the version and the sources, which is what git tags are for. 🙃

Could you please add them, ideally also for the last release? (tags can be pushed for past commits as well)

Only load small parts of a big file

Hi, my situation is that I want to load small parts of a big ark file. Of course, it is possible to load the entire ark file and then select certain rows, but it is not memory and time efficient. I wonder if it is possible to read only small parts of the ark file? (like np.load('/tmp/123.npy', mmap_mode='r')) Thanks for your help!

Updated kaldi_io can not read from pipe (python3)

Hi Karel,
It's nice to support python3, I tested the new kaldi_io script, however, though it works fine for directly reading "feats.ark", it will fail when reading from the stream "ark:apply-cmvn-sliding --center=true ark:feats.ark ark:- |", line 49 will throw an error as

`fd = os.popen(file[:-1], 'rb')
File "/cm/shared/apps/python/3.6/lib/python3.6/os.py", line 970, in popen
raise ValueError("invalid mode %r" % mode)'

This may be caused by the difference of stdin/out between python2 and python3.

Also writing to a .scp file will be cool

Hi @vesis84 ,
When writing a feature matrix to a .ark file, it might be helpful to generate the corresponding .scp file to indicate positions.
Like (the guys do here)[http://kaldi-to-matlab.gforge.inria.fr/], that will complete this tool's functionality.
BTW, It's a great tool, thank you. Tests are also great.

About AssertionError

Hi,
When i read the compressed features, and I am reading features in parallel on the python. Encountered such a problem, ask for help.

AssertionError: Caught AssertionError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/home/zhangpeng/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/zhangpeng/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/zhangpeng/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/zhangpeng/pycharmProjects/FFSVC/datasets/kaldi_dataset.py", line 68, in getitem
full_mat = read_mat(self.dataset[aid][1])
File "/home/zhangpeng/pycharmProjects/FFSVC/datasets/kaldi_io.py", line 717, in read_mat
mat = _read_mat_binary(fd)
File "/home/zhangpeng/pycharmProjects/FFSVC/datasets/kaldi_io.py", line 730, in _read_mat_binary
if header.startswith('CM'): return _read_compressed_mat(fd, header)
File "/home/zhangpeng/pycharmProjects/FFSVC/datasets/kaldi_io.py", line 776, in _read_compressed_mat
assert(format == 'CM ') # The formats CM2, CM3 are not supported...

Thank you~

Nnet example files

I'm trying to access the features that are used for kaldi's dnn model. It looks like these matrices stored as a different type of file (Nnet3Eg, NumIo). I don't see that these are supported. Would it be non-trivial to read these?

Typo in UnknownMatrixHeader class definition

UnknownMatrixHeader is undefined due to a typo in the class definition. The diff below contains a fix:

$ git diff
diff --git a/kaldi_io.py b/kaldi_io.py
index e05a60c..49d518c 100755
--- a/kaldi_io.py
+++ b/kaldi_io.py
@@ -21,7 +21,7 @@ os.environ['PATH'] = os.popen('echo $KALDI_ROOT/src/bin:$KALDI_ROOT/tools/openfs
 # Define all custom exceptions,
 class UnsupportedDataType(Exception): pass
 class UnknownVectorHeader(Exception): pass
-class UnkonwnMatrixHeader(Exception): pass
+class UnknownMatrixHeader(Exception): pass

 class BadSampleSize(Exception): pass
 class BadInputFormat(Exception): pass

hardcoded path

Can you change the hardcoded path in kaldi_io/kaldi_io.py from:

os.environ['KALDI_ROOT']='/mnt/matylda5/iveselyk/Tools/kaldi-trunk'

to something like:

os.environ['KALDI_ROOT']=os.path.join(os.environ['CONDA_PREFIX'], 'bin')

?

Maybe also instead of printing the warnings, logging would be useful so issues like that can be suppressed?

appended scp and ark file

hello,
i am working on kaldi and i want to try other features, i use kaldi-io-for-python it works, but i want to have the same number of scp and ark file as my number of job, like default kaldi features
but the "open_or_fd" function doesn't have 'ab' mode
i want to append my ark and scp file
could anyone give some suggestions to do this, please?
Regards !
Zhor
:)

Paths as keys for Matrices

Hey Karel,
since the humble beginnings of this script, the read_key function only supported keys without a / symbol. I was just wondering, is there a reason why it is like that, since kaldi itself supports keys with / in them, e.g. audio/file.wav, but kaldi_io does not ?

Otherwise, I just propose to change line 115 to:

assert(re.match('^[\/.a-zA-Z0-9_-]+$',key) != None) 

Packaging as a library

Hey Karel,
I'd like to ask if it would be possible to ship this script as a library (e.g. to install that with pip), since I guess most people using this script copy it around their system a lot. It's just a bit more convenient.

Query on wav.scp reader - Streaming audio

Hi,

I have found this tool very useful for understanding the kaldi IO mechanisms. I have small query on extracting the samples from streaming speech.

Is it possible using Readhelper to pass the real time audio signal and observe the numpy_array or the samples?

Thanks in advance,

Regards
Pradeep
PhD student
Dept of CSE
IIT Kharagpur

"Failed to read vector from stream. : Expected token FV, got W"

Hello,
I'm getting an error when attempting to use copy-vector on the output of 'kaldi_io.write_vec_int'.

Error is: "Failed to read vector from stream. : Expected token FV, got W"

Goal: I have a large text file of kaldi features. The file is in .ark format however the contents are in human-readable form which I converted using 'copy-feats ark:- ark,t:-'. I want to create multiple small files where each file contains a key and mat pair. To do this I am reading in the ark file using kaldi_io and attempting to write a new file using kaldi_io within the kaldi_io.read_vec_int_ark loop. I am able to successfully read key and mat from the file, but an error occurs when attempting to write.

Code:
`for key, mat in kaldi_io.read_vec_int_ark(sfile):
print("{} {}".format(key,mat.shape))

        ## create new file to write to
        new_file_path_txt = os.path.join(sdir, "{}.{}".format(key, file_tail))
        new_file_path = os.path.join(sdir, "{}.ark".format(key))
        # new_file_path_txt = os.path.join(sdir, "{}.txt".format(key))

        # Write new file
        print("type: {}".format(type(mat)))
        print("dtype: {}".format(mat.dtype))
        mat = mat.astype('int32') # need to cast for writing purposes
        print("dtype2: {}".format(mat.dtype))

        ark_txt_output = 'ark:| copy-vector ark:- ark,t:{}'.format(new_file_path_txt)
        with kaldi_io.open_or_fd(ark_txt_output, 'wb') as w:
            kaldi_io.write_vec_int(w, mat, key=key)`

I met a error when I use the read_vec_int_ark function

elif mode == "rb":
err=open(output_folder+'/log.log',"a")
proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE,stderr=err)
threading.Thread(target=cleanup,args=(proc,cmd)).start() # clean-up thread,
return proc.stdout

when the progarm at threading.Thread(target=cleanup,args=(proc,cmd)).start() # clean-up thread,
def cleanup(proc, cmd):
ret = proc.wait()
if ret > 0:
raise SubprocessFailed('cmd %s returned %d !' % (cmd,ret))
return
it reminds me that
data_io.SubprocessFailed: cmd gunzip -c /home/sxyl3800/workspace/kaldi/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali_test/ali*.gz | ali-to-pdf /home/sxyl3800/workspace/kaldi/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali_test/final.mdl ark:- ark:- returned 127 !
whats the cleanup function for? why ret = proc.wait()>0,it will have a error?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.