vesis84 / kaldi-io-for-python Goto Github PK

View Code? Open in Web Editor NEW

371.0 12.0 160.0 349 KB

Python functions for reading kaldi data formats. Useful for rapid prototyping with python.

License: Apache License 2.0

Python 94.27% Shell 5.73%

kaldi-io-for-python's Introduction

kaldi-io-for-python

'Glue' code connecting kaldi data and python.

Supported data types

vector (integer)
Vector (float, double)
Matrix (float, double)
Posterior (posteriors, nnet1 training targets, confusion networks, ...)

Examples

Reading feature scp example:

import kaldi_io
for key,mat in kaldi_io.read_mat_scp(file):
  ...

Writing feature ark to file/stream:

import kaldi_io
with open(ark_file,'wb') as f:
  for key,mat in dict.iteritems():
    kaldi_io.write_mat(f, mat, key=key)

Writing features as 'ark,scp' by pipeline with 'copy-feats':

import kaldi_io
ark_scp_output='ark:| copy-feats --compress=true ark:- ark,scp:data/feats2.ark,data/feats2.scp'
with kaldi_io.open_or_fd(ark_scp_output,'wb') as f:
  for key,mat in dict.iteritems():
    kaldi_io.write_mat(f, mat, key=key)

Install

from pypi:

pip install kaldi_io

from sources:

git clone https://github.com/vesis84/kaldi-io-for-python.git <kaldi-io-dir>`
pip install -r requirements.txt
pip install --editable .

Note: it is recommended to set export KALDI_ROOT=<some_kaldi_dir> environment variable. The I/O based on pipes can then contain kaldi binaries.

Unit tests

(note: these are not included in pypi package)

Unit tests are started this way:

./run_tests.sh

or by:

python3 -m unittest discover -s tests -t . python2 -m unittest discover -s tests -t .

License

Apache License, Version 2.0 ('LICENSE-2.0.txt')

Community

accepting pull requests with extensions on GitHub
accepting feedback via GitHub 'Issues' in the repo

kaldi-io-for-python's People

Contributors

Stargazers

Watchers

Forkers

hdubey foxyveta rosrad zhangjiulong stevenlol alumae himaivan shuang777 kronos-cm niujincidian pbaljeka shrutijpalaskar michaelfeng87 muyaooo gaoyiyeah leezqcst xiao2mo entonytang guanlongzhao phani-nidadavolu michaelcapizzi nicanorgarcia wombat78 xiaofei-wang adam2go hlthu wsstriving richermans datavizweb germany-zhu jackyguo624 aidman r9y9 shubhampachori12110095 samyoo78 pswietojanski jefflai108 fotwo aheba iwaterxt reloadbrain by2101 jeffersonchou npu-aslp jnishi bobchennan nisoka florian1990 boji123 xu-song nd1511 jerrypeng21cuhk xx205 giolaoz niucheney sdqdlgj cjjjy nihaoucas impress-lab oplatek ajilim zaazoou demfier tiot07 hawkaaron xrick opencvbaby maisyzhang staminazyy kingfener liusongxiang myhololens gandolfxu daiyuandian wangyu09 lcf2764 jupinter zhiyu-deep gulamungon zxynbnb qmeeus soloice karenmars caelum-clarum alicegaz nonlocal desh2608 jinhaogu wgfi110 rh531273646 filip-voiceitt-com bqzhu922 rukshani rxhmdia jfzhouuu wenwanchen wangtao2668129173 lql0716 thewchan mbencherif

kaldi-io-for-python's Issues

Immediately read a big file after writing it may leads to error ?

I am wondering if we need to wait a thread that writes files to hard-disk in this script. Today I use kaldi_io to write a big file and immediately read it which led to this Error:

ivector-mean ark:data/train/spk2utt scp:exp/train_embed_vectors/embeddings.scp 'ark:| copy-vector ark:- ark,scp:exp/train_embed_vectors/spk_embeddings.ark,exp/train_embed_vectors/spk_embeddings.scp' ark,t:exp/train_embed_vectors/num_utts.ark 
WARNING (ivector-mean[5.4.84~1405-c643]:ReadScriptFile():kaldi-table.cc:72) Invalid 148626'th line in script file:"id11251-gFfcgOVmiO0-00006"
WARNING (ivector-mean[5.4.84~1405-c643]:ReadScriptFile():kaldi-table.cc:46) [script file was: exp/train_embed_vectors/embeddings.scp]
ERROR (ivector-mean[5.4.84~1405-c643]:RandomAccessTableReader():util/kaldi-table-inl.h:2512) Error opening RandomAccessTableReader object  (rspecifier is: scp:exp/train_embed_vectors/embeddings.scp)

The error will disappear after I manually type this cmd:

ivector-mean ark:data/train/spk2utt scp:exp/train_embed_vectors/embeddings.scp 'ark:| copy-vector ark:- ark,scp:exp/train_embed_vectors/spk_embeddings.ark,exp/train_embed_vectors/spk_embeddings.scp' ark,t:exp/train_embed_vectors/num_utts.ark

So does this cmd:

copy-vector "scp:echo 'id11251-gFfcgOVmiO0-00006 exp/train_embed_vectors/embeddings.ark:614118526' |" ark,t:-|less

I guess the possible reason is that when I reading the newly created big file, the earlier thread is still writing the file.

which function is equal to copy-matrix?

Modifications to $PATH if $KALDI_ROOT is not set

Hi,

Are the modifications to the PATH variable in lines 17-24 really necessary?

If yes, I would suggest replacing the modifications with an exception if $KALDI_ROOT is not set, and if they are not necessary for the script, I would suggest to remove them completely!

Thanks for all the work,
Best,
Quentin

Reading scp files created by subsegment_data_dir.sh

Hi,

I was trying to read mfcc features from a subsegmented directory, i.e created using utils/data/subsegment_data_dir.sh. The contents of feats.scp are of the form:

<path_to_ark_file>:xx[0:N]

Currently, this cannot be handled by read_mat_scp. Is there any alternative?

Compressed matrices leads to `sample_size` not defined

If the matrix type is CM, then the sample_size assertions will fail because we never set sample_size.

Not sure what the sample_size is for CM, or whether it should simply complain?

Either way I'm happy to make a pull request.

MatrixDataTypeError is not defined.

It cannot use for python3

first ,thank you for your work.
I have saw that you make capability for python3,but in kaldi_io.py,it sames that you have not revise code. And as the matter of fact,I can not run kaldi_io.py in python3. If it really could run on python3,could you give me some advise for how to use it.Before now,I have try to modify code to fit for python3,but it does not work.
the primary problem is str and byte which are different from python2
thank you,hope the response!!

Exit code 255 with open_or_fd

Hi,
I'm working with python 3.5.2, and I am using a virtualenv to run kaldi_io. I'm trying to use this sample:

import kaldi_io
ark_scp_output='ark:| copy-feats --compress=true ark:- ark,scp:data/feats2.ark,data/feats2.scp'
with kaldi_io.open_or_fd(ark_scp_output,'wb') as f:for key,mat in dict.iteritems():
kaldi_io.write_mat(f, mat, key=key)

My dict is a csv file, such that the first column is utt-id, and the 2nd column is the feature. This feature is in a 2-D (1x1) numpy matrix format. I have 2 problems:
a) In using key, string type is not supported. Since that is an optional argument, I just didn't pass it. But I know it will be important.

b) In using open_or_fd, I'm getting the following error:
RROR (copy-feats[5.4.176~1-be967]:Read():kaldi-matrix.cc:1616) Failed to read matrix from stream. : Expected "[", got "��

[ Stack-Trace: ]

kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::Matrix::Read(std::istream&, bool, bool)
kaldi::KaldiObjectHolder<kaldi::Matrix >::Read(std::istream&)
kaldi::SequentialTableReaderArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix > >::Next()
kaldi::SequentialTableReaderArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix > >::Open(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)
kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix > >::Open(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)
kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix > >::SequentialTableReader(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)
main
__libc_start_main
_start

WARNING (copy-feats[5.4.1761-be967]:Read():util/kaldi-holder-inl.h:84) Exception caught reading Table object.
WARNING (copy-feats[5.4.1761-be967]:Next():util/kaldi-table-inl.h:574) Object read failed, reading archive standard input
WARNING (copy-feats[5.4.1761-be967]:Open():util/kaldi-table-inl.h:521) Error beginning to read archive file (wrong filename?): standard input
ERROR (copy-feats[5.4.1761-be967]:SequentialTableReader():util/kaldi-table-inl.h:860) Error constructing TableReader: rspecifier is ark:-

[ Stack-Trace: ]

kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix > >::SequentialTableReader(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)
main
__libc_start_main
_start

Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/home/ayushi/Projects_2018/non_native_perception/data/recordings_edited/kaldi-io/kaldi_io/kaldi_io/kaldi_io.py", line 82, in cleanup
raise SubprocessFailed('cmd %s returned %d !' % (cmd,ret))
kaldi_io.kaldi_io.SubprocessFailed: cmd copy-feats --print-args=false ark:- ark,scp:/home/ayushi/Tools/kaldi/egs/nn_perception/data/train/feats.ark,/home/ayushi/Tools/kaldi/egs/nn_perception/data/train/feats.scp returned 255 !

Am I doing something wrong?

Writing features as 'ark,scp' by pipeline with 'copy-feats'

Hi,
I can't correctly execute the example you provided to write the .ark and .scp files at the same time.

If instead I create the ark file and use copy-feats to create its copy and the attached .scp file, I don't encounter any problems.

Reading target (alignment) files

Hi,
I was trying to use kaldi_io to import alignment files, but I could not find out how to do it, and if it's possible.

I ran the timit recipe and ended up with a number of ali.<n>.gz files for example under the exp/mono_ali/ directory. I can convert those files from transition model IDs to PDF IDs with the command (for example):
ali-to-pdf exp/mono_ali/final.mdl "ark:gunzip -c exp/mono_ali/ali.1.gz|" ark,t:mono_ali.1.pdf.txt
The resulting file contains a line for each utterance, with utterance ID (for example faem0_si1392) and then a list of integer identifiers of the states in the model for each frame in that utterance.

Is there a way to import this file into python using kaldi_io?
Is there a way to pipe the ali-to-pdf command when opening the ali.1.gz file, so that I don't need to run it separately?

Thank you!
Giampiero

`read_ali_ark` crashes when reading gzipped file

I am trying to read the alignment file using read_ali_ark method. My code looks like this:

src_file = 's5/exp/tri2_ali/ali.1.gz'
abc = kaldi_io.read_ali_ark(src_file)

But this crashes on assert. It goes like this:

read_ali_ark will call open_or_fd method.
It will read the file using gzip library.
This library will set fd.mode property to an integer.
But then the code will try to assert that fd.mode is a string and contains 'b'.

Simply removing this assert fixes the issue and makes it possible to read gzipped ark files.

Parse matrix range in read_mat()

As of now, only read_mat_scp() supports matrix ranges (as in /path/to/file/foo.ark:5[30:40])
I suggest moving the range parsing into read_mat() so that ranges are also supported for direct calls of this function.

UTF8 decoding problem and accent management

Hi !

First of all, thanks you for this great job ! However, I had to transform every decode() in kaldi-io.py in decode('latin-1') in order to deal with French accents. I also had to comment an assertion that was checking for only non accentuated characters. It would be cool if you could bring this accentuation management for foreigners !

Support for ranges in script files

Hi,

I noticed that kaldi_io currently does not support ranges in script files. I need this feature so I implemented it here. I guess it is best to generate test cases for that, before I open a pull request. Unfortunately I have no experience with testing in python so far. If you would like to help me with that or could point out a resource, that would be great.

Another point is, that I did not yet fully understand the Kaldi rx/wxfilenames. So I guess you could also add ranges to script file lines like
utt_id_01002 gunzip -c /usr/data/01002.gz |
but I am not sure how this would be done.

Thanks for your work on kaldi_io!

how to read from wav.scp

is there a api to read values fram wav.scp?

Add tags for releases

Last release on pypi is 0.9.4, but there are no tags in this repo. For packaging things well in conda-forge, we need to know the relationship between the version and the sources, which is what git tags are for. 🙃

Could you please add them, ideally also for the last release? (tags can be pushed for past commits as well)

Only load small parts of a big file

Hi, my situation is that I want to load small parts of a big ark file. Of course, it is possible to load the entire ark file and then select certain rows, but it is not memory and time efficient. I wonder if it is possible to read only small parts of the ark file? (like np.load('/tmp/123.npy', mmap_mode='r')) Thanks for your help!

Updated kaldi_io can not read from pipe (python3)

Hi Karel,
It's nice to support python3, I tested the new kaldi_io script, however, though it works fine for directly reading "feats.ark", it will fail when reading from the stream "ark:apply-cmvn-sliding --center=true ark:feats.ark ark:- |", line 49 will throw an error as

`fd = os.popen(file[:-1], 'rb')
File "/cm/shared/apps/python/3.6/lib/python3.6/os.py", line 970, in popen
raise ValueError("invalid mode %r" % mode)'

This may be caused by the difference of stdin/out between python2 and python3.

Also writing to a .scp file will be cool

Hi @vesis84 ,
When writing a feature matrix to a .ark file, it might be helpful to generate the corresponding .scp file to indicate positions.
Like (the guys do here)[http://kaldi-to-matlab.gforge.inria.fr/], that will complete this tool's functionality.
BTW, It's a great tool, thank you. Tests are also great.

About AssertionError

Hi,
When i read the compressed features, and I am reading features in parallel on the python. Encountered such a problem, ask for help.

AssertionError: Caught AssertionError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/home/zhangpeng/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/zhangpeng/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/zhangpeng/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/zhangpeng/pycharmProjects/FFSVC/datasets/kaldi_dataset.py", line 68, in getitem
full_mat = read_mat(self.dataset[aid][1])
File "/home/zhangpeng/pycharmProjects/FFSVC/datasets/kaldi_io.py", line 717, in read_mat
mat = _read_mat_binary(fd)
File "/home/zhangpeng/pycharmProjects/FFSVC/datasets/kaldi_io.py", line 730, in _read_mat_binary
if header.startswith('CM'): return _read_compressed_mat(fd, header)
File "/home/zhangpeng/pycharmProjects/FFSVC/datasets/kaldi_io.py", line 776, in _read_compressed_mat
assert(format == 'CM ') # The formats CM2, CM3 are not supported...

Thank you~

read compressed features 'CM '

Hi, is there a way to read compressed feature matrix? thanks.

Nnet example files

I'm trying to access the features that are used for kaldi's dnn model. It looks like these matrices stored as a different type of file (Nnet3Eg, NumIo). I don't see that these are supported. Would it be non-trivial to read these?

Typo in UnknownMatrixHeader class definition

UnknownMatrixHeader is undefined due to a typo in the class definition. The diff below contains a fix:

$ git diff
diff --git a/kaldi_io.py b/kaldi_io.py
index e05a60c..49d518c 100755
--- a/kaldi_io.py
+++ b/kaldi_io.py
@@ -21,7 +21,7 @@ os.environ['PATH'] = os.popen('echo $KALDI_ROOT/src/bin:$KALDI_ROOT/tools/openfs
 # Define all custom exceptions,
 class UnsupportedDataType(Exception): pass
 class UnknownVectorHeader(Exception): pass
-class UnkonwnMatrixHeader(Exception): pass
+class UnknownMatrixHeader(Exception): pass

 class BadSampleSize(Exception): pass
 class BadInputFormat(Exception): pass

hardcoded path

Can you change the hardcoded path in kaldi_io/kaldi_io.py from:

os.environ['KALDI_ROOT']='/mnt/matylda5/iveselyk/Tools/kaldi-trunk'

to something like:

os.environ['KALDI_ROOT']=os.path.join(os.environ['CONDA_PREFIX'], 'bin')

Maybe also instead of printing the warnings, logging would be useful so issues like that can be suppressed?

appended scp and ark file

hello,
i am working on kaldi and i want to try other features, i use kaldi-io-for-python it works, but i want to have the same number of scp and ark file as my number of job, like default kaldi features
but the "open_or_fd" function doesn't have 'ab' mode
i want to append my ark and scp file
could anyone give some suggestions to do this, please?
Regards !
Zhor
:)

Raise the BUG about kaldi_io.UnknownMatrixHeader

After I have extracted the VAD features, I want to read the scp, but the error is reported. How can I solve it?

ERROR:datasets.kaldi_io.UnknownMatrixHeader: The header contained 'FV '

Paths as keys for Matrices

Hey Karel,
since the humble beginnings of this script, the read_key function only supported keys without a / symbol. I was just wondering, is there a reason why it is like that, since kaldi itself supports keys with / in them, e.g. audio/file.wav, but kaldi_io does not ?

Otherwise, I just propose to change line 115 to:

assert(re.match('^[\/.a-zA-Z0-9_-]+$',key) != None)

Packaging as a library

Hey Karel,
I'd like to ask if it would be possible to ship this script as a library (e.g. to install that with pip), since I guess most people using this script copy it around their system a lot. It's just a bit more convenient.

kaldi_io.UnknownMatrixHeader: The header contained 'FV '

can you please tell me how to solve these problem

Query on wav.scp reader - Streaming audio

Hi,

I have found this tool very useful for understanding the kaldi IO mechanisms. I have small query on extracting the samples from streaming speech.

Is it possible using Readhelper to pass the real time audio signal and observe the numpy_array or the samples?

Thanks in advance,

Regards
Pradeep
PhD student
Dept of CSE
IIT Kharagpur

typo

https://github.com/vesis84/kaldi-io-for-python/blob/master/kaldi_io.py#L480

should be m.dtype

"Failed to read vector from stream. : Expected token FV, got W"

Hello,
I'm getting an error when attempting to use copy-vector on the output of 'kaldi_io.write_vec_int'.

Error is: "Failed to read vector from stream. : Expected token FV, got W"

Goal: I have a large text file of kaldi features. The file is in .ark format however the contents are in human-readable form which I converted using 'copy-feats ark:- ark,t:-'. I want to create multiple small files where each file contains a key and mat pair. To do this I am reading in the ark file using kaldi_io and attempting to write a new file using kaldi_io within the kaldi_io.read_vec_int_ark loop. I am able to successfully read key and mat from the file, but an error occurs when attempting to write.

Code:
`for key, mat in kaldi_io.read_vec_int_ark(sfile):
print("{} {}".format(key,mat.shape))

        ## create new file to write to
        new_file_path_txt = os.path.join(sdir, "{}.{}".format(key, file_tail))
        new_file_path = os.path.join(sdir, "{}.ark".format(key))
        # new_file_path_txt = os.path.join(sdir, "{}.txt".format(key))

        # Write new file
        print("type: {}".format(type(mat)))
        print("dtype: {}".format(mat.dtype))
        mat = mat.astype('int32') # need to cast for writing purposes
        print("dtype2: {}".format(mat.dtype))

        ark_txt_output = 'ark:| copy-vector ark:- ark,t:{}'.format(new_file_path_txt)
        with kaldi_io.open_or_fd(ark_txt_output, 'wb') as w:
            kaldi_io.write_vec_int(w, mat, key=key)`

Supporting Lattice

Is it possible to add lattice I/O support in kaldi_io ?

I met a error when I use the read_vec_int_ark function

elif mode == "rb":
err=open(output_folder+'/log.log',"a")
proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE,stderr=err)
threading.Thread(target=cleanup,args=(proc,cmd)).start() # clean-up thread,
return proc.stdout

when the progarm at threading.Thread(target=cleanup,args=(proc,cmd)).start() # clean-up thread,
def cleanup(proc, cmd):
ret = proc.wait()
if ret > 0:
raise SubprocessFailed('cmd %s returned %d !' % (cmd,ret))
return
it reminds me that
data_io.SubprocessFailed: cmd gunzip -c /home/sxyl3800/workspace/kaldi/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali_test/ali*.gz | ali-to-pdf /home/sxyl3800/workspace/kaldi/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali_test/final.mdl ark:- ark:- returned 127 !
whats the cleanup function for? why ret = proc.wait()>0,it will have a error?