limix / bgen-reader-py Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 3.0 1.21 MB

A BGEN file format reader.

License: MIT License

Python 100.00%

bgen-reader-py's Introduction

limix

Genomic analyses require flexible models that can be adapted to the needs of the user.

Limix is a flexible and efficient linear mixed model library with interfaces to Python. It includes methods for

Single-variant association and interaction testing
Variance decompostion analysis with linear mixed models
Association and interaction set tests
Different utils for statistical analysis, basic i/o, and plotting.

We have an extensive documentation of the library. If you need further help or want to discuss anything related to limix, please, join our forum 💬 and have a chat with us 😃. In case you have found a bug, please, report it creating an issue.

Install

NOTE: We will be maintaining limix 2.0.x for a while, in case you find some missing feature in limix 3.0.x. If that is the case, please, type pip install "limix <3,>=2" in your terminal.

Installation is easy and works on macOS, Linux, and Windows:

pip install limix

If you already have Limix but want to upgrade it to the latest version:

pip install limix --upgrade

Interactive tutorials

eQTL (requires limix 2.0.x)
Struct-LMM (requires limix 2.0.x)

Running tests

After installation, you can test it

python -c "import limix; limix.test()"

as long as you have pytest.

Authors

License

This project is licensed under the Apache License License.

bgen-reader-py's People

Contributors

Stargazers

Watchers

Forkers

fpcasale hannahvmeyer sbaker-dev

bgen-reader-py's Issues

More options for control of behavior in read_genotype

Given that it seems like the intention with bgen in practice is to often use less than 20 bits for storage of discretized probabilities, per the paper, do you think it would be reasonable to use 32-bit floats instead of 64 for the numpy arrays those probabilities are read into @horta?

I'm looking at this and wondering what it takes to make that dtype a parameter, or if it shouldn't just be float32 all the time:

bgen-reader-py/bgen_reader/_bgen_file.py

Lines 70 to 71 in eceda00

    
           probs = full((nsamples, ncombs), nan, dtype=float64) 
        
           lib.bgen_genotype_read(genotype, ffi.cast("double *", probs.ctypes.data))

I'm also wondering if making it possible to skip reading some of these other fields would speed things up appreciably:

bgen-reader-py/bgen_reader/_bgen_file.py

Lines 73 to 79 in eceda00

    
           phased = lib.bgen_genotype_phased(genotype) 
        
           ploidy = full(nsamples, 0, dtype=uint8) 
        
           lib.read_ploidy(genotype, ffi.cast("uint8_t *", ploidy.ctypes.data), nsamples) 
        
           missing = full(nsamples, 0, dtype=bool) 
        
           lib.read_missing(genotype, ffi.cast("bool *", missing.ctypes.data), nsamples)

Error when installing with pip

Hi,

I was trying to install bgen-reader using pip (pip install bgen-reader --user), but at some point, I receive the error:

GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include -I/usr/include -I/usr/local/include -I/usr/include/python2.7 -c build/temp.linux-x86_64-2.7/bgen_reader._ffi.c -o build/temp.linux-x86_64-2.7/build/temp.linux-x86_64-2.7/bgen_reader._ffi.o
build/temp.linux-x86_64-2.7/bgen_reader._ffi.c:492:18: fatal error: bgen.h: No such file or directory
#include "bgen.h"
^
compilation terminated.
error: command 'gcc' failed with exit status 1

----------------------------------------

Command "/usr/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-install-zqQ5TK/bgen-reader/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-YLS4PR/install-record.txt --single-version-externally-managed --compile --user --prefix=" failed with error code 1 in /tmp/pip-install-zqQ5TK/bgen-reader/

Is there any workaround for this?

Best regards,

Can't figure out dev build on Derbin

[Sorry for the flood of queries. - Carl]

On Durbin, python setup.py build_ext --inplace seems to work. But, when I set by Python path and try to run a test, I get ImportError: libbgen.so.3: cannot open shared object file: No such file or directory

Below are some details.
Thanks! Carl

(base) carlk@kadie2:~$ ls /usr/local/include
athr_export.h  athr.h  bgen  bgen.h  zbuff.h  zdict.h  zstd_errors.h  zstd.h
(base) carlk@kadie2:~$ cd /mnt/d/OneDrive/programs/bgen-reader-py2
(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2$ export PYTHONPATH=/mnt/d/OneDrive/programs/bgen-reader-py2
(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2$ python setup.py build_ext --inplace
running build_ext
generating cffi module 'build/temp.linux-x86_64-3.7/bgen_reader._ffi.c'
building 'bgen_reader._ffi' extension
gcc -pthread -B /home/carlk/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/carlk/anaconda3/include -I/usr/include -I/usr/local/include -I/home/carlk/anaconda3/include/python3.7m -c build/temp.linux-x86_64-3.7/bgen_reader._ffi.c -o build/temp.linux-x86_64-3.7/build/temp.linux-x86_64-3.7/bgen_reader._ffi.o
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/bgen_reader
gcc -pthread -shared -B /home/carlk/anaconda3/compiler_compat -L/home/carlk/anaconda3/lib -Wl,-rpath=/home/carlk/anaconda3/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.7/build/temp.linux-x86_64-3.7/bgen_reader._ffi.o -L/home/carlk/anaconda3/lib -L/usr/lib -L/usr/local/lib -lbgen -lz -lzstd -lathr -o build/lib.linux-x86_64-3.7/bgen_reader/_ffi.abi3.so
copying build/lib.linux-x86_64-3.7/bgen_reader/_ffi.abi3.so -> bgen_reader
(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2$
(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2$
(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2$ ls
bgen_reader  build_ext.py  docs        LICENSE.md   __pycache__  readthedocs.yml   setup.cfg  version.py
build        ci            libpath.py  MANIFEST.in  README.md    requirements.txt  setup.py
(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2$ cd bgen_reader/
(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2/bgen_reader$ ls
_bgen.py     _dosage.py    _ffi.abi3.so  _helper.py   _metadata.py   _samples.py  _testit.py
conftest.py  _download.py  _file.py      __init__.py  _partition.py  _string.py
_dask.py     _example      _genotype.py  interface.h  _reader.py     test


(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2/bgen_reader$ cd test
(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2/bgen_reader/test$ ls
__init__.py  test_bgen_reader.py  test_dosage.py  test_interface.py  test_large_file.py
(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2/bgen_reader/test$ python test_bgen_reader.py
Traceback (most recent call last):
  File "test_bgen_reader.py", line 8, in <module>
    from bgen_reader import create_metafile, example_files, read_bgen
  File "/mnt/d/OneDrive/programs/bgen-reader-py2/bgen_reader/__init__.py", line 26, in <module>
    from ._metadata import create_metafile
  File "/mnt/d/OneDrive/programs/bgen-reader-py2/bgen_reader/_metadata.py", line 4, in <module>
    from ._bgen import bgen_file
  File "/mnt/d/OneDrive/programs/bgen-reader-py2/bgen_reader/_bgen.py", line 3, in <module>
    from ._ffi import ffi, lib
ImportError: libbgen.so.3: cannot open shared object file: No such file or directory
(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2/bgen_reader/test$ cd ..
(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2/bgen_reader$ cd ..
(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2$ find . | fgrep libbgen.so
(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2$ echo $PYTHONPATH
/mnt/d/OneDrive/programs/bgen-reader-py2
(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2$ find . | fgrep .so
./bgen_reader/_ffi.abi3.so
./build/lib.linux-x86_64-3.7/bgen_reader/_ffi.abi3.so
(base) carlk@kadie2:/mnt/d/OneDrive/programs/bgen-reader-py2$

sporadic error when reading in genotypes

Hi Danilo,

Thanks for the great package! I have run into one problem though: when reading in genotype probabilities with "x = bgen['genotype'][:100].compute()" it sometimes works perfectly and sometimes I get an error message (see below). This happens if I run this command multiple consecutive times on the exact same set of SNPs. Can you think of any explanation of this behavior? Does the function use any stochastic methods that could lead to the exact same command failing sometimes while working otherwise?

Best wishes,
Armin

x1 = bgen['genotype'][:100].compute()
x1 = bgen['genotype'][:100].compute()
Traceback (most recent call last):
File "", line 1, in
File "/.../python2.7/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/.../python2.7/site-packages/dask/base.py", line 395, in compute
results = schedule(dsk, keys, **kwargs)
File "/.../python2.7/site-packages/dask/threaded.py", line 75, in get
pack_exception=pack_exception, **kwargs)
File "/.../python2.7/site-packages/dask/local.py", line 501, in get_async
raise_exception(exc, tb)
File "/.../python2.7/site-packages/dask/local.py", line 272, in execute_task
result = _execute_task(task, data)
File "/.../python2.7/site-packages/dask/local.py", line 252, in _execute_task
args2 = [_execute_task(a, cache) for a in args]
File "/.../python2.7/site-packages/dask/local.py", line 253, in _execute_task
return func(*args2)
File "/.../python2.7/site-packages/bgen_reader/_reader.py", line 192, in
Gcall = delayed(lambda *args: _genotype_block(*args)[0], **kws)
File "/.../python2.7/site-packages/bgen_reader/_pylru.py", line 569, in wrapper
self.cache[key] = value
File "/.../python2.7/site-packages/bgen_reader/_pylru.py", line 141, in setitem
del self.table[node.key]
KeyError: ((<cdata 'struct bgen_vi * *' owning 8 bytes>, 487409, 26, 13), ())
x1 = bgen['genotype'][:100].compute()
x1 = bgen['genotype'][:100].compute()

Deal with NULL ids

Some IDs & rsids in BGEN are put as null and there for cause the wrapper to crash

bgen_reader.allele_expectation allocates memory based on unindexed genotype

bgen_reader.allele_expectation allocates memory based on the unindexed genotype. This causes problems when indexing a large bgen (for example UKBioBank).

The following code attempts to allocate a 4.45TiB array when computing the expectation for a single variant and sample

from bgen_reader import open_bgen
bgen = open_bgen('ukb_imp_chr22_v3.bgen', samples_filepath = 'ukb1404_imp_chr1_v2_s487406.sample', verbose = True)
bgen.allele_expectation(index = c(1,1))
Traceback (most recent call last):
File "", line 1, in
File "/n/home12/jrossen/.conda/envs/python3/lib/python3.8/site-packages/bgen_reader/_bgen2.py", line 1381, in allele_expectation
ploidy0 = self.read(return_probabilities=False, return_ploidies=True)[
File "/n/home12/jrossen/.conda/envs/python3/lib/python3.8/site-packages/bgen_reader/_bgen2.py", line 563, in read
ploidy_val = np.full(
File "/n/home12/jrossen/.conda/envs/python3/lib/python3.8/site-packages/numpy/core/numeric.py", line 343, in full
a = empty(shape, dtype, order)
numpy.core._exceptions.MemoryError: Unable to allocate 4.45 TiB for an array with shape (487409, 1255683) and data type int64

'bgen_reader._ffi' not found

Hi,

I'm having an issue just importing read_bgen, where a "ModuleNotFoundError: No module named 'bgen_reader._ffi'" error is returned. I'm getting this on both my mac and linux, but I have an older installation on a different mac that works really well.

Output from loading in ipython is pasted below. Any help would be really appreciated!

Running python 3.6.2, ipython 6.1.0

In [1]: from bgen_reader import read_bgen
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-0d4fe5d6d0d4> in <module>()
----> 1 from bgen_reader import read_bgen

~/.conda/envs/py3-ml/lib/python3.6/site-packages/bgen_reader/__init__.py in <module>()
     10 from __future__ import absolute_import as _
     11 
---> 12 from .bgen_reader import convert_to_dosage, read_bgen, create_metadata_file
     13 from .testit import test
     14 

~/.conda/envs/py3-ml/lib/python3.6/site-packages/bgen_reader/bgen_reader.py in <module>()
      2 from multiprocessing import cpu_count
      3 from multiprocessing.pool import ThreadPool
----> 4 from ._misc import (make_sure_bytes, create_string,
      5                     check_file_exist, check_file_readable)
      6 

~/.conda/envs/py3-ml/lib/python3.6/site-packages/bgen_reader/_misc.py in <module>()
      3 import sys
      4 import os
----> 5 from ._ffi import ffi
      6 from os.path import exists
      7 

ModuleNotFoundError: No module named 'bgen_reader._ffi'

Export infer_metafile_filepath properly

Refer to sgkit-dev/sgkit-bgen#7

pickle error on dask cluster

Hi,

I'm getting an error when using bgen_reader on a distributed dask cluster. I can load the bgen file in, but I get an error when I try to access the genotypes after that. This works without any problems if I don't start a cluster though. Is this something that could be fixed, or could it be specific to my cluster setup?

>>> from bgen_reader import read_bgen
>>> cluster = get_cluster()  # code omitted, but starts cluster on Sun Grid Engine
>>> cluster
SGECluster('tcp://10.112.80.16:35423', workers=5, threads=10, memory=200.00 GB)
>>> bgen = read_bgen(PATH_TO_BGEN, samples_filepath=PATH_TO_SAMPLES)
>>> bgen['genotype'][0].compute()
Traceback (most recent call last):
  File "/home/jmcrae/.local/miniconda/lib/python3.7/site-packages/distributed/worker.py", line 3244, in dumps_function
    result = cache_dumps[func]
  File "/home/jmcrae/.local/miniconda/lib/python3.7/site-packages/distributed/utils.py", line 1439, in __getitem__
    value = super().__getitem__(key)
  File "/home/jmcrae/.local/miniconda/lib/python3.7/collections/__init__.py", line 1027, in __getitem__
    raise KeyError(key)
KeyError: <function _get_read_genotype.<locals>._read_genotype at 0x2aab08752b00>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jmcrae/.local/miniconda/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 38, in dumps
    result = pickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
AttributeError: Can't pickle local object '_get_read_genotype.<locals>._read_genotype'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jmcrae/.local/miniconda/lib/python3.7/site-packages/dask/base.py", line 165, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/jmcrae/.local/miniconda/lib/python3.7/site-packages/dask/base.py", line 436, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/jmcrae/.local/miniconda/lib/python3.7/site-packages/distributed/client.py", line 2565, in get
    actors=actors,
  File "/home/jmcrae/.local/miniconda/lib/python3.7/site-packages/distributed/client.py", line 2492, in _graph_to_futures
    "tasks": valmap(dumps_task, dsk3),
  File "/home/jmcrae/.local/miniconda/lib/python3.7/site-packages/toolz/dicttoolz.py", line 83, in valmap
    rv.update(zip(iterkeys(d), map(func, itervalues(d))))
  File "/home/jmcrae/.local/miniconda/lib/python3.7/site-packages/distributed/worker.py", line 3282, in dumps_task
    return {"function": dumps_function(task[0]), "args": warn_dumps(task[1:])}
  File "/home/jmcrae/.local/miniconda/lib/python3.7/site-packages/distributed/worker.py", line 3246, in dumps_function
    result = pickle.dumps(func)
  File "/home/jmcrae/.local/miniconda/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 51, in dumps
    return cloudpickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
  File "/home/jmcrae/.local/miniconda/lib/python3.7/site-packages/cloudpickle/cloudpickle.py", line 1125, in dumps
    cp.dump(obj)
  File "/home/jmcrae/.local/miniconda/lib/python3.7/site-packages/cloudpickle/cloudpickle.py", line 482, in dump
    return Pickler.dump(self, obj)
  File "/home/jmcrae/.local/miniconda/lib/python3.7/pickle.py", line 437, in dump
    self.save(obj)
  File "/home/jmcrae/.local/miniconda/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/jmcrae/.local/miniconda/lib/python3.7/site-packages/cloudpickle/cloudpickle.py", line 556, in save_function
    return self.save_function_tuple(obj)
  File "/home/jmcrae/.local/miniconda/lib/python3.7/site-packages/cloudpickle/cloudpickle.py", line 758, in save_function_tuple
    save(state)
  File "/home/jmcrae/.local/miniconda/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/jmcrae/.local/miniconda/lib/python3.7/pickle.py", line 859, in save_dict
    self._batch_setitems(obj.items())
  File "/home/jmcrae/.local/miniconda/lib/python3.7/pickle.py", line 885, in _batch_setitems
    save(v)
  File "/home/jmcrae/.local/miniconda/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/jmcrae/.local/miniconda/lib/python3.7/pickle.py", line 859, in save_dict
    self._batch_setitems(obj.items())
  File "/home/jmcrae/.local/miniconda/lib/python3.7/pickle.py", line 885, in _batch_setitems
    save(v)
  File "/home/jmcrae/.local/miniconda/lib/python3.7/pickle.py", line 531, in save
    (t.__name__, obj))
_pickle.PicklingError: Can't pickle 'CompiledLib' object: <Lib object for 'bgen_reader._ffi'>

index out of range in bgen["variants"]

Hi,
I am using the bgen_reader package to read bgen files in and computed it into dosage value. I have 5089 samples and 400,000 variants when i try to do this. But i have met a strange issue. I ran code like follows:

bgen=read_bgen(bgen_file,verbose=False)
variants=bgen['variants'].compute()

The resulting data frame called variants has 400,000 rows, but the row index goes only up to 632. It seems many rows are indexed by the same number, although they are indeed different variants. It seems like related to the npartitions in the Dask DataFrame Structure variable called bgen["variants"]:

bgen["variants"]
Dask DataFrame Structure:
id rsid chrom pos nalleles allele_ids vaddr
npartitions=632
0 object object object int64 int64 object int64
633 ... ... ... ... ... ... ...
... ... ... ... ... ... ...
399423 ... ... ... ... ... ... ...
399999 ... ... ... ... ... ... ...
Dask Name: from-delayed, 1264 tasks

bgen['variants'].compute()
id rsid chrom pos nalleles allele_ids vaddr
0 1 1:754198:A:G 1 754198 2 A,G 60
1 1 1:762330:G:T 1 762330 2 G,T 632
2 1 1:764074:T:C 1 764074 2 T,C 1974
3 1 1:791420:A:C 1 791420 2 A,C 2631
4 1 1:792643:A:G 1 792643 2 A,G 2988
.. .. ... ... ... ... ... ...
572 21 21:30030111:A:G 21 30030111 2 A,G 1121838554
573 21 21:30030207:T:C 21 30030207 2 T,C 1121839479
574 21 21:30031175:C:T 21 30031175 2 C,T 1121840770
575 21 21:30035212:G:A 21 30035212 2 G,A 1121842971
576 21 21:30038641:C:T 21 30038641 2 C,T 1121843427

[400000 rows x 7 columns]

Thus when i ran

compute_dosage(allele_expectation(bgen, any int >632 ))

would got an error message:
nalleles = bgen["variants"].loc[variant_idx, "nalleles"].compute().values[0];
IndexError: index 0 is out of bounds for axis 0 with size 0.

I don't how to deal with these. I will be very grateful for any comment!

Thank you!

SijiaWang

Make use of pooch for example files

https://github.com/fatiando/pooch

example_filepath reports AWS error

If we run the first example from documentation, it fails:

from bgen_reader import example_filepath
bgen_file = example_filepath("example.bgen")

# Read from the file
from bgen_reader import open_bgen
bgen = open_bgen(bgen_file, verbose=False)
probs0 = bgen.read(0)   # Read 1st variant
print(probs0.shape)     # Shape of the NumPy array
# (500, 1, 3)
probs_all = bgen.read() # Read all variants
print(probs_all.shape)  # Shape of the NumPy array
# (500, 199, 3)

HTTPError: 404 Client Error: Not Found for url: https://bgen-examples.s3.amazonaws.com/example.bgen

@horta If you'd like me to switch it to use POOCH (as per issue #31 ), assign it to me and let me know.

-- Carl

Feature request: bulk distributed access

Thank you for all your work on this library! We are using bgen-reader-py to implement a Dask-based BGEN reader in sgkit-dev/sgkit-bgen#1. There is already some support for Dask, but (as far as I can tell) the delayed object that is created is for a single row (variant), rather than a chunk of rows. (See https://github.com/limix/bgen-reader-py/blob/master/bgen_reader/_genotype.py#L46.) This is a good choice for random access, but may be prohibitive for bulk access.

Would it be possible to support bulk access with an API that returns the genotype probability array as a Dask array?

Retrieving all genotypes for a single sample is very slow

It looks like the current implementation is optimised to retrieve a single genotype for all samples. For my data, the example case print(bgen['genotype'][0].compute()) is very fast, but print(bgen['genotype'][:, 0].compute()) is extremely slow and often exhausts all memory.

My guess is that the dask array chunk size is set so that single genotype retrieval is optimised. Could we have an option to optimise retrieval of all genotypes for a given list of samples? Or is there a better way to print/get all genotypes for a single sample?

ImportError

Hi, I installed via "conda install -c conda-forge bgen-reader". When I imported it "from bgen_reader import example_filepath", I got an import error. Not sure how to fix it. Thanks!

Traceback (most recent call last):
File "", line 1, in
File "/home/yc2553/miniconda3_new/envs/cy2/lib/python3.8/site-packages/bgen_reader/init.py", line 23, in
from ._bgen2 import open_bgen
File "/home/yc2553/miniconda3_new/envs/cy2/lib/python3.8/site-packages/bgen_reader/_bgen2.py", line 11, in
from cbgen import bgen_file, bgen_metafile
File "/home/yc2553/miniconda3_new/envs/cy2/lib/python3.8/site-packages/cbgen/init.py", line 4, in
from ._bgen_file import bgen_file
File "/home/yc2553/miniconda3_new/envs/cy2/lib/python3.8/site-packages/cbgen/_bgen_file.py", line 11, in
from ._ffi import ffi, lib
ImportError: libathr.so.1: cannot open shared object file: No such file or directory

ci\before-build-win.bat references bgen install.bat that no longer exists

Greetings,

The file bgen-reader-py\ci\before-build-win.bat calls
powershell -Command "(New-Object Net.WebClient).DownloadFile('https://raw.githubusercontent.com/limix/bgen/master/install.bat', 'install-bgen.bat')"
and then fails because the bgen github project no longer contains an install.bat file.

Minor bug in docs

There is a line of sample code in the docs that says:
from bgen_reader import allele_expectation, example_filepath, read_bgen

It should say:
from bgen_reader import allele_expectation, example_filepath, open_bgen

In other words, open_bgen rather than read_bgen.

Conda: Python versions >3.9

Hi, I am finding your bgen-reader really useful but is is possible to get conda builds of bgen-reader for Python versions beyond 3.9?
https://anaconda.org/conda-forge/bgen-reader/files

Saving metadata in different dir as bgen file

Thanks for making this library. I am starting to use it but my bgen files are saved in a location that I can't write. Is there any way to store the metadata file for speedy access in a different directory as the bgen files?

As expected, I get the following message

warnings.warn(_metafile_nowrite_dir.format(filepath=metafile), UserWarning)

Many thanks in advance,

Elena

Genotypes in probabilities?

Hi, how do I retrieve the genotype column ID for genotype[x].compute()["probs"]?

Print out the bgen file specification version

GitHub Action python-package.yml seems to have rotted

I don't think python-package.yml works anymore for any branch.

When I check in (a minor change on a branch) and python-package.yml runs, it gets error from --use-feature=in-tree-build,
which I think is now obsolete.

When I remove that and check in, I get a flake8 error on a file that I haven't changed:

Run flake8 .
would reformat bgen_reader/test/write_random.py

(I haven't been able to repro this under Ubuntu & Miniconda 3.9 or Windows and Conda 3.7)

RuntimeError: Could not open /path/to/test.bgen.metadata

Describe the bug
I am upgrading from an older version of the bgen_reader (migrating my code to python 3) and the read_bgen commands no longer work. According to the docs, the metadata file is supposed to be written if it isn't found, but that doesn't seem to be happening.

To reproduce, simply import and then read bgen on a file without metadata:
$ python
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

import bgen_reader
bgen = bgen_reader.read_bgen("test.bgen")
setupterm() failed: TERM=xterm-256color not found in database or too generic
Reading samples|===============================================================|
Error: Unrecognized bgen index version: �.
Traceback (most recent call last):
File "", line 1, in
File "/path/stuff/python3.7/site-packages/bgen_reader/_reader.py", line 84, in read_bgen
variants = map_metadata(filepath, metafile_filepath)
File "/path/stuff/lib/python3.7/site-packages/bgen_reader/_partition.py", line 14, in map_metadata
with bgen_metafile(metafile_filepath) as mf:
File "/path/stuff/lib/python3.7/contextlib.py", line 112, in enter
return next(self.gen)
File "/path/stuff/lib/python3.7/site-packages/bgen_reader/_bgen.py", line 22, in bgen_metafile
raise RuntimeError(f"Could not open {filepath}.")
RuntimeError: Could not open /path/stuff/tests/bedfiles/test.bgen.metadata.

These files worked fine with a version (maybe 1 release after you dropped python 2 support)

Can't retrieve large.bgen test file

Hi!

When I try to run

with example_files("large.bgen") as filepath:

from test_large_file, it tries to runs

'curl http://rest.s3for.me/bgen/large.bgen.bz2.enc -s | openssl enc -d -pbkdf2 -aes-256-cbc -kfile /Users/horta/pass |bunzip2 > /tmp/tmp49mee87i/large.bgen'

which fails for lack of a password. (If the large file isn't private, a work around would be just to give me temporary access and I'll make my own copy.)

Thanks,
Carl

I get "Error while storing buffer: Permission denied"

This is probably something to do with my setup (e.g. the BGEN files are in a read-only dir, and I don't know if I can write to the /tmp directory). It may also be a message output by the underlying bugs software. Even so, I think it would be useful to know if there is any way to solve it - a brief search online seems not to reveal anything.

Can't open bgen using symbolic links

Hello,

I want to use the numpy inspired API to open an bgen using a symbolic link. I want to do this because I don't want to create the .metadata2.mmm file in the shared directory where the bgen lives. I thought I recalled being able to do this in a previous version of bgen-reader, but am running into the following problem now.

from bgen_reader import open_bgen
wha = open_bgen('/n/holylfs05/LABS/liang_lab_l3/Lab/jordan/ssctpr/data/ukbb_symbolic_link/ukb_imp_chr1_v3.bgen')
Traceback (most recent call last):
File "", line 1, in
File "/n/home_fasse/jrossen/.conda/envs/ssctpr2/lib/python3.9/site-packages/bgen_reader/_bgen2.py", line 133, in init
assert_file_exist(filepath)
File "/n/home_fasse/jrossen/.conda/envs/ssctpr2/lib/python3.9/site-packages/bgen_reader/_file.py", line 14, in assert_file_exist
raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), filepath)

I don't have a very strong understanding of how PosixPath works, but after a little playing around, it seems like PosixPath precludes the use of symbolic links.

Either allowing for the use of symbolic links, or allowing the user to specify a directory where .metadata2.mmm should be created would solve my problem completely.

Thanks for making this software!

Suppress "setupterm() failed: TERM=xterm-256color not found in database or too generic" warning

#17

read_bgen ignores new file contents

(Thanks for looking at these.)

It seems like the library may be caching values globally by file name. If the file changes, it doesn't notice and gives the values from the old contents. (One might expect "del" to release all caches associated an "read_bgen", but it doesn't. Removing the metadata file doesn't help either.)

Consider this run:

Files:

import os
import shutil
from bgen_reader import read_bgen

filename1 = '/mnt/d/OneDrive/Shares/bgenreaderpy/abs_error1.bgen'
filename31 = '/mnt/d/OneDrive/Shares/bgenreaderpy/abs_error31.bgen'
filenameX = '/mnt/d/OneDrive/Shares/bgenreaderpy/abs_errorX.bgen'

rb1 = read_bgen(filename1,verbose=True)
print(os.path.getsize(filename1), rb1["genotype"][0].compute()['probs'][0])
del rb1

rb31 = read_bgen(filename31,verbose=True)
print(os.path.getsize(filename31),rb31["genotype"][0].compute()['probs'][0])
del rb31

shutil.copy(filename1,filenameX)
rbX = read_bgen(filenameX,verbose=True)
print(os.path.getsize(filenameX),rbX["genotype"][0].compute()['probs'][0])
del rbX

shutil.copy(filename31,filenameX)
rbX = read_bgen(filenameX,verbose=True)
print(os.path.getsize(filenameX),rbX["genotype"][0].compute()['probs'][0])
del rbX

os.remove(filenameX+'.metadata')
rbX = read_bgen(filenameX,verbose=True)
print(os.path.getsize(filenameX),rbX["genotype"][0].compute()['probs'][0])
del rbX

Which produces this output:

6189 [1. 0. 0.]
9545 [0.99346924 0.0027771  0.00375366]
We will create the metafile `/mnt/d/OneDrive/Shares/bgenreaderpy/abs_errorX.bgen.metadata`. This file will speed up further
reads and only need to be created once. So, please, bear with me.
6189 [1. 0. 0.]
File `/mnt/d/OneDrive/Shares/bgenreaderpy/abs_errorX.bgen` has been modified after the creation of `/mnt/d/OneDrive/Shares/bgenreaderpy/abs_errorX.bgen.metadata`.
We will therefore recreate the metadata file. So, please, bear with me.
9545 [1. 0. 0.]
We will create the metafile `/mnt/d/OneDrive/Shares/bgenreaderpy/abs_errorX.bgen.metadata`. This file will speed up further
reads and only need to be created once. So, please, bear with me.
9545 [1. 0. 0.]

But it should instead produce this:

6189 [1. 0. 0.]
9545 [0.99346924 0.0027771  0.00375366]
9545 [0.99346924 0.0027771  0.00375366]
9545 [0.99346924 0.0027771  0.00375366]

Segmentation fault (core dumped)

Hello,
I encounter the following error:
Segmentation fault (core dumped)

After running the following:

from bgen_reader import read_bgen

import ipdb; ipdb.set_trace()
filepath = "/data/genotypes/impute_21_interval.bgen"
samples_filepath = "/data/genotypes/interval.sample"
bgen = read_bgen(filepath, samples_filepath=samples_filepath, verbose=True)
print(bgen["variants"].head())

geno = bgen["genotype"][0].compute()

could you help me please?
Thanks,

Empty bgen file might cause division by zero.

Empty bgen file being one with no genotype but with samples. Marc found the following exception with such a file:

Traceback (most recent call last):
  File "/hps/nobackup/stegle/users/galvari/src/hipsci_pipeline/limix_QTL_pipeline/run_QTL_analysis.py", line 813, in <module>
    randomeff_filename=randeff_file, sample_mapping_filename=samplemap_file, extended_anno_filename=extended_anno_file, regressCovariatesUpfront = regressBefore, debugger= debugger)
  File "/hps/nobackup/stegle/users/galvari/src/hipsci_pipeline/limix_QTL_pipeline/run_QTL_analysis.py", line 64, in run_QTL_analysis
    extended_anno_filename=extended_anno_filename, feature_variant_covariate_filename=feature_variant_covariate_filename)
  File "/hps/nobackup/stegle/users/galvari/src/hipsci_pipeline/limix_QTL_pipeline/qtl_utilities.py", line 90, in run_QTL_analysis_load_intersect_phenotype_covariates_kinship_sample_mapping
    bgen = read_bgen(geno_prefix+'.bgen', verbose=False)
  File "/opt/conda/lib/python3.7/site-packages/bgen_reader/_reader.py", line 81, in read_bgen
    create_metafile(filepath, metafile_filepath, verbose)
  File "/opt/conda/lib/python3.7/site-packages/bgen_reader/_metadata.py", line 57, in create_metafile
    nparts = _estimate_best_npartitions(lib.bgen_nvariants(bgen))
  File "/opt/conda/lib/python3.7/site-packages/bgen_reader/_metadata.py", line 69, in _estimate_best_npartitions
    return nvariants // m
ZeroDivisionError: integer division or modulo by zero

GEN-reader?

Thanks for this great package! I tried alot of stuff to read bgen files from UKBIOBANK (including Hail) and yours just work beautifully with Dask. It saved me alot of work!

Do you have a version of this but for GEN files? The Dask integration is what I'm looking for.

Just Python 3?

Is bgen-reader just Python 3? (Fine if it is. Just wondering.)

(py2) carlk@kadie2:/mnt/d/OneDrive/programs/pysnptools/tests$ python
Python 2.7.16 |Anaconda, Inc.| (default, Sep 24 2019, 21:51:30)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from bgen_reader import read_bgen
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/carlk/anaconda3/envs/py2/lib/python2.7/site-packages/bgen_reader/__init__.py", line 18, in <module>
    e.msg = e.msg + msg
AttributeError: 'exceptions.ImportError' object has no attribute 'msg'
>>>

Add example for using batches if you want to accessing all the genome

I'm using the library to iterate over the entire genome. It's way faster to do this in batches, rather than calling .compute() on each variant in turn. I.e. I now do

bgen['genotype'][x:(x+batch_size),:,0:2].compute()

rather than this slow iteration

for i in range(x, x+batch_size):
    bgen['genotype'][i,:,0:2].compute()

In fact, the .compute() call seems takes almost the same time regardless of whether it is getting 1 or 100 variants. It's obvious in hindsight, but it's probably worth saying that in the documentation, or even giving an example. Of course, if the batch size is too high, you run out of memory, so you would have to recommend tweaking this for the machine you are running on. But I had at least a 100x speedup in my code by batching the .compute() requests in this way.

Error while creating metafile bgen.metadata

Hello,

I'm trying to read a bgen file but I get the following error:

WARNING: magic number mismatch
ERROR: unknown layout 1
Writing variants|                                                     | - eta - 

The total volume has not been fully consumed.Traceback (most recent call last):
  File "bgen.py", line 11, in <module>
    bgen2 = open_bgen(filename, verbose=True)
  File "/home/username/.local/lib/python3.7/site-packages/bgen_reader/_bgen2.py", line 159, in __init__
    self._create_metadata2()
  File "/home/username/.local/lib/python3.7/site-packages/bgen_reader/_bgen2.py", line 171, in _create_metadata2
    self._extract_nalleles_ids_etc(mmm_wplus)
  File "/home/username/.local/lib/python3.7/site-packages/bgen_reader/_bgen2.py", line 1049, in _extract_nalleles_ids_etc
    self._cbgen.create_metafile(metafile_filepath, verbose=self._verbose)
  File "/home/username/.local/lib/python3.7/site-packages/cbgen/_bgen_file.py", line 154, in create_metafile
    raise RuntimeError(f"Error while creating metafile {filepath}.")
RuntimeError: Error while creating metafile bgen.metadata.

Any ideas on what might be happening?

Thank you!

Print out whether there are sample ids

Lots of issues with python 2.7

Is this no longer supported for version 2.7?

bgen_reader/_download.py has a handful of Python 3 imports and then, after making those work for 2.7, I get syntax errors from _files.py:61, which I assume is Python 3 syntax.

variant position is returning negative numbers

On bgen files with pos larger than about 2 billion, the reader returns negative numbers. (Granted this is larger than the largest human chromosome.)

Open roundtrip1.bgen in the reader.
Look at the last pos values and see that they are negative
list(read_bgen['variants']['pos'])[-5:]
To confirm that the values are positive in the bgen file, convert it to *.gen
/mnt/m/qctool/build/release/qctool_v2.0.7 -g roundtrip1.bgen -og roundtrip2.gen
Inspect the *.gen file and see that the pos values are positive. (The largest one is 3037630001).
(The spec says the variant position is an unsigned 32-bit int.)

temp.zip

Add support for phased data (and return extra variant information)

It's not clear to me how to read extra variant information - for instance I'm trying to check on the phased flag as described in "Layout 2 probability data storage" on the BGEN definition page (http://www.well.ox.ac.uk/~gav/bgen_format/). How can I access this using bgen-reader-py ?

index out of range

Hello!

I am using the bgen-reader package to read in bgen files to analyse dosage information of certain cases. I have 40,101 people in the sample and 314,926 variants from chromosome X. I have noticed a strange issue. I followed the example on the website and ran these codes

data = read_bgen(bgen_file, samples_filepath = sample_file)
variants = data['variants'].compute()

The resulting data frame called variants has 314,926 rows, but row index goes only up to 314,350 or something. So, it means that about 600 rows are indexed by the same number, although they contain information on distinct variants (I checked this). I was hoping that this is only a problem in variants dataframe and simply reset the index.

However, now I see it also affects genotypes. If I run data['genotype'][314604], I get back the following output: Delayed('314604'). But if I run data['genotype'][314604].compute() I get an error message IndexError: list index out of range. It seems that the cause of index problem with both variants and genotypes is the same. But I don't know how to fix it for genotype.

I will be grateful for any comment!

Thank you!

Fatima

Variant info performance seems slow as # of variants goes to 1M+

Thanks for creating bgen-reader. It is great! [This "issue" might be a "won't fix". That is fine. There are workarounds. - Carl]

I'm working with a group that needs to access bgen files with 1M+ variants. Some operations that could be almost instantaneous are taking 15 to 30 seconds (and as the # of variants grows, even longer).

Here is my test file: "1x1000000.bgen". It is 1 individual x 1,000,000 variants of random values. [It was created by generating a *.gen file and then using "qctool ... -bgen-bits 8 -bgen-compression zlib" to convert to *.bgen. Anyone can use this file for any purpose.]

Here are some examples with timings and comments.

In:

%%time
#/mnt/m/qctool/build/release/qctool_v2.0.7 -g /mnt/m/deldir/1x1000000.gen -s /mnt/m/deldir/1x1000000.sample -og /mnt/m/deldir/1x1000000.bgen -bgen-bits 8 -bgen-compression zlib
from bgen_reader import read_bgen
filename = r'm:\deldir\1x1000000.bgen'
bgen = read_bgen(filename,verbose=True)

Out:

We will create the metafile m:\deldir\1x1000000.bgen.metadata. This file will speed up further
reads and only need to be created once. So, please, bear with me.
Mapping variants: 100%|███████████████████████████████████████████████████| 1000000/1000000 [00:13<00:00, 73777.15it/s]
Wall time: 18.9 s

That was as expected.

In:

%%time
id_list = bgen['variants']['id']
len(id_list)

Out:

Wall time: 19.2 s
1000000

That is slower that expected. If the metadata file contained the ids as, e.g., a numpy memmap or an npz, returning 1M values would take almost no time. Asking for chrom has about the same timing.

In:

%%time
chrom_list = bgen['variants']['chrom']
len(chrom_list)

Out:

Wall time: 16.3 s
1000000

Happily, getting a genotype is nice and fast!

In:

%%time
bgen["genotype"][500000].compute()['probs']

Out:

Wall time: 140 ms
array([[0.65490196, 0.2 , 0.14509804]])

Finally, if I reopen the bgen file. It needs to map the variants again.

In:

%%time
del bgen
bgen = read_bgen(filename,verbose=True)

Out:

Mapping variants: 100%|███████████████████████████████████████████████████| 1000000/1000000 [00:14<00:00, 67933.42it/s]
Wall time: 15.4 s

My guess is that the variant information is not included in the metadata file. A fix would be to add it. This would allow bgen-reader to work super nicely on bigger and bigger files.

Carl

Library fails on *.bgen files larger than 2Gbtyes (2**31)

Sorry for another one. Please let me know if you want to work on proposing a fix.

The vaddr variable is an offset into a file and so should be a uint64. (It looks like that is what it is in the C Bgen library). On the Python side, in interface.h, it is
long vaddr; /* variant offset-address */
which on most machines is a int32.

I generated a 2500 sample x 500,000 variant (bgen_bits=16) file of size 4.4GB. When I try to access anything past 2TB (2**31), it fails.

Carl
p.s. I'm happy to test any fixes. Or if you want your own giant test file, I can give instructions on how to generate one with a dev version of PySnpTools.

Reading variants takes longer than a minute, possible to only read subset of variants?

I want to read a subset of variants from a large BGEN file because reading the entire BGEN file using read_bgen() is very time consuming.

Is it possible to adapt the code in a way which would make this possible?

Thanks for your help.

Could not open genotype

Hi,
I'm trying to load a BGEN file into a ipython session using open_bgen but it fails returning two error

ERROR: `max_ploidy` cannot be zero
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-2-dc223f36040c> in <module>
----> 1 bgen = open_bgen('5c614288-7779-4013-a631-af36bbe743d6.bgen', allow_complex=False,)

/gpfs/milgram/scratch60/christakis/fb343/conda_envs/jupyter/lib/python3.8/site-packages/bgen_reader/_bgen2.py in __init__(self, filepath, samples_filepath, allow_complex, verbose)
    157             self._metadata2_path.unlink()
    158         if not self._metadata2_path.exists():
--> 159             self._create_metadata2()
    160
    161         self._metadata2_memmaps = MultiMemMap(self._metadata2_path, mode="r")

/gpfs/milgram/scratch60/christakis/fb343/conda_envs/jupyter/lib/python3.8/site-packages/bgen_reader/_bgen2.py in _create_metadata2(self)
    170         with MultiMemMap(metadata2_temp, mode="w+") as mmm_wplus:
    171             self._extract_nalleles_ids_etc(mmm_wplus)
--> 172             self._extract_ncombinations_etc(mmm_wplus)
    173             self._extract_samples_etc(mmm_wplus)
    174         os.rename(metadata2_temp, self._metadata2_path)

/gpfs/milgram/scratch60/christakis/fb343/conda_envs/jupyter/lib/python3.8/site-packages/bgen_reader/_bgen2.py in _extract_ncombinations_etc(self, mmm_wplus)
    303             previous_i = None
    304             while i < self.nvariants:
--> 305                 genotype = self._cbgen.read_genotype(mmm_wplus["vaddr"][i])
    306                 ncombinations_memmap[i] = genotype.probability.shape[-1]
    307                 phased_memmap[i] = genotype.phased

/gpfs/milgram/scratch60/christakis/fb343/conda_envs/jupyter/lib/python3.8/site-packages/cbgen/_bgen_file.py in read_genotype(self, offset, precision)
    178         gt: CData = lib.bgen_file_open_genotype(self._bgen_file, offset)
    179         if gt == ffi.NULL:
--> 180             raise RuntimeError(f"Could not open genotype (offset {offset}).")
    181
    182         if precision not in [64, 32]:

RuntimeError: Could not open genotype (offset 4685).

I'm using Python 3.8.5 with bgen_reader version 4.0.6

bgen-reader no longer working on windows

Suggested fix: change cbgen>=1.0.1 to cbgen==1.0.1

Repro:
conda create --yes --name mini38d python=3.8
conda activate mini38d
pip install bgen-reader

--- fails

Version 1.0.1 won't run unless you manually install everything

It appears that the latest version of bgen at conda is 0.1.16 (with no bgen_reader available there despite the docs suggesting otherwise). The current version of bgen_reader expects version libbgen 1.0.1. As a result, libbgen refuses to load.

The only way I was able to do this without installing all of the pieces manually was to:

Install bgen via the conda package manager (which also installed all dependencies)
Have pip uninstall bgen_reader 1.0.1 and download the version 0.1.16 and install that one

This is all from OSX using the most recent version of conda and python 2.7.x

	probs = full((nsamples, ncombs), nan, dtype=float64)
	lib.bgen_genotype_read(genotype, ffi.cast("double *", probs.ctypes.data))

	phased = lib.bgen_genotype_phased(genotype)

	ploidy = full(nsamples, 0, dtype=uint8)
	lib.read_ploidy(genotype, ffi.cast("uint8_t *", ploidy.ctypes.data), nsamples)

	missing = full(nsamples, 0, dtype=bool)
	lib.read_missing(genotype, ffi.cast("bool *", missing.ctypes.data), nsamples)

limix / bgen-reader-py Goto Github PK

bgen-reader-py's Introduction

limix

Install

Interactive tutorials

Running tests

Authors

License

bgen-reader-py's People

Contributors

Stargazers

Watchers

Forkers

bgen-reader-py's Issues

Recommend Projects

Recommend Topics

Recommend Org