biocore / biom-format Goto Github PK

View Code? Open in Web Editor NEW

89.0 89.0 95.0 21.14 MB

The Biological Observation Matrix (BIOM) Format Project

Home Page: http://biom-format.org

License: Other

Python 97.53% Shell 0.37% Makefile 0.09% Cython 1.86% Dockerfile 0.15%

biom-format's People

Contributors

Stargazers

Watchers

Forkers

wasade jairideout gregcaporaso josenavas mlangill adamrp joey711 pearofducks eldeveloper cleme ebolyen teravest antgonza jorge-c squirrelo wdwvt1 ajaykshatriya ekopylova jnpaulson johnchase shafferm mwhall xiuying mortonjt justin212k sunhuaibo tarah28 lkursell liupfskygre bassio nan-shen bgruening brwnj audy smdabdoub frederic-mahe epruesse alenzhao smrucc cuttlefishh hannesholste dayedepps edwardbetts tankmermaid mz-cy-han1998 nreeve17 rec3141 thisisliuqing grabear schferbe mahermassoud pooranis stevendbrown tanakrit-pichaitam shihuang047 cdiener gwarmstrong gsc0107 fedarko pythseq wwydmanski liibotero bioplatformsaustralia nbokulich drmoca bernt-matthias lixiaopi1985 franziskau pieterprovoost amnona biotovarx raissameyer fbeghini adamgayoso jing-xinxing khemlalnirmalkar biocoder-ajdash sangtaepark1973 kubor wangtaojun11 jjlea jjkoehorst nsoranzo genostack asafpr peterjc midnighter sfiligoi ake123 dengzq1234 qiyunzhu mataton

biom-format's Issues

SparseMat does not compile with GCC 4.7.1

See below

mcdonadt@hopper03:~/software/biom-format> gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/usr/common/usg/python/2.7.1/include/python2.7 -c python-code/support-code/_sparsemat.cpp -o build/temp.linux-x86_64-2.7/python-code/support-code/_sparsemat.o -std=c++0x
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++ [enabled by default]
python-code/support-code/sparsemat.cpp: In function 'PyObject* pyx_pf_10_sparsemat_16PySparseMatFloat_6getRow(PyObject, PyObject, PyObject)':
python-code/support-code/_sparsemat.cpp:1424:7: warning: variable '_pyx_v_r' set but not used [-Wunused-but-set-variable]
python-code/support-code/sparsemat.cpp: In function 'PyObject pyx_pf_10_sparsemat_16PySparseMatFloat_7getCol(PyObject, PyObject, PyObject)':
python-code/support-code/_sparsemat.cpp:1695:7: warning: variable '_pyx_v_c' set but not used [-Wunused-but-set-variable]
python-code/support-code/sparsemat.cpp: In function 'PyObject pyx_pf_10_sparsemat_14PySparseMatInt_6getRow(PyObject, PyObject, PyObject)':
python-code/support-code/_sparsemat.cpp:3399:7: warning: variable '_pyx_v_r' set but not used [-Wunused-but-set-variable]
python-code/support-code/sparsemat.cpp: In function 'PyObject pyx_pf_10_sparsemat_14PySparseMatInt_7getCol(PyObject, PyObject, PyObject)':
python-code/support-code/_sparsemat.cpp:3670:7: warning: variable '__pyx_v_c' set but not used [-Wunused-but-set-variable]
python-code/support-code/_sparsemat.cpp: In function 'void _Pyx_RaiseArgtupleInvalid(const char, int, Py_ssize_t, Py_ssize_t, Py_ssize_t)':
python-code/support-code/_sparsemat.cpp:5164:95: error: unable to find string literal operator 'operator"" PY_FORMAT_SIZE_T'
python-code/support-code/_sparsemat.cpp: In function 'void __Pyx_RaiseNeedMoreValuesError(Py_ssize_t)':
python-code/support-code/_sparsemat.cpp:5424:52: error: unable to find string literal operator 'operator"" PY_FORMAT_SIZE_T'
python-code/support-code/_sparsemat.cpp: In function 'void __Pyx_RaiseTooManyValuesError(Py_ssize_t)':
python-code/support-code/_sparsemat.cpp:5430:73: error: unable to find string literal operator 'operator"" PY_FORMAT_SIZE_T'

convert_biom.py is not splitting Consensus Lineage taxonomy strings

convert_biom.py should be outputting the Consensus Lineage strings (or taxonomy strings) in a list format not as a flat string

test suite does not properly test both sparsemat and sparsedict

i had to explicitly thrown an exception to get sparsedict to run in the test code for table. This makes sense, but not ideal

Should we warn if using SparseDict?

It's come up a few times now that users are not installing BIOM as expected leading to performance issues when working with large sparse datasets. Should a warning be thrown when SparseDict is being used?

Allow binning and collapsing to optionally have one-many support

When binning or collapsing by metadata, a given ID might be implicated in multiple of the bins. For example, when collapsing a KEGG Orthology category 4 table down to category 3.

convert_biom.py does not actually convert from dense to sparse

./convert_biom.py -i ../examples/rich_dense_otu_table.biom -o ../foo --dense_biom_to_sparse_biom

...the output table is still dense

remove inclusion of version as file format versions in master

this was added for 1.0.0a in the 1.0.0-release branch, but has not been incorporated into the master branch yet. this is a bug because the version of the file format is not coupled to the version of the software as of software version 1.0.0.

testing of setup.py on qiime ec2 instance

R: validity checks in the "biom" class of the rbiom package

It is possible to define these for the class, in which case they are checked early-on during instantiation. If that's a problem, they can be wrapped into the constructor function, biom().

create version-specific format pages

add generation_params field

requested by rob. direct use case within qiime is for storing the qiime params. useful in other contexts as well. can be nullable, but required for being valid

this was previously in QIIME trac as ticket #148 created by Daniel McDonald

metadata incorrectly parsed in biom to txt conversion for EC table

Details about the error are on email from Jeff Werner to qiime forum on 22/05/2012 (Subj: Funny error in convert_biom.py).

In convert_biom.py we do:

if biom_to_classic_table:
try:
output_f.write(convert_biom_to_table(
input_f, header_key, output_metadata_id))

and then in convert_biom_to_table:

if md_format is None:
md_format = lambda x: '; '.join(x)

which is later used in

   return table.delimitedSelf(header_key=header_key,
                                  header_value=header_value,
                                  metadata_formatter=md_format)

which was probably done with OTU tables in mind, but it is breaking here.

convert_biom.py should work in batch mode

R: Installation and Basic examples for rbiom use on wiki page

Of course the wiki needs to be build first. Would be great to show users the few lines of code required to install the rbiom-package and read biom files and the do something with them.

Error raised when pickle a biom table.

import pickle
from biom.parse import parse_biom_table
biom_table = parse_biom_table(open('otu_table.biom', 'U'))
output = open('data_biom.pkl', 'wb')
pickle.dump(biom_table, output)
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 1370, in dump
Pickler(file, protocol).dump(obj)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 224, in dump
self.save(obj)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 331, in save
self.save_reduce(obj=obj, *rv)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 419, in save_reduce
save(state)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 663, in _batch_setitems
save(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 600, in save_list
self._batch_appends(iter(obj))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 615, in _batch_appends
save(x)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 331, in save
self.save_reduce(obj=obj, *rv)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 401, in save_reduce
save(args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 562, in save_tuple
save(element)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 748, in save_global
(obj, module, name))
pickle.PicklingError: Can't pickle <function at 0x101465de8>:
it's not found as biom.table.

Update validator to handle multiple versions

Add support for validating different versions of BIOM files

R: Include rbiom checks and unit tests in biom-format tests

I haven't looked yet how these get wrapped up. I imagine there's a nice testing script somewhere that just needs to have one or two lines added to it. The R-code directory contains several .sh or .R scripts that just need to be called.

These checks do require that you have R installed on your system, and also that certain packages are installed. I've tried to make the additional package installation automatic as part of these tests, but post a complaint here right away if that doesn't work.

SparseObj is SparseDict by default, but should be SparseMat

In biom-format/python-code/biom/init.py, lines 53 and 54:

if backend is None:
    backend = 'SparseDict'

this should be

if backend is None:
    backend = 'SparseMat'

otherwise we run into the memory problems we've been having in the past. For instance, single_rarefaction.py with an input biom file ~173M requires over 512GB of memory to run. Modifying to SparseMat, memory seems to stabilize at ~4GB only.

And btw, is there a way to post an issue linking directly to the source code? So for instance you navigate the source repo, then make the issue point directly to the line you are interested in (like with this case)

verbose option on validation tool

add -v and --verbose

moved from QIIME trac ticket #149, originally created by Daniel McDonald

convert_biom.py does not actually convert from dense to sparse

./convert_biom.py -i ../examples/rich_dense_otu_table.biom -o ../foo --dense_biom_to_sparse_biom

...the output table is still dense

update install instructions

should show alternative to using sudo
should differentiate the directory that the code will be in for development versus release installs
should generalize the result of calling which convert_biom.py

add setup.py to biom-format top-level directory

as title states

Add examples of using BIOM

Add examples (in doctest?) of using BIOM objects

add implicit type promotion check

If feasible, add an implicit type promotion, or at least a warning, when using floats on int SparseMat tables

do we care that tab-separated format is not perfectly round-trippable with convert_biom.py?

all values are cast to floats when converting from tab-separated text to biom format. if written back to tab-separated format, the values will be written as floats, even if the original input were ints.

note sure if we care about this or not... thoughts?

add copyright information via a COPYING file

This was requested by the Debian Med project, who install QIIME/BIOM in their linux distro. Note that we're on a GPL license, so we should include that information.

conversion of .txt to .biom to .txt with taxonomy formats taxonomy incorrectly

we need to add a new option to convert_biom.py to define what function should be applied to format the taxonomy strings before writing them to file. they are currently treated as a list of taxonomy assignments (qiime's default handling) so get written as follows:

B; a; c; t; e; r; i; a; ;; P; r; o; t; e; o; b; a; c; t; e; r; i; a; ;; A; l; p; h; a; p; r; o; t; e; o; b; a; c; t; e; r; i; a; ;; S; p; h; i; n; g; o; m; o; n; a; d; a; l; e; s; ;; S; p; h; i; n; g; o; m; o; n; a; d; a; c; e; a; e; ;; S; p; h; i; n; g; o; b; i; u; m

Sparse representation benchmarking: relational database

Need to benchmark using a relational database (RDBMS) as the backend/underlying sparse representation. This technique seems to be most promising for extremely large tables. If this approach proves to be useful, it would be nice to support many types of databases (e.g. MySQL, PostgreSQL, etc.).

Sparse representation benchmarking: scipy.sparse

Need to benchmark using Scipy's sparse module (scipy.sparse), which provided several different sparse matrix representations. Each one has strengths and weaknesses, so we would need to explore converting between different representations under-the-hood, depending on the table size, density, and the operations that need to be performed on it.

Useful links:

http://www.scipy.org/SciPyPackages/Sparse
http://docs.scipy.org/doc/scipy/reference/sparse.html#module-scipy.sparse

More verbose error handling in convert_biom.py

Add more verbose error handling into the convert_biom.py script

R: rbiom package needs to be able to write biom files

I think this is a matter of reverse-package the components into lists with the right structure to be written by the RJSONIO::toJSON function.

Improve load and write mechanisms

The JSON Decoder and Encoder are heavy weight as they need to work on data without prior knowledge of the types. The following prototype code allows for faster loading of BIOM tables and cuts the peak memory usage for getting a table in memory in half. It is likely a similar mechanism would work for writing tables to improve performance (I dont remember if memory bloat is there). However, the writer may be I/O bound, testing would be necessary.

The following is proof of concept code. Please bounce questions off of Daniel McDonald as necessary

def light_biom_parse(s):
# gut data
start_idx = s.find('"data":') + 8
end_idx = s[start_idx:].find(']]') + start_idx
data = s[start_idx:end_idx]
new_s = s[:start_idx]
new_s += '[[0, 0, 1]]'
new_s += s[(end_idx + 2):]

# get shape
start_idx = s.find('"shape":') + 10
end_idx = s[start_idx:start_idx + 30].find('],') + start_idx
print s[start_idx-10:end_idx+10]
row, col = map(int, s[start_idx:end_idx].replace('[','').split(', '))

biom_table = parse_biom_table(new_s)
print row,col
biom_table._data = SparseMat(row, col)

for rec in data.replace('[','').split('], '):
    row,col,count = map(int, rec.split(', '))
    biom_table._data[row,col] = count
return biom_table

def light_biom_formatter(obj, generated_by):
# gut data
data = obj._data
obj._data = data.class(1,1)
obj._data[0,0] = 1

# conv data to str
newdata = []
for (r,c), v in data.items():
    newdata.append('[%d, %d, %d]')
newdata = "[%s]" % ', '.join(newdata)

biom_str = obj.getBiomFormatString(generated_by)

# insert data
start_idx = biom_str.find('"data":') + 8
end_idx = biom_str[start_idx:].find(']]') + start_idx
data = biom_str[start_idx:end_idx]
new_s = biom_str[:start_idx]
new_s += newdata
new_s += biom_str[(end_idx + 2):]

# update shape
start_idx = s.find('"shape":') + 10
end_idx = s[start_idx:start_idx + 30].find('],') + start_idx
newshape = '[%d, %d], ' % data.shape
final_s = new_s[:start_idx]
final_s += newshape
final_s += new_s[(end_idx + 2):]

return final_s

Incorporating observation relationships into the BIOM format

As a possible extension/addition to the current BIOM format, it would be nice to have some way to store relationship data for observations in a table.

For example, a phylogenetic tree showing evolutionary relationships among OTUs (the observations) in a sample x OTU table could be stored in Newick format in a BIOM file. Another example would be gene networks (DAGs) stored in eNewick format for a table containing metagenomes x genes.

A nested JSON structure might be appropriate for storing this data. Here's my first cut at what this would look like:

{
  "observation_relationships": [
    {"type": "phylogenetic tree", "format": "newick", "representation": "some newick string..."},
    {"type": "gene network", "format": "enewick", "representation": "some enewick string..."},
    ...
  ]
}

The "type" and "format" values would be a controlled vocabulary, and this would allow us to support an arbitrary number of relationships among observations, as well as store them in different supported formats (e.g. maybe we want to support more than just Newick format for trees, etc.). This nested JSON structure would reside in the top-level portion of the BIOM format (e.g. along with "id", "format", etc.) and would be completely optional.

I'd like to solicit some input on this proposed format. Specifically,

Does the proposed structure make sense? What changes need to be made to it?
What additional types of relationships should we support?
What additional types of formats should we support?

Thanks for your input!

delete status from all files - this is implied by version

remove mac os specific files from build tgz

Debian Med project developers pointed out that we have mac os hidden files in our 1.0.0 tgz. remove these for future release builds.

Specify sparse matrix representation at runtime

We need an easy way to specify the underlying sparse matrix representation to be used at runtime (e.g. be able to specify that the database backend is used instead of the pure-python dictionary structure). Currently, the code tries to import the sparse matrix cython code, and if this fails, it uses the pure-python dictionary structure.

This will benefit us in two ways: first, it will be easy to test out different sparse structures for benchmarking purposes, and second, we'll be able to test each sparse structure using the unit tests that are already in place to ensure they are working correctly.

Add wiki for this project

An admin for this repo should click the button that instatiates the GitHub-wiki for this repo. Then we can start posting documentation to it. Writing the markdown pages is really fast and easy.

setup.py needs to move _sparsemat.so to biom.table module directory

basically something equivilent to cp _sparsemat.so ../table/

convert_biom.py should allow user to name the metadata column when creating txt from biom

this would be useful if the user wants to rename the column when writing tab-separated text format.

summary: conver_biom.py should allow user to name the metadata column when creating txt from biom --> convert_biom.py should allow user to name the metadata column when creating txt from biom

can't pickle PySparseMatInt objects

import pickle
from biom.parse import parse_biom_table
biom_table = parse_biom_table(open('/Users/antoniog/svn_programs/qiime/examples/qiime_tutorial/otus/otu_table.biom', 'U'))
output = open('data_biom.pkl', 'wb')
biom_table.SampleMetadata = None
biom_table.ObservationMetadata = None
pickle.dump(biom_table, output)
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 1370, in dump
Pickler(file, protocol).dump(obj)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 224, in dump
self.save(obj)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 331, in save
self.save_reduce(obj=obj, *rv)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 419, in save_reduce
save(state)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 663, in _batch_setitems
save(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 725, in save_inst
save(stuff)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 663, in _batch_setitems
save(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 306, in save
rv = reduce(self.proto)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy_reg.py",
line 70, in _reduce_ex
raise TypeError, "can't pickle %s objects" % base.name
TypeError: can't pickle PySparseMatInt objects

Need easy way to benchmark sparse representations

We need an easy way to benchmark various types of underlying sparse matrix representations in order to be consistent in our evaluations. It would be nice to be able to specify the size and density of the matrix and have one created, and I think it would be good to also have some real tables to test on as well.

Add ability to include phylogenetic tree

This feature request is not R-specific, but a format feature.

Should not be necessary to re-implement phylogenetic tree formats, just borrow one (or more) standard(s) and define how it will be embedded in the .biom file.

This would be extremely useful for downstream tools that would also care about the structure of the tree.

This should be considered an enhancement to the current format.

Update table objects documentation and docstrings

Update the docstrings to help with API sphinx doc

Add version specific BIOM pages for url tags

Add webpages for each specific version for version details. This will feed back to the *.py scripts in the url tags and the BIOM files under format_url

Add --classic_table_style to convert_biom.py

Currently convert_biom.py does not handle --header_key correctly. This ticket is two parts:

md_format is not being passed correctly to Table.delimitSelf when converting from biom -> classic table with a column for metadata

supporting generalized formats here is difficult, the easy solution is a --classic_table_style key such that --classic_table_style=qiime would add a consensus taxonomy column

add list of projects that use BIOM to the website

should we remove .py extensions from biom-format scripts?

This was suggested by the debian med project developers, and the following references were provided:

http://en.wikipedia.org/wiki/Filename_extension#Command_name_issues
http://www.talisman.org/~erlkonig/documents/commandname-extensions-considered-harmful

I don't think this is a bad idea. Note I did this for pynast, but not for QIIME.

Add examples for convert_biom.py to the help text

as the title states. Thanks Antonio for the suggestion!

Address CSMat limitations

CSMat currently does not support empty (i.e. all zeros) matrices, nor empty rows/columns if in CSR/CSC format. A ValueError or index out of bounds error is thrown inside convert (or the private methods convert calls).

Make Table object immutable

The current Table objects are mutable, though many of the methods return new Table objects with the necessary modifications. After discussion with Daniel, we think it'll be best (performance-wise and for clarity) to make Table objects immutable. This will cut down on the bookkeeping required in the sparse backends which should yield performance gains.