Giter Site home page Giter Site logo

biocore / biom-format Goto Github PK

View Code? Open in Web Editor NEW
89.0 89.0 95.0 21.14 MB

The Biological Observation Matrix (BIOM) Format Project

Home Page: http://biom-format.org

License: Other

Python 97.53% Shell 0.37% Makefile 0.09% Cython 1.86% Dockerfile 0.15%

biom-format's People

Contributors

adamrp avatar ake123 avatar amethyst-asuka avatar antgonza avatar cleme avatar ebolyen avatar eldeveloper avatar gitter-badger avatar gregcaporaso avatar jairideout avatar joey711 avatar jorge-c avatar josenavas avatar justin212k avatar midnighter avatar mlangill avatar mwhall avatar nsoranzo avatar peterjc avatar pieterprovoost avatar potter-s avatar qiyunzhu avatar rec3141 avatar sfiligoi avatar squirrelo avatar stevendbrown avatar teravest avatar wasade avatar wdwvt1 avatar wwydmanski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

biom-format's Issues

SparseMat does not compile with GCC 4.7.1

See below

mcdonadt@hopper03:~/software/biom-format> gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/usr/common/usg/python/2.7.1/include/python2.7 -c python-code/support-code/_sparsemat.cpp -o build/temp.linux-x86_64-2.7/python-code/support-code/_sparsemat.o -std=c++0x
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++ [enabled by default]
python-code/support-code/sparsemat.cpp: In function 'PyObject* pyx_pf_10_sparsemat_16PySparseMatFloat_6getRow(PyObject, PyObject, PyObject)':
python-code/support-code/_sparsemat.cpp:1424:7: warning: variable '_pyx_v_r' set but not used [-Wunused-but-set-variable]
python-code/support-code/sparsemat.cpp: In function 'PyObject pyx_pf_10_sparsemat_16PySparseMatFloat_7getCol(PyObject, PyObject, PyObject
)':
python-code/support-code/_sparsemat.cpp:1695:7: warning: variable '_pyx_v_c' set but not used [-Wunused-but-set-variable]
python-code/support-code/sparsemat.cpp: In function 'PyObject pyx_pf_10_sparsemat_14PySparseMatInt_6getRow(PyObject, PyObject, PyObject
)':
python-code/support-code/_sparsemat.cpp:3399:7: warning: variable '_pyx_v_r' set but not used [-Wunused-but-set-variable]
python-code/support-code/sparsemat.cpp: In function 'PyObject pyx_pf_10_sparsemat_14PySparseMatInt_7getCol(PyObject, PyObject, PyObject
)':
python-code/support-code/_sparsemat.cpp:3670:7: warning: variable '__pyx_v_c' set but not used [-Wunused-but-set-variable]
python-code/support-code/_sparsemat.cpp: In function 'void _Pyx_RaiseArgtupleInvalid(const char, int, Py_ssize_t, Py_ssize_t, Py_ssize_t)':
python-code/support-code/_sparsemat.cpp:5164:95: error: unable to find string literal operator 'operator"" PY_FORMAT_SIZE_T'
python-code/support-code/_sparsemat.cpp: In function 'void __Pyx_RaiseNeedMoreValuesError(Py_ssize_t)':
python-code/support-code/_sparsemat.cpp:5424:52: error: unable to find string literal operator 'operator"" PY_FORMAT_SIZE_T'
python-code/support-code/_sparsemat.cpp: In function 'void __Pyx_RaiseTooManyValuesError(Py_ssize_t)':
python-code/support-code/_sparsemat.cpp:5430:73: error: unable to find string literal operator 'operator"" PY_FORMAT_SIZE_T'

Should we warn if using SparseDict?

It's come up a few times now that users are not installing BIOM as expected leading to performance issues when working with large sparse datasets. Should a warning be thrown when SparseDict is being used?

add generation_params field

requested by rob. direct use case within qiime is for storing the qiime params. useful in other contexts as well. can be nullable, but required for being valid

this was previously in QIIME trac as ticket #148 created by Daniel McDonald

metadata incorrectly parsed in biom to txt conversion for EC table

Details about the error are on email from Jeff Werner to qiime forum on 22/05/2012 (Subj: Funny error in convert_biom.py).

In convert_biom.py we do:

if biom_to_classic_table:
try:
output_f.write(convert_biom_to_table(
input_f, header_key, output_metadata_id))

and then in convert_biom_to_table:

if md_format is None:
md_format = lambda x: '; '.join(x)

which is later used in

   return table.delimitedSelf(header_key=header_key,
                                  header_value=header_value,
                                  metadata_formatter=md_format)

which was probably done with OTU tables in mind, but it is breaking here.

Error raised when pickle a biom table.

import pickle
from biom.parse import parse_biom_table
biom_table = parse_biom_table(open('otu_table.biom', 'U'))
output = open('data_biom.pkl', 'wb')
pickle.dump(biom_table, output)
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 1370, in dump
Pickler(file, protocol).dump(obj)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 224, in dump
self.save(obj)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 331, in save
self.save_reduce(obj=obj, *rv)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 419, in save_reduce
save(state)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 663, in _batch_setitems
save(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 600, in save_list
self._batch_appends(iter(obj))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 615, in _batch_appends
save(x)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 331, in save
self.save_reduce(obj=obj, *rv)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 401, in save_reduce
save(args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 562, in save_tuple
save(element)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 748, in save_global
(obj, module, name))
pickle.PicklingError: Can't pickle <function at 0x101465de8>:
it's not found as biom.table.

R: Include rbiom checks and unit tests in biom-format tests

I haven't looked yet how these get wrapped up. I imagine there's a nice testing script somewhere that just needs to have one or two lines added to it. The R-code directory contains several .sh or .R scripts that just need to be called.

These checks do require that you have R installed on your system, and also that certain packages are installed. I've tried to make the additional package installation automatic as part of these tests, but post a complaint here right away if that doesn't work.

SparseObj is SparseDict by default, but should be SparseMat

In biom-format/python-code/biom/init.py, lines 53 and 54:

if backend is None:
    backend = 'SparseDict'

this should be

if backend is None:
    backend = 'SparseMat'

otherwise we run into the memory problems we've been having in the past. For instance, single_rarefaction.py with an input biom file ~173M requires over 512GB of memory to run. Modifying to SparseMat, memory seems to stabilize at ~4GB only.

And btw, is there a way to post an issue linking directly to the source code? So for instance you navigate the source repo, then make the issue point directly to the line you are interested in (like with this case)

update install instructions

  • should show alternative to using sudo
  • should differentiate the directory that the code will be in for development versus release installs
  • should generalize the result of calling which convert_biom.py

conversion of .txt to .biom to .txt with taxonomy formats taxonomy incorrectly

we need to add a new option to convert_biom.py to define what function should be applied to format the taxonomy strings before writing them to file. they are currently treated as a list of taxonomy assignments (qiime's default handling) so get written as follows:

B; a; c; t; e; r; i; a; ;; P; r; o; t; e; o; b; a; c; t; e; r; i; a; ;; A; l; p; h; a; p; r; o; t; e; o; b; a; c; t; e; r; i; a; ;; S; p; h; i; n; g; o; m; o; n; a; d; a; l; e; s; ;; S; p; h; i; n; g; o; m; o; n; a; d; a; c; e; a; e; ;; S; p; h; i; n; g; o; b; i; u; m

Sparse representation benchmarking: relational database

Need to benchmark using a relational database (RDBMS) as the backend/underlying sparse representation. This technique seems to be most promising for extremely large tables. If this approach proves to be useful, it would be nice to support many types of databases (e.g. MySQL, PostgreSQL, etc.).

Sparse representation benchmarking: scipy.sparse

Need to benchmark using Scipy's sparse module (scipy.sparse), which provided several different sparse matrix representations. Each one has strengths and weaknesses, so we would need to explore converting between different representations under-the-hood, depending on the table size, density, and the operations that need to be performed on it.

Useful links:

http://www.scipy.org/SciPyPackages/Sparse
http://docs.scipy.org/doc/scipy/reference/sparse.html#module-scipy.sparse

Improve load and write mechanisms

The JSON Decoder and Encoder are heavy weight as they need to work on data without prior knowledge of the types. The following prototype code allows for faster loading of BIOM tables and cuts the peak memory usage for getting a table in memory in half. It is likely a similar mechanism would work for writing tables to improve performance (I dont remember if memory bloat is there). However, the writer may be I/O bound, testing would be necessary.

The following is proof of concept code. Please bounce questions off of Daniel McDonald as necessary

def light_biom_parse(s):
# gut data
start_idx = s.find('"data":') + 8
end_idx = s[start_idx:].find(']]') + start_idx
data = s[start_idx:end_idx]
new_s = s[:start_idx]
new_s += '[[0, 0, 1]]'
new_s += s[(end_idx + 2):]

# get shape
start_idx = s.find('"shape":') + 10
end_idx = s[start_idx:start_idx + 30].find('],') + start_idx
print s[start_idx-10:end_idx+10]
row, col = map(int, s[start_idx:end_idx].replace('[','').split(', '))

biom_table = parse_biom_table(new_s)
print row,col
biom_table._data = SparseMat(row, col)

for rec in data.replace('[','').split('], '):
    row,col,count = map(int, rec.split(', '))
    biom_table._data[row,col] = count
return biom_table

def light_biom_formatter(obj, generated_by):
# gut data
data = obj._data
obj._data = data.class(1,1)
obj._data[0,0] = 1

# conv data to str
newdata = []
for (r,c), v in data.items():
    newdata.append('[%d, %d, %d]')
newdata = "[%s]" % ', '.join(newdata)

biom_str = obj.getBiomFormatString(generated_by)

# insert data
start_idx = biom_str.find('"data":') + 8
end_idx = biom_str[start_idx:].find(']]') + start_idx
data = biom_str[start_idx:end_idx]
new_s = biom_str[:start_idx]
new_s += newdata
new_s += biom_str[(end_idx + 2):]

# update shape
start_idx = s.find('"shape":') + 10
end_idx = s[start_idx:start_idx + 30].find('],') + start_idx
newshape = '[%d, %d], ' % data.shape
final_s = new_s[:start_idx]
final_s += newshape
final_s += new_s[(end_idx + 2):]

return final_s

Incorporating observation relationships into the BIOM format

As a possible extension/addition to the current BIOM format, it would be nice to have some way to store relationship data for observations in a table.

For example, a phylogenetic tree showing evolutionary relationships among OTUs (the observations) in a sample x OTU table could be stored in Newick format in a BIOM file. Another example would be gene networks (DAGs) stored in eNewick format for a table containing metagenomes x genes.

A nested JSON structure might be appropriate for storing this data. Here's my first cut at what this would look like:

{
  "observation_relationships": [
    {"type": "phylogenetic tree", "format": "newick", "representation": "some newick string..."},
    {"type": "gene network", "format": "enewick", "representation": "some enewick string..."},
    ...
  ]
}

The "type" and "format" values would be a controlled vocabulary, and this would allow us to support an arbitrary number of relationships among observations, as well as store them in different supported formats (e.g. maybe we want to support more than just Newick format for trees, etc.). This nested JSON structure would reside in the top-level portion of the BIOM format (e.g. along with "id", "format", etc.) and would be completely optional.

I'd like to solicit some input on this proposed format. Specifically,

  1. Does the proposed structure make sense? What changes need to be made to it?
  2. What additional types of relationships should we support?
  3. What additional types of formats should we support?

Thanks for your input!

Specify sparse matrix representation at runtime

We need an easy way to specify the underlying sparse matrix representation to be used at runtime (e.g. be able to specify that the database backend is used instead of the pure-python dictionary structure). Currently, the code tries to import the sparse matrix cython code, and if this fails, it uses the pure-python dictionary structure.

This will benefit us in two ways: first, it will be easy to test out different sparse structures for benchmarking purposes, and second, we'll be able to test each sparse structure using the unit tests that are already in place to ensure they are working correctly.

Add wiki for this project

An admin for this repo should click the button that instatiates the GitHub-wiki for this repo. Then we can start posting documentation to it. Writing the markdown pages is really fast and easy.

can't pickle PySparseMatInt objects

import pickle
from biom.parse import parse_biom_table
biom_table = parse_biom_table(open('/Users/antoniog/svn_programs/qiime/examples/qiime_tutorial/otus/otu_table.biom', 'U'))
output = open('data_biom.pkl', 'wb')
biom_table.SampleMetadata = None
biom_table.ObservationMetadata = None
pickle.dump(biom_table, output)
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 1370, in dump
Pickler(file, protocol).dump(obj)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 224, in dump
self.save(obj)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 331, in save
self.save_reduce(obj=obj, *rv)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 419, in save_reduce
save(state)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 663, in _batch_setitems
save(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 725, in save_inst
save(stuff)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 663, in _batch_setitems
save(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 306, in save
rv = reduce(self.proto)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy_reg.py",
line 70, in _reduce_ex
raise TypeError, "can't pickle %s objects" % base.name
TypeError: can't pickle PySparseMatInt objects

Need easy way to benchmark sparse representations

We need an easy way to benchmark various types of underlying sparse matrix representations in order to be consistent in our evaluations. It would be nice to be able to specify the size and density of the matrix and have one created, and I think it would be good to also have some real tables to test on as well.

Add ability to include phylogenetic tree

This feature request is not R-specific, but a format feature.

Should not be necessary to re-implement phylogenetic tree formats, just borrow one (or more) standard(s) and define how it will be embedded in the .biom file.

This would be extremely useful for downstream tools that would also care about the structure of the tree.

This should be considered an enhancement to the current format.

Add --classic_table_style to convert_biom.py

Currently convert_biom.py does not handle --header_key correctly. This ticket is two parts:

md_format is not being passed correctly to Table.delimitSelf when converting from biom -> classic table with a column for metadata

supporting generalized formats here is difficult, the easy solution is a --classic_table_style key such that --classic_table_style=qiime would add a consensus taxonomy column

Address CSMat limitations

CSMat currently does not support empty (i.e. all zeros) matrices, nor empty rows/columns if in CSR/CSC format. A ValueError or index out of bounds error is thrown inside convert (or the private methods convert calls).

Make Table object immutable

The current Table objects are mutable, though many of the methods return new Table objects with the necessary modifications. After discussion with Daniel, we think it'll be best (performance-wise and for clarity) to make Table objects immutable. This will cut down on the bookkeeping required in the sparse backends which should yield performance gains.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.