blosc / bcolz Goto Github PK

A columnar data container that can be compressed.

C 59.83% Python 12.07% Shell 0.02% CMake 0.69% Makefile 0.22% C++ 2.07% Batchfile 0.12% PowerShell 0.06% Jupyter Notebook 24.56% Objective-C 0.29% Starlark 0.08%

column-store python compressed-data

bcolz's Introduction

Unmaintained Package Notice

Unfortunately, and due to lack of resources, the Blosc Development Team is unable to maintain this package anymore. During the last 10 years we managed to find resources (even if in a quite irregular way) to develop what we think is a nice package for handling compressed data containers, especially tabular data. Regrettably, for the last years we did not found sponsorship enough to continue the maintenance of this package.

For those that depend on bcolz, a fork is welcome and we will try our best to provide advice for possible new maintainers. Indeed, if we manage to get some decent grants via Blosc (https://blosc.org/pages/donate/), our umbrella project, we would be glad to reconsider the maintenance of bcolz. But again, we would be very open and supportive for this project to get a new maintenance team.

Finally, thanks to all the people that used and contributed in one way or another to bcolz; it has been a nice ride! Let's hope it still would have a bright future ahead.

The Blosc Development Team

bcolz: columnar and compressed data containers

Join the chat at https://gitter.im/Blosc/bcolz

Version:
Travis CI:
Appveyor:
Coveralls:
And...:

bcolz provides columnar, chunked data containers that can be compressed either in-memory and on-disk. Column storage allows for efficiently querying tables, as well as for cheap column addition and removal. It is based on NumPy, and uses it as the standard data container to communicate with bcolz objects, but it also comes with support for import/export facilities to/from HDF5/PyTables tables and pandas dataframes.

bcolz objects are compressed by default not only for reducing memory/disk storage, but also to improve I/O speed. The compression process is carried out internally by Blosc, a high-performance, multithreaded meta-compressor that is optimized for binary data (although it works with text data just fine too).

bcolz can also use numexpr internally (it does that by default if it detects numexpr installed) or dask so as to accelerate many vector and query operations (although it can use pure NumPy for doing so too). numexpr/dask can optimize the memory usage and use multithreading for doing the computations, so it is blazing fast. This, in combination with carray/ctable disk-based, compressed containers, can be used for performing out-of-core computations efficiently, but most importantly transparently.

Just to whet your appetite, here it is an example with real data, where bcolz is already fulfilling the promise of accelerating memory I/O by using compression:

http://nbviewer.ipython.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb

Rationale

By using compression, you can deal with more data using the same amount of memory, which is very good on itself. But in case you are wondering about the price to pay in terms of performance, you should know that nowadays memory access is the most common bottleneck in many computational scenarios, and that CPUs spend most of its time waiting for data. Hence, having data compressed in memory can reduce the stress of the memory subsystem as well.

Furthermore, columnar means that the tabular datasets are stored column-wise order, and this turns out to offer better opportunities to improve compression ratio. This is because data tends to expose more similarity in elements that sit in the same column rather than those in the same row, so compressors generally do a much better job when data is aligned in such column-wise order. In addition, when you have to deal with tables with a large number of columns and your operations only involve some of them, a columnar-wise storage tends to be much more effective because minimizes the amount of data that travels to CPU caches.

So, the ultimate goal for bcolz is not only reducing the memory needs of large arrays/tables, but also making bcolz operations to go faster than using a traditional data container like those in NumPy or Pandas. That is actually already the case in some real-life scenarios (see the notebook above) but that will become pretty more noticeable in combination with forthcoming, faster CPUs integrating more cores and wider vector units.

Requisites

Python >= 2.7 and >= 3.5
NumPy >= 1.8
Cython >= 0.22 (just for compiling the beast)
C-Blosc >= 1.8.0 (optional, as the internal Blosc will be used by default)

Optional:

numexpr >= 2.5.2
dask >= 0.9.0
pandas
tables (pytables)

Building

There are different ways to compile bcolz, depending if you want to link with an already installed Blosc library or not.

Compiling with an installed Blosc library (recommended)

Python and Blosc-powered extensions have a difficult relationship when compiled using GCC, so this is why using an external C-Blosc library is recommended for maximum performance (for details, see Blosc/python-blosc#110).

Go to https://github.com/Blosc/c-blosc/releases and download and install the C-Blosc library. Then, you can tell bcolz where is the C-Blosc library in a couple of ways:

Using an environment variable:

$ BLOSC_DIR=/usr/local     (or "set BLOSC_DIR=\blosc" on Win)
$ export BLOSC_DIR         (not needed on Win)
$ python setup.py build_ext --inplace

Using a flag:

$ python setup.py build_ext --inplace --blosc=/usr/local

Compiling without an installed Blosc library

bcolz also comes with the Blosc sources with it so, assuming that you have a C++ compiler installed, do:

$ python setup.py build_ext --inplace

That's all. You can proceed with testing section now.

Note: The requirement for the C++ compiler is just for the Snappy dependency. The rest of the other components of Blosc are pure C (including the LZ4 and Zlib libraries).

Testing

After compiling, you can quickly check that the package is sane by running:

$ PYTHONPATH=.   (or "set PYTHONPATH=." on Windows)
$ export PYTHONPATH    (not needed on Windows)
$ python -c"import bcolz; bcolz.test()"  # add `heavy=True` if desired

Installing

Install it as a typical Python package:

$ pip install -U .

Optionally Install the additional dependencies:

$ pip install .[optional]

Documentation

You can find the online manual at:

http://bcolz.blosc.org

but of course, you can always access docstrings from the console (i.e. help(bcolz.ctable)).

Also, you may want to look at the bench/ directory for some examples of use.

Resources

Visit the main bcolz site repository at: http://github.com/Blosc/bcolz

Home of Blosc compressor: http://blosc.org

User's mail list: http://groups.google.com/group/bcolz ([email protected])

An introductory talk (20 min) about bcolz at EuroPython 2014. Slides here.

License

Please see BCOLZ.txt in LICENSES/ directory.

Share your experience

Let us know of any bugs, suggestions, gripes, kudos, etc. you may have.

Enjoy Data!

bcolz's People

Contributors

Stargazers

Watchers

Forkers

87 rayleyva caglar marascio b-rich quantopian aspp josephwinston francescelies owajawa mrocklin talumbau bgrant schevalier dalejung carstvaartjes arf1 alimanfoo brentp weizier tailwind sdvillal jewfro-cuban hussainsultan ssanderson mindw hdfeos msarahan brenbarn grahamc testmana2 mindis xbsd llllllllll kdm9 gitter-badger rlugojr francescalted thequackdaddy derrickhiggins arita37 dirkbike bordingj coventryresearch linearregression hzliu ricequant eotp dmichalowicz jebenexer edwardbetts kjeanclaude louiekang caicj15 zackguan kundajelab resurgo-genetics rowhit leewhenyoung rytsim awesome-python smther pjpan manuelparra17 ccfr32 jvd10 arthur-liqian shniu chinaquants benzei yyyyyyx yutiansut dilaw2007 imaqqqq sksundaram-learning alexpearce jakebolewski shaahneha felixendres apalepu23 orangekitty sanjc kingsrd mindaugasvaitkus2 captaintsao gatoatigrado cnsuhao neuronq yssource swiftcore bingyao 1shekhar yijxiang nwut useric xyzlat easyquant paobu simon-castano bburan

bcolz's Issues

option to make user_dict the only namespace

Feature suggestion: option for eval() to make user_dict the only name space where variables are being looked up.

Reason: There might be situations, such as server-based computation engine, where execution of untrusted code might note be desirable. Currently the code within the expression string might access all variables of local python execution context.

Example:

import carray as ca

a = ca.carray([4,5,6,7,8,9,10,11,24,35])
ca.eval("a < b", user_dict={"b":7}, vm="numexpr")

ca.eval() uses user_dict in addition to local context variables. There should be a way to override this make such code raise an exception for undefined a.

Python eval() is not safe in general, as it can not be easily sandboxed. However, if it would be possible to limit what variables are passed to numexpr, that would provide sufficient alternative.

ctable addcol() doesn't account for list input

After creating a ctable:

In [6]: ct = ca.ctable((np.array([1,2,3]),np.array([4,5,6])), ('a', 'b'))

Adding a new column as list fails:

In [7]: ct.addcol([7,8,9], 'c')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)

/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in addcol(self, newcol, name, pos, **kwargs)
    323         self.cols[name] = newcol
    324         # Update _arr1

--> 325         self._arr1 = np.empty(shape=(1,), dtype=self.dtype)
    326 
    327 

/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in dtype(self)
     55         "The data type of this ctable (numpy dtype)."
     56         names, cols = self.names, self.cols
---> 57         l = [(name, cols[name].dtype) for name in names]
     58         return np.dtype(l)
     59 

AttributeError: 'list' object has no attribute 'dtype'

Consequently breaking the ctable:

In [10]: ct[0]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in __getitem__(self, key)
    603             ra = self._arr1.copy()
    604             # Fill it

--> 605             ra[0] = tuple([self.cols[name][key] for name in self.names])
    606             return ra[0]
    607         # Slices


ValueError: size of tuple must match number of fields.

ctable where selection with "in" clause

Hi!

Not an issue but more as an enhancement discussion, something i'm looking at with @fran-xeco and we are wondering about what we should do.

The where clause works fine in the tutorial examples, but when you have a case where you have a large in selection you run into issues. When you want to select on 1,000 or 50,000 values there's an issue. With a limited number of values you can do a "or" selection in the where clausule, but it's not great for performance it seems and with a larger number it runs into errors because of the maximum nesting
So we checked numpy & pandas:

Numpy in1d
Numpy has the in1d clause but it's not really great in performance in our tests (and I don't see how I could apply it easily to bcolz ;)
http://docs.scipy.org/doc/numpy/reference/generated/numpy.in1d.html

Pandas isin()
Pandas solves it, itself by cython looping through the series 1 by 1 with a "value in set" check, generating a boolean mask:
https://github.com/pydata/pandas/blob/master/pandas/lib.pyx#L124

So if you would have a clause like a == 1 and b == 2 and c in [3, 4, ..., 5000] you could do something like

result = [row for row in btable_example.where('a==1 and b==2') if row['c'] in (3, 4, ..., 5000)]

but then probably cythonized to make it fast. ideally for me where should also accept lists and do things internally, so you would just say:

result = [row for row in btable_example.where('a==1 and b==2 and c == [3, 4, ..., 5000])')

But that does mean that internally it needs to review the string and see what numexpr will handle and what will be done otherwise.
I'm a bit clueless if numexpr can be applied for the first 2 filters and a cython filter for the third one (would this break the vectorization of the numexpr selection?)

Can you give some ideas of what would be the best solution direction and how we could implement this?

Intended behavior of `carray.next`?

What is the intended behavior of carray.__next__?

In [1]: import bcolz

In [2]: b = bcolz.carray([1, 2, 3])

In [3]: next(b)
StopIteration:

I would expect either a TypeError because __next__ isn't implemented or 1.

nosetest incompatible

I can't run the bcolz tests properly with nose:

zsh» nosetests
........................................................................................................................................................................................................ssssss...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
----------------------------------------------------------------------
Ran 873 tests in 4.037s

OK (skipped=6)
.....................................................................................................................................................................................................................................SSSSSS...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
----------------------------------------------------------------------
Ran 874 tests in 8.473s

OK (SKIP=6)
nosetests  7.05s user 1.66s system 100% cpu 8.630 total

Consider adding a sort function for carrays.

Sorting carrays will be a bit tricky, since random access and inserts are more expensive than sequential ones. I would suggest using the classical combination of mergesort and insertion sort. I.e. insertion sort at the chunk (maybe even better block) level and chunk/block-wise merge sort beyond that. Also, due to the random access issues mentioned above an out-of-place mergesort will probably be better than an in-place one. Also it might be worth looking at Kunth's TAOCP Volume 3, IIRC there is a section about sorting algorithms for data that is stored on tape, I figure we may have a similar situation here.

Additionally the algorithm outlined above should work on out-of-core carrays too.

Row level types (numpy.ndarray 0d)

Hi,

While running some tests I stumbled upon something unexpected to me, bcolz ctable containing different data types for items that should have the same data type (code copied and modified from Bcolz tests).
Some objects are of the type numpy.ndarray while all objects inside the same column should have the same data type.
Is this issue known if yes is this a bcolz or numpy problem or am I misunderstanding something?

I tried in different machines and all of them had the same issue.

Could be this issue related to #21 or are they completely independent?

Can somebody else reproduce this issue (code below), summary of the output object types as comments.

import numpy as np
import bcolz
from collections import defaultdict

def check_types(col):
    check = defaultdict(int)
    for i in range(len(col)):
        val = col[i]
        check[str(type(val))] += 1
    return dict(check)

if __name__ == '__main__':
    ra = np.fromiter(((i, i * 2., i * 3)
                      for i in range(500000)), dtype='i4,f8,i8')
    t = bcolz.ctable(ra)

    print(check_types(t['f0']))
    # {"<type 'numpy.ndarray'>": 401407, "<type 'int'>": 98593}
    print(check_types(t['f1']))
    # {"<type 'float'>": 37153, "<type 'numpy.ndarray'>": 462847}
    print(check_types(t['f2']))
    # {"<type 'numpy.ndarray'>": 462847, "<type 'int'>": 37153}

Thank you guys for the great work you are doing in this project.

suggest to remove the .idea directory

PyCharm files?

bad import in init in version 0.5

In the init of the 0.5 version of the module (installed by pip today) there is this line:
import carray.test as test
but the interprter (ipython 0.12.1 in my case) is complaining because there is no such carray.test module to import.

Commenting the line is enough to fix this issue

BColz.ctable.where / iter cause unexpected behavior

Using ctable.where sets state that affects future operations, causing some confusing issues. It took me quite a while to track this down.

In [1]: import bcolz

In [2]: bc = bcolz.ctable([[1, 2, 3], [10, 20, 30]], names=['a', 'b'])

In [3]: bc.where('a >= 2')  # call .where but don't do anything with it
Out[3]: <itertools.imap at 0x7fd7a84f5750>

In [4]: list(bc['b'])  # Later iterate over table, get where result
Out[4]: [20, 30]

eval out_flavor 'numpy' error for multidim carrays

c = ca.carray([[1,2,3],[4,5,6],[7,8,9]])
ca.eval('c>5', out_flavor='numpy')

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/usr/local/lib/python2.6/dist-packages/carray/toplevel.py", line 502, in eval
 **kwargs)
 File "/usr/local/lib/python2.6/dist-packages/carray/toplevel.py", line 575, in _eval_blocks
   result[:bsize] = res_block
ValueError: output operand requires a reduction, but reduction is not enabled

Compression appears to not work on all arrays

One of my interests in bcolz is the excellent compression benefits and I was getting on great in testing where I was only using test arrays of zeros, ones or arange. But the first time I created a carray from my actual data (which is a dense million element array of uint8 values (i.e no higher than 255) the compression literally stops. I don't understand? So in this first test a which is created with bcolz own arange, dtype="uint8" and b which is numpy's arange both achieve 55.99 compression. But an array created with randint achieves no compression.

In the next test you can see some compression if I pass the array in as int64 and dtype casting it to uint8 but the result is only compression that brings the total size down to what a uint8 dtype array of this nature would be

I am very confused about whether this is a bug or what do I not understand?

Reading a HDF5 table generated by Pandas

Hi!

When I'm trying to import a hdf5 file that was created by Pandas as an appendable table, I get an error:

In [20]: import pandas
In [21]: import bcolz
In [22]: x = pd.HDFStore('/srv/hdf5/dev_fact_rv_standard.h5', mode='r')
In [23]: fact_df = x['/store_0']
In [24]: x.close()
In [25]: fact_bcolz = bcolz.ctable.fromhdf5('/srv/hdf5/dev_fact_rv_standard.h5', nodepath='/store_0', rootdir='/srv/hdf5/dev_fact_rv_standard.bcolz', mode='w')
---------------------------------------------------------------------------
NoSuchNodeError                           Traceback (most recent call last)
<ipython-input-25-164a62ef0752> in <module>()
----> 1 fact_bcolz = bcolz.ctable.fromhdf5('/srv/hdf5/dev_fact_rv_standard.h5', nodepath='/store_0', rootdir='/srv/hdf5/dev_fact_rv_standard.bcolz', mode='w')

/srv/python/venv/local/lib/python2.7/site-packages/bcolz/ctable.pyc in fromhdf5(filepath, nodepath, **kwargs)
    669             names = kwargs.pop('names')
    670         else:
--> 671             names = t.colnames
    672         # Collect metadata
    673         dtypes = [dt[0] for dt in t.dtype.fields.values()]

/srv/python/venv/local/lib/python2.7/site-packages/tables/group.pyc in __getattr__(self, name)
    809             self._g_add_children_names()
    810             return mydict[name]
--> 811         return self._f_get_child(name)
    812 
    813     def __setattr__(self, name, value):

/srv/python/venv/local/lib/python2.7/site-packages/tables/group.pyc in _f_get_child(self, childname)
    679         self._g_check_open()
    680 
--> 681         self._g_check_has_child(childname)
    682 
    683         childpath = join_path(self._v_pathname, childname)

/srv/python/venv/local/lib/python2.7/site-packages/tables/group.pyc in _g_check_has_child(self, name)
    403             raise NoSuchNodeError(
    404                 "group ``%s`` does not have a child named ``%s``"
--> 405                 % (self._v_pathname, name))
    406         return node_type
    407 

NoSuchNodeError: group ``/store_0`` does not have a child named ``colnames``

The description of the HDFStore:

<class 'pandas.io.pytables.HDFStore'>
File path: /srv/hdf5/dev_fact_rv_standard.h5
/store_0            frame_table  (typ->appendable,nrows->451612,ncols->91,indexers->[index],dc->[r1,r2,r3,r4,r5,r6])

Am I doing something wrong / Is there any way to solve this?

`wheretrue()` for multidimensional carray

Does not work completely:

In [23]: c = ca.carray([True]*9).reshape((3,3))
In [24]: c[1,2] = False
In [26]: list(c.wheretrue())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

/usr/local/lib/python2.6/dist-packages/carray/carrayExtension.so in carray.carrayExtension.carray.wheretrue (carray/carrayExtension.c:12949)()

ValueError: `self` is not an array of booleans

Will bcolz support numpy 1.9?

We are interested in including bcolz in the upcoming Anaconda release, which will use numpy 1.9. Do you have plans to support this? I hope so, since it's a very useful package. Thanks!

ctable construction doesn't account for list input

This works:

In [4]: ct = ca.ctable((np.array([1,2,3]),np.array([4,5,6])), ('a', 'b'))

This doesn't:

In [5]: ct = ca.ctable(([1,2,3],[4,5,6]), ('a', 'b'))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in __init__(self, cols, names, **kwargs)
    144             raise ValueError, "`cols` input is not supported"
    145         if not (calist or nalist or ratype):
--> 146             raise ValueError, "`cols` input is not supported"
    147 
    148         # The compression parameters


ValueError: `cols` input is not supported

Issue with indexing multidimensional carrays

NumPy:
In [5]: a = np.ones((27,27,729), dtype=int)
In [6]: a[0,0,0]
Out[6]: 1

Carray:
In [1]: c = ca.ones((27,27,729), dtype=int)
In [2]: c[0,0,0]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)

 /usr/local/lib/python2.6/dist-packages/carray/carrayExtension.so in  carray.carrayExtension.carray.__getitem__ (carray/carrayExtension.c:8872)()

TypeError: 'int' object is unsubscriptable

In [3]: c[0]
Out[3]: 1

c doesn't return an array for index 0.

Test failure with MKL linked numpy - small precision difference

When numpy is linked with MKL and numexpr isn't, one of the tests fail:

FAIL: test11 (carray.tests.test_carray.eval_big_ne)
Testing eval() with functions like np.sin()

The arrays are actually very close, so using assert_array_almost_equal with 14 decimal places allows it to pass:
diff --git a/carray/tests/test_carray.py b/carray/tests/test_carray.py
index 19fbcf9..d6ca9a1 100644
--- a/carray/tests/test_carray.py
+++ b/carray/tests/test_carray.py
@@ -1096,7 +1096,7 @@ class evalTest(unittest.TestCase):
nr = np.sin(a) + 2 * np.log(b) - 3
#print "ca.eval ->", cr
#print "numpy ->", nr
- assert_array_equal(cr[:], nr, "eval does not work correctly")
+ assert_array_almost_equal(cr[:], nr, 14, "eval does not work correctly")

     def test12(self):
         """Testing eval() with `out_flavor` == 'numpy'"""

running diff on carray returns short arrays

When calculating array derivative (diff) carray shortens the array:

import carray as ca
import numpy as np

carr = ca.arange(1000000)
diff_arr = ca.eval("np.diff(carr)", vm="python")
nd_arr = np.diff(carr)

print "Number of elements in c:", len(carr)
print "Number of elements in diff_arr:", len(diff_arr)
print "Number of elements in nd_arr:", len(np_arr)

This returns on my computer:

Number of elements in carr: 1000000
Number of elements in diff_arr: 999877
Number of elements in nd_arr: 999999

Derivatives calculated with ndarray and carray have different lengths.

Reverse indexing on chunks might be bust

In [10]: chunk = c.chunks[-2]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-2988c8c546fe> in <module>()
----> 1 chunk = c.chunks[-2]

/home/esc/gw/bcolz/bcolz/carray_ext.so in bcolz.carray_ext.chunks.__getitem__ (bcolz/carray_ext.c:10336)()

/home/esc/gw/bcolz/bcolz/carray_ext.so in bcolz.carray_ext.chunks.read_chunk (bcolz/carray_ext.c:9884)()

ValueError: chunkfile /tmp/foo/data/__-2.blp not found

Large Multi-Dimensional Arrays how to create entirely on disk?

I successfully created a trillion element bcolz array on disk. Fantastic. The compression and the disk-based ness makes a 2 terabyte array 13.7GB and on disk. But problem is, I want a multi-directional array and I can't reshape it to (1000000, 1000000) using .reshape because it does it all in RAM. non-starter. Then I tried to create the shape from the start using .zeros((1000000, 1000000)) or .ones and again it just does it in memory. Is there a way to create? If not can you change this behavior

Document the `user_dict` parameter in toplevel `eval`

This parameter can be useful for the end user in many situations.

Fancy indexing fails on leading axis of a carray

When selecting slices from a carray, fancy indexing works as expected for all axes that are not the leading axis. A fancy index applied to the leading axis triggers a TypeError. I could not find the error message in the bcolz codebase, so maybe 0.7.1 fixed it?

bcolz version: 0.7.0

(Apologies for missing In/Out prompts.)

Build a carray:

import bcolz as bz
a = np.random.randint(0, 3, (4, 3))
c = bz.carray(a)
c
carray((4, 3), int64)
nbytes: 96; cbytes: 15.98 KB; ratio: 0.01
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[[1 2 2]
[1 1 2]
[0 0 0]
[2 2 0]]

This works:

c[:, [0, 2]]
array([[1, 2],
[1, 2],
[0, 0],
[2, 0]])

This does not:

c[[0, 2], :]
Traceback (most recent call last):
File "<pyshell#15>", line 1, in
c[[0, 2], :]
File "carray_ext.pyx", line 1879, in bcolz.carray_ext.carray.getitem (bcolz/carray_ext.c:21106)
File "carray_ext.pyx", line 1897, in bcolz.carray_ext.carray.getitem (bcolz/carray_ext.c:21344)
File "carray_ext.pyx", line 1913, in bcolz.carray_ext.carray.getitem (bcolz/carray_ext.c:21618)
TypeError: object of type 'numpy.int64' has no len()

Update Bloscpack

Currently bcolz uses parts of an older version of Bloscpack for historical reasons.

repr() does not work well for bidimensional arrays

This is a case where it does not work:

In [55]: carray.arange(1010).reshape((101,10))
Out[55]:
carray((101, 10), int64) nbytes: 7.89 KB; cbytes: 15.94 KB; ratio: 0.50
cparams := cparams(clevel=5, shuffle=True)
[[1000 1001 1002 1003 1004 1005 1006 1007 1008 1009], [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009], [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009], ..., [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009], [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009], [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009]]

which is clearly wrong.

python 2.6 needs nose for running the tests

nose should not be necessary, and only unittest2 should.

Make a carray_ext.pxd file

As suggested by @CarstVaartjes in #63 this would allow access to the carray from other Cython code.

Poor shaping after append

In [1]: import bcolz

In [2]: b = bcolz.ctable([[1, 2, 3], [1., 2., 3.]])

In [3]: b
Out[3]: 
ctable((3,), [('f0', '<i8'), ('f1', '<f8')])
  nbytes: 48; cbytes: 32.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[(1, 1.0) (2, 2.0) (3, 3.0)]

In [4]: b.append([[4, 5, 6], [4., 5., 6.]])

In [5]: b
Out[5]: 
ctable((4,), [('f0', '<i8'), ('f1', '<f8')])
  nbytes: 96; cbytes: 32.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[(3, 3.0) (4, 4.0) (5, 5.0) (6, 6.0)]

In [6]: b.shape
Out[6]: (4,)

In [7]: b[5]
Out[7]: (6, 6.0)

Consider adding a Categorical datatype to bcolz

To be inspired by:

https://pandas-docs.github.io/pandas-docs-travis/categorical.html

setup.py missing

It looks like the setup.py file is missing.

Can't create large multidimensional carray

With Numpy I can do something like this:

foo = np.zeros([ 2 ] * 20)

And get an ndarray with the corresponding shape. I can then:

ac = ca.carray(foo)

To get a carray object. Great. But I'm playing around with carray because I want to use array sizes that are larger than could otherwise fit in memory, and [2] * 20 is an easy shape for Numpy to handle, so for me it's a baseline of sorts.

Looking to explore the capabilities of carray, I try to create the object directly, without the intermediate Numpy step:

ac = ca.zeros([2] * 20)

But I get an error:

~/python/lib/python2.7/site-packages/carray/toplevel.pyc in zeros(shape, dtype, **kwargs)
    291     """
    292     dtype = np.dtype(dtype)
--> 293     return fill(shape=shape, dflt=np.zeros((), dtype), dtype=dtype, **kwargs)
    294
    295 def ones(shape, dtype=np.float, **kwargs):

~/python/lib/python2.7/site-packages/carray/toplevel.pyc in fill(shape, dflt, dtype, **kwargs)
    256     # Then fill it
    257     # We need an array for the defaults so as to keep the atom info
--> 258     dflt = np.array(obj.dflt, dtype=dtype)
    259     # Making strides=(0,) below is a trick to create the array fast and
    260     # without memory consumption

Which leads me to wonder: is something like this possible using carray?

`where` iterator does not seem to work for multidim carrays

c = ca.carray([1,2,3,4,5])
list(c.where(c>2))

[3, 4, 5]

c = ca.carray([[1,2,3],[4,5,6],[7,8,9]])
list(c.where(c>5))

[]

Upgrade Travis config to test across cython and numpy versions

might want to look at scikit-learn for some inspiration.

Where fails on unicode text

In [1]: import bcolz

In [2]: b = bcolz.ctable([['a', 'b', 'c']], dtype='U4', names=['text'])

In [3]: b.where('text == "b"')
ValueError: unkown type unicode128

Although this may be a numexpr issue.

``repr`` of ``chunk`` class is broken

bcolz port of:
ContinuumIO/blz#11

In [5]: import bcolz

In [6]: b = bcolz.arange(1e8)

In [7]: b.chunks[0]
Out[7]: <repr(<bcolz.carray_ext.chunk at 0x7f4dff38d578>) failed: AttributeError: 'bcolz.carray_ext.chunk' object has no attribute 'shape'>

I had a look at the code and I could probably fix and submit a PR but I don't understand the intention. What should shape be:

  def __repr__(self):
    """Represent the chunk as an string, with additional info."""
    cratio = self.nbytes / float(self.cbytes)
    fullrepr = "chunk(%s, %s)  nbytes: %d; cbytes: %d; ratio: %.2f\n%r" % \
        (self.shape, self.dtype, self.nbytes, self.cbytes, cratio, str(self))
    return fullrepr

ctable construction doesn't account for single column input

In [11]: ct = ca.ctable(np.array([1,3]), 'a')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in __init__(self, cols, names, **kwargs)
    125                     raise ValueError, "cannot convert `names` into a list"
    126             if len(names) != len(cols):
--> 127                 raise ValueError, "`cols` and `names` must have the same length"
    128         # Check name validity

    129         nt = namedtuple('_nt', names, verbose=False)

ValueError: `cols` and `names` must have the same length

Related?

In [13]: ct = ca.ctable(np.array([1,3]))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in __init__(self, cols, names, **kwargs)
    115         if names is None:
    116             if isinstance(cols, np.ndarray):  # ratype case
--> 117                 names = list(cols.dtype.names)
    118             else:
    119                 names = ["f%d"%i for i in range(len(cols))]

TypeError: 'NoneType' object is not iterable

Bug in chunk.pointer (and maybe more)

There is a miscalculation in the chunk.pointer code:

    @property
    def pointer(self):
        return <Py_uintptr_t> self.data + BLOSCPACK_HEADER_LENGTH

Let me demonstrate what is wrong. First let's set the scene:

In [1]: import bcolz

In [2]: a = np.arange(100000)

In [3]: c = bcolz.carray(a)

In [4]: chunk = c.chunks[0]

In [5]: chunk
Out[5]: 
chunk(int64)  nbytes: 262144; cbytes: 4720; ratio: 55.54
'[    0     1     2 ..., 32765 32766 32767]'

We now have a chunk in memory. Let's try to use other means to decompress it:

In [6]: import ctypes

In [7]: comp = ctypes.string_at(chunk.pointer, 4720)

In [8]: import blosc

In [9]: dcmp = blosc.decompress(comp)
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-9-8f8c24dd6b78> in <module>()
----> 1 dcmp = blosc.decompress(comp)

/home/esc/anaconda/lib/python2.7/site-packages/blosc/toplevel.pyc in decompress(bytesobj)
    440     _check_bytesobj(bytesobj)
    441 
--> 442     return _ext.decompress(bytesobj)
    443 
    444 

error: Error 4720 : not a Blosc buffer or header info is corrupted

I have used the unmodified pointer and a length value of 4720 as indicated by the __repr__ of chunk.

My suspicion is that the pointer should not be offset by BLOSCPACK_HEADER_LENGTH which is still 16 since it uses an older version of bloscpack, so let's try that:

In [10]: comp = ctypes.string_at(chunk.pointer-16, 4720)

In [11]: dcmp = blosc.decompress(comp)
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-11-8f8c24dd6b78> in <module>()
----> 1 dcmp = blosc.decompress(comp)

/home/esc/anaconda/lib/python2.7/site-packages/blosc/toplevel.pyc in decompress(bytesobj)
    440     _check_bytesobj(bytesobj)
    441 
--> 442     return _ext.decompress(bytesobj)
    443 
    444 

error: Error 4720 : not a Blosc buffer or header info is corrupted

Okay so that doesn't work either. Let's use bloscpack to discover the contents of the blosc header, if possible:

In [12]: import bloscpack

In [13]: bloscpack.headers.decode_blosc_header(comp)
Out[13]: 
{'blocksize': 32768,
 'ctbytes': 4592,
 'flags': 1,
 'nbytes': 262144,
 'typesize': 8,
 'version': 2,
 'versionlz': 1}

Okay great, so we need to subtract 128 from the length (I'll explain why later):

In [14]: comp = ctypes.string_at(chunk.pointer-16, 4592)

In [15]: dcmp = blosc.decompress(comp)

In [16]: dcmp[:8]
Out[16]: '\x00\x00\x00\x00\x00\x00\x00\x00'

In [17]: dcmp[:9]
Out[17]: '\x00\x00\x00\x00\x00\x00\x00\x00\x01'

And as desired we can recover the first 8 byte 0 and all the rest of course too.

So, one thing is the chunk.pointer which can fixed easily. The other issue with the 128 bytes extra is from adding the approximate footprint in bytes to the chunk:

        footprint += 128  # add the (aprox) footprint of this instance in bytes

Maybe we do want this, maybe not, I am not sure. It was somewhat confusing in this instance.

Make private methods of carray/ctable actually private

In last 0.5 release several public methods in carray and ctable were actually meant to be private:

In [36]: c. # a carray
c.append c.copy c.dtype c.len c.ndim c.reshape c.size c.wheretrue
c.attrs c.cparams c.fill_chunks c.mkdirs c.next c.resize c.sum c.write_meta
c.cbytes c.create_carray c.flush c.mode c.open_carray c.rootdir c.trim
c.chunklen c.dflt c.iter c.nbytes c.read_meta c.shape c.where

In [35]: t. # a ctable
t.addcol t.cbytes t.cparams t.dtype t.iter t.mode t.ndim t.rootdir t.trim
t.append t.cols t.create_ctable t.eval t.len t.names t.open_ctable t.shape t.where
t.attrs t.copy t.delcol t.flush t.mkdir_rootdir t.nbytes t.resize t.size

Broken link to https://blosc.org

The frontpage of the bcolz github project page has an https link to https://blosc.org, the link is broken, http works though.

Something wrong with cython version detection

[ec2-user@ip-172-31-42-166 ~]$ cython --version    
Cython version 0.21
[ec2-user@ip-172-31-42-166 ~]$ pip install bcolz   
Downloading/unpacking bcolz
  Downloading bcolz-0.7.1.tar.gz (666kB): 666kB downloaded
  Running setup.py (path:/tmp/pip_build_ec2-user/bcolz/setup.py) egg_info for package bcolz
    .. ERROR:: You need Cython 0.20 or greater to compile bcolz!
    Complete output from command python setup.py egg_info:
    .. ERROR:: You need Cython 0.20 or greater to compile bcolz!

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_ec2-user/bcolz
Storing debug log for failure in /home/ec2-user/.pip/pip.log

ctable["field"] should return an ndarray object for consistency

The __getitem__() operator for a carray always return an ndarray object except for the whole column thing:

In [27]: type(t[1])
Out[27]: numpy.void

In [28]: type(t[1:30])
Out[28]: numpy.ndarray

In [29]: type(t["f0<10"])
Out[29]: numpy.ndarray

In [30]: type(t["f0"])
Out[30]: carray.carrayExtension.carray

This should return an ndarray for consistency. The user can always access the column in the original form with:

t.cols['f0']

setup.py missing

The build instructions in README.txt describe the use of the conventional "python setup.py xxxxx".

Shapes error when using big arrays

import carray
a = carray.zeros( ( 1e7,10 ), dtype=float )
carray.eval('a+2*a')
ValueError: operands could not be broadcast together with shapes (131072) (38528,10) (131072)

I even got a core dump when trying

carray.eval('a+a')

Link to homepage bin README not working

I get a 404 when I point my browser at: http://bcolz.blosc.org/docs/manual

Can't run carray: ImportError

The error:

In [1]: import carray
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
/home/roland/<ipython-input-1-df49a4222118> in <module>()
----> 1 import carray

/home/roland/.local/lib/python2.7/site-packages/carray/__init__.pyc in <module>()
     64     cparams, eval, walk )
     65 from carray.version import __version__
---> 66 from carray.tests import test
     67 from defaults import defaults
     68

The build log:

$ pip install --user carray
Downloading/unpacking carray
  Downloading carray-0.5.tar.gz (318Kb): 318Kb downloaded
  Running setup.py egg_info for package carray
    * Found Cython 0.16 package installed.
    * Found numpy 1.6.1 package installed.
    * Found numexpr 1.4.2 package installed.

    warning: no previously-included files found matching 'post-release.txt'
    warning: no previously-included files found matching 'releasing.txt'
    warning: no files found matching '*.txt' under directory 'bench'
    warning: no files found matching '*.pdf' under directory 'doc'
    warning: no previously-included files matching '*' found under directory 'doc/_build'
Installing collected packages: carray
  Running setup.py install for carray
    * Found Cython 0.16 package installed.
    * Found numpy 1.6.1 package installed.
    * Found numexpr 1.4.2 package installed.
    skipping 'carray/carrayExtension.c' Cython extension (up-to-date)
    building 'carray.carrayExtension' extension
    C compiler: gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC

    compile options: '-Iblosc -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c'
    extra options: '-msse2'
    gcc: blosc/blosclz.c
    gcc: carray/carrayExtension.c
    gcc: blosc/shuffle.c
    gcc: blosc/blosc.c
    blosc/blosc.c: In function ‘blosc_decompress’:
    blosc/blosc.c:738:41: warning: variable ‘ctbytes’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c:732:20: warning: variable ‘versionlz’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c:732:11: warning: variable ‘version’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c:731:12: warning: variable ‘_dest’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c: In function ‘blosc_getitem’:
    blosc/blosc.c:818:41: warning: variable ‘ctbytes’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c:809:20: warning: variable ‘versionlz’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c:809:11: warning: variable ‘version’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c: In function ‘blosc_cbuffer_sizes’:
    blosc/blosc.c:1248:20: warning: variable ‘versionlz’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c:1248:11: warning: variable ‘version’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c: In function ‘blosc_cbuffer_metainfo’:
    blosc/blosc.c:1267:20: warning: variable ‘versionlz’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c:1267:11: warning: variable ‘version’ set but not used [-Wunused-but-set-variable]
    gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro build/temp.linux-x86_64-2.7/carray/carrayExtension.o build/temp.linux-x86_64-2.7/blosc/blosc.o build/temp.linux-x86_64-2.7/blosc/blosclz.o build/temp.linux-x86_64-2.7/blosc/shuffle.o -o build/lib.linux-x86_64-2.7/carray/carrayExtension.so

    warning: no previously-included files found matching 'post-release.txt'
    warning: no previously-included files found matching 'releasing.txt'
    warning: no files found matching '*.txt' under directory 'bench'
    warning: no files found matching '*.pdf' under directory 'doc'
    warning: no previously-included files matching '*' found under directory 'doc/_build'
Successfully installed carray
Cleaning up...

Appends not persisted

Am I being stupid? When i append to a bcolz array, it seems to work and it's shape changes. After I ctrl-z it and then try to reload with bcolz.open it only opens the original array, none of the appended arrays are in fact appended, certainly not persisted. What am I missing? Is it something to do with .flush()?

Improve documentation about ``flush``

As suggested by @handloomweaver in #58 the documentation for using append on disk-backed arrays should include a hint about using flush. This should probably be the case for the method docstring and any tutorial-like documentation.

Invalid dtype segfaults python

Try this with following invalid dtype:

carray.carray([], dtype="string")

this will fail python interpreter with:

Floating point exception: 8

It should rather raise an exception.

Python version: 2.7.3

Relative paths are baked into bcolz tables

For some reason the path to the bcolz table is baked into the metadata. This makes using the table from other locations or copying tables around very difficult.

In [1]: pwd
Out[1]: u'/home/mrocklin/tmp'

In [2]: import bcolz

In [3]: bcolz.ctable([[1, 2, 3], [1., 2., 3.]], rootdir='foo/mytable.bcolz')
Out[3]: 
ctable((3,), [('f0', '<i8'), ('f1', '<f8')])
  nbytes: 48; cbytes: 32.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
  rootdir := 'foo/mytable.bcolz'
[(1, 1.0) (2, 2.0) (3, 3.0)]

In [4]: b = bcolz.ctable(rootdir='foo/mytable.bcolz')  # works fine from same directory

In [5]: cd ..
/home/mrocklin

In [6]: b = bcolz.ctable(rootdir='tmp/foo/mytable.bcolz')  # fails from other directory
IOError: [Errno 2] No such file or directory: u'foo/mytable.bcolz/f0/meta/sizes'

carray inconsistency for boolean selections

Han Genuit reported this problem to the mailing list:

There is an issue which cropped up when I tried out v0.4, which is

demonstrated by this code:

carr = ca.carray([1,2,3,4]) # create an array
carr[:]<3 # evaluate carr < 3 within NumPy

array([ True, True, False, False], dtype=bool)

carr[carr[:]<3] # select the elements found

array([1, 2])

You can use fancy selection to select the elements from 'carr' with a

boolean array. But this does not seem to work for multidimensional arrays:

carr = ca.carray([[1,2,3],[4,5,6],[7,8,9]])
carr[:]<3

array([[ True, True, False],
[False, False, False],
[False, False, False]], dtype=bool)

carr[carr[:]<3]

Traceback (most recent call last):
File "", line 1, in
File "carrayExtension.pyx", line 1059, in
carray.carrayExtension.carray.getitem
(carray\carrayExtension.c:9079) ValueError: setting an array element
with a sequence.