Giter Site home page Giter Site logo

blosc / bcolz Goto Github PK

View Code? Open in Web Editor NEW
959.0 62.0 149.0 3.54 MB

A columnar data container that can be compressed.

Home Page: http://bcolz.blosc.org

C 59.83% Python 12.07% Shell 0.02% CMake 0.69% Makefile 0.22% C++ 2.07% Batchfile 0.12% PowerShell 0.06% Jupyter Notebook 24.56% Objective-C 0.29% Starlark 0.08%
column-store python compressed-data

bcolz's Issues

option to make user_dict the only namespace

Feature suggestion: option for eval() to make user_dict the only name space where variables are being looked up.

Reason: There might be situations, such as server-based computation engine, where execution of untrusted code might note be desirable. Currently the code within the expression string might access all variables of local python execution context.

Example:

import carray as ca

a = ca.carray([4,5,6,7,8,9,10,11,24,35])
ca.eval("a < b", user_dict={"b":7}, vm="numexpr")

ca.eval() uses user_dict in addition to local context variables. There should be a way to override this make such code raise an exception for undefined a.

Python eval() is not safe in general, as it can not be easily sandboxed. However, if it would be possible to limit what variables are passed to numexpr, that would provide sufficient alternative.

ctable["field"] should return an ndarray object for consistency

The __getitem__() operator for a carray always return an ndarray object except for the whole column thing:

In [27]: type(t[1])
Out[27]: numpy.void

In [28]: type(t[1:30])
Out[28]: numpy.ndarray

In [29]: type(t["f0<10"])
Out[29]: numpy.ndarray

In [30]: type(t["f0"])
Out[30]: carray.carrayExtension.carray

This should return an ndarray for consistency. The user can always access the column in the original form with:

t.cols['f0']

ctable construction doesn't account for list input

This works:

In [4]: ct = ca.ctable((np.array([1,2,3]),np.array([4,5,6])), ('a', 'b'))

This doesn't:

In [5]: ct = ca.ctable(([1,2,3],[4,5,6]), ('a', 'b'))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in __init__(self, cols, names, **kwargs)
    144             raise ValueError, "`cols` input is not supported"
    145         if not (calist or nalist or ratype):
--> 146             raise ValueError, "`cols` input is not supported"
    147 
    148         # The compression parameters


ValueError: `cols` input is not supported

Issue with indexing multidimensional carrays

NumPy:
In [5]: a = np.ones((27,27,729), dtype=int)
In [6]: a[0,0,0]
Out[6]: 1

Carray:
In [1]: c = ca.ones((27,27,729), dtype=int)
In [2]: c[0,0,0]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)

 /usr/local/lib/python2.6/dist-packages/carray/carrayExtension.so in  carray.carrayExtension.carray.__getitem__ (carray/carrayExtension.c:8872)()

TypeError: 'int' object is unsubscriptable

In [3]: c[0]
Out[3]: 1

c doesn't return an array for index 0.

nosetest incompatible

I can't run the bcolz tests properly with nose:

zsh» nosetests
........................................................................................................................................................................................................ssssss...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
----------------------------------------------------------------------
Ran 873 tests in 4.037s

OK (skipped=6)
.....................................................................................................................................................................................................................................SSSSSS...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
----------------------------------------------------------------------
Ran 874 tests in 8.473s

OK (SKIP=6)
nosetests  7.05s user 1.66s system 100% cpu 8.630 total

Reverse indexing on chunks might be bust

In [10]: chunk = c.chunks[-2]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-2988c8c546fe> in <module>()
----> 1 chunk = c.chunks[-2]

/home/esc/gw/bcolz/bcolz/carray_ext.so in bcolz.carray_ext.chunks.__getitem__ (bcolz/carray_ext.c:10336)()

/home/esc/gw/bcolz/bcolz/carray_ext.so in bcolz.carray_ext.chunks.read_chunk (bcolz/carray_ext.c:9884)()

ValueError: chunkfile /tmp/foo/data/__-2.blp not found

Will bcolz support numpy 1.9?

We are interested in including bcolz in the upcoming Anaconda release, which will use numpy 1.9. Do you have plans to support this? I hope so, since it's a very useful package. Thanks!

Can't create large multidimensional carray

With Numpy I can do something like this:

foo = np.zeros([ 2 ] * 20)

And get an ndarray with the corresponding shape. I can then:

ac = ca.carray(foo)

To get a carray object. Great. But I'm playing around with carray because I want to use array sizes that are larger than could otherwise fit in memory, and [2] * 20 is an easy shape for Numpy to handle, so for me it's a baseline of sorts.

Looking to explore the capabilities of carray, I try to create the object directly, without the intermediate Numpy step:

ac = ca.zeros([2] * 20)

But I get an error:

~/python/lib/python2.7/site-packages/carray/toplevel.pyc in zeros(shape, dtype, **kwargs)
    291     """
    292     dtype = np.dtype(dtype)
--> 293     return fill(shape=shape, dflt=np.zeros((), dtype), dtype=dtype, **kwargs)
    294
    295 def ones(shape, dtype=np.float, **kwargs):

~/python/lib/python2.7/site-packages/carray/toplevel.pyc in fill(shape, dflt, dtype, **kwargs)
    256     # Then fill it
    257     # We need an array for the defaults so as to keep the atom info
--> 258     dflt = np.array(obj.dflt, dtype=dtype)
    259     # Making strides=(0,) below is a trick to create the array fast and
    260     # without memory consumption

Which leads me to wonder: is something like this possible using carray?

repr() does not work well for bidimensional arrays

This is a case where it does not work:

In [55]: carray.arange(1010).reshape((101,10))
Out[55]:
carray((101, 10), int64) nbytes: 7.89 KB; cbytes: 15.94 KB; ratio: 0.50
cparams := cparams(clevel=5, shuffle=True)
[[1000 1001 1002 1003 1004 1005 1006 1007 1008 1009], [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009], [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009], ..., [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009], [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009], [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009]]

which is clearly wrong.

Can't run carray: ImportError

The error:

In [1]: import carray
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
/home/roland/<ipython-input-1-df49a4222118> in <module>()
----> 1 import carray

/home/roland/.local/lib/python2.7/site-packages/carray/__init__.pyc in <module>()
     64     cparams, eval, walk )
     65 from carray.version import __version__
---> 66 from carray.tests import test
     67 from defaults import defaults
     68 

The build log:

$ pip install --user carray
Downloading/unpacking carray
  Downloading carray-0.5.tar.gz (318Kb): 318Kb downloaded
  Running setup.py egg_info for package carray
    * Found Cython 0.16 package installed.
    * Found numpy 1.6.1 package installed.
    * Found numexpr 1.4.2 package installed.

    warning: no previously-included files found matching 'post-release.txt'
    warning: no previously-included files found matching 'releasing.txt'
    warning: no files found matching '*.txt' under directory 'bench'
    warning: no files found matching '*.pdf' under directory 'doc'
    warning: no previously-included files matching '*' found under directory 'doc/_build'
Installing collected packages: carray
  Running setup.py install for carray
    * Found Cython 0.16 package installed.
    * Found numpy 1.6.1 package installed.
    * Found numexpr 1.4.2 package installed.
    skipping 'carray/carrayExtension.c' Cython extension (up-to-date)
    building 'carray.carrayExtension' extension
    C compiler: gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC

    compile options: '-Iblosc -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c'
    extra options: '-msse2'
    gcc: blosc/blosclz.c
    gcc: carray/carrayExtension.c
    gcc: blosc/shuffle.c
    gcc: blosc/blosc.c
    blosc/blosc.c: In function ‘blosc_decompress’:
    blosc/blosc.c:738:41: warning: variable ‘ctbytes’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c:732:20: warning: variable ‘versionlz’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c:732:11: warning: variable ‘version’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c:731:12: warning: variable ‘_dest’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c: In function ‘blosc_getitem’:
    blosc/blosc.c:818:41: warning: variable ‘ctbytes’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c:809:20: warning: variable ‘versionlz’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c:809:11: warning: variable ‘version’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c: In function ‘blosc_cbuffer_sizes’:
    blosc/blosc.c:1248:20: warning: variable ‘versionlz’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c:1248:11: warning: variable ‘version’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c: In function ‘blosc_cbuffer_metainfo’:
    blosc/blosc.c:1267:20: warning: variable ‘versionlz’ set but not used [-Wunused-but-set-variable]
    blosc/blosc.c:1267:11: warning: variable ‘version’ set but not used [-Wunused-but-set-variable]
    gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro build/temp.linux-x86_64-2.7/carray/carrayExtension.o build/temp.linux-x86_64-2.7/blosc/blosc.o build/temp.linux-x86_64-2.7/blosc/blosclz.o build/temp.linux-x86_64-2.7/blosc/shuffle.o -o build/lib.linux-x86_64-2.7/carray/carrayExtension.so

    warning: no previously-included files found matching 'post-release.txt'
    warning: no previously-included files found matching 'releasing.txt'
    warning: no files found matching '*.txt' under directory 'bench'
    warning: no files found matching '*.pdf' under directory 'doc'
    warning: no previously-included files matching '*' found under directory 'doc/_build'
Successfully installed carray
Cleaning up...

setup.py missing

The build instructions in README.txt describe the use of the conventional "python setup.py xxxxx".

Reading a HDF5 table generated by Pandas

Hi!

When I'm trying to import a hdf5 file that was created by Pandas as an appendable table, I get an error:

In [20]: import pandas
In [21]: import bcolz
In [22]: x = pd.HDFStore('/srv/hdf5/dev_fact_rv_standard.h5', mode='r')
In [23]: fact_df = x['/store_0']
In [24]: x.close()
In [25]: fact_bcolz = bcolz.ctable.fromhdf5('/srv/hdf5/dev_fact_rv_standard.h5', nodepath='/store_0', rootdir='/srv/hdf5/dev_fact_rv_standard.bcolz', mode='w')
---------------------------------------------------------------------------
NoSuchNodeError                           Traceback (most recent call last)
<ipython-input-25-164a62ef0752> in <module>()
----> 1 fact_bcolz = bcolz.ctable.fromhdf5('/srv/hdf5/dev_fact_rv_standard.h5', nodepath='/store_0', rootdir='/srv/hdf5/dev_fact_rv_standard.bcolz', mode='w')

/srv/python/venv/local/lib/python2.7/site-packages/bcolz/ctable.pyc in fromhdf5(filepath, nodepath, **kwargs)
    669             names = kwargs.pop('names')
    670         else:
--> 671             names = t.colnames
    672         # Collect metadata
    673         dtypes = [dt[0] for dt in t.dtype.fields.values()]

/srv/python/venv/local/lib/python2.7/site-packages/tables/group.pyc in __getattr__(self, name)
    809             self._g_add_children_names()
    810             return mydict[name]
--> 811         return self._f_get_child(name)
    812 
    813     def __setattr__(self, name, value):

/srv/python/venv/local/lib/python2.7/site-packages/tables/group.pyc in _f_get_child(self, childname)
    679         self._g_check_open()
    680 
--> 681         self._g_check_has_child(childname)
    682 
    683         childpath = join_path(self._v_pathname, childname)

/srv/python/venv/local/lib/python2.7/site-packages/tables/group.pyc in _g_check_has_child(self, name)
    403             raise NoSuchNodeError(
    404                 "group ``%s`` does not have a child named ``%s``"
--> 405                 % (self._v_pathname, name))
    406         return node_type
    407 

NoSuchNodeError: group ``/store_0`` does not have a child named ``colnames``

The description of the HDFStore:

<class 'pandas.io.pytables.HDFStore'>
File path: /srv/hdf5/dev_fact_rv_standard.h5
/store_0            frame_table  (typ->appendable,nrows->451612,ncols->91,indexers->[index],dc->[r1,r2,r3,r4,r5,r6])

Am I doing something wrong / Is there any way to solve this?

Shapes error when using big arrays

import carray
a = carray.zeros( ( 1e7,10 ), dtype=float )
carray.eval('a+2*a')
ValueError: operands could not be broadcast together with shapes (131072) (38528,10) (131072)

I even got a core dump when trying

carray.eval('a+a')

Something wrong with cython version detection

[ec2-user@ip-172-31-42-166 ~]$ cython --version    
Cython version 0.21
[ec2-user@ip-172-31-42-166 ~]$ pip install bcolz   
Downloading/unpacking bcolz
  Downloading bcolz-0.7.1.tar.gz (666kB): 666kB downloaded
  Running setup.py (path:/tmp/pip_build_ec2-user/bcolz/setup.py) egg_info for package bcolz
    .. ERROR:: You need Cython 0.20 or greater to compile bcolz!
    Complete output from command python setup.py egg_info:
    .. ERROR:: You need Cython 0.20 or greater to compile bcolz!

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_ec2-user/bcolz
Storing debug log for failure in /home/ec2-user/.pip/pip.log

Consider adding a sort function for carrays.

Sorting carrays will be a bit tricky, since random access and inserts are more expensive than sequential ones. I would suggest using the classical combination of mergesort and insertion sort. I.e. insertion sort at the chunk (maybe even better block) level and chunk/block-wise merge sort beyond that. Also, due to the random access issues mentioned above an out-of-place mergesort will probably be better than an in-place one. Also it might be worth looking at Kunth's TAOCP Volume 3, IIRC there is a section about sorting algorithms for data that is stored on tape, I figure we may have a similar situation here.

Additionally the algorithm outlined above should work on out-of-core carrays too.

`wheretrue()` for multidimensional carray

Does not work completely:

In [23]: c = ca.carray([True]*9).reshape((3,3))
In [24]: c[1,2] = False
In [26]: list(c.wheretrue())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

/usr/local/lib/python2.6/dist-packages/carray/carrayExtension.so in carray.carrayExtension.carray.wheretrue (carray/carrayExtension.c:12949)()

ValueError: `self` is not an array of booleans

carray inconsistency for boolean selections

Han Genuit reported this problem to the mailing list:

There is an issue which cropped up when I tried out v0.4, which is

demonstrated by this code:

carr = ca.carray([1,2,3,4]) # create an array
carr[:]<3 # evaluate carr < 3 within NumPy

array([ True, True, False, False], dtype=bool)

carr[carr[:]<3] # select the elements found

array([1, 2])

You can use fancy selection to select the elements from 'carr' with a

boolean array. But this does not seem to work for multidimensional arrays:

carr = ca.carray([[1,2,3],[4,5,6],[7,8,9]])
carr[:]<3

array([[ True, True, False],
[False, False, False],
[False, False, False]], dtype=bool)

carr[carr[:]<3]

Traceback (most recent call last):
File "", line 1, in
File "carrayExtension.pyx", line 1059, in
carray.carrayExtension.carray.getitem
(carray\carrayExtension.c:9079) ValueError: setting an array element
with a sequence.

Compression appears to not work on all arrays

One of my interests in bcolz is the excellent compression benefits and I was getting on great in testing where I was only using test arrays of zeros, ones or arange. But the first time I created a carray from my actual data (which is a dense million element array of uint8 values (i.e no higher than 255) the compression literally stops. I don't understand? So in this first test a which is created with bcolz own arange, dtype="uint8" and b which is numpy's arange both achieve 55.99 compression. But an array created with randint achieves no compression.
testcompression
In the next test you can see some compression if I pass the array in as int64 and dtype casting it to uint8 but the result is only compression that brings the total size down to what a uint8 dtype array of this nature would be

testcompression2

I am very confused about whether this is a bug or what do I not understand?

Test failure with MKL linked numpy - small precision difference

When numpy is linked with MKL and numexpr isn't, one of the tests fail:

FAIL: test11 (carray.tests.test_carray.eval_big_ne)
Testing eval() with functions like np.sin()

The arrays are actually very close, so using assert_array_almost_equal with 14 decimal places allows it to pass:
diff --git a/carray/tests/test_carray.py b/carray/tests/test_carray.py
index 19fbcf9..d6ca9a1 100644
--- a/carray/tests/test_carray.py
+++ b/carray/tests/test_carray.py
@@ -1096,7 +1096,7 @@ class evalTest(unittest.TestCase):
nr = np.sin(a) + 2 * np.log(b) - 3
#print "ca.eval ->", cr
#print "numpy ->", nr
- assert_array_equal(cr[:], nr, "eval does not work correctly")
+ assert_array_almost_equal(cr[:], nr, 14, "eval does not work correctly")

     def test12(self):
         """Testing eval() with `out_flavor` == 'numpy'"""

Make private methods of carray/ctable actually private

In last 0.5 release several public methods in carray and ctable were actually meant to be private:

In [36]: c. # a carray
c.append c.copy c.dtype c.len c.ndim c.reshape c.size c.wheretrue
c.attrs c.cparams c.fill_chunks c.mkdirs c.next c.resize c.sum c.write_meta
c.cbytes c.create_carray c.flush c.mode c.open_carray c.rootdir c.trim
c.chunklen c.dflt c.iter c.nbytes c.read_meta c.shape c.where

In [35]: t. # a ctable
t.addcol t.cbytes t.cparams t.dtype t.iter t.mode t.ndim t.rootdir t.trim
t.append t.cols t.create_ctable t.eval t.len t.names t.open_ctable t.shape t.where
t.attrs t.copy t.delcol t.flush t.mkdir_rootdir t.nbytes t.resize t.size

Bug in chunk.pointer (and maybe more)

There is a miscalculation in the chunk.pointer code:

    @property
    def pointer(self):
        return <Py_uintptr_t> self.data + BLOSCPACK_HEADER_LENGTH

Let me demonstrate what is wrong. First let's set the scene:

In [1]: import bcolz

In [2]: a = np.arange(100000)

In [3]: c = bcolz.carray(a)

In [4]: chunk = c.chunks[0]

In [5]: chunk
Out[5]: 
chunk(int64)  nbytes: 262144; cbytes: 4720; ratio: 55.54
'[    0     1     2 ..., 32765 32766 32767]'

We now have a chunk in memory. Let's try to use other means to decompress it:

In [6]: import ctypes

In [7]: comp = ctypes.string_at(chunk.pointer, 4720)

In [8]: import blosc

In [9]: dcmp = blosc.decompress(comp)
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-9-8f8c24dd6b78> in <module>()
----> 1 dcmp = blosc.decompress(comp)

/home/esc/anaconda/lib/python2.7/site-packages/blosc/toplevel.pyc in decompress(bytesobj)
    440     _check_bytesobj(bytesobj)
    441 
--> 442     return _ext.decompress(bytesobj)
    443 
    444 

error: Error 4720 : not a Blosc buffer or header info is corrupted

I have used the unmodified pointer and a length value of 4720 as indicated by the __repr__ of chunk.

My suspicion is that the pointer should not be offset by BLOSCPACK_HEADER_LENGTH which is still 16 since it uses an older version of bloscpack, so let's try that:

In [10]: comp = ctypes.string_at(chunk.pointer-16, 4720)

In [11]: dcmp = blosc.decompress(comp)
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-11-8f8c24dd6b78> in <module>()
----> 1 dcmp = blosc.decompress(comp)

/home/esc/anaconda/lib/python2.7/site-packages/blosc/toplevel.pyc in decompress(bytesobj)
    440     _check_bytesobj(bytesobj)
    441 
--> 442     return _ext.decompress(bytesobj)
    443 
    444 

error: Error 4720 : not a Blosc buffer or header info is corrupted

Okay so that doesn't work either. Let's use bloscpack to discover the contents of the blosc header, if possible:

In [12]: import bloscpack

In [13]: bloscpack.headers.decode_blosc_header(comp)
Out[13]: 
{'blocksize': 32768,
 'ctbytes': 4592,
 'flags': 1,
 'nbytes': 262144,
 'typesize': 8,
 'version': 2,
 'versionlz': 1}

Okay great, so we need to subtract 128 from the length (I'll explain why later):

In [14]: comp = ctypes.string_at(chunk.pointer-16, 4592)

In [15]: dcmp = blosc.decompress(comp)

In [16]: dcmp[:8]
Out[16]: '\x00\x00\x00\x00\x00\x00\x00\x00'

In [17]: dcmp[:9]
Out[17]: '\x00\x00\x00\x00\x00\x00\x00\x00\x01'

And as desired we can recover the first 8 byte 0 and all the rest of course too.

So, one thing is the chunk.pointer which can fixed easily. The other issue with the 128 bytes extra is from adding the approximate footprint in bytes to the chunk:

        footprint += 128  # add the (aprox) footprint of this instance in bytes

Maybe we do want this, maybe not, I am not sure. It was somewhat confusing in this instance.

ctable where selection with "in" clause

Hi!

Not an issue but more as an enhancement discussion, something i'm looking at with @fran-xeco and we are wondering about what we should do.

The where clause works fine in the tutorial examples, but when you have a case where you have a large in selection you run into issues. When you want to select on 1,000 or 50,000 values there's an issue. With a limited number of values you can do a "or" selection in the where clausule, but it's not great for performance it seems and with a larger number it runs into errors because of the maximum nesting
So we checked numpy & pandas:

Numpy in1d
Numpy has the in1d clause but it's not really great in performance in our tests (and I don't see how I could apply it easily to bcolz ;)
http://docs.scipy.org/doc/numpy/reference/generated/numpy.in1d.html

Pandas isin()
Pandas solves it, itself by cython looping through the series 1 by 1 with a "value in set" check, generating a boolean mask:
https://github.com/pydata/pandas/blob/master/pandas/lib.pyx#L124

So if you would have a clause like a == 1 and b == 2 and c in [3, 4, ..., 5000] you could do something like

result = [row for row in btable_example.where('a==1 and b==2') if row['c'] in (3, 4, ..., 5000)]

but then probably cythonized to make it fast. ideally for me where should also accept lists and do things internally, so you would just say:

result = [row for row in btable_example.where('a==1 and b==2 and c == [3, 4, ..., 5000])')

But that does mean that internally it needs to review the string and see what numexpr will handle and what will be done otherwise.
I'm a bit clueless if numexpr can be applied for the first 2 filters and a cython filter for the third one (would this break the vectorization of the numexpr selection?)

Can you give some ideas of what would be the best solution direction and how we could implement this?

Update Bloscpack

Currently bcolz uses parts of an older version of Bloscpack for historical reasons.

eval out_flavor 'numpy' error for multidim carrays

c = ca.carray([[1,2,3],[4,5,6],[7,8,9]])
ca.eval('c>5', out_flavor='numpy')

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/usr/local/lib/python2.6/dist-packages/carray/toplevel.py", line 502, in eval
 **kwargs)
 File "/usr/local/lib/python2.6/dist-packages/carray/toplevel.py", line 575, in _eval_blocks
   result[:bsize] = res_block
ValueError: output operand requires a reduction, but reduction is not enabled

Intended behavior of `carray.__next__`?

What is the intended behavior of carray.__next__?

In [1]: import bcolz

In [2]: b = bcolz.carray([1, 2, 3])

In [3]: next(b)
StopIteration: 

I would expect either a TypeError because __next__ isn't implemented or 1.

Where fails on unicode text

In [1]: import bcolz

In [2]: b = bcolz.ctable([['a', 'b', 'c']], dtype='U4', names=['text'])

In [3]: b.where('text == "b"')
ValueError: unkown type unicode128

Although this may be a numexpr issue.

BColz.ctable.where / __iter__ cause unexpected behavior

Using ctable.where sets state that affects future operations, causing some confusing issues. It took me quite a while to track this down.

In [1]: import bcolz

In [2]: bc = bcolz.ctable([[1, 2, 3], [10, 20, 30]], names=['a', 'b'])

In [3]: bc.where('a >= 2')  # call .where but don't do anything with it
Out[3]: <itertools.imap at 0x7fd7a84f5750>

In [4]: list(bc['b'])  # Later iterate over table, get where result
Out[4]: [20, 30]

ctable addcol() doesn't account for list input

After creating a ctable:

In [6]: ct = ca.ctable((np.array([1,2,3]),np.array([4,5,6])), ('a', 'b'))

Adding a new column as list fails:

In [7]: ct.addcol([7,8,9], 'c')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)

/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in addcol(self, newcol, name, pos, **kwargs)
    323         self.cols[name] = newcol
    324         # Update _arr1

--> 325         self._arr1 = np.empty(shape=(1,), dtype=self.dtype)
    326 
    327 

/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in dtype(self)
     55         "The data type of this ctable (numpy dtype)."
     56         names, cols = self.names, self.cols
---> 57         l = [(name, cols[name].dtype) for name in names]
     58         return np.dtype(l)
     59 

AttributeError: 'list' object has no attribute 'dtype'

Consequently breaking the ctable:

In [10]: ct[0]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in __getitem__(self, key)
    603             ra = self._arr1.copy()
    604             # Fill it

--> 605             ra[0] = tuple([self.cols[name][key] for name in self.names])
    606             return ra[0]
    607         # Slices


ValueError: size of tuple must match number of fields.

Large Multi-Dimensional Arrays how to create entirely on disk?

I successfully created a trillion element bcolz array on disk. Fantastic. The compression and the disk-based ness makes a 2 terabyte array 13.7GB and on disk. But problem is, I want a multi-directional array and I can't reshape it to (1000000, 1000000) using .reshape because it does it all in RAM. non-starter. Then I tried to create the shape from the start using .zeros((1000000, 1000000)) or .ones and again it just does it in memory. Is there a way to create? If not can you change this behavior

Fancy indexing fails on leading axis of a carray

When selecting slices from a carray, fancy indexing works as expected for all axes that are not the leading axis. A fancy index applied to the leading axis triggers a TypeError. I could not find the error message in the bcolz codebase, so maybe 0.7.1 fixed it?

bcolz version: 0.7.0

(Apologies for missing In/Out prompts.)

Build a carray:

import bcolz as bz
a = np.random.randint(0, 3, (4, 3))
c = bz.carray(a)
c
carray((4, 3), int64)
nbytes: 96; cbytes: 15.98 KB; ratio: 0.01
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[[1 2 2]
[1 1 2]
[0 0 0]
[2 2 0]]

This works:

c[:, [0, 2]]
array([[1, 2],
[1, 2],
[0, 0],
[2, 0]])

This does not:

c[[0, 2], :]
Traceback (most recent call last):
File "<pyshell#15>", line 1, in
c[[0, 2], :]
File "carray_ext.pyx", line 1879, in bcolz.carray_ext.carray.getitem (bcolz/carray_ext.c:21106)
File "carray_ext.pyx", line 1897, in bcolz.carray_ext.carray.getitem (bcolz/carray_ext.c:21344)
File "carray_ext.pyx", line 1913, in bcolz.carray_ext.carray.getitem (bcolz/carray_ext.c:21618)
TypeError: object of type 'numpy.int64' has no len()

Invalid dtype segfaults python

Try this with following invalid dtype:

carray.carray([], dtype="string")

this will fail python interpreter with:

Floating point exception: 8

It should rather raise an exception.

Python version: 2.7.3

``repr`` of ``chunk`` class is broken

bcolz port of:
ContinuumIO/blz#11

In [5]: import bcolz

In [6]: b = bcolz.arange(1e8)

In [7]: b.chunks[0]
Out[7]: <repr(<bcolz.carray_ext.chunk at 0x7f4dff38d578>) failed: AttributeError: 'bcolz.carray_ext.chunk' object has no attribute 'shape'>

I had a look at the code and I could probably fix and submit a PR but I don't understand the intention. What should shape be:

  def __repr__(self):
    """Represent the chunk as an string, with additional info."""
    cratio = self.nbytes / float(self.cbytes)
    fullrepr = "chunk(%s, %s)  nbytes: %d; cbytes: %d; ratio: %.2f\n%r" % \
        (self.shape, self.dtype, self.nbytes, self.cbytes, cratio, str(self))
    return fullrepr

running diff on carray returns short arrays

When calculating array derivative (diff) carray shortens the array:

import carray as ca
import numpy as np

carr = ca.arange(1000000)
diff_arr = ca.eval("np.diff(carr)", vm="python")
nd_arr = np.diff(carr)

print "Number of elements in c:", len(carr)
print "Number of elements in diff_arr:", len(diff_arr)
print "Number of elements in nd_arr:", len(np_arr)

This returns on my computer:

Number of elements in carr: 1000000
Number of elements in diff_arr: 999877
Number of elements in nd_arr: 999999

Derivatives calculated with ndarray and carray have different lengths.

Poor shaping after append

In [1]: import bcolz

In [2]: b = bcolz.ctable([[1, 2, 3], [1., 2., 3.]])

In [3]: b
Out[3]: 
ctable((3,), [('f0', '<i8'), ('f1', '<f8')])
  nbytes: 48; cbytes: 32.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[(1, 1.0) (2, 2.0) (3, 3.0)]

In [4]: b.append([[4, 5, 6], [4., 5., 6.]])

In [5]: b
Out[5]: 
ctable((4,), [('f0', '<i8'), ('f1', '<f8')])
  nbytes: 96; cbytes: 32.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[(3, 3.0) (4, 4.0) (5, 5.0) (6, 6.0)]

In [6]: b.shape
Out[6]: (4,)

In [7]: b[5]
Out[7]: (6, 6.0)

Relative paths are baked into bcolz tables

For some reason the path to the bcolz table is baked into the metadata. This makes using the table from other locations or copying tables around very difficult.

In [1]: pwd
Out[1]: u'/home/mrocklin/tmp'

In [2]: import bcolz

In [3]: bcolz.ctable([[1, 2, 3], [1., 2., 3.]], rootdir='foo/mytable.bcolz')
Out[3]: 
ctable((3,), [('f0', '<i8'), ('f1', '<f8')])
  nbytes: 48; cbytes: 32.00 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
  rootdir := 'foo/mytable.bcolz'
[(1, 1.0) (2, 2.0) (3, 3.0)]

In [4]: b = bcolz.ctable(rootdir='foo/mytable.bcolz')  # works fine from same directory

In [5]: cd ..
/home/mrocklin

In [6]: b = bcolz.ctable(rootdir='tmp/foo/mytable.bcolz')  # fails from other directory
IOError: [Errno 2] No such file or directory: u'foo/mytable.bcolz/f0/meta/sizes'

bad import in __init__ in version 0.5

Hi

In the init of the 0.5 version of the module (installed by pip today) there is this line:
import carray.test as test
but the interprter (ipython 0.12.1 in my case) is complaining because there is no such carray.test module to import.

Commenting the line is enough to fix this issue

Appends not persisted

Am I being stupid? When i append to a bcolz array, it seems to work and it's shape changes. After I ctrl-z it and then try to reload with bcolz.open it only opens the original array, none of the appended arrays are in fact appended, certainly not persisted. What am I missing? Is it something to do with .flush()?

testappend

ctable construction doesn't account for single column input

In [11]: ct = ca.ctable(np.array([1,3]), 'a')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in __init__(self, cols, names, **kwargs)
    125                     raise ValueError, "cannot convert `names` into a list"
    126             if len(names) != len(cols):
--> 127                 raise ValueError, "`cols` and `names` must have the same length"
    128         # Check name validity

    129         nt = namedtuple('_nt', names, verbose=False)

ValueError: `cols` and `names` must have the same length

Related?

In [13]: ct = ca.ctable(np.array([1,3]))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in __init__(self, cols, names, **kwargs)
    115         if names is None:
    116             if isinstance(cols, np.ndarray):  # ratype case
--> 117                 names = list(cols.dtype.names)
    118             else:
    119                 names = ["f%d"%i for i in range(len(cols))]

TypeError: 'NoneType' object is not iterable

Row level types (numpy.ndarray 0d)

Hi,

While running some tests I stumbled upon something unexpected to me, bcolz ctable containing different data types for items that should have the same data type (code copied and modified from Bcolz tests).
Some objects are of the type numpy.ndarray while all objects inside the same column should have the same data type.
Is this issue known if yes is this a bcolz or numpy problem or am I misunderstanding something?

I tried in different machines and all of them had the same issue.

Could be this issue related to #21 or are they completely independent?

Can somebody else reproduce this issue (code below), summary of the output object types as comments.

import numpy as np
import bcolz
from collections import defaultdict

def check_types(col):
    check = defaultdict(int)
    for i in range(len(col)):
        val = col[i]
        check[str(type(val))] += 1
    return dict(check)

if __name__ == '__main__':
    ra = np.fromiter(((i, i * 2., i * 3)
                      for i in range(500000)), dtype='i4,f8,i8')
    t = bcolz.ctable(ra)

    print(check_types(t['f0']))
    # {"<type 'numpy.ndarray'>": 401407, "<type 'int'>": 98593}
    print(check_types(t['f1']))
    # {"<type 'float'>": 37153, "<type 'numpy.ndarray'>": 462847}
    print(check_types(t['f2']))
    # {"<type 'numpy.ndarray'>": 462847, "<type 'int'>": 37153}

Thank you guys for the great work you are doing in this project.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.