blosc / bcolz Goto Github PK
View Code? Open in Web Editor NEWA columnar data container that can be compressed.
Home Page: http://bcolz.blosc.org
A columnar data container that can be compressed.
Home Page: http://bcolz.blosc.org
Feature suggestion: option for eval()
to make user_dict
the only name space where variables are being looked up.
Reason: There might be situations, such as server-based computation engine, where execution of untrusted code might note be desirable. Currently the code within the expression string might access all variables of local python execution context.
Example:
import carray as ca
a = ca.carray([4,5,6,7,8,9,10,11,24,35])
ca.eval("a < b", user_dict={"b":7}, vm="numexpr")
ca.eval()
uses user_dict
in addition to local context variables. There should be a way to override this make such code raise an exception for undefined a
.
Python eval()
is not safe in general, as it can not be easily sandboxed. However, if it would be possible to limit what variables are passed to numexpr
, that would provide sufficient alternative.
The __getitem__()
operator for a carray always return an ndarray object except for the whole column thing:
In [27]: type(t[1])
Out[27]: numpy.void
In [28]: type(t[1:30])
Out[28]: numpy.ndarray
In [29]: type(t["f0<10"])
Out[29]: numpy.ndarray
In [30]: type(t["f0"])
Out[30]: carray.carrayExtension.carray
This should return an ndarray for consistency. The user can always access the column in the original form with:
t.cols['f0']
This works:
In [4]: ct = ca.ctable((np.array([1,2,3]),np.array([4,5,6])), ('a', 'b'))
This doesn't:
In [5]: ct = ca.ctable(([1,2,3],[4,5,6]), ('a', 'b'))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in __init__(self, cols, names, **kwargs)
144 raise ValueError, "`cols` input is not supported"
145 if not (calist or nalist or ratype):
--> 146 raise ValueError, "`cols` input is not supported"
147
148 # The compression parameters
ValueError: `cols` input is not supported
NumPy:
In [5]: a = np.ones((27,27,729), dtype=int)
In [6]: a[0,0,0]
Out[6]: 1
Carray:
In [1]: c = ca.ones((27,27,729), dtype=int)
In [2]: c[0,0,0]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/usr/local/lib/python2.6/dist-packages/carray/carrayExtension.so in carray.carrayExtension.carray.__getitem__ (carray/carrayExtension.c:8872)()
TypeError: 'int' object is unsubscriptable
In [3]: c[0]
Out[3]: 1
c
doesn't return an array for index 0.
I can't run the bcolz tests properly with nose:
zsh» nosetests
........................................................................................................................................................................................................ssssss...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
----------------------------------------------------------------------
Ran 873 tests in 4.037s
OK (skipped=6)
.....................................................................................................................................................................................................................................SSSSSS...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
----------------------------------------------------------------------
Ran 874 tests in 8.473s
OK (SKIP=6)
nosetests 7.05s user 1.66s system 100% cpu 8.630 total
In [10]: chunk = c.chunks[-2]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-10-2988c8c546fe> in <module>()
----> 1 chunk = c.chunks[-2]
/home/esc/gw/bcolz/bcolz/carray_ext.so in bcolz.carray_ext.chunks.__getitem__ (bcolz/carray_ext.c:10336)()
/home/esc/gw/bcolz/bcolz/carray_ext.so in bcolz.carray_ext.chunks.read_chunk (bcolz/carray_ext.c:9884)()
ValueError: chunkfile /tmp/foo/data/__-2.blp not found
As suggested by @handloomweaver in #58 the documentation for using append
on disk-backed arrays should include a hint about using flush
. This should probably be the case for the method docstring and any tutorial-like documentation.
We are interested in including bcolz in the upcoming Anaconda release, which will use numpy 1.9. Do you have plans to support this? I hope so, since it's a very useful package. Thanks!
The frontpage of the bcolz github project page has an https link to https://blosc.org, the link is broken, http works though.
c = ca.carray([1,2,3,4,5])
list(c.where(c>2))
[3, 4, 5]
c = ca.carray([[1,2,3],[4,5,6],[7,8,9]])
list(c.where(c>5))
[]
With Numpy I can do something like this:
foo = np.zeros([ 2 ] * 20)
And get an ndarray
with the corresponding shape. I can then:
ac = ca.carray(foo)
To get a carray
object. Great. But I'm playing around with carray because I want to use array sizes that are larger than could otherwise fit in memory, and [2] * 20
is an easy shape for Numpy to handle, so for me it's a baseline of sorts.
Looking to explore the capabilities of carray
, I try to create the object directly, without the intermediate Numpy step:
ac = ca.zeros([2] * 20)
But I get an error:
~/python/lib/python2.7/site-packages/carray/toplevel.pyc in zeros(shape, dtype, **kwargs)
291 """
292 dtype = np.dtype(dtype)
--> 293 return fill(shape=shape, dflt=np.zeros((), dtype), dtype=dtype, **kwargs)
294
295 def ones(shape, dtype=np.float, **kwargs):
~/python/lib/python2.7/site-packages/carray/toplevel.pyc in fill(shape, dflt, dtype, **kwargs)
256 # Then fill it
257 # We need an array for the defaults so as to keep the atom info
--> 258 dflt = np.array(obj.dflt, dtype=dtype)
259 # Making strides=(0,) below is a trick to create the array fast and
260 # without memory consumption
Which leads me to wonder: is something like this possible using carray
?
This is a case where it does not work:
In [55]: carray.arange(1010).reshape((101,10))
Out[55]:
carray((101, 10), int64) nbytes: 7.89 KB; cbytes: 15.94 KB; ratio: 0.50
cparams := cparams(clevel=5, shuffle=True)
[[1000 1001 1002 1003 1004 1005 1006 1007 1008 1009], [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009], [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009], ..., [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009], [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009], [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009]]
which is clearly wrong.
The error:
In [1]: import carray
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
/home/roland/<ipython-input-1-df49a4222118> in <module>()
----> 1 import carray
/home/roland/.local/lib/python2.7/site-packages/carray/__init__.pyc in <module>()
64 cparams, eval, walk )
65 from carray.version import __version__
---> 66 from carray.tests import test
67 from defaults import defaults
68
The build log:
$ pip install --user carray
Downloading/unpacking carray
Downloading carray-0.5.tar.gz (318Kb): 318Kb downloaded
Running setup.py egg_info for package carray
* Found Cython 0.16 package installed.
* Found numpy 1.6.1 package installed.
* Found numexpr 1.4.2 package installed.
warning: no previously-included files found matching 'post-release.txt'
warning: no previously-included files found matching 'releasing.txt'
warning: no files found matching '*.txt' under directory 'bench'
warning: no files found matching '*.pdf' under directory 'doc'
warning: no previously-included files matching '*' found under directory 'doc/_build'
Installing collected packages: carray
Running setup.py install for carray
* Found Cython 0.16 package installed.
* Found numpy 1.6.1 package installed.
* Found numexpr 1.4.2 package installed.
skipping 'carray/carrayExtension.c' Cython extension (up-to-date)
building 'carray.carrayExtension' extension
C compiler: gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC
compile options: '-Iblosc -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c'
extra options: '-msse2'
gcc: blosc/blosclz.c
gcc: carray/carrayExtension.c
gcc: blosc/shuffle.c
gcc: blosc/blosc.c
blosc/blosc.c: In function ‘blosc_decompress’:
blosc/blosc.c:738:41: warning: variable ‘ctbytes’ set but not used [-Wunused-but-set-variable]
blosc/blosc.c:732:20: warning: variable ‘versionlz’ set but not used [-Wunused-but-set-variable]
blosc/blosc.c:732:11: warning: variable ‘version’ set but not used [-Wunused-but-set-variable]
blosc/blosc.c:731:12: warning: variable ‘_dest’ set but not used [-Wunused-but-set-variable]
blosc/blosc.c: In function ‘blosc_getitem’:
blosc/blosc.c:818:41: warning: variable ‘ctbytes’ set but not used [-Wunused-but-set-variable]
blosc/blosc.c:809:20: warning: variable ‘versionlz’ set but not used [-Wunused-but-set-variable]
blosc/blosc.c:809:11: warning: variable ‘version’ set but not used [-Wunused-but-set-variable]
blosc/blosc.c: In function ‘blosc_cbuffer_sizes’:
blosc/blosc.c:1248:20: warning: variable ‘versionlz’ set but not used [-Wunused-but-set-variable]
blosc/blosc.c:1248:11: warning: variable ‘version’ set but not used [-Wunused-but-set-variable]
blosc/blosc.c: In function ‘blosc_cbuffer_metainfo’:
blosc/blosc.c:1267:20: warning: variable ‘versionlz’ set but not used [-Wunused-but-set-variable]
blosc/blosc.c:1267:11: warning: variable ‘version’ set but not used [-Wunused-but-set-variable]
gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro build/temp.linux-x86_64-2.7/carray/carrayExtension.o build/temp.linux-x86_64-2.7/blosc/blosc.o build/temp.linux-x86_64-2.7/blosc/blosclz.o build/temp.linux-x86_64-2.7/blosc/shuffle.o -o build/lib.linux-x86_64-2.7/carray/carrayExtension.so
warning: no previously-included files found matching 'post-release.txt'
warning: no previously-included files found matching 'releasing.txt'
warning: no files found matching '*.txt' under directory 'bench'
warning: no files found matching '*.pdf' under directory 'doc'
warning: no previously-included files matching '*' found under directory 'doc/_build'
Successfully installed carray
Cleaning up...
The build instructions in README.txt describe the use of the conventional "python setup.py xxxxx".
might want to look at scikit-learn for some inspiration.
Hi!
When I'm trying to import a hdf5 file that was created by Pandas as an appendable table, I get an error:
In [20]: import pandas
In [21]: import bcolz
In [22]: x = pd.HDFStore('/srv/hdf5/dev_fact_rv_standard.h5', mode='r')
In [23]: fact_df = x['/store_0']
In [24]: x.close()
In [25]: fact_bcolz = bcolz.ctable.fromhdf5('/srv/hdf5/dev_fact_rv_standard.h5', nodepath='/store_0', rootdir='/srv/hdf5/dev_fact_rv_standard.bcolz', mode='w')
---------------------------------------------------------------------------
NoSuchNodeError Traceback (most recent call last)
<ipython-input-25-164a62ef0752> in <module>()
----> 1 fact_bcolz = bcolz.ctable.fromhdf5('/srv/hdf5/dev_fact_rv_standard.h5', nodepath='/store_0', rootdir='/srv/hdf5/dev_fact_rv_standard.bcolz', mode='w')
/srv/python/venv/local/lib/python2.7/site-packages/bcolz/ctable.pyc in fromhdf5(filepath, nodepath, **kwargs)
669 names = kwargs.pop('names')
670 else:
--> 671 names = t.colnames
672 # Collect metadata
673 dtypes = [dt[0] for dt in t.dtype.fields.values()]
/srv/python/venv/local/lib/python2.7/site-packages/tables/group.pyc in __getattr__(self, name)
809 self._g_add_children_names()
810 return mydict[name]
--> 811 return self._f_get_child(name)
812
813 def __setattr__(self, name, value):
/srv/python/venv/local/lib/python2.7/site-packages/tables/group.pyc in _f_get_child(self, childname)
679 self._g_check_open()
680
--> 681 self._g_check_has_child(childname)
682
683 childpath = join_path(self._v_pathname, childname)
/srv/python/venv/local/lib/python2.7/site-packages/tables/group.pyc in _g_check_has_child(self, name)
403 raise NoSuchNodeError(
404 "group ``%s`` does not have a child named ``%s``"
--> 405 % (self._v_pathname, name))
406 return node_type
407
NoSuchNodeError: group ``/store_0`` does not have a child named ``colnames``
The description of the HDFStore:
<class 'pandas.io.pytables.HDFStore'>
File path: /srv/hdf5/dev_fact_rv_standard.h5
/store_0 frame_table (typ->appendable,nrows->451612,ncols->91,indexers->[index],dc->[r1,r2,r3,r4,r5,r6])
Am I doing something wrong / Is there any way to solve this?
import carray
a = carray.zeros( ( 1e7,10 ), dtype=float )
carray.eval('a+2*a')
ValueError: operands could not be broadcast together with shapes (131072) (38528,10) (131072)
I even got a core dump when trying
carray.eval('a+a')
I get a 404 when I point my browser at: http://bcolz.blosc.org/docs/manual
[ec2-user@ip-172-31-42-166 ~]$ cython --version
Cython version 0.21
[ec2-user@ip-172-31-42-166 ~]$ pip install bcolz
Downloading/unpacking bcolz
Downloading bcolz-0.7.1.tar.gz (666kB): 666kB downloaded
Running setup.py (path:/tmp/pip_build_ec2-user/bcolz/setup.py) egg_info for package bcolz
.. ERROR:: You need Cython 0.20 or greater to compile bcolz!
Complete output from command python setup.py egg_info:
.. ERROR:: You need Cython 0.20 or greater to compile bcolz!
----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_ec2-user/bcolz
Storing debug log for failure in /home/ec2-user/.pip/pip.log
Sorting carrays will be a bit tricky, since random access and inserts are more expensive than sequential ones. I would suggest using the classical combination of mergesort and insertion sort. I.e. insertion sort at the chunk (maybe even better block) level and chunk/block-wise merge sort beyond that. Also, due to the random access issues mentioned above an out-of-place mergesort will probably be better than an in-place one. Also it might be worth looking at Kunth's TAOCP Volume 3, IIRC there is a section about sorting algorithms for data that is stored on tape, I figure we may have a similar situation here.
Additionally the algorithm outlined above should work on out-of-core carrays too.
Does not work completely:
In [23]: c = ca.carray([True]*9).reshape((3,3))
In [24]: c[1,2] = False
In [26]: list(c.wheretrue())
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python2.6/dist-packages/carray/carrayExtension.so in carray.carrayExtension.carray.wheretrue (carray/carrayExtension.c:12949)()
ValueError: `self` is not an array of booleans
Han Genuit reported this problem to the mailing list:
There is an issue which cropped up when I tried out v0.4, which is
demonstrated by this code:
carr = ca.carray([1,2,3,4]) # create an array
carr[:]<3 # evaluate carr < 3 within NumPyarray([ True, True, False, False], dtype=bool)
carr[carr[:]<3] # select the elements found
array([1, 2])
You can use fancy selection to select the elements from 'carr' with a
boolean array. But this does not seem to work for multidimensional arrays:
carr = ca.carray([[1,2,3],[4,5,6],[7,8,9]])
carr[:]<3array([[ True, True, False],
[False, False, False],
[False, False, False]], dtype=bool)carr[carr[:]<3]
Traceback (most recent call last):
File "", line 1, in
File "carrayExtension.pyx", line 1059, in
carray.carrayExtension.carray.getitem
(carray\carrayExtension.c:9079) ValueError: setting an array element
with a sequence.
One of my interests in bcolz is the excellent compression benefits and I was getting on great in testing where I was only using test arrays of zeros, ones or arange. But the first time I created a carray from my actual data (which is a dense million element array of uint8 values (i.e no higher than 255) the compression literally stops. I don't understand? So in this first test a which is created with bcolz own arange, dtype="uint8" and b which is numpy's arange both achieve 55.99 compression. But an array created with randint achieves no compression.
In the next test you can see some compression if I pass the array in as int64 and dtype casting it to uint8 but the result is only compression that brings the total size down to what a uint8 dtype array of this nature would be
I am very confused about whether this is a bug or what do I not understand?
When numpy is linked with MKL and numexpr isn't, one of the tests fail:
FAIL: test11 (carray.tests.test_carray.eval_big_ne)
Testing eval() with functions like np.sin()
The arrays are actually very close, so using assert_array_almost_equal with 14 decimal places allows it to pass:
diff --git a/carray/tests/test_carray.py b/carray/tests/test_carray.py
index 19fbcf9..d6ca9a1 100644
--- a/carray/tests/test_carray.py
+++ b/carray/tests/test_carray.py
@@ -1096,7 +1096,7 @@ class evalTest(unittest.TestCase):
nr = np.sin(a) + 2 * np.log(b) - 3
#print "ca.eval ->", cr
#print "numpy ->", nr
- assert_array_equal(cr[:], nr, "eval does not work correctly")
+ assert_array_almost_equal(cr[:], nr, 14, "eval does not work correctly")
def test12(self):
"""Testing eval() with `out_flavor` == 'numpy'"""
In last 0.5 release several public methods in carray and ctable were actually meant to be private:
In [36]: c. # a carray
c.append c.copy c.dtype c.len c.ndim c.reshape c.size c.wheretrue
c.attrs c.cparams c.fill_chunks c.mkdirs c.next c.resize c.sum c.write_meta
c.cbytes c.create_carray c.flush c.mode c.open_carray c.rootdir c.trim
c.chunklen c.dflt c.iter c.nbytes c.read_meta c.shape c.where
In [35]: t. # a ctable
t.addcol t.cbytes t.cparams t.dtype t.iter t.mode t.ndim t.rootdir t.trim
t.append t.cols t.create_ctable t.eval t.len t.names t.open_ctable t.shape t.where
t.attrs t.copy t.delcol t.flush t.mkdir_rootdir t.nbytes t.resize t.size
This parameter can be useful for the end user in many situations.
There is a miscalculation in the chunk.pointer
code:
@property
def pointer(self):
return <Py_uintptr_t> self.data + BLOSCPACK_HEADER_LENGTH
Let me demonstrate what is wrong. First let's set the scene:
In [1]: import bcolz
In [2]: a = np.arange(100000)
In [3]: c = bcolz.carray(a)
In [4]: chunk = c.chunks[0]
In [5]: chunk
Out[5]:
chunk(int64) nbytes: 262144; cbytes: 4720; ratio: 55.54
'[ 0 1 2 ..., 32765 32766 32767]'
We now have a chunk in memory. Let's try to use other means to decompress it:
In [6]: import ctypes
In [7]: comp = ctypes.string_at(chunk.pointer, 4720)
In [8]: import blosc
In [9]: dcmp = blosc.decompress(comp)
---------------------------------------------------------------------------
error Traceback (most recent call last)
<ipython-input-9-8f8c24dd6b78> in <module>()
----> 1 dcmp = blosc.decompress(comp)
/home/esc/anaconda/lib/python2.7/site-packages/blosc/toplevel.pyc in decompress(bytesobj)
440 _check_bytesobj(bytesobj)
441
--> 442 return _ext.decompress(bytesobj)
443
444
error: Error 4720 : not a Blosc buffer or header info is corrupted
I have used the unmodified pointer
and a length value of 4720
as indicated by the __repr__
of chunk
.
My suspicion is that the pointer should not be offset by BLOSCPACK_HEADER_LENGTH
which is still 16 since it uses an older version of bloscpack, so let's try that:
In [10]: comp = ctypes.string_at(chunk.pointer-16, 4720)
In [11]: dcmp = blosc.decompress(comp)
---------------------------------------------------------------------------
error Traceback (most recent call last)
<ipython-input-11-8f8c24dd6b78> in <module>()
----> 1 dcmp = blosc.decompress(comp)
/home/esc/anaconda/lib/python2.7/site-packages/blosc/toplevel.pyc in decompress(bytesobj)
440 _check_bytesobj(bytesobj)
441
--> 442 return _ext.decompress(bytesobj)
443
444
error: Error 4720 : not a Blosc buffer or header info is corrupted
Okay so that doesn't work either. Let's use bloscpack to discover the contents of the blosc header, if possible:
In [12]: import bloscpack
In [13]: bloscpack.headers.decode_blosc_header(comp)
Out[13]:
{'blocksize': 32768,
'ctbytes': 4592,
'flags': 1,
'nbytes': 262144,
'typesize': 8,
'version': 2,
'versionlz': 1}
Okay great, so we need to subtract 128 from the length (I'll explain why later):
In [14]: comp = ctypes.string_at(chunk.pointer-16, 4592)
In [15]: dcmp = blosc.decompress(comp)
In [16]: dcmp[:8]
Out[16]: '\x00\x00\x00\x00\x00\x00\x00\x00'
In [17]: dcmp[:9]
Out[17]: '\x00\x00\x00\x00\x00\x00\x00\x00\x01'
And as desired we can recover the first 8 byte 0 and all the rest of course too.
So, one thing is the chunk.pointer
which can fixed easily. The other issue with the 128 bytes extra is from adding the approximate footprint in bytes to the chunk:
footprint += 128 # add the (aprox) footprint of this instance in bytes
Maybe we do want this, maybe not, I am not sure. It was somewhat confusing in this instance.
Hi!
Not an issue but more as an enhancement discussion, something i'm looking at with @fran-xeco and we are wondering about what we should do.
The where clause works fine in the tutorial examples, but when you have a case where you have a large in selection you run into issues. When you want to select on 1,000 or 50,000 values there's an issue. With a limited number of values you can do a "or" selection in the where clausule, but it's not great for performance it seems and with a larger number it runs into errors because of the maximum nesting
So we checked numpy & pandas:
Numpy in1d
Numpy has the in1d clause but it's not really great in performance in our tests (and I don't see how I could apply it easily to bcolz ;)
http://docs.scipy.org/doc/numpy/reference/generated/numpy.in1d.html
Pandas isin()
Pandas solves it, itself by cython looping through the series 1 by 1 with a "value in set" check, generating a boolean mask:
https://github.com/pydata/pandas/blob/master/pandas/lib.pyx#L124
So if you would have a clause like a == 1 and b == 2 and c in [3, 4, ..., 5000] you could do something like
result = [row for row in btable_example.where('a==1 and b==2') if row['c'] in (3, 4, ..., 5000)]
but then probably cythonized to make it fast. ideally for me where should also accept lists and do things internally, so you would just say:
result = [row for row in btable_example.where('a==1 and b==2 and c == [3, 4, ..., 5000])')
But that does mean that internally it needs to review the string and see what numexpr will handle and what will be done otherwise.
I'm a bit clueless if numexpr can be applied for the first 2 filters and a cython filter for the third one (would this break the vectorization of the numexpr selection?)
Can you give some ideas of what would be the best solution direction and how we could implement this?
Currently bcolz uses parts of an older version of Bloscpack for historical reasons.
c = ca.carray([[1,2,3],[4,5,6],[7,8,9]])
ca.eval('c>5', out_flavor='numpy')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/dist-packages/carray/toplevel.py", line 502, in eval
**kwargs)
File "/usr/local/lib/python2.6/dist-packages/carray/toplevel.py", line 575, in _eval_blocks
result[:bsize] = res_block
ValueError: output operand requires a reduction, but reduction is not enabled
What is the intended behavior of carray.__next__
?
In [1]: import bcolz
In [2]: b = bcolz.carray([1, 2, 3])
In [3]: next(b)
StopIteration:
I would expect either a TypeError
because __next__
isn't implemented or 1
.
In [1]: import bcolz
In [2]: b = bcolz.ctable([['a', 'b', 'c']], dtype='U4', names=['text'])
In [3]: b.where('text == "b"')
ValueError: unkown type unicode128
Although this may be a numexpr issue.
Using ctable.where sets state that affects future operations, causing some confusing issues. It took me quite a while to track this down.
In [1]: import bcolz
In [2]: bc = bcolz.ctable([[1, 2, 3], [10, 20, 30]], names=['a', 'b'])
In [3]: bc.where('a >= 2') # call .where but don't do anything with it
Out[3]: <itertools.imap at 0x7fd7a84f5750>
In [4]: list(bc['b']) # Later iterate over table, get where result
Out[4]: [20, 30]
After creating a ctable:
In [6]: ct = ca.ctable((np.array([1,2,3]),np.array([4,5,6])), ('a', 'b'))
Adding a new column as list fails:
In [7]: ct.addcol([7,8,9], 'c')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in addcol(self, newcol, name, pos, **kwargs)
323 self.cols[name] = newcol
324 # Update _arr1
--> 325 self._arr1 = np.empty(shape=(1,), dtype=self.dtype)
326
327
/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in dtype(self)
55 "The data type of this ctable (numpy dtype)."
56 names, cols = self.names, self.cols
---> 57 l = [(name, cols[name].dtype) for name in names]
58 return np.dtype(l)
59
AttributeError: 'list' object has no attribute 'dtype'
Consequently breaking the ctable:
In [10]: ct[0]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in __getitem__(self, key)
603 ra = self._arr1.copy()
604 # Fill it
--> 605 ra[0] = tuple([self.cols[name][key] for name in self.names])
606 return ra[0]
607 # Slices
ValueError: size of tuple must match number of fields.
I successfully created a trillion element bcolz array on disk. Fantastic. The compression and the disk-based ness makes a 2 terabyte array 13.7GB and on disk. But problem is, I want a multi-directional array and I can't reshape it to (1000000, 1000000) using .reshape because it does it all in RAM. non-starter. Then I tried to create the shape from the start using .zeros((1000000, 1000000)) or .ones and again it just does it in memory. Is there a way to create? If not can you change this behavior
When selecting slices from a carray, fancy indexing works as expected for all axes that are not the leading axis. A fancy index applied to the leading axis triggers a TypeError. I could not find the error message in the bcolz codebase, so maybe 0.7.1 fixed it?
bcolz version: 0.7.0
(Apologies for missing In/Out prompts.)
Build a carray:
import bcolz as bz
a = np.random.randint(0, 3, (4, 3))
c = bz.carray(a)
c
carray((4, 3), int64)
nbytes: 96; cbytes: 15.98 KB; ratio: 0.01
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[[1 2 2]
[1 1 2]
[0 0 0]
[2 2 0]]
This works:
c[:, [0, 2]]
array([[1, 2],
[1, 2],
[0, 0],
[2, 0]])
This does not:
c[[0, 2], :]
Traceback (most recent call last):
File "<pyshell#15>", line 1, in
c[[0, 2], :]
File "carray_ext.pyx", line 1879, in bcolz.carray_ext.carray.getitem (bcolz/carray_ext.c:21106)
File "carray_ext.pyx", line 1897, in bcolz.carray_ext.carray.getitem (bcolz/carray_ext.c:21344)
File "carray_ext.pyx", line 1913, in bcolz.carray_ext.carray.getitem (bcolz/carray_ext.c:21618)
TypeError: object of type 'numpy.int64' has no len()
nose should not be necessary, and only unittest2 should.
Try this with following invalid dtype:
carray.carray([], dtype="string")
this will fail python interpreter with:
Floating point exception: 8
It should rather raise an exception.
Python version: 2.7.3
To be inspired by:
https://pandas-docs.github.io/pandas-docs-travis/categorical.html
bcolz port of:
ContinuumIO/blz#11
In [5]: import bcolz
In [6]: b = bcolz.arange(1e8)
In [7]: b.chunks[0]
Out[7]: <repr(<bcolz.carray_ext.chunk at 0x7f4dff38d578>) failed: AttributeError: 'bcolz.carray_ext.chunk' object has no attribute 'shape'>
I had a look at the code and I could probably fix and submit a PR but I don't understand the intention. What should shape be:
def __repr__(self):
"""Represent the chunk as an string, with additional info."""
cratio = self.nbytes / float(self.cbytes)
fullrepr = "chunk(%s, %s) nbytes: %d; cbytes: %d; ratio: %.2f\n%r" % \
(self.shape, self.dtype, self.nbytes, self.cbytes, cratio, str(self))
return fullrepr
PyCharm files?
When calculating array derivative (diff) carray shortens the array:
import carray as ca
import numpy as np
carr = ca.arange(1000000)
diff_arr = ca.eval("np.diff(carr)", vm="python")
nd_arr = np.diff(carr)
print "Number of elements in c:", len(carr)
print "Number of elements in diff_arr:", len(diff_arr)
print "Number of elements in nd_arr:", len(np_arr)
This returns on my computer:
Number of elements in carr: 1000000
Number of elements in diff_arr: 999877
Number of elements in nd_arr: 999999
Derivatives calculated with ndarray and carray have different lengths.
In [1]: import bcolz
In [2]: b = bcolz.ctable([[1, 2, 3], [1., 2., 3.]])
In [3]: b
Out[3]:
ctable((3,), [('f0', '<i8'), ('f1', '<f8')])
nbytes: 48; cbytes: 32.00 KB; ratio: 0.00
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[(1, 1.0) (2, 2.0) (3, 3.0)]
In [4]: b.append([[4, 5, 6], [4., 5., 6.]])
In [5]: b
Out[5]:
ctable((4,), [('f0', '<i8'), ('f1', '<f8')])
nbytes: 96; cbytes: 32.00 KB; ratio: 0.00
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[(3, 3.0) (4, 4.0) (5, 5.0) (6, 6.0)]
In [6]: b.shape
Out[6]: (4,)
In [7]: b[5]
Out[7]: (6, 6.0)
For some reason the path to the bcolz table is baked into the metadata. This makes using the table from other locations or copying tables around very difficult.
In [1]: pwd
Out[1]: u'/home/mrocklin/tmp'
In [2]: import bcolz
In [3]: bcolz.ctable([[1, 2, 3], [1., 2., 3.]], rootdir='foo/mytable.bcolz')
Out[3]:
ctable((3,), [('f0', '<i8'), ('f1', '<f8')])
nbytes: 48; cbytes: 32.00 KB; ratio: 0.00
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
rootdir := 'foo/mytable.bcolz'
[(1, 1.0) (2, 2.0) (3, 3.0)]
In [4]: b = bcolz.ctable(rootdir='foo/mytable.bcolz') # works fine from same directory
In [5]: cd ..
/home/mrocklin
In [6]: b = bcolz.ctable(rootdir='tmp/foo/mytable.bcolz') # fails from other directory
IOError: [Errno 2] No such file or directory: u'foo/mytable.bcolz/f0/meta/sizes'
Hi
In the init of the 0.5 version of the module (installed by pip today) there is this line:
import carray.test as test
but the interprter (ipython 0.12.1 in my case) is complaining because there is no such carray.test module to import.
Commenting the line is enough to fix this issue
Am I being stupid? When i append to a bcolz array, it seems to work and it's shape changes. After I ctrl-z it and then try to reload with bcolz.open it only opens the original array, none of the appended arrays are in fact appended, certainly not persisted. What am I missing? Is it something to do with .flush()?
As suggested by @CarstVaartjes in #63 this would allow access to the carray from other Cython code.
In [11]: ct = ca.ctable(np.array([1,3]), 'a')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in __init__(self, cols, names, **kwargs)
125 raise ValueError, "cannot convert `names` into a list"
126 if len(names) != len(cols):
--> 127 raise ValueError, "`cols` and `names` must have the same length"
128 # Check name validity
129 nt = namedtuple('_nt', names, verbose=False)
ValueError: `cols` and `names` must have the same length
Related?
In [13]: ct = ca.ctable(np.array([1,3]))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/usr/local/lib/python2.6/dist-packages/carray/ctable.pyc in __init__(self, cols, names, **kwargs)
115 if names is None:
116 if isinstance(cols, np.ndarray): # ratype case
--> 117 names = list(cols.dtype.names)
118 else:
119 names = ["f%d"%i for i in range(len(cols))]
TypeError: 'NoneType' object is not iterable
It looks like the setup.py file is missing.
Hi,
While running some tests I stumbled upon something unexpected to me, bcolz ctable containing different data types for items that should have the same data type (code copied and modified from Bcolz tests).
Some objects are of the type numpy.ndarray while all objects inside the same column should have the same data type.
Is this issue known if yes is this a bcolz or numpy problem or am I misunderstanding something?
I tried in different machines and all of them had the same issue.
Could be this issue related to #21 or are they completely independent?
Can somebody else reproduce this issue (code below), summary of the output object types as comments.
import numpy as np
import bcolz
from collections import defaultdict
def check_types(col):
check = defaultdict(int)
for i in range(len(col)):
val = col[i]
check[str(type(val))] += 1
return dict(check)
if __name__ == '__main__':
ra = np.fromiter(((i, i * 2., i * 3)
for i in range(500000)), dtype='i4,f8,i8')
t = bcolz.ctable(ra)
print(check_types(t['f0']))
# {"<type 'numpy.ndarray'>": 401407, "<type 'int'>": 98593}
print(check_types(t['f1']))
# {"<type 'float'>": 37153, "<type 'numpy.ndarray'>": 462847}
print(check_types(t['f2']))
# {"<type 'numpy.ndarray'>": 462847, "<type 'int'>": 37153}
Thank you guys for the great work you are doing in this project.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.