xhochy / fletcher Goto Github PK
View Code? Open in Web Editor NEWPandas ExtensionDType/Array backed by Apache Arrow
Home Page: https://fletcher.readthedocs.io/
License: MIT License
Pandas ExtensionDType/Array backed by Apache Arrow
Home Page: https://fletcher.readthedocs.io/
License: MIT License
Currently we don't support __setitem__
on any of the ExtensionArray classes. The support for it will not be simple as Arrow arrays are immutable. Thus an implementation for __setitem__
will have to create a copy on each operation.
This is also required to support fillna()
.
Otherwise they look like they are official Pandas types.
dask.dataframe
should also be able to handle fletcher
columns and accessors. Thus we should have at least tests that confirm:
dask.dataframe
can have fletcher.Fletcher{Chunked,Continuous}Array
columnsfr_text
accessor is working with dask.dataframe
Currently slices are not supported. For this to work, arrow's interal offset needs to be exposed to python.
Currently the description on https://pypi.org/project/fletcher/ is empty. It should be filled with the README.
We should also ensure that the README contains a simple code example.
The current master fails due to the new py.test version.
Set up the initial infrastructure for sphinx documentation
We will depend on Pandas and Apache Arrow master for a while. They will also change from time to time and breakage is expected. Once we have CI up and running, we should also check continuously that upstream introduces no new breakages.
See failing tests:
tests/test_pandas_integration.py::test_set_index
tests/test_pandas_integration.py::test_copy
Currently we spend a lot of time in creating the initial conda environment. We should build a custom Docker image that comes with conda et al. installed and then only install dependencies like Pandas or Arrow that change more often.
We support string functions using fr_text
and text
. These only work on fletcher
columns. To ease the conversion from the object
/ string
dtype to a fletcher based string type, we should support the following:
fr_str
is an accessor that works on fletcher
and pandas
columns.fr_strx
only works on fletcher
columns. This is useful for users that want to be certain that accelerated operations are used.text
and fr_text
accessors to be inline with pandas
naming.Is there some benchmark on str_concat
operation ?
On my local machine I've tried a naive python implem and got better result than with NumbaStringArray
:
import numpy as np
import pyarrow as pa
from fletcher._numba_compat import NumbaStringArray, buffers_as_arrays
from fletcher._algorithms import str_concat
a1 = pa.array(np.random.rand(10**6).astype(str).astype('O'))
a2 = pa.array(np.random.rand(10**6).astype(str).astype('O'))
%timeit pa.array([x + y for x, y in zip(a1.to_pandas(), a2.to_pandas())])
# 860 ms ± 6.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit str_concat(NumbaStringArray.make(a1), NumbaStringArray.make(a2))
# 1.11 s ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Is it something that you expect ?
+
+
Pseudo-Code:
Inputs: others, sep, na_rep
# TODO: Write pseudo code for all of the possible variants
The current pandas extension unit tests for TestBaseGroupbyTests
are marked as pytest.mark.xfail
c.f. https://github.com/xhochy/fletcher/blob/master/tests/string_array/test_pandas_extension.py
The xfail
marker should be removed and the failing tests should be fixed in the StringArray
and/or FletcherArrayBase
implementation
The current pandas extension unit tests for TestBaseMissingTests
are marked as pytest.mark.xfail
c.f. https://github.com/xhochy/fletcher/blob/master/tests/string_array/test_pandas_extension.py
The xfail
marker should be removed and the failing tests should be fixed in the StringArray
and/or FletcherArrayBase
implementation
Currently we use the default implementation https://github.com/pandas-dev/pandas/blob/2cbdd9a2cd19501c98582490e35c5402ae6de941/pandas/core/arrays/base.py#L382 that simply converts to object, runs Pandas' unique
and then recreates a new ExtensionArray instance.
This depends on the Arrow upstream issue https://issues.apache.org/jira/browse/ARROW-2663
In [319]: pa.__version__
Out[319]: '0.15.1'
In [320]: fletcher.pandas_from_arrow(tbl)
C:\Users\dhirschf\envs\dev\lib\site-packages\fletcher\base.py:731: FutureWarning: Calling .data on ChunkedArray is provided for compatibility after Column was removed, simply drop this attribute
data[col.name] = FletcherArray(col.data)
Traceback (most recent call last):
File "<ipython-input-320-a9ca90d82a7d>", line 1, in <module>
fletcher.pandas_from_arrow(tbl)
File "C:\Users\dhirschf\envs\dev\lib\site-packages\fletcher\base.py", line 731, in pandas_from_arrow
data[col.name] = FletcherArray(col.data)
AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'name'
In [321]: type(tbl)
Out[321]: pyarrow.lib.Table
py37/win64 - fletcher=0.2.0
Pseudo-Code:
Inputs: pat
output = BooleanArray(len(rows))
for i, row in enumerate(rows):
if isnull(row):
output.setnull(i)
else
output[i] = (row[-len(pat):] == pat)
width
(need no centering) and some aren't (need centering).Pseudo-Code:
Inputs: width, fillchar
builder = StringBuilder()
for row in rows:
if utf8_len(row) >= width:
builder.append(row)
else:
n_missing = width - utf8_len(row)
left = (n_missing // 2) + (n_missing % 2)
right = (n_missing // 2)
builder.add(fillchar * left + row + fillchar * right)
I often get asked whether fletcher
will just make strings faster or whether it is planned to replace pandas
completely. Both statements are wrong and we should mention the actual intentions of the project in the documentation.
Pseudo-Code:
Inputs: -
builder = StringBuilder()
for row in rows:
builder.append(map_utf8_chararcters(utf8_casefold, row))
fletcher/base.py uses pa.colum
But it is no more there in v0.15.0
Add a list
type to the unit tests and fix the failing ones.
As we want to expand the scope of this project to handle all Arrow types as Pandas extension arrays, we should extract the functionality that is independent of the underlying Arrow type to a common base class.
Pseudo-Code:
Inputs: pat
output = IntArray(len(rows))
for i, row in enumerate(rows):
count = 0
for offset in range(len(row)):
if pat == row[offset:len(pat)]:
count += 1
output[i] = count
The current output is simply <fletcher.string_array.StringArray at 0x7ff472ebf828>
. Ideally we should be able to mostly forward to the repr of pyarrow.ChunkedArray
which is also currently not really helpful. Upstream Arrow issue to fix this is https://issues.apache.org/jira/browse/ARROW-889
The current pandas extension unit tests for TestBaseReshapingTests
are marked as pytest.mark.xfail
c.f. https://github.com/xhochy/fletcher/blob/master/tests/string_array/test_pandas_extension.py
The xfail
marker should be removed and the failing tests should be fixed in the StringArray
and/or FletcherArrayBase
implementation
TestBaseSetitemTests::test_setitem_preserves_views
is failing as views don't update their parent on __setitem__
.
series[series.isna()] = series[series.isna()]
remove all null entries and replaces them with garbage.
Set up packaging and upload to PyPI
It's not clear how to use benchmarks py. IMO it would be nice to have the main nethod in that file, that is able to run benchmarks and print results.
in
case=True
, otherwise need lower
implementation)Pseudo-Code:
Inputs: pat
output = BooleanArray(len(rows))
for i, row in enumerate(rows):
output[i] = pat in row
Currently take converts to an object array and delegates to Pandas' implementation. This is much slower than it has to be. Depends on the upstream Arrow issue https://issues.apache.org/jira/browse/ARROW-2667
The numba implementation should be much slower, than arrow's StringBuilder.
Question: what can numba inteface with? The docs mention cffi.
We can store dates in fletcher
that pandas
cannot store as we allow for other precisions as nanoseconds. Sadly our code currently converts to nanoseconds for printing a DataFrame.
Reproducible example:
import fletcher as fr
import pandas as pd
import datetime
df = pd.DataFrame({
"date": fr.FletcherContinuousArray([datetime.datetime(9999, 12, 1), datetime.datetime(9999, 12, 1)])
})
print(df.head())
Exception:
Traceback (most recent call last):
File "extreme_dates.py", line 8, in <module>
print(df.head())
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/core/frame.py", line 680, in __repr__
self.to_string(
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/core/frame.py", line 820, in to_string
return formatter.to_string(buf=buf, encoding=encoding)
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 914, in to_string
return self.get_result(buf=buf, encoding=encoding)
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 521, in get_result
self.write_result(buf=f)
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 823, in write_result
strcols = self._to_str_columns()
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 759, in _to_str_columns
fmt_values = self._format_col(i)
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 948, in _format_col
return format_array(
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1172, in format_array
return fmt_obj.get_result()
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1203, in get_result
fmt_values = self._format_strings()
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1489, in _format_strings
array = np.asarray(values)
File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
File "/Users/uwe/Development/fletcher/fletcher/base.py", line 328, in __array__
return self.data.to_pandas().values
File "pyarrow/array.pxi", line 567, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/array.pxi", line 1027, in pyarrow.lib.Array._to_pandas
File "pyarrow/array.pxi", line 1209, in pyarrow.lib._array_like_to_pandas
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253399622400000000
To ensure that we have sufficient coverage of our tests, we should upload the code coverage to codecov to get automated reporting on pull requests.
I've got version 0.2.0
installed but I can't find that out from within python as fletcher.__version__
doesn't exist.
It's very convenient within python to be able to introspect the version of the installed packages.
Currently we have no continuous integration, this will lead to bugs that go unnoticed. We should either add Travis or CircleCI.
Failure is without fletcher
in the stacktrace, so I'm a bit confused:
tests/test_pandas_extension.py::TestBaseSetitemTests::test_setitem_integer_array[True-fletcher_type7-chunked-list] FAILED
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
self = 0 ['B' 'C']
1 ['A']
2 [None]
3 ['A' 'A']
4 []
dtype: fletcher_chunked[list<item: string>], key = [0, 1, 2], value = ['B', 'C']
def __setitem__(self, key, value):
key = com.apply_if_callable(key, self)
cacher_needs_updating = self._check_is_chained_assignment_possible()
try:
> self._set_with_engine(key, value)
../pandas/pandas/core/series.py:982:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = 0 ['B' 'C']
1 ['A']
2 [None]
3 ['A' 'A']
4 []
dtype: fletcher_chunked[list<item: string>], key = [0, 1, 2], value = ['B', 'C']
def _set_with_engine(self, key, value):
# fails with AttributeError for IntervalIndex
> loc = self.index._engine.get_loc(key)
../pandas/pandas/core/series.py:1015:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
pandas/_libs/index.pyx:61:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E TypeError: '[0, 1, 2]' is an invalid key
pandas/_libs/index.pyx:66: TypeError
During handling of the above exception, another exception occurred:
self = <test_pandas_extension.TestBaseSetitemTests object at 0x12679e450>, data = <FletcherChunkedArray>
[['B', 'C'], ['A'], [None], ['A', 'A'], [], ['B', 'C'],
['A'], [None..., ['B', 'C'],
['A'], [None], ['A', 'A'], []]
Length: 100, dtype: fletcher_chunked[list<item: string>], idx = [0, 1, 2], box_in_series = True
@pytest.mark.parametrize(
"idx",
[[0, 1, 2], pd.array([0, 1, 2], dtype="Int64"), np.array([0, 1, 2])],
ids=["list", "integer-array", "numpy-array"],
)
def test_setitem_integer_array(self, data, idx, box_in_series):
arr = data[:5].copy()
expected = data.take([0, 0, 0, 3, 4])
if box_in_series:
arr = pd.Series(arr)
expected = pd.Series(expected)
> arr[idx] = arr[0]
../pandas/pandas/tests/extension/base/setitem.py:153:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../pandas/pandas/core/series.py:1008: in __setitem__
self._set_with(key, value)
../pandas/pandas/core/series.py:1051: in _set_with
self._set_labels(key, value)
../pandas/pandas/core/series.py:1065: in _set_labels
self._set_values(indexer, value)
../pandas/pandas/core/series.py:1070: in _set_values
self._data = self._data.setitem(indexer=key, value=value)
../pandas/pandas/core/internals/managers.py:544: in setitem
return self.apply("setitem", **kwargs)
../pandas/pandas/core/internals/managers.py:424: in apply
applied = getattr(b, f)(**kwargs)
../pandas/pandas/core/internals/blocks.py:1816: in setitem
check_setitem_lengths(indexer, value, self.values)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
indexer = array([0, 1, 2]), value = ['B', 'C'], values = <FletcherChunkedArray>
[['B', 'C'], ['A'], [None], ['A', 'A'], []]
Length: 5, dtype: fletcher_chunked[list<item: string>]
def check_setitem_lengths(indexer, value, values) -> None:
"""
Validate that value and indexer are the same length.
An special-case is allowed for when the indexer is a boolean array
and the number of true values equals the length of ``value``. In
this case, no exception is raised.
Parameters
----------
indexer : sequence
Key for the setitem.
value : array-like
Value for the setitem.
values : array-like
Values being set into.
Returns
-------
None
Raises
------
ValueError
When the indexer is an ndarray or list and the lengths don't match.
"""
# boolean with truth values == len of the value is ok too
if isinstance(indexer, (np.ndarray, list)):
if is_list_like(value) and len(indexer) != len(value):
if not (
isinstance(indexer, np.ndarray)
and indexer.dtype == np.bool_
and len(indexer[indexer]) == len(value)
):
raise ValueError(
> "cannot set using a list-like indexer "
"with a different length than the value"
)
E ValueError: cannot set using a list-like indexer with a different length than the value
../pandas/pandas/core/indexers.py:115: ValueError
Before spamming the arrow repo or email, should we use GH to discuss or would you prefer something else?
Anyway, I have some questions regarding the arrow API:
test_array = pa.array(["Test", "string", None])
buffers = test_array.buffers()
Set up conda packaging and include fletcher
in https://anaconda.org/conda-forge
Just wondering if a function fletcher.read_csv
would be in scope which reads csv data directly into arrow tables?
Whilst pd.read_csv
is damn good and the workhorse of many data analytics pipelines it suffers slightly from pandas' limited type system which I'm hoping could be improved using native arrow types. Also it would be nice to not have to go through pandas/python types at all and so avoid the serialization cost
I have tried, perhaps incorrectly, to convert my column to pyarrow string type as follows:
fletcher_string_dtype = fr.FletcherDtype(pa.string())
df['string_col'] = df.string_col.astype(fletcher_string_type)
But now I can't do string functions on it because I get the error message AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Specifically, I'm trying to do .str.contains()
I may be casting column incorrectly. It may be that there's no value in using fletcher for this.
I saw in your talk, groupby
was a nice use case. Related to this question is what are the best use cases for this dtype - just a link to some additional reading material would be great.
Pseudo-Code:
Inputs: -
builder = StringBuilder()
for row in rows:
builder.append(map_utf8_chararcters(utf8_captitalize, row))
[…]
Pseudo-Code:
Inputs: start, end, step
builder = StringBuilder()
for row in enumerate(rows):
if isnull(row):
builder.addnull()
else
builder.add(utf8_slice(row, start, end, step))
We want to support strings (UTF-8 encoded) as fast as possible inside of pandas
. Therefore we need to implement several things. This will be split into many issues and hard to track just with the issue search, so we will list them all here.
We try to add the functionality in three stages:
pandas.StringDtype
but already provides the API to fletcher
users. This will allow us to add faster implementations bit-by-bit while already providing a fully usable library.pandas/object
implementation to ours.numba
. This will allow us to provide a fast algorithm with less implementation overhead then adding it to Apache Arrow.It seems serialisation is not yet supported by FletcherArray
. Here's my DataFrame
:
>>> df_amd.dtypes
time datetime64[ns]
evt_name fletcher[string]
evt_value float64
evt_unit fletcher[string]
bus uint64
route fletcher[string]
stop_code fletcher[int64]
stop fletcher[string]
lat float64
lon float64
dtype: object
And here are the backtraces when I try to serialise to various formats (.to_csv(..)
works):
HDF5
>>> df_amd.to_hdf('data/road_safety.h5', 'AMD')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 1996, in to_hdf
return pytables.to_hdf(path_or_buf, key, self, **kwargs)
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 279, in to_hdf
f(store)
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 273, in <lambda>
f = lambda store: store.put(key, value, **kwargs)
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 890, in put
self._write_to_group(key, value, append=append, **kwargs)
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 1367, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 2963, in write
self.write_array('block%d_values' % i, blk.values, items=blk_items)
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 2686, in write_array
value = value.T
AttributeError: 'FletcherArray' object has no attribute 'T'
Parquet
>>> df_amd.to_parquet('data/ahmedabad_event_report.parquet')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 1945, in to_parquet
compression=compression, **kwargs)
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 257, in to_parquet
return impl.write(df, path, compression=compression, **kwargs)
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 118, in write
table = self.api.Table.from_pandas(df)
File "pyarrow/table.pxi", line 1136, in pyarrow.lib.Table.from_pandas
File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 386, in dataframe_to_arrays
convert_types))
File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
yield fs.pop().result()
File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 375, in convert_column
raise e
File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 369, in convert_column
return pa.array(col, from_pandas=True, type=ty)
File "pyarrow/array.pxi", line 182, in pyarrow.lib.array
File "pyarrow/array.pxi", line 76, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column evt_name with type fletcher[string]')
Feather
>>> df = df_amd.reset_index()
>>> df.to_feather('data/ahmedabad_event_report.feather')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 1892, in to_feather
to_feather(self, fname)
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/feather_format.py", line 83, in to_feather
feather.write_dataframe(df, path)
File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/feather.py", line 181, in write_feather
writer.write(df)
File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/feather.py", line 93, in write
batch = RecordBatch.from_pandas(df, preserve_index=False)
File "pyarrow/table.pxi", line 901, in pyarrow.lib.RecordBatch.from_pandas
File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 386, in dataframe_to_arrays
convert_types))
File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
yield fs.pop().result()
File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 375, in convert_column
raise e
File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 369, in convert_column
return pa.array(col, from_pandas=True, type=ty)
File "pyarrow/array.pxi", line 182, in pyarrow.lib.array
File "pyarrow/array.pxi", line 76, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column evt_name with type fletcher[string]')
Not sure if this is related, for some DataFrames
.memory_usage()
(consequently also .info()
) triggers the following backtrace:
>>> df_amd.memory_usage()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2365, in memory_usage
for col, c in self.iteritems()], index=self.columns)
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2365, in <listcomp>
for col, c in self.iteritems()], index=self.columns)
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/series.py", line 3503, in memory_usage
v = super(Series, self).memory_usage(deep=deep)
File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/base.py", line 1143, in memory_usage
v = self.values.nbytes
File "/home/jallad/.local/lib/python3.6/site-packages/fletcher/base.py", line 410, in nbytes
size += buf.size
AttributeError: 'NoneType' object has no attribute 'size'
We have several benchmarks like https://github.com/xhochy/fletcher/blob/a63581d10381a41595695a9c3c89edd156375f74/benchmarks/take.py that compare the performance of a specific method of plain pandas
with the implementation in fletcher
. The performance difference is not covered by the standard plots that asv
provides.
We should therefore:
altair
for plotting).The current pandas extension unit tests for TestBaseMethodsTests
are marked as pytest.mark.xfail
c.f. https://github.com/xhochy/fletcher/blob/master/tests/string_array/test_pandas_extension.py
The xfail
marker should be removed and the failing tests should be fixed in the StringArray
and/or FletcherArrayBase
implementation
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.