xhochy / fletcher Goto Github PK

View Code? Open in Web Editor NEW

228.0 18.0 33.0 562 KB

Pandas ExtensionDType/Array backed by Apache Arrow

Home Page: https://fletcher.readthedocs.io/

License: MIT License

Python 99.17% Shell 0.83%

fletcher's People

Contributors

Stargazers

Watchers

fletcher's Issues

Support setitem

Currently we don't support __setitem__ on any of the ExtensionArray classes. The support for it will not be simple as Arrow arrays are immutable. Thus an implementation for __setitem__ will have to create a copy on each operation.

This is also required to support fillna().

Dtypes names should be "fletcher[X]"

Otherwise they look like they are official Pandas types.

Test integration with dask.dataframe

dask.dataframe should also be able to handle fletcher columns and accessors. Thus we should have at least tests that confirm:

dask.dataframe can have fletcher.Fletcher{Chunked,Continuous}Array columns
The fr_text accessor is working with dask.dataframe

Fix slice handling

Currently slices are not supported. For this to work, arrow's interal offset needs to be exposed to python.

Add description to setup.py

Currently the description on https://pypi.org/project/fletcher/ is empty. It should be filled with the README.

We should also ensure that the README contains a simple code example.

Master fails with RemovedInPytest4Warning

The current master fails due to the new py.test version.

Sphinx documentation infrastructure

Set up the initial infrastructure for sphinx documentation

Nightly builds

We will depend on Pandas and Apache Arrow master for a while. They will also change from time to time and breakage is expected. Once we have CI up and running, we should also check continuously that upstream introduces no new breakages.

Various pandas integrations are missing

See failing tests:

tests/test_pandas_integration.py::test_set_index
tests/test_pandas_integration.py::test_copy

Create custom circleci docker container

Currently we spend a lot of time in creating the initial conda environment. We should build a custom Docker image that comes with conda et al. installed and then only install dependencies like Pandas or Arrow that change more often.

Add string accessor that works on native pandas and fletcher columns

We support string functions using fr_text and text. These only work on fletcher columns. To ease the conversion from the object / string dtype to a fletcher based string type, we should support the following:

fr_str is an accessor that works on fletcher and pandas columns.
fr_strx only works on fletcher columns. This is useful for users that want to be certain that accelerated operations are used.
Drop text and fr_text accessors to be inline with pandas naming.

str_concat benchmark

Is there some benchmark on str_concat operation ?
On my local machine I've tried a naive python implem and got better result than with NumbaStringArray:

import numpy as np
import pyarrow as pa

from fletcher._numba_compat import NumbaStringArray, buffers_as_arrays
from fletcher._algorithms import str_concat

a1 = pa.array(np.random.rand(10**6).astype(str).astype('O'))
a2 = pa.array(np.random.rand(10**6).astype(str).astype('O'))


%timeit pa.array([x + y for x, y in zip(a1.to_pandas(), a2.to_pandas())])
# 860 ms ± 6.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit str_concat(NumbaStringArray.make(a1), NumbaStringArray.make(a2))                                                           
# 1.11 s ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is it something that you expect ?

Add str.cat

✔️ pandas function
✔️ Python function: +
✔️ C++ STL function: +
✔️ no need for a regular expression library
✔️ no need for a Unicode database as we just append
✔️/ ❌ can pre-compute output depending on the parameters

Pseudo-Code:

Inputs: others, sep, na_rep

# TODO: Write pseudo code for all of the possible variants

Fix pandas interface tests TestBaseGroupbyTests

The current pandas extension unit tests for TestBaseGroupbyTests are marked as pytest.mark.xfail

c.f. https://github.com/xhochy/fletcher/blob/master/tests/string_array/test_pandas_extension.py

The xfail marker should be removed and the failing tests should be fixed in the StringArray and/or FletcherArrayBase implementation

Fix pandas interface TestBaseMissingTests

The current pandas extension unit tests for TestBaseMissingTests are marked as pytest.mark.xfail

c.f. https://github.com/xhochy/fletcher/blob/master/tests/string_array/test_pandas_extension.py

The xfail marker should be removed and the failing tests should be fixed in the StringArray and/or FletcherArrayBase implementation

Native implementation of ExtensionArray.unique

Currently we use the default implementation https://github.com/pandas-dev/pandas/blob/2cbdd9a2cd19501c98582490e35c5402ae6de941/pandas/core/arrays/base.py#L382 that simply converts to object, runs Pandas' unique and then recreates a new ExtensionArray instance.

This depends on the Arrow upstream issue https://issues.apache.org/jira/browse/ARROW-2663

Can't convert from pyarrow.Table

In [319]: pa.__version__
Out[319]: '0.15.1'

In [320]: fletcher.pandas_from_arrow(tbl)
C:\Users\dhirschf\envs\dev\lib\site-packages\fletcher\base.py:731: FutureWarning: Calling .data on ChunkedArray is provided for compatibility after Column was removed, simply drop this attribute
  data[col.name] = FletcherArray(col.data)
Traceback (most recent call last):

  File "<ipython-input-320-a9ca90d82a7d>", line 1, in <module>
    fletcher.pandas_from_arrow(tbl)

  File "C:\Users\dhirschf\envs\dev\lib\site-packages\fletcher\base.py", line 731, in pandas_from_arrow
    data[col.name] = FletcherArray(col.data)

AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'name'


In [321]: type(tbl)
Out[321]: pyarrow.lib.Table

py37/win64 - fletcher=0.2.0

Add str.endswith

✔️ pandas function
✔️ Python function
✔️ simple C++ STL function: https://stackoverflow.com/a/62280289/1689261
✔️ no need for a regular expression library
✔️ no need for a Unicode database
✔️ can pre-compute size, output is a boolean array

Pseudo-Code:

Inputs: pat

output = BooleanArray(len(rows))
for i, row in enumerate(rows):
    if isnull(row):
        output.setnull(i)
    else
        output[i] = (row[-len(pat):] == pat)

Add str.center method

✔️ pandas function
✔️ Python function
❌ C++ STL function
✔️ no need for a regular expression library
✔️ no need for a Unicode database
❌ cannot pre-compute output size without going over the data as some strings are wider then width (need no centering) and some aren't (need centering).

Pseudo-Code:

Inputs: width, fillchar

builder = StringBuilder()
for row in rows:
    if utf8_len(row) >= width:
        builder.append(row)
    else:
        n_missing = width - utf8_len(row)
        left = (n_missing // 2) + (n_missing % 2)
        right = (n_missing // 2)
        builder.add(fillchar * left + row + fillchar * right)

Add documentation about the long-term goal of fletcher

I often get asked whether fletcher will just make strings faster or whether it is planned to replace pandas completely. Both statements are wrong and we should mention the actual intentions of the project in the documentation.

Add str.casefold

✔️ pandas function
✔️ Python function
❌ C++ STL function
✔️ no need for a regular expression library
❌ need for a Unicode database for capitalization
❌ cannot pre-compute output size as casefold letters could have a different byte-width

Pseudo-Code:

Inputs: -

builder = StringBuilder()
for row in rows:
    builder.append(map_utf8_chararcters(utf8_casefold, row))

pyarrow.column no more exists in 0.15.0

fletcher/base.py uses pa.colum
But it is no more there in v0.15.0

List Type testing

Add a list type to the unit tests and fix the failing ones.

Extract base functionality into an ArrowExtensionArray

As we want to expand the scope of this project to handle all Arrow types as Pandas extension arrays, we should extract the functionality that is independent of the underlying Arrow type to a common base class.

Add str.count

✔️ pandas function
✔️ Python function
❌ C++ STL function
❌ needs a regular expression library
✔️ no need for a Unicode database for capitalization
✔️ can pre-compute output size as return value is a numeric array

Pseudo-Code:

Inputs: pat

output = IntArray(len(rows))
for i, row in enumerate(rows):
    count = 0
    for offset in range(len(row)):
        if pat == row[offset:len(pat)]:
        count += 1
    output[i] = count

Nicer repr for FletcherArrayBase

The current output is simply <fletcher.string_array.StringArray at 0x7ff472ebf828>. Ideally we should be able to mostly forward to the repr of pyarrow.ChunkedArray which is also currently not really helpful. Upstream Arrow issue to fix this is https://issues.apache.org/jira/browse/ARROW-889

Fix pandas interface TestBaseReshapingTests

The current pandas extension unit tests for TestBaseReshapingTests are marked as pytest.mark.xfail

c.f. https://github.com/xhochy/fletcher/blob/master/tests/string_array/test_pandas_extension.py

The xfail marker should be removed and the failing tests should be fixed in the StringArray and/or FletcherArrayBase implementation

Fletcher?Array.setitem doesn't preserve views

TestBaseSetitemTests::test_setitem_preserves_views is failing as views don't update their parent on __setitem__.

FletcherArray.setitem(ndarray, FletcherArray) ignores null bitmap

series[series.isna()] = series[series.isna()] remove all null entries and replaces them with garbage.

Set up PyPI packaging and upload

Set up packaging and upload to PyPI

How to use benchmarks.py?

It's not clear how to use benchmarks py. IMO it would be nice to have the main nethod in that file, that is able to run benchmarks and print results.

Add str.contains method

✔️ pandas function
✔️ Python function: in
✔️ C++ STL function
❌ needs a regular expression library
✔️ no need for a Unicode database (for case=True, otherwise need lower implementation)
✔️ can precompute output size as the result is a boolean array

Pseudo-Code:

Inputs: pat

output = BooleanArray(len(rows))

for i, row in enumerate(rows):
    output[i] = pat in row

Implement faster take method

Currently take converts to an object array and delegates to Pandas' implementation. This is much slower than it has to be. Depends on the upstream Arrow issue https://issues.apache.org/jira/browse/ARROW-2667

Investigate exposing Arrow StringBuilder to numba

The numba implementation should be much slower, than arrow's StringBuilder.

Question: what can numba inteface with? The docs mention cffi.

Extreme dates cannot be renderes in displaying DataFrames

We can store dates in fletcher that pandas cannot store as we allow for other precisions as nanoseconds. Sadly our code currently converts to nanoseconds for printing a DataFrame.

Reproducible example:

import fletcher as fr
import pandas as pd
import datetime

df = pd.DataFrame({
    "date": fr.FletcherContinuousArray([datetime.datetime(9999, 12, 1), datetime.datetime(9999, 12, 1)])
})
print(df.head())

Exception:

Traceback (most recent call last):
  File "extreme_dates.py", line 8, in <module>
    print(df.head())
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/core/frame.py", line 680, in __repr__
    self.to_string(
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/core/frame.py", line 820, in to_string
    return formatter.to_string(buf=buf, encoding=encoding)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 914, in to_string
    return self.get_result(buf=buf, encoding=encoding)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 521, in get_result
    self.write_result(buf=f)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 823, in write_result
    strcols = self._to_str_columns()
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 759, in _to_str_columns
    fmt_values = self._format_col(i)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 948, in _format_col
    return format_array(
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1172, in format_array
    return fmt_obj.get_result()
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1203, in get_result
    fmt_values = self._format_strings()
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1489, in _format_strings
    array = np.asarray(values)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/Users/uwe/Development/fletcher/fletcher/base.py", line 328, in __array__
    return self.data.to_pandas().values
  File "pyarrow/array.pxi", line 567, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/array.pxi", line 1027, in pyarrow.lib.Array._to_pandas
  File "pyarrow/array.pxi", line 1209, in pyarrow.lib._array_like_to_pandas
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253399622400000000

Upload coverage to codecov

To ensure that we have sufficient coverage of our tests, we should upload the code coverage to codecov to get automated reporting on pull requests.

[FeatureRequest] Add version

I've got version 0.2.0 installed but I can't find that out from within python as fletcher.__version__ doesn't exist.

It's very convenient within python to be able to introspect the version of the installed packages.

Add Travis / CircleCI for testing

Currently we have no continuous integration, this will lead to bugs that go unnoticed. We should either add Travis or CircleCI.

BaseSetitemTests.test_setitem_integer_array fails with ValueError

Failure is without fletcher in the stacktrace, so I'm a bit confused:

tests/test_pandas_extension.py::TestBaseSetitemTests::test_setitem_integer_array[True-fletcher_type7-chunked-list] FAILED
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

self = 0    ['B' 'C']
1        ['A']
2       [None]
3    ['A' 'A']
4           []
dtype: fletcher_chunked[list<item: string>], key = [0, 1, 2], value = ['B', 'C']

    def __setitem__(self, key, value):
        key = com.apply_if_callable(key, self)
        cacher_needs_updating = self._check_is_chained_assignment_possible()

        try:
>           self._set_with_engine(key, value)

../pandas/pandas/core/series.py:982:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = 0    ['B' 'C']
1        ['A']
2       [None]
3    ['A' 'A']
4           []
dtype: fletcher_chunked[list<item: string>], key = [0, 1, 2], value = ['B', 'C']

    def _set_with_engine(self, key, value):
        # fails with AttributeError for IntervalIndex
>       loc = self.index._engine.get_loc(key)

../pandas/pandas/core/series.py:1015:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???

pandas/_libs/index.pyx:61:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: '[0, 1, 2]' is an invalid key

pandas/_libs/index.pyx:66: TypeError

During handling of the above exception, another exception occurred:

self = <test_pandas_extension.TestBaseSetitemTests object at 0x12679e450>, data = <FletcherChunkedArray>
[['B', 'C'],      ['A'],     [None], ['A', 'A'],         [], ['B', 'C'],
      ['A'],     [None..., ['B', 'C'],
      ['A'],     [None], ['A', 'A'],         []]
Length: 100, dtype: fletcher_chunked[list<item: string>], idx = [0, 1, 2], box_in_series = True

    @pytest.mark.parametrize(
        "idx",
        [[0, 1, 2], pd.array([0, 1, 2], dtype="Int64"), np.array([0, 1, 2])],
        ids=["list", "integer-array", "numpy-array"],
    )
    def test_setitem_integer_array(self, data, idx, box_in_series):
        arr = data[:5].copy()
        expected = data.take([0, 0, 0, 3, 4])

        if box_in_series:
            arr = pd.Series(arr)
            expected = pd.Series(expected)

>       arr[idx] = arr[0]

../pandas/pandas/tests/extension/base/setitem.py:153:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../pandas/pandas/core/series.py:1008: in __setitem__
    self._set_with(key, value)
../pandas/pandas/core/series.py:1051: in _set_with
    self._set_labels(key, value)
../pandas/pandas/core/series.py:1065: in _set_labels
    self._set_values(indexer, value)
../pandas/pandas/core/series.py:1070: in _set_values
    self._data = self._data.setitem(indexer=key, value=value)
../pandas/pandas/core/internals/managers.py:544: in setitem
    return self.apply("setitem", **kwargs)
../pandas/pandas/core/internals/managers.py:424: in apply
    applied = getattr(b, f)(**kwargs)
../pandas/pandas/core/internals/blocks.py:1816: in setitem
    check_setitem_lengths(indexer, value, self.values)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

indexer = array([0, 1, 2]), value = ['B', 'C'], values = <FletcherChunkedArray>
[['B', 'C'], ['A'], [None], ['A', 'A'], []]
Length: 5, dtype: fletcher_chunked[list<item: string>]

    def check_setitem_lengths(indexer, value, values) -> None:
        """
        Validate that value and indexer are the same length.

        An special-case is allowed for when the indexer is a boolean array
        and the number of true values equals the length of ``value``. In
        this case, no exception is raised.

        Parameters
        ----------
        indexer : sequence
            Key for the setitem.
        value : array-like
            Value for the setitem.
        values : array-like
            Values being set into.

        Returns
        -------
        None

        Raises
        ------
        ValueError
            When the indexer is an ndarray or list and the lengths don't match.
        """
        # boolean with truth values == len of the value is ok too
        if isinstance(indexer, (np.ndarray, list)):
            if is_list_like(value) and len(indexer) != len(value):
                if not (
                    isinstance(indexer, np.ndarray)
                    and indexer.dtype == np.bool_
                    and len(indexer[indexer]) == len(value)
                ):
                    raise ValueError(
>                       "cannot set using a list-like indexer "
                        "with a different length than the value"
                    )
E                   ValueError: cannot set using a list-like indexer with a different length than the value

../pandas/pandas/core/indexers.py:115: ValueError

Arrow API Questions

Before spamming the arrow repo or email, should we use GH to discuss or would you prefer something else?

Anyway, I have some questions regarding the arrow API:

test_array = pa.array(["Test", "string", None])
buffers = test_array.buffers()

I read somewhere that arrow array can be chunked. What is the simplest way to create a chunked string array?
How can I reconstruct an Arrow StringArray from individual buffers?

Set up conda packaging

Set up conda packaging and include fletcher in https://anaconda.org/conda-forge

fletcher.read_csv?

Just wondering if a function fletcher.read_csv would be in scope which reads csv data directly into arrow tables?

Whilst pd.read_csv is damn good and the workhorse of many data analytics pipelines it suffers slightly from pandas' limited type system which I'm hoping could be improved using native arrow types. Also it would be nice to not have to go through pandas/python types at all and so avoid the serialization cost

Using .str functions

I have tried, perhaps incorrectly, to convert my column to pyarrow string type as follows:

fletcher_string_dtype = fr.FletcherDtype(pa.string())
df['string_col'] = df.string_col.astype(fletcher_string_type)

But now I can't do string functions on it because I get the error message AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

Specifically, I'm trying to do .str.contains()

I may be casting column incorrectly. It may be that there's no value in using fletcher for this.

I saw in your talk, groupby was a nice use case. Related to this question is what are the best use cases for this dtype - just a link to some additional reading material would be great.

Add str.captialize

✔️ pandas function
✔️ Python function
❌ C++ STL function
✔️ no need for a regular expression library
❌ need for a Unicode database for capitalization
❌ cannot pre-compute output size as capital letters could have a different byte-width

Pseudo-Code:

Inputs: -

builder = StringBuilder()
for row in rows:
    builder.append(map_utf8_chararcters(utf8_captitalize, row))

Construct pa.array from buffers

Expose a function that has a more complicated interface but can create all types of arrays: https://issues.apache.org/jira/browse/ARROW-2281
Expose the normal StringArray constructor: https://issues.apache.org/jira/browse/ARROW-2282

Add str.slice

✔️ pandas function
✔️ Python function: […]
✔️/❌ simple C++ STL function: http://www.cplusplus.com/reference/string/string/substr/ (no UTF-8 support)
✔️ no need for a regular expression library
✔️ no need for a Unicode database
❌ cannot pre-compute size, output depends on the occurence of UTF-8 characters

Pseudo-Code:

Inputs: start, end, step

builder = StringBuilder()
for row in enumerate(rows):
    if isnull(row):
        builder.addnull()
    else
        builder.add(utf8_slice(row, start, end, step))

🚀 String Super-Issue

We want to support strings (UTF-8 encoded) as fast as possible inside of pandas. Therefore we need to implement several things. This will be split into many issues and hard to track just with the issue search, so we will list them all here.

We try to add the functionality in three stages:

Implement the functionality using plain Python operations. This will be the same speed as with pandas.StringDtype but already provides the API to fletcher users. This will allow us to add faster implementations bit-by-bit while already providing a fully usable library.
a) Also ensure that we have benchmarks setup to compare the pandas/object implementation to ours.
Given the algorithm isn't too complicated, we try to make an efficient implementation with numba. This will allow us to provide a fast algorithm with less implementation overhead then adding it to Apache Arrow.
For all methods, add an efficient implementation to Apache Arrow if there is none yet.

function	meta issue	naïve implementation	`numba` implementation	`pyarrow` implementation
`capitalize`	#124	#200	…	…
`casefold`	#125	#200	…	…
`cat`	#126	#200	…	…
`center`	#122	#200	…	…
`contains` (exact match) ✅	#123	#140	#141	ARROW-9160 / #151
`contains` (other)	#123	#200	…	…
`count`	#127	#200	…	…
`decode`	…	…	…
`encode`	…	…	…
`endswith`	#130	-	#131	…
`extract`	#137	#200	…	…
`extractall`	…	#200	…
`find`	…	#200	…
`findall`	…	#200	…
`get`	…	#200	…
`index`	…	#200	…
`join`	…	…	…
`len`	…	#200	…
`ljust`	…	#200	…
`lower`	#135	#200	…	ARROW-9133
`lstrip`	…	#200	…
`match`	…	#200	…
`normalize`	…	#200	…
`pad`	…	#200	…
`partition`	…	#200	…
`repeat`	…	#200	…
`replace`	#133	#200	…	…
`rfind`	…	#200	…	…
`rindex`	…	#200	…	…
`rjust`	…	#200	…	…
`rpartition`	…	#200	…	…
`rstrip`	…	#200	…	…
`slice`	#114	#200	…	…
`slice_replace`	…	#200	…	…
`split`	…	#200	…	…
`rsplit`	…	#200	…	…
`startswith`	#132	-	#131	…
`strip`	#136	–	#160	…
`swapcase`	…	#200	…	…
`title`	…	#200	…	…
`translate`	…	#200	…	…
`upper`	…	#200	…	ARROW-9133
`wrap`	…	#200	…	…
`zfill`	#134	#139	…	…
`isalnum` ✅	…	#200	…	ARROW-9268 / #203
`isalpha` ✅	…	#200	…	ARROW-9268 / #203
`isdigit` ✅	…	#200	…	ARROW-9268 / #203
`isspace` ✅	…	#200	…	ARROW-9268 / #203
`islower` ✅	…	#200	…	[ARROW-9268](apache/arrow#7656 / #203 )
`isupper` ✅	…	#200	…	ARROW-9268 / #203
`istitle` ✅	…	#200	…	ARROW-9268 / #203
`isnumeric` ✅	…	#200	…	ARROW-9268 / #203
`isdecimal` ✅	…	#200	…	ARROW-9268 / #203
`get_dummies`	…	#200	…	…

Cannot serialise dataframe with FletcherArray columns

It seems serialisation is not yet supported by FletcherArray. Here's my DataFrame:

>>> df_amd.dtypes
time           datetime64[ns]
evt_name     fletcher[string]
evt_value             float64
evt_unit     fletcher[string]
bus                    uint64
route        fletcher[string]
stop_code     fletcher[int64]
stop         fletcher[string]
lat                   float64
lon                   float64
dtype: object

And here are the backtraces when I try to serialise to various formats (.to_csv(..) works):

HDF5

>>> df_amd.to_hdf('data/road_safety.h5', 'AMD')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 1996, in to_hdf
    return pytables.to_hdf(path_or_buf, key, self, **kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 279, in to_hdf
    f(store)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 273, in <lambda>
    f = lambda store: store.put(key, value, **kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 890, in put
    self._write_to_group(key, value, append=append, **kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 1367, in _write_to_group
    s.write(obj=value, append=append, complib=complib, **kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 2963, in write
    self.write_array('block%d_values' % i, blk.values, items=blk_items)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 2686, in write_array
    value = value.T
AttributeError: 'FletcherArray' object has no attribute 'T'

Parquet

>>> df_amd.to_parquet('data/ahmedabad_event_report.parquet')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 1945, in to_parquet
    compression=compression, **kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 257, in to_parquet
    return impl.write(df, path, compression=compression, **kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 118, in write
    table = self.api.Table.from_pandas(df)
  File "pyarrow/table.pxi", line 1136, in pyarrow.lib.Table.from_pandas
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 386, in dataframe_to_arrays
    convert_types))
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception 
  File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 375, in convert_column
    raise e
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 369, in convert_column
    return pa.array(col, from_pandas=True, type=ty)
  File "pyarrow/array.pxi", line 182, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 76, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column evt_name with type fletcher[string]')

Feather

>>> df = df_amd.reset_index()
>>> df.to_feather('data/ahmedabad_event_report.feather')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 1892, in to_feather
    to_feather(self, fname)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/feather_format.py", line 83, in to_feather
    feather.write_dataframe(df, path)
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/feather.py", line 181, in write_feather
    writer.write(df)
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/feather.py", line 93, in write
    batch = RecordBatch.from_pandas(df, preserve_index=False)
  File "pyarrow/table.pxi", line 901, in pyarrow.lib.RecordBatch.from_pandas
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 386, in dataframe_to_arrays
    convert_types))
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 375, in convert_column
    raise e
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 369, in convert_column
    return pa.array(col, from_pandas=True, type=ty)
  File "pyarrow/array.pxi", line 182, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 76, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column evt_name with type fletcher[string]')

Not sure if this is related, for some DataFrames .memory_usage() (consequently also .info()) triggers the following backtrace:

>>> df_amd.memory_usage()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2365, in memory_usage
    for col, c in self.iteritems()], index=self.columns)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2365, in <listcomp>
    for col, c in self.iteritems()], index=self.columns)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/series.py", line 3503, in memory_usage
    v = super(Series, self).memory_usage(deep=deep)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/base.py", line 1143, in memory_usage
    v = self.values.nbytes
  File "/home/jallad/.local/lib/python3.6/site-packages/fletcher/base.py", line 410, in nbytes
    size += buf.size
AttributeError: 'NoneType' object has no attribute 'size'

Visualize pandas performance comparisons

We have several benchmarks like https://github.com/xhochy/fletcher/blob/a63581d10381a41595695a9c3c89edd156375f74/benchmarks/take.py that compare the performance of a specific method of plain pandas with the implementation in fletcher. The performance difference is not covered by the standard plots that asv provides.

We should therefore:

Run the benchmarks and produce machine-readable output
Parse the output and make plots in a notebook that show the performance differences (preferably use altair for plotting).
Have a way to publish the run notebook as static HTML somewhere.

Fix pandas interface TestBaseMethodsTests

The current pandas extension unit tests for TestBaseMethodsTests are marked as pytest.mark.xfail

c.f. https://github.com/xhochy/fletcher/blob/master/tests/string_array/test_pandas_extension.py

The xfail marker should be removed and the failing tests should be fixed in the StringArray and/or FletcherArrayBase implementation

xhochy / fletcher Goto Github PK

fletcher's People

Contributors

Stargazers

Watchers

Forkers

fletcher's Issues

Recommend Projects

Recommend Topics

Recommend Org