Giter Site home page Giter Site logo

fletcher's People

Contributors

alaxe avatar chmp avatar cristianpirnogqc avatar felixhoehleqc avatar fhoehle avatar fjetter avatar fossabot avatar higgser avatar ivandimitrovqc avatar jbrockmendel avatar krivonogov avatar marc9595 avatar marcantoineschmidtqc avatar radoslav11 avatar simonjayhawkins avatar windiana42 avatar xhochy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fletcher's Issues

Support __setitem__

Currently we don't support __setitem__ on any of the ExtensionArray classes. The support for it will not be simple as Arrow arrays are immutable. Thus an implementation for __setitem__ will have to create a copy on each operation.

This is also required to support fillna().

Test integration with dask.dataframe

dask.dataframe should also be able to handle fletcher columns and accessors. Thus we should have at least tests that confirm:

  • dask.dataframe can have fletcher.Fletcher{Chunked,Continuous}Array columns
  • The fr_text accessor is working with dask.dataframe

Fix slice handling

Currently slices are not supported. For this to work, arrow's interal offset needs to be exposed to python.

Nightly builds

We will depend on Pandas and Apache Arrow master for a while. They will also change from time to time and breakage is expected. Once we have CI up and running, we should also check continuously that upstream introduces no new breakages.

Create custom circleci docker container

Currently we spend a lot of time in creating the initial conda environment. We should build a custom Docker image that comes with conda et al. installed and then only install dependencies like Pandas or Arrow that change more often.

Add string accessor that works on native pandas and fletcher columns

We support string functions using fr_text and text. These only work on fletcher columns. To ease the conversion from the object / string dtype to a fletcher based string type, we should support the following:

  • fr_str is an accessor that works on fletcher and pandas columns.
  • fr_strx only works on fletcher columns. This is useful for users that want to be certain that accelerated operations are used.
  • Drop text and fr_text accessors to be inline with pandas naming.

str_concat benchmark

Is there some benchmark on str_concat operation ?
On my local machine I've tried a naive python implem and got better result than with NumbaStringArray:

import numpy as np
import pyarrow as pa

from fletcher._numba_compat import NumbaStringArray, buffers_as_arrays
from fletcher._algorithms import str_concat

a1 = pa.array(np.random.rand(10**6).astype(str).astype('O'))
a2 = pa.array(np.random.rand(10**6).astype(str).astype('O'))


%timeit pa.array([x + y for x, y in zip(a1.to_pandas(), a2.to_pandas())])
# 860 ms ± 6.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit str_concat(NumbaStringArray.make(a1), NumbaStringArray.make(a2))                                                           
# 1.11 s ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is it something that you expect ?

Add str.cat

  • ✔️ pandas function
  • ✔️ Python function: +
  • ✔️ C++ STL function: +
  • ✔️ no need for a regular expression library
  • ✔️ no need for a Unicode database as we just append
  • ✔️/ ❌ can pre-compute output depending on the parameters

Pseudo-Code:

Inputs: others, sep, na_rep

# TODO: Write pseudo code for all of the possible variants

Can't convert from pyarrow.Table

In [319]: pa.__version__
Out[319]: '0.15.1'

In [320]: fletcher.pandas_from_arrow(tbl)
C:\Users\dhirschf\envs\dev\lib\site-packages\fletcher\base.py:731: FutureWarning: Calling .data on ChunkedArray is provided for compatibility after Column was removed, simply drop this attribute
  data[col.name] = FletcherArray(col.data)
Traceback (most recent call last):

  File "<ipython-input-320-a9ca90d82a7d>", line 1, in <module>
    fletcher.pandas_from_arrow(tbl)

  File "C:\Users\dhirschf\envs\dev\lib\site-packages\fletcher\base.py", line 731, in pandas_from_arrow
    data[col.name] = FletcherArray(col.data)

AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'name'


In [321]: type(tbl)
Out[321]: pyarrow.lib.Table

py37/win64 - fletcher=0.2.0

Add str.center method

  • ✔️ pandas function
  • ✔️ Python function
  • ❌ C++ STL function
  • ✔️ no need for a regular expression library
  • ✔️ no need for a Unicode database
  • ❌ cannot pre-compute output size without going over the data as some strings are wider then width (need no centering) and some aren't (need centering).

Pseudo-Code:

Inputs: width, fillchar

builder = StringBuilder()
for row in rows:
    if utf8_len(row) >= width:
        builder.append(row)
    else:
        n_missing = width - utf8_len(row)
        left = (n_missing // 2) + (n_missing % 2)
        right = (n_missing // 2)
        builder.add(fillchar * left + row + fillchar * right)

Add documentation about the long-term goal of fletcher

I often get asked whether fletcher will just make strings faster or whether it is planned to replace pandas completely. Both statements are wrong and we should mention the actual intentions of the project in the documentation.

Add str.casefold

  • ✔️ pandas function
  • ✔️ Python function
  • ❌ C++ STL function
  • ✔️ no need for a regular expression library
  • ❌ need for a Unicode database for capitalization
  • ❌ cannot pre-compute output size as casefold letters could have a different byte-width

Pseudo-Code:

Inputs: -

builder = StringBuilder()
for row in rows:
    builder.append(map_utf8_chararcters(utf8_casefold, row))

Add str.count

  • ✔️ pandas function
  • ✔️ Python function
  • ❌ C++ STL function
  • ❌ needs a regular expression library
  • ✔️ no need for a Unicode database for capitalization
  • ✔️ can pre-compute output size as return value is a numeric array

Pseudo-Code:

Inputs: pat

output = IntArray(len(rows))
for i, row in enumerate(rows):
    count = 0
    for offset in range(len(row)):
        if pat == row[offset:len(pat)]:
        count += 1
    output[i] = count

How to use benchmarks.py?

It's not clear how to use benchmarks py. IMO it would be nice to have the main nethod in that file, that is able to run benchmarks and print results.

Add str.contains method

  • ✔️ pandas function
  • ✔️ Python function: in
  • ✔️ C++ STL function
  • ❌ needs a regular expression library
  • ✔️ no need for a Unicode database (for case=True, otherwise need lower implementation)
  • ✔️ can precompute output size as the result is a boolean array

Pseudo-Code:

Inputs: pat

output = BooleanArray(len(rows))

for i, row in enumerate(rows):
    output[i] = pat in row

Extreme dates cannot be renderes in displaying DataFrames

We can store dates in fletcher that pandas cannot store as we allow for other precisions as nanoseconds. Sadly our code currently converts to nanoseconds for printing a DataFrame.

Reproducible example:

import fletcher as fr
import pandas as pd
import datetime

df = pd.DataFrame({
    "date": fr.FletcherContinuousArray([datetime.datetime(9999, 12, 1), datetime.datetime(9999, 12, 1)])
})
print(df.head())

Exception:

Traceback (most recent call last):
  File "extreme_dates.py", line 8, in <module>
    print(df.head())
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/core/frame.py", line 680, in __repr__
    self.to_string(
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/core/frame.py", line 820, in to_string
    return formatter.to_string(buf=buf, encoding=encoding)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 914, in to_string
    return self.get_result(buf=buf, encoding=encoding)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 521, in get_result
    self.write_result(buf=f)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 823, in write_result
    strcols = self._to_str_columns()
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 759, in _to_str_columns
    fmt_values = self._format_col(i)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 948, in _format_col
    return format_array(
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1172, in format_array
    return fmt_obj.get_result()
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1203, in get_result
    fmt_values = self._format_strings()
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1489, in _format_strings
    array = np.asarray(values)
  File "/Users/uwe/miniconda3/envs/fletcher/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/Users/uwe/Development/fletcher/fletcher/base.py", line 328, in __array__
    return self.data.to_pandas().values
  File "pyarrow/array.pxi", line 567, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/array.pxi", line 1027, in pyarrow.lib.Array._to_pandas
  File "pyarrow/array.pxi", line 1209, in pyarrow.lib._array_like_to_pandas
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253399622400000000

Upload coverage to codecov

To ensure that we have sufficient coverage of our tests, we should upload the code coverage to codecov to get automated reporting on pull requests.

[FeatureRequest] Add __version__

I've got version 0.2.0 installed but I can't find that out from within python as fletcher.__version__ doesn't exist.

It's very convenient within python to be able to introspect the version of the installed packages.

BaseSetitemTests.test_setitem_integer_array fails with ValueError

Failure is without fletcher in the stacktrace, so I'm a bit confused:

tests/test_pandas_extension.py::TestBaseSetitemTests::test_setitem_integer_array[True-fletcher_type7-chunked-list] FAILED
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

self = 0    ['B' 'C']
1        ['A']
2       [None]
3    ['A' 'A']
4           []
dtype: fletcher_chunked[list<item: string>], key = [0, 1, 2], value = ['B', 'C']

    def __setitem__(self, key, value):
        key = com.apply_if_callable(key, self)
        cacher_needs_updating = self._check_is_chained_assignment_possible()

        try:
>           self._set_with_engine(key, value)

../pandas/pandas/core/series.py:982:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = 0    ['B' 'C']
1        ['A']
2       [None]
3    ['A' 'A']
4           []
dtype: fletcher_chunked[list<item: string>], key = [0, 1, 2], value = ['B', 'C']

    def _set_with_engine(self, key, value):
        # fails with AttributeError for IntervalIndex
>       loc = self.index._engine.get_loc(key)

../pandas/pandas/core/series.py:1015:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???

pandas/_libs/index.pyx:61:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: '[0, 1, 2]' is an invalid key

pandas/_libs/index.pyx:66: TypeError

During handling of the above exception, another exception occurred:

self = <test_pandas_extension.TestBaseSetitemTests object at 0x12679e450>, data = <FletcherChunkedArray>
[['B', 'C'],      ['A'],     [None], ['A', 'A'],         [], ['B', 'C'],
      ['A'],     [None..., ['B', 'C'],
      ['A'],     [None], ['A', 'A'],         []]
Length: 100, dtype: fletcher_chunked[list<item: string>], idx = [0, 1, 2], box_in_series = True

    @pytest.mark.parametrize(
        "idx",
        [[0, 1, 2], pd.array([0, 1, 2], dtype="Int64"), np.array([0, 1, 2])],
        ids=["list", "integer-array", "numpy-array"],
    )
    def test_setitem_integer_array(self, data, idx, box_in_series):
        arr = data[:5].copy()
        expected = data.take([0, 0, 0, 3, 4])

        if box_in_series:
            arr = pd.Series(arr)
            expected = pd.Series(expected)

>       arr[idx] = arr[0]

../pandas/pandas/tests/extension/base/setitem.py:153:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../pandas/pandas/core/series.py:1008: in __setitem__
    self._set_with(key, value)
../pandas/pandas/core/series.py:1051: in _set_with
    self._set_labels(key, value)
../pandas/pandas/core/series.py:1065: in _set_labels
    self._set_values(indexer, value)
../pandas/pandas/core/series.py:1070: in _set_values
    self._data = self._data.setitem(indexer=key, value=value)
../pandas/pandas/core/internals/managers.py:544: in setitem
    return self.apply("setitem", **kwargs)
../pandas/pandas/core/internals/managers.py:424: in apply
    applied = getattr(b, f)(**kwargs)
../pandas/pandas/core/internals/blocks.py:1816: in setitem
    check_setitem_lengths(indexer, value, self.values)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

indexer = array([0, 1, 2]), value = ['B', 'C'], values = <FletcherChunkedArray>
[['B', 'C'], ['A'], [None], ['A', 'A'], []]
Length: 5, dtype: fletcher_chunked[list<item: string>]

    def check_setitem_lengths(indexer, value, values) -> None:
        """
        Validate that value and indexer are the same length.

        An special-case is allowed for when the indexer is a boolean array
        and the number of true values equals the length of ``value``. In
        this case, no exception is raised.

        Parameters
        ----------
        indexer : sequence
            Key for the setitem.
        value : array-like
            Value for the setitem.
        values : array-like
            Values being set into.

        Returns
        -------
        None

        Raises
        ------
        ValueError
            When the indexer is an ndarray or list and the lengths don't match.
        """
        # boolean with truth values == len of the value is ok too
        if isinstance(indexer, (np.ndarray, list)):
            if is_list_like(value) and len(indexer) != len(value):
                if not (
                    isinstance(indexer, np.ndarray)
                    and indexer.dtype == np.bool_
                    and len(indexer[indexer]) == len(value)
                ):
                    raise ValueError(
>                       "cannot set using a list-like indexer "
                        "with a different length than the value"
                    )
E                   ValueError: cannot set using a list-like indexer with a different length than the value

../pandas/pandas/core/indexers.py:115: ValueError

Arrow API Questions

Before spamming the arrow repo or email, should we use GH to discuss or would you prefer something else?

Anyway, I have some questions regarding the arrow API:

test_array = pa.array(["Test", "string", None])
buffers = test_array.buffers()
  • I read somewhere that arrow array can be chunked. What is the simplest way to create a chunked string array?
  • How can I reconstruct an Arrow StringArray from individual buffers?

fletcher.read_csv?

Just wondering if a function fletcher.read_csv would be in scope which reads csv data directly into arrow tables?

Whilst pd.read_csv is damn good and the workhorse of many data analytics pipelines it suffers slightly from pandas' limited type system which I'm hoping could be improved using native arrow types. Also it would be nice to not have to go through pandas/python types at all and so avoid the serialization cost

Using .str functions

I have tried, perhaps incorrectly, to convert my column to pyarrow string type as follows:

fletcher_string_dtype = fr.FletcherDtype(pa.string())
df['string_col'] = df.string_col.astype(fletcher_string_type)

But now I can't do string functions on it because I get the error message AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

Specifically, I'm trying to do .str.contains()

I may be casting column incorrectly. It may be that there's no value in using fletcher for this.

I saw in your talk, groupby was a nice use case. Related to this question is what are the best use cases for this dtype - just a link to some additional reading material would be great.

Add str.captialize

  • ✔️ pandas function
  • ✔️ Python function
  • ❌ C++ STL function
  • ✔️ no need for a regular expression library
  • ❌ need for a Unicode database for capitalization
  • ❌ cannot pre-compute output size as capital letters could have a different byte-width

Pseudo-Code:

Inputs: -

builder = StringBuilder()
for row in rows:
    builder.append(map_utf8_chararcters(utf8_captitalize, row))

Add str.slice

Pseudo-Code:

Inputs: start, end, step

builder = StringBuilder()
for row in enumerate(rows):
    if isnull(row):
        builder.addnull()
    else
        builder.add(utf8_slice(row, start, end, step))

🚀 String Super-Issue

We want to support strings (UTF-8 encoded) as fast as possible inside of pandas. Therefore we need to implement several things. This will be split into many issues and hard to track just with the issue search, so we will list them all here.

We try to add the functionality in three stages:

  1. Implement the functionality using plain Python operations. This will be the same speed as with pandas.StringDtype but already provides the API to fletcher users. This will allow us to add faster implementations bit-by-bit while already providing a fully usable library.
    a) Also ensure that we have benchmarks setup to compare the pandas/object implementation to ours.
  2. Given the algorithm isn't too complicated, we try to make an efficient implementation with numba. This will allow us to provide a fast algorithm with less implementation overhead then adding it to Apache Arrow.
  3. For all methods, add an efficient implementation to Apache Arrow if there is none yet.
function meta issue naïve implementation numba implementation pyarrow implementation
capitalize #124 #200
casefold #125 #200
cat #126 #200
center #122 #200
contains (exact match) #123 #140 #141 ARROW-9160 / #151
contains (other) #123 #200
count #127 #200
decode
encode
endswith #130 - #131
extract #137 #200
extractall #200
find #200
findall #200
get #200
index #200
join
len #200
ljust #200
lower #135 #200 ARROW-9133
lstrip #200
match #200
normalize #200
pad #200
partition #200
repeat #200
replace #133 #200
rfind #200
rindex #200
rjust #200
rpartition #200
rstrip #200
slice #114 #200
slice_replace #200
split #200
rsplit #200
startswith #132 - #131
strip #136 #160
swapcase #200
title #200
translate #200
upper #200 ARROW-9133
wrap #200
zfill #134 #139
isalnum #200 ARROW-9268 / #203
isalpha #200 ARROW-9268 / #203
isdigit #200 ARROW-9268 / #203
isspace #200 ARROW-9268 / #203
islower #200 [ARROW-9268](apache/arrow#7656 / #203 )
isupper #200 ARROW-9268 / #203
istitle #200 ARROW-9268 / #203
isnumeric #200 ARROW-9268 / #203
isdecimal #200 ARROW-9268 / #203
get_dummies #200

Cannot serialise dataframe with FletcherArray columns

It seems serialisation is not yet supported by FletcherArray. Here's my DataFrame:

>>> df_amd.dtypes
time           datetime64[ns]
evt_name     fletcher[string]
evt_value             float64
evt_unit     fletcher[string]
bus                    uint64
route        fletcher[string]
stop_code     fletcher[int64]
stop         fletcher[string]
lat                   float64
lon                   float64
dtype: object

And here are the backtraces when I try to serialise to various formats (.to_csv(..) works):

HDF5

>>> df_amd.to_hdf('data/road_safety.h5', 'AMD')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 1996, in to_hdf
    return pytables.to_hdf(path_or_buf, key, self, **kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 279, in to_hdf
    f(store)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 273, in <lambda>
    f = lambda store: store.put(key, value, **kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 890, in put
    self._write_to_group(key, value, append=append, **kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 1367, in _write_to_group
    s.write(obj=value, append=append, complib=complib, **kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 2963, in write
    self.write_array('block%d_values' % i, blk.values, items=blk_items)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/pytables.py", line 2686, in write_array
    value = value.T
AttributeError: 'FletcherArray' object has no attribute 'T'

Parquet

>>> df_amd.to_parquet('data/ahmedabad_event_report.parquet')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 1945, in to_parquet
    compression=compression, **kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 257, in to_parquet
    return impl.write(df, path, compression=compression, **kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/parquet.py", line 118, in write
    table = self.api.Table.from_pandas(df)
  File "pyarrow/table.pxi", line 1136, in pyarrow.lib.Table.from_pandas
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 386, in dataframe_to_arrays
    convert_types))
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception 
  File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 375, in convert_column
    raise e
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 369, in convert_column
    return pa.array(col, from_pandas=True, type=ty)
  File "pyarrow/array.pxi", line 182, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 76, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column evt_name with type fletcher[string]')

Feather

>>> df = df_amd.reset_index()
>>> df.to_feather('data/ahmedabad_event_report.feather')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 1892, in to_feather
    to_feather(self, fname)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/io/feather_format.py", line 83, in to_feather
    feather.write_dataframe(df, path)
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/feather.py", line 181, in write_feather
    writer.write(df)
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/feather.py", line 93, in write
    batch = RecordBatch.from_pandas(df, preserve_index=False)
  File "pyarrow/table.pxi", line 901, in pyarrow.lib.RecordBatch.from_pandas
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 386, in dataframe_to_arrays
    convert_types))
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 375, in convert_column
    raise e
  File "/home/jallad/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 369, in convert_column
    return pa.array(col, from_pandas=True, type=ty)
  File "pyarrow/array.pxi", line 182, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 76, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column evt_name with type fletcher[string]')

Not sure if this is related, for some DataFrames .memory_usage() (consequently also .info()) triggers the following backtrace:

>>> df_amd.memory_usage()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2365, in memory_usage
    for col, c in self.iteritems()], index=self.columns)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2365, in <listcomp>
    for col, c in self.iteritems()], index=self.columns)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/series.py", line 3503, in memory_usage
    v = super(Series, self).memory_usage(deep=deep)
  File "/home/jallad/.local/lib/python3.6/site-packages/pandas/core/base.py", line 1143, in memory_usage
    v = self.values.nbytes
  File "/home/jallad/.local/lib/python3.6/site-packages/fletcher/base.py", line 410, in nbytes
    size += buf.size
AttributeError: 'NoneType' object has no attribute 'size'

Visualize pandas performance comparisons

We have several benchmarks like https://github.com/xhochy/fletcher/blob/a63581d10381a41595695a9c3c89edd156375f74/benchmarks/take.py that compare the performance of a specific method of plain pandas with the implementation in fletcher. The performance difference is not covered by the standard plots that asv provides.

We should therefore:

  • Run the benchmarks and produce machine-readable output
  • Parse the output and make plots in a notebook that show the performance differences (preferably use altair for plotting).
  • Have a way to publish the run notebook as static HTML somewhere.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.