Some of us talked a little bit about this when we met in Basel. The interchange protoc

plotly currently do this <div class="highlight highlight-source-python notranslate

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Move interchange protocol implementation into a separate project about pandas HOT 21 OPEN

phofl commented on May 26, 2024 2

Move interchange protocol implementation into a separate project

from pandas.

Comments (21)

MarcoGorelli commented on May 26, 2024 1

plotly currently do this

        if hasattr(args["data_frame"], "__dataframe__") and version.parse(
            pd.__version__
        ) >= version.parse("2.0.2"):
            import pandas.api.interchange

            df_not_pandas = args["data_frame"]
            args["data_frame"] = df_not_pandas.__dataframe__()
            # According interchange protocol: `def column_names(self) -> Iterable[str]:`
            # so this function can return for example a generator.
            # The easiest way is to convert `columns` to `pandas.Index` so that the
            # type is similar to the types in other code branches.
            columns = pd.Index(args["data_frame"].column_names())
            needs_interchanging = True
        elif hasattr(args["data_frame"], "to_pandas"):
            args["data_frame"] = args["data_frame"].to_pandas()
            columns = args["data_frame"].columns

and other issues have come up since

I think it would've been cleaner to just set a minimum version on dataframe-api-compat, and handle different pandas versions in there

I've toyed around with moving part of the implementation down to Cython to have the pandas buffers adhere to the Python buffer protocol

OK yeah, if there's something like that which ties it to living in pandas, then it probably shouldn't be removed, thanks for bringing this up

from pandas.

MarcoGorelli commented on May 26, 2024

Sure, no objections to removing it from the pandas repo

I could take it into https://github.com/data-apis/dataframe-api-compat and maintain it, if people are OK with that. There should be enough projects in the community and people in the consortium interested in this that we can keep it alive 💪

I think there's only a handful of repos* currently using it anyway (plotly, seaborn, vscode), it'd be pretty easy to manually make PRs to all of them. I could take this on too

I think that scikit-learn and altair wouldn't need to change anything:

the former only uses the interchange protocol for not-pandas:
https://github.com/scikit-learn/scikit-learn/blob/4ce8e19859cb8b2f2bef197ed5b28beea44ee4b4/sklearn/utils/validation.py#L281-L288
the latter only uses the interchange protocol for not-pandas, and uses pyarrow's from_dataframe method

from pandas.

lithomas1 commented on May 26, 2024

Would we still have some sort of testing burden for the interchange protocol even after it's removed?
(e.g. do we need to run a CI job against df-api-compat?)

from pandas.

MarcoGorelli commented on May 26, 2024

there could be just a simple test, like there is for the api standard (which also defers to dataframe-api-compat):

pandas/pandas/tests/test_downstream.py

Lines 307 to 323 in 8fb8b9f

    
           def test_dataframe_consortium() -> None: 
        
               """ 
        
               Test some basic methods of the dataframe consortium standard. 
        
               Full testing is done at https://github.com/data-apis/dataframe-api-compat, 
        
               this is just to check that the entry point works as expected. 
        
               """ 
        
               pytest.importorskip("dataframe_api_compat") 
        
               df_pd = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) 
        
               df = df_pd.__dataframe_consortium_standard__() 
        
               result_1 = df.get_column_names() 
        
               expected_1 = ["a", "b"] 
        
               assert result_1 == expected_1 
        
               ser = Series([1, 2, 3], name="a") 
        
               col = ser.__column_consortium_standard__() 
        
               assert col.name == "a"

from pandas.

WillAyd commented on May 26, 2024

How often does the issue pop up where tying the fixes to a release is problematic? I think that is actually preferable; syncing this across two projects seems more problematic to me

The package would be a pure and lightweight Python dependency

This is true today, although I've toyed around with moving part of the implementation down to Cython to have the pandas buffers adhere to the Python buffer protocol. I don't think that is possible in a pure Python package. See #55671

from pandas.

jbrockmendel commented on May 26, 2024

IIUC the PandasBuffer (which seems like the thing that would get a cython buffer implementation mixed into it) takes an ndarray in __init__ and doesn't seem to rely on anything pandas-specific. So I don't see that potential refactor as requiring this stuff to live in pandas (though it would negate the "pure-python" part of the description above).

Really the reason to split this out is "This shouldn't be our problem."

from pandas.

MarcoGorelli commented on May 26, 2024

@WillAyd could you please expand on why it would be beneficial for the buffers to adhere to the Python buffer protocol?

Furthermore, would it be possible to take that out into a non-pure-Python package, or would it still need to live in pandas?

from pandas.

WillAyd commented on May 26, 2024

The advantage of adhering to the Python protocol is that it allows other packages to access the raw data of a dataframe purely from Python. It doesn't necessarily need to live in pandas, but you would end up duplicating a lot of the same things that we already have in pandas (ex: build system, CI) just in another package.

from pandas.

MarcoGorelli commented on May 26, 2024

Thanks

Just for my understanding, would this require other libraries also adhere to the Python buffer protocol in order to be useful?

from pandas.

WillAyd commented on May 26, 2024

No other libraries would not need to do anything. An object that adheres to the buffer protocol can have its raw bytes accessed through standard Python objects. If you look at the tests added in the above PR you can see how that works with a memoryview

from pandas.

MarcoGorelli commented on May 26, 2024

Sure but I mean, would consumers because to consume a column's raw bytes through standard Python objects in a library-agnostic manner if other libraries don't change anything?

For example, I can currently write

if hasattr(df, '__dataframe__'):
    df_xch = df.__dataframe__()
    schema = {name: col.get_buffers()['data'][1][0] for (name, col) in zip(df_xch.column_names(), df_xch.get_columns())}

and know that it will give me the schema of any dataframe which implements the interchange protocol.

If pandas buffers were to adhere to the Python buffer protocol, then IIUC you'd also be able to do, say, bytearray(memoryview(df_xch.get_column_by_name('a').get_buffers()["data"][0])). But, if not all library which implement the interchange protocol also adhere to the python buffer protocol, then you'd have no guarantee that that code would work in a library-agnostic manner, right?

from pandas.

WillAyd commented on May 26, 2024

Yea I don't think it replaces or changes any parts of the dataframe protocol; I think of it more as a lightweight convenience that a third party can use to perform smaller tasks or aid in debugging a larger implementation of the dataframe protocol

from pandas.

jorisvandenbossche commented on May 26, 2024

In theory it could be brought up to the DataFrame Interchange Protocol spec to add Python buffer protocol support as a general requirement, but not sure all implementations would want to do that (for pyarrow it would be easy, since our main Buffer object already implements the buffer protocol).

If we move it out, would the idea be that it becomes a required dependency of pandas, and not optional? (I assume so, otherwise consumers cannot have the guarantee calling __dataframe__ on a pandas object would work)

from pandas.

MarcoGorelli commented on May 26, 2024

it could still be optional, anyone using it (e.g. plotly, seaborn) would need to make it required

from pandas.

jorisvandenbossche commented on May 26, 2024

But in general, we want to allow projects to be able to accept a pandas object, without depending on pandas.
Of course we could ensure that this new pandas-interchange-protocol package doesn't actually have a hard dependency on pandas, so they could add a dependency on that package without depending on pandas. But it does require every project that wants to use the interchange protocol to add such dependencies, which feels a bit strange?

from pandas.

MarcoGorelli commented on May 26, 2024

If a library already depends on pandas and does:

if isinstance(df, pd.DataFrame):
   pass
elif hasattr(df, '__dataframe__'):
   df = pd.api.interchange.from_dataframe(df)

then instead of just having pandas as a dependency, they would need to also have the interchange protocol package as a dependency. Users need not change anything.

Hypothetically* speaking, if a library does not already depend on pandas, does not have a special path for pandas, and does, say

if hasattr(df, '__dataframe__'):
   df = pyarrow.interchange.from_dataframe(df)

then they could either:

accept that pandas users are going to get a runtime error, telling them to install an optional dependency. pandas users already get this if they try to do df.to_excel, df.plot, df.to_markdown, so it doesn't feel terrible
only add the interchange protocol package as a dependency (and not pandas!). If we can keep it pure-Python and under 50 kilobytes, I don't foresee objections, it's a small price to pay in exchange for interoperability. They could even just vendor it.

*I don't think any such libraries currently exist

from pandas.

WillAyd commented on May 26, 2024

On the call today @jorisvandenbossche brought up the C data interface that pyarrow is pushing for, which looks further along developed and probably would make the interchange protocol obsolete. If that materializes I would definitely be in favor of deprecating / removing this.

I guess the only theoretical advantage this would have over the Arrow approach is that we would have the flexibility to serialize non-Arrow types, but I'm not sure if that is really useful in practice

from pandas.

MarcoGorelli commented on May 26, 2024

I'm confused, doesn't the dataframe interchange protocol use the Arrow C Data Interface?

If you inspect the third element of the dtype property of a Column, you'll see a format string from the Arrow C Data Interface

In [1]: df = pd.DataFrame({'a': [pd.Timestamp('2020-01-01', tz='Asia/Kathmandu')]})

In [2]: df
Out[2]:
                          a
0 2020-01-01 00:00:00+05:45

In [3]: df.__dataframe__().get_column_by_name('a').dtype
Out[3]: (<DtypeKind.DATETIME: 22>, 64, 'tsn:Asia/Kathmandu', '=')

'tsn:Asia/Kathmandu' means "timestamp [nanoseconds] with timezone “Asia/Kathmandu”", according to https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings

I'm probably missing something, but could you please elaborate on how it would make the dataframe interchange protocol obsolete?

from pandas.

WillAyd commented on May 26, 2024

The C data interface documents the comparison

https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#comparison-to-dataframe-interchange-protocol

There is definitely a lot of overlap. The difference would really come down to how important we think it is to exchange non-arrow types

from pandas.

jorisvandenbossche commented on May 26, 2024

Hypothetically* speaking, if a library does not already depend on pandas, does not have a special path for pandas ... *I don't think any such libraries currently exist

True, but of course it's a bit of the point of the interchange protocol to allow this more easily

On the call today @jorisvandenbossche brought up the C data interface that pyarrow is pushing for, which looks further along developed and probably would make the interchange protocol obsolete.

While there is certainly overlap, it doesn't make it fully obsolete, there are definitely still use case for both. In fact, I actually also proposed to add the Arrow protocol to the DataFrame Interchange protocol: data-apis/dataframe-api#279 / data-apis/dataframe-api#342.
The interchange protocol still gives you a nicer Python API to do some inspection, while the Arrow capsule protocol just gives you an opaque capsule (unless you use a library to consume those). So there can be a use case to first get __dataframe__ to inspect the column names, or number of rows, .. select the columns you need, etc, and then when you actually want to consume the data, use the __arrow_c_array/stream__ on the dataframe interchange object to access the actual data (instead of accessing the data through the columns and buffers of the interchange protocol). In summary, it's mostly an alternative to the manual access of individual buffer addresses for consumers, but not necessarily for the top-level dataframe object returned by __dataframe__.

I guess the only theoretical advantage this would have over the Arrow approach is that we would have the flexibility to serialize non-Arrow types, but I'm not sure if that is really useful in practice

Indeed, that's the main disadvantage. For example, if you have one dataframe library that implements missing values with a sentinel or byte mask, and interchanges data with another dataframe library that uses the same representation, then passing through Arrow incurs some extra overhead (because it only has one way of representing missing values), while going through the columns/buffers of the interchange protocol allows you to get exactly the data as they are represented by the producing library.

I'm confused, doesn't the dataframe interchange protocol use the Arrow C Data Interface?

The dataframe interchange protocol uses the type format string notation of the Arrow C Data Interface, as a way to have a more detailed type description than the generic enum (first element of the dtype tuple). But that's only a format string, it further doesn't actually use the core of the Arrow C Data Interface, i.e. the C struct with the actual buffer pointers.

To mention it here as well, I have a PR to add the Arrow C Data Interface (PyCapsule protocol) to pandas: #56587

from pandas.

MarcoGorelli commented on May 26, 2024

Thanks all for comments (and ensuing conversations), interesting stuff

The interchange protocol still gives you a nicer Python API to do some inspection

Indeed! I do think it is nice to be able to do

pandas_df = pd.api.interchange.from_dataframe(some_other_df)

and to know that, so long as some_other_df implements __dataframe__, then some_other_df can be converted to pandas

If I've understood Joris' proposal in #56587, then this would allow for the Python API to be kept, but for it to use (where possible) the Arrow C Data Interface

This way, plotly / seaborn / altair could keep saying "if you input something which isn't pandas/pyarrow, we can agnostically convert to them"

from pandas.

Move interchange protocol implementation into a separate project about pandas HOT 21 OPEN

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	def test_dataframe_consortium() -> None:
	"""
	Test some basic methods of the dataframe consortium standard.

	Full testing is done at https://github.com/data-apis/dataframe-api-compat,
	this is just to check that the entry point works as expected.
	"""
	pytest.importorskip("dataframe_api_compat")
	df_pd = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
	df = df_pd.__dataframe_consortium_standard__()
	result_1 = df.get_column_names()
	expected_1 = ["a", "b"]
	assert result_1 == expected_1

	ser = Series([1, 2, 3], name="a")
	col = ser.__column_consortium_standard__()
	assert col.name == "a"