Comments (17)
I am wondering now if it is even worth trying to support string aliases or if we should push users towards using a more explicit dtype construction. This would be a change from where we are today but could be better in the long run (?)
From a typing perspective, supporting all the different string versions of valid types for dtype
are a PITA in pandas-stubs
. So I'd be supportive of just having a class hierarchy to represent valid dtypes.
Having said that, if we are to deprecate the strings, we'd probably need a PDEP for that....
from pandas.
To your original point, I very much agree with this (at least for the physical storage, not necessarily for nullability semantics because I personally think we should move to just having one nullability semantic, but that's the topic for another PDEP)
Is this in reference to how nulls are stored or how they are expressed to the end user? Storage-wise I feel like it would be a mistake to stray from the Arrow implementation
from pandas.
Meant to tag @jorisvandenbossche
from pandas.
i like this idea, though as i mentioned at the sprint i think we should avoid "backend". maybe dtype "family"?
from pandas.
Maybe "type provider"?
from pandas.
Thinking through some more what I suggested above for type_category[type_provider, nullability_provider]
won't always work as a pattern because there are still types that accept more arguments, e.g. datetime, pa.list, pa.dictionary, etc...
I am wondering now if it is even worth trying to support string aliases or if we should push users towards using a more explicit dtype construction. This would be a change from where we are today but could be better in the long run (?)
from pandas.
As an exercise I tried to map out all of the types that pandas does today support (or reasonably could in the near term) and place in a hierarchy. Here is what I was able to come up with:
Tagging @pandas-dev/pandas-core in case this is of use to the larger team
graphviz used to build this:
digraph type_graph {
node [shape=box];
"type"
"type" -> "scalar"
"scalar" -> "numeric"
"numeric" -> "integral"
"integral" -> "signed"
subgraph cluster_signed {
edge [style=invis]
node [fillcolor="lightgreen" style=filled] "np.int8";
node [fillcolor="lightgreen" style=filled] "np.int16";
node [fillcolor="lightgreen" style=filled] "np.int32";
node [fillcolor="lightgreen" style=filled] "np.int64";
node [fillcolor="lightblue" style=filled] "pd.Int8Dtype";
node [fillcolor="lightblue" style=filled] "pd.Int16Dtype";
node [fillcolor="lightblue" style=filled] "pd.Int32Dtype";
node [fillcolor="lightblue" style=filled] "pd.Int64Dtype";
node [fillcolor="lightgray" style=filled] "pa.int8";
node [fillcolor="lightgray" style=filled] "pa.int16";
node [fillcolor="lightgray" style=filled] "pa.int32";
node [fillcolor="lightgray" style=filled] "pa.int64";
"np.int8" -> "np.int16" -> "np.int32" -> "np.int64"
"pd.Int8Dtype" -> "pd.Int16Dtype" -> "pd.Int32Dtype" -> "pd.Int64Dtype"
"pa.int8" -> "pa.int16" -> "pa.int32" -> "pa.int64"
}
"signed" -> "pd.Int8Dtype" [arrowsize=0]
"integral" -> "unsigned"
subgraph cluster_unsigned {
edge [style=invis]
node [fillcolor="lightgreen" style=filled] "np.uint8";
node [fillcolor="lightgreen" style=filled] "np.uint16";
node [fillcolor="lightgreen" style=filled] "np.uint32";
node [fillcolor="lightgreen" style=filled] "np.uint64";
node [fillcolor="lightblue" style=filled] "pd.UInt8Dtype";
node [fillcolor="lightblue" style=filled] "pd.UInt16Dtype";
node [fillcolor="lightblue" style=filled] "pd.UInt32Dtype";
node [fillcolor="lightblue" style=filled] "pd.UInt64Dtype";
node [fillcolor="lightgray" style=filled] "pa.uint8";
node [fillcolor="lightgray" style=filled] "pa.uint16";
node [fillcolor="lightgray" style=filled] "pa.uint32";
node [fillcolor="lightgray" style=filled] "pa.uint64";
"np.uint8" -> "np.uint16" -> "np.uint32" -> "np.uint64"
"pd.UInt8Dtype" -> "pd.UInt16Dtype" -> "pd.UInt32Dtype" -> "pd.UInt64Dtype"
"pa.uint8" -> "pa.uint16" -> "pa.uint32" -> "pa.uint64"
}
"unsigned" -> "pd.UInt8Dtype" [arrowsize=0]
"numeric" -> "floating point"
subgraph cluster_floating {
edge [style=invis]
node [fillcolor="lightgreen" style=filled] "np.float32";
node [fillcolor="lightgreen" style=filled] "np.float64";
node [fillcolor="lightblue" style=filled] "pd.Float32Dtype";
node [fillcolor="lightblue" style=filled] "pd.Float64Dtype";
node [fillcolor="lightgray" style=filled] "pa.float32";
node [fillcolor="lightgray" style=filled] "pa.float64";
"np.float32" -> "np.float64"
"pd.Float32Dtype" -> "pd.Float64Dtype"
"pa.float32" -> "pa.float64"
}
"floating point" -> "pd.Float32Dtype" [arrowsize=0]
"numeric" -> "fixed point"
subgraph cluster_fixed {
edge [style=invis]
node [fillcolor="lightgray" style=filled] "pa.decimal128";
node [fillcolor="lightgray" style=filled] "pa.decimal256";
"pa.decimal128" -> "pa.decimal256"
}
"fixed point" -> "pa.decimal128" [arrowsize=0]
"scalar" -> "boolean"
subgraph cluster_boolean {
edge[style=invis]
node[fillcolor="lightgreen" style=filled] "np.bool_";
node[fillcolor="lightblue" style=filled] "pd.BooleanDtype";
node[fillcolor="lightgray" style=filled] "pa.bool_";
}
"boolean" -> "pd.BooleanDtype" [arrowsize=0]
"scalar" -> "temporal"
"temporal" -> "date"
subgraph cluster_date {
edge [style=invis]
node [fillcolor="lightgray" style=filled] "pa.date32"
node [fillcolor="lightgray" style=filled] "pa.date64"
"pa.date32" -> "pa.date64"
}
"date" -> "pa.date32" [arrowsize=0]
"temporal" -> "datetime"
subgraph cluster_timestamp {
edge [style=invis]
node [fillcolor="lightblue" style=filled] "datetime64[unit, tz]";
node [fillcolor="lightgray" style=filled] "pa.timestamp(unit, tz)";
"datetime64[unit, tz]" -> "pa.timestamp(unit, tz)" [style=invis]
}
"datetime" -> "datetime64[unit, tz]" [arrowsize=0]
"temporal" -> "duration"
subgraph cluster_duration {
edge [style=invis]
node [fillcolor="lightblue" style=filled] "timedelta64[unit]";
node [fillcolor="lightgray" style=filled] "pa.duration(unit)";
"timedelta64[unit]" -> "pa.duration(unit)" [style=invis]
}
"duration" -> "timedelta64[unit]" [arrowsize=0]
"temporal" -> "interval"
"pa.month_day_nano_interval" [fillcolor="lightgray" style=filled]
"interval" -> "pa.month_day_nano_interval"
"scalar" -> "binary"
subgraph cluster_binary {
edge [style=invis]
node [fillcolor="lightgray" style=filled] "pa.binary";
node [fillcolor="lightgray" style=filled] "pa.large_binary";
"pa.binary" -> "pa.large_binary"
}
"binary" -> "pa.binary"
"binary" -> "string"
subgraph cluster_string {
edge [style=invis]
node [fillcolor="lightgreen" style=filled] "object";
node [fillcolor="lightgreen" style=filled] "np.StringDType";
node [fillcolor="lightblue" style=filled] "pd.StringDtype";
node [fillcolor="lightgray" style=filled] "pa.string";
node [fillcolor="lightgray" style=filled] "pa.large_string";
node [fillcolor="lightgray:lightgreen" style=filled] "string[pyarrow_numpy]";
"object" -> "np.StringDType"
"pa.string" -> "pa.large_string"
}
"string" -> "pa.string" [arrowsize=0]
"scalar" -> "categorical"
subgraph cluster_categorical {
edge [style=invis]
node [fillcolor="lightblue" style=filled] "pd.CategoricalDtype";
node [fillcolor="lightgray" style=filled] "pa.dictionary(index_type, value_type)";
"pd.CategoricalDtype" -> "pa.dictionary(index_type, value_type)"
}
"categorical" -> "pd.CategoricalDtype" [arrowsize=0]
"scalar" -> "sparse"
"pd.SparseDtype(dtype)" [fillcolor="lightblue" style=filled];
"sparse" -> "pd.SparseDtype(dtype)" [arrowsize=0]
"type" -> "aggregate"
"aggregate" -> "list"
subgraph cluster_list {
edge [style=invis]
node [fillcolor="lightgray" style=filled] "pa.list_(value_type)";
node [fillcolor="lightgray" style=filled] "pa.large_list(value_type)";
"pa.list_(value_type)" -> "pa.large_list(value_type)"
}
"list" -> "pa.list_(value_type)" [arrowsize=0]
"aggregate" -> "struct"
"pa.struct(fields)" [fillcolor="lightgray" style=filled]
"struct" -> "pa.struct(fields)" [arrowsize=0]
"aggregate" -> "dictionary"
"dictionary" -> "pa.dictionary(index_type, value_type)" [arrowsize=0]
"pa.map(index_type, value_type)" [fillcolor="lightgray" style=filled]
"dictionary" -> "pa.map(index_type, value_type)" [arrowsize=0]
}
from pandas.
I am wondering now if it is even worth trying to support string aliases or if we should push users towards using a more explicit dtype construction.
I would be supportive of this as well. Especially for dtypes as strings that take parameters (timezone types, decimal types), it would be great to avoid string parsing to dtype object construction
from pandas.
In any case I am just hoping we can start to detach the logical type from the physical storage / nulllability semantics with a well defined pattern
To your original point, I very much agree with this (at least for the physical storage, not necessarily for nullability semantics because I personally think we should move to just having one nullability semantic, but that's the topic for another PDEP)
This is a topic that I brought up last summer during the sprint, but never got around writing up publicly. The summary is that I would like to see us move to just having "pandas" dtypes, at least for the majority of the users that don't need to know the lower-level details.
Most users just need to know they have eg a "int64" or "string" column, and don't have to care whether that is under the hood stored using a single numpy array, a combo of numpy arrays (our masked arrays) or a pyarrow array.
The current string aliases for non-default dtypes are I think mostly a band-aid to let people more easily specify those dtypes, and I fully agree those aren't very pretty. I do think it will be hard (or even desirable) to fully do away with string aliases though, at least for the default data types, because this is so widespread.
But IMO we should at least make the alternative to string aliases, construct dtypes programmatically, better supported and more consistent (eg so a user can just do pd.Series(..., dtype=pd.int64())
or pd.Series(..., dtype=pd.Int64Dtype())
and get the default int64 dtype based on their settings (which currently is the numpy dtype, but could also be a masked or pyarrow dtype based on their settings)).
from pandas.
So maybe then for each category in the type hierarchy above we have wrappers with signatures like:
class pd.int8(dtype_backend="pyarrow"): ...
class pd.string(dtype_backend="pyarrow", nullability="numpy"): ...
class pd.datetime(dtype_backend="pyarrow", unit="us", tz=None): ...
class pd.list(value_type, dtype_backend="pyarrow"): ...
class pd.categorical(key_type="infer", value_type="infer", dtype_backend="pandas"): ...
I know @jbrockmendel prefers something besides dtype_backend
but keeping that now for consistency with the I/O methods.
Having said that, if we are to deprecate the strings, we'd probably need a PDEP for that....
I was thinking this as well
from pandas.
The current string aliases for non-default dtypes are I think mostly a band-aid to let people more easily specify those dtypes, and I fully agree those aren't very pretty. I do think it will be hard (or even desirable) to fully do away with string aliases though, at least for the default data types, because this is so widespread.
Yea this would be a long process. I think what's hard about the string alias is that it only works for very basic types. It definitely has been and would continue to be relatively easy for users to just say "int64" and get a 64 bit integer irrespective of what that is backed by, but if the user wants to then create a list column they can't just do "list".
I think users will end up with a frankenstein of string aliases alongside arguments like dtype=pd.ArrowDtype(pa.list(pa.string()))
, which I find confusing
from pandas.
Yea this would be a long process. I think what's hard about the string alias is that it only works for very basic types. It definitely has been and would continue to be relatively easy for users to just say "int64" and get a 64 bit integer irrespective of what that is backed by, but if the user wants to then create a list column they can't just do "list".
I think users will end up with a frankenstein of string aliases alongside arguments like
dtype=pd.ArrowDtype(pa.list(pa.string()))
, which I find confusing
I agree. One possibility to consider is to limit the number of string aliases to simple types "int"
, "float"
, "string"
, "object"
, "datetime"
, "timedelta"
, which default to something based on default backends, and even sizes (e.g., "int"
means "int64"
) as I guess that only a few of the strings are really used most often.
from pandas.
I found the notebook that I presented at the sprint last summer. It's a little bit off topic for the discussion just about string aliases, but I think it is relevant for the bigger picture (that we need to look at anyway if considering to move away from string aliases), so just dumping the content here (updated a little bit).
I like to have "pandas data types" with a consistent interface:
- all data types are instances of our own DType class (hierarchy)
- most users won't have to care about numpy vs arrow
(and at least I think we should allow you to write code that is agnostic to it) - we might not want to follow all naming choices and implementation details of arrow
For example, for datetime-like data, we currently have:
# current options
ser.astype("datetime64[ms]")
# vs
ser.astype("timestamp[us][pyarrow]")
# How to specify a "datetime" dtype being agnostic about the exact backend you are using?
# -> should use a single name and will pick the default backend based on your settings
ser.astype(pd.datetime("ns"))
# or
ser.astype(pd.timestamp("ns"))
# for user that want's to be explicit
ser.astype(pd.datetime("ns", backend=".."))
Another example, we currently have pd.ArrowDtype("date64")
or "date64[pyarrow]"
, but if we want to enable a date dtype by default, users shouldn't need to know this is stored using pyarrow under the hood, so this could be pd.date()
or "date"
?
Logical data types vs physical data types:
- For the "string" data type, we can store that with python objects or pyarrow strings (and now potentially also numpy's new string dtype), and then further for pyarrow, there are still 5 physical data types that all represent strings: string, large_string, string_view, dictionary[string], run_length_encoded[string] -> they all hold "logical" string values
For pandas, I think most users should care about logical data types, and not too much about the physical data type (and we can choose the best default, and advanced users can give hints which to use for performance optimizations)
Assuming we want a single pandas interface to all dtypes, we need to decide:
- how to let users construct them: functional (
pd.string()
vs class constructors (pd.StringDtype()
) - how to structure the DType classes: single class per logical type that are "backend-parametrized" (eg
pd.StringDtype(backend="arrow")
,pd.StringDtype(backend="numpy")
) or separate classes based on the physical storage (what we have right now with egpd.Int64Dtype()
andpd.ArrowDtype(pa.int64())
)
Either we can use "backend-parametrized" classes or either hide classes a bit more and use dtype constructor factory functions:
pd.StringDtype(), pd.StringDtype(backend="arrow"), pd.StringDtype(backend="numpy")
isinstance(dtype, pd.StringDtype)
-> but that means that choosing the approach of the current StringDtype
with different backends instead of ArrowDtype("string")
or we could have different classes but then we definitely need the functional interface and dtype-checking helpers (because isinstance then doesn't work):
pd.string(), pd.string(backend="arrow"), pd.string(backend="numpy")
pd.api.types.is_string(..)
(and maybe pd.string(backend="arrow", storage="string_view")
?
In this case we are more free to keep whatever classes structure we want under the hood.
from pandas.
I forget the details, but remember finding Joris's presentation at the sprint compelling.
from pandas.
Another example, we currently have
pd.ArrowDtype("date64")
or"date64[pyarrow]"
, but if we want to enable a date dtype by default, users shouldn't need to know this is stored using pyarrow under the hood, so this could bepd.date()
or"date"
?
This is an interesting example, but do we even need to support the pyarrow date64? I'm not really clear what advantages that has over date32. Per the hierarchy above I would just abstract this as pd.date()
which under the hood would only use pyarrow's date32. It would be a suboptimal API if we had to do something like pd.date(backend="pyarrow", size=32)
but I'm not sure how likely that is.
Outside of date types I do see that issue with strings where dtype_backend="pyarrow"
would leave it open to interpretation if you wanted pa.string(), pa.large_string(), or any of the other pyarrow string types you already mentioned.
Either we can use "backend-parametrized" classes or either hide classes a bit more and use dtype constructor factory functions:
In an ideal world I would be indifferent, but the problem with the class constructors is that they already exist (pd.StringDtype, pd.Int64Dtype, etc...). Repurposing them might only add to the confusion
Overall though I agree with your sentiment of starting at a place where we think in terms of logical data types foremost, which should cover the majority of use cases, and then giving some control over the physical data types via keyword arguments or options
from pandas.
Either we can use "backend-parametrized" classes or either hide classes a bit more and use dtype constructor factory functions:
In an ideal world I would be indifferent, but the problem with the class constructors is that they already exist (pd.StringDtype, pd.Int64Dtype, etc...). Repurposing them might only add to the confusion
I think we already have both types somewhat, so we will need to clean this up to a certain extent whichever choice we make:
pd.StringDtype(backend="python|"pyarrow")
is an example of using a single dtype class (that users can instantiate) as an interface to multiple implementations of the actual memory/compute (i.e. in this case we actually have multiple ExtensionArray classes that map to this single dtype depending on the backend)
(although I know we also havepd.ArrowDtype(pa.string())
which then uses a different DType class)np.dtype("int64")
/pd.Int64Dtype()
/pd.ArrowDtype(pa.int64())
is essentially an example of a logical integer type where the entry point is a different class depending on which implementation you want (I know this wasn't necessarily designed together /grown historically, and I am mixing with also numpy dtypes, but it is the current situation as how users face it)
While we could decide to have a singlepd.Int64Dtype(backend="numpy"|"pyarrow")
parametrized class (mimicking the string case above), or we could also decide that we are fine with having bothpd.Int64Dtype
(numpy based) andpd.ArrowDtype
classes, but then I think we would need another entry point for users likepd.int64()
factory function that can then create an instance of either of those classes (depending on your settings, or depending on a keyword you pass)
So while those class constructors indeed already exist, I think we have to repurpose or change the existing ones (and add new ones) to some extent anyway. And it is also not because we have those classes right now, that we can't decide we want to hide them more from the user by providing an alternative. I don't think there are already that many users that use pd.Int64Dtype()
directly, and (if we would prefer that interface) there is certainly still room to start pushing a functional constructor interface.
from pandas.
To your original point, I very much agree with this (at least for the physical storage, not necessarily for nullability semantics because I personally think we should move to just having one nullability semantic, but that's the topic for another PDEP)
Is this in reference to how nulls are stored or how they are expressed to the end user? Storage-wise I feel like it would be a mistake to stray from the Arrow implementation
In the first place to how they are expressed to the end user, because IMO that's the most important aspect (since we are talking about the user interface how dtypes are specified / presented). Personally I would also prefer to use a consistent implementation storage-wise, but that's more of an implementation detail that we could discuss/compromise per dtype.
from pandas.
Related Issues (20)
- BUG: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2
- ENH: Use a neutral format to have lossless interface with scipp, Astropy, Xarray
- BUILD: Pandas 2.2.2 install fails from Poetry (attempts meson build) 2.2.1 installs fine. HOT 2
- ENH: Allow Arrow types directly with pd.to_datetime HOT 6
- DOC: Switch to stable version redirection in banner HOT 2
- BUG: When passing list of dataclasses mixing types to `pd.DataFrame()`, but `pd.DataFrame()` did not raise `TypeError`. HOT 2
- ENH: Should we support aggregating by-frame in DataFrameGroupBy.agg HOT 4
- BUG: Errors when running pandas on development environment HOT 5
- different nan handling in to_json() vs to_dict() HOT 3
- BUG: Pandas Discrepancies in Handling Non-English Data During Memory Optimization
- ENH: sum() should default to numeric_only=True HOT 4
- BUG: OverflowError when JSON serializing a dataframe/series with PeriodArray variable HOT 2
- BUG: Series.plot(kind="pie") does not respect ylabel argument HOT 4
- BUG: qcut returns incorrect results HOT 4
- DISC: path to nullable-by-default integers and floats HOT 3
- BUG: register_dataframe_accessor causes "Untyped class decorator obscures type of class; ignoring decorator" HOT 1
- How do I read and write tabular data? HOT 2
- DEPR: ExcelFile.parse HOT 11
- PERF: Saving many datasets in a single group slows down with each new addition HOT 11
- DOC: Plotting Backend Documentation Incorrect HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandas.