Comments (12)
I think
pip install pandas
andconda install pandas
should install PyArrow, and possibly Matplotlib and other dependencies. And there should be a way to install pandas without any optional dependencies,pip install pandas[core]
andconda install pandas-core
, or whatever makes sense and is feasible.
As long as PDEP-10 holds I think pyarrow is a core package. Outside of that how much of a difference is this expected to make? I think there is also a downside to having separate packages because then you start to fragment the user base
from pandas.
I like the idea of having a "minimal" installation that covers most common use cases and avoids downloading unneeded packages. I would suggest the name minipandas
, akin to how miniconda
vs. anaconda
are a minimal and maximal version of anaconda.
With respect to the pyarrow
issue, we'd then have to make sure that minipandas
would work without pyarrow
being installed.
One other thought - I imagine the current test suite would have to be split into tests appropriate for minipandas
and pandas
, and there would be an additional burden when building distributions. We'd also have to carefully examine the docs to determine which parts need a "full pandas" label to indicate that you need the full package (or specific dependencies) for it to work.
from pandas.
I do think this would be useful but mainly for considering how the code is packaged and less about how dependencies are bundled (which seems to be the focus here?). For pip installations we have the pip extras set up and understandably conda doesn't have something like that (yet). If this re-packaging is to make the conda installation story nicer I'm not sure if it's worth it.
Just noting that core
seems to be the "common" prefix for minimal packages in Python too:
https://anaconda.org/conda-forge/jupyter_core
https://anaconda.org/conda-forge/dask-core
https://anaconda.org/conda-forge/poetry-core
https://anaconda.org/conda-forge/botocore
from pandas.
which seems to be the focus here?
My main point is about the UX, anything else I'm personally flexible and can be discussed later.
I think pip install pandas
and conda install pandas
should install PyArrow, and possibly Matplotlib and other dependencies. And there should be a way to install pandas without any optional dependencies, pip install pandas[core]
and conda install pandas-core
, or whatever makes sense and is feasible.
from pandas.
The difference is that by default users will get our recommended dependencies, as opposed as now, since the main packages will now add them, still leaving the option for users to install a version with no optional dependencies.
Making up the numbers, but if 20% of users have PyArrow now, maybe we'll get 80% of them, making pandas faster for many users who don't know or don't care much on what to install, and trust us on providing what they need by default.
I personally don't see the fragmentation problem you mention. This solution has been implemented for decades in the Linux world. If you want KDE for example, you just install the kde
package and you get a notepad, a calculator, a calendar... If you have a reason to not have everything that KDE provides, you can still install kde-core
and the specific packages you want. I wouldn't say KDE users are fragmented because of this, or that pandas users will be. We are already dealing with an user base where each individual has a different set of dependencies. We'll affect the percentage of users that have some of the pandas optional dependencies, but other than that I personally don't see a significant change or any drawback. the pandas installed will be exactly the same, the one in pandas-core
, which will be installed by both the pandas
and the pandas-core
packages.
from pandas.
There is already a great mechanism and all that is needed are some recommendations like installing pandas[all]
(or full or kitchen-sink) and possibly other subsets like pandas[io]
.
I think it would be a mistake to try and redefine pandas to be some huge set of dependencies, and to introduce some other package to be the current pandas.
from pandas.
I think this is well known, but feels worth stating anyways: no matter how its implemented, if there are ways of using pandas without pyarrow, then we have to maintain both "pandas with pyarrow" and "pandas without pyarrow" - which to me was the main reasons for PDEP-10.
If pyarrow is always opt-in, then I don't see much issue with this. But if we are having e.g. "string[pyarrow] when pyarrow is installed and otherwise numpy object" type inference, then users will have different behavior in pandas itself depending on whether a third party package is installed or not. That seems like a very bad user experience to me.
from pandas.
I think this is well known, but feels worth stating anyways: no matter how its implemented, if there are ways of using pandas without pyarrow, then we have to maintain both "pandas with pyarrow" and "pandas without pyarrow" - which to me was the main reasons for PDEP-10.
That was my understanding of one of the core reasons for PDEP-10. I was was one of the few people voting against PDEP-10, but now it has been voted is it not supposed to be accepted it and stuck with? Unless a new PDEP or amendment to it is put forward then surely this is out of scope until then. According to the PDEP the warning should be retained also and not repealed by a close majority vote which might also not follow PDEP rules.
from pandas.
I agree, and it's surely not the goal of this issue to cancel PDEP-10. Also, while having two packages could be used to install PyArrow more broadly without requiring, the scope of what I'm discussing here is not limited to PyArrow and could be used to other dependencies that we recommend (or assume users are most likely to want) but we don't want to force, for example Matplotlib.
From the previous discussions seems like several people have interest in not moving forward with PDEP-10, at least as is. I fully agree that this issue is not where we want to decide or even discuss it. But if there is interest in implementing the two packages for default and minimal dependencies, I think it can make a difference for future discussions on requiring Arrow.
And clearly, this issue doesn't help with cleaning our codebase of if pyarrow
or having to deal with two separate cases. The main change I envision is a significant increase in the number of users who have PyArrow installed.
I'm personally +1 on moving forward with PDEP-10, fully requiring PyArrow and keeping the warning, but if many people dislike the PDEP now, I think we'll have to have a new discussion.
from pandas.
There is already a great mechanism and all that is needed are some recommendations like installing
pandas[all]
(or full or kitchen-sink) and possibly other subsets likepandas[io]
.I think it would be a mistake to try and redefine pandas to be some huge set of dependencies, and to introduce some other package to be the current pandas.
Agreed with this. Extras should stay extras.
IIRC, the -core thing is probably specific to conda-forge, I've never seen it used with a project on PyPI.
from pandas.
There is already a great mechanism and all that is needed are some recommendations like installing
pandas[all]
(or full or kitchen-sink) and possibly other subsets likepandas[io]
.I think it would be a mistake to try and redefine pandas to be some huge set of dependencies, and to introduce some other package to be the current pandas.
Agree. And we should have more detailed installation instructions to educate users on using extras.
And I think we could have a pandas-core
containing all (or some) the extension modules so we could have more fine-grained tests and benchmarks. It'll also speed up CI and improve developer experience.
from pandas.
Thanks all for the feedback. It doesn't seem there is much interest to move forward with this at this point. I guess in the future something similar can be considered for conda-forge, which doesn't have extras like pip, but I'll close this issue, which was specific to making the "normal" pandas
package to install a subset of optional dependencies, which doesn't have much support.
from pandas.
Related Issues (20)
- ENH: The parameter `date_format` in `read_csv` HOT 1
- PERF: shift() of boolean series gives drawdown by an order of magnitude with default filling np.NaN comparing with filling by bool like False HOT 5
- BUG: FutureWarning appears when assigning a bool list through .at HOT 6
- BUG: how to convert .xls to .xlsx because Pandas failed to open .xls files HOT 13
- BUG: concatenating non overlapping time series with non ns unit leads to dataframe with missing data HOT 1
- ENH: Support μs Greek small letter mu
- BUG: Unexpected KeyError in `transform()` with dict-like func argument HOT 3
- ENH: Introduce type-safe constructors for `Timestamp` and `Timedelta`. HOT 2
- BUG: pandas.api.types.is_integer_dtype evaluates 'P' and 'p' as integers HOT 1
- BUG: DataFrame's loc set works incorrectly with scalar column key and single value DataFrame item
- BUG: DataFrame `convert_dtype='pyarrow'` with int64_t limits throws Warning
- DOC: delim_whitespace deprecation warning advocates code that generates a warning in Python 3.12 HOT 2
- DOC: User guide still recommends delim_whitespace HOT 3
- BUG: Add pyarrow strings to any_string_dtype fixture
- BUG: strings fails to convert to nullable Integer dtype in dataframe/series HOT 2
- DOC: Enforce Numpy Docstring Validation | pandas.Int16Dtype through pandas.IntervalIndex HOT 3
- BUG: KeyError: 'Step Nr.s' HOT 5
- DISC: Supporting numpy StringDType in Pandas HOT 7
- DOC: Enforce Numpy Docstring Validation | pandas.Series HOT 10
- DOC: Enforce Numpy Docstring Validation | pandas.Timestamp HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandas.