Comments (7)
>>> df = pd.DataFrame({'column': [0.0, 1.0, 2.0]})
>>> df.dtypes
column float64
dtype: object
>>> df.convert_dtypes()
column
0 0
1 1
2 2
>>> df.convert_dtypes().dtypes
column Int64
dtype: object
I can confirm that the issue is reproducible and not intended behavior. When creating a dataframe that has the same data as newdf, the intended behavior is shown above
from pandas.
It seems like convert_dtypes
does not do any conversion if the existing dtypes are already supporting pd.NA.
This might be intended because originally the point of convert_dtypes
was to encourage users to use pandas ExtensionDtypes instead of numpy dtypes, but that conflicts with the documentation: "Convert columns to the best possible dtypes"
from pandas.
Thanks for the issue @caballerofelipe but this is the expected behavior of convert_dtypes
. As mentioned it's only intended to convert to a dtype that supports pd.NA
I believe the functionality you're expecting is in to_numeric(downcast=)
so closing https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html
from pandas.
@mroeschke don't you think the doc is incorrect though?
It says it converts columns to the best possible dtypes that support pd.NA but that is not actually the case, if it was then it should have converted from Float64 to Int64
from pandas.
I guess "best possible" is a bit too subjective so I wouldn't say incorrect as opposed to unclear. A doc improvement to change "best possible" to "convert a numpy type to a type that supports pd.NA" would probably be better
from pandas.
I believe if I can use Int64 instead of Float64 is "best" (when I don't need a decimal number), for instance from the point of view of legibility it's easier to read an int than to read a number with a point and a zero (without doing some formatting). Also the maximum possible numbers are bigger.
Is there a processing reason for not changing from Float64 to Int64, is it expensive some how? (No rhetorical question here, I don't know the answer)
Also, is it more expensive than going from float64 (lower F) to Int64 (capital I)?
Also, maybe the function could have a parameter to make it do what I thought it was going to do?
from pandas.
So I found a workaround for what I want. Allow Pandas to change to int64 when no decimals are present.
In Step 6, instead of doing newdf.convert_dtypes()
, to force a simpler dtype you can do newdf.astype('object').convert_dtypes()
, it's one more step than I would have liked but it works.
Full Example
df = pd.DataFrame({'column': [0.0, 1.0, 2.0, 3.3]})
df = df.convert_dtypes()
print(df.dtypes)
# Returns
# column Float64
# dtype: object
newdf = df.iloc[:-1]
print(newdf)
# Returns
# column
# 0 0.0
# 1 1.0
# 2 2.0
newdf_convert = newdf.convert_dtypes()
print(newdf_convert.dtypes)
print(newdf_convert)
# Returns
# column Float64
# dtype: object
# column
# 0 0.0
# 1 1.0
# 2 2.0
newdf_astype_convert = newdf.astype('object').convert_dtypes()
print(newdf_astype_convert.dtypes)
print(newdf_astype_convert)
# Returns
# column Int64
# dtype: object
# column
# 0 0
# 1 1
# 2 2
# You could also use a more complex way to obtain int64 (lower i) or float64 (lower f)
newdf_astype_convert_int64 = (
newdf
.astype('object')
.convert_dtypes() # To dtype with pd.NA
.astype('object')
.replace(pd.NA, float('nan')) # Remove pd.NA created before
.infer_objects()
)
print(newdf_astype_convert_int64.dtypes)
print(newdf_astype_convert_int64)
# Returns
# column int64
# dtype: object
# column
# 0 0
# 1 1
# 2 2
The function convert_dtypes could have a parameter 'simplify_dtypes' (or maybe something a correct keyword that I haven't thought about) that would do the same thing without much implemetation effort: convert_dtypes(simplify_dtypes=True)
and that would do .astype('object')
before the actual conversion.
Also, you could use this to simplify "even further" to int64 (lower i) or float64 (lower f), see the full example. You would do: df.astype('object').convert_dtypes().astype('object').replace(pd.NA, float('nan')).infer_objects()
. Although you might want to do this inside a with pd.option_context('future.no_silent_downcasting', True):
because of the replace()
in there (see this issue).
Edit: Added .replace(pd.NA, float('nan'))
in the example to allow conversion to float64 when a nan
is present.
from pandas.
Related Issues (20)
- ENH: Python 3.13 support HOT 20
- BUG: "styler.format.thousands" option doesn't work for integers HOT 4
- BUG: Pandas 2 is broken! HOT 2
- BUG: 2-sided inplace drop loses freq in DatetimeIndex HOT 4
- BUG: read_orc does not use the provided filesystem for all operations HOT 1
- BUG: pd.to_datetime fails to identify actual date format HOT 4
- BUG: eval fails for ExtensionArray HOT 2
- ENH: Randomised row selection with read_csv() HOT 4
- BUG: read_parquet converts all digits strings to int HOT 2
- Make specific pandas dataframe column immuteable / not changeable HOT 4
- BUG: df.drop_duplicates fails if there is only a single row HOT 3
- Potential regression with PR "PERF: Eliminate circular references in accessor attributes (#58733)" HOT 1
- ENH: support parquet's enum type using Categorical when (de)serializing HOT 3
- ENH: generalize `__init__` on a `dict` to `abc.collections.Mapping` and `__getitem__` on a `list` to `abc.collections.Sequence` HOT 14
- ENH: Add a Series method which checks whether a Series is constant HOT 4
- BUG: df.agg with pd.NamedAgg axis=1 unsupported, but errors differently depending on contents of index HOT 2
- BUG: Segmentation Fault when importing Pandas in python 3.10.14 HOT 4
- BUG: df.agg with df with missing values results in IndexError HOT 3
- BUG: Groupby transformation (cumsum) output dtype depends on whether NA is among group labels HOT 9
- DOC: Docstrings missing from .py files in Sphinxext docs folder HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandas.