Comments (10)
If I were to fix this, should I add logic to process it, or should I remove it (which, as far as I understand, means MultiIndex creation will always use inferred dtypes)?
It appears that there is a similar issue asking this question but it probably needs more discussion first what direction to take #54523
from pandas.
Thanks for raising the issue. The problem is with None value handling in MultiIndex::insert
.
In the example code, df["one", None, "yes"] = 1
calls MultiIndex::insert
, and the MultiIndex
object gets updated to:
MultiIndex::insert 1 ('one', None, 'yes')
# before
levels [['one'], ['a'], ['yes']]
codes [[0], [0], [0]]
# updated
new_levels [['one'], ['a', None], ['yes']]
new_codes [[0, 0], [0, 1], [0, 0]]
While if the None value is inserted correctly, for example, using df.concat
as you have mentioned, the MultiIndex
object should be updated to:
# df
MultiIndex([('one', 'a', 'yes')],
)
levels: [['one'], ['a'], ['yes']], codes: [[0], [0], [0]]
# df_add
MultiIndex([('one', nan, 'yes')],
)
levels: [['one'], [], ['yes']], codes: [[0], [-1], [0]]
# df_concat
MultiIndex([('one', 'a', 'yes'),
('one', nan, 'yes')],
)
levels: [['one'], ['a'], ['yes']], codes: [[0, 0], [0, -1], [0, 0]]
, where if key is NA value, location of index unify as -1.
I have submitted PR #59069 which will hopefully resolve this issue.
from pandas.
@chaoyihu I came accross a similar problem but I'm not sure if I should open a new issue:
Creating a MultiIndex
like this
pd.MultiIndex.from_tuples([(1, None), (2,3)], names=["Idx1", "Idx2"])
yields
MultiIndex([(1, nan),
(2, 3.0)],
names=['Idx1', 'Idx2'])
Note the conversion from None
to nan
. Is this behavior intended? If not, how can I circumvent it?
from pandas.
@KilianKW Thanks for your question!
I'm not sure if I should open a new issue
I would recommend not opening a new issue, since similar issues such as #56366 already exist. Lookup involving NA was labeled as part of the discussion in the ice cream agreement, which afaik was a in-person agreement about distinguishing between NA and NaN reached by the developers some time last year.
Cross referencing a long discussion regarding this topic: #32265, and the latest mention of that issue linking to an updated discussion about 2 weeks ago: #59122 (comment).
Is this behavior intended?
Yes, PR #59069 intended that both None
and np.nan
be treated as NA values, i.e. isna()
on both values return True
without distinction, and both become nan
when you do lookups.
However, I am not sure if that would remain the intended behavior from the design perspective - I think that depends on how the developer team want it to be eventually.
how can I circumvent it?
I think in general missing values in indices should be avoided, except maybe as a temporary index in a middle step. One possible workaround I saw, as was mentioned in #56366, is to replace NA with another value. If you could provide more context, maybe I can try to find a better solution for you.
from pandas.
@chaoyihu Thanks for your response! This clarifies things quite a bit.
My use-case for having None
(or pd.NA
) values in the index would be the following:
I have an experiment conducted under different parameters which each time yields a Dataframe
with the results.
I would like to use the parameters as index for MultiIndex
dataframe. One of the parameters is an integer like upper_limit
.
In some experiments, there is no upper limit, so inf
would fit quite well semantically as an index value (but doesn't work technically since upper_limit
is of type int
or Int64
). My next guess was using None
instead, but that yields a floating point nan
which doesn't really fit to upper_limit
being an integer.
Is there a way to directly enforce the MultiIndex dtype
s at construction time?
from pandas.
@KilianKW I see, so you are trying to find a proper value to represent the edge case when the integer param upper_limit
reaches infinity.
Is there a way to directly enforce the MultiIndex dtypes at construction time?
I would probably use mixed types in upper_limit
. The dtype of that level will be inferred as object
, and the indexing is straight-forward:
>>> import pandas as pd
>>> import numpy as np
>>> mi = pd.MultiIndex.from_tuples(
... [
... (0.1, 100_000_000),
... (0.1, 'inf'), # use string 'inf' to represent infinity
... (0.5, 100_000_000),
... (0.5, 'inf'),
... ],
... name = ["float_param", "upper_limit"],
... )
>>> mi.dtypes
float_param float64
upper_limit object
dtype: object
>>> df = pd.DataFrame(np.random.randn(2, 4), columns=mi)
>>> df
float_param 0.1 0.5
upper_limit 100000000 inf 100000000 inf
0 0.133185 0.152419 -1.812430 0.486254
1 -0.082580 0.413587 -2.086529 -0.453249
>>> df[0.1][100_000_000]
0 0.133185
1 -0.082580
Name: 100000000, dtype: float64
>>> df[0.1]['inf']
0 0.152419
1 0.413587
Name: inf, dtype: float64
Or, in case you would like to keep the integer dtype of upper_limit
:
import pandas as pd
import numpy as np
mi = pd.MultiIndex(
levels = [
[0.1, 0.5], # this is level 0
[100_000_000, None], # this is level 1
],
codes = [
[0, 0, 1, 1], # location of keys in level 0, i.e.: 0.1, 0.1, 0.5, 0.5
[0, 1, 0, 1], # location of keys in level 1, i.e.: 100_000_000, None, 100_000_000, None
],
name=["float_param", "upper_limit"],
dtype={
"float_param": pd.Float32Dtype,
"upper_limit": pd.Int64Dtype, # nullable integer: https://pandas.pydata.org/docs/user_guide/integer_na.html
}
)
print("========== MultiIndex Dtypes =============")
print(mi, "\n", mi.dtypes)
df = pd.DataFrame(np.random.randn(2, 4), columns=mi)
print("========== DataFrame Indexing =============")
print("df\n", df)
print("df[0.1]\n", df[0.1])
print("df[0.1, 100_000_000]\n", df[(0.1, 100_000_000)])
print("df[0.5, None]\n", df[0.5, None])
output:
========== MultiIndex Dtypes =============
MultiIndex([(0.1, 100000000.0),
(0.1, nan),
(0.5, 100000000.0),
(0.5, nan)],
names=['float_param', 'upper_limit'])
float_param float64
upper_limit float64
dtype: object
========== DataFrame Indexing =============
df
float_param 0.1 0.5
upper_limit 100000000.0 NaN 100000000.0 NaN
0 -0.008430 0.587714 -0.063724 0.722172
1 -1.288321 -0.557332 0.502185 -0.358260
df[0.1]
upper_limit 100000000.0 NaN
0 -0.008430 0.587714
1 -1.288321 -0.557332
df[0.1, 100_000_000]
0 -0.008430
1 -1.288321
Name: (0.1, 100000000.0), dtype: float64
df[0.5, None]
0 0.722172
1 -0.358260
Name: (0.5, nan), dtype: float64
from pandas.
@chaoyihu Thanks a lot for your suggestions! Using a mixed index type sounds like an interesting option.
I see that the constructor of pd.MultiIndex
actually has a dtype
parameter. This is interesting and I have two questions related to that:
- Why is the
dtype
ofupper_limit
stillfloat64
even thought you set it toInt64
explicitly in the constructor ofpd.MultiIndex
? Is this intended or a bug? - Why don't other methods like
pd.MultiIndex.from_tuples
have thedtype
parameter?
from pandas.
Why is the dtype of upper_limit still float64 even thought you set it to Int64 explicitly in the constructor of pd.MultiIndex?
Sorry, you are right, I didn't notice that upper_limit
came out as float64
. I was under the impression that it used the specified dtype, since df[0.1, 100_000_000]
looked up successfully with an integer key. This is yet another inconsistency since I'm indexing a float64
level with an integer key.
I think the presence of floating point values in that level (in this case nan
) probably triggered some type inference logic, introducing the inconsistencies we saw in MultiIndex
creation and indexing.
Interestingly, the type inference logic does not reside in the MultiIndex
constructor. And to my surprise, the dtype
parameter passed to the MultiIndex
constructor is actually never accessed in the function logic.
So the takeaway here is that the second solution I proposed in the previous reply was wrong. dtype
param is currently not functioning or at least should not be used as a setter in MultiIndex
creation.
Is this intended or a bug?
Why don't other methods like pd.MultiIndex.from_tuples have the dtype parameter?
I think this is a bug. Missing code logic for the dtype
param in MultiIndex
constructor might also be the reason why the helper methods such as from_tuples
do not support customized dtypes.
I may try to work out a fix for this. If you are interested, I can keep you updated.
@mroeschke The dtype
parameter is passed but not accessed in the MultiIndex
constructor. If I were to fix this, should I add logic to process it, or should I remove it (which, as far as I understand, means MultiIndex
creation will always use inferred dtypes)?
from pandas.
@chaoyihu Thanks for your effort. I'd appreciate being updated on this.
from pandas.
@KilianKW The issue mentioned by mroeschke gives a workaround, which is to initialize Index
with dtypes and pass them to from_arrays
.
mi = pd.MultiIndex.from_arrays(
[
pd.Index([0.1, 0.1, 0.5, 0.5], dtype=float),
pd.Index([100_000_000, None, 100_000_000, None], dtype=pd.Int64Dtype())
],
names=("float_param", "upper_limit")
)
========== MultiIndex Dtypes =============
MultiIndex([(0.1, 100000000),
(0.1, <NA>),
(0.5, 100000000),
(0.5, <NA>)],
names=['float_param', 'upper_limit'])
float_param float64
upper_limit Int64
dtype: object
========== DataFrame Indexing =============
df
float_param 0.1 0.5
upper_limit 100000000 <NA> 100000000 <NA>
0 1.260549 -0.100241 -1.227271 -0.265970
1 -1.680282 -0.629497 0.195997 -0.131484
df[0.1]
upper_limit 100000000 <NA>
0 1.260549 -0.100241
1 -1.680282 -0.629497
df[0.1, 100_000_000]
0 1.260549
1 -1.680282
Name: (0.1, 100000000), dtype: float64
df[0.5, None]
0 -0.265970
1 -0.131484
Name: (0.5, nan), dtype: float64
from pandas.
Related Issues (20)
- BUG: Resampling to `"B"` frequency with `closed="right"` and `label="right"` adds empty bins HOT 4
- BUG: ArrowNotImplementedError: Unsupported cast from int64 to null using function cast_null HOT 3
- ENH: Adding `skipna:bool` to Rolling.sum HOT 2
- BUG: Replace fails after NaN in a Series of `string` HOT 2
- DEPR: future.no_silent_downcasting option HOT 1
- BUILD: Python Docker Build Issues HOT 4
- DOC: Incorrect Documentation: pd.Dataframe.dropna HOT 2
- BUG: Spurious `FutureWarning` when using `pd.read_json()` HOT 4
- BUG: read_json silently ignores the dtype when engine=pyarrow HOT 2
- ENH: support the Arrow PyCapsule Interface on pandas.Series (export) HOT 14
- BUG: HOT 3
- BUG: "Bad CRC-32 for file'docProps/core.xml" for Read large Excel file HOT 1
- BUG: Incorrect float32/float64 comparison result HOT 7
- BUG: import has error HOT 3
- BUG: subtracting datetime series from datetime dataframe, or datetime dataframe from datetime series, raises TypeError or UFuncTypeError HOT 2
- BUG: read_csv's usecols type hint isn't match with list of strings HOT 2
- BUG: OverflowError: value too large to convert to int when manipulating very large dataframes HOT 6
- DOC: Wrong bug number in what's new v3.0.0 HOT 3
- ENH: Checking gaps for time series HOT 2
- ENH: Support non-categorical values for pandas bar plots when x axis is datetime values HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandas.