Giter Site home page Giter Site logo

Comments (10)

mroeschke avatar mroeschke commented on September 27, 2024 1

If I were to fix this, should I add logic to process it, or should I remove it (which, as far as I understand, means MultiIndex creation will always use inferred dtypes)?

It appears that there is a similar issue asking this question but it probably needs more discussion first what direction to take #54523

from pandas.

chaoyihu avatar chaoyihu commented on September 27, 2024

Thanks for raising the issue. The problem is with None value handling in MultiIndex::insert.

In the example code, df["one", None, "yes"] = 1 calls MultiIndex::insert, and the MultiIndex object gets updated to:

MultiIndex::insert 1 ('one', None, 'yes')
# before
levels [['one'], ['a'], ['yes']]
codes [[0], [0], [0]]
# updated
new_levels [['one'], ['a', None], ['yes']]
new_codes [[0, 0], [0, 1], [0, 0]]

While if the None value is inserted correctly, for example, using df.concat as you have mentioned, the MultiIndex object should be updated to:

# df
MultiIndex([('one', 'a', 'yes')],
           )
levels: [['one'], ['a'], ['yes']], codes: [[0], [0], [0]]

# df_add
MultiIndex([('one', nan, 'yes')],
           )
levels: [['one'], [], ['yes']], codes: [[0], [-1], [0]]

# df_concat
MultiIndex([('one', 'a', 'yes'),
            ('one', nan, 'yes')],
           )
levels: [['one'], ['a'], ['yes']], codes: [[0, 0], [0, -1], [0, 0]]

, where if key is NA value, location of index unify as -1.

I have submitted PR #59069 which will hopefully resolve this issue.

from pandas.

KilianKW avatar KilianKW commented on September 27, 2024

@chaoyihu I came accross a similar problem but I'm not sure if I should open a new issue:

Creating a MultiIndex like this

pd.MultiIndex.from_tuples([(1, None), (2,3)], names=["Idx1", "Idx2"])

yields

MultiIndex([(1, nan),
            (2, 3.0)],
           names=['Idx1', 'Idx2'])

Note the conversion from None to nan. Is this behavior intended? If not, how can I circumvent it?

from pandas.

chaoyihu avatar chaoyihu commented on September 27, 2024

@KilianKW Thanks for your question!

I'm not sure if I should open a new issue

I would recommend not opening a new issue, since similar issues such as #56366 already exist. Lookup involving NA was labeled as part of the discussion in the ice cream agreement, which afaik was a in-person agreement about distinguishing between NA and NaN reached by the developers some time last year.

Cross referencing a long discussion regarding this topic: #32265, and the latest mention of that issue linking to an updated discussion about 2 weeks ago: #59122 (comment).

Is this behavior intended?

Yes, PR #59069 intended that both None and np.nan be treated as NA values, i.e. isna() on both values return True without distinction, and both become nan when you do lookups.

However, I am not sure if that would remain the intended behavior from the design perspective - I think that depends on how the developer team want it to be eventually.

how can I circumvent it?

I think in general missing values in indices should be avoided, except maybe as a temporary index in a middle step. One possible workaround I saw, as was mentioned in #56366, is to replace NA with another value. If you could provide more context, maybe I can try to find a better solution for you.

from pandas.

KilianKW avatar KilianKW commented on September 27, 2024

@chaoyihu Thanks for your response! This clarifies things quite a bit.

My use-case for having None (or pd.NA) values in the index would be the following:

I have an experiment conducted under different parameters which each time yields a Dataframe with the results.
I would like to use the parameters as index for MultiIndex dataframe. One of the parameters is an integer like upper_limit.

In some experiments, there is no upper limit, so inf would fit quite well semantically as an index value (but doesn't work technically since upper_limit is of type int or Int64). My next guess was using None instead, but that yields a floating point nan which doesn't really fit to upper_limit being an integer.

Is there a way to directly enforce the MultiIndex dtypes at construction time?

from pandas.

chaoyihu avatar chaoyihu commented on September 27, 2024

@KilianKW I see, so you are trying to find a proper value to represent the edge case when the integer param upper_limit reaches infinity.

Is there a way to directly enforce the MultiIndex dtypes at construction time?

I would probably use mixed types in upper_limit. The dtype of that level will be inferred as object, and the indexing is straight-forward:

>>> import pandas as pd
>>> import numpy as np
>>> mi = pd.MultiIndex.from_tuples(
...         [
...             (0.1, 100_000_000),
...             (0.1, 'inf'),    # use string 'inf' to represent infinity
...             (0.5, 100_000_000),
...             (0.5, 'inf'),
...         ],
...         name = ["float_param", "upper_limit"],
...     )
>>> mi.dtypes
float_param    float64
upper_limit     object
dtype: object
>>> df = pd.DataFrame(np.random.randn(2, 4), columns=mi)
>>> df
float_param       0.1                 0.5          
upper_limit 100000000       inf 100000000       inf
0            0.133185  0.152419 -1.812430  0.486254
1           -0.082580  0.413587 -2.086529 -0.453249
>>> df[0.1][100_000_000]
0    0.133185
1   -0.082580
Name: 100000000, dtype: float64
>>> df[0.1]['inf']
0    0.152419
1    0.413587
Name: inf, dtype: float64

Or, in case you would like to keep the integer dtype of upper_limit:

import pandas as pd
import numpy as np

mi = pd.MultiIndex(
        levels = [
            [0.1, 0.5],  # this is level 0
            [100_000_000, None],  # this is level 1
        ],
        codes = [
            [0, 0, 1, 1],  # location of keys in level 0, i.e.: 0.1, 0.1, 0.5, 0.5
            [0, 1, 0, 1],  # location of keys in level 1, i.e.: 100_000_000, None, 100_000_000, None
        ],
        name=["float_param", "upper_limit"],
        dtype={
            "float_param": pd.Float32Dtype,
            "upper_limit": pd.Int64Dtype,  # nullable integer: https://pandas.pydata.org/docs/user_guide/integer_na.html
        }
    )

print("========== MultiIndex Dtypes =============")
print(mi, "\n", mi.dtypes)

df = pd.DataFrame(np.random.randn(2, 4), columns=mi)

print("========== DataFrame Indexing =============")
print("df\n", df)
print("df[0.1]\n", df[0.1])
print("df[0.1, 100_000_000]\n", df[(0.1, 100_000_000)])
print("df[0.5, None]\n", df[0.5, None])

output:

========== MultiIndex Dtypes =============
MultiIndex([(0.1, 100000000.0),
            (0.1,         nan),
            (0.5, 100000000.0),
            (0.5,         nan)],
           names=['float_param', 'upper_limit'])
float_param    float64
upper_limit    float64
dtype: object
========== DataFrame Indexing =============
df
float_param         0.1                   0.5          
upper_limit 100000000.0       NaN 100000000.0       NaN
0             -0.008430  0.587714   -0.063724  0.722172
1             -1.288321 -0.557332    0.502185 -0.358260
df[0.1]
upper_limit  100000000.0  NaN        
0              -0.008430     0.587714
1              -1.288321    -0.557332
df[0.1, 100_000_000]
0   -0.008430
1   -1.288321
Name: (0.1, 100000000.0), dtype: float64
df[0.5, None]
0    0.722172
1   -0.358260
Name: (0.5, nan), dtype: float64

from pandas.

KilianKW avatar KilianKW commented on September 27, 2024

@chaoyihu Thanks a lot for your suggestions! Using a mixed index type sounds like an interesting option.

I see that the constructor of pd.MultiIndex actually has a dtype parameter. This is interesting and I have two questions related to that:

  1. Why is the dtype of upper_limit still float64 even thought you set it to Int64 explicitly in the constructor of pd.MultiIndex? Is this intended or a bug?
  2. Why don't other methods like pd.MultiIndex.from_tuples have the dtype parameter?

from pandas.

chaoyihu avatar chaoyihu commented on September 27, 2024

@KilianKW

Why is the dtype of upper_limit still float64 even thought you set it to Int64 explicitly in the constructor of pd.MultiIndex?

Sorry, you are right, I didn't notice that upper_limit came out as float64. I was under the impression that it used the specified dtype, since df[0.1, 100_000_000] looked up successfully with an integer key. This is yet another inconsistency since I'm indexing a float64 level with an integer key.

I think the presence of floating point values in that level (in this case nan) probably triggered some type inference logic, introducing the inconsistencies we saw in MultiIndex creation and indexing.

Interestingly, the type inference logic does not reside in the MultiIndex constructor. And to my surprise, the dtype parameter passed to the MultiIndex constructor is actually never accessed in the function logic.

So the takeaway here is that the second solution I proposed in the previous reply was wrong. dtype param is currently not functioning or at least should not be used as a setter in MultiIndex creation.

Is this intended or a bug?
Why don't other methods like pd.MultiIndex.from_tuples have the dtype parameter?

I think this is a bug. Missing code logic for the dtype param in MultiIndex constructor might also be the reason why the helper methods such as from_tuples do not support customized dtypes.

I may try to work out a fix for this. If you are interested, I can keep you updated.


@mroeschke The dtype parameter is passed but not accessed in the MultiIndex constructor. If I were to fix this, should I add logic to process it, or should I remove it (which, as far as I understand, means MultiIndex creation will always use inferred dtypes)?

from pandas.

KilianKW avatar KilianKW commented on September 27, 2024

@chaoyihu Thanks for your effort. I'd appreciate being updated on this.

from pandas.

chaoyihu avatar chaoyihu commented on September 27, 2024

@KilianKW The issue mentioned by mroeschke gives a workaround, which is to initialize Index with dtypes and pass them to from_arrays.

mi = pd.MultiIndex.from_arrays(
    [
        pd.Index([0.1, 0.1, 0.5, 0.5], dtype=float),
        pd.Index([100_000_000, None, 100_000_000, None], dtype=pd.Int64Dtype())
    ],
    names=("float_param", "upper_limit")
)
========== MultiIndex Dtypes =============
MultiIndex([(0.1, 100000000),
            (0.1,      <NA>),
            (0.5, 100000000),
            (0.5,      <NA>)],
           names=['float_param', 'upper_limit']) 
 float_param    float64
upper_limit      Int64
dtype: object
========== DataFrame Indexing =============
df
 float_param       0.1                 0.5          
upper_limit 100000000      <NA> 100000000      <NA>
0            1.260549 -0.100241 -1.227271 -0.265970
1           -1.680282 -0.629497  0.195997 -0.131484
df[0.1]
 upper_limit  100000000  <NA>     
0             1.260549  -0.100241
1            -1.680282  -0.629497
df[0.1, 100_000_000]
 0    1.260549
1   -1.680282
Name: (0.1, 100000000), dtype: float64
df[0.5, None]
 0   -0.265970
1   -0.131484
Name: (0.5, nan), dtype: float64

from pandas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.