Giter Site home page Giter Site logo

Comments (1)

avm19 avatar avm19 commented on July 16, 2024 1

I believe the code you are complaining about is in these lines (but it's worth double-checking):

# NaN value cannot be a min/max value

elif is_max:
ai = -np.inf
else:
ai = np.inf


Let me explain how I think your issue report could be modified and why.

Scratch that

From my perspective, a nan must generate other nan, an aggregation of nan, must again generate nan

semantically: "An invalid value, cannot be computed, so a transformation of it should result again into an invalid value"
an aggregation (via groupby) of nan, should result into nan

We need to be precise in the quantification all vs any when dealing with aggregation or reduction.

NumPy follows "any nan implies nan":

np.array([0, 1, 2]).max()  # 2 (no nan => no nan)
np.array([0, 1, np.inf]).max()  # inf (no nan => no nan)
np.array([np.nan, 0, 1, np.inf]).max()  # nan (some nan => nan)
(np.array([0, 1, np.inf]) / np.array([0, 1, np.inf])).max()  # nan (some nan => nan)

Pandas follows "all NA implies NA":

pd.Series([0, 1, 2], dtype='Float64').max()  # 2 (no NA => no NA)
pd.Series([0, 1, np.inf], dtype='Float64').max()  # inf (no NA => no NA)
pd.Series([np.nan, 0, 1, np.inf], dtype='Float64').max()  # inf (some NA =/=> NA)
pd.Series([np.nan, np.nan, np.nan], dtype='Float64').max()  # <NA> (all NA => NA)

To complicate the matter, Pandas treats NA and np.nan sometimes differently and sometimes not. It is still being decided by seniors in #32265 (which you referenced) what exactly the semantics of NA and np.nan should be in Pandas. The consensus tends to be that NA is a missing value, while np.nan is a bad value. In most cases, missing values can be simply ignored, unlike bad values. This explains why in a single bad value np.nan ruins the computation in NumPy, while a single missing value pd.NA does not do the same in Pandas.

Now, to complicate the matter even further, Pandas transforms np.nan into pd.NA:

s = pd.Series([np.nan, 0, 1], dtype="Float64")
s.max()  # 1.0, because max([<NA>, 0, 1]) is 1
(s / s).max()  # <NA>, because max([<NA>, np.nan, 1]) is np.nan which becomes <NA>
  • In the first line, Pandas tranforms np.nan (which historically denoted a missing value in Pandas, before nullable arrays were introduced). So the Series is [<NA>, 0, 1]
  • In the second line, as expected and as we saw before, a missing value is simply ignored: max([<NA>, 0, 1]) gives 1.
  • In the third line, s / s becomes [<NA>, np.nan, 1], where 0 / 0 or np.nan is a bad value, which must derail the aggregation, so max([<NA>, np.nan, 1]) gives np.nan. But this is not the end. For some reason, Pandas converts np.nan again to <NA>.

The expected behaviour you propose, @glaucouri, would equate pd.NA to np.nan. I don't think the council of maintainers would support this. Therefore I suggest to reframe your issue differently:

I misunderstood your suggestion initially. You indeed insist on treating np.nan as an invalid value consistently in aggregation functions. I personally care more about consistency, so here is another example of the supposed bug:

s = pd.Series([np.nan, 0, 1], dtype="Float64")
(s / s).max()  # <NA>
(s / s).groupby([9, 9, 9]).max().iat[0]  # 1.0

The last two lines were expected to give the same result (whatever it should be).

from pandas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.