Giter Site home page Giter Site logo

Comments (2)

deanm0000 avatar deanm0000 commented on July 20, 2024

You can do this with coalesce

pl.select(pl.coalesce(
    pl.Series(DATES).str.to_date(x, strict=False)
    for x in ['%Y-%m-%d', '%Y/%m/%d', '%Y.%m.%d', '%d-%m-%Y']
    ))

As far as making this the default behavior or available with a parameter, I'd disagree as your example switches the order of month and day. I think it's better for users to specify what they want than for polars to have some hard coded guesses. It is already lenient to the changing delimiters. Consider that 01/01/2020 is valid as either d-m-y or m-d-y and that some counties always use m-d-y and never d-m-y so it's not clear how that should be resolved.

Back to your example, we can even simplify the above a bit by using some regex on the delimiters first

pl.select(
    pl.coalesce(pl.Series(DATES).str.replace_all(r"-|/|\.","-").str.to_date(x, strict=False)
    for x in ['%Y-%m-%d', '%d-%m-%Y'])
)

from polars.

JulianCologne avatar JulianCologne commented on July 20, 2024

Thanks for the suggestion. Yes, there are workarounds but this is not my point.

My point is that the current Date & Datetime inference is confusing and inconsistent.

IMO Parsing Date or Datetime in a column should either allow

  • only 1 format that is determined by the first valid date / datetime; Everything else null
  • any format that is available; try everything you have

but not

  • allow some formats and not others.

I'd disagree as your example switches the order of month and day

Disagree here. Have seen it in multiple companies. This happens a lot. Data is appended to csvs. Excel files are shares or send around the globe where peolple from US, UK, BR, CN, ZA, DE, ... are just copying data into the excel sheets. You end up with all sorts of formats.
IMO either be strict and allow only 1 format or all of them.
Current implementation is not reflecting real world use cases in any way. Why are different separators okay but not different ordering? Either you have clean data or chaos

I think it's better for users to specify what they want than for polars to have some hard coded guesses

100% agree. If you know the format you should definetely always specify it!!
But for many exploratory work and working with very messy data it is super usefull as a first step to let polars try to find something before cleaning up.

Consider that 01/01/2020 is valid as either d-m-y or m-d-y and that some counties always use m-d-y and never d-m-y so it's not clear how that should be resolved

This is already the case now! Has nothing to do with my issue.
Maybe this should be added to the documentation: currently polars CANNOT parse month first! Only "YMD" or "DMY".

Also this is not only the case with Date but also Datetime

DATES = [
    "2020-01-01",
    "2020/01/01",
    "2020-01-01 1415",
    "2020/01/01 16:17:18",
    "01.01.2020",
]

pl.DataFrame({"date": DATES}).with_columns(
    pl.col("date").str.to_datetime(strict=False).name.suffix("_parsed"),
)

shape: (5, 2)
┌─────────────────────┬─────────────────────┐
│ datedate_parsed         │
│ ------                 │
│ strdatetime[μs]        │
╞═════════════════════╪═════════════════════╡
│ 2020-01-012020-01-01 00:00:00>>> format changed -> No Problem ✅ 
│ 2020/01/012020-01-01 00:00:00>>> format changed -> No Problem ✅ 
│ 2020-01-01 14152020-01-01 14:15:00>>> format changed -> No Problem ✅ 
│ 2020/01/01 16:17:182020-01-01 16:17:18>>> format changed -> No Problem ✅ 
│ 01.01.2020null>>> format changed -> "Error" 💥 
└─────────────────────┴─────────────────────┘

from polars.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.