Comments (2)
You can do this with coalesce
pl.select(pl.coalesce(
pl.Series(DATES).str.to_date(x, strict=False)
for x in ['%Y-%m-%d', '%Y/%m/%d', '%Y.%m.%d', '%d-%m-%Y']
))
As far as making this the default behavior or available with a parameter, I'd disagree as your example switches the order of month and day. I think it's better for users to specify what they want than for polars to have some hard coded guesses. It is already lenient to the changing delimiters. Consider that 01/01/2020 is valid as either d-m-y or m-d-y and that some counties always use m-d-y and never d-m-y so it's not clear how that should be resolved.
Back to your example, we can even simplify the above a bit by using some regex on the delimiters first
pl.select(
pl.coalesce(pl.Series(DATES).str.replace_all(r"-|/|\.","-").str.to_date(x, strict=False)
for x in ['%Y-%m-%d', '%d-%m-%Y'])
)
from polars.
Thanks for the suggestion. Yes, there are workarounds but this is not my point.
My point is that the current Date & Datetime inference is confusing and inconsistent.
IMO Parsing Date or Datetime in a column should either allow
- only 1 format that is determined by the first valid date / datetime; Everything else
null
- any format that is available; try everything you have
but not
- allow some formats and not others.
I'd disagree as your example switches the order of month and day
Disagree here. Have seen it in multiple companies. This happens a lot. Data is appended to csvs. Excel files are shares or send around the globe where peolple from US, UK, BR, CN, ZA, DE, ... are just copying data into the excel sheets. You end up with all sorts of formats.
IMO either be strict and allow only 1 format or all of them.
Current implementation is not reflecting real world use cases in any way. Why are different separators okay but not different ordering? Either you have clean data or chaos
I think it's better for users to specify what they want than for polars to have some hard coded guesses
100% agree. If you know the format you should definetely always specify it!!
But for many exploratory work and working with very messy data it is super usefull as a first step to let polars try to find something before cleaning up.
Consider that 01/01/2020 is valid as either d-m-y or m-d-y and that some counties always use m-d-y and never d-m-y so it's not clear how that should be resolved
This is already the case now! Has nothing to do with my issue.
Maybe this should be added to the documentation: currently polars CANNOT parse month first! Only "YMD" or "DMY".
Also this is not only the case with Date but also Datetime
DATES = [
"2020-01-01",
"2020/01/01",
"2020-01-01 1415",
"2020/01/01 16:17:18",
"01.01.2020",
]
pl.DataFrame({"date": DATES}).with_columns(
pl.col("date").str.to_datetime(strict=False).name.suffix("_parsed"),
)
shape: (5, 2)
┌─────────────────────┬─────────────────────┐
│ date ┆ date_parsed │
│ --- ┆ --- │
│ str ┆ datetime[μs] │
╞═════════════════════╪═════════════════════╡
│ 2020-01-01 ┆ 2020-01-01 00:00:00 │ >>> format changed -> No Problem ✅
│ 2020/01/01 ┆ 2020-01-01 00:00:00 │ >>> format changed -> No Problem ✅
│ 2020-01-01 1415 ┆ 2020-01-01 14:15:00 │ >>> format changed -> No Problem ✅
│ 2020/01/01 16:17:18 ┆ 2020-01-01 16:17:18 │ >>> format changed -> No Problem ✅
│ 01.01.2020 ┆ null │ >>> format changed -> "Error" 💥
└─────────────────────┴─────────────────────┘
from polars.
Related Issues (20)
- Serializing float columns turns inf/nan values into null HOT 1
- Ability to control the assumed century cutoff with `Expr.str.to_date` when parsing 2-digit years HOT 5
- Improve serialization of Categorical/Enum types
- Segfault / PanicException with `pl.lit` + `.slice` in a group by context
- `str.replace_many` to take a dictionary that defines a replacement mapping. HOT 8
- Different join behavior when streaming during join on different types HOT 1
- Allow remapping of hive partitioning columns (or physical parquet columns) before they're unified.
- DataFrame.vstack() deadlocks in asyncio background task HOT 2
- Panic when calling `top_k` on list-of-lists type
- Scan parquet should allow manual schema overrides (and/or should treat any `list[null]` encountered as a type to be resolved at query execution time?)
- Support sorting list-of-lists column
- Change `dt.week()` to be more consistent with `dt.year()` (Gregorian Year) HOT 1
- Deserializing list-of-null data leads to incorrect results
- Cannot instantiate Series of type Struct(List(UInt64)) with large integer value
- Aliasing Columns Names in Map_Element () with Dataclass Fields HOT 3
- Parallel string operations HOT 1
- add option for `to_dummies` to respect nulls
- `.mean_horizontal()` does not work with `pl.datetime` (but `.mean()` does)
- floor_div runtime error for i64, u32 and u64
- Support No-op round() on Int columns HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.