comtravo / ctparse Goto Github PK
View Code? Open in Web Editor NEWParse natural language time expressions in python
Home Page: https://www.comtravo.com
License: MIT License
Parse natural language time expressions in python
Home Page: https://www.comtravo.com
License: MIT License
When calling ctparse("1 hour and 42 minutes")
I expected to get a duration containing 1 hours and 42 minutes and I got back just the 42 minutes.
Enabled debug and I got 2 duration objects, the first the hour the second the minutes
Copied from wrong issue, originally created by @GeniaSh (thanks for reporting!)
Hi, some dates are not read, for example " 29/ Nov/2016" whereas they are common.
Furthermore, the method for calling the datetime object (def dt(self):) should check whether year, month, day are found, now we get an error when the date is not complete.
Please see the example below.
val = ctparse('whatever dat: 29/ Nov/2016',datetime.now())
print(val)
2016-X-X X:X (X/X) s=-3.393 p=(111, 'ruleYear')
val.resolution.dt
Typ eError: an integer is required (got type NoneType)
TypeError Traceback (most recent call last)
in engine
----> 1 val.resolution.dt
/home/XXXX/.local/lib/python3.6/site-packages/ctparse/types.py in dt(self)
260 return datetime(self.year, self.month, self.day,
261 self.hour or 0,
--> 262 self.minute or 0)
263
264
TypeError: an integer is required (got type NoneType)
Currently ctparse
depends on numpy
, scikit-learn
and scipy
- only for the relatively simple naive Bayes and vectorizer. It would reduce package size and issues integrating with other libs significantly when we would remove these deps - by simply having a private implementation.
Only caveat: the naive Bayes is called very often and at least it prediction runtime is crucial for the speed of ctparse
. Any replacement should not be slower than what we currently have.
Setting latent_time=False
still uses latent rules (e.g. "ruleLatentDOY"), and consequently delivers a DateTime relative to the current one. I would expect that none of the latent rules are considered.
I run the following command:
list(ctparse("4.Jan 22", latent_time=False, debug=True))
which delivers the following output:
[CTParse(2023-01-04 22:00 (X/X), (108, 103, 129, 'ruleHHMM', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleDateTOD'), 24.70392674175538),
CTParse(2022-12-04 X:X (X/X), (108, 103, 129, 'ruleHHMM', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM'), -1379.6945574679112),
CTParse(X-01-X X:X (X/X), (108, 103, 129, 'ruleHHMM', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM'), -974.2294493597469),
CTParse(X-X-X 22:00 (X/X), (108, 103, 129, 'ruleHHMM', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM'), -1379.6945574679112),
CTParse(2023-01-04 X:X (X/X), (108, 103, 129, 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY'), -464.62600230356384),
CTParse(2022-11-22 X:X (X/X), (124, 108, 'ruleDOM1', 'ruleDDMM', 'ruleLatentDOY', 'ruleLatentDOM'), -1386.2557536714821),
CTParse(X-01-04 X:X (X/X), (124, 108, 'ruleDOM1', 'ruleDDMM', 'ruleLatentDOM'), -471.5873006213257),
CTParse(2023-01-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleMonthDOM', 'ruleLatentDOY', 'ruleDOM1', 'ruleLatentDOM'), -283.055702689241),
CTParse(X-01-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleMonthDOM', 'ruleDOM1', 'ruleLatentDOM'), -285.2758563452916),
CTParse(2022-11-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleLatentDOM'), -1381.2580980306086),
CTParse(X-01-04 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOM'), -468.07426686688797),
CTParse(X-X-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOM1', 'ruleLatentDOM'), -1387.0195868535332),
CTParse(2023-01-04 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleDOM1', 'ruleLatentDOM'), -462.24216151516646),
CTParse(2022-11-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleDOM1', 'ruleLatentDOM'), -1378.5328933893213),
CTParse(X-01-04 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleDOM1', 'ruleLatentDOM'), -465.9835442173078),
CTParse(2023-01-04 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleLatentDOM'), -457.9653277397515),
CTParse(2022-11-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleLatentDOM'), -1374.2560596139065),
CTParse(X-01-04 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOM'), -463.7046680492498),
CTParse(2022-12-04 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleDOM1', 'ruleNamedMonth', 'ruleMonthDOM', 'ruleLatentDOY', 'ruleLatentDOM'), -1378.5807416488096),
CTParse(2023-01-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleDOM1', 'ruleNamedMonth', 'ruleMonthDOM', 'ruleLatentDOY', 'ruleLatentDOM'), -279.96845298069996),
CTParse(X-01-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleDOM1', 'ruleNamedMonth', 'ruleMonthDOM', 'ruleLatentDOM'), -283.9414417860744),
CTParse(X-X-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM'), -1385.4914732680768)]
P.S. Thank you for your great work!
Hi, thanks for making this! I'm trying to parse a simple string like 17. August 2020
, but the year gets interpreted as hours and minutes. Am I missing something?
In [6]: for res in ctparse('Datum: 17. August 2020', ts=datetime.now(), debug=True):
...: print(res)
...:
2023-08-17 20:20 (X/X) s=-367.858 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleDateTOD')
2023-06-17 X:X (X/X) s=-1989.968 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleLatentDOM')
X-08-X X:X (X/X) s=-1296.821 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleLatentDOM')
2023-06-01 20:20 (X/X) s=-1702.286 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleLatentDOM')
2023-08-17 X:X (X/X) s=-781.179 p=(108, 103, 130, 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY')
2023-08-17 20:20 (X/X) s=-357.082 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleDateTOD')
2023-06-17 X:X (X/X) s=-1984.401 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM')
X-08-X X:X (X/X) s=-1291.254 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM')
2023-06-01 20:20 (X/X) s=-1696.719 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM')
2023-08-17 X:X (X/X) s=-773.528 p=(108, 103, 130, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY')
2020-X-X X:X (X/X) s=-1701.773 p=(126, 111, 'ruleDDMM', 'ruleLatentDOY', 'ruleYear')
2020-08-17 X:X (X/X) s=-379.974 p=(126, 111, 'ruleDDMM', 'ruleYear', 'ruleDOYYear')
X-08-17 X:X (X/X) s=-790.539 p=(126, 111, 'ruleYear', 'ruleDDMM')
2020-X-X X:X (X/X) s=-1701.432 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleYear')
2020-08-17 X:X (X/X) s=-379.009 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleYear', 'ruleDOYYear')
X-X-17 X:X (X/X) s=-1994.620 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleYear', 'ruleMonthYear')
2020-08-X X:X (X/X) s=-695.337 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleYear', 'ruleMonthYear')
2020-08-17 X:X (X/X) s=-372.308 p=(108, 103, 111, 'ruleYear', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleDOYYear')
2020-X-X X:X (X/X) s=-1694.826 p=(108, 103, 111, 'ruleYear', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY')
Time intervals where first int is larger than the 2nd int like in "tomorrow 9-5" are parsed always incorrectly. In such case this should result in a resolution of 9am-5pm (as there's no other feasible case for that parse). Instead, this parses the 2nd int as the next day end_time. Assuming this is due to the 24 hour system the parser uses.
r = ctparse("tomorrow 9-5")
r
Out[8]: CTParse(2021-02-23 09:00 (X/X) - 2021-02-24 05:00 (X/X), (114, 128, 136, 128, 'ruleHHMM', 'ruleHHMM', 'ruleTODTOD', 'ruleTomorrow', 'ruleDateInterval'), 4.300637883728172)
When parsing time intervals and am/pm is not specified for start time, ctparse assumes it's AM. However for people (at least in US) it's very natural to specify interval in format like "3-5PM" or "from 3 to 5PM", assuming both start and end are the same time period.
>>> ts
datetime.datetime(2023, 1, 18, 0, 0)
>>> ctparse('3-5pm', ts=ts)
CTParse(2023-01-18 03:00 (X/X) - 2023-01-18 17:00 (X/X), (131, 125, 131, 'ruleHHMM', 'ruleHHMM', 'ruleTODTOD'), -0.43965638477402536)
>>> ctparse('between 3 and 5pm', ts=ts)
CTParse(2023-01-18 03:00 (X/X) - 2023-01-18 17:00 (X/X), (101, 131, 125, 131, 'ruleHHMM', 'ruleHHMM', 'ruleTODTOD', 'ruleAbsorbFromInterval'), 0.11287064948939474)
>>> ctparse('from 3 to 5pm', ts=ts)
CTParse(2023-01-18 03:00 (X/X) - 2023-01-18 17:00 (X/X), (101, 131, 125, 131, 'ruleHHMM', 'ruleHHMM', 'ruleTODTOD', 'ruleAbsorbFromInterval'), 0.11287064948939474)
In the above examples, I expected start time to be resolved to 15:00, not 3:00. Maybe some optional argument to ctparse could be added to customize this behavior?
The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.
thanks for the helpful documentation, however as I was playing around and implementing custom changes there are few inaccuracies in the Contributing section.
py.test tests/test_run_corpus.py
should be changed to py.test tests/test_corpus.py
I presume as the test_run_corpus.py
doesn't exist anymore.
the make train
command is giving me trouble as well when running it from the root directory.
Those are the parses:
for parse in ctparse.ctparse_gen("from Tuesday to Friday",
datetime.datetime(year=2018, month=8, day=16),
relative_match_len=1.0,
max_stack_depth=0, timeout=0):
print(parse)
2018-08-21 X:X (X/X) s=-1.012 p=(101, 102, 135, 102, 'ruleNamedDOW', 'ruleLatentDOW', 'ruleNamedDOW', 'ruleLatentDOW')
2018-08-17 X:X (X/X) s=-1.300 p=(101, 102, 135, 102, 'ruleNamedDOW', 'ruleLatentDOW', 'ruleNamedDOW', 'ruleLatentDOW')
2018-08-21 X:X (X/X) s=-0.861 p=(101, 102, 135, 102, 'ruleNamedDOW', 'ruleNamedDOW', 'ruleLatentDOW')
X-X-X X:X (4/X) s=-1.149 p=(101, 102, 135, 102, 'ruleNamedDOW', 'ruleNamedDOW', 'ruleLatentDOW')
2018-08-17 X:X (X/X) s=-1.229 p=(101, 102, 135, 102, 'ruleNamedDOW', 'ruleNamedDOW', 'ruleLatentDOW', 'ruleLatentDOW')
X-X-X X:X (1/X) - None s=-2.039 p=(134, 102, 135, 102, 'ruleNamedDOW', 'ruleLatentDOW', 'ruleNamedDOW', 'ruleAfterTime')
2018-08-21 X:X (X/X) - None s=-3.146 p=(134, 102, 135, 102, 'ruleNamedDOW', 'ruleNamedDOW', 'ruleLatentDOW', 'ruleLatentDOW', 'ruleAfterTime')
This is likely because Friday comes before Tuesday
The slash can be in a lot of different contexts and therefore it may affect a lot of different strings
This issue seems to be related to the fact that there are too many possibile matches and the correct parse doesn't make it to the top.
ctparse.ctparse('Montag 9. März bis Mittwoch 11. März')
Out[6]: CTParse(None - 2020-03-11 X:X (X/X), (102, 108, 103, 134, 102, 108, 103, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleNamedDOW', 'ruleDOWDate', 'ruleNamedDOW', 'ruleDOWDate', 'ruleBeforeTime'), 20.224422964079018)
Describe what you were trying to get done.
Tell us what happened, what went wrong, and what you expected to happen.
I encountered issues with ctparse while using it for natural language date and time processing, especially with return dates. When attempting date-time validation, I observed problems with ctparse not providing None responses for incorrect inputs, sometimes generating random dates.
I have attached image for reference:
The examples one is correct,
I plan to raise this issue on GitHub and seek a solution if possible.
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
Maybe you can tell us the difference in the document or readme
The problem here is only the max_stack_depth. Setting it to 50
fixes the problem
>>> ctparse.ctparse("ab dem 11.05.20-15.05.20")
CTParse(2020-05-11 X:X (X/X) - None, (135, 100, 126, 136, 126, 'ruleDDMMYYYY', 'ruleAbsorbOnTime', 'ruleAfterTime', 'ruleDDMMYYYY'), 14.872431305790162)
A small collection of problematic time expressions that should be supported
between 6- 8 AM
CTParse(2021-02-06 X:X (X/X) - 2021-02-07 X:X (X/X), (101, 108, 136, 108, 138, 'ruleDOM1', 'ruleDOM1', 'ruleNamedNumberDuration', 'ruleLatentDOM', 'ruleDOMDate', 'ruleAbsorbFromInterval'), 4.582549598503859)
ctparse.ctparse("Mo. 13 Jan.13:50")
Out[7]: CTParse(2021-01-13 X:X (X/X), (102, 108, 103, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleNamedDOW', 'ruleDOWDate'), 27.889635370185275)
What is the easiest / best way to convert a timedelta
object from a Duration
object? I want to go from the string "in two days" to a timedelta -> datetime, but the only thing I can think of is to manually try to match up the ctparse enums and fill out the timedelta myself. Is that the best way or is there something I'm missing?
top-2 parses are:
2015-10-21 X:X (X/morning),
2015-10-21 06:55 (X/X),
Somehow, the time of day and specific times are conflicting
Expressions like these should parse as YYYY/MM/DD
There seems to be a rule missing that can merge a month and a year:
Example: list(ctparse_gen('June 2020'))
Expected: 2020-06-X X:X (X/X)
Actual:
X-06-X X:X (X/X) s=-5.902 p=(103, 111, 'ruleYear', 'ruleNamedMonth')
2020-X-X X:X (X/X) s=-5.902 p=(103, 111, 'ruleYear', 'ruleNamedMonth')
and some more, but nothing that combines the month and the year into a single resolution.
What seems to be missing is a rule like
@rule(predicate('isMonth'), predicate('isYear'))
def ruleMonthYear(ts, m, y):
return Time(year=y.year, month=m.month)
Currently there is no fuzzy
matching done. However, for longer literals like January
, Thursday
we could easily allow some edits inside the regex match (i.e. instead of matching monday
match (?:monday){e<=2})
.
This applies to named months and days and probably a few other longer literals.
import ctparse
ctparse.ctparse("15/11 bis don 16/11", timeout=0, max_stack_depth=0)
the culprit seems to be having a dd/mm followed by a number
Hi, according to your doc, we should call regenerate_model after updating the rules,. But, I cann't find the function throw out the repository. So what should I do after updating the rules
When trying to parse something like 19.10.93
the result is in 2093
- even when setting the reference time to be before 1993. This is due to how ruleDDMMYYYY
handles two digit years. It should probably not "hard move" them to the 21st century but rather use some cutoff year within the century of the reference time.
This is how Excel handles this: https://docs.microsoft.com/en-us/office/troubleshoot/excel/two-digit-year-numbers
Suggestion:
ccyy
(e.g. 1983 => cc=19, yy=83
).I'm trying to add a new attribute called period
to the Time object in types.py
which will hold a string "am" or "pm" if it's specified in the input ("5am"). This is so I can make some further modifications based on it to the logic of some rules. For this, I've modified the _maybe_apply_am_pm
in rules.py
as below:
if ampm_match.lower().startswith("a") and t.hour <= 12:
t.period = "am"
return t
if ampm_match.lower().startswith("p") and t.hour < 12:
return Time(hour=t.hour + 12, minute=t.minute, period='pm')
Initially, this worked but disabled much of the other applicable rules in the pipeline and resulted in latent incomplete outputs (for "5am" I'd get a latent datetime X-X-X with 05:00 instead of a complete datetime for today / tomorrow etc.)
I figured I had to modify Time property isTOD
to
@property
def isTOD(self) -> bool:
"""isTimeOfDay - only a time, not date"""
return self._hasOnly("hour") or self._hasOnly("hour", "minute") or self._hasOnly("hour", "period") or self._hasOnly("hour", "minute", "period")
this enabled the previously latent datetime completion, intervals etc. However, my new attribute got somehow lost in the pipeline of rules and is not propagated to the resulting production. I've spent hours debugging and experimenting with this but I'm really not sure what else to change to make this work?
When testing the simple "5pm", I tried to look into latentDOM
rule but the attribute is not propagated to it from the HHMM
where its present. When debugging I used the logging as you described in the documentation but it hardly gives me all the rules that are applied, or I'm missing something somewhere else?
Thank you
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.