comtravo / ctparse Goto Github PK

View Code? Open in Web Editor NEW

130.0 30.0 23.0 3.02 MB

Parse natural language time expressions in python

Home Page: https://www.comtravo.com

License: MIT License

Python 99.05% Makefile 0.95%

nlp time-parsing python python-library regular-expression machine-learning

ctparse's People

Contributors

Stargazers

Watchers

Forkers

hitxujian mattilyra dutc giancastro gabrielelanaro pombredanne bharathi-srini contexity acreom saheen-ahamed yghokim rohit-singh07 aksanoble ahmad-abdellatif capuanob mayhemheroes hayasam discoverdigitalweb

ctparse's Issues

The duration of "1 hour and 42 minutes" returns just 42 minutes, debug returns the values correctly

ctparse - Parse natural language time expressions in python version:
Python version: 3.8
Operating System: Win10

Description

When calling ctparse("1 hour and 42 minutes") I expected to get a duration containing 1 hours and 42 minutes and I got back just the 42 minutes.

What I Did

Enabled debug and I got 2 duration objects, the first the hour the second the minutes

Parsing expressions like `29/Nov/2016` fails

Copied from wrong issue, originally created by @GeniaSh (thanks for reporting!)

Hi, some dates are not read, for example " 29/ Nov/2016" whereas they are common.

Furthermore, the method for calling the datetime object (def dt(self):) should check whether year, month, day are found, now we get an error when the date is not complete.

Please see the example below.

val = ctparse('whatever dat: 29/ Nov/2016',datetime.now())
print(val)

2016-X-X X:X (X/X) s=-3.393 p=(111, 'ruleYear')

val.resolution.dt

Typ eError: an integer is required (got type NoneType)
TypeError Traceback (most recent call last)
in engine
----> 1 val.resolution.dt

/home/XXXX/.local/lib/python3.6/site-packages/ctparse/types.py in dt(self)
260 return datetime(self.year, self.month, self.day,
261 self.hour or 0,
--> 262 self.minute or 0)
263
264

TypeError: an integer is required (got type NoneType)

Remove dependencies on sklearn

Currently ctparse depends on numpy, scikit-learn and scipy - only for the relatively simple naive Bayes and vectorizer. It would reduce package size and issues integrating with other libs significantly when we would remove these deps - by simply having a private implementation.

Only caveat: the naive Bayes is called very often and at least it prediction runtime is crucial for the speed of ctparse. Any replacement should not be slower than what we currently have.

`latent_time=False` works only half-way

ctparse - Parse natural language time expressions in pytho version:
Python version: 3.9
Operating System: Ubuntu 22.04 (WSL2)

Description

Setting latent_time=False still uses latent rules (e.g. "ruleLatentDOY"), and consequently delivers a DateTime relative to the current one. I would expect that none of the latent rules are considered.

What I Did

I run the following command:

list(ctparse("4.Jan 22", latent_time=False, debug=True))

which delivers the following output:

[CTParse(2023-01-04 22:00 (X/X), (108, 103, 129, 'ruleHHMM', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleDateTOD'), 24.70392674175538),
CTParse(2022-12-04 X:X (X/X), (108, 103, 129, 'ruleHHMM', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM'), -1379.6945574679112),
CTParse(X-01-X X:X (X/X), (108, 103, 129, 'ruleHHMM', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM'), -974.2294493597469),
CTParse(X-X-X 22:00 (X/X), (108, 103, 129, 'ruleHHMM', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM'), -1379.6945574679112),
CTParse(2023-01-04 X:X (X/X), (108, 103, 129, 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY'), -464.62600230356384),
CTParse(2022-11-22 X:X (X/X), (124, 108, 'ruleDOM1', 'ruleDDMM', 'ruleLatentDOY', 'ruleLatentDOM'), -1386.2557536714821),
CTParse(X-01-04 X:X (X/X), (124, 108, 'ruleDOM1', 'ruleDDMM', 'ruleLatentDOM'), -471.5873006213257),
CTParse(2023-01-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleMonthDOM', 'ruleLatentDOY', 'ruleDOM1', 'ruleLatentDOM'), -283.055702689241),
CTParse(X-01-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleMonthDOM', 'ruleDOM1', 'ruleLatentDOM'), -285.2758563452916),
CTParse(2022-11-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleLatentDOM'), -1381.2580980306086),
CTParse(X-01-04 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOM'), -468.07426686688797),
CTParse(X-X-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOM1', 'ruleLatentDOM'), -1387.0195868535332),
CTParse(2023-01-04 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleDOM1', 'ruleLatentDOM'), -462.24216151516646),
CTParse(2022-11-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleDOM1', 'ruleLatentDOM'), -1378.5328933893213),
CTParse(X-01-04 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleDOM1', 'ruleLatentDOM'), -465.9835442173078),
CTParse(2023-01-04 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleLatentDOM'), -457.9653277397515),
CTParse(2022-11-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleLatentDOM'), -1374.2560596139065),
CTParse(X-01-04 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOM'), -463.7046680492498),
CTParse(2022-12-04 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleDOM1', 'ruleNamedMonth', 'ruleMonthDOM', 'ruleLatentDOY', 'ruleLatentDOM'), -1378.5807416488096),
CTParse(2023-01-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleDOM1', 'ruleNamedMonth', 'ruleMonthDOM', 'ruleLatentDOY', 'ruleLatentDOM'), -279.96845298069996),
CTParse(X-01-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleDOM1', 'ruleNamedMonth', 'ruleMonthDOM', 'ruleLatentDOM'), -283.9414417860744),
CTParse(X-X-22 X:X (X/X), (108, 103, 108, 'ruleDOM1', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM'), -1385.4914732680768)]

P.S. Thank you for your great work!

Year gets parsed as hours and minutes

ctparse - Parse natural language time expressions in python version: 0.3.6.
Python version: 3.9.15
Operating System: MacOS 13.3.1

Description

Hi, thanks for making this! I'm trying to parse a simple string like 17. August 2020, but the year gets interpreted as hours and minutes. Am I missing something?

What I Did

In [6]: for res in ctparse('Datum: 17. August 2020', ts=datetime.now(), debug=True):
   ...:     print(res)
   ...:
2023-08-17 20:20 (X/X) s=-367.858 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleDateTOD')
2023-06-17 X:X (X/X) s=-1989.968 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleLatentDOM')
X-08-X X:X (X/X) s=-1296.821 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleLatentDOM')
2023-06-01 20:20 (X/X) s=-1702.286 p=(108, 103, 130, 'ruleNamedMonth', 'ruleHHMMmilitary', 'ruleDOM1', 'ruleLatentDOM')
2023-08-17 X:X (X/X) s=-781.179 p=(108, 103, 130, 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY')
2023-08-17 20:20 (X/X) s=-357.082 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleDateTOD')
2023-06-17 X:X (X/X) s=-1984.401 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM')
X-08-X X:X (X/X) s=-1291.254 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM')
2023-06-01 20:20 (X/X) s=-1696.719 p=(108, 103, 130, 'ruleHHMMmilitary', 'ruleDOM1', 'ruleNamedMonth', 'ruleLatentDOM')
2023-08-17 X:X (X/X) s=-773.528 p=(108, 103, 130, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY')
2020-X-X X:X (X/X) s=-1701.773 p=(126, 111, 'ruleDDMM', 'ruleLatentDOY', 'ruleYear')
2020-08-17 X:X (X/X) s=-379.974 p=(126, 111, 'ruleDDMM', 'ruleYear', 'ruleDOYYear')
X-08-17 X:X (X/X) s=-790.539 p=(126, 111, 'ruleYear', 'ruleDDMM')
2020-X-X X:X (X/X) s=-1701.432 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleYear')
2020-08-17 X:X (X/X) s=-379.009 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleYear', 'ruleDOYYear')
X-X-17 X:X (X/X) s=-1994.620 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleYear', 'ruleMonthYear')
2020-08-X X:X (X/X) s=-695.337 p=(108, 103, 111, 'ruleNamedMonth', 'ruleDOM1', 'ruleYear', 'ruleMonthYear')
2020-08-17 X:X (X/X) s=-372.308 p=(108, 103, 111, 'ruleYear', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleDOYYear')
2020-X-X X:X (X/X) s=-1694.826 p=(108, 103, 111, 'ruleYear', 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY')

parsing time intervals incorrectly

ctparse - Parse natural language time expressions in pytho version: 0.3.0
Python version: 3.8
Operating System: macOS BigSur

Description

Time intervals where first int is larger than the 2nd int like in "tomorrow 9-5" are parsed always incorrectly. In such case this should result in a resolution of 9am-5pm (as there's no other feasible case for that parse). Instead, this parses the 2nd int as the next day end_time. Assuming this is due to the 24 hour system the parser uses.

r = ctparse("tomorrow 9-5")
r
Out[8]: CTParse(2021-02-23 09:00 (X/X) - 2021-02-24 05:00 (X/X), (114, 128, 136, 128, 'ruleHHMM', 'ruleHHMM', 'ruleTODTOD', 'ruleTomorrow', 'ruleDateInterval'), 4.300637883728172)

Time interval parsing issues

ctparse - Parse natural language time expressions in pytho version:
Python version: 3.9.7
Operating System: Mac OS

Description

When parsing time intervals and am/pm is not specified for start time, ctparse assumes it's AM. However for people (at least in US) it's very natural to specify interval in format like "3-5PM" or "from 3 to 5PM", assuming both start and end are the same time period.

What I Did

>>> ts
datetime.datetime(2023, 1, 18, 0, 0)
>>> ctparse('3-5pm', ts=ts)
CTParse(2023-01-18 03:00 (X/X) - 2023-01-18 17:00 (X/X), (131, 125, 131, 'ruleHHMM', 'ruleHHMM', 'ruleTODTOD'), -0.43965638477402536)
>>> ctparse('between 3 and 5pm', ts=ts)
CTParse(2023-01-18 03:00 (X/X) - 2023-01-18 17:00 (X/X), (101, 131, 125, 131, 'ruleHHMM', 'ruleHHMM', 'ruleTODTOD', 'ruleAbsorbFromInterval'), 0.11287064948939474)
>>> ctparse('from 3 to 5pm', ts=ts)
CTParse(2023-01-18 03:00 (X/X) - 2023-01-18 17:00 (X/X), (101, 131, 125, 131, 'ruleHHMM', 'ruleHHMM', 'ruleTODTOD', 'ruleAbsorbFromInterval'), 0.11287064948939474)

In the above examples, I expected start time to be resolved to 15:00, not 3:00. Maybe some optional argument to ctparse could be added to customize this behavior?

Initial Update

The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.

obsolete documentation

thanks for the helpful documentation, however as I was playing around and implementing custom changes there are few inaccuracies in the Contributing section.

py.test tests/test_run_corpus.py should be changed to py.test tests/test_corpus.py I presume as the test_run_corpus.py doesn't exist anymore.

the make train command is giving me trouble as well when running it from the root directory.

from Tuesday to Friday

Those are the parses:

for parse in ctparse.ctparse_gen("from Tuesday to Friday",
                                 datetime.datetime(year=2018, month=8, day=16),
                                 relative_match_len=1.0,
                                 max_stack_depth=0, timeout=0):
    print(parse)

2018-08-21 X:X (X/X) s=-1.012 p=(101, 102, 135, 102, 'ruleNamedDOW', 'ruleLatentDOW', 'ruleNamedDOW', 'ruleLatentDOW')
2018-08-17 X:X (X/X) s=-1.300 p=(101, 102, 135, 102, 'ruleNamedDOW', 'ruleLatentDOW', 'ruleNamedDOW', 'ruleLatentDOW')
2018-08-21 X:X (X/X) s=-0.861 p=(101, 102, 135, 102, 'ruleNamedDOW', 'ruleNamedDOW', 'ruleLatentDOW')
X-X-X X:X (4/X) s=-1.149 p=(101, 102, 135, 102, 'ruleNamedDOW', 'ruleNamedDOW', 'ruleLatentDOW')
2018-08-17 X:X (X/X) s=-1.229 p=(101, 102, 135, 102, 'ruleNamedDOW', 'ruleNamedDOW', 'ruleLatentDOW', 'ruleLatentDOW')
X-X-X X:X (1/X) - None s=-2.039 p=(134, 102, 135, 102, 'ruleNamedDOW', 'ruleLatentDOW', 'ruleNamedDOW', 'ruleAfterTime')
2018-08-21 X:X (X/X) - None s=-3.146 p=(134, 102, 135, 102, 'ruleNamedDOW', 'ruleNamedDOW', 'ruleLatentDOW', 'ruleLatentDOW', 'ruleAfterTime')

This is likely because Friday comes before Tuesday

Parse string of the type "7.6 evening/night"

The slash can be in a lot of different contexts and therefore it may affect a lot of different strings

Time expression "Montag 9. März bis Mittwoch 11. März" doesn't parse correctly

This issue seems to be related to the fact that there are too many possibile matches and the correct parse doesn't make it to the top.

ctparse.ctparse('Montag 9. März bis Mittwoch 11. März')                                                                                                                                                                                                               
Out[6]: CTParse(None - 2020-03-11 X:X (X/X), (102, 108, 103, 134, 102, 108, 103, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleNamedMonth', 'ruleDOM1', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleNamedDOW', 'ruleDOWDate', 'ruleNamedDOW', 'ruleDOWDate', 'ruleBeforeTime'), 20.224422964079018)

Returning invalid values when input is wrong

ctparse - Parse natural language time expressions in Python version:
Python version: 3.10
Operating System: Linux

Description

Describe what you were trying to get done.
Tell us what happened, what went wrong, and what you expected to happen.

What I did do?

I encountered issues with ctparse while using it for natural language date and time processing, especially with return dates. When attempting date-time validation, I observed problems with ctparse not providing None responses for incorrect inputs, sometimes generating random dates.

I have attached image for reference:

The examples one is correct,

I plan to raise this issue on GitHub and seek a solution if possible.

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

What's the difference between time/corpus.py and time/auto_corpus.py?

Maybe you can tell us the difference in the document or readme

Time expression "02 Mär 2020 - 03 Mär 2020" fails to parse correctly

The problem here is only the max_stack_depth. Setting it to 50 fixes the problem

Intervals of the kind "ab dem xx.xx.xx-xx.xx.xx" parsed incorrectly

ctparse - Parse natural language time expressions in pytho version: 0.2.0
Python version: 3.8
Operating System: MacOS

Description

>>> ctparse.ctparse("ab dem 11.05.20-15.05.20")
CTParse(2020-05-11 X:X (X/X) - None, (135, 100, 126, 136, 126, 'ruleDDMMYYYY', 'ruleAbsorbOnTime', 'ruleAfterTime', 'ruleDDMMYYYY'), 14.872431305790162)

strings ctparse can't parse

A small collection of problematic time expressions that should be supported

between 6- 8 AM

CTParse(2021-02-06 X:X (X/X) - 2021-02-07 X:X (X/X), (101, 108, 136, 108, 138, 'ruleDOM1', 'ruleDOM1', 'ruleNamedNumberDuration', 'ruleLatentDOM', 'ruleDOMDate', 'ruleAbsorbFromInterval'), 4.582549598503859)

ctparse.ctparse("Mo. 13 Jan.13:50")                                                                                                   
Out[7]: CTParse(2021-01-13 X:X (X/X), (102, 108, 103, 'ruleDOM1', 'ruleNamedMonth', 'ruleDOMMonth', 'ruleLatentDOY', 'ruleNamedDOW', 'ruleDOWDate'), 27.889635370185275)

Get TimeDelta from Duration

ctparse - Parse natural language time expressions in python version: 0.3.6
Python version: 3.11.1
Operating System: MacOS

Description

What is the easiest / best way to convert a timedelta object from a Duration object? I want to go from the string "in two days" to a timedelta -> datetime, but the only thing I can think of is to manually try to match up the ctparse enums and fill out the timedelta myself. Is that the best way or is there something I'm missing?

Time expression of the kind am 21.10.2015 früh 06:55 parse incorrectly

top-2 parses are:

2015-10-21 X:X (X/morning),
2015-10-21 06:55 (X/X),

Somehow, the time of day and specific times are conflicting

Parse simple US dates

ctparse: 0.3.1
Python version: 3.9

Description

Expressions like these should parse as YYYY/MM/DD

2022/11/01
2022-11-01
2022.11.01

Month and Year not joined

Python version: 3.8
Operating System: MacOS

Description

There seems to be a rule missing that can merge a month and a year:

Example: list(ctparse_gen('June 2020'))
Expected: 2020-06-X X:X (X/X)
Actual:

X-06-X X:X (X/X) s=-5.902 p=(103, 111, 'ruleYear', 'ruleNamedMonth')
2020-X-X X:X (X/X) s=-5.902 p=(103, 111, 'ruleYear', 'ruleNamedMonth')

and some more, but nothing that combines the month and the year into a single resolution.

What seems to be missing is a rule like

@rule(predicate('isMonth'), predicate('isYear'))                                                                                                                                     
def ruleMonthYear(ts, m, y):                                                                                                                                                         
    return Time(year=y.year, month=m.month)

Allow edits when matching named months, days, or other longer literals

Currently there is no fuzzy matching done. However, for longer literals like January, Thursday we could easily allow some edits inside the regex match (i.e. instead of matching monday match (?:monday){e<=2}).

This applies to named months and days and probably a few other longer literals.

Time expression hangs ctparse when timeout=0 and max_stack_depth=0

import ctparse
ctparse.ctparse("15/11 bis don 16/11", timeout=0, max_stack_depth=0)

the culprit seems to be having a dd/mm followed by a number

Broken ruleYear

This does not look right:

ctparse/ctparse/time/rules.py

Line 150 in 15836ab

if y < 1900:

Maybe solved using a similar suggestion as in #56

function regenerate_model missing

Hi, according to your doc, we should call regenerate_model after updating the rules,. But, I cann't find the function throw out the repository. So what should I do after updating the rules

Interpretation of two-digit years

When trying to parse something like 19.10.93 the result is in 2093 - even when setting the reference time to be before 1993. This is due to how ruleDDMMYYYY handles two digit years. It should probably not "hard move" them to the 21st century but rather use some cutoff year within the century of the reference time.

This is how Excel handles this: https://docs.microsoft.com/en-us/office/troubleshoot/excel/two-digit-year-numbers

Suggestion:

Let the reference year be ccyy (e.g. 1983 => cc=19, yy=83).
Then any two digit year between 0 and yy+10 is interpreted to be within the century cc (e.g. 83 maps to 1983, 93 to 1993), anything above maps to the previous century (e.g. 94 maps to 1894).
In 2019 this would mean that anything up to 29 maps to 20xx, 30+ would map to 1930+

[Question] adding new attribute to Time object

I'm trying to add a new attribute called period to the Time object in types.py which will hold a string "am" or "pm" if it's specified in the input ("5am"). This is so I can make some further modifications based on it to the logic of some rules. For this, I've modified the _maybe_apply_am_pm in rules.py as below:

if ampm_match.lower().startswith("a") and t.hour <= 12:
    t.period = "am"
    return t
if ampm_match.lower().startswith("p") and t.hour < 12:
    return Time(hour=t.hour + 12, minute=t.minute, period='pm')

Initially, this worked but disabled much of the other applicable rules in the pipeline and resulted in latent incomplete outputs (for "5am" I'd get a latent datetime X-X-X with 05:00 instead of a complete datetime for today / tomorrow etc.)

I figured I had to modify Time property isTOD to

@property
def isTOD(self) -> bool:
   """isTimeOfDay - only a time, not date"""
    return self._hasOnly("hour") or self._hasOnly("hour", "minute") or self._hasOnly("hour", "period") or self._hasOnly("hour", "minute", "period")

this enabled the previously latent datetime completion, intervals etc. However, my new attribute got somehow lost in the pipeline of rules and is not propagated to the resulting production. I've spent hours debugging and experimenting with this but I'm really not sure what else to change to make this work?

When testing the simple "5pm", I tried to look into latentDOM rule but the attribute is not propagated to it from the HHMM where its present. When debugging I used the logging as you described in the documentation but it hardly gives me all the rules that are applied, or I'm missing something somewhere else?

Thank you

comtravo / ctparse Goto Github PK

ctparse's People

Contributors

Stargazers

Watchers

Forkers

ctparse's Issues

Description

What I Did

Description

What I Did

Description

What I Did

Description

Description

What I Did

Description

What I did do?

Description

Description

Description

Description

Recommend Projects

Recommend Topics

Recommend Org