Comments (8)
or?
SPLITTER_PATTERNS = [
# ------Original Message------ or ---- Reply Message ----
re.compile("[\s]*[-]+[ ]*(Original|Reply|Urspr=C3=BCngliche|Antwort) (Message|Nachricht)[ ]*[-]+", re.I),
...
from talon.
Hi, thanks for contributing! I like the first variant better. It looks like it's a more structured way to support multiple languages. E.g. it's easy to put a comment to clarify that it's the same quotations splitter pattern but e.g. in German, etc.
About special characters. I believe you should be able to put them as it is e.g. the pattern will look like
re.compile("[\s]*[-]+[ ]*(Ursprüngliche|Antwort) Nachricht[ ]*[-]+", re.I)
And I'd also put the rest of the pattern characters in German. Some characters look the same as English ones but once I entered a password in German and later wasn't able to login because the layout changed :)
Indeed, it seems like there are some issues with running the tests in MacOS. I'll try to figure it out. We plan to migrate to scikit in the future which should be easier to integrate with.
from talon.
Yes me too, ok I added a comment to this one (see #23), but what do you think about RE_ON_DATE_SMB_WROTE
? Should we duplicate it or mix the different languages in it? If we duplicate it we have to refactor preprocess
from talon.
Hi @obukhov-sergey shall I improve some of this stuff or what are we going to do? I'm just asking because our project (depending on this mailgun feature) is keep going and I should know if there is any hope that this german support feature is going to hit mailgun's production "soon" or if I have to include a dev version of talon into our projects first. Thank you very much and as I already wrote, I'm willing to help!
from talon.
I have another proposal for SPLITTER_PATTERNS in #29:
re.compile(u'[\s]*[-]+[ ]*({})[ ]*[-]+'.format(
u'|'.join((
# English
'Original Message', 'Reply Message',
# German
u'Ursprüngliche Nachricht', 'Antwort Nachricht',
# Danish
'Oprindelig meddelelse',
))), re.I)
from talon.
hi @jeremyschlatter nobody answered me so i gave up, made my own copy of talon without the signature stuff and my own testcases... works fine for me.
Though I'm hoping they let us improve the language stuff, but in my opinion this hole locale thing should be done otherwise. Maybe I we'll do another pull request for this, but the changes would be so big I don't think they let me rewrite the hole library.
If the owner puts me in the right direction, I'll come back and contribute my stuff like they say. Until then, I have to keep working on my own copy, because this stuff has to work in my project now.
from talon.
sry, didn't recognized you're an owner @jeremyschlatter! as I said in the pull request, one test is still broken and needs further investigation and maybe we should rewrite this hole locale stuff, so that we have one file per language or something.
from talon.
I agree that there is probably be a better way to do locale stuff, but I'm not sure it needs to be a big change to the library. RE_ORIGINAL_MESSAGE
and RE_FROM_COLON_OR_DATE_COLON
can easily be extended to other languages now. To make it a little cleaner we could have a separate data file that lists the translations for "original message", "reply message", "from", and "date" in lots of languages, and then read in that file and construct the two regexes.
Grammar differences might be harder, as in the case of RE_ON_DATE_SMB_WROTE
. Though in the worst case there we could have one regex per language, listed again in a separate file. Or maybe one regex per unique grammar structure, with lots of translations or'd in as in RE_ORIGINAL_MESSAGE
?
from talon.
Related Issues (20)
- PyPI not up-to-date HOT 6
- TypeError: cannot use a string pattern on a bytes-like object HOT 2
- Can not install using python 3.7
- Feature Request: Provide methods that return cursor and/or placeholder at end of reply
- How to arrange the two methods
- html to lined text issue
- How to run the code and extract the body of the email alone
- Not able to use Custom Classifier HOT 1
- Demo app source code HOT 2
- joblib warning HOT 3
- How to calculate Talon's accuracy score
- Joblib error HOT 9
- Parsing email in other languages HOT 1
- Unable to use signature extraction library HOT 1
- Unable to remove the part containing '--- Forwarded message ---' HOT 2
- Unable extract email signature by using talon HOT 1
- When is the next release planned? HOT 1
- error in importing signature
- What version of this library is compatible with Python 3.6?
- Unable to install Talon on python 3.11.4 due to dependency conflict with cchardet package HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from talon.