Giter Site home page Giter Site logo

Improve german formats about talon HOT 8 CLOSED

mailgun avatar mailgun commented on July 28, 2024
Improve german formats

from talon.

Comments (8)

sspross avatar sspross commented on July 28, 2024

or?

SPLITTER_PATTERNS = [
    # ------Original Message------ or ---- Reply Message ----
    re.compile("[\s]*[-]+[ ]*(Original|Reply|Urspr=C3=BCngliche|Antwort) (Message|Nachricht)[ ]*[-]+", re.I),
    ...

from talon.

obukhov-sergey avatar obukhov-sergey commented on July 28, 2024

Hi, thanks for contributing! I like the first variant better. It looks like it's a more structured way to support multiple languages. E.g. it's easy to put a comment to clarify that it's the same quotations splitter pattern but e.g. in German, etc.

About special characters. I believe you should be able to put them as it is e.g. the pattern will look like

re.compile("[\s]*[-]+[ ]*(Ursprüngliche|Antwort) Nachricht[ ]*[-]+", re.I)

And I'd also put the rest of the pattern characters in German. Some characters look the same as English ones but once I entered a password in German and later wasn't able to login because the layout changed :)

Indeed, it seems like there are some issues with running the tests in MacOS. I'll try to figure it out. We plan to migrate to scikit in the future which should be easier to integrate with.

from talon.

sspross avatar sspross commented on July 28, 2024

Yes me too, ok I added a comment to this one (see #23), but what do you think about RE_ON_DATE_SMB_WROTE? Should we duplicate it or mix the different languages in it? If we duplicate it we have to refactor preprocess

from talon.

sspross avatar sspross commented on July 28, 2024

Hi @obukhov-sergey shall I improve some of this stuff or what are we going to do? I'm just asking because our project (depending on this mailgun feature) is keep going and I should know if there is any hope that this german support feature is going to hit mailgun's production "soon" or if I have to include a dev version of talon into our projects first. Thank you very much and as I already wrote, I'm willing to help!

from talon.

jeremyschlatter avatar jeremyschlatter commented on July 28, 2024

I have another proposal for SPLITTER_PATTERNS in #29:

re.compile(u'[\s]*[-]+[ ]*({})[ ]*[-]+'.format(
    u'|'.join((
        # English
        'Original Message', 'Reply Message',
        # German
        u'Ursprüngliche Nachricht', 'Antwort Nachricht',
        # Danish
        'Oprindelig meddelelse',
    ))), re.I)

from talon.

sspross avatar sspross commented on July 28, 2024

hi @jeremyschlatter nobody answered me so i gave up, made my own copy of talon without the signature stuff and my own testcases... works fine for me.

Though I'm hoping they let us improve the language stuff, but in my opinion this hole locale thing should be done otherwise. Maybe I we'll do another pull request for this, but the changes would be so big I don't think they let me rewrite the hole library.

If the owner puts me in the right direction, I'll come back and contribute my stuff like they say. Until then, I have to keep working on my own copy, because this stuff has to work in my project now.

from talon.

sspross avatar sspross commented on July 28, 2024

sry, didn't recognized you're an owner @jeremyschlatter! as I said in the pull request, one test is still broken and needs further investigation and maybe we should rewrite this hole locale stuff, so that we have one file per language or something.

from talon.

jeremyschlatter avatar jeremyschlatter commented on July 28, 2024

I agree that there is probably be a better way to do locale stuff, but I'm not sure it needs to be a big change to the library. RE_ORIGINAL_MESSAGE and RE_FROM_COLON_OR_DATE_COLON can easily be extended to other languages now. To make it a little cleaner we could have a separate data file that lists the translations for "original message", "reply message", "from", and "date" in lots of languages, and then read in that file and construct the two regexes.

Grammar differences might be harder, as in the case of RE_ON_DATE_SMB_WROTE. Though in the worst case there we could have one regex per language, listed again in a separate file. Or maybe one regex per unique grammar structure, with lots of translations or'd in as in RE_ORIGINAL_MESSAGE?

from talon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.