mailgun / talon Goto Github PK

License: Apache License 2.0

Python 99.38% Dockerfile 0.44% Shell 0.18%

talon's Issues

Samsung quotations separator

The pattern goes like this:

Sent from Samsung MobileName <[email protected]> wrote: {{the previous 
message text will be inserted here}}

Where Name is the display name for the address <Name [email protected]>

PyML installation from SourceForge is not working

During setup of Talon there's a CDN reference that is ending in an unreachable and breaks the installation.

Also, is there any news on moving to scikit?

After extracting a signature block, it would be nice to break the signature into fields, such as name, title, org, mobile-phone, work-phone, work-phone, home-phone. Have you thought about building that?

Not extracting mail signatures for minor changes in email/signature.

The following code which is the example works while running the library and returns signature as expected.

message = """Thanks Sasha, I can't go any higher and is why I limited it to the
homepage.

John Doe
via mobile"""

text, signature = signature.extract(message, sender='[email protected]')

But when i make a minor edit as changing the sender name and his email id the signature returns "None". The following were values passed to signature.extract()

message = """
Hello ,
Thanks Sasha, I can't go any higher and is why I limited it to the
homepage.

Sam John
via mobile"""

text, signature = signature.extract(message, sender='[email protected]')

Signature returned None for most of the messages that were tried.

installing via pip fails

Issue with stripped_text when replying from Hotmail

I have found that if I send an email from mailgun to a hotmail account and I want to reply to that, the body looks like this:

My reply message...
<div><hr id="stopSpelling">
Subject: Sending a message from Mailgun's API<br>
From: [email protected]<br>
To: [email protected]<br>
Date: Thu, 18 Jun 2015 07:45:36 +0000<br>
etc.....

The stripped_text now is "My reply message... Subject: Sending a message from Mailgun's API" but it should be "My reply message..."

As to my understanding when you [ extract_from_html ] the rule [ cut_from_block ] is applied (that checks for "From" and "Date" ) . But in this case the "Subject" is above from both "From" and "Date" so it passes the check and doesn't get stripped.

This issue appears only when I send from Mailgun to Hotmail.

Signature Detection in HTML Emails

I haven't had a lot of success parsing signatures out of text/html emails. It seems to work pretty well for text/plain emails. Is there a good strategy to parse out the signature for text/html emails?

Thanks,
Pete

nested gmail_quote

Gmail started to nest "gmail_quote" tags:

<div class="gmail_quote">
On <date> [email protected] wrote:
  <blockquote class="gmail_quote">
    Original message
  </blockquote>
</div>

talon removes the nested tag with quoted message and leaves the outer tag with quotation splitter in it

emails/ folder not included for training

Hi,

It seems like the raw emails you used for the ML training are not included in the repo. I'd like to train the AI on my own emails, can you tell me what's the right format to use?

Talon does not detect signatures with this common format

The signature format:

Hi Mailgun,

Please fix the parsing logic so that it detects and strips signatures such as the one on this email.

Regards


David Perks
Managing Director
email: [email protected]<mailto:[email protected]> | mobile: 0424 282 465 | office: 1300 N REACH (1300 6 73224)
twitter: @withinreachsw | web: www.withinreach.com.au<http://www.withinreach.com.au/> | www.goalhuddle.com<http://www.goalhuddle.com/>

Within Reach Software Pty Ltd, Suite 102, 21 Berry St, North Sydney, NSW 2060

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. Please note that any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Finally, the recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email

Unknown function `detect_encoding` in to_unicode

In the function to_unicode, there's a call to detect_encoding, which isn't defined. See here:

talon/talon/utils.py

Line 45 in 170f110

encoding = detect_encoding(str_or_unicode) if precise else 'utf-8'

Does this really work? Or is to_unicode not used anywhere? Perhaps it should be removed?

Support Ukrainian format

 Reply
23 лист. 2015 р. 09:18 "John Smith" <[email protected]> пише:

> Original message

Detect signatures with long lines

Generally signatures have short lines - no more than 60 characters but there is also a class of signatures that have long lines with long URLs, etc.

Example:

Some text

-- 


John Smith
Co-Founder and CEO
Xxxxxxxxx

mobile: 555.115.4274 | book a mtg
<http://example.com/soooooome/looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong/path?t=looooooooooooooooooooooooooooooooooooooooooooooooooooooooong>
 | @handle
<http://example.com/looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong/paaaaaaaaath?t=loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong>
 | linkedin
<http://example.com/loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong?t=loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong>
 | video
<http://example.com/looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong/paaath?t=loooooooooooooooong-parameeeeeeeeeeeeeeeeeeeeeeeeter-query-string>

Currently talon doesn't parse signatures like this

Pip install fails

When I install talon via pip from pypi is installs fine. But in this case it's not installing PyML and it's separate installation also fails.
So I decided to install latest version of talon from github and here is what I got:
https://gist.github.com/EugeneFeshchenko/6335f648838842209a00
I was trying to install it in a newly created env.

Is it me doing something wrong or setup file needs fixing ?

Also in general, installation process is not very obvious:
When you install from pypi you have to install PyML separately
When you install from github PyML install as well
I didn't find info that numpy should be preinstalled in order to install PyMl but it's doc says numpy is required.

Add support for Python 3.4

flanker should be listed in setup.py.

flanker is required for some of the tests. It should be added to setup.py.

Zimbra HTML parsing

Run this through http://talon.mailgun.net/

It will only strip the reply headers but miss the actual quote in the html part
text/plain extraction is working fine

Date: Thu, 4 Feb 2016 16:56:47 +0100 (CET)
From: [email protected]
To: [email protected]
Message-ID: <[email protected]>
In-Reply-To: <[email protected]>
References: <[email protected]>
Subject: Re: Lorem Ipsum
MIME-Version: 1.0
Content-Type: multipart/alternative; 
    boundary="----=_Part_35_1109890054.1454601407386"
X-Originating-IP: [1.1.1.1]
X-Mailer: Zimbra 8.6.0_GA_1153 (ZimbraWebClient - FF44 (Win)/8.6.0_GA_1153)
Thread-Topic: Lorem Ipsum
Thread-Index: ddFMd6wnxYPGpAbdA2oNKj8MgU0bH6/lWgJ/

------=_Part_35_1109890054.1454601407386
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. 


From: [email protected] 
To: "admin" <[email protected]> 
Sent: Thursday, February 4, 2016 4:56:33 PM 
Subject: Lorem Ipsum 

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. 


------=_Part_35_1109890054.1454601407386
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

<html><body><div style=3D"font-family: arial, helvetica, sans-serif; font-s=
ize: 12pt; color: #000000"><div>Lorem ipsum dolor sit amet, consectetur adi=
piscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna al=
iqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris ni=
si ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehender=
it in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteu=
r sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt =
mollit anim id est laborum.</div><div><br></div><hr id=3D"zwchr" data-marke=
r=3D"__DIVIDER__"><div data-marker=3D"__HEADERS__"><b>From: </b>admin@mymon=
eyex.com<br><b>To: </b>"admin" &lt;[email protected]&gt;<br><b>Sent: </b>T=
hursday, February 4, 2016 4:56:33 PM<br><b>Subject: </b>Lorem Ipsum<br></di=
v><br><div data-marker=3D"__QUOTED_TEXT__"><div style=3D"font-family: arial=
, helvetica, sans-serif; font-size: 12pt; color: #000000"><div>Lorem ipsum =
dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididu=
nt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud =
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis =
aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu =
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt=
 in culpa qui officia deserunt mollit anim id est laborum.</div></div><br><=
/div></div></body></html>
------=_Part_35_1109890054.1454601407386--

HtmlEntity(13)

Carriage return "\r" is replaced with "" when html quotations are extracted. This happens when deepcopy is applied to html tree.

To reproduce:

>>> from copy import deepcopy
>>> from lxml import html
>>>
>>> html_tree = html.document_fromstring("<html>/r/n</html>")
>>> html_tree_copy = deepcopy(html_tree)
>>> html.tostring(html_tree)
'<html>/r/n</html>'
>>> html.tostring(html_tree_copy)
'<html>&#13;/n</html>'

Failing Testcase

Just in case it's helpful, we developed this test data based on a signature we received that wasn't parsed correctly. Here's the data I pasted at http://talon.mailgun.net/ to reproduce:

From: <[email protected]>
To: <[email protected]>
Subject: Re: [SPF] Still trying to figure out your signature
Date: Tue, 16 Dec 2014 16:46:22 +0000
Message-Id: <D0B5BDE2.882AA%[email protected]>
Accept-Language: en-US, ja-JP
Content-Language: en-US
User-Agent: Microsoft-MacOutlook/14.4.6.141106
Content-Type: multipart/alternative; boundary="_000_D0B5BDE2882AAtestbotmatoncompanysdomaincom_"
Mime-Version: 1.0
Sender: [email protected]

--_000_D0B5BDE2882AAtestbotmatoncompanysdomaincom_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Sure thing - happy to help! Hope you can get it sorted out :)
-----
Testbot Maton
Manager, Communications Platforms
Company Liner

Phone: +1 555.555.5555
Mobile: +1 555.555.5555

[email protected]<mailto:[email protected]>
http://companysdomain.com<http://companysdomain.com/>

Numpy dep is obsolete

Talon requires Numpy 1.6.1, while the latest version is 1.9.2. Version 1.6.1 doesn't install on Mac OS X, failing with this error: clang: error: invalid argument '-faltivec' only allowed with 'ppc/ppc64/ppc64le'

Arabic quotation splitter

Here's a new quotation splitter example:

في ١٨‏/٠٨‏/٢٠١٤، الساعة ٢:٣٣ م، كتب XXX <[email protected]>:

or:

\u202b\u0641\u064a \u0661\u0668\u200f/\u0660\u0668\u200f/\u0662\u0660\u0661\u0664\u060c \u0627\u0644\u0633\u0627\u0639\u0629 \u0662:\u0663\u0663 \u0645\u060c \u0643\u062a\u0628 XXX <[email protected]>:\u202c

Here's a translation to English:

On 08.18.2014, at 14:33, wrote XXX <[email protected]>:

Improve Date: splitter

Lines like:

Date: ....
From: ...

usually indicate quotations. But sometimes "Date:" could be part of text. Parsing could e improved by e.g. checking several lines to be present e.g. "Date:" line and "From:" line.

Improve german formats

Hi guys, I just got the link to this repo from your support. It's awesome that this is open source and I would like to help improve the german formats.

I just need some help to get started. E.g. I would like to respect this german Windows Outlook split pattern:

-----Ursprüngliche Nachricht-----

As you can see this includes a special character ü which is represented as =C3=BC (in Base64, I guess). So what do i have to do, something like this in quotations.py?

SPLITTER_PATTERNS = [
    # ------Original Message------ or ---- Reply Message ----
    re.compile("[\s]*[-]+[ ]*(Original|Reply) Message[ ]*[-]+", re.I),
    re.compile("[\s]*[-]+[ ]*(Urspr=C3=BCngliche|Antwort) Nachricht[ ]*[-]+", re.I),
    ...

Is this correct? I'll try to add more cases which do not work atm (I tested a few, which led me to your support :) I'll open a pull request then. Ah and how can I run the tests and should I create tests for each new case? Thanks for your advise.

`extract_from_html` breaks when xhtml encoding specified in document

To reproduce, add the following line to the top of tests/fixtures/html_replies/hotmail.html:

<?xml version="1.0" encoding="UTF-8"?>

and run nosetests tests/html_quotations_test.py:test_hotmail_reply. lxml cannot parse html documents from a unicode string when the encoding is declared in the document.

pyml failed while installting talon via pip(ubuntu)

I felt few things here:
any idea why scikit was not used rather pyml? which naturally ease the deployment.
how does multiple mail reply chain works to extract all the signature?

sh: 0: getcwd() failed: No such file or directory

% Total % Received % Xferd Average Speed Time Time Time Current

                             Dload  Upload   Total   Spent    Left  Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0tar: PyML-0.7.9: Cannot mkdir: No such file or directory

tar: PyML-0.7.9: Cannot mkdir: No such file or directory

tar: PyML-0.7.9/data: Cannot mkdir: No such file or directory

Parse email disclaimer

Sometimes emails have disclaimers. Disclaimer lines are generally longer then 60 chars. So signature won't be extracted.

C port

Can we possibly run talon in C?

Open source the talon demo app?

Would it be possible to publish the code used for the talon demo app? http://talon.mailgun.net/

It looks like it's hosted on Heroku, which perfectly suits my needs (I want to access it as an external service rather than integrate the library into my app).

Fails to find signatures with 61+ characters.

The first test email I pasted into http://talon.mailgun.net/ had a >60 character signature and it failed to find it. 60 characters works fine. Here's a simplified length checker test case suitable for inclusion into bruteforce_test.py:

def test_signature_lengths():
    for n in range(80):
        content = "CONTENT"
        sig = "\n-- \n" + "."*n
        msg_body = content + sig
        eq_((n, (content.strip(), sig.strip())),
                (n, bruteforce.extract_signature(msg_body)))

Error Initializing Talon

I installed talon using PIP on python 2.7. I'm trying to run the example but I get an error as soon as talon.init() is called. I can't figure out what parser it is colliding with - any ideas what is wrong? I didn't receive any error installing talon with pip.

Here's the code, error, and the other files I see on my system called parser.py.

Parser.py files:

vagrant@vagrant-ubuntu-trusty-64:/$ sudo find . -name "parser.py"
./usr/lib/python3.4/html/parser.py
./usr/lib/python3.4/email/parser.py
./usr/lib/python2.7/email/parser.py
./usr/lib/python2.7/dist-packages/jinja2/parser.py
./usr/lib/python2.7/dist-packages/yaml/parser.py
./usr/lib/python2.7/dist-packages/dateutil/parser.py
./usr/lib/python3/dist-packages/ufw/parser.py
./usr/local/lib/python2.7/dist-packages/cssselect/parser.py

Code:

import talon
from talon import quotations

talon.init()

text =  """Reply

-----Original Message-----

Quote"""

reply = quotations.extract_from(text, 'text/plain')
reply = quotations.extract_from_plain(text)
Print(reply)

Error

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    talon.init()
  File "/usr/local/lib/python2.7/dist-packages/talon/__init__.py", line 7, in init
    signature.initialize()
  File "/usr/local/lib/python2.7/dist-packages/talon/signature/__init__.py", line 38, in initialize
    EXTRACTOR_DATA)
  File "/usr/local/lib/python2.7/dist-packages/talon/signature/learning/classifier.py", line 31, in load
    return joblib.load(saved_classifier_filename)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 425, in load
    obj = unpickler.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 291, in load_build
    array = nd_array_wrapper.read(self)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 113, in read
    mmap_mode=unpickler.mmap_mode)
  File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 394, in load
    return format.read_array(fid)
  File "/usr/lib/python2.7/dist-packages/numpy/lib/format.py", line 437, in read_array
    shape, fortran_order, dtype = read_array_header_1_0(fp)
  File "/usr/lib/python2.7/dist-packages/numpy/lib/format.py", line 334, in read_array_header_1_0
    d = safe_eval(header)
  File "/usr/lib/python2.7/dist-packages/numpy/lib/utils.py", line 1128, in safe_eval
    ast = compiler.parse(source, mode="eval")
  File "/usr/lib/python2.7/compiler/transformer.py", line 53, in parse
    return Transformer().parseexpr(buf)
  File "/usr/lib/python2.7/compiler/transformer.py", line 132, in parseexpr
    return self.transform(parser.expr(text))
AttributeError: 'module' object has no attribute 'expr'

Extract both msg and text from text message

I send text message from G^Mail and I am unable to extract signature from messages.

How to do it? extract_signature doesn't work for plain messages.

In [34]: msg.text
Out[34]: u'Och, zapomnia\u0142em o jeszcze jednej decyzji.\r\n\r\nW dniu 14 marca 2015 19:58 u\u017cytkownik  <[email protected]> napisa\u0142:\r\n> Szanowna Pani/Pan,\r\n>\r\n> Przesy\u0142amy w za\u0142\u0105czeniu \u017c\u0105dane informacje.\r\n>\r\n> Z wyrazami szacunku,\r\n> Anna Kmie\u0107\r\n>\r\n> W dniu 14 marca 2015 19:57 u\u017cytkownik <[email protected]> napisa\u0142:\r\n>\r\n>> Szanowni Pa\u0144stwo,\r\n>>\r\n>> Dzia\u0142aj\u0105c na podstawie art. 61 Konstytucji RP wnosz\u0119 o przes\u0142anie:\r\n>> - kopie wszystkich decyzji w sprawie KM.1431.2.215,\r\n>>\r\n>> Odpowiedz przes\u0142a\u0107 pod adres [email protected].\r\n>>\r\n>> Z powa\u017caniem,\r\n>\r\n>'

In [35]: quotations.extract_from(msg.text, 'text/plain')
Out[35]: u'Och, zapomnia\u0142em o jeszcze jednej decyzji.\r\n\r\nW dniu 14 marca 2015 19:58 u\u017cytkownik  <[email protected]> napisa\u0142:'

In [36]: extract_signature(msg.text)
Out[36]: 
(u'Och, zapomnia\u0142em o jeszcze jednej decyzji.\r\n\r\nW dniu 14 marca 2015 19:58 u\u017cytkownik  <[email protected]> napisa\u0142:\r\n> Szanowna Pani/Pan,\r\n>\r\n> Przesy\u0142amy w za\u0142\u0105czeniu \u017c\u0105dane informacje.\r\n>\r\n> Z wyrazami szacunku,\r\n> Anna Kmie\u0107\r\n>\r\n> W dniu 14 marca 2015 19:57 u\u017cytkownik <[email protected]> napisa\u0142:\r\n>\r\n>> Szanowni Pa\u0144stwo,\r\n>>\r\n>> Dzia\u0142aj\u0105c na podstawie art. 61 Konstytucji RP wnosz\u0119 o przes\u0142anie:\r\n>> - kopie wszystkich decyzji w sprawie KM.1431.2.215,\r\n>>\r\n>> Odpowiedz przes\u0142a\u0107 pod adres [email protected].\r\n>>\r\n>> Z powa\u017caniem,\r\n>\r\n>',
 None)

I want receive result like:

In [40]: msg.text.replace(quotations.extract_from(msg.text, 'text/plain'),'')
Out[40]: u'\r\n> Szanowna Pani/Pan,\r\n>\r\n> Przesy\u0142amy w za\u0142\u0105czeniu \u017c\u0105dane informacje.\r\n>\r\n> Z wyrazami szacunku,\r\n> Anna Kmie\u0107\r\n>\r\n> W dniu 14 marca 2015 19:57 u\u017cytkownik <[email protected]> napisa\u0142:\r\n>\r\n>> Szanowni Pa\u0144stwo,\r\n>>\r\n>> Dzia\u0142aj\u0105c na podstawie art. 61 Konstytucji RP wnosz\u0119 o przes\u0142anie:\r\n>> - kopie wszystkich decyzji w sprawie KM.1431.2.215,\r\n>>\r\n>> Odpowiedz przes\u0142a\u0107 pod adres [email protected].\r\n>>\r\n>> Z powa\u017caniem,\r\n>\r\n>

Parsing of gmail forwarded messages

Forwarded email content sent from Gmail is pruned by Talon. Normally when you forward a message you need to preserve it.

A potential improvement could detect the ------Forwarded message------ text inside the first child node of the <div class="gmail_quote"> element and output the original message if gmail_quote div is not deeply nested.

Thanks,
Sokratis

JSON for Linking Data

Hey ho!

Thanks for creating this awesome library!

I was wondering if you have any thoughts or plans on supporting json for linking data. It is used to power gmail actions (see https://developers.google.com/gmail/actions/actions/actions-overview) and it generally seems like a great thing to have anyway.

I do not mind spending the time doing the development work to get this in but was wondering if you would accept a PR if I do this. Also, is there any specific way it should be done?

Improve handling dashes

Dashes could be mistaken for signature separators, e.g:

Hi,

some item list:
- item 1
- item 2
item 2 continued
- item3
some text

Because "item 3" is separated from the rest of the list (by "item 2 continued" line) it's mistaken for signature.

Fix date format

Reply

2016-09-12 10:55 GMT 02:00 Lara Croft <
[email protected]>: 

Original message

Mailgun Talon: Signature extraction example throwing error

Crossposting with StackOverflow (http://stackoverflow.com/questions/25639506/mailgun-talon-signature-extraction-example-throwing-error)

I installed mailgun/talon on GCE and was trying out the example in the README section, but it threw the following error at me:

from talon import signature
message = """Thanks Sasha, I can't go any higher and is why I limited it to the
... homepage.
...
... John Doe
... via mobile"""
message
"Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage.\n\nJohn Doe\nvia mobile"
text,signtr = signature.extract(message, sender='[email protected]')
ERROR:talon.signature.extraction:ERROR when extracting signature with classifiers
Traceback (most recent call last):
File "talon/signature/extraction.py", line 57, in extract
markers = _mark_lines(lines, sender)
File "talon/signature/extraction.py", line 99, in _mark_lines
elif is_signature_line(line, sender, EXTRACTOR):
File "talon/signature/extraction.py", line 40, in is_signature_line
return classifier.decisionFunc(data, 0) > 0
AttributeError: 'NoneType' object has no attribute 'decisionFunc'

Do I need to train the model somehow (this signature seems to be the ML example)? I installed it using pip.

Tests look strange

I'm looking at https://github.com/mailgun/talon/blob/master/tests/signature/learning/helpers_test.py

And don't understand why in

'Sergey N.  Obukhov <[email protected]>': ['Sergey', 'Obukhov'],

the expected result doesn't include 'serobnic'

Detect disclaimers

Many messages has disclaimers at the bottom. Signature detection doesn't work unless you strip disclaimers first.

More granular message / signature parts classification

The algo would be more accurate if the classification is more detailed. E.g. if it classifies message greeting there could be a sanity check saying that signature can't go right after a greeting.

Similarly if a closing phrase (e.g. "Kind Regards,", "Thanks,", etc) is detected in signature candidate lines we could say for sure that it should be the first signature line.

Another example - disclaimers. Some messages have them and that totally breaks signature detection logic.

Handle gmail_quote div in emails created by clicking the Reply button in gmail

If you click "reply" in gmail, and then change the subject line and undo the quotation in the WYSIWYG editor, you get something that looks a lot like a fresh email that you just typed yourself. If you send the email to another gmail user, it will render for them the way you would expect if you composed the email yourself from scratch.

But gmail wraps the content in a <div class="gmail_quote">, which gets stripped by quotations.extract_from_html. Ideally, talon would know when this was the case and not treat the div as a quote. This seems hard, though. It's not clear to me how to identify this case.

Misclassified signature line

Hi Stephen,

Yep Tuesday @3pm suits. I'll chat to you then.

Kind Regards,
John Doe

Date time appointment data is stripped off

Consider the following example:

Hi Xxx,

Good day to you. I have already scheduled an appointment with Dr. 
Xxx Xxx for you to pick up your xxx. Please see 
appointment details below:

Date: Monday, 8/25/2014
Time: 9:15 AM

If you have any questions, please let me know and I will be happy to help.

Have a great weekend!

Best,

Xxx X.
Perssist Assistant
www.example.com

The "Date: ..." part and the rest of the email is stripped off because it's mistaken for quotations.
In talon/talon/quotations.py the following splitter pattern is checked:

SPLITTER_PATTERNS = [
...
    re.compile('(_+\r?\n)?[\s]*(:?[*]?From|Date):[*]? .*'),
...
    ]

The pattern should be adjusted to check for both "From: ..." and "Date: ..." A possible implementation could be similar to RE_ON_DATE_SMB_WROTE pattern, also see mark_message_lines() on how multi-line patterns are handled.

One div wraps quotation and reply

Consider email html like this:

<div>
<p>Reply</p>
<p>
  <b>From:</b>[email protected]
  ...
</p>
<p>Original email</p>
</div>

talon detects the From: ... block, goes up till it finds a "wrapping div" and cuts off the whole message!

Better handling of Date: From: block

Usually "Date:" and "From:" indicate quotations but sometimes looking for them leads to false positives (e.g. when invitation details are provided with "Date:") or false negatives (when "Subject:" goes first).

No extraction of forwarded part

If i get the code right it is intended that

reply = quotations.extract_from_plain(body)

does not extract forwarded messages?

So something like:

test

----- Forwarded Message -----
From: "test" [email protected]
Sent: Thursday, January 28, 2016 4:31:32 PM
Subject: test

testmail

will be ignored and returned as is/a part of reply?

Would it be possible to extract forwarded messages just like replies?

So either chop the forwarded part with quotations.extract_from_...(body) or a dedicated function to run against the body if reply equals body?

Update lxml to 3.3.1

Update regex to 2014.12.24

Hello!

I tried installing talon in Python 3 today and ran into the following issue installing regex:

Downloading/unpacking regex==0.1.20110315
  Downloading regex-0.1.20110315.tar.gz (948kB): 948kB downloaded
  Running setup.py (path: .../regex/setup.py) egg_info for package regex
    No unicodedata_db.h could be prepared for Python 3.4.2 (default, Oct 19 2014, 17:55:38)
    [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)]
    Complete output from command python setup.py egg_info:
    No unicodedata_db.h could be prepared for Python 3.4.2 (default, Oct 19 2014, 17:55:38)

[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)]

----------------------------------------
Cleaning up...

I don't have this problem installing regex==2014.12.24. Could you please consider updating the dependency version?

I would be happy to bump the version, ensure tests pass, and open a PR. Thanks!

Install via pip fails due to missing README.rst file

In a fresh environment:

~ > pip install talon
Downloading/unpacking talon
  Downloading talon-1.0.tar.gz
  Running setup.py (path:/private/tmp/pip_build_root/talon/setup.py) egg_info for package talon
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/private/tmp/pip_build_root/talon/setup.py", line 13, in <module>
        long_description=open("README.rst").read(),
    IOError: [Errno 2] No such file or directory: 'README.rst'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/private/tmp/pip_build_root/talon/setup.py", line 13, in <module>

    long_description=open("README.rst").read(),

IOError: [Errno 2] No such file or directory: 'README.rst'

Dutch Apple mail splitter

Hi all,

After a request to support, I got this link to submit a new dutch splitter format. Hope this improves this great libary!

"Op 14 jan. 2015, om 13:52 heeft ... het volgende geschreven:"
Op <:date>. om heeft het volgende geschreven:

How should i proceed?

mailgun / talon Goto Github PK

talon's Issues

Recommend Projects

Recommend Topics

Recommend Org