mailgun / talon Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
The pattern goes like this:
Sent from Samsung MobileName <[email protected]> wrote: {{the previous
message text will be inserted here}}
Where Name
is the display name for the address <Name [email protected]>
Here's a sample:
Second from gmail
2014-10-17 11:28 GMT+03:00 Postmaster <
[email protected]>:
> First from site
>
Quotation SPLITTER_PATTERNS
from talon.quotations
need to be adjusted. In this particular case no new pattern is needed instead one of the following needs to be fixed:
re.compile("(\d+/\d+/\d+|\d+\.\d+\.\d+).*@", re.VERBOSE)
re.compile('\S{3,10}, \d\d? \S{3,10} 20\d\d,? \d\d?:\d\d(:\d\d)?'
'( \S+){3,6}@\S+:')
Consider merging them into one.
During setup of Talon there's a CDN reference that is ending in an unreachable and breaks the installation.
Also, is there any news on moving to scikit?
After extracting a signature block, it would be nice to break the signature into fields, such as name, title, org, mobile-phone, work-phone, work-phone, home-phone. Have you thought about building that?
The following code which is the example works while running the library and returns signature as expected.
message = """Thanks Sasha, I can't go any higher and is why I limited it to the
homepage.
John Doe
via mobile"""
text, signature = signature.extract(message, sender='[email protected]')
But when i make a minor edit as changing the sender name and his email id the signature returns "None". The following were values passed to signature.extract()
message = """
Hello ,
Thanks Sasha, I can't go any higher and is why I limited it to the
homepage.
Sam John
via mobile"""
text, signature = signature.extract(message, sender='[email protected]')
Signature returned None for most of the messages that were tried.
I have found that if I send an email from mailgun to a hotmail account and I want to reply to that, the body looks like this:
My reply message...
<div><hr id="stopSpelling">
Subject: Sending a message from Mailgun's API<br>
From: [email protected]<br>
To: [email protected]<br>
Date: Thu, 18 Jun 2015 07:45:36 +0000<br>
etc.....
The stripped_text now is "My reply message... Subject: Sending a message from Mailgun's API" but it should be "My reply message..."
As to my understanding when you [ extract_from_html ] the rule [ cut_from_block ] is applied (that checks for "From" and "Date" ) . But in this case the "Subject" is above from both "From" and "Date" so it passes the check and doesn't get stripped.
This issue appears only when I send from Mailgun to Hotmail.
I haven't had a lot of success parsing signatures out of text/html emails. It seems to work pretty well for text/plain emails. Is there a good strategy to parse out the signature for text/html emails?
Thanks,
Pete
Gmail started to nest "gmail_quote" tags:
<div class="gmail_quote">
On <date> [email protected] wrote:
<blockquote class="gmail_quote">
Original message
</blockquote>
</div>
talon removes the nested tag with quoted message and leaves the outer tag with quotation splitter in it
Hi,
It seems like the raw emails you used for the ML training are not included in the repo. I'd like to train the AI on my own emails, can you tell me what's the right format to use?
The signature format:
Hi Mailgun,
Please fix the parsing logic so that it detects and strips signatures such as the one on this email.
Regards
David Perks
Managing Director
email: [email protected]<mailto:[email protected]> | mobile: 0424 282 465 | office: 1300 N REACH (1300 6 73224)
twitter: @withinreachsw | web: www.withinreach.com.au<http://www.withinreach.com.au/> | www.goalhuddle.com<http://www.goalhuddle.com/>
Within Reach Software Pty Ltd, Suite 102, 21 Berry St, North Sydney, NSW 2060
This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. Please note that any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Finally, the recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email
In the function to_unicode
, there's a call to detect_encoding
, which isn't defined. See here:
Line 45 in 170f110
Does this really work? Or is to_unicode
not used anywhere? Perhaps it should be removed?
Reply
23 лист. 2015 р. 09:18 "John Smith" <[email protected]> пише:
> Original message
Generally signatures have short lines - no more than 60 characters but there is also a class of signatures that have long lines with long URLs, etc.
Example:
Some text
--
John Smith
Co-Founder and CEO
Xxxxxxxxx
mobile: 555.115.4274 | book a mtg
<http://example.com/soooooome/looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong/path?t=looooooooooooooooooooooooooooooooooooooooooooooooooooooooong>
| @handle
<http://example.com/looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong/paaaaaaaaath?t=loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong>
| linkedin
<http://example.com/loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong?t=loooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong>
| video
<http://example.com/looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong/paaath?t=loooooooooooooooong-parameeeeeeeeeeeeeeeeeeeeeeeeter-query-string>
Currently talon doesn't parse signatures like this
When I install talon via pip from pypi is installs fine. But in this case it's not installing PyML and it's separate installation also fails.
So I decided to install latest version of talon from github and here is what I got:
https://gist.github.com/EugeneFeshchenko/6335f648838842209a00
I was trying to install it in a newly created env.
Is it me doing something wrong or setup file needs fixing ?
Also in general, installation process is not very obvious:
When you install from pypi you have to install PyML separately
When you install from github PyML install as well
I didn't find info that numpy should be preinstalled in order to install PyMl but it's doc says numpy is required.
flanker
is required for some of the tests. It should be added to setup.py.
Run this through http://talon.mailgun.net/
It will only strip the reply headers but miss the actual quote in the html part
text/plain extraction is working fine
Date: Thu, 4 Feb 2016 16:56:47 +0100 (CET)
From: [email protected]
To: [email protected]
Message-ID: <[email protected]>
In-Reply-To: <[email protected]>
References: <[email protected]>
Subject: Re: Lorem Ipsum
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_Part_35_1109890054.1454601407386"
X-Originating-IP: [1.1.1.1]
X-Mailer: Zimbra 8.6.0_GA_1153 (ZimbraWebClient - FF44 (Win)/8.6.0_GA_1153)
Thread-Topic: Lorem Ipsum
Thread-Index: ddFMd6wnxYPGpAbdA2oNKj8MgU0bH6/lWgJ/
------=_Part_35_1109890054.1454601407386
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
From: [email protected]
To: "admin" <[email protected]>
Sent: Thursday, February 4, 2016 4:56:33 PM
Subject: Lorem Ipsum
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
------=_Part_35_1109890054.1454601407386
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable
<html><body><div style=3D"font-family: arial, helvetica, sans-serif; font-s=
ize: 12pt; color: #000000"><div>Lorem ipsum dolor sit amet, consectetur adi=
piscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna al=
iqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris ni=
si ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehender=
it in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteu=
r sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt =
mollit anim id est laborum.</div><div><br></div><hr id=3D"zwchr" data-marke=
r=3D"__DIVIDER__"><div data-marker=3D"__HEADERS__"><b>From: </b>admin@mymon=
eyex.com<br><b>To: </b>"admin" <[email protected]><br><b>Sent: </b>T=
hursday, February 4, 2016 4:56:33 PM<br><b>Subject: </b>Lorem Ipsum<br></di=
v><br><div data-marker=3D"__QUOTED_TEXT__"><div style=3D"font-family: arial=
, helvetica, sans-serif; font-size: 12pt; color: #000000"><div>Lorem ipsum =
dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididu=
nt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud =
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis =
aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu =
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt=
in culpa qui officia deserunt mollit anim id est laborum.</div></div><br><=
/div></div></body></html>
------=_Part_35_1109890054.1454601407386--
Carriage return "\r" is replaced with " "
when html quotations are extracted. This happens when deepcopy is applied to html tree.
To reproduce:
>>> from copy import deepcopy
>>> from lxml import html
>>>
>>> html_tree = html.document_fromstring("<html>/r/n</html>")
>>> html_tree_copy = deepcopy(html_tree)
>>> html.tostring(html_tree)
'<html>/r/n</html>'
>>> html.tostring(html_tree_copy)
'<html> /n</html>'
Just in case it's helpful, we developed this test data based on a signature we received that wasn't parsed correctly. Here's the data I pasted at http://talon.mailgun.net/ to reproduce:
From: <[email protected]>
To: <[email protected]>
Subject: Re: [SPF] Still trying to figure out your signature
Date: Tue, 16 Dec 2014 16:46:22 +0000
Message-Id: <D0B5BDE2.882AA%[email protected]>
Accept-Language: en-US, ja-JP
Content-Language: en-US
User-Agent: Microsoft-MacOutlook/14.4.6.141106
Content-Type: multipart/alternative; boundary="_000_D0B5BDE2882AAtestbotmatoncompanysdomaincom_"
Mime-Version: 1.0
Sender: [email protected]
--_000_D0B5BDE2882AAtestbotmatoncompanysdomaincom_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Sure thing - happy to help! Hope you can get it sorted out :)
-----
Testbot Maton
Manager, Communications Platforms
Company Liner
Phone: +1 555.555.5555
Mobile: +1 555.555.5555
[email protected]<mailto:[email protected]>
http://companysdomain.com<http://companysdomain.com/>
Talon requires Numpy 1.6.1, while the latest version is 1.9.2. Version 1.6.1 doesn't install on Mac OS X, failing with this error: clang: error: invalid argument '-faltivec' only allowed with 'ppc/ppc64/ppc64le'
Here's a new quotation splitter example:
في ١٨/٠٨/٢٠١٤، الساعة ٢:٣٣ م، كتب XXX <[email protected]>:
or:
\u202b\u0641\u064a \u0661\u0668\u200f/\u0660\u0668\u200f/\u0662\u0660\u0661\u0664\u060c \u0627\u0644\u0633\u0627\u0639\u0629 \u0662:\u0663\u0663 \u0645\u060c \u0643\u062a\u0628 XXX <[email protected]>:\u202c
Here's a translation to English:
On 08.18.2014, at 14:33, wrote XXX <[email protected]>:
Lines like:
Date: ....
From: ...
usually indicate quotations. But sometimes "Date:" could be part of text. Parsing could e improved by e.g. checking several lines to be present e.g. "Date:" line and "From:" line.
Hi guys, I just got the link to this repo from your support. It's awesome that this is open source and I would like to help improve the german formats.
I just need some help to get started. E.g. I would like to respect this german Windows Outlook split pattern:
-----Ursprüngliche Nachricht-----
As you can see this includes a special character ü
which is represented as =C3=BC
(in Base64, I guess). So what do i have to do, something like this in quotations.py
?
SPLITTER_PATTERNS = [
# ------Original Message------ or ---- Reply Message ----
re.compile("[\s]*[-]+[ ]*(Original|Reply) Message[ ]*[-]+", re.I),
re.compile("[\s]*[-]+[ ]*(Urspr=C3=BCngliche|Antwort) Nachricht[ ]*[-]+", re.I),
...
Is this correct? I'll try to add more cases which do not work atm (I tested a few, which led me to your support :) I'll open a pull request then. Ah and how can I run the tests and should I create tests for each new case? Thanks for your advise.
To reproduce, add the following line to the top of tests/fixtures/html_replies/hotmail.html
:
<?xml version="1.0" encoding="UTF-8"?>
and run nosetests tests/html_quotations_test.py:test_hotmail_reply
. lxml cannot parse html documents from a unicode string when the encoding is declared in the document.
I felt few things here:
any idea why scikit was not used rather pyml? which naturally ease the deployment.
how does multiple mail reply chain works to extract all the signature?
sh: 0: getcwd() failed: No such file or directory
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0tar: PyML-0.7.9: Cannot mkdir: No such file or directory
tar: PyML-0.7.9: Cannot mkdir: No such file or directory
tar: PyML-0.7.9/data: Cannot mkdir: No such file or directory
Sometimes emails have disclaimers. Disclaimer lines are generally longer then 60 chars. So signature won't be extracted.
Can we possibly run talon in C?
Would it be possible to publish the code used for the talon demo app? http://talon.mailgun.net/
It looks like it's hosted on Heroku, which perfectly suits my needs (I want to access it as an external service rather than integrate the library into my app).
The first test email I pasted into http://talon.mailgun.net/ had a >60 character signature and it failed to find it. 60 characters works fine. Here's a simplified length checker test case suitable for inclusion into bruteforce_test.py
:
def test_signature_lengths():
for n in range(80):
content = "CONTENT"
sig = "\n-- \n" + "."*n
msg_body = content + sig
eq_((n, (content.strip(), sig.strip())),
(n, bruteforce.extract_signature(msg_body)))
I installed talon using PIP on python 2.7. I'm trying to run the example but I get an error as soon as talon.init() is called. I can't figure out what parser it is colliding with - any ideas what is wrong? I didn't receive any error installing talon with pip.
Here's the code, error, and the other files I see on my system called parser.py.
Parser.py files:
vagrant@vagrant-ubuntu-trusty-64:/$ sudo find . -name "parser.py"
./usr/lib/python3.4/html/parser.py
./usr/lib/python3.4/email/parser.py
./usr/lib/python2.7/email/parser.py
./usr/lib/python2.7/dist-packages/jinja2/parser.py
./usr/lib/python2.7/dist-packages/yaml/parser.py
./usr/lib/python2.7/dist-packages/dateutil/parser.py
./usr/lib/python3/dist-packages/ufw/parser.py
./usr/local/lib/python2.7/dist-packages/cssselect/parser.py
Code:
import talon
from talon import quotations
talon.init()
text = """Reply
-----Original Message-----
Quote"""
reply = quotations.extract_from(text, 'text/plain')
reply = quotations.extract_from_plain(text)
Print(reply)
Error
Traceback (most recent call last):
File "test.py", line 4, in <module>
talon.init()
File "/usr/local/lib/python2.7/dist-packages/talon/__init__.py", line 7, in init
signature.initialize()
File "/usr/local/lib/python2.7/dist-packages/talon/signature/__init__.py", line 38, in initialize
EXTRACTOR_DATA)
File "/usr/local/lib/python2.7/dist-packages/talon/signature/learning/classifier.py", line 31, in load
return joblib.load(saved_classifier_filename)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 425, in load
obj = unpickler.load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 291, in load_build
array = nd_array_wrapper.read(self)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 113, in read
mmap_mode=unpickler.mmap_mode)
File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 394, in load
return format.read_array(fid)
File "/usr/lib/python2.7/dist-packages/numpy/lib/format.py", line 437, in read_array
shape, fortran_order, dtype = read_array_header_1_0(fp)
File "/usr/lib/python2.7/dist-packages/numpy/lib/format.py", line 334, in read_array_header_1_0
d = safe_eval(header)
File "/usr/lib/python2.7/dist-packages/numpy/lib/utils.py", line 1128, in safe_eval
ast = compiler.parse(source, mode="eval")
File "/usr/lib/python2.7/compiler/transformer.py", line 53, in parse
return Transformer().parseexpr(buf)
File "/usr/lib/python2.7/compiler/transformer.py", line 132, in parseexpr
return self.transform(parser.expr(text))
AttributeError: 'module' object has no attribute 'expr'
I send text message from G^Mail and I am unable to extract signature from messages.
How to do it? extract_signature
doesn't work for plain messages.
In [34]: msg.text
Out[34]: u'Och, zapomnia\u0142em o jeszcze jednej decyzji.\r\n\r\nW dniu 14 marca 2015 19:58 u\u017cytkownik <[email protected]> napisa\u0142:\r\n> Szanowna Pani/Pan,\r\n>\r\n> Przesy\u0142amy w za\u0142\u0105czeniu \u017c\u0105dane informacje.\r\n>\r\n> Z wyrazami szacunku,\r\n> Anna Kmie\u0107\r\n>\r\n> W dniu 14 marca 2015 19:57 u\u017cytkownik <[email protected]> napisa\u0142:\r\n>\r\n>> Szanowni Pa\u0144stwo,\r\n>>\r\n>> Dzia\u0142aj\u0105c na podstawie art. 61 Konstytucji RP wnosz\u0119 o przes\u0142anie:\r\n>> - kopie wszystkich decyzji w sprawie KM.1431.2.215,\r\n>>\r\n>> Odpowiedz przes\u0142a\u0107 pod adres [email protected].\r\n>>\r\n>> Z powa\u017caniem,\r\n>\r\n>'
In [35]: quotations.extract_from(msg.text, 'text/plain')
Out[35]: u'Och, zapomnia\u0142em o jeszcze jednej decyzji.\r\n\r\nW dniu 14 marca 2015 19:58 u\u017cytkownik <[email protected]> napisa\u0142:'
In [36]: extract_signature(msg.text)
Out[36]:
(u'Och, zapomnia\u0142em o jeszcze jednej decyzji.\r\n\r\nW dniu 14 marca 2015 19:58 u\u017cytkownik <[email protected]> napisa\u0142:\r\n> Szanowna Pani/Pan,\r\n>\r\n> Przesy\u0142amy w za\u0142\u0105czeniu \u017c\u0105dane informacje.\r\n>\r\n> Z wyrazami szacunku,\r\n> Anna Kmie\u0107\r\n>\r\n> W dniu 14 marca 2015 19:57 u\u017cytkownik <[email protected]> napisa\u0142:\r\n>\r\n>> Szanowni Pa\u0144stwo,\r\n>>\r\n>> Dzia\u0142aj\u0105c na podstawie art. 61 Konstytucji RP wnosz\u0119 o przes\u0142anie:\r\n>> - kopie wszystkich decyzji w sprawie KM.1431.2.215,\r\n>>\r\n>> Odpowiedz przes\u0142a\u0107 pod adres [email protected].\r\n>>\r\n>> Z powa\u017caniem,\r\n>\r\n>',
None)
I want receive result like:
In [40]: msg.text.replace(quotations.extract_from(msg.text, 'text/plain'),'')
Out[40]: u'\r\n> Szanowna Pani/Pan,\r\n>\r\n> Przesy\u0142amy w za\u0142\u0105czeniu \u017c\u0105dane informacje.\r\n>\r\n> Z wyrazami szacunku,\r\n> Anna Kmie\u0107\r\n>\r\n> W dniu 14 marca 2015 19:57 u\u017cytkownik <[email protected]> napisa\u0142:\r\n>\r\n>> Szanowni Pa\u0144stwo,\r\n>>\r\n>> Dzia\u0142aj\u0105c na podstawie art. 61 Konstytucji RP wnosz\u0119 o przes\u0142anie:\r\n>> - kopie wszystkich decyzji w sprawie KM.1431.2.215,\r\n>>\r\n>> Odpowiedz przes\u0142a\u0107 pod adres [email protected].\r\n>>\r\n>> Z powa\u017caniem,\r\n>\r\n>
Forwarded email content sent from Gmail is pruned by Talon. Normally when you forward a message you need to preserve it.
A potential improvement could detect the ------Forwarded message------
text inside the first child node of the <div class="gmail_quote">
element and output the original message if gmail_quote div is not deeply nested.
Thanks,
Sokratis
Hey ho!
Thanks for creating this awesome library!
I was wondering if you have any thoughts or plans on supporting json for linking data. It is used to power gmail actions (see https://developers.google.com/gmail/actions/actions/actions-overview) and it generally seems like a great thing to have anyway.
I do not mind spending the time doing the development work to get this in but was wondering if you would accept a PR if I do this. Also, is there any specific way it should be done?
Dashes could be mistaken for signature separators, e.g:
Hi,
some item list:
- item 1
- item 2
item 2 continued
- item3
some text
Because "item 3" is separated from the rest of the list (by "item 2 continued" line) it's mistaken for signature.
Reply
2016-09-12 10:55 GMT 02:00 Lara Croft <
[email protected]>:
Original message
Crossposting with StackOverflow (http://stackoverflow.com/questions/25639506/mailgun-talon-signature-extraction-example-throwing-error)
I installed mailgun/talon on GCE and was trying out the example in the README section, but it threw the following error at me:
from talon import signature
message = """Thanks Sasha, I can't go any higher and is why I limited it to the
... homepage.
...
... John Doe
... via mobile"""
message
"Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage.\n\nJohn Doe\nvia mobile"
text,signtr = signature.extract(message, sender='[email protected]')
ERROR:talon.signature.extraction:ERROR when extracting signature with classifiers
Traceback (most recent call last):
File "talon/signature/extraction.py", line 57, in extract
markers = _mark_lines(lines, sender)
File "talon/signature/extraction.py", line 99, in _mark_lines
elif is_signature_line(line, sender, EXTRACTOR):
File "talon/signature/extraction.py", line 40, in is_signature_line
return classifier.decisionFunc(data, 0) > 0
AttributeError: 'NoneType' object has no attribute 'decisionFunc'
Do I need to train the model somehow (this signature seems to be the ML example)? I installed it using pip.
I'm looking at https://github.com/mailgun/talon/blob/master/tests/signature/learning/helpers_test.py
And don't understand why in
'Sergey N. Obukhov <[email protected]>': ['Sergey', 'Obukhov'],
the expected result doesn't include 'serobnic'
Many messages has disclaimers at the bottom. Signature detection doesn't work unless you strip disclaimers first.
The algo would be more accurate if the classification is more detailed. E.g. if it classifies message greeting there could be a sanity check saying that signature can't go right after a greeting.
Similarly if a closing phrase (e.g. "Kind Regards,", "Thanks,", etc) is detected in signature candidate lines we could say for sure that it should be the first signature line.
Another example - disclaimers. Some messages have them and that totally breaks signature detection logic.
If you click "reply" in gmail, and then change the subject line and undo the quotation in the WYSIWYG editor, you get something that looks a lot like a fresh email that you just typed yourself. If you send the email to another gmail user, it will render for them the way you would expect if you composed the email yourself from scratch.
But gmail wraps the content in a <div class="gmail_quote">
, which gets stripped by quotations.extract_from_html
. Ideally, talon would know when this was the case and not treat the div as a quote. This seems hard, though. It's not clear to me how to identify this case.
Hi Stephen,
Yep Tuesday @3pm suits. I'll chat to you then.
Kind Regards,
John Doe
Consider the following example:
Hi Xxx,
Good day to you. I have already scheduled an appointment with Dr.
Xxx Xxx for you to pick up your xxx. Please see
appointment details below:
Date: Monday, 8/25/2014
Time: 9:15 AM
If you have any questions, please let me know and I will be happy to help.
Have a great weekend!
Best,
Xxx X.
Perssist Assistant
www.example.com
The "Date: ..." part and the rest of the email is stripped off because it's mistaken for quotations.
In talon/talon/quotations.py the following splitter pattern is checked:
SPLITTER_PATTERNS = [
...
re.compile('(_+\r?\n)?[\s]*(:?[*]?From|Date):[*]? .*'),
...
]
The pattern should be adjusted to check for both "From: ..." and "Date: ..." A possible implementation could be similar to RE_ON_DATE_SMB_WROTE
pattern, also see mark_message_lines()
on how multi-line patterns are handled.
Consider email html like this:
<div>
<p>Reply</p>
<p>
<b>From:</b>[email protected]
...
</p>
<p>Original email</p>
</div>
talon detects the From: ...
block, goes up till it finds a "wrapping div" and cuts off the whole message!
Usually "Date:" and "From:" indicate quotations but sometimes looking for them leads to false positives (e.g. when invitation details are provided with "Date:") or false negatives (when "Subject:" goes first).
If i get the code right it is intended that
reply = quotations.extract_from_plain(body)
does not extract forwarded messages?
So something like:
test
----- Forwarded Message -----
From: "test" [email protected]
Sent: Thursday, January 28, 2016 4:31:32 PM
Subject: testtestmail
will be ignored and returned as is/a part of reply?
Would it be possible to extract forwarded messages just like replies?
So either chop the forwarded part with quotations.extract_from_...(body) or a dedicated function to run against the body if reply equals body?
Hello!
I tried installing talon in Python 3 today and ran into the following issue installing regex:
Downloading/unpacking regex==0.1.20110315
Downloading regex-0.1.20110315.tar.gz (948kB): 948kB downloaded
Running setup.py (path: .../regex/setup.py) egg_info for package regex
No unicodedata_db.h could be prepared for Python 3.4.2 (default, Oct 19 2014, 17:55:38)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)]
Complete output from command python setup.py egg_info:
No unicodedata_db.h could be prepared for Python 3.4.2 (default, Oct 19 2014, 17:55:38)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)]
----------------------------------------
Cleaning up...
I don't have this problem installing regex==2014.12.24. Could you please consider updating the dependency version?
I would be happy to bump the version, ensure tests pass, and open a PR. Thanks!
In a fresh environment:
~ > pip install talon
Downloading/unpacking talon
Downloading talon-1.0.tar.gz
Running setup.py (path:/private/tmp/pip_build_root/talon/setup.py) egg_info for package talon
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/private/tmp/pip_build_root/talon/setup.py", line 13, in <module>
long_description=open("README.rst").read(),
IOError: [Errno 2] No such file or directory: 'README.rst'
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/private/tmp/pip_build_root/talon/setup.py", line 13, in <module>
long_description=open("README.rst").read(),
IOError: [Errno 2] No such file or directory: 'README.rst'
Hi all,
After a request to support, I got this link to submit a new dutch splitter format. Hope this improves this great libary!
"Op 14 jan. 2015, om 13:52 heeft ... het volgende geschreven:"
Op <:date>. om heeft het volgende geschreven:
How should i proceed?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.