govcert-lu / eml_parser Goto Github PK

python eml parser module

Home Page: http://eml-parser.readthedocs.io/

License: GNU Affero General Public License v3.0

Python 100.00%

python eml-files parser eml

eml_parser's Introduction

eml_parser serves as a python module for parsing eml files and returning various information found in the e-mail as well as computed information.

Extracted and generated information include but are not limited to:

attachments
- hashes
- names
from, to, cc
received servers path
subject
list of URLs parsed from the text content of the mail (including HTML body/attachments)

Please feel free to send me your comments / pull requests.

For the changelog, please see CHANGELOG.md.

Installation:

pip install eml_parser[filemagic]

⚠️ Note: If you don't want to / cannot use file-magic (e.g. if you are using python-magic), install via:

pip install eml_parser

Known Issues

OSX users

Make sure to install libmagic, else eml_parser will not work.

Python <=3.7.4 "rare header field parsing issue"

It has been reported (in #60) that there are parsing issues in some particular cases which seem to be caused by a bug in the email module of the Python standard library. At least versions <=3.7.4 are affected.

Python versions >=3.7.11 are not affected. If you do get KeyError exceptions on header field parsing, you should consider upgrading to a more recent version of Python.

-> Please open an issue if the error persists after upgrading.

Example usage:

import datetime
import json
import eml_parser


def json_serial(obj):
  if isinstance(obj, datetime.datetime):
      serial = obj.isoformat()
      return serial


with open('sample.eml', 'rb') as fhdl:
  raw_email = fhdl.read()

ep = eml_parser.EmlParser()
parsed_eml = ep.decode_email_bytes(raw_email)

print(json.dumps(parsed_eml, default=json_serial))

Which gives for a minimalistic EML file something like this:

  {
    "body": [
      {
        "content_header": {
          "content-language": [
            "en-US"
          ]
        },
        "hash": "6c9f343bdb040e764843325fc5673b0f43a021bac9064075d285190d6509222d"
      }
    ],
    "header": {
      "received_src": null,
      "from": "[email protected]",
      "to": [
        "[email protected]"
      ],
      "subject": "Sample EML",
      "received_foremail": [
        "[email protected]"
      ],
      "date": "2013-04-26T11:15:47+00:00",
      "header": {
        "content-language": [
          "en-US"
        ],
        "received": [
          "from localhost\tby mta.example.com (Postfix) with ESMTPS id 6388F684168\tfor <[email protected]>; Fri, 26 Apr 2013 13:15:55 +0200"
        ],
        "to": [
          "[email protected]"
        ],
        "subject": [
          "Sample EML"
        ],
        "date": [
          "Fri, 26 Apr 2013 11:15:47 +0000"
        ],
        "message-id": [
          "<[email protected]>"
        ],
        "from": [
          "John Doe <[email protected]>"
        ]
      },
      "received_domain": [
        "mta.example.com"
      ],
      "received": [
        {
          "with": "esmtps id 6388f684168",
          "for": [
            "[email protected]"
          ],
          "by": [
            "mta.example.com"
          ],
          "date": "2013-04-26T13:15:55+02:00",
          "src": "from localhost by mta.example.com (postfix) with esmtps id 6388f684168 for <[email protected]>; fri, 26 apr 2013 13:15:55 +0200"
        }
      ]
    }
  }

eml_parser's People

Contributors

Stargazers

Watchers

Forkers

crazylionheart drozas cpknight openkasnet lipper hlatif szborows nmarmolejo dxrmorgan 453483289 th4nat0s elafonizi cezhunter sal-git chihhunglin anthonykasza xuacker jeromeleonard daliew ericlingit cybertaoflow waltermccan rabarbra innovativeinventor gclen pythonthings ohyeah521 casimkhan blue-infosec phihsh amgtier ninoseki noeeka sheffercool kevin-dunas cr3m alizee1986 haginara jgru ryosuke-yasui rymmx-gls loicpirez lindamonster vevenlcf celestine-o kryptoslogic malvidin cyberrrllc irislpwong mcmahonr liu873397317 whiteeyehansel iamdank sirsunny tytocapensis xlinsplunk justzzz4 ennamarie19 mayhemheroes arpitjain799

eml_parser's Issues

'NoneType' object has no attribute 'group'

Hello,

I'm getting this error on some EML files with the newest version of the library.

With the old version 1.11.7, I don't have this problem.

Python 3.8.7 / eml_parser 1.11.7

python3
Python 3.8.7 (default, Dec 30 2020, 10:13:09)
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import eml_parser
>>> parsed_eml = eml_parser.eml_parser.decode_email("1.eml", 
include_raw_body=True, include_attachment_data=False, parse_attachments=True)
>>> parsed_eml
{'body': [{'uri': ['....

Python 3.8.7 / eml_parser 1.14.4

python3
Python 3.8.7 (default, Dec 30 2020, 10:13:09)
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import eml_parser
>>> parser = eml_parser.EmlParser(include_raw_body=True, include_attachment_data=False, parse_attachments=True)
>>> parsed_eml = parser.decode_email("1.eml")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/eml_parser/eml_parser.py", line 979, in decode_email
    return decode_email_b(eml_file=raw_email,
  File "/usr/local/lib/python3.8/site-packages/eml_parser/eml_parser.py", line 1040, in decode_email_b
    return ep.decode_email_bytes(eml_file)
  File "/usr/local/lib/python3.8/site-packages/eml_parser/eml_parser.py", line 192, in decode_email_bytes
    return self.parse_email()
  File "/usr/local/lib/python3.8/site-packages/eml_parser/eml_parser.py", line 315, in parse_email
    parsed_routing = eml_parser.routing.parserouting(received_line_flat)
  File "/usr/local/lib/python3.8/site-packages/eml_parser/routing.py", line 164, in parserouting
    out[item.strip()] = cleanline(reparseg.group(item.strip()))  # type: ignore
AttributeError: 'NoneType' object has no attribute 'group'

I sent an EML with this problem to George's email address.

Regards

Example?

Hi. Thanks for this module!

Btw, can you show any simple example of the workflow?
Personally, I need to save all attachments from bunch of .eml-s

thank you in advance!

Attachment content not being supplied (or None)

I've been trying to parse EMLs (which this library is amazing for), I just have issues getting attachment content. It seems its always None (null in the JSON):

    {
        "content_header": {
            "content-disposition": [
                "attachment; filename=\"rekha resume3.doc\""
            ],
            "content-transfer-encoding": [
                "base64"
            ],
            "content-type": [
                "application/msword; name=\"rekha resume3.doc\""
            ],
            "x-attachment-id": [
                "f_inzzd1g90"
            ]
        },
        "extension": "doc",
        "filename": "rekha resume3.doc",
        "hash": {
            "md5": "019ba196161169e6028d2c4761663c49",
            "sha1": "eb59d3fba44bde4585e885d0923a0727d18a0ab4",
            "sha256": "4510a62b2e582167ebbabe67c79e9ab54040c68f4c67b6434a60ca78fe8d502a",
            "sha512": "d4c35801db7940fe858f52d8bba92bb12ad07d23630e446cbae789414f70df9f759c642961cca196f7b979b61d736c72c9355f4112bd6cc1b9bd579ab2afa76f"
        },
        "raw": null,
        "size": 80642
    }
]

I initially the parsed content as: msg = eml_parser.eml_parser.decode_email_b(fdata, include_raw_body=True, include_attachment_data=True) per instructions on the readthedocs pages. I can't seem to get back actual content for attachment (it does give the hashes so it does have filedata).

Error when importing lib: symbol not found

Hello,

I don't think this is an issue on your side, but you might be able to advise/make the code more generic on different platforms.
When trying to execute one simple example:

import eml_parser
def json_serial(obj):
    if isinstance(obj, datetime.datetime):
        serial = obj.isoformat()
        return serial


with open('sample-message.eml', 'rb') as fhdl:
    raw_email = fhdl.read()

parsed_eml = eml_parser.eml_parser.decode_email_b(raw_email)

print(json.dumps(parsed_eml, default=json_serial))

it fails saying
AttributeError: dlsym(RTLD_DEFAULT, magic_open): symbol not found

I am using OSX, python3.6

Could you please help on this?

Thanks in advance.

Get empty To, From, Subject and Body of the email through eml_parser

I get empty To, From, Subject and Body of an email when parsed through eml_parser, but the eml has all these when opened in a mail client.
8E662D2D05B0.zip

Message ID parsing fails if no new line exists after value

I received a mail where the header looked like below:

...
Date: Fri, 19 Feb 2021 19:36:50 +0000
Message-ID:  
 <[email protected]>MIME-Version: 1.0
Accept-Language: en-US, en-GB
...

After parsing it:

ep = eml_parser.EmlParser(include_raw_body=True)
parsed_eml = ep.decode_email_bytes(raw_email)

The value of parsed_eml["header"]["header"]["message-id"][0] looked like this:
<[email protected]>MIME-Version: 1.0

How to parse a bad/different format?

Hello,

This is another question about the parser.
I am trying to parse the famous Enron dataset EML files:
https://archive.org/download/edrm.enron.email.data.set.v2.xml

(full folder: https://archive.org/download/edrm.enron.email.data.set.v2.xml/edrm-enron-v2_harris-s_xml.zip )

Unfortunately, there seems to be messages "not fully/correctly parsed".
For instance the attached file contains all the needed data (Subject, From address, etc) but these do not appear in the parsing results.
3.287079.LTUWB1UEUURY0AMLCGNSEUNK52PXR2CPB.eml.zip

Again, this is just a question to check if this is due to a wrong format of the EML file, or to something I am using incorrectly in the parser.

In advance, thank you for your answer.

Best regards.

coercing to Unicode: need string or buffer, NoneType found

Have you seen this error before?

$ python eml_parser.py 
Traceback (most recent call last):
  File "eml_parser.py", line 515, in <module>
    main()
  File "eml_parser.py", line 507, in main
    m = decode_email(msgfile)
  File "eml_parser.py", line 318, in decode_email
    fp = open(eml_file)
TypeError: coercing to Unicode: need string or buffer, NoneType found

Getting following error on windows python 3.6.4 while importing

import eml_parser
Traceback (most recent call last):
File "", line 1, in
File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 20, in do_import
module = self.system_import(name, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\eml_parser_init.py", line 8, in
from . import eml_parser
File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 20, in do_import
module = self._system_import(name, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\eml_parser\eml_parser.py", line 63, in
import magic
File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 20, in do_import
module = self._system_import(name, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\magic.py", line 23, in
_libraries['magic'] = _init()
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\magic.py", line 20, in init
return ctypes.cdll.LoadLibrary(find_library('magic'))
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\ctypes_init.py", line 426, in LoadLibrary
return self.dlltype(name)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\ctypes_init.py", line 348, in init
self._handle = _dlopen(self._name, mode)
TypeError: LoadLibrary() argument 1 must be str, not None

Text attachments are not retrieved

Hi,

I'm trying to parse eml files with text attachments, example :

--===============0219148833355454106==
Content-Type: text/plain; Name="text.txt"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="text.txt"

TG9yZW0gaXBzdW0gZG9sb3Igc2l0IGFtZXQsIGNvbnNlY3RldHVyIGFkaXBpc2NpbmcgZWxpdCwg
c2VkIGRvIGVpdXNtb2QgdGVtcG9yIGluY2lkaWR1bnQgdXQgbGFib3JlIGV0IGRvbG9yZSBtYWdu
YSBhbGlxdWEuIFV0IGVuaW0gYWQgbWluaW0gdmVuaWFtLCBxdWlzIG5vc3RydWQgZXhlcmNpdGF0
aW9uIHVsbGFtY28gbGFib3JpcyBuaXNpIHV0IGFsaXF1aXAgZXggZWEgY29tbW9kbyBjb25zZXF1
YXQuIER1aXMgYXV0ZSBpcnVyZSBkb2xvciBpbiByZXByZWhlbmRlcml0IGluIHZvbHVwdGF0ZSB2
ZWxpdCBlc3NlIGNpbGx1bSBkb2xvcmUgZXUgZnVnaWF0IG51bGxhIHBhcmlhdHVyLiBFeGNlcHRl
dXIgc2ludCBvY2NhZWNhdCBjdXBpZGF0YXQgbm9uIHByb2lkZW50LCBzdW50IGluIGN1bHBhIHF1
aSBvZmZpY2lhIGRlc2VydW50IG1vbGxpdCBhbmltIGlkIGVzdCBsYWJvcnVtLg==

the problem seems to occur when parsing the message, at line 850 (eml_parser.py) :

if ('content-disposition' in lower_keys and msg.get_content_disposition() != 'inline') \ or msg.get_content_maintype() != 'text':

always enters in the second condition ( except when the attachment is text )

Thanks.

"ImportError: cannot import name 'version'" with eml_parser v1.6

A simple import eml_parser fails with version 1.6:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-dbfbe77cbceb> in <module>()
----> 1 import eml_parser

~/.virtualenvs/XXXXX/lib/python3.5/site-packages/eml_parser/__init__.py in <module>()
      7 
      8 from . import eml_parser
----> 9 from . import version
     10 
     11 __version__ = '1.6'

ImportError: cannot import name 'version'

v1.5 works (it does not contain the problematic import statement).

Probably a missing file. Since the code for v1.6 is not on github, I can't provide a fix.

Distinguishing the body of attached emails from the main email body

Hello,

First of all I would like to say that I have been using eml_parser to extract data from emails for a while now and so far I am very happy with it!

One thing that I am struggling with is to distinguish the body of attached emails from a certain email. I have looked through the json multiple times but cannot figure out a good way.

To parse attachments and the full body I am using:
ep = eml_parser.EmlParser(include_raw_body=True, include_attachment_data=True, parse_attachments=True) parsed_eml = ep.decode_email_bytes(raw_email)

This works great for emails that do not have attachments, or emails that have attachments but no attachments that are eml files themselves. If an email is attached to an email, I can correctly read that it is there as it is listed in
parsed_eml["attachment"]

But the body of this attached email is appended to the body of the email. So
parsed_eml["body"][0]["content"]
would give me the body of the main email in text format. If the main email also has html, I can retrieve the content using
parsed_eml["body"][1]["content"]
And the attached email's body in text format can then be retrieved using
parsed_eml["body"][2]["content"]
And if the attached email also has HTML then I can retrieve it with
parsed_eml["body"][3]["content"]
...etc

This is fine, but problems arise when the main email is only in text format. Because then the first body (body[0]) is the main email, and body[1] would already give me the attached email's body instead of the html of the main email. I currently do not see how I can distinguish the bodies from each other to determine to which email they belong.

I hope I was able to describe the problem in a clear way.

Thank you for your time
Patrick

wrong attachment filename?

my attachment parse result:

{'filename': 'part-000',
 'size': 430, 
'hash': 
{'md5': '65c2b2c519925d7c6df9a990f03c80ca', 
'sha1': '2746fe51cc7d21e8f701a1c86261801d44a27513', 
'sha256': '9fc8fde257977393b0...0960f45fdb7', 'sha512': 'c31...db65'}, 'raw': b'VGhpcyBpcyBhbiBhdXRvbWF0aWNhbGx5IGdlbmVyYXRlZCBtZXNzYWdlIGZyb20gU2VuZEdyaWQuDQoNCkknbSBzb3JyeSB0byBoYXZlIHRvIHRlbGwgeW91IHRoYXQgeW91ciBtZXNzYWdlIHdhcyBub3QgYWJsZSB0byBiZQ0KZGVsaXZlcmVkIHRvIG9uZSBvZiBpdHMgaW50ZW5kZWQgcmVjaXBpZW50cy4NCg0KSWYgeW91IHJlcXVpcmUgYXNzaXN0YW5jZSB3aXRoIHRoaXMsIHBsZWFzZSBjb250YWN0IFNlbmRHcmlkIHN1cHBvcnQuDQoNCm51enplbDoyMDUzNTc6PGZvb2JhckBmb29iYXIuY29tPiA6IDE5OC4zNy4xNTIuMzQgOiBteDAzLm1haWwuZ29vLm5lLmpwOlsyMTAuMTY1LjEwLjFdIDogNTUwIDUuMS4xIHNpZD1pMDFLMW4wMGwwa24xRW0wMSBBZGRyZXNzIHJlamVjdGVkIGZvb2JhckBmb29iYXIuY29tLiBbY29kZT0yOF0gIGluIFJDUFQgVE8NCg==', 
'content_header': 
{'content-type': ['text/plain'], 
'content-disposition': ['inline'], 
'content-transfer-encoding': ['7bit'],
 'content-description': ['Notification']}}

But the file name in Microsoft Outlook is 189844630 :

which file name is right?
raw .eml is:

------------=_1395792079-24137-58419
Content-Type: message/rfc822; name="189844630"
Content-Disposition: inline; filename="189844630"
Content-Description: Undelivered Message
Content-Transfer-Encoding: 7bit

I think 189844630 is right value. maybe got wrong value?

Question: why some fields in header.header are returned as list?

There is a reason why for example "from" or "message-id" are lists instead of just strings?

Bug - URL not retrieved

Hello,

Did you forget to extend this list with the previous iteration result ?

eml_parser/eml_parser/eml_parser.py

Line 431 in a72d5c2

list_observed_urls = self.get_uri_ondata(body_slice)

Regards,

AttributeError: module 'eml_parser' has no attribute 'decode'

import datetime
import json
import eml_parser

def json_serial(obj):
if isinstance(obj, datetime.datetime):
serial = obj.isoformat()
return serial

with open('sample.eml', 'rb') as fhdl:
raw_email = fhdl.read()

parsed_eml = eml_parser.eml_parser.decode_email_b(raw_email)

print(json.dumps(parsed_eml, default=json_serial))

AttributeError Traceback (most recent call last)
in ()
13 raw_email = fhdl.read()
14
---> 15 parsed_eml = eml_parser.eml_parser.decode_email_b(raw_email)
16
17 print(json.dumps(parsed_eml, default=json_serial))

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eml_parser/eml_parser.py in decode_email_b(eml_file, include_raw_body, include_attachment_data, pconf, policy)
320 msg = email.message_from_bytes(eml_file, policy=policy)
321
--> 322 return parse_email(msg, include_raw_body, include_attachment_data, pconf)
323
324

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eml_parser/eml_parser.py in parse_email(msg, include_raw_body, include_attachment_data, pconf)
426 # parse and decode subject
427 subject = msg.get('subject', '')
--> 428 headers_struc['subject'] = eml_parser.decode.decode_field(subject)
429
430 # If parsing had problem... report it...

AttributeError: module 'eml_parser' has no attribute 'decode'

Module working intermittently

Hi @sim0nx,

First of all thanks for the module, it helped a lot easing procedures. I have been testing it and I saw that the module sometimes is not working as expected. I am using the following:

m = eml_parser.decode_email(k, include_raw_body=True)

to extract the data from the eml and getting the urls, but the same code with the same eml being processed, half of the times gets the text and process correctly all the data and the other half doesn't get it. The header gets processed always though.

As no error is being prompted, I have no idea what is the source of the problem. Have you seen this error before or do you have any idea of what the problem could be?

Thanks in advance :)

Hello,

encoding error on bounced messages

In a bounced message I got:

The undelivered mail returned to sender with Content-Type: text/plain; charset=us-ascii
The attached message returned has Content-Type: text/plain; charset=iso-8859-1

eml_parser return the following error:
`Traceback (most recent call last):
File "/home/joao/Projects/email-engine/eml_parser_test.py", line 18, in
parsed_eml = eml_parser.eml_parser.decode_email_b(message, include_raw_body=False, include_attachment_data=True)
File "/home/joao/Projects/email-engine/venv/lib64/python3.7/site-packages/eml_parser/eml_parser.py", line 417, in decode_email_b
return parse_email(msg, include_raw_body, include_attachment_data, pconf, parse_attachments=parse_attachments)
File "/home/joao/Projects/email-engine/venv/lib64/python3.7/site-packages/eml_parser/eml_parser.py", line 893, in parse_email
report_struc['attachment'] = traverse_multipart(msg, 0, include_attachment_data)
File "/home/joao/Projects/email-engine/venv/lib64/python3.7/site-packages/eml_parser/eml_parser.py", line 206, in traverse_multipart
attachments.update(traverse_multipart(part, counter, include_attachment_data)) # type: ignore
File "/home/joao/Projects/email-engine/venv/lib64/python3.7/site-packages/eml_parser/eml_parser.py", line 203, in traverse_multipart
prepare_multipart_part_attachment(msg, counter, include_attachment_data)) # type: ignore
File "/home/joao/Projects/email-engine/venv/lib64/python3.7/site-packages/eml_parser/eml_parser.py", line 249, in prepare_multipart_part_attachment
data = bytes(payload[0])
File "/usr/lib64/python3.7/email/message.py", line 164, in bytes
return self.as_bytes()
File "/usr/lib64/python3.7/email/message.py", line 178, in as_bytes
g.flatten(self, unixfrom=unixfrom)
File "/usr/lib64/python3.7/email/generator.py", line 116, in flatten
self._write(msg)
File "/usr/lib64/python3.7/email/generator.py", line 195, in _write
self._write_headers(msg)
File "/usr/lib64/python3.7/email/generator.py", line 418, in _write_headers
self._fp.write(self.policy.fold_binary(h, v))
File "/usr/lib64/python3.7/email/policy.py", line 200, in fold_binary
folded = self._fold(name, value, refold_binary=self.cte_type=='7bit')
File "/usr/lib64/python3.7/email/policy.py", line 214, in _fold
return self.header_factory(name, ''.join(lines)).fold(policy=self)
File "/usr/lib64/python3.7/email/headerregistry.py", line 258, in fold
return header.fold(policy=policy)
File "/usr/lib64/python3.7/email/_header_value_parser.py", line 157, in fold
return _refold_parse_tree(self, policy=policy)
File "/usr/lib64/python3.7/email/_header_value_parser.py", line 2698, in _refold_parse_tree
part.ew_combine_allowed, charset)
File "/usr/lib64/python3.7/email/_header_value_parser.py", line 2785, in _fold_as_ew
encoded_word = _ew.encode(to_encode_word, charset=encode_as)
File "/usr/lib64/python3.7/email/_encoded_words.py", line 222, in encode
bstring = string.encode('ascii', 'surrogateescape')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-6: ordinal not in range(128)

Process finished with exit code 1`

An option to NOT parse attached messages would be great, or just change the encoding to open the attachment would work, I think.

I've attached the message too.

Thanks in advance.
amostra_eml.txt

OSX Mojava install error

Trying to installl on OSX mojave gives me this error:

clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated] ld: library not found for -lstdc++ clang: error: linker command failed with exit code 1 (use -v to see invocation) error: command 'g++' failed with exit status 1

Content of email

Hi, Is possible to get content of email?

Body always empty

Hi, the parser always shows me, that the body of every mail is emtpy. Everything else, like parsing headers or attachments work.

edit: email body appears to be listed as two attachments (one simple text file and one html text file)

Handle python bug 30681 for invalid date parsing

Currently the code does not handle the bug mentioned here https://bugs.python.org/issue30681

Following part of the code breaks since we are not catching TypeError and ValueError.

  753:      try:
  754:          raw_body.append((encoding, raw_body_str, msg.items()))
  755:      except AttributeError:

https://github.com/GOVCERT-LU/eml_parser/blob/master/eml_parser/eml_parser.py#L753-L755

Sample eml that can raise this bug

From: <[email protected]>
Orig-Date: Wed Jul 2020 23:11:43 +0100

module 'eml_parser' has no attribute 'EmlParser'

Hello,
I'm using IDLE with python 3.6.1 on Windows 10

I installed the eml_parser library, then I tried to execute the example code from your page.
I get the following error:

Traceback (most recent call last):
File "C:\Users\980\Desktop\temp\Thunderbird export\x-spam data extractor.py", line 17, in
ep = eml_parser.EmlParser()
AttributeError: module 'eml_parser' has no attribute 'EmlParser'

Can you please help me?

Thanks
Bruno

Infinite loop in regex searching for URL ending with unicode characters

I am hitting what appears to be an infinite loop when trying to parse the following eml file:
eml_sample.txt

import eml_parser

f = open("eml_sample.txt", "rb")
data = f.read()
f.close()        

result = eml_parser.eml_parser.decode_email_b(data, include_raw_body=True)

The problem is with the following line. The email body contains a URL that is immediately followed by multiple occurrences or \xef\xbf\xbd.

xcxcxcxcxcxcxcxcxcxc
http://xxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxx.com������������������������������������������������

If I remove those, the parsing is just fine. It is also fine if I remove only some of them, but the parsing takes longer. So I assume that it is not an infinite loop, but it rather throws of the regex and creates some kind of cycle...

Is it a problem with the regex, or the way I am reading that file (not decoding Unicode characters)?

prevent TypeError: 'NoneType' object is not iterable #4

This issue refers to #4

recursively_extract_attachments example fails

$ PYTHONPATH=.. python3 ../examples/recursively_extract_attachments.py
You are using python-magic, though this module requires file-magic. Disabling magic usage due to incompatibilities.
Parsing: sample_body_data.eml
Traceback (most recent call last):
  File "../examples/recursively_extract_attachments.py", line 25, in <module>
    for a_id, a in m['attachments'].items():
KeyError: 'attachments'

Cannot use eml-parser

I have a problem to use library eml-parser

My environment is like this

Use WSL in window(ubuntu 18.04)

Use python 3.7 and 3.6

Use Jupyter notebook

in the jupyter those error messages are came out

-> module 'eml_parser' has no attribute 'EmlParser'

when i using code like this(here is my code)

import datetime
import json
import eml_parser

def json_serial(obj):
if isinstance(obj, datetime.datetime):
serial = obj.isoformat()
return serial

with open('YourECG.eml', 'rb') as fhdl:
raw_email = fhdl.read()

ep = eml_parser.EmlParser()
parsed_eml = ep.decode_email_bytes(raw_email)

print(json.dumps(parsed_eml, default=json_serial))

Thank you, a lot!

Text email not recognized

I've tried to parse an email with the following particular part-message:

--_----------=_1235550737204165
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset="utf-8"

# HTML CODE HERE

This part is recognized as an attachment, but it isn't one, because it's the actual text of the email.

I found a solution by editing the line (116-117) in eml_parser.py as follows:

Original

if ('content-disposition' not in msg and msg.get_content_maintype() == 'text') or (
                filename.endswith('.html') or filename.endswith('.htm')):

Modified

if (msg.is_attachment() == False and msg.get_content_maintype() == 'text') or (
                filename.endswith('.html') or filename.end

Could this solution work for you? Or does this break something I've not thought about?

should use python-magic instead of file-magic

Problem:

file-magic from requirements.txt results in AttributeError: module 'magic' has no attribute 'magic_open'.

Solution:
Use python-magic instead of file-magic
`

Error while importing eml_parser

Hi,

I am using python3, but i am unable to import eml_parser,

error

Traceback (most recent call last):
File "", line 1, in
File "/Users/rameshchowdeshetty/venv/lib/python3.4/site-packages/eml_parser/init.py", line 8, in
from . import eml_parser
File "/Users/rameshchowdeshetty/venv/lib/python3.4/site-packages/eml_parser/eml_parser.py", line 59, in
import magic
File "/Users/rameshchowdeshetty/venv/lib/python3.4/site-packages/magic.py", line 61, in
_open = _libraries['magic'].magic_open
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/ctypes/init.py", line 364, in getattr
func = self.getitem(name)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/ctypes/init.py", line 369, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: dlsym(RTLD_DEFAULT, magic_open): symbol not found

Dependancy error (ctypes) - TypeError: bad argument type for built-in operation

Hello. I've been trying to use this package but I can't get it working because of an error in dependancy. I'm working with Python 3.6.1 on Windows 7. Below is full traceback. Let me know if I can provide additional info.

Traceback (most recent call last):
  File "C:\Program Files (x86)\Python\lib\site-packages\django\core\handlers\exception.py", line 35, in inner
    response = get_response(request)
  File "C:\Program Files (x86)\Python\lib\site-packages\django\core\handlers\base.py", line 128, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "C:\Program Files (x86)\Python\lib\site-packages\django\core\handlers\base.py", line 126, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "C:/Users/USER/Desktop/Antoine/GitHub/ChronoManager\Chronos\views\views.py", line 18, in index
    import eml_parser
  File "C:\Program Files (x86)\Python\lib\site-packages\eml_parser\__init__.py", line 8, in <module>
    from . import eml_parser
  File "C:\Program Files (x86)\Python\lib\site-packages\eml_parser\eml_parser.py", line 63, in <module>
    import magic
  File "C:\Program Files (x86)\Python\lib\site-packages\magic.py", line 23, in <module>
    _libraries['magic'] = _init()
  File "C:\Program Files (x86)\Python\lib\site-packages\magic.py", line 20, in _init
    return ctypes.cdll.LoadLibrary(find_library('magic'))
  File "C:\Program Files (x86)\Python\lib\ctypes\__init__.py", line 426, in LoadLibrary
    return self._dlltype(name)
  File "C:\Program Files (x86)\Python\lib\ctypes\__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
TypeError: bad argument type for built-in operation

Error when importing eml_parser

Hello. This is my first time making an issue, so be easy on me. I have tried importing eml_parser on 3.6.1, 3.6.2, and 3.6.3 and when I import eml_parser I get an error. I posted the error below. Any help would be greatly appreciated.

Traceback (most recent call last):
File "getError.py", line 5, in
import eml_parser
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\eml_parser_init_.py", line 8, in
from . import eml_parser
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\eml_parser\eml_parser.py", line 63, in
import magic
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\magic.py", line 23, in
_libraries['magic'] = _init()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\magic.py", line 20, in init
return ctypes.cdll.LoadLibrary(find_library('magic'))
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\ctypes_init.py", line 426, in LoadLibrary
return self.dlltype(name)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\ctypes_init.py", line 348, in init
self._handle = _dlopen(self._name, mode)
TypeError: LoadLibrary() argument 1 must be str, not None

include_attachment_data flag not actually implemented

The include_attachment_data flag should be implemented like include_raw_body, but the only reference to the flag in eml_parser.eml_parser.parse_email is in line 875:

report_struc['attachment'] = traverse_multipart(msg, 0, include_attachment_data)

This doesn't stop the processing of the attachment when include_attachment_data=False, which affects processing time (when you don't want the attachment) and can throw a lot of binascii errors.

I suggest an if include_attachment_data: around the block from line 874-890 in eml_parser.eml_parser.

Thanks!

Memory leak / Infinite loop with eml attachment

Hey there,

First, thanks for this great library that helps me for over 6 months now.

I recently encountered a problem when I used eml-parser to extract datas from more than 20k emails. After a while, my python process was starting to eat all my 32Gb of RAM. At first, everything was fine, and suddenly, every second, hundreds of megabytes were added and used by the python process where eml-parser works.

I found out what eml file was responsible of the problem but due to confidential issues, I can't upload it. The most important thing I can tell you and I think it's the main problem is: there is an attachment in the eml file to... another eml file. I think maybe the library gets confused and try recursively and indefinitely to parse the eml.

The problem doesn't appear when eml-parser is initialised like this:

parser = eml_parser.EmlParser(include_raw_body=True, include_attachment_data=False, parse_attachments=False)

If I set parse_attachments to True, it goes forever and eats all my RAM.

Create new Python Bug to Header Parsing Issue

In test_headeremail2list_2, it mentions Python bug 27257. However, Bug 27257 appears to be related to empty groups in the header, not issues with obsolete period. With Python 3.7, I do not have any issues with the decoded value, unless the eml_parser should include address groups.

eml_parser/tests/test_emlparser.py

Line 131 in f98980a

with open(os.path.join(samples_dir, 'sample_bug27257.eml'), 'rb') as fhdl:

From the bug:

To: unlisted-recipients: ;,
""@pop.kundenserver.de (no To-header on input)
The current output below appears to be the expected output.
'to': ['@pop.kundenserver.de']

From the RFC:

To: A Group:Ed Jones [email protected],[email protected],John [email protected];
Again, the current output below appears to be the expected output.
'to': ['[email protected]', '[email protected]', '[email protected]']

I have not found a related issue in the Python bug tracker, but perhaps something like the following in _header_value_parser.py would be appropriate to prevent the exception:

import eml_parser returns error

Environment -:
Python 2.7.10
Virtualenviroment
Installing eml-parser via pip
Steps to reproduce:

$ pip install eml-parser
Collecting eml-parser
Using cached eml_parser-1.7-py2.py3-none-any.whl
Requirement already satisfied: cchardet in /x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages (from eml-parser)
Requirement already satisfied: python-dateutil in /x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages (from eml-parser)
Requirement already satisfied: file-magic in /x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages (from eml-parser)
Requirement already satisfied: typing in /x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages (from eml-parser)
Requirement already satisfied: six>=1.5 in /x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages (from python-dateutil->eml-parser)
Installing collected packages: eml-parser
Successfully installed eml-parser-1.7

(connect_to_cloud) my_pc:ccccc xxxxx$ python
Python 2.7.10 (default, Feb 7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import eml_parser
Traceback (most recent call last):
File "", line 1, in
File "/x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages/eml_parser/init.py", line 8, in
from . import eml_parser
File "/x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages/eml_parser/eml_parser.py", line 82
def get_raw_body_text(msg: email.message.Message) -> typing.List[typing.Tuple[typing.Any, typing.Any, typing.Any]]:
^
SyntaxError: invalid syntax

Result:-

import eml_parser
Traceback (most recent call last):
File "", line 1, in
File "/x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages/eml_parser/init.py", line 8, in
from . import eml_parser
File "/x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages/eml_parser/eml_parser.py", line 82
def get_raw_body_text(msg: email.message.Message) -> typing.List[typing.Tuple[typing.Any, typing.Any, typing.Any]]:
^
SyntaxError: invalid syntax

Infinite wait when parsing specific eml file

Hello.
Thank you for library.

A problem occurred during use.

ep = eml_parser.EmlParser(parse_attachments=False)
parsed_eml = ep.decode_email_bytes(raw_email)

I am trying to parse the eml file like this. but specific eml file is infinite wait.
If it is forcibly stopped, the following error occurs.

File "/usr/local/lib/python3.7/site-packages/eml_parser/eml_parser.py", line 192, in decode_email_bytes
return self.parse_email()
File "/usr/local/lib/python3.7/site-packages/eml_parser/eml_parser.py", line 431, in parse_email
list_observed_urls = self.get_uri_ondata(body_slice)
File "/usr/local/lib/python3.7/site-packages/eml_parser/eml_parser.py", line 641, in get_uri_ondata
for match in eml_parser.regex.url_regex_simple.findall(body):

I'm not sure why this is happening.
What is the workaround?

Attach the sample in txt format.

sample.txt

Getting following error when running "import eml_parser"

Unexpected Error: list index out of range

I'm using eml_parser with TheHive project and all analysis fails with

Unexpected Error: list index out of range

I'm not sure where else to find any other logging or info to troubleshoot this.

Getting following error while importing this package

import eml_parser
Traceback (most recent call last):
File "", line 1, in
File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 20, in do_import
module = self.system_import(name, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\eml_parser_init.py", line 8, in
from . import eml_parser
File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 20, in do_import
module = self._system_import(name, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\eml_parser\eml_parser.py", line 63, in
import magic
File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 20, in do_import
module = self._system_import(name, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\magic.py", line 23, in
_libraries['magic'] = _init()
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\magic.py", line 20, in init
return ctypes.cdll.LoadLibrary(find_library('magic'))
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\ctypes_init.py", line 426, in LoadLibrary
return self.dlltype(name)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\ctypes_init.py", line 348, in init
self._handle = _dlopen(self._name, mode)
TypeError: LoadLibrary() argument 1 must be str, not None

TypeError: coercing to str: need a bytes-like object, NoneType found

Hello,

Need help parsing a particular EML file that can not be attached here due to confidentiality. If EML is still needed after this information, I can send it via email or in some other way.

OS: Debian 9.2
Python: 3.5.3
eml_parser: 1.8
file-magic: 0.3.0
EML generated by: User-agent: Microsoft-MacOutlook/10.c.0.180410
Has attachments: Yes

test.py (code from readme)

import eml_parser


def json_serial(obj):
    if isinstance(obj, datetime.datetime):
        serial = obj.isoformat()
        return serial


with open('sample.eml', 'rb') as fhdl:
    raw_email = fhdl.read()

parsed_eml = eml_parser.eml_parser.decode_email_b(raw_email)

print(json.dumps(parsed_eml, default=json_serial))

Running python3.5 test.py returns an error

Traceback (most recent call last):
  File "test.py", line 13, in <module>
    parsed_eml = eml_parser.eml_parser.decode_email_b(raw_email)
  File "/usr/local/lib/python3.5/dist-packages/eml_parser/eml_parser.py", line 317, in decode_email_b
    return parse_email(msg, include_raw_body, include_attachment_data, pconf)
  File "/usr/local/lib/python3.5/dist-packages/eml_parser/eml_parser.py", line 774, in parse_email
    report_struc['attachment'] = traverse_multipart(msg, 0, include_attachment_data)
  File "/usr/local/lib/python3.5/dist-packages/eml_parser/eml_parser.py", line 196, in traverse_multipart
    attachments.update(traverse_multipart(part, counter, include_attachment_data))  # type: ignore
  File "/usr/local/lib/python3.5/dist-packages/eml_parser/eml_parser.py", line 234, in traverse_multipart
    attachments[file_id]['mime_type'] = magic_none.buffer(data)
  File "/usr/local/lib/python3.5/dist-packages/magic.py", line 152, in buffer
    return str(r, 'utf-8')
TypeError: coercing to str: need a bytes-like object, NoneType found

Do you have an idea without reviewing the EML file?

Thank you!

Email body parsing

Are there any branches to extract the email body?

Regex email.

I think the email regex should be tuned ...

[
'6f@k', '6f@k', '%@3', '%@3', 'i@0', '%@3', 'i@0', 'i@0', '/8r@gbj', 'i@0', '/8r@gbj', '/8r@gbj', '8@c', '8@c', '8@c', 'n@jg', 'xf9}8q@f', 'n@jg', 'xf9}8q@f', 'bt@b2x', 'n@jg', 'xf9}8q@f', 'bt@b2x', 'n@jg', 'xf9}8q@f', 'bt@b2x', 'bt@b2x', '5pa@ao', '5pa@ao', 'b2h$xa@vljqq', '8@d', 'b2h$xa@vljqq', '8@d', 'b2h$xa@vljqq', '8@d', '/8@d', '/8@d', 'm8y@v', '/8@d', 'm8y@v', '/8@d', 'm8y@v', 'm8y@v', 'r@gez', 'r@gez', 'zx%f@p', 'zx%j@s', 'zx%f@p', 'zx%j@s', 'kc@pl', 'zx%f@p', 'zx%j@s', 'kc@pl', 'zx%f@p', 'zx%j@s', 'kc@pl', 'zx%f@p', 'zx%j@s', 'kc@pl', 'kc@pl', 'ke@n', 'ke@n', 'ke@n', 'w@o', 'w@o', 'w@o', 'w@o', 'w@o', 'h.@oa', 'h.@oa', '5.@q', '8@d', '5.@q', '8@d', '5.@q', '8@d', '5.@q', '8@d', '^/#@tc', '^/#@tc', '/65@kt', '/65@kt', '5@kt', '9@o', 'g@x', 'g@x', '0#@p1', 'p@x', 'xt@5w', '0#@p1', 'p@x', 'xt@5w', 'xt@5w', 'i@sr', 'i@sr', '6z@n', '6z@n', '.@-ks', '.@-ks', "$'@cjd", 'ng@mxjj', "$'@cjd", 'ng@mxjj', "$'@cjd", 'ng@mxjj', '01@djll', '60@a7cf', '60@a7cf', '60@a7cf', '60@a7cf', '60@a7cf', '}@9j', '}@9j', '}@9j', '}@9j', '}@9j', 'l@el', 'l@el', 'l@el', 'l@el', 'p@y', 'p@y', 'p@y', 'p@y', 'ou@m0', 'y@64', '}@3', 'y@64', '}@3', 'y@64', '}@3', '}@3', 'nhyfhu@b', '&@he', 'nhyfhu@b', '&@he', 'nhyfhu@b', '&@he', '&@he', '&@he', '.t*@oh', '.t*@oh', '!}qv@k', '!}qv@k', 'ljd@z', 'ljd@z', 'ljd@z', 'ljd@z', 'c=@i', 'c=@i', 'm@qcoyub', 'c=@i', 'm@qcoyub', 'm@qcoyub', 'm@qcoyub', '&@rm', '&@rm', 'o$@t', 'o$@t', 'o$@t', 'o$@t', 'o$@t', '=e%@gyw', 'cq@u', 'cq@u', '*_ioj5@2', '*_ioj5@2', '*_ioj5@2', '#tid*{4|@8', '#tid*{4|@8', 'ny@f', 'ny@f', 'ny@f', 'fn8b@i', '0}.`@ne.9u', '0}.`@ne.9u', 'qq}@i', ',@u6xr', 'qq}@i', ',@u6xr', 'qq}@i', ',@u6xr', 'a@zf', 'a@zf', '3v@e', '}s_@a', 'a@zf', '3v@e', '}s_@a', '3v@e', '}s_@a', '3v@e', '}s_@a', 'gf@b', 'gf@b', 'ud@qt', 'gf@b', 'ud@qt', 'gf@b', 'ud@qt', 'gf@b', 'ud@qt', 'nis@eond', '-f@ynw', 'nis@eond', '-f@ynw', 'nis@eond', '-f@ynw',](url)

KeyError when parsing subject

Hello there,

Using the latest version of the package, I get an error on a particular EML with the following subject:

Subject: [MANAGER COMME UN COACH 1 - Presentiel] Une évaluation à
 =?ISO-8859-1?Q	?compl=E9ter=20?=sur votre plateforme de formation

Here is the full stack trace:

<ipython-input-28-eb0e5e7f17ee> in extract_email_to_txt_file(eml_file, destination)
     29 
     30     # convert eml raw file to an iterable object
---> 31     parsed_eml = ep.decode_email(eml_file)
     32 
     33     # write subject and content of the email to a txt file

/usr/local/Caskroom/miniconda/base/lib/python3.7/site-packages/eml_parser/eml_parser.py in decode_email(self, eml_file, ignore_bad_start)
    151             raw_email = fp.read()
    152 
--> 153         return self.decode_email_bytes(raw_email, ignore_bad_start=ignore_bad_start)
    154 
    155     def decode_email_bytes(self, eml_file: bytes, ignore_bad_start: bool = False) -> dict:

/usr/local/Caskroom/miniconda/base/lib/python3.7/site-packages/eml_parser/eml_parser.py in decode_email_bytes(self, eml_file, ignore_bad_start)
    190         self.msg = email.message_from_bytes(_eml_file, policy=self.policy)
    191 
--> 192         return self.parse_email()
    193 
    194     def parse_email(self) -> dict:

/usr/local/Caskroom/miniconda/base/lib/python3.7/site-packages/eml_parser/eml_parser.py in parse_email(self)
    223 
    224         # parse and decode subject
--> 225         subject = self.msg.get('subject', '')
    226         headers_struc['subject'] = eml_parser.decode.decode_field(subject)
    227 

/usr/local/Caskroom/miniconda/base/lib/python3.7/email/message.py in get(self, name, failobj)
    469         for k, v in self._headers:
    470             if k.lower() == name:
--> 471                 return self.policy.header_fetch_parse(k, v)
    472         return failobj
    473 

/usr/local/Caskroom/miniconda/base/lib/python3.7/email/policy.py in header_fetch_parse(self, name, value)
    161         # We can't use splitlines here because it splits on more than \r and \n.
    162         value = ''.join(linesep_splitter.split(value))
--> 163         return self.header_factory(name, value)
    164 
    165     def fold(self, name, value):

/usr/local/Caskroom/miniconda/base/lib/python3.7/email/headerregistry.py in __call__(self, name, value)
    587 
    588         """
--> 589         return self[name](name, value)

/usr/local/Caskroom/miniconda/base/lib/python3.7/email/headerregistry.py in __new__(cls, name, value)
    195     def __new__(cls, name, value):
    196         kwds = {'defects': []}
--> 197         cls.parse(value, kwds)
    198         if utils._has_surrogates(kwds['decoded']):
    199             kwds['decoded'] = utils._sanitize(kwds['decoded'])

/usr/local/Caskroom/miniconda/base/lib/python3.7/email/headerregistry.py in parse(cls, value, kwds)
    270     @classmethod
    271     def parse(cls, value, kwds):
--> 272         kwds['parse_tree'] = cls.value_parser(value)
    273         kwds['decoded'] = str(kwds['parse_tree'])
    274 

/usr/local/Caskroom/miniconda/base/lib/python3.7/email/_header_value_parser.py in get_unstructured(value)
   1100         if value.startswith('=?'):
   1101             try:
-> 1102                 token, value = get_encoded_word(value)
   1103             except errors.HeaderParseError:
   1104                 # XXX: Need to figure out how to register defects when

/usr/local/Caskroom/miniconda/base/lib/python3.7/email/_header_value_parser.py in get_encoded_word(value)
   1046     value = ''.join(remainder)
   1047     try:
-> 1048         text, charset, lang, defects = _ew.decode('=?' + tok + '?=')
   1049     except ValueError:
   1050         raise errors.HeaderParseError(

/usr/local/Caskroom/miniconda/base/lib/python3.7/email/_encoded_words.py in decode(ew)
    176     # Recover the original bytes and do CTE decoding.
    177     bstring = cte_string.encode('ascii', 'surrogateescape')
--> 178     bstring, defects = _cte_decoders[cte](bstring)
    179     # Turn the CTE decoded bytes into unicode.
    180     try:

KeyError: 'q\t'

AttributeError: module 'eml_parser' has no attribute 'EmlParser'

When running the example usage, Python returned such error message:

AttributeError: module 'eml_parser' has no attribute 'EmlParser'

can not install the latest version

pip install eml_parser==1.17.0

ERROR: Could not find a version that satisfies the requirement eml_parser==1.17.0 (from versions: 0.9, 1.0, 1.1, 1.3, 1.4, 1.5,
 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.11.1, 1.11.2, 1.11.4, 1.11.5, 1.11.6, 1.11.7)
ERROR: No matching distribution found for eml_parser==1.17.0

if I use pip install eml_parser gets the version of 1.11.7 and happends to this error:
AttributeError: module 'eml_parser' has no attribute 'EmlParser'

file-magic / python-magic

Hi,

My context requires me to use python-magic instead of file-magic.
When using eml_parser, I got the AttributeError: module 'magic' has no attribute 'open'.

I noticed that in eml_parser.py, instead of having:

try:
    import magic
except ImportError:
    magic = None
    magic_mime = None
    magic_none = None
else:
    # MAGIC_MIME_TYPE gives the real mime-type
    magic_mime = magic.open(magic.MAGIC_MIME_TYPE)
    magic_mime.load()
    # MAGIC_NONE gives the meta-information on the analysed file
    magic_none = magic.open(magic.MAGIC_NONE)
    magic_none.load()

If I put

import magic
magic = None
magic_mime = None
magic_none = None

The parsing works but I don't get the mime info, which is fine in my use-case.
Do you know a way to handle the wrong magic module (python-magic), in other words parsing the eml even with python-magic and not file-magic ?

Thanks in advance.

Сonstantly different conclusion

I used your example code and one eml file. But for some reason, every time I run, I have a different conclusion. What could be the reason for this?

Enhancement: Match HTML Anchor href and Image src

In the simple URL regex, many URLs that don't include the scheme in the href or src are skipped.

<a target="_blank" href="www.wikipedia.org">
  Wikipedia (opens in new tab)
</a>

Should URLs like this be extracted?

govcert-lu / eml_parser Goto Github PK

eml_parser's Introduction

Installation:

Known Issues

OSX users

Python <=3.7.4 "rare header field parsing issue"

Example usage:

eml_parser's People

Contributors

Stargazers

Watchers

Forkers

eml_parser's Issues

Python 3.8.7 / eml_parser 1.11.7

Python 3.8.7 / eml_parser 1.14.4

Recommend Projects

Recommend Topics

Recommend Org