Giter Site home page Giter Site logo

Comments (26)

oliver006 avatar oliver006 commented on August 20, 2024

What's the error you're seeing?

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

it was just a blank terminal. (nothing returned)

I ended up running it like this:

python2.7 index_emails.py -vvvvvv

and it spit out options.

then I ran this... and it worked..

python2.7 index_emails.py --infile=test.mbox

It does not seem to want to spit out options if you give it incomplete directions (no options or blank cli)

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

is this the expected output?

root@precise32:/vagrant# python2.7 index_emails.py --infile=test.mbox
Errors during upload: False - upload took: 1486ms, total messages uploaded: 500
Errors during upload: False - upload took: 614ms, total messages uploaded: 1000
Errors during upload: False - upload took: 621ms, total messages uploaded: 1500
Errors during upload: False - upload took: 353ms, total messages uploaded: 2000
Errors during upload: False - upload took: 369ms, total messages uploaded: 2500
Errors during upload: False - upload took: 373ms, total messages uploaded: 3000
Errors during upload: False - upload took: 295ms, total messages uploaded: 3500
Errors during upload: False - upload took: 289ms, total messages uploaded: 4000
Errors during upload: False - upload took: 377ms, total messages uploaded: 4500
Errors during upload: False - upload took: 452ms, total messages uploaded: 5000
Errors during upload: False - upload took: 297ms, total messages uploaded: 5500
Errors during upload: False - upload took: 500ms, total messages uploaded: 6000
Errors during upload: False - upload took: 273ms, total messages uploaded: 6500
Errors during upload: False - upload took: 435ms, total messages uploaded: 7000
Errors during upload: False - upload took:

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

above entry ended with the following

Errors during upload: False - upload took: 365ms, total messages uploaded: 22000
Errors during upload: False - upload took: 306ms, total messages uploaded: 22500
Errors during upload: False - upload took: 242ms, total messages uploaded: 23000
Errors during upload: False - upload took: 236ms, total messages uploaded: 23500
Errors during upload: False - upload took: 202ms, total messages uploaded: 24000
Traceback (most recent call last):
File "index_emails.py", line 179, in
IOLoop.instance().run_sync(load_from_file)
File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 418, in run_sync
return future_cell[0].result()
File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
raise_exc_info(self._exc_info)
File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 399, in run
result = func()
File "index_emails.py", line 136, in load_from_file
item = convert_msg_to_json(msg)
File "index_emails.py", line 102, in convert_msg_to_json
tz = tt[9] or 0
TypeError: 'NoneType' object has no attribute 'getitem'

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

a bunch of stuff ended up in the elastic search instance. So I'm unsure if this is all expected.

from elasticsearch-gmail.

oliver006 avatar oliver006 commented on August 20, 2024

It added the first ~ 24k emails to the index but then failed with an error.
I updated the src file to do a bit more robust error checking during tz parsing, can you try again?

from elasticsearch-gmail.

oliver006 avatar oliver006 commented on August 20, 2024

I also change that it outputs the --help info blurb if no parameters are passed.

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

Thanks. just did a git pull and running now. I think it takes about 10 minutes on my setup. will report back in a few.

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

latest run

Upload: OK - upload took: 491ms, total messages uploaded: 25000
Upload: OK - upload took: 435ms, total messages uploaded: 25500
Traceback (most recent call last):
File "index_emails.py", line 178, in
IOLoop.instance().run_sync(load_from_file)
File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 418, in run_sync
return future_cell[0].result()
File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
raise_exc_info(self._exc_info)
File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 399, in run
result = func()
File "index_emails.py", line 135, in load_from_file
item = convert_msg_to_json(msg)
File "index_emails.py", line 103, in convert_msg_to_json
result['date_ts'] = int(calendar.timegm(tt) - tz) * 1000
TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

running it with -vvvv

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

redoing... scrollback buffer was ... too small. capturing logfile this go around.

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

did an strace on the process while it is running. seeing this every so often

ead(9, "W0OHn+\r\nKAywzHpJtqzQypD4NLRlcJ3D"..., 4096) = 4096
read(9, "kw5DxMEDejhXUjIpjaa1zfhHPe9TIGKc"..., 4096) = 4096
_llseek(9, 3475939328, [3475939328], SEEK_SET) = 0
read(9, "uu4/\r\npB41IpTitShkGejrkw0/DCr04q"..., 4096) = 4096
read(9, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
brk(0xac8c000) = 0xac8c000
mmap2(NULL, 8269824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb672b000
brk(0xa248000) = 0xa248000
mmap2(NULL, 8269824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb5f48000
munmap(0xb672b000, 8269824) = 0
_llseek(9, 3475947520, [3475947520], SEEK_SET) = 0
_llseek(9, 3475947520, [3475947520], SEEK_SET) = 0
read(9, "AAAAAAAAAAAAAAAA

i suspect it is the upload taking place

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

only difference with -vvvv is this bit at the end.

1367 Upload: OK - upload took: 314ms, total messages uploaded: 21500
1368 Upload: OK - upload took: 331ms, total messages uploaded: 22000
1369 Upload: OK - upload took: 290ms, total messages uploaded: 22500
1370 Upload: OK - upload took: 283ms, total messages uploaded: 23000
1371 Upload: OK - upload took: 227ms, total messages uploaded: 23500
1372 Upload: OK - upload took: 255ms, total messages uploaded: 24000
1373 Upload: OK - upload took: 241ms, total messages uploaded: 24500
1374 Upload: OK - upload took: 271ms, total messages uploaded: 25000
1375 Upload: OK - upload took: 417ms, total messages uploaded: 25500
1376 Traceback (most recent call last):
1377 File "index_emails.py", line 178, in
1378 IOLoop.instance().run_sync(load_from_file)
1379 File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 418, in run_sync
1380 return future_cell[0].result()
1381 File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
1382 raise_exc_info(self.exc_info)
1383 File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 399, in run
1384 result = func()
1385 File "index_emails.py", line 135, in load_from_file
1386 item = convert_msg_to_json(msg)
1387 File "index_emails.py", line 103, in convert_msg_to_json
1388 result['date_ts'] = int(calendar.timegm(tt) - tz) * 1000
1389 TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'
1390 # clear builtin.

1391 # clear sys.path
1392 # clear sys.argv
1393 # clear sys.ps1
1394 # clear sys.ps2
1395 # clear sys.exitfunc

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

latest run command looks like this
python index_emails.py --infile=../../test.mbox --log_file_prefix=./real.log --logging=debug

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

same errors with above command. real.log is empty.
let me know if you have some ideas.

from elasticsearch-gmail.

oliver006 avatar oliver006 commented on August 20, 2024

Interesting. I pushed up another version, this time catching all errors in tz parsing with a try/except - let me know if that fixes the issue.

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

cool. This worked. I am interested to see what it was that caused the fix. I noticed you made two distinct changes. that derive from return values. One that seems to sanitize the bool vs int (removal of return False) and the other that returns 'None' if the value is outside spec...and not an int.. at first glance.

I could run again reverting one or the other to see if it fails on one portion or the other if you want.

I'm interested to figure out how to validate the data now that it is inside elastic search.

Great work, and thanks much for the help !

from elasticsearch-gmail.

oliver006 avatar oliver006 commented on August 20, 2024

You could add a print msg between line 104 and 105 and run the import again. That would output the message that causes the exception. I suspect it's an archived GChat transcript, I've seen them cause trouble in the past due to not having a timestamp.

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

It might be weird or foreign letters.
Date: ������, 26 ��� 2008 08:21:36 -0900

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

The other failures look like this
Date: Tue, 24 Apr 2007 01:01:10 GMT-07:00

from elasticsearch-gmail.

oliver006 avatar oliver006 commented on August 20, 2024

The second one looks alright, not sure why it'd fail on that, weird.

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

GMT string data / formatting issues?

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

I edited my code local to print tt seperating out each message with a 'mushroom'

if "date" in result:
    try:
        tt = email.utils.parsedate_tz(result['date'])
        tz = tt[9] if len(tt) == 10 else 0
        result['date_ts'] = int(calendar.timegm(tt) - tz) * 1000
    except:
        print "\n\n\nmushroom \n\n\n"
        #print msg
        print tt
        #print tz
        return None

This is the output:

Upload: OK - upload took: 265ms, total messages uploaded: 21000
Upload: OK - upload took: 420ms, total messages uploaded: 21500
Upload: OK - upload took: 326ms, total messages uploaded: 22000
Upload: OK - upload took: 329ms, total messages uploaded: 22500
Upload: OK - upload took: 330ms, total messages uploaded: 23000
Upload: OK - upload took: 229ms, total messages uploaded: 23500
Upload: OK - upload took: 341ms, total messages uploaded: 24000

mushroom

None
Upload: OK - upload took: 173ms, total messages uploaded: 24500
Upload: OK - upload took: 310ms, total messages uploaded: 25000
Upload: OK - upload took: 283ms, total messages uploaded: 25500

mushroom

(2007, 4, 24, 1, 1, 10, 0, 1, -1, None)

mushroom

(2007, 4, 25, 0, 58, 28, 0, 1, -1, None)

mushroom

(2007, 4, 28, 1, 0, 21, 0, 1, -1, None)
Upload: OK - upload took: 349ms, total messages uploaded: 26000
Upload: OK - upload took: 287ms, total messages uploaded: 26500
Upload: OK - upload took: 200ms, total messages uploaded: 27000
Upload: OK - upload took: 74ms, total messages uploaded: 27222
Done - total count 27245

from elasticsearch-gmail.

oliver006 avatar oliver006 commented on August 20, 2024

Thanks for the detailed log, that helps.
We can handle this case (2007, 4, 25, 0, 58, 28, 0, 1, -1, None) - I added a bit of code for that.

from elasticsearch-gmail.

spaceman10 avatar spaceman10 commented on August 20, 2024

clean run. solid. Thanks!!!

from elasticsearch-gmail.

oliver006 avatar oliver006 commented on August 20, 2024

Nice, guess we can close this.

from elasticsearch-gmail.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.