I am trying tablite with a CSV file with many fields and some of which are long, more

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

That's python multiprocessing module crashing. As the <a href="https://github.com/

I run the on MacOS, can this also be an issue with multiprocessing?

Tablite throws IndexError when reading a complex CSV file about tablite HOT 13 CLOSED

root-11 commented on July 1, 2024

Tablite throws IndexError when reading a complex CSV file

from tablite.

Comments (13)

root-11 commented on July 1, 2024

Hey @ypanagis
Thanks for this. I'll make the error message more informative in the next release.

The file you use has claims to have 31 headers, but there are only 30 columns.
The column for "court" is missing.

This will work for you:

from tablite import Table
from pathlib import Path
path = Path(__file__).parent / "data" / 'long_text_test.csv'
assert path.exists()

columns = [
    "sharepointid","Rank","ITEMID","docname","doctype","application","APPNO",
    "ARTICLES","violation","nonviolation","CONCLUSION","importance","ORIGINATING BODY ID",
    "typedescription","kpdate","kpdateAsText","documentcollectionid","documentcollectionid2",
    "languageisocode","extractedappno","isplaceholder","doctypebranch","RESPONDENT",
    "respondentOrderEng","scl","ECLI","ORIGINATING BODY","YEAR","FULLTEXT","judges"]

t = Table.import_file(path, import_as='csv', columns={c:'f' for c in columns}, text_qualifier='"')
selection = columns[:5]
t.__getitem__(*selection).show()

PS> Note that your file is cut mid-row in your file test.csv (the last row has 25 rows)

Here's what the output looks like on my machine:

from tablite.

ypanagis commented on July 1, 2024

Hi @root-11 and thank you for your reply. First of all, yes this CSV is rather ill-structured and is missing values at some columns. One of those is the column "court" as you very correctly noticed.

I didn't know of the columns and text_qualifier parameters of import_file and it is a real convenience that they are included!

I played a bit in the example script that you gave, with setting selection = columns[-3:] to see e.g. how it can work with the last few columns that includes TEXT which is the long one. I saw however the error I submit in the attached file. After browsing the messages it seems that lines 45-46 of the CSV, cause an error but not very obvious what and couldn't really see something in the CSV (there can be something of course).
error.txt

from tablite.

root-11 commented on July 1, 2024

That's python multiprocessing module crashing.
As the test suite runs python3.8 on linux just like you this seems strange. Could it be a difference between your conda env and pythons own venv?

from tablite.

ypanagis commented on July 1, 2024

I run the script on MacOS, can this also be an issue with multiprocessing?

from tablite.

root-11 commented on July 1, 2024

I'm not sure. Can you try to run the test multiprocessing test suite in this script:
https://github.com/root-11/mplite/blob/main/tests/test_basics.py

If that doesn't work I'll have to do a deeper dive to why MacOS behaves differently.

from tablite.

root-11 commented on July 1, 2024

I've added windows and macOS to the test matrix and they all come out positive:

from tablite.

ypanagis commented on July 1, 2024

I changed to Python 3.9 as you suggested but gives me now the error in the attached file. My PC has also mamba installed the environment is now a mamba one but I hope this is not a problem.

Note that I saw the same error when I removed the last two columns that had some emtpy values, in case that caused issues.

tablite is in version 2022.10.08.
error.txt

from tablite.

root-11 commented on July 1, 2024

Thanks for that Yannis! I'll look into that immediately.

from tablite.

root-11 commented on July 1, 2024

So the error says that psutil.virtual_memory().free is zero.

Could you run this on your mac for me:

import psutil
psutil.virtual_memory().free

from tablite.

ypanagis commented on July 1, 2024

Thanks Bjorn, I just ran it and gives this RuntimeError from Python 3.8 bug

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

...

from tablite.

root-11 commented on July 1, 2024

@ypanagis - you think we can close this ticket now?

from tablite.

ypanagis commented on July 1, 2024

Yes @root-11 makes total sense to me. Will try the package some more, but this part is definitely over now.

from tablite.

root-11 commented on July 1, 2024

Neat. Just FYI: I've released a new version today with slightly better memory management.
The details are in the changelog: https://github.com/root-11/tablite/blob/master/changelog.md

from tablite.

Tablite throws IndexError when reading a complex CSV file about tablite HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent