Comments (13)
Hey @ypanagis
Thanks for this. I'll make the error message more informative in the next release.
The file you use has claims to have 31 headers, but there are only 30 columns.
The column for "court" is missing.
This will work for you:
from tablite import Table
from pathlib import Path
path = Path(__file__).parent / "data" / 'long_text_test.csv'
assert path.exists()
columns = [
"sharepointid","Rank","ITEMID","docname","doctype","application","APPNO",
"ARTICLES","violation","nonviolation","CONCLUSION","importance","ORIGINATING BODY ID",
"typedescription","kpdate","kpdateAsText","documentcollectionid","documentcollectionid2",
"languageisocode","extractedappno","isplaceholder","doctypebranch","RESPONDENT",
"respondentOrderEng","scl","ECLI","ORIGINATING BODY","YEAR","FULLTEXT","judges"]
t = Table.import_file(path, import_as='csv', columns={c:'f' for c in columns}, text_qualifier='"')
selection = columns[:5]
t.__getitem__(*selection).show()
PS> Note that your file is cut mid-row in your file test.csv
(the last row has 25 rows)
Here's what the output looks like on my machine:
from tablite.
Hi @root-11 and thank you for your reply. First of all, yes this CSV is rather ill-structured and is missing values at some columns. One of those is the column "court" as you very correctly noticed.
I didn't know of the columns
and text_qualifier
parameters of import_file
and it is a real convenience that they are included!
I played a bit in the example script that you gave, with setting selection = columns[-3:]
to see e.g. how it can work with the last few columns that includes TEXT which is the long one. I saw however the error I submit in the attached file. After browsing the messages it seems that lines 45-46 of the CSV, cause an error but not very obvious what and couldn't really see something in the CSV (there can be something of course).
error.txt
from tablite.
That's python multiprocessing module crashing.
As the test suite runs python3.8 on linux just like you this seems strange. Could it be a difference between your conda env and pythons own venv?
from tablite.
I run the script on MacOS, can this also be an issue with multiprocessing?
from tablite.
I'm not sure. Can you try to run the test multiprocessing test suite in this script:
https://github.com/root-11/mplite/blob/main/tests/test_basics.py
If that doesn't work I'll have to do a deeper dive to why MacOS behaves differently.
from tablite.
I've added windows and macOS to the test matrix and they all come out positive:
from tablite.
I changed to Python 3.9 as you suggested but gives me now the error in the attached file. My PC has also mamba installed the environment is now a mamba one but I hope this is not a problem.
Note that I saw the same error when I removed the last two columns that had some emtpy values, in case that caused issues.
tablite
is in version 2022.10.08
.
error.txt
from tablite.
Thanks for that Yannis! I'll look into that immediately.
from tablite.
So the error says that psutil.virtual_memory().free
is zero.
Could you run this on your mac for me:
import psutil
psutil.virtual_memory().free
from tablite.
Thanks Bjorn, I just ran it and gives this RuntimeError
from Python 3.8 bug
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
...
from tablite.
@ypanagis - you think we can close this ticket now?
from tablite.
Yes @root-11 makes total sense to me. Will try the package some more, but this part is definitely over now.
from tablite.
Neat. Just FYI: I've released a new version today with slightly better memory management.
The details are in the changelog: https://github.com/root-11/tablite/blob/master/changelog.md
from tablite.
Related Issues (20)
- Join (reindexing) fails when table spans multiple pages HOT 2
- Documentation is out of sync HOT 1
- Determine method to handle out-of-memory for large joins. HOT 1
- Proposed format specification HOT 1
- multi proc groupby HOT 1
- multi proc join HOT 3
- Add warning in add_rows that is the slowest method HOT 1
- Deprecating support for python 3.8 in favor of type hints throughout the code HOT 1
- Columns with empty names HOT 2
- Table.load very slow with dtype('O') HOT 5
- Bloat in H5 storage following repeated SIGKILL HOT 3
- Statistics discrepancies in median/mode HOT 1
- Do Tablite Support different datasets Concurrently ? HOT 6
- Addition of match operator HOT 5
- HDF5 file size never decreases + concurrent interpreters can overwrite each others files. HOT 14
- sorting problem with datetime dt columns HOT 1
- Inconsistent row slice HOT 3
- Slow import of files with text escape HOT 16
- statistics() fails on time column HOT 2
- my first issue
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tablite.