Comments (10)
Renaming the installed package name from table to tablite to match the project name?
Funny. I've just had the same conversation with another user. We decided the rename about 25 minutes ago.
from tablite.
In regard to HTML import. I'm totally okay with that. If you want to write the code and make a pull request I can merge it.
from tablite.
I've just pushed a refactor of the code in the latest commit. Could you have a look and tell me if it's easier to navigate?
from tablite.
Summary thus far:
- Great idea.
- Done.
- Done.
from tablite.
You can have the latest version from pip as I've just packaged it.
https://pypi.org/project/tablite/2021.2.18.60263/
from tablite.
@danieldjewell reg. the HTML import: Is it something like this you wanted to do:
https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python
?
from tablite.
Funny. I've just had the same conversation with another user. We decided the rename about 25 minutes ago.
Hehe. It is. Must be thinking on the same wavelength. 😁 🍻
pushed the refactor
Yeah. Both the rename and refactoring helps loads. (The fact that the rename is a breaking change sucks, but I guess it's kinda like "Well, let's get this one over with because it's just going to be harder in the future." 😁 )
Is it better
I would suggest moving Table
out of __init__.py
as well - that way init can be used to import what you want to be publicly viewable (from what I've seen and done, to streamline & remove clutter).
For example -- if you do a:
dir(tablite)
... the result is nearly (?) everything in the package.
Imports
Semi Long Philosophical Paragraph Follows
Side note/philosophical note/disclaimer: I come from a pretty strong object oriented background (PHP/Java/C#/C++) - For whatever reason, I never really liked the Pythonic way of importing specific classes/functions from a package (e.g. from tablite import Table
). For me, it's somehow like a deadly sin to pollute the global namespace of a script like that. That's not meant to be a criticism of tablite
(even though the code does have some of that 😁 ) ... but more of an observation on the way I tend to use things - I personally very much like when I can do a one line import and keep everything nice and tidy: e.g. import numpy as np
, or import xarray as xr
, or import pandas as pd
... and not have to mess around with secondary imports.
That said, when a package imports nothing (see pyca/cryptography) I find it rather infuriating. (I personally do a lot of prototyping using either JupyterLab or more often an ipython session in a terminal - having to run multiple help()
commands to find the contents of a package is really a pain... (And I don't always have a GUI web browser to open API docs easily...)
tl;dr;: I recognize that my opinions may not be representative of what other Python developers might say. (The Python community seems to have lots of differing opinions on lots of things... )
That said, have a look at pandas' init.py and xarray's init.py (or even numpy's but that one has a lot more going on in it than is necessary to illustrate my point). Basically, everything that's needed for a user/developer to use/interact with the package is imported in to the root namespace. So after a import xarray as xr
you pretty much have every class/function/enum/whatever you'll need (and if there are more, they are usually static/class methods within some of the imported classes). The top of the cake is the __all__ list/tuple that defines what to import.
So, specifically for tablite
... Table, GroupBy, file_reader
definitely need to be in the "root" namespace. Not sure about the others. It looks like GroupBy already has imports setup -- which is great!
Lines 744 to 754 in 2072e8e
One interesting other way to do it would be to put the entire "from tablite.groupby_utils ...
" line as the first line inside of the GroupBy class declaration... That way they're local... something like:
class GroupBy(object):
from tablite.groupby_utils import GroupbyFunction, Max, Min, Sum, First, Last, Count, CountUnique, Average, StandardDeviation, Median, Mode
Of course this would be breaking if anyone is using tablite.Max
directly ... The other way to do it could be to leave it inside of GroupBy and add almost like an short-hand accessor
import tablite as tl
tab = tl.Table(...)
g = tab.groupby(keys=['a','b'],
functions=[('f', tl.gb.Max)])
## if "gb" is just an alias for the GroupBy class and the GroupBy functions (Max,Min, etc.) are imported into that class, this becomes possible:
g = tab.gb(keys=['a', 'b'],
.... )
### OR if you do something like
import tablite.GroupBy as gb
## then
g = tab.groupby(keys=['a','b'],
functions=[('f', gb.Max)])
### This is what I'm referring to when I mentioned the way I like to personally import things and not import things into the main global namespace
It might also be interesting to handle groupby a bit like pandas -- at least just the syntax -- e.g. df.groupby(['key1','key2]).agg(['min', 'mean', 'max'])
HTML Tables
reg. the HTML import: Is it something like this you wanted to do:
https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python
Kinda. That particular code example is the general idea... And having sort of a 2-stage option would be really nice I think:
- The "lower level" interface could accept a
bs4.element.Tag
object that has been pre-processed to be (hopefully) just one HTML table object. (You'd still need options to help guide headers - or even overriding column names with a separate list, rowskip, and definitely the ability to specify datatypes for casting... kinda the same general requirements as for a CSV) - A higher-level "get all the tables" could accept a file pointer object (e.g. StringIO/BytesIO), a filename, or a str/bytes object with the raw HTML. Some preliminary parsing could attempt to find all of the tables and then essentially loop through using the "lower level" interface (above) and return a collection of some kind of
tablite.Table()
objects
It just now occurred to me that pandas actually has a read_html() function ... The trouble always is that there are so many ways to write an HTML table. Colspan? Rowspan? (And I'm not even going to go to that place where some tables are built using <dl> and <dt>
... sheesh)
[Sorry this got kinda long... I started writing and then it was like "oh! one more idea! oh! can't forget to mention this other thing..." 😁 ]
from tablite.
Hi Daniel,
TL;DR: I think you've made a good case. Let's get it right.
I've refactored into the branch name_space_review for you to review. All tests pass.
Imports
As the only things the user really needs are Table
and GroupBy
this should do:
__all__ = [Table, Groupby]
For pycharm users, the helper will give a compact presentation as shown below:
and for jupyter users dir
is compact and unpolluted:
>>> import tablite as tl
>>> dir(tl)
['GroupBy', 'Table', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'core', 'datatypes', 'file_reader_utils', 'groupby_utils', 'stored_list']
There are dependencies between file_reader
and Table
, so moving the file readers away from core.py will cause cyclic imports. However as file_reader
isn't in __all__
it won't appear and users can use the class method as always:
Table.from_file(path)
which uses the file_reader
s
Shorthand will also work fine: I've updated groupby_tests.py, but am unsure if there is a better way than this:
tablite/tests/groupby_tests.py
Lines 1 to 3 in cac3a90
tablite/tests/groupby_tests.py
Lines 34 to 47 in cac3a90
https://github.com/root-11/tablite/blob/name_space_review/tests/groupby_tests.py#L34
HTML reader.
As you want to extend the file_readers with your own html_reader you can use:
>>> def html_reader(path): # define the reader.
>>> # do magic
>>> return 1
>>> from tablite.core import readers
>>> readers['my_html_reader']= [html_reader, {}]
>>>for kv in readers.items():
>>> print(kv)
csv [<function text_reader at 0x0000020FFF373C18>, {}]
tsv [<function text_reader at 0x0000020FFF373C18>, {}]
txt [<function text_reader at 0x0000020FFF373C18>, {}]
xls [<function excel_reader at 0x0000020FFF299DC8>, {}]
xlsx [<function excel_reader at 0x0000020FFF299DC8>, {}]
xlsm [<function excel_reader at 0x0000020FFF299DC8>, {}]
ods [<function ods_reader at 0x0000020FFF299E58>, {}]
zip [<function zip_reader at 0x0000020FFF299EE8>, {}]
log [<function log_reader at 0x0000020FFF299F78>, {'sep': False}]
my_html_reader [<function html_reader at 0x0000020FFF3828B8>, {}] # <---------
....
Longer answer too, but in the face of ambiguity ten extra words may save ten days of work ;-)
from tablite.
@danieldjewell - Hi - I'm just picking up this thread. Are you still reviewing?
from tablite.
@danieldjewell
name_space_review
has been merged into master. closing ticket.
from tablite.
Related Issues (20)
- Join (reindexing) fails when table spans multiple pages HOT 2
- Documentation is out of sync HOT 1
- Determine method to handle out-of-memory for large joins. HOT 1
- Proposed format specification HOT 1
- multi proc groupby HOT 1
- multi proc join HOT 3
- Add warning in add_rows that is the slowest method HOT 1
- Deprecating support for python 3.8 in favor of type hints throughout the code HOT 1
- Columns with empty names HOT 2
- Table.load very slow with dtype('O') HOT 5
- Bloat in H5 storage following repeated SIGKILL HOT 3
- Statistics discrepancies in median/mode HOT 1
- Do Tablite Support different datasets Concurrently ? HOT 6
- Addition of match operator HOT 5
- HDF5 file size never decreases + concurrent interpreters can overwrite each others files. HOT 14
- sorting problem with datetime dt columns HOT 1
- Inconsistent row slice HOT 3
- Slow import of files with text escape HOT 16
- statistics() fails on time column HOT 2
- my first issue
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tablite.