Just stumbled across tablite. Really cool, definitely can agree with the use case (pan

Summary thus far: Great idea. Done. Done.</l

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Load HTML and/or Consider Refactoring about tablite HOT 10 CLOSED

root-11 commented on July 19, 2024

Load HTML and/or Consider Refactoring

from tablite.

Comments (10)

root-11 commented on July 19, 2024

Renaming the installed package name from table to tablite to match the project name?

Funny. I've just had the same conversation with another user. We decided the rename about 25 minutes ago.

from tablite.

root-11 commented on July 19, 2024

In regard to HTML import. I'm totally okay with that. If you want to write the code and make a pull request I can merge it.

from tablite.

root-11 commented on July 19, 2024

I've just pushed a refactor of the code in the latest commit. Could you have a look and tell me if it's easier to navigate?

from tablite.

root-11 commented on July 19, 2024

Summary thus far:

Great idea.
Done.
Done.

from tablite.

root-11 commented on July 19, 2024

You can have the latest version from pip as I've just packaged it.

https://pypi.org/project/tablite/2021.2.18.60263/

from tablite.

root-11 commented on July 19, 2024

@danieldjewell reg. the HTML import: Is it something like this you wanted to do:
https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python

from tablite.

danieldjewell commented on July 19, 2024

Funny. I've just had the same conversation with another user. We decided the rename about 25 minutes ago.

Hehe. It is. Must be thinking on the same wavelength. 😁 🍻

pushed the refactor

Yeah. Both the rename and refactoring helps loads. (The fact that the rename is a breaking change sucks, but I guess it's kinda like "Well, let's get this one over with because it's just going to be harder in the future." 😁 )

Is it better

I would suggest moving Table out of __init__.py as well - that way init can be used to import what you want to be publicly viewable (from what I've seen and done, to streamline & remove clutter).

For example -- if you do a:

dir(tablite)

... the result is nearly (?) everything in the package.

Imports

Semi Long Philosophical Paragraph Follows

Side note/philosophical note/disclaimer: I come from a pretty strong object oriented background (PHP/Java/C#/C++) - For whatever reason, I never really liked the Pythonic way of importing specific classes/functions from a package (e.g. from tablite import Table). For me, it's somehow like a deadly sin to pollute the global namespace of a script like that. That's not meant to be a criticism of tablite (even though the code does have some of that 😁 ) ... but more of an observation on the way I tend to use things - I personally very much like when I can do a one line import and keep everything nice and tidy: e.g. import numpy as np, or import xarray as xr, or import pandas as pd ... and not have to mess around with secondary imports.

That said, when a package imports nothing (see pyca/cryptography) I find it rather infuriating. (I personally do a lot of prototyping using either JupyterLab or more often an ipython session in a terminal - having to run multiple help() commands to find the contents of a package is really a pain... (And I don't always have a GUI web browser to open API docs easily...)

tl;dr;: I recognize that my opinions may not be representative of what other Python developers might say. (The Python community seems to have lots of differing opinions on lots of things... )

That said, have a look at pandas' init.py and xarray's init.py (or even numpy's but that one has a lot more going on in it than is necessary to illustrate my point). Basically, everything that's needed for a user/developer to use/interact with the package is imported in to the root namespace. So after a import xarray as xr you pretty much have every class/function/enum/whatever you'll need (and if there are more, they are usually static/class methods within some of the imported classes). The top of the cake is the __all__ list/tuple that defines what to import.

So, specifically for tablite... Table, GroupBy, file_reader definitely need to be in the "root" namespace. Not sure about the others. It looks like GroupBy already has imports setup -- which is great!

tablite/tablite/__init__.py

Lines 744 to 754 in 2072e8e

    
           max = Max  # shortcuts to avoid having to type a long list of imports. 
        
           min = Min 
        
           sum = Sum 
        
           first = First 
        
           last = Last 
        
           count = Count 
        
           count_unique = CountUnique 
        
           avg = Average 
        
           stdev = StandardDeviation 
        
           median = Median 
        
           mode = Mode

One interesting other way to do it would be to put the entire "from tablite.groupby_utils ..." line as the first line inside of the GroupBy class declaration... That way they're local... something like:

class GroupBy(object):
    from tablite.groupby_utils import GroupbyFunction, Max, Min, Sum, First, Last, Count, CountUnique, Average, StandardDeviation, Median, Mode

Of course this would be breaking if anyone is using tablite.Max directly ... The other way to do it could be to leave it inside of GroupBy and add almost like an short-hand accessor

import tablite as tl 
tab = tl.Table(...) 
g = tab.groupby(keys=['a','b'],
               functions=[('f', tl.gb.Max)])

## if "gb" is just an alias for the GroupBy class and the GroupBy functions (Max,Min, etc.) are imported into that class, this becomes possible: 

g = tab.gb(keys=['a', 'b'], 
        .... )

### OR if you do something like
import tablite.GroupBy as gb 
## then 
g = tab.groupby(keys=['a','b'],
               functions=[('f', gb.Max)])

### This is what I'm referring to when I mentioned the way I like to personally import things and not import things into the main global namespace

It might also be interesting to handle groupby a bit like pandas -- at least just the syntax -- e.g. df.groupby(['key1','key2]).agg(['min', 'mean', 'max'])

HTML Tables

reg. the HTML import: Is it something like this you wanted to do:
https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python

Kinda. That particular code example is the general idea... And having sort of a 2-stage option would be really nice I think:

The "lower level" interface could accept a bs4.element.Tag object that has been pre-processed to be (hopefully) just one HTML table object. (You'd still need options to help guide headers - or even overriding column names with a separate list, rowskip, and definitely the ability to specify datatypes for casting... kinda the same general requirements as for a CSV)
A higher-level "get all the tables" could accept a file pointer object (e.g. StringIO/BytesIO), a filename, or a str/bytes object with the raw HTML. Some preliminary parsing could attempt to find all of the tables and then essentially loop through using the "lower level" interface (above) and return a collection of some kind of tablite.Table() objects

It just now occurred to me that pandas actually has a read_html() function ... The trouble always is that there are so many ways to write an HTML table. Colspan? Rowspan? (And I'm not even going to go to that place where some tables are built using <dl> and <dt> ... sheesh)

_{[Sorry this got kinda long... I started writing and then it was like "oh! one more idea! oh! can't forget to mention this other thing..." 😁 ]}

from tablite.

root-11 commented on July 19, 2024

Hi Daniel,

TL;DR: I think you've made a good case. Let's get it right.

I've refactored into the branch name_space_review for you to review. All tests pass.

Imports

As the only things the user really needs are Table and GroupBy this should do:

__all__ = [Table, Groupby]

For pycharm users, the helper will give a compact presentation as shown below:

and for jupyter users dir is compact and unpolluted:

>>> import tablite as tl
>>> dir(tl)
['GroupBy', 'Table', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'core', 'datatypes', 'file_reader_utils', 'groupby_utils', 'stored_list']

There are dependencies between file_reader and Table, so moving the file readers away from core.py will cause cyclic imports. However as file_reader isn't in __all__ it won't appear and users can use the class method as always:

Table.from_file(path)

which uses the file_readers

Shorthand will also work fine: I've updated groupby_tests.py, but am unsure if there is a better way than this:

tablite/tests/groupby_tests.py

Lines 1 to 3 in cac3a90

    
           from tablite import Table, GroupBy 
        
           gb = GroupBy

tablite/tests/groupby_tests.py

Lines 34 to 47 in cac3a90

    
           g = GroupBy(keys=['a', 'b'], 
        
                       functions=[('f', gb.max), 
        
                                  ('f', gb.min), 
        
                                  ('f', gb.sum), 
        
                                  ('f', gb.first), 
        
                                  ('f', gb.last), 
        
                                  ('f', gb.count), 
        
                                  ('f', gb.count_unique), 
        
                                  ('f', gb.avg), 
        
                                  ('f', gb.stdev), 
        
                                  ('a', gb.stdev), 
        
                                  ('f', gb.median), 
        
                                  ('f', gb.mode), 
        
                                  ('g', gb.median)])

https://github.com/root-11/tablite/blob/name_space_review/tests/groupby_tests.py#L34

HTML reader.

As you want to extend the file_readers with your own html_reader you can use:

>>> def html_reader(path):   # define the reader.
>>>     # do magic
>>>     return 1

>>> from tablite.core import readers

>>> readers['my_html_reader']= [html_reader, {}]

>>>for kv in readers.items():
>>>    print(kv)
    
csv [<function text_reader at 0x0000020FFF373C18>, {}]
tsv [<function text_reader at 0x0000020FFF373C18>, {}]
txt [<function text_reader at 0x0000020FFF373C18>, {}]
xls [<function excel_reader at 0x0000020FFF299DC8>, {}]
xlsx [<function excel_reader at 0x0000020FFF299DC8>, {}]
xlsm [<function excel_reader at 0x0000020FFF299DC8>, {}]
ods [<function ods_reader at 0x0000020FFF299E58>, {}]
zip [<function zip_reader at 0x0000020FFF299EE8>, {}]
log [<function log_reader at 0x0000020FFF299F78>, {'sep': False}]
my_html_reader [<function html_reader at 0x0000020FFF3828B8>, {}]  # <---------

....

Longer answer too, but in the face of ambiguity ten extra words may save ten days of work ;-)

from tablite.

root-11 commented on July 19, 2024

@danieldjewell - Hi - I'm just picking up this thread. Are you still reviewing?

from tablite.

root-11 commented on July 19, 2024

@danieldjewell
name_space_review has been merged into master. closing ticket.

from tablite.

Load HTML and/or Consider Refactoring about tablite HOT 10 CLOSED

Comments (10)

Imports

Semi Long Philosophical Paragraph Follows

HTML Tables

Imports

HTML reader.

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	max = Max # shortcuts to avoid having to type a long list of imports.
	min = Min
	sum = Sum
	first = First
	last = Last
	count = Count
	count_unique = CountUnique
	avg = Average
	stdev = StandardDeviation
	median = Median
	mode = Mode

	g = GroupBy(keys=['a', 'b'],
	functions=[('f', gb.max),
	('f', gb.min),
	('f', gb.sum),
	('f', gb.first),
	('f', gb.last),
	('f', gb.count),
	('f', gb.count_unique),
	('f', gb.avg),
	('f', gb.stdev),
	('a', gb.stdev),
	('f', gb.median),
	('f', gb.mode),
	('g', gb.median)])