Giter Site home page Giter Site logo

Comments (10)

root-11 avatar root-11 commented on June 19, 2024

Renaming the installed package name from table to tablite to match the project name?

Funny. I've just had the same conversation with another user. We decided the rename about 25 minutes ago.

from tablite.

root-11 avatar root-11 commented on June 19, 2024

In regard to HTML import. I'm totally okay with that. If you want to write the code and make a pull request I can merge it.

from tablite.

root-11 avatar root-11 commented on June 19, 2024

I've just pushed a refactor of the code in the latest commit. Could you have a look and tell me if it's easier to navigate?

from tablite.

root-11 avatar root-11 commented on June 19, 2024

Summary thus far:

  1. Great idea.
  2. Done.
  3. Done.

from tablite.

root-11 avatar root-11 commented on June 19, 2024

You can have the latest version from pip as I've just packaged it.

https://pypi.org/project/tablite/2021.2.18.60263/

from tablite.

root-11 avatar root-11 commented on June 19, 2024

@danieldjewell reg. the HTML import: Is it something like this you wanted to do:
https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python

?

from tablite.

danieldjewell avatar danieldjewell commented on June 19, 2024

Funny. I've just had the same conversation with another user. We decided the rename about 25 minutes ago.

Hehe. It is. Must be thinking on the same wavelength. 😁 🍻

pushed the refactor

Yeah. Both the rename and refactoring helps loads. (The fact that the rename is a breaking change sucks, but I guess it's kinda like "Well, let's get this one over with because it's just going to be harder in the future." 😁 )

Is it better

I would suggest moving Table out of __init__.py as well - that way init can be used to import what you want to be publicly viewable (from what I've seen and done, to streamline & remove clutter).

For example -- if you do a:

dir(tablite)

... the result is nearly (?) everything in the package.

Imports

Semi Long Philosophical Paragraph Follows

Side note/philosophical note/disclaimer: I come from a pretty strong object oriented background (PHP/Java/C#/C++) - For whatever reason, I never really liked the Pythonic way of importing specific classes/functions from a package (e.g. from tablite import Table). For me, it's somehow like a deadly sin to pollute the global namespace of a script like that. That's not meant to be a criticism of tablite (even though the code does have some of that 😁 ) ... but more of an observation on the way I tend to use things - I personally very much like when I can do a one line import and keep everything nice and tidy: e.g. import numpy as np, or import xarray as xr, or import pandas as pd ... and not have to mess around with secondary imports.

That said, when a package imports nothing (see pyca/cryptography) I find it rather infuriating. (I personally do a lot of prototyping using either JupyterLab or more often an ipython session in a terminal - having to run multiple help() commands to find the contents of a package is really a pain... (And I don't always have a GUI web browser to open API docs easily...)


tl;dr;: I recognize that my opinions may not be representative of what other Python developers might say. (The Python community seems to have lots of differing opinions on lots of things... )

That said, have a look at pandas' init.py and xarray's init.py (or even numpy's but that one has a lot more going on in it than is necessary to illustrate my point). Basically, everything that's needed for a user/developer to use/interact with the package is imported in to the root namespace. So after a import xarray as xr you pretty much have every class/function/enum/whatever you'll need (and if there are more, they are usually static/class methods within some of the imported classes). The top of the cake is the __all__ list/tuple that defines what to import.

So, specifically for tablite... Table, GroupBy, file_reader definitely need to be in the "root" namespace. Not sure about the others. It looks like GroupBy already has imports setup -- which is great!

tablite/tablite/__init__.py

Lines 744 to 754 in 2072e8e

max = Max # shortcuts to avoid having to type a long list of imports.
min = Min
sum = Sum
first = First
last = Last
count = Count
count_unique = CountUnique
avg = Average
stdev = StandardDeviation
median = Median
mode = Mode

One interesting other way to do it would be to put the entire "from tablite.groupby_utils ..." line as the first line inside of the GroupBy class declaration... That way they're local... something like:

class GroupBy(object):
    from tablite.groupby_utils import GroupbyFunction, Max, Min, Sum, First, Last, Count, CountUnique, Average, StandardDeviation, Median, Mode

Of course this would be breaking if anyone is using tablite.Max directly ... The other way to do it could be to leave it inside of GroupBy and add almost like an short-hand accessor

import tablite as tl 
tab = tl.Table(...) 
g = tab.groupby(keys=['a','b'],
               functions=[('f', tl.gb.Max)])

## if "gb" is just an alias for the GroupBy class and the GroupBy functions (Max,Min, etc.) are imported into that class, this becomes possible: 

g = tab.gb(keys=['a', 'b'], 
        .... )

### OR if you do something like
import tablite.GroupBy as gb 
## then 
g = tab.groupby(keys=['a','b'],
               functions=[('f', gb.Max)])

### This is what I'm referring to when I mentioned the way I like to personally import things and not import things into the main global namespace 

It might also be interesting to handle groupby a bit like pandas -- at least just the syntax -- e.g. df.groupby(['key1','key2]).agg(['min', 'mean', 'max'])

HTML Tables

reg. the HTML import: Is it something like this you wanted to do:
https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python

Kinda. That particular code example is the general idea... And having sort of a 2-stage option would be really nice I think:

  • The "lower level" interface could accept a bs4.element.Tag object that has been pre-processed to be (hopefully) just one HTML table object. (You'd still need options to help guide headers - or even overriding column names with a separate list, rowskip, and definitely the ability to specify datatypes for casting... kinda the same general requirements as for a CSV)
  • A higher-level "get all the tables" could accept a file pointer object (e.g. StringIO/BytesIO), a filename, or a str/bytes object with the raw HTML. Some preliminary parsing could attempt to find all of the tables and then essentially loop through using the "lower level" interface (above) and return a collection of some kind of tablite.Table() objects

It just now occurred to me that pandas actually has a read_html() function ... The trouble always is that there are so many ways to write an HTML table. Colspan? Rowspan? (And I'm not even going to go to that place where some tables are built using <dl> and <dt> ... sheesh)

[Sorry this got kinda long... I started writing and then it was like "oh! one more idea! oh! can't forget to mention this other thing..." 😁 ]

from tablite.

root-11 avatar root-11 commented on June 19, 2024

Hi Daniel,

TL;DR: I think you've made a good case. Let's get it right.

I've refactored into the branch name_space_review for you to review. All tests pass.

Imports

As the only things the user really needs are Table and GroupBy this should do:

__all__ = [Table, Groupby]

For pycharm users, the helper will give a compact presentation as shown below:

image

and for jupyter users dir is compact and unpolluted:

>>> import tablite as tl
>>> dir(tl)
['GroupBy', 'Table', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'core', 'datatypes', 'file_reader_utils', 'groupby_utils', 'stored_list']

There are dependencies between file_reader and Table, so moving the file readers away from core.py will cause cyclic imports. However as file_reader isn't in __all__ it won't appear and users can use the class method as always:

Table.from_file(path)

which uses the file_readers


Shorthand will also work fine: I've updated groupby_tests.py, but am unsure if there is a better way than this:

from tablite import Table, GroupBy
gb = GroupBy

g = GroupBy(keys=['a', 'b'],
functions=[('f', gb.max),
('f', gb.min),
('f', gb.sum),
('f', gb.first),
('f', gb.last),
('f', gb.count),
('f', gb.count_unique),
('f', gb.avg),
('f', gb.stdev),
('a', gb.stdev),
('f', gb.median),
('f', gb.mode),
('g', gb.median)])

https://github.com/root-11/tablite/blob/name_space_review/tests/groupby_tests.py#L34

HTML reader.

As you want to extend the file_readers with your own html_reader you can use:

>>> def html_reader(path):   # define the reader.
>>>     # do magic
>>>     return 1

>>> from tablite.core import readers

>>> readers['my_html_reader']= [html_reader, {}]

>>>for kv in readers.items():
>>>    print(kv)
    
csv [<function text_reader at 0x0000020FFF373C18>, {}]
tsv [<function text_reader at 0x0000020FFF373C18>, {}]
txt [<function text_reader at 0x0000020FFF373C18>, {}]
xls [<function excel_reader at 0x0000020FFF299DC8>, {}]
xlsx [<function excel_reader at 0x0000020FFF299DC8>, {}]
xlsm [<function excel_reader at 0x0000020FFF299DC8>, {}]
ods [<function ods_reader at 0x0000020FFF299E58>, {}]
zip [<function zip_reader at 0x0000020FFF299EE8>, {}]
log [<function log_reader at 0x0000020FFF299F78>, {'sep': False}]
my_html_reader [<function html_reader at 0x0000020FFF3828B8>, {}]  # <---------

....

Longer answer too, but in the face of ambiguity ten extra words may save ten days of work ;-)

from tablite.

root-11 avatar root-11 commented on June 19, 2024

@danieldjewell - Hi - I'm just picking up this thread. Are you still reviewing?

from tablite.

root-11 avatar root-11 commented on June 19, 2024

@danieldjewell
name_space_review has been merged into master. closing ticket.

from tablite.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.