stestagg / pytubes Goto Github PK

View Code? Open in Web Editor NEW

171.0 9.0 20.0 2.12 MB

A module for getting data into python from large data sources

License: MIT License

Makefile 1.09% Python 18.95% C++ 66.98% Shell 0.58% Batchfile 0.22% Dockerfile 0.24% Cython 11.94%

python cython data numpy cpp11 cpp

pytubes's Introduction

pytubes

Source: https://github.com/stestagg/pytubes

Pytubes is a library that optimizes loading datasets into memory.

At it’s core is a set of specialized C++ classes that can be chained together to load and manipulate data using a standard iterator pattern. Around this there is a cython extension module that makes defining and configuring a tube simple and straight-forward.

Simple Example

>>> from tubes import Each >>> import glob >>> tube = (Each(glob.glob("*.json")) # Iterate over some filenames .read_files() # Read each file, chunk by chunk .split() # Split the file, line-by-line .json() # parse json .get('country_code', 'null')) # extract field named 'country_code' >>> set(tube) # collect results in a set {'A1', 'AD', 'AE', 'AF', 'AG', 'AL', 'AM', 'AO', 'AP', ...}

More Complex Example

>>> from tubes import Each >>> import glob

>>> x = (Each(glob.glob('*.jsonz')): .map_files() .gunzip() .split(b'n') .json() .enumerate() .skip_unless(lambda x: x.slot(1).get('country_code', '""').to(str).equals('GB')) .multi(lambda x: ( x.slot(0), x.slot(1).get('timestamp', 'null'), x.slot(1).get('country_code', 'null'), x.slot(1).get('url', 'null'), x.slot(1).get('file', '{}').get('filename', 'null'), x.slot(1).get('file', '{}').get('project'), x.slot(1).get('details', '{}').get('installer', '{}').get('name', 'null'), x.slot(1).get('details', '{}').get('python', 'null'), x.slot(1).get('details', '{}').get('system', 'null'), x.slot(1).get('details', '{}').get('system', '{}').get('name', 'null'), x.slot(1).get('details', '{}').get('cpu', 'null'), x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('lib', 'null'), x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('version', 'null'), ))

)

>>> print(list(x)[-3]) (15,612,767, '2017-12-14 09:33:31 UTC', 'GB', '/packages/29/9b/25ef61e948321296f029f53c9f67cc2b54e224db509eb67ce17e0df6044a/certifi-2017.11.5-py2.py3-none-any.whl', 'certifi-2017.11.5-py2.py3-none-any.whl', 'certifi', 'pip', '2.7.5', {'name': 'Linux', 'release': '2.6.32-696.10.3.el6.x86_64'}, 'Linux', 'x86_64', 'glibc', '2.17')

pytubes's People

Contributors

Stargazers

Watchers

pytubes's Issues

Windows build

Currently, pytubes builds on windows fine (on the winbuild branch), but it needs a fairly modern VC++, I haven't yet been able to work out the magic incantation for appveyon to make it use the right VC++, meaning we don't have binary wheels of pytubes for windows available at the moment..

gzip decoder should automatically try to start new stream on error

If the first decode of an upstream slice fails, it would make sense to reset the decoder, and try again,

this would make the use-case of using read_files() across many gzipped files much simpler, as the file crossover would be easier

CSV support not working

I installed the latest version with pip install pytubes and when I attempt to run the example in the docs list(Each(['a,b', 'c,d']).csv()) I get AttributeError: 'tubes.Each' object has no attribute 'csv'

Looking at Pypi it looks like 0.6.0 was published in March 2018 and in April 2018 there are some commits made related to CSV support. #17

So I imagine it's related to that. @stestagg could you publish a new version to Pypi?

I also cloned the master and tried using python setup.py install and got the error pyx/ctubes.pyx:275:15: undeclared name not builtin: ReadFileObj

After removing read_fileobj since it looks like its not used anywhere and running setup.py again I got another error: ar: no archive members specified

Direct Json to python fails to clear temporary buffer with escaped values between rows

When pytubes converts Json string values directly to python values (with no explicit conversion specified) and the string value contains '' escaped values, then a temporary buffer is used to hold decoded string data. This buffer is re-used between iterations to improve performance, but isn't cleared before re-use, meaning that consecutive rows with escaped json values in the same slot results in the values being concatenated on the second row

Can't import tubes Reason: image not found. Library not loaded: @rpath/libarrow.12.dylib

Data type conversion bettween python bool and pytubes Bool

Hi, i need to add filter function in pytubes skip_if function. Specifically, it is complicated lambda expression which make more judgements than 'lambda x: x.is_blank()'. For example, for string list ['apple','banana','cherry'], i want to get strings which does not contain character 'a'. With the help of pytubes, i write some code clips like:
(Each( ['apple','banana',''])
.enumerate()
.skip_if(lambda x: contain_a(x))
)
here, contain_a is function which return python's bool value represents whether the string contains character.
contain_a(x):
if 'a' in x:
return True
else:
return False

However, i got an error "TypeError: Argument 'conditional' has incorrect type (expected ctubes.Tube, got bool)". It means skip_if function of pytubes is not compatible with python bool, just compatible with tubes.Bool. Is pytubes provides conversion between python bool and tubes.Bool ?

https://docs.pytubes.com/ is BAD, where is doc? thanks!

Support 1, 2, and 4 byte integers in ndarray() call

Currently, all integers translate to I8 dtypes in numpy, which can be wasteful.

For strings, we already have the slot_info tuple for specifying the string length, so we can use that to restrict the size of the array type (but we'll have to have some checking on the c++ filler side, so that the correct size values are copied in)

The same could be done for float/double types too.

Support non-utf8 conversion between bytes and utf8

how to skip blank line when reading json file

pytubes is a great package which support flexible data iterator, thank you @stestagg . I had read the doc of pytubes, you already implement skip_unless and equals function for us, why not offering skip function which is similar to skip_unless function? There is a skip function which will skip specific rows. I want to use skip_unless and no equal function to skip blank line, it is an expression such as "skip_unless(lambda x: x.slot(1).no_equal('null')))".
However, i can not implement such effect base on interfaces provided by pytubes. I will be very appriciated if any one can give me some help.

Add support for converting between doubles and bytes/UTF8

This seems like a useful and obvious thing to have