Giter Site home page Giter Site logo

stestagg / pytubes Goto Github PK

View Code? Open in Web Editor NEW
171.0 9.0 20.0 2.12 MB

A module for getting data into python from large data sources

License: MIT License

Makefile 1.09% Python 18.95% C++ 66.98% Shell 0.58% Batchfile 0.22% Dockerfile 0.24% Cython 11.94%
python cython data numpy cpp11 cpp

pytubes's Introduction

pytubes

Source: https://github.com/stestagg/pytubes

Pytubes is a library that optimizes loading datasets into memory.

At it’s core is a set of specialized C++ classes that can be chained together to load and manipulate data using a standard iterator pattern. Around this there is a cython extension module that makes defining and configuring a tube simple and straight-forward.

Simple Example

>>> from tubes import Each >>> import glob >>> tube = (Each(glob.glob("*.json")) # Iterate over some filenames .read_files() # Read each file, chunk by chunk .split() # Split the file, line-by-line .json() # parse json .get('country_code', 'null')) # extract field named 'country_code' >>> set(tube) # collect results in a set {'A1', 'AD', 'AE', 'AF', 'AG', 'AL', 'AM', 'AO', 'AP', ...}

More Complex Example

>>> from tubes import Each >>> import glob

>>> x = (Each(glob.glob('*.jsonz'))

.map_files() .gunzip() .split(b'n') .json() .enumerate() .skip_unless(lambda x: x.slot(1).get('country_code', '""').to(str).equals('GB')) .multi(lambda x: ( x.slot(0), x.slot(1).get('timestamp', 'null'), x.slot(1).get('country_code', 'null'), x.slot(1).get('url', 'null'), x.slot(1).get('file', '{}').get('filename', 'null'), x.slot(1).get('file', '{}').get('project'), x.slot(1).get('details', '{}').get('installer', '{}').get('name', 'null'), x.slot(1).get('details', '{}').get('python', 'null'), x.slot(1).get('details', '{}').get('system', 'null'), x.slot(1).get('details', '{}').get('system', '{}').get('name', 'null'), x.slot(1).get('details', '{}').get('cpu', 'null'), x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('lib', 'null'), x.slot(1).get('details', '{}').get('distro', '{}').get('libc', '{}').get('version', 'null'), ))

)

>>> print(list(x)[-3]) (15,612,767, '2017-12-14 09:33:31 UTC', 'GB', '/packages/29/9b/25ef61e948321296f029f53c9f67cc2b54e224db509eb67ce17e0df6044a/certifi-2017.11.5-py2.py3-none-any.whl', 'certifi-2017.11.5-py2.py3-none-any.whl', 'certifi', 'pip', '2.7.5', {'name': 'Linux', 'release': '2.6.32-696.10.3.el6.x86_64'}, 'Linux', 'x86_64', 'glibc', '2.17')

pytubes's People

Contributors

pytubes avatar stestagg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytubes's Issues

Windows build

Currently, pytubes builds on windows fine (on the winbuild branch), but it needs a fairly modern VC++, I haven't yet been able to work out the magic incantation for appveyon to make it use the right VC++, meaning we don't have binary wheels of pytubes for windows available at the moment..

CSV support not working

I installed the latest version with pip install pytubes and when I attempt to run the example in the docs list(Each(['a,b', 'c,d']).csv()) I get AttributeError: 'tubes.Each' object has no attribute 'csv'

Screen Shot 2019-03-23 at 10 38 07 AM

Looking at Pypi it looks like 0.6.0 was published in March 2018 and in April 2018 there are some commits made related to CSV support. #17

So I imagine it's related to that. @stestagg could you publish a new version to Pypi?

I also cloned the master and tried using python setup.py install and got the error pyx/ctubes.pyx:275:15: undeclared name not builtin: ReadFileObj

After removing read_fileobj since it looks like its not used anywhere and running setup.py again I got another error: ar: no archive members specified

Direct Json to python fails to clear temporary buffer with escaped values between rows

When pytubes converts Json string values directly to python values (with no explicit conversion specified) and the string value contains '' escaped values, then a temporary buffer is used to hold decoded string data. This buffer is re-used between iterations to improve performance, but isn't cleared before re-use, meaning that consecutive rows with escaped json values in the same slot results in the values being concatenated on the second row

Data type conversion bettween python bool and pytubes Bool

Hi, i need to add filter function in pytubes skip_if function. Specifically, it is complicated lambda expression which make more judgements than 'lambda x: x.is_blank()'. For example, for string list ['apple','banana','cherry'], i want to get strings which does not contain character 'a'. With the help of pytubes, i write some code clips like:
(Each( ['apple','banana',''])
.enumerate()
.skip_if(lambda x: contain_a(x))
)
here, contain_a is function which return python's bool value represents whether the string contains character.
contain_a(x):
if 'a' in x:
return True
else:
return False

However, i got an error "TypeError: Argument 'conditional' has incorrect type (expected ctubes.Tube, got bool)". It means skip_if function of pytubes is not compatible with python bool, just compatible with tubes.Bool. Is pytubes provides conversion between python bool and tubes.Bool ?

Support 1, 2, and 4 byte integers in ndarray() call

Currently, all integers translate to I8 dtypes in numpy, which can be wasteful.

For strings, we already have the slot_info tuple for specifying the string length, so we can use that to restrict the size of the array type (but we'll have to have some checking on the c++ filler side, so that the correct size values are copied in)

The same could be done for float/double types too.

how to skip blank line when reading json file

pytubes is a great package which support flexible data iterator, thank you @stestagg . I had read the doc of pytubes, you already implement skip_unless and equals function for us, why not offering skip function which is similar to skip_unless function? There is a skip function which will skip specific rows. I want to use skip_unless and no equal function to skip blank line, it is an expression such as "skip_unless(lambda x: x.slot(1).no_equal('null')))".
However, i can not implement such effect base on interfaces provided by pytubes. I will be very appriciated if any one can give me some help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.