Giter Site home page Giter Site logo

datapipes's Introduction

⚠️ Deprecation notice

The datapipes website is now archived and read-only. The gh-pages branch hosts the content of the static version of the website, which is now available at datapipes.datopian.com.

datapipes

A node library, command line tool and webapp to provide "pipe-able" Unix-Style data transformations on row-based data like CSVs.

DataPipes offers unix-style cut, grep, sed operations on row-based data like CSVs in a streaming, connectable "pipe-like" manner.

DataPipes can be used:

Build Status

Install

npm install -g datapipes

Usage - Command line

Once installed, datapipes will be available on the command line:

datapipes -h

See the help for usage instructions, but to give a quick taster:

# head (first 10 rows) of this file
datapipes https://raw.githubusercontent.com/datasets/browser-stats/c2709fe7/data.csv head

# search for occurrences of London (ignore case) and show first 10 results
datapipes https://raw.githubusercontent.com/rgrp/dataset-gla/75b56891/data/all.csv "grep -i london" head

Usage - Library

See the Developer Docs.


Developers

Installation

This is a Node Express application. To install and run do the following.

  1. Clone this repo
  2. Change into the repository base directory
  3. Run:
$ npm install

Testing

Once installed, you can run the tests locally with:

$ npm test

Running

To start the app locally, it’s:

$ node app.js

You can then access it from http://localhost:5000/

Deployment

For deployment we use Heroku.

The primary app is called datapipes on Heroku. To add it as a git remote, do:

$ heroku git:remote -a datapipes

Then to deploy:

$ git push datapipes

Inspirations and Related

  • https://github.com/substack/dnode dnode is an asynchronous rpc system for node.js that lets you call remote functions. You can pass callbacks to remote functions, and the remote end can call the functions you passed in with callbacks of its own and so on. It's callbacks all the way down!

Copyright and License

Copyright 2013-2014 Open Knowledge Foundation and Contributors.

Licensed under the MIT license:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

datapipes's People

Contributors

andylolz avatar ashenm avatar davidmiller avatar dependabot[bot] avatar floppy avatar nikitavlaznev avatar roll avatar rossjones avatar rufuspollock avatar serahkiburu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datapipes's Issues

Arbitrary map/filter functions

(was: Provide JS sandbox for user-specified filter functions)

It would be great if users could provide a filter function to be executed on each row.

This would be more powerful than grep as it could take into account values in other cells. And something similar could also map a new column onto the table using a user-specific filter (for example).

Something like http://gf3.github.io/sandbox/ looks like a reasonably good solution for JS. This particular one would be inproc, but can imagine other languages being allowed to run code over the rows in a different type of sandbox.

API modification – variable substitution

I really like the current API (and am aware that #2 is the most commented ticket!) I’d like to propose a very slight modification.

Why?

  1. Currently, pipelines look like this:

    / csv / head / html ?url=…
    

    Instead of the more familiar, perhaps more intuitive:

    / csv http://… / head / html
    
  2. Some regular expression metacharacters can’t be used in grep, because of URL limitations.

So…?

Variable substitutions! In a sort of printf style. e.g.:

# find the first 5 rows that include a happy smiley face
/ csv $url / grep $re / head -n 5 / html ?re=:\)&url=http://…

@ everyone?

Simple Map functions

You can execute arbitrary JS in node - see the vm module.

Hack approach

We are already running on Heroku which is somewhat sanbboxed.

We could also only use gists - and ban bad users ;-)

User Stories

Please add user stories here (in comments)

  • Joe = Data Wrangler
  • Irene = (Data) Journalist (less techy user interested in data)

View a CSV file quickly

As Joe or Irene I want to quickly view a CSV file online without downloading and opening so that I get a sense of what is in it

  • Want to view as a simple HTML table
  • Especially beneficial for large CSV files that might crash Excel or OpenOffice or take a long time to load (e.g. CSV files > 100 MB) (in this case probably just want to see the first 1000 or 10k rows)

Share viewable CSV link online

As Joe I want to share a link (email, irc, IM) for a CSV file that is easily viewable by someone else (HTML) so that they can see it

  • Want to only share top part of the file (if a large file) so I don't crash their browser

General Transform

Comment: A bit too general of a use case

As Joe I have a file on the web I want to use in my no-backend html5/js app and I need to operation on it / transform it on the way.

  • CORS support essential
  • what kind of transforms / operations
  • transforms like those (see #9 - inling here for convenience)

Possible transforms

  • delete = delete rows - #6
  • cut = filter columns - #37
  • grep = filter rows - #31
  • head / tail = delete rows at top / bottom #13
  • sed (note sed 3d = delete ...) = find and replace #77
  • addindex = add an index column #24
  • map = perform an arbitrary map function specified in this gist (dangerous)? - #3

cf https://github.com/rgrp/command-line-data-wrangling#tools-of-the-trade for a list of unix tools

(from @mihi-tr)

Share my transform

As Joe I want to quickly share online a transformation I did on a data file so that someone else can see it (and can repeat it on other files)

  • transforms like find and replace
  • filtering rows or columns

Arbitrary transforms

As Joe I want to do arbitrary JS transforms on rows of data online (and share this with others) so that it is easy to do

colinfo - show columns names and information

This is a of a hack but basically this would take entire sheet and convert into a tabular output where:

  • one row for each field/column in original (in order)
  • fields are:
    • name

To discuss

  • What other fields - if we just have row names then we don't have to parse much of the file :-)

License?

Please add a license for this code?

CSV HTML view operation

  • URL: /csv/view/ or /csv/html/ ?
    • latter fits better with /csv/plain/ or similar
  • Simple HTML table - can we do ellipsis on table headings?
  • Line number support ala github e.g. #L97 or #L97-103

Pipeline wizard!

The form at the top here is cool: http://datapipes.okfnlabs.org/html

This could be generalised to construct a complete pipeline of operations.

Implementation

  • Fixed field for URL
  • Fixed field for choosing output type (defaults to csv)
  • JS driven "pipeline" creator
    • Choose from a list of operators to add from a menu. When you select it gets added along with text field for arguments
    • you can delete that operator later
    • suggest we actually keep state in the JS and re-render ourselves each time rather than trying to manipulate in HTML (?)

extras

  • support for drag and drop reordering (?)

Transform: sort

Not sure I like this as it requires whole file to be read into memory (not streaming)

HTML output sortability

Regression introduced with pipability - doesn't sort tables.

(Doesn't deal with the Magic String 'end' index)

range should accept negative indexes

THe range in delete current accepts a comma-separated list, but doesn't work with negative indexes. This would be really handy for clearing junk at the end of a file, see:

http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv

Would expect the following URL to remove the first 5 lines, and the last 4 lines of the CSV.

http://datapipes.okfnlabs.org/csv/delete?range=0:6,-4:&url=http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv&

Pipe support

cf #2

Approach

 /csv/op1 args/op2 args/?url=...

e.g.

/csv/head -n 20/html/?url=....

Standard CSV arguments to all functions

Question: How are these passed in? Are they passed in to /csv .../ at the start or in query string?

Suggest that you have to pass these in like any other parameters so like:

/csv -d\t /...

Arguments are (taken from csvkit):

-d DELIMITER, --delimiter DELIMITER
                      Delimiting character of the input CSV file.
-t, --tabs            Specifies that the input CSV file is delimited with
                      tabs. Overrides "-d".
-q QUOTECHAR, --quotechar QUOTECHAR
                      Character used to quote strings in the input CSV file.
-b, --doublequote     Whether or not double quotes are doubled in the input
                      CSV file.
-p ESCAPECHAR, --escapechar ESCAPECHAR
                      Character used to escape the delimiter if quoting is
                      set to "Quote None" and the quotechar if doublequote
                      is not specified.
-e ENCODING, --encoding ENCODING
-S, --skipinitialspace
                      Ignore whitespace immediately following the delimiter.

Note:

  • Not sure we need skipinitialspace (is it supported by our csv parser?)
  • Encoding is probably needed at point we open the file stream so may be encoding should be query string parameter i.e. ?url=...&encoding=...

References

Support options from: http://www.dataprotocols.org/en/latest/csv-dialect.html

May want to follow csvkit and relevant unix commands where possible - csvkit common args http://csvkit.readthedocs.org/en/latest/scripts/common_arguments.html

Delete operation

/delete/?range=...

range = 1 or 1-5 or ...

Questions

Nice to be able to do reverse index e.g. range=-1 (delete last line)

However this is hard on streaming data ...

Support posting data

Then we can pipe from local using curl!

Notes on Getting Streaming Data in Express

Express as of 3.0 has a pretty neat bodyParser middleware that will take care of normal posts, file uploads etc. However, it does not give streaming access to uploads AFAICT.

// multipart form upload will give us
req.files
// 

Write your own

req.on('data', function(chunk) { 
       data += chunk;
    });

Though you may need to selectively disable bodyParser - see http://stackoverflow.com/questions/11295554/how-to-disable-express-bodyparser-for-file-uploads-node-js

Other random notes

Common arguments across transforms

  • column/field selection. In cut this is -f for fields

    -f, --fields=LIST
    select only these fields; also print any line that contains no delimiter character, unless the -s option is specified

  • ...

Strip op.

Not sure about the name.

Should be used to strip out any lines where none of the 'cells' have a value. An easy way of deleting lines like ,,,,,,,,, without having to work out where they are in the file.

head operation

normal unix head, but probably with some tweaks

Implementation

  • @davidmiller made a start on this in #10
  • Key point is that we close the incoming stream once we have the data we need ...

Unix head

   head [OPTION]... [FILE]...

DESCRIPTION

   Print  the  first  10 lines of each FILE to standard output.  With more
   than one FILE, precede each with a header giving the file  name.   With
   no FILE, or when FILE is -, read standard input.

   Mandatory  arguments  to  long  options are mandatory for short options
   too.

   -c, --bytes=[-]N
      print the first N bytes of each  file;  with  the  leading  '-',
      print all but the last N bytes of each file

   -n, --lines=[-]N
      print  the first N lines instead of the first 10; with the lead-
      ing '-', print all but the last N lines of each file

   -q, --quiet, --silent
      never print headers giving file names

   -v, --verbose
      always print headers giving file names

   --help display this help and exit

   --version
      output version information and exit

   N may have a multiplier suffix: b 512, k 1024, m 1024*1024.

Command line interface

Do the web stuff but from the command line (cf #18)

  • Is this worth doing when you could just pipe to the web service?

Design API (URLs)

  • Support for different input formats (?) - atm just CSV
  • Always have a url input - can either POST or pass ?url argument (tbd)
  • How far do we seek to mimic unix commands in the url interface
  • Should we have a mini json format for defining the operations?

Possibilities

Option 1

Mixed arguments - some in query string, some in path

/csv/cut/-d, -f1,2,3?url=...

Option 2

All arguments in query string

/csv/cut/?d=,&f=1,2,3&url=...

Option 3

/csv/cut/?args={json-structure}&url=...

Piping

Option 1

Pass the url of the previous pipe into the new pipe as a source e.g.

 /csv/cut/?url=http://datapipes.../csv/head/?url=....

Where that 2nd url is obviously appropriately encoded.

Option 2a

Have a way to put this in the url

/csv/op1/args/op2/args/?url=...

Option 2b

/csv/op1 args/op2 args/?url=...

Option 3

Mimic unix style piping!

/csv/op1 args | op2 args | op3 ?url=...

So there is one special parameter (the url) which is passed in the usual way ...

`html` output doesn't render if it isn't first in chain

As the first transform in a series, html works fine. For example, this produces the expected nice-looking output:

http://datapipes.okfnlabs.org/html/?url=http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv

But as anything but the first member, html gives me unrendered HTML. This, for example, just gives me a page of raw, unrendered HTML:

http://datapipes.okfnlabs.org/csv/delete%200:6/html/?url=http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv

Delete should use an internal counter, rather than static row indexes

I feel like I have seen this raised in another ticket, but I can’t find it now…

delete should keep an internal counter, rather than relying on the static row indexes. If it’s after tail or another delete operation, it’s almost certainly not going to remove the right rows.

Tail operation

Operation similar to unix tail - especially -n +3 (start from the 3rd line, aka skip 2 lines)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.