Giter Site home page Giter Site logo

ysv-rs / ysv Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 1.0 157 KB

ysv: clean and transform CSV data along your rules encoded in YAML format, lightning fast

Rust 98.30% Shell 1.70%
csv-converter rust-lang yaml-configuration data-science data-mining data-ops

ysv's People

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

forever-young

ysv's Issues

replace_regex: a regular expression causes YAML parsing error

Code:

    - replace_regex:
        pattern: "^(\d{4}-\d{2}-\d{2})"
        replace: $1

Error:

thread 'main' panicked at 'YAML config could not be parsed.: Scan(ScanError { mark: Marker { index: 176, line: 10, col: 17 }, info: "while parsing a quoted scalar, found unknown escape character" })', /home/anatoly/.cargo/registry/src/github.com-1ecc6299db9ec823/ysv-0.1.3/src/compile/mod.rs:27:8

Replace function must be more generic

Several examples.

Replace by file of patterns

this is similar to grep -F.

phone:
  - input: phone
  - replace:
      patterns-file:
        path: blacklist.txt
        format: text
      value: ""

Replace by mapping file

phone:
  - input: phone
  - replace:
      mappings-file:
        path: mapping.csv
        format:
          name: csv
          header: true

Replace by regex

phone:
  - input: phone
  - replace:
      regex: "+"
      value: ""

When working with mappings or search patterns file, we have to optimize.

  • For small files, it will be faster to load whole file into memory in a searchable format.
  • For large files, we will have to somehow index the file and search on disk with LRU cache for speed up.

`var` data source support

In configuration file, support var: some_variable_name filter. This will be another source to fill in a column (besides providing data from the input file or from another column of the output).

The program must search for variables in current environment. Similar to Terraform, YSV_VAR_something will be accessible as just something in the configuration file.

New command: `shell`

columns:
  uuid:
    shell:
      command: uuid -v4
      pipe: false

ysv should leverage the ecosystem of UNIX command line tools. It should permit the user to process the values of a given column through an external program.

There are multiple use cases to this.

  1. As we have seen in practice, sometimes ysv's built in filters are not enough. We have to write custom code in another language to do some sort of complex processing for particular columns.

With shell command, we could teach ysv to call our Python script in a separate process and feed the values, line by line, to that script. It will read the output from stdout of the script and incorporate the resulting values into the output CSV dataset.

This would make ysv enormously extensible. Moreover, we could allow it to run multiple instances of the external program and thus facilitate the multiprocessing capabilities of modern hardware (which, say, Python alone cannot easily do).

  1. Even without custom code, the communication using UNIX pipes allows to use standard command line tools, for example awk.

In both of these cases, we will get substantial expansion in functionality by leveraging tools that already exist out there, โ€“ and we can do that with great efficiency.

Construct a list of values

Suppose we want to join two columns.

version: 1
columns:
  greeting:
    - [
        value: "Our beloved client",
        input: title,
        input: first_name,
        input: last_name,
      ]
    - join: " "

We are facilitating the built-in YAML construct to make a list. I think this looks rather user-friendly.

Unknown operation results in an ugly panic message

./sample unknown-operation

yields

thread 'main' panicked at 'YAML config could not be parsed.: Message("data did not match any variant of untagged enum Column", Some(Pos { marker: Marker { index: 28, line: 3, col: 8 }, path: "columns" }))', src/config.rs:42:8
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

We need to provide a readable error description.

Write processed data to stdout in a separate thread

Right now, reading data, processing it and writing all are happening in one main thread. That is not productive, because I/O operations can and should be running in parallel rather than impede each other.

How to

Preliminary test

  • cat a big file from one disk to another, measure performance
  • ysv a big file from one disk to another, and measure performance

The difference should show how much slower ysv is compared to cat.

Modifications

  • Get acquainted with Rust concurrency of which I currently know nothing
  • For now, let us only write to a separate thread with communication via a queue; the transformations will be happening in the main thread still
  • The writer thread should take care to make sure the rows are being printed in proper order; for that, they need to be numbered

Evaluation

  • Run the same file as before through ysv and measure how long the processing has taken.

Release ysv version 0.1.1

  • Find out how to bump version
  • Change docs URL (use Gitbook for now)
  • Update description in README.md

`input:` transformation has to support multiple arguments

Example:

version: 1
columns:
    first_name:
      - input:
        - First Name
        - FirstName
        - First
        - owner__first_name

The system has to check the first column which is present in the input file, and use that column as source.

If nothing is present, a warning is issued and the output column (in this case first_name) is going to be empty.

This is an important feature for many usecases where we need to convert diverse input data to some sort of standard form.

A bit of refactoring & modularization

  • Get rid of compilation warnings about unused imports etc;
  • create lib.rs and refactor functions to be cleaner and concerns more separated;
  • make command line options parsing a bit cleaner.

Provide variables via command line options

Besides providing parameters as environment variables, ysv must support command line parameters, like this (modeled after Terraform):

cat input.csv | ysv -var "filename=input.csv" > output.csv

(Depends on #5.)

Create a catalog of examples

We need to have an examples directory which will feature a lot of use cases of how ysv can be used. Every directory should contain:

  • input.csv
  • ysv.yaml
  • output.csv
  • stdout.log
  • stderr.log

It will illustrate how the program processes certain input and what gives in return. We of course need a script to automatically generate all these files based on input.csv and ysv.yaml.

Call a Python function from ysv

Depends on: #58

integrations:
    python:
        interpreter: python3.8
        module: my_python_module.ysv_functions
columns:
    - first_name:
        - python: fake_first_name
        - uppercase

integrations key permits us to support multiple integrations with different programming languages. If at least one python command appears in the ysv configuration, the system must spawn a Python interpreter attached to every worker thread.

Python module specified in the module argument will be imported into that interpreter.

For every cell, specified Python function will be called, and the result of its execution will be used as cell value.

Split the data processor into a number of parallel workers

The CSV data processor is right now single-threaded; it processes the lines of incoming data stream one by one, sequentially.

This was acceptable for a proof of concept version, but now we should spawn a number of parallel threads instead. Every thread will work independently of the others.

Every row of the input will be dispatched to those worker threads via channels. When processing is complete, via another channel the output row will be sent to the writer thread.

Implement a transformation with if-else semantics

Example usecases:

  • If the value is empty, replace it with value from variable (or hardcoded, or from another input column, or from another output column)
  • If the value is equal to string "aaa", replace it with something else
  • If the value is equal to variable named bbb, replace it with something else
  • If the value matches "\d{2}" regex, prepend it with "000"

...and whatever. We need to come up with:

  • syntax for that in the configuration file
  • and an efficient implementation

Ugly exception when stdout buffer is closed

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error(Io(Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }))', src/worker.rs:50:9

TODO: how to reproduce?

Use ysv to anonymize datasets

Depends on: #59

Prerequisites:

  • ysv must support calling Python functions;
  • There has to be a csv2ysv tool (I will rename xsv2schema) to avoid spending time on writing boilerplate YAML configs.

After that, I need to write an article in ysv docs to explain how the tool can be used to anonymize datasets using faker, mimesis or similar Python tools. I believe this is a very useful application for this instrument.

`date` must accept multiple date formats

If the first format did not work, try the next one. The user might supply multiple formats.

Might use this to refactor transformer.rs into multiple files btw.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.