replace_regex: a regular expression causes YAML parsing error

Code:

    - replace_regex:
        pattern: "^(\d{4}-\d{2}-\d{2})"
        replace: $1

Error:

thread 'main' panicked at 'YAML config could not be parsed.: Scan(ScanError { mark: Marker { index: 176, line: 10, col: 17 }, info: "while parsing a quoted scalar, found unknown escape character" })', /home/anatoly/.cargo/registry/src/github.com-1ecc6299db9ec823/ysv-0.1.3/src/compile/mod.rs:27:8

Provide an option to crash if input column is not found

Do not compile a config where a nullary transformation is followed by a non-nullary one

Example:

version: 1
columns:
  something:
    - value: "boo"
    - input: "foo"

In this config, the value transformation is moot because it will be replaced by input.

Replace function must be more generic

Several examples.

Replace by file of patterns

this is similar to grep -F.

phone:
  - input: phone
  - replace:
      patterns-file:
        path: blacklist.txt
        format: text
      value: ""

Replace by mapping file

phone:
  - input: phone
  - replace:
      mappings-file:
        path: mapping.csv
        format:
          name: csv
          header: true

Replace by regex

phone:
  - input: phone
  - replace:
      regex: "+"
      value: ""

When working with mappings or search patterns file, we have to optimize.

For small files, it will be faster to load whole file into memory in a searchable format.
For large files, we will have to somehow index the file and search on disk with LRU cache for speed up.

`var` data source support

In configuration file, support var: some_variable_name filter. This will be another source to fill in a column (besides providing data from the input file or from another column of the output).

The program must search for variables in current environment. Similar to Terraform, YSV_VAR_something will be accessible as just something in the configuration file.

Implement `ysv draft` command to read a chunk of CSV data and print a YAML template

Annotate the type of CSV cell value and provide date conversion transformations

version: 1
columns:
  event_date:
    - input: Date
    - date: %m/%d/%Y
    - date-to-string: %Y-%m-%d

New command: `shell`

columns:
  uuid:
    shell:
      command: uuid -v4
      pipe: false

ysv should leverage the ecosystem of UNIX command line tools. It should permit the user to process the values of a given column through an external program.

There are multiple use cases to this.

As we have seen in practice, sometimes ysv's built in filters are not enough. We have to write custom code in another language to do some sort of complex processing for particular columns.

With shell command, we could teach ysv to call our Python script in a separate process and feed the values, line by line, to that script. It will read the output from stdout of the script and incorporate the resulting values into the output CSV dataset.

This would make ysv enormously extensible. Moreover, we could allow it to run multiple instances of the external program and thus facilitate the multiprocessing capabilities of modern hardware (which, say, Python alone cannot easily do).

Even without custom code, the communication using UNIX pipes allows to use standard command line tools, for example awk.

In both of these cases, we will get substantial expansion in functionality by leveraging tools that already exist out there, – and we can do that with great efficiency.

Construct a list of values

Suppose we want to join two columns.

version: 1
columns:
  greeting:
    - [
        value: "Our beloved client",
        input: title,
        input: first_name,
        input: last_name,
      ]
    - join: " "

We are facilitating the built-in YAML construct to make a list. I think this looks rather user-friendly.

Retrieve current version from `cargo` instead of hard-coding it

fn main() -> Result<(), String> {
    let matches = clap_app!(ysv =>
        (version: "0.1.6")
        (author: ...

That's obviously not right :)

Transformation::Value: fill in every cell with a static value

Decode dates in Excel ordinal format

Unknown operation results in an ugly panic message

./sample unknown-operation

yields

thread 'main' panicked at 'YAML config could not be parsed.: Message("data did not match any variant of untagged enum Column", Some(Pos { marker: Marker { index: 28, line: 3, col: 8 }, path: "columns" }))', src/config.rs:42:8
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

We need to provide a readable error description.

Write processed data to stdout in a separate thread

Right now, reading data, processing it and writing all are happening in one main thread. That is not productive, because I/O operations can and should be running in parallel rather than impede each other.

How to

Preliminary test

cat a big file from one disk to another, measure performance
ysv a big file from one disk to another, and measure performance

The difference should show how much slower ysv is compared to cat.

Modifications

Get acquainted with Rust concurrency of which I currently know nothing
For now, let us only write to a separate thread with communication via a queue; the transformations will be happening in the main thread still
The writer thread should take care to make sure the rows are being printed in proper order; for that, they need to be numbered

Evaluation

Run the same file as before through ysv and measure how long the processing has taken.

Convert number to string

Release ysv version 0.1.1

Find out how to bump version
Change docs URL (use Gitbook for now)
Update description in README.md

Implement --version flag to show the program version

Maybe use it as an excuse to get a proper command line processing library into the project.

`input:` transformation has to support multiple arguments

Example:

version: 1
columns:
    first_name:
      - input:
        - First Name
        - FirstName
        - First
        - owner__first_name

The system has to check the first column which is present in the input file, and use that column as source.

If nothing is present, a warning is issued and the output column (in this case first_name) is going to be empty.

This is an important feature for many usecases where we need to convert diverse input data to some sort of standard form.

Create tutorial and brief documentation on the configuration language

Learn about monadic helper functions and use them

A bit of refactoring & modularization

Get rid of compilation warnings about unused imports etc;
create lib.rs and refactor functions to be cleaner and concerns more separated;
make command line options parsing a bit cleaner.

date parser prints an error even if date was actually parsed (but not with the very first pattern in list)

Cannot parse date 4/10/2020 with format %Y-%m-%d.
Cannot parse date 4/10/2020 with format %Y-%m-%d.
Cannot parse date 4/10/2020 with format %Y-%m-%d.
Cannot parse date 4/10/2020 with format %Y-%m-%d.
Cannot parse date 4/10/2020 with format %Y-%m-%d.

We need a logging library to print errors and manage log levels

Provide variables via command line options

Besides providing parameters as environment variables, ysv must support command line parameters, like this (modeled after Terraform):

cat input.csv | ysv -var "filename=input.csv" > output.csv

(Depends on #5.)

Create a catalog of examples

We need to have an examples directory which will feature a lot of use cases of how ysv can be used. Every directory should contain:

input.csv
ysv.yaml
output.csv
stdout.log
stderr.log

It will illustrate how the program processes certain input and what gives in return. We of course need a script to automatically generate all these files based on input.csv and ysv.yaml.

Call a Python function from ysv

Depends on: #58

integrations:
    python:
        interpreter: python3.8
        module: my_python_module.ysv_functions
columns:
    - first_name:
        - python: fake_first_name
        - uppercase

integrations key permits us to support multiple integrations with different programming languages. If at least one python command appears in the ysv configuration, the system must spawn a Python interpreter attached to every worker thread.

Python module specified in the module argument will be imported into that interpreter.

For every cell, specified Python function will be called, and the result of its execution will be used as cell value.

Look into `tush` to do integration testing for ysv

https://github.com/adolfopa/tush

Implement a transformation which takes value from another column of output file

How to call this?

output
column
from-column
copy-column
duplicate

This task also means that we need to process columns in the order of their dependencies one from another; this makes this task rather complicated.

Split the data processor into a number of parallel workers

The CSV data processor is right now single-threaded; it processes the lines of incoming data stream one by one, sequentially.

This was acceptable for a proof of concept version, but now we should spawn a number of parallel threads instead. Every thread will work independently of the others.

Every row of the input will be dispatched to those worker threads via channels. When processing is complete, via another channel the output row will be sent to the writer thread.

Support regular expressions in `replace`

Implement a transformation with if-else semantics

Example usecases:

If the value is empty, replace it with value from variable (or hardcoded, or from another input column, or from another output column)
If the value is equal to string "aaa", replace it with something else
If the value is equal to variable named bbb, replace it with something else
If the value matches "\d{2}" regex, prepend it with "000"

...and whatever. We need to come up with:

syntax for that in the configuration file
and an efficient implementation

"Input column {} not found." error message must be formatted according to the specified error format

Ugly exception when stdout buffer is closed

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error(Io(Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }))', src/worker.rs:50:9

TODO: how to reproduce?

Implement a set of integration tests

Line number operation

Output current line number, starting from 1.

Use ysv to anonymize datasets

Depends on: #59

Prerequisites:

ysv must support calling Python functions;
There has to be a csv2ysv tool (I will rename xsv2schema) to avoid spending time on writing boilerplate YAML configs.

After that, I need to write an article in ysv docs to explain how the tool can be used to anonymize datasets using faker, mimesis or similar Python tools. I believe this is a very useful application for this instrument.

ysv-rs / ysv Goto Github PK

ysv's People

Stargazers

Watchers

Forkers

ysv's Issues