ysv-rs / ysv Goto Github PK
View Code? Open in Web Editor NEWysv: clean and transform CSV data along your rules encoded in YAML format, lightning fast
ysv: clean and transform CSV data along your rules encoded in YAML format, lightning fast
Code:
- replace_regex:
pattern: "^(\d{4}-\d{2}-\d{2})"
replace: $1
Error:
thread 'main' panicked at 'YAML config could not be parsed.: Scan(ScanError { mark: Marker { index: 176, line: 10, col: 17 }, info: "while parsing a quoted scalar, found unknown escape character" })', /home/anatoly/.cargo/registry/src/github.com-1ecc6299db9ec823/ysv-0.1.3/src/compile/mod.rs:27:8
Example:
version: 1
columns:
something:
- value: "boo"
- input: "foo"
In this config, the value
transformation is moot because it will be replaced by input
.
Several examples.
this is similar to grep -F
.
phone:
- input: phone
- replace:
patterns-file:
path: blacklist.txt
format: text
value: ""
phone:
- input: phone
- replace:
mappings-file:
path: mapping.csv
format:
name: csv
header: true
phone:
- input: phone
- replace:
regex: "+"
value: ""
When working with mappings or search patterns file, we have to optimize.
In configuration file, support var: some_variable_name
filter. This will be another source to fill in a column (besides providing data from the input file or from another column of the output).
The program must search for variables in current environment. Similar to Terraform, YSV_VAR_something
will be accessible as just something
in the configuration file.
version: 1
columns:
event_date:
- input: Date
- date: %m/%d/%Y
- date-to-string: %Y-%m-%d
columns:
uuid:
shell:
command: uuid -v4
pipe: false
ysv should leverage the ecosystem of UNIX command line tools. It should permit the user to process the values of a given column through an external program.
There are multiple use cases to this.
With shell
command, we could teach ysv to call our Python script in a separate process and feed the values, line by line, to that script. It will read the output from stdout of the script and incorporate the resulting values into the output CSV dataset.
This would make ysv enormously extensible. Moreover, we could allow it to run multiple instances of the external program and thus facilitate the multiprocessing capabilities of modern hardware (which, say, Python alone cannot easily do).
awk
.In both of these cases, we will get substantial expansion in functionality by leveraging tools that already exist out there, โ and we can do that with great efficiency.
Suppose we want to join two columns.
version: 1
columns:
greeting:
- [
value: "Our beloved client",
input: title,
input: first_name,
input: last_name,
]
- join: " "
We are facilitating the built-in YAML construct to make a list. I think this looks rather user-friendly.
fn main() -> Result<(), String> {
let matches = clap_app!(ysv =>
(version: "0.1.6")
(author: ...
That's obviously not right :)
./sample unknown-operation
yields
thread 'main' panicked at 'YAML config could not be parsed.: Message("data did not match any variant of untagged enum Column", Some(Pos { marker: Marker { index: 28, line: 3, col: 8 }, path: "columns" }))', src/config.rs:42:8
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
We need to provide a readable error description.
Right now, reading data, processing it and writing all are happening in one main thread. That is not productive, because I/O operations can and should be running in parallel rather than impede each other.
cat
a big file from one disk to another, measure performanceysv
a big file from one disk to another, and measure performanceThe difference should show how much slower ysv
is compared to cat
.
writer
thread should take care to make sure the rows are being printed in proper order; for that, they need to be numberedysv
and measure how long the processing has taken.Maybe use it as an excuse to get a proper command line processing library into the project.
Example:
version: 1
columns:
first_name:
- input:
- First Name
- FirstName
- First
- owner__first_name
The system has to check the first column which is present in the input file, and use that column as source.
If nothing is present, a warning is issued and the output column (in this case first_name
) is going to be empty.
This is an important feature for many usecases where we need to convert diverse input data to some sort of standard form.
lib.rs
and refactor functions to be cleaner and concerns more separated;Cannot parse date 4/10/2020 with format %Y-%m-%d.
Cannot parse date 4/10/2020 with format %Y-%m-%d.
Cannot parse date 4/10/2020 with format %Y-%m-%d.
Cannot parse date 4/10/2020 with format %Y-%m-%d.
Cannot parse date 4/10/2020 with format %Y-%m-%d.
Besides providing parameters as environment variables, ysv must support command line parameters, like this (modeled after Terraform):
cat input.csv | ysv -var "filename=input.csv" > output.csv
(Depends on #5.)
We need to have an examples
directory which will feature a lot of use cases of how ysv can be used. Every directory should contain:
It will illustrate how the program processes certain input and what gives in return. We of course need a script to automatically generate all these files based on input.csv
and ysv.yaml
.
Depends on: #58
integrations:
python:
interpreter: python3.8
module: my_python_module.ysv_functions
columns:
- first_name:
- python: fake_first_name
- uppercase
integrations
key permits us to support multiple integrations with different programming languages. If at least one python
command appears in the ysv
configuration, the system must spawn a Python interpreter attached to every worker thread.
Python module specified in the module
argument will be imported into that interpreter.
For every cell, specified Python function will be called, and the result of its execution will be used as cell value.
How to call this?
output
column
from-column
copy-column
duplicate
This task also means that we need to process columns in the order of their dependencies one from another; this makes this task rather complicated.
The CSV data processor is right now single-threaded; it processes the lines of incoming data stream one by one, sequentially.
This was acceptable for a proof of concept version, but now we should spawn a number of parallel threads instead. Every thread will work independently of the others.
Every row of the input will be dispatched to those worker threads via channels. When processing is complete, via another channel the output row will be sent to the writer thread.
...and whatever. We need to come up with:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error(Io(Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }))', src/worker.rs:50:9
TODO: how to reproduce?
Output current line number, starting from 1.
Depends on: #59
Prerequisites:
ysv
must support calling Python functions;csv2ysv
tool (I will rename xsv2schema
) to avoid spending time on writing boilerplate YAML configs.After that, I need to write an article in ysv
docs to explain how the tool can be used to anonymize datasets using faker
, mimesis
or similar Python tools. I believe this is a very useful application for this instrument.
If the first format did not work, try the next one. The user might supply multiple formats.
Might use this to refactor transformer.rs
into multiple files btw.
This will be killing performance because this function is called for every transformation applied to generate every column of the output.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.