hawkfish / textform Goto Github PK
View Code? Open in Web Editor NEWA data transformation pipeline library based on Potter's Wheel.
License: MIT License
A data transformation pipeline library based on Potter's Wheel.
License: MIT License
The examples folder is a good start, but things like downloading a remote csv
or creating fun Select predicates like skip (ignore rows until you hit a pattern N times) and stop (raising StopException when you hit pattern) are good ideas for a first round.
The initial implementation of Unfold
relied on the rows being contiguous. Instead, it should maintain a table of partly assembled rows, keyed off the fixed
fields and using the tags
to distribute the values to the correct group column(s).
The mapping from tags
to group offset can be inferred from the tags stream (which his essentially what currently happens) or provided explicitly as an optional argument.
This should wrap tabula-py
or camelot-py
.
PDF tables can be very messy, so being able to clean them up with filling and other shaping tools is very useful.
fuzzywuzzy
might work, but we really needs a streaming join style implementation.
Iterate
is a cross between Fold
and Unnest
. It takes a nested value and expands it, but it assumes the value is ragged (e.g., a variable length array or a record with inconsistent schema). To adapt the ragged structure to a fixed schema, it produces two columns: the tag and the value. Each input row then generates one row per entry from the nested value. Because the schema is variable, both columns will be strings and later transforms can sort out the data typing. To avoid losing data, empty records will produce one row with empty strings for the outputs.
For arrays (JSON arrays, Python list
/tuple
or csv
rows), the tags are the numeric indices; for structured records, the tags are the record keys.
Fold
and Unfold
have a member fixed
that was set to a filter()
iterator instead of the result of the iterator. This meant that the fixed fields were never updated.
textform
should be able to read data from any record source, and this is a big one.
Right now it is hard coded.
This will make it easier to generate sample fragments for documentation.
The package should to to pypi
and the docs to readthedocs
.
A transform is needed for fusing adjacent rows. It requires a predicate to determine fusion boundaries and a separator for each column. Non-string types could be fun.
This is a bit like Fill
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.