hawkfish / textform Goto Github PK

7.0 1.0 0.0 476 KB

A data transformation pipeline library based on Potter's Wheel.

License: MIT License

Makefile 0.10% Python 99.90%

wrangling-data wrangling potter-wheel data-transformation-pipeline record-stream pipelines markdown-converter textform csv-converter json-converter text-processing

textform's Issues

Add cookbook section to documentation

The examples folder is a good start, but things like downloading a remote csv or creating fun Select predicates like skip (ignore rows until you hit a pattern N times) and stop (raising StopException when you hit pattern) are good ideas for a first round.

Make Unfold order-independent.

The initial implementation of Unfold relied on the rows being contiguous. Instead, it should maintain a table of partly assembled rows, keyed off the fixed fields and using the tags to distribute the values to the correct group column(s).

The mapping from tags to group offset can be inferred from the tags stream (which his essentially what currently happens) or provided explicitly as an optional argument.

Add pdf support

This should wrap tabula-py or camelot-py.

PDF tables can be very messy, so being able to clean them up with filling and other shaping tools is very useful.

Fuzzy string matching

fuzzywuzzy might work, but we really needs a streaming join style implementation.

Iterate transform

Iterate is a cross between Fold and Unnest. It takes a nested value and expands it, but it assumes the value is ragged (e.g., a variable length array or a record with inconsistent schema). To adapt the ragged structure to a fixed schema, it produces two columns: the tag and the value. Each input row then generates one row per entry from the nested value. Because the schema is variable, both columns will be strings and later transforms can sort out the data typing. To avoid losing data, empty records will produce one row with empty strings for the outputs.

For arrays (JSON arrays, Python list/tuple or csv rows), the tags are the numeric indices; for structured records, the tags are the record keys.

hawkfish / textform Goto Github PK

textform's Issues

Add cookbook section to documentation

Make Unfold order-independent.

Add pdf support

Fuzzy string matching

Iterate transform

Fold and Unfold don't handle fixed rows correctly

Add data frame support

Add layout registration system

Add rst table IO formatters

Hook into GitHub publishing system

Row fusion

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent