Giter Site home page Giter Site logo

hawkfish / textform Goto Github PK

View Code? Open in Web Editor NEW
7.0 1.0 0.0 476 KB

A data transformation pipeline library based on Potter's Wheel.

License: MIT License

Makefile 0.10% Python 99.90%
wrangling-data wrangling potter-wheel data-transformation-pipeline record-stream pipelines markdown-converter textform csv-converter json-converter text-processing

textform's Introduction

textform

A data transformation pipeline module based on the seminal Potter's Wheel data wrangling formalism. The name is a portmanteau of "text" and "transform".

Overview

textform (abbreviated txf) is a text-oriented data transformation module. With it, you can create sequential record processing pipelines that convert data from (say) lines of text into records and then route the final record stream for another use (e.g, write the records to a csv file.)

Pipelines are cosntructed from a sequence of transforms that take in a record and modify it in some way. For example, the Split transform will replace an input field with several new fields that are derived from the input by splitting on a pattern.

While inspired by the Potter's Wheel transform list, textform is designed for practical everyday use. This means it includes transforms for limiting the number of rows, writing intermediate results to files and capturing via regular expressions.

Audience

How do I know if textform is right for me? The simplest use case is where you want to use Python's DictReader but the file isn't a csv. With textform you can write a pipeline that will end up producing the records you would get from DictReader.

More complex use cases can be built on top of this kind of record stream. Reshaping, computing values, splitting, dividing, merging, filling in blanks and other kinds of data cleaning and preparation tasks can all be implemented in a reusable fashion with textform. A pipeline effectively describes the format of a text file in an executable fashion that can be reused.

Example

I created textform because I had worked on a similar research system in the past and had two text files produced by the DuckDB performance test suite that I needed to convert into csvs:

------------------
|| Q01_PARALLEL ||
------------------
Cold Run...Done!
Run 1/5...0.12345
Run 1/5...0.12345
Run 1/5...0.12345
Run 1/5...0.12345
Run 1/5...0.12345
------------------
|| Q02_PARALLEL ||
------------------
...

This file is esssentially a sequence of records grouped by higher attributes. Instead of writing a one-off Python script, I decided to write some simple transforms and build a pipeline, which looked like this:

p = Text(sys.stdin, 'Line')                         # Read a line
p = Add(p, 'Branch', sys.argv[1])                   # Tag the file with the branch name
p = Match(p, 'Line', r'------', invert=True).       # Remove horizontal lines
p = Divide(p, 'Line', 'Query', 'Run', r'Q')         # Separate the query names from the run data
p = Fill(p, 'Query', '00')                          # Fill down the blank query names
p = Capture(p, 'Query', ('Query',), r'\|\|\s+Q(\w+)\s+\|\|')  # Capture the query number
# Split the execution mode from the query name
p = Split(p, 'Query', ('Query', 'Mode',), r'_', ('00', 'SERIAL',))
p = Cast(p, 'Query', int)                           # Cast the query number to an integer
p = Match(p, 'Run', r'\d')                          # Filter to the runs with data
# Capture the run components
p = Capture(p, 'Run', ('Run #', 'Run Count', 'Time',), r'(\d+)/(\d+)...(\d+\.\d+)')
p = Cast(p, 'Run #', int)                           # Cast the run components
p = Cast(p, 'Run Count', int)
p = Cast(p, 'Time', float)
p = Write(p, sys.stdout)                            # Write the records to stdout as a csv
p.pump()

We can now invoke the pipeline script as:

$ python3 pipeline.py master < performance.txt > performance.csv

Contributing

You know the drill: Fork, branch, test submit a PR. This is a completely open source, free as in beer project.

textform's People

Contributors

hawkfish avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

textform's Issues

Row fusion

A transform is needed for fusing adjacent rows. It requires a predicate to determine fusion boundaries and a separator for each column. Non-string types could be fun.

This is a bit like Fill.

Add cookbook section to documentation

The examples folder is a good start, but things like downloading a remote csv or creating fun Select predicates like skip (ignore rows until you hit a pattern N times) and stop (raising StopException when you hit pattern) are good ideas for a first round.

Iterate transform

Iterate is a cross between Fold and Unnest. It takes a nested value and expands it, but it assumes the value is ragged (e.g., a variable length array or a record with inconsistent schema). To adapt the ragged structure to a fixed schema, it produces two columns: the tag and the value. Each input row then generates one row per entry from the nested value. Because the schema is variable, both columns will be strings and later transforms can sort out the data typing. To avoid losing data, empty records will produce one row with empty strings for the outputs.

For arrays (JSON arrays, Python list/tuple or csv rows), the tags are the numeric indices; for structured records, the tags are the record keys.

Make Unfold order-independent.

The initial implementation of Unfold relied on the rows being contiguous. Instead, it should maintain a table of partly assembled rows, keyed off the fixed fields and using the tags to distribute the values to the correct group column(s).

The mapping from tags to group offset can be inferred from the tags stream (which his essentially what currently happens) or provided explicitly as an optional argument.

Fuzzy string matching

fuzzywuzzy might work, but we really needs a streaming join style implementation.

Add pdf support

This should wrap tabula-py or camelot-py.

PDF tables can be very messy, so being able to clean them up with filling and other shaping tools is very useful.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.