Giter Site home page Giter Site logo

stenway / rsv-specification Goto Github PK

View Code? Open in Web Editor NEW
58.0 6.0 1.0 488 KB

Rows of String Values (RSV Data Format) Specification - A Simple Binary Alternative to CSV

Home Page: https://www.stenway.com

License: MIT License

binary-file binary-file-format csv delimiter dsv file-format rsv specification tsv delimiter-collision

rsv-specification's Introduction

rsv-specification's People

Contributors

stenway avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

kwmiebach

rsv-specification's Issues

Extending RSV to support Base64-encoded binary data out-of-the-box

Idea

From a comment on your Youtube video by Rik Schaaf (me) (https://www.youtube.com/watch?v=tb_70o6ohMA&lc=Ugzsfj_OUAK4s_IYaNZ4AaABAg):

What about extending the RSV format to support Base64 encoded binary data, by prefixing a string with \xFB (I chose FB to easily remember B for Binary/Base64, while still being invalid for UTF-8, to prevent collisions, according to your table at 4:41).
This would make it much cheaper to represent numbers (with more than 2 digits), and dates (for example as timestamps).
It also would allow for lossless transfer of floating point values (which is a problem when just using strings, since their decimal string representation doesn't losslessly map to its binary representation)
It could even allow the encoding of an image as a bitmap or any other binary data.
This extension would turn the Array<Array<String | null>> data structure into Array<Array<String | Array | null>> instead.
With this addition, you could even embed an RSV file within an RSV file, because the inner RSV file would be Base64 encoded, preventing any collisions with the special characters.

Example

So:

[
 [1234567890, "Hello", "๐ŸŒ", null]
]

Would translate to:

251 | 83, 90, 89, 67, 48, 103, 61, 61 | 255 | 72, 101, 108, 108, 111 | 255 | 240, 159, 140, 142 | 255 | 254 | 255 | 253
\FB | B64: SZYC0g==                   | \FF |        "Hello"         | \FF |        "๐ŸŒ"        | \FF | \FE | \FF | \FD
    | Hex: 499602D2                   |
    | Dec: 1234567890                 |

So in essence, without prefix you get UTF-8 encoded data and with the \FB prefix you get Base64 encoded data (ASCII and UTF-8 compatible, to my knowledge)

What is this addition trying to do

The advantage from this encoding addition is that non-unicode characters could also be represented without risk of collisions, including the RSV special characters themselves.

Another advantage is that some data types can be stored more efficiently, like numbers and dates.

What is this addition NOT trying to do (but what could be added in a separate issue)

This is not a change to add the data types themselves to RSV. This additional special character only signifies the encoding, not the datatype, so you wouldn't know if the data represents an integer, timestamp, float, etc., just like you wouldn't know this with the current implementation. This is still left to the program that is using the RSV file.

If the data type would have to be derived from this binary data, the base64 value could be prefixed (after the \FB) by a string surrounded by non-base64 characters, to signify the data type, like (i32) for 32-bit integers.
Example:

251 | 40, 105, 51, 50, 41 | 83, 90, 89, 67, 48, 103, 61, 61 | 255 | 253
\FB |  type: 32-bit int   |         value: SZYC0g==         | \FF | \FD

...which would represent a single integer (int32) value that equals 1234567890.
Or you could use something more simple, but restrictive typing system, that uses a single non-base64 character to define the type, followed by a single character for the size.

251 | 35, 52 | 83, 90, 89, 67, 48, 103, 61, 61 | 255 | 253
\FB |   #4   |             SZYC0g==            | \FF | \FD

...where # defines an integer and 4 defines a size of 4 bytes (32 bit): 1234567890

251 | 35, 52 | 81, 69, 107, 80, 50, 119, 61, 61 | 255 | 253
\FB |   ~4   |             QEkP2w==             | \FF | \FD

...where ~ defines a floating point value and 4 defines a size of 4 bytes (32 bit): 3.141592...
This is out of scope for this issue though.

Considerations

With this addition, the name isn't really accurate anymore, so would this be RBSV (Rows of Binary or String Values)?

RSV & Wikipedia

RSV is interesting idea but the format has no page at Wikipedia, it has no RFC, there are no tools to support this file format...

Tools/utilities to convert CSV, JSON or TSV to/from RSV and tools to validate "syntax" of RSV files; this is important for new data format, to have a tool to verify compatibility with the standard... JSON has jq, xmlstarlet could be used to validate XML files, etc, Something like this should be designed for RSV...

Could be RSV processed by awk or gawk? It can be generated in awk but am not sure how difficult it will be to parse RSV in awk...

BTW, you have interested videos on YouTube. Unfortunately, those videos, about about structured file formats, miss structure! Please, check how to add chapters to YouTube videos, easy and powerful feature... One of many tutorials on YouTube chapters or other video tutorial

One more note, this XKCD story is about RSV ;-)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.