Giter Site home page Giter Site logo

Extend support of CSV with CSV Dialect about fq HOT 8 OPEN

wader avatar wader commented on August 25, 2024 1
Extend support of CSV with CSV Dialect

from fq.

Comments (8)

nichtich avatar nichtich commented on August 25, 2024

More specific, I'd first:

  • rename comma to delimiter and optionally keep comma as alias
  • change default of comment to null (most CSV in practice does not allow comments by default)
  • add quoteChar and skipInitialSpace as csvd and csvw agree on those
  • add header to automatically convert rows to objects when set (default false although csvd and csvw both have true as default but this can be discussed).

More support of CSV dialect requires at least someone with experience in actually working with messy CSV data (e.g. users of mr) because authors of standards tend to add features without common use cases.

from fq.

wader avatar wader commented on August 25, 2024

Hey, that is great and very helpful research, didn't know about any of those cvs dialect standards. I think you suggestions make sense to do. Is it something you would like to help out with coding-wise? might speed things up a it.

rename comma to delimiter and optionally keep comma as alias

Yeap good idea. The only (not great) reason it's called comma now is because that is was it's called in the csv parser used at the moment https://pkg.go.dev/encoding/csv#Reader

change default of comment to null (most CSV in practice does not allow comments by default)

Ok, so all lines will be treated as data?

add quoteChar and skipInitialSpace as csvd and csvw agree on those
add header to automatically [convert rows to objects](https://github.com/wader/fq/blob/master/doc/formats.md#convert rows-to-objects-based-on-header-row) when set (default false although csvd and csvw both have true as default but this can be discussed).

👍 could possibly also move convert to object code into go if doing in jq is slow

Maybe the csv decoder could have "dialect" option that is either a string that is a name of dialect or an object with settings?

One thing is to figure out if we could still use the csv parser in the golang standard library or needs to find another existing one or write one ourself.

from fq.

nichtich avatar nichtich commented on August 25, 2024

Is it something you would like to help out with coding-wise? might speed things up a it.

I'm very motivated but have not coded in Go yet (should be doable and happy to learn) so the "might speed things up" does not apply. So it depends :-) But data formats are my research topic and I heavily use jq so sooner or later I need to dig deeper into fq anyway.

Ok, so all lines will be treated as data?

Yes, most CSV parsers don't enable comments by default.

Maybe the csv decoder could have "dialect" option that is either a string that is a name of dialect or an object with settings?

Yes but then you need to manage names of dialects. The only commonly agreed names I know are RFC 4180 and TSV (probably better as rfc4180 and tsv). Most people don't document their data formats on this level with names but just assume csv ad supported by the software library they happen to use (and end up with incompatible edge cases).

One thing is to figure out if we could still use the csv parser in the golang standard library or needs to find another existing one or write one ourself.

The more dialect aspects are supported, the more the danger of having to write your own CSV library. That's why I'd first limit implementation to compatibility with a subset of CSVD and CSVW.

from fq.

wader avatar wader commented on August 25, 2024

I'm very motivated but have not coded in Go yet (should be doable and happy to learn) so the "might speed things up" does not apply. So it depends :-) But data formats are my research topic and I heavily use jq so sooner or later I need to dig deeper into fq anyway.

Great, there is no hurry, was more if you wanted something fast :) i'm can help out with both go and jq stuff. Maybe a possible route is that i start look at it and see how much work it seems to be, possible some initial PR etc, and then we figure something out?

What kind of research are you doing? as a student, phd etc? curious. And i'm of course happy to help out other fq or format related things.

Yes but then you need to manage names of dialects. The only commonly agreed names I know are RFC 4180 and TSV (probably better as rfc4180 and tsv). Most people don't document their data formats on this level with names but just assume csv ad supported by the software library they happen to use (and end up with incompatible edge cases).

Aha i see. But it's nice that both csvddf and csvw has default values, so a fq decoder could always have that as quite safe fallback for properties not set?

Had no idea there was even efforts to standardize CSV like this, seems like good idea, is quite confusing. I've had to explain at least a couple of times that "export it as CVS" is sadly not that straight forward :) also run into issues with numbers in csv, which decimal symbol to use, that seems to the out of scope for csvddf and csvw?

The more dialect aspects are supported, the more the danger of having to write your own CSV library. That's why I'd first limit implementation to compatibility with a subset of CSVD and CSVW.

Yes true good point. So maybe try stick with standard library csv reader/writer as see how far it can go?

from fq.

nichtich avatar nichtich commented on August 25, 2024

What kind of research are you doing? as a student, phd etc? curious.

I did my PhD thesis on patterns in data formats some years ago and I manage a structured register of data formats (in German, with focus on bibliographic data).

from fq.

wader avatar wader commented on August 25, 2024

I did my PhD thesis on patterns in data formats some years ago and I manage a structured register of data formats (in German, with focus on bibliographic data).

Interesting and the thesis looks like something i will like to have a look at.

As you might have noticed fq currently does not support much when it comes to schemas or generic format description languages, like kaitai stuct etc, at the moment. But I think it should be possible to add in some form, at least for decoding, encoding is different kind of beast, at least for complex formats like mp4 etc.

from fq.

wader avatar wader commented on August 25, 2024

Did some research about good test suits, csvw seems to have one in nice format https://github.com/w3c/csvw/tree/gh-pages/tests

from fq.

wader avatar wader commented on August 25, 2024

Did an initial PR to try some things out #546 see comments

from fq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.