Comments (15)
The proposed method seems to be better in performance. Benchmark results:
BenchmarkInferSmall-4 20000 113778 ns/op
BenchmarkInferMedium-4 100 10257832 ns/op
BenchmarkInferBig-4 20 86271503 ns/op
BenchmarkInferImplicitCastingSmall-4 30000 42894 ns/op
BenchmarkInferImplicitCastingMedium-4 500 2939785 ns/op
BenchmarkInferImplicitCastingBig-4 50 32592837 ns/op
PASS
ok github.com/frictionlessdata/tableschema-go/schema 12.000s
from tableschema-go.
I didn't want to only port the code, so I checked the code (JS and Python), issues (i.e. tableschema-py/issues/105 and gitter discussions around infer. After that, I am not entirely sure what is the actual scope of infer
in within Go library V1.0. In other words, is super critical to have the exact same behavior -- and maybe the exact same problems -- or is it ok to have a simpler but more consistent behavior (by consistent I mean consistent with other cases in the target language)?
That said, I would like to base the first Go implementation of infer
on json.Unmarshal, which is well known amongst Go developers. More specifically:
To unmarshal JSON into an interface value, Unmarshal stores one of
these in the interface value:
bool, for JSON booleans
float64, for JSON numbers
string, for JSON strings
[]interface{}, for JSON arrays
map[string]interface{}, for JSON objects
nil for JSON null
That would give users a pretty nice bootstrap Schema
. This can also be implemented in a quite efficient way and reuse a pretty solid code. From that, users could refine the types very easily.
What do you think?
from tableschema-go.
hey @danielfireman ! @roll will be best placed to help you with this.
from tableschema-go.
@danielfireman
Sorry for the late answer. TBH can't be very helpful on internals. I think infer
still could work consistent between different implementations because it's kinda pure mapping from text file
(cvs, or values from this csv) to text file
(json, table schema).
from tableschema-go.
Sorry @roll .. but could please clarify?
Is that ok if I follow the widely known rules of json.Unmarshall
for infer
?
boolean, for JSON booleans
float64, for JSON numbers
string, for JSON strings
array, for JSON arrays
object, for JSON objects
nil for JSON null
Asking because in the gitter channel you asked for contributors to think about it. I am suggesting we simplify the rules and, at least in Go, we have a solid implementation for this slightly simpler version.
I am also curious about the usage/use cases of infer
. That would help to think/suggest improvements.
from tableschema-go.
Another project that was an inspiration for my suggestion. Very powerful features:
- tags to have structs with fields whose names do not match the CSV headers
- Ability to provide a custom
Unmarshal
method
from tableschema-go.
A use case is where a user has a plain CSV file, and we use infer to generate a Table Schema for it, without requiring the user to know anything about Table Schema. An example of where we do this is in the user interface of OpenSpending (which does a bunch of other stuff too, beyond basic schema inference).
About the rules of json.Unmarshall
: It is a hard question for us to give you a definitive answer on from outside. There are several steps in the infer process, all relevant even if we put aside bugs that result from the very naive "algorithm" in place at present:
- Get a sample of data (per column)
- Attempt to cast the samples to a number of possible types + formats
- Choose the best matching type + format, based on some type of specificity
- Return the inferred type + format to the user.
For a subset of types in Table Schema, I assume json.Unmarshall
will work. But it will not help you, for, example, geojson. Also, it will not help you with format
of the type. You can ignore that (format) for sure at this stage, as I do not think format guessing is actually implemented anywhere.
My advice to you is to not overthink the infer
method too much - get a version that works in the spirit of the current implementations, and then explicitly outline the pros and cons of the approach in an issue for future consideration.
from tableschema-go.
Thanks for the answer, @pwalsh !
from tableschema-go.
I am going to aim with the following conversion rules for the first version (please notice that bellow we have with frictionless data types -> json type (no Go) :
boolean, for JSON booleans
float64, for JSON numbers
string, for JSON strings
array, for JSON arrays
object, for JSON objects
nil for JSON null
from tableschema-go.
Just had a chat with @pwalsh and we agreed on implementing:
- string, number, integer, boolean, object, array, date
- For the first version, the format will always be "default"
- The only accepted date format will be "default" (ISO8601)
We've got an open question at the end of the chat: whether the Infer
function should return invalid schemas.
Looking at the Python that implementation, the Infer
function chooses the type with most occurrences. That might return an invalid schema
, i.e. if 95% of the cells in a certain column are numbers and 5% are strings. To check that out, the user would need to call the Validate
function on the returned schema
.
I am going to follow the Python implementation for now.
from tableschema-go.
Looking at the Python that implementation, the Infer function chooses the type with most occurrences. That might return an invalid schema, i.e. if 95% of the cells in a certain column are numbers and 5% are strings. To check that out, the user would need to call the Validate function on the returned schema.
tableschema.validate
validates only Table Schema descriptor against Table Schema profile (jsonschema). That's across all our implementations for now. So please take it into account - because implementing it is much easier 😄
So talking about schema validity we talks only about schema check. Not data. So output of infer
can't be invalid. It could be just non-optimal for given data sample. But it's pretty OK. I would say improving infer
is an incremental process.
Consider real-life scenario of working with dataset - https://github.com/frictionlessdata/tableschema-js#table - user creates table model for existent data, infers schema, updates it and saves it. So infer
here it's more initial step for getting a perfect schema but of course it still should generate valid Table Schema descriptor. And then user could tweak it.
Sorry for the initial short comment - I had internet problems. So expanding it - it's OK for initial implementations don't be too precise on inferring but any implementations should be able to pass test like this:
id,name,joining
1,Alex,2015-05-06
2,John,2014-03-07
infer
"fields": [
{"name": "id", "type": "integer", "format": "default"},
{"name": "name", "type": "string", "format": "default"},
{"name": "joining", "type": "date", "format": "default"},
]
So what I've meant by infer is kinda pure mapping from text file to text file
is that we have text input and should produce text output. So it should be easier for static typed languages then having a deal with native types. At least I hope so)
So related to the Gitter discussion with @georgeslabreche I've meant that current infer
implementation is really simple - https://github.com/frictionlessdata/tableschema-js/blob/master/src/schema.js#L170-L197. 20 lines of code high-level algorithm. But it's not optimal because it does to much type castings and don't do row-by-row approach.
Internally it works pretty simple:
- for every cell try
Field.cast_value
for list of types (1) - based on this information compose a Table Schema descriptor (2)
What I've suggested that high-level
algorithm (2) could be improved (link to the code above). It's not critical but if you write infer
from scratch why not because it should be very easy.
from tableschema-go.
Thanks for the awesome explanation, @roll !
So talking about schema validity we talks only about schema check. Not data. So output of infer can't be invalid. It could be just non-optimal for given data sample. But it's pretty OK. I would say improving infer is an incremental process.
Got it. I meant the use of CastRow
on that data would return errors. But this the user perspective I was trying to get.
Just for the records: another way to deal with this problem is to use the wider type. For instance, if we have 3 integers and one float in a column, we could use number
as the field type (instead of integer
) and somehow warn the user. In that way, the user could cast rows without errors, but he/she could refine the schema manually later.
Consider real-life scenario of working with dataset - https://github.com/frictionlessdata/tableschema-js#table - user creates table model for existent data, infers schema, updates it and saves it. So infer here it's more initial step for getting a perfect schema but of course it still should generate valid Table Schema descriptor. And then user could tweak it.
Thanks for pointing me to the example. Great stuff! infer
being an initial step totally match my expectations.
If we change the example adding the column I mentioned above, traversing the table using the keyed=true (or CastRow
) before changing the schema it would error out (i.e. if there are 3 integers and 1 float in a column). But as far as I understood, this is the time where you either refine the schema or change the data.
Finally, thanks a lot for the explanations about infer
!
from tableschema-go.
Just for the records: another way to deal with this problem is to use the wider type. For instance, if we have 3 integers and one float in a column, we could use float as the field type and somehow warn the user. In that way, the user could cast rows without errors, but he/she could refine the schema manually later.
So that's exact case (high-level algorithm) when I've suggested implementers to go creative) And later we could re-use the best approach in existent implementations.
from tableschema-go.
Perfect. I am going to give this a shot.
from tableschema-go.
Need to improve documentation. Leaving this issue open.
from tableschema-go.
Related Issues (20)
- Latest version is not released HOT 2
- Add xls/xlsx data processing HOT 6
- Add tableschema-go to golang.org/pkg/ HOT 4
- Number being inferred as boolean HOT 9
- valdate example and capitals.csv test data HOT 3
- Add support to go 11 and modules HOT 1
- Version 1.11 seems to have been released by mistake HOT 3
- Error when trying to build with go 1.14 HOT 1
- Error trying to CastTable. HOT 1
- CastRow does not support embedded types
- Problem running tests
- Present all errors when Casting table HOT 1
- Cast should support pointer types
- Constraints are not correctly processed HOT 2
- Constraints field is called "Constraints" HOT 1
- Support GZIP compressed CSV files HOT 2
- Add Table.Validate(schema) HOT 2
- Renaming master branch to main HOT 1
- Migrate to Github Actions HOT 2
- datetime fields not being properly processed by CastRow
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tableschema-go.