frictionlessdata / tableschema-go Goto Github PK
View Code? Open in Web Editor NEWA Go library for working with Table Schema.
License: MIT License
A Go library for working with Table Schema.
License: MIT License
Notice that includes supporting the boolean field customizations described bellow.
https://specs.frictionlessdata.io/table-schema/#types-and-formats
boolean
The field contains boolean (true/false) data.
In the physical representations of data where boolean values are represented with strings, the values set in trueValues and falseValues are to be cast to their logical representation as booleans. trueValues and falseValues are arrays which can be customised to user need. The default values for these are in the additional properties section below.
The boolean field can be customised with these additional properties:
trueValues: [ "true", "True", "TRUE", "1" ]
falseValues: [ "false", "False", "FALSE", "0" ]
format: no options (other than the default).
This constraint can be used in all types:
If true, then all values for that field MUST be unique within the data file in which it is found.
In tableschema Go's implementation, it only makes sense to have unique constraint checked during schema.DecodeTable.
Besides aggregating information about the work to be done, implementation and testing the document should contain a proposal for the first version of the API.
The idea is to make it easier to discuss and gather comments. Have someone more acquainted to the libraries (and with experience in statically typed languages) signing off the document would be great.
Create a new instance is better, but I am not sure if we can do it cleanly.
For now the readme is useful but still lacks more detailed information on Table/Schema/Field
APIs. What exact methods/functions are supported etc. See https://github.com/frictionlessdata/tableschema-js#documentation
A document describing an API proposal and some implementation aspects (i.e. coding, testing and performance considerations).
Should not be very long. The main goal is to enable discussion and get a sign off before jump into coding.
Should follow the python implementation and config
Python implementation here
So first.. I've been watching this progress along..
Very excited to start using this...
On that point though I tried dep (my first use of dep by the way)
I get
Fils:frictionlessdata dfils$ dep ensure -add github.com/frictionlessdata/[email protected]
ensure Solve(): No versions of github.com/frictionlessdata/tableschema-go met constraints: v0.1: Could not introduce github.com/frictionlessdata/[email protected], as its subpackage github.com/frictionlessdata/tableschema-go does not contain usable Go code (*build.NoGoError).. (Package is required by (root).)
master: Could not introduce github.com/frictionlessdata/tableschema-go@master, as it is not allowed by constraint ^0.1.0 from project opencoredata.org/ocdGarden/frictionlessdata.
Fils:frictionlessdata dfils$
Is there a different version number I should use.. how does one tell what versions are
available?
See http://specs.frictionlessdata.io/table-schema/#number. Also we should update current behavior on currencies etc.
This should transparently deal with *schema.GeoPoint, which currently returns an error.
The same should work for *int and other types.
As per tableschema specs, datetime
fields could have two constraints associated to it: maximum and minimum.
A similar implementation (for year
Month fields), can be found at e86d9a7
There is:
schema.ReadFromFile("schema.json")
Not sure is there a counterpart for remote schema.
Inside datetime.go in a case of an error the functions returns time.Now()
func decodeYearMonth(value string, c Constraints) (time.Time, error) {
y, err := decodeYearMonthWithoutChecks(value)
if err != nil {
return time.Now(), err
}
This could cause a bug, when the caller of the function does not check the returned error. I would suggest to change the return value to an empty time.Time
return time.Time{}, err
This would make CastRow faster (in particular for tables with many fields)
The flexibility in schema description seems to play really well in dynamically typed languages like Python or Javascript. For example, let's take a look in the a following code fragment tableschema-py:
def skip_under_30(erows):
for number, headers, row in erows:
krow = dict(zip(headers, row))
if krow['age'] >= 30:
yield (number, headers, row)
table = Table(SOURCE, **post_cast=[skip_under_30])**
The Golang and Java counterparts for the tableschema library could simply translate this code using map[string]interface{}
and Map<String, Object>
, respectively. All type checking is gone and development is slower and more error prone, as the code would be full of type conversions (hopefully lots of methods converting back and forth maps to POJOs). I have the feeling this is not ideal.
Some research would lead to APIs like the Amazon's DynamoDB, which relies on objects like Item. Even though the Item object is super general, it has methods that return its inner contents using the right type. Important to notice that to access some field from an item is done through a string. So, the developer needs to have extra care to use the same constant when creating, updating (describe) and using the item. I feel this is an improvement when compared to pure general maps, but still not ideal.
Another option would be the Golang's database/sql Query. The difference from the above is that type conversion would be done transparently (without need to invoke the right method based on the type). Important to notice that it is fully positional, the order of the query parameter is the exact same order returned by the Rows.Scan method. Depending on your background, you can feel like this improves things a bit. I feel we are exchanging the field name (string constant above) by a field id.
More research and thought led me to remember Google's protocol buffers approach (not suggesting to use protocol buffers here): we could generate the code to access the table from the table schema. We could even generate the JSON and CSV serializers.
I would argue that these objects need to be shipped in the data package describing them. To the other two cons, it is really a trade off. I am personally in favor of having the tableschema team (and maybe external contributors) creating and maintaining the code generator. I believe that would lead to much cleaner and more performant code reading, processing and extracting useful information from data and this is where the bulk of complexity will (and should be).
It seems the lib supports all Table Schema types except "any" type. If it's not applicable for Go we need some kind of fallback to still support descriptors with "any" type.
Currently, Infer and InferImplicitCasting only receive a table as a parameter, but there are some other important inputs, for instance, missingValues
.
As an example, please take a look at the Schema and Configuration section our the Readme. When the "N/A" is used as missingValue
, the algorithm cannot properly detect the integer type. So my first suggestion would be to follow the functional options approach, which was already used in csv.NewTable.
given an import like
github.com/frictionlessdata/tableschema-go/csv
that could happen in a program with the standard
encoding/csv
won't name space collisions happen?
yes we can overload imports with a named import, but it might be better to not have that happen to new users.
According to tableschema specs, time
fields could have two constraints associated to it: maximum and minimum.
A similar implementation (for yearMonth fields), can be found at e86d9a7
Let's start with only lists and then extend the support to string.
Description of the field: https://specs.frictionlessdata.io/table-schema/#primary-key
This issue documents the initial steps to get started with a new Frictionless Data implementation.
According to tableschema specs, date
fields could have two constraints associated to it: maximum and minimum.
A similar implementation (for yearMonth fields), can be found at e86d9a7
According to tableschema specs, string
fields could have many constraints associated to it: pattern, minLength, maxLength
A similar implementation (for yearMonth fields), can be found at e86d9a7
A fragment from the spec
string: The field contains strings, that is, sequences of characters.
format:
default: any valid string.
email: A valid email address.
uri: A valid URI.
binary: A base64 encoded string representing binary data.
uuid: A string that is a uuid.
This issue describes the set of tasks to complete in order to finish up work on the library.
This feature probably refers to Table#Save() in tableschema-py.
Even though this is blocked by #3, lets explicit the possible ways to go. The proposals are inspired by mgo#Iter , AWS DynamoDB and database/sql.
We are going to follow good practices and context object will be part of the API. This might allow setting things like timeouts or early termination if the outer context is Done.
Having the struct describing the row would allow us to have fragments like:
var results []people.Person
if err := table.All(ctx, &results); err != nil {
return err
}
for i,p := range results {
fmt.Printf("%d - %+v", i, p)
}
Which loads all table records to memory and ranges over its results. If the table is too big, another option could be used:
var person people.Person
iter := table.Batch(100).Iter()
defer iter.Close()
for iter.Next(ctx, &person) {
fmt.Printf("%+v", p)
}
if err := iter.Err(); err != nil {
return err
}
Marking this issue as fixed includes augmenting infer and cast method.
Python implementation here
This will make more flexible to declare structs to encode/decode table rows.
Example:
type MyRow struct {
MyName string `tableheader:name`
}
MyName field will be populated with values from the "name" column (header)
For reference:
This includes change travis.ci to test with 1.8 and 1.9
The value of the field must exactly match a value in the enum array.
Unique constraint e.g. in Python when we iterating table rows - https://github.com/frictionlessdata/tableschema-py/blob/master/tableschema/table.py#L94-L104
Also to be sure we support all list of constraints:
There are cases where the Field struct needs to have some processing:
Field.Decode()
. Thus, the schema reading process ends up populating it.To embrace those cases, we should create a kind of constructor which receives a field struct and some other parameters and other parameters and updates the field internal state.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.