frictionlessdata / tableschema-go Goto Github PK

View Code? Open in Web Editor NEW

46.0 10.0 10.0 525 KB

A Go library for working with Table Schema.

License: MIT License

Go 99.80% Makefile 0.20%

csv tabular-data frictionlessdata table-schema

tableschema-go's Introduction

tableschema-go

Table schema tooling in Go.

Getting started

Installation

This package uses semantic versioning 2.0.0.

Go version >= 1.11

Please use go modules if you're using a version that supports it. To know which version of Go you're running, please run:

$ go version
go version go1.14 linux/amd64

If you're running go1.13+, you're good to go!

If you can not upgrade right now, you need to make sure your environment is using Go modules by setting the GO111MODULE environment variable. In bash, that could be done with the following command:

$ export GO111MODULE=on

Go version >= 1.8 && < 1.11

$ dep init
$ dep ensure -add github.com/frictionlessdata/tableschema-go/csv@>=0.1

Main Features

Tabular Data Load

Have tabular data stored in local files? Remote files? Packages like the csv are going to help on loading the data you need and making it ready for processing.

package main

import "github.com/frictionlessdata/tableschema-go/csv"

func main() {
   tab, err := csv.NewTable(csv.Remote("myremotetable"), csv.LoadHeaders())
   // Error handling.
}

Supported physical representations:

You would like to use tableschema-go but the physical representation you use is not listed here? No problem! Please create an issue before start contributing. We will be happy to help you along the way.

Schema Inference and Configuration

Got that new dataset and wants to start getting your hands dirty ASAP? No problems, let the schema package try to infer the data types based on the table data.

package main

import (
   "github.com/frictionlessdata/tableschema-go/csv"
   "github.com/frictionlessdata/tableschema-go/schema"
)

func main() {
   tab, _ := csv.NewTable(csv.Remote("myremotetable"), csv.LoadHeaders())
   sch, _ := schema.Infer(tab)
   fmt.Printf("%+v", sch)
}

Want to go faster? Please give InferImplicitCasting a try and let us know how it goes.

There might be cases in which the inferred schema is not correct. One of those cases is when your data use strings like "N/A" to represent missing cells. That would usually make our inferential algorithm think the field is a string.

When that happens, you can manually perform those last minutes tweaks Schema.

   sch.MissingValues = []string{"N/A"}
   sch.GetField("ID").Type = schema.IntegerType

After all that, you could persist your schema to disk:

sch.SaveToFile("users_schema.json")

And use the local schema later:

sch, _ := sch.LoadFromFile("users_schema.json")

Finally, if your schema is saved remotely, you can also use it:

sch, _ := schema.LoadRemote("http://myfoobar/users/schema.json")

Processing Tabular Data

Once you have the data, you would like to process using language data types. schema.CastTable and schema.CastRow are your friends on this journey.

package main

import (
   "github.com/frictionlessdata/tableschema-go/csv"
   "github.com/frictionlessdata/tableschema-go/schema"
)

type user struct {
   ID   int
   Age  int
   Name string
}

func main() {
   tab, _ := csv.NewTable(csv.FromFile("users.csv"), csv.LoadHeaders())
   sch, _ := schema.Infer(tab)
   var users []user
   sch.CastTable(tab, &users)
   // Users slice contains the table contents properly raw into
   // language types. Each row will be a new user appended to the slice.
}

If you have a lot of data and can no load everything in memory, you can easily iterate trough it:

...
   iter, _ := sch.Iter()
   for iter.Next() {
      var u user
      sch.CastRow(iter.Row(), &u)
      // Variable u is now filled with row contents properly raw
      // to language types.
   }
...

If you store data in a GZIP file, you can load it compressed using the same csv.FromFile:

...
   tab, _ := csv.NewTable(csv.FromFile("users.csv.gz"), csv.LoadHeaders())
...

Even better if you could do it regardless the physical representation! The table package declares some interfaces that will help you to achieve this goal:

Field

Class represents field in the schema.

For example, data values can be castd to native Go types. Decoding a value will check if the value is of the expected type, is in the correct format, and complies with any constraints imposed by a schema.

{
    'name': 'birthday',
    'type': 'date',
    'format': 'default',
    'constraints': {
        'required': True,
        'minimum': '2015-05-30'
    }
}

The following example will raise exception the passed-in is less than allowed by minimum constraints of the field. Errors will be returned as well when the user tries to cast values which are not well formatted dates.

date, err := field.Cast("2014-05-29")
// uh oh, something went wrong

Values that can't be castd will return an error. Casting a value that doesn't meet the constraints will return an error.

Available types, formats and resultant value of the cast:

Type	Formats	Casting result
any	default	interface{}
object	default	interface{}
array	default	[]interface{}
boolean	default	bool
duration	default	time.Time
geopoint	default, array, object	[float64, float64]
integer	default	int64
number	default	float64
string	default, uri, email, binary	string
date	default, any, <PATTERN>	time.Time
datetime	default, any, <PATTERN>	time.Time
time	default, any, <PATTERN>	time.Time
year	default	time.Time
yearmonth	default	time.Time

Saving Tabular Data

Once you're done processing the data, it is time to persist results. As an example, let us assume we have a remote table schema called summary, which contains two fields:

Date: of type date
AverageAge: of type number

import (
   "github.com/frictionlessdata/tableschema-go/csv"
   "github.com/frictionlessdata/tableschema-go/schema"
)


type summaryEntry struct {
    Date time.Time
    AverageAge float64
}

func WriteSummary(summary []summaryEntry, path string) {
   sch, _ := schema.LoadRemote("http://myfoobar/users/summary/schema.json")

   f, _ := os.Create(path)
   defer f.Close()

   w := csv.NewWriter(f)
   defer w.Flush()

   w.Write([]string{"Date", "AverageAge"})
   for _, summ := range summary{
       row, _ := sch.UncastRow(summ)
       w.Write(row)
   }
}

API Reference and More Examples

More detailed documentation about API methods and plenty of examples is available at https://godoc.org/github.com/frictionlessdata/tableschema-go

Contributing

Found a problem and would like to fix it? Have that great idea and would love to see it in the repository?

Please open an issue before start working

That could save a lot of time from everyone and we are super happy to answer questions and help you alonge the way. Furthermore, feel free to join frictionlessdata Gitter chat room and ask questions.

This project follows the Open Knowledge International coding standards

Before start coding:
- Fork and pull the latest version of the master branch
- Make sure you have go 1.8+ installed and you're using it
- Make sure you dep installed
Before sending the PR:

$ cd $GOPATH/src/github.com/frictionlessdata/tableschema-go
$ dep ensure
$ go test ./..

And make sure your all tests pass.

tableschema-go's People

Contributors

Stargazers

Watchers

Forkers

dwhitena ajurasz hwchiu as27 cored thateric endersonmaia russmack feynmanium lunarforge

tableschema-go's Issues

Support remote schema descriptor

Overview

There is:

schema.ReadFromFile("schema.json")

Not sure is there a counterpart for remote schema.

Add Table.Infer()

godoc.org page and link at README.md

Add go 1.9 support

This includes change travis.ci to test with 1.8 and 1.9

Implement `enum` constraint

The value of the field must exactly match a value in the enum array.

JS implementation (for reference)

Add Schema header's map

This would make CastRow faster (in particular for tables with many fields)

Validate Table against a Schema

Python code: https://github.com/frictionlessdata/tableschema-py/blob/master/tableschema/validate.py

Validate Schema

Create a "constructor" for Field

There are cases where the Field struct needs to have some processing:

The MissingValues is a field of the schema descriptor, but it is very needed at the Field.Decode(). Thus, the schema reading process ends up populating it.
The pattern constraint should be compiled only once, at field creation.

To embrace those cases, we should create a kind of constructor which receives a field struct and some other parameters and other parameters and updates the field internal state.

Potential namespace collision?

given an import like

github.com/frictionlessdata/tableschema-go/csv

that could happen in a program with the standard

encoding/csv

won't name space collisions happen?
yes we can overload imports with a named import, but it might be better to not have that happen to new users.

Add Field.CastValue to Integers

Python implementation here
Tableschema spec is here

Fragment from the spec

The field contains integers - that are whole numbers.

Integer values are indicated in the standard way for any valid integer.

format: no options (other than the default).

Introduce Schema.Save

Python code: https://github.com/frictionlessdata/tableschema-py/blob/master/tableschema/schema.py#L161

Migrate vendoring support to Go's dep tool

Create CONTRIBUTING.md

Inspiration: https://github.com/frictionlessdata/tableschema-php/blob/master/CONTRIBUTING.md

Support for type "any"

Overview

It seems the lib supports all Table Schema types except "any" type. If it's not applicable for Go we need some kind of fallback to still support descriptors with "any" type.

Improve Readme.md

Support bareNumber option for integer and number types

Overview

See http://specs.frictionlessdata.io/table-schema/#number. Also we should update current behavior on currencies etc.

Add support to streaming a data source through a schema

This feature probably refers to Table#Save() in tableschema-py.

Create a design document

Besides aggregating information about the work to be done, implementation and testing the document should contain a proposal for the first version of the API.

The idea is to make it easier to discuss and gather comments. Have someone more acquainted to the libraries (and with experience in statically typed languages) signing off the document would be great.

Implement unique constraint

This constraint can be used in all types:

If true, then all values for that field MUST be unique within the data file in which it is found.

In tableschema Go's implementation, it only makes sense to have unique constraint checked during schema.DecodeTable.

Add Field.TestValue

Python implementation here

Add MissingValues option to Infer and InferImplicitCasting

Currently, Infer and InferImplicitCasting only receive a table as a parameter, but there are some other important inputs, for instance, missingValues.

As an example, please take a look at the Schema and Configuration section our the Readme. When the "N/A" is used as missingValue, the algorithm cannot properly detect the integer type. So my first suggestion would be to follow the functional options approach, which was already used in csv.NewTable.

Error or create a new instance if Schema.CastRow receives a nil pointer

Create a new instance is better, but I am not sure if we can do it cleanly.

Add Table.Read

Python code: https://github.com/frictionlessdata/tableschema-py/blob/master/tableschema/table.py#L99

Implement "unique" constraint and other constraints?

Overview

Unique constraint e.g. in Python when we iterating table rows - https://github.com/frictionlessdata/tableschema-py/blob/master/tableschema/table.py#L94-L104

Also to be sure we support all list of constraints:

dep install issue

So first.. I've been watching this progress along..
Very excited to start using this...

On that point though I tried dep (my first use of dep by the way)

I get

Fils:frictionlessdata dfils$ dep ensure -add github.com/frictionlessdata/[email protected]
ensure Solve(): No versions of github.com/frictionlessdata/tableschema-go met constraints:        v0.1: Could not introduce github.com/frictionlessdata/[email protected], as its subpackage github.com/frictionlessdata/tableschema-go does not contain usable Go code (*build.NoGoError).. (Package is required by (root).)
        master: Could not introduce github.com/frictionlessdata/tableschema-go@master, as it is not allowed by constraint ^0.1.0 from project opencoredata.org/ocdGarden/frictionlessdata.
Fils:frictionlessdata dfils$

Is there a different version number I should use.. how does one tell what versions are
available?

Design document

A document describing an API proposal and some implementation aspects (i.e. coding, testing and performance considerations).

Should not be very long. The main goal is to enable discussion and get a sign off before jump into coding.

Maps versus strong types

The flexibility in schema description seems to play really well in dynamically typed languages like Python or Javascript. For example, let's take a look in the a following code fragment tableschema-py:

def skip_under_30(erows):
    for number, headers, row in erows:
        krow = dict(zip(headers, row))
        if krow['age'] >= 30:
            yield (number, headers, row)
table = Table(SOURCE, **post_cast=[skip_under_30])**

The Golang and Java counterparts for the tableschema library could simply translate this code using map[string]interface{} and Map<String, Object>, respectively. All type checking is gone and development is slower and more error prone, as the code would be full of type conversions (hopefully lots of methods converting back and forth maps to POJOs). I have the feeling this is not ideal.

Some research would lead to APIs like the Amazon's DynamoDB, which relies on objects like Item. Even though the Item object is super general, it has methods that return its inner contents using the right type. Important to notice that to access some field from an item is done through a string. So, the developer needs to have extra care to use the same constant when creating, updating (describe) and using the item. I feel this is an improvement when compared to pure general maps, but still not ideal.

Another option would be the Golang's database/sql Query. The difference from the above is that type conversion would be done transparently (without need to invoke the right method based on the type). Important to notice that it is fully positional, the order of the query parameter is the exact same order returned by the Rows.Scan method. Depending on your background, you can feel like this improves things a bit. I feel we are exchanging the field name (string constant above) by a field id.

More research and thought led me to remember Google's protocol buffers approach (not suggesting to use protocol buffers here): we could generate the code to access the table from the table schema. We could even generate the JSON and CSV serializers.

Pros : Cleaner, more idiomatic and more performant code
Cons :
- The object describing each table row would need to be generated beforehand.
- Need to maintain one more piece of software: the code generator
- Support more languages go through add a new generator

I would argue that these objects need to be shipped in the data package describing them. To the other two cons, it is really a trade off. I am personally in favor of having the tableschema team (and maybe external contributors) creating and maintaining the code generator. I believe that would lead to much cleaner and more performant code reading, processing and extracting useful information from data and this is where the bulk of complexity will (and should be).

Edit: Added database/sql option

Implement `datetime` field constraints

As per tableschema specs, datetime fields could have two constraints associated to it: maximum and minimum.

A similar implementation (for yearMonth fields), can be found at e86d9a7

Getting started

Description

This issue documents the initial steps to get started with a new Frictionless Data implementation.

Tasks

Add csv.Table.Write and csv.Table.SaveToFile

Create a contribution section in readme

Improve schema descriptior to support two types of primary keys: list and string

Let's start with only lists and then extend the support to string.

Description of the field: https://specs.frictionlessdata.io/table-schema/#primary-key

Overview

For now the readme is useful but still lacks more detailed information on Table/Schema/Field APIs. What exact methods/functions are supported etc. See https://github.com/frictionlessdata/tableschema-js#documentation

Add Field.CastValue

Python implementation here

Release version-1

Description

This issue describes the set of tasks to complete in order to finish up work on the library.

Tasks

Touch base with the @jobarratt and @pwalsh to notify that you consider the work complete
Provide a short description / link to code for how each action is implemented, with a link to unit tests that prove each action
Tag your candidate code as v0.1
Setup travis to auto deploy tagged versions to the package management solution for your language
Ensure that the OKI account on the package management platform is an administrator/maintainer of the package, along with yourself
Receive code review from @pwalsh and address any remaining issues
Publish final version

Augment Decode and DecodeTable to deal with pointers

This should transparently deal with *schema.GeoPoint, which currently returns an error.

The same should work for *int and other types.

Refactor datetime.go file

Inside datetime.go in a case of an error the functions returns time.Now()

func decodeYearMonth(value string, c Constraints) (time.Time, error) {
	y, err := decodeYearMonthWithoutChecks(value)
	if err != nil {
		return time.Now(), err
	}

This could cause a bug, when the caller of the function does not check the returned error. I would suggest to change the return value to an empty time.Time

return time.Time{}, err

Add Boolean support to CastValue and TestValue

Notice that includes supporting the boolean field customizations described bellow.
https://specs.frictionlessdata.io/table-schema/#types-and-formats

boolean
The field contains boolean (true/false) data.

In the physical representations of data where boolean values are represented with strings, the values set in trueValues and falseValues are to be cast to their logical representation as booleans. trueValues and falseValues are arrays which can be customised to user need. The default values for these are in the additional properties section below.

The boolean field can be customised with these additional properties:

trueValues: [ "true", "True", "TRUE", "1" ]
falseValues: [ "false", "False", "FALSE", "0" ]
format: no options (other than the default).

Add string support to Field.CastValue

https://specs.frictionlessdata.io/table-schema/#types-and-formats

Infer a Table Schema descriptor from a supplied sample of data

Python code: https://github.com/frictionlessdata/tableschema-py/blob/master/tableschema/infer.py

Implement `date` field constraints

According to tableschema specs, date fields could have two constraints associated to it: maximum and minimum.

A similar implementation (for yearMonth fields), can be found at e86d9a7

Implement `time` field constraints

According to tableschema specs, time fields could have two constraints associated to it: maximum and minimum.

A similar implementation (for yearMonth fields), can be found at e86d9a7

Implement `string` type constraints

According to tableschema specs, string fields could have many constraints associated to it: pattern, minLength, maxLength

A similar implementation (for yearMonth fields), can be found at e86d9a7

Add Field.Test support to strings

Python implementation here
Tableschema spec is here

A fragment from the spec

string: The field contains strings, that is, sequences of characters.

format:
default: any valid string.
email: A valid email address.
uri: A valid URI.
binary: A base64 encoded string representing binary data.
uuid: A string that is a uuid.

Add frictionless data field tags to structs

This will make more flexible to declare structs to encode/decode table rows.

Example:

type MyRow  struct {
   MyName string `tableheader:name`
}

MyName field will be populated with values from the "name" column (header)

For reference:

Support for type any:

Add Table Iterator

Even though this is blocked by #3, lets explicit the possible ways to go. The proposals are inspired by mgo#Iter , AWS DynamoDB and database/sql.

We are going to follow good practices and context object will be part of the API. This might allow setting things like timeouts or early termination if the outer context is Done.

Having the struct describing the row would allow us to have fragments like:

var results []people.Person
if err := table.All(ctx, &results); err != nil {
    return err
}
for i,p := range results {
    fmt.Printf("%d - %+v", i, p)
}

Which loads all table records to memory and ranges over its results. If the table is too big, another option could be used:

var person people.Person
iter := table.Batch(100).Iter()
defer iter.Close()
for iter.Next(ctx, &person) {
    fmt.Printf("%+v", p)
}
if err := iter.Err(); err != nil {
    return err
}

Edit: Added context to the more complicated case

Add support to remaining field types

Marking this issue as fixed includes augmenting infer and cast method.

time
date
year
yearmonth
duration
geopoint
geojson

Add default values for Field properties

Should follow the python implementation and config

frictionlessdata / tableschema-go Goto Github PK

tableschema-go's Introduction

tableschema-go

Getting started

Installation

Go version >= 1.11

Go version >= 1.8 && < 1.11

Main Features

Tabular Data Load

Schema Inference and Configuration

Processing Tabular Data

Field

Saving Tabular Data

API Reference and More Examples

Contributing

tableschema-go's People

Contributors

Stargazers

Watchers

Forkers

tableschema-go's Issues

Overview

Overview

Overview

Overview

Description

Tasks

Overview

Description

Tasks

Recommend Projects

Recommend Topics

Recommend Org