Giter Site home page Giter Site logo

s5tf's Introduction

S5TF

S5TF general helper utilities

Concepts

DataLoaders

Data loaders are objects that load data and make it iterable in mini batches. It is possible to create a custom data loader tailored for your specific needs or you can use one of the default data loaders available:

An example of using a data loader (inspired by UCI Iris):

let dataLoader = CSVDataLoader(fromFileAt: URL(string: "~/.s5tf-datasets/iris/iris.csv")!,
                               columnNames: ["sepal length in cm",
                                             "sepal width",
                                             "petal length",
                                             "petal width",
                                             "species"],
                               featureColumnNames: ["sepal length in cm",
                                                    "sepal width",
                                                    "petal length",
                                                    "petal width"],
                               labelColumnNames: ["species"])

for batch in dataLoader.batched(32) {
    print(batch.data, batch.labels)
}

Check out s5tf-team/datasets for predefined data loaders for a selection of public datasets.

Contributing ❤️

Thanks for even considering contributing.

Make sure to run swiftlint on your code. If you are not sure about how to format something, refer to the Google Swift Style Guide.

We use jazzy to generate documentation for this projct. If your contribution creates new objects, please create documentation with the following command:

jazzy \
--author S5TF Team \
--author_url http://s5tf-team.github.io \
--github_url https://github.com/s5tf-team/ \
--theme fullwidth

Please link to the completed GitHub Actions build test in your fork with your PR.

s5tf's People

Contributors

rickwierenga avatar williamhyzhang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

s5tf's Issues

Make CSVDataLoader work with more types

We should aim to be able to support datasets like these. Perhaps we need to extend the label logic to work with string labels and convert them to Ints. Or we could use DataLoader<Any> once generics are supported.

Update proposal for S5TFDataLoader

As @WilliamHYZhang and I discussed we need shuffling for data loaders. I would like to make a proposal to facilitate this.

We need to store an array of indices on a data loader. The type of Element in this array will be set by a type alias by the implementor of S5TFDataLoader which itself will need an associatedtype Index. An implementation of S5TFDataLoader needs to be able to load an element at this Index so we add the following line to the protocol:

func getElement(at index: Index) -> Tensor<DataType>

where DataType is the associated data type discussed in s5tf-team/datasets#14 (comment).

indices would be an array of row numbers in CSVDataLoader, an array of IDs in MNIST and comparable datasets and an array of paths in Imagenette for example.

We should add shuffled() in a protocol extension:

func shuffled() -> Self {
    var mutableSelf = self.copy()
    mutableSelf.indices = self.indices.shuffled()
    return mutableSelf
}

This uses copy() which should be another requirement of S5TFDataLoader. This will allow for more default implementations like the following:

func batched(_ batchSize: Int) -> Self {
    guard batchSize >= 1 else {
        fatalError("Batch size must be greater than or equal to 1")
    }

    guard batchSize <= count else {
        fatalError("Batch size equal to or smaller than the number of items.")
    }

    var mutableSelf = self.copy()
    mutableSelf. batchSize = batchSize
    return mutableSelf
}

getElement(at:) should be used by next. If we write a function createBatch(from indices): we can write next in advance too. Implementing a data loader would just involve writing a few basic functions because everything else will be handled by S5TFDataLoader. The implementor should not have to duplicate the next() functionality for every dataset because it will always be roughly the same code.

indices also allows for

var count: Int {
    return indices.count
}

This will save a lot of work in future implementations of datasets.

Data augmentation proposal

It would be great if DataLoaders could facilitate data augmentation on batches. I would like to make a proposal for a possible implementation of this.

Start by defining what an augmentation is:

typealias Augmentation = (SomeBatch) -> SomeBatch

Then create a modifier as proposed in #14:

func augmenting(_ augmentation: @escaping Augmentation) -> Self {
    var mutableSelf = self.copy()
    mutableSelf.augmentation = augmentation
    return mutableSelf
}

Make this a requirement on all S5TFDataLoaders:

var augmentation: Augmentation? { get }

Apply the augmentation when loading data (in getElement(at:)):

listOfData.map(augmentation)

Result:

MNIST.train.augmenting(flipRight).augmenting(blur(0.2)).batched(32)

We should offer a list of predefined augmentations using SwiftCV.

Simplified example: https://colab.research.google.com/drive/1YCyx59FXcDXDrtGBHcRFKmbakSNScAJk.

Add tests for DataLoaders

Data loaders are not tested now, but we must figure out a way to test them to ensure the API is robust. @WilliamHYZhang, what do you think about storing small test files like TFDS?

Utils structuring

I think we should structure S5TFUtils like so:

.
├── S5TFUtils.swift
├── S5TFUtils+IO.swift
├── ...

where +IO is an extension (by convention).

public extension S5TFUtils {
    struct IO {
        public func ...() {
            ...
        }
    }
}
S5TFUtils.IO.downloadAndExtract()

How to handle NaN?

Both the current and new (#29) implementations of the CSV reader try to encode each item to a float, and when it fails the value is assumed to be a string and encoded accordingly. When a "NaN" or "?" string is encountered, it is encoded as if it were a String.

My proposal is to have the user enter a certain string when initializing the CSVDataLoader for which Float.nan is encoded.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.