s5tf-team / s5tf Goto Github PK

View Code? Open in Web Editor NEW

5.0 1.0 2.0 473 KB

S5TF general helper utilities

Home Page: https://s5tf-team.github.io/S5TF/

License: Apache License 2.0

Swift 100.00%

swift s4tf s5tf machine-learning swift-for-tensorflow

s5tf's Introduction

S5TF

S5TF general helper utilities

Concepts

DataLoaders

Data loaders are objects that load data and make it iterable in mini batches. It is possible to create a custom data loader tailored for your specific needs or you can use one of the default data loaders available:

CSVDataLoader

An example of using a data loader (inspired by UCI Iris):

let dataLoader = CSVDataLoader(fromFileAt: URL(string: "~/.s5tf-datasets/iris/iris.csv")!,
                               columnNames: ["sepal length in cm",
                                             "sepal width",
                                             "petal length",
                                             "petal width",
                                             "species"],
                               featureColumnNames: ["sepal length in cm",
                                                    "sepal width",
                                                    "petal length",
                                                    "petal width"],
                               labelColumnNames: ["species"])

for batch in dataLoader.batched(32) {
    print(batch.data, batch.labels)
}

Check out s5tf-team/datasets for predefined data loaders for a selection of public datasets.

Contributing ❤️

Thanks for even considering contributing.

Make sure to run swiftlint on your code. If you are not sure about how to format something, refer to the Google Swift Style Guide.

We use jazzy to generate documentation for this projct. If your contribution creates new objects, please create documentation with the following command:

jazzy \
--author S5TF Team \
--author_url http://s5tf-team.github.io \
--github_url https://github.com/s5tf-team/ \
--theme fullwidth

Please link to the completed GitHub Actions build test in your fork with your PR.

s5tf's People

Contributors

Stargazers

Watchers

Forkers

williamhyzhang rickwierenga

s5tf's Issues

Make S5TFDataLoader and S5TFBatch generic

See s5tf-team/datasets#14 (comment).

Make CSVDataLoader work with more types

We should aim to be able to support datasets like these. Perhaps we need to extend the label logic to work with string labels and convert them to Ints. Or we could use DataLoader<Any> once generics are supported.

Update proposal for S5TFDataLoader

As @WilliamHYZhang and I discussed we need shuffling for data loaders. I would like to make a proposal to facilitate this.

We need to store an array of indices on a data loader. The type of Element in this array will be set by a type alias by the implementor of S5TFDataLoader which itself will need an associatedtype Index. An implementation of S5TFDataLoader needs to be able to load an element at this Index so we add the following line to the protocol:

func getElement(at index: Index) -> Tensor<DataType>

where DataType is the associated data type discussed in s5tf-team/datasets#14 (comment).

indices would be an array of row numbers in CSVDataLoader, an array of IDs in MNIST and comparable datasets and an array of paths in Imagenette for example.

We should add shuffled() in a protocol extension:

func shuffled() -> Self {
    var mutableSelf = self.copy()
    mutableSelf.indices = self.indices.shuffled()
    return mutableSelf
}

This uses copy() which should be another requirement of S5TFDataLoader. This will allow for more default implementations like the following:

func batched(_ batchSize: Int) -> Self {
    guard batchSize >= 1 else {
        fatalError("Batch size must be greater than or equal to 1")
    }

    guard batchSize <= count else {
        fatalError("Batch size equal to or smaller than the number of items.")
    }

    var mutableSelf = self.copy()
    mutableSelf. batchSize = batchSize
    return mutableSelf
}

getElement(at:) should be used by next. If we write a function createBatch(from indices): we can write next in advance too. Implementing a data loader would just involve writing a few basic functions because everything else will be handled by S5TFDataLoader. The implementor should not have to duplicate the next() functionality for every dataset because it will always be roughly the same code.

indices also allows for

var count: Int {
    return indices.count
}

This will save a lot of work in future implementations of datasets.

Data augmentation proposal

It would be great if DataLoaders could facilitate data augmentation on batches. I would like to make a proposal for a possible implementation of this.

Start by defining what an augmentation is:

typealias Augmentation = (SomeBatch) -> SomeBatch

Then create a modifier as proposed in #14:

func augmenting(_ augmentation: @escaping Augmentation) -> Self {
    var mutableSelf = self.copy()
    mutableSelf.augmentation = augmentation
    return mutableSelf
}

Make this a requirement on all S5TFDataLoaders:

var augmentation: Augmentation? { get }

Apply the augmentation when loading data (in getElement(at:)):

listOfData.map(augmentation)

Result:

MNIST.train.augmenting(flipRight).augmenting(blur(0.2)).batched(32)

We should offer a list of predefined augmentations using SwiftCV.

Simplified example: https://colab.research.google.com/drive/1YCyx59FXcDXDrtGBHcRFKmbakSNScAJk.

Add tests for DataLoaders

Data loaders are not tested now, but we must figure out a way to test them to ensure the API is robust. @WilliamHYZhang, what do you think about storing small test files like TFDS?

Downloader potential bug

See https://colab.research.google.com/drive/1-ROoq7w-vAaK7F1LgIoAorcQBV0TSjsz

Utils structuring

I think we should structure S5TFUtils like so:

.
├── S5TFUtils.swift
├── S5TFUtils+IO.swift
├── ...

where +IO is an extension (by convention).

public extension S5TFUtils {
    struct IO {
        public func ...() {
            ...
        }
    }
}

S5TFUtils.IO.downloadAndExtract()

How to handle NaN?

Both the current and new (#29) implementations of the CSV reader try to encode each item to a float, and when it fails the value is assumed to be a string and encoded accordingly. When a "NaN" or "?" string is encountered, it is encoded as if it were a String.

My proposal is to have the user enter a certain string when initializing the CSVDataLoader for which Float.nan is encoded.