lasagne / lasagne Goto Github PK

View Code? Open in Web Editor NEW

3.8K 217.0 952.0 2.21 MB

Lightweight library to build and train neural networks in Theano

Home Page: http://lasagne.readthedocs.org/

License: Other

Python 100.00%

deep-learning-library neural-networks python theano

lasagne's Introduction

Lasagne

Lasagne is a lightweight library to build and train neural networks in Theano. Its main features are:

Supports feed-forward networks such as Convolutional Neural Networks (CNNs), recurrent networks including Long Short-Term Memory (LSTM), and any combination thereof
Allows architectures of multiple inputs and multiple outputs, including auxiliary classifiers
Many optimization methods including Nesterov momentum, RMSprop and ADAM
Freely definable cost function and no need to derive gradients due to Theano's symbolic differentiation
Transparent support of CPUs and GPUs due to Theano's expression compiler

Its design is governed by six principles:

Simplicity: Be easy to use, easy to understand and easy to extend, to facilitate use in research
Transparency: Do not hide Theano behind abstractions, directly process and return Theano expressions or Python / numpy data types
Modularity: Allow all parts (layers, regularizers, optimizers, ...) to be used independently of Lasagne
Pragmatism: Make common use cases easy, do not overrate uncommon cases
Restraint: Do not obstruct users with features they decide not to use
Focus: "Do one thing and do it well"

Installation

In short, you can install a known compatible version of Theano and the latest Lasagne development version via:

pip install -r https://raw.githubusercontent.com/Lasagne/Lasagne/master/requirements.txt
pip install https://github.com/Lasagne/Lasagne/archive/master.zip

For more details and alternatives, please see the Installation instructions.

Documentation

Documentation is available online: http://lasagne.readthedocs.org/

For support, please refer to the lasagne-users mailing list.

Example

import lasagne
import theano
import theano.tensor as T

# create Theano variables for input and target minibatch
input_var = T.tensor4('X')
target_var = T.ivector('y')

# create a small convolutional neural network
from lasagne.nonlinearities import leaky_rectify, softmax
network = lasagne.layers.InputLayer((None, 3, 32, 32), input_var)
network = lasagne.layers.Conv2DLayer(network, 64, (3, 3),
                                     nonlinearity=leaky_rectify)
network = lasagne.layers.Conv2DLayer(network, 32, (3, 3),
                                     nonlinearity=leaky_rectify)
network = lasagne.layers.Pool2DLayer(network, (3, 3), stride=2, mode='max')
network = lasagne.layers.DenseLayer(lasagne.layers.dropout(network, 0.5),
                                    128, nonlinearity=leaky_rectify,
                                    W=lasagne.init.Orthogonal())
network = lasagne.layers.DenseLayer(lasagne.layers.dropout(network, 0.5),
                                    10, nonlinearity=softmax)

# create loss function
prediction = lasagne.layers.get_output(network)
loss = lasagne.objectives.categorical_crossentropy(prediction, target_var)
loss = loss.mean() + 1e-4 * lasagne.regularization.regularize_network_params(
        network, lasagne.regularization.l2)

# create parameter update expressions
params = lasagne.layers.get_all_params(network, trainable=True)
updates = lasagne.updates.nesterov_momentum(loss, params, learning_rate=0.01,
                                            momentum=0.9)

# compile training function that updates parameters and returns training loss
train_fn = theano.function([input_var, target_var], loss, updates=updates)

# train network (assuming you've got some training data in numpy arrays)
for epoch in range(100):
    loss = 0
    for input_batch, target_batch in training_data:
        loss += train_fn(input_batch, target_batch)
    print("Epoch %d: Loss %g" % (epoch + 1, loss / len(training_data)))

# use trained network for predictions
test_prediction = lasagne.layers.get_output(network, deterministic=True)
predict_fn = theano.function([input_var], T.argmax(test_prediction, axis=1))
print("Predicted class for first test input: %r" % predict_fn(test_data[0]))

For a fully-functional example, see examples/mnist.py, and check the Tutorial for in-depth explanations of the same. More examples, code snippets and reproductions of recent research papers are maintained in the separate Lasagne Recipes repository.

Citation

If you find Lasagne useful for your scientific work, please consider citing it in resulting publications. We provide a ready-to-use BibTeX entry for citing Lasagne.

Development

Lasagne is a work in progress, input is welcome.

Please see the Contribution instructions for details on how you can contribute!

lasagne's People

Contributors

Stargazers

Watchers

Forkers

wavelets dnouri jakobularius dimatura craffel mistertimn f0k shumarci zhangaustin 317070 darcy0511 legaultmarc jeffreydf cancan101 atomcat kaynewest emmaggie twistedmove untom pzsz davegreenwood britefury ebattenberg nsauder chagge mehdidc atqamar adbrebs tofigh- chrisyang reply2vikas francoisluus jingtaow diogo149 ogrisel ebenolson hubberwisdom motifmachine pjankiewicz elenacuoco gwding fdoperezi bmcfee pakjce ncsaba mandelbrot weizier ml-ai-nlp-ir beronx86 ray2020 pombreda xsongx white1033 kukumayas lijuncen adityatewari dnuffer mauinz davisglasser nkhuyu smalliao hiendang enlitic fariks neutralino brach mesnilgr mheilman hson648 tjrileywisc instagibbs table10shotshot turf1013 pombredanne hkyang charlesshang dennysem johon96 jacor- rtvt123 bowrein akolachalam yuelianghaoyuana leiyu2 52nlp nagyist syhw fudong1127 krallistic yliuhb lim0606 raphaelleis dr-dos-ok yanshuaicao frmfl bwalenz hjweide bmilde gshguru eminoel

lasagne's Issues

nntools.layers.DenseLayer 'flattens' input

nntools.layers.DenseLayer normally operates on matrix-shaped input (ndim = 2), but is also able to deal with input of a higher dimensionality simply by 'flattening' all trailing dimensions:

if input.ndim > 2:
    input = input.reshape((input.shape[0], T.prod(input.shape[1:])))

I think this is convenient because that way you don't have to insert a FlattenLayer or something when going from the convolutional layers to the dense layers in a convnet, for example.

I think the added complexity is okay here because it simplifies a very common use case. But I wanted to discuss this nevertheless, because maybe we want the Layer classes to focus on one thing (we have the separate DropoutLayer for that reason).

I would like to keep this feature, but it's not really 'pure' and there is something to be said for having to make this operation explicit. What does everyone think?

(Regardless of what we decide it's probably useful to implement a FlattenLayer that does just this anyway).

Packaging / setup.py

It would be nice to make this library installable through pip etc. once it matures a bit. I guess we need to use setuptools, Does anybody know what else we need to do for this?

EDIT: relevant reading: http://www.jeffknupp.com/blog/2013/08/16/open-sourcing-a-python-project-the-right-way/

Layer.get_fan_in(), get_fan_out()

the Uniform initializer currently tries to guess the lower and upper bounds if they are not specified, using the method proposed by Xavier Glorot et al. (see http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf ).

Basically this method says that the initial values should be drawn from a uniform distribution ranging between +/- sqrt(6.0 / (fan_in + fan_out)), where fan_in and fan_out are the fan-in and fan-out of an individual unit. This ensures that the variance of activations across layers stays relatively close to 1.

Currently this code only works for DenseLayer, and it's kind of ugly: it checks if len(shape) == 2, and if it is, it assumes that fan_in == shape[0] and fan_out == shape[1]. But this is only really true for DenseLayer.

To clean this up, and to be able to generalise this to other layer types, it is necessary to be able to determine the fan-in and fan-out of a unit. For the Conv2DLayer for example, this would be fan_in = shape[1] * shape[2] * shape[3] and fan_out = shape[0] * shape[2] * shape[3].

After talking to @avdnoord about this, we agreed that the cleanest solution would be to allow layers to implement two methods, get_fan_in() and get_fan_out(), that compute these values. These would make the Layer interface a bit bigger unfortunately, but they would of course be optional. For now, they would only be needed to support init.Uniform without arguments, but there may be other uses for this information as well.

Personally I plan to use this type of initialization a lot. It seems to work quite well with very deep networks, the Gaussian initialization approach tends to require more fiddling in my experience. So I feel that this interface extension is warranted, but I wanted to put this up for discussion first.

In the library we could just implement the methods for DenseLayer and the Conv*DLayers for now, and then update init.Uniform to make use of them (and provide a helpful error message if they are missing).

use all_grads = theano.grad(loss,all_params) instead of list comprehension in cost funtions

According to the Theano developers its faster to do:

all_grads = theano.grad(loss,all_params)

than

all_grads = [theano.grad(loss, param) for param in all_params]

see: https://groups.google.com/forum/#!topic/theano-users/DLCHXU9EY_E

I tested it and it compiles the updates maybe 10X times faster. I dont know about the running time

Travis failing because it can't download mnist.pkl.gz

mnist.pkl.gz is used in a bunch of examples, and Travis runs the examples for one iteration to make sure they all still work.

Apparently deeplearning.net is down today, so mnist.pkl.gz can't be downloaded from there, so those tests fail. Maybe we should grab it from somewhere else.

It's also at http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz , which seems to be working currently, but I don't know if it's appreciated to constantly redownload that file from there. As @dnouri said, maybe we should host it ourselves somewhere?

Storing models and parameters, 'picklability'

It should be possible to easily store trained models. There are two ways to do this:

store the parameter values. This is typically what I do myself. The actual architecture of the model is not stured, but the definition of the model architecture is just a Python file, and that is where the architecture is 'stored', as far as I'm concerned. So I'd do something like this:

l_out = ... # output layer of the network
all_params = nntools.layers.get_all_params(l_out)
all_param_values = [p.get_value() for p in all_params]
store(all_param_values, ...)

where store is a placeholder for pickling or something to that extent.

To load this model again:

all_param_values = load(...)
all_params = nntools.layers.get_all_params(l_out)
for p, v in zip(all_params, all_param_values):
    p.set_value(v)

This works because get_all_params always gives you the parameter variables in the same order. We should probably provide two utility functions for this.

pickle the entire model, parameters included (i.e. l_out). I personally don't do this usually, but it's kind of the natural thing to do in Python and I believe a lot of people will try this and expect it to work. So I think we should support this approach.

This imposes some limitations on what language features we can use. For example, lambdas are not 'picklable', and functions aren't either unless they are defined in the top level of a module (see https://docs.python.org/2/library/pickle.html#what-can-be-pickled-and-unpickled ). So some functional programming patterns cannot be used if we want to keep everything picklable.

That's unfortunate, but I think it is a good idea to keep picklability in mind nevertheless. What does everyone think?

I still think we should support the approach of storing only the parameter values as well, though.

imports of nntools.layers.cuda_convnet and .cormm should fail gracefully

nntools/layers/__init__.py currently imports the submodules cuda_convnet and cormm automatically, but these both fail if no GPU is available, and currently this is not caught. The only way to get it working is to comment out both of these imports.

Obviously both of these modules are not usable if no GPU is available, but import nntools should not fail because of this. This should trigger a warning at most (probably not even that), and then fail at runtime if any of the GPU-dependent code is called directly.

What is the cleanest way to handle this? I know the Theano codebase has some conventions for optional dependencies, should we inherit those or should we handle this in a different way? Suggestions are welcome!

Training loops

So far the only thing that's been implemented is a bunch of tools to generate Theano expressions for neural nets. There is no actual training code in the library yet.

We should provide some premade 'training loops' for common use cases, that take care of things like compiling the necessary Theano functions and updating the parameters given a dataset.

It would be great if we could rely on Python generators for this - although at this point I'm not sure if they offer enough flexibility. But if they do, it would be great to be able to avoid adding another class / abstraction.

We could provide a few different types of training loops, for different dataset sizes and approaches. For example, some datasets fit into GPU memory, so we should provide a loop that loads up the data into a shared variable and then iterates over that in batches. But a lot of datasets actually don't, so then we'd have to load a new 'chunk' of data into the shared variable at regular intervals.

Until now I've always reimplemented this type of thing specifically for the problems I was working on (layers.py only provided tools to generate Theano expressions, nothing else). But I've definitely felt like I was reinventing the wheel a couple of times :)

I don't have a concrete idea yet of how we should implement this, input is very welcome.

Feedback alignment

I'd like to try out feedback alignment, a recently proposed alternative to backprop. http://arxiv.org/abs/1411.0247

In short, this algorithm uses random fixed matrices to backpropagate the error signal, and supposedly works at least as well as backprop. Pretty intruiging!

Implementing it is rather easy, if you implement the backward pass manually. The latter is precisely what Theano allows us to avoid, and I'm wondering if there's a way to implement feedback alignment in a sufficiently general way, so we can leverage as much of Theano's automatic differentiation infrastructure as possible.

It's not just a question of using theano.clone to swap out weight matrices with random feedback matrices where appropriate, because the weight matrices are still used for forward propagation. So some would have to be replaced, some wouldn't.

Does anybody have any ideas or insights about this? It would be great to have something with the same interface as the functions in nntools.updates, except it does feedback alignment instead of backprop. Is this feasible?

Which versions of Python / numpy / Theano should be supported?

Which versions of Python, numpy and Theano should we support? It's probably okay to rely on Theano 0.6+, but what versions of Python and numpy are in common use at the moment? I would be in favor of not burdening the project with too much 'backward compatibility', so requiring recent Python and numpy versions is okay for me.

Also, what about Python 3? Should we support this? Afaik the scientific community is still relying mostly on Python 2 (and that's what I use), but Python 3 support is probably desirable in the interest of future-proofing the library.

Objective functions

There is currently a submodule nntools.objectives, but like the regularization one this was a bit of an afterthought.

I would be great if the code in nntools.objectives is defined in terms of Theano expressions (not Layer instances), so that it can be used in isolation.

But sometimes it might be convenient to have an object that represents the objective function, which defines a Theano variable for the target labels (in the case of a supervised training objective), so that you can just do something like:

l_out = ... # output layer of the network
obj = nntools.objectives.Objective(l_out, loss_function=nntools.objectives.mse)
loss = obj.get_loss()

I don't know if this warrants the extra class though. It doesn't really do that much. What does everyone think?

Deciding for a testing policy

Disucssion started in #66:

@dnouri said: "At some point we should make a decision about whether or not we're committing to tests. If we want tests, then pull requests that add functionality should add tests, too. get_output is (was) thoroughly tested, thus adding a test for this extra feature is trivial."

@benanne responded: "On the one hand tests are obviously a great thing to have - on the other hand they may create a bit of a barrier for potential contributors. I think it might actually be a good idea to require tests for modifications to core functionality like this one, but if someone decides to add a new layer class or something, maybe we shouldn't be so strict. We should probably flesh out an official policy, maybe we can create a separate issue for it."

Let's continue the discussion here.

First release

People have already started using the library, so we should make an effort to put out a first release.I made a milestone to tag issues and pull requests that need to be sorted out before we can make a release (thanks to @craffel for the suggestion).

The most important things will be sorting out our test coverage, and writing some basic documentation. I've been adding some docstrings now and then, but progress is slow and we will probably need a concerted effort to get this done in a reasonable amount of time.

What else should we take care of for the first release? Are there any other issues that need to be tagged?

Data vectors as columns vs rows

After using nntools a little bit I quickly noticed that data vectors are being represented as rows. That is, if A is a data matrix, then A.shape = (n_data, n_features). I have always represented data vectors as columns (A.shape = (n_features, n_data)). I considered not bringing this up because I don't expect to change your mind, but because design choices are still being made I figured I'd start a discussion. Here's how I see it -

Arguments for data vectors as rows

In Python, A[0] has the intuitive meaning of "the first data vector"
It's what sklearn uses

Arguments for data vectors as columns

This is a pretty global convention for machine learning, particularly outside of code

cc @bmcfee who has strong feelings about this.

Name

For now this is called 'nntools' since that is what we discussed before, but if we can come up with something better that would be great :)

As we discussed by email before, 'layers' is not the best name because that would lead to things like "from layers.layers import Layer" which is a tad confusing.

Any ideas? :) If we're going to change the name I guess we should do it sooner rather than later, preferably before we start using this code.

get_output fails when input is a np.ndarray

This came up today when I was transitioning some old code to use nntools instead of my own stuff. This is valid in Theano:

a = np.array([1, 2], dtype=theano.config.floatX)
b = theano.shared(np.array([3, 4], dtype=theano.config.floatX))
a_plus_b = a + b
a_plus_b.eval()

Essentially you're making a function where a is a constant. In the code I was transitioning, I was getting the output of a neural net where the input was constant - e.g. I always wanted to use the same input to the matrix. Simplified it looks something like this:

l_in = nntools.layers.InputLayer((2, 2))
l_out = nntools.layers.DenseLayer(l_in, 3)
outpt = l_out.get_output(np.array([[1, 2], [3, 4]], dtype=theano.config.floatX))

which gives

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-f98d4bb6dced> in <module>()
      1 l_in = nntools.layers.InputLayer((2, 2))
      2 l_out = nntools.layers.DenseLayer(l_in, 3)
----> 3 l_out.get_output(np.array([[1, 2], [3, 4]], dtype=theano.config.floatX))

/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nntools-0.1dev-py2.7.egg/nntools/layers/base.pyc in get_output(self, input, *args, **kwargs)
     98         else: # in all other cases, just pass the network input on to the next layer.
     99             layer_input = self.input_layer.get_output(input, *args, **kwargs)
--> 100             return self.get_output_for(layer_input, *args, **kwargs)
    101 
    102     def get_output_shape_for(self, input_shape):

/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nntools-0.1dev-py2.7.egg/nntools/layers/base.pyc in get_output_for(self, input, *args, **kwargs)
    215 
    216     def get_output_for(self, input, *args, **kwargs):
--> 217         if input.ndim > 2:
    218             # if the input has more than two dimensions, flatten it into a
    219             # batch of feature vectors.

AttributeError: 'NoneType' object has no attribute 'ndim'

input is None because nntools.layers.InputLayer.get_output doesn't have a case for when the input is a np.ndarray, so the function doesn't reach a return statement and therefore returns None. The obvious simple solution is to do this:

l_in = nntools.layers.InputLayer((2, 2))
l_out = nntools.layers.DenseLayer(l_in, 3)
inpt = T.matrix()
outpt = l_out.get_output(inpt)
outpt.eval({inpt:np.array([[1, 2], [3, 4]], dtype=theano.config.floatX)})

I'd be OK if everyone agreed that we don't want to allow input to be a np.ndarray, although in that case we should throw a more useful exception (by type checking the input). But, if we do that we are preventing a valid use-case, so I'm in support of allowing input to be a np.ndarray. This would just involve modifying the cases in nntools.layers.InputLayer.get_output. Can anyone think of any reasons we shouldn't do this?

Testing and continuous integration

We should probably write some tests for this stuff :)

I have very little experience with this. There seem to be a few testing frameworks for Python (I think Theano uses nose). Which one should we use and why?

Also, it would be great to set up some kind of continuous integration environment. Travis-CI looks pretty cool and it integrates with GitHub (also used by Theano), are there any alternatives worth looking into?

EDIT: relevant reading: http://www.jeffknupp.com/blog/2013/08/16/open-sourcing-a-python-project-the-right-way/

Install instructions

Do you have installations instructions somewhere?

Use development version of Theano on Travis

Looks like I just broke the build by committing some stuff based on GpuCorrMM. This is not available yet in Theano 0.6. Since so much stuff has been added already (and Theano 0.7 is not coming anytime soon, as far as I can tell), we should probably test with a more recent version.

Maybe we don't want to use the latest version from git at any given time, since that would mean changes in Theano could affect the outcome of the build. I guess we should fix it to a specific commit. What does everyone think?

interfacing with scikit-learn

As we've discussed before, it would be great to have some tools for scikit-learn integration. It would be nice to provide a general purpose wrapper that takes a bunch of Theano expressions as input and generates a class with the scikit-learn fit/transform interface, so that can be used in scikit-learn pipelines. I don't know how realistic this is as I don't use scikit-learn much, but it would be nice to make this work with any Theano expression so it can be used in isolation. We can put the code for this in nntools.sklearn.

There is also the question of how to deal with large datasets, for which the fit/transform interface is too limited. There is a partial_fit function in scikit-learn, but supposedly this is poorly supported, so maybe we shouldn't spend time on supporting this?

Citing

Github repos can have DOIs but only after a release has been made. We should do this once we have a release because we want people to cite this.

default initialization for `Conv2DCCLayer` does not make sense when dimshuffle=False

The default initialization for the cuda-convnet based 2D convolution (Conv2DCCLayer) is init.Uniform without any parameters, which defaults to the "Glorot"-style uniform initialization based on fan in and fan out.

If Conv2DCCLayer is used with dimshuffle=False, it uses c01b ordering for the input dimensions instead of the default of bc01. This causes the initializer to compute the fan in and fan out incorrectly: it computes the receptive field size as the product of the last two dimensions, which is correct for bc01 but not for c01b.

The easiest solution is probably to define a custom Uniform_c01b initializer and use that as the default instead when dimshuffle=False. I'll try to sort that out soon, unless someone has a better idea for this.

Code in lasagne.updates should not deal with weight decay

Weight decay should be implemented by adding an L2 term to the loss (I guess it would be useful to have a helper function that generates this term for an entire network, given the top layer). Support for weight decay should be removed from all update generation code (see discussion in #58).

Keep track of parameter shapes?

Should the Layer classes keep track of the shapes of their parameters? Currently nntools.Layer.create_param takes the provided initialization (either a numpy array, a Theano shared variable or a callable that takes a shape as input), as well as the parameter shape as inputs. It returns a Theano shared variable, but the shape information is only used for initialization and then discarded.

If we store the shapes somewhere (say in a dictionary), we can do things like check whether the current contents of the shared variables have the right shapes (could be a useful debugging tool), and compute the total number of parameters of a model.

All of this stuff could probably be implemented in the Layer base class, but it does add some complexity. Is this worth it? What are some other use cases, and how common are they?

Recurrence

layers.py, on which most of the nntools code is based, has always been geared towards feed-forward neural networks. We should look into recurrent neural networks as well. Personally I don't have a lot of hands-on experience with this type of models. I'm not entirely sure how hard it would be to implement the necessary elements - would modifications to the library design be required? Does anyone have any insights about this?

do we need a Model class?

The current setup follows the old layers.py code (see https://github.com/benanne/kaggle-galaxies/blob/master/layers.py ), in the sense that there is no overarching Model class that aggregates the layers and takes care of communication between them. Instead, this is taken care of by whatever the topmost layer in your graph happens to be.

This design works because of Theano: a lot of things can be implemented by recursion, going from the topmost (output) layer to the lower layers. If we had to implement backprop this would be an issue, because then lower layers need references to higher layers. But luckily Theano takes care of that.

Nevertheless there may still be benefits attached to having an overarching Model class, so here's an issue about it so we can discuss the pros and cons.

At the moment I would prefer not to add it. I can't really think of good use cases for Model that can't be handled adequately by the current setup with only Layer (but of course that doesn't necessarily mean there aren't any). And since we're aiming for small interfaces, fewer classes is better.

Where should functions that operate on the entire graph be defined?

There are a bunch of top-level functions in nntools.layers right now to work with the constructed layer graph: get_all_layers, get_all_params, get_all_bias_params, get_all_non_bias_params.

I didn't put these in the Layer class itself because they operate on the given layer and all layers below it in the graph. But maybe these should be somewhere else? If they were part of the Layer class we'd have both Layer.get_params and Layer.get_all_params, which could be confusing.

VOTE: name

I'm creating a new issue so we can take a vote and decide what to call this thing :) Please use this issue to cast your vote only, any debate or discussion should be posted over at #3.

Everyone is welcome to share their preference, but just a disclaimer up front: the votes of current contributors (code and/or ideas) will carry more weight.

If you don't have a strong preference for one option or the other, you're also welcome to specify a categorical distribution over the options :)

Here are the options:

nntools
layercake
onion
onionn
lasagne
lasagnne
katana

Conv3DLayer implementation

I'd like to have Conv1DLayer, Conv2DLayer and Conv3DLayer available. Unfortunately Theano's support for 3D convolutions is a bit of a mess imho, there are some different implementations (see documentation for details):

theano.tensor.nnet.Conv3D.conv3D, which I guess is the default. Its expected input shape configuration is b01tc (b = batch, 0 = width, 1 = height, t = time, c = channels, note that 0, 1 and t are basically interchangeable), which is sort of unusual because the channel dimension is last. In practice this is apparently CPU only because the GPU implementation is slow (the docs even encourage using conv3d2d on GPU instead).
theano.tensor.nnet.conv3d2d.conv3d has btc01 as its input shape configuration. Also kind of weird to have the channel dimension in the middle there. This works by turning the 3D convolution into a 2D convolution, and given the recent improvements for the 2D convolution, this is probably the most useful implementation.
theano.sandbox.cuda.fftconv.conv3d_fft has bct01 as its input shape configuration, which is the most natural way to do things imho (since conv2d uses bc01). This is a 3D version of conv2d_fft. I don't know how practical this is because FFT convolutions use a lot of memory. There is no gradient implemented
There is an optimization that swaps theano.tensor.nnet.Conv3D.conv3D for theano.sandbox.cuda.fftconv.conv3d_fft. It must be enabled manually.

Since conv3d_fft has no gradient implemented, there is no point in supporting it directly. Instead we should support Conv3D.conv3D.

I think we should also support conv3d2d.conv3d. Afaik there is no optimization to replace Conv3D.conv3D by conv3d2d.conv3d. So that means the Conv3DLayer class needs a switch to choose the implementation. This can just be an implementation keyword argument on the constructor (which is how I've tackled this issue in the past in layers.py).

In general I think having an implementation keyword for all of the Conv*DLayers is a good idea, there are a lot of different ways to do things and although there are sometimes optimizations that can be enabled manually, those aren't always available. And for the 1D convolution there is currently no implementation at all (but I have a bunch of code for that already).

What should be the default implementation to use? I think having conv3d2d.conv3d as the default makes the most sense since it is probably the most practically usable implementation these days. But we should definitely support Conv3D.conv3D as well.

Regardless of this, I think Conv3DLayer should use bct01 input order by default. This means some reshapes may be necessary. We could also add a keyword argument to disable this reshaping, as a performance optimization. In that case the user would be expected to ensure that the input to the layer already has the right order.

Regularization

We should implement some commonly used regularizers (L1, L2, sparsity penalties on the activations as in sparse autoencoders, ...).

How should we do this? The nntools.regularization module I included in the initial commit was an afterthought and should be treated as more of a placeholder.

In #11 @f0k already mentioned that it's probably a good idea to make the regularization module operate on Theano expressions, not Layer instances, so that it can be used in isolation.

Any ideas? We should also take into account that some regularizers operate on model parameters (e.g. L1, L2) and others operate on activations (autoencoder sparsity penalty) and are data-dependent.

nntools.layers submodules

We will probably want to create some submodules within nntools.layers, for example for cuda-convnet-backed layers (they have an extra dependency on pylearn2).

So that means we'd have to convert the layers module from a file to a directory, where all the top-level module members are defined in __init__.py. I've been told this is considerd 'ugly', but I'd really like to keep the most commonly used layers directly in nntools.layers, not in submodules.

So either we'd have to define them in __init__.py, or we should define them in some submodule nntools.layers.base and then have __init__.py import its contents. What's the best solution?

Should each layer support dropout, or should there be a separate DropoutLayer?

In layers.py ( https://github.com/benanne/kaggle-galaxies/blob/master/layers.py ) dropout was taken care of by each layer that supports it. Each layer has a dropout keyword argument.

This unfortunately leads to some code duplication, because each layer basically reimplements dropout. Although that issue could be mitigated somewhat by refactoring this piece of code into a function.

However, since dropout can be viewed as a transformation of the inputs, I figured it would be cleaner to implement this operation as a separate layer instead.

That means creating a typical neural net with dropout on all layers is a bit more verbose though. So maybe this is not the best solution?

Compare:

l_in = nntools.layers.InputLayer(num_features=input_dim, batch_size=BATCH_SIZE)
l_hidden = nntools.layers.DenseLayer(l_in, num_units=NUM_HIDDEN_UNITS, dropout=0.5)
l_out = nntools.layers.DenseLayer(l_hidden_dropout, num_units=output_dim)

vs.

l_in = nntools.layers.InputLayer(num_features=input_dim, batch_size=BATCH_SIZE)
l_hidden = nntools.layers.DenseLayer(l_in, num_units=NUM_HIDDEN_UNITS)
l_hidden_dropout = nntools.layers.DropoutLayer(l_hidden, p=0.5)
l_out = nntools.layers.DenseLayer(l_hidden_dropout, num_units=output_dim)

Updates.py format

I'm trying to understand the updates.py input/output format.

loss: Theano expression calculating the cost, e.g. squared or nll
all_params: list from nntools.layers.get_all_params(top_layer). This is a list of a list of Theano shared variables representing the parameters
learning_rate: float

The SGD is easy to understand:

all_grads = [theano.grad(loss, param) for param in all_params]
updates = []

for param_i, grad_i in zip(all_params, all_grads):
    updates.append((param_i, param_i - learning_rate * grad_i))

Get the gradient expression for all parameters and return a list which calculates the updates.
The ouput format is list of tuples with [0]: initial parameter [1] updated parameter

I dont understand momentum and the following methods:

all_grads = [theano.grad(loss, param) for param in all_params]
updates = []

for param_i, grad_i in zip(all_params, all_grads):
    mparam_i = theano.shared(np.zeros(param_i.get_value().shape, dtype=theano.config.floatX))
    v = momentum * mparam_i - weight_decay * learning_rate * param_i  - learning_rate * grad_i
    updates.append((mparam_i, v))
    updates.append((param_i, param_i + v))

mparam_i: a shared tensor of zeros?

In the calcualtion of v, wouldn't it be sensible to scale by (1-momentum) in the second part of the eq:
v = momentum * mparam_i - (1-momentum)*(weight_decay * learning_rate * param_i -
learning_rate * grad_i)

What is the 2 tuples appended to updates? The second is the update what is the first and how is i handled?

Track batch sizes and input shape separately or jointly?

Currently nntools.layers.Layer.get_output_shape and .get_output_shape_for are expected to return shape tuples that include the batch size as the leading element. If no batch size is specified, the first tuple element is None.

For some things it might be easier to keep track of the batch size separately. Unfortunately this requires a larger interface (separate get_batch_size method). I also think it makes sense to have the number of elements in the shape match the actual number of dimensions of the input tensors. So I'm in favor of keeping things as they are.

Nevertheless I wanted to put this up for discussion, maybe there are other compelling arguments for keeping track of the batch size separately.

Documentation

There is no documentation at the moment - we should probably get started with this as soon as possible. We talked about this previously and I believe we were thinking of using Sphinx. Are there any alternatives we should consider? Does anybody have any experience setting up Sphinx?

EDIT: relevant reading: http://www.jeffknupp.com/blog/2013/08/16/open-sourcing-a-python-project-the-right-way/

Split up layers/base.py

nntools.layers currently has the following structure:

corrmm.py: Conv2DMMLayer (2D convolution using GpuCorrMM)
cuda_convnet.py: Conv2DCCLayer, MaxPool2DCCLayer, ShuffleBC01ToC01BLayer, ShuffleC01BToBC01Layer (2D convolution and pooling using pylearn2's cuda-convnet wrappers)
base.py: all other layer classes we have
__init__.py: imports * from base and imports the two other submodules

base.py is a very long file with several different kinds of layers. I propose to split it up into smaller files that each hold a group of layers that belong together. __init__.py should still import those layers directly, so one doesn't have to remember which group a layer belongs to. A possible layout might be:

base.py:

__all__ = [  # so "from .base import *" only imports the relevant names
        'Layer',
        'MultipleInputsLayer', 
        'InputLayer',
        ]

class Layer(object): ...
class MultipleInputsLayer(Layer): ...
class InputLayer(Layer): ...  # not really a base class, but probably does not warrant its own submodule

helper.py: get_all_*, set_all_*
?.py: DenseLayer, NINLayer
noise.py: DropoutLayer, GaussianNoiseLayer
conv.py: Conv1DLayer, Conv2DLayer, [Conv3DLayer]
pool.py: MaxPool2DLayer, FeaturePoolLayer, FeatureWTALayer, GlobalPoolLayer
?.py: FlattenLayer, [ReshapeLayer, ReshuffleLayer], PadLayer?
merge.py: ConcatLayer, ElemwiseSumLayer

There are some questions marks, but that's about the grouping I see (some of these groups are already formed in the existing base.py, some are a little different). Any comments, refinements, suggestions? Let's try to converge on a good set of categories and split it up!

get_ prefix for Layer methods

The methods of nntools.layers.Layer are called get_output, get_params, ... because I wanted to make clear that they are methods, not attributes (e.g. if the latter method was called params instead, there could be uncertainty about whether you should use layer.params or layer.params()). Same for the functions in nntools.layers (get_all_layers and friends).

Is this a good idea or not? Omitting the get_ is a bit more concise, so I'm thinking about getting rid of these. This may seem like a detail but I think giving classes and methods the right name is very important.

Dropout rescaling

nntools.layers.DropoutLayer currently implements dropout in a nonstandard way. During training (determinstic=False), a Bernoulli dropout mask is applied to the input (with retention probability 1 - p) and the input is then scaled by dividing it by 1 - p. At test time (deterministic=True), the layer does nothing and just passes through its input.

This is different from the typical approach, where the mask is applied during training and rescaling by multiplying with 1 - p happens at test time.

I decided to do the rescaling during training, instead of at test time. I think this is cleaner because then dropout is something that only affects training. At test time nothing changes, regardless of whether dropout is used or not.

Also, the weight initialization does not need to be changed depending on whether dropout is used.

That said, although I think this is the cleanest way to implement dropout, it might be confusing for some users. So maybe we should do the rescaling at test time instead. Or maybe this should be an option?

I also included an option to disable the rescaling altogether, which is useful for implementing denoising autoencoders (no rescaling is typically used in that context).
#9 is also related to this issue.

Move this repository to an organization?

I created this repository on my personal GitHub account, but since this is a joint project, maybe it's better to create an organization and move this repository to it. Added benefits would be more granular access control (commit rights etc.).

Also I would just be able to submit pull requests like everyone else, instead of committing to the repo directly. Currently this is convenient because a lot of the 'base' library is still missing, but I probably don't want to make a habit out of this in the long term.

What do you guys think?

If we decide to do this, it would be a good time to change the name as well (if we're going to do that, see #3), so we only have to update configurations once.

Blocks (and other high level Theano-based neural net frameworks)

Are you guys aware of this newish project? https://github.com/bartvm/blocks

It looks like there's a bit of overlap with lasagne . Maybe you can use some of the ideas. As with any new abstraction interface on top of Theano, it's unclear exactly what this provides that Pylearn2 doesn't. It seems like many researchers just want to reinvent the wheel because previous offerings have gotten too bloated or they don't like the interface. I guess it will be difficult to prevent bloat in Lasagne as more features are added. What steps are being taken to prevent Lasagne from just becoming another Pylearn2? :)

Network input handling in nntools.layers.Layer.get_output()

The way nntools.layers.Layer.get_output() is currently implemented, it supports three ways to deal with network input.

In the following let's say l_out is the output layer of a network, and l_in is the input layer (an instance of nntools.layers.InputLayer).

l_out.get_output()
The first way is to just call get_output() without an input agument. In that case, the symbolic variable l_in.input_var will be used to represent the network input. This can be swapped out for something else with the givens argument of theano.function, or with theano.clone. The nice thing about this approach is that you don't have to bother 'declaring' an input variable, the InputLayer instance takes care of it for you.
l_out.get_output(some_theano_expression)
If you don't want to do that, you can pass in a Theano expression and then that will be used to represent the network input. Layer.get_output() will propagate the expression down, and InputLayer.get_output() will just return said expression.
l_out.get_output({ l_in: some_theano_expression })
... but what if you have multiple InputLayer instances in your network? Then approach 2) will map all of them to the same input expression. So to deal with that situation, get_output() also supports passing in a dictionary, mapping Layer instances to Theano expressions. That way it's easy to map different input layers to different Theano expressions.

As an added bonus, you can map any Layer instance to an expression, not just InputLayer instances. So you can replace any layer output with an arbitrary expression. Not sure how useful this is in practice, but it's nice to have I guess. It could be useful for debugging purposes, perhaps.

This may all seem a little overly complicated at first, but I think it's actually pretty clean because everything is implemented in Layer.get_output() and InputLayer.get_output(), and other layer don't need to worry about this stuff (they just implement get_output_for(), and get_output() is inherited).

I feel it's the cleanest way to support the use case of having multiple input layers, and it also makes it easy to use arbitrary Theano expressions as input.

Nevertheless I wanted to make an issue for this - maybe there is a simpler way to achieve the same flexibility? Or maybe there are use cases that this implementation does not handle cleanly?

EDIT: almost forgot to mention, most of this was @f0k 's idea as well :)

mnist example b0rked?

Is it just me or is the mnist broken? (I just saw a pull request for it, so that's why I'm wondering.) I'm anyway getting this error:

Loading data
Building model
Traceback (most recent call last):
  File "mnist.py", line 49, in <module>
    l_in = nntools.layers.InputLayer(num_features=input_dim, batch_size=BATCH_SIZE)
TypeError: __init__() got an unexpected keyword argument 'num_features'

One working example would greatly help people trying to get into this library, that is, me. ;-)

add support for cudnn pooling

cudnn pooling has been implemented in Theano: Theano/Theano#2185
We should add support for it to nntools. We could make the existing pooling layer(s) support it as an alternative implementation, or we could provide a separate set of classes that function as drop-in replacements. We should probably think this over before writing any code.

On a related note, it might also be useful to implement a Conv2DDNNLayer that directly uses dnn_conv, just like we have dedicated layers for cuda-convnet and the conv_gemm implementation at the moment.

Decorators

There are a few places where new library components can be created essentially by defining one method. We could provide decorators there to reduce boilerplate code.

Examples are:

Creating a new layer that does not change the shape of its input: subclass nntools.layers.Layer, implement get_output_for(). We could provide a decorator that allows you to implement such a layer like this:

@shape_preserving_layer
def l2_normalization(input, *args, **kwargs):
    norms = T.sqrt((input**2).sum(axis=1, keepdims=True))
    return input / norms

Parameter initialization: subclass nntools.init.Initializer, implement sample(). (Technically you also need __init__() to set distribution parameters etc.). Instead of this (from nntools.init):

class Normal(Initializer):
    def __init__(self, std=0.01, avg=0.0):
        self.std = std
        self.avg = avg

    def sample(self, shape):
        return floatX(np.random.normal(self.avg, self.std, size=shape))

You could write:

@initializer
def normal(shape, std=0.01, avg=0.0):
    return floatX(np.random.normal(avg, std, size=shape))

Another instance of this pattern could be wrapping an objective function so it takes a Layer instance as input

These are just some example use cases for decorators across the library. My question is, should we provide these (and use them) or not? I like the conciseness but there might be issues related to picklability of the models (see #7), and not all users might be familiar with the concept of Python decorators, so it may make the code harder to understand.

What do you guys think?

PEP8

What is everyone's stance on enforcing PEP8? Personally I think it's good to enforce some coding standards, but the 80 character line length limit really annoys me sometimes. @avdnoord feels the same way. Maybe there are good arguments in favor of it though?

Are there any good tools (perhaps with GitHub integration) that make enforcing this easier?

Dropout performance

Looking at the dropout code, I see two potential performance problems:

return input * utils.floatX(_srng.binomial(input.shape, p=retain_prob, dtype='int32'))

If you follow the implementation, the binomial is computed in 'floatX', then casted to 'int32' (inside binomial()) and then back to 'floatX'. It should be better to set the dtype to theano.config.floatX right away, or was there a reason?
The size is symbolic (input.shape) rather than explicit (self.input_layer.get_output_shape()). This way MRG_RandomStreams cannot determine the number of streams to use and falls back to a default value. Is there a reason against using the explicit shape? Does DropoutLayer have to support shapes different from the compile time shape? Should we support this via an extra optional input_shape argument in get_output_for, similar to the convolutional layers, or maybe via a flag in the constructor (use_runtime_shape=True, fixed_shape=False or something)?

Reproducibility, randomness

For the sake of reproducibility It should be possible to run the same piece of code twice and get the exact same result.

In practice this means that any step involving randomness should be able to reuse the same seeds. The cleanest way to achieve this is to allow for custom RNG objects to be used for anything that involves randomness.

An interesting complication is that randomness occurs both on the numpy side of things (e.g. parameter initialization, the order in which training examples are presented), and on the Theano side (e.g. dropout and other forms of training noise).

So I'm wondering how we should implement this. We could make every class or method involving randomness take an optional rng argument, but that might get messy. Maybe it's better if it was possible to control this globally? I haven't really thought about this yet.

Arbitrary expressions as Layer parameters

Currently, Layer parameters are assumed to be Theano shared variables. That way we can call get_value() and set_value() to them, and pass them to theano.function in updates.

However, it would be cool if Layer parameters could be arbitrary Theano expressions. There are a few use cases for this:

autoencoders with tied weights. If you have two layers l1 and l2, you might want to do something like l2 = nntools.layers.DenseLayer(l1, W=l1.W.T). This is currently not possible because l1.W.T is not a shared variable, but rather an expression.
sometimes the domain of parameters is restricted, and you might want to reparameterize things. For example, you might want all values to be positive. In that case you could reparameterize as follows: l2 = nntools.layers.DenseLayer(l1, W=T.exp(V)), where V is a shared variable. This ensures that all values in W are positive.

However, this requires some modifications to nntools.layers.get_all_params(), because we need a way to get the actual shared variables containing parameter values, not just the Theano expressions built on top of them. Otherwise there's no way to update them.

Given an arbitary Theano expression, it is fairly easy to get a list of all shared variables that occur in it, by traversing the Theano graph. This isn't very 'clean' I suppose, but definitely possible.

However, that just gives us a list of all shared variables, and we have no way of knowing if all of those contain learnable parameters. Perhaps some of them should not be touched by the learning.

We could assume that all shared variables represent learnable parameters by default. This would usually be the case. But then we need to provide a way for the user to specify that a given variable is not to be touched (for example, this could be a variable that contains a binary mask that restricts some of the parameters to be zero). Perhaps an extra attribute on the Layer instances that lists all "non-trainable" shared variables.

Should we support arbitrary expressions as Layer parameters? If so, there will be some added complexity, but it might be worth it. If we do not support this, a new Layer subclass has to be implemented for every new parameterization (so to support autoencoders with tied weights, we would need to implement a TransposedDenseLayer, for example).

What do you guys think?

Parameter initialization

I had a fruitful discussion with @f0k about how to handle parameter initialization. Following his suggestion I implemented a static method on the Layer class, Layer.create_param, which can be used to 'prepare' a shared variable containing layer parameters.

This method can take three types of input: a shared variable instance to use as is, a numpy array with initial parameter values to wrap in a shared variable, or a callable that takes an array shape as input and returns a numpy array of initial values.

The idea is that this method should be used when implementing a new layer, to initialize the shared variables containing the parameters associated with this layer. For example, in layers.DenseLayer:

def __init__(self, input_layer, num_units, W=init.Normal(0.01), b=init.Constant(0.), nonlinearity=nonlinearities.rectify):
    # ...
    self.W = self.create_param(W, (num_inputs, num_units))
    self.b = self.create_param(b, (num_units,))

The default values of W and b are callables that generate random gaussian and constant initial values respectively (implemented in nntools.init). But you could also pass in a numpy array that has been created on beforehand, or a Theano shared variable. The latter is useful for parameter sharing, for example:

l1 = nntools.layers.DenseLayer(l_in, num_units=100)
l2 = nntools.layers.DenseLayer(l_in, num_units=100, W=l1.W, b=l1.b)
# l1 and l2 now share the same parameter variables

I think this approach to initialization is pretty clean and flexible, but I was wondering if there are maybe any issues with it, or if there is a simpler way to achieve the same kind of flexibility.

One thing I definitely don't want to do is have the initialization code inside the *Layer classes, as was the case in my original layers.py code ( https://github.com/benanne/kaggle-galaxies/blob/master/layers.py ).

add batch-wise dropout layer, leaky ReLUs, ...

The results of the CIFAR-10 competition on Kaggle are out, and the 1st ranked solution uses some interesting tricks that we should probably make available in nntools: http://www.kaggle.com/c/cifar-10/forums/t/10493/train-you-very-own-deep-convolutional-network

leaky ReLUs
This idea has been proposed before, but I haven't really seen it used in the wild until now. Instead of having y = max(0, x), you have y = max(x/a, x), where a is some constant (say a = 10 or a = 3). This means you still get some sort of nonlinearity, but the gradient can flow through in both regimes. I don't think we have a clean way to do parameterised nonlinearities yet, maybe we need to create a callable class for this, like we have for the initializers. We could just use nested functions, but that will cause issues with picklability of the models (see #7).

batch-wise dropout
Instead of sampling a random dropout mask for each sample, and multiplying the batch with the mask, sample only one dropout mask for the entire batch and index the batch and parameter matrices to completely delete all dropped out units. Between two layers with batch-wise dropout, the size of the dot product is reduced by 75%, so this could mean a speedup of up to 4x.

Of course this will also affect the learning trajectory, but I suspect it has advantages and disadvantages (it's probably not inherently worse than doing sample-wise dropout). I don't know if it will be feasible to reap the gains in computational efficiency in Theano, since implementing this will require advanced indexing (i.e. making copies). But it's worth implementing and evaluating, it doesn't sound too hard.

Unfortunately it probably isn't possible to isolate this into a separate layer like we did with regular dropout, since the parameter matrices have be to be indexed with the dropout mask.

lasagne / lasagne Goto Github PK

lasagne's Introduction

Lasagne

Installation

Documentation

Example

Citation

Development

lasagne's People

Contributors

Stargazers

Watchers

Forkers

lasagne's Issues

Recommend Projects

Recommend Topics

Recommend Org