rwth-i6 / returnn_common Goto Github PK

Common building blocks for RETURNN configs, such as models, training concepts, etc

Python 71.52% Jupyter Notebook 28.48%

returnn_common's Introduction

`returnn_common`

This repo provides common building blocks for RETURNN, such as models or networks, network creation code, datasets, etc.

`nn`: Network definitions, models

RETURNN originally used dicts to define the network (model, computation graph). The network consists of layers, where each layer represents a block of operations and potentially also parameters. In here, we adopt many conventions by PyTorch or functional Keras and other frameworks, such that you use pure Python code to define the network (model, computation graph). Further, a module instance does not represent the actual computation but only once you call it with actual inputs, then it will perform the actual computation (create a RETURNN layer, or the corresponding RETURNN layer dict).

See the wiki for a starting point for documentation.

Usage examples

from returnn_common import nn


class MyModelBlock(nn.Module):
  def __init__(self, dim: nn.Dim, *,
               hidden: nn.Dim = nn.FeatureDim("hidden", 2048),
               dropout: float = 0.1):
    super().__init__()
    self.layer_norm = nn.LayerNorm(dim)
    self.linear_hidden = nn.Linear(dim, hidden)
    self.linear_out = nn.Linear(hidden, dim)
    self.dropout = dropout

  def __call__(self, x: nn.Tensor) -> nn.Tensor:
    y = self.layer_norm(x)
    y = self.linear_hidden(y)
    y = nn.sigmoid(y)
    y = self.linear_out(y)
    y = nn.dropout(y, dropout=self.dropout, axis=nn.any_feature_dim)
    return x + y

In case you want to have this three times separately now:

class MyModel(nn.Module):
  def __init__(self, dim: nn.Dim):
    super().__init__()
    self.block1 = MyModelBlock(dim)
    self.block2 = MyModelBlock(dim)
    self.block3 = MyModelBlock(dim)

  def __call__(self, x: nn.Tensor) -> nn.Tensor:
    x = self.block1(x)
    x = self.block2(x)
    x = self.block3(x)
    return x

Or if you want to share the parameters but run this three times:

class MyModel(nn.Module):
  def __init__(self, dim: nn.Dim):
    super().__init__()
    self.block = MyModelBlock(dim)

  def __call__(self, x: nn.Tensor) -> nn.Tensor:
    x = self.block(x)
    x = self.block(x)
    x = self.block(x)
    return x

Installation and usage

When this is integrated as part of a Sisyphus recipe, the common way people use it is similar as for i6_experiments, i.e. you would git clone this repo into your recipe directory.

Usage as Sisyphus recipe submodule

See i6_experiments.

Usage via `returnn.import_`

Earlier, this was intended to be used for the RETURNN returnn.import_ mechanism. See returnn #436 for initial import_ discussions. See #2 for discussions on import_ usage here. Note that this might not be the preferred usage pattern anymore but this is up to you.

Usage example for config:

from returnn.import_ import import_
test = import_("github.com/rwth-i6/returnn_common", "test.py", "20210602-1bc6822")
print(test.hello())

You can also make use of auto-completion features in your editor (e.g. PyCharm). Add ~/returnn/_pkg_import to your Python paths, and use this alternative code:

from returnn.import_ import import_
import_("github.com/rwth-i6/returnn_common", ".", "20210602-1bc6822")
from returnn_import.github_com.rwth_i6.returnn_common.v20210302133012_01094bef2761 import test
print(test.hello())

During development of a new feature in returnn-experiments, you would use a special None placeholder for the version, such that you can directly work in the checked out repo. The config code looks like this:

from returnn.import_ import import_
import_("github.com/rwth-i6/returnn_common", ".", None)
from returnn_import.github_com.rwth_i6.returnn_common.dev import test
print(test.hello())

You would also edit the code in ~/returnn/pkg/..., and once finished, you would commit and push to returnn_common, and then change the config to that specific version (date & commit).

Code principles

These are the ideas behind the recipes. If you want to contribute, please try to follow them. (If something is unclear, or even in general, better speak with someone before you do changes, or add something.)

Simplicity

This is supposed to be simple. Functions or classes can have some options with reasonable defaults. This should not become too complicated. E.g. a function to return a Librispeech corpus should not be totally generic to cover every possible case. When it doesn't fit your use case, instead of making the function more complicated, just provide your alternative LibrispeechCustomX class. There should be reasonable defaults. E.g. just Librispeech() will give you some reasonable dataset.

E.g. maybe ~5 arguments per function is ok (and each argument should have some good default), but it should not be much more. Better just make separate functions instead even when there is some amount of duplicate code (make_transformer which creates a standard Transformer, vs make_linformer which creates a Linformer, etc.).

Building blocks

It should be simple to use functions as basic building blocks to build sth more complex. E.g. when you implement the Transformer model (put that to models/segmental/transformer.py) make functions make_trafo_enc_block and make_trafo_encoder(num_layers=..., ...) in models/encoder/transformer.py, and then make_transformer_decoder and make_transformer in models/segmental/transformer.py. That makes parts of it easily reusable. Break it down as much as it is reasonable.

Code dependencies

The building blocks will naturally depend on each other. In most cases, you should use relative imports to make use of other building blocks, and not import_.

Data dependencies

Small files (e.g. vocabularies up to a certain size <100kb or so) could be directly put to the repository next to the Python files. This should be kept minimal and only be used for the most common files. (E.g. our Librispeech BPE vocab is stored.) The repository should stay small, so try to avoid this if this is not really needed.

For any larger files or other files, the idea is that this can easily be used across different systems. So there would be a common directory structure in some directory which could be some symlinks elsewhere. (We could also provide some scripts to simplify handling this.) To refer to such a file path, use the functions in data.py.

Requirements

Python 3.7+. See #43.

Recent RETURNN (>=2022), needs behavior version >=12.

returnn_common's People

Contributors

Stargazers

Watchers

Forkers

jotix16 atticus1806 mmz33 christophmluscher

returnn_common's Issues

`Loop.State` incomplete

(Initial design via #16.)

The initial is not really handled yet.

Also, we need to think about shape and dtype.

Maybe we also want to pass shape and dtype on to RETURNN, to simplify the recurrent template construction.
Currently, this would be via out_type.
When we have rwth-i6/returnn#706, maybe this would be another way, by out_shape or so.

shape also must be able to handle dynamic dims which could change in each iteration (e.g. for cum_concat).

We also don't handle nested state (e.g. LayerState) yet but this is really required.

Remove deprecated generated layers

This is more a question at this point:

There are a couple of deprecated layers.

E.g. SelfAttentionLayer (so the SelfAttention module), which can be constructed more explicitly via CumConcatLayer etc. We should provide one implementation for self attention, but maybe directly using other atomic layers (CumConcatLayer etc).
GaussWindowAttention
Etc

(We should specify a full list here.)

So, should we remove them now? Better now than later when people start to use them.

Implement standard attention and self-attention module

Via cum_concat etc.
rwth-i6/returnn#391
rwth-i6/returnn#656
rwth-i6/returnn#589
rwth-i6/returnn#590

Eager-style debugging

The way models are defined with the PyTorch-style API is oriented to allow for a simple mental model for the user, which allows for eager-like thinking/reasoning about the code / model definitions. This is even for recurrent definitions (#16).

For debugging purpose, it would be helpful to also allow eager execution.

This should be an optional option, and would not be used by default (default would be graph mode). But the code behavior should not change at all. It would be optional because it would be way more inefficient.

This should be technically possible though, because for all definitions / module calls, all values can be calculated at the time when the Python code is called. Some details on how we do this internally need to be sorted out. Not sure which is the easiest way. E.g.:

Really use TF eager mode. But this probably needs some changes on RETURNN side. (E.g. replace tf.placeholder.)
Implement this purely on returnn-common side.

Incorrect network when applying a simple rec layer

class DemoModule(Module):

    def __init__(self):
        super().__init__()
        self.fw_rec = models.Rec(n_out=128, unit="nativelstm2", direction=1)

    def forward(self):
        inp = get_root_extern_data("data")
        fw_out = self.fw_rec(inp)
        return fw_out

currently gives:

{'fw_rec': {'class': 'rec',
 'from': 'data:data',
 'n_out': 128,
 'unit': 'nativelstm2',
 'direction': 1},
'fw_rec_state': {'class': 'get_last_hidden_state', 'from': 'fw_rec'},
'output': {'class': 'copy', 'from': ('fw_rec', 'fw_rec_state')}}

This is definitely not what I would expect. Why does it create a state variable? And then it even tries to merge the recurrent layer and the state together, this can not be correct in any way.

Relative imports, RETURNN import_, or intended more stable?

Relative imports can be difficult to read in some cases when you need to go up a couple of packages.
Sth like from returnn_common... import ... can look nicer and cleaner in some cases.

However, this currently is not possible when it should work with the RETURNN import_ mechanism.

Maybe we can somehow fix the RETURNN import_ mechanism to make this possible, although I'm afraid this would be hacky, and I don't see a simple solution currently. (If we want to go this route, let's open a separate issue in the RETURNN repo about this.)

The question is also whether we need import_ here. import_ is intended for unstable code which can easily break. But in e.g. i6_experiments we handle this differently:

Common code (in common subdir in i6_experiments) is supposed to be relatively stable and not break too much, or tries to be compatible for older setups. (It's not really settled, and maybe less strict than RETURNN itself, or i6_core, but still.)
This corresponds to the code we have here directly, e.g. in models, etc.
Users code (in users subdir in i6-exp) can be as unstable as someone wants. There are no guarantees.
We have the same here, although naming might be different (we would not have a users sub-dir but instead custom modules with some postfix, like conformer_chris.py or so).

So maybe this is fine, and then we do not need import_. Instead, we can handle this repo in a similar way as i6_experiments, clone it into the recipe subdir for Sisyphus, and make it also available for RETURNN configs.

Masked computation wrapper

Similar to the rec loop design (#16), for the masked computation, we could have some API like with MaskedComputation(mask=...). This would wrap MaskedComputationLayer, and also automatically apply UnmaskLayer.

Note that while it is quite trivial to implement such masking logic by hand (using nn.where given the mask, to update the output and state or take the previous output and state), such explicit nn.MaskedComputation allows for efficiency optimization on RETURNN side. Specifically, when it can optimize this part out of the loop, it can calculate it much more efficiently by only going over the relevant frames.

Example for transducer using SlowRNN and FastRNN:

x  # shape {batch,enc_time,dim}
slow_rnn = SlowRNN()
fast_rnn = FastRNN()
blank_pred = nn.Linear(1)
non_blank_pred = nn.Linear(...)
loop = nn.Loop()  # over alignment labels
loop.state.t = nn.zeros([nn.batch_dim], dtype=int32)
loop.state.align_label = nn.zeros([nn.batch_dim], dtype=int32)
with loop:
  x_t = x[loop.state.t]  # shape {batch,dim}
  with nn.MaskedComputation(mask=(loop.state.align_label != BLANK)):
    slow = slow_rnn(loop.state.align_label, x_t)
  fast = fast_rnn(loop.state.align_label, x_t, slow)
  blank_pred_energy = blank_pred(fast)
  log_prob_blank = nn.log_sigmoid(blank_pred_energy)
  log_prob_not_blank = nn.log_sigmoid(-blank_pred_energy)
  log_prob_non_blank_labels = nn.log_softmax(non_blank_pred(fast))
  log_prob_combined = nn.concat(log_prob_non_blank_labels + log_prob_not_blank, log_prob_blank)
  loop.state.align_label = nn.choice(log_prob_combined, input_type="log_prob")
  loop.state.t = nn.where(loop.state.align_label == BLANK, 0, 1)
  loop.end(loop.state.t >= x.seq_len)

Generated layers, some arguments only for module call

By definition (or maybe convention), the a module constructor gets arguments which defines the model aspects, such as parameter size, etc. E.g. PyTorch Linear has input-dim and output-dim.

A module call will then perform the actual operation. It should be able to call a module with different inputs, and it would then use the same parameters. The inputs might have different amounts of dimensions.

We break this convention in multiple cases:

Linear currently is without input dimension. With rwth-i6/returnn#597, it is at least optional. When called with multiple inputs, it would assume that all inputs have the same input dimension. Maybe this is fine and not too much an issue. It behaves more like PyTorch LazyLinear.
Layers operating on axes, basically also rwth-i6/returnn#597, those axes (dim tags) very likely will depend on the input. So it should not be an argument to the module constructor but to the module call.
Layers creating new axes, so again rwth-i6/returnn#597. E.g. ConvLayer. The discussed options out_spatial_dims and in_spatial_dims should only be for the module call. While in_dim and out_dim would be for the module constructor.

So more in general:

Arguments which define the general model, esp parameter sizes, should be module constructor arguments.
Arguments which are specific to the input, e.g. axes (dim tags) to operate on, should be module call arguments.

Although this classification is ambiguous for some other arguments...

Generated layers, functional-like for pure functional layers

Some layers (modules) are functional similar as the PyTorch functional API. Basically this means that they do not have own parameters.

For example:

ReduceLayer
DotLayer
ChoiceLayer
ActivationLayer
Etc

(We should specify a whole list.)

It is common (at least in PyTorch) that such functional ops (without parameters) will be defined as a function, not as a module. E.g. there is a function like reduce, or dot, or activation (or also more directly tanh), etc.

In any case, we should provide such functions.

More open is the question whether we should anyway export those modules which wrap the RETURNN layers (Reduce, Dot, etc) or whether this should be hidden.

Related is also #29.

Somewhat related is also #28, as it is not really clear whether the arguments should be for the module constructor or module call.

Allow different label topologies for common/models/transducer/transducer_fullsum.py

transducer_fullsum.py right now, works only for RnnT label topology.

Do we want to allow the choice of the label topology in make_net?

In my fork I have done this by adding methods to create the layers "t", "dt" and "u", "du" in the Topology class, together with the losses and alignment computations.

The transducer_fullsum looks like this and the topology classes like this

How to handle Sisyphus hashes

When this becomes more widely used, the resulting net dicts will often also be used for Sisyphus hashes. This means that every minor change can lead to changed Sis hashes. So also things like the layer name heuristics, etc.

I have heard already about different opinions and preferences on this aspect, so returnn-common will not enforce anything.

I expect the net dicts to change quite often even when there is no semantic or logical change (e.g. just some layer name heuristic changed, without changing param name spaces though). And then the consequence is that people either don't update returnn-common (which is bad), end up with forks of returnn-common with only selected changes (even worse), or we are forced to not make changes anymore to the net dict unless really necessary, which will possibly restrict us or require ugly workarounds later or so (also not good).

Because of that, my original idea was to not use the resulting net dict but some other intermediate representation for Sis hashes. Kind of similar as Sisyphus Job object also can make some aspects of the Sis hash explicit, e.g. by overriding the hash function. However, this is not implemented yet, and this will probably also have some other drawbacks depending on the specific implementation. One concern was that people were afraid that actual semantic changes would possibly not lead to a changed Sis hash due to potential bugs in this implementation. Although my counter argument would be that this could be true for any Sisyphus Job with some custom hash logic (or even without when it depends on external things).

In any case, maybe we should think a bit about this before the first release.

Higher-level encoder decoder interfaces for transducer, attention, LM, ILM, etc

The encoder interface is quite trivial, basically just any [LayerRef] -> LayerRef function, although the interface also should imply the tensor format {B,T,D} or so.

The idea was to have a generic interface for the decoder which allows to define both a transducer (in its most generic form, including RNN-T, RNA, etc) either time-sync or alignment-sync, and a standard attention-based label-sync decoder.

The interface should allow for easy integration of an external LM, and also allow for integration of ILM estimation and subtraction.

A current draft is here.

We should implement some attention-based encoder-decoder and some transducer example using external LM + ILM estimation and subtraction as example.

Transformer should then also be refactored to make use of this interface.

Rec design for recurrent definitions / loops

This issue is to collect some thoughts on the recurrent loops design, which wraps the RecLayer with an explicit subnetwork in RETURNN.

The main goal is to have this very straight-forward and simple for the user. We can abstract away from the underlying RecLayer if that makes things easier. We can also extend RETURNN itself if needed.

Related is also #6 (rec prev mechanism), and this issue here might fix/resolve #6, although not necessarily.

This also needs some mechanism for unrolling/unstacking, i.e. when we iterate over input x with some time-axis, i.e. to get x[t]. This is rwth-i6/returnn#552.

To define a loop like this pseudo Python code:

x  # given, shape {batch, time, dim}
h = Zeros({batch,dim})()  # initial state, shape {batch,dim}
out = []
for t in range(x.max_seq_len):
  x_lin = Linear(dim)(x[t])
  h_prev = h
  h = Linear(dim)(x_lin + h_prev)
  out.append(h)

h  # final state
out  # shape {time, batch, dim}

Current design:

There is Loop() which can be used in a with context, which corresponds to the for-loop in the example, or in general to a while-loop. Like:

with Loop() as loop:
  ...

There is State() which can define hidden state (for any module or any code).

The example above can be written as:

h = State({batch, dim}, initial=0)
with Loop() as loop:  # this introduces a new loop
  x_t = loop.unstack(x)  # shape {batch, dim}

  x_lin = Linear(dim)(x_t)
  h_prev = h.get()
  h_ = Linear(dim)(x_lin + h_prev)  # shape {batch, dim}
  h.assign(h_)

  out = loop.stack(h_)  # shape {time,batch,dim}
  h_last = loop.last(h_)

# h.get() would now return the last state
# h_last is an alternative

Or with a module as:

class MyRec(Module):
  def __init__(self):
    super().__init__()
    self.x_linear = Linear(dim)
    self.h_linear = Linear(dim)
    self.h = State({batch, dim}, initial=0)

  def forward(self, x):
    # x shape is {batch, dim}
    x_lin = self.x_linear(x)
    h_prev = self.h.get()
    h = self.h_linear(x_lin + h_prev)  # shape {batch, dim}
    self.h.assign(h)
    return h

rec = MyRec()
with Loop() as loop:  # this introduces a new loop
  x_t = loop.unstack(x)  # shape {batch, dim}
  h_ = rec(x_t)  # shape {batch,dim}. this represents the inner value
  h = loop.last(h_)  # shape {batch,dim}
  out = loop.stack(h_)  # shape {time,batch,dim}

For the TF name scopes (and variable scopes), we should follow #25, i.e. make it exactly as the module hierarchy.

The RETURNN layer name of the created RecLayer via Loop does not matter too much. It could be arbitrary, or some clever (but simple) logic to use the first module name or so. The RETURNN layer hierarchy can be independent from the actual TF name scopes (via #25).

Special options for the RecLayer like include_eos can be options for Loop, like Loop(include_eos=True). Or as a method, like loop.set_include_eos(True).

Loop (potential) methods:

unstack.
We need rwth-i6/returnn#552 for this.
unstack also implicitly implies that the loop runs over the time-axis of x.
last
stack
idx: to return some layer which wraps RETURNN ':i'

State has methods get and assign. (... See discussion below for more ...)

Current reasonings:

Why no special base class Rec which derives from Module? We want to easily allow to use any kind of module inside a loop. We think the current API makes this more straight-forward.

Why is h not an argument of forward, and why State instead? This allows to call other sub modules, which might define their own hidden state. So the root recurrent module does not need to know about all the hidden states of sub modules.

Why to have the hidden state explicit, and not use sth more close to self.prev? To make the behavior more straight-forward.

The current design allows for nested loops and sub modules with hidden state.
Only the Loop() call actually introduces a new loop.

class MySubRec(Module):
  def __init__(self):
    super().__init__()
    self.h = State({batch,dim})

  def forward(self, a):
    # assume a shape {batch,dim}
    h = self.h.get() + a
    self.h.assign(h)
    return h

class MyRec(Module):
  def __init__(self):
    super().__init__()
    self.sub = MySubRec()
    self.h = State({batch,dim})

  def forward(self, x):
    a = self.h.get() + x

    # example with sub as nested loop
    with Loop() as loop:
      y = self.sub(a)
      y = loop.last(y)

    # or: example with sub in same loop
    y = self.sub(a)
    
    self.h.assign(y)
    return y

There should not be any special handling needed for the Choice layer.
Note that the search flag and train flag logic is a separate thing (#18).

There should not be any special handling needed whether the input to a rec module call would be inside the current/same loop or not. unstack on some value which is already inside the loop would not make sense, though, and should result in an error. But this would all be covered by RETURNN logic already.

RETURNN rec automatic optimization should not cause any problems. RETURNN already should guarantee that it is equivalent. From the user view point, it never ever should matter whether it is optimized. Otherwise this is rwth-i6/returnn#573. On this returnn-common level, it should not matter.

Example for LSTM for a single step:

class Lstm(Module):
  def __init__(self):
    super().__init__()
    self.h = State({batch,dim})
    self.c = State({batch,dim})
    self.ff_linear = Linear(dim * 4)
    self.rec_linear = Linear(dim * 4)

  def forward(self, x):
    # x shape is {batch,dim} (single frame)
    x_ = self.ff_linear(x)
    h_ = self.rec_linear(self.h.get())
    x_in, g_in, g_forget, g_out = split(x_ + h_, 4)
    c = self.c.get() * sigmoid(g_forget) + tanh(x_in) * sigmoid(g_in)
    self.c.assign(c)
    h = tanh(c) * sigmoid(g_out)
    self.h.assign(h)
    return h

Reorganization of code and module names, and user network code conventions

We should define (or sketch) common conventions when someone uses the returnn-common network definition code, e.g. Module and other layer wrappers.

In TensorFlow, import tensorflow as tf is really standard.

In PyTorch, see examples.
PyTorch code usually always does import torch and then all user code would write out the names like torch.nn.Module, torch.nn.Linear, or torch.nn.MSELoss etc.
Sometimes you also see from torch import nn.
But I also have seen from torch.nn import Sequential, Conv2d, MaxPool2d, Linear, ReLU.
For the functional API, the convention is import torch.nn.functional as F.

In Keras, see examples.
Keras code usually does from tensorflow import keras and then keras.layers.Layer, or also from tensorflow.keras import layers and then layers.Dense etc.

See Trax.
In Trax, it is common to use from trax import layers as tl, and then tl.Embedding, tl.Dense, tl.Serial, etc.

See Flax.
In Flax, you see import flax.linen as nn and then nn.Module.
(Also see doc on flax.lines as evolved from earlier flax.nn, and this.)

So, for us:

One problem we have is that returnn_common is already long, too long that a user would always want to have that everywhere. So we will not see names fully written out, like returnn_common.models.base.Module.

So, maybe the convention: import returnn_common as rc? But that's bad because rc is too commonly used in other context.
The name should be max 3 chars (for 4 chars, e.g. PyCharm will perform spell check).
The name should not be commonly used in other context.
The name should somehow reflect "RETURNN common".
Suggestions (for X in import returnn_common as X) welcome.
As a good proxy, use GitHub code search, e.g. this. The number should not be too high. rc yields 174,541,911 code results.

Or maybe not so important, but instead some convention for returnn_common.models, like from returnn_common import models as rcm? Or rmm?

Wildcard imports (from returnn_common.models.layers import *) are bad in general for Python, not just specific here.

I guess explicitly importing things is probably fine, so when the user does from returnn_common.models.layers import Module, Linear or so.

Maybe from returnn_common.models import layers is also fine, and then layers.Linear etc.

Along the lines, we maybe should also restructure returnn_common, and specifically returnn_common.models, or returnn_common.models.layers etc. Now we still can do that.

I think it's bad that currently returnn_common.models.layers is the main import for the user, which also includes the returnn_common.models.layers.base. In Python, usually only packages (__init__.py) might include things from sub-packages (so torch.nn includes things from torch.nn.modules and that includes things from torch.nn.modules.module, torch.nn.modules.linear, etc), but you usually never see that one module includes everything from another module on the same level (eg sth like torch.nn.modules.rnn will not do from .linear import *), although single imports are fine (you often see from .module import Module).
So this definitely has to change. E.g. we could make returnn_common.models the base import for the user, which includes everything from base and layers, and layers and base are just two sub modules.

Maybe returnn_common.models is also not so optimal as it really contains also the basic structures. Maybe returnn_common.models.base should just move to returnn_common.base, and base already be imported as part of returnn_common.

Maybe we could rename returnn_common.models to returnn_common.nn. Then it becomes very similar to PyTorch or Flax.

We might also split the base modules (layer wrappers) in returnn_common.models from higher-level interfaces (what IEncoder, IDecoderLabelSyncRnn etc is supposed to become) and higher-level models (Conformer etc)? Although the separation is not always clear. E.g. LSTM, is it part of the base or higher-level? Transformer? Conformer? In PyTorch, it's all in torch.nn.

Loss Scale

So as I understand right now, we want to go away from using Returnn Losses and use Loss 'as_is'. How would adding a loss scale be intended here? Use a mult layer with a constant? Or is there a better way or not implemented yet? Maybe add a parameter to the mark_as_loss() function with default value 1?

`ModuleList` module

Just like PyTorch.

This is needed to have explicit global names for the modules (#25).

Prev: in Rec

What would be the proper way to access the previous timestep for the Rec class? Is there a logic for that already? Just using a function get_special_layer would not makes sense I think, due to this being a crucial feature of a RecUnit so making it part of the class would be good.

Remove automatic layer input concatenation?

Many layers in RETURNN (all that derive from _ConcatInputLayer) support automatic concatenation of inputs. This simplifies writing RETURNN networks, as this is very common.

Currently we just overtook this behavior here.

However, I question now whether this is good, or whether we should require only a single layer argument for Linear, Rec and others? The concatenation then must be explicit by the user, via concat.

Make minimum Python version explicit

We need some check assert sys.version_info[:2] >= (3, 7) or so.
Also with some documentation of what actually we use.

We rely on deterministic insertion order of dict, which requires Python >=3.6 (see here).
We use from __future__ import annotations which requires Python >=3.7 (doc).
typing.Protocol requires Python >=3.8.
Slash (/) in function argument list to mark positional-only arguments (doc) requires Python >=3.8.

Originally posted by @albertz in #36 (comment)

Related discussion on RETURNN side: rwth-i6/returnn#487 (which currently still supports much older Python versions, also Python 2).

RETURNN layers with hidden state should make it explicit

As it was discussed in #16, RETURNN layers with (hidden) state (e.g. RecLayer with unit="lstm") should make the state explicit in the API. E.g. the Rec module should get two arguments input and prev_state and return output and state. So the usage would look like this in a loop:

lstm = Lstm(...)
with Loop() as loop:
  ...
  out, loop.state.lstm = lstm(x, loop.state.lstm)

Or like this outside a loop (using default initial state, ignoring last state):

lstm = Lstm(...)
out, _ = lstm(x)

This applies for all RETURNN layers with rec hidden state, and further modules like Lstm.
See RETURNN layers with rec hidden state.

Relevant modules here:

_Rec based, e.g. Lstm (only one so far)
window
cumsum
ken_lm_state
edit_distance_table
unmask
_TwoDLSTM
cum_concat

Replay buffer / auxiliary database storage

This is partly already supported, by using HDFDumpLayer, and then use that HDF in the next epoch, via the multi-stage training support by a custom get_network and overwriting the datasets, using MetaDataset to also load the HDF via HDFDataset.
(See rwth-i6/returnn#311 for some discussion on this.)

This could be made much simpler. Maybe on RETURNN side (then this becomes obsolete here), or on returnn-common side.

Have `Conv1d`, `Conv2d`, `Conv3d` instead of `Conv`? Same for `Pool`?

It's again a question on explicite vs implicit. Currently the kernel size defines the dimension. But this is maybe easier to read in the code, so that it is exactly clear what type of convolution you do.

PyTorch has Conv1d, Conv2d, Conv3d.

Flax has just Conv.

Implement standard Conformer encoder

Similar as Transformer in #53.
Using self-attention from #52.
Maybe making other sensible building blocks as needed.

Name scope for custom module functions?

The module forward function has the known semantics to create a subnetwork when the module is called directly (handled via __call__).

When the forward function is called directly, this logic does not apply, but maybe that doesn't really matter as this is not intended to be used like this.

It can be common for some applications that modules define other custom functions which can be called from "outside" ("outside" means not within its forward). E.g. initial_state (as discussed in #31, #35) is one such method.

Should this somehow automatically make some name scope, such that initial_state would also end up in a name scope? E.g. when the module is called lstm and ends up in such a subnetwork (or just layer), calling lstm.initial_state() would make a subnetwork lstm_initial_state and set the name_scope="lstm". But how would this be done?

Multiple returns doesn't use output layer

When returning multiple layers, referencing these layers causes all of them in the returnn config to be used with the sublay name with / instead of one (the first one) being used with just the subnetwork name (to reference the output layer). This then causes an issue when starting returnn, since returnn does not find the output layer in the subnetwork. I guess this happens cause optimized away since its not used. Setting one of them to the output layer by hand fixes this issue. There might be an easy returnn_common fix for this, but right now I dont have the time to look into it so I will leave this issue already, maybe you see and easy fix.

An example (which maybe when fixed can also be changed into a test) would be:

class SubSub(Module):

  def __init__(self):
    super().__init__()
    self.linear = layers.Linear(n_out=2)

  def forward(self, x: LayerRef) -> LayerRef:
    lin1 = self.linear(x)
    lin2 = self.linear(lin1)
    return lin1, lin2


class Sub(Module):
  def __init__(self):
    super().__init__()
    self.linear = layers.Linear(n_out=3)
    self.sub = SubSub()

  def forward(self) -> LayerRef:
    data = get_extern_data("x")
    lin1, lin2 = self.sub(data)
    lin3 = self.linear([lin1, lin2])
    return lin3

produces the following config, where you can see that jsut linear and linear_0 in from are used but never sub

{'sub': {'class': 'subnetwork', 
           'from': [], 'subnetwork': {'linear': {'class': 'linear', 'from': 'base:data:x', 'n_out': 2},
                                      'linear_0': {'class': 'linear', 'from': 'linear', 'n_out': 2, 'reuse_params': 'linear'}, 
                                      'output': {'class': 'copy', 'from': 'linear'}}}, 
 'linear': {'class': 'linear', 'from': ['sub/linear', 'sub/linear_0'], 'n_out': 3},
 'output': {'class': 'copy', 'from': 'linear'}}

Cond wrapper

Similar as #23, some API with with Cond(...) as cond_obj:, which corresponds to if ...:, and then some further with cond_obj.false_branch():, which corresponds to the else: branch.

Example:

x = ... # whatever
cond = ...  # scalar tensor, True or False
with Cond(cond) as cond_obj:
  y = mod_true_case(x)
with cond_obj.false_branch():
  y = mod_false_case(x)

But this is not so clear yet.

Missing pieces for first release

Also check for a complete list: https://github.com/rwth-i6/returnn_common/milestone/1

#16
#17
#18
#21
#23
#24
#25
#31
#33
#34
#35
#38
#39
#41
#43
#44
#46
#47
#48
#49
#50
#51
#52
#53
#54
#56
#57
#59
#61
#62
#63
#64
#65
#67
#68
#69
#72
#74
#75
#76
#77
#78
#79
#80
#81
#82
#83
#84
#90
#92
#93
#94
#95
#109
#111
#112
#113
#114
#115
#116
#119
#120
rwth-i6/returnn#975
#121
#125
#132
#136
#137
#138
#147
#148
#149
#150
#151
#159
#211
#212
#216
#221

Functional layer API, conventions

Definition: Functional means that the layer / op does not have trainable parameters.

Examples:

tanh, sigmoid etc, i.e. all math ops. RETURNN: ActivationLayer
dot/matmul/einsum. RETURNN: DotLayer
split. RETURNN: SplitLayer

Instead of writing Activation(activation="tanh")(x), one should be able to write simply tanh(x).
Instead of Dot(...)(x, y), one should be able to write dot(x, y, ...) or so. (Or maybe using a more einsum-like API.)
Instead of Split(...)(x), split(x, ...).

Similar as in the PyTorch functional API.

The naming convention would be to start with lowercase letter, unlike modules which start with upper case.

Also, modules are classes, and need to be initiated. The functional API would just behave like functions.

Related are also the elemwise ops on a LayerRef, such as +, == etc.

Some open questions:

Where to define? Namescope?
How far automatically generated?
- E.g. we could extend the current layer generation code, to automatically put layers without params into the functional namescope.
- Still we additionally want to manually/explicitly define some functions, e.g. einsum/dot. Also tanh etc need to be explicit.
Should we have always both variants, like Sigmoid as a module, and sigmoid as a function?
- PyTorch has this for some functions. But not always.
- Flax only has the functional variant when some op is purely functional.

Design/Handling of dimension tags

Like batch dim, spatial dims (with dynamic lengths), or static dims (named, or also unnamed).

Dim (earlier DimensionTag) in RETURNN. Directly use that, or wrap it somehow?

Should this (the batch dim) include beam information, or be separate from this?

Relevant for all layers which define some shape or dimension (e.g. Const, Variable).

Should this be enforced, i.e. no simple int allowed in n_out or so but always a Dim object?
And maybe better use out_dim instead of n_out (consistent with rwth-i6/returnn#597).
Edit: It was decided to make nn.Dim mandatory, and use out_dim instead of n_out.

Very related is this issue on RETURNN side on explicit dim tags: rwth-i6/returnn#597

Related is also whether we want unique dim tags? (#48, rwth-i6/returnn#632)

This issues covers multiple aspects:

Use dim tags. Directly use RETURNN Dim. We inherit all its logic on equality etc. We also have FeatureDim, SpatialDim from RETURNN.
Dim tags (Dim instances) are mandatory for any shape or size
Shape for all tensors is always available (via #47)
We have Tensor.verify_out_shape for easy verification
Solution for in_dim == out_dim, square matrices, #17 (comment), rwth-i6/returnn#871
Instead of out_spatial_dim argument, the a new spatial dim gets returned. See pool1d for example.
Tensor shape annotations, moved to #97

Beam Size in Choice and BaseChoice

Both BaseChoice and Choice have the parameter beam_size, which causes the following error when building a Choice layer:
TypeError: __init__() missing 1 required keyword-only argument: 'beam_size'

From reading the code, I would suggest removing beam_size from the Choice layer (class), since other classes inheriting from BaseChoice might also need it. What do you think?

Also how would you handle this? Make an exception in the _generate_layers script or is there a better way?
I can make a PR once we decided on a good solution.

Definition of losses, `mark_as_loss`?

Currently to define some tensor (layer ref) as loss, you call mark_as_loss on it. The idea was to be somewhat analoge to when you call loss.backward() in PyTorch.

Common code in PyTorch looks like this (see here):

class MyModel(nn.Module):
  ...

model = MyModel(...)

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

for input, targets in dataset:
  optimizer.zero_grad()
  output = model(input)
  loss = loss_fn(output, targets)
  loss.backprop()
  optimizer.step()

So the loss.backprop() call and also the definition of loss itself is somewhat separate from MyModel. In MyModel, you would not really define the loss. So this is usually decoupled.

This is not how it would work for returnn-common currently, where it cannot be separated.
When you call make_root_net_dict (#44) on model, it just calls model(...) (using extern data) and that's it.

So the current API (make_root_net_dict) implies that the loss is defined inside the model, inside MyModel, and cannot be decoupled. Or can it?

I think we should be able to decouple it, if we want to. Any module (e.g. Transformer #53) should just define the model and not be specific about losses.

The question is how exactly.

Maybe we can extend make_root_net_dict to pass train_loss as well or so.

(I open a separate issue on this because #38 is just on the aspect of what loss functions or modules we want and their naming and usage conventions.)

Collecting frame error, label error, edit distance?

In RETURNN, the losses actually did two things:

Calculate a differentiable loss (sometimes also called "score", although this is misleading, as we want to minimize it).
Optionally, calculate some non-differentiable error. Usually frame error, e.g. comparing arg max to the target class which is either correct or wrong. In case of sequence-level losses like CTC, this would often calculate the edit distance.

The error was mostly just used as running statistics, although it was also potentially used for learning rate scheduling.

We don't really have that now. But we need it, to cover similar behavior or at least reporting as before.

We could just use the normal loss mechanism (mark_as_loss or maybe sth different #56) and also handle frame error, edit distance or other things this way. This is maybe already the solution. Maybe it should set scale=0, although for non-differentiable losses, this is anyway not needed.

In any case, there should be some functions to calculate this easily (error, edit_distance, etc), maybe as part of #38.

OrderedSet or IndexedSet

For some internal handling (storing the parents of a module), I need a ordered set to make the behavior deterministic.

Python does not have this as a builtin.

So our options:

Just do not use a set, but a dict instead, with dummy values. (And then due to #43 it is ordered.)
Implement OrderedSet by ourselves here in returnn-common.
- Such implementation could use dict (due to #43, no need for OrderedDict). Note that set does not a deterministic order in contrast to dict (since Python 3.6).
Implement it by ourselves in RETURNN (returnn.util.ordered_set or so) and use it here.
Use some external dependency (should be recent and used by many people):
- Boltons IndexedSet
- (Others seem to be much less used, e.g. this or here. Or you know about common other ones?)

Suggestions? Opinions?

Another aspect, not really relevant for my use case currently, but maybe it will be: __eq__ is usually different on OrderedDict (and also Boltons IndexedSet) than it is on dict.__eq__/set.__eq__ and defines it such that it is only equal when it also has the same order. I see also use cases where we want to keep the original dict.__eq__/set.__eq__ logic.

Consistent variable name scopes for modules

The TF variable name scopes in RETURNN are determined by the layer names and layer hierarchy.

In PyTorch, the model variable names are determined by the module hierarchy (i.e. the attrib names).

When we use the module concept here to define some model, the mapping to RETURNN layers might not always yield the same variable name scopes.
Consider some code like this:

if cfg_option:
  y = mod(x)
else:
  with Ctx():
    y = mod(x)

Maybe Ctx is Cond (#24) or Loop (#16).
Depending on how Ctx works, the absolute layer name of mod might be different on whether cfg_option is enabled.

Originally, I thought this would not be a problem.
However, when you save the model checkpoint with cfg_option disabled, and then later want to load the model with cfg_option enabled, I think the user would expect this to work. And this requires that the variables to match.

So far to the problem.
On possible solutions:

I think it is not possible in general to always create the RETURNN layer hierarchy such that it matches. Depending on what Ctx is, it needs to be wrapped inside another layer (e.g. RecLayer). If mod is some Linear instance, this would yield different variable names.

One potential solution is when we allow to define a custom TF name (variable) scope for a layer. Then in the second case, RecLayer can specify to not consume any new TF name scope (i.e. flat), and then it would work.

Replace `reduce(x, mode="mean", ...)` by `reduce_mean(x, ...)`, etc?

reduce_mean might be simpler to read and write.

RecUnit additional parameters

Currently the class Rec is missing the possibility to add additonal parameters like max_seq_len i think. Maybe through modifying _make_layer_dict_from_subnet_ctx through adding an additional dict for additional params?

Better way to define main network dict (`make_root_net_dict`)

Currently, when you look at the examples (test cases), you see mostly such code:

class Model(rc.nn.Module):
  def __init__(self):
    super().__init__()
    self.linear = rc.nn.Linear(n_out=13)

  def forward(self) -> rc.nn.LayerRef:
    x = rc.nn.get_extern_data("data")
    x = self.linear(x)
    return x

model = Model()
net_dict = model.make_root_net_dict()

I don't like this too much, as Model does not get any inputs, but uses rc.nn.get_extern_data instead. This makes the code of Model not easily reusable in other context.

An alternative currently is also this variant:

linear = Linear(n_out=13)

with NameCtx.new_root() as name_ctx:
  out = linear(get_extern_data("data"))
  name_ctx.make_default_output(out)

  net_dict = name_ctx.make_net_dict()

This is equivalent.

But now you have not created any Model at all, so you also cannot easily reuse this code as a module or building block in some other context. So this is also not a good solution.

What I want somehow is this definition of the model:

class Model(rc.nn.Module):
  def __init__(self):
    super().__init__()
    self.linear = rc.nn.Linear(n_out=13)

  def forward(self, x: rc.nn.LayerRef) -> rc.nn.LayerRef:
    return self.linear(x)

model = Model()

But then, where do you get the extern data in, or rather, how do you connect the extern data to the inputs of model? And how do you get the net dict in the end?

Of course, you could do this now:

with NameCtx.new_root() as name_ctx:
  out = model(get_extern_data("data"))
  name_ctx.make_default_output(out)

  net_dict = name_ctx.make_net_dict()

However, that might not be the behavior as you want, as you now would get one big subnetwork named Model. But you maybe want that linear is directly a layer in the net dict, and not just in some subnetwork.

You can also already do this:

with NameCtx.new_root() as name_ctx:
  out = model(get_extern_data("data"), name=name_ctx)
  name_ctx.make_default_output(out)

  net_dict = name_ctx.make_net_dict()

With name=name_ctx, you explicitly tell it to use the root name scope as the name, which has the effect that it will not become a subnetwork.

But we should maybe also introduce a better make_root_net_dict. Maybe like:

net_dict = make_root_net_dict(model, x="data")

Other suggestions?

Enforce dim tags to be unique?

Related is #17 on how we handle dim tags in general here.

This needs rwth-i6/returnn#632 on the RETURNN side. Which is maybe a flag like behavior_unique_dim_tags = True.

We could just always enable this flag here. See the discussion in rwth-i6/returnn#632 why this might be useful.

I think this also would imply that we always need to have dim tags explicitly (so never just an integer for n_out) but I'm not exactly sure. This is again #17.

Tensorflow dependency

The code uses from tensorflow.python.util import nest. Is there a possibility to include this code standalone into the repository without linking on tensorflow? I think this is an extreme dependency, and causes the following two issues for me:

The loading time of the sisyphus manager is increased by 4 seconds just with the tensorflow import, which is not nice for a tool you start often and interactively.
I now have a numpy version conflict. I usually keep the RETURNN runtime environment completely separated from anything in Sisyphus, but have one library (for G2P) that needs numpy > 1.20 but Tensorflow needs < 1.20. Of course I could created yet another environment and call this externally, or maybe Tensorflow 2.6 (which would be fine for just loading nest) does not have this dependency issue anymore (2.5 still has this).

Nevertheless, the first issue is already big enough for me to search for a solution.

Make shape and dims available?

Currently, given some LayerRef (which can be thought of as a normal tensor, when comparing our code conceptually to PyTorch or TensorFlow), we cannot really get any information about it (despite the layer ref), such as its dtype or shape.

It would not be too hard to make it available though because all RETURNN layers have get_out_data_from_opts where this can be inferred. And this function get_out_data_from_opts is by design not involving any TF operation, or adding anything to the current active TF computation graph. (There are some smaller technical things to be considered with this approach but they are all easily solvable.)

Having this would allow to more easily check e.g. if the input is sparse (maybe for #38), maybe reusing or checking for specific dim tags, etc.

We can either make just the RETURNN Data instance available as-is (LayerRef.data), or provide some more simple wrappers, like LayerRef.shape = data.dim_tags or (or with #48, more like LayerRef.shape = set(data.dim_tags)) or so.

This is also very related to #17.

Calling Module twice results in error

Hey, I am writing this issue to clarify wether this indeed is intended behavior (and maybe the error message should be clarified or caught somewhere else) or wether this actually is not correct:
When Building a Module and calling it after initialization and the first call a second time results in an assertion error.
assert cur_scope_abs[0] is self_name_abs[0] # same root
Steps to reproduce:

class TestBlock(Module):

  def __init__(self, l2=1e-07, dropout=0.5, n_out=128):
    super().__init__()
    self.linear = layers.Linear(n_out=n_out, l2=l2, dropout=dropout,  with_bias=False, activation=None)

  def forward(self, x: LayerRef, *args, **kwargs) -> LayerRef:
    x = self.linear(x, name="test")
    return x
x = TestBlock()
y = x("test")
z = x("test2")

The last line causes the assertion error. Right now I cannot say wether actually calling it twice makes sense, if not maybe the message can be clarified. I caused the error by accident and went into debugging, thinking my built class had an error.

How to define the API for parameter initialization, regularization (L2, weight dropout, etc), maybe updater opts per-param

It is maybe not such a nice idea if every new Module will have this option and then explicitly passes it on to all submodules. Some modules might also not implement this. And maybe there are other options which are not handled yet, e.g. param noise, etc.

How should this be handled? Mostly such arguments which every layer can potentially accept. Mostly arguments about certain behavior on parameters (variables).

In TF, the natural way would be to use tf.variable_scope context scope and there have some custom getter or other custom logic.

So maybe also some context manager here?

Losses to implement and losses naming conventions

Some modules we should implement:

CrossEntropy or CE. Should this cover both dense and sparse targets, or do we want a separate module for the sparse case, like SparseCrossEntropy or so? Should this potentially also allow for logits? log-probs?
KL or KullbackLeiblerDivergence
BinaryCrossEntropy or BCE
L2Dist (absolute or mean?) Or MSE or MeanSquarredError? (The mean reduction is over the feature axis. Not over time or batch.)
L1Dist (absolute or mean?) Or MeanL1Dist?
Ctc or CTC or CtcLogProb
CosineSimilarity

I don't like the naming of the PyTorch losses too much here.
They have the postfix Loss on all of them, although these modules are generic and not necessarily just for loss computation (although that's probably their most common usage).
Also CrossEntropyLoss is actually log-softmax + CE together. So very much like the TF tf.nn.sparse_softmax_cross_entropy_with_logits.
And there is a separate NLLLoss. Which is just like CrossEntropyLoss but it doesn't take logits but log-prob instead. I find this naming to be confusing.

Also the question is how we should handle things like label smoothing. On RETURNN side (and also in TF), it is just an option to the CE-loss. On PyTorch side, it is not implemented yet as part of the official PyTorch API. Some background here. It was only very recently added (pytorch/pytorch#7455, pytorch/pytorch#63122). This also adds it as an option label_smoothing to CrossEntropyLoss. An alternative would be that the user makes this more explicit, like:

target_prob_smooth = smooth_one_hot(targets, label_prob=0.9)
loss = cross_entropy(target_prob_smooth, out_prob)

Although label smoothing has become very common, so maybe it makes sense to have this also just as an option.

Note also that the loss accumulation over the dataset and handling of calculating the correct average (mean) is handled by RETURNN. All such losses would just yield a vector of shape [B] or [B,T].

Implement standard Transformer encoder and decoder

Very similar to the PyTorch modules.
Based on attention and self-attention from #52.

Accessing layers not returned by the forward function of a module --or-- returning multiple layer-refs

Is there currently a way to access layers (or rather the corresponding LayerRef) of a Layer used in a Module but not returned by the forward function. I want to keep my Modules as generic as possible, always returning more than one layer there seems too much for me, if its just used in one or two settings.

More explicit support for reinforcement learning

returnn-common should come with simple ways to use common building blocks / mechanisms of reinforcement learning, which can be useful in general.

This issue is supposed to be a collection of things we need. Although we probably should have individual issues for each individual feature.

(I consider this as resolved when we have some simple RL examples, which seems simple enough. E.g. producing some common actor critic training example on some common task. Basically reproducing some of the basic examples of other RL frameworks. Only when we have that, we know that returnn-common covers the basic needed utilities.)

How to define whether search (or train flag) is enabled?

How to define whether search is enabled? Just not do any special handling on this level at all and leave it to the current RETURNN behavior? Or make it more explicit?

From what I can see RETURNN handling should be enough. Assuming using this with automation changes in config can be done elsewhere I think.

Note that this is not just about what is enough to be able to define all networks. Or not sure how you mean it.

It's about being straight forward and clear. I.e. at no time, it should be unclear to the user when search is used.

We do not have to follow exactly the behavior of RETURNN. There are also multiple ways in RETURNN. We can restrict it to one clean way. We can also change it. Or introduce a simpler variant here.

I'm tending to make it explicit. But not sure.

PyTorch also has a similar concept for the train flag (as we do have as well in RETURNN). I.e. some PyTorch modules behave differently depending if they are in train or eval mode (e.g. Dropout). We have exactly the same in RETURNN. And search is a flag like train.

The difference is how these flags are set:

In RETURNN, this is all globally, and for search flag, there are some additional (maybe unintuitive) ways to overwrite it. And the flags are implied automatically in RETURNN, depending e.g. on the task, and the user has not much control over it. It is quite hidden.
In PyTorch, there are no implicit automatic implied global flags. Every module has its own flag, and it is set explicitly (and easily recursively for all sub modules). Every module has always the train flag set initially, and you can disable it explicitly. So to the user, it's always clear how the flags are set, because the user sets them, and no automatic behavior. The user explicitly writes model.train() or model.eval().

Maybe again, here in returnn-common, we can follow the PyTorch style a bit for this, and also copy it for the search flag? Not sure...

Originally posted by @albertz in #16 (comment)

Kind parameter in Eval Layer

Currently the kind parameter for the Eval layer needs to be set to "eval" by hand, otherwise the super() call to the Combine layer is missing a parameter. Maybe it would be a good idea to add this as a default parameter to the Eval layer? This should not change in any usecase of the eval layer.

Again: Does this require some special case handling (if so, again through modifying _init_args) or can we handle it in another way?

`Sequential` module