tatp22 / linformer-pytorch Goto Github PK

View Code? Open in Web Editor NEW

402.0 16.0 36.0 2.79 MB

My take on a practical implementation of Linformer for Pytorch.

Home Page: https://arxiv.org/pdf/2006.04768.pdf

License: MIT License

Python 100.00%

artificial-intelligence deep-learning attention-mechanism pytorch machine-learning linformer paper

linformer-pytorch's Introduction

Peter (tatp22)

Currently working with PyTorch.

If you want to contact me, send an email to the address on my profile.

I motorbiked Vietnam 🇻🇳 recently, and wrote about it here.

linformer-pytorch's People

Contributors

Stargazers

Watchers

linformer-pytorch's Issues

padding mask and attention mask

Hi there,

You've done a great job and thanks for the sharing. I'm wondering how you deal with the masking stuff in Linformer since the attention shape, key and value shape have now changed to (n, k) instead of (n, n). I didn't find these in the code. Thanks for your time!

Error when using method="no_params" and GPU, because E and F incorrectly remain on CPU

When you create a Linformer() with method="no_params" and then load the model on your cuda device, you will get an error when trying to use the model. This is because the E and F matrices in the model accidentally remain on the CPU. When you call forward with an input, you will get an error at some point because the attention heads are trying to multiply the E matrix on CPU with another matrix on the GPU.

Basically, when you call Linformer().cuda() under this situation, the E and F matrices are not moved to the GPU.
(From what I've read so far, in order for them to be put on the GPU with a cuda() call, you also need to assign E to self.E in Linformer. However, this still doesn't fix it because of the lambdas in your Linformer initializer I think. The cuda() call can't track down the E and F inside the MHAttention objects inside the lambda call it seems)

My temporary fix is changing E_proj = get_EF(input_size, dim_k, method, head_dim) in the __init__ of Linformer to E_proj = get_EF(input_size, dim_k, method, head_dim).cuda(), but I think this would give an error if you do not have a gpu installed.

Any result on any benchmark?

Hi, thanks for sharing the implementation, could you pls share some reproduction results, possibly on some benchmarks?

Possible bug

I may have discovered a possible bug. When I run python pretrain_tutorial.py, I get some vertical lines when running the visualizer on the random data (Try it yourself, and run the trained model on a new random array).

In the extreme case, I get that all of the queries are attending to the same key. This leads to the effect that, for an input of size (batch_size, input_len, ch), every vector of length ch on the input_len axis will have the same value. To put this in a concrete example, imagine a 32x32x3 picture (RGB, as an example), fed in the model of a batch size of 1,. If the input is (1,32*32,3), every channel of the image will be the same value on every pixel. For example, the R channel will have the value of 128 on each channel, the G channel might have 43, and the B channel may have 212. If the input is (1,3,32*32), every R,G,B channel will look like an image, but they will have the same exact pixel values.

The problem is being investigated. I believe it is a problem with the MHAttention layer, or possibly the positional embedding, but I cannot say for sure. Will update when a fix is found.

Error with DistributedDataParallel

Hi, I trying to run informer training with DistributedDataParallel, and get error

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/home/dbliznyuk/test_crash_skript/linformer-pytorch/test_ddp_vanila_torch.py", line 71, in demo_basic
    outputs = ddp_model(torch.randint(20000, (3, 5120)))
  File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/dbliznyuk/test_crash_skript/linformer-pytorch/linformer_pytorch/linformer_pytorch.py", line 364, in forward
    tensor = self.linformer(tensor, **kwargs)
  File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/dbliznyuk/test_crash_skript/linformer-pytorch/linformer_pytorch/linformer_pytorch.py", line 321, in forward
    tensor = checkpoint(layer, tensor)
  File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 163, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 74, in forward
    outputs = run_function(*args)
  File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/dbliznyuk/test_crash_skript/linformer-pytorch/linformer_pytorch/linformer_pytorch.py", line 61, in forward
    tensor = tensor + self.fn(tensor, **kwargs)
  File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/dbliznyuk/test_crash_skript/linformer-pytorch/linformer_pytorch/linformer_pytorch.py", line 235, in forward
    head_outputs.append(checkpoint(head,Q,K,V))
  File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 163, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 74, in forward
    outputs = run_function(*args)
  File "/home/dbliznyuk/.conda/envs/test_crash/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/dbliznyuk/test_crash_skript/linformer-pytorch/linformer_pytorch/linformer_pytorch.py", line 162, in forward
    P_bar = Q/torch.sqrt(torch.tensor(self.dim).type(Q.type()))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Seems like this error connected with parameter sharing

Code for reproducing

import os
import tempfile
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp
from linformer_pytorch import LinformerLM
from torch.nn.parallel import DistributedDataParallel as DDP


def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    # initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()
    
def demo_basic(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    setup(rank, world_size)

    # create model and move it to GPU with id rank
    
    model = LinformerLM(
            num_tokens=30522,  # Number of tokens in the LM
            input_size=5120,  # Dimension 1 of the input
            channels=128,  # Dimension 2 of the input
            dim_d=None,
            # Overwrites the inner dim of the attention heads. If None, sticks with the recommended channels // nhead, as in the "Attention is all you need" paper
            dim_k=128,  # The second dimension of the P_bar matrix from the paper
            dim_ff=128,  # Dimension in the feed forward network
            dropout_ff=0.15,  # Dropout for feed forward network
            nhead=16,  # Number of attention heads
            depth=12,  # How many times to run the model
            dropout=0.1,  # How much dropout to apply to P_bar after softmax
            activation="gelu",
            # What activation to use. Currently, only gelu and relu supported, and only on ff network.
            checkpoint_level="C2",  # What checkpoint level to use. For more information, see below.
            parameter_sharing="none",  # What level of parameter sharing to use. For more information, see below.
            k_reduce_by_layer=0,
            # Going down `depth`, how much to reduce `dim_k` by, for the `E` and `F` matrices. Will have a minimum value of 1.
            full_attention=False,
            # Use full attention instead, for O(n^2) time and space complexity. Included here just for comparison
            include_ff=True,  # Whether or not to include the Feed Forward layer
            w_o_intermediate_dim=None,
            # If not None, have 2 w_o matrices, such that instead of `dim*nead,channels`, you have `dim*nhead,w_o_int`, and `w_o_int,channels`
            emb_dim=128,  # If you want the embedding dimension to be different than the channels for the Linformer
        ).to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    loss_fn = nn.CrossEntropyLoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randint(20000, (3, 5120)))
    labels = torch.randint(20000, (3, 5120)).to(rank)
    loss_mx = labels != -100
    output = outputs[loss_mx].view(-1, 30522)
    labels = labels[loss_mx].view(-1)
    loss_fn(output, labels).backward()
    optimizer.step()

    cleanup()


def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)
    
if __name__ == "__main__":
    run_demo(demo_basic, 2)

Also, with DataParallel training going normal

Composed linear layers?

Hey @tatp22 great repo!

I'm having trouble wrapping my head around the w_q, w_k, and w_v linear layers in the LinearAttentionHead module. Are they needed? There's no activation between the previous linear layers, to_q, to_k, to_v in MHAttention, and those weights so they wouldn't add any expressivity to the model since you would just be multiplying two matrices together which is equivalent to one linear layer. The E and F projections also seem like they're being composed with w_k, and w_v without a non-linearity.

Looking at Eq. 7 from the paper your implementation seems correct though.

Any thoughts on this?

Enquiry about your implementation

Thanks for your great work!

I have a few enquiries about your implementations:

Could you reproduce the paper results (or approximately similar) with your implementation?
While ordinary transformer requires multiple GPUs to train from scratch, as for your implementation of Linformer, is it possible to train it from scratch with single GPU only(8GB/ 11GB)?

Thanks,
Alex Lau

Question: Is Linformer permutation equivariant (set-operation)?

Hi. Thanks for the wonderful implementation!

I was wondering if linformer can be used with any unordered set of tensors (or is it just sequence data?). Specifically, is linformer permutation equivariant?

I'm looking to apply linear attention on points in 3d space (e.g. a point cloud with ~100k points). Would linformer attention be meaningful?

(I'm concerned about the n -> k projection, which assumes the n points in some order if I understand correctly)

Thanks!

causal_mask of the decoder

Hi ,
You've done a great job and thanks for the sharing.
I don't understand the causal_mask of the decoder,the shape of attention matrix is (n, k) , only the (k,k) part is masked, Does it work? Is there any test results in language model?
Thanks for your time!

Error with DistributedDataParallel and parameter_sharing="layerwise"

Hi, I trying to run informer training with DistributedDataParallel, parameter_sharing="layerwise" and get this error

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/home/jovyan/nlpdata/test_ddp_vanila_torch.py", line 95, in demo_basic
    loss_fn(output, labels).backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.3) Incorrect unused parameter detection. The return value of the `forward` function is inspected by the distributed data parallel wrapper to figure out if any of the module's parameters went unused. For unused parameters, DDP would not expect gradients from then. However, if an unused parameter becomes part of the autograd graph at a later point in time (e.g., in a reentrant backward when using `checkpoint`), the gradient will show up unexpectedly. If all parameters in the model participate in the backward pass, you can disable unused parameter detection by passing the keyword argument `find_unused_parameters=False` to `torch.nn.parallel.DistributedDataParallel`.
Exception raised from mark_variable_ready at ../torch/csrc/distributed/c10d/reducer.cpp:484 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f62b61fd99b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::mark_variable_ready(c10d::Reducer::VariableIndex) + 0xbe7 (0x7f62ef7edac7 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #2: c10d::Reducer::autograd_hook(c10d::Reducer::VariableIndex) + 0x93 (0x7f62ef7ede23 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0xad2006 (0x7f62ef7ee006 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0xad902a (0x7f62ef7f502a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x4f9 (0x7f62ea50b889 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x4b4 (0x7f62ea50d3f4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #7: torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) + 0x33c (0x7f62ea50aa1c in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #8: torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) + 0x4c (0x7f62ef2495bc in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #9: torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) + 0x82f (0x7f62ea509d5f in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) + 0x74 (0x7f62ef2492f4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #11: THPEngine_run_backward(THPEngine*, _object*, _object*) + 0xa10 (0x7f62ef24a070 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #12: _PyCFunction_FastCallDict + 0x154 (0x5572c4395304 in /opt/conda/bin/python)
frame #13: _PyCFunction_FastCallKeywords + 0x50 (0x5572c43c1cd0 in /opt/conda/bin/python)
frame #14: <unknown function> + 0x199b0c (0x5572c441cb0c in /opt/conda/bin/python)
frame #15: _PyEval_EvalFrameDefault + 0x10c9 (0x5572c44405d9 in /opt/conda/bin/python)
frame #16: <unknown function> + 0x192f26 (0x5572c4415f26 in /opt/conda/bin/python)
frame #17: <unknown function> + 0x193f31 (0x5572c4416f31 in /opt/conda/bin/python)
frame #18: <unknown function> + 0x199be5 (0x5572c441cbe5 in /opt/conda/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x30a (0x5572c443f81a in /opt/conda/bin/python)
frame #20: PyEval_EvalCodeEx + 0x329 (0x5572c4417a49 in /opt/conda/bin/python)
frame #21: <unknown function> + 0x195864 (0x5572c4418864 in /opt/conda/bin/python)
frame #22: PyObject_Call + 0x3e (0x5572c439510e in /opt/conda/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x1aaf (0x5572c4440fbf in /opt/conda/bin/python)
frame #24: <unknown function> + 0x192f26 (0x5572c4415f26 in /opt/conda/bin/python)
frame #25: _PyFunction_FastCallDict + 0x1be (0x5572c441740e in /opt/conda/bin/python)
frame #26: _PyObject_FastCallDict + 0x26f (0x5572c43956cf in /opt/conda/bin/python)
frame #27: _PyObject_Call_Prepend + 0x63 (0x5572c439a143 in /opt/conda/bin/python)
frame #28: PyObject_Call + 0x3e (0x5572c439510e in /opt/conda/bin/python)
frame #29: torch::autograd::PyNode::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x193 (0x7f62ef2519f3 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #30: <unknown function> + 0x29d82c5 (0x7f62ea5112c5 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #31: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x14a8 (0x7f62ea50c838 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #32: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x4b4 (0x7f62ea50d3f4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #33: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x99 (0x7f62ea504ec9 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #34: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x5a (0x7f62ef24905a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #35: <unknown function> + 0xbd6df (0x7f62fb49b6df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #36: <unknown function> + 0x76db (0x7f6318d876db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #37: clone + 0x3f (0x7f6318ab0a3f in /lib/x86_64-linux-gnu/libc.so.6)

Code for reproducing

import os
import tempfile
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp
from linformer_pytorch import LinformerLM
from torch.nn.parallel import DistributedDataParallel as DDP


def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    # initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()
    
def demo_basic(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    setup(rank, world_size)

    # create model and move it to GPU with id rank
    
    model = LinformerLM(
            num_tokens=30522,  # Number of tokens in the LM
            input_size=5120,  # Dimension 1 of the input
            channels=128,  # Dimension 2 of the input
            dim_d=None,
            # Overwrites the inner dim of the attention heads. If None, sticks with the recommended channels // nhead, as in the "Attention is all you need" paper
            dim_k=128,  # The second dimension of the P_bar matrix from the paper
            dim_ff=128,  # Dimension in the feed forward network
            dropout_ff=0.15,  # Dropout for feed forward network
            nhead=16,  # Number of attention heads
            depth=12,  # How many times to run the model
            dropout=0.1,  # How much dropout to apply to P_bar after softmax
            activation="gelu",
            # What activation to use. Currently, only gelu and relu supported, and only on ff network.
            checkpoint_level="C2",  # What checkpoint level to use. For more information, see below.
            parameter_sharing="layerwise",  # What level of parameter sharing to use. For more information, see below.
            k_reduce_by_layer=0,
            # Going down `depth`, how much to reduce `dim_k` by, for the `E` and `F` matrices. Will have a minimum value of 1.
            full_attention=False,
            # Use full attention instead, for O(n^2) time and space complexity. Included here just for comparison
            include_ff=True,  # Whether or not to include the Feed Forward layer
            w_o_intermediate_dim=None,
            # If not None, have 2 w_o matrices, such that instead of `dim*nead,channels`, you have `dim*nhead,w_o_int`, and `w_o_int,channels`
            emb_dim=128,  # If you want the embedding dimension to be different than the channels for the Linformer
        ).to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    loss_fn = nn.CrossEntropyLoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randint(20000, (3, 5120)))
    labels = torch.randint(20000, (3, 5120)).to(rank)
    loss_mx = labels != -100
    output = outputs[loss_mx].view(-1, 30522)
    labels = labels[loss_mx].view(-1)
    loss_fn(output, labels).backward()
    optimizer.step()

    cleanup()


def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)
    
if __name__ == "__main__":
    run_demo(demo_basic, 2)

Also, this issue reproducing with any parameter sharing besides the "none"

Would you like to release the pretrain tutorial?

Do you have any plan to release pretrain pipeline about linformer?

Huggingface

Could you integrate this model into the huggingface transformers repository ?
It would be a great addition there

Question about how to modify to predict on a series of sparse number

Hi dear developer, I am doing bio-medicine research which I am sure is suitable to use Transformer models. As every bio-sample was analyzed and obtained 125*80 features(most of them are zeros), I was told to use Linformer due to the large data size.

Any performance test on different checkpoint level ?

Hello,
Thanks for the code ! I am testing your code with different checkpoint level. I see a massive drop in required GPU memory if I use "C1" or "C2" (about 50% in my case). It is weird that both C1 and C2 return the same allocated memory. So my first question is what is the different between C1 and C2 ?

As I check the checkpoint function here then increase the checkpoint level only affect the backward pass. So another question is: Does it hurt the overall performance if we use C2 instead of C0 ?

input seg length

great work!
I noticed the linformer input is (batch_size, seq_len, channels), can seq_len be variable length or should the attention be masked if seq_len is padded? why seq_len is a fixed length?

Loss goes to 0 when using LinformerLM

Hi, I used the LinformerLM class with casual=True to do some language modelling. However, there seems to be some leakage as the loss goes to 0 after 1 epoch. Or am I using it wrongly? Thank you.

These are my settings

model = LinformerLM(
        num_tokens=ntoken, # Number of tokens in the LM
        input_size=args.seq_len, # Dimension 1 of the input
        channels=args.embsize, # Dimension 2 of the input
        dim_d=None, # Overwrites the inner dim of the attention heads. If None, sticks with the recommended channels // nhead, as in the "Attention is all you need" paper
        dim_k=16, # The second dimension of the P_bar matrix from the paper
        dim_ff=args.nhid, # Dimension in the feed forward network
        dropout_ff=args.dropout, # Dropout for feed forward network
        nhead=8, # Number of attention heads
        depth=12, # How many times to run the model
        dropout=args.dropout, # How much dropout to apply to P_bar after softmax
        activation="relu", # What activation to use. Currently, only gelu and relu supported, and only on ff network.
        checkpoint_level="C0", # What checkpoint level to use. For more information, see below.
        parameter_sharing="none", # What level of parameter sharing to use. For more information, see below.
        k_reduce_by_layer=0, # Going down `depth`, how much to reduce `dim_k` by, for the `E` and `F` matrices. Will have a minimum value of 1.
        full_attention=False, # Use full attention instead, for O(n^2) time and space complexity. Included here just for comparison
        include_ff=True, # Whether or not to include the Feed Forward layer
        w_o_intermediate_dim=None, # If not None, have 2 w_o matrices, such that instead of `dim*nead,channels`, you have `dim*nhead,w_o_int`, and `w_o_int,channels`
        emb_dim=None, # If you want the embedding dimension to be different than the channels for the Linformer
        causal=True, # If you want this to be a causal Linformer, where the upper right of the P_bar matrix is masked out.
        method="learnable", # The method of how to perform the projection. Supported methods are 'convolution', 'learnable', and 'no_params'
        ff_intermediate=None, # See the section below for more information
        )

Will any pretrained linformer models be open sourced?

Great paper! Will be great to have access to existing pretrained multilingual linformer model. Is there any plan to release pretrained linformer models?

embeddings_mask datatype

Is embeddings_mask a bool variable in your data？

Different number of tokens and Character Level Modeling

Hi
Thank you for the open source code. I have been using Transformers for a while now and I generally use them for character level modeling - that is, translation between two different languages. I was wondering if you could answer the following questions

1- Can I use different number of tokens for encoder and decoder? This is because two different languages will have different tokens
2- I can probably use your code for character level modeling, at what point should I split the input stream of string tokens to characters? Any particular module where you can point me to?

I hope I am not asking for much :)

Thank you!

How to interpret the visualization results?

Can you explain the visualization results? What is the meaning of each head plots?

tatp22 / linformer-pytorch Goto Github PK

linformer-pytorch's Introduction

Peter (tatp22)

linformer-pytorch's People

Contributors

Stargazers

Watchers

Forkers

linformer-pytorch's Issues

Recommend Projects

Recommend Topics

Recommend Org