openai / sparse_attention Goto Github PK

Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

Python 100.00%

sparse_attention's Issues

Questions about novelty

The paper is well written and makes great results in various datasets.
However, the contribution of novelty is unclear.
Q1： How is the Sparse Transformer (strided) different from local attention?
Q2： How is the Sparse Transformer (fixed) different from block self-attention? ( ICLR 2018 https://openreview.net/forum?id=H1cWzoxA-)?

Problem with reproducing "strided" attention scheme from the paper

HI,
I am trying to visualize the attention schemes using this code. Basically trying to reproduce Fig:3 from the paper. I could reproduce the "fixed" attention scheme as shown below:

The problem is I could not reproduce the "strided" scheme (Fig 3.b from paper). All I get is the following no matter what parameters I try:

If I change some code then I can get the correct "strided" version as shown in the paper. The following is after some code changes:

Did anyone face the same issue?

Great work! but seems insufficient "related work"

See title, as we all know, the DynamicConv has claimed that it achieved the state-of-the-art performance in many tasks (e.g., WMT14 ende). But I find that DynamicConv was never mentioned in your paper.

Would your team wanna conduct comparison experiments? Just like the issue659 in repository pytorch/fairseq

can it be used for cpu?

Has anyone been able to reproduce the results for image generation?

It seems that the code for images is not provided, and in #7, it was mentioned that the strided attention is difficult to reproduce. I am wondering whether anyone has successfully reproduce the results for image generation

a problem in running code

When I tried to run the code the following error occurred:
Traceback (most recent call last):
File "attention.py", line 4, in
from blocksparse import BlocksparseTransformer
File "/home/user/anaconda3/lib/python3.7/site-packages/blocksparse/init.py", line 3, in
from blocksparse.utils import (
File "/home/user/anaconda3/lib/python3.7/site-packages/blocksparse/utils.py", line 16, in
_op_module = tf.load_op_library(os.path.join(data_files_path, 'blocksparse_ops.so'))
File "/home/en/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: libcudart.so.10.0: cannot open shared object file: No such file or directory

Expected throughput?

Can you provide any insight into expected throughput, relative to a "base" transformer implementation?

I.e., if you consider two model with same hidden size, # layers, etc., will sparse_attention version run significantly slower (if yes, presumably because of recompute)?

Apologies if this was covered in the paper--I skimmed and didn't see it addressed.

Am considering getting this up and running--extremely interesting--but would like a sense on whether there is a major throughput hit before doing so.

Thank you--very neat to see successful evolution from https://openai.com/blog/block-sparse-gpu-kernels/.

openai / sparse_attention Goto Github PK

sparse_attention's Issues

Questions about novelty

Problem with reproducing "strided" attention scheme from the paper

Great work! but seems insufficient "related work"

can it be used for cpu?

Has anyone been able to reproduce the results for image generation?

a problem in running code

Expected throughput?

PyTorch Implementation

version of TensorFlow and python

What is the LICENSE?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent