openai / sparse_attention Goto Github PK

Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

Python 100.00%

sparse_attention's Introduction

Status: Archive (code is provided as-is, no updates expected)

Update August 2020: For an example repository that achieves state-of-the-art modeling performance on CIFAR-10 using Sparse Transformers, please see https://github.com/openai/distribution_augmentation

Sparse Attention

This repository contains the sparse attention primitives used in Sparse Transformers (see blog and paper). Specifically, it includes the following:

A faster implementation of normal attention (the upper triangle is not computed, and many operations are fused).
An implementation of "strided" and "fixed" attention, as in the Sparse Transformers paper.
A simple recompute decorator, which can be adapted for usage with attention.

We hope this code can further accelerate research into sparse attention.

An example Transformer implementation which is close to the version we use internally can be found at https://github.com/openai/blocksparse/blob/master/examples/transformer/enwik8.py.

Overview of kernels

The repository contains fused implementations of the attention operation, which takes in Q, K, V matrices (all of dimensionality batch, time, dim) representing the queries, keys, and values for a sequence. For every query element, a weighted sum of the values is returned, where the weightings are determined by the scaled matrix product of Q and K^T.

The kernels allow specification of block sparsity in the QK^T matrix. This means you define a pattern of 0/1s on a [time/blocksize, time/blocksize] matrix of blocks, and the values where it is 0 will not be computed, and not be included in the softmax calculation. Additionally, one can define "callbacks" on the computed blocks, which will further mask out values in any given block from the softmax (though the matrix product will still be computed for those elements).

Block sizes of {8, 16, 32, 64} are supported, and slight advantages in speed may be seen from using larger blocks.

Prerequisites

For fp32 and blocksize 32, any NVIDIA GPU past Kepler can be used (i.e. compute capability beyond 3.5).

For fp16 and blocksize 8, 16, 32, 64, a GPU with Tensor Cores (e.g. the V100 GPU, compute capability >= 7.0) is required.

The primary dependency is the OpenAI blocksparse package.

With CUDA 10 and tensorflow-gpu, you can install blocksparse with pip install blocksparse.

For other setups, you must install blocksparse from source, and directions can be found in the root of the repository.

Examples

Run the following on a non-V100 GPU:

python attention.py

On a V100 GPU:

python attention.py fp16

General usage

An example can be found at the bottom of attention.py.

full_attn_tf = attention_impl(q, k, v, heads=4, attn_mode="all", recompute=True)
full_attn_bs = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="all", recompute=True)

# first step of strided attention
local_attn_bs = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="local", local_attn_ctx=32, recompute=True)
local_attn_tf = attention_impl(q, k, v, heads=4, attn_mode="local", local_attn_ctx=32, recompute=True)

# second step of strided attention
strided_attn_bs = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="strided", local_attn_ctx=32, recompute=True)
strided_attn_tf = attention_impl(q, k, v, heads=4, attn_mode="strided", local_attn_ctx=32, recompute=True)

# # the 'fixed' attention pattern
fixed = blocksparse_attention_impl(q, k, v, heads=4, attn_mode="fixed", local_attn_ctx=128, num_verts=4, vertsize=1, recompute=True)

Referencing this work

If you find this helpful in your work, you can consider citing the following:

@article{child2019sparsetransformer,
  title={Generating Long Sequences with Sparse Transformers},
  author={Child, Rewon and Gray, Scott and Radford, Alec and Sutskever, Ilya},
  journal={URL https://openai.com/blog/sparse-transformers},
  year={2019}
}

sparse_attention's People

Contributors

Stargazers

Watchers

Forkers

github30 shafiahmed hassamsheikh merettm jfsantos zhangjiekui kelvinson deanwebb dachengai alphacyc templeblock wjymath peteroxic mingmingyang lduml ahujack wgwangang hoangcuong2011 nxw1994 leo-xxx yucoian ryanhuangnlp nieshaoshuai hyperji stjordanis jose-alvarez-volusion codeaudit qdebug kastnerkyle hhy5277 benzei ruohoruotsi huizhang0110 merajat gaohuan2015 jonathanfly batermj ganwang yueyedeai ewrfcas fedorajzf pranavcode sa757 casillas-qf herobring vseledkin jhuang111 jeonsworld bnorick luisfredgs giserh nniy sthpravin pkurainbow aucan calebgeniesse tricky61 albertwujj sruthi-racharla bruinxiong intuitionmachine ericxsun novaintrovert starkhuu shubhampachori12110095 404835993 dan-i quangtrung89 psds01 gptcod littleserendipity deisler134 chaoshen0 crystal22 sahanduiuc fakhraddin anhuilicheng chaoso scape1989 zerotoall prometeoai runngezhang ramakumar1729 tj1116 hxllegend wpfhtl michael-wzhu cognami joytianya hakanaku1234 ibabbar jamesliao2016 afcarl haorotu wxwoods jbdatascience jingjingandqiqi yuv4r4j huan99uan samithaj

sparse_attention's Issues

version of TensorFlow and python

For Ubantu 18.04, cuda10.0, what is the better version of python and TensorFlow?

Problem with reproducing "strided" attention scheme from the paper

HI,
I am trying to visualize the attention schemes using this code. Basically trying to reproduce Fig:3 from the paper. I could reproduce the "fixed" attention scheme as shown below:

The problem is I could not reproduce the "strided" scheme (Fig 3.b from paper). All I get is the following no matter what parameters I try:

If I change some code then I can get the correct "strided" version as shown in the paper. The following is after some code changes:

Did anyone face the same issue?

a problem in running code

When I tried to run the code the following error occurred:
Traceback (most recent call last):
File "attention.py", line 4, in
from blocksparse import BlocksparseTransformer
File "/home/user/anaconda3/lib/python3.7/site-packages/blocksparse/init.py", line 3, in
from blocksparse.utils import (
File "/home/user/anaconda3/lib/python3.7/site-packages/blocksparse/utils.py", line 16, in
_op_module = tf.load_op_library(os.path.join(data_files_path, 'blocksparse_ops.so'))
File "/home/en/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: libcudart.so.10.0: cannot open shared object file: No such file or directory

can it be used for cpu?

PyTorch Implementation

Is it possible to release a PyTorch implementation of the method?

Expected throughput?

Can you provide any insight into expected throughput, relative to a "base" transformer implementation?

I.e., if you consider two model with same hidden size, # layers, etc., will sparse_attention version run significantly slower (if yes, presumably because of recompute)?

Apologies if this was covered in the paper--I skimmed and didn't see it addressed.

Am considering getting this up and running--extremely interesting--but would like a sense on whether there is a major throughput hit before doing so.

Thank you--very neat to see successful evolution from https://openai.com/blog/block-sparse-gpu-kernels/.

Has anyone been able to reproduce the results for image generation?

It seems that the code for images is not provided, and in #7, it was mentioned that the strided attention is difficult to reproduce. I am wondering whether anyone has successfully reproduce the results for image generation

What is the LICENSE?

See title. The GPT-2 repo was MIT Licensed which was very helpful!

Great work! but seems insufficient "related work"

See title, as we all know, the DynamicConv has claimed that it achieved the state-of-the-art performance in many tasks (e.g., WMT14 ende). But I find that DynamicConv was never mentioned in your paper.

Would your team wanna conduct comparison experiments? Just like the issue659 in repository pytorch/fairseq

Questions about novelty

The paper is well written and makes great results in various datasets.
However, the contribution of novelty is unclear.
Q1： How is the Sparse Transformer (strided) different from local attention?
Q2： How is the Sparse Transformer (fixed) different from block self-attention? ( ICLR 2018 https://openreview.net/forum?id=H1cWzoxA-)?