ebert's Introduction

EBERT

This repository serves as the official code release of the paper EBERT: Efficient BERT Inference with Dynamic Structured Pruning (pubilished at Findings of ACL 2021).

EBERT is a dynamic structured pruning algorithm for efficient BERT inference. Unlike previous methods that randomly prune the model weights for static inference, EBERT dynamically determines and prunes the unimportant heads in multi-head self-attention layers and the unimportant structured computations in feed-forward network for each input sample at run-time.

Prerequisites

The code has the following dependencies:

python >= 3.8.5
pytorch >= 1.4.0
transformers = 3.3.1 As transformers v3.3.1 has a bug when the evaluation strategy is epoch, you need to make the following changes in the transformers library:

--- a/src/transformers/training_args.py
+++ b/src/transformers/training_args.py
@@ -323,7 +323,7 @@ class TrainingArguments:
     def __post_init__(self):
         if self.disable_tqdm is None:
             self.disable_tqdm = logger.getEffectiveLevel() > logging.WARN
-        if self.evaluate_during_training is not None:
+        if self.evaluate_during_training:
             self.evaluation_strategy = (
                 EvaluationStrategy.STEPS if self.evaluate_during_training else EvaluationStrategy.NO
             )

Usages

We provide script files for training and validation in the scripts folder, and users can run these script from the repo root, e.g. bash scripts/eval_glue.sh. In each scripts, there are several arguments to modify before running:

--data_dir: path to dataset：GLUE, SQuAD.
MODEL_PATH or --model_name_or_path: path to trained model folder
TASK_NAME: task name in GLUE (SST-2, MNLI, ...)
RUN_NAME: name of the current experiment, which influence the save path and log name for wandb.
other hyper-parameters, e.g., head_mask_mode

You can download the original pretrained model of BERT and RoBERTa from HuggingFace.

Citation

If you found the library useful for your work, please kindly cite our work:

@inproceedings{liu-etal-2021-ebert,
    title = "{EBERT}: Efficient {BERT} Inference with Dynamic Structured Pruning",
    author = "Liu, Zejian  and
              Li, Fanrong  and
              Li, Gang  and
              Cheng, Jian",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.425",
    doi = "10.18653/v1/2021.findings-acl.425",
    pages = "4814--4823",
}

ebert's People

Contributors

Stargazers

Watchers

ebert's Issues

A question about inference

Hi thank you for your impressive work :)
As is mentioned in your paper, "For MHA, heads with mask ’0’ will not be executed. For FFN, as matrix-matrix multiplication can be transformed to multiple matrix vector multiplications, we only need to complete part of computations where vector’s mask is not zero."
however, it seems that in modeling_ebert.py you may just simply multiply the mask with the hidden states or attention probs and computations aren't reduced. Then the inference flops is computed theoretically. Is what I said true?
But if you actually prune the channels and heads, the feature dimension (e.g.768) of the hidden states would be diminished, causing a mismatch of all those linear layers(e.g.in FFN 768->3072->768,the weight matrix is (3072,768) so if the inter dim<3072, the multiplication is invalid) How did you deal with this mismatch?

zejiangp / ebert Goto Github PK

ebert's Introduction

EBERT

Prerequisites

Usages

Citation

ebert's People

Contributors

Stargazers

Watchers

Forkers

ebert's Issues

A question about inference

code

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent