Giter Site home page Giter Site logo

akashmavle5 / multi-eurlex Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nlpaueb/multi-eurlex

0.0 0.0 0.0 2.8 MB

MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

Python 100.00%

multi-eurlex's Introduction

Multi-EURLEX

MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

This is the code used for the experiments described in the following paper:

I. Chalkidis, M. Fergadiotis, and I. Androutsopoulos, "MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer". Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2021 (xxx)

Requirements:

  • tensorflow==2.3.1
  • tensorflow-addons==0.11.2
  • transformers==4.3.3
  • tokenizers==0.10.1
  • scipy==1.5.4
  • torch==1.7.1
  • tqdm==4.43.0
  • cudatoolkit==10.1.243 (for GPU acceleration)
  • cudnn==7.6.0 (for GPU acceleration)

Quick start:

Install python requirements:

pip install -r requirements.txt

Download dataset (MultiEURLEX):

The dataset is hosted and been described in detail in the Hugging Face Datasets (https://huggingface.co/datasets/multi_eurlex). It is automatically downloaded and used by the Trainer. If you want to review and familiarize your self with the dataset, you can download it usingthe following Python code:

from datasets import load_dataset
dataset = load_dataset('multi_eurlex', languages='all_languages')

Train a model:

The following configuration (command-line) arguments can be used:

  • 'bert_path' (default='xlm-roberta-base'): The name of the pretrained transformer-based model hosted by Hugging Face, or the full path to a local directory.
  • 'native_bert' (default=False): If the ISO code of a language (e.g., 'en') is provided, then the relevant monolingual model will be fine-tuned.
  • 'multilingual_train' (default=False): If True, the model will be trained across multiple languages ('train_langs').
  • 'use_adapters' (default=False) If True, the model will be fine-tuned using Adapter modules (Houlsby et al., 2019).
  • 'use_ln' (default=False) If True, only the parameter of the LayerNorm layers of the the model will be fine-tuned
  • 'bottleneck_size' (default=256) The size of the bottleneck layer in Adapter modules (if used).
  • 'n_frozen_layers' (default=0) The number of the initial layers that will remain frozen in fine-tuning.
  • 'epochs' (default=70) The number of the maximum training epochs (Early stopping with patience 5 is used by default).
  • 'batch_size' (default=8) The number of the samples in a single batch.
  • 'learning_rate' (default=3e-5) The initial learning rate to be used by the Adam optimizer.
  • 'label_smoothing' (default=0.0) The rate of label smoothing (Szegedy et al.,2016).
  • 'max_document_length' (default=512) The maximum length of tokens to be considered per document.
  • 'monitor' (default='val_rp') The score to be monitored for early stopping ('val_rp' or 'val_loss')
  • 'train_lang' (default='en') The ISO code of the training language (e.g., 'en') in a one-to-many setting.
  • 'train_langs' (default=['en']) The list of languages to be used for fine-tuning, in many-to-one setting.
  • 'eval_langs' (default='all') The list of languages to be used for evaluation.
  • 'label_level' (default='level_3') The level of EUROVOC (e.g., 'level_1', 'level_2', 'level_3', 'all') used for the classification task.

You can run experiments by simply calling:

python trainer.py --bert_path 'xlm-roberta-base' --use_adapters True --train_lang 'en' --label_level 'level_1'

Credits

Thanks to @Essex97 for pointing out minor bugs in the codebase.

multi-eurlex's People

Contributors

iliaschalkidis avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.