Giter Site home page Giter Site logo

ir_efficiency's Introduction

ir_efficiency

This repository contains the code used in the experiments for our paper "Moving Stuff Around: A study on the efficiency of moving documents into memory for Neural IR models", published at the first ReNeuIR workshop at the SIGIR 2022.

You can find the paper here and an open Weights & Bias dashboard with results here.

To re-run the experiments, first make sure you have CUDA installed on your machine (check here for instructions) and use the Pipfile to install the dependencies. We recommend using Pipenv to do so in a new virtual environment.

To run an experiment using DataParellel (i.e. multithreads), call the main.py file like this:

python main.py --loader ir_datasets --parallel DataParallel --n_gpus 8 \
               --n_steps 1000 --learning_rate 1e-5 --base_model distilbert-base-uncased\
               --batch_per_gpu 8 --pin_memory --num_workers 8 --ramdisk

For an experiment using DistributedDataParallel (i.e. using Accelerate, use the accelerate launch command instead of python:

accelerate launch --config_file config_<n_gpus>.yaml main.py --loader ir_datasets —n_gpus <n_gpus> --parallel accelerator

Replacing <n_gpus> with how many GPUs you want to use.

Other parameters are:

  • —loader is the type of dataset loader to use. Options are ir_datasets indexed or in_memory
  • —parallel is the parallelism strategy. Options are accelerator for using Hugging Face's Accelerate or DataParallel for the native DataParallel option.
  • —n_gpus is the number of GPUs to use in this experiment.
  • —n_steps: Number of steps to train for
  • —learning_rate: Learning rate for the optimiser
  • —base_model: Base BERT model to use
  • --batch_per_gpu: Number of GPUs to use for each experiment
  • —pin_memory: wether or not to use the pin_memory option for PyTorch’s DataLoader object.
  • —num_workers The number of workers (threads) to use when loading data from disk
  • —ramdisk: Wether or not you want to use Ramdisk. If set to True, you must manually move the dataset to RAMDISK (usually, /dev/shm)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.