Giter Site home page Giter Site logo

varungumma / indictranstokenizer Goto Github PK

View Code? Open in Web Editor NEW
15.0 4.0 11.0 3.87 MB

A simple, consistent and extendable toolkit for IndicTrans2 tokenizer

Home Page: https://openreview.net/forum?id=vfT4YuzAYA

License: MIT License

Python 100.00%
ai4bharat indicnlp python tokenizer transformers translation

indictranstokenizer's Introduction

IndicTransTokenizer

The goal of this repository is to provide a simple, modular, and extendable toolkit for IndicTrans2 and be compatible with the HuggingFace models released.

Changelog

Major Update (v1.0.0)

  • The PreTrainedTokenizer for IndicTrans2 is now available on HF 🎉🎉 Note that, you still need the IndicProcessor to pre-process the sentences before tokenization.
  • In favor of the standard PreTrainedTokenizer, we have deprecated the custom tokenizer. However, this custom tokenizer will still be available here for backward compatibility, but no further updates or bug fixes will be provided.
  • The indic_evaluate function is now consolidated into a concrete IndicEvaluator class.
  • The data collation function for training is consolidated into a concrete IndicDataCollator class.
  • A simple batching method is now available in the IndicProcessor.

Update (v1.0.1)

  • Added an argument for progress bar during preprocessing (show_progress_bar=True).
  • Added an argument to prepend additional tags like __bt__ and __ft__ similar to IT2 BT/FT data preprocessing (additional_tag="__bt__").

Pre-requisites

Configuration

  • Editable installation (Note, this may take a while):
git clone https://github.com/VarunGumma/IndicTransTokenizer
cd IndicTransTokenizer

pip install --editable ./

Examples

For the training usecase, please refer here. Please do not use the custom tokenizer to train/fine-tune models. Training models with the custom tokenizer is untested and can lead to unexpected results.

PreTainedTokenizer

import torch
from IndicTransTokenizer import IndicProcessor
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

ip = IndicProcessor(inference=True)
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)

sentences = [
    "This is a test sentence.",
    "This is another longer different test sentence.",
    "Please send an SMS to 9876543210 and an email on [email protected] by 15th October, 2023.",
]

batch = ip.preprocess_batch(sentences, src_lang="eng_Latn", tgt_lang="hin_Deva", show_progress_bar=False)
batch = tokenizer(batch, padding="longest", truncation=True, max_length=256, return_tensors="pt")

with torch.inference_mode():
    outputs = model.generate(**batch, num_beams=5, num_return_sequences=1, max_length=256)

with tokenizer.as_target_tokenizer():
    # This scoping is absolutely necessary, as it will instruct the tokenizer to tokenize using the target vocabulary.
    # Failure to use this scoping will result in gibberish/unexpected predictions as the output will be de-tokenized with the source vocabulary instead.
    outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=True)

outputs = ip.postprocess_batch(outputs, lang="hin_Deva")
print(outputs)

>>> ['यह एक परीक्षण वाक्य है।', 'यह एक और लंबा अलग परीक्षण वाक्य है।', 'कृपया 9876543210 पर एक एस. एम. एस. भेजें और 15 अक्टूबर, 2023 तक [email protected] पर एक ईमेल भेजें।']

Custom Tokenizer (DEPRECATED)

import torch
from transformers import AutoModelForSeq2SeqLM
from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer

tokenizer = IndicTransTokenizer(direction="en-indic")
ip = IndicProcessor(inference=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)

sentences = [
    "This is a test sentence.",
    "This is another longer different test sentence.",
    "Please send an SMS to 9876543210 and an email on [email protected] by 15th October, 2023.",
]

batch = ip.preprocess_batch(sentences, src_lang="eng_Latn", tgt_lang="hin_Deva", show_progress_bar=False)
batch = tokenizer(batch, src=True, return_tensors="pt")

with torch.inference_mode():
    outputs = model.generate(**batch, num_beams=5, num_return_sequences=1, max_length=256)

outputs = tokenizer.batch_decode(outputs, src=False)
outputs = ip.postprocess_batch(outputs, lang="hin_Deva")
print(outputs)

>>> ['यह एक परीक्षण वाक्य है।', 'यह एक और लंबा अलग परीक्षण वाक्य है।', 'कृपया 9876543210 पर एक एस. एम. एस. भेजें और 15 अक्टूबर, 2023 तक [email protected] पर एक ईमेल भेजें।']

Evaluation

  • IndicEvaluator is a python implementation of compute_metrics.sh.
  • We have found that this python implementation gives slightly lower scores than the original compute_metrics.sh. So, please use this function cautiously, and feel free to raise a PR if you have found the bug/fix.
from IndicTransTokenizer import IndicEvaluator

# this method returns a dictionary with BLEU and ChrF2++ scores with appropriate signatures
evaluator = IndicEvaluator()
scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=pred_file, refs=ref_file) 

# alternatively, you can pass the list of predictions and references instead of files 
# scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=preds, refs=refs)

Batching

ip = IndicProcessor(inference=True)

for batch in ip.get_batches(source_sentences, batch_size=32):
    # perform necessary operations on the batch
    # ... pre-processing
    # ... tokenization 
    # ... generation 
    # ... decoding

Authors

Bugs and Contribution

Since this a bleeding-edge module, you may encounter broken stuff and import issues once in a while. In case you encounter any bugs or want additional functionalities, please feel free to raise Issues/Pull Requests or contact the authors.

Citation

If you use our codebase, models, or tokenizer, please cite the following paper:

@article{
    gala2023indictrans,
    title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
    author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
    journal={Transactions on Machine Learning Research},
    issn={2835-8856},
    year={2023},
    url={https://openreview.net/forum?id=vfT4YuzAYA},
    note={}
}

indictranstokenizer's People

Contributors

pranjalchitale avatar varungumma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

indictranstokenizer's Issues

Issue with IndicTransTokenizer Installation and Usage

Encountering an issue when installing and using the IndicTransTokenizer package. The installation seems to succeed, but an error occurs when trying to import and use the tokenizer.

@VarunGumma

Steps to Reproduce

  1. Install indic-nlp-library-IT2:

    !pip install indic-nlp-library-IT2
  2. Clone the IndicTransTokenizer GitHub repository:

    !git clone https://github.com/VarunGumma/IndicTransTokenizer.git
  3. Install the IndicTransTokenizer package:

    !pip install -e ./IndicTransTokenizer/
  4. Import and initialize IndicTransTokenizer:

    from IndicTransTokenizer import IndicTransTokenizer
    tokenizer = IndicTransTokenizer(direction="en-indic")

Expected Behavior

The IndicTransTokenizer should be imported and initialized without any issues.

Actual Behavior

Receiving a FileNotFoundError when trying to initialize IndicTransTokenizer.

Error Message:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.10/site-packages/IndicTransTokenizer/en-indic/dict.SRC.json'

Environment

  • Python version: 3.10
  • Operating System: Kaggle

Additional Information

image

Issue with editable installation

I tried setting up Indictrans2 for translation purposes in my project and also tried to import this tokeniser in the same repo , Can someone please help with my errors
Laptop: Windows
image

Indicproessor issue

The code was running fine till yesterday now indicprocessor is not getting initialised pls help due to that error in IndicTransTokenizer
image

getting error while importing from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer

same code was working about a week back but now i get this error, running the code using modal labs remote gpus

code

import modal
stub = modal.Stub()


volume = modal.NetworkFileSystem.persisted("data")
MODEL_DIR = "/data"

@stub.function( cpu=2, memory = 4276, gpu = 'A10G', timeout=1200, network_file_systems={MODEL_DIR: volume})
def loadIndicTrans2(dataset_name):
    import time
    start_time = time.time()

    import os 
    import subprocess
    
    commands = [
    "pip install -q bitsandbytes",
    "apt update ", 
    "apt install -y git",
    "git clone https://github.com/AI4Bharat/IndicTrans2"
    ]
    for command in commands:
        subprocess.run(command, shell=True)

    os.chdir("IndicTrans2/huggingface_interface")
    subprocess.run("bash install.sh", shell=True)


    with open('importIndic.py', 'w') as file:
        file.write(f'''
try:
    import torch
    import os
    import pandas as pd
    import csv
    print(torch.cuda.get_device_name(0))
    import sys
    from transformers import AutoModelForSeq2SeqLM, BitsAndBytesConfig
    print('from transformers imported')
    from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
    print('from indictranstokenizer imported')
    
    en_indic_ckpt_dir = "ai4bharat/indictrans2-en-indic-1B"  # ai4bharat/indictrans2-en-indic-dist-200M
    
    BATCH_SIZE = 4
    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
    
    if len(sys.argv) > 1:
        quantization = sys.argv[1]
    else:
        quantization = ""
    
    
    def initialize_model_and_tokenizer(ckpt_dir, direction, quantization):
        if quantization == "4-bit":
            qconfig = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_use_double_quant=True,
                bnb_4bit_compute_dtype=torch.bfloat16,
            )
        elif quantization == "8-bit":
            qconfig = BitsAndBytesConfig(
                load_in_8bit=True,
                bnb_8bit_use_double_quant=True,
                bnb_8bit_compute_dtype=torch.bfloat16,
            )
        else:
            qconfig = None
    
        tokenizer = IndicTransTokenizer(direction=direction)
        model = AutoModelForSeq2SeqLM.from_pretrained(
            ckpt_dir,
            trust_remote_code=True,
            low_cpu_mem_usage=True,
            quantization_config=qconfig,
        )
    
        if qconfig == None:
            model = model.to(DEVICE)
            model.half()
        model.eval()
        return tokenizer, model
    
    def batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip):
        translations = []
        for i in range(0, len(input_sentences), BATCH_SIZE):
            batch = input_sentences[i : i + BATCH_SIZE]
    
            batch = ip.preprocess_batch(batch, src_lang=src_lang, tgt_lang=tgt_lang)
    
            inputs = tokenizer(
                batch,
                src=True,
                truncation=True,
                padding="longest",
                return_tensors="pt",
                return_attention_mask=True,
            ).to(DEVICE)
    
            with torch.no_grad():
                generated_tokens = model.generate(
                    **inputs,
                    use_cache=True,
                    min_length=0,
                    max_length=256,
                    num_beams=5,
                    num_return_sequences=1,
                )
    
            generated_tokens = tokenizer.batch_decode(generated_tokens.detach().cpu().tolist(), src=False)
    
            translations += ip.postprocess_batch(generated_tokens, lang=tgt_lang)
            del inputs
            torch.cuda.empty_cache()
        return translations

    
    ip = IndicProcessor(inference=True)
    en_indic_tokenizer, en_indic_model = initialize_model_and_tokenizer(en_indic_ckpt_dir, "en-indic", quantization)


    from datasets import load_dataset
    dataset_name = '{dataset_name}'
    if(dataset_name == "ai2_arc"):
        possible_configs = [
        'ARC-Challenge',
        'ARC-Easy'
        ]
        # columns to translate
        columns = ['question','choices']
        # columns not to translate, to keep in converted dataset as is.
        columns_asis = ['id','answerKey']

    dataset = []
    if(dataset_name == 'ai2_arc'):
        for config in possible_configs:
            base_url = 'https://huggingface.co/api/datasets/allenai/ai2_arc/parquet/{{config}}'
            data_files = {{'train': base_url + '/train/0.parquet','test':base_url + '/test/0.parquet', 'validation': base_url + '/validation/0.parquet'}}
            dataset_slice = load_dataset('parquet', data_files=data_files)
            dataset.append(dataset_slice)

    
except Exception as e:
    # Handle the exception
    print('An error occurred:'+ str(e))
        ''')
    result = subprocess.run(['python', 'importIndic.py'], stdout=subprocess.PIPE)


@stub.local_entrypoint()
def main():
    # provide dataset name among ai2_arc, gsm8k, lukaemon/mmlu
    dataset_name = "ai2_arc"
    
    loadIndicTrans2.remote(dataset_name)

the error says An error occurred:[Errno 2] No such file or directory: '/usr/local/lib/python3.11/site-packages/RESOURCES/script/all_script_phonetic_data.csv'

ModuleNotFoundError: No module named 'indicnlp'

getting this error again

File "/home/gcpuser/sky_workdir/./app.py", line 6, in <module>
(task, pid=11924)     from api.routes import inference_api
(task, pid=11924)   File "/home/gcpuser/sky_workdir/./api/routes.py", line 12, in <module>
(task, pid=11924)     from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
(task, pid=11924)   File "/home/gcpuser/sky_workdir/./IndicTransTokenizer/IndicTransTokenizer/__init__.py", line 2, in <module>
(task, pid=11924)     from .utils import IndicProcessor
(task, pid=11924)   File "/home/gcpuser/sky_workdir/./IndicTransTokenizer/IndicTransTokenizer/utils.py", line 4, in <module>
(task, pid=11924)     from indicnlp.tokenize import indic_tokenize, indic_detokenize
(task, pid=11924) ModuleNotFoundError: No module named 'indicnlp'

Unable to import Indicprocessor from IndicTransTokenizer

Below are the code which I ran;

!git clone https://github.com/VarunGumma/IndicTransTokenizer
%cd IndicTransTokenizer
!git clone https://github.com/VarunGumma/indic_nlp_library.git

%cd indic_nlp_library

!python3 -m pip install nltk sacremoses pandas regex mock transformers>=4.33.2 mosestokenizer
!python3 -c "import nltk; nltk.download('punkt')"
!python3 -m pip install bitsandbytes scipy accelerate datasets
!python3 -m pip install sentencepiece


!python3 -m pip install --editable ./
%cd ..
from IndicTransTokenizer import  IndicProcessor,IndicTransTokenizer

image

Unable to import IndicProcessor, though IndicTransTokenizer is getting imported. Kindly help

ValueError when running example

getting error when running example in readme

ValueError: The model class you are passing has a `config_class` attribute that is not consistent with the config class you passed (model has <class 'configuration_indictrans.IndicTransConfig'> and you passed <class 'transformers_modules.ai4bharat.indictrans2-en-indic-dist-200M.f7f37e522d6612a10cbc8563af6820434e854047.configuration_indictrans.IndicTransConfig'>. Fix one of those so they match!

Getting this error after latest release

Traceback (most recent call last):
File "/home/gcpuser/sky_workdir/translate.py", line 15, in
from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
File "/home/gcpuser/sky_workdir/IndicTransTokenizer/IndicTransTokenizer/init.py", line 4, in
from .collator import IndicDataCollator
File "/home/gcpuser/sky_workdir/IndicTransTokenizer/IndicTransTokenizer/collator.py", line 8, in
from transformers.data.data_collator import pad_without_fast_tokenizer_warning
ImportError: cannot import name 'pad_without_fast_tokenizer_warning' from 'transformers.data.data_collator' (/opt/conda/lib/python3.10/site-packages/transformers/data/data_collator.py)
ERROR: Job 1 failed with return code list: [1]
INFO: Job finished (status: FAILED).
Shared connection to 35.246.42.131 closed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.