varungumma / indictranstokenizer Goto Github PK

View Code? Open in Web Editor NEW

15.0 4.0 11.0 3.87 MB

A simple, consistent and extendable toolkit for IndicTrans2 tokenizer

Home Page: https://openreview.net/forum?id=vfT4YuzAYA

License: MIT License

Python 100.00%

ai4bharat indicnlp python tokenizer transformers translation

indictranstokenizer's Introduction

IndicTransTokenizer

The goal of this repository is to provide a simple, modular, and extendable toolkit for IndicTrans2 and be compatible with the HuggingFace models released.

Changelog

Major Update (v1.0.0)

The PreTrainedTokenizer for IndicTrans2 is now available on HF 🎉🎉 Note that, you still need the IndicProcessor to pre-process the sentences before tokenization.
In favor of the standard PreTrainedTokenizer, we have deprecated the custom tokenizer. However, this custom tokenizer will still be available here for backward compatibility, but no further updates or bug fixes will be provided.
The indic_evaluate function is now consolidated into a concrete IndicEvaluator class.
The data collation function for training is consolidated into a concrete IndicDataCollator class.
A simple batching method is now available in the IndicProcessor.

Update (v1.0.1)

Added an argument for progress bar during preprocessing (show_progress_bar=True).
Added an argument to prepend additional tags like __bt__ and __ft__ similar to IT2 BT/FT data preprocessing (additional_tag="__bt__").

Pre-requisites

Python 3.8+
Indic NLP Library
Other requirements as listed in requirements.txt

Configuration

Editable installation (Note, this may take a while):

git clone https://github.com/VarunGumma/IndicTransTokenizer
cd IndicTransTokenizer

pip install --editable ./

Examples

For the training usecase, please refer here. Please do not use the custom tokenizer to train/fine-tune models. Training models with the custom tokenizer is untested and can lead to unexpected results.

PreTainedTokenizer

import torch
from IndicTransTokenizer import IndicProcessor
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

ip = IndicProcessor(inference=True)
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)

sentences = [
    "This is a test sentence.",
    "This is another longer different test sentence.",
    "Please send an SMS to 9876543210 and an email on [email protected] by 15th October, 2023.",
]

batch = ip.preprocess_batch(sentences, src_lang="eng_Latn", tgt_lang="hin_Deva", show_progress_bar=False)
batch = tokenizer(batch, padding="longest", truncation=True, max_length=256, return_tensors="pt")

with torch.inference_mode():
    outputs = model.generate(**batch, num_beams=5, num_return_sequences=1, max_length=256)

with tokenizer.as_target_tokenizer():
    # This scoping is absolutely necessary, as it will instruct the tokenizer to tokenize using the target vocabulary.
    # Failure to use this scoping will result in gibberish/unexpected predictions as the output will be de-tokenized with the source vocabulary instead.
    outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=True)

outputs = ip.postprocess_batch(outputs, lang="hin_Deva")
print(outputs)

>>> ['यह एक परीक्षण वाक्य है।', 'यह एक और लंबा अलग परीक्षण वाक्य है।', 'कृपया 9876543210 पर एक एस. एम. एस. भेजें और 15 अक्टूबर, 2023 तक [email protected] पर एक ईमेल भेजें।']

Custom Tokenizer (DEPRECATED)

import torch
from transformers import AutoModelForSeq2SeqLM
from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer

tokenizer = IndicTransTokenizer(direction="en-indic")
ip = IndicProcessor(inference=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)

sentences = [
    "This is a test sentence.",
    "This is another longer different test sentence.",
    "Please send an SMS to 9876543210 and an email on [email protected] by 15th October, 2023.",
]

batch = ip.preprocess_batch(sentences, src_lang="eng_Latn", tgt_lang="hin_Deva", show_progress_bar=False)
batch = tokenizer(batch, src=True, return_tensors="pt")

with torch.inference_mode():
    outputs = model.generate(**batch, num_beams=5, num_return_sequences=1, max_length=256)

outputs = tokenizer.batch_decode(outputs, src=False)
outputs = ip.postprocess_batch(outputs, lang="hin_Deva")
print(outputs)

>>> ['यह एक परीक्षण वाक्य है।', 'यह एक और लंबा अलग परीक्षण वाक्य है।', 'कृपया 9876543210 पर एक एस. एम. एस. भेजें और 15 अक्टूबर, 2023 तक [email protected] पर एक ईमेल भेजें।']

Evaluation

IndicEvaluator is a python implementation of compute_metrics.sh.
We have found that this python implementation gives slightly lower scores than the original compute_metrics.sh. So, please use this function cautiously, and feel free to raise a PR if you have found the bug/fix.

from IndicTransTokenizer import IndicEvaluator

# this method returns a dictionary with BLEU and ChrF2++ scores with appropriate signatures
evaluator = IndicEvaluator()
scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=pred_file, refs=ref_file) 

# alternatively, you can pass the list of predictions and references instead of files 
# scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=preds, refs=refs)

Batching

ip = IndicProcessor(inference=True)

for batch in ip.get_batches(source_sentences, batch_size=32):
    # perform necessary operations on the batch
    # ... pre-processing
    # ... tokenization 
    # ... generation 
    # ... decoding

Authors

Varun Gumma ([email protected])
Jay Gala ([email protected])
Pranjal Agadh Chitale ([email protected])
Raj Dabre ([email protected])

Bugs and Contribution

Since this a bleeding-edge module, you may encounter broken stuff and import issues once in a while. In case you encounter any bugs or want additional functionalities, please feel free to raise Issues/Pull Requests or contact the authors.

Citation

If you use our codebase, models, or tokenizer, please cite the following paper:

@article{
    gala2023indictrans,
    title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
    author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
    journal={Transactions on Machine Learning Research},
    issn={2835-8856},
    year={2023},
    url={https://openreview.net/forum?id=vfT4YuzAYA},
    note={}
}

indictranstokenizer's People

Contributors

Stargazers

Watchers

Forkers

adityaprakash-work orgpedia look4pritam kathir-ks santoshdahal2016 neeelabh aumungray ungraytech neha13rana vinodpandey prateek1306

indictranstokenizer's Issues

Issue with IndicTransTokenizer Installation and Usage

Encountering an issue when installing and using the IndicTransTokenizer package. The installation seems to succeed, but an error occurs when trying to import and use the tokenizer.

@VarunGumma

Steps to Reproduce

Install indic-nlp-library-IT2:
```
!pip install indic-nlp-library-IT2
```

Clone the IndicTransTokenizer GitHub repository:

!git clone https://github.com/VarunGumma/IndicTransTokenizer.git

Install the IndicTransTokenizer package:
```
!pip install -e ./IndicTransTokenizer/
```

Import and initialize IndicTransTokenizer:

from IndicTransTokenizer import IndicTransTokenizer
tokenizer = IndicTransTokenizer(direction="en-indic")

Expected Behavior

The IndicTransTokenizer should be imported and initialized without any issues.

Actual Behavior

Receiving a FileNotFoundError when trying to initialize IndicTransTokenizer.

Error Message:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.10/site-packages/IndicTransTokenizer/en-indic/dict.SRC.json'

Environment

Python version: 3.10
Operating System: Kaggle

Additional Information

Issue with editable installation

I tried setting up Indictrans2 for translation purposes in my project and also tried to import this tokeniser in the same repo , Can someone please help with my errors
Laptop: Windows

Token Streaming

Is it possible to stream tokens from the example model?

Indicproessor issue

The code was running fine till yesterday now indicprocessor is not getting initialised pls help due to that error in IndicTransTokenizer

I am getting error importing IndicProcessor

Context: I am running this notebook - https://colab.research.google.com/github/AI4Bharat/IndicTrans2/blob/main/huggingface_interface/colab_inference.ipynb.

I cannot get past
from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer

I get this error - "ImportError: cannot import name 'IndicProcessor' from 'IndicTransTokenizer' (unknown location)"

Is this something that needs to be installed from some place?

getting error while importing from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer

same code was working about a week back but now i get this error, running the code using modal labs remote gpus

code

import modal
stub = modal.Stub()


volume = modal.NetworkFileSystem.persisted("data")
MODEL_DIR = "/data"

@stub.function( cpu=2, memory = 4276, gpu = 'A10G', timeout=1200, network_file_systems={MODEL_DIR: volume})
def loadIndicTrans2(dataset_name):
    import time
    start_time = time.time()

    import os 
    import subprocess
    
    commands = [
    "pip install -q bitsandbytes",
    "apt update ", 
    "apt install -y git",
    "git clone https://github.com/AI4Bharat/IndicTrans2"
    ]
    for command in commands:
        subprocess.run(command, shell=True)

    os.chdir("IndicTrans2/huggingface_interface")
    subprocess.run("bash install.sh", shell=True)


    with open('importIndic.py', 'w') as file:
        file.write(f'''
try:
    import torch
    import os
    import pandas as pd
    import csv
    print(torch.cuda.get_device_name(0))
    import sys
    from transformers import AutoModelForSeq2SeqLM, BitsAndBytesConfig
    print('from transformers imported')
    from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
    print('from indictranstokenizer imported')
    
    en_indic_ckpt_dir = "ai4bharat/indictrans2-en-indic-1B"  # ai4bharat/indictrans2-en-indic-dist-200M
    
    BATCH_SIZE = 4
    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
    
    if len(sys.argv) > 1:
        quantization = sys.argv[1]
    else:
        quantization = ""
    
    
    def initialize_model_and_tokenizer(ckpt_dir, direction, quantization):
        if quantization == "4-bit":
            qconfig = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_use_double_quant=True,
                bnb_4bit_compute_dtype=torch.bfloat16,
            )
        elif quantization == "8-bit":
            qconfig = BitsAndBytesConfig(
                load_in_8bit=True,
                bnb_8bit_use_double_quant=True,
                bnb_8bit_compute_dtype=torch.bfloat16,
            )
        else:
            qconfig = None
    
        tokenizer = IndicTransTokenizer(direction=direction)
        model = AutoModelForSeq2SeqLM.from_pretrained(
            ckpt_dir,
            trust_remote_code=True,
            low_cpu_mem_usage=True,
            quantization_config=qconfig,
        )
    
        if qconfig == None:
            model = model.to(DEVICE)
            model.half()
        model.eval()
        return tokenizer, model
    
    def batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip):
        translations = []
        for i in range(0, len(input_sentences), BATCH_SIZE):
            batch = input_sentences[i : i + BATCH_SIZE]
    
            batch = ip.preprocess_batch(batch, src_lang=src_lang, tgt_lang=tgt_lang)
    
            inputs = tokenizer(
                batch,
                src=True,
                truncation=True,
                padding="longest",
                return_tensors="pt",
                return_attention_mask=True,
            ).to(DEVICE)
    
            with torch.no_grad():
                generated_tokens = model.generate(
                    **inputs,
                    use_cache=True,
                    min_length=0,
                    max_length=256,
                    num_beams=5,
                    num_return_sequences=1,
                )
    
            generated_tokens = tokenizer.batch_decode(generated_tokens.detach().cpu().tolist(), src=False)
    
            translations += ip.postprocess_batch(generated_tokens, lang=tgt_lang)
            del inputs
            torch.cuda.empty_cache()
        return translations

    
    ip = IndicProcessor(inference=True)
    en_indic_tokenizer, en_indic_model = initialize_model_and_tokenizer(en_indic_ckpt_dir, "en-indic", quantization)


    from datasets import load_dataset
    dataset_name = '{dataset_name}'
    if(dataset_name == "ai2_arc"):
        possible_configs = [
        'ARC-Challenge',
        'ARC-Easy'
        ]
        # columns to translate
        columns = ['question','choices']
        # columns not to translate, to keep in converted dataset as is.
        columns_asis = ['id','answerKey']

    dataset = []
    if(dataset_name == 'ai2_arc'):
        for config in possible_configs:
            base_url = 'https://huggingface.co/api/datasets/allenai/ai2_arc/parquet/{{config}}'
            data_files = {{'train': base_url + '/train/0.parquet','test':base_url + '/test/0.parquet', 'validation': base_url + '/validation/0.parquet'}}
            dataset_slice = load_dataset('parquet', data_files=data_files)
            dataset.append(dataset_slice)

    
except Exception as e:
    # Handle the exception
    print('An error occurred:'+ str(e))
        ''')
    result = subprocess.run(['python', 'importIndic.py'], stdout=subprocess.PIPE)


@stub.local_entrypoint()
def main():
    # provide dataset name among ai2_arc, gsm8k, lukaemon/mmlu
    dataset_name = "ai2_arc"
    
    loadIndicTrans2.remote(dataset_name)

the error says An error occurred:[Errno 2] No such file or directory: '/usr/local/lib/python3.11/site-packages/RESOURCES/script/all_script_phonetic_data.csv'

ModuleNotFoundError: No module named 'indicnlp'

getting this error again

File "/home/gcpuser/sky_workdir/./app.py", line 6, in <module>
(task, pid=11924)     from api.routes import inference_api
(task, pid=11924)   File "/home/gcpuser/sky_workdir/./api/routes.py", line 12, in <module>
(task, pid=11924)     from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
(task, pid=11924)   File "/home/gcpuser/sky_workdir/./IndicTransTokenizer/IndicTransTokenizer/__init__.py", line 2, in <module>
(task, pid=11924)     from .utils import IndicProcessor
(task, pid=11924)   File "/home/gcpuser/sky_workdir/./IndicTransTokenizer/IndicTransTokenizer/utils.py", line 4, in <module>
(task, pid=11924)     from indicnlp.tokenize import indic_tokenize, indic_detokenize
(task, pid=11924) ModuleNotFoundError: No module named 'indicnlp'

Unable to import Indicprocessor from IndicTransTokenizer

Below are the code which I ran;

!git clone https://github.com/VarunGumma/IndicTransTokenizer
%cd IndicTransTokenizer

!git clone https://github.com/VarunGumma/indic_nlp_library.git

%cd indic_nlp_library

!python3 -m pip install nltk sacremoses pandas regex mock transformers>=4.33.2 mosestokenizer
!python3 -c "import nltk; nltk.download('punkt')"
!python3 -m pip install bitsandbytes scipy accelerate datasets
!python3 -m pip install sentencepiece


!python3 -m pip install --editable ./
%cd ..

from IndicTransTokenizer import  IndicProcessor,IndicTransTokenizer

Unable to import IndicProcessor, though IndicTransTokenizer is getting imported. Kindly help

ValueError when running example

getting error when running example in readme

ValueError: The model class you are passing has a `config_class` attribute that is not consistent with the config class you passed (model has <class 'configuration_indictrans.IndicTransConfig'> and you passed <class 'transformers_modules.ai4bharat.indictrans2-en-indic-dist-200M.f7f37e522d6612a10cbc8563af6820434e854047.configuration_indictrans.IndicTransConfig'>. Fix one of those so they match!

Getting this error after latest release

Traceback (most recent call last):
File "/home/gcpuser/sky_workdir/translate.py", line 15, in
from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
File "/home/gcpuser/sky_workdir/IndicTransTokenizer/IndicTransTokenizer/init.py", line 4, in
from .collator import IndicDataCollator
File "/home/gcpuser/sky_workdir/IndicTransTokenizer/IndicTransTokenizer/collator.py", line 8, in
from transformers.data.data_collator import pad_without_fast_tokenizer_warning
ImportError: cannot import name 'pad_without_fast_tokenizer_warning' from 'transformers.data.data_collator' (/opt/conda/lib/python3.10/site-packages/transformers/data/data_collator.py)
ERROR: Job 1 failed with return code list: [1]
INFO: Job finished (status: FAILED).
Shared connection to 35.246.42.131 closed.

Create a public python package

Please host this package in pypi

Fairseq Tokenizer to HF Tokenizer

Hey @VarunGumma

I have converted my fine-tuned fairseq model to HF format using the following link: https://github.com/AI4Bharat/IndicTrans2/blob/main/huggingface_interface/convert_indictrans_checkpoint_to_pytorch.py

Presently, I am stuck on how to convert the custom tokenizer (vocab and final_bin) into HF autotokenizer. It would be great if you can share the script/steps for the same.

Thanks!

'sacrebleu' module is not installed by default and has to be done manually

After cloning the repo, the module 'sacrebleu' has to be manually installed