Giter Site home page Giter Site logo

varungumma / indictranstokenizer Goto Github PK

View Code? Open in Web Editor NEW
14.0 14.0 11.0 4.83 MB

A simple, consistent and extendable toolkit for IndicTrans2 tokenizer

Home Page: https://openreview.net/forum?id=vfT4YuzAYA

License: MIT License

Python 100.00%
ai4bharat indicnlp python tokenizer transformers translation

indictranstokenizer's People

Contributors

pranjalchitale avatar varungumma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

indictranstokenizer's Issues

Unable to import Indicprocessor from IndicTransTokenizer

Below are the code which I ran;

!git clone https://github.com/VarunGumma/IndicTransTokenizer
%cd IndicTransTokenizer
!git clone https://github.com/VarunGumma/indic_nlp_library.git

%cd indic_nlp_library

!python3 -m pip install nltk sacremoses pandas regex mock transformers>=4.33.2 mosestokenizer
!python3 -c "import nltk; nltk.download('punkt')"
!python3 -m pip install bitsandbytes scipy accelerate datasets
!python3 -m pip install sentencepiece


!python3 -m pip install --editable ./
%cd ..
from IndicTransTokenizer import  IndicProcessor,IndicTransTokenizer

image

Unable to import IndicProcessor, though IndicTransTokenizer is getting imported. Kindly help

Issue with editable installation

I tried setting up Indictrans2 for translation purposes in my project and also tried to import this tokeniser in the same repo , Can someone please help with my errors
Laptop: Windows
image

Indicproessor issue

The code was running fine till yesterday now indicprocessor is not getting initialised pls help due to that error in IndicTransTokenizer
image

ModuleNotFoundError: No module named 'indicnlp'

getting this error again

File "/home/gcpuser/sky_workdir/./app.py", line 6, in <module>
(task, pid=11924)     from api.routes import inference_api
(task, pid=11924)   File "/home/gcpuser/sky_workdir/./api/routes.py", line 12, in <module>
(task, pid=11924)     from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
(task, pid=11924)   File "/home/gcpuser/sky_workdir/./IndicTransTokenizer/IndicTransTokenizer/__init__.py", line 2, in <module>
(task, pid=11924)     from .utils import IndicProcessor
(task, pid=11924)   File "/home/gcpuser/sky_workdir/./IndicTransTokenizer/IndicTransTokenizer/utils.py", line 4, in <module>
(task, pid=11924)     from indicnlp.tokenize import indic_tokenize, indic_detokenize
(task, pid=11924) ModuleNotFoundError: No module named 'indicnlp'

Issue with IndicTransTokenizer Installation and Usage

Encountering an issue when installing and using the IndicTransTokenizer package. The installation seems to succeed, but an error occurs when trying to import and use the tokenizer.

@VarunGumma

Steps to Reproduce

  1. Install indic-nlp-library-IT2:

    !pip install indic-nlp-library-IT2
  2. Clone the IndicTransTokenizer GitHub repository:

    !git clone https://github.com/VarunGumma/IndicTransTokenizer.git
  3. Install the IndicTransTokenizer package:

    !pip install -e ./IndicTransTokenizer/
  4. Import and initialize IndicTransTokenizer:

    from IndicTransTokenizer import IndicTransTokenizer
    tokenizer = IndicTransTokenizer(direction="en-indic")

Expected Behavior

The IndicTransTokenizer should be imported and initialized without any issues.

Actual Behavior

Receiving a FileNotFoundError when trying to initialize IndicTransTokenizer.

Error Message:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.10/site-packages/IndicTransTokenizer/en-indic/dict.SRC.json'

Environment

  • Python version: 3.10
  • Operating System: Kaggle

Additional Information

image

Getting this error after latest release

Traceback (most recent call last):
File "/home/gcpuser/sky_workdir/translate.py", line 15, in
from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
File "/home/gcpuser/sky_workdir/IndicTransTokenizer/IndicTransTokenizer/init.py", line 4, in
from .collator import IndicDataCollator
File "/home/gcpuser/sky_workdir/IndicTransTokenizer/IndicTransTokenizer/collator.py", line 8, in
from transformers.data.data_collator import pad_without_fast_tokenizer_warning
ImportError: cannot import name 'pad_without_fast_tokenizer_warning' from 'transformers.data.data_collator' (/opt/conda/lib/python3.10/site-packages/transformers/data/data_collator.py)
ERROR: Job 1 failed with return code list: [1]
INFO: Job finished (status: FAILED).
Shared connection to 35.246.42.131 closed.

getting error while importing from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer

same code was working about a week back but now i get this error, running the code using modal labs remote gpus

code

import modal
stub = modal.Stub()


volume = modal.NetworkFileSystem.persisted("data")
MODEL_DIR = "/data"

@stub.function( cpu=2, memory = 4276, gpu = 'A10G', timeout=1200, network_file_systems={MODEL_DIR: volume})
def loadIndicTrans2(dataset_name):
    import time
    start_time = time.time()

    import os 
    import subprocess
    
    commands = [
    "pip install -q bitsandbytes",
    "apt update ", 
    "apt install -y git",
    "git clone https://github.com/AI4Bharat/IndicTrans2"
    ]
    for command in commands:
        subprocess.run(command, shell=True)

    os.chdir("IndicTrans2/huggingface_interface")
    subprocess.run("bash install.sh", shell=True)


    with open('importIndic.py', 'w') as file:
        file.write(f'''
try:
    import torch
    import os
    import pandas as pd
    import csv
    print(torch.cuda.get_device_name(0))
    import sys
    from transformers import AutoModelForSeq2SeqLM, BitsAndBytesConfig
    print('from transformers imported')
    from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
    print('from indictranstokenizer imported')
    
    en_indic_ckpt_dir = "ai4bharat/indictrans2-en-indic-1B"  # ai4bharat/indictrans2-en-indic-dist-200M
    
    BATCH_SIZE = 4
    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
    
    if len(sys.argv) > 1:
        quantization = sys.argv[1]
    else:
        quantization = ""
    
    
    def initialize_model_and_tokenizer(ckpt_dir, direction, quantization):
        if quantization == "4-bit":
            qconfig = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_use_double_quant=True,
                bnb_4bit_compute_dtype=torch.bfloat16,
            )
        elif quantization == "8-bit":
            qconfig = BitsAndBytesConfig(
                load_in_8bit=True,
                bnb_8bit_use_double_quant=True,
                bnb_8bit_compute_dtype=torch.bfloat16,
            )
        else:
            qconfig = None
    
        tokenizer = IndicTransTokenizer(direction=direction)
        model = AutoModelForSeq2SeqLM.from_pretrained(
            ckpt_dir,
            trust_remote_code=True,
            low_cpu_mem_usage=True,
            quantization_config=qconfig,
        )
    
        if qconfig == None:
            model = model.to(DEVICE)
            model.half()
        model.eval()
        return tokenizer, model
    
    def batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip):
        translations = []
        for i in range(0, len(input_sentences), BATCH_SIZE):
            batch = input_sentences[i : i + BATCH_SIZE]
    
            batch = ip.preprocess_batch(batch, src_lang=src_lang, tgt_lang=tgt_lang)
    
            inputs = tokenizer(
                batch,
                src=True,
                truncation=True,
                padding="longest",
                return_tensors="pt",
                return_attention_mask=True,
            ).to(DEVICE)
    
            with torch.no_grad():
                generated_tokens = model.generate(
                    **inputs,
                    use_cache=True,
                    min_length=0,
                    max_length=256,
                    num_beams=5,
                    num_return_sequences=1,
                )
    
            generated_tokens = tokenizer.batch_decode(generated_tokens.detach().cpu().tolist(), src=False)
    
            translations += ip.postprocess_batch(generated_tokens, lang=tgt_lang)
            del inputs
            torch.cuda.empty_cache()
        return translations

    
    ip = IndicProcessor(inference=True)
    en_indic_tokenizer, en_indic_model = initialize_model_and_tokenizer(en_indic_ckpt_dir, "en-indic", quantization)


    from datasets import load_dataset
    dataset_name = '{dataset_name}'
    if(dataset_name == "ai2_arc"):
        possible_configs = [
        'ARC-Challenge',
        'ARC-Easy'
        ]
        # columns to translate
        columns = ['question','choices']
        # columns not to translate, to keep in converted dataset as is.
        columns_asis = ['id','answerKey']

    dataset = []
    if(dataset_name == 'ai2_arc'):
        for config in possible_configs:
            base_url = 'https://huggingface.co/api/datasets/allenai/ai2_arc/parquet/{{config}}'
            data_files = {{'train': base_url + '/train/0.parquet','test':base_url + '/test/0.parquet', 'validation': base_url + '/validation/0.parquet'}}
            dataset_slice = load_dataset('parquet', data_files=data_files)
            dataset.append(dataset_slice)

    
except Exception as e:
    # Handle the exception
    print('An error occurred:'+ str(e))
        ''')
    result = subprocess.run(['python', 'importIndic.py'], stdout=subprocess.PIPE)


@stub.local_entrypoint()
def main():
    # provide dataset name among ai2_arc, gsm8k, lukaemon/mmlu
    dataset_name = "ai2_arc"
    
    loadIndicTrans2.remote(dataset_name)

the error says An error occurred:[Errno 2] No such file or directory: '/usr/local/lib/python3.11/site-packages/RESOURCES/script/all_script_phonetic_data.csv'

ValueError when running example

getting error when running example in readme

ValueError: The model class you are passing has a `config_class` attribute that is not consistent with the config class you passed (model has <class 'configuration_indictrans.IndicTransConfig'> and you passed <class 'transformers_modules.ai4bharat.indictrans2-en-indic-dist-200M.f7f37e522d6612a10cbc8563af6820434e854047.configuration_indictrans.IndicTransConfig'>. Fix one of those so they match!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.