varungumma / indictranstokenizer Goto Github PK
View Code? Open in Web Editor NEWA simple, consistent and extendable toolkit for IndicTrans2 tokenizer
Home Page: https://openreview.net/forum?id=vfT4YuzAYA
License: MIT License
A simple, consistent and extendable toolkit for IndicTrans2 tokenizer
Home Page: https://openreview.net/forum?id=vfT4YuzAYA
License: MIT License
Below are the code which I ran;
!git clone https://github.com/VarunGumma/IndicTransTokenizer
%cd IndicTransTokenizer
!git clone https://github.com/VarunGumma/indic_nlp_library.git
%cd indic_nlp_library
!python3 -m pip install nltk sacremoses pandas regex mock transformers>=4.33.2 mosestokenizer
!python3 -c "import nltk; nltk.download('punkt')"
!python3 -m pip install bitsandbytes scipy accelerate datasets
!python3 -m pip install sentencepiece
!python3 -m pip install --editable ./
%cd ..
from IndicTransTokenizer import IndicProcessor,IndicTransTokenizer
Unable to import IndicProcessor, though IndicTransTokenizer is getting imported. Kindly help
getting this error again
File "/home/gcpuser/sky_workdir/./app.py", line 6, in <module>
(task, pid=11924) from api.routes import inference_api
(task, pid=11924) File "/home/gcpuser/sky_workdir/./api/routes.py", line 12, in <module>
(task, pid=11924) from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
(task, pid=11924) File "/home/gcpuser/sky_workdir/./IndicTransTokenizer/IndicTransTokenizer/__init__.py", line 2, in <module>
(task, pid=11924) from .utils import IndicProcessor
(task, pid=11924) File "/home/gcpuser/sky_workdir/./IndicTransTokenizer/IndicTransTokenizer/utils.py", line 4, in <module>
(task, pid=11924) from indicnlp.tokenize import indic_tokenize, indic_detokenize
(task, pid=11924) ModuleNotFoundError: No module named 'indicnlp'
Encountering an issue when installing and using the IndicTransTokenizer
package. The installation seems to succeed, but an error occurs when trying to import and use the tokenizer.
Install indic-nlp-library-IT2
:
!pip install indic-nlp-library-IT2
Clone the IndicTransTokenizer
GitHub repository:
!git clone https://github.com/VarunGumma/IndicTransTokenizer.git
Install the IndicTransTokenizer
package:
!pip install -e ./IndicTransTokenizer/
Import and initialize IndicTransTokenizer
:
from IndicTransTokenizer import IndicTransTokenizer
tokenizer = IndicTransTokenizer(direction="en-indic")
The IndicTransTokenizer
should be imported and initialized without any issues.
Receiving a FileNotFoundError
when trying to initialize IndicTransTokenizer
.
Error Message:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.10/site-packages/IndicTransTokenizer/en-indic/dict.SRC.json'
Hey @VarunGumma
I have converted my fine-tuned fairseq model to HF format using the following link: https://github.com/AI4Bharat/IndicTrans2/blob/main/huggingface_interface/convert_indictrans_checkpoint_to_pytorch.py
Presently, I am stuck on how to convert the custom tokenizer (vocab and final_bin) into HF autotokenizer. It would be great if you can share the script/steps for the same.
Thanks!
Context: I am running this notebook - https://colab.research.google.com/github/AI4Bharat/IndicTrans2/blob/main/huggingface_interface/colab_inference.ipynb.
I cannot get past
from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
I get this error - "ImportError: cannot import name 'IndicProcessor' from 'IndicTransTokenizer' (unknown location)"
Is this something that needs to be installed from some place?
Traceback (most recent call last):
File "/home/gcpuser/sky_workdir/translate.py", line 15, in
from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
File "/home/gcpuser/sky_workdir/IndicTransTokenizer/IndicTransTokenizer/init.py", line 4, in
from .collator import IndicDataCollator
File "/home/gcpuser/sky_workdir/IndicTransTokenizer/IndicTransTokenizer/collator.py", line 8, in
from transformers.data.data_collator import pad_without_fast_tokenizer_warning
ImportError: cannot import name 'pad_without_fast_tokenizer_warning' from 'transformers.data.data_collator' (/opt/conda/lib/python3.10/site-packages/transformers/data/data_collator.py)
ERROR: Job 1 failed with return code list: [1]
INFO: Job finished (status: FAILED).
Shared connection to 35.246.42.131 closed.
import modal
stub = modal.Stub()
volume = modal.NetworkFileSystem.persisted("data")
MODEL_DIR = "/data"
@stub.function( cpu=2, memory = 4276, gpu = 'A10G', timeout=1200, network_file_systems={MODEL_DIR: volume})
def loadIndicTrans2(dataset_name):
import time
start_time = time.time()
import os
import subprocess
commands = [
"pip install -q bitsandbytes",
"apt update ",
"apt install -y git",
"git clone https://github.com/AI4Bharat/IndicTrans2"
]
for command in commands:
subprocess.run(command, shell=True)
os.chdir("IndicTrans2/huggingface_interface")
subprocess.run("bash install.sh", shell=True)
with open('importIndic.py', 'w') as file:
file.write(f'''
try:
import torch
import os
import pandas as pd
import csv
print(torch.cuda.get_device_name(0))
import sys
from transformers import AutoModelForSeq2SeqLM, BitsAndBytesConfig
print('from transformers imported')
from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
print('from indictranstokenizer imported')
en_indic_ckpt_dir = "ai4bharat/indictrans2-en-indic-1B" # ai4bharat/indictrans2-en-indic-dist-200M
BATCH_SIZE = 4
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
if len(sys.argv) > 1:
quantization = sys.argv[1]
else:
quantization = ""
def initialize_model_and_tokenizer(ckpt_dir, direction, quantization):
if quantization == "4-bit":
qconfig = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
elif quantization == "8-bit":
qconfig = BitsAndBytesConfig(
load_in_8bit=True,
bnb_8bit_use_double_quant=True,
bnb_8bit_compute_dtype=torch.bfloat16,
)
else:
qconfig = None
tokenizer = IndicTransTokenizer(direction=direction)
model = AutoModelForSeq2SeqLM.from_pretrained(
ckpt_dir,
trust_remote_code=True,
low_cpu_mem_usage=True,
quantization_config=qconfig,
)
if qconfig == None:
model = model.to(DEVICE)
model.half()
model.eval()
return tokenizer, model
def batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip):
translations = []
for i in range(0, len(input_sentences), BATCH_SIZE):
batch = input_sentences[i : i + BATCH_SIZE]
batch = ip.preprocess_batch(batch, src_lang=src_lang, tgt_lang=tgt_lang)
inputs = tokenizer(
batch,
src=True,
truncation=True,
padding="longest",
return_tensors="pt",
return_attention_mask=True,
).to(DEVICE)
with torch.no_grad():
generated_tokens = model.generate(
**inputs,
use_cache=True,
min_length=0,
max_length=256,
num_beams=5,
num_return_sequences=1,
)
generated_tokens = tokenizer.batch_decode(generated_tokens.detach().cpu().tolist(), src=False)
translations += ip.postprocess_batch(generated_tokens, lang=tgt_lang)
del inputs
torch.cuda.empty_cache()
return translations
ip = IndicProcessor(inference=True)
en_indic_tokenizer, en_indic_model = initialize_model_and_tokenizer(en_indic_ckpt_dir, "en-indic", quantization)
from datasets import load_dataset
dataset_name = '{dataset_name}'
if(dataset_name == "ai2_arc"):
possible_configs = [
'ARC-Challenge',
'ARC-Easy'
]
# columns to translate
columns = ['question','choices']
# columns not to translate, to keep in converted dataset as is.
columns_asis = ['id','answerKey']
dataset = []
if(dataset_name == 'ai2_arc'):
for config in possible_configs:
base_url = 'https://huggingface.co/api/datasets/allenai/ai2_arc/parquet/{{config}}'
data_files = {{'train': base_url + '/train/0.parquet','test':base_url + '/test/0.parquet', 'validation': base_url + '/validation/0.parquet'}}
dataset_slice = load_dataset('parquet', data_files=data_files)
dataset.append(dataset_slice)
except Exception as e:
# Handle the exception
print('An error occurred:'+ str(e))
''')
result = subprocess.run(['python', 'importIndic.py'], stdout=subprocess.PIPE)
@stub.local_entrypoint()
def main():
# provide dataset name among ai2_arc, gsm8k, lukaemon/mmlu
dataset_name = "ai2_arc"
loadIndicTrans2.remote(dataset_name)
the error says An error occurred:[Errno 2] No such file or directory: '/usr/local/lib/python3.11/site-packages/RESOURCES/script/all_script_phonetic_data.csv'
Is it possible to stream tokens from the example model?
Please host this package in pypi
After cloning the repo, the module 'sacrebleu' has to be manually installed
getting error when running example in readme
ValueError: The model class you are passing has a `config_class` attribute that is not consistent with the config class you passed (model has <class 'configuration_indictrans.IndicTransConfig'> and you passed <class 'transformers_modules.ai4bharat.indictrans2-en-indic-dist-200M.f7f37e522d6612a10cbc8563af6820434e854047.configuration_indictrans.IndicTransConfig'>. Fix one of those so they match!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.