ai4bharat / indic-bert-v1 Goto Github PK

Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and Indian-English. For latest Indic-BERT v2, check: https://github.com/AI4Bharat/IndicBERT

Home Page: https://indicnlp.ai4bharat.org

License: MIT License

Python 95.06% Jupyter Notebook 3.37% Shell 1.58%

indian-languages bert multilingual-models language-model nlp

indic-bert-v1's People

Contributors

Stargazers

Watchers

Forkers

vishwanath1999 bharat3012 sudhir22 supreet21 tejamoy smitakshigupta cmhashim stjordanis kiranvarghesev ritesh-nanda vaibhavdih durgaprasd sciencepal ssusantachary adityamh96 hv42 vjagannath786 vasu0403 nemani amarjitghuman murtaza-s jaideepmurkute zenonnp suvanbalu cuckoos-ai aadil-srivastava01 euihyeonmoon kanwalarora029 bhanu0925 tiwari-pranav tororoin yashmadhani97 abhijitmanepatil hanumantappabudihal pucho-digital-health-inc elijahahianyo rakhilgit aparnarajct iq-scm hemantasarma subramanyamsahoo

indic-bert-v1's Issues

Is there any documentation as how to implement Q&A and NER using indic-bert ?

Can someone please share link of a notebook or documentation with one example of how to use indic-bert for Q&A and NER ?

Hey, I am working on adding support for specifying custom dataset. Will revert in a couple of days

Hi there I was wondering if there is such a feature. If so, how I can use the fine tuning on google colab with new dataset. "Hey, I am working on adding support for specifying custom dataset. Will revert in a couple of days"
Thanks

NER finetuning: FileNotFoundError labels.txt

labels.txt file is referenced even before it is created. The code was working fine with previous commit.

Documentation to implement model on an external dataset

Hi All,

Is there any documentation to train the model on our own dataset ... I wanted to know what dataset format the model takes and how we can fine-tune the model based on that dataset.

Not able to fine tune for text classification using Huggingface library

Hi I am trying to fine tune indic bert on IITP Movie Reviews . It is not working for AutoTokenizer, AutoModelForSequenceClassification or AlbertTokenizer, AlbertForSequenceClassification
Following is the code I am trying to run , getting same exception if I change to Auto from Albert.

Getting error : ValueError: Checkpoint was expecting a trackable object (an object derived from TrackableBase), got AlbertForSequenceClassification(

import pandas as pd
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
# Import generic wrappers
#from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import AlbertTokenizer, AlbertForSequenceClassification

train_df = pd.read_csv(r'data\iitp-movie-reviews\hi\hi-train.csv' , names=['label','text'])
train_df['text'] = train_df['text'].str.replace('\n','')

test_df = pd.read_csv(r'data\iitp-movie-reviews\hi\hi-test.csv' , names=['label','text'])
test_df['text'] = test_df['text'].str.replace('\n','')

valid_df = pd.read_csv(r'data\iitp-movie-reviews\hi\hi-valid.csv' , names=['label','text'])
valid_df['text'] = test_df['text'].str.replace('\n','')

display(train_df.head(2))
display(train_df['label'].unique())

le = LabelEncoder()
x_train = train_df['text'].tolist()
y_train = list(le.fit_transform(train_df['label'].tolist()))

x_test = test_df['text'].tolist()
y_test = list(le.transform(test_df['label'].tolist()))

x_valid = valid_df['text'].tolist()
y_valid = list(le.transform(valid_df['label'].tolist()))

# Define the model repo
model_name = "ai4bharat/indic-bert" 

tokenizer = AlbertTokenizer.from_pretrained(model_name)

train_encodings = tokenizer(x_train, truncation=True, padding=True)
test_encodings = tokenizer(x_test, truncation=True, padding=True)
valid_encodings = tokenizer(x_valid, truncation=True, padding=True)

import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    y_train
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(valid_encodings),
    y_valid
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test
))

from transformers import TFTrainer, TFTrainingArguments

training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

with training_args.strategy.scope():
    model = AlbertForSequenceClassification.from_pretrained(model_name)

trainer = TFTrainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

finetuning.ipynb - Colab is broken

Greetings,

Thank you for this excellently documented package. I am having some trouble getting the Collab to run step 4, Fine-tune the Model. Here is the output:

/content/indic-bert/indic-bert
/usr/local/lib/python3.6/dist-packages/transformers/modeling_auto.py:798: FutureWarning: The class 'AutoModelWithLMHead' is deprecated and will be removed in a future version. Please use 'AutoModelForCausalLM' for causal language models, 'AutoModelForMaskedLM' for masked language models and 'AutoModelForSeq2SeqLM' for encoder-decoder models.
  FutureWarning,
Some weights of the model checkpoint at ai4bharat/indic-bert were not used when initializing AlbertForMaskedLM: ['sop_classifier.classifier.weight', 'sop_classifier.classifier.bias']
- This IS expected if you are initializing AlbertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing AlbertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
---------------------------------------------------------------------------
MisconfigurationException                 Traceback (most recent call last)
<ipython-input-4-df6be9fbd108> in <module>()
     17 ]
     18 
---> 19 finetune_main(argvec)

5 frames
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py in sanitize_gpu_ids(gpus)
    394                 You requested GPUs: {gpus}
    395                 But your machine only has: {all_available_gpus}
--> 396             """)
    397     return gpus
    398 

MisconfigurationException: 
                You requested GPUs: [0]
                But your machine only has: []

What do you think might be going wrong?

Cheers,
Joe

Error in loading the model from huggingface transformers

Thank you for your work! I'm loading the model from huggingface transformers and am running into the following error:

RuntimeError: Error(s) in loading state_dict for AlbertModel:
size mismatch for albert.embeddings.position_embeddings.weight: copying a param with shape torch.Size([512, 128]) from checkpoint, the shape in current model is torch.Size([128, 128]).

The error occurs when I load the model as follows:
model = AutoModel.from_pretrained("ai4bharat/indic-bert")

Please let me know if you need any more information.Thanks!

Documentation to implement NER

Hey,
I tried using IndicBert NER for news article clustering using transformers. While tokenization, some of the tokens are getting split up. I wanted to know if there is any way to avoid it.
Also, when I implemented the same example as you have mentioned in your documentation, I get different results.

kindly help me on why the tokens are not getting recognized properly. When I tried giving custom inputs in the same format of the tokenizer, tokens are not getting recognized and giving encoding as 1 even with add_special_token.

It would be helpful if you could share some implementations of the NER.

Finetuning on IndicGlue News Headline Prediction Task

I'm trying to find the IndicGlue dataset to finetune on the news headline prediction task using finetune.cli.py but I'm not able to find the IndicGlue dataset. Can someone please guide me with this?

Tokenization doesn't preserve diacritics

I was working recently with the IndicBERT SentencePiece tokenizer and found something which I was curious about. It turns out that when we encode sentences, a good amount of diacritics do not get encoded. So for example, in Hindi, the sentences - "मेंने उसकी गेंद दी।" and "मैने उसको गेंद दी।" have the same encodings despite one having the genitive and the other the dative marker. I have seen this for Gujarati and Hindi. The reason I think the diacritics are ignored is that when the encodings are decoded, some diacritics are missing.

I was curious to know why this happens and if there is a work-around.

Cloze-style multiple-choice QA

we able to download the dataset of csqa in telugu but we unable apply the indic bert.how do we give input to the model.

Regarding IndicBERT fine tuning and Evaluation splits

Hi, Thanks for this great repo and initiative. I have a question about IndicBERT evaluation. I read from that the paper that IndicBERT was fine tuned on the following datasets: –CVIT-MannKiBaat–WNLI–COPA–Amritha–MIDAS Hindi Discourse–IIT-Patna Movie and Product Sentiment Analysis (Hindi)–ACTSA Sentimenet analysis corpus (Telugu).

However in evaluation results of IndicBERT, I see again the same datasets references (Referring to the "Table 9" of the IndicNLPSuite paper). Could you please let know if this was done after splits? If yes, is it possible to please let know the splits. Please let me know in case I have overlooked something. Thank you.

Cannot instantiate Tokenizer

I am using Huggingface Transformers 4.0.0. When I instantiate the autotokenizer for indicbert, I get the following issue:

My code:
tokenizer = AutoTokenizer.from_pretrained('ai4bharat/indic-bert')

Error:
Couldn't instantiate the backend tokenizer from one of: (1) a tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

What is special in IndicBERT compared to other models?

Documentation for IndicGLUE

Hi,

This repository has some great resources!

I was wondering where the documentation for the fine-tuning datasets are. Sorry if I overlooked something obvious.
I want to evaluate my models (not IndicBERT) on your dataset, so wanted to see how that is possible.

Thanks!

How to decode token embeddings into token ids?

I'm trying to build a machine translation model using the indicBERT model as an embedding. I'm able to obtain token embeddings from a tokenized sentence as follows:

tokenizer = AutoTokenizer.from_pretrained('ai4bharat/indic-bert') 
model = AutoModel.from_pretrained('ai4bharat/indic-bert')

vocab_to_embedding_convertor = model.get_input_embeddings()
tokens = tokenizer(["హలో","పేరు"], return_tensors="pt")['input_ids']

embeddings = vocab_to_embedding_convertor(tokens)

However, I'm unable to find a way to obtain token ids from these embeddings. How would I go about doing this?

Thanks!
Vimal

Error during downloading the en-indic dataset

Getting the above error while trying to download the en-indic dataset.

Tokenization issues

Running into some potentially troublesome issues in the tokenization for indic-bert. It seems all vowel matras (diacritics) are getting dropped in the tokenization, which loses a lot of information about the word. Perhaps some sort of Unicode issue?

Minimal example (prints True) where two very different words get treated as the same token.

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained('ai4bharat/indic-bert')
print(tokenizer.tokenize("यहाँ") == tokenizer.tokenize("यह"))

bert-base-multilingual-cased does not have this issue.

Is this an issue on my end? I have this problem on Colab and on my machine (Mac, Python 3.8.8). @nitinvwaran also has this issue. I had to install sentencepiece to get the tokenizer to work btw.

Unable to download the BBC News dataset

Hi, I'm trying to train the model for classification on the BBC News dataset.

When I' visiting this link, getting the below error.

UserProjectAccountProblemUser project billing account not in good standing.The billing account for the owning project is disabled in state delinquent

Is there a plan to reinstate the dataset?
Also, can the weights of the model trained on this dataset be released?

Error encountered while training IndicBert on cvit-mkb dataset

Hi Authors,
First of all, It's great to have GLUE benchmarks for Indian languages, and thanks for the great effort.

Problem faced:
While executing the code for cvit-mkb (mann-ki-baat) dataset, I faced an issue. It seems like the ManKiBaat Dataset processor module doesn't have the method named: get_labels due to which the code terminates.

To get the inference results, I have used the following command:
argvec = ['--lang', 'hi',
'--dataset', 'cvit-mkb',
'--model', 'ai4bharat/indic-bert',
'--iglue_dir', '../indic-glue',
'--output_dir', '../outputs',
'--max_seq_length', '128',
'--learning_rate', '2e-5',
'--num_train_epochs', '3',
'--train_batch_size', '32'
]

finetune_main(argvec)`

I have another concern regarding the cvit-mkb dataset for cross-lingual sentence retrieval. In the ManKiBaat Dataset processor module only supports 'en' and 'in' modes but there is no such Unicode available for any of the languages mentioned in the IndicNLPSuite paper.

I am also attaching the error logs which I encountered during the inference:

ValueError: Shape of variable bert/embeddings/LayerNorm/beta:0 ((768,)) doesn't match with shape of tensor bert/embeddings/LayerNorm/beta ([128]) from checkpoint reader.

when i used model.ckpt from pretrained IndicBERT then i got this error
ile "extract_features.py", line 339, in
tf.compat.v1.app.run()
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "extract_features.py", line 305, in main
for result in estimator.predict(input_fn, yield_single_examples=True):
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3078, in predict
rendezvous.raise_errors()
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
six.reraise(typ, value, traceback)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3072, in predict
yield_single_examples=yield_single_examples):
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 622, in predict
features, None, ModeKeys.PREDICT, self.config)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
config)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3126, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1663, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1994, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "extract_features.py", line 153, in model_fn
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/training/checkpoint_utils.py", line 291, in init_from_checkpoint
init_from_checkpoint_fn)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1940, in merge_call
return self._merge_call(merge_fn, args, kwargs)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1947, in _merge_call
return merge_fn(self._strategy, *args, **kwargs)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/training/checkpoint_utils.py", line 286, in
ckpt_dir_or_file, assignment_map)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/training/checkpoint_utils.py", line 329, in _init_from_checkpoint
tensor_name_in_ckpt, str(variable_map[tensor_name_in_ckpt])
ValueError: Shape of variable bert/embeddings/LayerNorm/beta:0 ((768,)) doesn't match with shape of tensor bert/embeddings/LayerNorm/beta ([128]) from checkpoint reader.

NER Finetuning- TypeError: join() argument must be str or bytes, not 'NoneType'

Key error:KeyError: '[UNK]' [[{{node PyFunc}}]] [[IteratorGetNext]]

raceback (most recent call last):

File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call
ret = func(*args)

File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 594, in generator_py_func
values = next(generator_state.get_iterator(iterator_id))

File "extract_features.py", line 244, in convert_examples_to_features
window_size)

File "extract_features.py", line 188, in _convert_example_to_features
input_ids = tokenizer.convert_tokens_to_ids(tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 242, in convert_tokens_to_ids
return convert_by_vocab(self.vocab, tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 160, in convert_by_vocab
output.append(vocab[item])

KeyError: '[UNK]'

ERROR:tensorflow:Error recorded from prediction_loop: exceptions.KeyError: '[UNK]'
Traceback (most recent call last):

File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call
ret = func(*args)

File "extract_features.py", line 244, in convert_examples_to_features
window_size)

File "extract_features.py", line 188, in _convert_example_to_features
input_ids = tokenizer.convert_tokens_to_ids(tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 242, in convert_tokens_to_ids
return convert_by_vocab(self.vocab, tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 160, in convert_by_vocab
output.append(vocab[item])

KeyError: '[UNK]'

 [[{{node PyFunc}}]]
 [[IteratorGetNext]]

E1223 17:05:18.867840 140097924953920 error_handling.py:75] Error recorded from prediction_loop: exceptions.KeyError: '[UNK]'
Traceback (most recent call last):

File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call
ret = func(*args)

File "extract_features.py", line 244, in convert_examples_to_features
window_size)

File "extract_features.py", line 188, in _convert_example_to_features
input_ids = tokenizer.convert_tokens_to_ids(tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 242, in convert_tokens_to_ids
return convert_by_vocab(self.vocab, tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 160, in convert_by_vocab
output.append(vocab[item])

KeyError: '[UNK]'

 [[{{node PyFunc}}]]
 [[IteratorGetNext]]

INFO:tensorflow:prediction_loop marked as finished
I1223 17:05:18.869065 140097924953920 error_handling.py:101] prediction_loop marked as finished
WARNING:tensorflow:Reraising captured error
W1223 17:05:18.869143 140097924953920 error_handling.py:135] Reraising captured error
0%| | 0/2451534 [00:02<?, ?it/s]
Traceback (most recent call last):
File "extract_features.py", line 338, in
tf.compat.v1.app.run()
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "extract_features.py", line 304, in main
for result in estimator.predict(input_fn, yield_single_examples=True):
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3078, in predict
rendezvous.raise_errors()
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
six.reraise(typ, value, traceback)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3072, in predict
yield_single_examples=yield_single_examples):
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 640, in predict
preds_evaluated = mon_sess.run(predictions)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
run_metadata=run_metadata)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
return self._sess.run(*args, **kwargs)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: exceptions.KeyError: '[UNK]'
Traceback (most recent call last):

File "/home/dr/anaconda3/envs/hcoref/lib/python2.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call
ret = func(*args)

File "extract_features.py", line 244, in convert_examples_to_features
window_size)

File "extract_features.py", line 188, in _convert_example_to_features
input_ids = tokenizer.convert_tokens_to_ids(tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 242, in convert_tokens_to_ids
return convert_by_vocab(self.vocab, tokens)

File "/home/dr/Desktop/Hindi-coref/extract_bert_features/tokenization.py", line 160, in convert_by_vocab
output.append(vocab[item])

KeyError: '[UNK]'

 [[{{node PyFunc}}]]
 [[IteratorGetNext]]

why it is occuring

Need documentation in using the model

Hi,
I would like to use this model for extracting word vectors for Telugu text.
I am looking for code snippets on how to use the model for extracting word vectors. Can you please help with this? Or if there is any documentation available on using the model in Hugging Transformers, please share the same.

Thank you in advance

How do I use this in a sentence transformer?

I used a basic sample with SentenceTransformer to find the text similarity, but the text is in tamil language, i used this model, but not able to find matches for simple words too.

ai4bharat / indic-bert-v1 Goto Github PK

indic-bert-v1's People

Contributors

Stargazers

Watchers

Forkers

indic-bert-v1's Issues

Recommend Projects

Recommend Topics

Recommend Org