mgrankin / ru_transformers Goto Github PK

View Code? Open in Web Editor NEW

777.0 777.0 109.0 5.75 MB

License: Apache License 2.0

XSLT 1.90% Jupyter Notebook 42.69% Python 52.83% Shell 0.99% HCL 1.58%

ru_transformers's People

Contributors

Stargazers

Watchers

Forkers

cedspam angryjkirk kapunik iadolgov aa-tolmachev llirikkcoder zeroless felixgithub2017 i6173215 qwertistuff aprosvetova avrravrr lyudmilatretyakova digger3d itima piterskiy grhgrmgrhrm dimosus kremius shmelka goffmann-op karamax zer0nka token0 gelevanog alexanderantonenko zag prontiol tralivali1234 own2pwn rinnnn serzhns minhpqn alexsaen gpt2ent ren50 askinkaty radionbik rbsysnn octobercat kostyfisik singulart infectedvoice thenarrator phamphituan unitedassociates efimberson domendomestos pgsrv drupalcompany paltish-com dreamsmoke mykrass warlomak irumata constantor ibragim-bad piegu ksjae svirmi jahjajaka alex-tsvetkov vftens kepler-br kotinigor deftro nagoudi fen0s broair d11scord wndenis alexa-watchout shatashelans mosvlad mylittlecoin romakoks vassel9 umanema pgagarinov sivka12 apkawa zergey bdgang 0x00dec0de ulovehash webprogrammer77 omar-florez sanyabeast bedanar 0xbadcoffe kleevahew silexcorp mehwishfatimah uav-profile t-troll mtsakharov dlipeev ovsyan alovesearch kabyleai

ru_transformers's Issues

Failed to tokenize big dataset with YTTM

Hi, thanks for the implementation.

How did you managed to tokenize your +200 GB of text with YTTM?

I tried with ~150 GB in one text file and I got a memory error

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

I'm using a n1-standard-8 (8 vCPUs, 30 GB memory) from GCP, maybe I need to get the double?

while it seems to work with Huggingface Tokenizers

Seems like optimizer.step() has been overridden after learning rate scheduler initialization

Hi,

I'm using your run_lm_finetuning.py script. It works but I would like to know why in the starting of the training, I get:

3 times the following sentence: Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to ...
and the following warning:

/opt/anaconda3/envs/gpt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:91: 
UserWarning: Seems like `optimizer.step()` has been overridden after learning rate scheduler 
initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. 
See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

It is strange as we can see that optimizer.step() is well called before scheduler.step() from line 359.

Any thought about that?

Отказ от копирайта на генерируемые тексты

Пожалуйста, если возможно, укажите что все генерируемые тексты находятся под СС0. Это сделает возможным использовать их в качестве свободных датасетов, например для Mozilla Common Voice.

get_constant_schedule() got an unexpected keyword argument 'warmup_steps'

Tried to finetune your GPT2 Russian on russian dataset, but got this problem:

Traceback (most recent call last):
  File "run_lm_finetuning.py", line 662, in <module>
    main()
  File "run_lm_finetuning.py", line 630, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "run_lm_finetuning.py", line 289, in train
    scheduler = get_constant_schedule(optimizer, warmup_steps=warmup_steps)
TypeError: get_constant_schedule() got an unexpected keyword argument 'warmup_steps'

Here's how I run run_lm_finetuning.py:

python run_lm_finetuning.py \
    --output_dir=$MODEL_PATH \
    --model_type=gpt2 \
    --model_name_or_path=$MODEL_PATH \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size=$BATCH_SIZE \
    --save_steps=10000 \
    --logging_steps=1 \
    --warmup_samples 16000 \
    --learning_rate $LEARNING_RATE \
    --tokenizer_class YTEncoder \
    --tokenizer_name bpe/yt.model \
    --do_eval \
    --evaluate_during_training \
    --eval_steps 1000 \
    --eval_data_file=$VALIDATION_FILE \
    --num_train_epochs 1.0 \
    --unfreeze_level 0 \
    --overwrite_output_dir

But after I deleted warmup_steps from function call, training starts successfully but somewhat it cancels it self during first iteration:

Epoch:   0% 0/3 [00:00<?, ?it/s]
Iteration:   0% 0/9928 [00:00<?, ?it/s]^C

I'm using google colab for training
Even replacing get_constant_schedule with get_constant_schedule_with_warmup doesn't help: training still cancels it self with ^C.
I tried different pip transformers versions, but nothing works.
Sampling works flawlessly btw.
This is what happens when I try to use TPU on colab:

Traceback (most recent call last):
  File "tpu_lm_finetuning.py", line 697, in <module>
    xmp.spawn(main)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 182, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 108, in join
    (error_index, name)
Exception: process 2 terminated with signal SIGKILL

Here's my pip packages:

absl-py                       0.9.0          
alabaster                     0.7.12         
albumentations                0.1.12         
altair                        4.1.0          
apex                          0.1            
argon2-cffi                   20.1.0         
asgiref                       3.2.10         
astor                         0.8.1          
astropy                       4.0.1.post1    
astunparse                    1.6.3          
atari-py                      0.2.6          
atomicwrites                  1.4.0          
attrs                         19.3.0         
audioread                     2.1.8          
autograd                      1.3            
Babel                         2.8.0          
backcall                      0.2.0          
backports.tempfile            1.0            
backports.weakref             1.0.post1      
beautifulsoup4                4.6.3          
bleach                        3.1.5          
blis                          0.4.1          
bokeh                         2.1.1          
boto                          2.49.0         
boto3                         1.14.37        
botocore                      1.17.37        
Bottleneck                    1.3.2          
branca                        0.4.1          
bs4                           0.0.1          
bz2file                       0.98           
CacheControl                  0.12.6         
cachetools                    4.1.1          
catalogue                     1.0.0          
certifi                       2020.6.20      
cffi                          1.14.1         
chainer                       7.4.0          
chardet                       3.0.4          
click                         7.1.2          
cloudpickle                   1.3.0          
cmake                         3.12.0         
cmdstanpy                     0.4.0          
colorlover                    0.3.0          
community                     1.0.0b1        
contextlib2                   0.5.5          
convertdate                   2.2.1          
coverage                      3.7.1          
coveralls                     0.5            
crcmod                        1.7            
cufflinks                     0.17.3         
cupy-cuda101                  7.4.0          
cvxopt                        1.2.5          
cvxpy                         1.0.31         
cycler                        0.10.0         
cymem                         2.0.3          
Cython                        0.29.21        
daft                          0.0.4          
dask                          2.12.0         
dataclasses                   0.7            
datascience                   0.10.6         
decorator                     4.4.2          
defusedxml                    0.6.0          
descartes                     1.1.0          
dill                          0.3.2          
distributed                   1.25.3         
Django                        3.1            
dlib                          19.18.0        
dm-sonnet                     1.35           
dm-tree                       0.1.5          
docopt                        0.6.2          
docutils                      0.15.2         
dopamine-rl                   1.0.5          
earthengine-api               0.1.229        
easydict                      1.9            
ecos                          2.0.7.post1    
editdistance                  0.5.3          
en-core-web-sm                2.2.5          
entrypoints                   0.3            
ephem                         3.7.7.1        
et-xmlfile                    1.0.1          
fa2                           0.3.5          
fancyimpute                   0.4.3          
fastai                        1.0.59         
fastapi                       0.61.0         
fastdtw                       0.3.4          
fastprogress                  0.2.5          
fastrlock                     0.5            
fbprophet                     0.6            
feather-format                0.4.1          
featuretools                  0.4.1          
filelock                      3.0.12         
firebase-admin                4.1.0          
fix-yahoo-finance             0.0.22         
Flask                         1.1.2          
folium                        0.8.3          
fsspec                        0.8.0          
future                        0.16.0         
gast                          0.3.3          
GDAL                          2.2.2          
gdown                         3.6.4          
gensim                        3.6.0          
geographiclib                 1.50           
geopy                         1.17.0         
gevent                        1.4.0          
gin-config                    0.3.0          
glob2                         0.7            
google                        2.0.3          
google-api-core               1.16.0         
google-api-python-client      1.7.12         
google-auth                   1.17.2         
google-auth-httplib2          0.0.4          
google-auth-oauthlib          0.4.1          
google-cloud-bigquery         1.21.0         
google-cloud-core             1.0.3          
google-cloud-datastore        1.8.0          
google-cloud-firestore        1.7.0          
google-cloud-language         1.2.0          
google-cloud-storage          1.18.1         
google-cloud-translate        1.5.0          
google-colab                  1.0.0          
google-pasta                  0.2.0          
google-resumable-media        0.4.1          
googleapis-common-protos      1.52.0         
googledrivedownloader         0.4            
graph-nets                    1.0.5          
graphviz                      0.10.1         
greenlet                      0.4.15         
grpcio                        1.31.0         
gspread                       3.0.1          
gspread-dataframe             3.0.7          
gunicorn                      20.0.4         
gym                           0.17.2         
h5py                          2.10.0         
HeapDict                      1.0.1          
holidays                      0.9.12         
holoviews                     1.13.3         
html5lib                      1.0.1          
httpimport                    0.5.18         
httplib2                      0.17.4         
httplib2shim                  0.0.3          
humanize                      0.5.1          
hyperopt                      0.1.2          
ideep4py                      2.0.0.post3    
idna                          2.10           
image                         1.5.32         
imageio                       2.4.1          
imagesize                     1.2.0          
imbalanced-learn              0.4.3          
imblearn                      0.0            
imgaug                        0.2.9          
importlib-metadata            1.7.0          
imutils                       0.5.3          
inflect                       2.1.0          
iniconfig                     1.0.1          
intel-openmp                  2020.0.133     
intervaltree                  2.1.0          
ipykernel                     4.10.1         
ipython                       5.5.0          
ipython-genutils              0.2.0          
ipython-sql                   0.3.9          
ipywidgets                    7.5.1          
itsdangerous                  1.1.0          
jax                           0.1.75         
jaxlib                        0.1.52         
jdcal                         1.4.1          
jedi                          0.17.2         
jieba                         0.42.1         
Jinja2                        2.11.2         
jmespath                      0.10.0         
joblib                        0.16.0         
jpeg4py                       0.1.4          
jsonschema                    2.6.0          
jupyter                       1.0.0          
jupyter-client                5.3.5          
jupyter-console               5.2.0          
jupyter-core                  4.6.3          
kaggle                        1.5.6          
kapre                         0.1.3.1        
Keras                         2.3.1          
Keras-Applications            1.0.8          
Keras-Preprocessing           1.1.2          
keras-vis                     0.4.1          
kfac                          0.2.0          
kiwisolver                    1.2.0          
knnimpute                     0.1.0          
librosa                       0.6.3          
lightgbm                      2.2.3          
llvmlite                      0.31.0         
lmdb                          0.98           
lucid                         0.3.8          
LunarCalendar                 0.0.9          
lxml                          4.2.6          
magenta                       0.3.19         
Markdown                      3.2.2          
MarkupSafe                    1.1.1          
matplotlib                    3.2.2          
matplotlib-venn               0.11.5         
mesh-tensorflow               0.1.12         
mido                          1.2.6          
mir-eval                      0.5            
missingno                     0.4.2          
mistune                       0.8.4          
mizani                        0.6.0          
mkl                           2019.0         
mlxtend                       0.14.0         
more-itertools                8.4.0          
moviepy                       0.2.3.5        
mpi4py                        3.0.3          
mpmath                        1.1.0          
msgpack                       1.0.0          
multiprocess                  0.70.10        
multitasking                  0.0.9          
murmurhash                    1.0.2          
music21                       5.5.0          
natsort                       5.5.0          
nbconvert                     5.6.1          
nbformat                      5.0.7          
networkx                      2.4            
nibabel                       3.0.2          
nltk                          3.2.5          
notebook                      5.3.1          
np-utils                      0.5.12.1       
numba                         0.48.0         
numexpr                       2.7.1          
numpy                         1.18.5         
nvidia-ml-py3                 7.352.0        
oauth2client                  4.1.3          
oauthlib                      3.1.0          
okgrade                       0.4.3          
opencv-contrib-python         4.1.2.30       
opencv-python                 4.1.2.30       
openpyxl                      2.5.9          
opt-einsum                    3.3.0          
osqp                          0.6.1          
packaging                     20.4           
palettable                    3.3.0          
pandas                        1.0.5          
pandas-datareader             0.8.1          
pandas-gbq                    0.11.0         
pandas-profiling              1.4.1          
pandocfilters                 1.4.2          
panel                         0.9.7          
param                         1.9.3          
parso                         0.7.1          
pathlib                       1.0.1          
patsy                         0.5.1          
pbr                           5.4.5          
pexpect                       4.8.0          
pickleshare                   0.7.5          
Pillow                        7.0.0          
pip                           19.3.1         
pip-tools                     4.5.1          
plac                          1.1.3          
plotly                        4.4.1          
plotnine                      0.6.0          
pluggy                        0.7.1          
portpicker                    1.3.1          
prefetch-generator            1.0.1          
preshed                       3.0.2          
pretty-midi                   0.2.8          
prettytable                   0.7.2          
progressbar2                  3.38.0         
prometheus-client             0.8.0          
promise                       2.3            
prompt-toolkit                1.0.18         
protobuf                      3.12.4         
psutil                        5.4.8          
psycopg2                      2.7.6.1        
ptyprocess                    0.6.0          
py                            1.9.0          
pyarrow                       0.14.1         
pyasn1                        0.4.8          
pyasn1-modules                0.2.8          
pycocotools                   2.0.1          
pycparser                     2.20           
pyct                          0.4.6          
pydantic                      1.6.1          
pydata-google-auth            1.1.0          
pydot                         1.3.0          
pydot-ng                      2.0.0          
pydotplus                     2.0.2          
PyDrive                       1.3.1          
pyemd                         0.5.1          
pyglet                        1.5.0          
Pygments                      2.1.3          
pygobject                     3.26.1         
pymc3                         3.7            
PyMeeus                       0.3.7          
pymongo                       3.11.0         
pymystem3                     0.2.0          
PyOpenGL                      3.1.5          
pyparsing                     2.4.7          
pypng                         0.0.20         
pyrsistent                    0.16.0         
pysndfile                     1.3.8          
PySocks                       1.7.1          
pystan                        2.19.1.1       
pytest                        3.6.4          
python-apt                    1.6.5+ubuntu0.3
python-chess                  0.23.11        
python-dateutil               2.8.1          
python-louvain                0.14           
python-rtmidi                 1.4.0          
python-slugify                4.0.1          
python-utils                  2.4.0          
pytz                          2018.9         
pyviz-comms                   0.7.6          
PyWavelets                    1.1.1          
PyYAML                        3.13           
pyzmq                         19.0.2         
qtconsole                     4.7.5          
QtPy                          1.9.0          
regex                         2019.12.20     
requests                      2.23.0         
requests-oauthlib             1.3.0          
resampy                       0.2.2          
retrying                      1.3.3          
rpy2                          3.2.7          
rsa                           4.6            
s3fs                          0.4.2          
s3transfer                    0.3.3          
sacremoses                    0.0.43         
scikit-image                  0.16.2         
scikit-learn                  0.22.2.post1   
scipy                         1.4.1          
screen-resolution-extra       0.0.0          
scs                           2.1.2          
seaborn                       0.10.1         
semantic-version              2.8.4          
Send2Trash                    1.5.0          
sentencepiece                 0.1.91         
setuptools                    49.2.0         
setuptools-git                1.2            
Shapely                       1.7.0          
simplegeneric                 0.8.1          
six                           1.15.0         
sklearn                       0.0            
sklearn-pandas                1.8.0          
smart-open                    2.1.0          
snowballstemmer               2.0.0          
sortedcontainers              2.2.2          
spacy                         2.2.4          
Sphinx                        1.8.5          
sphinxcontrib-serializinghtml 1.1.4          
sphinxcontrib-websupport      1.2.3          
SQLAlchemy                    1.3.18         
sqlparse                      0.3.1          
srsly                         1.0.2          
stable-baselines              2.2.1          
starlette                     0.13.6         
statsmodels                   0.10.2         
sympy                         1.1.1          
tables                        3.4.4          
tabulate                      0.8.7          
tblib                         1.7.0          
tendo                         0.2.15         
tensor2tensor                 1.14.1         
tensorboard                   1.15.0         
tensorboard-plugin-wit        1.7.0          
tensorboardcolab              0.0.22         
tensorflow                    1.15.2         
tensorflow-addons             0.8.3          
tensorflow-datasets           2.1.0          
tensorflow-estimator          1.15.1         
tensorflow-gan                2.0.0          
tensorflow-gcs-config         2.3.0          
tensorflow-hub                0.8.0          
tensorflow-metadata           0.22.2         
tensorflow-privacy            0.2.2          
tensorflow-probability        0.7.0          
termcolor                     1.1.0          
terminado                     0.8.3          
testpath                      0.4.4          
text-unidecode                1.3            
textblob                      0.15.3         
textgenrnn                    1.4.1          
tflearn                       0.3.2          
Theano                        1.0.5          
thinc                         7.4.0          
tifffile                      2020.7.24      
toml                          0.10.1         
toolz                         0.10.0         
torch                         1.6.0+cu101    
torchsummary                  1.5.1          
torchtext                     0.3.1          
torchvision                   0.7.0+cu101    
tornado                       5.1.1          
tqdm                          4.41.1         
traitlets                     4.3.3          
transformers                  2.2.0          
tweepy                        3.6.0          
typeguard                     2.7.1          
typing-extensions             3.7.4.2        
tzlocal                       1.5.1          
umap-learn                    0.4.6          
uritemplate                   3.0.1          
urllib3                       1.24.3         
vega-datasets                 0.8.0          
wasabi                        0.7.1          
wcwidth                       0.2.5          
webencodings                  0.5.1          
Werkzeug                      1.0.1          
wheel                         0.34.2         
widgetsnbextension            3.5.1          
wordcloud                     1.5.0          
wrapt                         1.12.1         
xarray                        0.15.1         
xgboost                       0.90           
xkit                          0.0.0          
xlrd                          1.1.0          
xlwt                          1.3.0          
yellowbrick                   0.9.1          
youtokentome                  1.0.6          
zict                          2.0.0          
zipp                          3.1.0          
zmq                           0.0.0

Website doesn't respond

When I try to continue my story, this

happens over and over again on different devices, which started yesterday in the evening.
The neural network doesn't respond on any device, and many people report about having the same issue.

Actual requirements.txt for run rest.py

Hi!

I run pip3 install -r .\tpu_requirements.txt but after run python3 .\rest.py get more errors, for example:

Traceback (most recent call last):
  File "ru_transformers\rest.py", line 77, in <module>
    from pydantic import BaseModel, Schema
ImportError: cannot import name 'Schema' from 'pydantic' (C:\Users\jjj\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pydantic\__init__.cp39-win_amd64.pyd)

Provide please actual version of requirements.txt for run rest.py

Когда на GPT-3 переедете?

Боюсь представить что получится...

model directory structure

First of all, thank you so much for the incredible work on it.

I'm trying to reproduce your results locally, rest.py obviously expects some other gpt2 directory structure, with medium folder in it etc. I was able to make it work by moving some files around, but I believe it's still somehow wrong because the quality of generation is very much lower than yours on https://porfirevich.ru/

curl -i -X POST -H 'Content-Type: application/json' -d '{"prompt":"Московский городской суд вынес постановление об отмене признания Ираном вины в ракетной атаке на украинский Боинг","length":60,"num_samples":4}' http://0.0.0.0:8000/gpt2/medium/

Results are somewhat funny, but yours are so much better.

Runtime error

Hey guys,
I've got a problem with the Colab for fine-tuning, it returns a Runtime error every time I try to launch run_lm_finetuning.py:

RuntimeError: Found param transformer.wte.weight with type torch.FloatTensor, expected torch.cuda.FloatTensor.
When using amp.initialize, you need to provide a model with parameters
located on a CUDA device before passing it no matter what optimization level
you chose. Use model.to('cuda') to use the default device.

Maybe I'm doing something wrong... Could you give me a hint?
Thank you in advance.

Limits on a web service?

Good day guys, I added your system to my VK bot. But I didn't see limits. I hope my small bot won't be a problem for you. Thanks!

My commit is: Disinterpreter/perl-vk-bot@1f0fdf8#diff-b8426a177e7f1a5b01a99d88a34d1c93

No training loss in tensorboard in tpu_lm_finetuning.py

ru_transformers/tpu_lm_finetuning.py

Line 433 in 766343f

summary_write('lr', scheduler.get_last_lr()[0], global_step)

summary_write(...) is called for the learning rate, but not for training loss.

'GPT2Config' has no attribute 'pretrained_config_archive_map'

Hi! Thank your for sharing your work!
After installation i try to run rest.py and get this error:

ALL_MODELS = sum((tuple(conf.pretrained_config_archive_map.keys()) for conf in (GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig)), ())
AttributeError: type object 'GPT2Config' has no attribute 'pretrained_config_archive_map'

References in the text

Input: Захотелось выпить
Output:

Захотелось выпить и понять. Э. По: Шаг 1. Перевод Виктора Пелевина. СПб.: Азбука, 2014. С. 17.].) Гегель тоже вспоминает, что наблюдатель в положении сидя способен видеть вещи определенным образом и оставаться при этом вне их видимости.

Where is fit.sh?

I was going through the instructions. I see you have referred to fit.sh for training. But I don't see it in the repo. Did you forget to add it? Thanks.

I got confused on dataset paths (colab notebook)

Hello! I am very sorry the questions are probably silly, but I really cant figure :-(

I got a little bit confused with that I should have in my datasets, I have my raw dataset /mydataset/ and I have created following folders
/dataset/prepared
/dataset/train
/dataset/validation
The prepared populated according process_function it is all fine.
But then it is unclear, I should move smallest file into validation, but from which directory ? Is it from /mydataset ?
And what should I move into train ? All from /mydataset or from /dataset/prepared ?

Running on google colab TPU

I was trying to get the TPU training run on Google Colab TPU.
There is this TPU MNIST Demo that works fine with xla. (https://colab.sandbox.google.com/github/pytorch/xla/blob/master/contrib/colab/mnist-training-xrt-1-15.ipynb) so I thought I should be able to run your test_train_mp_mnist.py. But running into issues. Even the tpu_lm_finetuning.py failed to run with some thread join error.
Do you have any plans to make your code run on Google Colab TPU? That would be very helpful.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
    fn(gindex, *args)
  File "/content/ru_transformers/tpu/test_train_mp_mnist.py", line 180, in _mp_fn
    accuracy = train_mnist()
  File "/content/ru_transformers/tpu/test_train_mp_mnist.py", line 79, in train_mnist
    transforms.Normalize((0.1307,), (0.3081,))]))
  File "/usr/local/lib/python3.6/dist-packages/torchvision/datasets/mnist.py", line 71, in __init__
    self.download()
  File "/usr/local/lib/python3.6/dist-packages/torchvision/datasets/mnist.py", line 144, in download
    read_image_file(os.path.join(self.raw_folder, 'train-images-idx3-ubyte')),
  File "/usr/local/lib/python3.6/dist-packages/torchvision/datasets/mnist.py", line 483, in read_image_file
    x = read_sn3_pascalvincent_tensor(f, strict=False)
  File "/usr/local/lib/python3.6/dist-packages/torchvision/datasets/mnist.py", line 461, in read_sn3_pascalvincent_tensor
    magic = get_int(data[0:4])
  File "/usr/local/lib/python3.6/dist-packages/torchvision/datasets/mnist.py", line 426, in get_int
    return int(codecs.encode(b, 'hex'), 16)
ValueError: invalid literal for int() with base 16: b''

Are you the first who adopts GPT-2 to Russian?

Run on CPU

Is it possible to run this model for evaluate on CPU?

How does one preproccess dataset?

I've managed to get finetuning running, but hit another wall in how would one go about preproccessing dataset. Apparently, the model doesn't get grasp at default <|startoftext|> tag, and from what i can see in corpus.ipynb, there's just newline <|n|> tag. How would you differentiate start of piece and end of piece for model and tokenizer?

GOOGLE Collab sheet spawning ru_transformers folders

Add this line to the google collab sheet
!rm -rf '/content/ru_transformers' || :
before !git clone https://github.com/mgrankin/ru_transformers
otherwise it starts to spawn new folders every time the script re-runs (bsc of error or whatever)

774M model?

It's being mentioned in README briefly when setting up configs and env variables, but doesn't seem to be present on server. Was there ever 774M model trained on russian corpus? Quoting from README:

# GPT-2 774M, final perplexity 21.09?

export CUDA_VISIBLE_DEVICES=3
export MODEL_SIZE=gpt2-large
export OUTPUT=output_yt/l
export BS=1
export LR=1e-5

num_samples should be a positive integer value, but got num_samples=0

Hi. I'm extremely new to neural networks, but I've managed to set up TPU (my RTX 2060 was not enough it seems, since I was getting OOMs) and get the python /pytorch/xla/test/test_train_mp_mnist.py test running. I've downloaded a medium untuned Russian model and put it into $OUTPUT.
Now I want to finetune it to a log of chat messages (about 1 million messages, ~50MB text file).
I have a few questions that were not clear to me in the README

Should I just use the original text file or the one that I can get using corpus notebook?
The validation file is just a plain simple text that looks like something I'd like to get at the end when generating text, is that correct?
No matter what train file I choose I get this error:

06/12/2020 15:29:23 - INFO - __mp_main__ -   Loading features from ./data/full/chat.txt  
Exception in device=TPU:1: num_samples should be a positive integer value, but got num_samples=0

The folder cached does get created with some file in it, but it fails to extract the features from my dataset, as I understand it. What could be the reason, do I need to prepare the text in some other way?
I also had to comment out these two lines in yt_encoder.py, since I was getting "Setting 'max_len_single_sentence' is now deprecated. This value is automatically set up" error:

self.max_len_single_sentence = 1024
self.max_len_sentences_pair = 1024

Could that be the reason, and if it is, how do I get rid of the initial error without breaking things?
Thanks

Beginner question about the models

Hello and thank you for the great work.

This might be a rather stupid question, but I'm just beginning with NPL. Apologies in advance.
Could you give me a brief intro with regards to the files generated in the models ?

I am asking this because when I try to load the head model class and the tokenizer class I get the following error: We assumed '../russian_models/gpt2/m_checkpoint-3364613' was a path or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url

Of course, those files are not present there, but I'm not sure where to start at the moment.
And a second question if possible, are you licensing this project under MIT by any chance?

Thank you in advance.

temperature vs overfitting?

this may be rather generic question for NLP/NLG -
does temperature increase help fighting overfit somehow?

Colab notebook GPU issue

Running the notebook, provided here, I've got an error below:

Traceback (most recent call last):
  File "run_lm_finetuning.py", line 662, in <module>
    main()
  File "run_lm_finetuning.py", line 630, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "run_lm_finetuning.py", line 296, in train
    model, optimizer = amp.initialize(model.to('cuda'), optimizer, opt_level=args.fp16_opt_level)
  File "/usr/local/lib/python3.6/dist-packages/apex/amp/frontend.py", line 358, in initialize
    return _initialize(models, optimizers, _amp_state.opt_properties, num_losses, cast_model_outputs)
  File "/usr/local/lib/python3.6/dist-packages/apex/amp/_initialize.py", line 171, in _initialize
    check_params_fp32(models)
  File "/usr/local/lib/python3.6/dist-packages/apex/amp/_initialize.py", line 93, in check_params_fp32
    name, param.type()))
  File "/usr/local/lib/python3.6/dist-packages/apex/amp/_amp_state.py", line 32, in warn_or_err
    raise RuntimeError(msg)
RuntimeError: Found param transformer.wte.weight with type torch.FloatTensor, expected torch.cuda.FloatTensor.
When using amp.initialize, you need to provide a model with parameters
located on a CUDA device before passing it no matter what optimization level
you chose. Use model.to('cuda') to use the default device.

Even fixing line to:
device = torch.device("cuda")

I've got another issue:


10/21/2020 08:22:51 - INFO - transformers.modeling_utils -   loading weights file gpt2/m_checkpoint-3364613/pytorch_model.bin
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=47 error=100 : no CUDA-capable device is detected
Traceback (most recent call last):
  File "run_lm_finetuning.py", line 662, in <module>
    main()
  File "run_lm_finetuning.py", line 596, in main
    model.to(args.device)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 607, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 354, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 354, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 376, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 605, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 190, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:47

Definitely, I run GPU env.
Am I doing something wrong?

Do you plan to open trained model for public access?

I am waiting for this! To experiment with songs

Provide the google colab ipynb

Please provide the working ipynb, i'm trying to run it in google colab (using transformers), but it's so messy it won't even run and where's the transformers compliant tokenizer?

upd 17.07: no response

Confused on how to train the model.

Hello,

sorry if this is a silly question. I am trying to finetune the English GPT-2 model in my language. I tried without the "while loop" command and I got a memory issue problem (oom error) and the algorithm crashed. So then I tried to run it like this:

export TRAIN_FILE=/ru_transformers/fulljan21
export CUDA_VISIBLE_DEVICES=2
export MODEL_SIZE=gpt2-large
export OUTPUT=output_yt/lfeb
export BS=1
export LR=1e-5

  while true
  do
      python run_lm_finetuning.py \
                --output_dir=$OUTPUT \
          --model_type=gpt2 \
          --model_name_or_path=$OUTPUT \
          --do_train \
          --train_data_file=$TRAIN_FILE \
          --per_gpu_train_batch_size $BS \
          --save_steps=10000 \
          --logging_steps=10 \
          --fp16 \
          --fp16_opt_level O2 \
          --warmup_samples 16000 \
          --learning_rate $LR \
          --overwrite_output_dir \
          --tokenizer_class YTEncoder \
          --tokenizer_name bpe/yt.model \
          --do_eval \
          --evaluate_during_training \
          --eval_steps 1000 \
          --eval_data_file=./data/classic/valid \
          --save_total_limit 30 \
          --num_train_epochs 10.0 \
          --unfreeze_level 0

      sleep 1
  done

So now the model is running for around two weeks and it keeps evaluating. I cannot see the percentage of epochs or a sample.txt file that I could see when I was running the model with a very small sample of data (2gb). Now my files are around 40gb. I am using a NVIDIA Corporation GV102 graphic card and she is working in aroung 40%. Am I doing something wrong?

Currently I have this output:

 02/10/2021 14:35:22 - INFO - __main__ -   Loading features from ./evaluate_dir
 ./evaluate_dir

1675459███████████████████████████████████████████████████████████████████████████████████████████████| 100.00% [2259/2259 00:17<00:00]

 02/10/2021 14:35:39 - INFO - __main__ -   ***** Running evaluation checkpoint-27000 *****
 02/10/2021 14:35:39 - INFO - __main__ -     Num examples = 1675459
 02/10/2021 14:35:39 - INFO - __main__ -     Batch size = 4
 Evaluating:  17%|███████████████████████████████▏                                                                                                                                                   | 72983/418865 [58:22<4:35:44, 20.91it/s]

Thank you

Didn't find files vocab.json and merges.txt in model

Hi!
When I try to run with models downloaded from aws, I get an error:
I1227 05:17:21.238354 140364474943296 tokenization_utils.py:335] Didn't find file gpt2/s_checkpoint-1900000/vocab.json. We won't load it. I1227 05:17:21.238430 140364474943296 tokenization_utils.py:335] Didn't find file gpt2/s_checkpoint-1900000/merges.txt. We won't load it. I1227 05:17:21.238494 140364474943296 tokenization_utils.py:359] Didn't find file gpt2/s_checkpoint-1900000/added_tokens.json. We won't load it. I1227 05:17:21.238545 140364474943296 tokenization_utils.py:359] Didn't find file gpt2/s_checkpoint-1900000/special_tokens_map.json. We won't load it. I1227 05:17:21.238594 140364474943296 tokenization_utils.py:359] Didn't find file gpt2/s_checkpoint-1900000/tokenizer_config.json. We won't load it.
Where can I find vocab.json and merges.txt or tokenization files? I'm try start with files from "bpe" directory, but but nothing happened.
Thanks.

New version of fast ai broke everything?

It seems that some new version of fast ai broke this repository. I'm getting an error

conda ImportError: cannot import name progress_bar from fastprogress

which disappears if I install fastai==1.0.59.

But there are still errors with different packages like transformers. Could you please send the output of your pip freeze inside the working environment?

Usage of service

Hello!

Superb work, I've been looking for thing like that for quite a long time :)

But I am very new in Python and especially in DL, so I have 2 questions:

How to use it to get text output?
What is the requirements to model? For example, I have a lot of Russian dialogs, usually not very connected to each other, will it fit well?

I am faced with this error when trying to fine-tune the model

Saving the model freezes the TPU when using the latest torch_xla package and provided tpu_lm_finetuning script

ru_transformers/tpu_lm_finetuning.py

Line 307 in 766343f

xm.save(model_to_save.state_dict(), output_model_file)

call freezes the TPU due to https://github.com/pytorch/xla/blob/46bff8a6e12035f1857c52e74e263c7077cd3ed2/torch_xla/core/xla_model.py#L635 rendezvous in xm.save(...) function. In tpu_lm_finetuning.py, however, the function is called only when xm.is_master_ordinal() so the rendezvous point is never reached.

Training doesn't start, just says "loading weights..."

08/08/2020 14:07:25 - INFO - transformers.modeling_utils -   loading weights file ../all/classic/m_checkpoint-3396533/pytorch_model.bin`

On that line, the script just stops. No error message or anything, but training doesn't start either. Running it on Colab. Using all the libraries from pip freeze posted here in issues. What's wrong?

CUDA out of memory.

I'm trying to finetune a medium Russian model with my small amount of data.

After 5-10 seconds I get an error:
RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 7.79 GiB total capacity; 6.59 GiB already allocated; 74.06 MiB free; 6.79 GiB reserved in total by PyTorch)

My script:

`export TRAIN_FILE=./corpus/texts.txt

export CUDA_VISIBLE_DEVICES=0
export MODEL_SIZE=gpt2-medium
export OUTPUT=gpt2/m_checkpoint-3364613
export BS=3
export LR=3e-5

while true
do
    python run_lm_finetuning.py \
        --output_dir=$OUTPUT \
        --model_type=gpt2 \
        --model_name_or_path=$OUTPUT \
        --do_train \
        --train_data_file=$TRAIN_FILE \
        --per_gpu_train_batch_size $BS \
        --save_steps=10000 \
        --logging_steps=10 \
        --fp16 \
        --fp16_opt_level O2 \
        --warmup_samples 16000 \
        --learning_rate $LR \
        --overwrite_output_dir \
        --tokenizer_class YTEncoder \
        --tokenizer_name bpe/yt.model \
        --do_eval \
        --evaluate_during_training \
        --eval_steps 1000 \
        --eval_data_file=./data/classic/valid \
        --save_total_limit 30 \
        --num_train_epochs 10.0 \
        --unfreeze_level 0

    sleep 1
done

What should I do to lower the amount of memory used?

generation of text

Hi,

After training my gpt-2 in Portuguese after your README.md, I tried to generate text. I used the following code in my terminal but I got an error. message Could you give me the correct code?
Thank you.

export OUTPUT=output_yt/s
python run_generation.py \
    --model_type=gpt2 \
    --model_name_or_path=$OUTPUT \
    --padding_text="Eu gosto do carro que comprei ontem"

The error message I got:

01/06/2020 15:25:37 - INFO - transformers.tokenization_utils -   Model name 'output_yt/s' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming 'output_yt/s' is a path or url to a directory containing tokenizer files.
01/06/2020 15:25:37 - INFO - transformers.tokenization_utils -   Didn't find file output_yt/s/vocab.json. We won't load it.
01/06/2020 15:25:37 - INFO - transformers.tokenization_utils -   Didn't find file output_yt/s/merges.txt. We won't load it.
01/06/2020 15:25:37 - INFO - transformers.tokenization_utils -   Didn't find file output_yt/s/added_tokens.json. We won't load it.
01/06/2020 15:25:37 - INFO - transformers.tokenization_utils -   Didn't find file output_yt/s/special_tokens_map.json. We won't load it.
01/06/2020 15:25:37 - INFO - transformers.tokenization_utils -   Didn't find file output_yt/s/tokenizer_config.json. We won't load it.
Traceback (most recent call last):
  File "run_generation.py", line 204, in <module>
    main()
  File "run_generation.py", line 166, in main
    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
  File "/opt/anaconda3/envs/gpt/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 302, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/opt/anaconda3/envs/gpt/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 370, in _from_pretrained
    list(cls.vocab_files_names.values())))
OSError: Model name 'output_yt/s' was not found in tokenizers model name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). We assumed 'output_yt/s' was a path or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

YouTube tutorial

Hi!
Is it a bit difficult for me as a beginner to understand the steps. Can you guys make a yt tutorial?
Thank you!

Notepad

First of all, thank you for your work!
On behalf of beginners and just people who are interested in this topic, could you please make easy-to-use python jupyter notepad (or even google colab instance).

I think about something like this, but maybe a little simpler.

PS
I tried to do this myself (with the help of closed issues), but failed due to lack of experience

PPS
I'm aware about project web page existance, but it's too unflexible (no offence, it's quite good and entertaining) for real fun use.

Training on tokens available in textual file and how could I achieve a best model?

Dear all,

I am new in the field of NLP. I find transformer library, which is amazing well to generate text.

I came across your post about how could I train new language by using transformer.

Based on that, I have a question that I have already bunch of tokens available in text file (.txt) of programming language gathered from many repositories, separated with space character. I would like to first train tokenizer model as recommended by you, and then use transformer code run_lm_finetuning to fine tune the model as you have suggested.

For this purpose, what changes do I need to make in the code and how could I achieve best model?
Please advise.

TPU hanging with message "Waiting to connect to client mesh master (300 seconds) localhost:57343"

Thanks to your detailed instructions, I am able to run the training loop. But after some time it is getting stuck. When I press Ctrl-C, we can see it is stuck in some socket polling.
Previously I was able to run the MNIST example run successfully, but once this error comes, even running that throws the same error.
Did you face this anytime? Any idea how I can get around this? I have tried restarting the TPU but same problem is seen.

Iteration: 100%|############################################################################################################################################################| 32/32 [01:47<00:00,  3.37s/it]
100%|#########################################################################################################################################################################| 1/1 [00:00<00:00, 39.03it/s]
Evaluating: 36it [00:23,  1.55it/s]                                                                                                                                                   | 0/1 [00:00<?, ?it/s]
100%|#########################################################################################################################################################################| 1/1 [00:00<00:00,  4.96it/s]
Epoch: 100%|#################################################################################################################################################################| 1/1 [02:15<00:00, 135.51s/it]
2020-03-08 09:46:32.542284: I      68 tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57343
^CTraceback (most recent call last):
  File "tpu_lm_finetuning.py", line 697, in <module>
    xmp.spawn(main)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 182, in spawn
    start_method=start_method)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 78, in join
    timeout=timeout,
  File "/root/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
KeyboardInterrupt

Unable to use web app

I am deploying the model as per the instructions, but I am getting either '404 Not found' or 405 'Method not allowed'. What am I doing wrong?

(gpt) nikhil_subscribed@fastai-1:~/ru_transformers$ uvicorn rest:app --reload --host 0.0.0.0
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Started reloader process [6759]
INFO:     TensorFlow version 2.1.0 available.
INFO:     PyTorch version 1.4.0 available.
INFO:     Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-03-08 14:21:07.099055: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-03-08 14:21:07.105627: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2020-03-08 14:21:07.106342: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5623f942d910 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-08 14:21:07.106381: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-03-08 14:21:07.106523: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
INFO:     127.0.0.1:60018 - "GET /gpt2_poetry/ HTTP/1.1" 405 Method Not Allowed
INFO:     127.0.0.1:60020 - "GET / HTTP/1.1" 404 Not Found
INFO:     127.0.0.1:60030 - "GET /gpt2_poetry HTTP/1.1" 307 Temporary Redirect
INFO:     127.0.0.1:60030 - "GET /gpt2_poetry/ HTTP/1.1" 405 Method Not Allowed
INFO:     127.0.0.1:60040 - "GET /gpt/medium HTTP/1.1" 404 Not Found

run_lm_finetuning.py is different than the one of huggingface

Hi,

Thank you very much for your work on training gpt-2 in a language other than English!

A question: I would like to know why your run_lm_finetuning.py file is different from the one of huggingface on https://huggingface.co/transformers/examples.html#language-model-fine-tuning?

Direct link to script: https://github.com/huggingface/transformers/blob/master/examples/run_lm_finetuning.py

Thank you.

Memory Issue

Hello,

I have 28GB of text and I want to train from scratch. I have 4 GPUs (product: GV102 vendor: NVIDIA Corporation) and the algorithm crashes due to memory issues. I saw in your readme this

 # My dataset is 230Gb and it doesn't fit in RAM, so each epoch is a random sample from it. That is why the loop.
while true
do
python run_lm_finetuning.py \
    --output_dir=$OUTPUT \
    --model_type=gpt2 \
    --model_name_or_path=$OUTPUT \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size $BS \
    --save_steps=10000 \
    --logging_steps=10 \
    --fp16 \
    --fp16_opt_level O2 \
    --warmup_samples 16000 \
    --learning_rate $LR \
    --overwrite_output_dir \
    --tokenizer_class YTEncoder \
    --tokenizer_name bpe/yt.model \
    --do_eval \
    --evaluate_during_training \
    --eval_steps 1000 \
    --eval_data_file=./data/classic/valid \
    --save_total_limit 30 \
    --num_train_epochs 10.0 \
    --unfreeze_level 0

sleep 1

done

So If I use these parameters to train the model when will it stop, since you say while true. Do you believe this will fix the memory problem? Also it uses eventually all the training data or it is random as you say?

Thank you

Can you provide some advices on how to fine tune your pretrained models on my own dataset?

Hello!
Your work is amazing! Thank you.
Can you provide some instruction on how I can fine tune your models on some specific corpus? How long such process would be?