It's not clear to me how to train the GPT3XL via GPU/Colab. Could you add more det

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

GPT3XL training about gpt-neo HOT 10 CLOSED

eleutherai commented on July 3, 2024 4

GPT3XL training

from gpt-neo.

Comments (10)

StellaAthena commented on July 3, 2024 2

hey. it seems that it's working right now.
I collect some small changes need to make in the colab example to make it running:

change installed tokenizers in requirements file to 0.9.4 or add the command
!pip install tokenizers==0.9.4

in # Tokenize Data line: change name of argument from “base_dir” to “input_dir” .

delete the argument “--use_gpt2_tokenizer” because it’s using gpt2 tokenizer fast by default .
I might add soon more changes if needed..

Great! Can you put these changes on a branch and open a PR? That way we can verify that it doesn’t break anything on the TPUs and merge it.

from gpt-neo.

srulikbd commented on July 3, 2024 2

yeah, of course. I'll do that as soon as possible.
well done for your awesome work!

from gpt-neo.

srulikbd commented on July 3, 2024 1

there are some incompatibility between the tokenizers to the transformers version (it's installing the current transformers version, but the old tokenizers one).

which versions should we use?

from gpt-neo.

loretoparisi commented on July 3, 2024 1

@srulikbd I asked to Thomas Wolf from HF about this, and his suggestion was to use the latest version of both. Could you be more specific about the tokenizer's version issue?
Thank you.

from gpt-neo.

srulikbd commented on July 3, 2024 1

hey. it seems that it's working right now.
I collect some small changes need to make in the colab example to make it running:

change installed tokenizers in requirements file to 0.9.4 or add the command
!pip install tokenizers==0.9.4
in # Tokenize Data line: change name of argument from “base_dir” to “input_dir” .
delete the argument “--use_gpt2_tokenizer” because it’s using gpt2 tokenizer fast by default .
I might add soon more changes if needed..

from gpt-neo.

srulikbd commented on July 3, 2024

@StellaAthena hey.
I got to the training stage, but it got stuck over for some reason. do you have any idea why?
I succeed easily run the train_enwik8 on the gpt-neox library...what is the difference between the 2 packages?

here is the output after running on google colab the GPTNEO example:

2021-01-08 22:33:49.795424: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:49.795465: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term Current step 0 Saving config to /content/GPTNeo/model_weights 2021-01-08 22:33:53.177601: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-08 22:33:53.177746: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-01-08 22:33:53.177944: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-08 22:33:53.178523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.73GiB deviceMemoryBandwidth: 298.08GiB/s 2021-01-08 22:33:53.178667: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.178760: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.178842: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.180363: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-01-08 22:33:53.180792: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-01-08 22:33:53.182284: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-01-08 22:33:53.182413: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.182497: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.182519: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2021-01-08 22:33:53.285094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-01-08 22:33:53.285162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-01-08 22:33:53.285182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-01-08 22:33:53.291654: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes) Done! params = defaultdict(<function fetch_model_params.<locals>.<lambda> at 0x7f9ed91c0158>, {'n_head': 32, 'n_vocab': 50260, 'embed_dropout': 0, 'lr': 0.0002, 'lr_decay': 'cosine', 'warmup_steps': 3000, 'beta1': 0.9, 'beta2': 0.95, 'epsilon': 1e-08, 'opt_name': 'adam', 'weight_decay': 0.1, 'train_batch_size': 1, 'attn_dropout': 0, 'train_steps': 1, 'eval_steps': 0, 'predict_steps': 1, 'res_dropout': 0, 'eval_batch_size': 64, 'predict_batch_size': 1, 'iterations': 1, 'n_embd': 2048, 'datasets': [['openwebtexts', 21, 'documents_random', 1.0]], 'model': 'GPT', 'model_path': '/content/GPTNeo/model_weights', 'n_ctx': 2048, 'n_layer': 24, 'scale_by_depth': True, 'scale_by_in': False, 'attention_types': ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global'], 'mesh_shape': 'x:4,y:2', 'layout': 'intermediate_expanded:x,heads:x,vocab:x,memory_length:y,embd:y', 'activation_function': 'gelu', 'recompute_grad': True, 'gradient_clipping': 1.0, 'tokens_per_mb_per_replica': 2048, 'padding_id': 50257, 'eos_id': 50256, 'dataset_configs': {'openwebtexts': {'path': '/content/GPTNeo/openwebtext-small/bundestag_*.tfrecords', 'eval_path': '', 'n_vocab': 50256, 'tokenizer_is_pretrained': True, 'tokenizer_path': 'gpt2', 'eos_id': 50256, 'padding_id': 50257}}, 'mlm_training': False, 'causal': True, 'num_cores': 8, 'auto_layout': False, 'auto_layout_and_mesh_shape': False, 'use_tpu': False, 'gpu_ids': ['device:GPU:0'], 'steps_per_checkpoint': 2, 'predict': False, 'export': False, 'sampling_use_entmax': False, 'moe_layers': None, 'slow_sampling': False}) Using config: {'_model_dir': '/content/GPTNeo/model_weights', '_tf_random_seed': None, '_save_summary_steps': 1, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1, num_shards=8, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1, experimental_allow_per_host_v2_parallel_get_next=False, experimental_feed_hook=None), '_cluster': None} _TPUContext: eval_on_tpu True eval_on_tpu ignored because use_tpu is False. From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. WARNING:root:Changing batch size with sequential_input() will result in some data being skipped or repeated. Please ensure your batch size stays constant throughout training.

from gpt-neo.

StellaAthena commented on July 3, 2024

Where are you running this code? Are you using your own GPUs?

from gpt-neo.

srulikbd commented on July 3, 2024

I tried both GPU and tpu on colab. I tried also on aws AMI instance with V100.
The bug i quoted is from GPU colab.

from gpt-neo.

StellaAthena commented on July 3, 2024

I tried both GPU and tpu on colab. I tried also on aws AMI instance with V100.
The bug i quoted is from GPU colab.

Sorry this slipped through the cracks. I assume you got everything working based on your PR?

from gpt-neo.

srulikbd commented on July 3, 2024

actually it might still not work. I saw that you are focused on gptneox, so I switched over there :)

from gpt-neo.

GPT3XL training about gpt-neo HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent