Giter Site home page Giter Site logo

GPT3XL training about gpt-neo HOT 10 CLOSED

eleutherai avatar eleutherai commented on July 3, 2024 4
GPT3XL training

from gpt-neo.

Comments (10)

StellaAthena avatar StellaAthena commented on July 3, 2024 2

hey. it seems that it's working right now.
I collect some small changes need to make in the colab example to make it running:

  1. change installed tokenizers in requirements file to 0.9.4 or add the command
    !pip install tokenizers==0.9.4
  2. in # Tokenize Data line: change name of argument from “base_dir” to “input_dir” .
  3. delete the argument “--use_gpt2_tokenizer” because it’s using gpt2 tokenizer fast by default .
    I might add soon more changes if needed..

Great! Can you put these changes on a branch and open a PR? That way we can verify that it doesn’t break anything on the TPUs and merge it.

from gpt-neo.

srulikbd avatar srulikbd commented on July 3, 2024 2

yeah, of course. I'll do that as soon as possible.
well done for your awesome work!

from gpt-neo.

srulikbd avatar srulikbd commented on July 3, 2024 1

there are some incompatibility between the tokenizers to the transformers version (it's installing the current transformers version, but the old tokenizers one).

  1. which versions should we use?

from gpt-neo.

loretoparisi avatar loretoparisi commented on July 3, 2024 1

@srulikbd I asked to Thomas Wolf from HF about this, and his suggestion was to use the latest version of both. Could you be more specific about the tokenizer's version issue?
Thank you.

from gpt-neo.

srulikbd avatar srulikbd commented on July 3, 2024 1

hey. it seems that it's working right now.
I collect some small changes need to make in the colab example to make it running:

  1. change installed tokenizers in requirements file to 0.9.4 or add the command
    !pip install tokenizers==0.9.4
  2. in # Tokenize Data line: change name of argument from “base_dir” to “input_dir” .
  3. delete the argument “--use_gpt2_tokenizer” because it’s using gpt2 tokenizer fast by default .
    I might add soon more changes if needed..

from gpt-neo.

srulikbd avatar srulikbd commented on July 3, 2024

@StellaAthena hey.
I got to the training stage, but it got stuck over for some reason. do you have any idea why?
I succeed easily run the train_enwik8 on the gpt-neox library...what is the difference between the 2 packages?

here is the output after running on google colab the GPTNEO example:

2021-01-08 22:33:49.795424: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:49.795465: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term Current step 0 Saving config to /content/GPTNeo/model_weights 2021-01-08 22:33:53.177601: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-08 22:33:53.177746: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-01-08 22:33:53.177944: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-08 22:33:53.178523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5 coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.73GiB deviceMemoryBandwidth: 298.08GiB/s 2021-01-08 22:33:53.178667: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.178760: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.178842: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.180363: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-01-08 22:33:53.180792: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-01-08 22:33:53.182284: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-01-08 22:33:53.182413: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.182497: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia 2021-01-08 22:33:53.182519: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2021-01-08 22:33:53.285094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-01-08 22:33:53.285162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-01-08 22:33:53.285182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-01-08 22:33:53.291654: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes) Done! params = defaultdict(<function fetch_model_params.<locals>.<lambda> at 0x7f9ed91c0158>, {'n_head': 32, 'n_vocab': 50260, 'embed_dropout': 0, 'lr': 0.0002, 'lr_decay': 'cosine', 'warmup_steps': 3000, 'beta1': 0.9, 'beta2': 0.95, 'epsilon': 1e-08, 'opt_name': 'adam', 'weight_decay': 0.1, 'train_batch_size': 1, 'attn_dropout': 0, 'train_steps': 1, 'eval_steps': 0, 'predict_steps': 1, 'res_dropout': 0, 'eval_batch_size': 64, 'predict_batch_size': 1, 'iterations': 1, 'n_embd': 2048, 'datasets': [['openwebtexts', 21, 'documents_random', 1.0]], 'model': 'GPT', 'model_path': '/content/GPTNeo/model_weights', 'n_ctx': 2048, 'n_layer': 24, 'scale_by_depth': True, 'scale_by_in': False, 'attention_types': ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global'], 'mesh_shape': 'x:4,y:2', 'layout': 'intermediate_expanded:x,heads:x,vocab:x,memory_length:y,embd:y', 'activation_function': 'gelu', 'recompute_grad': True, 'gradient_clipping': 1.0, 'tokens_per_mb_per_replica': 2048, 'padding_id': 50257, 'eos_id': 50256, 'dataset_configs': {'openwebtexts': {'path': '/content/GPTNeo/openwebtext-small/bundestag_*.tfrecords', 'eval_path': '', 'n_vocab': 50256, 'tokenizer_is_pretrained': True, 'tokenizer_path': 'gpt2', 'eos_id': 50256, 'padding_id': 50257}}, 'mlm_training': False, 'causal': True, 'num_cores': 8, 'auto_layout': False, 'auto_layout_and_mesh_shape': False, 'use_tpu': False, 'gpu_ids': ['device:GPU:0'], 'steps_per_checkpoint': 2, 'predict': False, 'export': False, 'sampling_use_entmax': False, 'moe_layers': None, 'slow_sampling': False}) Using config: {'_model_dir': '/content/GPTNeo/model_weights', '_tf_random_seed': None, '_save_summary_steps': 1, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1, num_shards=8, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1, experimental_allow_per_host_v2_parallel_get_next=False, experimental_feed_hook=None), '_cluster': None} _TPUContext: eval_on_tpu True eval_on_tpu ignored because use_tpu is False. From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. WARNING:root:Changing batch size with sequential_input() will result in some data being skipped or repeated. Please ensure your batch size stays constant throughout training.

from gpt-neo.

StellaAthena avatar StellaAthena commented on July 3, 2024

Where are you running this code? Are you using your own GPUs?

from gpt-neo.

srulikbd avatar srulikbd commented on July 3, 2024

I tried both GPU and tpu on colab. I tried also on aws AMI instance with V100.
The bug i quoted is from GPU colab.

from gpt-neo.

StellaAthena avatar StellaAthena commented on July 3, 2024

I tried both GPU and tpu on colab. I tried also on aws AMI instance with V100.
The bug i quoted is from GPU colab.

Sorry this slipped through the cracks. I assume you got everything working based on your PR?

from gpt-neo.

srulikbd avatar srulikbd commented on July 3, 2024

actually it might still not work. I saw that you are focused on gptneox, so I switched over there :)

from gpt-neo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.