Giter Site home page Giter Site logo

Comments (8)

EmilyReif avatar EmilyReif commented on August 22, 2024

Hi, thanks for reaching out. When are you getting the error (ie, when training the model or when running it?) Also, can you share the rest of the stack trace?

from lit.

entslscheia avatar entslscheia commented on August 22, 2024

@EmilyReif Thanks for the quick response!
The complete stack trace is as follows,

2020-08-17 17:02:48.345849: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-08-17 17:02:50.591791: E tensorflow/stream_executor/cuda/cuda_blas.cc:225] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-08-17 17:02:50.600075: E tensorflow/stream_executor/cuda/cuda_blas.cc:225] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-08-17 17:02:50.600110: W tensorflow/stream_executor/stream.cc:2055] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "/scratch/gu.826/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/scratch/gu.826/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/scratch/gu.826/lit/lit_nlp/examples/quickstart_sst_demo.py", line 60, in <module>
    app.run(main)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/scratch/gu.826/lit/lit_nlp/examples/quickstart_sst_demo.py", line 48, in main
    run_finetuning(model_path)
  File "/scratch/gu.826/lit/lit_nlp/examples/quickstart_sst_demo.py", line 40, in run_finetuning
    model = glue_models.SST2Model(FLAGS.encoder_name, for_training=True)
  File "/scratch/gu.826/lit/lit_nlp/examples/models/glue_models.py", line 319, in __init__
    **kw)
  File "/scratch/gu.826/lit/lit_nlp/examples/models/glue_models.py", line 79, in __init__
    from_pt=True,
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/transformers/modeling_tf_auto.py", line 941, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/transformers/modeling_tf_utils.py", line 393, in from_pretrained
    return load_pytorch_checkpoint_in_tf2_model(model, resolved_archive_file, allow_missing_keys=True)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/transformers/modeling_tf_pytorch_utils.py", line 93, in load_pytorch_checkpoint_in_tf2_model
    tf_model, pt_state_dict, tf_inputs=tf_inputs, allow_missing_keys=allow_missing_keys
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/transformers/modeling_tf_pytorch_utils.py", line 125, in load_pytorch_weights_in_tf2_model
    tf_model(tf_inputs, training=False)  # Make sure model is built
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 985, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/transformers/modeling_tf_bert.py", line 924, in call
    outputs = self.bert(inputs, **kwargs)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 985, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/transformers/modeling_tf_bert.py", line 572, in call
    encoder_outputs = self.encoder([embedding_output, extended_attention_mask, head_mask], training=training)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 985, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/transformers/modeling_tf_bert.py", line 378, in call
    layer_outputs = layer_module([hidden_states, attention_mask, head_mask[i]], training=training)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 985, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/transformers/modeling_tf_bert.py", line 354, in call
    attention_outputs = self.attention([hidden_states, attention_mask, head_mask], training=training)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 985, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/transformers/modeling_tf_bert.py", line 303, in call
    self_outputs = self.self_attention([input_tensor, attention_mask, head_mask], training=training)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 985, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/transformers/modeling_tf_bert.py", line 232, in call
    mixed_query_layer = self.query(hidden_states)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 985, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/layers/core.py", line 1198, in call
    dtype=self._compute_dtype_object)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/layers/ops/core.py", line 56, in dense
    outputs = standard_ops.tensordot(inputs, kernel, [[rank - 1], [0]])
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 4519, in tensordot
    ab_matmul = matmul(a_reshape, b_reshape)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 3255, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5624, in mat_mul
    _ops.raise_from_not_ok_status(e, name)
  File "/scratch/gu.826/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6843, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(15, 128), b.shape=(128, 128), m=15, n=128, k=128 [Op:MatMul]

Any idea?

from lit.

iftenney avatar iftenney commented on August 22, 2024

This looks like a fairly low-level error. Are you able to run any of the examples from https://github.com/huggingface/transformers?

This one uses TF2 and the same underlying Keras models as our sentiment demo: https://github.com/huggingface/transformers/blob/master/examples/text-classification/run_tf_glue.py

from lit.

entslscheia avatar entslscheia commented on August 22, 2024

@iftenney I worked with Pytorch version of Transformers a lot but have never used it with TF2. I can try whether I can get it run. But since your tool is claimed to be framework-agnostic, I guess maybe I can just ignore this error with TF2 and learn how to use it with Pytorch. It won't bother.

from lit.

selous123 avatar selous123 commented on August 22, 2024

I have meet the same problem like you when i update tensorflow2.0.0 -> 2.3.0

Environment:

RTX1080Ti + cuda10.0 + cudnn7.6.0 + ubuntu 16.04 + tensorflow-gpu 2.3.0

The code is very simple.

import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])
predictions = model(x_train[:1]).numpy()
print(predictions)

Error Msg:

2020-09-01 12:48:46.703511: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-01 12:48:47.713105: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-09-01 12:48:47.740276: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-01 12:48:47.740779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.75GiB deviceMemoryBandwidth: 573.69GiB/s
2020-09-01 12:48:47.740798: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-01 12:48:47.741669: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-01 12:48:47.742320: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-01 12:48:47.742489: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-01 12:48:47.743454: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-01 12:48:47.744218: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-01 12:48:47.746263: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-01 12:48:47.746327: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-01 12:48:47.746794: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-01 12:48:47.747208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-09-01 12:48:47.747393: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-09-01 12:48:47.772100: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3192000000 Hz
2020-09-01 12:48:47.772975: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55e3db1aba90 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-09-01 12:48:47.772994: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-09-01 12:48:47.852506: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-01 12:48:47.852956: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55e3db217d60 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-09-01 12:48:47.852971: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-09-01 12:48:47.853082: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-01 12:48:47.853437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.75GiB deviceMemoryBandwidth: 573.69GiB/s
2020-09-01 12:48:47.853459: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-01 12:48:47.853472: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-01 12:48:47.853480: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-01 12:48:47.853489: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-01 12:48:47.853497: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-01 12:48:47.853504: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-01 12:48:47.853512: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-01 12:48:47.853544: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-01 12:48:47.853904: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-01 12:48:47.854236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-09-01 12:48:47.854253: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-01 12:48:48.306067: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-01 12:48:48.306095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2020-09-01 12:48:48.306101: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2020-09-01 12:48:48.306254: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-01 12:48:48.306654: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-01 12:48:48.307018: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9554 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-09-01 12:48:48.420102: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-01 12:48:48.420404: E tensorflow/stream_executor/cuda/cuda_blas.cc:225] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-09-01 12:48:48.420442: E tensorflow/stream_executor/cuda/cuda_blas.cc:225] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-09-01 12:48:48.420448: W tensorflow/stream_executor/stream.cc:2055] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "tf2_test.py", line 16, in <module>
    predictions = model(x_train[:1]).numpy()
  File "/home/ibrain/anaconda3/envs/nlp/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 985, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/home/ibrain/anaconda3/envs/nlp/lib/python3.7/site-packages/tensorflow/python/keras/engine/sequential.py", line 372, in call
    return super(Sequential, self).call(inputs, training=training, mask=mask)
  File "/home/ibrain/anaconda3/envs/nlp/lib/python3.7/site-packages/tensorflow/python/keras/engine/functional.py", line 386, in call
    inputs, training=training, mask=mask)
  File "/home/ibrain/anaconda3/envs/nlp/lib/python3.7/site-packages/tensorflow/python/keras/engine/functional.py", line 508, in _run_internal_graph
    outputs = node.layer(*args, **kwargs)
  File "/home/ibrain/anaconda3/envs/nlp/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 985, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/home/ibrain/anaconda3/envs/nlp/lib/python3.7/site-packages/tensorflow/python/keras/layers/core.py", line 1198, in call
    dtype=self._compute_dtype_object)
  File "/home/ibrain/anaconda3/envs/nlp/lib/python3.7/site-packages/tensorflow/python/keras/layers/ops/core.py", line 53, in dense
    outputs = gen_math_ops.mat_mul(inputs, kernel)
  File "/home/ibrain/anaconda3/envs/nlp/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5624, in mat_mul
    _ops.raise_from_not_ok_status(e, name)
  File "/home/ibrain/anaconda3/envs/nlp/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6843, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(1, 784), b.shape=(784, 128), m=1, n=128, k=784 [Op:MatMul]

Everything goed will in tensorflow2.0.0, but went wrong in tensorflow2.3.0....

from lit.

iftenney avatar iftenney commented on August 22, 2024

This does look like a much lower-level error than LIT, unfortunately.

Some issues on the TensorFlow tracker point to GPU memory: tensorflow/tensorflow#25403 and specific issues with RTX cards needing a certain config option set - do any of those fixes work?

from lit.

selous123 avatar selous123 commented on August 22, 2024

THANKs for your replying

all these fixes focus on previous version of tensorflow.

Everything goes will in tensorflow2.0.0 in my environment, but went wrong in tensorflow2.3.0....

maybe something in newest version not comparable with my environment, but i havent found.

good luck to me!

holy shit tensorflow(just complain, NO CARE)..

from lit.

selous123 avatar selous123 commented on August 22, 2024

everything goes will when i update cuda to 10.1

from lit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.