opennmt / ctranslate2 Goto Github PK
View Code? Open in Web Editor NEWFast inference engine for Transformer models
Home Page: https://opennmt.net/CTranslate2
License: MIT License
Fast inference engine for Transformer models
Home Page: https://opennmt.net/CTranslate2
License: MIT License
This would make the installation easier for users but this could make the packaging more complex, especially for GPU support.
This issue is to track progress on this front.
Hi @guillaumekln
There seems to be an issue when deleting a model from a device other than the 0th one.
import ctranslate2
translator = ctranslate2.Translator(
"enes_general_medium_ctranslate2",
device="cuda",
device_index=0)
del translator
--> OK
import ctranslate2
translator = ctranslate2.Translator(
"enes_general_medium_ctranslate2",
device="cuda",
device_index=1)
del translator
--> ERROR
terminate called after throwing an instance of 'std::runtime_error'
what(): /root/ctranslate2-dev/src/primitives/cuda.cu:72: CUDA failed with error invalid resource handle
Aborted (core dumped)
(Inference works fine though, it's only when deleting the object that it fails.)
EDIT: This also happens when using the cli entrypoint ctranslate2/bin/translate
.
Hi,
I've been digging around for a while in code integration but it is not clear to me which argumets are necessary. I guess "model" and "ct2_model" are not required at the same time...
Thanks
Hi,
Running pip install ctranslate2
with the latest pip as per the installation instructions results in the following:
ERROR: Could not find a version that satisfies the requirement ctranslate2== (from versions: none)
ERROR: No matching distribution found for ctranslate2==
> pip --version
pip 20.0.2 from [...]/lib/python3.8/site-packages/pip (python 3.8)
> conda --version
conda 4.7.12
This is on macOS Mojave 10.14.6 (18G2022)
We should look into implementing the TopK layer with a custom CUDA kernel instead of using TensorRT. The motivation is to remove the TensorRT and cuDNN dependencies (cuDNN is a dependency of TensorRT).
The benefits are:
On OS X Catalina, now I get this error when I try to convert a model:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/panos/Development/CTranslate2/python/ctranslate2/bin/opennmt_tf_converter.py", line 23, in <module>
main()
File "/Users/panos/Development/CTranslate2/python/ctranslate2/bin/opennmt_tf_converter.py", line 19, in main
tgt_vocab=args.tgt_vocab).convert_from_args(args)
File "/Users/panos/Development/CTranslate2/python/ctranslate2/converters/converter.py", line 39, in convert_from_args
force=args.force)
File "/Users/panos/Development/CTranslate2/python/ctranslate2/converters/converter.py", line 53, in convert
src_vocab, tgt_vocab = self._load(model_spec)
File "/Users/panos/Development/CTranslate2/python/ctranslate2/converters/opennmt_tf.py", line 107, in _load
tgt_vocab=self._tgt_vocab)
File "/Users/panos/Development/CTranslate2/python/ctranslate2/converters/opennmt_tf.py", line 57, in load_model
src_vocab = _get_asset_path(imported.examples_inputter.features_inputter)
AttributeError: 'AutoTrackable' object has no attribute 'examples_inputter'
Intel MKL is currently required to use the project on CPU. However, it is not always a good fit especially on non Intel hardware. It is likely that MKL checks the CPU vendor ID before activating some fast execution paths.
See for example this performance analysis on AMD Epyc where Intel MKL has poor results.
1. Integrate an alternative GEMM
The main requirements are:
BLIS appears to be a good candidate.
2. Dynamically select a GEMM backend
We should consider compiling with multiple backend and select one at runtime (e.g. on GenuineIntel
call Intel MKL, otherwise call BLIS).
3. (optional) Integrate an alternative caching allocator
We also rely on MKL to provide a caching allocator via mkl_malloc
and mkl_free
. We should measure the performance cost of disabling those and possibly find alternatives.
I've started making adaptations to the OpenNMT-py rest server to allow the use of CTranslate2 models.
I'm thinking of some wrapping object in onmt.translate.translation_server
, that would provide a similar API to onmt.translate.translator
:
class CTranslate2Translator(object):
"""
This should reproduce the onmt.translate.translator API.
"""
def __init__(self, model_path, device, device_index, beam_size, n_best):
import ctranslate2
self.translator = ctranslate2.Translator(
model_path,
device=device,
device_index=device_index,
inter_threads=1,
intra_threads=1,
compute_type="default")
self.beam_size = beam_size
self.n_best = n_best
def translate(self, texts_to_translate, batch_size=8):
batch = [item.split(" ") for item in texts_to_translate]
print(batch)
preds = self.translator.translate_batch(
batch,
beam_size=self.beam_size,
num_hypotheses=self.n_best
)
scores = [[item["score"] for item in ex] for ex in preds]
predictions = [[" ".join(item["tokens"]) for item in ex] for ex in preds]
return scores, predictions
This works fine for the translation API part.
Only remaining issue is that there is some logic in the server that requires models to move back and forth from/to CPU/cuda (to_cpu
// to_gpu
methods that call some .to(device)
on the model).
Is this something we could easily add in the ctranslate2.Translator
API?
Hi,
I'd like to install the Ctranslate2 module without using a Docker. Is it possible?
Are there any scripts for this? I've tried generating a shell script from the dockerfile but it gives me some errors.
Thanks
I'm curious that why the translation not have coverage penalty optional in ctranslate2.
Hi,
I was trying to use TransformerAAN to train a translation model. But I found that CTranslate2 does not support TransformerAAN for now.
Any plan on this kind of architecture?
Many thanks.
Regards
When loading a model variable, the code currently deduces the data type from the size in bytes of one item (it typically does if itemsize == 4 then float32
). This is a weak test. We should instead save an identifier that unambiguously define a data type.
Current fields:
item_size
data_size
Suggested fields:
dtype_id
nbytes
Dimensions are currently represented with size_t
. There are at least 2 issues with that:
for
loops converging to 0It would be nice to provide an efficient execution on ARM. This architecture is widespread on mobile devices and will be used for future Apple Mac CPUs. AWS also provides instances based on ARM.
To do:
If a model is sharing some variables (e.g. embeddings), the current serialization will duplicate them in the converted model. It should be improved to only save one copy of the variable.
I trained a Transformer model using OpenNMT-tf 2.0. The converter ran well but the translation result became weird. Does CTranslate2 support OpenNMT-tf 2.0?
Here are versions:
OpenNMT-tf == 2.3.0
tensorflow-gpu == 2.0.0
This is a general issue to discuss and track ONNX support.
The current limitation of the project is that only weights are extracted from pretrained models and the computation graph is redefined in the code itself. This could be mitigated by loading and executing ONNX graphs.
The current TranslatorPool
implementation is using a producer/consumer approach. The producer reads batches from the file and pushes them in a queue. Each consumer dequeues a batch and translates it.
As reading batches is commonly much faster than translating, batches quickly pile up in the work queue. This increases memory usage, especially when translating large files.
A basic fix is to limit the queue size. If the maximum size is reached, the producer should wait and be notified when a consumer dequeues a batch.
Now that Python 2 is EOL, we should update all Docker images to use Python 3 by default.
Checking int8 support currently involves creating and destroying a TensorRT builder. This is expensive. To avoid this overhead in future calls, we could cache the result.
Approach: use std::call_once
and store the result in a static variable.
In 1976f45, we already tried to move to pybind11 but there were some compatibility issues with other pybind extensions that use a different toolchain.
There seems to be improvement in pybind11 2.4 on this issue:
https://pybind11.readthedocs.io/en/master/changelog.html#v2-4-0-sep-19-2019
if sp=='in' or sp=='out' or sp=='inout':
s = spm.SentencePieceProcessor()
s.Load(modelPath + 'all.en.shuffled.filtered.spiece.model')
@app.route('/translate', methods=['Post'])
def trans():
try:
line = request.values.get('src')
if sp=='in' or sp=='inout':
sentence = s.EncodeAsPieces(line)
else:
sentence = list(line)
results = translator.translate_batch([sentence], beam_size=1, max_decoding_length=250, num_hypotheses=1, length_penalty=0, min_decoding_length=1, use_vmap=False, return_attention=False)
itemResult = ''
for itemStr in results:
item = itemStr[0]['tokens']
if sp=='out' or sp=='inout':
itemResult = s.DecodePieces(item)
else:
itemResult = str(''.join(item))
# print(result)
resultHtml = json.dumps([{"tgt": itemResult}], ensure_ascii=False)
except Exception as e:
resultHtml = json.dumps(({"error": 1, "message": str(e)}), ensure_ascii=False)
return resultHtml, 200
server = WSGIServer((args.ip, args.port), app)
print('Server ready!')
server.serve_forever()
When I make a lot of requests, it's a mistake
terminate called after throwing an instance of 'std::runtime_error'
what(): /root/ctranslate2-dev/src/ops/layer_norm_gpu.cu:32: cuDNN failed with status CUDNN_STATUS_BAD_PARAM
Aborted (core dumped)
https://github.com/OpenNMT/CTranslate2/blob/master/src/ops/layer_norm_gpu.cu
I haven't tested this extensively but a small test seems to indicate slower times when using CUDA 10.1 (Update 2 - i.e., latest) vs CUDA 10.0 (as is used in the Docker file). It's around 1.5 times slower. Have you tried using CUDA 10.1 and have you seen similar results?
So I managed to compile everything with MSVC but I can't figure out why the client doesn't translate as expected. With short sentences containing only a few words (~10), it seems to be working fine. With longer sentences, I get very short, truncated, and irrelevant translations or just a single irrelevant word. Under OS X, it works wonderfully, no matter the length of the sentence. In both systems I'm using the same converted tf model and the same sentencepiece model.
The only weird thing I can notice is that the special underscore character from sentencepiece in shared_vocabulary.txt
has encoding issues under Windows and appears as an empty box.
The related infos. are :
MKL > 2019.5
MKL-DNN: 1.1.1
The current quantization code is based on thrust::reduce_by_key
to get the absolute maximum of each row. However, this approach appears to be very slow in this context. It should be improved for better INT8 performance on GPU.
$ ./tests/benchmark_ops quantize cuda int8
benchmarking quantize_op(x, y, scale)
avg 0.186348 ms
$ ./tests/benchmark_ops quantize cpu int8
benchmarking quantize_op(x, y, scale)
avg 0.0024638 ms
When I run the demo of ReadMe, I got an error:
Traceback (most recent call last):
File "/root/miniconda3/bin/ct2-opennmt-py-converter", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/lib/python3.7/site-packages/ctranslate2/bin/opennmt_py_converter.py", line 11, in main
converters.OpenNMTPyConverter(args.model_path).convert_from_args(args)
File "/root/miniconda3/lib/python3.7/site-packages/ctranslate2/converters/converter.py", line 39, in convert_from_args
force=args.force)
File "/root/miniconda3/lib/python3.7/site-packages/ctranslate2/converters/converter.py", line 53, in convert
src_vocab, tgt_vocab = self._load(model_spec)
File "/root/miniconda3/lib/python3.7/site-packages/ctranslate2/converters/opennmt_py.py", line 22, in _load
checkpoint = torch.load(self._model_path, map_location="cpu")
File "/root/miniconda3/lib/python3.7/site-packages/torch/serialization.py", line 529, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/root/miniconda3/lib/python3.7/site-packages/torch/serialization.py", line 702, in _legacy_load
result = unpickler.load()
File "/root/miniconda3/lib/python3.7/site-packages/torchtext/vocab.py", line 119, in __setstate__
if state['unk_index'] is None:
KeyError: 'unk_index'
The version of the torch is 1.4.0 and ctranslate2 is 1.5.1 on my development machine. And I add 'unk_index' not in state or
in "/root/miniconda3/lib/python3.7/site-packages/torchtext/vocab.py:199", this test is passed.
would be great.
The linker in OS X (LLVM 10) doesn't understand the --start-group
and --end-group
linking options. When building with the default Apple's toolset, removing these options allows building the project, although with a ton of warnings due to linking order and particularly related to boost::program_options
. At least it builds and runs fine, as far as I have tested it.
If I change the compiler to gcc-9, it won't link at all.
I tried but I couldn't find a solution (maybe ordering the libraries manually?)
Same system configuration with TensorRT v5.1.5 does not have this issue.
I am using Ubuntu 18.04, and other than these two things, am using the same configuration as in the Centos7-gpu Docker file.
Note there are warnings of deprecated nvinfer function use when building.
gdb output:
[Switching to Thread 0x7fffc68d3700 (LWP 3773)]
0x00007fffe339d604 in nvinfer1::rt::SafeExecutionContext::~SafeExecutionContext() () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.6
(gdb) bt
#0 0x00007fffe339d604 in nvinfer1::rt::SafeExecutionContext::~SafeExecutionContext() () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.6
#1 0x00007fffe31b5449 in nvinfer1::rt::ExecutionContext::~ExecutionContext() () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.6
#2 0x00007ffff79093e8 in ctranslate2::cuda::TensorRTLayer::clear (this=0x7fffc68d3438) at /home/ubuntu/CTranslate2/src/cuda/utils.cc:189
#3 0x00007ffff790923c in ctranslate2::cuda::TensorRTLayer::~TensorRTLayer (this=0x7fffc68d3438, __in_chrg=<optimized out>) at /home/ubuntu/CTranslate2/src/cuda/utils.cc:165
#4 0x00007ffff79bf604 in ctranslate2::ops::TopKLayer::~TopKLayer (this=0x7fffc68d3438, __in_chrg=<optimized out>) at /home/ubuntu/CTranslate2/src/ops/topk_gpu.cu:8
#5 0x00007ffff6b1c8af in __GI___call_tls_dtors () at cxa_thread_atexit_impl.c:155
#6 0x00007ffff74726e9 in start_thread (arg=0x7fffc68d3700) at pthread_create.c:470
#7 0x00007ffff6bfa88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Hi,
While using the python ctranslate2.Translator
API, it seems that an OOM can cause the whole python session to crash.
>>> import ctranslate2
>>> translator = ctranslate2.Translator("ende_ctranslate2/")
>>> translator.translate_batch([["a"]*20000]) # very long dummy batch to force OOM for reproducibility
terminate called after throwing an instance of 'std::runtime_error'
what(): Failed to allocate memory
Aborted (core dumped)
Would it be possible to better catch such exceptions so that we can handle them python side?
Thanks!
The Multinomial op currently falls back on the CPU. This issue tracks the future addition of a dedicated CUDA implementation in multinomial_gpu.cu
.
We currently generate a custom shared library for Intel MKL. Instead, we should consider statically link against it.
Pros:
gomp
instead of iomp5
Cons:
libctranslate2.so
and libmkldnn.so
if the later also statically links against MKL)We should support FP16 execution on compatible GPU.
Hi @guillaumekln ,
As far as I can see, if we create an instance of Translator
, we can't change the model without destroying the object and creating a new one, as the model can only be defined in the constructor --unless I missed sth. Wouldn't it make sense to have a function to change the current model? Even if delete
ing and making new
translators is trivial, IMO it would improve the already excellent interface. If this makes sense, I could work on that soon, when I have some time.
Hi @guillaumekln,
I was trying to compile under Visual Studio 2019 and I got an error that 'max': is not a member of 'std'
in layer_norm_cpu.cc
(line 30). Adding the <algorithm>
header does the trick. After a bit of searching it seems this is because some Windows headers (WinDef.h) define their own macros for max
and min
.
Maybe it would be better to fix this in the CMakeLists.txt
instead of adding the header just for Windows, so I tried adding a block
if(MSVC)
add_definitions(-D_USE_MATH_DEFINES)
add_definitions(-DNOMINMAX)
endif()
but it won't work --to be more specific, the error disappears but the build is not fully successful and no libraries are created.
Hi @guillaumekln ,
Why did you make the assumption "interactive mode" in this discussion OpenNMT/OpenNMT-py#1800 ?
It would really be helpful to use this feature in batch mode.
Hey @guillaumekln
If we take a shared embeddings setup between encoder and decoder for instance, some aliases are made here:
CTranslate2/python/ctranslate2/specs/model_spec.py
Lines 83 to 99 in 9379e2f
which is called when .validate()
is called.
Here, we .validate()
before getting the vocabulary sizes:
CTranslate2/python/ctranslate2/converters/converter.py
Lines 59 to 61 in 9379e2f
But, these {source,target}_vocabulary_size
property/methods do not handle aliases:
CTranslate2/python/ctranslate2/specs/transformer_spec.py
Lines 34 to 40 in 9379e2f
--->
MODEL_SPEC AFTER VALIDATE {'weight': 'decoder/embeddings/weight', 'multiply_by_sqrt_depth': 'decoder/embeddings/multiply_by_sqrt_depth'}
Traceback (most recent call last):
File "/home/moses/CTranslate2/env_onmt/bin/onmt_release_model", line 8, in <module>
sys.exit(main())
File "/home/moses/CTranslate2/env_onmt/lib/python3.6/site-packages/onmt/bin/release_model.py", line 52, in main
converter.convert(opt.output, model_spec, force=True)
File "/home/moses/CTranslate2/env_onmt/lib/python3.6/site-packages/ctranslate2/converters/converter.py", line 74, in convert
self._check_vocabulary_size("source", src_vocab, model_spec.source_vocabulary_size)
File "/home/moses/CTranslate2/env_onmt/lib/python3.6/site-packages/ctranslate2/specs/transformer_spec.py", line 32, in source_vocabulary_size
return self.encoder.embeddings.weight.shape[0]
AttributeError: 'str' object has no attribute 'shape'
Am I missing something here?
The script in ( QuickState -> 2. Convert a model) fails.
pip install OpenNMT-py
wget https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
tar xf transformer-ende-wmt-pyOnmt.tar.gz
ct2-opennmt-py-converter --model_path averaged-10-epoch.pt --model_spec TransformerBase \
--output_dir ende_ctranslate2
Traceback (most recent call last):
File "/mnt/f/python-venv/onmt/bin/ct2-opennmt-py-converter", line 8, in <module>
sys.exit(main())
File "/mnt/f/python-venv/onmt/lib/python3.5/site-packages/ctranslate2/bin/opennmt_py_converter.py", line 11, in main converters.OpenNMTPyConverter(args.model_path).convert_from_args(args)
File "/mnt/f/python-venv/onmt/lib/python3.5/site-packages/ctranslate2/converters/converter.py", line 40, in convert_from_args
force=args.force)
File "/mnt/f/python-venv/onmt/lib/python3.5/site-packages/ctranslate2/converters/converter.py", line 52, in convert src_vocab, tgt_vocab = self._load(model_spec)
File "/mnt/f/python-venv/onmt/lib/python3.5/site-packages/ctranslate2/converters/opennmt_py.py", line 22, in _load checkpoint = torch.load(self._model_path, map_location="cpu")
File "/mnt/f/python-venv/onmt/lib/python3.5/site-packages/torch/serialization.py", line 529, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/mnt/f/python-venv/onmt/lib/python3.5/site-packages/torch/serialization.py", line 702, in _legacy_load
result = unpickler.load()
File "/mnt/f/python-venv/onmt/lib/python3.5/site-packages/torchtext/vocab.py", line 119, in __setstate__
if state['unk_index'] is None:
KeyError: 'unk_index'
I'm trying to convert OpenNMT-py model to CTranslate2 format, but it fails because of KeyError. The model that I'm trying to convert is available here (it is named paracrawl.pt
but it was renamed during uploading).
When I try to run conversion:
ct2-opennmt-py-converter --model_path paracrawl.pt --model_spec TransformerBase --output_dir paracrawl
It fails with KeyError
:
Traceback (most recent call last):
File "/usr/local/bin/ct2-opennmt-py-converter", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/ctranslate2/bin/opennmt_py_converter.py", line 11, in main
converters.OpenNMTPyConverter(args.model_path).convert_from_args(args)
File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/converter.py", line 35, in convert_from_args
return self.convert(
File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/converter.py", line 52, in convert
src_vocab, tgt_vocab = self._load(model_spec)
File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/opennmt_py.py", line 27, in _load
set_transformer_spec(model_spec, variables)
File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/opennmt_py.py", line 39, in set_transformer_spec
set_transformer_encoder(spec.encoder, variables, relative=spec.with_relative_position)
File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/opennmt_py.py", line 43, in set_transformer_encoder
set_input_layers(spec, variables, "encoder", relative=relative)
File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/opennmt_py.py", line 59, in set_input_layers
set_position_encodings(
File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/opennmt_py.py", line 136, in set_position_encodings
spec.encodings = _get_variable(variables, "%s.pe" % scope).squeeze()
File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/opennmt_py.py", line 141, in _get_variable
return variables[name].numpy()
KeyError: 'encoder.embeddings.make_embedding.pe.pe'
I'm using Python 3.8 on my custom python:buster
Docker image with theese Python packages installed:
Package Version
-------------------- ----------
absl-py 0.9.0
cachetools 4.0.0
certifi 2019.11.28
chardet 3.0.4
click 7.1.1
ConfigArgParse 1.0
ctranslate2 1.8.0
Flask 1.1.1
future 0.18.2
google-auth 1.11.3
google-auth-oauthlib 0.4.1
grpcio 1.27.2
idna 2.9
itsdangerous 1.1.0
Jinja2 2.11.1
Markdown 3.2.1
MarkupSafe 1.1.1
numpy 1.18.1
oauthlib 3.1.0
OpenNMT-py 1.0.2
pip 19.3.1
protobuf 3.11.3
pyasn1 0.4.8
pyasn1-modules 0.2.8
pyonmttok 1.18.3
requests 2.23.0
requests-oauthlib 1.3.0
rsa 4.0
setuptools 41.6.0
six 1.14.0
tensorboard 2.1.1
torch 1.4.0
torchtext 0.4.0
tqdm 4.30.0
urllib3 1.25.8
waitress 1.4.3
Werkzeug 1.0.0
wheel 0.33.6
All docker images in docker hub are python2 environemnt. What should i do if i want to build a docker image including python3 environment ?
The script in ( QuickState -> 2. Convert a model) fails.
$ ct2-opennmt-tf-converter --model_path averaged-ende-export500k-v2 --model_spec TransformerBase --output_dir ende_ctranslate2 --force
...
File ".local/lib/python3.6/site-packages/ctranslate2/bin/opennmt_tf_converter.py", line 19, in main
tgt_vocab=args.tgt_vocab).convert_from_args(args)
File ".local/lib/python3.6/site-packages/ctranslate2/converters/converter.py", line 40, in convert_from_args
force=args.force)
File ".local/lib/python3.6/site-packages/ctranslate2/converters/converter.py", line 52, in convert
src_vocab, tgt_vocab = self._load(model_spec)
File ".local/lib/python3.6/site-packages/ctranslate2/converters/opennmt_tf.py", line 126, in _load
tgt_vocab=self._tgt_vocab)
File ".local/lib/python3.6/site-packages/ctranslate2/converters/opennmt_tf.py", line 66, in load_model
src_vocab = _get_asset_path(imported.examples_inputter.features_inputter)
File ".local/lib/python3.6/site-packages/ctranslate2/converters/opennmt_tf.py", line 51, in _get_asset_path
asset = getattr(lookup_table._initializer, "_filename", None)
AttributeError: '_RestoredResource' object has no attribute '_initializer'
The code below will allocate some memory on GPU 0 even if the Translator is placed on another device:
import ctranslate2
translator = ctranslate2.Translator("ende_transformer", device="cuda", device_index=1)
Ideally, it should only allocate on GPU 1.
The model converters should accept Transformer models sharing embeddings and/or softmax weights.
I would like to load the model file from memory (in a std::vector<unsigned char>
) but I think it's not possible as all related methods use at some point the model directory as an std::string
. I can see the necessity in this, as the vocabularies and the vmap are also loaded from this directory.
Still, do you think there could be a use case (apart from mine obviously :)) for some overrides with arguments that will accept std::string
s pointing directly to the model and the vocabularies?
It seems that the ctranslate2 doesn't contain intel_mkl lib when I install ctranslate2 by pip. So it will be faster if the computer has intel_mkl, right?
We should investigate the dynamic loading of NVIDIA libraries. This would be helpful to publish a ctranslate2
Python package that is compatible with both CPU and GPU while allowing execution on a CPU-only system.
If that proves to be too complex, we might need to publish a separate package for GPU support.
Can you please support a model trained in fairseq, else since it is torch can it be imported to infer and quantized.
Also the model sizes are of transformer_big? Since if it is transformer _base it would be around half of the score.
Please consider distilling the model into smaller model that would help for inference and size.
Command line installation results in:
module 'ctranslate2' has no attribute 'Translator'
I tranied a model which size is about 460M. The cuda memory was allocated about 669M to this model when it was loaded into python environment :
import ctranslate2
translator = ctranslate2.Translator("/data/ende_ctranslate2/", device="cuda")
My first quesiton is why the loaded model occupyied much more memory than the model size?
When i tried to translate the frist one batch of sentence:
translator.translate_batch([["▁H", "ello", "▁world", "!"]])
The cuda memory occpuied by this model gradually increaed but suddenly reached about 2600M and quickly falled to 800M finally. I quiet want know what happend during this peorid as this behavior always lead to my other programs running on the smme gpu cuda out memory error .
Besides, when i translate some longer sentences, the memory occpuied by this model will always increase and never decrease to the previous size. This is quiet abnormal and i wonder whether these phenomens are resulted by memory leak? Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.