Giter Site home page Giter Site logo

opennmt / ctranslate2 Goto Github PK

View Code? Open in Web Editor NEW
2.8K 55.0 248.0 13.78 MB

Fast inference engine for Transformer models

Home Page: https://opennmt.net/CTranslate2

License: MIT License

CMake 0.84% C++ 80.60% Shell 0.41% C 0.05% Python 12.31% Cuda 5.48% Dockerfile 0.31%
neural-machine-translation cpp mkl quantization cuda thrust opennmt deep-neural-networks openmp onednn

ctranslate2's Introduction

CI PyPI version Documentation Gitter Forum

CTranslate2

CTranslate2 is a C++ and Python library for efficient inference with Transformer models.

The project implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering, etc., to accelerate and reduce the memory usage of Transformer models on CPU and GPU.

The following model types are currently supported:

  • Encoder-decoder models: Transformer base/big, M2M-100, NLLB, BART, mBART, Pegasus, T5, Whisper
  • Decoder-only models: GPT-2, GPT-J, GPT-NeoX, OPT, BLOOM, MPT, Llama, Mistral, Gemma, CodeGen, GPTBigCode, Falcon
  • Encoder-only models: BERT, DistilBERT, XLM-RoBERTa

Compatible models should be first converted into an optimized model format. The library includes converters for multiple frameworks:

The project is production-oriented and comes with backward compatibility guarantees, but it also includes experimental features related to model compression and inference acceleration.

Key features

  • Fast and efficient execution on CPU and GPU
    The execution is significantly faster and requires less resources than general-purpose deep learning frameworks on supported models and tasks thanks to many advanced optimizations: layer fusion, padding removal, batch reordering, in-place operations, caching mechanism, etc.
  • Quantization and reduced precision
    The model serialization and computation support weights with reduced precision: 16-bit floating points (FP16), 16-bit brain floating points (BF16), 16-bit integers (INT16), and 8-bit integers (INT8).
  • Multiple CPU architectures support
    The project supports x86-64 and AArch64/ARM64 processors and integrates multiple backends that are optimized for these platforms: Intel MKL, oneDNN, OpenBLAS, Ruy, and Apple Accelerate.
  • Automatic CPU detection and code dispatch
    One binary can include multiple backends (e.g. Intel MKL and oneDNN) and instruction set architectures (e.g. AVX, AVX2) that are automatically selected at runtime based on the CPU information.
  • Parallel and asynchronous execution
    Multiple batches can be processed in parallel and asynchronously using multiple GPUs or CPU cores.
  • Dynamic memory usage
    The memory usage changes dynamically depending on the request size while still meeting performance requirements thanks to caching allocators on both CPU and GPU.
  • Lightweight on disk
    Quantization can make the models 4 times smaller on disk with minimal accuracy loss.
  • Simple integration
    The project has few dependencies and exposes simple APIs in Python and C++ to cover most integration needs.
  • Configurable and interactive decoding
    Advanced decoding features allow autocompleting a partial sequence and returning alternatives at a specific location in the sequence.
  • Support tensor parallelism for distributed inference
    Very large model can be split into multiple GPUs. Following this documentation to set up the required environment.

Some of these features are difficult to achieve with standard deep learning frameworks and are the motivation for this project.

Installation and usage

CTranslate2 can be installed with pip:

pip install ctranslate2

The Python module is used to convert models and can translate or generate text with few lines of code:

translator = ctranslate2.Translator(translation_model_path)
translator.translate_batch(tokens)

generator = ctranslate2.Generator(generation_model_path)
generator.generate_batch(start_tokens)

See the documentation for more information and examples.

Benchmarks

We translate the En->De test set newstest2014 with multiple models:

  • OpenNMT-tf WMT14: a base Transformer trained with OpenNMT-tf on the WMT14 dataset (4.5M lines)
  • OpenNMT-py WMT14: a base Transformer trained with OpenNMT-py on the WMT14 dataset (4.5M lines)
  • OPUS-MT: a base Transformer trained with Marian on all OPUS data available on 2020-02-26 (81.9M lines)

The benchmark reports the number of target tokens generated per second (higher is better). The results are aggregated over multiple runs. See the benchmark scripts for more details and reproduce these numbers.

Please note that the results presented below are only valid for the configuration used during this benchmark: absolute and relative performance may change with different settings.

CPU

Tokens per second Max. memory BLEU
OpenNMT-tf WMT14 model
OpenNMT-tf 2.31.0 (with TensorFlow 2.11.0) 209.2 2653MB 26.93
OpenNMT-py WMT14 model
OpenNMT-py 3.0.4 (with PyTorch 1.13.1) 275.8 2012MB 26.77
- int8 323.3 1359MB 26.72
CTranslate2 3.6.0 658.8 849MB 26.77
- int16 733.0 672MB 26.82
- int8 860.2 529MB 26.78
- int8 + vmap 1126.2 598MB 26.64
OPUS-MT model
Transformers 4.26.1 (with PyTorch 1.13.1) 147.3 2332MB 27.90
Marian 1.11.0 344.5 7605MB 27.93
- int16 330.2 5901MB 27.65
- int8 355.8 4763MB 27.27
CTranslate2 3.6.0 525.0 721MB 27.92
- int16 596.1 660MB 27.53
- int8 696.1 516MB 27.65

Executed with 4 threads on a c5.2xlarge Amazon EC2 instance equipped with an Intel(R) Xeon(R) Platinum 8275CL CPU.

GPU

Tokens per second Max. GPU memory Max. CPU memory BLEU
OpenNMT-tf WMT14 model
OpenNMT-tf 2.31.0 (with TensorFlow 2.11.0) 1483.5 3031MB 3122MB 26.94
OpenNMT-py WMT14 model
OpenNMT-py 3.0.4 (with PyTorch 1.13.1) 1795.2 2973MB 3099MB 26.77
FasterTransformer 5.3 6979.0 2402MB 1131MB 26.77
- float16 8592.5 1360MB 1135MB 26.80
CTranslate2 3.6.0 6634.7 1261MB 953MB 26.77
- int8 8567.2 1005MB 807MB 26.85
- float16 10990.7 941MB 807MB 26.77
- int8 + float16 8725.4 813MB 800MB 26.83
OPUS-MT model
Transformers 4.26.1 (with PyTorch 1.13.1) 1022.9 4097MB 2109MB 27.90
Marian 1.11.0 3241.0 3381MB 2156MB 27.92
- float16 3962.4 3239MB 1976MB 27.94
CTranslate2 3.6.0 5876.4 1197MB 754MB 27.92
- int8 7521.9 1005MB 792MB 27.79
- float16 9296.7 909MB 814MB 27.90
- int8 + float16 8362.7 813MB 766MB 27.90

Executed with CUDA 11 on a g5.xlarge Amazon EC2 instance equipped with a NVIDIA A10G GPU (driver version: 510.47.03).

Additional resources

ctranslate2's People

Contributors

amrrs avatar anterart avatar brightxiaohan avatar cdockes avatar chengduozh avatar chiiyeh avatar clementchouteau avatar dependabot[bot] avatar ebraraktas avatar flyingleafe avatar funboarder13920 avatar guillaumekln avatar homink avatar jgcb00 avatar jhnwnd avatar jordimas avatar keichi avatar masa-oi avatar michaelfeil avatar minhthuc2502 avatar panosk avatar raphaelmerx avatar scotfang avatar sebastianbodza avatar shas3011 avatar vadi2 avatar vakkov avatar vince62s avatar yc-wang00 avatar zxdvd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ctranslate2's Issues

CTranslate2 layer_norm_gpu.cu:32: cuDNN failed with status CUDNN_STATUS_BAD_PARAM

if sp=='in' or sp=='out' or sp=='inout':
    s = spm.SentencePieceProcessor()
    s.Load(modelPath + 'all.en.shuffled.filtered.spiece.model')
@app.route('/translate', methods=['Post'])
def trans():
    try:
        line = request.values.get('src')
        
        if sp=='in' or sp=='inout':
            sentence = s.EncodeAsPieces(line)
        else:
            sentence = list(line)

        results = translator.translate_batch([sentence], beam_size=1, max_decoding_length=250, num_hypotheses=1, length_penalty=0, min_decoding_length=1, use_vmap=False, return_attention=False)

        itemResult = ''

        for itemStr in results:
            item = itemStr[0]['tokens']

            if sp=='out' or sp=='inout':
                itemResult = s.DecodePieces(item)
            else:
                itemResult = str(''.join(item))

            # print(result)

        resultHtml = json.dumps([{"tgt": itemResult}], ensure_ascii=False)
    except Exception as e:
        resultHtml = json.dumps(({"error": 1, "message": str(e)}), ensure_ascii=False)

    return resultHtml, 200

server = WSGIServer((args.ip, args.port), app)
print('Server ready!')

server.serve_forever()

When I make a lot of requests, it's a mistake

terminate called after throwing an instance of 'std::runtime_error'
what(): /root/ctranslate2-dev/src/ops/layer_norm_gpu.cu:32: cuDNN failed with status CUDNN_STATUS_BAD_PARAM
Aborted (core dumped)

https://github.com/OpenNMT/CTranslate2/blob/master/src/ops/layer_norm_gpu.cu
image

Moving model/translator object between devices

I've started making adaptations to the OpenNMT-py rest server to allow the use of CTranslate2 models.
I'm thinking of some wrapping object in onmt.translate.translation_server, that would provide a similar API to onmt.translate.translator:

class CTranslate2Translator(object):
    """
    This should reproduce the onmt.translate.translator API.
    """

    def __init__(self, model_path, device, device_index, beam_size, n_best):
        import ctranslate2
        self.translator = ctranslate2.Translator(
            model_path,
            device=device,
            device_index=device_index,
            inter_threads=1,
            intra_threads=1,
            compute_type="default")
        self.beam_size = beam_size
        self.n_best = n_best

    def translate(self, texts_to_translate, batch_size=8):
        batch = [item.split(" ") for item in texts_to_translate]
        print(batch)
        preds = self.translator.translate_batch(
            batch,
            beam_size=self.beam_size,
            num_hypotheses=self.n_best
        )
        scores = [[item["score"] for item in ex] for ex in preds]
        predictions = [[" ".join(item["tokens"]) for item in ex] for ex in preds]
        return scores, predictions

This works fine for the translation API part.
Only remaining issue is that there is some logic in the server that requires models to move back and forth from/to CPU/cuda (to_cpu // to_gpu methods that call some .to(device) on the model).
Is this something we could easily add in the ctranslate2.Translator API?

The example of converting opennmt-py model does not work.

The script in ( QuickState -> 2. Convert a model) fails.

pip install OpenNMT-py

wget https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
tar xf transformer-ende-wmt-pyOnmt.tar.gz

ct2-opennmt-py-converter --model_path averaged-10-epoch.pt --model_spec TransformerBase \
    --output_dir ende_ctranslate2
Traceback (most recent call last):
  File "/mnt/f/python-venv/onmt/bin/ct2-opennmt-py-converter", line 8, in <module>
    sys.exit(main())
  File "/mnt/f/python-venv/onmt/lib/python3.5/site-packages/ctranslate2/bin/opennmt_py_converter.py", line 11, in main      converters.OpenNMTPyConverter(args.model_path).convert_from_args(args)
  File "/mnt/f/python-venv/onmt/lib/python3.5/site-packages/ctranslate2/converters/converter.py", line 40, in convert_from_args
    force=args.force)
  File "/mnt/f/python-venv/onmt/lib/python3.5/site-packages/ctranslate2/converters/converter.py", line 52, in convert       src_vocab, tgt_vocab = self._load(model_spec)
  File "/mnt/f/python-venv/onmt/lib/python3.5/site-packages/ctranslate2/converters/opennmt_py.py", line 22, in _load        checkpoint = torch.load(self._model_path, map_location="cpu")
  File "/mnt/f/python-venv/onmt/lib/python3.5/site-packages/torch/serialization.py", line 529, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/mnt/f/python-venv/onmt/lib/python3.5/site-packages/torch/serialization.py", line 702, in _legacy_load
    result = unpickler.load()
  File "/mnt/f/python-venv/onmt/lib/python3.5/site-packages/torchtext/vocab.py", line 119, in __setstate__
    if state['unk_index'] is None:
KeyError: 'unk_index'

Are there any plan on TransformerAAN?

Hi,

I was trying to use TransformerAAN to train a translation model. But I found that CTranslate2 does not support TransformerAAN for now.
Any plan on this kind of architecture?
Many thanks.

Regards

Implement GPU TopK without TensorRT

We should look into implementing the TopK layer with a custom CUDA kernel instead of using TensorRT. The motivation is to remove the TensorRT and cuDNN dependencies (cuDNN is a dependency of TensorRT).

The benefits are:

  • make it easier to build Python wheels with GPU support (cuBLAS would be the only external NVIDIA dependency);
  • reduce the total installation size.

pip install ctranslate2: no package found on macOS

Hi,

Running pip install ctranslate2 with the latest pip as per the installation instructions results in the following:

ERROR: Could not find a version that satisfies the requirement ctranslate2== (from versions: none)
ERROR: No matching distribution found for ctranslate2==
> pip --version
pip 20.0.2 from [...]/lib/python3.8/site-packages/pip (python 3.8)
> conda --version
conda 4.7.12

This is on macOS Mojave 10.14.6 (18G2022)

Plans to support model trained in fairseq

Can you please support a model trained in fairseq, else since it is torch can it be imported to infer and quantized.

Also the model sizes are of transformer_big? Since if it is transformer _base it would be around half of the score.
Please consider distilling the model into smaller model that would help for inference and size.

Limit work queue size when translating large files

The current TranslatorPool implementation is using a producer/consumer approach. The producer reads batches from the file and pushes them in a queue. Each consumer dequeues a batch and translates it.

As reading batches is commonly much faster than translating, batches quickly pile up in the work queue. This increases memory usage, especially when translating large files.

A basic fix is to limit the queue size. If the maximum size is reached, the producer should wait and be notified when a consumer dequeues a batch.

Dynamic loading of NVIDIA libraries

We should investigate the dynamic loading of NVIDIA libraries. This would be helpful to publish a ctranslate2 Python package that is compatible with both CPU and GPU while allowing execution on a CPU-only system.

If that proves to be too complex, we might need to publish a separate package for GPU support.

error during model conversion

On OS X Catalina, now I get this error when I try to convert a model:

Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/panos/Development/CTranslate2/python/ctranslate2/bin/opennmt_tf_converter.py", line 23, in <module>
    main()
  File "/Users/panos/Development/CTranslate2/python/ctranslate2/bin/opennmt_tf_converter.py", line 19, in main
    tgt_vocab=args.tgt_vocab).convert_from_args(args)
  File "/Users/panos/Development/CTranslate2/python/ctranslate2/converters/converter.py", line 39, in convert_from_args
    force=args.force)
  File "/Users/panos/Development/CTranslate2/python/ctranslate2/converters/converter.py", line 53, in convert
    src_vocab, tgt_vocab = self._load(model_spec)
  File "/Users/panos/Development/CTranslate2/python/ctranslate2/converters/opennmt_tf.py", line 107, in _load
    tgt_vocab=self._tgt_vocab)
  File "/Users/panos/Development/CTranslate2/python/ctranslate2/converters/opennmt_tf.py", line 57, in load_model
    src_vocab = _get_asset_path(imported.examples_inputter.features_inputter)
AttributeError: 'AutoTrackable' object has no attribute 'examples_inputter'

Improve int8 quantization performance on GPU

The current quantization code is based on thrust::reduce_by_key to get the absolute maximum of each row. However, this approach appears to be very slow in this context. It should be improved for better INT8 performance on GPU.

$ ./tests/benchmark_ops quantize cuda int8
benchmarking quantize_op(x, y, scale)
avg   0.186348 ms

$ ./tests/benchmark_ops quantize cpu int8
benchmarking quantize_op(x, y, scale)
avg   0.0024638 ms

compiled client doesn't work as expected in Windows

So I managed to compile everything with MSVC but I can't figure out why the client doesn't translate as expected. With short sentences containing only a few words (~10), it seems to be working fine. With longer sentences, I get very short, truncated, and irrelevant translations or just a single irrelevant word. Under OS X, it works wonderfully, no matter the length of the sentence. In both systems I'm using the same converted tf model and the same sentencepiece model.
The only weird thing I can notice is that the special underscore character from sentencepiece in shared_vocabulary.txt has encoding issues under Windows and appears as an empty box.

The Demo in the ReadMe doesn't work.

When I run the demo of ReadMe, I got an error:

Traceback (most recent call last):
  File "/root/miniconda3/bin/ct2-opennmt-py-converter", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/lib/python3.7/site-packages/ctranslate2/bin/opennmt_py_converter.py", line 11, in main
    converters.OpenNMTPyConverter(args.model_path).convert_from_args(args)
  File "/root/miniconda3/lib/python3.7/site-packages/ctranslate2/converters/converter.py", line 39, in convert_from_args
    force=args.force)
  File "/root/miniconda3/lib/python3.7/site-packages/ctranslate2/converters/converter.py", line 53, in convert
    src_vocab, tgt_vocab = self._load(model_spec)
  File "/root/miniconda3/lib/python3.7/site-packages/ctranslate2/converters/opennmt_py.py", line 22, in _load
    checkpoint = torch.load(self._model_path, map_location="cpu")
  File "/root/miniconda3/lib/python3.7/site-packages/torch/serialization.py", line 529, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/root/miniconda3/lib/python3.7/site-packages/torch/serialization.py", line 702, in _legacy_load
    result = unpickler.load()
  File "/root/miniconda3/lib/python3.7/site-packages/torchtext/vocab.py", line 119, in __setstate__
    if state['unk_index'] is None:
KeyError: 'unk_index'

The version of the torch is 1.4.0 and ctranslate2 is 1.5.1 on my development machine. And I add 'unk_index' not in state or in "/root/miniconda3/lib/python3.7/site-packages/torchtext/vocab.py:199", this test is passed.

The example of converting opennmt-tf model does not work.

The script in ( QuickState -> 2. Convert a model) fails.

$ ct2-opennmt-tf-converter --model_path averaged-ende-export500k-v2 --model_spec TransformerBase --output_dir ende_ctranslate2 --force

...

File ".local/lib/python3.6/site-packages/ctranslate2/bin/opennmt_tf_converter.py", line 19, in main
tgt_vocab=args.tgt_vocab).convert_from_args(args)
File ".local/lib/python3.6/site-packages/ctranslate2/converters/converter.py", line 40, in convert_from_args
force=args.force)
File ".local/lib/python3.6/site-packages/ctranslate2/converters/converter.py", line 52, in convert
src_vocab, tgt_vocab = self._load(model_spec)
File ".local/lib/python3.6/site-packages/ctranslate2/converters/opennmt_tf.py", line 126, in _load
tgt_vocab=self._tgt_vocab)
File ".local/lib/python3.6/site-packages/ctranslate2/converters/opennmt_tf.py", line 66, in load_model
src_vocab = _get_asset_path(imported.examples_inputter.features_inputter)
File ".local/lib/python3.6/site-packages/ctranslate2/converters/opennmt_tf.py", line 51, in _get_asset_path
asset = getattr(lookup_table._initializer, "_filename", None)
AttributeError: '_RestoredResource' object has no attribute '_initializer'

python3 docker image?

All docker images in docker hub are python2 environemnt. What should i do if i want to build a docker image including python3 environment ?

cuda memory leak with python api?

I tranied a model which size is about 460M. The cuda memory was allocated about 669M to this model when it was loaded into python environment :
import ctranslate2
translator = ctranslate2.Translator("/data/ende_ctranslate2/", device="cuda")
My first quesiton is why the loaded model occupyied much more memory than the model size?

When i tried to translate the frist one batch of sentence:
translator.translate_batch([["▁H", "ello", "▁world", "!"]])
The cuda memory occpuied by this model gradually increaed but suddenly reached about 2600M and quickly falled to 800M finally. I quiet want know what happend during this peorid as this behavior always lead to my other programs running on the smme gpu cuda out memory error .

Besides, when i translate some longer sentences, the memory occpuied by this model will always increase and never decrease to the previous size. This is quiet abnormal and i wonder whether these phenomens are resulted by memory leak? Thanks.

Query int8 support on GPU once

Checking int8 support currently involves creating and destroying a TensorRT builder. This is expensive. To avoid this overhead in future calls, we could cache the result.

Approach: use std::call_once and store the result in a static variable.

Use int64_t for dimension values

Dimensions are currently represented with size_t. There are at least 2 issues with that:

  • platform-dependent size
  • negative values are sometimes useful:
    • for loops converging to 0
    • -1 support in reshape (to not explicitly set a value for a dimension)

suggestion to add function for changing models without deleting/creating new translators

Hi @guillaumekln ,

As far as I can see, if we create an instance of Translator, we can't change the model without destroying the object and creating a new one, as the model can only be defined in the constructor --unless I missed sth. Wouldn't it make sense to have a function to change the current model? Even if deleteing and making new translators is trivial, IMO it would improve the already excellent interface. If this makes sense, I could work on that soon, when I have some time.

segmentation fault at the finish of a translation when using TensorRT v6.0.1

Same system configuration with TensorRT v5.1.5 does not have this issue.
I am using Ubuntu 18.04, and other than these two things, am using the same configuration as in the Centos7-gpu Docker file.

Note there are warnings of deprecated nvinfer function use when building.

gdb output:

[Switching to Thread 0x7fffc68d3700 (LWP 3773)]
0x00007fffe339d604 in nvinfer1::rt::SafeExecutionContext::~SafeExecutionContext() () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.6
(gdb) bt
#0  0x00007fffe339d604 in nvinfer1::rt::SafeExecutionContext::~SafeExecutionContext() () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.6
#1  0x00007fffe31b5449 in nvinfer1::rt::ExecutionContext::~ExecutionContext() () from /usr/lib/x86_64-linux-gnu/libnvinfer.so.6
#2  0x00007ffff79093e8 in ctranslate2::cuda::TensorRTLayer::clear (this=0x7fffc68d3438) at /home/ubuntu/CTranslate2/src/cuda/utils.cc:189
#3  0x00007ffff790923c in ctranslate2::cuda::TensorRTLayer::~TensorRTLayer (this=0x7fffc68d3438, __in_chrg=<optimized out>) at /home/ubuntu/CTranslate2/src/cuda/utils.cc:165
#4  0x00007ffff79bf604 in ctranslate2::ops::TopKLayer::~TopKLayer (this=0x7fffc68d3438, __in_chrg=<optimized out>) at /home/ubuntu/CTranslate2/src/ops/topk_gpu.cu:8
#5  0x00007ffff6b1c8af in __GI___call_tls_dtors () at cxa_thread_atexit_impl.c:155
#6  0x00007ffff74726e9 in start_thread (arg=0x7fffc68d3700) at pthread_create.c:470
#7  0x00007ffff6bfa88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Support execution without Intel MKL

Intel MKL is currently required to use the project on CPU. However, it is not always a good fit especially on non Intel hardware. It is likely that MKL checks the CPU vendor ID before activating some fast execution paths.

See for example this performance analysis on AMD Epyc where Intel MKL has poor results.

1. Integrate an alternative GEMM

The main requirements are:

  • multi-threading support (ideally with OpenMP)
  • runtime dispatch to architecture-specific code (ideally including AMD and ARM)
  • bonus: integer-based GEMM

BLIS appears to be a good candidate.

2. Dynamically select a GEMM backend

We should consider compiling with multiple backend and select one at runtime (e.g. on GenuineIntel call Intel MKL, otherwise call BLIS).

3. (optional) Integrate an alternative caching allocator

We also rely on MKL to provide a caching allocator via mkl_malloc and mkl_free. We should measure the performance cost of disabling those and possibly find alternatives.

Proper configuration for server

Hi,
I've been digging around for a while in code integration but it is not clear to me which argumets are necessary. I guess "model" and "ct2_model" are not required at the same time...
Thanks

Link error/warning in OS X with --start-group and --end-group

The linker in OS X (LLVM 10) doesn't understand the --start-group and --end-group linking options. When building with the default Apple's toolset, removing these options allows building the project, although with a ton of warnings due to linking order and particularly related to boost::program_options. At least it builds and runs fine, as far as I have tested it.
If I change the compiler to gcc-9, it won't link at all.
I tried but I couldn't find a solution (maybe ordering the libraries manually?)

compilation error with MKL-DNN

Hi, I try to compile with MKL-DNN, but the following error occurs:
/data/work/c-translator/CTranslate2-1.2.0/src/primitives/cpu.cc: In static member function 'static void ctranslate2::primitives::gemm(const In*, const In*, bool, bool, size_t, size_t, size_t, float, float, Out*) [with In = signed char; Out = int; ctranslate2::Device D = (ctranslate2::Device)0; size_t = long unsigned int]':
/data/work/c-translator/CTranslate2-1.2.0/src/primitives/cpu.cc:546:39: error: invalid conversion from 'const char*' to 'char' [-fpermissive]
c, &ldc, &co),
^
/data/work/c-translator/CTranslate2-1.2.0/src/primitives/cpu.cc:546:39: error: invalid conversion from 'const char*' to 'char' [-fpermissive]
/data/work/c-translator/CTranslate2-1.2.0/src/primitives/cpu.cc:546:39: error: invalid conversion from 'const char*' to 'char' [-fpermissive]
/data/work/c-translator/CTranslate2-1.2.0/src/primitives/cpu.cc:546:39: error: invalid conversion from 'int*' to 'dnnl_dim_t {aka long int}' [-fpermissive]
/data/work/c-translator/CTranslate2-1.2.0/src/primitives/cpu.cc:546:39: error: invalid conversion from 'int*' to 'dnnl_dim_t {aka long int}' [-fpermissive]
/data/work/c-translator/CTranslate2-1.2.0/src/primitives/cpu.cc:546:39: error: invalid conversion from 'int*' to 'dnnl_dim_t {aka long int}' [-fpermissive]
/data/work/c-translator/CTranslate2-1.2.0/src/primitives/cpu.cc:546:39: error: cannot convert 'float*' to 'float' for argument '7' to 'dnnl_status_t dnnl_gemm_s8s8s32(char, char, char, dnnl_dim_t, dnnl_dim_t, dnnl_dim_t, float, const int8_t*, dnnl_dim_t, int8_t, const int8_t*, dnnl_dim_t, int8_t, float, int32_t*, dnnl_dim_t, const int32_t*)'
CMakeFiles/ctranslate2.dir/build.make:758: recipe for target 'CMakeFiles/ctranslate2.dir/src/primitives/cpu.cc.o' failed
make[2]: *** [CMakeFiles/ctranslate2.dir/src/primitives/cpu.cc.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/ctranslate2.dir/all' failed
make[1]: *** [CMakeFiles/ctranslate2.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

The related infos. are :
MKL > 2019.5
MKL-DNN: 1.1.1

ARM support

It would be nice to provide an efficient execution on ARM. This architecture is widespread on mobile devices and will be used for future Apple Mac CPUs. AWS also provides instances based on ARM.

To do:

  1. Figure out what is required to cross-compile to ARM.
  2. Look into the ARM Compute Library which has GEMM primitives optimized for ARM NEON.
  3. Add ARM NEON vectorization for CPU kernels, and update automatic ISA dispatch accordingly.

Placing a Translator on GPU N > 0 allocates memory on GPU 0

The code below will allocate some memory on GPU 0 even if the Translator is placed on another device:

import ctranslate2
translator = ctranslate2.Translator("ende_transformer", device="cuda", device_index=1)

Ideally, it should only allocate on GPU 1.

Save a data type identifier in converted models

When loading a model variable, the code currently deduces the data type from the size in bytes of one item (it typically does if itemsize == 4 then float32). This is a weak test. We should instead save an identifier that unambiguously define a data type.

Current fields:

  • item_size
  • data_size

Suggested fields:

  • dtype_id
  • nbytes

compilation needs the <algorithm> header for std::max with MSVC

Hi @guillaumekln,

I was trying to compile under Visual Studio 2019 and I got an error that 'max': is not a member of 'std' in layer_norm_cpu.cc (line 30). Adding the <algorithm> header does the trick. After a bit of searching it seems this is because some Windows headers (WinDef.h) define their own macros for max and min.
Maybe it would be better to fix this in the CMakeLists.txt instead of adding the header just for Windows, so I tried adding a block

if(MSVC)
  add_definitions(-D_USE_MATH_DEFINES)
  add_definitions(-DNOMINMAX)
endif()

but it won't work --to be more specific, the error disappears but the build is not fully successful and no libraries are created.

FP16 support

We should support FP16 execution on compatible GPU.

Conversion breaks in some shared parameters setups.

Hey @guillaumekln

If we take a shared embeddings setup between encoder and decoder for instance, some aliases are made here:

def _alias_variables(self):
"""Find duplicate variables in spec and create aliases."""
# When a variable is duplicated, keep the version that comes first in
# the alphabetical order and alias the others.
variables = self.variables(ordered=True)
for name, value in reversed(variables):
for other_name, other_value in variables:
if name == other_name:
break
# Because variables can be transformed on load (e.g. transposed),
# we use an element-wise equality check.
if value.dtype == other_value.dtype and np.array_equal(value, other_value):
# Replace variable value by the alias name.
scope, attr_name = _parent_scope(name)
spec = index_spec(self, scope)
setattr(spec, attr_name, other_name)
break

which is called when .validate() is called.

Here, we .validate() before getting the vocabulary sizes:

model_spec.validate()
self._check_vocabulary_size("source", src_vocab, model_spec.source_vocabulary_size)
self._check_vocabulary_size("target", tgt_vocab, model_spec.target_vocabulary_size)

But, these {source,target}_vocabulary_size property/methods do not handle aliases:

@property
def source_vocabulary_size(self):
return self.encoder.embeddings.weight.shape[0]
@property
def target_vocabulary_size(self):
return self.decoder.embeddings.weight.shape[0]

--->

MODEL_SPEC AFTER VALIDATE {'weight': 'decoder/embeddings/weight', 'multiply_by_sqrt_depth': 'decoder/embeddings/multiply_by_sqrt_depth'}
Traceback (most recent call last):
  File "/home/moses/CTranslate2/env_onmt/bin/onmt_release_model", line 8, in <module>
    sys.exit(main())
  File "/home/moses/CTranslate2/env_onmt/lib/python3.6/site-packages/onmt/bin/release_model.py", line 52, in main
    converter.convert(opt.output, model_spec, force=True)
  File "/home/moses/CTranslate2/env_onmt/lib/python3.6/site-packages/ctranslate2/converters/converter.py", line 74, in convert
    self._check_vocabulary_size("source", src_vocab, model_spec.source_vocabulary_size)
  File "/home/moses/CTranslate2/env_onmt/lib/python3.6/site-packages/ctranslate2/specs/transformer_spec.py", line 32, in source_vocabulary_size
    return self.encoder.embeddings.weight.shape[0]
AttributeError: 'str' object has no attribute 'shape'

Am I missing something here?

Invalid resource handle when deleting ctranslate2.Translator

Hi @guillaumekln
There seems to be an issue when deleting a model from a device other than the 0th one.

import ctranslate2
translator = ctranslate2.Translator(
    "enes_general_medium_ctranslate2",
    device="cuda",
    device_index=0)
del translator

--> OK

import ctranslate2
translator = ctranslate2.Translator(
    "enes_general_medium_ctranslate2",
    device="cuda",
    device_index=1)
del translator

--> ERROR

terminate called after throwing an instance of 'std::runtime_error'
  what():  /root/ctranslate2-dev/src/primitives/cuda.cu:72: CUDA failed with error invalid resource handle
Aborted (core dumped)

(Inference works fine though, it's only when deleting the object that it fails.)

EDIT: This also happens when using the cli entrypoint ctranslate2/bin/translate.

Statically link to Intel MKL

We currently generate a custom shared library for Intel MKL. Instead, we should consider statically link against it.

Pros:

  • no need to move the custom library around
  • allow linking to the GNU OpenMP library gomp instead of iomp5

Cons:

  • larger binary size (some symbols could be duplicated in libctranslate2.so and libmkldnn.so if the later also statically links against MKL)

Read model and vocabs from memory

Hi @guillaumekln

I would like to load the model file from memory (in a std::vector<unsigned char>) but I think it's not possible as all related methods use at some point the model directory as an std::string. I can see the necessity in this, as the vocabularies and the vmap are also loaded from this directory.

Still, do you think there could be a use case (apart from mine obviously :)) for some overrides with arguments that will accept std::strings pointing directly to the model and the vocabularies?

OpenNMT-py model conversion failed because of KeyError

I'm trying to convert OpenNMT-py model to CTranslate2 format, but it fails because of KeyError. The model that I'm trying to convert is available here (it is named paracrawl.pt but it was renamed during uploading).


When I try to run conversion:

ct2-opennmt-py-converter --model_path paracrawl.pt --model_spec TransformerBase --output_dir paracrawl

It fails with KeyError:

Traceback (most recent call last):
  File "/usr/local/bin/ct2-opennmt-py-converter", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/ctranslate2/bin/opennmt_py_converter.py", line 11, in main
    converters.OpenNMTPyConverter(args.model_path).convert_from_args(args)
  File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/converter.py", line 35, in convert_from_args
    return self.convert(
  File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/converter.py", line 52, in convert
    src_vocab, tgt_vocab = self._load(model_spec)
  File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/opennmt_py.py", line 27, in _load
    set_transformer_spec(model_spec, variables)
  File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/opennmt_py.py", line 39, in set_transformer_spec
    set_transformer_encoder(spec.encoder, variables, relative=spec.with_relative_position)
  File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/opennmt_py.py", line 43, in set_transformer_encoder
    set_input_layers(spec, variables, "encoder", relative=relative)
  File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/opennmt_py.py", line 59, in set_input_layers
    set_position_encodings(
  File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/opennmt_py.py", line 136, in set_position_encodings
    spec.encodings = _get_variable(variables, "%s.pe" % scope).squeeze()
  File "/usr/local/lib/python3.8/site-packages/ctranslate2/converters/opennmt_py.py", line 141, in _get_variable
    return variables[name].numpy()
KeyError: 'encoder.embeddings.make_embedding.pe.pe'

I'm using Python 3.8 on my custom python:buster Docker image with theese Python packages installed:

Package              Version
-------------------- ----------
absl-py              0.9.0
cachetools           4.0.0
certifi              2019.11.28
chardet              3.0.4
click                7.1.1
ConfigArgParse       1.0
ctranslate2          1.8.0
Flask                1.1.1
future               0.18.2
google-auth          1.11.3
google-auth-oauthlib 0.4.1
grpcio               1.27.2
idna                 2.9
itsdangerous         1.1.0
Jinja2               2.11.1
Markdown             3.2.1
MarkupSafe           1.1.1
numpy                1.18.1
oauthlib             3.1.0
OpenNMT-py           1.0.2
pip                  19.3.1
protobuf             3.11.3
pyasn1               0.4.8
pyasn1-modules       0.2.8
pyonmttok            1.18.3
requests             2.23.0
requests-oauthlib    1.3.0
rsa                  4.0
setuptools           41.6.0
six                  1.14.0
tensorboard          2.1.1
torch                1.4.0
torchtext            0.4.0
tqdm                 4.30.0
urllib3              1.25.8
waitress             1.4.3
Werkzeug             1.0.0
wheel                0.33.6

Better catch of CUDA OOMs

Hi,
While using the python ctranslate2.Translator API, it seems that an OOM can cause the whole python session to crash.

>>> import ctranslate2
>>> translator = ctranslate2.Translator("ende_ctranslate2/")
>>> translator.translate_batch([["a"]*20000]) # very long dummy batch to force OOM for reproducibility
terminate called after throwing an instance of 'std::runtime_error'
  what():  Failed to allocate memory
Aborted (core dumped)

Would it be possible to better catch such exceptions so that we can handle them python side?
Thanks!

OpenNMT-tf 2.0 supported?

I trained a Transformer model using OpenNMT-tf 2.0. The converter ran well but the translation result became weird. Does CTranslate2 support OpenNMT-tf 2.0?
Here are versions:
OpenNMT-tf == 2.3.0
tensorflow-gpu == 2.0.0

Support ONNX graphs

This is a general issue to discuss and track ONNX support.

The current limitation of the project is that only weights are extracted from pretrained models and the computation graph is redefined in the code itself. This could be mitigated by loading and executing ONNX graphs.

Release Python package on PyPI

This would make the installation easier for users but this could make the packaging more complex, especially for GPU support.

This issue is to track progress on this front.

How to install Ctranslate2 without Docker

Hi,
I'd like to install the Ctranslate2 module without using a Docker. Is it possible?
Are there any scripts for this? I've tried generating a shell script from the dockerfile but it gives me some errors.
Thanks

Much slower with CUDA 10.1 than with 10.0

I haven't tested this extensively but a small test seems to indicate slower times when using CUDA 10.1 (Update 2 - i.e., latest) vs CUDA 10.0 (as is used in the Docker file). It's around 1.5 times slower. Have you tried using CUDA 10.1 and have you seen similar results?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.