I suggest writing memory leak test disabled by default for models. Use <code class

Translation example (see gist at <a href="https://gist.github.com/guillaume-be/60d4a4a

I think models leak about rust-bert HOT 5 CLOSED

guillaume-be commented on May 18, 2024

I think models leak

from rust-bert.

Comments (5)

guillaume-be commented on May 18, 2024

Hello @Qeenon ,

Thank you for raising this issue. I tried reproducing the error and could notice that the sanitizer was identifying a leak of 20 bytes when variables are stored on the GPU. The problem does not seem to appear when loading on CPU.

I tried looping over a sequence of model loading/inference (see gist https://gist.github.com/guillaume-be/76e0d287dc125592e8a2088cc48f7066). No memory leak seems to be visible after ~ 5000 iterations. I have an intuition that the issue could come from the tokenizer loading, and not necessarily from the model itself. Which tokenizer are you loading?

Are you using a GPU for your service? Is there a model in particular for which the memory leak is more severe? Would you be able to share a snippet of code to reproduce the issue?

I will raise the issue of GPU memory leak with the author of the torch bindings and see if it could come from here. Note that the models are not meant to be loaded for every query, the lazy_static or use of the batched_fn crate (see example at https://github.com/epwalsh/rust-dl-webserver) are the appropriate way to serve such models.

from rust-bert.

Miezhiko commented on May 18, 2024

Yet I'm not 100% sure that leaks related to models, so far I'm trying to detect whether it's so

https://github.com/Qeenon/Amadeus/blob/d20660596042d8929050bc8154fe16e0cf91ff15/src/steins/ai/bert.rs#L80

This file loads QA / Conversation / Translation models, on this commit it was on demand and after some time bot used to become alike 25GB on RAM. Now I've changed to code and put those inside lazy_static and so far it's going okay-ish but I'd let it to keep running for some more time to be sure that it was related to models.

They was working on CPU (I didn't setup cuda properly on host machine).

from rust-bert.

guillaume-be commented on May 18, 2024

Translation example (see gist at https://gist.github.com/guillaume-be/60d4a4a61ec16d21478ba497d517a054)
This does not raise any warning using RUSTFLAGS=-Zsanitizer=leak cargo run -Zbuild-std --target x86_64-unknown-linux-gnu translation.

Below I am including the logs from valgrind target/debug/examples/translation --leak-check=full, which seems to indicate that some memory is not freed at the exit: https://gist.github.com/guillaume-be/c278ff8c9665264ef901736ea53ab88f

Some of the warnings seem to be caused by the reqwest library. I tried to rerun a minimal example:

extern crate anyhow;


use rust_bert::resources::{Resource, RemoteResource};
use rust_bert::marian::{MarianModelResources, MarianVocabResources, MarianSpmResources, MarianConfigResources};

fn main() -> anyhow::Result<()> {

    let model_resource = Resource::Remote(RemoteResource::from_pretrained(MarianModelResources::ENGLISH2RUSSIAN));
    let vocab_resource = Resource::Remote(RemoteResource::from_pretrained(MarianVocabResources::ENGLISH2RUSSIAN));
    let merge_resource = Resource::Remote(RemoteResource::from_pretrained(MarianSpmResources::ENGLISH2RUSSIAN));
    let config_resource = Resource::Remote(RemoteResource::from_pretrained(MarianConfigResources::ENGLISH2RUSSIAN));

    let _out1 = model_resource.get_local_path();
    let _out2 = vocab_resource.get_local_path();
    let _out3 = merge_resource.get_local_path();
    let _out4 = config_resource.get_local_path();
    Ok(())
}

valgrind log (minimal example)

==28563== Memcheck, a memory error detector
==28563== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==28563== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==28563== Command: target/debug/examples/resource_download --leak-check=full
==28563== 
==28563== Warning: set address range perms: large range [0x4dab000, 0x40fba000) (defined)
==28563== Warning: set address range perms: large range [0x40fba000, 0x51cf0000) (defined)
==28563== Source and destination overlap in memcpy_chk(0x1ffefff5c0, 0x1ffefff5c0, 5)
==28563==    at 0x4843BF0: __memcpy_chk (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==28563==    by 0x44EC0F83: cpuinfo_linux_parse_cpulist (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBFC96: cpuinfo_linux_get_max_possible_processor (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBDFA1: cpuinfo_x86_linux_init (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x51FDF47E: __pthread_once_slow (pthread_once.c:116)
==28563==    by 0x44EBA3B6: cpuinfo_initialize (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B157: at::native::compute_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B30C: at::native::get_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x42706CB8: THFloatVector_startup::THFloatVector_startup() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4197ED85: _GLOBAL__sub_I_THVector.cpp (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4011B89: call_init.part.0 (dl-init.c:72)
==28563==    by 0x4011C90: call_init (dl-init.c:30)
==28563==    by 0x4011C90: _dl_init (dl-init.c:119)
==28563== 
==28563== Source and destination overlap in memcpy_chk(0x1ffefff5c0, 0x1ffefff5c0, 5)
==28563==    at 0x4843BF0: __memcpy_chk (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==28563==    by 0x44EC0F83: cpuinfo_linux_parse_cpulist (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBFD16: cpuinfo_linux_get_max_present_processor (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBDFAC: cpuinfo_x86_linux_init (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x51FDF47E: __pthread_once_slow (pthread_once.c:116)
==28563==    by 0x44EBA3B6: cpuinfo_initialize (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B157: at::native::compute_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B30C: at::native::get_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x42706CB8: THFloatVector_startup::THFloatVector_startup() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4197ED85: _GLOBAL__sub_I_THVector.cpp (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4011B89: call_init.part.0 (dl-init.c:72)
==28563==    by 0x4011C90: call_init (dl-init.c:30)
==28563==    by 0x4011C90: _dl_init (dl-init.c:119)
==28563== 
==28563== Source and destination overlap in memcpy_chk(0x1ffefff5b0, 0x1ffefff5b0, 5)
==28563==    at 0x4843BF0: __memcpy_chk (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==28563==    by 0x44EC0F83: cpuinfo_linux_parse_cpulist (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBFD99: cpuinfo_linux_detect_possible_processors (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBE00D: cpuinfo_x86_linux_init (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x51FDF47E: __pthread_once_slow (pthread_once.c:116)
==28563==    by 0x44EBA3B6: cpuinfo_initialize (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B157: at::native::compute_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B30C: at::native::get_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x42706CB8: THFloatVector_startup::THFloatVector_startup() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4197ED85: _GLOBAL__sub_I_THVector.cpp (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4011B89: call_init.part.0 (dl-init.c:72)
==28563==    by 0x4011C90: call_init (dl-init.c:30)
==28563==    by 0x4011C90: _dl_init (dl-init.c:119)
==28563== 
==28563== Source and destination overlap in memcpy_chk(0x1ffefff5b0, 0x1ffefff5b0, 5)
==28563==    at 0x4843BF0: __memcpy_chk (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==28563==    by 0x44EC0F83: cpuinfo_linux_parse_cpulist (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBFDF9: cpuinfo_linux_detect_present_processors (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBE02F: cpuinfo_x86_linux_init (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x51FDF47E: __pthread_once_slow (pthread_once.c:116)
==28563==    by 0x44EBA3B6: cpuinfo_initialize (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B157: at::native::compute_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B30C: at::native::get_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x42706CB8: THFloatVector_startup::THFloatVector_startup() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4197ED85: _GLOBAL__sub_I_THVector.cpp (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4011B89: call_init.part.0 (dl-init.c:72)
==28563==    by 0x4011C90: call_init (dl-init.c:30)
==28563==    by 0x4011C90: _dl_init (dl-init.c:119)
==28563== 
==28563== Thread 2 reqwest-internal:
==28563== Syscall param statx(file_name) points to unaddressable byte(s)
==28563==    at 0x522579FE: statx (statx.c:29)
==28563==    by 0xAB4B00: statx (weak.rs:134)
==28563==    by 0xAB4B00: std::sys::unix::fs::try_statx (fs.rs:123)
==28563==    by 0xAB30A7: std::sys::unix::fs::stat (fs.rs:1105)
==28563==    by 0x510D3D: std::fs::metadata (fs.rs:1567)
==28563==    by 0x5126E1: openssl_probe::find_certs_dirs::{{closure}} (lib.rs:31)
==28563==    by 0x5125ED: core::ops::function::impls:: for &mut F>::call_mut (function.rs:269)
==28563==    by 0x510E3E: core::iter::traits::iterator::Iterator::find::check::{{closure}} (iterator.rs:2227)
==28563==    by 0x514733: core::iter::adapters::map::map_try_fold::{{closure}} (map.rs:87)
==28563==    by 0x5116F9: core::iter::traits::iterator::Iterator::try_fold (iterator.rs:1888)
==28563==    by 0x514454:  as core::iter::traits::iterator::Iterator>::try_fold (map.rs:113)
==28563==    by 0x514642: core::iter::traits::iterator::Iterator::find (iterator.rs:2231)
==28563==    by 0x510C5C:  as core::iter::traits::iterator::Iterator>::next (filter.rs:55)
==28563==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==28563== 
==28563== Syscall param statx(buf) points to unaddressable byte(s)
==28563==    at 0x522579FE: statx (statx.c:29)
==28563==    by 0xAB4B00: statx (weak.rs:134)
==28563==    by 0xAB4B00: std::sys::unix::fs::try_statx (fs.rs:123)
==28563==    by 0xAB30A7: std::sys::unix::fs::stat (fs.rs:1105)
==28563==    by 0x510D3D: std::fs::metadata (fs.rs:1567)
==28563==    by 0x5126E1: openssl_probe::find_certs_dirs::{{closure}} (lib.rs:31)
==28563==    by 0x5125ED: core::ops::function::impls:: for &mut F>::call_mut (function.rs:269)
==28563==    by 0x510E3E: core::iter::traits::iterator::Iterator::find::check::{{closure}} (iterator.rs:2227)
==28563==    by 0x514733: core::iter::adapters::map::map_try_fold::{{closure}} (map.rs:87)
==28563==    by 0x5116F9: core::iter::traits::iterator::Iterator::try_fold (iterator.rs:1888)
==28563==    by 0x514454:  as core::iter::traits::iterator::Iterator>::try_fold (map.rs:113)
==28563==    by 0x514642: core::iter::traits::iterator::Iterator::find (iterator.rs:2231)
==28563==    by 0x510C5C:  as core::iter::traits::iterator::Iterator>::next (filter.rs:55)
==28563==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==28563== 
==28563== 
==28563== HEAP SUMMARY:
==28563==     in use at exit: 1,897,131 bytes in 30,570 blocks
==28563==   total heap usage: 379,202 allocs, 348,632 frees, 54,173,510 bytes allocated
==28563== 
==28563== LEAK SUMMARY:
==28563==    definitely lost: 114 bytes in 1 blocks
==28563==    indirectly lost: 0 bytes in 0 blocks
==28563==      possibly lost: 2,120 bytes in 8 blocks
==28563==    still reachable: 1,894,897 bytes in 30,561 blocks
==28563==         suppressed: 0 bytes in 0 blocks
==28563== Rerun with --leak-check=full to see details of leaked memory
==28563== 
==28563== For lists of detected and suppressed errors, rerun with: -s
==28563== ERROR SUMMARY: 6 errors from 6 contexts (suppressed: 0 from 0)

I am not quite sure what is going on here, this goes a bit beyond my comfort zone.
@jerry73204 I see you did some troubleshooting on the tch-rs crate related to memory leak, do you have an idea on what may be happening here?
@proycon , @epwalsh reaching out to you as well for support if you would have some time

Note I also ran the following script for ~200 iterations and did not notice a significant increase in memory consumption (stable at ~2544MB): https://gist.github.com/guillaume-be/34a982ca33749ba4be2951836ab36b97

I also ran an end-to-end translation example, including reloading the entire model and tokenizer at each iteration (see https://gist.github.com/guillaume-be/06bbc56639522d8745f2d357b310bc17). I ran the script for 20 minutes (500 full model reloads and translation), the memory consumption remained stable at 2560MB). I could run it longer, but I am unlikely to reach a 25GB memory use in a realistic amount of time.

from rust-bert.

guillaume-be commented on May 18, 2024

@Qeenon I ran a few more experiments on my end.
Looking at the valgrind logs, there are two potential sources of leaks that are entirely related to the model loading. I have compared the value with experiments running inference and can now rule out memory leaks during predictions.

For the model loading, these seem to appear when registering variables or modules to the variable store. As a validation, I did a quick experiment, and a very basic module creation using the base tch-rs library also raises a memory leak warning in valgrind:

fn main() -> anyhow::Result<()> {

    let device = Device::cuda_if_available();
    let vs = nn::VarStore::new(device);
    let _module = nn::linear(&vs.root() / "dense", 1024, 1024, Default::default());

    Ok(())
}

Since running this for more than a million iteration does not lead to any actual memory leak by monitoring the resources consumed by the process, I believe this is a spurious error.

Here is a summary of my investigations so far:

The LeakSanitizer does not return any memory leak error for the models I tested (masked language model and translation)
Valgrind indicates potential for a moderate memory leak at model loading (not prediction). This is linked to the registration of variables in the variable store in tch-rs, and does not seem to result in a actual memory leak (false positive)
I tried running several hundreds of model loading / 1 prediction on the translation model and did not see a noticeable increase in memory consumption

Based on the following I would assume that there does not seem to be an obvious issue of memory leak in the models from the library. Loading the model once and running predictions on demand is indeed the right way of using those - is this working for you?

from rust-bert.

Miezhiko commented on May 18, 2024

Thank you for your investigations here.

I'm really right now can't be sure with it and next time I will do tests on my side first.

from rust-bert.

I think models leak about rust-bert HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent