Versions river version</

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

Glad to hear! Neat trick indeed 😁 (Please be aware that it is <a href="https://docs.p

Pickle loaded model uses 10x the amount of RAM about river HOT 7 CLOSED

jpfeil commented on June 2, 2024

Pickle loaded model uses 10x the amount of RAM

from river.

Comments (7)

gbolmier commented on June 2, 2024 1

From what I could find online I think this is actually due to how Python and the system manage memory usage.

See the below example (note that psutil.virtual_memory is system-wide so I added the current Python process Resident Set Size (rss) memory):

import pickle
import psutil
import sys
import random

from river.utils.pretty import humanize_bytes

def print_vmem_and_rss():
    vmem = humanize_bytes(psutil.virtual_memory().used)
    rss = humanize_bytes(psutil.Process().memory_info().rss)
    print(f"{vmem = } | {rss = }", end="\n\n")

print_vmem_and_rss()

print("1) Create a list of 100_000_000 random floats")
my_list = [random.random() for _ in range(100_000_000)]
print(f"{humanize_bytes(sys.getsizeof(my_list)) = }")
print_vmem_and_rss()

print("2) Dump `my_list` to disk")
pickle.dump(my_list, open("my_list.pickle", "wb"))
print_vmem_and_rss()

print("3) Load `my_list` from disk into `my_list2`")
my_list2 = pickle.load(open("my_list.pickle", "rb"))
print(f"{humanize_bytes(sys.getsizeof(my_list2)) = }")
print_vmem_and_rss()

which outputs:

vmem = '17.59 GB' | rss = '38.3 MB'

1) Create a list of 100_000_000 random floats
humanize_bytes(sys.getsizeof(my_list)) = '796.44 MB'
vmem = '20.85 GB' | rss = '3.09 GB'

2) Dump `my_list` to disk
vmem = '19.55 GB' | rss = '3.78 GB'

3) Load `my_list` from disk into `my_list2`
humanize_bytes(sys.getsizeof(my_list2)) = '785.06 MB'
vmem = '22.29 GB' | rss = '6.85 GB'

Despite the list object being less than 800 MB, memory usage increases by ≅ 3 GB on my machine when creating the list or loading it from disk. Not sure other formats would perform better in Python. Curious if someone wants to run such experiments and report here.

from river.

gbolmier commented on June 2, 2024

Hey @jpfeil, would you mind providing a minimal snippet reproducing the observed behaviour on a toy dataset? If possible with a dataset from the datasets module

from river.

jpfeil commented on June 2, 2024

Hi @gbolmier

I haven't tried it with the datasets data because I think you need to make a pretty large model to see this effect. So I made some synthetic data that will give you an idea of what is happening. Basically, when the model is dumped, it takes a lot of memory to create the pickled model but the memory is eventually released. However, when the model is loaded, the memory spikes but is then not released, so I suspect there is a reference that is keeping the pickle VM around longer than needed. But I'm not sure.

from sklearn.datasets import make_classification
from river.forest import ARFClassifier
from tqdm import tqdm
import pickle
import psutil

X, y = make_classification(n_samples=1000,
                           n_features=1000,
                           n_informative=800,
                           n_clusters_per_class=100)

model = ARFClassifier(n_models=300)

for i in tqdm(range(X.shape[0])):
    
    xi = dict((f"x{n}", v) for n, v in enumerate(X[i]))
    yi = y[i]
    
    model.learn_one(xi, yi)
    
print(model._memory_usage)

with open("test.pkl", "wb") as f:
    pickle.dump(model, f)

Now start a new python session

initial_memory = psutil.virtual_memory().used
with open('test.pkl', 'rb') as f:                                                               
    rmodel = pickle.load(f)  

rmodel._memory_usage
> '739.71 MB'  

final_memory = psutil.virtual_memory().used
print('RAM Used (GB):', final_memory - start_memory /1000000000)
> RAM Used (GB): 5.968044032

from river.

jpfeil commented on June 2, 2024

Thanks, @gbolmier! Yeah, I wonder if a simple type change could improve memory efficiency. I'm not familiar with the implementation code, but I wonder if using an array instead of a list when possible would lead to a significant improvement.

from river.

jpfeil commented on June 2, 2024

Also, I don't see the memory difference when I train the model, only when I pickle it. If it was just the list usage, then I should see the same memory usage when training, right? The training should look like case 1, but I don't see that with the river model. I see the expected memory usage.

from river.

jpfeil commented on June 2, 2024

I figured it out finally! The pickle VM sticks around to support memoization. Apparently, this is needed for recursive functions. So, as long as river doesn't use recursion, then you can set "fast" mode which does not set up memoization and the memory doesn't blow up.

with open("test-fast.pkl", "wb") as f:
    p = pickle.Pickler(f)
    p.fast = True
    p.dump(model)

I haven't tested whether this affects predictive performance, but this solves the memory issue.

from river.

gbolmier commented on June 2, 2024

Glad to hear! Neat trick indeed 😁 (Please be aware that it is deprecated though)

Also, I don't see the memory difference when I train the model, only when I pickle it. If it was just the list usage, then I should see the same memory usage when training, right?

This is because you're looking at the wrong memory usage metric. Resident memory instead, measures the Python process RAM usage. See the increase with this code:

import pickle
import psutil

from river.forest import ARFClassifier
from river.utils.pretty import humanize_bytes
from sklearn.datasets import make_classification

def print_vmem_and_rss():
    vmem = humanize_bytes(psutil.virtual_memory().used)
    rss = humanize_bytes(psutil.Process().memory_info().rss)
    print(f"{vmem = } | {rss = }", end="\n\n")

print_vmem_and_rss()

print("1) Create a dataset of 1_000 samples by 1_000 features")
X, y = make_classification(
    n_samples=1000,
    n_features=1000,
    n_informative=800,
    n_clusters_per_class=100,
)
print_vmem_and_rss()

print("2) Instantiate an ARF classifier")
model = ARFClassifier(n_models=300)
print(f"{model._memory_usage = }")
print_vmem_and_rss()

print("3) Train the ARF classifier on the created dataset")
for i in range(X.shape[0]):
    xi = dict((f"x{n}", v) for n, v in enumerate(X[i]))
    yi = y[i]
    model.learn_one(xi, yi)
print(f"{model._memory_usage = }")
print_vmem_and_rss()

print("4) Dump `model` to disk")
with open("test.pkl", "wb") as f:
    p = pickle.Pickler(f)
    p.fast = True
    p.dump(model)
print_vmem_and_rss()

print("5) Load `model` from disk into `model2`")
model2 = pickle.load(open("test.pkl", "rb"))
print(f"{model2._memory_usage = }")
print_vmem_and_rss()

vmem = '22.05 GB' | rss = '130.98 MB'

1) Create a dataset of 1_000 samples by 1_000 features
vmem = '22.18 GB' | rss = '269.94 MB'

2) Instantiate an ARF classifier
model._memory_usage = '0.99 MB'
vmem = '22.18 GB' | rss = '271.39 MB'

3) Train the ARF classifier on the created dataset
model._memory_usage = '1.06 GB'
vmem = '23.59 GB' | rss = '1.77 GB'

4) Dump `model` to disk
vmem = '23.01 GB' | rss = '1.77 GB'

5) Load `model` from disk into `model2`
model2._memory_usage = '1.12 GB'
vmem = '24.03 GB' | rss = '2.78 GB'

Note that the RAM usage increased by ≅ 500 MB more than the model size after training it.

from river.

Pickle loaded model uses 10x the amount of RAM about river HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent