Giter Site home page Giter Site logo

Comments (7)

gbolmier avatar gbolmier commented on June 2, 2024 1

From what I could find online I think this is actually due to how Python and the system manage memory usage.

See the below example (note that psutil.virtual_memory is system-wide so I added the current Python process Resident Set Size (rss) memory):

import pickle
import psutil
import sys
import random

from river.utils.pretty import humanize_bytes

def print_vmem_and_rss():
    vmem = humanize_bytes(psutil.virtual_memory().used)
    rss = humanize_bytes(psutil.Process().memory_info().rss)
    print(f"{vmem = } | {rss = }", end="\n\n")

print_vmem_and_rss()

print("1) Create a list of 100_000_000 random floats")
my_list = [random.random() for _ in range(100_000_000)]
print(f"{humanize_bytes(sys.getsizeof(my_list)) = }")
print_vmem_and_rss()

print("2) Dump `my_list` to disk")
pickle.dump(my_list, open("my_list.pickle", "wb"))
print_vmem_and_rss()

print("3) Load `my_list` from disk into `my_list2`")
my_list2 = pickle.load(open("my_list.pickle", "rb"))
print(f"{humanize_bytes(sys.getsizeof(my_list2)) = }")
print_vmem_and_rss()

which outputs:

vmem = '17.59 GB' | rss = '38.3 MB'

1) Create a list of 100_000_000 random floats
humanize_bytes(sys.getsizeof(my_list)) = '796.44 MB'
vmem = '20.85 GB' | rss = '3.09 GB'

2) Dump `my_list` to disk
vmem = '19.55 GB' | rss = '3.78 GB'

3) Load `my_list` from disk into `my_list2`
humanize_bytes(sys.getsizeof(my_list2)) = '785.06 MB'
vmem = '22.29 GB' | rss = '6.85 GB'

Despite the list object being less than 800 MB, memory usage increases by ≅ 3 GB on my machine when creating the list or loading it from disk. Not sure other formats would perform better in Python. Curious if someone wants to run such experiments and report here.

from river.

gbolmier avatar gbolmier commented on June 2, 2024

Hey @jpfeil, would you mind providing a minimal snippet reproducing the observed behaviour on a toy dataset? If possible with a dataset from the datasets module

from river.

jpfeil avatar jpfeil commented on June 2, 2024

Hi @gbolmier

I haven't tried it with the datasets data because I think you need to make a pretty large model to see this effect. So I made some synthetic data that will give you an idea of what is happening. Basically, when the model is dumped, it takes a lot of memory to create the pickled model but the memory is eventually released. However, when the model is loaded, the memory spikes but is then not released, so I suspect there is a reference that is keeping the pickle VM around longer than needed. But I'm not sure.

from sklearn.datasets import make_classification
from river.forest import ARFClassifier
from tqdm import tqdm
import pickle
import psutil

X, y = make_classification(n_samples=1000,
                           n_features=1000,
                           n_informative=800,
                           n_clusters_per_class=100)

model = ARFClassifier(n_models=300)

for i in tqdm(range(X.shape[0])):
    
    xi = dict((f"x{n}", v) for n, v in enumerate(X[i]))
    yi = y[i]
    
    model.learn_one(xi, yi)
    
print(model._memory_usage)

with open("test.pkl", "wb") as f:
    pickle.dump(model, f)

Now start a new python session

initial_memory = psutil.virtual_memory().used
with open('test.pkl', 'rb') as f:                                                               
    rmodel = pickle.load(f)  

rmodel._memory_usage
> '739.71 MB'  

final_memory = psutil.virtual_memory().used
print('RAM Used (GB):', final_memory - start_memory /1000000000)
> RAM Used (GB): 5.968044032

from river.

jpfeil avatar jpfeil commented on June 2, 2024

Thanks, @gbolmier! Yeah, I wonder if a simple type change could improve memory efficiency. I'm not familiar with the implementation code, but I wonder if using an array instead of a list when possible would lead to a significant improvement.

from river.

jpfeil avatar jpfeil commented on June 2, 2024

Also, I don't see the memory difference when I train the model, only when I pickle it. If it was just the list usage, then I should see the same memory usage when training, right? The training should look like case 1, but I don't see that with the river model. I see the expected memory usage.

from river.

jpfeil avatar jpfeil commented on June 2, 2024

I figured it out finally! The pickle VM sticks around to support memoization. Apparently, this is needed for recursive functions. So, as long as river doesn't use recursion, then you can set "fast" mode which does not set up memoization and the memory doesn't blow up.

with open("test-fast.pkl", "wb") as f:
    p = pickle.Pickler(f)
    p.fast = True
    p.dump(model)

I haven't tested whether this affects predictive performance, but this solves the memory issue.

from river.

gbolmier avatar gbolmier commented on June 2, 2024

Glad to hear! Neat trick indeed 😁 (Please be aware that it is deprecated though)

Also, I don't see the memory difference when I train the model, only when I pickle it. If it was just the list usage, then I should see the same memory usage when training, right?

This is because you're looking at the wrong memory usage metric. Resident memory instead, measures the Python process RAM usage. See the increase with this code:

import pickle
import psutil

from river.forest import ARFClassifier
from river.utils.pretty import humanize_bytes
from sklearn.datasets import make_classification

def print_vmem_and_rss():
    vmem = humanize_bytes(psutil.virtual_memory().used)
    rss = humanize_bytes(psutil.Process().memory_info().rss)
    print(f"{vmem = } | {rss = }", end="\n\n")

print_vmem_and_rss()

print("1) Create a dataset of 1_000 samples by 1_000 features")
X, y = make_classification(
    n_samples=1000,
    n_features=1000,
    n_informative=800,
    n_clusters_per_class=100,
)
print_vmem_and_rss()

print("2) Instantiate an ARF classifier")
model = ARFClassifier(n_models=300)
print(f"{model._memory_usage = }")
print_vmem_and_rss()

print("3) Train the ARF classifier on the created dataset")
for i in range(X.shape[0]):
    xi = dict((f"x{n}", v) for n, v in enumerate(X[i]))
    yi = y[i]
    model.learn_one(xi, yi)
print(f"{model._memory_usage = }")
print_vmem_and_rss()

print("4) Dump `model` to disk")
with open("test.pkl", "wb") as f:
    p = pickle.Pickler(f)
    p.fast = True
    p.dump(model)
print_vmem_and_rss()

print("5) Load `model` from disk into `model2`")
model2 = pickle.load(open("test.pkl", "rb"))
print(f"{model2._memory_usage = }")
print_vmem_and_rss()
vmem = '22.05 GB' | rss = '130.98 MB'

1) Create a dataset of 1_000 samples by 1_000 features
vmem = '22.18 GB' | rss = '269.94 MB'

2) Instantiate an ARF classifier
model._memory_usage = '0.99 MB'
vmem = '22.18 GB' | rss = '271.39 MB'

3) Train the ARF classifier on the created dataset
model._memory_usage = '1.06 GB'
vmem = '23.59 GB' | rss = '1.77 GB'

4) Dump `model` to disk
vmem = '23.01 GB' | rss = '1.77 GB'

5) Load `model` from disk into `model2`
model2._memory_usage = '1.12 GB'
vmem = '24.03 GB' | rss = '2.78 GB'

Note that the RAM usage increased by ≅ 500 MB more than the model size after training it.

from river.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.