Comments (7)
From what I could find online I think this is actually due to how Python and the system manage memory usage.
See the below example (note that psutil.virtual_memory is system-wide so I added the current Python process Resident Set Size (rss) memory):
import pickle
import psutil
import sys
import random
from river.utils.pretty import humanize_bytes
def print_vmem_and_rss():
vmem = humanize_bytes(psutil.virtual_memory().used)
rss = humanize_bytes(psutil.Process().memory_info().rss)
print(f"{vmem = } | {rss = }", end="\n\n")
print_vmem_and_rss()
print("1) Create a list of 100_000_000 random floats")
my_list = [random.random() for _ in range(100_000_000)]
print(f"{humanize_bytes(sys.getsizeof(my_list)) = }")
print_vmem_and_rss()
print("2) Dump `my_list` to disk")
pickle.dump(my_list, open("my_list.pickle", "wb"))
print_vmem_and_rss()
print("3) Load `my_list` from disk into `my_list2`")
my_list2 = pickle.load(open("my_list.pickle", "rb"))
print(f"{humanize_bytes(sys.getsizeof(my_list2)) = }")
print_vmem_and_rss()
which outputs:
vmem = '17.59 GB' | rss = '38.3 MB'
1) Create a list of 100_000_000 random floats
humanize_bytes(sys.getsizeof(my_list)) = '796.44 MB'
vmem = '20.85 GB' | rss = '3.09 GB'
2) Dump `my_list` to disk
vmem = '19.55 GB' | rss = '3.78 GB'
3) Load `my_list` from disk into `my_list2`
humanize_bytes(sys.getsizeof(my_list2)) = '785.06 MB'
vmem = '22.29 GB' | rss = '6.85 GB'
Despite the list object being less than 800 MB, memory usage increases by ≅ 3 GB on my machine when creating the list or loading it from disk. Not sure other formats would perform better in Python. Curious if someone wants to run such experiments and report here.
from river.
Hey @jpfeil, would you mind providing a minimal snippet reproducing the observed behaviour on a toy dataset? If possible with a dataset from the datasets module
from river.
Hi @gbolmier
I haven't tried it with the datasets data because I think you need to make a pretty large model to see this effect. So I made some synthetic data that will give you an idea of what is happening. Basically, when the model is dumped, it takes a lot of memory to create the pickled model but the memory is eventually released. However, when the model is loaded, the memory spikes but is then not released, so I suspect there is a reference that is keeping the pickle VM around longer than needed. But I'm not sure.
from sklearn.datasets import make_classification
from river.forest import ARFClassifier
from tqdm import tqdm
import pickle
import psutil
X, y = make_classification(n_samples=1000,
n_features=1000,
n_informative=800,
n_clusters_per_class=100)
model = ARFClassifier(n_models=300)
for i in tqdm(range(X.shape[0])):
xi = dict((f"x{n}", v) for n, v in enumerate(X[i]))
yi = y[i]
model.learn_one(xi, yi)
print(model._memory_usage)
with open("test.pkl", "wb") as f:
pickle.dump(model, f)
Now start a new python session
initial_memory = psutil.virtual_memory().used
with open('test.pkl', 'rb') as f:
rmodel = pickle.load(f)
rmodel._memory_usage
> '739.71 MB'
final_memory = psutil.virtual_memory().used
print('RAM Used (GB):', final_memory - start_memory /1000000000)
> RAM Used (GB): 5.968044032
from river.
Thanks, @gbolmier! Yeah, I wonder if a simple type change could improve memory efficiency. I'm not familiar with the implementation code, but I wonder if using an array instead of a list when possible would lead to a significant improvement.
from river.
Also, I don't see the memory difference when I train the model, only when I pickle it. If it was just the list usage, then I should see the same memory usage when training, right? The training should look like case 1, but I don't see that with the river model. I see the expected memory usage.
from river.
I figured it out finally! The pickle VM sticks around to support memoization. Apparently, this is needed for recursive functions. So, as long as river doesn't use recursion, then you can set "fast" mode which does not set up memoization and the memory doesn't blow up.
with open("test-fast.pkl", "wb") as f:
p = pickle.Pickler(f)
p.fast = True
p.dump(model)
I haven't tested whether this affects predictive performance, but this solves the memory issue.
from river.
Glad to hear! Neat trick indeed 😁 (Please be aware that it is deprecated though)
Also, I don't see the memory difference when I train the model, only when I pickle it. If it was just the list usage, then I should see the same memory usage when training, right?
This is because you're looking at the wrong memory usage metric. Resident memory instead, measures the Python process RAM usage. See the increase with this code:
import pickle
import psutil
from river.forest import ARFClassifier
from river.utils.pretty import humanize_bytes
from sklearn.datasets import make_classification
def print_vmem_and_rss():
vmem = humanize_bytes(psutil.virtual_memory().used)
rss = humanize_bytes(psutil.Process().memory_info().rss)
print(f"{vmem = } | {rss = }", end="\n\n")
print_vmem_and_rss()
print("1) Create a dataset of 1_000 samples by 1_000 features")
X, y = make_classification(
n_samples=1000,
n_features=1000,
n_informative=800,
n_clusters_per_class=100,
)
print_vmem_and_rss()
print("2) Instantiate an ARF classifier")
model = ARFClassifier(n_models=300)
print(f"{model._memory_usage = }")
print_vmem_and_rss()
print("3) Train the ARF classifier on the created dataset")
for i in range(X.shape[0]):
xi = dict((f"x{n}", v) for n, v in enumerate(X[i]))
yi = y[i]
model.learn_one(xi, yi)
print(f"{model._memory_usage = }")
print_vmem_and_rss()
print("4) Dump `model` to disk")
with open("test.pkl", "wb") as f:
p = pickle.Pickler(f)
p.fast = True
p.dump(model)
print_vmem_and_rss()
print("5) Load `model` from disk into `model2`")
model2 = pickle.load(open("test.pkl", "rb"))
print(f"{model2._memory_usage = }")
print_vmem_and_rss()
vmem = '22.05 GB' | rss = '130.98 MB'
1) Create a dataset of 1_000 samples by 1_000 features
vmem = '22.18 GB' | rss = '269.94 MB'
2) Instantiate an ARF classifier
model._memory_usage = '0.99 MB'
vmem = '22.18 GB' | rss = '271.39 MB'
3) Train the ARF classifier on the created dataset
model._memory_usage = '1.06 GB'
vmem = '23.59 GB' | rss = '1.77 GB'
4) Dump `model` to disk
vmem = '23.01 GB' | rss = '1.77 GB'
5) Load `model` from disk into `model2`
model2._memory_usage = '1.12 GB'
vmem = '24.03 GB' | rss = '2.78 GB'
Note that the RAM usage increased by ≅ 500 MB more than the model size after training it.
from river.
Related Issues (20)
- River package HOT 1
- More clustering metrics HOT 2
- Not possible to build Wheels in MacOS Catalina 10.15.7 HOT 10
- feature req: add w parameter (ex sample_weight) to progressive_val_score HOT 2
- feature req: stream.iter_polars
- Cannot build from source (sdist) from PyPI. HOT 4
- AttributeError: 'Vertex' object has no attribute 'uuid' HOT 4
- River in Deep Learning HOT 2
- Add a trained_on_dist to RandomSampler
- How to use river my own dataset? HOT 2
- Encountering error when installing river using pip on macOS M1 HOT 3
- ERROR: Failed building wheel for river HOT 11
- Bridge or a ford across a river? HOT 1
- Obtaining Model memory in MB HOT 2
- Enforcing monotonicity constraints?
- Conda support
- Some Confusion About ADWIN HOT 3
- module 'river.cluster' has no attribute 'ODAC' HOT 1
- Debug_one does not show clearly the explanation of how x is predicted HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from river.