Yes, maybe compile with debug symbols and get a <a href="https://github.com/KDE/heaptr

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

OOM with simple scan_csv + sink_parquet about polars HOT 9 OPEN

raayu83 commented on June 5, 2024

OOM with simple scan_csv + sink_parquet

from polars.

Comments (9)

ritchie46 commented on June 5, 2024

Hard to reproduce like this. Can you tell me more what is your schema. How does memory increase overtime? Why do you need to glob? Has that influence?

from polars.

raayu83 commented on June 5, 2024

Just tried setting up a more complete minimal example with random dummy csv data, but can't reproduce it there.
I'll continue looking into this and report back.

from polars.

ritchie46 commented on June 5, 2024

Yes, maybe compile with debug symbols and get a heaptack report?

from polars.

raayu83 commented on June 5, 2024

I managed to generate some sample data with similar behavior - 22GB Peak memory usage for a 23GB csv file.
The resulting flame graph from memray is attached:
memray-flamegraph.py.15.zip

The code looks like this:

import csv
import logging
import os
import random
from pathlib import Path

import polars as pl

logger = logging.getLogger("etl")
pl.show_versions()

local_folder = Path("downloads/polars_example")
local_file_csv = Path(local_folder / "example").with_suffix(".csv")
local_file_parquet = Path(local_folder / "example").with_suffix(".parquet")
os.makedirs(local_folder, exist_ok=True)

logger.info("Generating dummy CSV")
with open(local_file_csv, "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    for _ in range(200_000_000):
        writer.writerow(
            [
                "3c436c803d23be8" + str(random.random()),  # noqa
                "lEmeKiDdvHkkvpPlnvWPBAQhfG3DpFjDDDEA6ndhLX-dQXeyWvSCY" + str(random.random()),  # noqa
                "2023-01-01",
            ]
        )

logger.info("Generating Parquet")
lf = pl.scan_csv(local_file_csv, low_memory=True)
lf.sink_parquet(local_file_parquet)

from polars.

raayu83 commented on June 5, 2024

Oh and this happens in both Kubernetes (Debian bullseye based image) and WSL (Debian bookworm based image)

from polars.

raayu83 commented on June 5, 2024

I've now run the exact code above (the previous flame graph contained some additional code) on my MacOS computer with the same results regarding memory usage.
Please find the flame graph attached (I ran memray in native mode this time, so it contains more details on the rust side).
memray-flamegraph-main.py.76054.html.zip

The version info is the following:

--------Version info---------
Polars:               0.20.3
Index type:           UInt32
Platform:             macOS-14.2.1-arm64-arm-64bit
Python:               3.12.1 (main, Jan  5 2024, 19:05:58) [Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

from polars.

raayu83 commented on June 5, 2024

Hi @ritchie46 ,
I've now provided a complete example and memory consumption flamegraphs for the example above.
Would be great if you could take a look at this, as it breaks the promise of support for "larger than memory files" with scan_csv/sink_parquet.

Btw, in case this is related, a similar memory consumption problem is there when reading parquet files with scan_parquet. I can provide an example for that once this bug is fixed (since I'd need to be able to dynamically create a parquet file without consuming much memory first).

from polars.

dwink commented on June 5, 2024

I see the same behavior. Memory seems to continually grow while streaming from csv to parquet. I’ve tried to turn off all optimizations as well as compression, and adjusted row_group_size.

I’ll see if I can profile further as well.

from polars.

dwink commented on June 5, 2024

I was able to reproduce this in Rust, creating a LazyCsvReader and then calling .sink_parquet() on the resulting LazyFrame.

from polars.

OOM with simple scan_csv + sink_parquet about polars HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent