Deion We have a hive partitioned parquet file, partitioned o

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Create a what? A test dataset? <p dir="aut

See the sample data creation and reading up: <div class="highlight highlight-sourc

hive partitioning predicate isn't applied before reading about polars HOT 11 OPEN

lmocsi commented on July 20, 2024 1

hive partitioning predicate isn't applied before reading

from polars.

Comments (11)

ritchie46 commented on July 20, 2024

Can it be, that the order of the filtering criteria determines if partition filtering will be used or not?

I think this is the case. We should keep hive partitioned predicates separate.

@nameexhaustion another data point.

from polars.

ritchie46 commented on July 20, 2024

Hmm.. We did some checks and we do apply predicate pushdown based on hive partitions. Can you create a that shows the difference? That helps us to fix it.

from polars.

lmocsi commented on July 20, 2024

Create a what? A test dataset?
There is one similar that can be created: apache/arrow#39768
But I do not know if you can reproduce the problem on that.
How could I test it on the real data?

from polars.

deanm0000 commented on July 20, 2024

@ritchie46 See #13908 and subsequently #14244

@lmocsi it would be more helpful if you brought up the old issues from the start. Did it work in 0.20.7, which is the first version that the above PR was in?

from polars.

lmocsi commented on July 20, 2024

The performance difference is there with older versions (eg. 0.20.3), as well.
I did not go further back in time.

from polars.

lmocsi commented on July 20, 2024

Can it be, that datetime partition name formatting is confusing scan_parquet?
Thinking of the 00%3A00%3A00 part.
Eg:
/my_path/my_table/CALENDAR_DATE=2019-01-01 00%3A00%3A00/part-00000-35c18ead-whatever.c000.snappy.parquet

from polars.

ritchie46 commented on July 20, 2024

Create a what? A test dataset?

A full reproducable example that shows the problem. So test data and the query that goes with it.

from polars.

lmocsi commented on July 20, 2024

Create a what? A test dataset?

A full reproducable example that shows the problem. So test data and the query that goes with it.

I tried, but it is not clear, which part of the dataset the error comes from.
Unfortunately I cannot disclose the dataset.
Is there a way I can test, what you'd be testing on a full reproducable example?

from polars.

ritchie46 commented on July 20, 2024

Try to create some fake data with the same hive partition schemas. It probably is only the hive partitions and the query that's important.

from polars.

lmocsi commented on July 20, 2024

See the sample data creation and reading up:

#create data
import polars as pl
from faker import Faker
import random as rnd
from datetime import datetime,date
import pyarrow.dataset as ds

print(f"polars version: {pl.__version__}")

def ido():
    return datetime.now().strftime('%Y.%m.%d. %H:%M:%S')

fake = Faker()

print(ido(),'Started')

path = '/mypath/'
dflen = 10000000
df = pl.DataFrame({'ID': pl.Series(fake.unique.random_int(min=13127924000, max=14127924000) for i in range(dflen)),
                   'BA_ID': pl.Series(fake.unique.random_int(min=2, max=2585456410) for i in range(dflen)),
                   'PART_ID': pl.Series(fake.unique.random_int(min=2163520, max=16320804) for i in range(dflen)),
                   'CU_ID': pl.Series(rnd.choice([1096, 3342, 3374, 4272, 3098, 3099]) for i in range(dflen)),
                   'DEA_ID': pl.Series(fake.unique.random_int(min=996000, max=53237133) for i in range(dflen)),
                   'AM_CY': pl.Series(fake.pyfloat(min_value=10000.0, max_value=990000.0, right_digits=1) for i in range(dflen)),
                   'CR_FL': pl.Series(rnd.choice(['Y', 'N']) for i in range(dflen)),
                   'PA_COM': pl.Series(rnd.choice(["######", None]) for i in range(dflen)),
                   'CO_TE': pl.Series(rnd.choice(["######", "A:Techn. part######", " -5755.00 MAD -5755.0"]) for i in range(dflen)),
                   'OT_AC': pl.Series(rnd.choice(["121223234545565678788989", "111122224444555577778888",  None]) for i in range(dflen)),
                   'OP_AC_NA': pl.Series(rnd.choice(["Donald Arthur Biden###########", "Joe William Trump#############",  None]) for i in range(dflen)),
                   'DWS_ID': pl.Series(rnd.choice([198, 1395, 5121, 2473]) for i in range(dflen)),
                   'ADT_ID': pl.Series(rnd.choice([570, 1309, 1680, 1798, 1916, 13856, 355136]) for i in range(dflen)),
                   'ADC_ID': pl.Series(rnd.choice([1019, 1134, 1455]) for i in range(dflen)),
                   'ADK_ID': pl.Series(rnd.choice([2058, 2185, 160279, 240274]) for i in range(dflen)),
                   'ABDO_ID': pl.Series(rnd.choice([2, 31248967]) for i in range(dflen)),
                   'ADS_ID': pl.Series(rnd.choice([1271, 1265, 1399, 1342, 1652, 1266]) for i in range(dflen)),
                   'INT_FL': pl.Series(rnd.choice(['Y', None]) for i in range(dflen)),
                   'MN_DIR': pl.Series(rnd.choice(['INT', 'DOM']) for i in range(dflen)),
                   'ADC_ID': pl.Series(rnd.choice([2, 2688, 2689, 24605]) for i in range(dflen)),
                   'ADO_ID': pl.Series(rnd.choice([2, 3126]) for i in range(dflen)),
                   'REF': pl.Series(rnd.choice(['12345679801','AD789789_12345645','DAS7894561230315','12345678','81051314_239_02_01_00_4566']) for i in range(dflen)),
                   'SEC_ID': pl.Series(rnd.choice([2, 93708]) for i in range(dflen)),
                   'ADL_ID': pl.Series(rnd.choice([2, 1125, 1134, 1364, 20834]) for i in range(dflen)),
                   'CH_ID': pl.Series(rnd.choice([50141, 50016, 49904, 49838, None]) for i in range(dflen)),
                   #'CALENDAR_DATE': pl.Series(fake.date_between_dates(date(2023,1,1),date(2024,4,30)) for i in range(dflen)), # calendar_date:date -> scan_parquet reads it very fast
                   'CALENDAR_DATE': pl.Series(fake.date_between_dates(datetime(2023,1,1),datetime(2024,4,30)) for i in range(dflen)),
                  }).with_columns(AM=pl.col('AM_CY'))

print(ido(),dflen,'records created')

ds.write_dataset(
        df.to_arrow(),
        path+'my_transaction',
        format="parquet",
        partitioning=["CALENDAR_DATE"],
        partitioning_flavor="hive",
        existing_data_behavior="delete_matching",
    )

print(ido(),'finished')
# 2024.06.19. 17:26:12 Started
# 2024.06.19. 17:41:46 10000000 records created
# 2024.06.19. 17:43:02 finished

Then reading the data shows the difference: scan_pyarrow_dataset runs in 2 seconds, while scan_parquet runs in 21 seconds:

import polars as pl
from datetime import datetime
import pyarrow.dataset as ds

print(f"polars version: {pl.__version__}")

def ido():
    return datetime.now().strftime('%Y.%m.%d. %H:%M:%S')

parq_path = '/mypath/'
ext = '/**/*.parquet'

tr = pl.scan_pyarrow_dataset(ds.dataset(parq_path+"my_transaction", partitioning='hive'))
df = (tr.filter((pl.col('CALENDAR_DATE').is_between(pl.lit('2023-12-01'), pl.lit('2023-12-31'))) &
                   (pl.col('CR_FL') == 'I') &
                   (pl.col('SEC_ID') > 3) &
                   (pl.col('ADL_ID') == 2905) &
                   (~pl.col('PART_ID').is_in([5086634, 2149316, 6031676])) &
                   (pl.col('ADT_ID') != 7010)
                   )
           .select('PART_ID').unique()
           .rename({'PART_ID':'PARTY_ID'})
           .with_columns(pl.lit(1).alias('LU_NEXT_FL'))
         )
print(ido(),'scan_pyarrow_dataset started') # 2 sec
df.collect()
print(ido(),'scan_pyarrow_dataset finished')

# explain plan:
# WITH_COLUMNS:
#  [dyn int: 1.alias("LU_NEXT_FL")] 
#   RENAME
#     UNIQUE[maintain_order: false, keep_strategy: Any] BY None
#       simple π 1/6 ["PART_ID"]
#           PYTHON SCAN 
#           PROJECT 6/26 COLUMNS
#           SELECTION: [([([([([([(col("ADL_ID")) == (2905)]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("SEC_ID")) > (3)])]) & ([(col("ADT_ID")) != (7010)])]) & ([(col("CR_FL")) == (String(I))])]) & (col("CALENDAR_DATE").is_between([String(2023-12-01), String(2023-12-31)]))]


tr2 = pl.scan_parquet(parq_path+"my_transaction"+ext)
df2 = (tr2.filter((pl.col('CALENDAR_DATE').is_between(pl.lit('2023-12-01'), pl.lit('2023-12-31'))) &
                   (pl.col('CR_FL') == 'I') &
                   (pl.col('SEC_ID') > 3) &
                   (pl.col('ADL_ID') == 2905) &
                   (~pl.col('PART_ID').is_in([5086634, 2149316, 6031676])) &
                   (pl.col('ADT_ID') != 7010)
                   )
           .select('PART_ID').unique()
           .rename({'PART_ID':'PARTY_ID'})
           .with_columns(pl.lit(1).alias('LU_NEXT_FL'))
         )
print(ido(),'scan_parquet started') # 21 sec
df2.collect()
print(ido(),'scan_parquet finished')

# explain plan:
 # WITH_COLUMNS:
 # [dyn int: 1.alias("LU_NEXT_FL")] 
 #  RENAME
 #    UNIQUE[maintain_order: false, keep_strategy: Any] BY None
 #      simple π 1/6 ["PART_ID"]
 #          Parquet SCAN 485 files: first file: /mypath/my_transaction/CALENDAR_DATE=2023-01-01/part-0.parquet
 #          PROJECT 6/26 COLUMNS
 #          SELECTION: [([([([([([(col("ADL_ID")) == (2905)]) & ([(col("ADT_ID")) != (7010)])]) & (col("CALENDAR_DATE").is_between([String(2023-12-01), String(2023-12-31)]))]) & ([(col("SEC_ID")) > (3)])]) & (col("PART_ID").is_in([Series]).not())]) & ([(col("CR_FL")) == (String(I))])]

from polars.

lmocsi commented on July 20, 2024

One small addition: in the data creation part the CALENDAR_DATE field should be calculated like this:

   'CALENDAR_DATE': pl.Series(fake.date_time_between_dates(datetime(2023,1,1),datetime(2024,4,30)) for i in range(dflen)).dt.truncate('1d'),

from polars.

hive partitioning predicate isn't applied before reading about polars HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent