Giter Site home page Giter Site logo

Comments (8)

ritchie46 avatar ritchie46 commented on August 18, 2024

@coastalwhite we don't have a repro, but we do have a panic on statistics unwrap. Maybe you know what it is?

from polars.

coastalwhite avatar coastalwhite commented on August 18, 2024

It is difficult to see, but there are two panics here I think.

  • An unwrap of a Option::None at crates/polars-parquet/src/arrow/read/statistics/mod.rs:376.
  • An expect_as_binary, I suspect at lines crates/polars-parquet/src/arrow/read/statistics/mod.rs, somewhere between 527 and 532.

I don't see an immediate problem, but since the problem only happens when globbing there might be a schema mismatch?

from polars.

failable avatar failable commented on August 18, 2024

Hello, there are total 3 files. Not sure if these information helps.

user@macos:~/git/med-data $ ll data/*-*-*-*-*-*.parquet
-rw-r--r-- 1 user staff 41M Oct 20  2021 data/2020-02-04-2020-11-01.parquet
-rw-r--r-- 1 user staff 35M Oct 20  2021 data/2020-11-01-2021-03-01.parquet
-rw-r--r-- 1 user staff 59M Oct 20  2021 data/2021-03-01-2021-09-05.parquet

user@macos:~/git/med-data $ rp
Python 3.10.11 (main, May  7 2023, 18:32:37) [Clang 16.0.3 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> pl.scan_parquet("data/2020-02-04-2020-11-01.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()
shape: (789_880, 2)
┌─────────────────────────────────┬──────────────────────────┐
│ Genericname                     ┆ Diagnosis                │
│ ---                             ┆ ---                      │
│ str                             ┆ str                      │
╞═════════════════════════════════╪══════════════════════════╡
│ 磷酸奥司他韦颗粒                ┆ 预防性抗流行性感冒治疗   │
│ 硝苯地平控释片                  ┆ 原发性高血压             │
│ 富马酸替诺福韦二吡呋酯片        ┆ 慢性乙型肝炎             │
│ 苯磺酸氨氯地平片                ┆ 原发性高血压             │
│ 布地奈德福莫特罗粉吸入剂        ┆ 哮喘                     │
│ …                               ┆ …                        │
│ 金匮肾气丸;尿感宁颗粒           ┆ 尿路感染;肾气不足证      │
│ 地奈德乳膏;非洛地平缓释片;复方  ┆ 高血压病;气滞血瘀证;湿疹 │
│ 丹参滴丸                        ┆                          │
│ 牛黄解毒片;蒲地蓝消炎口服液;头  ┆ 牙龈炎                   │
│ 孢呋辛酯胶囊                    ┆                          │
│ 地奈德乳膏;替米沙坦片           ┆ 高血压;脂溢性皮炎        │
│ 头孢氨苄片                      ┆ 毛囊炎;中耳炎            │
└─────────────────────────────────┴──────────────────────────┘

>>> pl.scan_parquet("data/2020-11-01-2021-03-01.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()
shape: (601_951, 2)
┌─────────────────────────────────┬───────────────────────────────┐
│ Genericname                     ┆ Diagnosis                     │
│ ---                             ┆ ---                           │
│ str                             ┆ str                           │
╞═════════════════════════════════╪═══════════════════════════════╡
│ 复方酮康唑软膏;鲜竹沥           ┆ 皮肤真菌感染;上呼吸道感染     │
│ 甲钴胺分散片;双氯芬酸钠缓释胶囊 ┆ 腰椎间盘突出                  │
│ 氨溴特罗口服溶液;孟鲁司特钠片   ┆ 上呼吸道感染;上呼吸道过敏反应 │
│ 护肝片;心脑康胶囊               ┆ 肝气郁结证;瘀血阻络证         │
│ 奥硝唑片;双氯芬酸钠缓释胶囊;头  ┆ 慢性牙周炎                    │
│ 孢克洛分散片                    ┆                               │
│ …                               ┆ …                             │
│ 埃索美拉唑镁肠溶片;玻璃酸钠滴眼 ┆ 干眼症;十二指肠溃疡           │
│ 液                              ┆                               │
│ 丹黄祛瘀胶囊;散结镇痛胶囊       ┆ 血瘀证;子宫内膜异位症         │
│ 陈香露白露片                    ┆ 慢性胃炎;特指急性胃炎         │
│ 玉龙油                          ┆ 关节炎;痛风                   │
│ 罗红霉素片;清热散结片           ┆ 口腔溃疡;皮肤感染             │
└─────────────────────────────────┴───────────────────────────────┘

>>> pl.scan_parquet("data/2021-03-01-2021-09-05.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()
shape: (994_078, 2)
┌─────────────────────────────────┬──────────────────────────────┐
│ Genericname                     ┆ Diagnosis                    │
│ ---                             ┆ ---                          │
│ str                             ┆ str                          │
╞═════════════════════════════════╪══════════════════════════════╡
│ 灯盏生脉胶囊                    ┆ 类风湿性关节炎;心绞痛;银屑病 │
│ 头孢克洛分散片                  ┆ 皮肤感染;皮肤裂伤            │
│ 阿司匹林肠溶片;甲硝唑片;牙痛停  ┆ 牙周炎                       │
│ 滴丸                            ┆                              │
│ 奥硝唑分散片;头孢泊肟酯胶囊     ┆ 阑尾炎                       │
│ 玻璃酸钠滴眼液;肠胃宁片         ┆ 干眼症;泄泻病                │
│ …                               ┆ …                            │
│ 达格列净片;复方酮康唑发用洗剂   ┆ 糖尿病;头皮糠疹              │
│ 六神丸;维生素A软胶囊            ┆ 痤疮;咽炎                    │
│ 甲钴胺片;腰痛宁胶囊;依托考昔片  ┆ 腰椎病                       │
│ 急支糖浆;盐酸氨溴索糖浆         ┆ 上呼吸道感染                 │
│ 桂枝茯苓丸(浓缩水丸);血府逐瘀颗 ┆ 闭经;血瘀证                  │
│ 粒                              ┆                              │
└─────────────────────────────────┴──────────────────────────────┘

>>> pl.scan_parquet("data/2020-02-04-2020-11-01.parquet").select(["Genericname", "Diagnosis"]).describe()
shape: (9, 3)
┌────────────┬───────────────────────┬─────────────────────────────────┐
│ statistic  ┆ Genericname           ┆ Diagnosis                       │
│ ---        ┆ ---                   ┆ ---                             │
│ str        ┆ str                   ┆ str                             │
╞════════════╪═══════════════════════╪═════════════════════════════════╡
│ count      ┆ 789880                ┆ 789880                          │
│ null_count ┆ 0                     ┆ 0                               │
│ mean       ┆ null                  ┆ null                            │
│ std        ┆ null                  ┆ null                            │
│ min        ┆  ;特非那定片&#x0D     ┆     肠炎  ;上呼吸道感染         │
│ 25%        ┆ null                  ┆ null                            │
│ 50%        ┆ null                  ┆ null                            │
│ 75%        ┆ null                  ┆ null                            │
│ max        ┆ (畅迪5号)粉尘螨滴剂 ┆ A族高甘油三脂血症;高血压病;脑  │
│            ┆                       ┆ 梗死后遗症                      │
└────────────┴───────────────────────┴─────────────────────────────────┘

>>> pl.scan_parquet("data/2020-02-04-2020-11-01.parquet").describe()
shape: (9, 17)
┌────────────┬───────────────┬──────────────────────┬──────────────────────┬───┬─────────┬───────────┬───────────┬──────────────────────┐
│ statistic  ┆ Id            ┆ Genericname          ┆ Diagnosis            ┆ … ┆ Checker ┆ CheckTime ┆ Confirmer ┆ ConfirmTime          │
│ ---        ┆ ---           ┆ ---                  ┆ ---                  ┆   ┆ ---     ┆ ---       ┆ ---       ┆ ---                  │
│ str        ┆ f64           ┆ str                  ┆ str                  ┆   ┆ str     ┆ str       ┆ str       ┆ str                  │
╞════════════╪═══════════════╪══════════════════════╪══════════════════════╪═══╪═════════╪═══════════╪═══════════╪══════════════════════╡
│ count      ┆ 789880.0      ┆ 789880               ┆ 789880               ┆ … ┆ 601427  ┆ 601427    ┆ 601427    ┆ 601427               │
│ null_count ┆ 0.0           ┆ 0                    ┆ 0                    ┆ … ┆ 188453  ┆ 188453    ┆ 188453    ┆ 188453               │
│ mean       ┆ 419382.573516 ┆ null                 ┆ null                 ┆ … ┆ null    ┆ null      ┆ null      ┆ null                 │
│ std        ┆ 234739.505744 ┆ null                 ┆ null                 ┆ … ┆ null    ┆ null      ┆ null      ┆ null                 │
│ min        ┆ 1.0           ┆  ;特非那定片&#x0D    ┆ 肠炎  ;上呼吸道感染  ┆ … ┆ 何黎敏  ┆ 2020/10/1 ┆ 何黎敏    ┆ 2020/10/1 15:00:13   │
│            ┆               ┆                      ┆                      ┆   ┆         ┆ 14:29:43  ┆           ┆                      │
│ 25%        ┆ 217296.0      ┆ null                 ┆ null                 ┆ … ┆ null    ┆ null      ┆ null      ┆ null                 │
│ 50%        ┆ 421387.0      ┆ null                 ┆ null                 ┆ … ┆ null    ┆ null      ┆ null      ┆ null                 │
│ 75%        ┆ 622312.0      ┆ null                 ┆ null                 ┆ … ┆ null    ┆ null      ┆ null      ┆ null                 │
│ max        ┆ 823498.0      ┆ (畅迪5号)粉尘螨滴  ┆ A族高甘油三脂血症;  ┆ … ┆ 黄羡    ┆ 2020/9/30 ┆ 黄羡      ┆ 2020/9/30 15:11:42   │
│            ┆               ┆ 剂                   ┆ 高血压病;脑梗死后遗  ┆   ┆         ┆ 15:22:40  ┆           ┆                      │
│            ┆               ┆                      ┆ 症                   ┆   ┆         ┆           ┆           ┆                      │
└────────────┴───────────────┴──────────────────────┴──────────────────────┴───┴─────────┴───────────┴───────────┴──────────────────────┘

>>> pl.scan_parquet("data/2020-02-04-2020-11-01.parquet").columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']

>>> pl.scan_parquet("data/2020-11-01-2021-03-01.parquet").columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']

>>> pl.scan_parquet("data/2021-03-01-2021-09-05.parquet").columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']

>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']

>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").drop_nulls().columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']

>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").drop_nulls().collect().columns
thread 'polars-1' panicked at /rustc/ab14f944afe4234db378ced3801e637eae6c0f30/library/core/src/ops/function.rs:250:5:
Expected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead
stack backtrace:
   0:        0x1113238c7 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5b162cab46f344a5
   1:        0x10eb8239b - core::fmt::write::h4a73583a3886d3b0
   2:        0x1112f4d9e - std::io::Write::write_fmt::h8846f8d604484bad
   3:        0x1113279d1 - std::sys_common::backtrace::print::h7eceb11702f657b6
   4:        0x111327269 - std::panicking::default_hook::{{closure}}::he179a4d2e5ce811d
   5:        0x111328f93 - std::panicking::rust_panic_with_hook::hbfe888ce2af6ee0d
   6:        0x111327d12 - std::panicking::begin_panic_handler::{{closure}}::h2461a6874e053e43
   7:        0x111327c69 - std::sys_common::backtrace::__rust_end_short_backtrace::h1c49106eba8c7b96
   8:        0x111327c56 - _rust_begin_unwind
   9:        0x1114e3812 - core::panicking::panic_fmt::ha4b3f782c24c0530
  10:        0x1108767ce - core::ops::function::FnOnce::call_once::h5067212405562c9f
  11:        0x110872ffe - polars_parquet::arrow::read::statistics::push::h9d0d5787bd19f3b8
  12:        0x10fa59d44 - polars_io::parquet::read::predicates::read_this_row_group::haa1c0f7f42e29dac
  13:        0x10fa5c39d - polars_io::parquet::read::read_impl::rg_to_dfs::h372c4af67b7d657a
  14:        0x10febe335 - rayon::iter::plumbing::bridge_producer_consumer::helper::h1e5b6564c8c35e3b
  15:        0x10febf682 - rayon_core::join::join_context::{{closure}}::hdb785e885a11ecf5
  16:        0x10febec58 - rayon::iter::plumbing::bridge_producer_consumer::helper::h1e5b6564c8c35e3b
  17:        0x10fec0227 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb75d5f7e45293dab
  18:        0x111831120 - rayon_core::registry::WorkerThread::wait_until_cold::hfdbee4fa4f1f2f01
  19:        0x1110ce84f - std::sys_common::backtrace::__rust_begin_short_backtrace::h2512fac84c638486
  20:        0x1110ce63c - core::ops::function::FnOnce::call_once{{vtable.shim}}::h47bee49a5d47d169
  21:        0x11132c24b - std::sys::pal::unix::thread::Thread::new::thread_start::h176c25cd13ced921
  22:     0x7ff803c5d18b - __pthread_start
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/git/med-data/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1967, in collect
    return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: Expected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead
>>> 

from polars.

coastalwhite avatar coastalwhite commented on August 18, 2024

One thing I notice here is that there are columns that are typed as strings but contain numbers. Could it maybe be that one of the files has the same column but with different types?

from polars.

failable avatar failable commented on August 18, 2024

That seems to be the issue.

>>> files = ["data/2020-02-04-2020-11-01.parquet", "data/2020-11-01-2021-03-01.parquet", "data/2021-03-01-2021-09-05.parquet"]

>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']

>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").collect().row(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/git/med-data/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1967, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: not implemented: reading parquet type Int64 to Utf8View still not implemented

>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/git/med-data/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1967, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: not implemented: reading parquet type Int64 to Utf8View still not implemented

>>> import pandas as pd
>>> for f in files:
...     df = pd.read_parquet(f)
...     print(df.info())
... 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 789880 entries, 0 to 789879
Data columns (total 16 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Id           789880 non-null  int64 
 1   Genericname  789880 non-null  object
 2   Diagnosis    789880 non-null  object
 3   InquiryId    789880 non-null  object
 4   CreateTime   789880 non-null  object
 5   UpdateTime   9165 non-null    object
 6   InqCount     789880 non-null  int64 
 7   Level        789880 non-null  int64 
 8   UpdateBy     254 non-null     object
 9   Creater      789880 non-null  object
 10  Platform     712729 non-null  object
 11  Remark       15070 non-null   object
 12  Checker      601427 non-null  object
 13  CheckTime    601427 non-null  object
 14  Confirmer    601427 non-null  object
 15  ConfirmTime  601427 non-null  object
dtypes: int64(3), object(13)
memory usage: 96.4+ MB
None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 601952 entries, 0 to 601951
Data columns (total 16 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Id           601952 non-null  int64 
 1   Genericname  601952 non-null  object
 2   Diagnosis    601951 non-null  object
 3   InquiryId    601952 non-null  int64 <----------------------------------------- DIFFERENCE
 4   CreateTime   601952 non-null  object
 5   UpdateTime   1857 non-null    object
 6   InqCount     601952 non-null  int64 
 7   Level        601952 non-null  int64 
 8   UpdateBy     91 non-null      object
 9   Creater      601952 non-null  object
 10  Platform     599108 non-null  object
 11  Remark       8939 non-null    object
 12  Checker      599108 non-null  object
 13  CheckTime    599108 non-null  object
 14  Confirmer    599108 non-null  object
 15  ConfirmTime  599108 non-null  object
dtypes: int64(4), object(12)
memory usage: 73.5+ MB

None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 994078 entries, 0 to 994077
Data columns (total 16 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Id           994078 non-null  int64 
 1   Genericname  994078 non-null  object
 2   Diagnosis    994078 non-null  object
 3   InquiryId    994078 non-null  object
 4   CreateTime   994078 non-null  object
 5   UpdateTime   2329 non-null    object
 6   InqCount     994078 non-null  int64 
 7   Level        994078 non-null  int64 
 8   UpdateBy     8 non-null       object
 9   Creater      994078 non-null  object
 10  Platform     981809 non-null  object
 11  Remark       2654 non-null    object
 12  Checker      981809 non-null  object
 13  CheckTime    981809 non-null  object
 14  Confirmer    981809 non-null  object
 15  ConfirmTime  981809 non-null  object
dtypes: int64(3), object(13)
memory usage: 121.3+ MB
None

>>> pl.scan_parquet([files[0], files[2]]).collect()
shape: (1_783_958, 16)
┌─────────┬───────────────────┬───────────────────┬───────────────────┬───┬──────────────┬───────────┬──────────────┬───────────────────┐
│ Id      ┆ Genericname       ┆ Diagnosis         ┆ InquiryId         ┆ … ┆ Checker      ┆ CheckTime ┆ Confirmer    ┆ ConfirmTime       │
│ ---     ┆ ---               ┆ ---               ┆ ---               ┆   ┆ ---          ┆ ---       ┆ ---          ┆ ---               │
│ i64     ┆ str               ┆ str               ┆ str               ┆   ┆ str          ┆ str       ┆ str          ┆ str               │
╞═════════╪═══════════════════╪═══════════════════╪═══════════════════╪═══╪══════════════╪═══════════╪══════════════╪═══════════════════╡
│ 1       ┆ 磷酸奥司他韦颗粒  ┆ 预防性抗流行性感  ┆ 0                 ┆ … ┆ null         ┆ null      ┆ null         ┆ null              │
│         ┆                   ┆ 冒治疗            ┆                   ┆   ┆              ┆           ┆              ┆                   │
│ 2       ┆ 硝苯地平控释片    ┆ 原发性高血压      ┆ 0                 ┆ … ┆ null         ┆ null      ┆ null         ┆ null              │
│ 4       ┆ 富马酸替诺福韦二  ┆ 慢性乙型肝炎      ┆ 0                 ┆ … ┆ null         ┆ null      ┆ null         ┆ null              │
│         ┆ 吡呋酯片          ┆                   ┆                   ┆   ┆              ┆           ┆              ┆                   │
│ 5       ┆ 苯磺酸氨氯地平片  ┆ 原发性高血压      ┆ 0                 ┆ … ┆ null         ┆ null      ┆ null         ┆ null              │
│ 6       ┆ 布地奈德福莫特罗  ┆ 哮喘              ┆ 0                 ┆ … ┆ null         ┆ null      ┆ null         ┆ null              │
│         ┆ 粉吸入剂          ┆                   ┆                   ┆   ┆              ┆           ┆              ┆                   │
│ …       ┆ …                 ┆ …                 ┆ …                 ┆ … ┆ …            ┆ …         ┆ …            ┆ …                 │
│ 2428039 ┆ 达格列净片;复方酮 ┆ 糖尿病;头皮糠疹   ┆ 14346925769166766 ┆ … ┆ 智能审方判断 ┆ 2021/9/6  ┆ 智能审方判断 ┆ 2021/9/6 9:39:23  │
│         ┆ 康唑发用洗剂      ┆                   ┆ 96                ┆   ┆              ┆ 9:39:23   ┆              ┆                   │
│ 2428040 ┆ 六神丸;维生素A软  ┆ 痤疮;咽炎         ┆ 3862545784759552  ┆ … ┆ 智能审方判断 ┆ 2021/9/6  ┆ 智能审方判断 ┆ 2021/9/6 9:39:28  │
│         ┆ 胶囊              ┆                   ┆                   ┆   ┆              ┆ 9:39:28   ┆              ┆                   │
│ 2428041 ┆ 甲钴胺片;腰痛宁胶 ┆ 腰椎病            ┆ 3862545878415360  ┆ … ┆ 智能审方判断 ┆ 2021/9/6  ┆ 智能审方判断 ┆ 2021/9/6 9:39:35  │
│         ┆ 囊;依托考昔片     ┆                   ┆                   ┆   ┆              ┆ 9:39:35   ┆              ┆                   │
│ 2428042 ┆ 急支糖浆;盐酸氨溴 ┆ 上呼吸道感染      ┆ 4347993676906752  ┆ … ┆ 智能审方判断 ┆ 2021/9/6  ┆ 智能审方判断 ┆ 2021/9/6 9:39:36  │
│         ┆ 索糖浆            ┆                   ┆                   ┆   ┆              ┆ 9:39:36   ┆              ┆                   │
│ 2428043 ┆ 桂枝茯苓丸(浓缩水 ┆ 闭经;血瘀证       ┆ 14346923868894577 ┆ … ┆ 智能审方判断 ┆ 2021/9/6  ┆ 智能审方判断 ┆ 2021/9/6 9:39:41  │
│         ┆ 丸);血府逐瘀颗粒  ┆                   ┆ 53                ┆   ┆              ┆ 9:39:41   ┆              ┆                   │
└─────────┴───────────────────┴───────────────────┴───────────────────┴───┴──────────────┴───────────┴──────────────┴───────────────────┘

>>> pl.scan_parquet([files[0], files[2]]).drop_nulls().collect()
shape: (48, 16)
┌────────┬───────────────────┬───────────────────┬───────────────────┬───┬──────────────┬────────────┬──────────────┬───────────────────┐
│ Id     ┆ Genericname       ┆ Diagnosis         ┆ InquiryId         ┆ … ┆ Checker      ┆ CheckTime  ┆ Confirmer    ┆ ConfirmTime       │
│ ---    ┆ ---               ┆ ---               ┆ ---               ┆   ┆ ---          ┆ ---        ┆ ---          ┆ ---               │
│ i64    ┆ str               ┆ str               ┆ str               ┆   ┆ str          ┆ str        ┆ str          ┆ str               │
╞════════╪═══════════════════╪═══════════════════╪═══════════════════╪═══╪══════════════╪════════════╪══════════════╪═══════════════════╡
│ 210034 ┆ 利拉鲁肽注射液;缬 ┆ 缺血性脑血管病;糖 ┆ 3692776318482176  ┆ … ┆ 陈佩斯       ┆ 2020/5/15  ┆ 韩丽琴       ┆ 2020/5/15         │
│        ┆ 沙坦氢氯噻嗪胶囊; ┆ 尿病;原发性高血压 ┆                   ┆   ┆              ┆ 4:04:59    ┆              ┆ 19:26:42          │
│        ┆ 银杏叶提取物片    ┆                   ┆                   ┆   ┆              ┆            ┆              ┆                   │
│ 210734 ┆ 阿奇霉素分散片;枸 ┆ 高血压病;男性勃起 ┆ 3692783911425792  ┆ … ┆ 陈佩斯       ┆ 2020/5/15  ┆ 唐明嵩       ┆ 2020/5/15         │
│        ┆ 橼酸西地那非片;马 ┆ 障碍;软组织感染   ┆                   ┆   ┆              ┆ 2:57:29    ┆              ┆ 20:35:13          │
│        ┆ 来酸依那普利片;双 ┆                   ┆                   ┆   ┆              ┆            ┆              ┆                   │
│        ┆ 氯芬酸…           ┆                   ┆                   ┆   ┆              ┆            ┆              ┆                   │
│ 211476 ┆ 复方酮康唑发用洗  ┆ 肺动脉高压;甲状腺 ┆ 3692745775624705  ┆ … ┆ 陈佩斯       ┆ 2020/5/15  ┆ 唐明嵩       ┆ 2020/5/15         │
│        ┆ 剂;枸橼酸西地那非 ┆ 功能减退症;心绞痛 ┆                   ┆   ┆              ┆ 4:59:10    ┆              ┆ 22:01:27          │
│        ┆ 片;通脉颗粒;左甲  ┆ ;脂溢性皮炎       ┆                   ┆   ┆              ┆            ┆              ┆                   │
│        ┆ 状腺素钠…         ┆                   ┆                   ┆   ┆              ┆            ┆              ┆                   │
│ 214903 ┆ 苯磺酸左氨氯地平  ┆ 高血压病;男性勃起 ┆ 12607392863288279 ┆ … ┆ 唐明嵩       ┆ 2020/5/16  ┆ 吴雪静       ┆ 2020/5/16         │
│        ┆ 片;枸橼酸西地那非 ┆ 障碍              ┆ 22                ┆   ┆              ┆ 16:17:25   ┆              ┆ 17:19:47          │
│        ┆ 片                ┆                   ┆                   ┆   ┆              ┆            ┆              ┆                   │
│ 220492 ┆ 富马酸比索洛尔片; ┆ 不稳定性心绞痛;男 ┆ 3693396635109120  ┆ … ┆ 陈祉羽       ┆ 2020/5/17  ┆ 苏锡茵       ┆ 2020/5/18 8:05:50 │
│        ┆ 枸橼酸西地那非片  ┆ 性勃起障碍        ┆                   ┆   ┆              ┆ 14:54:51   ┆              ┆                   │
│ …      ┆ …                 ┆ …                 ┆ …                 ┆ … ┆ …            ┆ …          ┆ …            ┆ …                 │
│ 743981 ┆ 阿司匹林肠溶片;胱 ┆ 头皮糠疹;脱发     ┆ 13166628469894104 ┆ … ┆ 智能审方判断 ┆ 2020/10/15 ┆ 智能审方判断 ┆ 2020/10/15        │
│        ┆ 氨酸片            ┆                   ┆ 02                ┆   ┆              ┆ 16:57:01   ┆              ┆ 16:57:01          │
│ 745923 ┆ 坎地沙坦酯片;马来 ┆ 高血压病          ┆ 3724199280330496  ┆ … ┆ 翁庸徳       ┆ 2020/10/10 ┆ 苏锡茵       ┆ 2020/10/15        │
│        ┆ 酸依那普利片      ┆                   ┆                   ┆   ┆              ┆ 18:19:03   ┆              ┆ 22:40:58          │
│ 751981 ┆ 酚酞片;牛黄解毒片 ┆ 便秘病;热毒证     ┆ 12906196614272082 ┆ … ┆ 黄羡         ┆ 2020/10/12 ┆ 翁庸徳       ┆ 2020/10/16        │
│        ┆                   ┆                   ┆ 79                ┆   ┆              ┆ 9:25:59    ┆              ┆ 22:29:34          │
│ 809815 ┆ 地特胰岛素注射液; ┆ 1型糖尿病;高血压  ┆ 3713215386539776  ┆ … ┆ 翁庸徳       ┆ 2020/10/28 ┆ 苏锡茵       ┆ 2020/10/29        │
│        ┆ 厄贝沙坦片;罗红霉 ┆ 病;支气管炎       ┆                   ┆   ┆              ┆ 16:21:42   ┆              ┆ 11:17:13          │
│        ┆ 素氨溴索片        ┆                   ┆                   ┆   ┆              ┆            ┆              ┆                   │
│ 811401 ┆ 非诺贝特胶囊;门冬 ┆ 1型糖尿病;高脂血  ┆ 3732088374658816  ┆ … ┆ 翁庸徳       ┆ 2020/10/28 ┆ 苏锡茵       ┆ 2020/10/29        │
│        ┆ 胰岛素注射液      ┆ 症                ┆                   ┆   ┆              ┆ 16:35:08   ┆              ┆ 17:47:00          │
└────────┴───────────────────┴───────────────────┴───────────────────┴───┴──────────────┴────────────┴──────────────┴───────────────────┘

>>> pl.scan_parquet([files[0], files[1]]).drop_nulls().collect()
thread 'polars-1' panicked at /rustc/ab14f944afe4234db378ced3801e637eae6c0f30/library/core/src/ops/function.rs:250:5:
Expected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead
stack backtrace:
   0:        0x1113238c7 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5b162cab46f344a5
   1:        0x10eb8239b - core::fmt::write::h4a73583a3886d3b0
   2:        0x1112f4d9e - std::io::Write::write_fmt::h8846f8d604484bad
   3:        0x1113279d1 - std::sys_common::backtrace::print::h7eceb11702f657b6
   4:        0x111327269 - std::panicking::default_hook::{{closure}}::he179a4d2e5ce811d
   5:        0x111328f93 - std::panicking::rust_panic_with_hook::hbfe888ce2af6ee0d
   6:        0x111327d12 - std::panicking::begin_panic_handler::{{closure}}::h2461a6874e053e43
   7:        0x111327c69 - std::sys_common::backtrace::__rust_end_short_backtrace::h1c49106eba8c7b96
   8:        0x111327c56 - _rust_begin_unwind
   9:        0x1114e3812 - core::panicking::panic_fmt::ha4b3f782c24c0530
  10:        0x1108767ce - core::ops::function::FnOnce::call_once::h5067212405562c9f
  11:        0x110872ffe - polars_parquet::arrow::read::statistics::push::h9d0d5787bd19f3b8
  12:        0x10fa59d44 - polars_io::parquet::read::predicates::read_this_row_group::haa1c0f7f42e29dac
  13:        0x10fa5c39d - polars_io::parquet::read::read_impl::rg_to_dfs::h372c4af67b7d657a
  14:        0x10febe335 - rayon::iter::plumbing::bridge_producer_consumer::helper::h1e5b6564c8c35e3b
  15:        0x10fec0227 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb75d5f7e45293dab
  16:        0x111831120 - rayon_core::registry::WorkerThread::wait_until_cold::hfdbee4fa4f1f2f01
  17:        0x1110ce84f - std::sys_common::backtrace::__rust_begin_short_backtrace::h2512fac84c638486
  18:        0x1110ce63c - core::ops::function::FnOnce::call_once{{vtable.shim}}::h47bee49a5d47d169
  19:        0x11132c24b - std::sys::pal::unix::thread::Thread::new::thread_start::h176c25cd13ced921
  20:     0x7ff803c5d18b - __pthread_start
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/git/med-data/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1967, in collect
    return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: Expected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead

But is that mean even specific columns are selected, all the schema will be checked ?

pl.scan_parquet("data/*.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()

Is this error polars.exceptions.ComputeError: not implemented: reading parquet type Int64 to Utf8View still not implemented relevant? Both error messages seems a bit hard for me to locate the problem.

from polars.

ritchie46 avatar ritchie46 commented on August 18, 2024

@failable if this still occurs after #17321, can you open a new issue with a proper reproducable expample? We cannot take action on this one.

from polars.

failable avatar failable commented on August 18, 2024

@ritchie46 Thanks, seems the issue has been fixed now!

image

from polars.

failable avatar failable commented on August 18, 2024

When will we have a release? It took me an hour to build the main branch on my local machine.

from polars.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.