Comments (7)
Neither column data nor row data should be automatically removed without permission, regardless of whether it is completely empty. Neither pandas nor fastexcel, nor any other Excel reading tool, would behave like that.
from polars.
Two things:
- I agree with you that dropping empty rows should be an option that defaults to False.
- I'm irrationally perplexed at the inclusion of that screen capture (with a personal watermark too). I think it's preferred to include a github permalink like this (Note: To expand the selection do a shift-click):
@alexander-beedie this one is your baby, what do you think about having an option in read_excel
to drop null rows and have its default be to not drop rows? The default, of course, could be to keep the existing behavior for people who, by now, expect that behavior.
from polars.
I have users who write configuration files in excel (bless them) and they include blank lines as whitespace, so I definitely see why this behavior is useful. But yeah I guess you can filter the rows out in a one liner.
from polars.
This was required in earlier versions in order to extract data successfully; I didn't particularly like it then, but the alternative was failing to load certain shapes of table data at all. I'll revisit this as several relevant patches have landed between this code being implemented and the current fastexcel
0.9.1 release 👌
what do you think about having an option in read_excel to drop null rows and have its default be to not drop rows
Not really necessary to have a dedicated param here, as you can always choose to drop rows after load via the usual methods.
A frame-level trim is likely the ideal behaviour here; dropping trailing empty rows (as they are likely not data at all), while leaving all other empty rows alone.
from polars.
Not really necessary to have a dedicated param here, as you can always choose to drop rows after load via the usual methods.
If you just want to get rid of the behavior then I agree with this. I was coming at it from the idea that the behavior exists for a reason and that it wouldn't be dropped entirely.
A frame-level trim is likely the ideal behaviour here; dropping trailing empty rows (as they are likely not data at all), while leaving all other empty rows alone.
Here's a tweak to do the same on columns
polars/py-polars/polars/io/spreadsheet/functions.py
Lines 666 to 680 in ac0131a
null_cols = []
found_first_data = False
last_data_col = 0
for i, col_name in enumerate(df.columns):
# note that if multiple unnamed columns are found then all but the first one
# will be named as "_duplicated_{n}" (or "__UNNAMED__{n}" from calamine)
if col_name == "" or re.match(r"(_duplicated_|__UNNAMED__)\d+$", col_name):
col = df[col_name]
if found_first_data is False and (
col.dtype == pl.Null
or col.null_count() == len(df)
or (
col.dtype in NUMERIC_DTYPES
and col.replace(0, None).null_count() == len(df)
)
):
null_cols.append(col_name)
else:
found_first_data = True
last_data_col = i
else:
found_first_data = True
last_data_col = i
null_cols.extend(df.columns[last_data_col+1:])
from polars.
This was required in earlier versions in order to extract data successfully; I didn't particularly like it then, but the alternative was failing to load certain shapes of table data at all. I'll revisit this as several relevant patches have landed between this code being implemented and the current
fastexcel
0.9.1 release 👌what do you think about having an option in read_excel to drop null rows and have its default be to not drop rows
Not really necessary to have a dedicated param here, as you can always choose to drop rows after load via the usual methods.
A frame-level trim is likely the ideal behaviour here; dropping trailing empty rows (as they are likely not data at all), while leaving all other empty rows alone.
Removing intermediate empty rows makes the processing of standard-formatted Excel files unexpectedly complex. I believe it is acceptable to delete the empty rows at the end of the sheet.
from polars.
Two things:
1. I agree with you that dropping empty rows should be an option that defaults to False. 2. I'm irrationally perplexed at the inclusion of that screen capture (with a personal watermark too). I think it's preferred to include a github permalink like this (Note: To expand the selection do a shift-click): ![copy_paste_github](https://private-user-images.githubusercontent.com/37878412/310500021-ae17c923-07d4-4a59-a8e2-e2e58e23a262.gif?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDk3NzEwNTAsIm5iZiI6MTcwOTc3MDc1MCwicGF0aCI6Ii8zNzg3ODQxMi8zMTA1MDAwMjEtYWUxN2M5MjMtMDdkNC00YTU5LWE4ZTItZTJlNThlMjNhMjYyLmdpZj9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAzMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMzA3VDAwMTkxMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQ2MTQ2Y2IwMzlmOTRmZDZiYjU0MDI5ZTg1NGE0N2M4YTIyNzdjYzM0ODU4ODZhMjVhNWQ2NmRiMjNmYTA5OTgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.LAufrvSX-QRjolinvwdJHc1BbQvrsYoI6tMiH4h34_8) [ ![copy_paste_github](https://private-user-images.githubusercontent.com/37878412/310500021-ae17c923-07d4-4a59-a8e2-e2e58e23a262.gif?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDk3NzEwNTAsIm5iZiI6MTcwOTc3MDc1MCwicGF0aCI6Ii8zNzg3ODQxMi8zMTA1MDAwMjEtYWUxN2M5MjMtMDdkNC00YTU5LWE4ZTItZTJlNThlMjNhMjYyLmdpZj9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAzMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMzA3VDAwMTkxMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQ2MTQ2Y2IwMzlmOTRmZDZiYjU0MDI5ZTg1NGE0N2M4YTIyNzdjYzM0ODU4ODZhMjVhNWQ2NmRiMjNmYTA5OTgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.LAufrvSX-QRjolinvwdJHc1BbQvrsYoI6tMiH4h34_8) ](https://private-user-images.githubusercontent.com/37878412/310500021-ae17c923-07d4-4a59-a8e2-e2e58e23a262.gif?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDk3NzEwNTAsIm5iZiI6MTcwOTc3MDc1MCwicGF0aCI6Ii8zNzg3ODQxMi8zMTA1MDAwMjEtYWUxN2M5MjMtMDdkNC00YTU5LWE4ZTItZTJlNThlMjNhMjYyLmdpZj9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAzMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMzA3VDAwMTkxMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQ2MTQ2Y2IwMzlmOTRmZDZiYjU0MDI5ZTg1NGE0N2M4YTIyNzdjYzM0ODU4ODZhMjVhNWQ2NmRiMjNmYTA5OTgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.LAufrvSX-QRjolinvwdJHc1BbQvrsYoI6tMiH4h34_8) [ ](https://private-user-images.githubusercontent.com/37878412/310500021-ae17c923-07d4-4a59-a8e2-e2e58e23a262.gif?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDk3NzEwNTAsIm5iZiI6MTcwOTc3MDc1MCwicGF0aCI6Ii8zNzg3ODQxMi8zMTA1MDAwMjEtYWUxN2M5MjMtMDdkNC00YTU5LWE4ZTItZTJlNThlMjNhMjYyLmdpZj9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAzMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMzA3VDAwMTkxMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQ2MTQ2Y2IwMzlmOTRmZDZiYjU0MDI5ZTg1NGE0N2M4YTIyNzdjYzM0ODU4ODZhMjVhNWQ2NmRiMjNmYTA5OTgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.LAufrvSX-QRjolinvwdJHc1BbQvrsYoI6tMiH4h34_8)
@alexander-beedie this one is your baby, what do you think about having an option in
read_excel
to drop null rows and have its default be to not drop rows? The default, of course, could be to keep the existing behavior for people who, by now, expect that behavior.
Thank you! I am just a beginner with polars and am not very familiar with the routine operations on GitHub.
from polars.
Related Issues (20)
- Tracking Issue: utilize and track array metadata/statistics HOT 1
- `.last()` can't be used on LazyGroupBy HOT 2
- Panic when doing an invalid melt HOT 2
- Panic on DataFrame.pivot when using common aggregate function on string data HOT 2
- Read data with Float32 and Float64 have different outputs HOT 3
- `group_by` with `map_elements` result incorrectly gets wrapped in a list for lazyframes
- Support writing Parquet `distinct_count` statistics for all types
- Platform Dependent pyo3_runtime.PanicException HOT 3
- `.struct.field('*')` PanicException when used after `.list.to_struct()`
- DATE() SQL function always returns Date type, even with DateTime strftime format string HOT 2
- Add formatting option to `write_excel` for dataframe values HOT 1
- Performance scaling not working (at least as expected) HOT 6
- Parquet file writer uses non-compliant list element field name HOT 1
- Windowed `std` does not work correctly (non-deterministic, incorrect values) HOT 1
- If I pass weights to rolling_var, the actual ddof will be 0 no matter what ddof I passed
- Projection pushdown with hive partitions may not be respected HOT 2
- Exclude directories from expanded glob result HOT 3
- Inconsistent behavior of `with_columns` + `lit` on empty frames HOT 3
- Add bincode serialization format to `serialize`/`deserialize` methods HOT 1
- DataFrame.to_numpy() converts None to nan HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.