wireservice / agate-excel Goto Github PK

View Code? Open in Web Editor NEW

16.0 4.0 12.0 386 KB

Adds read support for Excel files (xls and xlsx) to agate.

Home Page: https://agate-excel.readthedocs.io

License: MIT License

Python 100.00%

agate-excel's Introduction

agate-excel adds read support for Excel files (xls and xlsx) to agate.

Important links:

agate https://agate.rtfd.org
Documentation: https://agate-excel.rtfd.org
Repository: https://github.com/wireservice/agate-excel
Issues: https://github.com/wireservice/agate-excel/issues

agate-excel's People

Contributors

Stargazers

Watchers

Forkers

autodataming shadyabhi joseevanan sksundaram-learning rasjani timfreund lcorbasson dotlambda fagan2888 jeffamaxey isabella232 arpitjain799

agate-excel's Issues

Breaking change in openpyxl 2.6.0

openpyxl 2.6.0 has a breaking change producing the following error :
iter_rows() got an unexpected keyword argument 'row_offset'

The row_offset parameter does not exist anymore and is replaced by a combination of min_col max_col

forcing install of openpyxl 2.3.0 works around the issue.

You should freeze the dependencies version to avoid that kind of issues.

Test failure with openpyxl 3.0.7

Hi,

I see a single test failure related to ambiguous dates in the NixOS builds of this package: https://nix-cache.s3.amazonaws.com/log/gj2vn41aa7ccl3frmkcvssn4na8z3ngm-python3.8-agate-excel-0.2.3.drv
I was able to git bisect the change to our upgrade of the openpyxl dependency from 3.0.6 to 3.0.7 -- the test succeeds with 3.0.6 and fails with 3.0.7. I haven't investigated further, but I wanted to let you know, in case you have any suggestions on how to fix this.

XLSX: On python2 integers are converted to `long` which isn't handled by `agate.Number`

XLS: Zeros are converted to empty strings

Cross filed from wireservice/csvkit#225 (comment)

Extract a series of sheets into tables

Dependency openpyxl dropped support for python<3.6

agate-excel==0.2.3 depends on openpyxl>=2.3.0. openpyxl released 3.0.0 on 2019-09-25, which breaks compatibility with python<3.6. Should agate-excel depend on openpyxl>=2.3.0,<3?

links

Release version 0.2.3

Hi, there is an entry for 0.2.3 version in change log, but there is no such release. Could you make one?

File isnt closed when we read a wrong sheet

My intention is to loop and read data from all sheet:

sheet = 0
while True:
    # try to parse at all sheet
    try:
        table = agate.Table.from_xlsx(filename, sheet=sheet)
    except IndexError:
        break
    sheet = sheet + 1

The problem is: if I pass the sheet argument which does not exist, then it throws an exception and it doesn't close the file. I received this warning:

sys:1: ResourceWarning: unclosed file <_io.BufferedReader name='filename.xlsx'>

Passing column_names to from_xls fails with TypeError

Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/Users/jani.mikkonen/src/customers/vr/vr/venv/lib/python3.7/site-packages/agateexcel/table_xls.py", line 86, in from_xls
    tables[sheet.name] = agate.Table(rows, column_names, **kwargs)
TypeError: __init__() got multiple values for argument 'column_names'

if column_names is present in kwargs, maybe remove column_names from there and copy it to the positional parameter passed to agate.Table() in https://github.com/wireservice/agate-excel/blob/master/agateexcel/table_xlsx.py#L85 & https://github.com/wireservice/agate-excel/blob/master/agateexcel/table_xls.py#L86 ?

Processing xls file leads to all RAM consumption

Processing a xls file will trigger crazy RAM consumption, until oomkiller is called.

    import agateexcel
    result = agate.Table.from_xls(filepath)

File size is around 30 Mb.
Did you already encounter such an scenario with this function?

from_xls() failes when a cell =1 and is formatted as a date

Problem file: temp.zip

Python 3.5.2 |Anaconda 4.1.1 (64-bit)| (default, Jul  2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import agate
>>> import agateexcel
>>> agateexcel.patch()
>>> agate.Table.from_xls('temp.xls')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/appl/anaconda/anaconda3/lib/python3.5/site-packages/agateexcel/table_xls.py", line 52, in from_xls
    values = normalize_dates(values, book.datemode)
  File "/appl/anaconda/anaconda3/lib/python3.5/site-packages/agateexcel/table_xls.py", line 102, in normalize_dates
    v_tuple = xlrd.xldate_as_tuple(v, datemode)
  File "/appl/anaconda/anaconda3/lib/python3.5/site-packages/xlrd/xldate.py", line 86, in xldate_as_tuple
    raise XLDateAmbiguous(xldate)
xlrd.xldate.XLDateAmbiguous: 1.0
>>>

Possibly related: wireservice/csvkit#607

Importing agateexcel prevents exiting with non-zero status

Hi,
MWE with agate-excel 0.2.2 (Python 3.6):
import agateexcel
raise Exception()

When agateexcel isn't imported, the script exits with code 1; if agateexcel is present, it exists with code 0.

Version constraint on openpyxl

After the fix for #45 the test fails with older versions of openpyxl. Should the version constraint in setup.py be updated to reflect this? As this package stands now, it should be constrained to >= 3.0.7

I haven't looking into the possibility to rewrite the test and take openpyxl version into consideration.

Is it possible to specify column types when creating a table from Excel?

It seems like Table.from_xls() only takes a file (and sheet) as an argument. The default TypeTester is incorrectly parsing number codes (similar to FIPS codes) as numbers instead of strings.

Is there a better way of forcing column types when importing from Excel? I could compute new columns, but it seems like something that could be done on creation.

Tests fail on Py27 and PyPy

Workbook method is deprecated

/Users/cgroskopf/src/agate-excel/.tox/py35/lib/python3.5/site-packages/openpyxl/workbook/workbook.py:194: DeprecationWarning: Call to deprecated function or class get_sheet_by_name (Use wb[sheetname]).

Dependency listed in the wrong place

agate-excel/setup.py

Line 44 in 5374a24

'six',

The user of import six is in the library itself, not in the test suite. This should be listed under install_requires not as a test requirement.

Also nose doesn't seem to be used at all.

openpyxl's read_only mode breaks parsing on certain sheets

openpyxl's read_only option for load_workbook can cause issues with parsing spreadsheets that do not properly report their dimensions. This issue is documented in the openpyxl's documentation.

(Here is a direct link to the spreadsheet surfacing this issue.)

While openpyxl does give you a possible solution for resetting the dimensions — ws.max_row = ws.max_column = None — it may be easier to add an option for disabling read_only mode. I'm not sure if that dimension resetting is viable across all the possible sheets you may hit. (I did confirm, however, that it solved this particular spreadsheet's issue.)

This issue also manifests in csvkit too, and prevents in2csv from being able to parse this sheet.

Designate header row when using `from_xls` or `from_xlsx`

In our tx_salaries repo, we use csvkit (and by extension, now agate-excel!) to prep Excel files for processing. An extra option we also have — after the conversion has happened — is to peel off rows until we hit the header row. We always know where that header row is, so we can pass that index in and be good to go.

When I was doing some tests with agate-excel directly, I quickly hit that wall. It gets a little funky if an Excel spreadsheet has some useless extra rows before the data.

That sound like something that'd be worth having as part of the package? I imagine it could be another option to be passed into the from_ commands.

table = agate.Table.from_xlsx('i_dont_want_your_extra_rows.xlsx', header_row=4)

The biggest decision to make? Do you use zero-based indexing, or start at 1 like Excel? 😬 (We forget what we picked constantly.)

When loading spreadsheets with column_type text cell numeric values get altered

I am trying to load tables from spreadsheets that contain text and numeric information (i.e. serial_numbers like "1234567890" and "12GNHGZ34567655". I have set column_types to be text only (confirmed to work) - however in the target table numeric values have been treated as float resulting in "1234567890.0". Alphanumeric values just stay as is.

Update documents for install

pip install agateexcel results in

Collecting agateexcel
  Could not find a version that satisfies the requirement agateexcel (from versions: )
No matching distribution found for agateexcel

pip install agate-excel results in correct install.

XLS: Number columns are losing precision

text,date,integer,boolean,float,datetime,empty_column,_unnamed
Chicago Reader,1971-01-01,40,True,1.0,1971-01-01T04:14:00,,
Chicago Sun-Times,1948-01-01,63,True,1.27,1948-01-01T14:57:13,,Extra data beyond headers will be trimmed
Chicago Tribune,1920-01-01,164,False,41800000.01,1920-01-01T00:00:00,,
This row has blanks,,,,,,,,
Unicode! Σ,,,,,,,,

text,date,integer,boolean,float,datetime,empty_column,
Chicago Reader,1971-01-01,40.0000000000,True,1.0000000000,1971-01-01 04:14:00,,
Chicago Sun-Times,1948-01-01,63.0000000000,True,1.2700000000,1948-01-01 14:57:13,,Extra data beyond headers will be trimmed
Chicago Tribune,1920-01-01,164.0000000000,False,41800000.0099999979,1920-01-01,,
This row has blanks,,,,,,,
Unicode! Σ,,,,,,,