Giter Site home page Giter Site logo

agate-excel's Introduction

agate-excel's People

Contributors

dotlambda avatar jpmckinney avatar lcorbasson avatar onyxfish avatar palewire avatar pmlandwehr avatar timfreund avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

agate-excel's Issues

Breaking change in openpyxl 2.6.0

openpyxl 2.6.0 has a breaking change producing the following error :
iter_rows() got an unexpected keyword argument 'row_offset'

The row_offset parameter does not exist anymore and is replaced by a combination of min_col max_col

forcing install of openpyxl 2.3.0 works around the issue.

You should freeze the dependencies version to avoid that kind of issues.

Test failure with openpyxl 3.0.7

Hi,

I see a single test failure related to ambiguous dates in the NixOS builds of this package: https://nix-cache.s3.amazonaws.com/log/gj2vn41aa7ccl3frmkcvssn4na8z3ngm-python3.8-agate-excel-0.2.3.drv
I was able to git bisect the change to our upgrade of the openpyxl dependency from 3.0.6 to 3.0.7 -- the test succeeds with 3.0.6 and fails with 3.0.7. I haven't investigated further, but I wanted to let you know, in case you have any suggestions on how to fix this.

Release version 0.2.3

Hi, there is an entry for 0.2.3 version in change log, but there is no such release. Could you make one?

File isnt closed when we read a wrong sheet

My intention is to loop and read data from all sheet:

sheet = 0
while True:
    # try to parse at all sheet
    try:
        table = agate.Table.from_xlsx(filename, sheet=sheet)
    except IndexError:
        break
    sheet = sheet + 1

The problem is: if I pass the sheet argument which does not exist, then it throws an exception and it doesn't close the file. I received this warning:

sys:1: ResourceWarning: unclosed file <_io.BufferedReader name='filename.xlsx'>

Passing column_names to from_xls fails with TypeError

Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/Users/jani.mikkonen/src/customers/vr/vr/venv/lib/python3.7/site-packages/agateexcel/table_xls.py", line 86, in from_xls
    tables[sheet.name] = agate.Table(rows, column_names, **kwargs)
TypeError: __init__() got multiple values for argument 'column_names'

if column_names is present in kwargs, maybe remove column_names from there and copy it to the positional parameter passed to agate.Table() in https://github.com/wireservice/agate-excel/blob/master/agateexcel/table_xlsx.py#L85 & https://github.com/wireservice/agate-excel/blob/master/agateexcel/table_xls.py#L86 ?

Processing xls file leads to all RAM consumption

Processing a xls file will trigger crazy RAM consumption, until oomkiller is called.

    import agateexcel
    result = agate.Table.from_xls(filepath)

File size is around 30 Mb.
Did you already encounter such an scenario with this function?

from_xls() failes when a cell =1 and is formatted as a date

Problem file: temp.zip

Python 3.5.2 |Anaconda 4.1.1 (64-bit)| (default, Jul  2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import agate
>>> import agateexcel
>>> agateexcel.patch()
>>> agate.Table.from_xls('temp.xls')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/appl/anaconda/anaconda3/lib/python3.5/site-packages/agateexcel/table_xls.py", line 52, in from_xls
    values = normalize_dates(values, book.datemode)
  File "/appl/anaconda/anaconda3/lib/python3.5/site-packages/agateexcel/table_xls.py", line 102, in normalize_dates
    v_tuple = xlrd.xldate_as_tuple(v, datemode)
  File "/appl/anaconda/anaconda3/lib/python3.5/site-packages/xlrd/xldate.py", line 86, in xldate_as_tuple
    raise XLDateAmbiguous(xldate)
xlrd.xldate.XLDateAmbiguous: 1.0
>>>

Possibly related: wireservice/csvkit#607

Version constraint on openpyxl

After the fix for #45 the test fails with older versions of openpyxl. Should the version constraint in setup.py be updated to reflect this? As this package stands now, it should be constrained to >= 3.0.7

I haven't looking into the possibility to rewrite the test and take openpyxl version into consideration.

Is it possible to specify column types when creating a table from Excel?

It seems like Table.from_xls() only takes a file (and sheet) as an argument. The default TypeTester is incorrectly parsing number codes (similar to FIPS codes) as numbers instead of strings.

Is there a better way of forcing column types when importing from Excel? I could compute new columns, but it seems like something that could be done on creation.

Workbook method is deprecated

/Users/cgroskopf/src/agate-excel/.tox/py35/lib/python3.5/site-packages/openpyxl/workbook/workbook.py:194: DeprecationWarning: Call to deprecated function or class get_sheet_by_name (Use wb[sheetname]).

openpyxl's read_only mode breaks parsing on certain sheets

openpyxl's read_only option for load_workbook can cause issues with parsing spreadsheets that do not properly report their dimensions. This issue is documented in the openpyxl's documentation.

(Here is a direct link to the spreadsheet surfacing this issue.)

While openpyxl does give you a possible solution for resetting the dimensions — ws.max_row = ws.max_column = None — it may be easier to add an option for disabling read_only mode. I'm not sure if that dimension resetting is viable across all the possible sheets you may hit. (I did confirm, however, that it solved this particular spreadsheet's issue.)

This issue also manifests in csvkit too, and prevents in2csv from being able to parse this sheet.

Designate header row when using `from_xls` or `from_xlsx`

In our tx_salaries repo, we use csvkit (and by extension, now agate-excel!) to prep Excel files for processing. An extra option we also have — after the conversion has happened — is to peel off rows until we hit the header row. We always know where that header row is, so we can pass that index in and be good to go.

When I was doing some tests with agate-excel directly, I quickly hit that wall. It gets a little funky if an Excel spreadsheet has some useless extra rows before the data.

That sound like something that'd be worth having as part of the package? I imagine it could be another option to be passed into the from_ commands.

table = agate.Table.from_xlsx('i_dont_want_your_extra_rows.xlsx', header_row=4)

The biggest decision to make? Do you use zero-based indexing, or start at 1 like Excel? 😬 (We forget what we picked constantly.)

When loading spreadsheets with column_type text cell numeric values get altered

I am trying to load tables from spreadsheets that contain text and numeric information (i.e. serial_numbers like "1234567890" and "12GNHGZ34567655". I have set column_types to be text only (confirmed to work) - however in the target table numeric values have been treated as float resulting in "1234567890.0". Alphanumeric values just stay as is.

Update documents for install

pip install agateexcel results in

Collecting agateexcel
  Could not find a version that satisfies the requirement agateexcel (from versions: )
No matching distribution found for agateexcel

pip install agate-excel results in correct install.

XLS: Number columns are losing precision

text,date,integer,boolean,float,datetime,empty_column,_unnamed
Chicago Reader,1971-01-01,40,True,1.0,1971-01-01T04:14:00,,
Chicago Sun-Times,1948-01-01,63,True,1.27,1948-01-01T14:57:13,,Extra data beyond headers will be trimmed
Chicago Tribune,1920-01-01,164,False,41800000.01,1920-01-01T00:00:00,,
This row has blanks,,,,,,,,
Unicode! Σ,,,,,,,,
text,date,integer,boolean,float,datetime,empty_column,
Chicago Reader,1971-01-01,40.0000000000,True,1.0000000000,1971-01-01 04:14:00,,
Chicago Sun-Times,1948-01-01,63.0000000000,True,1.2700000000,1948-01-01 14:57:13,,Extra data beyond headers will be trimmed
Chicago Tribune,1920-01-01,164.0000000000,False,41800000.0099999979,1920-01-01,,
This row has blanks,,,,,,,
Unicode! Σ,,,,,,,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.