jazzband / tablib Goto Github PK

Python Module for Tabular Datasets in XLS, CSV, JSON, YAML, &c.

Home Page: https://tablib.readthedocs.io/

License: MIT License

Python 100.00%

tablib's Introduction

Tablib: format-agnostic tabular dataset library

_____         ______  ___________ ______
__  /_______ ____  /_ ___  /___(_)___  /_
_  __/_  __ `/__  __ \__  / __  / __  __ \
/ /_  / /_/ / _  /_/ /_  /  _  /  _  /_/ /
\__/  \__,_/  /_.___/ /_/   /_/   /_.___/

Tablib is a format-agnostic tabular dataset library, written in Python.

Output formats supported:

Excel (Sets + Books)
JSON (Sets + Books)
YAML (Sets + Books)
Pandas DataFrames (Sets)
HTML (Sets)
Jira (Sets)
LaTeX (Sets)
TSV (Sets)
ODS (Sets)
CSV (Sets)
DBF (Sets)

Note that tablib purposefully excludes XML support. It always will. (Note: This is a joke. Pull requests are welcome.)

Tablib documentation is graciously hosted on https://tablib.readthedocs.io

It is also available in the docs directory of the source distribution.

Make sure to check out Tablib on PyPI!

Contribute

Please see the contributing guide.

tablib's People

Contributors

Stargazers

Watchers

Forkers

rogersmark pombredanne mjmeintjes xdissent acdha surflightroy jfriedly danlipsitt timclicks mihow svetlyak40wt imbilltucker entequak mariorz asemx vanl bracki butzeb daxreyes razum2um jqb tcoopman goldenboy mineo bmihelac policystat jsdalton msabramo klothe emsu djv waywardmonkeys heinrichfilter ovnicraft danielszoska flyabroad dericpan aaronlevin marram surbas derekzhang79 jean ecarreras pfmoore madberry greasycode structurl wwhawkww girasquid cheekybastard techniq al3xy ajah hydrophil winshuai rscottco kmrov pfctdayelise dyim42 rismalrv geraldoandradee klinkin marcelo-mt wenhw shiziwen webreply peterfarrell olemis alex dec0dedab0de rhunwicks eremedia egocheer vincentdong ascii1011 kennknowles lexual azizur77 bignuge fusionbox tclh123 overthink odeckmyn raviteja91 web5design semk mpodeley prateek cordmata francoaa icsaas gisce dongfanliang srkama sharafjaffri aaronmartin0303 elisbyberi papisz rb640 brad

tablib's Issues

Allow use of utf-8-sig encoding for Excel-compatible CSV export

Exporting tables with Unicode values to CSV does not encode include the byte-order marker needed by Excel to recognize that the CSV file has Unicode in it. As a result, double-clicking on the exported CSV file will not show the correct charactersets in the cells with non-ASCII values. This is arguably an Excel limitation (see http://www.sqlsnippets.com/en/topic-13412.html) but given that tablib is presumably trying to make life easier for people dealing with Excel, this would be nice to fix.

I believe this can be addressed by using utf-8-sig instead of utf-8 as the encoding during export.

The following demonstrates the problem and solution using a different encoding.

def testCSVandBOM():
    # requires the UnicodeWriter and UnicodeReader classes (see Python csv module docs)
    val = 'Etel\xc3\xa4-Suomi, Finland'.decode('utf-8')
    print val

    # double-clicking this file to open in Excel decodes correctly
    with open('with-BOM.csv', 'wb') as f:
        w = UnicodeWriter(f, delimiter = ",", encoding = 'utf-8-sig' )
        w.writerow(['Someplace I want to visit',val])

    # double-clicking this file to open in Excel does NOT decode correctly
    with open('without-BOM.csv', 'wb') as f:
        w = UnicodeWriter(f, delimiter = ",", encoding = 'utf-8' )
        w.writerow(['Someplace I want to visit',val])

If compatibility with current CSV export behavior is a concern, maybe this could be added as a new export format?

Strict tag filtering

This is a great library, thanks so much for all the work so far.

I needed to filter a dataset by multiple criteria at the same time using tags. When appending each row of my data to the set I added tags for a state ('on' or 'off) and a date (ex '2011-11-01'). I found that the filter function, when given multiple tags, will return a row if it has at least one match between the filter parameter and the list of tags for the row (if the intersection of the parameters and the tag list is greater than 0). However I needed a row returned only when it matched all parameters given to the filter. I've added new copies of the has_tag() and filter() functions in core.py that will only return rows where the set of filter parameters all match with tags in a row.

There might be a more appropriate solution for this use case but I thought it might be useful to bring it up.

The code for strict filtering is in my fork of TabLib if there is interest. Apologies if I am going about submitting this the wrong way but it's a first for me.

Documentation on imports should not use "stream" to mean "string".

The documentation currently uses the word "stream" in the context of importing a dataset. This gives the impression that the format modules, and the import_set function expects to be passed file stream objects, rather than strings.

New import/export interface for 1.0

data.export('json') instead of data.json.
data.import('json', value) instead of data.json = value.

append_col fails with 0 rows

>>> dataset = tablib.Dataset(headers=['a', 'b', 'c'])
>>> dataset.append_col([], header='d')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/tablib/core.py", line 684, in append_col
    self.rpush_col(col, header)
  File "/usr/local/lib/python2.7/dist-packages/tablib/core.py", line 649, in rpush_col
    self.insert_col(self.width, col, header=header)
  File "/usr/local/lib/python2.7/dist-packages/tablib/core.py", line 624, in insert_col
    col = self._clean_col(col)
  File "/usr/local/lib/python2.7/dist-packages/tablib/core.py", line 360, in _clean_col
    header = [col.pop(0)]
IndexError: pop from empty list

A workaround for adding columns the in the zero-rows situation is just to append to the list of headers. But headers normally should not be edited, because dataset.width is not updated to match the number of headers--which is the right behavior, because the existing rows are not necessarily the same length as the headers. The result is that the only way to add a new column to a dataset of zero or more rows is somewhat ugly:

      if dataset.height == 0:
          dataset.headers.append('new column name')
      else:
          dataset.append_col([''] * dataset.height, header='new column name')

This situation occurs when building datasets where not all the columns are not known in advance.

Proposed changes:

append_col should be modified to work when there are zero rows
Direct modification of headers should be forbidden, either when there are more than zero rows, or all the time (so headers must either be given in the argument to Dataset.__init__ or added through append_col/insert_col). The latter seems cleaner, but it would break backward compatibility for those who rely on modifying headers for whatever reason. (The only good reason I can think of is changing the names of headers after they've been created, but without adding or removing them.)

Row validation/manipulation

Do you have any thoughts on the addition of row validation support, perhaps in the form of a per-Dataset array of callables that are applied to the row on append/insert? I'm unsure if the Dataset is the place for this kind of row data manipulation and validation.

I envisage passing an array of callables on Dataset initialisation which are then applied on row insert. Each callable returns the value to include in the row, or raises an exception. All errors are captured and the Dataset raises InvalidData. For example:

def oneof(val, row):
    valid = ['red', 'green', 'blue']
    if val in valid:
        return val
    raise InvalidData('Row column %d must be one of %s' % (row.index(val), valid))

data = tablib.Dataset(processors=[int, int, oneof])

data.append(('1', '45', 'green')) ## insert successfully
data.append(('1', '45', 'orange'))  ## InvalidData raised
data.append(('NaN', '45', 'orange'))  ## InvalidData raised

I don't want to start any work on this if you have already given it some thought and decided against it.

OpenDocument Export Support (.ods)

Export support to OpenDocument Spreadsheets would be fantastic.

tablib.org no longer the documentation site?

http://tablib.org seems to no longer be the project documentation site.

XLSX Export: Right Column Frozen

Overview

Whenever I export to xlsx (via data.xlsx) the most right column is frozen. I have seen this on Mac OS X 10.7, Ubuntu 10, Windows 7. I have tried it on large and small files (I'll show you how I recreated the problem below).

System Specs

Operating systems file generated on: Ubuntu 10, Mac OS X 10.7, Windows 7
Python versions: 2.6, 2.7
Microsoft Excel versions file read on: Excel for Mac 2011 (14.2.3), Excel 2010 (whatever the most recent one for windows is)
Tablib version: most recent one
Note: I haven't opened up Calc (or whatever it's called) on Ubuntu to test. I merely generated an xlsx file on Ubuntu and saw the problem.

Recreate the problem

>>> import tablib
>>> headers = ('first_name', 'last_name', 'frozen_frame')
>>> data = [ ('Aaron', 'Levin','is cool'), ('David', 'Steinberg','is pretty cool'), ('Marie', 'Flanagan','is the best')]
>>> data = tablib.Dataset(*data, headers=headers)
>>> with open('forzen_frames.xlsx','wb') as frozen_frames:
...     frozen_frames.write(data.xlsx)
...  
>>>

Then open up the file. You should see the most right column frozen / fixed.

Slice on Dataset object should return another Dataset object

This is logical.

And Dataset have to have iter method as well, because list(dataset) results in subsequent calls to getitem and not efficient.

Databook._dataset shouldn't be hidden?

Just wonder why Databook._dataset is hidden attribute, I mean with underscore? What kind of motivation stands behind this?

Unicode data breaks HTML export

This is nearly identical to the issue reported in #5 -- basically if there are unicode characters in data they'll break in the HTML output.

Here's the same example from that error report:

Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from tablib import Dataset
>>> data = [u'\xfc', u'\xfd']
>>> dataset = Dataset()
>>> dataset.append(data)
>>> dataset.dict
[[u'\xfc', u'\xfd']]
>>> dataset.csv
'\xc3\xbc,\xc3\xbd\r\n'
>>> dataset.html
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "*****/lib/python2.7/site-packages/tablib/formats/_html.py", line 44, in export_set
    stream.writelines(str(page))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 16: ordinal not in range(128)

Looks like this is because of cStringIO -- see this stackoverflow question: http://stackoverflow.com/questions/1817695/python-how-to-get-stringio-writelines-to-accept-unicode-string

can't add a dataset to an empty Databook

Hi,

I created (a subclass of) a Databook. The I create one and add Datasets:

    document = ExcelDocument("test.xls") <- ExcelDocument is the subclass of Databook - don't ask why ;)
    sheet = Dataset()
    document.add_sheet(sheet)

but executing this leads to this error:

ERROR: testDocumentExport (TestJSONExport.Test)

Traceback (most recent call last):
File "...\TestJSONExport.py", line 21, in testDocumentExport
document.add_sheet(sheet)
File "...\Python25\Lib\site-packages\tablib\core.py", line 898, in add_sheet
self._datasets.append(dataset)
AttributeError: 'ExcelDocument' object has no attribute '_datasets'

Did I miss something?

get_row, get_col methods

Render table in Markdown format on unicode(dataset)

This will be useful for testing and for, well, outputting data in Markdown. Also, this could be helpful in the interactive python's prompt.

Format description is here: http://michelf.com/projects/php-markdown/extra/#table

I want to implement this during the next week, if you have no objections.

Publish Python3 versions of libraries, unvendorize.

Leaning towards it.

XLSX Support

With Excel spreadsheets there is a row limit of 65535 rows. This is apparently a limitation of older versions of Excel. Apparently, according to my Excel wizard of a client, newer versions of Excel don't have this problem. I can't find an exact piece of evidence that agrees with that statement. All I have is a client that says "If I save the file like this I can have more rows". :)

If it's not possible, it's not possible. Just thought I'd see if you had any ideas. Thanks!

pip install does not work

Hi!

Just tried to install tablib via pip and it does not work.

Installation goes well, but when you want to import tablib it raises an ImportError saying "No module named odf".
(On the compat.py file line 45)

Looks like there are some files missing in the packages folder compared to what you have on github.

Regards,

xlsx output has freeze panes on wrong axis

I have not fully nailed this down yet. I definitely have sample files created from datasets with say 11 cols x 300 rows and the freeze pane was running vertical at the next to last column.

I looked at https://github.com/kennethreitz/tablib/blob/develop/tablib/formats/_xlsx.py#L77 and counter that it could be simplified
from:

ws.freeze_panes = '%s%s' % (col_idx, row_number)

to:

ws.freeze_panes = 'A2'

import problems with xls and xlsx

I tried to import a couple of xls and xlsx files but got the following error messages

In [18]: datset = tablib.import_set(open('../private/Excel_Models/YM4_-_Completed_v7.1.xls','rb'))
---------------------------------------------------------------------------
ReaderError                               Traceback (most recent call last)

/home/fkrause/Dev/web2py/applications/yeastmap/modules/ in ()

/home/fkrause/Dev/web2py/applications/yeastmap/modules/tablib/core.py in import_set(stream)
    937 def import_set(stream):
    938     """Return dataset of given stream."""
--> 939     (format, stream) = detect(stream)
    940 
    941     try:

/home/fkrause/Dev/web2py/applications/yeastmap/modules/tablib/core.py in detect(stream)
    928     for fmt in formats.available:
    929         try:
--> 930             if fmt.detect(stream):
    931                 return (fmt, stream)
    932         except AttributeError:

/home/fkrause/Dev/web2py/applications/yeastmap/modules/tablib/formats/_yaml.py in detect(stream)
     56     """Returns True if given stream is valid YAML."""
     57     try:
---> 58         _yaml = yaml.load(stream)
     59         if isinstance(_yaml, (list, tuple, dict)):
     60             return True

/usr/lib/python2.7/dist-packages/yaml/__init__.pyc in load(stream, Loader)
     67     and produce the corresponding Python object.
     68     """
---> 69     loader = Loader(stream)
     70     try:
     71         return loader.get_single_data()

/usr/lib/python2.7/dist-packages/yaml/loader.pyc in __init__(self, stream)
     32 
     33     def __init__(self, stream):
---> 34         Reader.__init__(self, stream)
     35         Scanner.__init__(self)
     36         Parser.__init__(self)

/usr/lib/python2.7/dist-packages/yaml/reader.pyc in __init__(self, stream)
     83             self.eof = False
     84             self.raw_buffer = ''
---> 85             self.determine_encoding()
     86 
     87     def peek(self, index=0):

/usr/lib/python2.7/dist-packages/yaml/reader.pyc in determine_encoding(self)
    133                 self.raw_decode = codecs.utf_8_decode
    134                 self.encoding = 'utf-8'
--> 135         self.update(1)
    136 
    137     NON_PRINTABLE = re.compile(u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]')

/usr/lib/python2.7/dist-packages/yaml/reader.pyc in update(self, length)
    163                         position = exc.start
    164                     raise ReaderError(self.name, position, character,
--> 165                             exc.encoding, exc.reason)
    166             else:                                                                                                                                                                                                              
    167                 data = self.raw_buffer

ReaderError: 'utf8' codec can't decode byte #xd0: invalid continuation byte
  in "../private/Excel_Models/YM4_-_Completed_v7.1.xls", position 0

similar message, different file

/usr/lib/python2.7/dist-packages/yaml/reader.pyc in update(self, length)
    163                         position = exc.start
    164                     raise ReaderError(self.name, position, character,
--> 165                             exc.encoding, exc.reason)
    166             else:                                                                                                                                                                                                              
    167                 data = self.raw_buffer

ReaderError: 'utf8' codec can't decode byte #xa7: invalid start byte
  in "../private/Excel_Models/YM4 - Final.xlsx", position 14

not sure if I did something wrong

Switch to Apache License v2

Develop and Master branch are both version 0.9.11 even though the APIs are different

The documentation says it is for v0.9.11 but if one installs tablib v0.9.11 via pip some of the documented API calls are not available e.g. extend() and get_col()

The Tablib version available via pip corresponds to the Master branch while the documentation corresponds to the Develop branch but both branches are __version__ = '0.9.11'

Assuming that the "latest" docs are generated from the Develop branch should that version not be bumped?

Also can see my StackOverflow question for more information

Yaml format chokes on TSV when detecting

See here:

======================================================================
ERROR: Test YAML format detection.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jbrauer/projects/tablib/test_tablib.py", line 486, in test_yaml_format_detect
    self.assertFalse(tablib.formats.yaml.detect(_tsv))
  File "/home/jbrauer/projects/tablib/tablib/formats/_yaml.py", line 58, in detect
    _yaml = yaml.safe_load(stream)
  File "/usr/lib/python2.6/dist-packages/yaml/__init__.py", line 75, in safe_load
    return load(stream, SafeLoader)
  File "/usr/lib/python2.6/dist-packages/yaml/__init__.py", line 58, in load
    return loader.get_single_data()
  File "/usr/lib/python2.6/dist-packages/yaml/constructor.py", line 42, in get_single_data
    node = self.get_single_node()
  File "/usr/lib/python2.6/dist-packages/yaml/composer.py", line 35, in get_single_node
    if not self.check_event(StreamEndEvent):
  File "/usr/lib/python2.6/dist-packages/yaml/parser.py", line 93, in check_event
    self.current_event = self.state()
  File "/usr/lib/python2.6/dist-packages/yaml/parser.py", line 138, in parse_implicit_document_start
    StreamEndToken):
  File "/usr/lib/python2.6/dist-packages/yaml/scanner.py", line 116, in check_token
    self.fetch_more_tokens()
  File "/usr/lib/python2.6/dist-packages/yaml/scanner.py", line 257, in fetch_more_tokens
    % ch.encode('utf-8'), self.get_mark())
ScannerError: while scanning for the next token
found character '\t' that cannot start any token
  in "<string>", line 1, column 4:
    foo bar
       ^

Can't update a field

Updating a field fails silently.
It should either work, or fail loudly

how to reproduce

import tablib

headers = ('h1', 'h2')
data = tablib.Dataset(headers=headers)
data.append(('foo', 'bar'))
data.append(('foo2', 'bar2'))

data['h1'][1] = 'new'   # update a single cell
assert data['h1'][1] == 'new', "error still contains %s" % data['h1'][1]

result

The update does not barf, but doesn't work either.

Traceback (most recent call last):
File "test.py", line 10, in
assert data['h1'][1] == 'new', "error still contains %s" % data['h1'][1]
AssertionError: error still contains foo2'''

expected result

cell is updated or tablib refuses the assignment.

Possible NumPy Integration

May be better off as an extension.

data.to_numpy()

data.from_numpy(numpy_array)

Mobile Documentation

Documentation should be fluid.

tablib.compat module

v0.9.5 added a complete copy of tablib.core for 2.5 compatibility. Very messy.

tablib.conpat module should be added, and used as a sort of proxy to the proper definitions/overrides based on Python version.

Appending columns does not add the specified header

Create a Dataset, and append a column, specifying a header:

>>> d = tablib.Dataset()
>>> d.append_col(['foo', 'bar', 'baz'], header="h1")
>>> d.headers

>>> type(d.headers)
<type 'NoneType'>

This does not save the header, and exporting the dataset (as csv, html, etc) will omit any header info.

Creating the headers before appending a column results in duplicate headers:

>>> d = tablib.Dataset()
>>> d.headers = ['h1', ]
>>> d.append_col(['foo', 'bar', 'baz'], header="h1")
>>> d.headers
['h1', 'h1']

Creating the headers after appending a column appears to provide the expected result.

>>> d = tablib.Dataset()
>>> d.append_col(['foo', 'bar', 'baz'], header="h1")
>>> d.headers = ['h1', ]
>>> d.headers
['h1']

Dataset[header_name] doesn't like Unicode

>>> data = tablib.Dataset()
>>> csv_data = open('test.csv').read()
>>> csv_data
'Header 1,Header 2\nJunk,Data\n'
>>> data.csv = csv_data
>>> data.headers
[u'Header 1', u'Header 2']
>>> data['Header 1']
[u'Junk']
>>> data[data.headers[0]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "build\bdist.win32\egg\tablib\core.py", line 172, in __getitem__
TypeError: list indices must be integers, not unicode

Seems to be near tablib/core.py line 165.

(It isn't a blocker, I'll have a pull request for it tonight)

Not quite compatible with python 2.5

Things work fine with python 2.5 except for one place: anyjson.py, line 85. This uses 2.6-style exceptions. Changing the 'as' to a comma would fix it.

break_on() feature

I am considering if this should be a feature or if I should use another construct.

I want to execute a callback when the value of a cell changes. I may implement something this weekend, but put it out here to spark discussion.

setup.py improvements

Install speedup dependencies:

`$ python setup.py speedups`

Test suite (run by py.test):

`$ python setup.py test`

Useless use of map in example

>>> data = tablib.Dataset(headers=['First Name', 'Last Name', 'Age'])
>>> map(data.append, [('Kenneth', 'Reitz', 22), ('Bessie', 'Monke', 21)])

map is used to construct lists (doc). You don't want to construct a list here, so it doesn't make sense to use map. Further, it's needlessly confusing. A regular Python for statement is much more readable, and doesn't have the side-effect of creating a list:

>>> for datum in [('Kenneth', 'Reitz', 22), ('Bessie', 'Monke', 21)]:
>>>     data.append(datum)

Apologies for being pedantic, but due to your popularity in the Python community you have a lot of influence on others' coding styles.

new output format - Google Doc Spreadsheet

I think this would be really neat. I may try to work on it shortly.

I think just like the ods, xls, and xlsx, this should be output only initially.

Import support - plans? roadmap?

I just wondered if there are plans to improve import support, and document it as a feature. I tried working with import_set() without much success.

I am mostly interested in CSV, but also excel.

Change Row/Col Append API

The Row/Col appending api is a little odd.

Maybe replicate more data-centric apis (e.g. redis' rpush and lpush).

ScannerError while importing CSV data with colon

When trying to import a CSV dataset with a colon inside a ScannerError is raised during the detection phase. Specifically the YAML library raises this error:

ScannerError: mapping values are not allowed here

Test to reproduce the error:

>>> data = 'id,text,created_at\n83275,"random string with: colon",2012-02-09 23:35:15\n'
>>> tablib.import_set(data)

Excel import broken

I'm creating an empty Databook, write it to a xls file and then trying to read it throws an Exception. The code I use is:

d = tablib.Databook()
s = tablib.Dataset()
d.add_sheet(s)
open('tablib.xls','wb').write(d.xls)
d = tablib.import_set(open('tablib.xls','rb'))

The traceback:

eaderError                               Traceback (most recent call last)
/Users/dvelkov/<ipython-input-59-072c251b8638> in <module>()
----> 1 d = tablib.import_set(open('tablib.xls','rb'))

/Library/Python/2.7/site-packages/tablib/core.pyc in import_set(stream)
    937 def import_set(stream):
    938     """Return dataset of given stream."""
--> 939     (format, stream) = detect(stream)
    940 
    941     try:

/Library/Python/2.7/site-packages/tablib/core.pyc in detect(stream)
    928     for fmt in formats.available:
    929         try:
--> 930             if fmt.detect(stream):
    931                 return (fmt, stream)
    932         except AttributeError:

/Library/Python/2.7/site-packages/tablib/formats/_yaml.pyc in detect(stream)
     56     """Returns True if given stream is valid YAML."""
     57     try:
---> 58         _yaml = yaml.load(stream)
     59         if isinstance(_yaml, (list, tuple, dict)):
     60             return True

/Library/Python/2.7/site-packages/tablib/packages/yaml/__init__.pyc in load(stream, Loader)
     55     and produce the corresponding Python object.
     56     """
---> 57     loader = Loader(stream)
     58     return loader.get_single_data()
     59 

/Library/Python/2.7/site-packages/tablib/packages/yaml/loader.pyc in __init__(self, stream)
     32 
     33     def __init__(self, stream):
---> 34         Reader.__init__(self, stream)
     35         Scanner.__init__(self)
     36         Parser.__init__(self)

/Library/Python/2.7/site-packages/tablib/packages/yaml/reader.pyc in __init__(self, stream)
    118             self.eof = False
    119             self.raw_buffer = ''
--> 120             self.determine_encoding()
    121 
    122     def peek(self, index=0):

/Library/Python/2.7/site-packages/tablib/packages/yaml/reader.pyc in determine_encoding(self)
    168                 self.raw_decode = utf_8_decode
    169                 self.encoding = 'utf-8'
--> 170         self.update(1)
    171 
    172     NON_PRINTABLE = re.compile(u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]')

/Library/Python/2.7/site-packages/tablib/packages/yaml/reader.pyc in update(self, length)
    198                         position = exc.start
    199                     raise ReaderError(self.name, position, character,
--> 200                             exc.encoding, exc.reason)
    201             else:
    202                 data = self.raw_buffer

ReaderError: 'utf8' codec can't decode byte #xd0: invalid continuation byte
  in "tablib.xls", position 0```

Documentation Improvements

'Fork Me'
Add Dataset.title information
Orgs using Tablib
Simple code example on landing page
Left side Project Name

preserve leading zero's on xlsx output

Trying to determine the best way to handle this. Tablib is keeping them correctly. However, openpyxl needs an alternate syntax if you want to assign a literal.

An example is if I have a column of zip codes or social security numbers that I do not want to be interpreted as numeric.

TSV format does not support universal line endings.

The CSV module supports universal line endings, but the TSV module is hard-coded to split on /r/n.

Unicode data breaks CSV export (Python 2.x)

>>> from tablib import Dataset
>>> data = [u'\xfc', u'\xfd']
>>> dataset = Dataset()
>>> dataset.append(data)
>>> dataset.dict
[[u'\xfc', u'\xfd']]
>>> dataset.csv
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.6/dist-packages/tablib/formats/_csv.py", line 30, in export_set
    _csv.writerow(row)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128)

The CSV export function uses StringIO to build the csv data, which itself is capable of handling Unicode. The problem, as far as I can tell, is stdlib CSV in 2.x has some extra hoops to jump through to read and write non-ascii data. Unfortunately, this is impossible to apply to tablib's csv module on demand, due to the method presented to get it.

I have not tested this, but I believe this might not be an issue in 3.x, which has no mention of Unicode trouble in the CSV docs.

ImportError: No module named xlrd with github HEAD version

If I install into my virtualenv using the github head version:

pip install git+git://github.com/kennethreitz/tablib.git

Then python can't find xlrd:

(env)[watson@watson-thinkpad latte2 (master)]$ python
Python 2.7.2 (default, Oct 27 2011, 01:40:22) 
[GCC 4.6.1 20111003 (Red Hat 4.6.1-10)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tablib
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/watson/git/latte2/env/lib/python2.7/site-packages/tablib/__init__.py", line 3, in <module>
    from tablib.core import (
  File "/home/watson/git/latte2/env/lib/python2.7/site-packages/tablib/core.py", line 15, in <module>
    from tablib import formats
  File "/home/watson/git/latte2/env/lib/python2.7/site-packages/tablib/formats/__init__.py", line 6, in <module>
    from . import _csv as csv
  File "/home/watson/git/latte2/env/lib/python2.7/site-packages/tablib/formats/_csv.py", line 6, in <module>
    from tablib.compat import is_py3, csv, StringIO
  File "/home/watson/git/latte2/env/lib/python2.7/site-packages/tablib/compat.py", line 42, in <module>
    import tablib.packages.xlrd as xlrd
ImportError: No module named xlrd

This works fine with the normal pip install version.

Failing test: test_yaml_import_book

~/dev/git-repos/tablib$ PAGER=cat git log -1 --oneline
e8c923d Merge pull request #58 from jqb/develop
~/dev/git-repos/tablib$ .tox/py27/bin/pip freeze
PyYAML==3.10
distribute==0.6.24
omnijson==0.1.2
py==1.4.7
pytest==2.2.3
tablib==0.9.11
wsgiref==0.1.2
xlrd==0.7.7
~/dev/git-repos/tablib$ .tox/py27/bin/py.test test_tablib.py
===================================== test session starts ======================================
platform darwin -- Python 2.7.3 -- pytest-2.2.3
collected 44 items 

test_tablib.py ..........................................F.

====================================== FAILURES ======================================
_____________________________ TablibTestCase.test_yaml_import_book _____________________________

self = <test_tablib.TablibTestCase testMethod=test_yaml_import_book>

    def test_yaml_import_book(self):
        """Generate and import YAML book serialization."""
        data.append(self.john)
        data.append(self.george)
        data.headers = self.headers

        book.add_sheet(data)
        _yaml = book.yaml

>       book.yaml = _yaml

test_tablib.py:386: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

dbook = <databook object>
in_stream = '- !!python/object/apply:collections.OrderedDict\n  - - - data\n      - - !!python/object/apply:collections.OrderedDic...       - - [first_name, George]\n            - [last_name, Washington]\n            - [gpa, 67]\n    - [title, null]\n'

    def import_book(dbook, in_stream):
        """Returns databook from YAML stream."""

        dbook.wipe()

>       for sheet in yaml.safe_load(in_stream):

tablib/formats/_yaml.py:49: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

stream = '- !!python/object/apply:collections.OrderedDict\n  - - - data\n      - - !!python/object/apply:collections.OrderedDic...       - - [first_name, George]\n            - [last_name, Washington]\n            - [gpa, 67]\n    - [title, null]\n'

    def safe_load(stream):
        """
        Parse the first YAML document in a stream
        and produce the corresponding Python object.
        Resolve only basic YAML tags.
        """
>       return load(stream, SafeLoader)

    def safe_load_all(stream):

.tox/py27/lib/python2.7/site-packages/yaml/__init__.py:93: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

stream = '- !!python/object/apply:collections.OrderedDict\n  - - - data\n      - - !!python/object/apply:collections.OrderedDic...       - - [first_name, George]\n            - [last_name, Washington]\n            - [gpa, 67]\n    - [title, null]\n'
Loader = <class 'yaml.loader.SafeLoader'>

    def load(stream, Loader=Loader):
        """
        Parse the first YAML document in a stream
        and produce the corresponding Python object.
        """
        loader = Loader(stream)
        try:
>           return loader.get_single_data()
        finally:

.tox/py27/lib/python2.7/site-packages/yaml/__init__.py:71: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <yaml.loader.SafeLoader object at 0x101e9ab90>

    def get_single_data(self):
        # Ensure that the stream contains a single document and construct it.
        node = self.get_single_node()
        if node is not None:
>           return self.construct_document(node)

.tox/py27/lib/python2.7/site-packages/yaml/constructor.py:39: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <yaml.loader.SafeLoader object at 0x101e9ab90>
node = SequenceNode(tag=u'tag:yaml.org,2002:seq', value=[SequenceNode(tag=u'tag:yaml....lue=u'title'), ScalarNode(tag=u'tag:yaml.org,2002:null', value=u'null')])])])])

    def construct_document(self, node):
        data = self.construct_object(node)
        while self.state_generators:
            state_generators = self.state_generators
            self.state_generators = []
            for generator in state_generators:
>               for dummy in generator:

.tox/py27/lib/python2.7/site-packages/yaml/constructor.py:48: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <yaml.loader.SafeLoader object at 0x101e9ab90>
node = SequenceNode(tag=u'tag:yaml.org,2002:seq', value=[SequenceNode(tag=u'tag:yaml....lue=u'title'), ScalarNode(tag=u'tag:yaml.org,2002:null', value=u'null')])])])])

    def construct_yaml_seq(self, node):
        data = []
        yield data
>       data.extend(self.construct_sequence(node))

.tox/py27/lib/python2.7/site-packages/yaml/constructor.py:393: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <yaml.loader.SafeLoader object at 0x101e9ab90>
node = SequenceNode(tag=u'tag:yaml.org,2002:seq', value=[SequenceNode(tag=u'tag:yaml....lue=u'title'), ScalarNode(tag=u'tag:yaml.org,2002:null', value=u'null')])])])])
deep = False

    def construct_sequence(self, node, deep=False):
        if not isinstance(node, SequenceNode):
            raise ConstructorError(None, None,
                    "expected a sequence node, but found %s" % node.id,
                    node.start_mark)
        return [self.construct_object(child, deep=deep)
>               for child in node.value]

.tox/py27/lib/python2.7/site-packages/yaml/constructor.py:118: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <yaml.loader.SafeLoader object at 0x101e9ab90>
node = SequenceNode(tag=u'tag:yaml.org,2002:python/object/apply:collections.OrderedDi...value=u'title'), ScalarNode(tag=u'tag:yaml.org,2002:null', value=u'null')])])])
deep = False

    def construct_object(self, node, deep=False):
        if node in self.constructed_objects:
            return self.constructed_objects[node]
        if deep:
            old_deep = self.deep_construct
            self.deep_construct = True
        if node in self.recursive_objects:
            raise ConstructorError(None, None,
                    "found unconstructable recursive node", node.start_mark)
        self.recursive_objects[node] = None
        constructor = None
        tag_suffix = None
        if node.tag in self.yaml_constructors:
            constructor = self.yaml_constructors[node.tag]
        else:
            for tag_prefix in self.yaml_multi_constructors:
                if node.tag.startswith(tag_prefix):
                    tag_suffix = node.tag[len(tag_prefix):]
                    constructor = self.yaml_multi_constructors[tag_prefix]
                    break
            else:
                if None in self.yaml_multi_constructors:
                    tag_suffix = node.tag
                    constructor = self.yaml_multi_constructors[None]
                elif None in self.yaml_constructors:
                    constructor = self.yaml_constructors[None]
                elif isinstance(node, ScalarNode):
                    constructor = self.__class__.construct_scalar
                elif isinstance(node, SequenceNode):
                    constructor = self.__class__.construct_sequence
                elif isinstance(node, MappingNode):
                    constructor = self.__class__.construct_mapping
        if tag_suffix is None:
>           data = constructor(self, node)

.tox/py27/lib/python2.7/site-packages/yaml/constructor.py:88: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <yaml.loader.SafeLoader object at 0x101e9ab90>
node = SequenceNode(tag=u'tag:yaml.org,2002:python/object/apply:collections.OrderedDi...value=u'title'), ScalarNode(tag=u'tag:yaml.org,2002:null', value=u'null')])])])

    def construct_undefined(self, node):
        raise ConstructorError(None, None,
                "could not determine a constructor for the tag %r" % node.tag.encode('utf-8'),
>               node.start_mark)
E       ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object/apply:collections.OrderedDict'
E         in "<string>", line 1, column 3:
E           - !!python/object/apply:collection ... 
E             ^

.tox/py27/lib/python2.7/site-packages/yaml/constructor.py:414: ConstructorError
============================= 1 failed, 43 passed in 0.41 seconds ==============================

API Changes in Documentation

Major updates needed.

Other things to mention:

data_formatters
dataset.title

Dataset.html does not correctly close table cells for None values

For Example (notice the None value in the first column):

d = tablib.Dataset()
d.append_col(['h1-val1', None], header="heading1")
d.append_col(['h2-val1', 'h2-val2', ], header="heading2")
print d.html

Produces:

<table>
<tr><td>h1-val1</td>
<td>h2-val1</td></tr>
<tr><td>  <!-- This should have a closing td tag -->
<td>h2-val2</td></tr>
</table>

Rows == NamedTuples

This would be freaking awesome.

YAML Serializer doesn't like OrderedDicts

It's serializing the dicts as:

'- !!python/object/apply:collections.OrderedDict\n

It should treat them as normal dictionaries.

Hangs generating ODS with certain dataset

Dear Sir Kenneth,

I have a dataset that has no problems saving to all other formats available in your wonderful lib, but when it comes to saving in ODS, the fan in my portable computer starts to sound like a modern jet engine and the process never ends.

What would be the necessary steps to debug and get to the root of this matter?

I wish you the best and want to thank you for the effort you have put in this fantastic package.

Switch to Relative Imports

This makes vendorization and packaging much easier.