Giter Site home page Giter Site logo

dataframer's Introduction

dataframer

PyPI version

Tries to load any file into a pandas DataFrame, with a minimum of configuration, and a focus on bioinformatics

Examples

Typically, you’ll read a file from disk (open('my-file.txt', 'rb')), but a byte stream is simpler here.

>>> from io import BytesIO
>>> from dataframer import dataframer
>>> from pandas import set_option

>>> set_option('display.max_columns', None)

>>> bytes = b'a,b,c,z\n1,2,3,foo\n4,5,6,bar'
>>> stream = BytesIO(bytes)

Default behavior is to strip non-numeric values after the first column.

>>> df_info = dataframer.parse(stream)
>>> df_info.data_frame
   b  c
a      
1  2  3
4  5  6
>>> df_info.label_map is None
True

Alternatively, they can be preserved in place...

>>> df_info = dataframer.parse(stream, keep_strings=True)
>>> df_info.data_frame
   b  c    z
a           
1  2  3  foo
4  5  6  bar
>>> df_info.label_map is None
True

... or they can be used to compose more meaningful row labels.

>>> df_info = dataframer.parse(stream, relabel=True)
>>> df_info.data_frame
   b  c
a      
1  2  3
4  5  6
>>> df_info.label_map
{1: 'foo / 1', 4: 'bar / 4'}

Alternatively, the first column can also be treated as data.

>>> df_info = dataframer.parse(stream, col_zero_index=False)
>>> df_info.data_frame
   a  b  c
0  1  2  3
1  4  5  6
>>> df_info.label_map is None
True

If you don't need the whole file, but instead only want the first row for column information:

>>> df_info = dataframer.parse(stream, first_row_only=True)
>>> df_info.data_frame
   b  c
a      
1  2  3
>>> df_info.label_map is None
True

Single column lists are given an implicit header:

>>> bytes = b'banana\napple\npear'
>>> stream = BytesIO(bytes)
>>> df_info = dataframer.parse(stream)
>>> df_info.data_frame
     item
0  banana
1   apple
2    pear

Release process

In your branch update VERSION.txt, using semantic versioning: When the PR is merged, the successful Travis build will push a new version to pypi.

dataframer's People

Contributors

mccalluc avatar scottx611x avatar

Watchers

 avatar  avatar

dataframer's Issues

handle strings, not just files

where I would plug this into lineup-refinery-docker (in parse_to_dicts) it's expecting strings, not files. Fix it here, and then follow up there.

Re-enable zip files, but first decide how to handle multi-file zips...

If we do zip files, we need to figure out what should happen when multiple files are zipped up together. On the back burner until we have a usecase that will clarify this.

dataframer.py:

    compression = {
        b'\x1f\x8b': 'gzip',
        # TODO:
        # b'\x50\x4b': 'zip'
    }.get(file.read(2))
    ...
        # elif compression == 'zip':
        #     zf = zipfile.ZipFile(file)
        #     files = zf.namelist()
        #     first_bytes = zf.open(files[0]).peek(peek_window)

test:

    # No use-case for zip files right now, but it could be brought back.
    # def test_read_zip(self):
    #     self.assert_file_read(
    #         b'PK\x03\x04\n\x00\x00\x00\x00\x00\x8dZML\xfb\x9a\xc9\xa6\n\x00\x00\x00\n\x00\x00\x00\x08\x00\x1c\x00fake.csvUT\t\x00\x03J\x10\x83Zk\x11\x83Zux\x0b\x00\x01\x04\xf6\x01\x00\x00\x04\x14\x00\x00\x00,b,c\n1,2,3PK\x01\x02\x1e\x03\n\x00\x00\x00\x00\x00\x8dZML\xfb\x9a\xc9\xa6\n\x00\x00\x00\n\x00\x00\x00\x08\x00\x18\x00\x00\x00\x00\x00\x01\x00\x00\x00\xa4\x81\x00\x00\x00\x00fake.csvUT\x05\x00\x03J\x10\x83Zux\x0b\x00\x01\x04\xf6\x01\x00\x00\x04\x14\x00\x00\x00PK\x05\x06\x00\x00\x00\x00\x01\x00\x01\x00N\x00\x00\x00L\x00\x00\x00\x00\x00', self.target  # noqa: E501
    #     )

Option to just get headers

heatmap-scatter-dash is currently doing two passes in app_runner_refinery, but the first time we really just want the column headers. For large files in particular, it'd be great if we didn't need to read the whole file.

python 2?

What was the problem with python2? If we can't make it work with python2, make sure it's not being advertised as python2 compatible.

I see this when I install:

Collecting dataframer==0.0.1 (from -r context/python/requirements.txt (line 3))
  Using cached https://files.pythonhosted.org/packages/64/34/3d8b9519b98ae19e6b0e32e7754766f8530c22749d3c33a760c1da91d52e/dataframer-0.0.1-py2.py3-none-any.whl

Single column, "Banana" header

Make sure we can parse single column files, and that the column delimiter only picks non-word characters. We've had trouble with examples like:

banana
pear
apple

parsing to:

b n n
pe r
pple

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.