Giter Site home page Giter Site logo

csvsort's Introduction

CSV Sort

For sorting CSV files on disk that do not fit into memory. The merge sort algorithm is used to break up the original file into smaller chunks, sort these in memory, and then merge these sorted files.

Example usage

>>> from csvsort import csvsort
>>> # sort this CSV on the 5th and 3rd columns (columns are 0 indexed)
>>> csvsort('test1.csv', [4,2])
>>> # sort this CSV with no header on 4th column and save results to separate file
>>> csvsort('test2.csv', [3], output_filename='test3.csv', has_header=False)
>>> # sort this TSV on the first column and use a maximum of 10MB per split
>>> csvsort('test3.tsv', [0], max_size=10, delimiter='\t')
>>> # sort this CSV on the first column and force quotes around every field (default is csv.QUOTE_MINIMAL)
>>> import csv
>>> csvsort('test4.csv', [0], quoting=csv.QUOTE_ALL)

# sort multi csv files into one
>>> csvsort(["test1.csv", "test2.csv", [0], output_filename="test_all.csv")

Install

Supports python 2 & 3:

$ pip install csvsort
$ pip3 install csvsort

test

csvsort's People

Contributors

antlauzon avatar benkibejs avatar manchicken avatar philip-sterne avatar polyg314 avatar ramwin avatar richardpenman avatar suhaibmujahid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

csvsort's Issues

Custom Number of Worker for Parallel Processing

Hi there,

Currently, the parallel option utilizes all cores available, including v-cores:

csvsort/__init__.py

Lines 84 to 86 in 8eed99e

if parallel:
concurrency = multiprocessing.cpu_count()
with multiprocessing.Pool(processes=concurrency) as pool:

This issue asks for an implementation of an ability to set the number of workers explicitly or to pass the multiprocessing.Pool explicitly.

The main reason behind this is that max_size parameter is intended to control the amount of RAM used, but running N workers will consume at least N ร— max_size of RAM. Having N out of control makes it difficult to control the overall RAM usage.

Regards

Update PyPI package

Hi there ๐Ÿ‘‹๐Ÿป

Thank you for publishing the library!

However, it looks like its PyPI version is a bit behind the version in the main branch for a while.

Could we have a fresher PyPI package published, please? The latest changes contain pretty nice features, though:

image

Regards

Fills disk with junk data

Running library filled all free hard drive space with inaccessible junk data, over 500 gb in my case. The file I was trying to sort was only 12 gb, so the issue is likely something else

Handle Empty Datasets

Hello,

Currently, when trying to sort a file, which does have a header, but no data rows, an error occurs:

    csvsort(
  File "/Users/xxx/.virtualenvs/zzz/lib/python3.9/site-packages/csvsort/__init__.py", line 68, in csvsort
    sorted_filename = mergesort(filenames, columns, encoding=encoding)
  File "/Users/xxx/.virtualenvs/zzz/lib/python3.9/site-packages/csvsort/__init__.py", line 175, in mergesort
    return sorted_filenames[0]
IndexError: list index out of range

The workaround is to check if there are rows an the client side.

Is it possible for the library to handle such a corner case and still to produce a file with a header if the output_filename param is provided?

Custom Workdir Support

Hello,

is it possible to explicitly specify a path to the workdir? Currently, the default is used:

ntf = tempfile.NamedTemporaryFile(delete=False, mode='w')

An excerpt from https://docs.python.org/3/library/tempfile.html#tempfile.mkstemp:

If dir is not None, the file will be created in that directory; otherwise, a default directory is used.

The reasoning behind this request is that some systems can have their tmp dir located on a mounted drive with limited storage, this can become an issue when there's no control over the target system running the code. In such situations, the only workaround is to explicitly specify another location.

An example of the same issue in other software: dask/dask#1659.

Regards

Reverse sorting

By reading the examples and the code, it seems that there is no way to sort in reverse order.
Is there a way? Or it is not implemented?

Incorrect usage of sys.getsizeof() (+ PyPy incomaptibility)

I tried to use csvsort in PyPy, however its implementation of sys.getsizeof() always raises TypeError (See https://doc.pypy.org/en/latest/cpython_differences.html#miscellaneous).
This is different to CPython's implementation which only sometimes raises TypeError (see https://docs.python.org/3/library/sys.html#sys.getsizeof).

sys.getsizeof() is used here:

current_size += sys.getsizeof(row)

And this usage is very suspicious. It returns the size of the list, not the contained strings. In CPython:

>>> from sys import getsizeof
>>> getsizeof(["a", "b", "c"])
120
>>> getsizeof(["aaaaaaaaaa", "bbbbbbbbbb", "cccccccccc"])
120
>>> getsizeof(["a"*1_000_000_000])
64

Not only, usage of sys.getsizeof prevents your useful library from being used on PyPy, that usage makes absolutely no sense.

writer.writerow returns whatever file.write returned - which is usually* the number of bytes actually written to the file.
Fortunately, the usually covers our use case here, in both CPython and PyPy.

I'll make a PR to fix that :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.