Giter Site home page Giter Site logo

py-splice's Introduction

Downloads Latest version License

splice

A Python interface to splice(2) system call.

About

splice(2) moves data between two file descriptors without copying between kernel address space and user address space. It transfers up to nbytes bytes of data from the file descriptor in to the file descriptor out.

zero-copy

Normally when you copy data from one data stream to another, the data to be copied is first stored in a buffer in userspace and is then copied back to the target data stream from the user space which introduces a certain overhead.

zero-copy allows us to operate on data without the use of copying data to userspace. It essentialy transfers the data by remapping pages and not actually performing the copying of data, resulting in improved performance.

Illustrated below is a simple example of copying data from one file to another using the splice(2) system call. For the complete documentation see API Documentation.

# copy data from one file to another using splice

from splice import splice

to_read = open("read.txt")
to_write = open("write.txt", "w+")

splice(to_read.fileno(), to_write.fileno())

This copying of the data twice (once into the userland buffer, and once out from that userland buffer) imposes some performance and resource penalties. splice(2) syscall avoids these penalties by avoiding any use of userland buffers; it also results in a single system call (and thus only one context switch), rather than the series of read(2) / write(2) system calls (each system call requiring a context switch) used internally for the data copying.

Installation

pip

$ pip install py-splice

manual

$ git clone https://github.com/danishprakash/py-splice && cd py-splice
$ python3 setup.py install

API Documentation

sendfile module provides a single function: sendfile().

  • splice.splice(out, in, offset, nbytes, flags)

    Copy nbytes bytes from file descriptor in (a regular file) to file descriptor out (a regular file) starting at offset. Return the number of bytes just being sent. When the end of file is reached return 0. If offset is not specified, the bytes are read from the current position of in and the position of in is updated. If nbytes is not specified, the whole of in is copied over to out.

    Required arguments

    • in: file descriptor of the file from which data is to be read.
    • out: file descriptor of the file to which data is to be transferred.

    Optional positional arguments

    • offset: offset from where the input file is read from.

    • nbytes: number of bytes to be copied in total, default value

    • flags: a bit mask which can be composed by ORing together the following.

      • splice.SPLICE_F_MOVE
      • splice.SPLICE_F_NONBLOCK
      • splice.SPLICE_F_MORE
      • splice.SPLICE_F_GIFT

    More information on what each of the flag means can be found on the splice(2) man page here.

Usage

>>> from splice import splice

# init file objects
>>> to_read = open("read.txt") # file to read from
>>> to_write = open("write.txt", "w+") # file to write to

>>> len(to_read.read())
50

# copying whole file
>>> splice(to_read.fileno(), to_write.fileno())
50  # bytes copied

# copying file starting from an offset
>>> splice(to_read.fileno(), to_write.fileno(), offset=10)
40

# copying certain amount of bytes
>>> splice(to_read.fileno(), to_write.fileno(), nbytes=20)
20

# copying certain amount of bytes beginning from an offset
>>> splice(to_read.fileno(), to_write.fileno(), offset=10, nbytes=20)
20

# specifying flags
>>> import splice
>>> splice(to_read.fileno(), to_write.fileno(), flags=splice.SPLICE_F_MORE)
50

Why would I use this?

splice(2) is supposed to be better in terms of performance when compared to traditional read/write methods since it avoids overhead of copying the data to user address space and instead, does the transfer by remapping pages in kernel address space. There can be many uses for this especially if performance is important to the task at hand.

Supported platforms

The splice(2) system call is (GNU)Linux-specific.

Support

Feel free to add improvements, report issues or contact me about anything related to the project.

LICENSE

GNU GPL

py-splice's People

Contributors

danishprakash avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

amigrave mayfield

py-splice's Issues

Please make buffer size tunable

You use 1 page size buffer for each splice call. Its fine if second splice (to pipe) allows kernel to just mark memory pages as pagecache (thus, its fast). But fails if underiying pip is slow. For example if it belongs to file opened with O_DIRECT.

Simple test to splice data between 1GB files opened with O_DIRECT each:

(2 x SATA3 RAID0 ssd)

4k    buf_size (your default):    60.67s   17MiB/s
64k   buf size:                   13.17s   78MiB/s
256k  buf size:                    9.05s  113MiB/s
1024k buf size:                    5.46s  188MiB/s
4096k buf size:                    3.80s  270MiB/s
8192k buf size:                    3.16s  325MiB/s

What did I do above:

  1. Tuned buf_size accrordingly for each run
  2. Tuned pipe buf size for both ends (fcntl F_SETPIPE_SZ) for same value as buf_size

Pipe buf is 64k by default in modern linux. Everything above 1M requires CAP_SYS_RESOURCE non-root users. 8M is maximum I was able to test, everything above returns ENOMEM for root.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.