Giter Site home page Giter Site logo

arcanaframework / fileformats Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 0.0 907 KB

a Python package for the specification, validation and manipulation of file formats and types

Home Page: https://arcanaframework.github.io/fileformats/

License: Other

Python 100.00%
file-extensions file-format-converter file-formats magic-numbers mime-types

fileformats's Introduction

FileFormats

https://codecov.io/gh/arcanaframework/fileformats/branch/main/graph/badge.svg?token=UIS0OGPST7 Supported Python versions Latest Version Documentation Status

Fileformats provides a library of file-format types implemented as Python classes. The file-format types were designed to be used in type validation and data movement during the construction and execution of data workflows. However, they can can also be used some basic data handling methods (e.g. loading data to dictionaries) and format conversions between some equivalent types via methods defined in the associated fileformats-extras package.

File-format types are typically identified by a combination of file extension and "magic numbers" where applicable. However, unlike many other file-type Python packages, FileFormats, supports multi-file data formats ("file sets") often found in scientific workflows, e.g. with separate header/data files. FileFormats also provides a flexible framework to add custom identification routines for exotic file formats, e.g. formats that require inspection of headers to locate data files, directories containing certain file types, or to peek at metadata fields to define specific sub-types (e.g. functional MRI DICOM file set). It is in the handling of multi-file formats that fileformats comes into its own, since it keeps track of auxiliary files when moving/copying to different file-system locations and calculating hashes.

See the extension template for instructions on how to design FileFormats extensions modules to augment the standard file-types implemented in the main repository with custom domain/vendor-specific file-format types (e.g. fileformats-medimage).

Notes on MIME-type coverage

Support for all non-vendor standard MIME types (i.e. ones not matching */vnd.* or */x-*) has been added to FileFormats by semi-automatically scraping the IANA MIME types website for file extensions and magic numbers. As such, many of the formats in the library have not been properly tested on real data and so should be treated with some caution. If you encounter any issues with an implemented file type, please raise an issue in the GitHub tracker.

Adding support for vendor formats will be relatively straightforward and is planned for v1.0.

Installation

FileFormats can be installed for Python >= 3.7 from PyPI with

$ python3 -m pip fileformats

Support for converter methods between a few select formats can be installed by passing the 'extras' package, e.g

$ python3 -m pip install fileformats-extras

Examples

Using the WithMagicNumber mixin class, the Png format can be defined concisely as

from fileformats.generic import File
from fileformats.core.mixin import WithMagicNumber

class Png(WithMagicNumber, File):
    binary = True
    ext = ".png"
    iana_mime = "image/png"
    magic_number = b".PNG"

Files can then be checked to see whether they are of PNG format by

png = Png("/path/to/image/file.png")  # Checks the extension and magic number

which will raise a FormatMismatchError if initialisation or validation fails, or for a boolean method that checks the validation use matches

if Png.matches(a_path_to_a_file):
    ... handle case ...

Format Identification

There are 2 main functions that can be used for format identification

  • fileformats.core.from_mime
  • fileformats.core.find_matching

from_mime

As the name suggests, this function is used to return the FileFormats class corresponding to a given MIME string. All non-vendor official MIME-types are supported. Non-official types can be loaded using the application/x-name-of-type form as long as the name of the type is unique amongst all installed format types. To avoid name clashes between different extension types, the "MIME-like" string can be used instead, where informal registries corresponding to the fileformats extension namespace are used instead, e.g. medimage/nifti-gz or datascience/hdf5.

find_matching

Given a set of file-system paths, by default, find_matching will iterate through all installed fileformats classes and return all that validate successfully (formats without any specific constraints are excluded by default). The potential candidate classes can be restricted by using the candidates keyword argument.

Format Conversion

While not implemented in the main File-formats itself, file-formats provides hooks for other packages to implement extra behaviour such as format conversion. The fileformats-extras implements a number of converters between standard file-format types, e.g. archive types to/from generic file/directories, which if installed can be called using the convert() method.

from fileformats.application import Zip
from fileformats.generic import Directory

zip_file = Zip.convert(Directory("/path/to/a/directory"))
extracted = Directory.convert(zip_file)
copied = extracted.copy_to("/path/to/output")

The converters are implemented in the Pydra dataflow framework, and can be linked into wider Pydra workflows by creating a converter task

import pydra
from pydra.tasks.mypackage import MyTask
from fileformats.application import Json, Yaml

wf = pydra.Workflow(name="a_workflow", input_spec=["in_json"])
wf.add(
    Yaml.get_converter(Json, name="json2yaml", in_file=wf.lzin.in_json)
)
wf.add(
    MyTask(
        name="my_task",
        in_file=wf.json2yaml.lzout.out_file,
    )
)
...

Alternatively, the conversion can be executed outside of a Pydra workflow with

json_file = Json("/path/to/file.json")
yaml_file = Yaml.convert(json_file)

License

This work is licensed under a Creative Commons Attribution 4.0 International License

Creative Commons Attribution 4.0 International License

fileformats's People

Contributors

tclose avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

fileformats's Issues

Type checking

Given the focus on typing in Pydra 0.23, it would be nice to get it to type-check cleanly, which would be helped by the types of fileformats being available.

I don't consider this blocking, just opening an issue for the sake of having the plan on the record.

Adding Authorship

@effigies, I was wondering whether you would like to be considered an author of fileformats or not (see AUTHORS file). FileSet.copy uses the logic that you outlined when we were discussing this in the context of Pydra.

I have also just incorporated the CIFS/mount logic that was in Pydra into fileformats as I have found that I need it when using fileformats outside of Pydra.

No worries if you'd rather not put your name to it, but just thought I would offer/mention it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.