Giter Site home page Giter Site logo

dmqb-grp1-2018's Introduction

Valifor

Valifor is a small easily extensible command line tool to validate different NGS file formats. Currently the following file formats are supported:

  • Fasta
  • Fastq

Motivation

In any data driven workflow the quality of the data is of high importance. Especially with formats that get can be manually modified there is a chance for corrupted files that don't fit the format anymore. If corrupted files are able to enter the workflow undetected the whole system can break and may have to be reset, even if it doesn't break the result gotten at the end will not be usable and all the work done will have to be repeated. In the second case you may get an even worse situation if the mistake is not found quickly. All further work and conclusions building on the corrupted result might also be unusable.

To combat this situation there are validators for different formats that are able to make sure they conform to all the defined rules. For example there is FastaValidator and the FastqValidator which both can validate their respective formats. There are also programs that not only validate certain formats but also allow to run analysis on them like bamUtil

Yet there are no tools that concentrate on validating a broad range of formats that would allow to have a single program that validates all of the different files going into a workflow. Instead currently it is needed to have multiple tools which all need to be maintained.

In order to address this issue we started creating valifor a command line application that will be able to validate different NGS file formats while it's currently a small start, it's easily extensible and a needed format can quickly be added.

Table of Contents

The valifor architecture is shown in the following graph:

valifor

The basic structure is the factory pattern.

Following the flow of information the command line interface receives the input of one ore multiple paths to files/directories to validate and optionally the format which it should check. This is used to start the validation process in the main logic. There the factory is used to get the correct validator for the needed format from the child-classes of AbsValidator. The AbsValidator is the base class for all implemented validators. Each of the child-classes has to implement the interface given in the parent. The validator given by the factory is then used to validate the files and it returns information about if the files are valid and if not where the format is broken.

You can install valifor using the source files found on Github

After you downloaded the directory you can call:

$ pip install path_to_the_directory

to install it on your system.

Once valifor is installed, it can be used. You can get an overview on how to use it by adding the --help option in the command line. It will show an overview and explanation for valifor:

$ valifor --help
Usage: valifor [OPTIONS] [PATHS]...

Welcome to the format validator Valifor:

Valifor is a easily extensible validator for different formats. To get
started you can call "valifor --help" to get this message again and more
information to the options.

To use valifor you can call:

       "valifor [path_to_file]" to check its format based on its
       file-ending.

       "valifor [path_to_file] --format [format]" to check its format
       based on the given format.

Options:
  -f, --format [fasta-dna|fasta-aa|fastq]
                                  Type of format to be tested for all given files
  -h, --help                      Show this message and exit.

you can also test whole directories by giving the path to the directory:

$ valifor [path_to_dir]

or test multiple files and/or directories:

$ valifor [path_to_dir] [path_to_file]

An example use case:

To use valifor to validate a fasta file call:

$ valifor  example.fasta --format fasta-dna

Then it prints the result in the command line. In case of a valid file it prints the format tested and the name:

$ valid: example.fasta - fasta-dna

or in case of a corrupted file it additionally returns the reason if it is possible:

$ failed: example.fasta - fasta-dna: Character [O] not allowed in sequence. At line: 3:10

if the file or directory doesn't exist it prints a warning with the full path:

Given path does not exist: /path/to/fileOrDir

To add a new format you have to write a corresponding validator of course. The new plug-in must consist of a class that inherits from the AbsValidator and overwrites all its functions following the description in the documentation of AbsValidator. (Additionally there should be a unittest-class with test-files.)

With the new validator-class most of the work is finished the only thing left is to integrate it in the project. For this you only need add it in the functions of the validator-factory which is also in the validator module.

You will find 4 functions:

  • available_formats():
    • add the new name which the option in the CLI should have.
  • get_format_from_ending(file_ending):
    • add the conversion from the file-ending to the name of the option.
  • get_validator(name):
    • add the new class as a return for the new option.
  • def get_uncertain_endings():
    • if the file-ending is not enough to be completely certain about the exact format also add it here.

And with that you are finished and have integrated a new format for this project.

dmqb-grp1-2018's People

Contributors

najiaahmadi avatar sven1103 avatar ott-alexander avatar

Watchers

James Cloos avatar Alexander Peltzer avatar Gisela Gabernet avatar Sven Nahnsen avatar Christopher Mohr avatar Simon Heumos avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.