Giter Site home page Giter Site logo

tmaklin / unmulti Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 34 KB

Split a multifasta file into many fasta files containing a single sequence, or extract a list of sequences/contigs. Supports compressed input and output files.

License: MIT License

CMake 35.65% C++ 64.35%
fasta c-plus-plus bioinformatics-tool contigs multifasta

unmulti's Introduction

unmulti - extract individual sequeunces from a fasta file

unmulti is a tool for splitting fasta files containing several sequences into many files containing just one sequence, or for extracting a list of sequences from the same file. The tool supports input compression in .gz, .bz2, or .xz formats, and output compression in .gz format.

Installation

Prebuilt binaries

Prebuilt binaries for generix linux_x86-64 are available from the releases page.

Compiling from source

Requirements

  • c++11 compliant compiler.
  • cmake v2.8.2 or newer.
  • git.
  • zlib
  • Internet access.

How-to

Clone the repository, enter the directory and run

mkdir build
cd build
cmake ..
make

this will download the necessary dependencies and compile the unmulti executable in build/bin/.

Usage

unmulti -f <input multifasta> -o <output directory (default: working directory)

Example

Split a multifasta

Running unmulti on an input file in.fasta with the following contents

>seq_1
AAACGT
>seq_2
GGGTAC

will produce files 0.fasta and 1.fasta in the output directory with contents

>seq_1
AAACGT

and

>seq_2
GGGTAC

Extract specific sequence(s)

Running unmulti -f in.fasta --extract seq_2 on the example input above will extract only the sequence starting at >seq_2. Multiple sequences can be supplied by delimiting them with ,. Running unmulti -f in.fasta --extract seq_2,seq_1 will extract both sequences from the example input.

Other options

The input file can be supplied compressed in the zlib/libbz2/liblzma format depending on what was supported on the machine that unmulti was compiled on. Adding the --compress toggle will compress the output files using zlib.

Adding the -t number_to_sequence.tsv argument will write a table linking the output filenames to their sequence names to the supplied argument. In the example above, running unmulti -f in.fasta -t number_to_sequence.tsv would produce the following file

0	seq_1
1	seq_2

If your sequeunces begin with some other character than '>', the --seq-start option can be used to change the character. For example, running unmulti -f in.fasta --seq-start @ would make unmulti compatible with a file in the following format

@read_1
CGCCTAC
+
GGFGGCD
@read_2
TGAGCCA
+
FFGFG=G

Accepted flags/parameters

unmulti accepts the following flags/parameters:

-f            Input multifasta.
-o            Output directory (default: working directory)
-t            Write a table linking the output filenames to sequence names to the argument filename.
--compress    Compress the output files with zlib (default: false)
--extract     Extract only the named sequence(s). Multiple sequences should be delimited by ','.
--seq-start   Sequence begin character (default: '>')

License

The source code from this project is subject to the terms of the MIT license. A copy of the MIT license is supplied with the project, or can be obtained at https://opensource.org/licenses/MIT.

unmulti's People

Contributors

tmaklin avatar

Watchers

 avatar

unmulti's Issues

`terminate called after throwing an instance of 'std::out_of_range'` when input file has an empty line at the end

If the input multifasta contains empty lines, unmulti crashes with the error message:

terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::at: __n (which is 0) >= this->size() (which is 0)
Aborted (core dumped)

Intended behaviour is that empty lines should be considered part of whatever sequence is currently being processed.

Tested example input:

>seq_1
AAACGT
>seq_2
GGGTAC

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.