Giter Site home page Giter Site logo

fasta's Introduction

Fasta

Perl scripts dealing with fasta files.

======================================================== fasta_Cut_based-on-Mo.pl

perl fasta_Cut_based-on-Mo.pl <FastaFile> <File size in Mo>

This script will rewrite a fasta file (typically, a genome) in pieces of X Mo (~correlated to the total length of sequences)

<FastaFile>       --> (STRING)  Input file in fasta format (typically, a genome)
<File size in Mo> --> (INTEGER) To set X, the size of output files in Mo 

NB: - if cut results in more than 100 files, you should then replace \"%02d\" by \"%03d\" at line 39.
    - if some sequences are longer than the set size in Mo, then files containing them will be bigger anyway since the script won't cut a sequence.

======================================================== fasta_extract_random.pl [v1.2] fasta_extract_random_pieces.pl [v1.1]

PURPOSE:
These scripts will extract random sequences (fasta_extract_random.pl), 
or random sub sequences (fasta_extract_random_pieces.pl)

perl fasta_extract_random.pl -i <in.fa> [-n <X>] [-p <X>] [-d] [-u] [-c] [-m <X>] [-nom] [-v] [-h|help]
perl fasta_extract_random_pieces.pl -i <in.fa> -l <min,max> [-n <X>] [-p <X>] [-o <out.fa] [-a <X>] [-u] [-s] [-b] [-v] [-h|help]

Check their usage (use -h) to see the details of the options. Briefly:
-n or -p to set the number of sequences to extract
-u to write sequence in upper cases
-a to decide allowed overlap (0% to 100%)

========================================================

fasta_FetchSeqs.pl [v1.0]

WHAT IT DOES
This script allows to extract fasta sequences from a file.
  - matching ID (from command line or from a file containing a list of IDs using -file)
  - containing a word in the ID or in the description (-desc), or in both (-both)
  - the complement of that (meaning, extract when it does not match), option -inv (inverse match)

Note that for a given fasta header:
   >ID description
   The ID corresponds to anything before the first space, description is anything that's after (even if spaces)

Usage:
	perl FetchSeqs.pl -in <fa> -m <X> [-file] [-out <X>] [-fq] [-grep] [-desc] [-both] [-regex] [-inv] [-noc] [-chlog] [-v] [-h]

This script allows to extract fasta sequences from a file.
  - matching ID (from command line or using another fasta file or a file containing a list of IDs using -file)
  - containing a word in the ID or in the description (-desc), or in both (-both)
  - the complement of that (meaning, extract when it does not match), option -inv (inverse match)

Note that for a given fasta header:
   >ID description
   The ID corresponds to anything before the first space, description is anything that's after (even if spaces)

Examples:
   To extract all sequences containing ERV or LTR in IDs only:
	  perl fasta_FetchSeqs.pl -in fastafile.fa -m ERV,LTR -regex -v
   To extract all sequences that don't have the word \"virus\" in the description or in the ID
	  perl fasta_FetchSeqs.pl -in fastafile.fa -m virus -both -inv -v
   To extract all sequences that have their ID listed in a file
	  perl fasta_FetchSeqs.pl -in fastafile.fa -m list.txt -v
   To extract all sequences that have their full header listed in a file
	  perl fasta_FetchSeqs.pl -in fastafile.fa -m list.txt -both -v
	
MANDATORY:	
-in     => (STRING) input fasta file
-m      => (STRING) provide (i) a word or a list of words, or (ii) a path to a file
                    (i) in command line: you can set several words using , (comma) as a separator.
                        For example: -m ERV,LTR
                        Note that there can't be spaces in the command line, or they have to be escaped with \
                    (ii) a file: it can be a fasta/fastq file, or simply a file with a list of IDs (one column)
                        If the \">\" or @ is kept with the ID, then all lines need to have it (unless -grep)
                        Headers can contain:
                         - fasta/fastq IDs only (no spaces) [defaults earch is done against IDs only]
                         - full fasta headers (use -both to match both, otherwise only ID is looked at)
                         - descriptions only (spaces allowed) if -desc is set
                        Note that you need to use the -file flag

OPTIONAL:
-file   => (BOOL)   chose this if -m corresponds to a file                      
-out    => (STRING) to set the name of the output file (default = input.extract.fa) 
-fq     => (BOOL)   if input file is in fastq format; output will also be fastq
-grep   => (BOOL)   Chose this with -fq to use grep instead of using BioSeq
					But this is even slower on large files.
					Only relevant if -fq is set as well, because the sequences
					will be extracted using grep -A 3 for each word set with -m
					(extracting line that matches + 3 lines after the match)
                    Also, this makes irrelevant the use of these options:
                    -desc, -both, -regex, -inv, -noc
-desc   => (BOOL)   to look for match in the description and not the header
-both   => (BOOL)   to look into both headers and description   
-regex  => (BOOL)   to look for containing the word and not an exact match
                    Special characters in names or descriptions will be an issue;
                    the only ones that are taken care of are: | / . [ ] 
-inv    => (BOOL)   to extract what DOES NOT match
-noc    => (BOOL)   to ignore case in matching  
-chlog  => (BOOL)   print updates
-v      => (BOOL)   verbose mode, make the script talk to you
-v      => (BOOL)   print version if only option
-h|help => (BOOL)   print this help

========================================================

fasta_keep-unique.pl [v2.0]

WHAT IT DOES
This script will filter out non unique sequences (based on sequences, not names)
The first occurence of a sequence will be kept, so order of input files will matter
There will be 2 output files per input file: 
  - sequences that are unique when all files are considered
  - removed sequences
Use -cat to get concatenated files


perl <scriptname.pl> -i <in.fa> [-all] [-v] [-h|help]
 
MANDATORY	
-i <X>   => (STRING) fasta file. If several, separate with ,
                     Typically: -i inputfile1,inputfile2,inputfileN

OPTIONAL
-cat     => (BOOL)   To concatenate all unique sequences as well as all removed sequences 
                     (-> get 2 output files for the run)
-out     => (STRING) To rename the output names when -cat is chosen
                     default = name of the first file in -i is used
-rm      => (BOOL)   To remove single files after they are concatenated
-v       => (BOOL)   verbose mode, make the script talks to you / version if only option
-h|help  => (BOOL)   this usage

========================================================

fasta_split_blast_parse_P.pl [v1.1]

WHAT IT DOES
This script will blast a fasta file against the fasta file set with -db (set the blast type with -type)
To allow threading, the input fasta file is split in one file per sequence
Outputs (standard ones with alignments in the files) can be parsed to be in table format if -parse if chosen, 
with optional filtering using -s, -e, -id (and/or): a hit will be kept if at least one of the condition is met
Additionally, like the tabular output of blasts, the top X hits can be extracted (independently of filtering)


perl <fasta_split_blast_parse_P.pl> -in <in.fa> -db <db.fa> -type <blast_type> [-blast <path/bin>] 
                                   [-dbtype <db_type>] [-eval <evalue>] [-parse] [-s <score>] [-e <evalue>] 
                                   [-id <%id>] [-top <X>] [-cat] [-cpu <number>] [-v]

MANDATORY ARGUMENT:	
-in     => (STRING) input fasta file
-db     => (STRING) fasta file that will be used as the db to blast against
                    writing access needed (for makeblastdb)
-type   => (STRING) the type of blast, to chose between usual blasts (blastn, blastp, tblastn, tblastx...)

  
OPTIONAL ARGUMENTS
-blast =>  (STRING) To override default blast path
                    Default = /home/software/ncbi-blast-2.2.29+/bin
-dbtype => (STRING) molecule_type for makeblastdb (-dbtype nucl or -dbtype prot)
                    default = prot      
-eval   => (STRING) (or FLOAT) evalue used as threshold during the blast. 
                    default = 10e-50
-parse  => (BOOL)   to parse the blast outputs (if not chosen script will end when all blasts are done)
                    If no hits, there won't be a parsed file
-s      => (INT)    when -parse is chosen: filter out hits with a bit score < X  
-e      => (STRING) when -parse is chosen: filter out hits with an evalue < X
-id     => (FLOAT)  when -parse is chosen: filter out hits with a % identity < X 
-top    => (INT)    DESPITE the filters, print anyway parsed data for the top X hits
-cat    => (BOOL)   when -parse is chosen: concatenate the parsed outputs in one file
-cpus   => (INT)    number of cpus that will be used (number of threads started)
                    default = 1 (e.g. no threading)
-chlog  => (BOOL)   print updates
-v      => (BOOL)   verbose mode, make the script talk to you
-v      => (BOOL)   print version if only option
-h|help => (BOOL)   print this help

fasta's People

Contributors

4ureliek avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.