Giter Site home page Giter Site logo

chasewnelson / ebt Goto Github PK

View Code? Open in Web Editor NEW
11.0 4.0 4.0 102 KB

Evolutionary Bioinformatics Toolkit (EBT)

License: GNU General Public License v3.0

Perl 95.82% Python 4.18%
perl vcf vcf-files fasta fasta-sequences sequence sequence-alignment gtf gff dna-sequences

ebt's Introduction

Evolutionary Bioinformatics Toolkit

A smorgasbord of scripts providing computational help for the evolutionary analysis of sequence data.

Contents

  • Before You Begin
  • Citation
  • Troubleshooting
  • Scripts
    • align_codon2aa.pl. You want to impose an amino acid alignment on the underlying nucleotide sequence.
    • aligned_fasta_group_diffs.pl. You have two or more groups of sequences, aligned to one another but in separate FASTA files, and want to identify the sites at which the groups exhibit differences.
    • aligned_fasta2site_nt_freqs.pl. You want to tabulate the number (and proportion) of each nucleotide at each variable position across an aligned nucleotide FASTA file.
    • calculate_p_distance.pl. You want to calculate a p-distance between two nucleotide sequences.
    • count_k-mers.pl. You want to tally all k-mers for an aligned sequence and its genes.
    • determine_consensus_IUPAC_seqs.pl. You want to determine the consensus and/or IUPAC sequence(s) for an aligned nucleotide FASTA file.
    • extract_codon_from_ANN_VCFs.pl. You want to pull codon variants out of annotated VCF files (i.e., files that have been annotated using the snpeff -formatEff option), so that reference and variant codons can be compared.
    • extract_fasta_by_sites.pl. You want to create separate FASTA files for segments (e.g., each of the genes) in an aligned sequence file.
    • gb2gtf.pl. You want to create a GTF file from a GenBank file, compatible with SNPGenie input.
    • generate_seqs_from_VCF.py. You want to generate a FASTA file of sequences with variants randomly interspersed at the appropriate frequencies from a VCF file.
    • get_random_integers.pl. You want to get a certain number of random integer values in a range.
    • get_unique_values.pl. You have a table with a column containing multiple rows with the same value(s) (i.e., duplicates), and you want to extract just the unique ones (i.e., remove duplicate values) and generate summary statistics for the duplicates.
    • gff2gtf.pl. You want to convert a GFF file to a simpler GTF file, compatible with SNPGenie input.
    • haplotypes_provided_sites.pl. You want to determine all haplotypes in a given FASTA alignment for a specific combination of sites.
    • LinkGe_all_site_pairs.pl. You have an aligned BAM file and a list of sites with their SNPs, and want to use Gabriel Starrett's LinkGe program to check for linkage of SNPs in the same reads.
    • remove_seqs_with_stops.pl. You have a FASTA file where each sequence is an in-frame coding sequence, and want to remove all sequences containing mid-sequence STOP codons.
    • reverse_complement.pl. You want to obtain the reverse complement of a sequence supplied at the command line.
    • split_fasta.pl. You have a FASTA file, and want to create an individual FASTA file for each sequence inside.
    • store_fasta_by_ID.pl. You want to create a new FASTA file containing only a certain subset of another FASTA.
    • vcf2revcom.pl. You want to convert a VCF SNP report file, along with its accompanying FASTA file and GTF file, to the reverse complement strand.

Before You Begin

I provide these scripts for anyone to use freely for help in performing routine bioinformatics tasks with nucleotide sequence and related data. Please just cite this GitHub page. The scripts are meant for use on Unix/Mac machines; no support is offered for Windows. Readers of code, be warned: very little effort has been made to 'clean up' my coding comments and alternative operations. Pseudogenes abound!

How to Use

Most scripts are designed to take (unnamed) arguments for simple data file manipulations, in the following format:

    EBT_script.pl <argument1> <argument2>

where <argument1> is ommitted and replaced with the desired input value. Some more complicated scripts will contain named arguments in Perl's long form, in the following format:

    EBT_script.pl --argument-name=<value>

where <value> is ommitted and replaced with the desired input value.

Citation

When using this software, please refer to and cite:

Nelson CW, Evolutionary Bioinformatics Toolkit, https://github.com/chasewnelson/EBT

Troubleshooting

If a script isn't working, try working through the following checklist:

  • Remember that you must open the Terminal and navigate to the directory which contains the script before you're able to call it. More advanced users may wish to place the script(s) in a directory included in your computer's PATH so that it can be called from any directory.

  • Did you place the script in the same directory as your data? If not, you could move the script to the data-containing directory, or vice versa. Another option is to provide the full path of the input files to the script, e.g., /Users/cwnelson88/Desktop/my_project/my_data.fasta

  • These scripts assume that your computer's copy of Perl is located at /usr/bin/perl. You can check whether this is the case by opening the Terminal and typing which perl at the command line. If this does not return /usr/bin/perl, then (1) copy the path it provides; (2) open the script; (3) replace #! /usr/bin/perl at the top of the script with your own computer's path.

  • Scripts must be made executable before use. You can check whether this is the case my opening the Terminal, navigating to the directory containing the script, and typing ls -l, which will return something like the following:

      -rwxr-xr-x@  1  name  staff  155  Sep 25  2013  script.pl
    

    If the string of letters at the beginning of the line containing your script (here, it's -rwxr-xr-x) does not contain an 'x', it means it is not yet executable. You can add executable status by typing chmod +x <script.pl>, where <script.pl> is replaced with the script's name.

Scripts

  • align_codon2aa.pl. You want to impose an amino acid alignment on the underlying nucleotide sequence. At the command line, call this script with two arguments: (1) aligned amino acid sequence; and (2) nucleotide sequence. The codon-aligned nucleotide sequence will the be printed to the screen. For example:

      align_codon2aa.pl <aligned amino acid sequence> <nucleotide sequence>
      align_codon2aa.pl MSAARKRTL-L ATGTCGGCGGCTCGTAAGCGCACCTTGTTG
    
  • aligned_fasta_group_diffs.pl. You have two or more groups of sequences, aligned to one another but in separate FASTA files, and want to identify the sites at which the groups exhibit differences. First, make sure all sequences are aligned to one another (even across groups), place each group of sequences in a separate FASTA file, and create a directory containing all the group FASTA files (but no others). Then, at the command line, call this script. A results file will be placed in the working directory describing positions at which the groups differ in their major nucleotide. For more control use the following options:

    • --min_variant_maj_nt_freq to specify a minimum frequency cutoff for assigning a group's majority (consensus) nucleotide. Must be a decimal. Default=0.9 (90%).
    • --min_site_coverage to specify a minimum number of defined sequences required for calculating a group's frequencies. Must be an integer. For example, some sites in some groups may be mostly gaps (-) or undetermined (N), in which case we may not want to consider frequency calculations reliable. Default=5.

    Here is an example using defaults:

      aligned_fasta_group_diffs.pl
    

    Here is an example using using a frequency cutoff of 90% with a minimum of 8 defined nucleotides per site for each group:

      aligned_fasta_group_diffs.pl --min_variant_maj_nt_freq=.9 --min_site_coverage=8
    
  • aligned_fasta2site_nt_freqs.pl. You want to tabulate the number (and proportion) of each nucleotide at each variable position across an aligned nucleotide FASTA file. At the command line, call this script with one argument, an aligned FASTA file. One TAB-delimited results file will be placed in the working directory (*_site_summary.txt) and brief summary statistics will be printed to the Terminal. Here's an example:

      aligned_fasta2site_nt_freqs.pl <aligned_seqs.fasta>
    
  • calculate_p_distance.pl. You want to calculate a p-distance between two nucleotide sequences. At the command line, provide this script with two arguments: two FASTA (.fa or .fasta) files, each containing one sequence, which are aligned to each other. This script will exclude positions which are gaps (-) or undetermined (N) in both sequences and return a p-distance. Here's an example:

      calculate_p_distance.pl <aligned_seq_1.fasta> <aligned_seq_2.fasta>
    
  • count_k-mers.pl. You want to tally all k-mers for a group of aligned sequence and separately for its genes. At the command line, provide this script with three arguments: one ALIGNED FASTA (.fa or .fasta) file; one GTF file with gene annotations; and the max k-mer length. This script will exclude k-mers which contain gaps (-) or undetermined nucleotides (N). Here's an example:

      count_k-mers.pl <aligned.fasta> <gene_annotations.gtf> <max_k>
    
  • determine_consensus_IUPAC_seqs.pl. You want to determine the consensus and/or IUPAC sequence(s) for an aligned nucleotide FASTA file. At the command line, call this script with one argument, an aligned FASTA file. Two results files will be placed in the working directory, one with a consensus sequence (*_consensus.fa) and one with an IUPAC sequence (*_IUPAC.fa). Brief summary statistics will be printed to the Terminal. Here's an example:

      determine_consensus_IUPAC_seqs.pl <aligned_seqs.fasta>
    
  • extract_codon_from_ANN_VCFs.pl. You want to pull codon variants out of annotated VCF files (i.e., files that have been annotated using the snpeff -formatEff option), so that reference and variant codons can be compared. At the command line, call this script in a directory containing one or more files ending in .ann.vcf. It will extract the information in the "CODON: Codon change" field (e.g. ggT/ggG) for each record and output to the files ref_codons_seq.txt, alt_codons_seq.txt, and codon_metadata.txt. Here's an example:

      extract_codon_from_ANN_VCFs.pl
    
  • extract_fasta_by_sites.pl. You want to create separate FASTA files for segments (e.g., each of the genes) in an aligned sequence file. At the command line, provide this script with two arguments: (1) one FASTA file containing one or more aligned sequences; and (2) a GTF file containing the CDS products to extract. It will extract the sites for each coding element (CDS) annotation in the Gene Transfer Format, and output one FASTA file for each. Here's an example:

      extract_fasta_by_sites.pl <multiple_aligned_seqs.fasta> <gene_coordinates_to_extract.gtf>
    
  • gb2gtf.pl. (Helpful for preparing SNPGenie input!) DEPRECATED! Use a tool like gffread.

  • generate_seqs_from_VCF.py. You want to generate a FASTA file of sequences with variants randomly interspersed at the appropriate frequencies from a VCF file. Two or more sets of sequences thus generated can then be used as input for the current SNPGenie between-group script. At the command line, provide this script with three arguments: (1) FASTA reference (1 sequence); (2) a SNP report in VCF format that includes the 'AF' INFO entry; and (3) number of sequences to generate. Note that this Python script requires the following libraries: SeqIO, os, random, re, and sys. For example, for a reference sequence in a file named reference.fasta, a VCF-format SNP report in a file named variants.vcf, and to create 1,000 sequences, one would run the following:

      generate_seqs_from_VCF.py reference.fasta variants.vcf 1000
    
  • get_random_integers.pl. You want to get a certain number of random integer values in a range. At the command line, provide this script with three arguments: (1) number of random integer values wanted; (2) bottom of range (inclusive); (3) top of range (inclusive). Returns the specified number of random integers in the desired range. Here's an example to return 10 values in the range [3,999]:

      get_random_integers.pl 10 3 999
    
  • get_unique_values.pl. You have a table with a column containing multiple rows with the same value(s) (i.e., duplicates), and you want to extract just the unique ones (i.e., remove duplicate values) and generate summary statistics for the duplicates. At the command line, provide this script with two arguments: (1) the name of the input tab-delimited file; and (2) the column number to analyze. The script outputs the unique values in the column, i.e., duplicate values eliminated, including the header. Also creates a summary statistics file in the working directory, by the name *_unique_values_summary.txt. Here's an example:

      get_unique_values.pl <my_table.txt> <column_number>
    
  • gff2gtf.pl. (Helpful for preparing SNPGenie input!) DEPRECATED! Use a tool like gffread.

  • haplotypes_provided_sites.pl. You want to determine all haplotypes in a given FASTA alignment for a specific set of sites provided. At the command line, provide this script with two named arguments: (1) --fasta_file, the name of the aligned FASTA file; and (2) --sites_file, the name of a file containing a list of the sites of interst, separated by any whitespace (TABS, spaces, or newlines). The script outputs the unique haplotypes and their counts. Run as follows:

      haplotypes_provided_sites.pl --sites_file=<sites_of_interest>.txt --fasta_file=<aligned_seqs>.fa
    
  • LinkGe_all_site_pairs.pl. You have an aligned BAM file and a list of sites with their SNPs, and want to use Gabriel Starrett's LinkGe program to check for linkage of SNPs in the same reads. For example, you may wish to test if high-frequency variant nucleotides are present in linked haplotypes in a deep-sequenced viral sample from a single host. At the command line, provide this script with the named arguments:

    • --snp_file: REQUIRED. One TAB-delimited, CLC-style "SNP report" file with, at minimum, the following three columns (with headers):
      • "Reference Position": the site number of the single nucleotide variant within the aligned BAM reads.
      • "Reference": the reference (majority) nucleotide.
      • "Allele": the allele (variant or mutant) nucleotide.
    • --BAM_file: REQUIRED. Standard format; must be aligned; reads of any length.
    • --max_dist: the maximum distance at which to search for pairs of sites covered by the same read. DEFAULT: 1000.

    This script will then determine all pairs of sites provided in the snp_file which fall within max_dist of one another; run LinkGe for all identified pairs; and output read linkage information for all individual pairs (LinkGe_* file generated in working directory) and total summary data for linkage of reference and variant nucleotides (printed to screen; WT refers to the reference nucleotide, Mut refers to the alternative nucleotide). Here's an example:

      LinkGe_all_site_pairs.pl --snp_file=<my_SNPs>.txt --BAM_file=<aligned_BAM>.bam --max_dist=500
    

    Note that this script requires you to install LinkGe, and to add it to your computer's PATH. Three dependencies are also required: Bio::DB::Sam, BioPerl, and Samtools version 0.1.15. See the LinkGe page for details.

  • remove_seqs_with_stops.pl. You have a FASTA file where each sequence is an in-frame coding sequence, and want to remove all sequences containing mid-sequence STOP codons. At the command line, provide this script with one argument: a FASTA (.fa or .fasta) file containing multiple coding sequences that all begin at the first position of the first codon (they need not be aligned to each other). This script will create a new FASTA file in the working directory, containing just those sequences lacking an in-frame mid-sequence STOP codon. Here's an example:

      remove_seqs_with_stops.pl <my_fasta_file.fasta>
    
  • reverse_complement.pl. You want to obtain the reverse complement of a sequence supplied at the command line. Simply provide this script with one argument, the sequence:

      reverse_complement.pl <nucleotide_sequence>
    
  • split_fasta.pl. You want to create an individual FASTA file for each sequence in another FASTA file. At the command line, provide this script with one argument: a FASTA (.fa or .fasta) file containing multiple sequences. This script will create multiple files in the working directory, each containing one of the sequences. Here's an example:

      split_fasta.pl <my_fasta_file.fasta>
    
  • store_fasta_by_ID.pl. You want to create a new FASTA file containing only a certain subset of another FASTA. At the command line, provide this script with two arguments: (1) a FASTA (.fa or .fasta) file containing multiple sequences; and (2) a .txt file containing one column with the FASTA IDs you want in your new FASTA file โ€” there should be NO HEADER. This script will create a new FASTA file in the working directory, containing the sequences whose headers begin with the IDs in argument (2). Redirect (>) the output to create a file. Here's an example:

      store_fasta_by_ID.pl <all_seqs.fasta> <wanted_seqs_headers.txt> > <just_wanted_seqs.fasta>
    
  • vcf2revcom.pl. (Helpful for preparing SNPGenie input!) You want to convert a VCF SNP report file, along with its accompanying FASTA file and GTF file, to the reverse complement strand. This script automates the creation of such reverse complement files, compatible with SNPGenie input, but the input must already be compatible with SNPGenie specifications. Note that the resulting SNP report will not be a VCF file, but rather a CLC Genomics Workbench format file. At the command line, provide this script with three arguments, in the following order:

    1. A '+' strand FASTA (.fa or .fasta) file containing the reference sequence (ALL UPPERCASE) against which SNPs were called;
    2. A '+' strand GTF file containing both '+' and 'โ€“' strand products from the '+' strand point of view; and
    3. A '+' strand SNP report in VCF format.

    This script will then create a '-' strand (reverse complement) version of each file in the working directory, with "_revcom" concatenated to the original file name. Here's an example:

      vcf2revcom.pl <my_reference_sequence.fasta> <my_cds_file.gtf> <my_snp_report.vcf>
    

ebt's People

Contributors

singing-scientist avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ebt's Issues

Unexpected output from vcf2revcom.pl

Dear SNPGenie/CHASeq authors:
I was in CHASeq for I would like to use CHASeq to deal with gtf, fasta and vcf for reverse strand genes/products to SNPGenie. But I got this message when I run the perl script. I was wondering if there is anything I did wrong, or I hope it would be fixed soon.

seq length is 51304566

Two products have the same starting position, causing an error.
Please contact script author for a revision.

1). gtf was downloaded from an old version of Ensembl annotation (ftp://ftp.ensembl.org/pub/release-73/gtf/homo_sapiens/Homo_sapiens.GRCh37.73.gtf.gz), and filtered for CDS records whose full length are multiples of 3. These were done by the following commands (e.g. for chr22):

## to get the ids for only those ORFs/CDSs with both start codon and stop codon. These ORFs/CDSs should also be annotated as protein coding genes. The final ids should be gene(ENSG???????????)_transcript(ENSG???????????)
zcat Homo_sapiens.GRCh37.73.gtf.gz | awk '{if($3=="start_codon"){a[$12]=0;print $12"\tstart";} if($3=="stop_codon"){b[$12]=0;print $12"\tstop";} } ' | sort -u | cut -f1| uniq -d | sed 's/"//g;s/;//;' > tmp
zcat Homo_sapiens.GRCh37.73.gtf.gz | awk 'BEGIN{OFS="\t"} FILENAME==ARGV[1]{aa[$1]=0;} $1==22 && $2=="protein_coding" && $3=="CDS"{split($10,a,"\"");split($12,b,"\"");if(b[2] in aa){print $1,$2,$3,$4,$5,$6,$7,$8,"gene_id \""a[2]"_"b[2]"\"";}}' tmp - > chr22.gtf
## to filter for ORFs/CDSs whose length are multiples of 3.
sort -k9 -k4 chr22.gtf | awk ' {if($10==a){len=$5-$4+1+len;}else{if(len % 3 != 0 ){print a;}a=$10;len=$5-$4+1; } } END{if(len % 3 != 0 ){print a;}}' > tmp
grep -v -f tmp chr22.gtf > chr22.filter.gtf

2). The vcf was downloaded from OneKGenome phase 3(http://www.internationalgenome.org/data and more specificly ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/).
3). I got fasta file from UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz).
4). I run vcf2revcom.pl
vcf2revcom.pl chr22.fa chr22.filter.gtf ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf
P.S. the order of input is not vcf, fasta and then gtf as written in the git page (https://github.com/chasewnelson/CHASeq). But instead the order above.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.