Giter Site home page Giter Site logo

chasewnelson / ebt Goto Github PK

View Code? Open in Web Editor NEW
11.0 4.0 4.0 102 KB

Evolutionary Bioinformatics Toolkit (EBT)

License: GNU General Public License v3.0

Perl 95.82% Python 4.18%
perl vcf vcf-files fasta fasta-sequences sequence sequence-alignment gtf gff dna-sequences

ebt's Issues

Unexpected output from vcf2revcom.pl

Dear SNPGenie/CHASeq authors:
I was in CHASeq for I would like to use CHASeq to deal with gtf, fasta and vcf for reverse strand genes/products to SNPGenie. But I got this message when I run the perl script. I was wondering if there is anything I did wrong, or I hope it would be fixed soon.

seq length is 51304566

Two products have the same starting position, causing an error.
Please contact script author for a revision.

1). gtf was downloaded from an old version of Ensembl annotation (ftp://ftp.ensembl.org/pub/release-73/gtf/homo_sapiens/Homo_sapiens.GRCh37.73.gtf.gz), and filtered for CDS records whose full length are multiples of 3. These were done by the following commands (e.g. for chr22):

## to get the ids for only those ORFs/CDSs with both start codon and stop codon. These ORFs/CDSs should also be annotated as protein coding genes. The final ids should be gene(ENSG???????????)_transcript(ENSG???????????)
zcat Homo_sapiens.GRCh37.73.gtf.gz | awk '{if($3=="start_codon"){a[$12]=0;print $12"\tstart";} if($3=="stop_codon"){b[$12]=0;print $12"\tstop";} } ' | sort -u | cut -f1| uniq -d | sed 's/"//g;s/;//;' > tmp
zcat Homo_sapiens.GRCh37.73.gtf.gz | awk 'BEGIN{OFS="\t"} FILENAME==ARGV[1]{aa[$1]=0;} $1==22 && $2=="protein_coding" && $3=="CDS"{split($10,a,"\"");split($12,b,"\"");if(b[2] in aa){print $1,$2,$3,$4,$5,$6,$7,$8,"gene_id \""a[2]"_"b[2]"\"";}}' tmp - > chr22.gtf
## to filter for ORFs/CDSs whose length are multiples of 3.
sort -k9 -k4 chr22.gtf | awk ' {if($10==a){len=$5-$4+1+len;}else{if(len % 3 != 0 ){print a;}a=$10;len=$5-$4+1; } } END{if(len % 3 != 0 ){print a;}}' > tmp
grep -v -f tmp chr22.gtf > chr22.filter.gtf

2). The vcf was downloaded from OneKGenome phase 3(http://www.internationalgenome.org/data and more specificly ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/).
3). I got fasta file from UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz).
4). I run vcf2revcom.pl
vcf2revcom.pl chr22.fa chr22.filter.gtf ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf
P.S. the order of input is not vcf, fasta and then gtf as written in the git page (https://github.com/chasewnelson/CHASeq). But instead the order above.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.