solvebio / veppy Goto Github PK
View Code? Open in Web Editor NEWVariant Effect Prediction for Python
License: MIT License
Variant Effect Prediction for Python
License: MIT License
SolveBio documents using:
https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13
Primary assembled chr can be downloaded as so:
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr1.fa.gz | gzip -d > chr1.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr2.fa.gz | gzip -d > chr2.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr3.fa.gz | gzip -d > chr3.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr4.fa.gz | gzip -d > chr4.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr5.fa.gz | gzip -d > chr5.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr6.fa.gz | gzip -d > chr6.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr7.fa.gz | gzip -d > chr7.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr8.fa.gz | gzip -d > chr8.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr9.fa.gz | gzip -d > chr9.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr10.fa.gz | gzip -d > chr10.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr11.fa.gz | gzip -d > chr11.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr12.fa.gz | gzip -d > chr12.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr13.fa.gz | gzip -d > chr13.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr14.fa.gz | gzip -d > chr14.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr15.fa.gz | gzip -d > chr15.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr16.fa.gz | gzip -d > chr16.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr17.fa.gz | gzip -d > chr17.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr18.fa.gz | gzip -d > chr18.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr19.fa.gz | gzip -d > chr19.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr20.fa.gz | gzip -d > chr20.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr21.fa.gz | gzip -d > chr21.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr22.fa.gz | gzip -d > chr22.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chrX.fa.gz | gzip -d > chrX.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chrY.fa.gz | gzip -d > chrY.fa
cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa > data/Homo_sapiens.GRCh37.75.genbank.fa
The resulting Homo_sapiens.GRCh37.75.genbank.fa file does not match the original fastq MD5:
Homo_sapiens.GRCh37.75.genbank.fa
3e560b5d4f595d2249143e4c98f890de
The mismatch is due at least in part to Homo_sapiens.GRCh37.75.genbank.fa from SolveBio having already been "flattened in place" by pyfasta, removing headers.
It would be useful to have the original "non-flattened" fasta file if its available.
Otherwise, for now I am going to modify in a "build source data" branch the stored md5 check to match the fasta file I generate from current public sources, and will compare later by other means to make sure the sequence content is the same between the fasta data we are able to currently download from public source and SolveBio's.
Instructions here https://guides.github.com/activities/citable-code/
Will help with making veppy citable before publication
Add automatic tests with Travis when pushing to any branch.
The file
gencode.v19.annotation.gtf.gz
from solveBio S3 turns out to not be identical to:
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
We need to determine if either this file was post-processed after download or if a matching version is available for download from the gencode project.
@dandanxu could you take a look at this?
and see whats still a TODO
TODO
$ git grep TODO
README.md:Tests are based on chr1 versions of input data (TODO upload this somewhere)
veppy/file/fasta.py: # TODO: This is slow (3s per file). Move somewhere else.
veppy/file/fasta.py: # TODO: eventually, this should be dynamic and file-specific
veppy/file/feature.py: # TODO: uggh...linear scan...this should be a dict lookup...
veppy/file/features.py: # TODO: This is no longer used...
veppy/file/features.py: # TODO: This is no longer used...
veppy/file/features.py: # TODO: calling sorted every time we request a feature list
veppy/file/features.py: # TODO:
veppy/file/features.py: # TODO: optimize this?
veppy/file/features.py: # TODO: fix this to use 'stop_codon' if it exists...
veppy/file/features.py: # TODO: we may not need this, depending on how we choose to handle
veppy/file/features.py: # TODO: set 'intron' attribute on leaves in '_intron_tree'
veppy/io/readers.py:# TODO this is only used by the test cases
veppy/io/readers.py: # TODO: make encoding a config option
veppy/io/readers.py: # TODO: proper CSV parsing...
veppy/io/readers.py:# TODO merge this with other SolveBio VCFReaders
veppy/io/readers.py: # TODO: eventually remove and renamed 'self.__reader' to 'self._reader'
veppy/vep/sequence.py: # TODO: how do we ever get here??
veppy/vep/sequence.py: # TODO: this is broken! what about exon deletion in which the deletion
veppy/vep/sequence.py: # TODO: need to bound upstream_seq and downstream_seq
veppy/vep/sequence.py: # # TODO: better documentation
veppy/vep/sequence.py: # TODO: should we update ref_seq, alt_seq, etc. to remove
veppy/vep/vep.py: # TODO: handle this case more reasonably.
veppy/vep/vep.py: # TODO: check that transcript is cached in feature file.
FIXME
$ git grep FIXME
veppy/tests/test_vep.py: # FIXME: Local does not have chrX
a full python3.x audit was never done on this. likely there are a lot of simple fixes needed in order to be compatible, and a few big ones :)
See failed builds https://travis-ci.org/solvebio/veppy/builds/165497165
Recently been added to SO The-Sequence-Ontology/SO-Ontologies#334.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.