Giter Site home page Giter Site logo

veppy's People

Contributors

dandanxu avatar davecap avatar jsh2134 avatar surajnarwade avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

veppy's Issues

Confirm correct FASTA data source is used

SolveBio documents using:
https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13

Primary assembled chr can be downloaded as so:

curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr1.fa.gz | gzip -d > chr1.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr2.fa.gz | gzip -d > chr2.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr3.fa.gz | gzip -d > chr3.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr4.fa.gz | gzip -d > chr4.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr5.fa.gz | gzip -d > chr5.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr6.fa.gz | gzip -d > chr6.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr7.fa.gz | gzip -d > chr7.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr8.fa.gz | gzip -d > chr8.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr9.fa.gz | gzip -d > chr9.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr10.fa.gz | gzip -d > chr10.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr11.fa.gz | gzip -d > chr11.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr12.fa.gz | gzip -d > chr12.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr13.fa.gz | gzip -d > chr13.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr14.fa.gz | gzip -d > chr14.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr15.fa.gz | gzip -d > chr15.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr16.fa.gz | gzip -d > chr16.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr17.fa.gz | gzip -d > chr17.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr18.fa.gz | gzip -d > chr18.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr19.fa.gz | gzip -d > chr19.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr20.fa.gz | gzip -d > chr20.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr21.fa.gz | gzip -d > chr21.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chr22.fa.gz | gzip -d > chr22.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chrX.fa.gz | gzip -d > chrX.fa
curl http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/chrY.fa.gz | gzip -d > chrY.fa
cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa > data/Homo_sapiens.GRCh37.75.genbank.fa

The resulting Homo_sapiens.GRCh37.75.genbank.fa file does not match the original fastq MD5:

Homo_sapiens.GRCh37.75.genbank.fa
3e560b5d4f595d2249143e4c98f890de

The mismatch is due at least in part to Homo_sapiens.GRCh37.75.genbank.fa from SolveBio having already been "flattened in place" by pyfasta, removing headers.

It would be useful to have the original "non-flattened" fasta file if its available.

Otherwise, for now I am going to modify in a "build source data" branch the stored md5 check to match the fasta file I generate from current public sources, and will compare later by other means to make sure the sequence content is the same between the fasta data we are able to currently download from public source and SolveBio's.

Locate public URL for gencode gtf file

The file

gencode.v19.annotation.gtf.gz

from solveBio S3 turns out to not be identical to:

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz

We need to determine if either this file was post-processed after download or if a matching version is available for download from the gencode project.

@dandanxu could you take a look at this?

Look at Repository TODOs and FIXMEs

and see whats still a TODO

TODO

$ git grep TODO
README.md:Tests are based on chr1 versions of input data (TODO upload this somewhere)
veppy/file/fasta.py:        # TODO: This is slow (3s per file). Move somewhere else.
veppy/file/fasta.py:        # TODO: eventually, this should be dynamic and file-specific
veppy/file/feature.py:    # TODO: uggh...linear scan...this should be a dict lookup...
veppy/file/features.py:    # TODO: This is no longer used...
veppy/file/features.py:    # TODO: This is no longer used...
veppy/file/features.py:    # TODO: calling sorted every time we request a feature list
veppy/file/features.py:    # TODO:
veppy/file/features.py:        # TODO: optimize this?
veppy/file/features.py:    # TODO: fix this to use 'stop_codon' if it exists...
veppy/file/features.py:            # TODO: we may not need this, depending on how we choose to handle
veppy/file/features.py:            # TODO: set 'intron' attribute on leaves in '_intron_tree'
veppy/io/readers.py:# TODO this is only used by the test cases
veppy/io/readers.py:        # TODO: make encoding a config option
veppy/io/readers.py:    # TODO: proper CSV parsing...
veppy/io/readers.py:# TODO merge this with other SolveBio VCFReaders
veppy/io/readers.py:    # TODO: eventually remove and renamed 'self.__reader' to 'self._reader'
veppy/vep/sequence.py:        # TODO: how do we ever get here??
veppy/vep/sequence.py:        # TODO: this is broken! what about exon deletion in which the deletion
veppy/vep/sequence.py:        # TODO: need to bound upstream_seq and downstream_seq
veppy/vep/sequence.py:            #     # TODO: better documentation
veppy/vep/sequence.py:        # TODO: should we update ref_seq, alt_seq, etc. to remove
veppy/vep/vep.py:    # TODO: handle this case more reasonably.
veppy/vep/vep.py:    # TODO: check that transcript is cached in feature file.

FIXME

$ git grep FIXME
veppy/tests/test_vep.py:        # FIXME: Local does not have chrX

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.