Giter Site home page Giter Site logo

standage / genhub Goto Github PK

View Code? Open in Web Editor NEW
6.0 3.0 3.0 2.81 MB

Explore eukaryotic genome composition and organization with iLoci

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.31% Python 99.36% Shell 0.33%
genome reference-genome iloci annotation

genhub's People

Contributors

standage avatar vpbrendel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

genhub's Issues

Refactor `format.py`

The functions have gotten quite long, and could benefit from some decomposing.

Dmel format task fails

This is due to a gene model where the exon is labeled as a pseudogene but the gene feature itself is not. It therefore eludes the tidygff3 attempts to correct the feature types, and causes problems when downstream processing software tries to find the exon's parent RNA feature.

Polish documentation, dev environment

The pyyaml and pycurl packages have now been properly added to the setup.py file, so they are installed when you pip install genhub. The other packages in requirements.txt are for development. I need to update the docs (and my own development environment) to account for this.

Deprecate `*.simple-iloci.txt` file

This is a vestige of an old system, before we had the current terminology settled. It should be removed as soon as we can be sure it's not used anywhere.

Overlapping exons prematurely kill C. reinhardtii build

Gene GeneID:5716281 in C. reinhardtii prematurely kills an important step in the prepare task. Two issues to address. The first, and pressing, matter is to discard this gene using the typical annotfilter mechanism. The second matter, which will probably have to wait, is the fact that the build continued even though the canon-gff3 step of the grep -v | pmrna --locus | canon-gff3 pipeline failed.

nose --> py.test

The nose testing framework is no longer supported. The transition to py.test should be pretty seamless, as they support similar conventions for naming test functions. It'll mostly be a matter of updating dependencies, fixing makefile, etc.

Support for generic input

One of the benefits we claim regarding iLoci is that the software accepts a small number of standard inputs and generates a wealth of useful output. While the latter half of that claim is definitely true, we need to work on the first half.

If your genome happens to already be in RefSeq, you're in luck: all you need to do is copy one of the existing RefSeq configuration files, change a few values (species name, genome accession, etc) and you're in business. But if you simply have a pair of Fasta and GFF3 files? You're basically relegated to running all the genhub-build.py steps by yourself, from scratch.

We need better support for generic inputs: we need to document what is expected from the Fasta and GFF3 files, and then we need to fix the genhub-build.py script to support this.

Script names

Scripts have very generic names at this point, which is fine for git clone installation but not for a system-wide/virtualenv/pip type installation. Need to select concise names that minimize collision risk.

.txt --> .tsv for some ancillary data files

Some of the plain text supporting data files in each genome's directory are in fact simply tab-separated value (TSV) files that lack a header row. There isn't really a compelling reason for these files not to have self-documenting headers, which in turn facilitate easy loading into R/Python/etc for data analysis.

  • *.ilocus.mrnas.txt (this will probably require a change to AEGeAn's pmrna program)
  • *.protein2ilocus.txt (this is internal to GenHub)

Anything I'm missing or any other comments @vpbrendel @cycoyuk?

Test whether ilens file can be ignored

If all iiLocus and ziLocus lengths are easily parsed directly from GFF3, this can prevent cluttering of the working directory with many ancillary files.

Recipes for green algae

  • Auxenochlorella_protothecoides
  • Chlamydomonas_reinhardtii
  • Chlorella_variabilis
  • Coccomyxa subellipsoidea
  • Micromonas_pusilla
  • _Micromonas_sp.RCC299
  • Ostreococcus_lucimarinus
  • Ostreococcus tauri
  • Volvox_carteri

More species to integrate

BeeBase consortium data sets (10 bee genomes paper)

  • Dufourea novaeangliae
  • Eufriesea mexicana
  • Habropoda laboriosa
  • Lasioglossum albipes
  • Melipona quadrifasciata

Species only available in HymenopteraBase

  • Cardiocondyla obscurior

HymenopteraBase versions of already integrated species?

  • Apis mellifera
  • Bombus impatiens
  • Bombus terrestris
  • Nasonia vitripennis
  • Atta cephalotes
  • Acromyrmex echinatior
  • Camponotus floridanus
  • Harpegnathos saltator
  • Linepithema humile
  • Pogonomyrmex barbatus
  • Solenopsis invicta

Easily identify piLocus representatives

The *protein2ilocus.tsv file shows all proteins, not just those chosen to represent each piLocus. It would be helpful to have another mapping file with only the iLocus representatives shown.

New Amel genome

GFF3 checksums are failing, presumably due to an update of the RefSeq files. Need to investigate and take action.

Consolidate file name resolution

Currently, different scripts and modules all redundantly implement similar functionality for resolving file paths.

filepath = '%s/%s/%s' % (workdir, speclabel, filename)

It would be more robust and easier to maintain/fix/change in the future if we used a single function for doing this.

# File doesn't exist yet, no need to test file existence
outfilepath = genhub.file_path(filename, speclabel, workdir=workdir)

# Input file, check to make sure it exists
infilepath = genhub.file_path(filename, speclabel, workdir=workdir, check_exist=True)

Script paths during build

Currently, the format task and the format.sh script are calling other scripts using relative file paths, assuming the user is calling from the genhub root directory. There needs to be a better way to resolve the script paths that doesn't involve clogging up a bunch of function signatures.

Improve configuration parsing

Currently there are two available options for loading configuration files.

  • the -c/--cfg option for providing the path of a single config file
  • the --cfgdir option for providing the path of a directory, from which GenHub will attempt to load all .yml files

The --cfgdir option is fine as is, but I propose the following additions and changes for other config loading options.

  • --cfglist option for providing a file with config files (one per line)
  • --cfgpath option for providing one or more directories in which to search for config files
  • --cfgfullpath option for indicating that value(s) provided by -c/--cfg option or --cfglist option are full file paths; by default, they are treated as relative paths and GenHub searches all directories specified by --cfgpath for these files

The option labels might need tweaking, but I think the functionality supports most/all conceivable use cases with a relatively simple interface.

New rice recipe?

New rice entry in RefSeq: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Oryza_sativa/all_assembly_versions/GCF_001433935.1_IRGSP-1.0/

NT_ and NW_ accessions in mammalian genomes correspond to patches, variants

The mouse and human genomes are very well finished, and the chromosome sequences are assigned NC_ accessions. NW_ and NT_ do not correspond to unplaced genomic scaffolds as they do in many other species, they correspond to patches or variants not (yet?) integrated into a major build release. This information is redundant and should be filtered out in preprocessing. Filtering annotations is simple, but if we don't want redundant sequences to be included in calculations this will require implementing a new filtering mechanism.

Move, rename config files

  • Move the directory containing the config files into the distribution
  • Update setup.py accordingly
  • Rename to "recipes" or something like that

Pass source to `genhub-format-gff3.py`

The script is a mess of unnecessary if/elif/else statements right now, and it would be a lot cleaner (although probably just as verbose) if was passed the annotation source as a parameter.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.