standage / genhub Goto Github PK
View Code? Open in Web Editor NEWExplore eukaryotic genome composition and organization with iLoci
License: BSD 3-Clause "New" or "Revised" License
Explore eukaryotic genome composition and organization with iLoci
License: BSD 3-Clause "New" or "Revised" License
The functions have gotten quite long, and could benefit from some decomposing.
This is due to a gene model where the exon is labeled as a pseudogene but the gene feature itself is not. It therefore eludes the tidygff3
attempts to correct the feature types, and causes problems when downstream processing software tries to find the exon's parent RNA feature.
The pyyaml and pycurl packages have now been properly added to the setup.py file, so they are installed when you pip install genhub
. The other packages in requirements.txt
are for development. I need to update the docs (and my own development environment) to account for this.
A lot more could potentially by handled internally by the module, relieving the burden from the user.
This is a vestige of an old system, before we had the current terminology settled. It should be removed as soon as we can be sure it's not used anywhere.
Gene GeneID:5716281
in C. reinhardtii prematurely kills an important step in the prepare
task. Two issues to address. The first, and pressing, matter is to discard this gene using the typical annotfilter
mechanism. The second matter, which will probably have to wait, is the fact that the build continued even though the canon-gff3
step of the grep -v | pmrna --locus | canon-gff3
pipeline failed.
The nose testing framework is no longer supported. The transition to py.test should be pretty seamless, as they support similar conventions for naming test functions. It'll mostly be a matter of updating dependencies, fixing makefile, etc.
One of the benefits we claim regarding iLoci is that the software accepts a small number of standard inputs and generates a wealth of useful output. While the latter half of that claim is definitely true, we need to work on the first half.
If your genome happens to already be in RefSeq, you're in luck: all you need to do is copy one of the existing RefSeq configuration files, change a few values (species name, genome accession, etc) and you're in business. But if you simply have a pair of Fasta and GFF3 files? You're basically relegated to running all the genhub-build.py
steps by yourself, from scratch.
We need better support for generic inputs: we need to document what is expected from the Fasta and GFF3 files, and then we need to fix the genhub-build.py
script to support this.
Scripts have very generic names at this point, which is fine for git clone installation but not for a system-wide/virtualenv/pip type installation. Need to select concise names that minimize collision risk.
Probably best to specify something like a --cdhit_params
option. If they want to adjust any parameter, then they'll need to set all the parameters.
Putting the piLocus type in these types would make it easier to, for example, filter out only those features related to siLoci. Currently requires joins across tables.
See here for tips on a context manager: http://stackoverflow.com/questions/31661177/why-wont-python-multiprocessing-workers-die.
Some of the plain text supporting data files in each genome's directory are in fact simply tab-separated value (TSV) files that lack a header row. There isn't really a compelling reason for these files not to have self-documenting headers, which in turn facilitate easy loading into R/Python/etc for data analysis.
*.ilocus.mrnas.txt
(this will probably require a change to AEGeAn's pmrna
program)*.protein2ilocus.txt
(this is internal to GenHub)Anything I'm missing or any other comments @vpbrendel @cycoyuk?
If all iiLocus and ziLocus lengths are easily parsed directly from GFF3, this can prevent cluttering of the working directory with many ancillary files.
This shouldn't be difficult, since they're organized very similarly.
Should be able to handle both with a single module.
Might make sense to rename the module though.
BeeBase consortium data sets (10 bee genomes paper)
Species only available in HymenopteraBase
HymenopteraBase versions of already integrated species?
The *protein2ilocus.tsv
file shows all proteins, not just those chosen to represent each piLocus. It would be helpful to have another mapping file with only the iLocus representatives shown.
GFF3 checksums are failing, presumably due to an update of the RefSeq files. Need to investigate and take action.
Currently, different scripts and modules all redundantly implement similar functionality for resolving file paths.
filepath = '%s/%s/%s' % (workdir, speclabel, filename)
It would be more robust and easier to maintain/fix/change in the future if we used a single function for doing this.
# File doesn't exist yet, no need to test file existence
outfilepath = genhub.file_path(filename, speclabel, workdir=workdir)
# Input file, check to make sure it exists
infilepath = genhub.file_path(filename, speclabel, workdir=workdir, check_exist=True)
Enforce by default, but provide information about override flag upon failure.
Using the gt speck command.
Currently, the format
task and the format.sh
script are calling other scripts using relative file paths, assuming the user is calling from the genhub root directory. There needs to be a better way to resolve the script paths that doesn't involve clogging up a bunch of function signatures.
Currently there are two available options for loading configuration files.
-c/--cfg
option for providing the path of a single config file--cfgdir
option for providing the path of a directory, from which GenHub will attempt to load all .yml filesThe --cfgdir
option is fine as is, but I propose the following additions and changes for other config loading options.
--cfglist
option for providing a file with config files (one per line)--cfgpath
option for providing one or more directories in which to search for config files--cfgfullpath
option for indicating that value(s) provided by -c/--cfg
option or --cfglist
option are full file paths; by default, they are treated as relative paths and GenHub searches all directories specified by --cfgpath
for these filesThe option labels might need tweaking, but I think the functionality supports most/all conceivable use cases with a relatively simple interface.
Plain text, one per line? Or a YAML file for easy parsing?
CC @vpbrendel @cycoyuk
Looks like $HOME/local
can be cached! http://docs.travis-ci.com/user/migrating-from-legacy
New rice entry in RefSeq: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Oryza_sativa/all_assembly_versions/GCF_001433935.1_IRGSP-1.0/
The mouse and human genomes are very well finished, and the chromosome sequences are assigned NC_ accessions. NW_ and NT_ do not correspond to unplaced genomic scaffolds as they do in many other species, they correspond to patches or variants not (yet?) integrated into a major build release. This information is redundant and should be filtered out in preprocessing. Filtering annotations is simple, but if we don't want redundant sequences to be included in calculations this will require implementing a new filtering mechanism.
The scaffolding already appears to be there in the iloci.py
module, it should simply be a matter of adding it to the genhub-build.py
script's CLI.
The script is a mess of unnecessary if/elif/else statements right now, and it would be a lot cleaner (although probably just as verbose) if was passed the annotation source as a parameter.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.