The genhub from standage

Refactor `format.py`

The functions have gotten quite long, and could benefit from some decomposing.

This is due to a gene model where the exon is labeled as a pseudogene but the gene feature itself is not. It therefore eludes the tidygff3 attempts to correct the feature types, and causes problems when downstream processing software tries to find the exon's parent RNA feature.

Polish documentation, dev environment

The pyyaml and pycurl packages have now been properly added to the setup.py file, so they are installed when you pip install genhub. The other packages in requirements.txt are for development. I need to update the docs (and my own development environment) to account for this.

Configuration registry is cumbersome

A lot more could potentially by handled internally by the module, relieving the burden from the user.

Deprecate `*.simple-iloci.txt` file

This is a vestige of an old system, before we had the current terminology settled. It should be removed as soon as we can be sure it's not used anywhere.

Overlapping exons prematurely kill C. reinhardtii build

Gene GeneID:5716281 in C. reinhardtii prematurely kills an important step in the prepare task. Two issues to address. The first, and pressing, matter is to discard this gene using the typical annotfilter mechanism. The second matter, which will probably have to wait, is the fact that the build continued even though the canon-gff3 step of the grep -v | pmrna --locus | canon-gff3 pipeline failed.

Update `check.py` to check for PATH dependencies

AEGeAn programs
AEGeAn scripts
GenomeTools binary

nose --> py.test

The nose testing framework is no longer supported. The transition to py.test should be pretty seamless, as they support similar conventions for naming test functions. It'll mostly be a matter of updating dependencies, fixing makefile, etc.

Support for generic input

One of the benefits we claim regarding iLoci is that the software accepts a small number of standard inputs and generates a wealth of useful output. While the latter half of that claim is definitely true, we need to work on the first half.

If your genome happens to already be in RefSeq, you're in luck: all you need to do is copy one of the existing RefSeq configuration files, change a few values (species name, genome accession, etc) and you're in business. But if you simply have a pair of Fasta and GFF3 files? You're basically relegated to running all the genhub-build.py steps by yourself, from scratch.

We need better support for generic inputs: we need to document what is expected from the Fasta and GFF3 files, and then we need to fix the genhub-build.py script to support this.

Script names

Scripts have very generic names at this point, which is fine for git clone installation but not for a system-wide/virtualenv/pip type installation. Need to select concise names that minimize collision risk.

Allow user to override cd-hit defaults

Probably best to specify something like a --cdhit_params option. If they want to adjust any parameter, then they'll need to set all the parameters.

piLocus type for pre-mRNA, mRNA, exon, and intron TSV files

Putting the piLocus type in these types would make it easier to, for example, filter out only those features related to siLoci. Currently requires joins across tables.

Better handling of multiprocessing pool

See here for tips on a context manager: http://stackoverflow.com/questions/31661177/why-wont-python-multiprocessing-workers-die.

.txt --> .tsv for some ancillary data files

Some of the plain text supporting data files in each genome's directory are in fact simply tab-separated value (TSV) files that lack a header row. There isn't really a compelling reason for these files not to have self-documenting headers, which in turn facilitate easy loading into R/Python/etc for data analysis.

*.ilocus.mrnas.txt (this will probably require a change to AEGeAn's pmrna program)
*.protein2ilocus.txt (this is internal to GenHub)

Anything I'm missing or any other comments @vpbrendel @cycoyuk?

Test whether ilens file can be ignored

If all iiLocus and ziLocus lengths are easily parsed directly from GFF3, this can prevent cluttering of the working directory with many ancillary files.

Recipes for green algae

Extend RefSeq module with support for GenBank

This shouldn't be difficult, since they're organized very similarly.
Should be able to handle both with a single module.
Might make sense to rename the module though.

More species to integrate

BeeBase consortium data sets (10 bee genomes paper)

Species only available in HymenopteraBase

Cardiocondyla obscurior

HymenopteraBase versions of already integrated species?

Easily identify piLocus representatives

The *protein2ilocus.tsv file shows all proteins, not just those chosen to represent each piLocus. It would be helpful to have another mapping file with only the iLocus representatives shown.

New Amel genome

GFF3 checksums are failing, presumably due to an update of the RefSeq files. Need to investigate and take action.

Consolidate file name resolution

Currently, different scripts and modules all redundantly implement similar functionality for resolving file paths.

filepath = '%s/%s/%s' % (workdir, speclabel, filename)

It would be more robust and easier to maintain/fix/change in the future if we used a single function for doing this.

# File doesn't exist yet, no need to test file existence
outfilepath = genhub.file_path(filename, speclabel, workdir=workdir)

# Input file, check to make sure it exists
infilepath = genhub.file_path(filename, speclabel, workdir=workdir, check_exist=True)

Option for overriding integrity (shasum) check

Enforce by default, but provide information about override flag upon failure.

Create spec file for GFF3 input, integrate into workflow

Using the gt speck command.

Script paths during build

Currently, the format task and the format.sh script are calling other scripts using relative file paths, assuming the user is calling from the genhub root directory. There needs to be a better way to resolve the script paths that doesn't involve clogging up a bunch of function signatures.

Improve configuration parsing

Currently there are two available options for loading configuration files.

the -c/--cfg option for providing the path of a single config file
the --cfgdir option for providing the path of a directory, from which GenHub will attempt to load all .yml files

The --cfgdir option is fine as is, but I propose the following additions and changes for other config loading options.

--cfglist option for providing a file with config files (one per line)
--cfgpath option for providing one or more directories in which to search for config files
--cfgfullpath option for indicating that value(s) provided by -c/--cfg option or --cfglist option are full file paths; by default, they are treated as relative paths and GenHub searches all directories specified by --cfgpath for these files

The option labels might need tweaking, but I think the functionality supports most/all conceivable use cases with a relatively simple interface.

Recipes for specific versions

TAIR 6
TAIR 10?
Amel_2.0/OGSv1.0
Amel_4.5/OGSv3.2

Support for lists of conf files

Plain text, one per line? Or a YAML file for easy parsing?

Add documentation for creating new .yml config files

CC @vpbrendel @cycoyuk

More PycURL stuff

http://pycurl.sourceforge.net/doc/install.html#pip-and-cached-pycurl-package

Cache dependencies

Looks like $HOME/local can be cached! http://docs.travis-ci.com/user/migrating-from-legacy

New rice recipe?

New rice entry in RefSeq: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Oryza_sativa/all_assembly_versions/GCF_001433935.1_IRGSP-1.0/

NT_ and NW_ accessions in mammalian genomes correspond to patches, variants

The mouse and human genomes are very well finished, and the chromosome sequences are assigned NC_ accessions. NW_ and NT_ do not correspond to unplaced genomic scaffolds as they do in many other species, they correspond to patches or variants not (yet?) integrated into a major build release. This information is redundant and should be filtered out in preprocessing. Filtering annotations is simple, but if we don't want redundant sequences to be included in calculations this will require implementing a new filtering mechanism.

Move, rename config files

Move the directory containing the config files into the distribution
Update setup.py accordingly
Rename to "recipes" or something like that

standage / genhub Goto Github PK

genhub's People

Contributors

Stargazers

Watchers

Forkers

genhub's Issues

Recommend Projects

Recommend Topics

Recommend Org