gjuggler / greg-ensembl Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 1.0 38.45 MB

A repository of Greg's Ensembl perl stuff.

Perl 43.98% R 40.74% ActionScript 0.02% CSS 0.01% Java 3.34% TeX 11.69% Shell 0.03% Emacs Lisp 0.19%

greg-ensembl's People

Contributors

Stargazers

Watchers

Forkers

jergosh

greg-ensembl's Issues

Collect whole-genome alignments based on Compara alignments

For a given ProteinTree alignment, use the Compara API to collect an equivalent alignment derived from the EPO pipeline. Things to do might include:

Filter out bad Compara gene trees based on lack of overlap with the EPO alignments.
"" for individual exons or aligned residues.

Gorilla: Generate substitution trees

Basically, automate the creation of figures like the tree in the Paabo Foxp2 paper: http://www.nature.com/nature/journal/v418/n6900/fig_tab/nature01025_F2.html

Calculate gene family duplication / gain / loss rates

People might be interested in primate-specific acceleration within certain families.

Possible tools:

CAFE: http://sites.bio.indiana.edu/~hahnlab/Software.html

Refactor ALL Runnables into Bio::Greg locations

I should move all code currently in the Bio::EnsEMBL::Compara::* folders into Bio::Greg::* instead. This will eventually help with future possible integration into EnsEMBL's codebase.

Generate sets of genes under clade-specific relaxed or increased purifying constraint

This can be calculated in two ways for each gene:

Calculate the overall mean dn/ds (perhaps excluding pos-sel sites) for each gene, then compare these overall values between sub-clades
Do site-wise comparisons between one sub-clade and either (a) other individual sub-clades, i.e. primates vs glires, or (b) the complement of that sub-clade. Count up either the number/proportion of sites where the sub-clade of interest is significantly lower (LRT-like).

(Note: this can also be done by collating on domains, etc...)

Real-time visualization of inferred vs true alignment residues

This needs to be done in Java / Processing, but it would be cool.

Given a (hidden) true alignment and a (shown) inferred alignment. Plot the entire inferred alignment, and when the user hovers over a given residue, highlight (a) the current column of inferred homologous residues and (b) the (potentially scattered) set of truly homologous residues.

Compute gc3 and gci/gcf for genes

These values per gene tree could be useful for isochore / codon bias analysis...

From Lavner and Kotlar 2005:

For the first three of the four methods described above, we need non-coding sequences neighboring a given gene (the fourth method, MCB, uses the coding sequence itself). We used the sequence consisting of the introns of the gene, the 1000 nucleotides immediately preceding the coding area of the gene, and similarly, those 1000 nucleotides immediately succeeding it (or truncated, as necessary, in the case that genes were less than 1000 nucleotides apart; see also Hey and Kliman, 2002 and Urrutia and Hurst, 2003). If an intron is longer than 2000 bp, only the 1000 nucleotides on each of the intron's ends were taken. By taking 1000 flanking bases, we assure that regions that may be under selective constrains, both in flanking regions and introns, constitute only a small portion of the strands that are used as control. On the other hand, regions of large introns that are far from any coding sequence may not represent the mutational bias that acts on the nearby exons, and thus introns were truncated to 1000 bases on each end. We masked repetitive elements using RepeatMasker (http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker).

Optimize Indelign run time

See paper for more info: http://www.citeulike.org/user/gjuggler/article/6593838

Simulate domain-loop structures with differing indel rates

In order to achieve genome-wide bootstrap simulations, we need to go one level deeper than random simulations with indel processes. Domain-loop structure seems to be the next logical level to get at.

On the Slrsim side, we should have a parameter which accepts a Perl arrayref that defines a series of "domains", where each domain has a separate set of simulation parameters. These are then sent to Indelible as different blocks for simulation in consecutive order.

Non-interactive visualization of inferred vs true alignments

In the SLRsim project, it would be nice to directly compare the inferred alignments with the true alignment. We could do this by either:

Coloring columns of the inferred alignment with less than a given % cutoff of correctly-placed homology inferences.
Coloring individual residues with less than a given % of correct pairwise homology inferences.