Comments (14)
Seems to identify two separate within and between strain sets of links:
In here within label is 1 with mean [0.52112061 0.48266401] and between label is 2with mean [0.64633422 0.49405153]
from poppunk.
So the non-origin end of the scale is:
euclidean 0.1257303629256175
Hence the returned boundary likelihoods are:
origin end -0.028069695293248376 other end -0.13515692159251055
Looks like the centres of the two distributions are too close together?
from poppunk.
Yeah I don't think it'll work with that as the initial fit. It assumes two good cluster means for these components, whereas the fit above doesn't look like it's worked. So the first error will be because there's no root between the two means (should be able to see this from the likelihood contours + decision boundary plot too)
You could try with K = 2 or 4 and see if you get something reasonable to start from, or use --manual-start
with the means from the previous GPS fit/SPARC fit.
from poppunk.
Trying K = 4 now; K = 2 gave a similar fit. I think the previous GPS fit was t-dist based, so would the means still be useful for starting?
from poppunk.
from poppunk.
To "solve" this issue, I've been trying to look at whether there's a simple function that will identify a scaling factor that will ensure the two means have different signs:
def getBoundSize(model, start, end, within, between, origin, other):
boundSize = 1
sameSign = True
while sameSign:
origin_value = likelihoodBoundary(origin, model, start, end, within, between, boundSize)
other_value = likelihoodBoundary(other, model, start, end, within, between, boundSize)
if origin_value < 0 and other_value > 0:
sameSign = False
else:
boundSize = boundSize + 1
return boundSize
Followed by:
def likelihoodBoundary(s, model, start, end, within, between, boundSize):
"""Wrapper function around :func:`~PopPUNK.bgmm.fit2dMultiGaussian` so that it can
go into a root-finding function for probabilities between components
Args:
s (float)
Distance along line from mean0
model (BGMMFit)
Fitted mixture model
start (numpy.array)
The co-ordinates of the centre of the within-strain distribution
end (numpy.array)
The co-ordinates of the centre of the between-strain distribution
within (int)
Label of the within-strain distribution
between (int)
Label of the between-strain distribution
Returns:
responsibility (float)
The difference between responsibilities of assignment to the within component
and the between assignment
"""
X = transformLine(s, start, end).reshape(1*boundSize, -1*boundSize)
responsibilities = model.assign(X, values = True)
return(responsibilities[0, within] - responsibilities[0, between])
But it just throws:
X = transformLine(s, start, end).reshape(1*boundSize, -1*boundSize)
ValueError: cannot reshape array of size 2 into shape (3,newaxis)
It's probably not a good idea anyway, but I suppose we should (a) either throw a more comprehensible error or (b) use this as a criterion that has to be met for the fitting to be considered valid?
from poppunk.
I think in some cases when the initial fit is just crap (as here) there's not much that can be used to salvage it. Eschewing the mixture model and using the KDE estimate (or DBSCAN, if that gives something better) I think is the way to go.
Doing a check before doing the minimum search is definitely a good idea - I'll add this in now
from poppunk.
Did we have a decent BGMM fit from the previous iteration of GPS data, which we ran refinement on to produce the last set of clusters we sent to Becca and Steph?
from poppunk.
The last good fit I remember was the t-distribution for which we provided priors - not sure about the BGMM fits.
from poppunk.
The one I was thinking of was the one started from in issue #12, but I can't quite remember how/with what it was obtained – probably it was the t-distribution as you say!
from poppunk.
Alas it was t distribution based, from my neglected bsub outputs:
python3 ./PopPUNK/poppunk-runner.py --fit-model --distances GPS/GPS.dists --output GPS --full-db --ref-db GPS --t-dist --priors priors.txt --K 3
from poppunk.
from poppunk.
Wow, that's really quite something. I particularly don't understand the yellow component.
One final thing to try could be changing lines 44-49 of bgmm.py
to:
dpgmm = mixture.BayesianGaussianMixture(n_components = dpgmm_max_K,
n_init = 5,
covariance_type = 'full',
weight_concentration_prior = 1,
mean_precision_prior = 1).fit(X)
Though I don't really expect that to change anything. So either a manual start from KDE/previous or DBSCAN?
from poppunk.
Looks like --manual-start
fixed this, so closing for now
from poppunk.
Related Issues (20)
- Popunk installed with conda cannot import scikit-learn's KernelDensity
- Retreive source sequences from a database HOT 1
- --write-references option when running PopPUNK assign HOT 2
- visualize only query? HOT 1
- [database] A double-checked database of Neisseria meningitidis HOT 8
- Incorrect merge reporting from novel queries step HOT 1
- databases: reference only vs all genomes HOT 1
- IndexError: list index out of range (running current `master` branch) when running `poppunk_assign` HOT 25
- Add tutorial to documentation
- Update H. flu database to new sklearn version HOT 2
- ImportError: dlopen: cannot load any more object with static TLS HOT 4
- Sketchlib back end not available? HOT 8
- extract genome assemblies from database HOT 1
- What to expect for a sample that may have a mixture of two or more strains HOT 2
- PackagesNotFoundError HOT 2
- Generating jaccard distances per kmer HOT 3
- Enhancement: Assigning multiple GPSC types to mixed (meta-genome) samples HOT 2
- Zero Accessory Distances and Varied Core Distances in PopPUNK Fit Graph Analysis HOT 8
- Issue with a large Pseudomonas dataset HOT 4
- Listeria monocytogenes database is missing graph file. HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from poppunk.