rc / dist_mixtures Goto Github PK

0.0 4.0 1.0 747 KB

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

dist_mixtures's Introduction

dist_mixtures
=============

A package for fitting histograms of spatial orientations by a mixture of von
Mises distributions.

License: New BSD License, see the LICENSE file.

Data
----

This package has been developed to analyze muscle cells directions data coming
from aortic segments. The related data files can be found at [1].

[1] https://github.com/rc/dist_mixtures_data

Installation
------------

Download the source code from [1].

[1] https://github.com/rc/dist_mixtures

The following dependencies need to be installed:

- NumPy (http://numpy.org)
- SciPy (http://scipy.org) - Git version
- Matplotlib (http://matplotlib.sourceforge.net)
- Statsmodels (http://statsmodels.sourceforge.net) - Git version

Then ``cd`` into the ``dist_mixtures/`` directory.

Usage
-----

Run ``fit_von_mises.py`` without arguments to see the help message::

  $ ./fit_von_mises.py
  Usage: fit_von_mises.py [options] pattern data_dir

  Fit data files with names matching a given pattern by a mixture of von Mises
  distributions.

  Options:
    --version             show program's version number and exit
    -h, --help            show this help message and exit
    -o dirname, --output-dir=dirname
                          output directory [default: output]
    -c filename, --conf=filename
                          use configuration file with parameter sets. Ignored,
                          if n_components option is given.
    -n positive_int, --n-components=positive_int
                          number of components of the mixture [default: 2]
    -p kappa0,mu0,kappa1,mu1,..., --parameters=kappa0,mu0,kappa1,mu1,...
                          initial guess of von Mises parameters for each
                          component as a comma separated list, e.g., for two
                          components: "1,0,1,0" corresponding to kappa0, mu0,
                          kappa1, mu1 respectively. The location parameter mu
                          should be given in degrees in [-90, 90[.
    -d pattern, --dir-pattern=pattern
                          pattern that subdirectories should match [default: *]
    -m positive_int, --merge-bins=positive_int
                          number of consecutive bins in data to merge [default:
                          None]
    --plot-bins-step=positive_int
                          step to choose bins from all bins for histogram plots
                          [default: 4]
    --spread-data         spread raw data using their counts instead of just
                          repeating them
    -a, --area-angles     compute and draw angles of two systems of fibres
                          determined by equal histogram areas
    --no-neg-shift        do not add 180 degrees to negative angles
    -s, --show            show the figures

Example runs
------------

By default, the results are stored into a directory called ``output``. Use
``-o`` option to change that.

- Analyze histograms in ``*.txt`` files of data sets in ``data/<dataset name>``
  directories, assume 2 von Mises mixture (VMM) components::

    $ ./fit_von_mises.py "*.txt" data/ -n 2

- Run analysis with parameters given in a file. This example shows how to run
  the analysis with varying number of VMM components::

    $ ./fit_von_mises.py "*.txt" data/ -c examples/psets/n_components.py

  See other example parameter set files in ``examples/psets/``.

dist_mixtures's People

Contributors

Watchers

Forkers

josef-pkt

dist_mixtures's Issues

rvs method has some points outside (-pi,pi)

VonMisesMixture.rvs_mix for a large sample size=50000 up to 1 Million
I got a small (2 to 3) percent observations out of range.

-> add modulo/remainder

Merge 'updates' branch

Merge 'updates' branch to 'sensitivity' branch, add Struct class to parents of other classes. This would allow:

ipdb> print res.model
VonMisesMixture
  _data_attr:
    list: ['exog', 'endog', 'data.exog', 'data.endog', 'data.orig_endog', 'data.orig_exog']
  confint_dist:
    <scipy.stats.distributions.norm_gen object at 0x29343d0>
  data:
    <statsmodels.base.data.ModelData object at 0x38a4d10>
  endog:
    (40097,) array of float64
  exog:
    None
  k_constant:
    0

that is one would see immediately which attributes objects have.

normalize in fit(), update tests

fix coding (excessive LFs in dist_mixtures/mixture_von_mises.py)

In the sensitivity branch:

$ file dist_mixtures/mixture_von_mises.py
dist_mixtures/mixture_von_mises.py: Python script, ASCII text executable, with CRLF, LF line terminators

Originally, it was:

$ file dist_mixtures/mixture_von_mises.py
dist_mixtures/mixture_von_mises.py: Python script, ASCII text executable, with CRLF line terminators

Maybe autocrlf [1] setting needs to be fixed?

[1] http://stackoverflow.com/questions/4181870/git-on-windows-what-do-the-crlf-settings-mean

design: convert endog, exog to float ?

I just ran into a bug with fit_ls for a test case when endog, which is currently the count, is int (int32 returned by np.histogram).

It might be better to convert the data in the models to float if they are not already, since there might be other, less obvious problems with other dtypes.

(not a problem with the current data which is float64)

differences between nm and bfgs solutions

I am running the code with the test data, looking for two components as follows:

$ ./fit_von_mises.py "*.txt" test-data oo  -s

When nm is used, the two components obtained are:

dist0: shape=3.7856, loc=-3.1304, prob=0.5117
dist1: shape=1.7051, loc=3.0787, prob=0.4883

And with bfgs:

dist0: shape=2.3769, loc=3.1221, prob=0.5000
dist1: shape=2.3769, loc=3.1221, prob=0.5000

The bfgs result seems strange - can you reproduce it?

problem in vonmises cdf of scipy.stats

It looks like the vonmises.cdf doesn't work correctly if loc is different from zero.
I think it doesn't shift the support correctly.

I tried to use stats.vonmises.cdf for the chisquare test, but get negative cdf values.

Verify the kind of distribution

I may have asked this already, but (my colleagues ask...): Is there a way of saying (a statistical test with some p value?) that our data come

a) from a single
b) from a mixture of von Mises distributions?

I am sceptical about b), as given a sufficient count of mixture components, anything can be fitted. I tried 4, and got perfect fits for even rather irregular data.

The question is related to the hypothesis "there are two preferential direction symmetric w.r.t. axis (axis = e.g. the location of a single von Mises)" - I think this can be tested only under assumption of the von Mises mixture. Then even if rejected, in reality there can be two preferential direction symmetric w.r.t. axis, but with a different distribution, which cannot be verified. What do you think?

binning

the original data is binned, so we have a discretized distribution

I don't think the approximation error is large for the estimation

>>> plt.plot(stats.vonmises.pdf(xxpi-d/2., 4)[1:])
[<matplotlib.lines.Line2D object at 0x0DAE1E10>]
>>> plt.show()
>>> np.max(np.abs(np.diff(stats.vonmises.cdf(xxpi, 4)) / d -  stats.vonmises.pdf(xxpi, 4)[1:]))
0.015765796647894514
>>> np.max(np.abs(np.diff(stats.vonmises.cdf(xxpi, 4)) / d -  stats.vonmises.pdf(xxpi-d/2., 4)[1:]))
0.00015579844725244207
>>> np.max(np.abs(np.diff(stats.vonmises.cdf(xxpi, 4)) / d /  stats.vonmises.pdf(xxpi-d/2., 4)[1:] -1))
0.0008251767346727501

Note: cdf in cython looks pretty fast

some possibilities:

computational: we could take advantage that we only have 180 unique values
goodness of fit:
- chi-square test could be almost directly applied
- Watson (Anderson-Darling/Von Mises statistic): I need to check, it should work especially since
  we need bootstrap p-values anyway.
- parametric bootstrap: we might have to apply the binning to mimic the data sampling/binning
alternative estimators that take binning into account: maybe just as sensitivity check
My guess is that it wouldn't change the results but would be computationally more direct.
Disadvantage: We don't get standard maximum likelihood statistics and inference.

normalize params

Because of the periodicity of cosine, the parameters are not globally unique, but they are locally identified.

for example kappa only needs to be positive
loc only needs to be in interval (-pi, pi) (?)

We don't impose this during estimation to avoid boundary conditions
However, to get parameter estimates that are more easily interpretable, we could normalize the params to the observationally equivalent "standard" form.

e.g. getting rid of negative kappa by adjusting loc

>>> xx = np.linspace(-2*np.pi, 2*np.pi, 21)
>>> -np.cos(xx)
array([-1.        , -0.80901699, -0.30901699,  0.30901699,  0.80901699,
        1.        ,  0.80901699,  0.30901699, -0.30901699, -0.80901699,
       -1.        , -0.80901699, -0.30901699,  0.30901699,  0.80901699,
        1.        ,  0.80901699,  0.30901699, -0.30901699, -0.80901699, -1.        ])
>>> np.cos(xx+np.pi)
array([-1.        , -0.80901699, -0.30901699,  0.30901699,  0.80901699,
        1.        ,  0.80901699,  0.30901699, -0.30901699, -0.80901699,
       -1.        , -0.80901699, -0.30901699,  0.30901699,  0.80901699,
        1.        ,  0.80901699,  0.30901699, -0.30901699, -0.80901699, -1.        ])
>>> np.cos(xx-np.pi)
array([-1.        , -0.80901699, -0.30901699,  0.30901699,  0.80901699,
        1.        ,  0.80901699,  0.30901699, -0.30901699, -0.80901699,
       -1.        , -0.80901699, -0.30901699,  0.30901699,  0.80901699,
        1.        ,  0.80901699,  0.30901699, -0.30901699, -0.80901699, -1.        ])

gap in histogram for Estimated Distribution plot

the histogram misses one slice

I didn't look at the code yet

VonMisesMixtureBinned vs. VonMisesMixture

Some notes: when comparing the results of the two classes:

the fits are very similar, but the pdf plots plots slightly shifted by a "half of histogram bin width"
this shift can be removed by using the --spread-data option,
then the fits are visually almost the same, with VonMisesMixture looking slightly better.
aic and bic are much higher for VonMisesMixtureBinned
p-value is 0 for VonMisesMixtureBinned, 1 for VonMisesMixture

What causes the last two bullets?

(Commenting on results of $ ./fit_von_mises.py "*.txt" test-data/ -c examples/psets/models.py --spread-data)

VonMisesMixtureBinned does not work with statsmodels from git

Traceback (most recent call last):
  File "./fit_von_mises.py", line 128, in <module>
    main()
  File "./fit_von_mises.py", line 118, in main
    logs, alog = analyze(source, psets, options)
  File "./analyses/fit_mixture.py", line 110, in analyze
    pl.plot_histogram_comparison(pset.output_dir, res, source, ii)
  File "./analyses/plots.py", line 116, in plot_histogram_comparison
    ret_sizes=True)
  File "./aorta/dist_mixtures/mixture_von_mises.py", line 380, in rvs_mix
    size=sizes[ii]))
  File "~/software/usr/local/lib/python/dist-packages/scipy/stats/distributions.py", line 627, in rvs
    vals = self._rvs(*args)
  File "~/software/usr/local/lib/python/dist-packages/scipy/stats/distributions.py", line 5748, in _rvs
    return mtrand.vonmises(0.0, b, size=self._size)
  File "mtrand.pyx", line 2240, in mtrand.RandomState.vonmises (numpy/random/mtrand/mtrand.c:10433)
  File "mtrand.pyx", line 201, in mtrand.cont2_array_sc (numpy/random/mtrand/mtrand.c:1867)
ValueError: negative dimensions are not allowed

normalize_params reorder components in params

(To test failure in PR #10 comments, I'm not able to add a comment there now (?))

I'm getting the same failure, I think I have seen this before.
One addition I wanted to make is to rearrange components in normalize_params, so we get a unique parameterization (instead of permutations).

I thought of sorting by probability, but that can change which is the left-out component in probabilities, the last part of params.