Giter Site home page Giter Site logo

przechoj / gips Goto Github PK

View Code? Open in Web Editor NEW
6.0 4.0 2.0 24.43 MB

gips - Gaussian model Invariant by Permutation Symmetry

Home Page: https://przechoj.github.io/gips/

License: GNU General Public License v3.0

R 100.00%
machine-learning normal-distribution r r-package covariance-estimation

gips's People

Contributors

kolodziejek avatar mrdomani avatar przechoj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

kolodziejek

gips's Issues

`det_phi_part` of a `goal_function`

I've convinced myself that goal_function works properly for two simple edge cases, where det_phi_part is one or consists of a product of a single element. Unfortunately, I think the det_phi_part is not calculated correctly for the general case.

I would love to see this det_phi_part outsourced as a function with some unit tests.

statistical tests

We decide if the covariance matrix is invariant under permutation based on the a posteriori distribution in the Bayesian approach. We could then instruct users to use the statistical test to decide whether the found permutation is reasonable. Such tests are developed theoretically. IDK if someone implemented those in R, but it seems easy to do by ourselves...

Continue the optimization

When the user optimizes his S matrix and then plots the results, he may conclude that he wants to run the optimization for longer, for example, for twice as many iterations. Then he plots the results and sees that gips forgot the original path, showing only the new one.

I think the good idea would be to make an additional parameter in find_gips() called continue. When a user sets it to TRUE, then the optimization is performed in such a way to remember the previously made progress.

When the user calls find_gips(g, continue=FALSE) on already optimized g, there should be a warning that his process will be lost. Maybe even ask for confirmation?

`heatmap` of gips object

I thought it would be nice to be able to call just heatmap(g), where g is of gips class. It would be implemented like heatmap(gips::project_matrix(attr(g, "S"), g[[1]]), Rowv = NA, Colv = NA).

Is it a good idea? Can it be useful?

The minor technical issue is that stats::heatmap is not a generic function. We have two options now: make it generic in our package and set the heatmap.default to stats::heatmap; or call this function sth like heatmap_gips(g).

What do you think, @MrDomani ?

`basis` argument for calculations

For now, everything works on a natural orthonormal basis $I_p$.

Users may be interested in getting the decomposition for different orthonormal basis.

validate_gips breaks when log_posteriori is nan

library(gips)
library(rags2ridges)

GIPS_N_ITER <- 2

n_points <- 50
S <- createS(n=n_points, p=150)

D_coef <- 1000
delta <- 3

D_matrix <- diag(ncol(S)) * D_coef
g <- gips(S, n_points, delta = delta, D_matrix = D_matrix,
          was_mean_estimated = TRUE)
g_MAP <- find_MAP(g, max_iter = GIPS_N_ITER, return_probabilities = TRUE,
                  optimizer = "MH", save_all_perms = TRUE)
validate_gips(g_MAP)

Set up Github Actions

Right now we don't have any tests nor code. Once we have some, it will be useful to automatically run R CMD Check.
Execute

usethis::use_github_action_check_standard()

once ready.

Optimizers notes

In this issue, I will share my thoughts on optimizers in find_MAP().

For an introduction to the topic, see the vignette("Optimizers", package="gips") also available as the pkgdown page.

We encourage anyone to comment on those.

Our MH "walks" on permutations, not on groups

We want to find the cyclic group. In the reference paper, the process of walking on groups is referred to as the First approach in 4.1.1. The process of walking on permutations is referred to as the Second approach in 4.1.2.

In the gips package, the Second approach is implemented as the "MH" optimizer. We examined the First approach throughout and decided not to implement it. We don't reject it, though, and we may choose to do it in the future.

"BF" optimization

find_MAP(optimizer = "BF") for perm_size < 10 calculates a posteriori function only for group generators, while for 10 <= perm_size calculates a posteriori function for all permutations of a given size. This is faster to compute it for fewer perms, but we had to stop somewhere and decided to stop at perm_size == 10. We can always later extend the perm_group_generators_list object generated in data-raw/perm_group_generators_list.R.

Other optimizers

We have implemented other optimizers that we decided not to yet add to gips. Those include:

  1. Random - In every iteration, draw a random permutation uniformly on all possible ones. It was spectacularly worse than other optimizers.
  2. Simulated Annealing - Generalization of MH; has an additional parameter that makes it easier/harder to go to worse perm. The results were promising, but the optimal choice of the additional hyperparameter was hard. We tested only small spaces (p < 30).
  3. Evolutionary Optimization - the one we implemented gave promising results in small spaces (for p < 30, the Evolutionary Optimization was better than "MH"). Still, for bigger spaces, it was a complete disappointment (p = 100). It added a massive number of hyperparameters.

Ideas for future optimizers

  1. Greedy - in an iteration, it will go through all the transpositions, just like the Hill Climbing, but will go to the first found better permutation.

Release gips 1.2.0

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • urlchecker::url_check()
  • devtools::build_readme()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted ๐ŸŽ‰
  • Add preemptive link to blog post in pkgdown news menu
  • usethis::use_github_release()
  • usethis::use_dev_version(push = TRUE)
  • Finish blog post
  • Tweet

Returning acceptable permutational group

For now, we will implement a function that will get the observations $Z = (Z_1, Z_2, ..., Z_n) $ or matrix $U = {\sum_{i=1}}^n Z_i^T \cdot Z_i $ and will find the best permutational group $\Gamma $ to project the matrix $U $ on $\mathcal{P}_{\Gamma} $.

However, if $n &lt; p $, then $\frac{1}{n}\cdot U $ is not the maximum likelihood Covariance estimator because such the likelihood function does not exist.

The projection $U $ on $\mathcal{P}_{\Gamma} $ will be the maximum likelihood estimator, iff $p\ge n_0 $, where $n_0 $ depends on $\Gamma $, according to the paper.

It also is, that $\forall_{\Gamma} \text{ }n_0 \le p $

It may be that the found group $\Gamma$ will still have $n_0 &gt; n $. In such a case, the function should return the best Gamma among those with $n_0\le n $. Such a group always exists because for $\Gamma = \text{ }&lt; (1,2,...,p)&gt; $, we have $n_0 = 1 $, which is an acceptable Gamma for every number $n $.

Cycle-based goal function calculations

The goal function of the Metrop-Heist algorithm looks like this:
, where matrixes , and function

I suspect the calculation of function may be very expensive.

Edit, the following part contains errors. See the following comment
Let's indicate the disjoint cycles of as . Then notice that depends only on and .

It may be a big computational profit to save the calculated and recall them in stead of calculating them every time. Keep in mind that in single M-H step we are going from into such that , therefore and are different on the maximum 2 of theirs disjoint cycles:

and and most of the does not change.

`NaN` values of `calculate_phi_part` function

Sometimes the calculate_phi_part tries to return the NaN or Inf value. This is most likely caused by the situation for .

Further investigation is needed. It is not known if this is caused by the actual showing up or if sth very close to that R cannot distinguish from it.

Wrong output of `calculate_r`

The values of get_structure_constants(perm, 5), where perm$=(1,2,3,4,5)$ are wrong. The r vector should be c(1,1,1), but is c(1,1), therefore the bug is in calculate_r function. The rest of get_structure_constants works fine on this example.

Redesign the package

The design and names used in the package have to be changed slightly.

  • Introducing the gips() function that will be the wrapper for optimization algorithms looking for the permutation. There are now two algorithms, the Metropolis-Hastings and the best growth, but we plan to introduce the simulated annealing and the evolutionary algorithm in the future
  • MH() function will have changed the name to sth more meaningful
  • It is the same for goal_function(), especially since now it is logarithm-based
  • The terms of the attributes have to be thought through: n_number, max_iter (it is precisely the number of iterations for MH, but will be the maximum number of iterations for simulated annealing and the evolutionary algorithm in the future)
  • The U parameter from $U={\sum_{i=1}}^n(Z_i\cdot Z_i^T)$ has to be changed into the S parameter $S=\frac{U}{n}$

Set optimization time

The user can now set the maximum number of iterations for the optimization process. Worse, the time of every iteration is vastly different for each type.

I think a good idea would be to add the user option to choose whether he wants to set the maximum optimization time.

We did not encounter such a feature before, which is surprising.

Progress bar for `find_MAP()`

The user will run the M-H algorithm for a long time on his machine. The process has to be more informative and eye-pleasing than right now.

Randomized test for `goal_function`

The test 'goal_function has the desired property' in the test_MH.R file is designed randomly. It is supposed to show the desired property of the goal_function that is not guaranteed. I am not sure whether it is a problem. The code below shows that this test will fail about 6% of the time.

I put the set.seed(1234) in the test to make sure it will not be a false red flag while developing, but we should consider some other solution in the future.

For example, the test fails for set.seed(24) or 29, 36, 39, 62, 66, 69.

library(gips)

set.seed(1234)
M <- 10000
progressBar <- utils::txtProgressBar(min = 0, max = M, initial = 1)

p <- 10
n <- 20
actual_permutation <- permutations::as.cycle(permutations::as.word(c(2:p, 1)))
id_perm <- permutations::id

mu <- numeric(p)
sigma_matrix <- matrix(numeric(p*p), nrow=p)
for(i in 1:p){
  for(j in 1:p){
    sigma_matrix[i,j] <- 1 - abs(i-j) / p
  }
  sigma_matrix[i,i] <- 1 + 1/p
}

bad_i <- 0
for(i in 1:M){
  utils::setTxtProgressBar(progressBar, i)
  
  Z <- MASS::mvrnorm(n, mu = mu, Sigma = sigma_matrix)
  U <- t(Z) %*% Z
  
  actual_permutation_function_value <- goal_function(actual_permutation,
                                                     n, U)
  another_permutation_function_value <- goal_function(id_perm,
                                                      n, U)
  
  if(actual_permutation_function_value <= another_permutation_function_value){
    bad_i <- bad_i + 1
  }
}
close(progressBar)
paste0(bad_i/M * 100, "% of times test failed")  # 5.77% of times test failed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.