przechoj / gips Goto Github PK

View Code? Open in Web Editor NEW

6.0 4.0 2.0 24.43 MB

gips - Gaussian model Invariant by Permutation Symmetry

Home Page: https://przechoj.github.io/gips/

License: GNU General Public License v3.0

R 100.00%

machine-learning normal-distribution r r-package covariance-estimation

gips's People

Contributors

Stargazers

Watchers

Forkers

kolodziejek

gips's Issues

`det_phi_part` of a `goal_function`

I've convinced myself that goal_function works properly for two simple edge cases, where det_phi_part is one or consists of a product of a single element. Unfortunately, I think the det_phi_part is not calculated correctly for the general case.

I would love to see this det_phi_part outsourced as a function with some unit tests.

statistical tests

We decide if the covariance matrix is invariant under permutation based on the a posteriori distribution in the Bayesian approach. We could then instruct users to use the statistical test to decide whether the found permutation is reasonable. Such tests are developed theoretically. IDK if someone implemented those in R, but it seems easy to do by ourselves...

Continue the optimization

When the user optimizes his S matrix and then plots the results, he may conclude that he wants to run the optimization for longer, for example, for twice as many iterations. Then he plots the results and sees that gips forgot the original path, showing only the new one.

I think the good idea would be to make an additional parameter in find_gips() called continue. When a user sets it to TRUE, then the optimization is performed in such a way to remember the previously made progress.

When the user calls find_gips(g, continue=FALSE) on already optimized g, there should be a warning that his process will be lost. Maybe even ask for confirmation?

`heatmap` of gips object

I thought it would be nice to be able to call just heatmap(g), where g is of gips class. It would be implemented like heatmap(gips::project_matrix(attr(g, "S"), g[[1]]), Rowv = NA, Colv = NA).

Is it a good idea? Can it be useful?

The minor technical issue is that stats::heatmap is not a generic function. We have two options now: make it generic in our package and set the heatmap.default to stats::heatmap; or call this function sth like heatmap_gips(g).

What do you think, @MrDomani ?

`basis` argument for calculations

For now, everything works on a natural orthonormal basis $I_p$.

Users may be interested in getting the decomposition for different orthonormal basis.

validate_gips breaks when log_posteriori is nan

library(gips)
library(rags2ridges)

GIPS_N_ITER <- 2

n_points <- 50
S <- createS(n=n_points, p=150)

D_coef <- 1000
delta <- 3

D_matrix <- diag(ncol(S)) * D_coef
g <- gips(S, n_points, delta = delta, D_matrix = D_matrix,
          was_mean_estimated = TRUE)
g_MAP <- find_MAP(g, max_iter = GIPS_N_ITER, return_probabilities = TRUE,
                  optimizer = "MH", save_all_perms = TRUE)
validate_gips(g_MAP)

Set up Github Actions

Right now we don't have any tests nor code. Once we have some, it will be useful to automatically run R CMD Check.
Execute

usethis::use_github_action_check_standard()

once ready.

Optimizers notes

In this issue, I will share my thoughts on optimizers in find_MAP().

For an introduction to the topic, see the vignette("Optimizers", package="gips") also available as the pkgdown page.

We encourage anyone to comment on those.

Our MH "walks" on permutations, not on groups

We want to find the cyclic group. In the reference paper, the process of walking on groups is referred to as the First approach in 4.1.1. The process of walking on permutations is referred to as the Second approach in 4.1.2.

In the gips package, the Second approach is implemented as the "MH" optimizer. We examined the First approach throughout and decided not to implement it. We don't reject it, though, and we may choose to do it in the future.

"BF" optimization

find_MAP(optimizer = "BF") for perm_size < 10 calculates a posteriori function only for group generators, while for 10 <= perm_size calculates a posteriori function for all permutations of a given size. This is faster to compute it for fewer perms, but we had to stop somewhere and decided to stop at perm_size == 10. We can always later extend the perm_group_generators_list object generated in data-raw/perm_group_generators_list.R.

Other optimizers

We have implemented other optimizers that we decided not to yet add to gips. Those include:

Random - In every iteration, draw a random permutation uniformly on all possible ones. It was spectacularly worse than other optimizers.
Simulated Annealing - Generalization of MH; has an additional parameter that makes it easier/harder to go to worse perm. The results were promising, but the optimal choice of the additional hyperparameter was hard. We tested only small spaces (p < 30).
Evolutionary Optimization - the one we implemented gave promising results in small spaces (for p < 30, the Evolutionary Optimization was better than "MH"). Still, for bigger spaces, it was a complete disappointment (p = 100). It added a massive number of hyperparameters.

Ideas for future optimizers

Greedy - in an iteration, it will go through all the transpositions, just like the Hill Climbing, but will go to the first found better permutation.

Release gips 1.2.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Returning acceptable permutational group

For now, we will implement a function that will get the observations $Z = (Z_1, Z_2, ..., Z_n) $ or matrix $U = {\sum_{i=1}}^n Z_i^T \cdot Z_i $ and will find the best permutational group $\Gamma $ to project the matrix $U $ on $\mathcal{P}_{\Gamma} $.

However, if $n < p $, then $\frac{1}{n}\cdot U $ is not the maximum likelihood Covariance estimator because such the likelihood function does not exist.

The projection $U $ on $\mathcal{P}_{\Gamma} $ will be the maximum likelihood estimator, iff $p\ge n_0 $, where $n_0 $ depends on $\Gamma $, according to the paper.

It also is, that $\forall_{\Gamma} \text{ }n_0 \le p $

It may be that the found group $\Gamma$ will still have $n_0 > n $. In such a case, the function should return the best Gamma among those with $n_0\le n $. Such a group always exists because for $\Gamma = \text{ }< (1,2,...,p)> $, we have $n_0 = 1 $, which is an acceptable Gamma for every number $n $.

Cycle-based goal function calculations

The goal function of the Metrop-Heist algorithm looks like this:
$\pi(c) \propto f(c) := \frac{I_c(\delta + n, D_c+ U_c)}{I_c(\delta, D_c)} = exp(-\frac{1}{2}nA_c) \cdot \frac{G_c(\delta + n)}{G_c(\delta)} \cdot Det(\frac{1}{2}(D_c+ U_c))^{-\frac{1}{2}(\delta + n-2)} \cdot Det(\frac{1}{2}D_c)^{\frac{1}{2}(\delta-2)} \cdot \frac{\varphi_c(\frac{1}{2}(D_c+ U_c))}{\varphi_c(\frac{1}{2}D_c)}$ , where matrixes $D_c := \pi_c(D), U_c := \pi_c(U)$ , and function $G_c(\delta) := \prod_{i=1}^L \Gamma_{\Omega_i}(\frac{1}{2} \cdot k_i \cdot (\delta -2)+ \frac{dim(\Omega_i)}{r_i})$

I suspect the calculation of $G_c(\delta)$ function may be very expensive.

Edit, the following part contains errors. See the following comment
Let's indicate the disjoint cycles of $c$ as $c = c_1\circ c_2\circ ... \circ c_L$ . Then notice that $\Gamma_{\Omega_i}(\frac{1}{2} \cdot k_i \cdot (\delta -2)+ \frac{dim(\Omega_i)}{r_i})$ depends only on $c_i$ and $\delta$ .

It may be a big computational profit to save the calculated $\Gamma_{\Omega_i}(\frac{1}{2} \cdot k_i \cdot (\delta -2)+ \frac{dim(\Omega_i)}{r_i})$ and recall them in stead of calculating them every time. Keep in mind that in single M-H step we are going from $c$ into $c'$ such that $\exists_{(i,j)} c' = c\circ (i,j)$ , therefore $c'$ and $c$ are different on the maximum 2 of theirs disjoint cycles:

$|L - L'| \le 1$ and $|\{c_1, c_2, ..., c_L\}\cap \{c_1', c_2', ..., c_{L'}'\}| \ge min(L, L') -1$ and most of the $c_i$ does not change.

Full `ggplot2` supportability by `plot.gips`

plot.gips(type="heatmap") will produce a ggplot2 graph. It would be nice for the other types to also support ggplot2.

`return_probabilities=TRUE`does not work with rlang < 0.4.10

In non-interactive sessions, one should use is_installed instead.
We have check_installed in 2 places: for return_probabilities=TRUE and for plot.gips.
Both could be used in non-interactive sessions, the first one surely.

`NaN` values of `calculate_phi_part` function

Sometimes the calculate_phi_part tries to return the NaN or Inf value. This is most likely caused by the $0^{-\varepsilon}$ situation for $\varepsilon>0$ .

Further investigation is needed. It is not known if this is caused by the actual $0$ showing up or if sth very close to $0$ that R cannot distinguish from it.

Wrong output of `calculate_r`

The values of get_structure_constants(perm, 5), where perm$=(1,2,3,4,5)$ are wrong. The r vector should be c(1,1,1), but is c(1,1), therefore the bug is in calculate_r function. The rest of get_structure_constants works fine on this example.

Redesign the package

The design and names used in the package have to be changed slightly.

Introducing the gips() function that will be the wrapper for optimization algorithms looking for the permutation. There are now two algorithms, the Metropolis-Hastings and the best growth, but we plan to introduce the simulated annealing and the evolutionary algorithm in the future
MH() function will have changed the name to sth more meaningful
It is the same for goal_function(), especially since now it is logarithm-based
The terms of the attributes have to be thought through: n_number, max_iter (it is precisely the number of iterations for MH, but will be the maximum number of iterations for simulated annealing and the evolutionary algorithm in the future)
The U parameter from $U={\sum_{i=1}}^n(Z_i\cdot Z_i^T)$ has to be changed into the S parameter $S=\frac{U}{n}$

Set optimization time

The user can now set the maximum number of iterations for the optimization process. Worse, the time of every iteration is vastly different for each type.

I think a good idea would be to add the user option to choose whether he wants to set the maximum optimization time.

We did not encounter such a feature before, which is surprising.

Progress bar for `find_MAP()`

The user will run the M-H algorithm for a long time on his machine. The process has to be more informative and eye-pleasing than right now.

Randomized test for `goal_function`

The test 'goal_function has the desired property' in the test_MH.R file is designed randomly. It is supposed to show the desired property of the goal_function that is not guaranteed. I am not sure whether it is a problem. The code below shows that this test will fail about 6% of the time.

I put the set.seed(1234) in the test to make sure it will not be a false red flag while developing, but we should consider some other solution in the future.

For example, the test fails for set.seed(24) or 29, 36, 39, 62, 66, 69.

library(gips)

set.seed(1234)
M <- 10000
progressBar <- utils::txtProgressBar(min = 0, max = M, initial = 1)

p <- 10
n <- 20
actual_permutation <- permutations::as.cycle(permutations::as.word(c(2:p, 1)))
id_perm <- permutations::id

mu <- numeric(p)
sigma_matrix <- matrix(numeric(p*p), nrow=p)
for(i in 1:p){
  for(j in 1:p){
    sigma_matrix[i,j] <- 1 - abs(i-j) / p
  }
  sigma_matrix[i,i] <- 1 + 1/p
}

bad_i <- 0
for(i in 1:M){
  utils::setTxtProgressBar(progressBar, i)
  
  Z <- MASS::mvrnorm(n, mu = mu, Sigma = sigma_matrix)
  U <- t(Z) %*% Z
  
  actual_permutation_function_value <- goal_function(actual_permutation,
                                                     n, U)
  another_permutation_function_value <- goal_function(id_perm,
                                                      n, U)
  
  if(actual_permutation_function_value <= another_permutation_function_value){
    bad_i <- bad_i + 1
  }
}
close(progressBar)
paste0(bad_i/M * 100, "% of times test failed")  # 5.77% of times test failed