madeleineudell / lowrankmodels.jl Goto Github PK

View Code? Open in Web Editor NEW

190.0 190.0 65.0 1.1 MB

LowRankModels.jl is a julia package for modeling and fitting generalized low rank models.

License: Other

Julia 100.00%

lowrankmodels.jl's People

Contributors

Stargazers

Watchers

Forkers

vermouthmjl darf-ferrara knowledgelab alejandroschuler qingyunsun nandanasengupta tkelman zhmz90 mindis jaehyunp mrg7 jaspercooper erwayjb codeaudit cstjean sderaco valhongli mihirparadkar jjanna19 uw-amo pihuyadav wly00us marcoadurno andreasnoack gregkoytiger askuyue wneleigao dataronio ml-lab chengrunyang anyachopra97 pertschuk austonli ramcha24 xiaojiemao shubhampachori12110095 sohailkhanmarwat alancrawford invenia jiahao juliatagbot zeta1999 deepa-png aleksandervainer kathy908000 flobeck pdwaggoner david-vicente oxinabox dpmerrell trongnghia1104 standardgalactic unclesan jcardenasrdz eperim brendanjohnharris adwasser kshedden doaneas vikramsondergaard

lowrankmodels.jl's Issues

LogisticLoss and PoissonLoss not defined (julia version 0.4.6)

Compute X given Y

Once a model has been fit to a matrix A, is there any way to fit it to another matrix holding Y constant? For example, if factor analysis is part of a pipeline that ends with an SVM classifier, the cross-validation code should learn the feature matrix Y on the training set, and compute the data matrix X on the test set, given Y.

documentation needs update / OrdinalHinge constructor

In the losses.jl, LogisticLoss is called LogLoss, which is inconsistent with the documentation.
Using OrdinalHinge loss will raise an error, because it does not define an empty constructor, needed by the copy method in utilities.

call julia from R

I have found that there is an intent to call julia from r in todo.md.

There is a r package https://github.com/armgong/rjulia enabling us to call julia from r。 Is that a feasible choise for this project? I would like to assist if the problem is still existing.

How to run distributely?

Sorry for a newbie question.. is there is a way to run this distributively?

For example, in Hadoop, or on Spark or in other distributed frameworks?

Thanks.

Fit_rdataset fails

Trying to run fit_rdataset.jl, which is not included in the Pkg.test("LowRankModels"). this line of code

now we'll try it without type imputation

we'll just fit four of the columns, to try out all four data types

dd = DataFrame([df[s] for s in [:TOD, :Scale, :Vigorous, :Wakeful]])
dd[end] = (dd[end].==1)
datatypes = [:real, :cat, :ord, :bool]

gives the following error:

MethodError: Cannot convert an object of type Array{DataArrays.AbstractDataArray{T,1},1} to an object of type DataFrames.DataFrame
This may have arisen from a call to the constructor DataFrames.DataFrame(...),
since type constructors fall back to convert methods.

in DataFrames.DataFrame(::Array{DataArrays.AbstractDataArray{T,1},1}) at .\sysimg.jl:53

Thank you.

Use Sparse Matrices

I have a few applications that involve very large, very sparse (incomplete), matrices.

At the moment it appears that the entire data matrix, A, is estimated/reconstructed from X'*Y on each gradient step (see here). However, if many entries of A are missing, then only a subset of the X'*Y entries need to be computed to evaluate the loss function.

Would it make sense to make a new file sparse_proxgrad.jl in src/algorithms/ that is called when glrm.A is a sparse matrix?

offsets in binary LowRankModels are not the same for different rank-k

Data is binary data, I used logistic loss function and add offset to the model. According to the definition of generalized mean u_{j} for different loss function in your paper, the offset in logisticloss should be the
natural parameter corresponding to the column probability, this is the case in rank 1 model. However, when I give different rank-k, I got different offsets. I think it is important to get the same offsets for different rank -k, as null model only with offset provide baseline for us to compare models. Do you known how can I get the same offsets? regards.

A non linear GLRM is good for visualization?

With nonlinear GLRM I mean to use a multilayer perceptron to solve for X and Y in an alternating minimization fashion. With visualization I mean dimensionality reduction to 2D.

I think a nonlinear GLRM would improve the reconstruction, but do you think it will also improve the low dimensional representations "X variable"?.

What kind of model (i.e. losses and regularizations) do you think it would be good for visualization.

LossFunctions.jl

I think it's worth it to move the implementations of loss functions from here to LossFunctions.jl. This package has a lot of the multivariate and ordinal losses lacking in that package, as well as functions like the M_estimator for centering and scaling, but they'd be more widely-known if they were in an explicit LossFunction package.

Stats is a requirement

Need to ensure Stats package is installed or else get:

ulia> using LowRankModels
ERROR: Stats not found
in require at loading.jl:47
in include at /Applications/Julia-0.3.0.app/Contents/Resources/julia/lib/julia/sys.dylib
in include_from_node1 at /Applications/Julia-0.3.0.app/Contents/Resources/julia/lib/julia/sys.dylib
in include at /Applications/Julia-0.3.0.app/Contents/Resources/julia/lib/julia/sys.dylib
in include_from_node1 at /Applications/Julia-0.3.0.app/Contents/Resources/julia/lib/julia/sys.dylib
in reload_path at loading.jl:152
in _require at loading.jl:67
in require at loading.jl:51
while loading /Users/feldt/.julia/v0.3/LowRankModels/src/glrm.jl, in expression starting on line 1
while loading /Users/feldt/.julia/v0.3/LowRankModels/src/LowRankModels.jl, in expression starting on line 6

Issue with concatenating regularizer terms

Hi Madeleine,

I am trying to basically use GLRM on a dataset with heterogeneous data, and I'm trying to assign each column of my matrix, a, with a loss function and regularizers on X and Y. I use the code below to do so:

(m,n) = size(a)
k = 190
losses = [QuadLoss()]
rx = [ZeroReg()]
ry = [ZeroReg()]

for nn = 2:n
	if a[1,nn] == 0 || a[1,nn] == 1
		losses = cat(1,losses, HingeLoss())
		rx = cat(1,rx, QuadReg())
		ry = cat(1,ry, QuadReg())
	else
		losses = cat(1,losses, QuadLoss())
		rx = cat(1,rx, ZeroReg())
		ry = cat(1,ry, ZeroReg())
	end

glrm = GLRM(a,losses,rx,ry,k)
X,Y,ch = fit!(glrm)

My issue is that I receive an error on the regularizer terms. The error is:

ERROR: LoadError: MethodError: no method matching LowRankModels.GLRM{L<:LowRankModels.Loss,R<:LowRankModels.Regularizer}(::Array{Any,2}, ::Array{LowRankModels.Loss,1}, ::Array{LowRankModels.Regularizer,1}, ::Array{LowRankModels.Regularizer,1}, ::Int64)

What is strange is if I assign rx and ry only one regularizer for the entire matrix, but still have multiple loss functions, GLRM() actually runs (i.e., the loss terms work perfectly fine when I concatenate them as above.)

Has anyone experienced this problem and/or come up with a solution for this issue?

Thank you!

shareglrm `fit!` recomputes X*Y

In the X update:

...
@everywhere begin
            # X update
            gemm!('T','N',1.0,X[:,xlcols],Y.s,0.0,XYX)
...

And then again in the objective evaluation:

...
# evaluate objective 
        obj[1] = 0
        @everywhere begin
            XY = X[:,xlcols]'*Y
...

I see that this second computation has to be done because by the time the objective is computed, X and Y have changed. On the other hand, I don't see why we can't omit the first calculation and just do the second one, holding that value of XY over into the next iteration of the X update. We can add a pre-computation of XY into the initialization block so that it doesn't choke the first time around.

Also, why is gemm! used in the X update, but not in the objective eval? Shouldn't gemm! be faster for both?

svd warm start is not working

I have a glrm model that runs fine by itself and also with kmeans++, but breaks when I try to initialize it with svd:

ERROR: BoundsError()
 in checkbounds at abstractarray.jl:65
 in setindex! at multidimensional.jl:61
 in setindex! at sharedarray.jl:216
 in init_svd! at /home/aschuler/.julia/v0.3/LowRankModels/src/initialize.jl:70
 in include at ./boot.jl:245
 in include_from_node1 at loading.jl:128
 in process_options at ./client.jl:285
 in _start at ./client.jl:354
while loading /home/aschuler/LatentSyndromes/src/run.jl, in expression starting on line 56

Problem persists even when I use a simple glrm:

using LowRankModels
import StatsBase: sample

srand(1)
    m,n,k,s = 1000, 1000, 5, 50000
    A = randn(m,k)*randn(k,n)
    losses = fill(quadratic(),n)
    r = quadreg(.1)
    obsx = [sample(1:int(m/4),int(s/2)), sample(int(m/4)+1:m,s-int(s/2))] 
    obsy = sample(1:n,s)
    obs = [(obsx[i],obsy[i]) for i=1:s]
    glrm = GLRM(A,obs,losses,r,r,k)
        @time init_svd!(glrm)
ERROR: BoundsError()
 in checkbounds at abstractarray.jl:65
 in setindex! at multidimensional.jl:61
 in setindex! at sharedarray.jl:216
 in init_svd! at /home/aschuler/.julia/v0.3/LowRankModels/src/initialize.jl:70

However, it works fine using the serial implementation.

Downgrading many packages on install

When I install the package in Julia v0.6.2 it downgrades many other packages. For example:

INFO: Downgrading DataFrames: v0.11.6 => v0.10.1
INFO: Downgrading DataFramesMeta: v0.3.0 => v0.2.0

Any ideas why this might be happening and suggestions to fix / workaround it?

Pkg.test on v0.6 fails in Fitting GLRM with "UndefVarError: int not defined"

Using:

> versioninfo()
Julia Version 0.6.0-dev.437
Commit 8d236ea* (2016-08-30 19:45 UTC)
Platform Info:
  System: Linux (x86_64-suse-linux)
  CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

Pkg add and update work fine, then Pkg.test("LowRankModels") fails with partly:

Fitting GLRM
obj went up to 1.4417276364138017e6; reducing step size to 0.6666666666666666
obj went up to 147167.67651808265; reducing step size to 0.4666666666666666
Iteration 10: objective value = 2638.1033704712863
obj went up to 4287.300517189247; reducing step size to 0.4377645759375
obj went up to 401.7944286747264; reducing step size to 0.3910975998892637
Iteration 20: objective value = 234.28069178601592
obj went up to 230.37928252754898; reducing step size to 0.31692105135026627
Convergence history:[41117.9,21370.0,14895.9,11458.6,8695.98,6706.84,5198.71,3894.06,2638.1,1766.83,1110.21,654.743,429.906,381.599,364.834,262.088,234.281,226.921,225.621,210.322,206.981,205.839,205.299,205.009,204.846,204.77,204.77]
ERROR: LoadError: LoadError: UndefVarError: int not defined
 in fit_pca_nucnorm_sparse_nonuniform(::Int64, ::Int64, ::Int64, ::Int64) at /home/colin/.julia/v0.6/LowRankModels/test/../examples/simple_glrms.jl:77
 in include_from_node1(::String) at ./loading.jl:481 (repeats 2 times)
 in process_options(::Base.JLOptions) at ./client.jl:262
 in _start() at ./client.jl:318
while loading /home/colin/.julia/v0.6/LowRankModels/test/../examples/simple_glrms.jl, in expression starting on line 107
while loading /home/colin/.julia/v0.6/LowRankModels/test/runtests.jl, in expression starting on line 4
============================[ ERROR: LowRankModels ]============================

failed process: Process(`/home/colin/Downloads/julia/usr/bin/julia -Cnative -J/home/colin/Downloads/julia/usr/lib/julia/sys.so --compile=yes --depwarn=yes --check-bounds=yes --code-coverage=none --color=yes --compilecache=yes /home/colin/.julia/v0.6/LowRankModels/test/runtests.jl`, ProcessExited(1)) [1]

================================================================================
ERROR: LowRankModels had test errors
 in #test#61(::Bool, ::Function, ::Array{AbstractString,1}) at ./pkg/entry.jl:740
 in (::Base.Pkg.Entry.#kw##test)(::Array{Any,1}, ::Base.Pkg.Entry.#test, ::Array{AbstractString,1}) at ./<missing>:0
 in (::Base.Pkg.Dir.##2#3{Array{Any,1},Base.Pkg.Entry.#test,Tuple{Array{AbstractString,1}}})() at ./pkg/dir.jl:31
 in cd(::Base.Pkg.Dir.##2#3{Array{Any,1},Base.Pkg.Entry.#test,Tuple{Array{AbstractString,1}}}, ::String) at ./file.jl:59
 in #cd#1(::Array{Any,1}, ::Function, ::Function, ::Array{AbstractString,1}, ::Vararg{Array{AbstractString,1},N}) at ./pkg/dir.jl:31
 in (::Base.Pkg.Dir.#kw##cd)(::Array{Any,1}, ::Base.Pkg.Dir.#cd, ::Function, ::Array{AbstractString,1}, ::Vararg{Array{AbstractString,1},N}) at ./<missing>:0
 in #test#3(::Bool, ::Function, ::String, ::Vararg{String,N}) at ./pkg/pkg.jl:258
 in test(::String, ::Vararg{String,N}) at ./pkg/pkg.jl:258

expand_categoricals function not working

Enjoyed the H2O presentation and paper, given my research interests, but new to Julia.

The expand_categoricals function appears not to be working.
Here's an illustrative example in Julia 0.6.2

julia> using DataFrames
julia> using LowRankModels
julia> df = DataFrame(A = 1:4, gender = ["M", "F", "F", "M"])
4×2 DataFrames.DataFrame
│ Row │ A │ gender │
├─────┼───┼────────┤
│ 1 │ 1 │ M │
│ 2 │ 2 │ F │
│ 3 │ 3 │ F │
│ 4 │ 4 │ M │

julia> expand_categoricals!(df, [:gender])

ERROR: UndefVarError: symbol not defined
Stacktrace:
[1] expand_categoricals!(::DataFrames.DataFrame, ::Array{Int64,1}) at /Users/research/.julia/v0.6/LowRankModels/src/fit_dataframe.jl:250
[2] expand_categoricals!(::DataFrames.DataFrame, ::Array{Symbol,1}) at /Users/research/.julia/v0.6/LowRankModels/src/fit_dataframe.jl:268

Is there a simple solution?
Thanks!

Penalties/regularization of imputed values

In many experimental contexts (RNA-sequencing is a good example), ground truth data that are below a detection threshold are not observed due to technical error. Thus, we have a data matrix A with many NA entries. However, we suspect many of these NA entries to be small, though not necessarily zero.

For each observed entry A_ij we have a loss function: L[A_ij, dot(x_i,y_j)]

Maybe for each unobserved entry we could add some regularization: R_a[dot(x_i,y_j)]

Where R_a is a function defined by the user... Perhaps R_a[z] = z^2 or R_a[z] = abs(z)

Aside: For the RNA sequencing application, something like R_a[z] = sqrt(z), z > 0 would be interesting (though not convex). Actually even something non-monotonic would be interesting for reasons I won't go into. These are probably too weird/specialized to include, but I would be curious.

Load error with v0.4

So I tried upgrading to v0.4 and I've been having trouble loading the package.

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+6354 (2015-07-29 14:17 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit d7351cf (0 days old master)
|__/                   |  x86_64-apple-darwin14.3.0

julia> using LowRankModels
ERROR: LoadError: LoadError: TypeError: apply_type: in Array, expected Type{T}, got Tuple{DataType,DataType}
 in include at ./boot.jl:254
 in include_from_node1 at ./loading.jl:197
 in include at ./boot.jl:254
 in include_from_node1 at ./loading.jl:197
 in require at ./loading.jl:146
while loading /Users/alex/.julia/v0.4/LowRankModels/src/glrm.jl, in expression starting on line 60
while loading /Users/alex/.julia/v0.4/LowRankModels/src/LowRankModels.jl, in expression starting on line 18

The issue relates to the argument list for find_observations(). In v0.4, this doesn't work:

julia> function test(obs::Array{(Int,Int),1})
       end
ERROR: TypeError: apply_type: in Array, expected Type{T}, got Tuple{DataType,DataType}

But this does:

julia> function test(obs::Array{Tuple{Int,Int},1})
       end
test (generic function with 1 method)

However, this second option doesn't seem to work with v0.3 so fixing this would create a compatibility issue.

As a sidenote... I decided to upgrade to 0.4 in the first place because I was having trouble using the fit! function in 0.3 -- I think the func(;kwargs...) syntax in 0.3 doesn't create a dictionary of keyword args, leading to the following error.

>> fit!(glrm)
ERROR: `keys` has no method matching keys(::Array{Any,1})
 in fit! at /Users/alex/.julia/v0.3/LowRankModels/src/fit.jl:11

It seems like it might be too much work to keep supporting compatibility with v0.3? But I don't know the timeline for the release of 0.4

Parallelize Everything

I think it would be useful to parallelize all of the code, since parallelized code can still be run on a single processor. We can get rid of glrm altogether, or keep it for pedagogical purposes. For massive datasets, it currently takes forever to run stuff like observations() and we can turn df2array into df2sharedarray in one shot.

@time data = readtable("data/sample_100000");
elapsed time: 29.124417016 seconds (5721166992 bytes allocated, 7.27% gc time)

julia> @time obs = observations(data);
 elapsed time: 547.430868981 seconds (12869023916 bytes allocated, 93.97% gc time)

OrdinalDomain does not take real ordinal values in interval of length less than 2

Hi,
OrdinalDomain only takes Ints as min and max values, but this does not work in the case where ordinal values may be Floats. For example, having ordinal values [0.0, 0.1, 0.2, 0.3] will result in an inaccurate warn (domains.jl:40) and consequent error.

The way I'm thinking to solve this is to define

# Ordinal data should take integer values ranging from `min` to `max`
immutable OrdinalDomain<:Domain
	min::Real
	max::Real
	function OrdinalDomain(elements)
		if length(elements) < 2
			warn("The ordinal variable you've created is degenerate: it has only two levels. Consider using a Boolean variable instead; ordinal loss functions may have unexpected behavior on a degenerate ordinal domain.")
		end
		return new(minimum(elements), maximum(elements))
	end
end

Any comments/concerns on this fix?

Ordinalizing Margin Losses

Right now LowRankModels implements an OrdinalHingeLoss which is related to the HingeLoss. However, as we port the losses over to LossFunctions, I was thinking that this ordinalization is not unique to the HingeLoss, but could be used for any margin loss (LogisticLoss, SquaredHingeLoss, etc.). The Rennie paper that describes the ordinal hinge loss ( http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.124.9242 ) seems to confirm this.

Perhaps we should reimplement it generically across margin losses by using a wrapper type a la ScaledLosses from LossFunctions (I'm thinking OrdinalMarginLoss{T<:MarginLoss, min, max} or something like that).

Some regularizers do not work

Hello,

I just noticed something weird in the code and thought it would be helpful to share it with you: the fact that the copy function makes a copy of the regularizer with no arguments, i.e.,

function copy(r::Regularizer)
  newr = typeof(r)()
  for field in @compat fieldnames(r)
    setfield!(newr, field, copy(getfield(r, field)))
  end
  newr
end

make it impossible to use regularizers that do require an argument, for instance fixed_latent_features and fixed_last_latent_features.
I guess one could solve this by also copying the arguments?

Update a version in METADATA?

Current version that one gets from Pkg.add is has the tests failing and pulls really old version of Optim.jl. I did Pkg.checkout to get the latest version of LowRankModels.jl and it seems to be working fine and passing all the tests. Would it make sense to tag a new version?

Log of test errors on 0.5 based on the latest release: http://pkg.julialang.org/logs/LowRankModels_0.5.log

One regularizer per column for X?

I've noticed that while one can pass an array of regularizers for Y (i.e. one per column of Y), the same is not possible for X. Is there a reason for that? Is there any interest in generalizing the code in that direction? Maybe I could help!

get_bools and get_ordinals don't work

When running the fit_rdataset.jl example, the automatic loss generation only picks up the real-valued columns.

init_svd! fails for rank-one model

Minor issue I happened to come across. I might try to chase this down this afternoon.

using LowRankModels

m = 100
n = 100
k = 1 # changing k to be 2 or 3 fixes this

A = randn(m,k)*randn(k,n)
loss = quadratic()
r = zeroreg()
glrm = GLRM(A,loss,r,r,k)
init_svd!(glrm)

Produces an error:

ERROR: DimensionMismatch("*")
 in gemm_wrapper! at linalg/matmul.jl:270
 in A_mul_Bt at linalg/matmul.jl:141
 in A_mul_Bc at linalg/matmul.jl:172
 in init_svd! at /Users/alex/.julia/v0.3/LowRankModels/src/initialize.jl:74

This seems to only be an issue for rank-one models (changing k to be 2 or 3 fixes it).

Pkg.test on v0.4.6 fails with "MethodError: `getindex` has no method matching getindex"

Using:

> versioninfo()
Julia Version 0.4.6
Commit 2e358ce* (2016-06-19 17:16 UTC)
Platform Info:
  System: Linux (x86_64-suse-linux)
  CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: liblapack.so.3
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Pkg.add and update run fine, but then Pkg.test("LowRankModels") fails with (partly, can gist entire output if convenient)

while loading /home/colin/.julia/v0.4/LowRankModels/test/../examples/fit_rdataset.jl, in expression starting on line 8
ERROR: LoadError: LoadError: MethodError: `getindex` has no method matching getindex(::Tuple{DataType,DataType})
Closest candidates are:
  getindex(::Tuple, ::Int64)
  getindex(::Tuple, ::Real)
  getindex(::Tuple, ::AbstractArray{Bool,N})
  ...
 in df_observations at /home/colin/.julia/v0.4/LowRankModels/src/fit_dataframe.jl:41
 in GLRM at /home/colin/.julia/v0.4/LowRankModels/src/fit_dataframe.jl:29
while loading /home/colin/.julia/v0.4/LowRankModels/test/../examples/fit_rdataset.jl, in expression starting on line 8
while loading /home/colin/.julia/v0.4/LowRankModels/test/runtests.jl, in expression starting on line 5
============================[ ERROR: LowRankModels ]============================

failed process: Process(`/usr/bin/julia --check-bounds=yes --code-coverage=none --color=yes /home/colin/.julia/v0.4/LowRankModels/test/runtests.jl`, ProcessExited(1)) [1]

================================================================================
ERROR: LowRankModels had test errors
 in test at pkg/entry.jl:803
 in anonymous at pkg/dir.jl:31
 in cd at file.jl:22
 in cd at pkg/dir.jl:31
 in test at pkg.jl:71

Most time still spent on master node

Although the high gc issue seems to be resolved, I still see only one process running on top for the vast majority of the time.

High GC (with shareglrm)

Hey, I'm getting some runtimes (using @time) of the parallelized version of fit!() that look like this:
(1000 observations of 550 features)
elapsed time: 12.81354915 seconds (592596512 bytes allocated, 5.76% gc time)

(10000 observations of 550 features)
elapsed time: 42.705608708 seconds (2916525292 bytes allocated, 42.89% gc time)

(50000 observations of 550 features)
elapsed time: 160.638440166 seconds (11031343640 bytes allocated, 51.76% gc time)

(100000 observations of 550 features)
elapsed time: 1421.327476493 seconds (32386723552 bytes allocated, 88.19% gc time)

This is using a machine with 10 processors.

Also, when I look at top, I see that the grand majority of the time the process is only running on one CPU. Every now and again I'll see the parallelization (suddenly multiple julia processes all taking ~100% CPU), but then it all goes back to the single CPU. This happens less frequently when I'm running on the larger dataset, which could be reflective of the same thing I'm seeing with the gc time.

This doesn't happen with the serial implementation.

(10000 observations of 550 features)
obj went up to 8.760023035973597e6; reducing step size to 0.7000000000000001
obj went up to 8.1832688405943755e6; reducing step size to 0.46666666666666673
obj went up to 7.83172884385661e6; reducing step size to 0.31111111111111117
obj went up to 7.611766313323507e6; reducing step size to 0.15555555555555559
obj went up to 7.403213032260401e6; reducing step size to 0.051851851851851864
obj went up to 7.270450432882523e6; reducing step size to 0.012962962962962966
obj went up to 7.221932865355106e6; reducing step size to 0.0025925925925925934
elapsed time: 88.786999616 seconds (12047987036 bytes allocated, 9.12% gc time)

Something wrong with unitonesparse() and kmeans

Edit: Reverting back to this commit fixes things: ed9e680

I'm not sure when this happened, but unitonesparse() regularizer doesn't seem to be working correctly. All the columns of X have two nonzero elements (both equal to 1.0).

julia> include("simple_glrms.jl");
julia> A,X,Y,ch = fit_kmeans(50,50,3);
julia> display(sum(X,1))
1x50 Array{Float64,2}:
 2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  …  2.0  2.0  2.0  2.0  2.0  2.0  2.0

julia> display(X[:,3:5])
7x3 Array{Float64,2}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 1.0  1.0  1.0
 0.0  0.0  0.0
 1.0  1.0  1.0

Interestingly, the objective function is not infinite:

julia> ch.objective[end]
6.215511635538903e6

Yet the evaluate function correctly returns Inf when a column of X is passed with the appropriate regularizer:

julia> evaluate(unitonesparse(),X[:,1])
Inf

Better `show` for convergence history

Providing ConvergenceHistory objects with a good show method would make the results of model fitting much more readable!

Specifying MultinomialLoss() for multiple columns

When specifying the loss function for several different categorical variables with varying numbers of factors, I encountered the following error which can be traced back to src: "Y must be of size (k,d) where d is the sum of the embedding dimensions of all the losses. (1 for real-valued losses, and the number of categories for categorical losses".

Error is generated when size(Y)!=(k,sum(map(embedding_dim, losses))). However, size of Y is determined by Y = randn(k,embedding_dim(losses)) so if you are specifying multiple loss functions, there is bound to be an inequality because without the map function, embedding_dim extracts the number of categories specified in only the first loss function.

If this is a misunderstanding, I apologize.

fit_rdataset.jl does not run

I get the following error when running the example in examples/fit_rdataset.jl:

julia> auto_glrm = GLRM(df, 3) ;
WARNING: [a,b,...] concatenation is deprecated; use [a;b;...] instead
 in depwarn at deprecated.jl:73
 in oldstyle_vcat_warning at ./abstractarray.jl:29
 in GLRM at /root/.julia/v0.4/LowRankModels/src/fit_dataframe.jl:22
while loading no file, in expression starting on line 0
WARNING: [a,b,...] concatenation is deprecated; use [a;b;...] instead
 in depwarn at deprecated.jl:73
 in oldstyle_vcat_warning at ./abstractarray.jl:29
 in GLRM at /root/.julia/v0.4/LowRankModels/src/fit_dataframe.jl:23
while loading no file, in expression starting on line 0
ERROR: MethodError: `getindex` has no method matching getindex(::Tuple{DataType,DataType})
Closest candidates are:
  getindex(::Tuple, ::Int64)
  getindex(::Tuple, ::Real)
  getindex(::Tuple, ::AbstractArray{Bool,N})
  ...
 in df_observations at /root/.julia/v0.4/LowRankModels/src/fit_dataframe.jl:41
 in GLRM at /root/.julia/v0.4/LowRankModels/src/fit_dataframe.jl:29

The example works just fine if I use an earlier version of LowRankModels.jl (specifically, 676dfc0). Removing include("scikitlearn.jl") from src/LowRankModels.jl fixes things, but I'm afraid I don't know enough of the Julia language to know why or offer a fix.

Logistic() overwritten by using DataFrames

Just wanted to highlight this annoyance:

using LowRankModels
using DataFrames
logistic(1.0) # returns 0.731 instead of logistic(1.0,BoolDomain())
logistic() # ERROR: `logistic` has no method matching logistic()

This is the problematic function: https://github.com/JuliaStats/StatsFuns.jl/blob/9991e25012e29cf4bc1f081eec4ed9a1a6874291/src/basicfuns.jl#L16

I'm not sure why DataFrames needs to import all of StatsBase... but seeing as many users will likely be using one or both of these packages in conjunction with LowRankModels this could be a potential problem.

Should we consider changing names: logistic() ==> logloss(), quadratic() ==> quadloss(), etc...

Using LowRankModels, DataFrames does not import fit!(g::GLRM) into the namespace

julia> using DataFrames, LowRankModels

julia> methods(fit!)
#1 method for generic function "fit!":
fit!(glrm::GLRM) at /Users/aschuler/.julia/v0.3/LowRankModels/src/glrm.jl:171

but (note the order of the using statement):

julia> using LowRankModels, DataFrames

julia> methods(fit!)
#1 method for generic function "fit!":
fit!(obj::StatisticalModel,data...) at /Users/aschuler/.julia/v0.3/StatsBase/src/statmodels.jl:14

Non-negative Matrix Factorization Example Fails

using LowRankModels
A,X,Y,ch = fit_nnmf(100,50,2)

Produces the following error for me:

ERROR: `similar` has no method matching similar(::StridedView{Float64,2,0,Array{Float64,2}}, ::Type{Bool}, ::(Int64,Int64))
 in similar at abstractarray.jl:116
 in map at abstractarray.jl:1329
 in evaluate at /home/alex/.julia/v0.3/LowRankModels/src/loss_and_reg.jl:217
 in objective at /home/alex/.julia/v0.3/LowRankModels/src/glrm.jl:49
 in objective at /home/alex/.julia/v0.3/LowRankModels/src/glrm.jl:37
 in objective at /home/alex/.julia/v0.3/LowRankModels/src/glrm.jl:57
 in fit! at /home/alex/.julia/v0.3/LowRankModels/src/glrm.jl:108
 in fit_nnmf at none:7

The other examples in simple_glrms.jl seem to work for me though.

Working with the "psych" Rdataset and converting NaN to NA in a dataframe

Hi,

Just wanted to let you know that the basic example in the README file is not working right now because of the presence of NaN values. Currently running

import RDatasets
df = RDatasets.dataset("psych", "msq")
glrm, labels = GLRM(df,2)

returns the error "ERROR: Observed value in entry (1, 9) is NaN."

It would be useful to have an option in the GLRM command to treat NaN as NA. In the absence of that could you let me know how to convert NaN values to NA in a Julia DataFrame? It seems to be a non-trivial problem (I (unsuccessfully) tried many variants of "df[isnan(df)] = NA" ).

Thanks!
Nandana

Robustify to NaNs

NaNs in the data set introduce NaNs in the low rank model. Let's check for NaNs and exclude them from the observed_entries (and give a warning).

shareglrm doesn't play nice with fit_dataframe

shareglrm wants ry to be a single regularizer function, whereas fit_dataframe expects use of glrm, which assumes ry::Array{Regularizer,1} and thus equilibrate_variance and offset don't work with shareglrm

Reproducing PCA results

I can't reproduce PCA outcomes from other libraries with glrm. Am I doing something wrong?

using LowRankModels: pca, fit!, ProxGradParams
using RDatasets: dataset

iris = dataset("datasets", "iris")
norm_rows(mat) = mat ./ sqrt(sum(mat .^ 2, 2))
X_iris = convert(Matrix, iris[[:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]])
srand(33)
glrm_pca = pca(X_iris, 2)
fit!(glrm_pca, ProxGradParams(max_iter=1000, convergence_tol=1.e-20))
norm_rows(glrm_pca.Y)

> 2x4 Array{Float64,2}:
>   0.802641   0.560961   0.201397   0.0229959
>  -0.622311  -0.187197  -0.711173  -0.268178

Both scikit-learn-python and MultivariateStats.jl return

2x4 Array{Float64,2}:
  0.361387  -0.0845225  0.856671  0.358289
 -0.656589  -0.730161   0.173373  0.075481

ClassificationLoss not exported

As the title says.

v0.5 Compatibility

Many of the DataFrame helper functions don't work in v0.5 . For example, the following snippet produces an error (assumes that RDatasets and LowRankModels are imported)

df = RDatasets.dataset("psych", "msq")
observations(df)

Gives the following error:

ERROR: MethodError: no method matching getindex(::Tuple{DataType,DataType})
 in df_observations(::DataFrames.DataFrame) at /home/mihir/.julia/v0.5/LowRankModels/src/fit_dataframe.jl:41
 in observations(::DataFrames.DataFrame) at /home/mihir/.julia/v0.5/LowRankModels/src/fit_dataframe.jl:39

Detecting and dealing with Categorical Data

ISSUE 1:
Currently when running
glrm, labels = GLRM(df, k)
on a dataset 'df' which has categorical variables, the GLRM is run ignoring these variables. The only way to figure out that this is happening from the User's end is checking length(labels).

DESIRED BEHAVIOR:
A warning message informing the user that variables in df are being ignored and suggesting the use of the expand_categoricals( ) function.

ISSUE 2:
Currently when running
df2 = expand_categoricals(df, v)
where df is a DataFrame and v is a vector with column indices, the command returns a Match error. It runs only after I run
v = convert(Array, v)

DESIRED BEHAVIOR:
Automatically converting the column indices vector to Array/ accepting vectors into the expand_categoricals( ) function.

Shareglrm "broadcasting" overwrites variables

oe = 10 # some variable I was using earlier


A = ones(5,5)
A::Array # A is clearly an array here
glrm = GLRM(A, losses, rx, ry, k) # define some losses, regs, etc.
X,Y,ch = fit!(glrm)

A::Array # This throws an error because A has been overwritten by fit!() in this scope during broadcasting of variables to all processes!
oe == 10 # this returns FALSE because oe has also been overwritten!

Use Traits?

Consider using traits for losses and regularizers, in order to ensure that new losses and regularizers added implement the required methods (evaluate, grad, prox, etc). Might also be more efficient than the current implementation.

Working on randomly generated data returns 0 element Arrays for labels and glrm.ry

Hi again,

I was trying to debug my analysis by working with a randomly generated dataset:

A = rand(100,2)*rand(2,100)
A = convert(DataFrame, A)
glrm, labels = GLRM(A, 2)

However the corresponding labels and glrm.ry objects are empty:

julia> labels
0-element Array{Symbol,1}

julia> glrm.ry
0-element Array{Regularizer,1}

It would be great if you could help me understand why this is happening?

Also could you suggest a base dataset that could be used to understand the different aspects of the LowRankModels code like the loss and regularizer options? I tried using the "psych" Dataset referred to in the README and this randomly generated matrix and ran into issues with them both.

Thanks a lot,
Nandana

Regularizers for Soft K-means

I've implemented some new regularizers that let you move between a soft and hard clustering. Here is a quick demo/explanation:

http://nbviewer.ipython.org/github/ahwillia/notebooks/blob/master/code_pubs/2015_08_11_regularized_soft_kmeans.ipynb

Is this too specialized to include in the repository? The code is pretty short so I've just pasted it below. Let me know if you want me to open a PR.

## indicator of vectors in the simplex, plus a penalty on Shannon entropy
## (intuition: soft k-means with encouraged sparseness)
type entr_simplex<:Regularizer
    scale::Float64
end
function evaluate(r::entr_simplex,a::AbstractArray)
    evaluate(simplex(),a) == Inf && return Inf # simplex constraint
    b = length(a) # base for entropy calculation (normalizes: 0<=entropy<=1)
    return r.scale*entropy(a,b) # penalize entropy if constraint satisfied
end
function prox!(r::entr_simplex,u::AbstractArray,alpha::Number)
    prox!(simplex(),u,alpha) # first project onto unit simplex
    b = length(u)    # base for entropy calculation (normalizes: 0<=entropy<=1)
    g = -log(b,u)-1  # gradient of entropy

    # project entropy gradient onto simplex, ignoring Infs
    gs = g - mean(g[g.!=Inf]) 
    for i = 1:length(u)
        u[i] = (g[i]==Inf) ? 0.0 : u[i] - r.scale*alpha*gs[i]
    end

    # make sure we didn't step off the simplex before returning
    prox!(simplex(),u,alpha)
end
entr_simplex() = entr_simplex(1)

## indicator of vectors in the simplex, plus a penalty on
## the l1 distance to the center of the simplex
## (intuition: soft k-means with encouraged sparseness)
type dist_simplex<:Regularizer
    scale::Float64
    k::Int # rank of model (store up front)
    d::Float64 # distance from corner to center (calculate up front)
end
function evaluate(r::dist_simplex,a::AbstractArray)
    evaluate(simplex(),a) == Inf && return Inf # simplex constraint
    dist = sum(abs(a-ones(r.k)/r.k)) # distance from center
    return r.scale*(1 - (dist/r.d))  # penalize dist from corners
end
function prox(r::dist_simplex,u::AbstractArray,alpha::Number)
    prox!(simplex(),u,alpha) # first project onto unit simplex

    # Calculate gradient
    shrink_step = -r.scale*alpha/(r.k-1)
    imax = indmax(u)
    for i = 1:length(u)
        if i == imax
            u[i] += r.scale*alpha
        else
            u[i] += shrink_step
        end
    end

    # make sure we didn't step off the simplex before returning
    prox!(simplex(),u,alpha)
end
function dist_simplex(s::Float64,k::Int)
    # k is the rank of the data
    e = zeros(k); e[1] = 1.0  # corner of simplex
    c = ones(k)/k             # center of simplex
    return dist_simplex(s,k,sum(abs(c-e)))
end
dist_simplex(k::Int) = dist_simplex(1.0,k)

Quadratic Regularization Scale

Something small is bugging me... Shouldn't

prox(r::quadreg,u::AbstractArray,alpha::Number) = 1/(1+alpha*r.scale/2)*u

Actually be:

prox(r::quadreg,u::AbstractArray,alpha::Number) = 1/(1+alpha*r.scale*2)*u

My reasoning:

Sorry if this is a dumb math error or misunderstanding on my part

NNMF intialization

I was looking into implementing Nonnegative Double Singular Value Decomposition (NNDSVD) to initialize NNMF problems. There is already an implementation of this in NMF.jl, which we could follow very closely.

My only hesitation is that I'm not exactly sure how this would apply to a table with mixed data types. For the SVD initialization, we first normalize each column of A to have zero mean and unit variance... But this wouldn't be ideal if the data are meant to be strictly positive. Maybe we can just apply NNDSVD without any normalization?