<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I wrote a minimal working example to test the issue: <div class="snippet-clipboard

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

R1 Regularization,about denizyuret/knet.jl

Comments (9)

denizyuret commented on June 6, 2024

Looking at the error message more carefully, it seems to be trying to find the gradient of uncat wrt its 4'th argument. The signature for uncat is: uncat(dy, argn, dims, x...). Its operation can be described as follows: cat concatenates a bunch of x's into a y. In the backward pass we receive dy, the gradient of loss wrt y. Uncat takes this dy and extracts the region that corresponds to the argn'th input argument from it. It is basically an indexing operation into dy. Therefore only the first argument effects its return value, the x's only determine the shape of the return value. The derivative of uncat wrt any argument other than its first argument is 0. So we never defined them because under normal (first order) use back(::uncat,...) never gets called with argn!=1.

Now I don't quite understand why the second order code calls uncat's back method for the fourth argument. But assuming it does so for legitimate reasons, the fix is simple. Just define:

AutoGrad.back(::typeof(AutoGrad.uncat), ::AutoGrad.Arg{N}, dy, y, x...) = nothing
AutoGrad.back(::typeof(AutoGrad.uncat1), ::AutoGrad.Arg{N}, dy, y, x...) = nothing

as a catch-all for any derivative request for any argument other than the first. And see if the code works with this. If it does I will add this definition to core.jl.

You can try the following version of AutoGrad which includes the above fix:

pkg> add AutoGrad#dy/fix671

from knet.jl.

Kausta commented on June 6, 2024

I wrote a minimal working example to test the issue:

using Knet
using Statistics: mean
atype = Knet.atype()

# A simple model for the example
struct Linear; w; b; end
Linear(in_dim::Int, out_dim::Int) = Linear(param(out_dim,in_dim,atype=atype), param0(out_dim,atype=atype))
(l::Linear)(x) = l.w * x .+ l.b

struct Model; lin1; lin2; lin3; end
Model(in_dim1::Int,in_dim2::Int) = Model(Linear(in_dim1, 1), Linear(in_dim2, 1), Linear(2, 1))
function (m::Model)(x, y)
    out1 = m.lin1(x)
    out2 = m.lin2(y)
    outc = vcat(out1, out2)
    return m.lin3(outc)
end

# A sample loss function
function loss(model, x, y)
    out = model(x, y)
    loss = mean(out)
    
    gradfn = grad(t -> sum(model(t, y)))
    grad_out = gradfn(x)
    loss += sum(abs2.(grad_out)) / size(x)[end]
    
    return loss
end

x = convert(atype, randn(10, 8))
y = convert(atype, randn(5, 8))
model = Model(10, 5)

L = @diff loss(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing

With Autograd 1.2.4, differentiating loss produces the following error as expected:

ERROR: LoadError: MethodError: no method matching back(::typeof(AutoGrad.uncat), ::Type{AutoGrad.Arg{4}}, ::Knet.KnetArrays.KnetMatrix{Float32}, ::AutoGrad.Result{Knet.KnetArrays.KnetMatrix{Float32}}, ::AutoGrad.Result{Knet.KnetArrays.KnetMatrix{Float32}}, ::Int64, ::Int64, ::AutoGrad.Result{Knet.KnetArrays.KnetMatrix{Float32}}, ::AutoGrad.Result{Knet.KnetArrays.KnetMatrix{Float32}})

With AutoGrad#dy/fix671, it works and outputs the following as expected:

(value(L), grad(L, model.lin1.w)) = (4.3223014f0, K32(1,10)[0.32105368⋯])

However, gradients are the same even if we don't include the following block.

gradfn = grad(t -> sum(model(t, y)))
grad_out = gradfn(x)
loss += sum(abs2.(grad_out)) / size(x)[end]

Moreover, the following outputs nothing:

L = @diff sum(abs2.(grad(t -> sum(model(t, y)))(x))) / size(x)[end]
@show grad(L, model.lin1.w)

Hence, it runs without any compile time issues, however, I don't think it outputs any second order gradients. Is it possible that the new back functions defined are too generic and always used as the gradient of uncat?

from knet.jl.

denizyuret commented on June 6, 2024

First, mixing of the old grad interface (i.e. grad(f)) and the new grad interface (grad(result, param)) is not well tested and part of the problem seems to be mixing the two. So if you can find a way to express the computation using only the new interface (i.e. only using @diff and grad(result, param)), that could solve the problem.

Nevertheless I am also trying to figure out what goes wrong when we do mix the two interfaces. I found two problems and pushed a new update to the dy/fix671 branch:

Old grad function got confused when there was more than one Param in the computation, this should be fixed now.
This is more difficult: there was a PR (denizyuret/AutoGrad.jl#75) for fixing "tape confusion" which I understood at some point but now forgot what the problem was. The change here is at https://github.com/denizyuret/AutoGrad.jl/blob/1daede9b3215c170b5f9f0860042dca39c54805f/src/core.jl#L135 to L139 which is commented out in dy/fix671. What this does is if there are multiple tapes, it duplicates the Params and Results using the identity function. When I comment this out your code seems to work. However I presume this was added for a reason and was fixing some other problem which now I broke. So needs to be investigated a bit more.

from knet.jl.

Kausta commented on June 6, 2024

It now works while using only the @diff interface and when mixing both interfaces. I updated the MWE as following to initially test with only using @diff:

using Knet
using Statistics: mean
atype = Knet.atype()

# A simple model for the example
struct Linear; w; b; end
Linear(in_dim::Int, out_dim::Int) = Linear(param(out_dim,in_dim,atype=atype), param0(out_dim,atype=atype))
(l::Linear)(x) = l.w * x .+ l.b

struct Model; lin1; lin2; lin3; end
Model(in_dim1::Int,in_dim2::Int) = Model(Linear(in_dim1, 1), Linear(in_dim2, 1), Linear(2, 1))
function (m::Model)(x, y)
    out1 = m.lin1(x)
    out2 = m.lin2(y)
    outc = vcat(out1, out2)
    return m.lin3(outc)
end

# Loss1: Only first order, Loss2: first+second order, test: only second order
function loss1(model, x, y)
    out = model(x, y)
    return mean(out)
end

function loss2(model, x, y)
    out = model(x, y)
    loss = mean(out)
    
    xp = isa(x, Param) ? x : Param(x)
    g = @diff sum(model(xp, y))
    grad_out = grad(g, xp)
    loss += sum(abs2.(grad_out)) / size(x)[end]
    
    return loss
end

function test(model, x, y)
    xp = Param(x)
    g = @diff sum(model(xp, y))
    grad_out = grad(g, xp)
    return sum(abs2.(grad_out)) / size(x)[end]
end

x = convert(atype, randn(10, 8))
y = convert(atype, randn(5, 8))
model = Model(10, 5)

L = @diff loss1(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing

L = @diff loss2(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing

L = @diff test(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing

grad_result = @Knet.gcheck loss2(model, Param(x), y) (verbose=1,)
println("gcheck result: $grad_result")

and it works without an error using the AutoGrad#dy/fix671 branch. We get the output:

(value(L), grad(L, model.lin1.w)) = (-0.1450544f0, K32(1,10)[-0.06908582⋯])
(value(L), grad(L, model.lin1.w)) = (-0.033414274f0, K32(1,10)[-0.017087717⋯])
(value(L), grad(L, model.lin1.w)) = (0.111640126f0, K32(1,10)[0.051998105⋯])
gcheck result: true

Moreover, the gradients are no longer nothing, and gcheck also reports correct gradients.

In addition, the fix for the mixed interface also seems to work for this test case. By adding the following code:

function loss_mixed_interface(model, x, y)
    out = model(x, y)
    loss = mean(out)
    
    gradfn = grad(t -> sum(model(t, y)))
    grad_out = gradfn(x)
    loss += sum(abs2.(grad_out)) / size(x)[end]
    
    return loss
end

function test_mixed_interface(model, x, y)
    gradfn = grad(t -> sum(model(t, y)))
    grad_out = gradfn(x)
    return sum(abs2.(grad_out)) / size(x)[end]
end

L = @diff loss_mixed_interface(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing

L = @diff test_mixed_interface(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing

we get the additional output as:

(value(L), grad(L, model.lin1.w)) = (-0.033414274f0, K32(1,10)[-0.017087717⋯])
(value(L), grad(L, model.lin1.w)) = (0.111640126f0, K32(1,10)[0.051998105⋯])

which agree with the results from only using @diff interface.

Although it now works for higher order gradients, I think this introduces back the bug from denizyuret/AutoGrad.jl#75 as I get true for both of the following statements

grad(x -> x*grad(y -> x+y)(x))(5.0) == 2
grad(x -> x*grad(y -> x+y)(1x))(5.0) == 1

I am trying to understand why the fixes for denizyuret/AutoGrad.jl#75 break the higher order gradients with mixed interface, and I will update if I can find a solution. I will also re-check whether the @diff only version works without removing the tape confusion bug-fix, and it that case, requiring using the same interface in the code could also be an option for now.

from knet.jl.

BariscanBozkurt commented on June 6, 2024

I come across with a very similar error while I am implementing an Implicit-GON (Gradient Origin Network) model for implicit learning task. pkg> add AutoGrad#dy/fix671 seems to fix the problem for small working examples. I tried to debug my implementation with small dimensional toy dataset after this fix, and it worked fine. However, for high dimensional data, I could not obtain an output for nearly 10 minutes and I stopped the code. Now, I will share my MWEs for detailed explanations.

As I mentioned in the previous issue #670, I was trying to obtain a derivative of a loss function after two forward passes which leads to a second order derivative. In the following mwe, I want to take the derivative of loss_train(theta,x) function where I first feed the origin to the model and take the negative gradient of MSE loss w.r.t. this origin as my new latent point. After that, I feed this new latent point to the model and compute the MSE. I am able to take the gradient of loss_train(theta,x) function in the following example and note that dimensions are very small (latent_dim is 2, batch_size is 3, etc.).

using Knet
using Statistics: mean
atype = Knet.atype()

Knet.seed!(0)

function batched_linear(theta, x_in; atype = KnetArray{Float32})
#     """
#     multiply a weight matrix of size (O, I) with a batch of matrices 
#     of size (I, W, B) to have an output of size (O, W, B), 
#     where B is the batch size.
    
#     size(theta) = (O, I)
#     size(x_in) = (O, W, B)
#     """
    o = size(theta,1)
    w = size(x_in, 2)
    b = size(x_in, 3)
    x_in_reshaped = reshape(x_in, size(x_in,1), w*b)
    out = reshape(theta * x_in_reshaped, size(theta,1), w, b)
    return out
end

function get_mgrid(sidelen) # Create a grid
    iterator = (range(-1,stop=1,length = sidelen))
    return Array{Float64}(hcat([[i,j] for i = iterator, j = iterator]...)');
end

function model_forw(theta, z) #Forward implementation of the model
    # It is kind of a decoder model where we try to reconstruct a 
    # target by using z_in 
    z_rep = hcat([z for _ = 1:size(c,2)]...) # c is image coordinate matrix defined globally below
    z_in = cat(c, z_rep, dims = 3)
    z_in = (permutedims(z_in, (3,2,1)))
    z = batched_linear(theta, z_in) .+ 0.001
end

function loss(theta, z, x) # Compute mean squared error loss
    x_hat = model_forw(theta, z)
    L = mean(sum((x_hat- x).^2, dims = 2))
end

function loss_train(theta,x)
    z = Param(atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector as Param type
    derivative_origin = @diff loss(theta, z, x) # Feed zero latent to model and take the gradient w.r.t. it
    z = -grad(derivative_origin, z) # New latent point as negative gradient
    x_hat = model_forw(theta, z) # Reconstruct the target w.r.t. new latent
    L = mean((x_hat- x).^2) # Compute mean squared error loss
end

num_latent = 2
i = 4
o = 1
w = 4
batch_size = 3

x = atype(randn(o,w,batch_size)) # Target
theta = Param(atype(randn(o,i))) # Model Weight
mgrid = get_mgrid(2) # Create grid for generating image coordinate matrix c as below
c = atype(permutedims(repeat(mgrid,1,1,batch_size),(3,1,2))); # Image Coordinates
# z = Param(atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector as Param type

derivative_model = @diff loss_train(theta,x) # Differentiate the loss_train.
# It is working in this example

However, if I use higher dimensional data and more layers in the model as in the following modification of the above mwe, I wait too much and cannot obtain an output for 10 minutes. Therefore, I am stopping the execution of the code. My implementation includes lots of cat operation, reshaping and permuting dims. However, I am not sure whether these operations slows down the taking derivative.

using Knet
using Statistics: mean
atype = Knet.atype()

Knet.seed!(0)

function batched_linear(theta, x_in; atype = KnetArray{Float32})
#     """
#     multiply a weight matrix of size (O, I) with a batch of matrices 
#     of size (I, W, B) to have an output of size (O, W, B), 
#     where B is the batch size.
    
#     size(theta) = (O, I)
#     size(x_in) = (O, W, B)
#     """
    o = size(theta,1)
    w = size(x_in, 2)
    b = size(x_in, 3)
    x_in_reshaped = reshape(x_in, size(x_in,1), w*b)
    out = reshape(theta * x_in_reshaped, size(theta,1), w, b)
    return out
end

function get_mgrid(sidelen) # Create a grid
    iterator = (range(-1,stop=1,length = sidelen))
    return Array{Float64}(hcat([[i,j] for i = iterator, j = iterator]...)');
end

function model_forw(theta, z) #Forward implementation of the model
    # It is kind of a decoder model where we try to reconstruct a 
    # target by using z_in 
    z_rep = hcat([z for _ = 1:size(c,2)]...) # c is image coordinate matrix defined globally below
    z_in = cat(c, z_rep, dims = 3)
    z_in = (permutedims(z_in, (3,2,1)))
    z = batched_linear(theta[1], z_in) .+ 0.001
    z = sin.(30 * z)
    z = batched_linear(theta[2], z) .+ 0.001
    z = sin.(30 * z)
    z = batched_linear(theta[3], z) .+ 0.001
    z = sin.(30 * z)
    z = batched_linear(theta[4], z)
end

function loss(theta, z, x) # Compute mean squared error loss
    x_hat = model_forw(theta, z)
    L = mean(sum((x_hat- x).^2, dims = 2))
end

function loss_train(theta,x)
    z = Param(atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector as Param type
    derivative_origin = @diff loss(theta, z, x) # Feed zero latent to model and take the gradient w.r.t. it
    z = -grad(derivative_origin, z) # New latent point as negative gradient
    x_hat = model_forw(theta, z) # Reconstruct the target w.r.t. new latent
    L = mean((x_hat- x).^2) # Compute mean squared error loss
end

num_latent = 32
i = 34
o1 = 256
o2 = 256
o3 = 256
o4 = 1
w = 784
batch_size = 64

x = atype(randn(o4,w,batch_size)) # Target
# Model Weights : theta1, ..., theta4
theta1 = Param(atype(randn(o1,i)))
theta2 = Param(atype(randn(o2,o1)))
theta3 = Param(atype(randn(o3,o2)))
theta4 = Param(atype(randn(o4,o3)))
# Model Weight List
theta = []
push!(theta, theta1)
push!(theta, theta2)
push!(theta, theta3)
push!(theta, theta4)

mgrid = get_mgrid(28) # Create grid for generating image coordinate matrix c as below
c = atype(permutedims(repeat(mgrid,1,1,batch_size),(3,1,2))); # Image Coordinates
z = Param(atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector as Param type
derivative_origin = @diff loss(theta, z, x) # This works fine
println(derivative_origin)
derivative_model = @diff loss_train(theta,x) # This might work but takes too much time (I waited for 10 min and did not obtain an output)

Is there any implementation detail which I misses and that's why my code is running extremely slow? Even taking the derivative of loss(theta, z, x) takes 2-3 seconds. Also, the sinus activations inside model forward pass function does not slow down the implementation in my opinion. I cannot obtain the output even if I delete the sin activations.

from knet.jl.

Kausta commented on June 6, 2024

@BariscanBozkurt, can you try it with a smaller batch size for at least 2 iterations ? In my case, the first past through the model takes significantly longer. Currently, first 10 iterations complete in approximately 100 seconds, whereas 10 iterations take approximately 15 seconds. If we assume only the first iteration is slow ( which I suspect is due to precompilation ), than it would imply that first iteration takes approximately 135 seconds whereas the other iterations take 1.5 seconds. In other words, the first iteration is approximately 90 times slower. Maybe it's the case for your model too and it would speed up significantly after the first iteration.

from knet.jl.

BariscanBozkurt commented on June 6, 2024

Hi @Kausta. Thank you for your quick reply. I think I understand the problem now. The problem is not about the precompilation since other iterations also takes so much time. It is high probably due to the custom function batched_linear(theta, x_in; atype = KnetArray{Float32}) . I think taking the second order derivative of a function which includes two passes batched_linear function is very slow since AutoGrad tries to figure out the second derivative of this custom function. If I try to take derivative of loss(theta, z, x) after compiling, it is fast and it only includes one pass of the model. However, I do not know how I can make it faster to take the second order derivative. In PyTorch, default matrix multiplication is able to perform such a vectorized matrix multiplication for each batch. Since the Julia matrix multiplication does not support that I needed to write it by myself. However, it apparently slows down everything significantly.

from knet.jl.

BariscanBozkurt commented on June 6, 2024

Disregard my previous comment. In my second example code, the hcat function inside model_forw(theta, z) concatenates z 784 times. Normally, I would like to take the gradient of the loss() function with respect to z inside the loss_train() function. However, if I define z_rep as Param type outside model_forw() function and take the gradient of loss() function with respect to z_rep inside loss_train(theta,x) function, the code works pretty much fast. Therefore, the following piece of code works well.

using Knet
using Statistics: mean
atype = Knet.atype()

Knet.seed!(0)

function batched_linear(theta, x_in; atype = KnetArray{Float32})
#     """
#     multiply a weight matrix of size (O, I) with a batch of matrices 
#     of size (I, W, B) to have an output of size (O, W, B), 
#     where B is the batch size.
    
#     size(theta) = (O, I)
#     size(x_in) = (O, W, B)
#     """
    o = size(theta,1)
    w = size(x_in, 2)
    b = size(x_in, 3)
    x_in_reshaped = reshape(x_in, size(x_in,1), w*b)
    out = reshape(theta * x_in_reshaped, size(theta,1), w, b)
    return out
end

function get_mgrid(sidelen) # Create a grid
    iterator = (range(-1,stop=1,length = sidelen))
    return Array{Float64}(hcat([[i,j] for i = iterator, j = iterator]...)');
end

function model_forw(theta, z_rep) #Forward implementation of the model
    # It is kind of a decoder model where we try to reconstruct a 
    # target by using z_in 
#     z_rep = hcat([z for _ = 1:size(c,2)]...) # c is image coordinate matrix defined globally below
    z_in = cat(c, z_rep, dims = 3)
    z_in = (permutedims(z_in, (3,2,1)))
    z = batched_linear(theta[1], z_in) .+ 0.001
    z = sin.(30 * z)
    z = batched_linear(theta[2], z) .+ 0.001
    z = sin.(30 * z)
    z = batched_linear(theta[3], z) .+ 0.001
    z = sin.(30 * z)
    z = batched_linear(theta[4], z)
end

function loss(theta, z, x) # Compute mean squared error loss
    x_hat = model_forw(theta, z)
    L = mean(sum((x_hat- x).^2, dims = 2))
end

function loss_train(theta,x)
    z = (atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector as Param type
    z_rep = Param(atype(hcat([z for _ = 1:size(c,2)]...)))
    derivative_origin = @diff loss(theta, z_rep, x) # Feed zero latent to model and take the gradient w.r.t. it
    z = -grad(derivative_origin, z_rep) # New latent point as negative gradient
    x_hat = model_forw(theta, z) # Reconstruct the target w.r.t. new latent
    L = mean((x_hat- x).^2) # Compute mean squared error loss
end

num_latent = 32
i = 34
o1 = 256
o2 = 256
o3 = 256
o4 = 1
w = 784
batch_size = 64

x = atype(randn(o4,w,batch_size)) # Target
# Model Weights : theta1, ..., theta4
theta1 = Param(atype(randn(o1,i)))
theta2 = Param(atype(randn(o2,o1)))
theta3 = Param(atype(randn(o3,o2)))
theta4 = Param(atype(randn(o4,o3)))
# Model Weight List
theta = []
push!(theta, theta1)
push!(theta, theta2)
push!(theta, theta3)
push!(theta, theta4)

mgrid = get_mgrid(28) # Create grid for generating image coordinate matrix c as below
c = atype(permutedims(repeat(mgrid,1,1,batch_size),(3,1,2))); # Image Coordinates
z = (atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector 
z_rep = Param(hcat([z for _ = 1:size(c,2)]...)) # Make z_rep Param type this time
# The following line (derivative_origin ) works fine again. However, I do not want to obtain the gradient 
# w.r.t. z_rep actually. I need the gradient w.r.t z !!!
derivative_origin = @diff loss(theta, z_rep, x) 
# The following line to take the derivative w.r.t. model weights is fast now.
derivative_model = @diff loss_train(theta,x)

Here, instead of defining z as Param type, I defined z_rep as Param type outside the model_forw(). With this way, I can take the gradient of loss_train() function w.r.t. model weights (theta) very quickly, and it works faster in the follow-up iterations. Therefore, I suspect that concatenating lots of matrices inside a forward pass algorithm makes taking derivative very difficult. However, I could not find a work-around solution since I need the gradient of the loss() function w.r.t. z inside the loss_train() function. If I could use repeat() function for Param type KnetArrays instead of hcat or cat function since I keep concatenating the same matrix, maybe it can solve my problem. It corresponds to the issue #635, but since I need z as a Param type I cannot use the recommended workaround solution for that.

from knet.jl.

BariscanBozkurt commented on June 6, 2024

I found my workaround solution. Instead of using hcat function to repeat my z matrix over its second dimension 784 times, I used 1x1 convolution operation with convolution weights of all ones. At the end of the day, if I use the following lines of codes in my above example functions, my code runs fast. I have to give the credits to @ugrulas for this solution.

using Knet
using Statistics: mean
atype = Knet.atype()

one_conv_weight = atype(ones(1,1,1,784)) #Globally define convolution weights of all ones

num_latent = 32
batch_size = 64

z = Param(atype(zeros(batch_size, 1, num_latent))) #size : (64,1,32)
# We won't use the following line which includes hcat function to repeat z
# 784 times. Instead, we utilize 1x1 convolution.
# z_rep = hcat([z for _ = 1:784]...) # size : (64,784,32)
z_ = copy(z) # Create a copy of z, so that z_ is not param type
z_ = permutedims(reshape(z_,64,1,1,32),(4,3,2,1)) # size : (32,1,1,64)
z_ = conv4(one_conv_weight, z_)[:,1,:,:] # size : (32, 784, 64)
z_rep = permutedims(z_, (3,2,1)) #size : (64, 784, 32)

from knet.jl.

R1 Regularization about knet.jl HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent