Giter Site home page Giter Site logo

nn-zero-to-hero's Introduction

I like deep neural nets.

nn-zero-to-hero's People

Contributors

edvenson avatar karpathy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nn-zero-to-hero's Issues

BatchNorm1d in makemore part3 may have error

Hello Andrej, in your note book for makemore part3, you built BatchNorm1d as

class BatchNorm1d:
  
  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.momentum = momentum
    self.training = True
    # parameters (trained with backprop)
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)
    # buffers (trained with a running 'momentum update')
    self.running_mean = torch.zeros(dim)
    self.running_var = torch.ones(dim)
  
  def __call__(self, x):
    # calculate the forward pass
    if self.training:
      xmean = x.mean(0, keepdim=True) # batch mean
      xvar = x.var(0, keepdim=True) # batch variance
    else:
      xmean = self.running_mean
      xvar = self.running_var
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    # update the buffers
    if self.training:
      with torch.no_grad():
        self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * xmean
        self.running_var = (1 - self.momentum) * self.running_var + self.momentum * xvar
    return self.out
  
  def parameters(self):
    return [self.gamma, self.beta]

I think there may be sth wrong in xvar = self.running_var. If the input has batch size=1, then this var function will use unbiased function to calculate the variance, since 1-1 = 0, it will result in a divided by 0 error.

I fact, when I ran your sampling code:

# sample from the model
g = torch.Generator().manual_seed(2147483647 + 10)

for _ in range(20):
    
    out = []
    context = [0] * block_size # initialize with all ...
    while True:
      # forward pass the neural net
      emb = C[torch.tensor([context])] # (1,block_size,n_embd)
      x = emb.view(emb.shape[0], -1) # concatenate the vectors
      for layer in layers:
        x = layer(x)
      logits = x
      probs = F.softmax(logits, dim=1)
      # sample from the distribution
      ix = torch.multinomial(probs, num_samples=1, generator=g).item()
      # shift the context window and track the samples
      context = context[1:] + [ix]
      out.append(ix)
      # if we sample the special '.' token, break
      if ix == 0:
        break
    
    print(''.join(itos[i] for i in out)) # decode and print the generated word

I got an error at BatchNorm1d layer. Since the input x has batch_size=1, the calculated variance is all nan.
Btw, when I used PyTorch's implementation

layers = [
  Linear(n_embd * block_size, n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden), Tanh(),
  Linear(           n_hidden, vocab_size),
]

instead of

layers = [
  Linear(n_embd * block_size, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, vocab_size, bias=False), BatchNorm1d(vocab_size),
]

everything was fine.

I know your notebook has been run and tested, so I think there is sth I missed in the video or notebook?

Fun solution to loop replacement (`makemore_part4_backprop.ipynb`)

Hi Andrej, hi everyone,

First of all, let me add my voice to the chorus: such awesome lectures, very grateful for them, I recommend them around me as soon as I have the opportunity!

At one point in the backprop lecture, you mention that there might be slicker way to update the last gradient tensor, dC, instead of the Python loop you used. This tickled my curiosity, so I tinkered, and here's the solution I came up with, maybe others have found even better ways! (Although, arguably, if you're not into Torch nerdiness the threat to time management/peace of mind when basking in advanced indexing might not be lead to a great trade-off with the slow but straightforward loop! : >)

So, instead of:

dC = torch.zeros_like(C)
for k in range(Xb.shape[0]):
  for j in range(Xb.shape[1]):
    ix = Xb[k,j]
    dC[ix] += demb[k,j]

It is possible to do:

# arange        -> unsqueeze  -> tile         -> flatten
# [ 0,1,...32 ] -> [[0],      -> [[0,0,0],    -> [0,0,0,1,1,1,...,31,31,31] # batch_size * block_size times
#                   [1],          [1,1,1], 
#                   ...           ...  
#                   [31]]         [31,31,31]]
rows_xi = torch.tile(torch.arange(0, Xb.shape[0]).unsqueeze(1), (1,3)).flatten()

# [0,1,2] -> [[0,1,2],[0,1,2],...,[0,1,2]] # block_size * batch_size times
cols_xi = torch.tile(torch.arange(0, Xb.shape[1]), (Xb.shape[0],))

emb_xi = Xb[rows_xi, cols_xi] # block_size * batch_size indices to retrieve rows

dC1 = torch.zeros_like(C)

dC1.index_put_((emb_xi,), demb[rows_xi, cols_xi], accumulate=True)

A torch.allclose(dC1, dC) yields True on my end.

I'm indebted to the all-answering @ptrblck for that .index_put_(... accumulate=True) reference!

Have a great day!

difference on calculating loss for count-matrix

When I tried myself to re-implement the code for matrix count, I noticed that there is slight difference between my nll loss and the one in notebook.

Here is what we have in video: -nll = 559891.75

but here what I got: -nll = 559873.5915061831
The difference is about 18.

After reviewing both codes, I found out that the issue is with .item(). It converts the tensor to python number.
In the notebook, this has been calculated as logprob:

log_likelihood += logprob

But I have calculated this:

nll += logprob.item()

Here is example:

print(torch.log(prob).item())
print(torch.log(prob))

>>> -1.4469189643859863
>>> tensor(-1.4469)

In pytorch item, there is nothing mentioned about this.

makemore_part4_backprop dhpreact exact part is False.

Hello Andrej. First of all, I would like to express my gratitude to you for sharing such a valuable videos with us for free.

While watching the 'makemore part4' video, I was also trying to apply it to my own created dataset. When I tried to take the chained derivative in the 'dhpreact' part, it started to give an error output, and since it is a chain derivative operation, it also included subsequent outputs. Below, I share the code line and the output.
Please share any other solution if you have one. Using different Torch versions and changing the dtype to 'double' as suggested in the comments didn't work out for me.

dlogprobs = torch.zeros_like(logprobs)
dlogprobs[range(n), Yb] = -1.0/n
dprobs = (1.0 / probs) * dlogprobs
dcounts_sum_inv = (counts * dprobs).sum(1, keepdim=True)
dcounts = counts_sum_inv * dprobs
dcounts_sum = (-counts_sum**-2) * dcounts_sum_inv
dcounts += torch.ones_like(counts) * dcounts_sum
dnorm_logits = counts * dcounts
dlogits = dnorm_logits.clone()
dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)
dlogits += F.one_hot(logits.max(1).indices, num_classes=logits.shape[1]) * dlogit_maxes
dh = dlogits @ W2.T
dW2 = h.T @ dlogits
db2 = dlogits.sum(0)
dhpreact = (1.0 - h**2) * dh
dbngain = (bnraw * dhpreact).sum(0, keepdim=True)
dbnraw = bngain * dhpreact


Output : 

logprobs        | exact: True  | approximate: True  | maxdiff: 0.0
probs           | exact: True  | approximate: True  | maxdiff: 0.0
counts_sum_inv  | exact: True  | approximate: True  | maxdiff: 0.0
counts_sum      | exact: True  | approximate: True  | maxdiff: 0.0
counts          | exact: True  | approximate: True  | maxdiff: 0.0
norm_logits     | exact: True  | approximate: True  | maxdiff: 0.0
logit_maxes     | exact: True  | approximate: True  | maxdiff: 0.0
logits          | exact: True  | approximate: True  | maxdiff: 0.0
h               | exact: True  | approximate: True  | maxdiff: 0.0
W2              | exact: True  | approximate: True  | maxdiff: 0.0
b2              | exact: True  | approximate: True  | maxdiff: 0.0
hpreact         | exact: False | approximate: True  | maxdiff: 4.656612873077393e-10
bngain          | exact: False | approximate: True  | maxdiff: 1.862645149230957e-09
bnbias          | exact: False | approximate: True  | maxdiff: 7.450580596923828e-09
bnraw           | exact: False | approximate: True  | maxdiff: 6.984919309616089e-10
bnvar_inv       | exact: False | approximate: True  | maxdiff: 3.725290298461914e-09
bnvar           | exact: False | approximate: True  | maxdiff: 9.313225746154785e-10

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

Getting below error while running the backward pass:

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

Code

#Forward Pass
logits = (xenc @ W)
counts = logits.exp()
prob = counts/counts.sum(1,keepdim=True)
loss = - prob[torch.arange(5),ys].log().mean()
print(loss.item())

#Backward Pass
W.grad=None
loss.backward()

#update the weights
W.data += -0.1 * W.grad

Query:

Why are performing the one-hot encoding for the input every-time while iterating the forward pass?

need Discussions :)

Thank you karpathy for open sourcing this great course series.
I think the discussion board need be opened.
I found that in the process of learning, there were many thoughts and questions rather than issues. I think these thoughts should be enlightening to others, but because they are not issues, I cannot find a suitable place to post these thoughts.


image

Factorized code for dC in makemore_part4_backprop.ipynb

First thanks a lot Andreij for the great series,
Here is the factorized version for dC
one_hot_Xb = F.one_hot(Xb,27).view((Xb.shape[0]*Xb.shape[1],27)).float()
dC = one_hot_Xb.T @ demb.view((demb.shape[0]*demb.shape[1],demb.shape[2]))

I came up with it mainly through shape comparaison, could you please look into it :-)

Bug in makemore part-1, in computing the liklihood

n +=1 should be in the outer loop, I guess.
You want to count the words, not the characters.

A forum would be very nice, Youtube comments could do it, but it is full of nontechnical comments. Could we use pytorch forums maybe?

Not bug, just curiosity

Doing some research into the method of finding an optimal learning rate.

I made the models both from scratch as the videos and also in a torch friendly way, or well using torch modules, dataloaders, optmizer, etc ..

However something weird when running the following, which 'should' be same as code from video. The lr - loss graph is showed in im1 below.

im2 is using code very similar to videos, i.e manually updating weights. Why are the results not the same ? Is the optimizer doing somethign different in the backend ? Over all the training is about the same, both will converge roughly at the same rate.

def findlr(model, data, test_dataloader):
    lrs = torch.linspace(0.01, 1, 1000)
    lrs = 10**torch.linspace(-3, 0, 1000)
    lri = []
    lss = []

    optim = torch.optim.SGD(model.parameters(), lr=lrs[0])
    for i in range(len(lrs)):
        for g in optim.param_groups:
            g['lr'] = lrs[i]

        x, y = next(iter(data))
        l = calcLoss(model(x), y)
        model.zero_grad()
        l.backward()
        optim.step()

        lri.append(lrs[i].item())
        lss.append(l.item())

        print(lrs[i], l.item())
    plt.plot(lri, lss)
    plt.show()
    ```
    
 im1: 
    
<img width="597" alt="image" src="https://user-images.githubusercontent.com/95486801/193437520-55d14507-867c-411f-9c9e-e11db3b9e67c.png">


im2: 
<img width="597" alt="image" src="https://user-images.githubusercontent.com/95486801/193437537-481b38c6-447a-4bf7-86ee-bffb291b737a.png">

intro to neural networks first

Hi Andrej, this is really great and thank you so very much for the material. It is truly super useful!

I wonder, what would you use before this series, to introduce neural nets, what they do / can do, the foundation before diving into back-prop?

Have you seen a nice lecture that does that in a simple way, just aligned with your course and material?

Problem with makemore_part2_mlp.ipynb

loss = -prob[torch.arange(32), Y].log().mean()
loss
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[20], line 1
----> 1 loss = -prob[torch.arange(32), Y].log().mean()
      2 loss

IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [32], [228146]

zero_grad implementation correctness question

Hi,

First of all, thanks for the great video tutorial! You describe in the video at 2:11:48, that you need to reset the grad values to zero between iterations by assigning zero to the grad value of every parameter.

I may be misunderstanding this part, but it seems to me that one would also need to zero the grad values for all nodes and not just for the ones representing the parameters, otherwise the grad values for the internal nodes between the parameters will still keep accumulating between iterations.

I may easily be mistaken, so I'm phrasing this issue as a question rather than as a bug report. Is it enough to zero_grad the parameters and if so, why?

Thanks!

Numerical instability in Google Colab - Part 4 of Makemore

I ran into an interesting issue in makemore 4 backpro ninja where the dhpreact was not exactly matching the hpreact.grad.

However, this was only in the collab notebook because when I put the same code into a local jupyter notebook it works fine.

Not sure why this would be the case but just an odd curiosity.

TypeError: unsupported operand type(s) for +: 'int' and 'Value'

When I run the following command from micrograd_lecture_second_half_roughly.ipynb:

[(yout - ygt)**2 for ygt, yout in zip(ys, ypred)]

I get a clean output:

[Value(data=1.8688676392069992)), Value(data=0.37661915959598025)), Value(data=0.13611849326555403)), Value(data=1.69235533142263))]

But whenever I tried to enclose it in a sum() function:

sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))

It throws an error:

TypeError: unsupported operand type(s) for +: 'int' and 'Value'

Discrepancy in makemore_part1_bigrams single sample draw

When I run the makemore_part1_bigrams notebook, in the cell where we draw a single sample based on probability distribution in the first row of N, I get a different sample ( 'c' ) compared to the one in the video ( 'm' ). Everything else until then seems to be the same, with manually seeded generator, I'd expect even this to match exactly.

Value in this repo (and the video):
image

My run:
image

I have pushed the full run until this cell in the commit here.

What am I missing?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.