karpathy / nn-zero-to-hero Goto Github PK

View Code? Open in Web Editor NEW

10.6K 10.6K 1.3K 50 KB

Neural Networks: Zero to Hero

License: MIT License

Jupyter Notebook 100.00%

nn-zero-to-hero's Introduction

I like deep neural nets.

nn-zero-to-hero's People

Contributors

Stargazers

Watchers

Forkers

canxkoz lil9991 slf188 rprimi suryaavala kartset nachiket273 valeman mbilalai jdmonty abhijeet1990 mind-forks jiazichen111 techthiyanes devilanandgupta lyrl niangmohamed adinator007 urthor saran-gangster cvampal fuyw lucidsamuel hadryan xidic81 xantin thegovind omvishal1 chenchy mrahman53 rakeshraghavendrad usct01 karthy257 asclepiusinformatica mayanand logit507 vineedkaladharan slesinger guberm stanleyjacob dorucioclea mashakubyshina product-think2049 ozemeuw0 ratul-cefalo zhiephieforks seanchan ibadrather barvin04 petrichor2020 tanmay-mohanty-ignited sriharshitha842 santhosh-mamidisetti louderthanthunderx1 beesitech kssd aidamash phase7 souvickg deekshakoul makaveli10 abid-sayyad fulkast mohit-negi kalerushikesh32 beesaferoot edvenson amitkayal yogesh470 divya-r-kamat najmussaqib qraa adarshnamdev machinatoonist amandeepsinghraj nikhilkenvetil sarvesh9876 vksssd sasikanuri jvaldiviezo9 akash9859 hamza310 techsoft29 parzival7566 singhsukhendra murali-chevuri nigtroy rudeninja zcth428 cyjack karthigeyan-ati simplified-mind soonkienyuan pankaj18 binbin0915 drcoder9 rajaboja oyelowo sharper abhishek-patro

nn-zero-to-hero's Issues

BatchNorm1d in makemore part3 may have error

Hello Andrej, in your note book for makemore part3, you built BatchNorm1d as

class BatchNorm1d:
  
  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.momentum = momentum
    self.training = True
    # parameters (trained with backprop)
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)
    # buffers (trained with a running 'momentum update')
    self.running_mean = torch.zeros(dim)
    self.running_var = torch.ones(dim)
  
  def __call__(self, x):
    # calculate the forward pass
    if self.training:
      xmean = x.mean(0, keepdim=True) # batch mean
      xvar = x.var(0, keepdim=True) # batch variance
    else:
      xmean = self.running_mean
      xvar = self.running_var
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    # update the buffers
    if self.training:
      with torch.no_grad():
        self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * xmean
        self.running_var = (1 - self.momentum) * self.running_var + self.momentum * xvar
    return self.out
  
  def parameters(self):
    return [self.gamma, self.beta]

I think there may be sth wrong in xvar = self.running_var. If the input has batch size=1, then this var function will use unbiased function to calculate the variance, since 1-1 = 0, it will result in a divided by 0 error.

I fact, when I ran your sampling code:

# sample from the model
g = torch.Generator().manual_seed(2147483647 + 10)

for _ in range(20):
    
    out = []
    context = [0] * block_size # initialize with all ...
    while True:
      # forward pass the neural net
      emb = C[torch.tensor([context])] # (1,block_size,n_embd)
      x = emb.view(emb.shape[0], -1) # concatenate the vectors
      for layer in layers:
        x = layer(x)
      logits = x
      probs = F.softmax(logits, dim=1)
      # sample from the distribution
      ix = torch.multinomial(probs, num_samples=1, generator=g).item()
      # shift the context window and track the samples
      context = context[1:] + [ix]
      out.append(ix)
      # if we sample the special '.' token, break
      if ix == 0:
        break
    
    print(''.join(itos[i] for i in out)) # decode and print the generated word

I got an error at BatchNorm1d layer. Since the input x has batch_size=1, the calculated variance is all nan.
Btw, when I used PyTorch's implementation

layers = [
  Linear(n_embd * block_size, n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden), Tanh(),
  Linear(           n_hidden, vocab_size),
]

instead of

layers = [
  Linear(n_embd * block_size, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, vocab_size, bias=False), BatchNorm1d(vocab_size),
]

everything was fine.

I know your notebook has been run and tested, so I think there is sth I missed in the video or notebook?

name.txt is missing

b

Fun solution to loop replacement (`makemore_part4_backprop.ipynb`)

Hi Andrej, hi everyone,

First of all, let me add my voice to the chorus: such awesome lectures, very grateful for them, I recommend them around me as soon as I have the opportunity!

At one point in the backprop lecture, you mention that there might be slicker way to update the last gradient tensor, dC, instead of the Python loop you used. This tickled my curiosity, so I tinkered, and here's the solution I came up with, maybe others have found even better ways! (Although, arguably, if you're not into Torch nerdiness the threat to time management/peace of mind when basking in advanced indexing might not be lead to a great trade-off with the slow but straightforward loop! : >)

So, instead of:

dC = torch.zeros_like(C)
for k in range(Xb.shape[0]):
  for j in range(Xb.shape[1]):
    ix = Xb[k,j]
    dC[ix] += demb[k,j]

It is possible to do:

# arange        -> unsqueeze  -> tile         -> flatten
# [ 0,1,...32 ] -> [[0],      -> [[0,0,0],    -> [0,0,0,1,1,1,...,31,31,31] # batch_size * block_size times
#                   [1],          [1,1,1], 
#                   ...           ...  
#                   [31]]         [31,31,31]]
rows_xi = torch.tile(torch.arange(0, Xb.shape[0]).unsqueeze(1), (1,3)).flatten()

# [0,1,2] -> [[0,1,2],[0,1,2],...,[0,1,2]] # block_size * batch_size times
cols_xi = torch.tile(torch.arange(0, Xb.shape[1]), (Xb.shape[0],))

emb_xi = Xb[rows_xi, cols_xi] # block_size * batch_size indices to retrieve rows

dC1 = torch.zeros_like(C)

dC1.index_put_((emb_xi,), demb[rows_xi, cols_xi], accumulate=True)

A torch.allclose(dC1, dC) yields True on my end.

I'm indebted to the all-answering @ptrblck for that .index_put_(... accumulate=True) reference!

Have a great day!

Zero to hero

difference on calculating loss for count-matrix

When I tried myself to re-implement the code for matrix count, I noticed that there is slight difference between my nll loss and the one in notebook.

Here is what we have in video: -nll = 559891.75

but here what I got: -nll = 559873.5915061831
The difference is about 18.

After reviewing both codes, I found out that the issue is with .item(). It converts the tensor to python number.
In the notebook, this has been calculated as logprob:

log_likelihood += logprob

But I have calculated this:

nll += logprob.item()

Here is example:

print(torch.log(prob).item())
print(torch.log(prob))

>>> -1.4469189643859863
>>> tensor(-1.4469)

In pytorch item, there is nothing mentioned about this.

Missing L.backward() before constructing graph

In notebook micrograd_lecture_first_half_roughly.ipynb

L.backward() is missing before creating graph draw_dot(L)

because of this gradient is not showing up.

makemore_part4_backprop dhpreact exact part is False.

Hello Andrej. First of all, I would like to express my gratitude to you for sharing such a valuable videos with us for free.

While watching the 'makemore part4' video, I was also trying to apply it to my own created dataset. When I tried to take the chained derivative in the 'dhpreact' part, it started to give an error output, and since it is a chain derivative operation, it also included subsequent outputs. Below, I share the code line and the output.
Please share any other solution if you have one. Using different Torch versions and changing the dtype to 'double' as suggested in the comments didn't work out for me.

dlogprobs = torch.zeros_like(logprobs)
dlogprobs[range(n), Yb] = -1.0/n
dprobs = (1.0 / probs) * dlogprobs
dcounts_sum_inv = (counts * dprobs).sum(1, keepdim=True)
dcounts = counts_sum_inv * dprobs
dcounts_sum = (-counts_sum**-2) * dcounts_sum_inv
dcounts += torch.ones_like(counts) * dcounts_sum
dnorm_logits = counts * dcounts
dlogits = dnorm_logits.clone()
dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)
dlogits += F.one_hot(logits.max(1).indices, num_classes=logits.shape[1]) * dlogit_maxes
dh = dlogits @ W2.T
dW2 = h.T @ dlogits
db2 = dlogits.sum(0)
dhpreact = (1.0 - h**2) * dh
dbngain = (bnraw * dhpreact).sum(0, keepdim=True)
dbnraw = bngain * dhpreact


Output : 

logprobs        | exact: True  | approximate: True  | maxdiff: 0.0
probs           | exact: True  | approximate: True  | maxdiff: 0.0
counts_sum_inv  | exact: True  | approximate: True  | maxdiff: 0.0
counts_sum      | exact: True  | approximate: True  | maxdiff: 0.0
counts          | exact: True  | approximate: True  | maxdiff: 0.0
norm_logits     | exact: True  | approximate: True  | maxdiff: 0.0
logit_maxes     | exact: True  | approximate: True  | maxdiff: 0.0
logits          | exact: True  | approximate: True  | maxdiff: 0.0
h               | exact: True  | approximate: True  | maxdiff: 0.0
W2              | exact: True  | approximate: True  | maxdiff: 0.0
b2              | exact: True  | approximate: True  | maxdiff: 0.0
hpreact         | exact: False | approximate: True  | maxdiff: 4.656612873077393e-10
bngain          | exact: False | approximate: True  | maxdiff: 1.862645149230957e-09
bnbias          | exact: False | approximate: True  | maxdiff: 7.450580596923828e-09
bnraw           | exact: False | approximate: True  | maxdiff: 6.984919309616089e-10
bnvar_inv       | exact: False | approximate: True  | maxdiff: 3.725290298461914e-09
bnvar           | exact: False | approximate: True  | maxdiff: 9.313225746154785e-10

Not a real issue: Video links broken

youtube video links for lectures 3, 4, 5 are broken..not sure if its just me facing the issue

If I place more than 3 layers of FlattenConsecutive in makemore_part5_cnn1.ipynb, I get error: not enough values to unpack (expected 3, got 2)

@karpathy
I get an error from this line B, T, C = x.shape,
If I use more than 3 FlattenConsecutive layers

ValueError: not enough values to unpack (expected 3, got 2)

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

Getting below error while running the backward pass:

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

Code

#Forward Pass
logits = (xenc @ W)
counts = logits.exp()
prob = counts/counts.sum(1,keepdim=True)
loss = - prob[torch.arange(5),ys].log().mean()
print(loss.item())

#Backward Pass
W.grad=None
loss.backward()

#update the weights
W.data += -0.1 * W.grad

Query:

Why are performing the one-hot encoding for the input every-time while iterating the forward pass?

need Discussions :)

Thank you karpathy for open sourcing this great course series.
I think the discussion board need be opened.
I found that in the process of learning, there were many thoughts and questions rather than issues. I think these thoughts should be enlightening to others, but because they are not issues, I cannot find a suitable place to post these thoughts.

Factorized code for dC in makemore_part4_backprop.ipynb

First thanks a lot Andreij for the great series,
Here is the factorized version for dC
one_hot_Xb = F.one_hot(Xb,27).view((Xb.shape[0]*Xb.shape[1],27)).float()
dC = one_hot_Xb.T @ demb.view((demb.shape[0]*demb.shape[1],demb.shape[2]))

I came up with it mainly through shape comparaison, could you please look into it :-)

Bug in makemore part-1, in computing the liklihood

n +=1 should be in the outer loop, I guess.
You want to count the words, not the characters.

A forum would be very nice, Youtube comments could do it, but it is full of nontechnical comments. Could we use pytorch forums maybe?

Not bug, just curiosity

Doing some research into the method of finding an optimal learning rate.

I made the models both from scratch as the videos and also in a torch friendly way, or well using torch modules, dataloaders, optmizer, etc ..

However something weird when running the following, which 'should' be same as code from video. The lr - loss graph is showed in im1 below.

im2 is using code very similar to videos, i.e manually updating weights. Why are the results not the same ? Is the optimizer doing somethign different in the backend ? Over all the training is about the same, both will converge roughly at the same rate.

def findlr(model, data, test_dataloader):
    lrs = torch.linspace(0.01, 1, 1000)
    lrs = 10**torch.linspace(-3, 0, 1000)
    lri = []
    lss = []

    optim = torch.optim.SGD(model.parameters(), lr=lrs[0])
    for i in range(len(lrs)):
        for g in optim.param_groups:
            g['lr'] = lrs[i]

        x, y = next(iter(data))
        l = calcLoss(model(x), y)
        model.zero_grad()
        l.backward()
        optim.step()

        lri.append(lrs[i].item())
        lss.append(l.item())

        print(lrs[i], l.item())
    plt.plot(lri, lss)
    plt.show()
    ```
    
 im1: 
    
<img width="597" alt="image" src="https://user-images.githubusercontent.com/95486801/193437520-55d14507-867c-411f-9c9e-e11db3b9e67c.png">


im2: 
<img width="597" alt="image" src="https://user-images.githubusercontent.com/95486801/193437537-481b38c6-447a-4bf7-86ee-bffb291b737a.png">

Error in makemore_part1_bigrams.ipynb

P.shape

NameError Traceback (most recent call last)
in <cell line: 1>()
----> 1 P.shape

NameError: name 'P' is not defined

Suggestion: Time Series Forecasting and Transfer Learning

If you find it interesting, I would like to propose Time Series Forecasting and the difficulties associated with Transfer Learning as a new topic for the upcoming lecture. Thank you for your content, I really appreciate it!

Shouldn't names.txt be added to this repo?

I was looking for it and found it here
https://github.com/karpathy/makemore/blob/master/names.txt

under the makemore repo, but it is referenced from the nn-zero-to-hero repo, so I would expect to see it here as well.

intro to neural networks first

Hi Andrej, this is really great and thank you so very much for the material. It is truly super useful!

I wonder, what would you use before this series, to introduce neural nets, what they do / can do, the foundation before diving into back-prop?

Have you seen a nice lecture that does that in a simple way, just aligned with your course and material?

Problem with makemore_part2_mlp.ipynb

loss = -prob[torch.arange(32), Y].log().mean()
loss
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[20], line 1
----> 1 loss = -prob[torch.arange(32), Y].log().mean()
      2 loss

IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [32], [228146]

zero_grad implementation correctness question

Hi,

First of all, thanks for the great video tutorial! You describe in the video at 2:11:48, that you need to reset the grad values to zero between iterations by assigning zero to the grad value of every parameter.

I may be misunderstanding this part, but it seems to me that one would also need to zero the grad values for all nodes and not just for the ones representing the parameters, otherwise the grad values for the internal nodes between the parameters will still keep accumulating between iterations.

I may easily be mistaken, so I'm phrasing this issue as a question rather than as a bug report. Is it enough to zero_grad the parameters and if so, why?

Thanks!

Numerical instability in Google Colab - Part 4 of Makemore

I ran into an interesting issue in makemore 4 backpro ninja where the dhpreact was not exactly matching the hpreact.grad.

However, this was only in the collab notebook because when I put the same code into a local jupyter notebook it works fine.

Not sure why this would be the case but just an odd curiosity.

TypeError: unsupported operand type(s) for +: 'int' and 'Value'

When I run the following command from micrograd_lecture_second_half_roughly.ipynb:

[(yout - ygt)**2 for ygt, yout in zip(ys, ypred)]

I get a clean output:

[Value(data=1.8688676392069992)), Value(data=0.37661915959598025)), Value(data=0.13611849326555403)), Value(data=1.69235533142263))]

But whenever I tried to enclose it in a sum() function:

sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))

It throws an error:

TypeError: unsupported operand type(s) for +: 'int' and 'Value'

Discrepancy in makemore_part1_bigrams single sample draw

When I run the makemore_part1_bigrams notebook, in the cell where we draw a single sample based on probability distribution in the first row of N, I get a different sample ( 'c' ) compared to the one in the video ( 'm' ). Everything else until then seems to be the same, with manually seeded generator, I'd expect even this to match exactly.

Value in this repo (and the video):

My run:

I have pushed the full run until this cell in the commit here.

What am I missing?