karpathy / micrograd Goto Github PK

View Code? Open in Web Editor NEW

9.8K 9.8K 1.4K 248 KB

A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API

License: MIT License

Jupyter Notebook 90.26% Python 9.74%

micrograd's Introduction

I like deep neural nets.

micrograd's People

Contributors

Stargazers

Watchers

Forkers

erwin314 linhduongtuan sinjax srravula1 benedictquartey satinder147 zhangzp9970 achoora tejamoy raghavian kastnerkyle satishgaurav mostafaeissa kengoa techshot25 newcodevelop ralami1859 lanseyege randl gonvas evcu sunilsurineni rafaelmri mbrukman xrosliang jcassiojr jiahao abhinavshaw1993 conradbm danielkurniadi dragomirradev mrzresearcharena fcakyon maelstrom9 burakakrishna vgaurav3011 saralatif99 jesusfbes perfmjs milan-chicago paritoshgoyal stjordanis rahuldshetty n8behavior fagan2888 emanuele tkhan3 edwinyung andrew-pynch mmalekzadeh osmanbaskaya srkm009 nsarang kodeprav khushmeeet logancyang animuku faizankshaikh zeta1999 okz12 vishalbelsare bezova tchklovski njuhaozhang jli05 sailfish009 ilyastepanov t-groth knut0815 yassirf marcelbischoff jcanode abinitio888 pecanjk nvlawachan qixiuai donnyyou supercourage cj401-jw shuida brad-mengchi ssusantachary aokifish yejinlei ycechungai nickdgardner rkly baiduwen3 rhalbersma nguyenducnhaty i-spark jayoprell fengjunxi pandinosaurus tsukuyomih2 oceanos74 knarfamlap martinhoang11 dimmu mfkiwl

micrograd's Issues

Vectorized modification with GPU support.

https://github.com/rohit-krish/Deeplex
Your work sparked me to build this.
CuPy is used for GPU support.

Planning on improving it; peace...🙏

micrograd.NET: C# port for .NET developers

Hi, Andrej,
Thanks for this excellent library! It may be useful not only for Python developers, but for C#, F#, Pascal etc. developers too, so I wrote a C# port for .NET ecosystem. The basic info about this is here: https://github.com/ColorfulSoft/System.AI/blob/master/Docs/micrograd.NET.md
Best,
Gleb S. Brykin

backward member implementation question

Why can't this function simply be implemented as follows? Am I missing something? We are dealing with a composite structure.

  def backward(self, is_first=True):
    if (is_first == True):
      self.grad = 1.0

    self._backward()

    for c in self._prev:
      c.backward(False)

Is `y = x * x` an edge case?

Zero_grad only zeros the weight and bias nodes, not the nodes for addition and multiplication

In zero_grad, we just zero the weight and bias graph nodes. But don't we need to zero the other graph nodes, like those created for addition and multiplication as well, since the backprop gradient flows through them as well?

Regarding the gradient update of the sub operation

The sub operation implemented here would utilize the _backward method of the add operation. I believe this is wrong because the _backward method for add operation accumulates out.grad for both the operands, but in case of sub operation it should accumulate out.grad for the positive operand and -out.grad for the negative operand.

for example:
a = b + c
d(a)/db = 1
d(a)/dc = 1

a = b - c
d(a)/db = 1
d(a)/dc = -1

So i think we need to add a separate _backward function for sub operation or modify the _backward method for add operation.

Question/Idea: Automatic Gradient Clearing

Hi Prof Karpathy,
I wanted to create a discussion to ask this question, but there was no provision as such. I was watching https://youtu.be/VMj-3S1tku0 and got an idea.

Context

This is in reference to the step of clearing accumulated gradients at:

micrograd/demo.ipynb

Line 265 in c911406

" model.zero_grad()\n",

Problem

People tend to forget to clear the gradients wrt the loss function backward pass.

Idea

Create a way to bind the loss function to the network once, and then automatically clear accumulated gradients automatically when performing the backward pass.

Advantage

We can perform backward pass whenever, wherever, and as many times as we want without worrying about accumulated gradient.

Pseudocode

class Loss(Value):
  def __init__(self, bound_network):
    self.bound_network = bound_network

  def __call__(self, batch_size=None):
    # loss function definition
    self.data = data_loss + reg_loss

  def backward():
    # clear gradients of bound network
    bound_network.zero_grad()
    super().backward()    

total_loss = Loss(
  bound_network = model
)

for k in range(100):
  # ...

  # model.zero_grad() # since total_loss is bound to network, it should automatically perform model.zero_grad() before doing the backward
  total_loss.backward()

  # ...

Questions

Is my understanding of the problem correct?
Is this change value-adding?
Is the above pseudocode logically correct?
If the answer to all the above are yes, I could work on a PR with your guidance.

Appreciation and seed to GPU Support using PyCuda

@karpathy Hi,
Great Work.... simple and clean.
I was inspired and made my own mini deep learning library :P
I'm thinking can we extend this with GPU support?
I will be more than happy if you refer to my repository.
https://github.com/kartik4949/deepops

Thanks and keep doing work like this more :)
p.s: I'm a fresher :P

Would need 'substraction' support in Engine, mainly for regression loss functions like MSE

For addition adding incrementing grading makes sense, I can't make sense out of the incrementing it for multiplication too, potential bug?

def mul(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')

    def _backward():
        self.grad += other.data * out.grad
        other.grad += self.data * out.grad
    out._backward = _backward

    return out

If you have an expression of type (xy)(x*z) then the gradient w.r.t x is not additive, right?

Grad should be a Value instead of python/numpy scalar

So you can do higher order autodiff.

radd

In the micrograd engine.py line number 76, the code should be 'return other + self' not 'return self + other'

Another MiniGrad with the RAdam optimizer.

Hello guys. I wrote a MiniGrad with the RAdam optimizer. It could be found there.

Vectorized implementation with PyTorch flavor

Here is a vectorized implementation with PyTorch flavor built on top of NumPy / CuPy: https://github.com/conscell/ugrad

Issue with zero_grad?

Hi, unless I'm misunderstanding something, zero_grad in nn.py is zeroing out the gradients on the parameter nodes, but shouldn't it do it on all the nodes in the graph?
Otherwise the inner nodes will keep accumulating them.

Reseting the grad of weights and biases is not enough

In the video "The spelled-out intro to neural networks and backpropagation: building micrograd" you present the following code:

n = MLP(3, [4, 4, 1])
xs = [
  [2.0, 3.0, -1.0],
  [3.0, -1.0, 0.5],
  [0.5, 1.0, 1.0],
  [1.0, 1.0, -1.0],
]
ys = [1.0, -1.0, -1.0, 1.0] # desired targets
for k in range(20):
  
  # forward pass
  ypred = [n(x) for x in xs]
  loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
  
  # backward pass
  for p in n.parameters():
    p.grad = 0.0
  loss.backward()
  
  # update
  for p in n.parameters():
    p.data += -0.1 * p.grad
  
  print(k, loss.data)

However before calling loss.backward() we should reset the grad for ALL values, not just for n.parameters().
Because every iteration of loss.backward() changes the grad (+=...) for all.

Ensure backward() is idempotent

Hi Andrej,

Many thanks for micrograd & its accompanying video; they deepened my understanding of backprop considerably!

I notice that in the current implementation, calling backward() repeatedly is non-idempotent, because the grads just keep accumulating. This seems like something people are likely to trip over. The fix is simple: in the def of backward(), just above

        # go one variable at a time and apply the chain rule to get its gradient
        self.grad = 1

add

        # reset gradients to ensure they don't get repeatedly accumulated
        for v in reversed(topo):
            v.grad = 0

Just submitted PR 54 for your consideration which just makes that one change.

Example of non-idempotence with current master branch: given a simple tree where a = 3, b = 2, c = a + b, d = 1, e = c * d (all leaves as Values of course):

>>> print_grads()
print_grads()
a: 0, b: 0, c: 0, d: 0, e: 0
>>> e.backward()
e.backward()
>>> print_grads()
print_grads()
a: 1.0, b: 1.0, c: 1.0, d: 5.0, e: 1
>>> 
>>> e.backward()
e.backward()
>>> print_grads()
print_grads()
a: 3.0, b: 3.0, c: 2.0, d: 10.0, e: 1
>>>

Sequential MLP implementation

Maybe not PR worthy, but I guess one can abstract the MLP implementation even more, making use of the layers instead of number of inputs and outputs yet again, since each individual layer already knows them.

As such, I wrote it as:

class MLP:

  def __init__(self, layers):
    self.layers = layers

  def __call__(self, x):
    for l in self.layers:
      x = l(x)
    return x

  def parameters(self):
      return [p for layer in self.layers for p in layer.parameters()]

by which you can define a network more intuitively, much like PyTorch's Sequential:

n = MLP([Layer(3, 6), Layer(6, 3), Layer(3, 1)])

To be even more rigorous, a dimension assertion can be added in the __init__:

class MLP:

  def __init__(self, layers):
    self.layers = layers
    for i in range(1, len(layers)):
      assert layers[i-1].nout == layers[i].nin

for which I would have to store the nin & nout for the layers in the as well:

class Layer:

  def __init__(self, nin, nout):
    self.nin = nin
    self.nout = nout
    self.neurons = [Neuron(nin) for _ in range(nout)]

Homework Assignment Error with softmax activation function

Hi @karpathy
I was solving the assignment as mentioned in the YouTube video. In the Softmax function, I was getting the following error TypeError: unsupported operand type(s) for +: 'int' and 'Value'

This is the line where I am getting the error

def softmax(logits):
  counts = [logit.exp() for logit in logits]
  denominator = sum(counts) #Here I am getting the Typeerror
  out = [c / denominator for c in counts]
  return out

And, my add function in Value Class is the following

def __add__(self, other): # exactly as in the video
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data + other.data, (self, other), '+')
    
    def _backward():
      self.grad += 1.0 * out.grad
      other.grad += 1.0 * out.grad
    out._backward = _backward
    
    return out

So my query is on the sum of list function. It is probably similar to counts[i].add(counts[i+1]) and then we keep on adding to the result till the end of the list. So this add function should work well. But I am not sure why it is not working, am I missing something?
Thanks in advance

type annotation lacking/ maybe also add docstrings.

I think micrograd.nn would be easier to understand if the function signatures have type annotation. The backwards compatability of type annotation is limited, but it seems likely very few people will use micrograd with python < 3.8. I'm happy to help with this.

Also docstrings might be helpful for people studying the code.

I followed 'The spelled-out intro to neural networks and backpropagation: building micrograd' on YouTube and am grateful to Andrej for making this. Learned al lot!

PyPI package

Feature

Convert Micrograd into a PyPI package.

Need

Would be easier for institutions or bootcamps to help adopt this for their students.
As an organizer of the Data Science Club at SJSU, I would love to introduce the fundamentals with this library.

Topological sort - bug

It's a nit that won't matter most of the time but the topo sort implementation doesn't work in case you have cycles in the graph.

i.e. there is a hard assumption you're operating over a DAG.

A tensor version for micrograd inpired by this work

I have implement a tensor version based on numpy.ndarray, code: https://github.com/hkxIron/tensorgrad

_backward as lambdas?

Hi @karpathy,

congratulations on this repo/talk. The educational value is truly immense. Good job!

Can you please explain the main motivation for _backward methods implemented as lambdas, as opposed to (one) regular method that starts with a hypothetical switch (self._op) and contains implementation for all arithmetic cases?

`other` should have a gradient in `pow` (?)

Hey Andrej -- just want to say thanks so much for your YouTube video on micrograd. The video has been absolutely enlightening.

Quick question -- while re-implementing micrograd on my end, I noticed that __pow__ (in Value) was missing a back-propagation definition for other. Is this expected?

micrograd/micrograd/engine.py

Lines 39 to 40 in c911406

    
           def _backward(): 
        
               self.grad += (other * self.data**(other-1)) * out.grad

Rename engine.py to value.py

I suggest you rename engine.py to value.py.

Reasoning:

the engine name is misleading. The file doesn't contain some framework or domain logic.
the engine.py file contains a solo class named Value. The best name for it is value.py.

Adjusting parameters by sign and magnitude of gradient

https://github.com/karpathy/micrograd/blame/c911406e5ace8742e5841a7e0df113ecb5d54685/demo.ipynb#L271C13-L271C45

I really appreciate your videos! Such a gift to all of us.

When adjusting parameters after computing the loss, the example multiplies the step size by the sign and magnitude of the gradient. In cases of a steep gradients near local minimum values, a large value will jump the parameter far from the desired solution. In the case of shallow gradients, the parameter will struggle to reach its local minimum in the given number of iterations.

Thus, I think the adjustment should be a step size times the sign of the gradient.

What are your thoughts?

Noob question about backprop implementation

Hello,
I came across this from your YT video tutorials, thank you for making these!

In engine.py, you implement back propagation using explicit topological order computation.
Are there any reasons why we would not recursively call _backward for every child ?
e.g. implement backward function in Value as such:

    def backward(self):
        self._backward()
        for v in self._prev:
            v.backward()

Does it have something to do with how backprop is implemented in actual NN libraries? Is recursion harder to parallelise in practice compared to using topological ordering?

Thank you

Incorrect gradient when non-leaf Values are re-used

Thank you @evcu for raising, my little 2D toy problem converged and instead of going on to proper tests and double checking through the recursion I got all trigger-happy and amused with puppies. The core issue is that if variables are re-used then their gradient will be accumulated for each path. Do you think this simpler reference counting idea will work as a potential simpler solution? The idea is to suppress backward() calls until the very last one.

(Love your Stylized puppy in your branch btw! :D)

class Value:
    """ stores a single scalar value and its gradient """

    def __init__(self, data):
        self.data = data
        self.grad = 0
        self.backward = lambda: None
        self.refs = 0

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data)
        self.refs += 1
        other.refs += 1
        
        def backward():
            if out.refs > 1:
                out.refs -= 1
                return
            self.grad += out.grad
            other.grad += out.grad
            self.backward()
            other.backward()
        out.backward = backward

        return out

    def __radd__(self, other):
        return self.__add__(other)

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data)
        self.refs += 1
        other.refs += 1
        
        def backward():
            if out.refs > 1:
                out.refs -= 1
                return
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
            self.backward()
            other.backward()
        out.backward = backward

        return out

    def __rmul__(self, other):
        return self.__mul__(other)

    def relu(self):
        out = Value(0 if self.data < 0 else self.data)
        self.refs += 1
        def backward():
            if out.refs > 1:
                out.refs -= 1
                return
            self.grad += (out.data > 0) * out.grad
            self.backward()
        out.backward = backward
        return out

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"

	def _backward():
	self.grad += (other * self.data*(other-1)) out.grad