I like deep neural nets.
karpathy / micrograd Goto Github PK
View Code? Open in Web Editor NEWA tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API
License: MIT License
A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API
License: MIT License
I like deep neural nets.
https://github.com/rohit-krish/Deeplex
Your work sparked me to build this.
CuPy is used for GPU support.
Planning on improving it; peace...๐
Hi, Andrej,
Thanks for this excellent library! It may be useful not only for Python developers, but for C#, F#, Pascal etc. developers too, so I wrote a C# port for .NET ecosystem. The basic info about this is here: https://github.com/ColorfulSoft/System.AI/blob/master/Docs/micrograd.NET.md
Best,
Gleb S. Brykin
Why can't this function simply be implemented as follows? Am I missing something? We are dealing with a composite structure.
def backward(self, is_first=True):
if (is_first == True):
self.grad = 1.0
self._backward()
for c in self._prev:
c.backward(False)
In zero_grad, we just zero the weight and bias graph nodes. But don't we need to zero the other graph nodes, like those created for addition and multiplication as well, since the backprop gradient flows through them as well?
The sub operation implemented here would utilize the _backward method of the add operation. I believe this is wrong because the _backward method for add operation accumulates out.grad for both the operands, but in case of sub operation it should accumulate out.grad for the positive operand and -out.grad for the negative operand.
for example:
a = b + c
d(a)/db = 1
d(a)/dc = 1
a = b - c
d(a)/db = 1
d(a)/dc = -1
So i think we need to add a separate _backward function for sub operation or modify the _backward method for add operation.
Hi Prof Karpathy,
I wanted to create a discussion to ask this question, but there was no provision as such. I was watching https://youtu.be/VMj-3S1tku0 and got an idea.
This is in reference to the step of clearing accumulated gradients at:
Line 265 in c911406
People tend to forget to clear the gradients wrt the loss function backward pass.
Create a way to bind the loss function to the network once, and then automatically clear accumulated gradients automatically when performing the backward pass.
We can perform backward pass whenever, wherever, and as many times as we want without worrying about accumulated gradient.
class Loss(Value):
def __init__(self, bound_network):
self.bound_network = bound_network
def __call__(self, batch_size=None):
# loss function definition
self.data = data_loss + reg_loss
def backward():
# clear gradients of bound network
bound_network.zero_grad()
super().backward()
total_loss = Loss(
bound_network = model
)
for k in range(100):
# ...
# model.zero_grad() # since total_loss is bound to network, it should automatically perform model.zero_grad() before doing the backward
total_loss.backward()
# ...
@karpathy Hi,
Great Work.... simple and clean.
I was inspired and made my own mini deep learning library :P
I'm thinking can we extend this with GPU support?
I will be more than happy if you refer to my repository.
https://github.com/kartik4949/deepops
Thanks and keep doing work like this more :)
p.s: I'm a fresher :P
def mul(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out
If you have an expression of type (xy)(x*z) then the gradient w.r.t x is not additive, right?
So you can do higher order autodiff.
In the micrograd engine.py line number 76, the code should be 'return other + self' not 'return self + other'
Hello guys. I wrote a MiniGrad with the RAdam optimizer. It could be found there.
Here is a vectorized implementation with PyTorch flavor built on top of NumPy / CuPy: https://github.com/conscell/ugrad
Hi, unless I'm misunderstanding something, zero_grad
in nn.py
is zeroing out the gradients on the parameter nodes, but shouldn't it do it on all the nodes in the graph?
Otherwise the inner nodes will keep accumulating them.
In the video "The spelled-out intro to neural networks and backpropagation: building micrograd" you present the following code:
n = MLP(3, [4, 4, 1])
xs = [
[2.0, 3.0, -1.0],
[3.0, -1.0, 0.5],
[0.5, 1.0, 1.0],
[1.0, 1.0, -1.0],
]
ys = [1.0, -1.0, -1.0, 1.0] # desired targets
for k in range(20):
# forward pass
ypred = [n(x) for x in xs]
loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
# backward pass
for p in n.parameters():
p.grad = 0.0
loss.backward()
# update
for p in n.parameters():
p.data += -0.1 * p.grad
print(k, loss.data)
However before calling loss.backward()
we should reset the grad for ALL values, not just for n.parameters()
.
Because every iteration of loss.backward()
changes the grad (+=...) for all.
Hi Andrej,
Many thanks for micrograd & its accompanying video; they deepened my understanding of backprop considerably!
I notice that in the current implementation, calling backward()
repeatedly is non-idempotent, because the grads just keep accumulating. This seems like something people are likely to trip over. The fix is simple: in the def of backward()
, just above
# go one variable at a time and apply the chain rule to get its gradient
self.grad = 1
add
# reset gradients to ensure they don't get repeatedly accumulated
for v in reversed(topo):
v.grad = 0
Just submitted PR 54 for your consideration which just makes that one change.
Example of non-idempotence with current master branch: given a simple tree where a = 3, b = 2, c = a + b, d = 1, e = c * d (all leaves as Values of course):
>>> print_grads()
print_grads()
a: 0, b: 0, c: 0, d: 0, e: 0
>>> e.backward()
e.backward()
>>> print_grads()
print_grads()
a: 1.0, b: 1.0, c: 1.0, d: 5.0, e: 1
>>>
>>> e.backward()
e.backward()
>>> print_grads()
print_grads()
a: 3.0, b: 3.0, c: 2.0, d: 10.0, e: 1
>>>
Maybe not PR worthy, but I guess one can abstract the MLP implementation even more, making use of the layers instead of number of inputs and outputs yet again, since each individual layer already knows them.
As such, I wrote it as:
class MLP:
def __init__(self, layers):
self.layers = layers
def __call__(self, x):
for l in self.layers:
x = l(x)
return x
def parameters(self):
return [p for layer in self.layers for p in layer.parameters()]
by which you can define a network more intuitively, much like PyTorch's Sequential:
n = MLP([Layer(3, 6), Layer(6, 3), Layer(3, 1)])
To be even more rigorous, a dimension assertion can be added in the __init__
:
class MLP:
def __init__(self, layers):
self.layers = layers
for i in range(1, len(layers)):
assert layers[i-1].nout == layers[i].nin
for which I would have to store the nin
& nout
for the layers in the as well:
class Layer:
def __init__(self, nin, nout):
self.nin = nin
self.nout = nout
self.neurons = [Neuron(nin) for _ in range(nout)]
Hi @karpathy
I was solving the assignment as mentioned in the YouTube video. In the Softmax function, I was getting the following error TypeError: unsupported operand type(s) for +: 'int' and 'Value'
This is the line where I am getting the error
def softmax(logits):
counts = [logit.exp() for logit in logits]
denominator = sum(counts) #Here I am getting the Typeerror
out = [c / denominator for c in counts]
return out
And, my add function in Value Class is the following
def __add__(self, other): # exactly as in the video
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += 1.0 * out.grad
other.grad += 1.0 * out.grad
out._backward = _backward
return out
So my query is on the sum of list function. It is probably similar to counts[i].add(counts[i+1]) and then we keep on adding to the result till the end of the list. So this add function should work well. But I am not sure why it is not working, am I missing something?
Thanks in advance
I think micrograd.nn would be easier to understand if the function signatures have type annotation. The backwards compatability of type annotation is limited, but it seems likely very few people will use micrograd with python < 3.8. I'm happy to help with this.
Also docstrings might be helpful for people studying the code.
I followed 'The spelled-out intro to neural networks and backpropagation: building micrograd' on YouTube and am grateful to Andrej for making this. Learned al lot!
It's a nit that won't matter most of the time but the topo sort implementation doesn't work in case you have cycles in the graph.
i.e. there is a hard assumption you're operating over a DAG.
I have implement a tensor version based on numpy.ndarray, code: https://github.com/hkxIron/tensorgrad
Hi @karpathy,
congratulations on this repo/talk. The educational value is truly immense. Good job!
Can you please explain the main motivation for _backward
methods implemented as lambdas, as opposed to (one) regular method that starts with a hypothetical switch (self._op)
and contains implementation for all arithmetic cases?
Hey Andrej -- just want to say thanks so much for your YouTube video on micrograd
. The video has been absolutely enlightening.
Quick question -- while re-implementing micrograd
on my end, I noticed that __pow__
(in Value
) was missing a back-propagation definition for other
. Is this expected?
Lines 39 to 40 in c911406
I suggest you rename engine.py
to value.py
.
Reasoning:
engine
name is misleading. The file doesn't contain some framework or domain logic.engine.py
file contains a solo class named Value
. The best name for it is value.py.
I really appreciate your videos! Such a gift to all of us.
When adjusting parameters after computing the loss, the example multiplies the step size by the sign and magnitude of the gradient. In cases of a steep gradients near local minimum values, a large value will jump the parameter far from the desired solution. In the case of shallow gradients, the parameter will struggle to reach its local minimum in the given number of iterations.
Thus, I think the adjustment should be a step size times the sign of the gradient.
What are your thoughts?
Hello,
I came across this from your YT video tutorials, thank you for making these!
In engine.py, you implement back propagation using explicit topological order computation.
Are there any reasons why we would not recursively call _backward for every child ?
e.g. implement backward function in Value as such:
def backward(self):
self._backward()
for v in self._prev:
v.backward()
Does it have something to do with how backprop is implemented in actual NN libraries? Is recursion harder to parallelise in practice compared to using topological ordering?
Thank you
Thank you @evcu for raising, my little 2D toy problem converged and instead of going on to proper tests and double checking through the recursion I got all trigger-happy and amused with puppies. The core issue is that if variables are re-used then their gradient will be accumulated for each path. Do you think this simpler reference counting idea will work as a potential simpler solution? The idea is to suppress backward()
calls until the very last one.
(Love your Stylized puppy in your branch btw! :D)
class Value:
""" stores a single scalar value and its gradient """
def __init__(self, data):
self.data = data
self.grad = 0
self.backward = lambda: None
self.refs = 0
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data)
self.refs += 1
other.refs += 1
def backward():
if out.refs > 1:
out.refs -= 1
return
self.grad += out.grad
other.grad += out.grad
self.backward()
other.backward()
out.backward = backward
return out
def __radd__(self, other):
return self.__add__(other)
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data)
self.refs += 1
other.refs += 1
def backward():
if out.refs > 1:
out.refs -= 1
return
self.grad += other.data * out.grad
other.grad += self.data * out.grad
self.backward()
other.backward()
out.backward = backward
return out
def __rmul__(self, other):
return self.__mul__(other)
def relu(self):
out = Value(0 if self.data < 0 else self.data)
self.refs += 1
def backward():
if out.refs > 1:
out.refs -= 1
return
self.grad += (out.data > 0) * out.grad
self.backward()
out.backward = backward
return out
def __repr__(self):
return f"Value(data={self.data}, grad={self.grad})"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.