Comments (6)
This is cool. A small note that it's not just dense updates that we'd stop on - one could imagine other update types (e.g. "randomly set the values by drawing from a normal distribution with seed N" or "set all the values to 1") which are not dense per se (i.e. they don't involve storing explicit parameter values) but do involve setting all the parameter values while ignoring the previous values. Probably best to distinguish between updates that are truly updates (i.e. they rely on modifying the previous state) or aren't and use just look for the first instance of the latter kind of update. As an aside, I think it's informative to think about a from-scratch training run - ideally the first commit would just say "add these parameter groups and initialize them in this way".
I had something else to say but I forgot, maybe I will think of it another time.
from git-theta.
Yeah, we definitely want to support stopping at other update types.
I think a recursive solution would handle that. A "true update" would look up the previous update type in git and call the .get
(or whatever) method. If the previous one is also a "true update" it will continue the recursion. A "fake update" like dense or your random value one would just return values as is and function as a base case, without needing to enumerate what update types overwrite all params.
from git-theta.
#84 was very close to implementing this approach, however, it seems like it is not possible to do this correctly when using git to time travel (git checkout ${commit}
, git checkout branch
, etc.).
The gist of it is that the code that looks back through the git history to build the parameter needs to know where in the history to start looking. In something like a checkout, we only the commit we are at, within the smudge filter there isn't a way to know what commit we are going to.
So basically the result is that whenever we time travel, we end up with the smudged model checkpoint of where we were not where we wanted to be. Running git reset --hard
fixes this, but we don't want to have to run this everytime.
I talked with @nkandpa2 about this issue and neither of us found a way to fix it. Thus we took a lot of the ideas on how this implementation of updates worked and applied them to a file system based method of tracking and applying updates in #92.
I'm closing this as we don't think the git approach will work, but I'll leave the branch with the implementation on my fork as it may be useful to revisit in the future.
from git-theta.
Re-opening this discussion. I can't remember if we talked about this solution but why wouldn't it work to store the the hash of HEAD in the metadata file at clean time?
For example:
- I stage a model for the first time and the hash of HEAD is 1. When it gets staged the metadata file contains a key
"previous_commit": 1
. I commit this checkpoint and now the hash of HEAD is 2. - I make a sparse update to the model and stage that. The staged metadata file contains
"previous_commit": 2
. I commit this checkpoint and now HEAD is 3. - I make another sparse update to the model and stage it. The staged metadata file contains
"previous_commit": 3
. I commit this checkpoint and now HEAD is 4. - I make as many other commits as I like.
Now say, I run git checkout 4
. The smudge filter reads the metadata file and loads up the files in commit 4. It sees that commit 4 was a sparse update so it looks up the "previous_commit"
key in the metadata file and recursively loads commit 3. Since 3 is also a sparse update, it looks up the key in the metadata file and recursively loads commit 2. Finally, commit 2 is a dense update so we don't need to recurse any further.
Are there any issues with this solution?
from git-theta.
We talked about this solution and it seems like it will work. This branch has some tools for getting files from the git history which should help in the multi-pointer PR too.
We can get this up and running once the multi-pointer branch is working with dense updates.
One of the main questions to explore for us is if we will be able to track the last update directly or if we will need to iterate through history to find it but either way till work.
In the original git-tracks-updates implementation I occasionally has times where it was slow to re-build indices on something like a checkout. In the new format, the only file getting index'd is the main metadata file (not each parameter file) so it should be faster?
One question this does bring up is our tree processing algorithms. Currently we essentially process the parameter tree depth first where each parameter is processed individually (which might involve moving backwards through the git history) which could cause repeated work. It might be more efficient to collect all parameters that have changed in a batch and then go back in time once updating each parameter as appropriate. But before a large refactor like that we should 1) test that it is actually an issue and 2) check if memoization of our "get file from git history" function fixes any issue there is.
from git-theta.
Closed by #114
from git-theta.
Related Issues (20)
- Add an "apply to all" option to merge actions
- Parameter groups that are more than just tensors? HOT 3
- Add a way to script merges
- Functionality for partial model loading HOT 3
- Method to tell if git-theta wasn't installed? HOT 4
- Pytorch Checkpoint reading
- Git Add can have high memory usage.
- Finer-grained control of `git theta install` HOT 1
- Tensorflow model loading/saving seems bugged
- `git theta ls-files` HOT 1
- Git-Theta Clean
- Hanging when crashing
- More intelligent concurrency limits
- Investigate using cffi to speed up git lfs interface
- Configurable Serialization, Combining, and Saving to a backend
- Add `__str__` to metadata object HOT 1
- Update CI to handle MacOS
- Add retry to end2end tests
- in the `clean` filter, auto-detect checkpoint handler based on file extension HOT 1
- [end2ends] push repos to Hugging Face Hub (and git clone from there) to ensure it works HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from git-theta.