Giter Site home page Giter Site logo

git-theta's Issues

Prevent users from unintentionally running `git add <checkpoint>`

Regular files are staged with git add <file> while checkpoints are staged with git theta add <checkpoint file>. Talking with users, a common mistake is trying to stage some code by running something like git add . and unintentionally staging a checkpoint file in the current directory. We should prevent this behavior.

Updates to git_cml root

  1. Use TensorStore (or something else) for the leaf nodes
  2. Integrate LFS for tracking files in the git_cml root
  3. Parent directory should be full filename

Figure out interface (functions vs command-line tool)

We also need to decide what operations we would support. The obvious requirements for a POC, in order of implementation:

  1. Commit
  2. Apply
  3. Revert
  4. Checkout
  5. Log

Beyond that, we would also want to consider:

  1. Merge
  2. Branch

More robust logging configuration

Currently logging is only done through the basicConfig and everything is done at the debug level.

We should update this, we should also be logging to a file (whose location is user controllable), especially for clean and smudge filters, we should have some messages at debug and some at info, and a user configurable way to control the verbosity of logs.

Ideally there would also be a way to see our debug messages without getting the ones from GitPython as some of their debug logs look like errors (the message about CYGWIN for example) and like they come from git-theta as we are currently configuring the root logger.

Force merge conflicts

Have the global checkpoint checksum written out somewhere so that git always flags a merge conflict.

Support `git reset --hard`

After staging a change to a checkpoint with git theta add /path/to/my-model.pt, we should be able to use git reset --hard to destage the changes and blow away working tree modification, restoring back to the last commit.

Currently this results in a file not found error for ${git_repo}/path/to/my-model.pt file during one of the smudges.

Create a System Diagram

Define all the pieces in the pipeline
Define input and output for each piece
Define functionality of each piece
Identify which pieces are required for the PoC and which pieces can be built later

Replace init with install and track and support specifying checkpoint format

Replace git cml init with git cml install (only run once) and git cml track (run separately for each file).

git cml track should specify both the file to be tracked and the checkpoint format. There will need to be an attribute/metadata file somewhere in .git_cml that specifies that the checkpoint is a given format.

Add `.name` as a `@property` to our checkpoint handler objects

We should record a .name value on each of the plugins. This property on a checkpoint object should return a string so that a get_checkpoint function call with this name would return the class of this object.

This will make things like logging what checkpoint type is used (and making sure we use the same one across multiple cleans, etc) much easier, especially when the value is set via an environment variable.

Hello world git example

  • Run on a simple .json file (specifying parameter name -> parameter value)
  • Implement simple workflow for initial model -> make a change to a model -> produce diff file -> checkout commit
  • Simple example showing applying and rewinding a few changes

Figure out storage of the initial checkpoint

Probably needs to be able to refer to an external location for the initial checkpoint since we don't want to store it all in git - it's too big. Might look like git LFS. Bonus: Store the random seed that can be used to reconstruct the initial parameter values.

Run black formatting on `bin/` scirpts.

The scripts in our bin/ file don't end in .py so they seem to get missed by black (I have confirmed they are missed in the pre-commit hook and I am pretty sure they are missed in the CI lint).

Update both the pre-commit hook and the ci to actually format these files. Will probably result in needing a regex as I think that specifying specific files in pre-commit removes the file-type based default change detection.

Unify flattened leave iterations to flattened maps.

Lots of code uses small changes between (sorted) iteration through (value, key) pairs to do things like intersections and unions.

Convert these functions to use flattened maps and things like dict.update methods.

Investigate using git for tracking sparse updates and git smudge to apply them.

Instead of tracking/applying sparse updates manually (for example storing them in a different directory) can we just checking sparse updates and then move backwards through git history to build the real value (apply updates).

I have written this recursive smudge where when you smudge a file it will be transformed to include the content at each point in the history where it changed (and the commit the change happened at).

#!/bin/bash

COMMIT=${2:-"HEAD"}

echo "----------------------------" >> /tmp/smudge.log
echo "${COMMIT}" >> /tmp/smudge.log

if [ ${COMMIT} != "HEAD" ]; then
  PREV_COMMIT="${COMMIT}~1"
else
  PREV_COMMIT="${COMMIT}"
fi

echo "${PREV_COMMIT}" >> /tmp/smudge.log

echo "I'm running smudge"
LAST_CHANGE=$(git rev-list -1 ${PREV_COMMIT} -- $1)
echo "${LAST_CHANGE}" >> /tmp/smudge.log

if [ -z ${LAST_CHANGE} ]; then
  exit 0
else
  echo "The last time this file changed was ${LAST_CHANGE}"
  echo `git show ${LAST_CHANGE}:$1`
  /usr/local/google/home/brianlester/dev/git-theta-test/smudge.sh $1 ${LAST_CHANGE}
fi

Note, we can't run something like git checkout ${COMMIT} from inside a smudge but we can run things like git show and git rev-list.

We can apply this same idea to parameters. Reading in a sparse update will recurse backwards through history, until it hits a dense update. Once the dense update (which just returns the value) each sparse update (read from git) will be applied as we move back up the stack.

The main open questions are:

  • Does this still work when we hit a commit with multiple parents (from a merge for example)
  • Can tensorstore read a tensor when the binary blob (and the metadata file) are bytes sequences from git show
    • If it can't this solution would need to write the blobs to a temporary space causing an extra read/write per updated parameter. This could be mitigated by only changing updated parameters but could be costly otherwise.

Update git-theta metadata file format

Currently the metadata file produced by the clean filter looks like this

{
  "model/scoping/to/param/1-weight shape": List[int],
  "model/scoping/to/param/1-weight dtype": str,
  "model/scoping/to/param/1-weight hash": str,
  ...,
  "model/scoping/to/param/2-bias shape": List[int],
  "model/scoping/to/param/2-bias dtype": str,
  "model/scoping/to/param/2-bias": str,
  ...

To make fetching metadata for a single parameter we are converting to a nested format:

{
  "model/scoping/to/param/1-weight": {
      "tensor_metadata": {
        "shape": List[str],
        "dtype": str,
        "hash": str,
      },
  },
  ...,
  "model/scoping/to/param/2-bias": {
      "tensor_metadata": {
        "shape": List[str],
        "dtype": str,
        "hash": str,
      },
  },
  ...,
}

Tensor metadata is in it's own nested dict because we may add other keys like git_theta_metadata for tracking things like update types eventually.

Note: We need a consistent serialization order (lexical sort on keys of each dict) when writing to disk to support diffs.

Add basic integration tests

Add a simple test that creates a pytorch checkpoint and does as few operations on it; set up continuous integration to run it.

Remove `iterate_*_leaves`

With the change to using flat maps, we are no longer using the iterate_(dict|dir)_leaves functions.

They should be removed. The biggest part of the effort is that most of the tests that are operating on new functions like flatten or walk_dir are indirectly tested through these iterate functions. They need to be updated to test the actually functions we use.

Create binary for merging

Probably it should just always designate a merge conflict? We also could eventually implement parameter averaging or allowing merges when complementary sets of parameters are updated.

Add optional framework installs.

As we have moved to plugins for checkpoint handling, we don't need all of the deep learning frameworks installed all the time. Therefore we don't need to install them all, especially given that they can be heavy.

Update the setup.py to include extras_require for various frameworks that install them with git-theta. Also include some target that installs all the frameworks, or at least some of the most popular ones.

Update `params` module

Currently the params module uses torch as a dependency only to convert the tensor back into a numpy array.

As we are working on supporting multiple checkpoint formats, can we just use numpy for most of these methods?

Meeting Notes (Running Thread)

January 19th, 2022


Summary of work the previous week

Read the proposal and blog post for VCS for collaborative update of models
Created Drive Folder for project

Meeting Summary

Why do we need sparse updates and other communication efficiency strategies
With large models, updating all the parameters can create very large checkpoints that would become infeasible to store (diff history) and communicate
May not be as much of a problem with small models or models that are rarely updated
Merge updates from models not fully in the scope of this project. Next layer after building a version control system
Fall back on some kind of averaging method, or for newly added layers that are not conflicting it would be a simple merge (Eg kNN), mixture of models
What do we do in the case of merge conflicts that cannot be resolved automatically?
Some form of distillation
Last semester tried to see how we could merge different update methods
Evaluation/Downstream tasks
Differentiate the scope of this project as building something similar to Git but not dealing with CI (continuous integration) just yet
Eventually we may also want to know what data and hyperparameters resulted in that model update. But that’s an added layer
If one were to update a large model, wouldn't one also need to be resource-rich to even load these large models for training?
Yes, but there are ways to run them on a single GPU -> DeepSpeedZero
A very basic version of a VCS using Git with a model stored in ONNX format? So everytime you update the model, git saves your version history?
May support some update types and not others - need to explore this
Does git only store line-level changes or is more nuanced

ToDo List

Please take a look at the notebook and see if you can figure out a cleaner way to update a specific parameter value in the ONNX checkpoint. I'm currently doing initializer[1], it would be nice to choose it by parameter name. And also figure out why it's called "6", etc. And possibly also play around with the on-disk format, see whether it's at all usable by git, etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.