Giter Site home page Giter Site logo

Comments (7)

nkandpa2 avatar nkandpa2 commented on August 14, 2024

Desired Behavior:

  • When you commit changes to a model, only the diffs are stored
  • Integrate with git so that the same system is used to manage source code and models

Background:

  • Clean filter
    • Specifies a program that runs when staging files that match some pattern
    • git add foo.txt ==> clean foo.txt | git add
  • Smudge filter
    • Specifies a program that runs when checking out files that match some pattern
    • git checkout <commit_hash> ==> smudge <commit_hash> | git checkout

Design 1

  • Workflow:
    1. Commit initial my_model.pt as you would any other file in git
    2. Update my_model.pt in-place
    3. Commit updated my_model.pt as you would any other file in git
  • Implementation
    • Define clean filter for *.pt files that:
      1. If it does not yet exist, creates a directory .git_ml/diffs/my_model/
      2. Computes the diff between updated model.pt and the previous version of model.pt
      3. Stores the diff file in .git_ml/diffs/my_model/model.diff
      5. Stages the diff file
      6. Do not add my_model.pt to staging area
    • Define a smudge filter for *.pt files that:
      1. Checks for diff at .git_ml/diffs/my_model/model.diff
      2. Goes back through revision history of model.diff
      3. Iteratively applies diffs from revision history of model.diff to my_model.pt
  • Pros
    • The fact that my_model.pt is handled specially is transparent to the user
    • git checkout only requires checking out the initial version of my_model.pt (which is large and never changes)
      and all the small diffs up until the commit you are checking out
  • Cons
    • Requires clean filter to compute diff which is hard in general (e.g., disambiguating low-rank vs. dense update)
    • model.diff is stored in .git_ml/diffs/my_model so this makes it impossible to change the name of my_model.pt once its been initially committed
    • Probably can solve this by using something along the lines of the model file's hash

Design 2:

  • Workflow:
    1. Commit initial my_model.pt as you would any other file in git
    2. Training code modifies my_model.pt in-place and also produces a diff file in .git_ml/diffs/my_model/model.diff
    3. Commit updated my_model.pt as you would any other file in git
  • Implementation:
    • Define clean filter for *.pt files that:
      1. Stages .git_ml/diffs/my_model/model.diff
      2. Restores my_model.pt to its previous version
    • Define a smudge filter for *.pt files that:
      1. Checks for diff at .git_ml/diffs/my_model/model.diff
      2. Goes back through revision history of model.diff
      3. Iteratively applies diffs from revision history of model.diff to my_model.pt
  • Pros:
    • git checkout only requires checking out the initial version of my_model.pt (which is large and never changes)
      and all the small diffs up until the commit you are checking out
    • User can specify the type of diff (e.g., low-rank vs. dense update) which is easier than figuring it out from model checkpoints
  • Cons:
    • Couples training code and version control since training code needs to "know about" diff files and where to store them

from git-theta.

nkandpa2 avatar nkandpa2 commented on August 14, 2024

I think there's a much simpler design to consider. Suppose a user has a model checkpoint with the following parameter group structure:

{
    'layer1': {
        'w': [1, 2, 3, 4],
        'b': [10]
    },
    'layer2': {
        'w': [-1, -2, -3, -4],
        'b': [3]
    },
    'other_params': {
        'a': 0.2
    }
}

When a user runs git add model.pt the clean filter loads the model checkpoint and explodes the dictionary structure onto the filesystem under .git/ml

.git
└── ml
    └── model
        ├── layer1
        │   ├── b.pt
        │   └── w.pt
        ├── layer2
        │   ├── b.pt
        │   └── w.pt
        └── other_params
            └── a.pt

The clean filter puts this directory structure into the git index. Also, similar to git-lfs, instead of staging model.pt with its full contents the clean filter will stage a placeholder model.pt that points to .git/ml/model.

When checking out a commit, all that the smudge filter needs to do is re-synthesize model.pt from the exploded version of that model at .git/ml/model.

This design has the following advantages:

  • No need for diff files since updating a single parameter group will result in only that parameter group's file being updated in the git commit.
  • In the future we'll want to store model data with LFS or something like it since these will exceed git's maximum file size. In this design we simply need to make the .git/ml/model directory all LFS objects and everything should work the same. In the previous proposal, the smudge filter needed the whole history of diff files to re-synthesize the model checkpoint. This is problematic if the diff files are stored with LFS since git pull-ing from an LFS store only pulls the latest version of the object.

from git-theta.

craffel avatar craffel commented on August 14, 2024

Thanks @nkandpa2 . I think this has a clear advantage in terms of the fact that it will make git natively aware of which parameter groups were updated. A disadvantage I guess is that you would ultimately need to effectively materialize a second copy of the checkpoint, right? I'm not clear on what you mean

instead of staging model.pt with its full contents the clean filter will stage a placeholder model.pt that points to .git/ml/model.

Wouldn't we need to stage model.pt so that programs could make use of it?

from git-theta.

craffel avatar craffel commented on August 14, 2024

Also, the directory structure you're proposing is (I think) very similar to how t5x(/flax?) represents checkpoints - see e.g. gsutil ls gs://t5-data/pretrained_models/t5x/byt5_base/checkpoint_1000000/. Each parameter "group" gets its own subdirectory, e.g. gs://t5-data/pretrained_models/t5x/byt5_base/checkpoint_1000000/target.encoder.layers_9.mlp.wo.kernel/. Each subdirectory has a TensorStore object (which is a very nice library for storing and accessing tensors on disk) and a .zarray metadata file (required by TensorStore's storage format). Using TensorStore is probably much cleaner than using individual .pt files (though I know you were being illustrative).

from git-theta.

craffel avatar craffel commented on August 14, 2024

I should mention the naming of the TensorStore object corresponds to sharding, e.g. in the example above the TensorStore file is called 0.0. I don't really understand that naming convention/the sharding but just FYI in case you were wondering.

from git-theta.

nkandpa2 avatar nkandpa2 commented on August 14, 2024

Thanks @nkandpa2 . I think this has a clear advantage in terms of the fact that it will make git natively aware of which parameter groups were updated. A disadvantage I guess is that you would ultimately need to effectively materialize a second copy of the checkpoint, right? I'm not clear on what you mean

instead of staging model.pt with its full contents the clean filter will stage a placeholder model.pt that points to .git/ml/model.

Wouldn't we need to stage model.pt so that programs could make use of it?

model.pt never gets staged (i.e., gets put into the area holding everything about to be committed) but instead a pointer to the exploded checkpoint view gets staged. The working directory (the user's view of the directory) still contains the full model.pt so after git add/git commit to the user the full model.pt file is still there.

git-lfs does something very similar. For LFS tracked files, (1) the file gets copied to .git/lfs, (2) a file pointing to the LFS tracked file gets staged, and (3) on git push the file being pointed to gets synced to an LFS store. After git add the LFS tracked file is still in the working directory even though a pointer file is in the staging area.

from git-theta.

craffel avatar craffel commented on August 14, 2024

I see, that makes sense, thanks.

from git-theta.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.