Write down designs for possible git integrations about git-theta HOT 7 CLOSED

craffel commented on August 14, 2024

Write down designs for possible git integrations

from git-theta.

Comments (7)

nkandpa2 commented on August 14, 2024

Desired Behavior:

When you commit changes to a model, only the diffs are stored
Integrate with git so that the same system is used to manage source code and models

Background:

Clean filter
- Specifies a program that runs when staging files that match some pattern
- git add foo.txt ==> clean foo.txt | git add
Smudge filter
- Specifies a program that runs when checking out files that match some pattern
- git checkout <commit_hash> ==> smudge <commit_hash> | git checkout

Design 1

Workflow:
1. Commit initial my_model.pt as you would any other file in git
2. Update my_model.pt in-place
3. Commit updated my_model.pt as you would any other file in git
Implementation
- Define clean filter for *.pt files that:
  1. If it does not yet exist, creates a directory .git_ml/diffs/my_model/
  2. Computes the diff between updated model.pt and the previous version of model.pt
  3. Stores the diff file in .git_ml/diffs/my_model/model.diff
  5. Stages the diff file
  6. Do not add my_model.pt to staging area
- Define a smudge filter for *.pt files that:
  1. Checks for diff at .git_ml/diffs/my_model/model.diff
  2. Goes back through revision history of model.diff
  3. Iteratively applies diffs from revision history of model.diff to my_model.pt
Pros
- The fact that my_model.pt is handled specially is transparent to the user
- git checkout only requires checking out the initial version of my_model.pt (which is large and never changes)
  and all the small diffs up until the commit you are checking out
Cons
- Requires clean filter to compute diff which is hard in general (e.g., disambiguating low-rank vs. dense update)
- model.diff is stored in .git_ml/diffs/my_model so this makes it impossible to change the name of my_model.pt once its been initially committed
- Probably can solve this by using something along the lines of the model file's hash

Design 2:

Workflow:
1. Commit initial my_model.pt as you would any other file in git
2. Training code modifies my_model.pt in-place and also produces a diff file in .git_ml/diffs/my_model/model.diff
3. Commit updated my_model.pt as you would any other file in git
Implementation:
- Define clean filter for *.pt files that:
  1. Stages .git_ml/diffs/my_model/model.diff
  2. Restores my_model.pt to its previous version
- Define a smudge filter for *.pt files that:
  1. Checks for diff at .git_ml/diffs/my_model/model.diff
  2. Goes back through revision history of model.diff
  3. Iteratively applies diffs from revision history of model.diff to my_model.pt
Pros:
- git checkout only requires checking out the initial version of my_model.pt (which is large and never changes)
  and all the small diffs up until the commit you are checking out
- User can specify the type of diff (e.g., low-rank vs. dense update) which is easier than figuring it out from model checkpoints
Cons:
- Couples training code and version control since training code needs to "know about" diff files and where to store them

from git-theta.

nkandpa2 commented on August 14, 2024

I think there's a much simpler design to consider. Suppose a user has a model checkpoint with the following parameter group structure:

{
    'layer1': {
        'w': [1, 2, 3, 4],
        'b': [10]
    },
    'layer2': {
        'w': [-1, -2, -3, -4],
        'b': [3]
    },
    'other_params': {
        'a': 0.2
    }
}

When a user runs git add model.pt the clean filter loads the model checkpoint and explodes the dictionary structure onto the filesystem under .git/ml

.git
└── ml
    └── model
        ├── layer1
        │   ├── b.pt
        │   └── w.pt
        ├── layer2
        │   ├── b.pt
        │   └── w.pt
        └── other_params
            └── a.pt

The clean filter puts this directory structure into the git index. Also, similar to git-lfs, instead of staging model.pt with its full contents the clean filter will stage a placeholder model.pt that points to .git/ml/model.

When checking out a commit, all that the smudge filter needs to do is re-synthesize model.pt from the exploded version of that model at .git/ml/model.

This design has the following advantages:

No need for diff files since updating a single parameter group will result in only that parameter group's file being updated in the git commit.
In the future we'll want to store model data with LFS or something like it since these will exceed git's maximum file size. In this design we simply need to make the .git/ml/model directory all LFS objects and everything should work the same. In the previous proposal, the smudge filter needed the whole history of diff files to re-synthesize the model checkpoint. This is problematic if the diff files are stored with LFS since git pull-ing from an LFS store only pulls the latest version of the object.

from git-theta.

craffel commented on August 14, 2024

Thanks @nkandpa2 . I think this has a clear advantage in terms of the fact that it will make git natively aware of which parameter groups were updated. A disadvantage I guess is that you would ultimately need to effectively materialize a second copy of the checkpoint, right? I'm not clear on what you mean

instead of staging model.pt with its full contents the clean filter will stage a placeholder model.pt that points to .git/ml/model.

Wouldn't we need to stage model.pt so that programs could make use of it?

from git-theta.

craffel commented on August 14, 2024

Also, the directory structure you're proposing is (I think) very similar to how t5x(/flax?) represents checkpoints - see e.g. gsutil ls gs://t5-data/pretrained_models/t5x/byt5_base/checkpoint_1000000/. Each parameter "group" gets its own subdirectory, e.g. gs://t5-data/pretrained_models/t5x/byt5_base/checkpoint_1000000/target.encoder.layers_9.mlp.wo.kernel/. Each subdirectory has a TensorStore object (which is a very nice library for storing and accessing tensors on disk) and a .zarray metadata file (required by TensorStore's storage format). Using TensorStore is probably much cleaner than using individual .pt files (though I know you were being illustrative).

from git-theta.

craffel commented on August 14, 2024

I should mention the naming of the TensorStore object corresponds to sharding, e.g. in the example above the TensorStore file is called 0.0. I don't really understand that naming convention/the sharding but just FYI in case you were wondering.

from git-theta.

nkandpa2 commented on August 14, 2024

Thanks @nkandpa2 . I think this has a clear advantage in terms of the fact that it will make git natively aware of which parameter groups were updated. A disadvantage I guess is that you would ultimately need to effectively materialize a second copy of the checkpoint, right? I'm not clear on what you mean

instead of staging model.pt with its full contents the clean filter will stage a placeholder model.pt that points to .git/ml/model.

Wouldn't we need to stage model.pt so that programs could make use of it?

model.pt never gets staged (i.e., gets put into the area holding everything about to be committed) but instead a pointer to the exploded checkpoint view gets staged. The working directory (the user's view of the directory) still contains the full model.pt so after git add/git commit to the user the full model.pt file is still there.

git-lfs does something very similar. For LFS tracked files, (1) the file gets copied to .git/lfs, (2) a file pointing to the LFS tracked file gets staged, and (3) on git push the file being pointed to gets synced to an LFS store. After git add the LFS tracked file is still in the working directory even though a pointer file is in the staging area.

from git-theta.

craffel commented on August 14, 2024

I see, that makes sense, thanks.

from git-theta.

Write down designs for possible git integrations about git-theta HOT 7 CLOSED

Comments (7)

Desired Behavior:

Background:

Design 1

Design 2:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent