Comments (11)
Some file formats that can store dict-of-numpy-array-like objects:
- https://numpy.org/doc/stable/reference/generated/numpy.savez.html (should not use this one)
- https://developers.google.com/protocol-buffers/ (and maybe https://github.com/telamonian/numpy-protobuf)
- https://msgpack.org/ (and https://github.com/lebedov/msgpack-numpy)
- https://www.h5py.org/ (and others, e.g. https://github.com/telegraphic/hickle and http://www.pytables.org/)
from git-theta.
Possibly of interest: https://github.com/mverleg/array_storage_benchmark
from git-theta.
Per discussion in the lab meeting last week, instead of having separate .index
and .content
files for each diff, we will provide a tool to produce a human-readable view of a diff, but each diff file will be a binary (thus minimizing the number of files floating around).
from git-theta.
Another consideration is the metadata in each diff. We will want the ability to include e.g. author, commit date, previous commit id, ids of the commits that produced this merge, etc. So we are not just storing a dictionary of numpy-like objects, we are also storing purely string key-value pairs.
from git-theta.
Protobufs max out at 2GB, so they are not ideal. H5PY seems to be built for manipulating large distributed datasets, which seems to be overkill for a POC. I know we discussed that Pickle's arbitrary storage properties present security issues, but it stands out as the simplest and most widely understood tool that achieves what we are looking for here. Cognizant of the risk of bike shedding ourselves into oblivion here, would it be such a bad thing to charge ahead with Pickle while it's just us working on this?
from git-theta.
I think we can use pickle right now for testing, but I don't think we can or should use pickle going forward - it's just too general of a format.
from git-theta.
Fair enough. I think I'll use it for now, and then we can discuss the minimal set of features we need to keep and hunt for a tool that can represent those.
from git-theta.
Per a discussion with Anisha & Vishal, we want to add a few things here:
- Change to storing deltas (instead of values) in the diff.
- Define the mechanism for sparse update types.
- Call Onnx functions when applying an update instead of hard-coding it.
- Define create_diff_file as 'commit' operation. Validate that weight names in the update list are unique
We decided to store deltas instead of values so that we can efficiently represent low-rank updates (as two vectors, instead of a large matrix) - it is unclear whether it is possible to decompose a low-rank matrix into its constituent vectors.
from git-theta.
Addressed the above, and largely build the POC using MNIST + rotated MNIST digits. Had an issue trying to convert ONNX models back to Pytorch, but have largely figured out the kinks of the onnx2pytorch library.
Currently struggling to apply a sparse training mask à la FISHMask. There (at least in the section I read) Varun is using the a modified version of the Trainer class from the HuggingFace transformers libarry to zero out the gradients of model parameters that he doesn't want. I want to keep things simple and use torch.optim
, but quick experiments are failing to actually zero out any gradients.
from git-theta.
Aha! Fixed the above.
from git-theta.
Moving on the the VCS. Now wondering how to calculate deltas during / immediately after training. I guess we require the users to explicitly specify them, through direct calculation?
from git-theta.
Related Issues (20)
- Add an "apply to all" option to merge actions
- Parameter groups that are more than just tensors? HOT 3
- Add a way to script merges
- Functionality for partial model loading HOT 3
- Method to tell if git-theta wasn't installed? HOT 4
- Pytorch Checkpoint reading
- Git Add can have high memory usage.
- Finer-grained control of `git theta install` HOT 1
- Tensorflow model loading/saving seems bugged
- `git theta ls-files` HOT 1
- Git-Theta Clean
- Hanging when crashing
- More intelligent concurrency limits
- Investigate using cffi to speed up git lfs interface
- Configurable Serialization, Combining, and Saving to a backend
- Add `__str__` to metadata object HOT 1
- Update CI to handle MacOS
- Add retry to end2end tests
- in the `clean` filter, auto-detect checkpoint handler based on file extension HOT 1
- [end2ends] push repos to Hugging Face Hub (and git clone from there) to ensure it works HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from git-theta.