Light

Create a format for representing incremental changes about git-theta HOT 11 CLOSED

craffel commented on August 14, 2024

Create a format for representing incremental changes

from git-theta.

Comments (11)

craffel commented on August 14, 2024

Some file formats that can store dict-of-numpy-array-like objects:

https://numpy.org/doc/stable/reference/generated/numpy.savez.html (should not use this one)
https://developers.google.com/protocol-buffers/ (and maybe https://github.com/telamonian/numpy-protobuf)
https://msgpack.org/ (and https://github.com/lebedov/msgpack-numpy)
https://www.h5py.org/ (and others, e.g. https://github.com/telegraphic/hickle and http://www.pytables.org/)

from git-theta.

craffel commented on August 14, 2024

Possibly of interest: https://github.com/mverleg/array_storage_benchmark

from git-theta.

zoeqevans commented on August 14, 2024

Per discussion in the lab meeting last week, instead of having separate .index and .content files for each diff, we will provide a tool to produce a human-readable view of a diff, but each diff file will be a binary (thus minimizing the number of files floating around).

from git-theta.

zoeqevans commented on August 14, 2024

Another consideration is the metadata in each diff. We will want the ability to include e.g. author, commit date, previous commit id, ids of the commits that produced this merge, etc. So we are not just storing a dictionary of numpy-like objects, we are also storing purely string key-value pairs.

from git-theta.

zoeqevans commented on August 14, 2024

Protobufs max out at 2GB, so they are not ideal. H5PY seems to be built for manipulating large distributed datasets, which seems to be overkill for a POC. I know we discussed that Pickle's arbitrary storage properties present security issues, but it stands out as the simplest and most widely understood tool that achieves what we are looking for here. Cognizant of the risk of bike shedding ourselves into oblivion here, would it be such a bad thing to charge ahead with Pickle while it's just us working on this?

from git-theta.

craffel commented on August 14, 2024

I think we can use pickle right now for testing, but I don't think we can or should use pickle going forward - it's just too general of a format.

from git-theta.

zoeqevans commented on August 14, 2024

Fair enough. I think I'll use it for now, and then we can discuss the minimal set of features we need to keep and hunt for a tool that can represent those.

from git-theta.

zoeqevans commented on August 14, 2024

Per a discussion with Anisha & Vishal, we want to add a few things here:

Change to storing deltas (instead of values) in the diff.
Define the mechanism for sparse update types.
Call Onnx functions when applying an update instead of hard-coding it.
Define create_diff_file as 'commit' operation. Validate that weight names in the update list are unique

We decided to store deltas instead of values so that we can efficiently represent low-rank updates (as two vectors, instead of a large matrix) - it is unclear whether it is possible to decompose a low-rank matrix into its constituent vectors.

from git-theta.

zoeqevans commented on August 14, 2024

Addressed the above, and largely build the POC using MNIST + rotated MNIST digits. Had an issue trying to convert ONNX models back to Pytorch, but have largely figured out the kinks of the onnx2pytorch library.

Currently struggling to apply a sparse training mask à la FISHMask. There (at least in the section I read) Varun is using the a modified version of the Trainer class from the HuggingFace transformers libarry to zero out the gradients of model parameters that he doesn't want. I want to keep things simple and use torch.optim, but quick experiments are failing to actually zero out any gradients.

from git-theta.

zoeqevans commented on August 14, 2024

Aha! Fixed the above.

from git-theta.

zoeqevans commented on August 14, 2024

Moving on the the VCS. Now wondering how to calculate deltas during / immediately after training. I guess we require the users to explicitly specify them, through direct calculation?

from git-theta.

Related Issues (20)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.