Giter Site home page Giter Site logo

Comments (11)

craffel avatar craffel commented on August 14, 2024

Some file formats that can store dict-of-numpy-array-like objects:

from git-theta.

craffel avatar craffel commented on August 14, 2024

Possibly of interest: https://github.com/mverleg/array_storage_benchmark

from git-theta.

zoeqevans avatar zoeqevans commented on August 14, 2024

Per discussion in the lab meeting last week, instead of having separate .index and .content files for each diff, we will provide a tool to produce a human-readable view of a diff, but each diff file will be a binary (thus minimizing the number of files floating around).

from git-theta.

zoeqevans avatar zoeqevans commented on August 14, 2024

Another consideration is the metadata in each diff. We will want the ability to include e.g. author, commit date, previous commit id, ids of the commits that produced this merge, etc. So we are not just storing a dictionary of numpy-like objects, we are also storing purely string key-value pairs.

from git-theta.

zoeqevans avatar zoeqevans commented on August 14, 2024

Protobufs max out at 2GB, so they are not ideal. H5PY seems to be built for manipulating large distributed datasets, which seems to be overkill for a POC. I know we discussed that Pickle's arbitrary storage properties present security issues, but it stands out as the simplest and most widely understood tool that achieves what we are looking for here. Cognizant of the risk of bike shedding ourselves into oblivion here, would it be such a bad thing to charge ahead with Pickle while it's just us working on this?

from git-theta.

craffel avatar craffel commented on August 14, 2024

I think we can use pickle right now for testing, but I don't think we can or should use pickle going forward - it's just too general of a format.

from git-theta.

zoeqevans avatar zoeqevans commented on August 14, 2024

Fair enough. I think I'll use it for now, and then we can discuss the minimal set of features we need to keep and hunt for a tool that can represent those.

from git-theta.

zoeqevans avatar zoeqevans commented on August 14, 2024

Per a discussion with Anisha & Vishal, we want to add a few things here:

  1. Change to storing deltas (instead of values) in the diff.
  2. Define the mechanism for sparse update types.
  3. Call Onnx functions when applying an update instead of hard-coding it.
  4. Define create_diff_file as 'commit' operation. Validate that weight names in the update list are unique

We decided to store deltas instead of values so that we can efficiently represent low-rank updates (as two vectors, instead of a large matrix) - it is unclear whether it is possible to decompose a low-rank matrix into its constituent vectors.

from git-theta.

zoeqevans avatar zoeqevans commented on August 14, 2024

Addressed the above, and largely build the POC using MNIST + rotated MNIST digits. Had an issue trying to convert ONNX models back to Pytorch, but have largely figured out the kinks of the onnx2pytorch library.

Currently struggling to apply a sparse training mask à la FISHMask. There (at least in the section I read) Varun is using the a modified version of the Trainer class from the HuggingFace transformers libarry to zero out the gradients of model parameters that he doesn't want. I want to keep things simple and use torch.optim, but quick experiments are failing to actually zero out any gradients.

from git-theta.

zoeqevans avatar zoeqevans commented on August 14, 2024

Aha! Fixed the above.

from git-theta.

zoeqevans avatar zoeqevans commented on August 14, 2024

Moving on the the VCS. Now wondering how to calculate deltas during / immediately after training. I guess we require the users to explicitly specify them, through direct calculation?

from git-theta.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.