Giter Site home page Giter Site logo

Comments (2)

josiahls avatar josiahls commented on June 9, 2024

I don't want to lick the cookie, but one of the things I'm excited about mojo is the type safety / memory management.

What are everyone's thoughts on torchdata ? I experimented with building a RL framework off it fastrl

Pros

  • Linking pipelines was very cool and making transforms was easy and I could build some very complex pipelines using it personally and for work.
  • Very horizontal inheritance. Learning how to do custom stuff / tear apart torchdata was very easy since the hierarchy was basically flat. All pipes inherited from IterDataPipe or MapDataPipe. I think onboarding new users is a lot easier because of this.
    • I was surprised how important this was. An issue I've seen from a lot of dataloading frameworks is they turn into OOP hell, and thus are very hard to extend. My understanding is ray data has this issue from talking to other research friends that tried using it / extending it.

Cons

  • The future of torchdata is hazy, and they vaguely/unhelpfully noted they need to redesign some stuff. Below are my guesses.
  • Limitations related to python:
    • How do you verify pipeline A -> B -> C is valid in python? e.g How do we know those pipes plug into eachother correctly? Python doesn't have type safety, so unless we somehow check the signatures in python / use pydantic this doesn't appear possible
    • How do you pass values / references between pipes reliably? e.g. You want to cache data at certain points in the pipeline, but don't want to duplicate the data from earlier in the pipeline.
    • If you have a pipeline and want to do multiprocessing, how do you nicely get around the python GIL?
      • torchdata was recently testing a dataloader2 that uses pub/sub/messaging but doesn't look like that got anywhere?
  • Limitations not related to python
    • Exception messages in pipelines (torchdata or not) are simply awful. If you have a pipeline A -> B -> C, and there is an exception in A, you will get a long stack trace all the way up the pipeline. I feel like this might be the achilles heel of a lot of pipeline dataloader frame works.
      • I think mojo has inlining / nodebug capabilities that can make this not so bad (skip internal functions), which would be otherwise not possible in python (?)
      • Probably needs an innovation here: Modify the exception / stack trace when using the pipelines so the stack traces are easier to read.

Some things I'm seeing that would be needed from mojo:

Major blockers

  • Iterable / Iterator / Gettable traits that pipes can implement.

Minor needs

  • yield / coreoutines. I think a working pipeline can be hack around this for now.

I'm curious what other frameworks / libs people have used, liked, disliked.

from basalt.

StijnWoestenborghs avatar StijnWoestenborghs commented on June 9, 2024

Hi @josiahls, I think data pipelining is quite a complex, but important, topic that Basalt might not be focusing on soon, at least not in the near future. What I read about it is that torchdata suffers from lower level control over things like multiprocessing, and even though it should be possible in Mojo, other then algorithm.parallelize it doesn't have something like a threading API (yet!).

For sure the type safety and safely passing through references to the data without copies will and must be possible. And as a first rework of the current dataloader (which just simply loads all data in memory), I think an ultra simple pipeline that 'chunk-loads' the data in memory & passes it to the model like that should be the goal. Additionally Mojo might have an edge here with it's very convenient and easy to use compile time features. Are you perhaps interested in trying this out?

Long term thinking. I can see cloud storage integration & distributed computing being massively important here as well. And I wonder if that was one of the re-design evaluations of torchdata.

from basalt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.