Comments (2)
I don't want to lick the cookie, but one of the things I'm excited about mojo is the type safety / memory management.
What are everyone's thoughts on torchdata ? I experimented with building a RL framework off it fastrl
Pros
- Linking pipelines was very cool and making transforms was easy and I could build some very complex pipelines using it personally and for work.
- Very horizontal inheritance. Learning how to do custom stuff / tear apart torchdata was very easy since the hierarchy was basically flat. All pipes inherited from IterDataPipe or MapDataPipe. I think onboarding new users is a lot easier because of this.
- I was surprised how important this was. An issue I've seen from a lot of dataloading frameworks is they turn into OOP hell, and thus are very hard to extend. My understanding is ray data has this issue from talking to other research friends that tried using it / extending it.
Cons
- The future of torchdata is hazy, and they vaguely/unhelpfully noted they need to redesign some stuff. Below are my guesses.
- Limitations related to python:
- How do you verify pipeline
A -> B -> C
is valid in python? e.g How do we know those pipes plug into eachother correctly? Python doesn't have type safety, so unless we somehow check the signatures in python / use pydantic this doesn't appear possible - How do you pass values / references between pipes reliably? e.g. You want to cache data at certain points in the pipeline, but don't want to duplicate the data from earlier in the pipeline.
- If you have a pipeline and want to do multiprocessing, how do you nicely get around the python GIL?
- torchdata was recently testing a dataloader2 that uses pub/sub/messaging but doesn't look like that got anywhere?
- How do you verify pipeline
- Limitations not related to python
- Exception messages in pipelines (torchdata or not) are simply awful. If you have a pipeline
A -> B -> C
, and there is an exception inA
, you will get a long stack trace all the way up the pipeline. I feel like this might be the achilles heel of a lot of pipeline dataloader frame works.- I think mojo has inlining / nodebug capabilities that can make this not so bad (skip internal functions), which would be otherwise not possible in python (?)
- Probably needs an innovation here: Modify the exception / stack trace when using the pipelines so the stack traces are easier to read.
- Exception messages in pipelines (torchdata or not) are simply awful. If you have a pipeline
Some things I'm seeing that would be needed from mojo:
Major blockers
Iterable / Iterator / Gettable
traits that pipes can implement.
Minor needs
yield / coreoutines
. I think a working pipeline can be hack around this for now.
I'm curious what other frameworks / libs people have used, liked, disliked.
from basalt.
Hi @josiahls, I think data pipelining is quite a complex, but important, topic that Basalt might not be focusing on soon, at least not in the near future. What I read about it is that torchdata suffers from lower level control over things like multiprocessing, and even though it should be possible in Mojo, other then algorithm.parallelize
it doesn't have something like a threading API (yet!).
For sure the type safety and safely passing through references to the data without copies will and must be possible. And as a first rework of the current dataloader (which just simply loads all data in memory), I think an ultra simple pipeline that 'chunk-loads' the data in memory & passes it to the model like that should be the goal. Additionally Mojo might have an edge here with it's very convenient and easy to use compile time features. Are you perhaps interested in trying this out?
Long term thinking. I can see cloud storage integration & distributed computing being massively important here as well. And I wonder if that was one of the re-design evaluations of torchdata.
from basalt.
Related Issues (20)
- Add more initializers
- Unify styling
- Setup Dev/Nightly branch
- Solidify MAX compatability
- Load and Export from SafeTensors
- Add a memory profiler to benchmark
- Add more examples (YoloV8)
- Rexamine Kernels (Matmul, Conv, etc..)
- Mixed Precision
- Quantization
- Figure out Parallelization issues and optimal solution
- SPDA / MHA / Transformers
- Operator Fusion
- Fix API to not require graph being passed as an argument
- Fix tensorutils containing matmul kernel
- Add support for KAN (https://arxiv.org/pdf/2404.19756) HOT 3
- error: package 'math' does not contain 'max' HOT 2
- Add Einsum operator
- Add View operator
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from basalt.