Giter Site home page Giter Site logo

Comments (7)

ejguan avatar ejguan commented on May 18, 2024 1

general data transforms

I am not hundred percent sure about it. I would say we would guarantee DataPipe graph (pipeline) is going to be serializable with user provided function. Our current way is to pickle lambda function using dill.

data splitting into train/validation sets

We have utility DataPipe provided to users to split data into two separate pipelines. This may not be related, but I want to let you know. We would provide dynamic sharding for users, which means users don't need to hardcode sharding setting in their DataSet.

summary statistic computation

We currently have a way to retrieve a graph of data pipeline. But, better visualization is not done yet. https://github.com/pytorch/pytorch/blob/3202028ed1ca24c91dc7192ef69b305690db7abc/torch/utils/data/graph.py#L54

Are DataPipes guaranteed to be pickle safe and is there anything that needs to be done to support that?

Our provided DataPipes would be guaranteed to be serializable. And, we can't guarantee the users' implementation of DataPipes. But, if users choose to use DataLoader2 with their datapipes, they would get notification about if their DataPipe is serializable or not.

I was also wondering if there's multiprocessing based datapipes and how that works since this seems comparable

We would provide multiprocessing. The functionality is in-place, but we are still working with internal teams to align the API of DataLoaderV2.

should this be on the pytorch discussion forums instead?

I don't think this is a right timing as we are not officially released. And, the RFC is tracked in PyTorch Core not in this repo.

from data.

ejguan avatar ejguan commented on May 18, 2024

cc: @VitalyFedyunin to see if you want to supply other comments.

from data.

kiukchung avatar kiukchung commented on May 18, 2024

@ejguan regarding builtin datapipes being pickle-safe... is this the way you'd recommend folks implement checkpointing for datapipes?

from data.

ejguan avatar ejguan commented on May 18, 2024

regarding builtin datapipes being pickle-safe

IIRC, it's a requirement for both multiprocessing and checkpointing. As @NivekT is working on checkpointing, feel free to chime in

from data.

NivekT avatar NivekT commented on May 18, 2024

Yes, though you can write custom __getstate__ and __setstate__ methods to accomplish that.

from data.

kiukchung avatar kiukchung commented on May 18, 2024

IIUC when num_workers > 1 the DataPipes are iterated on the dataloader worker (child process). Therefore, the "state" of the datapipe will be resident on the child proc not the main parent (where the trainer loop will run). How exactly does one get the pickled state of the datapipe from the child process back to the parent for checkpointing?

from data.

NivekT avatar NivekT commented on May 18, 2024

Good question! The plan is to use PrototypeMultiprocessingReadingService to pass request/response messages, where the response will be the pickled state of the DataPipe

from data.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.