Giter Site home page Giter Site logo

Comments (40)

rom1504 avatar rom1504 commented on May 18, 2024

with the current architecture of things, it should be pretty natural to make it possible to choose a multiprocessing pool and a spark or dask distributed environment
May be a good thing to add in order to have multi node support

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://github.com/rom1504/img2dataset/blob/main/img2dataset/downloader.py#L337 at least do it at the file level, so this can be a pure mapper, follow the same idea as rom1504/clip-retrieval#79 (comment)

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://github.com/lucidrains/DALLE-pytorch/blob/main/dalle_pytorch/distributed_backends/distributed_backend.py can also be an interesting inspiration

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://github.com/horovod/horovod/blob/386be429b1417a1f6cb5e715bbe36efd2e74f402/horovod/spark/runner.py#L244 is a good trick to let the user build his own spark context

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

to move forward on this, moving the reader at the executor level could be good

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://docs.ray.io/en/latest/data/dataset-pipeline.html looks good

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

Spark streaming can handle a streaming collection of files in a folder
However it may not be able to handle partial files
Solutions:

  • Write in temporary dir the partial files and move at the end (spark solution)
  • just do many standard spark batching instead
  • simply push file names in a TCP stream / queue and have spark streaming read that !!!

Third solution is the best.
That should also allow this to work in distributed inference mode and for any inference

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://www.bogotobogo.com/Hadoop/BigData_hadoop_Apache_Spark_Streaming.php

Internally, a DStream is represented as a sequence of RDDs

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://livebook.manning.com/book/spark-in-action/chapter-6/12

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://towardsdatascience.com/apache-spark-stream-reading-data-from-local-http-server-d37e90e70fb0

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://stackoverflow.com/questions/33214988/spark-streaming-over-tcp

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://github.com/criteo/cluster-pack/tree/master/examples/spark-with-S3 maybe be helpful to create a pyspark session but should probably not be included by default and instead be under an option or even as an example script / let the user create the session as he prefers

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

ok we now have pyspark support.

Next step here is to actually try running it on some pyspark clusters.
I intend to try (and document):

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

standalone

https://spark.apache.org/downloads.html

https://spark.apache.org/docs/latest/spark-standalone.html

wget https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
tar xf spark-3.2.0-bin-hadoop3.2.tgz

on master:
bash ./sbin/start-master.sh

on nodes:
bash ./sbin/start-worker.sh "spark://master-ip:7077"

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

makes sure all writer overwrite for this to work well with spark feature of retrying (just delete the file if it already exists)

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

-> not obvious how to run standalone: how to send the env to other nodes ; where to write ? (how to setup a distributed fs locally)
maybe just try aws emr next

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

maybe using sshfs could work

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

An end-to-end Docker example for deploying a standalone PySpark with SparkSession.builder and PEX can be found here - it uses cluster-pack, a library on top of PEX that automatizes the the intermediate step of having to create & upload the PEX manually.

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

since this is just a mapper, it could also be possible to build a docker and spawn it one time per input file like https://blog.iron.io/docker-iron-io-super-easy-batch-processing/
might be interesting

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://cloud.google.com/blog/products/data-analytics/how-cloud-batch-and-stream-data-processing-works

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://beam.apache.org/get-started/quickstart-py/
https://beam.apache.org/documentation/programming-guide/

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://beam.apache.org/documentation/runners/spark/

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://towardsdatascience.com/pex-the-secret-sauce-for-the-perfect-pyspark-deployment-of-aws-emr-workloads-9aef0d8fa3a5

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://aws.github.io/aws-emr-containers-best-practices/submit-applications/docs/spark/pyspark/

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

possibly reconsider a streaming based approach to eliminate the concept of file from most of the pipeline

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

consider yielding examples in the downloader and moving the aggregation by the writer at the distributor level (not the driver, but an abstraction on top of the downloader happening in the workers)
that would allow for perfect balancing of written files

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://github.com/intel-analytics/analytics-zoo looks really good

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://analytics-zoo.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-pytorch-distributed-quickstart.html

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

This could potentially be made easier by having a service handling the http/dns part, returning the original image and letting img2dataset job do the resizing and packaging

Pipeline is

  • read urls
  • shard
  • download each url
  • resize
  • write

The download part may be complicated to scale beyond 1000 request/s due to dns, so maybe it's better to let this part be done by a service

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

consider making 2 ways shared file systems not required (can be done by distributing the shards via pyspark/python serialization instead of arrow + save to file system)
that would make it possible to use a rsync target as target file system

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024
rom1504@rom1504-Fixe:~/spark/spark-3.2.0-bin-hadoop3.2$ cat go_cluster.sh 
bash go_master.sh
bash go_worker.sh

rom1504@rom1504-Fixe:~/spark/spark-3.2.0-bin-hadoop3.2$ cat go_master.sh 
./sbin/start-master.sh -h ip -p 7077

rom1504@rom1504-Fixe:~/spark/spark-3.2.0-bin-hadoop3.2$ cat go_worker.sh 
export SPARK_IDENT_STRING=worker1
./sbin/start-worker.sh -c 2 -m 1G -h ip -p 3456 spark://ip:7077
export SPARK_IDENT_STRING=worker2
./sbin/start-worker.sh -c 2 -m 1G -h ip -p 3456 spark://ip:7077

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

this is almost done now
last thing to do will be a guide on how to setup a spark cluster on a set of machines available through ssh

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://github.com/rom1504/img2dataset/blob/main/examples/distributed_img2dataset_tutorial.md here is the guide

it works, but it's a bit complex

I would like to propose also these alternatives:

  • using aws emr eks
  • maybe propose the user another distribution mode without spark, using directly ssh

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks.html

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

aws emr on eks is actually rather painful to setup

I'm considering instead going the raw ec2 route
it would have the added benefit to work in a natural way for any other provider of instances

options are to document the spark setup in this case, or to do a no spark option (would require implementing robustness)

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

writing to s3 (and hdfs) from any machine is working just fine now

I believe the only additional thing I will try here is a pure ssh based strategy, to make it easier for people to run in distributed mode

from img2dataset.

rom1504 avatar rom1504 commented on May 18, 2024

this is working. A little troublesome to setup but overall working!

from img2dataset.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.