Comments (40)
with the current architecture of things, it should be pretty natural to make it possible to choose a multiprocessing pool and a spark or dask distributed environment
May be a good thing to add in order to have multi node support
from img2dataset.
https://github.com/rom1504/img2dataset/blob/main/img2dataset/downloader.py#L337 at least do it at the file level, so this can be a pure mapper, follow the same idea as rom1504/clip-retrieval#79 (comment)
from img2dataset.
https://github.com/lucidrains/DALLE-pytorch/blob/main/dalle_pytorch/distributed_backends/distributed_backend.py can also be an interesting inspiration
from img2dataset.
https://github.com/horovod/horovod/blob/386be429b1417a1f6cb5e715bbe36efd2e74f402/horovod/spark/runner.py#L244 is a good trick to let the user build his own spark context
from img2dataset.
to move forward on this, moving the reader at the executor level could be good
from img2dataset.
https://docs.ray.io/en/latest/data/dataset-pipeline.html looks good
from img2dataset.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
from img2dataset.
Spark streaming can handle a streaming collection of files in a folder
However it may not be able to handle partial files
Solutions:
- Write in temporary dir the partial files and move at the end (spark solution)
- just do many standard spark batching instead
- simply push file names in a TCP stream / queue and have spark streaming read that !!!
Third solution is the best.
That should also allow this to work in distributed inference mode and for any inference
from img2dataset.
https://www.bogotobogo.com/Hadoop/BigData_hadoop_Apache_Spark_Streaming.php
Internally, a DStream is represented as a sequence of RDDs
from img2dataset.
https://livebook.manning.com/book/spark-in-action/chapter-6/12
from img2dataset.
https://towardsdatascience.com/apache-spark-stream-reading-data-from-local-http-server-d37e90e70fb0
from img2dataset.
https://stackoverflow.com/questions/33214988/spark-streaming-over-tcp
from img2dataset.
https://github.com/criteo/cluster-pack/tree/master/examples/spark-with-S3 maybe be helpful to create a pyspark session but should probably not be included by default and instead be under an option or even as an example script / let the user create the session as he prefers
from img2dataset.
ok we now have pyspark support.
Next step here is to actually try running it on some pyspark clusters.
I intend to try (and document):
- just 2 simple nodes by using https://spark.apache.org/docs/latest/spark-standalone.html
- amazon emr
- maybe a yarn cluster
from img2dataset.
standalone
https://spark.apache.org/downloads.html
https://spark.apache.org/docs/latest/spark-standalone.html
wget https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
tar xf spark-3.2.0-bin-hadoop3.2.tgz
on master:
bash ./sbin/start-master.sh
on nodes:
bash ./sbin/start-worker.sh "spark://master-ip:7077"
from img2dataset.
makes sure all writer overwrite for this to work well with spark feature of retrying (just delete the file if it already exists)
from img2dataset.
https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
from img2dataset.
-> not obvious how to run standalone: how to send the env to other nodes ; where to write ? (how to setup a distributed fs locally)
maybe just try aws emr next
from img2dataset.
maybe using sshfs could work
from img2dataset.
An end-to-end Docker example for deploying a standalone PySpark with SparkSession.builder and PEX can be found here - it uses cluster-pack, a library on top of PEX that automatizes the the intermediate step of having to create & upload the PEX manually.
from img2dataset.
since this is just a mapper, it could also be possible to build a docker and spawn it one time per input file like https://blog.iron.io/docker-iron-io-super-easy-batch-processing/
might be interesting
from img2dataset.
from img2dataset.
https://beam.apache.org/get-started/quickstart-py/
https://beam.apache.org/documentation/programming-guide/
from img2dataset.
https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
from img2dataset.
https://beam.apache.org/documentation/runners/spark/
from img2dataset.
from img2dataset.
https://aws.github.io/aws-emr-containers-best-practices/submit-applications/docs/spark/pyspark/
from img2dataset.
possibly reconsider a streaming based approach to eliminate the concept of file from most of the pipeline
from img2dataset.
consider yielding examples in the downloader and moving the aggregation by the writer at the distributor level (not the driver, but an abstraction on top of the downloader happening in the workers)
that would allow for perfect balancing of written files
from img2dataset.
https://github.com/intel-analytics/analytics-zoo looks really good
from img2dataset.
from img2dataset.
This could potentially be made easier by having a service handling the http/dns part, returning the original image and letting img2dataset job do the resizing and packaging
Pipeline is
- read urls
- shard
- download each url
- resize
- write
The download part may be complicated to scale beyond 1000 request/s due to dns, so maybe it's better to let this part be done by a service
from img2dataset.
consider making 2 ways shared file systems not required (can be done by distributing the shards via pyspark/python serialization instead of arrow + save to file system)
that would make it possible to use a rsync target as target file system
from img2dataset.
rom1504@rom1504-Fixe:~/spark/spark-3.2.0-bin-hadoop3.2$ cat go_cluster.sh
bash go_master.sh
bash go_worker.sh
rom1504@rom1504-Fixe:~/spark/spark-3.2.0-bin-hadoop3.2$ cat go_master.sh
./sbin/start-master.sh -h ip -p 7077
rom1504@rom1504-Fixe:~/spark/spark-3.2.0-bin-hadoop3.2$ cat go_worker.sh
export SPARK_IDENT_STRING=worker1
./sbin/start-worker.sh -c 2 -m 1G -h ip -p 3456 spark://ip:7077
export SPARK_IDENT_STRING=worker2
./sbin/start-worker.sh -c 2 -m 1G -h ip -p 3456 spark://ip:7077
from img2dataset.
this is almost done now
last thing to do will be a guide on how to setup a spark cluster on a set of machines available through ssh
from img2dataset.
https://github.com/rom1504/img2dataset/blob/main/examples/distributed_img2dataset_tutorial.md here is the guide
it works, but it's a bit complex
I would like to propose also these alternatives:
- using aws emr eks
- maybe propose the user another distribution mode without spark, using directly ssh
from img2dataset.
https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks.html
from img2dataset.
aws emr on eks is actually rather painful to setup
I'm considering instead going the raw ec2 route
it would have the added benefit to work in a natural way for any other provider of instances
options are to document the spark setup in this case, or to do a no spark option (would require implementing robustness)
from img2dataset.
writing to s3 (and hdfs) from any machine is working just fine now
I believe the only additional thing I will try here is a pure ssh based strategy, to make it easier for people to run in distributed mode
from img2dataset.
this is working. A little troublesome to setup but overall working!
from img2dataset.
Related Issues (20)
- Run img2dataset on goolge cloud HOT 1
- parallelize tests with gh action containers
- cannot download laion-high-resolution HOT 2
- LAION-Aesthetic Huggingface Error: Access to this resource is disabled. HOT 1
- custom dataset HOT 1
- cc3m dataset HOT 3
- pyarrow.lib.ArrowInvalid: Empty CSV file
- Implement mode to retry failed urls of all shards
- Low success rate on donwloading laion400m HOT 26
- laion-coco is not available
- Download hangs at End
- GCS url_path either not recognized as directory or mangled glob HOT 1
- Why I can't download laion400M dataset? HOT 3
- Is the field 'similarity' in Parquet file referring to the cosine similarity of the feature representations of image-text pairs? How is this metric computed?
- placekitten.com example in README fails to download images HOT 1
- The success rate when downloading the sbu data set is extremely low at 0 HOT 1
- Question about LAION-400M
- s3 paths in url_list are not supported HOT 1
- Decompressing the downloaded tar file is very slow HOT 1
- pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(URL) in sample_id: double HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from img2dataset.