Make process spawning configurable: either multiprocess, either spark or dask

Question

May at least be useful to control better the memory usage

rom1504 · Answer

with the current architecture of things, it should be pretty natural to make it possib

rom1504 · Answer

<a href="https://github.com/rom1504/img2dataset/blob/main/img2dataset/downloader.py#L3

rom1504 · Answer

<a href="https://github.com/lucidrains/DALLE-pytorch/blob/main/dalle_pytorch/distribut

rom1504 · Answer

<a href="https://github.com/horovod/horovod/blob/386be429b1417a1f6cb5e715bbe36efd2e74f

rom1504 · Answer

to move forward on this, moving the reader at the executor level could be good

rom1504 · Answer

<a href="https://docs.ray.io/en/latest/data/dataset-pipeline.html" rel="nofollow">http

rom1504 · Answer

<a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.h

rom1504 · Answer

Spark streaming can handle a streaming collection of files in a folder
However it

rom1504 · Answer

<a href="https://www.bogotobogo.com/Hadoop/BigData_hadoop_Apache_Spark_Streaming.php"

rom1504 · Answer

<a href="https://livebook.manning.com/book/spark-in-action/chapter-6/12" rel="nofollow

rom1504 · Answer

<a href="https://towardsdatascience.com/apache-spark-stream-reading-data-from-local-ht

rom1504 · Answer

<a href="https://stackoverflow.com/questions/33214988/spark-streaming-over-tcp" rel="n

rom1504 · Answer

<a href="https://github.com/criteo/cluster-pack/tree/master/examples/spark-with-S3">ht

rom1504 · Answer

ok we now have pyspark support.

Next step here is to actually try ru

rom1504 · Answer

standalone<a href="https://spark.apache.org/downloads.html" rel="no

rom1504 · Answer

makes sure all writer overwrite for this to work well with spark feature of retrying (

rom1504 · Answer

<a href="https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.h

rom1504 · Answer

-> not obvious how to run standalone: how to send the env to other nodes ; where to

rom1504 · Answer

maybe using sshfs could work

rom1504 · Answer

An end-to-end Docker example for deploying a standalone PySpark with Spar

rom1504 · Answer

since this is just a mapper, it could also be possible to build a docker and spawn it

rom1504 · Answer

<a href="https://cloud.google.com/blog/products/data-analytics/how-cloud-batch-and-str

rom1504 · Answer

<a href="https://beam.apache.org/get-started/quickstart-py/" rel="nofollow">https://be

rom1504 · Answer

<a href="https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/" rel

rom1504 · Answer

<a href="https://beam.apache.org/documentation/runners/spark/" rel="nofollow">https://

rom1504 · Answer

<a href="https://towardsdatascience.com/pex-the-secret-sauce-for-the-perfect-pyspark-d

rom1504 · Answer

<a href="https://aws.github.io/aws-emr-containers-best-practices/submit-applications/d

rom1504 · Answer

possibly reconsider a streaming based approach to eliminate the concept of file from m

rom1504 · Answer

consider yielding examples in the downloader and moving the aggregation by the writer

rom1504 · Answer

https://github.com/intel-an

rom1504 · Answer

<a href="https://analytics-zoo.readthedocs.io/en/latest/doc/Orca/QuickStart/orca-pytor

rom1504 · Answer

This could potentially be made easier by having a service handling the http/dns part,

rom1504 · Answer

consider making 2 ways shared file systems not required (can be done by distributing t

rom1504 · Answer

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

rom1504 · Answer

this is almost done now
last thing to do will be a guide on how to setup a spark c

rom1504 · Answer

<a href="https://github.com/rom1504/img2dataset/blob/main/examples/distributed_img2dat

rom1504 · Answer

<a href="https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks.ht

rom1504 · Answer

aws emr on eks is actually rather painful to setup

I'm considering i

rom1504 · Answer

writing to s3 (and hdfs) from any machine is working just fine now

I

rom1504 · Answer

this is working. A little troublesome to setup but overall working!

Make process spawning configurable: either multiprocess, either spark or dask about img2dataset HOT 40 CLOSED

Comments (40)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent