Currently all PseudoFS methods use coarse-grained loc

Parallel downloading implemented in: <a class="commit-link" data-hovercard-type="commi

Support parallel downloads about us HOT 3 CLOSED

lukechampine commented on June 9, 2024

Support parallel downloads

from us.

Comments (3)

lukechampine commented on June 9, 2024 2

Parallel downloading implemented in: c004283 b9a4fc2 fb4bc71 d43b154

from us.

vargrant commented on June 9, 2024

I would like to comment this: "Writes, on the other hand, cannot be significantly sped up by adding parallelism. This is because, as previously mentioned, writing requires accessing all of the hosts, so you'll always be bottlenecked by the slowest host. (Imagine that you have two hosts: one that takes 1s per sector, and one that takes 10s. You might be able to upload all of the first host's data before the second has finished a single sector, but this doesn't improve your overall throughput; you still need to wait for the second host to finish.)"

At first you need to add some levels of abstraction for polsibiility to implement more advanced algorithm. I propose to have next entities:

job for upload - total data which must be uploaded
chunk - part of very small data from job (may be it's not need)
worker - it's process for upload data to host. (you will have several workers)
pool of available hosts
some datbase for posibillity to store map of allocation

Algorithm

You have 1000 TB of data and need to upload it's to 2 hosts. It's mean that one host must to have not more than 512 GB of data, but really all this data can be uploaded to the 3 or more hosts. May be posible to upload 724 GB to one host and 300GB to the other host. All this must be customized by users - becuase it's deppend from the politic of security police.
When job is started - some proces must provide data by chunks for each worker. Worker it's also separate process - you have 2 workers for 2 hosts.
In the moment when worker requesting the next chunk to upload you need to check performance by some stored map of requested chunks for this job. Here you can detect that this worker already uploaded more data then other worker and you will have possibility to provide for worker not only next chunk but also nominate for uploding and some host from your pool.

This algo is very flexible and will improve uploading speed even if you will have some slow hosts.

from us.

lukechampine commented on June 9, 2024

Closing this; we can address other forms of I/O parallelism in separate issues.

from us.

Recommend Projects

Support parallel downloads about us HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent