Giter Site home page Giter Site logo

Comments (19)

kvz avatar kvz commented on May 22, 2024

I think the second solution is better too

The first one is hard to implement, but also brittle as it requires some form of synchronization between the parallel uploaders in order to guarantee they don't write in the same ranges.

I wonder if a combination of the solutions makes sense:

Upon file creation, the client specifies how many parallel blocks it wants to work on.
The server returns a range of absolute offsets.
The client starts writing into these blocks in paralel, but each 'worker' owns a block.
The upside of partitioning early on is less chance of conflicts / easier implementation. The downside is the upload is as slow as the slowest block.

A different thought I had, is what if we leave parallel logic out of the uploading itself, but offer a merge request that let's the server know several completed uploaded files (in reality they are blocks) should be concatenated.

from tus-resumable-upload-protocol.

Acconut avatar Acconut commented on May 22, 2024

The downside is the upload is as slow as the slowest block.

This is a general "problem" about parallel upload which occurs for every solution. The file isn't fully uploaded until the last byte is received.

A different thought I had, is what if we leave parallel logic out of the uploading itself, but offer a merge request that let's the server know several completed uploaded files (in reality they are blocks) should be concatenated.

That's a brilliant idea! While keeping the parallel extension spec as small and focused as possible the parallel uploads can also take advantage of all other features.
I could think of following upload flow:
While creating the different blocks (using the file creation extension) you have to tell the server that this upload is just part of one bigger file (so it doesn't start processing it once one upload is done).
After the client has created (not necessarily uploaded all) blocks it must tell the server which uploads (identified by their file URL) are to concatenate in which order.

from tus-resumable-upload-protocol.

Acconut avatar Acconut commented on May 22, 2024

@kvz After thinking a bit more about the merge approach this offers a way to drop the Offset header at all. Its value in a PATCH request must no be smaller than the current offset of the resource (see #51). Furthermore non-contiguous uploads should be implemented by creating multiple uploads and merging them (at any time). This means that no Offset bigger then the current offset must be allowed. So the only allowed value of this request header is the one returned in the HEAD request.
In the end we should think about removing the Offset header at all since its main functionality, indicating the offset to write at, is not necessary any more. The only reason to keep it is to verify that the client is aware of which offset it uses. In some cases (e.g. connection drop or internal errors) the offset on the server may be different than the one the client expected.

In addition I had some initial thoughts on how to implement this feature. He's my first draft using a new Merge header in the POST request: https://gist.github.com/Acconut/65228c8486e284d49e10

from tus-resumable-upload-protocol.

kvz avatar kvz commented on May 22, 2024

This is a general "problem" about parallel upload which occurs for every solution. The file isn't fully uploaded until the last byte is received.

I meant that if you don't work with fixed partitioning, you could repartition so that threads/workers could help out with remaining bytes of the slowest blocks. But let's not go into this as we both agree the downsides of that outweigh any upside.

As for merging 'regular' tus uploads, I'm glad you like it :)

To verify I understand, you're saying we can remove Offset as any PATCH will just be appended to the existing bytes? How would we overwrite a block with a failed checksum? I guess that block never gets added in the first place?

As for the example Gist, it makes sense to me. Should we also define that its named parts are to be removed upon a successful Merge: final;?

from tus-resumable-upload-protocol.

Acconut avatar Acconut commented on May 22, 2024

I meant that if don't work with fixed partitioning, you could repartition so that threads could help out with remaining bytes of the slowest blocks. But let's not go into this as we both agree the downsides of that outweigh any upside.

I agree, this case is too specific.

To verify I understand, you're saying we can remove Offset as any PATCH will just be appended to the existing bytes? How would we overwrite a block with a failed checksum? I guess that block never gets added in the first place?

Blocks with failed checksum validation never get written and so the offset of the file is never changed. As I said the only function left to the Offset header is verifying that the client uploads the data from the correct offset.

As for the example Gist, it makes sense to me. Should we also define that its named parts are to be removed upon a successful Merge: final;?

In some cases you want to do and sometimes not to (e.g. reuse a partial upload to merge with another). What about keeping this application-specific?

from tus-resumable-upload-protocol.

kvz avatar kvz commented on May 22, 2024

Agreed Marius, unless other contributors have objections I'd say this can be formalized into the protocol

from tus-resumable-upload-protocol.

vayam avatar vayam commented on May 22, 2024

@Acconut that is awesome you are tackling the parallel upload problem.
@kvz some nice feedback there.

While Merge approach looks good. I have 2 concerns

  • Merge: final; Header getting bloated for large files.
    How about posting it as json array instead?
  • the current proposal forces client to keep status of how many chunks it uploaded. We should provide a
    way for client to query list of chunks uploaded so far. I am not able to think of a good way for the server to list the chunks with the current approach.

from tus-resumable-upload-protocol.

Acconut avatar Acconut commented on May 22, 2024

@vayam Thanks for the feedback, I appreciate it a lot! 👍

Merge: final; Header getting bloated for large files. How about posting it as json array instead?

This is a concern we have discussed internally, too. For this reason we have allowed remove the protocol and hostname from the URLs (see https://github.com/tus/tus-resumable-upload-protocol/blob/merge/protocol.md#merge-1):

The host and protocol scheme of the URLs MAY be omitted. In this case the value of the Host header MUST be used as the host and the scheme of the current request.

This enables you to use /files/24e533e02ec3bc40c387f1a0e460e216 instead of http://master.tus.io/files/24e533e02ec3bc40c387f1a0e460e216 to save some bytes.

Assuming a maximum of 4KB of total headers (default for nginx, see http://stackoverflow.com/a/8623061), the default headers (Host, Accept-, User-Agent, Cache-) take about 300 bytes leaving enough space to fit about 90 URLs. For my usecases this would be enough since I hardly imagine a case where you upload 90 chunks in parallel. Maybe you can throw in some experience there?

I would like to leave the body untouched by tus (against my older opinion). I had the idea to allow merging final uploads: The uploads A and B are merged into AB. C and D are merged into CD. In the end AB and CD are merged into ABCD. It may require a bit more coordination on the server side but what do you think.

the current proposal forces client to keep status of how many chunks it uploaded. We should provide a
way for client to query list of chunks uploaded so far. I am not able to think of a good way for the server to list the chunks with the current approach.

What about including the Merge header in the HEAD response?

from tus-resumable-upload-protocol.

vayam avatar vayam commented on May 22, 2024

Assuming a maximum of 4KB of total headers (default for nginx, see http://stackoverflow.com/a/8623061), the default headers (Host, Accept-, User-Agent, Cache-) take about 300 bytes leaving enough space to fit about 90 URLs. For my usecases this would be enough since I hardly imagine a case where you upload 90 chunks in parallel. Maybe you can throw in some experience there?

You are right. I don't expect more than 10-15 parallel uploads. Say if you are uploading 10GB with 1024 10MB chunks. The client uploads 1-10 chunks in parallel. Once it finishes those it merges to say B. then does 10 more parallel uploads and merges with the first merged file B and so on right?

What about including the Merge header in the HEAD response?

That works! But how would the server tie the part uploads to main upload? I don't understand. Can you explain?

I came up with a slightly modified version:
https://gist.github.com/vayam/774990b5919cc863dee7
Alert: I am thinking aloud here.
Let me know what you guys think?

from tus-resumable-upload-protocol.

Acconut avatar Acconut commented on May 22, 2024

Say if you are uploading 10GB with 1024 10MB chunks. The client uploads 1-10 chunks in parallel. Once it finishes those it merges to say B. then does 10 more parallel uploads and merges with the first merged file B and so on right?

Currently this is not allowed by the specification. You can only merge partial uploads and not final uploads which consist of merged partial uploads. But we may allow this in the future if it is necessary. Your example is basically the same as mine:

The uploads A and B are merged into AB. C and D are merged into CD. In the end AB and CD are merged into ABCD. It may require a bit more coordination on the server side but what do you think.

Let me know if you want to see this in the protocol.

But how would the server tie the part uploads to main upload? I don't understand. Can you explain?

Ok, I think I paid this point to little attention: Basically merging uploads is a simple concatenation. Assume following three uploads:

ID Content
1 abc
2 def
3 ghi

If you then merge the upload 1, 2 and 3 in this order the final upload will have a length of 9 bytes (3 * 3) and its content will be abcdefghi.

I came up with a slightly modified version:
https://gist.github.com/vayam/774990b5919cc863dee7
Alert: I am thinking aloud here.
Let me know what you guys think?

I am not able to follow your thought. Could you please explain it?

from tus-resumable-upload-protocol.

kvz avatar kvz commented on May 22, 2024

We 'decided' against allow to concatenate finals for focus, and to reduce the surface for bugs to appear, but willing to be persuaded otherwise. Are there compelling use-cases we can think of?

Also, would Concatenate be a better word than Merge?

from tus-resumable-upload-protocol.

vayam avatar vayam commented on May 22, 2024

@Acconut I reread the Merge proposal. Say I am uploading 10GB file with 10MB chunks. I will have have about 1024 partial files to merge. Wouldn't that make the header bloated. To keep Merge final header small i have to make sure I upload larger file chunks.

Second question about listing the partial files by server. Say my browser crashed while uploading. I start over. If I don't store what I have uploaded so far in local storage there is no way I can ask the server to send me the currently transferred partials corresponding to my file. That is what I mean't by tracking.

We 'decided' against allow to concatenate finals for focus, and to reduce the surface for bugs to appear, but willing to be persuaded otherwise. Are there compelling use-cases we can think of? Also, would Concatenate be a better word than Merge?

@kvz concatenate or concat works too. I am also not very keen on concatenating finals.

@Acconut @kvz sorry My gist was unclear. I will try to explain better.

POST /files HTTP/1.1
Merge: partial
Entity-Length: 10737418240

HTTP/1.1 204 No Content
Location: http://tus.example.org/files/a/

Server creates a directory instead of file. directory can be more logical thing if you are in turn persisting into cloud storage per say.

Client put parts of the file using PUT /files/id/part1 .. /files/id/partN

PUT /files/a/part1 HTTP/1.1
Merge: partial
Entity-Length: 10485760

HTTP/1.1 204 No Content
PUT /files/a/part2 HTTP/1.1
Merge: partial
Entity-Length: 10485760

HTTP/1.1 204 No Content

Client lists files (needs more discussion)

HEAD /files/a/ HTTP/1.1
MERGE:final

HTTP/1.1 200 Ok
MERGE:final;part1-part1024

The next step is to create the final upload. In following request no
Entity-Length header is presented.

POST /files/a/ HTTP/1.1
Merge: final

HTTP/1.1 204 No Content
MERGE:final;part1-part1024
Location: http://tus.example.org/files/b

The goals I am trying to address.

Allow simultaneous stateless partial uploads.
Allow a large number of partial uploads.
Client has as little state as possible. Client can requests partials for a specific whole entity anytime

from tus-resumable-upload-protocol.

Acconut avatar Acconut commented on May 22, 2024

Second question about listing the partial files by server. Say my browser crashed while uploading. I start over. If I don't store what I have uploaded so far in local storage there is no way I can ask the server to send me the currently transferred partials corresponding to my file. That is what I mean't by tracking.

A client is able to get every offset for every partial upload using a HEAD request. Of course, the client MUST store the URLs of these uploads in order to resume them. If the browser/environment crashes and this information is lost we, as the server, can not to anything to prevent this. Your solution has the same "problem". If the client looses the URL of the directory created (/files/a/, in this case) it is not able to resume.

Speaking about your proposal I have a question:

Must the client send the Entity-Length when creating a new directory? This may collide with the idea of streaming uploads where the length is not known at the beginning. The same applies for creating new parts (/files/id/part1).

A general problem with your approach is that you require the server to define the URLs which is against a principal written in the 1.0 branch:

This specification does not describe the struture of URLs, as that is left for the specific implementation to decide. All URLs shown in this document are meant for example purposes only.

While I don't stick to this rule until death I want to question breaking it as long as it is not totally necessary.

from tus-resumable-upload-protocol.

Acconut avatar Acconut commented on May 22, 2024

@vayam Another problem is see with your solution is that the client need to create the parts chronologically in their order. You are not able to change these afterwards and have to be aware of how to partition the file before uploading it. Using the merge/concat approach you have indeed the possibility to throw stuff around. This is especially important if you deal with non-contiguous chunks.

from tus-resumable-upload-protocol.

vayam avatar vayam commented on May 22, 2024

@Acconut I agree with you. part1 naming was an example. You are right the protocol shouldn't dictate.

Must the client send the Entity-Length when creating a new directory? This may collide with the idea of streaming uploads where the length is not known at the beginning.

You are right. Entity-Length is not needed.

The same applies for creating new parts (/files/id/part1).

Agreed

Using the merge/concat approach you have indeed the possibility to throw stuff around. This is especially important if you deal with non-contiguous chunks.Using the merge/concat approach you have indeed the possibility to throw stuff around. This is especially important if you deal with non-contiguous chunks.

Good point. What if we did

Client lists files (needs more discussion)

HEAD /files/a/ HTTP/1.1
MERGE:final

HTTP/1.1 200 Ok
MERGE:final;part1,part2....part1024

Final Merge

POST /files/a/ HTTP/1.1
MERGE:final;part1,part2,part1024

HTTP/1.1 204 No Content
Location: http://tus.example.org/files/b

I was thinking more on the lines of http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuAndPermissions.html

I am convinced your approach would be no issue for most of use cases. Let us go with that.

from tus-resumable-upload-protocol.

Acconut avatar Acconut commented on May 22, 2024

I will add a section to the merge extension to return the Merge header on HEAD requests. In addition I will clarify how files are merged/concatenated.

@vayam Just to be sure: Are you ok to go with the current proposal in #56 (plus the changes in the above sentence)? If so I will merge the PR and the release 1.0 Prerelease.

from tus-resumable-upload-protocol.

vayam avatar vayam commented on May 22, 2024

👍 sure go ahead and merge!

from tus-resumable-upload-protocol.

Acconut avatar Acconut commented on May 22, 2024

@vayam See 5e0ccb3.

from tus-resumable-upload-protocol.

Acconut avatar Acconut commented on May 22, 2024

Merged in #56.

from tus-resumable-upload-protocol.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.