Comments (19)
I think the second solution is better too
The first one is hard to implement, but also brittle as it requires some form of synchronization between the parallel uploaders in order to guarantee they don't write in the same ranges.
I wonder if a combination of the solutions makes sense:
Upon file creation, the client specifies how many parallel blocks it wants to work on.
The server returns a range of absolute offsets.
The client starts writing into these blocks in paralel, but each 'worker' owns a block.
The upside of partitioning early on is less chance of conflicts / easier implementation. The downside is the upload is as slow as the slowest block.
A different thought I had, is what if we leave parallel logic out of the uploading itself, but offer a merge request that let's the server know several completed uploaded files (in reality they are blocks) should be concatenated.
from tus-resumable-upload-protocol.
The downside is the upload is as slow as the slowest block.
This is a general "problem" about parallel upload which occurs for every solution. The file isn't fully uploaded until the last byte is received.
A different thought I had, is what if we leave parallel logic out of the uploading itself, but offer a merge request that let's the server know several completed uploaded files (in reality they are blocks) should be concatenated.
That's a brilliant idea! While keeping the parallel extension spec as small and focused as possible the parallel uploads can also take advantage of all other features.
I could think of following upload flow:
While creating the different blocks (using the file creation extension) you have to tell the server that this upload is just part of one bigger file (so it doesn't start processing it once one upload is done).
After the client has created (not necessarily uploaded all) blocks it must tell the server which uploads (identified by their file URL) are to concatenate in which order.
from tus-resumable-upload-protocol.
@kvz After thinking a bit more about the merge approach this offers a way to drop the Offset
header at all. Its value in a PATCH
request must no be smaller than the current offset of the resource (see #51). Furthermore non-contiguous uploads should be implemented by creating multiple uploads and merging them (at any time). This means that no Offset
bigger then the current offset must be allowed. So the only allowed value of this request header is the one returned in the HEAD
request.
In the end we should think about removing the Offset
header at all since its main functionality, indicating the offset to write at, is not necessary any more. The only reason to keep it is to verify that the client is aware of which offset it uses. In some cases (e.g. connection drop or internal errors) the offset on the server may be different than the one the client expected.
In addition I had some initial thoughts on how to implement this feature. He's my first draft using a new Merge
header in the POST
request: https://gist.github.com/Acconut/65228c8486e284d49e10
from tus-resumable-upload-protocol.
This is a general "problem" about parallel upload which occurs for every solution. The file isn't fully uploaded until the last byte is received.
I meant that if you don't work with fixed partitioning, you could repartition so that threads/workers could help out with remaining bytes of the slowest blocks. But let's not go into this as we both agree the downsides of that outweigh any upside.
As for merging 'regular' tus uploads, I'm glad you like it :)
To verify I understand, you're saying we can remove Offset
as any PATCH
will just be appended to the existing bytes? How would we overwrite a block with a failed checksum? I guess that block never gets added in the first place?
As for the example Gist, it makes sense to me. Should we also define that its named parts are to be removed upon a successful Merge: final;
?
from tus-resumable-upload-protocol.
I meant that if don't work with fixed partitioning, you could repartition so that threads could help out with remaining bytes of the slowest blocks. But let's not go into this as we both agree the downsides of that outweigh any upside.
I agree, this case is too specific.
To verify I understand, you're saying we can remove
Offset
as anyPATCH
will just be appended to the existing bytes? How would we overwrite a block with a failed checksum? I guess that block never gets added in the first place?
Blocks with failed checksum validation never get written and so the offset of the file is never changed. As I said the only function left to the Offset
header is verifying that the client uploads the data from the correct offset.
As for the example Gist, it makes sense to me. Should we also define that its named parts are to be removed upon a successful
Merge: final;
?
In some cases you want to do and sometimes not to (e.g. reuse a partial upload to merge with another). What about keeping this application-specific?
from tus-resumable-upload-protocol.
Agreed Marius, unless other contributors have objections I'd say this can be formalized into the protocol
from tus-resumable-upload-protocol.
@Acconut that is awesome you are tackling the parallel upload problem.
@kvz some nice feedback there.
While Merge
approach looks good. I have 2 concerns
Merge: final;
Header getting bloated for large files.
How about posting it as json array instead?- the current proposal forces client to keep status of how many chunks it uploaded. We should provide a
way for client to query list of chunks uploaded so far. I am not able to think of a good way for the server to list the chunks with the current approach.
from tus-resumable-upload-protocol.
@vayam Thanks for the feedback, I appreciate it a lot! 👍
Merge: final;
Header getting bloated for large files. How about posting it as json array instead?
This is a concern we have discussed internally, too. For this reason we have allowed remove the protocol and hostname from the URLs (see https://github.com/tus/tus-resumable-upload-protocol/blob/merge/protocol.md#merge-1):
The host and protocol scheme of the URLs MAY be omitted. In this case the value of the Host header MUST be used as the host and the scheme of the current request.
This enables you to use /files/24e533e02ec3bc40c387f1a0e460e216
instead of http://master.tus.io/files/24e533e02ec3bc40c387f1a0e460e216
to save some bytes.
Assuming a maximum of 4KB of total headers (default for nginx, see http://stackoverflow.com/a/8623061), the default headers (Host, Accept-, User-Agent, Cache-) take about 300 bytes leaving enough space to fit about 90 URLs. For my usecases this would be enough since I hardly imagine a case where you upload 90 chunks in parallel. Maybe you can throw in some experience there?
I would like to leave the body untouched by tus (against my older opinion). I had the idea to allow merging final uploads: The uploads A and B are merged into AB. C and D are merged into CD. In the end AB and CD are merged into ABCD. It may require a bit more coordination on the server side but what do you think.
the current proposal forces client to keep status of how many chunks it uploaded. We should provide a
way for client to query list of chunks uploaded so far. I am not able to think of a good way for the server to list the chunks with the current approach.
What about including the Merge
header in the HEAD
response?
from tus-resumable-upload-protocol.
Assuming a maximum of 4KB of total headers (default for nginx, see http://stackoverflow.com/a/8623061), the default headers (Host, Accept-, User-Agent, Cache-) take about 300 bytes leaving enough space to fit about 90 URLs. For my usecases this would be enough since I hardly imagine a case where you upload 90 chunks in parallel. Maybe you can throw in some experience there?
You are right. I don't expect more than 10-15 parallel uploads. Say if you are uploading 10GB with 1024 10MB chunks. The client uploads 1-10 chunks in parallel. Once it finishes those it merges to say B. then does 10 more parallel uploads and merges with the first merged file B and so on right?
What about including the Merge header in the HEAD response?
That works! But how would the server tie the part uploads to main upload? I don't understand. Can you explain?
I came up with a slightly modified version:
https://gist.github.com/vayam/774990b5919cc863dee7
Alert: I am thinking aloud here.
Let me know what you guys think?
from tus-resumable-upload-protocol.
Say if you are uploading 10GB with 1024 10MB chunks. The client uploads 1-10 chunks in parallel. Once it finishes those it merges to say B. then does 10 more parallel uploads and merges with the first merged file B and so on right?
Currently this is not allowed by the specification. You can only merge partial uploads and not final uploads which consist of merged partial uploads. But we may allow this in the future if it is necessary. Your example is basically the same as mine:
The uploads A and B are merged into AB. C and D are merged into CD. In the end AB and CD are merged into ABCD. It may require a bit more coordination on the server side but what do you think.
Let me know if you want to see this in the protocol.
But how would the server tie the part uploads to main upload? I don't understand. Can you explain?
Ok, I think I paid this point to little attention: Basically merging uploads is a simple concatenation. Assume following three uploads:
ID | Content |
---|---|
1 | abc |
2 | def |
3 | ghi |
If you then merge the upload 1, 2 and 3 in this order the final upload will have a length of 9 bytes (3 * 3) and its content will be abcdefghi
.
I came up with a slightly modified version:
https://gist.github.com/vayam/774990b5919cc863dee7
Alert: I am thinking aloud here.
Let me know what you guys think?
I am not able to follow your thought. Could you please explain it?
from tus-resumable-upload-protocol.
We 'decided' against allow to concatenate finals for focus, and to reduce the surface for bugs to appear, but willing to be persuaded otherwise. Are there compelling use-cases we can think of?
Also, would Concatenate
be a better word than Merge
?
from tus-resumable-upload-protocol.
@Acconut I reread the Merge
proposal. Say I am uploading 10GB file with 10MB chunks. I will have have about 1024 partial files to merge. Wouldn't that make the header bloated. To keep Merge final header small i have to make sure I upload larger file chunks.
Second question about listing the partial files by server. Say my browser crashed while uploading. I start over. If I don't store what I have uploaded so far in local storage there is no way I can ask the server to send me the currently transferred partials corresponding to my file. That is what I mean't by tracking.
We 'decided' against allow to concatenate finals for focus, and to reduce the surface for bugs to appear, but willing to be persuaded otherwise. Are there compelling use-cases we can think of? Also, would Concatenate be a better word than Merge?
@kvz concatenate
or concat
works too. I am also not very keen on concatenating finals.
@Acconut @kvz sorry My gist was unclear. I will try to explain better.
POST /files HTTP/1.1
Merge: partial
Entity-Length: 10737418240
HTTP/1.1 204 No Content
Location: http://tus.example.org/files/a/
Server creates a directory instead of file. directory can be more logical thing if you are in turn persisting into cloud storage per say.
Client put parts of the file using PUT /files/id/part1 .. /files/id/partN
PUT /files/a/part1 HTTP/1.1
Merge: partial
Entity-Length: 10485760
HTTP/1.1 204 No Content
PUT /files/a/part2 HTTP/1.1
Merge: partial
Entity-Length: 10485760
HTTP/1.1 204 No Content
Client lists files (needs more discussion)
HEAD /files/a/ HTTP/1.1
MERGE:final
HTTP/1.1 200 Ok
MERGE:final;part1-part1024
The next step is to create the final upload. In following request no
Entity-Length
header is presented.
POST /files/a/ HTTP/1.1
Merge: final
HTTP/1.1 204 No Content
MERGE:final;part1-part1024
Location: http://tus.example.org/files/b
The goals I am trying to address.
Allow simultaneous stateless partial uploads.
Allow a large number of partial uploads.
Client has as little state as possible. Client can requests partials for a specific whole entity anytime
from tus-resumable-upload-protocol.
Second question about listing the partial files by server. Say my browser crashed while uploading. I start over. If I don't store what I have uploaded so far in local storage there is no way I can ask the server to send me the currently transferred partials corresponding to my file. That is what I mean't by tracking.
A client is able to get every offset for every partial upload using a HEAD
request. Of course, the client MUST store the URLs of these uploads in order to resume them. If the browser/environment crashes and this information is lost we, as the server, can not to anything to prevent this. Your solution has the same "problem". If the client looses the URL of the directory created (/files/a/, in this case) it is not able to resume.
Speaking about your proposal I have a question:
Must the client send the Entity-Length when creating a new directory? This may collide with the idea of streaming uploads where the length is not known at the beginning. The same applies for creating new parts (/files/id/part1).
A general problem with your approach is that you require the server to define the URLs which is against a principal written in the 1.0 branch:
This specification does not describe the struture of URLs, as that is left for the specific implementation to decide. All URLs shown in this document are meant for example purposes only.
While I don't stick to this rule until death I want to question breaking it as long as it is not totally necessary.
from tus-resumable-upload-protocol.
@vayam Another problem is see with your solution is that the client need to create the parts chronologically in their order. You are not able to change these afterwards and have to be aware of how to partition the file before uploading it. Using the merge/concat approach you have indeed the possibility to throw stuff around. This is especially important if you deal with non-contiguous chunks.
from tus-resumable-upload-protocol.
@Acconut I agree with you. part1
naming was an example. You are right the protocol shouldn't dictate.
Must the client send the Entity-Length when creating a new directory? This may collide with the idea of streaming uploads where the length is not known at the beginning.
You are right. Entity-Length
is not needed.
The same applies for creating new parts (/files/id/part1).
Agreed
Using the merge/concat approach you have indeed the possibility to throw stuff around. This is especially important if you deal with non-contiguous chunks.Using the merge/concat approach you have indeed the possibility to throw stuff around. This is especially important if you deal with non-contiguous chunks.
Good point. What if we did
Client lists files (needs more discussion)
HEAD /files/a/ HTTP/1.1
MERGE:final
HTTP/1.1 200 Ok
MERGE:final;part1,part2....part1024
Final Merge
POST /files/a/ HTTP/1.1
MERGE:final;part1,part2,part1024
HTTP/1.1 204 No Content
Location: http://tus.example.org/files/b
I was thinking more on the lines of http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuAndPermissions.html
I am convinced your approach would be no issue for most of use cases. Let us go with that.
from tus-resumable-upload-protocol.
I will add a section to the merge extension to return the Merge
header on HEAD
requests. In addition I will clarify how files are merged/concatenated.
@vayam Just to be sure: Are you ok to go with the current proposal in #56 (plus the changes in the above sentence)? If so I will merge the PR and the release 1.0 Prerelease.
from tus-resumable-upload-protocol.
👍 sure go ahead and merge!
from tus-resumable-upload-protocol.
from tus-resumable-upload-protocol.
Merged in #56.
from tus-resumable-upload-protocol.
Related Issues (20)
- Future definition of the Tus-Resumable header HOT 10
- Spam: Nutt
- Spam: Nutt
- Request: Swagger / OpenAPI documentation of the API HOT 8
- Byte-range support for iOS compatibility HOT 2
- > net::ERR_UPLOAD_FILE_CHANGED HOT 2
- Body is not specified in OpenAPI HOT 1
- MIME-type Extension HOT 4
- Unable to resume uploads, Upload-Offset is always 0 HOT 1
- support http/3 HOT 1
- OWASP Considerations HOT 4
- > net::ERR_UPLOAD_FILE_CHANGED HOT 1
- Protocol version confusion HOT 2
- Upload Post-Processing HOT 12
- Ambiguous Tus-Concatenation specifications HOT 4
- Which StatusCode of response must server return, if server want to indicate a success HEAD request HOT 5
- How Can I learn about Tus V2? HOT 3
- Add extension to allow for discovery of existing upload URL HOT 4
- Upload-Metadata could be Structured Field Values (RFC 8941) HOT 3
- Support zero length uploads HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tus-resumable-upload-protocol.