Comments (14)
Yes, we are calling the server's copy object command, not downloading and rewriting the data - that would be very expensive!
from s3fs.
Although maned after the posix mv
command, for S3, this is actually copy-and-delete. How this is implemented internally within S3, I don't know, but it does not in general promise immediate consistency, so I am not totally surprised that either presto is reading the file before it is all available, or possibly that mv
is copying a file which is not yet available. I am open to suggestions, but I can't suggest more than to add sleep statements into your workflow.
As a side note, what does presto offer to you that you are not able to accomplish in python, since the data already passes through the memory of your machine(s) as pandas dataframes ?
from s3fs.
@martindurant thanks for suggestion. But I couldn't able to understand the usage of sleep here.
Coming to presto part, we have clickstream data, which flows into our system. We will be writing to JSON files and then converting them to parquet and then will store them in s3. So, we use presto heavily to run multiple queries on previous data along with current flowing data.
When we query existing data, there won't be a problem. But if we query the current hour data, then it might cause problem and it happens a bit rarely. Predicting this is almost impossible.
from s3fs.
Just in case calling mv too soon after writing has an effect, I would put a time.sleep() between the two calls. The probability of this fixing things isn't enormous, but it's worth a go as the simplest thing.
Are you certain that the files are indeed complete and valid? We have seen with GCS cases where some files were truncated at times of heavy usage.
from s3fs.
May be, I guess so. I'll try keeping sleep and check once.
As per my understanding and simple test, the files are complete. As soon as presto throws the error, I checked with fastparquet which is able to read the file. So, I guess presto is failing exactly at the time of copying the file.
from s3fs.
How big are the files? The REST documentation suggests behaviour may be different above 5GB than below http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html .
from s3fs.
No, the files are hardly around 500 Mb (JSON File). To be precise, it will never cross 490 Mb on disk and 50 mb (Parquet File) in in s3 (After converting to Parquet file)
from s3fs.
@martindurant
I believe we are doing copy_object only when we are using s3fs mv. For reference: presto-groups
from s3fs.
@martindurant
I'm little bit confused with block_size
.
Looks like our file uploads all are multi parts upload. If it is the case, the error seems quite natural as, at some point of time there is chance for partial data to present.
Correct me, if I'm wrong.
from s3fs.
Correct that they are multi-part uploads, but the final key should not be created until the multi-part-upload is finalised.
from s3fs.
Yes, you are right. Just ran a simple test, the file is not created unless the upload is done. Not sure what is causing this issue.
from s3fs.
@martindurant what do you mean by upload is finalised.
Because if you remember I've raised issue in fast parquet. I am just trying to understand what is the meaning of finalised, and how the invalid parquet file (s3 Key) is created when the process is killed (signal 9) due to memory issue. I trying to figure out is there any relation between this invalid parquet file, final key creation and current issue I have.
from s3fs.
Thanks for your time and patience. I'm overwriting the existing files in s3 wen running one minute cron job. This overwriting is causing the issue when reading from presto. I think we can close this issue.
Would you like to share your inputs in avoiding s3 consistency when overwriting ?
from s3fs.
Here is the reference: http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel
Amazon claims never to give partial or corrupted data, you either get the old version or the new. That could be enough to break presto, if the versions are not compatible. Another failure mode would be one action to get the file-list, then the next to download data, but the file size of the new file is different from the old one.
If you write your own code, you can check the generation of a key to make sure it hasn't changed, or download a specific generation (old data may still be available), or be sure to match each file of a batch by time-stamp.
I cannot, however, give any advice on how you might implement any of this for presto, sorry.
from s3fs.
Related Issues (20)
- Inconsistent recursive `put` behavior when running an identical command twice successively HOT 1
- open_async file is closed on arrival HOT 1
- set_session does not seem to be thread / jobs safe HOT 4
- Random XAmzContentSHA256Mismatch Errors HOT 6
- Access denied when providing an authentication token associated with a set of permission policies to S3FileSystem HOT 3
- calling flush on s3fs fails HOT 2
- s3fs 2024.3.0 fails reading glob patterns through pandas HOT 12
- Question: is awscrt useful ? HOT 2
- Errors when installing s3fs on Sagemaker Studio HOT 1
- Why isn't Pathlib supported yet? HOT 1
- Working example of using Async/Await HOT 7
- Custom s3 compatible https endpoint not working, port forwarded to localhost works HOT 9
- How to Increase async httpconnection limit? HOT 7
- Does aioboto3 Support Authentication with EC2 IAM Roles? HOT 3
- upload function didn't recognize the file path having "[]". HOT 4
- How to upload a list of files from local fs to cloud s3 fs async? HOT 3
- Writing metadata with underscores fail silently HOT 1
- s3fs.exists incorrectly returns False after calling glob
- fsspec.generic.rsync(<s3_path>, <s3_path>) raises FileNotFoundError HOT 7
- When requesting the wrong version of an existing file, the `FileNotFoundError` could be more informative. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from s3fs.