Giter Site home page Giter Site logo

Comments (14)

martindurant avatar martindurant commented on August 11, 2024 1

Yes, we are calling the server's copy object command, not downloading and rewriting the data - that would be very expensive!

from s3fs.

martindurant avatar martindurant commented on August 11, 2024

Although maned after the posix mv command, for S3, this is actually copy-and-delete. How this is implemented internally within S3, I don't know, but it does not in general promise immediate consistency, so I am not totally surprised that either presto is reading the file before it is all available, or possibly that mv is copying a file which is not yet available. I am open to suggestions, but I can't suggest more than to add sleep statements into your workflow.

As a side note, what does presto offer to you that you are not able to accomplish in python, since the data already passes through the memory of your machine(s) as pandas dataframes ?

from s3fs.

fsck-mount avatar fsck-mount commented on August 11, 2024

@martindurant thanks for suggestion. But I couldn't able to understand the usage of sleep here.

Coming to presto part, we have clickstream data, which flows into our system. We will be writing to JSON files and then converting them to parquet and then will store them in s3. So, we use presto heavily to run multiple queries on previous data along with current flowing data.

When we query existing data, there won't be a problem. But if we query the current hour data, then it might cause problem and it happens a bit rarely. Predicting this is almost impossible.

from s3fs.

martindurant avatar martindurant commented on August 11, 2024

Just in case calling mv too soon after writing has an effect, I would put a time.sleep() between the two calls. The probability of this fixing things isn't enormous, but it's worth a go as the simplest thing.
Are you certain that the files are indeed complete and valid? We have seen with GCS cases where some files were truncated at times of heavy usage.

from s3fs.

fsck-mount avatar fsck-mount commented on August 11, 2024

May be, I guess so. I'll try keeping sleep and check once.

As per my understanding and simple test, the files are complete. As soon as presto throws the error, I checked with fastparquet which is able to read the file. So, I guess presto is failing exactly at the time of copying the file.

from s3fs.

martindurant avatar martindurant commented on August 11, 2024

How big are the files? The REST documentation suggests behaviour may be different above 5GB than below http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html .

from s3fs.

fsck-mount avatar fsck-mount commented on August 11, 2024

No, the files are hardly around 500 Mb (JSON File). To be precise, it will never cross 490 Mb on disk and 50 mb (Parquet File) in in s3 (After converting to Parquet file)

from s3fs.

fsck-mount avatar fsck-mount commented on August 11, 2024

@martindurant
I believe we are doing copy_object only when we are using s3fs mv. For reference: presto-groups

from s3fs.

fsck-mount avatar fsck-mount commented on August 11, 2024

@martindurant
I'm little bit confused with block_size.

Looks like our file uploads all are multi parts upload. If it is the case, the error seems quite natural as, at some point of time there is chance for partial data to present.

Correct me, if I'm wrong.

from s3fs.

martindurant avatar martindurant commented on August 11, 2024

Correct that they are multi-part uploads, but the final key should not be created until the multi-part-upload is finalised.

from s3fs.

fsck-mount avatar fsck-mount commented on August 11, 2024

Yes, you are right. Just ran a simple test, the file is not created unless the upload is done. Not sure what is causing this issue.

from s3fs.

fsck-mount avatar fsck-mount commented on August 11, 2024

@martindurant what do you mean by upload is finalised.

Because if you remember I've raised issue in fast parquet. I am just trying to understand what is the meaning of finalised, and how the invalid parquet file (s3 Key) is created when the process is killed (signal 9) due to memory issue. I trying to figure out is there any relation between this invalid parquet file, final key creation and current issue I have.

from s3fs.

fsck-mount avatar fsck-mount commented on August 11, 2024

@martindurant

Thanks for your time and patience. I'm overwriting the existing files in s3 wen running one minute cron job. This overwriting is causing the issue when reading from presto. I think we can close this issue.
Would you like to share your inputs in avoiding s3 consistency when overwriting ?

from s3fs.

martindurant avatar martindurant commented on August 11, 2024

Here is the reference: http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel

Amazon claims never to give partial or corrupted data, you either get the old version or the new. That could be enough to break presto, if the versions are not compatible. Another failure mode would be one action to get the file-list, then the next to download data, but the file size of the new file is different from the old one.
If you write your own code, you can check the generation of a key to make sure it hasn't changed, or download a specific generation (old data may still be available), or be sure to match each file of a batch by time-stamp.
I cannot, however, give any advice on how you might implement any of this for presto, sorry.

from s3fs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.