Giter Site home page Giter Site logo

Comments (5)

crwilcox avatar crwilcox commented on June 20, 2024

It might be interesting to log the http traffic and inspect what is going on.

Python

Enabling HTTP requests in Python GCS library can be done using the logging module. In the following example, I'm enabling logging.DEBUG in the example:

from google.cloud import storage
# Python3 required
import http
http.client.HTTPConnection.debuglevel=5
# Necessary to turn on logging

storage_client = storage.Client()

blobs = storage_client.list_blobs("anima-frank")
for blob in blobs:
    print(blob.name)

Ref: https://docs.python.org/3/library/logging.html
Ref: https://docs.python.org/2/library/logging.html

GSUTIL

gsutil has the flag --debug to enable HTTP request logs

For example:

gsutil --debug ls gs://bucket-name

Ref: https://cloud.google.com/storage/docs/gsutil/addlhelp/TopLevelCommandLineOptions

from python-storage.

braussjoss avatar braussjoss commented on June 20, 2024

httplog_4workers_20files.txt
httplog.txt

"Per Chris's request I ran the test program with HTTP logging turned on. Here is the output for a run with 1 worker retrieving one file."

from python-storage.

crwilcox avatar crwilcox commented on June 20, 2024

First off, I made some test data by running this locally and uploading to a directory in storage:

for n in {1..1000}; do                        
    dd if=/dev/urandom of=file$( printf %03d "$n" ).data bs=1 count=1024
done

I also made some small modifications to the code to make it a bit more flexible.

  • I set defaults in the file to make running in a debugger easier. if you set your own they will still be used.
  • the code didn't support objects without metadata. It checks now before assuming there is metadata to access.

code.txt

log_crw_100f_10t.txt

from python-storage.

crwilcox avatar crwilcox commented on June 20, 2024

After a bit of investigation. testing from my network (Seattle, WA). running 8 workers on a machine with a Quad-Core Intel Core i7 (8 vCores). Bucket is multiregion us.

I tracked metadata retrieval, downloading the 1kb file, setting metadata. Each takes right around .15 - .25 seconds. If it takes longer than .25 I print a warning. The attached log has a single warning from a metadata update that took .27 seconds.

The code has change slightly from above as I added additional logging.
code.txt
log_1000f_8w.txt

Time to download using code.py is around 40 seconds (timing is capturing the final sleep on threads so actual time is less)

Using gsutil -m cp -r gs://bucket/demo-data/ I see it taking 35.76s

from python-storage.

crwilcox avatar crwilcox commented on June 20, 2024

Closing this out as customer has been helped. Will open bugs to dig into specific things we can do to help folks avoid this in the future. It seems the threaded version of this code has some contention. Moving to multiprocessing is much faster.

Using threads: ~30 seconds
After moving to multiprocesing:
16 workers: 19.4 seconds
32 workers: 13.2 seconds
64 workers: 10.3 seconds
128 workers: 9.1 secondsF

multiprocessing_code.txt

from python-storage.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.