Giter Site home page Giter Site logo

datafileutil's People

Contributors

briehl avatar jamesjeffryes avatar jayrbolton avatar jsfillman avatar mrcreosote avatar qzzhang avatar realmarcin avatar scanon avatar sychan avatar tianhao-gu avatar ugswork avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datafileutil's Issues

Handle inaccessible Google Drive files sensibly

Currently, if a google drive file is not publicly available, the request gets forwarded to a request access page with a 200 status code. This tricks the download code into thinking things are ok, and the code saves a file (at least currently) called uc with the contents of the page, which hopefully will cause any remaining steps to fail based on the file format. If not things could get messy.

Hopefully this doesn't happen too often, as most people will get the link from google drive directly, which means they have access.

The google download code should be replaced to avoid using direct links if possible, since Google thinks they're coming from a browser.

Investigate the google drive python library.

Document how to deal with tokens in tests

  1. It's possible (though unlikely) that the org level token may expire - list contact info in the case that it does
  2. Document how to set up a token in your own fork to use for testing

Downloading large files from Google no longer works

The code looks for a cookie called download_warning and then uses the contents in a confirm query parameter when downloading the file. However, that cookie no longer exists when querying the download url.

Another case where we should probably look into using the google drive library

Always sort objects before saving

This has 2 benefits:

  1. Distributes the computational load from sorting across many nodes rather than localizing it to the workspace server
  2. Prevents most cases of the error caused by using too much memory to store keys for maps while sorting the object

Re 2), note that if the object has map keys that are workspace refs, but not UPAs, that might trigger a resort, so the client application may need to swap UPAs for refs to avoid a workspace-side sort.

If the object is sorted already, this would add a low cost scan step to the save.

Update local function specs

At least unpack_files is now missing... although I'm not sure if these documents are even used. I sure don't use them. 3rd party devs maybe?

Add retry options

Add options to retry save / get calls for the workspace and the blobstore.

For the workspace, might want to add a load ID to the object metadata (or make the workspace support it directly) to make it easy to check if the object you expected was loaded.

Rework downloading web files re naming

Currently the code does

  • GET to the source to get the content distribution header, which I assume will also start (and maybe finish) pulling the file from the source
  • Inspects the content-distribution header to get the file name
  • Otherwise gets the name from the URL
  • does another GET to the source using wget to download the file.

Instead

  • Use a temporary file name like a UUID
  • Download the file
  • Figure out the filename as above
  • rename the file

Result: 1 less GET

Switch to pytest

nose is deprecated and pytest tests are much cleaner, IMO. This is a big job though...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.