Giter Site home page Giter Site logo

alphagov / asset-manager Goto Github PK

View Code? Open in Web Editor NEW
6.0 76.0 10.0 2.6 MB

Manages uploaded assets (images, PDFs etc.) for applications on GOV.UK

Home Page: https://docs.publishing.service.gov.uk/apps/asset-manager.html

License: MIT License

Ruby 97.56% HTML 1.98% Dockerfile 0.40% JavaScript 0.01% Procfile 0.05%
govuk container

asset-manager's Introduction

Asset Manager

Manages uploaded assets (images, PDFs etc.) for applications in the GOV.UK Publishing stack.

The app receives uploaded files from publishing applications and returns the URLs that they will be made available at. Before an asset is available to the public, it is virus scanned. Once a file is found to be clean, Asset Manager serves it at the previously generated URL. Unscanned or Infected files return a 404 Not Found error.

Scanning uses ClamAV and occurs asynchronously via govuk_sidekiq.

Technical Documentation

This is a Ruby on Rails app, and should follow our Rails app conventions.

You can use the GOV.UK Docker environment to run the application and its tests with all the necessary dependencies. Follow the usage instructions to get started.

Use GOV.UK Docker to run any commands that follow.

Running the test suite

bundle exec rspec

Assets on S3

All assets are uploaded to the S3 bucket via a separate govuk_sidekiq job triggered if virus scanning succeeds. Assets are currently still also saved to the NFS mount as per the original behaviour.

In non-production environments if the AWS_S3_BUCKET_NAME environment variable is not set, then a fake version of S3 (S3Storage::Fake) is used and the other AWS_* environment variables do not need to be set. In this case, files are saved to the local filesystem instead of S3 and are served via an instance of Rack::File mounted on the appropriate route path prefix.

Manuals and decisions

Check the docs directory for detailed instructions, including API documentation.

Licence

MIT License

asset-manager's People

Contributors

alext avatar barrucadu avatar beckal avatar benthorner avatar brucebolt avatar callumknights avatar cbaines avatar cdccollins avatar chrisbashton avatar chrislo avatar chrisroos avatar danielburnley avatar dependabot-preview[bot] avatar dependabot-support avatar dependabot[bot] avatar elliotcm avatar floehopper avatar h-lame avatar jkempster34 avatar jonathanhallam avatar jordanhatch avatar kevindew avatar kushalp avatar murilodalri avatar nsabri1 avatar sengi avatar steventux avatar theseanything avatar thomasleese avatar tijmenb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

asset-manager's Issues

Consider using delayed-plugins-airbrake or airbrake/delayed_job

Currently it looks as if the app is using explicit calls to Airbrake.notify_or_ignore to report exceptions to Errbit. I suspect we could reduce duplication and have this happen automatically by using either delayed-plugins-airbrake (which I think only works with Airbrake <= v4, i.e. the current version) or airbrake/delayed_job (which is built into Airbrake >= v5, but would require us upgrading Airbrake within the app).

Before doing anything it would be worth checking what's being used elsewhere for Airbrake in apps running DelayedJob.

Consider encapsulating asset key in a method

It feels as if there are many occurrences of asset.id.to_s spread across the codebase. If these are all used for the same purpose, it would be good to encapsulate the logic in one place, e.g. Asset#to_param, Asset#key, or something like that. It's possible that the logic is being used for multiple purposes in which case there should probably be multiple corresponding methods on Asset.

ActionController::BadRequest exceptions

We've seen a number of ActionController::BadRequest exceptions in Errbit. These seem to be the result of some invalid encoding in the URL.

You can replicate the problem locally by trying to visit http://localhost:3000/media/asset-id/scrutiny%94.65.

There's a related problem where these exceptions are being reported in Errbit as Notifications. I've created issue #122 to capture this.

Work out what to do about updating existing assets with new content

The API currently allows new content to be uploaded against an existing asset. Since the public URL generated by the Asset Manager includes the filename in its path, if the new content has a different filename, then Asset Manager redirects requests to the old URL (including the old filename) to the new URL (including the new filename).

We need to decide whether to retain and it. If we decide we need to keep it, then we need to work out how to implement the behaviour in the new solution. These two issues are probably relevant:

Fix "Can't verify CSRF token authenticity" warning

Steps to reproduce with app running on developer VM:

$ echo `date` > tmp.txt

$ curl http://asset-manager.dev.gov.uk/assets --form "asset[file][email protected]"                        
{"_response_info":{"status":"created"},"id":"http://asset-manager.dev.gov.uk/assets/59775bc7759b74230c000000","name":"tmp.txt","content_type":"text/plain","file_url":"http://assets-origin.dev.gov.uk/media/59775bc7759b74230c000000/tmp.txt","state":"unscanned"}%  

This resulted in the following lines in the Rails development.log file:

Started POST "/assets" for ::1 at 2017-07-25 14:55:03 +0000
Processing by AssetsController#create as */*
  Parameters: {"asset"=>{"file"=>#<ActionDispatch::Http::UploadedFile:0x007f2c2a9c4a20 @tempfile=#<Tempfile:/tmp/user/1000/RackMultipart20170725-8972-vzax4h.txt>, @original_filename
="tmp.txt", @content_type="text/plain", @headers="Content-Disposition: form-data; name=\"asset[file]\"; filename=\"tmp.txt\"\r\nContent-Type: text/plain\r\n">}}
Can't verify CSRF token authenticity

I don't think we should be seeing the "Can't verify CSRF token authenticity" warning.

Confusing behaviour when requesting JSON

It took me a while to realise that the API always responds with JSON but behaves slightly differently depending on the value of the Accept header that's sent.

Requesting HTML responds with JSON:

$ curl http://localhost:3000/assets/596e30c4edf877ece4000000 \
-H"Accept: text/html"
{"_response_info":{"status":"ok"},"id":"http://localhost:3000/assets/596e30c4edf877ece4000000","name":"tmp123.txt","content_type":"text/plain","file_url":"http://static.dev.gov.uk/media/596e30c4edf877ece4000000/tmp123.txt","state":"clean"}

Requesting JSON responds with a 401 response:

$ curl http://localhost:3000/assets/596e30c4edf877ece4000000 \
-H"Accept: application/json" \
-v
*   Trying ::1...
* Connected to localhost (::1) port 3000 (#0)
> GET /assets/596e30c4edf877ece4000000 HTTP/1.1
> Host: localhost:3000
> User-Agent: curl/7.43.0
> Accept: application/json
> 
< HTTP/1.1 401 Unauthorized 
< WWW-Authenticate: Bearer error="invalid_token"
< Cache-Control: no-cache
< X-Request-Id: 7f3f0415-4a5c-415b-b22d-c98aba5d5028
< X-Runtime: 0.004867
< Server: WEBrick/1.3.1 (Ruby/2.3.1/2016-04-26)
< Date: Mon, 31 Jul 2017 17:22:22 GMT
< Content-Length: 0
< Connection: Keep-Alive
< 
* Connection #0 to host localhost left intact

Requesting JSON and supplying an OAuth 2 Bearer Token responds with JSON:

$ curl http://localhost:3000/assets/596e30c4edf877ece4000000 \
-H"Accept: application/json" \
-H"Authorization: Bearer 123" 
{"_response_info":{"status":"ok"},"id":"http://localhost:3000/assets/596e30c4edf877ece4000000","name":"tmp123.txt","content_type":"text/plain","file_url":"http://static.dev.gov.uk/media/596e30c4edf877ece4000000/tmp123.txt","state":"clean"}

I wonder whether we should only respond to JSON requests?

Configure nginx to proxy requests to assets on S3

We've trialled this using a version of Asset Manager on EC2 and now want to make the change so that we can try the same thing within the GOV.UK environment.

We'll need to configure nginx to proxy requests for assets to S3. We'll need to add something like this config for the cloud-storage-proxy location to the asset-manager nginx config in govuk-puppet.

We should work out how much space we might be able to allow for the cache - I think the example above gives it 10GB.

Make the assets S3 bucket private

We made the contents of our S3 assets bucket public in alphagov/govuk-terraform-provisioning#130 in order to test serving assets directly from S3.

We're currently focussed on proxying requests to S3 via nginx so we no longer need this bucket to be public.

Todo

  • Ask someone to deploy/execute the task to update the bucket policy in integration

Add note to the README about making authenticated API requests

We have to set an OAuth 2 Bearer Token when requesting JSON from the API. This token can be set to anything in test/development but will need to be a real token issued by Signon in production.

The Bearer Token can be set using cURL:

$ curl http://localhost:3000/assets/596e30c4edf877ece4000000 \
-H"Accept: application/json" \
-H"Authorization: Bearer 123" \
-v

Updating assets doesn't always upload the file to S3

Updating an existing asset by uploading a different file but with the same name results in the file not being uploaded to S3.

We think this is related to issue #72 as files are uploaded to S3 after they're successfully virus scanned.

Requests to stream assets from S3 can result in an empty 200 response

I found it confusing that a request to serve an asset from S3 results in an empty 200 response if the AWS_S3_BUCKET_NAME environment variable isn't set. I think I'd prefer it to fail fast if I'm requesting a file from S3 when the app doesn't have enough information to be able to satisfy that request.

Updating assets can bypass virus scanning

Updating an existing asset by uploading a different file but with the same name causes the asset-manager to bypass virus scanning.

Steps to reproduce:

  1. Create temporary file
$ echo `date` > tmp.txt
$ cat tmp.txt
Tue 18 Jul 2017 17:22:54 BST
  1. Upload file
$ curl http://localhost:3000/assets --form "asset[file][email protected]"
{"_response_info":{"status":"created"},"id":"http://localhost:3000/assets/596e35e4edf877ece4000003","name":"tmp.txt","content_type":"text/plain","file_url":"http://static.dev.gov.uk/media/596e35e4edf877ece4000003/tmp.txt","state":"unscanned"}
  1. Request file info

Ensure virus scanning is running (bundle exec rake jobs:work) and that the response says the file is clean.

$ curl http://localhost:3000/assets/596e35e4edf877ece4000003
{"_response_info":{"status":"ok"},"id":"http://localhost:3000/assets/596e35e4edf877ece4000003","name":"tmp.txt","content_type":"text/plain","file_url":"http://static.dev.gov.uk/media/596e35e4edf877ece4000003/tmp.txt","state":"clean"}
  1. Request the file

Observe that the content matches the content of our temporary file created in step 1.

$ curl http://localhost:3000/media/596e35e4edf877ece4000003/tmp.txt
Tue 18 Jul 2017 17:22:54 BST
  1. Update the temporary file
$ echo `date` > tmp.txt
$ cat tmp.txt
Tue 18 Jul 2017 17:25:09 BST
  1. Stop the virus scanning job

  2. Update the asset

NOTE. The file is already marked as clean in the response.

$ curl http://localhost:3000/assets/596e35e4edf877ece4000003 --request PUT --form "asset[file][email protected]"
{"_response_info":{"status":"success"},"id":"http://localhost:3000/assets/596e35e4edf877ece4000003","name":"tmp.txt","content_type":"text/plain","file_url":"http://static.dev.gov.uk/media/596e35e4edf877ece4000003/tmp.txt","state":"clean"}
  1. Request the file

Observe that the content matches the content of our updated temporary file that we uploaded in step 7.

$ curl http://localhost:3000/media/596e35e4edf877ece4000003/tmp.txt
Tue 18 Jul 2017 17:25:09 BST

Test behaviour of Asset Manager clients when redirecting to S3

We want to test the behaviour on the Asset Manager clients of redirecting to assets on S3. This will help give us confidence that it's OK to deploy these changes to staging and production.

Dependencies

  • Making S3 bucket publicly readable as per PR #101
  • Deploying the branch in PR #120 to integration

Client apps of Asset Manager

  • Specialist Publisher
    • Assets are attached to Documents
  • Manuals Publisher
  • Service Manual
    • Assets are uploaded by dragging and dropping them into the text area of a "page"
  • Travel Advice Publisher
    • A single map attachment can be uploaded to a location
    • A single PDF attachment can be uploaded to a location
  • Publisher
    • Assets can be uploaded to Campaigns (which is a retired format)
    • Assets can be uploaded to Videos (which is a retired format)

Add functionality to redirect to the file on S3

We're currently able to stream the asset from S3 but want to compare the performance of redirecting to the asset instead.

We might want to toggle this functionality by using a querystring and/or using an environment variable as we've done for the streaming functionality.

Investigate the use of S3 versioning to manage updates to assets

The Asset Manager API allows you to update an asset. When updating an asset and replacing it with a file that has a different name the Asset Manager will put a redirect in place from the old asset to the new asset. We want to understand whether it's possible to get this behaviour by using the versioning functionality built into S3.

Asset Manager should generate IDs of objects stored on S3

The objects we write to S3 are currently named using the BSON ID of the corresponding record in the Asset Manager database. We want to decouple the storage of the assets from the storage of their metadata in Mongo. We should generate our own IDs and use those to write the files to S3.

Move non-secret config to govuk-puppet

Some of our non-secret asset-manager configuration options (e.g. the AWS_S3_BUCKET_NAME environment variable) are currently set in govuk-secrets. This is unnecessary (because they're not secret) and makes them hard for us to change (because we don't have access to the secrets repo). We think it's possible to move these config options to govuk-puppet which would allow us to change them when required.

Todo

  • Ask someone to check that the values in my PR match those in govuk-secrets
  • Merge and deploy my PR
  • Ask someone to remove the equivalent values from govuk-secrets

Log Delayed Job output in all environments

It would be useful if the delayed_job logs were available through Kibana so we can see how long virus checking and s3 uploading is taking, and trace any errors.

I asked in slack and @tuzz said:

It doesn't look like they are. On backend-1.integration I can see /var/log/asset-manager/delayed_job.out.log but there's no JSON version of this.

A kibana query of @fields.application:asset-manager shows all logs for that application, but that only appears to be http requests.

So it looks like we might have to format the logs as JSON and then configure something to forward them to Kibana.

Add CNAME to S3 bucket

We're imagining wanting to do this before serving assets directly from S3, to avoid people being confused by seeing an AWS URL.

Use S3 for storing and serving assets

Things we need to do to get it working end-to-end:

  • Merge changes to Asset Manager
  • Deploy Asset Manager changes to integration
  • Deploy Asset Manager changes to staging and production
  • Merge changes to Terraform
  • Deploy Terraform changes
  • Store the IAM credentials generated in the deployment repository
  • Deploy the IAM credentials
  • Merge changes to puppet
  • Deploy puppet changes
  • (Deploy with hard restart of asset-manager - we think we need to do this so that the asset manager picks up the new environment variables)

Upgrade CarrierWave

The app is currently using v0.10.0 and the latest version is v1.1.0. It would be sensible to upgrade.

Avoid sending S3 errors to the client

In the proxying-to-S3 approach it's possible for errors from S3 to be sent to the client. I guess we'll want to send custom error messages if/when we enable this in production by default.

An illustration of the problem:

# Create temporary file
$ echo `date` > tmp.txt

# Upload file to asset-manager
$ curl http://asset-manager.dev.gov.uk/assets --form "asset[file][email protected]"

# Request file from S3 via nginx
$ curl "http://asset-manager.dev.gov.uk/media/5995b518759b7443e960b9b0/tmp.txt?proxy_via_nginx=true"
Thu Aug 17 15:22:25 UTC 2017

# Manually remove the file from S3 - to simulate the sort of problem we might encounter

# Request the file from S3 via nginx
$ curl "http://asset-manager.dev.gov.uk/media/5995b518759b7443e960b9b0/tmp.txt?proxy_via_nginx=true"
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>5995b518759b7443e960b9b0</Key><RequestId>5B19547BA6F8A875</RequestId><HostId>5ZkLa2PeoC4B2TzsK93fsJ7VK8c5eR/CSMmzmnuc01IBy53N392vTDZwvkvKwAcYq994bcQeQoU=</HostId></Error>

Ensure Etag from nginx+s3proxy matches Etag from nginx+Sendfile

We're imaging an incremental rollout of the nginx+s3proxy approach to serving assets. The same asset served from the two approaches currently has a different Etag. The Etag generated by nginx appears to be (according to this Serverfault answer) a combination of last modified date and content length. The Etag generated by S3 may or may not be an md5 digest of the content depending on how the object was created and how it's encrypted.

The same asset having different Etags (e.g. etag-1 and etag-2) will result in some additional load on the asset-manager servers because a conditional GET using etag-1 might end up with a 200 response and etag-2, instead of a 304 response for etag-1.

I can think of two options (there may be others):

  1. Generate the same Etags that nginx currently generates and store them against the objects in S3

    • This assumes that we can set an Etag on S3 and have it sent in the response to the client.
    • The advantage is that Etags remain the same and so caches can be informed that they can continue using a previously cached version of the file
    • The problem might be in generating the Etag value the same as nginx
  2. Change the Etags for the current Sendfile mechanism and then use those same Etags when serving files from S3

    • This assumes we can customise the Etag sent from nginx (maybe by setting it in the Rails app) and that we can customise the Etag sent from S3
    • The disadvantage is that all cached content will/might have to be re-populated because the Etags have changed. If we're going to do this then we should roll it out in nginx and give it at least 24 hours to ensure all caches should have picked up the new Etags.
    • The advantage is that we're not tied to the Etag implementation of nginx.

Upgrade Airbrake Gem

We're using version 4.0.0 of the Airbrake Gem. This version doesn't report on ActionController::BadRequest exceptions correctly. They're reported in Errbit as Notifications, which isn't very helpful.

Pull request 271 in the Airbrake project appears to have fixed this problem.

The pull request was updated on Aug 26 2014 with a comment saying that the fix would be released the following week. Airbrake version 4.1.0 was released on Sep 4 2014 so I suspect we'll want at least this version.

Note that I've copied the text of this description from alphagov/smart-answers#1681 where we encountered the same problem.

Overwriting existing field file in class Asset

I noticed this warning (?), "Overwriting existing field file in class Asset", in the Rails log. I assume this is because Asset#file is first declared as a Mongoid field (field :file, type: String) and then as an uploader (mount_uploader :file, AssetUploader).

I assume this hasn't had any bad effects, but it does seem a bit odd.

Consider replacing release_1 tag with release_106 tag in remote git repo

Earlier today Jenkins CI lost track of build numbers on the master branch. The next build of the master branch then caused the release_1 tag being added to this commit. The following build of the master branch tried to add the release_2 tag, but failed, because the tag already existed.

Following advice from 2nd line support, I ran a script on Jenkins to update the current build number to 107. I chose 107, because release_105 was the last correctly numbered tag. The next build of the master branch had the correct build number, 107, and successfully added the release_107 tag. So I think everything is now working correctly.

However, it would be nice to replace the release_1 tag with release_106 so that the git history makes sense. However, I'm slightly uncertain about whether doing this is a good idea in case there are things depending on the existing release_1 tag. Thus I've written this up as a GH issue so we can discuss the problem and decide what to do.

Updating assets not working as expected in production

Assets are cached for 24 hours in production. This caching means that updating assets won't have any effect until the cache has expired.

You can use one of the publishing apps that make use of Asset Manager (e.g. Manuals Publisher) to replicate the problem.

Updating the attachment by uploading a different file with the same name will have no observable effect until the caching of the original asset has expired.

Updating the attachment by uploading a file with a different name correctly uploads the new file but requests for the original attachment are still served from the cache instead of responding with a redirect to the new attachment as expected.

Allow Asset Manager to send location of asset on S3 to nginx

We want to be able to request an asset from Asset Manager and have the app respond with the S3 location of the asset in the X-Accel-Redirect header. This will allow us to configure nginx to serve the file from S3 while keeping the location of the file on S3 hidden.

We're planning to keep the assets on S3 private so we should send a signed URL from Asset Manager.

There's a spike of this in PR #126.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.