alphagov / asset-manager Goto Github PK

View Code? Open in Web Editor NEW

6.0 76.0 10.0 2.6 MB

Manages uploaded assets (images, PDFs etc.) for applications on GOV.UK

Home Page: https://docs.publishing.service.gov.uk/apps/asset-manager.html

License: MIT License

Ruby 97.56% HTML 1.98% Dockerfile 0.40% JavaScript 0.01% Procfile 0.05%

govuk container

asset-manager's Introduction

Asset Manager

Manages uploaded assets (images, PDFs etc.) for applications in the GOV.UK Publishing stack.

The app receives uploaded files from publishing applications and returns the URLs that they will be made available at. Before an asset is available to the public, it is virus scanned. Once a file is found to be clean, Asset Manager serves it at the previously generated URL. Unscanned or Infected files return a 404 Not Found error.

Scanning uses ClamAV and occurs asynchronously via govuk_sidekiq.

Technical Documentation

This is a Ruby on Rails app, and should follow our Rails app conventions.

You can use the GOV.UK Docker environment to run the application and its tests with all the necessary dependencies. Follow the usage instructions to get started.

Use GOV.UK Docker to run any commands that follow.

Running the test suite

bundle exec rspec

Assets on S3

All assets are uploaded to the S3 bucket via a separate govuk_sidekiq job triggered if virus scanning succeeds. Assets are currently still also saved to the NFS mount as per the original behaviour.

In non-production environments if the AWS_S3_BUCKET_NAME environment variable is not set, then a fake version of S3 (S3Storage::Fake) is used and the other AWS_* environment variables do not need to be set. In this case, files are saved to the local filesystem instead of S3 and are served via an instance of Rack::File mounted on the appropriate route path prefix.

Manuals and decisions

Check the docs directory for detailed instructions, including API documentation.

Licence

MIT License

asset-manager's People

Contributors

Stargazers

Watchers

Forkers

theodi tadast rgarner alphagov-mirror catalin-ursachi tubbz-alt uk-gov-mirror nsabri1 selfbriefs prismaclouddevsecopsscanner

asset-manager's Issues

Consider using delayed-plugins-airbrake or airbrake/delayed_job

Currently it looks as if the app is using explicit calls to Airbrake.notify_or_ignore to report exceptions to Errbit. I suspect we could reduce duplication and have this happen automatically by using either delayed-plugins-airbrake (which I think only works with Airbrake <= v4, i.e. the current version) or airbrake/delayed_job (which is built into Airbrake >= v5, but would require us upgrading Airbrake within the app).

Before doing anything it would be worth checking what's being used elsewhere for Airbrake in apps running DelayedJob.

Set the Content-Disposition HTTP Header for objects on S3

We think we'll want/need to set this header to instruct the clients (e.g. browsers) to download the asset instead of displaying it inline.

Consider encapsulating asset key in a method

It feels as if there are many occurrences of asset.id.to_s spread across the codebase. If these are all used for the same purpose, it would be good to encapsulate the logic in one place, e.g. Asset#to_param, Asset#key, or something like that. It's possible that the logic is being used for multiple purposes in which case there should probably be multiple corresponding methods on Asset.

ActionController::MissingFile: Cannot read file /path/to/599becc1e5274a061c7ee84c/dismissed.doc

This is a placeholder to investigate the exception we saw in Errbit this morning.

Investigate whether it's possible to create redirects on S3

We're hoping to redirect from the Asset Manager to a public asset on S3. If an asset is updated then we think we'll want to add a redirect from the old to new URLs.

ActionController::BadRequest exceptions

We've seen a number of ActionController::BadRequest exceptions in Errbit. These seem to be the result of some invalid encoding in the URL.

You can replicate the problem locally by trying to visit http://localhost:3000/media/asset-id/scrutiny%94.65.

There's a related problem where these exceptions are being reported in Errbit as Notifications. I've created issue #122 to capture this.

Work out what to do about updating existing assets with new content

The API currently allows new content to be uploaded against an existing asset. Since the public URL generated by the Asset Manager includes the filename in its path, if the new content has a different filename, then Asset Manager redirects requests to the old URL (including the old filename) to the new URL (including the new filename).

We need to decide whether to retain and it. If we decide we need to keep it, then we need to work out how to implement the behaviour in the new solution. These two issues are probably relevant:

Fix "Can't verify CSRF token authenticity" warning

Steps to reproduce with app running on developer VM:

$ echo `date` > tmp.txt

$ curl http://asset-manager.dev.gov.uk/assets --form "asset[file][email protected]"                        
{"_response_info":{"status":"created"},"id":"http://asset-manager.dev.gov.uk/assets/59775bc7759b74230c000000","name":"tmp.txt","content_type":"text/plain","file_url":"http://assets-origin.dev.gov.uk/media/59775bc7759b74230c000000/tmp.txt","state":"unscanned"}%

This resulted in the following lines in the Rails development.log file:

Started POST "/assets" for ::1 at 2017-07-25 14:55:03 +0000
Processing by AssetsController#create as */*
  Parameters: {"asset"=>{"file"=>#<ActionDispatch::Http::UploadedFile:0x007f2c2a9c4a20 @tempfile=#<Tempfile:/tmp/user/1000/RackMultipart20170725-8972-vzax4h.txt>, @original_filename
="tmp.txt", @content_type="text/plain", @headers="Content-Disposition: form-data; name=\"asset[file]\"; filename=\"tmp.txt\"\r\nContent-Type: text/plain\r\n">}}
Can't verify CSRF token authenticity

I don't think we should be seeing the "Can't verify CSRF token authenticity" warning.

Upgrade carrierwave to mitigate CVE-2016-3714

At first glance it looks as if Asset Manager is affected by the same issue as described in alphagov/whitehall#2577.

Confusing behaviour when requesting JSON

It took me a while to realise that the API always responds with JSON but behaves slightly differently depending on the value of the Accept header that's sent.

Requesting HTML responds with JSON:

$ curl http://localhost:3000/assets/596e30c4edf877ece4000000 \
-H"Accept: text/html"
{"_response_info":{"status":"ok"},"id":"http://localhost:3000/assets/596e30c4edf877ece4000000","name":"tmp123.txt","content_type":"text/plain","file_url":"http://static.dev.gov.uk/media/596e30c4edf877ece4000000/tmp123.txt","state":"clean"}

Requesting JSON responds with a 401 response:

$ curl http://localhost:3000/assets/596e30c4edf877ece4000000 \
-H"Accept: application/json" \
-v
*   Trying ::1...
* Connected to localhost (::1) port 3000 (#0)
> GET /assets/596e30c4edf877ece4000000 HTTP/1.1
> Host: localhost:3000
> User-Agent: curl/7.43.0
> Accept: application/json
> 
< HTTP/1.1 401 Unauthorized 
< WWW-Authenticate: Bearer error="invalid_token"
< Cache-Control: no-cache
< X-Request-Id: 7f3f0415-4a5c-415b-b22d-c98aba5d5028
< X-Runtime: 0.004867
< Server: WEBrick/1.3.1 (Ruby/2.3.1/2016-04-26)
< Date: Mon, 31 Jul 2017 17:22:22 GMT
< Content-Length: 0
< Connection: Keep-Alive
< 
* Connection #0 to host localhost left intact

Requesting JSON and supplying an OAuth 2 Bearer Token responds with JSON:

$ curl http://localhost:3000/assets/596e30c4edf877ece4000000 \
-H"Accept: application/json" \
-H"Authorization: Bearer 123" 
{"_response_info":{"status":"ok"},"id":"http://localhost:3000/assets/596e30c4edf877ece4000000","name":"tmp123.txt","content_type":"text/plain","file_url":"http://static.dev.gov.uk/media/596e30c4edf877ece4000000/tmp123.txt","state":"clean"}

I wonder whether we should only respond to JSON requests?

Configure nginx to proxy requests to assets on S3

We've trialled this using a version of Asset Manager on EC2 and now want to make the change so that we can try the same thing within the GOV.UK environment.

We'll need to configure nginx to proxy requests for assets to S3. We'll need to add something like this config for the cloud-storage-proxy location to the asset-manager nginx config in govuk-puppet.

We should work out how much space we might be able to allow for the cache - I think the example above gives it 10GB.

Set Cache-Control HTTP header on S3 objects

We think we'll want/need to set cache control headers on the assets stored in S3 in order to reduce the number of requests made directly to S3.

Make the assets S3 bucket private

We made the contents of our S3 assets bucket public in alphagov/govuk-terraform-provisioning#130 in order to test serving assets directly from S3.

We're currently focussed on proxying requests to S3 via nginx so we no longer need this bucket to be public.

Todo

Ask someone to deploy/execute the task to update the bucket policy in integration

Upgrade Rails to 5.1

Rails 5.1 appears to be the latest version being used for GOV.UK apps.

We should update the Ruby/Rails versions spreadsheet when we're done.

Make the S3 bucket public in staging and production

Add note to the README about making authenticated API requests

We have to set an OAuth 2 Bearer Token when requesting JSON from the API. This token can be set to anything in test/development but will need to be a real token issued by Signon in production.

The Bearer Token can be set using cURL:

$ curl http://localhost:3000/assets/596e30c4edf877ece4000000 \
-H"Accept: application/json" \
-H"Authorization: Bearer 123" \
-v

Investigate how much it might cost to serve the assets from EC2/EFS

We think it'll be useful to come up with a rough estimate of how much it might cost to host/serve the assets from EC2/EFS.

Upgrade Ruby to 2.3.1

Ruby 2.3.1 appears to be the latest version being used for GOV.UK apps.

We should update the Ruby/Rails versions spreadsheet when we're done.

Investigate whether we can remove the Access Control functionality from Asset Manager

There's functionality in the Asset Manager to limit access to assets based on the state of Asset#access_limited?. We're not sure whether this is being used. If it's not then we should remove the functionality.

Updating assets doesn't always upload the file to S3

Updating an existing asset by uploading a different file but with the same name results in the file not being uploaded to S3.

We think this is related to issue #72 as files are uploaded to S3 after they're successfully virus scanned.

Understand how HTTP caching response headers are set and interpreted

Currently Asset Manager sets the Cache-Control header to public, max-age=86,400, i.e. 24 hours. Nginx also sets an ETag header. We'd like to understand whether how this combination of Cache-Control and ETag headers is interpreted by the CDN and/or other clients.

Requests to stream assets from S3 can result in an empty 200 response

I found it confusing that a request to serve an asset from S3 results in an empty 200 response if the AWS_S3_BUCKET_NAME environment variable isn't set. I think I'd prefer it to fail fast if I'm requesting a file from S3 when the app doesn't have enough information to be able to satisfy that request.

Updating assets can bypass virus scanning

Updating an existing asset by uploading a different file but with the same name causes the asset-manager to bypass virus scanning.

Steps to reproduce:

Create temporary file

$ echo `date` > tmp.txt
$ cat tmp.txt
Tue 18 Jul 2017 17:22:54 BST

Upload file

$ curl http://localhost:3000/assets --form "asset[file][email protected]"
{"_response_info":{"status":"created"},"id":"http://localhost:3000/assets/596e35e4edf877ece4000003","name":"tmp.txt","content_type":"text/plain","file_url":"http://static.dev.gov.uk/media/596e35e4edf877ece4000003/tmp.txt","state":"unscanned"}

Request file info

Ensure virus scanning is running (bundle exec rake jobs:work) and that the response says the file is clean.

$ curl http://localhost:3000/assets/596e35e4edf877ece4000003
{"_response_info":{"status":"ok"},"id":"http://localhost:3000/assets/596e35e4edf877ece4000003","name":"tmp.txt","content_type":"text/plain","file_url":"http://static.dev.gov.uk/media/596e35e4edf877ece4000003/tmp.txt","state":"clean"}

Request the file

Observe that the content matches the content of our temporary file created in step 1.

$ curl http://localhost:3000/media/596e35e4edf877ece4000003/tmp.txt
Tue 18 Jul 2017 17:22:54 BST

Update the temporary file

$ echo `date` > tmp.txt
$ cat tmp.txt
Tue 18 Jul 2017 17:25:09 BST

Stop the virus scanning job
Update the asset

NOTE. The file is already marked as clean in the response.

$ curl http://localhost:3000/assets/596e35e4edf877ece4000003 --request PUT --form "asset[file][email protected]"
{"_response_info":{"status":"success"},"id":"http://localhost:3000/assets/596e35e4edf877ece4000003","name":"tmp.txt","content_type":"text/plain","file_url":"http://static.dev.gov.uk/media/596e35e4edf877ece4000003/tmp.txt","state":"clean"}

Request the file

Observe that the content matches the content of our updated temporary file that we uploaded in step 7.

$ curl http://localhost:3000/media/596e35e4edf877ece4000003/tmp.txt
Tue 18 Jul 2017 17:25:09 BST

Test behaviour of Asset Manager clients when redirecting to S3

We want to test the behaviour on the Asset Manager clients of redirecting to assets on S3. This will help give us confidence that it's OK to deploy these changes to staging and production.

Dependencies

Making S3 bucket publicly readable as per PR #101
Deploying the branch in PR #120 to integration

Client apps of Asset Manager

Specialist Publisher
- Assets are attached to Documents
Manuals Publisher
- Assets are attached to Sections
- Example of a section containing a linked and embedded attachment - https://www.gov.uk/guidance/the-highway-code/vehicle-markings
Service Manual
- Assets are uploaded by dragging and dropping them into the text area of a "page"
Travel Advice Publisher
- A single map attachment can be uploaded to a location
- A single PDF attachment can be uploaded to a location
Publisher
- Assets can be uploaded to Campaigns (which is a retired format)
- Assets can be uploaded to Videos (which is a retired format)

Document the advantages/disadvantages of the approaches we've tried/discussed

We want to record (maybe in an ADR?) why we've decided to focus on serving the assets from a public S3 bucket and the alternatives we've tried/discussed.

Add functionality to redirect to the file on S3

We're currently able to stream the asset from S3 but want to compare the performance of redirecting to the asset instead.

We might want to toggle this functionality by using a querystring and/or using an environment variable as we've done for the streaming functionality.

Make S3 bucket public in integration

We'll need to do this in order to serve assets directly from S3.

Investigate the use of S3 versioning to manage updates to assets

The Asset Manager API allows you to update an asset. When updating an asset and replacing it with a file that has a different name the Asset Manager will put a redirect in place from the old asset to the new asset. We want to understand whether it's possible to get this behaviour by using the versioning functionality built into S3.

Document our research into the existing assets

E.g. total number, average size, average requests per second being served.

Upgrade to latest version of govuk-lint when it is released (v3.0.0)

In this pull request I was forced to disable a Rubocop rule for one line, because it was overly strict. I've opened this pull request on govuk-lint to upgrade the versions of Rubocop it allows. Once that PR is merged, it would be good to upgrade govuk-lint in this app and remove the disabling of that Rucocop rule assuming it ends up getting merged. Even if it doesn't it would be good to upgrade.

Asset Manager should generate IDs of objects stored on S3

The objects we write to S3 are currently named using the BSON ID of the corresponding record in the Asset Manager database. We want to decouple the storage of the assets from the storage of their metadata in Mongo. We should generate our own IDs and use those to write the files to S3.

Upgrade Mongoid to 6.1

Mongoid 6.1 appears to be the latest version being used for GOV.UK apps.

We should update the Ruby/Rails versions spreadsheet when we're done.

Regularly copy assets from Production to Staging and Integration

This is on the assumption that we currently regularly copy production assets to staging and integration. We should validate this assumption before doing any work on this.

Configure Fastly CDN to cache assets being served from S3

Move non-secret config to govuk-puppet

Some of our non-secret asset-manager configuration options (e.g. the AWS_S3_BUCKET_NAME environment variable) are currently set in govuk-secrets. This is unnecessary (because they're not secret) and makes them hard for us to change (because we don't have access to the secrets repo). We think it's possible to move these config options to govuk-puppet which would allow us to change them when required.

Todo

Ask someone to check that the values in my PR match those in govuk-secrets
Merge and deploy my PR
Ask someone to remove the equivalent values from govuk-secrets

Log Delayed Job output in all environments

It would be useful if the delayed_job logs were available through Kibana so we can see how long virus checking and s3 uploading is taking, and trace any errors.

I asked in slack and @tuzz said:

It doesn't look like they are. On backend-1.integration I can see /var/log/asset-manager/delayed_job.out.log but there's no JSON version of this.

A kibana query of @fields.application:asset-manager shows all logs for that application, but that only appears to be http requests.

So it looks like we might have to format the logs as JSON and then configure something to forward them to Kibana.

Add CNAME to S3 bucket

We're imagining wanting to do this before serving assets directly from S3, to avoid people being confused by seeing an AWS URL.

Use S3 for storing and serving assets

Things we need to do to get it working end-to-end:

Upgrade CarrierWave

The app is currently using v0.10.0 and the latest version is v1.1.0. It would be sensible to upgrade.

Avoid sending S3 errors to the client

In the proxying-to-S3 approach it's possible for errors from S3 to be sent to the client. I guess we'll want to send custom error messages if/when we enable this in production by default.

An illustration of the problem:

# Create temporary file
$ echo `date` > tmp.txt

# Upload file to asset-manager
$ curl http://asset-manager.dev.gov.uk/assets --form "asset[file][email protected]"

# Request file from S3 via nginx
$ curl "http://asset-manager.dev.gov.uk/media/5995b518759b7443e960b9b0/tmp.txt?proxy_via_nginx=true"
Thu Aug 17 15:22:25 UTC 2017

# Manually remove the file from S3 - to simulate the sort of problem we might encounter

# Request the file from S3 via nginx
$ curl "http://asset-manager.dev.gov.uk/media/5995b518759b7443e960b9b0/tmp.txt?proxy_via_nginx=true"
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>5995b518759b7443e960b9b0</Key><RequestId>5B19547BA6F8A875</RequestId><HostId>5ZkLa2PeoC4B2TzsK93fsJ7VK8c5eR/CSMmzmnuc01IBy53N392vTDZwvkvKwAcYq994bcQeQoU=</HostId></Error>

Use rubocop-rspec

Probably via govuk-lint, but possibly directly.

This would be a better fix for the Metrics/BlockLength rubocop problem than this commit.

Document the destroy and restore API endpoints in the README

I can only imagine that it's an oversight that the rest of the API is documented but these aren't.

Ensure Etag from nginx+s3proxy matches Etag from nginx+Sendfile

We're imaging an incremental rollout of the nginx+s3proxy approach to serving assets. The same asset served from the two approaches currently has a different Etag. The Etag generated by nginx appears to be (according to this Serverfault answer) a combination of last modified date and content length. The Etag generated by S3 may or may not be an md5 digest of the content depending on how the object was created and how it's encrypted.

The same asset having different Etags (e.g. etag-1 and etag-2) will result in some additional load on the asset-manager servers because a conditional GET using etag-1 might end up with a 200 response and etag-2, instead of a 304 response for etag-1.

I can think of two options (there may be others):

Generate the same Etags that nginx currently generates and store them against the objects in S3
- This assumes that we can set an Etag on S3 and have it sent in the response to the client.
- The advantage is that Etags remain the same and so caches can be informed that they can continue using a previously cached version of the file
- The problem might be in generating the Etag value the same as nginx
Change the Etags for the current Sendfile mechanism and then use those same Etags when serving files from S3
- This assumes we can customise the Etag sent from nginx (maybe by setting it in the Rails app) and that we can customise the Etag sent from S3
- The disadvantage is that all cached content will/might have to be re-populated because the Etags have changed. If we're going to do this then we should roll it out in nginx and give it at least 24 hours to ensure all caches should have picked up the new Etags.
- The advantage is that we're not tied to the Etag implementation of nginx.

Upgrade Airbrake Gem

We're using version 4.0.0 of the Airbrake Gem. This version doesn't report on ActionController::BadRequest exceptions correctly. They're reported in Errbit as Notifications, which isn't very helpful.

Pull request 271 in the Airbrake project appears to have fixed this problem.

The pull request was updated on Aug 26 2014 with a comment saying that the fix would be released the following week. Airbrake version 4.1.0 was released on Sep 4 2014 so I suspect we'll want at least this version.

Note that I've copied the text of this description from alphagov/smart-answers#1681 where we encountered the same problem.

Bring Dockerfile into line with Ruby & Bundler versions

I'm not sure whether the Dockerfile is being actively maintained, but ...

Ruby version

Ruby version in Dockerfile: v2.2.3
Ruby version in .ruby-version: v2.3.1

Bundler version

Bundler version provided by ruby:2.2.3 image: v1.10.6
Bundler version in Gemfile.lock: v1.14.5

We should either bring these into line (and ideally add specs to check they remain so), or delete the Dockerfile (and its associated .dockerignore).

Overwriting existing field file in class Asset

I noticed this warning (?), "Overwriting existing field file in class Asset", in the Rails log. I assume this is because Asset#file is first declared as a Mongoid field (field :file, type: String) and then as an uploader (mount_uploader :file, AssetUploader).

I assume this hasn't had any bad effects, but it does seem a bit odd.

Consider replacing release_1 tag with release_106 tag in remote git repo

Earlier today Jenkins CI lost track of build numbers on the master branch. The next build of the master branch then caused the release_1 tag being added to this commit. The following build of the master branch tried to add the release_2 tag, but failed, because the tag already existed.

Following advice from 2nd line support, I ran a script on Jenkins to update the current build number to 107. I chose 107, because release_105 was the last correctly numbered tag. The next build of the master branch had the correct build number, 107, and successfully added the release_107 tag. So I think everything is now working correctly.

However, it would be nice to replace the release_1 tag with release_106 so that the git history makes sense. However, I'm slightly uncertain about whether doing this is a good idea in case there are things depending on the existing release_1 tag. Thus I've written this up as a GH issue so we can discuss the problem and decide what to do.

Updating assets not working as expected in production

Assets are cached for 24 hours in production. This caching means that updating assets won't have any effect until the cache has expired.

You can use one of the publishing apps that make use of Asset Manager (e.g. Manuals Publisher) to replicate the problem.

Updating the attachment by uploading a different file with the same name will have no observable effect until the caching of the original asset has expired.

Updating the attachment by uploading a file with a different name correctly uploads the new file but requests for the original attachment are still served from the cache instead of responding with a redirect to the new attachment as expected.

Set a content type header when uploading assets to S3

The current send_file method of serving assets from MediaController#download infers the Content-Type header from the extension of the uploaded filename. We will need to set the content type explicitly on the S3 Object metadata when serving assets from S3 (either by proxy or redirect).

Allow Asset Manager to send location of asset on S3 to nginx

We want to be able to request an asset from Asset Manager and have the app respond with the S3 location of the asset in the X-Accel-Redirect header. This will allow us to configure nginx to serve the file from S3 while keeping the location of the file on S3 hidden.

We're planning to keep the assets on S3 private so we should send a signed URL from Asset Manager.

There's a spike of this in PR #126.

alphagov / asset-manager Goto Github PK

asset-manager's Introduction

Asset Manager

Technical Documentation

Running the test suite

Assets on S3

Manuals and decisions

Licence

asset-manager's People

Contributors

Stargazers

Watchers

Forkers

asset-manager's Issues

Todo

Dependencies

Client apps of Asset Manager

Todo

Ruby version

Bundler version

Recommend Projects

Recommend Topics

Recommend Org