Giter Site home page Giter Site logo

Comments (19)

chrisroos avatar chrisroos commented on August 17, 2024

@floehopper and I have been through the 6 AssetManagerAttachment* workers in Whitehall and have confirmed that they appear to be safe to run during this migration (i.e. outside of the context in which they're normally run).

The only one that might be slightly odd is the AssetManagerAttachmentAccessLimitedWorker. It can end up sending access-limited data for a publicly accessible asset if the AttachmentData belongs to a published edition that still has the access_limited flag set. Although potentially confusing, we're not concerned about this as the asset will be public in Asset Manager and so the access-limiting functionality won't be used.

from asset-manager.

chrisroos avatar chrisroos commented on August 17, 2024

We've discovered that there are a number of AttachmentData records (in integration at least) without a corresponding file on the file system. Requesting a URL for one of these AttachmentData records can result in the user being redirected elsewhere by the BaseAttachmentsController#fail method. When we migrated all the Whitehall attachments to Asset Manager we did it based on files on disk (instead of records in the database) which means that we might need to create assets in Asset Manager representing some of these AttachmentData records. We might also want to delete those AttachmentData records that don't have a corresponding file on disk and that don't redirect the user elsewhere.

from asset-manager.

chrislo avatar chrislo commented on August 17, 2024

I ran the rake task asset_manager:update_attachment_metadata[0,10000] on integration at 16:03:09 on 2018-03-15. It completed at 16:04:12 (~1min).

At peak (16:04:22) the rake task had added 47345 jobs to the asset_migration queue. The queue was empty at 16:29:02.

from asset-manager.

chrisroos avatar chrisroos commented on August 17, 2024

I've written a script to save all the AttachmentData URLs to a text file. The idea being that we can request these paths from assets-origin (Whitehall) and draft-assets (Asset Manager) to ensure that the responses are identical. I've not saved the output of the script anywhere publicly because I'm not sure we necessarily want it to be easy for someone to crawl such a list.

from asset-manager.

chrisroos avatar chrisroos commented on August 17, 2024

I've written a script to print the IDs of all the AttachmentData records that don't have a corresponding file on the filesystem. We might need to create assets in Asset Manager for these AttachmentData records if requesting them results in a redirect in Whitehall.

2ndline are currently running this script in staging for me.

from asset-manager.

chrisroos avatar chrisroos commented on August 17, 2024

We've now got a list of the IDs corresponding to AttachmentData records that don't have files on the filesystem in staging. I've added them to a file on this gist - https://gist.github.com/chrisroos/3c22337d23c015c88ae78efdf516d27a.

from asset-manager.

chrisroos avatar chrisroos commented on August 17, 2024

We created https://gist.github.com/chrisroos/5e9bd3e2312cf93da70cfd9fdcae32a1 to identify the subset of Whitehall attachments without files that have also been replaced/redirected to an alternative URL. We ran this script on integration to come up with the 20 Whitehall attachments that we need to create in Asset Manager.

from asset-manager.

chrisroos avatar chrisroos commented on August 17, 2024

We've opened alphagov/whitehall#3880 to update the AssetManagerAttachmentMetadataUpdater so that it ignores AttachmentData objects that don't have files on disk. This should be safe to do once we've created the missing Whitehall attachments as part of #528.

from asset-manager.

chrisroos avatar chrisroos commented on August 17, 2024

The changes in alphagov/whitehall#3880 and #528 have been deployed to staging and production.

@elenatanasoiu ran the create_missing_whitehall_assets Asset Manager Rake task in both staging and production and it created the 19 missing Whitehall attachments as expected:

16:01:42 Asset identified by /government/uploads/system/uploads/attachment_data/file/2205/business-plan-12.pdf created
16:01:44 Asset identified by /government/uploads/system/uploads/attachment_data/file/2206/business-plan-12-annexes.pdf created
16:01:47 Asset identified by /government/uploads/system/uploads/attachment_data/file/2319/epc.pdf created
16:01:50 Asset identified by /government/uploads/system/uploads/attachment_data/file/9559/att0212.xls created
16:01:53 Asset identified by /government/uploads/system/uploads/attachment_data/file/11510/climate-change-2011-tables-xls.zip created
16:01:56 Asset identified by /government/uploads/system/uploads/attachment_data/file/11723/Water_Efficiency_Calculator_Rev_02.xls created
16:01:59 Asset identified by /government/uploads/system/uploads/attachment_data/file/210359/positive-for-youth-consultation-responses.doc created
16:02:02 Asset identified by /government/uploads/system/uploads/attachment_data/file/210360/positive-for-youth-young-peoples-role-in-its-development.doc created
16:02:05 Asset identified by /government/uploads/system/uploads/attachment_data/file/211761/110713_Local_CO2_NS_-_Annex_A__Statistical_release_.pdf created
16:02:08 Asset identified by /government/uploads/system/uploads/attachment_data/file/211762/110713_Local_CO2_NS_-_Annex_B__Statistical_summary_.pdf created
16:02:11 Asset identified by /government/uploads/system/uploads/attachment_data/file/211763/Full_Dataset.xlsx created
16:02:14 Asset identified by /government/uploads/system/uploads/attachment_data/file/211767/110713_Local_CO2_NS_-_Annex_C__Methodology_summary__.pdf created
16:02:17 Asset identified by /government/uploads/system/uploads/attachment_data/file/211768/110713_Local_CO2_-_Technical_Report.pdf created
16:02:20 Asset identified by /government/uploads/system/uploads/attachment_data/file/211769/110713_LULUCF_Mapping_LULUCF_emissions.pdf created
16:02:23 Asset identified by /government/uploads/system/uploads/attachment_data/file/211883/e0010125-response.pdf created
16:02:25 Asset identified by /government/uploads/system/uploads/attachment_data/file/225280/SIN_Officer_Internal_Advert.docx created
16:02:28 Asset identified by /government/uploads/system/uploads/attachment_data/file/296338/mca-legislation-si.csv created
16:02:30 Asset identified by /government/uploads/system/uploads/attachment_data/file/490383/Local_Plans_Procedural_Guidance.pdf created
16:02:33 Asset identified by /government/uploads/system/uploads/attachment_data/file/627713/National_Lottery_Distribution_Fund_Investment_Account_2016-2017__web_.pdf created
16:02:36 Finished: SUCCESS

from asset-manager.

chrisroos avatar chrisroos commented on August 17, 2024

We just ran asset_manager:update_attachment_metadata[0,10000] on staging. It added about 36,000 jobs to the queue and processed all of them in about 6 minutes. There are 8004 AttachmentData records with an ID lower than 10,000 so it processed about 22 AttachmentData's per second (8004 / (6 * 60)).

There are 523134 AttachmentData's in total so we expect it to take in the region of 6.5 hours to process all of them (523134 / 22 / 60 / 60).

from asset-manager.

chrislo avatar chrislo commented on August 17, 2024

We've made (and deployed to integration) a change to the task we'll use for this migration - it now calls the various update workers synchronously so that we can ensure the delete worker is called last for each attachment.

I ran asset_manager:update_attachment_metadata[0,10000] again on integration, and verified there were no exceptions caused by workers attempting to call the Asset Manager API for deleted assets. There were ~6.5k jobs on the queue at peak, and the queue was emptied in 10 minutes.

from asset-manager.

chrisroos avatar chrisroos commented on August 17, 2024

We've just run asset_manager:update_attachment_metadata[0,10000] on staging. There were about 6800 jobs on the queue at peak and the queue was empty within 5 minutes.

from asset-manager.

chrisroos avatar chrisroos commented on August 17, 2024

We've just run asset_manager:update_attachment_metadata[0,350000] on staging. It took about 16 minutes to complete the Rake task. It put about 184,000 jobs on the queue at peak. The jobs continue to be processed.

All the jobs have now finished. It took about 2.5 hours to process all the jobs - it started at about 1pm and finished just after 3:30pm.

from asset-manager.

chrisroos avatar chrisroos commented on August 17, 2024

I've seen some GdsApi::HTTPUnprocessableEntity exceptions raised during the metadata migration. Some of these were caused because the Asset#replacement can't be found. In at least one of these cases that I've investigated, we're trying to update the access-limited state of the asset in Asset Manager but it's failing this is because the asset's replacement has been deleted. This only appears to affect 32 assets in Asset Manager:

missing_replacements = []
Asset.where(:replacement_id.nin => ['', nil]).each do |asset|
  missing_replacements << asset.id if asset.replacement.blank?
end
p missing_replacements
=> 32

I think it should be possible to update an asset whose replacement has been deleted so I've added #533 to capture this problem.

from asset-manager.

chrisroos avatar chrisroos commented on August 17, 2024

Following on from the previous commit, I've also seem some exceptions caused by trying to update an asset to draft when it already has a replacement. This is problematic now that we're running all the AssetManagerAttachment* workers synchronously because a failure in one will mean that we don't run any of the subsequent workers.

I wonder whether we need a single worker that updates the metadata in one go based on the latest state of AttachmentData in Whitehall.

from asset-manager.

chrisroos avatar chrisroos commented on August 17, 2024

We ran asset_manager:update_attachment_metadata[350000,700000] on staging earlier today. It took about 35 minutes to run the Rake task and it added about 293,000 jobs to the queue at peak. The queue was empty within about 4 hours (from 15:15 to 21:15).

Unfortunately, we've got about 1500 jobs on the retry queue. According to the exceptions in Sentry the two most common causes of these errors are:

  • GdsApi::HTTPUnprocessableEntity (2,200 exceptions) when we're trying to set an asset to be draft in Asset Manager when that asset has already been replaced.
  • GdsApi::HTTPNotFound (1,400 exceptions) when we're trying to replace one asset with another in Asset Manager, and we can't find the replacement.

I think we're going to need to work out how to resolve these before we can run this in production.

from asset-manager.

chrisroos avatar chrisroos commented on August 17, 2024

We ran asset_manager:update_attachment_metadata[0,350000] on production at about 15:20 today. It took about 15 minutes to run the task and added about 196,000 jobs to the queue at peak.

It took about 4 hours to process all the jobs on the queue (from about 15:30 to 19:30).

from asset-manager.

chrisroos avatar chrisroos commented on August 17, 2024

We ran asset_manager:update_attachment_metadata[350000,700000] on production at about 13:40 yesterday. It took about 25 minutes to run the task and added about 320,000 jobs to the queue at peak.

It took about 7.5 hours to process all the jobs on the queue (from about 14:00 to 21:30).

from asset-manager.

thomasleese avatar thomasleese commented on August 17, 2024

This has been done now.

from asset-manager.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.