cca / equella_scripts Goto Github PK

View Code? Open in Web Editor NEW

0.0 13.0 0.0 771 KB

Collection of miscellaneous scripts for working with openEQUELLA

Home Page: https://vault.cca.edu/

License: Other

JavaScript 69.68% CSS 0.40% FreeMarker 0.24% Python 29.01% Shell 0.67%

equella institutional-repository openequella

equella_scripts's Introduction

openEQUELLA Scripts

Various scripts used in VAULT (our openEQUELLA instance). The scripts are categorized by where they're used:

asc contains Advanced Scripting Controls used in Contribution Wizards.

bookmarklets are browser-side JS meant to aid in copying useful oE URLs.

fine-arts-jr-review is a script for biannual updates to certain data in oE.

retention scripts are tools for applying the VAULT Retention Policy.

user-scripts contains bulk metadata update tools meant to be run from the Manage Resources section.

utilities are scripts for bulk processing that Manage Resources cannot handle such as downloading files, manipulating user groups, or generating taxonomies. They often interact with oE via its REST API.

Setup

Most of these are node scripts which share dependencies. Run pnpm install or npm install to get them.

The fine arts junior review directory is a python project using Poetry. Run poetry install to get the dependencies.

Most tools will require their own rc file with settings, OAuth tokens, and other secrets. Each directory has an example and a readme with instructions.

Notes

openEQUELLA's internal JavaScript engine is most likely Mozilla's Rhino, for what it's worth. That means modern JavaScript features (e.g. ES5 stuff like Array#map) are not available.

Probably the biggest gotcha I've found working with openEQUELLA's JavaScript is that the return value of xml.get on an empty metadata node is not strictly equal to empty string (xml.get('thisdoesnotexist') !== ""). That's why conditions through these scripts will employ != or == when checking against strings returned by xml.get.

Testing

npm test runs the retention procedures' tests. They require a separate .equellarc file specific to the tests at the path retention/test/.testretentionrc (there is an example file provided). npm run csvtest runs the metadata-csv tests and npm run grouptest runs the utilities/group.js tests.

As tests are added to other utilities, they will need to run à la carte. I usually work on one utility at a time and it doesn't make sense to run tests over all of them, especially because they tend to involve HTTP requests and thus are quite slow. View the package.json scripts for shortcuts to different tools' tests.

LICENSE

ECL Version 2.0

equella_scripts's People

Watchers

equella_scripts's Issues

script to download specific faculty member's syllabi

See John Jenkins' email from 2021-05-10, it would be helpful for faculty files to have a way to bulk retrieve all their syllabi from VAULT but the search interface doesn't support this. I could adapt the existing syllabi download tool in this repository to accept a faculty member's name or list of names, use the search API to retrieve all relevant items in the Syllabus Collection, download all the files (using the same section code naming convention), and then compress them into a zip file for sharing.

retention - don't link to non-live items

Linking to non-live (e.g., DRAFT, SUSPENDED, ARCHIVE) items is probably most confusing than it is useful for end users and can cause the number of items in an email to become intimidatingly large. Consider someone who drafted an item, published it, then created a few new versions — they'd see several links for the same item and perhaps be confused about why and which one was the final version.

This should be a fairly simple array filter in the mailUser function.

port Makefile over to package.json scripts

Probably makes more sense to install uglify-js as a dev dependency, write the current make tasks into package.json's "scripts" section, & do npm run. This is a more common workflow for me and will allow adding new NPM tools in the future.

retention - write documentation on exporting / backing up items that are to be removed

Write end-user facing documentation on how to export VAULT items (i.e. go to item > right-side Actions menu > Export), to be referenced in notification emails.

retention - break items/emails into manageable chunks

We have some 28,000 items to remove with over 2,200 different owners. I don't want to send our emails to everyone right off the bat, both to judge the impact on my workload as well as to provide a chance to spot areas for improvement. It'd be nice to be able to chunk up the work into sets of emails but we don't want to just split up the 28,000 items e.g. do the first 2,000, then the next, etc. because now people are getting multiples if their items are distributed throughout the set. We want to first group up all the items by owner, then start working through the owners in smaller groups.

The easiest way to do this is probably a new script that splits up the huge items.json output of node ret.js into smaller JSON files where one owner does not have multiple items across the subsetted files.

switch from request to node-fetch

Many newer scripts use node-fetch instead of request but the majority do not. Since request is deprecated and fetch has an API in browsers too, it makes sense to switch. The process isn't too tricky. These are the main things to keep in mind:

going from the callback usage of request to promises in fetch
request had a nice json: true option which did a few things that have to be done individually in fetch
- stringify the JSON payload in fetch options, e.g. data: JSON.stringify(data)
- add a Content-Type: application/json header
we often specify a custom HTTP agent so we can set how many connections are used at once; need to confirm how this works with fetch

migrate to eslint

Stop using jshint and switch to eslint. Shouldn't be a hard transition and eslint probably recognizes that static class properties are OK (see retention/item.js Item.CSVHeaderRow).

convert to ESM

e.g. see this guide: https://gist.github.com/sindresorhus/a39789f98801d908bbc7ff3ecc99d99c

Not sure if this project is the right starting point, a smaller one might be preferable, but this is worth doing eventually. A few of our dependencies are locked on outdated versions because of CommonJS:

filenamify
~~mocha~~
node-fetch

export: test on item with zip archive

The collection export tool assumes a flat hierarchy of attachments such that we can simply download them all into the same directory. This appears to be true for items without folders, but what about ones where a .zip archive is uploaded and unpacked? What do the attachments.filename strings look like then, are they paths with directory separators in them? Requires some research.

retention: alumni emails

Students now lose their CCA Gmail addresses three to six months after they leave the college. Since we are following up about items that were contributed six years ago, it's highly likely everyone's CCA email will be defunct by then.

The offboarding recommendations for students tell them to ensure the alumni office has an accurate email, maybe we can get access to this data? So we'll map CCA username to alumni email.

We should also add a note about exporting VAULT items to the Technical Offboarding for Students page.

retention: make an exception for VCS theses

I already wrote the exempt.js script to remove the VCS theses from the retention files I'd already generated, but we also do not want to remove copies in the future.

There is a bit of question in that the VCS collection appears to be only PPD and theses. So rather than create a new exception in Item.js, I could just make the VCS collection itself exempt. I do want to cleanup the pile of "alumni success" PPD records that have no real data associated with them but that's for another time.

metadata-csv modify tests

For the XML / item operations, it makes sense to have tests. We want to make sure these work and we don't have to use the API and mock objects to do so.

export: namaste tags in directories?

https://ia601702.us.archive.org/2/items/ark_13030_c7g44hq41/NamasteSpec.pdf

Maybe add these to the exported dirs, behind a --namaste flag? Doesn't serve an immediate use for us but could theoretically be valuable if we start sharing large exports with other places.

retention - script to delete weeded items

This step should be incredibly simple: accept some kind of JSON file of items or simply item UUIDs and iterate over each, using the DELETE method of the Items REST API.

retention - data improvements

For users and collaborators that are internal users, the oE API reports their UUID and not their username, which is opaque. Similarly, collection UUIDs are not names are reported. We could use additional (user, collection) API routes to look up names given UUIDs and that would make our data more legible.

export: check downloaded attachments against their size in metadata

In testing the collection export tool on a set of 26 items, mostly images with some large TIFFs, some files were downloaded but in a partial or malformed state. It's immediately evident from viewing the files that they're malformed, but there weren't any obvious errors during the script's execution that highlighted the problem.

In the absence of checksums, perhaps we could use the file size (item.attachments.size) of the item to validate whether it was successfully downloaded. In testing, the size in bytes was identical between my laptop and the size in VAULT's attachments info, but I also believe some attachments do not have a size property. Ideally, the collection script would do this as it downloads attachments, but it might be easier to write a separate validation script (which could also perform other checks e.g. that the item's metadata files are present and valid).

create 'remove from group' script

We should have a 'remove from group' script analogous to the add to group one. It'd be useful in scenarios where someone has left CCA; we can check VAULT's diagnostics to see what groups they're in and then remove them without using the admin console at all.

export: folder name collision with --name option

See the @TODO in the collect.js script, if the --name option is passed and an item's directory is based on its title it causes a collision. The way the script is made its not easy to fix (some kind of recursive check for the dir and any others with appended integers?). An alternative might be to download to UUID dirs first then use a script to rename them (e.g. an eq item shell script).

export collection to CSV

Write a script to export a collection's metadata to a CSV. Our initial use case is the Hamaguchi Collection but we will probably end up doing this again. The script can use the Search API route limited to live items in a given collection. We will need to identify particular metadata nodes (e.g. date, author, title, dimensions) to export and use XML parsing to extract them.

Finally, as an added bonus, it might be nice to have an option to also download the files associated with each item and write their location into the CSV output. I'm thinking you could write files to subfolders like "attachments/$UUID/$VERSION/filename" e.g. "attachments/5b388638-3a2a-ddd5-9161-7c1d78126840/2/p_atlan18.jpg".

retention - email notifications for items to be removed

Create a bulk email notice routine, perhaps similar to how we do syllabus reminders. High-level outline:

iterate over all the items to be removed
collect them into buckets for each owner (question for the future when collaborators start to appear: do we email all collaborators of an item or only the true owner? probably has to be everyone)
format an email to the owner with a list of their items that will be removed and link to the Retention Policy
include instructions on how to export these items
ensure the mail routine can scale to hundreds or thousands of messages (use Mailgun?)

Questions

Who should the email's reply-to address be? Probably [email protected] right? We are talking about tens of thousands of items and thousands of users, individualized support will not be possible.

The syllabus reminders use a Python SMTP script run from a local development web server so that they are not flagged by Google as suspicious. However, we would really like to avoid an additional language/tech stack in this project, which is committed to using Node. We need to test and find an SMTP library for Node that can send emails without them being flagged by Google.

Bonus: HTML formatted emails instead of plain text. We could hyperlink item titles instead of printing out their gross-looking URLs.

retention - My Resources link in template

If we're not going to show links to items of every status, let's link to My Resources where users can find all their items and retain whichever ones they want.

retention: re-pack script for split data files

It would be nice to have a utility to combine all of the "deleted-items-X.json" files I have, possibly combined with a better idea of how to organize all the scattered notes and data files from the retention process. Or is it not necessary to package back together the items that have been chunked up because we have the original JSON file of all items? Either way, make a decision and document it.

Hamaguchi bulk modifications

https://docs.google.com/spreadsheets/d/1JherZkcsrvVmPfz7Xtd-B2wqxnpfw88kv5F6J4ixzvI/edit?usp=sharing

Find a way to apply the changes from a spreadsheet to EQUELLA now that EBI is defunct. We could modify records using the API or Manage Resources, we should research both methods to see if they're suitable for different use cases. Tools developed can go under utilities/metadata-csv.

We need to map the columns in the spreadsheet back to VAULT metadata fields, see hamaguchi-map.json. For the location details for "Move to studio" items, set Location = Printmedia Studio and remove other location details (drawer etc. are not accurate).

For the various color codes and statuses in the spreadsheet, here's what we need to do:

Yellow (missing) or Deaccessioned status: delete record
Keep / keep for now / move to studio: update record
Salmon: new record
Green / Blue: requires research

Tasks

delete removed records
update records moved to Printmedia Studio
create new records - metadata-only, we do not have photos for them
update changed records - script is ready but we do not know where in FArchives the prints will go so we cannot complete mods/location/copyInformation

retention - finalize the VAULT retention policy

afaik the retention policy is still a draft (see doc) and not official CCA policy