Giter Site home page Giter Site logo

vault-qmark-fix's Introduction

VAULT Question Mark Fix

Problem: VAULT was mangling the filenames of uploaded files, replacing some unicode characters with question marks. This typically happened with accented letters, smart quotes, and Chinese characters.

With Edalex's help, we tracked the problem down to the -Dfile.encoding=UTF8 setting in the JAVA_OPTS being passed to the app's JVM; removing that from the variable in {equella root}/manager/equellaserver-config.sh fixed the issue. That change prevents the problem in future uploads, but unfortunately we still have over a thousand files where this mangling might have happened.

Solution:

  1. find files with question marks in them in the data directory
  2. identify the actual names of these files (note: some question marks will be literals and the file does not need to be renamed)
  3. generate a shell script to rename the affected files
  4. run the script on the server & spot check several affected items
  5. (optional) under Manual Data Fixes run the Generate thumbnails and previews task to generate thumbnails for all the affected items

Setup

Copy a VAULT OAuth token with the necessary permissions to a file named ".token" in the root of this project.

Run the "find.sh" script on the server and download its output text file.

Run pnpm install (or npm install if that's your jam) to get node dependencies.

Execution

Once we have the list of filenames with question marks in them, we can look up their item's attachment data with VAULT's REST API and attempt to figure out what the filename is supposed to be. The index.js script atempts this procedure with a few caveats:

  • some filenames have literal question marks in them & don't need to be changed
  • sometimes multiple attachments on the same item have question marks in the same positions of their name such that it's impossible to tell them apart from filenames alone, e.g., both "ๅธƒ1.jpg" and "ๆ— 1.jpg" are mangled to "?1.jpg"
  • there are extraneous files in some item directories which are not listed in their attachments (example: the .psd files of this item)
  • it seems to be possible to have multiple attachments with the same name, at least in the metadata (example)
  • unpacked zip archives may have mangled filenames but only the .zip itself is present in metadata
  • there are vagaries with unicode and regular expressions that I probably do not understand

In the end, we should generate both a list of mv commands to rename affected attachments and a list of complications that'll require manual intervention. The whole process is merely node index qmark-files.txt > rename.sh 2> errors.txt where qmark-files.txt is the text output mentioned under Setup, rename.sh is the renaming script to run on the server, and errors.txt is the list of complications. You may want to sort the rename.sh script so it operates on each item in sequence; since a lot of processing happens in parallel, the output of index.js is not gauranteed to be in any particular order.

LICENSE

ECL Version 2.0

vault-qmark-fix's People

Contributors

phette23 avatar

Watchers

Eli Cochran avatar Mark Y. Goh avatar  avatar Cian Phillips avatar Ngoc Lam-Miller avatar CCA Software avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.