When saving a crate file back to the storage layer check modification time of crate fi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Check modification time when saving crate back to storage,about arkisto-platform/describo-online

Comments (21)

marcolarosa commented on July 18, 2024

This is potentially a very expensive operation. Rclone doesn't have a stat command that can be called on the file remotely so the only way to get the timestamp seems to be to copy it back to the server and performing the stat locally before doing the update. Given that describo writes the revised file back to the storage on every change this sequence of operations will slow things down considerably.

The most sensible way to implement this is to alter describo to save to the backend periodically (e.g. every few minutes) if the data has changed rather than immediately on each change. This then introduces the issue of data in the DB not being immediately sync'ed to the backend. So we end up with two sources of truth if they diverge. And it's a reasonably big architectural change as there would need to be some kind of queue in which to register file saves with retry and asynchronous failure reporting.

from describo-online.

ptsefton commented on July 18, 2024

ATM if you change the file under Describo changes are silently lost - this is not great. The work around is to delete the describo ID that gets added to the crate and then it will re-load (but does this leave cruft in the DB?)

from describo-online.

marcolarosa commented on July 18, 2024

Yep - this is definitely an issue and there's a few problems to consider.

The cruft in the DB is the smallest one. As of 635c07e there is a periodic process that deletes collections that haven't been touched in some definable period of time. That is, whenever you make a change to a crate the updatedAt time of the collection is updated. The default is cleanup anything older than 3 days.

The reason for describo working like this is more complex.

Since describo works with shared user storage, and since we can't know if two users are working in the same folder at the same time, we need to load a crate into the DB and then that becomes the single source of truth for any user that can see the crate (on disk) with that describoId in it. That's why the DB exists.

If I load in the crate off disk every time, then two users will see different views of the data in the DB that would then keep overwriting each other on disk. I don't know that there is an acceptable solution to this problem. What are your thoughts?

from describo-online.

ptsefton commented on July 18, 2024

Maybe we should not be serializing the crate to disk (at least not with the official RO-Crate file name). As I have often said, there's not really a use-case for describing data that sits at rest in a user-accessible file system. The whole point was to allow packaging for dissemination via a repository. So maybe Describo writes an ID of its own, and makes sure the crate is written back to the storage before deleting anything from the DB

EDIT: In this comment I was working on the assumption that if we are unable to tell the file modification time of an RO-Crate using RClone then might need to re look at the way Describo stores crate info back to the storage so that we reduce the likelihood of some other tool or human changing the crate file. I was not intending to rewrite history, or imply that anyone had done anything wrong in implementing Describo.

from describo-online.

marcolarosa commented on July 18, 2024

I don't understand what you are suggesting or how it solves the issue of the crate file changing on disk and that not being reflected in the DB whilst not creating a race condition between multiple users of the folder.

from describo-online.

marcolarosa commented on July 18, 2024

As I have often said, there's not really a use-case for describing data that sits at rest in a user-accessible file system.

This statement is patently false and incorrect. Describo Online exists because you wanted an online system that users could use to describe data in OneDrive. OneDrive was the first storage resource it was connected to and that is why we have a DB in the middle of the picture. We can't know who is using a one drive folder or whether it's shared so all we can do is load the crate file into the DB and then have all users of the folder work with that version (ie if they can see the crate file on disk and can get a describo id out of it then that points them to the DB version going forward).

Describo Desktop managed the data internally - ie in the crate file itself - by adding and manipulating entities directly. This was ok (though it didn't scale to hundreds of thousands of entities like the DB version does) because we assumed that being a desktop app, the user was operating on data only they could see. But this is not true either if they connected to a locally mounted onedrive/dropbox/whatever shared folder that was visible to someone else. In that case, desktop would suffer the same problem: who has the main copy of the data and what gets serialised to disk on change?

from describo-online.

ptsefton commented on July 18, 2024

I guess I should have said "sits at rest indefinitely". I have tried to be very clear that Describo was originally intended to describe data as part of a repository deposit or publishing process - and I stand by the statement that that is not achieved by leaving a crate on a user-accessible file system.

When I was at working at UTS we commissioned you @marcolarosa to write Describo Online for describing data in One Drive, that's true - but the point of the exercise was to allow users to describe data and deposit it in a repository for safe-keeping and dissemination. As far as I know that work is not complete at UTS, but once we had that working, we would have have been looking at options like allowing the user to delete the OneDive copy, making data (including the ro-crate-metadata.json file) read-only in the interests of encouraging use of data from trusted repositories and managing disk usage.

It is possible that when we jointly designed Describo Online and had it write an RO-Crate file back to OneDrive we did not understand the limitations of that scenario (eg gather from what you have said @marcolarosa even getting file modification times is problematic) I certainly didn't understand how different it would be from having access to a local file system.

EDITED TO ADD: I want to make it clear that @marcolarosa did point out the limitations of storing the RO-Crate file on disk in detail when we started on Describo Online.

To clarify what said above, maybe we should consider having Describo still store a file but make it less inviting to edit for RO-Crate aficionados, eg call it DESCRIBO_DO_NOT_EDIT-ro-crate-metadfata.json and make the file minified rather than pretty printed, for example. This would then be written to a 'proper' RO-Crate metadata file on export / repository deposit.

from describo-online.

ptsefton commented on July 18, 2024

@marcolarosa Re Rclone and file stats - it does seem to be able to get file modification times, at least from dropbox. Or is this not sufficient?

> rclone lsjson  dropbox:/pt/kp/rb.sh
[
{"Path":"rb.sh","Name":"rb.sh","Size":53,"MimeType":"application/x-sh","ModTime":"2018-05-28T04:37:09Z","IsDir":false,"ID":"id:mDmSon1L514AAAAAAAURWQ"}
]

UPDATE: there's a summary here: https://rclone.org/overview/ this shows that most back ends do support mod times - and also checksums (which would be another way to check for conflicts). I think all the back ends we are using do support this. @marcolarosa does this mean we can do what you originally suggested saving a time-stamped conflict-version?

from describo-online.

marcolarosa commented on July 18, 2024

I guess I should have said "sits at rest indefinitely". I have tried to be very clear that Describo was originally intended to describe data as part of a repository deposit or publishing process - and I stand by the statement that that is not achieved by leaving a crate on a user-accessible file system.

When I was at working at UTS we commissioned you @marcolarosa to write Describo Online for describing data in One Drive, that's true - but the point of the exercise was to allow users to describe data and deposit it in a repository for safe-keeping and dissemination. As far as I know that work is not complete at UTS, but once we had that working, we would have have been looking at options like allowing the user to delete the OneDive copy, making data (including the ro-crate-metadata.json file) read-only in the interests of encouraging use of data from trusted repositories and managing disk usage.

It is possible that when we jointly designed Describo Online and had it write an RO-Crate file back to OneDrive we did not understand the limitations of that scenario (eg gather from what you have said @marcolarosa even getting file modification times is problematic) I certainly didn't understand how different it would be from having access to a local file system.

I feel the comment above, and the ones preceding it, are misleading and suggest that what I built was not what was requested or appropriate. It's difficult not to take a rewriting of the past personally when it portrays my efforts in a negative light. And that's how I read the initial comments. To your credit - I note your edits to try to clear some of this up.

Perhaps your thinking about overall architecture has changed - as it should in light of new information - but the design of both the original desktop and then web versions of describo was always based around the idea of writing a crate file back to disk. That web describo (this version) uses a DB internally is because we can't know if two users are using a shared folder in onedrive or S3 or owncloud or whatever. So the interaction with the folder is as follows:

Load the requested user folder
- Is there a crate file?
  - no - stamp one with a describo id matching the collection id in the DB and continue with the DB representation
  - yes - does it contain a describo id?
    - yes - does that match a collection id in the DB - yes - continue with DB data
    - no - load the crate file into a new collection, update the describo id to match (and the crate file on disk) - continue with DB data

(The code is @ https://github.com/Arkisto-Platform/describo-online/blob/master/api/src/lib/crate.js#L40-L100 - I had to comment it so that I could follow what was happening inside it!)

For 2+ years this was known, understood and acceptable. The alternative would have been to load the crate file on disk into individual user sessions and have multiple users then fighting over the on disk representation serialised from their respective DB views.

To my recollection, until very recently (the last few months maybe), we have not considered what happens when a crate is associated to a DB representation but some other process updates the file on disk. Which is what is causing this problem. That is the crux of this issue.

I went looking for an rclone stat command and didn't find one hence one of my earlier comments. I didn't know, or think to look at, the data from lsjon. That looks like what I need. Thankyou for finding.

Going back; stamping a changed (on disk) crate file with a conflict marker will result in other issues. In an environment where some automated process is continually writing metadata to the on disk representation, there will be an accumulation of conflict files which will be confusing to end users. (Why do I have all of these other crate files? What's the difference between them? What am I supposed to do with them?). And there will still be data loss and data mismatch because the data on disk will not match the representation in the DB that describo considers the source of truth.

I've had this problem in nyingarn where the workspace has information that needs to be put into the crate file but it has to do it by passing it to describo to write into the DB and then onto disk. I've not documented it because whilst it works, it's clunky and now tightly couples nyingarn to describo which is unacceptable. A failure in one cascades into the other.

My understanding was always that describo was a tool connected to some user facing storage system (desktop and web versions) that enabled users to easily describe their folders so that they could be ingested into a repository somehow. That is what we pitched to the Science Mesh last year (documented, presented and described in great detail). The issue of other processes also creating / updating metadata in those same folders is a problem to be solved, but it is not a failing of describo per se. If a user and some automated process both need to write to a folder on disk, then in the current design the only acceptable solution is for the automated process to instrument describo to do the work (as I implemented in nyingarn). But I really don't think this is a good idea and honestly believe that capability should be removed.

from describo-online.

marcolarosa commented on July 18, 2024

@ptsefton

Doing this with modtime is not going to work as we expect. Consider the following:

load a crate file from disk - it has no describo id so a collection is made in the db, the data is loaded in, and a new version of that crate file is written to the backend with the describo id in it.
reload the page - and that folder - the updatedAt of the collection is older than the mod time of the file on disk because we wrote a new version to disk (with the describo id in it) after we loaded it into the db.

So modtime is a no go.

But there is hashsum. With this we can calculate the remote hash and compare it to the local hash and then decide whether or not to create a conflict file before pushing the new crate to the backend.

from describo-online.

ptsefton commented on July 18, 2024

Looks good

from describo-online.

ptsefton commented on July 18, 2024

Re-opening this as it is not working as expected in Describo local.

After updating, I am getting a conflict file created for every action - Describo is working as it should, but it is detecting conflicts every time something changes. Not sure what would be needed to debug this - @marcolarosa have you tried it with Describo Local?

from describo-online.

ptsefton commented on July 18, 2024

Further to the discussion about the best way do this looks like rclone check is a single command which tell if a local and remote file are the same or different - this would avoid having to code for all possible cases with hashsum, where different remotes support different algorithms, modtimes etc.

from describo-online.

marcolarosa commented on July 18, 2024

I'm about to push a new release with that code commented out. It's working as expected. The problem is as follows:

the version on disk has a particular hash
when you update the data in UI, the data changes so the hash of the new file is different to the old and a conflict file is created
then updating the data again causes the same issue - over and over.

check command is a good find if we can get around this issue.

This needs more thought about how to handle so let's leave this ticket open.

from describo-online.

ptsefton commented on July 18, 2024

Could the problem be here:

describo-online/api/src/lib/crate.js

Line 314 in 0a021fd

async saveCrate({ session, user, resource, parent, localFile, crate }) {

 async saveCrate({ session, user, resource, parent, localFile, crate }) {
        // write the file out locally
        await writeJSON(localFile, crate, { spaces: 2 });

        // sync it back to the remote
        syncLocalFileToRemote({
            session,
            user,
            resource,
            parent,
            localFile,
        });

This saves the crate (which means it WILL be different from the remote) and then does the sync. Should it not check for conflicts in what you have locally before saving that is, check if another process has sent an update to the remote since describo sent an update.

from describo-online.

marcolarosa commented on July 18, 2024

This might be a good one for you guys to work on.

from describo-online.

ptsefton commented on July 18, 2024

I think it would be better to roll back this entire change, including the hard-coded SHA1 hash which will not work with all remotes - I can probably get my team to look at this issue but not immediately.

from describo-online.

marcolarosa commented on July 18, 2024

Is the issue still happening? I've commented out that code path. When your team looks into it they can remove all the related code then.

from describo-online.

ptsefton commented on July 18, 2024

That commit appears to have caused another issue -- #57 - and you have not commented out all the code you added - I think from reading the code it is now doing an unnecessary hash operation. I think it will be simpler to just use rclone's check command rather than checksums having looked into this a bit more.

from describo-online.

marcolarosa commented on July 18, 2024

Yes it was still doing the hash but I didn't think it would cause an issue other than maybe slowing things down a touch. I haven't seen that problem as I've noted in #57 (comment) but all that code is gone nonetheless.

from describo-online.

ptsefton commented on July 18, 2024

hashsum looks good. I see Rclone has a hashsum method. Does this mean it can do the backend-specific hashing without describo having to work that out.

…

---------------------------- Dr Peter Sefton Senior Technical Advisor, School of Languages and Culture Mobile: 0404 096 932

________________________________ From: Marco La Rosa ***@***.***> Sent: Tuesday, September 13, 2022 13:09 To: Arkisto-Platform/describo-online ***@***.***> Cc: Peter Sefton ***@***.***>; Mention ***@***.***> Subject: Re: [Arkisto-Platform/describo-online] Check modification time when saving crate back to storage (Issue #38) @ptsefton<https://github.com/ptsefton> Doing this with modtime is not going to work as we expect. Consider the following: * load a crate file from disk - it has no describo id so a collection is made in the db, the data is loaded in, and a new version of that crate file is written to the backend with the describo id in it. * reload the page - and that folder - the updatedAt of the collection is older than the mod time of the file on disk because we wrote a new version to disk (with the describo id in it) after we loaded it into the db. So modtime is a no go. But there is hashsum<https://rclone.org/commands/rclone_hashsum/>. With this we can calculate the remote hash and compare it to the local hash and then decide whether or not to create a conflict file before pushing the new crate to the backend. — Reply to this email directly, view it on GitHub<#38 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAFYTWCUZF6T4377CRWTULTV57WAJANCNFSM52U3D6NQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

from describo-online.

Check modification time when saving crate back to storage about describo-online HOT 21 OPEN

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent