Giter Site home page Giter Site logo

Sync reserved characters proposal about syncthing HOT 17 OPEN

rasa avatar rasa commented on June 22, 2024 1
Sync reserved characters proposal

from syncthing.

Comments (17)

JanKanis avatar JanKanis commented on June 22, 2024 1

a nitpick, before I forget it. I'll see if I have more substantial comments when I have more time.

re url encoding and Samba's Catia mapping encoders: FAT12/16/32 do support unicode, as utf-16 I suppose, so they should be able to handle the PUA encoder just fine. The Catia mapping also requires unicode support. But the filesystem specifically using UTF-8 is not required for any encoder.

from syncthing.

AudriusButkevicius avatar AudriusButkevicius commented on June 22, 2024 1

I think I got lost in the levels of inception of re-encoding, but I think this should be handled exactly the same way as the "case insensitive fs" wrapper.
i.e., there is a file system wrapper that hides the gnarlyness of having to encode/decode files, so to syncthing it should look like every filesystem supports everything, and I'm not that concerned as to how the sausages are made.

I agree that you can end up with cases where you switch between the different wrappers leading to unexpected effects, i.e., files that were claimed to be with: now suddenly get deleted, and replaced with some encoded version, but I think that is ok, as there is no actual data loss, there is just change in names, and setting the encoding back on would unwind this.

In majority of the cases the codec should be a no-op, and that's fine, switching it back and fourth should have no effect, and will only matter for cases where you do have a genuine ":" in the paths, which should be very few cases.

I guess the more interesting case that I don't see handled is where our encoding scheme clashes with files that already exist.
Namely I replace : with unicorn, want to sync a:, but already have a file aunicorn.
I guess perhaps this was covered by the inception of re-encodings, but I guess it lacks clarity and examples for me to digest what is being said there.

Agreed, we can have helper cli utility that help "decode" or "encode" things in place to allow you to convert.

from syncthing.

acolomb avatar acolomb commented on June 22, 2024 1

I think taking a step back and defining the assumed invariants would be good before diving into details and an action plan.

  1. Is "encoding" always a well-defined, reversible process? Do encode and decode schemes have perfect symmetry and lossless round-trips?

  2. Do we acknowledge that there is no way to deduce an encoding solely based on looking at encoded names? As long as we don't have any reserved escape character(s) that are otherwise forbidden except in names encoded by Syncthing, we must assume that detecting "already encoded" names vs. "happens to use one of our replacement characters" is a best-effort heuristic.

  3. Can the encoding / decoding process be made foolproof if we assume the encoding scheme is known? Imagine a more radical encoder which, e.g. translates all names to their base64 equivalent. Then detecting a single file that uses any other character would point to a misconfiguration or externally placed, non-encoded files.

  4. Is the encoding a strictly local matter, or is it announced to other peers?

Regarding point 3, I really like the idea of storing the encoding scheme with the data, under .stfolder. It survives database resets and messing with the configuration. I would even argue that once put there, Syncthing should not support changing the encoding, but rather require the folder to be set up again from scratch. Putting an encoding marker there after the fact should be safe, as any file name found locally that cannot be a product of that encoding, will be renamed (encoded) at that time. Assuming the choice of encoder is sensible (e.g. FAT on a FAT filesystem), there cannot even be unencoded existing names, as the filesystem would not allow them.

As to point 1, we do have some kind of encoders already in Syncthing: encrypted names on untrusted devices (not easily reversible) and the Unicode normalization code (also not reversible if the previous name was not normalized). Looking at those might give some hints regarding the invariance questions. Integrating that functionality with the proposed encoding stuff is probably too far fetched though.

Thinking one step further, I could imagine even more radical encoders emerging, such as the mentioned base64 encoding. That might prove useful to implement further filesystem types in Syncthing, e.g. to add object stores. But then it needs to be clear whether this encoding machinery works with only a (non-reversible) hash function. Again, laying down these invariants / requirements for encoding schemes will help set the boundaries for designing the basic encoders we actually need in the first step.

from syncthing.

JanKanis avatar JanKanis commented on June 22, 2024 1

But the filesystem specifically using UTF-8 is not required for any encoder.

Sorry, I don't follow. Can you clarify?

s/UTF-8/unicode/. It doesn't matter if a filesystem uses UTF8, they need to support unicode, any unicode encoding will do.

from syncthing.

JanKanis avatar JanKanis commented on June 22, 2024 1

FAT12/16/32 do support unicode, as utf-16 I suppose, so they should be able to handle the PUA encoder just fine.

You may be right. I was going off of here, but it may be wrong. I will remove that claim.

The base filesystems only support 8.3 length non-unicode filenames, but Windows uses an extension to also store longer unicode filenames as an add-on.

from syncthing.

JanKanis avatar JanKanis commented on June 22, 2024 1

Hi @rasa, I finally got around to writing a full reply to your proposal.

Use cases

Judging by the other comments, the use cases/purpose needs fleshing out. My personal use case is the following:

I use Linux as desktop, and so I have some of my personal documents using windows reserved characters in their filenames. I also have some documentation downloaded for local use from a website using wget. Those pages used '?' in their url, so the files now also have that in their names. I would like to sync my documents to my Android phone, which does not accept those characters (depending on Android version etc). Of course I could go through all these files to rename them. But in the case of the wget-ed webpages that would break their internal links. I also don't really want to change my names as I mainly use these documents on my Linux laptop, and only occasionally on my phone.

For my use case I want to be able to view/edit the files with existing Android apps, so proposals such as base64-encoding or storing a filename hash don't cut it for me. In that case I would not be able to identify the file when browsing through the files in e.g. an Android file manager or any other app that is not Syncthing. Using the Unicode PUA works, as I only occasionally have a reserved character in my file names and I can still identify them from the rest of the file name. That is why I proposed including the Samba Catia mapping, in which all characters are still identifiable without using any special software.

Other use cases could be backing up your personal files to a Windows server, or using multiple computers with mixed operating systems where Mac or Linux is your main OS, or when handling files from a WSL/cygwin/etc environment on Windows. Other use cases could be when you are not syncing your own personal files, but a file set over which you have no direct control and for which you thus can't just change the file names. However some real use cases are probably more compelling than what I can think up.

Restricting file names

I see that, compared to my previous proposal, you've not adopted the part of configuring certain file name characters as disallowed for a folder. I'm totally fine with that, Syncthing cannot actually control what users put in to their synced folders anyway. It was primarily a way to surface something similar to the existing proposal in a way that would be easier to understand.

The Default encoder

I would suggest renaming it to "None". A setting encoder: None is imo the clearest way to signal to users that this encoder doesn't do anything. If it's named 'default', I would then need to go look up what the default encoding for syncthing is.

re-encoding 'inception'

As others have also mentioned, the whole part on re-encoding and re-re-encoding is overly complicated. First, the encoder doesn’t know the meaning of the file names it receives, it only knows that it sees some characters incoming that are also in its encoding codomain. The question then is how to handle that. I think it would be clearer to rephrase the section in terms of incoming characters instead of as “re-(re-)*encoding”.

IMO there are two sane ways to handle it: don’t, or escape. Adding a separate encoder for different re-encoding is imo way too much complexity for questionable gain, so the FAT encoder should just implement one of these options.

If the encoder does not handle incoming encoding target characters, the encoder should reject the file which should lead to a synchronization failure, just like currently already happens with file names with reserved characters. However encoder target characters, whether PUA or the Samba Catia mapping, should be a lot less common than e.g. : and ? and probably won’t be an issue for most users.

If the encoder escapes target characters, it would prefix such codomain characters with an escape character such as \xf05c (in case of the PUA encoder). If the input name already contains an \xf05c, that would be escaped just like any other encoding codomain character, so it would end up as \xf05c\xf05c. No need to handle re-re-encoding specially. The only problem is that this increases the file name length, so it would not work for files with a length of 255. That isn’t really an issue as the proposed PUA encoding already increases the file name length by requiring surrogate pairs to encode the PUA characters. Also, file names that are almost 255 bytes long are quite rare in my observation, and file length limits already differ between file systems.

For this issue it is also worth finding out what WSL/cygwin/Mys2/CIFS do when they encounter the PUA target characters.

There are other ways to store files that sidestep the whole filename character issue, such as base64-encoding them or storing a base64-d hash of the filename and a separate file with the real filenames. But with such solutions it is no longer practical to edit files on the encoded side, and the feature set that Syncthing would offer in such a case would be (practically speaking) one-way backup instead of two way synchronization, which is Syncthing’s unique selling point.

I have a slight preference for doing escaping, but I’m also fine with not handling re-encoding. Especially if that—being simpler—contributes to the proposal being accepted.

Switching encoder settings

A lot of the potential problems come from changing encoder settings. There’s one simple solution for (most of) this problem: don’t allow changing the encoding setting. Syncthing already does this by not allowing the folder path and ID to be changed. Even though changing the path shouldn’t be such a big problem (as far as I know). Users can still change the encoding by editing the config.xml file directly, but in that case they should know what they are doing, and the user should manually rename any affected files or make sure there are no affected files in the folder.

There is still an issue with downgrading Syncthing after creating a folder with the FAT encoding, but I don’t think that is worth bothering about a lot. It seems like a quite rare situation, there won’t be any data loss, all that happens is files can be duplicated or their names messed up, and it can be fixed by a script or cli tool.

Of course it is also possible to handle this in the GUI. That would certainly be more user friendly, but I’m not sure if it is worth the complexity. In that case Syncthing should handle it as proposed in Potential Issues 5. If all files can be renamed automatically (or don’t need renaming), Syncthing can just do the rename. Otherwise, it should ask the user and affected files would need to be deleted and un-synced.

Technical scope of the encoder framework

Judging by some of the discussion, the proposal should probably clarify that the encoding is something that happens purely locally. File names sent over the wire in the Syncthing protocol are always the unencoded names, and nodes don’t know about each other if they use any kind of encoder when storing their files.

Proposal document structure

The proposal document is a bit too complicated, in my opinion. And that probably contributes to the issue appearing more complex to readers than it actually is.

Specifically, the description of what the encoders actually do is spread around the document, under the headings “The encoders”, “Other encoders”, and “Possible encoding methods”. I think the main proposal (including the PUA encoding) should be at the beginning of the document, so readers not already familiar with the existing discussion have a clear view of what the (current) proposal constitutes.

Also, you divide the work into several “phases”, but you don’t specify that you’re doing that or what these phases are before referring to them. However I think the notion of defining separate phases should be dropped altogether. We only need two “phases”: what will be included in the current proposal and (assuming it gets accepted) pull request, and Future Extensions, i.e. everything that can be implemented as later enhancements. The goal being to limit the scope and complexity of the current proposal, both in the amount of code that needs to be written, but more importantly in the number of issues to be discussed and that can be disagreed about. IMO the proposal should correspond mostly to what is now phase 1 using the PUA encoder, and probably not allowing changing the encoder from the GUI. For the choices of changing encoder and handling filenames that are already encoded, these are the most simple options and other options can be implemented in the future. These other enhancements should still be mentioned under future enhancements, of course.

As a reference it might be helpful to see the list of sections a Python Enhancement Proposal (PEP) should include. Not all of them apply to Syncthing, but it is still a helpful and thought out structure.

from syncthing.

rasa avatar rasa commented on June 22, 2024 1

@JanKanis I completely rewrote the proposal, using the PEP layout you referenced. Let me know if I captured all your feedback, or if further simplification is needed. Clearly, my initial draft was way too complicated to be accepted. Lesson learned! Thanks again for the thoughtful and detailed feedback!

from syncthing.

JanKanis avatar JanKanis commented on June 22, 2024 1

If this issue is important to you, visit #1734, and click the thumbs up icon.

I'd like to do that, but I can not add any emoticon response/vote to it, I guess because the issue is locked. Also after authenticating to roadmap.syncthing.net, voting on the issue there doesn't do anything, presumably also because the issue is locked.

from syncthing.

calmh avatar calmh commented on June 22, 2024 1

I don't think all the details around config handling need to be defined right out of the gate, but my gut feeling is that this would be the new default on Windows and Android, editable at folder creation time, and otherwise handled pretty much like the folder path -- not easily editable, with some FAQ or doc article explaining the situation and what to do to change it safely.

from syncthing.

rasa avatar rasa commented on June 22, 2024

FAT12/16/32 do support unicode, as utf-16 I suppose, so they should be able to handle the PUA encoder just fine.

You may be right. I was going off of here, but it may be wrong. I will remove that claim.

The Catia mapping also requires unicode support.

Will update doc.

But the filesystem specifically using UTF-8 is not required for any encoder.

Sorry, I don't follow. Can you clarify?

from syncthing.

calmh avatar calmh commented on June 22, 2024

Thank you for writing this up, it's an excellent summary of the problem, your proposed solutions, and the potential issues. ❤️

For me, however, it also illustrates quite clearly why I'm disinclined to accept the proposal (and the corresponding PR). In a nutshell, the problem ("I want to sync filenames containing reserved/unsupported/special characters on any filesystem") is fairly easily avoided and/or corrected when it surfaces. The proposed solutions, however, are complicated and error prone, and the result of mistakes and misconfigurations much harder to reason about and fix than the original problem. In my mind this makes the cost higher than the benefit.

from syncthing.

rdebath avatar rdebath commented on June 22, 2024

Some small points to start with ...

  • There seems to be a requirement for a machine with a "Default" encoder within the swarm; it should not be assumed that there will be a Unix host available, all running on different versions of Windows seems rather likely. This would mean that if an particular host switches to a particular non-"Default" encoder you need to be able to fix the "mess" from that host.
  • Please use vFAT rather than FAT for your encoder name. This is because, pedantically, FAT is a filesystem that only supports uppercase Ascii 8+3 filenames (with some extras depending on localisation) that can (IMO) only be upgraded by a full overlay filesystem like umsdos.
  • Don't forget that the valid characters on a Windows filesystem depend on the version of Windows (sometimes even build).

from syncthing.

rdebath avatar rdebath commented on June 22, 2024

Oh and as a counterpoint.

The requirement seems to be that a particular host has all the files created on every other peer irrespective of the name it might be given here to overcome any local limitation. This is presumably useful for things like backup servers.

In that case a translation like the previously mentioned base64 would be acceptable, BUT might still hit a file length limitation. Taking a secure hash (MD5, SHA1 etc) of the pathname would give a name with four or five 8 character sections for any original filename which is (basically) guaranteed to be unique.

A small database containing a list of all the paths would be required to know what filenames are stored on the local FS. Working with the filesystem would be mostly trivial but there would be no method of migrating to or from this scheme except for adding another peer to the swarm. Though individual pathnames can be translated using simple tools like sha1sum so restoring particular files would be quite feasible.

Personally I'm more likely to make the backup server a Linux box.

from syncthing.

JanKanis avatar JanKanis commented on June 22, 2024

I added a number of comments to the Google doc version on the document structure.

One other question: you haven't commented on or adopted what I proposed w.r.t. changing the encoder setting from the GUI. (disallowing it, or if allowed, make sure all files are renamed) What do you think about that?

from syncthing.

rasa avatar rasa commented on June 22, 2024

One other question: you haven't commented on or adopted what I proposed w.r.t. changing the encoder setting from the GUI. (disallowing it, or if allowed, make sure all files are renamed) What do you think about that?

@JanKanis It's a good idea, but I don't think we can make a field read-only on the Actions > Advanced > Folder page. And that page already has a big red message

Be careful! Incorrect configuration may damage your folder contents and render Syncthing inoperable.

so the user's been warned enough. And it doesn't appear that page restricts the user from editing any field, including the folderID, so adding logic to change some fields to read-only may defeat the purpose of this page (which is to change any field, no matter how disastrous the change would be, like changing 'Filesystem Type' to fake).

But if we ever add the setting to the folder's setting page, we should make the field read-only if it's not None. I purposely left this idea out of the proposal, as, IMO, changing the encoder is an "Advanced" feature, such as changing the "Case Sensitive FS", "Junctions As Dirs", or "modTimeWindowS" settings.

from syncthing.

JanKanis avatar JanKanis commented on June 22, 2024

Ah, I wasn't aware of that advanced configuration page. That's basically equivalent to directly editing the configuration file, so I agree with you then. Does that mean this option will be an advanced configuration only feature, or do you still want to show the option in the regular UI when creating a new folder?

from syncthing.

rasa avatar rasa commented on June 22, 2024

Does that mean this option will be an advanced configuration only feature?

@JanKanis IMO, yes, as long as we have the potential for duplicate files, I think we need to hide the option from the user.

from syncthing.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.