jkominek / fdbfs Goto Github PK

View Code? Open in Web Editor NEW

12.0 3.0 1.0 298 KB

A not-yet-ready-for-use FoundationDB-backed FUSE filesystem. Seriously, don't use it.

License: ISC License

Python 2.04% C++ 94.82% C 1.72% Shell 0.69% Meson 0.72%

fuse fuse-filesystem foundationdb does-not-work full-of-lies

fdbfs's People

Contributors

Stargazers

Watchers

Forkers

iamfork

fdbfs's Issues

Convert to nlohmann/json 3.x

We're on 2.1.1. Which is probably fine for some time; I haven't actually read what's changed, I just know we don't build with the latest version. Probably isn't too hard to fix, since we're not doing anything very exciting.

statfs.cc and the build instructions in the github action are the places to touch.

Liveness-aware garbage collection

The garbage collector should make use of live process information from the liveness management system, to ensure it doesn't collect anything early.

Increase complexity of FDB configuration

Install a configuration file which spawns multiple/many FDB processes, each taking on different roles. The goal wouldn't be to improve or even alter performance, but rather to increase the opportunities for concurrency in FDB's internals. Hopefully that will expose us to a wider variety of FDB behaviors.

Serialize our own operations on inodes?

Currently it is possible that we would receive, and dispatch to FDB, requests which we could predict will cause transaction conflicts. Multiple updates to an inode, or writes to the same area of a file.

We should consider implementing a locking scheme on our side, whereby we maintain inode-level locks to serialize these operations.

The necessary data structure wouldn't be totally trivial; we'd need some sort of weakly held map from inodes to dynamically created locks, which is itself locked, so that competing processes don't attempt to both create the lock for a given inode at the same time. (Remember to release the lock on the outer structure before attempting to take the lock you really want.)

Include SQLite in the test suite

By "SQLite" I mean:

Compile sqlite3 in a directory on fdbfs
Run the sqlite3 test suite in there.

That should represent a lot of real-world loads nicely.

Maybe there are some other disk-interaction-heavy packages with sizable test suites that we could incorporate?

Perform our own permissions checking

Right now we have to farm out permissions checking to the kernel with the default_permissions option. That works, but allows for... "permissions skew", where the kernel may use cached permissions to determine whether or not some operation is allowed. So we could see:

system A reads the inode for permissions (read-only transaction)
system B changes the permissions on the inode (read-write)
system A uses those permissions to perform an operation on the inode (read-write)

Now, is that the end of the world? No, I think local filesystems probably don't guarantee that can't happen. But their time bounds on it preventing it are probably muuuuch tighter than ours. In bad situations ours might be long enough to be human perceivable and weird.

It shouldn't really be any more expensive to do this. I believe I've added reads for the inode in all the places where permissions would need to be checked, even if we don't actually use the retrieved value (and it is used in almost all cases). So we're already paying the price; it's just a matter of implementing the permission checking function and calling it inside our transactions.

This is marked very hard because you've got to know the subtleties of POSIX permissions checking.

indexing into buffers during writes is incorrect

fdbfs/write.cc

Line 325 in 34a37f1

block = buffer.data() + (off % BLOCKSIZE) + mid_block * BLOCKSIZE;

start position is calculated incorrectly. mid_block should be mid_block - iter_start,

cmake

Switch to building with cmake.

Keep an eye out for Windows compatibility, if that's a concern, as we want to be able to build for Dokan as well.

Liveness management

Correct garbage collection and lock breaking requires "liveness management", whereby every process registers an ID used for marking inodes as in-use and holding locks. Every process will increment a counter and set a last updated time at a constant frequency. Every process will watch the PID table and, on spotting a process not updating its entry, "ping" it. Failure to respond to the "ping" within a significant multiple of the update frequency will mean the pinged process is dead, and other processes may remove the dead process' entry from the PID table which will kill it.

Convert away from Travis CI

To something/anything free. Or something which can securely store some AWS credentials and fire stuff up on AWS. Wouldn't mind paying a little bit in EC2 costs, just don't want to have to build/maintain significant infrastructure.

Macros for supporting static analysis

I think I'd like to have some macros like:

INODE_KEY
FILEDATA_KEY
DIRENT_KEY
???

which just pass through their contents:

#define INODE_KEY(x, descriptor) (x)

and some helper macros:

#define TARGET 0
#define PARENT 1
???

which are used to mark all of the keys provided fdb_transaction_* functions:

fdb_transaction_get(transaction, DIRENT_KEY(key, PARENT).data(), key.size(), ...)
fdb_transaction_set(transaction, INODE_KEY(key, TARGET).data(), key.size(), ...)

so that we can quickly and accurately identify what KV pairs are read/written to by any given operation. If we start generating "synthetic" conflict keys for the inodes, we'll need this to help with correct reasoning.

Implement filesystem lock operations

Add support for the various FUSE locking operations. Double check them against the Dokan and samba lock operations to make sure whatever KV layout we use should support the locking requirements of our major targets.

Conflict range "cleanup"

When reading file blocks, the ..._get_range call should be snapshot; we don't have any obligation to return a specific version of the blocks, and don't want to conflict with writes. Though we might end up conflicting on the inode, if we're updating access times. At a minimum the snapshot read will produce one less conflict range to send & check. (Tiny optimization? Yes)

Similarly when performing a write, we should produce a single write range conflict covering all possibly affected blocks, instead of the maybe 3 conflicts we'd currently produce (start block, middle range clear, stop block). Again, tiny optimization, slightly fewer ranges to send over the wire, and fewer for the resolver to check.

Automated test suite

Set up, perhaps in conjunction with cmake, a mechanism for running tests automatically. Local and via Travis.

The fstest portion of https://github.com/billziss-gh/secfs.test would be a great place to start.

file extended attribute encodings

Our KV layout gives us a lot of flexibility for encoding the inode data blocks. We've got basically a tiny amount of arbitrary data tacked at the ends of the keys, so that we can encode compression or parity information.

It'd be nice to be able to get the same thing going with xattr data, as well. Linux VFS allows attribute values up to 64kB, which is definitely a large enough lump of data to be worth compressing.

No immediate ideas for how to make that change.

Consider sooner, rather than later, how to pull the FUSE-specifics out of the FDB code.