googlecloudplatform / gcsfuse Goto Github PK
View Code? Open in Web Editor NEWA user-space file system for interacting with Google Cloud Storage
Home Page: https://cloud.google.com/storage/docs/gcs-fuse
License: Apache License 2.0
A user-space file system for interacting with Google Cloud Storage
Home Page: https://cloud.google.com/storage/docs/gcs-fuse
License: Apache License 2.0
Want to try out the integration tests for all of the following packages, on a GCE instance in the US:
Things might work differently with differing network latency and listing latency conditions.
docs/semantics.md should include a section about how to safely modify mmap'd files, such that the modifications are made durable (and the user sees an error otherwise).
There is a novel written about this in the documentation for fuseops.FlushFileOp
. Summary: if the user wants to have this work on both OS X and Linux, they should:
MS_SYNC
flag.(checking for errors on all calls.)
This can probably be relaxed in various ways on Linux and/or OS X, but I'm relatively confident this particular dance works (and jacobsa/fuse contains a test for it).
Allow configuration of the temporary directory used for storing all temporary files. This lets users use a partition with more space, an SSD, etc.
Audit all uses of TempFile
, AnonymousFile
, and TempDir
. Plumb in the setting.
when bucket is mounted the rights of the mounted folder are: owner - root, group - root. But because the access permission to folder is 700, our application can't access this folder. Can it be fixed? because our application works under different unprivileged user (not with root privileges). If we try to do chmod or chown we have an error function is not implemented.
The tightening of tests in #30 revealed another problem that has been on my radar but which I haven't actually written down yet. Say the objects in a bucket are:
Then when we get a readdir op from fuse for the root inode, we call GCS's Objects.list method with a delimiter of "/" and a prefix of "". That returns an object named "foo" and a collapsed run named "bar/", so we return directory entries for a file named "foo" and for a directory named "bar".
The problem is that, unless the user is running with --implicit_dirs
(see semantics.md), there is no accessible directory named "bar". When the kernel comes back to look up such an inode, it will receive ENOENT
.
On OS X, when ls
encounters this situation it prints an easy to miss error and then ignores the entry (which is why the problem hasn't been in my face):
% ls -l mp
ls: bar: No such file or directory
total 0
-rwx------ 1 jacobsa eng 0 Apr 1 09:42 foo
On Linux it's a bit more obvious:
% ls -l mp
ls: cannot access mp/bar: No such file or directory
total 0
d????????? ? ? ? ? ? bar
-rwx------ 1 jacobsa eng 0 Apr 1 09:42 foo
As seen in #22, enabling fuse "big writes" ups the kernel -> gcsfuse write atom from 4 KiB to 32 KiB, which makes a performance difference.
Things to do:
InitResponse.MaxWrite
didn't appear to work.If you start a very large copy into a mounted bucket with cp
, kill it, then attempt to Ctrl-C the gcsfuse process, it will refuse to unmount if it made it to the Flush
stage because it the request is still in progress. You simply have to wait for it to finish, or take more drastic measures.
However, on OS X and Linux an Interrupt
request does come through. We just respond ENOSYS
to it. So we need to plumb through support for cancelling the associated context when this is received, and set up the GCS package to pay attention to cancelled contexts (probably using http.Transport.CancelRequest
). Test the latter by starting large uploads and downloads, cancelling them, and timing the duration until the bucket returns in error.
Update the documentation about best pracitices for mmap in light of jacobsa/fuse#8. (Wait for the fuse package documentation to be updated, then translate that here.)
This probably involves setting a mode bit in some object metadata value saying "this is a symlink", maybe with a key called gcsfuse_mode
or similar. That's a bit unfortunate, because we haven't had to touch the metadata elsewhere yet.
@marcgel reports the following permissions-related weirdnesses:
Directory permissions of the mount point show up with question marks:
d????????? ? ? ? ? ? test
Need to use sudo
to list the directory. (Probably related to the first issue.)
Reproduce each of these interactively (maybe Linux is required), add failing tests, and fix.
It looks like it is possible to re-use credentials built into the GCE instance when running on GCE—see this example. This would be a nice feature.
The primary subtlety is probably the scope of the credentials. Can we introspect that, or do we simply need to print a helpful error when a request fails with an HTTP 403 or whatever?
As mentioned at the end of #28, there are problems with ls -l
on an empty directory:
jacobsa@jacobsa-macpro:~/tmp% ls -l mp/
total 0
drwx------ 1 jacobsa eng 0 Aug 31 1754 foo
-rwx------ 1 jacobsa eng 0 Mar 31 16:28 foo?
jacobsa@jacobsa-macpro:~/tmp% ls -l mp/foo/
ls: foo: No such file or directory
jacobsa@jacobsa-macpro:~/tmp% ls -l mp/foo/
ls: foo: No such file or directory
I believe this is because this code doesn't filter out itself from the prefix-based results returned from GCS. The equivalent code used to, but must have regressed at some point during refactoring.
The reason this isn't reproduced in the integration tests is that ioutil.ReadDir
ignores ENOENT
when statting the names it reads from the directory, silently filtering out such entries.
So, update the integration tests to use a pickier version of ioutil.ReadDir
, then fix the bug.
As of the changes in d054a1a, every operation happens on a single goroutine, blocking further fuse operations. This is good because it fixes the race discussed in jacobsa/fuse#3, but is certainly not optimal.
We should decide which types of operations need to be serialized (for example, WriteFile/SyncFile/FlushFile for a particular inode) and build queues for those, with a goroutine spawned when the queue goes from empty to non-empty. All other operations can simply receive their own goroutine. Optimization: things that don't block don't need a goroutine.
when mounting a bucket with gcsfuse if option tmp_dir is set to a non-existing folder, then bucket is still mounted, but will raise an IO error when trying to copy files and all files will have 0 filesize.
Right now, gcsfuse caches nothing and allows the kernel to cache nothing. This is in order to support the consistency guarantees documented in semantics.md. But it makes things slow, particularly when the kernel is doing path resolution (which is very frequent).
There is probably room for a --go_fast
flag that users can enable if they are okay with relaxed guarantees. I would start with just allowing the kernel to cache attributes and entries, and see if anything else is truly needed. If so, the following additional things may or may not be helpful (measure first to find out!):
gcs.Object
record for it, perhaps with TTL. Probably also supports negative entries.Steps for repro:
Expected JSON API to reflect the size.
We should do some performance testing to identify completely obvious bottlenecks.
time cp
to measure wall time taken to copy from local disk to GCS.gsutil cp
. (Both can be measured here by just using time gsutil cp
, I think.)Both DirInode.Attributes and FileInode.Attributes stat the object in GCS just to find out what they should set the Nlink
field to. This means that the many slow Getattr requests done by ls -l
(see issue #39) are only for the sake of Nlink
.
Assuming no one cares about Nlink
for anything important, we can get rid of this and probably speed up ls -l
significantly without any cost to consistency guarantees. But need to check what Nlink
is used for by the kernel, and whether it traditionally matters in userspace.
After fixing jacobsa/fuse#3, gcsfuse no longer builds. Fix it. Don't worry about parallelism for now; leave a TODO.
A user requests read-only mounting, which seems like a reasonable thing to support. Package bazilfuse has a read-only mount option, so this should be easy.
--read_only
flag to gcsfuse
.Figure out what is typical for fuse file system mount tools in terms of running in the foreground/background and logging to stderr/log files/both. Make it happen, and document how it works in the readme.
If there is no typical behavior, choose what seems like a good behavior and document it.
After the shakeup caused by switching fuse packages, the "foreign modifications test" now passes but the "read/write test" doesn't. Add features and/or update tests until it does, on both OS X and Linux.
I happened to mount a bucket containing the detritus left over from the InterestingNames
test case in the jacobsa/gcloud package. ls
did not like it.
We should have a similar test that covers the same cases for file and directory names, in order to discover what the OS chokes on. If nothing, we should recurse into why ls
chokes. In either case, also check the posix standard.
For each guarantee, make sure there is a passing test. If there's not, file an issue. Replace the "make it not aspirational" TODO with a list of these known issues.
DirInode has become quite complicated. I need to keep myself honest by adding unit tests, covering implicit dirs on and off, using a mock bucket.
This will allow the GCS team to measure usage, more easily distinguish from abusive clients, etc.
I hear a second-hand report that doing ls
on a directory with a few hundred files takes multiple minutes, from GCE. This is probably due to ls
doing a ton of stats plus our consistency guarantees, so plays into #29 (adding a "go fast" mode with caching and reduced guarantees). Investigate and make sure.
GCS is quite slow and the kernel is quite chatty, so the cost of the consistency guarantees in semantics.md is very high. We probably want to turn on "fast mode" using caching by default. Sigh.
To do:
Another feature we think would be great is to have some sort of logging option for gcsfuse for ease of administration. As an example in the log:
In addition, it would be great to have a full debugging option.
The current design is terrible for the use case of a handful of small random reads within a very large file (e.g. hundreds of 20-byte reads within a 100 GiB file). I'm told that this use case may be important for e.g. genomics databases.
The obvious fix here is to read only the portions of the GCS object requested, on demand or with some cache. Many subtleties lurk though. If we do something about this, we probably want to start with minimal complexity:
O_RDONLY
file handles only. Supporting writing is a whole other can of worms.This is not currently implemented, and thus blocking #3.
We'll first need support in jacobsa/fuse. Relevant bazilfuse request structs: bazilfuse.FlushRequest and bazilfuse.FsyncRequest.
User report:
when copying files with cp to a mounted bucket, all other commands hang
(e.g. df -h), until the file is not entirely copied.
This is probably due to #23; confirm when that is fixed. This probably should not require setting GOMAXPROCS
greater than one, assuming the Go scheduler doesn't starve goroutines. I could be wrong.
Currently we destroy an inode's temporary files only when the kernel tells us to forget it. I believe the kernel does this only when it is running out of space in its inode cache, at least if the file hasn't been unlinked. In contrast, one temporary file per inode can grow to a lot of disk space, and we may want to clean up earlier.
Consider setting a (configurable) limit on the amount of disk space devoted to temporary files. The limit may be exceeded if we have dirty inodes that add up to more than the limit. But if we are over the limit and have any clean inodes, we will throw away their content (in least recently used order?) until we are under the limit or run out of clean inodes.
How to plumb this in? Some sort of central object used by the object proxies. They use a "grab file if still exists" method when clean, and a "register dirty file" and "unregister dirty file" methods when transitioning into and out of the dirty state. Something like that.
a) if we do ctrl+c during copying of a file into a bucket (interrupt/cancel -cp command), then copying is not immediately cancelled (-cp still copies the file). Our guess is that when we do ctrl+c for -cp, gcsfuse will start copying of the data into a bucket which it has already put into gcsproxy.temp_dir. So, it seems that with ctrl+c we interrupt copying of files into gcsproxy.temp_dir, but the next step of copying (from gcsproxy.temp_dir into a bucket ) stays uninterrupted.
b) the example above results in gcsproxy.temp_dir being uncleaned, unlike for un-interrupted copying (i.e. when we didn't try to stop the process with ctrl+c)
Hello Aaron,
This is really great you have added this option! it is exactly what is needed - yet, still some bugs sneak around:
when we tried to use -o allow_root we got a following message:
2015/05/19 09:28:22.936398 Mount:bazilfuse.Mount: fusermount: "fusermount: option allow_root only allowed if 'user_allow_other' is set in /etc/fuse.conf\n", exit status 1
when we set an option user_allow_other in /etc/fuse.conf, gcsfuse complained about wrong option -o allow_root
2015/05/19 09:29:33.606858 Mount:bazilfuse.Mount: fusermount: "fusermount: mount failed: Invalid argument\n", exit status 1
we also tried -o=allow_root (just in case it is a syntax issue) - still no success
when we removed -o allow_root then bucket connected correctly.
For testing we used fuse-2.9.3-4.fc21.x86_64
Do you know what might cause this?
On Linux, touch
complains about a setattr call:
jacobsa@fourier:~/tmp% touch mp/foobar
touch: setting times of ‘mp/foobar’: Function not implemented
This appears to be because of an attempt to set times:
fuse: 2015/04/02 15:59:45 Received: Setattr [ID=0x39 Node=0xd Uid=83333 Gid=5000 Pid=30939] atime=2015-04-02 15:59:45.180338035 +1100 AEDT atime=now mtime=2015-04-02 15:59:45.180338035 +1100 AEDT mtime=now handle=INVALID-0x0
fuse: 2015/04/02 15:59:45 Responding with error to *fuseops.SetInodeAttributesOp: function not implemented
We don't actually support arbitrary modification times, and don't support atime at all.
To do:
Updated
field)We have a scenario where we need to load GBs of data to ~100 of VMs at the same time. And having 100+ MB/s throughput to each VM would make E2E workload very fast, and would allow us to switch to gcsfuse!
Thank you!
I am using Google Compute Engine running a VM (default Debian image). I have installed Go as well as gcsfuse (from git head) based on documentation from: https://github.com/GoogleCloudPlatform/gcsfuse
I can mount a directory:
$ gcsfuse --key_file key.json --bucket my-test-bucket --mount_point /data --fuse.debug
2015/05/19 23:02:56.808281 Initializing GCS connection.
2015/05/19 23:02:56.808475 Opening a bazilfuse connection.
2015/05/19 23:02:56.810162 File system has been successfully mounted.
2015/05/19 23:02:56.810592 Op 0x00000000 connection.go:319] <- Init [ID=0x1 Node=0x0 Uid=0 Gid=0 Pid=0] 7.23 ra=131072 fl=InitAsyncRead+InitPosixLocks+InitAtomicTrunc+InitExportSupport+InitBigWrites+InitDontMask+InitSpliceWrite+InitSpliceMove+InitSpliceRead+InitFlockLocks+InitAutoInvalData+InitDoReaddirplus+InitReaddirplusAuto+InitAsyncDIO+InitWritebackCache+InitNoOpenSupport
2015/05/19 23:02:56.810954 Op 0x00000000 common_op.go:154] -> Init {MaxReadahead:131072 Flags:InitBigWrites MaxWrite:2097152}
But, as soon as I try to list the contents of the mounted directory I get:
$ ls /data
ls: reading directory /data: Input/output error
The debug output is here:
2015/05/19 23:02:59.455564 Op 0x00000001 connection.go:319] <- Getattr [ID=0x2 Node=0x1 Uid=1000 Gid=1000 Pid=22387]
2015/05/19 23:02:59.456040 Op 0x00000001 common_op.go:154] -> Getattr {AttrValid:0 Attr:{Inode:1 Size:0 Blocks:0 Atime:0001-01-01 00:00:00 +0000 UTC Mtime:0001-01-01 00:00:00 +0000 UTC Ctime:0001-01-01 00:00:00 +0000 UTC Crtime:0001-01-01 00:00:00 +0000 UTC Mode:drwxr-xr-x Nlink:1 Uid:1000 Gid:1000 Rdev:0 Flags:0}}
2015/05/19 23:02:59.456466 Op 0x00000002 connection.go:319] <- Getattr [ID=0x3 Node=0x1 Uid=1000 Gid=1000 Pid=22387]
2015/05/19 23:02:59.456924 Op 0x00000002 common_op.go:154] -> Getattr {AttrValid:0 Attr:{Inode:1 Size:0 Blocks:0 Atime:0001-01-01 00:00:00 +0000 UTC Mtime:0001-01-01 00:00:00 +0000 UTC Ctime:0001-01-01 00:00:00 +0000 UTC Crtime:0001-01-01 00:00:00 +0000 UTC Mode:drwxr-xr-x Nlink:1 Uid:1000 Gid:1000 Rdev:0 Flags:0}}
2015/05/19 23:02:59.457124 Op 0x00000003 connection.go:319] <- Open [ID=0x4 Node=0x1 Uid=1000 Gid=1000 Pid=22387] dir=true fl=OpenReadOnly+0x10800
2015/05/19 23:02:59.457501 Op 0x00000003 common_op.go:154] -> Open {Handle:0 Flags:0}
2015/05/19 23:02:59.457627 Op 0x00000004 connection.go:319] <- Read [ID=0x5 Node=0x1 Uid=1000 Gid=1000 Pid=22387] 0x0 4096 @0x0 dir=true
2015/05/19 23:02:59.619914 Op 0x00000004 common_op.go:131] -> (ReadDir(inode=1)) error: readAllEntries: ReadEntries: ListObjects: toObjects: toObject: Unexpected Md5Hash field:
2015/05/19 23:02:59.620324 Op 0x00000005 connection.go:319] <- Release [ID=0x6 Node=0x1 Uid=0 Gid=0 Pid=0] 0x0 fl=OpenReadOnly+0x10800 rfl=0 owner=0x0
2015/05/19 23:02:59.620587 Op 0x00000005 common_op.go:148] -> (ReleaseDirHandle(inode=1)) OK
As discussed in the semantics doc, we require objects to exist for directories as well as files; there is no such thing as an implicit directory. If an object named foo/bar
exists but no object named foo/
exists, then the file system behaves as if foo/bar
does not exist. So if the user mounts a bucket containing only an object named foo/bar
and then does cat foo/bar
, they will get a "file not found" error.
When the user does cat foo/bar
, fuse sends the following requests to gcsfuse:
The fundamental issue is that at the point of the call in (1), gcsfuse can see that an object named foo
doesn't exist and therefore can say "foo" doesn't refer to a file, but needs to decide between telling the kernel that it doesn't exist at all or telling the kernel that it refers to a directory.
The current behavior is that in (1) gcsfuse asks GCS to do a consistent read of the metadata for two objects, foo
and foo/
. If it finds the first it calls "foo" a file, if it finds the second it calls it a directory, and if it finds neither it says it doesn't exist. That's why we require the object foo/
to exist for the directory to appear to exist.
This method works because unlike listing objects by prefix, a read of the metadata for a single object is guaranteed to be fresh.
One alternative is that we implement (1) by asking whether foo
exists (as today) and by scanning objects with the prefix foo/
, saying that "foo" is a directory if the scan is non-empty. But there are drawbacks here:
rm foo/bar
, suddenly it will appear as if the file system is completely empty. This is contrary to expectations, since the user hasn't done rmdir foo
.rm foo/bar
then touch foo/baz
, the second command will fail with a surprising "no such file or directory" error.foo/
maps down to an unbounded number of requests to GCS, since each response contains a continuation token that must be redeemed to continue scanning and GCS does only a limited amount of work before bailing out and returning this. This means a single simple path resolution may result in enormous expense.
rm foo/bar
. As discussed above now the directory "foo" no longer exists because it was only implicitly defined, so the user gets the surprising behavior of touch foo/baz
failing. Except they only get behavior once the listing catches up. Worse, if they try the experiment several times then it may fail, succeed, fail, succeed, and fail again.Even if GCS eventually offers list-your-own-writes consistency, negating the last point, the other issues remain.
If users want to mount buckets where they've created object names assuming that implicit directories will work, we can create a "fixup" tool that lists the buckets and creates the appropriate objects for the implicit directories.
The main caveat here is that the tool would itself depend upon listing, so may miss some objects in the bucket for the same reasons discussed above. Another caveat is the need to run such a tool, but the behavior could be built into the gcsfuse binary itself (either as the default when mounting or on an opt in basis).
Goal: figure out how to set up an fstab entry for gcsfuse, so that buckets can be mounted at startup. Ideally set up such that a particular non-root user owns the mount.
Before calling it alpha, we should relax the "don't depend on this for anything" warnings throughout a bit.
@marcgel points out that if you hit Ctrl-C to kill gcsfuse, which is natural to do when testing it out, you're left in a state where the mount is still active as far as the kernel is concerned but interacting with the directory doesn't work. More importantly, you can't just run the gcsfuse command again, because you can't mount over an existing mount point. Even more importantly, on Linux you have to use sudo
to unmount because umount
refuses normal users.
It would be better if we could avoid this. Modify the binary to set up a SIGINT
handler that prints a message, attempts to unmount, and then exits with a status. Try it out on Linux and see how it works.
In these packages:
I think there are several things that should be turned into GitHub issues.
Currently:
DirInode.LookUpChild
prefers directories over files if there is a conflicting name.DirInode.ReadEntries
makes no effort to do the same—a conflicting name will show up twice.We don't have a good test for this, and I think files should probably be preferred over directories instead—they are more "local" in some sense. Action items:
ReadDir
. What is the ls
user experience?Currently fileSystem
contains two indexes over its inodes:
fileIndex
is an index over all file inodes by (name, source generation.)dirIndex
is an index over all directory inodes by name.It is suspicious that these are different, and indeed this has come around to bite me when implementing rmdir. The FakeGCS.DirectoryTest.Rmdir_OpenedForReading
test cases crashes in checkInvariants
after it creates a directory, deletes it, then creates another with the same name.
Not having the generation number of the directory's backing object in there makes it very hard to reason about race conditions involving the backing object for the directory. We should make this look just like files.
Action items:
NewDirInode
accept a *storage.Object
just like NewFileInode
, and record it for a SourceGeneration
method.lookUpOrCreateDirInode
looks like lookUpOrCreateFileInode
.This is blocking #3.
Relevant fuse request struct: bazilfuse.RenameRequest
Unfortunately we can't do this atomically. So:
Hello Aaron,
we think it would be great to have an option for running gcsfuse as a linux daemon. Do you think it is feasible?
Anna
Currently we never forget an inode that we created on behalf of the kernel, which means our RAM and disk usage grows without bound. When we receive a ForgetInodeOp
from the fuse package, we should destroy any temporary files for the inode and then throw it away.
This is blocked by jacobsa/fuse#7 (the fuse package doesn't yet actually issue these).
As of 863e4a5, the semantics doc says that we will allow unlinking directories regardless of content.
I did this because it's impossible to delete the placeholder object in GCS if and only if there are no contents in the directory. Short of a transactional change feature in GCS, this will be true no matter what for concurrent changes to the directory across machines, even if consistent listing is eventually added to GCS. We could solve it correctly for a single machine with consistent listing, though.
Is this the least surprising and/or most desirable compromise though? The alternative is to list the directory and delete if the listing is empty. Consequences:
rmdir
and see it fail with "not empty" due to lack of listing consistency.rmdir
and see it succeed, due to lack of listing consistency. This is true today too, but maybe more surprising if rmdir
"usually" appears to work as normal.rmdir
"usually" appears to work as normal.I don't yet know what this involves. Try it out with GCS as both a source and a sink, and see what the user experience is like.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.