Giter Site home page Giter Site logo

sdpeters / ceph-rwl Goto Github PK

View Code? Open in Web Editor NEW
7.0 7.0 2.0 297.32 MB

Ceph write-back cache

License: Other

CMake 0.96% Roff 0.06% Shell 3.36% Python 10.29% Makefile 0.01% C++ 69.54% C 1.69% HTML 0.67% CSS 0.09% JavaScript 0.29% Perl 0.78% DIGITAL Command Language 0.01% Ruby 0.03% Assembly 0.15% Java 0.18% Lua 0.01% Perl 6 0.13% Terra 8.87% TypeScript 2.91% Dockerfile 0.01%

ceph-rwl's People

Contributors

6uv1s avatar adamemerson avatar andrewschoen avatar athanatos avatar batrick avatar cbodley avatar dalgaaf avatar dotnwat avatar dzafman avatar gregsfortytwo avatar idryomov avatar jdurgin avatar joscollin avatar jtlayton avatar ldachary avatar majianpeng avatar mattbenjamin avatar oritwas avatar rjfd avatar rzarzynski avatar smithfarm avatar tchaikov avatar theanalyst avatar trociny avatar ukernel avatar wjwithagen avatar xiexingguo avatar yehudasa avatar yuyuyu101 avatar zmc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ceph-rwl's Issues

Add "image-cache discard-dirty" option to rbd CLI tool

This is October CDM feedback from Sage and Jason. This command is used to remove the configured image cache from an image without opening and flushing the image. This would be necessary if a node failed while writing through an RWL cache to an image, and the RWL pool file couldn't be recovered.

Make Image Caches plugins

Jason and Sage want RWL to be a plugin. I interpreted that to mean that all ImageCache modules should be plugins. This would include PassThroughImageCache.

A plugin is separately loaded, only when a client uses an image that's configured to use that feature. This implies that opens must fail if the plugin is not available. There are some nuances to what that means we should discuss. (Does open really fail, or do we just fail to initialize the cache if we don't have the plugin? Probably the second. If so, can we still discard a dirty image cache without its plugin? We can for RWL. I think it should always be true.)

A plugin can be distributed in a separate package. PassThroughImageCache should be included with librbd. A statically linked (with PMDK) RWL plugin could be included in the librbd rpm, but that might no t be a good idea. We'd like RWL to be distributed in its own rpm (deb, whatever) so that rpm can depend on libpmemobj, libfabric, etc. That way RWL can use the installed PMDK (otherwise it really can't).

As noted in Here, there's a Ceph build option related to this (WITH_PMEM_PKG). What that means and does needs to be nailed down as part of this. That implies working out how an RWL plugin can get built and tested in every Ceph PR, even if those machines don't have PMDK or libfabric installed.

Making ImageCache modules plugins has the additional implication that enabling the image-cache feature should be generalized. Now that's just a bit. On open, the image-cache feature bit causes librbd to enable RWL and the ImageWriteBack object under it. This should take arguments, and enable the configuration of a combination of (stacked) write back caches. For example, a user might stack an instance of RWL using 256M of pmem on top of another instance of RWL using a mirrored NVMe-oF SSD (or some future SSD-based HA write back cache). I'd suggest that initially we support configuring RWL, PassThrough, or both. Unit tests that don't build RWL can at least build and test the enabling of PassThrough. If we allow combining multiple instances of the same cache plugin with different parameters, the ImageCache constructor will need to gain a "layer" argument, which the modules can use in the names of files (etc.) they create to implement the cache.

Allowing ImageCache modules to be configured in layers may mean that the "image-cache discard-dirty" command needs to selectively remove just one layer. If someone stacked an unreplicated RWL on top of a replicated one for some reason, they might want to discard only the one on top (flushing the next one down from its replica).

Stacking replicated write back caches introduces some other issues. If they don't replicate to the same node, then how do the replicas ever flush when the client fails? The replica for the top cache layer can't flush unless it can write to the cache layer below. If the cache layer below is replicated to a different node that won't be possible. One solution is to require all replicated caches to replicate to the same node, and all fail over at the same time (making that replica of all layers the master for its layer, and enabling writes down through the stack. Another solution is to flush the replica from the lowest layer first, then remove that layer before the replica for the next layer up is flushed. Each layer's replica rites directly to ImageWriteBack. The simplest solution might be to allow only one layer to use replication, or only one layer to be write-back.

SSD Support

We promised the community SSD support in RWL.

Wiki design discussion / document TBD.

It probably makes the most sense for an SSD write-back cache and RWL to be distinct things to the user. This probably means refactoring RWL so the logic near the ImageCache (incoming) and ImageWriteBack (outgoing) interfaces is in separate classes. That would allow an RWL class and some new SSD WBC class to reuse the same support logic for overlap detection, throttled ordered flushing, etc.

Since you can't RDMA to an SSD, the SSD-based write back cache won't be HA.

First IO hangs if RWL can't be initialized

In a vstart cluster, run rbd-bench on an image with RWL, and tell it not to flush on close. This leaves a dirty image cache behind.

Now run bench again with NODE_NAME set to something else. Open will succeed, but we won't hold the exclusive lock yet (this is by design). On the first IO, the exclusive lock acquire will succeed, up until the PostAcquire stage. There after we actually have the lock we'll finally do all the open related initialization things for which we must hold the lock. Opening the journal or an image cache happens here.
In this case, we've left a dirty image cache behind for the host containing the vstart cluster, but this instance of librbd believes its hostname is something else. RWL init will now fail, as it must when a dirty cache exists somewhere else.

Now it's too late to fail the open, but we're supposed to get some kind of failure that tells rbd bench it should fail and quit. What actually happens is we retry the exclusive lock forever.

Jason says this is the same path that we'd take if we failed to open the journal.

Good next steps might be to force the journal init fail case to happen, and see if it has the same problem. Or at least inspect that code path and see why image cache init doesn't behave the same way.
It's possible something about the RWL patch has broken this. If the journal fail path works on master but doesn't work here, that would narrow it down.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.