There is a terrible bug in WP Core that specifically affects persistent object caches like this one:
https://core.trac.wordpress.org/ticket/31245
(The thread is mostly about the idea of replacing the whole alloptions
system with a one-at-a-time approach when a persistent object cache is installed. IMHO this is a bad plan and predictably nothing has happened because of concerns over back-compat with all the existing object-cache.php
implementations. If you get to the bottom you'll see my proposals which explain our much more straightforward and probably performant solution).
It's been known for 10 months but IMHO goes back much farther and accounts for white-screen-of-death bugs on our site even back when we used the APC Object Cache. The bug isn't specific to Redis in any way, but in our testing we were able to fix it at the object caching layer by using a simple action hook we added to WP_Object_Cache->set()
, so if you want, you can fix this for other users of your plugin by integrating the same fix (at least until Core fixes it, maybe in 4.5 but I won't hold my breath).
To summarize the situation:
- As most devs know, WP fetches all options with autoload=1 in a single query at the beginning of the pageload to avoid hundreds of fast but separate MySQL queries.
- When a persistent object cache is installed there is no way to have "autoload=1" as part of the query, so they instead store the entire set of autoloaded options in a cache with name=alloptions group=options
- When a pageload updates an autoload=1 option it updates the object cache for that option, but also has to update
alloptions
since it isn't reloaded automatically.
And now the bug:
- When a pageload updates
alloptions
it does so using the original value it started with when first fetching alloptions
from the object cache, inserting the changed value and resaving alloptions
.
- When two concurrent pageloads change separate options their changes will overwrite each other. The change from the first to save will be overwritten by the other.
- The result is options WP expects to find in
alloptions
are totally missing, and in many cases it can't handle it at all, resulting in WSOD in our case.
- The whole wp-cron system exacerbates the problem dramatically on sites that are rarely visited, since cold caches co-incide with the need for a cron refresh and create the perfect storm for the bug to thrive in.
As I said above, the main solution they were considering in the ticket was to remove alloptions
entirely and rely on either the speed of the object cache to make it okay to fetch each option separately, or a get_multi()
functionality that would pull in all autoloaded options with one query. This is complicated and likely to remain quagmired in compatibility concerns.
My solution is way more simple: _Just reload the alloptions
cache (triggering a single MySQL query for autoload=1 options) whenever one of it's autoloaded options is updated._
The result is that alloptions
gets dropped a lot, but it reloads quickly and on the vast majority of pageloads where no options are changed there is no affect on performance at all.
Solving this in core would involve replacing the system that manages updates to alloptions
(pretty simple) but solving it at the object-cache.php
level is even simpler: Delete the alloptions
cache whenever it's updated. It sounds counter-intuitive, but the effect is basically the same as if core just didn't update the value at all: alloptions
is reloaded from the db, including updates done by any processes, the next time it's required.
Here's the code we're using to fix this bug on our sites.
/**
* Redis object cache: Delete the alloptions cache whenever it is updated (SET) in the
* object cache.
*
* Fixes bug in core that creates race condition when multiple processes update options in the alloptions cache.
* By deleting the cache for all options we can be confident that any mistakes (inserting the new value in an out-of-date
* version of the alloptions array) won't carry forward into other processes and cause WSOD.
*
* Shouldn't be necessary if that bug is ever fixed.
*
* Relies on 'redis_object_cache_set' action in redis' object-cache.php (WP_Object_Cache->set()) which we added manually for now (2015-12-09)
*
* @see https://core.trac.wordpress.org/ticket/31245
*/
function gv_redis_object_cache_delete_alloptions($key, $value, $group, $expiration) {
if ('alloptions' == $key && 'options' == $group)
wp_cache_delete('alloptions', 'options');
}
add_action('redis_object_cache_set', 'gv_redis_object_cache_delete_alloptions', 10, 4);
As it states in the PHPDoc, we're relying on an action we added to WP_Object_Cache->set()
in object-cache.php
for this to work, so anyone trying to use this will need to copy the action into their version of the file until/unless the plugin is updated with our proposed actions+filters from ticket #18
If you wanted, you could add the same fix directly into WP_Object_Cache->set()
without having the filter, up to you. IMHO it would be better to keep it as a filter since ideally it can be removed entirely once the core bug is fixed.
If you don't consider this something that your plugin needs to resolve I understand completely. I wanted to lay out the parameters here either way in case someone else using the plugin has the same problems as us and wants to use our fix. Also I'm hoping it empasizes the value of having the actions+filters proposed in #18 added into your plugin so we don't need to keep patching it in the future to maintain our fix.
Thanks!