Giter Site home page Giter Site logo

dlsnode's People

Contributors

geod24 avatar iain-buclaw-sociomantic avatar joseph-wakeling-frequenz avatar leandro-lucarella-sociomantic avatar mathias-baumann-sociomantic avatar mathias-lang-sociomantic avatar nemanja-boric-sociomantic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dlsnode's Issues

Add tests for checkpoint service

The checkpoint service is a feature that needs to avoid regressions. There are several tests that can be added, from which some are very easy and some are more complicated to implement.

Easy:

  • Write into the multiple channels and expect that all buckets appear in the checkpoint file exactly one time
  • Confirm that the file doesn't exist on the clean exit of the node
  • Start the node with the data and checkpoint file and expect buckets truncated on the right spots

Medium:

  • Write into the channels distributed around for more than < "number of cached files" constant > buckets worth of data, so some buckets gets closed and confirm that they appear immidiately after opening the new channel, and after that expect that they are not found there

Hard:

  • Open new buckets while the node is inside checkpoint dump cycle, so the checkpoints are scheduled to be dumped at the end of the cycle

Add systemd support

Systemd support should be added in 3 stages:

  • Create the unit file and add it to the package (when available)
  • Deploy the unit file (deploying the package when available)
  • Switch the servers running the app to systemd

Cache FileSystem layout and iterate over it

A PR from July 2017 has been lurking unmerged in the old private dlsnode repo. Here's Nemanja's description of it:


This is a WIP patch, with lots of work still to be done in terms of tidying up code/fixing up commits (I want to start having a CI support for the final touches), so it's not yet ready for the general review.

The patch itself introduces four main things:

  • B-Tree data structure, managed on the glibc's heap and the auxiliary tools for making that possible
  • FileSystemCache used to build the initial view and track the changes of the file system. It uses the B-Tree for storing the fs' data
  • FileSystemLayout which uses the range primitives and iterates over the files in the cache, for the given range.
  • StorageEngine/StorageEngineStepIterator which now builds and uses the cache to do the iteration, instead of old directory iteration/stat method.

I downloaded the patch of the PR and tried to apply it to this repo, but it failed:

git apply 288.patch
error: patch failed: src/dlsnode/storage/BufferedBucketOutput.d:337
error: src/dlsnode/storage/BufferedBucketOutput.d: patch does not apply
error: patch failed: src/dlsnode/storage/StorageEngine.d:197
error: src/dlsnode/storage/StorageEngine.d: patch does not apply
error: patch failed: src/dlsnode/storage/BucketFile.d:18
error: src/dlsnode/storage/BucketFile.d: patch does not apply
error: patch failed: src/dlsnode/storage/iterator/StorageEngineStepIterator.d:41
error: src/dlsnode/storage/iterator/StorageEngineStepIterator.d: patch does not apply

I don't have time now to look into applying this properly, so will just upload the patch here for posterity:
288.patch.txt

Nemanja said that this is a useful PR that was tested but wasn't merged because we planned to install the DLS nodes on servers with SSDs (an alternative way of speeding up file access). He also mentioned that the BTree implementation in the patch was merged to ocean: sociomantic-tsunami/ocean#210.

Add DLS tests for more realistic scenarios

There should be a set of DLS tests where real-world scenarios are covered - multiple writers, simultaneous read/writes, many combinations of iterations/writers, etc. In the current test suite there's no way, for example, to ignore the last chunk of the data (so it's forcing node to flush), there's no way to trigger fiber race conditions (as there's only one active fiber at the time), etc.

Crash in Neo & AsyncIO

Two nodes had a crash at:

(gdb) bt
#0  0x000000000087ddea in dlsnode.util.aio.internal.AioScheduler.AioScheduler.handle_(ulong).__foreachbody3764(ref dlsnode.util.aio.internal.JobQueue.Job*) (this=0x7ffc1f1f3810, 
    __applyArg0=0x7ffc1f1f3810) at ./src/dlsnode/util/aio/internal/AioScheduler.d:199
#1  0x000000000087e047 in swarm.neo.util.TreeQueue.TreeQueue!(dlsnode.util.aio.internal.JobQueue.Job*).TreeQueue.opApply(int(ref dlsnode.util.aio.internal.JobQueue.Job*) delegate).__dgliteral499(ref ulong) (this=0x7ffc1f1f38a0, value_=0x7ffc1f1f38a0) at ./submodules/swarm/src/swarm/neo/util/TreeQueue.d:101
#2  0x000000000082ac49 in swarm.neo.util.TreeQueue.TreeQueueCore.opApply(int(ref ulong) delegate) (this=0x0, dg=...) at ./submodules/swarm/src/swarm/neo/util/TreeQueue.d:464
#3  0x000000000087e01a in swarm.neo.util.TreeQueue.TreeQueue!(dlsnode.util.aio.internal.JobQueue.Job*).TreeQueue.opApply(int(ref dlsnode.util.aio.internal.JobQueue.Job*) delegate) (this=0x7f20d0759888, dg=...) at ./submodules/swarm/src/swarm/neo/util/TreeQueue.d:98
#4  0x000000000087dd78 in dlsnode.util.aio.internal.AioScheduler.AioScheduler.handle_(ulong) (this=0x7ffc1f1f3910, n=140720830626064)
    at ./src/dlsnode/util/aio/internal/AioScheduler.d:196
#5  0x00000000007c9ea4 in ocean.io.select.client.SelectEvent.ISelectEvent.handle(ocean.sys.Epoll.epoll_event_t.Event) (this=0x0, event=1)
    at ./submodules/ocean/src/ocean/io/select/client/SelectEvent.d:147
#6  0x000000000086cf0f in ocean.io.select.selector.SelectedKeysHandler.SelectedKeysHandler.handleSelectedKey(ocean.sys.Epoll.epoll_event_t, bool(Exception) delegate) (
    this=0x7ffc1f1f39a0, unhandled_exception_hook=..., key=...) at ./submodules/ocean/src/ocean/io/select/selector/SelectedKeysHandler.d:170
#7  0x000000000086ce89 in ocean.io.select.selector.SelectedKeysHandler.SelectedKeysHandler.opCall(ocean.sys.Epoll.epoll_event_t[], bool(Exception) delegate) (this=0x0, 
    unhandled_exception_hook=..., selected_set=...) at ./submodules/ocean/src/ocean/io/select/selector/SelectedKeysHandler.d:134
#8  0x00000000008714f8 in ocean.io.select.EpollSelectDispatcher.EpollSelectDispatcher.select(bool) (this=0x7f20d073c780, exit_asap=false)
    at ./submodules/ocean/src/ocean/io/select/EpollSelectDispatcher.d:836
#9  0x000000000087134f in ocean.io.select.EpollSelectDispatcher.EpollSelectDispatcher.eventLoop(bool() delegate, bool(Exception) delegate) (this=0x7f20d073c780, 
    unhandled_exception_hook=..., select_cycle_hook=...) at ./submodules/ocean/src/ocean/io/select/EpollSelectDispatcher.d:749
#10 0x000000000072f4e9 in dlsnode.main.DlsNodeServer.run(ocean.text.Arguments.Arguments, ocean.util.config.ConfigParser.ConfigParser) (this=0x0, config=0x7ffc1f1f3c30, 
    args=0x7ffc1f1f3c30) at src/dlsnode/main.d:355
#11 0x00000000007ba82b in ocean.util.app.DaemonApp.DaemonApp.run(char[][]) (this=0x0, args=...) at ./submodules/ocean/src/ocean/util/app/DaemonApp.d:538
#12 0x0000000000856c5d in ocean.util.app.Application.Application.main(char[][]) (this=0x7f20d0732400, args=...) at ./submodules/ocean/src/ocean/util/app/Application.d:260
#13 0x000000000072ee3c in D main (cl_args=...) at src/dlsnode/main.d:98
(gdb) p job
$2 = (dlsnode.util.aio.internal.JobQueue.Job *) 0x0

Difference in storage engine performance for Neo on cold data

During the recent deployment of test application we've noticed the slowdown when using Neo protocol which we managed to pin down to the apparent difference in the node's performance when the page cache is empty. When the requested data is already paged in in the memory, the difference matches what we were seeing when we did dry run client tests.

  • The results with the data paged in:

Legacy:

Timing ---
Summary: Reading time:                                 60.85
Summary: Total dls time:                               60.85
Summary: Sorting time:                                 0.00
Summary: Reduction time:                               8.35
Summary: Writing time:                                 0.00
Summary: Copying time:                                 0.00
Summary: Total time:                                   69.20

Neo:

Timing ---
Summary: Reading time:                                 64.93
Summary: Total dls time:                               64.93
Summary: Sorting time:                                 0.00
Summary: Reduction time:                               8.13
Summary: Writing time:                                 0.00
Summary: Copying time:                                 0.00
Summary: Total time:                                   73.06
  • The results with the data not in cache:

Legacy:

Summary: Reading time:                                 62.08
Summary: Total dls time:                               62.08
Summary: Sorting time:                                 0.00
Summary: Reduction time:                               8.49
Summary: Writing time:                                 0.00
Summary: Copying time:                                 0.00
Summary: Total time:                                   70.57

Neo:

Timing ---
Summary: Reading time:                                 81.66
Summary: Total dls time:                               81.66
Summary: Sorting time:                                 0.00
Summary: Reduction time:                               8.42
Summary: Writing time:                                 0.00
Summary: Copying time:                                 0.00
Summary: Total time:                                   90.09

The results are consistent across multiple testing rounds.

This shows that there's no unexpected difference in the Neo protocol overhead. When serving the data that's in, Neo is capable of achieving the approximately same performance. This is good news, since we did all the tricks that we could think of in the swarm and dlsproto's implementation. The only thing that we never tested/nor optimized is the dlsnode's storage engine neo path, which differs from the legacy one.

Evaluate `errors=remount-ro` mount option for DLS

In case of the underlying IO error, all bets are off. It might be useful to assume that not all the servers will not encounter IO error at the same time, so that we can let fs remount itself as readonly on the error, still providing read access to consumers, but redirect other writers elsewhere.

Remove StorageProtocolLegacy from dlsnode

During the transitional period to a storage protocol V1, we had to support legacy storage protocol (the protocol which supported buckets without a bucket header) as long as we had the data stored like that. The support for that is long gone and we can just remove all references to them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.