simerplaha / swaydb Goto Github PK

View Code? Open in Web Editor NEW

288.0 16.0 15.0 20.42 MB

Persistent and in-memory key-value storage engine for JVM that scales on a single machine.

Home Page: https://swaydb.simer.au

License: Apache License 2.0

Scala 97.89% Java 2.11%

embeddable type-safe multiple-disks persistent in-memory key-value-store storage-engine scala java kotlin

swaydb's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger myuwono frosner cih-y2k timothyklim sh0hei andrewts129 lvitaly ramy-ahmed wiltonlazary algobardo desmondyeung hemantgs akalexus clicking34

swaydb's Issues

Support for committed and uncommitted inserts for distributed databases

Distributed databases implementing algorithms like RAFT require temporarily inserted entries which are eventually committed by the RAFT leader.

Existing solutions would require these entries to be re-read and then committed. This creates 3 times more IO (1 key read + 1 value read + 1 key-value insert). Which means that for each write there needs to be additional reads and another write that adds a lot to read, write, space & compaction requirement.

This can simply be avoided by implementing a commit api where a key or range of keys can be committed without requiring to read the key or value.

Implement range API

Deletes keys between the ranges fromKey & toKey (exclusive)

db.remove(fromKey, toKey)

Implement Future API for reads

When two concurrent threads are reading a Segment, the second thread waits for the other thread to decompress data using AtomicBoolean which is similar to blocking.

The blocking API should be optionally enabled and a new Future API is required for reads.

The Future api should also optionally allow for two ExecutionContexts to be provided. One for compaction and other for reads.

RootMap should allow for SubMap iteration

Directly copy non-overlapping Segments into a Level instead of merge it with an existing Segments during compaction

v0.2 and older were merging new Segments into one of the existing Segments for key-values did not overlap with the Segment and only copied Segments if the Level was empty. Now the compaction will check if non-overlapping Segments exists in the compaction request and will simply copy them into the Level which reduces compaction workload.

This has also improved the write speed. Full benchmarks will be updated after the next release on the website.

Warm up start to init id variables in generated id classes

headId and lastId require warm up start for the macros to pre-fetch the min and max ids otherwise the first access to these variables will cost initial access speed during read.

This can easily be avoided by initialising these variables on boot-up.

Code coverage

Setup travis to run test cases and provide code coverage.

Match error on 'null' when key-values are being merged in to the Levels with grouping/compression enabled

MergeList is eventually returning 'Some(null)' on headOption during the merging process. This is probably due to mutable operations on MergeList.

The error is not critical as merge eventually succeeds on merge retry but a fix is require for performance.

Improve merge performance when grouping key-values

Performance of Grouping key-values during merge can be improved.

allow updates using Scala functions or Java lambdas

Currently key-values can be updated using one of the write APIs.

This release will allow updating key-values using any Scala or Java functions.

This task is complete. I will pull in the changes soon.

Increase accuracy of memory usage for the configuration setting cacheSize

The number of in-memory key-values for both Persistent and Memory databases require a more accurate calculation of total memory usage (cacheSize).

Currently for a Group of a key-value the total size of decompressed Group's byte size is used. This decompressed Group size does not account for internal compression (prefix compression for key, value, offset and length).

For example: if the value is fully compressed with previous or if the key is partially compressed with previous can lead to more bytes being allocated to the decompressed key-value in-memory which is not accounted for in the LimitQueues.

A temporary solution is to give cacheSize small enough size to account for "internal compression".

Controlled concurrent compaction to avoid IO congestion

Levels run independant compaction which can randomly lead to IO congestion where all Levels are writing Segments to disks at once (for persistent databases).

We can limit the number of threads in the ExecutionContext. But this still does not stop IO congestion from occurring when there are more than 1 thread.

There needs to be a configuration to set the number of concurrent compaction and prioritised compaction based on the number of Segments in the Level. For most cases compaction in upper Levels should always be prioritised so that there is always enough room for Level0 to flush key-values.

Prioritising concurrent compaction is also dependant on the type of application. If the application is read heavy then a different configuration will be required then if the application is write heavy. But the default configuration should be enough for most common use-cases.

Replace Batch API with prepare

Batch API can be replaced with a prepare function which will return a database instance that provides the same APIs as a normal Map or Set database instance which might make the Batch more familiar and easier to work with.

This API can be similar to map.maps API in extensions.

So for example instead of

val batches =
  Seq(
    Batch.Put(key = 1, value = "one value"),
    Batch.Remove(key = 1),
    Batch.Update(key = 1, value = "value updated"),
  )

db.batch(batches)

use prepare and commit

val preparedWrites =
  db.prepare.put(key = 1, value = "one value") ++
    db.prepare.remove(key = 1) ++
    db.prepare.update(key = 1, value = "value updated")

db.commit(preparedWrites)

Kotlin API

Kotlin and Scala have very similar Iterable syntax. Having official support for Kotlin would be a good milestone to achieve.

Also see if Jar file size could be further reduced for mobile deployment. There are not many external dependency other than bloom-filter-scala, Snappy & LZ4.

Split stress, performance, integration & unit test-cases

Test cases should be split so that they are easy to execute and easy to setup CI.

Implement persistent.Queue[T] and memory.Queue[T]

This is one of the goals. Queuing solutions are used often and having a Queue[T] implementation for both memory and persistent databases similar to Map[K, V] and Set[T] is needed.

This does not require any changes to database internals but can be implemented as an extension.

I'm guessing similar to how maps can be extended with map.maps we can implement a map.queues which will return set of queues (Set[Queue[T]]) under each Map[K, V]. Therefore each existing Map[K, V] can have it's own set of Queue[T].

`lower` should do equality check on previous key before returning result.

lower should do equality check on previous key-value when both previous and next key-values are know in the Segment before returning previous as the lower key-value for the input key.

Improve iterator performance for each new key request from client.

#37 improves iteration performance to find the next highest or lowest key from upper & lower level within the Levels.

But to have major impact on iterators submitted by clients/users (per key iterators) Seek pointers can be resubmitted to these per Level iterators to continue iteration from the previously fetched successful key. So basically maintaining state within the user created iterators of it's previous iteration position in each Level.

This would give a massive performance boost.

Watch-out: Since compaction happens asynchronously, on each start, these iterator will also need to check if the state of submitted iterator pointers are still valid i.e. if the state of the current Level has not changed for the keys the iterator is searching for. If the state has changed then the iterators should reset just the Lower Level's pointer and after a single seek the new Seek pointer returned for lower Level will be valid. Level State changes are not expected to occur as often as reads can occur. State changes can be 1 to 10 every second (depending on compaction speed) but reads can occur 300k to 600k every second so having a check to reset for every new iterator is not expensive.

Timer should be persistent for persistent databases

Currently Maps will use System.nanoTime as the start time for key-value's Time and increment by 1 for each new time. On restart current System.nanoTime is used again. But this would cause problem if the system clock changes.

Times needs to be persisted for persistent databases. This is planned for 0.7.1.

Implement an API to check if the registeredFunction is safe to remove.

registerFunction and applyFunction can be used to alter data using basic Scala or Java functions.

We also need to implement an API that checks if the registered function's functionID is safe to remove temporary from the function store.

Kotlin API

Kotlin and Scala have very similar Iterable syntax. Having official support for Kotlin would be a good milestone to achieve.

Also need to reduce the size of the JAR file for mobile deployment. There are not many external dependency other than bloom-filter-scala, Snappy & LZ4. The JAR file size could be reduced by changing the build file.

Support B-tree file format

Currently key-values can be written to two types of file formats.

Level0 - Append only log (also see #31)
Other Levels - And a skipList like file format with Groups.
In-memory databases with compression enabled also use the above file formats otherwise they simply write key-values to an in-memory skipList.

Having a mutable B-tree file format can be useful for certain application.

unmap memory-mapped files fails for java 9 +

Unmapping memory-mapped files fails for Java 9+. Need a fall back solution.

Scala.js & scala-native support

SwayDB does not have any external dependencies other than compression libraries (LZ4 and Snappy) which are optionally enabled.

Majority of the effects are implemented in Effect.scala. May be moving all java.io.* invocations to this class and providing an Effect.scala implementations for target platforms is required.

Implement repairAppendix

repairAppendix can be used to fix corrupted appendix files. This can be executed by supplying the Level's path and repair strategy.

SwayDB.repairAppendix[Int](
    levelPath = "/Disk1/myDB/1", //path of the level's folder.
    repairStrategy = AppendixRepairStrategy.Report
  )

Detailed documentation on the website www.swaydb.io/#api/repairAppendix.

This will be available in v0.1.1

TimeOrder & MergeOrder

TimeOrder - Implemented!

Now each key also stores a Time value i.e. the order in which the key was inserted. This Time value is used when applying atomic updates using Scala or Java functions.

Since reads and writes are both unaware of each other and we do not want to block, the Time value is used to determine if the currently read key has already been merged with previous Level's updates so that updates are applied atomically and are not duplicated.

Time is of type Time[T] where T can be of any type. SwayDB stores Time as raw bytes Time[Slice[Byte]].

Storage cost: The default Time implementation stores a Long requiring 1 to 8 bytes and is prefix compressed with previous key's Time. If the Long has no common prefix bytes with previous Time then it's stored as an unsigned Long to get maximum storage savings. GroupingStrategy can be configured to further compress the time values.

MergeOrder - Pending

TimeOrder can be updated to also include (MergeOrder) information where a single key can have multiple updates at different machine clock times but the the MergeOrder can override this machine clock time.

This will be mostly useful for distributed databases where multiple machines can insert updates to SwayDB in any order and the actual updates or the sequence in which these updates get applied is determined by mergeOrder which is just a function and will be consistent on all machines therefore resulting in a consistent database state on all machines.

For example: MergeOrder[CustomInsertTime] can be the following class.

case class CustomInsertTime(machineId: Int, machineLocalTime: Long)

and the order can be provided to on both the fields. As long as this Ordering[CustomInsertTime] is consistent on all machines, the state of the database will be consistent regardless of the order in which the data was actually inserted into SwayDB.

MergeOrder does not require any new implementation or file format changes as it can be a part of the Time value and a minor change is required to the merge implementation. Updating all the test-cases to account for mergeOrder will be the largest task for this implementation.

Allow Level0 to be a standalone level

A database with only a single Level (Level0) should be allowed.

A single leveled database will not run compaction. It would be a simple single Level setup backed by immutable append only log files. New log files will get added as they get full.

This type of configuration can be useful for applications requiring a small set of immutable data or data that rarely changes but require quick read and write access without any background compaction.

Status - Implemented!

Pending task - requires changes to Config api to set Level1 as None.

Implement compression for persistent files

Implement in-built database extensions to support multiple data types

This extension will be optional.
The extension will include implementation for data types such as Map, List, Queue, Set etc.

Currently the master branch includes implementation for creation of multiple maps within the database and will be available in the next release.

Performing update using Scala or Java functions

This is one of the goals - to submit any Scala or Java function to update data instead using a query language like SQL.

Any Scala or JVM libraries like Circe - JSON can be used to store and update data.

The implementation for this complete and will be pushed with 0.7 (#33).

Initial Travis CI test automation setup

Overview

We need a way to automate tests to increase confidence in components and features. Completion of this task will enable smoother collaboration.

Acceptance Criteria

Hook up Travis CI integration to project

Handling seek during Level state changes

Seeks can stash key-values read from upper & lower Level to avoid duplicate reads and can also Stop reads on either Level if the previous read from the Level returns empty. But since compaction occurs asynchronously the state of the Level can change making the previous read stale.

There needs to be a way to detect state changes in the Level from the Level's last Seek and if the state was updated then the previous Seek pointer from that Level is invalid and the Level should be re-read.

This can be resolved by performing hasPut check to find the next highest or the lowest Segment. If the Segment returned has not changed from previous check this means the previous read is still valid or else it's stale and the Level should be re-read. This check will only be required when the Seek pointer is set to Seek.Stop or Seek.Stash. This is a lightweight check requiring only the Segment footer to be read (usually cached) therefore not a costly operation.

This implementation will give the iterators a good performance upgrade mentioned in #42.

Update DB instances to use Scala's Map and Set instead of extends Iterable.

Extending Iterable with custom APIs deviates SwayDB's API from Scala's Map and Set APIs which I think adds unnecessary learning for us.

Need to provide Scala's MapLike or SetLike implementations for swaydb.Map or swaydb.Set database instances respectively.

Extensions can implement scala.collections.mutable.MultiMap.

Edit: May be instead of getting rid of Iterable another API similar to .asScala in JavaConverters can be implemented that converts a swaydb.Map or Set to a scala.collection.mutable.Map or Set.

The default implementations will support returning backpressure meters on each write and .asScala can be used where back-pressure or the Try return type is not required.

KeyValueLimiter should not allow dropping of Memory.Group key-values

Allowing partial commits of Segments being batch compacted.

Currently as Segments are batch compacted from one Level to another a failure to commit any one Segment will revert the entire compaction. The compaction is retried again based on the throttle configuration.

These failure can occur if a Segment is closed(maxSegmentsOpen) while its being read.

To avoid retries and to save IO partial commits of the batched Segments should be allowed - to write as many Segments as possible and alert the upper Level of the partial commit where only the failed Segments will be retried.

Updating key-values with functions

Allow updating key-values with functions

//increments all values by 1
db.update(from = 0, to = 1000, updater = oldValue => oldValue + 1)

Skipping Levels during compaction

Segments that do not overlap with existing Segment in the Level should be skipped and pushed to the next Level when pushForward flag is set to true in the configuration.

In some cases this can reduce IO and increase overall database performance.

The logic to skip Segments is required here. Currently Segments are forwarded only when both the current Level and next Level are empty.

Performance improvement for forward & backward iterators when there are too many dead updates within the current Level's iteration

Iterators can be slower if there are too many dead updates between two keys.

Put checkpoint can be added or something similar to help iterators skip dead updates entirely jump to the next valid key instantly.

A test case would be: Suppose there are 100,000 keys. Keys 1 & 100,000 are valid keys and updates were submitted for keys 2 to 99,999 that do not exist. This effects iterator performance when reading from key 1 to 100,000.

assertLevel() function for test-cases

All Levels (including Level0) behave the same when it comes to reading data. Therefore test-cases on any variation of Level config should yield the same result. Reads before/after the merge or during concurrent merge should always have consistent output.

Implemented `assertLevel` test function

Given a use case it will run the same test multiple times with different Level configurations to get maximum test coverage

with throttling on and off
with key-values being in inserted into different Levels
with a group of key-values being merged into a single Level
different grouping strategy
etc

Implement TTL APIs

TTL APIs to allow for key-values to automatically expire/remove from the database after a finite duration.

//put & expire
db.put(key = 1, value = "one", expireAfter = 1.day)
//expire existing keys
db.expire(key = 1, after = 10.seconds)
//expire range of key-values
db.expire(from = 1, until = 100, after = 1.hour)

Update APIs should allow to modify values without effecting the existing TTL/Expiry set for the key.

//update single key
db.update(key = 1, value = "one updated")
//update range
db.update(from = 1, until = 100, value = "updated")

Non-blocking reads using Monix

scala.Iterable[T] is blocking and requires the use of IO.Async[T].safeGetBlocking.

An in-build streaming support to provide non-blocking & back-pressured reads using IO.Async[T].safeGetFuture instead of IO.Async[T].safeGetBlocking is required.

I haven't compared all libraries yet but Monix sounds like a better option since it compiles to Scala.js and can interop with Akka-streams and FS2.

Allow Level0's Map files to be used as Segments in other Levels.

Currently Map files are only allowed in Level0. Having these file optionally available for other Levels will increase compaction speed specially in upper Levels.

Map file should extend Segment.

Simplify API to create a DB instance

Change

val db = SwayDB.persistent[K, V](...)
val db = SwayDB.persistentSet[T](...)

val db = SwayDB.memory[K, V](...)
val db = SwayDB.memorySet[T](...)

to similar to how a basic Scala collection would be initialised

import swaydb._

val map = persistent.Map[K, V]()
val set = persistent.Set[T]()

val map = memory.Map[K, V]()
val set = memory.Set[T]()

Java 8 API

Possibly an independant library for Java syntax with code samples.

Use trampolining if there is no performance penalty in Higher, Lower & SegmentMerger

Higher, Lower, Get & SegmentMerger are becoming to be large functions because they need to be tailrec for performance and to avoid any StackOverflow for deep seeks. Splitting these implementations to smaller dedicated functions would be better and easier to write tests for.

Problem

These implementations can run a minimum of 300k+ iterations per second to fetch a minimum of 300k key-values per second performance.

Using trampolining would require extra objects being created per iteration which could add to GC workload, if for each key read, it created another trampoline object which would double the number of objects created per second. And that is no good for performance.

Looking for alternatives and/or advice on this.

Assert that fromKey < toKey on range operations

Range update should perform splits if the range covers over too many Segments

Range writes - Remove range, Expire range & Update range can spread out to multiple Segments if the difference between from and to is too wide. In these cases ranges should be split to perform update in batches.

Implement backup

Back up all databases files to target path.

db.backup(path: Path)

cacheFunction updates can be applied multiple times due to non blocking

Since cacheFunction are being applied in real time, a faster compaction can result in these updates being applied multiple times during reads as read do not block compaction.

These updates will eventually be consistent but a check during read is required to assert that a function only get applied once. This information can be stored in the read context.

Actor timerLoop should control message overflow

Currently when Actor.timerLoop has a message overflow it process most messages in the inbox leaving approximately 1/4 messages to be processed for next interval. It will also reduce the next delay by half if the previous interval had too many messages.

Although this works ok for controlling overflow it still needs a more smoother approach which would improve the overall read performance.