simerplaha / swaydb Goto Github PK
View Code? Open in Web Editor NEWPersistent and in-memory key-value storage engine for JVM that scales on a single machine.
Home Page: https://swaydb.simer.au
License: Apache License 2.0
Persistent and in-memory key-value storage engine for JVM that scales on a single machine.
Home Page: https://swaydb.simer.au
License: Apache License 2.0
Distributed databases implementing algorithms like RAFT require temporarily inserted entries which are eventually committed by the RAFT leader.
Existing solutions would require these entries to be re-read and then committed. This creates 3 times more IO (1 key read + 1 value read + 1 key-value insert). Which means that for each write there needs to be additional reads and another write that adds a lot to read
, write
, space
& compaction
requirement.
This can simply be avoided by implementing a commit api where a key or range of keys can be committed without requiring to read the key or value.
Deletes keys between the ranges fromKey & toKey (exclusive)
db.remove(fromKey, toKey)
When two concurrent threads are reading a Segment, the second thread waits for the other thread to decompress data using AtomicBoolean
which is similar to blocking.
The blocking API should be optionally enabled and a new Future
API is required for reads.
The Future
api should also optionally allow for two ExecutionContext
s to be provided. One for compaction and other for reads.
v0.2 and older were merging new Segments into one of the existing Segments for key-values did not overlap with the Segment and only copied Segments if the Level was empty. Now the compaction will check if non-overlapping Segments exists in the compaction request and will simply copy them into the Level which reduces compaction workload.
This has also improved the write speed. Full benchmarks will be updated after the next release on the website.
headId
and lastId
require warm up start for the macros to pre-fetch the min and max ids otherwise the first access to these variables will cost initial access speed during read.
This can easily be avoided by initialising these variables on boot-up.
Setup travis to run test cases and provide code coverage.
MergeList
is eventually returning 'Some(null)' on headOption during the merging process. This is probably due to mutable operations on MergeList
.
The error is not critical as merge eventually succeeds on merge retry but a fix is require for performance.
Performance of Grouping key-values during merge can be improved.
Currently key-values can be updated using one of the write APIs.
This release will allow updating key-values using any Scala or Java functions.
This task is complete. I will pull in the changes soon.
The number of in-memory key-values for both Persistent and Memory databases require a more accurate calculation of total memory usage (cacheSize).
Currently for a Group of a key-value the total size of decompressed Group's byte size is used. This decompressed Group size does not account for internal compression (prefix compression for key, value, offset and length).
For example: if the value is fully compressed with previous or if the key is partially compressed with previous can lead to more bytes being allocated to the decompressed key-value in-memory which is not accounted for in the LimitQueues.
A temporary solution is to give cacheSize small enough size to account for "internal compression".
Levels run independant compaction which can randomly lead to IO congestion where all Levels are writing Segments to disks at once (for persistent databases).
We can limit the number of threads in the ExecutionContext
. But this still does not stop IO congestion from occurring when there are more than 1 thread.
There needs to be a configuration to set the number of concurrent compaction and prioritised compaction based on the number of Segments in the Level. For most cases compaction in upper Levels should always be prioritised so that there is always enough room for Level0 to flush key-values.
Prioritising concurrent compaction is also dependant on the type of application. If the application is read heavy then a different configuration will be required then if the application is write heavy. But the default configuration should be enough for most common use-cases.
Batch API can be replaced with a prepare
function which will return a database instance that provides the same APIs as a normal Map or Set database instance which might make the Batch
more familiar and easier to work with.
This API can be similar to map.maps
API in extensions.
So for example instead of
val batches =
Seq(
Batch.Put(key = 1, value = "one value"),
Batch.Remove(key = 1),
Batch.Update(key = 1, value = "value updated"),
)
db.batch(batches)
use prepare
and commit
val preparedWrites =
db.prepare.put(key = 1, value = "one value") ++
db.prepare.remove(key = 1) ++
db.prepare.update(key = 1, value = "value updated")
db.commit(preparedWrites)
Kotlin and Scala have very similar Iterable
syntax. Having official support for Kotlin would be a good milestone to achieve.
Also see if Jar file size could be further reduced for mobile deployment. There are not many external dependency other than bloom-filter-scala, Snappy & LZ4.
Test cases should be split so that they are easy to execute and easy to setup CI.
This is one of the goals. Queuing solutions are used often and having a Queue[T]
implementation for both memory and persistent databases similar to Map[K, V]
and Set[T]
is needed.
This does not require any changes to database internals but can be implemented as an extension.
I'm guessing similar to how maps can be extended with map.maps
we can implement a map.queues
which will return set of queues (Set[Queue[T]]
) under each Map[K, V]
. Therefore each existing Map[K, V]
can have it's own set of Queue[T]
.
lower
should do equality check on previous key-value when both previous
and next
key-values are know in the Segment
before returning previous
as the lower key-value for the input key.
#37 improves iteration performance to find the next highest or lowest key from upper & lower level within the Levels.
But to have major impact on iterators submitted by clients/users (per key iterators) Seek
pointers can be resubmitted to these per Level
iterators to continue iteration from the previously fetched successful key. So basically maintaining state within the user created iterators of it's previous iteration position in each Level.
This would give a massive performance boost.
Watch-out: Since compaction happens asynchronously, on each start, these iterator will also need to check if the state of submitted iterator pointers are still valid i.e. if the state of the current Level has not changed for the keys the iterator is searching for. If the state has changed then the iterators should reset just the Lower Level's pointer and after a single seek the new Seek
pointer returned for lower Level will be valid. Level State changes are not expected to occur as often as reads can occur. State changes can be 1 to 10 every second (depending on compaction speed) but reads can occur 300k to 600k every second so having a check to reset for every new iterator is not expensive.
Currently Maps will use System.nanoTime
as the start time for key-value's Time
and increment by 1 for each new time. On restart current System.nanoTime
is used again. But this would cause problem if the system clock changes.
Times needs to be persisted for persistent databases. This is planned for 0.7.1
.
registerFunction
and applyFunction
can be used to alter data using basic Scala or Java functions.
We also need to implement an API that checks if the registered function's functionID
is safe to remove temporary from the function store.
Kotlin and Scala have very similar Iterable
syntax. Having official support for Kotlin would be a good milestone to achieve.
Also need to reduce the size of the JAR file for mobile deployment. There are not many external dependency other than bloom-filter-scala, Snappy & LZ4. The JAR file size could be reduced by changing the build file.
Currently key-values can be written to two types of file formats.
Having a mutable B-tree file format can be useful for certain application.
Unmapping memory-mapped files fails for Java 9+. Need a fall back solution.
SwayDB does not have any external dependencies other than compression libraries (LZ4 and Snappy) which are optionally enabled.
Majority of the effects are implemented in Effect.scala
. May be moving all java.io.*
invocations to this class and providing an Effect.scala
implementations for target platforms is required.
repairAppendix
can be used to fix corrupted appendix files. This can be executed by supplying the Level's path and repair strategy.
SwayDB.repairAppendix[Int](
levelPath = "/Disk1/myDB/1", //path of the level's folder.
repairStrategy = AppendixRepairStrategy.Report
)
Detailed documentation on the website www.swaydb.io/#api/repairAppendix.
This will be available in v0.1.1
Now each key also stores a Time
value i.e. the order in which the key was inserted. This Time
value is used when applying atomic updates using Scala or Java functions.
Since reads and writes are both unaware of each other and we do not want to block, the Time
value is used to determine if the currently read key has already been merged with previous Level's updates so that updates are applied atomically and are not duplicated.
Time
is of type Time[T]
where T
can be of any type. SwayDB stores Time
as raw bytes Time[Slice[Byte]]
.
Storage cost: The default Time
implementation stores a Long
requiring 1 to 8 bytes and is prefix compressed with previous key's Time
. If the Long
has no common prefix bytes with previous Time
then it's stored as an unsigned Long
to get maximum storage savings. GroupingStrategy can be configured to further compress the time values.
TimeOrder
can be updated to also include (MergeOrder
) information where a single key can have multiple updates at different machine clock times but the the MergeOrder
can override this machine clock time.
This will be mostly useful for distributed databases where multiple machines can insert updates to SwayDB in any order and the actual updates or the sequence in which these updates get applied is determined by mergeOrder
which is just a function and will be consistent on all machines therefore resulting in a consistent database state on all machines.
For example: MergeOrder[CustomInsertTime]
can be the following class.
case class CustomInsertTime(machineId: Int, machineLocalTime: Long)
and the order can be provided to on both the fields. As long as this Ordering[CustomInsertTime]
is consistent on all machines, the state of the database will be consistent regardless of the order in which the data was actually inserted into SwayDB.
MergeOrder does not require any new implementation or file format changes as it can be a part of the Time
value and a minor change is required to the merge implementation. Updating all the test-cases to account for mergeOrder
will be the largest task for this implementation.
A database with only a single Level (Level0) should be allowed.
A single leveled database will not run compaction. It would be a simple single Level setup backed by immutable append only log files. New log files will get added as they get full.
This type of configuration can be useful for applications requiring a small set of immutable data or data that rarely changes but require quick read and write access without any background compaction.
Pending task - requires changes to Config
api to set Level1
as None
.
This extension will be optional.
The extension will include implementation for data types such as Map, List, Queue, Set etc.
Currently the master branch includes implementation for creation of multiple maps within the database and will be available in the next release.
This is one of the goals - to submit any Scala or Java function to update data instead using a query language like SQL.
Any Scala or JVM libraries like Circe - JSON can be used to store and update data.
The implementation for this complete and will be pushed with 0.7 (#33).
We need a way to automate tests to increase confidence in components and features. Completion of this task will enable smoother collaboration.
Hook up Travis CI integration to project
Seek
s can stash key-values read from upper & lower Level to avoid duplicate reads and can also Stop
reads on either Level if the previous read from the Level returns empty. But since compaction occurs asynchronously the state of the Level can change making the previous read stale.
There needs to be a way to detect state changes in the Level from the Level's last Seek
and if the state was updated then the previous Seek
pointer from that Level is invalid and the Level should be re-read.
This can be resolved by performing hasPut
check to find the next highest
or the lowest
Segment. If the Segment
returned has not changed from previous check this means the previous read is still valid or else it's stale and the Level should be re-read. This check will only be required when the Seek
pointer is set to Seek.Stop
or Seek.Stash
. This is a lightweight check requiring only the Segment footer to be read (usually cached) therefore not a costly operation.
This implementation will give the iterators a good performance upgrade mentioned in #42.
Extending Iterable
with custom APIs deviates SwayDB's API from Scala's Map and Set APIs which I think adds unnecessary learning for us.
Need to provide Scala's MapLike
or SetLike
implementations for swaydb.Map
or swaydb.Set
database instances respectively.
Extensions can implement scala.collections.mutable.MultiMap
.
Edit: May be instead of getting rid of Iterable
another API similar to .asScala
in JavaConverters
can be implemented that converts a swaydb.Map
or Set
to a scala.collection.mutable.Map
or Set
.
The default implementations will support returning backpressure meters on each write and .asScala
can be used where back-pressure or the Try
return type is not required.
Currently as Segments are batch compacted from one Level to another a failure to commit any one Segment will revert the entire compaction. The compaction is retried again based on the throttle configuration.
These failure can occur if a Segment is closed(maxSegmentsOpen) while its being read.
To avoid retries and to save IO partial commits of the batched Segments should be allowed - to write as many Segments as possible and alert the upper Level of the partial commit where only the failed Segments will be retried.
Allow updating key-values with functions
//increments all values by 1
db.update(from = 0, to = 1000, updater = oldValue => oldValue + 1)
Segments that do not overlap with existing Segment in the Level should be skipped and pushed to the next Level when pushForward flag is set to true in the configuration.
In some cases this can reduce IO and increase overall database performance.
The logic to skip Segments is required here. Currently Segments are forwarded only when both the current Level and next Level are empty.
Iterators can be slower if there are too many dead updates between two keys.
Put checkpoint can be added or something similar to help iterators skip dead updates entirely jump to the next valid key instantly.
A test case would be: Suppose there are 100,000 keys. Keys 1 & 100,000 are valid keys and updates were submitted for keys 2 to 99,999 that do not exist. This effects iterator performance when reading from key 1 to 100,000.
All Levels (including Level0) behave the same when it comes to reading data. Therefore test-cases on any variation of Level config should yield the same result. Reads before/after the merge or during concurrent merge should always have consistent output.
assertLevel
test functionGiven a use case it will run the same test multiple times with different Level configurations to get maximum test coverage
TTL APIs to allow for key-values to automatically expire/remove from the database after a finite duration.
//put & expire
db.put(key = 1, value = "one", expireAfter = 1.day)
//expire existing keys
db.expire(key = 1, after = 10.seconds)
//expire range of key-values
db.expire(from = 1, until = 100, after = 1.hour)
Update APIs should allow to modify values without effecting the existing TTL/Expiry set for the key.
//update single key
db.update(key = 1, value = "one updated")
//update range
db.update(from = 1, until = 100, value = "updated")
scala.Iterable[T]
is blocking and requires the use of IO.Async[T].safeGetBlocking
.
An in-build streaming support to provide non-blocking & back-pressured reads using IO.Async[T].safeGetFuture
instead of IO.Async[T].safeGetBlocking
is required.
I haven't compared all libraries yet but Monix sounds like a better option since it compiles to Scala.js and can interop with Akka-streams and FS2.
Change
val db = SwayDB.persistent[K, V](...)
val db = SwayDB.persistentSet[T](...)
val db = SwayDB.memory[K, V](...)
val db = SwayDB.memorySet[T](...)
to similar to how a basic Scala collection would be initialised
import swaydb._
val map = persistent.Map[K, V]()
val set = persistent.Set[T]()
val map = memory.Map[K, V]()
val set = memory.Set[T]()
Possibly an independant library for Java syntax with code samples.
Higher
, Lower
, Get
& SegmentMerger
are becoming to be large functions because they need to be tailrec
for performance and to avoid any StackOverflow
for deep seeks. Splitting these implementations to smaller dedicated functions would be better and easier to write tests for.
These implementations can run a minimum of 300k+ iterations per second to fetch a minimum of 300k key-values per second performance.
Using trampolining would require extra objects being created per iteration which could add to GC workload, if for each key read, it created another trampoline object which would double the number of objects created per second. And that is no good for performance.
Looking for alternatives and/or advice on this.
Range writes - Remove range, Expire range & Update range can spread out to multiple Segments if the difference between from
and to
is too wide. In these cases ranges should be split to perform update in batches.
Back up all databases files to target path.
db.backup(path: Path)
Since cacheFunction are being applied in real time, a faster compaction can result in these updates being applied multiple times during reads as read do not block compaction.
These updates will eventually be consistent but a check during read is required to assert that a function only get applied once. This information can be stored in the read context.
Currently when Actor.timerLoop
has a message overflow it process most messages in the inbox leaving approximately 1/4 messages to be processed for next interval. It will also reduce the next delay by half if the previous interval had too many messages.
Although this works ok for controlling overflow it still needs a more smoother approach which would improve the overall read performance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.