Hello, I was just wondering what Bolt is planning for database compa

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Compaction,about boltdb/bolt

Comments (26)

benbjohnson commented on May 23, 2024

@pkieltyka Bolt uses the same model as LMDB which is a copy-on-write B+Tree. Older versions of the tree are maintained as long as old transactions need them but will be reclaimed once those transactions close. There's no write-ahead log (WAL) or SSTables like what LevelDB does. Everything is consistent and up to date when it's written. You can add a WAL in front of Bolt to improve write performance or coalesce transactions but Bolt is mainly geared toward fast read performance.

There's an issue with the buckets page (aka the list of bucket root nodes) where they are not getting reclaimed (#82). I suspect that's the issue you're having. I'm going to try to fix it tomorrow.

from bolt.

pkieltyka commented on May 23, 2024

Very cool! Thanks

from bolt.

benbjohnson commented on May 23, 2024

@pkieltyka If you install the bolt CLI then you can run it against your db to check the page information:

$ go get github.com/boltdb/bolt/...
$ bolt pages /path/to/my.db

If you see a bunch of buckets lines then that's issue #82.

from bolt.

pkieltyka commented on May 23, 2024

https://gist.github.com/pkieltyka/1234f06cf63d03e3a769

from bolt.

benbjohnson commented on May 23, 2024

@pkieltyka Yep. That's the same issue. :)

I'll close this issue since it's the same as #82.

from bolt.

pkieltyka commented on May 23, 2024

Sounds good. Thanks and keep up the awesome work. Btw, have you considered added a distributed component to bolt to turn it into a distirbuted store? Some form of consistent hashing algorithm + goraft?

from bolt.

benbjohnson commented on May 23, 2024

@pkieltyka Actually, can you post the program used to generate the database? Page 181 is concerning. That's a lot of items and a lot of overflow pages.

from bolt.

benbjohnson commented on May 23, 2024

@pkieltyka Thanks! I'm working on a second version of goraft right now so maybe after that I'll try making a distributed k/v store. I've been looking at porting Lucene on top of Bolt to provide full text search and indexing. Not sure what to tackle first. :)

from bolt.

pkieltyka commented on May 23, 2024

yea for sure I will, I just have to head out now but I will post it in a few hours

from bolt.

pkieltyka commented on May 23, 2024

and +1 for distributed store :) .. and also, why rewrite goraft tho?

from bolt.

benbjohnson commented on May 23, 2024

go-raft is good for small datasets but there's performance issues around allocation and serialization. I've learned a lot about Go perf tuning since I wrote goraft a year ago and I need something higher performance for SkyDB.

I'm looking to do a simpler design and use read-only mmaps for the log (similar to how Bolt uses a read-only mmap). It's wicked fast. The snapshotting in go-raft is very limited too so that needs to be fixed as well. There's already a lot of people using go-raft so trying to upgrade it and break the API is not something I want to do.

I still need to think of a name for the next goraft. raft2?

from bolt.

benbjohnson commented on May 23, 2024

@pkieltyka #90 should fix your growing db issue. Let me know if you still have any problems.

from bolt.

pkieltyka commented on May 23, 2024

@benbjohnson hey, sorry about the delay. I'll give it a test it now and paste the page info. Thx.

from bolt.

pkieltyka commented on May 23, 2024

btw, re: raft I still have to read the paper to fully grasp all the details. It uses an http transport? or is that just an option. Btw, are you on irc?

from bolt.

benbjohnson commented on May 23, 2024

Raft doesn't specify a transport. go-raft uses a pluggable transport layer but most people use HTTP as the transport. I'll probably do something similar with raft2.

Here's a (relatively) short visualization of Raft that's helpful if you don't have time for the full paper:

http://thesecretlivesofdata.com/raft/

from bolt.

pkieltyka commented on May 23, 2024

Nice, that is a beautiful visualization.. I'm gonna go through that now.

So, I'm still not seeing the space free up. The pages: https://gist.github.com/pkieltyka/9763425

I've pushed the dummy project here: https://github.com/pkieltyka/seamdb. The project uses a sample of ~200 small images from 10KB to 150KB. When the test suite boots, it will repeat the set with the same key and append a number, just to increase the size of the db (seamdb_test.go has the testData build code with a 100x repeat).

To run the tests and reproduce the issue:

Change the path to the db in bolt.go Open function
go test -run=TestBoltSetInit
go test -run=TestBoltSetGet .. just to validate the data is there..
go test -run=TestBoltSetDel .. I also noticed this is a pretty slow operation

at each point, I was just checking the size of the db file, and the finally checking the pages.

from bolt.

pkieltyka commented on May 23, 2024

Btw, I never finally delete the now empty bucket.. I keep it around as I would intend to put more data into it. Treating it as a namespace.

from bolt.

pkieltyka commented on May 23, 2024

re: raft -- I looked at the visualization, it's quite logical and very cool. I'm still new to it, but wouldn't it make writes to a cluster extremely slow as it's strongly consistent between the cluster? .. that could be fine depending on the application.

For something like etcd, yea that is fast enough. But, I don't know how it could be fast enough for influxdb if its an analytics events db (unless its far from real-time). Hrmm.. I guess depends on interpretation too.. the data wouldn't be committed to the cluster until a majority accept it, but the uncommitted data is still accessible from the leader. But then that sucks too cuz your followers are just more like backups now.

And yea, it makes sense the spec doesn't prescribe a transport. But, I'm still not sure why http is a good idea at the low-level to communicate data between nodes when you can easily use tcp sockets. An app developer could just use the raft client, open a connection, and have a few functions to check the node state, etc... it would all be abstracted away with the goal of max efficiency. Just some thoughts..

Maybe raft can do it.. but I'm thinking of a completely different distributed data design for an embedded db.. with raft, you'd always have to read/write from the leader it seems? I suppose you could easily let it read from a follower and maybe proxy writes to the leader.. but still, I'd rather always read/write to the local embedded db, and have the set become eventually consistent. Hrmm.. I wonder if this is how cassandra is designed.

from bolt.

benbjohnson commented on May 23, 2024

Thanks for the repo link. I'll probably be tomorrow before I can look at it.

You're right that the delete operation is really slow. I've done very limited optimization on Bolt so far. It's mostly been ensuring that everything is consistent and correct and the API works well. I'll get that fixed up though.

Another performance issue is that bulk loading more than 1000 items at a time is also really slow. The nodes aren't splitting before commit so there's a lot of large memmove() calls that's causing slowness.

I added two issues to track those performance problems: #93 & #94

from bolt.

benbjohnson commented on May 23, 2024

Wouldn't it make writes to a cluster extremely slow as it's strongly consistent between the cluster?

The nice thing about Raft is that it batches up log entries so it can still theoretically achieve high throughput but there's additional latency. Latencies are in the 10s to 100s of milliseconds range.

But then that sucks too cuz your followers are just more like backups now.

It mostly depends on your data. If you can accept stale reads then you can read from the followers. If you need consistent reads then you have to read from the leader. It's definitely a high cost but sometimes you really need it.

But, I'm still not sure why http is a good idea at the low-level to communicate data between nodes when you can easily use tcp sockets.

If an application already has an HTTP server embedded in it then go-raft can overlay its routes on there and then you only need to open up that single HTTP port. If you're running over TCP then you'd have to open a separate port in your application (and your firewall) and it's just more configuration overhead.

Another benefit to HTTP is that you can implement Raft over TLS. Raft doesn't provide any safety for byzantine errors (e.g. malformed messages, malicious messages) so TLS can provide encryption and authentication.

I'd rather always read/write to the local embedded db, and have the set become eventually consistent.

Eventual consistency is a whole different approach you can go with but it has its own drawbacks and challenges. Building against a strongly consistent data store is a lot easier than understanding things like vector clocks, last-write-wins, and other eventually consistent concepts. However, eventually consistent systems can scale larger and have higher availability since they don't require a quorum. It just depends on your use case.

from bolt.

pkieltyka commented on May 23, 2024

Cool! this is all so interesting. Btw, have a look at https://github.com/inconshreveable/muxado .. it's written by the same guy who made ngrok. It's a tcp multiplexing transport that is designed similar to http2.... so there's your single port :)

In that case, for boltdb.. I like that it's just an embedded db, and implementing raft2 or other distribution schemes could be completely separated out anyways but easily overlaid.. (perhaps you were thinking like this anyways, but that just occurred to me).

from bolt.

kardianos commented on May 23, 2024

The distributed system I hope to build soon-ish is a distributed data store
with explicit versions. When a client interacts with the data store, it can
require a minimum version. Such explicit version allow clients to know when
data has changed before it updates it, it can also provide a cross between
the scalability of an eventually consistent store and the ease of use of a
strongly consistent data store.
You would use it as follows:

For consistent business data: Client reads data with version "A" from
data store, client modifies data, client writes back to data source
requiring that at time of write data store should still be version "A".

For consistent data reads: Client writes data and returns version "32".
Client reads data and requires the version be 32 or newer.

Has this been done before?
-Daniel

On Tue, Mar 25, 2014 at 9:12 AM, Ben Johnson [email protected]:

Wouldn't it make writes to a cluster extremely slow as it's strongly
consistent between the cluster?

The nice thing about Raft is that it batches up log entries so it can
still theoretically achieve high throughput but there's additional latency.
Latencies are in the 10s to 100s of milliseconds range.

But then that sucks too cuz your followers are just more like backups now.

It mostly depends on your data. If you can accept stale reads then you can
read from the followers. If you need consistent reads then you have to read
from the leader. It's definitely a high cost but sometimes you really need
it.

But, I'm still not sure why http is a good idea at the low-level to
communicate data between nodes when you can easily use tcp sockets.

If an application already has an HTTP server embedded in it then go-raft
can overlay its routes on there and then you only need to open up that
single HTTP port. If you're running over TCP then you'd have to open a
separate port in your application (and your firewall) and it's just more
configuration overhead.

Another benefit to HTTP is that you can implement Raft over TLS. Raft
doesn't provide any safety for byzantine errors (e.g. malformed messages,
malicious messages) so TLS can provide encryption and authentication.

I'd rather always read/write to the local embedded db, and have the set
become eventually consistent.

Eventual consistency is a whole different approach you can go with but it
has its own drawbacks and challenges. Building against a strongly
consistent data store is a lot easier than understanding things like vector
clocks, last-write-wins, and other eventually consistent concepts. However,
eventually consistent systems can scale larger and have higher availability
since they don't require a quorum. It just depends on your use case.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/89#issuecomment-38584920
.

from bolt.

benbjohnson commented on May 23, 2024

@pkieltyka Yeah, distributed systems is a whole different beast. So many little nuances. I haven't seen muxado before. That looks cool. They mention something about performance so I'd be interested to try out some benchmarks at some point.

Layerable systems (e.g. bolt + raft + ...) is what I'm going for. Most systems tend to be broken up as individual servers such as MySQL + elasticsearch + redis but I've always been disappointed that these servers aren't also available as libraries. There's a lot of use cases where you can get significantly improved performance by combining some of these things into a single process. It can also be a lot easier to reason about and deploy when it's in a single process.

@kardianos The versioning you're doing is basically CAS (compare-and-swap). Zookeeper and etcd both have distributed CAS in this same way but they're backed by a consensus protocol (Raft and ZAB, respectively). Versioning alone doesn't help you distribute the data across multiple nodes. If you allow writes to multiple nodes then they can check their local version before committing but still have to resolve merges between nodes. If you're writing to a single node then you still have to handle failover and determine when something is safely committed in case of a node failure.

If you distribute the version across multiple nodes then you have Lamport Timestamps which can help determine causality but you still end up in situations where you have to resolve conflicts. There's also Vector Clocks which track Lamport timestamps per node and improve the granularity of the causality but you still have conflicts. Riak uses Vector Clocks.

There's also something new called Commutative Replicated Data Types (CRDTs) which allow you to merge conflicts better than last-write-wins (LWW), however, they're still early stage and not all data can use CRDTs.

from bolt.

pkieltyka commented on May 23, 2024

@benbjohnson yea I definitely like the idea of a layered infrastructure design. I feel go's unconventional package model helps with this too.

Btw, for bolt, it would be nice to have boltdb/harness or something that tests various data sets and really pounds it through some regressions. Once you deem it functionally complete, then to validate its integrity of all features. The repo I created above could be a start.

from bolt.

benbjohnson commented on May 23, 2024

@pkieltyka I think an external harness is a good idea. I added two issues:

bolt-bench - Database benchmarker
bolt-fsck - Database consistency checker

Those could probably be bolt bench and bolt fsck in the main CLI too. The harness could probably just be a shell script to combine a series of bolt commands. Thanks for the starting test case. :)

from bolt.

pkieltyka commented on May 23, 2024

Nice!

On Mar 25, 2014, at 2:26 PM, Ben Johnson [email protected] wrote:

@pkieltyka I think an external harness is a good idea. I added two issues:

bolt-bench - Database benchmarker
bolt-fsck - Database consistency checker
Those could probably be bolt bench and bolt fsck in the main CLI too. The harness could probably just be a shell script to combine a series of bolt commands. Thanks for the starting test case. :)

—
Reply to this email directly or view it on GitHub.

from bolt.

Compaction about bolt HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent