mkrd / dictdatabase Goto Github PK
View Code? Open in Web Editor NEWA python NoSQL dictionary database, with concurrent access and ACID compliance
License: MIT License
A python NoSQL dictionary database, with concurrent access and ACID compliance
License: MIT License
The first draft is:
External edits issue:
import dictdatabase as DDB
from path_dict import PathDict
import time
# Measure time to get all cups
t1 = time.perf_counter()
cups = DDB.at("cups/*", key="organizer_email").read()
# REMOVE where selector
# KEEP as_type since it is also important in session
# KEEP at() since it is also used in session
# ADD: read_key(key, as_type=None) and read_keys(keys, as_type=None)
# ... OR ADD: .select_key(key) and .select_keys(keys) and .select_all() and .select_where(f) as intermediate step before read() or session()
# File Alt 1
DDB.read_file("cups/aachen", as_type=PathDict)
DDB.read_file("cups/aachen", key="version", as_type=PathDict)
DDB.read_file("cups/aachen", keys=["version", "name"], as_type=PathDict)
DDB.read_file("cups/aachen", not_keys=["big_part"], as_type=PathDict)
with DDB.file_session("cups/aachen", as_type=PathDict) as (session, cup):
cup["count"] += 1
session.write(cup)
with DDB.file_session("cups/aachen", key="version", as_type=PathDict) as (session, version):
version += 1
session.write(version)
# File Alt 2
DDB.at_file("cups/aachen").read(as_type=PathDict)
DDB.at_file("cups/aachen").select_key("version").read(as_type=PathDict)
DDB.at_file("cups/aachen").select_keys("version", "name").read(as_type=PathDict)
DDB.at_file("cups/aachen").select_not_keys("big_part").read(as_type=PathDict)
with DDB.at_file("cups/aachen").session(as_type=PathDict) as (session, cup):
cup["count"] += 1
session.write(cup)
# File Alt 3
DDB.file.at("cups/aachen").read(as_type=PathDict)
DDB.file.at("cups/aachen").select_key("version").read(as_type=PathDict)
DDB.file.at("cups/aachen").select_keys("version", "name").read(as_type=PathDict)
DDB.file.at("cups/aachen").select_not_keys("big_part").read(as_type=PathDict)
with DDB.file.at("cups/aachen").session(as_type=PathDict) as (session, cup):
cup["count"] += 1
session.write(cup)
# Dir Alt 1
DDB.read_dir("cups", as_type=PathDict)
DDB.read_dir("cups", key="version", as_type=PathDict)
DDB.read_dir("cups", keys=["version", "name"], as_type=PathDict)
with DDB.dir_session("cups", as_type=PathDict) as (session, cups):
cups["aachen"]["count"] += 1
session.write(cups)
with DDB.dir_session("cups", key="version", as_type=PathDict) as (session, versions):
versions["aachen", "version"] += 1
session.write(versions)
# Dir Alt 2
DDB.at_dir("cups").read(as_type=PathDict)
DDB.at_dir("cups").select_key("version").read(as_type=PathDict)
DDB.at_dir("cups").select_keys("version", "name").read(as_type=PathDict)
with DDB.at_dir("cups").session(as_type=PathDict) as (session, cups):
cups["aachen"]["count"] += 1
session.write(cups)
with DDB.at_dir("cups").select_key("version").session(as_type=PathDict) as (session, versions):
versions["aachen", "version"] += 1
session.write(versions)
# Dir Alt 3
DDB.dir.at("cups").read(as_type=PathDict)
DDB.dir.at("cups").select_key("version").read(as_type=PathDict)
DDB.dir.at("cups").select_keys("version", "name").read(as_type=PathDict)
with DDB.dir.at("cups").session(as_type=PathDict) as (session, cups):
cups["aachen"]["count"] += 1
session.write(cups)
with DDB.dir.at("cups").select_key("version").session(as_type=PathDict) as (session, versions):
versions["aachen", "version"] += 1
session.write(versions)
# Read key from top level file:
v1 = DDB.at("app_state", key="version").read(as_type=PathDict)
v2 = DDB.at(file="app_state").select_key("version").read(as_type=PathDict)
# Read one cup:
v1 = DDB.at("cups/1").read(as_type=PathDict)
v2 = DDB.at("cups/1").read(as_type=PathDict)
# Read all cups:
v1 = DDB.at("cups/*").read(as_type=PathDict)
v2 = DDB.at("cups").read(as_type=PathDict)
v3 = DDB.at(dir="cups").select_all().read(as_type=PathDict)
# Read only mail addresses:
v1 = DDB.at("cups/*", key="organizer_email").read(as_type=PathDict)
v2 = DDB.at("cups/*").select_key("organizer_email").read(as_type=PathDict)
# Read locations and emails:
v1 = ... # Not Possible
v2 = DDB.at("cups/*").select_keys("organizer_email", "location").read(as_type=PathDict)
t2 = time.perf_counter()
assert len(cups) == 228
print(f"Time to get all cups: {(t2-t1)*1000:.1f} ms")
print(cups) ```
Currently, if a file does not exist, a FileNotFoundError is raised.
If the file exists, but a key that does not exit was specified, then a KeyError is raised
Proposal:
time.monotonic_ns() might not have the same baseline on different processes or threads
If keys are not sorted, the indexer cannot work properly, so performance will suffer
By default, DDB creates a index files for each db file in which the indices of values and the value hash are stored.
This drastically increases performance, but might not be wanted in some cases. If no index files should be written, DDB.config.use_indexing = False could disable it.
Using a different locking mechanism could fix both problems at once
DDB.at("test").create({"a": 0}, force_overwrite=True)
with DDB.at("test", key="a").session() as (session, a):
a = 1
session.write()
print(DDB.at("test").read())
prints 0, but it would be more intuitive if it was 1. The reason is that primitives like int are passed by value, not by ref.
The current way to solve this is:
DDB.at("test").create({"a": 0}, force_overwrite=True)
with DDB.at("test_pd").session() as (session, d):
d["a"] = 1
session.write()
print(DDB.at("test").read())
But this is not as efficient, since, the whole test file has to be loaded instead of only the key.
Another solution would be to pass the variable into the session.write()
DDB.at("test").create({"a": 0}, force_overwrite=True)
with DDB.at("test", key="a").session() as (session, a):
a = 1
session.write(a)
print(DDB.at("test").read())
A possible solution without changing the syntax is using the current frame with inspect:
import inspect
class ObtainVarInWith:
def __enter__(self):
f = inspect.currentframe().f_back
self.x = 2
self.oldvars = dict(f.f_locals) # Take copy of locals dict.
return self.x
def __exit__(self, type, value, tb):
f = inspect.currentframe().f_back
for name, val in f.f_locals.items():
if name not in self.oldvars:
print("New variable:", name, val)
elif val is not self.oldvars[name]:
print("Changed variable", name, val)
with ObtainVarInWith() as x:
print("in with block", x)
x = 2
print("after assign with block", x)
Suppose I have a database with the following layout:
A/
meta.json
data.json
B/
meta.json
data.json
C/
...
Then suppose I try to read all the metadata via:
DDB.at('*/meta').read()
This will return the contents of only one of the meta.json
, what I'd like is the dictionary of all of them indexed by 'A', 'B', 'C', ...
. In other words, I'd like the glob wildcard match to be used as keys, not the last component of the path.
Is it possible to add such functionality?
P.S. I realize that it's possible to reorganize the database into something like:
meta/
A.json
B.json
C.json
data/
A.json
B.json
C.json
But I think that's less flexible.
mmap could greatly improve the performance of .at(…, key=…).session()
Compiling the glob regex is slow, use listdir for more performance
.at() should be able to accept a lambda function at the final selection level to allow filtering for a specific condition.
DDB.at("tasks", lambda t: t["annotator_id"] == current_user.id).read()
to get all tasks where the id matches the user id
The entire file needs to be read, but by partially reading each key value pair sequentially, we can prevent having to load the entire file into memory.
Proposal:
Currently, a partial read still has to load the entire file into memory if the key was not found by the indexer
Proposal:
utils.find_outermost_key_in_json_bytes
and seek_index_through_value_bytes
should be changed to operate on batches of file bytes, going through at most x megabytes of data at once.First draft finished. Issues to solve:
First, love your lib, saw it on reddit and have replaced yours with my json config saver.
However, for me, the order of the keys is important and in function serialize_data_to_json_bytes in io_unsafe.py you always sort the keys. This hurts.
My personal opinion is not to do that there ? If you want to sort your keys then that could/should be done prior.
Kinda separate the purpose of serialize the data and ordering the data.
Alternatively you could add an option when creating the instance if you want an alternative solution.
Just my two cents.
for now I patch it with
def serialize_data_to_json_bytes(data: dict) -> bytes:
from dictdatabase import config
if config.use_orjson:
import orjson
option = (orjson.OPT_INDENT_2 if config.indent else 0)
return orjson.dumps(data, option=option)
else:
db_dump = json.dumps(data, indent=config.indent, sort_keys=False)
return db_dump.encode()
def io_write(db_name: str, data: dict):
data_bytes = serialize_data_to_json_bytes(data)
io_bytes.write(db_name, data_bytes)
def write(self):
super(SessionFileFull, self).write()
io_write(self.db_name, self.data_handle)
SessionFileFull.write = write
Performance profiling shows that seek_index_through_value is the bottleneck for fast partial reading. Consider rewriting the function in C, which will probably be much faster.
I love the idea of this library! Thanks for building it. :)
This isn't an issue, but more of a question. When I was playing with this library, the results of key
were surprising to me because I was unclear what was going to match.
{
"users": {
"Ben": {
"age": 30,
"job": "Software Engineer"
}
},
"Ben": {"job": "Plumber"}
}
print(DDB.at("users", key="job").read())
>>> "Plumber"
Have you thought about requiring the key
to be more explicit? Maybe something like XPath (key='Ben/job'
), jq (key='.Ben.job'
), or glom (key='Ben.job'
), or something else?
Or maybe if there is a reason to keep the current functionality, a note in the readme about this behavior might be useful for other devs?
Orjson is so much faster that it is the better choice in all cases
Pro:
Cons:
{
"users": {
"Ben": {
"age": 30,
"job": "Software Engineer"
}
},
"Ben": {"job": "Plumber"}
}
print(DDB.at("users", key="job").read())
>>> "Plumber"
Have you thought about requiring the key
to be more explicit? Maybe something like XPath (key='Ben/job'
), jq (key='.Ben.job'
), or glom (key='Ben.job'
), or something else?
Currently, all lock files and index files are stored in a .ddb folder in the storage dir.
Pro:
Con:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.