mkrd / dictdatabase Goto Github PK

View Code? Open in Web Editor NEW

219.0 6.0 9.0 6.11 MB

A python NoSQL dictionary database, with concurrent access and ACID compliance

License: MIT License

Python 99.47% Shell 0.05% Just 0.48%

compression database json python documentdb nosql dict acid multiprocessing multithreading

dictdatabase's People

Contributors

Stargazers

Watchers

Forkers

kticoder matt-wisdom nnneznaika umbrellamalware astreatss sourcery-ai-bot python-popular-repos petersr yedsn

dictdatabase's Issues

Add indexing capabilities

The first draft is:

An index file exists for each db file in the storage dir, inside a separate .ddb folder. It contains a json object where for each key, the start and end index of the value for that key in the db is stored
When doing a partial read, first check if such an index file exists, and if it contains the given key. If yes, skip the whole finding and seeking process and directly read from the file. If not, do the finding and seeking, and then write the start and end indexes to the index file.
After writing, also update the index file.

External edits issue:

If the user manually changes a file, the indexing will be broken. This might make the idea infeasible
No solution would be to wrap the parsing of the partial reads into a try block if a JSONDecodeError is raised delete the index file
In some cases, the parsed json might still be valid after external modification, so wrong data will be returned
Another option would be to also store a hash of the value string in the index, and if the hash of the read data is different, delete that key from the index file and proceed with a regular seek and find

Api improvement

import dictdatabase as DDB
from path_dict import PathDict
import time

# Measure time to get all cups

t1 = time.perf_counter()
cups = DDB.at("cups/*", key="organizer_email").read()


# REMOVE where selector
# KEEP as_type since it is also important in session
# KEEP at() since it is also used in session
# ADD: read_key(key, as_type=None) and read_keys(keys, as_type=None)
# ... OR ADD: .select_key(key) and .select_keys(keys) and .select_all() and .select_where(f) as intermediate step before read() or session()









# File Alt 1
DDB.read_file("cups/aachen", as_type=PathDict)
DDB.read_file("cups/aachen", key="version", as_type=PathDict)
DDB.read_file("cups/aachen", keys=["version", "name"], as_type=PathDict)
DDB.read_file("cups/aachen", not_keys=["big_part"], as_type=PathDict)
with DDB.file_session("cups/aachen", as_type=PathDict) as (session, cup):
    cup["count"] += 1
    session.write(cup)
with DDB.file_session("cups/aachen", key="version", as_type=PathDict) as (session, version):
    version += 1
    session.write(version)

# File Alt 2
DDB.at_file("cups/aachen").read(as_type=PathDict)
DDB.at_file("cups/aachen").select_key("version").read(as_type=PathDict)
DDB.at_file("cups/aachen").select_keys("version", "name").read(as_type=PathDict)
DDB.at_file("cups/aachen").select_not_keys("big_part").read(as_type=PathDict)
with DDB.at_file("cups/aachen").session(as_type=PathDict) as (session, cup):
    cup["count"] += 1
    session.write(cup)

# File Alt 3
DDB.file.at("cups/aachen").read(as_type=PathDict)
DDB.file.at("cups/aachen").select_key("version").read(as_type=PathDict)
DDB.file.at("cups/aachen").select_keys("version", "name").read(as_type=PathDict)
DDB.file.at("cups/aachen").select_not_keys("big_part").read(as_type=PathDict)
with DDB.file.at("cups/aachen").session(as_type=PathDict) as (session, cup):
    cup["count"] += 1
    session.write(cup)

# Dir Alt 1
DDB.read_dir("cups", as_type=PathDict)
DDB.read_dir("cups", key="version", as_type=PathDict)
DDB.read_dir("cups", keys=["version", "name"], as_type=PathDict)
with DDB.dir_session("cups", as_type=PathDict) as (session, cups):
    cups["aachen"]["count"] += 1
    session.write(cups)
with DDB.dir_session("cups", key="version", as_type=PathDict) as (session, versions):
    versions["aachen", "version"] += 1
    session.write(versions)

# Dir Alt 2
DDB.at_dir("cups").read(as_type=PathDict)
DDB.at_dir("cups").select_key("version").read(as_type=PathDict)
DDB.at_dir("cups").select_keys("version", "name").read(as_type=PathDict)
with DDB.at_dir("cups").session(as_type=PathDict) as (session, cups):
    cups["aachen"]["count"] += 1
    session.write(cups)
with DDB.at_dir("cups").select_key("version").session(as_type=PathDict) as (session, versions):
    versions["aachen", "version"] += 1
    session.write(versions)

# Dir Alt 3
DDB.dir.at("cups").read(as_type=PathDict)
DDB.dir.at("cups").select_key("version").read(as_type=PathDict)
DDB.dir.at("cups").select_keys("version", "name").read(as_type=PathDict)
with DDB.dir.at("cups").session(as_type=PathDict) as (session, cups):
    cups["aachen"]["count"] += 1
    session.write(cups)
with DDB.dir.at("cups").select_key("version").session(as_type=PathDict) as (session, versions):
    versions["aachen", "version"] += 1
    session.write(versions)











# Read key from top level file:
v1 = DDB.at("app_state", key="version").read(as_type=PathDict)
v2 = DDB.at(file="app_state").select_key("version").read(as_type=PathDict)


# Read one cup:
v1 = DDB.at("cups/1").read(as_type=PathDict)
v2 = DDB.at("cups/1").read(as_type=PathDict)


# Read all cups:
v1 = DDB.at("cups/*").read(as_type=PathDict)
v2 = DDB.at("cups").read(as_type=PathDict)
v3 = DDB.at(dir="cups").select_all().read(as_type=PathDict)


# Read only mail addresses:
v1 = DDB.at("cups/*", key="organizer_email").read(as_type=PathDict)
v2 = DDB.at("cups/*").select_key("organizer_email").read(as_type=PathDict)



# Read locations and emails:
v1 = ... # Not Possible
v2 = DDB.at("cups/*").select_keys("organizer_email", "location").read(as_type=PathDict)



t2 = time.perf_counter()

assert len(cups) == 228

print(f"Time to get all cups: {(t2-t1)*1000:.1f} ms")

print(cups) ```

Change .session() Error behavior

Currently, if a file does not exist, a FileNotFoundError is raised.
If the file exists, but a key that does not exit was specified, then a KeyError is raised

Proposal:

If the file exists, but a key that does not exit was specified, no error is raised and the session variable is none. on Write, the new key will be added to the db file

Partial read and write on bytes instead of decoded str

decoding a file from bytes to utf-8 is expensive, so indexing should happen on bytes.
find outermost key and seek through value have to be adjusted to use byte indexes instead of string indexes

Use clock with shared baseline

time.monotonic_ns() might not have the same baseline on different processes or threads

Use slots and dataclass(slots=true) everywhere

Improve API reference and docs

Add Advanced section with information about the DDB.locking config variables

Drop config.sort_keys?

If keys are not sorted, the indexer cannot work properly, so performance will suffer

By default, DDB creates a index files for each db file in which the indices of values and the value hash are stored.
This drastically increases performance, but might not be wanted in some cases. If no index files should be written, DDB.config.use_indexing = False could disable it.

Better locking

Problems

Currently, a lock is removed after a certain threshold of time, so automatically remove dead locks. But if a session is taking longer than the timeout, it will also remove that active lock
Locking is relatively slow under heavy multithreaded work

Solution for 1

The timeout problem could be fixed if the active lock periodically (less than timeout duration) rewrites its lock with a newer timestamp, so no other process will remove it. The difficulty is to how this should work, having an extra thread or process doing the refresh work is an option

Solution for 1 and 2

Using a different locking mechanism could fix both problems at once

os.O_EXLOCK, but not available on linux
fcntl, but not available on windows
portalocker
Aplies to fcntl and portalocker: http://0pointer.de/blog/projects/locking.html (fcntl locks are process bound and cannot be used with multithreaded environments)

Unintuitive behavior when session with sub key value is a primitive

DDB.at("test").create({"a": 0}, force_overwrite=True)
    with DDB.at("test", key="a").session() as (session, a):
        a = 1
        session.write()
    print(DDB.at("test").read())

prints 0, but it would be more intuitive if it was 1. The reason is that primitives like int are passed by value, not by ref.

The current way to solve this is:

DDB.at("test").create({"a": 0}, force_overwrite=True)
    with DDB.at("test_pd").session() as (session, d):
        d["a"] = 1
        session.write()
    print(DDB.at("test").read())

But this is not as efficient, since, the whole test file has to be loaded instead of only the key.

Another solution would be to pass the variable into the session.write()

DDB.at("test").create({"a": 0}, force_overwrite=True)
    with DDB.at("test", key="a").session() as (session, a):
        a = 1
        session.write(a)
    print(DDB.at("test").read())

A possible solution without changing the syntax is using the current frame with inspect:

import inspect

class ObtainVarInWith:
	def __enter__(self):
		f = inspect.currentframe().f_back
		self.x = 2
		self.oldvars = dict(f.f_locals)  # Take copy of locals dict.

		return self.x


	def __exit__(self, type, value, tb):
		f = inspect.currentframe().f_back
		for name, val in f.f_locals.items():
			if name not in self.oldvars:
				print("New variable:", name, val)
			elif val is not self.oldvars[name]:
				print("Changed variable", name, val)


with ObtainVarInWith() as x:
	print("in with block", x)
	x = 2
	print("after assign with block", x)

Use glob match for dictionary keys

Suppose I have a database with the following layout:

A/
  meta.json
  data.json
B/
  meta.json
  data.json
C/
  ...

Then suppose I try to read all the metadata via:

DDB.at('*/meta').read()

This will return the contents of only one of the meta.json, what I'd like is the dictionary of all of them indexed by 'A', 'B', 'C', .... In other words, I'd like the glob wildcard match to be used as keys, not the last component of the path.

Is it possible to add such functionality?

P.S. I realize that it's possible to reorganize the database into something like:

meta/
    A.json
    B.json
    C.json
data/
    A.json
    B.json
    C.json

But I think that's less flexible.

type hint session enter return values

Release next PathDict

Evaluate usage of mmap

mmap could greatly improve the performance of .at(…, key=…).session()

Remove glob from locking

Compiling the glob regex is slow, use listdir for more performance

.at() filter capabilities

.at() should be able to accept a lambda function at the final selection level to allow filtering for a specific condition.

DDB.at("tasks", lambda t: t["annotator_id"] == current_user.id).read()
to get all tasks where the id matches the user id

Design decisions

Drop support for normal json in favor of orjson? orjson is so much faster that it is the better choice in all cases
Use .ddb folder as of now or write lock files into the storage dir? In the storage dir, they will be noticed more bit dissapear completely after use. The ddb dir stays in the folder
Add config.use_indexing? Sometimes indexing might now be wanted

Improve where selector for file

The entire file needs to be read, but by partially reading each key value pair sequentially, we can prevent having to load the entire file into memory.

Proposal:

Do many partial reads instead of one full read
Might lead to a lot of overhead since multiple full reads will be executed if multiple keys are not in the index.

Batched partial reading

Currently, a partial read still has to load the entire file into memory if the key was not found by the indexer

Proposal:

utils.find_outermost_key_in_json_bytes and seek_index_through_value_bytes should be changed to operate on batches of file bytes, going through at most x megabytes of data at once.

Feature: Object mapper

First draft finished. Issues to solve:

Available Bases should be improved to allow better nested mapping
Wenn going into a read-write session, there is no way to map the object back into a dict. This is required to run recursively for all mapped subfields

Option not to sort the keys

First, love your lib, saw it on reddit and have replaced yours with my json config saver.
However, for me, the order of the keys is important and in function serialize_data_to_json_bytes in io_unsafe.py you always sort the keys. This hurts.
My personal opinion is not to do that there ? If you want to sort your keys then that could/should be done prior.
Kinda separate the purpose of serialize the data and ordering the data.
Alternatively you could add an option when creating the instance if you want an alternative solution.
Just my two cents.

for now I patch it with

def serialize_data_to_json_bytes(data: dict) -> bytes:
    from dictdatabase import config
    if config.use_orjson:
        import orjson
        option = (orjson.OPT_INDENT_2 if config.indent else 0)
        return orjson.dumps(data, option=option)
    else:
        db_dump = json.dumps(data, indent=config.indent, sort_keys=False)
        return db_dump.encode()

def io_write(db_name: str, data: dict):
    data_bytes = serialize_data_to_json_bytes(data)
    io_bytes.write(db_name, data_bytes)

def write(self):
    super(SessionFileFull, self).write()
    io_write(self.db_name, self.data_handle)

SessionFileFull.write = write

Write seek_index_through_value in C or Rust

Performance profiling shows that seek_index_through_value is the bottleneck for fast partial reading. Consider rewriting the function in C, which will probably be much faster.

Question about how `key` matches

I love the idea of this library! Thanks for building it. :)

This isn't an issue, but more of a question. When I was playing with this library, the results of key were surprising to me because I was unclear what was going to match.

{
  "users": {
    "Ben": {
      "age": 30,
      "job": "Software Engineer"
    }
  },
  "Ben": {"job": "Plumber"}
}

print(DDB.at("users", key="job").read())
>>> "Plumber"

Have you thought about requiring the key to be more explicit? Maybe something like XPath (key='Ben/job'), jq (key='.Ben.job'), or glom (key='Ben.job'), or something else?

Or maybe if there is a reason to keep the current functionality, a note in the readme about this behavior might be useful for other devs?

Drop json in favor of orjson?

Orjson is so much faster that it is the better choice in all cases

Pro:

Faster
Simpler code

Cons:

Orjson only allows indenting files with two spaces, or no indentation at all when writing
Only allows strings as keys. Can be changed but performance should be tested

Add XPath-like searching for keys

{
  "users": {
    "Ben": {
      "age": 30,
      "job": "Software Engineer"
    }
  },
  "Ben": {"job": "Plumber"}
}

print(DDB.at("users", key="job").read())
>>> "Plumber"

Have you thought about requiring the key to be more explicit? Maybe something like XPath (key='Ben/job'), jq (key='.Ben.job'), or glom (key='Ben.job'), or something else?

Indexer behavior

Instead of only indexing the key that was not found by the indexer, do a full indexing of all keys since full file was read anyways

Use .ddb folder?

Currently, all lock files and index files are stored in a .ddb folder in the storage dir.

Pro:

Keep all indexes and lock in a separate place

Con:

Does not disappear, unlike when locks and indices are written directly next to the corresponding file.
index files are directly inside the folder structure

mkrd / dictdatabase Goto Github PK

dictdatabase's People

Contributors

Stargazers

Watchers

Forkers

dictdatabase's Issues

Problems

Solution for 1

Solution for 1 and 2

Recommend Projects

Recommend Topics

Recommend Org