Giter Site home page Giter Site logo

mkrd / dictdatabase Goto Github PK

View Code? Open in Web Editor NEW
219.0 6.0 9.0 6.11 MB

A python NoSQL dictionary database, with concurrent access and ACID compliance

License: MIT License

Python 99.47% Shell 0.05% Just 0.48%
compression database json python documentdb nosql dict acid multiprocessing multithreading

dictdatabase's People

Contributors

bobotig avatar github-actions[bot] avatar matt-wisdom avatar mkrd avatar petersr avatar pires22 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dictdatabase's Issues

Add indexing capabilities

The first draft is:

  • An index file exists for each db file in the storage dir, inside a separate .ddb folder. It contains a json object where for each key, the start and end index of the value for that key in the db is stored
  • When doing a partial read, first check if such an index file exists, and if it contains the given key. If yes, skip the whole finding and seeking process and directly read from the file. If not, do the finding and seeking, and then write the start and end indexes to the index file.
  • After writing, also update the index file.

External edits issue:

  • If the user manually changes a file, the indexing will be broken. This might make the idea infeasible
  • No solution would be to wrap the parsing of the partial reads into a try block if a JSONDecodeError is raised delete the index file
  • In some cases, the parsed json might still be valid after external modification, so wrong data will be returned
  • Another option would be to also store a hash of the value string in the index, and if the hash of the read data is different, delete that key from the index file and proceed with a regular seek and find

Api improvement

import dictdatabase as DDB
from path_dict import PathDict
import time

# Measure time to get all cups

t1 = time.perf_counter()
cups = DDB.at("cups/*", key="organizer_email").read()


# REMOVE where selector
# KEEP as_type since it is also important in session
# KEEP at() since it is also used in session
# ADD: read_key(key, as_type=None) and read_keys(keys, as_type=None)
# ... OR ADD: .select_key(key) and .select_keys(keys) and .select_all() and .select_where(f) as intermediate step before read() or session()









# File Alt 1
DDB.read_file("cups/aachen", as_type=PathDict)
DDB.read_file("cups/aachen", key="version", as_type=PathDict)
DDB.read_file("cups/aachen", keys=["version", "name"], as_type=PathDict)
DDB.read_file("cups/aachen", not_keys=["big_part"], as_type=PathDict)
with DDB.file_session("cups/aachen", as_type=PathDict) as (session, cup):
    cup["count"] += 1
    session.write(cup)
with DDB.file_session("cups/aachen", key="version", as_type=PathDict) as (session, version):
    version += 1
    session.write(version)

# File Alt 2
DDB.at_file("cups/aachen").read(as_type=PathDict)
DDB.at_file("cups/aachen").select_key("version").read(as_type=PathDict)
DDB.at_file("cups/aachen").select_keys("version", "name").read(as_type=PathDict)
DDB.at_file("cups/aachen").select_not_keys("big_part").read(as_type=PathDict)
with DDB.at_file("cups/aachen").session(as_type=PathDict) as (session, cup):
    cup["count"] += 1
    session.write(cup)

# File Alt 3
DDB.file.at("cups/aachen").read(as_type=PathDict)
DDB.file.at("cups/aachen").select_key("version").read(as_type=PathDict)
DDB.file.at("cups/aachen").select_keys("version", "name").read(as_type=PathDict)
DDB.file.at("cups/aachen").select_not_keys("big_part").read(as_type=PathDict)
with DDB.file.at("cups/aachen").session(as_type=PathDict) as (session, cup):
    cup["count"] += 1
    session.write(cup)

# Dir Alt 1
DDB.read_dir("cups", as_type=PathDict)
DDB.read_dir("cups", key="version", as_type=PathDict)
DDB.read_dir("cups", keys=["version", "name"], as_type=PathDict)
with DDB.dir_session("cups", as_type=PathDict) as (session, cups):
    cups["aachen"]["count"] += 1
    session.write(cups)
with DDB.dir_session("cups", key="version", as_type=PathDict) as (session, versions):
    versions["aachen", "version"] += 1
    session.write(versions)

# Dir Alt 2
DDB.at_dir("cups").read(as_type=PathDict)
DDB.at_dir("cups").select_key("version").read(as_type=PathDict)
DDB.at_dir("cups").select_keys("version", "name").read(as_type=PathDict)
with DDB.at_dir("cups").session(as_type=PathDict) as (session, cups):
    cups["aachen"]["count"] += 1
    session.write(cups)
with DDB.at_dir("cups").select_key("version").session(as_type=PathDict) as (session, versions):
    versions["aachen", "version"] += 1
    session.write(versions)

# Dir Alt 3
DDB.dir.at("cups").read(as_type=PathDict)
DDB.dir.at("cups").select_key("version").read(as_type=PathDict)
DDB.dir.at("cups").select_keys("version", "name").read(as_type=PathDict)
with DDB.dir.at("cups").session(as_type=PathDict) as (session, cups):
    cups["aachen"]["count"] += 1
    session.write(cups)
with DDB.dir.at("cups").select_key("version").session(as_type=PathDict) as (session, versions):
    versions["aachen", "version"] += 1
    session.write(versions)











# Read key from top level file:
v1 = DDB.at("app_state", key="version").read(as_type=PathDict)
v2 = DDB.at(file="app_state").select_key("version").read(as_type=PathDict)


# Read one cup:
v1 = DDB.at("cups/1").read(as_type=PathDict)
v2 = DDB.at("cups/1").read(as_type=PathDict)


# Read all cups:
v1 = DDB.at("cups/*").read(as_type=PathDict)
v2 = DDB.at("cups").read(as_type=PathDict)
v3 = DDB.at(dir="cups").select_all().read(as_type=PathDict)


# Read only mail addresses:
v1 = DDB.at("cups/*", key="organizer_email").read(as_type=PathDict)
v2 = DDB.at("cups/*").select_key("organizer_email").read(as_type=PathDict)



# Read locations and emails:
v1 = ... # Not Possible
v2 = DDB.at("cups/*").select_keys("organizer_email", "location").read(as_type=PathDict)



t2 = time.perf_counter()

assert len(cups) == 228

print(f"Time to get all cups: {(t2-t1)*1000:.1f} ms")

print(cups) ```

Change .session() Error behavior

Currently, if a file does not exist, a FileNotFoundError is raised.
If the file exists, but a key that does not exit was specified, then a KeyError is raised

Proposal:

  • If the file exists, but a key that does not exit was specified, no error is raised and the session variable is none. on Write, the new key will be added to the db file

Add config.use_indexing?

By default, DDB creates a index files for each db file in which the indices of values and the value hash are stored.
This drastically increases performance, but might not be wanted in some cases. If no index files should be written, DDB.config.use_indexing = False could disable it.

Better locking

Problems

  1. Currently, a lock is removed after a certain threshold of time, so automatically remove dead locks. But if a session is taking longer than the timeout, it will also remove that active lock
  2. Locking is relatively slow under heavy multithreaded work

Solution for 1

  • The timeout problem could be fixed if the active lock periodically (less than timeout duration) rewrites its lock with a newer timestamp, so no other process will remove it. The difficulty is to how this should work, having an extra thread or process doing the refresh work is an option

Solution for 1 and 2

Using a different locking mechanism could fix both problems at once

Unintuitive behavior when session with sub key value is a primitive

DDB.at("test").create({"a": 0}, force_overwrite=True)
    with DDB.at("test", key="a").session() as (session, a):
        a = 1
        session.write()
    print(DDB.at("test").read())

prints 0, but it would be more intuitive if it was 1. The reason is that primitives like int are passed by value, not by ref.

The current way to solve this is:

DDB.at("test").create({"a": 0}, force_overwrite=True)
    with DDB.at("test_pd").session() as (session, d):
        d["a"] = 1
        session.write()
    print(DDB.at("test").read())

But this is not as efficient, since, the whole test file has to be loaded instead of only the key.

Another solution would be to pass the variable into the session.write()

DDB.at("test").create({"a": 0}, force_overwrite=True)
    with DDB.at("test", key="a").session() as (session, a):
        a = 1
        session.write(a)
    print(DDB.at("test").read())

A possible solution without changing the syntax is using the current frame with inspect:

import inspect

class ObtainVarInWith:
	def __enter__(self):
		f = inspect.currentframe().f_back
		self.x = 2
		self.oldvars = dict(f.f_locals)  # Take copy of locals dict.

		return self.x


	def __exit__(self, type, value, tb):
		f = inspect.currentframe().f_back
		for name, val in f.f_locals.items():
			if name not in self.oldvars:
				print("New variable:", name, val)
			elif val is not self.oldvars[name]:
				print("Changed variable", name, val)


with ObtainVarInWith() as x:
	print("in with block", x)
	x = 2
	print("after assign with block", x)

Use glob match for dictionary keys

Suppose I have a database with the following layout:

A/
  meta.json
  data.json
B/
  meta.json
  data.json
C/
  ...

Then suppose I try to read all the metadata via:

DDB.at('*/meta').read()

This will return the contents of only one of the meta.json, what I'd like is the dictionary of all of them indexed by 'A', 'B', 'C', .... In other words, I'd like the glob wildcard match to be used as keys, not the last component of the path.

Is it possible to add such functionality?

P.S. I realize that it's possible to reorganize the database into something like:

meta/
    A.json
    B.json
    C.json
data/
    A.json
    B.json
    C.json

But I think that's less flexible.

.at() filter capabilities

.at() should be able to accept a lambda function at the final selection level to allow filtering for a specific condition.

DDB.at("tasks", lambda t: t["annotator_id"] == current_user.id).read()
to get all tasks where the id matches the user id

Design decisions

  • Drop support for normal json in favor of orjson? orjson is so much faster that it is the better choice in all cases
  • Use .ddb folder as of now or write lock files into the storage dir? In the storage dir, they will be noticed more bit dissapear completely after use. The ddb dir stays in the folder
  • Add config.use_indexing? Sometimes indexing might now be wanted

Improve where selector for file

The entire file needs to be read, but by partially reading each key value pair sequentially, we can prevent having to load the entire file into memory.

Proposal:

  • Do many partial reads instead of one full read
  • Might lead to a lot of overhead since multiple full reads will be executed if multiple keys are not in the index.

Batched partial reading

Currently, a partial read still has to load the entire file into memory if the key was not found by the indexer

Proposal:

  • utils.find_outermost_key_in_json_bytes and seek_index_through_value_bytes should be changed to operate on batches of file bytes, going through at most x megabytes of data at once.

Feature: Object mapper

First draft finished. Issues to solve:

  • Available Bases should be improved to allow better nested mapping
  • Wenn going into a read-write session, there is no way to map the object back into a dict. This is required to run recursively for all mapped subfields

Option not to sort the keys

First, love your lib, saw it on reddit and have replaced yours with my json config saver.
However, for me, the order of the keys is important and in function serialize_data_to_json_bytes in io_unsafe.py you always sort the keys. This hurts.
My personal opinion is not to do that there ? If you want to sort your keys then that could/should be done prior.
Kinda separate the purpose of serialize the data and ordering the data.
Alternatively you could add an option when creating the instance if you want an alternative solution.
Just my two cents.

for now I patch it with

def serialize_data_to_json_bytes(data: dict) -> bytes:
    from dictdatabase import config
    if config.use_orjson:
        import orjson
        option = (orjson.OPT_INDENT_2 if config.indent else 0)
        return orjson.dumps(data, option=option)
    else:
        db_dump = json.dumps(data, indent=config.indent, sort_keys=False)
        return db_dump.encode()

def io_write(db_name: str, data: dict):
    data_bytes = serialize_data_to_json_bytes(data)
    io_bytes.write(db_name, data_bytes)

def write(self):
    super(SessionFileFull, self).write()
    io_write(self.db_name, self.data_handle)

SessionFileFull.write = write

Write seek_index_through_value in C or Rust

Performance profiling shows that seek_index_through_value is the bottleneck for fast partial reading. Consider rewriting the function in C, which will probably be much faster.

Question about how `key` matches

I love the idea of this library! Thanks for building it. :)

This isn't an issue, but more of a question. When I was playing with this library, the results of key were surprising to me because I was unclear what was going to match.

{
  "users": {
    "Ben": {
      "age": 30,
      "job": "Software Engineer"
    }
  },
  "Ben": {"job": "Plumber"}
}
print(DDB.at("users", key="job").read())
>>> "Plumber"

Have you thought about requiring the key to be more explicit? Maybe something like XPath (key='Ben/job'), jq (key='.Ben.job'), or glom (key='Ben.job'), or something else?

Or maybe if there is a reason to keep the current functionality, a note in the readme about this behavior might be useful for other devs?

Drop json in favor of orjson?

Orjson is so much faster that it is the better choice in all cases

Pro:

  • Faster
  • Simpler code

Cons:

  • Orjson only allows indenting files with two spaces, or no indentation at all when writing
  • Only allows strings as keys. Can be changed but performance should be tested

Add XPath-like searching for keys

{
  "users": {
    "Ben": {
      "age": 30,
      "job": "Software Engineer"
    }
  },
  "Ben": {"job": "Plumber"}
}
print(DDB.at("users", key="job").read())
>>> "Plumber"

Have you thought about requiring the key to be more explicit? Maybe something like XPath (key='Ben/job'), jq (key='.Ben.job'), or glom (key='Ben.job'), or something else?

Indexer behavior

  • Instead of only indexing the key that was not found by the indexer, do a full indexing of all keys since full file was read anyways

Use .ddb folder?

Currently, all lock files and index files are stored in a .ddb folder in the storage dir.

Pro:

  • Keep all indexes and lock in a separate place

Con:

  • Does not disappear, unlike when locks and indices are written directly next to the corresponding file.
  • index files are directly inside the folder structure

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.