FatCatDB

- This project is still in its beta phase. If you want a well tested version, please come back in April. -

FatCatDB is a zero configuration database library for .NET Core. Its main target segment is ETL workflows (e.g. time-series data), therefore it's optimized for high throughput. Supports class-based schema definition, multiple indices per table and fluid, object-oriented query expressions. One would use it for a smaller project to avoid managing a PostgreSQL or another full-fledged database system. With this library your project will already have data storage capability after just cloned from a GIT repo. You don't need to create and maintain Docker images for a database server.

Example query

You can make fluid style queries using lambda expressions:

var db = new DbContext();
var cursor = db.Metrics.Query()
    .Where(x => x.Date, "2020-01-02")
    .Where(x => x.AccountID, "a11")
    .FlexFilter(x => x.Revenue > x.Cost * 2.2 && x.Impressions > 10)
    .OrderByAsc(x => x.CampaignID)
    .OrderByDesc(x => x.Cost)
    .Limit(100)
    .AfterBookmark(bookmark)
    .GetCursor();

foreach(var item in cursor) {
    ...
}

FatCatDB
Example query
Table of contents
NuGet package
Creating a table schema
Creating a database context class
Inserting and modifying data
Queries
Paging with bookmarks
Query plans
Atomic operations with the OnUpdate event
Making fields unchangeable using OnUpdate
Async support
Adding new types
Hinting the query planner
ACID and durability
Configurations
TODO

NuGet package

Available at: https://www.nuget.org/packages/FatCatDB

To include it in a .NET Core project, execute:

$ dotnet add package FatCatDB

Creating a table schema

See the example below. You only have to add annotations to a class and some of its public properties. Please find an explanation below the example. All annotated columns must have Nullable type, which you can either achieve by adding a question mark after non-nullable types, like long? or through Nullable<long>. They also need to be comparable, which means that they have to implement the IComparable interface.

using System;
using FatCatDB.Annotation;
using NodaTime;

namespace FatCatDB.Test {
    [Table(Name = "test_event", Unique = "campaign_id, ad_id", NullValue = "n.a.")]
    [TableIndex(Name = "account_date", Columns = "account_id, date")]
    [TableIndex(Name = "date_account", Columns = "date, account_id")]
    public class MetricsRecord {
        [Column(Name = "date")]
        public LocalDate? Date { get; set; }

        [Column(Name = "account_id")]
        public string AccountID { get; set; }

        [Column(Name = "campaign_id")]
        public string CampaignID { get; set; }

        [Column(Name = "ad_id")]
        public string AdID { get; set; }

        [Column(Name = "last_updated")]
        public LocalDateTime? LastUpdated { get; set; }

        [Column(Name = "impressions")]
        public long? Impressions { get; set; }

        [Column(Name = "clicks")]
        public long? Clicks { get; set; }

        [Column(Name = "conversion")]
        public long? Conversions { get; set; }

        [Column(Name = "revenue")]
        public decimal? Revenue { get; set; }

        [Column(Name = "cost")]
        public decimal? Cost { get; set; }
    }
}

Annotation	Description
`Table.Name`	The name of the database table. Used in error messages and in the filesystem structure.
`Table.Unique`	Each `TableIndex` defines a way to partition the data into packets. This `Unique` property defines uniqueness inside a packet only. Do not use the same list of fields here as in any of the `TableIndex` annotations, because you would end up with packets containing a single record. Think of this as a continuation of the indices.
`Table.NullValue`	The string representation of how to store "unknown" or NULL values.
`TableIndex.Name`	The name of a database table index which speeds up queries. If you define 3 indices, then the data is stored 3 times on the disk, redundantly.
`TableIndex.Columns`	Comma-separated list of columns. This works the same way how you define composite indices in a relational database. The encoded content of the indexed field cannot be longer than 248 characters. In FatCatDB an index defines a multi-level directory structure on the disk, which contain `.tsv.gz` files, called `packets`. The column list tells how to partition the data into packets. The optimal size of a packet is around between 10 KB -> 1 MB. This database uses multi-level directory structures for quick queries.
`Column.Name`	If you want a property to be part of the database table, then add a `Column` annotation to it. The `Name` tells how the `Table.Unique` and the `TableIndex.Columns` fields refer to it. The data is also exported by default on this name.

All properties without annotation are just ignored by FatCatDB, and they won't cause any problem. Feel free to include arbitrary logic (methods, custom properties, private members, etc.) in your record classes.

Creating a database context class

The design of FatCatDB follows dependency injection to make implementing unit tests possible. Therefore to use it, you have to instantiate a database context class, which is derived from DbContextBase.

A minimal database context class contains only your tables:

public class DbContext : DbContextBase {
    public Table<CatRecord> Cats { get; } = new Table<CatRecord>();
    public Table<DogRecord> Dogs { get; } = new Table<DogRecord>();
    public Table<BunnyRecord> Bunnies { get; } = new Table<BunnyRecord>();
}

But you can also change the default configuration:

public class DbContext : DbContextBase {
    public Table<MetricsRecord> Metrics { get; } = new Table<MetricsRecord>();
    
    protected override void OnConfiguring (TypeConverterSetup typeConverterSetup, Configurator configurator) {
        configurator
            .SetTransactionParallelism(8)
            .SetQueryParallelism(8)
            .EnableDurability(false);
    }
}

Two imporant things to note:

All tables must be defined inside the context class the above way. As a property with a get accessor, and also setting it to an instance with the new operator. In the above example, Metrics is a table that contains MetricsRecord records.
You can optionally override the OnConfiguring method to change the default configuration or to extend the system with your custom types. For the later see the section about custom types.

The available configuration options are the following:

Example	Description
`.SetTransactionParallelism(8)`	Specify the number of threads working on a single data modification transaction. This should have a high value for console applications and low for servers. Default value: 4
`.SetQueryParallelism(8)`	Specify the number of threads working on a single query. This should have a high value for console applications and low for servers. Default value: 4
`.EnableDurability(false)`	If durability is enabled then instead of overwriting files they are first written to a temporary file, and then swapped with the old one. Disabled by default.
`.SetDatabasePath("/path/to/dir")`	You can configure a custom path to the database folder. By default it is: `{WorkDirectory}/var/data`. Use a relative path to specify a path relative to the working directory.

Inserting and modifying data

The data is modified in bigger chunks, called "transactions". To create one, just use the NewTransaction() method on one of your tables:

var db = new DbContext();
var transaction = db.YourTable.NewTransaction();

Adding and updating records are both done using the same Add method. By default FatCatDB is always doing an upsert. You can change this behaviour with OnUpdate event handlers.

var record = new MyRecord();
record.Name = "Name1";
record.Time = db.NowUTC;

transaction.Add(record);

The unique fields determine when two records belong to the same entity. If a record exists already, then it gets updated automatically. You can also remove records using:

transaction.Remove(record);

At the end you have to commit the transaction to save the changes to disk:

foreach(var record in records) {
    transaction.Add(record);
}

transaction.Commit();

The bigger the transactions are the higher performance you get. Feel free to store multile gigabytes in a single commit. If you provide a true parameter to the Commit method, then it also forces garbage collection in the .NET assembly at the end:

transaction.Commit(true);

Queries

The following are the typical levels that are involved in a query: database context, query, cursor and optionally an exporter, when you don't iterate through the records yourself.

var db = new DbContext();
var query = db.MyTable.Query()
    .Where(x => x.Name, "John Smith")
    .Where(x => x.Age, 25)
    .OrderByAsc(x => x.LastModified);

var cursor = query.GetCursor();
var exporter = cursor.GetExporter();
exporter.Print();

The above example printed the results to the standard output in Linear TSV format, but there's always a shortcut for everything:

db.MyTable.Query()
    .Where(x => x.Name, "John Smith")
    .Where(x => x.Age, 25)
    .OrderByAsc(x => x.LastModified)
    .Print();

The cursor is an enumerable of your record class, which you can loop through:

foreach(var item in cursor) {
    ...
}

You can fetch the first item by the FindOne() method. The response is null if none found:

var person = db.MyTable.Query()
    .Where(x => x.Name, "John Smith")
    .Where(x => x.Age, 25)
    .OrderByAsc(x => x.LastModified)
    .FindOne();

if (person != null) {
    ...
}

Please find below a complete list of query directives:

Example directive	Description
`.Where(x => x.Date, "2020-02-09")`	Filtering on a specific value (exact match). This kind of filtering is fast, because it uses the indices. You can use `Where` on both a value of the original type of the column, or on the string representation of it. (See Adding new types about the string conversion.)
`.Where(x => x.Date, new LocalDate(2020, 2, 9))`	You can also use the original type of the column in `Where` filters. This also uses the indices.
`.FlexFilter(x => x.Cost > x.Revenue && x.Impressions > 10)`	In flex filters, you can specify an arbitrary expression over the columns. This filtering is slow as it doesn't use the indices.
`.OrderByAsc(x => x.Budget)` `.OrderByDesc(x => x.Budget)`	Ordering by a column in ascending or descending way. You can append multiple sorting directives to sort over multiple fields, in which case the order of the directives is important.
`.Limit(limit)`	The limit value specifies the maximum number of items to return. For the `offset` see the next line.
`.AfterBookmark(bookmark)`	Instead of an `offset` value, FatCatDB uses strings called `Bookmarks`. They provide a much more efficient way to continue a query than offset values. See the chapter Paging with bookmarks.
`.HintIndexPriority( IndexPriority.Sorting )`	Hinting an index selection algorithm. See the section Hinting the query planner for more details.
`.HintIndex("index_name")`	Hinting a specific index. See the section Hinting the query planner for more details.

Note that since the cursor is an enumerable of record objects, you can use Linq expressions on them. But if you do that, then the whole result set gets loaded into the memory (if there's enough memory for it). Therefore it's recommended to use Linq only in the presence of a Limit directive.

Paging with bookmarks

Paging in most database systems is done using the combination of a limit and an offset directive. Instead of offset FatCatDB uses bookmarks. Basically a bookmark describes the last item fetched during a query, so the query can be continued later in a different request. Bookmarks are more efficient than using offsets.

You can get a bookmark from either a cursor or from an exporter:

var cursor = db.People.Query()
    .Where(x => x.City, "Amsterdam")
    .Where(x => x.Age, 25)
    .OrderByAsc(x => x.ID)
    .Limit(10)
    .GetCursor();

// ... process data ...

string bookmark = cursor.GetBookMark();

var exporter = db.People.Query()
    .Where(x => x.City, "Amsterdam")
    .Where(x => x.Age, 25)
    .OrderByAsc(x => x.ID)
    .Limit(10)
    .GetExporter();

// ... export data ...

string bookmark = exporter.GetBookMark();

The bookmark is something like:

eyJGcmFnbWVudHMiOlt7InRhYmxlTmFtZSI6InRlc3RfZXZlbnQiLCJpbmRleE5hbWUiOiJhY2NvdW50X2RhdGUiLCJQYXRoIjp7ImFjY291bnRfaWQiOiJhMTAiLCJkYXRlIjoiMjAyMC0wMS0wMSIsImFkX2lkIjoiMTAwMTIifX1dfQ==

Then to continue the same query, supply the bookmark using the AfterBookmark directive:

var cursor = db.People.Query()
    .Where(x => x.City, "Amsterdam")
    .Where(x => x.Age, 25)
    .OrderByAsc(x => x.ID)
    .Limit(10)
    .AfterBookmark(bookmark)
    .GetCursor();

If the bookmark is null then it is disabled:

    .AfterBookmark(null)

Query plans

You can generate a user-friendly description of the query plan which would be used for your query. Example:

var db = new DbContext();
var plan = db.Metrics.Query()
    .Where(x => x.AccountID, "a11")
    .OrderByAsc(x => x.Date)
    .OrderByAsc(x => x.Cost)
    .FlexFilter(x => x.Impressions > x.Clicks && x.Revenue > 0)
    .Limit(100)
    .GetQueryPlan();

Console.Write(plan);

The response:

- The default index selection mode was selected which gives priority to filtering over sorting.
- The selected index is 'account_date'. The steps of the query are:
    - Index levels:
        - 1. account_id: Select one (exact match)
        - 2. date: Sort by (full scan)
    - Apply flex filtering.
    - Apply the sorting directives inside the packets, which weren't used for an index level:
        - cost
    - Limit: The maximal number of records to return is 100

Atomic operations with the OnUpdate event

During the update of a records, there's a narrow window of time, when both the old and the new versions of a record are available in memory, and there's an exclusive lock on the packet of the records. You can exploit this opportunity by the OnUpdate event.

You can specify a lambda function as an update event handler on a transaction. It will be called during the commit phase, when a record you pushed has the same unique key in a packet as an existing one. (So it would need to be updated.) The return value of it is the new version of the record to be stored.

var db = new DbContext();
var transaction = db.MyTable.NewTransaction();

transaction.OnUpdate((oldRecord, newRecord) => {
    if (newRecord.Type == MyRecordTypes.NoUpdate) {
        // If you return null, then no changes will be made.
        return null;
    }

    // This incrementation is an atomic change
    newRecord.Counter = oldRecord.Counter + 1;
    
    return newRecord;
});

// The contents of "records" is probably imported from an external server.
foreach(var record in records) {
    ...
    // These are the "new records"
    transaction.Add(record);
}

transaction.Commit();

Note that a lambda function always brings its context with it. Meaning: it can see all variables/fields that are visible inside the method you defined it. This can give great flexibility.

The return value can be of 4 kinds:

You can return the old record (or a modified version of it), if you would like to minimize the changes.
You can return the new record (or a modified version of it), if you would like to change the most of the fields.
You can return null in order to avoid any modifications done to the old record. (The new one will just be ignored and not stored anywhere.)
You can create a completely new record of the same type and return it.

But you can also throw an exception to stop the commit process. (Packets that were saved already, will remain that way.)

Two things to note:

Changing the fields of the unique key is safe. You won't have any duplicate records, no worries.
BUT, changing fields which are used in any of the indices defined will result in an exception. It isn't allowed to change the indexed fields inside an OnUpdate event, because then the records would need to be relocated into another packet, which cannot be done in an efficient way. (You can do that in application code with the combination of a remove and an add on a transaction.)

If you don't provide an OnUpdate event handler, then the default way of operation is to replace the old record with the new one. As by default FatCatDB always does an upsert for conflicting unique keys, but you can change this behavior with an OnUpdate event handler.

Making fields unchangeable using OnUpdate

In the previous section we described how to use the OnUpdate event handler in general and specifically for atomic operations.

Let's say that you are importing data from an external server. You would like to insert new records and update the old ones based on the unique key of the data. One of the fields of your schema is the date of creation, called Created. You don't want to change that. One solution is (the bad solution) to query the existing records, modify them based on the imported data and persist the result.

But you can do this more efficiently by just pushing all your data into the table (without any previous queries), and doing the fine-tuning inside the update event handler:

var db = new DbContext();
var transaction = db.MyTable.NewTransaction();

transaction.OnUpdate((oldRecord, newRecord) => {
    // Keep the creation date always unchanged
    newRecord.Created = oldRecord.Created;
    
    return newRecord;
});

foreach(var record in importedData) {
    ...
    transaction.Add(record);
}

transaction.Commit();

Async support

Asynchronous versions of all methods are available which are involved in input-output operations. Using async is only recommended for server applications. The only case one would use async in a console application is, when there's a source of async events, for example a fast-CGI client, or a hardware interface.

Examples for the query object:

await query.FindOneAsync();
await query.PrintAsync();

Async iteration over the cursor:

while ((var item = await cursor.FetchNextAsync()) != null) {
    ...
}

You can also fetch multiple items in one call:

List<MyRecord> items = await cursor.FetchAsync(int count);

Async methods for the exporter that output the data in linear TSV text format:

await exporter.PrintAsync();
await exporter.PrintToTsvWriterAsync(TsvWriter output);
await exporter.PrintToStreamAsync(Stream stream);
await exporter.PrintToFileAsync(string path);

Adding new types

With FatCatDB you can use columns of arbitrary types. It's very easy to extend it. The only thing to do is to use the TypeConverterSetup parameter in the OnConfiguring event of your database context class.

The following example adds the LocalDateTime type of the NodaTime library to FatCatDB. (This type is added by default already, this is only an example. See below.)

internal class DbContext : DbContextBase {
    public Table<MetricsRecord> Metrics { get; } = new Table<MetricsRecord>();
    
    private LocalDateTimePattern pattern = LocalDateTimePattern.CreateWithInvariantCulture(
        "yyyy-MM-dd HH:mm:ss"
    );

    protected override void OnConfiguring (TypeConverterSetup typeConverterSetup, Configurator configurator) {
        typeConverterSetup
            .RegisterTypeConverter<LocalDateTime, string>((x) => {
                return pattern.Format(x);
            })
            .RegisterTypeConverter<string, LocalDateTime>((x) => {
                return pattern.Parse(x).Value;
            });
    }
}

When you add a new type, you alwas have to add 2 converters: one that converts to string, and another that converts back from a string. The 2 template parameters of RegisterTypeConverter are the source and the target type. Examples:

typeConverterSetup
    .RegisterTypeConverter<MyType, string>((x) => {
        return x.ConvertToString( ... );
    })
    .RegisterTypeConverter<string, MyType>((x) => {
        return new MyType(x);
    });

BTW the LocalDateTime and LocalDate types of NodaTime are added by default to FatCatDB, as this library is the recommended way of dealing with time, instead of the built-in classes of .NET.

If you want to sort by your custom type, then it has to implement the IComperable interface.

You can also overwrite the built-in converters with your own ones. Just use the same RegisterTypeConverter method as in the above examples. You can find the built-in ones in the constructor of TypeConverterSetup.

Hinting the query planner

The query planner tries to select the best index to execute a query. It has two modes of operation:

Filtering priority: This is the default. Selects the index by first looking at the Where statements and just then at the sorting directives. This mode gives the best performance if only a small fraction of all records are queried, but it can happen that sorting is not possible (considering your directives and the indexed fields). In that case you get an error message.
Sorting priority: Let's say for example that you have 10 GBytes of data in a table, and you want to query the 95% of it with a complex sorting on multiple fields. In this case sorting priority is the best way to go (performance wise). Use it only when you need the majority of records returned, and you also have sorting directives in your query, which matches an index you defined.

Example:

var cursor = db.Metrics.Query()
    .FlexFilter(x => x.Impressions > 10)
    .OrderByAsc(x => x.AccountID)
    .OrderByAsc(x => x.CampaignID)
    .OrderByAsc(x => x.AdID)
    .OrderByDesc(x => x.Date)
    .HintIndexPriority(IndexPriority.Sorting)
    .GetCursor();

You can also hint a specific index if you know what you are doing:

query.HintIndex("index_name")

ACID and durability

FatCatDB is thread safe, but provides only the read uncommitted isolation level for transactions. The primary usage scenario in mind is a single-threaded console application, which loads data (most likely time-series data) from multiple sources, then transforms and stores them before pushing the data to destination endpoints. Application in servers is possible (since async methods are provided for everything), but not recommended, because of high memory usage (packet size * concurrency) and the lack of a complete ACID support.

In an average case the schema should be the same as - or at least, it should resemble - the export format. So the data transformation ideally happens during the import and before the storage. This means that high redundancy in the schema is normal and expected, in contrary to relational databases.

Durability is provided by two different mechanisms. The first is that the data is stored independently for each index defined. If you define 3 indices for a table, then the data is stored redundantly 3 times on the disk in separate folder structures.

The other source of the durability is explicit, and can be enabled by a configuration setting in the database context class. If that setting is enabled, then instead of overwriting files, the library first creates temporary ones and then swaps them with the old ones.

Configurations

Debug build for development:

$ cd IntegrationTests
$ dotnet build -c Debug

Run the integration tests:

$ cd IntegrationTests
$ dotnet publish -c Release
$ run.sh

Create package for NuGet:

$ cd FatCatDB
$ dotnet build -c Release
$ dotnet pack -c Release

TODO

Add benchmarks
Implement tools for data recovery and maintenance
Extend the integration tests
Implement unit tests after the interfaces are finalized
Implement aggregation functionality
Implement left join and inner join
Implement query.Delete(), table.Truncate() and db.Drop()
Delete packets which became empty after removal of records.
Use local thread pool instead of global

bubdm / fatcatdb Goto Github PK

fatcatdb's Introduction