Giter Site home page Giter Site logo

foundationdb / fdb-record-layer Goto Github PK

View Code? Open in Web Editor NEW
570.0 34.0 102.0 24.47 MB

A record-oriented store built on FoundationDB

License: Apache License 2.0

Python 0.06% Shell 0.11% Java 71.45% HTML 0.02% Raku 28.35%
relational-database foundationdb

fdb-record-layer's Introduction

FoundationDB logo

FoundationDB is a distributed database designed to handle large volumes of structured data across clusters of commodity servers. It organizes data as an ordered key-value store and employs ACID transactions for all operations. It is especially well-suited for read/write workloads but also has excellent performance for write-intensive workloads. Users interact with the database using API language binding.

To learn more about FoundationDB, visit foundationdb.org

FoundationDB Record Layer

The Record Layer is a Java API providing a record-oriented store on top of FoundationDB, (very) roughly equivalent to a simple relational database, featuring:

  • Structured types - Records are defined and stored in terms of protobuf messages.
  • Indexes - The Record Layer supports a variety of different index types including value indexes (the kind provided by most databases), rank indexes, and aggregate indexes. Indexes and primary keys can be defined either via protobuf options or programmatically.
  • Complex types - Support for complex types, such as lists and nested records, including the ability to define indexes against such nested structures.
  • Queries - The Record Layer does not provide a query language, however it provides query APIs with the ability to scan, filter, and sort across one or more record types, and a query planner capable of automatic selection of indexes.
  • Many record stores, shared schema - The Record Layer provides the ability to support many discrete record store instances, all with a shared (and evolving) schema. For example, rather than modeling a single database in which to store all users' data, each user can be given their own record store, perhaps sharded across different FDB cluster instances.
  • Very light weight - The Record layer is designed to be used in a large, distributed, stateless environment. The time between opening a store and the first query is intended to be measured in milliseconds.
  • Extensible - New index types and custom index key expressions may be dynamically incorporated into a record store.

The Record Layer may be used directly or provides an excellent foundational layer on which more complex systems can be constructed.

Documentation

fdb-record-layer's People

Contributors

0xflotus avatar alecgrieser avatar ammolitor avatar arnaud-lacurie avatar butlermh avatar davelester avatar dme26 avatar dsimmen avatar foundationdb-ci avatar g31pranjal avatar hatyo avatar jaydunk avatar jjezra avatar jleach4 avatar mmcm avatar nblintao avatar normen662 avatar nschiefer avatar ohadzeliger avatar panghy avatar pengpeng-lu avatar qdnguyen0 avatar saintstack avatar scgray avatar scottdugas avatar scottfines avatar sophiabuell avatar tian-yizuo avatar vsharma999 avatar wshaib-apple avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fdb-record-layer's Issues

Add record type number (or just record type) as key expression type

From an earlier discussion:

The proposal here is for a new key expression number called either RECORD_TYPE or RECORD_TYPE_NO or RECORD_TYPE_NUMBER. When evaluated, it would evaluate to that record type's field number within the union descriptor. (This is so that the value is unique per record type, allows for renaming without requiring on-disk changes, and is smaller than the written out name.) This would allow the user to easily segregate the record keyspace by record type (and, in some sense, use "the same" primary key (but for record type) for different records with different types à la different tables in a SQL database), which would allow for "efficient" full "table" scans filters instead of less efficient full store scans for operations that are limited to a single record type but otherwise not assisted by an index. (Obviously, scanning a full table or table-equivalent can be bad, but, like, it's still better than scanning the full store.)

Adding it as a first class key expression would allow users to turn this feature on and off as they wished, and then the semantics of using this should closely match the semantics of using the other types.

One could also use this field as a grouping in some interesting ways. For example, you could keep a COUNT index on records grouped by record type.

Right now, the user can implement this themselves by adding a field to their record that is its type name or some type number, and then all of this would essentially work, but (1) it is a little sad that now the user has to keep around a field that we can derive from their message any way when we could compute it that they are now responsible for keeping consistent and (2) if we had this expression type, there are some features we might be able to enable. If, for example, a user is using this kind of primary key and they have an index on a single record type, we can use the fact that we know what the type code is to avoid storing that information in the index. (We would then require index rebuilds when the index goes from being single-type to multi-type.) If the record type is the first index in the primary key, then we can also short circuit the index builder to skip over those ranges that it knows contains records of the wrong type rather than scanning and then discarding them. And even if the primary key doesn't contain record type information, then if there is a count index grouped by the record type, then we could theoretically query that when we doing user version checking to see if an index is being added on an empty type (such as a new type), and then we can automatically mark that index as readable without any kind of build.

Choosing between index scan to record scan

Right now, if a record scan and an index scan "score" the same (meaning, in practice, that they don't do much filter matching at all), the record scan is preferred (because it was tried first and only better ones are substituted). Except, that an index on just the primary key (which isn't super useful otherwise) is preferred.

A record scan is more efficient when it is really what you want, but less efficient if it has to skip over lots of other record types, so it degrades less well as the record store fills up.

The primary key index doesn't need to be a special case: it could just be preferred because its entries are smaller.

Text: Query.or leads to record rather than index scan

In a full text query, if we have something like:

Query.or(
    Query.field("text").text().containsAll("civil unclean blood", 4),
    Query.field("text").text().containsAll("king was 1016")
)

Then this gets planned as a scan over all records rather than an index scan. I suspect that this is because with an OR query, you need to be able to effectively take the union of elements, which the planner isn't quite smart enough to do with text queries, so it falls back to a full scan.

Queries with order-preserving filter sorting by primary key will not use index for filter

Suppose we have a schema like:

message MyRecord {
   optional int64 id = 1 [(field).primary_key = true];
   optional int64 value = 2 [(field).index = {}];
}

Then suppose someone has a query like:

RecordQuery query = RecordQuery.newBuilder()
    .setRecordType("MyRecord")
    .setFilter(Query.field("value").equalsValue(5L))
    .setSort(Key.Expressions.field("id"))
    .build();

Without the sort, the returned plan will be something like Index(MyRecord$value [[5],[5]]), which is correct. However, with the sort, this will do Scan(<,>) | [MyRecord] | value EQUALS 5 (i.e., it does a full store scan instead of using the index). But because the sort is by primary key and the filter is an equality comparison, the results of the index scan will be returned in the correct order, so it would be better to return the index scan instead (just like if there was no sort).

Note that if the filter had been something like Query.field("value").greaterThan(5), then this optimization cannot be made.

A slightly more sophisticated example of this can occur if there are multiple filters. For example:

message MyRecord {
   optional int64 id = 1 [(field).primary_key = true];
   optional int64 value = 2 [(field).index = {}];
   optional int64 unindexed_value = 3;
}

Then the query:

RecordQuery query = RecordQuery.newBuilder()
    .setRecordType("MyRecord")
    .setFilter(Query.and(Query.field("value").equalsValue(5L), Query.field("unindexed_value").equalsValue(6L))
    .setSort(Key.Expressions.field("id"))
    .build();

The query plan should use the index on value then manually filter out the unindexed_value results.

Full Text: The Planner doesn't always correctly handle ands with nesteds

This has to do with the kind of book-keeping information the planner has associated with it when doing queries with nested fields. In particular, if you have a schema like:

message MapDocument {
   message Entry {
      optional string key = 1;
      optional string value = 2;
   }
   repeated Entry entry = 1;
}

And you have a text index on new Grouping(field("entry", FanType.FanOut).nest(concatenateFields("key", "value")), 1) (that is, grouping the map by key).

A legitimate query on this might be something like:

Query.field("entry").oneOfThem().matches(Query.field("key").equalsValue(keyValue), Query.field("value").text().something())

But unfortunately, if you have a query like:

Query.and(
   Query.field("entry").oneOfThem().matches(Query.field("key").equalsValue(keyValue)),
   Query.field("entry").oneOfThem().matches(Query.field("value").text().something())
)

Then it will get planned just like the first one (essentially distributing the and, which isn't correct).

I believe it is possible to run into a similar situation if you have an index one something like field("entry", FanType.FanOut).nest(field("value")).groupBy(field("entry", FanType.FanOut).nest("key")). This is effectively pairing each value with each key, but in a not-necessarily-sane way. Then the first query will get planned as if this index looks like the first kind (I think), which is almost fine, except that it kind of needs to do a filter on the returned results to verify that it actually is finding the key and the value in the same entry.

The FilterSatisfiedMask class has an expression member variable that's supposed to help with that book-keeping, but I didn't quite plumb that all the way through. I think making that better might fix this issue.

Define index "behaviors" with defined contracts, for use in planning and index maintenance

Our current handling of planning around indexes is a bit brittle. We originally just checked index types for equality, but this breaks when users of the Record Layer define new index types that the planner should treat as “the same” for purposes of planning.

The idea here is to be able to add new index types without adding a bunch of planning rules, and to change the planner without needing to change what index types means. Basically, this requires us to define what each index “behaviour” actually means and rigorously document it.

Be more selective about which values are re-written during re-tokenization

Currently, all tokens are rewritten during re-tokenization. In practice, if most tokenizer changes only do things like add stop words or change the way certain languages are handled (which we anticipate to be the case), then this is re-doing a lot of work. (It's somewhat like Sysiphus in that it pushes a bolder down a hill and then pushes it all the way back up to exactly where it started.) We should be able to detect which keys have changed and only update those, which should make a re-tokenization a lot cheaper in most cases.

DirectoryLayerDirectory should support scopes and correctly detect incompatible peer

DirectoryLayerDirectory (DLD) considers its peers in the keyspace compatible if they are also DLDs and represent a different constant string. With scopes this is not correct (DLD does not allow you to set the scope for this reason), if two peer DLDs have different scopes then they could collide when mapping distinct strings. The compatibility check in DLD needs to check that the peer directory has the same scope. The tricky part is that this should happen when the keyspace is constructed NOT when the path is constructed…

Implement continuation handling and record scan limit for RecordQueryLoadByKeysPlan

The RecordQueryLoadByKeys plan does not support continuations, which makes it impossible to properly implement the record scan limit contract. Currently, the RecordQueryLoadByKeysPlan decrements the record scan limit but does not respect it.

This change is reasonably involved because we need to decide exactly what needs to be serialized as part of the continuation.

Add causal read risky support

Add FDB's causal read risky semantics flag to WeakReadSemantics.
The causal read risky is described here:

Transaction.options.set_causal_read_risky(): This transaction does not require the strict causal consistency guarantee that FoundationDB provides by default. The read version of the transaction will be a committed version, and usually will be the latest committed, but it might be an older version in the event of a fault or network partition.

Planner does not currently coalesce overlapping filters

The planner currently implements a query with overlapping (and therefore redundant) filters using several ScanComparisons predicates, rather than eliminating the less specific one.

Using an example from FDBFilterCoalescingQueryTest.overlappingFilters(), a query like

[MySimpleRecord] | And([str_value_indexed EQUALS $str, num_value_3_indexed GREATER_THAN_OR_EQUALS 3, num_value_3_indexed LESS_THAN_OR_EQUALS 4, num_value_3_indexed GREATER_THAN 0])

produces the plan

Index(multi_index [EQUALS $str, [GREATER_THAN_OR_EQUALS 3 && LESS_THAN_OR_EQUALS 4 && GREATER_THAN 0]])

even though the > 0 predicate is implemented using the >= 3 predicate.

Apply stability annotations to *.cursors packages

Annotate public classes (and possibly their members) with the new API stability annotations in the packages:

com.apple.foundationdb.record.cursors
com.apple.foundationdb.record.provider.foundationdb.cursors

Part of #57.

IN join in filter can prevent index scan merging

If IN is turned into a join over a loop of comparison values, only to be used in to drive a filter operator, and not an actual index scan, it can make the branch of an OR in which it appears no longer ordered by the index scan. Furthermore, the index scan will be repeated unnecessarily.

The IN should be pushed further down into the filter. Which is to say, it should be turned back into the runtime IN function.

Add API stability annotations to existing code

Once #42 is complete, we should apply annotate each public class (and possibly also its public members) with one of the stability statuses.

I'll track the work, but it's probably best done on a package-by-pacakge basis by people who know that code well.

Some SonarQube issues with RecordMetaDataBuilder

  • RecordMetaData has too many arguments to the constructor (needed by the builder).
  • RecordMetaDataBuilder constructor is too complicated.
  • Couple of deprecated methods missing Javadoc tag.
  • buildPrimaryKeyComponentPositions should be @Nullable

Full Text: Sorts are not supported with full text queries

The proximate cause for this is this line in the planner:

https://github.com/FoundationDB/fdb-record-layer/blob/4b6455ffd24cd4e21c9db916f95d1afd8e3d0976/fdb-record-layer-core/src/com/apple/foundationdb/record/query/plan/RecordQueryPlanner.java#L986-L993

If sorts were to be supported, there would need to be extra logic to figure out what components are missing from the sort and if the remaining fields in the text index (after the text field) are compatible with the sort ordering.

Full Text: Support covered queries with text index scans

If a query comes in that only wants certain fields (say, the record's primary key, which is always in the index), then if a text index scan picks up all of those fields, then it shouldn't resolve the record.

We already do this for regular index scans, we just need to make sure to do this for text as well. I guess some care needs to be included to make sure that if the user wants the text field, then it doesn't get covered because we don't have the full value. Unless we do want to return it so that they get the token they care about (or something). That behavior can also be changed later.

Full Text: Add support for querying the values of maps

Right now, the query planner gets tripped up if there is a nested structure within the grouping key. So, if you have a schema like:

message MapDocument {
   message MapEntry {
       optional string key = 1;
       optional string value = 2;
   }
   repeated MapEntry entry = 1;
}

If you have a text index on, let's say,

new Grouping(field("entry", FanType.FanOut).nest(concatenateFields("key", "value"), grouping = 2)

then you have essentially tokenized the text of each value and grouped by the key. This should let you issue queries like:

Query.field("entry").oneOfThem().matches(Query.and(Query.field("key").equalsValue(key), Query.field("value").text().someTextPredicate())

But because of deficiencies in the planner not handling nested structures in the right places, this doesn't use the index.

Use QueryToKeyMatcher when planning

The QueryToKeyMatcher class has logic on matching comparisons (expressed through the QueryComponent API) with key expressions to determine whether a filter can be satisfied by an index (for example). Some of this duplicates logic that is used within the planner (because they are tasked with doing similar things), so it would be nice if the duplicate logic was removed (if possible).

Make the parameterized JUnit test output less ugly

Something is up with the "name" function in our parameterized JUnit tests, because this is what I'm seeing:

FDBDatabaseTest > cachedReadVersionWithRetryLoops(BooleanEnum)[1] STARTED
FDBDatabaseTest > cachedReadVersionWithRetryLoops(BooleanEnum)[1] PASSED

FDBDatabaseTest > cachedReadVersionWithRetryLoops(BooleanEnum)[2] STARTED
FDBDatabaseTest > cachedReadVersionWithRetryLoops(BooleanEnum)[2] PASSED

(Package names omitted for brevity.) But the test itself looks like:

    @EnumSource(TestHelpers.BooleanEnum.class)
    @ParameterizedTest(name = "cachedReadVersionWithRetryLoops [async = {0}]")
    public void cachedReadVersionWithRetryLoops(TestHelpers.BooleanEnum asyncEnum) throws InterruptedException, ExecutionException {

Which makes me think it should say [async = FALSE] and [async = TRUE] in those two tests. I could have sworn it did in fact do that in the past, too, and then something changed. ¯\_(ツ)_/¯

Some query plans include redundant filtering operations even when the index is a complete specification

The planner adds filters to plans that don't need them because an index scan fully satisfies the predicate.

For example, in FDBFilterCoalescingQuerTest.duplicateFilters(), the query

[MySimpleRecord] | And([str_value_indexed EQUALS even, num_value_3_indexed EQUALS 3, num_value_3_indexed EQUALS 3])

produces the plan

Index(multi_index [[even, 3],[even, 3]]) | num_value_3_indexed EQUALS 3

Note that the filter on num_value_3_indexed EQUALS 3 is redundant, since that's provided by the index scan.

Simplify sort-only index planning in current planner

It seems that the logic in RecordQueryPlanner.planSortOnly() is (mostly) reimplementing a prefix comparison for KeyExpression. For a while now, we've had a pretty polished implementation of exactly this operation in BaseKeyExpression.isPrefixKey(). Assuming that all of the tests pass and there are no weird subtleties, we should simply RecordQueryPlanner.planSortOnly() by using isPrefixKey() instead.

Shouldn't depend on union field names

RecordMetaData.getUnionFieldForRecordType(String) matches the name. It really should find one with the descriptor, so that it can have any name. May even want keep a map of this to speed things up.

Write planning rules for planning (simple, non-nested, unsorted) filter queries with value index scans

As a first test of the new planning infrastructure, write enough rules to be able to plan queries with no sort, and a simple filter like fieldName > 3 into an index scan over an index with a KeyExpression that starts with fieldName. Basically, port the planning logic from RecordQueryPlanner.planFieldWithComparison().

This also requires writing some rules to implement type filters (or eliminate unnecessary ones). It will probably also require fixing some bugs in the RewritePlanner, since it's never been tested. :)

At the end, simple query tests like FDBRecordStoreQueryTest.query() should run against both planners.

CODE_QUALITY & performIgnoredChecks

  • Building.md recommends CODE_QUALITY=YES ./gradlew -PperformIgnoredChecks ...
    • performIgnoredChecks is exactly about disabling the code that checks CODE_QUALITY.
  • Every build stanza sets CODE_QUALITY in the environment.
  • The build script set performIgnoredChecks.
  • The code in question disables check tasks that don't have ignoreFailures.
    • Every check task has ignoreFailures = false.
  • performIgnoredChecks also controls whether jacoco runs.

I conclude that most of this is leftover control from when things were more complicated.

Using stored MetaData can lead to "descriptor did not match record type" when saving records

If one is using a statically-generated proto file (rather than a dynamically generated one) that has extensions specifying indexes on fields and are also using the MetaDataStore to try and distribute the metadata to applications, then the following set of circumstances can lead to an IllegalArgumentException being thrown with the message "descriptor did not match record type":

  1. One serializes and stores the record meta data using a meta data store.
  2. One then creates a record store off of the stored and (now) deserialized record meta data.
  3. One takes a record whose type is specified using the same record meta data that was serialized in step 1 and tries to save it in the store created in step 2.
  4. ILLEGAL ARGUMENT EXCEPTION

The problem here is that the descriptor of the record may have changed during the serialization and deserialization process. In particular, when we serialize RecordMetaData objects into MetaData proto messages, we strip out the extensions that specify indexes. Doing this simplifies our serialization process when it comes to how indexes are serialized and avoids issues that might arise from someone deleting a record that was specified with the extensions (and a few other edge cases), but it also means that this problem can happen because we check that the record type of records we try and serialize is within the record meta data of the store before we try and serialize it. We could loosen the rule and simply require that the name matches, and this would all work, but it would also create potential problems if people did weird things with their metadata that we don't check.

For now, this sterling bit of code works around the issue by serializing and then deserializing the message:

 Message recreatedRecord = DynamicMessage.parseFrom(
      recordStore.getRecordMetaData()
          .getRecordType(record.getDescriptorForType.getName()).getDescriptor(),
      record.toByteString());

Clients can use this today to get around the issue, but it's not pretty.

Use QueryToKeyMatcher when planning groups

This is, in some sense, a subtask of #20 in that planning groups for ranks and text queries is a thing that the planner has to do, and it seems somewhat more straightforward to transition that over to the QueryToKeyMatcher class as it will always want an equality comparison that it can use as a prefix for whatever else it needs to do. I think the main thing that it doesn't do that the other logic to match a group is track which comparisons it used, which is necessary to figure out which filters don't need to be applied later (which is also something needed for general query planning).

FDBRecordStoreBase.checkPossiblyRebuild() could take a long time if the record count index is split into many groups

It can be useful to split the record count index into several subgroups; for example, if all records are scoped by some common entity (e.g., "the records related to ") and we want to efficiently delete all records associated with that entity, we need to group the record count index by that entity.

If there were many (thousands?) of these entities, then FDBRecordStoreBase.checkPossiblyRebuild() will have to scan thousands of rows. We shouldn't count the records during checkVersion to determine if the container has too many records to build the index. Instead, there should be a limit on the number of rows scanned.

KeySpaceDirectory should allow for documentation

The KeySpaceDirectory allows for a nice organized way to describe your FDB row keys, however it would be nice if this description included an actual textual description of the key elements so that the directory tree could be self-documenting.

Move Away From JSR-305 Annotations

It looks like that project has been dormant for a while, and it looks like there are more modern frameworks (that might also be more powerful) that might be good replacements.

Release build fails because of re-publishing fdb-extensions

The current build will try and upload the fdb-extensions jar in both the proto2 and proto3 builds, which leads to the build to fail. The proto3 build should only publish the proto3 jar, and then fdb-extensions can come from the proto2 jar.

Move index-specific query planning behavior outside of planner

This is somewhat of a vague and possible impossible task, but the idea here is that the index maintainers currently do a pretty good job of isolating the index lifestyle. However, to add a new index, one still will often have to change the query planner to make it use the new index when planning, which means that creating a new index type isn't really well isolated.

Perhaps the index maintainer class needs some kind of method like "can support query" or a method to turn a query into an index operation with which filters it supports, or perhaps this should not go in the maintainer but somewhere else in com.apple.foundationdb.record.query.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.