The evitadb from fgforrest

Support BinaryEntity in entity enrichment and deep fetching

Currently the deep fetching and entity enrichment is not fully implemented for BinaryEntities on Evita Java Client. The BinaryEntity is a research concept that is not yet fully completed and its benefit properly measured by performance tests. This issue targets to finalize binary entity:

enrichment / limitation on the server side
deserialization on the client side
performance tests measurement

Create GitHub pipeline for validating OpenAPI schema

Using GitHub action and docker, we could validate OpenAPI schemas from evitadb.io catalogs using redocly.

Metrics

Introduce metrics into the evitaDB. The servlet for metric should start as separate API on different port (or part of a system API). Although we are used to Prometheus API, we should analyze different options - namely Open Telemetry.

Metrics proposals

consider removing metrics type from name
add tracing to evitaDB core on more places and not only QueryPlan

Storage metrics

Transactions

Storage

Per collection

Per catalog

Per instance

Engine metrics

Queries

Per instance

active queries (tag: catalog, collection) - ∑ of catalogs
query process time (tag: catalog, collection) - ∑ of catalogs
query complexity (tag: catalogn, collection) - ∑ of catalogs
query records returned (tag: catalog, collection) - ∑ of catalogs
query records and extra results returned (tag: catalog, collection) - ∑ of catalogs
active sessions (tag: catalog) - ∑ of catalogs
executor threads
executor used threads (tag: process name, catalog)
executor thread execution time (tag: process name, catalog)

Cache

Web API metrics

Watching for the changes in certificates

If a Let's encrypt certificate is configured for the evitaDB server, it is automatically renewed every few months by an external script. The evitaDB, which is configured to use this certificate, currently needs to be restarted to reload the certificate. It would be nice if evitaDB would watch for changes in the file and if the file is changed, it would automatically update it in the internal keyStore.

Proof read - assignment documentation

Please proof read following documents:

Allow OR in generic entity queries in GraphQL

Right now in getEntity and listEntity queries, if one specifies more than one attribute to filter by, all attributes are joined with AND by default. Sometimes it would be beneficial to join them with OR, but this is on the user to specify.

We agreed on going with simple switch changing the behaviour for all arguments in single query:

listEntity(url: "", urlInactive: "", join: OR) {...}

Alternative approach would be to enable using containers here like in full query but this would be too complicated right now, and don't have an use-case for it.

Access to global attributes in generic REST/GraphQL query

Consider following GraphQL query:

query {
  getEntity(url: "/cs/doplnky") {
    primaryKey
    type
    ... on Category {
      locales
      attributes {
        name
        inactiveUrls
      }
    }
  }
}

To access global attributes, a developer must declare a full switch statement, which is cumbersome. The global attributes are guaranteed to have the same configuration in all entity schemas. The entity schema has only one option - it can choose to use the global attribute (with all its shared configuration) or not. It cannot redefine the attribute with a different data type or behavior.

Therefore, we can afford to provide access to global attributes directly in the generic query in the following way:

query {
  getEntity(url: "/cs/doplnky") {
    primaryKey
    type  
    locales
    attributes {
      name
      inactiveUrls
    }  
  }
}

If the entity that doesn't use this global attribute is returned, it simply provides a NULL value for such an attribute. The query format will be much cleaner and easier for developers to understand. The gRPC and Java implementations are not affected by this extension because they already provide easy access to global attributes.

The next problem is with the query's implicit AND' relation. There are cases where we would want to use OR` instead - real world use case ... finding an unknown entity by its active URL or one of its inactive URLS:

getEntity(url: "/en/accessories", urlInactive: "/en/accessories", joinType: OR)

We don't want to open up this generic method to full `filterBy' capabilities since it is rather limited and can effectively only search for globally unique attributes and nothing else. However, the wrapper container is a feature that would allow us to get rid of unnecessary additional requests.

Properly document ConstraintSchemaBuilder and ConstraintResolver

The ConstraintSchemaBuilder and ConstraintResolver have gotten quite robust and it is difficult to hold all concepts in head. Mainly the concept of switching contexts (domains and data locators).

Thus, it could be helpful if we would have those concepts written properly down with some diagrams of traversing of the constraint trees.

Exlusive facets

We want to provide support for a special type of facet that is represented by a select instead of a checkbox. This means that only a single facet can be selected from a single group at a time. This fact also affects the impact calculation. The exclusive facets should behave (and be defined) in the same way as negative facets.

Cache inconsistency

During manual testing of evitaDB we've run into scenarios where evitaDB returned inconsistent results. The problem disappeared when the cache was turned off, which means there is a problem in the cache key calculation. We need to track down this problem.

Support for read-only mode

We want to open access to our test dataset directly on https://evitadb.io for all to play. In order to avoid server / catalog pollution we want to allow read-only access. Until #25 is finalized we plan to add simple boolean flag that will start the server in full read-only mode that doesn't allow clients to:

create / update / delete catalog schemas
open read-write sessions

This will effectively mean that the clients would not be able to alter the demo data and can only play with read queries.

Too generic exception in gRPC implementation

Hello,

in current implementation of gRPC, it's not possible to identify the root cause of problem when exception is thrown.

Can you please expand the error conditions so that I can identify the problem?

Cheers

Stack:

io.grpc.StatusRuntimeException: INTERNAL: 
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271) ~[grpc_workaround_build-0.6-SNAPSHOT.jar:?]
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252) ~[grpc_workaround_build-0.6-SNAPSHOT.jar:?]
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165) ~[grpc_workaround_build-0.6-SNAPSHOT.jar:?]
	at io.evitadb.externalApi.grpc.generated.EvitaServiceGrpc$EvitaServiceBlockingStub.createReadOnlySession(EvitaServiceGrpc.java:685) ~[evita_external_api_grpc_shared-0.6-SNAPSHOT.jar:?]
	at io.evitadb.driver.EvitaClient.lambda$createSession$6(EvitaClient.java:250) ~[evita_java_driver-0.6-SNAPSHOT.jar:?]
	at io.evitadb.driver.EvitaClient.executeWithEvitaService(EvitaClient.java:148) ~[evita_java_driver-0.6-SNAPSHOT.jar:?]
	at io.evitadb.driver.EvitaClient.createSession(EvitaClient.java:249) ~[evita_java_driver-0.6-SNAPSHOT.jar:?]
	at io.evitadb.driver.EvitaClient.queryCatalog(EvitaClient.java:375) ~[evita_java_driver-0.6-SNAPSHOT.jar:?]
```

Multi-writer transactional logic (parallel transactions)

There is single test that tries to verify multi writer scenario: io.evitadb.api.EvitaApiFunctionalTest#shouldTrackAndFreeOpenSessions
The test is currently flaky. This issue is about WAL implementation, transaction ordering, handling conflict a proper testing of parallel transactions in a high load scenario.

Exact ordering as requested in input query

There are cases where the user wants to sort entities not by primary key, nor by attribute but by the order or arguments specified in entityPrimaryKeyInSet or attributeInSet constraint of the query.

The ordering will have following shapes:

Filter by order of primary keys listed in filtering part of the query:

orderBy(
   entityPrimaryKeyInFilter()
)

This requires that there is exactly one entityPrimaryKeyInSet in the filter.

Filter by order of primary keys listed directly:

orderBy(
   entityPrimaryKeyExact(5, 7, 9, 12, 6, 7)
)

Filter by order of attribute constants listed in filtering part of the query:

orderBy(
   attributeSetInFilter('code')
)

This requires that there is exactly one attributeInSet('code', 'e', 'x', 'y') in the filter.

Filter by order of attributes listed directly:

orderBy(
   attributeSetExact('code', 'a', 'b', 'c')
)

When there are multiple entities with the same attribute value, they will be inserted on the "same position" ordered by primary key ascending.

Aliases for extra results

GraphQL supports field aliases which can lead to duplicate fields with different arguments and different sub fields. This is problem especially for extra results e.g. facetSummary where we support only one summary for each reference. Unfortunately, in GraphQL client can sent following query:

facetSummary {
  brand {
    groupEntity {
      primaryKey
    }
  }
  otherBrand: brand(filterGroupBy: {}) {
    groupEntity {
      attributes {
        code
      }
    }
  }
}

this GraphQL query would need 2 separate summaries for the same reference with different filters and entity requirements.

This could by supported by evitaDB if each summary could be identified by outputName similary to hierarchies rather than references. The idea of outputName could be maybe used in other extra results like histograms for the same purposes.

There is also a problem with inner facet statistics collections which could be also duplicated but inside one summary. This cannot be solved simply with outputName. One idea is to fetch multiple summaries with otherwise same properties except for the facet statistics and then in GraphQL resolution somehow merge it together.

For now, we will support only one summary of single reference and throw an error otherwise.

Enable filter / order constraints for facets

We've recently implemented a new support for filtering / ordering of the fetched referenced entities. This new functionality is supported in referencedContent requirement only, but it has also sense for facetSummary which also relates to referenced entities and allows accessing them directly. The real life use-cases might be:

displaying and computing only the "most important" facets in the parameter filter (the next 4 filters don't need to be fetched and even don't need their statistics to be computed!):

sorting the parameter groups and the parameters within them by the server side relieving the client to perform these computations

Proposed solution

The facetSummary constraint will now support this layout:

query(
	collection(Entities.PRODUCT),
	filterBy(
		and(
			hierarchyWithin(Entities.CATEGORY, 1, excluding(entityPrimaryKeyInSet(excludedSubTrees))),
			userFilter(
				facetInSet(Entities.BRAND, 1),
				facetInSet(Entities.STORE, 5, 6, 7, 8),
				facetInSet(Entities.CATEGORY, 8, 9)
			)
		)
	),
	require(
		page(1, Integer.MAX_VALUE),
		facetSummaryOfReference(
			FacetStatisticsDepth.COUNTS,
			filterBy(
				attributeGreaterThan(ATTRIBUTE_QUANTITY, 950),
				entityHaving(
					attributeEqualsTrue(ATTRIBUTE_ALIAS)
				)
			),
			filterGroupBy(
				attributeEqualsTrue(ATTRIBUTE_ALIAS)
			),
			orderBy(
				attributeNatural(ATTRIBUTE_PRIORITY),
				entityProperty(
					attributeNatural(ATTRIBUTE_NAME)
				)
			),
			orderGroupBy(
				attributeNatural(ATTRIBUTE_NAME)
			),
			entityFetch(attributeContent()),
			entityGroupFetch(attributeContent())
		)
	)
);

Change GQL and REST query case notations

We want to make GQL and REST APIs easier to write and as close to standard convetions as possible.

Initially, we settled on using _ character for delimiting parts in dynamically constructed names of queries, mutations, query constraints and so on.
This is however far from conventional camelCase notation and is harder to write by users.

get_product
query_product
upsert_product

attribute_code_equals: ...
hierarchy_withinSelf: ...

But now we have tools and knowledge to implement pure camelCase notation for all dynamically constructed names in GQL and REST.

getProduct
queryProduct
upsertProduct

attributeCodeEquals: ...
hierarchyWithinSelf: ...

These changes will make the APIs easier to write, in case of REST easier to generate clients from and more close to the Java API.

Also, we need simple shell script that would take evitaDB schema, generates list of old queries, methods and constraints and replaces them with new syntax.

Spatial indexes and constraints

We want to be able to search by geo location query. Library https://github.com/locationtech/jts is recommended source of inspiration or reuse.

Also, we should look at: https://github.com/davidmoten/rtree2

Problem with deleting catalogs on Windows OS

When running evita tests on Windows, the used catalog cannot be deleted. Perhaps some IO handler on server side is prohibiting the delete operation.

Mar 29, 2023 10:51:27 AM io.grpc.internal.SerializingExecutor run
SEVERE: Exception while executing runnable io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@4fe29d07
io.evitadb.exception.UnexpectedIOException: Failed to delete file: .\data\catalogs\testingCatalog\testingCatalog.catalog
	at [email protected]/io.evitadb.utils.FileUtils.lambda$deleteDirectory$1(FileUtils.java:80)
	at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
	at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
	at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
	at [email protected]/io.evitadb.utils.FileUtils.deleteDirectory(FileUtils.java:75)
	at evita.store.server/io.evitadb.store.catalog.DefaultCatalogPersistenceService.delete(DefaultCatalogPersistenceService.java:364)
	at evita.engine/io.evitadb.core.Catalog.delete(Catalog.java:498)
	at evita.engine/io.evitadb.core.Evita.removeCatalogInternal(Evita.java:566)
	at evita.engine/io.evitadb.core.Evita.update(Evita.java:340)
	at evita.engine/io.evitadb.core.Evita.deleteCatalogIfExists(Evita.java:310)
	at evita.external.api.grpc/io.evitadb.externalApi.grpc.services.EvitaService.deleteCatalogIfExists(EvitaService.java:176)
	at evita.external.api.grpc.shared/io.evitadb.externalApi.grpc.generated.EvitaServiceGrpc$MethodHandlers.invoke(EvitaServiceGrpc.java:733)
	at [email protected]/io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
	at [email protected]/io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
	at [email protected]/io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
	at [email protected]/io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
	at evita.external.api.grpc/io.evitadb.externalApi.grpc.services.interceptors.GlobalExceptionHandlerInterceptor$ExceptionHandler.onHalfClose(GlobalExceptionHandlerInterceptor.java:72)
	at [email protected]/io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:355)
	at [email protected]/io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:867)
	at [email protected]/io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at [email protected]/io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
	at [email protected]/org.jboss.threads.ContextHandler$1.runWith(ContextHandler.java:18)
	at [email protected]/org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)
	at [email protected]/org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538)
	at java.base/java.lang.Thread.run(Thread.java:833)

Rework sorting by multiple attributes

We need to avoid and remove data type Multiple. Problem with this type is that it hides inner data types (which there may be multiple) and there is also problem how to describe this type to the end users. We need to stick to analogies to existing and well established data store - primarily SQL. On the other hand we need to allow both these scenarios:

sort by two or more attributes in "expected way" - i.e. secondary ordering decides order of elements when primary ordering cannot decide because values are the same
sort by two or more attributes when the sorting handles "NULLs" last scenario - i.e. there is no attribute found for primary attribute in an entity

We also want to avoid situation where we would need to compare values of the entities in real-time. We already know it's slower than current masking process for pre-sorted arrays. This was the original reason why Multiple constraint exists in the first place.

The proposed solution discussed with @lho is:

add support for creating multi-attribute sort indexes in EntitySchema / Reference schema by adding so called SortableAttributeCompounds which will aggregate sort direction for multiple existing attributes (which doesn't need necessarily to be sortable themselves)

The final composition would like this:

attribute(nameOfCompound, DESC)

we should also relax on requiring that sortable reference attribute is present in entity only once, when multiple references are selected the sort order should be combined:
- in case of hierarchy reference by a deep traversal order of the filter matching hierarchy tree
- in case of plain referenced entities by a ascending order of their primary keys

Chained entity support

When the primary system sorts entities in the list using drag'n'drop, there is a serious problem of how to implement the order. There are these possible options:

numerical priority from 1 ... N where first entity is marked with 1, last entity with N
→ disadvantage: if you move the last entity to the top, you have to update all entities
float priority, where the entire possible numeric scope is divided so that the first entity occupies the lowest possible number in the scope and the last entity occupies the highest possible number in the scope
→ Disadvantage: If you move entities, the new order might be calculated as follows (left entity order + right entity order) / 2 ... this looks ok, but the possible number of divisions is surprisingly low, and from time to time you have to recompute all orders from scratch and re-position entities evenly
instead of an order/priority number for each entity, the entities are organized as a chain of entities, where the last entity refers to its predecessor (linked list), this is a very powerful representation, but quite hard to manage and relational databases are quite slow to use it in more complex join queries

Evita should allow all three ordering variants.

Since evitaDB is considered as a secondary search index, we need to handle situations where the chain is not consistent because not every change in the primary database has been propagated to the secondary index. So the mechanism should require only one chain head, but allow multiple chain tails. Such a chain should behave like a tree, where the entities are ordered by the longest unbroken chain, but then partial branches of the tree are appended from top to bottom so that no entity is left out. The moments when the tree is not consistent should eventually be made consistent, but may prevail for a while. evitaDB should also allow setting strongly consistent requirements, where the chain must be fully consistent at transaction commit time.

Initialize catalogs on evitaDB startup in parallel

Currently, all catalogs are opened and loaded into memory synchronously. The goal of this task is to load them in parallel. Since the catalogs don't share any memory structure and don't write anything to disk (and if they did, they would write to different files/folders), it's easy to initialize them in parallel and use all CPU power to speed up evitaDB startup.

Introduce security

We should introduce at least basic level of security so that the database is not entirely open. The mechanism needs to be clarified first.

Endpoint trailing slashes in Undertow doesn't match

I've discovered that URLs with trailing slashes don't match any endpoint (all our endpoint URLs are without trailing slashes).
This is issue with Undertow itself, it was kinda addressed in the https://issues.redhat.com/browse/UNDERTOW-267 with now allows for defining endpoints with or without trailing slashes but still doesn't allow for two endpoints with same URL where one ends with slash and the other doesn't.

We would probably have to write ticket to Undertow because there no documentation for the RoutingHandler that we use (which could take long time to resolve), or we would have to write our routing handler (maybe on top of the PathHandler which could potentially have different behaviour.

Facet count per group

Some clients wants to display facet count on the group level (when the entire group listing is collapsed). The number cannot be computed as a mere sum of facet counts, because a single entity may have multiple facets in the same group. Therefore the count number must be computed by evitaDB directly during the facet computation phase.

Distributed database support

The issue is just a setup to collect interesting information:

Ctrl node protocol recovery paper:

https://ramalagappan.github.io/pdfs/papers/par.pdf

Cache serialization

When EvitaDB is stopped it seems beneficial to store entire cache (or at least most used results) to the MemTable so that when EvitaDB is restarted again the cache is already warmed up.

We also need to write much more tests for cache.

Note: We may be able to get much better results with Azul OpenJDK CRAC feature

Data wrapper layer for Java driver

We want to provide a simple data wrapper layer over generic Entity data structures in the Java ecosystem. This data wrapper should allow automatic implementation of interfaces / POJO classes that transparently delegate the calls to evitaDB Entity or EntityBuilder. The wrappers will be based on dynamic class generation, most likely using https://github.com/FgForrest/Proxycian, and we may also provide automatic implementation of DAO interfaces similar to Spring DAO that developers are used to.

The reason for this time investment is that the developers are more familiar with the interfaces they design and name themselves. Most developers would create such wrappers manually anyway, so why not help them if we know how?

Finalize storage format

Version storage file format

In order to evolve the storage format and automatically migrate existing data we need to include version information in all record types.

Consider moving control byte to CRC-checked part

Currently the control byte is not covered by CRC check and this might open a space for a undetected error.

Add version to RecordKey

Currently the record key doesn't contain the version information. This is necessary to properly locate the correct instance of the record in the file.

Enable filter / order constraints for references in GQL / REST API

Relating to recently finished issue #45 we're now able to define filter and ordering constraint within referencedContent requirement. This new funcionality is not propagated to the GQL and REST query language, however. This issue aims to get our APIs up-to date with possibilities that are available in plain Java query language.

We should be able to define following GQL:

query {

  queryProduct(
    filterBy: {
      hierarchyCategoriesWithin: {
        ofParent: 650
      },
      entityLocaleEquals: cs_CZ                
    }  
  ) {
    recordPage {
      data {
        primaryKey
        attributes {
          name
        }
        groups(
          filterBy: {
            attributeOrderInGroupLessThan: 10,
            entityHaving: {
              attributeNameContains: "..."
            }
          }
          orderBy: {
            attributeOrderInGroupNatural: ASC
          }
        ) {          
          attributes {
            orderInGroup      
          }
          group {
            primaryKey
            attributes {
              name
            }
          }
        }
      }
    }  
  }
  
}

Similar problem is in our REST API where the new possibilities need to be handled in our annotation framework.
The gRPC API doesn't need to be changed since it works with String format of the query that is translated directly to the Java format.

Remove HierarchyPlacementContract

The hierarchy playground has changed in issue #7. We're now able to sort hierarchies along any of their sortable attributes. The orderAmongSiblings property is thus no longer necessary and therefore we may get rid of entire HierarchyPlacementContract abstraction and merge it with EntityContract where we can maintain @Nullable Integer parentPrimaryKey.

Compute "dynamic" set of attribute histogram for references

During discussion with Next.JS team (namely @jru), new idea sprung up. In current situation is up to them to maitain list of filterable parameters, cache it and ask actively for attributeHistogram computation. This requires quite a complex logic in the middleware and also caching (which bring a lot of additional problems). @jru came with following idea - if he could ask for all "referenceHistograms" and specify a filterBy constraint which references:

reference_parameter_histogram(
        filterBy: entityHaving(
          attribute_inputWidgetType_equals: "interval"
        )
      ) {
        attributes {
          primaryKey
          name
          order
        }
        # we may specify one or more (numeric) attributes set on relation
        nameOfTheReferenceAttribute {
           min
           max
           overallCount
           buckets(requestedCount: 20) {
             index
             occurrences
             threshold
           }      
        }
        # we may specify one or more (numeric) attributes set on referenced entity
        entity_nameOfTheReferencedEntityAttribute {
           min
           max
           overallCount
           buckets(requestedCount: 20) {
             index
             occurrences
             threshold
           }      
        }
      }

and evitaDB would compute "dynamic" count of histograms for target attributes based on reference relevancy and grouped by ReferenceContract#group.

we need to remove temporary extension to the GraphQL API histograms allowing to retrieve histograms by names used in variable argument

Extend price fetch requirement

Actually there is no possibility how to fetch additional prices in io.evitadb.api.query.require.PriceContentMode#RESPECTING_FILTER and only those prices that reflect priceInPriceList are returned. This is not comprehensible to the users that needs to include the "non-sellable" prices to priceInPriceList in order to fetch them to the client. More comprehensible variant of this use case would be as follows:

In Java:

Query.query(
	collection(entityType),
	filterBy(
		priceInPriceLists("a", "b")
	),
	require(
		entityFetch(priceContent("c", "d"))
	)
)

Or in evitaQL:

query(
	collection("product"),
	filterBy(
		priceInPriceLists("a", "b")
	),
	require(
		fetch(prices(RESPECTING_FILTER, "c", "d"))
	)
)

This would fetch prices in price lists "a" + "b" required by filer and also "c" + "d" added in the requirement.

Query profiles

In current API there are some parts that complicate the storefront implementation. Namely:

set of "persistent filters" - such as PriceValidIn (now), attributeEquals (status = ACTIVE)
HierarchyExcluding inserting set of categories with having attributeEquals (visibility = INVISIBLE)
QueryPriceMode set always on WITH_VAT
FacetGroupsConjunction, FacetGroupsDisjunction, FacetGroupsNegation that is always derived from referenced entitites having attributeEquals (mode = NEGATIVE) and so on

These "relations" needs to be wired up in the storefront middleware and although doable they complicate the implementation, require caching (and proper invalidation) and may introduce subtle bugs if some of these "mandatory" constraints is omitted from the query by mistake.

That's why the idea of "query profiles" looks promising on the first sight. Let's say we can create a named profile using the evitaDB API that would allow to define declarative "rules". Then the client would just apply the profile for the query in following way:

query(
	collection(Entities.PRODUCT),
	filterBy(
		and(
			entityPrimaryKeyInSet(primaryKey),
			entityLocaleEquals(locale)
		)
	),
	require(
		profile("b2c")
	)
)

When such profile is used, server side (evita query engine) would automatically enrich the query of a new conjunctive filters / additional requirements that were associated with this profile. This would allow shifting a lot more logic to the server part that maintains the data set and make the logic on frontend / middle-ware a lot easier and maintainable. Also moving the selection logic to standard constraints and formula calculation tree would allow us to reuse existing core cache / invalidation baked in evitaDB. This removes the necessity to handle cache in the middle-ware at all and promises the correct and immediate invalidation process in case the data is changed.

The profiles should be able to:

define filtering constraint for one or multiple entity types
define requirement constraint for one or multiple entity types

Of course we'd need to extend certain constraints to be able to accept a sub constraint filter instead of exact primary keys (HierarchyExcluding, FacetGroupsConjunction, FacetGroupsDisjunction, FacetGroupsNegation) - which doesn't directly relate to this issue of query profiles, but can be solved separatedly.

The query profiles should be composable - so that we could use multiple profiles in a single query profile("b2c","visible").

Another thing that comes to my mind in connection with query profiles is implications to security. The profiles might play nicely with something like EdgeDB access policies or table permissions in SurrealDB - so when we introduce "logging in" process to the Evita session establishing proces we could enforce using one or more profiles for entire session based on the logged in user account credentials and thus limit the access of the user to the certain data.

We should also extend queryTelemetry hint to allow returning complete query composed from different profiles to allow debugging problems by the developers (because part of the query composition is hidden for them on the server side).

Create and document backup & restore & vacuum process

We need to implement the vacuuming process, the backup & restore procedure. We want all of them to be executable while the system is running and writing to the original files without interruption. This requires special integration tests and manual testing. We should also partially rewrite the current storage logic where we use "nice" names for entity collections and catalogs. We (probably) need to avoid situations that require renaming existing files and use monotonically increasing indexes for the files and switch the reference to the currently used file only in storage data structures. Finally, if the system detects that there is no living pointer to the old file, it will be permanently deleted.

Vacuuming processes:

small: when trash ratio exceeds limit ⇨ write memory snapshot to a new file with incremented index
big: regularly delete old files when:
- there is no active transaction working with the file (otherwise log error)
- their last modification timestamp is lesser than the required history to be kept
- compact catalog header file to track only those headers that relate to existing files

Backup process:

small: creates ZIP file with:
1. last record catalog header only
2. current catalog content file
3. current entity content files
big: creates ZIP file with:
1. entire catalog header file
2. all catalog content files
3. all entity content files

Ensures no vacuuming process is running and prevents it to be executed for particular catalog.

Restoring process:

simple - just unzip entire catalog contents and load the contents from it
chirurgical (can be used only on "big" backups) - it looks up into the ZIP file and finds catalog header that matches requested timestamp, if the catalog history is newer than the timestamp, error is printed with the information of the earliest timestamp, that can be used. If the particular header can be found, its contents are restored as a fresh snapshot without any waste; it accepts two arguments:
- backup file name (mandatory)
- timestamp the catalog should be restored in (optional - if not set, current date and time is used)

Implement parallelism for GraphQL resolvers

As we find out during status meeting, currently the queries in GraphQL implementation are evaluated serially. If the GraphQL request contains 3 queries A,B,C they will be evaluated in that order one after another. The proper implementaiton is that each query is executed within single task passed to the evitaDB thread pool and than the main process should wait for each of it to be processed using latch.

Conditional filter

Some e-commerce systems allow for sticky facet filters. These filters are remembered when browsing different product categories. This makes life easier for users in areas where the e-shop sells uniform (or similar) products. Real use case example:

Client sells food. The food is marked with an "eco" tag. The user selects that he wants to buy only "eco" food and the e-shop offers only products marked with "eco" tag. We want to make this selection sticky, so that when the user browses the categories, he/she will only see the products that apply to him/her. The store also has a special category with hand tools. If we apply the "eco" tag requirement to this category, the user wouldn't see any products because no single product is marked with such a tag (it just doesn't make sense).

To avoid duplicating the query to evitaDB, we want to provide a conditional block in the query that would "disable" the conditional part of the query if it returns an empty result. The result should also contain a used query, so that the client side can detect whether the conditional part was used or not.

In this way, we could provide a solution to this situation by using a single query that implicitly avoids returning empty results and provides enough information for the end store to display information to the user:

Your selection does not match any products. We offer other products instead.

Entity grouping by selected attribute / reference

The reality shows that the best approach for work with product variants and "master" products is that the listing / faceting queries list the variants, generate facets for them and then in additional run group them by the reference to their master products that are shown in the listings instead of them. The pagination and sorting is applied after the grouping is realized.

The situations where this brings a value are:

More precise filtering

In traditional approach master product aggregates all parameters of their variants. This seems ok on the first sight but eliminites possibility to filter precisely the products that contain only specific combination of the inner parameters. This is clearly visible on the sites that sell shoes or clothing. There you want regurarly find proper combination of the variant you want - for example "blue variant of size XXL". When using traditional approach you'll find a master product that might have red variant of size XXL but blue variant of size M. That is false positive for your search criteria. Searching for variants instead and grouping them to masters afterwards would solve this problem.

Access to the matching number of variants

With traditional model it's not possible to show number of matching variants in the listing where only the master products are listed. When we use the grouping approach the count would be naturally accessible.

Verify entity collection in catalog schema strictly

Now the entity collection is automatically created in the catalog when the first entity of this type is inserted. This behavior is not correct if we consider that it's possible to check the schema strictly, and also allows creation of unwanted entity collections by mistake or as a simple typo. This new feature requires the addition of something like CatalogEvolutionMode', which will control whether the entity collections of new types are created on demand, or if it would require a deliberate CatalogSchema` change.

Finalize REST API

We want to support REST API querying as first party citizen in evitaDB. We want to build our REST API on the Undertow web server that came out as the the server with the best performance, embedability and feature characteristics. There is already GraphQL API integrated into it and so we can inspire there.

Evaluate REST API Java libraries:

Requested features:

highly performant
lightweight
integrable with Undertow
open-source license
can use Jackson for conversion from/to JSON (Jackson is already internally used by evitaDB)
allows to create / consume dynamic endpoints / payload based on catalog schema

Features to avoid:

the solution is more framework than library
dictates us the maven structure,
requires a lot of external libraries,
forces us to adapt to their structure / API / code not the other way
is bloated / slow - "enterprise" :)

Main tasks of this issue:

open REST API endpoint in Undertow server
each catalog will have it's own "path" of the endpoint (e.g. https://evitadb-server.org/rest/catalog-name) - the catalog name must be properly sanitized for the URL
- list all available collections
- create new collection with optionally passing initial schema definition
- retrieve schema - beware! there must be 2 variants:
  - evitaDB schema that allows mutations to the schema definition,
  - evitaDB entity schema that allows querying and mutating entities in the collection (see the next major bullet point)
- update collection schema
- remove collection
each catalog endpoint will provide automatically generated REST API (Open API) schema that will cover:
- list all entities in the collection (paginated access only)
- get one or more entities by their primary keys
- get one or more entities by any of their unique attributes
- get one or more entities by any of their unique attributes, regardles of their entity type (typically accessing unknown entity by their unique URL) - requires https://gitlab.fg.cz/hv/evita/-/issues/39 (read concept of CatalogSchema there)
each catalog endpoint will provide a query/mutate end-point
- list entities by exact query (also used for getting the overal count of matching entities)
- mutate contents of entity
- remove entity
expose a "global" endpoint (e.g. https://evitadb-server.org/rest/) that will allow to:
- list all available catalogs (and their endpoints)
- create new catalog with configuration specification
- remove existing catalog
- get catalog configuration
- alter catalog configuration
- get evita configuration

The REST API should follow the FG best-practice rules.
The API should also behave similarly to GraphQL API created by @lho - close communication and cooperation with @lho is recommended and expected.

Expected obstacles:

dynamic Open API schema (for latest specification 3.1) needs to be generated from evitaDB schema - there are probably no suitable libraries for such purpose in Java ecosystem
dynamic endpoints (de)registration
API design that would be similar to GraphQL, yet follows the REST API best practices
- multi resource payloads (we want to avoid too many roundtrips between client and the server)
- support for transactions (batch updates? dynamic endpoint?)

Hierarchy statistics refinement

In research phase we implemented hierarchyStatistics constraint the computed the entity cardinality for all nodes in the hierarchy. This solution can be used for "mega-menu" rendering but usually the application needs to display only part of the menu so we need to come up with refined design that would allow clients to fetch only the part of the menu they actually need to render and nothing more (we don't want to spend time on calculating things that would get thrown out).

User requirements

we need to render entire menu in single query (along with the filtered entities that are assigned to a hierarchy)
we need to render "mega-menu" separately - i.e. get top X levels of all product categories - hierarchical entities (without actually fetching the product themselves)
when we look at specific category (hierarchical entity):
- we need to fetch parents to the root (specific levels or all)
- we need to fetch children from current node (specific levels or all)
- we need to fetch siblings of this node or any other node from previous two requirements
- we need to fetch immediate children

EvitaQL requirement changes

We will completely refactor the current hierarchyStatistics so that the statistics will be optional and will be composable in this way:

hierarchyOfReference(
   "categories",
   fromRoot(
      "topLevel",
      // only one sub constraint has sense - this is only example
      stopAt(level(6), distance(2), node(filterBy(attributeEquals('someAttribute', 'someValue')))), 
      fetch(attributes()),
      statistics()
   ),
   fromNode(
      "someDetachedHierarchy",
      node(filterBy(attributeEquals('code', 'specialNode'))),
      // only one sub constraint has sense - this is only example
      stopAt(level(6), distance(2), node(filterBy(attributeEquals('someAttribute', 'someValue')))), 
      fetch(attributes()),
      statistics()
   ),
   children(
      "myChildren",
      // only one sub constraint has sense - this is only example
      stopAt(level(6), distance(2), node(filterBy(attributeEquals('someAttribute', 'someValue')))), 
      fetch(attributes()),
      statistics()
   ),
   parents(
      "myParents",
      // only one sub constraint has sense - this is only example
      stopAt(level(2), distance(2), node(filterBy(attributeEquals('someAttribute', 'someValue'))))
      fetch(attributes()),
      statistics(),
      siblings(
         stopAt(level(3)), 
         fetch(attributes()),
         statistics()
      )
   ),
   siblings(
      "mySiblings",
      filterBy(attributeEquals('whatever','value')), 
      fetch(attributes()),
      statistics()
   )
)

Response data structure

We will reuse current DTO: io.evitadb.api.requestResponse.extraResult.HierarchyStatistics, but it will have deeper structure:

first Map level will be indexed by referenceName
second Map level will be indexed by constraint custom name (i.e. "mySiblings" or whatever developer specifies in query)
as a value the io.evitadb.api.requestResponse.extraResult.HierarchyStatistics.LevelInfo will be provided and it will be altered in following way (the properties will be not null only when statistics requirement is part of the query):
- cardinality will be optional (int -> @Nullable Integer)
- @Nullable Boolean hasChildren will be added to signalize whether the node has any additional children within it

Real-life use-cases

Mega-menu example

Render 2 level deep mega-menu:

hierarchyOfReference(
   "categories",
   fromRoot(
      "megamenu",
      stopAt(level(2)), 
      fetch(attributes())
   )
)

Left menu example

Render menu of the category on the third level - we need to render category siblings, all category parents and their siblings, and also immediate children.

hierarchyOfReference(
   "categories",
   siblings(
      "currentNodeSiblings",
      fetch(attributes())
   ),
   children(
      "currentNodeChildren",
      stopAt(distance(1)), 
      fetch(attributes())
   )
   parents(
      "currentNodeParents",
      fetch(attributes()),
      siblings(
         fetch(attributes())
      )
   )  
)

Render immediate children of the node

This query will be used in AJAX style gradual menu unrolling when only immediate sub level is opened.

hierarchyOfReference(
   "categories",
   children(
      "currentNodeChildren",
      stopAt(distance(1)), 
      fetch(attributes())
   )
)

GraphQL query mechanism

In GraphQL it could look like this:

hierarchy {
  # reference name
  categories {
    # list of all nodes requested from root, on FE this needs to be converted into tree
    topLevel: fromRoot(
      stopAt: {
        # only one of them could be used in real query, this would be validated on server
        level: 6
        distance: 2
        node: {
          # this would be same filterBy which is used when querying category entities
          filterBy: {
            attribute_someAttribute_equals: "someValue"
          }
        }
      }
    ) {
      parentId # id of parent node
      path # array of node ids in tree where this node resides in 
      cardinality
      childrenCount
      entity {
        primaryKey
        attributes {
          code
        }
      }
    }

    # list of all nodes requested from specific node, on FE this needs to be converted into tree
    someDetachedHierarhcy: fromNode(
      node: {
        filterBy: {
          attribute_code_equals: "specialNode"
        }
      }
      stopAt: {
        # only one of them could be used in real query, this would be validated on server
        level: 6
        distance: 2
        node: {
          # this would be same filterBy which is used when querying category entities
          filterBy: {
            attribute_someAttribute_equals: "someValue"
          }
        }
      }
    ) {
      parentId # id of parent node
      path # array of node ids in tree where this node resides in 
      cardinality
      childrenCount
      entity {
        primaryKey
        attributes {
          code
        }
      }
    }

    # list of all children nodes requested from current position, on FE this needs to be converted into tree
    myChildren: children(
      stopAt: {
        # only one of them could be used in real query, this would be validated on server
        level: 6
        distance: 2
        node: {
          # this would be same filterBy which is used when querying category entities
          filterBy: {
            attribute_someAttribute_equals: "someValue"
          }
        }
      }
    ) {
      parentId # id of parent node
      path # array of node ids in tree where this node resides in 
      cardinality
      childrenCount
      entity {
        primaryKey
        attributes {
          code
        }
      }
    }

    # list of all parent nodes requested from current position
    myParents: parents(
      stopAt: {
        # only one of them could be used in real query, this would be validated on server
        level: 6
        distance: 2
        node: {
          # this would be same filterBy which is used when querying category entities
          filterBy: {
            attribute_someAttribute_equals: "someValue"
          }
        }
      }
    ) {
      cardinality
      childrenCount
      entity {
        primaryKey
        attributes {
          code
        }
      }
      # for each node there would be flat list of sibling nodes, the level would filter where this list is returned
      siblings(level: 3) {
        entity {
          primaryKey
        }
        cardinality
      }
    }

    # list of all sibling nodes requested for current position
    mySiblings: siblings(
      # this would be same filterBy which is used when querying category entities
      filterBy: {
        attribute_whatever_equals: "value"
      }
    ) {
      cardinality
      childrenCount
      entity {
        primaryKey
        attributes {
          code
        }
      }
    }
  }
}

Production mode

We want to be able to switch between production and dev mode in configuration of evitaDB. Currently, mainly to disable introspection in GraphQL and disable OpenAPI schema endpoint.

Re-introduce automatized performance test suite in Kubernetes

The original project has a performance suite that has been adapted to run on Digital Ocean's K8S infrastructure. The scripts were part of the private repository from which this project was migrated. We need to run these performance tests even after project migration and this issue should help us with that. Please create a workflow that:

build a separate Docker image on top of the GitHub repository dedicated to performance testing built from the performance branch (who: @novoj)
order a K8S infrastructure on D/O and run tests on it (who: @fgmpe)
publish JSON result to Gist in THIS repository (who: @fgmpe)
destroy the K8S infrastructure (who: @fgmpe)
implement a scheduled workflow to ensure that K8S does not exist 4 hours after its creation

Developer documentation

Revisit our initial documentation and create basis of the technical documentation for the consumers. The structure of the planned developer documentation is as follows:

Get started

Run evitaDB
1. Run embedded in you application
2. Run as service inside Docker
Create your first database
Query our dataset

Use

Data model
1. Data types
2. Schema
Connectors
1. GraphQL
2. REST
3. gRPC
4. Java
5. C#
API
1. Define schema
2. Upsert data
3. Query data
4. Write tests
5. Troubleshoot

Query

Basics
Filtering
1. Bitwise
2. Boolean
3. Comparable
4. String
5. Locale
6. Range
7. Price
8. References
9. Hierarchy
10. Facet
Ordering
1. Natural
2. Price
3. Reference
4. Random
Requirements
1. Paging
2. Fetching
3. Price
4. Hierarchy
5. Facet
6. Histogram

Operate

Configure
1. Setup TLS
Run
Backup & Restore
Monitor

Deep dive

Storage model
Bulk vs. incremental indexing
Transactions
Cache
Observe changes

Solve

Render category menu
1. Mega-menu
2. Partial menu
3. Hide parts of the category menu
Filter products in category
1. With faceted search
2. With price filter
Render referenced brand
1. With product listing
2. With involved categories listing
Handle images & binaries
Model price policies

Limiting the relations accepted for computation of HierarchyWithin / HierarchyWithinRoot / FacetSummary / HierarchyStatistics

The HierarchyWithin only uses a filter constraint that selects the considered node for the hierarchy lookup, the same is true for FacetSummary and HierarchyStatistics.

However, there might be a use case where the client wants to consider only those relations to the target entities that match a certain constraint.

Let's look at this picture:

What if we want to consider only relations with attribute disabled="false" and want to list all products in category (hierarchy node) two?! Currently, product 1 would be returned even if we use the additional constraint referenceHaving(attributeEqualsFalse("disabled")) because it has another relation to another node with attribute disabled set to true.

Therefore, we need a new way to specify which relations should be considered when evaluating relation constraints. We came up with the following idea:

In filter section:

hierarchyWithin(
    attributeEquals("code", "ABC"),
    considerReferencesHaving(
    	attributeEqualsFalse("disabled")
    )
    excluding(
	attributeEquals("code", "DEF")
    )
)

In requirements:

facetSummary(
    filterBy(
       inRange("validity", now())
    )
    filterGroupBy(
       attributeEqualsTrue("filterable")
    )
    considerReferencesHaving(
    	attributeEqualsFalse("disabled")
    )
)

We first thought about reusing the referenceHaving constraint, but it also allows an entityHaving sub-constraint, which doesn't make sense in this situation. Therefore, we think we need a differently named constraint container that would accept filtering constraints that target only attributes of the reference and nothing else.

Document gRPC API

We need to document gRPC proto files with comments that would reflect the descriptions in GQL/REST api descriptors. Can @lukashornych please provide more information on how to find proper descriptions for analogous data in the descriptors?

Also we need to write article docs/user/en/use/connectors/grpc.md.

Allow filtering by unique attributes in WithinHierarchy and facetInSet constraints

During the meetings with the frontend team we've discovered necessity for multipl request/response turnarounds that are required in order to resolve the primary keys of the entities we want to constraint on. Exact use-cases are as follows:

withinHierarchy: currently accepts only primary keys of the parent category, the real use-case is that the client have easily accessible only globally unique URL attribute of the category
facetInSet*: currently accepts only primary keys of the facets, the real use-case is that the client have easily accessible only localy unique code attributes of the parameters (the codes are used because they're better readable and comprehensible for the end users)

This issue targets solving this discrepancy by allowing to target also any unique attribute of those entities.

There are two alternatives how to do it:

1. extend current constraints

by adding a new constructor like:

@ConstraintCreatorDef
public FacetInSet(@Nonnull @ConstraintClassifierParamDef String referenceName,
                  @Nonnull @ConstraintClassifierParamDef String uniqueAttributeName,
                  @Nonnull @ConstraintValueParamDef Serializable... attributeValue) {
	super(concat(referenceName, attributeName, attributeValue));
}

2. creating new constraints

by creating entirely new constraing like:

@ConstraintDef(
	name = "inSet",
	shortDescription = "The constraint if entity has at least one of the passed facet primary keys.",
	supportedIn = ConstraintDomain.ENTITY
)
public class FacetAttributeInSet extends AbstractFilterConstraintLeaf implements FacetConstraint<FilterConstraint> {

	@ConstraintCreatorDef
	public FacetAttributeInSet(@Nonnull @ConstraintClassifierParamDef String referenceName,
	                           @Nonnull @ConstraintClassifierParamDef String attributeName,
	                           @Nonnull @ConstraintValueParamDef Serializable... attributeValue) {
		super(concat(referenceName, attributeName, attributeValue));
	}

}

I like the first variant more, but I don't know whether the APIs can handle it.

Query validation

There are some rules for constructing query, that needs to be enforced - primarily to avoid confusion of the users. These rules should be validated and violations reported before processing with the query. Validations should contain these rules:

facet constraint should be placed directly in userFilter
userFilter is allowed only within AND condition chain
directRelation and excludingRoot make no sense in query (see https://gitlab.fg.cz/hv/evita/-/issues/51)
content requirements (referenceContent etc.) not enclosed within entityFetch
invalid nesting hierarchyContent(entityFetch(hierarchyContent()))
multiple hierarchy requirements ofSelf / ofReference targetting the same reference name, multiple outputNames of same name
contradictory contents: #407

Implement readiness and livenes probes for Kubernetes

See https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

liveness (aka restart needed) - Many applications running for long periods of time eventually transition to broken states, and cannot recover except by being restarted. Kubernetes provides liveness probes to detect and remedy such situations.
readiness (aka i'm fine, but don't send me requests) - Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup, or depend on external services after startup. In such cases, you don't want to kill the application, but you don't want to send it requests either. A pod with containers reporting that they are not ready does not receive traffic through Kubernetes Services.

Write REST & GraphQL developer documentation

Please write these two articles:

docs/user/en/use/connectors/graphql.md
docs/user/en/use/connectors/rest.md
write also recommendations for testing docs/user/en/use/api/write-tests.md

Content of the articles is suggested in the stub of these articles. Please inspire in Run evitaDB article.

Another inspiring source can be found on EdgeDB or SurrealDB

fgforrest / evitadb Goto Github PK

evitadb's People

Contributors

Stargazers

Watchers

Forkers

evitadb's Issues