Kibana Elastic Translation Clickhouse User Proxy
This is primarily a "simple" reverse proxy that emulates several ElasticSearch/OpenSearch APIs, and depending on its configuration and the nature of the request, either translates queries to SQL and dispatches them to Clickhouse- compatible servers, or passes requests through to an underlying Elasticsearch or OpenSearch server.
๐จ Definitely not ready for production use! ๐จ
-
aggregation queries
-
date_histogram
as a top-level aggregationmin
,max
, etc. can be nested underneath it
-
min
,max
,avg
,sum
,value_count
, andcardinality
as top-level aggregations- we make no distinction between
value_count
andcardinality
, they both currently translate toCOUNT(*)
.
- we make no distinction between
-
movingFn
aggregations are supported under certain conditions:- The
script
MUST match the regexMovingFunctions\.(.*?)\(values\)
- The functions we support are (moving)
max
,min
,sum
, andunweightedAvg
.
- The functions we support are (moving)
- The
buckets_path
MUST reference a bucket aggregation that has already been translated, i.e. is an ancestor or older sibling - The
window
parameter is required - The
shift
parameter is not supported (yet?)
- The
-
terms
as a top-level aggregation- the "other bucket" that counts the number of values that weren't in the top-k isn't populated (it will make the SQL really hairy)
- not correctly translated when not top-level
-
-
non-aggregation queries
- The way SQL result columns are divided between
_source
andfields
is a heuristic at best - When no projections are requested, we just return the ID field, assuming there is one in the metadata--maybe we'd want some notion of default fields?
- The way SQL result columns are divided between
terms
aggregations when not top-level- Search targets with wildcards currently only work when the target string is mapped literally, e.g. if you ask for
GET /my-cluster-*:awesome-index-*/_search
your mapping configuration files will need to mentionmy-cluster-*:awesome-index-*
as an index name or alias.- TODO find a place in metadata to model the full cluster and index names, for more precise matching
- queries that target multiple clusters/indexes, e.g.
GET /foo,bar/_search
-- currently we just take the first one- TODO make this work if everything winds up pointing to the same DB after resolving the cluster/index names
- queries that want to use aggregations AND return documents in the same query (i.e.
aggs != null AND size != 0
) -- only the aggregation side of the query will be attempted. - In aggregation queries,
hits.total.value
is the size of the result set after the results have already been reduced by aGROUP BY
, not the number of records that matched the filter criteria before aggregation. - In
moving_fn
aggregations, the following functions are not supported (yet?):stdDev()
linearWeightedAvg()
ewma()
holt()
holtWinters()
- Basically every kind of aggregation not listed above!
- Compatibility with OpenSearch is untested, and may have significant gaps
The proxy server doesn't participate in Elastic/OpenSearch security mechanisms other than passing through non-query requests and maintaining its own cookie cache.
Running the proxy with TLS may be a simple matter of adding the appropriate ktor directives to application.conf but isn't tested.
We currently use ktor for our (async) service implementation, but the (sync) java HTTP client for talking to backends. We should use ktor's client library for the backend connections too, so we could get by with fewer threads.
We use a home-grown intermediate representation for queries and expressions, because we couldn't find anything that looked obviously better that would fit within our timeline. Substrait might be a future candidate.
flowchart LR
other[Other Elastic clients]
kibana[Kibana or OpenSearch Dashboards]
proxy[KETCHUP Proxy]
elastic[Elastic or OpenSearch]
clickhouse[(Clickhouse Cluster)]
config[(Config)]
creds[(Credentials)]
other == search ==> proxy
kibana == search ==> proxy
kibana -. read metadata .-> proxy
proxy -. read metadata .-> elastic
proxy -- passthrough --> elastic
proxy == query via JDBC ==> clickhouse
proxy -. read metadata & <br/> connections .-> config
proxy -. retrieve .-> creds
./gradlew clean build
- elastic-proxy-server is the main module; most of this README is about it
- cli-query-translator is a standalone command-line utility that attempts to translate Elastic queries to Clickhouse-flavoured SQL without needing to connect to anything, and without attempting to run anything. It has its own README.
- kibana-proxy-server is a (currently non-functional) experiment in providing functionality
similar to
elastic-proxy-server
except it's intended to sit in front of Kibana/OpenSearch Dashboards, and supports Kibana/OpenSearch Dashboards APIs instead of Elastic/OpenSearch. In some deployment scenarios, this would be much easier to experiment with if it was restored to full function. - Several internal libraries not designed for reuse:
- kql-parsing-scala implements a KQL/Lucene parser using the excellent
Fastparse library which doesn't really have a Kotlin equivalent
(yes, I tried
h0tk3y/better-parse
!) - model contains our configuration and metadata domain models, and the query IR/AST
- server-common is stuff that would be useful for both
elastic-proxy-server
andkibana-proxy-server
. - util is the usual util package, generic stuff that's used widely
- kql-parsing-scala implements a KQL/Lucene parser using the excellent
Fastparse library which doesn't really have a Kotlin equivalent
(yes, I tried
- hydrolix-specific is some tooling that only applies to Hydrolix rather than Clickhouse generally.
./gradlew :elastic-proxy-server:shadowJar
java -jar elastic-proxy-server/build/libs/elastic-proxy-server-all.jar
curl -d '{"query":{"query_string":{"query":"(new york city) OR (big apple)"}}}' \
http://localhost:19200/my_index/_search
Configuration is mainly loaded from JSON files referenced by URLs contained within environment variables. Have a look at application.conf for more info, but we'll summarize here:
Q. Why are these all URLs instead of paths?
A. Because you might want your configuration files to be on a cloud storage medium someday!
LISTEN_PORT
is the TCP port the proxy should listen on; it defaults to 19200
.
DB_MAPPINGS_JSON_URL
is a URL (not a filesystem path!) that contains an array of
DBMapping objects, each of which maps an
Elastic index name, and optionally aliases, to a particular Clickhouse Connection, referencing it by its UUID.
If not set, defaults to file:dbMappings.json
.
A DBMappings file looks like this:
[
{
"indexPattern": "my-cluster*:my-index",
"indexPatternIds": [],
"clickHouseConnectionId": "<uuid>",
"elasticConnectionId": "<uuid>",
"clickHouseTableName": "my_db.my_table",
"idFieldName": "id",
"defaultFields": []
}
]
ELASTICSEARCH_CONNECTIONS_JSON_URL
is a URL that contains an array of one or more
ElasticsearchConnection objects.
If not set, defaults to file:elasticConnections.json
.
An ElasticsearchConnections file looks like this:
[
{
"id": "<uuid>",
"name": "My Elastic cluster",
"created": "1970-01-01T00:00:00Z",
"url": "http://10.10.0.137:9200/",
"vendor": "Elastic",
"default": true,
"extraProperties": {}
}
]
CLICKHOUSE_CONNECTIONS_JSON_URL
is a URL that contains an array of one or more
ClickhouseConnection objects.
Note that credentials should be stored in a credential file, not here.
If not set, defaults to file:clickhouseConnections.json
.
A ClickhouseConnections file looks like this:
[
{
"id": "<uuid>",
"created": "1970-01-01T00:00:00Z",
"legacyDriver": false,
"url": "jdbc:clickhouse://127.0.0.1:8123/my_db",
"extraProperties": {
"compress": "false"
}
}
]
ELASTIC_METADATA_JSON_URLS.<0..n>
are numbered URLs; each one points to a single
ElasticMetadata document describing how a
particular Elastic/OpenSearch index is mapped to an SQL schema. Have a look at
the source, it's too much to explain here.
There's no default value, and query translation can work without ElasticMetadata, but query translations can be wrong,
e.g. nested field references like foo.bar
will be left as-is, resulting in invalid SQL in most such cases.
CREDENTIALS_JSON_URL
is a URL that contains an array of one or more
CredentialEntry objects where credentials
can be stored separately from connection metadata. These JSON files are the only implementation of
credential storage at the moment, but we anticipate building more, e.g. for Kubernetes Secrets, AWS Secret Manager, or
other cloud providers' equivalents.
The id
of the CredentialEntry must match that of a configured (elastic, clickhouse, etc.) connection to be useful.
If not set, defaults to file:credentials.json
.
A Credentials file looks like this:
[
{
"id": "<uuid>",
"created": "1970-01-01T00:00:00Z",
"username": "admin",
"credential": "5uper_S3cre7"
}
]